Kai Zhang


Loading...

Last Name

Zhang

First Name

Kai

Organisational unit

Search Results

Publications 1 - 10 of 10
  • Liang, Jingyun; Fang, Yuchen; Zhang, Kai; et al. (2025)
    Lecture Notes in Computer Science ~ Computer Vision – ECCV 2024
    While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
  • Zhao, Zixiang; Bai, Haowen; Zhu, Yuanzhi; et al. (2023)
    2023 IEEE/CVF International Conference on Computer Vision (ICCV)
    Multi-modality image fusion aims to combine different modalities to produce fused images that retain the complementary features of each modality, such as functional highlights and texture details. To leverage strong generative priors and address challenges such as unstable training and lack of interpretability for GAN-based generative methods, we propose a novel fusion algorithm based on the denoising diffusion probabilistic model (DDPM). The fusion task is formulated as a conditional generation problem under the DDPM sampling framework, which is further divided into an unconditional generation subproblem and a maximum likelihood subproblem. The latter is modeled in a hierarchical Bayesian manner with latent variables and inferred by the expectation-maximization (EM) algorithm. By integrating the inference solution into the diffusion sampling iteration, our method can generate high-quality fused images with natural image generative priors and cross-modality information from source images. Note that all we required is an unconditional pre-trained generative model, and no fine-tuning is needed. Our extensive experiments indicate that our approach yields promising fusion results in infrared-visible image fusion and medical image fusion. The code is available at https://github. com/Zhaozixiang1228/MMIF-DDFM.
  • Liu, Shijie; Yan, Kang; Qin, Feiwei; et al. (2024)
    Lecture Notes in Computer Science ~ Advanced Intelligent Computing Technology and Applications
    Single image super-resolution (SR) is an established pixel-level vision task aimed at reconstructing a high-resolution image from its degraded low resolution counterpart. Despite the notable advancements achieved by leveraging deep neural networks for SR, most existing deep learning architectures feature an extensive number of layers, leading to high computational complexity and substantial memory demands. To mitigate these challenges, we introduce a novel, efficient, and precise single infrared image SR model, termed the Lightweight Information Split Network (LISN). The LISN comprises four main components: shallow feature extraction, deep feature extraction, dense feature fusion, and high-resolution infrared image reconstruction. A key innovation within this model is the introduction of the Lightweight Information Split Block (LISB) for deep feature extraction. The LISB employs a sequential process to extract hierarchical features, which are then aggregated based on the relevance of the features under consideration. By integrating channel splitting and shift operations, the LISB successfully strikes an optimal balance between enhanced SR performance and a lightweight framework. Comprehensive experimental evaluations reveal that the proposed LISN achieves superior performance over contemporary state-of-the-art methods in terms of both SR quality and model complexity, affirming its efficacy for practical deployment in resource-constrained infrared imaging applications.
  • Li, Mu; Zhang, Kai; Li, Jinxing; et al. (2023)
    IEEE Transactions on Neural Networks and Learning Systems
    The entropy of the codes usually serves as the rate loss in the recent learned lossy image compression methods. Precise estimation of the probabilistic distribution of the codes plays a vital role in reducing the entropy and boosting the joint rate-distortion performance. However, existing deep learning based entropy models generally assume the latent codes are statistically independent or depend on some side information or local context, which fails to take the global similarity within the context into account and thus hinders the accurate entropy estimation. To address this issue, we propose a special nonlocal operation for context modeling by employing the global similarity within the context. Specifically, due to the constraint of context, nonlocal operation is incalculable in context modeling. We exploit the relationship between the code maps produced by deep neural networks and introduce the proxy similarity functions as a workaround. Then, we combine the local and the global context via a nonlocal attention block and employ it in masked convolutional networks for entropy modeling. Taking the consideration that the width of the transforms is essential in training low distortion models, we finally produce a U-net block in the transforms to increase the width with manageable memory consumption and time complexity. Experiments on Kodak and Tecnick datasets demonstrate the priority of the proposed context-based nonlocal attention block in entropy modeling and the U-net block in low distortion situations. On the whole, our model performs favorably against the existing image compression standards and recent deep image compression models.
  • Zhang, Kai; Zuliani, Riccardo; Balta, Efe C.; et al. (2024)
    IEEE Control Systems Letters
    This letter introduces the Data-Enabled Predictive iteRative Control (DeePRC) algorithm, a direct data-driven approach for iterative LTI systems. The DeePRC learns from previous iterations to improve its performance and achieves the optimal cost. By utilizing a tube-based variation of the DeePRC scheme, we propose a two-stage approach that enables safe active exploration using a left-kernel-based input disturbance design. This method generates informative trajectories to enrich the historical data, which extends the maximum achievable prediction horizon and leads to faster iteration convergence. In addition, we present an end-to-end formulation of the two-stage approach, integrating the disturbance design procedure into the planning phase. We showcase the effectiveness of the proposed algorithms on a numerical experiment.
  • Zhang, Kai; Li, Yawei; Liang, Jingyun; et al. (2023)
    Machine Intelligence Research
    While recent years have witnessed a dramatic upsurge of exploiting deep neural networks toward solving image denoising, existing methods mostly rely on simple noise assumptions, such as additive white Gaussian noise (AWGN), JPEG compression noise and camera sensor noise, and a general-purpose blind denoising method for real images remains unsolved. In this paper, we attempt to solve this problem from the perspective of network architecture design and training data synthesis. Specifically, for the network architecture design, we propose a swin-conv block to incorporate the local modeling ability of residual convolutional layer and non-local modeling ability of swin transformer block, and then plug it as the main building block into the widely-used image-to-image translation UNet architecture. For the training data synthesis, we design a practical noise degradation model which takes into consideration different kinds of noise (including Gaussian, Poisson, speckle, JPEG compression, and processed camera sensor noises) and resizing, and also involves a random shuffle strategy and a double degradation strategy. Extensive experiments on AGWN removal and real image denoising demonstrate that the new network architecture design achieves state-of-the-art performance and the new degradation model can help to significantly improve the practicability. We believe our work can provide useful insights into current denoising research. The source code is available at https://github.com/cszn/SCUNet.
  • Shen, Zhengwei; Qin, Feiwei; Ge, Ruiquan; et al. (2025)
    Alexandria Engineering Journal
    Image denoising is a quintessential challenge in computer vision, intending to produce high-quality, clean images from degraded, noisy counterparts. Infrared imaging holds a pivotal position across many research domains, attributed to its inherent benefits such as concealment and noninvasiveness. Despite these advantages, infrared images are often plagued by hardware-related imperfections resulting in poor contrast, diminished quality, and noise contamination. Extracting and characterizing features amidst these unique feature patterns in infrared imagery are taxing tasks. To surmount these obstacles, we introduce the Infrared image Denoising Transformer (IDTransformer), encapsulated in a symmetrical encoder–decoder architecture. Central to our approach is the Convolutional Transposed Self-Attention Block (CTSAB), which is ingeniously conceived to capture long-range dependencies via channel-wise self-attention, while simultaneously encapsulating local context through depth-wise convolution. In addition, we refine the conventional feed-forward network by integrating Convolutional Gated Linear Units (CGLU) and deploy the Channel Coordinate Attention Block (CCAB) during the feature fusion phase to dynamically apportion weights across the feature map, thereby facilitating a more nuanced representation of pattern features endemic to infrared images. Through rigorous experimentation, we establish that our IDTransformer attains superior visual enhancement across five infrared image datasets, compared with the state-of-the-art methods. The source codes are available at https://github.com/szw811/IDTransformer.
  • Calvi, Marco; Liang, Xiaoyang; Ferrari, Eugenio; et al. (2023)
    Journal of Synchrotron Radiation
    The Paul Scherrer Institute is implementing laser-based seeding in the soft X-ray beamline (Athos) of its free-electron laser, SwissFEL, to enhance the temporal and spectral properties of the delivered photon pulses. This technique requires, among other components, two identical modulators for coupling the electron beam with an external laser with a wavelength range between 260 and 1600 nm. The design, magnetic measurements results, alignment, operation and also details of the novel and exotic magnetic configuration of the prototype are described.
  • Qin, Feiwei; Yan, Kang; Wang, Changmiao; et al. (2024)
    Multimedia Tools and Applications
    Given the broad application of infrared technology across diverse fields, there is an increasing emphasis on investigating super-resolution techniques for infrared images within the realm of deep learning. Despite the impressive results of current Transformer-based methods in image super-resolution tasks, their reliance on the self-attention mechanism intrinsic to the Transformer architecture results in images being treated as one-dimensional sequences, thereby neglecting their inherent two-dimensional structure. Moreover, infrared images exhibit a uniform pixel distribution and a limited gradient range, posing challenges for the model to capture effective feature information. Consequently, we suggest a potent Transformer model, termed Large Kernel Transformer (LKFormer), to address this issue. Specifically, we have designed a Large Kernel Residual Depth-wise Convolutional Attention (LKRDA) module with linear complexity. This mainly employs depth-wise convolution with large kernels to execute non-local feature modeling, thereby substituting the standard self-attention layer. Additionally, we have devised a novel feed-forward network structure called Gated-Pixel Feed-Forward Network (GPFN) to augment the LKFormer's capacity to manage the information flow within the network. Comprehensive experimental results reveal that our method surpasses the most advanced techniques available, using fewer parameters and yielding considerably superior performance.
  • Cao, Jiezhang; Shi, Yue; Zhang, Kai; et al. (2024)
    2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
    Diffusion model-based image restoration (IR) aims to use diffusion models to recover high-quality (HQ) images from degraded images achieving promising performance. Due to the inherent property of diffusion models most existing methods need long serial sampling chains to restore HQ images step-by-step resulting in expensive sampling time and high computation costs. Moreover, such long sampling chains hinder understanding the relationship between inputs and restoration results since it is hard to compute the gradients in the whole chains. In this work we aim to rethink the diffusion model-based IR models through a different perspective i.e. a deep equilibrium (DEQ) fixed point system called DeqIR. Specifically, we derive an analytical solution by modeling the entire sampling chain in these IR models as a joint multivariate fixed point system. Based on the analytical solution we can conduct parallel sampling and restore HQ images without training. Furthermore, we compute fast gradients via DEQ inversion and found that initialization optimization can boost image quality and control the generation direction. Extensive experiments on benchmarks demonstrate the effectiveness of our method on typical IR tasks and real-world settings.
Publications 1 - 10 of 10