Jiezhang Cao
Loading...
10 results
Filters
Reset filtersSearch Results
Publications 1 - 10 of 10
- VRT: A Video Restoration TransformerItem type: Journal Article
IEEE Transactions on Image ProcessingLiang, Jingyun; Cao, Jiezhang; Fan, Yuchen; et al. (2024)Video restoration aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which are restricted by frame-by-frame restoration. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction ability. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal reciprocal self attention (TRSA) and parallel warping. TRSA divides the video into small clips, on which reciprocal attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on five tasks, including video super-resolution, video deblurring, video denoising, video frame interpolation and space-time video super-resolution, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16dB) on fourteen benchmark datasets. The codes are available at https://github.com/JingyunLiang/VRT. - DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion ModelItem type: Conference Paper
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Pan, Zhenghao; Zeng, Haijin; Cao, Jiezhang; et al. (2024)This paper endeavors to advance the precision of snap-shot compressive imaging (SCI) reconstruction for multispectral image (MSI). To achieve this, we integrate the advantageous attributes of established SCI techniques and an image generative model, propose a novel structured zero-shot diffusion model, dubbed DiffSCI. DiffSCI leverages the structural insights from the deep prior and optimizationbased methodologies, complemented by the generative capabilities offered by the contemporary denoising diffusion model. Specifically, firstly, we employ a pre-trained diffusion model, which has been trained on a substantial corpus of RGB images, as the generative denoiser within the Plug-and-Play framework for the first time. This integration allows for the successful completion of SCI reconstruction, especially in the case that current methods struggle to address effectively. Secondly, we systematically account for spectral band correlations and introduce a robust methodology to mitigate wavelength mismatch, thus enabling seamless adaptation of the RGB diffusion model to MSIs. Thirdly, an accelerated algorithm is implemented to expedite the resolution of the data subproblem. This augmentation not only accelerates the convergence rate but also elevates the quality of the reconstruction process. We present extensive testing to show that DiffSCI exhibits discernible performance enhancements over prevailing self-supervised and zero-shot approaches, surpassing even supervised transformer counterparts across both simulated and real datasets. Code is at https://github.com/PAN083/DiffSCI. - Inheriting Bayer's Legacy: Joint Remosaicing and Denoising for Quad Bayer Image SensorItem type: Journal Article
International Journal of Computer VisionZeng, Haijin; Feng, Kai; Cao, Jiezhang; et al. (2024)Pixel binning-based Quad sensors (mega-pixel resolution camera sensor) offer a promising solution to address the hardware limitations of compact cameras for low-light imaging. However, the binning process leads to reduced spatial resolution and introduces non-Bayer CFA artifacts. In this paper, we propose a Quad CFA-driven remosaicing model that effectively converts noisy Quad Bayer and standard Bayer patterns compatible to existing Image Signal Processor (ISP) without any loss in resolution. To enhance the practicality of the remosaicing model for real-world images affected by mixed noise, we introduce a novel dual-head joint remosaicing and denoising network (DJRD), which addresses the order of denoising and remosaicing by performing them in parallel. In DJRD, we customize two denoising branches for Quad Bayer and Bayer inputs. These branches model non-local and local dependencies, CFA location, and frequency information using residual convolutional layers, Swin Transformer, and wavelet transform-based CNN. Furthermore, to improve the model's performance on challenging cases, we fine-tune DJRD to handle difficult scenarios by identifying problematic patches through Moire and zipper detection metrics. This post-training phase allows the model to focus on resolving complex image regions. Extensive experiments conducted on simulated and real images in both Bayer and sRGB domains demonstrate that DJRD outperforms competing models by approximately 3 dB, while maintaining the simplicity of implementation without adding any hardware. - Dual Prior Unfolding for Snapshot Compressive ImagingItem type: Conference Paper
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Zhang, Jiancheng; Zeng, Haijin; Cao, Jiezhang; et al. (2024)Recently, deep unfolding methods have achieved remarkable success in the realm of Snapshot Compressive Imaging (SCI) reconstruction. However, the existing methods all follow the iterative framework of a single image prior, which limits the efficiency of the unfolding methods and makes it a problem to use other priors simply and effectively. To break out of the box, we derive an effective Dual Prior Unfolding (DPU), which achieves the joint utilization of multiple deep priors and greatly improves iteration efficiency. Our unfolding method is implemented through two parts, i.e., Dual Prior Framework (DPF) and Focused Attention (FA). In brief, in addition to the normal image prior, DPF introduces a residual into the iteration formula and constructs a degraded prior for the residual by considering various degradations to establish the unfolding framework. To improve the effectiveness of the image prior based on self-attention, FA adopts a novel mechanism inspired by PCA denoising to scale and filter attention, which lets the attention focus more on effective features with little computation cost. Besides, an asymmetric backbone is proposed to further improve the efficiency of hierarchical self-attention. Remarkably, our 5-stage DPU achieves state-of-the-art (SOTA) performance with the least FLOPs and parameters compared to previous methods, while our 9-stage DPU significantly outperforms other unfolding methods with less computational requirement. https://github.com/ZhangJC-2k/DPU - Unmixing Diffusion for Self-Supervised Hyperspectral Image DenoisingItem type: Conference Paper
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Zeng, Haijin; Cao, Jiezhang; Zhang, Kai; et al. (2024)Hyperspectral images (HSIs) have extensive applications in various fields such as medicine, agriculture, and industry. Nevertheless, acquiring high signal-to-noise ratio HSI poses a challenge due to narrow-band spectral filtering. Consequently, the importance of HSI denoising is substantial, especially for snapshot hyperspectral imaging technology. While most previous HSI denoising methods are supervised, creating supervised training datasets for the diverse scenes, hyperspectral cameras, and scan parameters is impractical. In this work, we present DiffUnmix, a self-supervised denoising method for HSI using diffusion denoising generative models. Specifically, Diff-Unmix addresses the challenge of recovering noise-degraded HSI through a fusion of Spectral Unmixing and conditional abundance generation. Firstly, it employs a learnable block-based spectral unmixing strategy, complemented by a pure transformer-based backbone. Then, we introduce a self-supervised generative diffusion network to enhance abundance maps from the spectral unmixing block. This network reconstructs noise-free Unmixing probability distributions, effectively mitigating noise-induced degradations within these components. Finally, the reconstructed HSI is reconstructed through unmixing reconstruction by blending the diffusion-adjusted abundance map with the spectral endmembers. Experimental results on both simulated and real-world noisy datasets show that Diff-Unmix achieves state-of-the-art performance. - Towards Lightweight Super-Resolution With Dual Regression LearningItem type: Journal Article
IEEE Transactions on Pattern Analysis and Machine IntelligenceGuo, Yong; Tan, Mingkui; Deng, Zeshuai; et al. (2024)Deep neural networks have exhibited remarkable performance in image super-resolution (SR) tasks by learning a mapping from low-resolution (LR) images to high-resolution (HR) images. However, the SR problem is typically an ill-posed problem and existing methods would come with several limitations. First, the possible mapping space of SR can be extremely large since there may exist many different HR images that can be super-resolved from the same LR image. As a result, it is hard to directly learn a promising SR mapping from such a large space. Second, it is often inevitable to develop very large models with extremely high computational cost to yield promising SR performance. In practice, one can use model compression techniques to obtain compact models by reducing model redundancy. Nevertheless, it is hard for existing model compression methods to accurately identify the redundant components due to the extremely large SR mapping space. To alleviate the first challenge, we propose a dual regression learning scheme to reduce the space of possible SR mappings. Specifically, in addition to the mapping from LR to HR images, we learn an additional dual regression mapping to estimate the downsampling kernel and reconstruct LR images. In this way, the dual mapping acts as a constraint to reduce the space of possible mappings. To address the second challenge, we propose a dual regression compression (DRC) method to reduce model redundancy in both layer-level and channel-level based on channel pruning. Specifically, we first develop a channel number search method that minimizes the dual regression loss to determine the redundancy of each layer. Given the searched channel numbers, we further exploit the dual regression manner to evaluate the importance of channels and prune the redundant ones. Extensive experiments show the effectiveness of our method in obtaining accurate and efficient SR models. - LocalViT: Analyzing Locality in Vision TransformersItem type: Conference Paper
2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)Li, Yawei; Zhang, Kai; Cao, Jiezhang; et al. (2023)The aim of this paper is to study the influence of locality mechanisms in vision transformers. Transformers originated from machine translation and are particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking is a locality mechanism for information exchange within a local region. In this paper, locality mechanism is systematically investigated by carefully designed controlled experiments. We add locality to vision transformers into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to vision transformers with different architecture designs, which shows the generalization of the locality concept. For ImageNet2012 classification, the locality-enhanced transformers outperform the baselines Swin-T [1], DeiT-T [2] and PVT-T [3] by 1.0%, 2.6% and 3.1% with a negligible increase in the number of parameters and computational effort. Code is available at https://github.com/ofsoundof/LocalViT. - Degradation-Noise-Aware Deep Unfolding Transformer for Hyperspectral Image DenoisingItem type: Journal Article
IEEE Transactions on Geoscience and Remote SensingZeng, Haijin; Feng, Kai; Zhao, Xudong; et al. (2025)Hyperspectral images (HSIs) play a pivotal role in fields, such as medical diagnosis and agriculture. However, it often contends with significant noise stemming from narrowband spectral filtering. Existing denoising techniques have their limitations: model-driven methods rely on manual priors and hyperparameters, while learning-based methods struggle to discern intrinsic noise patterns, as they require paired images with specific example noise for training, fail to capture critical noise distribution information, leading to unrobust denoising results. This work addresses the issue by presenting a degradation-noise-aware unfolding network (DNA-Net). Unlike training directly with the simulated noise, DNA-Net initially models general sparse and Gaussian noise through statistic distributions. It then explicitly represents image priors with a customized spectral transformer. The model is subsequently unfolded into an end-to-end (E2E) network, with hyperparameters adaptively estimated from noisy HSI and degradation models, effectively regulating each iteration. Furthermore, a novel U-shaped local-nonlocal-spectral transformer (U-LNSA) is introduced, simultaneously capturing spectral correlations, local features, and nonlocal dependencies. The integration of U-LNSA into DNA-Net establishes the first Transformer-based deep unfolding method for HSI denoising. Experimental results on synthetic and real noise validate DNA-Net's superior performance over state-of-the-art (SOTA) methods. Moreover, the DNA-Net, trained exclusively on mixed Gaussian noise and impulse noise, demonstrates the ability to generalize to unseen noise present in real images. Code and models will be released at: https://github.com/NavyZeng/DNA-Net. - Deep Beyond Pixels: Enhancing Super-Resolution via Deep LearningItem type: Doctoral ThesisCao, Jiezhang (2024)Super-resolution (SR) aims at restoring high-resolution (HR) images or videos from low-resolution (LR) counterparts. Recently, the rise of deep learning has significantly advanced SR, enabling impressive real-world applications through deep neural networks. Despite tremendous progress, SR faces critical challenges including integrating cross-information, scaling cross-scale image resolutions effectively, handling cross-degradations simultaneously, and extending to cross-dimensions. These challenges manifest in practical issues such as limited information for restoration, lack of generalization to out-of-scale images, difficulties in addressing multiple kinds of degradations, and limited real-world applicability in video SR. To address these challenges and issues, we propose the following SR methods in the dissertation. Firstly, for the cross-information challenge, we propose a deformable attention Transformer, namely DATSR, to exploit more information from reference images. The method consists of a texture feature encoder (TFE) module, a reference-based deformable attention (RDA) module and a residual feature aggregation (RFA) module. Specifically, TFE first extracts image transformation (e.g., brightness, contrast and hue) insensitive features for LR and Ref images, then RDA exploits multiple relevant textures to compensate more information for LR features, and last RFA aggregates LR features and relevant textures to get a more visually pleasant result. Extensive experiments demonstrate that more information helps improve the SR performance, and our DATSR achieves state-of-the-art performance on benchmark datasets. Secondly, we propose a continuous implicit attention-in-attention network for SR, called CiaoSR, to address the cross-scale challenge. Specifically, we explicitly design an implicit attention network to learn the ensemble weights for the nearby local features. Furthermore, we embed scale-aware attention in this network to exploit additional non-local information. Extensive experiments on benchmark datasets demonstrate that CiaoSR achieves state-of-the-art performance on the arbitrary-scale SR task. More importantly, our CiaoSR can be flexibly integrated into any backbone to improve cross-scale performance. Thirdly, to tackle the cross-degradation challenge, we propose a diffusion model-based image restoration (IR) method through a deep equilibrium fixed point system, called DeqIR. Specifically, we first formulate some IR tasks as linear inverse problems. Existing diffusion methods solve the inverse problems using long sequential sampling chains, resulting in expensive sampling time and high computation costs. To address this, we derive an analytical solution by modeling the entire sampling chain as a joint multivariate fixed point system. Based on the analytical solution, we can conduct parallel sampling and restore high-quality images without training. Extensive experiments demonstrate our method is able to generalize well on different degradations in typical IR tasks and real-world settings. Lastly, for the cross-dimension challenge, we further extend the image SR method to a cross-dimension application, i.e., a practical space-time video SR task. We propose a new method by leveraging both model-based and learning-based methods. Specifically, we first formulate this task as a joint video deblurring, frame interpolation, and super-resolution problem, and solve it as two sub-problems in an alternate way. For the first sub-problem, we derive an interpretable analytical solution and then formulate it as a Fourier data transform layer. Then, we propose a recurrent video enhancement layer for the second sub-problem to recover high-frequency details. Extensive experiments demonstrate our method has a successful application on the practical space-time video SR task and achieves superior performance. All in all, this dissertation contributes to image and video SR, achieving state-of-the-art performance on benchmark datasets. We believe that our proposed SR methods have broad applications, including entertainment (e.g., old films or photo restoration), smartphones, digital cameras, medical imaging, video conferencing, and video games, etc.
- Deep Equilibrium Diffusion Restoration with Parallel SamplingItem type: Conference Paper
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Cao, Jiezhang; Shi, Yue; Zhang, Kai; et al. (2024)Diffusion model-based image restoration (IR) aims to use diffusion models to recover high-quality (HQ) images from degraded images achieving promising performance. Due to the inherent property of diffusion models most existing methods need long serial sampling chains to restore HQ images step-by-step resulting in expensive sampling time and high computation costs. Moreover, such long sampling chains hinder understanding the relationship between inputs and restoration results since it is hard to compute the gradients in the whole chains. In this work we aim to rethink the diffusion model-based IR models through a different perspective i.e. a deep equilibrium (DEQ) fixed point system called DeqIR. Specifically, we derive an analytical solution by modeling the entire sampling chain in these IR models as a joint multivariate fixed point system. Based on the analytical solution we can conduct parallel sampling and restore HQ images without training. Furthermore, we compute fast gradients via DEQ inversion and found that initialization optimization can boost image quality and control the generation direction. Extensive experiments on benchmarks demonstrate the effectiveness of our method on typical IR tasks and real-world settings.
Publications 1 - 10 of 10