This repository contains the code and resources for a project on single-image text deblurring using Conditional Flow Matching (CFM). The goal is to restore clarity to blurred text images, a crucial task for applications like document analysis, Optical Character Recognition (OCR), and image-based search.
Conditional Flow Matching (CFM) [1] is a generative modeling technique that offers an alternative framework for training Continuous Normalizing Flows (CNFs) without the need to simulate full probability paths. CNFs learn an invertible mapping from a simple prior distribution (e.g., Gaussian) to a complex target data distribution, defined by an Ordinary Differential Equation (ODE):
The core idea of CFM is to reformulate the standard Flow Matching (FM) objective, which minimizes the difference between a learnable vector field and a target field, by using conditional probability paths
The CFM objective is:
This project specifically adopts Independent Conditional Flow Matching (I-CFM), a variant where
with mean
This results in a straight-line interpolation between samples, which contributes to stable and efficient training. I-CFM allows for simulation-free training of CNFs, leading to benefits such as faster convergence, reduced complexity, and strong performance in generative modeling.
The model, a U-Net architecture [2] in this project, is trained to predict the time-dependent vector field
-
Path Sampling:
- For each sharp image
$x_{1}$ and its corresponding blurred version$y$ :- Sample a noise image
$x_{0}\sim\mathcal{N}(0,I)$ . - Sample a time
$t \sim U[0,1]$ . - Compute the intermediate point
$x_{t}=(1-t)x_{0}+tx_{1}+\sigma\epsilon^{\prime}$ , where$\epsilon^{\prime}\sim\mathcal{N}(0,I)$ and$\sigma=0.01$ . - Define the target vector field
$u_{t}(x_{t}|x_{0},x_{1})=x_{1}-x_{0}$ .
- Sample a noise image
- For each sharp image
-
Vector Field Prediction: The U-Net model
$v_{\theta}$ takes$x_{t}$ , the time$t$ (as a continuous embedding), and the conditioning blurred image$y$ (concatenated along the channel dimension) as input. It outputs the predicted vector field$v_{\theta}(x_{t},t,y)$ . - Loss Computation: The objective is the mean squared error between the predicted and target vector fields: $\mathcal{L}{I-CFM}=\mathbb{E}{t,x_{0},x_{1},y}[||v_{\theta}(x_{t},t,y)-(x_{1}-x_{0})||^{2}]$
- Optimization: Model parameters are updated using the Adam optimizer.
Once the U-Net
-
Initialization: Sample an initial noise image
$x_{0}\sim\mathcal{N}(0,I)$ . -
ODE Integration: The sharp image
$x_{1}$ is obtained by solving the initial value problem:$\frac{dx_{t}}{dt}=v_{\theta}(x_{t},t,y),$ with$x(0)=x_{0}$ Integration is performed over$t \in [0,1]$ . This project uses the Dormand-Prince 5 (DOPRI5) Runge-Kutta solver with adaptive step size[3]. -
Final Output: The solution at
$t=1$ , denoted$x_{1}$ , is the generated deblurred image.
The key files related to the Conditional Flow Matching implementation are:
src/deblur/cfm/cfm.py: Contains the core implementation of the ConditionalFlowMatcher class, including methods for computing probability paths, conditional vector fields, and sampling.src/deblur/train_cfm.py: Script for training the CFM model for image deblurring. It handles data loading, model initialization, the training loop, validation, and checkpointing.src/deblur/train_cfm_dist.py: Script for distributed training of the CFM model using DDP (Distributed Data Parallel).src/deblur/test_cfm.py: Script for evaluating a trained CFM model. It loads a checkpoint, performs inference on a test set, and computes evaluation metrics.src/models/unet.py: Defines the U-Net architecture used as the backbone for the CFM model.src/deblur/datasets/deblur.py: Contains theDeblurDatasetclass for loading paired blurred and sharp images, and a functionmake_deblur_splitsfor creating training and validation dataset splits.
- Dataset: Prepare your dataset of blurred and sharp image pairs. The
DeblurDatasetclass insrc/deblur/datasets/deblur.pyexpects corresponding blurred and sharp images to have filenames that allow pairing (e.g.,image_001_blurred.pngandimage_001_sharp.png). The dataset generation process in this project was inspired by various sources for text ([4]), fonts ([5]), and textures ([6]). - Training:
- For single GPU training, use
src/deblur/train_cfm.py. - For distributed training, use
src/deblur/train_cfm_dist.py. - Adjust parameters in the script's argument parser as needed, such as data paths, batch size, learning rate, model architecture, and CFM specific parameters (e.g.,
cfm_sigma).
- For single GPU training, use
- Evaluation:
- Use
src/deblur/test_cfm.pyto evaluate a trained model checkpoint. - Provide paths to the checkpoint, blurred and sharp image directories, and configure ODE solver parameters.
- Use
The CFM model demonstrated strong performance in the text deblurring task. It achieved a PSNR of 25.93 dB and an SSIM of 0.72 on the test set. Notably, the CFM model converged faster during training (reaching its best PSNR in 14 epochs compared to 23 for a DDIM [7] baseline) and offered faster sampling times (20 seconds per image vs. 25 seconds for DDIM on an NVIDIA RTX 4060 Laptop GPU). These results highlight the efficiency and effectiveness of CFM for this image restoration task.
Qualitatively, the CFM model successfully reconstructed text with high fidelity, preserving font and letter structure even in challenging cases with strong blur and textured backgrounds.
For more details, please refer to the full Report.pdf.
[1]Tong, A., Fatras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Brooks, J., Wolf, G., & Bengio, Y. (2024). Improving and generalizing flow-based generative models with minibatch optimal transport. TMLR.[2]Dhariwal, P., & Nichol, A. (2021). Diffusion models beat gans on image synthesis. NeurIPS, 34, 8780-8794.[3]Dormand, J. R., & Prince, P. J. (1980). A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6(1), 19-26.[4]Project Gutenberg Literary Archive Foundation. (accessed 2025-05-09). www.gutenberg.org.[5]Google Fonts. (accessed 2025-05-09). github.com/google/fonts.[6]TextureLabs. (accessed 2025-05-09). texturelabs.org/.[7]Song, J., Meng, C., & Ermon, S. (2021). Denoising diffusion implicit models. In ICLR.
