WarpI2I: Image Warping for Image-to-Image Translation

Techinical Contributions

  • Introduce saliency-guided image warping framework to enlarge salient regions to better preserve fine details under latent compression.
  • Our approach is model-agnostic, requiring no architectural modification, and is compatible with existing latent I2I frameworks.
  • Our approach is efficient, introducing negligible computational overhead without additional learnable parameters.
  • We further introduce a synthetic data generation pipeline to produce high-quality paired data for human and driving scene relighting.
I2I examples

Image-to-image translation transforms an input image into another with modified appearance while preserving scene content. In paired translation, we show human relighting from neutral light to moonlight, where paired training yields better results but relied on hard-to-obtain real-world paired data. In unpaired translation, we show driving scene translation from night to day, where unpaired training is typically more unstable and yields lower-quality results.

I2I examples

Existing low-capacity image-to-image translation models often struggle to preserve fine details, due to compressing high-resoluiton input images into a tiny, downsampled latent space.

I2I examples

Our saliency-guided image warping framework enlarge salient regions via image warping to better preserve fine details under extreme latent compression (e.g., 8x).

I2I examples

Without warping, the model struggles to maintain fine structures such as facial features. By enlarging important regions before translation, warping helps the model generate more accurate and consistent results.

I2I examples

We design a synthetic data pipeline to generate training pairs for our model. Starting from an original image, we first apply FLUX outpainting to create base and relit scenes. We then estimate depth and perform depth-conditioned generation. Finally, ChatGPT verifies the pair before adding it to the training set.

I2I examples

Synthetic training pairs generated by our pipeline provide diverse and high-quality supervision for paired training.

I2I examples

Prior methods often distort facial and clothing identity or produce unrealistic lighting that does not match the prompt. In contrast, our method with warping preserves facial details and generates more realistic relighting results.

I2I examples

Compared with prior methods, our approach with warping better preserves facial details and produces more realistic lighting.

I2I examples

Previous methods often change the shape or color of traffic arrows and hallucinate details in the sky and traffic signs. In contrast, our method with warping preserves these structures and produces more realistic results.

I2I examples

Compared with prior methods, our approach with warping better preserves scene structures and produces more consistent fog effects.

I2I examples

Compared with the no-warp baseline, our method with warping produces more realistic lighting and better preserves scene structures.