WarpI2I: Image Warping for Image-to-Image Translation

Techinical Contributions

Introduce saliency-guided image warping framework to enlarge salient regions to better preserve fine details under latent compression.
Our approach is model-agnostic, requiring no architectural modification, and is compatible with existing latent I2I frameworks.
Our approach is efficient, introducing negligible computational overhead without additional learnable parameters.
We further introduce a synthetic data generation pipeline to produce high-quality paired data for human and driving scene relighting.

Image-to-image translation transforms an input image into another with modified appearance while preserving scene content. In paired translation, we show human relighting from neutral light to moonlight, where paired training yields better results but relied on hard-to-obtain real-world paired data. In unpaired translation, we show driving scene translation from night to day, where unpaired training is typically more unstable and yields lower-quality results.

Existing low-capacity image-to-image translation models often struggle to preserve fine details, due to compressing high-resoluiton input images into a tiny, downsampled latent space.

Our saliency-guided image warping framework enlarge salient regions via image warping to better preserve fine details under extreme latent compression (e.g., 8x).

Without warping, the model struggles to maintain fine structures such as facial features. By enlarging important regions before translation, warping helps the model generate more accurate and consistent results.

We design a synthetic data pipeline to generate training pairs for our model. Starting from an original image, we first apply FLUX outpainting to create base and relit scenes. We then estimate depth and perform depth-conditioned generation. Finally, ChatGPT verifies the pair before adding it to the training set.

Synthetic training pairs generated by our pipeline provide diverse and high-quality supervision for paired training.

Prior methods often distort facial and clothing identity or produce unrealistic lighting that does not match the prompt. In contrast, our method with warping preserves facial details and generates more realistic relighting results.

Compared with prior methods, our approach with warping better preserves facial details and produces more realistic lighting.

Previous methods often change the shape or color of traffic arrows and hallucinate details in the sky and traffic signs. In contrast, our method with warping preserves these structures and produces more realistic results.

Compared with prior methods, our approach with warping better preserves scene structures and produces more consistent fog effects.

Compared with the no-warp baseline, our method with warping produces more realistic lighting and better preserves scene structures.