Improved 3D Scene Stylization<br>via Text-Guided Generative Image Editing with Region-Based Control

Abstract

Recent advances in text-driven 3D scene editing and stylization, which leverage the powerful capabilities of 2D generative models, have demonstrated promising outcomes. However, challenges remain in ensuring high-quality stylization and view consistency simultaneously. Moreover, applying style consistently to different regions or objects in the scene with semantic correspondence is a challenging task. To address these limitations, we introduce techniques that enhance the quality of 3D stylization while maintaining view consistency and providing optional region-controlled style transfer. Our method achieves stylization by re-training an initial 3D representation using stylized multi-view 2D images of the source views. Therefore, ensuring both style consistency and view consistency of stylized multi-view images is crucial. We achieve this by extending the style-aligned depth-conditioned view generation framework, replacing the fully shared attention mechanism with a single reference-based attention-sharing mechanism, which effectively aligns style across different viewpoints. Additionally, inspired by recent 3D inpainting methods, we utilize a grid of multiple depth maps as a single-image reference to further strengthen view consistency among stylized images. Finally, we propose Multi-Region Importance-Weighted Sliced Wasserstein Distance Loss, allowing styles to be applied to distinct image regions using segmentation masks from off-the-shelf models. We demonstrate that this optional feature enhances the faithfulness of style transfer and enables the mixing of different styles across distinct regions of the scene. Experimental evaluations, both qualitative and quantitative, demonstrate that our pipeline effectively improves the results of text-driven 3D stylization.

3D Scene Stylization Pipeline

Our method consists of two main stages: (1) a training stage, where we train a 3D scene representation using stylized multi-view images, and (2) an inference stage, where we generate stylized multi-view images from the trained 3D scene representation. The training stage involves re-training the initial 3D scene representation with stylized multi-view images, ensuring both style consistency and view consistency. The inference stage utilizes the trained model to generate stylized images from novel viewpoints.

Overall pipeline

Overall 3D scene stylization pipeline.

Style-Consistent View Generation

Our multi-view stylization phase takes as input a reference tiled depth map to stylize views. A single reference-based attention-sharing mechanism effectively aligns style across different viewpoints.

Overall pipeline

Multi-Region Stylization

Our method can apply 3D style transfer to different semantic regions based on segmentation masks, enabling spatially controlled stylization. In the example below, we apply three different styles to distinct regions while keeping the background area unchanged, demonstrating the ability to selectively transfer style while preserving certain parts of the original scene.

Overall pipeline

Example of 3D scene stylization using our multi-region loss.

Results

We evaluate our method on various scenes and styles, demonstrating its effectiveness in generating high-quality stylized images with view consistency.

Overall pipeline

BibTeX


        @inproceedings{fujiwara2025improved,
            title     = {Improved 3D Scene Stylization via Text-Guided Generative Image Editing with Region-Based Control},
            author    = {Haruo Fujiwara and Yusuke Mukuta and Tatsuya Harada},
            booktitle = {Pacific Graphics 2025 Conference Papers},
            year      = {2025}
        }

Improved 3D Scene Stylizationvia Text-Guided Generative Image Editing with Region-Based Control

Pacific Graphics 2025

We propose an improved stylization technique for Gaussian Splatting scenes.

Abstract

3D Scene Stylization Pipeline

Style-Consistent View Generation

Multi-Region Stylization

Results

BibTeX

Improved 3D Scene Stylization
via Text-Guided Generative Image Editing with Region-Based Control