15. Image Generation ์‹ฌํ™”

15. Image Generation ์‹ฌํ™”

๊ฐœ์š”

์ด ๋ ˆ์Šจ์—์„œ๋Š” Stable Diffusion ์ดํ›„์˜ ์ตœ์‹  ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๊ธฐ์ˆ ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. SDXL, ControlNet, IP-Adapter, Latent Consistency Models ๋“ฑ ์‹ค์šฉ์ ์ธ ๊ธฐ๋ฒ•์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.


1. SDXL (Stable Diffusion XL)

1.1 ์•„ํ‚คํ…์ฒ˜ ๊ฐœ์„ 

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    SDXL vs SD 1.5 ๋น„๊ต                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  SD 1.5:                                                         โ”‚
โ”‚  - UNet: 860M params                                            โ”‚
โ”‚  - Text Encoder: CLIP ViT-L/14 (77 ํ† ํฐ)                        โ”‚
โ”‚  - ํ•ด์ƒ๋„: 512ร—512                                              โ”‚
โ”‚  - VAE: 4ร— downscale                                            โ”‚
โ”‚                                                                  โ”‚
โ”‚  SDXL:                                                           โ”‚
โ”‚  - UNet: 2.6B params (3๋ฐฐ ์ฆ๊ฐ€)                                 โ”‚
โ”‚  - Text Encoder: CLIP ViT-L + OpenCLIP ViT-bigG (์ด์ค‘)          โ”‚
โ”‚  - ํ•ด์ƒ๋„: 1024ร—1024                                            โ”‚
โ”‚  - VAE: ๊ฐœ์„ ๋œ VAE-FT                                           โ”‚
โ”‚  - Refiner ๋ชจ๋ธ (์„ ํƒ์ )                                         โ”‚
โ”‚                                                                  โ”‚
โ”‚  ์ฃผ์š” ๊ฐœ์„ :                                                      โ”‚
โ”‚  - ๋” ํ’๋ถ€ํ•œ ํ…์ŠคํŠธ ์ดํ•ด (์ด์ค‘ ์ธ์ฝ”๋”)                           โ”‚
โ”‚  - ๊ณ ํ•ด์ƒ๋„ ์ƒ์„ฑ (4๋ฐฐ ํ”ฝ์…€)                                     โ”‚
โ”‚  - Micro-conditioning (ํฌ๊ธฐ, ์ข…ํšก๋น„)                            โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

1.2 SDXL ์‚ฌ์šฉ

from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
import torch

def sdxl_generation():
    """SDXL ์ด๋ฏธ์ง€ ์ƒ์„ฑ"""

    # Base ๋ชจ๋ธ ๋กœ๋“œ
    pipe = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.float16,
        variant="fp16",
        use_safetensors=True
    ).to("cuda")

    # ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”
    pipe.enable_model_cpu_offload()
    pipe.enable_vae_slicing()

    # ์ƒ์„ฑ
    prompt = "A majestic lion in a savanna at sunset, photorealistic, 8k"
    negative_prompt = "blurry, low quality, distorted"

    image = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=30,
        guidance_scale=7.5,
        height=1024,
        width=1024,
    ).images[0]

    return image


def sdxl_with_refiner():
    """SDXL Base + Refiner ํŒŒ์ดํ”„๋ผ์ธ"""

    # Base
    base = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.float16
    ).to("cuda")

    # Refiner
    refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-refiner-1.0",
        torch_dtype=torch.float16
    ).to("cuda")

    prompt = "A cyberpunk city at night, neon lights, rain"

    # Stage 1: Base (80% denoising)
    high_noise_frac = 0.8
    base_output = base(
        prompt=prompt,
        num_inference_steps=40,
        denoising_end=high_noise_frac,
        output_type="latent"
    ).images

    # Stage 2: Refiner (20% denoising)
    refined_image = refiner(
        prompt=prompt,
        image=base_output,
        num_inference_steps=40,
        denoising_start=high_noise_frac
    ).images[0]

    return refined_image

1.3 Micro-Conditioning

def sdxl_micro_conditioning():
    """SDXL Micro-Conditioning ์‚ฌ์šฉ"""

    pipe = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.float16
    ).to("cuda")

    prompt = "A portrait of a woman"

    # ๋‹ค์–‘ํ•œ ์ข…ํšก๋น„๋กœ ์ƒ์„ฑ
    aspect_ratios = [
        (1024, 1024),  # 1:1
        (1152, 896),   # 4:3
        (896, 1152),   # 3:4
        (1216, 832),   # ์•ฝ 3:2
        (832, 1216),   # ์•ฝ 2:3
    ]

    images = []
    for width, height in aspect_ratios:
        # Micro-conditioning: ์›๋ณธ ํ•ด์ƒ๋„ ํžŒํŠธ
        image = pipe(
            prompt=prompt,
            height=height,
            width=width,
            original_size=(height, width),  # ํ•™์Šต ์‹œ ์›๋ณธ ํฌ๊ธฐ
            target_size=(height, width),    # ๋ชฉํ‘œ ํฌ๊ธฐ
            crops_coords_top_left=(0, 0),   # ํฌ๋กญ ์ขŒํ‘œ
        ).images[0]
        images.append(image)

    return images

2. ControlNet

2.1 ๊ฐœ๋…

ControlNet: ์กฐ๊ฑด๋ถ€ ์ œ์–ด ์ถ”๊ฐ€

์›๋ณธ Diffusion ๋ชจ๋ธ์„ ์ˆ˜์ •ํ•˜์ง€ ์•Š๊ณ  ์ถ”๊ฐ€ ์ œ์–ด ์‹ ํ˜ธ ์ฃผ์ž…

์ง€์› ์กฐ๊ฑด:
- Canny Edge (์œค๊ณฝ์„ )
- Depth Map (๊นŠ์ด)
- Pose (์ž์„ธ)
- Segmentation (์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜)
- Normal Map (๋ฒ•์„ )
- Scribble (๋‚™์„œ)
- Line Art

์ž‘๋™ ์›๋ฆฌ:
1. ์กฐ๊ฑด ์ด๋ฏธ์ง€ โ†’ ์กฐ๊ฑด ์ธ์ฝ”๋”
2. ์ธ์ฝ”๋”ฉ๋œ ์กฐ๊ฑด โ†’ UNet์— ์ฃผ์ž… (zero convolution)
3. ์›๋ณธ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜ ๊ณ ์ •, ControlNet๋งŒ ํ•™์Šต

2.2 ๊ตฌํ˜„ ๋ฐ ์‚ฌ์šฉ

from diffusers import (
    StableDiffusionControlNetPipeline,
    ControlNetModel,
    UniPCMultistepScheduler
)
from controlnet_aux import CannyDetector, OpenposeDetector
import cv2
import numpy as np

class ControlNetGenerator:
    """ControlNet ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ์ƒ์„ฑ"""

    def __init__(self, base_model: str = "runwayml/stable-diffusion-v1-5"):
        self.base_model = base_model
        self.controlnets = {}
        self.detectors = {
            'canny': CannyDetector(),
            'openpose': OpenposeDetector(),
        }

    def load_controlnet(self, control_type: str):
        """ControlNet ๋กœ๋“œ"""
        controlnet_models = {
            'canny': "lllyasviel/sd-controlnet-canny",
            'depth': "lllyasviel/sd-controlnet-depth",
            'openpose': "lllyasviel/sd-controlnet-openpose",
            'scribble': "lllyasviel/sd-controlnet-scribble",
            'seg': "lllyasviel/sd-controlnet-seg",
        }

        if control_type not in self.controlnets:
            self.controlnets[control_type] = ControlNetModel.from_pretrained(
                controlnet_models[control_type],
                torch_dtype=torch.float16
            )

        return self.controlnets[control_type]

    def generate_with_canny(
        self,
        image: np.ndarray,
        prompt: str,
        low_threshold: int = 100,
        high_threshold: int = 200
    ):
        """Canny Edge ์ œ์–ด"""

        # Canny edge ์ถ”์ถœ
        canny_image = cv2.Canny(image, low_threshold, high_threshold)
        canny_image = np.stack([canny_image] * 3, axis=-1)

        # ControlNet ๋กœ๋“œ
        controlnet = self.load_controlnet('canny')

        # ํŒŒ์ดํ”„๋ผ์ธ
        pipe = StableDiffusionControlNetPipeline.from_pretrained(
            self.base_model,
            controlnet=controlnet,
            torch_dtype=torch.float16
        ).to("cuda")

        pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

        # ์ƒ์„ฑ
        output = pipe(
            prompt=prompt,
            image=canny_image,
            num_inference_steps=20,
            guidance_scale=7.5,
            controlnet_conditioning_scale=1.0,  # ์ œ์–ด ๊ฐ•๋„
        ).images[0]

        return output, canny_image

    def generate_with_pose(self, image: np.ndarray, prompt: str):
        """Pose ์ œ์–ด"""

        # OpenPose ์ถ”์ถœ
        pose_image = self.detectors['openpose'](image)

        controlnet = self.load_controlnet('openpose')

        pipe = StableDiffusionControlNetPipeline.from_pretrained(
            self.base_model,
            controlnet=controlnet,
            torch_dtype=torch.float16
        ).to("cuda")

        output = pipe(
            prompt=prompt,
            image=pose_image,
            num_inference_steps=20,
        ).images[0]

        return output, pose_image

    def multi_controlnet(
        self,
        image: np.ndarray,
        prompt: str,
        control_types: list = ['canny', 'depth']
    ):
        """๋‹ค์ค‘ ControlNet"""

        # ์—ฌ๋Ÿฌ ControlNet ๋กœ๋“œ
        controlnets = [self.load_controlnet(ct) for ct in control_types]

        # ์กฐ๊ฑด ์ด๋ฏธ์ง€ ์ถ”์ถœ
        control_images = []
        for ct in control_types:
            if ct == 'canny':
                canny = cv2.Canny(image, 100, 200)
                control_images.append(np.stack([canny]*3, axis=-1))
            elif ct == 'depth':
                # Depth ์ถ”์ถœ (์˜ˆ: MiDaS)
                depth = self.extract_depth(image)
                control_images.append(depth)

        # ๋‹ค์ค‘ ControlNet ํŒŒ์ดํ”„๋ผ์ธ
        pipe = StableDiffusionControlNetPipeline.from_pretrained(
            self.base_model,
            controlnet=controlnets,
            torch_dtype=torch.float16
        ).to("cuda")

        output = pipe(
            prompt=prompt,
            image=control_images,
            controlnet_conditioning_scale=[1.0, 0.5],  # ๊ฐ๊ฐ์˜ ๊ฐ•๋„
        ).images[0]

        return output


# ์‚ฌ์šฉ ์˜ˆ์‹œ
generator = ControlNetGenerator()

# ์ฐธ์กฐ ์ด๋ฏธ์ง€์—์„œ ๊ตฌ๋„ ์œ ์ง€ํ•˜๋ฉฐ ์Šคํƒ€์ผ ๋ณ€๊ฒฝ
reference_image = cv2.imread("reference.jpg")
result, canny = generator.generate_with_canny(
    reference_image,
    "A beautiful anime girl, studio ghibli style"
)

3. IP-Adapter (Image Prompt Adapter)

3.1 ๊ฐœ๋…

IP-Adapter: ์ด๋ฏธ์ง€๋ฅผ ํ”„๋กฌํ”„ํŠธ๋กœ ์‚ฌ์šฉ

ํ…์ŠคํŠธ ๋Œ€์‹ /ํ•จ๊ป˜ ์ด๋ฏธ์ง€๋กœ ์Šคํƒ€์ผ/๋‚ด์šฉ ์ง€์‹œ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    IP-Adapter ๊ตฌ์กฐ                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                            โ”‚
โ”‚  ์ฐธ์กฐ ์ด๋ฏธ์ง€ โ†’ CLIP Image Encoder โ†’ Image Features        โ”‚
โ”‚                         โ†“                                  โ”‚
โ”‚                  Projection Layer (ํ•™์Šต)                   โ”‚
โ”‚                         โ†“                                  โ”‚
โ”‚              Cross-Attention์— ์ฃผ์ž…                        โ”‚
โ”‚                         โ†“                                  โ”‚
โ”‚  Text Prompt + Image Features โ†’ UNet โ†’ ์ƒ์„ฑ ์ด๋ฏธ์ง€        โ”‚
โ”‚                                                            โ”‚
โ”‚  ์šฉ๋„:                                                     โ”‚
โ”‚  - ์Šคํƒ€์ผ ์ „์ด (style reference)                          โ”‚
โ”‚  - ์–ผ๊ตด ์œ ์‚ฌ์„ฑ ์œ ์ง€ (face reference)                      โ”‚
โ”‚  - ๊ตฌ๋„/์ƒ‰์ƒ ์ฐธ์กฐ (composition)                           โ”‚
โ”‚                                                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3.2 ์‚ฌ์šฉ

from diffusers import StableDiffusionPipeline
from transformers import CLIPVisionModelWithProjection
import torch

def use_ip_adapter():
    """IP-Adapter ์‚ฌ์šฉ"""

    # ๊ธฐ๋ณธ ํŒŒ์ดํ”„๋ผ์ธ
    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16
    ).to("cuda")

    # IP-Adapter ๋กœ๋“œ
    pipe.load_ip_adapter(
        "h94/IP-Adapter",
        subfolder="models",
        weight_name="ip-adapter_sd15.bin"
    )

    # ์Šค์ผ€์ผ ์„ค์ • (0~1, ๋†’์„์ˆ˜๋ก ์ฐธ์กฐ ์ด๋ฏธ์ง€ ์˜ํ–ฅ ํผ)
    pipe.set_ip_adapter_scale(0.6)

    # ์ฐธ์กฐ ์ด๋ฏธ์ง€
    from PIL import Image
    style_image = Image.open("style_reference.jpg")

    # ์ƒ์„ฑ
    output = pipe(
        prompt="A portrait of a woman",
        ip_adapter_image=style_image,
        num_inference_steps=30,
    ).images[0]

    return output


def ip_adapter_face():
    """IP-Adapter Face: ์–ผ๊ตด ์œ ์‚ฌ์„ฑ ์œ ์ง€"""

    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16
    ).to("cuda")

    # Face ์ „์šฉ IP-Adapter
    pipe.load_ip_adapter(
        "h94/IP-Adapter",
        subfolder="models",
        weight_name="ip-adapter-full-face_sd15.bin"
    )

    pipe.set_ip_adapter_scale(0.7)

    # ์ฐธ์กฐ ์–ผ๊ตด
    face_image = Image.open("face_reference.jpg")

    # ๋‹ค์–‘ํ•œ ์Šคํƒ€์ผ๋กœ ์ƒ์„ฑ
    prompts = [
        "A person in a business suit, professional photo",
        "A person as a superhero, comic book style",
        "A person in ancient Rome, oil painting"
    ]

    results = []
    for prompt in prompts:
        output = pipe(
            prompt=prompt,
            ip_adapter_image=face_image,
            num_inference_steps=30,
        ).images[0]
        results.append(output)

    return results


def ip_adapter_plus():
    """IP-Adapter Plus: ๋” ๊ฐ•ํ•œ ์ด๋ฏธ์ง€ ์กฐ๊ฑด"""

    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16
    ).to("cuda")

    # Plus ๋ฒ„์ „ (๋” ์„ธ๋ฐ€ํ•œ ์ œ์–ด)
    pipe.load_ip_adapter(
        "h94/IP-Adapter",
        subfolder="models",
        weight_name="ip-adapter-plus_sd15.bin"
    )

    # ๋‹ค์ค‘ ์ด๋ฏธ์ง€ ์ฐธ์กฐ
    style_images = [
        Image.open("style1.jpg"),
        Image.open("style2.jpg")
    ]

    output = pipe(
        prompt="A landscape",
        ip_adapter_image=style_images,
        num_inference_steps=30,
    ).images[0]

    return output

4. Latent Consistency Models (LCM)

4.1 ๊ฐœ๋…

LCM: ์ดˆ๊ณ ์† ์ด๋ฏธ์ง€ ์ƒ์„ฑ

๊ธฐ์กด Diffusion: 20-50 ์Šคํ… ํ•„์š”
LCM: 2-4 ์Šคํ…์œผ๋กœ ๊ณ ํ’ˆ์งˆ ์ƒ์„ฑ

์ž‘๋™ ์›๋ฆฌ:
1. ์›๋ณธ Diffusion ๋ชจ๋ธ์„ consistency ๋ชฉํ‘œ๋กœ ์ฆ๋ฅ˜
2. ์–ด๋–ค ๋…ธ์ด์ฆˆ ๋ ˆ๋ฒจ์—์„œ๋„ ๋ฐ”๋กœ ๊นจ๋—ํ•œ ์ด๋ฏธ์ง€๋กœ ๋งคํ•‘
3. ๋‹จ์ผ ๋˜๋Š” ์†Œ์ˆ˜ ์Šคํ…์œผ๋กœ ์ƒ์„ฑ

์žฅ์ :
- ์‹ค์‹œ๊ฐ„ ์ƒ์„ฑ ๊ฐ€๋Šฅ (< 1์ดˆ)
- ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ์‘์šฉ
- ์ €์ „๋ ฅ ๋””๋ฐ”์ด์Šค ๊ฐ€๋Šฅ

4.2 ์‚ฌ์šฉ

from diffusers import (
    DiffusionPipeline,
    LCMScheduler,
    AutoPipelineForText2Image
)

def lcm_generation():
    """LCM ๋น ๋ฅธ ์ƒ์„ฑ"""

    # LCM-LoRA ์‚ฌ์šฉ (๊ธฐ์กด ๋ชจ๋ธ์— ์ ์šฉ)
    pipe = DiffusionPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.float16,
        variant="fp16"
    ).to("cuda")

    # LCM-LoRA ๋กœ๋“œ
    pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")

    # LCM ์Šค์ผ€์ค„๋Ÿฌ
    pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

    # ๋น ๋ฅธ ์ƒ์„ฑ (4 ์Šคํ…!)
    image = pipe(
        prompt="A beautiful sunset over mountains",
        num_inference_steps=4,  # ๋งค์šฐ ์ ์€ ์Šคํ…
        guidance_scale=1.5,     # LCM์€ ๋‚ฎ์€ guidance ๊ถŒ์žฅ
    ).images[0]

    return image


def lcm_real_time():
    """์‹ค์‹œ๊ฐ„ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ฐ๋ชจ"""
    import time

    pipe = DiffusionPipeline.from_pretrained(
        "SimianLuo/LCM_Dreamshaper_v7",
        torch_dtype=torch.float16
    ).to("cuda")

    pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

    prompts = [
        "A red apple",
        "A blue car",
        "A green forest",
        "A yellow sun"
    ]

    for prompt in prompts:
        start = time.time()
        image = pipe(
            prompt=prompt,
            num_inference_steps=4,
            guidance_scale=1.0,
            height=512,
            width=512
        ).images[0]
        elapsed = time.time() - start

        print(f"'{prompt}': {elapsed:.2f}s")


def turbo_generation():
    """SDXL-Turbo: 1-4 ์Šคํ… ์ƒ์„ฑ"""

    pipe = AutoPipelineForText2Image.from_pretrained(
        "stabilityai/sdxl-turbo",
        torch_dtype=torch.float16,
        variant="fp16"
    ).to("cuda")

    # ๋‹จ 1 ์Šคํ…!
    image = pipe(
        prompt="A cinematic shot of a cat wearing a hat",
        num_inference_steps=1,
        guidance_scale=0.0,  # Turbo๋Š” guidance ๋ถˆํ•„์š”
    ).images[0]

    return image

5. ๊ณ ๊ธ‰ ๊ธฐ๋ฒ•

5.1 Inpainting & Outpainting

from diffusers import StableDiffusionInpaintPipeline

def inpainting_example():
    """์˜์—ญ ์ˆ˜์ • (Inpainting)"""

    pipe = StableDiffusionInpaintPipeline.from_pretrained(
        "runwayml/stable-diffusion-inpainting",
        torch_dtype=torch.float16
    ).to("cuda")

    # ์›๋ณธ ์ด๋ฏธ์ง€์™€ ๋งˆ์Šคํฌ
    image = Image.open("original.jpg")
    mask = Image.open("mask.png")  # ํฐ์ƒ‰ = ์ˆ˜์ •ํ•  ์˜์—ญ

    result = pipe(
        prompt="A cat sitting on the couch",
        image=image,
        mask_image=mask,
        num_inference_steps=30,
    ).images[0]

    return result


def outpainting_example():
    """์ด๋ฏธ์ง€ ํ™•์žฅ (Outpainting)"""

    pipe = StableDiffusionInpaintPipeline.from_pretrained(
        "runwayml/stable-diffusion-inpainting",
        torch_dtype=torch.float16
    ).to("cuda")

    # ์›๋ณธ ์ด๋ฏธ์ง€๋ฅผ ์บ”๋ฒ„์Šค์— ๋ฐฐ์น˜
    original = Image.open("original.jpg")
    canvas_size = (1024, 1024)
    canvas = Image.new("RGB", canvas_size, (128, 128, 128))

    # ์ค‘์•™์— ๋ฐฐ์น˜
    offset = ((canvas_size[0] - original.width) // 2,
              (canvas_size[1] - original.height) // 2)
    canvas.paste(original, offset)

    # ๋งˆ์Šคํฌ: ์›๋ณธ ์˜์—ญ ์™ธ ํฐ์ƒ‰
    mask = Image.new("L", canvas_size, 255)
    mask.paste(0, offset, (offset[0] + original.width, offset[1] + original.height))

    # ํ™•์žฅ
    result = pipe(
        prompt="A beautiful landscape extending the scene",
        image=canvas,
        mask_image=mask,
    ).images[0]

    return result

5.2 Image-to-Image Translation

from diffusers import StableDiffusionImg2ImgPipeline

def style_transfer():
    """์Šคํƒ€์ผ ๋ณ€ํ™˜"""

    pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16
    ).to("cuda")

    # ์ž…๋ ฅ ์ด๋ฏธ์ง€
    init_image = Image.open("photo.jpg").resize((512, 512))

    # ์Šคํƒ€์ผ ๋ณ€ํ™˜
    result = pipe(
        prompt="oil painting, impressionist style, vibrant colors",
        image=init_image,
        strength=0.75,  # 0~1, ๋†’์„์ˆ˜๋ก ํฐ ๋ณ€ํ™”
        num_inference_steps=30,
    ).images[0]

    return result

5.3 ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์กฐ์ž‘

def prompt_weighting():
    """ํ”„๋กฌํ”„ํŠธ ๊ฐ€์ค‘์น˜ ์กฐ์ ˆ"""
    from compel import Compel

    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16
    ).to("cuda")

    compel = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)

    # ๊ฐ€์ค‘์น˜ ๋ฌธ๋ฒ•
    prompts = [
        "a (beautiful)++ sunset",           # ++ = 1.21๋ฐฐ
        "a (beautiful)+++ sunset",          # +++ = 1.33๋ฐฐ
        "a (ugly)-- sunset",                # -- = 0.83๋ฐฐ
        "a (red:1.5) and (blue:0.5) sunset" # ๋ช…์‹œ์  ๊ฐ€์ค‘์น˜
    ]

    for prompt in prompts:
        conditioning = compel.build_conditioning_tensor(prompt)

        image = pipe(
            prompt_embeds=conditioning,
            num_inference_steps=30,
        ).images[0]


def prompt_blending():
    """ํ”„๋กฌํ”„ํŠธ ๋ธ”๋ Œ๋”ฉ"""
    from compel import Compel

    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16
    ).to("cuda")

    compel = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)

    # ๋‘ ํ”„๋กฌํ”„ํŠธ ๋ธ”๋ Œ๋”ฉ
    prompt1 = "a photo of a cat"
    prompt2 = "a photo of a dog"

    cond1 = compel.build_conditioning_tensor(prompt1)
    cond2 = compel.build_conditioning_tensor(prompt2)

    # 50:50 ๋ธ”๋ Œ๋”ฉ
    blended = (cond1 + cond2) / 2

    image = pipe(
        prompt_embeds=blended,
        num_inference_steps=30,
    ).images[0]

    return image

6. ์ตœ์ ํ™” ๊ธฐ๋ฒ•

6.1 ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”

def optimize_memory():
    """๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๊ธฐ๋ฒ•"""

    pipe = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0",
        torch_dtype=torch.float16
    )

    # 1. CPU Offload
    pipe.enable_model_cpu_offload()

    # 2. Sequential CPU Offload (๋” ๋А๋ฆฌ์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ)
    # pipe.enable_sequential_cpu_offload()

    # 3. VAE Slicing (ํฐ ์ด๋ฏธ์ง€์šฉ)
    pipe.enable_vae_slicing()

    # 4. VAE Tiling (๋งค์šฐ ํฐ ์ด๋ฏธ์ง€์šฉ)
    pipe.enable_vae_tiling()

    # 5. Attention Slicing
    pipe.enable_attention_slicing(slice_size="auto")

    # 6. xFormers
    pipe.enable_xformers_memory_efficient_attention()

    return pipe


def batch_generation():
    """๋ฐฐ์น˜ ์ƒ์„ฑ ์ตœ์ ํ™”"""

    pipe = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16
    ).to("cuda")

    prompts = [
        "A red apple",
        "A blue car",
        "A green tree",
        "A yellow sun",
    ]

    # ๋ฐฐ์น˜ ์ƒ์„ฑ (๋” ํšจ์œจ์ )
    images = pipe(
        prompt=prompts,
        num_inference_steps=30,
    ).images

    return images

์ฐธ๊ณ  ์ž๋ฃŒ

๋…ผ๋ฌธ

  • Podell et al. (2023). "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis"
  • Zhang et al. (2023). "Adding Conditional Control to Text-to-Image Diffusion Models" (ControlNet)
  • Ye et al. (2023). "IP-Adapter: Text Compatible Image Prompt Adapter"
  • Luo et al. (2023). "Latent Consistency Models"

๋ชจ๋ธ

๊ด€๋ จ ๋ ˆ์Šจ

to navigate between lessons