当前位置：首页 > news >正文

ControlNet

news 来源：原创 2025/6/10 11:59:08

文章目录

摘要
abstract
1.ControlNet
- 1.1 原文摘要
- 1.2 模型架构
- 1.3 架构细节
- 1.4 训练损失函数
- 1.5 实验
- 1.6 结论
2.总结
参考文献

摘要

本周学习的ControlNet 是一种用于文本到图像扩散模型（如 Stable Diffusion）的条件控制方法。它通过冻结预训练的扩散模型，并创建一个可训练的副本，使其能够学习额外的条件信息。关键技术包括零卷积（Zero Convolutions），用于确保模型训练初期不影响原始网络，同时逐步引入控制信息。ControlNet 可以接受多种条件输入（如 Canny 边缘检测），并在保持高质量图像生成的同时，实现精确的结构控制。实验结果表明，该方法在不同条件约束下均能稳定工作，有效增强了扩散模型的可控性。

abstract

ControlNet is a conditional control method for text-to-image Diffusion models such as Stable Diffusion. It does this by freezing the pre-trained diffusion model and creating a trainable copy, enabling it to learn additional conditional information. Key techniques include Zero Convolutions, which ensure that the initial model training does not affect the original network, while gradually introducing control information. ControlNet can accept a variety of conditional inputs (such as Canny edge detection) and achieve precise structural control while maintaining high-quality image generation. Experimental results show that the proposed method can work stably under different conditions and effectively enhance the controllability of diffusion model.

1.ControlNet

1.1 原文摘要

上周学习的T2I-Adapter模型和ControlNet两者都是基于Stable Diffusion模型的扩展，显示的将条件（支持很多条件）注入到预训练的网络当中。
在这里插入图片描述
摘要的第一句话表明ControlNet可以想T2I扩散模型添加条件，以此来控制扩散模型的生成。通过冻结预训练好的扩散模型，然后重复的使用它们学习各种条件的控制，并且使用Zero convolutions（权重和偏置分别初始化为0，卷积核大小为1*1的卷积层，这样的卷积在模型中所充当的角色就是将ControlNet和SD做一个连接）来确保微调过程中不受到噪音的影响。

1.2 模型架构

在这里插入图片描述
无论是ControlNet还是T2I-Adapter，它们都是对当前大规模预训练好的文本到图像的扩散模型提供可控的输入条件，使得生成会变得更加可控。上图中，右侧ControlNet，它先将模型冻结，同时复制一个可训练的网络（和冻结的网络结构参数是一样的，就是一个更新一个不更新）。然后将条件c输入到ControlNet中，从而得到这个条件网络的输出，这个输入会加到当前的扩散模型中输出结果。上述零卷积的作用就是连接了两个网络。

1.3 架构细节

在这里插入图片描述
左侧是冻结的SD模型，右侧是ControlNet，主要是复制了SD的encoder Block和Middle Block这两个部分。输入的condition经过controlNet后，会经过零卷积（网络没有更新的第一次运算时，无论输入什么，这个网络的输出都是0，就是在训练开始时，不会对SD有任何的干扰）分别连接SD的每一层当中。

1.4 训练损失函数

$\mathcal{L}=\mathbb{E}_{\boldsymbol{z}_{0},\boldsymbol{t},\boldsymbol{c}_{t},\boldsymbol{c}_{\mathrm{f}},\epsilon\sim\mathcal{N}(0,1)}\left[\|\epsilon-\epsilon_{\theta}(\boldsymbol{z}_{t},\boldsymbol{t},\boldsymbol{c}_{t},\boldsymbol{c}_{\mathrm{f}}))\|_{2}^{2}\right]$
它的优化目标和T2I-Adapter很像，就是在原始的时间步、文本、当前噪声的条件下，再加入条件输入，这样整体的优化目标仍然是在当前时间步去估计当前加的噪声，以及和真实噪声做一个L2 loss.ControlNet从宏观来说，就是利用一个网络去对另外一个网络注入条件，它所用的网络实际上是一个小网络（为一个大网络提供条件，希望大网络能够得到一些性能）。

1.5 实验

在这里插入图片描述
上述实验中用高斯权重初始化的标准卷积层替换零卷积层。
实验代码
输入提示词 prompt: cute dog

apply_canny = CannyDetector()

model = create_model('./models/cldm_v15.yaml').cpu()
model.load_state_dict(load_state_dict('./models/control_sd15_canny.pth', location='cuda'))
model = model.cuda()
ddim_sampler = DDIMSampler(model)


def process(input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, ddim_steps, guess_mode, strength, scale, seed, eta, low_threshold, high_threshold):
    with torch.no_grad():
        img = resize_image(HWC3(input_image), image_resolution)
        H, W, C = img.shape

        detected_map = apply_canny(img, low_threshold, high_threshold)
        detected_map = HWC3(detected_map)

        control = torch.from_numpy(detected_map.copy()).float().cuda() / 255.0
        control = torch.stack([control for _ in range(num_samples)], dim=0)
        control = einops.rearrange(control, 'b h w c -> b c h w').clone()

        if seed == -1:
            seed = random.randint(0, 65535)
        seed_everything(seed)

        if config.save_memory:
            model.low_vram_shift(is_diffusing=False)

        cond = {"c_concat": [control], "c_crossattn": [model.get_learned_conditioning([prompt + ', ' + a_prompt] * num_samples)]}
        un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
        shape = (4, H // 8, W // 8)

        if config.save_memory:
            model.low_vram_shift(is_diffusing=True)

        model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  # Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
        samples, intermediates = ddim_sampler.sample(ddim_steps, num_samples,
                                                     shape, cond, verbose=False, eta=eta,
                                                     unconditional_guidance_scale=scale,
                                                     unconditional_conditioning=un_cond)

        if config.save_memory:
            model.low_vram_shift(is_diffusing=False)

        x_samples = model.decode_first_stage(samples)
        x_samples = (einops.rearrange(x_samples, 'b c h w -> b h w c') * 127.5 + 127.5).cpu().numpy().clip(0, 255).astype(np.uint8)

        results = [x_samples[i] for i in range(num_samples)]
    return [255 - detected_map] + results


block = gr.Blocks().queue()
with block:
    with gr.Row():
        gr.Markdown("## Control Stable Diffusion with Canny Edge Maps")
    with gr.Row():
        with gr.Column():
            input_image = gr.Image(source='upload', type="numpy")
            prompt = gr.Textbox(label="Prompt")
            run_button = gr.Button(label="Run")
            with gr.Accordion("Advanced options", open=False):
                num_samples = gr.Slider(label="Images", minimum=1, maximum=12, value=1, step=1)
                image_resolution = gr.Slider(label="Image Resolution", minimum=256, maximum=768, value=512, step=64)
                strength = gr.Slider(label="Control Strength", minimum=0.0, maximum=2.0, value=1.0, step=0.01)
                guess_mode = gr.Checkbox(label='Guess Mode', value=False)
                low_threshold = gr.Slider(label="Canny low threshold", minimum=1, maximum=255, value=100, step=1)
                high_threshold = gr.Slider(label="Canny high threshold", minimum=1, maximum=255, value=200, step=1)
                ddim_steps = gr.Slider(label="Steps", minimum=1, maximum=100, value=20, step=1)
                scale = gr.Slider(label="Guidance Scale", minimum=0.1, maximum=30.0, value=9.0, step=0.1)
                seed = gr.Slider(label="Seed", minimum=-1, maximum=2147483647, step=1, randomize=True)
                eta = gr.Number(label="eta (DDIM)", value=0.0)
                a_prompt = gr.Textbox(label="Added Prompt", value='best quality, extremely detailed')
                n_prompt = gr.Textbox(label="Negative Prompt",
                                      value='longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality')
        with gr.Column():
            result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery").style(grid=2, height='auto')
    ips = [input_image, prompt, a_prompt, n_prompt, num_samples, image_resolution, ddim_steps, guess_mode, strength, scale, seed, eta, low_threshold, high_threshold]
    run_button.click(fn=process, inputs=ips, outputs=[result_gallery])

在这里插入图片描述

1.6 结论

ControlNet是一种神经网络架构，用于学习大型预训练文本到图像扩散模型的条件控制。原始模型和可训练副本通过zero convolution层连接，从而消除训练过程中的有害噪声。文中大量实验表明无论是否有提示词,ControlNet可以有效地控制具有单个或多个条件的SD。

2.总结

ControlNet 通过在 Stable Diffusion 之上添加一个可训练的控制网络，实现了对图像生成的精确调控。其核心优势在于无需修改原始扩散模型，而是通过独立的可训练分支来学习条件映射，从而提高可控性。零卷积的引入确保了训练的稳定性，避免了对扩散模型的过度干扰。实验表明，ControlNet 可以在不同任务（如边缘检测、深度图、姿态引导等）中有效发挥作用，使得文本到图像的生成更加灵活、多样，为扩散模型的实际应用提供了更广泛的可能性。