ControlNet on diffusers

参考：https://huggingface.co/docs/diffusers/using-diffusers/controlnet v0.24.0

ControlNet 通过输入给 diffusion 模型一个额外的输入图作为条件，来控制生成图的结果。这个条件输入图可以是各种形式，如 canny 边缘、用户的手稿、人体姿态、深度图等。这无疑非常有用，我们终于能更好地控制生成图的结果了，而无需再去反复调一些文本 prompt 或去噪步数之类的参数来抽奖。

ControlNet 模型包含了两组参数（或者称为块），通过一个零卷积（zero convolution）相连：

一组拷贝过来的参数是固定的，从而保证原本预训练的 diffusion 模型的能力不会丢失
另一组拷贝过来的参数是可训练的，会根据额外的条件输入图来进行训练

由于固定的这份拷贝保持了原来预训练模型的能力，因此基于新的条件训练一个 ControlNet 就跟微调其他模型一样快，因为我们不需要从头训练。

本文将介绍如何在 text-to-image、image-to-image、inpainting 等中使用 ControlNet。当然，ControlNet 还有其他很多不同的条件形式，可以自行探索一下。

开始之前，记得安装如下依赖包：

pip install -q diffusers transformers accelerate opencv-python

text-to-image

对于一般的 text-to-image 生成，我们只需要传入一个文本 prompt。但是如果有 ControlNet，我们还可以通过额外的输入图来设定其他的条件。这里，我们以 Canny 边缘条件为例。Canny 是一种边缘算子，可以描绘出图像的边缘轮廓。将其作为条件输入图，生成出来的结果可以在文本 prompt 的基础上，保持同样的边缘轮廓。

首先我们用过 opencv 来提取 Canny 边缘图：

from diffusers.utils import load_image, make_image_grid
from PIL import Image
import cv2
import numpy as nporiginal_image = load_image("https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)image = np.array(original_image)low_threshold = 100
high_threshold = 200image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)

然后，加载一个以 Canny 边缘作为条件的 ControlNet 模型，并将其传入 StableDiffusionControlNetPipeline ：

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
import torchcontrolnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

写一个 prompt，将其与 Canny 边缘条件图一起输入到 pipeline 中：

output = pipe("the mona lisa", image=canny_image
).images[0]
make_image_grid([original_image, canny_image, output], rows=1, cols=3)

在这里插入图片描述

可以看到在生成结果中，不仅有 mona lisa 的风格，而且保持了原图的边缘轮廓。

image-to-image

对于一般的 image-to-image 生成，我们输入一张初始图和一个文本 prompt，来生成一张新图。在使用 ControlNet 时，我们还可以传入一张条件输入图，来引导模型。这次我们用深度图引导来做个示例。深度图包含了图片的空间深度信息。通过输入深度图作为引导，模型的生成结果就能保持原图的空间信息。

首先使用 transformers 中的 depth-estimation pipeline 来提取原图的深度图：

import torch
import numpy as npfrom transformers import pipeline
from diffusers.utils import load_image, make_image_gridimage = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg"
)def get_depth_map(image, depth_estimator):image = depth_estimator(image)["depth"]image = np.array(image)image = image[:, :, None]image = np.concatenate([image, image, image], axis=2)detected_map = torch.from_numpy(image).float() / 255.0depth_map = detected_map.permute(2, 0, 1)return depth_mapdepth_estimator = pipeline("depth-estimation")
depth_map = get_depth_map(image, depth_estimator).unsqueeze(0).half().to("cuda")

然后加载一个以深度图为条件的 ControlNet，并将其传给 StableDiffusionControlNetImg2ImgPipeline 。

from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
import torchcontrolnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

然后将文本 prompt，输入图和深度图一起传入 pipeline，得到生成结果：

from diffusers.utils import pt_to_piloutput = pipe("lego batman and robin", image=image, control_image=depth_map,
).images[0]
make_image_grid([image, pt_to_pil(depth_map)[0], output], rows=1, cols=3)

在这里插入图片描述

可以看到生成的结果中，人物符合文本 prompt 的描述，轮廓符合原图，前背景也符合深度图的空间信息。

inpainting

在 inpainting 中，一般我们要传入原图，掩码图和文本 prompt。ControlNet 中我们可以指定一个额外的条件图。这次我们用 inpainting mask 作为条件。这样，ControlNet 就能根据 mask，只在掩码区域内生成图像。

首先加载一张原图和掩码图：

from diffusers.utils import load_image, make_image_gridinit_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg"
)
init_image = init_image.resize((512, 512))mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg"
)
mask_image = mask_image.resize((512, 512))

然后写一个生成掩码条件图的函数，该函数将原图中像素值高于阈值的位置掩码掉：

import numpy as np
import torchdef make_inpaint_condition(image, image_mask):image = np.array(image.convert("RGB")).astype(np.float32) / 255.0image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0assert image.shape[0:1] == image_mask.shape[0:1]image[image_mask > 0.5] = 1.0  # set as masked pixelimage = np.expand_dims(image, 0).transpose(0, 3, 1, 2)image = torch.from_numpy(image)return imagecontrol_image = make_inpaint_condition(init_image, mask_image)

然后创建一个以 inpainting 为条件的 ControlNet，并将其传入到 StableDiffusionControlNetInpaintPipeline 中：

from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepSchedulercontrolnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

最后将文本 prompt，原图，掩码图，条件图都传入 pipeline 进行生成：

output = pipe("corgi face with large ears, detailed, pixar, animated, disney",num_inference_steps=20,eta=1.0,image=init_image,mask_image=mask_image,control_image=control_image,
).images[0]
make_image_grid([init_image, mask_image, pt_to_pil(control_image)[0], output], rows=1, cols=4)

在这里插入图片描述

可以看到，模型在指定区域内进行重绘，生成新图，且符合文本 prompt 的语义。、

Guess Mode

在 Guess Mode 下，我们无需为 ControlNet 提供 prompt（如果提供了 prompt，也不会有作用），这使得 ControlNet 必须尽自己所能来猜测（guess）输入的条件图是什么（深度图？人体姿势？还是 Canny 边缘等）。

Guess mode adjusts the scale of the output residuals from a ControlNet by a fixed ratio depending on the block depth. The shallowest DownBlock corresponds to 0.1, and as the blocks get deeper, the scale increases exponentially such that the scale of the MidBlock output becomes 1.0.

在 pipeline 中将 guess_mode 设置为 True，guidance_scale 建议在 3 - 5 之间。

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image, make_image_grid
import numpy as np
import torch
from PIL import Image
import cv2controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True)
pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to("cuda")original_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/bird_512x512.png")image = np.array(original_image)low_threshold = 100
high_threshold = 200image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)

在这里插入图片描述

ControlNet with Stable Diffusion XL

目前，SDXL 还没有太多配套的 ControlNet，但是 diffusers 官方已经训了两个适配 SDXL 的 ControlNet 模型，分别是以 Canny 边缘为条件的和以深度图为条件的。

这里我们以 Canny 边缘为示例。首先加载一张图片并生成其 canny 边缘图：

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
from diffusers.utils import load_image, make_image_grid
from PIL import Image
import cv2
import numpy as np
import torchoriginal_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
)image = np.array(original_image)low_threshold = 100
high_threshold = 200image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
make_image_grid([original_image, canny_image], rows=1, cols=2)

然后加载适配于 SDXL 的 Canny 的 ControlNet 模型，并将其传给 StableDiffusionXLControlNetPipeline。

然后同样是传入 prompt 和 canny 条件图到 pipeline 中，进行生图：

prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = 'low quality, bad quality, sketches'image = pipe(prompt,negative_prompt=negative_prompt,image=canny_image,controlnet_conditioning_scale=0.5,
).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)

在这里插入图片描述

SDXL 的 ControlNet pipeline 也可以使用 guess mode：

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
from diffusers.utils import load_image, make_image_grid
import numpy as np
import torch
import cv2
from PIL import Imageprompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = "low quality, bad quality, sketches"original_image = load_image("https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
)controlnet = ControlNetModel.from_pretrained("diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
)
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, vae=vae, torch_dtype=torch.float16, use_safetensors=True
)
pipe.enable_model_cpu_offload()image = np.array(original_image)
image = cv2.Canny(image, 100, 200)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)image = pipe(prompt, negative_prompt=negative_prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True,
).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)

MultiControlNet

我们可以组合多种 ControlNet 条件，从而实现 MultiControlNet，以下建议一般有助于得到更好的结果：

对各条件进行掩码，保证他们之间不要有重叠（比如说在有人体姿势图的位置，就不要再有 Canny 边缘信息了）
可能需要调一下 controlnet_conditioning_scale 这个参数。

要使用 MultiControlNet，记得现将 SDXL 替换成 SD1.5。

本例中，我们将结合结合 Canny 边缘和人体姿态来生成图片。

首先准备一个图像 Canny 边缘：

from diffusers.utils import load_image, make_image_grid
from PIL import Image
import numpy as np
import cv2original_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
)
image = np.array(original_image)low_threshold = 100
high_threshold = 200image = cv2.Canny(image, low_threshold, high_threshold)# 中间这块位置我们一会儿要放人体姿态条件图，在Canny中先mask掉（建议一）
zero_start = image.shape[1] // 4
zero_end = zero_start + image.shape[1] // 2
image[:, zero_start:zero_end] = 0image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
make_image_grid([original_image, canny_image], rows=1, cols=2)

在这里插入图片描述

准备人体姿态图，首先装一下 controlnet_aux 包：

# !pip install -q controlnet-aux
from controlnet_aux import OpenposeDetectoropenpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
original_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
)
openpose_image = openpose(original_image)
make_image_grid([original_image, openpose_image], rows=1, cols=2)

在这里插入图片描述

加载要用的 Canny 和人体姿势的两个 ControlNet，放到一个列表里并传给 pipeline：

from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL, UniPCMultistepScheduler
import torchcontrolnets = [ControlNetModel.from_pretrained("thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16),ControlNetModel.from_pretrained("diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True),
]vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16, use_safetensors=True
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

然后，将文本 prompt，Canny 边缘图和人体姿势图传入，进行生成：

prompt = "a giant standing in a fantasy landscape, best quality"
negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"generator = torch.manual_seed(1)images = [openpose_image.resize((1024, 1024)), canny_image.resize((1024, 1024))]images = pipe(prompt,image=images,num_inference_steps=25,generator=generator,negative_prompt=negative_prompt,num_images_per_prompt=3,controlnet_conditioning_scale=[1.0, 0.8],
).images
make_image_grid([original_image, canny_image, openpose_image,images[0].resize((512, 512)), images[1].resize((512, 512)), images[2].resize((512, 512))], rows=2, cols=3)