在 StableDiffusion 的各版本实现中,有一个参数:scale_factor=0.18215.

scale_factor 是在论文 High-Resolution Image Synthesis with Latent Diffusion Models 中提出的,其主要是用于处理具有相似 noise schedules 的不同的 latent spaces,(采用不同的 autoencoders 所得到的 latent spaces,其缩放比例与图像完全不同).

scale_factor 确保了 diffusion 模型中的初始化 latent space,具有相似的单位方差。

scale_factor 的计算

from diffusers import AutoencoderKL
import torch
import torchvision
from torchvision.datasets.utils import download_and_extract_archive
from torchvision import transforms


num_workers = 4
batch_size = 12
# From https://github.com/fastai/imagenette
IMAGENETTE_URL = 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz'

torch.manual_seed(0)
torch.set_grad_enabled(False)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

pretrained_model_name_or_path = 'CompVis/stable-diffusion-v1-4'
vae = AutoencoderKL.from_pretrained(
    pretrained_model_name_or_path,
    subfolder='vae',
    revision=None,
)
vae.to(device)

size = 512
image_transform = transforms.Compose([
    transforms.Resize(size),
    transforms.CenterCrop(size),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5]),
])

root = 'dataset'
download_and_extract_archive(IMAGENETTE_URL, root)

dataset = torchvision.datasets.ImageFolder(root, transform=image_transform)
loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=num_workers,
)

all_latents = []
for image_data, _ in loader:
    image_data = image_data.to(device)
    latents = vae.encode(image_data).latent_dist.sample()
    all_latents.append(latents.cpu())

all_latents_tensor = torch.cat(all_latents)
std = all_latents_tensor.std().item()
normalizer = 1 / std
print(f'{normalizer = }')

输出:

normalizer = 0.19503

资料

[1] - Explanation of the 0.18215 factor in textual_inversion?

Last modification:June 29th, 2023 at 11:46 am