在 StableDiffusion 的各版本实现中,有一个参数:scale_factor=0.18215
.
scale_factor 是在论文 High-Resolution Image Synthesis with Latent Diffusion Models 中提出的,其主要是用于处理具有相似 noise schedules 的不同的 latent spaces,(采用不同的 autoencoders 所得到的 latent spaces,其缩放比例与图像完全不同).
scale_factor 确保了 diffusion 模型中的初始化 latent space,具有相似的单位方差。
scale_factor 的计算
from diffusers import AutoencoderKL
import torch
import torchvision
from torchvision.datasets.utils import download_and_extract_archive
from torchvision import transforms
num_workers = 4
batch_size = 12
# From https://github.com/fastai/imagenette
IMAGENETTE_URL = 'https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz'
torch.manual_seed(0)
torch.set_grad_enabled(False)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pretrained_model_name_or_path = 'CompVis/stable-diffusion-v1-4'
vae = AutoencoderKL.from_pretrained(
pretrained_model_name_or_path,
subfolder='vae',
revision=None,
)
vae.to(device)
size = 512
image_transform = transforms.Compose([
transforms.Resize(size),
transforms.CenterCrop(size),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]),
])
root = 'dataset'
download_and_extract_archive(IMAGENETTE_URL, root)
dataset = torchvision.datasets.ImageFolder(root, transform=image_transform)
loader = torch.utils.data.DataLoader(
dataset,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
)
all_latents = []
for image_data, _ in loader:
image_data = image_data.to(device)
latents = vae.encode(image_data).latent_dist.sample()
all_latents.append(latents.cpu())
all_latents_tensor = torch.cat(all_latents)
std = all_latents_tensor.std().item()
normalizer = 1 / std
print(f'{normalizer = }')
输出:
normalizer = 0.19503
资料
[1] - Explanation of the 0.18215 factor in textual_inversion?