Faiss 构建了几个非常高效的实现:K-means 聚类、PCA降维、PQ量化(编码/解码).
1. K-means 聚类
Faiss K-means 与 Scikit-learn K-means 实现的对比可参考:
K-Means 8x faster, 27x lower error than Scikit-learn’s in 25 lines
K-means 迭代(From: https://en.wikipedia.org/wiki/K-means_clustering#/media/File:K-means_convergence.gif)
例如,将给定 2D 张量 x 进行聚类的实现如:
ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
#cpu
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
#gpu,使用所有的gpu
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True)
#gpu,使用 3 个gpu
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=3)
#
kmeans.train(x)
cluster_centers = kmeans.centroids #聚类后的聚类中心
obj = kmeans.obj #目标函数,kmeans 中为总的平方差
iteration_stats = kmeans.iteration_stats #聚类中的统计信息
1.1. 参数说明
Kmeans 对象是 C++ Clustering
的封装层,其参数如:
struct ClusteringParameters {
int niter; // 聚类迭代次数
int nredo; // 重复聚类的次数,保留最佳的centroids
bool verbose; // 是否使聚类更加冗长(verbose)
bool spherical; /// 是否球形 kmeans;是否需要在每次迭代后将 centroids 进行 L2 归一化
bool int_centroids; /// 是否将 centroids 坐标取整
bool update_index; /// 是否每次迭代后重训练索引
bool frozen_centroids; /// 是否使用预提供的 centroids 作为输入,并在迭代中更新 centroids
int min_points_per_centroid; //
int max_points_per_centroid; //
int seed; // 随机数生成器种子
size_t decode_block_size; //
ClusteringParameters();
};
1.2. 计算归属聚类
当 Kmeans 训练完成后,可以将向量 x 集映射到聚类中心,如:
D, I = kmeans.index.search(x, 1)
其中, I
中为 x 中每一行的向量所对应的最接近的聚类(centroid),D
包含了对应的平方 L2 距离.
对于倒排才做,如,查找 x 的 15 个最近邻点,需要建立索引,如:
index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)
其中,I
的大小为 (ncentroids,15),包含了每个 centroid 的最近邻点.
1.3. FaissKmeans 使用
import faiss
import numpy as np
class FaissKMeans:
def __init__(self, n_clusters=8, n_init=10, max_iter=300):
self.n_clusters = n_clusters
self.n_init = n_init
self.max_iter = max_iter
self.kmeans = None
self.cluster_centers_ = None
self.inertia_ = None
def fit(self, X, y):
self.kmeans = faiss.Kmeans(d=X.shape[1],
k=self.n_clusters,
niter=self.max_iter,
nredo=self.n_init)
self.kmeans.train(X.astype(np.float32))
self.cluster_centers_ = self.kmeans.centroids
self.inertia_ = self.kmeans.obj[-1]
def predict(self, X):
return self.kmeans.index.search(X.astype(np.float32), 1)[1]
2. PCA 计算
例如,将 40D 向量降维到 10D,
#随机生成训练数据
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
#print this to show that the magnitude of tr's columns is decreasing
print((tr ** 2).sum(0))
3. PQ 量化
如:
d = 32 # data dimension
cs = 4 # code size (bytes)
#随机生成数据集
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
#
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)
# encode
codes = pq.compute_codes(x)
# decode
x2 = pq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
标量量化(scalar quantizer):
d = 32 # data dimension
# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)
# encode
codes = sq.compute_codes(x)
# decode
x2 = sq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()