Scikit-learn - Clustering 聚类算法汇总

AIHGF

May 21, 2018

166views

No comments

5244 words

机器学习

scikit-learning - Clustering

scikit-learning 提供了 sklearn.cluster 模块，以用于无标签数据的聚类.

sklearn.cluster 模块共提供了一下几种聚类算法：

[1] - K-means

[2] - Affinity Propagation

[3] - Mean Shift

[4] - Spectral clustering

[5] - Hierarchical clustering

[6] - DBSCAN

[7] - OPTICS

[8] - Birch

每一种聚类算法包含两部分：

[1] - 一个类(class)，实现 fit 方法，以对训练数据进行聚类.

[2] - 一个函数(function)，对于给定训练数据，返回对应于不同聚类的整数类别标签数组.

对于该 class，训练数据的聚类标签可以通过 labels_ 属性得到.

输入数据：

[1] - sklearn.cluster 中的聚类算法支持不同类型的矩阵作为输入.

[2] - 所有的聚类算法都支持标准的 [nsamples, nfeatures] 形式的数据矩阵.

特征提取模块：sklearn.feature_extraction

[3] - AffinityPropagation, SpectralClustering 和 DBSCAN 还支持 [nsamples, nsamples]形式的相似性矩阵作为输入.

相似性矩阵计算模块：sklearn.metrics.pairwise

1. 不同聚类算法对比

算法名	参数	Scalability	使用场景	Geometry (metric used)
K-Means	number of clusters	Very large `n_samples`, medium `n_clusters` with MiniBatch code	General-purpose, even cluster size, flat geometry, not too many clusters	Distances between points
Affinity propagation	damping, sample preference	Not scalable with n_samples	Many clusters, uneven cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Mean-shift	bandwidth	Not scalable with `n_samples`	Many clusters, uneven cluster size, non-flat geometry	Distances between points
Spectral clustering	number of clusters	Medium `n_samples`, small `n_clusters`	Few clusters, even cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering	number of clusters or distance threshold	Large `n_samples` and `n_clusters`	Many clusters, possibly connectivity constraints	Distances between points
Agglomerative clustering	number of clusters or distance threshold, linkage type, distance	Large `n_samples` and `n_clusters`	Many clusters, possibly connectivity constraints, non Euclidean distances	Any pairwise distance
DBSCAN	neighborhood size	Very large `n_samples`, medium `n_clusters`	Non-flat geometry, uneven cluster sizes	Distances between nearest points
OPTICS	minimum cluster membership	Very large `n_samples`, large `n_clusters`	Non-flat geometry, uneven cluster sizes, variable cluster density	Distances between points
Gaussian mixtures	many	Not scalable	Flat geometry, good for density estimation	Mahalanobis distances to centers
Birch	branching factor, threshold, optional global clusterer.	Large `n_clusters` and `n_samples`	Large dataset, outlier removal, data reduction.	Euclidean distance between points

Scikit-learn - Clustering 聚类算法汇总

AIHGF • 2018 年 05 月 21 日

scikit-learning - Clustering

scikit-learning 提供了 sklearn.cluster 模块，以用于无标签数据的聚类.

sklearn.cluster 模块共提供了一下几种聚类算法：

[1] - K-means

[2] - Affinity Propagation

[3] - Mean Shift

[4] - Spectral clustering

[5] - Hierarchical clustering

[6] - DBSCAN

[7] - OPTICS

[8] - Birch

每一种聚类算法包含两部分：

[1] - 一个类(class)，实现 fit 方法，以对训练数据进行聚类.

[2] - 一个函数(function)，对于给定训练数据，返回对应于不同聚类的整数类别标签数组.

对于该 class，训练数据的聚类标签可以通过 labels_ 属性得到.

输入数据：

[1] - sklearn.cluster 中的聚类算法支持不同类型的矩阵作为输入.

[2] - 所有的聚类算法都支持标准的 [nsamples, nfeatures] 形式的数据矩阵.

特征提取模块：sklearn.feature_extraction

[3] - AffinityPropagation, SpectralClustering 和 DBSCAN 还支持 [nsamples, nsamples]形式的相似性矩阵作为输入.

相似性矩阵计算模块：sklearn.metrics.pairwise

1. 不同聚类算法对比

算法名	参数	Scalability	使用场景	Geometry (metric used)
K-Means	number of clusters	Very large `n_samples`, medium `n_clusters` with MiniBatch code	General-purpose, even cluster size, flat geometry, not too many clusters	Distances between points
Affinity propagation	damping, sample preference	Not scalable with n_samples	Many clusters, uneven cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Mean-shift	bandwidth	Not scalable with `n_samples`	Many clusters, uneven cluster size, non-flat geometry	Distances between points
Spectral clustering	number of clusters	Medium `n_samples`, small `n_clusters`	Few clusters, even cluster size, non-flat geometry	Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering	number of clusters or distance threshold	Large `n_samples` and `n_clusters`	Many clusters, possibly connectivity constraints	Distances between points
Agglomerative clustering	number of clusters or distance threshold, linkage type, distance	Large `n_samples` and `n_clusters`	Many clusters, possibly connectivity constraints, non Euclidean distances	Any pairwise distance
DBSCAN	neighborhood size	Very large `n_samples`, medium `n_clusters`	Non-flat geometry, uneven cluster sizes	Distances between nearest points
OPTICS	minimum cluster membership	Very large `n_samples`, large `n_clusters`	Non-flat geometry, uneven cluster sizes, variable cluster density	Distances between points
Gaussian mixtures	many	Not scalable	Flat geometry, good for density estimation	Mahalanobis distances to centers
Birch	branching factor, threshold, optional global clusterer.	Large `n_clusters` and `n_samples`	Large dataset, outlier removal, data reduction.	Euclidean distance between points

Scikit-learn - Clustering 聚类算法汇总

1. 不同聚类算法对比

※相关文章推荐※

※最新文章推荐※

Leave a Comment Cancel reply

Scikit-learn - Clustering 聚类算法汇总

1. 不同聚类算法对比