聚类分析和K-means&Dbscan原理
聚类概念
- 无监督问题:手里没有标签
- 聚类:相似的东西分为一组
- 难点:如何评估、如何调参
K-means算法
基本概念
- 要得到簇的个数,需要指定K值
- 质心:均值,即向量各维取平均即可(迭代时用)
- 距离的度量:常用欧几里得距离(欧氏距离)和余弦相似度(先正则化)
- 优化目标:(即损失函数最小)m i n ∑ i = 1 K ∑ x ∈ C i d i s t ( c i , x ) 2 min\sum^K_{i=1}\sum_{x\in{C_i}}{dist(c_i,x)^2}m i n i =1 ∑K x ∈C i ∑d i s t (c i ,x )2
流程
- 首先给定k值,确定簇的个数
- 根据k值,随机选取k个质心
- 根据质心计算每个点的距离,根据距离的远近,将点分到不同的簇中
- 得到k个簇后,重新计算质心,再重复以上步骤,直到点不在发生变化
优势和劣势
优势:
- 简单、快速、适合常规数据集
劣势:
- k值难确定
- 复杂度与样本呈线性关系
- 很难发现任意形状的簇(例如:环绕、纠缠)
- 初始值的选择很重要(所以只能多次实验,随机选择多次初始点)
Dbscan算法
基本概念
- 核心对象:若某个点的密度达到算法设定的阈值则其为核心点(即r邻域内的点的数量不小于minPts,minPts由自己设置)
- epsilon-邻域的距离阈值:设定的半径r(r也由自己设置)
- 直接密度可达:若某点p在点q的r邻域内,且q时核心点则p-q直接密度可达
- 密度可达:若有一个点序列q0、q1、…、qk,对任意qi-qi-1是直接密度可达的,则称从q0到qk密度可达,这实际上是直接密度可达的”传播”
- 密度相连:若从某核心点p出发,点q和点k都是密度可达的,则称点q和点k是密度相连的。
- 边界点:属于某一个类的非核心点,不能发展下线
- 噪声点:不属于任何一个类簇的点,从任何一个核心点出发都是密度不可达的
流程
- 输入参数:数据集D、epsilon半径、MinPts密度阈值
- 标记所有对象为unvisited
- 随机选择一个unvisited对象p,把p标记为visited
- 情况一:若p的epsilon邻域至少有MinPts对象,则创建一个新簇C,并把p添加到C中
- 令N为p的epsilon邻域中的对象集合,对N中每一个点q:
- 若q是unvisited,则把它标记为visited,若q的epsilon邻域至少有MinPts个对象,则把这些对象添加到N;
- 若q还不是任何簇的成员,把p添加到C
- 输出C
- 情况二:标记p为噪声
- 直到没有标记为unvisited的对象
参数选择
半径epsilon,可以根据k距离来设定,找突变点(即随机选择点pi,计算pi到其他所有点的距离。将这些距离从小到大排序,以突变点为间隔分类)
k距离:给定数据集P={pi,i=0,1,2,…,n},计算P(i)到集合D的子集S中所有点之间的距离,按照从小到大的顺序排列,d(k)就被称为k-距离。
MinPts:k-距离中k的值,一般取得小一些,多次尝试
优势和劣势
优势
- 不需要指定簇的个数
- 可以发现任意形状的簇
- 擅长找到离群点(检测任务)
- 两个参数就够了
劣势
- 高维数据有些困难(可以做降维)
- 参数难以选择(参数对结果的影响非常大)
- Sklearn中效率很慢(数据削减策略)
k-means算法代码实现
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
class Kmeans:
def __init__(self, data, num_clustres):
self.data = data
self.num_clustres = num_clustres
def train(self, max_iterations):
centroids = Kmeans.centroids_init(self.data,self.num_clustres)
num_examples = self.data.shape[0]
cloest_centroids_ids = np.empty((num_examples,1))
for _ in range(max_iterations):
cloest_centroids_ids =Kmeans.centroids_find_closest(self.data,centroids)
centroids = Kmeans.centroids_compute(self.data,cloest_centroids_ids,self.num_clustres)
return centroids,cloest_centroids_ids
def centroids_init(data,num_clustres):
num_examples = data.shape[0]
random_ids = np.random.permutation(num_examples)
centroids = data[random_ids[:num_clustres],:]
return centroids
def centroids_find_closest(data,centroids):
num_example = data.shape[0]
num_centroids = centroids.shape[0]
cloest_centroids_ids = np.zeros((num_example,1))
for example_index in range(num_example):
distance = np.zeros((num_centroids,1))
for centroids_index in range(num_centroids):
distance_diff = data[example_index,:] - centroids[centroids_index,:]
distance[centroids_index] = np.sum(distance_diff ** 2)
cloest_centroids_ids[example_index] = np.argmin(distance)
return cloest_centroids_ids
def centroids_compute(data,cloest_centroids_ids,num_clustres):
num_features = data.shape[1]
centroids = np.zeros((num_clustres,num_features))
for centroids_id in range(num_clustres):
cloest_ids = cloest_centroids_ids == centroids_id
centroids[centroids_id] = np.mean(data[cloest_ids.flatten(),:],axis=0)
return centroids
鸢尾花数据集
data = pd.read_csv('Desktop\数据分析\data\iris.csv')
data.head()
Unnamed: 0Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies015.13.51.40.2setosa124.93.01.40.2setosa234.73.21.30.2setosa344.63.11.50.2setosa455.03.61.40.2setosa
new_data = data.drop(columns=['Unnamed: 0','Species'])
def x_stander(x):
x_mean = np.mean(x)
x_std = np.std(x)
return (x-x_mean)/x_std
data_std = x_stander(new_data)
ew,ev = np.linalg.eig(np.cov(data_std.T))
ew,ev
(array([2.93808505, 0.9201649 , 0.14774182, 0.02085386]),
array([[ 0.52106591, -0.37741762, -0.71956635, 0.26128628],
[-0.26934744, -0.92329566, 0.24438178, -0.12350962],
[ 0.5804131 , -0.02449161, 0.14212637, -0.80144925],
[ 0.56485654, -0.06694199, 0.63427274, 0.52359713]]))
ew_sort = ew[np.argsort(ew)[::-1]]
ev_sort = ev[np.argsort(ew)[::-1]]
sum_total = np.sum(ew_sort)
sum_i = 0
n = 1
for i in ew_sort:
sum_i += i
percent = round(sum_i/sum_total,2)*100
print(f'第{n}个元素的占比为{percent}%')
n += 1
第1个元素的占比为73.0%
第2个元素的占比为96.0%
第3个元素的占比为99.0%
第4个元素的占比为100.0%
V = ev_sort[:,:2]
data_1 = new_data.dot(V)
data_1.head()
0102.640270-5.20404112.670730-4.66691022.454606-4.77363632.545517-4.64846342.561228-5.258629
data['Species'].value_counts()
iris_types=['setosa','versicolor','virginica']
plt.figure(figsize=(12,5))
plt.scatter(data_1[0][:],data_1[1][:])
<matplotlib.collections.pathcollection at 0x27311641be0>
</matplotlib.collections.pathcollection>
num_example = data_1.shape[0]
data_2 = data_1[:].values.reshape(num_example,2)
num_clustres = 3
max_iterations = 50
data_2.dtype
dtype('float64')
kmeans = Kmeans(data_2,num_clustres)
centroids,cloest_centroids_ids = kmeans.train(max_iterations)
plt.figure(figsize=(12,8))
for cloest_centroids_id, centroid in enumerate(centroids):
current_examples_index = (cloest_centroids_id == cloest_centroids_ids).flatten()
plt.scatter(data_2[:,0][current_examples_index],data_2[:,1][current_examples_index],label=cloest_centroids_id)
plt.show()
聚类算法实践
- Kmeans算法和Dbscan算法
- 半监督问题解决方案
- 聚类算法的评估
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['axes.labelsize']=14
plt.rcParams['xtick.labelsize']=12
plt.rcParams['ytick.labelsize']=12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
from sklearn.datasets import make_blobs
blob_centers = np.array([[0.2,2.3],
[-1.5,2.3],
[-2.8,1.8],
[-2.8,2.8],
[-2.8,1.3]])
blob_std = np.array([0.4,0.3,0.1,0.1,0.1])
X,y = make_blobs(n_samples=2000,centers=blob_centers,n_features=2,
cluster_std=blob_std,random_state=7)
X,y
(array([[-2.69823941, 1.3454702 ],
[-2.87459835, 1.8097575 ],
[ 0.96077126, 1.17046777],
...,
[-2.80303543, 2.72948115],
[ 0.24057359, 2.40103109],
[-2.63807768, 1.95621065]]),
array([4, 2, 0, ..., 3, 0, 2]))
def plot_clusters(X,y=None):
plt.scatter(X[:,0],X[:,1],c=y,s=1)
plt.xlabel('$x_1$',fontsize=14)
plt.ylabel('$y_1$',fontsize=14,rotation=0)
plt.figure(figsize=(8,4))
plot_clusters(X)
plt.show()
决策边界
from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters = k, random_state = 42)
y_pred = kmeans.fit_predict(X)
fit_predict(X)与kmeans.labels_得到预测结果相同
kmeans.cluster_centers_ 簇中心点
transform 各个点到各个簇的中心点的距离
y_pred,kmeans.labels_
(array([4, 1, 0, ..., 3, 0, 1]), array([4, 1, 0, ..., 3, 0, 1]))
kmeans.cluster_centers_
array([[ 0.20876306, 2.25551336],
[-2.80389616, 1.80117999],
[-1.46679593, 2.28585348],
[-2.79290307, 2.79641063],
[-2.80037642, 1.30082566]])
X_new = np.array([[0,2],[3,2],[-3,3],[-3,2.5]])
kmeans.predict(X_new)
array([0, 0, 3, 3])
kmeans.transform(X_new)
array([[0.32995317, 2.81093633, 1.49439034, 2.9042344 , 2.88633901],
[2.80290755, 5.80730058, 4.4759332 , 5.84739223, 5.84236351],
[3.29399768, 1.21475352, 1.69136631, 0.29040966, 1.71086031],
[3.21806371, 0.72581411, 1.54808703, 0.36159148, 1.21567622]])
def plot_data(X):
plt.plot(X[:,0],X[:,1],'k.',markersize=2)
def plot_centroids(centroids,weights=None,circle_color='w',cross_color='k'):
if weights is not None:
centroids = centroids[weights > weights.max()/10]
plt.scatter(centroids[:,0],centroids[:,1],
marker='o',s=50,linewidths=8,color=circle_color,
zorder=11,alpha=0.9)
plt.scatter(centroids[:,0],centroids[:,1],
marker='x',s=10,linewidths=10,color=cross_color,
zorder=11,alpha=1)
def plot_decision_boundaries(clusterer,X,resolution=1000,show_centroids=True,
show_xlabels=True,show_ylabels=True):
mins = X.min(axis=0)-0.1
maxs = X.max(axis=0)+0.1
xx,yy = np.meshgrid(np.linspace(mins[0],maxs[0],resolution),
np.linspace(mins[1],maxs[1],resolution))
Z = clusterer.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(Z,extent=(mins[0],maxs[0],mins[1],maxs[1]),
cmap='Pastel2')
plt.contour(Z,extent=(mins[0],maxs[0],mins[1],maxs[1]),
linewidths=1,colors='k')
plot_data(X)
if show_centroids:
plot_centroids(clusterer.cluster_centers_)
if show_xlabels:
plt.xlabel('$x_1$',fontsize=14)
else:
plt.tick_params(labelbottom='off')
if show_ylabels:
plt.ylabel('$y_1$',fontsize=14,rotation=0)
else:
plt.tick_params(labelleft='off')
plt.figure(figsize=(8,4))
plot_decision_boundaries(kmeans,X)
plt.show
<function matplotlib.pyplot.show(close="None," block="None)">
</function>
算法流程
- n_clusters:指定簇的个数
- init:初始点的位置,’random’随机,或者可以指定一个ndarray
- n_init:默认值为10,会跑10次,选择出最好的一次
- max_iter:最大迭代次数
kmeans_iter1 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=1,random_state=1)
kmeans_iter2 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=2,random_state=1)
kmeans_iter3 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=3,random_state=1)
kmeans_iter1.fit(X)
kmeans_iter2.fit(X)
kmeans_iter3.fit(X)
KMeans(init='random', max_iter=3, n_clusters=5, n_init=1, random_state=1)
plt.figure(figsize=(12,8))
plt.subplot(321)
plot_data(X)
plot_centroids(kmeans_iter1.cluster_centers_,circle_color='r',cross_color='k')
plt.title('Update cluster_centers')
plt.subplot(322)
plot_decision_boundaries(kmeans_iter1,X,show_xlabels=False,show_ylabels=False)
plt.title('label')
plt.subplot(323)
plot_decision_boundaries(kmeans_iter1,X,show_xlabels=False,show_ylabels=False)
plot_centroids(kmeans_iter2.cluster_centers_)
plt.subplot(324)
plot_decision_boundaries(kmeans_iter2,X,show_xlabels=False,show_ylabels=False)
plt.title('label')
plt.subplot(325)
plot_decision_boundaries(kmeans_iter2,X,show_xlabels=False,show_ylabels=False)
plot_centroids(kmeans_iter3.cluster_centers_)
plt.subplot(326)
plot_decision_boundaries(kmeans_iter3,X,show_xlabels=False,show_ylabels=False)
plt.title('label')
plt.show()
不稳定的结果
根据初始位置的不同,迭代会发生差异
def plot_clusterer_comparison(c1,c2,X):
c1.fit(X)
c2.fit(X)
plt.figure(figsize=(10,4))
plt.subplot(121)
plot_decision_boundaries(c1,X)
plt.subplot(122)
plot_decision_boundaries(c2,X)
c1 = KMeans(n_clusters=5,init='random',n_init=1,random_state=11)
c2 = KMeans(n_clusters=5,init='random',n_init=1,random_state=19)
plot_clusterer_comparison(c1,c2,X)
评估指标
- Inertia指标:每个样本与其质心的距离
kmeans.inertia_
211.59853725816828
X_dist = kmeans.transform(X)
X_dist,kmeans.labels_
(array([[3.04611916, 0.46779778, 1.54944305, 1.45402521, 0.11146795],
[3.11541584, 0.07122059, 1.48612753, 0.99002955, 0.51431557],
[1.32016676, 3.81713488, 2.67154781, 4.09069201, 3.76340605],
...,
[3.04886464, 0.92830156, 1.40795651, 0.06769209, 1.42865797],
[0.14895409, 3.10300136, 1.71125 , 3.05913478, 3.23385668],
[2.8625311 , 0.22700281, 1.21678483, 0.85434589, 0.67518173]]),
array([4, 1, 0, ..., 3, 0, 1]))
X_dist[np.arange(len(X_dist)),kmeans.labels_]
array([0.11146795, 0.07122059, 1.32016676, ..., 0.06769209, 0.14895409,
0.22700281])
np.sum(X_dist[np.arange(len(X_dist)),kmeans.labels_]**2)
211.59853725816862
kmeans.score(X)
-211.59853725816836
c1.inertia_,c2.inertia_
(223.2910857281904, 237.4624916944286)
找到最佳簇数
若k值越大,得到的结果一定会越来越小
拐点法
kmeans_per_k = [KMeans(n_clusters = k).fit(X) for k in range(1,10)]
inertias = [model.inertia_ for model in kmeans_per_k]
plt.figure(figsize=(8,4))
plt.plot(range(1,10),inertias,'bo-')
plt.axis([1.8,5,0,1300])
plt.show()
轮廓系数
- ai:计算样本i到同簇其他样本的平均距离ai。ai越小,说明样本i越应该被聚类到该簇。将ai称为样本i的簇内不相似度。
- bi:计算样本i到其他某簇cj的所有样本的平均距离bi,称为样本i与簇cj的不相似度。定义为样本i的簇间不相似度:bi=min{bi1,bi2,…bik}
s i = b i − a i m a x a i , b i s_i=\frac{b_i-a_i}{max{a_i,b_i}}s i =m a x a i ,b i b i −a i
s i = { 1 − a i b i , a i < b i 0 , a i = b a b i a i − 1 , a i > b i s_i = \left{ \begin{matrix} 1-\frac{a_i}{b_i},&a_i s i =⎩⎨⎧1 −b i a i ,0 ,a i b i −1 ,a i b i
结论:
- si接近1,则说明样本i聚类合理
- si接近-1,则说明样本i更应该分类到另外的簇
- 若si近似于0,则说明样本i在两个簇的边界上
注意:2
Original: https://blog.csdn.net/qq_45003520/article/details/123688989
Author: indigo女孩
Title: 聚类分析和K-means&Dbscan原理
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/550134/
转载文章受原作者版权保护。转载请注明原作者出处!