聚类分析和K-means&Dbscan原理

聚类分析和K-means&Dbscan原理

聚类概念

  • 无监督问题:手里没有标签
  • 聚类:相似的东西分为一组
  • 难点:如何评估、如何调参

K-means算法

基本概念

  • 要得到簇的个数,需要指定K值
  • 质心:均值,即向量各维取平均即可(迭代时用)
  • 距离的度量:常用欧几里得距离(欧氏距离)和余弦相似度(先正则化)
  • 优化目标:(即损失函数最小)m i n ∑ i = 1 K ∑ x ∈ C i d i s t ( c i , x ) 2 min\sum^K_{i=1}\sum_{x\in{C_i}}{dist(c_i,x)^2}m i n i =1 ∑K ​x ∈C i ​∑​d i s t (c i ​,x )2

流程

  • 首先给定k值,确定簇的个数
  • 根据k值,随机选取k个质心
  • 根据质心计算每个点的距离,根据距离的远近,将点分到不同的簇中
  • 得到k个簇后,重新计算质心,再重复以上步骤,直到点不在发生变化

优势和劣势

优势:
  • 简单、快速、适合常规数据集
劣势:
  • k值难确定
  • 复杂度与样本呈线性关系
  • 很难发现任意形状的簇(例如:环绕、纠缠)
  • 初始值的选择很重要(所以只能多次实验,随机选择多次初始点)

Dbscan算法

基本概念

  • 核心对象:若某个点的密度达到算法设定的阈值则其为核心点(即r邻域内的点的数量不小于minPts,minPts由自己设置)
  • epsilon-邻域的距离阈值:设定的半径r(r也由自己设置)
  • 直接密度可达:若某点p在点q的r邻域内,且q时核心点则p-q直接密度可达
  • 密度可达:若有一个点序列q0、q1、…、qk,对任意qi-qi-1是直接密度可达的,则称从q0到qk密度可达,这实际上是直接密度可达的”传播”
  • 密度相连:若从某核心点p出发,点q和点k都是密度可达的,则称点q和点k是密度相连的。
  • 边界点:属于某一个类的非核心点,不能发展下线
  • 噪声点:不属于任何一个类簇的点,从任何一个核心点出发都是密度不可达的

流程

  1. 输入参数:数据集D、epsilon半径、MinPts密度阈值
  2. 标记所有对象为unvisited
  3. 随机选择一个unvisited对象p,把p标记为visited
  4. 情况一:若p的epsilon邻域至少有MinPts对象,则创建一个新簇C,并把p添加到C中
  5. 令N为p的epsilon邻域中的对象集合,对N中每一个点q:
  6. 若q是unvisited,则把它标记为visited,若q的epsilon邻域至少有MinPts个对象,则把这些对象添加到N;
  7. 若q还不是任何簇的成员,把p添加到C
  8. 输出C
  9. 情况二:标记p为噪声
  10. 直到没有标记为unvisited的对象

参数选择

半径epsilon,可以根据k距离来设定,找突变点(即随机选择点pi,计算pi到其他所有点的距离。将这些距离从小到大排序,以突变点为间隔分类)
k距离:给定数据集P={pi,i=0,1,2,…,n},计算P(i)到集合D的子集S中所有点之间的距离,按照从小到大的顺序排列,d(k)就被称为k-距离。
MinPts:k-距离中k的值,一般取得小一些,多次尝试

优势和劣势

优势
  • 不需要指定簇的个数
  • 可以发现任意形状的簇
  • 擅长找到离群点(检测任务)
  • 两个参数就够了
劣势
  • 高维数据有些困难(可以做降维)
  • 参数难以选择(参数对结果的影响非常大)
  • Sklearn中效率很慢(数据削减策略)

k-means算法代码实现

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
class Kmeans:
    def __init__(self, data, num_clustres):
        self.data = data
        self.num_clustres = num_clustres

    def train(self, max_iterations):

        centroids = Kmeans.centroids_init(self.data,self.num_clustres)
        num_examples = self.data.shape[0]
        cloest_centroids_ids = np.empty((num_examples,1))
        for _ in range(max_iterations):

            cloest_centroids_ids =Kmeans.centroids_find_closest(self.data,centroids)

            centroids = Kmeans.centroids_compute(self.data,cloest_centroids_ids,self.num_clustres)
        return centroids,cloest_centroids_ids

    def centroids_init(data,num_clustres):
        num_examples = data.shape[0]
        random_ids = np.random.permutation(num_examples)
        centroids = data[random_ids[:num_clustres],:]

        return centroids

    def centroids_find_closest(data,centroids):
        num_example = data.shape[0]
        num_centroids = centroids.shape[0]
        cloest_centroids_ids = np.zeros((num_example,1))
        for example_index in range(num_example):
            distance = np.zeros((num_centroids,1))
            for centroids_index in range(num_centroids):
                distance_diff = data[example_index,:] - centroids[centroids_index,:]
                distance[centroids_index] = np.sum(distance_diff ** 2)
            cloest_centroids_ids[example_index] = np.argmin(distance)
        return cloest_centroids_ids

    def centroids_compute(data,cloest_centroids_ids,num_clustres):
        num_features = data.shape[1]
        centroids = np.zeros((num_clustres,num_features))
        for centroids_id in range(num_clustres):
            cloest_ids = cloest_centroids_ids == centroids_id
            centroids[centroids_id] = np.mean(data[cloest_ids.flatten(),:],axis=0)
        return centroids

鸢尾花数据集

data = pd.read_csv('Desktop\数据分析\data\iris.csv')
data.head()

Unnamed: 0Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies015.13.51.40.2setosa124.93.01.40.2setosa234.73.21.30.2setosa344.63.11.50.2setosa455.03.61.40.2setosa


new_data = data.drop(columns=['Unnamed: 0','Species'])
def x_stander(x):
    x_mean = np.mean(x)
    x_std = np.std(x)
    return (x-x_mean)/x_std
data_std = x_stander(new_data)
ew,ev = np.linalg.eig(np.cov(data_std.T))
ew,ev
(array([2.93808505, 0.9201649 , 0.14774182, 0.02085386]),
 array([[ 0.52106591, -0.37741762, -0.71956635,  0.26128628],
        [-0.26934744, -0.92329566,  0.24438178, -0.12350962],
        [ 0.5804131 , -0.02449161,  0.14212637, -0.80144925],
        [ 0.56485654, -0.06694199,  0.63427274,  0.52359713]]))
ew_sort = ew[np.argsort(ew)[::-1]]
ev_sort = ev[np.argsort(ew)[::-1]]
sum_total = np.sum(ew_sort)
sum_i = 0
n = 1
for i in ew_sort:
    sum_i += i
    percent = round(sum_i/sum_total,2)*100
    print(f'第{n}个元素的占比为{percent}%')
    n += 1
第1个元素的占比为73.0%
第2个元素的占比为96.0%
第3个元素的占比为99.0%
第4个元素的占比为100.0%
V = ev_sort[:,:2]
data_1 = new_data.dot(V)
data_1.head()

0102.640270-5.20404112.670730-4.66691022.454606-4.77363632.545517-4.64846342.561228-5.258629

data['Species'].value_counts()
iris_types=['setosa','versicolor','virginica']
plt.figure(figsize=(12,5))
plt.scatter(data_1[0][:],data_1[1][:])
<matplotlib.collections.pathcollection at 0x27311641be0>
</matplotlib.collections.pathcollection>

聚类分析和K-means&Dbscan原理
num_example = data_1.shape[0]
data_2 = data_1[:].values.reshape(num_example,2)
num_clustres = 3
max_iterations = 50
data_2.dtype
dtype('float64')
kmeans = Kmeans(data_2,num_clustres)
centroids,cloest_centroids_ids = kmeans.train(max_iterations)
plt.figure(figsize=(12,8))
for cloest_centroids_id, centroid in enumerate(centroids):
    current_examples_index = (cloest_centroids_id == cloest_centroids_ids).flatten()
    plt.scatter(data_2[:,0][current_examples_index],data_2[:,1][current_examples_index],label=cloest_centroids_id)

plt.show()

聚类分析和K-means&Dbscan原理

聚类算法实践

  • Kmeans算法和Dbscan算法
  • 半监督问题解决方案
  • 聚类算法的评估
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['axes.labelsize']=14
plt.rcParams['xtick.labelsize']=12
plt.rcParams['ytick.labelsize']=12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
from sklearn.datasets import make_blobs

blob_centers = np.array([[0.2,2.3],
                         [-1.5,2.3],
                         [-2.8,1.8],
                         [-2.8,2.8],
                         [-2.8,1.3]])

blob_std = np.array([0.4,0.3,0.1,0.1,0.1])
X,y = make_blobs(n_samples=2000,centers=blob_centers,n_features=2,
                 cluster_std=blob_std,random_state=7)

X,y
(array([[-2.69823941,  1.3454702 ],
        [-2.87459835,  1.8097575 ],
        [ 0.96077126,  1.17046777],
        ...,
        [-2.80303543,  2.72948115],
        [ 0.24057359,  2.40103109],
        [-2.63807768,  1.95621065]]),
 array([4, 2, 0, ..., 3, 0, 2]))
def plot_clusters(X,y=None):
    plt.scatter(X[:,0],X[:,1],c=y,s=1)
    plt.xlabel('$x_1$',fontsize=14)
    plt.ylabel('$y_1$',fontsize=14,rotation=0)
plt.figure(figsize=(8,4))
plot_clusters(X)
plt.show()

聚类分析和K-means&Dbscan原理

决策边界

from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters = k, random_state = 42)
y_pred = kmeans.fit_predict(X)

fit_predict(X)与kmeans.labels_得到预测结果相同
kmeans.cluster_centers_ 簇中心点
transform 各个点到各个簇的中心点的距离

y_pred,kmeans.labels_
(array([4, 1, 0, ..., 3, 0, 1]), array([4, 1, 0, ..., 3, 0, 1]))
kmeans.cluster_centers_
array([[ 0.20876306,  2.25551336],
       [-2.80389616,  1.80117999],
       [-1.46679593,  2.28585348],
       [-2.79290307,  2.79641063],
       [-2.80037642,  1.30082566]])
X_new = np.array([[0,2],[3,2],[-3,3],[-3,2.5]])
kmeans.predict(X_new)
array([0, 0, 3, 3])
kmeans.transform(X_new)
array([[0.32995317, 2.81093633, 1.49439034, 2.9042344 , 2.88633901],
       [2.80290755, 5.80730058, 4.4759332 , 5.84739223, 5.84236351],
       [3.29399768, 1.21475352, 1.69136631, 0.29040966, 1.71086031],
       [3.21806371, 0.72581411, 1.54808703, 0.36159148, 1.21567622]])
def plot_data(X):
    plt.plot(X[:,0],X[:,1],'k.',markersize=2)

def plot_centroids(centroids,weights=None,circle_color='w',cross_color='k'):
    if weights is not None:
        centroids = centroids[weights > weights.max()/10]
    plt.scatter(centroids[:,0],centroids[:,1],
               marker='o',s=50,linewidths=8,color=circle_color,
               zorder=11,alpha=0.9)
    plt.scatter(centroids[:,0],centroids[:,1],
               marker='x',s=10,linewidths=10,color=cross_color,
               zorder=11,alpha=1)

def plot_decision_boundaries(clusterer,X,resolution=1000,show_centroids=True,
                            show_xlabels=True,show_ylabels=True):
    mins = X.min(axis=0)-0.1
    maxs = X.max(axis=0)+0.1
    xx,yy = np.meshgrid(np.linspace(mins[0],maxs[0],resolution),
                       np.linspace(mins[1],maxs[1],resolution))
    Z = clusterer.predict(np.c_[xx.ravel(),yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(Z,extent=(mins[0],maxs[0],mins[1],maxs[1]),
                 cmap='Pastel2')
    plt.contour(Z,extent=(mins[0],maxs[0],mins[1],maxs[1]),
                 linewidths=1,colors='k')
    plot_data(X)
    if show_centroids:
        plot_centroids(clusterer.cluster_centers_)
    if show_xlabels:
        plt.xlabel('$x_1$',fontsize=14)
    else:
        plt.tick_params(labelbottom='off')
    if show_ylabels:
        plt.ylabel('$y_1$',fontsize=14,rotation=0)
    else:
        plt.tick_params(labelleft='off')
plt.figure(figsize=(8,4))
plot_decision_boundaries(kmeans,X)
plt.show
<function matplotlib.pyplot.show(close="None," block="None)">
</function>

聚类分析和K-means&Dbscan原理

算法流程

  • n_clusters:指定簇的个数
  • init:初始点的位置,’random’随机,或者可以指定一个ndarray
  • n_init:默认值为10,会跑10次,选择出最好的一次
  • max_iter:最大迭代次数
kmeans_iter1 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=1,random_state=1)
kmeans_iter2 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=2,random_state=1)
kmeans_iter3 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=3,random_state=1)

kmeans_iter1.fit(X)
kmeans_iter2.fit(X)
kmeans_iter3.fit(X)
KMeans(init='random', max_iter=3, n_clusters=5, n_init=1, random_state=1)
plt.figure(figsize=(12,8))

plt.subplot(321)
plot_data(X)
plot_centroids(kmeans_iter1.cluster_centers_,circle_color='r',cross_color='k')
plt.title('Update cluster_centers')

plt.subplot(322)
plot_decision_boundaries(kmeans_iter1,X,show_xlabels=False,show_ylabels=False)
plt.title('label')

plt.subplot(323)

plot_decision_boundaries(kmeans_iter1,X,show_xlabels=False,show_ylabels=False)
plot_centroids(kmeans_iter2.cluster_centers_)

plt.subplot(324)
plot_decision_boundaries(kmeans_iter2,X,show_xlabels=False,show_ylabels=False)
plt.title('label')

plt.subplot(325)
plot_decision_boundaries(kmeans_iter2,X,show_xlabels=False,show_ylabels=False)
plot_centroids(kmeans_iter3.cluster_centers_)

plt.subplot(326)
plot_decision_boundaries(kmeans_iter3,X,show_xlabels=False,show_ylabels=False)
plt.title('label')

plt.show()

聚类分析和K-means&Dbscan原理

不稳定的结果

根据初始位置的不同,迭代会发生差异

def plot_clusterer_comparison(c1,c2,X):
    c1.fit(X)
    c2.fit(X)

    plt.figure(figsize=(10,4))
    plt.subplot(121)
    plot_decision_boundaries(c1,X)
    plt.subplot(122)
    plot_decision_boundaries(c2,X)
c1 = KMeans(n_clusters=5,init='random',n_init=1,random_state=11)
c2 = KMeans(n_clusters=5,init='random',n_init=1,random_state=19)
plot_clusterer_comparison(c1,c2,X)

聚类分析和K-means&Dbscan原理

评估指标

  • Inertia指标:每个样本与其质心的距离
kmeans.inertia_

211.59853725816828
X_dist = kmeans.transform(X)
X_dist,kmeans.labels_
(array([[3.04611916, 0.46779778, 1.54944305, 1.45402521, 0.11146795],
        [3.11541584, 0.07122059, 1.48612753, 0.99002955, 0.51431557],
        [1.32016676, 3.81713488, 2.67154781, 4.09069201, 3.76340605],
        ...,
        [3.04886464, 0.92830156, 1.40795651, 0.06769209, 1.42865797],
        [0.14895409, 3.10300136, 1.71125   , 3.05913478, 3.23385668],
        [2.8625311 , 0.22700281, 1.21678483, 0.85434589, 0.67518173]]),
 array([4, 1, 0, ..., 3, 0, 1]))
X_dist[np.arange(len(X_dist)),kmeans.labels_]

array([0.11146795, 0.07122059, 1.32016676, ..., 0.06769209, 0.14895409,
       0.22700281])
np.sum(X_dist[np.arange(len(X_dist)),kmeans.labels_]**2)
211.59853725816862
kmeans.score(X)
-211.59853725816836
c1.inertia_,c2.inertia_
(223.2910857281904, 237.4624916944286)

找到最佳簇数

若k值越大,得到的结果一定会越来越小

拐点法

kmeans_per_k = [KMeans(n_clusters = k).fit(X) for k in range(1,10)]
inertias = [model.inertia_ for model in kmeans_per_k]
plt.figure(figsize=(8,4))
plt.plot(range(1,10),inertias,'bo-')
plt.axis([1.8,5,0,1300])
plt.show()

聚类分析和K-means&Dbscan原理

轮廓系数

  • ai:计算样本i到同簇其他样本的平均距离ai。ai越小,说明样本i越应该被聚类到该簇。将ai称为样本i的簇内不相似度。
  • bi:计算样本i到其他某簇cj的所有样本的平均距离bi,称为样本i与簇cj的不相似度。定义为样本i的簇间不相似度:bi=min{bi1,bi2,…bik}

s i = b i − a i m a x a i , b i s_i=\frac{b_i-a_i}{max{a_i,b_i}}s i ​=m a x a i ​,b i ​b i ​−a i ​​

s i = { 1 − a i b i , a i < b i 0 , a i = b a b i a i − 1 , a i > b i s_i = \left{ \begin{matrix} 1-\frac{a_i}{b_i},&a_i s i ​=⎩⎨⎧​1 −b i ​a i ​​,0 ,a i ​b i ​​−1 ,​a i ​b i ​​

结论:

  • si接近1,则说明样本i聚类合理
  • si接近-1,则说明样本i更应该分类到另外的簇
  • 若si近似于0,则说明样本i在两个簇的边界上

注意:2

Original: https://blog.csdn.net/qq_45003520/article/details/123688989
Author: indigo女孩
Title: 聚类分析和K-means&Dbscan原理

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/550134/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球