聚类分析和K-means&Dbscan原理

2023年5月31日上午8:48 • 人工智能 • 阅读 94

聚类概念

无监督问题：手里没有标签
聚类：相似的东西分为一组
难点：如何评估、如何调参

K-means算法

基本概念

要得到簇的个数，需要指定K值
质心：均值，即向量各维取平均即可（迭代时用）
距离的度量：常用欧几里得距离（欧氏距离）和余弦相似度（先正则化）
优化目标：（即损失函数最小）m i n ∑ i = 1 K ∑ x ∈ C i d i s t ( c i , x ) 2 min\sum^K_{i=1}\sum_{x\in{C_i}}{dist(c_i,x)^2}m i n i =1 ∑K x ∈C i ∑d i s t (c i ,x )2

流程

首先给定k值，确定簇的个数
根据k值，随机选取k个质心
根据质心计算每个点的距离，根据距离的远近，将点分到不同的簇中
得到k个簇后，重新计算质心，再重复以上步骤，直到点不在发生变化

优势和劣势

优势：

简单、快速、适合常规数据集

劣势：

k值难确定
复杂度与样本呈线性关系
很难发现任意形状的簇（例如：环绕、纠缠）
初始值的选择很重要（所以只能多次实验，随机选择多次初始点）

Dbscan算法

基本概念

核心对象：若某个点的密度达到算法设定的阈值则其为核心点（即r邻域内的点的数量不小于minPts，minPts由自己设置）
epsilon-邻域的距离阈值：设定的半径r(r也由自己设置）
直接密度可达：若某点p在点q的r邻域内，且q时核心点则p-q直接密度可达
密度可达：若有一个点序列q0、q1、…、qk，对任意qi-qi-1是直接密度可达的，则称从q0到qk密度可达，这实际上是直接密度可达的”传播”
密度相连：若从某核心点p出发，点q和点k都是密度可达的，则称点q和点k是密度相连的。
边界点：属于某一个类的非核心点，不能发展下线
噪声点：不属于任何一个类簇的点，从任何一个核心点出发都是密度不可达的

流程

输入参数：数据集D、epsilon半径、MinPts密度阈值
标记所有对象为unvisited
随机选择一个unvisited对象p，把p标记为visited
情况一：若p的epsilon邻域至少有MinPts对象，则创建一个新簇C，并把p添加到C中
令N为p的epsilon邻域中的对象集合，对N中每一个点q：
若q是unvisited，则把它标记为visited，若q的epsilon邻域至少有MinPts个对象，则把这些对象添加到N；
若q还不是任何簇的成员，把p添加到C
输出C
情况二：标记p为噪声
直到没有标记为unvisited的对象

参数选择

半径epsilon，可以根据k距离来设定，找突变点（即随机选择点pi，计算pi到其他所有点的距离。将这些距离从小到大排序，以突变点为间隔分类）
k距离：给定数据集P={pi，i=0,1,2,…,n}，计算P(i)到集合D的子集S中所有点之间的距离，按照从小到大的顺序排列，d(k)就被称为k-距离。
MinPts：k-距离中k的值，一般取得小一些，多次尝试

优势和劣势

优势

不需要指定簇的个数
可以发现任意形状的簇
擅长找到离群点（检测任务）
两个参数就够了

劣势

高维数据有些困难（可以做降维）
参数难以选择（参数对结果的影响非常大）
Sklearn中效率很慢（数据削减策略）

k-means算法代码实现

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

class Kmeans:
    def __init__(self, data, num_clustres):
        self.data = data
        self.num_clustres = num_clustres

    def train(self, max_iterations):

        centroids = Kmeans.centroids_init(self.data,self.num_clustres)
        num_examples = self.data.shape[0]
        cloest_centroids_ids = np.empty((num_examples,1))
        for _ in range(max_iterations):

            cloest_centroids_ids =Kmeans.centroids_find_closest(self.data,centroids)

            centroids = Kmeans.centroids_compute(self.data,cloest_centroids_ids,self.num_clustres)
        return centroids,cloest_centroids_ids

    def centroids_init(data,num_clustres):
        num_examples = data.shape[0]
        random_ids = np.random.permutation(num_examples)
        centroids = data[random_ids[:num_clustres],:]

        return centroids

    def centroids_find_closest(data,centroids):
        num_example = data.shape[0]
        num_centroids = centroids.shape[0]
        cloest_centroids_ids = np.zeros((num_example,1))
        for example_index in range(num_example):
            distance = np.zeros((num_centroids,1))
            for centroids_index in range(num_centroids):
                distance_diff = data[example_index,:] - centroids[centroids_index,:]
                distance[centroids_index] = np.sum(distance_diff ** 2)
            cloest_centroids_ids[example_index] = np.argmin(distance)
        return cloest_centroids_ids

    def centroids_compute(data,cloest_centroids_ids,num_clustres):
        num_features = data.shape[1]
        centroids = np.zeros((num_clustres,num_features))
        for centroids_id in range(num_clustres):
            cloest_ids = cloest_centroids_ids == centroids_id
            centroids[centroids_id] = np.mean(data[cloest_ids.flatten(),:],axis=0)
        return centroids

鸢尾花数据集

data = pd.read_csv('Desktop\数据分析\data\iris.csv')
data.head()

Unnamed: 0Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies015.13.51.40.2setosa124.93.01.40.2setosa234.73.21.30.2setosa344.63.11.50.2setosa455.03.61.40.2setosa


new_data = data.drop(columns=['Unnamed: 0','Species'])
def x_stander(x):
    x_mean = np.mean(x)
    x_std = np.std(x)
    return (x-x_mean)/x_std
data_std = x_stander(new_data)
ew,ev = np.linalg.eig(np.cov(data_std.T))
ew,ev

(array([2.93808505, 0.9201649 , 0.14774182, 0.02085386]),
 array([[ 0.52106591, -0.37741762, -0.71956635,  0.26128628],
        [-0.26934744, -0.92329566,  0.24438178, -0.12350962],
        [ 0.5804131 , -0.02449161,  0.14212637, -0.80144925],
        [ 0.56485654, -0.06694199,  0.63427274,  0.52359713]]))

ew_sort = ew[np.argsort(ew)[::-1]]
ev_sort = ev[np.argsort(ew)[::-1]]
sum_total = np.sum(ew_sort)
sum_i = 0
n = 1
for i in ew_sort:
    sum_i += i
    percent = round(sum_i/sum_total,2)*100
    print(f'第{n}个元素的占比为{percent}%')
    n += 1

&#x7B2C;1&#x4E2A;&#x5143;&#x7D20;&#x7684;&#x5360;&#x6BD4;&#x4E3A;73.0%
&#x7B2C;2&#x4E2A;&#x5143;&#x7D20;&#x7684;&#x5360;&#x6BD4;&#x4E3A;96.0%
&#x7B2C;3&#x4E2A;&#x5143;&#x7D20;&#x7684;&#x5360;&#x6BD4;&#x4E3A;99.0%
&#x7B2C;4&#x4E2A;&#x5143;&#x7D20;&#x7684;&#x5360;&#x6BD4;&#x4E3A;100.0%

V = ev_sort[:,:2]
data_1 = new_data.dot(V)
data_1.head()

0102.640270-5.20404112.670730-4.66691022.454606-4.77363632.545517-4.64846342.561228-5.258629

data['Species'].value_counts()
iris_types=['setosa','versicolor','virginica']

plt.figure(figsize=(12,5))
plt.scatter(data_1[0][:],data_1[1][:])

<matplotlib.collections.pathcollection at 0x27311641be0>
</matplotlib.collections.pathcollection>

num_example = data_1.shape[0]
data_2 = data_1[:].values.reshape(num_example,2)
num_clustres = 3
max_iterations = 50
data_2.dtype

dtype('float64')

kmeans = Kmeans(data_2,num_clustres)
centroids,cloest_centroids_ids = kmeans.train(max_iterations)

plt.figure(figsize=(12,8))
for cloest_centroids_id, centroid in enumerate(centroids):
    current_examples_index = (cloest_centroids_id == cloest_centroids_ids).flatten()
    plt.scatter(data_2[:,0][current_examples_index],data_2[:,1][current_examples_index],label=cloest_centroids_id)

plt.show()

聚类算法实践

Kmeans算法和Dbscan算法
半监督问题解决方案
聚类算法的评估

import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['axes.labelsize']=14
plt.rcParams['xtick.labelsize']=12
plt.rcParams['ytick.labelsize']=12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

from sklearn.datasets import make_blobs

blob_centers = np.array([[0.2,2.3],
                         [-1.5,2.3],
                         [-2.8,1.8],
                         [-2.8,2.8],
                         [-2.8,1.3]])

blob_std = np.array([0.4,0.3,0.1,0.1,0.1])

X,y = make_blobs(n_samples=2000,centers=blob_centers,n_features=2,
                 cluster_std=blob_std,random_state=7)

X,y

(array([[-2.69823941,  1.3454702 ],
        [-2.87459835,  1.8097575 ],
        [ 0.96077126,  1.17046777],
        ...,
        [-2.80303543,  2.72948115],
        [ 0.24057359,  2.40103109],
        [-2.63807768,  1.95621065]]),
 array([4, 2, 0, ..., 3, 0, 2]))

def plot_clusters(X,y=None):
    plt.scatter(X[:,0],X[:,1],c=y,s=1)
    plt.xlabel('$x_1$',fontsize=14)
    plt.ylabel('$y_1$',fontsize=14,rotation=0)
plt.figure(figsize=(8,4))
plot_clusters(X)
plt.show()

决策边界

from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters = k, random_state = 42)
y_pred = kmeans.fit_predict(X)

fit_predict(X)与kmeans.labels_得到预测结果相同
kmeans.cluster_centers_ 簇中心点
transform 各个点到各个簇的中心点的距离

y_pred,kmeans.labels_

(array([4, 1, 0, ..., 3, 0, 1]), array([4, 1, 0, ..., 3, 0, 1]))

kmeans.cluster_centers_

array([[ 0.20876306,  2.25551336],
       [-2.80389616,  1.80117999],
       [-1.46679593,  2.28585348],
       [-2.79290307,  2.79641063],
       [-2.80037642,  1.30082566]])

X_new = np.array([[0,2],[3,2],[-3,3],[-3,2.5]])
kmeans.predict(X_new)

array([0, 0, 3, 3])

kmeans.transform(X_new)

array([[0.32995317, 2.81093633, 1.49439034, 2.9042344 , 2.88633901],
       [2.80290755, 5.80730058, 4.4759332 , 5.84739223, 5.84236351],
       [3.29399768, 1.21475352, 1.69136631, 0.29040966, 1.71086031],
       [3.21806371, 0.72581411, 1.54808703, 0.36159148, 1.21567622]])

def plot_data(X):
    plt.plot(X[:,0],X[:,1],'k.',markersize=2)

def plot_centroids(centroids,weights=None,circle_color='w',cross_color='k'):
    if weights is not None:
        centroids = centroids[weights > weights.max()/10]
    plt.scatter(centroids[:,0],centroids[:,1],
               marker='o',s=50,linewidths=8,color=circle_color,
               zorder=11,alpha=0.9)
    plt.scatter(centroids[:,0],centroids[:,1],
               marker='x',s=10,linewidths=10,color=cross_color,
               zorder=11,alpha=1)

def plot_decision_boundaries(clusterer,X,resolution=1000,show_centroids=True,
                            show_xlabels=True,show_ylabels=True):
    mins = X.min(axis=0)-0.1
    maxs = X.max(axis=0)+0.1
    xx,yy = np.meshgrid(np.linspace(mins[0],maxs[0],resolution),
                       np.linspace(mins[1],maxs[1],resolution))
    Z = clusterer.predict(np.c_[xx.ravel(),yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(Z,extent=(mins[0],maxs[0],mins[1],maxs[1]),
                 cmap='Pastel2')
    plt.contour(Z,extent=(mins[0],maxs[0],mins[1],maxs[1]),
                 linewidths=1,colors='k')
    plot_data(X)
    if show_centroids:
        plot_centroids(clusterer.cluster_centers_)
    if show_xlabels:
        plt.xlabel('$x_1$',fontsize=14)
    else:
        plt.tick_params(labelbottom='off')
    if show_ylabels:
        plt.ylabel('$y_1$',fontsize=14,rotation=0)
    else:
        plt.tick_params(labelleft='off')

plt.figure(figsize=(8,4))
plot_decision_boundaries(kmeans,X)
plt.show

<function matplotlib.pyplot.show(close="None," block="None)">
</function>

算法流程

n_clusters:指定簇的个数
init:初始点的位置，’random’随机，或者可以指定一个ndarray
n_init:默认值为10，会跑10次，选择出最好的一次
max_iter:最大迭代次数

kmeans_iter1 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=1,random_state=1)
kmeans_iter2 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=2,random_state=1)
kmeans_iter3 = KMeans(n_clusters = 5,init = 'random',n_init=1,max_iter=3,random_state=1)

kmeans_iter1.fit(X)
kmeans_iter2.fit(X)
kmeans_iter3.fit(X)

KMeans(init='random', max_iter=3, n_clusters=5, n_init=1, random_state=1)

plt.figure(figsize=(12,8))

plt.subplot(321)
plot_data(X)
plot_centroids(kmeans_iter1.cluster_centers_,circle_color='r',cross_color='k')
plt.title('Update cluster_centers')

plt.subplot(322)
plot_decision_boundaries(kmeans_iter1,X,show_xlabels=False,show_ylabels=False)
plt.title('label')

plt.subplot(323)

plot_decision_boundaries(kmeans_iter1,X,show_xlabels=False,show_ylabels=False)
plot_centroids(kmeans_iter2.cluster_centers_)

plt.subplot(324)
plot_decision_boundaries(kmeans_iter2,X,show_xlabels=False,show_ylabels=False)
plt.title('label')

plt.subplot(325)
plot_decision_boundaries(kmeans_iter2,X,show_xlabels=False,show_ylabels=False)
plot_centroids(kmeans_iter3.cluster_centers_)

plt.subplot(326)
plot_decision_boundaries(kmeans_iter3,X,show_xlabels=False,show_ylabels=False)
plt.title('label')

plt.show()

不稳定的结果

根据初始位置的不同，迭代会发生差异

def plot_clusterer_comparison(c1,c2,X):
    c1.fit(X)
    c2.fit(X)

    plt.figure(figsize=(10,4))
    plt.subplot(121)
    plot_decision_boundaries(c1,X)
    plt.subplot(122)
    plot_decision_boundaries(c2,X)

c1 = KMeans(n_clusters=5,init='random',n_init=1,random_state=11)
c2 = KMeans(n_clusters=5,init='random',n_init=1,random_state=19)
plot_clusterer_comparison(c1,c2,X)

评估指标

Inertia指标：每个样本与其质心的距离

kmeans.inertia_

211.59853725816828

X_dist = kmeans.transform(X)
X_dist,kmeans.labels_

(array([[3.04611916, 0.46779778, 1.54944305, 1.45402521, 0.11146795],
        [3.11541584, 0.07122059, 1.48612753, 0.99002955, 0.51431557],
        [1.32016676, 3.81713488, 2.67154781, 4.09069201, 3.76340605],
        ...,
        [3.04886464, 0.92830156, 1.40795651, 0.06769209, 1.42865797],
        [0.14895409, 3.10300136, 1.71125   , 3.05913478, 3.23385668],
        [2.8625311 , 0.22700281, 1.21678483, 0.85434589, 0.67518173]]),
 array([4, 1, 0, ..., 3, 0, 1]))

X_dist[np.arange(len(X_dist)),kmeans.labels_]

array([0.11146795, 0.07122059, 1.32016676, ..., 0.06769209, 0.14895409,
       0.22700281])

np.sum(X_dist[np.arange(len(X_dist)),kmeans.labels_]**2)

211.59853725816862

kmeans.score(X)

-211.59853725816836

c1.inertia_,c2.inertia_

(223.2910857281904, 237.4624916944286)

找到最佳簇数

若k值越大，得到的结果一定会越来越小

拐点法

kmeans_per_k = [KMeans(n_clusters = k).fit(X) for k in range(1,10)]
inertias = [model.inertia_ for model in kmeans_per_k]

plt.figure(figsize=(8,4))
plt.plot(range(1,10),inertias,'bo-')
plt.axis([1.8,5,0,1300])
plt.show()

轮廓系数

ai:计算样本i到同簇其他样本的平均距离ai。ai越小，说明样本i越应该被聚类到该簇。将ai称为样本i的簇内不相似度。
bi：计算样本i到其他某簇cj的所有样本的平均距离bi，称为样本i与簇cj的不相似度。定义为样本i的簇间不相似度：bi=min{bi1,bi2,…bik}

s i = b i − a i m a x a i , b i s_i=\frac{b_i-a_i}{max{a_i,b_i}}s i =m a x a i ,b i b i −a i

s i = { 1 − a i b i , a i < b i 0 , a i = b a b i a i − 1 , a i > b i s_i = \left{ \begin{matrix} 1-\frac{a_i}{b_i},&a_i s i =⎩⎨⎧1 −b i a i ,0 ,a i b i −1 ,a i b i 

结论：

si接近1，则说明样本i聚类合理
si接近-1，则说明样本i更应该分类到另外的簇
若si近似于0，则说明样本i在两个簇的边界上

注意：2

Original: https://blog.csdn.net/qq_45003520/article/details/123688989
Author: indigo女孩
Title: 聚类分析和K-means&Dbscan原理

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/550134/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

LIO-SAM学习与运行测试数据集

文章目录 0. 说明 * 0.1 环境配置说明 0.2 LIO-SAM 0.3 系统架构 0.4 LIO-SAM youtube视频演示： – Rotation Dat…

人工智能 2023年7月26日
00200
【论文&模型讲解】CLIP（Learning Transferable Visual Models From Natural Language Supervision）

文章目录前言 0 摘要 1 Introduction and Motivating Work 2 Approach * 2.0 模型整体结构 2.1 数据集 2.2 选择一种高效…

人工智能 2023年7月27日
00113
数据科学必备Pandas、NumPy进行数据清洗方法详解

大家好，我是Mr数据杨。让我们穿越回到《三国演义》的世界，在那里我们将借助诸葛亮、周瑜、郭嘉等人，一同探索Python的秘密。准备工作就像三国中的调查地形，无论打哪场战役，谋士们…

人工智能 2023年7月6日
0079
计算变量占用的内存sys.getsizeof()方法

选择题对于以下python代码表述错误的一项是? import sys Num_int=1 Num_float=1.1 List=[1] print(“【显示】Num…

人工智能 2023年6月26日
00115
终于有人把区块链讲明白了

导读：本文介绍区块链技术，包括区块链的概念、原理、特点和通用应用这4个部分。大年初一，区块链及金融智能知识分享给你，祝你牛年发大财！作者：王健宗何安珣李泽远来源：大数据D…

人工智能 2023年7月26日
0069
人工智能、机器学习、深度学习和神经网络的关系

人工智能、机器学习、深度学习和神经网络的关系人工智能（Artificial Intelligence）人工智能（Artificial Intelligence），英文缩写为AI…

人工智能 2023年6月15日
00114
Opencv学习笔记（五）形状识别

本次笔记的内容主要是实现对形状的检查，使用Opencv中的函数实现图片中图形的监测。 1、首先开始对原始图形进行预处理。 String path="../shapes.p…

人工智能 2023年5月28日
0098
Carla+SUMO联合仿真多交叉路口信号灯配时协同优化策略

Carla+SUMO联合仿真多交叉路口信号灯配时协同优化策略一.背景博主近期基于研发项目驱动，项目第一阶段需求为基于仿真环境，探索多交叉路口信号灯配时协同优化策略。由于研发内容…

人工智能 2023年6月1日
0089
基于知识图谱(androdi版本)的学习类软件教育应用APP设计

资源下载地址：https://download.csdn.net/download/sheziqiong/85620430 1 项目概述本项目是一套基于知识图谱(Knowledg…

人工智能 2023年6月1日
0097
几种常用电流互感器采样电路

我们知道，采样电流信号最简单的方法就是通过采样电阻将电流信号转换为电压信号，然后再进行放大、采样即可。直流信号一般都可以这样处理，但是对于电流互感器出来的交流信号，不能直接输入到单…

人工智能 2023年6月24日
0071
SpringBoot集成Redis实现缓存——基于SpringBoot和Vue的后台管理系统项目系列博客（二十）

系列文章目录系统功能演示——基于SpringBoot和Vue的后台管理系统项目系列博客（一） Vue2安装并集成ElementUI——基于SpringBoot和Vue的后台管理系…

人工智能 2023年6月27日
00108
无人驾驶综述：等级划分

无人驾驶不可能一蹴而就，从实现不同程度的无人驾驶开始也能较快促进产业的发展。无人驾驶的智能程度就对应不同的无人驾驶等级。《SAE J3016推荐实践：道路机动车辆驾驶自动化系统相关…

人工智能 2023年6月2日
00227
机器学习：非线性模型解决分类问题

非线性模型线性模型在处理线性可分的资料时具有良好的表现，通过计算w T x w^Tx w T x得到分数s s s，然后进行取正负号操作也即：s i g n ( s ) sign…

人工智能 2023年7月2日
0065
边缘计算 | 在移动设备上部署深度学习模型的思路与注意点

💡 作者：韩信子@ShowMeAI📘 深度学习◉技能提升系列：https://www.showmeai.tech/tutorials/35📘 深度学习实战系列：https://ww…

人工智能 2023年6月26日
00103
【推荐收藏】深度好文：带你全面了解《知识图谱》

一文带你了解《知识图谱》新技术自然语言处理 * 自然语言处理发展历程（1）自然语言处理发展历程（2）自然语言处理发展历程（3）知识图谱典型的知识图谱 * WordNet …

人工智能 2023年6月1日
0079
快速傅里叶变换（FFT）基础

参考博文频谱分析是一种将复杂信号分解为较简单信号的技术。首先来很清楚几个概念： FT（Fourier Transformation）: 傅里叶变换。其时域信号、频域信号都是连续…

人工智能 2023年7月5日
00109

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

聚类分析和K-means&Dbscan原理

聚类概念

K-means算法

基本概念

流程

优势和劣势

优势：

劣势：

Dbscan算法

基本概念

流程

参数选择

优势和劣势

优势

劣势

k-means算法代码实现

鸢尾花数据集

决策边界

算法流程

不稳定的结果

评估指标

找到最佳簇数

轮廓系数

大家都在看