[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4aebb33a-9c84-4bb9-a772-2617a00e1da2
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9e9cf0c8-9305-43c5-b3ef-748a5ac7940c
二、聚类的相关概念
1.聚类
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6309d91a-f979-4007-9e33-048f60820a74
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:822e0ac0-17b1-4bee-94d2-840eca89d128
2.聚类的类型
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:81d9debd-b656-479c-bc62-910a8f89b9dc
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:92b2016a-7f69-4c37-a923-cdb08f29d983
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:fb875bdb-36c0-4fc2-a069-86c7e238cb74
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:deaeea3b-28b5-4448-9097-f96cfb38bb06
3.簇的类型
- 基于中心的簇:簇内的点和其”中心”(质心/中心点)较为相近(或相似),和其他簇的”中心”较远,这样的一组样本形成的簇。
- 基于连续性的簇:相比其他任何簇的点,每个点都至少和所属簇的某一个点更近。
- 基于密度的簇:簇是由高密度的区域形成的,簇之间是一些低密度的区域。
- 基于概念的簇:同一个簇共享某种性质,这个性质是从整个结合推导出来的,通常不是基于中心、邻接、密度的。
4.聚类分析的”三要素”
样本间的”远近”:使用相似性/距离函数
评价聚类出来的簇的质量:利用评价函数去评估聚类质量
获得聚类的簇:表示簇、设计划分和优化算法、算法停止时间
三、距离度量函数
1.距离函数
一个距离度量函数满足:
3.数据预处理
4.余弦相似度
夹角余弦:两变量xi,xj看做D维空间的两个向量,这两个向量间的夹角余弦可计算:
5.相关系数
6.杰卡德相似系数(Jacard)
四、聚类性能评价指标
1.聚类性能评价方法
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6680eb97-37bd-4b00-a319-c6eedffb371d
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e009fe16-cd60-4878-84b6-87ddf2f0d651
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1e0d8312-2dfe-4a6d-bdbb-944cf896dc8e
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4300e057-d956-4554-ae85-131f27281014
2.参考模型
3.外部索引
4.簇内相似度
5. 簇间相似度
6.内部评价指标
7.轮廓指数
五、聚类算法
1.k均值聚类( 基于划分的聚类方法、基于中心的聚类方法,数据在向量空间,对于非向量空间可以用类似核函数度量相似性)
输入:数据
,簇数目为K- 随机选择K个种子数据点作为K个簇的中心
- repeat
- end for
- 用当前的簇内点,重新计算K个簇中心位置
-
until当前簇中心未更新
-
本质
K-means是在目标函数上进行坐标轴下降优化,J是非凸的,所以J上应用坐标下降法不能保证收敛到全局最小值。最好将K-means运行多次,选择最好的结果。
2.K的选择——肘部法
3.K的选择——假设检验法
对每个K,进行假设检验:原假设H0:簇的数目为K,备择假设H1:簇的数目不为K
从K=1开始,如果拒绝H0,则继续对K=2进行假设检验,直到接受原假设,即K为第一个不被拒绝的假设。
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e0adfd0b-e287-4ea5-b455-0ce7998c538b
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6608d2b2-c4d9-415b-82ad-b9586c724589
4.初始化K-means
启发式做法:随机确定第一个类的中心,其他类的中心位置尽量远离已有类中心,Scikit-Learn中K-means实现中参数inti可设置初始值的设置方式,默认值为”k-means++”,将初始化质心彼此远离,得到比随机初始化更好的结果。
k-means:
给定: 数据
,簇数目为K- 从随机选择一个样本点,记为;
- For k=2,3,…,K
- 计算与已有簇中心的最短距离:
- end for
- 以下述概率抽取样本点:
5.预处理和后处理
预处理:标准化数据(e.g.缩放到单位标准差);消除离群点
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a80348a8-63ec-4bac-abc3-e25f3e81967e
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3e65db04-cae3-4fac-9a3e-27a30eae26e2
6.K-means的优点与局限性
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1fa2e1c6-b26f-4549-8923-58b15411e9d6
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0430d1e1-ff86-457d-9fc1-c642808154f5
局限性:当簇具有不同的尺寸、密度、非球形时,K-means会存在问题;K-means可能得不到理想的聚类结果;硬划分数据点到簇,当数据上出现一些小扰动,可能会导致一个点划分到另外的簇。
——>
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:5a5299e6-95a6-4f34-af51-230f9a182928
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c2016739-274c-43d8-b090-246076b4155e
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a1d5b426-163c-441f-b777-a091a3498a80
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:37ef1824-8c26-4b50-afed-f965316f4801
7.K-medoids
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0479ced0-0382-4233-bc90-4f6f72d073e3
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bfc4dc9b-9010-438b-8556-51e0d718a3a2
对比:1.均值极有可能不存在,不足以代表该簇的样本,而中值是一个样本集合中真实存在的一个样本点;2.相对均值,中值对噪声(孤立点、离散点)不那么敏感;3.但是中值计算需要对簇内所有样本进行排序,计算费用高。
2.高斯混合模型和EM算法
1.高斯混合模型
2.引入隐含变量
参数估计:极大似然估计
3.求解方法:EM算法
类似K均值聚类,采用(块)坐标轴下降,称为 EM算法
可看做是对从属于第k个簇的一种估计或者”解释”M步:基于当前的期望
,重新估计参数的值4.通用的EM算法
5.EM for GMM
6.K-means与高斯混合模型(GMM)
K-means:损失函数为最小化平方距离的和;样本点硬划分到某个簇;假定样本属于每个簇的概率相等,且为球形簇。
GMM:最小化负对数似然;点到簇的从属关系为软分配;可以被用于椭圆形簇,且各个簇概率不同。
3.层次聚类
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:824723c9-b2e3-4c54-93ee-9bafe0bb6178
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a8d049f2-68db-4cb8-8926-03734b477f93
1.层次聚类的优点
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a48d3138-70a4-47da-a2f6-af945564139c
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8335ea6f-d2d0-49e2-995c-60729da7be95
2.层次聚类的分类
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:136b4eb8-4e41-431e-9d17-a6349e18cfc6
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c1d92d4f-dd0b-4183-ba78-e3bffb3f724a
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:60ab1135-98ff-4510-ac3a-649c64abcb09
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a8e260a9-ca0f-4ebd-a39a-8a60ba2405f8
3.定义簇间相似性
- 最小距离:优势:可形成非球形、非凸的簇;问题:链式效应
- 最大距离:对噪声更加鲁棒(不成链);问题:趋向于拆开大的簇,偏好球形簇
- 平均距离:最小距离和最大距离的折中方案
- 中心点距离:问题:反向效应(后边合并的簇间距离可能比之前合并的簇间距离更近)
- Ward’s方法使用平方误差:两个簇的相似性基于两个簇融合后平方误差的增加:更少受噪声和离群点影响;倾向于球形簇;K-means的层次化版本,可以初始化K-means
4.层次聚类的簇数目确定
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f3fbb452-2972-4640-813f-2522f2833845
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:fd33ed8b-52fb-45ec-8107-27eba3c3c7e3
5.层次聚类的限制
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e1b58c7c-7d00-4f6e-ba05-5b8ba6c84621
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a4df6f06-312c-48c1-a3d3-8ad53e92cb18
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d8349ae4-ead5-44e4-a260-919c1dcddec3
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4ff1b48e-008a-4a73-a4f2-c814f4e7cf3d
4.基于密度的聚类
1.基于密度的聚类
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6cbe597f-d7ee-4e7f-a5d9-26a968c7094a
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:952e491c-78e8-4dee-b39e-1f6606e3c10b
2.DBSCAN
密度=给定半径内点的个数
核心点:指定半径内多于指定数量MinPts个点
边界点:半径内有少于MinPts个点,但在某个核心点的邻域内
噪声点:核心点和边界点之外的点
点q由点p 密度可达:连接两个点的路径上所有的点都是核心点。(如果p是核心点,那么由它密度可达的点形成一个簇。)
点q和点p是 密度相连的:如果存在点o从其密度可达点q和点p。
聚类的簇满足两个性质:1. 连接性:簇内的任意两点是密度相连的。2. 最大性:如果一个点从一个簇中的任意一点密度可达,则该点属于该簇。
DBSCAN算法:
2. #确定核心点
3. for i=1,2,…,N
5. if
7. end if
8. end for
10. #对所有核心点
15. 去除队列Q中的首个样本点q
16. if
19. end if
20. end while
3.聚类中心
pi:点i的局部密度
4.基于图的聚类
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4a756d9f-72d9-4d8f-be81-e90424a3e461
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:59a097bb-e9c8-4e87-9cf8-2dfc0323cc03
1.边的创建
2.边的权重
定义矩阵W为 邻接矩阵,元素wi,j表示结点vi和vj的 相似度,即为边ei,j的权重。
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4344bae7-cb47-4537-9af7-dff270cdec81
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:831874d2-1b64-4124-bc20-bf1e37666b3a
3.图论基础
邻接矩阵W:所有点之间的权重值wi,j,构成图的邻接矩阵W,这是一个N*N的对称矩阵。
D是对角矩阵,第i行的主对角线元素值,为结点vi的度数。
Laplace矩阵L:L=D-W
- L是对称矩阵,所有特征值都是实数。
- L的最小特征值是0,且特征值0所对应的特征向量为全1向量;
- L的特征值中”0″出现的次数是图连通区域的个数。
规范化的拉普拉斯矩阵:
切图:
无向图G的 切图:将图G(V,E)切成互不连接的K个子图,每个子图点的集合为
,满足最佳分割:图的切图最小。缺点:切图权重与边的数目成正比;倾向于剪切成小的、孤立的成分。
4.优化目标
矩阵Q的表示:
5.Ncuts特征值分解
- 根据输入的相似矩阵构建邻接矩阵W和度矩阵D
- 计算出拉普拉斯矩阵L
- 计算Lsym/Lrm最小的K1个特征值对应的特征向量vk,k=1,..,K1
- 将vk作为矩阵V的列,得到N*K1维的特征矩阵V,(Lsym:并将V每行规范化(模长为1))
- 对V中的每一行作为一个K1维的样本,共N个样本作为输入进行聚类,聚类类别数为K2
6.优缺点
优点:可灵活选择邻接矩阵;对稀疏数据的聚类效果很有效;当聚类的类别个数较小时,谱聚类的效果好;建立在谱图理论上,能在任意形状的样本空间上聚类,且收敛于全局最优解。
缺点:对相似图和聚类参数的选择非常敏感;适用于簇大小均衡的问题。
Original: https://blog.csdn.net/weixin_43939890/article/details/121735451
Author: 露(
Title: 模式识别与机器学习第八章聚类
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/561068/
转载文章受原作者版权保护。转载请注明原作者出处!