[基本功]聚类方法:层次聚类&k均值聚类

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c980cd59-8cea-4cd4-862b-84efbf574318

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4572ba80-5308-4c27-a4f8-f3ca7bb7814b

1.1 距离&相似度

1.2 类&簇

  • 类/簇:设T为给定正数,若集合G中任意两个样本x i , x j x_i,x_j x i ​,x j ​,有d i j < = T d_{ij},则称G为一个类或簇
  • 类的中心x ‾ G \overline x_G x G ​:类的均值
  • 类的直径D G D_G D G ​:类中任意两个样本之间的最大距离

1.3 类与类的距离(连接)

D p q = m a x { d i j ∣ x i ∈ G p , x j ∈ G q } D_{pq}=max{d_{ij}|x_i\in G_p,x_j\in G_q}D p q ​=m a x {d i j ​∣x i ​∈G p ​,x j ​∈G q ​}

三要素:

距离or相似度:闵可夫斯基距离、马氏距离、相关系数、夹角余弦

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:68518bcd-7a8b-41a7-98b7-1f292d5ca89e

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d864813a-4d7a-4e5a-9e4d-e1c580220c49

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3306328c-3fd5-49e9-9c88-42dacf464258

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1c6f5f75-4a8c-4a3b-ba67-c7510f43f42d

2.1 聚合(自下而上)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9a934506-0582-4b3e-b56f-24c0d1f4face

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7b754b46-a435-4396-a58b-965738c89406

输入:n个样本组成的样本集合

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b3da0309-7325-48bf-b709-504cac2e3ef2

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b4d21d37-fb33-407e-b9e1-bdb54da26452

算法:

(1)计算n个样本两两之间的欧氏距离d i j d_{ij}d i j ​,记作矩阵D = [ d i j ] n ∗ n D=[d_{ij}]_{n*n}D =[d i j ​]n ∗n ​

(2)构造n个类,每个类只包含一个样本

(3)合并类间距离最小的两个类,构建一个新类

(4)计算新类与当前各类的距离。若类的个数为1,终止计算,否则回到(3)

2.2 分裂(自上而下)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:72d765ab-747d-4c1f-99c1-d48f86058fc2

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a8ef670c-e926-4563-9133-81ad6dad79e5

  • 基于中心,通过迭代,将样本分到k个类中,使得每个样本与其所属类的中心或均值最近,得到k个”平坦的”、非层次化的类别,构成对空间的划分。
  • k均值聚类是对样本集合X的划分,或者从样本到类的函数的选择问题
  • 策略:损失函数最小化

    采用欧氏距离平方作为样本间的距离 损失函数:样本与其所属类中心的距离的总和
    W ( C ) = ∑ l = 1 k ∑ C ( i ) = l ∣ ∣ x i − x ‾ l ∣ ∣ 2 W(C)=\sum_{l=1}^k\sum_{C(i)=l}||x_i-\overline x_l||^2 W (C )=l =1 ∑k ​C (i )=l ∑​∣∣x i ​−x l ​∣∣2
    目标:求解最优化问题
    C ∗ = a r g m i n C W ( C ) = a r g m i n C ∑ l = 1 k ∑ C ( i ) = l ∣ ∣ x i − x ‾ l ∣ ∣ 2 C^*=argmin_CW(C)=argmin_C\sum_{l=1}^k\sum_{C(i)=l}||x_i-\overline x_l||^2 C ∗=a r g m i n C ​W (C )=a r g m i n C ​l =1 ∑k ​C (i )=l ∑​∣∣x i ​−x l ​∣∣2
    该问题为NP困难问题,现实中采用迭代方法求解。

  • 算法 首先选择k个类的中心,将样本逐个指派到与其最近的中心的类中,得到一个聚类结果;然后更新每个类的样本的均值,作为类的新的中心;重复以上步骤,直到收敛为止。 输入:n个样本的集合X 输入:样本集合的聚类C ∗ C^C ∗ 算法: (1)初始化。令t=0,随机选择k个样本点作为初始聚类中心m ( 0 ) = ( m 1 ( 0 ) , m 2 ( 0 ) , . . . , m k ( 0 ) ) m^{(0)}=(m_1^{(0)},m_2^{(0)},…,m_k^{(0)})m (0 )=(m 1 (0 )​,m 2 (0 )​,…,m k (0 )​) (2)对样本进行聚类。对固定的类中心m ( t ) = ( m 1 ( t ) , m 2 ( t ) , . . . , m k ( t ) ) m^{(t)}=(m_1^{(t)},m_2^{(t)},…,m_k^{(t)})m (t )=(m 1 (t )​,m 2 (t )​,…,m k (t )​),计算每个样本到类中心的距离,将每个样本指派到与其最近的中心的类中,构成聚类结果C ( t ) C^{(t)}C (t ) (3)计算新的类中心。对聚类结果C ( t ) C^{(t)}C (t ),计算当前各个类中的样本的均值,作为新的类中心m ( t + 1 ) = ( m 1 ( t + 1 ) , m 2 ( t + 1 ) , . . . , m k ( t + 1 ) ) m^{(t+1)}=(m_1^{(t+1)},m_2^{(t+1)},…,m_k^{(t+1)})m (t +1 )=(m 1 (t +1 )​,m 2 (t +1 )​,…,m k (t +1 )​) (4)如果迭代收敛或符合停止条件,输出C ∗ = C ( t ) C^=C^{(t)}C ∗=C (t )。否则令t = t + 1 t=t+1 t =t +1,返回(2)
    [基本功]聚类方法:层次聚类&k均值聚类

Original: https://blog.csdn.net/weixin_52093054/article/details/122073664
Author: 女青年学习日记
Title: [基本功]聚类方法:层次聚类&k均值聚类

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/561128/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球