《机器学习》理论——速读学习2 常用方法(3)

该系列文章系个人读书笔记及总结性内容,任何组织和个人不得转载进行商业活动!
time: 2021-12-24
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8cd05e7e-6933-4867-92a4-89989012a584

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:da9180cc-77f5-404c-93a1-887366f602ef

  • 第9章 聚类
  • 第10章 降维与度量学习

第9章 聚类

无监督学习中,训练样本的标记信息未知,目标是通过对无标记训练样本的学习来揭示数据内在的性质及规律,为进一步数据分析做基础;此类问题研究最多的就包括” 聚类(clustering)“;

除了聚类任务,常见的无监督学习任务还有密度估计(density estimation)、异常检测(anomaly detection)等;

聚类试图将数据集中的样本划分为若干个通常是不相交的子集,每个子集称为一个 簇(cluster)(就聚类算法而言,样本簇也称类);

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9447a038-55d4-4cfd-8955-44d48b02d854

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:75efe5b5-843d-4994-bad3-db47e10eeaf7

这些概念事先是未知的,聚类过程仅能自动形成簇结构,簇对应的概念需要由使用者自己把握;

聚类任务中可使用标记训练样本,如半监督聚类,但要注意样本的类标记和聚类产生的簇有所不同;即类标记不同于 簇标记(cluster label);

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f6965aca-043f-40eb-a46d-eb63a67e1ec0

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1b042130-8838-4851-be19-34c0b2b4d233

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0632496e-b03b-4b9b-b548-4f93b8a36ba8

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d686a117-1aa6-4341-bfdd-7713b539a2cc

我们已经在第二章中了解过 监督学习中的性能度量;

聚类性能度量 亦称 聚类” 有效性指标“(validity index);与监督学习中的性能度量作用类似;

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:2dd4d50c-0fac-4383-a805-70e1afdd97fc

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:87a33eb8-6f4b-4d43-aa5f-fed61211d716

什么样的聚类比较好:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:2cdd9d54-19ea-4806-aceb-9e459ede1063

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:543df3c9-8cdf-4f56-8afc-ba01ea5fac4a

* 即,聚类结果的 簇内相似度(intra-cluster similarity)高, 簇间相似度(inter-cluster similarity)低;

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d309a41c-d794-4c30-bd16-565a6d1c4ca6

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:35d4ecd2-915f-4ef4-9531-566f82218f8a

  • 聚类结果与某个参考模型进行比较,称为 外部指标(external index)
  • Jaccard系数(Jaccard Coefficient,简称JC)
  • FM指数(Fowlkes and Mallows Index,FMI)
  • Rand指数(Rand Index,RI)
  • 这三个性能度量结果值 均在 [0,1]区间,值越大越好;
  • 聚类结果直接考察,不利用任何参考模型,称为 内部指标(internal index)
  • DB指数(Davies-Bouldin Index,DBI),值越小越好
  • Dunn指数(Dunn Index,DI),值越大越好;

对于函数 dist(,)表示一个 距离度量(distance measure),满足:

  • 非负性:函数值大于等于0
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:79e2bd8a-bc55-4181-9911-79fb3164ad8a
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ae6259d5-8206-4b96-bcc6-b4da2a7c4199

  • 对称性:xi到xj的距离,与xj到xi的距离相等
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e65af1fb-06d1-4b54-b29a-7d8f0bbbb228
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:007ae97d-3481-4f7e-9af5-4d1831b8a66a

给定样本xi、xj,它们各自可以表示为一组属性值的向量,距离度量函数最常用的是 闵可夫斯基距离(Minkowski distance);

  • 范数p的值=2时,即为欧氏距离
  • 范数p的值=1时,即为曼哈顿距离(Manhattan distance)

连续属性(continuous attribute)亦称 数值属性(numerical attribute);离散属性(categorical attribute)亦称 列名属性(nominal attribute);在讨论距离计算时,属性上是否定义了序关系往往很重要;由此可得到 有序属性(ordinal attribute)和 无序属性(non-ordinal attribute)的概念;

  • 闵可夫斯基距离 可用于 有序属性;
  • 对于无序属性可采用 VDM(Value Difference Metric)距离;
  • 处理混合属性可结合闵可夫斯基距离和VDM,一般令有序属性排列在无序属性之前;
  • 当样本空间中不同属性的重要性不同是,可使用 加权距离(weighted distance);(通常权重值累加和为1);

通常我们是基于某种形式的距离来定义 相似度度量(similarity measure),即距离越大,相似度越小;

然而用于相似度度量的距离未必一定要满足距离度量的所有基本性质,尤其是直递性;对于不满足直递性的距离称为 非度量距离(non-metric distance);

欧氏距离和余弦相似度:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9f7589b7-97dd-42f3-9196-b191df434422

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:903debc5-6edb-459e-81eb-48f3b1cea45e

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:33a28550-466d-4596-8341-48545c37b54f

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ada14760-670a-4140-844e-9e810b0fe375


dist = linalg.norm(A - B)
sim = 1.0 / (1.0 + dist)

num = float(A.T * B)
denom = linalg.norm(A) * linalg.norm(B)
cos = num / denom
sim = 0.5 + 0.5 * cos

原型 指样本空间中具有代表性的点;原型聚类亦称 基于原型的聚类(prototype-based clustering);

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:12223b33-8aab-4dae-877e-1c99e8183539

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a6e80e9b-bcd8-4ae7-8406-126e40297c25

k均值算法

  • k均值(k-means)算法针对聚类所得簇划分 最小化平方误差;
  • E一定程度上刻画了簇内样本围绕簇均值向量的紧密程度,E值也是小,簇内样本相似度越高;
  • 最小化并不容易,最优解需要考虑样本集D的所有可能簇划分,这是一个NP难问题;
  • k均值算法采用贪心策略,通过迭代来近似求解;
  • 对均值向量初始化,假定类簇数k=3,算法开始随机选取三个样本作为初始均值向量;
  • 对当前簇划分及均值向量进行迭代更新,直到聚类结果保持不变(或小于设置的阀值,也可以设置最大迭代次数);
    • 考察样本x1,计算它与当前均值向量的距离,以决定x1被划分的簇,继续对所有样本考察一遍;
    • 这样就得到了新的簇划分;继而可以计算各个划分簇的均值向量;
    • 继续考察所有样本,与新划分簇的均值向量的距离,继续得到新的划分簇,不断重复该过程;
    • 直到触发终止条件;

学习向量量化

  • 学习向量量化(Learning Vector Quantization,LVQ)与k均值算法类似,也是视图找到一组原型向量来刻画聚类结构;
  • 与一般聚类算法不同的是,LVQ假设样本数据带有类别标记,学习过程利用样本的这些监督信息来辅助聚类;
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:da3dc1f7-cfcb-4475-9845-431d2fd0c7c5
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:5695bc5a-7722-441a-a7be-d3628b67acc1

  • 每个样本类中,随机选择有类标签的样本初始化为原型向量;
  • 每轮迭代中,找出与训练样本最近的原型向量;
    • 若两者类别一致,则更新原型向量(向训练样本靠拢,靠拢后的向量为 原型向量 +(样本向量-原型向量)*学习率);
    • 若两者类别不一致,则更新原型向量(远离训练样本)
  • 最终学得的原型向量定义了一个与之相关的区域,该区域中每个样本与相应的原型向量距离不大于它与其他原型向量的距离;

SOM是基于无标记样本的聚类算法,LVQ可看做SOM基于监督信息的扩展;

对于一个样本集,如果全部都用原型向量来表示,即可实现数据的有损压缩(lossy compression),这个过程称为 向量量化,LVQ由此得名;

由此形成的对样本空间的簇划分,通常称为 Voronoi剖分(Voronoi tessellation);

高斯混合聚类

  • 与前两个不同,高斯混合(Mixture-of-Gaussian)聚类采用概率模型(高斯分布)来表达聚类原型
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c07ccfe9-f2ca-48d2-b6e4-e4296712649b
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e32c5320-1b4f-4d4b-bd40-2d1455dbcd8c

  • 基于EM算法对模型参数进行迭代更新;;

即基于密度的聚类(density-based clustering),此类算法假设聚类结构能通过样本分布的紧密程度确定;通常,是从样本密度的角度来考察样本之间的可连续性,并基于可连接样本不断扩展聚类以获得最终的聚类结果;

密度直达关系通常不满足对称性;密度可达关系满足直递性,但不满足对称性;密度相连关系满足对称性;

DBSCAN是一种著名的密度聚类算法,基于一组 邻域(neighborhood)参数刻画样本分布的紧密程度;

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9f9fe7ac-0666-48d2-889a-99743e2bc76a

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8244bee9-06fc-4a78-9c33-e9ecb7a3bc8d

* 密度直达(directly density-reachable):xi是核心对象,xj位于其邻域中,则称xj由xi直达;
* 密度可达(density-reachable):xi、xj经过多个间接的密度直达,即为密度可达;
* 密度相连(density-connected):xi、xj可用同一个核心对象密度可达,即为密度相连;

对于不属于任何簇的样本 一般被认为是噪声(noise)或异常(anomaly);

基于上述概念,DBSCAN将簇定义为:由密度可达关系导出的最大的密度相连样本集合;

  • DBSCAN算法先任选数据集中的一个核心对象为 种子(seed),再由此触发确定相应的聚类簇;
    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3e2a0ebe-68e5-44f3-ace5-b391f7d991f2
    [En]

    [TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6c42dedf-bf12-4e31-bb48-2f6fd5e91eac

层次聚类(hierarchical clustering)试图在不同层次对数据集进行划分,从而形成树形的聚类结构;

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c4134f13-0979-4633-9909-b22633acb6e4

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d2455c3d-d2a5-4a99-9a2c-9e916e482351

AGNES(AGglomerative NESting)是一种采用自底向上策略策略的层次聚类算法:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1c0f2b1c-d394-41d6-8e4a-bbe808e7f946

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ede382ac-6248-4e61-b817-632954185b4c

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bb82a5b1-0467-42c0-9002-07fa232da99d

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c1af1362-188a-4061-a4c4-d357e83df704

集合间距离计算常采用豪斯多夫距离;

聚类簇距离可以使用 最小距离(两个集合最近样本决定)、最大距离(两个集合最远样本决定)、平均距离(两个簇的所有样本共同决定),相应的AGNES算法分别被称为 单链接(single-linkage)全连接(complete-linkage)均链接(average-linkage)

聚类集成(clustering ensemble)通过对多个聚类学习器进行集成,能有效降低聚类假设与真实聚类结构不符、聚类过程中的随机因素等带来的不良影响;

异常检测(anomaly detection)常借助聚类或距离计算进行,如将远离所有簇中心的样本作为异常点,将密度极低处的样本最为异常点;(也有基于隔离性快速检测异常点的方法);

第10章 降维与度量学习

k近邻(k-Nearest Neighbor,kNN)学习是一种常用的监督学习方法,工作机制是:给定测试样本,基于某种距离度量找出训练集中与其靠近的k个训练样本,然后基于这k个邻居的信息进行预测;

通常,分类任务使用投票法,即选择k个样本中出现最多的类别标记作为预测结果;回归任务中使用平均法,即将这k个样本的实值输出标记的平均值作为预测结果;还可以基于距离远近进行加权平均或加权投票,距离越近的样本权重越大;

k近邻学习有个明显之处:它似乎没有显式的训练过程;事实上,它是 懒惰学习(lazy learning)的著名代表,此类学习技术在训练阶段仅仅把样本保存起来,训练时间开销为零,待收到测试样本后在进行整理;那些在训练阶段就对样本进行学习处理的方法,称为 急切学习(eager learning)

显然k的取值 和 距离的计算方式,对结果有很大影响;

k=1时,即1NN,如果是对于分类问题,即最近邻分类器;给定测试样本x,若其最近邻样本为z,则最近邻分类器出错的概率就是x与z类别标记不同的概率,这个概率经推到计算,有结论: 最近邻分类器虽简单,但它的泛化错误率不超过贝叶斯最优分类器的错误率的两倍(这里假设选取了合适的距离计算方式,且训练样本的采样密度足够大——即密采样,大到任意一个测试样本都可以在任意小的距离范围内找到一个训练样本);

一般的 密采样(dense sample)的假设很难满足,因为这需要太大量的样本;此外许多学习方法都涉及距离计算,高维空间的记录计算更不容易;

在高维情形下出现的数据样本稀疏、距离计算困难等问题,被称为”维数灾难(curse ofdimensionality)”;

缓解维数灾难的两种重要途径 分别是 降维(dimension reduction,也称维数约简)特征选择

降维,即通过某种数学变换将原始高维属性空间转变为一个低维子空间,在这个子空间中 样本密度大幅提高,距离计算也变得更加容易;

低维嵌入:很多时候,人们观测或收集到的样本数据虽然是高维的,但是与实际学习任务相关的也许只是某个低维分布,这个低维分布就是原高维空间中的一个低维 嵌入(embedding)

多维缩放(Multiple Dimensional Scaling,MDS):

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e022f24a-37ed-46fa-9aa6-47b866e4f475

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:44743993-80ad-4696-9686-4e9a25e3f6ea

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:28a0b656-31f3-42f1-b60e-149273952b55

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:030a7e12-fdb1-49db-8574-2c70db75c8a3

对降维效果的评估,通常是比较降维前后学习器的性能,性能有所提高则认为降维起到了作用;

主成分分析(Principal Component Analysis,PCA),也叫主分量分析,是最常用的降维方法(是一种无监督线性降维方法);

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:11d7f547-7280-4f06-b267-e47da885efbc

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:06756a09-34b7-4338-9533-6268b5b40628

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:83efa1fe-84be-452f-88d5-1d9d91388d62

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:918e71b8-564d-4e15-88f3-9db5dd3a221e

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b5dd5c8b-2772-4a64-bb12-fe416a59ed14

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bc6ce7e5-08f9-400f-bdf2-d111c8858a51

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a982924f-c891-4aab-88eb-c764e5c3afd5

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e48b139f-24dc-4e31-ac9f-4287e547cae9

注:具体推到忽略,理解 最近重构性 和 最大可分性即可;

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1935bd49-7e3b-4889-b0ad-3effe14faa6c

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9627f33a-3c37-4430-b11a-3d6557675b8e

如从二维空间中的矩形区域采样后以S形曲面嵌入到三维空间,若直接使用线性降维则将丢失原本的低维结构;

原本样本采样的低维空间 称为 本真(intrinsic)低维空间,注意要区别于降维后的低维空间;

非线性降维的一种常用方法,是基于核技巧对线性降维方法进行 核化:核主成分分析(Kernelized PCA,KPCA);

流形学习(manifold learning);

度量学习,亦称距离度量学习(distance metric learning);

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9af1108a-092b-4674-9a46-65f8d09e46f4

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b73eaf15-1b86-4f03-a1e3-c64bd352c50a

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9226586e-b973-4be6-b6ab-fbdbf9c9657b

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7b136641-9fc6-4f09-aae5-28a9138b5ff8

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:fea03e4b-0d9d-42e8-9be1-d31f6bd780c4

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:91c88a1f-cd42-4bf7-97f0-d91d982a5dd5

Original: https://blog.csdn.net/baby_hua/article/details/122456515
Author: baby_hua
Title: 《机器学习》理论——速读学习2 常用方法(3)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/563363/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球