聚类分析经典算法（一）

2023年6月2日下午5:31 • 人工智能 • 阅读 80

完成实验的过程学习下聚类分析算法
内容图片如无法查看请前往原站点访问：http://taoblog421.cn/posts/27782ca8/
参考文章：https://developer.ibm.com/zh/articles/ba-1607-clustering-algorithm/

1、分类和聚类

监督学习与非监督学习

生活例子：

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3de020a4-7ed3-48f6-90a3-ea00b495077c

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:744a9580-cf82-4759-8111-07addcb4e2fb

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f49ef32a-5154-48f9-bdd3-6bf2882f3f85

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:65fd2ac1-73e7-4cdb-ae8b-33602a13a5bc

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:be572737-d075-491a-b519-4ed796e05ea5

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8106391f-d817-46cc-be10-3ef52101cc13

2、K均值算法

2.1 基础知识

k均值算法为聚类算法中最基础也最重要的算法，对绝大多数数据有效，但是不能处理异型簇

随机的取 k 个点作为 k 个初始质心；
计算其他点到这个 k 个质心的距离；
如果某个点 p 离第 n 个质心的距离更近，则该点属于 cluster n，并对其打标签，标注 point p.label=n，其中 n

K值的估计

对于k的值必须提前知道，这也是kmeans算法的一个缺点，对于k值有多种估计方法，这里使用 平均直径法来估计

就是首先视所有的点为一个大的整体 cluster，计算所有点之间距离的平均值作为该 cluster 的平均直径。选择初始质心的时候，先选择最远的两个点，接下来从这最两个点开始，与这最两个点距离都很远的点（远的程度为，该点到之前选择的最远的两个点的距离都大于整体 cluster 的平均直径）可视为新发现的质心，否则不视之为质心。这样就可以得到质心的数量

2.2 代码实现

测试数据 kmeans.txt

1,1
2,1
1,2
2,2
6,1
6,2
7,1
7,2
1,5
1,6
2,5
2,6
6,5
6,6
7,5
7,6

代码实现（java）：

文本读取工具类：

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;

public class FileRead {
    public static void main(String[] args) {
        String filePath = "文件所在路径/文件名";
        List<String> list = FileRead.read(filePath);
        for (String s : list) {
            System.out.println(s);
        }
    }

    public static List<String> read(String filePath){
        FileInputStream inputStream = null;
        BufferedReader bufferedReader = null;
        List<String> data = new ArrayList<String>();

        try {
            inputStream = new FileInputStream(filePath);

            bufferedReader = new BufferedReader(new InputStreamReader(inputStream,"gbk"));

            String str = null;
            while((str = bufferedReader.readLine()) != null)
            {
                data.add(str);
            }
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (inputStream != null) {
                try {
                    inputStream.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if (bufferedReader != null) {
                try {
                    bufferedReader.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        return data;
    }
}

Kmeans.java:

import java.util.*;

public class Kmeans {

    private static List<Node> nodeList;

    private static Map<Integer,Node> centroid;

    public static void main(String[] args) {

        nodeList = getNodeListByPath("g:/kmeans.txt");

        centroid = computeK(nodeList);

        doIteration();

        printResult();

    }

    public static void printResult() {
        for (Node node : nodeList) {
            System.out.println(Arrays.toString(node.getAttributes()) + "belongs to cluster " + node.getLabel());
        }
    }

    public static void doIteration() {

        while (true) {

            for (Node node : nodeList) {

                Map<Double,Integer> distance = new HashMap<>();
                for (Map.Entry<Integer, Node> entry : centroid.entrySet()) {
                    distance.put(getDistance(node,entry.getValue()), entry.getKey());
                }

                double min = 0;
                for (Double value : distance.keySet()) {
                    if (min < value) {
                        min = value;
                    }
                }
                node.setLabel(distance.get(min));
            }

            Map<Integer,Node> oldCentroid = centroid;

            centroid = new HashMap<>();

            int count = 1;

            List<Integer> labelList = new ArrayList<>();
            List<CentroidSupport> centroidSupportList = new ArrayList<>();
            for (Node node : nodeList) {
                if (! labelList.contains(node.getLabel())) {
                    CentroidSupport centroidSupport = new CentroidSupport(node.getLabel());
                    labelList.add(node.getLabel());
                    centroidSupport.getNodeList().add(node);
                    centroidSupportList.add(centroidSupport);
                } else {
                    for (CentroidSupport centroidSupport : centroidSupportList) {
                        if (centroidSupport.getLabel() == node.getLabel()) {
                            centroidSupport.getNodeList().add(node);
                        }
                    }
                }
            }

            for (CentroidSupport centroidSupport : centroidSupportList) {
                Node avg = CentroidSupport.getAvg(centroidSupport.getNodeList());
                centroid.put(count,avg);
                count ++;
            }

            boolean falg = false;
            for (Map.Entry<Integer, Node> entry : centroid.entrySet()) {
                Node node1 = centroid.get(entry.getKey());
                Node node2 = oldCentroid.get(entry.getKey());
                if (node1.getLabel() == node2.getLabel() && node1.getAttributes() != node2.getAttributes()) {
                    falg = true;
                }
            }
            if (falg) {
                break;
            }

        }

    }

    public static List<Node> getNodeListByPath(String path) {
        List<Node> result = new ArrayList<>();
        List<String> list = FileRead.read(path);
        for (String s : list) {
            List<String> list1 = Arrays.asList(s.split(","));
            Node node = new Node();
            double[] attArr = new double[6];
            for (int i = 0; i < list1.size(); i++) {
                attArr[i] = Double.parseDouble(list1.get(i));
            }
            node.setAttributes(attArr);
            result.add(node);
        }
        return result;
    }

    public static Map<Integer,Node> computeK(List<Node> nodeList) {

        List<NodeSupport> nodeSupportList = new ArrayList<>();

        double distanceSum = 0;
        double distanceAvg;
        int count = 0;

        Map<Integer,Node> resultMap = new HashMap<>();

        for (int i = 0; i < nodeList.size(); i++) {
            for (int j = i+1; j < nodeList.size(); j++) {
                distanceSum += getDistance(nodeList.get(i),nodeList.get(j));

                nodeSupportList.add(new NodeSupport(nodeList.get(i),nodeList.get(j),getDistance(nodeList.get(i),nodeList.get(j))));
                count ++;
            }
        }

        distanceAvg = distanceSum/count;

        count = 3;

        NodeSupport max = Collections.max(nodeSupportList, (n1, n2) -> (int) (n1.getDistance() - n2.getDistance()));

        resultMap.put(1,max.getNode1());
        resultMap.put(2,max.getNode2());

        for (Node node : nodeList) {
            if (getDistance(node,max.getNode1()) > distanceAvg && getDistance(node,max.getNode2()) > distanceAvg) {

                resultMap.put(count,node);
                count ++;
            }
        }

        return resultMap;
    }

    public static double getDistance(Node n1,Node n2) {
        double distance = 0;
        for (int i = 0; i < n1.getAttributes().length; i++) {
            distance += (n1.getAttributes()[i] - n2.getAttributes()[i]) * (n1.getAttributes()[i] - n2.getAttributes()[i]);
        }
        return distance;
    }

}

class Node {

    private int label;
    private double[] attributes = new double[6];

    public int getLabel() {
        return label;
    }

    public void setLabel(int label) {
        this.label = label;
    }

    public double[] getAttributes() {
        return attributes;
    }

    public void setAttributes(double[] attributes) {
        this.attributes = attributes;
    }

    @Override
    public String toString() {
        return "Node{" +
                "label=" + label +
                ", attributes=" + Arrays.toString(attributes) +
                '}';
    }
}

class NodeSupport{

    private Node node1;
    private Node node2;
    private double distance;

    public NodeSupport(Node node1, Node node2, double distance) {
        this.node1 = node1;
        this.node2 = node2;
        this.distance = distance;
    }

    public Node getNode1() {
        return node1;
    }

    public void setNode1(Node node1) {
        this.node1 = node1;
    }

    public Node getNode2() {
        return node2;
    }

    public void setNode2(Node node2) {
        this.node2 = node2;
    }

    public double getDistance() {
        return distance;
    }

    public void setDistance(double distance) {
        this.distance = distance;
    }

    @Override
    public String toString() {
        return "NodeSupport{" +
                "node1=" + node1 +
                ", node2=" + node2 +
                ", distance=" + distance +
                '}';
    }
}

class CentroidSupport{

    private Integer label;
    private List<Node> nodeList = new ArrayList<>();

    public static Node getAvg(List<Node> list) {
        Node sum = new Node();
        for (Node node : list) {
            for (int i = 0; i < node.getAttributes().length; i++) {
                sum.getAttributes()[i] += node.getAttributes()[i];
            }
        }
        for (int i = 0; i < sum.getAttributes().length; i++) {
            sum.getAttributes()[i] /= list.size();
        }
        return sum;
    }

    public Integer getLabel() {
        return label;
    }

    public void setLabel(Integer label) {
        this.label = label;
    }

    public List<Node> getNodeList() {
        return nodeList;
    }

    public void setNodeList(List<Node> nodeList) {
        this.nodeList = nodeList;
    }

    public CentroidSupport(Integer label) {
        this.label = label;
    }

    @Override
    public String toString() {
        return "CentroidSupport{" +
                "label=" + label +
                ", nodeList=" + nodeList +
                '}';
    }
}

运行结果：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Fi17BdY4-1609473021134)(https://i.loli.net/2020/12/29/1uRi9t7BlemLwQT.png)]

这里默认数据最多是六维，设计的时候不应该使用数组存储，发现已经来不及改，若需要增加多维数据修改Node类即可

3、层次聚类算法

3.1 基础知识

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:983e230a-4c80-4fe4-85a7-0cec5f3db61c

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f77588af-0f2b-462e-9de7-8fe9c41ba7b1

簇的邻近准则：

min 、max、组平均。下面这张图可以很好的理解

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jgQxNHR7-1609473021138)(https://i.loli.net/2020/12/29/jeldwvnT7KsQDL8.png)]

各种距离：

该部分参考：
https://blog.csdn.net/sinat_30353259/article/details/80885702
https://blog.csdn.net/zmdsjtu/article/details/77965222

欧式距离：

初中高中数学经常使用的两点之间的距离公式

曼哈顿距离：

名字可以猜测个大概，从曼哈顿街区一个十字路口到另一个路口的距离肯定不是直线距离，而是实际驾驶的距离，除非暴君mk2直接飞过去…

切比雪夫距离：

之前好像学过个切比雪夫不等式，不知道有啥关系没有…

国际象棋里国王的走法每次只能向各个方向走一步，从点(x1,y1)走到(x2,y2)的最少步数为切比雪夫距离，这个距离为 max(|x1 – x2|,|y1 -y2|)

汉明距离：

两个字符串对应字符不一样的个数

1011101 与 1001001 之间的汉明距离是 2

2143896 与 2233796 之间的汉明距离是 3

“toned” 与 “roses” 之间的汉明距离是 3

余弦距离：

1 – 两个向量角的余弦值（梦回高中，余弦定理？？），这个余弦值又叫做余弦相似度

闵氏距离：

不是一种距离，而是一组距离的定义，是对多个距离度量公式的概括性的表述。

两个n维变量a(x11,x12,…,x1n)与b(x21,x22,…,x2n)间的闵可夫斯基距离定义为:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZesCzFVw-1609473021140)(https://i.loli.net/2020/12/30/pSQmR7batZou4vA.png)]

其中p是一个变参数：
当p=1时，就是曼哈顿距离；
当p=2时，就是欧氏距离；
当p→∞时，就是切比雪夫距离。
根据p的不同，闵氏距离可以表示某一类/种的距离。

绝对距离：

百度百科：平面直角坐标系中两点的横坐标的差的绝对值与纵坐标的差的绝对值的和叫做这两点的绝对距离

d = |x1 – x2| + |y1 + y2|

…(这些应该够用了)

3.2 代码实现

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:fc7a918b-4f77-40a7-aa73-365500978fad

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3f20ec4e-4a88-4bb9-832d-8363ed73943b

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CQRXIVGR-1609473021143)(https://i.loli.net/2020/12/30/HDZx5t6bTwlpeBO.png)]

点击下载test3-1.csv

临近准则就选最大和最小显然比较简单，距离选择 欧氏距离和 切比雪夫距离（本来选的平方欧式距离，结果linkage这个函数的参数里没有平方欧式距离）

这样就有四种组合，先用spss软件操作一波，在这两个地方分布选择临近准则和距离衡量

（四种组合就不用了，应该是用到就行，不用覆盖每个组合，两种就行这样）

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CNpilRPT-1609473021145)(https://i.loli.net/2020/12/30/BsM14qGviOyIDdR.png)]

好家伙，一点直接报错

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c2ZeAt7Q-1609473021146)(https://i.loli.net/2020/12/30/63IXEzOGDjhsq12.png)]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:32fdb307-160a-4158-8664-f7112324690d

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bf71fd1d-135a-4743-88b8-02fb2817d65a

网上查了是安装目录的问题，需要将spss安装在默认目录下，无奈卸载重装（时间实在有限，期末将至，暂且认为这样可以解决）

结果：

最小距离，欧式距离
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4euOOA95-1609473021148)(https://i.loli.net/2020/12/30/ReGQ1w2yEtg7voz.png)]
最大距离，切比雪夫距离

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ha4qT1AJ-1609473021148)(https://i.loli.net/2020/12/30/aYS3xN4X2he8dwC.png)]

貌似都差不多，根据不错

既然要求了python就只好先用了

原来python直接调库就行，都不用实现算法，还是有点爽

代码来源：本校大佬博客

import pandas as pd
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt

data = pd.read_csv("test3-1.csv",encoding="gbk")

data = data.drop(['酒'], axis=1)
df = MinMaxScaler().fit_transform(data)

model = AgglomerativeClustering(n_clusters=3)
model.fit(df)
data['类别标签'] = model.labels_
print(data.head())

ss = sch.linkage(df,method='single', metric='euclidean')
sch.dendrogram(ss)
plt.show()

ss = sch.linkage(df,method='complete', metric='chebychev')
sch.dendrogram(ss)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KF07q5nI-1609473021149)(https://i.loli.net/2020/12/30/M1noeHRkzWSiKF9.jpg)]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0e510998-13c2-48cc-97c0-6c3d234b67cc

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e73e9490-c776-4988-8201-461cbdd5de55

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:2221d9b9-a6e0-42cd-95e4-493fcd6720b3

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:aebe3fad-c4aa-4300-9feb-b51594b81acb

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

再全部安装一遍库

运行结果：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5yi5vJH4-1609473021150)(https://i.loli.net/2020/12/30/fDkha24dYy7rnZv.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rG708ngj-1609473021151)(https://i.loli.net/2020/12/30/Q6qXSY4tlfZ29BD.png)]

Original: https://blog.csdn.net/qq_35210105/article/details/112059277
Author: cyf__wlp
Title: 聚类分析经典算法（一）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/561595/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

数字图像处理之matlab实验（五）：形态学图像处理

常见的形态学处理包括腐蚀、膨胀、开运算、闭运算。不同的操作有不同的作用，同样的操作在不同类型的图片上也有不同效果，具体效果如下表格所示。要求熟练掌握对二值图像的形态学处理。不同操…

人工智能 2023年6月17日
00104
数据分析-特征选择-feature_selection

目录前言过滤法方差选择法卡方检验法相关系数检验法皮尔逊相关系数互信息-Mutual information 最大信息系数 maximal information co…

人工智能 2023年6月11日
00105
2021年最完整的人工智能入门指南

这可能是一份中英文世界最好的人工智能学习教程，在充分考虑大家信息获取难度的情况下制定了帮助你掌握人工智能（AI）和机器学习（ML）的完整指南：无需专业背景！无需学费！无需额外的时…

人工智能 2023年7月16日
0073
【攻破css系列——第七天】属性值的计算

文章目录 1. 属性值的计算 * 1.1 定义 1.2 样式的来源 1.3 层叠规则 – 1.3.1 重要性 1.3.2 特殊性 1.3.3 源次序 1.4 属性值计算…

人工智能 2023年6月26日
0079
【机器学习】逻辑回归算法

逻辑回归算法学习目标 1. 逻辑回归的介绍 * 1.1 逻辑回归的应用场景 1.2 逻辑回归的原理 – 1.2.1 输入 1.2.2 激活函数 1.3 损失以及优化 …

人工智能 2023年6月18日
0092
向量相似度评估方法

相似度在工作中的使用可以说是相当频繁，今天就带大家介绍pytorch中四种常用的向量相似度评估思路： 1 余弦相似度余弦相似度相信大家都很熟悉了。用两个向量夹角的余弦值作为衡量两…

人工智能 2023年6月16日
00113
文献阅读：Should You Mask 15% in Masked Language Modeling?

文献阅读：Should You Mask 15% in Masked Language Modeling? 1. 内容简介 2. 实验考察 1. mask比例考察 2. corru…

人工智能 2023年5月30日
0065
NLP标注神器：可同时对文本与实体进行标注

原文链接：NLP标注神器-可同时对文本与实体进行标注我之前做实体标注项目使用过 标注精灵、…

人工智能 2023年5月28日
00118
Python的随机库 random库

random库是使用随机数的python标准库伪随机数：计算机中采用梅森旋转算法生成的（伪）随机数 python中生成伪随机数的函数库是random python中引入标准库ra…

人工智能 2023年7月15日
0060
机器学习实用代码汇总（你想要的这里都有）

机器学习实用代码汇总（你想要的这里都有）文章目录机器学习实用代码汇总（你想要的这里都有）前言一、数据导入 * 1.数据文件读取 2.提取特征和标签 3.数据分布及关系图（P…

人工智能 2023年6月13日
00112
OpenCV实战之人脸美颜美型（一）——开发环境搭建

### 回答1：人脸_识别系统是一种利用计算机技术对 _人脸_进行识别和验证的系统。其中， _opencv 人脸_检测是一种基于 _opencv_库的 _人脸_检测算法，可以通过…

人工智能 2023年7月20日
0043
时序分析 27 – 时序预测格兰杰因果关系(上) 理论知识

时序分析 27 时序预测格兰杰因果关系(上) 理论知识本文主要阐述Granger Causality，即格兰杰因果关系在时间序列预测中的应用。背景与回顾本系列前面的文章中，…

人工智能 2023年6月16日
0082
【YOLOv5实战4】基于YOLOv5的交通标志识别系统-模型测试与评估

实战博客指引：实战环境搭建自定义数据集模型训练模型测试与评估 YOLOv5整合PyQt5 项目源代码可联系博主获取。一、参数说明在上一个步骤中，我们进行了模型的训练，训…

人工智能 2023年6月25日
0073
安装Pytorch-gpu版本（第一次安装或已经安装Pytorch-cpu版本后）

由于已经安装了cpu&am…

人工智能 2023年6月16日
0050
神经网络中常用的损失函数

在神经网络中，常用的损失函数来评估算法模型得到的预测值与真实值之间的差距。损失函数值越小，说明二者之间的差距越小，表明模型拟合的越好。一般用来表示损失函数，其中y表示真实值，表示…

人工智能 2023年7月13日
0066
使用MindSpore计算旋转矩阵

坐标变换、旋转矩阵，是在线性空间常用的操作，在分子动力学模拟领域有非常广泛的应用。比如在一个体系中切换坐标，或者对整体分子进行旋转平移等。如果直接使用Numpy，是很容易可以实现的…

人工智能 2023年6月4日
0095

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

聚类分析经典算法（一）

1、分类和聚类

2、K均值算法

2.1 基础知识

2.2 代码实现

3、层次聚类算法

3.1 基础知识

3.2 代码实现

大家都在看