完成实验的过程学习下聚类分析算法
内容图片如无法查看请前往原站点访问:http://taoblog421.cn/posts/27782ca8/
参考文章:https://developer.ibm.com/zh/articles/ba-1607-clustering-algorithm/
1、分类和聚类
监督学习与非监督学习
生活例子:
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3de020a4-7ed3-48f6-90a3-ea00b495077c
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:744a9580-cf82-4759-8111-07addcb4e2fb
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f49ef32a-5154-48f9-bdd3-6bf2882f3f85
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:65fd2ac1-73e7-4cdb-ae8b-33602a13a5bc
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:be572737-d075-491a-b519-4ed796e05ea5
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8106391f-d817-46cc-be10-3ef52101cc13
2、K均值算法
2.1 基础知识
k均值算法为聚类算法中最基础也最重要的算法,对绝大多数数据有效,但是不能处理异型簇
- 随机的取 k 个点作为 k 个初始质心;
- 计算其他点到这个 k 个质心的距离;
- 如果某个点 p 离第 n 个质心的距离更近,则该点属于 cluster n,并对其打标签,标注 point p.label=n,其中 n
K值的估计
对于k的值必须提前知道,这也是kmeans算法的一个缺点,对于k值有多种估计方法,这里使用 平均直径法来估计
就是首先视所有的点为一个大的整体 cluster,计算所有点之间距离的平均值作为该 cluster 的平均直径。选择初始质心的时候,先选择最远的两个点,接下来从这最两个点开始,与这最两个点距离都很远的点(远的程度为,该点到之前选择的最远的两个点的距离都大于整体 cluster 的平均直径)可视为新发现的质心,否则不视之为质心。这样就可以得到质心的数量
2.2 代码实现
测试数据 kmeans.txt
1,1
2,1
1,2
2,2
6,1
6,2
7,1
7,2
1,5
1,6
2,5
2,6
6,5
6,6
7,5
7,6
代码实现(java):
文本读取工具类:
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
public class FileRead {
public static void main(String[] args) {
String filePath = "文件所在路径/文件名";
List<String> list = FileRead.read(filePath);
for (String s : list) {
System.out.println(s);
}
}
public static List<String> read(String filePath){
FileInputStream inputStream = null;
BufferedReader bufferedReader = null;
List<String> data = new ArrayList<String>();
try {
inputStream = new FileInputStream(filePath);
bufferedReader = new BufferedReader(new InputStreamReader(inputStream,"gbk"));
String str = null;
while((str = bufferedReader.readLine()) != null)
{
data.add(str);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (inputStream != null) {
try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if (bufferedReader != null) {
try {
bufferedReader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return data;
}
}
Kmeans.java:
import java.util.*;
public class Kmeans {
private static List<Node> nodeList;
private static Map<Integer,Node> centroid;
public static void main(String[] args) {
nodeList = getNodeListByPath("g:/kmeans.txt");
centroid = computeK(nodeList);
doIteration();
printResult();
}
public static void printResult() {
for (Node node : nodeList) {
System.out.println(Arrays.toString(node.getAttributes()) + "belongs to cluster " + node.getLabel());
}
}
public static void doIteration() {
while (true) {
for (Node node : nodeList) {
Map<Double,Integer> distance = new HashMap<>();
for (Map.Entry<Integer, Node> entry : centroid.entrySet()) {
distance.put(getDistance(node,entry.getValue()), entry.getKey());
}
double min = 0;
for (Double value : distance.keySet()) {
if (min < value) {
min = value;
}
}
node.setLabel(distance.get(min));
}
Map<Integer,Node> oldCentroid = centroid;
centroid = new HashMap<>();
int count = 1;
List<Integer> labelList = new ArrayList<>();
List<CentroidSupport> centroidSupportList = new ArrayList<>();
for (Node node : nodeList) {
if (! labelList.contains(node.getLabel())) {
CentroidSupport centroidSupport = new CentroidSupport(node.getLabel());
labelList.add(node.getLabel());
centroidSupport.getNodeList().add(node);
centroidSupportList.add(centroidSupport);
} else {
for (CentroidSupport centroidSupport : centroidSupportList) {
if (centroidSupport.getLabel() == node.getLabel()) {
centroidSupport.getNodeList().add(node);
}
}
}
}
for (CentroidSupport centroidSupport : centroidSupportList) {
Node avg = CentroidSupport.getAvg(centroidSupport.getNodeList());
centroid.put(count,avg);
count ++;
}
boolean falg = false;
for (Map.Entry<Integer, Node> entry : centroid.entrySet()) {
Node node1 = centroid.get(entry.getKey());
Node node2 = oldCentroid.get(entry.getKey());
if (node1.getLabel() == node2.getLabel() && node1.getAttributes() != node2.getAttributes()) {
falg = true;
}
}
if (falg) {
break;
}
}
}
public static List<Node> getNodeListByPath(String path) {
List<Node> result = new ArrayList<>();
List<String> list = FileRead.read(path);
for (String s : list) {
List<String> list1 = Arrays.asList(s.split(","));
Node node = new Node();
double[] attArr = new double[6];
for (int i = 0; i < list1.size(); i++) {
attArr[i] = Double.parseDouble(list1.get(i));
}
node.setAttributes(attArr);
result.add(node);
}
return result;
}
public static Map<Integer,Node> computeK(List<Node> nodeList) {
List<NodeSupport> nodeSupportList = new ArrayList<>();
double distanceSum = 0;
double distanceAvg;
int count = 0;
Map<Integer,Node> resultMap = new HashMap<>();
for (int i = 0; i < nodeList.size(); i++) {
for (int j = i+1; j < nodeList.size(); j++) {
distanceSum += getDistance(nodeList.get(i),nodeList.get(j));
nodeSupportList.add(new NodeSupport(nodeList.get(i),nodeList.get(j),getDistance(nodeList.get(i),nodeList.get(j))));
count ++;
}
}
distanceAvg = distanceSum/count;
count = 3;
NodeSupport max = Collections.max(nodeSupportList, (n1, n2) -> (int) (n1.getDistance() - n2.getDistance()));
resultMap.put(1,max.getNode1());
resultMap.put(2,max.getNode2());
for (Node node : nodeList) {
if (getDistance(node,max.getNode1()) > distanceAvg && getDistance(node,max.getNode2()) > distanceAvg) {
resultMap.put(count,node);
count ++;
}
}
return resultMap;
}
public static double getDistance(Node n1,Node n2) {
double distance = 0;
for (int i = 0; i < n1.getAttributes().length; i++) {
distance += (n1.getAttributes()[i] - n2.getAttributes()[i]) * (n1.getAttributes()[i] - n2.getAttributes()[i]);
}
return distance;
}
}
class Node {
private int label;
private double[] attributes = new double[6];
public int getLabel() {
return label;
}
public void setLabel(int label) {
this.label = label;
}
public double[] getAttributes() {
return attributes;
}
public void setAttributes(double[] attributes) {
this.attributes = attributes;
}
@Override
public String toString() {
return "Node{" +
"label=" + label +
", attributes=" + Arrays.toString(attributes) +
'}';
}
}
class NodeSupport{
private Node node1;
private Node node2;
private double distance;
public NodeSupport(Node node1, Node node2, double distance) {
this.node1 = node1;
this.node2 = node2;
this.distance = distance;
}
public Node getNode1() {
return node1;
}
public void setNode1(Node node1) {
this.node1 = node1;
}
public Node getNode2() {
return node2;
}
public void setNode2(Node node2) {
this.node2 = node2;
}
public double getDistance() {
return distance;
}
public void setDistance(double distance) {
this.distance = distance;
}
@Override
public String toString() {
return "NodeSupport{" +
"node1=" + node1 +
", node2=" + node2 +
", distance=" + distance +
'}';
}
}
class CentroidSupport{
private Integer label;
private List<Node> nodeList = new ArrayList<>();
public static Node getAvg(List<Node> list) {
Node sum = new Node();
for (Node node : list) {
for (int i = 0; i < node.getAttributes().length; i++) {
sum.getAttributes()[i] += node.getAttributes()[i];
}
}
for (int i = 0; i < sum.getAttributes().length; i++) {
sum.getAttributes()[i] /= list.size();
}
return sum;
}
public Integer getLabel() {
return label;
}
public void setLabel(Integer label) {
this.label = label;
}
public List<Node> getNodeList() {
return nodeList;
}
public void setNodeList(List<Node> nodeList) {
this.nodeList = nodeList;
}
public CentroidSupport(Integer label) {
this.label = label;
}
@Override
public String toString() {
return "CentroidSupport{" +
"label=" + label +
", nodeList=" + nodeList +
'}';
}
}
运行结果:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Fi17BdY4-1609473021134)(https://i.loli.net/2020/12/29/1uRi9t7BlemLwQT.png)]
这里默认数据最多是六维,设计的时候不应该使用数组存储,发现已经来不及改,若需要增加多维数据修改Node类即可
3、层次聚类算法
3.1 基础知识
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:983e230a-4c80-4fe4-85a7-0cec5f3db61c
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f77588af-0f2b-462e-9de7-8fe9c41ba7b1
簇的邻近准则:
min 、max、组平均。下面这张图可以很好的理解
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jgQxNHR7-1609473021138)(https://i.loli.net/2020/12/29/jeldwvnT7KsQDL8.png)]
各种距离:
该部分参考:
https://blog.csdn.net/sinat_30353259/article/details/80885702
https://blog.csdn.net/zmdsjtu/article/details/77965222
欧式距离:
初中高中数学经常使用的两点之间的距离公式
曼哈顿距离:
名字可以猜测个大概,从曼哈顿街区一个十字路口到另一个路口的距离肯定不是直线距离,而是实际驾驶的距离,除非暴君mk2直接飞过去…
切比雪夫距离:
之前好像学过个切比雪夫不等式,不知道有啥关系没有…
国际象棋里国王的走法每次只能向各个方向走一步,从点(x1,y1)走到(x2,y2)的最少步数为切比雪夫距离,这个距离为 max(|x1 – x2|,|y1 -y2|)
汉明距离:
两个字符串对应字符不一样的个数
1011101 与 1001001 之间的汉明距离是 2
2143896 与 2233796 之间的汉明距离是 3
“toned” 与 “roses” 之间的汉明距离是 3
余弦距离:
1 – 两个向量角的余弦值(梦回高中,余弦定理 ??),这个余弦值又叫做余弦相似度
闵氏距离:
不是一种距离, 而是一组距离的定义, 是对多个距离度量公式的概括性的表述。
两个n维变量a(x11,x12,…,x1n)与b(x21,x22,…,x2n)间的闵可夫斯基距离定义为:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZesCzFVw-1609473021140)(https://i.loli.net/2020/12/30/pSQmR7batZou4vA.png)]
其中p是一个变参数:
当p=1时, 就是曼哈顿距离;
当p=2时, 就是欧氏距离;
当p→∞时, 就是切比雪夫距离。
根据p的不同, 闵氏距离可以表示某一类/种的距离。
绝对距离:
百度百科:平面直角坐标系中两点的横坐标的差的绝对值与纵坐标的差的绝对值的和叫做这两点的绝对距离
d = |x1 – x2| + |y1 + y2|
…(这些应该够用了)
3.2 代码实现
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:fc7a918b-4f77-40a7-aa73-365500978fad
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3f20ec4e-4a88-4bb9-832d-8363ed73943b
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CQRXIVGR-1609473021143)(https://i.loli.net/2020/12/30/HDZx5t6bTwlpeBO.png)]
临近准则就选 最大和 最小显然比较简单,距离选择 欧氏距离和 切比雪夫距离(本来选的平方欧式距离,结果linkage这个函数的参数里没有平方欧式距离)
这样就有四种组合,先用spss软件操作一波,在这两个地方分布选择临近准则和距离衡量
(四种组合就不用了,应该是用到就行,不用覆盖每个组合,两种就行这样)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CNpilRPT-1609473021145)(https://i.loli.net/2020/12/30/BsM14qGviOyIDdR.png)]
好家伙,一点直接报错
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c2ZeAt7Q-1609473021146)(https://i.loli.net/2020/12/30/63IXEzOGDjhsq12.png)]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:32fdb307-160a-4158-8664-f7112324690d
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bf71fd1d-135a-4743-88b8-02fb2817d65a
网上查了是安装目录的问题,需要将spss安装在默认目录下,无奈卸载重装(时间实在有限,期末将至,暂且认为这样可以解决)
结果:
- 最小距离,欧式距离
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4euOOA95-1609473021148)(https://i.loli.net/2020/12/30/ReGQ1w2yEtg7voz.png)] - 最大距离,切比雪夫距离
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ha4qT1AJ-1609473021148)(https://i.loli.net/2020/12/30/aYS3xN4X2he8dwC.png)]
貌似都差不多,根据不错
既然要求了python就只好先用了
原来python直接调库就行,都不用实现算法,还是有点爽
代码来源:本校大佬博客
import pandas as pd
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt
data = pd.read_csv("test3-1.csv",encoding="gbk")
data = data.drop(['酒'], axis=1)
df = MinMaxScaler().fit_transform(data)
model = AgglomerativeClustering(n_clusters=3)
model.fit(df)
data['类别标签'] = model.labels_
print(data.head())
ss = sch.linkage(df,method='single', metric='euclidean')
sch.dendrogram(ss)
plt.show()
ss = sch.linkage(df,method='complete', metric='chebychev')
sch.dendrogram(ss)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KF07q5nI-1609473021149)(https://i.loli.net/2020/12/30/M1noeHRkzWSiKF9.jpg)]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0e510998-13c2-48cc-97c0-6c3d234b67cc
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e73e9490-c776-4988-8201-461cbdd5de55
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:2221d9b9-a6e0-42cd-95e4-493fcd6720b3
[En]
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:aebe3fad-c4aa-4300-9feb-b51594b81acb
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
再全部安装一遍库
运行结果:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5yi5vJH4-1609473021150)(https://i.loli.net/2020/12/30/fDkha24dYy7rnZv.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rG708ngj-1609473021151)(https://i.loli.net/2020/12/30/Q6qXSY4tlfZ29BD.png)]
Original: https://blog.csdn.net/qq_35210105/article/details/112059277
Author: cyf__wlp
Title: 聚类分析经典算法(一)
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/561595/
转载文章受原作者版权保护。转载请注明原作者出处!