Title: Auto-Mpg Data
Sources:
(a) Origin: This dataset was taken from the StatLib library which is
maintained at Carnegie Mellon University. The dataset was
used in the 1983 American Statistical Association Exposition.

(c) Date: July 7, 1993
3. Past Usage:
– See 2b (above)
– Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning.

In Proceedings on the Tenth International Conference of Machine
Learning, 236-243, University of Massachusetts, Amherst. Morgan
Kaufmann.

Relevant Information: This dataset is a slightly modified version of the dataset provided in
the StatLib library. In line with the use by Ross Quinlan (1993) in
predicting the attribute “mpg”, 8 of the original instances were removed
because they had unknown values for the “mpg” attribute. The original
dataset is available in the file “auto-mpg.data-original”. “The data concerns city-cycle fuel consumption in miles per gallon,
to be predicted in terms of 3 multivalued discrete and 5 continuous
attributes.” (Quinlan, 1993)
Number of Instances: 398
Number of Attributes: 9 including the class attribute
Attribute Information:
mpg: continuous
cylinders: multi-valued discrete
displacement: continuous
horsepower: continuous
weight: continuous
acceleration: continuous
model year: multi-valued discrete
origin: multi-valued discrete
car name: string (unique for each instance)
Missing Attribute Values: horsepower has 6 missing values

1、导入必要的包

from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.cluster import MeanShift
from sklearn.cluster import SpectralClustering
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

2、读入数据

data=pd.read_csv("d:/datasets/auto-mpg.csv")

3、数据探索

data.head()
data.info()
data.describe()

4、数据预处理

data_auto=data.drop("car name",axis=1)
print(data_auto[data_auto["horsepower"]=="?"])
horse=data_auto["horsepower"].value_counts()  #&#x7EDF;&#x8BA1;&#x7F3A;&#x5931;&#x503C;&#x6570;&#x91CF;
#&#x5220;&#x9664;&#x4E0D;&#x5B8C;&#x6574;&#x7684;&#x6837;&#x672C;
data_auto.drop(data_auto[data_auto["horsepower"]=="?"].index,inplace=True)
data_auto.horsepower=data_auto.horsepower.astype("int64")
#&#x6807;&#x51C6;&#x5316;
model_sc=StandardScaler()
model_sc.fit(data_auto)
data_auto_sc=model_sc.transform(data_auto)

5、建模

5.1 Kmeans

model_km=KMeans(n_clusters=3,random_state=10)
model_km.fit(data_auto_sc)
auto_label=model_km.labels_
auto_cluster=model_km.cluster_centers_
pd.Series(auto_label).value_counts()
print(auto_cluster)
print(model_sc.inverse_transform(auto_cluster))

探寻最优的K值

for k in [2,3,4,6,300]:
    model_km=KMeans(n_clusters=k,random_state=10).fit(data_auto_sc)
    auto_label=model_km.labels_
    auto_cluster=model_km.cluster_centers_
    print(k,"   ",round(metrics.silhouette_score(data_auto_sc,auto_label),4))

5.2 MeanShift

model_mn=MeanShift(bandwidth=2).fit(data_auto_sc)
auto_label=model_mn.labels_
auto_cluster=model_mn.cluster_centers_

bandwidth_grid=np.arange(1,2.5,0.2)
cluster_number=[]
slt_score=[]
for i in bandwidth_grid:
    model=MeanShift(bandwidth=i).fit(data_auto_sc)
    cluster_number.append(len(np.unique(model.labels_)))
    slt_score.append(metrics.silhouette_score(data_auto_sc,model.labels_))

from prettytable import PrettyTable
x = PrettyTable(["&#x7A97;&#x5BBD;","&#x851F;&#x7684;&#x4E2A;&#x6570;","&#x8F6E;&#x5ED3;&#x7CFB;&#x6570;"])
#x.align["&#x7A97;&#x5BBD;"] = "1" #&#x4EE5;&#x59D3;&#x540D;&#x5B57;&#x6BB5;&#x5DE6;&#x5BF9;&#x9F50;
#x.padding_width = 1 # &#x586B;&#x5145;&#x5BBD;&#x5EA6;
for i,j,k in zip(bandwidth_grid,cluster_number,slt_score):
    x.add_row([i,j,k])
print(x)

5.3 AgglomerativeClustering

model=AgglomerativeClustering(n_clusters=3,linkage="average").fit(data_auto_sc)
auto_label=model.labels_

lbs=pd.Series(auto_label).value_counts()
#plt.bar(x=lbs.index,height=lbs )
lbs.plot(kind="bar",rot=0)

&#x7ED8;&#x5236;&#x8C31;&#x7CFB;&#x56FE;
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']
#&#x5229;&#x7528;scipy&#x4E2D;pdist,linkage,dendrogram&#x51FD;&#x6570;&#x7ED8;&#x5236;&#x8C31;&#x7CFB;&#x56FE;
#pdist&#x51FD;&#x6570;&#x8FD4;&#x56DE;&#x8DDD;&#x79BB;&#x77E9;&#x9635;&#xFF0C;linkage&#x51FD;&#x6570;&#x8FD4;&#x56DE;&#x4E00;&#x4E2A;ndarray&#x5BF9;&#x8C61;&#xFF0C;&#x63CF;&#x8FF0;&#x4E86;&#x7C07;&#x5408;&#x5E76;&#x7684;&#x8FC7;&#x7A0B;
#dendrogram&#x51FD;&#x6570;&#x7528;&#x6765;&#x7ED8;&#x5236;&#x8C31;&#x7CFB;&#x56FE;
row_clusters = linkage(pdist(data_auto_sc,metric='euclidean'),method='ward')
fig = plt.figure(figsize=(16,8))
#&#x53C2;&#x6570;p&#x548C;&#x53C2;&#x6570;truncate_mode&#x7528;&#x6765;&#x5C06;&#x8C31;&#x7CFB;&#x56FE;&#x622A;&#x65AD;&#xFF0C;&#x90E8;&#x5206;&#x7ED3;&#x70B9;&#x7684;&#x5B50;&#x6811;&#x88AB;&#x526A;&#x679D;&#xFF0C;&#x6A2A;&#x8F74;&#x663E;&#x793A;&#x7684;&#x662F;&#x8BE5;&#x7ED3;&#x70B9;&#x5305;&#x542B;&#x7684;&#x6837;&#x672C;&#x6570;
row_dendr = dendrogram(row_clusters, p=50, truncate_mode='lastp',color_threshold=5)
plt.tight_layout()
plt.title('&#x8C31;&#x7CFB;&#x56FE;', fontsize=15)

5.4 DBSCAN

&#x8BAD;&#x7EC3;&#x6A21;&#x578B;
model = DBSCAN(eps=1,min_samples=2).fit(data_auto_sc_0)
&#x8F93;&#x51FA;&#x6A21;&#x578B;&#x7ED3;&#x679C;
auto_label = model.labels_
&#x6838;&#x5FC3;&#x5BF9;&#x8C61;&#x7684;&#x7D22;&#x5F15;
model.core_sample_indices_
&#x8F93;&#x51FA;&#x6838;&#x5FC3;&#x5BF9;&#x8C61;
model.components_

clu_num=[]
for min_ in [1,3,5,7,9]:
    model = DBSCAN(eps=1,min_samples=min_).fit(data_auto_sc_0)
    # &#x8F93;&#x51FA;&#x6A21;&#x578B;&#x7ED3;&#x679C;
    labels=model.labels_
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    clu_num.append(n_clusters_)

5.5 SpectralClustering

from sklearn.cluster import SpectralClustering
model= SpectralClustering(n_clusters=3)
model.fit(data_auto_sc)
auto_label=model.labels_

Original: https://blog.csdn.net/it_liujh/article/details/123317735
Author: ITLiu_JH
Title: 数据分析-聚类-案例

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/700618/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

3 个不常见但非常实用的Pandas 使用技巧

在本文中，将演示一些不常见，但是却非常有用的 Pandas 函数。创建一个示例 DataFrame 。 import numpy as np import pandas as p…

人工智能 2023年7月17日
0042
python代码：VOC to cityscapes标注文件转换

在进行语义分割训练时，不同算法支持的数据集格式不同（例如mmsegmentation中每个算法的主页都给出了文件支持的数据集格式），因此有时需要转换数据集的格式一、数据集简介 1…

人工智能 2023年7月19日
0041
【知识图谱】02数据准备

目录准备工作数据导入数据修正为什么要修正如何做操作步骤关联关系表更新关联表数据数据清洗准备工作 MySQL安装和配置方法，链接： https://blog.cs…

人工智能 2023年6月10日
0071
【Pytorch】基于卷积神经网络实现的面部表情识别

作者：何翔学院：计算机学院学号：04191315班级：软件1903转载请标注本文链接： https://blog.csdn.net/HXBest/article/details/1…

人工智能 2023年7月20日
0056
神经辐射场 3D 重建——NeRF

😸NeRF（ECCV 2020）主要贡献：提出一种将具有复杂几何性质和材料的连续场景表示为 5D 神经辐射场的方法，并将其参数化为基本的 MLP 网络提出一种基于经典体渲染技…

人工智能 2023年6月16日
0091
Pandas 学习第4篇：DataFrame -（创建、属性、操作列、类型转换）

DataFrame中面向行和面向列的操作基本上是相同的，把行和列称作轴（axis），DataFrame是按照轴进行操作的，axis=0表示行轴；axis=1 表示列轴。在操作Da…

人工智能 2023年6月2日
0097
图像质量评估指标：PSNR / SSIM 原理及Python代码

2023.6.19 更新：PSNR与SSIM的API已迁移至 skimage.metrics.peak_signal_noise_ratio 和 skimage.metrics.s…

人工智能 2023年7月5日
0071
pytorch快速入门教程

使用pytorch进行深度学习的一般步骤构建自己的模型，必须继承自 nn.Module，且必须实现两个方法： __init()__(self)， forward()。一般单独一…

人工智能 2023年7月23日
0057
安装Tensorflow的简单方法

Tensorflow tensorflow是干什么的这里就不多说了，既然你已经准备磨剑了，也许早就知道你需要它是做什么。之前安装tensorflow试过了种种方法，看了各种帖子，…

人工智能 2023年7月6日
0050
基于Matlab的PIV图像处理

本文为实验流体力学课程作业，需用Matlab处理两幅PIV图像来计算该流动的速度场，基于速度获得流线与涡量场信息。两幅图像如下。本文根据所查资料，自己编写了Matlab代码来处理，…

人工智能 2023年6月18日
0079
gcn在图像上的应用_GCN的概念与应用

本文将从三个角度来理解GCN的概念，并介绍GCN在NLP中的相关应用。文中所有图片都来自引用文档，侵删。有道云笔记的格式更加，查看可以点击这里。本文结构如下：GCN的理解构造法 …

人工智能 2023年6月5日
0065
本手、妙手、俗手？我用AI写2022高考全国作文题，会被看出来？

目录 1 自然语言处理(NLP) 2 NLP的核心问题 3 用AI写高考作文 1 自然语言处理(NLP) 计算机中的数据分为两大类：结构化数据：指可以按某种数据结构组织的数据，比…

人工智能 2023年6月15日
0062
OpenCV图像处理学习九，双边滤波器 (Bilateral Filter)和中位数滤波器 (Median Filter)

均值模糊无法克服边缘像素信息丢失缺陷，原因是均值滤波是基于平均权重，赋予图像内的像素与图像边缘像素一样的比值权重，图像处理是会使得边缘部分图像部分像素信息丢失；高斯模糊部分克服了…

人工智能 2023年5月28日
00117
MRI图像神经纤维束的追踪

大脑的内部有灰质和白质，白质的主要成分就是神经，水分子的弥散方向倾向于沿着神经细胞轴突的走向，也就是沿着神经的方向，通过测定的弥散加权像(DWI)就可以推断出大脑内部白质神经纤维束…

人工智能 2023年7月14日
0064
tensorflow2 训练和预测使用不同的输出层、获取权重参数

youtubeNet通过训练tensorflow2时设置不同的激活函数，训练和预测采用不同的分支，然后可以在训练和测试时，把模型进行分离，得到训练和预测时，某些层的参数不同。可以通…

人工智能 2023年5月25日
00128
docker安装mongoDB及使用

文章目录一、mongoDB是什么？ * 1. mongo的体系结构 2. mongoDB的特点(或使用场景) 3. mongoDB与mysql、redis对比 4. mongoD…

人工智能 2023年7月31日
0069

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

数据分析-聚类-案例

Content