MachineLearning 3. 聚类分析(Cluster Analysis)

MachineLearning 3. 聚类分析(Cluster Analysis)


; 前 言

聚类分析(Cluster Analysis)又称群分析,是根据”物以类聚”的道理,对样品或指标进行分类的一种多元统计分析方法,它们讨论的对象是大量的样品,要求能合理地按各自的特性来进行合理的分类,没有任何模式可供参考或依循,即是在没有先验知识的情况下进行的。聚类分析起源于分类学,在古老的分类学中,人们主要依靠经验和专业知识来实现分类,很少利用数学工具进行定量的分类。随着人类科学技术的发展,对分类的要求越来越高,以致有时仅凭经验和专业知识难以确切地进行分类,于是人们逐渐地把数学工具引用到了分类学中,形成了数值分类学,之后又将多元分析的技术引入到数值分类学形成了聚类分析。


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d2c37e18-3d9e-48ed-9d81-f2f92c26a50f


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:cb5bab53-5bba-4292-bfa7-07280f11b6f1

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6621e88f-4a38-40f6-b446-96917efab4cb


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f3fca902-7b93-44b1-b262-b6e9dc8f980d

  1. 分裂法(partitioning methods);
  2. 层次法(hierarchical methods);
  3. 基于密度的方法(density-based methods);
  4. 基于网格的方法(grid-based methods);
  5. 基于模型的方法(model-based methods)。
  6. 分裂法又称划分方法(PAM:PArtitoning methods)首先创建k个划分,k为要创建的划分个数;然后利用一个循环定位技术通过将对象从一个划分移到另一个划分来帮助改善划分质量。典型的划分方法包括:

a.k-means,k-medoids,CLARA(Clustering LARge Application);

b.CLARANS(Clustering Large Application based upon RANdomized Search);


  1. 层次法(hierarchical method)创建一个层次以分解给定的数据集。该方法可以分为自上而下(分解)和自下上(台并)两种操作方式。为弥补分解与合并的不足,层次合并经常要与其它聚类方法相结合,如循环定位。典型的这类方法包括:

a. BRCH(Balanced lterative Reduing and Clustering using Hierarchies)方法,它首先利用树的结构对对象集进行划分;然后再利用其它聚类方法对这些聚类进行优化;

b. CURE(Clustering Using REprisentatives)方法,它利用固定数目代表对象来表示相应聚类;然后对各聚类按照指定量(向聚类中心)进行收缩;

c. ROCK方法,它利用聚类间的连接进行聚类合并;

d. CHEMALOEN方法,它则是在层次聚类时构造动态模型;

  1. 基于密度的方法,根据密度完成对象的聚类。它根据对象周围的密度〈如DBSCAN)不断增长聚类。典型的基于密度方法包括:

a. DBSCAN(Densit-based Spatial Clustering of Application with Noise;该算法通过不断生长足够高密度区域来进行聚类;它能从含有噪声的空间数)据库中发现任意形状的聚类。此方法将一个聚类定义为一组”密度连接”的点集; b. OPTICS(Ordering Points To ldentify the Clustering Structure;并不明确产生一个聚类,而是为自动交互的聚类分析计算出一个增强聚类I顺序。


a. STING(STatistical INformation Grid)就是一个利用网格单元保存的统计信息进行基于网格聚类的方法;

b. CLIQUE(Clustering lIn QUEst)和Wave-Cluster则是一个将基于网格与基于密度相结合的方法。





















if (!require(flexclust)) {
if (!require(NbClust)) {
if (!require(cluster)) {
if (!require(fMultivar)) {
if (!require(Hmisc)) {
if (!require(ggplot2)) {
if (!require(mclust)) {
if (!require(fpc)) {
if (!require(optpart)) {
if (!require(factoextra)) {


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ceb95c2e-18a3-4151-a841-f4f7dc7f58a7


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:afdd698a-dada-4772-92b3-579c2644700b

DEG = read.table("DEG-resdata.xls", sep = "\t", check.names = F, header = T)
## Down   Up
## 1296 2832
group <- 1 2 41 478 1351 1735 read.table("deg-group.xls", sep="\t" , check.names="F," header="T)" table(group$group) ## nt tp len <- read.table("all_hg19gene_len.txt", head(len, 2) gene length ddx11l1 wash7p < code></->


geneList <- 1 2 3 4 5 6 7123 8029 130749 200931 266675 284723 deg$row.names eg <- bitr(genelist, fromtype="ENSEMBL" , totype="c("ENTREZID"," "ensembl", "symbol"), orgdb="" ) head(eg) ## ensembl entrezid symbol ensg00000142959 best4 ensg00000163815 clec3b ensg00000107611 cubn ensg00000162461 slc25a34 ensg00000163959 slc51a ensg00000144410 cpo mergedata merge(eg, deg, by.y="Row.names" by.x="ENSEMBL" < code></->

由于我们之前保留的是Count Reads ,所以我们需要将其转为RPKM,转化规则可以看下我们的公众号内容:RNA 12. SCI 文章中肿瘤免疫浸润计算方法之 CIBERSORT RNA 8. SCI文章中差异基因表达–热图 (heatmap)

exp <- 1000 merge(mergedata, len, by.x="SYMBOL" , by.y="Gene" ) exp <- exp[!duplicated(exp$symbol), ] kb exp$length countdata exp[, 10:ncol(exp)] rpk expmat t(t(rpk) colsums(countdata) * 10^6) rownames(expmat)="exp$SYMBOL" expmat[1:3, 1:3] ## tcga-3l-aa1b-01a-11r-a37k-07 tcga-4n-a93t-01a-11r-a37k-07 a2ml1 0.3428737 0.1904078 aacsp1 0.1481661 0.0000000 aadac 4.5036087 3.7514781 tcga-4t-aa8h-01a-11r-a41b-07 9.9516816 0.2905691 1.1776060 < code></->

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:433ae636-5c43-4acd-9a19-46abf2a5cc75


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:bed6f47d-3bc9-4872-8d8d-3892a0a8b203

sam <- 50 100 2534 sample(ncol(expmat), 100, replace="FALSE)" expmat="expMat[," sam] dim(expmat) ## [1] library(dplyr) set.seed(1234) exp_sample <- sample_n(expmat, 50, # exp_sample<-na.omit(exp_sample) t(exp_sample) dim(exp_sample) exp_sample[1:3, 1:3] il17f f7 ighv3-7 tcga-az-4614-01a-01r-1410-07 1.5826763 192.233246 0.00000 tcga-aa-3524-01a-02r-0821-07 0.0000000 40.575613 tcga-dm-a1d9-01a-11r-a155-07 0.5863731 1.813507 11.64019 < code></->



1. hclust {stats}


  1. euclidean;
  2. maximum;
  3. manhattan;
  4. canberra”;
  5. binary;
  6. minkowski.

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:5418f31f-b701-48ca-ab0e-178789a46c6c


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:052da235-0e9d-4061-9860-891aaf42a9a8

  1. ward.D;
  2. ward.D2;
  3. single;
  4. complete;
  5. average (= UPGMA);
  6. mcquitty (= WPGMA);
  7. median (= WPGMC);
  8. centroid (= UPGMC).

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:2de5aefa-8d03-4fdc-a7dc-4f7ed8a9255a


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:da892238-63b4-4388-93d6-ea9fc2e5c33e

pdf("cluster.pdf", h = 96, w = 12)
par(mfrow = c(24, 2))
dis = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski")
meds <- 2 c("ward.d", "ward.d2", "single", "complete", "average", "mcquitty", "median", "centroid") nom <- scale(exp_sample) for (i in1:length(dis)) { (j in1:length(meds)) d dist(nom, method="dis[i])" hc hclust(d, plot(hc, hang="-1," cex="0.8," main="paste(dis[i]," meds[j], sep=" and " ), labels="FALSE," xlab , sub ) } ## png < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)

下面的结果看出聚类结果较好的组合是maximum and ward.D或maximum and ward.D2,那么我们就选第一下后面的,如下:

nom <- scale(exp_sample) d <- dist(nom, method="maximum" ) hc hclust(d, plot(hc, hang="-1," cex="0.8," main="maximum and ward.D" , labels="FALSE," xlab sub < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)

2. NbClust {NbClust}


nc <- nbclust(nom, distance="maximum" ,"2,""15," method="ward.D" ) < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)
## *** : The Hubert index is a graphical method of determining the number of clusters.

##                 In the plot of Hubert index, we seek a significant knee that corresponds to a
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot.


MachineLearning 3. 聚类分析(Cluster Analysis)
## *** : The D index is a graphical method of determining the number of clusters.

##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure.

## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
## * 8 proposed 12 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 2 proposed 15 as the best number of clusters
##                    ***** Conclusion *****
## * According to the majority rule, the best number of clusters is  12
## *******************************************************************
table(nc$Best.n[1, ])
##  0  1  2  3  4 11 12 13 14 15
##  2  1  7  2  1  1  8  1  1  2

barplot(table(nc$Best.n[1, ]), xlab = "Numer of Clusters", ylab = "Number of Criteria",
    main = "Number of Clusters Chosen by 15 Criteria", col = "lightblue")

MachineLearning 3. 聚类分析(Cluster Analysis)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:835da461-8621-4927-bc2c-9deac02f7e22


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d2003fb4-4e00-4f7d-9449-588ed7e882d0

Listing 16.3 - Obtaining the final cluster solution
clusters <- 0 1 2 24 76 cutree(hc, k="2)" table(clusters) ## clusters aggregate(exp_sample, by="list(cluster" = clusters), median) cluster il17f f7 ighv3-7 cmtm5 smyd1 pdcd4 tmem40 0.5885949 2.816345 7.3784166 0.1689894 0.6187506 513.9769 0.5769721 0.9963804 4.799207 0.5056116 0.0000000 0.4159927 445.8797 1.2995444 krt32 card14 spock3 six1 plpp4 linc01630 linc01433 iglv7-43 11.01663 2.003168 22.92558 0.02081256 2.288693 76.82389 0.1844749 12.39139 2.512881 15.60184 0.00000000 2.534744 104.89287 cyp2d8p fam180b igfbp1 emilin3 mmp3 ky arntl2 inhba 7.468996 0.2696951 1.283745 1.736065 289.3306 0.5925788 84.30983 74.43746 8.259506 2.837076 1.978977 526.1924 0.4384617 75.41280 89.01513 gcg rundc3b ccl8 rna5sp123 cadps alkal2 tmeff2 cd160 1.141907 6.913850 24.13523 3.926224 25.72227 4.281868 0.1002217 2.634259 2.521290 3.643218 17.15406 0.000000 17.71174 6.338635 2.692410 igfbp7-as1 linc01749 edn2 asb11 linc00052 igkv1d-17 snora54 ripply1 0.8244369 0.05529321 10.062267 6.083275 1.960834 7.1290982 0.25768473 8.645022 7.798438 8.620298 1.691099 linc01929 ervv-2 dbndd1 dnase1l3 st8sia3 lgals9c eif3ip1 npy2r 0.8043869 115.6272 8.462869 0.03101895 8.613400 0.4656158 0.7878984 134.7018 11.672475 0.03818862 6.604742 2.6208526 igkv1-17 tacstd2 mir3941 285.7966 139.45623 194.6544 88.77291 aggregate(, -0.2880191 -0.3312677 -0.3516608 -0.2474067 -0.2620227 0.05681396 -0.2312050 -0.3016563 -0.4114248 -0.3324179 -0.2899677 -0.24847137 -0.3355040 -0.2353572 -0.3266249 -0.2581192 -0.3293945 -0.2408573 -0.2658946 -0.1894173 -0.2147371 -0.2394491 -0.2631967 -0.4262673 -0.2873070 -0.3010961 -0.4629829 -0.3139076 -0.2568656 -0.3000304 -0.18644310 -0.2426877 -0.4016761 -0.2428107 -0.3276591 -0.2602157 -0.06684606 -0.30056784 -0.1298538 -0.2745230 -0.2450213 -0.4230769 -0.2710479 -0.3408637 -0.07308097 -0.1805193 -0.3744222 -0.1199038 -0.3879564 -0.5768739 -0.5078703 -0.1504334 -0.2797459 -0.2216113 -0.1749399 -0.1781910 -0.19369512 -0.2327267 -0.1863332 -0.4496976 -0.1055253 -0.2213039 -0.1571879 -0.03549121 -0.1805199 -0.3856065 -0.1625653 -0.1369634 -0.2887597 -0.1818619 -0.3099469 -0.2742967 -0.4653439 -0.2680339 -0.1618842 -0.3403107 -0.2777713 -0.3872932 -0.21988468 -0.4127087 -0.3828443 -0.3413641 -0.4650979 -0.5097886 -0.05964795 -0.3105041 -0.3532432 -0.4200815 -0.2856910 -0.4301185 -0.3442862 -0.4263927 -0.5403581 -0.4053723 plot(hc, hang="-1," cex="0.8," main="Maximum Linkage Clustering\n2 Cluster Solution" , labels="FALSE," sub xlab ) rect.hclust(hc, < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)


Plot function for within groups sum of squares by number of clusters
wssplot <- function(data, nc="15," seed="1234)" { wss <- (nrow(data) - 1) * sum(apply(data, 2, var)) for (i in2:nc) set.seed(seed) wss[i] sum(kmeans(data, centers="i)$withinss)" } plot(1:nc, wss, type="b" , xlab="Number of Clusters" ylab="Within groups sum of squares" ) wssplot(nom) < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)

3. K-means


nc <- nbclust(nom,"2,""15," method="kmeans" ) < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)
## *** : The Hubert index is a graphical method of determining the number of clusters.

##                 In the plot of Hubert index, we seek a significant knee that corresponds to a
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot.


MachineLearning 3. 聚类分析(Cluster Analysis)
## *** : The D index is a graphical method of determining the number of clusters.

##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure.

## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 12 proposed 3 as the best number of clusters
## * 4 proposed 4 as the best number of clusters
## * 1 proposed 12 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
##                    ***** Conclusion *****
## * According to the majority rule, the best number of clusters is  3
## *******************************************************************
table(nc$Best.n[1, ])  # &#x51B3;&#x5B9A;&#x805A;&#x7C7B;&#x4E2A;&#x6570;
##  0  2  3  4 12 13
##  2  6 12  4  1  1

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a13e9f6f-82fe-4bfb-bc17-45f358797b6c


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c101f3c5-4ab6-412b-88a6-fbee459169f6

barplot(table(nc$Best.n[1, ]), xlab = "Numer of Clusters", ylab = "Number of Criteria",
    main = "Number of Clusters Chosen by 13 Criteria", col = "lightblue")

MachineLearning 3. 聚类分析(Cluster Analysis)


set.seed(1234) <- 1 2 3 11 88 kmeans(nom, 3, nstart="25)" # 进行k均值聚类分析 head($size) ## [1] head($centers) il17f f7 ighv3-7 cmtm5 smyd1 pdcd4 -0.3033558 -0.26190722 1.4055743 0.4082672 0.2498796 0.69684023 0.0421243 0.02517585 -0.1709715 -0.1522383 -0.1231630 -0.08641151 -0.3700242 0.66550478 -0.4158215 8.9060298 8.0896682 -0.06102960 tmem40 krt32 card14 spock3 six1 plpp4 -0.362363108 -0.22065253 -0.78214969 0.11952882 -0.55062557 -0.7788310 0.005027673 -0.00806648 0.07851735 -0.04062087 0.07074427 0.1038498 3.543558958 3.13702805 1.69412011 2.25981933 -0.16861467 -0.5716388 linc01630 linc01433 iglv7-43 cyp2d8p fam180b igfbp1 -0.24579463 -0.65201479 1.2035853 -0.68438367 0.7997951 -0.31833860 -0.05810688 0.02298383 -0.1466840 0.02431761 -0.1718955 0.03427224 7.81714632 5.14958598 -0.3312469 5.38827060 6.3290583 0.48576776 emilin3 mmp3 ky arntl2 inhba gcg 0.8095011 -0.46443987 -0.14329723 -0.8842087 -0.79163943 2.26939837 -0.1671745 0.05721899 -0.09203014 0.1189363 0.09411268 -0.28280319 5.8068424 0.07356759 9.67492155 -0.7400995 0.42611749 -0.07670164 rundc3b ccl8 rna5sp123 cadps alkal2 tmeff2 0.45673682 0.41538110 -0.130832591 -0.568783015 0.269797780 0.2654644 -0.07048812 -0.05666444 0.003104592 0.071202484 -0.033612042 -0.1422725 1.17884953 0.41727843 1.165954384 -0.009205451 -0.009915867 9.5998700 cd160 igfbp7-as1 linc01749 edn2 asb11 linc00052 -0.03071948 -0.15939844 -0.2331126 0.36867201 -0.07035151 -0.12921182 -0.07427513 0.01659261 -0.0598016 -0.06959498 -0.10312412 -0.09439828 6.87412605 0.29323277 7.8267799 2.06896647 9.84878886 9.72837895 igkv1d-17 snora54 ripply1 linc01929 ervv-2 dbndd1 1.1706605 -0.16585298 -0.47669118 -0.30734680 -0.27433892 -0.8699670 -0.1603865 0.02279823 -0.02897558 0.04346155 0.03869342 0.1093077 1.2367475 -0.18186187 7.79345442 -0.44380136 -0.38729319 -0.0494384 dnase1l3 st8sia3 lgals9c eif3ip1 npy2r igkv1-17 1.7339587 0.5752728 0.9766528 -0.35904218 2.2800690 0.75062759 -0.2231538 -0.1345245 -0.1423932 0.05060592 -0.2792156 -0.08565249 0.5639854 5.5101569 1.7874242 -0.50385688 -0.5097886 -0.71948427 tacstd2 mir3941 -0.49459774 -0.31656432 0.05944485 0.04441591 0.20942843 -0.42639268 aggregate(exp_sample, by="list(cluster" =$cluster), mean) cluster 0.4785149 7.460917 209.45886 1.4723694 4.332919 656.7416 0.444122 2.9582112 26.684794 28.15752 0.3581697 1.626264 482.0289 2.261306 0.0000000 69.562991 0.00000 18.3646297 61.215432 487.6906 19.763545 0.1315531 3.833043 1.1778745 0.2997228 1.675485 0.0403494 0.8104104 2.0334175 17.405693 0.6783717 5.0841760 36.541595 0.2227790 3.6539136 30.1704631 42.883618 7.8533855 3.2411486 9.859628 7.8774055 25.2502680 839.8532 3.349757 4.2951477 0.5694678 3.758912 118.7052 0.5516859 221.6396 11.229645 0.5933968 14.3262393 1.775201 661.8622 0.7076329 137.1385 70.870215 25.3594190 31.9409017 13.908933 678.8845 30.4172776 30.01116 10.75001 106.891035 14.697087 55.74761 6.069886 12.09875 119.35135 113.95056 6.651268 9.058721 36.01527 20.718074 42.26393 42.84554 152.63303 14.746072 22.419665 55.82692 147.894427 38.47397 12.988460 1.0522118 3.042562 2.191211 0.05379713 23.46872 0.2065057 7.612765 0.1708365 2.921970 9.204716 0.72567688 15.67902 0.1331140 8.032605 21.2297187 22.159956 20.229238 31.29981141 53.68956 22.4196647 0.01406769 126.8576 6.907777 0.4795657 0.6475473 0.1455193 38.24107 0.07724792 16.7068 88.310073 4.4568368 2.3123139 0.5487996 154.81437 17.90380144 132.3266 0.000000 73.9472135 135.91717 75.87623 0.2630833 42.24571 1.739676 2.6566259 1261.99486 14.74231 0.5053909 14.41560 0.0911642 13.69061 6.660828 0.2195618 570.58862 474.43309 2.1664932 39.13471 1.4583557 62.93446 46.55936 598.87492 < code></->

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f5c72a10-2fc6-4a51-bfed-a85cf4113390


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e45eeaf6-9812-4130-ab23-63ebabc4b5cf

NT <- 0 1 2 3 8 88 group[group$group="=" "nt", ] exp_sample$group="ifelse(rownames(exp_sample)" %in% nt$sample, 1, 2) <- table(exp_sample$group,$cluster) ## randindex( ari 0.7449018 < code></->

2. 分裂聚类 pam{cluster}

利用PAM(Partitioning around mediods)进行分析,如下:

Listing 16.5 - Partitioning around mediods for the DEG data
fit.pam <- 0 1 2 pam(exp_sample, k="3," stand="TRUE)" head(fit.pam$medoids) ## il17f f7 ighv3-7 cmtm5 tcga-t9-a92h-01a-11r-a37k-07 0.6943772 1.7570762 15.50722 0.7731135 tcga-az-6599-11a-01r-1774-07 0.0000000 0.3429155 104.41179 2.4895665 tcga-cm-4748-01a-01r-1410-07 69.5629914 0.00000 18.3646297 smyd1 pdcd4 tmem40 krt32 card14 533.3114 2.0800125 0.4233717 19.67791 0.6601123 706.5871 0.1217822 2.80102 61.2154324 487.6906 19.7635450 30.1704631 42.88362 spock3 six1 plpp4 linc01630 linc01433 0.000000 0.6822288 6.226054 0.05527045 3.1889546 2.226043 0.2396619 2.187167 0.01618009 0.3111823 7.853385 3.2411486 9.859628 7.87740551 25.2502680 iglv7-43 cyp2d8p fam180b igfbp1 emilin3 21.16858 18.895448 1.067580 0.8964313 1.756613 496.32119 1.164532 3.750329 0.3936371 3.313977 137.13847 70.870215 25.359419 31.9409017 13.908933 mmp3 ky arntl2 inhba 3.617678 0.2561011 125.517002 75.500094 2.753538 0.1124580 9.535511 3.500819 678.884526 30.4172776 42.845543 152.633034 gcg rundc3b ccl8 rna5sp123 cadps 0.9311694 2.674163 1.76265 0.0000 23.130995 127.3923166 12.663665 82.04478 3.719115 14.7460719 22.419665 55.82692 147.8944 38.473970 alkal2 tmeff2 cd160 igfbp7-as1 3.719719 1.679200 16.234881 1.111942 2.621736 0.3739553 8.032605 21.229719 22.159956 20.2292377 linc01749 edn2 asb11 linc00052 igkv1d-17 38.42376 0.4719111 2.78534 0.0551052 28.98096 0.2072236 15.90013 31.2998114 53.68956 22.4196647 17.9038 132.32659 snora54 ripply1 linc01929 ervv-2 dbndd1 12.04716 2.5709479 116.911028 0.5467917 0.2052625 6.609587 73.9472135 135.917170 dnase1l3 st8sia3 lgals9c eif3ip1 npy2r 4.001051 0.1227876 21.97454 0.2042173 33.278217 0.6290431 27.51096 0.4485916 3.8261354 39.134707 1.4583557 62.93446 igkv1-17 tacstd2 mir3941 group 276.36763 180.261614 1770.72762 8.303047 46.55936 598.874924 clusplot(fit.pam, main="Bivariate Cluster Plot" , color="TRUE," shade="TRUE," labels="1," lines="0)" < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:06eb54c9-d877-46b2-8265-2f89643c260c


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8986cde6-b25f-40e4-8cb4-204cc30588dc

evaluate clustering
ct.pam <- 0 1 2 3 8 91 table(exp_sample$group, fit.pam$clustering) ct.pam ## randindex(ct.pam) ari 0.9309073 < code></->

3. 基于模型聚类 Mclust{mclust}


fit.m <- 1 2 3 4 5 96 100 310 mclust(exp_sample) summary(fit.m) ## ---------------------------------------------------- gaussian finite mixture model fitted by em algorithm mclust eei (diagonal, equal volume and shape) with components: log-likelihood n df bic icl -19190.8 -39809.2 clustering table: plot(fit.m, what="classification" ) # 绘图 < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)

4. 基于密度聚类 dbscan {fpc}

我们介绍一种新的聚类方法,叫DBSCAN(Density-Based Spatial Clustering of Applications with Noise)聚类法,是基于密度的聚类算法,由于差异基因并不适合这种方法聚类,故效果不是很好,如下:

db <- dbscan(exp_sample, eps="0.15," minpts="5)" fviz_cluster(db, data="exp_sample," stand="FALSE," ellipse="FALSE," show.clust.cent="FALSE," geom="point" , palette="Set2" ggtheme="theme_classic())" < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8b3bebe3-d37a-4ac4-95ec-d9102aa8cb0c


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f847f145-fbb9-4cfa-95df-515de18cf939

df <- multishapes[, 1:2] set.seed(123) db <- dbscan(df, eps="0.15," minpts="5)" fviz_cluster(db, data="df," stand="FALSE," ellipse="FALSE," show.clust.cent="FALSE," geom="point" , palette="jco" ggtheme="theme_classic())" < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)

5. 基于网格聚类 clique {optpart}

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4e0e08a2-91a6-4bb3-81e7-766fc6055110


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:851a740a-f7f6-44a4-852c-0cf1f6dede54

cq <- 100 clique(d, 0.5) summary(cq) ## maximal cliques at alphac="0.5" minimum size="1" maximum plot(cq, panel="all" ) < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)
## hit return to continue :

MachineLearning 3. 聚类分析(Cluster Analysis)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a44e0ff3-919d-4e12-b699-e16d42212529


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ff7fe13e-e8ce-4a86-9908-307b67652a52

iris2 <- iris[-5] dist.e="dist(iris2," method="euclidean" ) cq <- clique(dist.e, 0.5) # summary(cq) plot(cq) < code></->

MachineLearning 3. 聚类分析(Cluster Analysis)
## hit return to continue :

MachineLearning 3. 聚类分析(Cluster Analysis)




MachineLearning 3. 聚类分析(Cluster Analysis)

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9b505b57-260f-45d7-abf7-fd6c6466cf2f


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e517956a-244b-42e8-8e34-e8a5de7b12bc

Topic 1.克隆进化之sciClone
Topic 2.克隆进化之ClonEvol
Topic 3. 克隆进化之 fishplot
Topic 4. 克隆进化之 Pyclone
Topic 5. 克隆进化之 CITUP
Topic 6. 克隆进化之 Canopy
Topic 7. 克隆进化之 Cardelino
Topic 8. 克隆进化之 RobustClone
Topic 9. 克隆进化之 TimeScape
Clone 1. 肿瘤克隆进化之前世今生
Clone 2. 肿瘤克隆进化之不同进化模

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:5adfda56-268d-4560-828b-7a996672b91c


[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:216f0d60-4fba-443c-b50d-0aabf983ca95

; References:

  1. Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, The R Journal, 8/1, pp. 289-317.

  2. Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis and density estimation, Journal of the American Statistical Association, 97/458, pp. 611-631.

  3. Fraley C., Raftery A. E., Murphy T. B. and Scrucca L. (2012) mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report No. 597, Department of Statistics, University of Washington.

  4. C. Fraley and A. E. Raftery (2007) Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification, 24, 155-18

Author: 桓峰基因
Title: MachineLearning 3. 聚类分析(Cluster Analysis)





亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球