K-均值聚类算法

K-均值聚类算法

聚类是一种无监督的学习算法,它将相似的数据归纳到同一簇中。K-均值是因为它可以按照k个不同的簇来分类,并且不同的簇中心采用簇中所含的均值计算而成。

K-均值算法

算法思想

K-均值是把数据集按照k个簇分类,其中k是用户给定的,其中每个簇是通过质心来计算簇的中心点。

主要步骤:[en]Main steps:

  • 随机确定k个初始点为质心[en]* randomly determine k initial points as centroids
  • 为数据集中的每个数据点查找最近的聚类[en]* find the nearest cluster for each data point in the dataset
  • 对于每个集群,计算集群中所有点的平均值,并以该平均值为质心[en]* for each cluster, calculate the mean of all the points in the cluster and take the mean as the center of mass
  • 重复第二步,直到任意点的集群分配结果保持不变。[en]* repeat step 2 until the cluster allocation result of any point remains the same.

具体实现

from numpy import *
import matplotlib
import matplotlib.pyplot as plt

def loadDataSet(fileName):      #general function to parse tab -delimited floats
    dataMat = []                #assume last column is target value
    fr = open(fileName)
    for line in fr.readlines():
        curLine = line.strip().split('\t')
        fltLine = map(float,curLine) #map all elements to float()
        dataMat.append(fltLine)
    return dataMat

def distEclud(vecA, vecB):
    return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)

def randCent(dataSet, k):
    n = shape(dataSet)[1]
    centroids = mat(zeros((k,n)))#create centroid mat
    for j in range(n):#create random cluster centers, within bounds of each dimension
        minJ = min(dataSet[:,j])
        rangeJ = float(max(dataSet[:,j]) - minJ)
        centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
    return centroids

def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
    m = shape(dataSet)[0]
    clusterAssment = mat(zeros((m,2)))#create mat to assign data points
                                      #to a centroid, also holds SE of each point
    centroids = createCent(dataSet, k)
    clusterChanged = True
    while clusterChanged:
        clusterChanged = False
        for i in range(m):#for each data point assign it to the closest centroid
            minDist = inf; minIndex = -1
            for j in range(k):
                distJI = distMeas(centroids[j,:],dataSet[i,:])
                if distJI < minDist:
                    minDist = distJI; minIndex = j
            if clusterAssment[i,0] != minIndex: clusterChanged = True
            clusterAssment[i,:] = minIndex,minDist**2
        for cent in range(k):#recalculate centroids
            ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
            centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean
            print ptsInClust
            print mean(ptsInClust, axis=0)
            return
    return centroids, clusterAssment

def clusterClubs(numClust=5):
    datList = []
    for line in open('places.txt').readlines():
        lineArr = line.split('\t')
        datList.append([float(lineArr[4]), float(lineArr[3])])
    datMat = mat(datList)
    myCentroids, clustAssing = biKmeans(datMat, numClust, distMeas=distSLC)
    fig = plt.figure()
    rect=[0.1,0.1,0.8,0.8]
    scatterMarkers=['s', 'o', '^', '8', 'p', \
                    'd', 'v', 'h', '>', '

结果

K-均值聚类算法

算法收敛

将目标函数设置为[en]Set the objective function to

$$J(c, \mu) = \sum {i=1}^m (x_i – \mu{c_{(i)}})^2$$

Kmeans算法是将J调整到最小,每次调整质心,J值也会减小,同时c和$\mu$也会收敛。由于该函数是一个非凸函数,所以不能保证得到全局最优,智能确保局部最优解。

二分K均值算法

为了克服K-Means算法收敛于局部极小的问题,提出了一种二进制K-Means算法。[en]In order to overcome the problem that the K-means algorithm converges to the local minimum, a binary K-means algorithm is proposed.

算法思想

该算法首先将所有的点作为一个簇,然后将簇分成2个,然后选择其中一个簇继续划分,划分规则是最大化SSE(目标函数)的值。[en]The algorithm first takes all the points as a cluster, and then divides the cluster into 2, and then selects one of the clusters to continue to divide, the partition rule is to maximize the value of SSE (objective function).

主要步骤:[en]Main steps:

  • 将所有点视为一个簇[en]* treat all points as a cluster
  • 计算每个集群的总误差[en]* calculate the total error for each cluster
  • 对给定的簇进行K-均值聚类,计算将簇一分为二的总误差[en]* perform K-means clustering on a given cluster to calculate the total error of dividing the cluster into two
  • 选择误差最小的集群重新划分。[en]* choose the cluster with the least error to divide it again.
  • 重复步骤2,直至集群数量达到要求[en]* repeat step 2 until the number of clusters meets the requirement

具体实现

def biKMeans(dataSet, k, distMeans=distEclud):
    m, n = shape(dataSet)
    clusterAssment = mat(zeros((m, 2))) # init all data for index 0
    centroid = mean(dataSet, axis=0).tolist()
    centList = [centroid]
    for i in range(m):
        clusterAssment[i, 1] = distMeans(mat(centroid), dataSet[i, :]) ** 2
    while len(centList) < k:
        lowestSSE = inf
        for i in range(len(centList)):
            cluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :] # get the clust data of i
            centroidMat, splitCluster = kMeans(cluster, 2, distMeans)
            sseSplit = sum(splitCluster[:, 1]) #all sse
            sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1]) # error sse
            #print sseSplit, sseNotSplit
            if sseSplit + sseNotSplit < lowestSSE:
                bestCentToSplit = i
                bestNewCent = centroidMat
                bestClust = splitCluster.copy()
                lowerSEE = sseSplit + sseNotSplit
        print bestClust
        bestClust[nonzero(bestClust[:, 0].A == 1)[0], 0] = len(centList)
        bestClust[nonzero(bestClust[:, 0].A == 0)[0], 0] = bestCentToSplit
        print bestClust
        print 'the bestCentToSplit is: ',bestCentToSplit
        print 'the len of bestClustAss is: ', len(bestClust)
        centList[bestCentToSplit] = bestNewCent[0, :].tolist()[0]
        centList.append(bestNewCent[1, :].tolist()[0])
        print clusterAssment
        clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentToSplit)[0], :] = bestClust
        print clusterAssment
    return mat(centList), clusterAssment

结果

Original: https://www.cnblogs.com/coder2012/p/4712718.html
Author: cococo点点
Title: K-均值聚类算法

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6042/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

发表回复

登录后才能评论
免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部