机器学习算法(二十一):核密度估计 Kernel Density Estimation(KDE)

源链接:https://blog.csdn.net/weixin_39910711/article/details/107307509

这只是为了收集。建议核对原文。有关链接,请参见上文。

[En]

This is for collection only. It is recommended to check the original text. For links, see above.

1 分布密度函数

1.1 参数估计方法

1.2 非参数估计

2 直方图到核密度估计

2.1 核函数

2.2 带宽的选择

2.2.1 自适应或可变带宽的核密度估计

2.3 多维

1 分布密度函数
给定一个样本集,如何得到样本集的分布密度函数,有两种方法可以解决这个问题:

[En]

Given a sample set, how to get the distribution density function of the sample set, there are two ways to solve this problem:

1.1 参数估计方法
简单地说,就是假设样本集服从一定的概率分布,然后根据样本集对分布中的参数进行拟合,如似然估计、混合高斯等,由于参数估计方法需要加入主观先验知识,往往很难将模型与真实分布进行拟合。

[En]

To put it simply, it is assumed that the sample set conforms to a certain probability distribution, and then the parameters in the distribution are fitted according to the sample set, such as likelihood estimation, mixed Gaussian, etc., because the parameter estimation method needs to add subjective prior knowledge, it is often difficult to fit the model with the real distribution.

1.2 非参数估计
与参数估计不同的是,非参数估计不增加任何先验知识,而是根据数据的特征和性质对分布进行拟合,从而得到比参数估计方法更好的模型。核密度估计是非参数估计的一种,由Rosenblatt(1955)和Emanuel Parzen(1962)提出,又称Parzen窗。Ruppert和Cline在数据集密度函数聚类算法的基础上提出了一种改进的核密度估计方法。

[En]

Different from parameter estimation, non-parameter estimation does not add any prior knowledge, but fits the distribution according to the characteristics and properties of the data, so that a better model can be obtained than the parameter estimation method. Kernel density estimation is one of the nonparametric estimators, proposed by Rosenblatt (1955) and Emanuel Parzen (1962), also known as Parzen window (Parzen window). Ruppert and Cline propose a modified kernel density estimation method based on the data set density function clustering algorithm.

2 直方图到核密度估计
给定一个数据集,我们需要观察这些样本的分布情况,通常我们会用直方图的方法直观地显示出来。直方图的特点是易于理解,但缺点在于以下三个方面:(1)密度函数不光滑;(2)密度函数受子区间(即每个直方图)的宽度影响很大。如果相同的原始数据采用不同的子区间范围,则结果可能完全不同。对于下图中的前两张图,第二张图仅在第一张图的基础上增加了0.75的分区间隔,但所显示的密度函数似乎有很大的不同;(3)直方图最多只能显示二维数据。如果有更多的维度,则无法有效地显示。

[En]

Given a data set, we need to observe the distribution of these samples, often we will use the histogram method to show intuitively. The characteristic of the histogram is easy to understand, but the disadvantage lies in the following three aspects: (1) the density function is not smooth; (2) the density function is greatly affected by the width of the sub-interval (that is, each histogram). If the same original data take different sub-interval ranges, then the results may be completely different. For the first two diagrams in the following figure, the second graph only increases the partition interval by 0.75 on the basis of the first graph, but the density function shown appears to be very different; (3) the histogram can only show 2-D data at most. If there are more dimensions, it can’t be effectively displayed.

此外,直方图还有一个问题,那就是直方图显示的分布曲线不光滑,即一个仓位中的样本具有相等的概率密度,这显然是不合适的。解决这个问题的办法是增加垃圾箱的数量。当箱数增加到样本的最大值时,样本的每个点都会有自己的概率,但同时也会带来其他问题。样本中没有出现的值的概率为0,概率密度函数不连续,这也是一个很大的问题。如果我们把这些不连续的区间连接起来,那么这在很大程度上可以满足我们的要求,其中一个想法是,对于样本中某一点的概率密度,如果可以利用邻域信息,那么最终的概率密度将大大改善不连续的问题,为了便于观察,我们再来看另一幅图。

[En]

In addition, there is another problem with the histogram, that is, the distribution curve shown by the histogram is not smooth, that is, the samples in a bin have equal probability density, which is obviously not suitable. The way to solve this problem is to increase the number of bins. When bins increases to the maximum value of the sample, it will have its own probability for each point of the sample, but at the same time it will bring other problems. The probability of the value that does not appear in the sample is 0, and the probability density function is discontinuous, which also has a big problem. If we connect these discontinuous intervals, then to a large extent, this can meet our requirements, one of the ideas is that for the probability density of a certain point in the sample, if the neighborhood information can be used, then the final probability density will greatly improve the problem of discontinuity, in order to facilitate observation, let’s look at another picture.

一个很自然的想法是,如果我们想知道X=x处的密度函数值,可以像直方图一样,选一个x附近的小区间,数一下在这个区间里面的点的个数,除以总个数,应该是一个比较好的估计。用数学语言来描述,如果你还记得导数的定义,密度函数可以写为:

现在我们假设要求x处的密度函数值,根据上面的思想,如果取 x 的邻域[x-h,x+h],当h->0的时候,我们便能把该邻域的密度函数值当作x点的密度函数值。用数学语言写就是:

它是邻域内的样本点的数量和样本集的总数。最后,对邻域内的密度值求平均,得到x点f(X)的密度函数值。改写了上述公式,即核密度估计(Kernel Density Estimation),这是一种估计概率密度函数的非参数方法。对于独立同分布F的n个样本点,设其概率密度函数为f:

[En]

It is the number of sample points in the neighborhood and the total number of sample set. finally, the density function value of x point f (x) is obtained by averaging the density value in the neighborhood. The above formula is rewritten, that is, kernel density estimation (Kernel density estimation), which is a nonparametric method for estimating the probability density function. For n sample points of independent identical distribution F, let its probability density function be f:

这里 h 如果选的太大,肯定不符合 h 趋向于 0 的要求。h 选的太小,那么用于估计 f(x) 的点实际上非常少。这也就是非参数估计里面的 bias-variance tradeoff,也就是偏差和方差的平衡。这样后还是存在一个问题,那就是概率密度函数依然不够平滑(因为两个数之间的存在无数个数)。

记 ,那么:

因为概率密度的积分需要为1,所以:

[En]

Because the integral of probability density needs to be 1, so:

也就是要满足 K(t) 的积分等于1也就满足了的积分为1。如果把 K(t) 当作其他已知的概率密度函数,那么问题就解决了,最后的密度函数也就连续了。

2.1 核函数

从支持向量机、meansift都接触过核函数,应该说核函数是一种理论概念,但每种核函数的功能都是不一样的,这里的核函数有uniform,triangular, biweight, triweight, Epanechnikov,normal等。这些核函数的图像大致如下图:

有言论称Epanechnikov 内核在均方误差意义下是最优的,效率损失也很小。由于高斯内核方便的数学性质,也经常使用 K(x)= ϕ(x),ϕ(x)为标准正态概率密度函数。
从上面讲述的得到的是样本中某一点的概率密度函数,那么整个样本集应该是怎么拟合的呢?将设有N个样本点,对这N个点进行上面的拟合过后,将这N个概率密度函数进行叠加便得到了整个样本集的概率密度函数。例如利用高斯核对X={x1=−2.1,x2=−1.3,x3=−0.4,x4=1.9,x5=5.1,x6=6.2} 六个点的”拟合”结果如下:

在直方图中,水平轴区间为2,数据落入某一区间,该区间的y轴增加1/12。

[En]

In the histogram, the horizontal axis interval is 2, the data falls into a certain interval, and the y-axis of this interval increases by 1 prime 12.

在核密度估计中,其他正态分布的方差为2.25,红色虚线表示从每个数据获得的正态分布,将其叠加得到核密度估计的结果,以蓝色表示。

[En]

In the kernel density estimation, the variance of the other normal distribution is 2.25, the red dotted line represents the normal distribution obtained from each data, and the result of the kernel density estimation is obtained by superimposing it, which is expressed in blue.

那么问题是,如何选择核函数的“方差”呢?这实际上是由h决定的,不同带宽下核函数的估计结果差别很大,如下图所示:

[En]

So the question is, how to choose the “variance” of the kernel function? This is actually determined by h. The estimated results of kernel functions vary greatly under different bandwidths, as shown in the following figure:

(Kernel density estimate (KDE) with different bandwidths of a random sample of 100 points from a standard normal distribution. Grey: true density (standard normal). Red: KDE with h=0.05. Black: KDE with h=0.337. Green: KDE with h=2.)

2.2 带宽的选择
在确定核函数后,例如上面选择的高斯核,应该选择高斯核的方差有多大,即h(也称为带宽,也称为我们在这里讨论的邻域)?不同的带宽会导致最终的拟合结果有很大的差异。同时,上面也提到,理论上h->0,但如果h太小,邻域内的拟合点就会太少。因此,在机器学习理论的帮助下,我们当然可以使用交叉验证来选择最佳的h。此外,还有一个理论推导,给出了一些关于选择h的信息。

[En]

After the kernel function is determined, such as the Gaussian kernel selected above, how big should the variance of the Gaussian kernel, that is h (also called bandwidth, also known as the neighborhood we are talking about here) be chosen? Different bandwidths will lead to great differences in the final fitting results. At the same time, it is also mentioned above that theoretically h-> 0, but if h is too small, there will be too few points involved in fitting in the neighborhood. So with the help of machine learning theory, we can certainly use cross-validation to choose the best h. In addition, there is a theoretical derivation that gives you some information about choosing h.

当样本集给定时,只能计算样本点的概率密度,拟合后的概率密度应更接近计算值。在此基础上,定义了误差函数。则最小化误差函数可以为H选择提供大体方向。选择最小化L2风险函数,即均方积分误差函数(平均积分平方误差),其定义为:

[En]

When the sample set is given, we can only calculate the probability density of the sample point, and the probability density after fitting should be closer to the calculated value. based on this, we define an error function. then minimizing the error function can provide a general direction for h selection. Choose to minimize the L2 risk function, namely the mean square integral error function (mean intergrated squared error), which is defined as:

在weak assumptions下, ,其中AMISE为渐进的MISE。而AMISE有:

其中:

最小化MISE(h)等价于最小化AMISE(h),求导,令导数为0有:

当核函数确定后,h公式中的R,m,f“即可确定,并且h有解析解。

[En]

When the kernel function is determined, R, m, f “in h formula can be determined, and h has an analytical solution.

如果带宽不固定,变化取决于估计的位置(气球估计器)或采样点(逐点估计器的逐点估计),从而产生了一种非常强大的方法,称为自适应或可变带宽核密度估计。

[En]

If the bandwidth is not fixed, the change depends on the estimated location (balloon estimator) or sample point (point-by-point estimation of pointwise estimator), resulting in a very powerful method called adaptive or variable bandwidth kernel density estimation.

如果使用高斯核函数进行核密度估计,则h的最优选择(即使平均积分平方误差的带宽最小化)为

[En]

If the Gaussian kernel function is used for kernel density estimation, the optimal choice of h (even if the bandwidth of the mean integral square error is minimized) is

这里 是样品的标准差。这种近似称为正态分布近似,高斯近似,或Silverman(1986)经验法则。虽然这个经验法则很容易计算,但应谨慎使用,因为当密度不接近正态时,可能会产生泛化极差的估计。

以下是带宽的作用的简要说明:

[En]

Here is a brief description of the role of bandwidth:

在数据可视化相关领域中,带宽的大小决定了核密度估计函数(KDE)的平滑程度。带宽越小,越不平滑,带宽越大,越不平滑。

[En]

In the related field of data visualization, the size of the bandwidth determines the smooth degree of the kernel density estimation function (KDE). The smaller the bandwidth, the more undersmooth, and the larger the bandwidth, the more oversmooth.

在POI兴趣点推荐领域,或位置服务领域,带宽h的设置主要与分析尺度以及地理现象特点有关。较小的带宽可以使密度分布结果中出现较多的高值或低值区域,适合于揭示密度分布的局部特征,而较大的带宽可以在全局尺度下使热点区域体现得更加明显。另外,带宽应与兴趣点的离散程度呈正相关,对于稀疏型的兴趣点分布应采用较大的带宽,而对于密集型的兴趣点则应考虑较小一些的带宽。
2.2.1 自适应或可变带宽的核密度估计
如果带宽不是固定的,而是随样本的位置而变化(变化取决于估计的位置(气球估计器)或采样点(逐点估计的逐点估计),则产生一种特别强大的方法,称为自适应或可变带宽核密度估计。在兴趣点推荐方面,由于密集城区的签到密度较高,而人口稀少的农村地区签到密度较低。也就是说,不同的位置应该采用不同的分析尺度,因此本文使用非固定带宽来估计核密度。

[En]

If the bandwidth is not fixed, but varies according to the location of the sample (the change depends on the estimated location (balloon estimator) or the sample point (point-by-point estimation of pointwise estimator), a particularly powerful method is produced, called adaptive or variable bandwidth kernel density estimation. In terms of POI points of interest recommendation, due to the high check-in density in dense urban areas and low check-in density in sparsely populated rural areas. That is to say, different locations should adopt different analysis scales, so this paper uses unfixed bandwidth to estimate the kernel density.

说到这, 有些朋友可能不知道POI兴趣点推荐是啥意思, 这里简单的说一下:POI是Point-of-Interest的意思,即兴趣点。就是说,给用户推荐其感兴趣的地点。就这么简单。在推荐系统相关领域,兴趣点推荐是一个非常火爆的研究课题。这里会用到核密度估计的方法,比如这篇论文:Jia-Dong Zhang,Chi-Yin Chow.(2015)GeoSoCa: Exploiting Geographical, Social and Categorical Correlations for Point-of-Interest Recommendations.SIGIR’15, August 09 – 13, 2015, Santiago, Chile.就利用了可变带宽的核密度估计方法。

本文简要讨论了自适应带宽的核密度估计方法。自适应带宽的核密度估计方法是在固定带宽核密度函数的基础上通过修改带宽参数得到的,其形式如下式所示:

[En]

Here we briefly discuss the kernel density estimation method of adaptive bandwidth. The kernel density estimation method of adaptive bandwidth is obtained by modifying the bandwidth parameter on the basis of fixed bandwidth kernel density function, and its form is as shown in the formula:

这里 k(x) 是带宽为 的核密度估计函数,M 是样例的个数,看出来了吧,每一个点 j 都有一个带宽 ,因此这叫自适应或可变。K(x) 是核函数,这里用了高斯核函数,当然也可以是其他的核函数。0≤α≤1,为灵敏因子,通常 α 取0.5,α=0 时,自适应带宽的核密度估计就变成了固定带宽的核密度估计了。固定带宽的核密度估计就是前面说的核密度估计。ω 表示带宽的参数。

2.3 多维
它还可以扩展到多维,即:

[En]

It can also be extended to multidimensional, that is:

其中d为x的维数,K为多维的kernel,一般为d个一维kernel的乘积。

核密度估计 Kernel Density Estimation(KDE):核密度估计 Kernel Density Estimation(KDE)_NeverMore_7的博客-CSDN博客_核密度估计

核密度估计(Kernel density estimation):核密度估计(Kernel density estimation)_Starworks的博客-CSDN博客_核密度

什么是核密度估计?如何看待性行为?什么是核密度估计?如何看待性行为?–知乎

[En]

What is nuclear density estimation? How to perceive sexuality? What is nuclear density estimation? How to perceive sexuality? -Zhihu

核密度估计Kernel Density Estimation(KDE)概述 密度估计的问题:核密度估计Kernel Density Estimation(KDE)概述 密度估计的问题 – 简书

自适应带宽的核密度估计可以参考维基百科:https://en.wikipedia.org/wiki/Variable_kernel_density_estimation

[En]

版权声明:本文为CSDN博主原创文章《点子回复》,依照CC 4.0BY-SA版权协议。请附上原始来源链接和此声明以供转载。

[En]

Copyright notice: this article is the original article of CSDN blogger “idea reply”, in accordance with the CC 4.0BY-SA copyright agreement. Please attach the original source link and this statement for reprint.

原文链接:https://blog.csdn.net/weixin_39910711/article/details/107307509

Original: https://www.cnblogs.com/dhcn/p/16377703.html
Author: 辉–
Title: 机器学习算法(二十一):核密度估计 Kernel Density Estimation(KDE)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/7149/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部
最近整理资源【免费获取】:   👉 程序员最新必读书单  | 👏 互联网各方向面试题下载 | ✌️计算机核心资源汇总