非参数估计:核密度估计KDE

https://blog.csdn.net/pipisorry/article/details/53635895

核密度估计(KDE)综述

[En]

Overview of Kernel density estimation Kernel Density Estimation (KDE)

关于密度估计的问题

[En]

The problem of density estimation

在给定样本集上求随机变量的分布密度函数是概率统计的基本问题之一。解决这一问题的方法包括参数估计和非参数估计。

[En]

Solving the distribution density function of random variables from a given sample set is one of the basic problems of probability statistics. The methods to solve this problem include parametric estimation and nonparametric estimation.

参数估计。

[En]

Parameter estimation.

参数估计可分为参数回归分析和参数判别分析。在参数回归分析中,人们假设数据分布符合特定的行为,如线性、可约化线性或指数行为,然后在目标函数族中寻找特定的解,即确定回归模型中的未知参数。在参数判别分析中,人们需要假设随机选取的数据样本在所有可能的类别中服从特定的分布,作为判别的依据。经验和理论表明,参数模型的基本假设与实际物理模型之间往往存在较大差距,这些方法并不总是取得令人满意的结果。

[En]

Parameter estimation can be divided into parameter regression analysis and parameter discriminant analysis. In the parametric regression analysis, people assume that the data distribution conforms to a specific behavior, such as linear, reducible linear or exponential behavior, and then find a specific solution in the objective function family, that is, to determine the unknown parameters in the regression model. In the parameter discriminant analysis, people need to assume that the randomly selected data samples as the basis for discrimination obey a specific distribution in all possible categories. Experience and theory show that there is often a large gap between the basic assumption of the parametric model and the actual physical model, and these methods do not always achieve satisfactory results.

[参数估计:最大似然估计MLE][参数估计:文本分析的参数估计方法]

非参数估计法

[En]

Nonparametric estimation method

基于上述缺陷,Rosenblatt和Parzen提出了一种非参数估计方法,即核密度估计方法。由于核密度估计方法不利用数据分布的先验知识,不对数据分布附加任何假设,是一种从数据样本本身出发研究数据分布特征的方法,因此在统计理论和应用领域受到了极大的关注。

[En]

Because of the above defects, Rosenblatt and Parzen proposed a nonparametric estimation method, namely kernel density estimation method. Because the kernel density estimation method does not make use of the prior knowledge about data distribution and does not attach any assumptions to data distribution, it is a method to study the characteristics of data distribution from the data sample itself, therefore, it has received great attention in the field of statistical theory and application.

核密度估计是概率论中用来估计未知密度函数的一种非参数检验方法。它是由Rosenblatt(1955)和Emanuel Parzen(1962)提出的,也被称为Parzen Window(Parzen Window)。Ruppert和Cline在数据集密度函数聚类算法的基础上提出了一种改进的核密度估计方法。

[En]

Kernel density estimation (kernel density estimation) is used to estimate unknown density functions in probability theory, which is one of the nonparametric test methods. It was proposed by Rosenblatt (1955) and Emanuel Parzen (1962), also known as Parzen window (Parzen window). Ruppert and Cline propose a modified kernel density estimation method based on the data set density function clustering algorithm.

核密度估计在估计边界区域时会产生边界效应。

[En]

The kernel density estimation will have the boundary effect when estimating the boundary region.

[https://zh. wikipedia.org/zh-hans/核密度估计]
因此,核密度估计(KDE)是概率论中的一种非参数检验方法,用来估计未知密度函数。

[En]

Therefore, in a word, kernel density estimation Kernel Density Estimation (KDE) is used to estimate unknown density functions in probability theory, which is one of the nonparametric test methods.

在密度函数估计中有一种方法是被广泛应用的——直方图。如下图中的第一和第二幅图(名为Histogram和Histogram, bins shifted)。直方图的特点是简单易懂,但缺点在于以下三个方面:密度函数是不平滑的;密度函数受子区间(即每个直方体)宽度影响很大,同样的原始数据如果取不同的子区间范围,那么展示的结果可能是完全不同的。如下图中的前两个图,第二个图只是在第一个图的基础上,划分区间增加了0.75,但展现出的密度函数却看起来差异很大;直方图最多只能展示2维数据,如果维度更多则无法有效展示。

核密度估计有多种内核,图3(Tophat Kernl Density)为不平滑内核,图4(Gaussian Kernel Density,bandwidth=0.75)为平滑内核。在很多情况下,平滑内核(如高斯核密度估计,Gaussian Kernel Density)使用场景较多。

虽然采用不同的核函数都可以获得一致性的结论(整体趋势和密度分布规律性基本一致),但核密度函数也不是完美的。除了核算法的选择外,带宽(bandwidth)也会影响密度估计,过大或过小的带宽值都会影响估计结果。如上图中的最后三个图,名为Gaussian Kernel Density,bandwidth=0.75、Gaussian Kernel Density,bandwidth=0.25、Gaussian Kernel Density,bandwidth=0.55.

核密度估计的应用场景

[En]

Application scenario of Kernel density estimation

股票、金融等风险预测:在单变量核密度估计的基础上,建立风险价值预测模型。通过对核密度估计的变异系数进行加权处理,可以建立不同的风险价值预测模型。

[En]

Stock, finance and other risk prediction: on the basis of univariate kernel density estimation, the prediction model of value at risk can be established. Through the weighted processing of the coefficient of variation of kernel density estimation, different prediction models of value at risk can be established.

密度估计中应用最广泛的算法有混合高斯模型和基于最近邻的核密度估计。高斯混合核密度估计模型更多地应用于聚类场景。

[En]

The most widely used algorithms in density estimation are Gaussian mixture model and kernel density estimation based on nearest neighbors. Gaussian mixture kernel density estimation model is more used in clustering scenarios.

[核密度估计Kernel Density Estimation(KDE)]

核密度分析可用于测量建筑密度、获取犯罪情况报告,以及发现对城镇或野生动物栖息地造成影响的道路或公共设施管线。可使用 population 字段根据要素的重要程度赋予某些要素比其他要素更大的权重,该字段还允许使用一个点表示多个观察对象。例如,一个地址可以表示一栋六单元的公寓,或者在确定总体犯罪率时可赋予某些罪行比其他罪行更大的权重。对于线要素,分车道高速公路可能比狭窄的土路产生更大的影响,高压线要比标准电线杆产生更大的影响。[ArcGIS中的介绍]

你一定听说过热图。实际上,热图是对核密度的估计。

[En]

You must have heard of the thermal map. In fact, the thermal map is the estimation of nuclear density.

总而言之,核密度是用来估计密度的,如果你有一系列的空间点数据,那么核密度估计通常是一种更好的可视化方法。

[En]

All in all, kernel density is used to estimate density, and if you have a series of spatial point data, then kernel density estimation is often a better visualization method.

皮皮博客

[En]

Pippi blog

核密度估计

[En]

Kernel density estimation

所谓的核密度估计是用一个光滑的峰值函数(“核”)来拟合观测数据点,从而模拟真实的概率分布曲线。

[En]

The so-called kernel density estimation uses a smooth peak function (“kernel”) to fit the observed data points, so as to simulate the real probability distribution curve.

核密度估计(Kernel Density Estiments)是一种估计概率密度函数的非参数方法。对于独立同分布F的n个样本点,概率密度函数为f,核密度估计如下:

[En]

Kernel density estimation (Kernel density estimation) is a nonparametric method used to estimate the probability density function. For n sample points of independent identical distribution F, the probability density function is f, and the kernel density estimation is as follows:

K(.)为核函数(非负、积分为1,符合概率密度性质,并且均值为0)。有很多种核函数,uniform,triangular, biweight, triweight, Epanechnikov,normal等。

h>0为一个平滑参数,称作带宽(bandwidth),也看到有人叫窗口。

Kh(x) = 1/h K(x/h). 为缩放核函数(scaled Kernel)。

核密度函数的原理相对简单。当我们知道某事物的概率分布时,如果一个数字出现在观测中,我们可以认为这个数字的概率密度很大。离这个数越近的数的概率密度也会越大,而远离这个数的数的概率密度就会相对较小。

[En]

The principle of the kernel density function is relatively simple. When we know the probability distribution of something, if a number appears in the observation, we can think that the probability density of this number is very large. The probability density of the number closer to this number will also be larger, while the probability density of those numbers far away from this number will be relatively small.

基于这一思想,对于观测中的第一个数,我们可以用K来拟合我们想象的远小和近大的概率密度。对每个观测值拟合的多个概率密度分布函数求平均。如果有些数字很重要,你可以取加权平均数。值得注意的是,核密度的估计并不是为了寻找真实的分布函数。

[En]

Based on this idea, for the first number in the observation, we can use K to fit the far-small and near-large probability density we imagine. The multiple probability density distribution functions fitted by each observation number are averaged. If some numbers are important, you can take a weighted average. It is important to note that the estimation of kernel density is not to find the real distribution function.

Note: 核密度估计其实就是通过核函数(如高斯)将每个数据点的数据+带宽当作核函数的参数,得到N个核函数,再线性叠加就形成了核密度的估计函数,归一化后就是核密度概率密度函数了。

以以下三个数据点的一维数据集为例:5、10、15

[En]

Take the one-dimensional datasets of the following three data points as an example: 5, 10, 15

绘制直方图如下所示:使用KDE是:

[En]

Drawing a histogram looks like this: using KDE is:

KDE核函数k(.)
理论上,所有光滑的峰值函数都可以作为KDE的核函数,只要对于归一化的KDE(图上描绘的是数据点出现的概率),函数曲线下方的面积之和等于1。

[En]

In theory, all smooth peak functions can be used as kernel functions of KDE, as long as for the normalized KDE (depicted on the graph is the probability of the occurrence of data points), the sum of the area below the function curve is equal to 1.

当只有一个数据点时,单个波峰以下的面积为1,当有多个数据点时,所有波峰以下的面积之和为1,简而言之,函数曲线需要包含所有可能的数据值。

[En]

When there is only one data point, the area below the single wave crest is 1, and when there are multiple data points, the sum of the areas below all the wave peaks is 1. In a nutshell, the function curve needs to include all possible data values.

常用的核函数有:矩形、Epanechnikov曲线、高斯曲线等。这些函数有共同的特点:数据点的波峰和曲线下的面积为1。

[En]

The commonly used kernel functions are: rectangle, Epanechnikov curve, Gaussian curve and so on. These functions have common characteristics: the wave crest at the data point and the area under the curve is 1.

这些核函数对应于单个数据点(当只有一个数据时)

[En]

These kernel functions corresponding to a single data point (when there is only one data)

Epanechnikov曲线

高斯曲线

[En]

Gauss curve

[概率论:高斯/正态分布 ]

sklearn中实现的核函数

sklearn核函数形式

Gaussian kernel (kernel = ‘gaussian’)

Tophat kernel (kernel = ‘tophat’)

Epanechnikov kernel (kernel = ‘epanechnikov’)

Exponential kernel (kernel = ‘exponential’)

Linear kernel (kernel = ‘linear’)

Cosine kernel (kernel = ‘cosine’)

[Kernel Density Estimation ¶]
wekipedia上各种核函数的图形

均匀核函数 k(x)=1/2,-1≤x≤1 加入带宽h后: kh(x)=1/(2h),-h≤x≤h

三角核函数 k(x)=1-|x|,-1≤x≤1 加入带宽h后: kh(x)=(h-|x|)/h^2,-h≤x≤h

伽马核函数 kxi(x)=[x^(α-1)exp{-xα/xi}]/[(xi/α)^α.Γ(α)]

高斯核函数K(x,xc)=exp(-||x-xc||^2/(2*σ)^2)其中xc为核函数中心,σ为函数的宽度参数
[https://zh.wikipedia.org/zh-hans/%E6%A0%B8%E5%AF%86%E5%BA%A6%E4%BC%B0%E8%AE%A1]

不同核函数的比较

[En]

Comparison of different kernels

Epanechnikov 内核在均方误差意义下是最优的,效率损失也很小。

由于高斯内核方便的数学性质,也经常使用 K(x)= ϕ(x),ϕ(x)为标准正态概率密度函数。

对于多个数据点的KDE曲线:由于波形合成发生在相邻的峰值之间,所以最终曲线的形状与所选的核函数没有密切的关系。考虑到该函数在波形合成计算中的易用性,通常采用高斯曲线(正态分布曲线)作为KDE的核函数。

[En]

For the KDE curve of multiple data points: because the waveform synthesis occurs between the adjacent peaks, the shape of the final curve is not closely related to the selected kernel function. Considering the ease of use of the function in the calculation of waveform synthesis, the Gaussian curve (normal distribution curve) is generally used as the kernel function of KDE.

KDE算法:索引树
lz发现sklearn算法实现中有一个参数是算法项,如algorithm=’auto’,想了一下是为了加速。

KDE的概率密度函数公式得到后

利用上述公式,我们只需要遍历输出图像的每个点并计算估计的核密度。

[En]

With the above formula, we only need to traverse each point of the output image and calculate the estimated kernel density.

但如果你稍微想一想,你就会发现这个程序太多余了。如果点很多(n很大),并且输出的图像很大,那么每个像素需要n次累加运算,其中大部分是+0(因为一般来说,一个点附近的点不多,远小于n,其余的大部分从这个像素开始都大于r),这就造成了冗余计算。

[En]

But if you think about it a little bit, you will find that the program is too redundant. If there are many points (n is very large) and the output image is very large, then each pixel needs n cumulative addition operations, and most of them are + 0 (because generally speaking, there are not many points near a point, which is much less than n, and most of the rest are greater than r from this pixel.), this results in redundant calculation.

解决方案当然很简单,就是建立一个索引,然后在计算一个像素的估计核密度时,使用该索引搜索附近的点,然后累加这些点的核函数。

[En]

The solution is of course very simple, which is to build an index, and then use the index to search for nearby points when calculating the estimated kernel density of a pixel, and then accumulate the kernel functions of these points.

如Dotspatial自带了多种空间索引,有R树,R*树,KD树等;sklearn自带了kd tree, ball tree等等。

如果只需要找附近的点,对指标的要求不高,任何指标都可以使用。

[En]

If you only need to find the nearby points, the requirement for the index is not high, and any index can be used.

[ 空间点云核密度估计算法的实现-以Dotspatial为基础GIS库]
KDE带宽h
如何选择核函数的“方差”?这实际上是由带宽h决定的,在不同的带宽下,核函数的估计结果差别很大。

[En]

How to select the “variance” of the kernel function? This is actually determined by the bandwidth h, and the estimated results of kernel functions vary greatly under different bandwidths.

带宽反映了KDE曲线的整体平坦度,即观测到的数据点在KDE曲线形成中的比例。带宽越大,观测数据点在最终曲线形状中的比例越小,整体KDE曲线越平坦;带宽越小,最终曲线形状中观测数据点的比例越大,整体KDE曲线越陡峭。

[En]

The bandwidth reflects the overall flatness of the KDE curve, that is, the proportion of the observed data points in the formation of the KDE curve. The larger the bandwidth, the smaller the proportion of the observed data points in the final curve shape, and the flatter the overall KDE curve; the smaller the bandwidth, the greater the proportion of the observed data points in the final curve shape, the steeper the overall KDE curve.

或者以上述三个数据点的一维数据集为例,如果增加带宽,得到的KDE曲线将变平:

[En]

Or take the one-dimensional dataset of the above three data points as an example, if you increase the bandwidth, the resulting KDE curve will flatten out:

如果进一步增加带宽,波形合成将随着KDE曲线变平而发生:

[En]

If the bandwidth is further increased, waveform synthesis will occur as the KDE curve flattens:

相反,如果您减少带宽,KDE曲线将变得更加陡峭:

[En]

Conversely, if you reduce the bandwidth, the KDE curve becomes steeper:

在数学上,对于数据点xi,如果带宽是h,则在xi处形成的曲线函数为(其中K是核函数):

[En]

Mathematically, for the data point Xi, if the bandwidth is h, then the curve function formed at Xi is (where K is the kernel function):

在上述函数中,K函数内部的h分母用于调整KDE曲线的宽度,而K函数外部的h分母用于确保曲线下的面积符合KDE规则(KDE曲线下的面积之和为1)。

[En]

In the above function, the h denominator inside the K function is used to adjust the width of the KDE curve, while the h denominator outside the K function is used to ensure that the area under the curve conforms to the KDE rule (the sum of the area under the KDE curve is 1).

带宽的选择

[En]

Selection of bandwidth

带宽的选择在很大程度上取决于主观判断:如果你认为真实的概率分布曲线相对平坦,那么选择更大的带宽;相反,如果你认为真实的概率分布曲线更陡峭,那么选择更小的带宽。

[En]

The choice of bandwidth depends largely on subjective judgment: if you think that the real probability distribution curve is relatively flat, then choose a larger bandwidth; on the contrary, if you think that the real probability distribution curve is steeper, then choose a smaller bandwidth.

带宽计算似乎也有相应的方法,比如用R语言计算带宽时,默认使用nrd0方法。

[En]

There seems to be a corresponding method for bandwidth calculation, such as the “nrd0” method is used by default when calculating bandwidth in R language.

如何选择h?显然,这种选择可以将误差降到最低。H的优劣用平均积分的平方误差(平均积分平方误差)来衡量。

[En]

How to choose h? It is obvious that the choice can minimize the error. The advantages and disadvantages of h are measured by the square error of average integral (mean intergrated squared error).

其中

[En]

Among them

为了使MISE(H)最小,将其转化为寻找极点的问题。

[En]

In order to minimize MISE (h), it is transformed into the problem of finding the pole.

当核函数确定之后,h公式里的R、m、f”都可以确定下来,有(hAMISE ~ n−1/5),AMISE(h) = O(n−4/5)。

如果带宽不固定,变化取决于估计的位置(气球估计器)或采样点(逐点估计器的逐点估计),从而产生了一种非常强大的方法,称为自适应或可变带宽核密度估计。

[En]

If the bandwidth is not fixed, the change depends on the estimated location (balloon estimator) or sample point (point-by-point estimation of pointwise estimator), resulting in a very powerful method called adaptive or variable bandwidth kernel density estimation.

[ 核密度估计(Kernel density estimation) ]
在选择合适的核函数和带宽后,KDE可以模拟真实的概率分布曲线,得到平滑美观的结果。以近200个点的CPU使用率为例,用KDE绘制的结果如下:

[En]

After choosing the appropriate kernel function and bandwidth, KDE can simulate the real probability distribution curve and get smooth and beautiful results. Taking the CPU utilization of nearly 200 points as an example, the result drawn with KDE is as follows:

[一维数据可视化:核密度估计(Kernel Density Estimates)]

皮皮博客

[En]

Pippi blog

核密度估计的实现

[En]

Realization of Kernel density estimation

Python中KDE的实现:sklearn
[sklearn.neighbors.KernelDensity(bandwidth=1.0, algorithm=’auto’, kernel=’gaussian’, metric=’euclidean’, atol=0, rtol=0, breadth_first=True, leaf_size=40, metric_params=None)

from sklearn.neighbors import kde
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
kde = kde.KernelDensity(kernel=’gaussian’, bandwidth=0.2).fit(X)
print(kde.score_samples(X))
print(np.exp(kde.score_samples(X)))
[-0.41075698 -0.41075698 -0.41076071 -0.41075698 -0.41075698 -0.41076071]
[ 0.66314807 0.66314807 0.6631456 0.66314807 0.66314807 0.6631456 ]

score_samples(X)

Evaluate the density model on the data.

Parameters:
X : array_like, shape (n_samples, n_features)

kde.score_samples(X)返回的是点x对应概率的log值,要使用exp求指数还原。

Note: 还原后的所有点的概率和范围是[0, 无穷大],只是说一维数据线下面的面积或者二维数据面下面的体积和为1。

[Density Estimation¶]

[sklearn.neighbors.KernelDensity¶]

spark中KDE的实现
MLlib中,仅仅支持以高斯核做核密度估计。

[核密度估计]

R中KDE的实现
在R语言中,KDE的绘制是通过density()函数来实现的 — 通过density()函数计算得到KDE模型,然后再使用plot()函数对KDE曲线进行绘制:
x

density()函数接受以下7个核函数选项:
gaussian。高斯曲线,默认选项。在数据点处模拟正态分布。
epanechnikov。Epanechnikov曲线。
rectangular。矩形核函数。
triangular。三角形核函数。
biweight。
cosine。余弦曲线。
optcosine。

from: http://blog.csdn.net/pipisorry/article/details/53635895

ref: [有边界区间上的核密度估计]

[En]

版权声明:本文是CSDN博客“–柚皮-”的原创文章,符合CC 4.0BY-SA版权协议。请附上原始来源链接和此声明以供转载。

[En]

Copyright notice: this article is the original article of CSDN blogger “- Shaddock Peel -” in accordance with the CC 4.0BY-SA copyright Agreement. Please attach the original source link and this statement for reprint.

原文链接:https://blog.csdn.net/pipisorry/article/details/53635895

Original: https://www.cnblogs.com/dhcn/p/16454853.html
Author: 辉–
Title: 非参数估计:核密度估计KDE

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6507/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部