# 非参数估计：核密度估计KDE

https://blog.csdn.net/pipisorry/article/details/53635895

[En]

Overview of Kernel density estimation Kernel Density Estimation (KDE)

[En]

The problem of density estimation

[En]

Solving the distribution density function of random variables from a given sample set is one of the basic problems of probability statistics. The methods to solve this problem include parametric estimation and nonparametric estimation.

[En]

Parameter estimation.

[En]

Parameter estimation can be divided into parameter regression analysis and parameter discriminant analysis. In the parametric regression analysis, people assume that the data distribution conforms to a specific behavior, such as linear, reducible linear or exponential behavior, and then find a specific solution in the objective function family, that is, to determine the unknown parameters in the regression model. In the parameter discriminant analysis, people need to assume that the randomly selected data samples as the basis for discrimination obey a specific distribution in all possible categories. Experience and theory show that there is often a large gap between the basic assumption of the parametric model and the actual physical model, and these methods do not always achieve satisfactory results.

[参数估计：最大似然估计MLE][参数估计：文本分析的参数估计方法]

[En]

Nonparametric estimation method

[En]

Because of the above defects, Rosenblatt and Parzen proposed a nonparametric estimation method, namely kernel density estimation method. Because the kernel density estimation method does not make use of the prior knowledge about data distribution and does not attach any assumptions to data distribution, it is a method to study the characteristics of data distribution from the data sample itself, therefore, it has received great attention in the field of statistical theory and application.

[En]

Kernel density estimation (kernel density estimation) is used to estimate unknown density functions in probability theory, which is one of the nonparametric test methods. It was proposed by Rosenblatt (1955) and Emanuel Parzen (1962), also known as Parzen window (Parzen window). Ruppert and Cline propose a modified kernel density estimation method based on the data set density function clustering algorithm.

[En]

The kernel density estimation will have the boundary effect when estimating the boundary region.

[https://zh. wikipedia.org/zh-hans/核密度估计]

[En]

Therefore, in a word, kernel density estimation Kernel Density Estimation (KDE) is used to estimate unknown density functions in probability theory, which is one of the nonparametric test methods.

[En]

Application scenario of Kernel density estimation

[En]

Stock, finance and other risk prediction: on the basis of univariate kernel density estimation, the prediction model of value at risk can be established. Through the weighted processing of the coefficient of variation of kernel density estimation, different prediction models of value at risk can be established.

[En]

The most widely used algorithms in density estimation are Gaussian mixture model and kernel density estimation based on nearest neighbors. Gaussian mixture kernel density estimation model is more used in clustering scenarios.

[核密度估计Kernel Density Estimation(KDE)]

[En]

You must have heard of the thermal map. In fact, the thermal map is the estimation of nuclear density.

[En]

All in all, kernel density is used to estimate density, and if you have a series of spatial point data, then kernel density estimation is often a better visualization method.

[En]

Pippi blog

[En]

Kernel density estimation

[En]

The so-called kernel density estimation uses a smooth peak function (“kernel”) to fit the observed data points, so as to simulate the real probability distribution curve.

[En]

Kernel density estimation (Kernel density estimation) is a nonparametric method used to estimate the probability density function. For n sample points of independent identical distribution F, the probability density function is f, and the kernel density estimation is as follows:

K(.)为核函数（非负、积分为1，符合概率密度性质，并且均值为0）。有很多种核函数，uniform,triangular, biweight, triweight, Epanechnikov,normal等。

h>0为一个平滑参数，称作带宽(bandwidth)，也看到有人叫窗口。

Kh(x) = 1/h K(x/h). 为缩放核函数(scaled Kernel)。

[En]

The principle of the kernel density function is relatively simple. When we know the probability distribution of something, if a number appears in the observation, we can think that the probability density of this number is very large. The probability density of the number closer to this number will also be larger, while the probability density of those numbers far away from this number will be relatively small.

[En]

Based on this idea, for the first number in the observation, we can use K to fit the far-small and near-large probability density we imagine. The multiple probability density distribution functions fitted by each observation number are averaged. If some numbers are important, you can take a weighted average. It is important to note that the estimation of kernel density is not to find the real distribution function.

Note: 核密度估计其实就是通过核函数（如高斯）将每个数据点的数据+带宽当作核函数的参数，得到N个核函数，再线性叠加就形成了核密度的估计函数，归一化后就是核密度概率密度函数了。

[En]

Take the one-dimensional datasets of the following three data points as an example: 5, 10, 15

[En]

Drawing a histogram looks like this: using KDE is:

KDE核函数k(.)

[En]

In theory, all smooth peak functions can be used as kernel functions of KDE, as long as for the normalized KDE (depicted on the graph is the probability of the occurrence of data points), the sum of the area below the function curve is equal to 1.

[En]

When there is only one data point, the area below the single wave crest is 1, and when there are multiple data points, the sum of the areas below all the wave peaks is 1. In a nutshell, the function curve needs to include all possible data values.

[En]

The commonly used kernel functions are: rectangle, Epanechnikov curve, Gaussian curve and so on. These functions have common characteristics: the wave crest at the data point and the area under the curve is 1.

[En]

These kernel functions corresponding to a single data point (when there is only one data)

Epanechnikov曲线

[En]

Gauss curve

[概率论：高斯/正态分布 ]

sklearn中实现的核函数

sklearn核函数形式

Gaussian kernel (kernel = ‘gaussian’)

Tophat kernel (kernel = ‘tophat’)

Epanechnikov kernel (kernel = ‘epanechnikov’)

Exponential kernel (kernel = ‘exponential’)

Linear kernel (kernel = ‘linear’)

Cosine kernel (kernel = ‘cosine’)

[Kernel Density Estimation ¶]
wekipedia上各种核函数的图形

[https://zh.wikipedia.org/zh-hans/%E6%A0%B8%E5%AF%86%E5%BA%A6%E4%BC%B0%E8%AE%A1]

[En]

Comparison of different kernels

Epanechnikov 内核在均方误差意义下是最优的，效率损失也很小。

[En]

For the KDE curve of multiple data points: because the waveform synthesis occurs between the adjacent peaks, the shape of the final curve is not closely related to the selected kernel function. Considering the ease of use of the function in the calculation of waveform synthesis, the Gaussian curve (normal distribution curve) is generally used as the kernel function of KDE.

KDE算法：索引树
lz发现sklearn算法实现中有一个参数是算法项，如algorithm=’auto’，想了一下是为了加速。

KDE的概率密度函数公式得到后

[En]

With the above formula, we only need to traverse each point of the output image and calculate the estimated kernel density.

[En]

But if you think about it a little bit, you will find that the program is too redundant. If there are many points (n is very large) and the output image is very large, then each pixel needs n cumulative addition operations, and most of them are + 0 (because generally speaking, there are not many points near a point, which is much less than n, and most of the rest are greater than r from this pixel.), this results in redundant calculation.

[En]

The solution is of course very simple, which is to build an index, and then use the index to search for nearby points when calculating the estimated kernel density of a pixel, and then accumulate the kernel functions of these points.

[En]

If you only need to find the nearby points, the requirement for the index is not high, and any index can be used.

[ 空间点云核密度估计算法的实现-以Dotspatial为基础GIS库]
KDE带宽h

[En]

How to select the “variance” of the kernel function? This is actually determined by the bandwidth h, and the estimated results of kernel functions vary greatly under different bandwidths.

[En]

The bandwidth reflects the overall flatness of the KDE curve, that is, the proportion of the observed data points in the formation of the KDE curve. The larger the bandwidth, the smaller the proportion of the observed data points in the final curve shape, and the flatter the overall KDE curve; the smaller the bandwidth, the greater the proportion of the observed data points in the final curve shape, the steeper the overall KDE curve.

[En]

Or take the one-dimensional dataset of the above three data points as an example, if you increase the bandwidth, the resulting KDE curve will flatten out:

[En]

If the bandwidth is further increased, waveform synthesis will occur as the KDE curve flattens:

[En]

Conversely, if you reduce the bandwidth, the KDE curve becomes steeper:

[En]

Mathematically, for the data point Xi, if the bandwidth is h, then the curve function formed at Xi is (where K is the kernel function):

[En]

In the above function, the h denominator inside the K function is used to adjust the width of the KDE curve, while the h denominator outside the K function is used to ensure that the area under the curve conforms to the KDE rule (the sum of the area under the KDE curve is 1).

[En]

Selection of bandwidth

[En]

The choice of bandwidth depends largely on subjective judgment: if you think that the real probability distribution curve is relatively flat, then choose a larger bandwidth; on the contrary, if you think that the real probability distribution curve is steeper, then choose a smaller bandwidth.

[En]

There seems to be a corresponding method for bandwidth calculation, such as the “nrd0” method is used by default when calculating bandwidth in R language.

[En]

How to choose h? It is obvious that the choice can minimize the error. The advantages and disadvantages of h are measured by the square error of average integral (mean intergrated squared error).

[En]

Among them

[En]

In order to minimize MISE (h), it is transformed into the problem of finding the pole.

[En]

If the bandwidth is not fixed, the change depends on the estimated location (balloon estimator) or sample point (point-by-point estimation of pointwise estimator), resulting in a very powerful method called adaptive or variable bandwidth kernel density estimation.

[ 核密度估计(Kernel density estimation) ]

[En]

After choosing the appropriate kernel function and bandwidth, KDE can simulate the real probability distribution curve and get smooth and beautiful results. Taking the CPU utilization of nearly 200 points as an example, the result drawn with KDE is as follows:

[一维数据可视化：核密度估计(Kernel Density Estimates)]

[En]

Pippi blog

[En]

Realization of Kernel density estimation

Python中KDE的实现：sklearn
[sklearn.neighbors.KernelDensity(bandwidth=1.0, algorithm=’auto’, kernel=’gaussian’, metric=’euclidean’, atol=0, rtol=0, breadth_first=True, leaf_size=40, metric_params=None)

from sklearn.neighbors import kde
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
kde = kde.KernelDensity(kernel=’gaussian’, bandwidth=0.2).fit(X)
print(kde.score_samples(X))
print(np.exp(kde.score_samples(X)))
[-0.41075698 -0.41075698 -0.41076071 -0.41075698 -0.41075698 -0.41076071]
[ 0.66314807 0.66314807 0.6631456 0.66314807 0.66314807 0.6631456 ]

score_samples(X)

Evaluate the density model on the data.

Parameters:
X : array_like, shape (n_samples, n_features)

kde.score_samples(X)返回的是点x对应概率的log值，要使用exp求指数还原。

Note: 还原后的所有点的概率和范围是[0, 无穷大]，只是说一维数据线下面的面积或者二维数据面下面的体积和为1。

[Density Estimation¶]

[sklearn.neighbors.KernelDensity¶]

spark中KDE的实现
MLlib中，仅仅支持以高斯核做核密度估计。

[核密度估计]

R中KDE的实现

x

density()函数接受以下7个核函数选项：
gaussian。高斯曲线，默认选项。在数据点处模拟正态分布。
epanechnikov。Epanechnikov曲线。
rectangular。矩形核函数。
triangular。三角形核函数。
biweight。
cosine。余弦曲线。
optcosine。

from: http://blog.csdn.net/pipisorry/article/details/53635895

ref: [有边界区间上的核密度估计]

[En]

[En]

Original: https://www.cnblogs.com/dhcn/p/16454853.html
Author: 辉–
Title: 非参数估计：核密度估计KDE

(0)