# 机器学习算法（二十一）：核密度估计 Kernel Density Estimation(KDE)

[En]

This is for collection only. It is recommended to check the original text. For links, see above.

1 分布密度函数

1.1 参数估计方法

1.2 非参数估计

2 直方图到核密度估计

2.1 核函数

2.2 带宽的选择

2.2.1 自适应或可变带宽的核密度估计

2.3 多维

1 分布密度函数

[En]

Given a sample set, how to get the distribution density function of the sample set, there are two ways to solve this problem:

1.1 参数估计方法

[En]

To put it simply, it is assumed that the sample set conforms to a certain probability distribution, and then the parameters in the distribution are fitted according to the sample set, such as likelihood estimation, mixed Gaussian, etc., because the parameter estimation method needs to add subjective prior knowledge, it is often difficult to fit the model with the real distribution.

1.2 非参数估计

[En]

Different from parameter estimation, non-parameter estimation does not add any prior knowledge, but fits the distribution according to the characteristics and properties of the data, so that a better model can be obtained than the parameter estimation method. Kernel density estimation is one of the nonparametric estimators, proposed by Rosenblatt (1955) and Emanuel Parzen (1962), also known as Parzen window (Parzen window). Ruppert and Cline propose a modified kernel density estimation method based on the data set density function clustering algorithm.

2 直方图到核密度估计

[En]

Given a data set, we need to observe the distribution of these samples, often we will use the histogram method to show intuitively. The characteristic of the histogram is easy to understand, but the disadvantage lies in the following three aspects: (1) the density function is not smooth; (2) the density function is greatly affected by the width of the sub-interval (that is, each histogram). If the same original data take different sub-interval ranges, then the results may be completely different. For the first two diagrams in the following figure, the second graph only increases the partition interval by 0.75 on the basis of the first graph, but the density function shown appears to be very different; (3) the histogram can only show 2-D data at most. If there are more dimensions, it can’t be effectively displayed.

[En]

In addition, there is another problem with the histogram, that is, the distribution curve shown by the histogram is not smooth, that is, the samples in a bin have equal probability density, which is obviously not suitable. The way to solve this problem is to increase the number of bins. When bins increases to the maximum value of the sample, it will have its own probability for each point of the sample, but at the same time it will bring other problems. The probability of the value that does not appear in the sample is 0, and the probability density function is discontinuous, which also has a big problem. If we connect these discontinuous intervals, then to a large extent, this can meet our requirements, one of the ideas is that for the probability density of a certain point in the sample, if the neighborhood information can be used, then the final probability density will greatly improve the problem of discontinuity, in order to facilitate observation, let’s look at another picture.

[En]

It is the number of sample points in the neighborhood and the total number of sample set. finally, the density function value of x point f (x) is obtained by averaging the density value in the neighborhood. The above formula is rewritten, that is, kernel density estimation (Kernel density estimation), which is a nonparametric method for estimating the probability density function. For n sample points of independent identical distribution F, let its probability density function be f:

[En]

Because the integral of probability density needs to be 1, so:

2.1 核函数

[En]

In the histogram, the horizontal axis interval is 2, the data falls into a certain interval, and the y-axis of this interval increases by 1 prime 12.

[En]

In the kernel density estimation, the variance of the other normal distribution is 2.25, the red dotted line represents the normal distribution obtained from each data, and the result of the kernel density estimation is obtained by superimposing it, which is expressed in blue.

[En]

So the question is, how to choose the “variance” of the kernel function? This is actually determined by h. The estimated results of kernel functions vary greatly under different bandwidths, as shown in the following figure:

（Kernel density estimate (KDE) with different bandwidths of a random sample of 100 points from a standard normal distribution. Grey: true density (standard normal). Red: KDE with h=0.05. Black: KDE with h=0.337. Green: KDE with h=2.）

2.2 带宽的选择

[En]

After the kernel function is determined, such as the Gaussian kernel selected above, how big should the variance of the Gaussian kernel, that is h (also called bandwidth, also known as the neighborhood we are talking about here) be chosen? Different bandwidths will lead to great differences in the final fitting results. At the same time, it is also mentioned above that theoretically h-> 0, but if h is too small, there will be too few points involved in fitting in the neighborhood. So with the help of machine learning theory, we can certainly use cross-validation to choose the best h. In addition, there is a theoretical derivation that gives you some information about choosing h.

[En]

When the sample set is given, we can only calculate the probability density of the sample point, and the probability density after fitting should be closer to the calculated value. based on this, we define an error function. then minimizing the error function can provide a general direction for h selection. Choose to minimize the L2 risk function, namely the mean square integral error function (mean intergrated squared error), which is defined as:

[En]

When the kernel function is determined, R, m, f “in h formula can be determined, and h has an analytical solution.

[En]

If the bandwidth is not fixed, the change depends on the estimated location (balloon estimator) or sample point (point-by-point estimation of pointwise estimator), resulting in a very powerful method called adaptive or variable bandwidth kernel density estimation.

[En]

If the Gaussian kernel function is used for kernel density estimation, the optimal choice of h (even if the bandwidth of the mean integral square error is minimized) is

[En]

Here is a brief description of the role of bandwidth:

[En]

In the related field of data visualization, the size of the bandwidth determines the smooth degree of the kernel density estimation function (KDE). The smaller the bandwidth, the more undersmooth, and the larger the bandwidth, the more oversmooth.

2.2.1 自适应或可变带宽的核密度估计

[En]

If the bandwidth is not fixed, but varies according to the location of the sample (the change depends on the estimated location (balloon estimator) or the sample point (point-by-point estimation of pointwise estimator), a particularly powerful method is produced, called adaptive or variable bandwidth kernel density estimation. In terms of POI points of interest recommendation, due to the high check-in density in dense urban areas and low check-in density in sparsely populated rural areas. That is to say, different locations should adopt different analysis scales, so this paper uses unfixed bandwidth to estimate the kernel density.

[En]

Here we briefly discuss the kernel density estimation method of adaptive bandwidth. The kernel density estimation method of adaptive bandwidth is obtained by modifying the bandwidth parameter on the basis of fixed bandwidth kernel density function, and its form is as shown in the formula:

2.3 多维

[En]

It can also be extended to multidimensional, that is:

[En]

What is nuclear density estimation? How to perceive sexuality? What is nuclear density estimation? How to perceive sexuality? -Zhihu

[En]

[En]

Original: https://www.cnblogs.com/dhcn/p/16377703.html
Author: 辉–
Title: 机器学习算法（二十一）：核密度估计 Kernel Density Estimation(KDE)

(0)