# 机器学习面试常见问题

(1) 无监督和有监督算法的区别？

[En]

Supervised learning:

[En]

The training samples with conceptual markers (classification) are studied to mark (classify) and predict the data outside the training sample set as much as possible. Here, all the tags (classifications) are known. Therefore, the ambiguity of the training sample is low.

[En]

Unsupervised learning:

[En]

The training samples without conceptual markers (classification) are studied in order to find the structural knowledge in the training sample set. Here, all tags (categories) are unknown. Therefore, the ambiguity of the training sample is high. Clustering is a typical unsupervised learning.

(2) SVM 的推导，特性？多分类怎么处理？

[En]

From the linear separable case, the original problem, the dual problem after feature transformation, the introduction of kernel (linear kernel, polynomial, Gaussian), and finally soft margin.

[En]

Linear: simple, fast, but linearly separable

[En]

Polynomial: the degree of fitting is stronger than the linear kernel, know the specific dimensions, but the higher order is prone to numerical instability, and there are more parameter choices.

[En]

Gauss: the fitting ability is the strongest, but we should pay attention to the fitting problem. However, only one parameter needs to be adjusted.

[En]

In the problem of multi-classification, there are generally three ways to extend two-classification to multi-classification, one-to-one, one-to-many, and many-to-many.

[En]

One on one:

[En]

One-to-many:

[En]

Each time we take one example as a positive example and the others as a counterexample to train N classifiers. If only one classifier is predicted as a positive class, then the corresponding category is the final result. If there are more than one, we generally choose the one with the highest confidence. One-to-one is more from a classifier point of view, but only two categories are used at a time, so one-to-one overhead is usually lower when there are a large number of categories (as long as the training complexity is higher than O (N)).

[En]

Many to many:

[En]

Several classes are regarded as positive classes and several classes as negative classes. Note that pros and cons must be specially designed.

(3) LR 的推导，特性？

LR的优点在于实现简单，并且计算量非常小，速度很快，存储资源低，缺点就是因为模型简单，对于复杂的情况下会出现欠拟合，并且只能处理2分类问题(可以通过一般的二元转换为多元或者用softmax回归)。

(4) 决策树的特性？

[En]

The decision tree makes decisions based on the tree structure, which is very similar to the human processing mechanism when faced with problems. Its characteristic is that it needs to select an attribute to branch, and select the attribute with the largest information gain in the process of branching, which is defined as follows.

[En]

In the division, we hope that the samples contained in the branch nodes of the decision tree belong to the same category, that is, the purity of the nodes is getting higher and higher. The decision tree has the advantages of simple calculation and strong explanation, so it is more suitable to deal with samples with missing attribute values, and can deal with irrelevant features, but it is easy to fit, so it needs to use pruning or random forest. Information gain is entropy minus conditional entropy, which represents the degree of less information uncertainty. the greater the information gain, the greater the uncertainty, so it shows that this feature is very important for classification. Because the information gain criterion has a preference for a large number of attributes, the information gain rate is generally used (c4.5).

[En]

The denominator can be regarded as the entropy of the attribute itself. The more possible the value is, the greater the entropy of the attribute is.

Cart决策树使用基尼指数来选择划分属性，直观的来说，Gini(D)反映了从数据集D中随机抽取两个样本，其类别标记不一致的概率，因此基尼指数越小数据集D的纯度越高，一般为了防止过拟合要进行剪枝，有预剪枝和后剪枝，一般用cross validation集进行剪枝。

[En]

For the processing of continuous values and missing values, for the continuous attribute a, the different values of an on D are sorted, and D is divided into two subsets based on the partition point t. Generally, the midpoint of each consecutive two values is taken as the division point, and then the maximum one is selected according to the information gain. Different from discrete attributes, if the current node partition attribute is a continuous attribute, it can also be used as a partition attribute for its descendants.

(5) SVM、LR、决策树的对比？

SVM既可以用于分类问题，也可以用于回归问题，并且可以通过核函数快速的计算，LR实现简单，训练速度非常快，但是模型较为简单，决策树容易过拟合，需要进行剪枝等。从优化函数上看，soft margin的SVM用的是hinge loss,而带L2正则化的LR对应的是cross entropy loss，另外adaboost对应的是exponential loss。所以LR对远点敏感，但是SVM对outlier不太敏感，因为只关心support vector，SVM可以将特征映射到无穷维空间，但是LR不可以，一般小数据中SVM比LR更优一点，但是LR可以预测概率，而SVM不可以，SVM依赖于数据测度，需要先做归一化，LR一般不需要，对于大量的数据LR使用更加广泛，LR向多分类的扩展更加直接，对于类别不平衡SVM一般用权重解决，即目标函数中对正负样本代价函数不同，LR可以用一般的方法，也可以直接对最后结果调整(通过阈值)，一般小数据下样本维度比较高的时候SVM效果要更优一些。

(6) GBDT 和随机森林的区别？

[En]

Random forest adopts the idea of bagging, bagging is also known as bootstrap aggreagation. Multiple sampling sets are obtained by sampling in the training sample set, and a basic learner is trained based on each sampling set, and then the basic learner is combined. Random forest introduces random attribute selection in the training process of decision tree on the basis of bagging of decision tree. When selecting the partition attributes, the traditional decision tree selects the optimal attributes in the current node attribute set, while the random forest randomly selects a subset of k attributes for the nodes, and then selects the most attributes, and k as a parameter controls the degree of introduction of randomness.

[En]

In addition, GBDT training is based on the idea of Boosting, and the sample weight is updated according to errors in each iteration, so it is a serial serialization method, while random forest is the idea of bagging, so it is a parallel method.

(7) 如何判断函数凸或非凸？什么是凸优化？

[En]

Then the function is convex. The above conditions can also lead to more general results.

[En]

If the function has the second derivative, then if the second derivative of the function is positive, or for the multivariate function, the Hessian matrix is a convex function.

(也可能引到SVM，或者凸函数局部最优也是全局最优的证明，或者上述公式期望情况下的Jessen不等式)

(8) 如何解决类别不平衡问题？

(9) 解释对偶的概念。

[En]

An optimization problem can be examined from two angles, one is the primal problem, and the other is the dual problem, which is the dual problem. In general, the dual problem gives the lower bound of the optimal value of the main problem. In the case of strong duality, the optimal lower bound of the main problem can be obtained from the dual problem. The dual problem is a convex optimization problem, which can be well solved. In SVM, the primal problem is transformed into a dual problem to solve. Thus the idea of kernel function is further introduced.

(10) 如何进行特征选择？

[En]

Feature selection is an important data preprocessing process, mainly for two reasons. First, we will encounter the problem of dimension disaster (sample density is very sparse) in real tasks. If we can select some features from them, then this problem can be greatly alleviated. In addition, the removal of irrelevant features will reduce the difficulty of learning tasks and increase the generalization ability of the model. Redundant feature means that the information contained in this feature can be deduced from other features, but this does not mean that the redundant feature must be useless. For example, it can also be used to add redundant features in the case of underfitting to increase the complexity of the simple model.

[En]

In theory, if there is no domain knowledge as a priori hypothesis, then we can only traverse all possible subsets. But this is obviously impossible, because the number of traverses is combinatorial explosion. Generally speaking, we can be divided into two processes: subset search and subset evaluation. Subset search generally uses greedy algorithm, and each round adds or deletes from the candidate features, which becomes forward search and post-first search respectively. Or a combination of two-way search. Information gain is generally used in subset evaluation, and the midpoint is often selected as the partition point for continuous data.

[En]

Common feature selection methods are filter, wrapped and embedded, filter,wrapper and embedding. The Filter type first selects the features of the data set, and then trains the learner. Wrapper directly regards the performance of the final learner as the evaluation criterion of the feature subset, generally through continuous candidate subset, and then uses the cross-validation process to update the candidate features, usually with a large amount of computation. Embedded feature selection integrates the feature selection process with the training process, and feature selection is carried out automatically in the training process. For example, L1 regularization is easier to obtain sparse solutions, while L2 regularization is more difficult to over-fit. L1 regularization can be solved by PGD, proximal gradient descent.

(11) 为什么会产生过拟合，有哪些方法可以预防或克服过拟合？

[En]

In general, in machine learning, the error of the learner in the training set is called the training error or empirical error, and the error on the new sample is called the generalization error. Obviously, we want to get a learner with small generalization error, but we do not know the new sample in advance, so we often try to minimize the empirical error. However, when the learner learns the training sample too well, it may regard the characteristics of the training sample as the general properties of the potential sample. This will lead to the decline of generalization performance, which is called overfitting. On the contrary, underfitting generally means that the general properties of the training samples have not been learned well, and there is still a large error in the training set.

[En]

Under-fitting: generally speaking, under-fitting is easier to solve, such as increasing the complexity of the model, increasing the branches in the decision tree, increasing the number of training in the neural network and so on.

[En]

Over-fitting: it is generally believed that over-fitting can not be completely avoided, because the problem faced by machine learning is generally np-hard, but an effective solution must work in the polynomial, so it will sacrifice some generalization ability. The solutions of over-fitting generally include increasing the number of samples, reducing the dimension of samples, reducing the complexity of the model, using prior knowledge (L1 and L2 regularization), using cross-validation,early stopping and so on.

(12) 什么是偏差与方差？

[En]

The generalization error can be decomposed into the square of the deviation plus variance plus noise. The deviation measures the deviation between the expected prediction of the learning algorithm and the real results, describes the fitting ability of the learning algorithm itself, and the variance measures the change of learning performance caused by the change of the training set of the same size. it describes the impact of data disturbance, and the noise expresses the lower bound of the expected generalization error that can be achieved by any learning algorithm on the current task, and describes the difficulty of the problem itself. Deviation and variance are generally called bias and variance. The stronger the training degree is, the smaller the deviation is, and the greater the variance is. The generalization error generally has a minimum value in the middle. If the deviation is large, the variance is small, it is generally called underfitting, while the deviation is small, and the variance is larger is called overfitting.

(13) 神经网络的原理，如何进行训练？

[En]

Neural network has been a very large discipline since its development. generally speaking, it is considered that neural network is composed of single neurons and connections between different neurons, and different neural networks are composed of insufficient structures. The most common neural network is called multi-layer feedforward neural network. in addition to the input and output layers, the number of hidden layers in the middle is called the number of layers of the neural network. BP algorithm is the most famous algorithm in training neural networks, and its essence is gradient descent and chain rule.

(14) 介绍卷积神经网络，和 DBN 有什么区别？

[En]

Convolution neural network is characterized by convolution kernel, weight sharing is used in CNN, and different feature representations are obtained by continuous use and convolution. The sampling layer, also known as pooling layer, carries out sub-sampling based on the principle of local correlation, which reduces the amount of data while maintaining useful information. DBN is a deep belief network, each layer is a RBM, the whole network can be regarded as RBM stack, usually use unsupervised layer by layer training, from the first layer, each layer uses the input of the previous layer for training, and then use the BP algorithm to train the whole network after the end of each layer training.

(15) 采用 EM 算法求解的模型有哪些，为什么不用牛顿法或梯度下降法？

(16) 用 EM 算法推导解释 Kmeans。

k-means算法是高斯混合聚类在混合成分方差相等，且每个样本仅指派一个混合成分时候的特例。注意k-means在运行之前需要进行归一化处理，不然可能会因为样本在某些维度上过大导致距离计算失效。k-means中每个样本所属的类就可以看成是一个隐变量，在E步中，我们固定每个类的中心，通过对每一个样本选择最近的类优化目标函数，在M步，重新更新每个类的中心点，该步骤可以通过对目标函数求导实现，最终可得新的类中心就是类中样本的均值。

(17) 用过哪些聚类算法，解释密度聚类算法。

(19) 聚类算法中的距离度量有哪些？

(20) 解释贝叶斯公式和朴素贝叶斯分类。

[En]

This can not only ensure the normalization of probability, but also avoid the above phenomenon.

(22) 解释L1和L2正则化的作用。

(23) TF-IDF是什么？

TF指Term frequecy,代表词频,IDF代表inverse document frequency,叫做逆文档频率，这个算法可以用来提取文档的关键词，首先一般认为在文章中出现次数较多的词是关键词，词频就代表了这一项，然而有些词是停用词，例如的，是，有这种大量出现的词，首先需要进行过滤，比如过滤之后再统计词频出现了中国，蜜蜂，养殖且三个词的词频几乎一致，但是中国这个词出现在其他文章的概率比其他两个词要高不少，因此我们应该认为后两个词更能表现文章的主题，IDF就代表了这样的信息，计算该值需要一个语料库，如果一个词在语料库中出现的概率越小，那么该词的IDF应该越大，一般来说TF计算公式为(某个词在文章中出现次数/文章的总词数)，这样消除长文章中词出现次数多的影响，IDF计算公式为log(语料库文章总数/(包含该词的文章数)+1)。将两者乘乘起来就得到了词的TF-IDF。传统的TF-IDF对词出现的位置没有进行考虑，可以针对不同位置赋予不同的权重进行修正，注意这些修正之所以是有效的，正是因为人观测过了大量的信息，因此建议了一个先验估计，人将这个先验估计融合到了算法里面，所以使算法更加的有效

(24) 文本中的余弦距离是什么，有哪些作用？

[En]

CoSine distance is a measure of the distance between two vectors, with a value between-1 and 1. A value of 1 means that two vectors are in phase, 0 means that two vectors are orthogonal, and-1 means that two vectors are in reverse. Use TF-IDF and cosine distance to find articles with similar content, for example, first use TF-IDF to find out the keywords of two articles, and then take out k keywords (10-20) for each article, count the word frequency of these keywords, generate the word frequency vector of the two articles, and then use the cosine distance to calculate their similarity.

Original: https://www.cnblogs.com/mfryf/p/15293514.html
Author: 知识天地
Title: 机器学习面试常见问题

(0)