机器学习面试常见问题

(1) 无监督和有监督算法的区别?

有监督的学习:

[En]

Supervised learning:

对带有概念标记(分类)的训练样本进行学习,以尽可能地标记(分类)和预测训练样本集合之外的数据。这里,所有标签(分类)都是已知的。因此,训练样本的模糊度较低。

[En]

The training samples with conceptual markers (classification) are studied to mark (classify) and predict the data outside the training sample set as much as possible. Here, all the tags (classifications) are known. Therefore, the ambiguity of the training sample is low.

无监督学习:

[En]

Unsupervised learning:

对没有概念标记(分类)的训练样本进行研究,以发现训练样本集中的结构知识。在这里,所有标签(类别)都是未知的。因此,训练样本的歧义性较高。聚类是一种典型的无监督学习。

[En]

The training samples without conceptual markers (classification) are studied in order to find the structural knowledge in the training sample set. Here, all tags (categories) are unknown. Therefore, the ambiguity of the training sample is high. Clustering is a typical unsupervised learning.

(2) SVM 的推导,特性?多分类怎么处理?

机器学习面试常见问题

从线性可分的情况、原始问题、特征变换后的对偶问题、引入核(线性核、多项式、高斯核),最后到软边距。

[En]

From the linear separable case, the original problem, the dual problem after feature transformation, the introduction of kernel (linear kernel, polynomial, Gaussian), and finally soft margin.

线性:简单、快速,但可线性分离

[En]

Linear: simple, fast, but linearly separable

多项式:拟合度强于线性核,知道具体维度,但高阶容易出现数值不稳定,参数选择较多。

[En]

Polynomial: the degree of fitting is stronger than the linear kernel, know the specific dimensions, but the higher order is prone to numerical instability, and there are more parameter choices.

高斯:拟合能力是最强的,但要注意拟合问题。然而,只有一个参数需要调整。

[En]

Gauss: the fitting ability is the strongest, but we should pay attention to the fitting problem. However, only one parameter needs to be adjusted.

在多分类问题中,一般有三种方式将两分类扩展为多分类、一对一、一对多、多对多。

[En]

In the problem of multi-classification, there are generally three ways to extend two-classification to multi-classification, one-to-one, one-to-many, and many-to-many.

一对一:

[En]

One on one:

将N个类别两两配对,产生N(N-1)/2个二分类任务,测试阶段新样本同时交给所有的分类器,最终结果通过投票产生。

一对多:

[En]

One-to-many:

每次我们都以一个例子作为正例,其他例子作为反例来训练N个分类器。如果只有一个分类器被预测为正类,那么对应的类别就是最终结果。如果有不止一个,我们通常会选择信心最高的那个。从分类器的角度来看,一对一更多,但一次只使用两个类别,所以当类别较多时,一对一开销通常较低(只要训练复杂度高于O(N))。

[En]

Each time we take one example as a positive example and the others as a counterexample to train N classifiers. If only one classifier is predicted as a positive class, then the corresponding category is the final result. If there are more than one, we generally choose the one with the highest confidence. One-to-one is more from a classifier point of view, but only two categories are used at a time, so one-to-one overhead is usually lower when there are a large number of categories (as long as the training complexity is higher than O (N)).

多对多:

[En]

Many to many:

有几个班级被认为是积极的班级,有几个班级被认为是负面的班级。请注意,利弊必须经过特别设计。

[En]

Several classes are regarded as positive classes and several classes as negative classes. Note that pros and cons must be specially designed.

(3) LR 的推导,特性?

LR的优点在于实现简单,并且计算量非常小,速度很快,存储资源低,缺点就是因为模型简单,对于复杂的情况下会出现欠拟合,并且只能处理2分类问题(可以通过一般的二元转换为多元或者用softmax回归)。

(4) 决策树的特性?

决策树基于树结构做出决策,这与人类面对问题时的处理机制非常相似。其特点是需要选择一个属性进行分支,并在分支过程中选择信息增益最大的属性,定义如下。

[En]

The decision tree makes decisions based on the tree structure, which is very similar to the human processing mechanism when faced with problems. Its characteristic is that it needs to select an attribute to branch, and select the attribute with the largest information gain in the process of branching, which is defined as follows.

机器学习面试常见问题

在划分中,我们希望决策树的分支节点中包含的样本属于同一类别,即节点的纯度越来越高。决策树具有计算简单、解释力强的优点,更适合处理属性值缺失的样本,可以处理不相关的特征,但容易拟合,需要使用剪枝或随机森林。信息增益是信息熵减去条件熵,它代表信息不确定性较小的程度。信息增益越大,不确定性越大,说明这一特征对分类非常重要。由于信息增益标准偏爱大量的属性,因此通常使用信息增益率(C4.5)。

[En]

In the division, we hope that the samples contained in the branch nodes of the decision tree belong to the same category, that is, the purity of the nodes is getting higher and higher. The decision tree has the advantages of simple calculation and strong explanation, so it is more suitable to deal with samples with missing attribute values, and can deal with irrelevant features, but it is easy to fit, so it needs to use pruning or random forest. Information gain is entropy minus conditional entropy, which represents the degree of less information uncertainty. the greater the information gain, the greater the uncertainty, so it shows that this feature is very important for classification. Because the information gain criterion has a preference for a large number of attributes, the information gain rate is generally used (c4.5).

机器学习面试常见问题

分母可以被认为是属性本身的熵。值越有可能,属性的熵就越大。

[En]

The denominator can be regarded as the entropy of the attribute itself. The more possible the value is, the greater the entropy of the attribute is.

Cart决策树使用基尼指数来选择划分属性,直观的来说,Gini(D)反映了从数据集D中随机抽取两个样本,其类别标记不一致的概率,因此基尼指数越小数据集D的纯度越高,一般为了防止过拟合要进行剪枝,有预剪枝和后剪枝,一般用cross validation集进行剪枝。

对于连续值和缺失值的处理,对于连续的属性a,对n在D上的不同值进行排序,并根据分割点t将D划分为两个子集,一般以每个连续两个值的中点为分割点,然后根据信息增益选择最大的一个。与离散属性不同,如果当前节点分区属性是连续属性,则也可以作为其后代的分区属性。

[En]

For the processing of continuous values and missing values, for the continuous attribute a, the different values of an on D are sorted, and D is divided into two subsets based on the partition point t. Generally, the midpoint of each consecutive two values is taken as the division point, and then the maximum one is selected according to the information gain. Different from discrete attributes, if the current node partition attribute is a continuous attribute, it can also be used as a partition attribute for its descendants.

(5) SVM、LR、决策树的对比?

SVM既可以用于分类问题,也可以用于回归问题,并且可以通过核函数快速的计算,LR实现简单,训练速度非常快,但是模型较为简单,决策树容易过拟合,需要进行剪枝等。从优化函数上看,soft margin的SVM用的是hinge loss,而带L2正则化的LR对应的是cross entropy loss,另外adaboost对应的是exponential loss。所以LR对远点敏感,但是SVM对outlier不太敏感,因为只关心support vector,SVM可以将特征映射到无穷维空间,但是LR不可以,一般小数据中SVM比LR更优一点,但是LR可以预测概率,而SVM不可以,SVM依赖于数据测度,需要先做归一化,LR一般不需要,对于大量的数据LR使用更加广泛,LR向多分类的扩展更加直接,对于类别不平衡SVM一般用权重解决,即目标函数中对正负样本代价函数不同,LR可以用一般的方法,也可以直接对最后结果调整(通过阈值),一般小数据下样本维度比较高的时候SVM效果要更优一些。

(6) GBDT 和随机森林的区别?

随机林采用套袋的思想,套袋也被称为自助式集聚。在训练样本集中采样得到多个样本集,并基于每个样本集训练基本学习器,然后对基本学习器进行组合。随机森林在决策树装袋的基础上,在决策树的训练过程中引入了随机属性选择。在选择划分属性时,传统的决策树选择当前节点属性集中最优的属性,而随机森林则随机为节点选择k个属性的子集,然后选择最多的属性,k作为参数控制引入的随机性程度。

[En]

Random forest adopts the idea of bagging, bagging is also known as bootstrap aggreagation. Multiple sampling sets are obtained by sampling in the training sample set, and a basic learner is trained based on each sampling set, and then the basic learner is combined. Random forest introduces random attribute selection in the training process of decision tree on the basis of bagging of decision tree. When selecting the partition attributes, the traditional decision tree selects the optimal attributes in the current node attribute set, while the random forest randomly selects a subset of k attributes for the nodes, and then selects the most attributes, and k as a parameter controls the degree of introduction of randomness.

此外,GBDT训练基于Boosting的思想,并根据每次迭代中的误差来更新样本权重,因此是一种串行化方法,而随机森林是一种装袋的思想,所以它是一种并行方法。

[En]

In addition, GBDT training is based on the idea of Boosting, and the sample weight is updated according to errors in each iteration, so it is a serial serialization method, while random forest is the idea of bagging, so it is a parallel method.

(7) 如何判断函数凸或非凸?什么是凸优化?

机器学习面试常见问题

机器学习面试常见问题

则该函数是凸的。上述条件也可能导致更一般的结果。

[En]

Then the function is convex. The above conditions can also lead to more general results.

机器学习面试常见问题

如果函数有二阶导数,则如果函数的二阶导数为正,或者对于多元函数,海森矩阵是凸函数。

[En]

If the function has the second derivative, then if the second derivative of the function is positive, or for the multivariate function, the Hessian matrix is a convex function.

(也可能引到SVM,或者凸函数局部最优也是全局最优的证明,或者上述公式期望情况下的Jessen不等式)

(8) 如何解决类别不平衡问题?

机器学习面试常见问题

(9) 解释对偶的概念。

一个优化问题可以从两个角度来考察,一个是原始问题,另一个是对偶问题,即对偶问题。一般而言,对偶问题给出主问题最优值的下界。在强对偶的情况下,主问题的最优下界可以从对偶问题得到。对偶问题是一个凸优化问题,可以很好地求解。在支持向量机中,将原始问题转化为对偶问题进行求解。从而进一步引入了核函数的概念。

[En]

An optimization problem can be examined from two angles, one is the primal problem, and the other is the dual problem, which is the dual problem. In general, the dual problem gives the lower bound of the optimal value of the main problem. In the case of strong duality, the optimal lower bound of the main problem can be obtained from the dual problem. The dual problem is a convex optimization problem, which can be well solved. In SVM, the primal problem is transformed into a dual problem to solve. Thus the idea of kernel function is further introduced.

(10) 如何进行特征选择?

特征选择是一个重要的数据预处理过程,主要有两个原因。首先,我们在实际任务中会遇到维度灾难(样本密度非常稀疏)的问题。如果我们能从中选择一些特征,那么这个问题就可以大大缓解。此外,去除不相关的特征将降低学习任务的难度,增加模型的泛化能力。冗余特征是指这个特征中包含的信息可以从其他特征中推断出来,但这并不意味着冗余特征一定是无用的。例如,它还可以用于在欠拟合的情况下添加冗余特征,以增加简单模型的复杂性。

[En]

Feature selection is an important data preprocessing process, mainly for two reasons. First, we will encounter the problem of dimension disaster (sample density is very sparse) in real tasks. If we can select some features from them, then this problem can be greatly alleviated. In addition, the removal of irrelevant features will reduce the difficulty of learning tasks and increase the generalization ability of the model. Redundant feature means that the information contained in this feature can be deduced from other features, but this does not mean that the redundant feature must be useless. For example, it can also be used to add redundant features in the case of underfitting to increase the complexity of the simple model.

理论上,如果没有领域知识作为先验假设,那么我们只能遍历所有可能的子集。但这显然是不可能的,因为遍历的次数是组合爆炸的。一般来说,我们可以分为两个过程:子集搜索和子集评估。子集搜索一般采用贪婪算法,每轮增加或删除候选特征,分别为前向搜索和后向搜索。或者是双向搜索的组合。信息增益通常用于子集评估,并且经常选择中点作为连续数据的分割点。

[En]

In theory, if there is no domain knowledge as a priori hypothesis, then we can only traverse all possible subsets. But this is obviously impossible, because the number of traverses is combinatorial explosion. Generally speaking, we can be divided into two processes: subset search and subset evaluation. Subset search generally uses greedy algorithm, and each round adds or deletes from the candidate features, which becomes forward search and post-first search respectively. Or a combination of two-way search. Information gain is generally used in subset evaluation, and the midpoint is often selected as the partition point for continuous data.

常用的特征选择方法有滤波法、包裹嵌入法、滤波法、包装法和嵌入法。过滤器类型首先选择数据集的特征,然后训练学习者。Wrapper直接将最终学习者的表现作为特征子集的评价标准,通常通过连续的候选子集,然后使用交叉验证过程来更新候选特征,通常计算量很大。嵌入式特征选择将特征选择过程与训练过程相结合,在训练过程中自动进行特征选择。例如,L1正则化更容易获得稀疏解,而L2正则化更难过拟合。L1正则化可以用近端梯度下降法(PGD)求解。

[En]

Common feature selection methods are filter, wrapped and embedded, filter,wrapper and embedding. The Filter type first selects the features of the data set, and then trains the learner. Wrapper directly regards the performance of the final learner as the evaluation criterion of the feature subset, generally through continuous candidate subset, and then uses the cross-validation process to update the candidate features, usually with a large amount of computation. Embedded feature selection integrates the feature selection process with the training process, and feature selection is carried out automatically in the training process. For example, L1 regularization is easier to obtain sparse solutions, while L2 regularization is more difficult to over-fit. L1 regularization can be solved by PGD, proximal gradient descent.

(11) 为什么会产生过拟合,有哪些方法可以预防或克服过拟合?

一般而言,在机器学习中,学习者在训练集中的错误称为训练错误或经验错误,而在新样本上的错误称为泛化错误。显然,我们希望得到一个泛化误差较小的学习器,但我们事先不知道新的样本,所以我们经常试图将经验误差最小化。然而,当学习者对训练样本学习得太好时,它可能会将训练样本的特征视为潜在样本的一般属性。这将导致泛化性能的下降,这被称为过拟合。相反,欠拟合通常意味着训练样本的一般性质没有被很好地学习,训练集中仍然存在较大的误差。

[En]

In general, in machine learning, the error of the learner in the training set is called the training error or empirical error, and the error on the new sample is called the generalization error. Obviously, we want to get a learner with small generalization error, but we do not know the new sample in advance, so we often try to minimize the empirical error. However, when the learner learns the training sample too well, it may regard the characteristics of the training sample as the general properties of the potential sample. This will lead to the decline of generalization performance, which is called overfitting. On the contrary, underfitting generally means that the general properties of the training samples have not been learned well, and there is still a large error in the training set.

欠拟合:一般来说,欠拟合更容易解决,如增加模型的复杂性、增加决策树中的分支、增加神经网络中的训练次数等。

[En]

Under-fitting: generally speaking, under-fitting is easier to solve, such as increasing the complexity of the model, increasing the branches in the decision tree, increasing the number of training in the neural network and so on.

过拟合:人们普遍认为,过拟合是不能完全避免的,因为机器学习面临的问题一般是NP难的,但有效的解必须在多项式中工作,所以会牺牲一些泛化能力。过拟合的解决方法一般包括增加样本数量、降低样本维度、降低模型的复杂度、使用先验知识(L1和L2正则化)、使用交叉验证、提前停止等。

[En]

Over-fitting: it is generally believed that over-fitting can not be completely avoided, because the problem faced by machine learning is generally np-hard, but an effective solution must work in the polynomial, so it will sacrifice some generalization ability. The solutions of over-fitting generally include increasing the number of samples, reducing the dimension of samples, reducing the complexity of the model, using prior knowledge (L1 and L2 regularization), using cross-validation,early stopping and so on.

(12) 什么是偏差与方差?

泛化误差可以分解为偏差+方差+噪声的平方。偏差度量学习算法的预期预测与实际结果之间的偏差,描述学习算法本身的适应能力,方差度量相同大小的训练集的变化引起的学习性能的变化。它描述了数据扰动的影响,噪声表示了任何学习算法对当前任务所能达到的期望泛化误差的下界,并描述了问题本身的难度。偏差和方差通常被称为偏差和方差。训练程度越强,偏差越小,方差越大。泛化误差一般在中间有一个最小值。如果偏差大,方差小,一般称为欠拟合,而偏差小,方差大则称为过拟合。

[En]

The generalization error can be decomposed into the square of the deviation plus variance plus noise. The deviation measures the deviation between the expected prediction of the learning algorithm and the real results, describes the fitting ability of the learning algorithm itself, and the variance measures the change of learning performance caused by the change of the training set of the same size. it describes the impact of data disturbance, and the noise expresses the lower bound of the expected generalization error that can be achieved by any learning algorithm on the current task, and describes the difficulty of the problem itself. Deviation and variance are generally called bias and variance. The stronger the training degree is, the smaller the deviation is, and the greater the variance is. The generalization error generally has a minimum value in the middle. If the deviation is large, the variance is small, it is generally called underfitting, while the deviation is small, and the variance is larger is called overfitting.

机器学习面试常见问题

机器学习面试常见问题

(13) 神经网络的原理,如何进行训练?

神经网络自发展以来一直是一门规模很大的学科。一般来说,人们认为神经网络是由单个神经元和不同神经元之间的连接组成的,而不同的神经网络是由不充分的结构组成的。最常见的神经网络称为多层前馈神经网络。除了输入层和输出层之外,中间的隐含层数称为神经网络的层数。BP算法是神经网络训练中最著名的算法,其本质是梯度下降和链式规则。

[En]

Neural network has been a very large discipline since its development. generally speaking, it is considered that neural network is composed of single neurons and connections between different neurons, and different neural networks are composed of insufficient structures. The most common neural network is called multi-layer feedforward neural network. in addition to the input and output layers, the number of hidden layers in the middle is called the number of layers of the neural network. BP algorithm is the most famous algorithm in training neural networks, and its essence is gradient descent and chain rule.

(14) 介绍卷积神经网络,和 DBN 有什么区别?

卷积神经网络的特征是卷积核,CNN采用权值分配,通过连续使用和卷积得到不同的特征表示。采样层,也称为池化层,基于局部相关原理进行子采样,在保持有用信息的同时减少了数据量。DBN是一个深度信任网络,每一层都是一个RBM,整个网络可以看作是一个RBM堆栈,通常采用无监督的逐层训练,从第一层开始,每一层使用上一层的输入进行训练,每一层训练结束后再用BP算法训练整个网络。

[En]

Convolution neural network is characterized by convolution kernel, weight sharing is used in CNN, and different feature representations are obtained by continuous use and convolution. The sampling layer, also known as pooling layer, carries out sub-sampling based on the principle of local correlation, which reduces the amount of data while maintaining useful information. DBN is a deep belief network, each layer is a RBM, the whole network can be regarded as RBM stack, usually use unsupervised layer by layer training, from the first layer, each layer uses the input of the previous layer for training, and then use the BP algorithm to train the whole network after the end of each layer training.

(15) 采用 EM 算法求解的模型有哪些,为什么不用牛顿法或梯度下降法?

用EM算法求解的模型一般有GMM或者协同过滤,k-means其实也属于EM。EM算法一定会收敛,但是可能收敛到局部最优。由于求和的项数将随着隐变量的数目指数上升,会给梯度计算带来麻烦。

(16) 用 EM 算法推导解释 Kmeans。

k-means算法是高斯混合聚类在混合成分方差相等,且每个样本仅指派一个混合成分时候的特例。注意k-means在运行之前需要进行归一化处理,不然可能会因为样本在某些维度上过大导致距离计算失效。k-means中每个样本所属的类就可以看成是一个隐变量,在E步中,我们固定每个类的中心,通过对每一个样本选择最近的类优化目标函数,在M步,重新更新每个类的中心点,该步骤可以通过对目标函数求导实现,最终可得新的类中心就是类中样本的均值。

(17) 用过哪些聚类算法,解释密度聚类算法。

机器学习面试常见问题

(19) 聚类算法中的距离度量有哪些?

聚类算法中的距离度量一般用闽科夫斯基距离,在p取不同的值下对应不同的距离,例如p=1的时候对应曼哈顿距离,p=2的情况下对应欧式距离,p=inf的情况下变为切比雪夫距离,还有jaccard距离,幂距离(闽科夫斯基的更一般形式),余弦相似度,加权的距离,马氏距离(类似加权)作为距离度量需要满足非负性,同一性,对称性和直递性,闽科夫斯基在p>=1的时候满足读来那个性质,对于一些离散属性例如{飞机,火车,轮船}则不能直接在属性值上计算距离,这些称为无序属性,可以用VDM(Value Diffrence Metrix),属性u上两个离散值a,b之间的VDM距离定义为

机器学习面试常见问题

机器学习面试常见问题

(20) 解释贝叶斯公式和朴素贝叶斯分类。

机器学习面试常见问题

机器学习面试常见问题

机器学习面试常见问题

机器学习面试常见问题

机器学习面试常见问题

这样既保证了概率的归一化,又避免了上述现象的发生。

[En]

This can not only ensure the normalization of probability, but also avoid the above phenomenon.

(22) 解释L1和L2正则化的作用。

机器学习面试常见问题

(23) TF-IDF是什么?

TF指Term frequecy,代表词频,IDF代表inverse document frequency,叫做逆文档频率,这个算法可以用来提取文档的关键词,首先一般认为在文章中出现次数较多的词是关键词,词频就代表了这一项,然而有些词是停用词,例如的,是,有这种大量出现的词,首先需要进行过滤,比如过滤之后再统计词频出现了中国,蜜蜂,养殖且三个词的词频几乎一致,但是中国这个词出现在其他文章的概率比其他两个词要高不少,因此我们应该认为后两个词更能表现文章的主题,IDF就代表了这样的信息,计算该值需要一个语料库,如果一个词在语料库中出现的概率越小,那么该词的IDF应该越大,一般来说TF计算公式为(某个词在文章中出现次数/文章的总词数),这样消除长文章中词出现次数多的影响,IDF计算公式为log(语料库文章总数/(包含该词的文章数)+1)。将两者乘乘起来就得到了词的TF-IDF。传统的TF-IDF对词出现的位置没有进行考虑,可以针对不同位置赋予不同的权重进行修正,注意这些修正之所以是有效的,正是因为人观测过了大量的信息,因此建议了一个先验估计,人将这个先验估计融合到了算法里面,所以使算法更加的有效

(24) 文本中的余弦距离是什么,有哪些作用?

余弦距离是两个向量之间距离的量度,值在-1和1之间。值1表示两个向量同相,0表示两个向量正交,-1表示两个向量相反。使用TF-IDF和余弦距离来查找内容相似的文章,例如,首先使用TF-IDF找出两篇文章的关键词,然后为每篇文章取出k个关键词(10-20),统计这些关键词的词频,生成两篇文章的词频向量,然后使用余弦距离计算它们的相似度。

[En]

CoSine distance is a measure of the distance between two vectors, with a value between-1 and 1. A value of 1 means that two vectors are in phase, 0 means that two vectors are orthogonal, and-1 means that two vectors are in reverse. Use TF-IDF and cosine distance to find articles with similar content, for example, first use TF-IDF to find out the keywords of two articles, and then take out k keywords (10-20) for each article, count the word frequency of these keywords, generate the word frequency vector of the two articles, and then use the cosine distance to calculate their similarity.

Original: https://www.cnblogs.com/mfryf/p/15293514.html
Author: 知识天地
Title: 机器学习面试常见问题

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6502/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部