# AI工程师基础知识100题

100道AI基础面试题

1 、协方差和相关性有什么区别？

[En]

Correlation is the standardized format of covariance. The covariance itself is difficult to compare. For example, if we calculate the covariance of salary (\$) and age (years), because the two variables have different measures, we will get different covariances that cannot be compared.

[En]

To solve this problem, we calculate the correlation to get a value between-1 and 1, and we can ignore their different metrics.

2 、xgboost如何寻找最优特征？是有放回还是无放回的呢？

xgboost在训练的过程中给出各个特征的增益评分，最大增益的特征会被选出来作为分裂依据, 从而记忆了每个特征对在模型训练时的重要性– 从根到叶子中间节点涉及某特征的次数作为该特征重要性排序. xgboost属于boosting集成学习方法, 样本是不放回的, 因而每轮计算样本不重复. 另一方面, xgboost支持子采样,也就是每轮计算可以不使用全部样本, 以减少过拟合. 进一步地, xgboost 还有列采样, 每轮计算按百分比随机采样一部分特征, 既提高计算速度又减少过拟合。

3 、谈谈判别式模型和生成式模型？

[En]

The discriminant model can be obtained from the generating model, but the generating model can not be obtained from the discriminant model.

4 、线性分类器与非线性分类器的区别以及优劣

[En]

The linear classifier has the advantages of good explanation and low computational complexity, but the deficiency is that the fitting effect of the model is relatively weak.

[En]

The nonlinear classifier has strong fitting ability, but its deficiency is that it is easy to over-fit when the amount of data is insufficient, the computational complexity is high and the interpretability is not good.

SVM两种都有（看线性核还是高斯核）

5 、L1和L2正则先验分别服从什么分布

[En]

A priori is the optimized starting line, and a priori advantage is that it can have good generalization performance in smaller data sets, of course, this is obtained when the prior distribution is close to the real distribution, from the point of view of information theory, adding correct a priori information to the system will certainly improve the performance of the system.

6 、简单介绍下logistics回归？

[En]

For example, in practical work, we may encounter the following problems: predicting whether a user will click on a specific product.

[En]

Determine the gender of the user

[En]

Predict whether users will buy a given category

[En]

Judge whether a comment is positive or negative

[En]

It’s called spell checking.

9 、为什么朴素贝叶斯如此”朴素”？

[En]

Because it assumes that all features play an equally important and independent role in the dataset. As we know, this assumption is very untrue in the real world, so it is really “naive” to say naive Bayes.

10 、请大致对比下plsa和LDA的区别

11 、请详细说说EM算法

12 、KNN中的K如何选取的？

KNN中的K值选取对K近邻算法的结果会产生重大影响。如李航博士的一书「统计学习方法」上所说：

K=N，则完全不足取，因为此时无论输入实例是什么，都只是简单的预测它属于在训练实例中最多的累，模型过于简单，忽略了训练实例中大量有用信息。

13 、防止过拟合的方法

[En]

The reason for over-fitting is that the learning ability of the algorithm is too strong; some assumptions (such as independent and identical distribution of samples) may not be true; too few training samples can not estimate the distribution of the whole space.

1 早停止：如在训练中多次迭代后发现模型性能没有显著提高就停止训练

2 数据集扩增：原有数据增加、原有数据加随机噪声、重采样

3 正则化，正则化可以限制模型的复杂度

4 交叉验证

5 特征选择/特征降维

6 创建一个验证集是最基本的防止过拟合的方法。我们最终训练得到的模型目标是要在验证集上面有好的表现，而不训练集

14 、机器学习中，为何要经常对数据做归一化

[En]

Generally speaking, when doing machine learning applications, most of the time is spent on feature processing, in which the key step is to normalize the feature data.

[En]

Why should it be normalized?

[En]

Many students do not understand the explanation given by Wikipedia:

1）归一化后加快了梯度下降求最优解的速度；

2）归一化有可能提高精度。

[En]

Let’s briefly expand to explain these two points.

15 、什么最小二乘法？

[En]

We often say in words: generally speaking, on average. For example, on average, the health of non-smokers is better than that of smokers, and the word “average” is added because there are always exceptions. There is always a special person who smokes, but his health may be better than that of his non-smoking friends because of regular exercise. One of the simplest examples of least squares is arithmetic mean.

[En]

The least square method (also known as the least square method) is a mathematical optimization technique.

[En]

It finds the best function match of the data by minimizing the sum of the square of the error. The unknown data can be easily obtained by using the least square method, and the sum of squares between the obtained data and the actual data is minimized.

16 、梯度下降法找到的一定是下降最快的方向么？

[En]

In general, downhill is used as an example to explain the decline of the gradient. Suppose you are at the top of the mountain and must reach the lake at the foot of the mountain (that is, the lowest part of the valley). But the headache is that your eyes are blindfolded and you can’t tell where you’re going.

17 、简单说说贝叶斯定理

18 、怎么理解决策树、xgboost能处理缺失值？而有的模型(svm)对缺失值比较敏感。

[En]

First of all, explain your confusion from two angles:

19 、标准化与归一化的区别？

20 、随机森林如何处理缺失值？

21 、随机森林如何评估特征重要性？

1) Decrease GINI： 对于回归问题，直接使用argmax(VarVarLeftVarRight)作为评判标准，即当前节点训练集的方差Var减去左节点的方差VarLeft和右节点的方差VarRight。

2) Decrease Accuracy：对于一棵树Tb(x)，我们用OOB样本可以得到测试误差1；然后随机改变OOB样本的第j列：保持其他列不变，对第j列进行随机的上下置换，得到误差2。至此，我们可以用误差1-误差2来刻画变量j的重要性。基本思想就是，如果一个变量j足够重要，那么改变它会极大的增加测试误差；反之，如果改变它测试误差没有增大，则说明该变量不是那么的重要。

22 、优化Kmeans？

23 、KMeans初始类簇中心点的选取。

k-means++算法选择初始seeds的基本思想就是：初始的聚类中心之间的相互距离要尽可能的远。
1. 从输入的数据点集合中随机选择一个点作为第一个聚类中心
2. 对于数据集中的每一个点x，计算它与最近聚类中心(指已选择的聚类中心)的距离D(x)
3. 选择一个新的数据点作为新的聚类中心，选择的原则是：D(x)较大的点，被选取作为聚类中心的概率较大
4. 重复2和3直到k个聚类中心被选出来
5. 利用这k个初始的聚类中心来运行标准的k-means算法

24 、解释对偶的概念。

25 、如何进行特征选择？

1. 去除方差较小的特征

2. 正则化。1正则化能够生成稀疏的模型。L2正则化的表现更加稳定，由于有用的特征往往对应系数非零。

3. 随机森林，对于分类问题，通常采用基尼不纯度或者信息增益，对于回归问题，通常采用的是方差或者最小二乘拟合。一般不需要feature engineering、调参等繁琐的步骤。

[En]

Its two main problems

1是重要的特征有可能得分很低（关联特征问题），

2是这种方法对特征变量类别多的特征越有利（偏向问题）。

1. 稳定性选择。

26 、衡量分类器的好坏？

[En]

Several commonly used indicators:

F1值：2/F1 = 1/recall + 1/precision

ROC曲线：ROC空间是一个以伪阳性率（FPR，false positive rate）为X轴，真阳性率（TPR, true positive rate）为Y轴的二维坐标系所代表的平面。其中真阳率TPR = TP / P = recall， 伪阳率FPR = FP / N

27 、机器学习和统计里面的auc的物理意义是啥？

auc是评价模型好坏的常见指标之一，本题解析来自：https://www.zhihu.com/question/39840928

1、什么是AUC？AUC是一个模型评价指标，只能用于二分类模型的评价，对于二分类模型，还有很多其他评价指标，比如logloss，accuracy，precision。如果你经常关注数据挖掘比赛，比如kaggle，那你会发现AUC和logloss基本是最常见的模型评价指标。

28 、数据预处理。

1.缺失值，填充缺失值fillna：

i. 离散：None,

ii. 连续：均值。

iii. 缺失值太多，则直接去除该列

1. 连续值：离散化。有的模型（如决策树）需要离散值

2. 对定量特征二值化。核心在于设定一个阈值，大于阈值的赋值为1，小于等于阈值的赋值为0。如图像操作

3. 皮尔逊相关系数，去除高度相关的列

29 、观察增益gain, alpha和gamma越大，增益越小？

xgboost寻找分割点的标准是最大化gain. 考虑传统的枚举每个特征的所有可能分割点的贪心法效率太低，xgboost实现了一种近似的算法。大致的思想是根据百分位法列举几个可能成为分割点的候选者，然后从候选者中计算Gain按最大值找出最佳的分割点。

[En]

The first is to assume the weight score of the segmented left child, the second is the right child, the third is the undivided overall score, and the last is the complexity loss of introducing a node.

https://zhidao.baidu.com/question/2121727290086699747.html?fr=iks&word=xgboost+lamda&ie=gbk
lambda[默认1]权重的L2正则化项。(和Ridge regression类似)。

11、alpha[默认1]权重的L1正则化项。(和Lasso regression类似)。 可以应用在很高维度的情况下，使得算法的速度更快。
gamma[默认0]在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。Gamma指定了节点分裂所需的最小损失函数下降值。 这个参数的值越大，算法越保守。

30 、什麽造成梯度消失问题?

Yes you should understand backdrop－Andrej Karpathy

How does the ReLu solve the vanishing gradient problem?

[En]

The disappearance of gradient will result in slow updating of weights and increase the difficulty of model training. One of the reasons for the disappearance of the gradient is that many activation functions squeeze the output value in a very small interval, and the gradient is 0 in the larger definition domain at both ends of the activation function, resulting in the cessation of learning.

31 、到底什么是特征工程？

[En]

First of all, what do most machine learning practitioners do in the company? Not to do mathematical derivation, nor to invent high-end algorithms, but to do feature engineering.

34 、数据不平衡问题

[En]

This is mainly due to the uneven distribution of data. The solution is as follows:

[En]

Sampling, adding noise sampling to small samples and downsampling to large samples

[En]

Data generation, using known samples to generate new samples

[En]

Adopt an algorithm that is insensitive to unbalanced data sets

[En]

Consider the prior distribution of data when designing the model

35 、特征比数据量还大时，选择什么样的分类器？

[En]

Linear classifier, because when the dimension is high, the data is generally sparse in the dimension space and is likely to be linearly separable.

36 、常见的分类算法有哪些？他们各自的优缺点是什么？

1）所需估计的参数少，对于缺失数据不敏感。

2）有着坚实的数学基础，以及稳定的分类效率。

1）假设属性之间相互独立，这往往并不成立。（喜欢吃番茄、鸡蛋，却不喜欢吃番茄炒蛋）。

2）需要知道先验概率。

3）分类决策存在错误率。

1）不需要任何领域知识或参数假设。

2）适合高维数据。

3）简单易于理解。

4）短时间内处理大量数据，得到可行且效果较好的结果。

5）能够同时处理数据型和常规性属性。

1）对于各类别样本数量不一致数据，信息增益偏向于那些具有更多数值的特征。

2）易于过拟合。

3）忽略属性之间的相关性。

4）不支持在线学习。

1）可以解决小样本下机器学习的问题。

2）提高泛化性能。

3）可以解决高维、非线性问题。超高维文本分类仍受欢迎。

4）避免神经网络结构选择和局部极小的问题。

1）对缺失数据敏感。

2）内存消耗大，难以解释。

3）运行和调差略烦人。

K近邻

1）思想简单，理论成熟，既可以用来做分类也可以用来做回归；

2）可用于非线性分类；

3）训练时间复杂度为O(n)；

4）准确度高，对数据没有假设，对outlier不敏感；

1）计算量太大

2）对于样本分类不均衡的问题，会产生误判。

3）需要大量的内存。

4）输出的可解释性不强。

Logistic回归

1）速度快。

2）简单易于理解，直接看到各个特征的权重。

3）能容易地更新模型吸收新的数据。

4）如果想要一个概率框架，动态调整分类阀值。

[En]

Feature processing is complex. Normalization and more feature engineering are needed.

1）分类准确率高。

2）并行处理能力强。

3）分布式存储和学习能力强。

4）鲁棒性较强，不易受噪声影响。

1）需要大量参数（网络拓扑、阀值、阈值）。

2）结果难以解释。

3）训练时间过长。

3）当使用简单分类器时，计算出的结果是可以理解的。而且弱分类器构造极其简单。

4）简单，不用做特征筛选。

5）不用担心overfitting。

37 、常见的监督学习算法有哪些？

38 、说说常见的优化算法及其优缺点？

[En]

Warm Tip: when answering the interviewer’s questions, you tend to answer the questions in a big way, so you don’t get stuck in a small technical knockout, and it’s easy to kill yourself. Make a long story short

1）随机梯度下降 优点：可以一定程度上解决局部最优解的问题 缺点：收敛速度较慢

2）批量梯度下降 优点：容易陷入局部最优解 缺点：收敛速度较快

3）mini_batch梯度下降 综合随机梯度下降和批量梯度下降的优缺点，提取的一个中和的方法。

4）牛顿法 牛顿法在迭代的时候，需要计算Hessian矩阵，当维度较高的时候，计算 Hessian矩阵比较困难。

5）拟牛顿法 拟牛顿法是为了改进牛顿法在迭代过程中，计算Hessian矩阵而提取的算法，它采用的方式是通过逼近Hessian的方式来进行求解。

39 、特征向量的归一化方法有哪些？

40 、RF与GBDT之间的区别与联系？

1）相同点：都是由多棵树组成，最终的结果都是由多棵树一起决定。

2）不同点：

a 组成随机森林的树可以分类树也可以是回归树，而GBDT只由回归树组成

b 组成随机森林的树可以并行生成，而GBDT是串行生成

c 随机森林的结果是多数表决表决的，而GBDT则是多棵树累加之和

d 随机森林对异常值不敏感，而GBDT对异常值比较敏感

e 随机森林是减少模型的方差，而GBDT是减少模型的偏差f 随机森林不需要进行特征归一化。而GBDT则需要进行特征归一化

42 、请比较下EM算法、HMM、CRF

[En]

It is not appropriate to put these three together, but they are related to each other, so we put them here together. Pay attention to the idea of the algorithm.

（1）EM算法

EM算法是用于含有隐变量模型的极大似然估计或者极大后验估计，有两步组成：

E步，求期望（expectation）；

M步，求极大（maxmization）。

（2）HMM算法

[En]

Prediction problem: know the model and observation sequence, solve the corresponding state sequence. Approximate algorithm (greedy algorithm) and Viterbit algorithm (dynamic programming to find the optimal path)

（3）条件随机场CRF

（4）HMM和CRF对比

[En]

The fundamental reason lies in the different basic ideas, one is the generating model, the other is the discriminant model, which leads to the different ways of solving.

43 、带核的SVM为什么能分类非线性问题？

44 、请说说常用核函数及核函数的条件

[En]

Linear kernel is mainly used in the case of linear separability, we can see that the dimension from the feature space to the input space is the same, its parameter is less and the speed is fast, and the classification effect is very ideal for linearly separable data. so we usually first try to use the linear kernel function to do the classification, to see what the effect is, if not, then change another polynomial kernel function.

[En]

[En]

It is a kind of kernel function with strong locality, which can map a sample to a higher dimensional space. This kernel function is the most widely used, no matter the large sample or the small sample has better performance. Moreover, it has fewer parameters compared with the polynomial kernel function, so in most cases, when we do not know what kernel function to use, the Gaussian kernel function is preferred.

sigmoid核函数

[En]

Therefore, when choosing a kernel function, if we have a priori knowledge of our data, we use a priori to select the kernel function that conforms to the data distribution; if we do not know, we usually use the method of cross-validation to try different kernel functions, and the kernel function with the lowest error is the kernel function with the best effect, or we can combine multiple kernel functions to form a mixed kernel function.

45 、请具体说说Boosting和Bagging的区别

（1）Bagging之随机森林

[En]

Random forests change the problem that decision trees are easy to over-fit, which is mainly optimized by two operations:

1）Boostrap从袋内有放回的抽取样本值

2）每次随机抽取一定数量的特征（通常为sqr(n)）。

[En]

Regression problem: take the average of the results of each tree directly.

1、树最大深度

2、树的个数

3、节点上的最小样本数

4、特征数(sqr(n)) oob(out-of-bag)

[En]

Taking the unsampled samples of each tree as the prediction sample statistical error as the error rate can be calculated in parallel.

[En]

No need for feature selection

[En]

The importance of features can be summarized.

[En]

Can handle missing data

[En]

No additional design is required. The test set cannot output continuous results on regression.

（3）Boosting之GBDT

（4）Boosting之Xgboost

[En]

This tool has the following main features:

[En]

Support linear classifier

[En]

The loss function can be customized and the second order partial derivative can be used.

[En]

Parallelism is supported under certain circumstances, which is only used in the stage of tree building, and each node can look for split features in parallel.

46 、逻辑回归相关问题

（1）公式推导一定要会

（2）逻辑回归的基本概念

（3）L1-norm和L2-norm

（4）LR和SVM对比

[En]

Secondly, both are linear models.

（5）LR和随机森林区别

（6）常用的优化方法

[En]

Newton method, quasi-Newton method:

47 、什么是共线性, 跟过拟合有什么关联?

[En]

Collinearity: in multivariate linear regression, the regression estimation is inaccurate because of the high correlation between variables. Collinearity can lead to redundancy and overfitting.

[En]

Solution: exclude the correlation of variables / add weight regularization.

48 、机器学习中，有哪些特征选择的工程方法？

1 特征工程是什么？

2 数据预处理

2.1 无量纲化

2.1.1 标准化

2.1.2 区间缩放法

2.1.3 标准化与归一化的区别

2.2 对定量特征二值化

2.3 对定性特征哑编码

2.4 缺失值计算

2.5 数据变换

2.6 回顾

3 特征选择

3.1 Filter

3.1.1 方差选择法

3.1.2 相关系数法

3.1.3 卡方检验

3.1.4 互信息法

3.2 Wrapper

3.2.1 递归特征消除法

3.3 Embedded

3.3.1 基于惩罚项的特征选择法

3.3.2 基于树模型的特征选择法

3.4 回顾

4 降维

4.1 主成分分析法（PCA）

4.2 线性判别分析法（LDA）

4.3 回顾5 总结6 参考资料

49 、用贝叶斯机率说明Dropout的原理

Dropout的目标是在指数 级数量的神经网络上近似这个过程。Dropout训练与Bagging训练不太一样。在Bagging的情况下,所有模型是独立 的。

50 、对于维度极低的特征，选择线性还是非线性分类器？

[En]

Nonlinear classifier, low-dimensional space may have many features run together, resulting in linear inseparability.

1.如果Feature的数量很大，跟样本数量差不多，这时候选用LR或者是Linear Kernel的SVM

1. 如果Feature的数量比较小，样本数量一般，不算大也不算小，选用SVM+Gaussian Kernel

2. 如果Feature的数量比较小，而样本数量很多，需要手工添加一些feature变成第一种情况。

51 、请问怎么处理特征向量的缺失值

[En]

On the other hand, there are fewer missing values, and the rest of the missing values are less than 10%. We can deal with them in many ways:

1) 把NaN直接作为一个特征，假设用0表示；

2) 用均值填充；

3) 用随机森林等算法预测填充。

52 、SVM、LR、决策树的对比。

53 、什么是ill-condition病态问题？

[En]

After the training of the model, a little modification of the test samples will get very different results, that is, the ill-conditioned problem, the prediction ability of the model to unknown data is very poor, that is, the generalization error is large.

54 、简述KNN最近邻分类算法的过程？

1. 计算测试样本和训练样本中每个样本点的距离（常见的距离度量有欧式距离，马氏距离等）；

2. 对上面所有的距离值进行排序；

3. 选前k 个最小距离的样本；

4. 根据这k 个样本的标签进行投票，得到最后的分类类别；

55 、常用的聚类划分方式有哪些？列举代表算法。

1. 基于划分的聚类:K-means，k-medoids，CLARANS。

2. 基于层次的聚类：AGNES（自底向上），DIANA（自上向下）。

3. 基于密度的聚类：DBSACN，OPTICS，BIRCH(CF-Tree)，CURE。

4. 基于网格的方法：STING，WaveCluster。

5. 基于模型的聚类：EM,SOM，COBWEB。

56 、什么是偏差与方差？

57 、采用EM 算法求解的模型有哪些，为什么不用牛顿法或梯度下降法？

58 、xgboost怎么给特征评分？

[python] # feature importance

print(model.feature_importances_)

# plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)

pyplot.show() ==========

# plot feature importance

plot_importance(model)

pyplot.show()

59 、什么是OOB？随机森林中OOB是如何计算的，它有什么优缺点？

bagging方法中Bootstrap每次约有1/3的样本不会出现在Bootstrap所采集的样本集合中，当然也就没有参加决策树的建立，把这1/3的数据称为袋外数据oob（out of bag）,它可以用于取代测试集误差估计方法。

60 、推导朴素贝叶斯分类P(c|d)，文档d（由若干word 组成），求该文档属于类别c 的概率， 并说明公式中哪些概率可以利用训练集计算得到

61 、请写出你对VC维的理解和认识

VC维是模型的复杂程度，模型假设空间越大，VC维越高。某种程度上说，VC维给机器学习可学性提供了理论支撑。

1.测试集合的loss是否和训练集合的loss接近？VC维越小，理论越接近，越不容易overfitting。

2.训练集合的loss是否足够小？VC维越大，loss理论越小，越不容易underfitting。

62 、kmeans聚类中，如何确定k的大小

[En]

This is a clich é classic question that is often asked in an interview.

K-均值聚类算法首先会随机确定k个中心位置，然后将各个数据项分配给最临近的中心点。待分配完成之后，聚类中心就会移到分配给该聚类的所有节点的平均位置处，然后整个分配过程重新开始。这一过程会一直重复下去，直到分配过程不再产出变化为止。

63 、请用Python实现下线性回归，并思考下更高效的实现方式

[En]

Linear programming is one of the important fields in optimization problems. Many practical problems in operational research can be expressed by linear programming. Some special cases of linear programming, such as network flow, multi-commodity traffic and so on, are considered to be very important, and there are a lot of special studies on its algorithms. Many other kinds of optimization algorithms can be divided into linear programming subproblems and then get the solution.

[En]

In history, many concepts derived from linear programming have inspired the core concepts of optimization theory, such as the importance and generalization of “duality”, “decomposition”, “convexity” and so on. Similarly, in the field of microeconomics and business management, linear programming is widely used to solve problems such as income maximization or cost minimization of the production process.

64 、给你一个有1000列和1百万行的训练数据集。这个数据集是基于分类问题的。 经理要求你来降低该数据集的维度以减少模型计算时间。你的机器内存有限。你会怎么做？（你可以自由做各种实际操作假设）

[En]

A: your interviewer should be well aware that it is difficult to process high-dimensional data with limited memory. Here are the processing methods you can use:

1.由于我们的RAM很小，首先要关闭机器上正在运行的其他程序，包括网页浏览器，以确保大部分内存可以使用。

2.我们可以随机采样数据集。这意味着，我们可以创建一个较小的数据集，比如有1000个变量和30万行，然后做计算。

3.为了降低维度，我们可以把数值变量和分类变量分开，同时删掉相关联的变量。对于数值变量，我们将使用相关性分析。对于分类变量，我们可以用卡方检验。

4.另外，我们还可以使用PCA（主成分分析），并挑选可以解释在数据集中有最大偏差的成分。

5.利用在线学习算法，如VowpalWabbit（在Python中可用）是一个可能的选择。

7.我们也可以用我们对业务的理解来估计各预测变量对响应变量的影响大小。但是，这是一个主观的方法，如果没有找出有用的预测变量可能会导致信息的显著丢失。

65 、问2：在PCA中有必要做旋转变换吗？ 如果有必要，为什么？如果你没有旋转变换那些成分，会发生什么情况？

66 、给你一个数据集，这个数据集有缺失值，且这些缺失值分布在离中值有1个标准偏差的范围内。百分之多少的数据不会受到影响？为什么？

[En]

A: this question gives you enough hints to start thinking! Since the data is distributed near the median, let’s assume that this is a normal distribution.

[En]

We know that in a normal distribution, about 68% of the data is within the standard deviation of the average (or mode, median), so that the remaining 32% of the data is not affected.

67 、给你一个癌症检测的数据集。你已经建好了分类模型，取得了96％的精度。为什么你还是不满意你的模型性能？你可以做些什么呢？

[En]

A: if you have analyzed enough data sets, you should be able to tell that the cancer test results are unbalanced. In unbalanced data sets, accuracy should not be used as a measure of the model, because 96% (given) may only correctly predict most categories, but we are interested in those few categories (4%), those who have been diagnosed with cancer.

1.我们可以使用欠采样、过采样或SMOTE让数据平衡。

2.我们可以通过概率验证和利用AUC-ROC曲线找到最佳阀值来调整预测阀值。

3.我们可以给分类分配权重，那样较少的分类获得较大的权重。

4.我们还可以使用异常检测。

[En]

68 、解释朴素贝叶斯算法里面的先验概率、似然估计和边际似然估计？

[En]

A priori probability is the proportion of dependent variables (dichotomy) in the data set. This is the closest guess you can make about the classification when you don’t have any further information.

[En]

For example, in a dataset, dependent variables are binary (1 and 0). For example, the proportion of 1 (spam) is 70% and 0 (non-spam) is 30%. As a result, we can estimate that any new e-mail has a 70% chance of being classified as spam.

69 、你正在一个时间序列数据集上工作。经理要求你建立一个高精度的模型。你开始用决策树算法，因为你知道它在所有类型数据上的表现都不错。 后来，你尝试了时间序列回归模型，并得到了比决策树模型更高的精度。 这种情况会发生吗？为什么？

[En]

As we all know, time series data has a linear relationship. On the other hand, the decision tree algorithm is the best known algorithm to detect nonlinear interactions.

[En]

The reason why the decision tree does not provide a good prediction is that it can not map the linear relationship as well as the regression model.

[En]

Therefore, we know that if we have a data set that satisfies the linear hypothesis, a linear regression model can provide a strong prediction.

70 、给你分配了一个新的项目，是关于帮助食品配送公司节省更多的钱。问题是，公司的送餐队伍没办法准时送餐。结果就是他们的客户很不高兴。 最后为了使客户高兴，他们只好以免餐费了事。哪个机器学习算法能拯救他们？

[En]

All kinds of machine learning algorithms may have begun to flash in your brain. But wait! This way of asking questions is just to test your machine learning foundation. This is not a problem of machine learning, but a problem of path optimization.

[En]

Machine learning problems consist of three things:

1.模式已经存在。

2.不能用数学方法解决（指数方程都不行）。

3.有相关的数据。

71 、你意识到你的模型受到低偏差和高方差问题的困扰。应该使用哪种算法来解决问题呢？为什么？

[En]

Low deviation means that the predicted value of the model is close to the actual value. In other words, the model is flexible enough to simulate the distribution of training data. It looks good, but don’t forget that a flexible model has no generalization ability. This means that when the model is used to test an unseen dataset, it can be disappointing.

[En]

In addition, in order to cope with large variance, we can:

1.使用正则化技术，惩罚更高的模型系数，从而降低了模型的复杂性。

2.使用可变重要性图表中的前n个特征。可以用于当一个算法在数据集中的所有变量里很难寻找到有意义信号的时候。

72 、给你一个数据集。该数据集包含很多变量，你知道其中一些是高度相关的。 经理要求你用PCA。你会先去掉相关的变量吗？为什么？

73 、花了几个小时后，现在你急于建一个高精度的模型。结果，你建了5 个GBM（Gradient Boosted Models），想着boosting算法会显示魔力。 不幸的是，没有一个模型比基准模型表现得更好。最后，你决定将这些模型结合到一起。 尽管众所周知，结合模型通常精度高，但你就很不幸运。你到底错在哪里？

74 、KNN和KMEANS聚类（kmeans clustering）有什么不同？

KMEAN算法把一个数据集分割成簇，使得形成的簇是同构的，每个簇里的点相互靠近。该算法试图维持这些簇之间有足够的可分离性。由于无监督的性质，这些簇没有任何标签。NN算法尝试基于其k（可以是任何数目）个周围邻居来对未标记的观察进行分类。它也被称为懒惰学习法，因为它涉及最小的模型训练。因此，它不用训练数据对未看见的数据集进行泛化。

75 、真阳性率和召回有什么关系？写出方程式。

76 、你建了一个多元回归模型。你的模型R2为并不如你设想的好。为了改进，你去掉截距项，模型R的平方从0.3变为0.8。 这是否可能？怎样才能达到这个结果？

77 、在分析了你的模型后，经理告诉你，你的模型有多重共线性。 你会如何验证他说的是真的？在不丢失任何信息的情况下，你还能建立一个更好的模型吗？

VIF值

[En]

We can also add some random noise to the related variables to make the variables different from each other. However, increasing noise may affect the accuracy of the prediction, so this method should be used with caution.

78 、什么时候Ridge回归优于Lasso回归？

79 、全球平均温度的上升导致世界各地的海盗数量减少。这是否意味着海盗的数量减少引起气候变化？

[En]

A: after reading this question, you should know that this is a classic case of causality and correlation. We cannot conclude that the decline in the number of pirates is the cause of climate change, as there may be other factors (latent or confounding factors) that affect this phenomenon. There may be a correlation between the global average temperature and the number of pirates, but based on this information, we cannot say that the disappearance of pirates has been caused by the rise in global average temperatures.

[En]

80 、如何在一个数据集上选择重要的变量？给出解释。

[En]

A: here are the methods you can use to select variables:

1.选择重要的变量之前除去相关变量

2.用线性回归然后基于P值选择变量

3.使用前向选择，后向选择，逐步选择

4.使用随机森林和Xgboost，然后画出变量重要性图

5.使用lasso回归

6.测量可用的特征集的的信息增益，并相应地选择前n个特征量。

81 、是否有可能捕获连续变量和分类变量之间的相关性？如果可以的话，怎样做？

Bagging是平行进行的。而boosting是在第一轮的预测之后，算法将分类出错的预测加高权重，使得它们可以在后续一轮中得到校正。这种给予分类出错的预测高权重的顺序过程持续进行，一直到达到停止标准为止。随机森林通过减少方差（主要方式）提高模型的精度。生成树之间是不相关的，以把方差的减少最大化。在另一方面，GBM提高了精度，同时减少了模型的偏差和方差。

[En]

83 、运行二元分类树算法很容易，但是你知道一个树是如何做分割的吗，即树如何决定把哪些变量分到哪个根节点和后续节点上？

[En]

A: the classification tree uses Gini coefficient and node entropy to make decisions. In short, the tree algorithm finds the best possible features, and it can divide the data set into the purest possible child nodes. The tree algorithm finds the feature quantity that can divide the data set into the purest possible child nodes. The Gini coefficient is that if the population is completely pure, then we randomly select two samples from the population, and the two samples must be of the same class and the probability that they are of the same kind is 1. We can calculate the Gini coefficient in the following ways:

1.利用成功和失败的概率的平方和(p^2+q^2)计算子节点的基尼系数

2.利用该分割的节点的加权基尼分数计算基尼系数以分割

[En]

Entropy is a standard to measure the impurity of information (two categories):

84 、你已经建了一个有10000棵树的随机森林模型。在得到0.00的训练误差后，你非常高兴。但是，验证错误是34.23。到底是怎么回事？你还没有训练好你的模型吗？

A：这个型号太合身了。训练误差为0.00表示分类器已经在一定程度上模拟了训练数据，这样的分类器不能用于看不见的数据。

[En]

A: the model is overfitted. A training error of 0.00 means that the classifier has simulated the training data to a certain extent, and such a classifier cannot be used on unseen data.

[En]

Therefore, when the classifier is used on unseen samples, the prediction returned will have a high error rate because the existing pattern cannot be found. In the random forest algorithm, this happens when more trees are used than the required number of trees. Therefore, to avoid these situations, we need to use cross-validation to adjust the number of trees.

85 、你有一个数据集，变量个数p大于观察值个数n。为什么用OLS是一个不好的选择？用什么技术最好？为什么？

[En]

Other methods include subset regression and forward stepwise regression.

86 、什么是凸包？（提示：想一想SVM） 其他方法还包括子集回归、前向逐步回归。

[En]

A: when the data is linearly separable, the convex hull represents the outer boundary of two sets of data points.

87 、我们知道，一位有效编码会增加数据集的维度。但是，标签编码不会。为什么？

[En]

[En]

Encoded with a valid bit, the dimension (that is, the feature) of the dataset increases because it creates a variable for each level that exists in the classification variable. For example, suppose we have a variable “color”. This variable has three levels, namely, red, blue, and green.

88 、你会在时间序列数据集上使用什么交叉验证技术？是用k倍或LOOCV？

[En]

fold 1 : training [1], test [2]

fold 2 : training [1 2], test [3]

fold 3 : training [1 2 3], test [4]

fold 4 : training [1 2 3 4], test [5]

fold 5 : training [1 2 3 4 5], test [6]

1，2，3，4，5，6代表的是年份。

89 、给你一个缺失值多于30%的数据集？比方说，在50个变量中，有8个变量的缺失值都多于30%。你对此如何处理？

[En]

A: we can deal with it in the following ways:

1.把缺失值分成单独的一类，这些缺失值说不定会包含一些趋势信息。

2.我们可以毫无顾忌地删除它们。

3.或者，我们可以用目标变量来检查它们的分布，如果发现任何模式，我们将保留那些缺失值并给它们一个新的分类，同时删除其他缺失值。

90 、买了这个的客户，也买了……”亚马逊的建议是哪种算法的结果？

[En]

A: the basic idea of this recommendation engine comes from collaborative filtering.

[En]

The collaborative filtering algorithm considers the “user behavior” used to recommend items. They take advantage of the purchase behavior of other users and the transaction history, rating, selection and purchase information for goods. The behavior and preferences of other users of the product are used to recommend items (goods) to new users. In this case, the characteristics of the project (commodity) are unknown.

[En]

91 、你怎么理解第一类和第二类错误？

[En]

A: the first kind of error is that when the original hypothesis is true, we reject it, which is also called “false positive”. The second kind of error is that when the original assumption is false, we accept it, also known as “false negative”.

[En]

In the confusion matrix, we can say that the first kind of error occurs when we classify a value as positive (1) but it is actually negative (0). The second kind of error occurs when we classify a value as negative (0) but it is actually positive (1).

92 、当你在解决一个分类问题时，出于验证的目的，你已经将训练集随机抽样地分成训练集和验证集。你对你的模型能在未看见的数据上有好的表现非常有信心，因为你的验证精度高。但是，在得到很差的精度后，你大失所望。什么地方出了错？

[En]

A: when doing classification problems, we should use stratified sampling instead of random sampling. Random sampling does not take into account the proportion of target categories. On the contrary, stratified sampling helps to maintain the distribution of target variables in the obtained distribution samples.

93 、请简单阐述下决策树、回归、SVM、神经网络等算法各自的优缺点？

[En]

It is an extension of another method (usually the regression method), which punishes it based on the complexity of the model, and it likes models that are relatively simple and can be better generalized.

94 、在应用机器学习算法之前纠正和清理数据的步骤是什么？

1.将数据导入

2.看数据：重点看元数据，即对字段解释、数据来源等信息；导入数据后，提取部分数据进行查看

3.缺失值清洗

• 根据需要对缺失值进行处理，可以删除数据或填充数据

• 重新取数：如果某些非常重要的字段缺失，需要和负责采集数据的人沟通，是否可以再获得

4.数据格式清洗：统一数据的时间、日期、全半角等显示格式

5.逻辑错误的数据

• 重复的数据

• 不合理的值

6.不一致错误的处理：指对矛盾内容的修正，最常见的如身份证号和出生年月日不对应 不同业务中数据清洗的任务略有不同，比如数据有不同来源的话，数据格式清洗和不一致错误的处理就尤为突出。数据预处理是数据类岗位工作内容中重要的部分。

95 、什么是K-means聚类算法？

K-means也是聚类算法中最简单的一种了，但是里面包含的思想却是不一般。最早我使用并实现这个算法是在学习韩爷爷那本数据挖掘的书中，那本书比较注重应用。看了Andrew Ng的这个讲义后才有些明白K-means后面包含的EM思想。

96 、请详细说说文字特征提取

97 、请详细说说图像特征提取

[En]

Computer vision is a science that studies how to make machines “see” so that computers learn to process and understand images. This knowledge sometimes requires the help of machine learning.

[En]

This section introduces some basic techniques of machine learning in the field of computer vision. The feature digital image extracted by pixel value is usually a raster image or pixel image, and the color is mapped to the grid coordinates. A picture can be seen as a matrix in which each element is a color value. The basic feature of the image is to connect each row of the matrix into a row vector.

98 、了解xgboost么，请详细说说它的原理

99 、请详细说说梯度提升树(GBDT)的原理

https://blog.csdn.net/haidao2009/article/details/7514787

Original: https://www.cnblogs.com/Anita9002/p/11218932.html
Author: Anita-ff
Title: AI工程师基础知识100题

(0)