# 特征选择

## 树的剪枝

[En]

Decision trees are prone to over-fitting, so we need to adopt certain pruning strategies to prevent over-fitting.

## 特征如何选择

[En]

The probability in the above definition is estimated by data, so it is called the empirical entropy of the sample set. The more uniform the category distribution of the sample is, the greater the entropy is, the more mixed the sample set is, the lower the purity is, the higher the unpurity is; when the proportion of samples belonging to each category is the same, the entropy value is the highest, and when all samples belong to the same category, the entropy value is 0.

# 决策树生成

## 三种算法对比

ID3决策树：使用信息增益作为特征选择标准
C4.5决策树在ID3决策树基础上有三点改进，其它部分相同。
(1)ID3容易偏向于优先选取取值种类较多的特征。为此，C4.5先从候选划分特征中找出信息增益高于平均水平的特征，再从中选择信息增益比最高的特征作为划分特征。
(2)ID3不能处理连续型特征。为此，C4.5对连续型特征的取值排序后按区间和阀值进行离散化。
(3)ID3决策树容易过拟合。决策树分叉过细会导致最后生成的决策树模型对训练集数据拟合特别好，但是对新数据的预测效果较差，即模型泛化能力不好。为此，C4.5引入了正则化系数进行初步的剪枝来缓解过拟合问题。
CART(Classification And Regression Tree分类回归树)
(1)ID3和C4.5计算熵值时需要计算对数，CART采用基尼系数，简化了计算。
(2)ID3和C4.5采用多叉树进行特征划分，即特征有几种类别取值就划分几棵子树，并且该特征在后续算法执行过程中被排除在候选特征之外，这种划分方式过于粗糙，特征信息的利用率较低；C4.5对连续值采用区间离散化，或多或少会损失一部分信息。CART采用二叉树对每个特征进行划分

(3)ID3和C4.5只能用于分类任务。CART则可用于分类和回归。CART用于回归预测时，采用平方误差最小的划分为最优划分

(4)CART预测输出

[En]

Classification prediction: the majority of label categories in all samples contained in each leaf node are output as its corresponding label category prediction.

[En]

Regression prediction: the average or median of the corresponding label value of all samples contained in each leaf node is used as its corresponding label value prediction output.

# 决策树剪枝

[En]

Pruning strategy has a great influence on the decision tree and is the core of the optimization decision tree algorithm. There are two common ways

# 案例—红酒分类

[En]

If the dataset is very large and you have predicted that it will be pruned anyway, it would be better to set these parameters in advance to control the complexity and size of the tree.

# 案例—带噪正弦曲线拟合

Sklearn回归树衡量最佳结点和分枝的指标有
(1)criterion= ” mse ” ，使用均方误差MSE，父节点和子节点之间的均方误差的差额被用来作为划分特征选择的标准，这种方法通过使用叶子节点的均值来最小化L2损失。（不填该参数，则默认mse）
(2)criterion= ” friedman_mse ” ，使用费尔德曼均方误差(针对潜在分枝中的问题改进后的均方误差)
(3)criterion= ” mae “使用平均绝对误差MAE，使用叶节点的中值来最小化L1损失。

Original: https://blog.csdn.net/weixin_50481708/article/details/125512061
Author: 跳楼梯企鹅
Title: 【人工智能】机器学习中的决策树

(0)