CART决策树算法的Python实现（注释详细）

2023年7月5日下午6:46 • 人工智能 • 阅读 83

一、CART决策树算法简介

CART（Classification And Regression Trees 分类回归树）算法是一种树构建算法，既可以用于分类任务，又可以用于回归。相比于 ID3 和 C4.5 只能用于离散型数据且只能用于分类任务，CART 算法的适用面要广得多，既可用于离散型数据，又可以处理连续型数据，并且分类和回归任务都能处理。

本文仅讨论基本的CART分类决策树构建，不讨论回归树和剪枝等问题。

首先，我们要明确以下几点：
1. CART算法是二分类常用的方法，由CART算法生成的决策树是二叉树，而 ID3 以及 C4.5 算法生成的决策树是多叉树，从运行效率角度考虑，二叉树模型会比多叉树运算效率高。
2. CART算法通过基尼(Gini)指数来选择最优特征。

二、基尼系数

基尼系数代表模型的不纯度，基尼系数越小，则不纯度越低，注意这和 C4.5的信息增益比的定义恰好相反。

分类问题中，假设有K个类，样本点属于第k类的概率为pk，则概率分布的基尼系数定义为:

若CART用于二类分类问题（ 不是只能用于二分类），那么概率分布的基尼系数可简化为

假设使用特征 A 将数据集 D 划分为两部分 D1 和 D2，此时按照特征 A 划分的数据集的基尼系数为：
CART决策树算法的Python实现（注释详细）

; 三、CART决策树生成算法

输入：训练数据集D，停止计算的条件
输出：CART决策树
根据训练数据集，从根结点开始，递归地对每个结点进行以下操作，构建二叉决策树：
（1）计算现有特征对该数据集的基尼指数，如上面所示；
（2）选择基尼指数最小的值对应的特征为最优特征，对应的切分点为最优切分点（若最小值对应的特征或切分点有多个，随便取一个即可）；
（3）按照最优特征和最优切分点，从现结点生成两个子结点，将训练数据集中的数据按特征和属性分配到两个子结点中；
（4）对两个子结点递归地调用（1）（2）（3），直至满足停止条件。
（5）生成CART树。
算法停止的条件：结点中的样本个数小于预定阈值，或样本集的基尼指数小于预定阈值（样本基本属于同一类，如完全属于同一类则为0），或者特征集为空。
注：最优切分点是将当前样本下分为两类（因为我们要构造二叉树）的必要条件。对于离散的情况，最优切分点是当前最优特征的某个取值；对于连续的情况，最优切分点可以是某个具体的数值。具体应用时需要遍历所有可能的最优切分点取值去找到我们需要的最优切分点。

四、CART算法的Python实现

from math import log

def create_dataset():
    dataset = [['youth', 'no', 'no', 'just so-so', 'no'],
               ['youth', 'no', 'no', 'good', 'no'],
               ['youth', 'yes', 'no', 'good', 'yes'],
               ['youth', 'yes', 'yes', 'just so-so', 'yes'],
               ['youth', 'no', 'no', 'just so-so', 'no'],
               ['midlife', 'no', 'no', 'just so-so', 'no'],
               ['midlife', 'no', 'no', 'good', 'no'],
               ['midlife', 'yes', 'yes', 'good', 'yes'],
               ['midlife', 'no', 'yes', 'great', 'yes'],
               ['midlife', 'no', 'yes', 'great', 'yes'],
               ['geriatric', 'no', 'yes', 'great', 'yes'],
               ['geriatric', 'no', 'yes', 'good', 'yes'],
               ['geriatric', 'yes', 'no', 'good', 'yes'],
               ['geriatric', 'yes', 'no', 'great', 'yes'],
               ['geriatric', 'no', 'no', 'just so-so', 'no']]
    features = ['age', 'work', 'house', 'credit']
    return dataset, features

def calcGini(dataset):

    num_of_examples = len(dataset)
    labelCnt = {}

    for example in dataset:

        currentLabel = example[-1]

        if currentLabel not in labelCnt.keys():
            labelCnt[currentLabel] = 0
        labelCnt[currentLabel] += 1

    for key in labelCnt:
        labelCnt[key] /= num_of_examples
        labelCnt[key] = labelCnt[key] * labelCnt[key]

    Gini = 1 - sum(labelCnt.values())
    return Gini

def create_sub_dataset(dataset, index, value):
    sub_dataset = []
    for example in dataset:
        current_list = []
        if example[index] == value:
            current_list = example[:index]
            current_list.extend(example[index + 1 :])
            sub_dataset.append(current_list)
    return sub_dataset

def split_dataset(dataset, index, value):
    sub_dataset1 = []
    sub_dataset2 = []
    for example in dataset:
        current_list = []
        if example[index] == value:
            current_list = example[:index]
            current_list.extend(example[index + 1 :])
            sub_dataset1.append(current_list)
        else:
            current_list = example[:index]
            current_list.extend(example[index + 1 :])
            sub_dataset2.append(current_list)
    return sub_dataset1, sub_dataset2

def choose_best_feature(dataset):

    numFeatures = len(dataset[0]) - 1

    if numFeatures == 1:
        return 0

    bestGini = 1

    index_of_best_feature = -1

    for i in range(numFeatures):

        uniqueVals = set(example[i] for example in dataset)

        Gini = {}

        for value in uniqueVals:

            sub_dataset1, sub_dataset2 = split_dataset(dataset,i,value)

            prob1 = len(sub_dataset1) / float(len(dataset))
            prob2 = len(sub_dataset2) / float(len(dataset))

            Gini_of_sub_dataset1 = calcGini(sub_dataset1)

            Gini_of_sub_dataset2 = calcGini(sub_dataset2)

            Gini[value] = prob1 * Gini_of_sub_dataset1 + prob2 * Gini_of_sub_dataset2

            if Gini[value] < bestGini:
                bestGini = Gini[value]
                index_of_best_feature = i
                best_split_point = value
    return index_of_best_feature, best_split_point

def find_label(classList):

    labelCnt = {}
    for key in classList:
        if key not in labelCnt.keys():
            labelCnt[key] = 0
        labelCnt[key] += 1

    sorted_labelCnt = sorted(labelCnt.items(), key = lambda a:a[1], reverse = True)

    return sorted_labelCnt[0][0]

def create_decision_tree(dataset, features):

    label_list = [example[-1] for example in dataset]

    if label_list.count(label_list[0]) == len(label_list):
        return label_list[0]

    if len(dataset[0]) == 1:
        return find_label(label_list)

    index_of_best_feature, best_split_point = choose_best_feature(dataset)

    best_feature = features[index_of_best_feature]

    decision_tree = {best_feature: {}}

    del(features[index_of_best_feature])

    sub_labels = features[:]

    sub_dataset1, sub_dataset2 = split_dataset(dataset,index_of_best_feature,best_split_point)

    decision_tree[best_feature][best_split_point] = create_decision_tree(sub_dataset1, sub_labels)

    decision_tree[best_feature]['others'] = create_decision_tree(sub_dataset2, sub_labels)
    return decision_tree

def classify(decision_tree, features, test_example):

    first_feature = list(decision_tree.keys())[0]

    second_dict = decision_tree[first_feature]

    index_of_first_feature = features.index(first_feature)

    for key in second_dict.keys():

        if key != 'others':
            if test_example[index_of_first_feature] == key:

                if type(second_dict[key]).__name__ == 'dict':

                    classLabel = classify(second_dict[key], features, test_example)

                else:

                    classLabel = second_dict[key]

            else:

                if isinstance(second_dict['others'],str):
                    classLabel = second_dict['others']

                else:
                    classLabel = classify(second_dict['others'], features, test_example)
    return classLabel

if __name__ == '__main__':
    dataset, features = create_dataset()
    decision_tree = create_decision_tree(dataset, features)

    print(decision_tree)

    features = ['age', 'work', 'house', 'credit']
    test_example = ['midlife', 'yes', 'no', 'great']
    print(classify(decision_tree, features, test_example))

若是二分类问题，则函数calcGini和choose_best_feature可简化如下：


def calcProbabilityEnt(dataset):
    numEntries = len(dataset)
    count = 0
    label = dataset[0][len(dataset[0]) - 1]
    for example in dataset:
        if example[-1] == label:
            count += 1
    probabilityEnt = float(count) / numEntries
    return probabilityEnt

def choose_best_feature(dataset):

    numFeatures = len(dataset[0]) - 1

    if numFeatures == 1:
        return 0

    bestGini = 1

    index_of_best_feature = -1
    for i in range(numFeatures):

        uniqueVals = set(example[i] for example in dataset)

        Gini = {}
        for value in uniqueVals:
            sub_dataset1, sub_dataset2 = split_dataset(dataset,i,value)
            prob1 = len(sub_dataset1) / float(len(dataset))
            prob2 = len(sub_dataset2) / float(len(dataset))
            probabilityEnt1 = calcProbabilityEnt(sub_dataset1)
            probabilityEnt2 = calcProbabilityEnt(sub_dataset2)
            Gini[value] = prob1 * 2 * probabilityEnt1 * (1 - probabilityEnt1) + prob2 * 2 * probabilityEnt2 * (1 - probabilityEnt2)
            if Gini[value] < bestGini:
                bestGini = Gini[value]
                index_of_best_feature = i
                best_split_point = value
    return index_of_best_feature, best_split_point

五、运行结果

Original: https://blog.csdn.net/qq_45717425/article/details/120992980
Author: Polaris_T
Title: CART决策树算法的Python实现（注释详细）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/672343/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

人工智能-深度学习-yolov3口罩佩戴识别

一.基础环境 windows 10 cuda 10.0 python3.7.4 tensorflow-gpu 1.14.0 keras2.2.4 numpy==1.16.5 二.下…

人工智能 2023年5月24日
0098
理解DALL·E 2， Stable Diffusion和 Midjourney工作原理

编者按：随着AIGC的兴起，各位小伙伴们对文生图工具DALL-E 2、Stable Diffusion和Midjourney一定并不陌生。本期IDP Inspiration，小白将…

人工智能 2023年7月30日
0055
MongoDB安装教程

✅作者简介：大家好我是honker707,大家可以叫我honker，新星计划第三季python赛道Top1🥇🥇🥇📃个人主页：honker707的csdn博客🔥系列专栏：python…

人工智能 2023年7月31日
0085
Python列表推导式（更有风格的Python代码写法）

所谓推导式也就是Python中一种更有风格的Python代码的写法。什么样是有风格的呢？假如有个需求，它可能需要三行五行甚至是更多行代码完成，但是如果是同样的需求用推导式来书写的话…

人工智能 2023年7月4日
0070
第6章数据加载、存储与文件格式

访问数据通常是数据分析的第⼀步。输入输出通常可以划分为几个大类：读取文本文件和其他更高效的磁盘存储格式，加载数据库中的数据，利用Web API操作网络资源。 6.1 读写文本格式的…

人工智能 2023年7月7日
0087
【yolov6系列一】深度解析网络架构

在yolov5霸屏计算机视觉领域很久时，六月处美团开源了yolov6，并号称在精度和速度上均超越其他同量级的计算机视觉模型，刚刚瞅了一眼，star已经超过2.8k,脑子里莫名冒出一…

人工智能 2023年6月16日
0089
《深度学习之pytorch实战计算机视觉》第8章图像风格迁移实战（代码可跑通）

上一章《深度学习之pytorch实战计算机视觉》第7章迁移学习（代码可跑通）介绍了迁移学习。本章将完成一个有趣的应用，基于卷积神经网络实现图像风格迁移（Style Trans…

人工智能 2023年5月28日
00105
TensorRT部署总结

文章目录 * – + 1、什么是TensorRT + 2、流程 + 3、推荐方案 + * 3.1 视频作者的方案 * 3.2 方案优缺点 * 3.3 方案具体过程 + …

人工智能 2023年6月17日
0093
EM算法参考CMU讲义

EM算法属于似然思想下，对于模型参数更新的方法。具体而言，我们以神经网络为例，我们记这个神经网络的所有参数为θ \theta θ，可以观测到的因变量为y y y，不可观测的因素（隐…

人工智能 2023年6月27日
0067
主成分分析（PCA）详解

主成分分析（PCA)是一种比较基础的数据降维方法，也是多元统计中的重要部分，在数据分析、机器学习等方面具有广泛应用。主成分分析目的是用较少的变量来代替原来较多的变量，并可以反映原来…

人工智能 2023年7月18日
0080
win10系统上安装pytorch3D

首先感谢前人的经验文章，使我少走不少弯路。最近三维视觉的文章越来越多，CVPR2022的三维视觉文章已经有了很大的比重。而最近关于点云的文章越来越多的使用pytorch3D来直接训…

人工智能 2023年5月28日
0092
Anaconda中安装Pytorch 速度太慢解决办法

本文参考： Anaconda中安装Pytorch 速度太慢解决办法_ 蜗牛在听雨的博客-CSDN博客_conda install pytorch太慢怎么办【陆续排坑】…

人工智能 2023年7月21日
0086
AI+医疗：使用神经网络进行医学影像识别分析 ⛵

💡 作者：韩信子@ShowMeAI📘 计算机视觉实战系列：https://www.showmeai.tech/tutorials/46📘 行业名企应用系列：https://www….

人工智能 2023年7月13日
0078
数据分析：数据可视化篇6

绘图的现实应用—变量分析单变量分析绘图绘制双变量联合分布图多变量关系分布图前面我们已经学习了seaborn画图风格设定以及颜色选取的基本函数，下面我们继续了解seaborn…

人工智能 2023年6月11日
0088
论文阅读笔记(二)：Few-shot Knowledge Graph-to-Text Generation with Pretrained Language Models

论文阅读笔记(二)：Few-shot Knowledge Graph-to-Text Generation with Pretrained Language Models &#82…

人工智能 2023年6月1日
0069
数据挖掘05-偏相关分析【原理、案例、python实现】

数据挖掘05-偏相关分析【原理、案例、python实现】 * – 一、需求场景 – 二、偏相关分析简介 – + 2.1 引入偏相关分析的原因 +…

人工智能 2023年7月17日
0075

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31