手写分类决策树（鸢尾花数据集）

2023年6月30日下午6:14 • 人工智能 • 阅读 81

*
– 1.实验简介及数据集
– 2.算法分析
– 3.具体实现
–
+ 3.1 数据结构
+ 3.2 如何产生分支
+
* 3.2.1 增益
* 3.2.2 寻找某一属性的阈值
* 3.2.3 寻找最优属性及其阈值
+ 3.3 建立决策树
+ 3.4 预测
+ 3.5 整体代码
– 4.实验结果
– 5.实验总结

1.实验简介及数据集

本次实验需要实现一个简单的分类决策树并在鸢尾花数据集上进行预测。鸢尾花数据集中共有150个样本，包含四个属性，值都是连续的，共有三种类别。

2.算法分析

使用分类决策树进行预测可以分为两个部分。
第一部分是建立一棵决策树，在该部分我们需要判断在每个节点使用哪个属性的来划分左右子节点，并将信息储存在当前节点中，这里我们通过使划分后的”增益”最大来选择划分属性。在叶子节点，需要给出到达这个节点的数据的类别，以实现预测的功能。
第二部分就是进行预测了，将数据输入决策树，通过储存在节点中的信息判断当前的数据怎么走，最终到达的叶子节点给出的类别就是该数据的预测值。

3.具体实现

3.1 数据结构

对于决策树的数据结构定义，通过DecisionNode类来实现：
由于鸢尾花数据集的属性值是连续的，所以我们找到一个阈值来将数据划分位两个分支。

class DecisionNode(object):
    def __init__(self, f_idx, threshold, value=None, L=None, R=None):
        self.f_idx = f_idx
        self.threshold = threshold
        self.value = value
        self.L = L
        self.R = R

3.2 如何产生分支

在数据通过一个节点时，如何将数据进行划分呢？这是决策树的关键，这里分三小节介绍。

3.2.1 增益

我们希望决策树分支节点 所包含的样本尽可能属于同一类别，即节点的纯度越高越好。”增益”就可以表示出划分后纯度的提升程度，增益越高，效果也就越好。这里的”增益”可以选择 信息增益、信息增益率或基尼系数。
下面给出三种增益计算的方法：


def calculate_entropy(dataset: np.ndarray):
    scale = dataset.shape[0]
    d = {}
    for data in dataset:
        key = data[-1]
        if key in d:
            d[key] += 1
        else:
            d[key] = 1

    entropy = 0.0
    for key in d.keys():
        p = d[key] / scale
        entropy -= p * math.log(p, 2)
    return entropy

def calculate_gain(dataset, l, r):
    e1 = calculate_entropy(dataset)
    e2 = len(l) / len(dataset) * calculate_entropy(l) + len(r) / len(dataset) * calculate_entropy(r)
    gain = e1 - e2
    return gain

def calculate_gain_ratio(dataset, l, r):
    gain = calculate_gain(dataset, l, r)
    p1 = len(l) / len(dataset)
    p2 = len(r) / len(dataset)

    if p1 == 0:
        s = p2 * math.log(p2, 2)
    elif p2 == 0:
        s = p1 * math.log(p1, 2)
    else:
        s = - p1 * math.log(p1, 2) - p2 * math.log(p2, 2)

    gain_ratio = gain / s
    return gain_ratio

def calculate_gini(dataset: np.ndarray):
    scale = dataset.shape[0]
    d = {}
    for data in dataset:
        key = data[-1]
        if key in d:
            d[key] += 1
        else:
            d[key] = 1

    gini = 1.0
    for key in d.keys():
        p = d[key] / scale
        gini -= p * p
    return gini

def calculate_gini_index(dataset, l, r):
    gini_index = len(l) / len(dataset) * calculate_gini(l) + len(r) / len(dataset) * calculate_gini(r)
    return gini_index

3.2.2 寻找某一属性的阈值

有了计算增益的方法后，我们就可以计算某一属性的阈值了，按照该属性的阈值将样本划分为两个分支，小于阈值的划分到左子树，大于阈值的划分到右子树。

求阈值的具体步骤：

将该属性的值进行 排序并去重
取 相邻两个数据的均值作为 候选阈值
遍历候选阈值，对于每一个值：
将数据划分为小于该候选阈值和大于该候选阈值的
计算划分后的增益
选择 使增益最大的候选阈值作为该属性的阈值

实现代码如下（给出了三种实现”增益”的方式，用split_choice来表示）：

def find_best_threshold(dataset: np.ndarray, f_idx: int, split_choice: str):
    best_gain = -math.inf
    best_gini = math.inf
    best_threshold = None
    dataset_sorted = sorted(list(set(dataset[:, f_idx].reshape(-1))))
    candidate = []

    for i in range(len(dataset_sorted) - 1):
        candidate.append(round((dataset_sorted[i] + dataset_sorted[i + 1]) / 2.0, 2))

    for threshold in candidate:
        L, R = split_dataset(dataset, f_idx, threshold)
        gain = None
        if split_choice == "gain":
            gain = calculate_gain(dataset, L, R)
            if gain > best_gain:
                best_gain = gain
                best_threshold = threshold
        if split_choice == "gain_ratio":
            gain = calculate_gain_ratio(dataset, L, R)
            if gain > best_gain:
                best_gain = gain
                best_threshold = threshold
        if split_choice == "gini":
            gini = calculate_gini_index(dataset, L, R)
            if gini < best_gini:
                best_gini = gini
                best_threshold = threshold

    return best_threshold, best_gain

其中split_dataset函数（将数据划分为左右两子树）如下：

def split_dataset(X: np.ndarray, f_idx: int, threshold: float):

    L = X[:, f_idx] < threshold
    R = ~L
    return X[L], X[R]

3.2.3 寻找最优属性及其阈值

有了上述基础后，就可以 遍历所有的属性，选择使增益最大的属性作为最优属性。这里我们用f_idx_list这个列表存放属性的下标值（0、1、2、3）。注意， 用过的属性在接下来不能继续再用了，需要将其从f_idx_list中去除。

        best_gain = -math.inf
        best_gini = math.inf
        best_threshold = None
        best_f_idx = None

        for i in f_idx_list:
            threshold, gain = find_best_threshold(dataset, i, split_choice)
            if split_choice == "gini":
                if gain < best_gini:
                    best_gini = gain
                    best_threshold = threshold
                    best_f_idx = i
            if split_choice == "gain" or split_choice == "gain_ratio" :
                if gain > best_gain:
                    best_gain = gain
                    best_threshold = threshold
                    best_f_idx = i

        son_f_idx_list = f_idx_list.copy()
        son_f_idx_list.remove(best_f_idx)

至此，我们就得到了进行划分的最优属性（下标）及其阈值了。

3.3 建立决策树

我们通过递归来建立决策树（对于递归不太了解的朋友可以去了解一下 递归的基本思想，再来看建树过程就会清晰很多了）
首先，给出 递归结束的条件：
1. 当前所有数据具有相同的类别标签，类别自然就标记为该类别，并返回叶子节点。
2. 进行划分选择的属性都用完了，类别标记为当先数据中最多的类，并返回叶子节点。

不满足上述结束条件时，创建分支

def build_tree(dataset: np.ndarray, f_idx_list: list, split_choice: str):

    class_list = [data[-1] for data in dataset]

    if class_list.count(class_list[0]) == len(class_list):
        return DecisionNode(None, None, value=class_list[0])

    elif len(f_idx_list) == 0:
        value = collections.Counter(class_list).most_common(1)[0][0]
        return DecisionNode(None, None, value=value)

    else:

        best_gain = -math.inf
        best_gini = math.inf
        best_threshold = None
        best_f_idx = None

        for i in f_idx_list:
            threshold, gain = find_best_threshold(dataset, i, split_choice)
            if split_choice == "gini":
                if gain < best_gini:
                    best_gini = gain
                    best_threshold = threshold
                    best_f_idx = i
            if split_choice == "gain" or split_choice == "gain_ratio" :
                if gain > best_gain:
                    best_gain = gain
                    best_threshold = threshold
                    best_f_idx = i

        son_f_idx_list = f_idx_list.copy()
        son_f_idx_list.remove(best_f_idx)

        L, R = split_dataset(dataset, best_f_idx, best_threshold)
        if len(L) == 0:
            L_tree = DecisionNode(None, None, majority_count(dataset))
        else:
            L_tree = build_tree(L, son_f_idx_list, split_choice)

        if len(R) == 0:
            R_tree = DecisionNode(None, None, majority_count(dataset))
        else:
            R_tree = build_tree(R, son_f_idx_list, split_choice)
        return DecisionNode(best_f_idx, best_threshold, value=None, L=L_tree, R=R_tree)

3.4 预测

这里对一条数据进行预测（传入模型与数据），同样是采用递归的方法：

def predict_one(model: DecisionNode, data):
    if model.value is not None:
        return model.value
    else:
        feature_one = data[model.f_idx]
        branch = None
        if feature_one >= model.threshold:
            branch = model.R
        else:
            branch = model.L
        return predict_one(branch, data)

3.5 整体代码

import math
import numpy
import numpy as np
from typing import Union
import collections

from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

class DecisionNode(object):
    def __init__(self, f_idx, threshold, value=None, L=None, R=None):
        self.f_idx = f_idx
        self.threshold = threshold
        self.value = value
        self.L = L
        self.R = R

def find_best_threshold(dataset: np.ndarray, f_idx: int, split_choice: str):
    best_gain = -math.inf
    best_gini = math.inf
    best_threshold = None
    dataset_sorted = sorted(list(set(dataset[:, f_idx].reshape(-1))))
    candidate = []

    for i in range(len(dataset_sorted) - 1):
        candidate.append(round((dataset_sorted[i] + dataset_sorted[i + 1]) / 2.0, 2))

    for threshold in candidate:
        L, R = split_dataset(dataset, f_idx, threshold)
        gain = None
        if split_choice == "gain":
            gain = calculate_gain(dataset, L, R)
            if gain > best_gain:
                best_gain = gain
                best_threshold = threshold
        if split_choice == "gain_ratio":
            gain = calculate_gain_ratio(dataset, L, R)
            if gain > best_gain:
                best_gain = gain
                best_threshold = threshold
        if split_choice == "gini":
            gini = calculate_gini_index(dataset, L, R)
            if gini < best_gini:
                best_gini = gini
                best_threshold = threshold

    return best_threshold, best_gain

def calculate_entropy(dataset: np.ndarray):
    scale = dataset.shape[0]
    d = {}
    for data in dataset:
        key = data[-1]
        if key in d:
            d[key] += 1
        else:
            d[key] = 1

    entropy = 0.0
    for key in d.keys():
        p = d[key] / scale
        entropy -= p * math.log(p, 2)
    return entropy

def calculate_gain(dataset, l, r):
    e1 = calculate_entropy(dataset)
    e2 = len(l) / len(dataset) * calculate_entropy(l) + len(r) / len(dataset) * calculate_entropy(r)
    gain = e1 - e2
    return gain

def calculate_gain_ratio(dataset, l, r):
    gain = calculate_gain(dataset, l, r)
    p1 = len(l) / len(dataset)
    p2 = len(r) / len(dataset)

    if p1 == 0:
        s = p2 * math.log(p2, 2)
    elif p2 == 0:
        s = p1 * math.log(p1, 2)
    else:
        s = - p1 * math.log(p1, 2) - p2 * math.log(p2, 2)

    gain_ratio = gain / s
    return gain_ratio

def calculate_gini(dataset: np.ndarray):
    scale = dataset.shape[0]
    d = {}
    for data in dataset:
        key = data[-1]
        if key in d:
            d[key] += 1
        else:
            d[key] = 1

    gini = 1.0
    for key in d.keys():
        p = d[key] / scale
        gini -= p * p
    return gini

def calculate_gini_index(dataset, l, r):
    gini_index = len(l) / len(dataset) * calculate_gini(l) + len(r) / len(dataset) * calculate_gini(r)
    return gini_index

def split_dataset(X: np.ndarray, f_idx: int, threshold: float):

    L = X[:, f_idx] < threshold
    R = ~L
    return X[L], X[R]

def majority_count(dataset):
    class_list = [data[-1] for data in dataset]
    return collections.Counter(class_list).most_common(1)[0][0]

def build_tree(dataset: np.ndarray, f_idx_list: list, split_choice: str):

    class_list = [data[-1] for data in dataset]

    if class_list.count(class_list[0]) == len(class_list):
        return DecisionNode(None, None, value=class_list[0])

    elif len(f_idx_list) == 0:
        value = collections.Counter(class_list).most_common(1)[0][0]
        return DecisionNode(None, None, value=value)

    else:

        best_gain = -math.inf
        best_gini = math.inf
        best_threshold = None
        best_f_idx = None

        for i in f_idx_list:
            threshold, gain = find_best_threshold(dataset, i, split_choice)
            if split_choice == "gini":
                if gain < best_gini:
                    best_gini = gain
                    best_threshold = threshold
                    best_f_idx = i
            if split_choice == "gain" or split_choice == "gain_ratio" :
                if gain > best_gain:
                    best_gain = gain
                    best_threshold = threshold
                    best_f_idx = i

        son_f_idx_list = f_idx_list.copy()
        son_f_idx_list.remove(best_f_idx)

        L, R = split_dataset(dataset, best_f_idx, best_threshold)
        if len(L) == 0:
            L_tree = DecisionNode(None, None, majority_count(dataset))
        else:
            L_tree = build_tree(L, son_f_idx_list, split_choice)

        if len(R) == 0:
            R_tree = DecisionNode(None, None, majority_count(dataset))
        else:
            R_tree = build_tree(R, son_f_idx_list, split_choice)
        return DecisionNode(best_f_idx, best_threshold, value=None, L=L_tree, R=R_tree)

def predict_one(model: DecisionNode, data):
    if model.value is not None:
        return model.value
    else:
        feature_one = data[model.f_idx]
        branch = None
        if feature_one >= model.threshold:
            branch = model.R
        else:
            branch = model.L
        return predict_one(branch, data)

def predict_accuracy(y_predict, y_test):
    y_predict = y_predict.tolist()
    y_test = y_test.tolist()
    count = 0

    for i in range(len(y_predict)):
        if int(y_predict[i]) == y_test[i]:
            count = count + 1
    accuracy = count / len(y_predict)
    return accuracy

class SimpleDecisionTree(object):
    def __init__(self, split_choice, min_samples: int = 1, min_gain: float = 0, max_depth: Union[int, None] = None,
                 max_leaves: Union[int, None] = None):
        self.split_choice = split_choice

    def fit(self, X: np.ndarray, y: np.ndarray) -> None:
        dataset_in = np.c_[X, y]
        f_idx_list = [i for i in range(X.shape[1])]
        self.my_tree = build_tree(dataset_in, f_idx_list, self.split_choice)

    def predict(self, X: np.ndarray) -> np.ndarray:
        predict_list = []
        for data in X:
            predict_list.append(predict_one(self.my_tree, data))

        return np.array(predict_list)

if __name__ == "__main__":

    predict_accuracy_all = []

    for i in range(10):
        iris = load_iris()
        x = iris.data
        y = iris.target
        X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

        predict_accuracy_list = []
        split_choice_list = ["gain", "gain_ratio", "gini"]
        for split_choice in split_choice_list:
            m = SimpleDecisionTree(split_choice)
            m.fit(X_train, y_train)
            y_predict = m.predict(X_test)

            y_predict_accuracy = predict_accuracy(y_predict, y_test.reshape(-1))
            predict_accuracy_list.append(y_predict_accuracy)

        clf = DecisionTreeClassifier()
        clf.fit(X_train, y_train)
        predicted = clf.predict(X_test)
        predict_accuracy_list.append(clf.score(X_test, y_test))

        predict_accuracy_all.append(predict_accuracy_list)

    p = numpy.array(predict_accuracy_all)
    p = np.round(p, decimals=3)
    for i in p:
        print(i)

    print(p.mean(axis=0))

4.实验结果

使用信息增益、信息增益率、基尼系数三种不同评价增益的方法进行实验，预测的准确率如下图，另附上使用sklearn方法的准确率。

信息增益信息增益率基尼系数sklearn10.9670.9670.9330.96720.9670.9670.9330.93330.9670.967
0.833 0.833

40.90.90.90.96750.9670.9670.9670.96760.9330.9330.9670.9337
1. 1.

0.867
1. 0.867 0.867 0.833

0.990.9330.9330.90.933100.9670.9670.9330.933
平均值 0.9468 0.9468 0.9066 0.9366

可以发现，使用基尼系数来评价增益，准确率相比其他的方法低一些。总体来看，准确率都还是不错的。但也能看出，准确率的波动比较大，比如使用信息增益时，准确率最高为1，最低为0.867。原因可能是每次所划分的训练集、测试集不同。

使用sklearn方法的决策图如下：

; 5.实验总结

本次实验难度有点大，首先是对决策树的原理了解不深刻，其次对决策树的数据结构认识不足，不知道怎么样去建树，对于递归的实现也有些困难，导致本次实验花费时间较多。另外，对于numpy的操作不熟悉，导致许多代码显得比较臃肿，可以用更简便的方法实现，后期还得多看看numpy。

Original: https://blog.csdn.net/qq_51879318/article/details/125190507
Author: ShowerSong
Title: 手写分类决策树（鸢尾花数据集）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/661807/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

还不会做销售分析？掌握这3个思路，让你秒变分析高手

作为销售，是否一直都着这样的烦恼？由于使用传统的EXCEL去制作报表，导致工作任务处理的效率非常低下，如果数据量一大，表格就会变得非常卡。其次，由于缺乏对销售数据概念和方法，销…

人工智能 2023年7月18日
0059
6G网络知识图谱技术研究

6G简介近年来，随着5G、人工智能(AI, artificial intelligence) 和大数据技术的不断革新，智能移动物联网业务的兴起和迅速发展引发了新一轮信息技术革命浪…

人工智能 2023年6月1日
0059
大数据分析-实验八鸢尾花数据集分类

Tec8-鸢尾花数据集分类使用Sklearn的逻辑回归完成鸢尾花分类预测 from sklearn.datasets import load_iris from sklearn….

人工智能 2023年7月15日
0077
java计算机毕业设计宠物店管理系统设计与实现源码+mysql数据库+系统+lw文档+部署

本源码技术栈：项目架构：B/S架构开发语言：Java语言开发软件：idea eclipse 前端技术：Layui、HTML、CSS、JS、JQuery等技术后端技术：JAV…

人工智能 2023年6月27日
0055
【论文笔记】 Denoising Implicit Feedback for Recommendation

Authors: 王文杰，冯福利，何向南，聂礼强，蔡达成 WSDM’21 新加坡国立大学，中国科学技术大学，山东大学论文链接：http://staff.ustc.ed…

人工智能 2023年6月6日
0061
安装spaCy（最简单的教程）

如果不是折腾了一下午 + 半个晚上，谁想发小水文 T_T，但是网上真的没有搜到我遇到的问题哎（枯）… 说白了，问题就是我们的 python找不到 model，官网给的下…

人工智能 2023年5月27日
0096
浅谈知识追踪（BKT、IRT、DKT）

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录前言一、知识追踪是什么？二、具体内容 * 1.基于贝叶斯的知识追踪（BKT）项目反应理论（IRT）…

人工智能 2023年6月25日
0092
【华为OD机试真题23 JAVA】单词倒序

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。 Original: https://blo…

人工智能 2023年7月30日
0043
【tensorflow】缺少libcudart.so.11.0和libcudnn.so.8解决方法

问题：安装tensorflow-gpu，在测试是否调用GPU时出现如下问题： Could not load dynamic library ‘libcudart.so…

人工智能 2023年5月23日
0042
【kaggle竞赛】Ames房价预测与回归问题（上）

2021.9.17 自学机器学习的日子，在图书馆摸了本《大数据与机器学习经典案例》，记录一下。第一章，讲讲房价预测和回归问题。本文用到的是爱荷华州艾姆斯市房价数据集，由杜…

人工智能 2023年6月17日
0063
数据可视化

数据可视化数据可视化指的是通过可视化表示来探索数据，它与数据挖掘紧密相关，而数据挖掘指的是使用代码来探索数据集的规律和关联。 1.matplotlib pip install m…

人工智能 2023年6月11日
0057
2021年声纹识别研究与应用学术研讨会笔记

2021年声纹识别研究与应用学术研讨会笔记声纹识别是国务院认定的远程身份认证方法，在研究过程中要注意信息安全和法律法规的要求，声纹识别是一个比较热的研究方向，ICCASP有36篇…

人工智能 2023年5月25日
0072
[Python] 字典操作近两万字大总结（超详细教程）

🔥 信仰：一个人走得远了，就会忘记自己为了什么而出发，希望你可以不忘初心，不要随波逐流，一直走下去🦋 欢迎关注🖱点赞👍收藏🌟留言🐾🦄 本文由程序喵正在路上原创，CSDN首发！…

人工智能 2023年7月6日
0039
监督学习中的损失函数

监督学习中的损失函数 1 分类问题 * 1.1 0-1 损失 1.2 Hinge 损失函数 1.3 Logistic 损失函数 1.4 交叉熵 2 回归问题 * 2.1 平方损失函…

人工智能 2023年6月17日
0071
2022年，校招计算机视觉算法岗，还要继续all in吗？

本文受众聚焦在本科大四以及硕士研究生的范畴。然后，本文只是提供一些切身的思考，并不一定全对，也会存在考虑不全的地方，希望读者能持开放包容的心态进行阅读😘 so，enjoy：劝退先…

人工智能 2023年7月20日
0030
Alphapose论文代码详解

注：B站有相应视频，点击此链接即可跳转观看https://www.bilibili.com/video/BV1hb4y117mu/ 第2节：Alphapose 2.1Alphapo…

人工智能 2023年6月16日
0062

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30