ID3决策树算法及其Python实现

目录

一、决策树算法

决策树是一种基于树结构来进行决策的分类算法,我们希望从给定的训练数据集学得一个模型(即决策树),用该模型对新样本分类。决策树可以非常直观展现分类的过程和结果,一旦模型构建成功,对新样本的分类效率也相当高。
最经典的决策树算法有ID3、C4.5、CART,其中ID3算法是最早被提出的,它可以处理离散属性样本的分类,C4.5和CART算法则可以处理更加复杂的分类问题,本文重点介绍ID3算法。

基础理论

  1. 纯度(purity)
    对于一个分支结点,如果该结点所包含的样本都属于同一类,那么它的纯度为1,而我们总是希望纯度越高越好,也就是尽可能多的样本属于同一类别。那么如何衡量”纯度”呢?由此引入”信息熵”的概念。
  2. 信息熵(information entropy)
    假定当前样本集合D中第k类样本所占的比例为pk(k=1,2,…,|y|),则D的信息熵定义为:
    Ent(D) = -∑k=1 pk·log2 pk(约定若p=0,则log2 p=0)
    显然,Ent(D)值越小,D的纯度越高。因为0
  3. 信息增益(information gain)
    假定离散属性a有V个可能的取值{a1,a2,…,aV}. 若使用a对样本集D进行分类,则会产生V个分支结点,记Dv为第v个分支结点包含的D中所有在属性a上取值为av的样本。不同分支结点样本数不同,我们给予分支结点不同的权重:|Dv|/|D|, 该权重赋予样本数较多的分支结点更大的影响、由此,用属性a对样本集D进行划分所获得的信息增益定义为: Gain(D,a) = Ent(D)-∑v=1 |Dv|/|D|·Ent(Dv)

其中,Ent(D)是数据集D划分前的信息熵,∑v=1 |Dv|/|D|·Ent(Dv)可以表示为划分后的信息熵。”前-后”的结果表明了本次划分所获得的信息熵减少量,也就是纯度的提升度。显然,Gain(D,a) 越大,获得的纯度提升越大,此次划分的效果越好。

  1. 增益率(gain ratio)
    基于信息增益的最优属性划分原则——信息增益准则,对可取值数据较多的属性有所偏好。C4.5算法使用增益率替代信息增益来选择最优划分属性,增益率定义为: Gain_ratio(D,a) = Gain(D,a)/IV(a)

其中

          IV(a) = -∑v=1 |Dv|/|D|·log2 |Dv|/|D|

称为属性a的固有值。属性a的可能取值数目越多(即V越大),则IV(a)的值通常会越大。这在一定程度上消除了对可取值数据较多的属性的偏好。

事实上,增益率准则对可取值数目较少的属性有所偏好,C4.5算法并不是直接使用增益率准则,而是先从候选划分属性中找出信息增益高于平均水平的属性,再从中选择增益率最高的。

  1. 基尼指数(Gini index)
    CART决策树算法使用基尼指数来选择划分属性,基尼指数定义为:
       Gini(D) = ∑k=1 ∑k'≠1 pk·pk' = 1- ∑k=1  pk·pk

可以这样理解基尼指数:从数据集D中随机抽取两个样本,其类别标记不一致的概率。Gini(D)越小,纯度越高。 属性a的基尼指数定义:

      Gain_index(D,a) = ∑v=1 |Dv|/|D|·Gini(Dv)

使用基尼指数选择最优划分属性,即选择使得划分后基尼指数最小的属性作为最优划分属性。

决策树的学习过程

一棵决策树的生成过程主要分为以下3个部分:

特征选择:特征选择是指从训练数据中众多的特征中选择一个特征作为当前节点的分裂标准,如何选择特征有着很多不同量化评估标准标准,从而衍生出不同的决策树算法。

决策树生成: 根据选择的特征评估标准,从上至下递归地生成子节点,直到数据集不可分则停止决策树停止生长。 树结构来说,递归结构是最容易理解的方式。

剪枝:决策树容易过拟合,一般来需要剪枝,缩小树结构规模、缓解过拟合。剪枝技术有预剪枝和后剪枝两种。

在讲解特征选择前,我们先了解一些概念。

决策树节点的不纯度(impurity)

ID3决策树算法及其Python实现

; ID3算法

ID3算法是最早提出的一种决策树算法,ID3算法的核心是在决策树各个节点上应用信息增益准则来选择特征,递归的构建决策树。具体方法是:从根节点开始,对节点计算所有可能的特征的信息增益,选择信息增益最大的特征作为节点的特征,由该特征的不同取值建立子节点:再对子节点递归的调用以上方法,构建决策树:直到所有的特征信息增益均很小或没有特征可以选择为止。
决策树是根据信息增益来进行特征选择的,信息增益定义为

ID3决策树算法及其Python实现
其中D为总的样本,a为属性,v为在属性a中的v类样本,信息增益越大,表明该属性对分类的相关性越大。Ent()表示信息熵(entropy),公式如下:
ID3决策树算法及其Python实现

k表示在样本D中的第k类样本,pk表示第k类样本所占样本总体的概率。类比于现实中的熵,可以理解为,信息熵越小,表明纯度越高。

二、实现针对西瓜数据集的ID3算法

watermalon.txt文件

ID3决策树算法及其Python实现

; 实现代码

import numpy as np
import pandas as pd
import math
data = pd.read_csv('./watermalon.txt')
data

ID3决策树算法及其Python实现
def info(x,y):
    if x != y and x != 0:

        return -(x/y)*math.log2(x/y) - ((y-x)/y)*math.log2((y-x)/y)
    if x == y or x == 0:

        return 0
        info_D = info(8,17)
        info_D

seze_black_entropy = -(4/6)*math.log2(4/6)-(2/6)*math.log2(2/6)
seze_green_entropy = -(3/6)*math.log2(3/6)*2
seze_white_entropy = -(1/5)*math.log2(1/5)-(4/5)*math.log2(4/5)

seze_entropy = (6/17)*seze_black_entropy+(6/17)*seze_green_entropy+(5/17)*seze_white_entropy
print(seze_entropy)

info_D - seze_entropy
data.根蒂.value_counts()

print(data[data.根蒂=='蜷缩'])
print(data[data.根蒂=='稍蜷'])
print(data[data.根蒂=='硬挺'])
gendi_entropy = (8/17)*info(5,8)+(7/17)*info(3,7)+(2/17)*info(0,2)
gain_col = info_D - gendi_entropy
gain_col

ID3决策树算法及其Python实现
data.敲声.value_counts()

print(data[data.敲声=='浊响'])
print(data[data.敲声=='沉闷'])
print(data[data.敲声=='清脆'])
qiaosheng_entropy = (10/17)*info(6,10)+(5/17)*info(2,5)+(2/17)*info(0,2)
info_gain = info_D - qiaosheng_entropy
info_gain

ID3决策树算法及其Python实现
data.纹理.value_counts()

print(data[data.纹理=="清晰"])
print(data[data.纹理=="稍糊"])
print(data[data.纹理=="模糊"])
wenli_entropy = (9/17)*info(7,9)+(5/17)*info(1,5)+(3/17)*info(0,3)
info_gain = info_D - wenli_entropy
info_gain

ID3决策树算法及其Python实现
data.脐部.value_counts()

print(data[data.脐部=="凹陷"])
print(data[data.脐部=="稍凹"])
print(data[data.脐部=="平坦"])
qidai_entropy = (7/17)*info(5,7)+(6/17)*info(3,6)+(4/17)*info(0,4)
info_gain = info_D - qidai_entropy
info_gain

ID3决策树算法及其Python实现

data.触感.value_counts()

print(data[data.触感=="硬滑"])
print(data[data.触感=="软粘"])
chugna_entropy = (12/17)*info(6,12)+(5/17)*info(2,5)
info_D - chugna_entropy

ID3决策树算法及其Python实现
  • 绘制可视化树
import matplotlib.pylab as plt
import matplotlib

matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.serif'] = ['SimHei']

decisionNode = dict(boxstyle="sawtooth", fc="0.8")

leafNode = dict(boxstyle="round4", fc="0.8")

arrow_args = dict(arrowstyle=")

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
"""
    绘制一个节点
    :param nodeTxt: 描述该节点的文本信息
    :param centerPt: 文本的坐标
    :param parentPt: 点的坐标,这里也是指父节点的坐标
    :param nodeType: 节点类型,分为叶子节点和决策节点
    :return:
"""
    createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',
                            xytext=centerPt, textcoords='axes fraction',
                            va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)

def getNumLeafs(myTree):
"""
    获取叶节点的数目
    :param myTree:
    :return:
"""

    numLeafs = 0

    firstStr = list(myTree.keys())[0]

    secondDict = myTree[firstStr]

    for key in secondDict.keys():

        if type(secondDict[key]).__name__ == 'dict':
            numLeafs += getNumLeafs(secondDict[key])

        else:
            numLeafs += 1
    return numLeafs

def getTreeDepth(myTree):
"""
    得到数的深度层数
    :param myTree:
    :return:
"""

    maxDepth = 0

    firstStr = list(myTree.keys())[0]

    secondDic = myTree[firstStr]

    for key in secondDic.keys():

        if type(secondDic[key]).__name__ == 'dict':

            thisDepth = 1 + getTreeDepth(secondDic[key])

        else:
            thisDepth = 1

        if thisDepth > maxDepth:
            maxDepth = thisDepth

    return maxDepth

def plotMidText(cntrPt, parentPt, txtString):
"""
    计算出父节点和子节点的中间位置,填充信息
    :param cntrPt: 子节点坐标
    :param parentPt: 父节点坐标
    :param txtString: 填充的文本信息
    :return:
"""

    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]

    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]

    createPlot.ax1.text(xMid, yMid, txtString)

def plotTree(myTree, parentPt, nodeTxt):
"""
    绘制出树的所有节点,递归绘制
    :param myTree: 树
    :param parentPt: 父节点的坐标
    :param nodeTxt: 节点的文本信息
    :return:
"""

    numLeafs = getNumLeafs(myTree=myTree)

    depth = getTreeDepth(myTree=myTree)

    firstStr = list(myTree.keys())[0]

    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)

    plotMidText(cntrPt, parentPt, nodeTxt)

    plotNode(firstStr, cntrPt, parentPt, decisionNode)

    secondDict = myTree[firstStr]

    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD

    for key in secondDict.keys():

        if isinstance(secondDict[key], dict):
            plotTree(secondDict[key], cntrPt, str(key))
        else:

            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW

            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)

            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))

    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

def createPlot(inTree):
"""
    需要绘制的决策树
    :param inTree: 决策树字典
    :return:
"""

    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)

    plotTree.totalW = float(getNumLeafs(inTree))

    plotTree.totalD = float(getTreeDepth(inTree))

    plotTree.xOff = -0.5/plotTree.totalW

    plotTree.yOff = 1.0

    plotTree(inTree, (0.5, 1.0), '')

    plt.show()

if __name__ == '__main__':
    createPlot(mytree)

ID3决策树算法及其Python实现

三、C4.5和CART的算法代码实现

C4.5算法

  • C4.5是决策树算法的一种。决策树算法作为一种分类算法,目标就是将具有p维特征的n个样本分到c个类别中去。相当于做一个投影,c=f(n),将样本经过一种变换赋予一种类别标签。决策树为了达到这一目的,可以把分类的过程表示成一棵树,每次通过选择一个特征pi来进行分叉。
    信息增益比

训练数据关于特征的信息增益为,关于的熵为,,则关于的信息增益比为信息增益为与熵的比值:

ID3决策树算法及其Python实现
代码:
def calcGainRatio(dataSet, labelIndex, labelPropertyi):
"""
    type: (list, int, int) -> float, int
    计算信息增益率,返回信息增益率和连续属性的划分点
    dataSet: 数据集
    labelIndex: 特征值索引
    labelPropertyi: 特征值类型,0为离散,1为连续
"""
    baseEntropy = calcShannonEnt(dataSet, labelIndex)
    featList = [example[labelIndex] for example in dataSet]
    uniqueVals = set(featList)
    newEntropy = 0.0
    bestPartValuei = None
    IV = 0.0
    totalWeight = 0.0
    totalWeightV = 0.0
    totalWeight = calcTotalWeight(dataSet, labelIndex, True)
    totalWeightV = calcTotalWeight(dataSet, labelIndex, False)
    if labelPropertyi == 0:
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, labelIndex, value)
            totalWeightSub = 0.0
            totalWeightSub = calcTotalWeight(subDataSet, labelIndex, True)
            if value != 'N':
                prob = totalWeightSub / totalWeightV
                newEntropy += prob * calcShannonEnt(subDataSet, labelIndex)
            prob1 = totalWeightSub / totalWeight
            IV -= prob1 * log(prob1, 2)
    else:
        uniqueValsList = list(uniqueVals)
        if 'N' in uniqueValsList:
            uniqueValsList.remove('N')

            totalWeightN = 0.0
            dataSetNull = splitDataSet(dataSet, labelIndex, 'N')
            totalWeightN = calcTotalWeight(dataSetNull, labelIndex, True)
            probNull = totalWeightN / totalWeight
            if probNull > 0.0:
                IV += -1 * probNull * log(probNull, 2)

        sortedUniqueVals = sorted(uniqueValsList)
        listPartition = []
        minEntropy = inf

        if len(sortedUniqueVals) == 1:
            totalWeightLeft = calcTotalWeight(dataSet, labelIndex, True)
            probLeft = totalWeightLeft / totalWeightV
            minEntropy = probLeft * calcShannonEnt(dataSet, labelIndex)
            IV = -1 * probLeft * log(probLeft, 2)
        else:
            for j in range(len(sortedUniqueVals) - 1):
                partValue = (float(sortedUniqueVals[j]) + float(
                    sortedUniqueVals[j + 1])) / 2

                dataSetLeft = splitDataSet(dataSet, labelIndex, partValue, 'L')
                dataSetRight = splitDataSet(dataSet, labelIndex, partValue, 'R')
                totalWeightLeft = 0.0
                totalWeightLeft = calcTotalWeight(dataSetLeft, labelIndex, True)
                totalWeightRight = 0.0
                totalWeightRight = calcTotalWeight(dataSetRight, labelIndex, True)
                probLeft = totalWeightLeft / totalWeightV
                probRight = totalWeightRight / totalWeightV
                Entropy = probLeft * calcShannonEnt(
                    dataSetLeft, labelIndex) + probRight * calcShannonEnt(dataSetRight, labelIndex)
                if Entropy < minEntropy:
                    minEntropy = Entropy
                    bestPartValuei = partValue
                    probLeft1 = totalWeightLeft / totalWeight
                    probRight1 = totalWeightRight / totalWeight
                    IV = -1 * (probLeft * log(probLeft, 2) + probRight * log(probRight, 2))

        newEntropy = minEntropy
    gain = totalWeightV / totalWeight * (baseEntropy - newEntropy)
    if IV == 0.0:
        IV = 0.0000000001
    gainRatio = gain / IV
    return gainRatio, bestPartValuei

CART算法

CART算法构造的是二叉决策树,决策树构造出来后同样需要剪枝,才能更好的应用于未知数据的分类。CART算法在构造决策树时通过基尼系数来进行特征选择。

def entropy(data):
    length = data.size
    ent = 0
    for i in data.value_counts():
        prob = i / length
        ent += - prob * (np.log2(prob))
    return ent
print('--------')
entD = entropy(data['好瓜'])
print(entD)

def gini_discrete(data, input_column, output_column):

    ret = 0
    lens = data[output_column].size
    all_attribute = data[input_column].value_counts()
    for name in data[input_column].unique():
        print(name)
        temp = 1
        for i in range(len(data[output_column].unique())):
            attribute_num = data[input_column].where(data[output_column] == data[output_column].unique()[i]).value_counts()
            try:
                prob = int(attribute_num[name]) / int(all_attribute[name])
            except:
                prob = 0
            if prob == 0:
                temp += 0
            else:
                temp -= prob * prob

        ret += temp * (all_attribute[name] / lens)
    return ret

def gini_continuous(data, input_column, output_column):

    lens = data[output_column].size
    gini = 0
    T = []
    Gini = []
    values = sorted(data[input_column].values)
    for i in range(lens - 1):
        good_n = 0
        good_p = 0
        bad_n = 0
        bad_p = 0
        t = round(((values[i] + values[i+1]) / 2), 3)
        T.append(t)
        for index in data.index:
            if data[input_column].values[index-1] < t:
                if data['好瓜'].values[index-1] == '是' :
                    good_n += 1
                else:
                    bad_n += 1
            else:
                if data['好瓜'].values[index-1] == '是' :
                    good_p += 1
                else:
                    bad_p += 1

        dn_sum = i + 1
        dp_sum = lens - i - 1
        prob = dn_sum / lens
        gini_n = 1 - (np.square(good_n / dn_sum) + np.square(bad_n / dn_sum))
        gini_p = 1 - (np.square(good_p / dp_sum) + np.square(bad_p / dp_sum))
        gini = prob * gini_n + (1 - prob) * gini_p
        Gini.append(gini)

    print("对应划分点为:",T[Gini.index(min(Gini))])
    return T[Gini.index(min(Gini))], min(Gini)

max_n_features = 4

decisionTree = {}

def transLabel(label):
    if label == '是':
        return '好瓜'
    else:
        return '坏瓜'

def createTree(data, features):
"""
    data: input the name of DataFrame
    features: input the list of features
"""
    bestFeaIndex, bestFeatureName, at = chooseBestFeature(data)
    bestFeatureValue = data[bestFeatureName].values
    attrCount = 0
    attr = []

    sameLvTree = {}
    tempTree = {}

    DataGood = pd.DataFrame()
    DataBad = pd.DataFrame()

    features.remove(bestFeatureName)

    if bestFeatureName != '密度' and '含糖率':
        for name in data[bestFeatureName].unique():
            attrCount += 1
            attr.append(name)
        for i in range(attrCount):
            dataName='DataSubSet'+str(i)
            locals()['DataSubSet'+str(i)] = pd.DataFrame()
        for index in data.index:
            for i in range(attrCount):
                if data[bestFeatureName].values[index-1] == attr[i]:
                    locals()['DataSubSet'+str(i)] = locals()['DataSubSet'+str(i)].append(data[index-1:index], sort=False)
                    break

        outputCount = 0
        for i in range(attrCount):
            print()
            print(attr[i])
            locals()['DataSubSet'+str(i)] = locals()['DataSubSet'+str(i)].drop(columns = [bestFeatureName])
            locals()['DataSubSet'+str(i)] = locals()['DataSubSet'+str(i)].reset_index(drop=True)
            locals()['DataSubSet'+str(i)].index += 1
            print(locals()['DataSubSet'+str(i)])
            print()
            for name in locals()['DataSubSet'+str(i)]['好瓜'].unique():
                outputCount += 1
            if outputCount == 1:
                print("*******",bestFeatureName, attr[i], locals()['DataSubSet'+str(i)]['好瓜'].values[0])
                sameLvTree[attr[i]] = transLabel(locals()['DataSubSet'+str(i)]['好瓜'].values[0])
                tempTree[bestFeatureName] = sameLvTree
                outputCount = 0
            else:
                print("*******",bestFeatureName, attr[i], '?')
                outputCount = 0
                if len(features) > max_n_features:
                    sameLvTree[attr[i]] = createTree(locals()['DataSubSet'+str(i)], features)
                    print(sameLvTree[attr[i]])
                    tempTree[bestFeatureName] = sameLvTree

    else:

        DataN = pd.DataFrame()
        DataP = pd.DataFrame()
        for index in data.index:
            if data[bestFeatureName].values[index-1] < at:
                DataN = DataN.append(data[index-1:index], sort=False)
            else:
                DataP = DataP.append(data[index-1:index], sort=False)

        outputCount = 0

        print()
        print(', at)
        DataN = DataN.drop(columns = [bestFeatureName])
        DataN = DataN.reset_index(drop=True)
        DataN.index += 1
        print(DataN)
        print()
        for name in DataN['好瓜'].unique():
            outputCount += 1
        if outputCount == 1:
            print("*******",bestFeatureName, '.format(at), DataN['好瓜'].values[0])
            sameLvTree['.format(at)] = transLabel(DataN['好瓜'].values[0])
            tempTree[bestFeatureName] = sameLvTree
            print(tempTree)
            outputCount = 0
        else:
            print("*******",bestFeatureName, '.format(at), '?')
            outputCount = 0
            if len(features) > max_n_features:
                sameLvTree['.format(at)] = createTree(DataN, features)
                tempTree[bestFeatureName] = sameLvTree
                print(tempTree)

        print()
        print('>', at)
        DataP = DataP.drop(columns = [bestFeatureName])
        DataP = DataP.reset_index(drop=True)
        DataP.index += 1
        print(DataP)
        print()
        for name in DataP['好瓜'].unique():
            outputCount += 1
        if outputCount == 1:
            print("*******",bestFeatureName, '>{}'.format(at), DataP['好瓜'].values[0])
            print(bestFeatureName)
            sameLvTree['>{}'.format(at)] = transLabel(DataP['好瓜'].values[0])
            tempTree[bestFeatureName] = sameLvTree
            outputCount = 0
        else:
            print("*******",bestFeatureName, '>{}'.format(at), '?')
            outputCount = 0
            if len(features) > max_n_features:
                sameLvTree['>{}'.format(at)] = createTree(DataP, features)
                tempTree[bestFeatureName] = sameLvTree

    return tempTree

features = list(data.columns[0:-1])
decisionTree = createTree(data, features)
print(decisionTree)

ID3决策树算法及其Python实现

总结

决策树中ID3算法中的信息增益准则对取值数目较多的属性有所偏好,而C4.5算法的基本流程与ID3类似,但C4.5算法进行特征选择时不是通过计算信息增益完成的,而是通过信息增益比来进行特征选择。CART算法构造的是二叉决策树,决策树构造出来后同样需要剪枝,才能更好的应用于未知数据的分类。CART算法在构造决策树时通过基尼系数来进行特征选择。

参考文章

机器学习算法(3)之决策树算法
决策树算法(ID3算法详解)

【机器学习】CART决策树原理及python实现

Original: https://blog.csdn.net/tauvan/article/details/121028351
Author: Sue hiroshi
Title: ID3决策树算法及其Python实现

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/671248/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球