Python学习—3级分类的概率校准

2023年7月2日下午1:29 • 人工智能 • 阅读 57

This example illustrates how sigmoid calibration changes predicted probabilities for a 3-class classification problem. Illustrated is the standard 2-simplex, where the three corners correspond to the three classes. Arrows point from the probability vectors predicted by an uncalibrated classifier to the probability vectors predicted by the same classifier after sigmoid calibration on a hold-out validation set. Colors indicate the true class of an instance (red: class 1, green: class 2, blue: class 3).

数据

Below, we generate a classification dataset with 2000 samples, 2 features and 3 target classes. We then split the data as follows:

train: 600 samples (for training the classifier)
valid: 400 samples (for calibrating predicted probabilities)
test: 1000 samples

Note that we also create X_train_valid and y_train_valid, which consists of both the train and valid subsets. This is used when we only want to train the classifier but not calibrate the predicted probabilities.

Author: Jan Hendrik Metzen
License: BSD Style.

import numpy as np
from sklearn.datasets import make_blobs

np.random.seed(0)

X, y = make_blobs(
    n_samples=2000, n_features=2, centers=3, random_state=42, cluster_std=5.0
)
X_train, y_train = X[:600], y[:600]
X_valid, y_valid = X[600:1000], y[600:1000]
X_train_valid, y_train_valid = X[:1000], y[:1000]
X_test, y_test = X[1000:], y[1000:]

拟合和校准

First, we will train a RandomForestClassifier with 25 base estimators (trees) on the concatenated train and validation data (1000 samples). This is the uncalibrated classifier

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train_valid, y_train_valid)

To train the calibrated classifier, we start with the same RandomForestClassifier but train it using only the train data subset (600 samples) then calibrate, with method='sigmoid', using the valid data subset (400 samples) in a 2-stage process.

from sklearn.calibration import CalibratedClassifierCV

clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
cal_clf = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
cal_clf.fit(X_valid, y_valid)

比较概率

Below we plot a 2-simplex with arrows showing the change in predicted probabilities of the test samples.

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
colors = ["r", "g", "b"]

clf_probs = clf.predict_proba(X_test)
cal_clf_probs = cal_clf.predict_proba(X_test)
Plot arrows
for i in range(clf_probs.shape[0]):
    plt.arrow(
        clf_probs[i, 0],
        clf_probs[i, 1],
        cal_clf_probs[i, 0] - clf_probs[i, 0],
        cal_clf_probs[i, 1] - clf_probs[i, 1],
        color=colors[y_test[i]],
        head_width=1e-2,
    )

Plot perfect predictions, at each vertex
plt.plot([1.0], [0.0], "ro", ms=20, label="Class 1")
plt.plot([0.0], [1.0], "go", ms=20, label="Class 2")
plt.plot([0.0], [0.0], "bo", ms=20, label="Class 3")

Plot boundaries of unit simplex
plt.plot([0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0], "k", label="Simplex")

Annotate points 6 points around the simplex, and mid point inside simplex
plt.annotate(
    r"($\frac{1}{3}$, $\frac{1}{3}$, $\frac{1}{3}$)",
    xy=(1.0 / 3, 1.0 / 3),
    xytext=(1.0 / 3, 0.23),
    xycoords="data",
    arrowprops=dict(facecolor="black", shrink=0.05),
    horizontalalignment="center",
    verticalalignment="center",
)
plt.plot([1.0 / 3], [1.0 / 3], "ko", ms=5)
plt.annotate(
    r"($\frac{1}{2}$, $0$, $\frac{1}{2}$)",
    xy=(0.5, 0.0),
    xytext=(0.5, 0.1),
    xycoords="data",
    arrowprops=dict(facecolor="black", shrink=0.05),
    horizontalalignment="center",
    verticalalignment="center",
)
plt.annotate(
    r"($0$, $\frac{1}{2}$, $\frac{1}{2}$)",
    xy=(0.0, 0.5),
    xytext=(0.1, 0.5),
    xycoords="data",
    arrowprops=dict(facecolor="black", shrink=0.05),
    horizontalalignment="center",
    verticalalignment="center",
)
plt.annotate(
    r"($\frac{1}{2}$, $\frac{1}{2}$, $0$)",
    xy=(0.5, 0.5),
    xytext=(0.6, 0.6),
    xycoords="data",
    arrowprops=dict(facecolor="black", shrink=0.05),
    horizontalalignment="center",
    verticalalignment="center",
)
plt.annotate(
    r"($0$, $0$, $1$)",
    xy=(0, 0),
    xytext=(0.1, 0.1),
    xycoords="data",
    arrowprops=dict(facecolor="black", shrink=0.05),
    horizontalalignment="center",
    verticalalignment="center",
)
plt.annotate(
    r"($1$, $0$, $0$)",
    xy=(1, 0),
    xytext=(1, 0.1),
    xycoords="data",
    arrowprops=dict(facecolor="black", shrink=0.05),
    horizontalalignment="center",
    verticalalignment="center",
)
plt.annotate(
    r"($0$, $1$, $0$)",
    xy=(0, 1),
    xytext=(0.1, 1),
    xycoords="data",
    arrowprops=dict(facecolor="black", shrink=0.05),
    horizontalalignment="center",
    verticalalignment="center",
)
Add grid
plt.grid(False)
for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    plt.plot([0, x], [x, 0], "k", alpha=0.2)
    plt.plot([0, 0 + (1 - x) / 2], [x, x + (1 - x) / 2], "k", alpha=0.2)
    plt.plot([x, x + (1 - x) / 2], [0, 0 + (1 - x) / 2], "k", alpha=0.2)

plt.title("Change of predicted probabilities on test samples after sigmoid calibration")
plt.xlabel("Probability class 1")
plt.ylabel("Probability class 2")
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)
_ = plt.legend(loc="best")
plt.show()

In the figure above, each vertex of the simplex represents a perfectly predicted class (e.g., 1, 0, 0). The mid point inside the simplex represents predicting the three classes with equal probability (i.e., 1/3, 1/3, 1/3). Each arrow starts at the uncalibrated probabilities and end with the arrow head at the calibrated probability. The color of the arrow represents the true class of that test sample.

The uncalibrated classifier is overly confident in its predictions and incurs a large log loss. The calibrated classifier incurs a lower log loss due to two factors. First, notice in the figure above that the arrows generally point away from the edges of the simplex, where the probability of one class is 0. Second, a large proportion of the arrows point towards the true class, e.g., green arrows (samples where the true class is ‘green’) generally point towards the green vertex. This results in fewer over-confident, 0 predicted probabilities and at the same time an increase in the the predicted probabilities of the correct class. Thus, the calibrated classifier produces more accurate predicted probablities that incur a lower log loss
We can show this objectively by comparing the log loss of the uncalibrated and calibrated classifiers on the predictions of the 1000 test samples. Note that an alternative would have been to increase the number of base estimators (trees) of the RandomForestClassifier which would have resulted in a similar decrease in log loss.


from sklearn.metrics import log_loss score = log_loss(y_test, clf_probs) cal_score = log_loss(y_test, cal_clf_probs) print("Log-loss of") print(f" * uncalibrated classifier: {score:.3f}") print(f" * calibrated classifier: {cal_score:.3f}")

Log-loss of
* uncalibrated classifier: 1.290
* calibrated classifier: 0.549

Finally we generate a grid of possible uncalibrated probabilities over the 2-simplex, compute the corresponding calibrated probabilities and plot arrows for each. The arrows are colored according the highest uncalibrated probability. This illustrates the learned calibration map:

plt.figure(figsize=(10, 10))
Generate grid of probability values
p1d = np.linspace(0, 1, 20)
p0, p1 = np.meshgrid(p1d, p1d)
p2 = 1 - p0 - p1
p = np.c_[p0.ravel(), p1.ravel(), p2.ravel()]
p = p[p[:, 2] >= 0]

Use the three class-wise calibrators to compute calibrated probabilities
calibrated_classifier = cal_clf.calibrated_classifiers_[0]
prediction = np.vstack(
    [
        calibrator.predict(this_p)
        for calibrator, this_p in zip(calibrated_classifier.calibrators, p.T)
    ]
).T

Re-normalize the calibrated predictions to make sure they stay inside the
simplex. This same renormalization step is performed internally by the
predict method of CalibratedClassifierCV on multiclass problems.

prediction /= prediction.sum(axis=1)[:, None]

Plot changes in predicted probabilities induced by the calibrators
for i in range(prediction.shape[0]):
    plt.arrow(
        p[i, 0],
        p[i, 1],
        prediction[i, 0] - p[i, 0],
        prediction[i, 1] - p[i, 1],
        head_width=1e-2,
        color=colors[np.argmax(p[i])],
    )

Plot the boundaries of the unit simplex
plt.plot([0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0], "k", label="Simplex")

plt.grid(False)
for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    plt.plot([0, x], [x, 0], "k", alpha=0.2)
    plt.plot([0, 0 + (1 - x) / 2], [x, x + (1 - x) / 2], "k", alpha=0.2)
    plt.plot([x, x + (1 - x) / 2], [0, 0 + (1 - x) / 2], "k", alpha=0.2)

plt.title("Learned sigmoid calibration map")
plt.xlabel("Probability class 1")
plt.ylabel("Probability class 2")
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)

plt.show()

Original: https://blog.csdn.net/m0_38127487/article/details/124644809
Author: 临风暖阳
Title: Python学习—3级分类的概率校准

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/665525/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

被“智能”蒙蔽双眼的智能制造

最近调研了一个离散制造企业，企业现在有ERP、PLM、MES、WMS、BPM、MDM、ESB、QMS和门户等系统，企业要进行全厂的整体升级。针对这些系统使用的现状进行梳理，一共总…

人工智能 2023年6月4日
0079
CPU调用GPU训练的模型

如何保存模型保存模型的整个结构+权重+优化器状态 from keras.models import load_model model.save(‘my_model.h5’) 预测…

人工智能 2023年5月24日
0085
AndroidStudio集成GitHub操作入门

团队合作中GitHub的使用学校里做个小组作业啊自己开发个小东西啊Git还是非常好用的，可以很好的保障代码的安全修改。下面整理一下自己的GitHub使用入门文章目录一、查看自己…

人工智能 2023年6月29日
0087
python安装torch-cluster、torch-scatter、torch-sparse和torch-geometric

跑图神经网络经常要安装torch-cluster、torch-scatter、torch-sparse和torch-geometric这些包，但是这些包安装挺麻烦的，经常因为版本不…

人工智能 2023年7月21日
0085
R语言使用order函数对dataframe数据进行排序、基于单个字段（变量）进行排序（升序、降序）、基于多个字段（变量）进行排序（升序、降序）

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年7月16日
0057
数据挖掘十大算法之分类算法(决策树模型)

文章目录 * – 1. 决策树的概念 – 2. 构建决策树 – 3. 决策树中的信息论原理 – + 3.1 信息量 + 3.2 熵 …

人工智能 2023年7月3日
00101
可以用于毕设参考，请勿过度借鉴

🔗 运行环境：python3 🚩 作者：K同学啊 🥇 精选专栏：《深度学习100例》 🔥 推荐专栏：《新手入门深度学习》 📚 选自专栏：《Matplotlib教程》 🧿 优秀专栏：…

人工智能 2023年5月26日
00108
Transformer模型架构及PyTorch源码详解（基于AttentionisAllYouNeed）

文章目录模型架构源码 * torch.nn.Transformer – init + 调用及参数源码 forward + 调用及参数源码 torch.nn.Tr…

人工智能 2023年7月24日
0093
论文笔记：WWW 2020 Heterogeneous Graph Transformer

1. 前言论文链接：https://arxiv.org/abs/2003.01332github：https://github.com/acbull/pyHGT 近年来，图神经网…

人工智能 2023年6月10日
0094
Java学习路线图，看这一篇就够了！

主要分为三阶段 | 耗废1024根秀发，Java学习路线图来了，整合了自己所学的所有技术整理出来的2022最新版Java学习路线图，适合于初、中级别的Java程序员。可以按照这个序…

人工智能 2023年7月31日
0051
AI | 第2章机器学习算法 – sklearn 分类算法

AI | 第2章机器学习算法 – sklearn 分类算法前言 1. sklearn 的转换器和估计器 * 1.1 转换器 1.2 估计器 2. K-近邻算法（KN…

人工智能 2023年6月30日
0086
【机器学习实验四】朴素贝叶斯算法

目录 ………………………………&#…

人工智能 2023年7月1日
0080
通过R语言实现平稳时间序列的建模–基础（ARMA模型）

目录 1. 建模流程 2. 序列平稳性检验和纯随机性检验 2.1 图检验 2.2 单位根检验 3. 模型选择 4. 参数估计 5. 模型检验 5.1 模型显著性检验 5.2 参…

人工智能 2023年6月16日
0084
【Python基础】reduce函数详解

转载请注明出处：【Python基础】reduce函数详解 reduce函数原本在 python2中也是个内置函数，不过在 python3中被移到 functools模块中。 red…

人工智能 2023年7月4日
0099
BertTokenizer 使用方法

python 导入与初始化 BertTokenizer from transformers import BertTokenizer tokenizer = BertTokeniz…

人工智能 2023年6月23日
0068
神经网络logistic回归模型,logistic回归的基本理论

神经网络挖掘模型与logistic回归挖掘模型的不同点有哪些？逻辑回归有点像线性回归，但是它是当因变量不是数字时使用。比如说因变量是布尔变量（如是/否响应），这时候就需要逻辑回归…

人工智能 2023年6月18日
0079

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Python学习—3级分类的概率校准

大家都在看