Model-based learning 简单实践

2023年6月4日上午11:05 • 人工智能 • 阅读 94

从样本集进行归纳的方法是建立这些样本的模型，然后使用这个模型进行预测，这叫作基于模型学习（Model-based learning）。

例如，你想知道钱是否能让人快乐？下面是一个简单的基于线性模型的案例。

数据来源：https://github.com/ageron/handson-ml

Python ≥3.5
import sys
assert sys.version_info >= (3, 5)

Scikit-Learn ≥0.20
import sklearn
assert sklearn.__version__ >= "0.20"

加载数据

数据所在路径设置
import os
datapath = os.path.join("datasets", "lifesat", "")
print(datapath)

datasets/lifesat/

从 OECD 网站下载了 Better Life Index 指数数据，如下：

import numpy as np
import pandas as pd

oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',') # thousands 设置千位分隔符；
oecd_bli.head()

LOCATION Country INDICATOR Indicator MEASURE Measure INEQUALITY Inequality Unit Code Unit PowerCode Code PowerCode Reference Period Code Reference Period Value Flag Codes Flags 0 AUS Australia HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units NaN NaN 1.1 E Estimated value 1 AUT Austria HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units NaN NaN 1.0 NaN NaN 2 BEL Belgium HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units NaN NaN 2.0 NaN NaN 3 CAN Canada HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units NaN NaN 0.2 NaN NaN 4 CZE Czech Republic HO_BASE Dwellings without basic facilities L Value TOT Total PC Percentage 0 units NaN NaN 0.9 NaN NaN

从 IMF 下载了人均 GDP 数据，如下：

gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv", thousands=',',  # per capita 人均
                             delimiter='\t', encoding='latin1', na_values="n/a")
gdp_per_capita.head()

Country Subject Descriptor Units Scale Country/Series-specific Notes 2015 Estimates Start After 0 Afghanistan Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren… 599.994 2013.0 1 Albania Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren… 3995.383 2010.0 2 Algeria Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren… 4318.135 2014.0 3 Angola Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren… 4100.315 2014.0 4 Antigua and Barbuda Gross domestic product per capita, current prices U.S. dollars Units See notes for: Gross domestic product, curren… 14414.302 2011.0

准备数据

This function just merges the OECD’s life satisfaction data and the IMF’s GDP per capita data. It’s a bit too long and boring and it’s not specific to Machine Learning, which is why I left it out of the book.

def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
    gdp_per_capita.set_index("Country", inplace=True)
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
country_stats.head()

GDP per capita Life satisfaction Country Russia 9054.914 6.0 Turkey 9437.372 5.6 Hungary 12239.894 4.9 Poland 12495.334 5.8 Slovak Republic 15991.736 6.1

可视化数据

import matplotlib.pyplot as plt
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()

线性回归

import sklearn.linear_model
model = sklearn.linear_model.LinearRegression()

训练模型

X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]
model.fit(X, y)

LinearRegression()

根据模型进行预测

X_new = [[22587]]  # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]

[[5.96242338]]

总结

read_csv参数

thousands=',' : 千位分隔符；可以将”1,000″转换为 int 型的1000；
delimiter='\t' : sep的替代参数，csv文件分隔符可能为”,” or “\t”，可用sublime查看；
encoding='latin1' : 确定正确的编码方式才能正确解码；vim this file and set fileencoding即可显示编码格式；
na_values="n/a" : 缺少值处理，可参考https://blog.csdn.net/weixin_44520259/article/details/106053987 ；

学习重点是机器学习原理，对于numpy，pandas之类的不熟悉的遇到了就学一下，不需要系统的学习，抓住重点！

Original: https://www.cnblogs.com/kphang/p/16359908.html
Author: KpHang
Title: Model-based learning 简单实践

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/567374/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

函数式编程-常用的函数式接口

1 lambda作为参数和返回值 package com.stream.函数式编程.lambda作为参数和返回值; public class LambdaAsParams { pu…

人工智能 2023年6月29日
00116
语音识别中强制对齐_AI语音评测技术简述与应用层级

一、前言「AI语音评测」技术，指的是针对口语发音水平和差错，进行自动评价、检错并提供指导纠正的技术。经过几十年的发展，这项技术在中英发音标准、口语表达能力等测评任务上均已超过专…

人工智能 2023年5月25日
0072
【算法】跑ORB-SLAM2遇到的问题、解决方法、效果展示（环境：Ubuntu18.04+ROS melodic）

文章目录一、Pangolin库的介绍和安装问题 * 1、问题：Error: No preferred package managers from list [brew] foun…

人工智能 2023年6月10日
00128
TensorFlow的标量、向量、矩阵和多维数组有何区别

标量、向量、矩阵和多维数组的区别在TensorFlow中，我们经常使用标量、向量、矩阵和多维数组来表示数据。这些数据类型在数学计算和机器学习中扮演着重要的角色，它们之间有着一些明…

人工智能 2023年12月30日
0047
AAAI 2022 论文列表

链接及代码之后会更新 Scaled ReLU Matters for Training Vision TransformersPichao Wang, Xue Wang, Hao …

人工智能 2023年5月26日
0099
opencv 去除孤立点以及findContours()和connectedComponentsWithStats()详解

findContours()和connectedComponentsWithStats()两个函数可以分别实现去除图像孤立点的功能 connectedComponentsWithS…

人工智能 2023年6月19日
00103
CUDA 11.7最新特性

CUDA 11.7最新特性周三，与R515 NVIDIA Linux驱动测试版和NVIDIA GPU内核驱动开源版一起发布的是CUDA 11.7。NVIDIA CUDA 11.7…

人工智能 2023年5月28日
00108
什么是知识图谱

1.1 什么是知识图谱知识图谱是一种用图模型来描述知识和建模世界万物之间的关联关系的技术方法[1]。知识图谱由节点和边组成。节点可以是实体，如一个人、一本书等，或是抽象的概念，如…

人工智能 2023年6月1日
0094
基于图像的目标检测与定位方法概述

目录 1. 目标检测与定位概念 2. 目标检测与定位方法 * 2.1 传统目标检测流程 2.2 two-stage检测算法 – 2.2.1 R-CNN 2.2.2 tw…

人工智能 2023年7月12日
0056
yolo fastest V2数据集训练模型步骤

1，收集数据集，train，val文件夹2，labelimg标注数据集，yolo数据格式3，运行train和val内jpg2listtxt.bat生成各自的list.txt4，修改…

人工智能 2023年7月10日
0077
信息论初级——信源概述——2020-11-11

信息论初级——信源概述内容：一、信源的数学模型以及分类二、离散信源信息熵以及其性质三、随机波形信源四、信源的冗余度 关于&am…

人工智能 2023年5月27日
0086
端口号及作用

CDH 端口说明7180Cloudera Manager 前端端口7182Cloudera Manager Server 与 Agent通讯端口8888Hue前端端口 Hadoop…

人工智能 2023年7月30日
0049
数据可视化（二）pandas和seaborn作图

pandas pandas.DataFrame.plot(kind,x,y,title,figsize,grid) 可以通过Series或DataFrame对象调用，本质是对pyp…

人工智能 2023年7月7日
0090
基于巴法云平台的天猫精灵控制开关

天猫精灵已经出来很久了，因为他的语音知识更准确，所以很多智能家居开发商都想用天猫精灵来操控一些家电。于是，天猫精灵、小度立体声等具有语音识别功能的音频系统应运而生。我知道的最多的就…

人工智能 2023年5月27日
00169
SQL数据分析之数据提取、数据查询、数据清洗【MySQL速查】

文章目录 * – 一、数据提取 – 二、数据查询 – + 1、选取数据（select） + 2、筛选（where） + 3、范围匹配（IN） +…

人工智能 2023年7月15日
00119
环形链表问题

文章目录环形链表问题 * 1.环形链表 – 题干思路延申问题总结 2. 环形链表 II – 题干思路环形链表问题环形链表就是一个链表没有结束的…

人工智能 2023年7月30日
0062

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31