机器学习实例（预测房价中位数）（附代码）

2023年6月17日下午2:59 • 人工智能 • 阅读 104

前提条件：

1、有一些python编程经验。
2、熟悉python主要科学库，特别是：numpy，pandas和matplotlib。
3、最好使用Jupyter 编程。（没有的话，建议下载Anaconda。里面有。）

一、下载数据：

1、下载一个压缩文件housing.tgz即可，其包含housing.csv（已经包含书有数据。)，用 tax xzf housing.tgz 来解压提取CSV文件。

import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

之后应用函数就好了。 Jupyter 最好用谷歌浏览器，搞不好会报错（没有网站访问权限）。

fetch_housing_data()

2、使用pandas加载数据，返回包含所用数据的DF 对象。

import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path=os.path.join(housing_path,"housing.csv")
    return pd.read_csv(csv_path)
load_housing_data(HOUSING_PATH)

查看数据结构：


housing = load_housing_data()
housing.head()
housing.info()

housing.describe()

%matplotlib inline
import matplotlib.pyplot as plt

housing.hist(bins=50,figsize=(20,15))
plt.show()

3、创建测试集（一般为数据集的百分之20，数据集越大，比例越小。）


import numpy as np
np.random.seed(42)

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]
train_set, test_set = split_train_test(housing, 0.2)
len(train_set)
len(test_set)

from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]
import hashlib

def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio

housing_with_id = housing.reset_index()
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

test_set.head()

4、用Scikit-Learn 随机拆分和分层抽样出的数据测试集：


from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

test_set.head()
housing["median_income"].hist()
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

housing["income_cat"].value_counts()
housing["income_cat"].hist()

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

strat_test_set["income_cat"].value_counts() / len(strat_test_set)
housing["income_cat"].value_counts() / len(housing)

5、接下来对三种测试集进行比较。

def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100
compare_props

得到结果之后，只有随机的会有一定的偏差。我们可以将其删除，使数据恢复原样：

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

**二、数据探索

前提（为了不损坏数据，copy一下吧。）**

housing = strat_train_set.copy()

1、将地理数据可视化：


housing.plot(kind="scatter", x="longitude", y="latitude")

housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.1)

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
             s=housing["population"]/100, label="population", figsize=(10,7),
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
             sharex=False)
plt.legend()

2、寻找相关性：


corr_matrix = housing.corr()

corr_matrix["median_house_value"].sort_values(ascending=False)

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])
save_fig("income_vs_house_value_scatterplot")

3、试验不同属性的组合（特征提取）：


housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
             alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()
housing.describe()

三、数据准备

先回到一个干净的训练集（copy（））^ ^

housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

1、数据清理（对残缺的数据，我进行的是补充完整训练数据的中位数。）：


sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)
sample_incomplete_rows

2、Scikit-Learn的设计：


from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)
imputer.statistics_
housing_num.median().values
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns ,index=housing_num.index )
housing_tr.loc[sample_incomplete_rows.index.values]
imputer.strategy
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing_num.index)

housing_tr.head()

3、处理文本和分类属性：、
前面我们只处理了数值属性。现在看一下文本属性。

housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder =OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

ordinal_encoder.categories_

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

housing_cat_1hot.toarray()

cat_encoder.categories_

4、自定义转换器


from sklearn.base import BaseEstimator,TransformerMixin
rooms_ix , bedrooms_ix, population_ix , households_ix =3,4,5,6
class  CombinedAttributesAdder(BaseEstimator, TransformerMixin ):
    def __init__ (self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room=add_bedrooms_per_room
    def fit(self,X, y = None):
        return self
    def transform(self , X):
        rooms_per_household = X [: , rooms_ix] / X[:,households_ix]
        population_per_household = X[:,population_ix] / X [:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:,bedrooms_ix] / X[:,rooms_ix]
            return np.c_[X,rooms_per_household , population_per_household,bedrooms_per_room ]
        else:
            return np.c_[X,rooms_per_household, population_per_household ]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room= False)
housing_extra_attribs= attr_adder.transform(housing.values)

5、特征缩放：

col_names = "total_rooms", "total_bedrooms", "population", "households"
rooms_ix, bedrooms_ix, population_ix, households_ix = [
    housing.columns.get_loc(c) for c in col_names]

housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
    index=housing.index)
housing_extra_attribs.head()

6、转换流水线：
（数据的转换需要正确的顺序来执行）

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num",num_pipeline , num_attribs ),
    ("cat" , OneHotEncoder(),cat_attribs ),
])
housing_prepared = full_pipeline.fit_transform(housing)

housing_prepared
housing_prepared.shape

四、选择和训练模型：

开始准备机器学习算法：
一共训练了线性回归模型，决策树和随机森林。训练之后用测试集评估看那个泛化效果更好。
1、训练和评估训练集：


from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared,housing_labels)

some_data =housing.iloc[:5]
some_labels = housing_labels.iloc[: 5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:",lin_reg.predict(some_data_prepared))

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse  = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

但是这个结果也并不是太好看（68628.198）有点大。让我们再看一下决策树：

from sklearn.tree  import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels,housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

结果为（0.0）大概严重过拟合了。

2、交叉验证更好的评估：


from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg,housing_prepared , housing_labels ,
                        scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:",scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

之后你会发现，决策树确实是过拟合了，而且表现比线性回归还有糟糕。让我们再试试随机森林：


from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared,housing_labels)

housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

五、微调模型：

1、网格搜索：
调整超参数：


from sklearn.model_selection import GridSearchCV

param_grid = [

    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},

    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

grid_search.best_params_

grid_search.best_estimator_

cvres = grid_search.cv_results_
for mean_score ,params in zip(cvres["mean_test_score"],cvres["params"]):
    print(np.sqrt(-mean_score),params)

2、随机搜索：（适合那种超参数比较大范围的）。


from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

rnd_search.best_params_

cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

3、分析最佳模型及其误差：

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

将重要性分数显示在对应属性旁边：

extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]

cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

4、通过测试集评估系统：
到现在，我们终于有了一个还不错的系统。来让我们进行最后的评估，成败在此一举。
评估最终模型


final_model = grid_search.best_estimator_

x_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

x_test_prepared = full_pipeline.transform(x_test)
final_predictions = final_model.predict(x_test_prepared)

final_mse = mean_squared_error(y_test , final_predictions)
final_rmse =  np.sqrt(final_mse)

final_rmse

结果还不错，但是存在的泛化误差的危害性还是比较大的。
为此计算泛化误差的0.95置信区间：

from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

六、启动！！


full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),
        ("linear", LinearRegression())
    ])

full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(some_data)

保存训练好的模型，以后还能用。^^


my_model = full_pipeline_with_predictor
import joblib
joblib.dump(my_model, "my_model.pkl")

my_model_loaded = joblib.load("my_model.pkl")

结束语：

我是跟着一本《机器学习实战》学习的，以上基本上是上面的内容。以下会提及。
鄙人不才，分析不是很全面，如有一些错误，请评论指正，感谢！
完整代码：这个是我敲的
或者：原作者敲的
最后：
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, 作者： Aurelien Geron(法语) , 又 O Reilly 出版，书号 978-1-492-03264-9。
建议买一本，很不错。🆗

Original: https://blog.csdn.net/qq_51153436/article/details/121527662
Author: 看到我你要笑一下
Title: 机器学习实例（预测房价中位数）（附代码）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/630567/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Pytorch与深度学习 —— 3. 如何利用 CUDA 加速神经网络训练过程

文章目录什么是CUDA（Compute Unified Device Architecture）准备CUDA设备准备CUDA环境 * 对于MAC用户怎么装CUDA Linux…

人工智能 2023年7月14日
00134
[Python从零到壹] 三十八.图像处理基础篇之图像几何变换（平移缩放旋转）

欢迎大家来到”Python从零到壹”，在这里我将分享约200篇Python系列文章，带大家一起去学习和玩耍，看看Python这个有趣的世界。所有文章都将结合…

人工智能 2023年6月18日
0089
【数据库原理系列】IDEF1x工程化方法

IDEF1x工程化方法 IDEF1x是将E-R模型扩充语义含义而形成的, 或者说，IDEF1x是E-R图的细化 IDEF1x是一种进行数据建模或数据库设计的工程化的方法实体(En…

人工智能 2023年6月1日
0087
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

error1: Solving environment: failed with initial frozen solve. Retrying with flexible solv…

人工智能 2023年5月23日
00113
故障预测方法分类

故障预测算法分类故障预测算法分为三类：基于模型(model-driven)的故障预测技术；基于数据驱动(data—driven)的故障预测技术；基于统计可靠性的故障预测技术…

人工智能 2023年6月15日
0070
数仓建模—数据领域常见概念与职位划分

数据领域常见概念随着大数据时代的普及以及国家战略层面上的数字化转型，越来越多的客户意识到了”数据”的重要性，无论是走在前面的互联网企业、银行，还是传统有一…

人工智能 2023年7月18日
0062
8.4 帕德逼近

文章目录简介例子通解 python代码测试简介帕德逼近 Padé approximant_是一种对任意函数的有理函数逼近。这个是高中数学内容，经常在高考题中出现，但是呢…

人工智能 2023年6月16日
00102
影像基础—–CT-MRI图像的特点和临床应用

CT图像是经数字转换的重建模拟图像，是由一定数目从黑到白不同灰度的像素按固有矩阵排列而成。这些像素的灰度反映的是相应体素的X线吸收系数。如同X线图像，CT图像亦是用灰度反映器官和组…

人工智能 2023年6月20日
0047
MMAction2-视频理解、行为识别（学习笔记-附代码实操）

一、MMAction2——视频理解与行为识别 * 视频理解的重点 – 重点2：如何高效的处理视频数据？视频的数据量远大于图像，一秒钟的视频就包含20~30张图像，对计…

人工智能 2023年7月21日
0097
图解自注意力机制

写在最前边这个文章是《图解GPT-2 | The Illustrated GPT-2 (Visualizing Transformer Language Models)》的一部分…

人工智能 2023年6月24日
00168
安装mmdet,mmcv-full

前言由于最近可能要参加关于目标检测和跟踪的比赛,所以先提前装好环境. 我的环境以及各种配置版本: 系统: Ubuntu 18.04.6 CUDA : 11.04 Python 3…

人工智能 2023年7月20日
0048
Qt-OpenCV开发环境搭建（史上最详细）

文章目录一、环境介绍二、OpenCV下载三、OpenCV结构介绍四、编译OpenCV 五、配置Qt工程附、编译opencv-contrib * 一、下载opencv-co…

人工智能 2023年7月19日
0065
一元线性回归及案例（Python）

目录 1 一元线性回归简介 2 一元线性回归数学形式 3 案例：不同行业工龄与薪水的线性回归模型 3.1 案例背景 3.2 具体代码 3.3 模型优化 4 总体展示 5 线性回归模…

人工智能 2023年7月4日
00102
opencv 直方图均衡化

文章目录前言一、原理 opencv 函数支持equalizeHist() 前言在图像直方图详解中详细讲解了图像直方图，这章来讲解一下直方图的均衡化。直方图均衡化是图像处理领域…

人工智能 2023年5月26日
0089
【离线语音专题③】安信可VC系列离线语音SDK开发环境搭建——基于Linux系统

文章目录前言一、SDK生成二、编译工具链下载 * 1. 依赖工具安装三、SDK下载编译测试 * 1.下载解压 2.编译例程五、友情连接前言本专题的第一篇文章已经介绍了…

人工智能 2023年5月25日
0076
OpenCV基础操作_图片读取和保存

目录 1 图片读取 2 图片保存 1 图片读取在OpenCV中，加载图片采用imread（）函数。函数详细说明在：Reading and Writing Images and …

人工智能 2023年7月5日
0092

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30