R语言实例:基于Boston数据集的数据分析报告——用 logistic 回归、LDA(线性判别法)、K 临近法(k=1 和 k=5)构建分类模型。目的是预测一个区域的犯罪率是否高于所有犯罪率的中位数

文章目录

问题

请分析 Boston 数据集,并撰写一个数据分析报告。

在报告中主要分析并回答以下两个问题。

  • 用 logistic 回归、LDA(线性判别法)、K 临近法(k=1 和 k=5)构建分类模型。目的是预测一个区域的犯罪率是否高于所有犯罪率的中位数。 在构建每种类型的模型时,请分别选择三组(三个不同子集的)自变量。从三组自变量构造的模型中分别选出一个你认为最好的,你的选择应当基于交叉验证法。请讨论你得到的结果。
  • 用最优子集的方法构建回归模型,预测一个区域的犯罪率。

Boston 数据集

查看数据集

> library(MASS)
> head(Boston)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat medv
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

数据描述

在命令行中输入 ?Boston命令,Rstudio 界面出现该数据集的解释界面,如图所示:

R语言实例:基于Boston数据集的数据分析报告——用 logistic 回归、LDA(线性判别法)、K 临近法(k=1 和 k=5)构建分类模型。目的是预测一个区域的犯罪率是否高于所有犯罪率的中位数

变量含义crim城镇人均犯罪率zn25000平方英尺以上地块的住宅用地比例indus每个城镇的非零售业务面积比例chasCharles River 哑变量
1

(如果道沿河而行,该项数值为 1,否则为0)nox氮氧化物浓度(千万分之一)rm每个住宅的平均房间数age1940年以前建造的自有住房比例dis五个波士顿就业中心距离的加权平均数rad辐射状公路通达性指数tax按每10,000美元计算的全值物业税税率ptratio城镇师生比例black
1000 ( B k − 0.63 ) 2 1000(Bk-0.63)^ 2 1 0 0 0 (B k −0 .6 3 )2

,其中
B k Bk B k

是城镇黑人的比例lstat底层阶级人口占比(%)medv业主自住住宅的中位价值(以1000美元为单位)

; 构建分类模型

数据可视化

通过查看数据描述,我们知道了每个变量的含义。通过数据可视化,我们可以快速知道数据分布情况,便于下一步构造模型。查看 crim 变量,绘制箱线图。因为数值多分布在0-1范围内,所以在该箱线图中,对y轴的显示取对数,便于更方便地观察数据。

boxplot  boxplot(Boston$crim,outline = T,log= "y")
boxplot$stats
abline(h=boxplot$stats[1,],lwd=1,col=2,asp = 2,lty = 2)
text(1.25,boxplot$stats[1,], "minimum=0.00632", col = 2,adj=c(0,-0.4))

abline(h=boxplot$stats[2,],lwd=1,col=2,asp = 2,lty = 2)
text(1.25,boxplot$stats[2,], "Q1=0.08199", col = 2,adj=c(0,-0.4))

abline(h=boxplot$stats[3,],lwd=1,col=2,asp = 2,lty = 2)
text(1.25,boxplot$stats[3,], "median=0.25651", col = 2,adj=c(0,-0.4))

abline(h=boxplot$stats[4,],lwd=1,col=2,asp = 2,lty = 2)
text(1.25,boxplot$stats[4,], "Q3=3.67822", col = 2,adj=c(0,-0.4))

abline(h=boxplot$stats[5,],lwd=1,col=2,asp = 2,lty = 2)
text(1.25,boxplot$stats[5,], "maximum=8.98296", col = 2,adj=c(0,-0.4))

R语言实例:基于Boston数据集的数据分析报告——用 logistic 回归、LDA(线性判别法)、K 临近法(k=1 和 k=5)构建分类模型。目的是预测一个区域的犯罪率是否高于所有犯罪率的中位数

logistic 分类模型

构建分类模型的因变量

构建 logistic 分类模型的因变量,该因变量是二分类的。我们将高于犯罪率( crim)中位数的项记为”1″,否则为”0″。

dt  Boston

dt$crim_bi  ifelse(dt$crim > median(dt$crim), 1, 0)

构建三个不同自变量的模型


log.fit  glm(crim_bi ~ zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,
               data = dt , family = "binomial")
summary(log.fit)
log.fit2  glm(crim_bi ~ zn+indus+nox+age+dis+rad+tax+ptratio+black+medv,
               data = dt , family = "binomial")
summary(log.fit2)
log.fit3  glm(crim_bi ~ zn+nox+age+dis+rad+tax+ptratio+black+medv,
                data = dt , family = "binomial")
summary(log.fit3)

交叉验证

进行交叉验证,将准确率作为衡量标准。

fold_log  function(log.fit,dt){
  library(caret)
  set.seed(3)
  folds  createFolds(y=dt[,10],k=10)
  accuracy  as.numeric()
  for (i in 1:10){
    fold_test  dt[folds[[i]],]
    fold_train  dt[-folds[[i]],]
    fold_pre predict(log.fit,fold_test,type = "response")
    log.class  ifelse(fold_pre > 0.5, 1, 0)
    a  table(log.class, fold_test$crim_bi)
    accuracy  append(accuracy,(a[1]+a[4])/sum(a))
  }
  return(mean(accuracy))
}

fold_log(log.fit,dt)
fold_log(log.fit2,dt)
fold_log(log.fit3,dt)

结果分析

> fold_log(log.fit,dt)
[1] 0.9150087
> fold_log(log.fit2,dt)
[1] 0.9229287
> fold_log(log.fit3,dt)
[1] 0.9090433

由输出结果可知,log.fit2 即第二个模型的准确率更高,为0.9229287 0.9229287 0 .9 2 2 9 2 8 7。

LDA 回归模型

同理,

lda  lda(crim_bi ~ zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,
           data = dt)
lda2  lda(crim_bi ~ zn+indus+nox+age+dis+rad+tax+ptratio+black+medv,
           data = dt)
lda3  lda(crim_bi ~ zn+nox+age+dis+rad+tax+ptratio+black+medv,
           data = dt)
fold_lda  function(lda,dt){
  library(caret)
  set.seed(3)
  folds  createFolds(y=dt[,10],k=10)
  accuracy  as.numeric()
  for (i in 1:10){
    fold_test  dt[folds[[i]],]
    fold_train  dt[-folds[[i]],]
    fold_pre predict(lda,fold_test)
    a  table(predict(lda,fold_test)$class, fold_test$crim_bi)
    accuracy  append(accuracy,(a[1]+a[4])/sum(a))
  }
  return(mean(accuracy))
}
fold_lda(lda,dt)
fold_lda(lda2,dt)
fold_lda(lda3,dt)

结果分析

> fold_lda(lda,dt)
[1] 0.8556253
> fold_lda(lda2,dt)
[1] 0.8575861
> fold_lda(lda3,dt)
[1] 0.8635469

由输出结果可知,lda3 即第三个模型的准确率更高,为0.8635469 0.8635469 0 .8 6 3 5 4 6 9。

K 临近模型


library(kknn)
library(caret)
set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,
              fold_train,fold_test,k=1)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,
              fold_train,fold_test,k=5)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

library(kknn)
library(caret)
set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+indus+nox+age+dis+rad+tax+ptratio+black+medv,
              fold_train,fold_test,k=1)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+indus+nox+age+dis+rad+tax+ptratio+black+medv,
              fold_train,fold_test,k=5)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

library(kknn)
library(caret)
set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+nox+age+dis+rad+tax+ptratio+black+medv,
              fold_train,fold_test,k=1)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+nox+age+dis+rad+tax+ptratio+black+medv,
              fold_train,fold_test,k=5)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

最优子集构建回归模型

最优子集

library(leaps)
leaps regsubsets(crim ~ zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,
           data=dt)
plot(leaps,scale = "adjr2")

R语言实例:基于Boston数据集的数据分析报告——用 logistic 回归、LDA(线性判别法)、K 临近法(k=1 和 k=5)构建分类模型。目的是预测一个区域的犯罪率是否高于所有犯罪率的中位数

如图所示,截距+rad的调整R平方值为0.39 0.39 0 .3 9。调整R平方值越高的模型越好,因此最佳预测变量为:
zn+nox+dis+rad+ptratio+black+lstat+medv

故有:

lmfit lm(crim ~ zn+nox+dis+rad+ptratio+black+lstat+medv,
                   data=dt)

划分学习和测试数据集

随机抽取70 % 70 \%7 0 %的数据放入学习数据集,剩余30 % 30\%3 0 %放入测试数据集。

dim(dt)
length  dim(dt)[1]
set.seed(1)
pre  sample(length,length*0.7)
pre  sort(pre)
train  dt[pre,]
test   dt[-pre,]

预测犯罪率

写一个计算均方误差的函数 RMSE:

RMSE=function(t,p){
  return(sqrt(mean((t-p)^2)))
}

用测试数据集预测犯罪率,并计算均方误差:

lm_pre predict(lmfit, test)
RMSE(test$crim,lm_pre)

计算知:

> RMSE(test$crim,lm_pre)
[1] 7.557481

则均方误差为7.557481 7.557481 7 .5 5 7 4 8 1。

代码

rm(list=ls())
library(ggplot2)
library(dplyr)
library(MASS)
head(Boston)

boxplot  boxplot(Boston$crim,outline = T,log= "y")
boxplot$stats
abline(h=boxplot$stats[1,],lwd=1,col=2,asp = 2,lty = 2)
text(1.25,boxplot$stats[1,], "minimum=0.00632", col = 2,adj=c(0,-0.4))

abline(h=boxplot$stats[2,],lwd=1,col=2,asp = 2,lty = 2)
text(1.25,boxplot$stats[2,], "Q1=0.08199", col = 2,adj=c(0,-0.4))

abline(h=boxplot$stats[3,],lwd=1,col=2,asp = 2,lty = 2)
text(1.25,boxplot$stats[3,], "median=0.25651", col = 2,adj=c(0,-0.4))

abline(h=boxplot$stats[4,],lwd=1,col=2,asp = 2,lty = 2)
text(1.25,boxplot$stats[4,], "Q3=3.67822", col = 2,adj=c(0,-0.4))

abline(h=boxplot$stats[5,],lwd=1,col=2,asp = 2,lty = 2)
text(1.25,boxplot$stats[5,], "maximum=8.98296", col = 2,adj=c(0,-0.4))

dt  Boston

dt$crim_bi  ifelse(dt$crim > median(dt$crim), 1, 0)

log.fit  glm(crim_bi ~ zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,
               data = dt , family = "binomial")
summary(log.fit)
log.fit2  glm(crim_bi ~ zn+indus+nox+age+dis+rad+tax+ptratio+black+medv,
               data = dt , family = "binomial")
summary(log.fit2)
log.fit3  glm(crim_bi ~ zn+nox+age+dis+rad+tax+ptratio+black+medv,
                data = dt , family = "binomial")
summary(log.fit3)

fold_log  function(log.fit,dt){
  library(caret)
  set.seed(3)
  folds  createFolds(y=dt[,10],k=10)
  accuracy  as.numeric()
  for (i in 1:10){
    fold_test  dt[folds[[i]],]
    fold_train  dt[-folds[[i]],]
    fold_pre predict(log.fit,fold_test,type = "response")
    log.class  ifelse(fold_pre > 0.5, 1, 0)
    a  table(log.class, fold_test$crim_bi)
    accuracy  append(accuracy,(a[1]+a[4])/sum(a))
  }
  return(mean(accuracy))
}

fold_log(log.fit,dt)
fold_log(log.fit2,dt)
fold_log(log.fit3,dt)

dim(dt)
length  dim(dt)[1]
set.seed(5)
pre  sample(length,length*0.8)
pre  sort(pre)
train  dt[pre,]
test   dt[-pre,]

log.pred  predict(log.fit2, test, type = "response")
log.class  ifelse(log.pred > 0.5, 1, 0)

table(log.class, test$crim_bi)

lda  lda(crim_bi ~ zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,
           data = dt)
lda2  lda(crim_bi ~ zn+indus+nox+age+dis+rad+tax+ptratio+black+medv,
           data = dt)
lda3  lda(crim_bi ~ zn+nox+age+dis+rad+tax+ptratio+black+medv,
           data = dt)
fold_lda  function(lda,dt){
  library(caret)
  set.seed(3)
  folds  createFolds(y=dt[,10],k=10)
  accuracy  as.numeric()
  for (i in 1:10){
    fold_test  dt[folds[[i]],]
    fold_train  dt[-folds[[i]],]
    fold_pre predict(lda,fold_test)
    a  table(predict(lda,fold_test)$class, fold_test$crim_bi)
    accuracy  append(accuracy,(a[1]+a[4])/sum(a))
  }
  return(mean(accuracy))
}
fold_lda(lda,dt)
fold_lda(lda2,dt)
fold_lda(lda3,dt)

library(kknn)
library(caret)
set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,
              fold_train,fold_test,k=1)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,
              fold_train,fold_test,k=5)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

library(kknn)
library(caret)
set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+indus+nox+age+dis+rad+tax+ptratio+black+medv,
              fold_train,fold_test,k=1)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+indus+nox+age+dis+rad+tax+ptratio+black+medv,
              fold_train,fold_test,k=5)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

library(kknn)
library(caret)
set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+nox+age+dis+rad+tax+ptratio+black+medv,
              fold_train,fold_test,k=1)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

set.seed(3)
folds  createFolds(y=dt[,10],k=10)
accuracy  as.numeric()
for (i in 1:10){
  fold_test  dt[folds[[i]],]
  fold_train  dt[-folds[[i]],]
  knn  kknn(crim_bi ~ zn+nox+age+dis+rad+tax+ptratio+black+medv,
              fold_train,fold_test,k=5)
  pre_knn  fitted(knn)
  pre_knn  ifelse(pre_knn > 0.5, 1, 0)
  a  table(pre_knn, fold_test$crim_bi)
  accuracy  append(accuracy,(a[1]+a[4])/sum(a))
}
mean(accuracy)

library(leaps)
leaps regsubsets(crim ~ zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+medv,
           data=dt)
plot(leaps,scale = "adjr2")
lmfit lm(crim ~ zn+nox+dis+rad+ptratio+black+lstat+medv,
                   data=dt)

dim(dt)
length  dim(dt)[1]
set.seed(1)
pre  sample(length,length*0.7)
pre  sort(pre)
train  dt[pre,]
test   dt[-pre,]

lm_pre predict(lmfit, test)
RMSE=function(t,p){
  return(sqrt(mean((t-p)^2)))
}
RMSE(test$crim,lm_pre)

Original: https://blog.csdn.net/weixin_45503977/article/details/121065673
Author: 涂零测试
Title: R语言实例:基于Boston数据集的数据分析报告——用 logistic 回归、LDA(线性判别法)、K 临近法(k=1 和 k=5)构建分类模型。目的是预测一个区域的犯罪率是否高于所有犯罪率的中位数

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/695025/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球