常见的两类方法:
- 基于相关原则的 回归分析法
- 基于惯性原则的 *时间序列法
回归分析的基本步骤:
- (1)重点考察一个特定的变量(因变量),而把其他变量(自变量)看作是影响这一变量的因素,并通过适当的数学模型将变量间的关系表达出来;
- (2)利用样本数据建立模型的估计方程;
- (3)对模型进行显著性检验;
- (4)通过一个或几个自变量的取值来估计或预测因变量的取值。
预测方法的选择(仅供参考):
数据模式预测方法对数据的要求预测期平稳序列移动平均数据个数与移动平均步长相等非常短平稳序列简单指数平滑5个以上短期线性趋势Holt指数平滑5个以上短期至中期线性趋势一元线性回归10个以上短期至中期非线性趋势指数模型10个以上短期至中期非线性趋势多项式函数10个以上短期至中期趋势和季节成分Winter指数平滑至少有四个周期的季节或月份数据短期至中期趋势和季节成分季节性多元回归至少有四个周期的季节或月份数据短期、中期、长期趋势、季节成分和循环成分分解预测至少有四个周期的季节或月份数据短期、中期、长期
预测方法的评估:一种预测方法的好坏取决于预测误差的大小。预测误差是预测值与实际值的差距,度量方法有:
- 平均误差(Mean Error)
- 平均绝对误差(Mean Absolute Deviation)
- 均方误差(Mean Square Error,MSE)(常用)
- 平均百分比误差(Mean Percentage Error)
- 平均绝对百分比误差(Mean Absolute Percentage Error)
2.1 回归分析
为研究员工月工资收入与工作年限和性别之间的关系,从某公司职员中随机抽取男女各4名,他们的月工资收入与工作年限和性别之间的关系表如下:
月工资收入(元)工作年限性别29002男30006女48008男18003女29002男49007男42009女48008女
令y y y表示月工资收入,x 1 x_1 x 1 表示工作年限,x 2 x_2 x 2 表示性别,性别作为哑变量引入时,回归方程如下:y = β 0 + β 1 x 1 + β 2 x 2 y=\beta_0+\beta_1 x_1 + \beta_2 x_2 y =β0 +β1 x 1 +β2 x 2 ,于是我们可以得到:
- 女(x 2 = 0 x_2=0 x 2 =0):y 女 性 = β 0 + β 1 x 1 y_{女性}=\beta_0+\beta_1 x_1 y 女性=β0 +β1 x 1
- 男(x 2 = 1 x_2=1 x 2 =1):y 男 性 = ( β 0 + β 2 ) + β 1 x 1 y_{男性}=(\beta_0+\beta_2)+\beta_1 x_1 y 男性=(β0 +β2 )+β1 x 1
其中各参数的含义如下:
- β 0 \beta_0 β0 的含义是女性职工的基本月工资收入
- ( β 0 + β 2 ) (\beta_0+\beta_2)(β0 +β2 )的含义是男性职工的基本月工资收入
- β 1 \beta_1 β1 的含义是工作年限每增加1年,男性或女性工资的平均增加值
- β 2 \beta_2 β2 的含义是男性职工的月工资收入与女性职工的月工资收入之间的差值,即y 男 性 − y 女 性 = ( β 0 + β 2 ) + β 1 x 1 − β 0 + β 1 x 1 = β 2 y_{男性}-y_{女性}=(\beta_0+\beta_2)+\beta_1 x_1-\beta_0+\beta_1 x_1=\beta_2 y 男性−y 女性=(β0 +β2 )+β1 x 1 −β0 +β1 x 1 =β2
python实现代码如下:
import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.DataFrame({
'月工资收入':[2900,3000,4800,1800,2900,4900,4200,4800],
'工作年限':[2,6,8,3,2,7,9,8],
'性别':['男','女','男','女','男','男','女','女']
})
dummy_variables = pd.get_dummies(data=data['性别'].values)
X = np.column_stack(tup=(data['工作年限'].values,dummy_variables.values))
X = sm.add_constant(data=X)
y = data['月工资收入'].values
linear_model = sm.OLS(endog=y,exog=X)
ols_result = linear_model.fit()
print(ols_result.summary())
'''
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.901
Model: OLS Adj. R-squared: 0.862
Method: Least Squares F-statistic: 22.78
Date: Sun, 22 May 2022 Prob (F-statistic): 0.00307
Time: 17:02:55 Log-Likelihood: -58.036
No. Observations: 8 AIC: 122.1
Df Residuals: 5 BIC: 122.3
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
const 865.7005 447.091 1.936 0.111 -283.583 2014.984
x1 397.5845 60.183 6.606 0.001 242.879 552.290
x2 1120.7729 323.747 3.462 0.018 288.554 1952.992
==============================================================================
Omnibus: 4.593 Durbin-Watson: 1.536
Prob(Omnibus): 0.101 Jarque-Bera (JB): 1.483
Skew: 1.049 Prob(JB): 0.477
Kurtosis: 3.219 Cond. No. 20.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
可以看到,模型整体的拟合是没有变的,在显著性水平为0.05的条件下,x1和x2的参数显著性通过检验,但常数项的显著性不明显,因此下面我们剔除常数项再做一次拟合
'''
X2 = X[:,[1,3]]
linear_model2 = sm.OLS(endog=y,exog=X2)
ols_result2 = linear_model2.fit()
print(ols_result2.summary())
'''
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.986
Model: OLS Adj. R-squared (uncentered): 0.981
Method: Least Squares F-statistic: 210.6
Date: Sun, 22 May 2022 Prob (F-statistic): 2.77e-06
Time: 17:22:47 Log-Likelihood: -60.274
No. Observations: 8 AIC: 124.5
Df Residuals: 6 BIC: 124.7
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
const 7.8141 0.080 97.818 0.000 7.650 7.978
x1 2.6652 0.258 10.310 0.000 2.136 3.195
==============================================================================
Omnibus: 5.481 Durbin-Watson: 2.414
Prob(Omnibus): 0.065 Jarque-Bera (JB): 4.092
Skew: 0.883 Prob(JB): 0.129
Kurtosis: 3.391 Cond. No. 4.69
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
可以看到,模型估计的拟合优度为0.792,在假设显著性水平α=0.05的条件下,模型显著性检验和参数显著性检验均通过,得到的回归方程为:y = 7.8141 + 2.6652*x1
'''
from sklearn.preprocessing import PolynomialFeatures
X2 = PolynomialFeatures(degree=2).fit_transform(X=data[['广告费用(百万元)']])
model2 = sm.OLS(endog=y,exog=X2)
result2 = model2.fit()
print(result2.summary())
'''
OLS Regression Results
==============================================================================
Dep. Variable: 销售量(百万支) R-squared: 0.838
Model: OLS Adj. R-squared: 0.826
Method: Least Squares F-statistic: 69.81
Date: Fri, 27 May 2022 Prob (F-statistic): 2.14e-11
Time: 22:45:32 Log-Likelihood: -3.2455
No. Observations: 30 AIC: 12.49
Df Residuals: 27 BIC: 16.69
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
const 29.1133 7.483 3.890 0.001 13.701 44.525
x1 11.1342 4.446 2.504 0.019 1.978 20.291
x2 -7.6080 2.469 -3.081 0.005 -12.693 -2.523
x3 0.6712 0.203 3.312 0.003 0.254 1.089
x4 -1.4777 0.667 -2.215 0.036 -2.852 -0.104
==============================================================================
Omnibus: 0.242 Durbin-Watson: 1.512
Prob(Omnibus): 0.886 Jarque-Bera (JB): 0.148
Skew: -0.153 Prob(JB): 0.929
Kurtosis: 2.843 Cond. No. 9.81e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.81e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
可以看到,模型估计的拟合优度提升至0.921,在假设显著性水平α=0.05的条件下,模型显著性检验和参数显著性检验均通过,最终确定回归方程为:
y = 29.1133 + 11.1342*x1 - 7.6080*x2 + 0.6712*(x2)**2 - 1.4777*x1*x2
'''
某省为了研究第三产业在本省宏观经济发展中的运行情况,对 影响第三产业的资本要素、劳动力要素及科技进步要素这三项主要因素进行了统计分析,并运用 道格拉斯生产函数建立了基本数学模型。原始数据如下:
年份第三产业国内生产总值资本投入从业人员1992448.96524085470.081993611.231068889480.771994834.931632884529.08……………………2002312010029357903.14
现需要预测当资本投入为11738245、劳动力投入为987.37时,第三产业国内生产总值是多少。
道格拉斯生产函数对应数学模型为Y = A K α L β Y=AK^{\alpha}L^{\beta}Y =A K αL β,其中:
- Y是第三产业国内生产总值
- K是资金投入
- L是劳动力投入
- A是科技进步水平
- α \alpha α是资本弹性系数
- β \beta β是劳动弹性系数
上面的非线性模型我们可以通过对等号两边取对数,使之转换为多元线性模型,即L n Y = L n A + α L n K + β L n L LnY = LnA + \alpha LnK + \beta LnL L n Y =L n A +αL n K +βL n L,令L n Y = y , L n A = c , L n K = x 1 , L n L = x 2 LnY=y,LnA=c,LnK=x_1,LnL=x_2 L n Y =y ,L n A =c ,L n K =x 1 ,L n L =x 2 ,于是得到y = c + α x 1 + β x 2 y=c+\alpha x_1+\beta x_2 y =c +αx 1 +βx 2
下面我们用python实现上述的建模过程:
import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.read_excel(r'G:\第三产业国内生产总值数据表.xlsx')
data1 = data[['第三产业国内生产总值', '资本投入', '从业人员']].apply(lambda x:np.log(x))
X = sm.add_constant(data=data1[['资本投入', '从业人员']].values)
y = data1['第三产业国内生产总值'].values
model = sm.OLS(endog=y,exog=X)
result = model.fit()
print(result.summary())
'''
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.988
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 338.6
Date: Tue, 31 May 2022 Prob (F-statistic): 1.86e-08
Time: 23:20:30 Log-Likelihood: 15.034
No. Observations: 11 AIC: -24.07
Df Residuals: 8 BIC: -22.87
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
const 30.5780 0.942 32.449 0.000 28.624 32.532
x1 0.5605 0.066 8.498 0.000 0.424 0.697
==============================================================================
Omnibus: 0.733 Durbin-Watson: 1.698
Prob(Omnibus): 0.693 Jarque-Bera (JB): 0.695
Skew: -0.089 Prob(JB): 0.707
Kurtosis: 2.186 Cond. No. 29.6
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
'''
从最小二乘法回归结果可以看到,模型拟合的调整R方不是很高,但模型F检验和参数的t检验在显著性水平α \alpha α为0.05的情况下,均通过检验,因此我们确定分离季节成分后的序列的线性趋势方程为Y t ^ = 30.5780 + 0.5605 t \hat{Y_t} = 30.5780 + 0.5605t Y t ^=3 0 .5 7 8 0 +0 .5 6 0 5 t。
(4)根据线性趋势方程进行预测,并计算最后的预测值
yearseason销售量4项移动平均中心化移动平均季节比率季节指数分离季节成分time线性趋势预测最后预测值误差2005Q1250.79410690231.48190746131.138524.727297760.2727022382005Q232301.04483467630.62685489231.69933.12021439-1.1202143852005Q33731.2530.6251.2081632651.27040406629.12459193332.259540.98259996-3.9825999642005Q42632.75320.81250.89065435729.19202024432.8229.23127598-3.2312759832006Q1303433.3750.8988764040.79410690237.77828896533.380526.507685443.4923145642006Q2383534.51.1014492751.04483467636.36939019633.94135.462733732.5372662722006Q34234.7534.8751.2043010751.27040406633.06034759734.501543.83084588-1.830845882006Q4303534.8750.8602150540.89065435733.68310027835.06231.22812305-1.228123052007Q12937360.8055555560.79410690236.51901266935.622528.288073110.711926892007Q23938.2537.6251.036544851.04483467637.32647941036.18337.805253071.1947469292007Q35038.538.3751.3029315961.27040406639.357556661136.743546.67909183.3209082052007Q43538.538.50.9090909090.89065435739.296950321237.30433.224970121.7750298832008Q13038.7538.6250.7766990290.79410690237.778288961337.864530.06846078-0.0684607842008Q23939.253911.04483467637.32647941438.42540.14777241-1.1477724142008Q3513939.1251.3035143771.27040406640.144707791538.985549.527337711.4726622892008Q43739.7539.3750.939682540.89065435741.542490341639.54635.221817181.7781828152009Q12940.7540.250.7204968940.79410690236.519012661740.106531.84884846-2.8488484582009Q2424140.8751.0275229361.04483467640.197747051840.66742.49029176-0.4902917572009Q35541.541.251.3333333331.27040406643.293312321941.227552.375583632.6244163732009Q43841.7541.6250.9129129130.89065435742.665260342041.78837.218664250.7813357482010Q13141.541.6250.7447447450.79410690239.037565262142.348533.62923613-2.6292361322010Q24342.2541.8751.0268656721.04483467641.154836262242.90944.8328111-1.83281112010Q3544644.1251.2237960341.27040406642.506161192343.469555.22382954-1.2238295432010Q44147.546.750.8770053480.89065435746.033570372444.0339.215511321.784488682011Q10.7941069022544.590535.409623812011Q21.0448346762645.15147.175330442011Q31.2704040662745.711558.072075462011Q40.8906543572846.27241.21235839
Original: https://blog.csdn.net/weixin_45498948/article/details/125368865
Author: statistics_man
Title: 定量预测方法总结及案例实践
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/631542/
转载文章受原作者版权保护。转载请注明原作者出处!