我们想知道银行贷款审批中是否存在种族歧视,这是一个非常典型的“推断”问题,于是可采用线性回归分类模型对该问题进行探究。本次习题使用数据loanapp.dta,所使用的变量解释如下:
因变量:
· approve:贷款是否被批准(0为不批准、1为批准)
自变量:
· white:种族哑变量(0为黑人,1为白人)
· obrat:债务占比
由于数据集含有缺失值,我们先去除含有缺失值的样本(非习题)
loan=pd.read_stata('loanapp.dta')
# 选取要用的变量组成新的数据集
loan=loan[["approve","white","hrat","obrat","loanprc","unem","male","married","dep","sch","cosign","chist","pubrec","mortlat1","mortlat2","vr"]]
loan.dropna() #去除含缺失值样本
approve | white | hrat | obrat | loanprc | unem | male | married | dep | sch | cosign | chist | pubrec | mortlat1 | mortlat2 | vr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.0 | 1.0 | 22.540001 | 34.099998 | 0.800000 | 3.2 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
2 | 1.0 | 1.0 | 19.000000 | 26.000000 | 0.895105 | 3.9 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 1.0 | 1.0 | 24.000000 | 37.000000 | 0.600000 | 3.1 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
4 | 1.0 | 1.0 | 25.100000 | 32.099998 | 0.895522 | 4.3 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 | 1.0 | 1.0 | 21.000000 | 33.000000 | 0.804348 | 3.2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1984 | 1.0 | 1.0 | 20.299999 | 29.299999 | 0.897727 | 4.3 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1985 | 1.0 | 1.0 | 8.000000 | 20.000000 | 0.111111 | 3.2 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1986 | 1.0 | 1.0 | 56.099998 | 60.500000 | 1.000000 | 3.2 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1987 | 1.0 | 1.0 | 16.000000 | 17.000000 | 0.455814 | 3.2 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1988 | 0.0 | 0.0 | 31.000000 | 47.000000 | 0.898649 | 3.2 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1971 rows × 16 columns
使用python进行实操并回答以下问题
(1):先考虑一个线性概率模型
a
p
p
r
o
v
e
=
β
0
+
β
1
w
h
i
t
e
+
u
approve = \beta_0+\beta_1white+u
approve=β0+β1white+u
如果存在种族歧视,那么
β
1
\beta1
β1的符号应如何?
(2):用OLS估计上述模型,解释参数估计的意义,其显著性如何?实际上大吗?
(3):在上述模型中加入数据集中的其他所有自变量,此时white系数发生了什么变化?我们仍然可以认为存在黑人歧视现象吗?
(4):允许种族效应与债务占比(obrat)有交互效应,请问交互效应显著吗?请解读这种交互效应。
(5):使用logit模型与probit模型重新(4)中的模型,观察变量系数及其显著性的变化。
问题(1)回答:
如果存在种族歧视,那么
β
1
\beta1
β1的符号应该会 显著性>0
import statsmodels.api as sm
loan_ols1=sm.formula.ols('approve~white',data=loan).fit()
print(loan_ols1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: approve R-squared: 0.049
Model: OLS Adj. R-squared: 0.048
Method: Least Squares F-statistic: 102.2
Date: Mon, 25 Jul 2022 Prob (F-statistic): 1.81e-23
Time: 18:04:41 Log-Likelihood: -555.54
No. Observations: 1989 AIC: 1115.
Df Residuals: 1987 BIC: 1126.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.7078 0.018 38.806 0.000 0.672 0.744
white 0.2006 0.020 10.111 0.000 0.162 0.240
==============================================================================
Omnibus: 801.085 Durbin-Watson: 2.002
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2394.523
Skew: -2.161 Prob(JB): 0.00
Kurtosis: 6.197 Cond. No. 4.90
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
问题(3)回答
formula = 'approve~' + '+'.join(loan.columns.values[1:])
loan_ols2=sm.formula.ols(formula=formula, data=loan).fit()
print(loan_ols2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: approve R-squared: 0.166
Model: OLS Adj. R-squared: 0.159
Method: Least Squares F-statistic: 25.86
Date: Mon, 25 Jul 2022 Prob (F-statistic): 1.84e-66
Time: 18:12:44 Log-Likelihood: -429.26
No. Observations: 1971 AIC: 890.5
Df Residuals: 1955 BIC: 979.9
Df Model: 15
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.9367 0.053 17.763 0.000 0.833 1.040
white 0.1288 0.020 6.529 0.000 0.090 0.168
hrat 0.0018 0.001 1.451 0.147 -0.001 0.004
obrat -0.0054 0.001 -4.930 0.000 -0.008 -0.003
loanprc -0.1473 0.038 -3.926 0.000 -0.221 -0.074
unem -0.0073 0.003 -2.282 0.023 -0.014 -0.001
male -0.0041 0.019 -0.220 0.826 -0.041 0.033
married 0.0458 0.016 2.810 0.005 0.014 0.078
dep -0.0068 0.007 -1.019 0.308 -0.020 0.006
sch 0.0018 0.017 0.105 0.916 -0.031 0.034
cosign 0.0098 0.041 0.238 0.812 -0.071 0.090
chist 0.1330 0.019 6.906 0.000 0.095 0.171
pubrec -0.2419 0.028 -8.571 0.000 -0.297 -0.187
mortlat1 -0.0573 0.050 -1.145 0.252 -0.155 0.041
mortlat2 -0.1137 0.067 -1.698 0.090 -0.245 0.018
vr -0.0314 0.014 -2.241 0.025 -0.059 -0.004
==============================================================================
Omnibus: 685.691 Durbin-Watson: 2.001
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1902.139
Skew: -1.855 Prob(JB): 0.00
Kurtosis: 6.066 Cond. No. 417.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
加入了所有变量后, w h i t e white white系数的估计值(coef)变小了为0.1288,但仍然>0,代表 可以认为存在黑人歧视现象
问题(4)回答
from statsmodels.stats.anova import anova_lm
formula1 = formula + '+I(white*obrat)'
loan_ols3=sm.formula.ols(formula=formula1, data=loan).fit()
print(loan_ols3.summary())
# 在有无交互项的情况下比较两模型的差异
anova_lm(loan_ols2, loan_ols3)
OLS Regression Results
==============================================================================
Dep. Variable: approve R-squared: 0.171
Model: OLS Adj. R-squared: 0.164
Method: Least Squares F-statistic: 25.17
Date: Mon, 25 Jul 2022 Prob (F-statistic): 2.37e-68
Time: 18:27:17 Log-Likelihood: -422.99
No. Observations: 1971 AIC: 880.0
Df Residuals: 1954 BIC: 974.9
Df Model: 16
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 1.1806 0.087 13.601 0.000 1.010 1.351
white -0.1460 0.080 -1.819 0.069 -0.303 0.011
hrat 0.0018 0.001 1.421 0.156 -0.001 0.004
obrat -0.0122 0.002 -5.518 0.000 -0.017 -0.008
loanprc -0.1525 0.037 -4.075 0.000 -0.226 -0.079
unem -0.0075 0.003 -2.360 0.018 -0.014 -0.001
male -0.0060 0.019 -0.320 0.749 -0.043 0.031
married 0.0455 0.016 2.800 0.005 0.014 0.077
dep -0.0076 0.007 -1.141 0.254 -0.021 0.005
sch 0.0018 0.017 0.107 0.915 -0.031 0.034
cosign 0.0177 0.041 0.431 0.666 -0.063 0.098
chist 0.1299 0.019 6.754 0.000 0.092 0.168
pubrec -0.2403 0.028 -8.538 0.000 -0.296 -0.185
mortlat1 -0.0628 0.050 -1.258 0.208 -0.161 0.035
mortlat2 -0.1268 0.067 -1.896 0.058 -0.258 0.004
vr -0.0305 0.014 -2.183 0.029 -0.058 -0.003
I(white * obrat) 0.0081 0.002 3.531 0.000 0.004 0.013
==============================================================================
Omnibus: 696.766 Durbin-Watson: 2.007
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1990.841
Skew: -1.869 Prob(JB): 0.00
Kurtosis: 6.205 Cond. No. 853.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
---|---|---|---|---|---|---|
0 | 1955.0 | 178.393534 | 0.0 | NaN | NaN | NaN |
1 | 1954.0 | 177.262206 | 1.0 | 1.131328 | 12.470879 | 0.000423 |
可以看出交互性显著,有交互性的模型的P值=0.00423。
代表白人更容易获得贷款
鸢尾花分类问题是经典的多分类问题,我们使用sklearn的logisticRegression求解该问题。
# 下载数据集
from sklearn.datasets import load_iris
iris_dataset=load_iris()
# 提取数据集中的自变量集与标签集
iris_data=iris_dataset['data'] # 自变量
iris_target=iris_dataset['target'] # 标签集
使用python进行实操并回答以下问题
(1):将原数据集划分为训练集与测试集,两者样本比例为3:1。
(2):使用训练集数据训练logistic回归模型,并分别对训练集与测试集数据进行预测,并将预测的结果分别储存在两个自定义的变量中。
(3):使用函数接口计算出:模型对训练集数据的分类正确率、模型对测试集数据的分类正确率,比较它们孰高孰低,并思考为什么会有这样的差异。
(4):给出测试集数据的混淆矩阵以及精确率、召回率、f分数的综合报告。
问题(1)回答
from sklearn.model_selection import train_test_split
# 数据集切分
X_train,X_test,Y_train,Y_test=train_test_split(iris_data, iris_target, test_size=0.25, random_state=0) # test_size为测试集数据量占原始数据的比例
print('train_size',len(X_train)/len(iris_data))
print('test_size',len(X_test)/len(iris_data))
train_size 0.7466666666666667
test_size 0.25333333333333335
问题(2)回答
from sklearn.linear_model import LogisticRegression
# 使用训练集进行训练
model = LogisticRegression(multi_class='multinomial', max_iter=1000).fit(X_train,Y_train)
# 使用训练集进行预测
train_y_pred=model.predict(X_train)
# 使用测试集进行预测
test_y_pred=model.predict(X_test)
问题(3)回答
import numpy as np
from sklearn.metrics import confusion_matrix
confusion_matrix_train = confusion_matrix(Y_train, train_y_pred)
print("训练集数据:", np.diagonal(confusion_matrix_train).sum() / np.sum(confusion_matrix_train))
confusion_matrix_test = confusion_matrix(Y_test, test_y_pred)
print("测试集数据:", np.diagonal(confusion_matrix_test).sum() / np.sum(confusion_matrix_test))
训练集数据: 0.9821428571428571
测试集数据: 0.9736842105263158
问题(4)回答
from sklearn.metrics import classification_report
# 混淆矩阵
print("混淆矩阵:")
display(confusion_matrix(Y_test, test_y_pred))
print("======================================")
# 综合指标
print("综合指标:")
print(classification_report(Y_test, test_y_pred))
混淆矩阵:
array([[13, 0, 0],
[ 0, 15, 1],
[ 0, 0, 9]], dtype=int64)
======================================
综合指标:
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 0.94 0.97 16
2 0.90 1.00 0.95 9
accuracy 0.97 38
macro avg 0.97 0.98 0.97 38
weighted avg 0.98 0.97 0.97 38