当前位置: 首页 > 工具软件 > git-recall > 使用案例 >

Datawhale&Git-Model:分类分析与模型诊断

楚举
2023-12-01

作业

作业1——二分类:信贷风险评估

我们想知道银行贷款审批中是否存在种族歧视,这是一个非常典型的“推断”问题,于是可采用线性回归分类模型对该问题进行探究。本次习题使用数据loanapp.dta,所使用的变量解释如下:

因变量:


· approve:贷款是否被批准(0为不批准、1为批准)

自变量:


· white:种族哑变量(0为黑人,1为白人)


· obrat:债务占比

由于数据集含有缺失值,我们先去除含有缺失值的样本(非习题)

loan=pd.read_stata('loanapp.dta')
# 选取要用的变量组成新的数据集
loan=loan[["approve","white","hrat","obrat","loanprc","unem","male","married","dep","sch","cosign","chist","pubrec","mortlat1","mortlat2","vr"]]
loan.dropna() #去除含缺失值样本
approvewhitehratobratloanprcunemmalemarrieddepschcosignchistpubrecmortlat1mortlat2vr
10.01.022.54000134.0999980.8000003.21.01.01.01.00.01.00.00.00.01.0
21.01.019.00000026.0000000.8951053.91.00.00.01.00.01.00.00.00.00.0
31.01.024.00000037.0000000.6000003.11.01.00.01.00.00.01.00.00.01.0
41.01.025.10000032.0999980.8955224.31.01.00.00.00.01.00.00.00.00.0
51.01.021.00000033.0000000.8043483.21.00.00.00.00.00.00.00.00.00.0
...................................................
19841.01.020.29999929.2999990.8977274.31.01.00.01.00.01.00.00.00.01.0
19851.01.08.00000020.0000000.1111113.21.01.00.01.00.01.00.00.00.00.0
19861.01.056.09999860.5000001.0000003.21.01.00.00.00.01.00.00.00.00.0
19871.01.016.00000017.0000000.4558143.21.00.00.01.00.01.00.00.00.01.0
19880.00.031.00000047.0000000.8986493.20.00.00.01.00.01.00.00.00.01.0

1971 rows × 16 columns

使用python进行实操并回答以下问题

(1):先考虑一个线性概率模型
a p p r o v e = β 0 + β 1 w h i t e + u approve = \beta_0+\beta_1white+u approve=β0+β1white+u
如果存在种族歧视,那么 β 1 \beta1 β1的符号应如何?

(2):用OLS估计上述模型,解释参数估计的意义,其显著性如何?实际上大吗?

(3):在上述模型中加入数据集中的其他所有自变量,此时white系数发生了什么变化?我们仍然可以认为存在黑人歧视现象吗?

(4):允许种族效应与债务占比(obrat)有交互效应,请问交互效应显著吗?请解读这种交互效应。

(5):使用logit模型与probit模型重新(4)中的模型,观察变量系数及其显著性的变化。

问题(1)回答:
如果存在种族歧视,那么 β 1 \beta1 β1的符号应该会 显著性>0

import statsmodels.api as sm

loan_ols1=sm.formula.ols('approve~white',data=loan).fit()
print(loan_ols1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                approve   R-squared:                       0.049
Model:                            OLS   Adj. R-squared:                  0.048
Method:                 Least Squares   F-statistic:                     102.2
Date:                Mon, 25 Jul 2022   Prob (F-statistic):           1.81e-23
Time:                        18:04:41   Log-Likelihood:                -555.54
No. Observations:                1989   AIC:                             1115.
Df Residuals:                    1987   BIC:                             1126.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.7078      0.018     38.806      0.000       0.672       0.744
white          0.2006      0.020     10.111      0.000       0.162       0.240
==============================================================================
Omnibus:                      801.085   Durbin-Watson:                   2.002
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2394.523
Skew:                          -2.161   Prob(JB):                         0.00
Kurtosis:                       6.197   Cond. No.                         4.90
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

问题(3)回答

formula = 'approve~' + '+'.join(loan.columns.values[1:])

loan_ols2=sm.formula.ols(formula=formula, data=loan).fit()

print(loan_ols2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                approve   R-squared:                       0.166
Model:                            OLS   Adj. R-squared:                  0.159
Method:                 Least Squares   F-statistic:                     25.86
Date:                Mon, 25 Jul 2022   Prob (F-statistic):           1.84e-66
Time:                        18:12:44   Log-Likelihood:                -429.26
No. Observations:                1971   AIC:                             890.5
Df Residuals:                    1955   BIC:                             979.9
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.9367      0.053     17.763      0.000       0.833       1.040
white          0.1288      0.020      6.529      0.000       0.090       0.168
hrat           0.0018      0.001      1.451      0.147      -0.001       0.004
obrat         -0.0054      0.001     -4.930      0.000      -0.008      -0.003
loanprc       -0.1473      0.038     -3.926      0.000      -0.221      -0.074
unem          -0.0073      0.003     -2.282      0.023      -0.014      -0.001
male          -0.0041      0.019     -0.220      0.826      -0.041       0.033
married        0.0458      0.016      2.810      0.005       0.014       0.078
dep           -0.0068      0.007     -1.019      0.308      -0.020       0.006
sch            0.0018      0.017      0.105      0.916      -0.031       0.034
cosign         0.0098      0.041      0.238      0.812      -0.071       0.090
chist          0.1330      0.019      6.906      0.000       0.095       0.171
pubrec        -0.2419      0.028     -8.571      0.000      -0.297      -0.187
mortlat1      -0.0573      0.050     -1.145      0.252      -0.155       0.041
mortlat2      -0.1137      0.067     -1.698      0.090      -0.245       0.018
vr            -0.0314      0.014     -2.241      0.025      -0.059      -0.004
==============================================================================
Omnibus:                      685.691   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1902.139
Skew:                          -1.855   Prob(JB):                         0.00
Kurtosis:                       6.066   Cond. No.                         417.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

加入了所有变量后, w h i t e white white系数的估计值(coef)变小了为0.1288,但仍然>0,代表 可以认为存在黑人歧视现象

问题(4)回答

from statsmodels.stats.anova import anova_lm

formula1 = formula + '+I(white*obrat)'

loan_ols3=sm.formula.ols(formula=formula1, data=loan).fit()

print(loan_ols3.summary())

# 在有无交互项的情况下比较两模型的差异
anova_lm(loan_ols2, loan_ols3)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                approve   R-squared:                       0.171
Model:                            OLS   Adj. R-squared:                  0.164
Method:                 Least Squares   F-statistic:                     25.17
Date:                Mon, 25 Jul 2022   Prob (F-statistic):           2.37e-68
Time:                        18:27:17   Log-Likelihood:                -422.99
No. Observations:                1971   AIC:                             880.0
Df Residuals:                    1954   BIC:                             974.9
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            1.1806      0.087     13.601      0.000       1.010       1.351
white               -0.1460      0.080     -1.819      0.069      -0.303       0.011
hrat                 0.0018      0.001      1.421      0.156      -0.001       0.004
obrat               -0.0122      0.002     -5.518      0.000      -0.017      -0.008
loanprc             -0.1525      0.037     -4.075      0.000      -0.226      -0.079
unem                -0.0075      0.003     -2.360      0.018      -0.014      -0.001
male                -0.0060      0.019     -0.320      0.749      -0.043       0.031
married              0.0455      0.016      2.800      0.005       0.014       0.077
dep                 -0.0076      0.007     -1.141      0.254      -0.021       0.005
sch                  0.0018      0.017      0.107      0.915      -0.031       0.034
cosign               0.0177      0.041      0.431      0.666      -0.063       0.098
chist                0.1299      0.019      6.754      0.000       0.092       0.168
pubrec              -0.2403      0.028     -8.538      0.000      -0.296      -0.185
mortlat1            -0.0628      0.050     -1.258      0.208      -0.161       0.035
mortlat2            -0.1268      0.067     -1.896      0.058      -0.258       0.004
vr                  -0.0305      0.014     -2.183      0.029      -0.058      -0.003
I(white * obrat)     0.0081      0.002      3.531      0.000       0.004       0.013
==============================================================================
Omnibus:                      696.766   Durbin-Watson:                   2.007
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1990.841
Skew:                          -1.869   Prob(JB):                         0.00
Kurtosis:                       6.205   Cond. No.                         853.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
df_residssrdf_diffss_diffFPr(>F)
01955.0178.3935340.0NaNNaNNaN
11954.0177.2622061.01.13132812.4708790.000423

可以看出交互性显著,有交互性的模型的P值=0.00423。
代表白人更容易获得贷款

作业2——多分类:鸢尾花分类问题

鸢尾花分类问题是经典的多分类问题,我们使用sklearn的logisticRegression求解该问题。

# 下载数据集
from sklearn.datasets import load_iris
iris_dataset=load_iris()

# 提取数据集中的自变量集与标签集
iris_data=iris_dataset['data'] # 自变量
iris_target=iris_dataset['target'] # 标签集

使用python进行实操并回答以下问题

(1):将原数据集划分为训练集与测试集,两者样本比例为3:1。

(2):使用训练集数据训练logistic回归模型,并分别对训练集与测试集数据进行预测,并将预测的结果分别储存在两个自定义的变量中。

(3):使用函数接口计算出:模型对训练集数据的分类正确率、模型对测试集数据的分类正确率,比较它们孰高孰低,并思考为什么会有这样的差异。

(4):给出测试集数据的混淆矩阵以及精确率、召回率、f分数的综合报告。

问题(1)回答

from sklearn.model_selection import train_test_split

# 数据集切分
X_train,X_test,Y_train,Y_test=train_test_split(iris_data, iris_target, test_size=0.25, random_state=0) # test_size为测试集数据量占原始数据的比例

print('train_size',len(X_train)/len(iris_data))
print('test_size',len(X_test)/len(iris_data))
train_size 0.7466666666666667
test_size 0.25333333333333335

问题(2)回答

from sklearn.linear_model import LogisticRegression

# 使用训练集进行训练
model = LogisticRegression(multi_class='multinomial', max_iter=1000).fit(X_train,Y_train)

# 使用训练集进行预测
train_y_pred=model.predict(X_train)

# 使用测试集进行预测
test_y_pred=model.predict(X_test)

问题(3)回答

import numpy as np
from sklearn.metrics import confusion_matrix

confusion_matrix_train = confusion_matrix(Y_train, train_y_pred)
print("训练集数据:", np.diagonal(confusion_matrix_train).sum() / np.sum(confusion_matrix_train))

confusion_matrix_test = confusion_matrix(Y_test, test_y_pred)
print("测试集数据:", np.diagonal(confusion_matrix_test).sum() / np.sum(confusion_matrix_test))
训练集数据: 0.9821428571428571
测试集数据: 0.9736842105263158

问题(4)回答

from sklearn.metrics import classification_report

# 混淆矩阵
print("混淆矩阵:")
display(confusion_matrix(Y_test, test_y_pred))

print("======================================")

# 综合指标
print("综合指标:")
print(classification_report(Y_test, test_y_pred))
混淆矩阵:



array([[13,  0,  0],
       [ 0, 15,  1],
       [ 0,  0,  9]], dtype=int64)


======================================
综合指标:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       1.00      0.94      0.97        16
           2       0.90      1.00      0.95         9

    accuracy                           0.97        38
   macro avg       0.97      0.98      0.97        38
weighted avg       0.98      0.97      0.97        38
 类似资料: