当前位置: 首页 > 工具软件 > statsmodels > 使用案例 >

Python 第三方模块 统计1 statsmodels模块1 简介,回归

齐昊焱
2023-12-01

官方文档:https://www.statsmodels.org/stable/user-guide.html \quad https://www.statsmodels.org/stable/api.html

一.概述
1.简介
(1)简介:

参见:https://zhuanlan.zhihu.com/p/91384305

statsmodels是1个Python统计分析模块,源于斯坦福大学统计学教授Jonathan Taylor,并由Skipper Seabold和Josef Perktold于2010年正式创
建该项目.其包含了许多经典统计学和经济计量学的算法,主要有:
①回归模型:线性回归,广义线性模型,健壮线性模型,线性混合效应模型等
②方差分析(ANOVA)
③时间序列分析和状态空间模型:AR,ARMA,ARIMA,VAR等
④广义的矩量法
⑤非参数方法:核密度估计,核回归
⑥统计模型结果可视化方法

(2)项目结构:

statsmodels/
    __init__.py
    api.py
    discrete/
        __init__.py
        discrete_model.py
        tests/
            results/
    tsa/
        __init__.py
        api.py
        tsatools.py
        stattools.py
        arima_process.py
        vector_ar/
            __init__.py
            var_model.py
            tests/
                results/
        tests/
            results/
    stats/
        __init__.py
        api.py
        stattools.py
        tests/
    tools/
        __init__.py
        tools.py
        decorators.py
        tests/

2.与其他模块的关系:

①与patsy:受R的公式系统的启发,Nathaniel Smith创建了patsy项目.该模块提供了statsmodels的公式/模型的规范框架
②与scikit-learn:statsmodels更关注统计推断,而sklearn更注重预测

3.安装与导入
(1)安装:

pip install statsmodels

(2)导入:

①对交互式使用,推荐导入接口:
import statsmodels.api as sm
import statsmodels.tsa.api as tsa
②直接导入方法/模型/子模块:
from statsmodels.regression.linear_model import OLS,WLS
from statsmodels.datasets import macrodata
import statsmodels.regression.linear_model as lm

(3)查看可用函数/类:

>>> dir(sm)
['BayesGaussMI', 'BinomialBayesMixedGLM'...'webdoc']
>>> dir(sm.tsa)
['AR', 'ARIMA'...'x13_arima_select_order']

(4)不同导入方法的比较:

参见:https://www.statsmodels.org/stable/api-structure.html#import-paths-and-structure

二.横断面研究(Cross-Sectional Study)
1.接口

#通常导入为sm:
import statsmodels.api as sm
#注意:
①这类接口推荐用于交互式使用
②这些类/函数实际上是定义在其他地方的,sm只是提供了1个接口

(1)回归(Regression):

"普通最小二乘法"(Ordinary Least Squares):class sm.OLS(<endog>,<exog>[,missing='none',hasconst=None,**kwargs])
  #实际上是class statsmodels.regression.linear_model.OLS
  #参数说明:
    endog:指定数据点的y值;为1-D array-like
    exog:指定数据点的x值;为n×k array-like,其中n=len(<endog>),k为特征数
    missing:指定如何处理缺失值;为"none"(不检查是否包含NaN)/"drop"(丢弃相应记录)/"raise"(报错)
    hasconst:说明自变量中是否包含常数项对应的虚拟变量;为None/bool
      Indicates whether the RHS includes a user-supplied constant.If True,a constant is not checked for and 
      k_constant is set to 1 and all result statistics are calculated as if a constant is present.If False,
      a constant is not checked for and k_constant is set to 0
    kwargs:指定使用公式接口时要传入的其他参数

######################################################################################################################

"广义最小二乘法"(Generalized Least Squares):class sm.GLS(<endog>,<exog>[,sigma=None,missing='none',hasconst=None,**kwargs])
  #实际上是class statsmodels.regression.linear_model.GLS
  #参数说明:其他参数同sm.OLS
    sigma:指定协方差加权矩阵;为None/scalar/array
      #The default is None for no scaling
      #If sigma is a scalar, it is assumed that sigma is an n x n diagonal matrix with the given scalar, sigma as the
      #value of each diagonal element
      #If sigma is an n-length vector, then sigma is assumed to be a diagonal matrix with the given sigma on the
      #diagonal
      #This should be the same as WLS

######################################################################################################################

Generalized Least Squares with AR covariance structures:class sm.GLSAR(<endog>,<exog>[,rho=1,missing='none',hasconst=None,**kwargs])
  #实际上是class statsmodels.regression.linear_model.GLSAR

######################################################################################################################

"加权最小二乘法"(Weighted Least Squares):class sm.WLS(<endog>,<exog>[,weights=1.0,missing='none',hasconst=None,**kwargs])
  #实际上是class statsmodels.regression.linear_model.WLS
  #参数说明:其他参数同sm.OLS
    weights:指定权重;为int/1-D array-like

######################################################################################################################

"递归最小二乘法"(Recursive Least Squares):class sm.RecursiveLS(<endog>,<exog>[,constraints=None,**kwargs])
  #实际上是class statsmodels.regression.recursive_ls.RecursiveLS

######################################################################################################################

"滚动普通最小二乘法"(Rolling Ordinary Least Squares):class sm.RollingOLS(<endog>,<exog>[,window=None,min_nobs=None,missing='drop',expanding=False])
  #实际上是class statsmodels.regression.rolling.RollingOLS

######################################################################################################################

"滚动加权最小二乘法"(Rolling Weighted Least Squares):class sm.RollingWLS(<endog>,<exog>[,window=None,weights=None,min_nobs=None,missing='drop',expanding=False])
  #实际上是class statsmodels.regression.rolling.RollingWLS

(2)缺失值的处理(Imputation):

"基于高斯模型的贝叶斯插补"(Bayesian Imputation using a Gaussian model):class sm.BayesGaussMI(<data>[,mean_prior=None,cov_prior=None,cov_prior_df=1])
  #实际上是class statsmodels.imputation.bayes_mi.BayesGaussMI
"基于贝叶斯估计的广义线性混合模型"(Generalized Linear Mixed Model with Bayesian estimation):class sm.BinomialBayesMixedGLM(<endog>,<exog>,<exog_vc>,<ident>[,vcp_p=1,fe_p=2,fep_names=None,vcp_names=None,vc_names=None])
  #实际上是class statsmodels.genmod.bayes_mixed_glm.BinomialBayesMixedGLM
"因子分析"(Factor analysis):class sm.Factor([endog=None,n_factor=1,corr=None,method='pa',smc=True,endog_names=None,nobs=None,missing='drop'])
  #实际上是class statsmodels.multivariate.factor.Factor
基于指定"缺失值处理器"(Imputer)的"多重插补"(Multiple Imputation):class sm.MI(<imp>,<model>[,model_args_fn=None,model_kwds_fn=None,formula=None,fit_args=None,fit_kwds=None,xfunc=None,burn=100,nrep=20,skip=10])
  #实际上是class statsmodels.imputation.bayes_mi.MI
基于"链式方程"(Chained Equations)的多重插补:class sm.MICE(<model_formula>,<model_class>,<data>[,n_skip=3,init_kwds=None,fit_kwds=None])
  #实际上是class statsmodels.imputation.mice.MICE
包装数据集以允许使用sm.MICE处理缺失值:class sm.MICEData(<data>[,perturbation_method='gaussian',k_pmm=20,history_callback=None])
  #实际上是class statsmodels.imputation.mice.MICEData

(3)广义估计方程(Generalized Estimating Equations;GEE):

"基于GEE的边际回归模型"(Marginal Regression Model using GEE):class sm.GEE(<endog>,<exog>,<groups>[,time=None,family=None,cov_struct=None,missing='none',offset=None,exposure=None,dep_data=None,constraint=None,update_dep=True,weights=None,**kwargs])
  #实际上是class statsmodels.genmod.generalized_estimating_equations.GEE
"基于GEE的名义反应边际回归模型"(Nominal Response Marginal Regression Model using GEE):sm.NominalGEE(<endog>,<exog>,<groups>[,time=None,family=None,cov_struct=None,missing='none',offset=None,dep_data=None,constraint=None,**kwargs])
  #实际上是class statsmodels.genmod.generalized_estimating_equations.NominalGEE
"基于GEE的顺序反应边际回归模型"(Ordinal Response Marginal Regression Model using GEE):class sm.OrdinalGEE(<endog>,<exog>,<groups>[,time=None,family=None,cov_struct=None,missing='none',offset=None,dep_data=None,constraint=None,**kwargs])
  #实际上是statsmodels.genmod.generalized_estimating_equations.OrdinalGEE

(4)广义线性模型(Generalized Linear Models;GLM):

"广义线性模型"(Generalized Linear Models;GLM):class sm.GLM(<endog>,<exog>[,family=None,offset=None,exposure=None,freq_weights=None,var_weights=None,missing='none',**kwargs])
  #实际上是class statsmodels.genmod.generalized_linear_model.GLM
"广义加性模型"(Generalized Additive Models;GAM):class sm.GLMGam(<endog>,<exog>[,smoother=None,alpha=0,family=None,offset=None,exposure=None,missing='none',**kwargs])
  #实际上是class statsmodels.gam.generalized_additive_model.GLMGam
"基于贝叶斯估计的广义线性混合模型"(Generalized Linear Mixed Model with Bayesian estimation):class sm.PoissonBayesMixedGLM(<endog>,<exog>,<exog_vc>,<ident>[,vcp_p=1,fe_p=2,fep_names=None,vcp_names=None,vc_names=None])
  #实际上是class statsmodels.genmod.bayes_mixed_glm.PoissonBayesMixedGLM

(5)离散与计数模型(Discrete and Count Models):

"广义泊松模型"(Generalized Poisson Model):class sm.GeneralizedPoisson(<endog>,<exog>[,p=1,offset=None,exposure=None,missing='none',check_rank=True,**kwargs])
  #实际上是class statsmodels.discrete.discrete_model.GeneralizedPoisson
"Logit模型"(Logit Model):class sm.Logit(<endog>,<exog>[,check_rank=True,**kwargs])
  #实际上是class statsmodels.discrete.discrete_model.Logit
"多分类Logit模型"(Multinomial Logit Model):class sm.MNLogit(<endog>,<exog>[,check_rank=True,**kwargs])
  #实际上是class statsmodels.discrete.discrete_model.MNLogit
"泊松模型"(Poisson Model):class sm.Poisson(<endog>,<exog>[,offset=None,exposure=None,missing='none',check_rank=True,**kwargs])
  #实际上是class statsmodels.discrete.discrete_model.Poisson
"Probit模型"(Probit Model):class sm.Probit(<endog>,<exog>[,check_rank=True,**kwargs])
  #实际上是class statsmodels.discrete.discrete_model.Probit
"负二项式模型"(Negative Binomial Model):class sm.NegativeBinomial(<endog>,<exog>[,loglike_method='nb2',offset=None,exposure=None,missing='none',check_rank=True,**kwargs])
  #实际上是class statsmodels.discrete.discrete_model.NegativeBinomial
"广义负二项式模型"(Generalized Negative Binomial Model):class sm.NegativeBinomialP(<endog>,<exog>[,p=2,offset=None,exposure=None,missing='none',check_rank=True,**kwargs])
  #实际上是class statsmodels.discrete.discrete_model.NegativeBinomialP
"零膨胀广义泊松模型"(Zero Inflated Generalized Poisson Model):class sm.ZeroInflatedGeneralizedPoisson(<endog>,<exog>[,exog_infl=None,offset=None,exposure=None,inflation='logit',p=2,missing='none',**kwargs])
  #实际上是class statsmodels.discrete.count_model.ZeroInflatedGeneralizedPoisson
"零膨胀广义负二项式模型"(Zero Inflated Generalized Negative Binomial Model):class sm.ZeroInflatedNegativeBinomialP(<endog>,<exog>[,exog_infl=None,offset=None,exposure=None,inflation='logit',p=2,missing='none',**kwargs])
  #实际上是class statsmodels.discrete.count_model.ZeroInflatedNegativeBinomialP
"泊松零膨胀模型"(Poisson Zero Inflated Model):class sm.ZeroInflatedPoisson(<endog>,<exog>[,exog_infl=None,offset=None,exposure=None,inflation='logit',missing='none',**kwargs])
  #实际上是class statsmodels.discrete.count_model.ZeroInflatedPoisson

(6)多变量模型(Multivariate Models):

"多元方差分析"(Multivariate Analysis of Variance;MANOVA):class sm.MANOVA(<endog>,<exog>[,missing='none',hasconst=None,**kwargs])
  #实际上是class statsmodels.multivariate.manova.MANOVA
"主成分分析"(Principal Component Analysis;PCA):class sm.PCA(<data>[,ncomp=None,standardize=True,demean=True,normalize=True,gls=False,weights=None,method='svd',missing=None,tol=5e-08,max_iter=1000,tol_em=5e-08,max_em_iter=100])
  #实际上是class statsmodels.multivariate.pca.PCA

(7)其他模型(Misc Models):

"线性混合效应模型"(Linear Mixed Effects Model):class sm.MixedLM(<endog>,<exog>,<groups>[,exog_re=None,exog_vc=None,use_sqrt=True,missing='none',**kwargs])
  #实际上是class statsmodels.regression.mixed_linear_model.MixedLM
"Cox比例风险回归模型"(Cox Proportional Hazards Regression Model):class sm.PHReg(<endog>,<exog>[,status=None,entry=None,strata=None,offset=None,ties='breslow',missing='drop',**kwargs])
  #实际上是class statsmodels.duration.hazard_regression.PHReg
"分位数回归"(Quantile Regression):class sm.QuantReg(<endog>,<exog>[,**kwargs])
  #实际上是class statsmodels.regression.quantile_regression.QuantReg
"稳健线性模型"(Robust Linear Model):class sm.RLM(<endog>,<exog>[,M=None,missing='none',**kwargs])
  #实际上是class statsmodels.robust.robust_linear_model.RLM
"对生存函数的估计和推断"(Estimation and inference for a survival function):class sm.SurvfuncRight(<time>,<status>[,entry=None,title=None,freq_weights=None,exog=None,bw_factor=1.0])
  #实际上是class statsmodels.duration.survfunc.SurvfuncRight

(8)图像(Graphics):

Q-Q and P-P Probability Plots:class sm.ProbPlot(<data>[,dist=<scipy.stats._continuous_distns.norm_gen object>,fit=False,distargs=(),a=0,loc=0,scale=1])
  #实际上是class statsmodels.graphics.gofplots.ProbPlot
Plot a reference line for a qqplot:sm.qqline(<ax>,<line>[,x=None,y=None,dist=None,fmt='r-',**lineoptions])
  #实际上是statsmodels.graphics.gofplots.qqline()
Q-Q plot of the quantiles of x versus the quantiles/ppf of a distribution:sm.qqplot(<data>[,dist=<scipy.stats._continuous_distns.norm_gen object>,distargs=(),a=0,loc=0,scale=1,fit=False,line=None,ax=None,**plotkwargs])
  #实际上是statsmodels.graphics.gofplots.qqplot()
Q-Q Plot of two samples’ quantiles:sm.qqplot_2samples(<data1>,<data2>[,xlabel=None,ylabel=None,line=None,ax=None])
  #实际上是statsmodels.graphics.gofplots.qqplot_2samples()

(9)工具(Tools):

Run the test suite:sm.test([extra_args=None,exit=False])
  #实际上是statsmodels.__init__.test()
Add a column of ones to an array:sm.add_constant(<data>[,prepend=True,has_constant='skip'])
  #实际上是tatsmodels.tools.tools.add_constant()
Load a previously saved object:sm.load_pickle(<fname>)
  #实际上是statsmodels.iolib.smpickle.load_pickle()
List the versions of statsmodels and any installed dependencies:sm.show_versions([show_dirs=True])
  #实际上是statsmodels.tools.print_version.show_versions()
Opens a browser and displays online documentation:sm.webdoc([func=None,stable=None])
  #实际上是statsmodels.tools.web.webdoc()

2.模型的方法:

#以OLS为例子,其他模型类似:
进行拟合:[<RR>=]OLS.fit([method='pinv',cov_type='nonrobust',cov_kwds=None,use_t=None,**kwargs])
  #参数说明:
    method:指定如何求解最小二乘问题;为"pinv"(广义逆)/"qr"(QR分解)
    RR:返回拟合结果;为class statsmodels.regression.linear_model.RegressionResults

###########################################################################################################

OLS.fit_regularized([method='elastic_net',alpha=0.0,L1_wt=1.0,start_params=None,profile_scale=False,refit=False,**kwargs])
OLS.from_formula(<formula>,<data>[,subset=None,drop_cols=None,*args,**kwargs])
OLS.get_distribution(<params>,<scale>[,exog=None,dist_class=None])
OLS.hessian(<params>[,scale=None])
OLS.hessian_factor(<params>[,scale=None,observed=True])
OLS.information(<params>)
OLS.initialize()
OLS.loglike(<params>[,scale=None])
OLS.predict(<params>[,exog=None])
OLS.score(<params>[,scale=None])
OLS.whiten(<x>)

3.回归结果
(1)回归结果:

通过.fit()得到的回归结果为class statsmodels.regression.linear_model.RegressionResults

(2)方法:

进行模型约束的F检验:<RR>.compare_f_test(<restricted>)
进行模型线性约束的"拉格朗日乘数检验"(Lagrange Multiplier test):<RR>.compare_lm_test(<restricted>[,demean=True,use_lr=False])
进行模型约束的"似然比检验"(Likelihood ratio test):<RR>.compare_lr_test(<restricted>[,large_sample=False])
求拟合出的参数的置信区间:<RR>.conf_int([alpha=0.05,cols=None])
求拟合出的参数的(协)方差矩阵:<RR>.cov_params([r_matrix=None,column=None,scale=None,cov_p=None,other=None])
进行联合线性假设的F检验:<RR>.f_test(<r_matrix>[,cov_p=None,invcov=None])
  #可用于模型的显著性F检验
进行预测:<RR>.get_prediction([exog=None,transform=True,weights=None,row_labels=None,**kwargs])
Create new results instance with robust covariance as default:<RR>.get_robustcov_results([cov_type='HC1',use_t=None,**kwargs])
对模型进行评估:<RR>.info_criteria(<crit>[,dk_params=0])
Initialize (possibly re-initialize) a Results instance:<RR>.initialize(<model>,<params>[,**kwargs])
加载模型:<RR>.load(<fname>)
  #注意:该方法对错误/恶意数据不安全
Call self.model.predict with self.params as the first argument:<RR>.predict([exog=None,transform=True,*args,**kwargs])
Remove data arrays,all nobs arrays from result and model:<RR>.remove_data()
保存模型:<RR>.save(<fname>[,remove_data=False])
A scale factor for the covariance matrix:<RR>.scale()
概括回归结果:<RR>.summary([yname=None,xname=None,title=None,alpha=0.05,slim=False])
           <RR>.summary2([yname=None,xname=None,title=None,alpha=0.05,float_format='%.4f'])
  #注意L.summary2()为实验性方法
进行线性假设的t检验:<RR>.t_test(<r_matrix>[,cov_p=None,use_t=None])
  #可用于参数的显著性t检验
Perform pairwise t_test with multiple testing corrected p-values:<RR>.t_test_pairwise(<term_name>[,method='hs',alpha=0.05,factor_labels=None])
进行线性假设的"沃特检验"(Wald-test):<RR>.wald_test(<r_matrix>[,cov_p=None,invcov=None,use_f=None,df_constraints=None])
进行联合线性假设的沃特检验:<RR>.wald_test_terms([skip_single=False,extra_constraints=None,combine_terms=None])

(3)属性:

<RR>.HC0_se                              <RR>.fvalue
<RR>.HC1_se                              <RR>.llf
<RR>.HC2_se                              <RR>.mse_model
<RR>.HC3_se                              <RR>.mse_resid
<RR>.aic                                 <RR>.mse_total
<RR>.bic                                 <RR>.nobs
<RR>.bse                                 <RR>.pvalues
<RR>.centered_tss                        <RR>.resid
<RR>.condition_number                    <RR>.resid_pearson
<RR>.cov_HC0                             <RR>.rsquared
<RR>.cov_HC1                             <RR>.rsquared_adj
<RR>.cov_HC2                             <RR>.ssr
<RR>.cov_HC3                             <RR>.tvalues
<RR>.eigenvals                           <RR>.uncentered_tss
<RR>.ess                                 <RR>.use_t
<RR>.f_pvalue                            <RR>.wresid
<RR>.fittedvalues
 类似资料: