AutoML框架-Auto-sklearn学习笔记01-原理及基本使用方法（参数、函数解析）

丁阎宝

2023-12-01

1、介绍

官网：APIs — AutoSklearn 0.15.0 documentation

源码：https://github.com/automl/auto-sklearn

主要功能：

自动学习样本数据: meta-learning，去学习样本数据的模样，自动推荐合适的模型。比如文本数据用什么模型比较好，比如很多的离散数据用什么模型好。
自动调超参：Bayesian optimizer，贝叶斯优化。
自动模型集成: build-ensemble，模型集成，在一般的比赛中都会用到的技巧。多个模型组合成一个更强更大的模型。往往能提高预测准确性。

2、安装

pip install auto-sklearn

或者

pip install --upgrade auto-sklearn -i https://pypi.douban.com/simple

导入库并打印版本号以确认它已成功安装

import autosklearn
print('autosklearn:%s'%autosklearn.__version__)

根据预测任务的不同，是分类还是回归，可以创建和配置 AutoSklearnClassifier或 AutoSklearnRegressor类的实例，将其拟合到数据集上。然后可以使用生成的模型直接进行预测或保存到文件（使用pickle）以供以后使用。

3、使用方法

加载数据

from pprint import pprint
import sklearn.datasets
import sklearn.metrics
import autosklearn.classification

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)

使用make_scorer封装我们自己的评价函数

import autosklearn.classification
import numpy as np
import pandas as pd
import sklearn.datasets
import sklearn.metrics
from autosklearn.metrics import balanced_accuracy, precision, recall, f1


def error(solution, prediction):
    # custom function defining error
    return np.mean(solution != prediction)

error_rate = autosklearn.metrics.make_scorer(
    name="custom_error",
    score_func=error,
    optimum=0,
    greater_is_better=False,
    needs_proba=False,
    needs_threshold=False,
)

构建和拟合一个分类器

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder="/tmp/autosklearn_classification_example_tmp",
    scoring_functions=[balanced_accuracy, precision, recall, f1, error_rate],
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")

获取所有搜索到的模型评分

def get_metric_result(cv_results):
    results = pd.DataFrame.from_dict(cv_results)
    results = results[results["status"] == "Success"]
    cols = ["rank_test_scores", "param_classifier:__choice__", "mean_test_score"]
    cols.extend([key for key in cv_results.keys() if key.startswith("metric_")])
    return results[cols]

print("Metric results")
print(get_metric_result(cls.cv_results_).to_string(index=False))

查看并保存通过auto-sklearn找到的模型

import os

my_leaderboard = automl.leaderboard(detailed=True)
print(my_leaderboard)

flag1 = os.path.isfile('./data/leaderboard/my_leaderboard.csv')
if flag1:
    print('file exists')
    os.remove('./data/leaderboard/my_leaderboard.csv')
    my_leaderboard.to_csv('./data/leaderboard/my_leaderboard.csv',encoding='utf-8-sig')
else:
    print("保存成功！")
    my_leaderboard.to_csv('./data/leaderboard/my_leaderboard.csv',encoding='utf-8-sig')

保存模型

打印由auto-sklearn构建的最终集成

automl.cv_results_
automl.sprint_statistics() #展示模型的状态
print(automl.show_models()) #展示最后我们获取的最佳模型

automl.performance_over_time_.plot(
        x='Timestamp',
        kind='line',
        legend=True,
        title='Auto-sklearn accuracy over time',
        grid=True,
    )
    plt.show()

performance_over_time_返回一个DataFrame，其中包含模型的性能随时间变化的数据，可以直接用于绘图。

查看训练集测试集表现

predictions = automl.predict(X_test)
print("Accuracy score:", sklearn.metrics.accuracy_score(y_test, predictions))

4、参数

(1) AutoSklearnClassifier()

autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=3600, per_run_time_limit=None, initial_configurations_via_metalearning=25, ensemble_size: int | None = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', ensemble_kwargs: Dict[str, Any] | None = None, ensemble_nbest=50, max_models_on_disc=50, seed=1, memory_limit=3072, include: Optional[Dict[str, List[str]]] = None, exclude: Optional[Dict[str, List[str]]] = None, resampling_strategy='holdout', resampling_strategy_arguments=None, tmp_folder=None, delete_tmp_folder_after_terminate=True, n_jobs: Optional[int] = None, dask_client: Optional[dask.distributed.Client] = None, disable_evaluator_output=False, get_smac_object_callback=None, smac_scenario_args=None, logging_config=None, metadata_directory=None, metric: Scorer | Sequence[Scorer] | None = None, scoring_functions: Optional[List[Scorer]] = None, load_models: bool = True, get_trials_callback: SMACCallback | None = None, dataset_compression: Union[bool, Mapping[str, Any]] = True, allow_string_features: bool = True)

在这个过程中也将进行数据的预处理，auto-sklearn 中的预处理分为数据预处理和特征预处理。数据预处理包括分类特征的独热编码，缺失值插补以及特征或样本的归一化。这些步骤目前无法关闭。特征预处理是单个特征变换器，可实现例如特征选择或将特征变换到不同空间（如PCA）。特征预处理可以通过设置include参数中的可选项preprocessors=["no_preprocessing"] 将其关闭。

autosklearn.classification.AutoSklearnClassifier() 参数

autosklearn.classification.AutoSklearnClassifier() 参数
metric Scorer, optional (None)	autosklearn.metrics.Scorer的实例，由autosklearn.metrics.make_scorer()创建。这些是内置指标。如果提供了None，则根据任务选择默认的度量。内置指标有：{'accuracy': accuracy, 'balanced_accuracy': balanced_accuracy, 'roc_auc': roc_auc, 'average_precision': average_precision, 'log_loss': log_loss, 'precision_macro': precision_macro, 'precision_micro': precision_micro, 'precision_samples': precision_samples, 'precision_weighted': precision_weighted, 'recall_macro': recall_macro, 'recall_micro': recall_micro, 'recall_samples': recall_samples, 'recall_weighted': recall_weighted, 'f1_macro': f1_macro, 'f1_micro': f1_micro, 'f1_samples': f1_samples, 'f1_weighted': f1_weighted}
scoring_functions List[Scorer], optional (None)	评分列表，将计算每个管道和结果将通过cv_results可用
methods	我们提供以下方法来减少数据集的大小。这些可以在一个列表中提供，并按给定的顺序执行。 `"precision"` - 我们降低浮点精度如下: * `np.float128 -> np.float64` * `np.float96 -> np.float64` * `np.float64 -> np.float32` `subsample` - 我们对数据进行子采样，使其直接适合于内存 allocation `memory_allocation * memory_limit`. 次抽样考虑到分类标签并相应地分层。我们保证每个标签至少有一次出现在样本集中。
load_models bool, optional (True)	拟合Auto-sklearn后是否加载模型。
控制训练时间和内存使用量
time_left_for_this_task int, 可选(默认= 3600)	设置所有模型训练时间总和，以秒为单位。通过增加这个值，auto-sklearn有更高的机会找到更好的模型。
get_trials_callback callable	具有以下定义的可调用对象： (smac.SMBO, smac.RunInfo, smac.RunValue, time_left: float) -> bool \| None 这将在SMAC (autosklearn的底层优化器)完成每次运行的训练后调用。您可以使用它来记录您自己关于优化过程的信息。您还可以使用它来基于某些标准启用早期停止。
per_run_time_limit 可选(默认= time_left_for_this_task的1/10)	设置单个模型训练最长时间。如果机器学习算法超过时间限制，将终止模型拟合。将这个值设置得足够高，以便典型的机器学习算法能够适合训练数据。
max_models_on_disc int，可选(默认=50)	定义保存在磁盘中的模型的最大数量。额外数量的模型将被永久删除。由于这个变量的性质，它设置了一个集合可以使用多少个模型的上限。必须是大于等于1的整数。如果设置为None，则所有模型都保存在磁盘上。
memory_limit int,可选(3072)	机器学习算法的内存限制(MB)。如果auto-sklearn尝试分配超过memory_limit MB的内存，它将停止拟合机器学习算法。
initial_configurations_via_metalearning 可选(默认= 25)	用这些配置初始化超参数优化算法，这些配置在以前看过的数据集上工作得很好。如果超参数优化算法需要从头开始，则禁用。
ensemble_class type [AbstractEnsemble] \| " default "，可选(默认= " default ")	类实现了事后集成算法。设置为None可以禁用集成构建，或者使用SingleBest只获取使用单一最佳模型而不是集成。如果设置为“default”，它将对单目标问题使用EnsembleSelection，对多目标问题使用MultiObjectiveDummyEnsemble。
ensemble_kwargs Dict，可选，关键字参数	在初始化时传递给集成类。
模型选择
include 可选，[Dict[str, List[str]]] =None `"data_preprocessor"` `"balancing"` `"feature_preprocessor"` `"classifier"` - 仅当使用AutoSklearnClasssifier时 `"regressor"` -仅当使用AutoSklearnRegressor时	如果为None，则使用所有可能的算法。否则，指定搜索中包含的步骤和组件。参见/pipeline/components/<step>/*查看可用的组件。与参数exclude不兼容。举例： include = { 'classifier': ["random_forest"], 'feature_preprocessor': ["no_preprocessing"] } The supported components for the step 'feature_preprocessor' for this task are ['densifier', 'extra_trees_preproc_for_classification', 'fast_ica', 'feature_agglomeration', 'kernel_pca', 'kitchen_sinks', 'liblinear_svc_preprocessor', 'no_preprocessing', 'nystroem_sampler', 'pca', 'polynomial', 'random_trees_embedding', 'select_percentile_classification', 'select_rates_classification', 'truncatedSVD']
数据切分
resampling_strategy str\| BaseCrossValidator \| _RepeatedSplits\| BaseShuffleSplit = "holdout"	使用resampling_strategy参数可设置训练集与测试集的切分方法，设置五折交叉验证: resampling_strategy='cv', resampling_strategy_arguments={'folds': 5} 将数据切分为训练集和测集，其中训练集数据占2/3: resampling_strategy='holdout', resampling_strategy_arguments={'train_size': 0.67} `"holdout"` - 使用67:33 (train:test)分割 `"cv"`: 执行交叉验证，需要在resampling_strategy_arguments中“folds” `"holdout-iterative-fit"` - Same as “holdout” but iterative fit where possible `"cv-iterative-fit"`: Same as “cv” but iterative fit where possible `"partial-cv"`: Same as “cv” but uses intensification. `BaseCrossValidator` - 任何BaseCrossValidator子类(在scikit-learn model_selection模块中找到) `_RepeatedSplits` - 任何_repeatedslices子类(在scikit-learn model_selection模块中找到) `BaseShuffleSplit` - any BaseShuffleSplit subclass (found in scikit-learn model_selection module)
resampling_strategy_arguments 可选[Dict] = None	resampling_strategy的附加参数，如果使用基于cv的策略，这是必需的。如果保留为None，默认参数为: { "train_size": 0.67, # The size of the training set "shuffle": True, # Whether to shuffle before splitting data "folds": 5 # Used in 'cv' based resampling strategies }
模型存储
tmp_folder string, optional (None)	暂存目录。如果为“None”，则自动使用 `/tmp/autosklearn_tmp_$pid_$random_number`文件夹保存配置输出和日志文件
delete_tmp_folder_after_terminate: string, optional (True)	完成后，删除tmpfolder。如果tmp_folder为“None”，则将始终删除tmp_dir
n_jobs int, optional, experimental	fit()要并行运行的作业数。-1表示使用所有处理器。
logging_config dict, optional (None)	指定记录器配置的字典对象。如果为None，则默认日志记录。使用Yaml文件，该文件可以在util/logging目录中找到。
dataset_compression Union[bool, Mapping[str, Any]] = True	我们压缩数据集，以使它们适合于某些预定义的内存量。目前，这并不适用于数据帧或稀疏数组，只适用于原始numpy数组。注意:如果使用依赖于特定数据大小或顺序的自定义resampling_strategy，必须禁用此选项以保留这些属性。你可以通过传递False来完全禁用它，或者在下面的配置中保留默认的True： { "memory_allocation": 0.1, "methods": ["precision", "subsample"] }
disable_evaluator_output bool or list, optional (False)	如果为True，则禁用模型和预测输出。列表中允许的元素是： 'y_optimization'：do not save the predictions for the optimization set, which would later on be used to build an ensemble. model:不保存任何模型文件
属性
cv_results_ dict of numpy (masked) ndarrays	一种字典，其键作为列标题，值作为列，可以导入pandas DataFrame。目前还不支持所有由scikit-learn返回的键。
performance_over_time_ pandas.core.frame.DataFrame	包含模型随时间变化的性能数据的数据帧。可直接用于绘图。请参考示例训练和测试输入。Performance-over-time plot — AutoSklearn 0.15.0 documentation

metric

Scorer, optional (None)

autosklearn.metrics.Scorer的实例，由autosklearn.metrics.make_scorer()创建。这些是内置指标。如果提供了None，则根据任务选择默认的度量。

内置指标有：{'accuracy': accuracy, 'balanced_accuracy': balanced_accuracy, 'roc_auc': roc_auc, 'average_precision': average_precision, 'log_loss': log_loss, 'precision_macro': precision_macro, 'precision_micro': precision_micro, 'precision_samples': precision_samples, 'precision_weighted': precision_weighted, 'recall_macro': recall_macro, 'recall_micro': recall_micro, 'recall_samples': recall_samples, 'recall_weighted': recall_weighted, 'f1_macro': f1_macro, 'f1_micro': f1_micro, 'f1_samples': f1_samples, 'f1_weighted': f1_weighted}

scoring_functions

List[Scorer], optional (None)

评分列表，将计算每个管道和结果将通过cv_results可用

methods

我们提供以下方法来减少数据集的大小。这些可以在一个列表中提供，并按给定的顺序执行。

"precision" - 我们降低浮点精度如下: * np.float128 -> np.float64 * np.float96 -> np.float64 * np.float64 -> np.float32
subsample - 我们对数据进行子采样，使其直接适合于内存 allocation
memory_allocation * memory_limit. 次抽样考虑到分类标签并相应地分层。我们保证每个标签至少有一次出现在样本集中。

load_models

bool, optional (True)

拟合Auto-sklearn后是否加载模型。

控制训练时间和内存使用量

time_left_for_this_task

int, 可选(默认= 3600)

设置所有模型训练时间总和，以秒为单位。通过增加这个值，auto-sklearn有更高的机会找到更好的模型。

get_trials_callback

callable

具有以下定义的可调用对象：

(smac.SMBO, smac.RunInfo, smac.RunValue, time_left: float) -> bool | None

这将在SMAC (autosklearn的底层优化器)完成每次运行的训练后调用。

您可以使用它来记录您自己关于优化过程的信息。您还可以使用它来基于某些标准启用早期停止。

per_run_time_limit

可选(默认= time_left_for_this_task的1/10)

设置单个模型训练最长时间。如果机器学习算法超过时间限制，将终止模型拟合。将这个值设置得足够高，以便典型的机器学习算法能够适合训练数据。

max_models_on_disc

int，可选(默认=50)

定义保存在磁盘中的模型的最大数量。额外数量的模型将被永久删除。由于这个变量的性质，它设置了一个集合可以使用多少个模型的上限。必须是大于等于1的整数。如果设置为None，则所有模型都保存在磁盘上。

memory_limit

int,可选(3072)

机器学习算法的内存限制(MB)。如果auto-sklearn尝试分配超过memory_limit MB的内存，它将停止拟合机器学习算法。

initial_configurations_via_metalearning

可选(默认= 25)

用这些配置初始化超参数优化算法，这些配置在以前看过的数据集上工作得很好。如果超参数优化算法需要从头开始，则禁用。

ensemble_class

type [AbstractEnsemble] | " default "，

可选(默认= " default ")

类实现了事后集成算法。设置为None可以禁用集成构建，或者使用SingleBest只获取使用单一最佳模型而不是集成。

如果设置为“default”，它将对单目标问题使用EnsembleSelection，对多目标问题使用MultiObjectiveDummyEnsemble。

ensemble_kwargs

Dict，可选，关键字参数

在初始化时传递给集成类。

模型选择

include

可选，[Dict[str, List[str]]] =None

"data_preprocessor"
"balancing"
"feature_preprocessor"
"classifier" - 仅当使用AutoSklearnClasssifier时
"regressor" -仅当使用AutoSklearnRegressor时

如果为None，则使用所有可能的算法。

否则，指定搜索中包含的步骤和组件。参见/pipeline/components/<step>/*查看可用的组件。

与参数exclude不兼容。

举例：

include = {
    'classifier': ["random_forest"],
    'feature_preprocessor': ["no_preprocessing"]
}

The supported components for the step 'feature_preprocessor' for this task are ['densifier', 'extra_trees_preproc_for_classification', 'fast_ica', 'feature_agglomeration', 'kernel_pca', 'kitchen_sinks', 'liblinear_svc_preprocessor', 'no_preprocessing', 'nystroem_sampler', 'pca', 'polynomial', 'random_trees_embedding', 'select_percentile_classification', 'select_rates_classification', 'truncatedSVD']

数据切分

resampling_strategy

str| BaseCrossValidator | _RepeatedSplits| BaseShuffleSplit = "holdout"

使用resampling_strategy参数可设置训练集与测试集的切分方法，设置五折交叉验证:

resampling_strategy='cv',

resampling_strategy_arguments={'folds': 5}

将数据切分为训练集和测集，其中训练集数据占2/3:

resampling_strategy='holdout',

resampling_strategy_arguments={'train_size': 0.67}

"holdout" - 使用67:33 (train:test)分割
"cv": 执行交叉验证，需要在resampling_strategy_arguments中“folds”
"holdout-iterative-fit" - Same as “holdout” but iterative fit where possible
"cv-iterative-fit": Same as “cv” but iterative fit where possible
"partial-cv": Same as “cv” but uses intensification.
BaseCrossValidator - 任何BaseCrossValidator子类(在scikit-learn model_selection模块中找到)
_RepeatedSplits - 任何_repeatedslices子类(在scikit-learn model_selection模块中找到)
BaseShuffleSplit - any BaseShuffleSplit subclass (found in scikit-learn model_selection module)

resampling_strategy_arguments

可选[Dict] = None

resampling_strategy的附加参数，如果使用基于cv的策略，这是必需的。如果保留为None，默认参数为:

{
    "train_size": 0.67,     # The size of the training set
    "shuffle": True,        # Whether to shuffle before splitting data
    "folds": 5              # Used in 'cv' based resampling strategies
}

模型存储

tmp_folder

string, optional (None)

暂存目录。如果为“None”，则自动使用 /tmp/autosklearn_tmp_$pid_$random_number文件夹保存配置输出和日志文件

delete_tmp_folder_after_terminate: string, optional (True)

完成后，删除tmpfolder。如果tmp_folder为“None”，则将始终删除tmp_dir

n_jobs

int, optional, experimental

fit()要并行运行的作业数。-1表示使用所有处理器。

logging_config

dict, optional (None)

指定记录器配置的字典对象。如果为None，则默认日志记录。使用Yaml文件，该文件可以在util/logging目录中找到。

dataset_compression

Union[bool, Mapping[str, Any]] = True

我们压缩数据集，以使它们适合于某些预定义的内存量。目前，这并不适用于数据帧或稀疏数组，只适用于原始numpy数组。

注意:如果使用依赖于特定数据大小或顺序的自定义resampling_strategy，必须禁用此选项以保留这些属性。

你可以通过传递False来完全禁用它，或者在下面的配置中保留默认的True：

{
    "memory_allocation": 0.1,
    "methods": ["precision", "subsample"]
}

disable_evaluator_output

bool or list, optional (False)

如果为True，则禁用模型和预测输出。列表中允许的元素是：

'y_optimization'：do not save the predictions for the optimization set, which would later on be used to build an ensemble.

model:不保存任何模型文件

属性

cv_results_

dict of numpy (masked) ndarrays

一种字典，其键作为列标题，值作为列，可以导入pandas DataFrame。

目前还不支持所有由scikit-learn返回的键。

performance_over_time_

pandas.core.frame.DataFrame

包含模型随时间变化的性能数据的数据帧。可直接用于绘图。请参考示例训练和测试输入。Performance-over-time plot — AutoSklearn 0.15.0 documentation

(2) fit(X,y, X_test=None, y_test=None, feat_type=None, dataset_name=None)

参数
X：array-like or sparse matrix of shape = [n_samples, n_features]	The training input samples.
y：array-like, shape = [n_samples] or [n_samples, n_outputs]	The target classes.
X_test：array-like or sparse matrix of shape = [n_samples, n_features]	Test data input samples. Will be used to save test predictions for all models. This allows to evaluate the performance of Auto-sklearn over time.
y_test：array-like, shape = [n_samples] or [n_samples, n_outputs]	Test data target classes. Will be used to calculate the test error of all models. This allows to evaluate the performance of Auto-sklearn over time.
feat_type：list, optional (default=None)	描述属性类型的len(X.shape[1])的str列表。可能的类型是范畴型和数值型。分类属性将自动进行One-Hot编码。用于分类属性的值必须是整数，例如通过sklearn.preprocessing.LabelEncoder获得。
dataset_name：str, optional (default=None)	创建更好的输出。如果为None，则由数据集的md5散列确定一个字符串。

(3) fit_ensemble()

fit_ensemble(y, task: int = None, precision: Literal[16, 21, 64] = 32, dataset_name: Optional[str] = None, ensemble_size: int | None = None, ensemble_kwargs: Optional[Dict[str, Any]] = None, ensemble_nbest: Optional[int] = None, ensemble_class: Type[AbstractEnsemble] | Literal['default'] | None = 'default', metric: Scorer | Sequence[Scorer] | None = None)

根据优化过程中训练的模型拟合集合。所有参数默认为None。如果没有给出其他值，则使用在fit()调用中设置的默认值。

参数
y：array-like	Target values.
task：int	来自autosklearn.constants模块的常量。确定任务类型(二元分类、多类分类、多标签分类或回归)。
precision：int	加载集成数据时使用的数值精度。可以是16、32或64。
dataset_name：str	当前数据集的名称。
ensemble_kwargs：Dict, optional	关键字参数，在初始化时传递给集成类。
ensemble_nbest：int	在构建ensemble 时，只考虑最佳ensemble 模型。
ensemble_classType[AbstractEnsemble] \| “default”, optional (default=”default”)	类实现了事后集成算法。设置为None来禁用集成构建或使用类:SingleBest来只获得使用单一最佳模型而不是集成。如果设置为“default”，它将对单目标问题使用EnsembleSelection，对多目标问题使用MultiObjectiveDummyEnsemble。
metric: Scorer \| Sequence[Scorer] \| None = None

(4) fit_pipeline()

fit_pipeline(X: Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], y: Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix], config: Union[ConfigSpace.configuration_space.Configuration, Dict[str, Union[str, float, int]]], dataset_name: Optional[str] = None, X_test: Optional[Union[List, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, y_test: Optional[Union[List, pandas.core.series.Series, pandas.core.frame.DataFrame, numpy.ndarray, scipy.sparse._base.spmatrix]] = None, feat_type: Optional[List[str]] = None, *args, **kwargs: Dict) → Tuple[Optional[autosklearn.pipeline.base.BasePipeline], smac.runhistory.runhistory.RunInfo, smac.runhistory.runhistory.RunValue]

适合单独的管道配置并将结果返回给用户。

参数
X: array-like, shape = (n_samples, n_features)	The features used for training
y: array-like	The labels used for training
X_test: Optionalarray-like, shape = (n_samples, n_features)	If provided, the testing performance will be tracked on this features.
y_test: array-like	如果提供，测试性能将在此标签上跟踪
config: Union[Configuration, Dict[str, Union[str, float, int]]]	用于定义管道步骤的配置对象。如果传递了一个字典，则基于该字典创建一个配置。
dataset_name: Optional[str]	将用于标记Auto-Sklearn运行并标识Auto-Sklearn运行的名称
feat_typelist, optional (default=None)	描述属性类型的len(X.shape[1])的str列表。可能的类型是范畴型和数值型。分类属性将自动进行One-Hot编码。用于分类属性的值必须是整数，例如通过sklearn.preprocessing.LabelEncoder获得。
返回
pipeline: Optional[BasePipeline]	安装管道。如果在安装管道时出现故障，则返回None。
run_info: RunInFo	包含启动配置的命名元组
run_value: RunValue	包含运行结果的命名元组

(5) leaderboard()

leaderboard(detailed: bool = False, ensemble_only: bool = True, top_k: Union[int, Literal['all']] = 'all', sort_by: str = 'cost', sort_order: Literal['auto', 'ascending', 'descending'] = 'auto', include: Optional[Union[str, Iterable[str]]] = None)

返回所有评估模型的结果的DataFrame。给出在搜索过程中训练的所有模型的概述，以及关于它们的训练的各种统计数据。现有的统计如下:

输出（Simple）	输出（Detailed）
`"model_id"` - 给模型的id。	`"config_id"` - The id used by SMAC for optimization.
`"rank"` - The rank of the model based on it’s `"cost"`.	`"budget"` - 分配给这个模型的预算是多少。
`"ensemble_weight"` - 在集合中赋予模型的权重。	`"status"` - 用SMAC训练模型的返回状态。
`"type"` - 使用的分类器/回归器的类型。	`"train_loss"` - The loss of the model on the training set.
`"cost"` - 验证集中模型的损失。	`"balancing_strategy"` - 用于数据预处理的平衡策略。
`"duration"` - 模型优化的时间长度。	`"start_time"` - 模型开始优化的时间
	`"end_time"` - Time the model ended being optimized
	`"data_preprocessors"` - 数据上使用的预处理器
	`"feature_preprocessors"` - 特性类型的预处理器
参数
detailed: bool = False	是要给出详细的信息还是只是一个简单的概述。
ensemble_only: bool = True	是只查看集合中包含的模型还是所有训练过的模型。
top_k: int or “all” = “all”	要显示多少个模型。
sort_by: str = ‘cost’	按哪一列排序。如果该列不存在，则默认排序为“model_id”索引列。默认为优化的度量。在多目标优化问题中，按第一个目标排序
sort_order: “auto” or “ascending” or “descending” = “auto”	Which sort order to apply to the `sort_by` column. If left as `"auto"`, it will sort by a sensible default where “better” is on top, otherwise defaulting to the pandas default for DataFrame.sort_values if there is no obvious “better”.
include: Optional[str or Iterable[str]]	包括的项目，其他未指定的项目将被排除。例外的是“model_id”索引列，它总是包含在内。如果保留为None，它将返回使用详细参数来决定要包含的列。

(6) show_models()

返回一个包含集合模型的字典的字典。通过将model_id作为键，可以访问集成中的每个模型。

一个模型字典包含以下内容:

"model_id" - 模型id.
"rank" - The rank of the model based on it’s "cost".
"cost" - The loss of the model on the validation set.
"ensemble_weight" - The weight given to the model in the ensemble.
"voting_model" - The cv_voting_ensemble model (for ‘cv’ resampling).
"estimators" - List of models (dicts) in cv_voting_ensemble
(‘cv’ resampling).
"data_preprocessor" - The preprocessor used on the data.
"balancing" - The balancing used on the data (for classification).
"feature_preprocessor" - The preprocessor for features types.
"classifier" / "regressor" - The autosklearn wrapped classifier or regressor.
"sklearn_classifier" or "sklearn_regressor" - The sklearn classifier or regressor.

(7) autosklearn.metrics.make_scorer()

autosklearn.metrics.make_scorer(name: str, score_func: Callable, *, optimum: float = 1.0, worst_possible_result: float = 0.0, greater_is_better: bool = True, needs_proba: bool = False, needs_threshold: bool = False, needs_X: bool = False, **kwargs: Any)

参数
name: str	名称
score_func：callable	带有签名score_func(y, y_pred， **kwargs)的评分函数(或损失函数)
optimum：int or float, default=1	得分函数所能达到的最佳分数，即记分函数的最大值和损失函数的最小值。
worst_possible_resultint of float, default=0	得分函数可达到的最差分数，即记分函数中的最小值和损失函数中的最大值。
greater_is_better：boolean, default=True	score_func是一个分数函数（默认），表示高就是好。score_func是一个损失函数，表示低就是好。在后一种情况下，scorer对象将对score_func的结果进行符号翻转。
needs_proba：boolean, default=False	score_func是否需要predict_proba才能从分类器中获得概率估计值。
needs_threshold：boolean, default=False	score_func是否需要持续的决策确定性。这仅适用于二进制分类。
needs_X：boolean, default=False	Whether score_func requires X in __call__ to compute a metric.
**kwargsadditional arguments	要传递给score_func的其他参数。