使用GridSearch时使用Scikit-learn建立模型帮助

何涵畅

2023-03-14

问题内容：

作为Enron项目的一部分，构建了附件模型，以下是步骤的摘要，

下面的模型给出了很高的分数

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
    x_train, x_test = features[train_ind], features[test_ind]
    y_train, y_test = labels[train_ind],labels[test_ind]

    gcv.best_estimator_.predict(x_test)

下面的模型给出了更多合理但较低的分数

cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)

gcv.fit(features,labels) ---> with the full dataset

for train_ind, test_ind in cv.split(features,labels):
     x_train, x_test = features[train_ind], features[test_ind]
     y_train, y_test = labels[train_ind],labels[test_ind]

     gcv.best_estimator_.fit(x_train,y_train)
     gcv.best_estimator_.predict(x_test)

使用Kbest找出分数并对其功能进行排序，并尝试组合较高和较低的分数。
通过StratifiedShuffle将SVM与GridSearch一起使用
使用best_estimator_预测和计算精度和召回率。

问题是估算器会吐出完美分数，在某些情况下为1

但是，当我在训练数据上重新拟合最佳分类器然后运行测试时，它会给出合理的分数。

我的疑问/问题是，使用我们发送给它的Shuffle拆分对象进行拆分后，GridSearch对测试数据的处理方式是什么。我以为它不适合测试数据，如果是真的，那么当我预测使用相同的测试数据时，它应该不会给出如此高的分数。因为我使用了random_state值，所以shufflesplit应该为Grid
Fit和预测创建了相同的副本。

那么，是否将相同的Shufflesplit用于两个错误？

问题答案：

基本上，网格搜索将：

尝试参数网格的所有组合
对于他们每个人，都会进行K折交叉验证
选择最好的。

因此，您的第二种情况就是好的。否则，您实际上是在预测训练的数据（在第二种情况下不是这样，那里只保留了gridsearch中的最佳参数）

使用GridSearch时使用Scikit-learn建立模型帮助

下面的模型给出了很高的分数

下面的模型给出了更多合理但较低的分数

相关阅读

相关文章

相关问答

相关工具

相关文档