合并来自原始pandas DataFrame的model.predict（）结果？

长孙阳嘉

2023-03-14

问题内容：

我正在尝试将predict方法的结果与pandas.DataFrame对象中的原始数据合并在一起。

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

要将这些预测与原始预测合并起来df，我可以尝试以下方法：

df['y_hats'] = y_hats

但这引起了：

ValueError：值的长度与索引的长度不匹配

我知道可以将分解df为train_df，test_df并且这个问题可以解决，但实际上，我需要按照上述路径创建矩阵，X并且y（我的实际问题是文本分类问题，在分解之前，我要对
整个特征矩阵进行归一化培训和测试）。df由于y_hats数组的索引为零，并且关于哪些
行包含在中X_test并y_test丢失了，因此如何将这些预测值与我的适当行对齐？还是将我降级为先将数据帧拆分为训练测试，然后再构建特征矩阵？我想只需填写包括在行train与np.nan
数据框中的值。

问题答案：

您的y_hats长度仅是测试数据上的长度（20％），因为您在X_test上进行了预测。一旦模型通过验证并且对测试预测满意（通过检查模型在X_test预测上与X_test真实值相比的准确性），您应该在完整数据集（X）上重新运行预测。将这两行添加到底部：

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

*根据您的评论进行 *编辑，这是一个更新的结果，返回带有预测的数据集并附加在测试数据集中

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

合并来自原始pandas DataFrame的model.predict（）结果？

相关阅读

相关文章

相关问答

相关工具

相关文档