Intro to Machine Learning 课程链接
此课程适合新手了解机器学习最基础的知识,只介绍了决策树、随机森林模型且基本不涉及复杂的底层原理和模型调参。
每个章节含有理论教学内容和实际操作(Kaggle上可以直接运行代码并检查答案),都较简单且容易上手,新手友好。但不适合已经对机器学习步骤掌握了的朋友。
引入pandas
import pandas as pd
Leading Data
import pandas as pd
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
# read the file into a variable home_data
home_data = pd.read_csv(iowa_file_path)
Review the Data
home_data.describe()
Selecting The Prediction Target:
选择预测值可以用 ' . ' 表示,预测值用y表示,示例如下:(例中预测价格)To choose variables/columns, we’ll need to see a list of all columns in the dataset. That is done with the columns property of the DataFrame (the bottom line of code below).
There are many ways to select a subset of your data. We will focus on two approaches for now.
1.Dot notation, which we use to select the “prediction target”
2.Selecting with a column list, which we use to select the “features”
y = melbourne_data.Price
Choosing “Features”:
For now, we’ll build a model with only a few features. Later on you’ll see how to iterate and compare models built with different features.
对模型输入的称为特征值,选取特征值的示例如下:(特征值数据用x表示)
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
每次选取数据后可以使用 ‘.describe( )’ 函数(返回数据统计信息)和 ’ .head( )’ 函数(返回数据前五行)查看数据情况
Building Your Model:
建立模型的一般步骤:
- Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified
too.- Fit: Capture patterns from provided data. This is the heart of modeling.
- Predict: Just what it sounds like
- Evaluate: Determine how accurate the model’s predictions are.
# Step 1: Specify Prediction Target
y = home_data.SalePrice
# Step 2: Create X
# Create the list of features below
feature_names = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
# Select data corresponding to features in feature_names
X = home_data[feature_names]
# Review data
# print description or statistics from X
print(X.describe())
# print the top few lines
print(X.head())
# Step 3: Specify and Fit Model
#specify the model.
from sklearn.tree import DecisionTreeRegressor
#For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor(random_state = 2)
# Fit the model
iowa_model.fit(X,y)
# Step 4: Make Predictions
predictions = iowa_model.predict(X)
calculate mean absolute error : MAE
(所有预测值与真实值差的绝对值的平均数)
mean_absolute_error(y, predicted_home_prices)
We measure performance on data that wasn’t used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model’s accuracy on data it hasn’t seen before. This data is called validation data.
from sklearn.model_selection import train_test_split
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Step 1: Split Your Data
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)
# Step 2: Specify and Fit the Model
# Specify the model
iowa_model = DecisionTreeRegressor(random_state = 1)
# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)
# Step 3: Make Predictions with Validation data
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)
# Step 4: Calculate the Mean Absolute Error in Validation Data
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y,val_predictions)
Overfitting :
where a model matches the training data almost perfectly, but does poorly in validation and other new data.
Underfitting :
When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.
We can use a utility function to help compare MAE scores from different values for max_leaf_nodes and use a for loop to compare the result:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
Max leaf nodes: 5 Mean Absolute Error: 347380
Max leaf nodes: 50 Mean Absolute Error: 258171
Max leaf nodes: 500 Mean Absolute Error: 243495
Max leaf nodes: 5000 Mean Absolute Error: 254983
# Step 1: Compare Different Tree Sizes
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
best_mae = get_mae(5, train_X, val_X, train_y, val_y)
for max_node in candidate_max_leaf_nodes:
mae = get_mae(max_node, train_X, val_X, train_y, val_y)
if mae < best_mae:
best_mae = mae
best_tree_size = max_node
print( best_tree_size) # 100
# Step 2: Fit Model Using All Data
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=0)
final_model.fit(X, y)
The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree.
随机森林由多个决策树组成,它将每个决策树的预测结果平均以得到最终结果,模型效果比单个决策树效果更好
使用 scikit-learn 中的RandomForestRegressor 生成随机森林,示例代码如下:
from sklearn.metrics import mean_absolute_error
forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
# (模型中还有可以调整的参数以达到更好的模型效果)
# Step 1: Use a Random Forest
from sklearn.ensemble import RandomForestRegressor
# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state = 1)
# fit your model
rf_model.fit(train_X, train_y)
# Calculate the mean absolute error of your Random Forest model on the validation data
rf_val_mae = mean_absolute_error(val_y,rf_model.predict(val_X))
print("Validation MAE for Random Forest Model: {}".format(rf_val_mae))
# Validation MAE for Random Forest Model: 21857.15912981083