当前位置: 首页 > 工具软件 > Path Intro > 使用案例 >

Kaggle Learn: Intro to Machine Learning 记录

燕富
2023-12-01

kaggle上learn板块 intro to machine learning的笔记,连着上了三个ML的课,
有exercise,适合新手自学
https://www.kaggle.com/learn/intro-to-machine-learning

import pandas as pd
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns
#dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

We’ll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y. So the code we need to save the house prices in the Melbourne data is

y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

要预测的是price
By convention, this data is called X.

X = melbourne_data[melbourne_features]
X.describe()
X.head(n) #默认返回5行

The steps to building and using a model are:

  • Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
  • Fit: Capture patterns from provided data. This is the heart of modeling.
  • Predict: Just what it sounds like
  • Evaluate: Determine how accurate the model’s predictions are.

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

from sklearn.tree import DecisionTreeRegressor
#Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)
#Fit model
melbourne_model.fit(X, y)

print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
Rooms Bathroom Landsize Lattitude Longtitude
1 2 1.0 156.0 -37.8079 144.9934
2 3 2.0 134.0 -37.8093 144.9944
4 4 1.0 120.0 -37.8072 144.9941
6 3 2.0 245.0 -37.8024 144.9993
7 2 1.0 256.0 -37.8060 144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]
MAE:Mean Absolute Error
看模型是否准确

from sklearn.metrics import mean_absolute_error
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

Sklearn train_test_split()
https://www.cnblogs.com/bonelee/p/8036024.html

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
#Define model
melbourne_model = DecisionTreeRegressor()
#Fit model
melbourne_model.fit(train_X, train_y)
#get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

用training data 来fit 然后用val来predict
When we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes’ actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).

This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn’t divide up the houses into very distinct groups.

At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.

Here’s the takeaway: Models can suffer from either:

  • Overfitting: capturing spurious patterns that won’t recur in the future, leading to less accurate predictions, or
  • Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.

We use validation data, which isn’t used in model training, to measure a candidate model’s accuracy. This lets us try many candidate models and keep the best one.

 类似资料: