当前位置: 首页 > 知识库问答 >
问题:

用时间序列数据集训练LSTM模型后,如何预测未来数据或未知范围的数据?

乐正涵意
2023-03-14
          StationIndex    Station   Year  Month Day Rainfall dayofyear
1970-01-01  1               Dhaka   1970    1   1   0           1
1970-01-02  1               Dhaka   1970    1   2   0           2
1970-01-03  1               Dhaka   1970    1   3   0           3
1970-01-04  1               Dhaka   1970    1   4   0           4
1970-01-05  1               Dhaka   1970    1   5   0           5
import numpy as np
from pandas.plotting import register_matplotlib_converters
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc
from pylab import rcParams
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from keras.layers import (
    Input,
    Dense,
    LSTM,
    AveragePooling1D,
    TimeDistributed,
    Flatten,
    Bidirectional,
    Dropout
)
from keras.models import Model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

tf.keras.backend.clear_session()
register_matplotlib_converters()
sns.set(style='whitegrid', palette='muted', font_scale=1.5)

rcParams['figure.figsize'] = 22, 10

RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

#reading from CSV
df = pd.read_csv("\customized_daily_rainfall_data_Copy.csv")
#droping bad data
df = df[df.Rainfall != -999]

#droping dates (leapyear, wrong day numbers of month)
df.drop(df[(df['Day']>28) & (df['Month']==2) & (df['Year']%4!=0)].index,inplace=True)
df.drop(df[(df['Day']>29) & (df['Month']==2) & (df['Year']%4==0)].index,inplace=True)
df.drop(df[(df['Day']>30) & ((df['Month']==4)|(df['Month']==6)|(df['Month']==9)|(df['Month']==11))].index,inplace=True)

#date parcing (Index)
date = [str(y)+'-'+str(m)+'-'+str(d) for y, m, d in zip(df.Year, df.Month, df.Day)]
df.index = pd.to_datetime(date)

df['Date'] = df.index
df['Dayofyear']=df['Date'].dt.dayofyear
df.drop('Date',axis=1,inplace=True)
df.drop(['Station'],axis=1,inplace=True)
df.head()


#limiting the dataframe to just rows where StationIndex is 11
datarange = df.loc[df['StationIndex'] == 11]

#splitting train and test set
train_size = int(len(datarange) * 0.9)
test_size = len(datarange) - train_size
train, test = df.iloc[0:train_size], df.iloc[train_size:len(datarange)]

#Scaling the feature and label columns of the dataset
from sklearn.preprocessing import RobustScaler
f_columns = ['Year', 'Month','Day','Dayofyear']
f_transformer = RobustScaler()
l_transformer = RobustScaler()
f_transformer = f_transformer.fit(train[f_columns].to_numpy())
l_transformer = l_transformer.fit(train[['Rainfall']])


train.loc[:, f_columns] = f_transformer.transform(train[f_columns].to_numpy())
train['Rainfall'] = l_transformer.transform(train[['Rainfall']])
test.loc[:, f_columns] = f_transformer.transform(test[f_columns].to_numpy())
test['Rainfall'] = l_transformer.transform(test[['Rainfall']])

#making smaller train and test sections withing the dataset
def create_dataset(X, y, time_steps=1):
    Xs, ys = [], []
    for i in range(len(X) - time_steps):
        v = X.iloc[i:(i + time_steps)].to_numpy()
        Xs.append(v)        
        ys.append(y.iloc[i + time_steps])
    return np.array(Xs), np.array(ys)

time_steps = 7

# reshape to [samples, time_steps, n_features]

X_train, y_train = create_dataset(train, train.Rainfall, time_steps)
X_test, y_test = create_dataset(test, test.Rainfall, time_steps)

#testing
X_test[0][0]


#model code

model = keras.Sequential()

#3 biderectional LSTM layers
model.add(keras.layers.Bidirectional(keras.layers.LSTM(units=128, input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences = True)))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(units=128,  return_sequences = True)))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(units=128 )))
model.add(keras.layers.Dropout(rate=0.1))
model.add(keras.layers.Dense(units=1))
model.compile(loss="mean_squared_error", optimizer="RMSprop")

#training the model
history = model.fit(
    X_train, y_train, 
    epochs=500, 
    batch_size=1052, 
    validation_split=0.2,
    shuffle=False,
)

#saving the model
from tensorflow.keras.models import load_model
model.save("\Timeseries-timestep7-batchsize1052.h5")

#Using text dataset to do a prediction
y_pred = model.predict(X_test)

#inverst transformation
y_train_inv = l_transformer.inverse_transform(y_train.reshape(1, -1))
y_test_inv = l_transformer.inverse_transform(y_test.reshape(1, -1))
y_pred_inv = l_transformer.inverse_transform(y_pred)

#score
from sklearn import metrics
score = np.sqrt(metrics.mean_squared_error(y_pred,y_test))
print(score)

例如,使用我训练的模型来预测未来的数据。也可能是随机/自定义范围。假设我想预测2017年的逐日降雨数据。或者得到25-02-2017的预测数据。也可能是数据集结束后的X天的数据。

共有1个答案

燕璞
2023-03-14
class WindowGenerator():
  def __init__(self, input_width, label_width, shift,
               train_df=train_df, val_df=val_df, test_df=test_df,
               label_columns=None):
    # Store the raw data.
    self.train_df = train_df
    self.val_df = val_df
    self.test_df = test_df

    # Work out the label column indices.
    self.label_columns = label_columns
    if label_columns is not None:
      self.label_columns_indices = {name: i for i, name in
                                    enumerate(label_columns)}
    self.column_indices = {name: i for i, name in
                           enumerate(train_df.columns)}

    # Work out the window parameters.
    self.input_width = input_width
    self.label_width = label_width
    self.shift = shift

    self.total_window_size = input_width + shift

    self.input_slice = slice(0, input_width)
    self.input_indices = np.arange(self.total_window_size)[self.input_slice]

    self.label_start = self.total_window_size - self.label_width
    self.labels_slice = slice(self.label_start, None)
    self.label_indices = np.arange(self.total_window_size)[self.labels_slice]

  def __repr__(self):
    return '\n'.join([
        f'Total window size: {self.total_window_size}',
        f'Input indices: {self.input_indices}',
        f'Label indices: {self.label_indices}',
        f'Label column name(s): {self.label_columns}'])
input_width: 2 Years => 365 * 2 = 730
label_width: Entire Feb Month => 28
shift: We are not predicting from Jan 1st 2017 but are shifting by entire Month of Jan => 30
train_df, test_df, val_df => Self Explanatory
label_columns : Name of the Target Column
 类似资料:
  • 问题内容: 我像这样使用scikit-learn的SVM: 我的问题是,当我使用分类器预测训练集成员的班级时,即使在scikit- learns实现中,分类器也可能是错误的。(例如) 问题答案: 是的,可以运行以下代码,例如: 分数是0.61,因此将近40%的训练数据被错误分类。部分原因是,即使默认内核是(理论上也应该能够对任何训练数据集进行完美分类,只要您没有两个带有不同标签的相同训练点),也可

  • 我正在使用alturos.yolo,自动配置alturos.yolov2tinyvocdata。但我想用Yolov3更改为手动配置。 我尝试了在https://github.com/alturosdestinations/alturos.yolo中使用guide预训练的数据集,但它仍然不起作用。 我的代码:

  • 我正在研究一个用于存储时间序列的卡桑德拉数据模型(我是卡桑德拉新手)。我有两个应用程序:日内股票数据和传感器数据。 库存数据将以一分钟的时间分辨率保存。七个数据字段构建一个时间框架:符号、日期时间、开盘、高位、低位、收盘、成交量 我将主要通过符号和日期来查询数据。例如,给我2013年1月1日到2013年1月31日之间按日期时间排序的AAPL的所有数据。cassandra查询的建议是查询整列。所以你

  • 我有文件及其非常大的文件说100MB文件。我想执行NER以提取组织名称。我使用OpenNLP进行了培训。 示例代码: 但是我得到了一个错误:。 有没有办法使用openNLP for NER来训练大型数据集?你能发布示例代码吗? 当我谷歌时,我发现Class GIS和DataIndexer界面可用于训练大型数据集,但我知道如何训练?你能发布示例代码吗?

  • 我正在为一个JPA项目进行集成测试。测试在嵌入式h2数据库上运行。但是,当我使用Hibernate模式生成时,我收到了h2的错误 错误为org。h2.jdbc。JdbcSQLException:未知数据类型:“INTERVAL”; h2文档表明支持间隔: http://www.h2database.com/html/datatypes.html#interval_type 我用的是h2版本1.4.

  • 为了评估我们的监督模型的泛化能力,我们可以将数据分成训练和测试集: from sklearn.datasets import load_iris iris = load_iris() X, y = iris.data, iris.target 考虑如何正常执行机器学习,训练/测试分割的想法是有道理的。真实世界系统根据他们拥有的数据进行训练,当其他数据进入时(来自客户,传感器或其他来源),经过训