state_space:
state_space = (
1
+ 2 * stock_dimension
+ len(config.TECHNICAL_INDICATORS_LIST) * stock_dimension
)
建环境
建agent,依赖于环境
agent = DRLAgent(env=env_train)
定义state和action
action = Box(-1.0, 1.0, (2,), float32)
state = Box(-inf, inf, (21,), float32)
构造数据
env.step
加载子policy,有很多可选,底层已封装好,一般选择MlpPolicy,policy有一个step,返回值底层已经算好。用来生成数据,然后在env里跑。
_policy_registry = {
ActorCriticPolicy: {
"CnnPolicy": CnnPolicy,
"CnnLstmPolicy": CnnLstmPolicy,
"CnnLnLstmPolicy": CnnLnLstmPolicy,
"MlpPolicy": MlpPolicy,
"MlpLstmPolicy": MlpLstmPolicy,
"MlpLnLstmPolicy": MlpLnLstmPolicy,
}
}
初始化:
super(MlpPolicy, self).__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, reuse,
feature_extraction="mlp", **_kwargs)
加载策略模型,agent加载,模型使用的是stable_baselines3原装,可以自定义参数送进去。
model_ppo = agent.get_model("ppo")
在加载时模型进行初始化:
input初始化
model初始化
# 构建actor函数
self.deterministic_action, policy_out, logp_pi = self.policy_tf.make_actor(self.processed_obs_ph)
# 构建critics函数,这是使用双Q
qf1, qf2, value_fn = self.policy_tf.make_critics(self.processed_obs_ph, self.actions_ph, create_qf=True, create_vf=True)
qf1_pi, qf2_pi, _ = self.policy_tf.make_critics(self.processed_obs_ph,
policy_out,create_qf=True,create_vf=False,reuse=True)
```
target初始化
loss初始化
训练模型,agent训练,使用的是stable_baselines3原装训练
风控:turbulence_threshold