实际业务场景中, 模型训练完成后,但需要对模型进行一些微调,或者加一些人工的规则到模型中, 此时,需要了解pmml文件结构,以及 如何基于pmml文件人工计算预测得分,同时如何对pmml文件进行修改。
http://dmg.org/pmml/v4-1/MultipleModels.html ---pmml4.1结构文件参考文档
需关注的几个元素:
(1)MiningBuildTask: 模型文件描述
(2)DataDictionary:数据字典, 特征类型说明,以及target说明
(3)multipleModelMethod: 多个树时,如何计算预测得分,一般有如下几种方式
<xs:simpleType name="MULTIPLE-MODEL-METHOD">
<xs:restriction base="xs:string">
<xs:enumeration value="majorityVote"/>
<xs:enumeration value="weightedMajorityVote"/>
<xs:enumeration value="average"/>
<xs:enumeration value="weightedAverage"/>
<xs:enumeration value="median"/>
<xs:enumeration value="max"/>
<xs:enumeration value="sum"/> # 树叶子节点求和
<xs:enumeration value="selectFirst"/>
<xs:enumeration value="selectAll"/>
<xs:enumeration value="modelChain"/>
</xs:restriction>
</xs:simpleType>
(4)TreeModel: 树节点。 有多少个TreeModel表示有多少颗树。
(1 )一棵树中,只能输出一个得分。 NODE存在并列,取满足条件后的叶子节点得分最大的值作为最终得分。
(2) pmml文件结构,默认右缩进为子节点, 节点缩进size相同则为并列节点。
(3)按叶子节点满足条件倒推,如果不满足当前条件则取上一个节点score,以此类推。如果都不满足条件,则取根节点score,
(4)举个例子:
<Segment id="1"> # treeModel1
<True/>
<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction" x-mathContext="float">
<MiningSchema>
<MiningField name="q_len"/>
<MiningField name="q_c_click_num"/>
</MiningSchema>
<Node score="-0.0011013647"> ###Node1
<True/>
<Node score="0.0019000656">
<SimplePredicate field="q_c_click_num" operator="greaterOrEqual" value="0.35"/>
<Node score="5.699482E-4">
<SimplePredicate field="q_len" operator="greaterOrEqual" value="8.5"/>
</Node>
</Node>
<Node score="-0.001793226"> ###Node2, Node1与Node1并列, 并列时取score最大值
<SimplePredicate field="q_len" operator="greaterOrEqual" value="1.5"/>
</Node>
</Node>
</TreeModel>
</Segment>
<Segment id="2"> # treeMode2
<True/>
<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction" x-mathContext="float">
<MiningSchema>
<MiningField name="c_c_cos"/>
<MiningField name="q_c_click_num"/>
</MiningSchema>
<Node score="-0.0019235479">
<True/>
<Node score="5.842927E-4">
<SimplePredicate field="c_c_cos" operator="greaterOrEqual" value="0.55919766"/>
<Node score="0.0019866328">
<SimplePredicate field="q_c_click_num" operator="greaterOrEqual" value="0.35"/>
</Node>
</Node>
<Node score="0.0011271919">
<SimplePredicate field="q_c_click_num" operator="greaterOrEqual" value="0.35"/>
</Node>
</Node>
</TreeModel>
</Segment>
样本1特征值:q_len=3, q_click_num = 0.7, c_c_cos=0.99
##=分割线===
treeModel 1:
Node1:
when q_click_num>=0.35 and q_len >=8.5 then 5.699482E-4
when q_click_num>=0.35 then 0.0019000656
else:-0.0011013647
Node2:
when q_click_num>=0.35 then -0.001793226
else : -0.001793226
根据以上:
样本1满足: q_click_num>=0.35 , 满足Node1 ( 0.0019000656),Node2(-0.001793226), treeModel1取score最大值 0.0019000656。
即:treemodel1_score 的score= 0.0019000656
##=分割线===
treeModel2:
Node1:
when c_c_cos>=0.55919766 and q_c_click_num>=0.35 then 0.0019866328
when c_c_cos>=0.55919766 then 5.842927E-4
else :-0.0019235479
Node2:
when q_c_click_num >= 0.35 then 0.0011271919
else :-0.0019235479
根据以上:
样本1满足: q_click_num>=0.35 and c_c_cos>=0.55919766 满足Node1 ( 0.0019866328),Node2(0.0011271919), treeModel1取score最大值 0.001986632。
即:treemode2_score 的score= 0.001986632
##=分割线===
a = treemodel1_score + treemode2_score
pre = 1/(1+np.exp(-a)) # 因为模型选择objective=‘binary:logistic’
经过验证,该计算方法与xgb.predict()预测结果一样
将如下形式加入pmml文件,则可增加新的树或者修改新的树。经过验证后,重新load pmml文件,进行预测,与预期一致
"""
<Segment id="3">
<True/>
<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction" x-mathContext="float">
<MiningSchema>
<MiningField name="c_c_cos"/>
<MiningField name="q_c_click_num"/>
</MiningSchema>
<Node score="-0.0019259413">
<True/>
<Node score="5.4347207E-4">
<SimplePredicate field="c_c_cos" operator="greaterOrEqual" value="0.5531428"/>
<Node score="0.001985003">
<SimplePredicate field="q_c_click_num" operator="greaterOrEqual" value="0.35"/>
</Node>
</Node>
<Node score="0.0011074619">
<SimplePredicate field="q_c_click_num" operator="greaterOrEqual" value="0.35"/>
</Node>
</Node>
</TreeModel>
</Segment>
"""