本文使用的训练数据:https://download.csdn.net/download/qq_31573519/12344779
从上述地址下载,数据格式:
1.数据介绍
User ID, item ID, category ID, behavior type, timestamp
Field Explanation
User ID An integer, the serialized ID that represents a user
Item ID An integer, the serialized ID that represents an item
Category ID An integer, the serialized ID that represents the category which the corresponding item belongs to
Behavior type A string, enum-type from ('pv', 'buy')
vim process.py
import random
lines=[]
file_path = '/Users/Documents/data.csv'
with open(file_path, 'r') as infile:
for line in infile:
lines.append(line.strip())
# print(len(lines))
random.shuffle(lines)
for line in lines:
# print(line)
temp = line.split(',')
# print(temp)
# print(temp[3])
if temp[3] == 'buy':
print('1 ' + temp[0]+':1 ' + temp[1]+':1 ' + temp[2]+':1')
else:
print('0 ' + temp[0] + ':1 ' + temp[1] + ':1 ' + temp[2] + ':1')
注意:一定要进行数据打乱 random.shuffle(lines)
head -2196681 ./data/train_all_data > ./data/train_shuffle
tail -549170 ./data/train_all_data > ./data/test_shuffle
看看训练、测试集的样本比例是否符合预期
cat ./data/train_shuffle| grep '^1' | wc -l
47756
cat ./data/train_shuffle| grep '^0' | wc -l
2148925
cat ./data/test_shuffle| grep '^0' | wc -l
536951
cat ./data/test_shuffle| grep '^1' | wc -l
12219
>>> 47756.0/2148925
0.022223204625568597
>>> 12219.0/536951
0.022756266400472295
>>>
都是0.02,符合预期 (shuffle逻辑没问题)
vim train.sh
#./bin/libFM -task c -method sgd -train ./data/train_shuffle -test ./data/test_shuffle -dim '1,1,8' -out result_v1 -save_model model_v1 -iter 20 -learn_rate 0.01
#./bin/libFM -task c -method sgd -train ./data/train_shuffle -test ./data/test_shuffle -dim '1,1,16' -out result_v1 -iter 20 -learn_rate 0.01
./bin/libFM -task c -method sgd -train ./data/train_shuffle -test ./data/test_shuffle -dim '1,1,32' -out result_v1 -iter 100 -learn_rate 0.01