当前位置: 首页 > 工具软件 > Jester > 使用案例 >

Jester数据集

姬选
2023-12-01

原文:

4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003.

Freely available for research use when acknowledged with the following reference:

Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.

(Aside: many papers, including ours, report Normalized Mean Absolute Error (NMAE) rates of approx 20%. How good is this compared with random guessing? In the Appendix to our paper, we show that if user ratings are uniformly distributed, random guessing yields NMAE = 33%.)

As a courtesy, if you use the data, I would appreciate knowing your name, what research group you are in, and the publications that may result.

The Jester Dataset (save to disk, then unzip to obtain Excel files):

jester-data-1.zip : (3.9MB) Data from 24,983 users who have rated 36 or more jokes, a matrix with dimensions 24983 X 101.

jester-data-2.zip : (3.6MB) Data from 23,500 users who have rated 36 or more jokes, a matrix with dimensions 23500 X 101.

jester-data-3.zip : (2.1MB) Data from 24,938 users who have rated between 15 and 35 jokes, a matrix with dimensions 24,938 X 101.

Format:

3 Data files contain anonymous ratings data from 73,421 users.

Data files are in .zip format, when unzipped, they are in Excel (.xls) format

Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").

One row per user

The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.

The sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes (see discussion of "universal queries" in the above paper).

译文:

在1999年4月至2003年5月期间,73421名用户对100个笑话进行了410万次连续评分(-10.00到+10.00)。

经以下参考确认后,可免费用于研究:

Eigentaste:一种恒定时间的协同过滤算法。Ken Goldberg,Theresa Roeder,Dhruv Gupta和Chris Perkins。信息检索,4(2),133-151。2001年7月。

(旁白:许多论文,包括我们的,报告标准化平均绝对误差(NMAE)率约为20%。这和随机猜测相比有多好?在本文的附录中,我们发现,如果用户评分是均匀分布的,随机猜测的结果是NMAE=33%。)

出于礼貌,如果您使用这些数据,我将非常感谢您知道您的姓名、您所在的研究小组以及可能产生的出版物。

Jester数据集(保存到磁盘,然后解压缩以获取Excel文件):

jester-data-1.zip:(3.9MB)来自24983个用户的数据,这些用户给36个或更多的笑话打分,一个尺寸为24983x101的矩阵。

jester-data-2.zip:(3.6MB)来自23500个用户的数据,他们给36个或更多的笑话打分,一个尺寸为23500 X 101的矩阵。

jester-data-3.zip:(2.1MB)来自24938个用户的数据,这些用户的笑话评分在15到35个之间,这是一个尺寸为24938x101的矩阵。

格式:

3个数据文件包含来自73421个用户的匿名评级数据。

数据文件为.zip格式,解压缩时为Excel(.xls)格式,定值是介于-10.00到+10.00之间的实际值(“99”对应于“null”=“未评级”)。

每个用户一行第一列给出了该用户评定的笑话数。接下来的100个专栏给出了笑话01-100的评分。

只包含{5,7,8,13,15,16,17,18,19,20}列的子矩阵是稠密的。几乎所有的用户都对这些笑话进行了评级(参见上述文章中关于“通用查询”的讨论)。

大家可以到官网地址下载数据集,我自己也在百度网盘分享了一份。可关注本人公众号,回复“2020092701”获取下载链接。

 

 类似资料: