读Opinion Spam论文总结（1）【Collective Opinion Spam Detection: Bridging Review Networks and Metadata】

淳于烈

2023-12-01

论文地址：http://shebuti.com/wp-content/uploads/2016/06/15-kdd-collectiveopinionspam.pdf

Terminology：

Bipartite network 二分网络: 二分网由两种类型的节点构成，边只在不同类型的节点间存在。自然和社会中一系列的合作网络，都可以描述为合作主体和合作事务构成的二分网。二分网具有普遍性，已经成为复杂网络研究的重要对象。在已有关于二分网的研究工作中，通常的做法是把二分网投影到单顶点网络，然后进行网络分析。（百度百科）

Relational data: user-review-product graph

Metadata: behavioral and text data (information that provides information about other data)

Unsupervised, Semi-supervised, Supervised区别:
https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/
Supervised: having a full set of labeled data while training an algorithm 用于训练的数据集的每个数据都包含了算法应该输出的结果.比如有关于花的品种的数据集，那么这些被labeled了的数据就会告诉模型它们预测得对不对。When shown a new image, the model compares it to the training examples to predict the correct label. 一般地，如果我们有可以作为参考的数据集，那么我们可以用supervised learning。“Supervised learning is, thus, best suited to problems where there is a set of available reference points or a ground truth with which to train the algorithm. But those aren’t always available.”

Unsupervised:如果我们没有足够的可以参考的数据集，而且我们需要利用算法来得到答案的时候，我们就要用unsupervised learning。No explicit instruction, 训练数据集也是一堆没有标准答案的数据。然后神经网络*就会自动在数据集中通过分析特征找到数据中的结构。
常见的几种形式：

Clustering：将相似的数据集合在一起
Anomaly Detection: 检测异常，比如一张信用卡在同一天内分别在加州和伦敦花了钱。
Association: 当你在购物车内加了一大堆婴儿用品的时候，网站会自动推给你其他婴儿用品。非常常见。
Auto encoders: 比如你有一本故事书，你写了一些提纲，随后你再用你写的提纲来重写这个故事. remove noise from visual data like images, video or medical scans to improve picture quality.前一阵子有朋友给我推荐“还原照片”的微信小程序，现在发现就是利用的这么一个技术，https://news.developer.nvidia.com/ai-can-now-fix-your-grainy-photos-by-only-looking-at-grainy-photos/

总体而言，因为半监督学习是没有标准答案的，它的结果可能不够精准。

Neural Network*: https://developer.nvidia.com/discover/artificial-neural-network
它是一种模仿动物神经网络行为特征，进行分布式并行信息处理的算法数学模型。这种网络依靠系统的复杂程度，通过调整内部大量节点之间相互连接的关系，从而达到处理信息的目的。
大规模并行处理，分布式存储，弹性拓扑，高度冗余和非线性运算
人工神经网络可以用于数据压缩、图像处理、矢量编码、差错控制（纠错和检错编码）、自适应信号处理、自适应均衡、信号检测、模式识别、ATM流量控制、路由选择、通信网优化和智能网管理等等。（百度百科）

Semi-supervised:
A training dataset with both labeled and unlabeled data.

R-precision: 你有100个文件，其中30个是相关的（R=30），然后你发现前30个文件中有10个是相关的（r=10），那么你的R-precision就是10/30,三分之一。Needs to know about the whole dataset.

Precision@k:
你有10个文件，1，3，6，7，9是相关的文件，然后你想知道你的算法在5th位置的时候工作得怎么样。因为5之前只有1，3是相关文件，所以Precision@K where K = 5 是2／5. It is easier to score manually since only the top k results need to be examined to determine if they are relevant or not. (Wikipedia)
(R-Precision is the same as Precision@X where X is the total number of relevant documents in the collection.)
https://cs.stackexchange.com/questions/67736/what-is-the-difference-between-r-precision-and-precision-at-k

Joint probability: Joint probability is the probability of event Y occurring at the same time that event X occurs.

搜索评价指标:
Cumulative Gain: only consider about relevance, position doesn’t matter.
Discounted CG: Ranking matters. First ones have larger values.
Normalized Discounted Cumulative Gain: A measure of ranking quality. https://blog.csdn.net/qq_39521554/article/details/80950936
Python implementation: https://gist.github.com/bwhite/3726239

用到的数学知识：

Cumulative distribution function: 累积分布函数(Cumulative Distribution Function)，又叫分布函数，是概率密度函数的积分，能完整描述一个实随机变量X的概率分布。一般以大写CDF标记,与概率密度函数probability density function（小写pdf）相对。

Framework:

利用一个二分图来表示用户-评论-产品的联系。+或者-表示评论是正面或者负面。有N个用户node和M个产品nodes.Formally，这个图 is represented by a Markov Random Field (MRF) 马尔可夫随机场。

文中提到的思想：

用户、产品Feature：
某用户每天的评论量，好评量，差评量，平均差，加权方差，突发性（burstiness），用户的评分的不确定性（entropy*），用户给评分的间隔，平均评论的长度，用户评论的平均相似度，用户评论的最大相似度

评论Feature：
考虑某个评论在该产品所有评论中的排名（我猜应该是在考虑极端情况，或者根据该评论的大致地位来确定可信度），绝对偏差，极端评价（特别好），thresholded rating deviation（不知道是啥），评价的时间（fraud倾向于过早的评价以增加影响力），该评价是否为某用户给出的唯一评价，大写单词比重，大写字母比重，评价的长度，第一人称词汇，感叹号用法，主观／客观词汇，写评价的频率（用locality sensitive hashing* 来做估计），根据bigram\unigram所判断的评价的长度。

*Entropy：变量的不确定性越大，熵也就越大，把它搞清楚所需要的信息量也就越大。信息熵是信息论中用于度量信息量的一个概念。一个系统越是有序，信息熵就越低；反之，一个系统越是混乱，信息熵就越高。所以，信息熵也可以说是系统有序化程度的一个度量。（百度百科）

文中所用到的数据：

YelpNYC, YelpChi, YelpZip
Yelp有Recommended review - 经过yelp自己的算法筛选出来的可信评价，但是我们应该用所有review来进行考核。

度量标准：

Precision & recall: While precision refers to the percentage of your results which are relevant, recall refers to the percentage of total relevant results correctly classified by your algorithm.

True Positive Rate vs False Positive Rate Curves: A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. AUC: area under the curve, used to determine which of the used models predicts the classes best.

提速：

通过Semi-supervised detection来改良detection performance：即便只有很小一部分是supervised，结果提速很多.

Observations：

Linguistic features并没有想象中那么重要。Behavioral features才是最关键的部分（仅仅观察behaviors与同时观察behavior和text的AUC很接近。）

“Light” version of SpEagle: 更加快速的算法
在Feature extraction的方面，SpEagle很慢，所以作者希望通过定义少数review的behavioral features来更快速的估值prior*。

*prior：Prior probability, in Bayesian statistical inference, is the probability of an event before new data is collected. This is the best rational assessment of the probability of an outcome based on the current knowledge before an experiment is performed. https://www.investopedia.com/terms/p/prior_probability.asp

Conclusion:

这篇论文中讨论了结合用户-评价-产品图（relational data）和行为/文字数据(metadata)来发现可疑的用户和评论。在无监督学习下可以easily leverage labels. 但是在办监督学习的条件下，效果加倍变好。更多的，作者还研究了一种light版本的SpEagle，用少量的review features as prior information，速度显著加快。

读Opinion Spam论文总结（1）【Collective Opinion Spam Detection: Bridging Review Networks and Metadata】

Terminology：

用到的数学知识：

Framework:

文中提到的思想：

文中所用到的数据：

度量标准：

提速：

Observations：

Conclusion:

相关阅读

相关文章

相关问答

相关文档