Quora
整体
Quora是如何做推荐的, 原文在 machine-learning-for-qa-sites-the-quora-example
看到了这篇文章,然后顺便追了相关的参考文献
首先是这篇综述
Quora主要考虑的三个因素:Relevance、Quality和Demand。
Feed Ranking
目标:Present most interesting stories for a user at a given time
Interesting = topical relevance + social relevance + timeliness
Stories = questions + answers
Answer Ranking
A Machine Learning Approach to Ranking Answers on Quora
-
答案内容本身的质量度。Quora对什么是「好的答案」有明确的指导[2],比如应该是有事实根据的,有可复用价值的,提供了解释说明的,进行了好的格式排版的等等。What-does-a-good-answer-on-Quora-look-like-What-does-it-mean-to-be-helpfult
-
互动,包括顶/踩、评论、分享、收藏、点击等等。
-
回答者本身的一些特征,比如回答者在问题领域的专业度。
Ask2Answers
Ask-To-Answer-as-a-Machine-Learning-Problem
Given a question and a viewer rank all other users based on how 「well-suited」 they are。其中「well-suited」= likelihood of viewer sending a request + likelihood of the candidate adding a good answer,既要考虑浏览用户发送邀请的可能性,又要考虑被邀请者受邀回答的可能性。
Topic Network
User Trust/Expertise Inference
Quora需要找出某个领域的专家,在某个领域里回答问题的多少,接收到的顶、踩、感谢、分享、收藏及浏览等数据。另外还有一个很重要的是专业度的传播效应。比如Xavier在推荐系统领域对某个答案顶了一下,那么这个答案作者在推荐系统领域很可能具备较高的专业度。
The-Product-Engineering-Behind-Most-Viewed-Writers 这个感觉稍微有点关系
数据案例挖掘
Mapping-the-Discussion-on-Quora-Over-Time-through-Question-Text
趋势下面👇针对我比较感兴趣的点一个个看————————————
A Machine Learning Approach to Ranking Answers on Quora
Early Attempts and Baselines
使用upvotes/downvotes进行简单的打分,优点是简单、快速、可解释性强,是一个很好的baseline(想使用观看次数来进行优化,结果发现效果不是很好)
缺点如下,所以想到用丰富的结构化信息:
- Time sensitivity,必须先有行为才能打分
- Rich get richer:up得越多获得up的机会越多
- Joke answers:类似标题党或者有噱头的玩意
- Discoverability:新用户发的内容很难获得行为
Our Approach to Solving the Ranking Problem
In ranking problems our goal is to predict a list of documents ranked by relevance. In most cases there is also additional context, e.g. an associated user that's viewing results, therefore introducing personalization to the problem.下面介绍一种应该是base的模型: - non-personalized
- supervised
- item-wise regression:给每一个答案打分
Ground Truth Dataset
At Quora we define good answers to have the following five properties: - 确实是在回答该问题
- 提供reusable的知识
- 有理由支持的答案
- 可证实是正确的
-
清晰易读
有效特征的例子
提取的特征也包括
- 用户认证信息, 格式, 赞成票
- 作者和投票者的历史数据来推算主题的专业度和可信度
We explored many different ways to know the true quality of answers.
- once again consider using a list of answers where the labels are the ratio of upvotes to downvotes.the shortcoming here is that this label would suffer from the same issues as the downvote model presented earlier
- run a user survey
- try to combine them into one authoritative ground truth dataset.
Features and Models
Feature engineering is a major contributor to the success of a model and it's often the hardest part of building a good machine learning system.The features we tried can broadly be categorized into three groups
- text-based features
- expertise-based features
- author/upvoter history-based features
一开始用的是文本特征,但是由于有句法复杂性,所以部分有问题。
通常来说集成别的模型作为特征很有用,比如我们使用了一个预估用户在某个topic下面的专业度的特征就证明了这一点。
对于这个回归模型来说,gbdt和一些DeepLearning的结果很令人信服,但是dl的可解释性差一点,所以我们也会用一些lr来进行实验。
Metrics
We used the following metrics: - Rank-based: NDCG, Mean Reciprocal Rank, Spearman's Rho, Kendall's Tau
- Point-wise: R2, and Mean Squared Error, make sure our scores were close to our training data in scale and that answers between questions can also be compared.
Productionalization - 新答案先用一个快速的特征提取来进行ranking,然后异步的重新计算更为准确的分数。
- 排序数百个答案是很费时间的,所以分数需要缓存,只有当特征发生变化的时候才更新它。
- 有一个问题是如果是作者的特征变了,那我们要更新所有这个作者相关的答案,可能会很费时间,所以做了一个特殊的组织和批处理。
- 另外一个优化是在决策树内部,当这个特征变化不会影响得分的时候就不更新特征。
- 总共减少了70%的计算
The Quora Topic Network
Introduction
topics form an important organizational backbone for Quora's corpus of knowledge。Our goal is to become the Internet's best source for knowledge on as many of these topics as possible。
- 越来越多的topics,并且人们提供了很多质量的内容
-
通过将内容标记为一些topics,人们创造了一个合理的有层次的领域知识
人们可以关注另外一个人,也可以关注问题和主题
关系
Quora's Diversification
topic的数量其实代表了多样性的变化,于是计算了拥有100个好问题的topic的数量。
什么是好问题:除了作者,至少有一个人觉得这个问题有价值。2013年底大概5000个topic,而且增长很快,这说明多样性在变好
Defining the Probabilistic Topic Network
topic涨得快,但是新的领域知识提高的没那么快,所以需要将这些知识进行组织和整理。
一个问题可以打上多个topic可以提供topic之间的隐含关系。因此可以只针对topic进行网络关系。
- 我们先链接topic A和topic B,如果这俩至少被一个问题在一起tag了或者引用了
- 有很多打了Moon landing的topic会打上NASA,但是反过来不是,所以边是有向边
-
增加权重,如果有n个人follow这个问题,那么这一次链接的重要性就是n,最后A->B的weight定义如下:
公式
Hints of the Topic Hierarchy
每一个节点的入度是一个简单的衡量手段。直接把所有指向这个node的权重加起来就行了。
假设topic存在层级关系,那么一个topic就有可能至少从两个不同的途径获得一个大的入度,所有节点入度的平均数和中位数应该会差别很多。中位数主要收到那些典型具体的topic影响,入度应该会比较低,而且随着越来越多的有特点的topic加入进来,中位数会越来越低。平均数主要收到那些大的节点的入度影响,应该会比较大一点,而且相对变化不会那么大。使用NetworkX统计出来的情况确实是这样,所以我们可以从topic网络中演绎出层级关系
Diving Deeper into the Topic Hierarchy
- degree distribution:power law,proportional to 𝑘−1.6
- 连通性:99.8% of all topics are connected together in one big "component."
- the joint degree distribution (JDD):相配的 assortative-如果热门的人一起hangout,不热门的人一起hangout;不相配的disassortative-如果热门的人与不热门的人hangout。这个网络是mildly disassortative: large, well-connected, general topics tend to be linked to smaller, more specific topics.
- the clustering coefficient (CC):the CC measures the probability that any two of my friends are also friends with each other, given that they are my friends。decreases steeply with the number of links a topic has。这意味着小的topic联系比较紧密,大的topic关系不是那么大,这也说明这个网络是有层级的。
Topic Clustering
我们可以使用层级主题聚类来进行topic的表达。具体的算法如下:
- 1.Create a list of empty trees with each topic as the root
- Find the topic with the largest total outdegree in the topic network
- Add the topic, and its subtree, to the subtree of each topic it links to with weight
- Remove the topic from the topic network
- Goto 2 until only N topics are left
算法得到的结果包括topic的列表以及每一个node作为root的层次的树结构,可以知道相关的topic连接得多么紧密。有了这些信息,我们就可以选择任意topic通过爬上和爬下找到相关的topic,我们在聚类的时候选择了fuzzy方法来允许一个topic可以有多个parents,这很有用。这个网络他们留下了大概2000个topic。
- Goto 2 until only N topics are left