Related Pins at Pinterest: The E

2019-04-29  本文已影响0人  xiiatuuo

小结

主要讲述pinterest在Related Pins这个场景的演进过程。pinterest的总保存RP占比从10%涨到了40%。Related Pins Save Propensity,是他们主要的优化目标。Related Pins leverages this human-curated content to provide personalized recommendations of pins based on a given query pin

系统总览

一共分为三块、候选产生、memboost和rank,注意看三者在系统中的先后顺序

  1. 候选
    规模从10亿到1000
  2. memboost
    memorizes past engagement on specific query and result pairs.
  3. rank
    maximize our target engagement metric of Save Propensity


    图片.png

候选演进过程

最早候选主要是基于pins在boards中的共现,后来引入memboost和ltr之后,候选的问题从precision慢慢往recall的方向靠拢,加入了更多新的候选来源

  1. board共现
    1)用mapreduce来计算两个pin在boards中的共现,同时基于分类和文本的匹配程度来进行相关性加权,有两个问题a)长尾问题;b)更多的依赖基础信息的相关性得分
    2)随机游走Pixie。a)使用一些规则去掉了一些高相关和低相关的节点;b)参考了twitter的游走方法,超过10w次的随机游走的结果聚合,每次都重设概率
    3)优缺点
    优点是recall还不错,缺点1)board有时宽泛了,而且board的内容容易随着用户的兴趣转义而变化;2)board有时太狭义了,比如whiskey和cocktail有时候在不同的board中

  2. session共现
    用户在时序上的行为串联可以解决board太宽泛和狭窄的问题
    Pin2Vec is a learned embedding of the N most popular (head) pins in a d-dimensional space, with the goal of mini- mizing the distance between pins that are saved in the same session.
    we consider pins that are saved by the same user within a certain time window to be related (保存行为而不是点击行为!!),it captures a large amount of user behavior in a compact vector representation.

  3. 补充候选
    为了解决两大问题:1)the cold start problem: rare pins do not have a lot of candidates because they do not appear on many boards.2)after we added ranking, we wanted to expand our candidate sets in the cases where diversity of results would lead to more engagement.
    1)基于搜索的候选
    We generate candidates by leveraging Pinterest’s text-based search, using the query pin’s annotations (words from the web link or description) as query tokens. Each popular search query is backed by a precomputed set of pins from Pinterest Search
    2)视觉相似候选
    a) If the query image is a near-duplicate, then we add the Related Pins recommendations for the duplicate image to the results.
    b) use the Visual Search backend to return visually similar images, based on a nearest-neighbor lookup

  1. 区域候选
    the content activation problem: rare pins do not show up as candidates because they do not appear on many boards.
    we generate additional candidate sets segmented by locale for many of the above generation techniques
    解决内容激活问题的方法还有gender-specific content or fresh content.

memboost演进过程

we built Memboost to memorize the best result pins for each query,Memboost as a whole introduces significant system com- plexity by adding feedback loops in the system。
1)使用clicks over expected clicks (COEC) 来解决位置和平台的偏差


图片.png

2)具体行为考虑到点击、长点击、关闭和保存,具体的计算方法为:


图片.png
3)如果加入了new ranker或者时间推移或者系统改变,其实会对memboost这种历史累计的得分产生影响,解决方法是把memboost作为一个feature喂给ranker
4)memboost insertion主要是为了解决一些召回和排序不存在的优质内容进行回流

ranking演进过程

  1. 概览
    假设ranking是未来最有可能提升效果的部分,第一个版本效果提升了30%。第一个版本只用了pin的原始数据,后面加上了Memboost and user data,包括用户的行为数据(最近搜索)。特征包括原始特征(主题、分类)、归一化特征(memboost)、one-hot编码特征、相关性特征(query和candidate的主题相关性)等等。
    在实际线上使用中主要是有三个大的问题需要处理:
  1. 进化过程


    图片.png

    1). Memboost training data, relevance pair labels, pairwise loss, and linear RankSVM model

  1. 模型偏差
    the model that is currently deployed dramatically impacts the training examples produced for future models.
  1. 成功的评测指标
    迭代越快越好,线上的ab测试对于ranking的评价方法就是看保存率,但是online需要数天,所以离线测试就很重要了。
  1. serving框架
    离线与在线相结合


    图片.png

ranking与memboost的关系

ranking和memboost理论上来说都是进行排序的,在pinterest的应用中,这俩模块也一直在。我们可以看到

挑战

  1. Changing Anything Changes Everything
    inputs are never really independent,Improving another compo- nent may actually result in worse overall performance.
    Our general solution is to jointly train/automate as much of the system as possible for each experiment.
  1. 内容冷启动

TODO

pinterest的最新方向好像是GCN

参考资料和深入阅读

Pinterest推荐系统四年进化之路
Wtf: The who to follow service at twitter
Visual search at pinterest
Visual discovery at pinterest

上一篇下一篇

猜你喜欢

热点阅读