(GeekBand)系统设计与实践 案例分析

2017-01-10  本文已影响0人  Linary_L

案例

News feed(信息流)

Define feed

Organize

Level1.0

Database Schema:
GetNewsfeed:

Why bad?

100+ friends
1Query-->Get friends list

1Query-->

SELECT news

WHERE timestamp>xxx
AND sourceid IN friend list
LIMIT 1000

IN is slow

Either Sequential scan or 100+ index queries

Level 2.0

Pull vs Push

Pull:Get news from each friend,merge them together.(NewsFeed generated when user request)

Push:NewsFeed generated when news generated.(we have another table to store newsfeed,may cause duplicate news)

Push:
1Query to select latest 1000 newsfeed.
100+ insert queries(Async)

Disadvantage:News Delay.

Level 3.0

Popular star(Justin Bieber)

Flowers 13M+

Async Push may cause over 30 minutes(13M+ insertions,delay too long)

Push+Pull

for popular star,don't push news to flowers

for every newfeed reqiest,merge non-popular user newfeed(push) and popular users newsfeed(pull)

Level 4.0

Push disadvantage
Go back to PULL:

Click Stats Server

How are click stats stored

A poor candidate will suggest write-back to a data store on every click

A good candidate will suggest some form of aggregation tier that accepts clickstream data,aggregates it,and writes back a persistent data store periodically

A great candidate will suggest alow-latecy messaging system to bugger the click data and transfer it to the aggregation tier.

If daily,storing in hdfs and running map/reduce jobs to compute stats is a reasonable approach

If near real-time,the aggregation logic should compute stats

PS:要如何统计鼠标点击的次数以及相关区域呢?普通的程序员会将每次点击的数据(log)直接存储在数据库一层。比较好的程序员会在前段与数据库间加一个中间层,为点击的数据流做一次聚合,每隔一段时间(1分钟或10分钟)做一次刷新,存储到数据库,大大减轻了后端的压力。优秀的程序员综合以上的两种情况,对于数据量很大,实时性效果不高的情况下,可以通过分布式的批处理方式,将刷新聚合层的时间定位一天。对于时效性强的要适当缩短刷新时间。

Cache Requirement

PS:如何设计cache(LRU设计相关):

Web Crawler

爬虫

Amazon Product Page

The product page includes information such as
Reference
上一篇 下一篇

猜你喜欢

热点阅读