分布式系统分布式架构

Distributed systems theory for t

2019-01-10  本文已影响0人  tsogvilin

Distributed systems theory for the distributed systems engineer

适合 分布式系统工程师 的 分布式系统理论

Gwen Shapira, who at the time was an engineer at Cloudera and now is spreading the Kafka gospel, asked a question on Twitter that got me thinking.

Gwen Shapira曾在Cloudera做工程师,现在宣传Kafka,他在Twitter问了以下问题,使我有所思考。

I need to improve my proficiency in distributed systems theory. Where do I start? Any recommended books?
我想在分布式理论上有所提升。应该从哪开始?有推荐的书?
— Gwen (Chen) Shapira (@gwenshap) August 7, 2014

My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”,
我第一反应是“可以看:FLP论文、paxos论文、Byzantine将军论文”,
and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed.
我推荐的主要阅读材料,如果你贸然去读,你至少要阅读6个月才会有感觉。
But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program).
由此可知,推荐一吨的理论论文让你阅读,这是了解分布式系统的错误的方式。(除非你在读博士)
Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context.
论文一般是深奥、复杂的,而且需要一系列学习和丰富的经验才能感觉到其贡献、才能其放到对应的场景(以理解和应用)。
What good is requiring that level of expertise of engineers?
工程师了解分布式理论有什么好处?

And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory;
很不幸,几乎没有好的引导文章,来总结、提炼、场景化 分布式系统理论中的重要结论和想法;
particularly material that does so without condescending.
特别是 通俗易懂的引导文章 更没有。
Considering that gap lead me to another interesting question:
考虑这样的空白区域,让我想问另一个问题:

What distributed systems theory should a distributed systems engineer know?
一个分布式系统工程师应该了解什么样的分布式系统理论?

A little theory is, in this case, not such a dangerous thing.
这种情况下,了解一点点理论并不是坏事。
So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer.
我日常工作是一个分布式系统工程师,我认为适合我的基本概念,下面会给出这些基本概念。
Let me know what you think I missed!
你认为我缺失的请告知我!

First steps 准备

These four readings do a pretty good job of explaining what about building distributed systems is challenging.
下面四个读物解释了构建分布式系统会遇到的困难。
Collectively they outline a set of abstract but technical difficulties that the distributed systems engineer has to overcome, and set the stage for the more detailed investigation in later sections
这些读物都勾勒了一些列 抽象而非技术 的困难,分布式系统工程师必须要克服这些困难。这些读物的后面章节有更详细的研究。

Distributed Systems for Fun and Profit is a short book which tries to cover some of the basic issues in distributed systems including the role of time and different strategies for replication.
Distributed Systems for Fun and Profit 是一本小书,它想覆盖分布式系统中的一些基本问题,包括 时钟所起的作用、不同策略的复制。

Notes on distributed systems for young bloods - not theory, but a good practical counterbalance to keep the rest of your reading grounded.
Notes on distributed systems for young bloods - 非理论,而是一个很好的实践,以让你落到实处。

A Note on Distributed Systems - a classic paper on why you can’t just pretend all remote interactions are like local objects.
A Note on Distributed Systems - 一个经典论文,关于 为什么你不能假装所有远程交互像本地对象一样。

The fallacies of distributed computing - 8 fallacies of distributed computing that set the stage for the kinds of things system designers forget.
The fallacies of distributed computing 分布式计算的8个错误的推论,以提醒系统设计者。

You should know about safety and liveness properties:
你应该知道 安全 和 活力:

Failure and Time 失败和时钟

Many difficulties that the distributed systems engineer faces can be blamed on two underlying causes:
分布式系统工程师面对的许多困难可以归结为以下两个原因:

  1. Processes may fail

  2. 进程可能失败

  3. There is no good way to tell that they have done so

There is a very deep relationship between what, if anything, processes share about their knowledge of time, what failure scenarios are possible to detect, and what algorithms and primitives may be correctly implemented.
进程间怎么共用时钟、什么样的失败可以检测、什么样的算法和原语可以被正确实现,这三者之间有很深的联系。
Most of the time, we assume that two different nodes have absolutely no shared knowledge of what time it is, or how quickly time passes.
一般情况下,我们假设不同节点绝对无法共用时钟(时刻值或流过了多少时间)

You should know:
你应该知道:

The basic tension of fault tolerance 容错导致的基本矛盾

A system that tolerates some faults without degrading must be able to act as though those faults had not occurred.
一个系统容忍一些错误而没有降级 必须能当成 就像这些错误没有发生过一样。
This means usually that parts of the system must do work redundantly, but doing more work than is absolutely necessary typically carries a cost both in performance and resource consumption.
这意味着系统的一部分要冗余地工作(同样的功能部署多个节点),冗余是绝对必要的,冗余一般会带来性能和资源的消耗。
This is the basic tension of adding fault tolerance to a system.
这就是给一个系统添加冗余的基本矛盾。

You should know:
你应该知道:

Basic primitives 基本原语

There are few agreed-upon basic building blocks in distributed systems, but more are beginning to emerge. You should know what the following problems are, and where to find a solution for them:
在分布式系统中,很少有约定的基本构建块,更多的是处于形成中的基本构建块。有应该知道下面的问题是什么,并且从哪能找到他们的解决方案:

Fundamental Results 基础结论

Some facts just need to be internalised. There are more than this, naturally, but here’s a flavour:
有些事实只需要主观理解(不需要关注证明).

Real systems 真实系统

The most important exercise to repeat is to read descriptions of new, real systems, and to critique their design decisions. Do this over and over again. Some suggestions:
最重要的、应该不断重复的实践是:读新的、真实的系统的描述,并评价他们设计的决定。 下面是建议的系统:

Google:

Not Google:

Postscript 结尾

If you tame all the concepts and techniques on this list, I’d like to talk to you about engineering positions working with the menagerie of distributed systems we curate at Cloudera.
如果你驯服了这个列表中的所有概念和技术,我很乐意和你聊聊Cloudera的分布式系统工程师职位。

上一篇 下一篇

猜你喜欢

热点阅读