DryadLINQ笔记
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language: source, pdf
What is Dryad?
Dryad was a research project at Microsoft Research for a general purpose runtime for execution of data parallel applications. In October 2011, Microsoft discontinued active development on Dryad, shifting focus to the Apache Hadoop framework.
What is LINQ?
Language Integrated Query (LINQ, pronounced "link") is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages.
What is DryadLINQ?
- A system and a set of language extensions that enable a new programming model for large scale distributed computing.
- It generalizes previous execution environments such as SQL, MapReduce, and Dryad in two ways: by adopting an expressive data model of strongly typed .NET objects; and by supporting general-purpose imperative and declarative operations on datasets within a traditional high-level programming language.
- A DryadLINQ program is a sequential program composed of LINQ expressions performing arbitrary side-effect-free transformations on datasets, and can be written and debugged using standard .NET development tools.
- Translates the data-parallel portions of the program into a distributed execution plan which is passed to the Dryad execution platform.
Pros?
- LINQ's extensibility, allowing the introduction of new execution implementations and custom operators, is the key that allows us to achieve deep integration of Dryad with LINQ-enabled programming languages.
- LINQ's strong static typing is extremely valuable when programming large-scale computations-it is much easier to debug compilation errors in Visual Studio than run-time errors in the cluster.
Cons?
- The Dryad execution engine was engineered for batch applications on large datasets. There is an overhead of at least a few seconds when executing a DryadLINQ EPG.
- Bad for low-latency distributed database lookups.
- The main item that DryadLINQ lacked was any kind of DFS. Without a good DFS, a distributed processing framework isn't all that functional. You basically had to fake a DFS, by **manually **partitioning your data across all machines, and generate an INI file that described how the data was partitioned.
What use cases is DryadLINQ suited for?
- Specialized for streaming computations, e.g. for breadth-first traversal of large graphs DryadLINQ outperforms specialized random-access infrastructures. This is because the current performance characteristics of hard disk drives ensures that sequential streaming is faster than small random-access reads even when greater than 99% of the streamed data is discarded.
- The main requirement is that the program can be written using LINQ constructs: users generally then find it straightforward to adapt it to distributed execution using DryadLINQ-and in fact frequently no adaptation is necessary. However, a certain change in outlook may be required to identify the data-parallel components of an algorithm and express them using LINQ operators.
What use cases do DryadLINQ fail to satisfy?
Very inefficient for algorithms which are naturally expressed using random-accesses.
Learnings? Takeaways? Influences on future work?
Purity vs ease of use: Many DryadLINQbeginners find it easier to write custom code inside Apply than to determine the equivalent native LINQ expression. Apply is therefore helpful since it lowers the barrier to entry to use the system. However, the use of Apply "pollutes" the relational nature of LINQ and can reduce the system's ability to make high-level program transformations.
Core Innovations / Ideas
- shared-nothing architecture, horizontal data partitioning, dynamic repartitioning, parallel query evaluation, and dataflow scheduling
provides a generalization of the concept of query language, but it does not provide a data definition language (DDL) or a data management language (DML) and it does not provide support for in-place table updates or transaction processing - The three layers of storage, execution, and application are decoupled.