Alluxio
master
Only one master process can be the leading master in an Alluxio cluster.
The leading master is responsible for
managing the global metadata of the system.
This includes:
- file system metadata (e.g. the file system inode tree),
- block metadata (e.g. block locations),
- and worker capacity metadata (free and used space).
The leading master will only ever query under storage for metadata. Application data will never be routed through the master.
Alluxio clients interact with the leading master to read or modify this metadata.
All workers periodically send heartbeat information to the leading
master to maintain their participation in the cluster.
The leading master does not initiate communication with other components;
it only responds to requests via RPC services.
The leading master records all file system transactions to a distributed persistent storage location to allow for recovery of master state information;
the set of records is referred to as the journal.
Alluxio Workers
Alluxio workers store data as blocks and serve client requests that read or write data by reading or creating new blocks within their local resources.
Alluxio Job Workers
Alluxio Job Workers are clients of the Alluxio file system. They are responsible for running tasks given to them by the Alluxio Job Master. Job Workers receive instructions to run load, persist, replicate, move, or copy operations on any given file system locations.
client
The Alluxio client provides users a gateway to interact with the Alluxio servers.
It initiates communication with the leading master to carry out metadata operations and with workers to read and write data that is stored in Alluxio.
architecture-overview-simple-docs.png
Data flow: Read
1. short-circuit
- if an application requests data access through the Alluxio client,
- the client asks the Alluxio master for the worker location of the data.
- If the data is locally available
short-circuit reads use local file system operations which require permissive permissions
2. domain socket
Cache Miss
The Alluxio client delegates the read from a UFS to a worker
preferably a local worker.
This worker reads and caches the data from the under storage.
Cache misses generally cause the largest delay because data must be fetched from the under storage.
A cache miss is expected when reading data for the first time.
asynchronous caching
When the client reads only a portion of a block or reads the block non-sequentially, the client will instruct the worker to cache the full block asynchronously.
Asynchronous caching does not block the client, but may still impact performance if the network bandwidth between Alluxio and the under storage system is a bottleneck. You can tune the impact of asynchronous caching by setting`
Cache Skip
It is possible to turn off caching in Alluxio by setting the property
in the client to NO_CACHE.
alluxio.user.file.readtype.default
Data flow: Write
MUST_CACHE
With a write type of MUST_CACHE, the Alluxio client only writes to the local Alluxio worker and no data will be written to the under storage.
During the write, if short-circuit write is available, Alluxio client directly writes to the file on the local RAM disk,
bypassing the Alluxio worker to avoid network transfer.
Since the data is not persisted to the under storage, data can be lost
if the machine crashes or data needs to be freed up for newer writes. The MUST_CACHE setting is useful for writing temporary data when data loss can be tolerated.
CACHE_THROUGH
With the write type of CACHE_THROUGH, data is written synchronously to an Alluxio worker and the under storage system. The Alluxio client delegates the write to the local worker and the worker simultaneously writes to both local memory and the under storage. Since the under storage is typically slower to write to than the local storage, the client write speed will match the write speed of the under storage. The CACHE_THROUGH write type is recommended when data persistence is required. A local copy is also written so any future reads of the data can be served from local memory directly.
MUST_CACHE
writes no data to UFS, so Alluxio space is never consistent with UFS.
CACHE_THROUGH
writes data synchronously to Alluxio and UFS before returning success to applications.
If writing to UFS is also strongly consistent (e.g., HDFS), Alluxio space will be always consistent with UFS if there is no other out-of-band updates in UFS;
if writing to UFS is eventually consistent (e.g. S3), it is possible that the file is written successfully to Alluxio but shown up in UFS later. In this case, Alluxio clients will still see consistent file system as they will always consult Alluxio master which is strongly consistent; Therefore, there may be a window of inconsistence before data finally propagated to UFS despite different Alluxio clients still see consistent state in Alluxio space.
ASYNC_THROUGH
writes data to Alluxio and return to applications, leaving Alluxio to propagate the data to UFS asynchronously. From users perspective, the file can be written successfully to Alluxio, but get persisted in UFS later.
THROUGH
writes data to UFS directly without caching the data in Alluxio, however, Alluxio knows the files and its status. Thus the metadata is still consistent.