Elasticsearch tutorial (二)
Basic Concepts基本概念:
There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.
以下是Elasticsearch核心的一些概念。在教程伊始了解这些概念可以极好地帮助你学习接下来的课程。
Near Realtime(NRT)近实时性
Elasticsearch is a near real time search platform.What this means is there is a slight latency(normally one second) from the time you index a document until the time it becomes searchable.
Elasticsearch是一个近乎实时搜索的平台。换言之,自你导入一个文档到它可以被搜索到的时候只有细微的延迟等待(通常在1s左右)。
Cluster集群
A cluster is a collection of one or more nodes(servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch".This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
集群是一个或多个节点(服务器)组成的,通过所有节点一起保存你的全部数据并提供联合索引和搜索功能。每个集群都有一个唯一名称作为身份标识,默认为"elasticsearch"。这个名称很重要,因为只有一个节点以这个名称加入集群,才能够成为这个集群的一部分。
Make sure that you don't reuse the same cluster names in different environments,otherwise you might end up with nodes joining the wrong cluster.For instance you could use logging-dev , logging-stage, and logging-prod for the development,staging,and production clusters.
你没有在不同环境下重复使用相同的集群名称,否则你终将把节点加入错误的集群。例如:你可以使用logging-dev、logging-stage、logging-prod 来为开发、演示、产出集群分别命名。
Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.
请注意:设立只有一个节点在内的集群是有效且完全ok的。不过,你就需要为多种独立存在的集群设置它们专有的集群名称。
Node节点
A node is a single server that is part of your cluster, stores your data, and participates in the cluster's indexing and search capabilities.Just like a cluster, a node is identified by a name which by default is a random Univesally Unique Identifier(UUID) that is assigned to the node at startup. You can define any code name you want if you do not want the default.This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.
节点是组成你集群中的一个服务器,为你存储数据,参与集群的索引及搜索功能。类似集群,一个节点在建立之初也被分配一个代表身份标识的名称,默认为一个随机的UUID(普遍唯一标识符)。如果你不想要这个默认的名称,也可以自己定义。这个名称对你识别网络上服务器对应哪个Elasticsearch集群的节点有着重要的管理意义。
A node can be configured to join a specific cluster by the cluster name.By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and——assuming they can discover each other——they will automatically form and join a single cluster named elasticsearch.
一个节点可以通过配置集群名称来加入指定的集群。但默认情况下,每个节点创建之初就被加入到一个名为elasticsearch的集群中。意味着,若你在网络中创建了一些节点,且假定它们能够互相识别,它们将自动排列并加入到名为elasticsearch的集群中。
Index索引
An index is a collection of documents that have somewhat similar characteristics.For example, you can have an index for customer data,another index for a product catalog, and yet another index for order data.An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
索引就是有着某些相似特性的文档集合。例如,你有一个索引指向用户数据,一个指向产品分类,一个指向订单数据。一个索引被一个名称(名称必须全部小写)唯一标识,这个名称将通过文档去执行索引、搜索、更新】删除等操作。
In a single cluster, you can define as many indexes as you want.
在一个集群中,你可以随意定义诸多索引。
Type类型
warning: Deprecated in 6.0.0
警告:6.0.0版本不建议使用 Removal of mapping types
A type used to be a logical category/partition of your index to allow you to store different types of documents in the same index, eg one type for users, another type for blog posts.It is no longer possible to create multiple types in an index, and the whole concept of types will be removed in a later version. See Removal of mapping types for more.
类型就是索引中的一个逻辑分类/分区,它的存在允许你在相同的索引中存储不同类型的文档,例如,一个用户类型,一个博客文章类型。如今已不能在一个索引中创建多种类型,且整个类型的概念也将在之后的版本中移除。查看移除类型映射获取更多信息。
Document文档
A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order.This document is expressed in JSON (JavaScript Object Notation) which is ubiquitous internet data interchange format.
文档是一个可被检索的信息基础单元。例如,你可以为一个独立用户创建一个文档,为一个产品创建一个文档,一个订单创建一个文档。这个文档以JSON(JavaScript对象标记)形式呈现,JSON是一种普遍的网络数据交换格式。
Within an index/type, you can store as many as documents as you want.Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.
在一个索引/类型中, 你可以随意存储诸多文档。注意,虽然一个文档在物理属性上属于一个索引,但实际上必须被索引/指定到索引中的类型。
Shards & Replicas
An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to search requests from a single node alone.
索引可以潜藏可能超过一个节点硬盘限制的大量数据。例如,一个十亿文档索引将占据1TB的磁盘空间,但一个节点上但硬盘空间可能没这么大,即使足够承载,但从单一节点上发起搜索请求的响应也会非常慢。
To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards.When you create an index, you can simply define the number of shards that you want.Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
为解决这个问题,Elasticsearch 提供了将索引分割成多片区的功能,称之为shards(分片)。当你创建一个索引,你可以简单定义想要的分片数量。每个分片功能齐备且独立于索引,能够安放在集群的任一节点上。
Sharding is important for two primary reasons:
分片之所以重要的两个主要原因:
It allows you to horizontally split/scale your content volume.
允许你水平分割/缩放你的内容册
It allows you to distribute and parallelize operations across shards(potentially on multiple nodes) thus increasing performance/throughput.
允许通过分片来分发和并行化操作以便提高表现/吞吐量。
The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.
分片是如何被分发的操作流程,它的文档又是如何被聚集到搜索请求里是完全由Elasticsearch管理的,且这些流程完全向用户透明。
In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index's shards into what are called replica shards, or replicas for short.
网络/云环境下,故障随时可能发生。以防一个分片/节点因某些原因下线或者消失了,强烈推荐一个非常好用的故障转移机制。为达到目的,Elasticsearch 允许你将一个或多个索引的分片拷贝放入一个叫replica shards(复刻分片)的地方,简称replica(复制)
Replication is important for two primary reasons:
复刻之所以重要主要源于以下两点:
It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
一旦分片/节点挂了,它有着很高的可利用性。也因此,谨记不要将复制分片和原始分片分配到同一节点上。
It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.
因搜索行为可以在分片的所有拷贝中并行执行,它允许你的分片提供超出自身负荷的搜索。
To summarize, each index can be split into multiple shards. An index can also be replicated zero(meaning no replicas) or more times.Once replicated, each index will have primary shards(the original shards that were replicated from) and replica shards(the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards after-the fact.
总结来说,每个索引可以被分割成多个分片。每个索引也可以被复制0(也就是没有复制)到多次。一旦复刻,每个索引将会有原始分片(复刻产生的原始分片)和复刻分片(原始分片的拷贝)。分片和复刻分片的数量可以在每个索引创建的时候定义。索引创建后,你可以随时动态更改复刻分片的数量,但不能更改原始分片但数量。
By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster,your index will have 5 primary shards and another 5 replica shards(1complete replica) for a total of 10 shards per index.
默认情况下,Elasticsearch的每个索引都分发了5个原始分片和一个复刻,意味着你的集群里有至少两个节点,你的索引里有5个原始分片和另外5个复刻分片(1个完整复刻)也就是每个索引有10个分片。
Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of Lucene-5843, the limit is 2,127,483,519(=Integer.MAX_VALUE - 128) documents.You can monitor shard sizes using the _cat/shards API.
每个Elasticsearch分片都是一个Lucene索引。一个Lucene索引中都有一个文档数量的最大值。截至Lucene-5843,限制2,127,483,519(=Integer.MAX_VALUE - 128) 个文档。你可以使用_cat/shards来监测分片大小。