[Elasticsearch Monitor] 如何监控Elas

2017-02-08 本文已影响1066人 king_wang

Elasticsearch本身提供了详尽API以供用户实时了解Es运行状态。通过这些Api你可以及时发现例如丢失节点，OOM，长时间GC等问题，然后可以及时修复它们。对Elasticsearch监控主要分以下几类：

搜索和索引性能
内存和GC
机器指标和网络指标
集群状态和节点可用
资源负载和错误

（一）Search performance metrics

搜索性能指标。搜索是Es最主要的2个功能之一，另一个就是索引。搜索和索引类似于传统DB的read和write。Es搜索功能的内部实现包含了query和fetch两个阶段，API分别提供了这两个阶段的相关指标（性能数据主要分两类：Throughput吞吐量和Performance性能）：

Metric description	Name	Metric type
Total number of queries	`indices.search.query_total`	Work: Throughput
Total time spent on queries	`indices.search.query_time_in_millis`	Work: Performance
Number of queries currently in progress	`indices.search.query_current`	Work: Throughput
Total number of fetches	`indices.search.fetch_total`	Work: Throughput
Total time spent on fetches	`indices.search.fetch_time_in_millis`	Work: Performance
Number of fetches currently in progress	`indices.search.fetch_current`	Work: Throughput

Search performance metrics to watch

Query load: 查询负载。监视当前正在进行的查询数可以大致了解集群在任意时段内处理的请求数。请求数突然激增或骤降都预示了一些问题，可以考虑给予告警。如果想监控搜索线程池队列大小，文章后面会有介绍。
Query latency: 查询延迟。尽管Elasticsearch API没有直接提供此指标，但是可以通过几个指标来计算平均查询延迟，方法是定期抽样查询总数和总耗用时间。如果延迟超过一定阈值时，就要找到资源瓶颈，或确认是否需要优化查询。
Fetch latency:提取延迟。提取阶段是搜索过程的第二阶段，它通常需要比查询阶段花费少得多的时间。如果发现此指标持续增加，这可能表示磁盘缓慢，对结果文档处理（在搜索结果中高亮相关文字等），或请求过多结果文档的问题。

（二）Indexing performance metrics

索引请求类似于传统数据库中的write请求。如果您的Elasticsearch主要工作是write，那么监视和分析如何提高index性能就非常重要了。事先了解Elasticsearch更新索引的过程是有益处的。当将新文档添加到索引，或更新删除现有文档时，索引中的每个分片都经过两个过程：refresh和flush。API提供了相关的指标：

Metric description	Name	Metric type
Total number of documents indexed	`indices.indexing.index_total`	Work: Throughput
Total time spent indexing documents	`indices.indexing.index_time_in_millis`	Work: Performance
Number of documents currently being indexed	`indices.indexing.index_current`	Work: Throughput
Total number of index refreshes	`indices.refresh.total`	Work: Throughput
Total time spent refreshing indices	`indices.refresh.total_time_in_millis`	Work: Performance
Total number of index flushes to disk	`indices.flush.total`	Work: Throughput
Total time spent on flushing indices to disk	`indices.flush.total_time_in_millis`	Work: Performance

Indexing performance metrics to watch

Indexing latency: 索引延迟。Elasticsearch API没有直接提供此指标，但是可以通过index_total和index_time_in_millis指标来计算平均索引延时。如果延时在增加，可能是由于一次索引的数据量太大导致的（Elasticsearch的文档建议在做bulk index时，单批次索引从5M-15M，慢慢增加，直到找到合理的值）。
Flush latency: 刷新延时。在flush成功完成之前，数据并不会持久化到磁盘，所以监控该指标也是非常有必要的，如果想能下降的厉害，就要采取相应措施了。如果您看到此指标稳步增长，则可能表明磁盘出现slow问题; 此问题可能会升级，并最终不能写入数据。您可以尝试在索引的flush设置中降低index.translog.flush_threshold_size。此设置时一个触发Flush的阈值，即当translog超过多大时开始Flush。但是，如果您是一个写得很重的Elasticsearch用户，您应该使用iostat等工具随时关注磁盘IO指标。如果有必要，请考虑升级磁盘。

（三）Memory usage and garbage collection

当Elasticsearch运行时，内存是需要密切监视的关键资源之一。 Elasticsearch和Lucene会通过两种方式充分利用RAM：JVM堆和文件系统高速缓存。 Elasticsearch在Java虚拟机（JVM）中运行，这意味着JVM垃圾收集持续时间和频率将是另外一个需要监视的重要领域。
JVM heap
Elasticsearch非常强调JVM堆大小的“刚刚好”的重要性 - 既不能设置得太大或也不能太小，原因后面说。一般来说，Elasticsearch的经验是将接近50％的内存分配给JVM堆，并且永远不要超过32 GB。
分配给Elasticsearch的堆内存越少，Lucene可以使用的RAM越多（Lucene非常依赖于file system cache来快速地处理请求）。但如果将Elasticsearch堆大小设置得太小，程序就会频繁的GC，持续短暂停顿。甚至OOM。
Garbage collection
Elasticsearch依靠垃圾回收进程释放堆内存。 GC会导致进程无法响应外部请求，需要留意它的频率和持续时间，看看是否需要调整堆大小。设置堆太大可能导致长时间的垃圾收集：长时间的暂停是危险的，因为这可能导致集群错误地认为节点已脱离集群。

Metric description	Name	Metric type
Total count of young-generation garbage collections	`jvm.gc.collectors.young.collection_count`	Other
Total time spent on young-generation garbage collections	`jvm.gc.collectors.young.collection_time_in_millis`	Other
Total count of old-generation garbage collections	`jvm.gc.collectors.old.collection_count`	Other
Total time spent on old-generation garbage collections	`jvm.gc.collectors.old.collection_time_in_millis`	Other
Percent of JVM heap currently in use	`jvm.mem.heap_used_percent`	Resource: Utilization
Amount of JVM heap committed	`jvm.mem.heap_committed_in_bytes`	Resource: Utilization

JVM metrics to watch

JVM heap in use: 已用的JVM堆大小。Elasticsearch默认配置在JVM堆使用率达到75％时进行垃圾回收GC。如果使用率一直非常高比如85%，说明GC长时间来不及回收内存，这很危险。这是可能需要增加内存或者增加节点。
JVM heap used vs. JVM heap committed: JVM堆的used于committed的比率。如果比率随着时间的推移开始向上倾斜，这意味着垃圾收集速率不能跟上对象创建速率，这可能导致垃圾收集时间变慢，最终导致OutOfMemoryErrors。
Garbage collection duration and frequency:GC耗时和频率。young gc和old gc都会有一个 “stop the world” 阶段，因为GC时JVM会停止程序执行并回收无用的对象实例。在此期间，节点无法完成任何任务。由于master节点每30秒检查一个其他节点的状态，如果任何节点的垃圾收集时间超过30秒，它将导致主节点认为该节点已经丢失。

（四）Host-level network and system metrics

除了应用层面的性能指标，还需要监控节点主机的性能指标。

Disk space: 数据节点的磁盘空间是非常重要的，如果空间不够是无法写入任何新数据的。当空间不够时，需要删除无用index，或者增加新的硬盘，或者增加新节点。
I/O utilization: I/O使用率。当创建，查询，合并段文件时，Elasticsearch会大量的读写磁盘。Elasticsearch集群性能比较依赖磁盘I/O，如果条件允许，使用SSD，可以显著提高集群性能。
CPU utilization: CPU使用率。如果CPU使用率增加，这通常是由于搜索或索引工作量大。如果CPU使用情况持续增加，那么可能需要添加更多节点以根据需要均衡负载。
Network bytes sent/received: 网络流量。节点之间的通信是集群平衡非常关键。为了确保它的健康，监控网络是非常必要的。 Elasticsearch本身提供了集群通信的传输指标，但可以直接查看主机发送和接收的字节速率，以了解您的网络流量。
Open file descriptors: 文件描述符。file descriptors用于文件操作，网络连接。操作系统会有一个可用的上限，如果超过这个值，那么新链接和文件操作都不能进行。Elasticsearch会要求将该值设大，因为Lucene会同时打开大量文件。

HTTP connections

Metric description	Name	Metric type
Number of HTTP connections currently open	`http.current_open`	Resource: Utilization
Total number of HTTP connections opened over time	`http.total_opened`	Resource: Utilization

除了Java Client其他语言的Client都是使用的Http协议，如果Http链接数一直持续不断增加，应该是有些client程序在连接Elasticsearch时设置有问题。不断的重新建立连接会浪费server和client的资源。在写client程序时要注意这点。

（五）Cluster health and node availability

Metric description	Name	Metric type
Cluster status (green, yellow, red)	`cluster.health.status`	Other
Number of nodes	`cluster.health.number_of_nodes`	Resource: Availability
Number of initializing shards	`cluster.health.initializing_shards`	Resource: Availability
Number of unassigned shards	`cluster.health.unassigned_shards`	Resource: Availability

**Cluster status: **集群状态。如果集群状态为yellow，说明至少丢失一个备份分片。这种状态下，搜索结果仍是完整的。如果集群状态为red，说明至少丢失一个主分片。这种状态下，搜索结果会缺失部分数据。
**Initializing and unassigned shards: **初始化和未分配的数量。当创建一个新index或者节点重启时，index的分片会首先处于“initializing”状态，此时master节点会给集群中的节点分配分片。然后分片进入 “started” 或 “unassigned”状态。

（六）Resource saturation and errors

Elasticsearch使用线程池来管理线程，用以调配内存和CPU资源。线程池是基于CPU核数自动配置的，大部分情况下不需要调整。但是，最好实时监控线程池的队列长度和被拒绝的数量，以便可以及时发现集群配置已经跟不上需求。这种情况应该增加节点以满足高并发需求。fielddata 和 filter缓存使用是另外一个需要监控的重要领域，因为它们可能反映出有人使用了低效的查询语法请求，或者存在内存压力。

1.Thread pool queues and rejections
每个节点都管理多种线程池，其中最需要监视的是search, index, merge, 和 bulk，分别对应了search, index, merge 和 bulk 请求操作。
线程池的大小表示该节点有多少请求正在等待服务。节点最终会服务队列里的这些请求，并不会丢失他们。当线程池满了之后，请求会被拒绝。

Metric description	Name	Metric type
Number of queued threads in a thread pool	`thread_pool.bulk.queue` `thread_pool.index.queue` `thread_pool.search.queue` `thread_pool.merge.queue`	Resource: Saturation
Number of rejected threads a thread pool	`thread_pool.bulk.rejected` `thread_pool.index.rejected` `thread_pool.search.rejected` `thread_pool.merge.rejected`	Resource: Error

**Thread pool queues: ** 线程池队列。只是简单的把队列设大并不是一个好方案，因为这会耗尽系统资源，影响其他性能。而且队列过大反而会增加数据丢失的风险。如果发现等待队列及拒绝队列在逐步增加，如果可能的话减少请求频次，或者增加节点CPU，或直接增加节点。
**Bulk rejections and bulk queues: ** bulk的等待队列及拒绝队列。bulk操作是同时执行多个操作，用以代替多次单个请求。如果发现bulk拒绝，一般是因为bulk操作在同一批次索引了过多文档。此时应该线性或指数性减少请求量。

2.Cache usage metrics
每个查询请求都会被分发到index的每个分片shard，然后命中每个分片的段文件segment。Elasticsearch基于每个段来缓存查询，以加快响应时间。另一方面，如果缓存使用了太多内存，他们可能会放慢速度，而不是加快速度！
Elasticsearch使用两种主要类型的缓存更快地提供搜索请求：fielddata缓存和filter缓存。

Metric description	Name	Metric type
Size of the fielddata cache (bytes)	`indices.fielddata.memory_size_in_bytes`	Resource: Utilization
Number of evictions from the fielddata cache	`indices.fielddata.evictions`	Resource: Saturation
Size of the filter cache (bytes)	`indices.filter_cache.memory_size_in_bytes`	Resource: Utilization
Number of evictions from the filter cache	`indices.filter_cache.evictions`	Resource: Saturation

**Fielddata cache evictions: ** fielddata缓存剔除数。Elasticsearch会按照一定的规则剔除一些不常用的缓存，以更好的利用内存资源。
**Filter cache evictions: ** filter缓存剔除数。类似fielddata缓存剔除数。

3.Pending tasks

Metric description	Name	Metric type
Number of pending tasks	`pending_task_total`	Resource: Saturation
Number of urgent pending tasks	`pending_tasks_priority_urgent`	Resource: Saturation
Number of high-priority pending tasks	`pending_tasks_priority_high`	Resource: Saturation

挂起的任务只能由master节点处理，这些任务包括创建索引和向节点分配分片。如果主节点非常忙，并且挂起任务的数量不减少，则可能导致不稳定的集群。

4.Unsuccessful GET requests

Metric description	Name	Metric type
Total number of GET requests where the document was missing	`indices.get.missing_total`	Work: Error
Total time spent on GET requests where the document was missing	`indices.get.missing_time_in_millis`	Work: Error

Get请求比查询请求直接多了：它直接通过ID获取文档。通常情况下Get请求不会有什么问题，但最好在发生Get失败时保持警惕。