Debugging Elasticsearch Cluster Issues
Things To Monitor In Elasticsearch Cluster
There could be times when Elasticsearch cluster may misbehave and you might not have any idea what went wrong. In order to debug elasticsearch issues you must have monitoring enabled for certain metrics which you can observe to tune up the cluster to work well.
Following are some ways to debug Elasticsearch:
1. Cluster health (Nodes and Shards)
It is a basic metric which give overview about the number of nodes running and the status of shards distributed to the nodes. Maintain the number of nodes to be present in the cluster and watch for relocating shards when a node joins the cluster. Relocation should be set to a limit such that it should not affect the cluster performance.
$ curl localhost:9200/_cat/health
{
"cluster_name": "elasticsearch_zach",
"status": "green",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 10,
"active_shards": 10,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0
}
2. Disk I/O
It is a good metric to check the effectiveness for indexing and query performance. It should be monitored whether read operation are more or write operations. If write operations are more than read operations then indexing should be optimised else query optimizations should be done.
One more bottleneck may occur if you are using fixed I/O volumes and elasticsearch is performing more I/O on disk resulting in burst balance to reach 0 then the read and write operations will get slow and ultimately it will increase the CPU utilisation. The solution for this is to use high IOPS volumes.
3. Memory Usage
Elasticsearch is memory consuming. Do not panic if full memory is getting consumed, it is indeed good that memory is getting utilized. Generally 50% of the available memory is given to HEAP and rest is left for file system cache. One just needs to check if buffered and cache memory is all used or there is very little available buffered and cache memory then indeed the RAM is low and needs to be increased.
4. Node’s performance – CPU
Watch for CPU spikes, if CPU utilization increases, look for JVM metrics (Garbage collection activity). If you find spikes in CPU and high GC activity at the same time then the GC should be tuned by adjusting the HEAP space or new-gen heap or old-gen heap.
- The new-gen GCs should not last longer than 50 ms or happen more often than every 10 seconds.
- The old-gen GCs should not last longer than 1 second or happen more often than every 10 minutes.
5. HEAP Size and Garbage Collection.
Elasticsearch runs in JVM and monitoring Garbage collection and Memory usage are critical part. Thus several JVM and operating system settings needs to be tuned.
- JVM process should not get swapped out to disk. The JVM process should reside in RAM, to make sure this set bootstrap.mlockall=true in elasticsearch configuration file and set environment variable MAX_LOCKED_MEMORY=unlimited
- set vm.swappiness=1 in /etc/sysctl.conf
- Set HEAP size for elasticsearch using ES_HEAP_SIZE. It is generally recommended to set ES_HEAP_SIZE to 50-60% of the available memory. Also make sure not to cross 32G of HEAP size if you have a lot of memory which will result in JVM using larger 64 bit pointers which requires more memory.
6. Request Latency & Request Rate
Latency directly affects the end user who are using elasticsearch for searching. There could be many reason for slow elasticsearch queries:
- un-optimized queries
- improperly configured elasticsearch cluster
- HEAP size issue
- Garbage Collection issues
- Disk I/O and many more
Putting request rate graph and request latency will give insight how elasticsearch responds.
7. JVM Pool Size
Monitor memory pools for old-gen and perm-gen and if they reach or about to reach 100% utilization then you need to worry and fix the issue. This may results into high CPU utilization, increased garbage collection activity. If there is too much of garbage collection activity then following could be the reason:
- One of the pool is full, then tune the pool size.
- JVM needs more memory which you can increase by increasing the HEAP size.
8. Indexing Performance
Indexing rate which can be calculated by number of documents per seconds is another important thing to monitor. When a document is created, one needs to wait for refresh for the document to appear in the search and refresh is an expensive operation, hence it should be set using index.refresh_interval setting instead of doing a refresh after every indexing operations
9. Filter-Cache Size
When a query is executed with filter, elasticsearch will find documents matching the filter and build a structure called “bitset” which contains a document identifier and whether that document matches the filter or not. Subsequent queries having the same filter will reuse the information stored in the bitset and will make the query to execute faster by having less IO operations and CPU cycles.
Filters are generally small but may take up a large portion of HEAP if you have enormous data and many filters. Thus to optimize this, you should set indices.cache.filter.size to limit amount of HEAP size being used for filter cache. Watch filter cache size and filter cache eviction to know the appropriate size of indices.cache.filter.size.
10. Field Data Size
Field data size and evictions are important as they impact search performance if aggregation queries are used. Field data is costly operation as it requires getting the data from disk into memory.
By default field data cache size is unbounded which could result into JVM HEAP explosion. Thus it is wise to set proper size for indices.fielddata.cache.size. Watch this metric to get the proper size.