Today I troubleshooted an Elasticsearch-cluster-down issue.
Several lessons were learned:
- When many elasticsearch cluster nodes are restarted, to avoid HEAP spike, better to temporarily stop all connection attempts;
- Avoid setting allow_primary=true when reroute shards via API;
- Don’t forget backup! It could save you some day!