mooc-notes

Notes from online courses

View on GitHub

Understanding Sharding in Elasticsearch by Brad Quarry, elastic.co

Agenda

  1. Quick - What is ES?
  2. What are shards?
  3. Why do we need shards?
  4. What is in a shard?
  5. Optimal shard size
  6. Optimal shard number
  7. How do you manage shards?

What is Elastic Stack?

Elasticsearch: highly available document store, search and analyse Kibana: visualisation engine Beats and Logstash: to get data into the system

What are shards?

Equal pieces of the overall data in a single index If you start with 100 GB of documents, ES will split them equally into data nodes as shards.

Example

Why do we need shards?

They enable the use of multiple nodes and horizontal scaling to process data and improve performance

Example by contrast

  1. No shards Vertical scaling, one big node
    1. Exponentially more expensive as you get bigger.
      1. Data Node 1: 4 CPU 16G RAM -> 8 CPU 32 G RAM -> 16 CPU 64 G -> …
    2. Eventually you can’t get bigger
  2. Shards Horizontal scaling, many nodes
    1. Cheaper over time as the node type ages
    2. Add up to hundreds of nodes for incremental performance
      1. Data Node 1: 16 CPU 64G RAM
      2. Data Node 2: same

Can I mix and match node sizes in a cluster?

What is in a shard?

A shard is an individual instance of Apache Lucene

What happens when nodes and shards fail?

Using Shard Replicas to avoid data loss and increase query throughput

look for image in Evernote

Shard replicas are never stored in the same node as the primary shard.

Is there an optimal shard size?

Shard size best practices:

Is there an optimal shard count?

How do you manage shards?