Notes on Database Sharding

Note that theses are my personal notes from Obsidian. Some parts may be my own words and some parts may be copied from the sources listed below.

What is Database Sharding?

Sharding is a database architecture pattern related to horizontal partitioning — the practice of separating one table’s rows into multiple different tables, known as partitions. Each partition has the same schema and columns, but also entirely different rows. Likewise, the data held in each is unique and independent of the data held in other partitions. There are many types of sharding strategies.

Key/Hash Based Sharding

Key or hash based sharding is very simple. It involves taking a value from a record and hashing it and using that hash, assign a database shard to put that record in. This is most often done with unique values such as ids.

Advantages

  • Very easy to implement
  • Keys are distributed evenly, assuming your hash function is sound

Downsides

  • Hard to add new/remove shards as you have to remap most if not all keys to the total number of shards.
    • For example if you're hash depends on the last two numbers of an int and that int is 125 and you have 7 shards, so 25 % 7 = 4, would mean you would have to totally remap if you removed a shard so 25 % 6 = 1

Key/Hash Based Sharding With Consistent Hashing

You may of noticed in the Key/Hash Based Sharding section that a lot depends on your hash function being sound. However we can extend that to drastically reduce the amount of keys that need to be remapped. With consistent hashing only `n / m` keys need to be remapped on average where *n* is the number of keys and *m* is the number of slots. It does this by representing the system/server/clients as a virtual ring structure called a hashring seen below.hashring diagramThe hashring is said to have an infinite number of points you can place servers on, really this depends on your the cardinality of the outputs for your hash function but in practice it is true. Servers are often placed after all of the keys are figured out to place servers where there are the most current keys.
When a request comes in it goes to a point in the hashring and goes clockwise on the ring till it finds the closest server. If a server is unresponsive it can hop to the next node to see if it has a copy.hashring request flow diagram

Advantages

  • Very easy to add/remove servers
  • Lessons the burden of serves failing as the keys it(or a new server) needs to take on is less
  • Servers can easily and should store keys from other servers to increase availability
  • Can easily further decrease possible hotspots by putting servers on the hashring n times randomly with a new hash function

Downsides

  • Complex
  • Increase overall server and memory usage

Range Based Sharding

Range based sharding is where you take a range and assign it to a database shard. Typically most useful when you have a fixed range. If there is no range and they key maps one-to-one with the shard it's called **Directory Based Sharding** and performs very similarly

Advantages

  • Very easy to implement
  • If key source is purely random, keys are distributed evenly.
  • Can more easily add/remove database shards

Downsides

  • Keys, even if they appear or are said to be random are very often not random. This will cause some shards to become hotspots and have many more reads and writes than others.

Use Cases

  • When a dataset won't fit on a single server or write's overtake the capacity of your server

Quick Facts and Notes

  • You should only shard if the dataset cannot fit on one machine or your write's are bottlenecking you, else you should handle load through read-replicas as it reduces complexity, and increases performance and durability
  • It also also very hard to go back once you shard your database

Sources and Learning Material

  • https://www.digitalocean.com/community/tutorials/understanding-database-sharding
  • https://www.geeksforgeeks.org/consistent-hashing/
  • https://www.baeldung.com/
  • https://ably.com/blog/implementing-efficient-consistent-hashing