Skip to main content

Command Palette

Search for a command to run...

Hash-Based Sharding: Uniformity with Limitations

Updated
3 min read
R

👋 Hey, I'm Rahul — a full-stack developer who loves turning ideas into clean, functional products. I write about JavaScript, Node.js, React, and real-world dev lessons. Expect dev logs, bugs I broke (and fixed), and things I'm learning along the way. 🛠 Currently building, shipping, and writing one commit at a time.

A Developer’s Guide to Distributed Database Design

📌 Overview

Hash-based sharding is one of the most popular strategies used to evenly distribute data across multiple database nodes. It’s simple, effective — and widely adopted by systems like Twitter, Facebook, and Reddit during their early scaling phases.

But what makes it so powerful — and where does it fall short?

Let’s dive deep.

🧠 What Is Hash-Based Sharding?

At its core, hash-based sharding works like this:

  1. Choose a sharding key (e.g., user_id)

  2. Apply a hash function to the key (e.g., hash(user_id))

  3. Use the result to determine the target shard using something like:

     const shardIndex = hash(user_id) % totalShards
    

Your data is now assigned to a specific shard, and evenly distributed — assuming a good hash function and uniform key distribution.

💡 Real-Life Analogy

Think of assigning students to dorms by using the hash of their student ID:

  • Hash the ID, then mod by the number of dorms.

  • Each student goes to one dorm — seemingly random, but balanced.

That’s the goal of hash-based sharding.

📋 Step-by-Step Example

Let’s say we have 4 shards and we want to store user data by user_id.

const user_id = 12468;
const totalShards = 4;

const hash = require('crypto').createHash('md5');
const hashedValue = parseInt(hash.update(user_id.toString()).digest('hex').substring(0, 8), 16);

const shardIndex = hashedValue % totalShards;

console.log("Store in Shard", shardIndex);

This ensures deterministic and uniform distribution.

🧪 Benefits

✅ 1. Uniform Distribution

A well-designed hash function reduces data skew and keeps shards balanced.

✅ 2. No Hotspots

Unlike range-based sharding (where large ranges may concentrate data), hash-based sharding spreads keys unpredictably — avoiding hotspots.

✅ 3. Easy Lookups

For point queries (SELECT * FROM users WHERE id = 123), the shard can be found instantly using the hash.

🚫 Limitations

❌ 1. Hard to Scale Horizontally

Let’s say you go from 4 to 5 shards. That completely changes hash(key) % totalShards.
All your keys remap to different shardsmassive data movement.

Solution: Consistent Hashing (will be posting soon about this)


❌ 2. No Range Queries

Want to get users with IDs between 1000 and 2000?

You can’t predict which shards hold those users, because the hash function randomizes the distribution.


❌ 3. Rebalancing Is Painful

You can’t simply “add a shard.” You’ll need to rehash all existing keys and redistribute — expensive for large datasets.

🔁 When to Use Hash-Based Sharding

✅ When your queries are mostly point lookups
✅ When your traffic is evenly distributed
✅ When you’re okay with fixed infrastructure size

🚫 Avoid it if:

  • You expect frequent scaling

  • You rely on range queries or time-based aggregations

🔧 Real-World Case: Twitter (Early Architecture)

Twitter initially used hash-based sharding on user IDs. But as their traffic and user base exploded, adding new shards became painful.
Eventually, they switched to consistent hashing with virtual nodes to ease the rebalancing problem.


📊 Summary Table

FeatureHash-Based Sharding
Load Distribution✅ Very uniform
Range Queries Support❌ No
Rebalancing Simplicity❌ Difficult
Scaling Flexibility❌ Requires rehashing
Implementation Effort✅ Easy to start with

📘 Coming Up Next

📌 Range-Based Sharding
We’ll explore how to shard based on ordered key ranges — great for time-series and reporting systems, but with some pitfalls.


✍️ Final Thoughts

Hash-based sharding is perfect for getting started with distributed databases, especially for apps where you want:

  • Fast user lookups

  • Balanced performance

  • Predictable writes

But as your system grows and your needs evolve, you may hit its limits. That’s when strategies like consistent hashing or dynamic sharding become essential.


Got questions or want to share how you’ve used hash-based sharding in production?
Drop them in the comments.

Click here for 👉 Range Based Sharding


Scaling Databases with Sharding

Part 5 of 5

Learn the principles, algorithms, trade-offs, and real-world practices behind Hash-Based Sharding, Range-Based Sharding, Consistent Hashing, and more. Each post is developer-friendly and packed with diagrams, examples, and performance insights.

Start from the beginning

Why We Need Sharding: Scaling Beyond Limits

“How do you split an ocean and still find the right drop in milliseconds?”That’s what modern databases are trying to solve. The Scalability Dilemma Your app is booming. Yesterday it had 10,000 users. Today? 100,000. And suddenly: ⚠️ Queries are slo...