Skip to content

Merkle Trees (Hash Trees)#

This post is an AI-generated summary of the Medium article Advanced Algorithms Every Senior Developer Must Know: Part 1 β€” Merkle Trees. It is intended as a concise reference to the key ideas. The original article includes more detail, explanations, and examples, and is worth reading for full context.

What Is a Merkle Tree?#

A Merkle tree (also called a hash tree) is a binary tree structure invented by Ralph Merkle in 1979. It recursively hashes and combines data blocks to produce a single root hash, which serves as a cryptographic fingerprint of the entire dataset.

Legend: πŸ“¦=Data Blockγ€€πŸŸ’=Leaf Hashγ€€πŸ”·=Branch Hashγ€€πŸŒ=Root Hash

                        🌐 Root Hash
                             β–²
                             β”‚
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚                   β”‚
              πŸ”· Branch_AB        πŸ”· Branch_CD
                   β–²                   β–²
              β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
              β”‚         β”‚         β”‚         β”‚
         🟒 Hash_A   🟒 Hash_B  🟒 Hash_C  🟒 Hash_D
              β–²         β–²         β–²         β–²
              β”‚         β”‚         β”‚         β”‚
        πŸ“¦ Data_A   πŸ“¦ Data_B  πŸ“¦ Data_C  πŸ“¦ Data_D

Construction Process (4-Block Example)#

Step 1: Hash each data block#

Legend: πŸ“¦=Data Blockγ€€πŸŸ’=Leaf Hash (newly created)

   🟒 Hash_A    🟒 Hash_B    🟒 Hash_C    🟒 Hash_D
       β–²            β–²            β–²            β–²
       β”‚            β”‚            β”‚            β”‚
    SHA256       SHA256       SHA256       SHA256
       β–²            β–²            β–²            β–²
       β”‚            β”‚            β”‚            β”‚
   πŸ“¦ Data_A    πŸ“¦ Data_B    πŸ“¦ Data_C    πŸ“¦ Data_D

Step 2: Pair leaves and combine into branches#

Legend: 🟒=Leaf Hashγ€€πŸ”·=Branch Hash (newly created)

πŸ’‘ Order matters. SHA256(Hash_A β€– Hash_B) is not the same as SHA256(Hash_B β€– Hash_A), so changing the order of data blocks changes the Merkle root.

              πŸ”· Branch_AB              πŸ”· Branch_CD
                   β–²                         β–²
              β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”               β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
              β”‚         β”‚               β”‚         β”‚
         🟒 Hash_A   🟒 Hash_B     🟒 Hash_C   🟒 Hash_D

  Rule:  Branch_AB = SHA256( Hash_A β€– Hash_B )
         Branch_CD = SHA256( Hash_C β€– Hash_D )

Step 3: Combine branches up to the root#

Legend: 🟒=Leafγ€€πŸ”·=Branchγ€€πŸŒ=Root (newly created)

                        🌐 Root Hash
                             β–²
                             β”‚
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚                   β”‚
              πŸ”· Branch_AB        πŸ”· Branch_CD
                   β–²                   β–²
              β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
              β”‚         β”‚         β”‚         β”‚
         🟒 Hash_A   🟒 Hash_B  🟒 Hash_C  🟒 Hash_D

  Root Hash = SHA256( Branch_AB β€– Branch_CD )

Merkle Proof β€” The Killer Feature#

Question: How can we prove Data_C exists in the tree without downloading every block?

Legend: βœ…=Trusted Rootγ€€πŸ”‘=Required Proofγ€€πŸ”΅=Computed Locallyγ€€πŸŽ―=Target

                     βœ… Root Hash
                          β–²
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚                   β”‚
          πŸ”‘ Branch_AB         πŸ”΅ Branch_CD   ← compute locally
                                    β–²
                              β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
                              β”‚           β”‚
                         πŸ”΅ Hash_C    πŸ”‘ Hash_D
                              β–²
                         🎯 Data_C

What the verifier needs:

  • βœ… Trusted Root Hash (already known from a reliable source)
  • 🎯 Target Data Block (Data_C, the one being verified)
  • πŸ”‘ Sibling Hashes along the path (only logβ‚‚(n) of them: Hash_D and Branch_AB)

Verification Walkthrough#

The prover has the full tree. For target Data_C, it builds a Merkle proof by collecting only the sibling hashes along C's path to the root, plus their left/right positions:

Proof for Data_C:
1. right sibling = Hash_D
2. left sibling  = Branch_AB

These are the only hashes the verifier needs for this example:

  • Hash_C pairs with Hash_D to rebuild Branch_CD
  • Branch_CD pairs with Branch_AB to rebuild the root

If the prover sent some unrelated hash such as Hash_E, the recomputed root would not match the trusted root hash.

Legend: 🎯=Targetγ€€πŸ”΅=Local Computationγ€€πŸ”‘=Provided Proofγ€€βœ…=Verified

   🎯 Data_C
      β”‚
      β”‚  Step 1: hash the target yourself
      β–Ό
   πŸ”΅ Hash_C_local              ←── computed locally
      β”‚
      β”‚  Step 2: combine with πŸ”‘ Hash_D (right sibling)
      β–Ό
   πŸ”΅ Branch_CD_local           ←── computed locally
      β”‚
      β”‚  Step 3: combine with πŸ”‘ Branch_AB (left sibling)
      β–Ό
   πŸ”΅ Root_Calculated
      β”‚
      β”‚  Step 4: compare with βœ… Trusted Root Hash
      β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  match    β†’  βœ… DATA IS VALID  β”‚
  β”‚  no match β†’  ❌ DATA TAMPERED  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Equivalent formulas:

Hash_C_local      = SHA256(Data_C)
Branch_CD_local   = SHA256(Hash_C_local || Hash_D)
Root_Calculated   = SHA256(Branch_AB || Branch_CD_local)

πŸ’‘ 1,000 data blocks need only ~10 πŸ”‘ sibling hashes (β‰ˆ320 bytes) to verify β€” not the full 1,000-block dataset!

Handling an Odd Number of Blocks#

When the leaf count is odd, duplicate the last block to make it even, then continue building.

This is one common convention for binary Merkle trees, including Bitcoin-style examples. Under this convention, if you append a new block:

  • if the previous leaf count was even, the new last block is temporarily duplicated
  • if the previous leaf count was odd, the previously duplicated last block is replaced by the new real block

Example:

3 blocks:  A  B  C      -> duplicate C  -> A  B  C  C
4 blocks:  A  B  C  D   -> already even -> A  B  C  D
5 blocks:  A  B  C  D  E -> duplicate E -> A  B  C  D  E  E

Legend: πŸ“¦=Dataγ€€πŸŸ’=Leaf Hashγ€€πŸ”·=Branchγ€€πŸŒ=Root ♻️=Duplicated Node

  Before (3 blocks, odd) :  πŸ“¦ Data_A   πŸ“¦ Data_B   πŸ“¦ Data_C
                                                          β”‚
                                                     ♻️ duplicate
                                                          β–Ό
  After  (4 blocks, even):  πŸ“¦ Data_A   πŸ“¦ Data_B   πŸ“¦ Data_C   ♻️ Data_C'

                          🌐 Root Hash
                               β–²
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚                   β”‚
                πŸ”· Branch_AB        πŸ”· Branch_CC'
                     β–²                   β–²
                β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
                β”‚         β”‚         β”‚         β”‚
           🟒 Hash_A   🟒 Hash_B  🟒 Hash_C  ♻️ Hash_C

Build Algorithm Flow#

flowchart LR
    A["Input: N data blocks"] --> B["Hash blocks to create leaf hashes"]
    B --> C{"One hash left?"}
    C -- Yes --> D["Return root hash"]
    C -- No --> NEXT

    subgraph NEXT["Build next level"]
        direction TD
        E{"Odd number of nodes?"}
        E -- Yes --> F["Duplicate last node"]
        E -- No --> G["Hash pairs to build parent level"]
        F --> G
        G --> H["Use parent level as current level"]
    end

    NEXT --> C

    classDef input fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1px;
    classDef decision fill:#f3e8ff,stroke:#7e22ce,color:#581c87,stroke-width:1px;
    classDef special fill:#fee2e2,stroke:#dc2626,color:#7f1d1d,stroke-width:1px;
    classDef normal fill:#dbeafe,stroke:#2563eb,color:#1e3a8a,stroke-width:1px;
    classDef output fill:#ffedd5,stroke:#ea580c,color:#9a3412,stroke-width:1px;

    class A input;
    class C,E decision;
    class F special;
    class B,G,H normal;
    class D output;

Performance Analysis#

Legend:

  • πŸš€ = Sub-linear (scales WAY better than data size)
  • ⚑ = Linear (scales proportionally to data size β€” unavoidable lower bound)
  • 🐒 = Super-linear (scales WORSE than data size β€” would be a problem)
   Operation              Complexity      Scaling     For n = 1,000,000
   ─────────────────────────────────────────────────────────────────────
   πŸ”¨ Build the tree       O(n)            ⚑        ~1,000,000 hashes
   πŸ” Generate a proof     O(log n)        πŸš€        ~20 hashes
   βœ… Verify a proof       O(log n)        πŸš€        ~20 hashes
   πŸ’Ύ Storage footprint    O(n)            ⚑        ~2,000,000 nodes
   ─────────────────────────────────────────────────────────────────────
   🐒 (No super-linear operations β€” that's the whole point of Merkle Trees!)

πŸ”₯ Real-World Comparison (2,000 Data Blocks)#

   ❌ Traditional approach:
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚  ~2 MB
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        Download ALL 2,000 blocks

   βœ… Merkle proof:
   β”Œβ”€β”
   β”‚β–ˆβ”‚  ~352 bytes
   β””β”€β”˜
        Download only ~11 sibling hashes

   πŸ“‰ Bandwidth savings: 99.98%

Real-World Applications#

Merkle trees aren't just theory β€” they power some of the most-used systems on the planet. Each application uses the tree slightly differently. Let's look at four of the most important ones.

πŸ”— Bitcoin (Blockchain) β€” SPV Wallets#

Problem: A mobile Bitcoin wallet on your phone can't store the full ~600 GB blockchain.

Solution: Simplified Payment Verification (SPV) β€” verify that your transaction is in a block using only its Merkle proof.

Legend:

  • βœ…=Block Header (small, always synced)γ€€

  • πŸ”‘=Sibling Proofγ€€

  • 🎯=Your Transactionγ€€

  • πŸ“¦=Other Transactions (NOT downloaded)

   Block Header (~80 bytes, always synced by SPV wallet):
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  Previous Block Hash                    β”‚
   β”‚  Timestamp β”‚ Difficulty β”‚ Nonce         β”‚
   β”‚  βœ… Merkle Root  ◀── trusted anchor     β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό (your wallet asks: "is Tx_C in this block?")

                     βœ… Merkle Root
                          β–²
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚                   β”‚
           πŸ”‘ Branch_AB        Branch_CD
                                    β–²
                              β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
                              β”‚           β”‚
                           Hash_C       πŸ”‘ Hash_D
                              β–²
                         🎯 Your Tx_C

   πŸ“¦ Tx_A  πŸ“¦ Tx_B  🎯 Tx_C  πŸ“¦ Tx_D  ... πŸ“¦ Tx_2000
   β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘
   β–² NOT downloaded β€” full node only sends πŸ”‘ sibling hashes

   Total downloaded: 1 block header + 11 sibling hashes β‰ˆ 432 bytes
   Instead of:       full 1 MB block (~99.96% bandwidth saved)

πŸ“š Git (Version Control) β€” Commit Integrity#

Problem: How does Git detect if even a single byte of any file in your repo has been tampered with β€” across millions of commits?

Solution: Every Git commit, tree, and blob is content-addressed by its SHA-1 hash. A commit is essentially a Merkle Root over your entire project state.

Legend: 🟠=Commitγ€€πŸŒ³=Tree (directory)γ€€πŸ“„=Blob (file content)γ€€πŸ”—=hash pointer

                      🟠 Commit "abc123..."
                       β”‚
                       β”‚  (points to root tree hash)
                       β–Ό
                     🌳 Tree (root directory)
                       β–²
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚        β”‚        β”‚
              β–Ό        β–Ό        β–Ό
           🌳 src/  🌳 tests/  πŸ“„ README.md
              β–²        β–²        (hash: 5d4e1c...)
        β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”   β”‚
        β”‚          β”‚   β–Ό
        β–Ό          β–Ό   πŸ“„ test_main.py
   πŸ“„ main.py  πŸ“„ utils.py    (hash: 7448d8...)
   (hash:      (hash:
    a665a4...)  d1d4e1...)

   ──────────────────────────────────────────────
   πŸ’‘ Change ONE byte in main.py
      β†’ main.py's hash changes
      β†’ src/'s tree hash changes
      β†’ root tree hash changes
      β†’ Commit hash changes
   ──────────────────────────────────────────────
   Result: tampering ANY file is instantly detectable.

πŸ—„ Apache Cassandra (Distributed DB) β€” Anti-Entropy Repair#

Problem: Two database replicas in different data centers might drift out of sync. Comparing all records row-by-row would take hours.

Solution: Each replica builds a Merkle tree over its data. Compare root hashes first β€” if they match, done. If not, drill down recursively to find only the differing branches.

Legend: βœ…=matchγ€€βŒ=mismatchγ€€πŸŸ’=in syncγ€€πŸ”΄=needs repair

   Replica A (Data Center 1)        Replica B (Data Center 2)
   ─────────────────────────        ─────────────────────────

       Root Hash                          Root Hash
        ❌ aaa111            ←≠→            ❌ bbb222
            β”‚                                  β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
     β”‚             β”‚                    β”‚             β”‚
   βœ… x1y2       ❌ p3q4               βœ… x1y2       ❌ z5z5
   (match,        β”‚                    (match,         β”‚
    skip!)        β”‚                     skip!)         β”‚
              β”Œβ”€β”€β”€β”΄β”€β”€β”€β”                            β”Œβ”€β”€β”€β”΄β”€β”€β”€β”
              β”‚       β”‚                            β”‚       β”‚
            βœ… aa    ❌ bb                       βœ… aa   πŸ”΄ cc
                      β”‚                                    β”‚
                  πŸ”΄ Diff!                              πŸ”΄ Diff!
                  (only sync                            (only sync
                  this subtree)                         this subtree)

   ──────────────────────────────────────────────────────────
   πŸ’‘ 1 million records, only 2 differ?
      β†’ 20 hash comparisons (logβ‚‚ 1M) instead of 1M comparisons
      β†’ ~50,000Γ— speedup ⚑
   ──────────────────────────────────────────────────────────

🌐 IPFS / BitTorrent (P2P Networks) β€” Chunk Verification#

Problem: You're downloading a 4 GB file from random untrusted peers. How do you detect if any peer sent corrupted or malicious chunks?

Solution: The file's "address" is its Merkle Root. Every downloaded chunk can be independently verified against this root.

Legend: 🌐=Content ID (Root)γ€€πŸ”·=Branchγ€€πŸ“¦=File Chunkγ€€πŸ€=trusted peerγ€€βš οΈ=malicious peer

   File "movie.mp4" (4 GB) split into 4 chunks:

                    🌐 Content ID = SHA256(...) ← "address" of the file
                          β–²
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚                   β”‚
           πŸ”· Branch_AB        πŸ”· Branch_CD
                β–²                   β–²
           β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
           β”‚         β”‚         β”‚         β”‚
        Hash_A    Hash_B    Hash_C    Hash_D
           β–²         β–²         β–²         β–²
           β”‚         β”‚         β”‚         β”‚
       πŸ“¦ Chunk_A πŸ“¦ Chunk_B πŸ“¦ Chunk_C πŸ“¦ Chunk_D
       (1 GB)    (1 GB)    (1 GB)    (1 GB)

   ──────────────── Download phase ────────────────

   🀝 Peer 1 sends Chunk_A  β†’ SHA256(Chunk_A) == Hash_A  βœ… keep
   🀝 Peer 2 sends Chunk_B  β†’ SHA256(Chunk_B) == Hash_B  βœ… keep
   ⚠️ Peer 3 sends Chunk_C' β†’ SHA256(Chunk_C') β‰  Hash_C  ❌ reject!
   🀝 Peer 4 sends Chunk_D  β†’ SHA256(Chunk_D) == Hash_D  βœ… keep

   ──────────────────────────────────────────────────
   πŸ’‘ Each chunk verified independently against the
      Content ID β€” no peer can sneak in malicious data.
   ──────────────────────────────────────────────────

πŸ“Š Application Comparison Summary#

Use CaseWhat's a "Block"?What's the Root?Why Merkle Tree?
πŸ”— Bitcoin SPVOne transactionMerkle Root in block headerVerify Tx without 600 GB chain
πŸ“š GitOne file (blob)Commit hashDetect any tampering across commits
πŸ—„ CassandraOne DB recordPer-replica rootFind diffs without scanning all rows
🌐 IPFS / TorrentOne file chunkContent ID (CID)Verify chunks from untrusted peers

Summary#

🌳 Merkle Tree = πŸ›‘οΈ Data Integrity + πŸš€ Efficient Proof + πŸ”’ Tamper Detection

Compress an entire dataset into a single 🌐 Root Hash, then prove the existence and integrity of any individual record at the cost of just O(log n) sibling hashes.

That's exactly why πŸ”— Bitcoin Β· πŸ“š Git Β· 🌐 IPFS Β· πŸ—„ Cassandra Β· πŸ”— Ethereum all rely on it.

πŸ“– Glossary#

TermMeaning
Data BlockA single unit of raw input data β€” could be a file, a record, a transaction, a chunk of bytes, etc.
Leaf HashThe hash of one data block β€” sits at the bottom of the tree
Branch Hash (a.k.a. Inner / Internal Hash)The hash of two child hashes concatenated together
Root Hash (a.k.a. Merkle Root)The single top-level hash representing the entire dataset
Sibling HashThe "other child" of a parent node β€” needed during proof verification
Merkle ProofThe minimal set of sibling hashes required to verify one data block
SHA256A standard cryptographic hash function producing a 256-bit (32-byte) output
β€– (concatenation)The operator that joins two hashes byte-wise before hashing them together

Tags: πŸ”’ Cryptographically Secure Β· πŸš€ O(log n) Verification Β· 🌐 Distributed Systems Β· πŸ›‘οΈ Tamper-Proof Β· πŸ”— Blockchain Β· πŸ“š Git Β· 🌐 IPFS Β· πŸ—„ Cassandra

Comments