Merkle Trees (Hash Trees)#
This post is an AI-generated summary of the Medium article Advanced Algorithms Every Senior Developer Must Know: Part 1 β Merkle Trees. It is intended as a concise reference to the key ideas. The original article includes more detail, explanations, and examples, and is worth reading for full context.
What Is a Merkle Tree?#
A Merkle tree (also called a hash tree) is a binary tree structure invented by Ralph Merkle in 1979. It recursively hashes and combines data blocks to produce a single root hash, which serves as a cryptographic fingerprint of the entire dataset.
Legend: π¦=Data Blockγπ’=Leaf Hashγπ·=Branch Hashγπ=Root Hash
π Root Hash
β²
β
βββββββββββ΄ββββββββββ
β β
π· Branch_AB π· Branch_CD
β² β²
ββββββ΄βββββ ββββββ΄βββββ
β β β β
π’ Hash_A π’ Hash_B π’ Hash_C π’ Hash_D
β² β² β² β²
β β β β
π¦ Data_A π¦ Data_B π¦ Data_C π¦ Data_D
Construction Process (4-Block Example)#
Step 1: Hash each data block#
Legend: π¦=Data Blockγπ’=Leaf Hash (newly created)
π’ Hash_A π’ Hash_B π’ Hash_C π’ Hash_D
β² β² β² β²
β β β β
SHA256 SHA256 SHA256 SHA256
β² β² β² β²
β β β β
π¦ Data_A π¦ Data_B π¦ Data_C π¦ Data_D
Step 2: Pair leaves and combine into branches#
Legend: π’=Leaf Hashγπ·=Branch Hash (newly created)
π‘ Order matters.
SHA256(Hash_A β Hash_B)is not the same asSHA256(Hash_B β Hash_A), so changing the order of data blocks changes the Merkle root.
π· Branch_AB π· Branch_CD
β² β²
ββββββ΄βββββ ββββββ΄βββββ
β β β β
π’ Hash_A π’ Hash_B π’ Hash_C π’ Hash_D
Rule: Branch_AB = SHA256( Hash_A β Hash_B )
Branch_CD = SHA256( Hash_C β Hash_D )
Step 3: Combine branches up to the root#
Legend: π’=Leafγπ·=Branchγπ=Root (newly created)
π Root Hash
β²
β
βββββββββββ΄ββββββββββ
β β
π· Branch_AB π· Branch_CD
β² β²
ββββββ΄βββββ ββββββ΄βββββ
β β β β
π’ Hash_A π’ Hash_B π’ Hash_C π’ Hash_D
Root Hash = SHA256( Branch_AB β Branch_CD )
Merkle Proof β The Killer Feature#
Question: How can we prove Data_C exists in the tree without downloading every block?
Legend: β =Trusted Rootγπ=Required Proofγπ΅=Computed Locallyγπ―=Target
β
Root Hash
β²
βββββββββββ΄ββββββββββ
β β
π Branch_AB π΅ Branch_CD β compute locally
β²
βββββββ΄ββββββ
β β
π΅ Hash_C π Hash_D
β²
π― Data_C
What the verifier needs:
- β Trusted Root Hash (already known from a reliable source)
- π― Target Data Block (
Data_C, the one being verified) - π Sibling Hashes along the path (only
logβ(n)of them:Hash_DandBranch_AB)
Verification Walkthrough#
The prover has the full tree. For target Data_C, it builds a Merkle proof by collecting only the sibling hashes along C's path to the root, plus their left/right positions:
These are the only hashes the verifier needs for this example:
Hash_Cpairs withHash_Dto rebuildBranch_CDBranch_CDpairs withBranch_ABto rebuild the root
If the prover sent some unrelated hash such as Hash_E, the recomputed root would not match the trusted root hash.
Legend: π―=Targetγπ΅=Local Computationγπ=Provided Proofγβ =Verified
π― Data_C
β
β Step 1: hash the target yourself
βΌ
π΅ Hash_C_local βββ computed locally
β
β Step 2: combine with π Hash_D (right sibling)
βΌ
π΅ Branch_CD_local βββ computed locally
β
β Step 3: combine with π Branch_AB (left sibling)
βΌ
π΅ Root_Calculated
β
β Step 4: compare with β
Trusted Root Hash
βΌ
ββββββββββββββββββββββββββββββββββ
β match β β
DATA IS VALID β
β no match β β DATA TAMPERED β
ββββββββββββββββββββββββββββββββββ
Equivalent formulas:
Hash_C_local = SHA256(Data_C)
Branch_CD_local = SHA256(Hash_C_local || Hash_D)
Root_Calculated = SHA256(Branch_AB || Branch_CD_local)
π‘ 1,000 data blocks need only ~10 π sibling hashes (β320 bytes) to verify β not the full 1,000-block dataset!
Handling an Odd Number of Blocks#
When the leaf count is odd, duplicate the last block to make it even, then continue building.
This is one common convention for binary Merkle trees, including Bitcoin-style examples. Under this convention, if you append a new block:
- if the previous leaf count was even, the new last block is temporarily duplicated
- if the previous leaf count was odd, the previously duplicated last block is replaced by the new real block
Example:
3 blocks: A B C -> duplicate C -> A B C C
4 blocks: A B C D -> already even -> A B C D
5 blocks: A B C D E -> duplicate E -> A B C D E E
Legend: π¦=Dataγπ’=Leaf Hashγπ·=Branchγπ=Rootγβ»οΈ=Duplicated Node
Before (3 blocks, odd) : π¦ Data_A π¦ Data_B π¦ Data_C
β
β»οΈ duplicate
βΌ
After (4 blocks, even): π¦ Data_A π¦ Data_B π¦ Data_C β»οΈ Data_C'
π Root Hash
β²
βββββββββββ΄ββββββββββ
β β
π· Branch_AB π· Branch_CC'
β² β²
ββββββ΄βββββ ββββββ΄βββββ
β β β β
π’ Hash_A π’ Hash_B π’ Hash_C β»οΈ Hash_C
Build Algorithm Flow#
flowchart LR
A["Input: N data blocks"] --> B["Hash blocks to create leaf hashes"]
B --> C{"One hash left?"}
C -- Yes --> D["Return root hash"]
C -- No --> NEXT
subgraph NEXT["Build next level"]
direction TD
E{"Odd number of nodes?"}
E -- Yes --> F["Duplicate last node"]
E -- No --> G["Hash pairs to build parent level"]
F --> G
G --> H["Use parent level as current level"]
end
NEXT --> C
classDef input fill:#dcfce7,stroke:#15803d,color:#14532d,stroke-width:1px;
classDef decision fill:#f3e8ff,stroke:#7e22ce,color:#581c87,stroke-width:1px;
classDef special fill:#fee2e2,stroke:#dc2626,color:#7f1d1d,stroke-width:1px;
classDef normal fill:#dbeafe,stroke:#2563eb,color:#1e3a8a,stroke-width:1px;
classDef output fill:#ffedd5,stroke:#ea580c,color:#9a3412,stroke-width:1px;
class A input;
class C,E decision;
class F special;
class B,G,H normal;
class D output;Performance Analysis#
Legend:
- π = Sub-linear (scales WAY better than data size)
- β‘ = Linear (scales proportionally to data size β unavoidable lower bound)
- π’ = Super-linear (scales WORSE than data size β would be a problem)
Operation Complexity Scaling For n = 1,000,000
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¨ Build the tree O(n) β‘ ~1,000,000 hashes
π Generate a proof O(log n) π ~20 hashes
β
Verify a proof O(log n) π ~20 hashes
πΎ Storage footprint O(n) β‘ ~2,000,000 nodes
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π’ (No super-linear operations β that's the whole point of Merkle Trees!)
π₯ Real-World Comparison (2,000 Data Blocks)#
β Traditional approach:
ββββββββββββββββββββββββββββββββββββββββ
β ββββββββββββββββββββββββββββββββββββ β ~2 MB
ββββββββββββββββββββββββββββββββββββββββ
Download ALL 2,000 blocks
β
Merkle proof:
βββ
βββ ~352 bytes
βββ
Download only ~11 sibling hashes
π Bandwidth savings: 99.98%
Real-World Applications#
Merkle trees aren't just theory β they power some of the most-used systems on the planet. Each application uses the tree slightly differently. Let's look at four of the most important ones.
π Bitcoin (Blockchain) β SPV Wallets#
Problem: A mobile Bitcoin wallet on your phone can't store the full ~600 GB blockchain.
Solution: Simplified Payment Verification (SPV) β verify that your transaction is in a block using only its Merkle proof.
Legend:
β =Block Header (small, always synced)γ
π=Sibling Proofγ
π―=Your Transactionγ
π¦=Other Transactions (NOT downloaded)
Block Header (~80 bytes, always synced by SPV wallet):
βββββββββββββββββββββββββββββββββββββββββββ
β Previous Block Hash β
β Timestamp β Difficulty β Nonce β
β β
Merkle Root βββ trusted anchor β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ (your wallet asks: "is Tx_C in this block?")
β
Merkle Root
β²
βββββββββββ΄ββββββββββ
β β
π Branch_AB Branch_CD
β²
βββββββ΄ββββββ
β β
Hash_C π Hash_D
β²
π― Your Tx_C
π¦ Tx_A π¦ Tx_B π― Tx_C π¦ Tx_D ... π¦ Tx_2000
βββββββββββββββββββββββββββββββββββββββββββββββββ
β² NOT downloaded β full node only sends π sibling hashes
Total downloaded: 1 block header + 11 sibling hashes β 432 bytes
Instead of: full 1 MB block (~99.96% bandwidth saved)
π Git (Version Control) β Commit Integrity#
Problem: How does Git detect if even a single byte of any file in your repo has been tampered with β across millions of commits?
Solution: Every Git commit, tree, and blob is content-addressed by its SHA-1 hash. A commit is essentially a Merkle Root over your entire project state.
Legend: π =Commitγπ³=Tree (directory)γπ=Blob (file content)γπ=hash pointer
π Commit "abc123..."
β
β (points to root tree hash)
βΌ
π³ Tree (root directory)
β²
ββββββββββΌβββββββββ
β β β
βΌ βΌ βΌ
π³ src/ π³ tests/ π README.md
β² β² (hash: 5d4e1c...)
βββββββ΄βββββ β
β β βΌ
βΌ βΌ π test_main.py
π main.py π utils.py (hash: 7448d8...)
(hash: (hash:
a665a4...) d1d4e1...)
ββββββββββββββββββββββββββββββββββββββββββββββ
π‘ Change ONE byte in main.py
β main.py's hash changes
β src/'s tree hash changes
β root tree hash changes
β Commit hash changes
ββββββββββββββββββββββββββββββββββββββββββββββ
Result: tampering ANY file is instantly detectable.
π Apache Cassandra (Distributed DB) β Anti-Entropy Repair#
Problem: Two database replicas in different data centers might drift out of sync. Comparing all records row-by-row would take hours.
Solution: Each replica builds a Merkle tree over its data. Compare root hashes first β if they match, done. If not, drill down recursively to find only the differing branches.
Legend: β =matchγβ=mismatchγπ’=in syncγπ΄=needs repair
Replica A (Data Center 1) Replica B (Data Center 2)
βββββββββββββββββββββββββ βββββββββββββββββββββββββ
Root Hash Root Hash
β aaa111 ββ β β bbb222
β β
ββββββββ΄βββββββ ββββββββ΄βββββββ
β β β β
β
x1y2 β p3q4 β
x1y2 β z5z5
(match, β (match, β
skip!) β skip!) β
βββββ΄ββββ βββββ΄ββββ
β β β β
β
aa β bb β
aa π΄ cc
β β
π΄ Diff! π΄ Diff!
(only sync (only sync
this subtree) this subtree)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π‘ 1 million records, only 2 differ?
β 20 hash comparisons (logβ 1M) instead of 1M comparisons
β ~50,000Γ speedup β‘
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π IPFS / BitTorrent (P2P Networks) β Chunk Verification#
Problem: You're downloading a 4 GB file from random untrusted peers. How do you detect if any peer sent corrupted or malicious chunks?
Solution: The file's "address" is its Merkle Root. Every downloaded chunk can be independently verified against this root.
Legend: π=Content ID (Root)γπ·=Branchγπ¦=File Chunkγπ€=trusted peerγβ οΈ=malicious peer
File "movie.mp4" (4 GB) split into 4 chunks:
π Content ID = SHA256(...) β "address" of the file
β²
βββββββββββ΄ββββββββββ
β β
π· Branch_AB π· Branch_CD
β² β²
ββββββ΄βββββ ββββββ΄βββββ
β β β β
Hash_A Hash_B Hash_C Hash_D
β² β² β² β²
β β β β
π¦ Chunk_A π¦ Chunk_B π¦ Chunk_C π¦ Chunk_D
(1 GB) (1 GB) (1 GB) (1 GB)
ββββββββββββββββ Download phase ββββββββββββββββ
π€ Peer 1 sends Chunk_A β SHA256(Chunk_A) == Hash_A β
keep
π€ Peer 2 sends Chunk_B β SHA256(Chunk_B) == Hash_B β
keep
β οΈ Peer 3 sends Chunk_C' β SHA256(Chunk_C') β Hash_C β reject!
π€ Peer 4 sends Chunk_D β SHA256(Chunk_D) == Hash_D β
keep
ββββββββββββββββββββββββββββββββββββββββββββββββββ
π‘ Each chunk verified independently against the
Content ID β no peer can sneak in malicious data.
ββββββββββββββββββββββββββββββββββββββββββββββββββ
π Application Comparison Summary#
| Use Case | What's a "Block"? | What's the Root? | Why Merkle Tree? |
|---|---|---|---|
| π Bitcoin SPV | One transaction | Merkle Root in block header | Verify Tx without 600 GB chain |
| π Git | One file (blob) | Commit hash | Detect any tampering across commits |
| π Cassandra | One DB record | Per-replica root | Find diffs without scanning all rows |
| π IPFS / Torrent | One file chunk | Content ID (CID) | Verify chunks from untrusted peers |
Summary#
π³ Merkle Tree = π‘οΈ Data Integrity + π Efficient Proof + π Tamper Detection
Compress an entire dataset into a single π Root Hash, then prove the existence and integrity of any individual record at the cost of just O(log n) sibling hashes.
That's exactly why π Bitcoin Β· π Git Β· π IPFS Β· π Cassandra Β· π Ethereum all rely on it.
π Glossary#
| Term | Meaning |
|---|---|
| Data Block | A single unit of raw input data β could be a file, a record, a transaction, a chunk of bytes, etc. |
| Leaf Hash | The hash of one data block β sits at the bottom of the tree |
| Branch Hash (a.k.a. Inner / Internal Hash) | The hash of two child hashes concatenated together |
| Root Hash (a.k.a. Merkle Root) | The single top-level hash representing the entire dataset |
| Sibling Hash | The "other child" of a parent node β needed during proof verification |
| Merkle Proof | The minimal set of sibling hashes required to verify one data block |
| SHA256 | A standard cryptographic hash function producing a 256-bit (32-byte) output |
| β (concatenation) | The operator that joins two hashes byte-wise before hashing them together |
Tags: π Cryptographically Secure Β· π O(log n) Verification Β· π Distributed Systems Β· π‘οΈ Tamper-Proof Β· π Blockchain Β· π Git Β· π IPFS Β· π Cassandra