Opened at 2012-09-19T16:08:34Z
Last modified at 2012-09-22T10:41:37Z
#1813 new enhancement
Choice of tree-hash
| Reported by: | CodesInChaos | Owned by: | davidsarah | 
|---|---|---|---|
| Priority: | minor | Milestone: | undecided | 
| Component: | unknown | Version: | n/a | 
| Keywords: | newcaps | Cc: | |
| Launchpad Bug: | 
Description
One topic I'm particularly interesting in is designing ans specifying a good tree-hash, especially the tree-hash of the plaintext. Standardizing on such a hash would create a protocol transcending file id. In many such systems, the plaintext hash serves as identifier for the file, similar to what git does with SHA-1. Reusing the same hash-functions across projects such as tahoe-lafs, cryptosphere or my own project could lead to nice synergy effects. One could use this to merge downloads from different sources, or as a key to look up a file in an online database.
Obviously a binary tree of unlimited depth is the best choice here. The hashfunction must be collision and second pre-image resistant. Still there are a number of interesting detail choices:
1. Hash size
What output size should the hashfunction have? To get collision resistance, 256 bit seems to be the minimum acceptable value. Values between 256 and 512 bit seem reasonable.
My suggestion: 256 bits
2. Hash primitive
Of the current generation hash-functions, SHA-2 seems like the best choice. It's security margin seems to be a bit on the low side, so I'd prefer a SHA-3 finalist, since those seem to have a larger margin. The selection of SHA-3 is around the corner, so waiting for that seems reasonable. On the other hand, SHA-3 candidates have probably seen less cryptoanalysis than SHA-2.
My suggestion: SHA-3-256, whatever primitive might be chosen.
3. Leaf size
Should be a power of two. The leaf size should be independent from any higher level parameters, such as segment size. Larger leaves reduce the CPU cost somewhat, but have no space benefit. Smaller leaves increase the CPU cost a bit, but allow verification of smaller pieces. Values around 1KiB seem like a reasonable trade-off, but perhaps we should benchmark this.
My suggestion: 1 KiB
4. Which tree construction?
The simplest merkle tree adds a prefix of 0 for leaves and 1 for inner nodes. Alternatively one could include the depth as a 1 byte integer.
Should we add personalization/purpose strings at the leaf level?
How to treat non power-of-two sizes? Pull up the lower layer hashes? Or pad the lowest layer with all zero hashes?
Should leaves include an offset from the beginning of the file, like skein does?
Should the last leaf be tagged with a Final bit? Else the tree suffers from length-extensions. But that shouldn't be a practical problem.
In my current draft I use a 7 byte purpose string and a one byte depth integer as the prefix to all hashes. I pad the leaf hashes with zeros to the next power of two. But there are probably better choices.
5. Extra info in the root?
One interesting idea I had, was to hash some additional information together with the root of the tree, and use that as the identifying hash. My idea for the extra information were the first 24 bytes of the file, and the file size as an 8 byte integer.
This has some nice properties:
- The size of the complete hash tree, truncated at an arbitrary layer is a power of two.
- If you only know the identifying hash, you can ask somebody else about the first bytes and file size, and they can't lie. The first few bytes tell you the file type for most files(magic bytes), and the known file size allows you to preallocate the necessary space, and prevents bombing attacks.
6. Draft from my project
For completeness I reproduce the hashing part from my current draft specification:
Basic ideas
I want a FileID that's independent from the encrypted transport representation. This allows merging downloads from different transport representations. This is important if one wants to change the sharing protocol in the future. For example by adding splitting and error correction like in Tahoe-LAFS, or using block wise transports like in Freenet.
I decided to use a 256 bit hash, which offers 256 bit resistance against second pre-images, and 128 bit resistance against collisions. This seems like an appropriate level, since second pre-images would totally break the system, and collisions are pretty annoying but not fatal.
When creating my treehash system, I noticed that the total size tree is a power of two, apart from a single unused hash-sized part. I decided to fill those with the size of the file and the first 24 bytes of the file. This allows quick sanity checking of the most important properties of a file without downloading the file itself: its size and the file type.
Hash Tree
########### ToDo?: Is 1024 bytes the optimal leaf size?
########### If SHA3 gets selected before this project gets popular, I'll probably replace all uses of SHA-256 with SHA3, keeping the hash size at 256 bits.
########### Skein-256-256 in tree mode would be an alternative. It also has a built-in personalization feature. Unfortunately Skein-512-256 doesn't support 256 bit intermediate hashes :(
This defines the tree-hash. It's a binary merkle-tree based on SHA-256. It takes the LeafSize and a 7 byte ASCII string as purpose parameter.
Define the function Hash(Depth, Message) as SHA-256(Purpose || Depth || Message) where || denotes concatenation.
To calculate the tree-hash of a file:
- Divide the file into LeafSize byte blocks. The last block may be shorter.
- For each of those blocks calculate Hash(0, LeafBlock). If the number of hashes is not a power-of-two, pad the list of hashes(not the file) to the next power-of-two with 00 bytes.
- For each pair of hashes from the previous step, calculate Hash(Depth, HashPair) where HashPair is the concatenation of the two child hashes. If the right child hashes is all 00 simply keep the left hash. Depth is 1 for the first level above the leaves, and gets incremented with each level.
- Repeat step 3 until there is only one hash left.
- Concat the following to form the 64 byte RootBlock:
- The size of the file as 8 byte big-endian integer
- The first 24 bytes of the file (padded with 00 if the file is shorter than that)
- The root of the original tree (result of step 4)
 
- Calculate Hash(255, RootBlock).
TreeBlock? format:
Concat:
- The size of the file as 8 byte big-endian integer
- The first 24 bytes of the file (padded with 00 if the file is shorter than that)
- The individual hashes in the tree, layer wise, starting with the root.
Note: This implies that the TreeBlock starts with the RootBlock.
This tree may be truncated at the end of any layer. Typically at a fixed size, called PieceSize (e.g. 215 bytes = 32 KiB)
It must at least contain the first 64 bytes (the root block). For files smaller than the PieceSize, it will always consist of 64 bytes. When doing this, the size of the tree in bytes will always be a power-of-two.
FileID
Set Purpose to "FileID\0" and LeafSize to 1024. The hash of the root block (result of step 6) is the FileID.
Change History (2)
comment:1 Changed at 2012-09-19T22:47:24Z by davidsarah
comment:2 Changed at 2012-09-22T10:41:37Z by CodesInChaos
The CPU overhead of tree-hashing depends a lot on the particular hash function, and how you use it. In particular the interaction of the blocksize, hashsize, padding and finalization. But I estimate that it's only about 15% compared to a flat hash, for a leaf size of 1 KiB. So the increase in software complexity using a higher branching factor doesn't seem worthwhile. If those 15% worry you, I'd rather double the leaf size.
The official Skein treehash skips finalization for anything but the root, since finalization is only required for full pseudo-randomness, and not for pre-image/collision resistance. This halves the overhead from 15% to 8%.


There may be a slight performance benefit to non-binary trees, at the expense of an increase in size of the uncle chain for a given leaf block. Here's the expansion in size of the uncle chain relative to a binary tree, for arities k = 2..16:
This is useful to increase the performance slightly if hashing k children in each intermediate hash can be done in the same time as hashing 2 children. (That depends on which hash is used, but most hashes have an input block size larger than their output size.)
I'll look up some references for papers on tree hashes.