#280 assigned enhancement

get_hash method in webapi for extension caching logic.

Reported by: nejucomo Owned by: zooko
Priority: minor Milestone: undecided
Component: code-frontend-web Version: 0.7.0
Keywords: webapi cache extension newcaps Cc: zooko
Launchpad Bug:

Description

The webapi could provide a call which returns the content's hash for a given capability:

get_hash(cap, hashtype) -> hash

cap - A string containing a capability.

hashtype - An enumeration type specifying the hash algorithm; example "sha256" (more below).

hash - The result of applying the specified hash to the contents referred to by cap.

Support for different hashtypes allows the backend to implement which ever types are convenient, and extension writers can request specific types for future versions.

As long as the hashtype is convenient for extensions to compute on their own, this allows them to make "smart" caching decisions. For instance, a local file system synchronization command could chose to only download (or upload) a file if get_hash returns a different hash than one computed from the local file.

The tahoe architecture may provide support for certain algorithms efficiently (because they are innate to the data structures).

Change History (15)

comment:1 Changed at 2008-02-06T09:59:02Z by warner

Yeah, my concern is that I'm not sure where we would store these hashes. We could stash them as metadata on directory edges, but then the API is more like::

 hash = dirnode.get_hash_of_child("foo.txt", "sha1")

and of course you have to have the dirnode around to ask it anything.

To have a function that just takes an arbitrary cap would either mean that these hashes are contained inside the cap (so the cap would have to get bigger), or that there's some magic table somewhere that maps caps to hashes (and where do we keep this table, who gets to add to it, who gets to read from it, etc).

I completely agree with the utility of this feature, I just don't yet see how to implement it.

comment:2 Changed at 2008-03-27T16:28:00Z by zooko

Here's something we could do:

Store such hashes (encrypted by the readcap) in the UEB (which will hopefully be renamed CEB), so Tahoe can answer queries like

get_hash(cap, hashtype) -> hash

by making a single request (typically) to a storage server. The supported hashtypes would be limited to the hashtypes that were supported by the uploader when they uploaded they file -- either just one (sha256), or maybe two or three (sha256 and Tiger and RIPEMD-160?). Most code which does file validation stuff nowadays still uses MD5, SHA-1, or Tiger, but the first two really shouldn't be used for secure file validation in the future, so I would be happy to not support them.

By the way, storing an encrypted sha256 hash of the plaintext in the CEB is something that Rob and perhaps Brian and perhaps I want to do anyway in order to gave further assurance that there wasn't a bug or wrong symmetric key in our decryption of the validated ciphertext.

comment:3 Changed at 2008-05-30T20:21:33Z by zooko

A user of allmydata.com's consumer backup service just requested that it display the md5sum of a file on the web site so that he could use that to assure himself that the file had uploaded completely and correctly.

comment:4 Changed at 2008-06-01T20:58:01Z by warner

  • Milestone changed from eventually to undecided

comment:5 Changed at 2009-03-08T22:02:51Z by warner

  • Component changed from unknown to code-frontend-web
  • Owner nobody deleted

comment:6 Changed at 2009-09-29T00:14:10Z by nejucomo

The comments above seem to only consider a well-known hash function, like SHA256, and indeed it seems like including such a hash would add some overhead or complexity to the storage format. This might be worth it.

However, when I originally wrote this, I imagined there was some hashtype which was "innate" to Tahoe storage structures, and therefore this call could extract that information efficiently from a Cap.

After a quick skim of the architecture doc, it sounds like there is a merkle tree stored in the capability extension block. If this is a tree over the plain text, then the root of this tree could be efficiently returned by the proposed call, such as:

get_hash(myCap, "tahoe_content_merkle_tree_root")

Clients would then need to compute a merkle tree, but I expect this would be somewhat simple and efficient, given the right library for computing merkle trees.

Because I've noticed a thread on tahoe-dev about caching, and I've seen some tickets related to caching, I'm going to link all of these related tickets and threads together.

comment:7 Changed at 2009-09-29T00:27:49Z by nejucomo

See ticket #316 for a built-in caching feature proposal.

I personally prefer this minimal code change which makes it easier for clients to do caching versus a built-in caching feature. Fewer features, fewer configuration states, and more test-coverage per component.

comment:8 Changed at 2009-09-29T03:24:56Z by zooko

There is currently no hash of the plaintext stored. See http://allmydata.org/~zooko/lafs.pdf diagram 1 for what is stored for an immutable file currently. We used to have one, but we took it out because it was visible to anyone (it was stored on storage servers unencrypted), and this enables anyone to mount guess-and-check attacks (per http://hacktahoe.org/drew_perttula.html ). #453 (safely add plaintext_hash to immutable UEB) is a ticket to add plaintext hashes back but store them encrypted under the read-cap.

If we had #453, we could easily give out the hash-of-plaintext or else the root-of-merkle-tree-of-plaintext to serve this API. But wait a minute, what's the use case of this proposed API again? How come the user can't just use the verify cap instead of this hash-of-the-plaintext?

comment:9 Changed at 2009-10-28T04:09:24Z by davidsarah

  • Keywords newcaps added

Tagging issues relevant to new cap protocol design.

comment:10 Changed at 2010-02-11T02:56:14Z by davidsarah

  • Keywords cache added; caching removed

comment:11 follow-up: Changed at 2010-05-15T04:39:21Z by zooko

I still don't understand why the use case for this isn't satisfied by verify caps.

comment:12 in reply to: ↑ 11 ; follow-up: Changed at 2012-02-21T21:23:09Z by nejucomo

Replying to zooko:

I still don't understand why the use case for this isn't satisfied by verify caps.

Here's a use case I advocate:

  • I have a large file called myblob.bin and a capability, $C (of any kind) which I believe is associated with some revision of myblob.bin.
  • I use a commandline tool to calculate a cryptographic-hash-like value. Example alternatives:
    • $ md5sum myblob.bin > local_hash
    • $ pyeval 'hashlib.sha256(ri).hexdigest()' < myblob.bin > local_hash
    • $ tahoe calculate_hashlike_thingy --input-file myblob.bin > local_hash
  • I then ask tahoe for the hash-like value given the capability:
    • $ tahoe calculate_hashlike_thingy --input-uri $C > lafs_hash
    • NOTE: For my use case, I want this command to not do any networking, if possible.
  • Compare the results for equality:
    • $ if diff -q local_hash lafs_hash ; then echo 'This revision of myblob.bin is not stored at that capability.' ; fi

So for this use case to be satisfied by verify caps I need this command:

$ tahoe spit_out_verify_cap < myblob.bin

This command should only read myblob.bin but should not do any networking or use any state other than the cap and myblob.bin (so that any tahoe user on any grid can run it).

Is it feasible to make this command? That would satisfy my goal for this ticket.

comment:13 Changed at 2012-02-21T22:09:08Z by zooko

  • Cc zooko added
  • Owner set to zooko
  • Status changed from new to assigned

comment:14 in reply to: ↑ 12 Changed at 2012-02-22T00:53:10Z by davidsarah

Replying to nejucomo:

So for this use case to be satisfied by verify caps I need this command:

$ tahoe spit_out_verify_cap < myblob.bin

This command should only read myblob.bin but should not do any networking or use any state other than the cap and myblob.bin (so that any tahoe user on any grid can run it).

Is it feasible to make this command? That would satisfy my goal for this ticket.

Yes, it is feasible to make this command. Depending on the cap protocol, it might have to do all the work of erasure coding the file and computing a Merkle hash of the ciphertext shares before it can compute the verify cap.

Your use case could also be met with a Merkle hash of the plaintext and convergence secret, which could be computed without erasure coding. But there's a tradeoff between being able to do that and the cap size: in order to be able to recover the plaintext hash from the read cap without network access, the encryption bits and the integrity bits of the read cap must be separate, which means that the minimum immutable read cap size for a security level of 2K against 2T targets is 3K + T (2K integrity bits and K+T confidentiality bits). In contrast the scheme with the shortest read caps so far without this constraint is Rainhill 3, which has an immutable read cap size of only 2K, the minimum possible to achieve 2K security against collision attacks.

(A simplified version of Rainhill 3 without traversal caps is here. It does allow you to compute a plaintext hash P, or an encrypted hash EncP_R, before doing erasure coding, but in order to recover that value from the read cap, you also need EncK_R which is stored on the server.)

comment:15 Changed at 2012-02-22T01:01:21Z by davidsarah

BTW, if you drop the feature of being able to derive a verify cap from a read cap off-line, then a verify cap could include the information normally stored on the server that allows to verify a plaintext off-line without doing erasure coding, and read caps could still be optimally short. However, in practice I think off-line derivation of verify caps is the more useful feature.

Note: See TracTickets for help on using tickets.