| 1 | .. -*- coding: utf-8-with-signature -*- |
|---|
| 2 | |
|---|
| 3 | ============= |
|---|
| 4 | Mutable Files |
|---|
| 5 | ============= |
|---|
| 6 | |
|---|
| 7 | 1. `Mutable Formats`_ |
|---|
| 8 | 2. `Consistency vs. Availability`_ |
|---|
| 9 | 3. `The Prime Coordination Directive: "Don't Do That"`_ |
|---|
| 10 | 4. `Small Distributed Mutable Files`_ |
|---|
| 11 | |
|---|
| 12 | 1. `SDMF slots overview`_ |
|---|
| 13 | 2. `Server Storage Protocol`_ |
|---|
| 14 | 3. `Code Details`_ |
|---|
| 15 | 4. `SDMF Slot Format`_ |
|---|
| 16 | 5. `Recovery`_ |
|---|
| 17 | |
|---|
| 18 | 5. `Medium Distributed Mutable Files`_ |
|---|
| 19 | 6. `Large Distributed Mutable Files`_ |
|---|
| 20 | 7. `TODO`_ |
|---|
| 21 | |
|---|
| 22 | Mutable files are places with a stable identifier that can hold data that |
|---|
| 23 | changes over time. In contrast to immutable slots, for which the |
|---|
| 24 | identifier/capability is derived from the contents themselves, the mutable |
|---|
| 25 | file identifier remains fixed for the life of the slot, regardless of what |
|---|
| 26 | data is placed inside it. |
|---|
| 27 | |
|---|
| 28 | Each mutable file is referenced by two different caps. The "read-write" cap |
|---|
| 29 | grants read-write access to its holder, allowing them to put whatever |
|---|
| 30 | contents they like into the slot. The "read-only" cap is less powerful, only |
|---|
| 31 | granting read access, and not enabling modification of the data. The |
|---|
| 32 | read-write cap can be turned into the read-only cap, but not the other way |
|---|
| 33 | around. |
|---|
| 34 | |
|---|
| 35 | The data in these files is distributed over a number of servers, using the |
|---|
| 36 | same erasure coding that immutable files use, with 3-of-10 being a typical |
|---|
| 37 | choice of encoding parameters. The data is encrypted and signed in such a way |
|---|
| 38 | that only the holders of the read-write cap will be able to set the contents |
|---|
| 39 | of the slot, and only the holders of the read-only cap will be able to read |
|---|
| 40 | those contents. Holders of either cap will be able to validate the contents |
|---|
| 41 | as being written by someone with the read-write cap. The servers who hold the |
|---|
| 42 | shares are not automatically given the ability read or modify them: the worst |
|---|
| 43 | they can do is deny service (by deleting or corrupting the shares), or |
|---|
| 44 | attempt a rollback attack (which can only succeed with the cooperation of at |
|---|
| 45 | least k servers). |
|---|
| 46 | |
|---|
| 47 | |
|---|
| 48 | Mutable Formats |
|---|
| 49 | =============== |
|---|
| 50 | |
|---|
| 51 | History |
|---|
| 52 | ------- |
|---|
| 53 | |
|---|
| 54 | When mutable files first shipped in Tahoe-0.8.0 (15-Feb-2008), the only |
|---|
| 55 | version available was "SDMF", described below. This was a |
|---|
| 56 | limited-functionality placeholder, intended to be replaced with |
|---|
| 57 | improved-efficiency "MDMF" files shortly afterwards. The development process |
|---|
| 58 | took longer than expected, and MDMF didn't ship until Tahoe-1.9.0 |
|---|
| 59 | (31-Oct-2011), and even then it was opt-in (not used by default). |
|---|
| 60 | |
|---|
| 61 | SDMF was intended for relatively small mutable files, up to a few megabytes. |
|---|
| 62 | It uses only one segment, so alacrity (the measure of how quickly the first |
|---|
| 63 | byte of plaintext is returned to the client) suffers, as the whole file must |
|---|
| 64 | be downloaded even if you only want to get a single byte. The memory used by |
|---|
| 65 | both clients and servers also scales with the size of the file, instead of |
|---|
| 66 | being limited to the half-a-MB-or-so that immutable file operations use, so |
|---|
| 67 | large files cause significant memory usage. To discourage the use of SDMF |
|---|
| 68 | outside it's design parameters, the early versions of Tahoe enforced a |
|---|
| 69 | maximum size on mutable files (maybe 10MB). Since most directories are built |
|---|
| 70 | out of mutable files, this imposed a limit of about 30k entries per |
|---|
| 71 | directory. In subsequent releases, this limit was removed, but the |
|---|
| 72 | performance problems inherent in the SDMF implementation remained. |
|---|
| 73 | |
|---|
| 74 | In the summer of 2010, Google-Summer-of-Code student Kevan Carstensen took on |
|---|
| 75 | the project of finally implementing MDMF. Because of my (Brian) design |
|---|
| 76 | mistake in SDMF (not including a separate encryption seed in each segment), |
|---|
| 77 | the share format for SDMF could not be used for MDMF, resulting in a larger |
|---|
| 78 | gap between the two implementations (my original intention had been to make |
|---|
| 79 | SDMF a clean subset of MDMF, where any single-segment MDMF file could be |
|---|
| 80 | handled by the old SDMF code). In the fall of 2011, Kevan's code was finally |
|---|
| 81 | integrated, and first made available in the Tahoe-1.9.0 release. |
|---|
| 82 | |
|---|
| 83 | SDMF vs. MDMF |
|---|
| 84 | ------------- |
|---|
| 85 | |
|---|
| 86 | The improvement of MDMF is the use of multiple segments: individual 128-KiB |
|---|
| 87 | sections of the file can be retrieved or modified independently. The |
|---|
| 88 | improvement can be seen when fetching just a portion of the file (using a |
|---|
| 89 | Range: header on the webapi), or when modifying a portion (again with a |
|---|
| 90 | Range: header). It can also be seen indirectly when fetching the whole file: |
|---|
| 91 | the first segment of data should be delivered faster from a large MDMF file |
|---|
| 92 | than from an SDMF file, although the overall download will then proceed at |
|---|
| 93 | the same rate. |
|---|
| 94 | |
|---|
| 95 | We've decided to make it opt-in for now: mutable files default to |
|---|
| 96 | SDMF format unless explicitly configured to use MDMF, either in ``tahoe.cfg`` |
|---|
| 97 | (see :doc:`../configuration`) or in the WUI or CLI command that created a |
|---|
| 98 | new mutable file. |
|---|
| 99 | |
|---|
| 100 | The code can read and modify existing files of either format without user |
|---|
| 101 | intervention. We expect to make MDMF the default in a subsequent release, |
|---|
| 102 | perhaps 2.0. |
|---|
| 103 | |
|---|
| 104 | Which format should you use? SDMF works well for files up to a few MB, and |
|---|
| 105 | can be handled by older versions (Tahoe-1.8.3 and earlier). If you do not |
|---|
| 106 | need to support older clients, want to efficiently work with mutable files, |
|---|
| 107 | and have code which will use Range: headers that make partial reads and |
|---|
| 108 | writes, then MDMF is for you. |
|---|
| 109 | |
|---|
| 110 | |
|---|
| 111 | Consistency vs. Availability |
|---|
| 112 | ============================ |
|---|
| 113 | |
|---|
| 114 | There is an age-old battle between consistency and availability. Epic papers |
|---|
| 115 | have been written, elaborate proofs have been established, and generations of |
|---|
| 116 | theorists have learned that you cannot simultaneously achieve guaranteed |
|---|
| 117 | consistency with guaranteed reliability. In addition, the closer to 0 you get |
|---|
| 118 | on either axis, the cost and complexity of the design goes up. |
|---|
| 119 | |
|---|
| 120 | Tahoe's design goals are to largely favor design simplicity, then slightly |
|---|
| 121 | favor read availability, over the other criteria. |
|---|
| 122 | |
|---|
| 123 | As we develop more sophisticated mutable slots, the API may expose multiple |
|---|
| 124 | read versions to the application layer. The tahoe philosophy is to defer most |
|---|
| 125 | consistency recovery logic to the higher layers. Some applications have |
|---|
| 126 | effective ways to merge multiple versions, so inconsistency is not |
|---|
| 127 | necessarily a problem (i.e. directory nodes can usually merge multiple |
|---|
| 128 | "add child" operations). |
|---|
| 129 | |
|---|
| 130 | |
|---|
| 131 | The Prime Coordination Directive: "Don't Do That" |
|---|
| 132 | ================================================= |
|---|
| 133 | |
|---|
| 134 | The current rule for applications which run on top of Tahoe is "do not |
|---|
| 135 | perform simultaneous uncoordinated writes". That means you need non-tahoe |
|---|
| 136 | means to make sure that two parties are not trying to modify the same mutable |
|---|
| 137 | slot at the same time. For example: |
|---|
| 138 | |
|---|
| 139 | * don't give the read-write URI to anyone else. Dirnodes in a private |
|---|
| 140 | directory generally satisfy this case, as long as you don't use two |
|---|
| 141 | clients on the same account at the same time |
|---|
| 142 | * if you give a read-write URI to someone else, stop using it yourself. An |
|---|
| 143 | inbox would be a good example of this. |
|---|
| 144 | * if you give a read-write URI to someone else, call them on the phone |
|---|
| 145 | before you write into it |
|---|
| 146 | * build an automated mechanism to have your agents coordinate writes. |
|---|
| 147 | For example, we expect a future release to include a FURL for a |
|---|
| 148 | "coordination server" in the dirnodes. The rule can be that you must |
|---|
| 149 | contact the coordination server and obtain a lock/lease on the file |
|---|
| 150 | before you're allowed to modify it. |
|---|
| 151 | |
|---|
| 152 | If you do not follow this rule, Bad Things will happen. The worst-case Bad |
|---|
| 153 | Thing is that the entire file will be lost. A less-bad Bad Thing is that one |
|---|
| 154 | or more of the simultaneous writers will lose their changes. An observer of |
|---|
| 155 | the file may not see monotonically-increasing changes to the file, i.e. they |
|---|
| 156 | may see version 1, then version 2, then 3, then 2 again. |
|---|
| 157 | |
|---|
| 158 | Tahoe takes some amount of care to reduce the badness of these Bad Things. |
|---|
| 159 | One way you can help nudge it from the "lose your file" case into the "lose |
|---|
| 160 | some changes" case is to reduce the number of competing versions: multiple |
|---|
| 161 | versions of the file that different parties are trying to establish as the |
|---|
| 162 | one true current contents. Each simultaneous writer counts as a "competing |
|---|
| 163 | version", as does the previous version of the file. If the count "S" of these |
|---|
| 164 | competing versions is larger than N/k, then the file runs the risk of being |
|---|
| 165 | lost completely. [TODO] If at least one of the writers remains running after |
|---|
| 166 | the collision is detected, it will attempt to recover, but if S>(N/k) and all |
|---|
| 167 | writers crash after writing a few shares, the file will be lost. |
|---|
| 168 | |
|---|
| 169 | Note that Tahoe uses serialization internally to make sure that a single |
|---|
| 170 | Tahoe node will not perform simultaneous modifications to a mutable file. It |
|---|
| 171 | accomplishes this by using a weakref cache of the MutableFileNode (so that |
|---|
| 172 | there will never be two distinct MutableFileNodes for the same file), and by |
|---|
| 173 | forcing all mutable file operations to obtain a per-node lock before they |
|---|
| 174 | run. The Prime Coordination Directive therefore applies to inter-node |
|---|
| 175 | conflicts, not intra-node ones. |
|---|
| 176 | |
|---|
| 177 | |
|---|
| 178 | Small Distributed Mutable Files |
|---|
| 179 | =============================== |
|---|
| 180 | |
|---|
| 181 | SDMF slots are suitable for small (<1MB) files that are editing by rewriting |
|---|
| 182 | the entire file. The three operations are: |
|---|
| 183 | |
|---|
| 184 | * allocate (with initial contents) |
|---|
| 185 | * set (with new contents) |
|---|
| 186 | * get (old contents) |
|---|
| 187 | |
|---|
| 188 | The first use of SDMF slots will be to hold directories (dirnodes), which map |
|---|
| 189 | encrypted child names to rw-URI/ro-URI pairs. |
|---|
| 190 | |
|---|
| 191 | SDMF slots overview |
|---|
| 192 | ------------------- |
|---|
| 193 | |
|---|
| 194 | Each SDMF slot is created with a public/private key pair. The public key is |
|---|
| 195 | known as the "verification key", while the private key is called the |
|---|
| 196 | "signature key". The private key is hashed and truncated to 16 bytes to form |
|---|
| 197 | the "write key" (an AES symmetric key). The write key is then hashed and |
|---|
| 198 | truncated to form the "read key". The read key is hashed and truncated to |
|---|
| 199 | form the 16-byte "storage index" (a unique string used as an index to locate |
|---|
| 200 | stored data). |
|---|
| 201 | |
|---|
| 202 | The public key is hashed by itself to form the "verification key hash". |
|---|
| 203 | |
|---|
| 204 | The write key is hashed a different way to form the "write enabler master". |
|---|
| 205 | For each storage server on which a share is kept, the write enabler master is |
|---|
| 206 | concatenated with the server's nodeid and hashed, and the result is called |
|---|
| 207 | the "write enabler" for that particular server. Note that multiple shares of |
|---|
| 208 | the same slot stored on the same server will all get the same write enabler, |
|---|
| 209 | i.e. the write enabler is associated with the "bucket", rather than the |
|---|
| 210 | individual shares. |
|---|
| 211 | |
|---|
| 212 | The private key is encrypted (using AES in counter mode) by the write key, |
|---|
| 213 | and the resulting crypttext is stored on the servers. so it will be |
|---|
| 214 | retrievable by anyone who knows the write key. The write key is not used to |
|---|
| 215 | encrypt anything else, and the private key never changes, so we do not need |
|---|
| 216 | an IV for this purpose. |
|---|
| 217 | |
|---|
| 218 | The actual data is encrypted (using AES in counter mode) with a key derived |
|---|
| 219 | by concatenating the readkey with the IV, the hashing the results and |
|---|
| 220 | truncating to 16 bytes. The IV is randomly generated each time the slot is |
|---|
| 221 | updated, and stored next to the encrypted data. |
|---|
| 222 | |
|---|
| 223 | The read-write URI consists of the write key and the verification key hash. |
|---|
| 224 | The read-only URI contains the read key and the verification key hash. The |
|---|
| 225 | verify-only URI contains the storage index and the verification key hash. |
|---|
| 226 | |
|---|
| 227 | :: |
|---|
| 228 | |
|---|
| 229 | URI:SSK-RW:b2a(writekey):b2a(verification_key_hash) |
|---|
| 230 | URI:SSK-RO:b2a(readkey):b2a(verification_key_hash) |
|---|
| 231 | URI:SSK-Verify:b2a(storage_index):b2a(verification_key_hash) |
|---|
| 232 | |
|---|
| 233 | Note that this allows the read-only and verify-only URIs to be derived from |
|---|
| 234 | the read-write URI without actually retrieving the public keys. Also note |
|---|
| 235 | that it means the read-write agent must validate both the private key and the |
|---|
| 236 | public key when they are first fetched. All users validate the public key in |
|---|
| 237 | exactly the same way. |
|---|
| 238 | |
|---|
| 239 | The SDMF slot is allocated by sending a request to the storage server with a |
|---|
| 240 | desired size, the storage index, and the write enabler for that server's |
|---|
| 241 | nodeid. If granted, the write enabler is stashed inside the slot's backing |
|---|
| 242 | store file. All further write requests must be accompanied by the write |
|---|
| 243 | enabler or they will not be honored. The storage server does not share the |
|---|
| 244 | write enabler with anyone else. |
|---|
| 245 | |
|---|
| 246 | The SDMF slot structure will be described in more detail below. The important |
|---|
| 247 | pieces are: |
|---|
| 248 | |
|---|
| 249 | * a sequence number |
|---|
| 250 | * a root hash "R" |
|---|
| 251 | * the encoding parameters (including k, N, file size, segment size) |
|---|
| 252 | * a signed copy of [seqnum,R,encoding_params], using the signature key |
|---|
| 253 | * the verification key (not encrypted) |
|---|
| 254 | * the share hash chain (part of a Merkle tree over the share hashes) |
|---|
| 255 | * the block hash tree (Merkle tree over blocks of share data) |
|---|
| 256 | * the share data itself (erasure-coding of read-key-encrypted file data) |
|---|
| 257 | * the signature key, encrypted with the write key |
|---|
| 258 | |
|---|
| 259 | The access pattern for read is: |
|---|
| 260 | |
|---|
| 261 | * hash read-key to get storage index |
|---|
| 262 | * use storage index to locate 'k' shares with identical 'R' values |
|---|
| 263 | |
|---|
| 264 | * either get one share, read 'k' from it, then read k-1 shares |
|---|
| 265 | * or read, say, 5 shares, discover k, either get more or be finished |
|---|
| 266 | * or copy k into the URIs |
|---|
| 267 | |
|---|
| 268 | * read verification key |
|---|
| 269 | * hash verification key, compare against verification key hash |
|---|
| 270 | * read seqnum, R, encoding parameters, signature |
|---|
| 271 | * verify signature against verification key |
|---|
| 272 | * read share data, compute block-hash Merkle tree and root "r" |
|---|
| 273 | * read share hash chain (leading from "r" to "R") |
|---|
| 274 | * validate share hash chain up to the root "R" |
|---|
| 275 | * submit share data to erasure decoding |
|---|
| 276 | * decrypt decoded data with read-key |
|---|
| 277 | * submit plaintext to application |
|---|
| 278 | |
|---|
| 279 | The access pattern for write is: |
|---|
| 280 | |
|---|
| 281 | * hash write-key to get read-key, hash read-key to get storage index |
|---|
| 282 | * use the storage index to locate at least one share |
|---|
| 283 | * read verification key and encrypted signature key |
|---|
| 284 | * decrypt signature key using write-key |
|---|
| 285 | * hash signature key, compare against write-key |
|---|
| 286 | * hash verification key, compare against verification key hash |
|---|
| 287 | * encrypt plaintext from application with read-key |
|---|
| 288 | |
|---|
| 289 | * application can encrypt some data with the write-key to make it only |
|---|
| 290 | available to writers (use this for transitive read-onlyness of dirnodes) |
|---|
| 291 | |
|---|
| 292 | * erasure-code crypttext to form shares |
|---|
| 293 | * split shares into blocks |
|---|
| 294 | * compute Merkle tree of blocks, giving root "r" for each share |
|---|
| 295 | * compute Merkle tree of shares, find root "R" for the file as a whole |
|---|
| 296 | * create share data structures, one per server: |
|---|
| 297 | |
|---|
| 298 | * use seqnum which is one higher than the old version |
|---|
| 299 | * share hash chain has log(N) hashes, different for each server |
|---|
| 300 | * signed data is the same for each server |
|---|
| 301 | |
|---|
| 302 | * now we have N shares and need homes for them |
|---|
| 303 | * walk through peers |
|---|
| 304 | |
|---|
| 305 | * if share is not already present, allocate-and-set |
|---|
| 306 | * otherwise, try to modify existing share: |
|---|
| 307 | * send testv_and_writev operation to each one |
|---|
| 308 | * testv says to accept share if their(seqnum+R) <= our(seqnum+R) |
|---|
| 309 | * count how many servers wind up with which versions (histogram over R) |
|---|
| 310 | * keep going until N servers have the same version, or we run out of servers |
|---|
| 311 | |
|---|
| 312 | * if any servers wound up with a different version, report error to |
|---|
| 313 | application |
|---|
| 314 | * if we ran out of servers, initiate recovery process (described below) |
|---|
| 315 | |
|---|
| 316 | Server Storage Protocol |
|---|
| 317 | ----------------------- |
|---|
| 318 | |
|---|
| 319 | The storage servers will provide a mutable slot container which is oblivious |
|---|
| 320 | to the details of the data being contained inside it. Each storage index |
|---|
| 321 | refers to a "bucket", and each bucket has one or more shares inside it. (In a |
|---|
| 322 | well-provisioned network, each bucket will have only one share). The bucket |
|---|
| 323 | is stored as a directory, using the base32-encoded storage index as the |
|---|
| 324 | directory name. Each share is stored in a single file, using the share number |
|---|
| 325 | as the filename. |
|---|
| 326 | |
|---|
| 327 | The container holds space for a container magic number (for versioning), the |
|---|
| 328 | write enabler, the nodeid which accepted the write enabler (used for share |
|---|
| 329 | migration, described below), a small number of lease structures, the embedded |
|---|
| 330 | data itself, and expansion space for additional lease structures:: |
|---|
| 331 | |
|---|
| 332 | # offset size name |
|---|
| 333 | 1 0 32 magic verstr "Tahoe mutable container v1\n\x75\x09\x44\x03\x8e" |
|---|
| 334 | 2 32 20 write enabler's nodeid |
|---|
| 335 | 3 52 32 write enabler |
|---|
| 336 | 4 84 8 data size (actual share data present) (a) |
|---|
| 337 | 5 92 8 offset of (8) count of extra leases (after data) |
|---|
| 338 | 6 100 368 four leases, 92 bytes each |
|---|
| 339 | 0 4 ownerid (0 means "no lease here") |
|---|
| 340 | 4 4 expiration timestamp |
|---|
| 341 | 8 32 renewal token |
|---|
| 342 | 40 32 cancel token |
|---|
| 343 | 72 20 nodeid which accepted the tokens |
|---|
| 344 | 7 468 (a) data |
|---|
| 345 | 8 ?? 4 count of extra leases |
|---|
| 346 | 9 ?? n*92 extra leases |
|---|
| 347 | |
|---|
| 348 | The "extra leases" field must be copied and rewritten each time the size of |
|---|
| 349 | the enclosed data changes. The hope is that most buckets will have four or |
|---|
| 350 | fewer leases and this extra copying will not usually be necessary. |
|---|
| 351 | |
|---|
| 352 | The (4) "data size" field contains the actual number of bytes of data present |
|---|
| 353 | in field (7), such that a client request to read beyond 504+(a) will result |
|---|
| 354 | in an error. This allows the client to (one day) read relative to the end of |
|---|
| 355 | the file. The container size (that is, (8)-(7)) might be larger, especially |
|---|
| 356 | if extra size was pre-allocated in anticipation of filling the container with |
|---|
| 357 | a lot of data. |
|---|
| 358 | |
|---|
| 359 | The offset in (5) points at the *count* of extra leases, at (8). The actual |
|---|
| 360 | leases (at (9)) begin 4 bytes later. If the container size changes, both (8) |
|---|
| 361 | and (9) must be relocated by copying. |
|---|
| 362 | |
|---|
| 363 | The server will honor any write commands that provide the write token and do |
|---|
| 364 | not exceed the server-wide storage size limitations. Read and write commands |
|---|
| 365 | MUST be restricted to the 'data' portion of the container: the implementation |
|---|
| 366 | of those commands MUST perform correct bounds-checking to make sure other |
|---|
| 367 | portions of the container are inaccessible to the clients. |
|---|
| 368 | |
|---|
| 369 | The two methods provided by the storage server on these "MutableSlot" share |
|---|
| 370 | objects are: |
|---|
| 371 | |
|---|
| 372 | * readv(ListOf(offset=int, length=int)) |
|---|
| 373 | |
|---|
| 374 | * returns a list of bytestrings, of the various requested lengths |
|---|
| 375 | * offset < 0 is interpreted relative to the end of the data |
|---|
| 376 | * spans which hit the end of the data will return truncated data |
|---|
| 377 | |
|---|
| 378 | * testv_and_writev(write_enabler, test_vector, write_vector) |
|---|
| 379 | |
|---|
| 380 | * this is a test-and-set operation which performs the given tests and only |
|---|
| 381 | applies the desired writes if all tests succeed. This is used to detect |
|---|
| 382 | simultaneous writers, and to reduce the chance that an update will lose |
|---|
| 383 | data recently written by some other party (written after the last time |
|---|
| 384 | this slot was read). |
|---|
| 385 | * test_vector=ListOf(TupleOf(offset, length, opcode, specimen)) |
|---|
| 386 | * the opcode is a string, from the set [gt, ge, eq, le, lt, ne] |
|---|
| 387 | * each element of the test vector is read from the slot's data and |
|---|
| 388 | compared against the specimen using the desired (in)equality. If all |
|---|
| 389 | tests evaluate True, the write is performed |
|---|
| 390 | * write_vector=ListOf(TupleOf(offset, newdata)) |
|---|
| 391 | |
|---|
| 392 | * offset < 0 is not yet defined, it probably means relative to the |
|---|
| 393 | end of the data, which probably means append, but we haven't nailed |
|---|
| 394 | it down quite yet |
|---|
| 395 | * write vectors are executed in order, which specifies the results of |
|---|
| 396 | overlapping writes |
|---|
| 397 | |
|---|
| 398 | * return value: |
|---|
| 399 | |
|---|
| 400 | * error: OutOfSpace |
|---|
| 401 | * error: something else (io error, out of memory, whatever) |
|---|
| 402 | * (True, old_test_data): the write was accepted (test_vector passed) |
|---|
| 403 | * (False, old_test_data): the write was rejected (test_vector failed) |
|---|
| 404 | |
|---|
| 405 | * both 'accepted' and 'rejected' return the old data that was used |
|---|
| 406 | for the test_vector comparison. This can be used by the client |
|---|
| 407 | to detect write collisions, including collisions for which the |
|---|
| 408 | desired behavior was to overwrite the old version. |
|---|
| 409 | |
|---|
| 410 | In addition, the storage server provides several methods to access these |
|---|
| 411 | share objects: |
|---|
| 412 | |
|---|
| 413 | * allocate_mutable_slot(storage_index, sharenums=SetOf(int)) |
|---|
| 414 | |
|---|
| 415 | * returns DictOf(int, MutableSlot) |
|---|
| 416 | |
|---|
| 417 | * get_mutable_slot(storage_index) |
|---|
| 418 | |
|---|
| 419 | * returns DictOf(int, MutableSlot) |
|---|
| 420 | * or raises KeyError |
|---|
| 421 | |
|---|
| 422 | We intend to add an interface which allows small slots to allocate-and-write |
|---|
| 423 | in a single call, as well as do update or read in a single call. The goal is |
|---|
| 424 | to allow a reasonably-sized dirnode to be created (or updated, or read) in |
|---|
| 425 | just one round trip (to all N shareholders in parallel). |
|---|
| 426 | |
|---|
| 427 | migrating shares |
|---|
| 428 | ```````````````` |
|---|
| 429 | |
|---|
| 430 | If a share must be migrated from one server to another, two values become |
|---|
| 431 | invalid: the write enabler (since it was computed for the old server), and |
|---|
| 432 | the lease renew/cancel tokens. |
|---|
| 433 | |
|---|
| 434 | Suppose that a slot was first created on nodeA, and was thus initialized with |
|---|
| 435 | WE(nodeA) (= H(WEM+nodeA)). Later, for provisioning reasons, the share is |
|---|
| 436 | moved from nodeA to nodeB. |
|---|
| 437 | |
|---|
| 438 | Readers may still be able to find the share in its new home, depending upon |
|---|
| 439 | how many servers are present in the grid, where the new nodeid lands in the |
|---|
| 440 | permuted index for this particular storage index, and how many servers the |
|---|
| 441 | reading client is willing to contact. |
|---|
| 442 | |
|---|
| 443 | When a client attempts to write to this migrated share, it will get a "bad |
|---|
| 444 | write enabler" error, since the WE it computes for nodeB will not match the |
|---|
| 445 | WE(nodeA) that was embedded in the share. When this occurs, the "bad write |
|---|
| 446 | enabler" message must include the old nodeid (e.g. nodeA) that was in the |
|---|
| 447 | share. |
|---|
| 448 | |
|---|
| 449 | The client then computes H(nodeB+H(WEM+nodeA)), which is the same as |
|---|
| 450 | H(nodeB+WE(nodeA)). The client sends this along with the new WE(nodeB), which |
|---|
| 451 | is H(WEM+nodeB). Note that the client only sends WE(nodeB) to nodeB, never to |
|---|
| 452 | anyone else. Also note that the client does not send a value to nodeB that |
|---|
| 453 | would allow the node to impersonate the client to a third node: everything |
|---|
| 454 | sent to nodeB will include something specific to nodeB in it. |
|---|
| 455 | |
|---|
| 456 | The server locally computes H(nodeB+WE(nodeA)), using its own node id and the |
|---|
| 457 | old write enabler from the share. It compares this against the value supplied |
|---|
| 458 | by the client. If they match, this serves as proof that the client was able |
|---|
| 459 | to compute the old write enabler. The server then accepts the client's new |
|---|
| 460 | WE(nodeB) and writes it into the container. |
|---|
| 461 | |
|---|
| 462 | This WE-fixup process requires an extra round trip, and requires the error |
|---|
| 463 | message to include the old nodeid, but does not require any public key |
|---|
| 464 | operations on either client or server. |
|---|
| 465 | |
|---|
| 466 | Migrating the leases will require a similar protocol. This protocol will be |
|---|
| 467 | defined concretely at a later date. |
|---|
| 468 | |
|---|
| 469 | Code Details |
|---|
| 470 | ------------ |
|---|
| 471 | |
|---|
| 472 | The MutableFileNode class is used to manipulate mutable files (as opposed to |
|---|
| 473 | ImmutableFileNodes). These are initially generated with |
|---|
| 474 | client.create_mutable_file(), and later recreated from URIs with |
|---|
| 475 | client.create_node_from_uri(). Instances of this class will contain a URI and |
|---|
| 476 | a reference to the client (for peer selection and connection). |
|---|
| 477 | |
|---|
| 478 | NOTE: this section is out of date. Please see src/allmydata/interfaces.py |
|---|
| 479 | (the section on IMutableFilesystemNode) for more accurate information. |
|---|
| 480 | |
|---|
| 481 | The methods of MutableFileNode are: |
|---|
| 482 | |
|---|
| 483 | * download_to_data() -> [deferred] newdata, NotEnoughSharesError |
|---|
| 484 | |
|---|
| 485 | * if there are multiple retrieveable versions in the grid, get() returns |
|---|
| 486 | the first version it can reconstruct, and silently ignores the others. |
|---|
| 487 | In the future, a more advanced API will signal and provide access to |
|---|
| 488 | the multiple heads. |
|---|
| 489 | |
|---|
| 490 | * update(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError |
|---|
| 491 | * overwrite(newdata) -> OK, UncoordinatedWriteError, NotEnoughSharesError |
|---|
| 492 | |
|---|
| 493 | download_to_data() causes a new retrieval to occur, pulling the current |
|---|
| 494 | contents from the grid and returning them to the caller. At the same time, |
|---|
| 495 | this call caches information about the current version of the file. This |
|---|
| 496 | information will be used in a subsequent call to update(), and if another |
|---|
| 497 | change has occured between the two, this information will be out of date, |
|---|
| 498 | triggering the UncoordinatedWriteError. |
|---|
| 499 | |
|---|
| 500 | update() is therefore intended to be used just after a download_to_data(), in |
|---|
| 501 | the following pattern:: |
|---|
| 502 | |
|---|
| 503 | d = mfn.download_to_data() |
|---|
| 504 | d.addCallback(apply_delta) |
|---|
| 505 | d.addCallback(mfn.update) |
|---|
| 506 | |
|---|
| 507 | If the update() call raises UCW, then the application can simply return an |
|---|
| 508 | error to the user ("you violated the Prime Coordination Directive"), and they |
|---|
| 509 | can try again later. Alternatively, the application can attempt to retry on |
|---|
| 510 | its own. To accomplish this, the app needs to pause, download the new |
|---|
| 511 | (post-collision and post-recovery) form of the file, reapply their delta, |
|---|
| 512 | then submit the update request again. A randomized pause is necessary to |
|---|
| 513 | reduce the chances of colliding a second time with another client that is |
|---|
| 514 | doing exactly the same thing:: |
|---|
| 515 | |
|---|
| 516 | d = mfn.download_to_data() |
|---|
| 517 | d.addCallback(apply_delta) |
|---|
| 518 | d.addCallback(mfn.update) |
|---|
| 519 | def _retry(f): |
|---|
| 520 | f.trap(UncoordinatedWriteError) |
|---|
| 521 | d1 = pause(random.uniform(5, 20)) |
|---|
| 522 | d1.addCallback(lambda res: mfn.download_to_data()) |
|---|
| 523 | d1.addCallback(apply_delta) |
|---|
| 524 | d1.addCallback(mfn.update) |
|---|
| 525 | return d1 |
|---|
| 526 | d.addErrback(_retry) |
|---|
| 527 | |
|---|
| 528 | Enthusiastic applications can retry multiple times, using a randomized |
|---|
| 529 | exponential backoff between each. A particularly enthusiastic application can |
|---|
| 530 | retry forever, but such apps are encouraged to provide a means to the user of |
|---|
| 531 | giving up after a while. |
|---|
| 532 | |
|---|
| 533 | UCW does not mean that the update was not applied, so it is also a good idea |
|---|
| 534 | to skip the retry-update step if the delta was already applied:: |
|---|
| 535 | |
|---|
| 536 | d = mfn.download_to_data() |
|---|
| 537 | d.addCallback(apply_delta) |
|---|
| 538 | d.addCallback(mfn.update) |
|---|
| 539 | def _retry(f): |
|---|
| 540 | f.trap(UncoordinatedWriteError) |
|---|
| 541 | d1 = pause(random.uniform(5, 20)) |
|---|
| 542 | d1.addCallback(lambda res: mfn.download_to_data()) |
|---|
| 543 | def _maybe_apply_delta(contents): |
|---|
| 544 | new_contents = apply_delta(contents) |
|---|
| 545 | if new_contents != contents: |
|---|
| 546 | return mfn.update(new_contents) |
|---|
| 547 | d1.addCallback(_maybe_apply_delta) |
|---|
| 548 | return d1 |
|---|
| 549 | d.addErrback(_retry) |
|---|
| 550 | |
|---|
| 551 | update() is the right interface to use for delta-application situations, like |
|---|
| 552 | directory nodes (in which apply_delta might be adding or removing child |
|---|
| 553 | entries from a serialized table). |
|---|
| 554 | |
|---|
| 555 | Note that any uncoordinated write has the potential to lose data. We must do |
|---|
| 556 | more analysis to be sure, but it appears that two clients who write to the |
|---|
| 557 | same mutable file at the same time (even if both eventually retry) will, with |
|---|
| 558 | high probability, result in one client observing UCW and the other silently |
|---|
| 559 | losing their changes. It is also possible for both clients to observe UCW. |
|---|
| 560 | The moral of the story is that the Prime Coordination Directive is there for |
|---|
| 561 | a reason, and that recovery/UCW/retry is not a subsitute for write |
|---|
| 562 | coordination. |
|---|
| 563 | |
|---|
| 564 | overwrite() tells the client to ignore this cached version information, and |
|---|
| 565 | to unconditionally replace the mutable file's contents with the new data. |
|---|
| 566 | This should not be used in delta application, but rather in situations where |
|---|
| 567 | you want to replace the file's contents with completely unrelated ones. When |
|---|
| 568 | raw files are uploaded into a mutable slot through the Tahoe-LAFS web-API |
|---|
| 569 | (using POST and the ?mutable=true argument), they are put in place with |
|---|
| 570 | overwrite(). |
|---|
| 571 | |
|---|
| 572 | The peer-selection and data-structure manipulation (and signing/verification) |
|---|
| 573 | steps will be implemented in a separate class in allmydata/mutable.py . |
|---|
| 574 | |
|---|
| 575 | SDMF Slot Format |
|---|
| 576 | ---------------- |
|---|
| 577 | |
|---|
| 578 | This SDMF data lives inside a server-side MutableSlot container. The server |
|---|
| 579 | is oblivious to this format. |
|---|
| 580 | |
|---|
| 581 | This data is tightly packed. In particular, the share data is defined to run |
|---|
| 582 | all the way to the beginning of the encrypted private key (the encprivkey |
|---|
| 583 | offset is used both to terminate the share data and to begin the encprivkey). |
|---|
| 584 | |
|---|
| 585 | :: |
|---|
| 586 | |
|---|
| 587 | # offset size name |
|---|
| 588 | 1 0 1 version byte, \x00 for this format |
|---|
| 589 | 2 1 8 sequence number. 2^64-1 must be handled specially, TBD |
|---|
| 590 | 3 9 32 "R" (root of share hash Merkle tree) |
|---|
| 591 | 4 41 16 IV (share data is AES(H(readkey+IV)) ) |
|---|
| 592 | 5 57 18 encoding parameters: |
|---|
| 593 | 57 1 k |
|---|
| 594 | 58 1 N |
|---|
| 595 | 59 8 segment size |
|---|
| 596 | 67 8 data length (of original plaintext) |
|---|
| 597 | 6 75 32 offset table: |
|---|
| 598 | 75 4 (8) signature |
|---|
| 599 | 79 4 (9) share hash chain |
|---|
| 600 | 83 4 (10) block hash tree |
|---|
| 601 | 87 4 (11) share data |
|---|
| 602 | 91 8 (12) encrypted private key |
|---|
| 603 | 99 8 (13) EOF |
|---|
| 604 | 7 107 436ish verification key (2048 RSA key) |
|---|
| 605 | 8 543ish 256ish signature=RSAsign(sigkey, H(version+seqnum+r+IV+encparm)) |
|---|
| 606 | 9 799ish (a) share hash chain, encoded as: |
|---|
| 607 | "".join([pack(">H32s", shnum, hash) |
|---|
| 608 | for (shnum,hash) in needed_hashes]) |
|---|
| 609 | 10 (927ish) (b) block hash tree, encoded as: |
|---|
| 610 | "".join([pack(">32s",hash) for hash in block_hash_tree]) |
|---|
| 611 | 11 (935ish) LEN share data (no gap between this and encprivkey) |
|---|
| 612 | 12 ?? 1216ish encrypted private key= AESenc(write-key, RSA-key) |
|---|
| 613 | 13 ?? -- EOF |
|---|
| 614 | |
|---|
| 615 | (a) The share hash chain contains ceil(log(N)) hashes, each 32 bytes long. |
|---|
| 616 | This is the set of hashes necessary to validate this share's leaf in the |
|---|
| 617 | share Merkle tree. For N=10, this is 4 hashes, i.e. 128 bytes. |
|---|
| 618 | (b) The block hash tree contains ceil(length/segsize) hashes, each 32 bytes |
|---|
| 619 | long. This is the set of hashes necessary to validate any given block of |
|---|
| 620 | share data up to the per-share root "r". Each "r" is a leaf of the share |
|---|
| 621 | has tree (with root "R"), from which a minimal subset of hashes is put in |
|---|
| 622 | the share hash chain in (8). |
|---|
| 623 | |
|---|
| 624 | Recovery |
|---|
| 625 | -------- |
|---|
| 626 | |
|---|
| 627 | The first line of defense against damage caused by colliding writes is the |
|---|
| 628 | Prime Coordination Directive: "Don't Do That". |
|---|
| 629 | |
|---|
| 630 | The second line of defense is to keep "S" (the number of competing versions) |
|---|
| 631 | lower than N/k. If this holds true, at least one competing version will have |
|---|
| 632 | k shares and thus be recoverable. Note that server unavailability counts |
|---|
| 633 | against us here: the old version stored on the unavailable server must be |
|---|
| 634 | included in the value of S. |
|---|
| 635 | |
|---|
| 636 | The third line of defense is our use of testv_and_writev() (described below), |
|---|
| 637 | which increases the convergence of simultaneous writes: one of the writers |
|---|
| 638 | will be favored (the one with the highest "R"), and that version is more |
|---|
| 639 | likely to be accepted than the others. This defense is least effective in the |
|---|
| 640 | pathological situation where S simultaneous writers are active, the one with |
|---|
| 641 | the lowest "R" writes to N-k+1 of the shares and then dies, then the one with |
|---|
| 642 | the next-lowest "R" writes to N-2k+1 of the shares and dies, etc, until the |
|---|
| 643 | one with the highest "R" writes to k-1 shares and dies. Any other sequencing |
|---|
| 644 | will allow the highest "R" to write to at least k shares and establish a new |
|---|
| 645 | revision. |
|---|
| 646 | |
|---|
| 647 | The fourth line of defense is the fact that each client keeps writing until |
|---|
| 648 | at least one version has N shares. This uses additional servers, if |
|---|
| 649 | necessary, to make sure that either the client's version or some |
|---|
| 650 | newer/overriding version is highly available. |
|---|
| 651 | |
|---|
| 652 | The fifth line of defense is the recovery algorithm, which seeks to make sure |
|---|
| 653 | that at least *one* version is highly available, even if that version is |
|---|
| 654 | somebody else's. |
|---|
| 655 | |
|---|
| 656 | The write-shares-to-peers algorithm is as follows: |
|---|
| 657 | |
|---|
| 658 | * permute peers according to storage index |
|---|
| 659 | * walk through peers, trying to assign one share per peer |
|---|
| 660 | * for each peer: |
|---|
| 661 | |
|---|
| 662 | * send testv_and_writev, using "old(seqnum+R) <= our(seqnum+R)" as the test |
|---|
| 663 | |
|---|
| 664 | * this means that we will overwrite any old versions, and we will |
|---|
| 665 | overwrite simultaenous writers of the same version if our R is higher. |
|---|
| 666 | We will not overwrite writers using a higher seqnum. |
|---|
| 667 | |
|---|
| 668 | * record the version that each share winds up with. If the write was |
|---|
| 669 | accepted, this is our own version. If it was rejected, read the |
|---|
| 670 | old_test_data to find out what version was retained. |
|---|
| 671 | * if old_test_data indicates the seqnum was equal or greater than our |
|---|
| 672 | own, mark the "Simultanous Writes Detected" flag, which will eventually |
|---|
| 673 | result in an error being reported to the writer (in their close() call). |
|---|
| 674 | * build a histogram of "R" values |
|---|
| 675 | * repeat until the histogram indicate that some version (possibly ours) |
|---|
| 676 | has N shares. Use new servers if necessary. |
|---|
| 677 | * If we run out of servers: |
|---|
| 678 | |
|---|
| 679 | * if there are at least shares-of-happiness of any one version, we're |
|---|
| 680 | happy, so return. (the close() might still get an error) |
|---|
| 681 | * not happy, need to reinforce something, goto RECOVERY |
|---|
| 682 | |
|---|
| 683 | Recovery: |
|---|
| 684 | |
|---|
| 685 | * read all shares, count the versions, identify the recoverable ones, |
|---|
| 686 | discard the unrecoverable ones. |
|---|
| 687 | * sort versions: locate max(seqnums), put all versions with that seqnum |
|---|
| 688 | in the list, sort by number of outstanding shares. Then put our own |
|---|
| 689 | version. (TODO: put versions with seqnum <max but >us ahead of us?). |
|---|
| 690 | * for each version: |
|---|
| 691 | |
|---|
| 692 | * attempt to recover that version |
|---|
| 693 | * if not possible, remove it from the list, go to next one |
|---|
| 694 | * if recovered, start at beginning of peer list, push that version, |
|---|
| 695 | continue until N shares are placed |
|---|
| 696 | * if pushing our own version, bump up the seqnum to one higher than |
|---|
| 697 | the max seqnum we saw |
|---|
| 698 | * if we run out of servers: |
|---|
| 699 | |
|---|
| 700 | * schedule retry and exponential backoff to repeat RECOVERY |
|---|
| 701 | |
|---|
| 702 | * admit defeat after some period? presumeably the client will be shut down |
|---|
| 703 | eventually, maybe keep trying (once per hour?) until then. |
|---|
| 704 | |
|---|
| 705 | |
|---|
| 706 | Medium Distributed Mutable Files |
|---|
| 707 | ================================ |
|---|
| 708 | |
|---|
| 709 | These are just like the SDMF case, but: |
|---|
| 710 | |
|---|
| 711 | * We actually take advantage of the Merkle hash tree over the blocks, by |
|---|
| 712 | reading a single segment of data at a time (and its necessary hashes), to |
|---|
| 713 | reduce the read-time alacrity. |
|---|
| 714 | * We allow arbitrary writes to any range of the file. |
|---|
| 715 | * We add more code to first read each segment that a write must modify. |
|---|
| 716 | This looks exactly like the way a normal filesystem uses a block device, |
|---|
| 717 | or how a CPU must perform a cache-line fill before modifying a single word. |
|---|
| 718 | * We might implement some sort of copy-based atomic update server call, |
|---|
| 719 | to allow multiple writev() calls to appear atomic to any readers. |
|---|
| 720 | |
|---|
| 721 | MDMF slots provide fairly efficient in-place edits of very large files (a few |
|---|
| 722 | GB). Appending data is also fairly efficient. |
|---|
| 723 | |
|---|
| 724 | |
|---|
| 725 | Large Distributed Mutable Files |
|---|
| 726 | =============================== |
|---|
| 727 | |
|---|
| 728 | LDMF slots (not implemented) would use a fundamentally different way to store |
|---|
| 729 | the file, inspired by Mercurial's "revlog" format. This would enable very |
|---|
| 730 | efficient insert/remove/replace editing of arbitrary spans. Multiple versions |
|---|
| 731 | of the file can be retained, in a revision graph that can have multiple heads. |
|---|
| 732 | Each revision can be referenced by a cryptographic identifier. There are two |
|---|
| 733 | forms of the URI, one that means "most recent version", and a longer one that |
|---|
| 734 | points to a specific revision. |
|---|
| 735 | |
|---|
| 736 | Metadata can be attached to the revisions, like timestamps, to enable rolling |
|---|
| 737 | back an entire tree to a specific point in history. |
|---|
| 738 | |
|---|
| 739 | LDMF1 provides deltas but tries to avoid dealing with multiple heads. LDMF2 |
|---|
| 740 | provides explicit support for revision identifiers and branching. |
|---|
| 741 | |
|---|
| 742 | |
|---|
| 743 | TODO |
|---|
| 744 | ==== |
|---|
| 745 | |
|---|
| 746 | improve allocate-and-write or get-writer-buckets API to allow one-call (or |
|---|
| 747 | maybe two-call) updates. The challenge is in figuring out which shares are on |
|---|
| 748 | which machines. First cut will have lots of round trips. |
|---|
| 749 | |
|---|
| 750 | (eventually) define behavior when seqnum wraps. At the very least make sure |
|---|
| 751 | it can't cause a security problem. "the slot is worn out" is acceptable. |
|---|
| 752 | |
|---|
| 753 | (eventually) define share-migration lease update protocol. Including the |
|---|
| 754 | nodeid who accepted the lease is useful, we can use the same protocol as we |
|---|
| 755 | do for updating the write enabler. However we need to know which lease to |
|---|
| 756 | update.. maybe send back a list of all old nodeids that we find, then try all |
|---|
| 757 | of them when we accept the update? |
|---|
| 758 | |
|---|
| 759 | We now do this in a specially-formatted IndexError exception: |
|---|
| 760 | "UNABLE to renew non-existent lease. I have leases accepted by " + |
|---|
| 761 | "nodeids: '12345','abcde','44221' ." |
|---|
| 762 | |
|---|
| 763 | confirm that a repairer can regenerate shares without the private key. Hmm, |
|---|
| 764 | without the write-enabler they won't be able to write those shares to the |
|---|
| 765 | servers.. although they could add immutable new shares to new servers. |
|---|