| 1 | .. -*- coding: utf-8-with-signature -*- |
|---|
| 2 | |
|---|
| 3 | ========================== |
|---|
| 4 | Tahoe-LAFS Directory Nodes |
|---|
| 5 | ========================== |
|---|
| 6 | |
|---|
| 7 | As explained in the architecture docs, Tahoe-LAFS can be roughly viewed as |
|---|
| 8 | a collection of three layers. The lowest layer is the key-value store: it |
|---|
| 9 | provides operations that accept files and upload them to the grid, creating |
|---|
| 10 | a URI in the process which securely references the file's contents. |
|---|
| 11 | The middle layer is the file store, creating a structure of directories and |
|---|
| 12 | filenames resembling the traditional Unix or Windows filesystems. The top |
|---|
| 13 | layer is the application layer, which uses the lower layers to provide useful |
|---|
| 14 | services to users, like a backup application, or a way to share files with |
|---|
| 15 | friends. |
|---|
| 16 | |
|---|
| 17 | This document examines the middle layer, the "file store". |
|---|
| 18 | |
|---|
| 19 | 1. `Key-value Store Primitives`_ |
|---|
| 20 | 2. `File Store Goals`_ |
|---|
| 21 | 3. `Dirnode Goals`_ |
|---|
| 22 | 4. `Dirnode secret values`_ |
|---|
| 23 | 5. `Dirnode storage format`_ |
|---|
| 24 | 6. `Dirnode sizes, mutable-file initial read sizes`_ |
|---|
| 25 | 7. `Design Goals, redux`_ |
|---|
| 26 | |
|---|
| 27 | 1. `Confidentiality leaks in the storage servers`_ |
|---|
| 28 | 2. `Integrity failures in the storage servers`_ |
|---|
| 29 | 3. `Improving the efficiency of dirnodes`_ |
|---|
| 30 | 4. `Dirnode expiration and leases`_ |
|---|
| 31 | |
|---|
| 32 | 8. `Starting Points: root dirnodes`_ |
|---|
| 33 | 9. `Mounting and Sharing Directories`_ |
|---|
| 34 | 10. `Revocation`_ |
|---|
| 35 | |
|---|
| 36 | Key-value Store Primitives |
|---|
| 37 | ========================== |
|---|
| 38 | |
|---|
| 39 | In the lowest layer (key-value store), there are two operations that reference |
|---|
| 40 | immutable data (which we refer to as "CHK URIs" or "CHK read-capabilities" or |
|---|
| 41 | "CHK read-caps"). One puts data into the grid (but only if it doesn't exist |
|---|
| 42 | already), the other retrieves it:: |
|---|
| 43 | |
|---|
| 44 | chk_uri = put(data) |
|---|
| 45 | data = get(chk_uri) |
|---|
| 46 | |
|---|
| 47 | We also have three operations which reference mutable data (which we refer to |
|---|
| 48 | as "mutable slots", or "mutable write-caps and read-caps", or sometimes "SSK |
|---|
| 49 | slots"). One creates a slot with some initial contents, a second replaces the |
|---|
| 50 | contents of a pre-existing slot, and the third retrieves the contents:: |
|---|
| 51 | |
|---|
| 52 | mutable_uri = create(initial_data) |
|---|
| 53 | replace(mutable_uri, new_data) |
|---|
| 54 | data = get(mutable_uri) |
|---|
| 55 | |
|---|
| 56 | File Store Goals |
|---|
| 57 | ================ |
|---|
| 58 | |
|---|
| 59 | The main goal for the middle (file store) layer is to give users a way to |
|---|
| 60 | organize the data that they have uploaded into the grid. The traditional way |
|---|
| 61 | to do this in computer filesystems is to put this data into files, give those |
|---|
| 62 | files names, and collect these names into directories. |
|---|
| 63 | |
|---|
| 64 | Each directory is a set of name-entry pairs, each of which maps a "child name" |
|---|
| 65 | to a directory entry pointing to an object of some kind. Those child objects |
|---|
| 66 | might be files, or they might be other directories. Each directory entry also |
|---|
| 67 | contains metadata. |
|---|
| 68 | |
|---|
| 69 | The directory structure is therefore a directed graph of nodes, in which each |
|---|
| 70 | node might be a directory node or a file node. All file nodes are terminal |
|---|
| 71 | nodes. |
|---|
| 72 | |
|---|
| 73 | Dirnode Goals |
|---|
| 74 | ============= |
|---|
| 75 | |
|---|
| 76 | What properties might be desirable for these directory nodes? In no |
|---|
| 77 | particular order: |
|---|
| 78 | |
|---|
| 79 | 1. functional. Code which does not work doesn't count. |
|---|
| 80 | 2. easy to document, explain, and understand |
|---|
| 81 | 3. confidential: it should not be possible for others to see the contents of |
|---|
| 82 | a directory |
|---|
| 83 | 4. integrity: it should not be possible for others to modify the contents |
|---|
| 84 | of a directory |
|---|
| 85 | 5. available: directories should survive host failure, just like files do |
|---|
| 86 | 6. efficient: in storage, communication bandwidth, number of round-trips |
|---|
| 87 | 7. easy to delegate individual directories in a flexible way |
|---|
| 88 | 8. updateness: everybody looking at a directory should see the same contents |
|---|
| 89 | 9. monotonicity: everybody looking at a directory should see the same |
|---|
| 90 | sequence of updates |
|---|
| 91 | |
|---|
| 92 | Some of these goals are mutually exclusive. For example, availability and |
|---|
| 93 | consistency are opposing, so it is not possible to achieve #5 and #8 at the |
|---|
| 94 | same time. Moreover, it takes a more complex architecture to get close to the |
|---|
| 95 | available-and-consistent ideal, so #2/#6 is in opposition to #5/#8. |
|---|
| 96 | |
|---|
| 97 | Tahoe-LAFS v0.7.0 introduced distributed mutable files, which use public-key |
|---|
| 98 | cryptography for integrity, and erasure coding for availability. These |
|---|
| 99 | achieve roughly the same properties as immutable CHK files, but their |
|---|
| 100 | contents can be replaced without changing their identity. Dirnodes are then |
|---|
| 101 | just a special way of interpreting the contents of a specific mutable file. |
|---|
| 102 | Earlier releases used a "vdrive server": this server was abolished in the |
|---|
| 103 | v0.7.0 release. |
|---|
| 104 | |
|---|
| 105 | For details of how mutable files work, please see :doc:`mutable`. |
|---|
| 106 | |
|---|
| 107 | For releases since v0.7.0, we achieve most of our desired properties. The |
|---|
| 108 | integrity and availability of dirnodes is equivalent to that of regular |
|---|
| 109 | (immutable) files, with the exception that there are more simultaneous-update |
|---|
| 110 | failure modes for mutable slots. Delegation is quite strong: you can give |
|---|
| 111 | read-write or read-only access to any subtree, and the data format used for |
|---|
| 112 | dirnodes is such that read-only access is transitive: i.e. if you grant Bob |
|---|
| 113 | read-only access to a parent directory, then Bob will get read-only access |
|---|
| 114 | (and *not* read-write access) to its children. |
|---|
| 115 | |
|---|
| 116 | Relative to the previous "vdrive server"-based scheme, the current |
|---|
| 117 | distributed dirnode approach gives better availability, but cannot guarantee |
|---|
| 118 | updateness quite as well, and requires far more network traffic for each |
|---|
| 119 | retrieval and update. Mutable files are somewhat less available than |
|---|
| 120 | immutable files, simply because of the increased number of combinations |
|---|
| 121 | (shares of an immutable file are either present or not, whereas there are |
|---|
| 122 | multiple versions of each mutable file, and you might have some shares of |
|---|
| 123 | version 1 and other shares of version 2). In extreme cases of simultaneous |
|---|
| 124 | update, mutable files might suffer from non-monotonicity. |
|---|
| 125 | |
|---|
| 126 | |
|---|
| 127 | Dirnode secret values |
|---|
| 128 | ===================== |
|---|
| 129 | |
|---|
| 130 | As mentioned before, dirnodes are simply a special way to interpret the |
|---|
| 131 | contents of a mutable file, so the secret keys and capability strings |
|---|
| 132 | described in :doc:`mutable` are all the same. Each dirnode contains an RSA |
|---|
| 133 | public/private keypair, and the holder of the "write capability" will be able |
|---|
| 134 | to retrieve the private key (as well as the AES encryption key used for the |
|---|
| 135 | data itself). The holder of the "read capability" will be able to obtain the |
|---|
| 136 | public key and the AES data key, but not the RSA private key needed to modify |
|---|
| 137 | the data. |
|---|
| 138 | |
|---|
| 139 | The "write capability" for a dirnode grants read-write access to its |
|---|
| 140 | contents. This is expressed on concrete form as the "dirnode write cap": a |
|---|
| 141 | printable string which contains the necessary secrets to grant this access. |
|---|
| 142 | Likewise, the "read capability" grants read-only access to a dirnode, and can |
|---|
| 143 | be represented by a "dirnode read cap" string. |
|---|
| 144 | |
|---|
| 145 | For example, |
|---|
| 146 | URI:DIR2:swdi8ge1s7qko45d3ckkyw1aac%3Aar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o |
|---|
| 147 | is a write-capability URI, while |
|---|
| 148 | URI:DIR2-RO:buxjqykt637u61nnmjg7s8zkny:ar8r5j99a4mezdojejmsfp4fj1zeky9gjigyrid4urxdimego68o |
|---|
| 149 | is a read-capability URI, both for the same dirnode. |
|---|
| 150 | |
|---|
| 151 | |
|---|
| 152 | Dirnode storage format |
|---|
| 153 | ====================== |
|---|
| 154 | |
|---|
| 155 | Each dirnode is stored in a single mutable file, distributed in the Tahoe-LAFS |
|---|
| 156 | grid. The contents of this file are a serialized list of netstrings, one per |
|---|
| 157 | child. Each child is a list of four netstrings: (name, rocap, rwcap, |
|---|
| 158 | metadata). (Remember that the contents of the mutable file are encrypted by |
|---|
| 159 | the read-cap, so this section describes the plaintext contents of the mutable |
|---|
| 160 | file, *after* it has been decrypted by the read-cap.) |
|---|
| 161 | |
|---|
| 162 | The name is simple a UTF-8 -encoded child name. The 'rocap' is a read-only |
|---|
| 163 | capability URI to that child, either an immutable (CHK) file, a mutable file, |
|---|
| 164 | or a directory. It is also possible to store 'unknown' URIs that are not |
|---|
| 165 | recognized by the current version of Tahoe-LAFS. The 'rwcap' is a read-write |
|---|
| 166 | capability URI for that child, encrypted with the dirnode's write-cap: this |
|---|
| 167 | enables the "transitive readonlyness" property, described further below. The |
|---|
| 168 | 'metadata' is a JSON-encoded dictionary of type,value metadata pairs. Some |
|---|
| 169 | metadata keys are pre-defined, the rest are left up to the application. |
|---|
| 170 | |
|---|
| 171 | Each rwcap is stored as IV + ciphertext + MAC. The IV is a 16-byte random |
|---|
| 172 | value. The ciphertext is obtained by using AES in CTR mode on the rwcap URI |
|---|
| 173 | string, using a key that is formed from a tagged hash of the IV and the |
|---|
| 174 | dirnode's writekey. The MAC is written only for compatibility with older |
|---|
| 175 | Tahoe-LAFS versions and is no longer verified. |
|---|
| 176 | |
|---|
| 177 | If Bob has read-only access to the 'bar' directory, and he adds it as a child |
|---|
| 178 | to the 'foo' directory, then he will put the read-only cap for 'bar' in both |
|---|
| 179 | the rwcap and rocap slots (encrypting the rwcap contents as described above). |
|---|
| 180 | If he has full read-write access to 'bar', then he will put the read-write |
|---|
| 181 | cap in the 'rwcap' slot, and the read-only cap in the 'rocap' slot. Since |
|---|
| 182 | other users who have read-only access to 'foo' will be unable to decrypt its |
|---|
| 183 | rwcap slot, this limits those users to read-only access to 'bar' as well, |
|---|
| 184 | thus providing the transitive readonlyness that we desire. |
|---|
| 185 | |
|---|
| 186 | Dirnode sizes, mutable-file initial read sizes |
|---|
| 187 | ============================================== |
|---|
| 188 | |
|---|
| 189 | How big are dirnodes? When reading dirnode data out of mutable files, how |
|---|
| 190 | large should our initial read be? If we guess exactly, we can read a dirnode |
|---|
| 191 | in a single round-trip, and update one in two RTT. If we guess too high, |
|---|
| 192 | we'll waste some amount of bandwidth. If we guess low, we need to make a |
|---|
| 193 | second pass to get the data (or the encrypted privkey, for writes), which |
|---|
| 194 | will cost us at least another RTT. |
|---|
| 195 | |
|---|
| 196 | Assuming child names are between 10 and 99 characters long, how long are the |
|---|
| 197 | various pieces of a dirnode? |
|---|
| 198 | |
|---|
| 199 | :: |
|---|
| 200 | |
|---|
| 201 | netstring(name) ~= 4+len(name) |
|---|
| 202 | chk-cap = 97 (for 4-char filesizes) |
|---|
| 203 | dir-rw-cap = 88 |
|---|
| 204 | dir-ro-cap = 91 |
|---|
| 205 | netstring(cap) = 4+len(cap) |
|---|
| 206 | encrypted(cap) = 16+cap+32 |
|---|
| 207 | JSON({}) = 2 |
|---|
| 208 | JSON({ctime=float,mtime=float,'tahoe':{linkcrtime=float,linkmotime=float}}): 137 |
|---|
| 209 | netstring(metadata) = 4+137 = 141 |
|---|
| 210 | |
|---|
| 211 | so a CHK entry is:: |
|---|
| 212 | |
|---|
| 213 | 5+ 4+len(name) + 4+97 + 5+16+97+32 + 4+137 |
|---|
| 214 | |
|---|
| 215 | And a 15-byte filename gives a 416-byte entry. When the entry points at a |
|---|
| 216 | subdirectory instead of a file, the entry is a little bit smaller. So an |
|---|
| 217 | empty directory uses 0 bytes, a directory with one child uses about 416 |
|---|
| 218 | bytes, a directory with two children uses about 832, etc. |
|---|
| 219 | |
|---|
| 220 | When the dirnode data is encoding using our default 3-of-10, that means we |
|---|
| 221 | get 139ish bytes of data in each share per child. |
|---|
| 222 | |
|---|
| 223 | The pubkey, signature, and hashes form the first 935ish bytes of the |
|---|
| 224 | container, then comes our data, then about 1216 bytes of encprivkey. So if we |
|---|
| 225 | read the first:: |
|---|
| 226 | |
|---|
| 227 | 1kB: we get 65bytes of dirnode data : only empty directories |
|---|
| 228 | 2kB: 1065bytes: about 8 |
|---|
| 229 | 3kB: 2065bytes: about 15 entries, or 6 entries plus the encprivkey |
|---|
| 230 | 4kB: 3065bytes: about 22 entries, or about 13 plus the encprivkey |
|---|
| 231 | |
|---|
| 232 | So we've written the code to do an initial read of 4kB from each share when |
|---|
| 233 | we read the mutable file, which should give good performance (one RTT) for |
|---|
| 234 | small directories. |
|---|
| 235 | |
|---|
| 236 | |
|---|
| 237 | Design Goals, redux |
|---|
| 238 | =================== |
|---|
| 239 | |
|---|
| 240 | How well does this design meet the goals? |
|---|
| 241 | |
|---|
| 242 | 1. functional: YES: the code works and has extensive unit tests |
|---|
| 243 | 2. documentable: YES: this document is the existence proof |
|---|
| 244 | 3. confidential: YES: see below |
|---|
| 245 | 4. integrity: MOSTLY: a coalition of storage servers can rollback individual |
|---|
| 246 | mutable files, but not a single one. No server can |
|---|
| 247 | substitute fake data as genuine. |
|---|
| 248 | 5. availability: YES: as long as 'k' storage servers are present and have |
|---|
| 249 | the same version of the mutable file, the dirnode will |
|---|
| 250 | be available. |
|---|
| 251 | 6. efficient: MOSTLY: |
|---|
| 252 | network: single dirnode lookup is very efficient, since clients can |
|---|
| 253 | fetch specific keys rather than being required to get or set |
|---|
| 254 | the entire dirnode each time. Traversing many directories |
|---|
| 255 | takes a lot of roundtrips, and these can't be collapsed with |
|---|
| 256 | promise-pipelining because the intermediate values must only |
|---|
| 257 | be visible to the client. Modifying many dirnodes at once |
|---|
| 258 | (e.g. importing a large pre-existing directory tree) is pretty |
|---|
| 259 | slow, since each graph edge must be created independently. |
|---|
| 260 | storage: each child has a separate IV, which makes them larger than |
|---|
| 261 | if all children were aggregated into a single encrypted string |
|---|
| 262 | 7. delegation: VERY: each dirnode is a completely independent object, |
|---|
| 263 | to which clients can be granted separate read-write or |
|---|
| 264 | read-only access |
|---|
| 265 | 8. updateness: VERY: with only a single point of access, and no caching, |
|---|
| 266 | each client operation starts by fetching the current |
|---|
| 267 | value, so there are no opportunities for staleness |
|---|
| 268 | 9. monotonicity: VERY: the single point of access also protects against |
|---|
| 269 | retrograde motion |
|---|
| 270 | |
|---|
| 271 | |
|---|
| 272 | |
|---|
| 273 | Confidentiality leaks in the storage servers |
|---|
| 274 | -------------------------------------------- |
|---|
| 275 | |
|---|
| 276 | Dirnode (and the mutable files upon which they are based) are very private |
|---|
| 277 | against other clients: traffic between the client and the storage servers is |
|---|
| 278 | protected by the Foolscap SSL connection, so they can observe very little. |
|---|
| 279 | Storage index values are hashes of secrets and thus unguessable, and they are |
|---|
| 280 | not made public, so other clients cannot snoop through encrypted dirnodes |
|---|
| 281 | that they have not been told about. |
|---|
| 282 | |
|---|
| 283 | Storage servers can observe access patterns and see ciphertext, but they |
|---|
| 284 | cannot see the plaintext (of child names, metadata, or URIs). If an attacker |
|---|
| 285 | operates a significant number of storage servers, they can infer the shape of |
|---|
| 286 | the directory structure by assuming that directories are usually accessed |
|---|
| 287 | from root to leaf in rapid succession. Since filenames are usually much |
|---|
| 288 | shorter than read-caps and write-caps, the attacker can use the length of the |
|---|
| 289 | ciphertext to guess the number of children of each node, and might be able to |
|---|
| 290 | guess the length of the child names (or at least their sum). From this, the |
|---|
| 291 | attacker may be able to build up a graph with the same shape as the plaintext |
|---|
| 292 | file store, but with unlabeled edges and unknown file contents. |
|---|
| 293 | |
|---|
| 294 | |
|---|
| 295 | Integrity failures in the storage servers |
|---|
| 296 | ----------------------------------------- |
|---|
| 297 | |
|---|
| 298 | The mutable file's integrity mechanism (RSA signature on the hash of the file |
|---|
| 299 | contents) prevents the storage server from modifying the dirnode's contents |
|---|
| 300 | without detection. Therefore the storage servers can make the dirnode |
|---|
| 301 | unavailable, but not corrupt it. |
|---|
| 302 | |
|---|
| 303 | A sufficient number of colluding storage servers can perform a rollback |
|---|
| 304 | attack: replace all shares of the whole mutable file with an earlier version. |
|---|
| 305 | To prevent this, when retrieving the contents of a mutable file, the |
|---|
| 306 | client queries more servers than necessary and uses the highest available |
|---|
| 307 | version number. This insures that one or two misbehaving storage servers |
|---|
| 308 | cannot cause this rollback on their own. |
|---|
| 309 | |
|---|
| 310 | |
|---|
| 311 | Improving the efficiency of dirnodes |
|---|
| 312 | ------------------------------------ |
|---|
| 313 | |
|---|
| 314 | The current mutable-file -based dirnode scheme suffers from certain |
|---|
| 315 | inefficiencies. A very large directory (with thousands or millions of |
|---|
| 316 | children) will take a significant time to extract any single entry, because |
|---|
| 317 | the whole file must be downloaded first, then parsed and searched to find the |
|---|
| 318 | desired child entry. Likewise, modifying a single child will require the |
|---|
| 319 | whole file to be re-uploaded. |
|---|
| 320 | |
|---|
| 321 | The current design assumes (and in some cases, requires) that dirnodes remain |
|---|
| 322 | small. The mutable files on which dirnodes are based are currently using |
|---|
| 323 | "SDMF" ("Small Distributed Mutable File") design rules, which state that the |
|---|
| 324 | size of the data shall remain below one megabyte. More advanced forms of |
|---|
| 325 | mutable files (MDMF and LDMF) are in the design phase to allow efficient |
|---|
| 326 | manipulation of larger mutable files. This would reduce the work needed to |
|---|
| 327 | modify a single entry in a large directory. |
|---|
| 328 | |
|---|
| 329 | Judicious caching may help improve the reading-large-directory case. Some |
|---|
| 330 | form of mutable index at the beginning of the dirnode might help as well. The |
|---|
| 331 | MDMF design rules allow for efficient random-access reads from the middle of |
|---|
| 332 | the file, which would give the index something useful to point at. |
|---|
| 333 | |
|---|
| 334 | The current SDMF design generates a new RSA public/private keypair for each |
|---|
| 335 | directory. This takes some time and CPU effort (around 100 milliseconds on a |
|---|
| 336 | relatively high-end 2021 laptop) per directory. |
|---|
| 337 | We have designed (but not yet built) a DSA-based |
|---|
| 338 | mutable file scheme which will use shared parameters to reduce the |
|---|
| 339 | directory-creation effort to a bare minimum (picking a random number instead |
|---|
| 340 | of generating two random primes). |
|---|
| 341 | |
|---|
| 342 | When a backup program is run for the first time, it needs to copy a large |
|---|
| 343 | amount of data from a pre-existing local filesystem into reliable storage. |
|---|
| 344 | This means that a large and complex directory structure needs to be |
|---|
| 345 | duplicated in the dirnode layer. With the one-object-per-dirnode approach |
|---|
| 346 | described here, this requires as many operations as there are edges in the |
|---|
| 347 | imported filesystem graph. |
|---|
| 348 | |
|---|
| 349 | Another approach would be to aggregate multiple directories into a single |
|---|
| 350 | storage object. This object would contain a serialized graph rather than a |
|---|
| 351 | single name-to-child dictionary. Most directory operations would fetch the |
|---|
| 352 | whole block of data (and presumeably cache it for a while to avoid lots of |
|---|
| 353 | re-fetches), and modification operations would need to replace the whole |
|---|
| 354 | thing at once. This "realm" approach would have the added benefit of |
|---|
| 355 | combining more data into a single encrypted bundle (perhaps hiding the shape |
|---|
| 356 | of the graph from a determined attacker), and would reduce round-trips when |
|---|
| 357 | performing deep directory traversals (assuming the realm was already cached). |
|---|
| 358 | It would also prevent fine-grained rollback attacks from working: a coalition |
|---|
| 359 | of storage servers could change the entire realm to look like an earlier |
|---|
| 360 | state, but it could not independently roll back individual directories. |
|---|
| 361 | |
|---|
| 362 | The drawbacks of this aggregation would be that small accesses (adding a |
|---|
| 363 | single child, looking up a single child) would require pulling or pushing a |
|---|
| 364 | lot of unrelated data, increasing network overhead (and necessitating |
|---|
| 365 | test-and-set semantics for the modification side, which increases the chances |
|---|
| 366 | that a user operation will fail, making it more challenging to provide |
|---|
| 367 | promises of atomicity to the user). |
|---|
| 368 | |
|---|
| 369 | It would also make it much more difficult to enable the delegation |
|---|
| 370 | ("sharing") of specific directories. Since each aggregate "realm" provides |
|---|
| 371 | all-or-nothing access control, the act of delegating any directory from the |
|---|
| 372 | middle of the realm would require the realm first be split into the upper |
|---|
| 373 | piece that isn't being shared and the lower piece that is. This splitting |
|---|
| 374 | would have to be done in response to what is essentially a read operation, |
|---|
| 375 | which is not traditionally supposed to be a high-effort action. On the other |
|---|
| 376 | hand, it may be possible to aggregate the ciphertext, but use distinct |
|---|
| 377 | encryption keys for each component directory, to get the benefits of both |
|---|
| 378 | schemes at once. |
|---|
| 379 | |
|---|
| 380 | |
|---|
| 381 | Dirnode expiration and leases |
|---|
| 382 | ----------------------------- |
|---|
| 383 | |
|---|
| 384 | Dirnodes are created any time a client wishes to add a new directory. How |
|---|
| 385 | long do they live? What's to keep them from sticking around forever, taking |
|---|
| 386 | up space that nobody can reach any longer? |
|---|
| 387 | |
|---|
| 388 | Mutable files are created with limited-time "leases", which keep the shares |
|---|
| 389 | alive until the last lease has expired or been cancelled. Clients which know |
|---|
| 390 | and care about specific dirnodes can ask to keep them alive for a while, by |
|---|
| 391 | renewing a lease on them (with a typical period of one month). Clients are |
|---|
| 392 | expected to assist in the deletion of dirnodes by canceling their leases as |
|---|
| 393 | soon as they are done with them. This means that when a client unlinks a |
|---|
| 394 | directory, it should also cancel its lease on that directory. When the lease |
|---|
| 395 | count on a given share goes to zero, the storage server can delete the |
|---|
| 396 | related storage. Multiple clients may all have leases on the same dirnode: |
|---|
| 397 | the server may delete the shares only after all of the leases have gone away. |
|---|
| 398 | |
|---|
| 399 | We expect that clients will periodically create a "manifest": a list of |
|---|
| 400 | so-called "refresh capabilities" for all of the dirnodes and files that they |
|---|
| 401 | can reach. They will give this manifest to the "repairer", which is a service |
|---|
| 402 | that keeps files (and dirnodes) alive on behalf of clients who cannot take on |
|---|
| 403 | this responsibility for themselves. These refresh capabilities include the |
|---|
| 404 | storage index, but do *not* include the readkeys or writekeys, so the |
|---|
| 405 | repairer does not get to read the files or directories that it is helping to |
|---|
| 406 | keep alive. |
|---|
| 407 | |
|---|
| 408 | After each change to the user's file store, the client creates a manifest and |
|---|
| 409 | looks for differences from their previous version. Anything which was removed |
|---|
| 410 | prompts the client to send out lease-cancellation messages, allowing the data |
|---|
| 411 | to be deleted. |
|---|
| 412 | |
|---|
| 413 | |
|---|
| 414 | Starting Points: root dirnodes |
|---|
| 415 | ============================== |
|---|
| 416 | |
|---|
| 417 | Any client can record the URI of a directory node in some external form (say, |
|---|
| 418 | in a local file) and use it as the starting point of later traversal. Each |
|---|
| 419 | Tahoe-LAFS user is expected to create a new (unattached) dirnode when they first |
|---|
| 420 | start using the grid, and record its URI for later use. |
|---|
| 421 | |
|---|
| 422 | Mounting and Sharing Directories |
|---|
| 423 | ================================ |
|---|
| 424 | |
|---|
| 425 | The biggest benefit of this dirnode approach is that sharing individual |
|---|
| 426 | directories is almost trivial. Alice creates a subdirectory that she wants |
|---|
| 427 | to use to share files with Bob. This subdirectory is attached to Alice's |
|---|
| 428 | file store at "alice:shared-with-bob". She asks her file store for the |
|---|
| 429 | read-only directory URI for that new directory, and emails it to Bob. When |
|---|
| 430 | Bob receives the URI, he attaches the given URI into one of his own |
|---|
| 431 | directories, perhaps at a place named "bob:shared-with-alice". Every time |
|---|
| 432 | Alice writes a file into this directory, Bob will be able to read it. |
|---|
| 433 | (It is also possible to share read-write URIs between users, but that makes |
|---|
| 434 | it difficult to follow the `Prime Coordination Directive`_ .) Neither |
|---|
| 435 | Alice nor Bob will get access to any files above the mounted directory: |
|---|
| 436 | there are no 'parent directory' pointers. If Alice creates a nested set of |
|---|
| 437 | directories, "alice:shared-with-bob/subdir2", and gives a read-only URI to |
|---|
| 438 | shared-with-bob to Bob, then Bob will be unable to write to either |
|---|
| 439 | shared-with-bob/ or subdir2/. |
|---|
| 440 | |
|---|
| 441 | .. _`Prime Coordination Directive`: ../write_coordination.rst |
|---|
| 442 | |
|---|
| 443 | A suitable UI needs to be created to allow users to easily perform this |
|---|
| 444 | sharing action: dragging a folder from their file store to an IM or email |
|---|
| 445 | user icon, for example. The UI will need to give the sending user an |
|---|
| 446 | opportunity to indicate whether they want to grant read-write or read-only |
|---|
| 447 | access to the recipient. The recipient then needs an interface to drag the |
|---|
| 448 | new folder into their file store and give it a home. |
|---|
| 449 | |
|---|
| 450 | Revocation |
|---|
| 451 | ========== |
|---|
| 452 | |
|---|
| 453 | When Alice decides that she no longer wants Bob to be able to access the |
|---|
| 454 | shared directory, what should she do? Suppose she's shared this folder with |
|---|
| 455 | both Bob and Carol, and now she wants Carol to retain access to it but Bob to |
|---|
| 456 | be shut out. Ideally Carol should not have to do anything: her access should |
|---|
| 457 | continue unabated. |
|---|
| 458 | |
|---|
| 459 | The current plan is to have her client create a deep copy of the folder in |
|---|
| 460 | question, delegate access to the new folder to the remaining members of the |
|---|
| 461 | group (Carol), asking the lucky survivors to replace their old reference with |
|---|
| 462 | the new one. Bob may still have access to the old folder, but he is now the |
|---|
| 463 | only one who cares: everyone else has moved on, and he will no longer be able |
|---|
| 464 | to see their new changes. In a strict sense, this is the strongest form of |
|---|
| 465 | revocation that can be accomplished: there is no point trying to force Bob to |
|---|
| 466 | forget about the files that he read a moment before being kicked out. In |
|---|
| 467 | addition it must be noted that anyone who can access the directory can proxy |
|---|
| 468 | for Bob, reading files to him and accepting changes whenever he wants. |
|---|
| 469 | Preventing delegation between communication parties is just as pointless as |
|---|
| 470 | asking Bob to forget previously accessed files. However, there may be value |
|---|
| 471 | to configuring the UI to ask Carol to not share files with Bob, or to |
|---|
| 472 | removing all files from Bob's view at the same time his access is revoked. |
|---|