| 16 | | * If the exception occurs during an immutable-share write, that share will |
| 17 | | be broken. The client will detect this, and will declare the upload as |
| 18 | | failing if insufficient shares can be placed (this "shares of happiness" |
| 19 | | threshold defaults to 7 out of 10). The code does not yet search for new |
| 20 | | servers to replace the full ones. If the upload fails, the server's |
| 21 | | upload-already-in-progress routines may interfere with a subsequent |
| 22 | | upload. |
| 23 | | * If the exception occurs during a mutable-share write, the old share will |
| 24 | | be left in place (and a new home for the share will be sought). If enough |
| 25 | | old shares are left around, subsequent reads may see the file in its |
| 26 | | earlier state, known as a "rollback" fault. Writing a new version of the |
| 27 | | file should find the newer shares correctly, although it will take |
| 28 | | longer (more roundtrips) than usual. |
| | 11 | If a v1.0 or v1.1.0 storage server runs out of disk space then its attempts to |
| | 12 | write data to the local filesystem will fail. For immutable files, this will |
| | 13 | not lead to any problem (the attempt to upload that share to that server will |
| | 14 | fail, the partially uploaded share will be deleted from the storage server's |
| | 15 | "incoming shares" directory, and the client will move on to using another |
| | 16 | storage server instead). |
| 30 | | The out-of-space handling code is not yet complete, and we do not yet have a |
| 31 | | space-limiting solution that is suitable for large storage nodes. The |
| 32 | | "sizelimit" configuration uses a /usr/bin/du -style query at node startup, |
| 33 | | which takes a long time (tens of minutes) on storage nodes that offer 100GB |
| 34 | | or more, making it unsuitable for highly-available servers. |
| | 18 | If the write was an attempt to modify an existing mutable file, however, a |
| | 19 | problem will result: when the attempt to write the new share fails due to |
| | 20 | insufficient disk space, then it will be aborted and the old share will be left |
| | 21 | in place. If enough such old shares are left, then a subsequent read may get |
| | 22 | those old shares and see the file in its earlier state, which is a "rollback" |
| | 23 | failure. With the default parameters (3-of-10), six old shares will be enough |
| | 24 | to potentially lead to a rollback failure. |
| | 65 | === issue 2: pyOpenSSL and/or Twisted defect resulting false alarms in the unit tests === |
| | 66 | |
| | 67 | The combination of Twisted v8.1.0 and pyOpenSSL v0.7 causes the Tahoe v1.1 unit |
| | 68 | tests to fail, even though the behavior of Tahoe itself which is being tested is |
| | 69 | correct. |
| | 70 | |
| | 71 | ==== how to manage it ==== |
| | 72 | |
| | 73 | If you are using Twisted v8.1.0 and pyOpenSSL v0.7, then please ignore XYZ in |
| | 74 | XYZ. Downgrading to an older version of Twisted or pyOpenSSL will cause those |
| | 75 | false alarms to stop happening. |
| | 76 | |
| | 77 | |
| | 78 | == issues in Tahoe v1.0.0, released 2008-03-25 == |
| | 79 | |
| | 80 | (Tahoe v1.0 was superceded by v1.1 which was released 2008-06-10.) |
| | 81 | |
| | 82 | === issue 3: server out of space when writing mutable file === |
| | 83 | |
| | 84 | In addition to the problems caused by insufficient disk space described above, |
| | 85 | v1.0 clients which are writing mutable files when the servers fail to write to |
| | 86 | their filesystem are likely to think the write succeeded, when it in fact |
| | 87 | failed. This can cause data loss. |
| | 88 | |
| | 89 | ==== how to manage it ==== |
| | 90 | |
| | 91 | Upgrade client to v1.1, or make sure that servers are always able to write to |
| | 92 | their local filesystem (including that there is space available) as described in |
| | 93 | "issue 1" above. |
| | 94 | |
| | 95 | |
| | 96 | === issue 4: server out of space when writing immutable file === |
| | 97 | |
| | 98 | Tahoe v1.0 clients are using v1.0 servers which are unable to write to their |
| | 99 | filesystem during an immutable upload will correctly detect the first failure, |
| | 100 | but if they retry the upload without restarting the client, or if another client |
| | 101 | attempts to upload the same file, the second upload may appear to succeed when |
| | 102 | it hasn't, which can lead to data loss. |
| | 103 | |
| | 104 | ==== how to manage it ==== |
| | 105 | |
| | 106 | Upgrading either or both of the client and the server to v1.1 will fix this |
| | 107 | issue. Also it can be avoided by ensuring that the servers are always able to |
| | 108 | write to their local filesystem (including that there is space available) as |
| | 109 | described in "issue 1" above. |
| | 110 | |
| | 111 | |
| | 112 | === issue 5: large directories or mutable files in a specific range of sizes === |
| | 113 | |
| | 114 | If a client attempts to upload a large mutable file with a size greater than |
| | 115 | about 3,139,000 and less than or equal to 3,500,000 bytes then it will fail but |
| | 116 | appear to succeed, which can lead to data loss. |
| | 117 | |
| | 118 | (Mutable files larger than 3,500,000 are refused outright). The symptom of the |
| | 119 | failure is very high memory usage (3 GB of memory) and 100% CPU for about 5 |
| | 120 | minutes, before it appears to succeed, although it hasn't. |
| | 121 | |
| | 122 | Directories are stored in mutable files, and a directory of approximately 9000 |
| | 123 | entries may fall into this range of mutable file sizes (depending on the size of |
| | 124 | the filenames or other metadata associated with the entries). |
| | 125 | |
| | 126 | ==== how to manage it ==== |
| | 127 | |
| | 128 | This was fixed in v1.1, under ticket #379. If the client is upgraded to v1.1, |
| | 129 | then it will fail cleanly instead of falsely appearing to succeed when it tries |
| | 130 | to write a file whose size is in this range. If the server is also upgraded to |
| | 131 | v1.1, then writes of mutable files whose size is in this range will succeed. |
| | 132 | (If the server is upgraded to v1.1 but the client is still v1.0 then the client |
| | 133 | will still suffer this failure.) |
| | 134 | |
| | 135 | |
| | 136 | === issue 6: pycryptopp defect resulting in data corruption === |
| | 137 | |
| | 138 | Versions of pycryptopp earlier than pycryptopp-0.5.0 had a defect which, when |
| | 139 | compiled with some compilers, would cause AES-256 encryption and decryption to |
| | 140 | be computed incorrectly. This could cause data corruption. Tahoe v1.0 |
| | 141 | required, and came with a bundled copy of, pycryptopp v0.3. |
| | 142 | |
| | 143 | ==== how to manage it ==== |
| | 144 | |
| | 145 | You can detect whether pycryptopp-0.3 has this failure when it is compiled by |
| | 146 | your compiler. Run the unit tests that come with pycryptopp-0.3: unpack the |
| | 147 | "pycryptopp-0.3.tar" file that comes in the Tahoe v1.0 {{{misc/dependencies}}} |
| | 148 | directory, cd into the resulting {{{pycryptopp-0.3.0}}} directory, and execute |
| | 149 | {{{python ./setup.py test}}}. If the tests pass, then your compiler does not |
| | 150 | trigger this failure. |
| | 151 | |
| | 152 | Tahoe v1.1 requires, and comes with a bundled copy of, pycryptopp v0.5.1, which |
| | 153 | does not have this defect. |