Context Navigation

Changes between Version 2 and Version 3 of KnownIssues

Timestamp:: 2008-06-10T23:27:18Z (17 years ago)
Author:: zooko
Comment:: new version

Legend:

: Unmodified
: Added
: Removed
: Modified

KnownIssues

-                      v2
+                      v3
 = Known Issues =
+This page describes known problems for recent releases of Tahoe. Issues are
+fixed as quickly as possible, however users of older releases may still need
+to be aware of these problems until they upgrade to a release which resolves
+it.
+Below is a list of known issues in recent releases of Tahoe, and how to manage
+them.
-== Issues in [Tahoe 1.1 milestone:1.1.0] (not quite released) ==
 === Servers which run out of space ===
+== issues in Tahoe v1.1.0, released 2008-06-10 ==
+If a Tahoe storage server runs out of space, writes will fail with an
+{{{IOError}}} exception. In some situations, Tahoe-1.1 clients will not react
+to this very well:
+=== issue 1: server out of space when writing mutable file ===
+ * If the exception occurs during an immutable-share write, that share will
+   be broken. The client will detect this, and will declare the upload as
+   failing if insufficient shares can be placed (this "shares of happiness"
+   threshold defaults to 7 out of 10). The code does not yet search for new
+   servers to replace the full ones. If the upload fails, the server's
+   upload-already-in-progress routines may interfere with a subsequent
+   upload.
+ * If the exception occurs during a mutable-share write, the old share will
+   be left in place (and a new home for the share will be sought). If enough
+   old shares are left around, subsequent reads may see the file in its
+   earlier state, known as a "rollback" fault. Writing a new version of the
+   file should find the newer shares correctly, although it will take
+   longer (more roundtrips) than usual.
+If a v1.0 or v1.1.0 storage server runs out of disk space then its attempts to
+write data to the local filesystem will fail.  For immutable files, this will
+not lead to any problem (the attempt to upload that share to that server will
+fail, the partially uploaded share will be deleted from the storage server's
+"incoming shares" directory, and the client will move on to using another
+storage server instead).
+The out-of-space handling code is not yet complete, and we do not yet have a
+space-limiting solution that is suitable for large storage nodes. The
+"sizelimit" configuration uses a /usr/bin/du -style query at node startup,
+which takes a long time (tens of minutes) on storage nodes that offer 100GB
+or more, making it unsuitable for highly-available servers.
+If the write was an attempt to modify an existing mutable file, however, a
+problem will result: when the attempt to write the new share fails due to
+insufficient disk space, then it will be aborted and the old share will be left
+in place.  If enough such old shares are left, then a subsequent read may get
+those old shares and see the file in its earlier state, which is a "rollback"
+failure.  With the default parameters (3-of-10), six old shares will be enough
+to potentially lead to a rollback failure.
+In lieu of 'sizelimit', server admins are advised to set the
+NODEDIR/readonly_storage (and remove 'sizelimit', and restart their nodes) on
+their storage nodes before space is exhausted. This will stop the influx of
+immutable shares. Mutable shares will continue to arrive, but since these are
+mainly used by directories, the amount of space consumed will be smaller.
+==== how to manage it ====
+Eventually we will have a better solution for this.
+Make sure your Tahoe storage servers don't run out of disk space.  This means
+refusing storage requests before the disk fills up. There are a couple of ways
+to do that with v1.1.
+== Issues in Tahoe 1.0 ==
+First, there is a configuration option named "sizelimit" which will cause the
+storage server to do a "du" style recursive examination of its directories at
+startup, and then if the sum of the size of files found therein is greater than
+the "sizelimit" number, it will reject requests by clients to write new
+immutable shares.
+=== Servers which run out of space ===
+However, that can take a long time (something on the order of a minute of
+examination of the filesystem for each 10 GB of data stored in the Tahoe
+server), and the Tahoe server will be unavailable to clients during that time.
+In addition to the problems described above, Tahoe-1.0 clients which
+experience out-of-space errors while writing mutable files are likely to
 think the write succeeded, when it in fact failed. This can cause data loss.
+Another option is to set the "readonly_storage" configuration option on the
+storage server before startup.  This will cause the storage server to reject
+all requests to upload new immutable shares.
+=== Large Directories or Mutable files in a specific range of sizes ===
+Note that neither of these configurations affect mutable shares: even if
+sizelimit is configured and the storage server currently has greater space used
+than allowed, or even if readonly_storage is configured, servers will continue
+to accept new mutable shares and will continue to accept requests to overwrite
+existing mutable shares.
+A mismatched pair of size limits causes a problem when a client attempts to
+upload a large mutable file with a size between 3139275 and 3500000 bytes.
+(Mutable files larger than 3.5MB are refused outright). The symptom is very
+high memory usage (3GB) and 100% CPU for about 5 minutes. The attempted write
+will fail, but the client may think that it succeeded. This size corresponds
+to roughly 9000 entries in a directory.
+Mutable files are typically used only for directories, and are usually much
+smaller than immutable files, so if you use one of these configurations to stop
+the influx of immutable files while there is still sufficient disk space to
+receive an influx of (much smaller) mutable files, you may be able to avoid the
+potential for "rollback" failure.
+This was fixed in 1.1, as ticket #379. Files up to 3.5MB should now work
+properly, and files above that size should be rejected properly. Both servers
+and clients must be upgraded to resolve the problem, although once the client
+is upgraded to 1.1 the memory usage and false-success problems should be
+fixed.
+A future version of Tahoe will include a fix for this issue.  Here is
+[http://allmydata.org/pipermail/tahoe-dev/2008-May/000630.html the mailing list
+discussion] about how that future version will work.
-=== pycryptopp compile errors resulting in corruption ===
+Certain combinations of compiler, linker, and pycryptopp versions may cause
+corruption errors during decryption, resulting in corrupted plaintext.
+== issues in Tahoe v1.1.0 and v1.0.0 ==
+=== issue 2: pyOpenSSL and/or Twisted defect resulting false alarms in the unit tests ===
+The combination of Twisted v8.1.0 and pyOpenSSL v0.7 causes the Tahoe v1.1 unit
+tests to fail, even though the behavior of Tahoe itself which is being tested is
+correct.
+==== how to manage it ====
+If you are using Twisted v8.1.0 and pyOpenSSL v0.7, then please ignore XYZ in
+XYZ.  Downgrading to an older version of Twisted or pyOpenSSL will cause those
+false alarms to stop happening.
+== issues in Tahoe v1.0.0, released 2008-03-25 ==
+(Tahoe v1.0 was superceded by v1.1 which was released 2008-06-10.)
+=== issue 3: server out of space when writing mutable file ===
+In addition to the problems caused by insufficient disk space described above,
+v1.0 clients which are writing mutable files when the servers fail to write to
+their filesystem are likely to think the write succeeded, when it in fact
+failed. This can cause data loss.
+==== how to manage it ====
+Upgrade client to v1.1, or make sure that servers are always able to write to
+their local filesystem (including that there is space available) as described in
+"issue 1" above.
+=== issue 4: server out of space when writing immutable file ===
+Tahoe v1.0 clients are using v1.0 servers which are unable to write to their
+filesystem during an immutable upload will correctly detect the first failure,
+but if they retry the upload without restarting the client, or if another client
+attempts to upload the same file, the second upload may appear to succeed when
+it hasn't, which can lead to data loss.
+==== how to manage it ====
+Upgrading either or both of the client and the server to v1.1 will fix this
+issue.  Also it can be avoided by ensuring that the servers are always able to
+write to their local filesystem (including that there is space available) as
+described in "issue 1" above.
+=== issue 5: large directories or mutable files in a specific range of sizes ===
+If a client attempts to upload a large mutable file with a size greater than
+about 3,139,000 and less than or equal to 3,500,000 bytes then it will fail but
+appear to succeed, which can lead to data loss.
+(Mutable files larger than 3,500,000 are refused outright).  The symptom of the
+failure is very high memory usage (3 GB of memory) and 100% CPU for about 5
+minutes, before it appears to succeed, although it hasn't.
+Directories are stored in mutable files, and a directory of approximately 9000
+entries may fall into this range of mutable file sizes (depending on the size of
+the filenames or other metadata associated with the entries).
+==== how to manage it ====
+This was fixed in v1.1, under ticket #379.  If the client is upgraded to v1.1,
+then it will fail cleanly instead of falsely appearing to succeed when it tries
+to write a file whose size is in this range.  If the server is also upgraded to
+v1.1, then writes of mutable files whose size is in this range will succeed.
+(If the server is upgraded to v1.1 but the client is still v1.0 then the client
+will still suffer this failure.)
+=== issue 6: pycryptopp defect resulting in data corruption ===
+Versions of pycryptopp earlier than pycryptopp-0.5.0 had a defect which, when
+compiled with some compilers, would cause AES-256 encryption and decryption to
+be computed incorrectly.  This could cause data corruption.  Tahoe v1.0
+required, and came with a bundled copy of, pycryptopp v0.3.
+==== how to manage it ====
+You can detect whether pycryptopp-0.3 has this failure when it is compiled by
+your compiler.  Run the unit tests that come with pycryptopp-0.3: unpack the
+"pycryptopp-0.3.tar" file that comes in the Tahoe v1.0 {{{misc/dependencies}}}
+directory, cd into the resulting {{{pycryptopp-0.3.0}}} directory, and execute
+{{{python ./setup.py test}}}.  If the tests pass, then your compiler does not
+trigger this failure.
+Tahoe v1.1 requires, and comes with a bundled copy of, pycryptopp v0.5.1, which
+does not have this defect.