Opened at 2013-11-26T04:32:53Z
Last modified at 2014-04-15T00:49:57Z
#2116 closed defect
xml parse error from S4 service — at Version 18
| Reported by: | zooko | Owned by: | daira |
|---|---|---|---|
| Priority: | major | Milestone: | undecided |
| Component: | code-storage | Version: | 1.9.0-s3branch |
| Keywords: | s3 error txaws retry LeastAuthority | Cc: | |
| Launchpad Bug: |
Description (last modified by daira)
zooko@spark ~/research/papers $ tahoe backup . $WRITABLE_DIRCAP:
Traceback (most recent call last):
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 156, in run
rc = runner(sys.argv[1:], install_node_control=install_node_control)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 141, in runner
rc = cli.dispatch[command](so)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/cli.py", line 574, in backup
rc = tahoe_backup.backup(options)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 325, in backup
return bu.run()
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 118, in run
new_backup_dircap = self.process(options.from_dir)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 188, in process
childcap = self.process(childpath)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 194, in process
childcap, metadata = self.upload(childpath)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 305, in upload
raise HTTPError("Error during file PUT", resp)
HTTPError: Error during file PUT: 500 Internal Server Error
"Traceback (most recent call last):
File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 753, in receiveClose
self.request.fail(f)
File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 95, in fail
self.deferred.errback(why)
File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 421, in errback
self._startRunCallbacks(fail)
File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 488, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 575, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 604, in _got_response
return self._loop()
File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 516, in _loop
return self._failed(msg)
File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 617, in _failed
raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: server selection failed for
<Tahoe2ServerSelector for upload qd7vq>: shares could be placed or found on
only 0 server(s). We were asked to place shares on at least 1 server(s) such
that any 1 of them have enough shares to recover the file. (placed 0 shares out
of 1 total (1 homeless), want to place shares on at least 1 servers such that
any 1 of them have enough shares to recover the file, sent 1 queries to 1
servers, 0 queries placed some shares, 1 placed none (of which 0 placed none
due to the server being full and 1 placed none due to an error)) (last failure
(from <ServerTracker for server 6dwd2q and SI qd7vq>) was:
[Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>:
<RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/tcp.py\", line 287, in connectionLost
protocol.connectionLost(reason)
File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/web/client.py\", line 193, in connectionLost
self.factory._disconnectedDeferred.callback(None)
File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 368, in callback
self._startRunCallbacks(result)
File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/s3/client.py\", line 153, in _parse_get_bucket
root = XML(xml_bytes)
File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/util.py\", line 61, in XML
return parser.close()
File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1655, in close
self._raiseerror(v)
File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1507, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
]'>
])
"
Change History (18)
comment:1 Changed at 2013-11-26T04:34:59Z by zooko
comment:2 Changed at 2013-11-26T17:00:40Z by daira
comment:3 Changed at 2013-11-26T17:01:10Z by daira
Actually the response from S3 seems to have been empty.
comment:4 follow-up: ↓ 6 Changed at 2013-11-26T17:08:41Z by daira
- Keywords s3 error txaws retry LeastAuthority added
- Owner set to daira
- Status changed from new to assigned
- Version changed from 1.10.0 to 1.9.0-s3branch
Hmm, maybe an empty S3 response should trigger a retry?
comment:5 Changed at 2013-11-26T17:13:13Z by daira
Sigh, I wish we could reduce that error report to something more concise -- although I suppose we should be grateful that the actual error was in there somewhere.
comment:6 in reply to: ↑ 4 Changed at 2013-11-27T19:59:27Z by zooko
Replying to daira:
Hmm, maybe an empty S3 response should trigger a retry?
I guess. :-/
This is what I think of as "snowshoes". Having our upper layer perform some redundancy or retrying just to work-around what appear to be bugs or unreliability from the lower layer. Ick.
comment:7 follow-up: ↓ 8 Changed at 2013-11-27T20:01:57Z by daira
Well, I'm philosophically resistant to that as well, but I have no confidence that we can get the underlying problem with S3 fixed.
comment:8 in reply to: ↑ 7 Changed at 2013-11-27T20:03:38Z by zooko
Replying to daira:
Well, I'm philosophically resistant to that as well, but I have no confidence that we can get the underlying problem with S3 fixed.
Yeah, I agree that we should go ahead and add a limited number of exponential-backoff-retries for this.
comment:9 Changed at 2013-11-27T20:04:42Z by daira
Note that this isn't the first time I've seen an empty response from S3. They are rare, but they've come up before - we just haven't recorded that problem until now.
comment:10 Changed at 2013-11-28T12:52:35Z by zooko
I just a similar failure again:
Traceback (most recent call last):
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 156, in run
rc = runner(sys.argv[1:], install_node_control=install_node_control)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/runner.py", line 141, in runner
rc = cli.dispatch[command](so)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/cli.py", line 574, in backup
rc = tahoe_backup.backup(options)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 325, in backup
return bu.run()
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 118, in run
new_backup_dircap = self.process(options.from_dir)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 188, in process
childcap = self.process(childpath)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 194, in process
childcap, metadata = self.upload(childpath)
File "/home/zooko/playground/LAFS/pristine/src/allmydata/scripts/tahoe_backup.py", line 305, in upload
raise HTTPError("Error during file PUT", resp)
HTTPError: Error during file PUT: 500 Internal Server Error
"Traceback (most recent call last):
File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 753, in receiveClose
self.request.fail(f)
File \"/home/zooko/playground/LAFS/pristine/support/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py\", line 95, in fail
self.deferred.errback(why)
File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 421, in errback
self._startRunCallbacks(fail)
File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 488, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File \"/home/zooko/.local/lib/python2.7/site-packages/twisted/internet/defer.py\", line 575, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 604, in _got_response
return self._loop()
File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 516, in _loop
return self._failed(msg)
File \"/home/zooko/playground/LAFS/pristine/src/allmydata/immutable/upload.py\", line 617, in _failed
raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: server selection failed for <Tahoe2ServerSelector
for upload s7uqw>: shares could be placed or found on only 0 server(s). We were asked to place
shares on at least 1 server(s) such that any 1 of them have enough shares to recover the file.
(placed 0 shares out of 1 total (1 homeless), want to place shares on at least 1 servers such
that any 1 of them have enough shares to recover the file, sent 1 queries to 1 servers, 0
queries placed some shares, 1 placed none (of which 0 placed none due to the server being full
and 1 placed none due to an error)) (last failure (from <ServerTracker for server 6dwd2q and SI
s7uqw>) was:
[Failure instance: Traceback (failure with no frames): <class 'foolscap.tokens.RemoteException'>: <RemoteException around '[CopiedFailure instance: Traceback from remote host -- Traceback (most recent call last):
File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/tcp.py\", line 287, in connectionLost
protocol.connectionLost(reason)
File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/web/client.py\", line 193, in connectionLost
self.factory._disconnectedDeferred.callback(None)
File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 368, in callback
self._startRunCallbacks(result)
File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File \"/home/customer/LAFS_source/support/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-linux-i686.egg/twisted/internet/defer.py\", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/s3/client.py\", line 153, in _parse_get_bucket
root = XML(xml_bytes)
File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/util.py\", line 61, in XML
return parser.close()
File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1655, in close
self._raiseerror(v)
File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1507, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
]'>
])
"
comment:11 Changed at 2013-11-29T05:42:21Z by zooko
Hm, I seem to get this every time that I run a full tahoe backup on my data now. ☹ It seems like the LeastAuthority S4 service is suffering badly from this.
File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/s3/client.py\", line 153, in _parse_get_bucket
root = XML(xml_bytes)
File \"/usr/local/lib/python2.7/dist-packages/txAWS-0.2.1.post4-py2.7.egg/txaws/util.py\", line 61, in XML
return parser.close()
File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1655, in close
self._raiseerror(v)
File \"/usr/lib/python2.7/xml/etree/ElementTree.py\", line 1507, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0
]'>
])
"
comment:12 Changed at 2013-11-30T20:55:30Z by daira
- Priority changed from normal to major
comment:13 Changed at 2013-12-01T16:21:22Z by zooko
I just got this same error again, but this time on read instead of on upload. Manually retrying a couple of times (by hitting C-r in my web browser) got it to work.
comment:14 Changed at 2014-03-04T19:29:52Z by daira
CyberAxe just saw the same error: http://sebsauvage.net/paste/?e301277924eb38dc#6W5WPJkf1HgW3+rdFVvQ/2BqzYIZ0ZwQyPUb08+kY0A=
comment:15 Changed at 2014-03-04T19:30:43Z by daira
I'm going to make it retry.
comment:16 follow-up: ↓ 17 Changed at 2014-03-10T21:45:28Z by daira
Note that the error occurs in txAWS when it tries to parse an empty response body for a 500 error as XML. The fix might need to be in txAWS as well.
It's a bit difficult to test this, because the tests intentionally do not depend on txAWS (currently they only test the mock cloud backend, not the service container implementations).
comment:17 in reply to: ↑ 16 Changed at 2014-03-11T15:55:09Z by daira
Replying to daira:
Note that the error occurs in txAWS when it tries to parse an empty response body for a 500 error as XML.
No, this is a wrong inference. The error occurs in txaws.s3.client._parse_get_bucket, which should not be reached if the S3 server responded with a 500 error. (The 500 error in the stack trace is from the Tahoe-LAFS gateway, not S3.) So it looks like S3 returned success with an empty body.
comment:18 Changed at 2014-03-17T22:03:07Z by daira
- Description modified (diff)

I tried the same command-line again, and this time it worked. I had started by tahoe-lafs gateway a few second before I got this xml parse error, so I suspect that this is a case of #719.