Opened at 2013-07-06T15:16:55Z
Closed at 2020-01-13T20:36:24Z
#2017 closed defect (wontfix)
non-deterministic test hang on OpenBSD
| Reported by: | zooko | Owned by: | sickness |
|---|---|---|---|
| Priority: | normal | Milestone: | soon |
| Component: | code | Version: | 1.10.0 |
| Keywords: | iputil heisenbug openbsd test hang | Cc: | |
| Launchpad Bug: |
Description (last modified by daira)
sickness's OpenBSD buildslave showed a test timeout:
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs
allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x816eb82c [0.00169348716736s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x7ff29ed4>, 1373030506.664452), **{})()>
allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1139 tests in 1784.336s
FAILED (skips=15, expectedFailures=3, errors=2, successes=1120)
Rerunning the tests with the exact same build (using Buildbot's "force rebuild" feature) resulted in success:
In that run (build number 28), those tests took only a few seconds:
19.917 seconds: allmydata.test.test_runner.RunNode.test_client
13.758 seconds: allmydata.test.test_runner.RunNode.test_client_no_noise
So there is a non-deterministic bug that exhibits on sickness's buildslave which causes those two tests to hang.
Change History (15)
comment:1 Changed at 2013-07-06T15:17:35Z by zooko
- Description modified (diff)
comment:2 in reply to: ↑ description Changed at 2013-07-07T08:45:11Z by sickness
did run the tests as requested (rsyncing the build subdir of the buildbot in /tmp/someotherdir because I don't know how to properly checkout trunk) and here's the results:
[...]
===============================================================================
[TODO]
Reason: 'Fix this after 1.7.1 release.'
Traceback (most recent call last):
File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
return self._loop()
File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any
2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such
servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them hav
e enough shares to recover the file, sent 4 queries to 4 servers, 4 queries placed some shares, 0 plac
ed none (of which 0 placed none due to the server being full and 0 placed none due to an error))
===============================================================================
[SKIPPED]
A non-ASCII argument/output could not be encoded on this platform.
allmydata.test.test_runner.BinTahoe.test_unicode_arguments_and_output
===============================================================================
[TODO]
Reason: "this isn't fixed yet"
Traceback (most recent call last):
File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
return self._loop()
File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 2 server(s) such that any
2 of them have enough shares to recover the file, but we were asked to place shares on at least 3 such
servers. (placed all 3 shares, want to place shares on at least 3 servers such that any 2 of them hav
e enough shares to recover the file, sent 3 queries to 3 servers, 1 queries placed some shares, 2 plac
ed none (of which 2 placed none due to the server being full and 0 placed none due to an error))
allmydata.test.test_upload.EncodingParameters.test_problem_layout_comment_187
allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1124
===============================================================================
[TODO]
Reason: 'Invent a smarter uploader that uploads successfully in this case.'
Traceback (most recent call last):
File "/tmp/build/src/allmydata/immutable/upload.py", line 604, in _got_response
return self._loop()
File "/tmp/build/src/allmydata/immutable/upload.py", line 455, in _loop
return self._failed("%s (%s)" % (failmsg, self._get_progress_message()))
File "/tmp/build/src/allmydata/immutable/upload.py", line 617, in _failed
raise UploadUnhappinessError(msg)
allmydata.interfaces.UploadUnhappinessError: shares could be placed on only 3 server(s) such that any 2 of them have enough shares to recover the file, but we were asked to place shares on at least 4 such servers. (placed all 4 shares, want to place shares on at least 4 servers such that any 2 of them have enough shares to recover the file, sent 4 queries to 4 servers, 3 queries placed some shares, 1 placed none (of which 1 placed none due to the server being full and 0 placed none due to an error))
allmydata.test.test_upload.EncodingParameters.test_problem_layout_ticket_1128
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs
allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0xb89705ec [0.00525784492493s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0xb84a09cc>, 1373110342.373227), **{})()>
allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1139 tests in 2020.513s
FAILED (skips=15, expectedFailures=3, errors=2, successes=1120
comment:3 Changed at 2013-07-07T15:06:45Z by sickness
so then I've run:
git checkout d85a75d7f689cb55ecddb319dc2057f38e4db87a
python setup.py build && ./bin/tahoe --version
./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise
and it failed way earlier than the previous branch with this:
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs
allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x873c5e2c [0.000361204147339s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x873c648c>, 1373184489.75621), **{})()>
allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1 tests in 240.027s
comment:4 Changed at 2013-07-07T15:31:11Z by daira
- Keywords test hang added
comment:5 Changed at 2013-07-07T15:34:34Z by daira
If it were the iputil patches, why does it only affect test_client_no_noise? Many other tests depend on the iputil address-finding code.
comment:6 Changed at 2013-07-07T19:58:06Z by daira
Actually sickness' result in comment:3 proves that it wasn't the iputil patches. (d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before those patches.)
comment:7 Changed at 2013-07-09T15:44:44Z by zooko
- Keywords regression added
- Milestone changed from undecided to 1.11.0
comment:9 Changed at 2013-08-28T15:43:53Z by daira
- Description modified (diff)
comment:10 Changed at 2013-08-28T15:51:57Z by daira
- Description modified (diff)
comment:11 Changed at 2013-08-28T15:53:38Z by zooko
sickness: we don't know how to proceed with this. It seems like there might be an old bug, since, per daira's comment:6, d85a75d7f689cb55ecddb319dc2057f38e4db87a/trunk was before the recent iputil patches. So, could you do some investigation of what version(s) of tahoe-lafs exhibit this failure on your !OpenBSD system and report back? You might just start with the Tahoe-LAFS v1.10 release [f9af0633d8da426cbcaed3ff05ab6d7128148bb0] and see if your system could stably run this test on that version. Thanks!
comment:12 Changed at 2013-08-29T07:16:37Z by sickness
# ./bin/tahoe --version
allmydata-tahoe: 1.10.0
foolscap: 0.6.4
pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958
zfec: 1.4.24
Twisted: 13.0.0
Nevow: 0.10.0
zope.interface: unknown
python: 2.7.3
platform: OpenBSD-5.2-i386-32bit-ELF
pyOpenSSL: 0.13
simplejson: 3.3.0
pycrypto: 2.6
pyasn1: 0.1.7
mock: 1.0.1
setuptools: 0.6c16dev4
# ./bin/tahoe debug trial --until-failure allmydata.test.test_runner.RunNode.test_client_no_noise
Test Pass 1
allmydata.test.test_runner
RunNode
test_client_no_noise ... [OK]
-------------------------------------------------------------------------------
Ran 1 tests in 4.987s
PASSED (successes=1)
Test Pass 2
allmydata.test.test_runner
RunNode
test_client_no_noise ... [OK]
-------------------------------------------------------------------------------
Ran 1 tests in 5.011s
PASSED (successes=1)
Test Pass 3
allmydata.test.test_runner
RunNode
test_client_no_noise ... [ERROR]
[ERROR]
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.internet.defer.TimeoutError: <allmydata.test.test_runner.RunNode testMethod=test_client_no_noise> (test_client_no_noise) still running at 240.0 secs
allmydata.test.test_runner.RunNode.test_client_no_noise
===============================================================================
[ERROR]
Traceback (most recent call last):
Failure: twisted.trial.util.DirtyReactorAggregateError: Reactor was unclean.
DelayedCalls: (set twisted.internet.base.DelayedCall.debug = True to debug)
<DelayedCall 0x83fa9dac [0.00449967384338s] called=0 cancelled=0 LoopingCall<0.01>(RunNode._poll, *(<function _node_has_started at 0x83fc83e4>, 1377735752.361727), **{})()>
allmydata.test.test_runner.RunNode.test_client_no_noise
-------------------------------------------------------------------------------
Ran 1 tests in 240.022s
comment:13 Changed at 2013-08-29T08:45:11Z by sickness
I have to add that after the previous test exits, a process seems to still hang: tahoe --quiet start test_runner/RunNode/test_client_no_noise/c1
and it seems to sit there forever until I manually kill it :/
comment:14 Changed at 2013-08-29T11:41:04Z by zooko
Hm. Could you try editing the source code, i.e. the code of trunk/src/allmydata/test/test_runner.py, and add these lines somewhere near the top:
from twisted.internet import base base.DelayedCall.debug = True
And then re-run that experiment?
comment:15 Changed at 2020-01-13T20:36:24Z by exarkun
- Resolution set to wontfix
- Status changed from new to closed
The OpenBSD builder is gone. This could be re-opened and fixed by a dedicated OpenBSD maintainer but won't be otherwise.
