View Full Version : Browse Host responses timing out incomplete
RickH
July 5th, 2007, 12:44 AM
Hi again. I don't want to seem like I'm trying to find a new issue to post about every day, but while I was capturing GET request packets to investigate the BearShare problem, I stumbled across something else.
I noticed that many of my users were repeating their browse host "GET /" requests multiple times in a short period, as if they weren't working right. I added a bit of logging to HTTPUploadManager.connectionClosed() to log uploader.getTotalAmountUploaded() for the closed connection, and was quite surprised.
Less than 20% of the browse host response connections uploaded the full number of bytes. The rest all stopped short, returning a fairly even distribution of result lengths from around 13KB to just under the full max length of roughly 200KB. This wasn't related to the client type; it happened equally to all of my users, in the usual mix of mostly LimeWire, a minority of BearShare, and the occasional oddity like Phex or whatever. I was able to reliably get the complete result list when I browsed myself (with LOCAL_IS_PRIVATE turned off).
I added some more logging to more closely track the behavior of uploader.BrowseRequestHandler, and discovered that in these short upload cases, handleWrite() never got far enough to exhaust the file iterator and return false, and finished() was never called at all. I added some more logging to see where HTTPUploadManager.connectionClosed() was getting called from, and found that in almost every early abort case, the connection was being shut down by a timeout in org.limewire.nio.timeout.TimeoutController.process Timeouts().
So, as an experiment, I patched HTTPUploadManager.getOrCreateUploader() so that if the HTTPUploader type was BROWSE_HOST, to add an arbitrary 30 extra seconds to the underlying connection's socket timeout (if getSocketTimeout() wasn't already >= 30 seconds). This turned out to extend the timeout from 8 seconds to 38 seconds instead.
This drastically improved the browse response completions. Now instead of an 80%+ failure rate, 80%+ of all browse host responses go the full distance and end cleanly. And some of the ones that do still fail early are legitimate "closed by remote host" errors; the rest are probably just the usual sorts of random communication problems. I haven't seen another timeout error since. A timeout related problem also makes sense of why I was able to reliably browse myself; those always completed in under a second, and probably bypass a big chunk of the TCP/IP stack anyway.
So, there it is. Currently, browse host responses are badly broken and usually don't complete successfully. Extending the timeout lets them complete at a much more normal rate. I don't really know exactly what the underlying connection's socket timeout actually controls, and I doubt that simply extending it is the correct solution, or that 38 seconds is the best value in any case.
Maybe the timeout is missing getting reset or cleared at some critical point during the upload, so that it expires even though the upload is continuing OK?
zab
July 5th, 2007, 05:24 AM
Oh joy. Does this happen with 4.13.5 or earlier? Btw you can build any 4.13 LW version from the tag prod-4-13-x-java15, any 4.12 from prod-4-12-x-java14, I don't think you care about older than that.
Update: 4.13.5, not .6. That's the last version not using httpcore.
RickH
July 6th, 2007, 03:06 AM
Well, I don't know when I'll have time to build 4.13.5 and figure out how to repatch the extra log info about uploaded bytes in order to test it (I've looked at it, but the upload code is quite different).
But I did have time to update my current build to the latest CVS sources, including the changes checked in for CORE-237 earlier today by spingel, and I'm sorry to say they didn't seem to help. The first browse host response sent by my server timed out in the same way as before, about 1/4 complete, exactly 8.029 seconds after it was started. :(
You guys might have some trouble setting up a test to duplicate this in-house. Browsing yourself won't work; it's too fast. And so are your internal IP connections between systems, I expect. Somehow you need to rig the browse host response to take a fairly long time to complete (well over 8 seconds, in any case). Maybe insert a strategic Sleep() in BrowseResponseEntity to slow it down?
Sam
July 6th, 2007, 03:34 AM
Good to know the fix didn't work -- is it still failing from a timeout?
zab
July 6th, 2007, 04:24 AM
we chose the wrong way of reproducing it - we reduced the timeout to 50ms and saw that the changes fixed that case. Of course, a lot more could be going on in 8 seconds that we have missed....
btw feel free to comment in JIRA. This way we can keep everything in context. And thanks! We're happy to have people like you keeping an eye on our code ;-)
RickH
July 6th, 2007, 04:32 AM
Yep, exactly as before. Like so, if it helps. I created the IOException and set the stack trace in HTTPUploadManager.connectionClosed() to log how it gets shut down. Some of the line numbers will be off by a few due to my adding debug log calls and such.
##### 2007-07-05 18:50:02,898 DEBUG [NIODispatcher] gnutella.ConnectionDispatcher.delegate - Handling dispatched word: GET in same thread
##### 2007-07-05 18:50:02,898 DEBUG [NIODispatcher] gnutella.HTTPAcceptor.requestReceived - Processing request: GET / HTTP/1.1
##### 2007-07-05 18:50:10,927 DEBUG [NIODispatcher] uploader.BrowseRequestHandler.finished - Browse response done; xxx results queued
##### 2007-07-05 18:50:10,927 DEBUG [NIODispatcher] gnutella.HTTPUploadManager.connectionClosed - Closing session for xx.xx.xx.xx; xxxxx/0 of Browse-File
java.io.IOException
at com.limegroup.gnutella.HTTPUploadManager$ResponseL istener.connectionClosed(HTTPUploadManager.java:84 1)
at com.limegroup.gnutella.HTTPAcceptor$ConnectionEven tListener.connectionClosed(HTTPAcceptor.java:325)
at org.limewire.http.HttpServiceHandler.closed(HttpSe rviceHandler.java:221)
at org.apache.http.impl.nio.DefaultServerIOEventDispa tch.disconnected(DefaultServerIOEventDispatch.java :88)
at org.limewire.http.HttpChannel$1.run(HttpChannel.ja va:132)
at org.limewire.nio.NIODispatcher$NIOExecutorService. execute(NIODispatcher.java:1035)
at org.limewire.http.HttpChannel.shutdown(HttpChannel .java:130)
at org.limewire.nio.AbstractNBSocket.shutdown(Abstrac tNBSocket.java:507)
at org.limewire.nio.NIODispatcher.cancel(NIODispatche r.java:381)
at org.limewire.nio.NIODispatcher.access$2(NIODispatc her.java:378)
at org.limewire.nio.NIODispatcher$Attachment.notifyTi meout(NIODispatcher.java:965)
at org.limewire.nio.timeout.TimeoutController.process Timeouts(TimeoutController.java:56)
at org.limewire.nio.NIODispatcher.process(NIODispatch er.java:634)
at org.limewire.nio.NIODispatcher.run(NIODispatcher.j ava:845)
at java.lang.Thread.run(Unknown Source)
##### 2007-07-05 18:50:10,927 DEBUG [NIODispatcher] uploader.UploadSlotManager.cancelRequest - com.limegroup.gnutella.uploader.HTTPUploadSession[host=xx.xx.xx.xx,queueStatus=UNKNOWN] cancelling request
##### 2007-07-05 18:50:10,927 DEBUG [NIODispatcher] gnutella.HTTPUploadManager.cleanupFinishedUploader - Cleaning uploader com.limegroup.gnutella.uploader.HTTPUploader[host=xx.xx.xx.xx,index=0,filename=Browse-File,state=COMPLETE,lastTransferState=BROWSE_HOST]
There is one small difference. BrowseRequestHandler.finished() wasn't being called after these timeout shutdowns before (only from a clean ending or an exception in handleWrite()), and this time it was.
spingel
July 6th, 2007, 04:44 PM
Thanks for helping us debugging the problem. I haven't been able to reproduce it with the latest CVS head, yet.
Could you run "ant clean compile" and double check that your cvs update picked up this change https://www.limewire.org/fisheye/changelog/limecvs/?cs=MAIN:spingel:20070705172852 which reverses the order of the source.interestRead(true) and reader.handleRead() statements in AbstractNBSocket.setReadObserver()?
RickH
July 6th, 2007, 10:55 PM
Loaded up AbstractNBSocket.java, verified that "source.interestRead(true);" was on line 173, and "reader.handleRead();" was on line 180 immediately followed by the catch IOException on the next line.
I'm using Eclipse 3.2.1 on XP SP2, so instead of running the Ant script from the command line, I did a Clean from the Project menu. I then went to the components\nio\build\classes\org\limewire\nio directory and verified that the AbstractNBSocket*.class files had just been rebuilt (along with all the others).
Ran the program using the run.bat file in the gui directory, waited for a user to browse my host, and saw the same incomplete upload and connection shutdown after 8.015 seconds caused by nio.timeout.TimeoutController.processTimeouts(Time outController.java:56) as before. Sorry.
If it matters, I'm running Java 1.5.0_10-b03. I can't upgrade to 1.6 yet, as I sometimes do other Java development on this system and we're not ready to migrate.
If it helps, I was able to recreate this error when browsing myself through the Search/Direct Connect box. I slowed down the BrowseResponseEntity to 20 files/second by adding a sleep(50) just after the "responses.add(iterable.next());". When I browsed myself, it slowly received 170 file descriptions and then stopped, and my extra logging showed the usual timeout after 8.7 seconds.
Just to verify, how are you checking if the error happens? Do you have a couple thousand files on the browsed host and then check if you receive that many file descriptions on the client, and that it took well over 8 seconds to get them? Or are you looking for logged errors? The only reason the errors are visible in my log is because I added some extra logging to report the number of file descriptions queued, number of bytes uploaded, and a stack trace from within HTTPUploadManager.connectionClosed(). You won't see any obvious sign of a problem with the standard log messages.
A good sign of it working would be to see the HTTPUploadManager.connectionClosed log entry be timestamped well over 8 seconds from the HTTPAcceptor.requestReceived entry.
zab
July 7th, 2007, 06:58 PM
@RickH: A quick logging statement in org.limewire.http.HttpChannel .setReadChannel and requestRead before the delegation to readSource.interestRead will let us know if something is starting a timeout countdown.
spingel
July 8th, 2007, 12:08 AM
RickH, thanks for verifying. I have tried to reproduce the bug with Java 1.5.0_09, Java 1.5.0_11 and 1.6.0_01 but haven't had any luck, yet. I have added various sleeps() in BrowseResponseEntity, tried browsing with thousands of files and also set the timeouts to very low values in HttpAcceptor.initializeReactor().
I have commented on the bug report https://www.limewire.org/jira/browse/CORE-237 and added a statement in NIODispatcher.notifyTimeout() to log read timeouts. It is still unclear why a read timeout occurs since LimeWire only writes to the channel when a browse request is being handled. The changes committed in https://www.limewire.org/fisheye/changelog/limecvs/?cs=MAIN:spingel:20070707220945 turn reading off as a workaround in case it was turned on unexpectedly to hopefully fix the timeouts.
It would be great if you could update to the latest cvs head and try again. If the bug is still there please post the logging output (particularly from NIODispatcher, HttpChannel and HttpServiceHandler).
RickH
July 8th, 2007, 03:59 AM
OK, so this is getting even weirder. Now I'm seeing CVS problems. I just tried to check out a completely fresh head in a new empty directory, to eliminate any possibility of corrupted non-updated code for whatever reason, and the checkout bombed out partway through.
Here's the last bit of the output log:
U limewire/components/test-util/src/main/java/org/limewire/util/DuckType.java
U limewire/components/test-util/src/main/java/org/limewire/util/ErrorUtils.java
U limewire/components/test-util/src/main/java/org/limewire/util/LimeTestSuite.java
U limewire/components/test-util/src/main/java/org/limewire/util/PairTuple.java
U limewire/components/test-util/src/main/java/org/limewire/util/PrivilegedAccessor.java
U limewire/components/test-util/src/main/java/org/limewire/util/UnexpectedExceptionError.java
U limewire/components/test-util/src/test/java/org/limewire/util/AssertComparisonsTest.java
cvs checkout: cannot rewrite CVS/Entries.Backup: No such file or directory
cvs checkout: cannot rewrite CVS/Entries.Backup: No such file or directory
cvs checkout: cannot rewrite CVS/Entries.Backup: No such file or directory
cvs [checkout aborted]: cannot open limewire/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/core/com/limegroup/gnutella/store/server/CVS/Root: No such file or directory
Error, CVS operation failed
Tried doing an update on a copy of my current working codebase, and it crashed out very quickly, with lots of "xxx is no longer in the repository" and "no such file or directory" errors, ending with the same "core/core/core" garbage as above.
Looks like something has gotten misconfigured on the server. I'm using TortoiseCVS 1.8.30, but that's always worked fine before, and those errors don't look like client-side issues. The last update I did before this one worked perfectly, on July 5th about at 18:23 PDT (according to AbstractNBSocket.java's mod time), so whatever's happened must have been since then.
Can anyone else try to do a checkout to a new empty directory and verify if it's screwed up?
zab
July 8th, 2007, 04:52 AM
yeah server is messed up. We'll look into it asap.
Update: for now you can check out the "core" module separately - that works.
zab
July 8th, 2007, 05:19 AM
Fixed. Sorry for the inconvenience, the responsible parties will be flailed.
RickH
July 8th, 2007, 09:39 AM
The bug is dead. With a fresh, clean head build, I can no longer reproduce it.
At this point I don't know for sure if it was fixed by the latest changes, or if there was something screwed up in my codebase that was fixed by starting over with a clean new checkout (which I have to suspect since nobody else could recreate it). But frankly, I no longer care; I'm tired of screwing with it, as I'm sure you guys must be. I'm just glad it's fixed and working now. Close this sucker.
Any future bugs will be reproduced with a fresh new checkout and build before reporting, just to make sure.
Sam
July 8th, 2007, 02:07 PM
Thanks very much for helping us with this one, Rick. :)
vBulletin® v3.7.1, Copyright ©2000-2009, Jelsoft Enterprises Ltd.