PDA

View Full Version : Broken NTL router(?) violates TCP standard.


melevittfl
20-10-2004, 15:06
Over the last couple of days, I've been having serious performance problems trying to reach certain sites.

This occured just after I had a short outage of about 15 minutes.

Here's what I think has happened:
During the outage, NTL installed a new router, proxy, traffic shaper, or some other bit of kit.

This new NTL equipment is not properly implementing RFC 1323 (http://www.faqs.org/rfcs/rfc1323.html)

RFC1323 defines a way for two machines to use a large TCP window size. Originally, TCP window size was limited to 64kb. This is too small to efficiently use the bandwidth or more modern networks. So, RFC1323 defines a standard by which connections can be set up with a window "scaling" factor.

RFC1323 uses one of the TCP option fields to defice a "window scaling factor". From lwn.net:
"...a system wanting to use window scaling sets a TCP option containing an eight-bit scale factor. All window values used by that system thereafter should be left-shifted by that scale factor; a window scale of zero, thus, implies no scaling at all, while a scale factor of five implies that window sizes should be shifted five bits, or multiplied by 32. With this scheme, a 128KB window could be expressed by setting the scale factor to five and putting 4096 in the window field."

The problem is that some network devices (and specificly whichever bit of kit NTL just added in the Reading area) are incorrectly leaving the TCP option present in the TCP header, but reseting it to zero. The other end of the connection sees the option present so it acknoledges the use of the window scaling method. However, the initial machine thinks the window scale being used is, say five, while the receiving end sees it as zero because of the broken router.

In the more recent Linux kernels (2.6.8 and above (I think)), the default value for the window scale is 7. So, what's happening is that Linux set the window scale in the TCP options to 7 and some router in NTLs network is reseting it to zero (which is a violation of the rules of TCP, BTW). You wouldn't see this on a Windows system becuase the Windows TCP implementation doesn't implement the RFC. More details here:
http://lwn.net/Articles/92727/

[EDIT]: I'm pretty sure this is the problem because if I change the TCP stack to use a value of "zero" for the window scale, the sites that were slow are suddenly speedy. If I change the window scale back to the default of 7, the sites slow down again.

Now, having explained all of that, I can't imagine a way to get that across to anyone who'd answer the phone at NTL. So, if anyone on this board knows a way to get this info to someone who can actually fix the problem, that would be great.

Thanks.

danielf
20-10-2004, 15:19
I haven't got a clue what you're on about, but why do you think this is what happened?

I presume you tried changing proxies?

melevittfl
20-10-2004, 15:29
I haven't got a clue what you're on about, but why do you think this is what happened?

I presume you tried changing proxies?

Becuase if I change the TCP stack to use a value of "zero" for the window scale, the sites that were slow are suddenly speedy. If I change the window scale back to the default of 7, the sites slow down again.

If you read the article at lwn.net, it describes exactly the symptoms. Plus, the loss of connection I described is consistent with some sort of network maintainence.

danielf
20-10-2004, 15:35
Becuase if I change the TCP stack to use a value of "zero" for the window scale, the sites that were slow are suddenly speedy. If I change the window scale back to the default of 7, the sites slow down again.


Ah, but that wasn't apparent from your post. (Not to me anyway) ;)

Chris
20-10-2004, 15:46
Becuase if I change the TCP stack to use a value of "zero" for the window scale, the sites that were slow are suddenly speedy. If I change the window scale back to the default of 7, the sites slow down again.

If you read the article at lwn.net, it describes exactly the symptoms. Plus, the loss of connection I described is consistent with some sort of network maintainence.
Most of the people who come to this site are beleagured consumer users of one NTL product or another ... with the greatest of respect to you, your initial posting will have gone right over the heads of most of us! :dunce:

Still, a number of NTL network techs do frequent this board so I expect it won't be long before someone who understands what you're talking about comes along. With a little luck they might even be able to do something about it.

melevittfl
20-10-2004, 15:47
Ah, but that wasn't apparent from your post. (Not to me anyway) ;)

You're right. Thanks for pointing that out. I've edited my post with the additional information.

melevittfl
20-10-2004, 15:48
Still, a number of NTL network techs do frequent this board so I expect it won't be long before someone who understands what you're talking about comes along. With a little luck they might even be able to do something about it.

That's what I'm hoping for. :)

BBKing
20-10-2004, 15:51
Can't look into this myself but I can certainly draw someone's attention to it.
A few questions:
Is it all sites or just a few?
All protocols or just HTTP?
If just HTTP, does an explicit proxy make any difference?

melevittfl
20-10-2004, 16:14
A few questions:
Is it all sites or just a few?


It's just a few. I suspect it's only going to happen on web servers that impliment the RFC becuase the problem is that my machine sets the option and the other machine acknoledges it (which means it knows what it is), but becuase of the router in-between, the two machines get out of sync on the size of the TCP window.


All protocols or just HTTP?


In theroy, it would affect all protocols. It's a problem at the TCP layer, rather than at the protocol layer. However the device that's incorrectly setting the TCP option to zero might be the proxy server itself that's mangling the headings. Hard to say.


If just HTTP, does an explicit proxy make any difference?

Interestingly, I tried this proxy: "swan-cache-2.server.ntli.net:8080" and it does not exhibit the same behavior.

BUT, like I said, it could be the proxy server that's mangling the TCP headers.

Thanks for offering to pass this on!

BBKing
20-10-2004, 16:44
Don't get carried away with the idea that because the proxy is layer 4 it couldn't mangle layer 3...

Try and find a site with HTTPS or FTP on the same IP as an iffy HTTP server and see if those protocols fail.

melevittfl
20-10-2004, 16:52
Don't get carried away with the idea that because the proxy is layer 4 it couldn't mangle layer 3...


Hi,
Yes, I'm sure that the proxy could very well be the one that's mangling the TCP options. It has to re-write the TCP header, so it's definitly a likely candidate. I wasn't saying it isn't the proxy server. I'm just saying that it's not the normal "change proxies becuase the local one is overloaded" problem.

EDIT: Just to clarify, I guess what I was trying to say before was that any TCP traffic that goes through the device doing the mangling is going to be affected. However, if the only traffic that's going through the device is HTTP, because the device is the transparent proxy, then the problem will only show up on HTTP.



Try and find a site with HTTPS or FTP on the same IP as an iffy HTTP server and see if those protocols fail.

I'll try.

BBKing
20-10-2004, 17:26
However, if the only traffic that's going through the device is HTTP, because the device is the transparent proxy, then the problem will only show up on HTTP

Precisely my reasoning :)

Thorny
22-10-2004, 19:34
Ok im gonna bring the technical talk right down now with my addition so be warned. :)

There are some addresses that I get random outages from, including the one that my hosting happens to be on. www.alivewww.com and www.aria.co.uk are the ones I notice it on. Its usually between 5 and 15 minutes and its annoying the hell out of me. the servers ping fine and i can trace them no problems. just on those wierd occasions i get timeouts to them. would that be due to NTL's routing too?

Cheers

melevittfl
23-10-2004, 01:03
would that be due to NTL's routing too?




Unless you're running a recent Linux kernel or you've modified the Windows TCP registry settings, no.

It's probably not so much a routing problem per-say. It's that some device in the path is incorrectly messing up the tcp packets.

Ignition
24-10-2004, 11:20
No extra kit has been added in Reading / Winnersh in the timeframe you describe that would be affecting your service.
No operating system upgrades have been done on any of the servers or routers that you would traverse.
NTL do not employ traffic shaping or management on their network.

Incidentally routers wouldn't be doing this to TCP, routers operate at layer 3 and couldn't really care less about layer 4.

Will have a looksie and see if anything has been done to the Winnersh webcaches that I'm not aware of as they are really the natural and only suspect in this - all the network kit has to be considered innocent as it's incapable of this behaviour. Also I should mention that all the caches are running the same level of kernel, etc.

Don't get carried away with the idea that because the proxy is layer 4 it couldn't mangle layer 3... << BBKing, TCP is layer 4, you've been working on DW too much :p: :D

BBKing
24-10-2004, 12:33
TCP ports are layer 4 - the proxy redirect happens on TCP ports, therefore the proxy is a layer 4.

At least that seemed logical at the time. And yes, the natural suspicion would be something that has to rewrite TCP headers to do its job, which rules out simple IP routing, as that shouldn't touch the payload within the IP headers.