is this a VM congested transit issue? [Archive]

View Full Version : is this a VM congested transit issue?

Chrysalis

19-04-2011, 09:22

here is a return path for a dodgy ssh connection, cursor jumping all over the place and 4% packet loss to my modem from the server over 100 packets.

Host Loss% Snt Last Avg Best Wrst StDev
1. 3.208.154.95.in-addr.arpa 0.0% 100 0.1 0.1 0.1 1.3 0.1
2. 1.208.154.95.in-addr.arpa 0.0% 100 0.8 31.1 0.3 1115. 145.3
3. 62-233-127-181.as20860.net 0.0% 100 1.1 13.0 1.1 155.9 33.7
4. 202.core1.thn.as20860.net 0.0% 100 12.0 7.9 1.1 190.9 28.9
5. 195.66.224.23 8.0% 100 1.4 1.6 1.3 24.7 2.4
6. glfd-tmr-1-as0-0.network.virginm 0.0% 100 2.7 3.1 2.4 28.5 3.1
7. glfd-bb-1b-ae5-0.network.virginm 0.0% 100 2.5 7.5 2.3 97.2 16.0
8. nrth-bb-1a-as5-0.network.virginm 9.0% 100 6.3 6.8 6.0 39.3 3.7
9. nrth-bb-1b-ae0-0.network.virginm 8.0% 100 6.0 12.1 5.9 171.0 24.0
10. leic-core-1b-as0-0.network.virgi 5.0% 100 7.3 7.9 7.0 28.5 3.2
11. leic-cmts-14-gigaether-151.netwo 0.0% 100 10.4 10.6 9.8 32.0 3.0
leic-cmts-14-gigaether-21.network.virginmedia.net
leic-cmts-14-gigaether-141.network.virginmedia.net
12. cpc14-leic14<blah> 4.0% 100 17.6 17.0 14.1 21.4 1.5

guesses on which hop is the problem?

---------- Post added at 09:22 ---------- Previous post was at 09:19 ----------

I suspect hop5, which is.

% Information related to '195.66.224.0/19AS5459'
route: 195.66.224.0/19
descr: UK-LINX-1
origin: AS5459
mnt-by: AS5459-MNT
source: RIPE # Filtered

weesteev

19-04-2011, 10:07

That IP is registered to the London InterNet eXchange. Its likely a peering issue with the carrier. Issues like that should be resolved pretty quickly.

deed02392

19-04-2011, 10:23

I'm guessing when you say cursor on SSH you mean you're tunneling a VNC connection with SSH. I get this on fairly rare occaision with my servers both in France and Canada and it does usually appear to be a peering issue with an intermediary carrier more often than not.

This thing happens fairly often but it's only as power users when we rely on one specific server much more than others (who generally browse different websites with about the same interest who would just assume the site was 'down'), that we actually notice it.

It will usually rectify itself after 6 hours.

Chrysalis

19-04-2011, 10:50

no ssh is a secure command line variant of telnet. so it has a cursor like on a command prompt.

Ignitionnet

19-04-2011, 12:30

That IP is registered to the London InterNet eXchange. Its likely a peering issue with the carrier. Issues like that should be resolved pretty quickly.

It's a VM router Steve, to be exact it's redb-ic-1. Everyone on the LINX public LANs uses LINX IP addresses but all are responsible for their own capacity to the LANs.

---------- Post added at 12:30 ---------- Previous post was at 12:27 ----------

here is a return path for a dodgy ssh connection, cursor jumping all over the place and 4% packet loss to my modem from the server over 100 packets.

guesses on which hop is the problem?

Given there's zero packet loss to hop 11, disproving the possibility of packet loss starting at hop 5 or at least making it extremely unlikely, I have no idea.

Have you checked the route a forward trace takes? I see that there's a couple of paths from the VM network to that location, Global Crossing stands out as a good one.

Chrysalis

19-04-2011, 14:47

well the loss has stopped now and the outbound routing changed from hop 4 onwards.

the route to the server is currently going over gblx, I did not check it earlier.

100 packets on new route.

Host Loss% Snt Last Avg Best Wrst StDev
1. 95.154.208.3 0.0% 101 0.1 0.4 0.1 12.6 1.5
2. 95.154.208.1 0.0% 100 375.8 24.6 0.4 1185. 129.7
3. 62.233.127.181 0.0% 100 1.2 7.2 1.1 173.1 22.5
4. 77.67.74.225 0.0% 100 1.1 5.1 1.0 65.7 12.0
5. 77.67.65.162 0.0% 100 2.5 3.8 2.4 42.9 5.3
6. 212.43.163.182 0.0% 100 4.6 6.6 4.5 46.6 7.2
7. 62.253.185.118 0.0% 100 4.5 14.1 4.4 145.8 25.1
8. 213.105.172.42 0.0% 100 5.6 7.5 5.6 51.5 7.7
9. 82.3.33.174 0.0% 100 8.2 10.0 8.2 42.8 5.5
82.3.33.226
82.3.33.170
10. 82.30.x.x 0.0% 100 18.4 17.0 12.4 52.1 5.0

without -n the hostnames are.

Host Loss% Snt Last Avg Best Wrst StDev
1. 3.208.154.95.in-addr.arpa 0.0% 3 0.1 0.1 0.1 0.1 0.0
2. 1.208.154.95.in-addr.arpa 0.0% 3 1.3 0.7 0.4 1.3 0.5
3. 62-233-127-181.as20860.net 0.0% 2 1.2 1.2 1.2 1.2 0.0
4. ae0-135.lon25.ip4.tinet.net 0.0% 2 1.1 1.1 1.1 1.1 0.0
5. telewest-gw.ip4.tinet.net 0.0% 2 2.5 2.5 2.5 2.6 0.1
6. nrth-bb-1a-as5-0.network.virginm 0.0% 2 4.6 4.6 4.6 4.6 0.0
7. nrth-bb-1b-ae0-0.network.virginm 0.0% 2 4.4 4.5 4.4 4.5 0.1
8. leic-core-1b-as0-0.network.virgi 0.0% 2 5.7 5.7 5.7 5.8 0.1
9. leic-cmts-14-gigaether-141.netwo 0.0% 2 8.5 8.4 8.4 8.5 0.0
leic-cmts-14-gigaether-151.network.virginmedia.net
10. cpc14-<blah> 0.0% 2 17.0 16.1 15.3 17.0 1.2

so tinet instead of linx now.

inbound below, with poor looking jitter :(

1 <1 ms <1 ms <1 ms home.gateway4 [192.168.1.253]
2 5 ms 8 ms 6 ms cpc14-leic14-2-0-gw.8-1.cable.virginmedia.com [8
2.30.112.1]
3 8 ms 6 ms 7 ms leic-core-1a-ae3-2231.network.virginmedia.net [8
2.3.33.45]
4 30 ms 49 ms 24 ms leed-bb-1a-as8-0.network.virginmedia.net [213.10
5.172.17]
5 11 ms 12 ms 9 ms leed-bb-1b-ae0-0.network.virginmedia.net [62.253
.187.186]
6 18 ms 21 ms 19 ms 64.214.128.37
7 14 ms 54 ms 13 ms IOMART.TenGigabitEthernet7-1.ar4.LON3.gblx.net [
64.211.1.190]
8 18 ms 94 ms 16 ms 62-233-127-182.as20860.net [62.233.127.182]
9 15 ms 17 ms 15 ms 3.208.154.95.in-addr.arpa [95.154.208.3]

incidently this issue is affecting mk pingtest server as well.

http://www.pingtest.net/result/39068613.png (http://www.pingtest.net)http://www.pingtest.net/result/39068884.png (http://www.pingtest.net)

that packet loss is consistent across many tests 1-5% loss. not had a 0% loss today. edit - I see this image shows 0% but on the 250 pings only 249 got returned so was some loss.

http://www.pingtest.net/result/39068658.png (http://www.pingtest.net)

maidenhead seems ok. but that test randomly has very high base latency (over 40ms)

http://www.pingtest.net/result/39068694.png (http://www.pingtest.net)

other servers I am using over ssh are/were fine so it doesnt seem to be a generic issue with my connection.

that gblx is not the normal route as I have in the past ran tracert's to the server and it normally doesnt go over 3rd party transit. Whats the other path you see asides from gblx?

---------- Post added at 14:39 ---------- Previous post was at 13:03 ----------

I bricked my dir615 flashing to an older firmware, luckily now recovered it.

When I went online with superhub I reran the pingtest.net tests and the packetloss was gone on that. It is now back again on the modem+dir615 so thats an interesting one.

Also still getting the random really high latency with jitter increase on pingtest.net.

http://www.pingtest.net/result/39072457.png (http://www.pingtest.net)

---------- Post added at 14:47 ---------- Previous post was at 14:39 ----------

its only happening on maidenhead and mk, manchester and all overseas ones are ok, the superhub was on a different ip subnet so I am conclusing my ip range is affected by some kind of linx issue and the other subnet isnt (possibly different routing).

ShaneC

19-04-2011, 15:50

Just to add, I have spoken to business support as we are experiencing packet loss on lots of VPN's and links that go via LINX on our VM leased line and I was told this is currently a known issue they are trying to resolve. No ETA but it sounded like quite a few people had already complained.

Chris

19-04-2011, 15:52

ShaneC

19-04-2011, 15:59

Chris, probably, if you do a tracert to the website you are having problems with and you see redb-ic-1 at the start of one of the routers this could be the issue. It seems to be the providers listed here http://www.red-bus.com

craigj2k12

19-04-2011, 16:16

Is this likely to have an effect on domestic Virgin National (ADSL) service? I have been getting intermittent problems all day today with pages taking an age to load one minute and everything being fine the next.

i thought that the national service was on a completley separate network. sounds like your problem is unrelated, but they might use some similar things (DNS servers etc)

ShaneC

19-04-2011, 17:02

According to our smokeping server, the packet loss stopped at 16:05

---------- Post added at 17:02 ---------- Previous post was at 16:24 ----------

.....and at 16:53 it seemed to come back again :/

Chrysalis

19-04-2011, 17:29

well linx has had tons of issues lately, the usual question will be why cant VM simply turn the linx link off and reroute traffic. Which is what I want of course as a customer. But VM probably trying to keep costs low keep it turned on as is cheaper than using other links. For it to last all day is a bit excessive.

Ignitionnet

19-04-2011, 19:31

They don't have enough capacity to go without all their LINX links, they've 81Gbps of them.

theoldbill

19-04-2011, 19:59

They don't have enough capacity to go without all their LINX links, they've 81Gbps of them.

Interesting but why not route via their AMS (or 'furt) interconnect, how much bandwidth there?

Ignitionnet

19-04-2011, 20:08

Those links are being used for standard traffic and some peers at LINX won't also be at DecIX and AmsIX.

They've 60Gbps to AmsIX, no idea about DecIX.

Chrysalis

19-04-2011, 21:00

They don't have enough capacity to go without all their LINX links, they've 81Gbps of them.

too much eggs in one basket?

81gbps to linx but no lonap.

how much peering in manchester?

Ignitionnet

19-04-2011, 21:13

too much eggs in one basket?

81gbps to linx but no lonap.

how much peering in manchester?

Exactly what use would LONAP be in the case of issues at LINX given it has a fraction of the members and currently maxes out at less than 17Gbps? Not really a replacement for >1.4Tbps LINX shifts on public and private peering. VM alone shift more traffic than every peer on LONAP exchanges over the exchange.

EdgeIX likewise, even if VM were to stick 80Gbps in there as well in case of loss of LINX other networks don't have nearly extensive enough bandwidth to fulfill their needs.

Chrysalis

19-04-2011, 21:27

theoldbill

19-04-2011, 21:59

yep I know, the too much eggs in one basket applies to everyone really, as its no use just one isp alone planting pipes there if noone else does it as well. having some at lonap probably would at least help tho.

Igni didn't say what ratio of that 81gigs is over the two LINX lans, in theory there should never be a total loss of LINX, the worst you'd get is a serious slowdown - whereas the smaller exchanges you mention would slowdown to the point of stalling so like Igni suggests, you can pretty much discount those in all practicality.

Chrysalis

19-04-2011, 22:04

Igni didn't say what ratio of that 81gigs is over the two LINX lans, in theory there should never be a total loss of LINX, the worst you'd get is a serious slowdown - whereas the smaller exchanges you mention would slowdown to the point of stalling so like Igni suggests, you can pretty much discount those in all practicality.

they would if moved all from one to the other.

I meant adding 10 gig or so at lonap and shifting 'some' traffic to it so its less noticeable, then do the same elsewhere, maybe 10 to 20 gig elsewhere, this then adds up to 40gig or so thats maybe on the dodgy linx lan.

linx itself has been having problems since latter 2010 now so its been going on for months, I think isps just sitting tight hoping it goes away is not a very responsible way of dealing with it. The uk needs a linx mark II so we have some more redundancy. ideally in a completely different location outside of london.

Ignitionnet

20-04-2011, 00:06

Igni didn't say what ratio of that 81gigs is over the two LINX lans, in theory there should never be a total loss of LINX, the worst you'd get is a serious slowdown - whereas the smaller exchanges you mention would slowdown to the point of stalling so like Igni suggests, you can pretty much discount those in all practicality.

40Gbps on Extreme LAN, 41Gbps on Brocade.

https://www.linx.net/pubtools/member-techinfo/member_id/100050

---------- Post added at 00:06 ---------- Previous post was at 00:02 ----------

Note that BT and Sky both have 100Gbps to LINX, Opal / TalkTalk 110Gbps. Even Be/O2 have 40Gbps so VM's 81Gbps is actually relatively low given their size.

pip08456

20-04-2011, 00:20

Cheapskates?

Chrysalis

20-04-2011, 09:05

40Gbps on Extreme LAN, 41Gbps on Brocade.

https://www.linx.net/pubtools/member-techinfo/member_id/100050

---------- Post added at 00:06 ---------- Previous post was at 00:02 ----------

Note that BT and Sky both have 100Gbps to LINX, Opal / TalkTalk 110Gbps. Even Be/O2 have 40Gbps so VM's 81Gbps is actually relatively low given their size.

yeah now you posted comparison's VM's does indeed seem poor, also considering that end user speed's on VM are probably the highest out of all those isp's as an average. You remember when ntl had just 2 gigabit to linx and both pipes were 100% pretty much 24/7? it seems to be a historical trend with this company to cut short on peering. Curious if you know what the utilisation levels are like now? :)

---------- Post added at 09:05 ---------- Previous post was at 08:21 ----------

interesting comments from other isps.

3 different isps.

a user of BE commented that linx has been fine yesterday "except for virgin" traffic,
2 other isps told me that they seeing issues with virgin media but otherwise fine.

Chrysalis

20-04-2011, 16:39

this issue seems back, this time seeing on superhub.

both maidenhead and mk dropping 1-5 packets per test with dodgy latency also.

no issue on the ssh as that isp still has the linx disabled to VM.

craigj2k12

20-04-2011, 17:53

http://www.pingtest.net/result/39139311.png

http://www.pingtest.net/result/39139233.png

http://www.pingtest.net/result/39139355.png

Chrysalis

20-04-2011, 18:34

you have to watch the 250 packets/sent before it starts the jitter test, if 248 or 249 return it is generous and counts as 0% loss. Manchester is not affected as well :)

craigj2k12

20-04-2011, 18:58

248 it returned

http://www.pingtest.net/result/39141901.png

---------- Post added at 18:58 ---------- Previous post was at 18:42 ----------

http://ukinternetreport.co.uk/

foddy

20-04-2011, 18:58

Note that BT and Sky both have 100Gbps to LINX, Opal / TalkTalk 110Gbps. Even Be/O2 have 40Gbps so VM's 81Gbps is actually relatively low given their size.
Don't forget that a) it's up to someone at VM to update PeeringDB, that information isn't direct from LINX, and b) it doesn't include private peering at LINX, which VM may use more than Sky. Only VM know the full picture.

LINX also impose a congestion charge if a port averages more than 80% utilised, which effectively forces an upgrade. 81Gbps is likely what's needed, rather than them being cheapskates! :)

I suspect that the problems are more likely to be router related (e.g. routeing table capacity problems causing some routes to be software switched) than congestion.

Chrysalis

20-04-2011, 18:59

interesting microsoft have the same capacity at linx as VM do O_o

level3 who I thought were a big global player in transit only 4 gigabit, I guess that shows how much private peering has grown in comparison to using 3rd party transit.

akamai 160gig (the biggest I found so far).

craigj2k12

20-04-2011, 19:01

did everyone catch the link i posted?

http://ukinternetreport.co.uk/

Ignitionnet

20-04-2011, 19:03

Don't forget that a) it's up to someone at VM to update PeeringDB, that information isn't direct from LINX, and b) it doesn't include private peering at LINX, which VM may use more than Sky. Only VM know the full picture.

LINX also impose a congestion charge if a port averages more than 80% utilised, which effectively forces an upgrade. 81Gbps is likely what's needed, rather than them being cheapskates! :)

I suspect that the problems are more likely to be router related (e.g. routeing table capacity problems causing some routes to be software switched) than congestion.

That information is from LINX - only the asterisked fields are from PeeringDB. PeeringDB doesn't, for example, know which ports on which switches VM connect to ;)

VM have ticked congestion charges more than once, they're famous for it. They spend an awful lot of time on traffic management to make maximum use of their various transit and peering connections.

Chrysalis

20-04-2011, 19:04

hmmmz

http://ukinternetreport.co.uk/?p=d&id=114

Ignitionnet

20-04-2011, 19:05

level3 who I thought were a big global player in transit only 4 gigabit, I guess that shows how much private peering has grown in comparison to using 3rd party transit.

Not really it shows a misunderstanding on your part. Why would a tier 1 provider of transit have a ton of capacity on the LINX peering LANs? They have what they need for their own hosting and CDN activities.

foddy

20-04-2011, 19:06

level3 who I thought were a big global player in transit only 4 gigabit, I guess that shows how much private peering has grown in comparison to using 3rd party transit.

The important part of the word "peering" is "peer". Level 3 are so big, it wouldn't be peering so much as giving smaller companies free transit.

A year or so ago, they even considered Comcast too small, and a split Internet was threatened. It's worth reading about, if you're interested. I found an article on it here:

http://arstechnica.com/tech-policy/news/2010/12/comcastlevel3.ars/

That said, Level3 probably have a lot of capacity at LINX that you can't see.

Chrysalis

20-04-2011, 19:09

Not really it shows a misunderstanding on your part. Why would a tier 1 provider of transit have a ton of capacity on the LINX peering LANs? They have what they need for their own hosting and CDN activities.

obviously yeah I did misunderstand what I expected.

the likes of level3 I expected to have connectivity to the lan's due to the fact isp's without their own peering need to access it somehow. I just didnt expect it to be that small in comparison to direct peering links.

Ignitionnet

20-04-2011, 19:15

obviously yeah I did misunderstand what I expected.

the likes of level3 I expected to have connectivity to the lan's due to the fact isp's without their own peering need to access it somehow. I just didnt expect it to be that small in comparison to direct peering links.

You misunderstand why those LANs exist. They are a place for ASes to peer settlement free, not for Level3 et al to pimp transit over.

When you buy transit you pay for a port to connect to the transit provider and pay a combination of a port charge and a 95th percentile traffic charge.

Chrysalis

20-04-2011, 19:44

You misunderstand why those LANs exist. They are a place for ASes to peer settlement free, not for Level3 et al to pimp transit over.

When you buy transit you pay for a port to connect to the transit provider and pay a combination of a port charge and a 95th percentile traffic charge.

yeah but I assumed level3 do peer settlement free with various isp's.

Ignitionnet

21-04-2011, 11:06

yeah but I assumed level3 do peer settlement free with various isp's.

Yes they do, that's why they have the LINX connections there. It's a subset of their IP range that they advertise for hosting and CDN, other than this they are a tier 1 ISP and only peer, via private peering, with other tier 1s and maybe the odd tier 2. Everyone else purchases transit from them.

Chrysalis

21-04-2011, 12:01

interestingly I read that comcast page and another page that was linked from it. I can see now why isp's are going all out to get customers (even at a loss). As they using a large customer base to whore cash of CDN providers for delivery to subscribers.

Chrysalis

21-04-2011, 16:55

getting worse, only 244 returned.

http://www.pingtest.net/result/39194668.png (http://www.pingtest.net)

this has been going on for days now, is it going to be the new expected from VM?

http://ukinternetreport.co.uk/?p=d&id=114 still down, I wonder if they need customers to tell them its down?

Ignitionnet

21-04-2011, 23:22

They are quite aware the interface is down, when it went down or was admined down an SNMP trap would have been fired off to the monitoring system.

Chrysalis

22-04-2011, 09:16

I am seeing lots of congested traffic actually, the post I did last night in the p2p thread seems to be one of those routes (still slow today but not as slow), various EU routes going via manchester performing poor, this reminds me of my ntl days where there was tons of peering/transit issues. Feels like VM have simply been sell sell sell and gone in over their head. If this outage is causing all these issues then its very bad VM have left it down so long.

Chrysalis

22-04-2011, 15:07

this seems to be fixed now.

http://www.pingtest.net/result/39245545.png (http://www.pingtest.net)

I see the router in question in traceroute's again as well. Although I am still seeing poor performance from EU servers it may be tho they have yet to change their routing back on the return path.