MAJOR NETWORK ISSUE (17 Jan 2012) - Page 10

Gavin78 · 17-01-2012, 20:33

Its working at about 75% at the moment here in Leeds not back to 100% yet

Turkey Machine · 17-01-2012, 20:35

Quote:

Originally Posted by qasdfdsaq

Yes but a simple "If you are calling about the current service outage, we're aware of it already" would massively reduce the number of support staff required.

Easily done, provided you can actually connect to the phone system rather than the NTS service Virgin use for their 150 and 08454541111 numbers.

If you had a few hundred thousand customers bombarding your support number for answers wouldn't you sweat a little? It's the reason most have automated menus on high-capacity servers to deal with that load, evidently this was just a little too much for the poor PBX to handle!

Quote:

Originally Posted by qasdfdsaq

And a lot of sites have a "backup" high volume, low complexity system for reporting major outages - e.g. reverting to a single line of text instead of failing completely.

An alternative for the 503 Server Error if you will, like some sites' cute 404 errors. This'd be fine if their internet service hadn't degraded to the point that the error was worth a pretty page for.

Quote:

Originally Posted by qasdfdsaq

Having a contingency plan to deal with major outages is all part of being a major service provider.

I'm not sure hardware failure comes under that list of first things to check. Evidently something big was up with their internet service when leased lines fail as well - it must have been a very major switch failure within their core network for customers to have still been connected at the UBRs but not routing the traffic, even the VOD failed and that's supposed to be internal.

Quote:

Originally Posted by qasdfdsaq

Actually I was getting a "Site too busy" response from Cable Forum most of the time while VM's own forums were slow, but functional.

God bless Paul M's ability to build a stable forum to take the load of 5000+ angry cable customers wanting answers. This place is non-profit and survives on donations.

By the way - slow is better than "unavailable". At least if it's slow you can get somewhere.

Osem · 17-01-2012, 20:35

I must admit that when I found this site unavailable due to an overloaded server I assumed Alan Fry had posted another of his legendary 'plans'.....

Doz007 · 17-01-2012, 20:37

Quote:

Originally Posted by braysoj1

it is Down here in keighley this is from my mobile

All back up and running now though.

rmwebs · 17-01-2012, 20:58

Quote:

Originally Posted by Turkey Machine

God bless Paul M's ability to build a stable forum to take the load of 5000+ angry cable customers wanting answers. This place is non-profit and survives on donations.

Actually the forum did bomb out a few times with 'too many connection' warnings from vBulletin. These usually happen if there is a really high load on the server.

I can understand it on a site like this, but for VM's site to get 503 errors isnt acceptable. They could easily host the site in a cloud failover distribution, the costs would be negligible for VM to do so.

---------- Post added at 21:58 ---------- Previous post was at 21:54 ----------

Quote:

Originally Posted by Traduk

My modem went off at midnight last night for an hour which usually means VM working somewhere and as it came back with a different IP I thought it was re-segmentation. However performance was below par and pings longer than usual.

Early this afternoon as surfing started to fall apart and changes in DNS servers made no difference

Sounds a bit like mine. Woke up this morning with a new IP, service fine for most of the morning however was noticably slower this afternoon, with regular drops and server not found notices. Did a bunch of worldwide ping and speed tests and didnt find any problems. Shortly after this most sites failed to load. I thought it may be OpenDNS at first as I've got out network set to use them, however obviously it wasnt.

VOD wasnt affected in any way however.

(St Albans AL4 via Hemel)

Peter_ · 17-01-2012, 20:59

Quote:

Originally Posted by rmwebs

Actually the forum did bomb out a few times with 'too many connection' warnings from vBulletin. These usually happen if there is a really high load on the server.

I can understand it on a site like this, but for VM's site to get 503 errors isnt acceptable. They could easily host the site in a cloud failover distribution, the costs would be negligible for VM to do so.

Have you read about the fault on the network on the community forum instead of making assumptions.

CLICK ME

Chrysalis · 17-01-2012, 21:04

service back to its usual 60% now.

Synthetic · 17-01-2012, 21:13

Not noticed any issues in Newcastle

qasdfdsaq · 17-01-2012, 21:24

Quote:

Originally Posted by Chrysalis

service back to its usual 60% now.

Lol.

I'm glad I missed all the fun, was at work till it was all fixed.

So anyone know what the fuss was about? Someone mentioned routing hardware failure (which is pretty embarrasing, as a major ISP should have backup routes on just about everything) and someone else mentioned an aircon failure in Poplar?

mikes12345 · 17-01-2012, 21:56

It sounds like this is either fixed or getting fixed - good news from my perspective. I was in work when the failure hit (Virgin Media leased line) and had all sorts of complaints - I worked out when I could not call Virgin it must be a Virgin fault but good to know it should have gone away before I get back into the office tomorrow! At home I haven't seen any disruption (VM cable) - but that might be because I was in work dealing with complaints during the major blackout!

Good work Virgin for getting it fixed, fingers crossed it doesn't happen again any time soon!

Turkey Machine · 17-01-2012, 21:59

Quote:

Originally Posted by rmwebs

I can understand it on a site like this, but for VM's site to get 503 errors isnt acceptable. They could easily host the site in a cloud failover distribution, the costs would be negligible for VM to do so.

Wrong completely. Have you *ANY* idea how much bandwidth actually costs for ISPs? (clue - it's not cheap. Try Â£10 per MegaBIT-per-second as a rough minimum, with 100 MegaBIT being the minimum size for bandwidth charges, not to mention transit costs (LINX/LoNAP for UK traffic and Cogent et al [expensive] for the worldwide traffic)).

Your domestic 50Mb connection you pay for is a contended service. If you have a 50Mb leased line you pay the high premium to have that service switched on, all the time, and bandwidth reserved exclusively for you.

---------- Post added at 22:59 ---------- Previous post was at 22:57 ----------

Quote:

Originally Posted by qasdfdsaq

Lol.

I'm glad I missed all the fun, was at work till it was all fixed.

So anyone know what the fuss was about? Someone mentioned routing hardware failure (which is pretty embarrasing, as a major ISP should have backup routes on just about everything) and someone else mentioned an aircon failure in Poplar?

Backup routes yes, hardware failure is rare, and when it does happen it requires an engineer to go to the rack(s) in question and manually switch the hardware. That takes time because you have to do it slowly, carefully and properly. Not ham-fistedly so some script-kiddie in his mum's basement can continue his Warcraft campaign.

Sephiroth · 17-01-2012, 22:09

Failover is usually soft in my world - the backup protocols are all there on Cisco kit. A SPOF is not how VM's network is designed otherwise you'd see everyone routing through that point in the traceroutes.

The engineer goes to the racks to replace or reset the kit in question after reading the logs and when the emergency change manager decides would be the best time.

qasdfdsaq · 17-01-2012, 22:11

Quote:

Originally Posted by Turkey Machine

Quote:

Originally Posted by rmwebs

I can understand it on a site like this, but for VM's site to get 503 errors isnt acceptable. They could easily host the site in a cloud failover distribution, the costs would be negligible for VM to do so.

Wrong completely. Have you *ANY* idea how much bandwidth actually costs for ISPs? (clue - it's not cheap. Try Â£10 per MegaBIT-per-second as a rough minimum, with 100 MegaBIT being the minimum size for bandwidth charges, not to mention transit costs (LINX/LoNAP for UK traffic and Cogent et al [expensive] for the worldwide traffic)).

Your domestic 50Mb connection you pay for is a contended service. If you have a 50Mb leased line you pay the high premium to have that service switched on, all the time, and bandwidth reserved exclusively for you.

How is this at all relevant to 503 errors on VM's website?

Internal access to VM's website has nothing to do with transit or peering, they wouldn't have to pay anyone for anything. In the worst case all they'd have to do is rent a Â£30/month server in someone else's datacentre to serve up "Yes, the site really is down" messages. Â£30 a month really is negligable for a company the size of VM.

Quote:

Backup routes yes, hardware failure is rare, and when it does happen it requires an engineer to go to the rack(s) in question and manually switch the hardware. That takes time because you have to do it slowly, carefully and properly. Not ham-fistedly so some script-kiddie in his mum's basement can continue his Warcraft campaign.

Dunno about yours or VM's setups, but in our environment backup routes (and pretty much backup everything else) kick in automatically. Last major outage we had on our primary route, nobody even noticed except the net-ops guys, even support hadn't heard a thing from either side till I told them.

Having to manually fail-over faulty hardware in this day and age is pretty backwards.

In any case, the failure of a single router or any individual piece of hardware should not be able to cause anything as severe as this. Loss of an entire datacentre due to aircon failure however, could be a justifiable cause, though quite what was going on with the A/C would raise a few questions in itself.

[Edit]
Yeah, what Seph said.

**Paul** · 17-01-2012, 22:44

Quote:

Originally Posted by qasdfdsaq

Actually I was getting a "Site too busy" response from Cable Forum most of the time while VM's own forums were slow, but functional.

We have a number of limits in place to stop anything bringing the site down completely. One of them cuts the forum to that site busy message if the server load exceeds a preset value. That limit is normally set to about 6.00 (the server normally runs at about 1.50 when busy). Today the load hit over 9.00 at one point, so that "safety valve" kicked in until the load fell again. I actually rasied it at one point to let more people on.

Quote:

Originally Posted by rmwebs

Actually the forum did bomb out a few times with 'too many connection' warnings from vBulletin. These usually happen if there is a really high load on the server.

Too many connections actually has no connection with server load. Its purely down to the connection limit on mysql. Again we have this set such that the whole thing cant run away with itself and die.

We hit record concurrent guest and member figures, and survived very well.

watzizname · 18-01-2012, 04:40

Not sure if it's related or not but my hub has been resetting itself for the last few hours now, several times while reading this thread..