|
For the last 4 hours or so my website has been unreachable. The server that I lease is hosted in Atlanta and there were some problems there:
The word we're getting from Atlanta is that a
major carrier is working on a bad fiber connection. This is not the
data center's fault and is totally unrelated to the emergency router
maintenance from a month ago. I'm seeing 1%-3% packet loss from NJ.
We'll continue to monitor and provide updates as we receive them.
Apologies for any inconvenience.
This means that 1 to 3% of information was not making the trip, in fact it was much worse than that with major sections of the Internet completely unreachable....
Checking some Internet traffic graphs told the story, some of the major carriers had completely lost all data transfer ability through this area, this was a major stuff up.
This then had a flow on effect as Internet traffic was routed around the damaged areas thus slowing down large portions of the Internet. I am not sure of exactly what happened but it looks as if a major Internet backbone was damaged somewhere in Atlanta.
Having said that further complications then arose as reported by some European users;
Actually, connectivity from Amsterdam to the US
is all looking pretty flaky at the moment (11:22 CET) - > 50% packet
loss to Atlanta and Fremont.
This was then followed by a report from Tech Support;
Posted: Wed Jan 09, 2008 7:56 am Post subject: Atlanta: Datacenter Wide Power Failure |
|
|
|
We're currently making sure everything is back online. Updates in a few.
|
A Datacenter wide power failure at the same time as the backbone problems... hmmmm.....
What happened to the UPS's? What about the backup generators?
Then this one;
| Posted: Wed Jan 09, 2008 12:05 pm Post subject: |
|
|
Atlanta
DC employees have *finally* informed us that they can not get atlanta24
to power back up. Apparently their power outage (which we have still
not received a RFO for) damaged the server.
We have instructed them to move atlanta24's drives to a standby host. More updates to come. . .
|
Until I was able to log into my server and order it to shutdown then reboot, finally bringing the website back online.
I really hate this as Google seems to penalise websites for being unstable and not up all the time. The last several times my server crashed it meant a drop in visitors to my site of about 30% for the following days which then slowly climbed back up to normal levels.
Since something like 90% of my traffic comes from Google this means I loose 30% of my Internet income for several weeks each time this happens and it places me way back to start trying to rebuild my web traffic again each time.
If I can get the site traffic up to double current levels it will be financially viable to have another server in California or Texas and share the site between them so if one server or datacenter goes down then the site will remain active.
Anyway we're back up and running now, lets hope it stays that way :)
UPDATE:
Here is the official word now that everything is repaired, seems like the guys at the Datacenter in Atlanta, GA had a rather unfortunate run of bad luck (see what happens when everything is made as cheap as possible as is the case these days):
At approximately 4:45 am EST the NAP suffered a power outage lasting approximately 10 seconds from Georgia Power.
The generators fired and came online 15 seconds after the initial outge
and the load was transferred to generators which ran for 30 minutes
while monitoring the incoming power quality from GA Power at which time
the load was transferred back to utility.
One of the UPS's that serves part of the facility suffered a
battery outage on 2 different redundant strings which caused it to drop
the load.
We installed a second redundant string approximately 9 months ago
to minimize the possibility of this type of situation. The batteries in
the 2 strings are setup in parallel meaning each is capable of carrying
the full load for up to 5 minutes.
All it takes is 1 battery in a string to fail for the entire string
to fail. this is the same in all ups systems and is the reason we
installed the second string from advice from the manufacturer.
The original string batteries are 1.5 years old and were installed new. The second string is 9 months old and was installed new.
A single battery in the second string failed after 3 batteries in the first string failed.
We turned the generators back on to avoid an interruption during
troubleshooting and maintenance and MGE sent a tech onsite within an
hour to troubleshoot at which time we discovered the battery issue. we
replaced the batteries within an hour of diagnosis and brought the
system back onlnine and out of maintenance bypass.
The load is currently protected and all batteries have been tested again.
Both sets of batteries have been maintained and tested by MGE
direct service every 6 months under a pm plan that they recommended for
proper maintenance and operation.
This was extremely rare and unforseen to have something like this happen.
We are purchasing our own battery tester and will set up a monthly
pm on the batteries that we will conduct ourselves in addition to the 6
month pm that MGE does on the UPS as well as the batteries. We are also
researching a real time battery monitoring system that can predict
battery failure.
Batteries are the weakest link in the system and we feel like we
properly followed recommended engineering and maintenance on these
systems. - however that will not assure 100% as we found out today in a
very rare incident.
Extemporaneous events that continued to affect service during the outage:
one of the main metro e switches that runs the links of our
backbone went offline during the outage and during that powerinduced
reboot we lost connectivity to half our backbones. we have our
backbones split in half - with half going out the east and half out the
west side of the building taking dirverse paths across redundant
switches to the final interconnect points.
the switch was unstable when it came back online due to a gbic
that died and for some odd reason rebooted itself several times about
every 10 minutes. we replaced the gbic with a spare we keep onsite.
This caused half the backbones to go up and down and placed a large
cpu load on the different core routers we have due to bgp table loads
going on - this is very cpu intensive and when you have a lot of up and
down it can appear that the network is completely down (it is if you
are on a link that is flapping) but the fact is that the entire network
was not down but was impacted. this settled down when the switch was
stabilized.
We split our backbones up over several different redundant backbone routers.
once this switch was brought back online and stabilized the network stabilized as well.
an access switch that serves 16 servers also died and we replaced
it with a spare once we found the issue. we keep spares on site for
every piece of network gear we have.
an apc that was only 6 months old and is a dual fed apc from 2
different power sources (including the newer ups) failed and did not
come back - we replaced it with an onsite spare. it was bizarre to say
the least and of course it powered one of our 3 main dns clusters so we
lost dns capacity for an hour.
Most of the issues currently going on are related to server
hardware that did not do well in a power reboot situation or need a
fsck. we are actively working on them and will not rest until all is
well.
Many customers in the facility do have A and B feeds from our
power. we offer this through different ups systems / different power
panels and different transformers. Some very early customers that
purchased a and b feeds when we only had one ups system at the NAP are
on the same ups and as such lost power. those customers will be offered
a free move on their b feed to the newer ups to increase their power
diversity - they simply need to open a ticket.
What are we doing on power in the future?
We have another UPS from MGE on order as of 4 weeks ago that is due
to deliver in mid Feb that will increase the diveristy of the power in
the facility. We plan on having 2 battery strings on it as well.
We are in the process of installing another set of 5 cummins
generators and another 3000 amp transformer which will further
diversify our generator and transformer plant - this will be completed
in mid february - construction of this is going on currently we took
delivery of the switchgear and generators 2 weeks ago. 4 ups/ will be
moved to the new power feeed and g enarators to diversify the power
source to the UPS . this will give us 100% redundancy on the A / B
feeds at that point.
We installed a redundant b feed to our metro e gear and 2 dual fed
apcs at our TELX cabinet after TELX suffered a complete UPS failure at
56 marietta 4 months ago. This turned out to be good because there was
another complete failure of the B ups 4 weeks ago - but we were not
affected since we had a redundant feed from them. the outage affected
all customers on the second floor. we would have more than 50% of our
network had we not been on dual fed apcs and dual power feeds at the
building which would have been bad.
we are increasing the battery pm schedule to monthly from biannual.
we are researching a battery monitoring system for the strings.
we will be taking a fuel delivery this week to restock our main fuel supply
we are examining in depth on of our 4 core metro switch
abnormalities this morning and if we do not find a rfo from the
manufacturer will be examining replacing it or upgrading to a different
more robust solution - which has been in our long term plan but may get
moved up.
we will be doing another power examination of our core swithcing
routers ( currently 6 of them all with dual fed power ) and our core
metro e switches (currently 4 of them) to make sure that our power
feeeds are truly redundant and no legacy circuits are there to affect
them.
we will be examining our on site spares inventory to make sure we
are still at correct levels since we used some items this morning.
We appologize for the outage caused by the failure of hte primary
and backup batteries and will continue to provide the best service at
an excellent price.
The MGE tech that has all the major accounts in Atlanta including
coke and several others told us that this was a very freak occurance
with negligible odds of happening and in his opinion we have done
everything right on our maintenance and pm and redundancy of the
batteries and he would have done the same thing and that there was
really nothing he would have recommended different at that point.
Just glad it wasn't me out there diagnosing all those faults.
|