Tales from our helpdesk: server failure on Christmas eve
- Created: Monday, 07 February 2022
- Written by Jonathan Hobson
In the early hours of Christmas Eve our monitoring platform alerted us that a server was offline - Paul diagnosed a faulty RAM module.
He was quickly onsite, verified that the RAM was faulty and liaised with the manufacturer to get a replacement. He removed the faulty RAM and booted the server - adjusting resource limits to allow everyone to work.
As soon as the replacement RAM arrived, we went onsite to install it and re-adjust the resources back to normal.
This could have been a much worse outage, but for our customers:
- We spec servers to be able to cope with the failure of a component.
- We supply equipment with business class warranties, which means technical help and spare parts are readily available from the manufacturer.
- Our 24/7 monitoring system is deployed, configured to detect the failure and alert us.
- We configure and test out-of-band tools that allow us to identify issues even if the server is off.
- Our out-of-hours engineers are experts in identifying issues, temporary workarounds and making replacements.
- Our engineers speak the language of manufacturers, which allows us to rapidly get parts dispatched.
Without the services we provide outages would last significantly longer.
Some people might think that they’re fine with just the manufacturer’s on-site response. However:
- The manufacturer will often just ship out the replacement part and expect the customer to replace it - would you know how?
- The manufacturer doesn’t know how the customer’s systems are set up and aren’t responsible for other components, whereas we do know and we’ll get the system back up and running while we’re waiting for the replacement part
- Replacing the part doesn’t always get you back up and running - we resolve the knock-on effects, for example reinstalling software or restoring from backup
For peace of mind, engage telanova as your IT team