The fault tolerance needs to be tested so that you:
- know exactly how the RAID behaves when a hard drive fails;
ensure that the RAID is actually capable of surviving a disk failure.
When deploying a fault-tolerant array (RAID 1, RAID 10, RAID 5, or RAID 6), test the system with a simulated disk failure.
If you have got hot swappable drive bays, just pick a random one and pull the hard drive on a live system.
If there is no hot swap available, then disconnect one of the disks with the system powered off, then start it.
You need to verify that the system behaves precisely as expected in all aspects.
This obviously includes that the array is still accessible.
Less obvious, but nonetheless required is to verify that all the notifications
were sent and received properly: emails, SMS, and possibly pre-recorded phone calls.
Also, check your ability to identify the failed drive easily.
Either the appropriate drive bay light should turn red, or the array diagnostic software
should give you a clear number of a failed drive bay.
While you are at it, do a simulated power failure by pulling the UPS plug out of its socket.
The system should remain online for some predefined period and then gracefully shut down.
Ensure that the UPS battery is not fully depleted by the time the shutdown completes.
Obviously, you better do the testing before the array is
loaded with the production data, or might have an unplanned RAID recovery incident.