Overview
To have an understanding of failure modes and scenarios of VolunteerGrid2, we will need to do a failure analysis of each node and understand the overlapping failure scenarios of multiple grids for planning.
It is important to keep in mind that when "data loss" is mentioned below, that means that shares of files are lost, but because of the way Tahoe-LAFS stores files, the file integrity should remain fully in tact.
Summary
Nodes can fail in a number of ways:
- Node failures
- Drive failures
- other component failures
- Connection failures
- Physical destruction
Using failure analysis, good governance of the grid can be achieved as well as the ability to plan expansions, and direct upgrades to node
Modes of Failure
Node Failures
* Administrative Error
* Storage Drive Failure
** Single Disk Node
- Loss of all node data
- Temporary node outage pending drive replacement
- All capabilities with shares on this node require repair
** RAID
- Data loss due to a single disk failure is unlikely.
- RAID stability compromised pending replacement of failed drive.
** LVM arrays
- This is, by far the most fragile node configuration, however, this is also an excellent candidate for inclusion in a Tahoe-LAFS grid.
- LVM volumes will commonly be comprised of cast-off and second-hand drives
- Loss of all node data
- Temporary node outage pending drive replacement
- All capabilities with shares on this node require repair
* Other Hardware Failures
- Power supply failure
- System drive failure
- General system fault
Connection Failures
* NIC Failure
- Down until component replacement. No permanent data loss
* Router Failure
- Down until component replacement. No permanent data loss
* ISP Outage
- Down until service restored. No permanent data loss
* ISP Upstream Outage
- Down until service restored. No permanent data loss
Physical destruction
* Theft
- Probable permanent data loss
- Multiple scenarios on node service restoration
* Structure Fire
- Probable permanent data loss
- Multiple scenarios on node service restoration
* Local Catastrophe
- Possible permanent data loss
- Multiple scenarios on node service restoration
* Large-scale Catastrophe
- Possible permanent data loss
- Multiple scenarios on node service restoration
Node Administrator Incapacitation