Tags:
failure1Add my vote for this tag create new tag
, view all tags

Overview

To have an understanding of failure modes and scenarios of VolunteerGrid2, we will need to do a failure analysis of each node and understand the overlapping failure scenarios of multiple grids for planning.

It is important to keep in mind that when "data loss" is mentioned below, that means that shares of files are lost, but because of the way Tahoe-LAFS stores files, the file integrity should remain fully in tact.

Summary

Nodes can fail in a number of ways:
  • Node failures
    • Drive failures
    • other component failures
  • Connection failures
  • Physical destruction

Using failure analysis, good governance of the grid can be achieved as well as the ability to plan expansions, and direct upgrades to node

Modes of Failure

Node Failures

* Administrative Error

* Storage Drive Failure

** Single Disk Node

  • Loss of all node data
  • Temporary node outage pending drive replacement
  • All capabilities with shares on this node require repair

** RAID
  • Data loss due to a single disk failure is unlikely.
  • RAID stability compromised pending replacement of failed drive.
** LVM arrays
  • This is, by far the most fragile node configuration, however, this is also an excellent candidate for inclusion in a Tahoe-LAFS grid.
    • LVM volumes will commonly be comprised of cast-off and second-hand drives
  • Loss of all node data
  • Temporary node outage pending drive replacement
  • All capabilities with shares on this node require repair

* Other Hardware Failures

  • Power supply failure
  • System drive failure
  • General system fault

Connection Failures

* NIC Failure

  • Down until component replacement. No permanent data loss

* Router Failure

  • Down until component replacement. No permanent data loss

* ISP Outage

  • Down until service restored. No permanent data loss

* ISP Upstream Outage

  • Down until service restored. No permanent data loss

Physical destruction

* Theft

  • Probable permanent data loss
  • Multiple scenarios on node service restoration

* Structure Fire

  • Probable permanent data loss
  • Multiple scenarios on node service restoration

* Local Catastrophe

  • Possible permanent data loss
  • Multiple scenarios on node service restoration

* Large-scale Catastrophe

  • Possible permanent data loss
  • Multiple scenarios on node service restoration

Node Administrator Incapacitation

Topic revision: r7 - 2011-04-27 - JodyHarris
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback