Friday, September 18, 2015

Failure Mode and Effects Analysis

It is said that a chain is only as strong as its weakest link. In IT terms, we refer to a Single Point of Failure (SPOF) as being one of a systems weaknesses. Technology Architects spend a significant amount of time attempting to identify and mitigate these SPOFs. But what if there is a large number of points of failure ? How does an Enterprise IT organization prioritize which SPOFs to mitigate first ?

Enter Failure Mode and Effects Analysis. The idea is to first identify the Points of Failure, then identify:
  • Severity of the Failure - How would the failure effect the business of the Enterprise ?
    • a higher number here reflects greater risk should a failure occur
  • Occurrence of the Failure - how often could the failure happen ?
    • a high number here represents greater likelihood of the failure occuring 
  • Detection of the Failure - how would we know if the failure occurred ?
    • a high number here represents a less-detectable failure, and therefor a greater risk
These three items are evaluated on a scale of one to ten, with ten being the highest. When multiplied together, these three factors comprise a Risk Priority Number (RPN). If effect, the higher the RPN, higher the risk of Failure. From there, we can create a prioritized list of tasks which will mitigate the known risks of a system.

Illustration of an FMEA Example
Generally speaking, as the IT organizations works through the top few items on the list (from higher priority to lower), the overall risk to the system decreases dramatically, along the lines of the 80/20 rule. FMEA is normally applied to manufacturing or development processes, but can easily be adapted to suit IT systems as well.

No comments: