请输入您要查询的百科知识:

 

词条 Predictive failure analysis
释义

  1. Disks

  2. Processor and Memory

  3. References

  4. See also

Predictive Failure Analysis (PFA) refers to methods intended to predict imminent failure of systems or components (software or hardware), and potentially enable mechanisms to avoid or counteract failure issues, or recommend maintenance of systems prior to failure.

For example, computer mechanisms that analyze trends in corrected errors to predict future failures of hardware/memory components and proactively enabling mechanisms to avoid them. Predictive Failure Analysis was originally used as term for a proprietary IBM technology for monitoring the likelihood of hard disk drives to fail, although the term is now used generically for a variety of technologies for judging the imminent failure of CPU's, memory and I/O devices.[1] See also first failure data capture.

Disks

IBM introduced the term PFA and its technology in 1992 with reference to its 0662-S1x drive (1052 MB Fast-Wide SCSI-2 disk which operated at 5400 rpm).

The technology relies on measuring several key (mainly mechanical) parameters of the drive unit, for example the flying height of heads. The drive firmware compares the measured parameters against predefined thresholds and evaluates the health status of the drive. If the drive appears likely to fail soon, the system sends notification to the disk controller.

The major drawbacks of the technology included:

  • the binary result - the only status visible to the host was presence or absence of a notification
  • the unidirectional communications - the drive firmware sending notification

The technology merged with IntelliSafe to form the Self-Monitoring, Analysis, and Reporting Technology (SMART).

Processor and Memory

High counts of corrected RAM intermittent errors by ECC can be predictive of future DIMM failures [2] and so automatic offlining for memory and CPU caches can be used to avoid future errors,[3] for example under the Linux operating system the mcelog daemon will automatically remove from usage memory pages showing excessive corrections, and will remove from usage processor cores showing excessive cache correctable memory errors.[4]

References

1. ^{{cite web|url=http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html|title=Intel Xeon Processor E7 Family: supporting next generation RAS servers. White paper.|author=Intel Corp|year=2011 |accessdate=9 May 2012 }}
2. ^{{cite web|url=http://research.google.com/pubs/pub35162.html|author1=Bianca Schroeder |author2=Eduardo Pinheiro |author3=Wolf-Dietrich Weber |title=DRAM Errors in the Wild: A Large-Scale Field Study. Proceedings SIGMETRICS, 2009|year=2009}}
3. ^{{cite news|title="Assessment of the Effect of Memory Page Retirement on Systems RAS against Hardware Faults", Proceedings of the 2006 International Conference on Dependable Systems and Networks|author=Tang, Arruthers, Totari, Shapiro|date=2006}}
4. ^{{cite web|url=http://halobates.de/lk10-mcelog.pdf|title=mcelog - memory error handling in user space. Linux Kongress 2010|year=2010}}

See also

  • MCELog- Linux daemon for processing of x86 machine checks for predictive failure analysis
{{Compu-storage-stub}}

2 : Hard disk computer storage|IBM storage devices

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/9/23 15:24:04