How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder
Supercomputer operators have had to struggle with many other quirky faults as well. To take one example: The IBM Blue Gene/L system at Lawrence Livermore National Laboratory, in California, the largest computer in the world from 2004 to 2008, would frequently crash while running a simulation or produce erroneous results. After weeks of searching, the culprit was uncovered: the solder used to make the boards carrying the processors. Radioactive lead in the solder was found to be causing bad data in the L1 cache, a chunk of very fast memory meant to hold frequently accessed data. The workaround to this resilience problem on the Blue Gene/L computers was to reprogram the system to, in essence, bypass the L1 cache. That worked, but it made the computations slower.