Reasons for memory failures

Like many other physical storage media, the contents of a memory chip are not completely permanent, but only last for 10-15 years at best. Many capital goods fail inexplicably due to electronics problems after some years of operation, even though the machinery, systems, and equipment are mechanically still in perfect condition. As well as defects in the circuit board itself, this can be caused by information loss in the memory chips.

One single memory cell changing state from “0” to “1” results in a completely different machine instruction that works completely differently. Memory errors frequently create incoherent error patterns and inexplicable machine behaviour. It can often be difficult to distinguish between memory chip errors, errors caused by a defective sensor, and errors caused by a defective circuit board.

ROMs and PROMs are seldom used since the 1980s, but they achieve the highest data lifetimes due to their design, since the information is essentially “hardwired” into the chip. By contrast, the more modern and ubiquitous EPROMs, EEPROMs, and flash EEPROMs based on MOSFET transistor technology are susceptible to various failure mechanisms, which are briefly presented below.

The natural aging process of each memory cell eventually causes some of the information stored by the cell to be lost. This occurs because the insulating layer of the floating gates in the MOSFET transistors used for data storage is not perfectly insulating for physical and technological reasons; a small stream of electrons still manages to pass through. Over time, the control voltage of the MOSFET gradually decreases. When the control voltage falls below a critical threshold, the memory cell no longer represents the state “0”, suddenly switching to the unprogrammed state “1”. This can lead to unstable memory cell behaviour as it sporadically alternates between states depending on environmental conditions. The typical lifetime of a memory cell quoted by manufacturers before 2000 was only 10-15 years. 

The quality of the erasure process and the programming algorithms plays an important role in premature data loss. The data lifetime is maximized by maintaining a sufficiently large charge difference exists between memory cells holding the value “0” and memory cells holding the value “1” to mitigate the effects of natural loss of charge for as long as possible. If an insufficiently high-quality erasure and programming process was selected to reduce the cost, the distinction between ones and zeroes may fade in some memory cells, leading to failure in the long term.

Premature loss of information can also occur due to microscopic production errors in the insulation of the floating gates that allow the electrical charge to dissipate faster than expected.

One aspect shared by all of these errors is that they cannot be diagnosed during or after programming. In other words, even if the memory chip is found to have been programmed correctly during testing, the charge differential in certain memory cells may not be sufficiently high to guarantee that the target data lifetime will be achieved.

As well as the failure mechanisms described above, which can modify the contents of individual memory cells, a third mode of failure can occur in the event of electrical damage to the memory chip. This can for example arise in any of the following cases:

  • Overvoltage, especially in sensitive CMOS semiconductor technologies, can cause the failure of individual memory cells and internal control units. If an internal control unit is damaged, data recovery will be impossible in many cases, since the memory cells can no longer be properly addressed and read out.
  • In devices with very small semiconductor structures, the phenomenon of electromigration can also lead to a form of “electrical wear” over time, causing individual cells or functional elements of the memory chip to fail.
  • In some cases, the address lines inside or outside of the memory chip may be damaged, in which case only parts of the memory matrix can be properly addressed and read out. If program code or data are stored in one of these areas, data recovery may occasionally be possible if the address line can be successfully reactivated.
  • Like address lines, data lines may suffer internal or external damage. In this case, if the programming language is known, it may be possible to check and reconstruct the corrupted program code using special algorithms.
­

Key facts

­

Reasons for memory failures

­

Analysis & Data Recovery