Improving Memory Reliability Against Multiple Cell Upsets Using

Transcription

Improving Memory Reliability Against Multiple Cell Upsets Using
International Journal Of Innovative Science And Applied Engineering Research (IJISAER)
ISSN: 2349-9389
Volume 13, Issue 44 Ver. I (March. 2015)
Improving Memory Reliability Against
Multiple Cell Upsets Using Hamming Based
Matrix Code
M.Sivasankaran,
PG Student, VLSI Design
KCG College of Technology
Chennai, India
sivasankaran44@gmail.com
G.Renganayaki,
Assistant Professor, Department of ECE
KCG College of Technology
Chennai, India
renganayaki@kcgcollege.com
Abstract— The soft error rate in storage cells is rapidly
the detection of errors caused by noise or other
increasing due to the ionizing effects of atmospheric
impairments during transmission from the transmitter
neutron, alpha-particle and cosmic rays. Due to this
to the destination. Error correction is the detection of
Single Cell Upset (SCU) and Multiple Cell Upset (MCU)
errors and reconstruction of the original, error-free
will take place. The error correction codes (ECCs) are
data. The general idea for achieving error detection
widely applied to protect memories against soft errors.
and correction is to add redundant bits (i.e., any extra
Existing ECCs can correct SCU and limited MCU. Hence
data) to the message. Error-detection and correction
a more reliable ECC is required. The Decimal Algorithm
schemes can be either systematic or non-systematic [2]
based Matrix Code (DMC) uses decimal algorithm to
[9].
obtain the error detection and correction capability. The
Encoder-reuse technique (ERT) is employed in DMC to
In a systematic scheme, the transmitter sends the
minimize the area. Initially, the data bits are split into
information and attaches a fixed number of redundant
symbols and they are set in a 2D matrix. The Horizontal
bits which are derived from the information bits by
Redundant Bits (HRB) and Vertical Redundant Bits
some deterministic algorithm. In a system that uses a
(VRB) are computed by decimal operations in DMC
non-systematic code, the information is transformed
encoder. After encoding, the obtained codeword is stored
into an encoded message that has at least as many bits
in the memory. If the radiation affects the memory,
as the original information. Good error control
Multiple Cell Upset problem will happen. These troubles
performance requires the scheme to be selected based
can be rectified in the decoder. The ECC based DMC has
on the characteristics of the communication channel
been simulated implemented and compared to Hamming
Code using Xilinx Design Suite 14.2. From the simulation
[6]. The various sources of errors are temperature,
and implementation, analysis, the ECC based DMC
humidity, vibrations, aging of components cosmic
yields better performance compared to Hamming Code.
radiation & alpha particles which induce failures in
Keywords— Decimal Matrix Code (DMC), Encoder Reuse
chips having RAM, incomplete specifications. There
Technique, Multipli-Cell Upset, Decimal Algorithm, Error
are two types of errors. They are soft error and firm
Correcting Codes, Hamming Code.
error. A soft error does not damage the hardware of the
I. INTRODUCTION
system; the only damage is to the data that is being
processed. An error in a memory element is considered
The information and coding theory have a wide variety
soft because it corrupts the data. Radiation induced
of applications in computer science and
error in an FPGA is a "firm" error, because it is not just
communication. Error control is the technique that
a transient data error. When it occurs, the device's
enables the reliable delivery of digital data over
configuration or "personality" that is corrupted. This
unreliable communication mediums. Error detection is
error changes the actual function of the device [3].
IJISAER 3SI011140121
www.ijisaer.com
|38
Licensed under a Creative Commons Attribution-Non Commercial 4.0 International
License
International Journal Of Innovative Science And Applied Engineering Research (IJISAER)
ISSN: 2349-9389
Volume 13, Issue 44 Ver. I (March. 2015)
Alpha particles from package decay, Cosmic rays
creating energetic neutrons and protons, thermal
neutrons, random noise or signal integrity are the
sources of soft errors [4]. Usually, only one cell of a
memory is affected. Sometimes multiple memory cells
are affected due to high energy radiations. Multiplecell upset leads to only a number of separate single-bit
upsets in multiple correction words. So, an error
correcting code needs only to cope with a single bit in
error in each correction word in order to cope with all
likely soft errors. Each Error Correction Code
provides different protection level against soft errors
by relying on error correcting codes [5]. Various Error
Detection and Error Correction Codes are Hamming
code, Reed Solomon Codes, Different Set Cyclic
Codes.
Hamming code is a form of linear error correcting
code that can detect up to two-bit error or correct onebit errors without detection of uncorrected errors. The
Hamming decoder can detect and correct all single-bit
errors or detect all double-bit errors. Because errorcorrection software is permanently stored in the ROM
and uses the core resources
IJISAER 3SI011140121
www.ijisaer.com
|39
Licensed under a Creative Commons Attribution-Non Commercial 4.0 International
License
International Journal Of Innovative Science And Applied Engineering Research (IJISAER)
ISSN: 2349-9389
Volume 13, Issue 44 Ver. I (March. 2015)
whenever memory is used, an ECC solution that
employs a little memory and processor cycles is the
preferable scenario [8].
Reed Solomon is an ECC system that was used for
correcting multiple errors – especially burst-type
errors in mass storage devices, wireless and mobile
communications units, satellite links, digital TV,
digital video broadcasting (DVB), and modem
technologies. Reed Solomon Code provides
significant burst error correcting capability. The only
disadvantages to using Reed Solomon code lies in the
lack of an efficient maximum likelihood soft-decision
decoding algorithm [7].
Difference-set cyclic code is a new class of randomerror-correcting cyclic code. The correction process
includes encoding and decoding of cyclic codes.
Different set cyclic codes are able to correct a large
number of bits flips. It takes lesser decoding cycles and
it uses less memory and low power consumption.
Matrix Code overcomes all the above mentioned
disadvantages in all the ECC’s with less area and delay
overheads [10].
The organization of the paper is as follows, Section I
gives a general introduction about Error Detection and
Correction. Section II describes on the proposed
Decimal Matrix Code (DMC) with an example.
Section III gives the simulation results of the (DMC).
Finally, Section V concludes the report.
II. DECIMAL ALGORITHM BASED
MATRIX CODE
Multiple cell upsets (MCUs) are becoming major
issues in the reliability of memories exposed to
radiation. To prevent from data corruption, more
complex Error Correction Codes (ECCs) are widely
used to protect memory. The main drawback is that
they would require higher delay overhead. In this
paper, decimal algorithm based on matrix code (DMC)
is exploited to enhance memory reliability with lower
delay overhead. The ECC based DMC utilizes a
decimal algorithm to obtain the maximum error
detection capability. Moreover, the Encoder-Reuse
Technique (ERT) is used to minimize the area
overhead of extra circuits without disturbing the whole
encoding and decoding process. In ERT the circuit
used for encoding can also be used for decoding.
Fig. 1. Fault tolerant memory
The schematic of fault-tolerant, memory is depicted in
Fig.1. First, during the encoding process, information
bits are fed to the DMC encoder, and then the
horizontal redundant bits and vertical redundant bits
are obtained from the DMC encoder. After encoding,
the obtained codeword is stored in the memory. If the
radiation affects the memory, the MCU will occur.
This can be corrected in the decoding process. Due to
the advantage of decimal algorithm, the proposed
DMC has the higher fault-tolerant capability with
lower performance overheads. In the fault-tolerant
memory, the ERT technique is proposed to reduce the
area overhead of extra circuits.
A. DMC Encoder
The encoding process is given below,
Step 1: Divide N-bit word into k symbols of m bits
Step 2: Arrange them in a k1 × k2, 2-D matrix
Step 3: Calculate the horizontal redundant bits (H) by
the decimal integer addition, among the symbols in the
row
Step 4: Calculate the vertical redundant bits (V) by
binary operation among the bits per columns
Step 5: Now this code word is stored in
memory(SRAM)
Fig. 2. Block diagram of DMC Encoder
D0 to D31 are the information bits. H0 to H19 are the
Horizontal redundant bits. V0 to V15 are the vertical
redundant bits. U0 to U31 are the copy of information
bits. If the memory is exposed to radiation, multiple
cell upset problem will occur. This can be eliminated
by the decoding process.
B. DMC Decoder
The DMC decoder is made up of the following sub
modules, and each executes a specific task in the
decoding process: syndrome calculator, error locator
and corrector. It can be observed from the Fig. 3. that
the redundant bits must be recomputed from the
received information bits and compared to the set of
redundant bits in order to obtain the syndrome bits.
Then error locator uses the syndrome bits to detect and
locate which bits some errors occur in. If the horizontal
syndrome bits are non zero, it is able to find the symbol
in which the error occurs. If the vertical syndrome bits
are non zero, it is able to find which particular bit is
affected by the radiation particle (MCU). Finally, the
error corrector corrects the error by inverting the
values of error bits.
The decoding process is given below,
Step 1: Receive the affected information bits
IJISAER 3SI011140121
www.ijisaer.com
|40
Licensed under a Creative Commons Attribution-Non Commercial 4.0 International
License
International Journal Of Innovative Science And Applied Engineering Research (IJISAER)
ISSN: 2349-9389
Volume 13, Issue 44 Ver. I (March. 2015)
Step 2: Calculate the horizontal and vertical redundant
bits for the received information bits
Step 3: Calculate the horizontal syndrome bits by
decimal integer subtraction
Step 4: Calculate the vertical syndrome bits by logical
EXOR
If both the syndrome bits are zero, then the information
bits are not affected. If the horizontal syndrome bits
are non zero, it is able to find the symbol in which the
error occurs. If the vertical syndrome bits are non zero,
it is able to find which particular bit is affected by the
radiation particle (MCU). It can be corrected by
simply flipping the affected bit (or) bits.
Fig. 3. Block diagram of DMC Decoder
In this ECC based DMC scheme, the circuit area of
DMC is minimized by reusing its encoder. This is
called the ERT. The ERT can reduce the area overhead
of DMC without disturbing the whole encoding and
decoding processes. It can be observed from the block
diagram of decoder, that the DMC encoder is also
reused for obtaining the syndrome bits in decoder.
Therefore, the area of DMC can be minimized as a
result of using the existing circuits of the encoder.
There are three cases in the decoding process:
Case1: The results of horizontal syndrome bits are
non-zero, if the symbols are affected by radiation
particles
Case2: The results of horizontal syndrome bits are
zero, if the symbols are not affected by the radiation
particles
Case3: The results of horizontal syndrome bits are
zero, if the symbols are affected by the radiation
particles
To explain the DMC scheme, take a 32-bit word as an
example. The cells from D0 to D31 are information
bits. This 32-bit word has been divided into eight
symbols of 4-bit. k1 = 2 and k2 = 4 have been chosen
simultaneously. H0–H19 is horizontal redundant bits;
V0 through V15 are vertical redundant bits. The
maximum correction capability (i.e., the maximum
size of MCUs can be corrected) and the number of
redundant bits are different when the different values
of k and m are chosen. Therefore, k and m should be
carefully adjusted to maximize the correction
capability and minimize the number of redundant bits.
For example, in this case, when k1=2, k2=2 and m =
8, only one bit error can be corrected and the total
number of redundant bits is 40. When k = 4 × 4 and m
= 2, 3-bit errors can be corrected and the number of
redundant bits is reduced to 32. However, when k = 2
× 4 and m = 4, the maximum correction capability is
up to 5 bits and the number of redundant bits is 36. In
this paper, in order to enhance the reliability of
memory, the error correction capability is first
considered, so k = 2 × 4 and m = 4 are utilized to
construct DMC. The encoding steps as follows,
Step 1: 32-bit word is divided into 8 symbols of 4 bits.
Symbol 0=D3D2D1D0; Symbol 4=D19D18D17D16;
Symbol 1=D7D6D5D4; Symbol 5=D23D22D21D20;
Symbol 2=D11D10D9D8; Symbol 6=D27D26D25D24;
Symbol 3=D15D14D13D12; Symbol 7=D31D30D29D28;
Step 2: Calculate the horizontal redundant bits by
decimal integer addition.
H4H3H2H1H0 = Symbol 0+Symbol 2;
H9H8H7H6H5 = Symbol 1+Symbol 3;
H14H13H12H11H10 = Symbol 4+Symbol 6;
H19H18H17H16H1 5= Symbol 5+Symbol 7;
Where “+” represents decimal integer addition.
Step 3: To calculate the vertical redundant bits the
below notation is used.
Vn= Dn + Dn+16;
Where n= 0 to 15 if the input data is 32 bit.
Encoding is performed by decimal integer binary
addition. The encoder that computes the redundant bits
using multi bit adders and XOR. The information bits
and the redundant bits are together called as codeword.
Step 4: Finally, the obtained code word is stored in
memory. If any radiation particle strikes the memory,
then the information bits stored in the memory gets
changed. This can be corrected in decoding process.
The decoding steps are given below. There are three
cases in the decoding process.
Case1: The results of horizontal syndrome bits are
non-zero, if the symbols are affected by radiation
particles
Case2: The results of horizontal syndrome bits are
zero, if the symbols are not affected by the radiation
particles
Case3: The results of horizontal syndrome bits are
zero, if the symbols are affected by the radiation
particles.
Example for decoding process
Case 1:
Horizontal syndrome bits are non zero - symbols are
affected
Now, H4H3H2H1H0=10110
The
decimal
value
of
10110
is
22.
H4H3H2H1H0=10110-10010=22-18;
IJISAER 3SI011140121
www.ijisaer.com
|41
Licensed under a Creative Commons Attribution-Non Commercial 4.0 International
License
International Journal Of Innovative Science And Applied Engineering Research (IJISAER)
ISSN: 2349-9389
Volume 13, Issue 44 Ver. I (March. 2015)
H4H3H2H1H0 =00100=4 (a non-zero value)
represents the decimal integer difference.
Error is detected in symbol 0 and symbol 2. To find
the exact bit position of the error, vertical syndrome
bits are used. After finding the exact position of error,
it will be corrected by simply inverting the error bit.
Case 2:
IJISAER 3SI011140121
www.ijisaer.com
|42
Licensed under a Creative Commons Attribution-Non Commercial 4.0 International
License
International Journal Of Innovative Science And Applied Engineering Research (IJISAER)
ISSN: 2349-9389
Volume 13, Issue 44 Ver. I (March. 2015)
Horizontal syndrome bits are zero - symbols are not
affected
H4H3H2H1H0 =00000(zero)
Then, the symbols are not affected by the radiation
particles. Hence there is no multiple cell upset
problem.
Case 3:
Consider, symbol 0=0110; symbol 2=1001;
Now, H4H3H2H1H0=symbol 0+symbol 2;
So, H4H3H2H1H0=0110+1001=01111
If symbol 0 and symbol 2 are affected by the radiation
particles, then MCU will occur and the symbols are
changed to 1001, 0110.
So, H4H3H2H1H0=01111.
Therefore, H4H3H2H1H0 =00000(zero).
Though the symbols get affected due to radiation, the
horizontal syndrome bits are 0. But it must be non
zero. These types of errors are considered as decoding
errors. This case is very rare. For example, when m =
4, the probability of decoding errors is 0.001. If m = 8,
then the probability is 0.0000011.
For the binary error detection technique, although it
requires low redundant bits, its error detection
capability is limited. The main reason for this is that
its error detection mechanism is based on binary.
We illustrate the limits of this simple binary error
detection, using a simple example. Let us suppose that
the bits B3, B2, B1 and B0 are original information bits
and the bits C0 and C1 are redundant bits.
C0=B0 xor B2=1 xor 0 =1;
C1=B1 xor B3=0 xor 1=1;
Then assume now that MCUs occur in bits B3, B2, and
B0. (i.e., B3’=0, B2’=1 and B0’=0)
The received redundant bits are computed as,
C0’=B0’ xor B2’=0 xor 1=0;
C1’=B1’ xor B3’=0 xor 0=0;
In order to detect these errors, the syndrome bits S0
and S1 are obtained as follows,
S0 = C0’ xor C0 = 1 xor 1 = 0;
S1 = C1’ xor C1 = 0 xor 1 = 1;
These results mean that error bits B2 and B0 are
wrongly regarded as the original bits so that these two
error bits are not corrected. This example illustrates
that for this simple binary operation, the number of
even bit errors cannot be detected.
From the previous discussion, it has been shown that
error detection based on binary algorithm can detect
only odd number of errors. When the decimal
algorithm is used, it is able to detect both even and odd
number of errors.
The reason is that the operation mechanism of decimal
algorithm is different from that of binary. DMC uses
ERT, so that the area is reduced. The advantages are
listed as follows: it is able to detect even and odd
number of errors, reduced area overhead by the use of
ERT, reduced delay by the use of Decimal algorithm.
III. IMPLEMENTATION AND ANALYSIS
The DMC is implemented and analyzed in Xilinx ISE
Design Suite 14.2.From the results it has been
observed that the DMC can correct a maximum of 5
bit errors. It was shown in Fig. . But the existing
Hamming Code can correct only 1 bit error and the
results are shown in Fig.
Fig. 5. Simulation result of DMC Correcting Five Cell
Upset
Fig. 6. Simulation results of Hamming code correcting
only single bit error
TABLE I
of Device
Delay
Power
COMPAR No.
utilization
(ns)
(mW)
ISON OF errors
corrected
DMC
AND
HAMMI
NG
CODE
Para
meters
Method
LUTs
DMC
Hammi
ng
Code
5
1
Slices
653
210
2484
708
78
336
Bounded IO
6.953
23.25
IJISAER 3SI011140121
www.ijisaer.com
|43
Licensed under a Creative Commons Attribution-Non Commercial 4.0 International
License
597
2236