Optimization of Area and Power in Feed Forward Cut Set Free MAC Unit using EXOR Full Adder and 4:2 Compressor

V. Mohanapriya¹, S. Purushothaman², S. Tamilarasi³, P. Vinitha⁴

¹PG Student, Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu (India).
²Assistant Professor, Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu (India).
³Assistant Professor, Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu (India).
⁴Assistant Professor, Dept. of ECE, PGP College of Engineering and Technology, Tamilnadu (India).

Abstract: MAC (Multiply Accumulate Unit) computation plays an important role in (DSP) Digital Signal Processing. The MAC is a common step that computes the product of two numbers and adds that product to an accumulator. Generally, the Pipelined architecture is used to improve the performance by reducing the length of the critical path. But, more number of flip flops are used when using the pipeline architecture that reduces the efficiency of MAC and increases the power consumption. On the basis of machine learning algorithm, this paper proposes a feed forward-cutset-free (FCF) pipelined MAC architecture that is specialized for a high-performance machine learning accelerator, and also proposes the new design concept of MFCF_PA using the concept of column addition stage with the 4:2 compressor. Therefore, the proposed design reduces the area and the power consumption by decreasing the number of inserted flip-flops for the pipelining when compared to the existing pipelined architecture for MAC computation. Finally, the proposed feed forward cutset free pipelined architecture for MAC is implemented in the VHDL and synthesized in XILINX and compared in terms of area, power and delay reports.

Keywords: Hardware accelerator, Machine Learning, Multiply–Accumulate (MAC) unit, Pipelining.

1. INTRODUCTION

In a machine learning accelerator, a large number of multiply–accumulate (MAC) units are included for parallel computations, and timing critical paths of the system are often found in the unit. A multiplier typically consists of several computational parts including a partial product generation, a column addition, and a final addition. An accumulator consists of the carry-propagation adder. Long critical paths through these stages lead to the performance degradation of the overall system. To minimize this problem, various methods have been studied. The Wallace [8] and Dadda [9] multipliers are well-known examples for the achievement of a fast column addition, and the carry-lookahead (CLA) adder is often used to reduce the critical path in the accumulator or the final addition stage of the multiplier. Meanwhile, a MAC operation is performed in the machine learning algorithm to compute a partial sum that is the accumulation of the input multiplied by the weight. In a MAC unit, the multiply and accumulate operations are usually merged to reduce the number of carry-propagation steps from two to one [10]. Such a structure, however, still comprises a long critical path delay that is approximately equal to the critical path delay of a multiplier. It is well known that pipelining is one of the most popular approaches for increasing the operation clock frequency. Although pipelining is an efficient way to reduce the critical path delays, it results in an increase in the area and the power consumption due to the insertion of many flip-flops. In particular, the number of flip-flops tends to be large because the flip-flops must be inserted in the feed forward-cutset to ensure functional equality before and after the pipelining. The problem worsens as the number of pipeline stages is increased. The main idea of this paper is the ability to relax the feedforward-cutset rule in the MAC design for machine learning applications, because only the final value is used out of the large number of multiply–accumulations. In other words, different from the usage of the conventional MAC unit, intermediate accumulation values are not used here, and hence, they do not need to be correct as long as the final value is correct. Under such a condition, the final value can become correct if each binary input of the adders inside the MAC participates in the calculation once and only once, irrespective of the cycle. Therefore, it is not necessary to set an accurate pipeline boundary. Based on the previously explained idea, this paper proposes a feed forward-cutset-free (FCF) pipelined MAC architecture for a high-performance machine learning accelerator.
2. EXISTING SYSTEM

RECENTLY, the deep neural network (DNN) emerged as a powerful tool for various applications including image classification and speech recognition. Since an enormous amount of vector-matrix multiplication computations are required in a typical DNN application, a variety of dedicated hardware for machine learning have been proposed to accelerate the computations. In a machine learning accelerator, a large number of multiply-accumulate (MAC) units are included for parallel computations, and timing-critical paths of the system are often found in the unit.

The main idea of this paper is the ability to relax the feed forward-cutset rule in the MAC design for machine learning applications, because only the final value is used out of the large number of multiply-accumulations. In other words, different from the usage of the conventional MAC unit, intermediate accumulation values are not used here, and hence, they do not need to be correct as long as the final value is correct. Under such a condition, the final value can become correct if each binary input of the adders inside the MAC participates in the calculation once and only once, irrespective of the cycle. Therefore, it is not necessary to set an accurate pipeline boundary.

Based on the previously explained idea, this paper proposes a feed forward-cutset-free (FCF) pipelined MAC architecture that is specialized for a high-performance machine learning accelerator. The proposed design method reduces the area and the power consumption by decreasing the number of inserted flip-flops for the pipelining.

2.1 Preliminary: Feed forward-Cutset Rule for Pipelining

It is well known that pipelining is one of the most effective ways to reduce the critical path delay, thereby increasing the clock frequency. This reduction is achieved through the insertion of flip-flops into the data path. In addition to reducing critical path delays through pipelining, it is also important to satisfy functional equality before and after pipelining. The point at which the flip-flops are inserted to ensure functional equality is called the feed forward-cutset.

Cutset: A set of the edges of a graph such that if these edges are removed from the graph, and the graph becomes disjointed.

Feed forward-cutset: A cutset where the data move in the forward direction on all of the cutset edges.

2.2 Disadvantages

- Number of inserted flip-flops increases the pipeline stages.
- Consumes larger area and high critical path delay.
- Power consumption is high.

3. PROPOSED SYSTEM

MAC (Multiply Accumulate Unit) computation plays a important role in (DSP) Digital Signal Processing. The MAC is common step that computes the product of two numbers and add that product to an accumulator. Generally, the Pipelined architecture is used to improve the performance by reducing the length of the critical path. But, more number of flip-flops are used when using the pipeline architecture that reduces the efficiency of MAC and increases the power consumption. On the basis of machine learning algorithm, this paper proposes a feed forward-cutset-free (FCF) pipelined MAC architecture that is specialized for a high-performance machine learning accelerator. The proposed design method reduces the area and the power consumption by decreasing the number of inserted flip-flops for the pipelining when compared to the existing pipelined architecture for MAC computation. Finally, the proposed feed forward cutset free pipelined architecture for MAC is implemented in the VHDL and synthesized in XILINX and compared in terms of area, power and delay reports.

3.1 Proposed FCF Pipelining

Fig. 1 shows examples of the two-stage 32-bit pipelined accumulator (PA) that is based on the ripple carry adder (RCA). A[31 : 0] represents data that move from the outside to the input buffer register.

A Reg[31 : 0] represents the data that are stored in the input buffer. S[31 : 0] represents the data that are stored in the output buffer register as a result of the accumulation. In the conventional PA structure [Fig. 1(a)], the flip-flops must be inserted along the feed forward-cutset to ensure functional equality. Since the accumulator in Fig. 1(a) comprises two pipeline stages, the number of additional flip-flops for the pipelining is 33 (gray-colored flip-flops). If the accumulator is pipelined to the n-stage, the number of inserted flip-flops becomes 33(n−1), which confirms that the number of flip-flops for the pipelining increases significantly as the number of pipeline stages is increased.

Fig. 1(b) shows the proposed FCF-PA. For the FCF-PA, only one flip-flop is inserted for the two-stage pipelining. Therefore, the number of additional flip-flops for the n-stage pipeline is n − 1 only.
work. In the conventional two-stage PA, the accumulation output \( S \) is produced two clock cycles after the corresponding input is stored in the input buffer. On the other hand, regarding the proposed structure, the output is generated one clock cycle after the input arrives. Moreover, for the proposed scheme, the generated carry from the lower half of the 32-bit adder is involved in the accumulation one clock cycle later than the case of the conventional pipelining.

For example, in the conventional case, the generated carry from the lower half and the corresponding inputs are fed into the upper half adder in the same clock cycle as shown in the cycles 4 and 5 of Fig. 2 (left). On the other hand, in the proposed FCF-PA, the carry from the lower half is fed into the upper half one cycle later than the corresponding input for the upper half, as depicted in the clock cycles 3-5 of Fig. 2 (right). This characteristic makes the intermediate result that is stored in the output buffer of the proposed accumulator different from the result of the conventional pipelining case.

Fig. 2 shows examples of the ways that the conventional PA and the proposed method (FCF-PA)
Meanwhile, the CLA adder has been mostly used to reduce the critical path delay of the accumulator. The carry prediction logic in the CLA, however, causes a significant increase in the area and the power consumption. For the same critical path delay, the FCF-PA can be implemented with less area and lower power consumption compared with the accumulator that is based on the CLA.

3.2 Full adder designs using XNOR and XOR gates for sum logic

A full adder design employing two stages of XNOR gates for the sum logic, while that employing two successive stages of XOR gates for the sum logic is depicted.

Fig -3: Full adder using XOR gates and a MUX.

3.3 Modified FCF-PA for Further Power Reductions

Although the proposed FCF-PA can reduce the area and the power consumption by replacing the CLA, there are certain input conditions in which the undesired data transition in the output buffer occurs, thereby reducing the power efficiency when 2's complement numbers are used. Fig. 4 shows an example of the undesired data transition. The inputs are 4-bit 2's complement binary numbers. AReg [7:4] is the sign extension of AReg [3], which is the sign bit of AReg [3 : 0]. In the conventional pipelining (Fig. 4 (left)), the accumulation result (S) in cycle 3 and the data stored in the input buffer (AReg) in cycle 2 are added and stored in the output buffer (S) in cycle. In this case, the “1” in AReg [2] in cycle 2 and the “1” in S[2] in cycle 3 are added, thereby generating a carry. The carry is transmitted to the upper half of the S, and hence, S[7:4] remains as “0000” in cycle.

Fig -4: Pipelined column addition structure with the Dadda multiplier. (a) Conventional pipelining. (b) Proposed FCF pipelining. HA: half-adder. FA: full adder.

Fig -5: Proposed (a) FCF-PA and (b) MFCF-PA for the improvement of the power efficiency.

3.4 4:2 Compressor Design

The 4:2 compressor used to reduce the number of device computation in order to reduce the area and power of a MAC unit is depicted.
Fig-6: MAC unit using 4:2 compressor.

3.5 Advantages

- Feed Forward Cutset Free technique decreases the Pipeline stages.
- Less area and shorter critical path delay when using the concept of DADDA multiplier.
- Power consumption is low.

4. RESULT AND DISCUSSION

4.1 Power report

Fig-7 Power report of MAC Unit using 4:2 Compressor.

4.2 Delay Report

Fig-8 Delay report of MAC unit using 4:2 compressor

4.3 Area Report

Fig-9 Area report of MAC unit using 4:2 compressor
4.4 Simulation Output

Fig-10 Simulation output of MAC unit using 4:2 compressor

5. CONCLUSION

We introduced the FCF pipelining method in this paper. In the proposed scheme, the number of flip-flops in a pipeline can be reduced by relaxing the feedback-cutset constraint, thanks to the unique characteristic of the machine learning algorithm. We applied the FCF pipelining method to the accumulator (FCF-PA) design, and then optimized the power dissipation of FCF-PA by reducing the chance of undesired data transitions (MFCF-PA). The proposed scheme was also expanded, and applied to the MAC unit (FCF-MAC). For the evaluation, the conventional and proposed MAC architectures were synthesized in a 65-nm CMOS technology. The proposed accumulator showed the reduction of area and the power consumption by 17% and 19%, respectively, compared with the accumulator with the conventional CLA adder-based design. In the case of the MAC architecture, the proposed scheme reduced both the area and power by 20%. We will design MAC Unit using MCF-PA with 4:2 compressor and XOR MUX Full adder with compared Conventional full adder designs in the future. We believe that the proposed idea to utilize the unique characteristic of 4:2 compressor computation for more efficient MAC design can be adopted in many hardware accelerator designs.

6. REFERENCES


