Chapter 6: Instruction Parallelism

1. Multi-launch muti-issue

In the previous out-of-order execution kernel, each cycle can only issue one instruction at most. Even if sometimes many instructions are parallelized, the average instruction execution efficiency is at most one instruction per cycle. If the issue unit can issue multiple instructions at a time, more instructions can be processed in parallel.

(1) superscalar (superscalar)

 (2)VLIW

        If the declaration of parallelization of instructions is shown in the instruction format, the processor just executes it stupidly, that is, the programmer manually writes parallel assembly code, and then defines it as a parallel instruction.

 

LDH .D1*A5++,A0 
|| LDH .D2*B6++,B1

        The "||" in front of the instruction means that this instruction is executed in the same Cycle as the previous instruction, and if there is no "||", it means that this instruction is executed in the next Cycle. Each instruction is 32 bits, "||" is represented by the 0th bit, and the processor only needs to execute it according to the instruction rules.

2. superscalar processor

(1) Take P4COU as an example:

It is mainly divided into storage subsystem, front-end, and back-end;

The front end is used to prepare instructions, including value fetching, decoding, and branch prediction; the back end includes execution units and out-of-order control.

(2) decoding

        A superscalar processor needs to decode multiple instructions at a time. For fixed-length coded instructions, the number of bits for each instruction is fixed. Only a few more sets of decoding circuits are needed to realize concurrent decoding of multiple instructions. code.

        As for the variable-length code, I don't know which ones are the first instruction, so I can't solve it.

        Therefore, it is necessary to identify the start bit of the instruction, and when the instruction is read from the memory into the Cache, pre-decoding begins. The pre-decoding flag is stored together with the instruction in the instruction cache.

        Intel's processors use a multi-stage decoding pipeline to implement decoding. The first stage detects the start and end positions of the instruction first, and the second stage decodes the instruction into uop.
        An x86 CISC instruction usually corresponds to multiple uops. When the number of uops generated by a CISC instruction is more than 4, the uops corresponding to these CISC instructions are stored in the micro-ROM (uROM), and they are obtained from the micro-ROM by using a lookup table during decoding, which simplifies The decoding process of complex instructions.

(3)Trace Cache

        The decoded uop is stored in the Trace Cache. Unlike the Cache, the order of the stored instructions is not the address order of the instructions, but the execution order of the instructions. And what is stored is not the original instruction, but the decoded micro-operation.

(4) Front-end pipeline

         1) Read instructions from L2Cache, and then put them into the queue Buffer (to smooth the speed).

         2) Decode takes instructions from the queue for decoding

         3) Put the decoded instruction into the queue

         4) Store in Trace Cache according to the execution order of uop

         5) Then take it out of the Trace Cache and put it into the uop Queue

(5) Back-end pipeline

        1) When the uop enters the backend, the resources are allocated first, and the Buffer is used to schedule the allocation. If it needs to be allocated in the ROB

There is a position in the logical register that needs to use the physical register, and the memory operation requires Load/Store Buffer,

        2) The uop is renamed by the register, and the mapping relationship is saved to the RAT (Register Alias ​​Table).

        3) The scheduling of instructions is the core of out-of-order execution. The scheduler is based on the preparation of the iop operand and the execution unit

Preparedness to determine when the uop is executed. For example, memory accesses and ALU instructions are placed in different queues

        4) The scheduler is connected to 4 Dispatch Ports (dispatch ports), and different types of instructions are dispatched by different Dispatch Ports, as shown in the following figure:

         Exec Port0 and Exec Port 1 are used to dispatch ALU uops, Load Port is used to dispatch Load uops, and Store Port is used to dispatch Store uops. ALU (double speed) means that Exec Port can allocate a simple ALUuop every half Cycle, so in the most ideal case, Exec Port 0 and Exec Port 1 each Cycle respectively launch two uops, Load Port and Store Port Each Cycle emits 1 uop, so a Cycle can emit up to 6 uops. However, this is only a theoretical situation. Due to the dependence of instructions, the actual situation is far from reaching 6 uops in parallel. In real time, the maximum number of instructions that can be processed in parallel at each stage of the processor pipeline is different, such as Trace Cache - a Cycle outputs 3 uops, so Intel processors have Buffers at almost every stage to isolate the speed between them deviation.

        5) The following Register Read, Execute, L1 Cache (MEM), Register Write are similar to the classic MIPS 5-level pipeline.
        6) The last step of the out-of-order execution kernel is Retire (exit), which is responsible for updating the ISA register status, and the instructions exit the out-of-order execution kernel in order. Allocate, Register Rename, Schedule, and Retire form out-of-order control.
        The actual pipeline of the P4 processor reaches 20 stages, which is more complicated than the above introduction.

3. VLIW processor instance

The most important task of DSP is to execute digital signal processing algorithm, and the typical algorithm in digital signal processing is FIR filtering:

         We can see from it that: 8 functional units are executed in parallel at full speed, and the efficiency is optimal. How can we achieve such a high degree of parallelism?
        The compiler adopts two methods of optimizing loops: loop unrolling and software pipelining.

        When the compiler processes this loop, it unrolls the loop 4 times:

         There are many functional units in the DSP, after the loop is unrolled, these functional units can be fully utilized to process more data at one time. If count is not an integer multiple of 4, it can be split into two parts, one part is an integer multiple of 4; the other part is the remaining content.
        Software pipelining is an instruction scheduling strategy for the compiler to optimize the loop code, which is used to improve the parallelism of instructions in multiple iterations of the loop. Hardware pipelining has been introduced earlier, and software pipelining, as the name suggests, is to perform similar pipeline scheduling on software (here specifically refers to loops). Software pipelining is also known as loop-level parallelism.
        Each cycle is called an iteration (Iteration), and each iteration executes 3 instructions: K1, K2, K3 (such as fetching, calculating, and storing).

        Below we analyze how the 8 instructions of the FIR filter are implemented in parallel. In the following code, the description of each instruction follows the arrow:

 

 

Guess you like

Origin blog.csdn.net/Strive_LiJiaLe/article/details/127527466