This article mainly analyzes the principle of five-stage pipeline and pipeline interlock, so that more efficient assembly code can be written.
1. ARM9 five-stage pipeline
ARM7 uses a typical three-stage pipeline structure, including three parts of fetching, decoding and execution. Among them, the execution unit completes a lot of work, including read and write operations of registers and memories related to operands, ALU operations, and data transmission between related devices. Each of these three phases generally takes one clock cycle, but if three instructions perform three stages of three-stage pipeline at the same time, one instruction per cycle can still be reached. However, the execution unit often takes up multiple clock cycles, thus becoming the bottleneck of system performance.
ARM9 uses a more efficient five-stage pipeline design. After fetching, decoding, and executing, LS1 and LS2 stages are added. LS1 is responsible for loading and storing the data specified in the instruction, and LS2 is responsible for fetching, sign expansion through bytes or halfword The data loaded by the load command. But LS1 and LS2 are only valid for load and store commands, other instructions do not need to execute these two stages. The following is the definition of the ARM official document:
-
Fetch: Fetch from memory the instruction at addresspc. The instruction is loaded intothe core and then processes down the core pipeline.
-
Decode: Decode the instruction that was fetched in the previous cycle. The processoralso reads the input operands from the register bank if they are not available via one ofthe forwarding paths.
-
ALU: Executes the instruction that was decoded in the previous cycle. Note this instruc-tion was originally fetched from addresspc−8 (ARM state) orpc−4 (Thumb state).Normally this involves calculating the answer for a data processing operation, or theaddress for a load, store, or branch operation. Some instructions may spend severalcycles in this stage. For example, multiply and register-controlled shift operations takeseveral ALU cycles.
-
LS1: Load or store the data specified by a load or store instruction. If the instruction isnot a load or store, then this stage has no effect.
-
LS2: Extract and zero- or sign-extend the data loaded by a byte or halfword loadinstruction. If the instruction is not a load of an 8-bit byte or 16-bit halfword item,then this stage has no effect.
2. The problem of pipeline interlock
LDR r1, [r2, #4]
ADD r0, r0, r1
The above code requires three clock cycles, because the LDR instruction will calculate the value of r2 + 4 in the ALU stage, and the ADD instruction is still in the decoding stage, and this one clock cycle is not completed from [r2, # 4] Remove the data from the memory and write it back to the r1 register. The ALU of the ADD instruction will need to use r1 by the next clock cycle. The LS1 phase of the instruction is completed before moving on to the ALU phase of the ADD instruction. The following figure shows the pipeline interlock in the above example:
LDRB r1, [r2, #1]
ADD r0, r0, r2
EOR r0, r0, r1
The above code takes four clock cycles, because the LDRB instruction needs to complete the write-back to r1 after the LS2 stage is completed (it is a byte byte load instruction), so the EOR instruction needs to wait for one clock cycle. The pipeline operation is as follows:
Look at the following example again:
MOV r1, #1
B case1
AND r0, r0, r1 EOR r2, r2, r3 ...
case1:
SUB r0, r0, r1
The above code needs to take five clock cycles, and a B instruction takes three clock cycles, because when a jump instruction is encountered, it will clear the instructions behind the pipeline and go to the new address to fetch instructions again. The pipeline operation is as follows:
3. Avoid pipeline interlock to improve operating efficiency
void str_tolower(char *out, char *in)
{
unsigned int c;
do {
c = *(in++);
if (c>=’A’ && c<=’Z’)
{
c = c + (’a’ -’A’);
}
*(out++) = (char)c;
} while (c);
}
The compiler generates the following assembly code:
str_tolower
LDRB r2,[r1],#1 ; c = *(in++)
SUB r3,r2,#0x41 ; r3=c-‘A’
CMP r3,#0x19 ; if (c <=‘Z’-‘A’)
ADDLS r2,r2,#0x20 ; c +=‘a’-‘A’
STRB r2,[r0],#1 ; *(out++) = (char)c
CMP r2,#0 ; if (c!=0)
BNE str_tolower ; goto str_tolower
MOV pc,r14 ; return
Among them (c> = 'A' && c <= 'Z') conditional judgment after compilation into assembly, the variant becomes 0 <= c-'A' <= 'Z'-'A'.
3.1 Load Scheduling by Preloading
out RN 0 ; pointer to output string
in RN 1 ; pointer to input string
c RN 2 ; character loaded
t RN 3 ; scratch register
; void str_tolower_preload(char *out, char *in)
str_tolower_preload
LDRB c, [in], #1 ; c = *(in++)
loop
SUB t, c, #’A’ ; t = c-’A’
CMP t, #’Z’-’A’ ; if (t <= ’Z’-’A’)
ADDLS c, c, #’a’-’A’ ; c += ’a’-’A’;
STRB c, [out], #1 ; *(out++) = (char)c;
TEQ c, #0 ; test if c==0
LDRNEB c, [in], #1 ; if (c!=0) { c=*in++;
BNE loop ; goto loop; }
MOV pc, lr ; return
This version of the assembly has one more instruction than the assembly compiled by the C compiler, but it saves 2 clock cycles, reducing the cycle clock cycle from 11 to 9 per character, and the efficiency is 1.22 times that of the C compiled version. .
3.2 Load Scheduling by Unrolling
out RN 0 ; pointer to output string
in RN 1 ; pointer to input string
ca0 RN 2 ; character 0
t RN 3 ; scratch register
ca1 RN 12 ; character 1
ca2 RN 14 ; character 2
; void str_tolower_unrolled(char *out, char *in)
str_tolower_unrolled
STMFD sp!, {lr} ; function entry
loop_next3
LDRB ca0, [in], #1 ; ca0 = *in++;
LDRB ca1, [in], #1 ; ca1 = *in++;
LDRB ca2, [in], #1 ; ca2 = *in++;
SUB t, ca0, #’A’ ; convert ca0 to lower case
CMP t, #’Z’-’A’
ADDLS ca0, ca0, #’a’-’A’
SUB t, ca1, #’A’ ; convert ca1 to lower case
CMP t, #’Z’-’A’
ADDLS ca1, ca1, #’a’-’A’
SUB t, ca2, #’A’ ; convert ca2 to lower case
CMP t, #’Z’-’A’
ADDLS ca2, ca2, #’a’-’A’
STRB ca0, [out], #1 ; *out++ = ca0;
TEQ ca0, #0 ; if (ca0!=0)
STRNEB ca1, [out], #1 ; *out++ = ca1;
TEQNE ca1, #0 ; if (ca0!=0 && ca1!=0)
STRNEB ca2, [out], #1 ; *out++ = ca2;
TEQNE ca2, #0 ; if (ca0!=0 && ca1!=0 && ca2!=0)
BNE loop_next3 ; goto loop_next3;
LDMFD sp!, {pc} ; return;
The above code is the most efficient implementation we have experimented with so far. This method requires only 7 clock cycles for each character, which is 1.57 times more efficient than the C compiled version.