Computer Composition and Design Patterson & Hennessy Notes (1) MIPS Instruction Set

The language of computers: the assembly instruction set

That is the instruction set. This book mainly introduces the MIPS instruction set.

assembly instructions

Arithmetic operations:

add a,b,c	# a=b+c
sub a,b,c	# a=b-c

Comments for MIPS assembly are # signs.

Since the register size in MIPS is 32 bits, which is the basic access unit, it is also called a word word. MIPS assembly has 32 registers. The number of registers is related to the feasibility of the instruction set, not to say that the more Moore's Law, the better.

Next, replace the above abstract letters with register representations:

# f=(g+h)-(i+j)
# f,g,h,i,j in s0,s1,s2,s3,s4

add $t0,$s1,$s2
add $t1,$s3,$s4
sub $s0,$t0,$t1

Registers in MIPS assembly are represented by $+ two characters.

Then there are instructions to fetch data from memory.

# g=h+A[8]
# g,h in s1,s2
# A address in s3
lw $t0,8($s3)	# 用这种方式拿出来
add $s1,$s2,$t0

MIPS uses big-endian addressing, that is, the high byte is stored in the low address.

Least Significant Bit: The rightmost bit, which is bit 0 of MIPS[31:0].

Most significant bit: 31 bits.

If addressing by byte, the correct offset is offset*4.

For example: calculate A[12]=h+A[8], h is in s2, and A is in s3.

lw $t0,32($s3)
add $t0,$s2,$t0
sw $t0,48($s3)	# 写回

Constant operation: For example, the constant 4 is stored in s1+AddrConstant4, we can lw $t0,AddConstant4($s1)take the constant out.

Alternatively, use the immediate value directly.

addi $s3,$s3,4	#加立即数需要用 addi

Immediate operation speed is fast and energy consumption is low.

The $zero register stores a constant 0, because, for example, a data transfer instruction can be regarded as an add 0 instruction, which simplifies the combination of data transfer instructions and addition instructions.

Defining some constants according to the frequency of use is one of the ways to speed up high-probability events.

Digital storage is divided into unsigned numbers and signed numbers represented by complement codes, and the specific algorithm will not be expanded in detail~

command composition

t0-t7 are registers 8-15, s0-s7 are registers 16-23.

For example, the instruction add $t0,$s1,$s2machine language representation (decimal) is:

image-20230623165653673

The 0 field at the beginning and the 32 field at the end represent the add command.

17 18 are two source operands s1 s2.

8 is the destination operand t0.

The fifth field is not used and set to 0.

Of course the underlying representation is a 32-bit binary number. MIPS instructions are all 32 bits.

image-20230623170143372

op: Operation code.

rs rt: Two source operand registers.

rd: destination operand register.

shamt: displacement.

funct: function code, a variant of op such as addi.

But the defect of this instruction format is that the length is sometimes not enough. For example, the address or immediate value we want to process cannot be represented by 5 bits (5 bits can only represent 32 numbers in the final analysis).

Hence the introduction of type I instructions (instructions used for immediate values. The above instructions are R type instructions, used for registers)

image-20230623170757377

The last large field is used to represent an address offset or an immediate value.

Both are 32-bit instructions, so the complexity doesn't increase much. But how does the computer judge which is the R command and which is the I command? The specific instruction format is judged by different ops.

1687511468734

For example, lw means that rs and rt are registers, and all the following bits are addr values.

1687516329025

The above example shows the machine code for two different instructions. If there is no rd, the second source operand register rt is used instead.

Move left:

sll $t2,$s0,4	# t2=s0<<4
and $t0,$t1,$t2	#t0=t1 & t2
or $t0,$t1,$t2	#t0=t1 | t2
nor $t0,$t1,$t2	#t0=~(t1 | t2), 其中一个是0的话相当于 not

image-20230623192547413

Conditional jump:

beq $s1,$s2,L1	#两者相等则跳转到L1处
bne $s1,$s2,L1

for example:if(i==j)f=g+h;else f=g-h;

bne $s3,$s4,Else
add $s0,$s1,$s2
j Exit
Else:sub $s0,$s1,$s2
Exit:

Loop Jump: A similar jump comparison method.

Loop:	#循环体
bne $s0,$s1,Exit	#如果两者不等,跳出循环
j Loop
Exit:

less than directive:

slt $t0,$s3,$s4	#t0=1 if s3<s4
sltu $t0,$s3,$s4	#无符号数
slti $t0,$s3,10	#带立即数的比较

Because of the simplicity principle, there is no "jump if less than" instruction. Of course, we can use slt first, and then use beq to judge the value of t0, so that the two simple instructions will be more efficient.

process (function)

An abstract concept, a part of the process of program execution, similar to a function. The process does not need to know all the information about the caller, but only needs to know the part it needs to complete the process.

It involves: passing in parameters, handing over control to the process, and obtaining the specified storage area to obtain the return value of the process after the process returns.

a0~a3 are the incoming parameters, v0 and v1 are the return value of the function, and ra is the return address register of the return starting point.

jal: jump and link, jump and store the return value in ra. Then use the jr instruction to jump back. jr is an unconditional jump to the address stored in the subsequent register.

jr ra

The part that calls the function is the caller. The function being called is callee.

jal actually stores the value of pc+4 into ra.

the stack

For example, we need to use more registers in the function, not just these 5.

We can first push the value of the original register onto the stack, and then give those registers to the stack.

The stack sp still grows from high address to low address.

For example, the function calculates the sum of f=(g+h)-(i+j). The four parameters passed in are a0-a3, f is stored in s0, and the two addition operations in the process need to use two temporary variables t0 and t1. Therefore, the values ​​of these three registers need to be saved on the stack.

Push code:

addi $sp,$sp,-12
sw t0,8($sp)
sw t1,4($sp)
sw s0,0($sp)

After the operation, the value in the s0 register should be handed over to a return value register

add $v0,$s0,$zero

Stack code:

lw t0,8($sp)
lw t1,4($sp)
lw s0,0($sp)
addi $sp,$sp,+12

But in fact, MIPS stipulates that the s0-s7 series registers must be saved, and the callee should be saved and restored. t0-t9 series are not used.

nested procedure

A process that does not call other processes is called a leaf process. However, we know that there are very few programs that consist only of leaf procedures.

The non-leaf process must push all registers that must be reserved, caller saves a0-a3 parameter registers and t0-t9 temporary registers, callee saves s0-s7 save registers and ra return address. The register saved by callee can guarantee the same value when calling and returning, while the value of temporary register and parameter register is variable when returning. Therefore, if the caller saves the t-series for usefulness, you should save it yourself, and don't expect the callee to save it; if it is not useful, you don't need to save it.

For example, a recursive code, expressed in C language as follows:

int fact(int n)
{
    
    
    if(n<1)return 1;
    else return n*fact(n-1);
}

MIPS assembly code:

First, take an inventory of the registers used by non-leaf processes.

  • The calculation steps are relatively simple and do not need to save temporary registers.

  • n is the a0 parameter that needs to be saved.

  • ra needs to be saved.

  • s0-s7 are not used.

That is, only a0 ra needs to be saved. Then judge where to save it? caller or callee?

Every time the fact function is called, the fact function is callee, which saves ra by itself. If fact recursively calls the next self, then he himself becomes the caller and needs to save a0. (Or another logic: save ra and a0 every time, if it recursively calls itself, restore ra a0, otherwise it is not used restore ra a0 because the value has not changed)

fact:
	addi $sp,$sp,-8
	sw   $ra,4($sp)
	sw   $a0,0($sp)
	
	slti $t0,$a0,1		# 判断是否 <1
	beq  $t0,$zero,L1	# >=1 则准备进入下一层循环
	
	addi $v0,$zero,1
	addi $sp,$sp,8
	jr   $ra			# ra a0 值没变,这里把栈指针恢复一下,返回值赋值一下就结束函数了
	
L1:
	addi $a0,$a0,-1
	jal  fact			# 递归调用非叶过程
	
	lw   $ra,4($sp)
	lw   $a0,0($sp)
	addi $sp,$sp,8		# 恢复寄存器原值
	mul  $v0,$a0,$v0    # 乘上本轮递归的 n
	jal  $ra

Supplement: Global and static variables are static. Static variables are stored in the static data area and also exist during process entry and exit, while dynamic variables only exist during process entry and exit. MIPS has a global pointer $gp pointing to the static data area.

The overhead of the recursive process is still relatively large. Would it be better to use iteration instead?

process frame

In some processes, a part of registers is also pushed onto the stack. This part of registers and local variable fragments are called process frames or active records.

Some MIPS software will use $fp frame pointer and $sp to identify which part is the process frame, and some software will use registers to save the pointer of the process frame.

The frame pointer can also save more than 4 parameters, and the excess part can find its location in the memory according to the frame pointer.

Of course, a very important principle is: before the process returns, this part must be restored to empty.

1692197891107

code snippet

Provide space for static variables and dynamic data in the heap.

1692197983293

Body: code snippet.

Static data: global and static variables.

Dynamic: Dynamic variables.

C language allocates and releases heap space through explicit malloc and free functions. The disadvantage is that forgetting to manually release can easily lead to memory leaks, and if the release is early, a dangling pointer will be generated, and the program will point to a location that it does not want to point to. And java will automatically allocate memory and reclaim useless units XD

This is a MIPS-saved register convention. This counts as an accelerated high-probability event, because statistics prove that saving 8 registers and 10 scratchpads is enough most of the time.

image-20230816230725911

human-computer interaction

Most computers today use 8-bit bytes to represent characters, also known as ASCII codes.

lb sb: read and write only one byte, to the rightmost eight bits of the target register.

Characters are usually combined into strings. There are three schemes for marking the length of a string: 1. The first character of the string is its length; 2. Use a separate variable to store the length of the string; 3. Use a special terminator to mark the end of the string. The c language uses scheme 3, using \0 flag.

For example, for the implementation of a string copy, the logic of the c language is: loop through a bit of copy characters until \0 is encountered.

Assume that the target and source arrays are based in $a0 $a1 and i is stored in $s0.

strcpy:
	addi $sp,$sp,-4
	sw   $s0,0($sp)
	
	add  $s0,$zero,$zero	# i置0
    L1:
        add  $t1,$a1,$s0		# t1存放源数组的当前指针
        lbu  $t2,0($t1)			# 无符号字节读取
        add  $t3,$a0,$s0		# t1存放目标数组的当前指针
        sbu  $t2,0($t3)

        beq  $t2,$zero,L2		# 跳出结束复制

        addi $s0,$s0,1			# 这里和之前以字为单位做处理不同,我们是以字节为单位做处理,因此i++而不是i+4

    L2:
        lw   $s0,0($sp)
        addi $sp,$sp,4
        jr   $ra

Java uses the more general Unicode (the scheme used by most Web pages today) to save characters, and the unit is 16 bits. MIPS can directly use lh sh lhu shu to read and write half-word length exactly one character long. So java strings take up twice as much memory as c, but string operations are faster.

Java uses a word to store the total length of the string.

Because the stack address of MIPS must be aligned by word, so a char in c is 8 bits, even if there are 5 chars, the length of 8 chars will be allocated to align 2 words. Java's halfwords also require an alignment mechanism similarly.

32-bit immediate

The normal immediate number is 16 bits, but sometimes we need it to be longer, to 32 bits.

The lui load upper immediate instruction can copy the 16-bit immediate value to the upper 16 bits of the register.

image-20230816234205831

In this way, for example, for a 32-digit number, we can first lui the first 16 bits into a register, and then insert the lower 16 bits ori.

There is a special $at register in MIPS to temporarily store 32-bit immediate values.

However, you need to pay attention to the use of 32-bit immediate and 16-bit immediate. For example, the upper 16 bits of addi and logical operations are involved in the operation (the upper 16-bit logical operation of the 16-bit immediate is regarded as all 0).

addressing

J series jump instructions are 6-bit opcode + 26-bit jump address, and the jump range is 2^26.

Because the conditional branch b series instructions also need bits to store the registers to be compared, they are structured as 6-bit opcode + 5-bit register 1 + 5-bit register 2 + 16-bit address.

If this 16-bit address represents the target address, then the address range that can be jumped is limited to 2^16 words, and the total length of the program cannot exceed this range, which is too boring.

To this end, the solution adopted is: the 16-bit address is an offset address, and the jump method is the current base address + 16-bit offset address (one sign bit, that is, ±2^15). Since most loop instructions and conditional instructions are less than 2^15 (which is also a high probability event for speedup), this approach is sufficient. This method is called PC-relative addressing.

Moreover, MIPS addresses are word-aligned, so compared with byte addresses, the addressable range is expanded by 4 times. For example, the addressing range of j series is 2^28 byte addresses.

But aren't PC addresses 32-bit? In fact, only the lower 28 bits can be modified by jump instructions. If the program size exceeds 2^28, it needs to jump by register jump.

1692203076366

The b series is relative addressing, and is relative to the next instruction, that is, the jump at 80016 addi $s3,$s3,1to the Exit at 80032, that is, 8+80016.

j series is direct addressing, 20000*4=80000 jumps.

1692203253588

This is also an interesting idea. I always feel that conditional disassembly is different from normal if thinking.

The addressing modes of MIPS generally have the following types:

1692203396325

Base addressing: address + offset address in a given register.

Pseudo-direct addressing: PC high-order and 26-bit form addresses are concatenated.

Although MIPS in this book is 32-bit addressing, almost all microprocessors can be extended to 64-bit addressing, which is upward compatible.

Parallel and Synchronous Instructions

The synchronization mechanism is more important when executing tasks in parallel to prevent data competition.

It feels very similar to learning the operating system here. Perform atomic read and write operations on a set of data through a mutex.

We use the instruction pair: link access + conditional storage ll+sc to achieve.

again:	addi $t0,$zero,1		; 尝试上锁=1
	ll	$t1,0($s1)				; 获取 s1 初始值
	sc	$t0,0($s1)				; 保存 s1 值。如果发现 ll 获取值和 sc 保存值不同,t0 置零
	beq	$t0,$zero,again			; 如果 t0 又变成0了,执行失败,重新执行
	add $s4,$zero,$t1			; 做操作

Later chapters will expand further.

translation executive

1692205180908

In the early days, the storage capacity of the hardware was small and the efficiency of the compiler was not high, and they were all written in assembly.

The assembler supports some variants of machine language, such as the move instruction. In fact add $t0,$zero,$t0, the assembler can also translate, but there is no move instruction. This type of instruction is called a pseudo-instruction.

The assembler will, call the symbol table.

The object file generated by the assembly file contains:

  • Target file header: describes the target file composition, size, location and other information.
  • code snippet
  • static data segment
  • Relocation information: Some instructions and data that depend on absolute addresses.
  • Symbol table: Undefined remaining tags (such as labels in branch and data transfer instructions are placed in a table to be consulted, and the data in the table consists of labels and addresses in pairs), such as external references.
  • debug information.

The linker combines the individual machine language object files into an executable. The main steps involved are as follows:

  • According to the relocation information and the symbol table in the file, the old address in each file is combined to make a new address. Why not generate the executable file and set the new address from the beginning, instead of compiling each file into a separate object file and then modifying it again? Because this modification is more efficient.
  • After parsing the external links, the linker then determines the location of all modules in memory and represents them with relocated absolute addresses. Absolute addresses are processed first, and then the remaining relative addresses are laid out.

In general executable and object files are of the same format, but contain no unresolved references (except for some external links, such as linking to library functions).

Example: The following are the two target files of AB, link and give the updated address.

1692270268963

  1. Handle external references. A references XB and B references YA.
  2. The code segment starts at 0x400000, and the data segment starts at 0x10000000, so the code segment of A is 0x400000-0x400100 (the file header of process A identifies the size of its text, and 0x400100 is not used), the data segment is 0x10000000-0x10000020, followed by B After that, the code segment is 0x400100-0x400300, and the data segment is 0x10000020-0x10000050.
  3. The first jump instruction of the two is to jump to the first instruction position of the other party. jal: Pseudo-direct addressing, the jal jump address is the address of the first instruction of the other party, a is to jump to 400100, b is to jump to 400000. In addition, the jal jump rule is to discard the leftmost two digits (actual jump is the base address + 4* jal), the actual jump addresses of the two are 100040 and 100000. The instructions are incrementally increased
  4. The initial address of gp is 0x10008000, and the access data depends on the base address register. If the actual fetch address wants to fetch 0x10000000, the offset should be 0x8000. The big-endian data is increasing decreasingly.

insert image description here

After the executable file is created, the loader comes to put the data instructions into memory.

  1. Read the file header to know the code segment and data size;
  2. Create a sufficiently large body and data space;
  3. copy data command;
  4. The main function parameters are copied to the top of the stack, and the stack pointer points to NULL;
  5. Jump to the startup routine, copy the parameters and call the main function of the program; when the main function returns, call exit to terminate the program.

The linking method mentioned above is static linking. The disadvantage is that if the library function is updated, the previously linked library function will not change. And it will also cause the library to be fully loaded even though not all the content in the library is used, and the program will be very large.

The dynamic link DLL is to link the library when it is running. The dynamic link at the beginning will also add the content in all libraries, and the DLL linked in the late process will only link the routines called.

1692293369223

The second time avoids some indirect jumps.

The execution of a java program goes through two steps.

  1. javac compiles the java language into class binary byte files.
  2. jvm interprets and translates bytecode files item by item.

The efficiency of jvm is too low, and JIT instant compiler assistance appeared later, which will identify frequently-running code blocks as "hot codes", compile them into machine codes related to the local platform, and optimize them to improve efficiency.

Project example: sort

image-20230818014711247

sll $t0,$a1,2
add $t1,$a0,$t0
lw  $t2,0($t1)
lw  $t0,4($t1)
sw  $t0,4($t1)
sw  $t2,0($t1)
jr ra

image-20230818014932306

Loop 1:

move $s0,$zero
for1tst:
	slt  $t0,$s0,$a1
	beq  $t0,$zero,$exit1
	...
	for loop 2
	...
	addi $s0,$s0,1
	j    for1tst
exit1:

Cycle 2: Do not touch the s register in cycle 1, you can change the t register at will.

addi $s1,$s0,-1
for2snd:
	slti $t0,$s1,0
	bne  $t0,$zero,$exit2
	sll  $t1,$s1,2
	add  $t2,$t1,$a0
    lw   $t3,0($t2)
    lw   $t4,4($t2)
    slt  $t0,$t4,$t3
    beq  $t0,$zero,$exit2
    ...
    swap
    ...
    addi $s1,$s1,-1
    j for2snd
exit2:

swap call: save the original parameter register first, and then modify the parameter register.

# 传参
move $s2,$a0
move $s3,$a1
move $a0,$s2
move $a1,$s1

jal swap

Finally, before merging, we also need to add the operation of saving and changing the source register at the beginning and end. We used s0-s3, so save s0-s3 and ra.

sort:
	addi $sp,$sp,-20
	sw   $ra,16($sp)
	sw   $s3,12($sp)
	sw   $s2,8($sp)
	sw   $s1,4($sp)
	sw   $s0,0($sp)
	
	move $s2,$a0
	move $s3,$a1 # 不管是否要调用,先存一下参数
	
	move $s0,$zero
    for1tst:
        slt  $t0,$s0,$a1
        beq  $t0,$zero,$exit1
        
        addi $s1,$s0,-1
        for2snd:
            slti $t0,$s1,0
            bne  $t0,$zero,$exit2
            sll  $t1,$s1,2
            add  $t2,$t1,$a0
            lw   $t3,0($t2)
            lw   $t4,4($t2)
            slt  $t0,$t4,$t3
            beq  $t0,$zero,$exit2
            
            move $a0,$s2
            move $a1,$s1
            jal swap
            
            addi $s1,$s1,-1
            j for2snd
        exit2:
            addi $s0,$s0,1
            j    for1tst
    exit1:
    
	lw   $ra,16($sp)
	lw   $s3,12($sp)
	lw   $s2,8($sp)
	lw   $s1,4($sp)
	lw   $s0,0($sp)
	addi $sp,$sp,20
	
	jr   $ra

The part that calls the swap function may be optimized by inlining, that is, the swap operation is directly expanded instead of calling the function to reduce jump overhead. But because of the increase in the amount of code, if the cache miss rate increases, the loss outweighs the gain.

Also, actually $sp always holds 4 parameter registers -16. Because c will have a variable parameter vararg option, allowing a pointer parameter.

Arrays and pointers

The array is subscript *4 added to the base address of the array, and the pointer is the base address of the array +=4.

The following program examples are two implementations of array clearing:

1692341881231(1)

Compared to other instruction sets

  1. ARM architecture:
    • ARM (Advanced RISC Machines) is a reduced instruction set computer (RISC) architecture widely used in mobile devices, embedded systems, embedded chips and microcontrollers.
    • ARM designs focus on energy efficiency and low power consumption, so they excel in mobile devices.
    • The ARM architecture has several versions and variants, including ARMv7, ARMv8, etc., where ARMv8 introduced a 64-bit execution mode (AArch64).
  2. x86 architecture:
    • The x86 architecture is a Complex Instruction Set Computer (CISC) architecture primarily used in personal computers, servers, and desktop systems.
    • Representative processors of the x86 architecture include Intel's Pentium, Core i series, and AMD's Ryzen series.
    • The x86 architecture provides high flexibility in performance and functionality for a wide range of computing tasks.
  3. MIPS architecture:
    • MIPS (Microprocessor without Interlocked Pipeline Stages) is a reduced instruction set computer (RISC) architecture that was widely used in workstations and embedded systems in the early days.
    • MIPS design focuses on simplifying the instruction set to improve execution efficiency.
    • Although its use has gradually declined in some areas, the MIPS architecture is still used in some embedded, networking equipment, and embedded controllers.

by gpt

For example, arm's comparison and conditional branch: MIPS stores the comparison result in a register, and ARM sets the comparison result as a condition code, including: negative number, zero, carry, overflow. For example, a CMP compare subtracts two register values, and the result sets the condition code.

image-20230818195828205

In this way, arm has twice the number of registers (16) but can still complete the task, and in some cases, according to the condition code, it will directly skip instructions that do not need to be executed, saving code space and running time.

The immediate value of arm: the extended form of 4+8.

image-20230818200350122

It also supports shift operations on the second register parameter, and the same number of bits can represent more data.

arm also supports the operation of the register group, and decides which registers in the 16-bit registers to load/copy through a 16-bit mask.

The disadvantage of x86 is mainly that the addressing range is limited, and cisc slows down the efficiency.

In general, MIPS instructions are regular and uniform in length. Only 32 registers are used to ensure the speed requirements, and the composition of instructions is optimized by accelerating high-probability events and other ideas.

Guess you like

Origin blog.csdn.net/jtwqwq/article/details/132369334