2020-10-13

Computer Architecture

Introduction Take notes for the computer structure and write down the knowledge points in each chapter

Chapter 1

Main formula

CPU_TIME

CPU_Time
CPU TIME = execution time
CPU TIME = instructions * CPI * cycle time
CPU TIME = (instructions * CPI) / clock rate
CPU TIME = clock cycles / clock rate
CPU TIME = (Σ(instructions * CPI)) / clock rate

Instruction Count & CPI

instructions = clock cycles / cycles per instruction
instructions = CPU TIME / CPI / cycle time
instructions = CPU TIME * clock rate / CPI

clock cycles = Σ(instructions * CPI)
clock cycles = instructions * CPI

CPI = clock cycles / instructions
CPI = Σ ( CPI_i * (instructions_i / instructions))

BIG Picture

Instruction

Pitfall: Amdahl’s Law

Instruction

MIPS & IPS

IPS = (instructions / CPU_TIME) = clock_rate / CPI

MIPS = instructions / (execution_time * 10 ^ 6)
MIPS = (instructions) / ((instructions * CPI / clock_rate) * 10 ^ 6)
MIPS = clock_rate / (CPI * 10 ^ 6)

Example

Instruction

Chapter 2

RISC-V Instruction

R-Type(regular)

add x9, x20, x21 => x9 = x20 + 21

f7	rs2	rs1	f3	rd	op
0(add)	21	20	0	9	51(R)

sub x1, x2, x3 => x1 = x2 - x3

f7	rs2	rs1	f3	rd	op
32(sub)	3	2	0	1	51(R)

I-type(Immediate)

Immediate field = funct7 + rs2 : 12 bits, holds a constant

ld x9 240(x22) => x9 = mem[x22 + 240]

imm	rs1	f3	rd	op
240	22	3	9	3(ld)

addi X8, X9, 100 => x8 = x9 + 100

imm	rs1	f3	rd	op
100	9	3	8	addi

S-type(sd)

sd x1, 1000(x2) => mem(1000 + x2) = x1

f7	rs2	rs1	f3	rd	op
31	1	2	3	8	35(sd)

1000 = 512+256+128+64+32+8 = 0011111(f7) 01000(rd)

SB-type(branch)

bne x10, x11, 2000. if x 10!= x11 take the branch

f7	rs2	rs1	f3	rd	op
0(1bit) 62(6bit)	11	10	001	8(4bit) 0(1bit)	103(bne)

2000 = 0 111110 1000 0

UJ-type(jal)

jal x0 2000 => (1) x0 = PC + 4 (2) PC = PC + 2000
f7 + rs2 + rs1 + f3 = imm[20] + imm[10 : 1] + imm[11] + imm[19:12]

f7	rs2	rs1	f3	rd	op
0	1111101000	0	0	0	1101111(jal)

2000 = 0 00000000 0 1111101000 0

Summary

Example

Instruction	Format	f7	rs2	rs1	f3	rd	op
add x15, x16, x17	R	add	17	16	0	15	R
sub x4, x5, x6	R	sub	6	5	0	4	R
addi x8, x3, 20	I	20		3	xxx	8	addi
sd x17, 64(x21)	S	2	17	21	xxx	0	sd
ld x9, 8(x12)	I	8		12	xxx	9	ld
beq x2, x3, 2000	SB	xxx	3	2	xxx	xxx	beq
jal x16, 1000	UJ	1000				16	jal
addi x21, x23, 8	I	8		23	xxx	21	addi

Addressing Mode

Branch address
beq x1, x2, offset => if (x1 = x2) goto brach address
Format of brach address:
1. Offset in binary, 13 bits:imm[12:0]
2. Replace the least significant bit by 0(least significant bit is the most right bit)
3. Branch address = PC + imm[12:0]0(add one more zero in the right, the total length is 14)

Example

Instruction

Chapter 4

functional units

DataPath

R-type

Basic Steps

fetch instructions
select registers(rs1, rs2)
ALU operations on two data, need ALU
write back registers

Example
add x3, x4, x5 => x3 = x4 + x5

f7	rs2	rs1	f3	rd	op
add	5	4	0	3	R

I-type(Immediate)

Basic Steps

fetch instructions
select registers(rs1)
calculate address, need ALU
access memory(read memory)
write register file(rd)

Example
ld x3 64(x4) => x3 = mem[x4 + 64]

imm	rs1	f3	rd	op
64	4	xxx	3	addi

S-type(sd)

Basic Steps

fetch instructions
select two register(rs1, rs2)
calculate address, need ALU
access memory(write memory)

Example
sd x3 64(x4) => mem[x4 + 64] = x3

f7	rs2	rs1	f3	rd	op
2	3	4	xxx	0	sd

SB-type(branch)

Basic Steps

fetch instructions
select registers
test condition, calculate branch address(need additional ALU)

Example
beq x1, x2, offset => if x1 = x2, pc = offset + pc + 4

f7	rs2	rs1	f3	rd	op
xxx	2	1	xxx	xxx	beq

UJ-type(jal)

Example
jal x0 2000 => (1) x0 = PC + 4 (2) PC = PC + 2000

Single Datapath for Single Cycle

Summary

Data Memory is only used during load and store.

Function code of ALU Control

ALU used for

Load/Store: F = add
Branch: F = subtract
R-type: F depends on f7,f3

ALU control	Function
0000	AND
0001	OR
0010	add
0110	subtract

truth table

Summary of execution Time Analysis

optimization(replace & eliminated control line)

cycle time example

PipeLine

Basic idea of pipeline

5 steps: IF(instruction fetch), ID(Instruction decoding), EX(execution), MEM(memory access), WB(write back).
For the current step, hardware of other steps are idle.
Use the idle hardware to work on other instructions

Overlap the instructions executions
Add buffers to hold partial results of instruction executions

RISC-V Pipelined Datapath

Pipeline Control

Example add x4, x2, x3

Execution time of n instructions

5 + n - 1 = n + 4, the first instruction is 5 cycle time, the remaining instruciton is 1 cycle time.

Speedup

(5 * n) / (n + 4) ≈ 5

CPI = cycle time / instruction = (n + 4) / n ≈ 1

Buffer size

IF/ID: 32(inst) + 64(PC) = 96
ID/EX: 2+3+3+64*4+4+5 = 273
EX/MEM: 2+3+1+64*3+5 = 203
MEM/WB: 2+5+64*2 = 135

Example: How the five instructions go through the pipe, and what are the buffer values

Data Hazards

WAW,WAR won’t cause Data Hazards, because write is the last process of instruction, write always after the read.

Handle Data Hazards

Compiler approach

insert dummy operations or “nops”
reschedule instructions
Hardware approach

forwarding

Example

How to identify all dependence

Find all instructions that write registers
For each such instruction writing register x_j, identify all other instructions that either read or write register x_j
Data dependence graph(DDG)

Forwarding

basic forwarding path

Detecting the need to Forward

But only if forwarding instructions will write to a register!

EX/MEM.RegWrite, MEM/WB.RegWrite
Only if Rd for that instruction is not 0

EX/MEM.RegisterRd!=0
MEM/WB.RegisterRd!=0

summary-forwarding conditions

A Bug: Double Data Hazard

Consider the sequence:

add x1, x1, x2
add x1, x1, x3
add x1, x1, x4
Both hazards occur

Want to use the most recent
Revise MEM hazard condition

Only forward if EX hazard condition isn’t true

Load-use Hazard, can’t forward

Harzard Detection

check if instruction x1 in EX and instruction in ID have Load Harzard.
ALU operand register numbers in ID stage are given by

IF/ID.RegisterRs1, IF/ID.RegisterRs2
Load-use harzard when

ID/EX.MemRead and
((ID/EX.RegisterRd = IF/ID.RegisterRs1) or
(ID/EX.RegisterRd = IF/ID.RegisterRs2))
if detected, stall and insert bubble

How to Stall the Pipeline

Force control values in ID/EX register to 0

EX,MEM and WB do nop(no-operation)
Prevent update of PC and IF/ID register
Using instruction is decoded again
Following instruction is fetched again
1-cycle stall allows MEM to read data for ld
```
  - Can subsequently forward to EX stage
```

Branch Hazards

Handle Branch Hazards

just lose 1 cycle when branch is taken

Flushing instrucion summary

Performance Analysis – Individual CPI

Basic pipeline

Advanced pipeline

Example

Static Dual Issue

Example

Powerful project

Loop unrolling

Replicate loop body to expose more parallelism

Reduces loop-control overhead
Example enroll by 2,and 3

Concluding remarks

ISA influences design of datapath and control
Pipelining improves instruction throughput using parallelism

More instructions completed persecond
Latency for each instruction not reduced
Hazards: structural, data, control

Chapter 5

Memory hierarchy

Goal of Memory hierarchy

speed of highest level (cache)
size of lowest level (disk)

Principle of locality

Program access a relative small portion of their address space — if an item is referenced.

it will tend to be referenced again soon – temporal locality
nearby items will tend to be referenced soon – spatial locality

Big picture

Address mapping

Direct address mapping

Lower-order 3 bits of M.address = C.index

Memory locations with the same lower-order 3 bits share the same cache
location-due to spatial locality
The block in cache to be replaced is fixed
How do I know if the data is what I want?

Tags: consist of higher-order bits of memory addresses

Example

Blocksize = 2

Example

Summary

Write through

Write back

Handle cache miss

Performance

Performance Example

Cache Misses: 3 C’s

Complusory miss(cold start misses): A cache miss caused by the 1^st access to a block that has never been in the cache.
Capacity miss: A cache miss that occurs because of the cache size.
Conflict miss(collision miss): A cache miss that occurs when multiple blocks compete for the same set while other sets are available.

Average Memory Access Time

Set Associative Cache–Motivation

Fully associative

Allow a given block to go in any cache entry
Requires all entries to be searched at once
Comparator per entry(expensive)
n-way set associative

Each set contains n entries
Block number determines which set: (Block number) modulo(#Sets in cache)
Search all entries in a given set at once
n comparators(less expensive)

Comparison of 2 caches with 8 words

Miss ratio with associativity

Increase associativity, decrease miss ratio, increase the return time(because it need to search all tag)

Multi-cache-level

Virtual Memory

Addressing Mapping & Address Translation

Address mapping–fully associative:

a page can go any page frame in physical memory
Address translation

by page table: tells if a page is in physical memory; if it is in, provides the physical address
each program has a page table
page table stored in main memory
Page table register: pointing to the page table