Computer Architecture

Introduction Take notes for the computer structure and write down the knowledge points in each chapter

Chapter 1

Main formula

CPU_TIME

CPU_Time
CPU TIME = execution time
CPU TIME = instructions * CPI * cycle time
CPU TIME = (instructions * CPI) / clock rate
CPU TIME = clock cycles / clock rate
CPU TIME = (Σ(instructions * CPI)) / clock rate

Instruction Count & CPI

Instruction
instructions = clock cycles / cycles per instruction
instructions = CPU TIME / CPI / cycle time
instructions = CPU TIME * clock rate / CPI

clock cycles = Σ(instructions * CPI)
clock cycles = instructions * CPI

CPI = clock cycles / instructions
CPI = Σ ( CPIi * (instructionsi / instructions))

BIG Picture

Instruction

Pitfall: Amdahl’s Law

Instruction

MIPS & IPS

Instruction
IPS = (instructions / CPU_TIME) = clock_rate / CPI

MIPS = instructions / (execution_time * 10 ^ 6)
MIPS = (instructions) / ((instructions * CPI / clock_rate) * 10 ^ 6)
MIPS = clock_rate / (CPI * 10 ^ 6)

Example

Instruction
Instruction

Chapter 2

RISC-V Instruction

R-Type(regular)

add x9, x20, x21 => x9 = x20 + 21

f7 rs2 rs1 f3 rd op
0(add) 21 20 0 9 51(R)

sub x1, x2, x3 => x1 = x2 - x3

f7 rs2 rs1 f3 rd op
32(sub) 3 2 0 1 51(R)

I-type(Immediate)

Immediate field = funct7 + rs2 : 12 bits, holds a constant

ld x9 240(x22) => x9 = mem[x22 + 240]

imm rs1 f3 rd op
240 22 3 9 3(ld)

addi X8, X9, 100 => x8 = x9 + 100

imm rs1 f3 rd op
100 9 3 8 addi

S-type(sd)

sd x1, 1000(x2) => mem(1000 + x2) = x1

f7 rs2 rs1 f3 rd op
31 1 2 3 8 35(sd)

1000 = 512+256+128+64+32+8 = 0011111(f7) 01000(rd)

SB-type(branch)

bne x10, x11, 2000. if x 10!= x11 take the branch

f7 rs2 rs1 f3 rd op
0(1bit) 62(6bit)
11 10 001
8(4bit) 0(1bit)
103(bne)

2000 = 0 111110 1000 0

UJ-type(jal)

jal x0 2000 => (1) x0 = PC + 4 (2) PC = PC + 2000
f7 + rs2 + rs1 + f3 = imm[20] + imm[10 : 1] + imm[11] + imm[19:12]

f7 rs2 rs1 f3 rd op
0 1111101000 0 0 0 1101111(jal)

2000 = 0 00000000 0 1111101000 0

Summary

Example

Instruction Format f7 rs2 rs1 f3 rd op
add x15, x16, x17 R add 17 16 0 15 R
sub x4, x5, x6 R sub 6 5 0 4 R
addi x8, x3, 20 I 20 3 xxx 8 addi
sd x17, 64(x21) S 2 17 21 xxx 0 sd
ld x9, 8(x12) I 8 12 xxx 9 ld
beq x2, x3, 2000 SB xxx 3 2 xxx xxx beq
jal x16, 1000 UJ 1000 16 jal
addi x21, x23, 8 I 8 23 xxx 21 addi

Addressing Mode

  • Branch address

    beq x1, x2, offset => if (x1 = x2) goto brach address
    Format of brach address:

    1. Offset in binary, 13 bits:imm[12:0]
    2. Replace the least significant bit by 0(least significant bit is the most right bit)
    3. Branch address = PC + imm[12:0]0(add one more zero in the right, the total length is 14)

Example

Instruction

Chapter 4

functional units

DataPath

R-type

Basic Steps

  • fetch instructions
  • select registers(rs1, rs2)
  • ALU operations on two data, need ALU
  • write back registers

Example
add x3, x4, x5 => x3 = x4 + x5

f7 rs2 rs1 f3 rd op
add 5 4 0 3 R

I-type(Immediate)

Basic Steps

  • fetch instructions
  • select registers(rs1)
  • calculate address, need ALU
  • access memory(read memory)
  • write register file(rd)

Example
ld x3 64(x4) => x3 = mem[x4 + 64]

imm rs1 f3 rd op
64 4 xxx 3 addi

S-type(sd)

Basic Steps

  • fetch instructions
  • select two register(rs1, rs2)
  • calculate address, need ALU
  • access memory(write memory)

Example
sd x3 64(x4) => mem[x4 + 64] = x3

f7 rs2 rs1 f3 rd op
2 3 4 xxx 0 sd

SB-type(branch)

Basic Steps

  • fetch instructions
  • select registers
  • test condition, calculate branch address(need additional ALU)

Example
beq x1, x2, offset => if x1 = x2, pc = offset + pc + 4

f7 rs2 rs1 f3 rd op
xxx 2 1 xxx xxx beq

UJ-type(jal)

Example
jal x0 2000 => (1) x0 = PC + 4 (2) PC = PC + 2000

Single Datapath for Single Cycle



Summary

Data Memory is only used during load and store.

Function code of ALU Control

ALU used for

  • Load/Store: F = add
  • Branch: F = subtract
  • R-type: F depends on f7,f3
ALU control Function
0000 AND
0001 OR
0010 add
0110 subtract

truth table

Summary of execution Time Analysis

optimization(replace & eliminated control line)

cycle time example

PipeLine

Basic idea of pipeline

  • 5 steps: IF(instruction fetch), ID(Instruction decoding), EX(execution), MEM(memory access), WB(write back).

  • For the current step, hardware of other steps are idle.

  • Use the idle hardware to work on other instructions

    Overlap the instructions executions
    Add buffers to hold partial results of instruction executions

RISC-V Pipelined Datapath

Pipeline Control

Example add x4, x2, x3

  • Execution time of n instructions

5 + n - 1 = n + 4, the first instruction is 5 cycle time, the remaining instruciton is 1 cycle time.

  • Speedup

(5 * n) / (n + 4) ≈ 5

  • CPI = cycle time / instruction = (n + 4) / n ≈ 1

Buffer size

  • IF/ID: 32(inst) + 64(PC) = 96
  • ID/EX: 2+3+3+64*4+4+5 = 273
  • EX/MEM: 2+3+1+64*3+5 = 203
  • MEM/WB: 2+5+64*2 = 135

Example: How the five instructions go through the pipe, and what are the buffer values


Data Hazards



WAW,WAR won’t cause Data Hazards, because write is the last process of instruction, write always after the read.

Handle Data Hazards

  • Compiler approach

    insert dummy operations or “nops”
    reschedule instructions

  • Hardware approach

    forwarding

Example

How to identify all dependence

  • Find all instructions that write registers
  • For each such instruction writing register xj, identify all other instructions that either read or write register xj
  • Data dependence graph(DDG)

Forwarding



basic forwarding path

Detecting the need to Forward

  • But only if forwarding instructions will write to a register!

    EX/MEM.RegWrite, MEM/WB.RegWrite

  • Only if Rd for that instruction is not 0

    EX/MEM.RegisterRd!=0
    MEM/WB.RegisterRd!=0

summary-forwarding conditions

A Bug: Double Data Hazard

  • Consider the sequence:

    add x1, x1, x2
    add x1, x1, x3
    add x1, x1, x4

  • Both hazards occur

    Want to use the most recent

  • Revise MEM hazard condition

    Only forward if EX hazard condition isn’t true

Load-use Hazard, can’t forward


Harzard Detection

  • check if instruction x1 in EX and instruction in ID have Load Harzard.

  • ALU operand register numbers in ID stage are given by

    IF/ID.RegisterRs1, IF/ID.RegisterRs2

  • Load-use harzard when

    ID/EX.MemRead and
    ((ID/EX.RegisterRd = IF/ID.RegisterRs1) or
    (ID/EX.RegisterRd = IF/ID.RegisterRs2))

  • if detected, stall and insert bubble

How to Stall the Pipeline

  • Force control values in ID/EX register to 0

    EX,MEM and WB do nop(no-operation)

  • Prevent update of PC and IF/ID register

    Using instruction is decoded again
    Following instruction is fetched again
    1-cycle stall allows MEM to read data for ld

      - Can subsequently forward to EX stage

Branch Hazards

Handle Branch Hazards

just lose 1 cycle when branch is taken

Flushing instrucion summary

Performance Analysis – Individual CPI

Basic pipeline

Advanced pipeline

Example

Static Dual Issue

Example

Powerful project

Loop unrolling

  • Replicate loop body to expose more parallelism

    Reduces loop-control overhead

  • Example enroll by 2,and 3

Concluding remarks

  • ISA influences design of datapath and control

  • Pipelining improves instruction throughput using parallelism

    More instructions completed persecond
    Latency for each instruction not reduced

  • Hazards: structural, data, control

Chapter 5

Memory hierarchy


Goal of Memory hierarchy

  • speed of highest level (cache)
  • size of lowest level (disk)

Principle of locality

Program access a relative small portion of their address space — if an item is referenced.

  • it will tend to be referenced again soon – temporal locality
  • nearby items will tend to be referenced soon – spatial locality

Big picture

Address mapping

Direct address mapping

  • Lower-order 3 bits of M.address = C.index

    Memory locations with the same lower-order 3 bits share the same cache
    location-due to spatial locality

  • The block in cache to be replaced is fixed

  • How do I know if the data is what I want?

    Tags: consist of higher-order bits of memory addresses

Example

Blocksize = 2

Example

Summary

Write through

Write back

Handle cache miss

Performance

Performance Example

Cache Misses: 3 C’s

  • Complusory miss(cold start misses): A cache miss caused by the 1st access to a block that has never been in the cache.
  • Capacity miss: A cache miss that occurs because of the cache size.
  • Conflict miss(collision miss): A cache miss that occurs when multiple blocks compete for the same set while other sets are available.

Average Memory Access Time

Set Associative Cache–Motivation

  • Fully associative

    Allow a given block to go in any cache entry
    Requires all entries to be searched at once
    Comparator per entry(expensive)

  • n-way set associative

    Each set contains n entries
    Block number determines which set: (Block number) modulo(#Sets in cache)
    Search all entries in a given set at once
    n comparators(less expensive)

Comparison of 2 caches with 8 words

Miss ratio with associativity


Increase associativity, decrease miss ratio, increase the return time(because it need to search all tag)

Multi-cache-level

Virtual Memory

Addressing Mapping & Address Translation

  • Address mapping–fully associative:

    a page can go any page frame in physical memory

  • Address translation

    by page table: tells if a page is in physical memory; if it is in, provides the physical address
    each program has a page table
    page table stored in main memory

  • Page table register: pointing to the page table

Page Table

Replacement Policy

Write

TLB–cache of page table

process of TLB

Example TLB translation

Chapter 3

Normalized Number in Binary

Represent Normalized Number in Computer IEEE


Why a bias

  • If we use 2’s complement for exponent, not good for sorting and comparison

    0000 0000 most negative exponent
    1111 1111 most positive exponent

Example floating transition fraction


Contents
  1. 1. Chapter 1
    1. 1.1. Main formula
      1. 1.1.1. CPU_TIME
      2. 1.1.2. Instruction Count & CPI
      3. 1.1.3. BIG Picture
      4. 1.1.4. Pitfall: Amdahl’s Law
      5. 1.1.5. MIPS & IPS
      6. 1.1.6. Example
  2. 2. Chapter 2
    1. 2.1. RISC-V Instruction
      1. 2.1.1. R-Type(regular)
      2. 2.1.2. I-type(Immediate)
      3. 2.1.3. S-type(sd)
      4. 2.1.4. SB-type(branch)
      5. 2.1.5. UJ-type(jal)
      6. 2.1.6. Summary
      7. 2.1.7. Example
      8. 2.1.8. Addressing Mode
      9. 2.1.9. Example
  3. 3. Chapter 4
    1. 3.1. functional units
    2. 3.2. DataPath
      1. 3.2.1. R-type
      2. 3.2.2. I-type(Immediate)
      3. 3.2.3. S-type(sd)
      4. 3.2.4. SB-type(branch)
      5. 3.2.5. UJ-type(jal)
      6. 3.2.6. Single Datapath for Single Cycle
      7. 3.2.7. Summary
      8. 3.2.8. Function code of ALU Control
      9. 3.2.9. Summary of execution Time Analysis
      10. 3.2.10. optimization(replace & eliminated control line)
      11. 3.2.11. cycle time example
    3. 3.3. PipeLine
      1. 3.3.1. Basic idea of pipeline
      2. 3.3.2. RISC-V Pipelined Datapath
      3. 3.3.3. Pipeline Control
      4. 3.3.4. Buffer size
      5. 3.3.5. Data Hazards
      6. 3.3.6. Handle Data Hazards
      7. 3.3.7. How to identify all dependence
      8. 3.3.8. Forwarding
        1. 3.3.8.1. Detecting the need to Forward
        2. 3.3.8.2. summary-forwarding conditions
      9. 3.3.9. A Bug: Double Data Hazard
      10. 3.3.10. Load-use Hazard, can’t forward
      11. 3.3.11. Harzard Detection
        1. 3.3.11.1. How to Stall the Pipeline
      12. 3.3.12. Branch Hazards
      13. 3.3.13. Handle Branch Hazards
      14. 3.3.14. Flushing instrucion summary
      15. 3.3.15. Performance Analysis – Individual CPI
        1. 3.3.15.1. Basic pipeline
        2. 3.3.15.2. Advanced pipeline
      16. 3.3.16. Example
      17. 3.3.17. Static Dual Issue
      18. 3.3.18. Example
        1. 3.3.18.1. Powerful project
      19. 3.3.19. Loop unrolling
      20. 3.3.20. Concluding remarks
  4. 4. Chapter 5
    1. 4.1. Memory hierarchy
    2. 4.2. Principle of locality
    3. 4.3. Big picture
    4. 4.4. Address mapping
      1. 4.4.1. Direct address mapping
        1. 4.4.1.1. Example
      2. 4.4.2. Blocksize = 2
        1. 4.4.2.1. Example
        2. 4.4.2.2. Summary
      3. 4.4.3. Write through
      4. 4.4.4. Write back
      5. 4.4.5. Handle cache miss
      6. 4.4.6. Performance
        1. 4.4.6.1. Performance Example
      7. 4.4.7. Cache Misses: 3 C’s
      8. 4.4.8. Average Memory Access Time
      9. 4.4.9. Set Associative Cache–Motivation
      10. 4.4.10. Comparison of 2 caches with 8 words
      11. 4.4.11. Miss ratio with associativity
      12. 4.4.12. Multi-cache-level
    5. 4.5. Virtual Memory
      1. 4.5.1. Addressing Mapping & Address Translation
      2. 4.5.2. Page Table
      3. 4.5.3. Replacement Policy
      4. 4.5.4. Write
      5. 4.5.5. TLB–cache of page table
      6. 4.5.6. process of TLB
      7. 4.5.7. Example TLB translation
  5. 5. Chapter 3
    1. 5.1. Normalized Number in Binary
    2. 5.2. Represent Normalized Number in Computer IEEE
      1. 5.2.1. Example floating transition fraction
|