東京大学情報理工学系研究科コンピュータ科学専攻 2021年8月実施専門科目問題4

Author

Answer the following questions on computer architecture.

(1) When the instruction \(j\) may use the result generated by the instruction \(i\), we say there is a data dependency from the instruction \(j\) to the instruction \(i\), and write \(j \rightarrow i\). Give all data dependencies in the program below. The behavior of each instruction is described as a comment (the description after #) in the program. The processor which executes the program has 32 registers from \(x0\) to \(x31\), and \(x0\) is the zero register that always keeps the value 0. In the comment, we represent a memory access to the address addr on the memory as memory[addr].

instruction 0)        addi x3, x0, 64     # x3 <- x0 + 64
instruction 1)        addi x4, x0, 0      # x4 <- x0 + 0
instruction 2)        addi x5, x0, 0      # x5 <- x0 + 0
instruction 3) Loop:  lw x6, 0 (x4)       # x6 <- memory[x4 + 0]
instruction 4)        add x5, x5, x6      # x5 <- x5 + x6
instruction 5)        addi x4, x4, 4      # x4 <- x4 + 4
instruction 6)        blt x4, x3, Loop    # if x4 < x3, goto Loop
instruction 7)        sw x5, 4096(x0)     # memory[x0 + 4096] <- x5

(2) Consider a 5-stage pipeline processor that issues up to one instruction per clock cycle. The processor consists of 5 stages: instruction fetch (IF) stage, instruction decode and register fetch (ID) stage, execution (EX) stage, memory access (MA) stage, and register write back (WB) stage. The bit width of each register is 32. The processor has the instruction and data memories that can be accessed in one clock cycle, and load-word lw and store-word sw instructions do not stall on the MA stage. If there is a load-use data hazard for lw instruction, the IF, ID, and EX stages are stalled for one clock cycle. The branch instruction blt (branch if less than) stalls the IF and ID stages until the branch result is determined in the EX stage. Thus, the processor does not fetch subsequent instructions for two clock cycles after a branch instruction is fetched. Execution results in the EX stage and load results in the MA stage are properly forwarded to the EX stage.

Explain what the load-use data hazard is. Explain also how load-use data hazards occur when the program in question (1) is executed on the processor.

(3) Calculate the number of clock cycles required for the execution of the program in question (1) on the processor in question (2). Calculate also the average IPC (instructions per cycle) up to two places of decimals.

(4) Using the program in question (1) and the processor in question (2) as an example, explain the mechanism and role of dynamic branch prediction.

回答以下关于计算机体系结构的问题。

(1) 当指令 \(j\) 可能使用由指令 \(i\) 生成的结果时，我们说指令 \(j\) 对指令 \(i\) 存在数据依赖关系，并记作 \(j \rightarrow i\)。列出下面程序中的所有数据依赖关系。每条指令的行为在程序中的注释（# 后的描述）中描述。执行该程序的处理器有 32 个寄存器，从 \(x0\) 到 \(x31\)，其中 \(x0\) 是始终保持值为 0 的零寄存器。在注释中，我们将对内存地址 addr 的访问表示为 memory[addr]。

instruction 0) addi x3, x0, 64  # x3 <- x0 + 64
instruction 1) addi x4, x0, 0   # x4 <- x0 + 0
instruction 2) addi x5, x0, 0   # x5 <- x0 + 0
instruction 3) Loop: lw x6, 0(x4)  # x6 <- memory[x4 + 0]
instruction 4) add x5, x5, x6   # x5 <- x5 + x6
instruction 5) addi x4, x4, 4   # x4 <- x4 + 4
instruction 6) blt x4, x3, Loop  # if x4 < x3, goto Loop
instruction 7) sw x5, 4096(x0)  # memory[x0 + 4096] <- x5

(2) 考虑一个 5 级流水线处理器，该处理器每个时钟周期最多发出一条指令。处理器由 5 个阶段组成：指令获取（IF）阶段、指令解码和寄存器获取（ID）阶段、执行（EX）阶段、存储器访问（MA）阶段和寄存器写回（WB）阶段。每个寄存器的位宽为 32 位。处理器具有可以在一个时钟周期内访问的指令和数据存储器，并且加载字（lw）和存储字（sw）指令不会在 MA 阶段停止。如果存在 lw 指令的加载-使用数据冒险，则 IF、ID 和 EX 阶段将停止一个时钟周期。分支指令 blt（如果小于则分支）会使 IF 和 ID 阶段停止，直到在 EX 阶段确定分支结果。因此，处理器在提取分支指令后的两个时钟周期内不会获取后续指令。执行结果在 EX 阶段产生，加载结果在 MA 阶段正确转发到 EX 阶段。

解释什么是加载-使用数据冒险。还解释在处理器上执行问题（1）中的程序时，加载-使用数据冒险是如何发生的。

(3) 计算在问题（2）中的处理器上执行问题（1）中的程序所需的时钟周期数。同时计算平均 IPC（每周期指令数），保留到小数点后两位。

(4) 以问题（1）中的程序和问题（2）中的处理器为例，解释动态分支预测的机制和作用。

Kai

(1)

Data dependencies in the program:

Instruction 3 uses x4 from Instruction 1: 3 → 1
Instruction 4 uses x5 from Instruction 2: 4 → 2
Instruction 4 uses x6 from Instruction 3: 4 → 3
Instruction 6 uses x4 from Instruction 5: 6 → 5
Instruction 6 uses x3 from Instruction 0: 6 → 0
Instruction 7 uses x5 from Instruction 4: 7 → 4
Instruction 3 in iteration \(n\) uses x4 from Instruction 5 in iteration \(n-1\): 3 → 5 (loop dependency)

(2)

A load-use data hazard occurs when a load instruction (lw) is followed immediately by an instruction that uses the loaded value, causing a stall because the subsequent instruction needs the value that is still being fetched.

In the program, this hazard occurs between:

Instruction 3: lw x6, 0(x4)
Instruction 4: add x5, x5, x6

The processor stalls for one cycle to ensure the value in x6 is available for Instruction 4.

(3)

We consider the following:

The loop executes \((x3 - x4) / 4\) times, i.e., \((64 - 0) / 4 = 16\) iterations.
Each iteration includes Instructions 3 to 6.

Instructions 0, 1, 2: - Each executed once. - IF: 1 cycle - ID: 1 cycle - EX: 1 cycle - MA: 1 cycle - WB: 1 cycle

Instructions 3 to 6 per iteration: - IF: 1 cycle - ID: 1 cycle - EX: 1 cycle - MA: 1 cycle - WB: 1 cycle

Stalls per iteration: - Load-use hazard after Instruction 3: 1 cycle - Branch hazard after Instruction 6: 2 cycles

Execution Cycles: - Instructions 0, 1, 2: 3 cycles - Each iteration: 7 cycles (5 cycles + 1 cycle load-use hazard + 2 cycles branch hazard) - 16 iterations: \(16 \times 7 = 112\) cycles - Final instruction 7: 5 cycles

Total clock cycles: \(3 + 112 + 5 = 120\) cycles

Average IPC Calculation: - Total instructions: \(3 + 16 \times 4 + 1 = 68\) - IPC: \(\frac{68}{120} \approx 0.57\)

Pipeline Execution Example

Below is the Gantt chart for the first few cycles of the program execution:

Cycle	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Inst0	IF	ID	EX	MA	WB
Inst1		IF	ID	EX	MA	WB
Inst2			IF	ID	EX	MA	WB
Inst3				IF	ID	EX	MA	WB	ST	ST	IF (After `blt` EX)	ID	EX	MA	WB
Inst4					ST (L-U Haz.)	IF	ID	EX	MA	WB
Inst5							IF	ID	EX	MA	WB
Inst6								IF	ID	EX	MA	WB

Key:

IF: Instruction Fetch
ID: Instruction Decode and Register Fetch
EX: Execution
MA: Memory Access
WB: Write Back
ST: Stall

(4)

Dynamic Branch Prediction is a technique used in pipelined processors to predict the outcome of branch instructions to minimize control hazards and reduce pipeline stalls. It helps maintain a steady flow of instructions by making educated guesses about whether a branch will be taken.

Mechanism of Dynamic Branch Prediction

Branch History Table (BHT):
Stores the outcomes of recent branch instructions (taken or not taken).
Speculative Execution:
The processor speculatively executes instructions based on the prediction.
If the prediction is correct, execution continues without interruption.
If incorrect, the speculative instructions are discarded, causing a penalty.

Role of Dynamic Branch Prediction in the Given Program

In the given program, we have a branch instruction:

Instruction 6: blt x4, x3, Loop
This instruction checks if x4 is less than x3 and branches back to the loop start if true.

Example with Branch Prediction:

Initial Iteration:
The processor encounters blt x4, x3, Loop.
It uses the BHT to predict whether the branch will be taken.
Prediction:
Assume the prediction is that the branch will be taken (likely due to the loop).
Speculative Execution:
Instructions from Loop: are speculatively executed.
If the prediction is correct, there is no penalty, and the pipeline stays full.
If the prediction is incorrect, the speculatively executed instructions are discarded, causing a 2-cycle penalty.

Specific Numbers in the Context of the Program With Branch Prediction: - Assume a 90% accuracy in branch prediction. - Mispredictions = 16 iterations * 10% = 1.6 ≈ 2 mispredictions. - Total penalty due to mispredictions = 2 mispredictions * 2 cycles = 4 cycles. - Total clock cycles with prediction: \(120 - 32 + 4 = 92\) cycles (adjusting for reduced penalties).

Impact on IPC With Prediction: - Total cycles = 92. - Total instructions = 68. - IPC = \(\frac{68}{92} \approx 0.74\).

Conclusion:

Dynamic branch prediction significantly improves the performance of the program by reducing the number of stall cycles caused by branch instructions. With a 90% accurate branch predictor, the total execution time is reduced to 92 cycles, resulting in a higher IPC and more efficient pipeline utilization.

知识点

数据冒险动态分支预测流水线

[[流水线#数据冒险 Data Hazards]]

[[流水线#分支预测 Branch Prediction]]

解题技巧和信息

Data Dependency: Identify dependencies by analyzing which instructions use the results of previous instructions.
Load-use Hazard: Recognize hazards caused by consecutive load and use instructions.

重点词汇

Data dependency 数据依赖
Load-use data hazard 加载使用数据冒险
Pipeline 流水线
Clock cycle 时钟周期
Instructions per cycle (IPC) 每周期指令数
Dynamic branch prediction 动态分支预测

参考资料

Bryant, R. E., & O'Hallaron, D. R. (2015). Computer Systems: A Programmer's Perspective (3^rd ed.). Pearson.
Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach (6^th ed.). Morgan Kaufmann.

東京大学 情報理工学系研究科 コンピュータ科学専攻 2021年8月実施 専門科目 問題4