東京大学 情報理工学系研究科 コンピュータ科学専攻 2023年8月実施 専門科目 問題2
Author
Description
Consider a processor lw
and writes data to the memory by the store-word instruction sw
. The address and data bit-widths of the load-word/store-word instructions are 32. When the bit representation of a memory address is x0
to x31
, and x0
is the zero register that always keeps the value 0.
The following program 0x1000
and size 402 and stores the result on the one-dimensional array 0x2000
and size 400. Each element of the arrays memory[addr]
represents a memory access to the address addr
. The initial values of the registers x5
, x6
, and x7
are 0x1640
, 0x1000
, and 0x2000
, respectively.
Instruction 0) addi x2, x0, 3 # x2 <- x0 + 3
Instruction 1) Loop: lw x3, 0(x6) # x3 <- memory[x6 + 0]
Instruction 2) lw x9, 4(x6) # x9 <- memory[x6 + 4]
Instruction 3) add x8, x8, x9 # x8 <- x8 + x9
Instruction 4) lw x9, 8(x6) # x9 <- memory[x6 + 8]
Instruction 5) add x8, x8, x9 # x8 <- x8 + x9
Instruction 6) div x8, x8, x2 # x8 <- x8 / x2
Instruction 7) sw x8, 0(x7) # memory[x7 + 0] <- x8
Instruction 8) addi x6, x6, 4 # x6 <- x6 + 4
Instruction 9) addi x7, x7, 4 # x7 <- x7 + 4
Instruction 10) blt x6, x5, Loop # if x6 < x5, goto Loop
Answer the following questions:
(1) Calculate the cache hit rate up to three places of decimals for the execution of the program
(2) Calculate the IPC (instructions per cycle) up to three places of decimals for the execution of the program add
, addi
, and blt
instructions in one clock cycle, and executes div
instruction in four clock cycles. The processor executes each of lw
and sw
instructions in one clock cycle in case of cache hits and four clock cycles in case of cache misses.
(3) Consider a modification to the data cache to improve the cache hit rate for the execution of the program
(4) Explain a software-level optimization of the program
考虑一个处理器 lw
从内存读取数据,并通过存储字指令 sw
将数据写入内存。加载字/存储字指令的地址和数据位宽为 32。当一个内存地址的位表示为
下面的程序
指令 0) addi x2, x0, 3 # x2 <- x0 + 3
指令 1) Loop: lw x3, 0(x6) # x3 <- memory[x6 + 0]
指令 2) lw x9, 4(x6) # x9 <- memory[x6 + 4]
指令 3) add x8, x8, x9 # x8 <- x8 + x9
指令 4) lw x9, 8(x6) # x9 <- memory[x6 + 8]
指令 5) add x8, x8, x9 # x8 <- x8 + x9
指令 6) div x8, x8, x2 # x8 <- x8 / x2
指令 7) sw x8, 0(x7) # memory[x7 + 0] <- x8
指令 8) addi x6, x6, 4 # x6 <- x6 + 4
指令 9) addi x7, x7, 4 # x7 <- x7 + 4
指令 10) blt x6, x5, Loop # if x6 < x5, goto Loop
回答以下问题:
(1) 计算程序
(2) 计算程序 add
、addi
和 blt
指令,并在四个时钟周期内执行 div
指令。处理器在缓存命中情况下在一个时钟周期内执行每条 lw
和 sw
指令,而在缓存未命中情况下在四个时钟周期内执行。
(3) 考虑对数据缓存进行修改以提高程序
(4) 解释程序
Kai
(1)
Cache Parameters:
- Cache size: 256 bytes
- Block size: 16 bytes
- Number of cache lines:
lines
Memory Address Breakdown:
- Address size: 32 bits
- Offset bits (block size of 16 bytes): 4 bits (i.e.,
) - Index bits (16 lines): 4 bits (i.e.,
) - Tag bits: Remaining bits (i.e.,
)
Cache Access Analysis
Arrays A and B:
- Array
starts at address 0x1000. - Array
starts at address 0x2000. - Both
and map to the same index in the cache due to the identical lower address bits.
Detailed Analysis of the First Four Loops
-
Loop 1 (Iteration 0):
- Access
(address 0x1000), Cache Line 0: Miss - Access
(address 0x1004), Cache Line 0: Hit - Access
(address 0x1008), Cache Line 0: Hit - Write
(address 0x2000), Cache Line 0: Miss
- Access
-
Loop 2 (Iteration 1):
- Access
(address 0x1004), Cache Line 0: Miss (replaced by B ) - Access
(address 0x1008), Cache Line 0: Hit - Access
(address 0x100C), Cache Line 0: Hit - Write
(address 0x2004), Cache Line 0: Miss
- Access
-
Loop 3 (Iteration 2):
- Access
(address 0x1008), Cache Line 0: Miss (replaced by B ) - Access
(address 0x100C), Cache Line 0: Hit - Access
(address 0x1010), Cache Line 1: Miss (crosses to a new cache line) - Write
(address 0x2008), Cache Line 0: Miss
- Access
-
Loop 4 (Iteration 3):
- Access
(address 0x100C), Cache Line 0: Miss (replaced by B
- Access
)
- Access
(address 0x1010), Cache Line 1: Hit - Access
(address 0x1014), Cache Line 1: Hit - Write
(address 0x200C), Cache Line 0: Miss
Summary of the First Four Loops
- Total Accesses: 16 accesses (4 per loop)
- Total Cache Misses: 9 misses
- Total Cache Hits: 7 hits
- Hit Rate for the First Four Loops:
Analysis of Subsequent Loops
Assuming the same pattern continues, the subsequent 96 loops (24 groups of 4 loops) will follow the same pattern:
- Loop 1 in each group: 3 hits, 1 miss
- Loop 2 in each group: 2 hits, 2 misses
- Loop 3 in each group: 1 hit, 3 misses
- Loop 4 in each group: 2 hits, 2 misses
Total Cache Misses and Hits for All Loops
Total Cache Misses for 96 Loops:
Total Cache Hits for 96 Loops:
Final Cache Hit Rate Calculation
- Total Accesses:
accesses - Total Cache Misses:
misses - Total Cache Hits:
hits - Final Hit Rate:
Conclusion
- The final cache hit rate for the entire program is approximately 49.9%.
(2)
Given the cache hit rate of approximately 49.9 %:
-
Instruction Breakdown:
addi
: 1 cyclelw/sw
: 1 cycle (hit) or 4 cycles (miss)div
: 4 cyclesblt
: 1 cycle
-
Cycle Calculation per Iteration:
- Total Basic Cycles for all instructions :
cycles - Total Additional Cycles for
lw
andsw
instruction:cycles - Other Additional Cycles: 3 cycles (for the
div
instruction)
- Total Basic Cycles for all instructions :
-
Total Cycles per Iteration:
- Total Cycles for 400 Iterations:
- Total Instructions:
- IPC:
(3)
Modification: Use a 2-way set associative cache instead of a direct-mapped cache.
Reason:
A 2-way set associative cache allows each index to map to two cache lines, which would enable both arrays
(4)
Optimization: Array padding.
Example: Introduce padding in array
Explanation:
By adding a small padding (e.g., 16 bytes) to the start of array
Knowledge
Cache IPC 指令集
难点思路
本题的难点在于正确分析缓存访问的冲突问题,以及提出有效的硬件和软件优化方案来减少这些冲突。直接映射缓存的冲突失效是此题的核心问题。
解题技巧和信息
- Direct-Mapped Cache: 在处理大量连续数据时容易产生冲突失效,集合关联缓存可以有效缓解这一问题。
- IPC Calculation: 注意由于缓存失效引起的指令执行延迟对 IPC 的影响。
- Array Padding: 是解决数组访问冲突问题的有效软件优化方法。
重点词汇
- Cache Line 缓存行
- Direct-Mapped Cache 直接映射缓存
- Hit Rate 命中率
- Instruction Per Cycle (IPC) 每周期指令数
- Array Padding 数组填充
参考资料
- Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach. Chapter 2.
- Patterson, D. A., & Hennessy, J. L. (2013). Computer Organization and Design: The Hardware/Software Interface. Chapter 5.