# **Cache Memories**

COMP402127: Introduction to Computer Systems <u>https://xjtu-ics.github.io/</u>

Danfeng Shan Xi'an Jiaotong University

# Today

## Cache memory organization and operation

## Performance impact of caches

- The memory mountain
- Rearranging loops to improve spatial locality
- Using blocking to improve temporal locality

## **Cache Memories**

- Cache memories are small, fast SRAM-based memories managed automatically in hardware
  - Hold frequently accessed blocks of main memory
- CPU looks first for data in cache
- Typical system structure:



# What it Really Looks Like



HyperTransport<sup>™</sup> Phy MISCLU HyperTransp ----MAR T ----ort<sup>IM</sup> Phy DDR3 Phy Northbridge HyperTransport<sup>™</sup> 2MB13 Cache ---- -----15 2MB L2 2MB Cache Cache Ph ----------..... HyperTransport<sup>™</sup> Phy



#### Core i7-3960X



# General Cache Organization (S, E, B)





# Example: Direct Mapped Cache (E = 1)

Direct mapped: One line per set Assume: cache block size B=8 bytes



# Example: Direct Mapped Cache (E = 1)

Direct mapped: One line per set Assume: cache block size B=8 bytes



# Example: Direct Mapped Cache (E = 1)

Direct mapped: One line per set Assume: cache block size B=8 bytes



## If tag doesn't match (= miss): old line is evicted and replaced

# **Direct-Mapped Cache Simulation**

| t=1 | s=2 | b=1 |
|-----|-----|-----|
| X   | XX  | X   |

4-bit addresses (address space size M=16 bytes) S=4 sets, E=1 Blocks/set, B=2 bytes/block

### Address trace (reads, one byte per read):

| 0 | [ <mark>000</mark> 0 <sub>2</sub> ],     | miss |
|---|------------------------------------------|------|
| 1 | [ <mark>0<u>00</u>1<sub>2</sub>],</mark> | hit  |
| 7 | [ <mark>011</mark> 1 <sub>2</sub> ],     | miss |
| 8 | [ <mark>1<u>00</u>0<sub>2</sub>],</mark> | miss |
| 0 | [ <mark>000</mark> 0 <sub>2</sub> ]      | miss |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 0   | M[0-1] |
| Set 1 | 0 |     |        |
| Set 2 | 0 |     |        |
| Set 3 | 1 | 0   | M[6-7] |

Address of short int:

# E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set

#### Assume: cache block size B=8 bytes



# E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set

Assume: cache block size B=8 bytes

Address of short int:



# E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set

Assume: cache block size B=8 bytes

Address of short int:



## No match or not valid (= miss):

- One line in set is selected for eviction and replacement
- Replacement policies: random, least recently used (LRU), ...

# **2-Way Set Associative Cache Simulation**



4-bit addresses (M=16 bytes) S=2 sets, E=2 blocks/set, B=2 bytes/block

Address trace (reads, one byte per read):

| 0 | [ <mark>000</mark> 0 <sub>2</sub> ], | miss |
|---|--------------------------------------|------|
| 1 | [ <mark>000</mark> 1 <sub>2</sub> ], | hit  |
| 7 | [ <mark>011</mark> 1 <sub>2</sub> ], | miss |
| 8 | [ <mark>100</mark> 0 <sub>2</sub> ], | miss |
| 0 | [00 <u>0</u> 0 <sub>2</sub> ]        | hit  |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
| Jetu  | 1 | 10  | M[8-9] |



**B-1** 

0

tag

valid bit dirty bit

2

B = 2<sup>b</sup> bytes

1

# What about writes?

Multiple copies of data exist:

- L1, L2, L3, Main Memory, Disk
- What to do on a write-hit?
  - Write-through (write immediately to memory)
  - Write-back (defer write to memory until replacement of line)
    - Needs a dirty bit (set if data has been written to)

## What to do on a write-miss?

- Write-allocate (load into cache, update line in cache)
  - Good if more writes to the location will follow
- No-write-allocate (writes straight to memory, does not load into cache)
- Typical
  - Write-through + No-write-allocate
  - Write-back + Write-allocate



**B-1** 

0

tag

valid bit dirty bit

2

 $B = 2^b$  bytes

1

# **Practical Write-back Write-allocate**

- A write to address X is issued
- If it is a hit
  - Update the contents of block
  - Set dirty bit to 1 (bit is sticky and only cleared on eviction)

## If it is a miss

- Fetch block from memory (per a read miss)
- The perform the write operations (per a write hit)

## If a line is evicted and dirty bit is set to 1

- The entire block of 2<sup>b</sup> bytes are written back to memory
- Dirty bit is cleared (set to 0)
- Line is replaced by new contents

# Why Index Using Middle Bits?

Direct mapped: One line per set Assume: cache block size 8 bytes



# Illustration of Indexing Approaches

- 64-byte memory
  - 6-bit addresses
- 16 byte, direct-mapped cache
- Block size = 4. (Thus, 4 sets; why?)
- 2 bits tag, 2 bits index, 2 bits offset



| r |  | 1              |
|---|--|----------------|
|   |  | 0000xx         |
|   |  | 0001xx         |
|   |  | 0010xx         |
|   |  | 0011xx         |
|   |  | 0100xx         |
|   |  | 0101xx         |
|   |  | 0110xx         |
|   |  | 0111xx         |
|   |  | 1000xx         |
|   |  | 1001xx         |
|   |  | 1010xx         |
|   |  | 1011xx         |
|   |  | 1100 <b>xx</b> |
|   |  | 1101 <b>xx</b> |
|   |  | 1110 <b>xx</b> |
|   |  | 1111xx         |
|   |  | 10             |

## Middle Bit Indexing

### Addresses of form TTSSBB

- **TT** Tag bits
- SS Set index bits
- **BB** Offset bits

## Makes good use of spatial locality



|  |  | 0000xx |
|--|--|--------|
|  |  | 0001xx |
|  |  | 0010xx |
|  |  | 0011xx |
|  |  | 0100xx |
|  |  | 0101xx |
|  |  | 0110xx |
|  |  | 0111xx |
|  |  | 1000xx |
|  |  | 1001xx |
|  |  | 1010xx |
|  |  | 1011xx |
|  |  | 1100xx |
|  |  | 1101xx |
|  |  | 1110xx |
|  |  | 1111xx |

## **High Bit Indexing**

### Addresses of form SSTTBB

- SS Set index bits
- **TT** Tag bits
- BB Offset bits
- Program with high spatial locality would generate lots of conflicts



| <br> | <br> |                |
|------|------|----------------|
|      |      | 0000xx         |
|      |      | 0001xx         |
|      |      | 0010xx         |
|      |      | 0011xx         |
|      |      | 0100 <b>xx</b> |
|      |      | 0101 <b>xx</b> |
|      |      | 0110 <b>xx</b> |
|      |      | 0111 <b>xx</b> |
|      |      | 1000 <b>xx</b> |
|      |      | 1001 <b>xx</b> |
|      |      | 1010xx         |
|      |      | 1011xx         |
|      |      | 1100 <b>xx</b> |
|      |      | 1101 <b>xx</b> |
|      |      | 1110xx         |
|      |      | 1111xx         |
|      |      | J              |

# **Intel Core i7 Cache Hierarchy**

#### **Processor package**



L1 i-cache and d-cache: 32 KB, 8-way,

Access: 4 cycles

#### L2 unified cache:

256 KB, 8-way, Access: 10 cycles

L3 unified cache: 8 MB, 16-way, Access: 40-75 cycles

**Block size**: 64 bytes for all caches.

# **Kunpeng 920 Cache Hierarchy**



- 集成最多64×自研核 □ 指令集兼容ARMv8.2,最高主频达3.0GHz
  - □ 每核集成64KB L1 I/D缓存
  - □ 每核独享512KB L2缓存,单芯片共享48-64MB L3缓存
- 8×DDR4控制器@2933MT/s
- 集成PCI-e/SAS接口
  - □ 支持PCI-e 4.0,向下兼容PCI-e 3.0/2.0/1.0
  - □ 支持x16,x8,x4,x2,x1 PCI-e 4.0, 集成20 PCI-e控制器
  - □ 支持16×SAS/SATA 3.0控制器
- 支持CCIX接口,支持加速器的缓存一致 性
- 支持2×100G RoCE v2, 支持 25GE/50GE/100GE标准NIC
- 支持2P/4P扩展
- 封装大小: 60mm × 75mm

# **Kunpeng 920 Cache Hierarchy**



# **Cache Performance Metrics**

### Miss Rate

- Fraction of memory references not found in cache (misses / accesses)
   = 1 hit rate
- Typical numbers (in percentages):
  - 3-10% for L1
  - can be quite small (e.g., < 1%) for L2, depending on size, etc.</li>

### Hit Time

- Time to deliver a line in the cache to the processor
  - includes time to determine whether the line is in the cache
- Typical numbers:
  - 4 clock cycle for L1
  - 10 clock cycles for L2

## Miss Penalty

- Additional time required because of a miss
  - typically 50-200 cycles for main memory (Trend: increasing!)

## Let's think about those numbers

## Huge difference between a hit and a miss

Could be 100x, if just L1 and main memory

## Would you believe 99% hits is twice as good as 97%?

- Consider this simplified example: cache hit time of 1 cycle miss penalty of 100 cycles
- Average access time:
   97% hits: 1 cycle + 0.03 x 100 cycles = 4 cycles
   99% hits: 1 cycle + 0.01 x 100 cycles = 2 cycles

## This is why "miss rate" is used instead of "hit rate"

# Writing Cache Friendly Code

### Make the common case go fast

Focus on the inner loops of the core functions

### Minimize the misses in the inner loops

- Repeated references to variables are good (temporal locality)
- Stride-1 reference patterns are good (spatial locality)

# Today

## Cache organization and operation

## Performance impact of caches

- The memory mountain
- Rearranging loops to improve spatial locality
- Using blocking to improve temporal locality

## **The Memory Mountain**

Read throughput (read bandwidth)

- Number of bytes read from memory per second (MB/s)
- Memory mountain: Measured read throughput as a function of spatial and temporal locality.
  - Compact way to characterize memory system performance.

# **Memory Mountain Test Function**

```
long data[MAXELEMS]; /* Global array to traverse */
/* test - Iterate over first "elems" elements of
          array "data" with stride of "stride",
 *
                                                         Call test() with many
          using 4x4 loop unrolling.
 *
                                                         combinations of elems
*/
                                                         and stride.
int test(int elems, int stride) {
    long i, sx2=stride*2, sx3=stride*3, sx4=stride*4;
                                                         For each elems and
    long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
    long length = elems, limit = length - sx4;
                                                         stride:
    /* Combine 4 elements at a time */
                                                         1. Call test() once to
    for (i = 0; i < limit; i += sx4) {</pre>
                                                         warm up the caches.
        acc0 = acc0 + data[i];
        acc1 = acc1 + data[i+stride];
        acc2 = acc2 + data[i+sx2];
                                                         2. Call test() again and
        acc3 = acc3 + data[i+sx3];
                                                         measure the read
    }
                                                         throughput(MB/s)
    /* Finish any remaining elements */
    for (; i < length; i++) {</pre>
        acc0 = acc0 + data[i];
    return ((acc0 + acc1) + (acc2 + acc3));
```









# Today

- Cache organization and operation
- Performance impact of caches
  - The memory mountain
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality





Out[i, j] = dot product(A[i, ..], B[..,j]) = sum ( a[i, 0] \* b[0, j], a[l, 1] \* b[1, j]

# **Matrix Multiplication Example**

## Description:

- Multiply N x N matrices
- Matrix elements are doubles (8 bytes)
- O(N<sup>3</sup>) total operations
- N reads per source element
- N values summed per destination
  - but may be able to hold in register

/\* ijk \*/
for (i=0; i<n; i++) t
for (j=0; j<n; j++) t
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] \* b[k][j];
c[i][j] = sum;
}
matmult/mm.c</pre>

# Miss Rate Analysis for Matrix Multiply

#### Assume:

- Block size = 32B (big enough for four doubles)
- Matrix dimension (N) is very large
  - Approximate 1/N as 0.0
- Cache is not even big enough to hold multiple rows

### Analysis Method:

Look at access pattern of inner loop



# Layout of C Arrays in Memory (review)

### C arrays allocated in row-major order

- each row in contiguous memory locations
- a[i][j] = a[i\*N + j] where N is the number of columns

### Stepping through columns in one row:

for (i = 0; i < N; i++)</pre>

sum += a[0][i];

- accesses successive elements
- if block size (B) > sizeof(a<sub>ii</sub>) bytes, exploit spatial locality
  - miss rate = sizeof(a<sub>ij</sub>) / B
- Stepping through rows in one column:

sum += a[i][0];

- accesses distant elements
- no spatial locality!
  - miss rate = 1 (i.e. 100%)

## Matrix Multiplication (ijk)



## Matrix Multiplication (ijk)



### Matrix Multiplication (kij)



## Matrix Multiplication (kij)



| <u>ivilss rate for i</u> | <u>inner loop i</u> | <u>terations:</u> |
|--------------------------|---------------------|-------------------|
| <u>A</u>                 | B                   | <u>C</u>          |
| 0.0                      | 0.25                | 0.25              |

Block size = 32B (four doubles)

# Matrix Multiplication (jki)



Miss rate for inner loop iterations:

<u>A</u> <u>B</u> <u>C</u>

Block size = 32B (four doubles)

# Matrix Multiplication (jki)



 $\begin{array}{c|c} \underline{\text{Miss rate for inner loop iterations:}} \\ \underline{\underline{A}} & \underline{\underline{B}} & \underline{\underline{C}} \\ 1.0 & 0.0 & 1.0 \end{array}$ 

Block size = 32B (four doubles)

### **Summary of Matrix Multiplication**

```
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
    for (k=0; k<n; k++)
        sum += a[i][k] * b[k][j];
    c[i][j] = sum;
  }
}</pre>
```

```
for (k=0; k<n; k++) {
  for (i=0; i<n; i++) {
    r = a[i][k];
    for (j=0; j<n; j++)
        c[i][j] += r * b[k][j];
  }</pre>
```

```
for (j=0; j<n; j++) {
  for (k=0; k<n; k++) {
    r = b[k][j];
    for (i=0; i<n; i++)
        c[i][j] += a[i][k] * r;
}</pre>
```

ijk(&jik):

- 2 loads, 0 stores
- avg misses/iter = **1.25**

```
kij(&ikj):
```

- 2 loads, 1 store
- avg misses/iter = 0.5

#### jki (& kji):

- 2 loads, 1 store
- avg misses/iter = 2.0

# **Core i7 Matrix Multiply Performance**



# Today

- Cache organization and operation
- Performance impact of caches
  - The memory mountain
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality

### **Example: Matrix Multiplication**





### Assume:

- Matrix elements are doubles
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</p>

### First iteration:

n/8 + n = 9n/8 misses

 Afterwards in cache: (schematic)



### Assume:

- Matrix elements are doubles
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</p>

### Second iteration:

Again:
 n/8 + n = 9n/8 misses



### Total misses:

•  $9n/8 n^2 = (9/8) n^3$ 

# **Blocked Matrix Multiplication**





#### Assume:

- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</p>
- Three blocks fit into cache: 3B<sup>2</sup> < C



#### Assume:

- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</p>
- Three blocks fit into cache: 3B<sup>2</sup> < C



### **Blocking Summary**

- No blocking: (9/8) n<sup>3</sup> misses
- Blocking: (1/(4B)) n<sup>3</sup> misses

#### Use largest block size B, such that B satisfies 3B<sup>2</sup> < C</p>

Fit three blocks in cache! Two input, one output.

#### Reason for dramatic difference:

- Matrix multiplication has inherent temporal locality:
  - Input data: 3n<sup>2</sup>, computation 2n<sup>3</sup>
  - Every array elements used O(n) times!
- But program has to be written properly

### **Cache Summary**

Cache memories can have significant performance impact

#### You can write your programs to exploit this!

- Focus on the inner loops, where bulk of computations and memory accesses occur.
- Try to maximize spatial locality by reading data objects sequentially with stride 1.
- Try to maximize temporal locality by using a data object as often as possible once it's read from memory.