### HeteroTime: Accelerating Static Timing Analysis using GPUs

Dr. Tsung-Wei Huang Department of Electrical and Computer Engineering University of Utah, UT https://tsung-wei-huang.github.io/



### Agenda

#### □ Introduce the scalability problem of STA

- □ What is STA and its challenges?
- □ Why do we need new parallel paradigm for STA?
- □ Accelerate graph-based analysis using GPU
- □ Accelerate path-based analysis using GPU

# **Static Timing Analysis**

#### Static timing analysis (STA)

- Key step in the VLSI design
- Verify the circuit timing

#### □ Analyze worst-case timing

- Minimum timing values
- Maximum timing values

#### **Challenges**

- Compute giant graphs
- Analyze millions of paths
- Balance the loads



# **Timing Checks (Required Arrival Time)**

#### Modern circuits are sequential

- Drive data signal via clocks
- Capture data via flip-flops (FF)s

#### **Timing constraints**

- Min required arrival time
  - After clock: hold
- □ Max required arrival time
  - Before clock: setup





### "Traffic Light" Analogy

Can I pass the block before the next red light with 40 mph?

### **Building a Good Traffic System is Hard**

Trillions of sections and traffic lights to analyze ...



# STA is Computationally Challenging!

#### STA graphs is extremely large and irregular

- Millions to billions of nodes and edges
- Propagate timing information along giant graphs



ISPD circuit design (10M gates)



STA graphs are extremely large and irregular

## **Our STA Solution: OpenTimer**

#### Open-source incremental timing analysis software

https://github.com/OpenTimer/OpenTimer



T.-W. Huang et al., "OpenTimer v2: A New Parallel Incremental Timing Analysis Engine," IEEE TCAD21

### How Can We Make STA Run Faster?

#### □ Leverage many-core CPUs to speed up the runtime

- Dramatic speed-up using 8 cores
- □ Yet, scalability *saturates* at about 10−16 cores



# **Observed Scalability Bottleneck**

#### **CPU-only parallelism stagnates at about 10 cores**

"Amdahl's Law" limits the strong scalability

• The maximum speed-up you can achieve from a problem is proportional to the part that can be parallelized

#### □ Circuit graph structures limits the maximum parallelism

- If the graph has only 10 parallel nodes at a level, we won't achieve 40x speed-up
- Irregular computations limits the memory bandwidth
  - STA is graph-oriented, not cache-friendly

#### Need to incorporate new parallel paradigms

- GPU opens opportunities for new scalability milestones
  - e.g., 20—80x speed-up reported in placement



How can we leverage new GPU parallel paradigms to accelerate static timing analysis algorithms and achieve transformational performance milestones?

### Leverage GPU to Accelerate STA

□ We target two important STA steps:

- Graph-based analysis (GBA) shortest path finding
- □ Path-based analysis (PBA) *k-shortest path finding*
- □ We design new GPU-accelerated STA algorithms:
  - CPU-GPU task decomposition
  - □ GPU kernels for timing update

<u>PBA</u> analyzes critical paths one by one on a updated graph



### Agenda

#### □ Introduce the scalability problem of STA

- □ What is STA and its challenges?
- □ Why do we need new parallel paradigm for STA?
- **Accelerate graph-based analysis using GPU**
- □ Accelerate path-based analysis using GPU

Z Guo, <u>T-W Huang</u>, and Y Lin, "GPU-Accelerated Static Timing Analysis," *IEEE/ACM International Conference on Computer-aided Design (ICCAD)*, 2020

#### **Research Question:**

- How can we use GPU to speed up graphbased analysis (GBA)?
- How do we overcome the challenges of running irregular graph computations on GPU?

### **Runtime Breakdown of GBA**

#### **GBA** has three time-consuming steps

- 1. Prepare tasks through levelization  $\rightarrow$  42% runtime
  - All pins at the same level can run in parallel => data parallelism
- 2. Compute RC delay  $\rightarrow$  48% runtime
  - RC network (e.g., SPEF) can take GBs of data to compute
- 3. Propagate timing  $\rightarrow$  10% runtime



Leon2 GBA runtime breakdown

Leon2 circuit diagram and layout

### **GPU-Accelerated GBA Algorithm Flow**



### **Step #1: Levelization**

#### □ Levelize the circuit graph to a 2D levellist

- Nodes at the same level can run in parallel (red circle)
- Nodes at the same level can be modeled as a batch



GPU-accelerated levelization using parallel frontiers

### Step #1: Levelization (cont'd)

#### Levelization kernels



### Step #2: RC Update

#### The Elmore delay model

**D** Phase 1: 
$$load_u = \sum_{v \text{ is child of } u} cap_v$$

□ For example,  $load_A = cap_A + cap_B + cap_C + cap_D = cap_A + load_B + load_D$ 

**D** Phase 2:  $delay_u = \sum_{v \text{ is any node}} cap_v \times R_{Z \to LCA(u,v)}$ 

□ For example,  $delay_B = cap_A R_{Z \to A} + cap_D R_{Z \to A} + cap_B R_{Z \to B} + cap_C R_{Z \to B} = delay_A + R_{A \to B} load_B$ 



### Step #2: RC Update Upward Phase

Store the parent index of each node on GPU
 Perform dynamic programming on trees





GPU\_load: For u in [**C, D, B, E, A**]: load[u] += cap[u] load[u.parent] += load[u]



### Step #2: RC Update Downward Phase

Store the parent index of each node on GPU

#### Perform dynamic programming on trees





GPU\_delay:

For u in [**A, E, B, D, C**]:

temp := R[u.parent,u]\*load[u]
delay[u]=delay[u.parent] + temp



### Step #2: RC Update Memory Coalesce

Consecutive threads access consecutive memory
 RC update has four cases: {Rise, Fall} x {Early, Late}



(b)

### Step #2: RC Update Kernels (flatten + update)

| Input: N as #nets, (M, E) as (#nodes, #edges) in all nets         Input: roots[0N - 1], the index of root of each net         Input: edges[0E - 1], the undirected edges {(a, b)}         Input: nodestart[0N], the offsets of each net in arrays of nodes, with nodestart[N] = M         Input: edgestart[0N], the offsets of each net in arrays of edges, with edgestart[N] = E         Input: distances[0M] = ∞, counts[0M] = 0         Output: order[0M - 1], nodes in BFS order for each net         /* Process one net w/ blockDim.x threads         1 netID = blockIdx.x;         > gridDim.x = #nets         2 threadID = threadIdx.x;         > node offset start         4 nend = nodestart[netID];         > edge offset start         * edgestart[netID]; | Flatten the<br>data to array<br>for "data<br>parallelism" | Input: N as #nets, M as #nodes in all netsInput: start[0N], the offsets of each net in arrays of nodesInput: parent[0M - 1], the index of parent of every nodesInput: pres[0M - 1], the resistance between nodes and their<br>parentInput: cap[04M - 1], the capacitance of nodes, each in 4<br>different combinationsOutput: load[04M - 1], the capacitance of nodes, each in 4<br>different combinationsOutput: load[04M - 1], delay[04M - 1], impulse[04M - 1]:<br>arrays of results of load, delay and impulse,<br>respectively1netID = blockIdx.x × blockDim.x + threadIdx.x;2condID = threadIdx.y;3if netID ≥ N then return;40ffsetL = start[netID];> node offset start5offsetR = start[netID + 1];> node offset end6Initialize load, delay, ldelay to zero;77Initialize $\beta = 0$ as an auxiliary array; |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 6       eend = edgestart[netID + 1]; $\triangleright$ edge offset end         7       distances[nst + roots[netID]] = 0;         8       for $d = 0, 1, 2,, (nend - nst)$ do         9       for $i = est + threadID$ to eend step blockDim.x do         10 $(a, b) = edgelist[i];$ 11       if distances[a] == d and distances[b] >d + 1 then         12 $distances[b] = d + 1;$ 13 $atomicAdd(counts[d], 1);$                                                                                                                                                                                                                                                                                                                                                       | Each thread<br>computes<br>one RC tree                    | * Initialize $p = 0$ as an auxiliary array;<br>* for $i = offsetR - 1$ down to offsetL do<br>  $load[4i + condID] += cap[4i + condID];$<br>10   $load[4parent[i] + condID] += load[4i + condID];$<br>11 end<br>12 for $i = offsetL + 1$ to $offsetR - 1$ do<br>13   $t = load[4i + condID] \times pres[i];$<br>14   $delay[4i + condID] = delay[4parent[i] + condID] + t;$<br>15 end                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 14       end         15       else if distances[b] == d and distances[a] >d + 1 then         16       distances[a] = d + 1;         17       atomicAdd(counts[d], 1);         18       end         19       end         20      syncthreads();       > Sync threads within a block         21       break when counts[d] == 0;         22       end         23       countingSort(distances, counts, order, threadID);                                                                                                                                                                                                                                                                                                                                                | _                                                         | 16 for $i = offsetR - 1$ down to $offsetL$ do<br>17 $  ldelay[4i + condID] +=$<br>$cap[4i + condID] \times delay[4i + condID];$<br>18 $  ldelay[4parent[i] + condID] += load[4i + condID];$<br>19 end<br>20 for $i = offsetL + 1$ to $offsetR - 1$ do<br>21 $  t' = ldelay[4i + condID] \times pres[i];$<br>22 $\beta[4i + condID] = \beta[4parent[i] + condID] + t';$<br>23 $  inpulse[4i + condID] =$<br>$2\beta[4i + condID] - delay[4i + condID]^2;$<br>24 end                                                                                                                                                                                                                                                                                                                                                |

### Step #3: Cell Delay Update

#### **Perform linear inter- and extra-polation in batches**

□ x-axis (input slew) and then y-axis (output load)



### Step #3: Cell Delay Update Kernels

```
/* Input: line (x_1, y_1) - (x_2, y_2)
                                                                      */
  /* Input: the x value queried
                                                                      */
1 Function interpolate (x_1, x_2, y_1, y_2, x):
       if x_1 = x_2 then return y_1;
2
       else return d_1 + (d_2 - d_1) \frac{x - x_1}{x_2 - x_1};
3
4 end
  /* Input: n \times m look-up table
                                                                      */
  /* Input: the point queried (x, y)
                                                                      */
5 Function lut_lookup(n, m, X, Y, mat, x, y):
       i' \leftarrow 0;
                                                                                          Each thread performs binary
6
      i \leftarrow \min(1, n-1);
7
                                                                                      search on x-axis and y-axis to find
      while i + 1 < n and X[i] \leq x do
8
                                                                                             the slew and capacitance
           i' \leftarrow i:
 9
          i \leftarrow i + 1;
10
       end
11
      i' \leftarrow 0;
12
      j \leftarrow \min(1, m-1);
13
      while j + 1 < m and Y[j] \leq y do
14
          i' \leftarrow i
15
          j \leftarrow j + 1;
16
       end
17
       r_{i'} \leftarrow \text{interpolate}(Y[j'], Y[j], mat[i', j'], mat[i', j]);
18
       r_i \leftarrow interpolate(Y[j'], Y[j], mat[i, j'], mat[i, j]);
19
      r \leftarrow interpolate(X[i'], X[i], r_{i'}, r_i);
20
      return r;
21
22 end
```

## **Experimental Setting**

#### □ Machine configuration

- Nvidia CUDA, RTX 2080
- 40 Intel Xeon Gold 6138 CPU cores

#### Execution parameters for GPU kernels

- □ RC Tree Flattening
  - 64 threads per block with one block for each net
- Levelization
  - 128 threads per block
- RC delay computation
  - 4 threads for each net (one for each Early/Late and Rise/Fall condition) with a block of 64 nets
- Cell delay computation
  - 4 threads for each arc, with a block of 32 arcs

### **Overall Performance**

#### **Comparison with OpenTimer of 40 CPUs**

□ Run on large TAU15 Benchmarks (>20K gates)

|                 |       |       |         |         |         |          |          | OpenTimer | Our Runtime     |          |
|-----------------|-------|-------|---------|---------|---------|----------|----------|-----------|-----------------|----------|
| Benchmark       | # PIs | # POs | # Gates | # Nets  | # Pins  | # Nodes  | # Edges  | Runtime   | (40 CPUs 1 GPU) |          |
|                 |       |       |         |         |         |          |          | (40 CPUs) | Runtime         | Speed-up |
| aes_core        | 260   | 129   | 22938   | 23199   | 66751   | 413588   | 453508   | 156 ms    | 138 ms          | 1.13×    |
| vga_lcd         | 85    | 99    | 139529  | 139635  | 397809  | 1966411  | 2185601  | 829 ms    | 311 ms          | 2.67×    |
| vga_lcd_iccad   | 85    | 99    | 259067  | 259152  | 679258  | 3556285  | 3860916  | 1480 ms   | 496 ms          | 2.98×    |
| b19             | 22    | 25    | 255278  | 255300  | 782914  | 4423074  | 4961058  | 1831 ms   | 585 ms          | 3.13×    |
| cordic          | 34    | 64    | 45359   | 45393   | 127993  | 7464477  | 820763   | 274 ms    | 167 ms          | 1.64×    |
| des_perf        | 234   | 140   | 138878  | 139112  | 371587  | 2128130  | 2314576  | 832 ms    | 325 ms          | 2.56×    |
| edit_dist       | 2562  | 12    | 147650  | 150212  | 416609  | 2638639  | 2870985  | 1059 ms   | 376 ms          | 2.86×    |
| fft             | 1026  | 1984  | 38158   | 39184   | 116139  | 646992   | 718566   | 241 ms    | 148 ms          | 1.63×    |
| leon2           | 615   | 85    | 1616369 | 1616984 | 4328255 | 22600317 | 24639340 | 10200 ms  | 2762 ms         | 3.69×    |
| leon3mp         | 254   | 79    | 1247725 | 1247979 | 3376832 | 17755954 | 19408705 | 7810 ms   | 2585 ms         | 3.02×    |
| netcard         | 1836  | 10    | 1496719 | 1498555 | 3999174 | 21121256 | 23027533 | 9225 ms   | 2571 ms         | 3.60×    |
| mgc_edit_dist   | 2562  | 12    | 161692  | 164254  | 450354  | 2436927  | 2674934  | 1021 ms   | 368 ms          | 2.77×    |
| mgc_matrix_mult | 3202  | 1600  | 171282  | 174484  | 492568  | 2713241  | 2994343  | 1138 ms   | 377 ms          | 3.02×    |
| tip_master      | 778   | 857   | 37715   | 38493   | 95524   | 533690   | 570154   | 163 ms    | 143 ms          | 1.14×    |

# PIs: number of primary inputs # POs: number of primary outputs # Gates: number of gates # Nets: number of nets **# Pins**: number of pins **# Nodes**: number of nodes in the STA graph **# Edges**: number of edges in the STA graph

TAU15 Contest on Incremental Timing Analysis: https://sites.google.com/site/taucontest2015

T.-W. Huang et al., "OpenTimer v2: A New Parallel Incremental Timing Analysis Engine," IEEE TCAD21

### **Runtime Breakdown**

#### **Circuit leon2 (21 M nodes)**



### **Runtime vs CPUs**

#### □ Significant performance gap between CPU and GPU



Our runtime of 1 CPU and 1 GPU is very close to OpenTimer of 40 CPUs

### **Runtime vs Problem Sizes**

- Problem size matters for GPU acceleration
- □ When to enable GPU acceleration?
  - Net count > 20K
  - □ Gate count > 50K
  - Propagation candidate count > 15K



### Agenda

#### □ Introduce the scalability problem of STA

- □ What is STA and its challenges?
- □ Why do we need new parallel paradigm for STA?
- □ Accelerate graph-based analysis using GPU
- Accelerate path-based analysis using GPU

G Guo, <u>T-W Huang</u>, Y Lin, and M Wong, "GPU-Accelerated Path-based Timing Analysis," *IEEE/ACM Design Automation Conference* (DAC), 2021

#### **Research Question:**

- How can we use GPU to speed up pathbased analysis (PBA)?
- How do we overcome the challenges of generating large numbers of critical paths to analyze with GPU?

### Path-based Analysis (PBA)

#### □ Identify a set of critical paths from a updated graph

Exponential number of paths in the circuit graph

#### **Re-analyze each path with path-specific update**

- Re-propagate the slew and remove pessimism
- Advanced on-chip variation (AOCV)
- □ Common path pessimism removal (CPPR)



### **PBA is Extremely Time-Consuming**

Speed vs Accuracy (pessimism removal) tradeoff



### A Key Step: Generate Critical Paths

#### □ We introduce implicit path representation

- □ Each path is represented using <u>O(1) space</u> and time
- Each path is ranked through a prefix tree & a suffix tree



T.-W. Huang et al., "OpenTimer v2: A New Parallel Incremental Timing Analysis Engine," IEEE TCAD21

### **GPU-Accelerated PBA Algorithm Flow**



### **Step #1: Generate Suffix Tree on GPU**



### Step #1: Generate Suffix Tree Kernel



- **Input** : Shortest distance cache, distanceCache[N]
- **Input** : Array indicating vertices with updated distances, distanceUpdated[N]

**Result:** Shortest distances array, distances[N]

- 1 tid  $\leftarrow$  blockIdx.x \* blockDim.x + threadIdx.x;
- 2 if  $tid \ge N$  then
- 3 return;
- 4 end
- 5 if distanceUpdated[tid] is false then
- 6 return;

#### 7 end

- 8 distanceUpdated[tid]  $\leftarrow$  false;
- 9 edgeStart  $\leftarrow$  vertices[tid];
- 10 edgeEnd  $\leftarrow$  (tid == N-1) ? M : vertices[tid+1];
- 11 for eid  $\leftarrow$  edgeStart to edgeEnd do
- 12 neighbor  $\leftarrow$  edges[eid];
- 13 weight  $\leftarrow$  weights[eid];
- 14 newDis  $\leftarrow$  distances[tid] + weight;
- 15 **atomicMin** (&distanceCache[neighbor], newDis);

#### 16 **end**

17 return;



Bellman loop to find the shortest path tree (each thread per edge)

### **Step #2: Expand Prefix Tree on GPU**



### Step #2: Expand Prefix Tree on GPU (cont'd)

□ Iteratively grow GPU memory at each expansion

- Each iteration uses GPU to decide path candidates
- Each iteration uses CPU to prune path candidates
- □ Each path candidate takes O(1) space "deviation edge"



#### **Step #2: Expand Prefix Tree Kernel**

| <ul> <li>Input : G<sup>+</sup> in CSR format, N as #vertices, M as #edges, vertices[N], edges[M], weights[M]</li> <li>Input : Shortest path forest, forest[N] as edge array, distances[N] as distance array</li> <li>Input : currLevel as current level</li> <li>Input : levelSize as the number of entries in current level</li> <li>Result: Explore critical path candidates in next level</li> <li>1 tid ← blockIdx.x * blockDim.x + threadIdx.x;</li> <li>2 if tid ≥ levelSize then</li> <li>3   return;</li> <li>4 end</li> <li>5 offset ← (tid == 0) ? 0 : currLevel[tid-1].childOffset;</li> <li>6 level ← currLevel[tid].level;</li> </ul> | expand<br>Level 1, 2, 3,<br>cudaMalloc |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|
| 7 slack $\leftarrow$ currLevel[tid].slack;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                        |
| s $v \leftarrow currLevel[tid].to;$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                        |
| 9 while v is not endpoint do<br>10 edgeStart $\leftarrow$ vertices[v];<br>11 edgeEnd $\leftarrow$ (v == N-1) ? M : vertices[v+1];<br>12 for eid $\leftarrow$ edgeStart to edgeEnd do<br>13   neighbor $\leftarrow$ edges[eid];                                                                                                                                                                                                                                                                                                                                                                                                                     | Key idea: levelized prefix tree        |
| 13Integration $\leftarrow$ edges[end],14weight $\leftarrow$ weights[eid];15if eid is deviation edge then16/* Fill out child path data */17newPath $\leftarrow$ nextLevel[offset];                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | <b>Deviation:</b> $e_{11}$             |
| 18newPath.level $\leftarrow$ level+1;19newPath.from $\leftarrow$ v;20newPath.to $\leftarrow$ neighbor;21newPath.parent $\leftarrow$ tid;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | $e_1 e_2 e_{11} e_4$ Level 1           |
| 22 newPath.childOffset $\leftarrow 0$ ;<br>23 newPath.slack $\leftarrow$ slack + distances[neighbor] +<br>weight - distances[v];                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | e <sub>7</sub> e <sub>9</sub> Level 2  |
| 24 offset $\leftarrow$ offset + 1;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                        |
| 25 end                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | Path node                              |
| 26 end<br>27 /* Traverse along the shortest path                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | ( <b>e</b> <sub>11</sub> )             |
| forest */                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                        |
| v = forest[v];                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                        |
| 29 end                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                        |
| 30 return;                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                        |

## **Experimental Setting**

#### □ Machine configuration

- Nvidia CUDA, RTX 2080
- 40 Intel Xeon Gold 6138 CPU cores
- □ Measure the accuracy-runtime tradeoff
  - "MDL" stands for maximum deviation level

#### □ Execution parameters for GPU kernels

- Suffix tree kernel
  - 1024 threads per block
- Prefix tree kernel
  - 1024 threads per block

### **Overall Performance**

# Compare with OpenTimer's CPU-based PBA Report speed-up at different MDLs

| Benchmark #Pins #Gate | #Pins   | #Pins #Gates | #Arcs   | OpenTimer<br>Runtime | Our Algorithm<br>#MDL=10 |                | Our Algorithm<br>#MDL=15 |         | Our Algorithm<br>#MDL=20 |       |
|-----------------------|---------|--------------|---------|----------------------|--------------------------|----------------|--------------------------|---------|--------------------------|-------|
|                       |         |              | Kunume  | Runtime              | Speed-up                 | Runtime        | Speed-up                 | Runtime | Speed-up                 |       |
| leon2                 | 4328255 | 1616399      | 7984262 | 2875783              | 4708.36                  | 611×           | 5295.49ms                | 543×    | 5413.84                  | 531×  |
| leon3mp               | 3376821 | 1247725      | 6277562 | 1217886              | 5520.85                  | 221×           | 7091.79ms                | 172×    | 8182.84                  | 149×  |
| netcard               | 3999174 | 1496719      | 7404006 | 752188               | 2050.60                  | 367×           | 2475.90ms                | 304×    | 2484.08                  | 303×  |
| vga_lcd               | 397809  | 139529       | 756631  | 53204                | 682.94                   | 77 <b>.9</b> × | 683.04ms                 | 77.9×   | 706.16                   | 75.3× |
| vga_lcd_iccad         | 679258  | 259067       | 1243041 | 66582                | 720.40                   | 92.4×          | 754.35ms                 | 88.3×   | 766.29                   | 86.9× |
| b19_iccad             | 782914  | 255278       | 1576198 | 402645               | 2144.67                  | 188×           | 2948.94ms                | 137×    | 3483.05                  | 116×  |
| des_perf_ispd         | 371587  | 138878       | 697145  | 24120                | 763.79                   | 31.6×          | 766.31ms                 | 31.5×   | 780.56                   | 30.9× |
| edit_dist_ispd        | 416609  | 147650       | 799167  | 614043               | 1818.49                  | 338×           | 2475.12ms                | 248×    | 2900.14                  | 212×  |
| mgc_edit_dist         | 450354  | 161692       | 852615  | 694014               | 1463.61                  | 474×           | 1485.65ms                | 467×    | 1493.90                  | 465×  |
| mgc_matric_mult       | 492568  | 171282       | 948154  | 214980               | 994.67                   | 216×           | 1075.90ms                | 200×    | 1113.26                  | 193×  |

#### □ Achieve significant speed-up at large designs

- □ 611x speed-up in leon2 (1.3M gates)
- 221x speed-up in leon3mp (1.2M gates)

T.-W. Huang et al., "OpenTimer v2: A New Parallel Incremental Timing Analysis Engine," IEEE TCAD21

### Path Accuracy vs MDL

#### □ Achieve decent accuracy at 10—12 GPU iterations



### **GPU Speed-up vs CPUs**

#### **one GPU is even faster than OpenTimer with 40 CPUs**



### Conclusion

#### Introduced the runtime challenges of STA

- Introduced graph-based analysis
- Introduced path-based analysis
- □ Accelerated the graph-based analysis using GPU
  - □ Achieved 4x speed-up over 40 CPUs on large designs

#### Accelerated the path-based analysis using GPU

□ Achieved 600x speed-up over 40 CPUs on large designs

#### Future work

- □ Consider other important STA tasks (CPPR, CCS, etc.)
- Leverage cudaMallocAsync to boost memory efficiency
- Leverage cudaGraph to reduce kernel launch overheads
- □ Collaborate with experts in Nvidia!



#### **Use right algorithms for GPU!**

tsung-wei.huang@utah.edu