Files
kernbench2/docs/adr/ADR-0003-target-system-hierarchy.md
T
ywkang 32b29a1e5c ADR-0003/0014: generalize "router mesh" to "NOC"
NOC topology is an implementation choice (mesh, ring, crossbar, etc.).
ADR-0017 covers the current 2D mesh choice; ADRs at the system-level
shouldn't bind to that specific implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 23:23:46 -07:00

69 lines
2.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0003: Target System Hierarchy & Modeling Scope
## Status
Accepted
## Context
We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
through switching fabrics, with a host CPU issuing commands/kernels.
## Decision
We model the system hierarchy explicitly:
### D1. Tray-level
- A compute tray contains:
- Host CPU (issues requests / coordinates runtime & data placement)
- Multiple identical SIPs (accelerators)
- Interconnect fabric between SIPs (PCIe and/or UAL via switches)
### D2. SIP-level
- A SIP is a multi-die package composed of:
- Multiple CUBEs (HBM die + compute PEs + UCIe)
- One or more IO chiplets (host/SIP interfaces)
- IO chiplets:
- provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
- can be multiple per SIP
- placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 12 IO chiplets
### D3. CUBE-level
- A CUBE contains:
- HBM + memory controller (HBM_CTRL)
- NOC (on-die fabric): carries all intra-cube traffic including HBM data,
inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access.
Must provide: full-BW PE↔local HBM path, PE↔SRAM connectivity,
PE↔UCIe connectivity, M_CPU↔PE command path.
NOC topology is an implementation choice (e.g., 2D mesh, ring, crossbar);
current implementation uses a 2D mesh with XY routing (see ADR-0017).
HBM_CTRL is attached to each PE's local NOC port (local HBM = minimal hop).
- Shared SRAM: cube-level shared memory accessible by all PEs via NOC
- management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
- multiple PEs
- up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
### D4. PE-level
- A PE can execute one kernel instance
- PE contains internal control + accelerators (modeled at PE view granularity):
- PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
## Consequences
- The simulator supports abstraction by “views”:
- SIP view hides PE internals
- CUBE view treats each PE as a single block
- PE view expands PE internals
- Topology remains parameterized; sizes/counts/links come from configuration.
## Links
- SPEC R3/R5
- ADR-0005 (diagram views)
- ADR-0017 (cube NOC 2D mesh architecture)