commit - release 1

2026-03-18 11:47:48 -07:00
commit 6f43807900
109 changed files with 14909 additions and 0 deletions
@@ -0,0 +1,64 @@
+# ADR-0003: Target System Hierarchy & Modeling Scope
+
+## Status
+
+Accepted
+
+## Context
+
+We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
+The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
+through switching fabrics, with a host CPU issuing commands/kernels.
+
+## Decision
+
+We model the system hierarchy explicitly:
+
+### D1. Tray-level
+
+- A compute tray contains:
+  - Host CPU (issues requests / coordinates runtime & data placement)
+  - Multiple identical SIPs (accelerators)
+  - Interconnect fabric between SIPs (PCIe and/or UAL via switches)
+
+### D2. SIP-level
+
+- A SIP is a multi-die package composed of:
+  - Multiple CUBEs (HBM die + compute PEs + UCIe)
+  - One or more IO chiplets (host/SIP interfaces)
+- IO chiplets:
+  - provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
+  - can be multiple per SIP
+  - placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 1–2 IO chiplets
+
+### D3. CUBE-level
+
+- A CUBE contains:
+  - HBM + memory controller (HBM_CTRL)
+  - XBAR (top/bottom): HBM pseudo-channel crossbar, PE's dedicated path to HBM
+  - Bridge (left/right): connects XBAR.top ↔ XBAR.bottom for cross-half HBM access
+  - NOC: distributed on-die fabric spanning the entire cube (distance modeled as 0);
+    carries non-HBM traffic including inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access
+  - Shared SRAM: cube-level shared memory accessible by all PEs via NOC
+  - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
+  - multiple PEs
+  - up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
+
+### D4. PE-level
+
+- A PE can execute one kernel instance
+- PE contains internal control + accelerators (modeled at PE view granularity):
+  - PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
+
+## Consequences
+
+- The simulator supports abstraction by “views”:
+  - SIP view hides PE internals
+  - CUBE view treats each PE as a single block
+  - PE view expands PE internals
+- Topology remains parameterized; sizes/counts/links come from configuration.
+
+## Links
+
+- SPEC R3/R5
+- ADR-0005 (diagram views)