fc6abbc8ee
- CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
333 lines
10 KiB
Markdown
333 lines
10 KiB
Markdown
# KernBench System-Level Simulator — SPEC
|
|
|
|
This document defines the architectural contract for the KernBench
|
|
system-level discrete-event simulator for our AI Accelerator SIP-based systems.
|
|
All implementations, tests, and changes MUST conform to this SPEC.
|
|
|
|
---
|
|
|
|
## 0. Goal
|
|
|
|
Build a **system-level, discrete-event simulator** to evaluate the performance of
|
|
**LLM kernels running on our AI Accelerator SIP-based systems**, under varying
|
|
**SIP architectures, topologies, and interconnect configurations**.
|
|
|
|
The simulator models **data-movement and control paths across the full hardware
|
|
hierarchy** and computes **end-to-end execution latency** for kernel executions
|
|
dispatched to Processing Elements (PEs).
|
|
|
|
Primary objectives:
|
|
|
|
- compare LLM kernel execution latency under different system configurations
|
|
- model PE↔HBM, PE↔PE, CUBE↔CUBE, and SIP↔SIP communication and control paths
|
|
- guarantee deterministic, verifiable behavior with strong debuggability
|
|
- support visual inspection of the modeled system at multiple abstraction levels
|
|
|
|
---
|
|
|
|
## 0.1 Golden Invariants (Must NOT be violated)
|
|
|
|
- End-to-end latency is computed **strictly by explicit traversal** over modeled
|
|
components and links.
|
|
- Every routed request MUST incur **latency > 0**.
|
|
- Routing decisions MUST be **deterministic** given
|
|
(topology + routing policy + request).
|
|
- All valid request flows MUST have explicit connectivity in the model.
|
|
- No hidden shortcuts, implicit bypasses, or magic paths are allowed.
|
|
- Architectural decisions documented in ADRs override local optimizations.
|
|
|
|
---
|
|
|
|
## 0.2 Architectural References (ADRs)
|
|
|
|
Major architectural decisions are documented in ADRs and referenced by number.
|
|
|
|
- ADR-0001: PhysAddr layout & address decoding contract
|
|
- ADR-0002: Routing distance, ordering, and bypass rules
|
|
- ADR-0003: Target system hierarchy & modeling scope (Tray / SIP / CUBE / PE / IO chiplet)
|
|
- ADR-0004: Memory semantics & local-HBM bandwidth guarantee contract
|
|
- ADR-0005: Diagram views (SIP / CUBE / PE) and distance-aware layout rules
|
|
- ADR-0006: Topology compilation, distance extraction, and automatic diagram generation
|
|
- ADR-0007: runtime_api vs sim_engine responsibility boundaries
|
|
- ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
|
|
- ADR-0009: Kernel execution fan-out and completion semantics
|
|
- ADR-0010: CLI device selection and multi-device execution semantics
|
|
- ADR-0011: Memory addressing simplification (PA-first)
|
|
- ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
|
|
- ADR-0013: Verification strategy and Phase 1 test plan
|
|
- ADR-0014: PE internal execution model (PE_CPU, PE_SCHEDULER, composite commands)
|
|
- ADR-0015: Component port/wire model, BW occupancy, and fabric routing
|
|
- ADR-0016: IOChiplet NOC and memory data path (M_CPU bypass)
|
|
- ADR-0017: Cube NOC 2D mesh architecture (XY routing, contention, attachments)
|
|
|
|
SPEC MUST remain consistent with accepted ADRs.
|
|
|
|
---
|
|
|
|
## 1. Core Requirements
|
|
|
|
### R1. Correct Routing and Control Path
|
|
|
|
- A request MUST traverse the correct sequence of components based on:
|
|
- source location,
|
|
- destination address or placement tags,
|
|
- routing policy and available topology connectivity.
|
|
- Local vs remote traffic MUST be distinguishable:
|
|
- same SIP vs different SIP,
|
|
- same CUBE vs different CUBE,
|
|
- (optional) same PE-group vs cross PE-group.
|
|
- Routing behavior MUST be reproducible and deterministic.
|
|
|
|
---
|
|
|
|
### R2. Latency is Computed by Traversal
|
|
|
|
End-to-end latency is the sum of:
|
|
|
|
- per-node fixed latency (processing / router delay),
|
|
- per-link latency (fixed and/or size-aware serialization: bytes / BW),
|
|
- per-service latency (e.g., memory controller service time).
|
|
|
|
The simulator MUST:
|
|
|
|
- support both fixed and size-aware latency,
|
|
- emit hop-by-hop traces with timestamps and component identifiers.
|
|
|
|
---
|
|
|
|
### R3. Topology is Configurable and Variable
|
|
|
|
Topology MUST NOT be hardcoded.
|
|
|
|
The simulator MUST accept multiple topologies (YAML / JSON / dict), varying:
|
|
|
|
- SIP count,
|
|
- CUBE count per SIP,
|
|
- PE count per CUBE,
|
|
- on-chip fabric structure (e.g., mesh / NoC / XBAR),
|
|
- IO chiplets and interconnects,
|
|
- link bandwidth, latency, and capacity parameters.
|
|
|
|
Given a topology:
|
|
|
|
- all required request flows MUST have valid connectivity,
|
|
- missing links are a topology construction error, not a routing error.
|
|
|
|
---
|
|
|
|
### R4. DI-First Component Design (Swappable Implementations)
|
|
|
|
All components MUST be replaceable behind stable interfaces, including:
|
|
|
|
- routers and fabrics (NoC, bridges, switches),
|
|
- XBAR-like selectors,
|
|
- DMA engines and queues,
|
|
- memory controllers and services (HBM, TCM, queues),
|
|
- management and control processors (modeled components).
|
|
|
|
The simulator MUST:
|
|
|
|
- use dependency injection (DI) to bind node specifications to implementation classes,
|
|
- allow component swapping without changing test logic,
|
|
- avoid leaking routing or policy logic into unrelated components.
|
|
|
|
---
|
|
|
|
### R5. Multi-Domain Communication Modeling
|
|
|
|
The simulator MUST model communication across hierarchical domains, including:
|
|
|
|
- PE ↔ local HBM
|
|
- PE ↔ remote HBM in the same CUBE
|
|
- PE ↔ remote HBM in other CUBEs within the same SIP
|
|
- PE ↔ remote HBM in other SIPs
|
|
- PE ↔ PE messaging (e.g., IPCQ)
|
|
- PE ↔ IO chiplets
|
|
- CUBE ↔ CUBE (e.g., via UCIe)
|
|
- SIP ↔ SIP (e.g., via PCIe or UAL)
|
|
|
|
Policy-based bypass is allowed ONLY if:
|
|
|
|
- the bypass path is explicitly represented in the model,
|
|
- the bypass incurs non-zero latency,
|
|
- the bypass is visible in traces and diagrams.
|
|
|
|
---
|
|
|
|
### R6. Verification-Driven Development
|
|
|
|
Development MUST follow a verification-driven workflow:
|
|
|
|
- behavior is validated by tests with meaningful input cases,
|
|
- tests encode SPEC-defined invariants, not incidental implementation details,
|
|
- changes without clear verification coverage are not allowed.
|
|
|
|
---
|
|
|
|
## R7. Runtime API
|
|
|
|
The simulator MUST provide a host-facing runtime API that:
|
|
|
|
- exposes tensor deployment and kernel execution operations,
|
|
- submits requests to endpoint components: PCIE_EP for memory operations
|
|
(MemoryWrite/Read), IO_CPU for kernel launch,
|
|
- owns host-side tensor handles and allocation metadata as PA shard maps,
|
|
- remains topology-agnostic and does not perform routing or fan-out.
|
|
|
|
Tensor deployment in Phase 0 produces **device physical-address (PA) shard mappings**.
|
|
Each shard explicitly identifies its target `(sip, cube, pe)` and PA range.
|
|
No separate host-visible allocation RPC (e.g., AllocateTensorMeta) exists.
|
|
|
|
---
|
|
|
|
## R8. Simulation Engine
|
|
|
|
The simulator MUST include a discrete-event simulation engine that:
|
|
|
|
- injects requests into the system graph,
|
|
- schedules events deterministically,
|
|
- tracks completion via correlation identifiers,
|
|
- decomposes runtime API operations into explicit graph requests
|
|
(e.g., MemoryWrite, MemoryRead, KernelLaunch).
|
|
|
|
---
|
|
|
|
## R9. CLI Execution Semantics
|
|
|
|
The CLI MUST support executing benchmarks:
|
|
|
|
- on a specified device.
|
|
|
|
Benchmarks are executed once per invocation within a single simulation instance.
|
|
If multiple devices are present in the topology, a benchmark MAY interact with
|
|
multiple devices internally, but the CLI does not launch multiple independent
|
|
benchmark instances by default.
|
|
|
|
---
|
|
|
|
## R10. Memory Addressing (Phase 0)
|
|
|
|
In Phase 0, the simulator uses a **PA-first memory model**:
|
|
|
|
- All memory operations use device physical addresses (PA) only.
|
|
- Virtual addressing, MMU/IOMMU, and address translation latency are out of scope.
|
|
- Tensor placement is represented as a list of PA shards, each explicitly tagged
|
|
with `(sip, cube, pe)`.
|
|
|
|
All memory access latency MUST be modeled explicitly via graph traversal.
|
|
No implicit translation or hidden latency is allowed.
|
|
|
|
---
|
|
|
|
## 2. Model Concepts
|
|
|
|
### 2.1 Graph Execution Model
|
|
|
|
- Nodes represent modeled components (PE blocks, XBAR, NoC, bridges,
|
|
HBM controllers, IO components, etc.).
|
|
- Directed edges represent interconnect links with latency and bandwidth attributes.
|
|
- Execution model:
|
|
- a node receives a request,
|
|
- incurs node or service latency,
|
|
- emits the request to the next hop via a link,
|
|
- repeats until the destination service completes.
|
|
|
|
---
|
|
|
|
### 2.2 Routing
|
|
|
|
Routing MAY be implemented as:
|
|
|
|
- policy-based routing (code-driven),
|
|
- routing tables (config-driven),
|
|
- topology-driven routing (e.g., mesh XY),
|
|
- or a hybrid approach.
|
|
|
|
Routing MUST:
|
|
|
|
- consume decoded address domains or explicit placement tags,
|
|
- operate only on explicit topology connectivity,
|
|
- remain deterministic.
|
|
|
|
Kernel execution requests reference tensors via PA shard mappings.
|
|
Each shard explicitly identifies its target PE, allowing IO_CPU to
|
|
deterministically fan-out execution without relying on PA decoding.
|
|
|
|
---
|
|
|
|
## 3. Inputs and Identity
|
|
|
|
### 3.1 Node Identity Scheme
|
|
|
|
Nodes MUST have stable, parsable identifiers sufficient for domain inference
|
|
and trace-based debugging.
|
|
|
|
Example patterns:
|
|
|
|
- `tray.host_cpu`
|
|
- `sip{S}.io{I}.pcie_ep`
|
|
- `sip{S}.cube{C}.fabric`
|
|
- `sip{S}.cube{C}.pe{P}`
|
|
- `sip{S}.cube{C}.hbm_ctrl`
|
|
|
|
---
|
|
|
|
### 3.2 Link Specifications
|
|
|
|
A link MAY include:
|
|
|
|
- fixed latency (ns),
|
|
- bandwidth (GB/s) for serialization latency,
|
|
- optional capacity for contention modeling.
|
|
|
|
Topology builders MUST ensure:
|
|
|
|
- required links exist,
|
|
- link parameters are consistent with topology intent.
|
|
|
|
---
|
|
|
|
## 4. Output, Debuggability, and Diagrams
|
|
|
|
The simulator MUST provide:
|
|
|
|
- per-request hop-by-hop traces with timestamps,
|
|
- clear error messages for missing connectivity
|
|
(e.g., "no link for A → B"),
|
|
- reproducible, inspectable representations of the modeled system.
|
|
|
|
Diagrams are **derived artifacts** of the simulator model:
|
|
|
|
- They MUST be generatable from the **compiled topology** and **distance metadata**
|
|
used by execution and routing.
|
|
- Generation MAY be performed lazily or cached by the implementation,
|
|
as long as outputs remain consistent with the compiled topology.
|
|
|
|
Diagram abstraction levels and distance-aware layout rules are defined in ADR-0005.
|
|
Automatic diagram generation and output conventions are defined in ADR-0006.
|
|
|
|
By default, generated diagrams are written under:
|
|
|
|
- `docs/diagrams/`
|
|
|
|
---
|
|
|
|
## 5. Non-Goals (for now)
|
|
|
|
The following are explicitly out of scope:
|
|
|
|
- cycle-accurate microarchitecture modeling,
|
|
- detailed cache coherence protocols,
|
|
- full PCIe / CXL protocol correctness.
|
|
|
|
These MAY be layered later via additional components and policies.
|
|
|
|
---
|
|
|
|
## 6. Decision Boundaries
|
|
|
|
- SPEC.md defines architectural intent and invariants.
|
|
- Code implements SPEC and MUST NOT introduce hidden invariants.
|
|
- Tests validate SPEC-defined behavior and MUST NOT encode fixed topology assumptions.
|
|
- ADRs record non-trivial architectural decisions and MUST be referenced when relevant.
|