10 KiB
KernBench System-Level Simulator — SPEC
This document defines the architectural contract for the KernBench system-level discrete-event simulator for our AI Accelerator SIP-based systems. All implementations, tests, and changes MUST conform to this SPEC.
0. Goal
Build a system-level, discrete-event simulator to evaluate the performance of LLM kernels running on our AI Accelerator SIP-based systems, under varying SIP architectures, topologies, and interconnect configurations.
The simulator models data-movement and control paths across the full hardware hierarchy and computes end-to-end execution latency for kernel executions dispatched to Processing Elements (PEs).
Primary objectives:
- compare LLM kernel execution latency under different system configurations
- model PE↔HBM, PE↔PE, CUBE↔CUBE, and SIP↔SIP communication and control paths
- guarantee deterministic, verifiable behavior with strong debuggability
- support visual inspection of the modeled system at multiple abstraction levels
0.1 Golden Invariants (Must NOT be violated)
- End-to-end latency is computed strictly by explicit traversal over modeled components and links.
- Every routed request MUST incur latency > 0.
- Routing decisions MUST be deterministic given (topology + routing policy + request).
- All valid request flows MUST have explicit connectivity in the model.
- No hidden shortcuts, implicit bypasses, or magic paths are allowed.
- Architectural decisions documented in ADRs override local optimizations.
0.2 Architectural References (ADRs)
Major architectural decisions are documented in ADRs and referenced by number.
- ADR-0001: PhysAddr layout & address decoding contract
- ADR-0002: Routing distance, ordering, and bypass rules
- ADR-0003: Target system hierarchy & modeling scope (Tray / SIP / CUBE / PE / IO chiplet)
- ADR-0004: Memory semantics & local-HBM bandwidth guarantee contract
- ADR-0005: Diagram views (SIP / CUBE / PE) and distance-aware layout rules
- ADR-0006: Topology compilation, distance extraction, and automatic diagram generation
- ADR-0007: runtime_api vs sim_engine responsibility boundaries
- ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
- ADR-0009: Kernel execution fan-out and completion semantics
- ADR-0010: CLI device selection and multi-device execution semantics
- ADR-0011: Memory addressing simplification (PA-first)
- ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
- ADR-0013: Verification strategy and Phase 1 test plan
SPEC MUST remain consistent with accepted ADRs.
1. Core Requirements
R1. Correct Routing and Control Path
- A request MUST traverse the correct sequence of components based on:
- source location,
- destination address or placement tags,
- routing policy and available topology connectivity.
- Local vs remote traffic MUST be distinguishable:
- same SIP vs different SIP,
- same CUBE vs different CUBE,
- (optional) same PE-group vs cross PE-group.
- Routing behavior MUST be reproducible and deterministic.
R2. Latency is Computed by Traversal
End-to-end latency is the sum of:
- per-node fixed latency (processing / router delay),
- per-link latency (fixed and/or size-aware serialization: bytes / BW),
- per-service latency (e.g., memory controller service time).
The simulator MUST:
- support both fixed and size-aware latency,
- emit hop-by-hop traces with timestamps and component identifiers.
R3. Topology is Configurable and Variable
Topology MUST NOT be hardcoded.
The simulator MUST accept multiple topologies (YAML / JSON / dict), varying:
- SIP count,
- CUBE count per SIP,
- PE count per CUBE,
- on-chip fabric structure (e.g., mesh / NoC / XBAR),
- IO chiplets and interconnects,
- link bandwidth, latency, and capacity parameters.
Given a topology:
- all required request flows MUST have valid connectivity,
- missing links are a topology construction error, not a routing error.
R4. DI-First Component Design (Swappable Implementations)
All components MUST be replaceable behind stable interfaces, including:
- routers and fabrics (NoC, bridges, switches),
- XBAR-like selectors,
- DMA engines and queues,
- memory controllers and services (HBM, TCM, queues),
- management and control processors (modeled components).
The simulator MUST:
- use dependency injection (DI) to bind node specifications to implementation classes,
- allow component swapping without changing test logic,
- avoid leaking routing or policy logic into unrelated components.
R5. Multi-Domain Communication Modeling
The simulator MUST model communication across hierarchical domains, including:
- PE ↔ local HBM
- PE ↔ remote HBM in the same CUBE
- PE ↔ remote HBM in other CUBEs within the same SIP
- PE ↔ remote HBM in other SIPs
- PE ↔ PE messaging (e.g., IPCQ)
- PE ↔ IO chiplets
- CUBE ↔ CUBE (e.g., via UCIe)
- SIP ↔ SIP (e.g., via PCIe or UAL)
Policy-based bypass is allowed ONLY if:
- the bypass path is explicitly represented in the model,
- the bypass incurs non-zero latency,
- the bypass is visible in traces and diagrams.
R6. Verification-Driven Development
Development MUST follow a verification-driven workflow:
- behavior is validated by tests with meaningful input cases,
- tests encode SPEC-defined invariants, not incidental implementation details,
- changes without clear verification coverage are not allowed.
R7. Runtime API
The simulator MUST provide a host-facing runtime API that:
- exposes tensor deployment and kernel execution operations,
- submits requests only to endpoint components (e.g., IO_CPU),
- owns host-side tensor handles and allocation metadata as PA shard maps,
- remains topology-agnostic and does not perform routing or fan-out.
Tensor deployment in Phase 0 produces device physical-address (PA) shard mappings.
Each shard explicitly identifies its target (sip, cube, pe) and PA range.
No separate host-visible allocation RPC (e.g., AllocateTensorMeta) exists.
R8. Simulation Engine
The simulator MUST include a discrete-event simulation engine that:
- injects requests into the system graph,
- schedules events deterministically,
- tracks completion via correlation identifiers,
- decomposes runtime API operations into explicit graph requests (e.g., MemoryWrite, MemoryRead, KernelLaunch).
R9. CLI Execution Semantics
The CLI MUST support executing benchmarks:
- on a specified device.
Benchmarks are executed once per invocation within a single simulation instance. If multiple devices are present in the topology, a benchmark MAY interact with multiple devices internally, but the CLI does not launch multiple independent benchmark instances by default.
R10. Memory Addressing (Phase 0)
In Phase 0, the simulator uses a PA-first memory model:
- All memory operations use device physical addresses (PA) only.
- Virtual addressing, MMU/IOMMU, and address translation latency are out of scope.
- Tensor placement is represented as a list of PA shards, each explicitly tagged
with
(sip, cube, pe).
All memory access latency MUST be modeled explicitly via graph traversal. No implicit translation or hidden latency is allowed.
2. Model Concepts
2.1 Graph Execution Model
- Nodes represent modeled components (PE blocks, XBAR, NoC, bridges, HBM controllers, IO components, etc.).
- Directed edges represent interconnect links with latency and bandwidth attributes.
- Execution model:
- a node receives a request,
- incurs node or service latency,
- emits the request to the next hop via a link,
- repeats until the destination service completes.
2.2 Routing
Routing MAY be implemented as:
- policy-based routing (code-driven),
- routing tables (config-driven),
- topology-driven routing (e.g., mesh XY),
- or a hybrid approach.
Routing MUST:
- consume decoded address domains or explicit placement tags,
- operate only on explicit topology connectivity,
- remain deterministic.
Kernel execution requests reference tensors via PA shard mappings. Each shard explicitly identifies its target PE, allowing IO_CPU to deterministically fan-out execution without relying on PA decoding.
3. Inputs and Identity
3.1 Node Identity Scheme
Nodes MUST have stable, parsable identifiers sufficient for domain inference and trace-based debugging.
Example patterns:
tray.host_cpusip{S}.io{I}.pcie_epsip{S}.cube{C}.fabricsip{S}.cube{C}.pe{P}sip{S}.cube{C}.hbm_ctrl
3.2 Link Specifications
A link MAY include:
- fixed latency (ns),
- bandwidth (GB/s) for serialization latency,
- optional capacity for contention modeling.
Topology builders MUST ensure:
- required links exist,
- link parameters are consistent with topology intent.
4. Output, Debuggability, and Diagrams
The simulator MUST provide:
- per-request hop-by-hop traces with timestamps,
- clear error messages for missing connectivity (e.g., "no link for A → B"),
- reproducible, inspectable representations of the modeled system.
Diagrams are derived artifacts of the simulator model:
- They MUST be generatable from the compiled topology and distance metadata used by execution and routing.
- Generation MAY be performed lazily or cached by the implementation, as long as outputs remain consistent with the compiled topology.
Diagram abstraction levels and distance-aware layout rules are defined in ADR-0005. Automatic diagram generation and output conventions are defined in ADR-0006.
By default, generated diagrams are written under:
docs/diagrams/
5. Non-Goals (for now)
The following are explicitly out of scope:
- cycle-accurate microarchitecture modeling,
- detailed cache coherence protocols,
- full PCIe / CXL protocol correctness.
These MAY be layered later via additional components and policies.
6. Decision Boundaries
- SPEC.md defines architectural intent and invariants.
- Code implements SPEC and MUST NOT introduce hidden invariants.
- Tests validate SPEC-defined behavior and MUST NOT encode fixed topology assumptions.
- ADRs record non-trivial architectural decisions and MUST be referenced when relevant.