# KernBench System-Level Simulator — SPEC This document defines the architectural contract for the KernBench system-level discrete-event simulator for our AI Accelerator SIP-based systems. All implementations, tests, and changes MUST conform to this SPEC. --- ## 0. Goal Build a **system-level, discrete-event simulator** to evaluate the performance of **LLM kernels running on our AI Accelerator SIP-based systems**, under varying **SIP architectures, topologies, and interconnect configurations**. The simulator models **data-movement and control paths across the full hardware hierarchy** and computes **end-to-end execution latency** for kernel executions dispatched to Processing Elements (PEs). Primary objectives: - compare LLM kernel execution latency under different system configurations - model PE↔HBM, PE↔PE, CUBE↔CUBE, and SIP↔SIP communication and control paths - guarantee deterministic, verifiable behavior with strong debuggability - support visual inspection of the modeled system at multiple abstraction levels --- ## 0.1 Golden Invariants (Must NOT be violated) - End-to-end latency is computed **strictly by explicit traversal** over modeled components and links. - Every routed request MUST incur **latency > 0**. - Routing decisions MUST be **deterministic** given (topology + routing policy + request). - All valid request flows MUST have explicit connectivity in the model. - No hidden shortcuts, implicit bypasses, or magic paths are allowed. - Architectural decisions documented in ADRs override local optimizations. --- ## 0.2 Architectural References (ADRs) Major architectural decisions are documented in ADRs and referenced by number. - ADR-0001: PhysAddr layout & address decoding contract - ADR-0002: Routing distance, ordering, and bypass rules - ADR-0003: Target system hierarchy & modeling scope (Tray / SIP / CUBE / PE / IO chiplet) - ADR-0004: Memory semantics & local-HBM bandwidth guarantee contract - ADR-0005: Diagram views (SIP / CUBE / PE) and distance-aware layout rules - ADR-0006: Topology compilation, distance extraction, and automatic diagram generation - ADR-0007: runtime_api vs sim_engine responsibility boundaries - ADR-0008: Tensor deployment and allocation (Host allocator, PA-first) - ADR-0009: Kernel execution fan-out and completion semantics - ADR-0010: CLI device selection and multi-device execution semantics - ADR-0011: Memory addressing simplification (PA-first) - ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards) - ADR-0013: Verification strategy and Phase 1 test plan SPEC MUST remain consistent with accepted ADRs. --- ## 1. Core Requirements ### R1. Correct Routing and Control Path - A request MUST traverse the correct sequence of components based on: - source location, - destination address or placement tags, - routing policy and available topology connectivity. - Local vs remote traffic MUST be distinguishable: - same SIP vs different SIP, - same CUBE vs different CUBE, - (optional) same PE-group vs cross PE-group. - Routing behavior MUST be reproducible and deterministic. --- ### R2. Latency is Computed by Traversal End-to-end latency is the sum of: - per-node fixed latency (processing / router delay), - per-link latency (fixed and/or size-aware serialization: bytes / BW), - per-service latency (e.g., memory controller service time). The simulator MUST: - support both fixed and size-aware latency, - emit hop-by-hop traces with timestamps and component identifiers. --- ### R3. Topology is Configurable and Variable Topology MUST NOT be hardcoded. The simulator MUST accept multiple topologies (YAML / JSON / dict), varying: - SIP count, - CUBE count per SIP, - PE count per CUBE, - on-chip fabric structure (e.g., mesh / NoC / XBAR), - IO chiplets and interconnects, - link bandwidth, latency, and capacity parameters. Given a topology: - all required request flows MUST have valid connectivity, - missing links are a topology construction error, not a routing error. --- ### R4. DI-First Component Design (Swappable Implementations) All components MUST be replaceable behind stable interfaces, including: - routers and fabrics (NoC, bridges, switches), - XBAR-like selectors, - DMA engines and queues, - memory controllers and services (HBM, TCM, queues), - management and control processors (modeled components). The simulator MUST: - use dependency injection (DI) to bind node specifications to implementation classes, - allow component swapping without changing test logic, - avoid leaking routing or policy logic into unrelated components. --- ### R5. Multi-Domain Communication Modeling The simulator MUST model communication across hierarchical domains, including: - PE ↔ local HBM - PE ↔ remote HBM in the same CUBE - PE ↔ remote HBM in other CUBEs within the same SIP - PE ↔ remote HBM in other SIPs - PE ↔ PE messaging (e.g., IPCQ) - PE ↔ IO chiplets - CUBE ↔ CUBE (e.g., via UCIe) - SIP ↔ SIP (e.g., via PCIe or UAL) Policy-based bypass is allowed ONLY if: - the bypass path is explicitly represented in the model, - the bypass incurs non-zero latency, - the bypass is visible in traces and diagrams. --- ### R6. Verification-Driven Development Development MUST follow a verification-driven workflow: - behavior is validated by tests with meaningful input cases, - tests encode SPEC-defined invariants, not incidental implementation details, - changes without clear verification coverage are not allowed. --- ## R7. Runtime API The simulator MUST provide a host-facing runtime API that: - exposes tensor deployment and kernel execution operations, - submits requests only to endpoint components (e.g., IO_CPU), - owns host-side tensor handles and allocation metadata as PA shard maps, - remains topology-agnostic and does not perform routing or fan-out. Tensor deployment in Phase 0 produces **device physical-address (PA) shard mappings**. Each shard explicitly identifies its target `(sip, cube, pe)` and PA range. No separate host-visible allocation RPC (e.g., AllocateTensorMeta) exists. --- ## R8. Simulation Engine The simulator MUST include a discrete-event simulation engine that: - injects requests into the system graph, - schedules events deterministically, - tracks completion via correlation identifiers, - decomposes runtime API operations into explicit graph requests (e.g., MemoryWrite, MemoryRead, KernelLaunch). --- ## R9. CLI Execution Semantics The CLI MUST support executing benchmarks: - on a specified device. Benchmarks are executed once per invocation within a single simulation instance. If multiple devices are present in the topology, a benchmark MAY interact with multiple devices internally, but the CLI does not launch multiple independent benchmark instances by default. --- ## R10. Memory Addressing (Phase 0) In Phase 0, the simulator uses a **PA-first memory model**: - All memory operations use device physical addresses (PA) only. - Virtual addressing, MMU/IOMMU, and address translation latency are out of scope. - Tensor placement is represented as a list of PA shards, each explicitly tagged with `(sip, cube, pe)`. All memory access latency MUST be modeled explicitly via graph traversal. No implicit translation or hidden latency is allowed. --- ## 2. Model Concepts ### 2.1 Graph Execution Model - Nodes represent modeled components (PE blocks, XBAR, NoC, bridges, HBM controllers, IO components, etc.). - Directed edges represent interconnect links with latency and bandwidth attributes. - Execution model: - a node receives a request, - incurs node or service latency, - emits the request to the next hop via a link, - repeats until the destination service completes. --- ### 2.2 Routing Routing MAY be implemented as: - policy-based routing (code-driven), - routing tables (config-driven), - topology-driven routing (e.g., mesh XY), - or a hybrid approach. Routing MUST: - consume decoded address domains or explicit placement tags, - operate only on explicit topology connectivity, - remain deterministic. Kernel execution requests reference tensors via PA shard mappings. Each shard explicitly identifies its target PE, allowing IO_CPU to deterministically fan-out execution without relying on PA decoding. --- ## 3. Inputs and Identity ### 3.1 Node Identity Scheme Nodes MUST have stable, parsable identifiers sufficient for domain inference and trace-based debugging. Example patterns: - `tray.host_cpu` - `sip{S}.io{I}.pcie_ep` - `sip{S}.cube{C}.fabric` - `sip{S}.cube{C}.pe{P}` - `sip{S}.cube{C}.hbm_ctrl` --- ### 3.2 Link Specifications A link MAY include: - fixed latency (ns), - bandwidth (GB/s) for serialization latency, - optional capacity for contention modeling. Topology builders MUST ensure: - required links exist, - link parameters are consistent with topology intent. --- ## 4. Output, Debuggability, and Diagrams The simulator MUST provide: - per-request hop-by-hop traces with timestamps, - clear error messages for missing connectivity (e.g., "no link for A → B"), - reproducible, inspectable representations of the modeled system. Diagrams are **derived artifacts** of the simulator model: - They MUST be generatable from the **compiled topology** and **distance metadata** used by execution and routing. - Generation MAY be performed lazily or cached by the implementation, as long as outputs remain consistent with the compiled topology. Diagram abstraction levels and distance-aware layout rules are defined in ADR-0005. Automatic diagram generation and output conventions are defined in ADR-0006. By default, generated diagrams are written under: - `docs/diagrams/` --- ## 5. Non-Goals (for now) The following are explicitly out of scope: - cycle-accurate microarchitecture modeling, - detailed cache coherence protocols, - full PCIe / CXL protocol correctness. These MAY be layered later via additional components and policies. --- ## 6. Decision Boundaries - SPEC.md defines architectural intent and invariants. - Code implements SPEC and MUST NOT introduce hidden invariants. - Tests validate SPEC-defined behavior and MUST NOT encode fixed topology assumptions. - ADRs record non-trivial architectural decisions and MUST be referenced when relevant.