Files
kernbench2/SPEC.md
T
ywkang 5917b3497c Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)
- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)

326 passed, 13 skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 17:51:28 -07:00

11 KiB

KernBench System-Level Simulator — SPEC

This document defines the architectural contract for the KernBench system-level discrete-event simulator for our AI Accelerator SIP-based systems. All implementations, tests, and changes MUST conform to this SPEC.


0. Goal

Build a system-level, discrete-event simulator to evaluate the performance of LLM kernels running on our AI Accelerator SIP-based systems, under varying SIP architectures, topologies, and interconnect configurations.

The simulator models data-movement and control paths across the full hardware hierarchy and computes end-to-end execution latency for kernel executions dispatched to Processing Elements (PEs).

Primary objectives:

  • compare LLM kernel execution latency under different system configurations
  • model PE↔HBM, PE↔PE, CUBE↔CUBE, and SIP↔SIP communication and control paths
  • guarantee deterministic, verifiable behavior with strong debuggability
  • support visual inspection of the modeled system at multiple abstraction levels

0.1 Golden Invariants (Must NOT be violated)

  • End-to-end latency is computed strictly by explicit traversal over modeled components and links.
  • Every routed request MUST incur latency > 0.
  • Routing decisions MUST be deterministic given (topology + routing policy + request).
  • All valid request flows MUST have explicit connectivity in the model.
  • No hidden shortcuts, implicit bypasses, or magic paths are allowed.
  • Architectural decisions documented in ADRs override local optimizations.

0.2 Architectural References (ADRs)

Major architectural decisions are documented in ADRs and referenced by number.

  • ADR-0001: PhysAddr layout & address decoding contract
  • ADR-0002: Routing distance, ordering, and bypass rules
  • ADR-0003: Target system hierarchy & modeling scope (Tray / SIP / CUBE / PE / IO chiplet)
  • ADR-0004: Memory semantics & local-HBM bandwidth guarantee contract
  • ADR-0005: Diagram views (SIP / CUBE / PE) and distance-aware layout rules
  • ADR-0006: Topology compilation, distance extraction, and automatic diagram generation
  • ADR-0007: runtime_api vs sim_engine responsibility boundaries
  • ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
  • ADR-0009: Kernel execution fan-out and completion semantics
  • ADR-0010: CLI device selection and multi-device execution semantics
  • ADR-0011: Memory addressing simplification (PA-first)
  • ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
  • ADR-0013: Verification strategy and Phase 1 test plan
  • ADR-0014: PE internal execution model (PE_CPU, PE_SCHEDULER, composite commands)
  • ADR-0015: Component port/wire model, BW occupancy, and fabric routing
  • ADR-0016: IOChiplet NOC and memory data path (M_CPU bypass)
  • ADR-0017: Cube NOC 2D mesh architecture (XY routing, contention, attachments)

SPEC MUST remain consistent with accepted ADRs.


1. Core Requirements

R1. Correct Routing and Control Path

  • A request MUST traverse the correct sequence of components based on:
    • source location,
    • destination address or placement tags,
    • routing policy and available topology connectivity.
  • Local vs remote traffic MUST be distinguishable:
    • same SIP vs different SIP,
    • same CUBE vs different CUBE,
    • (optional) same PE-group vs cross PE-group.
  • Routing behavior MUST be reproducible and deterministic.

R2. Latency is Computed by Traversal

End-to-end latency is the sum of:

  • per-node fixed latency (processing / router delay),
  • per-link latency (fixed and/or size-aware serialization: bytes / BW),
  • per-service latency (e.g., memory controller service time).

The simulator MUST:

  • support both fixed and size-aware latency,
  • emit hop-by-hop traces with timestamps and component identifiers.

R3. Topology is Configurable and Variable

Topology MUST NOT be hardcoded.

The simulator MUST accept multiple topologies (YAML / JSON / dict), varying:

  • SIP count,
  • CUBE count per SIP,
  • PE count per CUBE,
  • on-chip fabric structure (e.g., mesh / NoC router grid),
  • IO chiplets and interconnects,
  • link bandwidth, latency, and capacity parameters.

Given a topology:

  • all required request flows MUST have valid connectivity,
  • missing links are a topology construction error, not a routing error.

R4. DI-First Component Design (Swappable Implementations)

All components MUST be replaceable behind stable interfaces, including:

  • routers and fabrics (NoC router mesh, switches),
  • DMA engines and queues,
  • memory controllers and services (HBM, TCM, queues),
  • management and control processors (modeled components).

The simulator MUST:

  • use dependency injection (DI) to bind node specifications to implementation classes,
  • allow component swapping without changing test logic,
  • avoid leaking routing or policy logic into unrelated components.

R5. Multi-Domain Communication Modeling

The simulator MUST model communication across hierarchical domains, including:

  • PE ↔ local HBM
  • PE ↔ remote HBM in the same CUBE
  • PE ↔ remote HBM in other CUBEs within the same SIP
  • PE ↔ remote HBM in other SIPs
  • PE ↔ PE messaging (e.g., IPCQ)
  • PE ↔ IO chiplets
  • CUBE ↔ CUBE (e.g., via UCIe)
  • SIP ↔ SIP (e.g., via PCIe or UAL)

Policy-based bypass is allowed ONLY if:

  • the bypass path is explicitly represented in the model,
  • the bypass incurs non-zero latency,
  • the bypass is visible in traces and diagrams.

R6. Verification-Driven Development

Development MUST follow a verification-driven workflow:

  • behavior is validated by tests with meaningful input cases,
  • tests encode SPEC-defined invariants, not incidental implementation details,
  • changes without clear verification coverage are not allowed.

R7. Runtime API

The simulator MUST provide a host-facing runtime API that:

  • exposes tensor deployment and kernel execution operations,
  • submits requests to endpoint components: PCIE_EP for memory operations (MemoryWrite/Read), IO_CPU for kernel launch,
  • owns host-side tensor handles and allocation metadata as PA shard maps,
  • remains topology-agnostic and does not perform routing or fan-out.

Tensor deployment in Phase 0 produces device physical-address (PA) shard mappings. Each shard explicitly identifies its target (sip, cube, pe) and PA range. No separate host-visible allocation RPC (e.g., AllocateTensorMeta) exists.


R8. Simulation Engine

The simulator MUST include a discrete-event simulation engine that:

  • injects requests into the system graph,
  • schedules events deterministically,
  • tracks completion via correlation identifiers,
  • decomposes runtime API operations into explicit graph requests (e.g., MemoryWrite, MemoryRead, KernelLaunch).

R9. CLI Execution Semantics

The CLI MUST support executing benchmarks:

  • on a specified device.

Benchmarks are executed once per invocation within a single simulation instance. If multiple devices are present in the topology, a benchmark MAY interact with multiple devices internally, but the CLI does not launch multiple independent benchmark instances by default.


R10. Memory Addressing (Phase 0)

The simulator uses a VA/PA memory model (ADR-0011):

  • Tensors are assigned a contiguous virtual address (VA) range at deployment.
  • PE_MMU translates VA→PA per access; TLB overhead is configurable.
  • Mapping installation (MmuMapMsg) traverses the fabric with measured latency.
  • Replicate tensors use per-cube local PA mapping; sharded tensors broadcast.
  • PA-only fallback is retained for backward compatibility.
  • Tensor placement is represented as a list of PA shards, each explicitly tagged with (sip, cube, pe), plus a tensor-wide va_base.

All memory access latency MUST be modeled explicitly via graph traversal. No implicit translation or hidden latency is allowed.


2. Model Concepts

2.1 Graph Execution Model

  • Nodes represent modeled components (PE blocks, NoC routers, HBM controllers, IO components, etc.).
  • Directed edges represent interconnect links with latency and bandwidth attributes.
  • Execution model:
    • a node receives a request,
    • incurs node or service latency,
    • emits the request to the next hop via a link,
    • repeats until the destination service completes.

2.2 Routing

Routing MAY be implemented as:

  • policy-based routing (code-driven),
  • routing tables (config-driven),
  • topology-driven routing (e.g., mesh XY),
  • or a hybrid approach.

Routing MUST:

  • consume decoded address domains or explicit placement tags,
  • operate only on explicit topology connectivity,
  • remain deterministic.

Kernel execution requests reference tensors via PA shard mappings. Each shard explicitly identifies its target PE, allowing IO_CPU to deterministically fan-out execution without relying on PA decoding.


3. Inputs and Identity

3.1 Node Identity Scheme

Nodes MUST have stable, parsable identifiers sufficient for domain inference and trace-based debugging.

Example patterns:

  • tray.host_cpu
  • sip{S}.io{I}.pcie_ep
  • sip{S}.cube{C}.fabric
  • sip{S}.cube{C}.pe{P}
  • sip{S}.cube{C}.hbm_ctrl

A link MAY include:

  • fixed latency (ns),
  • bandwidth (GB/s) for serialization latency,
  • optional capacity for contention modeling.

Topology builders MUST ensure:

  • required links exist,
  • link parameters are consistent with topology intent.

4. Output, Debuggability, and Diagrams

The simulator MUST provide:

  • per-request hop-by-hop traces with timestamps,
  • clear error messages for missing connectivity (e.g., "no link for A → B"),
  • reproducible, inspectable representations of the modeled system.

Diagrams are derived artifacts of the simulator model:

  • They MUST be generatable from the compiled topology and distance metadata used by execution and routing.
  • Generation MAY be performed lazily or cached by the implementation, as long as outputs remain consistent with the compiled topology.

Diagram abstraction levels and distance-aware layout rules are defined in ADR-0005. Automatic diagram generation and output conventions are defined in ADR-0006.

By default, generated diagrams are written under:

  • docs/diagrams/

5. Non-Goals (for now)

The following are explicitly out of scope:

  • cycle-accurate microarchitecture modeling,
  • detailed cache coherence protocols,
  • full PCIe / CXL protocol correctness.

These MAY be layered later via additional components and policies.


6. Decision Boundaries

  • SPEC.md defines architectural intent and invariants.
  • Code implements SPEC and MUST NOT introduce hidden invariants.
  • Tests validate SPEC-defined behavior and MUST NOT encode fixed topology assumptions.
  • ADRs record non-trivial architectural decisions and MUST be referenced when relevant.