Files
kernbench2/SPEC.md
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

11 KiB

KernBench System-Level Simulator — SPEC

This document defines the architectural contract for the KernBench system-level discrete-event simulator for our AI Accelerator SIP-based systems. All implementations, tests, and changes MUST conform to this SPEC.


0. Goal

Build a system-level, discrete-event simulator to evaluate the performance of LLM kernels running on our AI Accelerator SIP-based systems, under varying SIP architectures, topologies, and interconnect configurations.

The simulator models data-movement and control paths across the full hardware hierarchy and computes end-to-end execution latency for kernel executions dispatched to Processing Elements (PEs).

Primary objectives:

  • compare LLM kernel execution latency under different system configurations
  • model PE↔HBM, PE↔PE, CUBE↔CUBE, and SIP↔SIP communication and control paths
  • guarantee deterministic, verifiable behavior with strong debuggability
  • support visual inspection of the modeled system at multiple abstraction levels

0.1 Golden Invariants (Must NOT be violated)

  • End-to-end latency is computed strictly by explicit traversal over modeled components and links.
  • Every routed request MUST incur latency > 0.
  • Routing decisions MUST be deterministic given (topology + routing policy + request).
  • All valid request flows MUST have explicit connectivity in the model.
  • No hidden shortcuts, implicit bypasses, or magic paths are allowed.
  • Architectural decisions documented in ADRs override local optimizations.

0.2 Architectural References (ADRs)

Major architectural decisions are documented in ADRs and referenced by number.

  • ADR-0001: PhysAddr layout & address decoding contract
  • ADR-0002: Routing distance, ordering, and bypass rules
  • ADR-0003: Target system hierarchy & modeling scope (Tray / SIP / CUBE / PE / IO chiplet)
  • ADR-0004: Memory semantics & local-HBM bandwidth guarantee contract
  • ADR-0005: Diagram views (SIP / CUBE / PE) and distance-aware layout rules
  • ADR-0006: Topology compilation, distance extraction, and automatic diagram generation
  • ADR-0007: runtime_api vs sim_engine responsibility boundaries
  • ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
  • ADR-0009: Kernel execution fan-out and completion semantics
  • ADR-0010: Command line interface and execution semantics
  • ADR-0011: Memory Addressing — PA / VA / LA Address Models
  • ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
  • ADR-0013: Verification strategy and Phase 1 test plan
  • ADR-0014: PE internal execution model (PE_CPU, PE_SCHEDULER, composite commands)
  • ADR-0015: Component port/wire model, BW occupancy, and fabric routing
  • ADR-0016: IOChiplet NOC and memory data path (M_CPU bypass)
  • ADR-0017: Cube NOC 2D mesh architecture (XY routing, contention, attachments)

SPEC MUST remain consistent with accepted ADRs.


1. Core Requirements

R1. Correct Routing and Control Path

  • A request MUST traverse the correct sequence of components based on:
    • source location,
    • destination address or placement tags,
    • routing policy and available topology connectivity.
  • Local vs remote traffic MUST be distinguishable:
    • same SIP vs different SIP,
    • same CUBE vs different CUBE,
    • (optional) same PE-group vs cross PE-group.
  • Routing behavior MUST be reproducible and deterministic.

R2. Latency is Computed by Traversal

End-to-end latency is the sum of:

  • per-node fixed latency (processing / router delay),
  • per-link latency (fixed and/or size-aware serialization: bytes / BW),
  • per-service latency (e.g., memory controller service time).

The simulator MUST:

  • support both fixed and size-aware latency,
  • emit hop-by-hop traces with timestamps and component identifiers.

R3. Topology is Configurable and Variable

Topology MUST NOT be hardcoded.

The simulator MUST accept multiple topologies (YAML / JSON / dict), varying:

  • SIP count,
  • CUBE count per SIP,
  • PE count per CUBE,
  • on-chip fabric structure (e.g., mesh / NoC router grid),
  • IO chiplets and interconnects,
  • link bandwidth, latency, and capacity parameters.

Given a topology:

  • all required request flows MUST have valid connectivity,
  • missing links are a topology construction error, not a routing error.

R4. DI-First Component Design (Swappable Implementations)

All components MUST be replaceable behind stable interfaces, including:

  • routers and fabrics (NoC router mesh, switches),
  • DMA engines and queues,
  • memory controllers and services (HBM, TCM, queues),
  • management and control processors (modeled components).

The simulator MUST:

  • use dependency injection (DI) to bind node specifications to implementation classes,
  • allow component swapping without changing test logic,
  • avoid leaking routing or policy logic into unrelated components.

R5. Multi-Domain Communication Modeling

The simulator MUST model communication across hierarchical domains, including:

  • PE ↔ local HBM
  • PE ↔ remote HBM in the same CUBE
  • PE ↔ remote HBM in other CUBEs within the same SIP
  • PE ↔ remote HBM in other SIPs
  • PE ↔ PE messaging (e.g., IPCQ)
  • PE ↔ IO chiplets
  • CUBE ↔ CUBE (e.g., via UCIe)
  • SIP ↔ SIP (e.g., via PCIe or UAL)

Policy-based bypass is allowed ONLY if:

  • the bypass path is explicitly represented in the model,
  • the bypass incurs non-zero latency,
  • the bypass is visible in traces and diagrams.

R6. Verification-Driven Development

Development MUST follow a verification-driven workflow:

  • behavior is validated by tests with meaningful input cases,
  • tests encode SPEC-defined invariants, not incidental implementation details,
  • changes without clear verification coverage are not allowed.

R7. Runtime API

The simulator MUST provide a host-facing runtime API that:

  • exposes tensor deployment and kernel execution operations,
  • submits requests to endpoint components: PCIE_EP for memory operations (MemoryWrite/Read), IO_CPU for kernel launch,
  • owns host-side tensor handles and allocation metadata as PA shard maps,
  • remains topology-agnostic and does not perform routing or fan-out.

Tensor deployment in Phase 0 produces device physical-address (PA) shard mappings. Each shard explicitly identifies its target (sip, cube, pe) and PA range. No separate host-visible allocation RPC (e.g., AllocateTensorMeta) exists.


R8. Simulation Engine

The simulator MUST include a discrete-event simulation engine that:

  • injects requests into the system graph,
  • schedules events deterministically,
  • tracks completion via correlation identifiers,
  • decomposes runtime API operations into explicit graph requests (e.g., MemoryWrite, MemoryRead, KernelLaunch).

R9. CLI Execution Semantics

The CLI MUST support executing benchmarks:

  • on a specified device.

Benchmarks are executed once per invocation within a single simulation instance. If multiple devices are present in the topology, a benchmark MAY interact with multiple devices internally, but the CLI does not launch multiple independent benchmark instances by default.


R10. Memory Addressing

The simulator defines three address models in ADR-0011; one is selected per simulation configuration:

  • PA (Physical Address) — direct PA, retained as PageFault fallback.
  • VA (Virtual Address with MMU) — currently implemented default.
  • LA (Logical Address with BAAW) — proposed, supports per-channel HBM modelling (1:1 / n:1 mapping modes).

VA model details (current default):

  • Tensors are assigned a contiguous virtual address (VA) range at deployment.
  • PE_MMU translates VA→PA per access; TLB overhead is configurable.
  • Mapping installation (MmuMapMsg) traverses the fabric with measured latency.
  • Replicate tensors use per-cube local PA mapping; sharded tensors broadcast.
  • PA fallback is retained for backward compatibility.
  • Tensor placement is represented as a list of PA shards, each explicitly tagged with (sip, cube, pe), plus a tensor-wide va_base.

All memory access latency MUST be modeled explicitly via graph traversal. No implicit translation or hidden latency is allowed.


2. Model Concepts

2.1 Graph Execution Model

  • Nodes represent modeled components (PE blocks, NoC routers, HBM controllers, IO components, etc.).
  • Directed edges represent interconnect links with latency and bandwidth attributes.
  • Execution model:
    • a node receives a request,
    • incurs node or service latency,
    • emits the request to the next hop via a link,
    • repeats until the destination service completes.

2.2 Routing

Routing MAY be implemented as:

  • policy-based routing (code-driven),
  • routing tables (config-driven),
  • topology-driven routing (e.g., mesh XY),
  • or a hybrid approach.

Routing MUST:

  • consume decoded address domains or explicit placement tags,
  • operate only on explicit topology connectivity,
  • remain deterministic.

Kernel execution requests reference tensors via PA shard mappings. Each shard explicitly identifies its target PE, allowing IO_CPU to deterministically fan-out execution without relying on PA decoding.


3. Inputs and Identity

3.1 Node Identity Scheme

Nodes MUST have stable, parsable identifiers sufficient for domain inference and trace-based debugging.

Example patterns:

  • tray.host_cpu
  • sip{S}.io{I}.pcie_ep
  • sip{S}.cube{C}.fabric
  • sip{S}.cube{C}.pe{P}
  • sip{S}.cube{C}.hbm_ctrl

A link MAY include:

  • fixed latency (ns),
  • bandwidth (GB/s) for serialization latency,
  • optional capacity for contention modeling.

Topology builders MUST ensure:

  • required links exist,
  • link parameters are consistent with topology intent.

4. Output, Debuggability, and Diagrams

The simulator MUST provide:

  • per-request hop-by-hop traces with timestamps,
  • clear error messages for missing connectivity (e.g., "no link for A → B"),
  • reproducible, inspectable representations of the modeled system.

Diagrams are derived artifacts of the simulator model:

  • They MUST be generatable from the compiled topology and distance metadata used by execution and routing.
  • Generation MAY be performed lazily or cached by the implementation, as long as outputs remain consistent with the compiled topology.

Diagram abstraction levels and distance-aware layout rules are defined in ADR-0005. Automatic diagram generation and output conventions are defined in ADR-0006.

By default, generated diagrams are written under:

  • docs/diagrams/

5. Non-Goals (for now)

The following are explicitly out of scope:

  • cycle-accurate microarchitecture modeling,
  • detailed cache coherence protocols,
  • full PCIe / CXL protocol correctness.

These MAY be layered later via additional components and policies.


6. Decision Boundaries

  • SPEC.md defines architectural intent and invariants.
  • Code implements SPEC and MUST NOT introduce hidden invariants.
  • Tests validate SPEC-defined behavior and MUST NOT encode fixed topology assumptions.
  • ADRs record non-trivial architectural decisions and MUST be referenced when relevant.