Files
kernbench2/SPEC.md
T
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

343 lines
11 KiB
Markdown

# KernBench System-Level Simulator — SPEC
This document defines the architectural contract for the KernBench
system-level discrete-event simulator for our AI Accelerator SIP-based systems.
All implementations, tests, and changes MUST conform to this SPEC.
---
## 0. Goal
Build a **system-level, discrete-event simulator** to evaluate the performance of
**LLM kernels running on our AI Accelerator SIP-based systems**, under varying
**SIP architectures, topologies, and interconnect configurations**.
The simulator models **data-movement and control paths across the full hardware
hierarchy** and computes **end-to-end execution latency** for kernel executions
dispatched to Processing Elements (PEs).
Primary objectives:
- compare LLM kernel execution latency under different system configurations
- model PE↔HBM, PE↔PE, CUBE↔CUBE, and SIP↔SIP communication and control paths
- guarantee deterministic, verifiable behavior with strong debuggability
- support visual inspection of the modeled system at multiple abstraction levels
---
## 0.1 Golden Invariants (Must NOT be violated)
- End-to-end latency is computed **strictly by explicit traversal** over modeled
components and links.
- Every routed request MUST incur **latency > 0**.
- Routing decisions MUST be **deterministic** given
(topology + routing policy + request).
- All valid request flows MUST have explicit connectivity in the model.
- No hidden shortcuts, implicit bypasses, or magic paths are allowed.
- Architectural decisions documented in ADRs override local optimizations.
---
## 0.2 Architectural References (ADRs)
Major architectural decisions are documented in ADRs and referenced by number.
- ADR-0001: PhysAddr layout & address decoding contract
- ADR-0002: Routing distance, ordering, and bypass rules
- ADR-0003: Target system hierarchy & modeling scope (Tray / SIP / CUBE / PE / IO chiplet)
- ADR-0004: Memory semantics & local-HBM bandwidth guarantee contract
- ADR-0005: Diagram views (SIP / CUBE / PE) and distance-aware layout rules
- ADR-0006: Topology compilation, distance extraction, and automatic diagram generation
- ADR-0007: runtime_api vs sim_engine responsibility boundaries
- ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
- ADR-0009: Kernel execution fan-out and completion semantics
- ADR-0010: Command line interface and execution semantics
- ADR-0011: Memory Addressing — PA / VA / LA Address Models
- ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
- ADR-0013: Verification strategy and Phase 1 test plan
- ADR-0014: PE internal execution model (PE_CPU, PE_SCHEDULER, composite commands)
- ADR-0015: Component port/wire model, BW occupancy, and fabric routing
- ADR-0016: IOChiplet NOC and memory data path (M_CPU bypass)
- ADR-0017: Cube NOC 2D mesh architecture (XY routing, contention, attachments)
SPEC MUST remain consistent with accepted ADRs.
---
## 1. Core Requirements
### R1. Correct Routing and Control Path
- A request MUST traverse the correct sequence of components based on:
- source location,
- destination address or placement tags,
- routing policy and available topology connectivity.
- Local vs remote traffic MUST be distinguishable:
- same SIP vs different SIP,
- same CUBE vs different CUBE,
- (optional) same PE-group vs cross PE-group.
- Routing behavior MUST be reproducible and deterministic.
---
### R2. Latency is Computed by Traversal
End-to-end latency is the sum of:
- per-node fixed latency (processing / router delay),
- per-link latency (fixed and/or size-aware serialization: bytes / BW),
- per-service latency (e.g., memory controller service time).
The simulator MUST:
- support both fixed and size-aware latency,
- emit hop-by-hop traces with timestamps and component identifiers.
---
### R3. Topology is Configurable and Variable
Topology MUST NOT be hardcoded.
The simulator MUST accept multiple topologies (YAML / JSON / dict), varying:
- SIP count,
- CUBE count per SIP,
- PE count per CUBE,
- on-chip fabric structure (e.g., mesh / NoC router grid),
- IO chiplets and interconnects,
- link bandwidth, latency, and capacity parameters.
Given a topology:
- all required request flows MUST have valid connectivity,
- missing links are a topology construction error, not a routing error.
---
### R4. DI-First Component Design (Swappable Implementations)
All components MUST be replaceable behind stable interfaces, including:
- routers and fabrics (NoC router mesh, switches),
- DMA engines and queues,
- memory controllers and services (HBM, TCM, queues),
- management and control processors (modeled components).
The simulator MUST:
- use dependency injection (DI) to bind node specifications to implementation classes,
- allow component swapping without changing test logic,
- avoid leaking routing or policy logic into unrelated components.
---
### R5. Multi-Domain Communication Modeling
The simulator MUST model communication across hierarchical domains, including:
- PE ↔ local HBM
- PE ↔ remote HBM in the same CUBE
- PE ↔ remote HBM in other CUBEs within the same SIP
- PE ↔ remote HBM in other SIPs
- PE ↔ PE messaging (e.g., IPCQ)
- PE ↔ IO chiplets
- CUBE ↔ CUBE (e.g., via UCIe)
- SIP ↔ SIP (e.g., via PCIe or UAL)
Policy-based bypass is allowed ONLY if:
- the bypass path is explicitly represented in the model,
- the bypass incurs non-zero latency,
- the bypass is visible in traces and diagrams.
---
### R6. Verification-Driven Development
Development MUST follow a verification-driven workflow:
- behavior is validated by tests with meaningful input cases,
- tests encode SPEC-defined invariants, not incidental implementation details,
- changes without clear verification coverage are not allowed.
---
## R7. Runtime API
The simulator MUST provide a host-facing runtime API that:
- exposes tensor deployment and kernel execution operations,
- submits requests to endpoint components: PCIE_EP for memory operations
(MemoryWrite/Read), IO_CPU for kernel launch,
- owns host-side tensor handles and allocation metadata as PA shard maps,
- remains topology-agnostic and does not perform routing or fan-out.
Tensor deployment in Phase 0 produces **device physical-address (PA) shard mappings**.
Each shard explicitly identifies its target `(sip, cube, pe)` and PA range.
No separate host-visible allocation RPC (e.g., AllocateTensorMeta) exists.
---
## R8. Simulation Engine
The simulator MUST include a discrete-event simulation engine that:
- injects requests into the system graph,
- schedules events deterministically,
- tracks completion via correlation identifiers,
- decomposes runtime API operations into explicit graph requests
(e.g., MemoryWrite, MemoryRead, KernelLaunch).
---
## R9. CLI Execution Semantics
The CLI MUST support executing benchmarks:
- on a specified device.
Benchmarks are executed once per invocation within a single simulation instance.
If multiple devices are present in the topology, a benchmark MAY interact with
multiple devices internally, but the CLI does not launch multiple independent
benchmark instances by default.
---
## R10. Memory Addressing
The simulator defines three address models in ADR-0011; one is selected
per simulation configuration:
- **PA (Physical Address)** — direct PA, retained as PageFault fallback.
- **VA (Virtual Address with MMU)** — currently implemented default.
- **LA (Logical Address with BAAW)** — proposed, supports per-channel
HBM modelling (1:1 / n:1 mapping modes).
VA model details (current default):
- Tensors are assigned a contiguous virtual address (VA) range at deployment.
- PE_MMU translates VA→PA per access; TLB overhead is configurable.
- Mapping installation (MmuMapMsg) traverses the fabric with measured latency.
- Replicate tensors use per-cube local PA mapping; sharded tensors broadcast.
- PA fallback is retained for backward compatibility.
- Tensor placement is represented as a list of PA shards, each explicitly tagged
with `(sip, cube, pe)`, plus a tensor-wide `va_base`.
All memory access latency MUST be modeled explicitly via graph traversal.
No implicit translation or hidden latency is allowed.
---
## 2. Model Concepts
### 2.1 Graph Execution Model
- Nodes represent modeled components (PE blocks, NoC routers,
HBM controllers, IO components, etc.).
- Directed edges represent interconnect links with latency and bandwidth attributes.
- Execution model:
- a node receives a request,
- incurs node or service latency,
- emits the request to the next hop via a link,
- repeats until the destination service completes.
---
### 2.2 Routing
Routing MAY be implemented as:
- policy-based routing (code-driven),
- routing tables (config-driven),
- topology-driven routing (e.g., mesh XY),
- or a hybrid approach.
Routing MUST:
- consume decoded address domains or explicit placement tags,
- operate only on explicit topology connectivity,
- remain deterministic.
Kernel execution requests reference tensors via PA shard mappings.
Each shard explicitly identifies its target PE, allowing IO_CPU to
deterministically fan-out execution without relying on PA decoding.
---
## 3. Inputs and Identity
### 3.1 Node Identity Scheme
Nodes MUST have stable, parsable identifiers sufficient for domain inference
and trace-based debugging.
Example patterns:
- `tray.host_cpu`
- `sip{S}.io{I}.pcie_ep`
- `sip{S}.cube{C}.fabric`
- `sip{S}.cube{C}.pe{P}`
- `sip{S}.cube{C}.hbm_ctrl`
---
### 3.2 Link Specifications
A link MAY include:
- fixed latency (ns),
- bandwidth (GB/s) for serialization latency,
- optional capacity for contention modeling.
Topology builders MUST ensure:
- required links exist,
- link parameters are consistent with topology intent.
---
## 4. Output, Debuggability, and Diagrams
The simulator MUST provide:
- per-request hop-by-hop traces with timestamps,
- clear error messages for missing connectivity
(e.g., "no link for A → B"),
- reproducible, inspectable representations of the modeled system.
Diagrams are **derived artifacts** of the simulator model:
- They MUST be generatable from the **compiled topology** and **distance metadata**
used by execution and routing.
- Generation MAY be performed lazily or cached by the implementation,
as long as outputs remain consistent with the compiled topology.
Diagram abstraction levels and distance-aware layout rules are defined in ADR-0005.
Automatic diagram generation and output conventions are defined in ADR-0006.
By default, generated diagrams are written under:
- `docs/diagrams/`
---
## 5. Non-Goals (for now)
The following are explicitly out of scope:
- cycle-accurate microarchitecture modeling,
- detailed cache coherence protocols,
- full PCIe / CXL protocol correctness.
These MAY be layered later via additional components and policies.
---
## 6. Decision Boundaries
- SPEC.md defines architectural intent and invariants.
- Code implements SPEC and MUST NOT introduce hidden invariants.
- Tests validate SPEC-defined behavior and MUST NOT encode fixed topology assumptions.
- ADRs record non-trivial architectural decisions and MUST be referenced when relevant.