Files

T

ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 01:15:55 -07:00

11 KiB

Raw Permalink Blame History

KernBench System-Level Simulator — SPEC

This document defines the architectural contract for the KernBench system-level discrete-event simulator for our AI Accelerator SIP-based systems. All implementations, tests, and changes MUST conform to this SPEC.

0. Goal

Build a system-level, discrete-event simulator to evaluate the performance of LLM kernels running on our AI Accelerator SIP-based systems, under varying SIP architectures, topologies, and interconnect configurations.

The simulator models data-movement and control paths across the full hardware hierarchy and computes end-to-end execution latency for kernel executions dispatched to Processing Elements (PEs).

Primary objectives:

compare LLM kernel execution latency under different system configurations
model PE↔HBM, PE↔PE, CUBE↔CUBE, and SIP↔SIP communication and control paths
guarantee deterministic, verifiable behavior with strong debuggability
support visual inspection of the modeled system at multiple abstraction levels

0.1 Golden Invariants (Must NOT be violated)

End-to-end latency is computed strictly by explicit traversal over modeled components and links.
Every routed request MUST incur latency > 0.
Routing decisions MUST be deterministic given (topology + routing policy + request).
All valid request flows MUST have explicit connectivity in the model.
No hidden shortcuts, implicit bypasses, or magic paths are allowed.
Architectural decisions documented in ADRs override local optimizations.

0.2 Architectural References (ADRs)

Major architectural decisions are documented in ADRs and referenced by number.

ADR-0001: PhysAddr layout & address decoding contract
ADR-0002: Routing distance, ordering, and bypass rules
ADR-0003: Target system hierarchy & modeling scope (Tray / SIP / CUBE / PE / IO chiplet)
ADR-0004: Memory semantics & local-HBM bandwidth guarantee contract
ADR-0005: Diagram views (SIP / CUBE / PE) and distance-aware layout rules
ADR-0006: Topology compilation, distance extraction, and automatic diagram generation
ADR-0007: runtime_api vs sim_engine responsibility boundaries
ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
ADR-0009: Kernel execution fan-out and completion semantics
ADR-0010: Command line interface and execution semantics
ADR-0011: Memory Addressing — PA / VA / LA Address Models
ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
ADR-0013: Verification strategy and Phase 1 test plan
ADR-0014: PE internal execution model (PE_CPU, PE_SCHEDULER, composite commands)
ADR-0015: Component port/wire model, BW occupancy, and fabric routing
ADR-0016: IOChiplet NOC and memory data path (M_CPU bypass)
ADR-0017: Cube NOC 2D mesh architecture (XY routing, contention, attachments)

SPEC MUST remain consistent with accepted ADRs.

1. Core Requirements

R1. Correct Routing and Control Path

A request MUST traverse the correct sequence of components based on:
- source location,
- destination address or placement tags,
- routing policy and available topology connectivity.
Local vs remote traffic MUST be distinguishable:
- same SIP vs different SIP,
- same CUBE vs different CUBE,
- (optional) same PE-group vs cross PE-group.
Routing behavior MUST be reproducible and deterministic.

R2. Latency is Computed by Traversal

End-to-end latency is the sum of:

per-node fixed latency (processing / router delay),
per-link latency (fixed and/or size-aware serialization: bytes / BW),
per-service latency (e.g., memory controller service time).

The simulator MUST:

support both fixed and size-aware latency,
emit hop-by-hop traces with timestamps and component identifiers.

R3. Topology is Configurable and Variable

Topology MUST NOT be hardcoded.

The simulator MUST accept multiple topologies (YAML / JSON / dict), varying:

SIP count,
CUBE count per SIP,
PE count per CUBE,
on-chip fabric structure (e.g., mesh / NoC router grid),
IO chiplets and interconnects,
link bandwidth, latency, and capacity parameters.

Given a topology:

all required request flows MUST have valid connectivity,
missing links are a topology construction error, not a routing error.

R4. DI-First Component Design (Swappable Implementations)

All components MUST be replaceable behind stable interfaces, including:

routers and fabrics (NoC router mesh, switches),
DMA engines and queues,
memory controllers and services (HBM, TCM, queues),
management and control processors (modeled components).

The simulator MUST:

use dependency injection (DI) to bind node specifications to implementation classes,
allow component swapping without changing test logic,
avoid leaking routing or policy logic into unrelated components.

R5. Multi-Domain Communication Modeling

The simulator MUST model communication across hierarchical domains, including:

PE ↔ local HBM
PE ↔ remote HBM in the same CUBE
PE ↔ remote HBM in other CUBEs within the same SIP
PE ↔ remote HBM in other SIPs
PE ↔ PE messaging (e.g., IPCQ)
PE ↔ IO chiplets
CUBE ↔ CUBE (e.g., via UCIe)
SIP ↔ SIP (e.g., via PCIe or UAL)

Policy-based bypass is allowed ONLY if:

the bypass path is explicitly represented in the model,
the bypass incurs non-zero latency,
the bypass is visible in traces and diagrams.

R6. Verification-Driven Development

Development MUST follow a verification-driven workflow:

behavior is validated by tests with meaningful input cases,
tests encode SPEC-defined invariants, not incidental implementation details,
changes without clear verification coverage are not allowed.

R7. Runtime API

The simulator MUST provide a host-facing runtime API that:

exposes tensor deployment and kernel execution operations,
submits requests to endpoint components: PCIE_EP for memory operations (MemoryWrite/Read), IO_CPU for kernel launch,
owns host-side tensor handles and allocation metadata as PA shard maps,
remains topology-agnostic and does not perform routing or fan-out.

Tensor deployment in Phase 0 produces device physical-address (PA) shard mappings. Each shard explicitly identifies its target (sip, cube, pe) and PA range. No separate host-visible allocation RPC (e.g., AllocateTensorMeta) exists.

R8. Simulation Engine

The simulator MUST include a discrete-event simulation engine that:

injects requests into the system graph,
schedules events deterministically,
tracks completion via correlation identifiers,
decomposes runtime API operations into explicit graph requests (e.g., MemoryWrite, MemoryRead, KernelLaunch).

R9. CLI Execution Semantics

The CLI MUST support executing benchmarks:

on a specified device.

Benchmarks are executed once per invocation within a single simulation instance. If multiple devices are present in the topology, a benchmark MAY interact with multiple devices internally, but the CLI does not launch multiple independent benchmark instances by default.

R10. Memory Addressing

The simulator defines three address models in ADR-0011; one is selected per simulation configuration:

PA (Physical Address) — direct PA, retained as PageFault fallback.
VA (Virtual Address with MMU) — currently implemented default.
LA (Logical Address with BAAW) — proposed, supports per-channel HBM modelling (1:1 / n:1 mapping modes).

VA model details (current default):

Tensors are assigned a contiguous virtual address (VA) range at deployment.
PE_MMU translates VA→PA per access; TLB overhead is configurable.
Mapping installation (MmuMapMsg) traverses the fabric with measured latency.
Replicate tensors use per-cube local PA mapping; sharded tensors broadcast.
PA fallback is retained for backward compatibility.
Tensor placement is represented as a list of PA shards, each explicitly tagged with (sip, cube, pe), plus a tensor-wide va_base.

All memory access latency MUST be modeled explicitly via graph traversal. No implicit translation or hidden latency is allowed.

2. Model Concepts

2.1 Graph Execution Model

Nodes represent modeled components (PE blocks, NoC routers, HBM controllers, IO components, etc.).
Directed edges represent interconnect links with latency and bandwidth attributes.
Execution model:
- a node receives a request,
- incurs node or service latency,
- emits the request to the next hop via a link,
- repeats until the destination service completes.

2.2 Routing

Routing MAY be implemented as:

policy-based routing (code-driven),
routing tables (config-driven),
topology-driven routing (e.g., mesh XY),
or a hybrid approach.

Routing MUST:

consume decoded address domains or explicit placement tags,
operate only on explicit topology connectivity,
remain deterministic.

Kernel execution requests reference tensors via PA shard mappings. Each shard explicitly identifies its target PE, allowing IO_CPU to deterministically fan-out execution without relying on PA decoding.

3. Inputs and Identity

3.1 Node Identity Scheme

Nodes MUST have stable, parsable identifiers sufficient for domain inference and trace-based debugging.

Example patterns:

tray.host_cpu
sip{S}.io{I}.pcie_ep
sip{S}.cube{C}.fabric
sip{S}.cube{C}.pe{P}
sip{S}.cube{C}.hbm_ctrl

3.2 Link Specifications

A link MAY include:

fixed latency (ns),
bandwidth (GB/s) for serialization latency,
optional capacity for contention modeling.

Topology builders MUST ensure:

required links exist,
link parameters are consistent with topology intent.

4. Output, Debuggability, and Diagrams

The simulator MUST provide:

per-request hop-by-hop traces with timestamps,
clear error messages for missing connectivity (e.g., "no link for A → B"),
reproducible, inspectable representations of the modeled system.

Diagrams are derived artifacts of the simulator model:

They MUST be generatable from the compiled topology and distance metadata used by execution and routing.
Generation MAY be performed lazily or cached by the implementation, as long as outputs remain consistent with the compiled topology.

Diagram abstraction levels and distance-aware layout rules are defined in ADR-0005. Automatic diagram generation and output conventions are defined in ADR-0006.

By default, generated diagrams are written under:

docs/diagrams/

5. Non-Goals (for now)

The following are explicitly out of scope:

cycle-accurate microarchitecture modeling,
detailed cache coherence protocols,
full PCIe / CXL protocol correctness.

These MAY be layered later via additional components and policies.

6. Decision Boundaries

SPEC.md defines architectural intent and invariants.
Code implements SPEC and MUST NOT introduce hidden invariants.
Tests validate SPEC-defined behavior and MUST NOT encode fixed topology assumptions.
ADRs record non-trivial architectural decisions and MUST be referenced when relevant.

11 KiB Raw Permalink Blame History