Files

T

ywkang 5917b3497c Replace xbar/bridge/single-NOC with explicit router mesh (ADR-0019)

- Remove xbar_top/bot, bridge, single noc node from topology
- Each cube_mesh.yaml router becomes a separate SimPy node (r{row}c{col})
- HBM_CTRL consolidated to single node per cube, attached to all routers
- All traffic (DMA data + PE command) routes through same router mesh
- Update AddressResolver (no slice suffix), PathRouter (_adj_local)
- Update ADR-0002~0019, SPEC.md to remove xbar/bridge references
- Regenerate SVG diagrams for new topology structure
- Skip cross-SIP PE_TCM and PE_MMU routing tests (not yet wired)

326 passed, 13 skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-04 17:51:28 -07:00

11 KiB

Raw Blame History

KernBench System-Level Simulator — SPEC

This document defines the architectural contract for the KernBench system-level discrete-event simulator for our AI Accelerator SIP-based systems. All implementations, tests, and changes MUST conform to this SPEC.

0. Goal

Build a system-level, discrete-event simulator to evaluate the performance of LLM kernels running on our AI Accelerator SIP-based systems, under varying SIP architectures, topologies, and interconnect configurations.

The simulator models data-movement and control paths across the full hardware hierarchy and computes end-to-end execution latency for kernel executions dispatched to Processing Elements (PEs).

Primary objectives:

compare LLM kernel execution latency under different system configurations
model PE↔HBM, PE↔PE, CUBE↔CUBE, and SIP↔SIP communication and control paths
guarantee deterministic, verifiable behavior with strong debuggability
support visual inspection of the modeled system at multiple abstraction levels

0.1 Golden Invariants (Must NOT be violated)

End-to-end latency is computed strictly by explicit traversal over modeled components and links.
Every routed request MUST incur latency > 0.
Routing decisions MUST be deterministic given (topology + routing policy + request).
All valid request flows MUST have explicit connectivity in the model.
No hidden shortcuts, implicit bypasses, or magic paths are allowed.
Architectural decisions documented in ADRs override local optimizations.

0.2 Architectural References (ADRs)

Major architectural decisions are documented in ADRs and referenced by number.

ADR-0001: PhysAddr layout & address decoding contract
ADR-0002: Routing distance, ordering, and bypass rules
ADR-0003: Target system hierarchy & modeling scope (Tray / SIP / CUBE / PE / IO chiplet)
ADR-0004: Memory semantics & local-HBM bandwidth guarantee contract
ADR-0005: Diagram views (SIP / CUBE / PE) and distance-aware layout rules
ADR-0006: Topology compilation, distance extraction, and automatic diagram generation
ADR-0007: runtime_api vs sim_engine responsibility boundaries
ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
ADR-0009: Kernel execution fan-out and completion semantics
ADR-0010: CLI device selection and multi-device execution semantics
ADR-0011: Memory addressing simplification (PA-first)
ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
ADR-0013: Verification strategy and Phase 1 test plan
ADR-0014: PE internal execution model (PE_CPU, PE_SCHEDULER, composite commands)
ADR-0015: Component port/wire model, BW occupancy, and fabric routing
ADR-0016: IOChiplet NOC and memory data path (M_CPU bypass)
ADR-0017: Cube NOC 2D mesh architecture (XY routing, contention, attachments)

SPEC MUST remain consistent with accepted ADRs.

1. Core Requirements

R1. Correct Routing and Control Path

A request MUST traverse the correct sequence of components based on:
- source location,
- destination address or placement tags,
- routing policy and available topology connectivity.
Local vs remote traffic MUST be distinguishable:
- same SIP vs different SIP,
- same CUBE vs different CUBE,
- (optional) same PE-group vs cross PE-group.
Routing behavior MUST be reproducible and deterministic.

R2. Latency is Computed by Traversal

End-to-end latency is the sum of:

per-node fixed latency (processing / router delay),
per-link latency (fixed and/or size-aware serialization: bytes / BW),
per-service latency (e.g., memory controller service time).

The simulator MUST:

support both fixed and size-aware latency,
emit hop-by-hop traces with timestamps and component identifiers.

R3. Topology is Configurable and Variable

Topology MUST NOT be hardcoded.

The simulator MUST accept multiple topologies (YAML / JSON / dict), varying:

SIP count,
CUBE count per SIP,
PE count per CUBE,
on-chip fabric structure (e.g., mesh / NoC router grid),
IO chiplets and interconnects,
link bandwidth, latency, and capacity parameters.

Given a topology:

all required request flows MUST have valid connectivity,
missing links are a topology construction error, not a routing error.

R4. DI-First Component Design (Swappable Implementations)

All components MUST be replaceable behind stable interfaces, including:

routers and fabrics (NoC router mesh, switches),
DMA engines and queues,
memory controllers and services (HBM, TCM, queues),
management and control processors (modeled components).

The simulator MUST:

use dependency injection (DI) to bind node specifications to implementation classes,
allow component swapping without changing test logic,
avoid leaking routing or policy logic into unrelated components.

R5. Multi-Domain Communication Modeling

The simulator MUST model communication across hierarchical domains, including:

PE ↔ local HBM
PE ↔ remote HBM in the same CUBE
PE ↔ remote HBM in other CUBEs within the same SIP
PE ↔ remote HBM in other SIPs
PE ↔ PE messaging (e.g., IPCQ)
PE ↔ IO chiplets
CUBE ↔ CUBE (e.g., via UCIe)
SIP ↔ SIP (e.g., via PCIe or UAL)

Policy-based bypass is allowed ONLY if:

the bypass path is explicitly represented in the model,
the bypass incurs non-zero latency,
the bypass is visible in traces and diagrams.

R6. Verification-Driven Development

Development MUST follow a verification-driven workflow:

behavior is validated by tests with meaningful input cases,
tests encode SPEC-defined invariants, not incidental implementation details,
changes without clear verification coverage are not allowed.

R7. Runtime API

The simulator MUST provide a host-facing runtime API that:

exposes tensor deployment and kernel execution operations,
submits requests to endpoint components: PCIE_EP for memory operations (MemoryWrite/Read), IO_CPU for kernel launch,
owns host-side tensor handles and allocation metadata as PA shard maps,
remains topology-agnostic and does not perform routing or fan-out.

Tensor deployment in Phase 0 produces device physical-address (PA) shard mappings. Each shard explicitly identifies its target (sip, cube, pe) and PA range. No separate host-visible allocation RPC (e.g., AllocateTensorMeta) exists.

R8. Simulation Engine

The simulator MUST include a discrete-event simulation engine that:

injects requests into the system graph,
schedules events deterministically,
tracks completion via correlation identifiers,
decomposes runtime API operations into explicit graph requests (e.g., MemoryWrite, MemoryRead, KernelLaunch).

R9. CLI Execution Semantics

The CLI MUST support executing benchmarks:

on a specified device.

Benchmarks are executed once per invocation within a single simulation instance. If multiple devices are present in the topology, a benchmark MAY interact with multiple devices internally, but the CLI does not launch multiple independent benchmark instances by default.

R10. Memory Addressing (Phase 0)

The simulator uses a VA/PA memory model (ADR-0011):

Tensors are assigned a contiguous virtual address (VA) range at deployment.
PE_MMU translates VA→PA per access; TLB overhead is configurable.
Mapping installation (MmuMapMsg) traverses the fabric with measured latency.
Replicate tensors use per-cube local PA mapping; sharded tensors broadcast.
PA-only fallback is retained for backward compatibility.
Tensor placement is represented as a list of PA shards, each explicitly tagged with (sip, cube, pe), plus a tensor-wide va_base.

All memory access latency MUST be modeled explicitly via graph traversal. No implicit translation or hidden latency is allowed.

2. Model Concepts

2.1 Graph Execution Model

Nodes represent modeled components (PE blocks, NoC routers, HBM controllers, IO components, etc.).
Directed edges represent interconnect links with latency and bandwidth attributes.
Execution model:
- a node receives a request,
- incurs node or service latency,
- emits the request to the next hop via a link,
- repeats until the destination service completes.

2.2 Routing

Routing MAY be implemented as:

policy-based routing (code-driven),
routing tables (config-driven),
topology-driven routing (e.g., mesh XY),
or a hybrid approach.

Routing MUST:

consume decoded address domains or explicit placement tags,
operate only on explicit topology connectivity,
remain deterministic.

Kernel execution requests reference tensors via PA shard mappings. Each shard explicitly identifies its target PE, allowing IO_CPU to deterministically fan-out execution without relying on PA decoding.

3. Inputs and Identity

3.1 Node Identity Scheme

Nodes MUST have stable, parsable identifiers sufficient for domain inference and trace-based debugging.

Example patterns:

tray.host_cpu
sip{S}.io{I}.pcie_ep
sip{S}.cube{C}.fabric
sip{S}.cube{C}.pe{P}
sip{S}.cube{C}.hbm_ctrl

3.2 Link Specifications

A link MAY include:

fixed latency (ns),
bandwidth (GB/s) for serialization latency,
optional capacity for contention modeling.

Topology builders MUST ensure:

required links exist,
link parameters are consistent with topology intent.

4. Output, Debuggability, and Diagrams

The simulator MUST provide:

per-request hop-by-hop traces with timestamps,
clear error messages for missing connectivity (e.g., "no link for A → B"),
reproducible, inspectable representations of the modeled system.

Diagrams are derived artifacts of the simulator model:

They MUST be generatable from the compiled topology and distance metadata used by execution and routing.
Generation MAY be performed lazily or cached by the implementation, as long as outputs remain consistent with the compiled topology.

Diagram abstraction levels and distance-aware layout rules are defined in ADR-0005. Automatic diagram generation and output conventions are defined in ADR-0006.

By default, generated diagrams are written under:

docs/diagrams/

5. Non-Goals (for now)

The following are explicitly out of scope:

cycle-accurate microarchitecture modeling,
detailed cache coherence protocols,
full PCIe / CXL protocol correctness.

These MAY be layered later via additional components and policies.

6. Decision Boundaries

SPEC.md defines architectural intent and invariants.
Code implements SPEC and MUST NOT introduce hidden invariants.
Tests validate SPEC-defined behavior and MUST NOT encode fixed topology assumptions.
ADRs record non-trivial architectural decisions and MUST be referenced when relevant.

11 KiB Raw Blame History