commit - release 1

This commit is contained in:
2026-03-18 11:47:48 -07:00
commit 6f43807900
109 changed files with 14909 additions and 0 deletions
+108
View File
@@ -0,0 +1,108 @@
# ADR-0001: PhysAddr Layout & Address Decoding Contract
## Status
Accepted
## Date
2026-02-27
## Context
KernBench Graph Latency Simulator must route requests deterministically and compute end-to-end latency strictly by graph traversal.
To model local vs remote traffic (same/different SIP, same/different CUBE, optional PE-group), requests need a stable, parsable address/location scheme that:
- can be decoded into routing domains (SIP/CUBE/HBM/PE-resource, etc.)
- remains topology-agnostic (no hardcoded counts)
- supports swappable policy and DI-first components without leaking topology assumptions into node implementations
## Decision
We define a **PhysAddr value object** and an **address decoding contract** that converts an integer address into routing domains.
### D1. PhysAddr is an immutable value object
- PhysAddr is immutable and comparable as a pure value.
- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
- No global state may be required to interpret a PhysAddr.
### D2. PhysAddr fields (logical contract)
PhysAddr must be able to represent at least:
- `rack_id` (optional but reserved for scale-out)
- `sip_id` (device / SIP domain)
- `sip_seg` (SIP-level segment/window selection, e.g., cube window)
- `local_offset` (offset within the chosen segment/window)
Decoded/derived fields may include (optional):
- `cube_id`
- `kind` (e.g., HBM vs PE-resource vs raw)
- `unit_type` / `pe_id` (if PE-level addressing is modeled)
**Important:** The exact bit allocation may evolve, but the *semantic fields above* must remain decodable without hidden assumptions.
### D3. Decoding is deterministic and policy-compatible
- Decoding must deterministically map an integer address to:
- destination SIP domain (`sip_id`)
- destination sub-domain (`cube_id` if applicable)
- destination target kind (HBM/PE-resource/other)
- Decoding must not depend on runtime topology sizes; it may depend on **explicit topology parameters** provided through configuration (e.g., segment size, slice size), and those parameters must live in the topology/config layer (not in random components).
### D4. Topology-derived constants live in the topology layer
Constants such as segment sizes (e.g., HBM slice size / window size) are derived from topology configuration (YAML/JSON/dict) and are provided to the decoder via DI/config.
They must not be hardcoded in node implementations.
### D5. Routing consumes decoded domains, not raw bits
Routing policy uses decoded domains:
- `src` location (sip/cube/pe or node_id)
- `dst` domains derived from PhysAddr decoding
- `size_bytes` for size-aware link latency
Routing must not inspect raw bit-fields directly except inside the decoding module.
## Alternatives Considered
1) **Use raw integers everywhere, decode ad-hoc in routing**
- Rejected: leads to duplicated logic, inconsistent routing, and hidden assumptions embedded in multiple components.
1) **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**
- Rejected: violates SPEC (R3) and breaks swappability and configuration-driven topologies.
1) **Put decoding inside memory controllers or routers**
- Rejected: leaks policy into components and undermines DI-first, swappable implementations (SPEC R4).
## Consequences
### Positive
- Deterministic routing domains enable clear test invariants for local vs remote paths (SPEC R1, R5).
- Keeps topology variability (SPEC R3) while preserving consistent semantics.
- DI-first: decoder can be swapped or extended without changing components or tests (SPEC R4).
### Tradeoffs / Costs
- Requires explicit configuration for any topology-derived sizes.
- Introduces a single “blessed” decoding module that must remain stable and well-tested.
## Implementation Notes (Non-normative)
- Recommended module boundary:
- `src/kernbench/policy/address/phyaddr.py`
- Tests should cover:
- deterministic decoding
- local vs remote classification from decoded fields
- invariants: “allocator returns full PhysAddr”, “decoding requires no global state”
## Links
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first), R5 (multi-domain comm)
+103
View File
@@ -0,0 +1,103 @@
# ADR-0002: Routing Distance, Ordering & Bypass Rules
## Status
Accepted
## Date
2026-02-27
## Context
The KernBench Graph Latency Simulator must compare kernel execution time
across different architectures and topologies by computing end-to-end
latency from graph traversal.
To support meaningful comparison:
- routing must be deterministic
- latency must reflect actual interconnect structure
- local vs remote traffic must be distinguishable
- “bypass” optimizations must not undermine debuggability or correctness
The simulator also aims to avoid software-managed metadata and hidden
shortcuts that obscure control paths.
## Decision
### D1. Distance is accumulated latency, not hop count
- Routing “distance” is defined as the **sum of per-node and per-link latency**.
- Hop count alone must not be used for ordering or path selection.
- Size-aware serialization latency (bytes / BW) contributes to distance.
### D2. Routing order is derived from graph traversal
- The chosen route is the path with minimum accumulated latency
given the constructed graph and routing policy.
- Deterministic ordering must be guaranteed for identical inputs
(topology + policy + request).
### D3. Bypass is explicit and graph-represented
- Any bypass (e.g., local cube HBM access via XBAR instead of NOC) must be:
- explicitly represented as a graph path, and
- subject to latency accumulation like any other path.
- Example: PE_DMA has dual egress — one to XBAR (HBM path) and one to NOC (non-HBM path).
Both are explicit graph edges; neither is a “bypass” — they are distinct data paths
serving different memory domains.
- Implicit or “magic” bypass paths are disallowed.
### D4. No zero-latency end-to-end paths
- Every routed request must incur **end-to-end** latency > 0.
- Individual fabric segments (e.g., NOC hops) MAY have distance_mm = 0
when the fabric is distributed and distance is not meaningful at that granularity.
This is allowed because other components on the same path (e.g., PE_DMA, SRAM,
UCIe endpoints) contribute non-zero latency, ensuring the end-to-end invariant holds.
- Fully zero-latency end-to-end paths are disallowed, except for explicit
test-only stubs clearly marked as such.
### D5. Policy vs topology responsibility split
- Topology builder:
- defines nodes and links and their latency/BW parameters
- Routing policy:
- selects among available graph paths based on decoded domains
- Routing policy must not assume missing links; missing connectivity
is a topology construction error.
### D6. No software-managed routing metadata
- Routing decisions must not rely on per-request software-managed metadata
that tracks distance, hop count, or ordering outside the graph model.
- All distance/order computation is derived from traversal itself.
## Alternatives Considered
1) **Hop-count based routing**
- Rejected: ignores heterogeneous latency/BW and misrepresents
architectural differences.
2) **Implicit local shortcuts**
- Rejected: breaks debuggability and violates traversal-based latency.
3) **Software-managed distance metadata**
- Rejected: increases control overhead and obscures routing semantics.
## Consequences
### Positive
- Clear, debuggable hop-by-hop traces (SPEC R2, R4).
- Architecture comparisons reflect real interconnect structure.
- Routing behavior is reproducible and deterministic.
### Tradeoffs / Costs
- Graph construction must be correct and complete.
- Bypass modeling requires explicit graph representation,
which slightly increases topology description complexity.
## Implementation Notes (Non-normative)
- Recommended responsibilities:
- Graph builder: ensure all required paths exist.
- Router: select next hop based on decoded domains and policy.
- Tests should assert:
- non-zero end-to-end latency
- deterministic routing for identical inputs
- bypass paths appear explicitly in emitted traces
## Links
- SPEC.md: R1 (routing), R2 (latency), R3 (topology), R5 (multi-domain comm)
- ADR-0001: PhysAddr layout & decoding contract
@@ -0,0 +1,64 @@
# ADR-0003: Target System Hierarchy & Modeling Scope
## Status
Accepted
## Context
We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
through switching fabrics, with a host CPU issuing commands/kernels.
## Decision
We model the system hierarchy explicitly:
### D1. Tray-level
- A compute tray contains:
- Host CPU (issues requests / coordinates runtime & data placement)
- Multiple identical SIPs (accelerators)
- Interconnect fabric between SIPs (PCIe and/or UAL via switches)
### D2. SIP-level
- A SIP is a multi-die package composed of:
- Multiple CUBEs (HBM die + compute PEs + UCIe)
- One or more IO chiplets (host/SIP interfaces)
- IO chiplets:
- provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
- can be multiple per SIP
- placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 12 IO chiplets
### D3. CUBE-level
- A CUBE contains:
- HBM + memory controller (HBM_CTRL)
- XBAR (top/bottom): HBM pseudo-channel crossbar, PE's dedicated path to HBM
- Bridge (left/right): connects XBAR.top ↔ XBAR.bottom for cross-half HBM access
- NOC: distributed on-die fabric spanning the entire cube (distance modeled as 0);
carries non-HBM traffic including inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access
- Shared SRAM: cube-level shared memory accessible by all PEs via NOC
- management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
- multiple PEs
- up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
### D4. PE-level
- A PE can execute one kernel instance
- PE contains internal control + accelerators (modeled at PE view granularity):
- PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
## Consequences
- The simulator supports abstraction by “views”:
- SIP view hides PE internals
- CUBE view treats each PE as a single block
- PE view expands PE internals
- Topology remains parameterized; sizes/counts/links come from configuration.
## Links
- SPEC R3/R5
- ADR-0005 (diagram views)
@@ -0,0 +1,64 @@
# ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee
## Status
Accepted
## Context
Accurately modeling PE↔HBM behavior is essential for kernel latency estimation.
Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth.
## Decision
### D1. Local HBM definition
- Each PE is assigned a logically defined “local HBM” region.
- Local HBM corresponds to the pseudo-channel subset directly attached to that PEs DMA path
via the XBAR (top or bottom, depending on PE corner placement).
- The path is: PE_DMA → XBAR.top/bottom → HBM_CTRL.
- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
### D2. Local HBM bandwidth guarantee contract
- Accesses from a PE to its local HBM MUST guarantee full HBM read/write bandwidth
independent of intervening fabric bandwidth limits.
- This guarantee is modeled by:
- a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
- while still incurring non-zero latency along explicitly modeled components.
### D3. Cross-half HBM semantics
- A PE connected to XBAR.bottom that accesses HBM pseudo-channels on the XBAR.top half
(or vice versa) traverses a bridge:
- PE_DMA → XBAR.bottom → bridge → XBAR.top → HBM_CTRL
- Bridge bandwidth may limit cross-half HBM access relative to local-half access.
### D4. Non-local HBM semantics (inter-cube / inter-SIP)
- Accesses from a PE to HBM in a different cube or SIP MAY be limited by:
- NOC bandwidth within the cube,
- inter-cube UCIe links,
- inter-SIP fabric (PCIe/UAL).
- These paths MUST be explicit and traceable.
### D5. Shared SRAM semantics
- Each CUBE contains a shared SRAM accessible by all PEs in that CUBE.
- Access path: PE_DMA → NOC → shared SRAM.
- Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth.
- Shared SRAM is not part of the HBM address space; it is a separate memory domain.
## Verification Notes
Tests should cover:
- local-HBM case: BW matches HBM BW regardless of fabric BW parameter
- cross-half HBM case: latency includes bridge traversal
- non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
- shared SRAM case: access via NOC with correct BW
## Links
- SPEC R2/R5
- ADR-0002 (distance/order & explicit bypass)
@@ -0,0 +1,186 @@
# ADR-0005: Diagram Views & Distance-Aware Layout Rules
## Status
Accepted
## Context
We require verifiable and inspectable system modeling for a large-scale,
parameterized AI Accelerator system.
Humans must be able to:
- visually inspect the modeled topology,
- reason about communication structure and relative distance,
- do so at multiple abstraction levels without being overwhelmed by detail.
The simulator models distance (accumulated latency) as a first-class concept.
Diagrams must reflect this distance by default.
---
## Global Defaults
- All diagrams MUST be **distance-aware by default**.
- All diagrams MUST render **representative views** of the architecture.
- Instance indices (e.g., sip0, cube2, pe3) MUST NOT be required for diagram generation.
- Instance indices MAY be used ONLY:
- to define a distance anchor in asymmetric or debugging scenarios, or
- when explicitly requested.
---
## Representative Rendering Rule
- All CUBEs share the same internal structure.
- All PEs share the same internal structure.
Therefore:
- SIP-level diagrams render representative CUBEs and IO chiplets.
- CUBE-level diagrams render representative PEs as opaque blocks.
- PE-level diagrams render a representative PE with fully expanded internals.
Diagrams MUST NOT depend on specific SIP, CUBE, or PE indices
unless explicitly requested.
---
## Diagram Views
### View A — SIP-Level Diagram
**Purpose**
Explain system-scale structure and connectivity.
**Visible elements**
- SIP boundaries (optional)
- CUBEs (opaque blocks)
- IO chiplets (opaque blocks)
- Optional UCIe stubs only if needed to clarify connectivity
**Hidden elements**
- PE internals
- CUBE internal fabric
- IO chiplet internals
**Visible links**
- Host ↔ IO chiplets (PCIe)
- SIP ↔ SIP (PCIe / UAL via switches)
- IO ↔ CUBE (on-package links)
---
### View B — CUBE-Level Diagram
**Purpose**
Explain cube-internal structure and data/control flow.
**Visible elements**
- XBAR (top/bottom): HBM pseudo-channel crossbar
- Bridge (left/right): cross-half HBM connectors between XBAR.top and XBAR.bottom
- NOC: distributed on-die fabric for non-HBM traffic
- HBM subsystem (HBM_CTRL)
- Shared SRAM: cube-level shared memory
- Management CPU (M_CPU)
- PEs as opaque blocks (PE[0..N1])
- UCIe endpoints (N/E/W/S) as ports
**Hidden elements**
- PE internals
**Visible links**
- PE → XBAR (HBM data path, top or bottom by corner placement)
- PE → NOC (non-HBM data path)
- XBAR ↔ bridge ↔ XBAR (cross-half HBM access)
- XBAR → HBM_CTRL
- NOC ↔ UCIe endpoints
- NOC ↔ shared SRAM
- M_CPU ↔ NOC (command path)
- NOC → PE_CPU (command delivery, collapsed into PE block)
---
### View C — PE-Level Diagram
**Purpose**
Explain internal PE behavior and execution structure.
**Visible elements**
- PE_CPU
- Command handler / scheduler
- PE_TCM (local SRAM)
- HW accelerators (DMA, GEMM, MATH, etc.)
- Local HBM interface
- Optional IPCQ / messaging endpoints
**Visible links**
- Control paths (CPU → scheduler → engines)
- Data paths (engines ↔ TCM, DMA ↔ local HBM)
- External fabric ports as abstract ports only
---
## Distance-Aware Layout (Default)
### Distance definition
- Distance is defined as **accumulated latency**, consistent with ADR-0002.
- Distance is computed from a single anchor node.
### Default anchor selection
- SIP view: IO chiplet (or Host CPU if present)
- CUBE view: a representative PE
- PE view: PE_CPU or Command Handler
Anchors are **implicit defaults** and MUST NOT be required to be specified.
### Layout rules
- Diagrams MUST be laid out in layers based on distance buckets.
- Layout direction MUST be consistent within a view type
(preferred: left-to-right).
- Nodes with equal distance MUST have stable ordering
(by role or identifier, deterministically).
Cycles MAY be rendered using dashed or curved edges for readability,
without affecting distance semantics.
---
## Generation Contract (for Tools / Claude Code)
When generating diagrams:
- Assume distance-aware layout by default.
- Assume representative rendering by default.
- Do NOT ask for SIP/CUBE/PE indices unless required.
- Do NOT expand hidden abstraction levels.
- Prefer architectural clarity over micro-hop fidelity.
---
## Consequences
- Diagrams are stable across topology scaling.
- Changes in distance or routing policy are reflected visually.
- Diagrams serve as verifiable artifacts derived from the simulator model,
not as hand-maintained documentation.
---
## Links
- SPEC Section 4 (Output, Debuggability, and Diagrams)
- ADR-0002 (Routing distance semantics)
- ADR-0006 (Topology compilation & automatic diagram generation)
@@ -0,0 +1,130 @@
# ADR-0006: Topology Compilation, Distance Extraction, and Automatic Diagram Generation
## Status
Accepted
## Context
The simulator compiles topology configuration (e.g., topology.yaml) into an explicit model graph,
and computes routing and accumulated latency (distance).
Diagrams should be generated from these authoritative artifacts to ensure consistency and avoid
hand-maintained topology drawings.
Additionally, for usability, diagrams should be emitted automatically into a stable location
so that developers can preview them immediately in the repository.
---
## Decision
### D1. Topology compilation is the single source of truth
- topology.yaml (or equivalent config) is compiled into:
- an explicit system graph,
- node/link attributes,
- routing policies.
This compiled graph is the authoritative representation of the system.
### D2. Distance extraction during compilation
- During or immediately after topology compilation, the simulator MUST compute distance metadata
(accumulated latency) consistent with ADR-0002.
- Distance metadata MUST be sufficient to support distance-aware diagram layout as defined in ADR-0005.
- Distributed fabric segments (e.g., NOC) MAY have distance_mm = 0 per ADR-0002 D4;
layout placement for such nodes uses explicit position metadata rather than distance buckets.
### D3. Diagram generation is a derived artifact
- Diagrams MUST be generated from:
- the compiled topology graph,
- extracted distance metadata,
- view/layout rules defined in ADR-0005.
- Diagram generation MUST NOT require additional hand-written topology descriptions.
### D4. Automatic diagram emission to the repository
- As part of topology compilation, the implementation MUST produce the following diagrams by default:
- SIP-level diagram (representative, distance-aware)
- CUBE-level diagram (representative, distance-aware)
- PE-level diagram (representative, distance-aware)
- The default output directory is:
- `docs/diagrams/`
- The generator MUST overwrite/update only when the compiled topology (or diagram rules) changes.
### D5. View-specific projection and layout
For each view (SIP / CUBE / PE):
- The generator MUST project the compiled graph into a reduced view graph:
- hide/collapse nodes according to ADR-0005,
- preserve connectivity semantics relevant to that view,
- compute distance buckets and assign layout layers deterministically.
- CUBE-level projection MUST include:
- XBAR (top/bottom), bridge (left/right), NOC, HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
and PEs as opaque blocks.
- Distinct edge kinds for HBM path (PE→XBAR) vs non-HBM path (PE→NOC).
- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
### D6. Output formats and determinism
- The generator MUST output at least one of:
- Mermaid (Markdown-native)
- Graphviz DOT (rank-based control)
- SVG (mm-accurate layout, no external dependencies)
- SVG is preferred when mm-accurate position metadata is available from the compiled topology.
- Output MUST be deterministic:
- same topology + same rules → identical diagram text
- File naming MUST be deterministic and stable (see "Output Conventions").
### D7. Performance and caching
- Diagram generation MAY be lazy and/or cached, as long as the outputs in `docs/diagrams/`
remain consistent with the compiled topology.
- The implementation SHOULD use a cache key based on:
- topology content hash,
- routing policy version,
- diagram rules version,
- view type (SIP/CUBE/PE).
---
## Output Conventions
### Directory
- `docs/diagrams/` is the canonical output directory for generated diagrams.
### File names (recommended, deterministic)
- `system_view.svg` / `system_view.mmd` / `system_view.dot`
- `sip_view.svg` / `sip_view.mmd` / `sip_view.dot`
- `cube_view.svg` / `cube_view.mmd` / `cube_view.dot`
- `pe_view.svg` / `pe_view.mmd` / `pe_view.dot`
Optionally, for multi-topology workflows:
- `sip_view__{topology_id}.svg`
- `cube_view__{topology_id}.svg`
- `pe_view__{topology_id}.svg`
### Repository policy
- Generated diagram files MAY be committed to the repository to enable diff-based review.
- If committed, they MUST be reproducible from topology compilation.
---
## Consequences
- Diagrams are always consistent with simulator behavior.
- Architectural changes automatically propagate to visualizations.
- Diagram diffs become meaningful indicators of architectural change.
---
## Links
- SPEC Section 4 (Output, Debuggability, and Diagrams)
- ADR-0002 (Distance semantics)
- ADR-0005 (Diagram views and layout rules)
@@ -0,0 +1,89 @@
# ADR-0007: Runtime API and Simulation Engine Boundaries
## Status
Accepted
## Context
The simulator consists of multiple layers with distinct responsibilities:
- a host-facing API layer used by benchmarks and user code,
- a discrete-event simulation engine that executes requests,
- device components that model hardware behavior.
Without strict boundaries, orchestration logic can leak into components,
or simulation internals can become entangled with user-facing APIs.
This ADR defines clear responsibility boundaries between:
- runtime API,
- simulation engine (sim_engine),
- hardware components.
---
## Decision
### D1. Runtime API is host-facing orchestration only
The runtime API represents host/driver-level behavior and MUST:
- expose high-level operations (tensor deployment, kernel launch),
- submit requests only to endpoint components (e.g., IO_CPU),
- await completion via futures/handles,
- own and persist host-side metadata (tensor allocation maps, kernel bindings).
The runtime API MUST NOT:
- hardcode hop-by-hop routing or fan-out,
- directly invoke internal components (M_CPU, PE_CPU, engines),
- embed topology- or routing-specific assumptions.
---
### D2. Simulation engine executes and schedules requests
The simulation engine (sim_engine) MUST:
- inject requests into the compiled topology graph,
- schedule and execute events using a discrete-event model,
- manage correlation ids and completion tracking,
- decompose operations into low-level requests when required
(e.g., MemoryWrite events).
The simulation engine MUST NOT:
- define tensor semantics,
- define kernel execution policies,
- expose internal graph details to the runtime API.
---
### D3. Components own fan-out and aggregation
Device-side components MUST:
- fan-out requests to downstream domains
(IO_CPU → M_CPU → PE_CPU → schedulers/engines),
- aggregate completion and failure signals,
- propagate results deterministically upstream.
Neither the runtime API nor the simulation engine may orchestrate
component-level fan-out explicitly.
---
## Consequences
- Runtime APIs remain stable as topology and routing evolve.
- Simulation internals can change without affecting user-facing code.
- Component implementations remain swappable via DI.
---
## Links
- SPEC R4, R7, R8
- ADR-0008 (Tensor deployment)
- ADR-0009 (Kernel execution)
@@ -0,0 +1,100 @@
# ADR-0008: Tensor Deployment and Allocation (Host Allocator, PA-first)
## Status
Accepted
## Context
Benchmarks require PyTorch-like tensor semantics:
- tensor creation (empty, fill),
- deployment to accelerator devices (tensor.to()).
In the realistic system, host software manages allocation/mapping and installs
mappings for DMA/MMU. For Phase 0 we simplify (ADR-0011):
- device memory operations use PA only,
- VA/MMU/IOMMU is not modeled.
To keep the host↔device interface minimal, we avoid a separate
AllocateTensorMeta message. Instead, host allocation produces a PA shard map
that is used directly by MemoryWrite/Read and KernelLaunch.
---
## Decision
### D1. Tensor is a host-owned handle with PA shard mapping
A Tensor object is a host-owned handle that encapsulates:
- shape and dtype,
- initialization intent,
- device placement and allocation metadata as a PA shard map.
After deployment, the Tensor handle MUST contain:
- a list of shards, each with (sip,cube,pe,pa,nbytes,offset_bytes).
This PA shard mapping is the single source of truth for kernel argument binding.
---
### D2. Deployment uses a host allocator (Phase 0)
In Phase 0, tensor deployment produces PA shard mappings via a host allocator:
- placement (split/replicate/hybrid) is decided by a DP policy,
- allocation assigns PA ranges at the PE level and returns shard mappings,
- the Tensor handle stores the resulting shard list deterministically.
No separate host-visible device allocation RPC is required in Phase 0.
---
### D3. Data initialization and transfer uses MemoryWrite/Read only
Any data initialization or transfer implied by a tensor (e.g., fill, copy)
MUST be represented using Host ↔ IO_CPU messages only:
- MemoryWrite
- MemoryRead
Rules:
- MemoryWrite/Read MUST reference PA + (sip,cube,pe) tags (ADR-0012).
- Allocation metadata MUST NOT be embedded as a separate allocation message.
- Bulk tensor data MUST NOT be embedded in Phase 0 messages.
The simulation engine schedules MemoryWrite/Read through the graph so that
latency is computed by explicit traversal.
---
### D4. Extension path (non-breaking)
Future ADRs MAY introduce optional VA/MMU/IOMMU modeling by adding:
- virtual addressing in tensor handles,
- mapping install steps,
- translation latency/page granularity.
The Phase 0 PA shard map remains a valid fast-path configuration.
---
## Consequences
- Host↔IO_CPU contract remains minimal (MemoryRead/Write + KernelLaunch).
- KernelLaunch can pass per-PE data placement explicitly via shard tags.
- Early implementation stays simple and testable.
---
## Links
- ADR-0011 (PA-first)
- ADR-0012 (Host↔IO_CPU schema)
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0009 (Kernel execution)
@@ -0,0 +1,74 @@
# ADR-0009: Kernel Execution Messaging and Completion Semantics
## Status
Accepted
## Context
Kernel execution is initiated by the host and proceeds through
device control components:
Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
Completion propagates in reverse order.
To keep benchmarks simple and topology-agnostic,
kernel execution must be endpoint-driven with deterministic aggregation.
---
## Decision
### D1. Kernel launch is an endpoint request
A kernel launch is initiated by submitting a single KernelLaunch request
to the IO_CPU endpoint.
The runtime API MUST:
- construct the kernel launch request,
- submit it to IO_CPU,
- await a single completion result.
The runtime API MUST NOT orchestrate internal fan-out.
---
### D2. Tensor arguments are passed by metadata
KernelLaunch requests MUST reference tensor arguments via:
- host-owned tensor handles, or
- resolved device address maps derived from those handles.
Bulk tensor data MUST NOT be embedded in kernel launch messages.
---
### D3. Fan-out and aggregation are component responsibilities
- IO_CPU fans out work to M_CPUs.
- M_CPU fans out work to PE_CPUs.
- PE_CPU manages kernel execution and engine dispatch.
Completion semantics:
- M_CPU completes when all targeted PEs complete or a failure policy triggers.
- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
---
### D4. Completion and failure propagation
- All messages MUST carry correlation identifiers.
- Completion and failure MUST propagate deterministically to the host.
- The simulation engine provides futures/handles to observe completion.
---
## Links
- SPEC R1, R2, R7, R8
- ADR-0007 (Runtime API boundaries)
- ADR-0008 (Tensor deployment)
+62
View File
@@ -0,0 +1,62 @@
# ADR-0010: CLI Device Selection and Multi-Device Execution Semantics
## Status
Accepted
## Context
Benchmarks represent device-agnostic workloads that operate on a single device.
Users may want to run a benchmark:
- on a specific device, or
- across all devices in the system.
Device enumeration must not leak into benchmarks or runtime APIs.
---
## Decision
### D1. Benchmarks are single-device by design
- A benchmark MUST define behavior for a single device only.
- A benchmark MUST accept a device identifier as input.
- Benchmarks MUST NOT enumerate or loop over multiple devices.
---
### D2. CLI controls device selection
The `kernbench run` command supports an optional `--device` argument:
- If `--device <id>` is specified:
- the benchmark executes once for the specified device.
- If `--device` is omitted:
- the benchmark executes once using all the SIPs discovered in the topology.
---
### D3. Multi-device execution is logically parallel
When running on multiple devices:
- benchmark executions are submitted to a single simulation engine instance,
- executions are logically parallel in simulation time,
- inter-device contention is naturally modeled.
---
### D4. Runtime API and simulation engine remain device-scoped
- Runtime API calls operate on one device per invocation.
- The simulation engine schedules all requests deterministically.
- Neither layer enumerates devices.
---
## Links
- SPEC R7, R8
- ADR-0007 (Runtime API boundaries)
@@ -0,0 +1,65 @@
# ADR-0011: Memory Addressing Simplification (PA-first)
## Status
Accepted
## Context
A realistic system uses host-side virtual addressing and an MMU/IOMMU-style
translation path for DMA: host allocates physical memory at PE level, maps it
into a virtual address space, installs mappings, and DMA requests use virtual
addresses that are translated to physical addresses.
For early development, we want a minimal, deterministic model that enables:
- correct routing and latency accounting through the graph,
- stable tensor deployment and kernel execution semantics,
- future extension toward VA/MMU without rewriting workflows.
---
## Decision
### D1. Phase 0 model is PA-only
The simulator uses a PA-first model:
- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical
addresses (PA) plus size.
- Tensor handles store PA-based shard mappings after deployment.
- KernelLaunch passes tensor arguments as PA-based mappings (or references to them).
- MMU/IOMMU concepts (virtual address spaces, page tables, translation latency)
are NOT modeled in Phase 0.
### D2. Allocation produces PA mappings
Device allocation selects PE-local memory regions and returns PA mappings
sufficient to execute kernels and issue DMA requests.
### D3. Extension path (non-breaking)
A future ADR MAY introduce an optional VA/MMU layer by:
- introducing virtual addresses in tensor handles,
- adding a mapping-install step,
- modeling translation latency and page granularity.
The Phase 0 PA model remains a valid fast-path configuration.
---
## Consequences
- Early implementation stays simple and testable.
- All latency remains explicit via graph traversal, not hidden translation.
- Future VA/MMU modeling can be added without breaking existing benchmarks.
---
## Links
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0008 (tensor deployment)
- ADR-0009 (kernel execution)
- SPEC R2 (latency by traversal)
+232
View File
@@ -0,0 +1,232 @@
# ADR-0012: Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
## Status
Accepted
## Context
Phase 0 uses a PA-first memory model (ADR-0011):
- memory operations use device physical addresses (PA) only,
- VA/MMU/IOMMU is not modeled.
The host-facing runtime API interacts with the device via the IO_CPU endpoint.
We define stable, minimal message schemas for Host ↔ IO_CPU so that:
- benchmarks remain stable,
- IO_CPU-internal fan-out/aggregation can evolve independently,
- completion and failure propagation is deterministic.
We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
so IO_CPU can deterministically route/fan-out without relying on PA decoding.
---
## Decision
### D1. Contract scope
This schema is the stable contract ONLY for Host ↔ IO_CPU.
Messages beyond IO_CPU (to M_CPU, PE_CPU, schedulers, engines) are component-internal
and are NOT part of this host contract in Phase 0.
---
### D2. Required message set
The runtime API MUST use only these message types for Host ↔ IO_CPU:
- MemoryWrite
- MemoryRead
- KernelLaunch
All operations required by benchmarks (tensor init/copy, kernel run) MUST be expressible
with these messages.
---
### D3. Common envelope (mandatory for all requests)
All Host ↔ IO_CPU requests MUST include:
- `msg_type: str`
- `correlation_id: str`
- generated by the host
- used to match responses deterministically
- `request_id: str`
- unique within a correlation_id
- `target_device: str`
- device identifier (e.g., "sip:0")
- `timestamp_tag: str | None` (optional)
- debug tag only; MUST NOT affect determinism
All Host ↔ IO_CPU responses MUST include:
- `correlation_id: str`
- `request_id: str`
- `completion: Completion`
---
### D4. Completion schema (mandatory)
`Completion` MUST have:
- `ok: bool`
- `error_code: str | None`
- `error_message: str | None`
Rules:
- If `ok == true` then `error_code` and `error_message` MUST be null.
- If `ok == false` then `error_code` MUST be non-null.
- Completion semantics MUST be deterministic.
---
### D5. MemoryWrite schema (PA-first, PE-tagged)
`MemoryWrite` represents a host-initiated write/initialize operation to device memory.
Mandatory fields:
- common envelope fields (D3)
- destination placement tags (A 방식):
- `dst_sip: int`
- `dst_cube: int`
- `dst_pe: int`
- `dst_pa: int`
- destination physical address in the destination PE's address space
- `nbytes: int`
- `src_kind: "pattern" | "host_buffer_ref"`
- Phase 0 MUST support "pattern"
- `pattern: Pattern | None`
- required if `src_kind == "pattern"`
`Pattern` (Phase 0 mandatory support):
- `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"`
- `value: number | None`
- required for fill_*; ignored for zero
Optional fields:
- `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (default "AUTO")
- `debug_label: str | None`
Notes:
- This message MUST NOT embed bulk tensor data in Phase 0.
- All latency MUST come from explicit graph traversal and modeled components.
---
### D6. MemoryRead schema (PA-first, PE-tagged)
`MemoryRead` represents a host-initiated read from device memory.
Mandatory fields:
- common envelope fields (D3)
- source placement tags (A 방식):
- `src_sip: int`
- `src_cube: int`
- `src_pe: int`
- `src_pa: int`
- `nbytes: int`
Optional fields:
- `dst_kind: "host_sink" | "discard"` (default "host_sink")
- `debug_label: str | None`
Response payload:
- actual bytes are NOT required in Phase 0 (latency/traces focus)
- implementations MAY return lightweight stats or hashes later via a new ADR
---
### D7. KernelLaunch schema (PA-first, PE-tagged shards)
`KernelLaunch` represents launching a kernel on a target device via IO_CPU.
Mandatory fields:
- common envelope fields (D3)
- `kernel_ref: KernelRef`
- `args: list[KernelArg]`
`KernelRef` MUST have:
- `name: str`
- `kind: "deployed" | "builtin"`
- `deploy_pa: int | None` — PA where kernel binary was deployed (required for "deployed")
- `deploy_sip: int` — SIP where binary resides
- `deploy_cube: int` — cube where binary resides
- `deploy_pe: int` — PE where binary resides
- `nbytes_code: int` — kernel binary size (for BW modeling)
Kernel binaries MUST be pre-deployed to device memory via MemoryWrite.
KernelLaunch MUST NOT embed kernel source code or IR in the launch message.
`KernelArg` supports tensor args by PA mapping and scalars by value.
Tensor arg (mandatory):
- `arg_kind: "tensor"`
- `tensor_pa_map: TensorPAMap`
`TensorPAMap` MUST have:
- `shards: list[TensorShard]`
`TensorShard` MUST have (A 방식 강제):
- `sip: int`
- `cube: int`
- `pe: int`
- `pa: int`
- `nbytes: int`
- `offset_bytes: int`
Scalar arg (mandatory):
- `arg_kind: "scalar"`
- `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"`
- `value: number | bool`
Optional KernelLaunch fields:
- `grid: dict | None`
- `meta: dict | None`
- `failure_policy: "fail_fast" | "collect_all"` (default "fail_fast")
- `debug_label: str | None`
Notes:
- KernelLaunch MUST NOT embed bulk tensor data.
- KernelLaunch MUST be submitted only to the IO_CPU endpoint.
- IO_CPU MUST fan-out work internally using the shard (sip,cube,pe) tags.
---
## Verification Notes
Tests SHOULD validate:
- schema validation rejects missing mandatory fields,
- deterministic correlation/response matching,
- MemoryWrite/Read/KernelLaunch produce explicit hop traces,
- all routed requests incur latency > 0.
---
## Links
- ADR-0011 (PA-first memory addressing)
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0009 (kernel execution fan-out/aggregation)
- SPEC R2, R7, R8
+139
View File
@@ -0,0 +1,139 @@
# ADR-0013: Verification Strategy and Phase 1 Test Plan
## Status
Accepted
## Context
KernBench is a system-level simulator whose correctness is defined by:
- adherence to SPEC-defined invariants,
- determinism and debuggability,
- explicit modeling of routing and latency.
Given the evolving implementation, we need a stable verification strategy
that prevents architectural drift while allowing incremental development.
This ADR defines the Phase 1 verification plan and what constitutes
"correct behavior" for early implementations.
---
## Decision
### D1. Verification is contract-based
Verification MUST be derived from:
- SPEC requirements,
- accepted ADRs.
Tests MUST validate architectural contracts, not incidental implementation details.
---
### D2. Phase 1 verification scope
Phase 1 verification focuses on:
- message contract validity (ADR-0012),
- routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
- PA-first memory addressing and shard tagging (ADR-0011),
- core latency and trace invariants (SPEC 0.1, R2).
Microarchitectural accuracy, bandwidth contention, and cycle-level behavior
are explicitly out of scope in Phase 1.
---
### D3. Required Phase 1 verification cases
The following verification cases MUST be supported by the implementation:
#### V1. Message schema validation
- KernelLaunch requests missing `(sip, cube, pe)` in any tensor shard MUST be rejected.
- MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
- Completion results MUST follow the `ok / error_code / error_message` contract.
#### V2. IO_CPU fan-out and aggregation
Given:
- a topology with one SIP, one CUBE, and two PEs,
- a KernelLaunch request containing two tensor shards targeting different PEs,
The system MUST:
- submit a single KernelLaunch to IO_CPU,
- fan-out work internally to both PEs,
- aggregate completion and return a single deterministic completion to the host.
#### V3. Latency and trace invariants
For any valid request:
- the hop-by-hop trace MUST be non-empty,
- total latency MUST be greater than zero,
- repeated runs with identical inputs MUST produce identical traces.
#### V4. Topology independence and cross-domain coverage
Verification cases MUST pass for multiple topology shapes, including:
- minimal: (1 SIP, 1 CUBE, 1 PE)
- multi-PE: (1 SIP, 1 CUBE, N PEs)
- multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
- multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)
For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:
- explicit connectivity (required links exist),
- deterministic routing and control-path traversal,
- non-empty traces and latency > 0 for representative cross-domain requests
(inter-CUBE and inter-SIP paths).
Tests MUST NOT hardcode topology sizes, node ids, or link counts.
Instead, tests MUST derive expectations from the compiled topology metadata
---
### D4. Phase 1 artifacts
Phase 1 MAY include:
- verification-only test code,
- topology fixtures,
- trace inspection utilities.
Phase 1 MUST NOT require:
- production code changes solely to satisfy tests,
- weakening or removing tests to allow progress.
---
### D5. Phase 2 enforcement
Phase 2 (Apply) MUST:
- run the Phase 1 verification cases,
- rollback all changes if any verification fails,
- preserve tests as authoritative contracts.
---
## Consequences
- Architectural correctness is enforced early.
- Tests serve as executable documentation of system behavior.
- Implementation remains flexible without losing rigor.
---
## Links
- SPEC 0.1, R2, R6
- ADR-0011 (PA-first memory addressing)
- ADR-0012 (Host ↔ IO_CPU message schema)
- ADR-0009 (Kernel execution semantics)
@@ -0,0 +1,364 @@
# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
## Status
Proposed
## Context
ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
- the dispatch model inside a PE,
- the responsibilities of PE_SCHEDULER,
- the PE_TCM-centric dataflow contract used by accelerator engines.
We need a deterministic and debuggable PE-internal execution contract that supports:
- simple single-engine commands
- composite commands that build a tiled pipeline across DMA and accelerator engines
The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
## Decision
### D1. PE internal component roles
Each PE contains the following logical components.
**PE_CPU**
- Executes kernel instruction stream or kernel control logic.
- Generates PE commands.
- Submits commands to PE_SCHEDULER.
- PE_CPU does NOT enqueue work directly into engine queues.
**PE_SCHEDULER**
- The sole dispatcher inside a PE.
- Receives commands from PE_CPU.
- Expands composite commands into sub-commands.
- Tracks dependencies and command state.
- Dispatches work to engine queues.
- Manages tile scheduling for composite commands.
**PE_DMA**
- Handles memory transfers between PE_TCM and external memory domains.
- PE_DMA has **dual egress** at the CUBE level:
- **→ XBAR**: dedicated path to HBM (local and cross-half via bridge)
- **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.)
- Supported directions include:
- HBM → PE_TCM (via XBAR)
- PE_TCM → HBM (via XBAR)
- PE_TCM → shared SRAM (via NOC)
- PE_TCM → other memory domains (via NOC, if supported by topology)
**PE_GEMM**
- Matrix multiplication engine.
- Reads activations from PE_TCM.
- May stream weights directly from HBM.
**PE_MATH**
- Element-wise computation engine.
- Reads and writes PE_TCM.
**PE_TCM**
- Local SRAM used as the staging memory for accelerator operations.
---
### D2. Command lifecycle and queues
PE_SCHEDULER maintains three logical structures.
**SubmissionQueue**
- Written by PE_CPU.
- Contains incoming PE commands waiting to be processed.
**InflightTable**
- Owned and mutated only by PE_SCHEDULER.
- Tracks:
- expanded sub-commands
- dependency state
- engine assignment
- completion status
**CompletionQueue**
- Written by PE_SCHEDULER.
- Contains final completion records for commands.
**Single-writer rule**
- Only PE_SCHEDULER is allowed to mutate command completion state.
- Engine components must report completion via explicit completion events/messages.
**Command completion**
A command becomes DONE when:
- all sub-commands complete
- PE_SCHEDULER publishes a completion record to CompletionQueue.
---
### D3. Dispatch modes
PE commands are divided into two categories.
#### D3.1 Simple command
A simple command expands to exactly one engine sub-command.
Examples include:
- DMA transfer
- GEMM compute
- MATH compute
Execution flow:
```
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
```
#### D3.2 Composite command (tiled pipeline)
Composite commands implement tiled pipelined execution across engines.
Each tile executes the following pipeline:
```
Input DMA (READ)
→ Compute (GEMM or MATH)
→ Output DMA (WRITE)
```
**Tiling rule**
If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
Each tile is assigned a monotonically increasing `tile_id`.
**Tile dependency rules**
For tile `t`:
- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
- All dependencies are enforced by PE_SCHEDULER.
**Overlap policy (Phase 0 default)**
Operations for different tiles may overlap when engine resources permit.
Allowed overlaps:
```
DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t1) ∥ COMPUTE(t)
DMA_READ(t) ∥ DMA_WRITE(t)
```
Disallowed overlaps:
```
GEMM(t) ∥ GEMM(t)
MATH(t) ∥ MATH(t)
GEMM(t) ∥ MATH(t)
```
---
### D4. Engine execution model (Phase 0 default)
Each engine behaves as a deterministic service resource.
**DMA engine**
PE_DMA contains two independent channels.
```
DMA_READ capacity = 1
DMA_WRITE capacity = 1
```
Rules:
- DMA_READ and DMA_WRITE may execute concurrently.
- Multiple READs cannot overlap.
- Multiple WRITEs cannot overlap.
Example allowed:
```
DMA_READ(t+1) ∥ DMA_WRITE(t)
```
Example not allowed:
```
DMA_READ(t) ∥ DMA_READ(t+1)
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
```
**Compute engine**
Compute operations share a single compute resource.
```
PE_ACCEL capacity = 1
```
Both GEMM and MATH require this shared compute slot.
Consequences:
- GEMM ∥ GEMM not allowed
- MATH ∥ MATH not allowed
- GEMM ∥ MATH not allowed
Only one compute operation can run in a PE at a time.
**Compute opcode restriction**
Composite commands contain one compute opcode only.
Examples:
```
COMPOSITE_GEMM
COMPOSITE_MATH
```
Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
**Engine completion signaling**
Every engine emits a completion event when a sub-command finishes.
Completion events are delivered to PE_SCHEDULER.
---
### D5. Dataflow model
Compute operations use a TCM-centric dataflow model.
**Input path (HBM)**
```
HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
```
**Input path (shared SRAM)**
```
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
```
**Compute stage**
Compute engines read input tensors from PE_TCM.
```
PE_TCM → GEMM / MATH
```
Weights for GEMM may optionally stream directly from HBM (via XBAR).
**Output path (HBM)**
Compute results are written to PE_TCM, then DMA writes to HBM.
```
PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
```
**Output path (shared SRAM)**
```
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
```
#### D5.1 PE_TCM partitioning and ownership boundary
The PE_TCM address space is partitioned into two logical regions.
**SchedulerReservedTCM**
- A staging region owned exclusively by PE_SCHEDULER.
- This region is used for composite command tile buffers.
- PE_SCHEDULER:
- partitions this region into tile buffers
- assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
- guarantees input/output buffer separation
- manages tile buffer lifetime
**AllocatableTCM**
- General-purpose region managed by PEMemAllocator.
- Used by host or DP-visible allocations.
**Visibility rule (hard isolation)**
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
- This prevents DP or host allocations from interfering with scheduler staging buffers.
**Tile buffer rules**
Within SchedulerReservedTCM:
- input buffers and output buffers must not overlap
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
- tile buffers remain valid until the corresponding DMA_WRITE completes
- Buffer reuse is allowed only after the tile lifetime finishes.
---
### D6. Observability and trace contract
The simulator must emit deterministic trace events.
Required events include:
- `command_submitted`
- `sub_command_dispatched`
- `engine_start`
- `engine_complete`
- `tile_ready`
- `command_complete`
Trace ordering must be deterministic for identical inputs.
---
### D7. Topology representation
PE internal components are declared in `cube.pe_template`.
The template is instantiated once per PE.
PE instances are derived from `cube.pe_layout`.
External connectivity such as:
- PE_DMA → XBAR (HBM data path)
- PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe)
- NOC → PE_CPU (command path from M_CPU)
is modeled at the CUBE level (see ADR-0003 D3).
---
## Links
- SPEC R3, R4
- ADR-0003 D4 (PE-level system hierarchy)
- ADR-0005 View C (PE-level diagram)
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
@@ -0,0 +1,178 @@
# ADR-0015: Component Port/Wire Model and Fabric Routing
## Status
Proposed
## Context
ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
In practice, the engine iterates the topology path and calls `run()` on each component
sequentially — conflating routing policy with component behavior and preventing realistic
hardware modeling (queues, contention, fan-out).
ADR-0007 D3 already states that components own fan-out and aggregation, but the current
implementation does not enforce this for fabric traversal.
This ADR defines:
- how components communicate via typed port queues,
- how propagation delay is modeled (wire processes),
- the fabric path for Memory R/W through M_CPU.DMA,
- the reduced role of the simulation engine,
- M_CPU.DMA as an internal subcomponent of M_CPU.
---
## Decision
### D1. Component port model
Each component has typed input/output ports modeled as SimPy Stores:
```
in_ports: dict[str, simpy.Store] # keyed by source node_id
out_ports: dict[str, simpy.Store] # keyed by destination node_id
```
Ports are created at engine initialization based on graph edges.
Each directed edge (src → dst) results in:
- `src.out_ports[dst]` — the sending end
- `dst.in_ports[src]` — the receiving end
---
### D2. Wire process (propagation delay)
For each directed edge (src, dst) in the topology graph, a SimPy wire process
models propagation delay:
```python
def wire_process(env, out_port, in_port, delay_ns):
while True:
cmd = yield out_port.get()
yield env.timeout(delay_ns)
yield in_port.put(cmd)
```
Wire processes are started at engine initialization.
BW constraints are enforced by the sending component's out_port capacity or token model,
not by the wire process itself.
---
### D3. Engine role (reduced)
The simulation engine MUST:
- wire components at initialization (create port Stores, start wire processes),
- identify the entry component for each request type (PCIE_EP),
- put the request into the entry component's in_port,
- wait for a completion event.
The simulation engine MUST NOT:
- walk the topology path during request execution,
- call component `run()` methods directly,
- track per-hop latency or decompose fan-out.
This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
ADR-0007 D2 must be amended accordingly.
---
### D4. Unified fabric path for Memory R/W and Kernel Launch
Both Memory R/W and Kernel Launch use the same fabric path to reach the target cube's M_CPU.
The difference is what M_CPU does upon receiving the request.
**Forward path (IO_CPU → target M_CPU):**
```
IO_CPU
→ [transit cubes: ucie_out → wire → ucie_in → noc → ucie_out] (zero or more)
→ target cube: ucie_in → noc → M_CPU
```
**At M_CPU (diverges by operation type):**
```
Memory R/W: M_CPU → M_CPU.DMA → noc → hbm_ctrl
Kernel Launch: M_CPU → PE[0..n] (parallel fan-out)
```
**Completion path (reverse, same fabric):**
```
Memory R/W: hbm_ctrl → noc → M_CPU.DMA → M_CPU
Kernel Launch: PE[0..n] all complete → M_CPU (aggregation)
M_CPU → [transit cubes: ucie → noc → ucie] → IO_CPU → runtime_api
```
---
### D5. M_CPU.DMA is an internal subcomponent of M_CPU
M_CPU.DMA is NOT a separate topology node.
It is an internal subcomponent owned by the M_CPU component implementation.
M_CPU.DMA:
- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
- issues memory requests over the NOC to hbm_ctrl,
- receives completion from hbm_ctrl via the NOC,
- reports completion to M_CPU,
- is created and managed inside M_CPU's `__init__` and `run()`.
M_CPU.DMA does not appear as a node in the compiled topology graph.
---
### D6. Transit cube forwarding
A cube that is not the target of a memory or kernel request acts as a transit node.
Transit cubes forward requests without consuming them:
```
ucie_in (from upstream) → noc → ucie_out (to downstream)
```
Transit forwarding is implemented entirely within the ucie_in component.
The noc and ucie_out components in a transit cube forward the packet without modification.
---
### D7. _formula_latency is preserved as a lower-bound cross-check
The path-based formula latency function (`_formula_latency`) is preserved in the engine
as a lower bound for correctness verification.
Invariant:
- Phase 0: `_formula_latency == component model total_ns`
- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
This function is independent of the port/wire model and requires only the topology graph.
It is used for shard comparison in `_route_kernel` and as a regression guard.
---
## Consequences
- Components model realistic hardware behavior (queues, contention, fan-out).
- Propagation delay is modeled accurately per edge.
- Engine is decoupled from routing policy.
- Component implementations remain swappable via DI (ADR-0007 D3).
- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).
---
## Links
- ADR-0007 D2 (to be amended: engine path-walking clause)
- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
- ADR-0014 D4 (DMA engine capacity=1)
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)