commit - release 1

2026-03-18 11:47:48 -07:00
commit 6f43807900
109 changed files with 14909 additions and 0 deletions
@@ -0,0 +1,108 @@
+# ADR-0001: PhysAddr Layout & Address Decoding Contract
+
+## Status
+
+Accepted
+
+## Date
+
+2026-02-27
+
+## Context
+
+KernBench Graph Latency Simulator must route requests deterministically and compute end-to-end latency strictly by graph traversal.
+To model local vs remote traffic (same/different SIP, same/different CUBE, optional PE-group), requests need a stable, parsable address/location scheme that:
+
+- can be decoded into routing domains (SIP/CUBE/HBM/PE-resource, etc.)
+- remains topology-agnostic (no hardcoded counts)
+- supports swappable policy and DI-first components without leaking topology assumptions into node implementations
+
+## Decision
+
+We define a **PhysAddr value object** and an **address decoding contract** that converts an integer address into routing domains.
+
+### D1. PhysAddr is an immutable value object
+
+- PhysAddr is immutable and comparable as a pure value.
+- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
+- No global state may be required to interpret a PhysAddr.
+
+### D2. PhysAddr fields (logical contract)
+
+PhysAddr must be able to represent at least:
+
+- `rack_id` (optional but reserved for scale-out)
+- `sip_id`  (device / SIP domain)
+- `sip_seg` (SIP-level segment/window selection, e.g., cube window)
+- `local_offset` (offset within the chosen segment/window)
+
+Decoded/derived fields may include (optional):
+
+- `cube_id`
+- `kind` (e.g., HBM vs PE-resource vs raw)
+- `unit_type` / `pe_id` (if PE-level addressing is modeled)
+
+**Important:** The exact bit allocation may evolve, but the *semantic fields above* must remain decodable without hidden assumptions.
+
+### D3. Decoding is deterministic and policy-compatible
+
+- Decoding must deterministically map an integer address to:
+  - destination SIP domain (`sip_id`)
+  - destination sub-domain (`cube_id` if applicable)
+  - destination target kind (HBM/PE-resource/other)
+- Decoding must not depend on runtime topology sizes; it may depend on **explicit topology parameters** provided through configuration (e.g., segment size, slice size), and those parameters must live in the topology/config layer (not in random components).
+
+### D4. Topology-derived constants live in the topology layer
+
+Constants such as segment sizes (e.g., HBM slice size / window size) are derived from topology configuration (YAML/JSON/dict) and are provided to the decoder via DI/config.
+They must not be hardcoded in node implementations.
+
+### D5. Routing consumes decoded domains, not raw bits
+
+Routing policy uses decoded domains:
+
+- `src` location (sip/cube/pe or node_id)
+- `dst` domains derived from PhysAddr decoding
+- `size_bytes` for size-aware link latency
+Routing must not inspect raw bit-fields directly except inside the decoding module.
+
+## Alternatives Considered
+
+1) **Use raw integers everywhere, decode ad-hoc in routing**
+
+- Rejected: leads to duplicated logic, inconsistent routing, and hidden assumptions embedded in multiple components.
+
+1) **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**
+
+- Rejected: violates SPEC (R3) and breaks swappability and configuration-driven topologies.
+
+1) **Put decoding inside memory controllers or routers**
+
+- Rejected: leaks policy into components and undermines DI-first, swappable implementations (SPEC R4).
+
+## Consequences
+
+### Positive
+
+- Deterministic routing domains enable clear test invariants for local vs remote paths (SPEC R1, R5).
+- Keeps topology variability (SPEC R3) while preserving consistent semantics.
+- DI-first: decoder can be swapped or extended without changing components or tests (SPEC R4).
+
+### Tradeoffs / Costs
+
+- Requires explicit configuration for any topology-derived sizes.
+- Introduces a single “blessed” decoding module that must remain stable and well-tested.
+
+## Implementation Notes (Non-normative)
+
+- Recommended module boundary:
+  - `src/kernbench/policy/address/phyaddr.py`
+
+- Tests should cover:
+  - deterministic decoding
+  - local vs remote classification from decoded fields
+  - invariants: “allocator returns full PhysAddr”, “decoding requires no global state”
+
+## Links
+
+- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first), R5 (multi-domain comm)
@@ -0,0 +1,103 @@
+# ADR-0002: Routing Distance, Ordering & Bypass Rules
+
+## Status
+Accepted
+
+## Date
+2026-02-27
+
+## Context
+The KernBench Graph Latency Simulator must compare kernel execution time
+across different architectures and topologies by computing end-to-end
+latency from graph traversal.
+
+To support meaningful comparison:
+- routing must be deterministic
+- latency must reflect actual interconnect structure
+- local vs remote traffic must be distinguishable
+- “bypass” optimizations must not undermine debuggability or correctness
+
+The simulator also aims to avoid software-managed metadata and hidden
+shortcuts that obscure control paths.
+
+## Decision
+
+### D1. Distance is accumulated latency, not hop count
+- Routing “distance” is defined as the **sum of per-node and per-link latency**.
+- Hop count alone must not be used for ordering or path selection.
+- Size-aware serialization latency (bytes / BW) contributes to distance.
+
+### D2. Routing order is derived from graph traversal
+- The chosen route is the path with minimum accumulated latency
+  given the constructed graph and routing policy.
+- Deterministic ordering must be guaranteed for identical inputs
+  (topology + policy + request).
+
+### D3. Bypass is explicit and graph-represented
+- Any bypass (e.g., local cube HBM access via XBAR instead of NOC) must be:
+  - explicitly represented as a graph path, and
+  - subject to latency accumulation like any other path.
+- Example: PE_DMA has dual egress — one to XBAR (HBM path) and one to NOC (non-HBM path).
+  Both are explicit graph edges; neither is a “bypass” — they are distinct data paths
+  serving different memory domains.
+- Implicit or “magic” bypass paths are disallowed.
+
+### D4. No zero-latency end-to-end paths
+
+- Every routed request must incur **end-to-end** latency > 0.
+- Individual fabric segments (e.g., NOC hops) MAY have distance_mm = 0
+  when the fabric is distributed and distance is not meaningful at that granularity.
+  This is allowed because other components on the same path (e.g., PE_DMA, SRAM,
+  UCIe endpoints) contribute non-zero latency, ensuring the end-to-end invariant holds.
+- Fully zero-latency end-to-end paths are disallowed, except for explicit
+  test-only stubs clearly marked as such.
+
+### D5. Policy vs topology responsibility split
+- Topology builder:
+  - defines nodes and links and their latency/BW parameters
+- Routing policy:
+  - selects among available graph paths based on decoded domains
+- Routing policy must not assume missing links; missing connectivity
+  is a topology construction error.
+
+### D6. No software-managed routing metadata
+- Routing decisions must not rely on per-request software-managed metadata
+  that tracks distance, hop count, or ordering outside the graph model.
+- All distance/order computation is derived from traversal itself.
+
+## Alternatives Considered
+
+1) **Hop-count based routing**
+- Rejected: ignores heterogeneous latency/BW and misrepresents
+  architectural differences.
+
+2) **Implicit local shortcuts**
+- Rejected: breaks debuggability and violates traversal-based latency.
+
+3) **Software-managed distance metadata**
+- Rejected: increases control overhead and obscures routing semantics.
+
+## Consequences
+
+### Positive
+- Clear, debuggable hop-by-hop traces (SPEC R2, R4).
+- Architecture comparisons reflect real interconnect structure.
+- Routing behavior is reproducible and deterministic.
+
+### Tradeoffs / Costs
+- Graph construction must be correct and complete.
+- Bypass modeling requires explicit graph representation,
+  which slightly increases topology description complexity.
+
+## Implementation Notes (Non-normative)
+- Recommended responsibilities:
+  - Graph builder: ensure all required paths exist.
+  - Router: select next hop based on decoded domains and policy.
+- Tests should assert:
+  - non-zero end-to-end latency
+  - deterministic routing for identical inputs
+  - bypass paths appear explicitly in emitted traces
+
+## Links
+- SPEC.md: R1 (routing), R2 (latency), R3 (topology), R5 (multi-domain comm)
+- ADR-0001: PhysAddr layout & decoding contract
@@ -0,0 +1,64 @@
+# ADR-0003: Target System Hierarchy & Modeling Scope
+
+## Status
+
+Accepted
+
+## Context
+
+We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
+The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
+through switching fabrics, with a host CPU issuing commands/kernels.
+
+## Decision
+
+We model the system hierarchy explicitly:
+
+### D1. Tray-level
+
+- A compute tray contains:
+  - Host CPU (issues requests / coordinates runtime & data placement)
+  - Multiple identical SIPs (accelerators)
+  - Interconnect fabric between SIPs (PCIe and/or UAL via switches)
+
+### D2. SIP-level
+
+- A SIP is a multi-die package composed of:
+  - Multiple CUBEs (HBM die + compute PEs + UCIe)
+  - One or more IO chiplets (host/SIP interfaces)
+- IO chiplets:
+  - provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
+  - can be multiple per SIP
+  - placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 1–2 IO chiplets
+
+### D3. CUBE-level
+
+- A CUBE contains:
+  - HBM + memory controller (HBM_CTRL)
+  - XBAR (top/bottom): HBM pseudo-channel crossbar, PE's dedicated path to HBM
+  - Bridge (left/right): connects XBAR.top ↔ XBAR.bottom for cross-half HBM access
+  - NOC: distributed on-die fabric spanning the entire cube (distance modeled as 0);
+    carries non-HBM traffic including inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access
+  - Shared SRAM: cube-level shared memory accessible by all PEs via NOC
+  - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
+  - multiple PEs
+  - up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
+
+### D4. PE-level
+
+- A PE can execute one kernel instance
+- PE contains internal control + accelerators (modeled at PE view granularity):
+  - PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
+
+## Consequences
+
+- The simulator supports abstraction by “views”:
+  - SIP view hides PE internals
+  - CUBE view treats each PE as a single block
+  - PE view expands PE internals
+- Topology remains parameterized; sizes/counts/links come from configuration.
+
+## Links
+
+- SPEC R3/R5
+- ADR-0005 (diagram views)
@@ -0,0 +1,64 @@
+# ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee
+
+## Status
+
+Accepted
+
+## Context
+
+Accurately modeling PE↔HBM behavior is essential for kernel latency estimation.
+Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth.
+
+## Decision
+
+### D1. Local HBM definition
+
+- Each PE is assigned a logically defined “local HBM” region.
+- Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s DMA path
+  via the XBAR (top or bottom, depending on PE corner placement).
+- The path is: PE_DMA → XBAR.top/bottom → HBM_CTRL.
+- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
+
+### D2. Local HBM bandwidth guarantee contract
+
+- Accesses from a PE to its local HBM MUST guarantee full HBM read/write bandwidth
+  independent of intervening fabric bandwidth limits.
+- This guarantee is modeled by:
+  - a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
+  - while still incurring non-zero latency along explicitly modeled components.
+
+### D3. Cross-half HBM semantics
+
+- A PE connected to XBAR.bottom that accesses HBM pseudo-channels on the XBAR.top half
+  (or vice versa) traverses a bridge:
+  - PE_DMA → XBAR.bottom → bridge → XBAR.top → HBM_CTRL
+- Bridge bandwidth may limit cross-half HBM access relative to local-half access.
+
+### D4. Non-local HBM semantics (inter-cube / inter-SIP)
+
+- Accesses from a PE to HBM in a different cube or SIP MAY be limited by:
+  - NOC bandwidth within the cube,
+  - inter-cube UCIe links,
+  - inter-SIP fabric (PCIe/UAL).
+- These paths MUST be explicit and traceable.
+
+### D5. Shared SRAM semantics
+
+- Each CUBE contains a shared SRAM accessible by all PEs in that CUBE.
+- Access path: PE_DMA → NOC → shared SRAM.
+- Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth.
+- Shared SRAM is not part of the HBM address space; it is a separate memory domain.
+
+## Verification Notes
+
+Tests should cover:
+
+- local-HBM case: BW matches HBM BW regardless of fabric BW parameter
+- cross-half HBM case: latency includes bridge traversal
+- non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
+- shared SRAM case: access via NOC with correct BW
+
+## Links
+
+- SPEC R2/R5
+- ADR-0002 (distance/order & explicit bypass)
@@ -0,0 +1,186 @@
+# ADR-0005: Diagram Views & Distance-Aware Layout Rules
+
+## Status
+
+Accepted
+
+## Context
+
+We require verifiable and inspectable system modeling for a large-scale,
+parameterized AI Accelerator system.
+
+Humans must be able to:
+
+- visually inspect the modeled topology,
+- reason about communication structure and relative distance,
+- do so at multiple abstraction levels without being overwhelmed by detail.
+
+The simulator models distance (accumulated latency) as a first-class concept.
+Diagrams must reflect this distance by default.
+
+---
+
+## Global Defaults
+
+- All diagrams MUST be **distance-aware by default**.
+- All diagrams MUST render **representative views** of the architecture.
+- Instance indices (e.g., sip0, cube2, pe3) MUST NOT be required for diagram generation.
+- Instance indices MAY be used ONLY:
+  - to define a distance anchor in asymmetric or debugging scenarios, or
+  - when explicitly requested.
+
+---
+
+## Representative Rendering Rule
+
+- All CUBEs share the same internal structure.
+- All PEs share the same internal structure.
+
+Therefore:
+
+- SIP-level diagrams render representative CUBEs and IO chiplets.
+- CUBE-level diagrams render representative PEs as opaque blocks.
+- PE-level diagrams render a representative PE with fully expanded internals.
+
+Diagrams MUST NOT depend on specific SIP, CUBE, or PE indices
+unless explicitly requested.
+
+---
+
+## Diagram Views
+
+### View A — SIP-Level Diagram
+
+**Purpose**
+Explain system-scale structure and connectivity.
+
+**Visible elements**
+
+- SIP boundaries (optional)
+- CUBEs (opaque blocks)
+- IO chiplets (opaque blocks)
+- Optional UCIe stubs only if needed to clarify connectivity
+
+**Hidden elements**
+
+- PE internals
+- CUBE internal fabric
+- IO chiplet internals
+
+**Visible links**
+
+- Host ↔ IO chiplets (PCIe)
+- SIP ↔ SIP (PCIe / UAL via switches)
+- IO ↔ CUBE (on-package links)
+
+---
+
+### View B — CUBE-Level Diagram
+
+**Purpose**
+Explain cube-internal structure and data/control flow.
+
+**Visible elements**
+
+- XBAR (top/bottom): HBM pseudo-channel crossbar
+- Bridge (left/right): cross-half HBM connectors between XBAR.top and XBAR.bottom
+- NOC: distributed on-die fabric for non-HBM traffic
+- HBM subsystem (HBM_CTRL)
+- Shared SRAM: cube-level shared memory
+- Management CPU (M_CPU)
+- PEs as opaque blocks (PE[0..N−1])
+- UCIe endpoints (N/E/W/S) as ports
+
+**Hidden elements**
+
+- PE internals
+
+**Visible links**
+
+- PE → XBAR (HBM data path, top or bottom by corner placement)
+- PE → NOC (non-HBM data path)
+- XBAR ↔ bridge ↔ XBAR (cross-half HBM access)
+- XBAR → HBM_CTRL
+- NOC ↔ UCIe endpoints
+- NOC ↔ shared SRAM
+- M_CPU ↔ NOC (command path)
+- NOC → PE_CPU (command delivery, collapsed into PE block)
+
+---
+
+### View C — PE-Level Diagram
+
+**Purpose**
+Explain internal PE behavior and execution structure.
+
+**Visible elements**
+
+- PE_CPU
+- Command handler / scheduler
+- PE_TCM (local SRAM)
+- HW accelerators (DMA, GEMM, MATH, etc.)
+- Local HBM interface
+- Optional IPCQ / messaging endpoints
+
+**Visible links**
+
+- Control paths (CPU → scheduler → engines)
+- Data paths (engines ↔ TCM, DMA ↔ local HBM)
+- External fabric ports as abstract ports only
+
+---
+
+## Distance-Aware Layout (Default)
+
+### Distance definition
+
+- Distance is defined as **accumulated latency**, consistent with ADR-0002.
+- Distance is computed from a single anchor node.
+
+### Default anchor selection
+
+- SIP view: IO chiplet (or Host CPU if present)
+- CUBE view: a representative PE
+- PE view: PE_CPU or Command Handler
+
+Anchors are **implicit defaults** and MUST NOT be required to be specified.
+
+### Layout rules
+
+- Diagrams MUST be laid out in layers based on distance buckets.
+- Layout direction MUST be consistent within a view type
+  (preferred: left-to-right).
+- Nodes with equal distance MUST have stable ordering
+  (by role or identifier, deterministically).
+
+Cycles MAY be rendered using dashed or curved edges for readability,
+without affecting distance semantics.
+
+---
+
+## Generation Contract (for Tools / Claude Code)
+
+When generating diagrams:
+
+- Assume distance-aware layout by default.
+- Assume representative rendering by default.
+- Do NOT ask for SIP/CUBE/PE indices unless required.
+- Do NOT expand hidden abstraction levels.
+- Prefer architectural clarity over micro-hop fidelity.
+
+---
+
+## Consequences
+
+- Diagrams are stable across topology scaling.
+- Changes in distance or routing policy are reflected visually.
+- Diagrams serve as verifiable artifacts derived from the simulator model,
+  not as hand-maintained documentation.
+
+---
+
+## Links
+
+- SPEC Section 4 (Output, Debuggability, and Diagrams)
+- ADR-0002 (Routing distance semantics)
+- ADR-0006 (Topology compilation & automatic diagram generation)
@@ -0,0 +1,130 @@
+# ADR-0006: Topology Compilation, Distance Extraction, and Automatic Diagram Generation
+
+## Status
+
+Accepted
+
+## Context
+
+The simulator compiles topology configuration (e.g., topology.yaml) into an explicit model graph,
+and computes routing and accumulated latency (distance).
+Diagrams should be generated from these authoritative artifacts to ensure consistency and avoid
+hand-maintained topology drawings.
+
+Additionally, for usability, diagrams should be emitted automatically into a stable location
+so that developers can preview them immediately in the repository.
+
+---
+
+## Decision
+
+### D1. Topology compilation is the single source of truth
+
+- topology.yaml (or equivalent config) is compiled into:
+  - an explicit system graph,
+  - node/link attributes,
+  - routing policies.
+This compiled graph is the authoritative representation of the system.
+
+### D2. Distance extraction during compilation
+
+- During or immediately after topology compilation, the simulator MUST compute distance metadata
+  (accumulated latency) consistent with ADR-0002.
+- Distance metadata MUST be sufficient to support distance-aware diagram layout as defined in ADR-0005.
+- Distributed fabric segments (e.g., NOC) MAY have distance_mm = 0 per ADR-0002 D4;
+  layout placement for such nodes uses explicit position metadata rather than distance buckets.
+
+### D3. Diagram generation is a derived artifact
+
+- Diagrams MUST be generated from:
+  - the compiled topology graph,
+  - extracted distance metadata,
+  - view/layout rules defined in ADR-0005.
+- Diagram generation MUST NOT require additional hand-written topology descriptions.
+
+### D4. Automatic diagram emission to the repository
+
+- As part of topology compilation, the implementation MUST produce the following diagrams by default:
+  - SIP-level diagram (representative, distance-aware)
+  - CUBE-level diagram (representative, distance-aware)
+  - PE-level diagram (representative, distance-aware)
+- The default output directory is:
+  - `docs/diagrams/`
+- The generator MUST overwrite/update only when the compiled topology (or diagram rules) changes.
+
+### D5. View-specific projection and layout
+
+For each view (SIP / CUBE / PE):
+
+- The generator MUST project the compiled graph into a reduced view graph:
+  - hide/collapse nodes according to ADR-0005,
+  - preserve connectivity semantics relevant to that view,
+  - compute distance buckets and assign layout layers deterministically.
+- CUBE-level projection MUST include:
+  - XBAR (top/bottom), bridge (left/right), NOC, HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
+    and PEs as opaque blocks.
+  - Distinct edge kinds for HBM path (PE→XBAR) vs non-HBM path (PE→NOC).
+- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
+
+### D6. Output formats and determinism
+
+- The generator MUST output at least one of:
+  - Mermaid (Markdown-native)
+  - Graphviz DOT (rank-based control)
+  - SVG (mm-accurate layout, no external dependencies)
+- SVG is preferred when mm-accurate position metadata is available from the compiled topology.
+- Output MUST be deterministic:
+  - same topology + same rules → identical diagram text
+- File naming MUST be deterministic and stable (see "Output Conventions").
+
+### D7. Performance and caching
+
+- Diagram generation MAY be lazy and/or cached, as long as the outputs in `docs/diagrams/`
+  remain consistent with the compiled topology.
+- The implementation SHOULD use a cache key based on:
+  - topology content hash,
+  - routing policy version,
+  - diagram rules version,
+  - view type (SIP/CUBE/PE).
+
+---
+
+## Output Conventions
+
+### Directory
+
+- `docs/diagrams/` is the canonical output directory for generated diagrams.
+
+### File names (recommended, deterministic)
+
+- `system_view.svg` / `system_view.mmd` / `system_view.dot`
+- `sip_view.svg` / `sip_view.mmd` / `sip_view.dot`
+- `cube_view.svg` / `cube_view.mmd` / `cube_view.dot`
+- `pe_view.svg` / `pe_view.mmd` / `pe_view.dot`
+
+Optionally, for multi-topology workflows:
+
+- `sip_view__{topology_id}.svg`
+- `cube_view__{topology_id}.svg`
+- `pe_view__{topology_id}.svg`
+
+### Repository policy
+
+- Generated diagram files MAY be committed to the repository to enable diff-based review.
+- If committed, they MUST be reproducible from topology compilation.
+
+---
+
+## Consequences
+
+- Diagrams are always consistent with simulator behavior.
+- Architectural changes automatically propagate to visualizations.
+- Diagram diffs become meaningful indicators of architectural change.
+
+---
+
+## Links
+
+- SPEC Section 4 (Output, Debuggability, and Diagrams)
+- ADR-0002 (Distance semantics)
+- ADR-0005 (Diagram views and layout rules)
@@ -0,0 +1,89 @@
+# ADR-0007: Runtime API and Simulation Engine Boundaries
+
+## Status
+
+Accepted
+
+## Context
+
+The simulator consists of multiple layers with distinct responsibilities:
+
+- a host-facing API layer used by benchmarks and user code,
+- a discrete-event simulation engine that executes requests,
+- device components that model hardware behavior.
+
+Without strict boundaries, orchestration logic can leak into components,
+or simulation internals can become entangled with user-facing APIs.
+
+This ADR defines clear responsibility boundaries between:
+
+- runtime API,
+- simulation engine (sim_engine),
+- hardware components.
+
+---
+
+## Decision
+
+### D1. Runtime API is host-facing orchestration only
+
+The runtime API represents host/driver-level behavior and MUST:
+
+- expose high-level operations (tensor deployment, kernel launch),
+- submit requests only to endpoint components (e.g., IO_CPU),
+- await completion via futures/handles,
+- own and persist host-side metadata (tensor allocation maps, kernel bindings).
+
+The runtime API MUST NOT:
+
+- hardcode hop-by-hop routing or fan-out,
+- directly invoke internal components (M_CPU, PE_CPU, engines),
+- embed topology- or routing-specific assumptions.
+
+---
+
+### D2. Simulation engine executes and schedules requests
+
+The simulation engine (sim_engine) MUST:
+
+- inject requests into the compiled topology graph,
+- schedule and execute events using a discrete-event model,
+- manage correlation ids and completion tracking,
+- decompose operations into low-level requests when required
+  (e.g., MemoryWrite events).
+
+The simulation engine MUST NOT:
+
+- define tensor semantics,
+- define kernel execution policies,
+- expose internal graph details to the runtime API.
+
+---
+
+### D3. Components own fan-out and aggregation
+
+Device-side components MUST:
+
+- fan-out requests to downstream domains
+  (IO_CPU → M_CPU → PE_CPU → schedulers/engines),
+- aggregate completion and failure signals,
+- propagate results deterministically upstream.
+
+Neither the runtime API nor the simulation engine may orchestrate
+component-level fan-out explicitly.
+
+---
+
+## Consequences
+
+- Runtime APIs remain stable as topology and routing evolve.
+- Simulation internals can change without affecting user-facing code.
+- Component implementations remain swappable via DI.
+
+---
+
+## Links
+
+- SPEC R4, R7, R8
+- ADR-0008 (Tensor deployment)
+- ADR-0009 (Kernel execution)
@@ -0,0 +1,100 @@
+# ADR-0008: Tensor Deployment and Allocation (Host Allocator, PA-first)
+
+## Status
+
+Accepted
+
+## Context
+
+Benchmarks require PyTorch-like tensor semantics:
+
+- tensor creation (empty, fill),
+- deployment to accelerator devices (tensor.to()).
+
+In the realistic system, host software manages allocation/mapping and installs
+mappings for DMA/MMU. For Phase 0 we simplify (ADR-0011):
+
+- device memory operations use PA only,
+- VA/MMU/IOMMU is not modeled.
+
+To keep the host↔device interface minimal, we avoid a separate
+AllocateTensorMeta message. Instead, host allocation produces a PA shard map
+that is used directly by MemoryWrite/Read and KernelLaunch.
+
+---
+
+## Decision
+
+### D1. Tensor is a host-owned handle with PA shard mapping
+
+A Tensor object is a host-owned handle that encapsulates:
+
+- shape and dtype,
+- initialization intent,
+- device placement and allocation metadata as a PA shard map.
+
+After deployment, the Tensor handle MUST contain:
+
+- a list of shards, each with (sip,cube,pe,pa,nbytes,offset_bytes).
+
+This PA shard mapping is the single source of truth for kernel argument binding.
+
+---
+
+### D2. Deployment uses a host allocator (Phase 0)
+
+In Phase 0, tensor deployment produces PA shard mappings via a host allocator:
+
+- placement (split/replicate/hybrid) is decided by a DP policy,
+- allocation assigns PA ranges at the PE level and returns shard mappings,
+- the Tensor handle stores the resulting shard list deterministically.
+
+No separate host-visible device allocation RPC is required in Phase 0.
+
+---
+
+### D3. Data initialization and transfer uses MemoryWrite/Read only
+
+Any data initialization or transfer implied by a tensor (e.g., fill, copy)
+MUST be represented using Host ↔ IO_CPU messages only:
+
+- MemoryWrite
+- MemoryRead
+
+Rules:
+
+- MemoryWrite/Read MUST reference PA + (sip,cube,pe) tags (ADR-0012).
+- Allocation metadata MUST NOT be embedded as a separate allocation message.
+- Bulk tensor data MUST NOT be embedded in Phase 0 messages.
+
+The simulation engine schedules MemoryWrite/Read through the graph so that
+latency is computed by explicit traversal.
+
+---
+
+### D4. Extension path (non-breaking)
+
+Future ADRs MAY introduce optional VA/MMU/IOMMU modeling by adding:
+
+- virtual addressing in tensor handles,
+- mapping install steps,
+- translation latency/page granularity.
+
+The Phase 0 PA shard map remains a valid fast-path configuration.
+
+---
+
+## Consequences
+
+- Host↔IO_CPU contract remains minimal (MemoryRead/Write + KernelLaunch).
+- KernelLaunch can pass per-PE data placement explicitly via shard tags.
+- Early implementation stays simple and testable.
+
+---
+
+## Links
+
+- ADR-0011 (PA-first)
+- ADR-0012 (Host↔IO_CPU schema)
+- ADR-0007 (runtime_api vs sim_engine boundaries)
+- ADR-0009 (Kernel execution)
@@ -0,0 +1,74 @@
+# ADR-0009: Kernel Execution Messaging and Completion Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+Kernel execution is initiated by the host and proceeds through
+device control components:
+
+Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
+
+Completion propagates in reverse order.
+
+To keep benchmarks simple and topology-agnostic,
+kernel execution must be endpoint-driven with deterministic aggregation.
+
+---
+
+## Decision
+
+### D1. Kernel launch is an endpoint request
+
+A kernel launch is initiated by submitting a single KernelLaunch request
+to the IO_CPU endpoint.
+
+The runtime API MUST:
+
+- construct the kernel launch request,
+- submit it to IO_CPU,
+- await a single completion result.
+
+The runtime API MUST NOT orchestrate internal fan-out.
+
+---
+
+### D2. Tensor arguments are passed by metadata
+
+KernelLaunch requests MUST reference tensor arguments via:
+
+- host-owned tensor handles, or
+- resolved device address maps derived from those handles.
+
+Bulk tensor data MUST NOT be embedded in kernel launch messages.
+
+---
+
+### D3. Fan-out and aggregation are component responsibilities
+
+- IO_CPU fans out work to M_CPUs.
+- M_CPU fans out work to PE_CPUs.
+- PE_CPU manages kernel execution and engine dispatch.
+
+Completion semantics:
+
+- M_CPU completes when all targeted PEs complete or a failure policy triggers.
+- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
+
+---
+
+### D4. Completion and failure propagation
+
+- All messages MUST carry correlation identifiers.
+- Completion and failure MUST propagate deterministically to the host.
+- The simulation engine provides futures/handles to observe completion.
+
+---
+
+## Links
+
+- SPEC R1, R2, R7, R8
+- ADR-0007 (Runtime API boundaries)
+- ADR-0008 (Tensor deployment)
@@ -0,0 +1,62 @@
+# ADR-0010: CLI Device Selection and Multi-Device Execution Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+Benchmarks represent device-agnostic workloads that operate on a single device.
+Users may want to run a benchmark:
+
+- on a specific device, or
+- across all devices in the system.
+
+Device enumeration must not leak into benchmarks or runtime APIs.
+
+---
+
+## Decision
+
+### D1. Benchmarks are single-device by design
+
+- A benchmark MUST define behavior for a single device only.
+- A benchmark MUST accept a device identifier as input.
+- Benchmarks MUST NOT enumerate or loop over multiple devices.
+
+---
+
+### D2. CLI controls device selection
+
+The `kernbench run` command supports an optional `--device` argument:
+
+- If `--device <id>` is specified:
+  - the benchmark executes once for the specified device.
+
+- If `--device` is omitted:
+  - the benchmark executes once using all the SIPs discovered in the topology.
+
+---
+
+### D3. Multi-device execution is logically parallel
+
+When running on multiple devices:
+
+- benchmark executions are submitted to a single simulation engine instance,
+- executions are logically parallel in simulation time,
+- inter-device contention is naturally modeled.
+
+---
+
+### D4. Runtime API and simulation engine remain device-scoped
+
+- Runtime API calls operate on one device per invocation.
+- The simulation engine schedules all requests deterministically.
+- Neither layer enumerates devices.
+
+---
+
+## Links
+
+- SPEC R7, R8
+- ADR-0007 (Runtime API boundaries)
@@ -0,0 +1,65 @@
+# ADR-0011: Memory Addressing Simplification (PA-first)
+
+## Status
+
+Accepted
+
+## Context
+
+A realistic system uses host-side virtual addressing and an MMU/IOMMU-style
+translation path for DMA: host allocates physical memory at PE level, maps it
+into a virtual address space, installs mappings, and DMA requests use virtual
+addresses that are translated to physical addresses.
+
+For early development, we want a minimal, deterministic model that enables:
+
+- correct routing and latency accounting through the graph,
+- stable tensor deployment and kernel execution semantics,
+- future extension toward VA/MMU without rewriting workflows.
+
+---
+
+## Decision
+
+### D1. Phase 0 model is PA-only
+
+The simulator uses a PA-first model:
+
+- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical
+  addresses (PA) plus size.
+- Tensor handles store PA-based shard mappings after deployment.
+- KernelLaunch passes tensor arguments as PA-based mappings (or references to them).
+- MMU/IOMMU concepts (virtual address spaces, page tables, translation latency)
+  are NOT modeled in Phase 0.
+
+### D2. Allocation produces PA mappings
+
+Device allocation selects PE-local memory regions and returns PA mappings
+sufficient to execute kernels and issue DMA requests.
+
+### D3. Extension path (non-breaking)
+
+A future ADR MAY introduce an optional VA/MMU layer by:
+
+- introducing virtual addresses in tensor handles,
+- adding a mapping-install step,
+- modeling translation latency and page granularity.
+
+The Phase 0 PA model remains a valid fast-path configuration.
+
+---
+
+## Consequences
+
+- Early implementation stays simple and testable.
+- All latency remains explicit via graph traversal, not hidden translation.
+- Future VA/MMU modeling can be added without breaking existing benchmarks.
+
+---
+
+## Links
+
+- ADR-0007 (runtime_api vs sim_engine boundaries)
+- ADR-0008 (tensor deployment)
+- ADR-0009 (kernel execution)
+- SPEC R2 (latency by traversal)
@@ -0,0 +1,232 @@
+# ADR-0012: Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
+
+## Status
+
+Accepted
+
+## Context
+
+Phase 0 uses a PA-first memory model (ADR-0011):
+
+- memory operations use device physical addresses (PA) only,
+- VA/MMU/IOMMU is not modeled.
+
+The host-facing runtime API interacts with the device via the IO_CPU endpoint.
+We define stable, minimal message schemas for Host ↔ IO_CPU so that:
+
+- benchmarks remain stable,
+- IO_CPU-internal fan-out/aggregation can evolve independently,
+- completion and failure propagation is deterministic.
+
+We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
+so IO_CPU can deterministically route/fan-out without relying on PA decoding.
+
+---
+
+## Decision
+
+### D1. Contract scope
+
+This schema is the stable contract ONLY for Host ↔ IO_CPU.
+
+Messages beyond IO_CPU (to M_CPU, PE_CPU, schedulers, engines) are component-internal
+and are NOT part of this host contract in Phase 0.
+
+---
+
+### D2. Required message set
+
+The runtime API MUST use only these message types for Host ↔ IO_CPU:
+
+- MemoryWrite
+- MemoryRead
+- KernelLaunch
+
+All operations required by benchmarks (tensor init/copy, kernel run) MUST be expressible
+with these messages.
+
+---
+
+### D3. Common envelope (mandatory for all requests)
+
+All Host ↔ IO_CPU requests MUST include:
+
+- `msg_type: str`
+- `correlation_id: str`
+  - generated by the host
+  - used to match responses deterministically
+- `request_id: str`
+  - unique within a correlation_id
+- `target_device: str`
+  - device identifier (e.g., "sip:0")
+- `timestamp_tag: str | None` (optional)
+  - debug tag only; MUST NOT affect determinism
+
+All Host ↔ IO_CPU responses MUST include:
+
+- `correlation_id: str`
+- `request_id: str`
+- `completion: Completion`
+
+---
+
+### D4. Completion schema (mandatory)
+
+`Completion` MUST have:
+
+- `ok: bool`
+- `error_code: str | None`
+- `error_message: str | None`
+
+Rules:
+
+- If `ok == true` then `error_code` and `error_message` MUST be null.
+- If `ok == false` then `error_code` MUST be non-null.
+- Completion semantics MUST be deterministic.
+
+---
+
+### D5. MemoryWrite schema (PA-first, PE-tagged)
+
+`MemoryWrite` represents a host-initiated write/initialize operation to device memory.
+
+Mandatory fields:
+
+- common envelope fields (D3)
+- destination placement tags (A 방식):
+  - `dst_sip: int`
+  - `dst_cube: int`
+  - `dst_pe: int`
+- `dst_pa: int`
+  - destination physical address in the destination PE's address space
+- `nbytes: int`
+- `src_kind: "pattern" | "host_buffer_ref"`
+  - Phase 0 MUST support "pattern"
+- `pattern: Pattern | None`
+  - required if `src_kind == "pattern"`
+
+`Pattern` (Phase 0 mandatory support):
+
+- `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"`
+- `value: number | None`
+  - required for fill_*; ignored for zero
+
+Optional fields:
+
+- `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (default "AUTO")
+- `debug_label: str | None`
+
+Notes:
+
+- This message MUST NOT embed bulk tensor data in Phase 0.
+- All latency MUST come from explicit graph traversal and modeled components.
+
+---
+
+### D6. MemoryRead schema (PA-first, PE-tagged)
+
+`MemoryRead` represents a host-initiated read from device memory.
+
+Mandatory fields:
+
+- common envelope fields (D3)
+- source placement tags (A 방식):
+  - `src_sip: int`
+  - `src_cube: int`
+  - `src_pe: int`
+- `src_pa: int`
+- `nbytes: int`
+
+Optional fields:
+
+- `dst_kind: "host_sink" | "discard"` (default "host_sink")
+- `debug_label: str | None`
+
+Response payload:
+
+- actual bytes are NOT required in Phase 0 (latency/traces focus)
+- implementations MAY return lightweight stats or hashes later via a new ADR
+
+---
+
+### D7. KernelLaunch schema (PA-first, PE-tagged shards)
+
+`KernelLaunch` represents launching a kernel on a target device via IO_CPU.
+
+Mandatory fields:
+
+- common envelope fields (D3)
+- `kernel_ref: KernelRef`
+- `args: list[KernelArg]`
+
+`KernelRef` MUST have:
+
+- `name: str`
+- `kind: "deployed" | "builtin"`
+- `deploy_pa: int | None` — PA where kernel binary was deployed (required for "deployed")
+- `deploy_sip: int` — SIP where binary resides
+- `deploy_cube: int` — cube where binary resides
+- `deploy_pe: int` — PE where binary resides
+- `nbytes_code: int` — kernel binary size (for BW modeling)
+
+Kernel binaries MUST be pre-deployed to device memory via MemoryWrite.
+KernelLaunch MUST NOT embed kernel source code or IR in the launch message.
+
+`KernelArg` supports tensor args by PA mapping and scalars by value.
+
+Tensor arg (mandatory):
+
+- `arg_kind: "tensor"`
+- `tensor_pa_map: TensorPAMap`
+
+`TensorPAMap` MUST have:
+
+- `shards: list[TensorShard]`
+
+`TensorShard` MUST have (A 방식 강제):
+
+- `sip: int`
+- `cube: int`
+- `pe: int`
+- `pa: int`
+- `nbytes: int`
+- `offset_bytes: int`
+
+Scalar arg (mandatory):
+
+- `arg_kind: "scalar"`
+- `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"`
+- `value: number | bool`
+
+Optional KernelLaunch fields:
+
+- `grid: dict | None`
+- `meta: dict | None`
+- `failure_policy: "fail_fast" | "collect_all"` (default "fail_fast")
+- `debug_label: str | None`
+
+Notes:
+
+- KernelLaunch MUST NOT embed bulk tensor data.
+- KernelLaunch MUST be submitted only to the IO_CPU endpoint.
+- IO_CPU MUST fan-out work internally using the shard (sip,cube,pe) tags.
+
+---
+
+## Verification Notes
+
+Tests SHOULD validate:
+
+- schema validation rejects missing mandatory fields,
+- deterministic correlation/response matching,
+- MemoryWrite/Read/KernelLaunch produce explicit hop traces,
+- all routed requests incur latency > 0.
+
+---
+
+## Links
+
+- ADR-0011 (PA-first memory addressing)
+- ADR-0007 (runtime_api vs sim_engine boundaries)
+- ADR-0009 (kernel execution fan-out/aggregation)
+- SPEC R2, R7, R8
@@ -0,0 +1,139 @@
+# ADR-0013: Verification Strategy and Phase 1 Test Plan
+
+## Status
+
+Accepted
+
+## Context
+
+KernBench is a system-level simulator whose correctness is defined by:
+
+- adherence to SPEC-defined invariants,
+- determinism and debuggability,
+- explicit modeling of routing and latency.
+
+Given the evolving implementation, we need a stable verification strategy
+that prevents architectural drift while allowing incremental development.
+
+This ADR defines the Phase 1 verification plan and what constitutes
+"correct behavior" for early implementations.
+
+---
+
+## Decision
+
+### D1. Verification is contract-based
+
+Verification MUST be derived from:
+
+- SPEC requirements,
+- accepted ADRs.
+
+Tests MUST validate architectural contracts, not incidental implementation details.
+
+---
+
+### D2. Phase 1 verification scope
+
+Phase 1 verification focuses on:
+
+- message contract validity (ADR-0012),
+- routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
+- PA-first memory addressing and shard tagging (ADR-0011),
+- core latency and trace invariants (SPEC 0.1, R2).
+
+Microarchitectural accuracy, bandwidth contention, and cycle-level behavior
+are explicitly out of scope in Phase 1.
+
+---
+
+### D3. Required Phase 1 verification cases
+
+The following verification cases MUST be supported by the implementation:
+
+#### V1. Message schema validation
+
+- KernelLaunch requests missing `(sip, cube, pe)` in any tensor shard MUST be rejected.
+- MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
+- Completion results MUST follow the `ok / error_code / error_message` contract.
+
+#### V2. IO_CPU fan-out and aggregation
+
+Given:
+
+- a topology with one SIP, one CUBE, and two PEs,
+- a KernelLaunch request containing two tensor shards targeting different PEs,
+
+The system MUST:
+
+- submit a single KernelLaunch to IO_CPU,
+- fan-out work internally to both PEs,
+- aggregate completion and return a single deterministic completion to the host.
+
+#### V3. Latency and trace invariants
+
+For any valid request:
+
+- the hop-by-hop trace MUST be non-empty,
+- total latency MUST be greater than zero,
+- repeated runs with identical inputs MUST produce identical traces.
+
+#### V4. Topology independence and cross-domain coverage
+
+Verification cases MUST pass for multiple topology shapes, including:
+
+- minimal: (1 SIP, 1 CUBE, 1 PE)
+- multi-PE: (1 SIP, 1 CUBE, N PEs)
+- multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
+- multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)
+
+For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:
+
+- explicit connectivity (required links exist),
+- deterministic routing and control-path traversal,
+- non-empty traces and latency > 0 for representative cross-domain requests
+  (inter-CUBE and inter-SIP paths).
+
+Tests MUST NOT hardcode topology sizes, node ids, or link counts.
+Instead, tests MUST derive expectations from the compiled topology metadata
+---
+
+### D4. Phase 1 artifacts
+
+Phase 1 MAY include:
+
+- verification-only test code,
+- topology fixtures,
+- trace inspection utilities.
+
+Phase 1 MUST NOT require:
+
+- production code changes solely to satisfy tests,
+- weakening or removing tests to allow progress.
+
+---
+
+### D5. Phase 2 enforcement
+
+Phase 2 (Apply) MUST:
+
+- run the Phase 1 verification cases,
+- rollback all changes if any verification fails,
+- preserve tests as authoritative contracts.
+
+---
+
+## Consequences
+
+- Architectural correctness is enforced early.
+- Tests serve as executable documentation of system behavior.
+- Implementation remains flexible without losing rigor.
+
+---
+
+## Links
+
+- SPEC 0.1, R2, R6
+- ADR-0011 (PA-first memory addressing)
+- ADR-0012 (Host ↔ IO_CPU message schema)
+- ADR-0009 (Kernel execution semantics)
@@ -0,0 +1,364 @@
+# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
+
+## Status
+
+Proposed
+
+## Context
+
+ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
+
+- the dispatch model inside a PE,
+- the responsibilities of PE_SCHEDULER,
+- the PE_TCM-centric dataflow contract used by accelerator engines.
+
+We need a deterministic and debuggable PE-internal execution contract that supports:
+
+- simple single-engine commands
+- composite commands that build a tiled pipeline across DMA and accelerator engines
+
+The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
+
+## Decision
+
+### D1. PE internal component roles
+
+Each PE contains the following logical components.
+
+**PE_CPU**
+
+- Executes kernel instruction stream or kernel control logic.
+- Generates PE commands.
+- Submits commands to PE_SCHEDULER.
+- PE_CPU does NOT enqueue work directly into engine queues.
+
+**PE_SCHEDULER**
+
+- The sole dispatcher inside a PE.
+- Receives commands from PE_CPU.
+- Expands composite commands into sub-commands.
+- Tracks dependencies and command state.
+- Dispatches work to engine queues.
+- Manages tile scheduling for composite commands.
+
+**PE_DMA**
+
+- Handles memory transfers between PE_TCM and external memory domains.
+- PE_DMA has **dual egress** at the CUBE level:
+  - **→ XBAR**: dedicated path to HBM (local and cross-half via bridge)
+  - **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.)
+- Supported directions include:
+  - HBM → PE_TCM (via XBAR)
+  - PE_TCM → HBM (via XBAR)
+  - PE_TCM → shared SRAM (via NOC)
+  - PE_TCM → other memory domains (via NOC, if supported by topology)
+
+**PE_GEMM**
+
+- Matrix multiplication engine.
+- Reads activations from PE_TCM.
+- May stream weights directly from HBM.
+
+**PE_MATH**
+
+- Element-wise computation engine.
+- Reads and writes PE_TCM.
+
+**PE_TCM**
+
+- Local SRAM used as the staging memory for accelerator operations.
+
+---
+
+### D2. Command lifecycle and queues
+
+PE_SCHEDULER maintains three logical structures.
+
+**SubmissionQueue**
+
+- Written by PE_CPU.
+- Contains incoming PE commands waiting to be processed.
+
+**InflightTable**
+
+- Owned and mutated only by PE_SCHEDULER.
+- Tracks:
+  - expanded sub-commands
+  - dependency state
+  - engine assignment
+  - completion status
+
+**CompletionQueue**
+
+- Written by PE_SCHEDULER.
+- Contains final completion records for commands.
+
+**Single-writer rule**
+
+- Only PE_SCHEDULER is allowed to mutate command completion state.
+- Engine components must report completion via explicit completion events/messages.
+
+**Command completion**
+
+A command becomes DONE when:
+
+- all sub-commands complete
+- PE_SCHEDULER publishes a completion record to CompletionQueue.
+
+---
+
+### D3. Dispatch modes
+
+PE commands are divided into two categories.
+
+#### D3.1 Simple command
+
+A simple command expands to exactly one engine sub-command.
+
+Examples include:
+
+- DMA transfer
+- GEMM compute
+- MATH compute
+
+Execution flow:
+
+```
+PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
+```
+
+#### D3.2 Composite command (tiled pipeline)
+
+Composite commands implement tiled pipelined execution across engines.
+
+Each tile executes the following pipeline:
+
+```
+Input DMA (READ)
+→ Compute (GEMM or MATH)
+→ Output DMA (WRITE)
+```
+
+**Tiling rule**
+
+If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
+Each tile is assigned a monotonically increasing `tile_id`.
+
+**Tile dependency rules**
+
+For tile `t`:
+
+- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
+- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
+- All dependencies are enforced by PE_SCHEDULER.
+
+**Overlap policy (Phase 0 default)**
+
+Operations for different tiles may overlap when engine resources permit.
+
+Allowed overlaps:
+
+```
+DMA_READ(t+1) ∥ COMPUTE(t)
+DMA_WRITE(t−1) ∥ COMPUTE(t)
+DMA_READ(t) ∥ DMA_WRITE(t)
+```
+
+Disallowed overlaps:
+
+```
+GEMM(t) ∥ GEMM(t′)
+MATH(t) ∥ MATH(t′)
+GEMM(t) ∥ MATH(t′)
+```
+
+---
+
+### D4. Engine execution model (Phase 0 default)
+
+Each engine behaves as a deterministic service resource.
+
+**DMA engine**
+
+PE_DMA contains two independent channels.
+
+```
+DMA_READ capacity  = 1
+DMA_WRITE capacity = 1
+```
+
+Rules:
+
+- DMA_READ and DMA_WRITE may execute concurrently.
+- Multiple READs cannot overlap.
+- Multiple WRITEs cannot overlap.
+
+Example allowed:
+
+```
+DMA_READ(t+1) ∥ DMA_WRITE(t)
+```
+
+Example not allowed:
+
+```
+DMA_READ(t) ∥ DMA_READ(t+1)
+DMA_WRITE(t) ∥ DMA_WRITE(t+1)
+```
+
+**Compute engine**
+
+Compute operations share a single compute resource.
+
+```
+PE_ACCEL capacity = 1
+```
+
+Both GEMM and MATH require this shared compute slot.
+
+Consequences:
+
+- GEMM ∥ GEMM not allowed
+- MATH ∥ MATH not allowed
+- GEMM ∥ MATH not allowed
+
+Only one compute operation can run in a PE at a time.
+
+**Compute opcode restriction**
+
+Composite commands contain one compute opcode only.
+
+Examples:
+
+```
+COMPOSITE_GEMM
+COMPOSITE_MATH
+```
+
+Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
+
+**Engine completion signaling**
+
+Every engine emits a completion event when a sub-command finishes.
+Completion events are delivered to PE_SCHEDULER.
+
+---
+
+### D5. Dataflow model
+
+Compute operations use a TCM-centric dataflow model.
+
+**Input path (HBM)**
+
+```
+HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
+```
+
+**Input path (shared SRAM)**
+
+```
+Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
+```
+
+**Compute stage**
+
+Compute engines read input tensors from PE_TCM.
+
+```
+PE_TCM → GEMM / MATH
+```
+
+Weights for GEMM may optionally stream directly from HBM (via XBAR).
+
+**Output path (HBM)**
+
+Compute results are written to PE_TCM, then DMA writes to HBM.
+
+```
+PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
+```
+
+**Output path (shared SRAM)**
+
+```
+PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
+```
+
+#### D5.1 PE_TCM partitioning and ownership boundary
+
+The PE_TCM address space is partitioned into two logical regions.
+
+**SchedulerReservedTCM**
+
+- A staging region owned exclusively by PE_SCHEDULER.
+- This region is used for composite command tile buffers.
+- PE_SCHEDULER:
+  - partitions this region into tile buffers
+  - assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
+  - guarantees input/output buffer separation
+  - manages tile buffer lifetime
+
+**AllocatableTCM**
+
+- General-purpose region managed by PEMemAllocator.
+- Used by host or DP-visible allocations.
+
+**Visibility rule (hard isolation)**
+
+- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
+- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
+- This prevents DP or host allocations from interfering with scheduler staging buffers.
+
+**Tile buffer rules**
+
+Within SchedulerReservedTCM:
+
+- input buffers and output buffers must not overlap
+- PE_SCHEDULER assigns tile buffers for DMA and compute stages
+- tile buffers remain valid until the corresponding DMA_WRITE completes
+- Buffer reuse is allowed only after the tile lifetime finishes.
+
+---
+
+### D6. Observability and trace contract
+
+The simulator must emit deterministic trace events.
+
+Required events include:
+
+- `command_submitted`
+- `sub_command_dispatched`
+- `engine_start`
+- `engine_complete`
+- `tile_ready`
+- `command_complete`
+
+Trace ordering must be deterministic for identical inputs.
+
+---
+
+### D7. Topology representation
+
+PE internal components are declared in `cube.pe_template`.
+
+The template is instantiated once per PE.
+
+PE instances are derived from `cube.pe_layout`.
+
+External connectivity such as:
+
+- PE_DMA → XBAR (HBM data path)
+- PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe)
+- NOC → PE_CPU (command path from M_CPU)
+
+is modeled at the CUBE level (see ADR-0003 D3).
+
+---
+
+## Links
+
+- SPEC R3, R4
+- ADR-0003 D4 (PE-level system hierarchy)
+- ADR-0005 View C (PE-level diagram)
+- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
+- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
@@ -0,0 +1,178 @@
+# ADR-0015: Component Port/Wire Model and Fabric Routing
+
+## Status
+
+Proposed
+
+## Context
+
+ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
+In practice, the engine iterates the topology path and calls `run()` on each component
+sequentially — conflating routing policy with component behavior and preventing realistic
+hardware modeling (queues, contention, fan-out).
+
+ADR-0007 D3 already states that components own fan-out and aggregation, but the current
+implementation does not enforce this for fabric traversal.
+
+This ADR defines:
+
+- how components communicate via typed port queues,
+- how propagation delay is modeled (wire processes),
+- the fabric path for Memory R/W through M_CPU.DMA,
+- the reduced role of the simulation engine,
+- M_CPU.DMA as an internal subcomponent of M_CPU.
+
+---
+
+## Decision
+
+### D1. Component port model
+
+Each component has typed input/output ports modeled as SimPy Stores:
+
+```
+in_ports:  dict[str, simpy.Store]   # keyed by source node_id
+out_ports: dict[str, simpy.Store]   # keyed by destination node_id
+```
+
+Ports are created at engine initialization based on graph edges.
+Each directed edge (src → dst) results in:
+
+- `src.out_ports[dst]`  — the sending end
+- `dst.in_ports[src]`   — the receiving end
+
+---
+
+### D2. Wire process (propagation delay)
+
+For each directed edge (src, dst) in the topology graph, a SimPy wire process
+models propagation delay:
+
+```python
+def wire_process(env, out_port, in_port, delay_ns):
+    while True:
+        cmd = yield out_port.get()
+        yield env.timeout(delay_ns)
+        yield in_port.put(cmd)
+```
+
+Wire processes are started at engine initialization.
+BW constraints are enforced by the sending component's out_port capacity or token model,
+not by the wire process itself.
+
+---
+
+### D3. Engine role (reduced)
+
+The simulation engine MUST:
+
+- wire components at initialization (create port Stores, start wire processes),
+- identify the entry component for each request type (PCIE_EP),
+- put the request into the entry component's in_port,
+- wait for a completion event.
+
+The simulation engine MUST NOT:
+
+- walk the topology path during request execution,
+- call component `run()` methods directly,
+- track per-hop latency or decompose fan-out.
+
+This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
+ADR-0007 D2 must be amended accordingly.
+
+---
+
+### D4. Unified fabric path for Memory R/W and Kernel Launch
+
+Both Memory R/W and Kernel Launch use the same fabric path to reach the target cube's M_CPU.
+The difference is what M_CPU does upon receiving the request.
+
+**Forward path (IO_CPU → target M_CPU):**
+
+```
+IO_CPU
+  → [transit cubes: ucie_out → wire → ucie_in → noc → ucie_out]  (zero or more)
+  → target cube: ucie_in → noc → M_CPU
+```
+
+**At M_CPU (diverges by operation type):**
+
+```
+Memory R/W:     M_CPU → M_CPU.DMA → noc → hbm_ctrl
+Kernel Launch:  M_CPU → PE[0..n] (parallel fan-out)
+```
+
+**Completion path (reverse, same fabric):**
+
+```
+Memory R/W:     hbm_ctrl → noc → M_CPU.DMA → M_CPU
+Kernel Launch:  PE[0..n] all complete → M_CPU (aggregation)
+
+M_CPU → [transit cubes: ucie → noc → ucie] → IO_CPU → runtime_api
+```
+
+---
+
+### D5. M_CPU.DMA is an internal subcomponent of M_CPU
+
+M_CPU.DMA is NOT a separate topology node.
+It is an internal subcomponent owned by the M_CPU component implementation.
+
+M_CPU.DMA:
+
+- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
+- issues memory requests over the NOC to hbm_ctrl,
+- receives completion from hbm_ctrl via the NOC,
+- reports completion to M_CPU,
+- is created and managed inside M_CPU's `__init__` and `run()`.
+
+M_CPU.DMA does not appear as a node in the compiled topology graph.
+
+---
+
+### D6. Transit cube forwarding
+
+A cube that is not the target of a memory or kernel request acts as a transit node.
+Transit cubes forward requests without consuming them:
+
+```
+ucie_in (from upstream) → noc → ucie_out (to downstream)
+```
+
+Transit forwarding is implemented entirely within the ucie_in component.
+The noc and ucie_out components in a transit cube forward the packet without modification.
+
+---
+
+### D7. _formula_latency is preserved as a lower-bound cross-check
+
+The path-based formula latency function (`_formula_latency`) is preserved in the engine
+as a lower bound for correctness verification.
+
+Invariant:
+
+- Phase 0: `_formula_latency == component model total_ns`
+- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
+
+This function is independent of the port/wire model and requires only the topology graph.
+It is used for shard comparison in `_route_kernel` and as a regression guard.
+
+---
+
+## Consequences
+
+- Components model realistic hardware behavior (queues, contention, fan-out).
+- Propagation delay is modeled accurately per edge.
+- Engine is decoupled from routing policy.
+- Component implementations remain swappable via DI (ADR-0007 D3).
+- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
+- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).
+
+---
+
+## Links
+
+- ADR-0007 D2 (to be amended: engine path-walking clause)
+- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
+- ADR-0014 D4 (DMA engine capacity=1)
+- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)