ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle: - ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable. - ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2: docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft), docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for retroactive docs pending verification. Merges (one ADR per topic, no change-history annotations): - ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items) - ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl. TileToken self-routing and multi-op composite epilogue scope) - ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md deleted; ADR-0019/0021 moved to adr-history with one-line stub status Retroactive documentation (G4 closures, code-verified): - ADR-0037 forwarding component (TransitComponent: first-flit overhead, serial worker, path-based routing, single impl/multiple names) - ADR-0036 IO_CPU component (target_start_ns global barrier stamping, per-cube fan-out, response aggregation) - ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources, target_start_ns passthrough) - ADR-0034 HBM controller internal design (per-PC state, address-based selection, flit-aware per-flit commit, async finalize, command-only fallback path) Content updates: - ADR-0010 expanded to full CLI surface (run/probe/web), retitled "Command Line Interface and Execution Semantics" - ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned - ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata block replaced with standard Status header - ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4); ADR-0027 cleaned of supersession history - ADR-0033 D6 cleanup: address-based PC selection moved out of future-work (now documented in ADR-0034 D3); related D1/D3 wording realigned - Cross-references back-filled in 5 ADRs (G3 gaps closed) Onboarding docs split: - docs/onboarding/ created - moved: hw-architecture-overview.md, latency-model.md, di-presentation.md, ccl-author-guide{,.en}.md - references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8). No behavior change. Tooling: - tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py (ADR EN/KO pair invariant checker) - .claude/commands/report.md tracked (/report slash command) - .gitignore: allow .claude/commands/*.md while keeping settings files ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00
parent 22fd0d2b9d
commit 687c98086d
97 changed files with 3286 additions and 3766 deletions
@@ -35,7 +35,7 @@ shortcuts that obscure control paths.

 ### D3. Bypass is explicit and graph-represented
 - All paths must be explicitly represented in the graph and subject to latency accumulation.
- Example: PE_DMA connects to the NOC router mesh (ADR-0019). All destinations
+- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations
  (HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
  Local HBM access has minimal hops (switching overhead only); remote access
  traverses additional routers.
@@ -15,7 +15,7 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,

 - Each PE is assigned a logically defined “local HBM” region.
 - Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
-  router in the NOC mesh (ADR-0019).
+  router in the NOC mesh (ADR-0017 D4).
 - The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
 - The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.

@@ -20,7 +20,9 @@ Diagrams must reflect this distance by default.

 ---

-## Global Defaults
+## Decision
+
+### D1. Global Defaults

 - All diagrams MUST be **distance-aware by default**.
 - All diagrams MUST render **representative views** of the architecture.
@@ -31,7 +33,7 @@ Diagrams must reflect this distance by default.

 ---

-## Representative Rendering Rule
+### D2. Representative Rendering Rule

 - All CUBEs share the same internal structure.
 - All PEs share the same internal structure.
@@ -47,9 +49,9 @@ unless explicitly requested.

 ---

-## Diagram Views
+### D3. Diagram Views

-### View A — SIP-Level Diagram
+#### View A — SIP-Level Diagram

 **Purpose**
 Explain system-scale structure and connectivity.
@@ -75,7 +77,7 @@ Explain system-scale structure and connectivity.

 ---

-### View B — CUBE-Level Diagram
+#### View B — CUBE-Level Diagram

 **Purpose**
 Explain cube-internal structure and data/control flow.
@@ -106,7 +108,7 @@ Explain cube-internal structure and data/control flow.

 ---

-### View C — PE-Level Diagram
+#### View C — PE-Level Diagram

 **Purpose**
 Explain internal PE behavior and execution structure.
@@ -128,14 +130,14 @@ Explain internal PE behavior and execution structure.

 ---

-## Distance-Aware Layout (Default)
+### D4. Distance-Aware Layout (Default)

-### Distance definition
+#### Distance definition

 - Distance is defined as **accumulated latency**, consistent with ADR-0002.
 - Distance is computed from a single anchor node.

-### Default anchor selection
+#### Default anchor selection

 - SIP view: IO chiplet (or Host CPU if present)
 - CUBE view: a representative PE
@@ -143,7 +145,7 @@ Explain internal PE behavior and execution structure.

 Anchors are **implicit defaults** and MUST NOT be required to be specified.

-### Layout rules
+#### Layout rules

 - Diagrams MUST be laid out in layers based on distance buckets.
 - Layout direction MUST be consistent within a view type
@@ -156,7 +158,7 @@ without affecting distance semantics.

 ---

-## Generation Contract (for Tools / Claude Code)
+### D5. Generation Contract (for Tools / Claude Code)

 When generating diagrams:

@@ -63,7 +63,7 @@ For each view (SIP / CUBE / PE):
 - CUBE-level projection MUST include:
  - Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
    and PEs as opaque blocks.
-  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0019).
+  - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017).
 - Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.

 ### D6. Output formats and determinism
@@ -42,21 +42,25 @@ The runtime API MUST NOT:

 ---

-### D2. Simulation engine executes and schedules requests
+### D2. Simulation engine wires components and tracks completion

 The simulation engine (sim_engine) MUST:

- inject requests into the compiled topology graph,
+- wire components at initialization (create port stores + start wire
+  processes per the component port/wire framework — ADR-0015),
+- inject requests into the compiled topology graph at entry components
+  (e.g., PCIE_EP for memory operations, IO_CPU for kernel launch),
 - schedule and execute events using a discrete-event model,
- manage correlation ids and completion tracking,
- decompose operations into low-level requests when required
-  (e.g., MemoryWrite events).
+- manage correlation ids and completion tracking.

 The simulation engine MUST NOT:

 - define tensor semantics,
 - define kernel execution policies,
- expose internal graph details to the runtime API.
+- expose internal graph details to the runtime API,
+- walk the topology path during request execution,
+- call component `run()` methods directly,
+- track per-hop latency or decompose fan-out (components own this).

 ---

@@ -87,3 +91,5 @@ component-level fan-out explicitly.
 - SPEC R4, R7, R8
 - ADR-0008 (Tensor deployment)
 - ADR-0009 (Kernel execution)
+- ADR-0015 (Component port/wire model and engine role)
+- ADR-0010 (CLI surface and execution semantics — runtime API consumer)
@@ -142,3 +142,5 @@ control plane — runtime API and application kernels are unchanged.
 - SPEC R1, R2, R7, R8
 - ADR-0007 (Runtime API boundaries)
 - ADR-0008 (Tensor deployment)
+- ADR-0013 (Verification strategy — V2 fan-out tests)
+- ADR-0015 D4 (concrete fabric path for kernel launch)
@@ -0,0 +1,131 @@
+# ADR-0010: Command Line Interface and Execution Semantics
+
+## Status
+
+Accepted
+
+## Context
+
+The `kernbench` CLI is the user-facing entry point of the simulator. It
+exposes three subcommands:
+
+- `run` — execute a benchmark against a topology.
+- `probe` — diagnostic utility for latency / BW measurement.
+- `web` — interactive topology viewer.
+
+Device enumeration is centralized in the CLI; neither the runtime API
+nor the simulation engine enumerates devices. Benchmarks remain
+single-device by design and accept a device identifier as input.
+
+## Decision
+
+### D1. Benchmark contract — single-device by design
+
+- A benchmark MUST define behavior for a single device only.
+- A benchmark MUST accept a device identifier as input.
+- Benchmarks MUST NOT enumerate or loop over multiple devices.
+
+Multi-device execution is the CLI's concern (D3), not the benchmark's.
+
+### D2. `kernbench run` — benchmark execution
+
+Required arguments:
+
+- `--topology <path>`: topology YAML file path. Loaded via
+  `resolve_topology()`.
+- `--bench <name>`: benchmark name. Resolved via
+  `benches.loader.resolve_bench()`.
+
+Optional arguments:
+
+- `--device <selector>` (default: `all`):
+  - `all` — run once per discovered SIP (see D3).
+  - `sip:<N>` — run only on SIP N.
+  - Parsed via `resolve_device()`.
+- `--verify-data` (default: off) — enable Phase 2 data verification
+  (see ADR-0020). When set, `engine_factory` constructs the engine
+  with `enable_data=True`. After the benchmark runs, a diagnostic
+  summary of recorded ops is printed.
+
+Each invocation runs the benchmark once within a single simulation
+instance.
+
+### D3. Multi-device execution is logically parallel
+
+When `--device all` (or omitted) and the topology has multiple SIPs:
+
+- Benchmark executions are submitted to a single simulation engine
+  instance.
+- Executions are logically parallel in simulation time.
+- Inter-device contention is naturally modeled (shared fabric
+  bandwidth, cross-SIP traffic, etc.).
+
+The CLI does NOT spawn multiple OS processes or independent
+simulation runs — parallelism is internal to one simulation instance.
+
+### D4. `kernbench probe` — latency / BW diagnostic utility
+
+Required argument:
+
+- `--topology <path>`: topology YAML file path.
+
+Optional argument:
+
+- `--case <name>` (default: `all`) — run a predefined traffic
+  pattern, or `all` to run every defined case.
+
+Probe runs each pattern through the simulation engine and reports
+per case:
+
+- End-to-end latency (ns).
+- Effective bandwidth (nbytes / total_ns).
+- Bottleneck bandwidth (min edge BW along the chosen path).
+- Utilization (effective / bottleneck).
+
+Probe additionally validates monotonicity invariants — for example
+that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
+cross-SIP — and reports violations. Probe is a developer tool for
+verifying the latency / BW model; it is not a benchmark.
+
+### D5. `kernbench web` — topology viewer
+
+Optional arguments:
+
+- `--port <N>` (default: `8765`) — HTTP port.
+- `--no-open` — do not auto-open the browser.
+
+Launches a local HTTP server that renders the compiled topology in
+the browser. Distinct from the static `docs/diagrams/` artifacts:
+
+- `docs/diagrams/` files are derived at topology-compile time
+  (ADR-0006).
+- `kernbench web` is interactive — pan/zoom, hover for component
+  attributes, switch between SIP / CUBE / PE views.
+
+### D6. Runtime API and simulation engine remain device-scoped
+
+- Runtime API calls operate on one device per invocation.
+- The simulation engine schedules all requests deterministically.
+- Neither layer enumerates devices.
+
+This invariant keeps each layer testable in isolation; device
+enumeration and multi-device fan-out live only in the CLI's `run`
+command (D3).
+
+## Consequences
+
+- Benchmark authors write single-device logic; multi-device behavior
+  emerges from the CLI dispatching across SIPs.
+- Adding a new subcommand (e.g., trace export, replay) does not
+  require benchmark or runtime-API changes — the CLI is the
+  extension point.
+- `probe` and `web` are diagnostic / visualization tools, not
+  benchmarks; they bypass the benchmark loader path.
+
+## Links
+
+- SPEC R7, R8, R9
+- ADR-0007 (Runtime API and Simulation Engine Boundaries)
+- ADR-0020 (Two-pass data execution — `--verify-data`)
+- ADR-0006 (Topology compilation and diagram generation —
+  background for `kernbench web`)
@@ -1,62 +0,0 @@
-# ADR-0010: CLI Device Selection and Multi-Device Execution Semantics
-
-## Status
-
-Accepted
-
-## Context
-
-Benchmarks represent device-agnostic workloads that operate on a single device.
-Users may want to run a benchmark:
-
- on a specific device, or
- across all devices in the system.
-
-Device enumeration must not leak into benchmarks or runtime APIs.
-
---
-
-## Decision
-
-### D1. Benchmarks are single-device by design
-
- A benchmark MUST define behavior for a single device only.
- A benchmark MUST accept a device identifier as input.
- Benchmarks MUST NOT enumerate or loop over multiple devices.
-
---
-
-### D2. CLI controls device selection
-
-The `kernbench run` command supports an optional `--device` argument:
-
- If `--device <id>` is specified:
-  - the benchmark executes once for the specified device.
-
- If `--device` is omitted:
-  - the benchmark executes once using all the SIPs discovered in the topology.
-
---
-
-### D3. Multi-device execution is logically parallel
-
-When running on multiple devices:
-
- benchmark executions are submitted to a single simulation engine instance,
- executions are logically parallel in simulation time,
- inter-device contention is naturally modeled.
-
---
-
-### D4. Runtime API and simulation engine remain device-scoped
-
- Runtime API calls operate on one device per invocation.
- The simulation engine schedules all requests deterministically.
- Neither layer enumerates devices.
-
---
-
-## Links
-
- SPEC R7, R8
- ADR-0007 (Runtime API boundaries)
@@ -396,7 +396,7 @@ Other N values:
 #### D-LA7. n:1 mode detail

 - One logical access → one aggregated request.
- Target: aggregated router → hbm_ctrl (see ADR-0019).
+- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
 - Aggregated link BW = `channels_per_pe × channel_bw_gbs`
  (e.g. 8 × 32 = 256 GB/s).
 - Single queue / resource for modelling.
@@ -516,6 +516,6 @@ Negative:
 - ADR-0009 (kernel execution)
 - ADR-0014 (PE-internal execution model)
 - ADR-0015 (component port/wire model)
- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology
-  consumer)
+- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
+- ADR-0013 (Verification strategy — V1 PA tagging)
 - SPEC R2 (latency by traversal), R10 (memory addressing)
@@ -229,4 +229,5 @@ Tests SHOULD validate:
 - ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (kernel execution fan-out/aggregation)
+- ADR-0013 (Verification strategy — V1 message schema validation)
 - SPEC R2, R7, R8
@@ -0,0 +1,451 @@
+# ADR-0014: PE Pipeline Execution Model
+
+## Status
+
+Accepted
+
+## Context
+
+This ADR defines the PE-internal kernel execution model:
+
+- Role decomposition of PE-internal components
+- Command dispatch paths (simple / composite / multi-op composite with epilogue)
+- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
+- TCM-centric dataflow with a register-file intermediary
+- Engine resource model
+- Observability and trace contract
+- Topology representation
+
+PE-internal structure (7 components in scope; 2 cross-referenced):
+
+- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
+  `pe_tcm` — defined here
+- `pe_mmu` — VA model, defined in ADR-0011 D-VA
+- `pe_ipcq` — collective communication, defined in ADR-0023
+
+The goal is a deterministic, trace-friendly execution contract that keeps
+each block independently swappable.
+
+## Decision
+
+### D1. PE-internal component roles
+
+**PE_CPU**
+
+- Executes kernel instruction stream / control logic.
+- Generates PE commands and submits them to `PE_SCHEDULER` (via
+  `PeInternalTxn`).
+- Does NOT enqueue work directly into engine queues.
+
+**PE_SCHEDULER**
+
+- Sole dispatcher inside a PE.
+- Receives commands from `PE_CPU`. Dispatch by command type:
+  - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
+    → forward directly to the target engine.
+  - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
+    via a single `_feed_loop` (D6).
+- Does not participate in stage-to-stage chaining within a composite;
+  that is handled by token self-routing (D6).
+
+**PE_DMA**
+
+- Handles memory transfers between TCM and external memory domains
+  (HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
+- Two execution channels:
+  - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
+- Additional virtual channels:
+  - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
+  - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
+
+**PE_FETCH_STORE**
+
+- TCM ↔ Register File transfer unit.
+- Isolates register-file access semantics from compute engines so that
+  GEMM/MATH stay pure compute components.
+- BW-based latency model; TCM access contention naturally serializes
+  through `PE_TCM`'s BW resource.
+
+**PE_GEMM**
+
+- MAC array. Reads operands from the register file; writes results to
+  the register file. Does not touch `PE_TCM` directly.
+
+**PE_MATH**
+
+- Element-wise / reduction / SIMD unit. Reads / writes the register file.
+
+**PE_TCM**
+
+- Tightly-coupled scratchpad with BW-serialized access. Two logical
+  regions partitioned by ownership (see D5).
+
+**Cross-referenced components** (defined elsewhere):
+
+- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
+- `pe_ipcq` — collective ring buffers and peer endpoint metadata
+  (ADR-0023).
+
+### D2. Command lifecycle and queues
+
+`PE_SCHEDULER` maintains three logical structures:
+
+**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
+
+**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
+expanded sub-commands, dependency state, engine assignment, and
+completion status.
+
+**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
+records.
+
+**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
+state. Engines report completion via explicit events / messages
+consumed by the scheduler.
+
+**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
+publishes a completion record.
+
+### D3. Dispatch modes
+
+#### D3.1 Simple command
+
+A simple command expands to exactly one engine sub-command:
+
+- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
+- `GemmCmd` → `PE_GEMM`
+- `MathCmd` → `PE_MATH`
+
+Flow:
+
+```text
+PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
+       → completion → PE_SCHEDULER → CompletionQueue
+```
+
+#### D3.2 Composite command (single-op tiled pipeline)
+
+The default `CompositeCmd` runs a single compute op as a tile-pipelined
+sequence:
+
+```text
+DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
+```
+
+`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
+`TileToken` per tile with a monotonically increasing `tile_id`.
+
+Tile dependency (within one tile `t`):
+
+```text
+DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
+```
+
+Inter-tile overlap is allowed wherever engine resources permit
+(D4 governs the constraints):
+
+```text
+DMA_READ(t+1) ∥ COMPUTE(t)
+DMA_WRITE(t-1) ∥ COMPUTE(t)
+```
+
+#### D3.3 Multi-op composite (head + epilogue with scope)
+
+A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
+multi-op pipeline:
+
+```python
+@dataclass(frozen=True)
+class OpSpec:
+    kind: str         # "gemm" | "math.exp" | "math.bias_add" | ...
+    scope: Scope      # "per_k_tile" | "per_output_tile" | "once"
+    ...
+```
+
+- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
+  M/K/N partition).
+- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
+  often they fire:
+  - `per_k_tile` — every K-reduction step.
+  - `per_output_tile` — once per output tile.
+  - `once` — once per kernel.
+
+Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
+each stage is dispatched via token self-routing (D6), so GEMM and MATH
+participate serially within the same composite even though they share
+the compute slot (D4).
+
+The empty-`ops` form is the legacy single-op path.
+
+### D4. Engine resource model
+
+**DMA engine**:
+
+- `DMA_READ`: `simpy.Resource(capacity=1)`.
+- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
+- Both channels run concurrently (READ ∥ WRITE allowed).
+- Within a channel, requests serialize (READ ∥ READ disallowed; same
+  for WRITE).
+- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
+  ADR-0023 D8 — out of scope for this ADR.
+
+**Compute engine**:
+
+- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
+  `PE_MATH`.
+- At most one compute op runs at a time within a PE.
+- Multi-op composite chains (D3.3) execute their compute stages serially
+  through this slot; token self-routing (D6) ensures the next stage
+  starts only after the previous compute releases the slot.
+
+**Engine completion**: each engine emits a completion event consumed by
+the scheduler / `PipelineContext` (D6).
+
+### D5. Dataflow
+
+**Input path (HBM source)**:
+
+```text
+HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+Register File → PE_GEMM | PE_MATH
+```
+
+**Input path (shared SRAM source)**:
+
+```text
+Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
+PE_TCM → PE_FETCH_STORE → Register File
+```
+
+**Output path (HBM destination)**:
+
+```text
+Register File → PE_FETCH_STORE → PE_TCM
+PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
+```
+
+GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
+single TCM↔register-file gateway. This makes TCM BW contention
+explicit and lets fetch unit policies (e.g., prefetch) be replaced
+independently of compute engines.
+
+#### D5.1 PE_TCM partitioning
+
+`PE_TCM` is split into two logical regions:
+
+**SchedulerReservedTCM**
+
+- Owned exclusively by `PE_SCHEDULER`.
+- Holds composite-command tile buffers.
+- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
+  COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
+  manages tile-buffer lifetimes.
+
+**AllocatableTCM**
+
+- General-purpose region managed by `PEMemAllocator`.
+- Used for host / DP-visible allocations.
+
+**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
+allocate inside `SchedulerReservedTCM`. The reserved region is excluded
+from allocator-managed ranges by construction.
+
+**Tile buffer rules**:
+
+- Input and output buffers within `SchedulerReservedTCM` MUST NOT
+  overlap during a tile's active lifetime.
+- A tile buffer remains valid until the corresponding `DMA_WRITE`
+  completes.
+- Buffer reuse is permitted only after the consuming tile's lifetime
+  ends.
+
+### D6. TileToken self-routing pipeline
+
+A composite's stage-to-stage progression happens **without** routing
+through the scheduler. Each component forwards the token directly to
+the next stage's component using the token's `plan`:
+
+```text
+Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
+              ↑ chaining: no scheduler hop                          ↑
+                                                  PipelineContext.complete_tile()
+```
+
+This mirrors real-HW done-wire chains. The scheduler handles only
+**initial dispatch + completion aggregation**.
+
+#### TilePlan / Stage
+
+```python
+class StageType(Enum):
+    DMA_READ = 0
+    FETCH = 1
+    GEMM = 2
+    MATH = 3
+    STORE = 4
+    DMA_WRITE = 5
+
+@dataclass(frozen=True)
+class Stage:
+    stage_type: StageType
+    component: str         # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
+    params: dict           # stage-specific parameters
+
+@dataclass(frozen=True)
+class TilePlan:
+    tile_id: int
+    stages: tuple[Stage, ...]
+```
+
+#### TileToken
+
+```python
+@dataclass
+class TileToken:
+    tile_id: int
+    pipeline_ctx: PipelineContext
+    plan: TilePlan
+    stage_idx: int
+    params: dict             # cached current stage params
+    data_op: bool = True     # op_log opt-in (ADR-0020 D4)
+```
+
+Single-owner invariant: a token is owned by exactly one component at a
+time. Lifecycle: scheduler creates with `stage_idx=0` → component
+`_process()` → increment `stage_idx` → put to next stage's `in_port` →
+last stage calls `pipeline_ctx.complete_tile()`.
+
+#### PipelineContext (exactly-once completion)
+
+```python
+@dataclass
+class PipelineContext:
+    id: str
+    total_tiles: int
+    completed_tiles: int = 0
+    done_event: simpy.Event = None
+
+    def complete_tile(self) -> None:
+        self.completed_tiles += 1
+        if self.completed_tiles == self.total_tiles:
+            self.done_event.succeed()
+```
+
+Each tile's last stage MUST call `complete_tile()` exactly once.
+Duplicate calls are bugs (SimPy `Event` can succeed at most once).
+
+#### Feed ordering
+
+`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
+`_pending_feeds` FIFO. Composite commands are enqueued in submission
+order; tile feed for a command runs to completion before the next
+command's feed begins. **Tile-feed interleaving between commands is
+disallowed.**
+
+Within a single command's tiles, downstream pipeline overlap arises
+naturally — earlier tiles progress through later stages while the feeder
+keeps pushing remaining tiles into the first stage queue (SimPy Store
+backpressure governs flow control). If the first-stage queue is full,
+only the feeder blocks; the scheduler worker's inbox processing
+continues.
+
+#### Token routing pattern (base class)
+
+```python
+def _pipeline_worker(self, env):
+    while True:
+        token = yield self._inbox.get()
+        yield from self._process(env, token)       # stage-specific logic
+        next_idx = token.stage_idx + 1
+        if next_idx < len(token.plan.stages):
+            next_stage = token.plan.stages[next_idx]
+            token.stage_idx = next_idx
+            token.params = next_stage.params
+            yield self.out_ports[next_stage.component].put(token)
+        else:
+            token.pipeline_ctx.complete_tile()
+```
+
+Each component implements only `_process()`; chaining lives in the
+base class.
+
+### D7. Observability and trace contract
+
+The simulator emits deterministic trace events:
+
+- `command_submitted`
+- `sub_command_dispatched`
+- `engine_start`
+- `engine_complete`
+- `tile_ready`
+- `command_complete`
+
+For identical inputs, trace ordering MUST be deterministic.
+
+### D8. Topology representation
+
+PE-internal components are declared in `cube.pe_template`:
+
+```yaml
+pe_template:
+  components:
+    pe_cpu:         { kind: pe_cpu,         impl: builtin.pe_cpu,         attrs: { overhead_ns: ... } }
+    pe_scheduler:   { kind: pe_scheduler,   impl: builtin.pe_scheduler,   attrs: { overhead_ns: ... } }
+    pe_dma:         { kind: pe_dma,         impl: builtin.pe_dma,         attrs: { rd_engines: 1, wr_engines: 1 } }
+    pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
+    pe_gemm:        { kind: pe_gemm,        impl: builtin.pe_gemm,        attrs: { shared_resource: accel_slot, ... } }
+    pe_math:        { kind: pe_math,        impl: builtin.pe_math,        attrs: { shared_resource: accel_slot, ... } }
+    pe_tcm:         { kind: pe_tcm,         impl: builtin.pe_tcm,         attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
+    pe_mmu:         { kind: pe_mmu,         impl: builtin.pe_mmu,         attrs: { ... } }   # ADR-0011 D-VA
+    pe_ipcq:        { kind: pe_ipcq,        impl: builtin.pe_ipcq,        attrs: { ... } }   # ADR-0023
+  links:
+    # Scheduler dispatch edges (initial)
+    scheduler_to_dma_mm:         0.0
+    scheduler_to_fetch_store_mm: 0.0
+    scheduler_to_gemm_mm:        0.0
+    scheduler_to_math_mm:        0.0
+    # Pipeline chaining edges (token self-routing per D6)
+    dma_to_fetch_store_mm:       0.0
+    fetch_store_to_gemm_mm:      0.0
+    fetch_store_to_math_mm:      0.0
+    gemm_to_fetch_store_mm:      0.0
+    gemm_to_math_mm:             0.0
+    math_to_fetch_store_mm:      0.0
+    fetch_store_to_dma_mm:       0.0
+    fetch_store_to_tcm_bw_gbs:   ...
+```
+
+Template is instantiated once per PE. PE instances are derived from
+`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
+cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
+
+## Consequences
+
+### Positive
+
+- Each block is an independent topology node — individually swappable
+  via DI (ADR-0015).
+- PE-internal structure is visible in the topology graph.
+- Components do not know their downstream — plan-based routing gives
+  flexibility (e.g., epilogue chains require no scheduler change).
+- DMA and compute overlap naturally via SimPy Store backpressure.
+- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
+  without engine-level coupling.
+- TCM access contention is realistic — `PE_FETCH_STORE` is the single
+  TCM↔RF gateway.
+
+### Negative
+
+- Intra-PE component count is higher than a coarser model (7 base + 2
+  cross-referenced) — more topology nodes/edges.
+- Intra-PE token forwarding is explicit in traces (acceptable trade for
+  HW fidelity).
+
+## Links
+
+- ADR-0011 D-VA (PE_MMU component, VA translation)
+- ADR-0015 D4 (component port/wire model)
+- ADR-0020 (greenlet kernel execution / two-pass)
+- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
+- SPEC R3, R4
@@ -1,365 +0,0 @@
-# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
-
-## Status
-
-Accepted
-
-## Context
-
-ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
-
- the dispatch model inside a PE,
- the responsibilities of PE_SCHEDULER,
- the PE_TCM-centric dataflow contract used by accelerator engines.
-
-We need a deterministic and debuggable PE-internal execution contract that supports:
-
- simple single-engine commands
- composite commands that build a tiled pipeline across DMA and accelerator engines
-
-The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
-
-## Decision
-
-### D1. PE internal component roles
-
-Each PE contains the following logical components.
-
-**PE_CPU**
-
- Executes kernel instruction stream or kernel control logic.
- Generates PE commands.
- Submits commands to PE_SCHEDULER.
- PE_CPU does NOT enqueue work directly into engine queues.
-
-**PE_SCHEDULER**
-
- The sole dispatcher inside a PE.
- Receives commands from PE_CPU.
- Expands composite commands into sub-commands.
- Tracks dependencies and command state.
- Dispatches work to engine queues.
- Manages tile scheduling for composite commands.
-
-**PE_DMA**
-
- Handles memory transfers between PE_TCM and external memory domains.
- PE_DMA connects to the cube-level NOC (on-die fabric):
-  - All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the NOC
-  - Local HBM access: PE_DMA → NOC → hbm_ctrl (minimal hop)
-  - Remote/shared: PE_DMA → NOC → (fabric hops) → destination
- Supported directions include:
-  - HBM → PE_TCM (via NOC)
-  - PE_TCM → HBM (via NOC)
-  - PE_TCM → shared SRAM (via NOC)
-  - PE_TCM → other memory domains (via NOC, if supported by topology)
-
-**PE_GEMM**
-
- Matrix multiplication engine.
- Reads activations from PE_TCM.
- May stream weights directly from HBM.
-
-**PE_MATH**
-
- Element-wise computation engine.
- Reads and writes PE_TCM.
-
-**PE_TCM**
-
- Local SRAM used as the staging memory for accelerator operations.
-
---
-
-### D2. Command lifecycle and queues
-
-PE_SCHEDULER maintains three logical structures.
-
-**SubmissionQueue**
-
- Written by PE_CPU.
- Contains incoming PE commands waiting to be processed.
-
-**InflightTable**
-
- Owned and mutated only by PE_SCHEDULER.
- Tracks:
-  - expanded sub-commands
-  - dependency state
-  - engine assignment
-  - completion status
-
-**CompletionQueue**
-
- Written by PE_SCHEDULER.
- Contains final completion records for commands.
-
-**Single-writer rule**
-
- Only PE_SCHEDULER is allowed to mutate command completion state.
- Engine components must report completion via explicit completion events/messages.
-
-**Command completion**
-
-A command becomes DONE when:
-
- all sub-commands complete
- PE_SCHEDULER publishes a completion record to CompletionQueue.
-
---
-
-### D3. Dispatch modes
-
-PE commands are divided into two categories.
-
-#### D3.1 Simple command
-
-A simple command expands to exactly one engine sub-command.
-
-Examples include:
-
- DMA transfer
- GEMM compute
- MATH compute
-
-Execution flow:
-
-```text
-PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
-```
-
-#### D3.2 Composite command (tiled pipeline)
-
-Composite commands implement tiled pipelined execution across engines.
-
-Each tile executes the following pipeline:
-
-```text
-Input DMA (READ)
-→ Compute (GEMM or MATH)
-→ Output DMA (WRITE)
-```
-
-**Tiling rule**
-
-If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
-Each tile is assigned a monotonically increasing `tile_id`.
-
-**Tile dependency rules**
-
-For tile `t`:
-
- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
- All dependencies are enforced by PE_SCHEDULER.
-
-**Overlap policy (Phase 0 default)**
-
-Operations for different tiles may overlap when engine resources permit.
-
-Allowed overlaps:
-
-```text
-DMA_READ(t+1) ∥ COMPUTE(t)
-DMA_WRITE(t−1) ∥ COMPUTE(t)
-DMA_READ(t) ∥ DMA_WRITE(t)
-```
-
-Disallowed overlaps:
-
-```text
-GEMM(t) ∥ GEMM(t′)
-MATH(t) ∥ MATH(t′)
-GEMM(t) ∥ MATH(t′)
-```
-
---
-
-### D4. Engine execution model (Phase 0 default)
-
-Each engine behaves as a deterministic service resource.
-
-**DMA engine**
-
-PE_DMA contains two independent channels.
-
-```text
-DMA_READ capacity  = 1
-DMA_WRITE capacity = 1
-```
-
-Rules:
-
- DMA_READ and DMA_WRITE may execute concurrently.
- Multiple READs cannot overlap.
- Multiple WRITEs cannot overlap.
-
-Example allowed:
-
-```text
-DMA_READ(t+1) ∥ DMA_WRITE(t)
-```
-
-Example not allowed:
-
-```text
-DMA_READ(t) ∥ DMA_READ(t+1)
-DMA_WRITE(t) ∥ DMA_WRITE(t+1)
-```
-
-**Compute engine**
-
-Compute operations share a single compute resource.
-
-```text
-PE_ACCEL capacity = 1
-```
-
-Both GEMM and MATH require this shared compute slot.
-
-Consequences:
-
- GEMM ∥ GEMM not allowed
- MATH ∥ MATH not allowed
- GEMM ∥ MATH not allowed
-
-Only one compute operation can run in a PE at a time.
-
-**Compute opcode restriction**
-
-Composite commands contain one compute opcode only.
-
-Examples:
-
-```text
-COMPOSITE_GEMM
-COMPOSITE_MATH
-```
-
-Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
-
-**Engine completion signaling**
-
-Every engine emits a completion event when a sub-command finishes.
-Completion events are delivered to PE_SCHEDULER.
-
---
-
-### D5. Dataflow model
-
-Compute operations use a TCM-centric dataflow model.
-
-**Input path (HBM)**
-
-```text
-HBM → NOC → PE_DMA (DMA_READ) → PE_TCM
-```
-
-**Input path (shared SRAM)**
-
-```text
-Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
-```
-
-**Compute stage**
-
-Compute engines read input tensors from PE_TCM.
-
-```text
-PE_TCM → GEMM / MATH
-```
-
-Weights for GEMM may optionally stream directly from HBM (via NOC).
-
-**Output path (HBM)**
-
-Compute results are written to PE_TCM, then DMA writes to HBM.
-
-```text
-PE_TCM → PE_DMA (DMA_WRITE) → NOC → HBM
-```
-
-**Output path (shared SRAM)**
-
-```text
-PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
-```
-
-#### D5.1 PE_TCM partitioning and ownership boundary
-
-The PE_TCM address space is partitioned into two logical regions.
-
-**SchedulerReservedTCM**
-
- A staging region owned exclusively by PE_SCHEDULER.
- This region is used for composite command tile buffers.
- PE_SCHEDULER:
-  - partitions this region into tile buffers
-  - assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
-  - guarantees input/output buffer separation
-  - manages tile buffer lifetime
-
-**AllocatableTCM**
-
- General-purpose region managed by PEMemAllocator.
- Used by host or DP-visible allocations.
-
-**Visibility rule (hard isolation)**
-
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
- This prevents DP or host allocations from interfering with scheduler staging buffers.
-
-**Tile buffer rules**
-
-Within SchedulerReservedTCM:
-
- input buffers and output buffers must not overlap
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
- tile buffers remain valid until the corresponding DMA_WRITE completes
- Buffer reuse is allowed only after the tile lifetime finishes.
-
---
-
-### D6. Observability and trace contract
-
-The simulator must emit deterministic trace events.
-
-Required events include:
-
- `command_submitted`
- `sub_command_dispatched`
- `engine_start`
- `engine_complete`
- `tile_ready`
- `command_complete`
-
-Trace ordering must be deterministic for identical inputs.
-
---
-
-### D7. Topology representation
-
-PE internal components are declared in `cube.pe_template`.
-
-The template is instantiated once per PE.
-
-PE instances are derived from `cube.pe_layout`.
-
-External connectivity such as:
-
- PE_DMA → NOC → HBM (data path)
- PE_DMA → NOC → shared SRAM, inter-cube UCIe (non-HBM data path)
- NOC → PE_CPU (command path from M_CPU)
-
-is modeled at the CUBE level (see ADR-0003 D3).
-
---
-
-## Links
-
- SPEC R3, R4
- ADR-0003 D4 (PE-level system hierarchy)
- ADR-0005 View C (PE-level diagram)
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
@@ -6,20 +6,19 @@ Accepted

 ## Context

-ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
-In practice, the engine iterates the topology path and calls `run()` on each component
-sequentially — conflating routing policy with component behavior and preventing realistic
-hardware modeling (queues, contention, fan-out).
-
-ADR-0007 D3 already states that components own fan-out and aggregation, but the current
-implementation does not enforce this for fabric traversal.
+Realistic hardware modeling — queues, contention, fan-out — requires
+that components own fabric traversal while the simulation engine
+handles only initialization and completion observation. Direct method
+calls between components, or path-walking inside the engine, defeat
+queueing and contention semantics.

 This ADR defines:

 - how components communicate via typed port queues,
 - how propagation delay is modeled (wire processes with BW occupancy),
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU),
- the reduced role of the simulation engine,
+- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
+  (via M_CPU),
+- the engine's reduced role (wire init + completion observation only),
 - M_CPU.DMA as an internal subcomponent of M_CPU.

 ---
@@ -88,9 +87,6 @@ The simulation engine MUST NOT:
 - call component `run()` methods directly,
 - track per-hop latency or decompose fan-out.

-This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
-ADR-0007 D2 must be amended accordingly.
-
 ---

 ### D4. Fabric paths for Memory R/W and Kernel Launch
@@ -192,16 +188,15 @@ It is used for shard comparison in `_route_kernel` and as a regression guard.
 - Propagation delay is modeled accurately per edge.
 - Engine is decoupled from routing policy.
 - Component implementations remain swappable via DI (ADR-0007 D3).
- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).

 ---

 ## Links

- ADR-0007 D2 (to be amended: engine path-walking clause)
- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
+- ADR-0007 D2 (engine role boundary)
+- ADR-0009 D3 (kernel execution fan-out hierarchy)
 - ADR-0014 D4 (DMA engine capacity=1)
 - ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
 - ADR-0016 (IOChiplet NOC and memory data path)
 - ADR-0017 (cube NOC 2D mesh architecture)
+- ADR-0033 (Latency model assumptions built on these mechanisms)
@@ -1,189 +0,0 @@
-# ADR-0017: Cube NOC 2D Mesh Architecture
-
-## Status
-
-Accepted
-
-## Context
-
-ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but
-does not specify the internal routing model, contention semantics, or
-attachment topology. The implementation uses a 2D mesh router grid with
-XY routing and per-segment contention modeling. This ADR formalizes that
-architecture.
-
-## Decision
-
-### D1. NOC node and router grid
-
-Each cube contains a 2D router mesh generated by `mesh_gen.py`.
-Each router is a separate topology node (`sip{S}.cube{C}.r{row}c{col}`)
-implemented as `forwarding_v1`. (Supersedes the original single-node
-`noc_2d_mesh_v1` design — see ADR-0019.)
-
-Grid properties:
-
- Default dimensions: 6x6 routers (derived from PE layout + UCIe connections)
- Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`)
- HBM exclusion zone: center rows/columns are excluded where HBM physically
-  occupies space (e.g., r2c2, r2c3, r3c2, r3c3)
- Router positions are derived from physical PE corner placement and cube
-  geometry
-
-The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance
-traversal within the mesh (distance_mm x ns_per_mm).
-
-### D2. XY routing algorithm
-
-The NOC uses deterministic XY routing:
-
-1. Horizontal segment: route from source X to destination X at source Y
-2. Vertical segment: route from destination X at source Y to destination Y
-
-Each directed segment is identified by a unique link key:
-
- Horizontal: `("H", y_band, x_min, x_max, direction)`
- Vertical: `("V", x_band, y_min, y_max, direction)`
-
-Grid positions are snapped to the router grid, excluding the HBM zone.
-
-### D3. Contention model
-
-Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
-sharing a segment (same row or column band, same direction) contend for the
-resource. This models link-level serialization in a wormhole-routed mesh.
-
-With no contention, NOC traversal latency equals the Manhattan distance
-multiplied by `ns_per_mm`. Under contention, additional queueing delay
-is added by SimPy's resource scheduling.
-
-### D4. NOC attachment points
-
-The NOC connects to all major cube-level components:
-
-```text
-                    UCIe-N (conn x4)
-                         |
-           +---------+---+---+---------+
-           |         |       |         |
-PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
-PE0.cpu <--+         |       |         +--< PE2.cpu
-           |         |       |         |
-UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
-(conn x4)  |         | zone  |         |  (conn x4)
-           |  r2c0   |       |         |
-M_CPU <--->+         |       |         |
-           |  r3c0   |       |         |
-SRAM <---->+         |       |         |
-           |         |       |         |
-PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
-PE4.cpu <--+         |       |         +--< PE6.cpu
-           |         |       |         |
-           +---------+---+---+---------+
-                         |
-                    UCIe-S (conn x4)
-
-HBM attach: PE가 있는 라우터에 hbm_ctrl도 연결 (ADR-0019 D1)
-(xbar_top/xbar_bot은 ADR-0019에 의해 제거됨)
-```
-
-### D5. NOC edge bandwidths and distances
-
-| Connection | BW (GB/s) | Distance | Notes |
-| --- | --- | --- | --- |
-| PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW |
-| NOC -> PE_CPU | - | 0.0 mm | Command path only |
-| Router <-> HBM_CTRL | 256.0 | 0.0 mm | Per PE router (ADR-0019) |
-| NOC <-> M_CPU | - | 0.0 mm | Command path |
-| NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate |
-| NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port |
-
-Distance 0.0 mm for most connections reflects the distributed nature of
-the NOC; the actual traversal distance is computed internally via Manhattan
-distance within the router grid.
-
-### D6. UCIe decomposition and inter-cube traffic
-
-Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into:
-
- 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns)
- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe
-
-This decomposition enables N=4 independent NOC-to-UCIe connections per port,
-each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s.
-
-Inter-cube traffic path:
-
-```text
-Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
-                    [UCIe link: 512 GB/s, 1.0mm seam distance]
-Target: ucie-{PORT} -> conn{i} -> r{x}c{y} -> (mesh hops) -> hbm_ctrl
-```
-
-UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a
-full crossing incurs 16 ns (TX port + RX port).
-
-### D7. Data paths through the NOC
-
-**PE DMA to local HBM (same half):**
-
-```text
-PE_DMA -> r{x}c{y} -> hbm_ctrl  (local: 0 mesh hops, switching overhead only)
-```
-
-**PE DMA to remote PE's HBM:**
-
-```text
-PE_DMA -> r{x}c{y} -> (mesh hops) -> r{x'}c{y'} -> hbm_ctrl
-```
-
-**PE DMA to remote cube HBM:**
-
-```text
-PE_DMA -> r{x}c{y} -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> r{x'}c{y'} -> hbm_ctrl
-```
-
-**Kernel Launch command to PE:**
-
-```text
-[from io_noc] -> ucie -> conn -> r{x}c{y} -> (mesh hops) -> M_CPU -> (mesh hops) -> PE_CPU
-```
-
-**Shared SRAM access:**
-
-```text
-PE_DMA -> r{x}c{y} -> (mesh hops) -> SRAM
-```
-
-### D8. Mesh generation
-
-The router grid is generated by `mesh_gen.py` based on:
-
- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner
- `cube.geometry`: cube physical dimensions and HBM zone
- `cube.ucie.n_connections`: determines router count for UCIe attachment
-
-The generator produces a `mesh_data` dictionary containing:
-
- Router grid with positions and HBM exclusion zones
- PE-to-router attachments (pe_dma, pe_cpu per PE)
- UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
- M_CPU and SRAM router attachments
- HBM attachment per PE router (ADR-0019)
-
-## Consequences
-
- NOC provides position-aware routing with deterministic latency
- Contention is captured per directed segment (not per-node)
- All cube-internal traffic is explicitly routed through the NOC
- HBM exclusion zone reflects physical die layout constraints
- The mesh generation is fully parameterized by `topology.yaml`
-
-## Links
-
- ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
- ADR-0004 D1 (PE DMA to local HBM path via router mesh)
- ADR-0014 D1 (PE_DMA egress via router mesh)
- ADR-0019 (NOC-Local HBM — xbar/bridge 제거, 명시적 라우터 mesh)
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
- ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)
@@ -0,0 +1,291 @@
+# ADR-0017: Cube NOC and HBM Connectivity
+
+## Status
+
+Accepted
+
+## Context
+
+The CUBE-level NOC is a 2D router mesh that carries every intra-cube
+request: PE-to-HBM data, PE-to-PE traffic, command paths
+(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
+
+The CUBE's HBM is exposed through per-PE controller endpoints attached
+to PE routers. This per-PE partitioning makes local-vs-remote HBM
+distinguishable by mesh distance: a PE's own HBM partition sits at its
+own router (switching overhead only); another PE's HBM partition is
+reachable by mesh hops to that PE's router.
+
+Two channel-mapping modes are supported in the design space:
+
+- **n:1 (default, implemented)** — each PE's HBM partition aggregates
+  `channels_per_pe` pseudo-channels into one endpoint. Effective
+  per-PE BW = N × per-channel BW.
+- **1:1 (future)** — each PE router decomposes into per-channel
+  mini-routers; per-channel BW contention is modeled directly.
+
+In both modes the per-PE effective BW is identical; only the connectivity
+granularity differs.
+
+## Decision
+
+### D1. 2D router mesh
+
+Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
+
+- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
+- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
+- Default 6×6 grid (sized from PE corner placement + UCIe attachment
+  count); larger PE counts scale the grid up.
+- HBM exclusion zone: center rows/columns are excluded where HBM die
+  physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
+- Latency = Manhattan distance × `ns_per_mm`.
+
+### D2. XY routing algorithm
+
+Deterministic XY routing:
+
+1. Horizontal segment: route from source X to destination X at source Y.
+2. Vertical segment: route from destination X at source Y to destination Y.
+
+Each directed segment carries a unique key:
+
+- Horizontal: `("H", y_band, x_min, x_max, direction)`
+- Vertical:   `("V", x_band, y_min, y_max, direction)`
+
+Grid positions are snapped to the router grid, excluding the HBM zone.
+
+### D3. Per-segment contention model
+
+Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
+sharing a segment (same row or column band, same direction) contend for
+the resource — modelling link-level serialization in a wormhole-routed
+mesh.
+
+With no contention, NOC traversal latency equals Manhattan distance ×
+`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
+delay.
+
+### D4. NOC attachment points (per-PE HBM partition)
+
+Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
+and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
+HBM (one pseudo-channel group; see D8).
+
+Other attachments:
+
+- M_CPU and shared SRAM each occupy a dedicated edge router.
+- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
+  along that edge (see D6).
+
+```text
+                    UCIe-N (conn x4)
+                         |
+           +---------+---+---+---------+
+           |         |       |         |
+PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
+PE0.cpu <--+ +hbm.pe0|       | +hbm.pe2+--< PE2.cpu
+           |         |       |         |
+UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
+(conn x4)  |         | zone  |         |  (conn x4)
+           |  r2c0   |       |         |
+M_CPU <--->+         |       |         |
+           |  r3c0   |       |         |
+SRAM <---->+         |       |         |
+           |         |       |         |
+PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
+PE4.cpu <--+ +hbm.pe4|       | +hbm.pe6+--< PE6.cpu
+           |         |       |         |
+           +---------+---+---+---------+
+                         |
+                    UCIe-S (conn x4)
+```
+
+Per-PE HBM partitioning is the key invariant that makes local vs
+cross-PE HBM distinguishable by mesh distance (see D7).
+
+### D5. NOC edge bandwidths and distances
+
+| Connection                    | BW (GB/s)  | Distance      | Notes                                       |
+| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
+| PE_DMA → NOC                  | 256.0      | Physical (PE) | Matches local-HBM aggregate BW              |
+| NOC → PE_CPU                  | —          | 0.0 mm        | Command path only                           |
+| Router ↔ hbm_ctrl.pe{idx}     | 256.0      | 0.0 mm        | Per PE router; N × per-channel BW (see D8)  |
+| NOC ↔ M_CPU                   | —          | 0.0 mm        | Command path                                |
+| NOC ↔ SRAM                    | 128.0 × 4  | 0.0 mm        | 512 GB/s aggregate                          |
+| NOC ↔ UCIe conn               | 128.0      | 0.0 mm        | Per connection; 4 conn per port             |
+
+`0.0 mm` distances reflect the distributed nature of the NOC; actual
+traversal distance is computed via Manhattan distance within the router
+grid.
+
+### D6. UCIe decomposition and inter-cube traffic
+
+Each of the 4 UCIe ports (N, S, E, W) decomposes into:
+
+- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
+- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
+
+This decomposition gives 4 independent NOC↔UCIe connections per port,
+each with 128 GB/s bandwidth (512 GB/s aggregate per port).
+
+Inter-cube traffic path:
+
+```text
+Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
+                  [UCIe link: 512 GB/s, 1.0mm seam distance]
+Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
+```
+
+UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
+crossing incurs 16 ns (TX port + RX port).
+
+### D7. Data paths through the NOC
+
+All intra-cube traffic uses the same router mesh — no separate fast
+paths.
+
+**Local HBM** (same PE's own partition; 0 mesh hops):
+
+```text
+PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx}   (switching overhead only)
+```
+
+**Cross-PE HBM within cube** (target PE's partition, reached by mesh):
+
+```text
+PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
+
+```text
+PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
+```
+
+Dijkstra computes the shortest path within the mesh.
+
+**Cross-cube HBM** (UCIe traversal):
+
+```text
+PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
+       → r{x'}c{y'} → hbm_ctrl.pe{idx'}
+```
+
+**Kernel launch command to PE**:
+
+```text
+[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
+```
+
+**Shared SRAM access**:
+
+```text
+PE_DMA → r{x}c{y} → (mesh) → SRAM
+```
+
+### D8. HBM channel mapping mode
+
+Channel mapping is configured at cube scope:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one       # one_to_one | n_to_one
+    hbm_pseudo_channels: 64          # total pseudo-channel count
+    hbm_channels_per_pe: 8           # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0         # per-channel bandwidth (GB/s)
+    hbm_slices_per_cube: 8           # number of per-PE partitions
+    hbm_total_gb_per_cube: 48
+```
+
+**n:1 mode (default, implemented).** Each PE's HBM partition is a single
+endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
+channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
+`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
+interleave; only aggregate per-PE BW is modeled. No separate aggregated
+router node exists — the per-PE router itself serves that role.
+
+**1:1 mode (future).** Each PE router decomposes into N channel
+mini-routers; per-channel routing carries fully-resolved PA + channel ID.
+A `ChannelSplitter` resolves a logical access to N per-channel physical
+requests. Per-channel link models BW contention. Cross-PE channel
+access semantics are deferred to the implementation ADR.
+
+**BW math (defaults).**
+
+| Parameter                          | Value                      |
+| ---------------------------------- | -------------------------- |
+| pseudo channels per cube           | 64 (parameter)             |
+| PEs per cube                       | 8 (parameter)              |
+| channels per PE (N)                | 64 / 8 = 8                 |
+| per-channel BW                     | 32 GB/s (parameter)        |
+| per-PE local BW                    | N × 32 = 256 GB/s          |
+| cube total HBM BW                  | 64 × 32 = 2048 GB/s        |
+
+Both modes give the same per-PE effective BW; only the request shape and
+contention model differ.
+
+### D9. AddressResolver — per-PE HBM endpoint
+
+The address resolver decodes a PA's HBM offset to the owning PE's
+partition:
+
+```python
+# policy/routing/router.py
+hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
+
+if addr.kind == "hbm":
+    pe_id = int(addr.hbm_offset) // hbm_slice_bytes
+    return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+```
+
+The pe_id computation is intrinsic to the routing layer (not a
+topology-time concern). Any HBM PA falls within exactly one partition,
+yielding deterministic routing.
+
+External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
+same resolver path — there is no separate fast path.
+
+### D10. Mesh generation parameters
+
+`mesh_gen.py` produces `cube_mesh.yaml` from:
+
+- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
+- `cube.geometry`: cube physical dimensions and HBM zone.
+- `cube.ucie.n_connections`: determines router count for UCIe attachment.
+
+Output `mesh_data` dictionary contains:
+
+- Router grid with positions and HBM exclusion zones.
+- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
+  per PE).
+- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
+- M_CPU and SRAM router attachments.
+
+## Consequences
+
+- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
+  (mesh hops) are naturally distinguishable, satisfying SPEC R5
+  (multi-domain communication) and ADR-0002 (no zero-latency end-to-end
+  paths).
+- All cube-internal traffic routes through one mesh — single contention
+  model, single layout, single set of edge BWs.
+- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
+  PE's partition is the n:1 aggregate of its assigned pseudo-channels.
+- 1:1 mode extension is structurally natural — split each PE router into
+  N channel routers.
+- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
+  geometry changes propagate without code edits.
+
+## Links
+
+- ADR-0002 (Routing distance, ordering, no zero-latency paths)
+- ADR-0003 D3 (cube-level NOC definition — extended here)
+- ADR-0004 (Memory semantics, local HBM)
+- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
+- ADR-0014 D1 (PE_DMA egress via router mesh)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
+- ADR-0033 (Latency model: per-PC parallelism, switch penalty)
@@ -1,305 +0,0 @@
-# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
-
-## Status
-
-Accepted
-
-## Context
-
-The CUBE-internal NOC must connect each PE to HBM. KernBench needs
-to evaluate two connectivity models:
-
- **1:1 mode** — PE_DMA connects to N separate per-channel routers,
-  each with its own link to hbm_ctrl. Models per-channel BW
-  contention precisely.
-  N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
- **n:1 mode** — PE_DMA connects to a single aggregated router with
-  one link to hbm_ctrl. Channels are treated as interleaved; only
-  aggregate BW is modeled.
-
-Effective PE-local BW is identical under both modes
-(= N × per-channel BW); only the connectivity granularity differs.
-
---
-
-## Decision
-
-### D1. HBM Attaches to PE Routers
-
-Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
-and attach the HBM access point to the same router where the PE is attached.
-
- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
- Remote PE's HBM access: reaches the target PE's router via mesh hops
- The read/write resource model within the HBM controller is preserved
-
-Node naming changes:
-
-| Current | After Change |
-| ---- | ------- |
-| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
-
-In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
-the builder generates an edge between that router and hbm_ctrl.
-
---
-
-### D2. Complete Removal of xbar, bridge, and Single NOC Node
-
-Remove all of the following nodes and related edges:
-
- `{cube}.xbar_top`, `{cube}.xbar_bot`
- `{cube}.bridge.left`, `{cube}.bridge.right`
- `{cube}.noc` (single TwoDMeshNocComponent node)
- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
-
-Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
-Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
-is created as a separate SimPy node in the topology graph,
-and adjacent routers are connected via XY mesh edges.
-
---
-
-### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
-
-#### Router Nodes Based on cube_mesh.yaml
-
-Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
-is created as a **separate SimPy node** in the topology graph.
-
- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
- kind: `noc_router`, impl: `forwarding_v1`
- pos_mm: taken from cube_mesh.yaml
-
-Based on the attach information in cube_mesh.yaml, components are connected to each router:
- `pe{p}.dma` → PE_DMA ↔ router edge
- `pe{p}.cpu` → PE_CPU ↔ router edge
- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
- `m_cpu` → M_CPU ↔ router edge
- `sram` → SRAM ↔ router edge
- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
-
-Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
-Null routers (HBM exclusion zones) are skipped.
-
-#### 1:1 Mode Extension (To Be Implemented Later)
-
-In 1:1 mode, each router differentiates into N channel mini-routers.
-Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
-N GEMM engines per PE are also added at this point.
-
---
-
-### D4. Cross-PE HBM Access (n:1 Mode)
-
-In n:1 mode, when a PE accesses another PE's local HBM,
-it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
-
-Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
-
-```text
-PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
-```
-
-The Dijkstra router finds the shortest path in the mesh.
-
-Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
-
---
-
-### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
-
-In n:1 mode, no separate "aggregated router" is created.
-The existing router grid from cube_mesh.yaml serves that role.
-
-#### Connection Structure
-
-PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
-
-```text
-sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-```
-
-Routers are connected via XY mesh edges. PE's local HBM access goes
-directly from its own router (switching overhead only).
-
-#### n:1 Mode Full Data Paths
-
-**Local HBM (0 hops):**
-```text
-PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
-```
-
-**Remote HBM (mesh hops):**
-```text
-PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
-```
-
-**M_CPU DMA:**
-```text
-M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
-```
-
---
-
-### D6. All Traffic Is Unified onto the Same Router Mesh
-
- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
- Local access does not use a separate fast path (xbar)
- Cross-cube (remote) access path:
-
-```text
-PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
-  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
-```
-
-UCIe connections maintain the existing structure,
-but both endpoints become mesh routers instead of xbars.
-
-The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
-
---
-
-### D7. AddressResolver Changes
-
-Current `AddressResolver.resolve()`:
-
-```python
-# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
-return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-```
-
-After change:
-
-```python
-# Changed: HBM → single endpoint
-return f"sip{s}.cube{c}.hbm_ctrl"
-```
-
-The pe_slice calculation is removed.
-In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
-
-resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
-
---
-
-### D8. topology.yaml Configuration Changes
-
-#### Added Settings
-
-```yaml
-cube:
-  memory_map:
-    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
-    hbm_pseudo_channels: 64             # total pseudo channel count
-    hbm_channels_per_pe: 8              # local channels per PE (= pseudo_channels / pes_per_cube)
-    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
-    hbm_total_gb_per_cube: 48           # retained
-```
-
-#### Removed Settings
-
-```yaml
-# To be removed
-links:
-  xbar_to_hbm_bw_gbs: 256.0            # → replaced by channel_bw_gbs × channels_per_pe
-  xbar_to_hbm_mm: 2.5                  # → replaced by ch_router_to_hbm_mm
-  xbar_to_bridge_bw_gbs: 128.0         # → removed (no bridge)
-  xbar_to_bridge_mm: 3.0               # → removed
-  noc_to_xbar_bw_gbs: ...              # → removed
-  noc_to_xbar_mm: ...                  # → removed
-```
-
-#### Added Link Settings
-
-```yaml
-links:
-  router_link_bw_gbs: 256.0            # XY mesh link BW between routers
-  router_overhead_ns: 2.0              # router switching overhead
-  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ router
-  hbm_to_router_bw_gbs: 256.0          # HBM ↔ router (= N × channel_bw)
-```
-
---
-
-### D9. Bandwidth Numerical Consistency
-
-| Configuration | Value |
-| ---- | --- |
-| pseudo channels per cube | 64 (parameter) |
-| PEs per cube | 8 (parameter) |
-| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
-| per-channel BW | 32 GB/s (parameter) |
-| per-PE local BW | N × 32 = 256 GB/s |
-| cube total HBM BW | 64 × 32 = 2048 GB/s |
-
-The effective BW per PE is identical in both modes:
-
- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
-
---
-
-## Consequences
-
-### Positive
-
- The router mesh based on cube_mesh.yaml accurately reflects physical placement
- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
- Aligns well with graph compiler-based topology generation
- Channel count and PE count are both parameterized, enabling testing of various configurations
- 1:1 mode extension naturally follows through router differentiation
-
-### Negative
-
- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
-
---
-
-## Alternatives
-
-### A1. Retain Existing xbar + HBM Slices
-
- Local/remote paths remain bifurcated
- Cannot model at pseudo-channel granularity
- Cannot switch between 1:1/n:1 modes
-
-### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
-
- Topology structure always has 1:1 size
- Expressing n:1 semantics via link aggregation is complex
- No reduction in router node count
-
-### A3. Gradual Transition (Retain xbar + Add NOC Path)
-
- Higher compatibility, but dual-path coexistence increases complexity
- Since xbar removal is ultimately necessary, the intermediate step provides little value
-
---
-
-## Test Requirements
-
- Verify that requests are delivered via per-channel links in 1:1 mode
- Verify that requests are delivered via the aggregated link in n:1 mode
- Verify that topology is correctly generated in both modes:
-  - 1:1: `total_ch` channel routers + per-PE links + horizontal links
-  - n:1: `pes_per_cube` aggregated routers + per-PE links
- Verify that effective BW is consistent across both modes for the same workload
- Verify that horizontal line routing works for cross-PE access
- Verify that routing through UCIe works for cross-cube access
- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
-
---
-
-## Links
-
- ADR-0011 (LA model) → addressing-side integration
- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
- ADR-0004 (Memory Semantics) → BW model redefinition
- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
@@ -1,305 +0,0 @@
-# ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델
-
-## Status
-
-Accepted
-
-## Context
-
-CUBE 내부 NOC은 각 PE를 HBM에 연결해야 한다. KernBench는 두 가지
-connectivity 모델을 비교 평가할 수 있어야 한다.
-
- **1:1 mode** — PE_DMA가 N개 per-channel router 각각에 별도 link로
-  연결되고, 각 router는 hbm_ctrl에 자기 channel link를 가진다.
-  Per-channel BW contention을 정확히 모델링.
-  N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
- **n:1 mode** — PE_DMA가 단일 aggregated router를 거쳐 하나의 link로
-  hbm_ctrl에 연결. Channel들이 interleaved 된 것으로 가정하고
-  aggregate BW만 모델링.
-
-두 모드에서 PE당 effective BW는 동일 (= N × per-channel BW);
-connectivity granularity만 다르다.
-
---
-
-## Decision
-
-### D1. HBM은 PE 라우터에 attach된다
-
-현재의 `hbm_ctrl.slice{0-7}` (8개 노드)를 **`hbm_ctrl` 단일 노드**로 통합하고,
-PE가 attach된 라우터에 HBM access point도 함께 attach한다.
-
- n:1 mode: PE의 local HBM 접근은 자기 라우터에서 바로 (switching overhead만, 0 hop)
- remote PE의 HBM 접근: mesh hop을 거쳐 대상 PE의 라우터에 도달
- HBM controller 내부의 read/write resource 모델은 유지
-
-노드 네이밍 변경:
-
-| 현재 | 변경 후 |
-| ---- | ------- |
-| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (단일) |
-
-`mesh_gen.py`에서 PE attachment에 `pe{idx}.hbm`을 추가하여,
-builder가 해당 라우터와 hbm_ctrl 간 edge를 생성한다.
-
---
-
-### D2. xbar, bridge, 단일 NOC 노드 완전 제거
-
-기존 다음 노드 및 관련 edge를 모두 제거한다:
-
- `{cube}.xbar_top`, `{cube}.xbar_bot`
- `{cube}.bridge.left`, `{cube}.bridge.right`
- `{cube}.noc` (단일 TwoDMeshNocComponent 노드)
- `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar` 종류의 edge
- `xbar_to_bridge`, `bridge_to_xbar` 종류의 edge
- `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu` 등 단일 noc 노드 참조 edge
-
-이들의 역할은 **cube_mesh.yaml 기반의 명시적 라우터 mesh**가 대체한다.
-기존 `mesh_gen.py`가 생성하는 6×6 라우터 grid의 각 라우터(r0c0, r0c1, ...)를
-별도의 SimPy 노드로 topology graph에 생성하고,
-인접 라우터 간 XY mesh edge로 연결한다.
-
---
-
-### D3. 명시적 라우터 mesh (n:1 / 1:1 공통 기반)
-
-#### cube_mesh.yaml 기반 라우터 노드
-
-`mesh_gen.py`가 생성한 cube_mesh.yaml의 각 non-null 라우터를
-topology graph의 **별도 SimPy 노드**로 생성한다.
-
- 노드 ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
- kind: `noc_router`, impl: `forwarding_v1`
- pos_mm: cube_mesh.yaml에서 가져옴
-
-기존 cube_mesh.yaml의 attach 정보에 따라 각 라우터에 component를 연결:
- `pe{p}.dma` → PE_DMA ↔ 라우터 edge
- `pe{p}.cpu` → PE_CPU ↔ 라우터 edge
- `pe{p}.hbm` → HBM_CTRL ↔ 라우터 edge (n:1에서 추가)
- `m_cpu` → M_CPU ↔ 라우터 edge
- `sram` → SRAM ↔ 라우터 edge
- `ucie_{dir}.c{i}` → UCIe conn ↔ 라우터 edge
-
-라우터 간 XY mesh edge: 인접 라우터 간 bidirectional edge.
-null 라우터(HBM exclusion zone)는 skip.
-
-#### 1:1 mode 확장 (나중에 구현)
-
-1:1 mode에서는 각 라우터가 N개 channel mini-router로 분화된다.
-per-channel routing과 ChannelSplitter (LA → per-channel PA) 도입이 필요.
-PE당 N개 GEMM engine도 이 시점에 추가.
-
---
-
-### D4. cross-PE HBM 접근 (n:1 mode)
-
-n:1 mode에서 PE가 다른 PE의 local HBM에 접근하는 경우,
-cube_mesh.yaml의 XY mesh를 통해 대상 PE의 라우터까지 hop한다.
-
-예: PE0(r0c0)이 PE2(r1c4)의 HBM에 접근:
-
-```text
-PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
-```
-
-Dijkstra router가 mesh에서 최단 경로를 탐색한다.
-
-1:1 mode에서의 cross-PE channel 접근은 D3의 1:1 확장 시 정의한다.
-
---
-
-### D5. n:1 mode: cube_mesh.yaml 라우터 mesh 사용
-
-n:1 mode에서는 별도의 "aggregated router"를 생성하지 않는다.
-기존 cube_mesh.yaml의 라우터 grid가 그 역할을 한다.
-
-#### 연결 구조
-
-각 PE가 attach된 라우터에 PE_DMA, PE_CPU, HBM이 함께 연결된다:
-
-```text
-sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-sip0.cube0.hbm_ctrl   ←→ sip0.cube0.r0c0  (bw: N × channel_bw_gbs)
-```
-
-라우터 간 XY mesh edge로 연결. PE의 local HBM 접근은
-자기 라우터에서 바로 (switching overhead만).
-
-#### n:1 mode 전체 데이터 경로
-
-**local HBM (0 hop):**
-```text
-PE0.pe_dma → r0c0 → hbm_ctrl  (switching overhead only)
-```
-
-**remote HBM (mesh hops):**
-```text
-PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
-```
-
-**M_CPU DMA:**
-```text
-M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
-```
-
---
-
-### D6. 모든 트래픽을 동일 router mesh로 통일한다
-
- 모든 memory access (DMA data)와 command (PE_CPU)가 동일 router mesh를 사용한다
- local access도 별도의 fast path(xbar)를 사용하지 않는다
- cross-cube (remote) access 경로:
-
-```text
-PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
-  → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
-```
-
-UCIe 연결은 기존 구조를 유지하되,
-양쪽 endpoint가 xbar 대신 mesh 라우터가 된다.
-
-UCIe line 수는 BW 비율로 결정: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
-
---
-
-### D7. AddressResolver 변경
-
-현재 `AddressResolver.resolve()`:
-
-```python
-# 현재: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
-return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
-```
-
-변경 후:
-
-```python
-# 변경: HBM → 단일 endpoint
-return f"sip{s}.cube{c}.hbm_ctrl"
-```
-
-pe_slice 계산이 제거된다.
-n:1 mode에서 PE_DMA는 자기 라우터에 attach된 hbm_ctrl에 직접 접근한다.
-
-resolver.resolve()는 외부 접근(M_CPU DMA 등) 및 backward compatibility용으로 유지한다.
-
---
-
-### D8. topology.yaml 설정 변경
-
-#### 추가 설정
-
-```yaml
-cube:
-  memory_map:
-    hbm_mapping_mode: n_to_one          # one_to_one | n_to_one
-    hbm_pseudo_channels: 64             # 전체 pseudo channel 수
-    hbm_channels_per_pe: 8              # PE당 local channel 수 (= pseudo_channels / pes_per_cube)
-    hbm_channel_bw_gbs: 32.0            # per-channel bandwidth (GB/s)
-    hbm_total_gb_per_cube: 48           # 유지
-```
-
-#### 제거 설정
-
-```yaml
-# 제거 대상
-links:
-  xbar_to_hbm_bw_gbs: 256.0            # → channel_bw_gbs × channels_per_pe로 대체
-  xbar_to_hbm_mm: 2.5                  # → ch_router_to_hbm_mm으로 대체
-  xbar_to_bridge_bw_gbs: 128.0         # → 제거 (bridge 없음)
-  xbar_to_bridge_mm: 3.0               # → 제거
-  noc_to_xbar_bw_gbs: ...              # → 제거
-  noc_to_xbar_mm: ...                  # → 제거
-```
-
-#### 추가 link 설정
-
-```yaml
-links:
-  router_link_bw_gbs: 256.0            # 라우터 간 XY mesh link BW
-  router_overhead_ns: 2.0              # 라우터 switching overhead
-  pe_to_router_bw_gbs: 256.0           # PE_DMA ↔ 라우터
-  hbm_to_router_bw_gbs: 256.0          # HBM ↔ 라우터 (= N × channel_bw)
-```
-
---
-
-### D9. 대역폭 수치 정합
-
-| 구성 | 값 |
-| ---- | --- |
-| pseudo channels per cube | 64 (파라미터) |
-| PEs per cube | 8 (파라미터) |
-| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
-| per-channel BW | 32 GB/s (파라미터) |
-| per-PE local BW | N × 32 = 256 GB/s |
-| cube total HBM BW | 64 × 32 = 2048 GB/s |
-
-두 모드에서 PE당 effective BW는 동일:
-
- 1:1 mode: N개 channel link × channel_bw_gbs = N × 32 = 256 GB/s
- n:1 mode: 1개 aggregated link = N × channel_bw_gbs = 256 GB/s
-
---
-
-## Consequences
-
-### Positive
-
- cube_mesh.yaml 기반 라우터 mesh로 물리적 배치를 정확히 반영한다
- n:1 mode에서 기존 VA 체계를 유지하여 전환 비용이 낮다
- local / remote / command 트래픽이 동일 mesh로 통일되어 단순하다
- graph compiler 기반 topology 생성과 잘 맞는다
- channel 수, PE 수가 모두 파라미터이므로 다양한 구성을 테스트할 수 있다
- 1:1 mode 확장이 라우터 분화로 자연스럽게 가능하다
-
-### Negative
-
- 명시적 라우터 노드로 인해 SimPy 노드 수가 증가한다 (6×6 = 최대 32개 라우터/cube)
- TwoDMeshNocComponent의 내부 contention 모델을 라우터별 모델로 교체 필요
-
---
-
-## Alternatives
-
-### A1. 기존 xbar + HBM slice 유지
-
- local/remote 경로가 이원화됨
- pseudo-channel 단위 모델링 불가
- 1:1/n:1 mode 전환 불가
-
-### A2. per-channel link를 항상 생성하고 n:1에서만 집계
-
- topology 구조가 항상 1:1 크기
- n:1 semantics를 link aggregation으로 표현하기 복잡
- router 노드 수 감소 효과 없음
-
-### A3. 단계적 전환 (xbar 유지 + NOC 경로 추가)
-
- 호환성은 높으나 두 경로 공존으로 복잡도 증가
- 최종적으로 xbar 제거가 필요하므로 중간 단계의 가치가 낮음
-
---
-
-## Test Requirements
-
- 1:1 mode에서 channel별 link로 request가 전달되는지 확인
- n:1 mode에서 aggregated link로 request가 전달되는지 확인
- 두 mode에서 topology가 올바르게 생성되는지 검증:
-  - 1:1: `total_ch`개 channel router + per-PE link + horizontal link
-  - n:1: `pes_per_cube`개 aggregated router + per-PE link
- 동일 workload에서 effective BW가 두 모드에서 일관적인지 확인
- cross-PE 접근 시 horizontal line routing이 동작하는지 확인
- cross-cube 접근 시 UCIe를 통한 routing이 동작하는지 확인
- 파라미터 변경 (channels_per_pe = 4, 8, 16 등)에서 topology 생성이 정상인지 확인
-
---
-
-## Links
-
- ADR-0011 (LA model) → addressing 측 연동
- ADR-0017 (Cube NOC 2D Mesh) → 본 ADR이 xbar/bridge 부분을 대체
- ADR-0004 (Memory Semantics) → BW 모델 재정의
- ADR-0014 (PE Internal Execution Model) → PE_DMA 경로 변경 영향
@@ -1,432 +0,0 @@
-# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
-
-## Status
-
-Accepted
-
-## Context
-
-### Actual Hardware Structure
-
-```
-HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
-```
-
- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
- Completion signal: PE-internal 1-cycle wire signal (done pin assert)
-
---
-
-## Decision
-
-### D1. Separate Each Block into an Independent Component
-
-The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
-Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
-
-| Component | Role | HW Correspondence |
-|-----------|------|-------------------|
-| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
-| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
-| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
-| PE_GEMM | MAC compute (register only) | MAC Array |
-| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
-| PE_TCM | BW-serialized scratchpad | SRAM Bank |
-
-Each component exists as a topology node and is connected via ports/wires.
-Replacing the `impl` allows changing the timing model of an individual block.
-
-### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
-
-**Components do not pass through the scheduler at every stage.**
-The token carries a plan so that components chain directly to the next stage.
-
-```
-Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
-              ↑ chaining: does not go through scheduler          completion only
-```
-
-This matches the actual HW structure where each block's done signal is directly
-connected to the next block via wire. The scheduler is responsible **only for
-initial dispatch + completion aggregation**.
-
-#### Stage Definition
-
-```python
-class StageType(Enum):
-    DMA_READ = 0
-    FETCH = 1
-    GEMM = 2
-    MATH = 3
-    STORE = 4
-    DMA_WRITE = 5
-```
-
-#### Plan Structure
-
-When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
-The plan defines the **stage sequence** for each tile:
-
-```python
-@dataclass
-class Stage:
-    stage_type: StageType
-    component: str       # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
-    params: dict         # per-stage parameters (dynamic)
-
-@dataclass(frozen=True)
-class TilePlan:
-    tile_id: int
-    stages: tuple[Stage, ...]  # list of stages to execute in order (immutable)
-```
-
-The stage sequence varies depending on the plan:
-
-```python
-# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
-stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
-
-# GEMM directly from TCM data (skip DMA read):
-stages = (FETCH, GEMM, STORE, DMA_WRITE)
-
-# MATH element-wise:
-stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
-
-# GEMM + accumulation (intermediate K-tile, skip writeback):
-stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
-```
-
-**Components do not hardcode the next component.**
-They read the next stage from the token's plan and forward it directly via out_port.
-This is the same pattern as a network packet carrying a routing header.
-
-#### Pipeline Context
-
-```python
-@dataclass
-class PipelineContext:
-    id: str
-    total_tiles: int
-    completed_tiles: int = 0
-    done_event: simpy.Event = None  # succeeds when all tiles are complete
-
-    def complete_tile(self) -> None:
-        self.completed_tiles += 1
-        if self.completed_tiles == self.total_tiles:
-            self.done_event.succeed()
-```
-
-**Completion follows an exactly-once contract**: the last stage of each tile must call
-`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
-succeed only once (SimPy Event constraint).
-
-#### Scheduler Role (Reduced)
-
-When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
-enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
-
-Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
-This feeder consumes `_pending_feeds` in FIFO order and
-**does not allow tile feed interleaving across composite commands.**
-That is, the feed for the next command begins only after all tiles of the current
-command have been injected into the first stage queue.
-
-There is **exactly one `_feed_loop`** per scheduler, and
-tile feed for composite commands is performed exclusively through this single process.
-Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
-
-This structure maintains command issue order while ensuring that when the first stage
-queue is full, only the feeder process blocks — the scheduler worker's inbox processing
-itself does not stall.
-
-```python
-class PeSchedulerV2(PeEngineBase):
-    _pipelines: dict[str, PipelineContext]
-    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)
-
-    def start(self, env):
-        super().start(env)
-        self._pending_feeds = simpy.Store(env)
-        env.process(self._feed_loop(env))
-
-    def _dispatch_composite(self, env, pe_txn, cmd):
-        plan = generate_plan(cmd)
-        ctx = PipelineContext(
-            id=next_id(),
-            total_tiles=len(plan.tiles),
-            done_event=pe_txn.done,
-        )
-        self._pipelines[ctx.id] = ctx
-
-        # only enqueue to feeder queue and return immediately
-        yield self._pending_feeds.put((plan, ctx))
-
-    def _feed_loop(self, env):
-        """Single feeder process: feeds composite commands in FIFO order.
-
-        Tile feed interleaving across composite commands is not allowed.
-        The feed for the next command begins only after all tiles of the
-        current command have been injected into the first stage queue.
-
-        When the first stage queue is full, only this feeder blocks;
-        the scheduler worker's inbox processing does not stall.
-        """
-        while True:
-            plan, ctx = yield self._pending_feeds.get()
-            for tile in plan.tiles:
-                token = TileToken(
-                    tile_id=tile.tile_id,
-                    pipeline_ctx=ctx,
-                    plan=tile,
-                    stage_idx=0,
-                    params=tile.stages[0].params,
-                )
-                yield self.out_ports[tile.stages[0].component].put(token)
-                # queue capacity = HW queue depth → feeder blocks only when full
-```
-
-In this ADR, the scheduler can accept multiple composite commands,
-but tile submission order follows per-command FIFO.
-Within a command, tile-level pipeline overlap is allowed,
-but tile feed interleaving across commands is not.
-
-### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
-
-| Communication Type | Method | HW Correspondence |
-|-------------------|--------|-------------------|
-| Tile token (work directive) | message via out_port | enqueue to command queue |
-| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
-| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
-
-**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
-
-**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
-is applied to intra-PE stage triggers. Chaining between components corresponds to
-PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
-is incurred.
-
-**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
-When all tiles are complete, PipelineContext calls done_event.succeed().
-
-### D4. Asynchronous Pipeline — Natural Overlap
-
-The scheduler processes CompositeCmds **asynchronously**.
-However, tile feed does not spawn an independent process per command; instead,
-the scheduler's internal **single feeder process** performs the feed in FIFO order.
-Therefore, the scheduler can continue to receive the next command,
-but the first-stage tile injection order is guaranteed per command.
-
-Since **SimPy Store capacity = HW queue depth**:
- When the queue is full, put() naturally blocks (backpressure)
- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
- When a second CompositeCmd arrives, it is immediately queued to the DMA queue
-
-```
-First-stage feed order (feeder → DMA queue):
-  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
-                                            ↑ cmd2 starts after cmd1 feed completes
-
-Runtime pipeline (downstream overlap):
-  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
-  PE_FETCH:          [cmd1:t0][cmd1:t1]...
-  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
-                              ↑ pipeline overlap within the same command
-```
-
-Here, the overlap does not come from tile feed interleaving across different commands,
-but occurs naturally as tiles from earlier commands progress to downstream stages
-while the feeder continues injecting subsequent tiles.
-
-For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
-injected into the first stage queue. However, while cmd1.tile0 has already progressed
-to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
-**pipeline overlap within the same command occurs naturally**.
-
-#### Component Chaining Pattern
-
-All components follow the same pattern:
-
-```python
-def _pipeline_worker(self, env):
-    while True:
-        token = yield self._inbox.get()
-
-        # process own stage
-        yield from self._process(env, token)
-
-        # chain to next stage (read from plan)
-        next_idx = token.stage_idx + 1
-        if next_idx < len(token.plan.stages):
-            next_stage = token.plan.stages[next_idx]
-            token.stage_idx = next_idx
-            token.params = next_stage.params
-            yield self.out_ports[next_stage.component].put(token)
-        else:
-            # last stage — pipeline completion
-            token.pipeline_ctx.complete_tile()
-```
-
-### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
-
-Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
-This is separated into a **PE_FETCH_STORE component**.
-
-```python
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-    # chaining is handled by the base class (D4 pattern)
-```
-
-Advantages:
- GEMM/MATH perform **pure compute only** — no TCM access logic
- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
- Prefetch strategies can be experimented with by replacing the fetch unit alone
-
-### D6. Simplification of Each Compute Component
-
-GEMM/MATH perform compute only with register data already prepared.
-**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
-
-```python
-# PE_GEMM._process()
-def _process(self, env, token):
-    yield env.timeout(self._mac_latency(token.params))
-
-# PE_MATH._process()
-def _process(self, env, token):
-    yield env.timeout(self._simd_latency(token.params))
-
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-
-# PE_DMA._process()
-def _process(self, env, token):
-    yield from self._do_fabric_dma(token.params)
-```
-
-By replacing only the timing model, one can freely switch between cycle-accurate
-and analytical models. Since the chaining logic resides in the base class,
-each component only implements its pure stage logic.
-
-### D7. Topology Changes
-
-Add PE_FETCH_STORE to the PE template:
-
-```yaml
-pe_template:
-  components:
-    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
-    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
-    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
-    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
-    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
-    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
-    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
-    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
-  links:
-    # existing links...
-    fetch_store_to_tcm_bw_gbs: 512.0
-    fetch_store_to_tcm_mm: 0.0
-```
-
-PE internal edge connections:
-```
-PE_SCHEDULER → PE_DMA (initial dispatch)
-PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
-PE_SCHEDULER → PE_GEMM (initial dispatch)
-PE_SCHEDULER → PE_MATH (initial dispatch)
-PE_DMA → PE_FETCH_STORE (chaining)
-PE_FETCH_STORE → PE_GEMM (chaining)
-PE_FETCH_STORE → PE_MATH (chaining)
-PE_GEMM → PE_FETCH_STORE (store chaining)
-PE_MATH → PE_FETCH_STORE (store chaining)
-PE_FETCH_STORE → PE_DMA (writeback chaining)
-PE_FETCH_STORE → PE_TCM (BW request)
-```
-
-Topology edges encompass both **control/dispatch visibility + runtime chaining**.
-Scheduler → sub-component edges are initial dispatch paths, while
-inter-component edges are runtime chaining paths driven by token self-routing.
-
-### D9. TileToken Message Definition
-
-A message used for passing tile work between components.
-The token carries the plan and stage index, enabling self-routing.
-
-```python
-@dataclass
-class TileToken:
-    tile_id: int
-    pipeline_ctx: PipelineContext    # completion tracking
-    plan: TilePlan                   # full stage sequence for this tile (immutable)
-    stage_idx: int                   # current stage index in plan.stages
-    params: dict                     # current stage parameter cache (canonical: plan.stages[stage_idx].params)
-    data_op: bool = True             # op_log recording target (ADR-0020)
-```
-
-A TileToken is **owned by exactly one component at a time** and
-is never referenced by multiple components simultaneously (single-owner).
-
-Token lifecycle:
-1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
-2. The component executes _process(), increments stage_idx, and puts it to the next component
-3. The last stage component calls pipeline_ctx.complete_tile()
-4. When all tiles are complete, PipelineContext calls done_event.succeed()
-
-Relationship with existing PeInternalTxn:
- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
-
---
-
-## Non-goals
-
- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
-  (PeInternalTxn-based, ADR-0014 maintained)
- **Resource contention model across multiple pipelines**: the current scope focuses on
-  accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
-  are future work.
-
-## Open Questions
-
- **Register File capacity model**: whether to model capacity limits when the fetch unit
-  loads into registers. Capacity is expressed in bytes (register_file_bytes), and
-  the number of tiles that can be held simultaneously is determined by tile size.
-  When capacity is exceeded, fetch stalls, creating natural backpressure.
- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
-  commands. Therefore, overlap arises not from pre-injection across commands, but
-  naturally from pipeline progression of tiles within the same command.
-  If additional prefetch is needed, it should be considered at the level of tile ordering
-  within the same command or fetch/store unit policy, not cross-command injection.
- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
-  Direction is to merge/coalesce within DMA without scheduler involvement.
- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
-  default/sole execution model. If a sync mode is needed for debug or validation
-  purposes, it will be considered in a future ADR.
- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
-  Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
-
---
-
-## Consequences
-
-### Positive
-
- Each block is an independent component — individually replaceable (ADR-0015 compliant)
- PE internal structure is visible in the topology
- Components do not know the next component — plan-based routing provides flexibility
- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
- Improved HW modeling accuracy (done signal = Event, data transfer = message)
- Fetch/store separation enables accurate TCM BW contention modeling
-
-### Negative
-
- Increased number of PE internal components (5 → 6) — more topology nodes/edges
- Component separation makes intra-PE token forwarding more explicit than before
-
@@ -1,426 +0,0 @@
-# ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅
-
-## Status
-
-Accepted
-
-## Context
-
-### 실제 하드웨어 구조
-
-```
-HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
-```
-
- DMA: HBM ↔ TCM 전송 (fabric 경유, 수십~수백 ns)
- Fetch/Store Unit: TCM ↔ Register File 전송 (BW 기반, 수 ns)
- GEMM/MATH Engine: Register File 간 연산 (cycle-accurate)
- 완료 신호: PE 내부 1-cycle wire signal (done pin assert)
-
---
-
-## Decision
-
-### D1. 각 블록을 독립 컴포넌트로 분리
-
-pe_accel의 내부 블록을 **독립 PeEngineBase 컴포넌트**로 분리한다.
-기존 5개 + Fetch/Store Unit 1개 = 6개 컴포넌트.
-
-| 컴포넌트 | 역할 | HW 대응 |
-|----------|------|---------|
-| PE_SCHEDULER | plan 생성, tile 상태 관리, stage 라우팅 | Scheduler/Sequencer |
-| PE_DMA | HBM ↔ TCM (fabric 경유) | DMA Engine |
-| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
-| PE_GEMM | MAC compute (register only) | MAC Array |
-| PE_MATH | element-wise/reduction (register only) | SIMD/Vector Unit |
-| PE_TCM | BW-serialized scratchpad | SRAM Bank |
-
-각 컴포넌트는 topology 노드로 존재하며, port/wire로 연결된다.
-`impl`을 교체하면 개별 블록의 타이밍 모델을 변경할 수 있다.
-
-### D2. Token Self-Routing — Scheduler는 dispatch + completion만
-
-**컴포넌트가 매 stage마다 scheduler를 경유하지 않는다.**
-Token이 plan을 가지고 있어 컴포넌트가 직접 다음 stage로 체이닝한다.
-
-```
-Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
-              ↑ 체이닝: scheduler 안 거침                    completion만
-```
-
-이는 실제 HW에서 각 블록의 done signal이 다음 블록에 직접 wire로 연결되어
-있는 구조와 일치한다. Scheduler는 **초기 dispatch + completion aggregation만** 담당.
-
-#### Stage 정의
-
-```python
-class StageType(Enum):
-    DMA_READ = 0
-    FETCH = 1
-    GEMM = 2
-    MATH = 3
-    STORE = 4
-    DMA_WRITE = 5
-```
-
-#### Plan 구조
-
-Scheduler가 CompositeCmd를 받으면 **tile 단위 실행 plan**을 생성한다.
-Plan은 각 tile의 **stage sequence**를 정의한다:
-
-```python
-@dataclass
-class Stage:
-    stage_type: StageType
-    component: str       # topology 노드 ID (e.g. "sip0.cube0.pe0.pe_dma")
-    params: dict         # stage별 파라미터 (dynamic)
-
-@dataclass(frozen=True)
-class TilePlan:
-    tile_id: int
-    stages: tuple[Stage, ...]  # 순서대로 실행할 stage 목록 (immutable)
-```
-
-Plan에 따라 stage sequence가 달라진다:
-
-```python
-# 일반 GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
-stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
-
-# TCM 데이터로 바로 GEMM (DMA read 생략):
-stages = (FETCH, GEMM, STORE, DMA_WRITE)
-
-# MATH element-wise:
-stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
-
-# GEMM + accumulation (중간 K-tile, writeback 생략):
-stages = (DMA_READ, FETCH, GEMM, STORE)  # store to TCM only
-```
-
-**컴포넌트는 다음 컴포넌트를 하드코딩하지 않는다.**
-Token의 plan에서 다음 stage를 읽고, out_port로 직접 전달한다.
-네트워크 패킷이 라우팅 헤더를 가지고 있는 것과 같은 패턴이다.
-
-#### Pipeline Context
-
-```python
-@dataclass
-class PipelineContext:
-    id: str
-    total_tiles: int
-    completed_tiles: int = 0
-    done_event: simpy.Event = None  # 모든 tile 완료 시 succeed
-
-    def complete_tile(self) -> None:
-        self.completed_tiles += 1
-        if self.completed_tiles == self.total_tiles:
-            self.done_event.succeed()
-```
-
-**Completion은 exactly-once contract**: 각 tile의 마지막 stage는 정확히 한 번만
-`complete_tile()`을 호출해야 한다. 중복 호출은 버그이며, `done_event`는
-단 한 번만 succeed되어야 한다 (SimPy Event 제약).
-
-#### Scheduler 역할 (축소됨)
-
-Scheduler는 CompositeCmd를 받으면 plan과 PipelineContext를 생성한 뒤,
-이를 scheduler 내부의 `_pending_feeds` FIFO에 enqueue하고 즉시 리턴한다.
-
-실제 tile 투입은 **단일 feeder process** (`_feed_loop`)가 담당한다.
-이 feeder는 `_pending_feeds`를 FIFO 순서로 소비하며,
-**composite command 간 tile feed interleaving은 허용하지 않는다.**
-즉, 한 command의 모든 tile이 첫 stage queue에 투입된 후에만
-다음 command의 feed가 시작된다.
-
-Scheduler당 `_feed_loop`는 **정확히 하나만** 존재하며,
-composite command의 tile feed는 이 단일 process를 통해서만 수행된다.
-Command issue order는 **PE_SCHEDULER가 PeInternalTxn을 수신한 순서**를 의미한다.
-
-이 구조는 command issue order를 유지하면서도, 첫 stage queue full 시
-feeder process만 block되고 scheduler worker의 inbox 처리 자체는 멈추지 않도록 한다.
-
-```python
-class PeSchedulerV2(PeEngineBase):
-    _pipelines: dict[str, PipelineContext]
-    _pending_feeds: simpy.Store   # FIFO of (plan, ctx)
-
-    def start(self, env):
-        super().start(env)
-        self._pending_feeds = simpy.Store(env)
-        env.process(self._feed_loop(env))
-
-    def _dispatch_composite(self, env, pe_txn, cmd):
-        plan = generate_plan(cmd)
-        ctx = PipelineContext(
-            id=next_id(),
-            total_tiles=len(plan.tiles),
-            done_event=pe_txn.done,
-        )
-        self._pipelines[ctx.id] = ctx
-
-        # feeder queue에 등록만 하고 즉시 리턴
-        yield self._pending_feeds.put((plan, ctx))
-
-    def _feed_loop(self, env):
-        """단일 feeder process: composite command를 FIFO 순서로 feed.
-
-        Composite command 간 tile feed interleaving은 허용하지 않는다.
-        한 command의 모든 tile이 첫 stage queue에 투입된 후에만
-        다음 command의 feed가 시작된다.
-
-        첫 stage queue full 시 이 feeder만 block되며,
-        scheduler worker의 inbox 처리는 멈추지 않는다.
-        """
-        while True:
-            plan, ctx = yield self._pending_feeds.get()
-            for tile in plan.tiles:
-                token = TileToken(
-                    tile_id=tile.tile_id,
-                    pipeline_ctx=ctx,
-                    plan=tile,
-                    stage_idx=0,
-                    params=tile.stages[0].params,
-                )
-                yield self.out_ports[tile.stages[0].component].put(token)
-                # queue capacity = HW queue depth → full이면 feeder만 block
-```
-
-본 ADR에서 scheduler는 여러 composite command를 수용할 수 있으나,
-tile submission order는 command 단위 FIFO를 따른다.
-Command 내부에서는 tile-level pipeline overlap을 허용하지만,
-command 간 tile feed interleaving은 허용하지 않는다.
-
-### D3. 데이터 전달 vs 완료 신호 — HW 모델링 기준
-
-| 통신 유형 | 방식 | HW 대응 |
-|----------|------|---------|
-| tile token (작업 지시) | message via out_port | command queue에 enqueue |
-| stage 완료 → 다음 stage | 컴포넌트가 직접 out_port.put | done-triggered local enqueue |
-| pipeline 완료 → scheduler | PipelineContext.complete_tile() | completion interrupt |
-
-**Tile token**: out_port.put() 사용. SimPy Store capacity = HW queue depth.
-
-**Intra-PE chaining latency**: 본 ADR 범위에서는 intra-PE stage trigger에
-explicit latency model을 두지 않는다. 컴포넌트 간 체이닝은 PE 내부 wire에 해당하며,
-scheduler 왕복이 없으므로 artificial hop cost가 발생하지 않는다.
-
-**Pipeline 완료**: 마지막 stage의 컴포넌트가 `pipeline_ctx.complete_tile()` 호출.
-모든 tile 완료 시 PipelineContext가 done_event.succeed().
-
-### D4. 비동기 파이프라인 — 자연스러운 overlap
-
-Scheduler는 CompositeCmd를 **비동기로** 처리한다.
-다만 tile feed는 command마다 독립 process를 만들지 않고,
-scheduler 내부의 **단일 feeder process**가 FIFO 순서로 수행한다.
-따라서 scheduler는 다음 command를 계속 받을 수 있지만,
-첫-stage tile 투입 순서는 command 단위로 보장된다.
-
-**SimPy Store capacity = HW queue depth**이므로:
- queue가 차면 put()이 자연스럽게 block (backpressure)
- DMA가 tile 0을 처리하는 동안 GEMM은 이미 완료된 tile의 fetch를 시작
- 두 번째 CompositeCmd가 들어오면 DMA queue에 바로 이어서 투입
-
-```
-First-stage feed order (feeder → DMA queue):
-  [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
-                                            ↑ cmd1 feed 완료 후 cmd2 시작
-
-Runtime pipeline (downstream overlap):
-  PE_DMA:    [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
-  PE_FETCH:          [cmd1:t0][cmd1:t1]...
-  PE_GEMM:                   [cmd1:t0][cmd1:t1]...
-                              ↑ 같은 cmd 내부에서 pipeline overlap
-```
-
-이때 overlap은 서로 다른 command의 tile feed interleaving에서 오는 것이 아니라,
-먼저 투입된 command의 tile들이 downstream stage로 진행되는 동안 feeder가
-다음 tile들을 계속 투입하면서 자연스럽게 발생한다.
-
-예를 들어 cmd1의 모든 tile이 첫 stage queue에 투입되기 전에는
-cmd2의 tile feed는 시작되지 않는다. 그러나 cmd1.tile0이 이미 GEMM으로
-진행한 상태에서 cmd1.tile1, cmd1.tile2가 DMA/FETCH에 남아 있을 수 있으므로,
-**같은 command 내부에서는 pipeline overlap이 자연스럽게 발생**한다.
-
-#### 컴포넌트 체이닝 패턴
-
-모든 컴포넌트가 동일한 패턴을 따른다:
-
-```python
-def _pipeline_worker(self, env):
-    while True:
-        token = yield self._inbox.get()
-
-        # 자기 stage 처리
-        yield from self._process(env, token)
-
-        # 다음 stage로 체이닝 (plan에서 읽음)
-        next_idx = token.stage_idx + 1
-        if next_idx < len(token.plan.stages):
-            next_stage = token.plan.stages[next_idx]
-            token.stage_idx = next_idx
-            token.params = next_stage.params
-            yield self.out_ports[next_stage.component].put(token)
-        else:
-            # 마지막 stage — pipeline completion
-            token.pipeline_ctx.complete_tile()
-```
-
-### D5. PE_FETCH_STORE — TCM ↔ Register File 전담
-
-기존에 GemmBlock과 MathBlock이 각각 TCM read/write를 구현했으나,
-이를 **PE_FETCH_STORE 컴포넌트**로 분리한다.
-
-```python
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-    # 체이닝은 base class가 처리 (D4 패턴)
-```
-
-장점:
- GEMM/MATH는 **순수 compute만** — TCM 접근 로직 없음
- fetch/store BW 경합이 자연스럽게 모델링됨 (PE_TCM의 resource로 serialization)
- prefetch 전략 등 fetch unit 단독 교체로 실험 가능
-
-### D6. 각 Compute 컴포넌트의 단순화
-
-GEMM/MATH는 register 데이터가 이미 준비된 상태에서 compute만 수행.
-**체이닝은 공통 패턴(D4)을 따르므로, _process()만 구현하면 된다:**
-
-```python
-# PE_GEMM._process()
-def _process(self, env, token):
-    yield env.timeout(self._mac_latency(token.params))
-
-# PE_MATH._process()
-def _process(self, env, token):
-    yield env.timeout(self._simd_latency(token.params))
-
-# PE_FETCH_STORE._process()
-def _process(self, env, token):
-    yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
-    yield tcm_done
-
-# PE_DMA._process()
-def _process(self, env, token):
-    yield from self._do_fabric_dma(token.params)
-```
-
-타이밍 모델만 교체하면 cycle-accurate든 analytical든 자유롭게 변경 가능.
-체이닝 로직은 base class에 있으므로 각 컴포넌트는 순수 stage 로직만 구현.
-
-### D7. Topology 변경
-
-PE template에 PE_FETCH_STORE 추가:
-
-```yaml
-pe_template:
-  components:
-    pe_cpu:         { kind: pe_cpu,         impl: pe_cpu_v1, ... }
-    pe_scheduler:   { kind: pe_scheduler,   impl: pe_scheduler_v2, ... }
-    pe_dma:         { kind: pe_dma,         impl: pe_dma_v1, ... }
-    pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
-    pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1, ... }
-    pe_math:        { kind: pe_math,        impl: pe_math_v1, ... }
-    pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1, ... }
-    pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1, ... }
-  links:
-    # 기존 links...
-    fetch_store_to_tcm_bw_gbs: 512.0
-    fetch_store_to_tcm_mm: 0.0
-```
-
-PE 내부 edge 연결:
-```
-PE_SCHEDULER → PE_DMA (초기 dispatch)
-PE_SCHEDULER → PE_FETCH_STORE (초기 dispatch)
-PE_SCHEDULER → PE_GEMM (초기 dispatch)
-PE_SCHEDULER → PE_MATH (초기 dispatch)
-PE_DMA → PE_FETCH_STORE (체이닝)
-PE_FETCH_STORE → PE_GEMM (체이닝)
-PE_FETCH_STORE → PE_MATH (체이닝)
-PE_GEMM → PE_FETCH_STORE (store 체이닝)
-PE_MATH → PE_FETCH_STORE (store 체이닝)
-PE_FETCH_STORE → PE_DMA (writeback 체이닝)
-PE_FETCH_STORE → PE_TCM (BW 요청)
-```
-
-Topology edge는 **control/dispatch visibility + runtime chaining** 양쪽을 포함한다.
-Scheduler → 하위 컴포넌트 edge는 초기 dispatch 경로이며,
-컴포넌트 간 edge는 token self-routing에 의한 runtime chaining 경로이다.
-
-### D9. TileToken 메시지 정의
-
-컴포넌트 간 tile 작업 전달에 사용하는 메시지.
-Token이 plan과 stage index를 가지고 있어 self-routing이 가능하다.
-
-```python
-@dataclass
-class TileToken:
-    tile_id: int
-    pipeline_ctx: PipelineContext    # completion 추적
-    plan: TilePlan                   # 이 tile의 전체 stage sequence (immutable)
-    stage_idx: int                   # 현재 stage index in plan.stages
-    params: dict                     # current stage 파라미터 캐시 (canonical: plan.stages[stage_idx].params)
-    data_op: bool = True             # op_log 기록 대상 (ADR-0020)
-```
-
-TileToken은 한 시점에 **하나의 컴포넌트에 의해서만 소유**되며,
-동시에 여러 컴포넌트에 의해 참조되지 않는다 (single-owner).
-
-Token lifecycle:
-1. Scheduler가 stage_idx=0으로 생성, 첫 stage 컴포넌트에 put
-2. 컴포넌트가 _process() 실행 후 stage_idx 증가, 다음 컴포넌트에 put
-3. 마지막 stage 컴포넌트가 pipeline_ctx.complete_tile() 호출
-4. 모든 tile 완료 시 PipelineContext가 done_event.succeed()
-
-기존 PeInternalTxn과의 관계:
- PeInternalTxn: PE_CPU → PE_SCHEDULER 간 command 전달 (기존 유지)
- TileToken: PE_SCHEDULER → 하위 컴포넌트 간 tile 단위 작업 전달 (신규, self-routing)
-
---
-
-## Non-goals
-
- **PE_CPU 변경**: PE_CPU → PE_SCHEDULER 인터페이스는 변경하지 않음
-  (PeInternalTxn 기반, ADR-0014 유지)
- **다중 pipeline 간 자원 경합 모델**: 현재 범위에서는 단일 pipeline의
-  정확한 모델링에 집중. 다중 pipeline 간 TCM bank conflict 등은 future work.
-
-## Open Questions
-
- **Register File 용량 모델**: fetch unit이 register에 로드할 때 용량 제한을
-  모델링할지. 용량은 바이트 단위(register_file_bytes)로 표현하며,
-  동시에 보유 가능한 tile 수는 tile 크기에 따라 결정된다.
-  용량 초과 시 fetch가 stall되어 자연스러운 backpressure가 발생한다.
- **Prefetch 전략**: 본 ADR에서는 composite command 간 tile feed interleaving을
-  허용하지 않는다. 따라서 overlap은 command 간 선행 투입이 아니라,
-  같은 command 내부 tile들의 pipeline progression에서 자연스럽게 발생한다.
-  추가적인 prefetch가 필요하면 command 간 투입이 아니라, 같은 command 내부에서의
-  tile ordering 또는 fetch/store unit policy 차원에서 검토한다.
- **PE_DMA coalescing**: tile 단위 DMA는 fragmentation 발생 가능.
-  DMA 내부에서 merge/coalesce하되 scheduler는 관여하지 않는 방향.
- **동기 실행 모드**: 본 ADR에서는 비동기 pipeline을 기본/유일 execution model로
-  채택한다. 디버그 또는 validation 목적의 sync mode가 필요하면 future ADR에서 검토.
- **다중 pipeline 간 TCM bank conflict**: 현재 단일 pipeline 기준.
-  다중 pipeline이 동시에 TCM에 접근할 때의 bank conflict 모델은 future work.
-
---
-
-## Consequences
-
-### 긍정적
-
- 각 블록이 독립 컴포넌트 — 개별 교체 가능 (ADR-0015 준수)
- topology에서 PE 내부 구조 가시화
- 컴포넌트가 다음 컴포넌트를 모름 — plan 기반 라우팅으로 유연성 확보
- DMA와 compute의 자연스러운 파이프라인 overlap (SimPy Store backpressure)
- HW 모델링 정확도 향상 (done signal = Event, data transfer = message)
- fetch/store 분리로 TCM BW 경합 정확히 모델링
-
-### 부정적
-
- PE 내부 컴포넌트 수 증가 (5 → 6) — topology 노드/edge 증가
- 컴포넌트 분리로 인해 intra-PE token forwarding이 이전 대비 더 명시적으로 드러남
-
@@ -1,10 +1,10 @@
 # ADR-0022: 2D Grid program_id Semantics

- **Status**: Accepted
- **Date**: 2026-04-09
- **Context**: Triton-style kernel addressing for multi-cube PE topology
+## Status

-## Problem
+Accepted
+
+## Context

 Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
 Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
@@ -709,7 +709,7 @@ piggyback, tail updates via the D9 fast-path channel.

 ### D13. Test strategy

-Following the ADR-0021 D8 pattern.
+Test plan:

 #### T1. Unit tests (component-level)

@@ -801,7 +801,7 @@ F5. **Slot full + infinite backpressure**: the peer never recvs.
 ### D15. Algorithm-author cheat sheet

 Full step-by-step lives in
-[`docs/ccl-author-guide.en.md`](../ccl-author-guide.en.md). The
+[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
 shortest version:

 | Things you touch | Things you don't |
@@ -969,7 +969,7 @@ tail 갱신은 D9 fast path SimPy Store 채널로 처리된다.

 ### D13. 테스트 전략

-ADR-0021의 D8 패턴을 따라 단위/통합/regression 테스트를 명시한다.
+단위/통합/regression 테스트를 명시한다.

 #### T1. 단위 테스트 (component-level)

@@ -1102,7 +1102,7 @@ F5. **Slot full + 무한 backpressure**:
 ### D15. 알고리즘 작성자 가이드 (요약)

 본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다.
-자세한 step-by-step 가이드는 [docs/ccl-author-guide.md](../ccl-author-guide.md) 참조.
+자세한 step-by-step 가이드는 [docs/onboarding/ccl-author-guide.md](../onboarding/ccl-author-guide.md) 참조.

 #### 만지는 것 / 만지지 않는 것

@@ -1175,7 +1175,416 @@ def neighbors(rank, world_size, neighbor_map) -> dict | None:
 2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure)
 3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임

-자세한 step-by-step과 hello-world 예제는 `docs/ccl-author-guide.md` 참조.
+자세한 step-by-step과 hello-world 예제는 `docs/onboarding/ccl-author-guide.md` 참조.
+
+---
+
+## HW Realization Notes (Informative)
+
+**Status of this section**: Forward-looking. Describes how the simulator
+contract (D1–D15) would map to silicon. Not currently implemented;
+subject to revision before tapeout. The simulator implements the
+contract via Python/SimPy equivalents in
+[pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and
+[pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py).
+
+### D16. Proposed HW Block Diagram and End-to-End Dataflow
+
+![PE Baseline Architecture](../diagrams/pe_baseline.png)
+
+> Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`.
+
+![PE Proposed Architecture](../diagrams/pe_proposed.png)
+
+> Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`.
+
+**Baseline → Proposed 핵심 변경**:
+
+- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
+- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
+- TCM 내 **IPCQ Slot Region 예약 영역** 명시
+- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
+
+#### End-to-End Sequence (HW view)
+
+```mermaid
+sequenceDiagram
+    participant CPU_A as PE_A: PE_CPU
+    participant IPCQ_A as PE_A: IPCQ Ctrl
+    participant DMA_A as PE_A: DMA
+    participant NOC as NoC Fabric
+    participant DMA_B as PE_B: DMA
+    participant IPCQ_B as PE_B: IPCQ Ctrl
+    participant TCM_B as PE_B: TCM
+    participant CPU_B as PE_B: PE_CPU
+
+    Note over CPU_A: tl.send(dir="E", src=0x1000)
+
+    CPU_A->>IPCQ_A: MMIO: send request
+    Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
+    IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
+    Note over IPCQ_A: my_head++
+    IPCQ_A-->>CPU_A: send returns (fire-and-forget)
+
+    Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
+    DMA_A->>NOC: IPCQ data flit(s)
+
+    Note over NOC: hop latency + BW drain
+
+    NOC->>DMA_B: IPCQ data flit(s)
+    Note over DMA_B: Terminal BW drain<br/>Slot write latency
+
+    rect rgb(255, 240, 220)
+        Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
+        DMA_B->>TCM_B: write data → slot address
+        DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
+    end
+
+    Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
+    IPCQ_B-->>CPU_B: recv_wake signal
+
+    Note over CPU_B: tl.recv(dir="W") wakes up
+    CPU_B->>IPCQ_B: recv request
+    Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
+    IPCQ_B-->>CPU_B: return slot_addr
+    CPU_B->>TCM_B: read data from slot
+    Note over IPCQ_B: my_tail++
+
+    IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
+    Note over NOC: credit traversal (NoC latency)
+    NOC->>IPCQ_A: Credit arrival
+
+    Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
+```
+
+### D17. IPCQ Controller HW Module (신규)
+
+PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. 시뮬레이터의
+`PeIpcqComponent`에 대응한다.
+
+#### QPair Register File
+
+방향별 queue pair 상태를 flip-flop으로 유지. PE_CPU가 MMIO(CSR)로 읽기/쓰기
+가능하며, init 시점에 소프트웨어가 채워넣는다.
+
+```
+Per-direction registers (each 64-bit):
+  my_head          — sender write position (monotonic)
+  my_tail          — receiver read position (monotonic)
+  peer_head_cache  — last known peer head (updated by Meta Extractor)
+  peer_tail_cache  — last known peer tail (updated by Credit Receiver)
+  rx_base_pa       — this PE's rx buffer base physical address
+  peer_rx_base_pa  — peer's rx buffer base physical address
+  n_slots          — ring depth (power-of-2 제약, D21 참조)
+  slot_size        — bytes per slot
+  peer_credit_tgt  — peer PE의 credit receive 주소
+
+Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
+Total: 8 dirs × 9 regs × 8B = 576B flip-flops
+```
+
+#### Slot Address Generator (combinational)
+
+```
+Input:  pointer (my_head or my_tail), n_slots, slot_size, base_pa
+Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
+
+Implementation:
+  n_slots power-of-2 → pointer & (n_slots - 1)   (AND mask, 1 gate)
+  slot_size power-of-2 → barrel shift             (1 cycle)
+  64-bit add → ripple/kogge-stone adder           (1 cycle)
+
+Latency: 1-2 cycles combinational
+```
+
+#### Backpressure Comparator (combinational)
+
+```
+full = (my_head - peer_tail_cache) >= n_slots
+
+Implementation: 64-bit subtract + unsigned compare
+Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
+Latency: 1 cycle
+```
+
+#### Meta Extractor (inbound datapath sideband)
+
+DMA Engine의 inbound vc_comm path에 wired. 도착하는 IPCQ flit의 header에서
+metadata를 추출하여 queue pair 상태를 갱신한다.
+
+```
+Trigger: DMA inbound write completion (same cycle)
+Extract: {sender_seq, dst_addr} from flit header
+
+Direction matching (ADR-0025 D2):
+  for each dir:
+    match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
+  8× parallel range comparators + priority encoder
+
+Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
+Output: recv_wake signal → PE_CPU interrupt/flag
+Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
+```
+
+#### Credit Injector (outbound)
+
+```
+Trigger: recv completion (my_tail 증가 후)
+Action:  pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
+
+Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
+Latency: 1 cycle to generate, then NoC traversal
+```
+
+#### Credit Receiver (inbound sideband)
+
+```
+Trigger: 16B credit packet arrival (from NoC)
+Extract: {consumer_seq, dst_rx_base_pa}
+
+Direction matching (ADR-0025 D3):
+  for each dir:
+    match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
+
+Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
+Output: send_wake signal → deassert backpressure stall
+Latency: 1 cycle
+```
+
+### D18. DMA Engine vc_comm IPCQ-aware Mode
+
+기존 vc_comm 채널(D8)에 IPCQ flit 처리 모드를 추가한다.
+
+**Outbound**:
+
+1. IPCQ Controller로부터 command 수신: `{src_addr, dst_addr, nbytes, sender_seq}`
+2. TCM에서 src_addr read → DMA read buffer에 snapshot (standard DMA behavior)
+3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
+4. NoC fabric port에 inject
+5. Fire-and-forget (completion 미대기)
+
+**Inbound**:
+
+1. NoC로부터 IPCQ flit 수신
+2. Terminal BW drain charge (`drain_ns = nbytes / bottleneck_bw`)
+3. Slot write latency charge (backing memory tier)
+4. **ATOMIC** (same pipeline stage, no stall insertion):
+   - TCM write: data → slot address
+   - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
+5. Done
+
+**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
+동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요. 시뮬레이터의
+"no SimPy yield between MemoryStore.write and IpcqMetaArrival put" (D9, I6)이
+자연스럽게 보장된다.
+
+#### Data Snapshot Semantics
+
+DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
+이는 DMA standard read-then-write behavior이므로 추가 HW 불필요.
+
+#### Credit Virtual Channel (선택적)
+
+- **옵션 A**: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분).
+- **옵션 B**: 3rd dedicated credit VC 추가 (strict priority > data).
+
+옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
+옵션 A로도 충분.
+
+### D19. Fabric Flit Format Extension
+
+```
+일반 data flit (예: 512-bit):
+┌──────────────────────────────────────────┐
+│ [511:480] routing header (32b)           │
+│ [479:0]   payload (480b = 60B)           │
+└──────────────────────────────────────────┘
+
+IPCQ data flit (첫 flit에만 metadata 포함):
+┌──────────────────────────────────────────┐
+│ [511:480] routing header (32b)           │
+│   [511]    ipcq_flag (1b)                │  ← IPCQ vs normal DMA 식별
+│   [510:509] vc_id (2b)                   │
+│   [508:480] route + hop count            │
+│ [479:416] ipcq_metadata (64b)            │  ← piggyback
+│   [479:448] sender_seq (32b)             │
+│   [447:416] dst_addr[31:0] (32b)         │  ← direction matching용
+│ [415:0]   payload (416b = 52B)           │
+└──────────────────────────────────────────┘
+후속 flits: full 60B payload (metadata 없음)
+
+Credit-only flit (128-bit, header-only):
+┌──────────────────────────────────────────┐
+│ [127:96]  routing header (32b)           │
+│   [127]   credit_flag (1b)               │
+│ [95:64]   consumer_seq (32b)             │
+│ [63:0]    dst_rx_base_pa (64b)           │
+└──────────────────────────────────────────┘
+```
+
+첫 flit의 payload가 60B → 52B로 감소 (13% overhead). Multi-flit transfer에서는
+후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
+
+### D20. TCM IPCQ Slot Region Layout
+
+```
+TCM Memory Map (16MB):
+┌─────────────────────────────┐ 0x000000
+│  Kernel Working Memory      │
+│  (compute tensors)          │
+│  ~14MB                      │
+├─────────────────────────────┤ 0xE00000
+│  IPCQ RX Buffers            │
+│  Dir N: slots × slot_size   │
+│  Dir S: slots × slot_size   │
+│  Dir E: slots × slot_size   │
+│  Dir W: slots × slot_size   │
+│  ~1MB                       │
+├─────────────────────────────┤ 0xF00000
+│  IPCQ Metadata / Scratch    │
+│  ~1MB                       │
+└─────────────────────────────┘ 0xFFFFFF
+```
+
+IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 bank conflict를
+최소화한다 (Risk D22 참조).
+
+### D21. 2nm Implementation Analysis
+
+#### Area Estimate
+
+| Module | Gate Count | Area (2nm est.) | Notes |
+|---|---|---|---|
+| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
+| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
+| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
+| **IPCQ Controller subtotal** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
+| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
+| **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
+
+#### Timing
+
+| Path | Delay (2nm est.) | Target Clock | Margin |
+|---|---|---|---|
+| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
+| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
+| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
+| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
+
+모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
+
+#### Power
+
+- Active: ~1 mW (register R/W + comparators, send/recv 동작 시)
+- Idle: leakage only
+- PE 전체 전력 대비 무시 가능
+
+#### Constraints
+
+| 항목 | 제약 | 근거 |
+|---|---|---|
+| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
+| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
+| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
+
+### D22. Risk Assessment
+
+#### TCM Bank Conflict
+
+- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
+- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 (D20)
+- **Cost**: TCM banking flexibility 소폭 감소
+- **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
+
+#### Credit Return Latency under Congestion
+
+- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
+- **Mitigation**:
+  - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
+  - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
+- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
+
+#### Inter-Direction Ordering
+
+- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
+- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
+  kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 (D2 + D4)
+- **Severity**: Low (아키텍처 설계에 의해 해소)
+
+### D23. HW Alternatives Considered
+
+#### Doorbell + Polling (전통적 방식)
+
+```
+Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
+Recv: Polling loop on doorbell, or interrupt-driven
+```
+
+| 장점 | 단점 |
+|---|---|
+| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
+| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
+| | Polling은 전력 낭비, interrupt는 latency overhead |
+
+**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
+
+#### Hardware Message Queue (NVIDIA NVLink 스타일)
+
+```
+Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
+Recv: HMQ에서 descriptor pop → data pointer 확인
+```
+
+| 장점 | 단점 |
+|---|---|
+| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
+| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
+| | Large tensor에는 결국 DMA 필요 |
+
+**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
+면적 낭비. **불채택.**
+
+#### RDMA-style Completion Queue (CQ)
+
+```
+Send: DMA write → peer에 CQE 자동 생성
+Recv: CQ poll/interrupt → data 위치 확인
+```
+
+| 장점 | 단점 |
+|---|---|
+| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
+| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
+| | PE-to-PE CCL에는 over-engineered |
+
+**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
+PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
+
+#### Credit-in-Data Piggyback (v2 최적화 후보)
+
+현재 설계에서 credit return은 별도 16B packet이다. Bidirectional 통신
+패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
+
+```
+PE_A →E→ PE_B: data + sender_seq=3
+PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4  ← credit이 data에 합쳐짐
+```
+
+| 장점 | 단점 |
+|---|---|
+| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
+| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
+| | Logic 복잡도 소폭 증가 |
+
+**평가**: 현재 설계의 우수한 최적화. Bidirectional allreduce에서 credit packet을
+완전 제거 가능. Standalone credit fallback도 유지. **v2로 채택 권고.**
+
+### Open HW Questions
+
+- IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
+- Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? (D18 참조)
+- Inter-SIP link에서의 flit format 호환성 검증 필요
+- n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)

 ---

@@ -0,0 +1,206 @@
+# ADR-0024: SIP-level Launcher — rank = SIP
+
+## Status
+
+Accepted
+
+## Context
+
+### 목표
+
+`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
+경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
+읽히는 bench 코드를 목표로 한다.
+
+real PyTorch와 비교:
+
+| 차원 | real PyTorch | KernBench |
+| --- | --- | --- |
+| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
+| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
+| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
+| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
+| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
+
+### 풀어야 할 문제
+
+1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
+2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
+   worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
+3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
+   기본 텐서 배치도 구조적 좌표로 표현되어야 함.
+
+### Non-problem (이 ADR 밖)
+
+- IPCQ direction addressing → ADR-0025
+- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
+- Megatron-style TP → ADR-0027
+- DTensor → ADR-0028 (future)
+- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
+  → ADR-0027 D0/D1
+- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
+
+## Decision
+
+### D1. rank = SIP (world_size 해석)
+
+```python
+def _resolve_world_size(self) -> int:
+    if "world_size" in self._merged:
+        return int(self._merged["world_size"])
+    defaults = self._cfg_all.get("defaults", {})
+    if "world_size" in defaults:
+        return int(defaults["world_size"])
+    spec = self.ctx.spec or {}
+    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
+```
+
+우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
+override는 legacy "rank = PE" 테스트 경로로 유지.
+
+### D2. Greenlet-local rank registry (+ debug warning)
+
+```python
+class DistributedContext:
+    def __init__(self):
+        self._backend = None
+        self._rank_by_greenlet: dict = {}
+
+    def _bind_rank(self, g, rank: int) -> None:
+        self._rank_by_greenlet[g] = int(rank)
+
+    def get_rank(self) -> int:
+        self._ensure_initialized()
+        from greenlet import getcurrent
+        g = getcurrent()
+        if g not in self._rank_by_greenlet:
+            if os.environ.get("KERNBENCH_DEBUG"):
+                warnings.warn(
+                    "get_rank() called outside a bound greenlet — returning 0. "
+                    "Likely a bug unless running single-driver."
+                )
+            return 0
+        return int(self._rank_by_greenlet[g])
+```
+
+### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
+
+KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
+`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
+namespace를 사용한다.
+
+```python
+class _AhbmNamespace:
+    """torch.ahbm — per-greenlet SIP device binding.
+
+    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
+    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
+    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
+    """
+
+    def __init__(self):
+        self._device_by_greenlet: dict = {}
+
+    def set_device(self, device: int) -> None:
+        from greenlet import getcurrent
+        self._device_by_greenlet[getcurrent()] = int(device)
+
+    def current_device(self) -> int | None:
+        from greenlet import getcurrent
+        return self._device_by_greenlet.get(getcurrent())
+
+# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
+# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
+```
+
+**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
+`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
+`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
+코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
+
+```python
+class _AcceleratorNamespace:
+    """torch.accelerator — device-agnostic API (PyTorch 2.x style).
+
+    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
+        torch.accelerator.set_device_index(rank)
+        torch.accelerator.current_device_index()
+    """
+
+    def __init__(self, ahbm: _AhbmNamespace):
+        self._ahbm = ahbm
+
+    def set_device_index(self, device: int) -> None:
+        self._ahbm.set_device(device)
+
+    def current_device_index(self) -> int | None:
+        return self._ahbm.current_device()
+
+# RuntimeContext
+self.ahbm = _AhbmNamespace()
+self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
+```
+
+Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
+
+```python
+torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
+torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
+```
+
+### D4. Tensor placement = structural (sip, cube, pe) 좌표
+
+`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
+세부는 ADR-0026.
+
+```python
+# RuntimeContext._create_tensor
+current_sip = self.ahbm.current_device()          # (D3 naming)
+if current_sip is None:
+    current_sip = 0  # single-driver fallback (D2와 일관)
+placement = resolve_dp_policy(
+    dp, shape=shape_2d, itemsize=itemsize,
+    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
+    target_sip=current_sip,
+)
+```
+
+Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
+좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
+
+---
+
+## Dependencies
+
+- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
+- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
+  ShardSpec의 구조적 좌표 표현.
+- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
+  collective drain, exception cleanup의 구현 기준.
+
+---
+
+## Non-goals
+
+- **IPCQ protocol 수정**: ADR-0023 유지.
+- **DPPolicy 필드 정리**: ADR-0026.
+- **Megatron-style TP**: ADR-0027.
+- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
+- **Collective algorithm 구현**: ADR-0032.
+- **Multi-node (프로세스 간)**: 단일 프로세스.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Bench = real PyTorch DDP** (공개 API 관점).
+- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
+- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
+  `(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
+
+### Neutral
+
+- IPCQ PE-level protocol (ADR-0023) 불변.
+- IO_CPU 역할 불변 (기존 transit 그대로).
@@ -1,868 +0,0 @@
-# ADR-0024: SIP-level TP Launcher — rank = SIP (host-driven dispatch)
-
-## Status
-
-Accepted. rank = SIP process-group model stands. The allreduce algorithm
-path (mapper / validator / per-PE install machinery originally targeted at
-ADR-0029) has been replaced by ADR-0032: `AhbmCCLBackend` now calls
-`configure_sfr_intercube_multisip` at `init_process_group` time and the
-intercube kernel receives `(sip_rank, sip_topo_kind, sip_topo_w,
-sip_topo_h)` appended after the module's `kernel_args()`. The
-`leader_only` / `all_pes` mapper concepts in this document are no longer
-used by the default allreduce path.
-
-## Context
-
-### 목표
-
-`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
-경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
-읽히는 bench 코드를 목표로 한다.
-
-real PyTorch와 비교:
-
-| 차원 | real PyTorch | KernBench (이 ADR 이후) |
-|---|---|---|
-| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
-| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
-| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
-| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
-| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
-
-### 설계 원칙 — 공개 API의 추상화, 내부는 기존 path 활용
-
-**공개 API (bench worker) 수준의 추상화**:
-```
-rank = SIP
-DPPolicy = intra-device (cube × PE) 분산만
-dist.all_reduce, torch.ahbm.set_device, mp.spawn 등 PyTorch-style 표면
-```
-
-**Framework 내부 구현**:
-```
-build_install_plans (host): topology + mapper + algorithm → SipInstallPlan
-  ↓
-backend (host): plan의 per-PE spec을 engine.submit으로 IpcqInitMsg 디스패치
-  ↓
-engine: 기존 PE-scoped routing (MmuMapMsg 등과 동일 경로)
-  ↓
-PE_IPCQ: 자체 message loop에서 IpcqInitMsg 처리 (기존 capability)
-```
-
-**핵심**: 새 message 타입이나 IO_CPU 확장 없음. 기존 engine routing과 기존
-`IpcqInitMsg` 타입을 그대로 사용. 기존의 "sideband direct call" 우회만
-제거하여 convention 일원화.
-
-### 풀어야 할 문제
-
-1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
-2. **Multi-worker 실행** — N개 rank가 독립 worker 코드 실행. 1 프로세스 제약
-   하에서 greenlet + barrier 동기화.
-3. **Cross-rank collective submit 동기화** — 첫 rank가 혼자 wait하면 peer 부재로
-   SimPy deadlock. 모든 rank submit 후 drain 보장.
-4. **기존 sideband install 제거** — IpcqInitMsg를 engine.submit으로 일원화.
-   MmuMapMsg 등 다른 control-plane 메시지와 동일 패턴.
-5. **Algorithm / mapper / validator 분리** — 알고리즘 모듈은 kernel 코드만
-   담고, topology / mapping / validation은 registry + 선언.
-
-### Non-problem (이 ADR 밖)
-
- IPCQ direction addressing fix → **ADR-0025**
- `DPPolicy.sip`/`num_sips` 제거 → **ADR-0026**
- Megatron-style TP → **ADR-0027**
- DTensor → **ADR-0028 (future)**
- **IO_CPU를 SIP-level control-plane 단일 endpoint로 승격**: 이 ADR에서는
-  invariant으로 채택하지 않음. 현재 KernBench에 해당 원칙이 없고, 단독으로
-  도입하기엔 정당화가 약함. 미래에 control-plane latency 모델링 정밀도 요구가
-  생기면 별도 ADR.
-
-## Decision
-
-### D1. rank = SIP (world_size 해석)
-
-```python
-def _resolve_world_size(self) -> int:
-    if "world_size" in self._merged:
-        return int(self._merged["world_size"])
-    defaults = self._cfg_all.get("defaults", {})
-    if "world_size" in defaults:
-        return int(defaults["world_size"])
-    spec = self.ctx.spec or {}
-    return int(spec.get("system", {}).get("sips", {}).get("count", 1))
-```
-
-우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
-override는 legacy "rank = PE" 테스트 경로로 유지.
-
-### D2. Install 경로 — engine.submit 일원화
-
-`ccl/install.py`의 sideband direct call을 제거하고, `IpcqInitMsg`를
-`engine.submit`으로 보낸다. MmuMapMsg / MemoryWriteMsg 등이 이미 동일 패턴.
-
-```python
-# Backend (AhbmCCLBackend.__init__ 또는 init_process_group 시점)
-from kernbench.ccl.install_plan import build_install_plans
-
-plans = build_install_plans(
-    world_size=self._world_size,
-    algorithm=self._merged["algorithm"],
-    algorithm_config=self._merged,
-    spec=self.ctx.spec,
-)
-self._plans = plans
-
-# Each PE_IPCQ가 자기 neighbor table을 받도록 engine 경유 submit
-handles = []
-for plan in plans:
-    for pe_install in plan.pe_installs:
-        h = self.ctx.submit(IpcqInitMsg(
-            correlation_id=self.ctx.correlation_id,
-            request_id=f"ipcq_init_s{plan.sip}c{pe_install.cube}p{pe_install.pe}",
-            target_sips=(plan.sip,),
-            target_cubes=(pe_install.cube,),
-            target_pe=pe_install.pe,
-            entries=pe_install.neighbors,
-            buffer_kind=plan.buffer_kind,
-            n_slots=plan.n_slots,
-            slot_size=plan.slot_size,
-            # ... (기존 IpcqInitMsg 필드)
-        ))
-        handles.append(h)
-
-# Eager install — init_process_group이 반환하기 전에 완료 보장
-for h in handles:
-    self.ctx.wait(h)
-```
-
-**PE_IPCQ 컴포넌트**는 이미 `IpcqInitMsg`를 main loop에서 처리 (`pe_ipcq.py`
-라인 145-147). 변경 불필요. 유일한 차이는 "message가 sideband Python call이
-아니라 engine queue를 거쳐 도착한다"는 점.
-
-**Correctness invariant (equivalence)**: `init_process_group()`은 모든
-install handle을 `wait()`한 후 반환하므로 launch-before-install 문제는
-구조적으로 없다. 남는 correctness 질문은 단 하나:
-
-> Engine-routed `IpcqInitMsg` 처리가 기존 sideband
-> `pe_ipcq._install_neighbors(msg)` 호출과 **동일한 최종 PE_IPCQ 상태**를
-> 생성하는가.
-
-검증 포인트 (T3 참고):
-
-1. **State equivalence**: `_install_neighbors()` 내부 상태 전이가 engine
-   dispatch path에서도 동일하게 일어나 최종 PE_IPCQ state
-   (`_queue_pairs`, `_installed`, `_credit_inbox` 등)가 일치.
-
-2. **Sideband-only side effect 부재**: Sideband path에서만 있던 부수 효과가
-   없음 (예: engine.submit이 설정하는 request_id / correlation tracking 등이
-   install semantics를 왜곡하지 않음).
-
-3. **Ordering independence**: 서로 다른 PE들의 install message가 engine
-   큐에서 임의 순서로 처리되어도 최종 상태가 동일. 즉 install은 **PE별
-   독립 연산**이어야 하고, cross-PE 순서 의존성이 있으면 안 됨.
-
-4. **Idempotency**: 동일 PE에 대해 `IpcqInitMsg`가 두 번 도착하면? 현재
-   설계 전제는 "per-PE 단 한 번 install". 중복 install 시 동작은 정의되지
-   않음. 보수적 정책:
-   - 최초 install 시 `_installed = True`로 전이
-   - 이후 중복 install msg는 **에러** (raise) 또는 **silent idempotent**
-     (no-op) 둘 중 하나로 명시
-   - Recommend: **raise** (명시적 에러 → 버그 조기 검출). T3에 duplicate
-     install 케이스 추가.
-
-5. **Partial install visibility**: 일부 PE만 install 완료된 중간 상태가
-   외부에 observable한가? 현재 구조에서는 `init_process_group()`의 eager
-   wait-all이 barrier 역할을 하므로 partial state는 bench 코드에 노출되지
-   않음. 단, debugging / introspection API는 중간 상태를 볼 수 있음 (문제
-   아님, 문서화만).
-
-**Timing 영향**: Engine-routed install은 `init_process_group()`이 SimPy 시간을
-소비하게 만든다. 기존 sideband install은 사실상 zero-cost. ADR 계약:
-
-> Benchmarks must not rely on zero-cost initialization.
-> `init_process_group()` consumes simulated time proportional to the number
-> of participating PEs × per-PE install latency. First collective call
-> starts at a well-defined but non-zero sim time.
-
-### D3. Launch 경로 — non-CCL 커널과 동일 primitive
-
-**CCL 커널은 non-CCL 커널과 동일한 `KernelLaunchMsg` submission path를 쓴다.**
-Engine 내부의 IO_CPU/M_CPU transit 같은 것은 **기존 구현 세부이지 CCL-specific
-장치가 아님**. Backend는 plan의 `participating_pes` 목록을 돌면서 `KernelLaunchMsg`를
-submit할 뿐이다. 새 메시지 타입 없음, 새 라우팅 경로 없음.
-
-```python
-# AhbmCCLBackend.all_reduce
-def all_reduce(self, tensor, op="sum"):
-    if op != "sum":
-        raise NotImplementedError(...)
-    if tensor._handle is None or not tensor._handle.shards:
-        raise RuntimeError(...)
-
-    # Validator — global handle 기준 (D8)
-    validator_name = self._merged.get("validator")
-    if validator_name:
-        resolve_validator(validator_name)(tensor._handle, self._world_size, self.ctx.spec)
-
-    rank = self.ctx.distributed.get_rank()
-    plan = self._plans[rank]
-    tensor_view = _tensor_slice_for_sip(tensor._handle, plan.sip)
-
-    # Plan에서 kernel args 계산 (host-side)
-    import importlib
-    mod = importlib.import_module(plan.kernel_module)
-    n_elem = tensor_view.shards[0].nbytes // tensor.itemsize
-    kargs = mod.kernel_args(n_elem=n_elem, world_size=plan.world_size,
-                             **plan.kernel_config)
-
-    def _submit():
-        out = []
-        for (cube, pe) in plan.participating_pes:
-            h = self.ctx.submit(KernelLaunchMsg(
-                correlation_id=self.ctx.correlation_id,
-                request_id=f"allreduce_r{rank}_c{cube}p{pe}",
-                kernel_ref=KernelRef(name=plan.algorithm_name, kind="builtin"),
-                args=(_tensor_arg_for_pe(tensor_view, cube, pe), *kargs),
-                target_sips=(plan.sip,),
-                target_cubes=(cube,),
-                target_pe=pe,
-            ))
-            out.append(h)
-        return out
-
-    self._barrier.submit_and_drain(self.ctx, rank, _submit)
-```
-
-### D4. Algorithm ABI — 얇게 + 명시적 arg 계약
-
-각 알고리즘 모듈은 **kernel + kernel_args만 필수**.
-
-```python
-# src/kernbench/ccl/algorithms/ring_allreduce.py
-def kernel(t_ptr, n_elem, world_size, tl):
-    """PE-side kernel code.
-
-    Signature convention: first positional arg is the tensor pointer
-    (per-PE slice), subsequent positional args are whatever
-    kernel_args() returns. `tl` is injected by the TLContext runtime.
-    """
-
-def kernel_args(*, n_elem: int, world_size: int, **kw) -> tuple:
-    """Return the tuple of non-tensor positional args.
-
-    Signature contract:
-    - Called keyword-only with n_elem and world_size plus kernel_config.
-    - Returns a tuple (possibly empty) of scalar / metadata args.
-    - The backend constructs the final KernelLaunchMsg.args as:
-          (per_pe_tensor_arg, *kernel_args(...))
-      where per_pe_tensor_arg is a TensorArg containing only the shards
-      local to the receiving PE (derived from tensor_view).
-    """
-    return (n_elem, world_size)
-```
-
-**Arg assembly in backend (reference)**:
-
-```python
-# AhbmCCLBackend.all_reduce (D3에서 발췌)
-kargs = mod.kernel_args(n_elem=n_elem, world_size=plan.world_size,
-                         **plan.kernel_config)
-for (cube, pe) in plan.participating_pes:
-    pe_tensor_arg = _tensor_arg_for_pe(tensor_view, cube, pe)
-    self.ctx.submit(KernelLaunchMsg(
-        args=(pe_tensor_arg, *kargs),       # tensor first, then kernel_args return
-        target_sips=(plan.sip,),
-        target_cubes=(cube,),
-        target_pe=pe,
-        ...
-    ))
-```
-
-**ccl.yaml**에서 선언적 metadata:
-
-```yaml
-algorithms:
-  ring_allreduce_tcm:
-    module: kernbench.ccl.algorithms.ring_allreduce
-    topology: ring_1d             # kernbench/ccl/topologies.py
-    mapper: leader_only           # kernbench/ccl/mappers.py (신규)
-    validator: single_shard_per_rank   # kernbench/ccl/validators.py (신규)
-    buffer_kind: tcm
-    n_elem: 8
-```
-
- `topology` (필수)
- `mapper` (선택, default `"leader_only"`)
- `validator` (선택)
-
-알고리즘 모듈 자체에는 mapper/validator/participating_pes/neighbor
-생성기가 **들어가지 않음**.
-
-### D5. Mapper + validator — registry key **또는** import path
-
-Host-side framework가 built-in registry 제공. 커스텀 확장은 dot-import path.
-
-```python
-# src/kernbench/ccl/mappers.py (new)
-Mapper = Callable[[dict, int], list[tuple[int, int]]]
-
-def leader_only(spec, rank):
-    """Single leader PE per SIP. Ring/tree/mesh용."""
-    return [(0, 0)]
-
-def all_pes(spec, rank):
-    """Every PE in the SIP. 알고리즘이 intra-SIP 전체 PE를 참여시킬 때 사용
-    (e.g. intra-SIP reduction, intra-SIP broadcast, hierarchical collective
-    의 낮은 레벨 등)."""
-    cm = spec["sip"]["cube_mesh"]
-    pl = spec["cube"]["pe_layout"]
-    n_cubes = cm["w"] * cm["h"]
-    n_pes = pl["pe_per_corner"] * len(pl["corners"])
-    return [(c, p) for c in range(n_cubes) for p in range(n_pes)]
-
-MAPPER_REGISTRY = {"leader_only": leader_only, "all_pes": all_pes}
-
-def resolve_mapper(key_or_path: str) -> Mapper:
-    if key_or_path in MAPPER_REGISTRY:
-        return MAPPER_REGISTRY[key_or_path]
-    if "." in key_or_path:
-        import importlib
-        mod_path, fn_name = key_or_path.rsplit(".", 1)
-        return getattr(importlib.import_module(mod_path), fn_name)
-    raise ValueError(f"unknown mapper: {key_or_path!r}")
-```
-
-Validator도 동일 패턴 (`src/kernbench/ccl/validators.py`). 입력은 **global
-TensorHandle** (D8 참고).
-
-### D6. Host-side install plan builder
-
-```python
-# src/kernbench/ccl/install_plan.py (new; 기존 install.py의 재구성)
-from dataclasses import dataclass
-from typing import Any, Mapping
-
-@dataclass(frozen=True)
-class NeighborTableEntry:
-    direction: str
-    peer_direction: str       # ADR-0025
-    peer_sip: int
-    peer_cube: int
-    peer_pe: int
-    rx_base_pa: int
-    # ... 기타 IPCQ 설정 ...
-
-@dataclass(frozen=True)
-class PeInstallSpec:
-    cube: int
-    pe: int
-    neighbors: tuple[NeighborTableEntry, ...]
-
-@dataclass(frozen=True)
-class SipInstallPlan:
-    algorithm_name: str                  # human-readable ("ring_allreduce_tcm")
-    sip: int
-    rank: int
-    world_size: int
-    pe_installs: tuple[PeInstallSpec, ...]     # per-PE neighbor tables
-    buffer_kind: str
-    n_slots: int
-    slot_size: int
-    kernel_module: str
-    participating_pes: tuple[tuple[int, int], ...]
-    kernel_config: Mapping[str, Any]
-
-
-def build_install_plans(
-    world_size: int,
-    algorithm: str,
-    algorithm_config: dict,
-    spec: dict,
-) -> list[SipInstallPlan]:
-    """Compose topology + mapper + algorithm into per-SIP plan list."""
-    topo_fn = _resolve_topology(algorithm_config["topology"])
-    mapper = resolve_mapper(algorithm_config.get("mapper", "leader_only"))
-
-    # kernel_config: launch 시 kernel_args에 전달할 algorithm-specific params
-    kernel_config = {
-        k: v for k, v in algorithm_config.items()
-        if k in {"n_elem", "reduce_op", "chunk_size"} or k.startswith("kernel_")
-    }
-
-    plans = []
-    for rank in range(world_size):
-        sip = rank  # identity mapping (non-identity는 open question)
-        pes = mapper(spec, rank)
-        pe_installs = _build_pe_installs(
-            rank=rank, world_size=world_size, sip=sip,
-            pes=pes, topo_fn=topo_fn, algorithm_config=algorithm_config, spec=spec,
-        )
-        plans.append(SipInstallPlan(
-            algorithm_name=algorithm,
-            sip=sip, rank=rank, world_size=world_size,
-            pe_installs=pe_installs,
-            buffer_kind=algorithm_config["buffer_kind"],
-            n_slots=algorithm_config["n_slots"],
-            slot_size=algorithm_config["slot_size"],
-            kernel_module=algorithm_config["module"],
-            participating_pes=tuple(pes),
-            kernel_config=kernel_config,
-        ))
-    return plans
-```
-
-`_build_pe_installs`는 기존 `ccl/install.py`의 neighbor 계산 로직을 재활용
-(ADR-0025의 `reverse_direction` 개선 반영).
-
-**Multi-PE 매퍼와 neighbor 생성 책임**: mapper가 SIP 내 여러 PE를 반환하는
-경우 (`all_pes` 등), PE-level neighbor 그래프는 `_build_pe_installs` 내부에
-형성된다. 즉 topology 모듈은 rank-level 관계만 제공하고, PE-level 연결은
-builder에서 풀어낸다. 복잡한 multi-level 패턴을 쓰는 알고리즘은 이 책임
-분산이 관리 부담이 될 수 있음 — 관련 논의는 ADR-0029 참고.
-
-### D7. Epoch-based collective barrier
-
-Cross-rank submit 동기화. 각 collective 호출은 독립 epoch. 같은 rank의
-중복 join은 즉시 에러.
-
-```python
-# src/kernbench/runtime_api/distributed.py
-@dataclass
-class _EpochState:
-    participants: set[int] = field(default_factory=set)
-    pending: list = field(default_factory=list)
-    drained: bool = False
-    returned: int = 0
-
-
-class _CollectiveBarrier:
-    """Epoch-based barrier.
-
-    Contract:
-    - Each call joins the earliest non-drained epoch.
-    - Each rank may join a given epoch at most once. Duplicate join raises.
-    - Last arriver (participants == world_size) performs drain and advances
-      _next_epoch. Earlier arrivers yield and re-check drained on resume.
-    - Epoch state is GC'd when returned == world_size (success path).
-    - On failure paths, residual state is acceptable; reset() clears it.
-    """
-
-    def __init__(self, world_size: int):
-        self._world_size = world_size
-        self._next_epoch = 0
-        self._state: dict[int, _EpochState] = {}
-
-    def submit_and_drain(self, ctx, rank: int, submit_fn) -> None:
-        epoch = self._next_epoch
-        state = self._state.setdefault(epoch, _EpochState())
-
-        if rank in state.participants:
-            raise RuntimeError(
-                f"rank {rank} attempted duplicate join to epoch {epoch}"
-            )
-        state.participants.add(rank)
-
-        handles = submit_fn()
-        state.pending.extend(handles)
-
-        is_last = len(state.participants) >= self._world_size
-
-        if is_last:
-            for h in state.pending:
-                ctx.wait(h)
-            state.drained = True
-            self._next_epoch = epoch + 1
-        else:
-            from greenlet import getcurrent
-            g = getcurrent()
-            if g.parent is None:
-                raise RuntimeError("barrier requires a bound worker greenlet")
-            while not state.drained:
-                g.parent.switch()
-
-        state.returned += 1
-        if state.returned >= self._world_size:
-            self._state.pop(epoch, None)
-
-    def reset(self) -> None:
-        """Explicit cleanup on spawn exception unwinding."""
-        self._state.clear()
-        self._next_epoch = 0
-```
-
-### D8. Per-rank tensor view + validator contract
-
-**Validator** (host-side, pre-slice, global handle 기준):
-
-```python
-# src/kernbench/ccl/validators.py
-Validator = Callable[[TensorHandle, int, dict], None]
-
-def single_shard_per_rank(handle, world_size, spec):
-    """Ring 계열: 정확히 world_size개 shard, SIP당 1개."""
-    if len(handle.shards) != world_size:
-        raise ValueError(...)
-    per_sip = {}
-    for s in handle.shards:
-        per_sip[s.sip] = per_sip.get(s.sip, 0) + 1
-    if any(c != 1 for c in per_sip.values()):
-        raise ValueError(...)
-
-def multi_pe_sip_local(handle, world_size, spec):
-    """Multi-PE per SIP layout: 각 SIP에 intra-SIP PE 수만큼 shard 존재.
-    Intra-SIP 전체 PE를 참여시키는 알고리즘이 사용."""
-    cm = spec["sip"]["cube_mesh"]
-    pl = spec["cube"]["pe_layout"]
-    per_sip = cm["w"] * cm["h"] * pl["pe_per_corner"] * len(pl["corners"])
-    if len(handle.shards) != world_size * per_sip:
-        raise ValueError(...)
-
-VALIDATOR_REGISTRY = {...}
-def resolve_validator(key_or_path): ...
-```
-
-Validator는 world 전체의 shard layout 불변량을 본다. Per-rank view는
-backend가 validator 호출 **후** `_tensor_slice_for_sip`로 생성.
-
-**Per-rank tensor view** — SIP-local slice:
-
-```python
-def _tensor_slice_for_sip(handle, sip) -> TensorArg:
-    sip_shards = [s for s in handle.shards if s.sip == sip]
-    if not sip_shards:
-        raise RuntimeError(f"tensor has no shards on SIP {sip}")
-    # Deterministic ordering contract: (cube, pe, offset_bytes) ascending.
-    # Multi-PE mappers (hierarchical 등) rely on this ordering to align
-    # per-PE tensor arg construction with participating_pes enumeration.
-    sip_shards.sort(key=lambda s: (s.cube, s.pe, s.offset_bytes))
-    min_offset = min(s.offset_bytes for s in sip_shards)
-    local_va_base = handle.va_base + min_offset if handle.va_base else 0
-    return TensorArg(
-        shards=tuple(TensorArgShard(...) for s in sip_shards),
-        va_base=local_va_base,
-    )
-```
-
-**Ordering invariant**: slice의 shard는 `(cube, pe, offset_bytes)` 오름차순.
-Backend가 `participating_pes`를 iterate하며 `_tensor_arg_for_pe(view, cube, pe)`를
-구성할 때, 결정론적 ordering을 전제할 수 있다. 특히 `all_pes` mapper +
-hierarchical 알고리즘이 per-PE slice 조합을 순서 의존적으로 해석하는 경우에
-중요.
-
-### D9. Greenlet-local rank registry (+ debug warning)
-
-```python
-class DistributedContext:
-    def __init__(self):
-        self._backend = None
-        self._rank_by_greenlet: dict = {}
-
-    def _bind_rank(self, g, rank: int) -> None:
-        self._rank_by_greenlet[g] = int(rank)
-
-    def get_rank(self) -> int:
-        self._ensure_initialized()
-        from greenlet import getcurrent
-        g = getcurrent()
-        if g not in self._rank_by_greenlet:
-            if os.environ.get("KERNBENCH_DEBUG"):
-                warnings.warn(
-                    "get_rank() called outside a bound greenlet — returning 0. "
-                    "Likely a bug unless running single-driver."
-                )
-            return 0
-        return int(self._rank_by_greenlet[g])
-```
-
-### D10. `torch.ahbm.set_device(rank)` — SIP 바인딩
-
-KernBench 백엔드 이름은 `ahbm` (ADR-0023 D10). Real PyTorch는
-`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
-namespace를 사용한다.
-
-```python
-class _AhbmNamespace:
-    """torch.ahbm — per-greenlet SIP device binding.
-
-    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
-    KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
-    API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
-    """
-
-    def __init__(self):
-        self._device_by_greenlet: dict = {}
-
-    def set_device(self, device: int) -> None:
-        from greenlet import getcurrent
-        self._device_by_greenlet[getcurrent()] = int(device)
-
-    def current_device(self) -> int | None:
-        from greenlet import getcurrent
-        return self._device_by_greenlet.get(getcurrent())
-
-# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
-# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
-```
-
-**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
-`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
-`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
-코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
-
-```python
-class _AcceleratorNamespace:
-    """torch.accelerator — device-agnostic API (PyTorch 2.x style).
-
-    Aliases torch.ahbm for bench code that prefers device-neutral idiom:
-        torch.accelerator.set_device_index(rank)
-        torch.accelerator.current_device_index()
-    """
-
-    def __init__(self, ahbm: _AhbmNamespace):
-        self._ahbm = ahbm
-
-    def set_device_index(self, device: int) -> None:
-        self._ahbm.set_device(device)
-
-    def current_device_index(self) -> int | None:
-        return self._ahbm.current_device()
-
-# RuntimeContext
-self.ahbm = _AhbmNamespace()
-self.accelerator = _AcceleratorNamespace(self.ahbm)   # alias
-```
-
-Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
-
-```python
-torch.ahbm.set_device(rank)                   # KernBench-native, explicit backend
-torch.accelerator.set_device_index(rank)      # PyTorch 2.x device-agnostic
-```
-
-### D11. Tensor placement = structural (sip, cube, pe) 좌표
-
-`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
-세부는 ADR-0026.
-
-```python
-# RuntimeContext._create_tensor
-current_sip = self.ahbm.current_device()          # (D10 naming)
-if current_sip is None:
-    current_sip = 0  # single-driver fallback (D9와 일관)
-placement = resolve_dp_policy(
-    dp, shape=shape_2d, itemsize=itemsize,
-    num_pe=eff_num_pe, num_cubes=eff_num_cubes,
-    target_sip=current_sip,
-)
-```
-
-Post-hoc `pe_index` shifting 제거 — ShardSpec이 `(sip, cube, pe)` 구조적
-좌표 보유.
-
-### D12. `torch.multiprocessing.spawn`-compat surface
-
-Bench 작성자 표면은 real PyTorch `mp.spawn`과 동일:
-
-```python
-# src/kernbench/runtime_api/multiprocessing.py (new)
-def spawn(fn, args=(), nprocs=1, join=True, daemon=False, start_method="spawn"):
-    """Drop-in for torch.multiprocessing.spawn.
-    Internal: greenlet fan-out + epoch-barrier sync + exception propagation.
-    """
-    ...
-
-# torch namespace에 부착
-torch.multiprocessing = SimpleNamespace(spawn=spawn)
-```
-
-Bench:
-
-```python
-import torch.multiprocessing as mp
-mp.spawn(worker, nprocs=world_size, args=(world_size, torch))
-```
-
-### D13. Scheduler + exception handling
-
-```python
-def spawn(fn, args, nprocs, ...):
-    dist = torch.distributed
-    gs: list[greenlet] = []
-    errors: dict[int, Exception] = {}
-
-    for rank in range(nprocs):
-        def _entry(r=rank):
-            try:
-                fn(r, *args)
-            except Exception as e:
-                errors[r] = e
-                raise
-        g = greenlet(_entry)
-        dist._bind_rank(g, rank)
-        gs.append(g)
-
-    try:
-        while True:
-            alive = [g for g in gs if not g.dead]
-            if not alive:
-                break
-            for g in alive:
-                if not g.dead:
-                    g.switch()
-    except Exception as outer:
-        for other in gs:
-            if not other.dead:
-                try:
-                    other.throw(SystemExit)
-                except Exception:
-                    pass
-        # Epoch barrier state 명시적 cleanup
-        backend = getattr(dist, "_backend", None)
-        if backend is not None and hasattr(backend, "_barrier"):
-            backend._barrier.reset()
-        raise SpawnException(errors) from outer
-```
-
-**Scheduler contract**:
- Deterministic round-robin over insertion order (rank 0, 1, ..., N-1).
- 동기화 지점은 epoch barrier (D7)만. Scheduler 순서에 의존하는 correctness 없음.
- 예외 발생 시 다른 greenlet 강제 종료 + `SpawnException` 전파.
-
-**Starvation guideline**:
- 일반적으로 collective barrier가 workers를 동기화. 큰 편차 없음.
- 극단적 non-collective 루프 대비 cooperative yield 제공:
-  `torch.distributed.cooperative_yield()`.
-
-### D14. Backward compatibility
-
-1. **Single-driver 호출**: `get_rank()` 0 반환 (D9).
-2. **`ccl.yaml` world_size override**: D1 fallback 우회 — legacy "rank = PE"
-   테스트 경로로 사용 가능.
-3. **`DPPolicy.sip="column_wise"` 명시**: ADR-0026 scope.
-4. **`install_ipcq()` compatibility wrapper**:
-
-기존 `ccl/install.py`의 `install_ipcq()` API는 곧바로 제거하지 않는다.
-Thin compatibility wrapper로 남겨 기존 직접 호출자가 점진적으로 migration할
-수 있게 한다.
-
-```python
-# src/kernbench/ccl/install.py (after this ADR)
-def install_ipcq(engine, spec, merged, *, algo_module=None, rank_to_pe=None):
-    """DEPRECATED: legacy host-side PE installer.
-
-    Internally delegates to build_install_plans + engine-routed IpcqInitMsg.
-    Use dist.init_process_group() instead.
-    """
-    from kernbench.ccl.install_plan import build_install_plans
-    import warnings
-    warnings.warn(
-        "install_ipcq() is deprecated; use dist.init_process_group()",
-        DeprecationWarning, stacklevel=2,
-    )
-    plans = build_install_plans(
-        world_size=merged.get("world_size", 1),
-        algorithm=merged["algorithm"],
-        algorithm_config=merged,
-        spec=spec,
-    )
-    handles = []
-    for plan in plans:
-        for pe_install in plan.pe_installs:
-            h = engine.submit(IpcqInitMsg(
-                target_sips=(plan.sip,),
-                target_cubes=(pe_install.cube,),
-                target_pe=pe_install.pe,
-                entries=pe_install.neighbors,
-                buffer_kind=plan.buffer_kind,
-                n_slots=plan.n_slots,
-                slot_size=plan.slot_size,
-            ))
-            handles.append(h)
-    for h in handles:
-        engine.wait(h)
-    return {"world_size": merged.get("world_size", 1), "plans": plans}
-```
-
-Migration 스케줄:
- Phase 1: wrapper로 유지 + DeprecationWarning
- Phase 2: 직접 호출자 grep-audit → 각각 `dist.init_process_group()` 또는
-  `build_install_plans()` 직접 사용으로 이관
- Phase 3: wrapper 제거 (별도 cleanup ADR 또는 PR)
-
---
-
-## Dependencies
-
- **ADR-0023** (IPCQ): `IpcqInitMsg` 메시지 타입과 PE_IPCQ 핸들링을 그대로
-  활용. Engine-routed submit으로 전환하는 것이 유일한 변경.
- **ADR-0025** (IPCQ direction fix): `_build_pe_installs`의 neighbor 계산이
-  2-rank ring 등에서 정확히 동작하려면 필요.
- **ADR-0003 / 0016** (IO_CPU): IO_CPU는 기존 transit 역할 그대로. 본 ADR에서
-  IO_CPU 역할 변경 없음.
-
---
-
-## Non-goals
-
- **IPCQ protocol 수정**: ADR-0023 유지.
- **DPPolicy 필드 정리**: ADR-0026.
- **Megatron-style TP**: ADR-0027.
- **Multi-node (프로세스 간)**: 단일 프로세스.
- **IO_CPU SIP control-plane 단일 endpoint 원칙 채택**: 본 ADR 범위 밖. 현재
-  KernBench에 이 원칙이 없고, 도입은 별도 ADR.
- **Hierarchical all-reduce 알고리즘 설계**: ADR-0029. 본 ADR은 그 알고리즘이
-  쓸 framework 인프라 (`all_pes` mapper, `multi_pe_sip_local` validator,
-  registry 확장점)만 제공.
-
---
-
-## Open questions
-
-### 🟡 Nice-to-have — scope 경계 관련
-
- **Install timing 허용치**: SimPy 시간 상 install이 몇 ns~us 소모. 기존
-  sideband는 0ns. 기존 테스트가 t=0 시작을 전제로 하는지 확인 (audit 결과에
-  따라 테스트 교정 필요).
-
- **`IpcqInitMsg` 배치 가능성**: MmuMapMsg처럼 `target_pe="all"` 브로드캐스트
-  는 IPCQ에서는 부적합 (PE마다 neighbor가 다름). 현재는 per-PE 개별 submit.
-  Per-PE payload를 담는 batched IpcqInitMsg 타입은 future optimization.
-
- **`_rank_to_sip` 매핑**: 현재 identity. Non-trivial mapping 요구 시 별도.
-
- **Cooperative yield API 위치**: `torch.distributed.cooperative_yield()`로
-  노출 예정. 실제 필요성은 Phase 2 이후 벤치 추가 시 판단.
-
-(PE-level topology 일원화 관련 중장기 방향은 **ADR-0029** 참고 — 복잡한
-multi-level 알고리즘이 driving force가 되는 framework 진화 방향.)
-
---
-
-## Consequences
-
-### Positive
-
- **새 message 타입 0개**: 기존 `IpcqInitMsg` + `KernelLaunchMsg`만으로 구현.
- **IO_CPU / engine 변경 없음**: 기존 routing 그대로.
- **Sideband install convention 제거**: MmuMapMsg 등과 동일 패턴으로 일원화.
- **Plan state stale 문제 소멸**: Plan은 host 단일 소유.
- **Bench = real PyTorch DDP** (공개 API 관점).
- **Algorithm ABI 경량**: `kernel` + `kernel_args`만 필수.
- **Epoch-based barrier**: interleaved collective 안전.
- **Control/data plane 분리**: data plane(PE_IPCQ)은 ADR-0023 유지, control
-  plane은 host-driven.
- 장기 확장성: Megatron TP, DTensor 기반.
-
-### Negative
-
- 신규 모듈: `install_plan.py`, `mappers.py`, `validators.py`,
-  `multiprocessing.py`.
- Engine이 `IpcqInitMsg`를 엔진-path로 라우팅할 수 있는지 구현 시 확인 필요
-  (minor hook 가능성).
- Install이 SimPy 시간을 소모 (positive로도 볼 수 있으나, 기존 sideband 시점
-  0ns 전제인 테스트가 있으면 교정 필요).
-
-### Neutral
-
- IPCQ PE-level protocol (ADR-0023) 불변.
- `DPPolicy` 필드 변경은 ADR-0026.
- IO_CPU 역할 불변 (기존 transit 그대로).
@@ -23,7 +23,7 @@ class DPPolicy:
    """Intra-device (cube × PE) data-parallel policy.

    SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
-    (ADR-0024 D10) and, for model-level TP, by Megatron-style parallel
+    (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
    layers (ADR-0027). DPPolicy does not cross SIP boundaries.
    """
    cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
@@ -37,7 +37,7 @@ class DPPolicy:
 ### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거

 현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
-pes + pe`). 이는 ADR-0024 D11이 "abstraction leakage"로 지적한 형태.
+pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.

 본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
 property로도 **남기지 않는다**:
@@ -73,7 +73,7 @@ class ShardSpec:

 ### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성

-ADR-0024 D11의 계약 구현. Post-hoc shifting 없음.
+ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.

 ```python
 # src/kernbench/policy/placement/dp.py (after)
@@ -135,14 +135,14 @@ def resolve_dp_policy(

 ### D4. `_create_tensor` — 구조적 좌표로 직접 placement

-ADR-0024 D11 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
+ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
 호출 시점에 직접 지정.

 ```python
 # context.py _create_tensor (after)
 current_sip = self.ahbm.current_device()
 if current_sip is None:
-    # Single-driver fallback (ADR-0024 D9와 일관).
+    # Single-driver fallback (ADR-0024 D2와 일관).
    # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
    # 문제가 있음 → debug mode에서 경고.
    if os.environ.get("KERNBENCH_DEBUG"):
@@ -267,7 +267,7 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
 - **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
 - **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
 - **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
-  abstraction leakage 해소 (ADR-0024 D11 계약 충족).
+  abstraction leakage 해소 (ADR-0024 D4 계약 충족).
 - **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
 - **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
  경계 제어 메커니즘.
@@ -2,9 +2,7 @@

 ## Status

-Accepted (Revision 7 — resume invariant / main-context wait 비재귀 invariant /
-global barrier over-serialization tradeoff / TP forward yield-safety 명시,
-2026-04-14)
+Accepted

 ## Context

@@ -166,9 +164,9 @@ while alive:
   - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
     등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
   - **Future extension**: non-collective 긴 계산 경로가 자주 나오면
-     ADR-0024 D13의 `torch.distributed.cooperative_yield()` primitive (명시적
-     no-op yield)를 도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 —
-     필요 시 추가하면 됨.
+     명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
+     도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
+     됨.
   - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
     안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
     enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
@@ -183,7 +181,7 @@ while alive:
   - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
     `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
     큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
-     worker는 이걸 직접 wait하지 않는다 (ADR-0024 D7).
+     worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
   - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
     후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
     하면 됨. worker wait 큐와의 순서 dependency 없음.
@@ -206,7 +204,7 @@ while alive:
     index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
     를 바꾸지 않는 최적화로 분류.

-4. **Exception propagation + sibling cleanup (ADR-0024 D13 방식 채택)**.
+4. **Exception propagation + sibling cleanup**.
   worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
   scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:

@@ -581,7 +579,7 @@ TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:

 | 개념 | 결정 주체 | 범위 |
 |---|---|---|
-| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D9/D10) | **cross-rank, cross-SIP** |
+| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
 | **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |

 따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
@@ -825,40 +823,11 @@ strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을

 ## Dependencies

- **ADR-0024** (launcher): rank = SIP, greenlet-local rank, `dist.all_reduce`,
-  `torch.ahbm.set_device(rank)`. 본 ADR의 D0/D1이 이 인프라를 확장.
+- **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
+  `torch.ahbm.set_device(rank)`.
 - **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
 - **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.

-### Supersedes (partial)
-
-ADR-0024의 다음 섹션은 **미구현 상태의 설계**이며, 본 ADR이 더 단순한 모델로
-대체한다:
-
- **ADR-0024 D7 (`_CollectiveBarrier.submit_and_drain`)** — epoch 기반 last-
-  arriver-drains 패턴. 문제: last arriver가 **worker 컨텍스트에서** `ctx.wait`을
-  호출해 env.run을 drive → D0.2가 막으려는 orphan 원인을 재현한다. 본 ADR의
-  **D0.4 two-queue drain** (worker가 모두 yield한 뒤 main이 drain)이 동일한
-  "모든 rank가 submit 완료 전까지 어떤 rank의 collective도 진행되지 않음"
-  invariant를 **worker-safe하게** 제공한다. `_CollectiveBarrier` 클래스는
-  구현하지 않는다.
- **ADR-0024 D12/D13 (`spawn_workers` skeleton)** — signature / scheduler
-  loop / exception handling 설계. 본 ADR의 **D1**이 real-PyTorch API와 일치하는
-  signature (`spawn(fn, args, nprocs)`)로 재정의하며, D0 scheduler drain을 단일
-  위치에서 수행한다. ADR-0024 D13의 exception cleanup (siblings
-  `throw(SystemExit)` + `SpawnException` 래핑)은 본 ADR에 그대로 흡수
-  (D0.4-(4) 참조).
-
-현 구현은 ADR-0024의 D7/D12/D13 어느 것도 landing하지 않았으므로 supersede에
-따른 마이그레이션 비용은 없음. 향후 `docs/adr/ADR-0024`에 "superseded by
-ADR-0027 D0/D1" 주석만 추가하면 정합.
-
-**Source of truth (normative, 구현자 대상)**: worker scheduling / collective
-drain / spawn / exception cleanup의 구현 기준은 **ADR-0027 D0/D1이다**. 구현
-시 ADR-0024 D7/D12/D13의 pseudocode / contract / signature를 참고하지 말 것 —
-두 ADR이 다른 결론을 낼 때는 항상 ADR-0027이 우선한다. 리뷰어도 이 원칙으로
-PR을 심사.
-
 ---

 ## Non-goals
@@ -1,171 +0,0 @@
-# ADR-0028: DTensor Support — 선언적 분산 텐서 (Stub / Future)
-
-## Status
-
-Stub (Future Work)
-
-## Context
-
-### 목표
-
-**선언적 분산 텐서 추상화**(PyTorch 2.x `DTensor` 스타일)를 KernBench에
-도입하기 위한 **디자인 공간 preliminary exploration**. 본 ADR은 **구현 계획이
-아닌 future 작업의 파일 플레이스홀더 + 초기 질문 목록**이다.
-
-### Megatron-style TP와의 차이 (Why DTensor)
-
-| 관점 | Megatron (ADR-0027) | DTensor (이 ADR) |
-|---|---|---|
-| 표현 | 명시적 parallel layer | 텐서 + placement spec |
-| 호출 형태 | `ColumnParallelLinear(...)` | `distribute_tensor(x, mesh, [Shard(1)])` |
-| Collective 삽입 | 레이어 내부 명시 | 연산 dispatch가 자동 |
-| Learning curve | 낮음 (명시적) | 중~높음 (선언적 의미 이해) |
-| 유연성 | 레이어 단위로 고정 | 레이어 경계 무관, 어디서나 |
-| KernBench에 선행 필요한 것 | launcher (ADR-0024) + TP (0027) | 그 + operator dispatch overhaul |
-
-DTensor는 operator-level에서 "텐서의 placement를 보고 자동으로 collective
-삽입". KernBench가 이를 지원하려면 **operator dispatch layer에 placement-aware
-rewriting**이 들어가야 한다. 이는 비-trivial.
-
-### 현재 상태
-
- KernBench는 operator dispatch 레이어가 없음 (`torch.matmul`은 없음; kernel
-  launch로 대체).
- DPPolicy는 정적 placement metadata를 보유 (ADR-0026 후: intra-device only).
- ADR-0024 launcher가 rank / device 개념 제공.
- Megatron-style TP (ADR-0027)가 명시적 대안으로 기능할 것.
-
---
-
-## Preliminary decision space
-
-### DQ1. PyTorch DTensor API 수용 범위
-
- `DeviceMesh`: rank들의 논리적 grid.
- `Placements`: `Shard(dim)`, `Replicate()`, `Partial(reduce_op)`.
- `distribute_tensor(tensor, device_mesh, placements)`: local tensor → DTensor.
- Redistribute: `dt.redistribute(new_placements)`로 collective 자동 삽입.
- Operator forward: `dt @ dt`, `dt + dt` 등 → 적절한 collective 자동 dispatch.
-
-KernBench가 어느 수준까지 지원할지 결정 필요. 최소: `distribute_tensor` +
-`redistribute`. 최대: 모든 operator overloading.
-
-### DQ2. Operator dispatch 레이어
-
-KernBench에서 `dt @ dt`를 정의하려면 Tensor의 `__matmul__`이 placement를
-보고 적절한 action 수행:
-
- 둘 다 replicated → local matmul
- A column-sharded, B row-sharded → local matmul + all-reduce (RowParallel)
- A replicated, B column-sharded → local matmul (ColumnParallel)
- etc.
-
-이는 Megatron-style의 **자동화된 버전**. Kernel은 기존 matmul kernel 사용.
-
-### DQ3. DeviceMesh와 기존 topology
-
-KernBench topology는 이미 SIP/cube/PE 계층. DTensor의 DeviceMesh는 추상
-`(tp_size, dp_size, ...)` grid. 매핑:
-
- 1D mesh of size = SIP count → rank = SIP
- 2D mesh (tp × dp) → SIP을 그룹 분할 (pure TP 대신 mixed parallelism)
-
-초기엔 1D mesh만, DP × TP 2D는 future.
-
-### DQ4. Placement의 intra-device (DP) 통합
-
-KernBench 특이점: 한 rank 내부에서 DPPolicy로 cube/PE에 분산. DTensor는
-device 내부를 보지 않음. 통합:
-
- DTensor placement = rank (SIP) 간 분산
- 각 rank의 local tensor는 여전히 DPPolicy로 cube/PE 배치
- → DTensor wrapper가 local tensor의 DPPolicy도 보관
-
-### DQ5. Collective 자동 삽입 지점
-
-`redistribute` 또는 operator forward 시. ADR-0024의 submit+yield+wait 패턴을
-자동으로 호출하는 형태. `_launch_submit` 내부화.
-
-### DQ6. Autograd
-
-DTensor는 autograd와 상호작용 (backward에서 reverse collective). KernBench가
-backward 지원하기 전까지는 **forward-only DTensor**.
-
---
-
-## Open questions (to resolve before real design)
-
-1. **우선순위**: Megatron-style(ADR-0027)이 먼저 안착한 후 DTensor를 위에
-   얹는가, 아니면 공통 lower-layer를 먼저 설계하는가?
-2. **호환성 목표**: PyTorch DTensor API와 몇 %까지 일치시키는가? 독자 API vs
-   거의 동일?
-3. **Operator dispatch**: KernBench `Tensor` 클래스에 `__matmul__` 등 연산자
-   overloading을 도입하는가? (현재는 kernel launch만)
-4. **Redistribute 정책**: `Shard(0) → Replicate()` 변환 시 어떤 collective
-   사용? `all_gather`가 없으면 구현 전까지 제약.
-5. **Mesh × DPPolicy interaction**: 하나의 DTensor가 2개 layer 분산을 갖는
-   경우의 metadata 표현.
-6. **Partial placement의 reduce 시점**: 자동 vs 명시 `redistribute` 호출.
-7. **Bench authoring impact**: 기존 Megatron-style bench가 DTensor 기반으로
-   얼마나 쉽게 포팅되는가?
-
---
-
-## Non-goals (for future real ADR)
-
- 이번 stub에서 API 확정. Future ADR에서 구체화.
- Implementation timeline. 이번 round에서는 **설계 공간 매핑만**.
-
---
-
-## Dependencies (potential)
-
- **ADR-0024** (launcher): rank / device 기반
- **ADR-0026** (DPPolicy cleanup): DTensor placement와의 분리 명확화
- **ADR-0027** (Megatron TP): 실용 TP 패턴 경험을 DTensor 설계로 환류
- **Future ADR** (operator dispatch layer): KernBench Tensor에 operator
-  overloading 도입
-
---
-
-## Expected consequences (hypothetical)
-
-### Positive
-
- PyTorch training code 이식이 **매우 쉬워짐** (DTensor 코드 그대로).
- TP + DP + 더 복잡한 parallelism을 **하나의 추상화**로 표현.
- Collective 삽입이 자동 → bench 작성자 부담 감소.
-
-### Negative
-
- Operator dispatch layer 신규 구축 → 상당한 엔지니어링.
- Implicit behavior 증가 → 디버깅 / 성능 분석 복잡.
- KernBench의 "명시적 kernel launch" 철학과 tension.
-
---
-
-## Action
-
- **Phase 1 (현재)**: 본 stub 유지. Megatron-style (ADR-0027) 먼저 구현 +
-  사용 경험 축적.
- **Phase 2 (future)**: 사용 경험을 바탕으로 본 ADR을 real design으로 승격.
-  위 Open questions에 대한 답을 제시.
- **Phase 3 (future)**: Implementation.
-
-현재 구현 작업은 **없음**. 디자인 공간 매핑만.
-
---
-
-## Affected files
-
-본 ADR은 **stub**이므로 production 변경 없음. Future real ADR에서 갱신될
-파일 후보:
-
-| File | 예상 변경 (future) |
-|------|---|
-| `src/kernbench/dtensor/__init__.py` | 신규 패키지 |
-| `src/kernbench/dtensor/device_mesh.py` | DeviceMesh |
-| `src/kernbench/dtensor/placements.py` | Shard/Replicate/Partial |
-| `src/kernbench/dtensor/api.py` | distribute_tensor, redistribute |
-| `src/kernbench/dtensor/ops/*.py` | Operator dispatch (matmul 등) |
-| `src/kernbench/runtime_api/tensor.py` | Tensor에 `__matmul__` 등 추가 |
@@ -1,347 +0,0 @@
-# ADR-0030: IPCQ Physical Addressing — PhysAddr integration
-
-## Status
-
-Proposed
-
-## Context
-
-### 목표
-
-IPCQ ring buffer의 주소 체계를 ADR-0023의 **synthetic parallel namespace**
-(`_IPCQ_BASE = 1<<60`)에서 **ADR-0001의 PhysAddr**로 이관한다. Routing /
-allocator / MemoryStore의 정합성을 회복하고, buffer_kind (tcm/hbm/sram)별
-physical backing을 구조적 좌표로 표현한다.
-
-### 현재 상태 (ADR-0023 D2.5)
-
-`src/kernbench/ccl/install.py:52-56`:
-
-```python
-_IPCQ_BASE = 1 << 60
-def _ipcq_base_for_pe(sip, cube, pe):
-    return _IPCQ_BASE | (sip << 40) | (cube << 32) | (pe << 24)
-
-def rx_base(s, c, p, d):
-    return _ipcq_base_for_pe(s, c, p) + direction_idx[d] * bytes_per_direction
-```
-
- **bit 60** 사용 → ADR-0001의 51-bit PhysAddr 공간 밖 (`MAX_51 = (1 << 51) - 1`)
- `PhysAddr.decode(addr)` → `PhysAddrError("addr must be a 51-bit value")`
- `IpcqEndpoint.rx_base_pa: int` — 타입이 raw int, 구조 없음
- `buffer_kind` (tcm/hbm/sram)와 synthetic 주소의 관계가 coupling 없음
- Allocator (`PEMemAllocator`) 우회 — synthetic unique id per (sip, cube, pe,
-  direction). 진짜 physical allocation이 아님
-
-ADR-0023 D2.5 원문:
-
-> This bypasses the topology's address resolver / PhysAddr encoding and
-> treats IPCQ buffers as a separate, parallel address namespace. Real PA
-> encoding can be plugged in later without changing the rest of the design.
-
-"later"가 이 ADR.
-
-### 왜 지금 다루는가
-
- ADR-0025 (direction addressing)은 주소-기반 매칭으로 전환. 주소가 correctness에
-  직접 기여 → 주소 체계가 설계 관점에서 더 중요해짐
- ADR-0001의 "Routing consumes decoded domains, not raw bit-fields" 계약 위반
-  지속 → 기술 부채
- Routing fabric (cube_noc / UCIe)은 PhysAddr.decode()로 destination을 정함.
-  IPCQ의 synthetic 주소가 fabric routing에서 실제로 어떻게 처리되는지 **검증되지
-  않음** (별도 경로로 배달되는 것으로 추정)
- TCM / HBM / SRAM의 실제 memory layout과 IPCQ ring buffer 위치가 **disjoint**
-  → allocator가 IPCQ 영역을 모르므로 실수로 겹칠 가능성 (현재는 bit 60로 완전
-  분리되어 문제 없지만 설계 원칙상 건강하지 않음)
-
-### 풀어야 할 문제
-
-1. **IPCQ ring buffer의 PhysAddr 표현**: buffer_kind별로 어떤 PhysAddr factory를
-   쓸지.
-2. **PhysAddr 공간 부족 가능성**: 51-bit 공간에 IPCQ 버퍼를 담을 여유가 있는지.
-3. **Allocator 통합**: `PEMemAllocator`에 IPCQ buffer 영역 예약 기능 추가, 또는
-   기존 pool에서 정상 allocation.
-4. **MemoryStore space naming 정리**: 현재는 `{"tcm", "hbm", "sram"}` 문자열로
-   space 구분. IPCQ buffer도 이 space에 속하면 일반 data와 주소 겹침 방지 필요.
-5. **Routing fabric 통합**: PhysAddr 기반 routing이 IPCQ 토큰을 올바른 SIP의
-   올바른 메모리로 배달.
-6. **ADR-0025와의 정합**: 주소-기반 매칭이 PhysAddr에서도 동일하게 작동.
-
---
-
-## Decision
-
-### D1. IPCQ ring buffer = PhysAddr factory 사용
-
-각 `buffer_kind`가 해당하는 PhysAddr factory를 호출:
-
-| buffer_kind | PhysAddr factory | 필요한 인자 |
-|---|---|---|
-| `tcm` | `PhysAddr.pe_tcm_addr(rack_id, sip_id, cube_id, pe_id, tcm_offset)` | PE-local TCM |
-| `hbm` | `PhysAddr.pe_hbm_addr(rack_id, sip_id, cube_id, pe_id, pe_local_hbm_offset, slice_size_bytes)` | PE-local HBM slice |
-| `sram` | `PhysAddr.cube_sram_addr(rack_id, sip_id, cube_id, sram_offset)` | Cube-shared SRAM |
-
-Install plan builder (`build_install_plans` in ADR-0024)가 각 PE의 rx_base를
-계산할 때:
-
-```python
-# ADR-0030 후 install_plan.py (pseudocode)
-def _compute_rx_base(sip, cube, pe, direction_idx, buffer_kind, n_slots, slot_size,
-                     allocator_pool, rack_id=0) -> PhysAddr:
-    bytes_per_direction = n_slots * slot_size
-    offset = direction_idx * bytes_per_direction
-
-    if buffer_kind == "tcm":
-        # TCM base (per-PE) + direction offset
-        tcm_base = allocator_pool.reserve_pe_tcm_for_ipcq(sip, cube, pe,
-                                                          total_bytes=N_DIR * bytes_per_direction)
-        return PhysAddr.pe_tcm_addr(rack_id=rack_id, sip_id=sip, cube_id=cube,
-                                      pe_id=pe, tcm_offset=tcm_base + offset)
-    elif buffer_kind == "hbm":
-        hbm_base = allocator_pool.reserve_pe_hbm_for_ipcq(sip, cube, pe,
-                                                          total_bytes=...)
-        return PhysAddr.pe_hbm_addr(rack_id=rack_id, sip_id=sip, cube_id=cube,
-                                      pe_id=pe, pe_local_hbm_offset=hbm_base + offset,
-                                      slice_size_bytes=slice_size)
-    elif buffer_kind == "sram":
-        sram_base = allocator_pool.reserve_cube_sram_for_ipcq(sip, cube,
-                                                               total_bytes=...)
-        return PhysAddr.cube_sram_addr(rack_id=rack_id, sip_id=sip, cube_id=cube,
-                                         sram_offset=sram_base + offset)
-```
-
-`IpcqEndpoint.rx_base_pa`의 타입을 `PhysAddr` (또는 encoded `int`)로 변경:
-
-```python
-@dataclass(frozen=True)
-class IpcqEndpoint:
-    sip: int
-    cube: int
-    pe: int
-    buffer_kind: str
-    rx_base_pa: int            # PhysAddr.encode() 결과 (51-bit)
-    rx_base_va: int
-    n_slots: int
-    slot_size: int
-```
-
-타입은 int 유지 (encoded form), 단 **반드시 PhysAddr.decode()로 복원 가능**한
-값임을 invariant으로 둔다. 디코더 호출자는 `PhysAddr.decode(rx_base_pa)`로
-구조적 좌표 획득.
-
-### D2. Allocator 확장 — IPCQ 예약 API
-
-`PEMemAllocator`에 IPCQ 전용 예약 기능 추가:
-
-```python
-class PEMemAllocator:
-    def reserve_ipcq_tcm(self, total_bytes: int) -> int:
-        """Reserve TCM region for IPCQ ring buffers at this PE.
-        Returns tcm_offset (to be used in PhysAddr.pe_tcm_addr)."""
-        # TCM에서 `total_bytes` 연속 영역 예약.
-        # Tensor allocation과 겹치지 않도록.
-
-    def reserve_ipcq_hbm(self, total_bytes: int) -> int: ...
-    # cube-level allocator도 유사
-```
-
-Install plan 빌더가 각 PE allocator에서 예약. 예약 결과(offset)를 PhysAddr
-factory에 전달.
-
-**기존 `_ipcq_base_for_pe` / `_IPCQ_BASE` 제거**.
-
-### D3. MemoryStore space 통합
-
-현재 `MemoryStore`는 `{space_name: {addr: ndarray}}` 구조. IPCQ buffer는 일반
-tensor 데이터와 같은 space (tcm/hbm/sram)를 공유하게 됨. 주소 유일성은 ADR-0001의
-PhysAddr 계층 보장.
-
-Backward compatibility: 기존 IPCQ address (synthetic)을 쓰는 code path는
-**제거**하고, 모두 PhysAddr.encode() 결과만 사용. 이 자체는 API 변경이 아니라
-값 변경.
-
-### D4. Routing fabric 통합
-
-IPCQ DMA write (`IpcqDmaToken`의 `src_addr → dst_addr`)이 PhysAddr encoding을
-사용하므로 **routing fabric이 `PhysAddr.decode(dst_addr)`로 destination
-SIP/cube/PE를 정확히 찾을 수 있음**. Fabric routing 로직 변경 없음 (기존에도
-PhysAddr.decode를 쓰는 것으로 추정).
-
-**검증 필요**: 현재 fabric이 bit 60 synthetic 주소를 어떻게 라우팅하는지 확인.
-별도 경로가 있다면 제거, PhysAddr 경로로 통합.
-
-### D5. ADR-0025와의 정합
-
-ADR-0025의 주소-기반 매칭 (dst_addr로 direction 식별)은 PhysAddr.encode()
-결과를 비교하는 것으로 자연스럽게 호환. 변경 없음.
-
-다만 debug / diagnostic 향상 가능:
-
-```python
-# pointer_dump 등에서
-print(f"E: rx_base_pa={PhysAddr.decode(qp.peer.rx_base_pa)}")
-# 출력 예: PhysAddr(sip=1, cube=0, pe=0, kind="pe_resource", unit_type=PE, ...)
-```
-
-이전 synthetic 주소는 decode 불가 → diagnostic 질 저하. PhysAddr 전환으로 개선.
-
-### D6. ADR-0023 D2.5 amendment
-
-ADR-0023의 "bypasses PhysAddr encoding" 문구를 **Accepted fallback → now
-replaced by ADR-0030**으로 수정. 본 ADR이 적용되면 ADR-0023 D2.5의 "Real PA
-encoding can be plugged in later" 약속이 이행된 것.
-
---
-
-## Migration strategy
-
-단계적 전환 (한 PR로 하지 않는다):
-
-### Phase 1: PhysAddr 공간 재검토
- 51-bit PhysAddr 공간에 IPCQ ring buffer가 실제로 들어갈 수 있는지 확인.
- 각 buffer_kind (tcm/hbm/sram)별 factory가 제공하는 `local_offset` 범위가
-  IPCQ 요구 (4 direction × n_slots × slot_size)를 수용 가능한지.
- 부족하면 PhysAddr layout 자체 확장 (ADR-0001 amendment 별도 필요).
-
-### Phase 2: Allocator API 확장
- `PEMemAllocator.reserve_ipcq_*` 메소드 추가.
- 기존 tensor allocation과 영역 충돌 방지.
-
-### Phase 3: Install plan builder 전환
- `_ipcq_base_for_pe` 제거, PhysAddr factory 호출로 대체.
- `IpcqEndpoint.rx_base_pa`가 PhysAddr.encode() 결과 (51-bit).
-
-### Phase 4: Routing fabric 검증
- IPCQ DMA token이 fabric 정상 경로로 배달되는지 확인.
- 별도 fast-path가 있다면 제거, 통합.
-
-### Phase 5: MemoryStore space 검증
- IPCQ buffer 주소가 기존 tensor 주소와 겹치지 않는지.
- Allocator 레벨에서 이미 예약했으므로 정상적으로 분리되어야 함.
-
-### Phase 6: ADR-0023 D2.5 업데이트 + 기존 sideband path 제거 (완료)
-
---
-
-## Dependencies
-
- **ADR-0031** (PhysAddr PE-resource extension) — **Blocker**: PhysAddr가 PE
-  resource (특히 IPCQ ring buffer)를 충분히 표현할 수 있도록 schema 확장이
-  선행되어야 함. 본 ADR은 ADR-0031 완료 후에만 실행 가능.
- **ADR-0001** (PhysAddr layout): 본 ADR의 기반. 51-bit 공간 / factory API의
-  ADR-0031 확장본을 사용.
- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023 D2.5의 "later" 약속 이행.
-  D9 piggyback / credit return 프로토콜 자체는 불변.
- **ADR-0024** (launcher + install_plan.py): `build_install_plans`가 PhysAddr
-  factory를 호출하게 됨.
- **ADR-0025** (direction addressing): 주소-기반 매칭이 PhysAddr에서도 동일하게
-  작동. 변경 없음.
-
---
-
-## Non-goals
-
- **ADR-0001 PhysAddr layout 자체 변경**: 51-bit 공간과 segment 구조는 유지.
-  부족 시 별도 ADR.
- **IPCQ protocol semantic 변경**: ADR-0023 D9 piggyback 등 프로토콜 로직 유지.
- **Allocator 전반 재설계**: IPCQ 예약 API 추가만.
-
---
-
-## Open questions
-
-### 🔴 Critical — Migration 전 반드시 검증
-
- **PhysAddr 51-bit 공간에 IPCQ 버퍼가 실제로 들어가는가**: 각 PE의 TCM
-  영역에서 `4 direction × n_slots (default 4) × slot_size (default 4KB)` =
-  64KB가 PE TCM 공간에 수용 가능. TCM size (e.g., 16MB) 대비 충분. HBM도 여유
-  많음. SRAM은 cube 공유라 direction × PE 곱이 있음 — 별도 검증 필요.
- **Routing fabric의 현재 IPCQ 주소 처리**: 현재 synthetic 주소가 fabric에서
-  어떻게 routing되는지 trace 필요. `PhysAddr.decode()`로 판독 불가한 값이
-  fabric에서 정상 배달된다면 어떤 경로를 쓰는지 조사.
-
-### 🟡 Nice-to-have
-
- **IPCQ 전용 kind / sub_offset 인코딩**: `UnitType.PE`의 sub_offset 공간을
-  IPCQ와 공유. 충돌 방지를 위해 IPCQ 전용 sub-space 정의할지 여부.
- **Debug tool**: `pointer_dump`를 PhysAddr 포매팅으로 개선.
-
---
-
-## Test strategy
-
-### T1. PhysAddr round-trip
-
-`tests/test_ipcq_physaddr.py` (new):
- `PhysAddr.pe_tcm_addr(...)` → encode → decode → 동일 필드 복원
- TCM / HBM / SRAM 각 factory에 대해
-
-### T2. Allocator 예약
-
-`tests/test_ipcq_alloc.py` (new):
- `PEMemAllocator.reserve_ipcq_tcm` → 반환된 offset이 valid TCM 영역
- 중복 예약 → 에러 또는 non-overlapping offset
- Tensor allocation과 충돌 없음
-
-### T3. Install plan PhysAddr integration
-
-`tests/test_ccl_install_plan.py` (확장):
- `build_install_plans` 결과의 `rx_base_pa`가 PhysAddr.decode() 가능
- Decoded 좌표가 plan의 (sip, cube, pe)와 일치
- I3.1 invariant (ADR-0025 D6) — rx_base range disjointness가 PhysAddr에서도 성립
-
-### T4. Routing — IPCQ DMA fabric traversal
-
-`tests/test_ipcq_routing.py` (new):
- Cross-SIP IPCQ send → fabric이 `PhysAddr.decode(dst_addr)`로 destination SIP
-  정확히 판단 → 올바른 MemoryStore에 write
- UCIe 경로 / cube_noc 경로 모두 검증
-
-### T5. 회귀
-
- 기존 IPCQ E2E 테스트 (ring, mesh, tree) 모두 통과
- ADR-0024, ADR-0025 통합 테스트 통과
-
---
-
-## Consequences
-
-### Positive
-
- **ADR-0001 정합성 회복**: routing과 addressing이 단일 체계.
- **buffer_kind 명확**: TCM/HBM/SRAM이 구조적 좌표로 구분.
- **Debug 향상**: PhysAddr.decode()로 사람이 읽을 수 있는 좌표.
- **Allocator 통합**: IPCQ 영역이 정상 예약 → tensor와의 충돌 리스크 사전 차단.
- **Fabric routing 일원화**: 별도 경로 없이 기존 PhysAddr-based routing 재활용.
-
-### Negative
-
- **Migration 복잡도**: 6 Phase 단계적 전환 필요. 각 Phase마다 regression 리스크.
- **PhysAddr 공간 검증 부담**: Phase 1에서 TCM/HBM/SRAM 공간이 IPCQ 요구를
-  수용하는지 실측 필요.
- **Routing fabric 검증**: 현재 fabric이 synthetic 주소를 어떻게 처리하는지
-  조사 필요.
-
-### Neutral
-
- IPCQ protocol semantic (ADR-0023 D9 등) 불변.
- ADR-0025의 direction addressing 로직 불변.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/ccl/install.py` | `_IPCQ_BASE`, `_ipcq_base_for_pe` 제거 |
-| `src/kernbench/ccl/install_plan.py` (ADR-0024) | D1: PhysAddr factory 호출로 rx_base 계산 |
-| `src/kernbench/policy/address/allocator.py` (or similar) | D2: IPCQ 예약 API (`reserve_ipcq_tcm` 등) |
-| `src/kernbench/common/ipcq_types.py` | D1: `IpcqEndpoint.rx_base_pa` 문서화 — PhysAddr.encode 결과 |
-| `src/kernbench/sim_engine/memory_store.py` | D3: IPCQ buffer가 기존 space와 공유되는지 검증 |
-| `src/kernbench/sim_engine/engine.py` | D4: IPCQ token routing이 PhysAddr-based fabric 경로 사용 |
-| `src/kernbench/ccl/diagnostics.py` | D5: pointer_dump를 PhysAddr 포매팅으로 개선 |
-| `docs/adr/ADR-0023-ipcq-pe-collective.md` | D6: D2.5 amendment note |
-| `tests/test_ipcq_physaddr.py` (new) | T1 |
-| `tests/test_ipcq_alloc.py` (new) | T2 |
-| `tests/test_ccl_install_plan.py` | T3 확장 |
-| `tests/test_ipcq_routing.py` (new) | T4 |
@@ -146,7 +146,7 @@ At each `dist.all_reduce(tensor)` call:
 3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
   `sip_rank` is the current greenlet's bound rank.
 4. Launches with `_defer_wait=True`; the main scheduler drains pending
-   handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).
+   handles after all workers submit (per ADR-0027 D0.4).

 ### D6. Config schema

@@ -10,7 +10,7 @@ The simulator is an analytical, event-driven performance model — not a
 cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
 or omitted by design. To keep the model auditable and reviewable as a whole,
 this ADR consolidates the assumptions in one place. Individual component ADRs
-(ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines
+(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
 the *limits of fidelity*.

 ## Decisions
@@ -21,7 +21,7 @@ the *limits of fidelity*.
  ADR-0015 D2.
 - **Per-component switching/overhead latency** (`overhead_ns` attr).
 - **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
-  with global round-robin chunking. Burst granularity tunable
+  with address-based PC selection (ADR-0034 D3). Burst granularity tunable
  (`burst_bytes`, default 256B). Read and write share each PC's
  `available_at` (real HW command bus is per-PC shared).
 - **HBM direction switching penalty mechanism**: per-PC last-direction
@@ -66,8 +66,8 @@ the *limits of fidelity*.
 ### D3. Ignored (out of scope)

 - Bank-level row buffer conflict penalty (assume no conflicts — best case;
-  round-robin chunk assignment is address-blind so we cannot detect same-bank
-  reuse).
+  the model has no per-bank state within a PC, so same-bank reuse cannot be
+  detected).
 - HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
  `burst_time = burst_bytes / pc_bw_gbs`).
 - Refresh, ECC, thermal throttling, power gating.
@@ -110,29 +110,6 @@ below are different concerns, ordered by expected workload impact.

 **Higher impact (workload accuracy gap)**:

- [ ] **Address-based PC selection at HBM CTRL** (replace the
-  address-blind global round-robin). Compute the PC index from
-  the HBM byte offset using parameters already in topology config:
-
-      pc_shift = log2(burst_bytes)        # default 8 (burst=256B)
-      pc_mask  = num_pcs - 1              # default 7 (8 PCs)
-      pc       = (hbm_offset >> pc_shift) & pc_mask
-
-  For the default `burst_bytes=256, num_pcs=8` this places the PC
-  select field at HBM byte-offset bits **[10:8]**: bits [7:0] are
-  the within-burst offset (same PC), bits [10:8] are the 3-bit PC
-  index, and bits [36:11] are row/bank/column within the PC slice.
-  Shift/mask are derived from topology config rather than hardcoded
-  so alternative `(burst_bytes, num_pcs)` pairs stay consistent.
-  See `src/kernbench/policy/address/phyaddr.py` for the canonical
-  comment.
-
-  Real-HW workloads where this matters most: (a) strided multi-
-  transaction streams that under global-RR collide on the same PCs
-  but under address-striping land on disjoint sets; (b) offset-
-  disjoint parallel transfers where address-striping preserves
-  parallelism while global-RR re-serializes them. Directly affects
-  multi-PE concurrent HBM workload latencies.
 - [ ] **Bank-level conflict modeling** within a PC (opt-in via
  `track_banks: true`). Currently we assume no same-bank reuse;
  random scatter/gather workloads are optimistic here.
@@ -169,7 +146,7 @@ below are different concerns, ordered by expected workload impact.
  touching latency must update the relevant section here.
 - Workload-specific magnitude error envelopes are explicit.
 - Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
-  enforces the ADR-0019 D9 invariant in code rather than relying on yaml
+  enforces the ADR-0017 D8 invariant in code rather than relying on yaml
  manual consistency.
 - Wire transfer time is charged once per bottleneck-link transit (Phase 2c
  per-flit timing) rather than via terminal `drain_ns` injection. Single
@@ -180,5 +157,6 @@ below are different concerns, ordered by expected workload impact.
 ## Cross-references

 - ADR-0015 — component / port / wire model.
- ADR-0019 — NoC and local HBM topology.
+- ADR-0017 — Cube NOC architecture and HBM connectivity.
 - ADR-0004 — memory semantics, local HBM.
+- ADR-0034 — HBM controller internal design.
@@ -0,0 +1,271 @@
+# ADR-0034: HBM Controller Internal Design
+
+## Status
+
+Accepted
+
+## Context
+
+`HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of
+the cube NOC. One instance is created per PE under the topology node
+`sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router
+(ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling,
+burst-granular commit timing, address-based PC selection, and response
+routing back to the requester.
+
+This ADR documents the component as currently implemented. ADR-0017 D4/D8
+defines *where* HBM CTRL attaches and *what* aggregate BW it must
+deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in
+scope. This ADR fills the gap between those two — the per-instance
+internal scheduling model.
+
+## Decision
+
+### D1. Role
+
+`HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per
+PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`)
+attaches to that PE's router via the `peX.hbm` attachment list in
+`cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping
+(ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels
+into one endpoint.
+
+The component models:
+
+- Per-PC scheduling (D2) with R/W command-bus sharing.
+- Address-based PC selection (D3).
+- Burst-granular commit timing (D4).
+- Flit-aware per-flit PC commit and async finalize (D5, D6).
+- Command-only Transaction handling for read-data drain (D7).
+- Response routing back to the requester (D8).
+
+It does not model:
+
+- Bank-level row-buffer conflicts, refresh, ECC, thermal throttling
+  (ADR-0033 D3).
+- Cross-PE HBM contention beyond its own router edge (handled by the
+  router mesh — ADR-0017 D3).
+- 1:1 channel mode (ADR-0017 D8 future work).
+
+### D2. Per-PC scheduling model
+
+Per-instance state initialised in `start()`:
+
+- `_pc_avail: list[float]` — earliest sim-time each PC is free; length
+  `num_pcs`, initial 0.0.
+- `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on
+  each PC, used for switch-penalty detection (D4); initial `None`.
+
+`num_pcs` and `burst_bytes` must each be a positive power of two so
+that address-based PC selection (D3) reduces to a shift-and-mask.
+
+Read and write requests share the same `_pc_avail` slot per PC — the
+real HW per-PC command bus is shared between read and write traffic, so
+issuing a write to PC k blocks a subsequent read to PC k by exactly the
+burst time.
+
+Direction `dir` for a request is inferred from the request type:
+
+- `MemoryWriteMsg` → `"W"`.
+- `PeDmaMsg` with `is_write=True` → `"W"`.
+- All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`.
+
+### D3. Address-based PC selection
+
+PC index for an access is derived from the access address by shift and
+mask:
+
+```text
+pc_shift = log2(burst_bytes)         # default 8  (burst=256B)
+pc_mask  = num_pcs - 1               # default 7  (8 PCs)
+pc       = (address >> pc_shift) & pc_mask
+```
+
+Computed once in `start()` from topology config so alternative
+`(burst_bytes, num_pcs)` pairs stay consistent. For the canonical
+default `(256, 8)` this places the PC select field at bits `[10:8]` of
+the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits
+`[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column
+within the PC slice (see `phyaddr.py` comment).
+
+Address-based striping — as opposed to address-blind global
+round-robin — preserves PC parallelism for offset-disjoint concurrent
+transfers: each transfer's bursts land deterministically on the PC set
+implied by its byte addresses, so multi-PE workloads accessing disjoint
+regions do not collide on a single PC.
+
+### D4. Burst granularity and PC commit timing
+
+A single PC commit takes:
+
+```text
+chunk_time = burst_bytes / pc_bw_gbs    # ns
+```
+
+- `burst_bytes` (default 256) is the burst granularity matching the
+  flit size (ADR-0033 D1).
+- `pc_bw_gbs` is **builder-derived** from
+  `hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing
+  the ADR-0017 D8 invariant that aggregate per-PE BW equals the
+  router-to-HBM link BW.
+
+Per-PC commit scheduling for an arriving access on PC `pc` with
+direction `dir`:
+
+```text
+switch_cost = switch_penalty_ns
+              if pc_last_dir[pc] not in (None, dir) else 0
+start  = max(env.now, pc_avail[pc]) + switch_cost
+finish = start + chunk_time
+pc_avail[pc]    = finish
+pc_last_dir[pc] = dir
+```
+
+Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM
+scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values
+model pessimistic per-alternation cost.
+
+### D5. Flit-aware per-flit PC commit (primary path)
+
+`_handle_flit` is the primary worker path. For each arriving `Flit`:
+
+1. On the **first** flit of a transaction (`tid = id(txn)` not in
+   `_txn_state`):
+   - Apply `overhead_ns` once via `run(env, nbytes)` — header decode
+     model, first-flit overhead pattern (ADR-0033 D1).
+   - Initialise `_txn_state[tid] = {"last_finish": env.now}`.
+2. Compute `pc = _pc_for_address(flit.address)` (D3).
+3. Apply the per-PC schedule (D4) using the request direction (D2).
+4. Update `state["last_finish"] = max(state["last_finish"], finish)`.
+5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn`
+   (D6).
+
+Per-flit address-aware commit is the mechanism that lets concurrent
+multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs
+in parallel.
+
+### D6. Async finalize per transaction
+
+When a transaction's last flit has been scheduled, finalisation runs in
+a separately-spawned process:
+
+```python
+def _finalize_txn(env, txn, last_finish):
+    wait = last_finish - env.now
+    if wait > 0:
+        yield env.timeout(wait)
+    yield from _send_response(env, txn)
+```
+
+`_handle_flit` spawns this via `env.process(...)` and returns
+immediately, so the worker can pick up the next inbox message while the
+last PC commit drains.
+
+Without this split — i.e. if the worker itself did
+`yield env.timeout(wait)` — concurrent single-flit transactions whose
+addresses hit distinct PCs would still serialise at `chunk_time` each
+inside the worker, hiding the PC parallelism that D3 and D5 are
+designed to expose.
+
+### D7. Non-flit fallback for command-only transactions
+
+`_handle_txn` runs when the inbox delivers a `Transaction` rather than a
+`Flit`. This is the path for command-only requests that the wire does
+not chunk into flits — most notably `MemoryReadMsg` whose command txn
+carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing,
+not as inbound flits).
+
+Procedure:
+
+1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)`
+   — for read commands, work is sized by the request.
+2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else
+   0.
+3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are
+   scheduled over time at `drain/n_chunks` ns intervals to model the
+   bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain).
+4. Apply `run(env, txn.nbytes)` once for `overhead_ns`.
+5. For each chunk `i`, advance `chunk_interval` ns then apply the D4
+   schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`.
+6. After scheduling all chunks, wait `last_finish - env.now` then call
+   `_send_response`.
+
+`_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with
+`_handle_flit` — there is exactly one source of PC scheduling truth
+across both paths.
+
+### D8. Response routing
+
+`_send_response` dispatches on request type and path geometry:
+
+| Case | Trigger | Response |
+| --- | --- | --- |
+| PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` |
+| Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) |
+| Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) |
+| Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path |
+
+The "bypass" classification matches the Memory R/W fabric path defined
+in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl,
+without M_CPU). The PE_DMA case is its own dedicated reverse-path to
+keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a
+ResponseMsg envelope).
+
+In all reverse-path cases, the response Transaction is put onto
+`out_ports[reverse_path[1]]` — the first hop back along the recorded
+forward path. If `reverse_path` has fewer than 2 entries (degenerate
+path), the original `txn.done` is signalled directly.
+
+### D9. Configurable attributes
+
+| Attribute | Default | Source | Notes |
+| --- | --- | --- | --- |
+| `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 |
+| `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant |
+| `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) |
+| `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching |
+| `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) |
+| `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) |
+
+`pc_bw_gbs` is derived by `topology/builder.py` rather than configured
+directly so the aggregate per-PE BW matches the router-to-HBM link BW
+without yaml-side duplication.
+
+## Consequences
+
+### Positive
+
+- Address-based PC selection preserves multi-stream HBM parallelism
+  that an address-blind round-robin would collapse — important for
+  multi-PE workloads with disjoint HBM regions.
+- Flit-aware path (D5) + async finalize (D6) preserves wormhole
+  pipelining and exposes PC parallelism for back-to-back single-flit
+  transactions.
+- Single source of PC scheduling truth (D4 mechanism, used by both D5
+  flit path and D7 chunk-loop path).
+- Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml
+  discipline.
+
+### Negative
+
+- No bank-level conflict modelling within a PC; address-blind to
+  bank/row-buffer reuse (ADR-0033 D3).
+- No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed
+  FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns`
+  (ADR-0033 D2).
+- `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state
+  accumulates per concurrent transaction and is removed only on
+  `is_last`. Adequate for current workloads.
+
+## Links
+
+- ADR-0001 (Physical address layout — PC bit field comment)
+- ADR-0015 D4 (Memory R/W fabric path — bypass response case)
+- ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers)
+- ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR
+  implements)
+- ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint
+  resolution)
+- ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty,
+  flit-aware PC commit, first-flit overhead, chunk-loop drain)
+- ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation)
@@ -0,0 +1,286 @@
+# ADR-0035: M_CPU and M_CPU.DMA Component Model
+
+## Status
+
+Accepted
+
+## Context
+
+M_CPU is the cube-level command processor. It receives commands from
+IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
+M_CPU as a fallback), fans them out to the PEs in its cube, and
+aggregates per-PE responses into a single ResponseMsg sent back to
+IO_CPU on the reverse path.
+
+M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
+fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
+it lives as internal state of `MCpuComponent`.
+
+This ADR documents the M_CPU component implementation that realizes
+those responsibilities, including the three distinct fan-out paths
+(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
+model, and the response aggregation contract.
+
+## Decision
+
+### D1. Role
+
+M_CPU has three responsibilities:
+
+1. **Transit forwarding** — when not the terminal hop (e.g., on the
+   reverse response path PE → M_CPU → IO_CPU), forwards Transactions
+   to `next_hop` in their pre-computed path.
+2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
+   fan-out paths based on request type (D2).
+3. **Response aggregation** — collects per-PE responses, sends a
+   single aggregate ResponseMsg back to IO_CPU on the reverse path.
+
+Per invocation (`run()`): applies `overhead_ns` once per incoming
+Transaction.
+
+M_CPU does **not**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002).
+- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
+  (ADR-0014).
+- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
+  `hbm_ctrl.pe{X}` directly (ADR-0017 D9).
+- Interpret tensor or kernel semantics — fan-out dispatch by Python
+  isinstance check only.
+
+### D2. Three fan-out paths dispatched by request type
+
+At the terminal hop the worker dispatches by request type:
+
+```python
+elif self.ctx is not None and txn.request is not None:
+    if isinstance(txn.request, KernelLaunchMsg):
+        env.process(self._kernel_launch_fanout(env, txn))
+    elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
+        env.process(self._mmu_msg_fanout(env, txn))
+    else:
+        env.process(self._dma_fanout(env, txn))
+```
+
+Each path uses a different router method:
+
+- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
+  M_CPU-specific DMA path that avoids PE pipeline nodes.
+- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
+  generic NOC command path to PE_CPU.
+- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
+  path to PE_MMU.
+
+### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
+
+`MCpuComponent.start()` initializes two SimPy resources:
+
+```python
+self._dma_write = simpy.Resource(env, capacity=1)  # MemoryWriteMsg
+self._dma_read  = simpy.Resource(env, capacity=1)  # MemoryReadMsg
+```
+
+Properties:
+
+- **Not a topology node** — managed entirely inside `MCpuComponent`;
+  does not appear in `topology.yaml` or in the compiled graph.
+- **Independent read and write channels** — concurrent in-flight
+  Memory R/W is allowed.
+- **Capacity=1 per channel** serializes the **dispatch step**
+  (`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
+  R/W requests at this M_CPU. Actual fabric transfer time is modeled
+  by wire processes between components (ADR-0015 D2) and by
+  `drain_ns` at terminal hops; the DMA resource does not gate
+  transfer duration.
+
+Resource selection is request-type-based:
+
+```python
+dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
+```
+
+### D4. Transit forwarding at non-terminal hops
+
+When `txn.next_hop` is not None — typical for the reverse response
+path (PE → M_CPU → IO_CPU) — the worker forwards normally:
+
+```python
+if next_hop:
+    yield self.out_ports[next_hop].put(txn.advance())
+```
+
+The fan-out branches fire only at the terminal hop. The same component
+therefore serves both forward command dispatch and reverse response
+relay roles.
+
+### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
+
+For each Memory R/W request at terminal hop:
+
+1. `_resolve_dma_destinations(request)` returns a per-PE
+   `hbm_ctrl.pe{X}` derived from the request's PA via
+   `ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
+2. For each destination:
+   - Acquire the appropriate DMA resource (`_dma_write` or
+     `_dma_read`) via `with dma_res.request() as req`.
+   - Resolve path via `ctx.router.find_mcpu_dma_path()`.
+   - Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
+   - Create sub-Transaction carrying `drain_ns` and dispatch to
+     `path[1]`.
+3. Track `max_drain_ns` across destinations and record it as
+   `txn.result_data["xfer_ns"]` after all responses arrive.
+4. After all per-PE responses are collected (D8), send an aggregate
+   ResponseMsg on the reverse command path back to IO_CPU.
+
+PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
+no such node exists after ADR-0017 D4's per-PE partitioning. Kept
+defensively but does not route to a real destination.
+
+### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
+
+For `KernelLaunchMsg` at terminal hop:
+
+1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
+2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
+   `ctx.router.find_node_path()`.
+3. **`target_start_ns` handling** (ADR-0009 D5):
+   - If the request already carries `target_start_ns` (stamped by
+     IO_CPU per ADR-0036 D3): **pass through unchanged**.
+   - If absent (direct-to-M_CPU launch in unit tests): compute a
+     per-cube barrier `env.now + max(per-PE leg latency)` and stamp
+     via `dataclasses.replace`.
+4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
+   control message; preserving nbytes=0 keeps fan-out off the shared
+   first-hop fabric BW, mirroring ADR-0036 D4).
+5. After all per-PE responses arrive (D8), aggregate per-PE metrics
+   from each sub-Transaction's `result_data` into the parent
+   transaction:
+
+   ```python
+   txn.result_data["pe_exec_ns"]  = max(existing, max(pe_exec_values))
+   txn.result_data["dma_ns"]      = max(existing, max(dma_values))
+   txn.result_data["compute_ns"]  = max(existing, max(compute_values))
+   ```
+
+   The max-merge with the existing value matters because cross-cube
+   IO_CPU fan-out shares the same parent `result_data`; merging
+   prevents one cube from clobbering another's metric.
+6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
+
+### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
+
+For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
+
+1. `_resolve_pe_ids(target_pe)` → PE ids.
+2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
+   `find_node_path()`.
+3. Dispatch sub-Transactions with `nbytes=0`.
+4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
+   back. Instead, the sub-Transaction's own `sub_done` event is the
+   completion signal.
+5. Wait for all `sub_done` events in-line (does **not** use
+   `_pending` counter — D8 is for response-bearing fan-out only).
+6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
+
+### D8. Response aggregation (`_pending` + `_parent_txns`)
+
+For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
+arriving on the reverse path):
+
+```python
+self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
+self._parent_txns: dict[str, Any] = {}
+```
+
+- On dispatch: register `(expected, received=0, all_done)` and
+  remember the parent transaction.
+- `_worker` recognises responses by `is_response=True` and routes
+  them to `_collect_response`, which increments `received` and
+  signals `all_done` when `received >= expected`.
+- After `yield all_done`, the fan-out path constructs the aggregate
+  ResponseMsg:
+
+  ```python
+  resp_msg = ResponseMsg(
+      correlation_id=request.correlation_id,
+      request_id=request.request_id,
+      src_cube=cube_id,
+      src_pe=-1,             # -1 = M_CPU aggregate, not a single PE
+      success=True,          # no failure semantics implemented
+  )
+  ```
+
+- The response Transaction travels on `list(reversed(txn.path))`
+  back to IO_CPU.
+
+MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
+because PE_MMU is terminal — there is no ResponseMsg path to
+intercept.
+
+### D9. Helpers and configurable attribute
+
+`_resolve_pe_ids(target_pe)`:
+
+- `int` → `[target_pe]`
+- `tuple[int, ...]` → `list(target_pe)`
+- `"all"` → `range(n_slices)` where `n_slices` comes from cube
+  `memory_map.hbm_slices_per_cube` (default 8).
+
+Used by kernel-launch and MMU fan-out paths.
+
+Single configurable attribute drives per-instance latency:
+
+| Site | impl name | overhead_ns |
+| --- | --- | --- |
+| Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
+
+Applied once in `run()` per Transaction — models command
+interpretation and dispatch-decision time at M_CPU.
+
+## Consequences
+
+### Positive
+
+- Three fan-out paths are clearly separated by request type — adding
+  a new request kind is an isinstance branch + one fan-out method.
+- M_CPU.DMA channels are independent (read and write run concurrently)
+  and serialize only the dispatch step at capacity=1.
+- Transit-vs-terminal behavior is a single `if next_hop` check, so
+  the same component handles forward dispatch and reverse response
+  relay without role duplication.
+- `target_start_ns` passthrough (D6) preserves the cross-cube barrier
+  established by IO_CPU (ADR-0036 D3), while the fallback computation
+  keeps direct-to-M_CPU unit tests working.
+- Per-PE metric `max`-merge against existing parent `result_data`
+  values is robust to cross-cube IO_CPU fan-out sharing the same
+  parent.
+
+### Negative
+
+- No partial-failure semantics — a missing per-PE response stalls the
+  parent `all_done` indefinitely. Acceptable for simulation; not
+  suitable as a production-style endpoint.
+- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
+  code (no such node exists post-ADR-0017 D4). Kept defensively;
+  invites confusion and merits a follow-up cleanup.
+- DMA resource serialization applies only at dispatch (the `put` call
+  is instantaneous in unbounded stores). The capacity=1 channel
+  models "one request in flight at a time at this M_CPU", not
+  "transfer duration serialization" — readers must consult wire
+  processes (ADR-0015 D2) and `drain_ns` for actual transfer
+  parallelism.
+
+## Links
+
+- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
+- ADR-0009 D5 (`target_start_ns` — passed through unchanged when
+  present; computed as per-cube barrier when absent)
+- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
+  point)
+- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
+  contract at cube level)
+- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
+  topology node)
+- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
+- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
+  through unchanged; nbytes=0 invariant preserved through fan-out)
@@ -0,0 +1,216 @@
+# ADR-0036: IO_CPU Component Model
+
+## Status
+
+Accepted
+
+## Context
+
+IO_CPU is the IO chiplet's host-facing endpoint inside the simulation
+graph. PCIE_EP receives host messages from the runtime API and routes
+them via the io_noc; for command-bearing requests (KernelLaunch,
+MmuMap/Unmap) the io_noc forwards to IO_CPU, which:
+
+- Fans out the request to per-cube M_CPUs.
+- Aggregates per-cube responses into a single host-visible completion.
+- For kernel launches, stamps a global `target_start_ns` barrier so
+  every PE across every targeted cube begins kernel body execution at
+  the same simulated time (ADR-0009 D5).
+
+Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3;
+this component therefore handles only command-plane traffic in normal
+operation.
+
+This ADR documents the IO_CPU component implementation that realizes
+those responsibilities.
+
+## Decision
+
+### D1. Role
+
+IO_CPU is the host-facing endpoint of the IO chiplet. It has two
+primary responsibilities:
+
+1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg /
+   MmuUnmapMsg to per-cube M_CPUs.
+2. **Response aggregation** — collect per-cube ResponseMsg, signal
+   parent `txn.done` when all targeted cubes have responded.
+
+A third, narrower responsibility applies only to KernelLaunchMsg:
+**`target_start_ns` global barrier stamping** (D3).
+
+The component does **not**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002).
+- Decode tensor or kernel internals — those concerns belong to
+  M_CPU / PE_CPU / engines.
+- Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3).
+- Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4
+  and ADR-0016 D3 (Memory R/W resolution code in
+  `_resolve_cube_targets` exists as a defensive fallback only).
+
+Per invocation (`run()`): applies the configured `overhead_ns` once
+per incoming Transaction (D8).
+
+### D2. Forward path — multi-cube fan-out
+
+When a non-response Transaction arrives, the worker:
+
+1. Pays `overhead_ns` via `run()`.
+2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)`
+   targets from the request (D5).
+3. For each target:
+   - Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`.
+   - Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`.
+   - Creates a per-cube sub-Transaction with `path` populated and
+     forwards it to `path[1]` (the first hop on the io_noc).
+4. Registers aggregation state: `_pending[request_id] = (expected,
+   received=0, parent_done)`.
+
+### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5)
+
+IO_CPU is the canonical stamper for `target_start_ns`. When the
+request is a `KernelLaunchMsg`, IO_CPU computes a single global
+barrier covering every targeted PE across every targeted cube:
+
+```text
+for (sip, cube) in cube_targets:
+    leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0)
+    for pe_id in target_pe_ids:
+        leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id),
+                                       nbytes=0)
+        latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns
+        global_max = max(global_max, latency)
+
+target_start_ns = env.now + global_max
+```
+
+The request is then replaced (via `dataclasses.replace`) so the
+stamped value propagates through the fan-out.
+
+Two overhead corrections:
+
+- `io_overhead_ns` is subtracted because IO_CPU has already paid it
+  in `run()` before this method runs.
+- `m_overhead_ns` is subtracted once because it appears as the
+  endpoint of leg1 *and* the start of leg2 in path latency, but
+  M_CPU pays it only once at run time.
+
+Every downstream PE_CPU yields until `target_start_ns` before
+beginning kernel body execution; all PEs therefore start at the same
+simulated time regardless of how long their individual dispatch path
+took.
+
+### D4. KernelLaunch sub-Transactions carry `nbytes=0`
+
+Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`,
+overriding the parent `txn.nbytes`:
+
+- Kernel launch is a control message; payload size is irrelevant at
+  the data-fabric level.
+- If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the
+  io_noc's shared first hop. With 16 cubes this serializes fan-out,
+  pushing far M_CPUs past `target_start_ns` and breaking the D3
+  invariant.
+
+Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant
+for the defensive Memory R/W fallback path, which carries actual
+payload sizes).
+
+### D5. Per-request-type cube target resolution
+
+`_resolve_cube_targets` dispatches by request type:
+
+| Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics |
+| --- | --- | --- |
+| `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode |
+| `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode |
+| `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP |
+| `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec |
+
+Each IO_CPU instance fans out only within its own SIP — `_my_sip()`
+parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0).
+
+The Memory R/W rows exist for defensive completeness; the engine's
+normal path routes Memory R/W via `_process_memory_direct()` /
+`find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 /
+ADR-0016 D3).
+
+### D6. Response aggregation
+
+`_pending: dict[request_id → (expected, received, parent_done)]`:
+
+- On dispatch: register `(len(cube_targets), 0, txn.done)`.
+- `_worker` recognises responses by `is_response=True` and routes
+  them to `_collect_response`.
+- `_collect_response` increments `received`; when `received >=
+  expected`, `parent_done.succeed()` is invoked and the entry is
+  removed from `_pending`.
+
+This is a simple per-request counter. There is no per-cube identity
+tracking and no partial-failure handling — a missing response
+indefinitely stalls the parent done. Production-style failure paths
+are out of scope for the current simulator model.
+
+### D7. `target_pe` resolution helper
+
+`_resolve_pe_ids(target_pe)`:
+
+- `int` → `[target_pe]`.
+- `tuple[int, ...]` → `list(target_pe)`.
+- `"all"` → `range(n_slices)`, where `n_slices` comes from cube
+  `memory_map.hbm_slices_per_cube` (default 8).
+
+Used in D3's barrier computation to enumerate every PE target per
+cube.
+
+### D8. Configurable `overhead_ns`
+
+A single attribute drives per-instance latency:
+
+| Site | impl name | overhead_ns |
+| --- | --- | --- |
+| IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 |
+
+Applied once in `run()` per Transaction. Models command
+interpretation + dispatch-decision time at IO_CPU.
+
+## Consequences
+
+### Positive
+
+- Cross-cube and cross-SIP kernel launches share a single global
+  barrier (D3 + D4) — no per-cube divergence in start time.
+- nbytes=0 invariant keeps fan-out off the shared first-hop fabric
+  BW, preserving the barrier's accuracy at scale (16 cubes).
+- Response aggregation via a single counter → minimal state,
+  deterministic ordering of completion.
+- Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs
+  cleanly independent.
+
+### Negative
+
+- No partial-failure semantics — a missing per-cube response
+  indefinitely stalls the parent. Adequate for simulation but not
+  suitable as a production-style endpoint.
+- `_pending` is a regular dict; in-flight requests accumulate state.
+  Acceptable for current benchmark workloads (few concurrent
+  outstanding launches); unbounded in principle.
+- The Memory R/W resolution branches in `_resolve_cube_targets` are
+  dead code in the normal engine path. Kept defensively but invite
+  drift if the bypass path ever changes.
+
+## Links
+
+- ADR-0002 (Routing distance — path computation)
+- ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU)
+- ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across
+  cubes)
+- ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU)
+- ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out)
+- ADR-0012 (Host ↔ IO_CPU message schema)
+- ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU)
+- ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here)
+- ADR-0016 D3 (Memory R/W path bypasses IO_CPU)
+- ADR-0016 D4 (Kernel Launch path through IO_CPU for command
+  interpretation)
@@ -0,0 +1,200 @@
+# ADR-0037: Forwarding Component (forwarding_v1)
+
+## Status
+
+Accepted
+
+## Context
+
+The simulation graph has many node positions that exist purely to model
+fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints,
+IO chiplet io_noc, transit cubes. These share a common pattern: receive
+a message, apply per-component overhead (modeling header decode +
+routing decision time), forward to the next hop along the pre-computed
+path.
+
+This ADR defines the contract for these transit nodes: a single
+component type (`TransitComponent`) that handles flit-aware forwarding
+with wormhole cut-through semantics, used under multiple impl names
+according to the conceptual role each instance plays.
+
+## Decision
+
+### D1. Role
+
+The Forwarding component (`TransitComponent` class) is a **stateless
+transit node** in the simulation graph. It models any fabric position
+where a message physically traverses but no semantic processing
+happens.
+
+Per traversal, the component:
+
+1. Reads an incoming Transaction or Flit from an `in_port`.
+2. Applies the configured per-component overhead (`overhead_ns`),
+   applied **once per Transaction** even across multi-flit payloads
+   (see D2).
+3. Looks up the next hop along the Transaction's pre-computed `path`.
+4. Forwards to the corresponding `out_port`; at the terminal node
+   (no next hop), signals `txn.done` once the `is_last` flit arrives.
+
+The component **does NOT**:
+
+- Decide routing — paths are pre-computed by the router (ADR-0002 /
+  ADR-0017 D2). Forwarding only executes the per-hop step.
+- Model wire propagation or bandwidth occupancy — separate wire
+  processes between components handle that (ADR-0015 D2).
+- Resolve addresses — the AddressResolver does that (ADR-0017 D9).
+- Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL)
+  handle that.
+
+### D2. First-flit overhead model (header decode)
+
+Per-Transaction `overhead_ns` is applied **exactly once**, at first
+flit arrival:
+
+- `_txn_decoded: set[int]` tracks which Transactions have already
+  paid the overhead at this node.
+- On first-flit arrival for a Transaction: `yield self.run(env,
+  msg.txn.nbytes)` — pays the overhead.
+- Subsequent flits of the same Transaction skip the overhead — they
+  pipeline through with no extra delay.
+- On `is_last` flit: remove the Transaction from `_txn_decoded`.
+
+This models the real-HW behavior where header decode and routing
+decision happen once on first flit; payload flits then stream through
+the same path (wormhole cut-through). Multi-hop pipelining emerges
+naturally — each hop adds its own first-flit overhead, but flits
+after the first do not re-pay overhead at any hop they have already
+passed first.
+
+### D3. Serial worker forwarding (preserves order)
+
+The component's worker is a single SimPy process that consumes flits
+from `_inbox` and forwards them serially in arrival order. The
+component does NOT spawn `env.process(...)` per flit.
+
+Rationale: if the first flit yields on `overhead_ns` while subsequent
+flits run in parallel processes, the later flits can overtake the
+first. This produces out-of-order delivery and lets the `is_last`
+flit arrive at the destination before the first flit — corrupting
+both the transaction's completion semantics and any flit-index-based
+processing downstream.
+
+### D4. Path-based next-hop routing
+
+Routing is **not** a Forwarding-component concern. The Transaction
+arrives with a pre-computed `path` (built by the router; ADR-0002 /
+ADR-0017 D2). The component just looks up its own position in the
+path and forwards to `path[index + 1]`:
+
+```python
+def _next_hop_in_path(self, txn):
+    my_id = self.node.id
+    path = txn.path
+    for i, n in enumerate(path):
+        if n == my_id and i + 1 < len(path):
+            return path[i + 1]
+    return None
+```
+
+If `next_hop` is found and present in `out_ports`, the flit is
+forwarded. Otherwise (terminal node), `txn.done.succeed()` is
+invoked when the `is_last` flit arrives.
+
+### D5. Flit-aware mode with Non-Flit fallback
+
+`_FLIT_AWARE = True` opts this component out of the base class's
+flit-reassembly logic in `_fan_in`. Flits are placed directly on
+`_inbox` (no reassembly), enabling per-flit handling in the worker
+loop (D2, D3).
+
+Non-Flit messages — zero-byte control Transactions and other
+non-chunkified payloads — fall through to the base class's legacy
+`_forward_txn` path via `env.process`. This preserves backward
+compatibility for control-plane traffic that does not benefit from
+flit-level processing.
+
+### D6. Multi-stream merging at the base class
+
+Multi-stream FIFO merging at routers is the base class's
+responsibility, not Forwarding's. The base class's `_fan_in` spawns
+one process per `in_port`; all push to a single shared `_inbox`.
+Flits from different upstream streams therefore interleave at
+flit granularity in `_inbox`'s FIFO order.
+
+The Forwarding worker simply consumes `_inbox` in arrival order —
+correctly modeling per-router multi-flow arbitration as
+fair-FIFO over the shared inbox.
+
+### D7. Single implementation under multiple impl names
+
+A single `TransitComponent` class is registered under four impl names
+in `components.yaml`:
+
+- `builtin.forwarding` — generic forwarding (e.g., `io_noc`,
+  `noc_router`, UCIe conn bridges)
+- `builtin.switch` — tray-level switch
+- `builtin.noc` — cube-level NOC fabric (legacy singleton; current
+  NOC routers use `builtin.forwarding`)
+- `builtin.ucie` — UCIe protocol endpoint
+
+All four aliases instantiate the same class with the same behavior.
+Per-instance differentiation lives only in `attrs.overhead_ns`.
+Separate impl names exist as intent tags for readability and to
+allow future divergence without backward-incompatible config
+changes.
+
+### D8. Configurable `overhead_ns`
+
+A single attribute drives per-instance latency:
+
+| Usage site | impl name | overhead_ns |
+| --- | --- | --- |
+| Tray-level switch | `builtin.switch` | 5.0 |
+| Cube NOC router | `builtin.forwarding` | 2.0 |
+| IO chiplet io_noc | `builtin.forwarding` | 0.0 |
+| UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
+| UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
+
+Default is 0.0. The attribute is read at each `run()` invocation, so
+dynamic reconfiguration is possible but not currently used.
+
+## Consequences
+
+### Positive
+
+- A single class handles all transit-node roles in the simulation
+  graph — minimal code surface for a high-population component type.
+- Flit-aware processing + serial worker preserves wormhole semantics
+  across multi-hop paths without per-flit process overhead.
+- `overhead_ns` is the only per-instance tunable; routing, BW, and
+  address resolution stay cleanly separated in their own components /
+  modules.
+- Multi-stream merging emerges from the base-class structure; no
+  router-specific logic duplicates fair-FIFO arbitration.
+- Non-Flit fallback path keeps control-plane traffic working without
+  forcing every message into the flit framework.
+
+### Negative
+
+- The single class hides usage-site intent inside `attrs.overhead_ns`
+  configuration; readers must consult `topology.yaml` +
+  `components.yaml` to see which impl name maps to which behavior
+  class.
+- Per-flit serial worker is a bottleneck if `overhead_ns` is large
+  and many concurrent transactions arrive at the same router; current
+  values (0–8 ns) make this negligible.
+
+## Links
+
+- ADR-0002 (Routing distance — path computation)
+- ADR-0015 D1 (Component port model)
+- ADR-0015 D2 (Wire process — BW + propagation, separate from this
+  component)
+- ADR-0015 D6 (Transit cube forwarding pattern)
+- ADR-0016 D1 (IO chiplet io_noc — uses this component)
+- ADR-0017 D1 (Cube NOC routers — use this component)
+- ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this
+  component)
+- ADR-0033 D1 (Flit-aware pass-through, first-flit overhead,
+  multi-stream merge semantics)