ADR: introduce docs/history/, merge 0011+0018, prune migration cruft

- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/, immutable numbering, no renumber) - ADR-0011: merge ADR-0018 content as "Address Model: LA" section alongside PA / VA; status notes VA model is currently implemented - ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed into 0001 rev 2) - ADR-0019: rewrite Context as PE-HBM connectivity decision (self-contained, no LA model framing) - ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted (code verified) and prune Implementation Notes / Affected files / Test strategy / "현재 상태" sub-sections describing pre-impl state - ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6 Migration and D8 docs-update sub-decisions - ADR-0030: status simplified (blocker ADR-0031 now superseded) - SPEC.md: R10 + §0.2 reflect PA / VA / LA model names - ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links 21 files changed, 553 insertions(+), 1290 deletions(-). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:42:45 -07:00
parent ecc57d050d
commit 22fd0d2b9d
23 changed files with 553 additions and 1290 deletions
@@ -94,7 +94,7 @@ The Phase 0 PA shard map remains a valid fast-path configuration.

 ## Links

- ADR-0011 (PA-first)
+- ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0012 (Host↔IO_CPU schema)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (Kernel execution)
@@ -1,95 +1,514 @@
-# ADR-0011: Memory Addressing — PA-first with VA/MMU Extension
+# ADR-0011: Memory Addressing — PA / VA / LA Address Models

 ## Status

-Accepted (Phase 1 VA/MMU implemented)
+Accepted.
+
+- **VA model: currently implemented (default).**
+- PA model: implemented as PageFault fallback in PE_DMA.
+- LA model: proposed, not implemented.

 ## Context

-A realistic system uses host-side virtual addressing and an MMU/IOMMU-style
-translation path for DMA: host allocates physical memory at PE level, maps it
-into a virtual address space, installs mappings, and DMA requests use virtual
-addresses that are translated to physical addresses.
+KernBench's address model evolved through three design points, each
+addressing a limitation of the previous. This ADR documents all three
+in one place because future implementation work selects among them.

-The PA-only model (Phase 0) was insufficient for running standard Triton kernels
-that use `base_addr + offset` patterns on sharded tensors — each PE's shard has
-a different PA, but the kernel needs a single contiguous address space.
+### PA-only baseline
+
+Phase 0 of KernBench treated all device memory operations
+(MemoryRead/MemoryWrite) as raw physical-address transfers. No
+host-side virtual addressing, no MMU/IOMMU translation. Allocators
+returned PA mappings; DMA requests carried PA directly.
+
+This was sufficient for early correctness/latency work but
+insufficient for running standard Triton kernels that use
+`base_addr + offset` patterns on sharded tensors: each PE's shard
+has a different PA, but the kernel needs a single contiguous address
+space to compute offsets.
+
+### Why VA/MMU (current default)
+
+A realistic system uses host-side virtual addressing and an
+MMU/IOMMU-style translation path for DMA: the host allocates physical
+memory at PE level, maps it into a virtual address space, installs
+mappings, and DMA requests use virtual addresses that are translated
+to physical addresses.
+
+Adopting this model lets kernels use `base_addr + offset` over a
+contiguous VA range while the device-side MMU translates each access
+to the appropriate PA.
+
+### Why LA/BAAW (proposed)
+
+VA/MMU treats HBM as a single backing space. KernBench needs to
+explore architectures where HBM is composed of multiple pseudo
+channels in parallel:
+
+- CUBE's HBM has 32 or 64 pseudo channels.
+- In a PE-Local-HBM model, each PE is assigned N pseudo channels
+  (N = `hbm_pseudo_channels / pes_per_cube`).
+- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
+  (N × per-channel).
+
+Two channel-mapping modes need to be modelable:
+
+- **1:1 mode** — one logical access → N per-channel requests.
+  Precise per-channel BW contention modelling.
+- **n:1 mode (default)** — one logical access → one aggregated
+  request. Channels are assumed to interleave; aggregated BW model.
+
+VA's `tl.load(va_ptr)` produces a single DMA request to a single
+target. Decomposing that into per-channel requests inside PE_DMA
+requires the address layer to be aware of channels. This is the
+role of the LA (Logical Address) abstraction with BAAW
+(Logical-to-Physical Mapping Unit).
+
+Core requirements driving the LA design:
+
+- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
+  in both modes (only request shape and resource model differ).
+- Kernel programming model is unchanged — physical channel
+  information is never exposed to kernel code.
+- Mode switch is a topology-level configuration.
+
+### Design space summary
+
+| Model | Status | Key idea |
+|-------|--------|----------|
+| PA | fallback (implemented) | Direct physical addressing, no translation |
+| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
+| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |

 ---

 ## Decision

-### D1. Phase 0 model is PA-only (original, retained as fallback)
+This ADR defines three address models. At any given time the system
+operates in exactly one model. Selection is topology- / configuration-
+driven; coexistence within one simulation run is not required.

- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical
-  addresses (PA) plus size.
- PA-only mode remains functional via PageFault fallback in PE_DMA.
+---

-### D2. Allocation produces PA mappings
+### Address Model: PA (Physical Address) — fallback

-Device allocation selects PE-local memory regions and returns PA mappings
-sufficient to execute kernels and issue DMA requests.
+#### D-PA1. PA-only semantics

-### D3. Phase 1: VA/MMU layer (implemented)
+- All device memory accesses (MemoryRead/MemoryWrite) operate on
+  device physical addresses (PA) plus size.
+- PA-only mode remains functional via the PageFault fallback path in
+  PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
+  the value as a PA directly.

-#### D3.1 Virtual Address Model
+#### D-PA2. Allocation produces PA mappings
+
+Device allocation selects PE-local memory regions and returns PA
+mappings sufficient to execute kernels and issue DMA requests.
+
+PA model is retained primarily for backward compatibility with PA-only
+tests and as the underlying physical layer that VA / LA models resolve
+into.
+
+---
+
+### Address Model: VA (Virtual Address with MMU) — current default
+
+#### D-VA1. Virtual Address Model

 - Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
 - `TensorShard` does NOT carry a `va` field — shard VA is derived as
  `va_base + offset_bytes`.
- Kernels receive `va_base` as their pointer argument (via `TensorArg.va_base`).
+- Kernels receive `va_base` as their pointer argument (via
+  `TensorArg.va_base`).
 - `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).

-#### D3.2 PE_MMU Component
+#### D-VA2. PE_MMU Component

- Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous
-  `translate()` called by PE_DMA).
- Page-aligned dict lookup for O(1) VA→PA translation.
+- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
+  (synchronous `translate()` called by PE_DMA).
+- Page-aligned dict lookup for O(1) VA → PA translation.
 - `tlb_overhead_ns` configurable per-access latency.
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly
-  (backward compatibility with PA-only tests).
+- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
+  directly (preserves PA model for backward compatibility).

-#### D3.3 Mapping Installation
+#### D-VA3. Mapping Installation

- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) →
-  M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent cross-SIP
-  mapping contamination for replicated tensors.
+- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
+  fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
+  end-to-end.
+- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
+  cross-SIP mapping contamination for replicated tensors.
 - Mapping strategy based on `DPPolicy.cube`:
-  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping only.
-    Each cube's PEs see only their local PA. No cross-cube mapping installed.
-  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard mappings to all
-    target cubes. Enables cross-PE and cross-cube DMA.
+  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
+    only. Each cube's PEs see only their local PA. No cross-cube
+    mapping installed.
+  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
+    mappings to all target cubes. Enables cross-PE and cross-cube
+    DMA.

-#### D3.4 Tensor Lifecycle
+#### D-VA4. Tensor Lifecycle

- `del tensor` triggers automatic cleanup via `Tensor.__del__` + `weakref` to
-  RuntimeContext. Sends `MmuUnmapMsg` through fabric, returns VA and PA space.
+- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
+  `weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
+  returns VA and PA space.
 - `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
 - `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
 - `PEMemAllocator` uses free-list with coalescing (not bump allocator).
 - `VirtualAllocator` uses free-list with coalescing for VA space.

-#### D3.5 Allocators
+#### D-VA5. Allocators

- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free with
+- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
+  with coalescing.
+- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
  coalescing.
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with coalescing.
- Page size configurable via `topology.yaml` pe_mmu attrs (default 4096).
+- Page size configurable via `topology.yaml` `pe_mmu` attrs
+  (default 4096).

---
+#### Consequences (VA model)

-## Consequences
-
- Triton kernels use `base_addr + offset` patterns naturally on sharded tensors.
- All latency remains explicit via graph traversal, including MMU mapping
-  installation and per-access TLB overhead.
+- Triton kernels use `base_addr + offset` patterns naturally on
+  sharded tensors.
+- All latency remains explicit via graph traversal, including MMU
+  mapping installation and per-access TLB overhead.
 - PA-only mode retained as fallback (PageFault → treat as PA).
- Benchmark parameter renamed `ctx` → `torch` for PyTorch code compatibility.
 - IPCQ and other fixed-address resources bypass MMU (use PA directly).

 ---

+### Address Model: LA (Logical Address with BAAW) — proposed
+
+LA replaces VA when channel-level HBM modelling is required.
+Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
+removed artifacts). Coexistence with VA in the same run is not a goal.
+
+#### D-LA1. LA introduction — replaces VA infrastructure
+
+LA is the sole address space used by kernel code (`tl.load`,
+`tl.store`, `tl.composite`). Properties:
+
+- Can map a Tensor to a contiguous logical space (like VA).
+- Expresses `(logical buffer + offset)`.
+- Does NOT contain physical channel information directly.
+- Stays as an intermediate abstraction until physical resolution.
+
+LA address space:
+
+| Item | Value |
+|------|-------|
+| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
+| LA space size | 64 GB per PE |
+| Alignment unit | segment (see D-LA3) |
+
+LA is PE-local: different PEs may use the same LA value; BAAW segment
+tables differ → they resolve to different PAs.
+
+VA infrastructure removed when LA is adopted:
+
+| Removed | Replacement |
+|---------|-------------|
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
+| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
+| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
+| `runtime_api/tensor.py`: `va_base` | `la_base` |
+| `topology.yaml`: `pe_mmu` component entry | Removed |
+
+#### D-LA2. Mapping mode setting
+
+Topology-level (cube) configuration:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
+    hbm_pseudo_channels: 64       # total pseudo channel count
+    hbm_channels_per_pe: 8        # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
+```
+
+Consumed by the graph compiler (topology builder) and BAAW
+initialisation.
+
+#### D-LA3. Segment and BAAW
+
+Segment partitions the LA space; each segment maps to a specific HBM
+channel or channel group. Created at tensor deploy time by the runtime
+allocator. BAAW resolves LA → physical request(s) using the segment
+table.
+
+```python
+@dataclass
+class BaawSegment:
+    la_base: int          # segment start LA
+    la_size: int          # segment size (bytes)
+    mode: str             # "one_to_one" | "n_to_one"
+    # 1:1 mode fields
+    channel_count: int    # channels assigned to this segment (e.g. 8)
+    pa_bases: list[int]   # per-channel PA bases (len = channel_count)
+    channel_ids: list[int]   # per-channel logical IDs (e.g. [0..7])
+    channel_size: int     # per-channel size (la_size // channel_count)
+    # n:1 mode fields
+    agg_pa_base: int      # aggregated PA base
+    agg_node_id: str      # aggregated router node_id
+```
+
+Segment lifecycle:
+
+1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
+   allocator. PEMemAllocator allocates per-channel PA (1:1) or
+   aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
+   with PE_DMA.
+2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
+   (src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
+   converts to PA(s).
+3. **Free** (tensor free): segment removed from table; LA and PA
+   returned.
+
+#### D-LA4. BAAW resolution logic
+
+BAAW is a front-end stage inside PE_DMA, not a separate SimPy
+component. Synchronous address-resolution logic executed at the start
+of PE_DMA's `handle_command()`.
+
+Input: `(LA, nbytes)`. Output:
+
+- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
+- **n:1 mode**: single `PhysicalRequest`.
+
+```python
+@dataclass
+class PhysicalRequest:
+    pa: int           # 51-bit Physical Address
+    nbytes: int       # transfer size for this request
+    dst_node: str     # target node_id (channel router or aggregated router)
+
+
+def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
+    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
+    offset = la - seg.la_base
+
+    if seg.mode == "n_to_one":
+        pa = seg.agg_pa_base + offset
+        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
+
+    # one_to_one
+    requests = []
+    per_ch_size = seg.channel_size
+    for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
+        ch_offset = offset % per_ch_size
+        ch_nbytes = nbytes // seg.channel_count
+        pa = pa_base + ch_offset
+        dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
+        requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
+    return requests
+```
+
+BAAW responsibilities:
+
+- Convert logical access → physical request units.
+- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
+- Compute PA and target node.
+
+BAAW non-responsibilities:
+
+- Performing actual data movement.
+- Executing NOC routing.
+- Simulating bandwidth occupation (downstream components' job).
+
+BAAW output is directly usable by the simulator's routing and resource
+model without additional address decoding.
+
+#### D-LA5. PE_DMA `handle_command()` change
+
+Current (VA-based) flow:
+
+```
+DmaReadCmd.src_addr (VA)
+  → MMU.translate(VA) → PA
+  → PhysAddr.decode(PA) → PhysAddr object
+  → resolver.resolve(PhysAddr) → dst_node_id
+  → router.find_path(pe_prefix, dst_node_id) → path
+  → 1 sub-Transaction → fabric inject
+```
+
+LA-based flow:
+
+```
+DmaReadCmd.src_addr (LA)
+  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
+  → for each PhysicalRequest:
+      → router.find_path(pe_prefix, req.dst_node) → path
+      → compute_drain_ns(path, req.nbytes) → drain
+      → sub-Transaction → fabric inject
+  → await all sub-Transactions
+  → pe_txn.done.succeed()
+```
+
+Key changes:
+
+- MMU reference removed → BAAW resolve.
+- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
+  directly.
+- 1 request → N parallel requests in 1:1 mode.
+
+#### D-LA6. 1:1 mode detail
+
+- One logical access → N physical requests (N = `channels_per_pe`).
+- N = `hbm_pseudo_channels / pes_per_cube`.
+- Each request: fully-resolved 51-bit PA, targets a specific channel
+  router (`{pe_prefix}.ch_r{channel_id}`).
+- Per-channel link models BW contention.
+- PE_DMA injects N sub-transactions concurrently.
+
+Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
+PE0 owns ch0-7.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "one_to_one", channel_count: 8,
+    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
+    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
+    channel_size: 512,
+}
+
+BAAW resolve result (8 requests):
+  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
+  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
+  → ...
+  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
+
+PE_DMA: 8 sub-transactions parallel inject
+  per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
+  Total effective BW = 8 × channel_bw_gbs
+```
+
+Other N values:
+
+- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
+  4 requests
+- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
+  16 requests
+
+#### D-LA7. n:1 mode detail
+
+- One logical access → one aggregated request.
+- Target: aggregated router → hbm_ctrl (see ADR-0019).
+- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
+  (e.g. 8 × 32 = 256 GB/s).
+- Single queue / resource for modelling.
+- No per-channel PA decomposition.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "n_to_one",
+    agg_pa_base: PA_agg,
+    agg_node_id: "sip0.cube0.pe0.agg_router",
+}
+
+BAAW resolve result:
+  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
+
+PE_DMA: 1 sub-transaction
+  aggregated router → hbm_ctrl link (256 GB/s)
+```
+
+#### D-LA8. Kernel model preserved
+
+- Kernel still issues single memory ops (`tl.load`, `tl.store`,
+  `tl.composite`).
+- LA is the address scheme exposed to kernel code.
+- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
+- Kernel code never sees physical channel information.
+
+#### Consequences (LA model, proposed)
+
+Positive:
+
+- 1:1 vs n:1 semantics live in one place (BAAW).
+- Kernel abstraction preserved — no kernel code changes.
+- Topology-based policy control (mode switch via yaml).
+- Improved simulation-model consistency and debuggability.
+- Segment-based mapping is simpler than page tables; lower overhead.
+
+Negative:
+
+- Full VA/MMU code refactor required.
+- Request-generation path more complex (N requests in 1:1 mode).
+- Reduced per-channel visibility in n:1 mode.
+- VA-related tests need rewriting.
+
+---
+
+## Migration Path
+
+- **PA → VA** was an extension. PA mode is retained as the PageFault
+  fallback inside PE_DMA. Switching does not require removing PA
+  code.
+- **VA → LA**, if adopted, is a replacement, not coexistence. See
+  D-LA1 for the VA infrastructure removal list. PA fallback inside
+  PE_DMA may be retained orthogonally for tests.
+
+## Alternatives Considered (LA model)
+
+1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
+   Rejected: MMU's role would grow beyond translation to request
+   decomposition; aggregation (n:1) becomes awkward to express.
+2. **Channel-aware kernel API**: kernels call per-channel load/store
+   directly. Rejected: abstraction leakage, portability loss, all
+   benchmarks need rewriting.
+3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
+   directly. Rejected: incompatible with aggregation; conversion
+   timing unclear; channel info leaks to kernel.
+
+## Test Requirements
+
+### VA model (current, regression)
+
+- Cross-PE / cross-cube DMA paths over installed mappings.
+- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
+- TLB-overhead-per-access timing.
+- PageFault fallback path preserves PA-only behaviour.
+
+### LA model (when implemented)
+
+- 1:1 mode: same logical access → N per-channel requests.
+- n:1 mode: same logical access → 1 aggregated request.
+- Bandwidth equivalence between modes for identical workload.
+- 1:1 mode: per-channel contention modelled correctly.
+- n:1 mode: aggregated bandwidth correctly reflected.
+- Kernel code unchanged across mode switch.
+- BAAW segment install / uninstall correctness.
+- Multiple tensors in distinct segments do not collide.
+
+## Implementation Order (LA, when scheduled)
+
+1. LA type (`policy/address/la_allocator.py`).
+2. BAAW segment table (`policy/address/baaw.py`).
+3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
+4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
+   `handle_command()`).
+5. RuntimeContext: LA alloc + segment install
+   (`runtime_api/context.py`).
+6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
+7. Remove VA/MMU code.
+8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
+9. Test migration:
+
+| Test file | Action |
+|-----------|--------|
+| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
+| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
+| `tests/test_pe_mmu.py` | Remove |
+| `tests/test_va_allocator.py` | Replace with LA allocator tests |
+| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
+| `tests/test_va_offset.py` | Replace with LA offset tests |
+
 ## Links

 - ADR-0007 (runtime_api vs sim_engine boundaries)
@@ -97,4 +516,6 @@ sufficient to execute kernels and issue DMA requests.
 - ADR-0009 (kernel execution)
 - ADR-0014 (PE-internal execution model)
 - ADR-0015 (component port/wire model)
- SPEC R2 (latency by traversal)
+- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology
+  consumer)
+- SPEC R2 (latency by traversal), R10 (memory addressing)
@@ -226,7 +226,7 @@ Tests SHOULD validate:

 ## Links

- ADR-0011 (PA-first memory addressing)
+- ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (kernel execution fan-out/aggregation)
 - SPEC R2, R7, R8
@@ -134,6 +134,6 @@ Phase 2 (Apply) MUST:
 ## Links

 - SPEC 0.1, R2, R6
- ADR-0011 (PA-first memory addressing)
+- ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0012 (Host ↔ IO_CPU message schema)
 - ADR-0009 (Kernel execution semantics)
@@ -2,35 +2,23 @@

 ## Status

-Proposed
+Accepted

 ## Context

-ADR-0018 introduced LA-based address abstraction and BAAW,
-defining how a logical memory access is translated into the following two forms of requests:
+The CUBE-internal NOC must connect each PE to HBM. KernBench needs
+to evaluate two connectivity models:

- 1:1 mode: one logical access → N per-channel requests
- n:1 mode: one logical access → one aggregated request
+- **1:1 mode** — PE_DMA connects to N separate per-channel routers,
+  each with its own link to hbm_ctrl. Models per-channel BW
+  contention precisely.
+  N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
+- **n:1 mode** — PE_DMA connects to a single aggregated router with
+  one link to hbm_ctrl. Channels are treated as interleaved; only
+  aggregate BW is modeled.

-Here N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`),
-determined by topology parameters.
-
-### Problems with the Existing Structure
-
-In the current implementation (`topology/builder.py`):
-
- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} path is used
- HBM is modeled as 8 slice (= per-PE) nodes
- Local/remote access use different paths:
-  - local: NOC → xbar → HBM slice
-  - cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
-  - remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
-
-Limitations of this structure:
-
- Cannot model at the pseudo-channel granularity (slice = per-PE granularity, not per-channel)
- xbar/bridge bifurcate local/remote paths
- Cannot express 1:1 / n:1 modes consistently
+Effective PE-local BW is identical under both modes
+(= N × per-channel BW); only the connectivity granularity differs.

 ---

@@ -270,7 +258,6 @@ The effective BW per PE is identical in both modes:
 ### Negative

 - The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
- Requires complete rewrite of existing xbar/bridge/single NOC-based tests
 - The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model

 ---
@@ -296,119 +283,6 @@ The effective BW per PE is identical in both modes:

 ---

-## Implementation Notes
-
-### topology/builder.py Change Details
-
-#### Code to Remove (within current `_instantiate_cube()`)
-
- xbar_top, xbar_bot node creation (~line 495-508)
- bridge.left, bridge.right node creation
- noc ↔ xbar edge creation (~line 540-555)
- xbar ↔ hbm_ctrl.slice edge creation (~line 510-538)
- xbar ↔ bridge edge creation (~line 557-572)
-
-#### Code to Add
-
-1:1 mode:
-
-```python
-N = hbm_channels_per_pe  # from topology config
-total_ch = hbm_pseudo_channels
-
-# Create channel router nodes
-for ch_id in range(total_ch):
-    pe_id = ch_id // N
-    nodes[f"{cp}.ch_r{ch_id}"] = Node(
-        id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),  # horizontal row = ch_id % N
-    )
-
-# PE_DMA ↔ local channel router edges
-for pe_id in range(pes_per_cube):
-    for local_ch in range(N):
-        ch_id = pe_id * N + local_ch
-        edges.append(Edge(
-            src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
-            bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
-            bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
-
-# Channel router ↔ hbm_ctrl edges
-for ch_id in range(total_ch):
-    edges.append(Edge(
-        src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
-        bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
-
-# Horizontal line edges (same logical index)
-for row in range(N):
-    for p in range(pes_per_cube - 1):
-        ch_a = p * N + row
-        ch_b = (p + 1) * N + row
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-```
-
-n:1 mode:
-
-```python
-# Create aggregated router nodes
-for pe_id in range(pes_per_cube):
-    nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
-        id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),
-    )
-
-agg_bw = N * channel_bw  # aggregated BW
-
-# PE_DMA ↔ aggregated router
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
-        bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
-
-# Aggregated router ↔ hbm_ctrl
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=agg_bw, kind="agg_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="hbm_to_agg", ...))
-
-# Horizontal links between aggregated routers
-for p in range(pes_per_cube - 1):
-    edges.append(Edge(
-        src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-```
-
-### Affected Existing Tests
-
-| Test File | Impact |
-| ---------- | ---- |
-| `tests/test_topology_compile.py` | Remove xbar/bridge node references, add channel router verification |
-| `tests/test_topology_load.py` | Reflect topology.yaml configuration changes |
-| `tests/test_pe_components.py` | PE_DMA routing path changes |
-| `tests/test_sip_parallel.py` | Cross-PE access path changes |
-| Cases that directly test xbar/bridge | Remove |
-
---
-
 ## Test Requirements

 - Verify that requests are delivered via per-channel links in 1:1 mode
@@ -425,7 +299,7 @@ for p in range(pes_per_cube - 1):

 ## Links

- ADR-0018 (LA + BAAW) → addressing-side integration
+- ADR-0011 (LA model) → addressing-side integration
 - ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
 - ADR-0004 (Memory Semantics) → BW model redefinition
 - ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
@@ -2,35 +2,23 @@

 ## Status

-Proposed
+Accepted

 ## Context

-ADR-0018에서는 LA 기반 주소 추상화와 BAAW를 도입하여,
-logical memory access가 다음 두 형태의 request로 변환되도록 정의하였다.
+CUBE 내부 NOC은 각 PE를 HBM에 연결해야 한다. KernBench는 두 가지
+connectivity 모델을 비교 평가할 수 있어야 한다.

- 1:1 mode: 하나의 logical access → N개의 per-channel request
- n:1 mode: 하나의 logical access → 하나의 aggregated request
+- **1:1 mode** — PE_DMA가 N개 per-channel router 각각에 별도 link로
+  연결되고, 각 router는 hbm_ctrl에 자기 channel link를 가진다.
+  Per-channel BW contention을 정확히 모델링.
+  N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
+- **n:1 mode** — PE_DMA가 단일 aggregated router를 거쳐 하나의 link로
+  hbm_ctrl에 연결. Channel들이 interleaved 된 것으로 가정하고
+  aggregate BW만 모델링.

-여기서 N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`)이며,
-topology 파라미터로 결정된다.
-
-### 기존 구조의 문제
-
-현재 구현(`topology/builder.py`)에서는:
-
- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} 경로를 사용
- HBM은 8개 slice(= PE 수) 노드로 모델링됨
- local/remote access가 서로 다른 경로를 사용:
-  - local: NOC → xbar → HBM slice
-  - cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
-  - remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
-
-이 구조의 한계:
-
- pseudo-channel 단위 모델링 불가 (slice = PE 단위, channel 단위 아님)
- xbar/bridge가 local/remote 경로를 이원화
- 1:1 / n:1 mode를 일관되게 표현할 수 없음
+두 모드에서 PE당 effective BW는 동일 (= N × per-channel BW);
+connectivity granularity만 다르다.

 ---

@@ -270,7 +258,6 @@ links:
 ### Negative

 - 명시적 라우터 노드로 인해 SimPy 노드 수가 증가한다 (6×6 = 최대 32개 라우터/cube)
- 기존 xbar/bridge/단일 NOC 기반 테스트 전면 재작성 필요
 - TwoDMeshNocComponent의 내부 contention 모델을 라우터별 모델로 교체 필요

 ---
@@ -296,119 +283,6 @@ links:

 ---

-## Implementation Notes
-
-### topology/builder.py 변경 상세
-
-#### 제거할 코드 (현재 `_instantiate_cube()` 내)
-
- xbar_top, xbar_bot 노드 생성 (~line 495-508)
- bridge.left, bridge.right 노드 생성
- noc ↔ xbar edge 생성 (~line 540-555)
- xbar ↔ hbm_ctrl.slice edge 생성 (~line 510-538)
- xbar ↔ bridge edge 생성 (~line 557-572)
-
-#### 추가할 코드
-
-1:1 mode:
-
-```python
-N = hbm_channels_per_pe  # from topology config
-total_ch = hbm_pseudo_channels
-
-# channel router 노드 생성
-for ch_id in range(total_ch):
-    pe_id = ch_id // N
-    nodes[f"{cp}.ch_r{ch_id}"] = Node(
-        id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),  # horizontal row = ch_id % N
-    )
-
-# PE_DMA ↔ local channel router edges
-for pe_id in range(pes_per_cube):
-    for local_ch in range(N):
-        ch_id = pe_id * N + local_ch
-        edges.append(Edge(
-            src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
-            bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
-            bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
-
-# channel router ↔ hbm_ctrl edges
-for ch_id in range(total_ch):
-    edges.append(Edge(
-        src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
-        bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
-
-# horizontal line edges (same logical index)
-for row in range(N):
-    for p in range(pes_per_cube - 1):
-        ch_a = p * N + row
-        ch_b = (p + 1) * N + row
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-```
-
-n:1 mode:
-
-```python
-# aggregated router 노드 생성
-for pe_id in range(pes_per_cube):
-    nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
-        id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),
-    )
-
-agg_bw = N * channel_bw  # aggregated BW
-
-# PE_DMA ↔ aggregated router
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
-        bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
-
-# aggregated router ↔ hbm_ctrl
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=agg_bw, kind="agg_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="hbm_to_agg", ...))
-
-# aggregated router 간 horizontal link
-for p in range(pes_per_cube - 1):
-    edges.append(Edge(
-        src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-```
-
-### 영향받는 기존 테스트
-
-| 테스트 파일 | 영향 |
-| ---------- | ---- |
-| `tests/test_topology_compile.py` | xbar/bridge 노드 참조 제거, channel router 검증 추가 |
-| `tests/test_topology_load.py` | topology.yaml 설정 변경 반영 |
-| `tests/test_pe_components.py` | PE_DMA 라우팅 경로 변경 |
-| `tests/test_sip_parallel.py` | cross-PE 접근 경로 변경 |
-| xbar/bridge를 직접 테스트하는 케이스 | 제거 |
-
---
-
 ## Test Requirements

 - 1:1 mode에서 channel별 link로 request가 전달되는지 확인
@@ -425,7 +299,7 @@ for p in range(pes_per_cube - 1):

 ## Links

- ADR-0018 (LA + BAAW) → addressing 측 연동
+- ADR-0011 (LA model) → addressing 측 연동
 - ADR-0017 (Cube NOC 2D Mesh) → 본 ADR이 xbar/bridge 부분을 대체
 - ADR-0004 (Memory Semantics) → BW 모델 재정의
 - ADR-0014 (PE Internal Execution Model) → PE_DMA 경로 변경 영향
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -16,21 +16,6 @@ but do not actually read tensor data or perform computations.
 2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
 3. Must minimize simulation performance degradation

-### Limitations of the Existing Kernel Execution Structure
-
-The current kernel execution is separated into 3 stages:
-
-```
-Phase 0: Kernel function execution in TLContext → PeCommand list generation (outside SimPy, no data)
-Phase 1: PE_CPU replays PeCommand list via SimPy (timing only)
-```
-
-Phase 0 requires the kernel to **complete execution entirely** before SimPy begins.
-`tl.load()` returns a TensorHandle (placeholder), so actual data cannot be accessed.
-Therefore, branching based on data values (dynamic control flow) is impossible.
-
-This ADR resolves this limitation **for memory operations only** (see D1, D3).
-
 ### Constraints

 - SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
@@ -532,22 +517,3 @@ Per-dtype tolerance policy:
  (computations execute in Phase 2, result values are undetermined in Phase 1).
  Memory-data-based branching is supported via greenlet.
 - greenlet C extension dependency added (pip install greenlet)
-
---
-
-## Affected Files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/components/base.py` | Add `_on_process_start/end` hooks |
-| `src/kernbench/common/pe_commands.py` | Add `data_op = True`, extend metadata fields |
-| `src/kernbench/sim_engine/op_log.py` | New: OpRecord, OpLogger |
-| `src/kernbench/sim_engine/data_executor.py` | New: DataExecutor, MemoryStore |
-| `src/kernbench/sim_engine/engine.py` | op_logger injection (optional) |
-| `src/kernbench/triton_emu/tl_context.py` | greenlet switch calls inside `tl.load()` etc. |
-| `src/kernbench/triton_emu/kernel_runner.py` | New: KernelRunner (greenlet ↔ SimPy bridge) |
-| `src/kernbench/components/builtin/pe_cpu.py` | Remove Phase 0, change to KernelRunner invocation |
-| `pyproject.toml` | Add greenlet dependency |
-
-Component implementation files (pe_gemm.py, pe_dma.py, hbm_ctrl.py, etc.): **no changes**
-Benchmark kernels (benches/*.py): **no user API changes**
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -16,21 +16,6 @@ Proposed
 2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
 3. 시뮬레이션 성능 저하를 최소화해야 한다

-### 기존 커널 실행 구조의 한계
-
-현재 커널 실행은 3단계로 분리되어 있다:
-
-```
-Phase 0: TLContext에서 커널 함수 실행 → PeCommand 리스트 생성 (SimPy 밖, 데이터 없음)
-Phase 1: PE_CPU가 PeCommand 리스트를 SimPy로 replay (타이밍만)
-```
-
-Phase 0에서 커널이 **전부 실행 완료**된 후에야 SimPy가 시작된다.
-`tl.load()`는 TensorHandle(placeholder)을 반환하므로 실제 데이터에 접근할 수 없다.
-따라서 데이터 값에 따른 분기(dynamic control flow)가 불가능하다.
-
-본 ADR은 이 한계를 **메모리 연산에 한해** 해소한다 (D1, D3 참조).
-
 ### 제약 조건

 - SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
@@ -529,22 +514,3 @@ dtype별 tolerance 정책:
  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
  메모리 데이터 기반 분기는 greenlet으로 지원된다.
 - greenlet C 확장 의존성 추가 (pip install greenlet)
-
---
-
-## 영향받는 파일
-
-| 파일 | 변경 |
-|------|------|
-| `src/kernbench/components/base.py` | `_on_process_start/end` hook 추가 |
-| `src/kernbench/common/pe_commands.py` | `data_op = True` 추가, metadata 필드 확장 |
-| `src/kernbench/sim_engine/op_log.py` | 신규: OpRecord, OpLogger |
-| `src/kernbench/sim_engine/data_executor.py` | 신규: DataExecutor, MemoryStore |
-| `src/kernbench/sim_engine/engine.py` | op_logger 주입 (optional) |
-| `src/kernbench/triton_emu/tl_context.py` | `tl.load()` 등 내부에서 greenlet switch 호출 |
-| `src/kernbench/triton_emu/kernel_runner.py` | 신규: KernelRunner (greenlet ↔ SimPy 연결) |
-| `src/kernbench/components/builtin/pe_cpu.py` | Phase 0 제거, KernelRunner 호출로 변경 |
-| `pyproject.toml` | greenlet 의존성 추가 |
-
-컴포넌트 구현 파일 (pe_gemm.py, pe_dma.py, hbm_ctrl.py 등): **변경 없음**
-벤치마크 커널 (benches/*.py): **사용자 API 변경 없음**
@@ -2,30 +2,10 @@

 ## Status

-Proposed
+Accepted

 ## Context

-### Problems with the Current Structure
-
-pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm)
-**inside a single component**.
-
-```
-SchedulerV2Component (single topology node)
-├── DmaInBlock     ← directly connected via internal SimPy Store
-├── DmaWbBlock     ← not visible in topology
-├── GemmBlock      ← not replaceable
-├── MathBlock      ← not replaceable
-└── TcmBlock       ← not replaceable
-```
-
-Problems:
- Blocks directly reference the next block via `desc.next_block` — hardcoded routing
- Individual blocks cannot be replaced (violates ADR-0015 component replacement principle)
- PE internal structure is not visible in the topology
- GemmBlock and MathBlock each duplicate TCM load/store logic
-
 ### Actual Hardware Structure

 ```
@@ -374,66 +354,6 @@ Topology edges encompass both **control/dispatch visibility + runtime chaining**
 Scheduler → sub-component edges are initial dispatch paths, while
 inter-component edges are runtime chaining paths driven by token self-routing.

-### D8. Existing Code Migration — Builtin Integration
-
-The existing builtin v1 components and pe_accel are **replaced with new builtin components**.
-
-#### Migration Strategy
-
-1. Back up existing `components/builtin/` → `components/builtin_legacy/` (preserved without modification)
-2. Back up existing `components/custom/pe_accel/` → likewise
-3. Re-implement new `components/builtin/` with the ADR-0021 architecture
-4. Maintain **only one** topology.yaml (including pe_fetch_store)
-5. components.yaml points to the new builtin
-
-```yaml
-# components.yaml — new builtin
-pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
-pe_gemm_v1:      kernbench.components.builtin.pe_gemm:PeGemmComponent
-pe_math_v1:      kernbench.components.builtin.pe_math:PeMathComponent
-pe_dma_v1:       kernbench.components.builtin.pe_dma:PeDmaComponent
-pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
-pe_tcm_v1:       kernbench.components.builtin.pe_tcm:PeTcmComponent
-```
-
-The impl names (pe_gemm_v1, etc.) are preserved, but **the implementations are replaced
-with the ADR-0021 architecture**. Existing benchmarks and tests referencing topology.yaml
-continue to work without changes.
-
-#### Latency Model Inheritance
-
-The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency,
-TCM BW serialization, DMA fabric latency, etc.) is **based on the current pe_accel
-implementation**. The tile schedule generation logic from tiling.py is also carried over.
-Only the architecture (component separation, self-routing) changes; timing accuracy
-is preserved.
-
-#### Test Strategy
-
-#### Test Plan
-
-**1. Existing test pass** (regression):
-After migration is complete, all existing tests (366) must pass.
-
-**2. Latency regression**:
-Verify that the new builtin produces identical latency for the same inputs as pe_accel.
-
-**3. Phase 1 → Phase 2 end-to-end**:
-Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor
-(Phase 2) actual numpy computation → result correctness verification.
- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification
- MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification
- Chaining: GEMM output → MATH input → final result end-to-end verification
-
-**4. TileToken self-routing**:
- Verify that tiles chain according to the plan's stage sequence
- Verify PipelineContext.complete_tile() exactly-once at the last stage
- Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded
-
-**5. Asynchronous pipeline overlap**:
- Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA)
- Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order)
-
 ### D9. TileToken Message Definition

 A message used for passing tile work between components.
@@ -472,8 +392,6 @@ Relationship with existing PeInternalTxn:
 - **Resource contention model across multiple pipelines**: the current scope focuses on
  accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
  are future work.
- **builtin_legacy maintenance**: kept for backup purposes only; not a target for
-  bug fixes or feature additions.

 ## Open Questions

@@ -511,27 +429,4 @@ Relationship with existing PeInternalTxn:

 - Increased number of PE internal components (5 → 6) — more topology nodes/edges
 - Component separation makes intra-PE token forwarding more explicit than before
- Breaking change from existing builtin/pe_accel — migration required

---
-
-## Affected Files
-
-| File | Change |
-|------|--------|
-| `topology.yaml` | Add pe_fetch_store component, add chaining edges |
-| `components.yaml` | Register new builtin components |
-| `src/kernbench/topology/builder.py` | Add fetch_store + chaining edges to PE internal edges |
-| `src/kernbench/common/pe_commands.py` | Add TileToken definition |
-| `src/kernbench/components/builtin/pe_scheduler.py` | Re-implement (feeder + plan-based dispatch) |
-| `src/kernbench/components/builtin/pe_gemm.py` | Re-implement (TileToken, _process pattern) |
-| `src/kernbench/components/builtin/pe_math.py` | Re-implement (TileToken, _process pattern) |
-| `src/kernbench/components/builtin/pe_dma.py` | Re-implement (TileToken, _process pattern) |
-| `src/kernbench/components/builtin/pe_fetch_store.py` | New |
-| `src/kernbench/components/builtin/pe_tcm.py` | Re-implement (TcmRequest service) |
-| `src/kernbench/components/builtin/types.py` | New: TilePlan, Stage, StageType, PipelineContext, TileToken |
-| `src/kernbench/components/builtin/tiling.py` | Ported from pe_accel: plan generation logic |
-
-Backup:
-| `src/kernbench/components/builtin_legacy/` | Full backup of existing builtin (preserved without modification) |
-| `src/kernbench/components/custom/pe_accel/` | Backup of existing pe_accel (preserved without modification) |
@@ -2,30 +2,10 @@

 ## Status

-Proposed
+Accepted

 ## Context

-### 현재 구조의 문제
-
-pe_accel (SchedulerV2Component)은 5개 하드웨어 블록(DmaIn, DmaWb, Gemm, Math, Tcm)을
-**단일 컴포넌트 내부**에 숨기고 있다.
-
-```
-SchedulerV2Component (단일 topology 노드)
-├── DmaInBlock     ← 내부 SimPy Store로 직접 연결
-├── DmaWbBlock     ← topology에 안 보임
-├── GemmBlock      ← 교체 불가
-├── MathBlock      ← 교체 불가
-└── TcmBlock       ← 교체 불가
-```
-
-문제점:
- 블록이 다음 블록을 `desc.next_block`으로 직접 참조 — 하드코딩된 라우팅
- 개별 블록 교체 불가 (ADR-0015 컴포넌트 교체 원칙 위배)
- topology에서 PE 내부 구조가 보이지 않음
- GemmBlock과 MathBlock이 TCM load/store 로직을 각각 중복 구현
-
 ### 실제 하드웨어 구조

 ```
@@ -370,64 +350,6 @@ Topology edge는 **control/dispatch visibility + runtime chaining** 양쪽을
 Scheduler → 하위 컴포넌트 edge는 초기 dispatch 경로이며,
 컴포넌트 간 edge는 token self-routing에 의한 runtime chaining 경로이다.

-### D8. 기존 코드 마이그레이션 — builtin 통합
-
-기존 builtin v1 컴포넌트와 pe_accel을 **새 builtin으로 교체**한다.
-
-#### 마이그레이션 전략
-
-1. 기존 `components/builtin/` → `components/builtin_legacy/`로 백업 (수정 없이 보관)
-2. 기존 `components/custom/pe_accel/` → 동일하게 백업
-3. 새 `components/builtin/`에 ADR-0021 아키텍처로 재구현
-4. topology.yaml은 **하나만 유지** (pe_fetch_store 포함)
-5. components.yaml은 새 builtin을 가리킴
-
-```yaml
-# components.yaml — 새 builtin
-pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
-pe_gemm_v1:      kernbench.components.builtin.pe_gemm:PeGemmComponent
-pe_math_v1:      kernbench.components.builtin.pe_math:PeMathComponent
-pe_dma_v1:       kernbench.components.builtin.pe_dma:PeDmaComponent
-pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
-pe_tcm_v1:       kernbench.components.builtin.pe_tcm:PeTcmComponent
-```
-
-impl 이름(pe_gemm_v1 등)은 유지하되, **구현이 ADR-0021 아키텍처로 교체**된다.
-기존 벤치마크와 테스트의 topology.yaml 참조는 변경 없이 동작한다.
-
-#### 레이턴시 모델 계승
-
-새 builtin 컴포넌트의 레이턴시 모델링(MAC cycle 계산, SIMD latency,
-TCM BW serialization, DMA fabric latency 등)은 **pe_accel 현재 버전의 구현을 바탕으로** 한다.
-tiling.py의 tile schedule 생성 로직도 그대로 가져온다.
-아키텍처(컴포넌트 분리, self-routing)만 변경하고, 타이밍 정확도는 유지한다.
-
-#### 테스트 전략
-
-#### 테스트 계획
-
-**1. 기존 테스트 통과** (regression):
-마이그레이션 완료 후 기존 테스트(366개)가 전부 통과해야 한다.
-
-**2. 레이턴시 regression**:
-pe_accel과 동일한 입력에 대해 새 builtin이 동일 레이턴시를 산출하는지 검증.
-
-**3. Phase 1 → Phase 2 end-to-end**:
-SimPy 시뮬레이션(Phase 1)에서 op_log 생성 → DataExecutor(Phase 2)로
-실제 numpy 연산 → 결과 정합성 검증까지 통합 테스트.
- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose 검증
- MATH: tl.exp / tl.add 등 → op_log → Phase 2 numpy op → allclose 검증
- 체이닝: GEMM 출력 → MATH 입력 → 최종 결과 end-to-end 검증
-
-**4. TileToken self-routing**:
- tile이 plan의 stage sequence를 따라 체이닝되는지 검증
- 마지막 stage에서 PipelineContext.complete_tile() exactly-once 검증
- queue backpressure: DMA queue capacity 초과 시 feeder만 block 검증
-
-**5. 비동기 pipeline overlap**:
- 동일 command 내 tile 간 stage overlap 발생 검증 (tile0 GEMM 중 tile1 DMA)
- 다중 command: cmd1 feed 완료 후 cmd2 feed 시작 (FIFO 순서) 검증
-
 ### D9. TileToken 메시지 정의

 컴포넌트 간 tile 작업 전달에 사용하는 메시지.
@@ -465,7 +387,6 @@ Token lifecycle:
  (PeInternalTxn 기반, ADR-0014 유지)
 - **다중 pipeline 간 자원 경합 모델**: 현재 범위에서는 단일 pipeline의
  정확한 모델링에 집중. 다중 pipeline 간 TCM bank conflict 등은 future work.
- **builtin_legacy 유지보수**: 백업 목적이며, 버그 수정이나 기능 추가 대상이 아님.

 ## Open Questions

@@ -502,27 +423,4 @@ Token lifecycle:

 - PE 내부 컴포넌트 수 증가 (5 → 6) — topology 노드/edge 증가
 - 컴포넌트 분리로 인해 intra-PE token forwarding이 이전 대비 더 명시적으로 드러남
- 기존 builtin/pe_accel과의 breaking change — 마이그레이션 필요

---
-
-## 영향받는 파일
-
-| 파일 | 변경 |
-|------|------|
-| `topology.yaml` | pe_fetch_store 컴포넌트 추가, 체이닝 edge 추가 |
-| `components.yaml` | 새 builtin 컴포넌트 등록 |
-| `src/kernbench/topology/builder.py` | PE 내부 edge에 fetch_store + 체이닝 edge 추가 |
-| `src/kernbench/common/pe_commands.py` | TileToken 정의 추가 |
-| `src/kernbench/components/builtin/pe_scheduler.py` | 재구현 (feeder + plan 기반 dispatch) |
-| `src/kernbench/components/builtin/pe_gemm.py` | 재구현 (TileToken, _process 패턴) |
-| `src/kernbench/components/builtin/pe_math.py` | 재구현 (TileToken, _process 패턴) |
-| `src/kernbench/components/builtin/pe_dma.py` | 재구현 (TileToken, _process 패턴) |
-| `src/kernbench/components/builtin/pe_fetch_store.py` | 신규 |
-| `src/kernbench/components/builtin/pe_tcm.py` | 재구현 (TcmRequest 서비스) |
-| `src/kernbench/components/builtin/types.py` | 신규: TilePlan, Stage, StageType, PipelineContext, TileToken |
-| `src/kernbench/components/builtin/tiling.py` | pe_accel에서 이식: plan 생성 로직 |
-
-백업:
-| `src/kernbench/components/builtin_legacy/` | 기존 builtin 전체 백업 (수정 없이 보관) |
-| `src/kernbench/components/custom/pe_accel/` | 기존 pe_accel 백업 (수정 없이 보관) |
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -19,17 +19,6 @@ queues. Host-level collectives (`dist.all_reduce`) are deferred to
 **future work**; this ADR focuses solely on the kernel-side collective
 infrastructure.

-### Current state
-
- ADR-0021 PE pipeline refactor: each PE is decomposed into components
-  (PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH,
-  PE_TCM, PE_MMU).
- No direct PE-to-PE channel exists today. All data movement goes
-  through PE_DMA → cube_noc / UCIe / PCIE → HBM.
- A pre-ADR host CCL skeleton exists (`dist.init_process_group(backend="ahbm")`,
-  `_run_ccl_bench` running per-rank greenlets concurrently). The
-  collective itself is a stub.
-
 ### Problems to solve

 1. PE-to-PE direct data movement (writing into a peer's memory).
@@ -891,30 +880,3 @@ fairness from `tl.recv()` round-robin, confusing
 - VC arbitration is a first-order approximation; heavy contention
  scenarios may report slightly optimistic latency vs real HW (D8).
 - Chunk-level interleave makes PE_DMA implementation more complex.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `topology.yaml` | Add `pe_ipcq` to `pe_template`, plus the IPCQ ↔ DMA / CPU / TCM edges. |
-| `components.yaml` | Register `pe_ipcq_v1`. |
-| `src/kernbench/topology/builder.py` | Wire the IPCQ chain into PE-internal edges. |
-| `src/kernbench/components/builtin/pe_ipcq.py` | New. |
-| `src/kernbench/components/builtin/pe_dma.py` | Add VCs, handle `IpcqDmaToken`. |
-| `src/kernbench/common/pe_commands.py` | `IpcqSendCmd`, `IpcqRecvCmd`, `IpcqDmaToken`. |
-| `src/kernbench/triton_emu/tl_context.py` | `tl.send` / `tl.recv` API. |
-| `src/kernbench/runtime_api/distributed.py` | Eager IPCQ install in `AhbmCCLBackend.__init__`. |
-| `src/kernbench/runtime_api/kernel.py` | `IpcqInitMsg` definition. |
-| `src/kernbench/ccl/__init__.py` | New CCL package. |
-| `src/kernbench/ccl/topologies.py` | Builtin topology generators + `resolve_topology()`. |
-| `src/kernbench/ccl/helpers.py` | Algorithm-author helpers (`chunked`, `ring_step`, `tree_step`). |
-| `src/kernbench/ccl/testing.py` | Mock CCL runtime (`run_kernel_in_mock`). |
-| `src/kernbench/ccl/algorithms/*.py` | Algorithm modules (kernel + `kernel_args` + optional `neighbors`). |
-| `ccl.yaml` | Algorithm metadata + IPCQ defaults. |
-| `tests/test_pe_ipcq.py` | PE_IPCQ unit tests. |
-| `tests/test_pe_dma_vc.py` | PE_DMA VC tests. |
-| `tests/test_ipcq_e2e.py` | end-to-end send/recv tests. |
-| `tests/test_ccl_topologies.py` | Builtin topology generator tests. |
-| `tests/test_ccl_allreduce_matrix.py` | Unified bench × algorithm matrix. |
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -17,14 +17,6 @@ Queue)를 통해** 일어난다.
 core-local 통신 큐와 유사하다. 호스트 레벨 collective(`dist.all_reduce`)는
 **미래 작업**으로 미루고, 본 ADR은 커널 collective 인프라에만 집중한다.

-### 현재 상태
-
- ADR-0021 PE 파이프라인 리팩토링: PE 내부가 컴포넌트 단위로 분리됨
-  (PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH, PE_TCM, PE_MMU)
- PE 간 직접 통신 채널 없음. 모든 데이터 이동은 PE_DMA → cube_noc/UCIe/PCIE → HBM 경로
- 호스트 CCL skeleton (ADR 없음, ad-hoc 구현): `dist.init_process_group(backend="ahbm")`,
-  `_run_ccl_bench`가 rank별 greenlet로 동시 실행. collective는 stub 상태.
-
 ### 풀어야 할 문제

 1. PE 간 직접 데이터 이동 (peer's memory에 write)
@@ -1245,29 +1237,3 @@ def neighbors(rank, world_size, neighbor_map) -> dict | None:
 - VC arbitration 모델이 first-order approximation이므로 heavy contention
  시나리오에서 실제 HW보다 약간 optimistic한 latency 결과 가능 (D8 한계)
 - VC chunk-level 인터리브로 PE_DMA 구현이 더 복잡해짐
-
---
-
-## 영향받는 파일
-
-| 파일 | 변경 |
-|------|------|
-| `topology.yaml` | pe_template에 pe_ipcq 추가, ipcq↔dma/cpu/tcm edge 추가 |
-| `components.yaml` | pe_ipcq_v1 등록 |
-| `src/kernbench/topology/builder.py` | PE 내부 edge에 ipcq 체인 추가 |
-| `src/kernbench/components/builtin/pe_ipcq.py` | 신규 |
-| `src/kernbench/components/builtin/pe_dma.py` | VC 추가, IpcqDmaToken 처리 |
-| `src/kernbench/common/pe_commands.py` | IpcqSendCmd, IpcqRecvCmd, IpcqDmaToken 정의 |
-| `src/kernbench/triton_emu/tl_context.py` | tl.send / tl.recv API |
-| `src/kernbench/runtime_api/distributed.py` | ccl.yaml 로드, init 시 IPCQ install (eager) |
-| `src/kernbench/runtime_api/kernel.py` | IpcqInitMsg (sideband) 정의 |
-| `src/kernbench/ccl/__init__.py` | 신규 — CCL 패키지 |
-| `src/kernbench/ccl/topologies.py` | 신규 — builtin topology generators (ring_1d, mesh_2d, tree_binary 등), `resolve_topology()` |
-| `src/kernbench/ccl/helpers.py` | 신규 — 알고리즘 작성 헬퍼 (chunked, ring_step 등) |
-| `src/kernbench/ccl/testing.py` | 신규 — mock CCL runtime (`run_kernel_in_mock`) |
-| `ccl.yaml` | 신규 — 알고리즘 metadata + IPCQ default 설정 |
-| `src/kernbench/ccl/algorithms/ring_allreduce.py` | 신규 — 첫 알고리즘 예제 |
-| `tests/test_pe_ipcq.py` | 신규 — PE_IPCQ 단위 테스트 |
-| `tests/test_pe_dma_vc.py` | 신규 — PE_DMA virtual channel 테스트 |
-| `tests/test_ipcq_e2e.py` | 신규 — send/recv end-to-end 테스트 |
-| `tests/test_ccl_topologies.py` | 신규 — builtin topology generator 단위 테스트 |
@@ -53,16 +53,6 @@ PE_IPCQ: 자체 message loop에서 IpcqInitMsg 처리 (기존 capability)
 `IpcqInitMsg` 타입을 그대로 사용. 기존의 "sideband direct call" 우회만
 제거하여 convention 일원화.

-### 현재 상태
-
- `DistributedContext` facade 존재
- `init_process_group("ahbm")` → `AhbmCCLBackend`가 `ctx.install_ipcq` 호출
-  → `ccl/install.py`가 **sideband direct call** (`pe_ipcq._install_neighbors`)로
-  PE_IPCQ에 neighbor table 설치
- `get_rank()` 항상 `0` (single-driver)
- `get_world_size()` fallback: 총 PE 수 (rank = PE)
- `benches/ccl_allreduce.py`: `worker(rank=0, world_size=total_PEs)` 1회 호출
-
 ### 풀어야 할 문제

 1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
@@ -86,14 +76,6 @@ PE_IPCQ: 자체 message loop에서 IpcqInitMsg 처리 (기존 capability)
  도입하기엔 정당화가 약함. 미래에 control-plane latency 모델링 정밀도 요구가
  생기면 별도 ADR.

-### TODO (이 ADR 구현 이후)
-
- Tensor Parallelism (ADR-0027)
- Hierarchical all-reduce 알고리즘 설계 (ADR-0029) — 본 ADR의 mapper /
-  validator registry 인프라를 활용하는 첫 사례
-
---
-
 ## Decision

 ### D1. rank = SIP (world_size 해석)
@@ -835,34 +817,6 @@ Migration 스케줄:

 ## Open questions

-### 🔴 Critical — 구현 blocker 가능성 (integration 전 반드시 검증)
-
- **`IpcqInitMsg`의 engine routing — primary implementation risk**: 현재
-  sideband만 쓰여서 engine routing path가 실사용 검증되지 않은 상태. **본
-  ADR 전체가 "engine routing이 동작한다"는 가정 위에 서 있다**. 이것이
-  실제로 안 되면 D2, D14, T3 등이 전부 영향 받음. 반드시 **ADR 구현 착수
-  전 스파이크 검증**:
-  - `engine.submit(IpcqInitMsg(target_sips=..., target_cubes=..., target_pe=...))`
-    가 PE_IPCQ로 정확히 배달되는지 (기존 `MmuMapMsg` / `MemoryWriteMsg` 라우팅
-    패턴과 비교)
-  - 미지원 시 minor hook: engine의 message-type → component-kind 매핑 테이블에
-    `IpcqInitMsg → "pe_ipcq"` 등록 (localized change, topology builder /
-    message schema 영향 없음)
-  - 결과에 따라 D2 채택 여부가 달라질 수 있음 — 만약 routing 불가 시 sideband
-    path 유지로 fallback 후 본 ADR 범위 재조정
-
- **Engine-routed install vs sideband equivalence** (D2 검증점 1-5): T3의
-  equivalence test가 실제 동작하는지 스파이크. 특히 ordering independence와
-  idempotency는 기존 테스트에 없는 속성이라 신규 검증 필요.
-
- **`install_ipcq()` 직접 호출자 audit** (구현 전 필수): deprecated wrapper
-  전략은 적절하지만 실제 migration 리스크는 호출자 목록에 따라 다름. 착수 전
-  grep audit:
-  - Pattern: `install_ipcq(` (cwd 전체)
-  - Scope: `src/`, `tests/`, `benches/`, `scripts/`, `src/kernbench/cli/`
-  - 각 호출자의 예상 migration path (→ `dist.init_process_group` vs
-    `build_install_plans` 직접)를 정리한 후 wrapper 도입
-
 ### 🟡 Nice-to-have — scope 경계 관련

 - **Install timing 허용치**: SimPy 시간 상 install이 몇 ns~us 소모. 기존
@@ -883,64 +837,6 @@ multi-level 알고리즘이 driving force가 되는 framework 진화 방향.)

 ---

-## Test strategy
-
-### T1. Launcher infrastructure
-
-`tests/test_ccl_ddp_launcher.py`:
- `test_world_size_equals_sip_count` — D1
- `test_ahbm_set_device_binds_tensor_to_single_sip` — D10/D11
- `test_get_rank_is_greenlet_local` — D9
- `test_run_spawns_one_worker_per_rank` — D12/D13
- `test_get_rank_debug_warning` — D9 warning path
-
-### T2. Install plan builder
-
-`tests/test_ccl_install_plan.py` (new):
- `build_install_plans` — ring_1d × leader_only 조합 (단일 PE per rank)
- `build_install_plans` — ring_1d × all_pes 조합 (multi-PE per rank; mapper
-  framework 동작 확인, 알고리즘-무관)
- Mapper / validator registry resolution (built-in key vs import path vs
-  unknown)
- Import path fallback (`"pkg.mod.fn"` 형식) 동작 검증
-
-### T3. Engine-routed IpcqInitMsg (equivalence — 핵심 검증)
-
-`tests/test_ipcq_init_routing.py` (new):
- **Routing**: `engine.submit(IpcqInitMsg)` → 지정 PE_IPCQ가 실제 설치 수행
- **Equivalence**: 동일한 IpcqInitMsg를 (a) sideband `_install_neighbors`
-  직접 호출, (b) engine.submit 두 경로로 보낸 뒤 PE_IPCQ 최종 state
-  (`_queue_pairs`, `_installed` 등) 동일성 비교
- **Ordering independence**: 서로 다른 PE의 install msg를 engine 큐에 임의
-  순서로 넣어도 최종 state가 동일
- **Idempotency (duplicate install)**: 동일 PE에 두 번 install msg → 두
-  번째는 에러 raise (policy: explicit error; D2 검증점 4 참고)
- **Multi-PE 병렬 install**: per-PE submit이 interference 없이 완료
- **Install 후 send 성공**: 설치 직후 `IpcqSendCmd` 실행해서 neighbor table
-  state가 실제로 유효한지 확인
-
-### T4. Barrier correctness
-
-`tests/test_collective_barrier.py` (new):
- Single collective 정상
- 다중 collective 연속 호출 (epoch 격리)
- 동일 rank의 duplicate join → RuntimeError
- Rank 1이 all_reduce 전 종료 → SpawnException + barrier.reset()
- Conditional branch 시 모든 rank 도달하면 정상
-
-### T5. E2E
-
-`tests/test_ccl_allreduce_matrix.py`:
- `ring_tcm` / `ring_hbm` / `ring_sram` @ ws=SIP_count
-
-### T6. 회귀
-
-기존 `test_ccl_framework`, `test_ccl_install`, `test_ccl_topologies`,
-`test_ccl_mock_runtime`, `test_pe_ipcq`, `test_ipcq_e2e`, 기타 non-CCL
-모두 통과.
-
---
-
 ## Consequences

 ### Positive
@@ -970,28 +866,3 @@ multi-level 알고리즘이 driving force가 되는 framework 진화 방향.)
 - IPCQ PE-level protocol (ADR-0023) 불변.
 - `DPPolicy` 필드 변경은 ADR-0026.
 - IO_CPU 역할 불변 (기존 transit 그대로).
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/runtime_api/distributed.py` | D1/D2/D7/D9: world_size fallback, rank_to_sip, plan 소유, engine-routed install/launch, epoch barrier |
-| `src/kernbench/runtime_api/context.py` | D10/D11: `_AhbmNamespace`, `ctx.ahbm`, `_create_tensor`가 `target_sip` 전달 |
-| `src/kernbench/runtime_api/multiprocessing.py` (new) | D12/D13: `spawn` + scheduler + exception |
-| `src/kernbench/ccl/install_plan.py` (new) | D6: `build_install_plans`, `SipInstallPlan`, `PeInstallSpec`, `NeighborTableEntry` |
-| `src/kernbench/ccl/mappers.py` (new) | D5: `leader_only`, `all_pes`, registry + resolver |
-| `src/kernbench/ccl/validators.py` (new) | D5: validator registry + resolver |
-| `src/kernbench/ccl/install.py` | Thin deprecated compat wrapper (D14) |
-| `src/kernbench/ccl/algorithms/ring_allreduce.py` | D4: `kernel` + `kernel_args` 유지 (큰 변화 없음) |
-| `src/kernbench/ccl/algorithms/mesh_allreduce.py` | D4 동일 |
-| `src/kernbench/ccl/algorithms/tree_allreduce.py` | D4 동일 |
-| `ccl.yaml` | 각 알고리즘에 `mapper` / `validator` 선언 추가 |
-| `src/kernbench/sim_engine/engine.py` | (If needed) `IpcqInitMsg` → PE_IPCQ 라우팅 확인 hook |
-| `benches/ccl_allreduce.py` | 새 launcher 기반 rewrite |
-| `tests/test_ccl_ddp_launcher.py` (new) | T1 |
-| `tests/test_ccl_install_plan.py` (new) | T2 |
-| `tests/test_ipcq_init_routing.py` (new) | T3 |
-| `tests/test_collective_barrier.py` (new) | T4 |
-| `tests/test_ccl_allreduce_matrix.py` | T5: ws=SIP_count 단순화 |
@@ -2,7 +2,7 @@

 ## Status

-Proposed (Revision 2 — Address-based matching; peer_direction field dropped)
+Accepted (Revision 2 — Address-based matching; peer_direction field dropped)

 ## Context

@@ -13,34 +13,6 @@ topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되
 2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
 topology 일반)에서 정확히 동작하도록 한다.

-### 현재 상태 (ADR-0023 D9 구현)
-
-`src/kernbench/components/builtin/pe_ipcq.py` — `_handle_meta_arrival`:
-
-```python
-def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
-    token = msg.token
-    sender_key = (token.src_sip, token.src_cube, token.src_pe)
-    for d, qp in self._queue_pairs.items():
-        p = qp["peer"]
-        if (p.sip, p.cube, p.pe) == sender_key:
-            qp["peer_head_cache"] = max(qp["peer_head_cache"], token.sender_seq + 1)
-            # ... wake recv waiters ...
-            return
-```
-
-`_credit_worker`도 동일한 "sender-coord-first-match" 패턴.
-
-`src/kernbench/ccl/install.py` — `reverse_direction`:
-
-```python
-def reverse_direction(my_rank: int, peer_rank: int) -> str | None:
-    for d, target in neighbor_table[peer_rank].items():
-        if target == my_rank:
-            return d
-    return None
-```
-
 ### 드러난 버그 — 2-rank bidirectional ring

 `ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
@@ -289,51 +261,6 @@ for plan in plans:

 ---

-## Test strategy
-
-### T1. Unit — `reverse_direction` opposite-preference
-
-`tests/test_ccl_install.py` (확장):
- Ring ws=2: `reverse_direction(0, 1, "E")` → "W", `reverse_direction(0, 1, "W")` → "E"
- Ring ws=4: `reverse_direction(0, 1, "E")` → "W" (자연스러운 opposite)
- Mesh 2×2: `reverse_direction(r, peer, "N")` → "S", "E" ↔ "W"
- Tree binary: opposite 없는 direction (parent) → fallback 경로
- Non-symmetric topology: opposite가 peer에 없고 다른 direction만 있는 경우
-
-### T2. Runtime — `_handle_meta_arrival` dst_addr 매칭
-
-`tests/test_pe_ipcq.py` (확장):
- 2-rank pair install 후, E direction dst_addr로 meta arrival → E의 `peer_head_cache`
-  증가 (W는 불변)
- W direction dst_addr로 meta arrival → W의 `peer_head_cache` 증가
- 잘못된 dst_addr (어느 rx range에도 속하지 않음) → 에러 또는 silent drop
-  (결정 후 명시)
-
-### T3. Credit — `dst_rx_base_pa` 매칭
-
-`tests/test_pe_ipcq.py` (확장):
- E direction send 후 peer가 consume → credit에 자기 W의 `my_rx_base_pa`
-  담아 송신 → sender의 E direction `peer_tail_cache` 증가
- W direction도 동일
-
-### T4. E2E — 2-rank bidirectional ring
-
-`tests/test_ipcq_e2e.py`:
- 2-rank ring_1d로 tl.send(E) + tl.recv(W) pattern이 양방향으로 작동
- ADR-0024의 `test_ccl_allreduce_matrix.py`에서 ring at ws=2가 통과
-
-### T5. Install invariant — rx_base range disjointness
-
-`tests/test_ccl_install_plan.py` (확장):
- I3.1 검증: `build_install_plans` 결과에서 모든 qp의 rx_base range가 disjoint
-
-### T6. 회귀
-
- 기존 ws≥3 ring / mesh / tree 테스트 그대로 통과
- `test_pe_ipcq`, `test_ipcq_e2e` 기존 케이스 회귀
-
---
-
 ## Consequences

 ### Positive
@@ -354,19 +281,3 @@ for plan in plans:

 - IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
  불변.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/ccl/install.py` | D1: `reverse_direction`에 `my_dir` 인자 추가, opposite-preference |
-| `src/kernbench/components/builtin/pe_ipcq.py` | D2: `_handle_meta_arrival` dst_addr 매칭 / D3: `_credit_worker` dst_rx_base_pa 매칭 / `_delayed_credit_send`가 `dst_rx_base_pa` 필드 채움 |
-| `src/kernbench/common/ipcq_types.py` | D3: `IpcqCreditMetadata`에 `dst_rx_base_pa` 필드 추가 |
-| `src/kernbench/ccl/install_plan.py` (ADR-0024 신규) | D6: I3.1 invariant 검증 (optional) |
-| `docs/adr/ADR-0023-ipcq-pe-collective.md` | Reference note: runtime 매칭 방식이 ADR-0025에서 바뀜 |
-| `tests/test_ccl_install.py` | T1 |
-| `tests/test_pe_ipcq.py` | T2, T3 |
-| `tests/test_ipcq_e2e.py` | T4 |
-| `tests/test_ccl_install_plan.py` | T5 |
@@ -13,53 +13,6 @@ intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이
 (ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
 layers가 담당).

-### 현재 상태
-
-`src/kernbench/policy/placement/dp.py`:
-
-```python
-@dataclass(frozen=True)
-class DPPolicy:
-    sip: Literal["replicate", "column_wise", "row_wise"] = "replicate"
-    cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
-    pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
-    num_pes: int | None = None
-    num_cubes: int | None = None
-    num_sips: int | None = None    # ← 제거 대상
-```
-
-`sip` / `num_sips` 필드는 텐서를 SIP 경계 **너머**로 분산하는 경로를 제공함.
-이는:
-
- **ADR-0024의 launcher 모델과 충돌**: ADR-0024는 "rank = SIP = 1 worker per SIP"
-  모델. 각 worker가 자기 SIP에 텐서를 생성. 텐서가 여러 SIP에 걸치는 경우는
-  Megatron-style TP가 개별 primitive로 처리해야 함.
- **사용자 의도와 불일치**: "DPPolicy는 한 디바이스 내에서 PE들로 분산하는 방법"
-  (사용자 진술).
- **개념 혼동**: `DPPolicy.sip="column_wise"`는 실제로 **TP**. 이름이 DP인데
-  하는 일은 TP → 신규 사용자에게 혼란.
-
-### 영향받는 call site (rollback 시점 grep 결과)
-
-**생성 사이트** (`DPPolicy(sip=...` 또는 `num_sips=...`):
- `tests/test_runtime_api_tensor.py`
- `benches/ccl_allreduce.py` (ADR-0024 scope 내에서 이미 개편됨)
- `tests/test_va_offset.py`
- `benches/va_offset_verify.py`
- `tests/test_sip_parallel.py`
-
-**참조 사이트** (`dp.sip`, `policy.sip`, `num_sips` 등):
- `src/kernbench/runtime_api/context.py` (`_create_tensor`, `launch`)
- `src/kernbench/components/builtin/pe_cpu.py`
- `src/kernbench/components/legacy/builtin/pe_cpu.py`
- `src/kernbench/policy/placement/dp.py` (구현 자체)
- `tests/test_tensor.py`, `test_ipcq_types.py`
-
-**핵심 테스트**: `test_sip_parallel.py`는 이름 그대로 "SIP 병렬성을 DPPolicy로
-표현하는" 테스트. 이 ADR 이후 **새 launcher 모델로 재작성** 필요.
-
---
-
 ## Decision

 ### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
@@ -258,66 +211,6 @@ for sip_id in sip_range:
 권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
 allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.

-### D6. Migration — 기존 call site
-
-**(A) `DPPolicy(sip=..., num_sips=..., ...)` 사용하던 코드**:
-
- `DPPolicy(sip="column_wise", cube=..., pe=...)` 패턴 → **해당 bench를 ADR-0024
-  launcher로 재작성**. worker가 `set_device(rank)`로 SIP 선택, DPPolicy는
-  cube/PE만.
- `DPPolicy(sip="replicate", num_sips=1, ...)` 패턴 → `DPPolicy(cube=..., pe=...)`로
-  축소 (필드가 사라지니 자연스럽게).
-
-**(B) `dp.sip`, `dp.num_sips` 읽던 코드**:
-
- 제거. `launch()`의 `_compute_local_shape`에서 `dp.sip` 분기 삭제.
- `pe_cpu.py`가 `dp.sip`을 참조하던 곳도 정리.
-
-**(C) `ShardSpec.pe_index`를 사용하던 코드 — 전부 수정 필요**:
-
- `.pe_index` 접근은 이제 `AttributeError` 발생 → 모든 call site 수정 필수.
- Allocator lookup: `allocators[spec.pe_index]` →
-  `allocators[(spec.sip, spec.cube, spec.pe)]`
- Flat integer가 꼭 필요한 국소 문맥: `spec.sip * N_CUBES * N_PE + spec.cube *
-  N_PE + spec.pe` 명시적 계산. **국소 변수로만 사용하고 공개 API에 노출하지
-  않는다**.
-
-**구현 착수 전 grep audit 체크리스트**:
-
-1. **Property 참조**:
-   - `\.pe_index\b` — 필드/property 접근 모두 (regex)
-   - `pe_index=` — 생성 시점의 키워드 인자
-   - `pe_index:` — dataclass 필드 선언
-2. **Allocator / dict indexing**:
-   - `allocators\[` — dict lookup 패턴. `allocators[spec.pe_index]` 같은
-     것이 걸리는지
-   - `_allocators\[` — 같은 패턴 (prefix _)
-3. **Flat index 수동 계산 블록**:
-   - `flat_idx =`
-   - `pe_index =` (좌변)
-   - `* pes_per_cube +` (전형적 flat 계산 패턴)
-   - `* self._num_cubes \* self._pes_per_cube` (global flat 계산)
-4. **Serialization / logging**:
-   - `asdict(.*shard` — dataclass 직렬화 시 `pe_index` 자동 포함 여부
-   - `repr(.*ShardSpec` — 로그 포맷에서 의존하는지
-   - JSON/YAML 저장 포맷에서 `pe_index` 키 사용 여부
-5. **Tests asserting integer PE identity**:
-   - `assert .*pe_index` — 정수 동일성 주장
-   - `spec.pe_index ==` — 비교 (SIP-local 의미로 변하면 테스트가 깨질 수 있음)
-
-각 match마다 "이 호출자가 global flat / SIP-local / 내부 lookup 중 무엇을
-기대했나"를 판단한 뒤 구조적 좌표로 교체.
-
-**(D) `test_sip_parallel.py`**:
-
- 이름 유지, 내용은 ADR-0024의 multi-greenlet launcher 기반 재작성.
- "SIP 병렬성 = rank 별 worker × 각자 DPPolicy" 로 검증.
-
-**(E) `test_va_offset.py`, `benches/va_offset_verify.py`**:
-
- `num_sips=1`만 쓰는 경우가 대부분. 단순히 필드 제거.
- SIP offset 테스트가 핵심이면 `set_device(rank)` + 구조적 좌표 관찰로 이식.
-
 ### D7. 하위 호환 — 불가 (cleanup ADR)

 이 ADR은 **breaking change**.
@@ -331,17 +224,6 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
 **Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
 코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.

-### D8. 문서 업데이트
-
- `ADR-0008` (tensor deploy) — DPPolicy 의미 갱신 note, ShardSpec 구조적 좌표
-  전환 명시
- DPPolicy docstring에 "intra-device only" 명시 (D1 코드 스니펫의 docstring)
- ShardSpec docstring에 **structural coordinates `(sip, cube, pe)`를 직접
-  사용하며, `pe_index`는 더 이상 제공되지 않음**을 명시 (D2)
- `docs/ccl-author-guide` 등 튜토리얼에서 `sip=...` 예시 제거
-
---
-
 ## Dependencies

 - **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
@@ -378,56 +260,6 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에

 ---

-## Test strategy
-
-### T1. 단위 테스트 갱신
-
- `tests/test_tensor.py`, `tests/test_ipcq_types.py`, `tests/test_runtime_api_tensor.py`
-  — DPPolicy 생성자 인자 정리, ShardSpec 구조적 좌표 검증
- `tests/test_va_offset.py` — `num_sips=1` 제거 후 동작 유지
-
-### T2. `resolve_dp_policy` 구조적 좌표 반환
-
-`tests/test_dp_policy.py` (new 또는 확장):
- `resolve_dp_policy(dp, ..., target_sip=1)` 결과의 모든 ShardSpec이 `sip=1`
- 각 spec의 `(cube, pe)`가 local (0..num_cubes-1, 0..num_pe-1)
- 같은 topology에서 `target_sip=0`과 `target_sip=1` 결과가 sip 필드만 다름
-
-### T3. `test_sip_parallel.py` 재작성
-
-SIP 병렬성 검증을 launcher 기반으로:
-
-```python
-def test_sip_parallel_via_launcher(topology):
-    ...
-    def worker(rank, ws, torch):
-        torch.ahbm.set_device(rank)
-        t = torch.zeros((1, 128), dtype="f16",
-                         dp=DPPolicy(cube="column_wise", pe="column_wise"))
-        # verify shard.sip == rank (structural coord)
-
-    spawn(worker, nprocs=n_sips, ...)
-```
-
-### T4. Allocator key migration
-
-`tests/test_allocator_structural_key.py` (new 또는 기존 확장):
- `PEMemAllocator` dict이 `(sip, cube, pe)` tuple key로 작동
- `deploy_tensor`가 구조적 좌표로 allocator lookup
- `_free_tensor`도 동일
-
-### T5. E2E 회귀
-
-ADR-0024의 `test_ccl_allreduce_matrix.py` 그대로 통과.
-
-### T6. 오류 검증
-
- `DPPolicy(sip="column_wise")` 호출 → `TypeError`. 테스트로 명시.
- `DPPolicy(num_sips=2)` 호출 → `TypeError`.
- `spec.pe_index` 접근 → `AttributeError` (property 완전 제거 검증).
-
---
-
 ## Consequences

 ### Positive
@@ -454,23 +286,3 @@ ADR-0024의 `test_ccl_allreduce_matrix.py` 그대로 통과.
 ### Neutral

 - 기존 `cube` / `pe` 필드 의미 불변.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/policy/placement/dp.py` | D1: `sip`/`num_sips` 제거 / D2: `ShardSpec`에 `sip`/`cube`/`pe` structural fields 추가, **`pe_index` property 제거** / D3: `resolve_dp_policy`에 `target_sip`, SIP-level 루프 제거 / 내부 resolver가 반환하는 shard 타입 이름도 `local_pe`로 명확화 (이름 충돌 방지) |
-| `src/kernbench/runtime_api/context.py` | D4: `_create_tensor` `target_sip` 전달 / D5: `_ensure_allocators` dict key → `(sip, cube, pe)` tuple / `launch`의 `dp.sip` 분기 제거 |
-| `src/kernbench/runtime_api/tensor.py` | D5: `deploy_tensor`가 구조적 좌표로 allocator lookup |
-| `src/kernbench/components/builtin/pe_cpu.py` | D6: `dp.sip` 참조 제거 |
-| `src/kernbench/components/legacy/builtin/pe_cpu.py` | D6: 동일 |
-| `benches/ccl_allreduce.py` | ADR-0024 scope에서 이미 처리 |
-| `benches/va_offset_verify.py` | D6: `num_sips=1` 제거 |
-| `tests/test_runtime_api_tensor.py` | D6 |
-| `tests/test_va_offset.py` | D6 |
-| `tests/test_tensor.py`, `test_ipcq_types.py` | D6 |
-| `tests/test_sip_parallel.py` | T3: launcher 기반 재작성 |
-| `tests/test_dp_policy.py` (new 또는 확장) | T2 |
-| `tests/test_allocator_structural_key.py` (new) | T4 |
@@ -2,7 +2,7 @@

 ## Status

-Proposed (Revision 7 — resume invariant / main-context wait 비재귀 invariant /
+Accepted (Revision 7 — resume invariant / main-context wait 비재귀 invariant /
 global barrier over-serialization tradeoff / TP forward yield-safety 명시,
 2026-04-14)

@@ -19,20 +19,6 @@ Megatron-style을 선택한 이유:
 - NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준.
 - DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적.

-### 현재 상태
-
- KernBench는 TP가 없음. 기존 `DPPolicy.sip="column_wise"` 경로는 ADR-0026에서
-  제거됨. rank = SIP launcher (ADR-0024) 위에 TP primitive를 얹는다.
- ADR-0024 Phase B에서 **worker-greenlet env.run 재진입 버그**가 드러남:
-  worker가 `ctx.wait(h)` (tensor 생성 시 MmuMapMsg 등)를 호출하면 `env.run`이
-  worker 컨텍스트에서 돌고, 이때 spawn되는 kernel greenlet의 `_parent`가
-  worker가 되어 orphan 발생. `ring_default_ws` strict xfail의 근본 원인.
- `dist.all_reduce`는 이미 `_defer_wait=True` + worker yield 패턴으로 이 문제를
-  피함 ([distributed.py:119-134](src/kernbench/runtime_api/distributed.py#L119-L134)).
- TP layer의 forward는 매번 `torch.launch("gemm", ...)`를 호출하고, 그 뒤에
-  `dist.all_reduce`가 따라오는 패턴이 반복됨. worker-wait 문제를 **반드시**
-  해결하지 않으면 TP 샘플이 첫 실행에서 실패.
-
 ### TP primitive 스펙 (Megatron-LM 참조)

 - **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에
@@ -907,155 +893,6 @@ PR을 심사.

 ---

-## Test strategy
-
-### T1. Unit — `tests/test_tp_parallel_state.py` (신규)
-
- `initialize_model_parallel(ws)`가 world_size와 일치하는 경우만 통과.
- `get_tensor_model_parallel_rank()`가 greenlet-local rank 반환 (ADR-0024 D9
-  회귀).
- 미초기화 상태에서 `get_tensor_model_parallel_world_size()`가 적절히 실패.
-
-### T2. Unit — `tests/test_tp_layers.py` (신규)
-
-**Shape / structural checks**:
-
- `ColumnParallelLinear(in=256, out=512).weight.shape` per-rank가 `(256, 512/ws)`.
- `RowParallelLinear(in=512, out=256).weight.shape` per-rank가 `(512/ws, 256)`.
- `ColumnParallelLinear.forward(x)`의 출력 텐서 shape이 `(M, K/ws)`.
-
-**Numerical correctness (weight ≠ zero)**: 단순 shape assert는 대수적 오류를
-놓치므로, 결정론적 non-zero 입력/weight으로 실제 연산 결과 검증:
-
- **T2.a (ColumnParallel, deterministic)**: weight를 per-rank identity
-  (또는 `(i, j) → i + rank * k_local + j` 같은 결정론적 패턴)으로 초기화
-  (`tensor.copy_`). 입력 `x`를 상수 벡터로 둔 뒤 forward. 각 rank의 출력이
-  **기대치 `x @ W_rank_local`와 rtol/atol 1e-2 이내로 일치** (gemm kernel의
-  fp16 round-off 고려).
- **T2.b (RowParallel, reduced output equality — primary)**: 모든 rank의
-  forward 결과가 동일 전역 행렬 곱 `concat([x_0..x_{ws-1}]) @ concat([W_0..
-  W_{ws-1}])`과 일치하는지 검증. rank-별 `y.numpy()` 비교로 (i) all-reduce 후
-  elementwise equality와 (ii) 기대치(host-side numpy로 계산) 일치 **둘 다**
-  assert. observable-only 검증 — internal hook 불필요.
-
-  *Optional implementation note*: partial-sum 단계를 더 세밀히 관찰하고 싶으면
-  `_pending_collective_handles` enqueue 직전 intercept hook을 쓸 수 있으나,
-  이는 내부 구현 detail에 결합되므로 ADR 수준의 test contract는 T2.b의
-  observable equality만 요구한다.
- **T2.c (rank-identity after all_reduce)**: 모든 rank의 `y.numpy()`이 elementwise
-  identical (mean뿐 아니라 full array equality, rtol 1e-2).
-
-**기존 weak assertion 금지**: `output mean이 identical` 같은 aggregate-only
-검증은 silently 깨지기 쉽기에 **main assertion으로 쓰지 말 것** — 보조
-sanity로만 사용.
-
-### T3. Worker-wait 일반화 + orphan regression — `tests/test_worker_wait_drain.py` (신규)
-
-본 테스트의 핵심 목적은 queue 동작이 아니라 **ADR-0024 Phase B orphan
-regression의 직접 방지**이다. 다음을 assert:
-
- **T3.a**: Worker가 `ctx.wait(h)`을 호출하면 `_pending_worker_waits`에
-  handle이 enqueue되고 main이 drain하기 전까지 worker는 resume되지 않는다.
- **T3.b**: `_drain_pending` 직후 worker가 resume되고 handle은 `_completed`
-  상태.
- **T3.c**: Multi-worker에서 모든 worker가 같은 drain 지점에서 resume.
- **T3.d (orphan invariant, 핵심)**: Worker 함수가 `torch.launch(...)`를
-  호출한 뒤, SimPy engine이 실제로 돌기 시작하는 시점에 **kernel greenlet의
-  `_parent`는 main greenlet**이다. 테스트는 `kernel_runner.run`을 monkey-patch
-  하거나 `KernelRunner._parent` capture 시점에 assertion hook을 걸어 이
-  invariant를 직접 검증.
- **T3.e (symptom regression)**: D0 없이는 T3.d와 등가인 GreenletExit 실패가
-  재현되어야 함 (historical failure mode 문서화 — 실제 테스트는 D0 도입 후
-  skip 또는 xfail 처리).
- **T3.f (idempotency)**: 같은 handle을 `ctx.wait(h)`로 두 번 호출해도
-  `engine.wait`은 한 번만 불린다 (D0.4-(3)).
- **T3.g (exception propagation)**: Worker가 `wait` 호출 후 raise하면 main
-  scheduler loop이 즉시 중단되고 예외가 위로 전파. 남은 `_pending_worker_waits`는
-  drain되지 않는다 (D0.4-(4)).
-
-### T4. `torch.multiprocessing.spawn` — `tests/test_mp_spawn.py` (신규)
-
- `spawn(fn, args, nprocs)`이 nprocs 개의 greenlet을 생성하고 각각 rank로 bind.
- 모든 worker 완료 후 return.
- 기존 bench `ccl_allreduce.py`의 hand-rolled loop을 `mp.spawn`으로 교체해도
-  matrix 회귀 통과.
-
-### T5. Host-read barrier — `tests/test_host_read_barrier.py` (신규)
-
-D0.5 contract를 직접 검증:
-
- **T5.a**: Worker가 `launch → tensor.numpy()`를 연속 호출하면 barrier가 동작,
-  numpy 결과는 kernel 완료 후 값 (post-drain).
- **T5.b**: `launch → tensor.shape` (metadata)는 barrier 발동 안 함 (pending
-  queue 그대로 유지).
- **T5.c**: Pending 큐가 비어 있는 상태의 `numpy()` 호출은 yield 없이 즉시
-  read (불필요한 context switch 방지).
- **T5.d**: `__getitem__`, `data` 역시 T5.a와 동일한 barrier 발동.
- **T5.e**: Collective pending (all_reduce) 진행 중 상태에서 `numpy()` 호출 시
-  collective drain까지 기다린 뒤 read.
- **T5.f (copy_ write barrier)**: target tensor에 미완료 pending handle이
-  있는 상태에서 `target.copy_(source)` 호출 시, write 전에 drain 발동.
-  주입한 host source가 drain-이후 상태에 덮어써지는지 확인 (stale-overwrite
-  없음).
- **T5.g (closed-set via registry)**: barrier entry-point의 closed-set은
-  **명시적 registry** (예: `tensor.py` 상단의 `_HOST_READ_BARRIERS = frozenset
-  ({"numpy", "data", "__getitem__", "__repr__", "copy_"})`)로 유지한다.
-  테스트는:
-  1. registry에 나열된 각 entry-point에 **실제 barrier 주입이 되어 있는지**
-     (invocation 시 pending queue를 확인하고 yield 경로를 거치는지) 관찰.
-  2. 새 host-read semantic API 추가는 code review에서 registry 업데이트를
-     의무화 (CODEOWNERS / review checklist로 운영).
-
-  **Non-goal**: Python introspection (method 시그니처, docstring 분석 등)으로
-  barrier-부재 API를 자동 탐지하는 것은 정밀도 문제로 ADR scope 밖. registry
-  + review 접근으로 충분.
-
-### T6. E2E — `tests/test_tp_mlp.py` (신규)
-
-2-layer MLP (ColumnParallel → RowParallel) forward:
-
-**Structural / liveness**:
-
- `ws = SIP count` (topology.yaml 기준 current 2) 모델로 실행 완료.
- **Deadlock 없음**: scheduler loop이 유한 시간 내 종료 (pytest-timeout 등).
- **Completion trace**: 각 `launch` 및 `all_reduce`가 `ctx._traces`에 entry
-  남김 (count = 예상 layer 수).
-
-**Numerical correctness (필수)**:
-
- **T6.a (zero-weight sanity)**: weight 전부 0 → 출력 전부 0. 파이프라인이
-  돌긴 하는지 확인용 smoke test. **이것만으로는 불충분 — T6.b/T6.c와 함께
-  채택**.
- **T6.b (deterministic pattern)**: 모든 weight를 결정론적 non-zero pattern
-  (예: all 0.01, 또는 per-rank identity에서 파생된 값)으로 `copy_`. 입력도
-  상수. 기대 출력을 host-side numpy로 계산한 뒤 각 rank의 `y.numpy()`와 rtol
-  1e-2로 비교.
- **T6.c (rank-consistency post all-reduce)**: RowParallel의 all-reduce
-  이후 **모든 rank의 output이 elementwise identical** (T2.c와 동일 기준).
-  단순 mean 일치가 아니라 full array equality.
- **T6.d (shape contract)**: ColumnParallel 출력이 `(B, D_hidden / ws)`,
-  RowParallel 출력이 `(B, D_out)`.
-
-### T7. 회귀 — `ring_default_ws` xfail 해제
-
- `tests/test_ccl_allreduce_matrix.py::test_ccl_allreduce_matrix[ring_default_ws]`의
-  `@pytest.mark.xfail(strict=True)` 제거 → **PASS**여야 함.
- Acceptance criteria (observable):
-  - **Deadlock 없음**: bench가 유한 시간 내 종료.
-  - **GreenletExit 없음**: stderr/log에 GreenletExit trace 없음.
-  - **Rank 0 산출**: `ring_allreduce_tcm (ws=2): 2 OK` 문자열이 출력.
-  - **Completion trace**: `all_reduce` trace entry 존재.
-  - **Numerical**: 각 rank의 입력 `r+1`에 대한 sum(1..ws)=3 결과를 tolerance
-    1e-1 이내로 달성.
-
-### T8. 회귀 — 기존 전체 test suite
-
- ADR-0026까지 통과하던 모든 test가 그대로 통과 (523 passed + 1 xfail).
- Phase 2 완료 기준: 524 passed (xfail 해제 포함) + 0 xfail + 위 T1~T7 신규
-  테스트 전부 통과.
-
---
-
 ## Consequences

 ### Positive
@@ -1080,29 +917,3 @@ D0.5 contract를 직접 검증:

 - ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation
  stack에 영향 없음 (D0 제외).
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/runtime_api/context.py` | D0.1/D0.2: `_pending_worker_waits` + `ctx.wait`의 worker fork, D1.3: `self.multiprocessing` namespace attach |
-| `src/kernbench/runtime_api/multiprocessing.py` | 신규 (D1): `_MultiprocessingNamespace.spawn` + `_drain_pending` + `SpawnException` |
-| `src/kernbench/runtime_api/distributed.py` | `_pending_collective_handles` 타입 annotation 보강 (`list[tuple[RequestHandle, int, dict]]`); spawn exception cleanup에서 clear 호출 지점 노출 |
-| `src/kernbench/runtime_api/tensor.py` | D0.5 barrier 주입: `numpy`, `__getitem__`, `data`, `__repr__`, `copy_` (source read + target write) |
-| `src/kernbench/tp/__init__.py` | 신규: public API re-export |
-| `src/kernbench/tp/parallel_state.py` | 신규: D3 |
-| `src/kernbench/tp/layers.py` | 신규: D4/D5 |
-| `src/kernbench/tp/primitives.py` | 신규: D6 |
-| `src/kernbench/tp/kernels.py` | 신규: TP layer용 `_gemm_kernel` (bench 복제) |
-| `src/kernbench/tp/mappings.py` | 신규 stub (backward TODO) |
-| `benches/tp_mlp.py` | 신규 샘플 (D7) |
-| `benches/ccl_allreduce.py` | hand-rolled loop → `torch.multiprocessing.spawn`으로 교체 (D1.4) |
-| `tests/test_tp_parallel_state.py` | 신규 (T1) |
-| `tests/test_tp_layers.py` | 신규 (T2) |
-| `tests/test_worker_wait_drain.py` | 신규 (T3): orphan invariant 직접 검증 포함 |
-| `tests/test_mp_spawn.py` | 신규 (T4) |
-| `tests/test_host_read_barrier.py` | 신규 (T5): D0.5 host-read barrier contract |
-| `tests/test_tp_mlp.py` | 신규 (T6) |
-| `tests/test_ccl_allreduce_matrix.py` | `ring_default_ws` xfail 제거 (T7) |
@@ -2,7 +2,7 @@

 ## Status

-Proposed (Blocked on ADR-0031 — PhysAddr PE-resource extension)
+Proposed

 ## Context

@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Merged into ADR-0011 (Address Model: LA section).

 ## Context

@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Merged into ADR-0011 (Address Model: LA section).

 ## Context