ADR: introduce docs/history/, merge 0011+0018, prune migration cruft

- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/, immutable numbering, no renumber) - ADR-0011: merge ADR-0018 content as "Address Model: LA" section alongside PA / VA; status notes VA model is currently implemented - ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed into 0001 rev 2) - ADR-0019: rewrite Context as PE-HBM connectivity decision (self-contained, no LA model framing) - ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted (code verified) and prune Implementation Notes / Affected files / Test strategy / "현재 상태" sub-sections describing pre-impl state - ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6 Migration and D8 docs-update sub-decisions - ADR-0030: status simplified (blocker ADR-0031 now superseded) - SPEC.md: R10 + §0.2 reflect PA / VA / LA model names - ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links 21 files changed, 553 insertions(+), 1290 deletions(-). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:42:45 -07:00
parent ecc57d050d
commit 22fd0d2b9d
23 changed files with 553 additions and 1290 deletions
@@ -94,7 +94,7 @@ The Phase 0 PA shard map remains a valid fast-path configuration.

 ## Links

- ADR-0011 (PA-first)
+- ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0012 (Host↔IO_CPU schema)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (Kernel execution)
@@ -1,95 +1,514 @@
-# ADR-0011: Memory Addressing — PA-first with VA/MMU Extension
+# ADR-0011: Memory Addressing — PA / VA / LA Address Models

 ## Status

-Accepted (Phase 1 VA/MMU implemented)
+Accepted.
+
+- **VA model: currently implemented (default).**
+- PA model: implemented as PageFault fallback in PE_DMA.
+- LA model: proposed, not implemented.

 ## Context

-A realistic system uses host-side virtual addressing and an MMU/IOMMU-style
-translation path for DMA: host allocates physical memory at PE level, maps it
-into a virtual address space, installs mappings, and DMA requests use virtual
-addresses that are translated to physical addresses.
+KernBench's address model evolved through three design points, each
+addressing a limitation of the previous. This ADR documents all three
+in one place because future implementation work selects among them.

-The PA-only model (Phase 0) was insufficient for running standard Triton kernels
-that use `base_addr + offset` patterns on sharded tensors — each PE's shard has
-a different PA, but the kernel needs a single contiguous address space.
+### PA-only baseline
+
+Phase 0 of KernBench treated all device memory operations
+(MemoryRead/MemoryWrite) as raw physical-address transfers. No
+host-side virtual addressing, no MMU/IOMMU translation. Allocators
+returned PA mappings; DMA requests carried PA directly.
+
+This was sufficient for early correctness/latency work but
+insufficient for running standard Triton kernels that use
+`base_addr + offset` patterns on sharded tensors: each PE's shard
+has a different PA, but the kernel needs a single contiguous address
+space to compute offsets.
+
+### Why VA/MMU (current default)
+
+A realistic system uses host-side virtual addressing and an
+MMU/IOMMU-style translation path for DMA: the host allocates physical
+memory at PE level, maps it into a virtual address space, installs
+mappings, and DMA requests use virtual addresses that are translated
+to physical addresses.
+
+Adopting this model lets kernels use `base_addr + offset` over a
+contiguous VA range while the device-side MMU translates each access
+to the appropriate PA.
+
+### Why LA/BAAW (proposed)
+
+VA/MMU treats HBM as a single backing space. KernBench needs to
+explore architectures where HBM is composed of multiple pseudo
+channels in parallel:
+
+- CUBE's HBM has 32 or 64 pseudo channels.
+- In a PE-Local-HBM model, each PE is assigned N pseudo channels
+  (N = `hbm_pseudo_channels / pes_per_cube`).
+- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
+  (N × per-channel).
+
+Two channel-mapping modes need to be modelable:
+
+- **1:1 mode** — one logical access → N per-channel requests.
+  Precise per-channel BW contention modelling.
+- **n:1 mode (default)** — one logical access → one aggregated
+  request. Channels are assumed to interleave; aggregated BW model.
+
+VA's `tl.load(va_ptr)` produces a single DMA request to a single
+target. Decomposing that into per-channel requests inside PE_DMA
+requires the address layer to be aware of channels. This is the
+role of the LA (Logical Address) abstraction with BAAW
+(Logical-to-Physical Mapping Unit).
+
+Core requirements driving the LA design:
+
+- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
+  in both modes (only request shape and resource model differ).
+- Kernel programming model is unchanged — physical channel
+  information is never exposed to kernel code.
+- Mode switch is a topology-level configuration.
+
+### Design space summary
+
+| Model | Status | Key idea |
+|-------|--------|----------|
+| PA | fallback (implemented) | Direct physical addressing, no translation |
+| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
+| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |

 ---

 ## Decision

-### D1. Phase 0 model is PA-only (original, retained as fallback)
+This ADR defines three address models. At any given time the system
+operates in exactly one model. Selection is topology- / configuration-
+driven; coexistence within one simulation run is not required.

- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical
-  addresses (PA) plus size.
- PA-only mode remains functional via PageFault fallback in PE_DMA.
+---

-### D2. Allocation produces PA mappings
+### Address Model: PA (Physical Address) — fallback

-Device allocation selects PE-local memory regions and returns PA mappings
-sufficient to execute kernels and issue DMA requests.
+#### D-PA1. PA-only semantics

-### D3. Phase 1: VA/MMU layer (implemented)
+- All device memory accesses (MemoryRead/MemoryWrite) operate on
+  device physical addresses (PA) plus size.
+- PA-only mode remains functional via the PageFault fallback path in
+  PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
+  the value as a PA directly.

-#### D3.1 Virtual Address Model
+#### D-PA2. Allocation produces PA mappings
+
+Device allocation selects PE-local memory regions and returns PA
+mappings sufficient to execute kernels and issue DMA requests.
+
+PA model is retained primarily for backward compatibility with PA-only
+tests and as the underlying physical layer that VA / LA models resolve
+into.
+
+---
+
+### Address Model: VA (Virtual Address with MMU) — current default
+
+#### D-VA1. Virtual Address Model

 - Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
 - `TensorShard` does NOT carry a `va` field — shard VA is derived as
  `va_base + offset_bytes`.
- Kernels receive `va_base` as their pointer argument (via `TensorArg.va_base`).
+- Kernels receive `va_base` as their pointer argument (via
+  `TensorArg.va_base`).
 - `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).

-#### D3.2 PE_MMU Component
+#### D-VA2. PE_MMU Component

- Hybrid design: SimPy component (inbox for MmuMapMsg) + utility (synchronous
-  `translate()` called by PE_DMA).
- Page-aligned dict lookup for O(1) VA→PA translation.
+- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
+  (synchronous `translate()` called by PE_DMA).
+- Page-aligned dict lookup for O(1) VA → PA translation.
 - `tlb_overhead_ns` configurable per-access latency.
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA directly
-  (backward compatibility with PA-only tests).
+- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
+  directly (preserves PA model for backward compatibility).

-#### D3.3 Mapping Installation
+#### D-VA3. Mapping Installation

- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube fan-out) →
-  M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured end-to-end.
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent cross-SIP
-  mapping contamination for replicated tensors.
+- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
+  fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
+  end-to-end.
+- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
+  cross-SIP mapping contamination for replicated tensors.
 - Mapping strategy based on `DPPolicy.cube`:
-  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping only.
-    Each cube's PEs see only their local PA. No cross-cube mapping installed.
-  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard mappings to all
-    target cubes. Enables cross-PE and cross-cube DMA.
+  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
+    only. Each cube's PEs see only their local PA. No cross-cube
+    mapping installed.
+  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
+    mappings to all target cubes. Enables cross-PE and cross-cube
+    DMA.

-#### D3.4 Tensor Lifecycle
+#### D-VA4. Tensor Lifecycle

- `del tensor` triggers automatic cleanup via `Tensor.__del__` + `weakref` to
-  RuntimeContext. Sends `MmuUnmapMsg` through fabric, returns VA and PA space.
+- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
+  `weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
+  returns VA and PA space.
 - `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
 - `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
 - `PEMemAllocator` uses free-list with coalescing (not bump allocator).
 - `VirtualAllocator` uses free-list with coalescing for VA space.

-#### D3.5 Allocators
+#### D-VA5. Allocators

- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free with
+- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
+  with coalescing.
+- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
  coalescing.
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with coalescing.
- Page size configurable via `topology.yaml` pe_mmu attrs (default 4096).
+- Page size configurable via `topology.yaml` `pe_mmu` attrs
+  (default 4096).

---
+#### Consequences (VA model)

-## Consequences
-
- Triton kernels use `base_addr + offset` patterns naturally on sharded tensors.
- All latency remains explicit via graph traversal, including MMU mapping
-  installation and per-access TLB overhead.
+- Triton kernels use `base_addr + offset` patterns naturally on
+  sharded tensors.
+- All latency remains explicit via graph traversal, including MMU
+  mapping installation and per-access TLB overhead.
 - PA-only mode retained as fallback (PageFault → treat as PA).
- Benchmark parameter renamed `ctx` → `torch` for PyTorch code compatibility.
 - IPCQ and other fixed-address resources bypass MMU (use PA directly).

 ---

+### Address Model: LA (Logical Address with BAAW) — proposed
+
+LA replaces VA when channel-level HBM modelling is required.
+Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
+removed artifacts). Coexistence with VA in the same run is not a goal.
+
+#### D-LA1. LA introduction — replaces VA infrastructure
+
+LA is the sole address space used by kernel code (`tl.load`,
+`tl.store`, `tl.composite`). Properties:
+
+- Can map a Tensor to a contiguous logical space (like VA).
+- Expresses `(logical buffer + offset)`.
+- Does NOT contain physical channel information directly.
+- Stays as an intermediate abstraction until physical resolution.
+
+LA address space:
+
+| Item | Value |
+|------|-------|
+| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
+| LA space size | 64 GB per PE |
+| Alignment unit | segment (see D-LA3) |
+
+LA is PE-local: different PEs may use the same LA value; BAAW segment
+tables differ → they resolve to different PAs.
+
+VA infrastructure removed when LA is adopted:
+
+| Removed | Replacement |
+|---------|-------------|
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
+| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
+| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
+| `runtime_api/tensor.py`: `va_base` | `la_base` |
+| `topology.yaml`: `pe_mmu` component entry | Removed |
+
+#### D-LA2. Mapping mode setting
+
+Topology-level (cube) configuration:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
+    hbm_pseudo_channels: 64       # total pseudo channel count
+    hbm_channels_per_pe: 8        # per-PE local channel count
+    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
+```
+
+Consumed by the graph compiler (topology builder) and BAAW
+initialisation.
+
+#### D-LA3. Segment and BAAW
+
+Segment partitions the LA space; each segment maps to a specific HBM
+channel or channel group. Created at tensor deploy time by the runtime
+allocator. BAAW resolves LA → physical request(s) using the segment
+table.
+
+```python
+@dataclass
+class BaawSegment:
+    la_base: int          # segment start LA
+    la_size: int          # segment size (bytes)
+    mode: str             # "one_to_one" | "n_to_one"
+    # 1:1 mode fields
+    channel_count: int    # channels assigned to this segment (e.g. 8)
+    pa_bases: list[int]   # per-channel PA bases (len = channel_count)
+    channel_ids: list[int]   # per-channel logical IDs (e.g. [0..7])
+    channel_size: int     # per-channel size (la_size // channel_count)
+    # n:1 mode fields
+    agg_pa_base: int      # aggregated PA base
+    agg_node_id: str      # aggregated router node_id
+```
+
+Segment lifecycle:
+
+1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
+   allocator. PEMemAllocator allocates per-channel PA (1:1) or
+   aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
+   with PE_DMA.
+2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
+   (src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
+   converts to PA(s).
+3. **Free** (tensor free): segment removed from table; LA and PA
+   returned.
+
+#### D-LA4. BAAW resolution logic
+
+BAAW is a front-end stage inside PE_DMA, not a separate SimPy
+component. Synchronous address-resolution logic executed at the start
+of PE_DMA's `handle_command()`.
+
+Input: `(LA, nbytes)`. Output:
+
+- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
+- **n:1 mode**: single `PhysicalRequest`.
+
+```python
+@dataclass
+class PhysicalRequest:
+    pa: int           # 51-bit Physical Address
+    nbytes: int       # transfer size for this request
+    dst_node: str     # target node_id (channel router or aggregated router)
+
+
+def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
+    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
+    offset = la - seg.la_base
+
+    if seg.mode == "n_to_one":
+        pa = seg.agg_pa_base + offset
+        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
+
+    # one_to_one
+    requests = []
+    per_ch_size = seg.channel_size
+    for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
+        ch_offset = offset % per_ch_size
+        ch_nbytes = nbytes // seg.channel_count
+        pa = pa_base + ch_offset
+        dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
+        requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
+    return requests
+```
+
+BAAW responsibilities:
+
+- Convert logical access → physical request units.
+- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
+- Compute PA and target node.
+
+BAAW non-responsibilities:
+
+- Performing actual data movement.
+- Executing NOC routing.
+- Simulating bandwidth occupation (downstream components' job).
+
+BAAW output is directly usable by the simulator's routing and resource
+model without additional address decoding.
+
+#### D-LA5. PE_DMA `handle_command()` change
+
+Current (VA-based) flow:
+
+```
+DmaReadCmd.src_addr (VA)
+  → MMU.translate(VA) → PA
+  → PhysAddr.decode(PA) → PhysAddr object
+  → resolver.resolve(PhysAddr) → dst_node_id
+  → router.find_path(pe_prefix, dst_node_id) → path
+  → 1 sub-Transaction → fabric inject
+```
+
+LA-based flow:
+
+```
+DmaReadCmd.src_addr (LA)
+  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
+  → for each PhysicalRequest:
+      → router.find_path(pe_prefix, req.dst_node) → path
+      → compute_drain_ns(path, req.nbytes) → drain
+      → sub-Transaction → fabric inject
+  → await all sub-Transactions
+  → pe_txn.done.succeed()
+```
+
+Key changes:
+
+- MMU reference removed → BAAW resolve.
+- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
+  directly.
+- 1 request → N parallel requests in 1:1 mode.
+
+#### D-LA6. 1:1 mode detail
+
+- One logical access → N physical requests (N = `channels_per_pe`).
+- N = `hbm_pseudo_channels / pes_per_cube`.
+- Each request: fully-resolved 51-bit PA, targets a specific channel
+  router (`{pe_prefix}.ch_r{channel_id}`).
+- Per-channel link models BW contention.
+- PE_DMA injects N sub-transactions concurrently.
+
+Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
+PE0 owns ch0-7.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "one_to_one", channel_count: 8,
+    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
+    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
+    channel_size: 512,
+}
+
+BAAW resolve result (8 requests):
+  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
+  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
+  → ...
+  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
+
+PE_DMA: 8 sub-transactions parallel inject
+  per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
+  Total effective BW = 8 × channel_bw_gbs
+```
+
+Other N values:
+
+- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
+  4 requests
+- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
+  16 requests
+
+#### D-LA7. n:1 mode detail
+
+- One logical access → one aggregated request.
+- Target: aggregated router → hbm_ctrl (see ADR-0019).
+- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
+  (e.g. 8 × 32 = 256 GB/s).
+- Single queue / resource for modelling.
+- No per-channel PA decomposition.
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "n_to_one",
+    agg_pa_base: PA_agg,
+    agg_node_id: "sip0.cube0.pe0.agg_router",
+}
+
+BAAW resolve result:
+  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
+
+PE_DMA: 1 sub-transaction
+  aggregated router → hbm_ctrl link (256 GB/s)
+```
+
+#### D-LA8. Kernel model preserved
+
+- Kernel still issues single memory ops (`tl.load`, `tl.store`,
+  `tl.composite`).
+- LA is the address scheme exposed to kernel code.
+- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
+- Kernel code never sees physical channel information.
+
+#### Consequences (LA model, proposed)
+
+Positive:
+
+- 1:1 vs n:1 semantics live in one place (BAAW).
+- Kernel abstraction preserved — no kernel code changes.
+- Topology-based policy control (mode switch via yaml).
+- Improved simulation-model consistency and debuggability.
+- Segment-based mapping is simpler than page tables; lower overhead.
+
+Negative:
+
+- Full VA/MMU code refactor required.
+- Request-generation path more complex (N requests in 1:1 mode).
+- Reduced per-channel visibility in n:1 mode.
+- VA-related tests need rewriting.
+
+---
+
+## Migration Path
+
+- **PA → VA** was an extension. PA mode is retained as the PageFault
+  fallback inside PE_DMA. Switching does not require removing PA
+  code.
+- **VA → LA**, if adopted, is a replacement, not coexistence. See
+  D-LA1 for the VA infrastructure removal list. PA fallback inside
+  PE_DMA may be retained orthogonally for tests.
+
+## Alternatives Considered (LA model)
+
+1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
+   Rejected: MMU's role would grow beyond translation to request
+   decomposition; aggregation (n:1) becomes awkward to express.
+2. **Channel-aware kernel API**: kernels call per-channel load/store
+   directly. Rejected: abstraction leakage, portability loss, all
+   benchmarks need rewriting.
+3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
+   directly. Rejected: incompatible with aggregation; conversion
+   timing unclear; channel info leaks to kernel.
+
+## Test Requirements
+
+### VA model (current, regression)
+
+- Cross-PE / cross-cube DMA paths over installed mappings.
+- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
+- TLB-overhead-per-access timing.
+- PageFault fallback path preserves PA-only behaviour.
+
+### LA model (when implemented)
+
+- 1:1 mode: same logical access → N per-channel requests.
+- n:1 mode: same logical access → 1 aggregated request.
+- Bandwidth equivalence between modes for identical workload.
+- 1:1 mode: per-channel contention modelled correctly.
+- n:1 mode: aggregated bandwidth correctly reflected.
+- Kernel code unchanged across mode switch.
+- BAAW segment install / uninstall correctness.
+- Multiple tensors in distinct segments do not collide.
+
+## Implementation Order (LA, when scheduled)
+
+1. LA type (`policy/address/la_allocator.py`).
+2. BAAW segment table (`policy/address/baaw.py`).
+3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
+4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
+   `handle_command()`).
+5. RuntimeContext: LA alloc + segment install
+   (`runtime_api/context.py`).
+6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
+7. Remove VA/MMU code.
+8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
+9. Test migration:
+
+| Test file | Action |
+|-----------|--------|
+| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
+| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
+| `tests/test_pe_mmu.py` | Remove |
+| `tests/test_va_allocator.py` | Replace with LA allocator tests |
+| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
+| `tests/test_va_offset.py` | Replace with LA offset tests |
+
 ## Links

 - ADR-0007 (runtime_api vs sim_engine boundaries)
@@ -97,4 +516,6 @@ sufficient to execute kernels and issue DMA requests.
 - ADR-0009 (kernel execution)
 - ADR-0014 (PE-internal execution model)
 - ADR-0015 (component port/wire model)
- SPEC R2 (latency by traversal)
+- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology
+  consumer)
+- SPEC R2 (latency by traversal), R10 (memory addressing)
@@ -226,7 +226,7 @@ Tests SHOULD validate:

 ## Links

- ADR-0011 (PA-first memory addressing)
+- ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0007 (runtime_api vs sim_engine boundaries)
 - ADR-0009 (kernel execution fan-out/aggregation)
 - SPEC R2, R7, R8
@@ -134,6 +134,6 @@ Phase 2 (Apply) MUST:
 ## Links

 - SPEC 0.1, R2, R6
- ADR-0011 (PA-first memory addressing)
+- ADR-0011 (Memory Addressing — PA / VA / LA)
 - ADR-0012 (Host ↔ IO_CPU message schema)
 - ADR-0009 (Kernel execution semantics)
@@ -1,441 +0,0 @@
-# ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction
-
-## Status
-
-Proposed
-
-## Context
-
-Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE.
-Currently, a VA-based access path is used; however, the following two channel mapping models
-are difficult to represent consistently.
-
-### Background: Local-HBM Pseudo Channel Structure
-
-The HBM in a CUBE consists of 32 or 64 pseudo channels.
-In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.
-
-Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM
-
-Both the number of pseudo channels and the number of PEs are topology parameters.
-`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe) determines
-the number of local channels per PE.
-
-The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel
-(e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the
-maximum memory BW.
-
-### Limitations of the Current VA Model
-
-When channels are divided into 8, requests must also be generated per channel and sent to DMA.
-However, in the current architecture, the kernel generates requests with VA (`tl.load`)
-and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.
-
-Therefore, instead of VA, we propose using **Logical Address (LA)**,
-where the **BAAW (Logical-to-Physical Mapping Unit)** inside PE_DMA
-converts LA to PA or a list of PAs based on segment-based mapping.
-
-### Two Channel Mapping Modes
-
- **1:1 mode**: Creates and executes per-channel requests. Precise per-channel modeling.
- **n:1 mode (default)**: Assumes interleaving across local HBM channels. Aggregated BW modeling.
-
-By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.
-
-### Core Requirements
-
- The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
- The difference must only be in the request representation and resource modeling approach
- The kernel programming model must not be changed
- Physical channel information must not be exposed to the kernel
-
-### Existing Physical Address
-
-The current system's 51-bit Physical Address is defined in `policy/address/phyaddr.py`:
-
-```
-[50:47] rack_id (4 bit)
-[46:43] sip_id  (4 bit)
-[42:38] cube_id (5 bit, sip_seg)
-[37]    hbm_selector (1=HBM window)
-[36:0]  hbm_offset   (37 bit, 128GB per cube)
-```
-
-PA is used to represent the final routable canonical physical destination,
-and this role is preserved.
-However, the timing and policy of logical access -> physical request conversion are not clearly separated.
-
---
-
-## Decision
-
-### D1. Introduction of LA (Logical Address) — Replacing VA
-
-The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).
-
-#### Characteristics of LA
-
- Like VA, tensors can be mapped to a contiguous memory space
- Represents logical buffer + offset
- Does not directly contain physical channel information
- An intermediate abstraction maintained until physical resolution
- The sole address scheme used by kernel code (`tl.load`, `tl.store`, `tl.composite`)
-
-#### LA Space Definition
-
-| Item | Value |
-|------|-------|
-| LA start address | `0x1_0000_0000` (4 GB, preserving the existing VA start point) |
-| LA space size | 64 GB per PE |
-| Alignment unit | Segment-based (see D3 below) |
-
-LA is a PE-local address space.
-Even if different PEs use the same LA value, they resolve to different PAs
-because each PE has a different BAAW segment table.
-
-#### VA Infrastructure Removal Scope
-
-With the introduction of LA, the following existing code will be replaced/removed:
-
-| Removal Target | Replacement |
-|----------------|-------------|
-| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, name/role changed) |
-| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
-| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
-| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | Replaced with BaawSegmentInstallMsg |
-| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
-| `runtime_api/tensor.py`: `va_base` field | `la_base` field |
-| `topology.yaml`: pe_mmu component entry | Removed |
-
---
-
-### D2. Mapping Mode Configuration
-
-The mapping mode is configured at the cube level in topology.yaml:
-
-```yaml
-cube:
-  memory_map:
-    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
-    hbm_pseudo_channels: 64       # total pseudo channel count
-    hbm_channels_per_pe: 8        # local channel count per PE
-    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
-```
-
-This configuration is referenced during graph compilation (topology builder) and BAAW initialization.
-
---
-
-### D3. Segments and BAAW
-
-#### Segment Definition
-
-A segment is a logical allocation unit that partitions the LA space so that each segment
-maps to a specific HBM channel or channel group.
-
-Segments are created by the runtime allocator during tensor deployment,
-and BAAW uses them to convert LA into physical requests.
-
-#### BAAW Segment Table Entry
-
-```python
-@dataclass
-class BaawSegment:
-    la_base: int          # segment start LA
-    la_size: int          # segment size (bytes)
-    mode: str             # "one_to_one" | "n_to_one"
-    # 1:1 mode fields
-    channel_count: int    # number of channels assigned to this segment (e.g., 8)
-    pa_bases: list[int]   # per-channel PA start address list (len = channel_count)
-    channel_ids: list[int]  # per-channel logical IDs (e.g., [0,1,2,...,7])
-    channel_size: int     # per-channel size (la_size // channel_count)
-    # n:1 mode fields
-    agg_pa_base: int      # aggregated PA start address
-    agg_node_id: str      # aggregated router node_id (for routing)
-```
-
-#### Segment Lifecycle
-
-1. **Allocation time** (tensor deploy):
-   - RuntimeContext allocates LA space from the LA allocator
-   - PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
-   - Sends `BaawSegmentInstallMsg` to PE_DMA to register in the segment table
-
-2. **Usage time** (kernel execution):
-   - Kernel issues `tl.load(la_ptr)` -> DmaReadCmd(src_addr=LA)
-   - PE_DMA looks up the segment corresponding to the LA in BAAW
-   - Converts to PA(s) according to the mode
-
-3. **Deallocation time** (tensor free):
-   - Removed from the segment table
-   - LA space returned, PA deallocated
-
---
-
-### D4. BAAW (Logical-to-Physical Mapping Unit)
-
-#### Location
-
-BAAW is placed as a front-end stage inside PE_DMA.
-It is not a separate SimPy component; it is synchronous address resolution logic
-executed at the beginning of PE_DMA's `handle_command()`.
-
-#### Input
-
- LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
- access size (bytes)
-
-#### Output
-
- 1:1 mode: `list[PhysicalRequest]` — each request is (PA, nbytes, channel_node_id)
- n:1 mode: 1 `PhysicalRequest` — (agg_PA, nbytes, agg_node_id)
-
-```python
-@dataclass
-class PhysicalRequest:
-    pa: int           # 51-bit Physical Address
-    nbytes: int       # transfer size for this request
-    dst_node: str     # target node_id (channel router or aggregated router)
-```
-
-#### BAAW Resolve Logic
-
-```python
-def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
-    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
-    offset = la - seg.la_base
-
-    if seg.mode == "n_to_one":
-        pa = seg.agg_pa_base + offset
-        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
-
-    elif seg.mode == "one_to_one":
-        requests = []
-        per_ch_size = seg.channel_size
-        for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
-            ch_offset = offset % per_ch_size  # interleaved or striped
-            ch_nbytes = nbytes // seg.channel_count
-            pa = pa_base + ch_offset
-            dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
-            requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
-        return requests
-```
-
-#### Scope of Responsibility
-
-BAAW is responsible for:
- Converting logical accesses into physical request units
- Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
- Generating Physical Addresses and determining target nodes
-
-BAAW is NOT responsible for:
- Performing actual data movement
- Executing NOC routing
- Simulating bandwidth consumption (this is the role of downstream components)
-
-#### Output Contract
-
-The output of BAAW must be request units that can be directly used by the simulator's
-routing and resource model without any additional address decoding.
-
---
-
-### D5. PE_DMA handle_command() Changes
-
-#### Current Flow (VA-based)
-
-```
-DmaReadCmd.src_addr (VA)
-  -> MMU.translate(VA) -> PA
-  -> PhysAddr.decode(PA) -> PhysAddr object
-  -> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
-  -> router.find_path(pe_prefix, dst_node_id) -> path
-  -> 1 sub-Transaction created -> fabric inject
-```
-
-#### New Flow (LA-based)
-
-```
-DmaReadCmd.src_addr (LA)
-  -> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
-  -> For each PhysicalRequest:
-      -> router.find_path(pe_prefix, req.dst_node) -> path
-      -> compute_drain_ns(path, req.nbytes) -> drain
-      -> sub-Transaction created -> fabric inject
-  -> Wait for all sub-Transactions to complete
-  -> pe_txn.done.succeed()
-```
-
-Key changes:
- MMU reference removed -> replaced with BAAW resolve
- PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
- 1 request -> N requests injected in parallel (1:1 mode)
-
---
-
-### D6. 1:1 Mode Details
-
- One logical access -> N (= `channels_per_pe`) physical requests
- N is a parameter determined by `hbm_pseudo_channels / pes_per_cube`
- Each request:
-  - Fully resolved 51-bit PA
-  - Targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`)
- BW contention modeling via per-channel links
- PE_DMA injects N sub-transactions simultaneously
-
-#### 1:1 Mode Example
-
-Configuration: `hbm_pseudo_channels=64`, `pes_per_cube=8`
-> `channels_per_pe=8`, PE0 owns ch0-7
-
-```text
-Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
-BAAW segment: {
-    la_base: 0x1_0000_0000, la_size: 4096,
-    mode: "one_to_one", channel_count: 8,  # = channels_per_pe
-    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
-    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
-    channel_size: 512,  # = la_size / channel_count
-}
-
-BAAW resolve result (N=8 requests):
-  -> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
-  -> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
-  -> ...
-  -> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
-
-PE_DMA: N sub-transactions injected in parallel
-  Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
-  Total effective BW = N x channel_bw_gbs
-```
-
-Examples with different N values:
- `hbm_pseudo_channels=32`, `pes_per_cube=8` -> `channels_per_pe=4`, 4 requests
- `hbm_pseudo_channels=64`, `pes_per_cube=4` -> `channels_per_pe=16`, 16 requests
-
---
-
-### D7. n:1 Mode Details
-
- One logical access -> one aggregated request
- Target: aggregated router -> hbm_ctrl (see ADR-0019)
- Aggregated link BW = `channels_per_pe` x `channel_bw_gbs` (e.g., 8 x 32 = 256 GB/s)
- Modeled as a single queue / resource
- No per-channel PA decomposition
-
-#### n:1 Mode Example
-
-```
-Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
-BAAW segment: {
-    la_base: 0x1_0000_0000, la_size: 4096,
-    mode: "n_to_one",
-    agg_pa_base: PA_agg,
-    agg_node_id: "sip0.cube0.pe0.agg_router",
-}
-
-BAAW resolve result:
-  -> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
-
-PE_DMA: 1 sub-transaction injected
-  Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)
-```
-
---
-
-### D8. Kernel Model Preservation
-
- The kernel still issues only single memory ops (`tl.load`, `tl.store`, `tl.composite`)
- LA is the address scheme passed to the kernel
- Channel decomposition/aggregation is performed by BAAW inside PE_DMA
- Physical channel information is not exposed to kernel code
-
---
-
-## Consequences
-
-### Positive
-
- 1:1 vs n:1 semantics are clearly separated at a single point: BAAW
- Kernel abstraction is preserved — no kernel code changes required
- Topology-based policy control is possible (mode switching via yaml)
- Improved simulation model consistency and debuggability
- Segment-based mapping is simpler and has lower overhead compared to page tables
-
-### Negative
-
- Full refactoring of VA/MMU-based code is required
- Increased complexity in the request generation path (managing N requests in 1:1 mode)
- Reduced per-channel visibility in n:1 mode
- Existing VA-related tests must be rewritten
-
---
-
-## Alternatives
-
-### A1. Keep VA + Fan-out at MMU
-
- Extend MMU to return per-channel PAs
- Problem: MMU's role expands beyond address translation to include request decomposition
- Problem: Aggregation representation is difficult in n:1 mode
-
-### A2. Kernel Generates Channel-Aware Requests
-
- Kernel directly calls per-channel load/store
- Problem: Abstraction leakage, reduced portability
- Problem: All benchmark code must be modified
-
-### A3. Always Use PA (Without LA)
-
- Runtime directly passes per-channel PA to the kernel
- Problem: Conflicts with the aggregation model
- Problem: Conversion timing is unclear, channel information exposed to kernel
-
---
-
-## Implementation Notes
-
-### Implementation Order
-
-1. Introduce LA type (`policy/address/la_allocator.py`)
-2. Implement BAAW segment table (`policy/address/baaw.py`)
-3. Add `BaawSegmentInstallMsg` message type (`runtime_api/kernel.py`)
-4. Integrate BAAW into PE_DMA (`components/builtin/pe_dma.py` handle_command changes)
-5. Modify RuntimeContext: LA alloc + segment install (`runtime_api/context.py`)
-6. Change Tensor.va_base -> la_base (`runtime_api/tensor.py`)
-7. Remove VA/MMU code
-8. Remove pe_mmu from topology.yaml, add mapping mode configuration
-9. Test migration
-
-### Affected Existing Tests
-
-| Test File | Impact |
-|-----------|--------|
-| `tests/test_mmu_component.py` | Remove -> replace with BAAW segment install test |
-| `tests/test_mmu_fabric.py` | Remove -> replace with BAAW + fabric integration test |
-| `tests/test_pe_mmu.py` | Remove |
-| `tests/test_va_allocator.py` | Replace with LA allocator test |
-| `tests/test_va_integration.py` | Replace with LA + BAAW integration test |
-| `tests/test_va_offset.py` | Replace with LA offset test |
-
---
-
-## Test Requirements
-
- For the same logical access:
-  - 1:1 -> verify N requests are generated
-  - n:1 -> verify 1 aggregated request is generated
- Verify effective bandwidth consistency across both modes
- 1:1 -> verify per-channel contention modeling
- n:1 -> verify aggregated bandwidth is reflected
- Verify operation without kernel code changes
- Verify correct BAAW segment install/uninstall operation
- Verify no conflicts when multiple tensors are assigned to different segments
-
---
-
-## Links
-
- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR
- ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration
- ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact
@@ -1,440 +0,0 @@
-# ADR-0018: LA 기반 메모리 주소 추상화 및 HBM Channel Mapping Mode 도입
-
-## Status
-
-Proposed
-
-## Context
-
-Kernbench는 CUBE 내부에서 PE_DMA와 Local-HBM 간의 메모리 접근을 시뮬레이션한다.
-현재는 VA 기반 접근 경로를 사용하고 있으나, 다음 두 가지 channel mapping 모델을
-일관되게 표현하기 어렵다.
-
-### 배경: Local-HBM pseudo channel 구조
-
-CUBE의 HBM은 32개 또는 64개의 pseudo channel로 구성된다.
-PE-Local-HBM 모델에서는 각 PE가 동일한 수의 pseudo channel을 담당한다.
-
-예: 64 pseudo channel, 8 PE per cube → 각 PE가 8개 pseudo channel을 local HBM으로 접근
-
-pseudo channel 수와 PE 수는 모두 topology 파라미터이다.
-`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe)가
-PE당 local channel 수를 결정한다.
-
-각 pseudo channel의 BW(예: 32 GB/s)만큼 DMA와 pseudo channel 사이의 라우팅 경로 BW도
-맞춰지므로, PE가 N개 채널에 동시 request를 보내면 최대 메모리 BW를 활용할 수 있다.
-
-### 현재 VA 모델의 한계
-
-채널을 8개로 나누면 request도 채널별로 생성되어 DMA에 보내져야 한다.
-그러나 현재 구조에서는 커널이 VA를 가지고 request를 생성한 뒤(`tl.load`)
-DMA에 바로 전달하므로, PE_CPU가 채널별 DMA request를 생성하기 어렵다.
-
-따라서 VA 대신 **Logical Address(LA)** 를 사용하고,
-PE_DMA 내부의 **BAAW(Logical-to-Physical Mapping Unit)** 가
-segment-based mapping을 기반으로 LA → PA 또는 PA 리스트로 변환하는 구조를 제안한다.
-
-### 두 가지 channel mapping mode
-
- **1:1 mode**: 채널별 request를 만들어 실행. 정밀한 per-channel 모델링
- **n:1 mode (default)**: local HBM 채널 간 인터리빙 가정. aggregated BW 모델링
-
-두 모드를 지원하여 n:1 모드의 오버헤드를 측정/검토할 수 있게 한다.
-
-### 핵심 요구사항
-
- PE_DMA → HBM_CTRL의 effective bandwidth semantics는 두 모드에서 동일해야 한다
- 차이는 request 표현 방식과 resource 모델링 방식에만 있어야 한다
- kernel programming model은 변경하지 않는다
- physical channel 정보는 kernel에 노출되지 않아야 한다
-
-### 기존 Physical Address
-
-현재 시스템의 51-bit Physical Address는 `policy/address/phyaddr.py`에 정의되어 있다:
-
-```
-[50:47] rack_id (4 bit)
-[46:43] sip_id  (4 bit)
-[42:38] cube_id (5 bit, sip_seg)
-[37]    hbm_selector (1=HBM window)
-[36:0]  hbm_offset   (37 bit, 128GB per cube)
-```
-
-PA는 최종 라우팅 가능한 canonical physical destination을 표현하는 데 사용되며,
-이 역할은 유지된다.
-하지만 logical access → physical request 변환 시점과 정책이 명확히 분리되어 있지 않다.
-
---
-
-## Decision
-
-### D1. LA (Logical Address) 도입 — VA를 대체
-
-기존 VA(Virtual Address) 인프라를 LA(Logical Address)로 대체한다.
-
-#### LA의 특징
-
- VA처럼 Tensor를 연속적인 메모리 공간에 매핑할 수 있다
- logical buffer + offset을 표현
- physical channel 정보를 직접 포함하지 않음
- physical resolution 이전까지 유지되는 중간 추상화
- 커널 코드(`tl.load`, `tl.store`, `tl.composite`)가 사용하는 유일한 주소 체계
-
-#### LA 공간 정의
-
-| 항목 | 값 |
-|------|-----|
-| LA 시작 주소 | `0x1_0000_0000` (4 GB, 기존 VA 시작점 유지) |
-| LA 공간 크기 | PE당 64 GB |
-| 정렬 단위 | segment 단위 (아래 D3 참조) |
-
-LA는 PE-local 주소 공간이다.
-서로 다른 PE가 동일한 LA 값을 사용해도 BAAW의 segment table이 다르므로
-서로 다른 PA로 resolve된다.
-
-#### VA 인프라 제거 범위
-
-LA 도입에 따라 다음 기존 코드를 대체/제거한다:
-
-| 제거 대상 | 대체 |
-|-----------|------|
-| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (동일 free-list 방식, 이름/역할 변경) |
-| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (PE_DMA 내부) |
-| `components/builtin/pe_mmu.py` (PeMmuComponent) | 제거 — BAAW는 별도 컴포넌트가 아닌 PE_DMA 내부 로직 |
-| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | BaawSegmentInstallMsg로 대체 |
-| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
-| `runtime_api/tensor.py`: `va_base` 필드 | `la_base` 필드 |
-| `topology.yaml`: pe_mmu 컴포넌트 항목 | 제거 |
-
---
-
-### D2. Mapping Mode 설정
-
-topology.yaml의 cube 레벨에서 mapping mode를 설정한다:
-
-```yaml
-cube:
-  memory_map:
-    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
-    hbm_pseudo_channels: 64       # 전체 pseudo channel 수
-    hbm_channels_per_pe: 8        # PE당 local channel 수
-    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
-```
-
-이 설정은 graph compiler(topology builder)와 BAAW 초기화 시 참조된다.
-
---
-
-### D3. Segment 및 BAAW
-
-#### Segment 정의
-
-Segment는 LA space를 partition하여, 각 segment가 특정 HBM channel 또는
-channel group에 매핑되도록 하는 logical allocation 단위이다.
-
-Segment는 runtime allocator가 tensor deploy 시 생성하며,
-BAAW는 이를 기반으로 LA를 physical request로 변환한다.
-
-#### BAAW Segment Table Entry
-
-```python
-@dataclass
-class BaawSegment:
-    la_base: int          # segment 시작 LA
-    la_size: int          # segment 크기 (bytes)
-    mode: str             # "one_to_one" | "n_to_one"
-    # 1:1 mode fields
-    channel_count: int    # 이 segment에 할당된 channel 수 (e.g., 8)
-    pa_bases: list[int]   # per-channel PA 시작 주소 리스트 (len = channel_count)
-    channel_ids: list[int]  # per-channel 논리적 ID (e.g., [0,1,2,...,7])
-    channel_size: int     # per-channel 크기 (la_size // channel_count)
-    # n:1 mode fields
-    agg_pa_base: int      # aggregated PA 시작 주소
-    agg_node_id: str      # aggregated router node_id (for routing)
-```
-
-#### Segment 라이프사이클
-
-1. **할당 시점** (tensor deploy):
-   - RuntimeContext가 LA allocator에서 LA 공간 할당
-   - PEMemAllocator가 per-channel PA 할당 (1:1) 또는 aggregated PA 할당 (n:1)
-   - `BaawSegmentInstallMsg`를 PE_DMA로 전송하여 segment table에 등록
-
-2. **사용 시점** (kernel 실행):
-   - 커널이 `tl.load(la_ptr)` → DmaReadCmd(src_addr=LA)
-   - PE_DMA가 BAAW에서 LA에 해당하는 segment를 lookup
-   - mode에 따라 PA(들)로 변환
-
-3. **해제 시점** (tensor free):
-   - segment table에서 제거
-   - LA 공간 반환, PA 해제
-
---
-
-### D4. BAAW (Logical-to-Physical Mapping Unit)
-
-#### 위치
-
-BAAW는 PE_DMA 내부의 front-end stage로 배치된다.
-별도의 SimPy 컴포넌트가 아니며, PE_DMA의 `handle_command()` 시작 부분에서 실행되는
-동기적 address resolution 로직이다.
-
-#### 입력
-
- LA (Logical Address) — DmaReadCmd.src_addr 또는 DmaWriteCmd.dst_addr
- access size (bytes)
-
-#### 출력
-
- 1:1 mode: `list[PhysicalRequest]` — 각 request는 (PA, nbytes, channel_node_id)
- n:1 mode: `PhysicalRequest` 1개 — (agg_PA, nbytes, agg_node_id)
-
-```python
-@dataclass
-class PhysicalRequest:
-    pa: int           # 51-bit Physical Address
-    nbytes: int       # 이 request의 transfer size
-    dst_node: str     # target node_id (channel router or aggregated router)
-```
-
-#### BAAW Resolve 로직
-
-```python
-def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
-    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
-    offset = la - seg.la_base
-
-    if seg.mode == "n_to_one":
-        pa = seg.agg_pa_base + offset
-        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
-
-    elif seg.mode == "one_to_one":
-        requests = []
-        per_ch_size = seg.channel_size
-        for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
-            ch_offset = offset % per_ch_size  # interleaved or striped
-            ch_nbytes = nbytes // seg.channel_count
-            pa = pa_base + ch_offset
-            dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
-            requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
-        return requests
-```
-
-#### 역할 범위
-
-BAAW의 책임:
- logical access를 physical request 단위로 변환
- mapping mode에 따른 fan-out (1:1) 또는 pass-through (n:1) 수행
- Physical Address 생성 및 target node 결정
-
-BAAW의 책임이 아닌 것:
- 실제 data movement 수행
- NOC routing 실행
- bandwidth 소비 시뮬레이션 (downstream component의 역할)
-
-#### Output Contract
-
-BAAW의 출력은 추가적인 address decoding 없이
-simulator의 routing 및 resource 모델에서 직접 사용 가능한 request 단위여야 한다.
-
---
-
-### D5. PE_DMA handle_command() 변경
-
-#### 현재 흐름 (VA 기반)
-
-```
-DmaReadCmd.src_addr (VA)
-  → MMU.translate(VA) → PA
-  → PhysAddr.decode(PA) → PhysAddr object
-  → resolver.resolve(PhysAddr) → dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
-  → router.find_path(pe_prefix, dst_node_id) → path
-  → 1개 sub-Transaction 생성 → fabric inject
-```
-
-#### 새 흐름 (LA 기반)
-
-```
-DmaReadCmd.src_addr (LA)
-  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
-  → 각 PhysicalRequest에 대해:
-      → router.find_path(pe_prefix, req.dst_node) → path
-      → compute_drain_ns(path, req.nbytes) → drain
-      → sub-Transaction 생성 → fabric inject
-  → 모든 sub-Transaction 완료 대기
-  → pe_txn.done.succeed()
-```
-
-핵심 변경:
- MMU 참조 제거 → BAAW resolve로 대체
- PhysAddr.decode() + resolver.resolve() → BAAW가 직접 dst_node 반환
- 1개 request → N개 request 병렬 inject (1:1 mode)
-
---
-
-### D6. 1:1 Mode 상세
-
- 하나의 logical access → N개(= `channels_per_pe`)의 physical request
- N은 `hbm_pseudo_channels / pes_per_cube`로 결정되는 파라미터
- 각 request:
-  - fully resolved 51-bit PA
-  - 특정 channel router를 target (`{pe_prefix}.ch_r{channel_id}`)
- per-channel link에 의한 BW contention 모델링
- PE_DMA는 N개 sub-transaction을 동시에 inject
-
-#### 1:1 Mode 예시
-
-구성: `hbm_pseudo_channels=64`, `pes_per_cube=8`
-→ `channels_per_pe=8`, PE0이 ch0-7 소유
-
-```text
-Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
-BAAW segment: {
-    la_base: 0x1_0000_0000, la_size: 4096,
-    mode: "one_to_one", channel_count: 8,  # = channels_per_pe
-    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
-    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
-    channel_size: 512,  # = la_size / channel_count
-}
-
-BAAW resolve 결과 (N=8개 request):
-  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
-  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
-  → ...
-  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
-
-PE_DMA: N개 sub-transaction 병렬 inject
-  각각 channel router → hbm_ctrl link (channel_bw_gbs)를 통해 HBM 접근
-  총 effective BW = N × channel_bw_gbs
-```
-
-N이 다른 구성의 예:
- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`, 4개 request
- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`, 16개 request
-
---
-
-### D7. n:1 Mode 상세
-
- 하나의 logical access → 하나의 aggregated request
- target: aggregated router → hbm_ctrl (ADR-0019 참조)
- aggregated link BW = `channels_per_pe` × `channel_bw_gbs` (e.g., 8 × 32 = 256 GB/s)
- single queue / resource로 모델링
- per-channel PA 분해 없음
-
-#### n:1 Mode 예시
-
-```
-Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
-BAAW segment: {
-    la_base: 0x1_0000_0000, la_size: 4096,
-    mode: "n_to_one",
-    agg_pa_base: PA_agg,
-    agg_node_id: "sip0.cube0.pe0.agg_router",
-}
-
-BAAW resolve 결과:
-  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
-
-PE_DMA: 1개 sub-transaction inject
-  aggregated router → hbm_ctrl link (256 GB/s)를 통해 HBM 접근
-```
-
---
-
-### D8. Kernel Model 유지
-
- kernel은 여전히 단일 memory op만 발행 (`tl.load`, `tl.store`, `tl.composite`)
- LA가 커널에 전달되는 주소 체계
- channel 분해/집계는 PE_DMA 내부 BAAW에서 수행
- kernel 코드에 physical channel 정보가 노출되지 않음
-
---
-
-## Consequences
-
-### Positive
-
- 1:1 vs n:1 semantics가 BAAW라는 단일 지점에서 명확히 분리됨
- kernel abstraction 유지 — 커널 코드 변경 불필요
- topology 기반 정책 제어 가능 (yaml에서 mode 전환)
- simulation 모델 일관성 및 디버깅 용이성 향상
- segment-based mapping은 page table 대비 단순하고 overhead가 낮음
-
-### Negative
-
- VA/MMU 기반 코드 전체 리팩토링 필요
- request 생성 경로 복잡도 증가 (1:1 mode에서 N개 request 관리)
- n:1 mode에서 per-channel visibility 감소
- 기존 VA 관련 테스트 재작성 필요
-
---
-
-## Alternatives
-
-### A1. VA 유지 + MMU에서 fan-out
-
- MMU가 per-channel PA를 반환하도록 확장
- 문제: MMU의 역할이 address translation을 넘어 request 분해까지 확장됨
- 문제: n:1 mode에서 aggregation 표현이 어려움
-
-### A2. Kernel이 channel-aware request 생성
-
- 커널이 직접 채널별 load/store를 호출
- 문제: abstraction leakage, portability 저하
- 문제: 모든 벤치마크 코드 수정 필요
-
-### A3. 항상 PA 사용 (LA 없이)
-
- runtime이 직접 per-channel PA를 커널에 전달
- 문제: aggregation 모델과 충돌
- 문제: 변환 시점이 불명확, 커널에 channel 정보 노출
-
---
-
-## Implementation Notes
-
-### 구현 순서
-
-1. LA 타입 도입 (`policy/address/la_allocator.py`)
-2. BAAW segment table 구현 (`policy/address/baaw.py`)
-3. `BaawSegmentInstallMsg` 메시지 타입 추가 (`runtime_api/kernel.py`)
-4. PE_DMA에 BAAW 통합 (`components/builtin/pe_dma.py` handle_command 변경)
-5. RuntimeContext 변경: LA alloc + segment install (`runtime_api/context.py`)
-6. Tensor.va_base → la_base 변경 (`runtime_api/tensor.py`)
-7. VA/MMU 코드 제거
-8. topology.yaml에서 pe_mmu 제거, mapping mode 설정 추가
-9. 테스트 마이그레이션
-
-### 영향받는 기존 테스트
-
-| 테스트 파일 | 영향 |
-|------------|------|
-| `tests/test_mmu_component.py` | 제거 → BAAW segment install 테스트로 대체 |
-| `tests/test_mmu_fabric.py` | 제거 → BAAW + fabric 통합 테스트로 대체 |
-| `tests/test_pe_mmu.py` | 제거 |
-| `tests/test_va_allocator.py` | LA allocator 테스트로 대체 |
-| `tests/test_va_integration.py` | LA + BAAW 통합 테스트로 대체 |
-| `tests/test_va_offset.py` | LA offset 테스트로 대체 |
-
---
-
-## Test Requirements
-
- 동일 logical access에 대해:
-  - 1:1 → N개 request 생성 확인
-  - n:1 → 1개 aggregated request 생성 확인
- 두 모드에서 effective bandwidth 일관성 검증
- 1:1 → per-channel contention 모델링 확인
- n:1 → aggregated bandwidth 반영 확인
- kernel 코드 변경 없이 동작 확인
- BAAW segment install/uninstall 정상 동작
- 여러 tensor가 서로 다른 segment에 할당될 때 충돌 없음
-
---
-
-## Links
-
- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU 도입) → 본 ADR이 대체
- ADR-0019 (NOC Per-Channel HBM 연결 모델) → topology 측 연동
- ADR-0014 (PE Internal Execution Model) → PE_DMA 변경 영향
@@ -2,35 +2,23 @@

 ## Status

-Proposed
+Accepted

 ## Context

-ADR-0018 introduced LA-based address abstraction and BAAW,
-defining how a logical memory access is translated into the following two forms of requests:
+The CUBE-internal NOC must connect each PE to HBM. KernBench needs
+to evaluate two connectivity models:

- 1:1 mode: one logical access → N per-channel requests
- n:1 mode: one logical access → one aggregated request
+- **1:1 mode** — PE_DMA connects to N separate per-channel routers,
+  each with its own link to hbm_ctrl. Models per-channel BW
+  contention precisely.
+  N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
+- **n:1 mode** — PE_DMA connects to a single aggregated router with
+  one link to hbm_ctrl. Channels are treated as interleaved; only
+  aggregate BW is modeled.

-Here N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`),
-determined by topology parameters.
-
-### Problems with the Existing Structure
-
-In the current implementation (`topology/builder.py`):
-
- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} path is used
- HBM is modeled as 8 slice (= per-PE) nodes
- Local/remote access use different paths:
-  - local: NOC → xbar → HBM slice
-  - cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
-  - remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
-
-Limitations of this structure:
-
- Cannot model at the pseudo-channel granularity (slice = per-PE granularity, not per-channel)
- xbar/bridge bifurcate local/remote paths
- Cannot express 1:1 / n:1 modes consistently
+Effective PE-local BW is identical under both modes
+(= N × per-channel BW); only the connectivity granularity differs.

 ---

@@ -270,7 +258,6 @@ The effective BW per PE is identical in both modes:
 ### Negative

 - The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
- Requires complete rewrite of existing xbar/bridge/single NOC-based tests
 - The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model

 ---
@@ -296,119 +283,6 @@ The effective BW per PE is identical in both modes:

 ---

-## Implementation Notes
-
-### topology/builder.py Change Details
-
-#### Code to Remove (within current `_instantiate_cube()`)
-
- xbar_top, xbar_bot node creation (~line 495-508)
- bridge.left, bridge.right node creation
- noc ↔ xbar edge creation (~line 540-555)
- xbar ↔ hbm_ctrl.slice edge creation (~line 510-538)
- xbar ↔ bridge edge creation (~line 557-572)
-
-#### Code to Add
-
-1:1 mode:
-
-```python
-N = hbm_channels_per_pe  # from topology config
-total_ch = hbm_pseudo_channels
-
-# Create channel router nodes
-for ch_id in range(total_ch):
-    pe_id = ch_id // N
-    nodes[f"{cp}.ch_r{ch_id}"] = Node(
-        id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),  # horizontal row = ch_id % N
-    )
-
-# PE_DMA ↔ local channel router edges
-for pe_id in range(pes_per_cube):
-    for local_ch in range(N):
-        ch_id = pe_id * N + local_ch
-        edges.append(Edge(
-            src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
-            bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
-            bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
-
-# Channel router ↔ hbm_ctrl edges
-for ch_id in range(total_ch):
-    edges.append(Edge(
-        src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
-        bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
-
-# Horizontal line edges (same logical index)
-for row in range(N):
-    for p in range(pes_per_cube - 1):
-        ch_a = p * N + row
-        ch_b = (p + 1) * N + row
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-```
-
-n:1 mode:
-
-```python
-# Create aggregated router nodes
-for pe_id in range(pes_per_cube):
-    nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
-        id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),
-    )
-
-agg_bw = N * channel_bw  # aggregated BW
-
-# PE_DMA ↔ aggregated router
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
-        bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
-
-# Aggregated router ↔ hbm_ctrl
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=agg_bw, kind="agg_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="hbm_to_agg", ...))
-
-# Horizontal links between aggregated routers
-for p in range(pes_per_cube - 1):
-    edges.append(Edge(
-        src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-```
-
-### Affected Existing Tests
-
-| Test File | Impact |
-| ---------- | ---- |
-| `tests/test_topology_compile.py` | Remove xbar/bridge node references, add channel router verification |
-| `tests/test_topology_load.py` | Reflect topology.yaml configuration changes |
-| `tests/test_pe_components.py` | PE_DMA routing path changes |
-| `tests/test_sip_parallel.py` | Cross-PE access path changes |
-| Cases that directly test xbar/bridge | Remove |
-
---
-
 ## Test Requirements

 - Verify that requests are delivered via per-channel links in 1:1 mode
@@ -425,7 +299,7 @@ for p in range(pes_per_cube - 1):

 ## Links

- ADR-0018 (LA + BAAW) → addressing-side integration
+- ADR-0011 (LA model) → addressing-side integration
 - ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
 - ADR-0004 (Memory Semantics) → BW model redefinition
 - ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
@@ -2,35 +2,23 @@

 ## Status

-Proposed
+Accepted

 ## Context

-ADR-0018에서는 LA 기반 주소 추상화와 BAAW를 도입하여,
-logical memory access가 다음 두 형태의 request로 변환되도록 정의하였다.
+CUBE 내부 NOC은 각 PE를 HBM에 연결해야 한다. KernBench는 두 가지
+connectivity 모델을 비교 평가할 수 있어야 한다.

- 1:1 mode: 하나의 logical access → N개의 per-channel request
- n:1 mode: 하나의 logical access → 하나의 aggregated request
+- **1:1 mode** — PE_DMA가 N개 per-channel router 각각에 별도 link로
+  연결되고, 각 router는 hbm_ctrl에 자기 channel link를 가진다.
+  Per-channel BW contention을 정확히 모델링.
+  N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
+- **n:1 mode** — PE_DMA가 단일 aggregated router를 거쳐 하나의 link로
+  hbm_ctrl에 연결. Channel들이 interleaved 된 것으로 가정하고
+  aggregate BW만 모델링.

-여기서 N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`)이며,
-topology 파라미터로 결정된다.
-
-### 기존 구조의 문제
-
-현재 구현(`topology/builder.py`)에서는:
-
- PE_DMA → NOC → xbar_top/xbar_bot → HBM_CTRL.slice{0-7} 경로를 사용
- HBM은 8개 slice(= PE 수) 노드로 모델링됨
- local/remote access가 서로 다른 경로를 사용:
-  - local: NOC → xbar → HBM slice
-  - cross-half: NOC → xbar_top → bridge → xbar_bot → HBM slice
-  - remote cube: NOC → UCIe → remote NOC → remote xbar → remote HBM slice
-
-이 구조의 한계:
-
- pseudo-channel 단위 모델링 불가 (slice = PE 단위, channel 단위 아님)
- xbar/bridge가 local/remote 경로를 이원화
- 1:1 / n:1 mode를 일관되게 표현할 수 없음
+두 모드에서 PE당 effective BW는 동일 (= N × per-channel BW);
+connectivity granularity만 다르다.

 ---

@@ -270,7 +258,6 @@ links:
 ### Negative

 - 명시적 라우터 노드로 인해 SimPy 노드 수가 증가한다 (6×6 = 최대 32개 라우터/cube)
- 기존 xbar/bridge/단일 NOC 기반 테스트 전면 재작성 필요
 - TwoDMeshNocComponent의 내부 contention 모델을 라우터별 모델로 교체 필요

 ---
@@ -296,119 +283,6 @@ links:

 ---

-## Implementation Notes
-
-### topology/builder.py 변경 상세
-
-#### 제거할 코드 (현재 `_instantiate_cube()` 내)
-
- xbar_top, xbar_bot 노드 생성 (~line 495-508)
- bridge.left, bridge.right 노드 생성
- noc ↔ xbar edge 생성 (~line 540-555)
- xbar ↔ hbm_ctrl.slice edge 생성 (~line 510-538)
- xbar ↔ bridge edge 생성 (~line 557-572)
-
-#### 추가할 코드
-
-1:1 mode:
-
-```python
-N = hbm_channels_per_pe  # from topology config
-total_ch = hbm_pseudo_channels
-
-# channel router 노드 생성
-for ch_id in range(total_ch):
-    pe_id = ch_id // N
-    nodes[f"{cp}.ch_r{ch_id}"] = Node(
-        id=f"{cp}.ch_r{ch_id}", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),  # horizontal row = ch_id % N
-    )
-
-# PE_DMA ↔ local channel router edges
-for pe_id in range(pes_per_cube):
-    for local_ch in range(N):
-        ch_id = pe_id * N + local_ch
-        edges.append(Edge(
-            src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.ch_r{ch_id}",
-            bw_gbs=channel_bw, kind="pe_to_ch_router", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.pe{pe_id}.pe_dma",
-            bw_gbs=channel_bw, kind="ch_router_to_pe", ...))
-
-# channel router ↔ hbm_ctrl edges
-for ch_id in range(total_ch):
-    edges.append(Edge(
-        src=f"{cp}.ch_r{ch_id}", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=channel_bw, kind="ch_router_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.ch_r{ch_id}",
-        bw_gbs=channel_bw, kind="hbm_to_ch_router", ...))
-
-# horizontal line edges (same logical index)
-for row in range(N):
-    for p in range(pes_per_cube - 1):
-        ch_a = p * N + row
-        ch_b = (p + 1) * N + row
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_a}", dst=f"{cp}.ch_r{ch_b}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-        edges.append(Edge(
-            src=f"{cp}.ch_r{ch_b}", dst=f"{cp}.ch_r{ch_a}",
-            bw_gbs=ch_horizontal_bw, kind="ch_horizontal", ...))
-```
-
-n:1 mode:
-
-```python
-# aggregated router 노드 생성
-for pe_id in range(pes_per_cube):
-    nodes[f"{cp}.pe{pe_id}.agg_router"] = Node(
-        id=f"{cp}.pe{pe_id}.agg_router", kind="noc_router", impl="noc_v1",
-        attrs={}, pos_mm=(...),
-    )
-
-agg_bw = N * channel_bw  # aggregated BW
-
-# PE_DMA ↔ aggregated router
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.pe_dma", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="pe_to_agg_router", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.pe{pe_id}.pe_dma",
-        bw_gbs=agg_bw, kind="agg_router_to_pe", ...))
-
-# aggregated router ↔ hbm_ctrl
-for pe_id in range(pes_per_cube):
-    edges.append(Edge(
-        src=f"{cp}.pe{pe_id}.agg_router", dst=f"{cp}.hbm_ctrl",
-        bw_gbs=agg_bw, kind="agg_to_hbm", ...))
-    edges.append(Edge(
-        src=f"{cp}.hbm_ctrl", dst=f"{cp}.pe{pe_id}.agg_router",
-        bw_gbs=agg_bw, kind="hbm_to_agg", ...))
-
-# aggregated router 간 horizontal link
-for p in range(pes_per_cube - 1):
-    edges.append(Edge(
-        src=f"{cp}.pe{p}.agg_router", dst=f"{cp}.pe{p+1}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-    edges.append(Edge(
-        src=f"{cp}.pe{p+1}.agg_router", dst=f"{cp}.pe{p}.agg_router",
-        bw_gbs=agg_horizontal_bw, kind="agg_horizontal", ...))
-```
-
-### 영향받는 기존 테스트
-
-| 테스트 파일 | 영향 |
-| ---------- | ---- |
-| `tests/test_topology_compile.py` | xbar/bridge 노드 참조 제거, channel router 검증 추가 |
-| `tests/test_topology_load.py` | topology.yaml 설정 변경 반영 |
-| `tests/test_pe_components.py` | PE_DMA 라우팅 경로 변경 |
-| `tests/test_sip_parallel.py` | cross-PE 접근 경로 변경 |
-| xbar/bridge를 직접 테스트하는 케이스 | 제거 |
-
---
-
 ## Test Requirements

 - 1:1 mode에서 channel별 link로 request가 전달되는지 확인
@@ -425,7 +299,7 @@ for p in range(pes_per_cube - 1):

 ## Links

- ADR-0018 (LA + BAAW) → addressing 측 연동
+- ADR-0011 (LA model) → addressing 측 연동
 - ADR-0017 (Cube NOC 2D Mesh) → 본 ADR이 xbar/bridge 부분을 대체
 - ADR-0004 (Memory Semantics) → BW 모델 재정의
 - ADR-0014 (PE Internal Execution Model) → PE_DMA 경로 변경 영향
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -16,21 +16,6 @@ but do not actually read tensor data or perform computations.
 2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
 3. Must minimize simulation performance degradation

-### Limitations of the Existing Kernel Execution Structure
-
-The current kernel execution is separated into 3 stages:
-
-```
-Phase 0: Kernel function execution in TLContext → PeCommand list generation (outside SimPy, no data)
-Phase 1: PE_CPU replays PeCommand list via SimPy (timing only)
-```
-
-Phase 0 requires the kernel to **complete execution entirely** before SimPy begins.
-`tl.load()` returns a TensorHandle (placeholder), so actual data cannot be accessed.
-Therefore, branching based on data values (dynamic control flow) is impossible.
-
-This ADR resolves this limitation **for memory operations only** (see D1, D3).
-
 ### Constraints

 - SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
@@ -532,22 +517,3 @@ Per-dtype tolerance policy:
  (computations execute in Phase 2, result values are undetermined in Phase 1).
  Memory-data-based branching is supported via greenlet.
 - greenlet C extension dependency added (pip install greenlet)
-
---
-
-## Affected Files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/components/base.py` | Add `_on_process_start/end` hooks |
-| `src/kernbench/common/pe_commands.py` | Add `data_op = True`, extend metadata fields |
-| `src/kernbench/sim_engine/op_log.py` | New: OpRecord, OpLogger |
-| `src/kernbench/sim_engine/data_executor.py` | New: DataExecutor, MemoryStore |
-| `src/kernbench/sim_engine/engine.py` | op_logger injection (optional) |
-| `src/kernbench/triton_emu/tl_context.py` | greenlet switch calls inside `tl.load()` etc. |
-| `src/kernbench/triton_emu/kernel_runner.py` | New: KernelRunner (greenlet ↔ SimPy bridge) |
-| `src/kernbench/components/builtin/pe_cpu.py` | Remove Phase 0, change to KernelRunner invocation |
-| `pyproject.toml` | Add greenlet dependency |
-
-Component implementation files (pe_gemm.py, pe_dma.py, hbm_ctrl.py, etc.): **no changes**
-Benchmark kernels (benches/*.py): **no user API changes**
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -16,21 +16,6 @@ Proposed
 2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
 3. 시뮬레이션 성능 저하를 최소화해야 한다

-### 기존 커널 실행 구조의 한계
-
-현재 커널 실행은 3단계로 분리되어 있다:
-
-```
-Phase 0: TLContext에서 커널 함수 실행 → PeCommand 리스트 생성 (SimPy 밖, 데이터 없음)
-Phase 1: PE_CPU가 PeCommand 리스트를 SimPy로 replay (타이밍만)
-```
-
-Phase 0에서 커널이 **전부 실행 완료**된 후에야 SimPy가 시작된다.
-`tl.load()`는 TensorHandle(placeholder)을 반환하므로 실제 데이터에 접근할 수 없다.
-따라서 데이터 값에 따른 분기(dynamic control flow)가 불가능하다.
-
-본 ADR은 이 한계를 **메모리 연산에 한해** 해소한다 (D1, D3 참조).
-
 ### 제약 조건

 - SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
@@ -529,22 +514,3 @@ dtype별 tolerance 정책:
  (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
  메모리 데이터 기반 분기는 greenlet으로 지원된다.
 - greenlet C 확장 의존성 추가 (pip install greenlet)
-
---
-
-## 영향받는 파일
-
-| 파일 | 변경 |
-|------|------|
-| `src/kernbench/components/base.py` | `_on_process_start/end` hook 추가 |
-| `src/kernbench/common/pe_commands.py` | `data_op = True` 추가, metadata 필드 확장 |
-| `src/kernbench/sim_engine/op_log.py` | 신규: OpRecord, OpLogger |
-| `src/kernbench/sim_engine/data_executor.py` | 신규: DataExecutor, MemoryStore |
-| `src/kernbench/sim_engine/engine.py` | op_logger 주입 (optional) |
-| `src/kernbench/triton_emu/tl_context.py` | `tl.load()` 등 내부에서 greenlet switch 호출 |
-| `src/kernbench/triton_emu/kernel_runner.py` | 신규: KernelRunner (greenlet ↔ SimPy 연결) |
-| `src/kernbench/components/builtin/pe_cpu.py` | Phase 0 제거, KernelRunner 호출로 변경 |
-| `pyproject.toml` | greenlet 의존성 추가 |
-
-컴포넌트 구현 파일 (pe_gemm.py, pe_dma.py, hbm_ctrl.py 등): **변경 없음**
-벤치마크 커널 (benches/*.py): **사용자 API 변경 없음**
@@ -2,30 +2,10 @@

 ## Status

-Proposed
+Accepted

 ## Context

-### Problems with the Current Structure
-
-pe_accel (SchedulerV2Component) hides 5 hardware blocks (DmaIn, DmaWb, Gemm, Math, Tcm)
-**inside a single component**.
-
-```
-SchedulerV2Component (single topology node)
-├── DmaInBlock     ← directly connected via internal SimPy Store
-├── DmaWbBlock     ← not visible in topology
-├── GemmBlock      ← not replaceable
-├── MathBlock      ← not replaceable
-└── TcmBlock       ← not replaceable
-```
-
-Problems:
- Blocks directly reference the next block via `desc.next_block` — hardcoded routing
- Individual blocks cannot be replaced (violates ADR-0015 component replacement principle)
- PE internal structure is not visible in the topology
- GemmBlock and MathBlock each duplicate TCM load/store logic
-
 ### Actual Hardware Structure

 ```
@@ -374,66 +354,6 @@ Topology edges encompass both **control/dispatch visibility + runtime chaining**
 Scheduler → sub-component edges are initial dispatch paths, while
 inter-component edges are runtime chaining paths driven by token self-routing.

-### D8. Existing Code Migration — Builtin Integration
-
-The existing builtin v1 components and pe_accel are **replaced with new builtin components**.
-
-#### Migration Strategy
-
-1. Back up existing `components/builtin/` → `components/builtin_legacy/` (preserved without modification)
-2. Back up existing `components/custom/pe_accel/` → likewise
-3. Re-implement new `components/builtin/` with the ADR-0021 architecture
-4. Maintain **only one** topology.yaml (including pe_fetch_store)
-5. components.yaml points to the new builtin
-
-```yaml
-# components.yaml — new builtin
-pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
-pe_gemm_v1:      kernbench.components.builtin.pe_gemm:PeGemmComponent
-pe_math_v1:      kernbench.components.builtin.pe_math:PeMathComponent
-pe_dma_v1:       kernbench.components.builtin.pe_dma:PeDmaComponent
-pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
-pe_tcm_v1:       kernbench.components.builtin.pe_tcm:PeTcmComponent
-```
-
-The impl names (pe_gemm_v1, etc.) are preserved, but **the implementations are replaced
-with the ADR-0021 architecture**. Existing benchmarks and tests referencing topology.yaml
-continue to work without changes.
-
-#### Latency Model Inheritance
-
-The latency modeling of the new builtin components (MAC cycle calculation, SIMD latency,
-TCM BW serialization, DMA fabric latency, etc.) is **based on the current pe_accel
-implementation**. The tile schedule generation logic from tiling.py is also carried over.
-Only the architecture (component separation, self-routing) changes; timing accuracy
-is preserved.
-
-#### Test Strategy
-
-#### Test Plan
-
-**1. Existing test pass** (regression):
-After migration is complete, all existing tests (366) must pass.
-
-**2. Latency regression**:
-Verify that the new builtin produces identical latency for the same inputs as pe_accel.
-
-**3. Phase 1 → Phase 2 end-to-end**:
-Integration test from SimPy simulation (Phase 1) op_log generation → DataExecutor
-(Phase 2) actual numpy computation → result correctness verification.
- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose verification
- MATH: tl.exp / tl.add, etc. → op_log → Phase 2 numpy op → allclose verification
- Chaining: GEMM output → MATH input → final result end-to-end verification
-
-**4. TileToken self-routing**:
- Verify that tiles chain according to the plan's stage sequence
- Verify PipelineContext.complete_tile() exactly-once at the last stage
- Queue backpressure: verify that only the feeder blocks when DMA queue capacity is exceeded
-
-**5. Asynchronous pipeline overlap**:
- Verify that inter-tile stage overlap occurs within the same command (tile0 in GEMM while tile1 in DMA)
- Multiple commands: verify that cmd2 feed starts after cmd1 feed completes (FIFO order)
-
 ### D9. TileToken Message Definition

 A message used for passing tile work between components.
@@ -472,8 +392,6 @@ Relationship with existing PeInternalTxn:
 - **Resource contention model across multiple pipelines**: the current scope focuses on
  accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
  are future work.
- **builtin_legacy maintenance**: kept for backup purposes only; not a target for
-  bug fixes or feature additions.

 ## Open Questions

@@ -511,27 +429,4 @@ Relationship with existing PeInternalTxn:

 - Increased number of PE internal components (5 → 6) — more topology nodes/edges
 - Component separation makes intra-PE token forwarding more explicit than before
- Breaking change from existing builtin/pe_accel — migration required

---
-
-## Affected Files
-
-| File | Change |
-|------|--------|
-| `topology.yaml` | Add pe_fetch_store component, add chaining edges |
-| `components.yaml` | Register new builtin components |
-| `src/kernbench/topology/builder.py` | Add fetch_store + chaining edges to PE internal edges |
-| `src/kernbench/common/pe_commands.py` | Add TileToken definition |
-| `src/kernbench/components/builtin/pe_scheduler.py` | Re-implement (feeder + plan-based dispatch) |
-| `src/kernbench/components/builtin/pe_gemm.py` | Re-implement (TileToken, _process pattern) |
-| `src/kernbench/components/builtin/pe_math.py` | Re-implement (TileToken, _process pattern) |
-| `src/kernbench/components/builtin/pe_dma.py` | Re-implement (TileToken, _process pattern) |
-| `src/kernbench/components/builtin/pe_fetch_store.py` | New |
-| `src/kernbench/components/builtin/pe_tcm.py` | Re-implement (TcmRequest service) |
-| `src/kernbench/components/builtin/types.py` | New: TilePlan, Stage, StageType, PipelineContext, TileToken |
-| `src/kernbench/components/builtin/tiling.py` | Ported from pe_accel: plan generation logic |
-
-Backup:
-| `src/kernbench/components/builtin_legacy/` | Full backup of existing builtin (preserved without modification) |
-| `src/kernbench/components/custom/pe_accel/` | Backup of existing pe_accel (preserved without modification) |
@@ -2,30 +2,10 @@

 ## Status

-Proposed
+Accepted

 ## Context

-### 현재 구조의 문제
-
-pe_accel (SchedulerV2Component)은 5개 하드웨어 블록(DmaIn, DmaWb, Gemm, Math, Tcm)을
-**단일 컴포넌트 내부**에 숨기고 있다.
-
-```
-SchedulerV2Component (단일 topology 노드)
-├── DmaInBlock     ← 내부 SimPy Store로 직접 연결
-├── DmaWbBlock     ← topology에 안 보임
-├── GemmBlock      ← 교체 불가
-├── MathBlock      ← 교체 불가
-└── TcmBlock       ← 교체 불가
-```
-
-문제점:
- 블록이 다음 블록을 `desc.next_block`으로 직접 참조 — 하드코딩된 라우팅
- 개별 블록 교체 불가 (ADR-0015 컴포넌트 교체 원칙 위배)
- topology에서 PE 내부 구조가 보이지 않음
- GemmBlock과 MathBlock이 TCM load/store 로직을 각각 중복 구현
-
 ### 실제 하드웨어 구조

 ```
@@ -370,64 +350,6 @@ Topology edge는 **control/dispatch visibility + runtime chaining** 양쪽을
 Scheduler → 하위 컴포넌트 edge는 초기 dispatch 경로이며,
 컴포넌트 간 edge는 token self-routing에 의한 runtime chaining 경로이다.

-### D8. 기존 코드 마이그레이션 — builtin 통합
-
-기존 builtin v1 컴포넌트와 pe_accel을 **새 builtin으로 교체**한다.
-
-#### 마이그레이션 전략
-
-1. 기존 `components/builtin/` → `components/builtin_legacy/`로 백업 (수정 없이 보관)
-2. 기존 `components/custom/pe_accel/` → 동일하게 백업
-3. 새 `components/builtin/`에 ADR-0021 아키텍처로 재구현
-4. topology.yaml은 **하나만 유지** (pe_fetch_store 포함)
-5. components.yaml은 새 builtin을 가리킴
-
-```yaml
-# components.yaml — 새 builtin
-pe_scheduler_v1: kernbench.components.builtin.pe_scheduler:PeSchedulerComponent
-pe_gemm_v1:      kernbench.components.builtin.pe_gemm:PeGemmComponent
-pe_math_v1:      kernbench.components.builtin.pe_math:PeMathComponent
-pe_dma_v1:       kernbench.components.builtin.pe_dma:PeDmaComponent
-pe_fetch_store_v1: kernbench.components.builtin.pe_fetch_store:PeFetchStoreComponent
-pe_tcm_v1:       kernbench.components.builtin.pe_tcm:PeTcmComponent
-```
-
-impl 이름(pe_gemm_v1 등)은 유지하되, **구현이 ADR-0021 아키텍처로 교체**된다.
-기존 벤치마크와 테스트의 topology.yaml 참조는 변경 없이 동작한다.
-
-#### 레이턴시 모델 계승
-
-새 builtin 컴포넌트의 레이턴시 모델링(MAC cycle 계산, SIMD latency,
-TCM BW serialization, DMA fabric latency 등)은 **pe_accel 현재 버전의 구현을 바탕으로** 한다.
-tiling.py의 tile schedule 생성 로직도 그대로 가져온다.
-아키텍처(컴포넌트 분리, self-routing)만 변경하고, 타이밍 정확도는 유지한다.
-
-#### 테스트 전략
-
-#### 테스트 계획
-
-**1. 기존 테스트 통과** (regression):
-마이그레이션 완료 후 기존 테스트(366개)가 전부 통과해야 한다.
-
-**2. 레이턴시 regression**:
-pe_accel과 동일한 입력에 대해 새 builtin이 동일 레이턴시를 산출하는지 검증.
-
-**3. Phase 1 → Phase 2 end-to-end**:
-SimPy 시뮬레이션(Phase 1)에서 op_log 생성 → DataExecutor(Phase 2)로
-실제 numpy 연산 → 결과 정합성 검증까지 통합 테스트.
- GEMM: tl.composite(gemm) → op_log → Phase 2 matmul → allclose 검증
- MATH: tl.exp / tl.add 등 → op_log → Phase 2 numpy op → allclose 검증
- 체이닝: GEMM 출력 → MATH 입력 → 최종 결과 end-to-end 검증
-
-**4. TileToken self-routing**:
- tile이 plan의 stage sequence를 따라 체이닝되는지 검증
- 마지막 stage에서 PipelineContext.complete_tile() exactly-once 검증
- queue backpressure: DMA queue capacity 초과 시 feeder만 block 검증
-
-**5. 비동기 pipeline overlap**:
- 동일 command 내 tile 간 stage overlap 발생 검증 (tile0 GEMM 중 tile1 DMA)
- 다중 command: cmd1 feed 완료 후 cmd2 feed 시작 (FIFO 순서) 검증
-
 ### D9. TileToken 메시지 정의

 컴포넌트 간 tile 작업 전달에 사용하는 메시지.
@@ -465,7 +387,6 @@ Token lifecycle:
  (PeInternalTxn 기반, ADR-0014 유지)
 - **다중 pipeline 간 자원 경합 모델**: 현재 범위에서는 단일 pipeline의
  정확한 모델링에 집중. 다중 pipeline 간 TCM bank conflict 등은 future work.
- **builtin_legacy 유지보수**: 백업 목적이며, 버그 수정이나 기능 추가 대상이 아님.

 ## Open Questions

@@ -502,27 +423,4 @@ Token lifecycle:

 - PE 내부 컴포넌트 수 증가 (5 → 6) — topology 노드/edge 증가
 - 컴포넌트 분리로 인해 intra-PE token forwarding이 이전 대비 더 명시적으로 드러남
- 기존 builtin/pe_accel과의 breaking change — 마이그레이션 필요

---
-
-## 영향받는 파일
-
-| 파일 | 변경 |
-|------|------|
-| `topology.yaml` | pe_fetch_store 컴포넌트 추가, 체이닝 edge 추가 |
-| `components.yaml` | 새 builtin 컴포넌트 등록 |
-| `src/kernbench/topology/builder.py` | PE 내부 edge에 fetch_store + 체이닝 edge 추가 |
-| `src/kernbench/common/pe_commands.py` | TileToken 정의 추가 |
-| `src/kernbench/components/builtin/pe_scheduler.py` | 재구현 (feeder + plan 기반 dispatch) |
-| `src/kernbench/components/builtin/pe_gemm.py` | 재구현 (TileToken, _process 패턴) |
-| `src/kernbench/components/builtin/pe_math.py` | 재구현 (TileToken, _process 패턴) |
-| `src/kernbench/components/builtin/pe_dma.py` | 재구현 (TileToken, _process 패턴) |
-| `src/kernbench/components/builtin/pe_fetch_store.py` | 신규 |
-| `src/kernbench/components/builtin/pe_tcm.py` | 재구현 (TcmRequest 서비스) |
-| `src/kernbench/components/builtin/types.py` | 신규: TilePlan, Stage, StageType, PipelineContext, TileToken |
-| `src/kernbench/components/builtin/tiling.py` | pe_accel에서 이식: plan 생성 로직 |
-
-백업:
-| `src/kernbench/components/builtin_legacy/` | 기존 builtin 전체 백업 (수정 없이 보관) |
-| `src/kernbench/components/custom/pe_accel/` | 기존 pe_accel 백업 (수정 없이 보관) |
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -19,17 +19,6 @@ queues. Host-level collectives (`dist.all_reduce`) are deferred to
 **future work**; this ADR focuses solely on the kernel-side collective
 infrastructure.

-### Current state
-
- ADR-0021 PE pipeline refactor: each PE is decomposed into components
-  (PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH,
-  PE_TCM, PE_MMU).
- No direct PE-to-PE channel exists today. All data movement goes
-  through PE_DMA → cube_noc / UCIe / PCIE → HBM.
- A pre-ADR host CCL skeleton exists (`dist.init_process_group(backend="ahbm")`,
-  `_run_ccl_bench` running per-rank greenlets concurrently). The
-  collective itself is a stub.
-
 ### Problems to solve

 1. PE-to-PE direct data movement (writing into a peer's memory).
@@ -891,30 +880,3 @@ fairness from `tl.recv()` round-robin, confusing
 - VC arbitration is a first-order approximation; heavy contention
  scenarios may report slightly optimistic latency vs real HW (D8).
 - Chunk-level interleave makes PE_DMA implementation more complex.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `topology.yaml` | Add `pe_ipcq` to `pe_template`, plus the IPCQ ↔ DMA / CPU / TCM edges. |
-| `components.yaml` | Register `pe_ipcq_v1`. |
-| `src/kernbench/topology/builder.py` | Wire the IPCQ chain into PE-internal edges. |
-| `src/kernbench/components/builtin/pe_ipcq.py` | New. |
-| `src/kernbench/components/builtin/pe_dma.py` | Add VCs, handle `IpcqDmaToken`. |
-| `src/kernbench/common/pe_commands.py` | `IpcqSendCmd`, `IpcqRecvCmd`, `IpcqDmaToken`. |
-| `src/kernbench/triton_emu/tl_context.py` | `tl.send` / `tl.recv` API. |
-| `src/kernbench/runtime_api/distributed.py` | Eager IPCQ install in `AhbmCCLBackend.__init__`. |
-| `src/kernbench/runtime_api/kernel.py` | `IpcqInitMsg` definition. |
-| `src/kernbench/ccl/__init__.py` | New CCL package. |
-| `src/kernbench/ccl/topologies.py` | Builtin topology generators + `resolve_topology()`. |
-| `src/kernbench/ccl/helpers.py` | Algorithm-author helpers (`chunked`, `ring_step`, `tree_step`). |
-| `src/kernbench/ccl/testing.py` | Mock CCL runtime (`run_kernel_in_mock`). |
-| `src/kernbench/ccl/algorithms/*.py` | Algorithm modules (kernel + `kernel_args` + optional `neighbors`). |
-| `ccl.yaml` | Algorithm metadata + IPCQ defaults. |
-| `tests/test_pe_ipcq.py` | PE_IPCQ unit tests. |
-| `tests/test_pe_dma_vc.py` | PE_DMA VC tests. |
-| `tests/test_ipcq_e2e.py` | end-to-end send/recv tests. |
-| `tests/test_ccl_topologies.py` | Builtin topology generator tests. |
-| `tests/test_ccl_allreduce_matrix.py` | Unified bench × algorithm matrix. |
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -17,14 +17,6 @@ Queue)를 통해** 일어난다.
 core-local 통신 큐와 유사하다. 호스트 레벨 collective(`dist.all_reduce`)는
 **미래 작업**으로 미루고, 본 ADR은 커널 collective 인프라에만 집중한다.

-### 현재 상태
-
- ADR-0021 PE 파이프라인 리팩토링: PE 내부가 컴포넌트 단위로 분리됨
-  (PE_CPU, PE_SCHEDULER, PE_DMA, PE_FETCH_STORE, PE_GEMM, PE_MATH, PE_TCM, PE_MMU)
- PE 간 직접 통신 채널 없음. 모든 데이터 이동은 PE_DMA → cube_noc/UCIe/PCIE → HBM 경로
- 호스트 CCL skeleton (ADR 없음, ad-hoc 구현): `dist.init_process_group(backend="ahbm")`,
-  `_run_ccl_bench`가 rank별 greenlet로 동시 실행. collective는 stub 상태.
-
 ### 풀어야 할 문제

 1. PE 간 직접 데이터 이동 (peer's memory에 write)
@@ -1245,29 +1237,3 @@ def neighbors(rank, world_size, neighbor_map) -> dict | None:
 - VC arbitration 모델이 first-order approximation이므로 heavy contention
  시나리오에서 실제 HW보다 약간 optimistic한 latency 결과 가능 (D8 한계)
 - VC chunk-level 인터리브로 PE_DMA 구현이 더 복잡해짐
-
---
-
-## 영향받는 파일
-
-| 파일 | 변경 |
-|------|------|
-| `topology.yaml` | pe_template에 pe_ipcq 추가, ipcq↔dma/cpu/tcm edge 추가 |
-| `components.yaml` | pe_ipcq_v1 등록 |
-| `src/kernbench/topology/builder.py` | PE 내부 edge에 ipcq 체인 추가 |
-| `src/kernbench/components/builtin/pe_ipcq.py` | 신규 |
-| `src/kernbench/components/builtin/pe_dma.py` | VC 추가, IpcqDmaToken 처리 |
-| `src/kernbench/common/pe_commands.py` | IpcqSendCmd, IpcqRecvCmd, IpcqDmaToken 정의 |
-| `src/kernbench/triton_emu/tl_context.py` | tl.send / tl.recv API |
-| `src/kernbench/runtime_api/distributed.py` | ccl.yaml 로드, init 시 IPCQ install (eager) |
-| `src/kernbench/runtime_api/kernel.py` | IpcqInitMsg (sideband) 정의 |
-| `src/kernbench/ccl/__init__.py` | 신규 — CCL 패키지 |
-| `src/kernbench/ccl/topologies.py` | 신규 — builtin topology generators (ring_1d, mesh_2d, tree_binary 등), `resolve_topology()` |
-| `src/kernbench/ccl/helpers.py` | 신규 — 알고리즘 작성 헬퍼 (chunked, ring_step 등) |
-| `src/kernbench/ccl/testing.py` | 신규 — mock CCL runtime (`run_kernel_in_mock`) |
-| `ccl.yaml` | 신규 — 알고리즘 metadata + IPCQ default 설정 |
-| `src/kernbench/ccl/algorithms/ring_allreduce.py` | 신규 — 첫 알고리즘 예제 |
-| `tests/test_pe_ipcq.py` | 신규 — PE_IPCQ 단위 테스트 |
-| `tests/test_pe_dma_vc.py` | 신규 — PE_DMA virtual channel 테스트 |
-| `tests/test_ipcq_e2e.py` | 신규 — send/recv end-to-end 테스트 |
-| `tests/test_ccl_topologies.py` | 신규 — builtin topology generator 단위 테스트 |
@@ -53,16 +53,6 @@ PE_IPCQ: 자체 message loop에서 IpcqInitMsg 처리 (기존 capability)
 `IpcqInitMsg` 타입을 그대로 사용. 기존의 "sideband direct call" 우회만
 제거하여 convention 일원화.

-### 현재 상태
-
- `DistributedContext` facade 존재
- `init_process_group("ahbm")` → `AhbmCCLBackend`가 `ctx.install_ipcq` 호출
-  → `ccl/install.py`가 **sideband direct call** (`pe_ipcq._install_neighbors`)로
-  PE_IPCQ에 neighbor table 설치
- `get_rank()` 항상 `0` (single-driver)
- `get_world_size()` fallback: 총 PE 수 (rank = PE)
- `benches/ccl_allreduce.py`: `worker(rank=0, world_size=total_PEs)` 1회 호출
-
 ### 풀어야 할 문제

 1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
@@ -86,14 +76,6 @@ PE_IPCQ: 자체 message loop에서 IpcqInitMsg 처리 (기존 capability)
  도입하기엔 정당화가 약함. 미래에 control-plane latency 모델링 정밀도 요구가
  생기면 별도 ADR.

-### TODO (이 ADR 구현 이후)
-
- Tensor Parallelism (ADR-0027)
- Hierarchical all-reduce 알고리즘 설계 (ADR-0029) — 본 ADR의 mapper /
-  validator registry 인프라를 활용하는 첫 사례
-
---
-
 ## Decision

 ### D1. rank = SIP (world_size 해석)
@@ -835,34 +817,6 @@ Migration 스케줄:

 ## Open questions

-### 🔴 Critical — 구현 blocker 가능성 (integration 전 반드시 검증)
-
- **`IpcqInitMsg`의 engine routing — primary implementation risk**: 현재
-  sideband만 쓰여서 engine routing path가 실사용 검증되지 않은 상태. **본
-  ADR 전체가 "engine routing이 동작한다"는 가정 위에 서 있다**. 이것이
-  실제로 안 되면 D2, D14, T3 등이 전부 영향 받음. 반드시 **ADR 구현 착수
-  전 스파이크 검증**:
-  - `engine.submit(IpcqInitMsg(target_sips=..., target_cubes=..., target_pe=...))`
-    가 PE_IPCQ로 정확히 배달되는지 (기존 `MmuMapMsg` / `MemoryWriteMsg` 라우팅
-    패턴과 비교)
-  - 미지원 시 minor hook: engine의 message-type → component-kind 매핑 테이블에
-    `IpcqInitMsg → "pe_ipcq"` 등록 (localized change, topology builder /
-    message schema 영향 없음)
-  - 결과에 따라 D2 채택 여부가 달라질 수 있음 — 만약 routing 불가 시 sideband
-    path 유지로 fallback 후 본 ADR 범위 재조정
-
- **Engine-routed install vs sideband equivalence** (D2 검증점 1-5): T3의
-  equivalence test가 실제 동작하는지 스파이크. 특히 ordering independence와
-  idempotency는 기존 테스트에 없는 속성이라 신규 검증 필요.
-
- **`install_ipcq()` 직접 호출자 audit** (구현 전 필수): deprecated wrapper
-  전략은 적절하지만 실제 migration 리스크는 호출자 목록에 따라 다름. 착수 전
-  grep audit:
-  - Pattern: `install_ipcq(` (cwd 전체)
-  - Scope: `src/`, `tests/`, `benches/`, `scripts/`, `src/kernbench/cli/`
-  - 각 호출자의 예상 migration path (→ `dist.init_process_group` vs
-    `build_install_plans` 직접)를 정리한 후 wrapper 도입
-
 ### 🟡 Nice-to-have — scope 경계 관련

 - **Install timing 허용치**: SimPy 시간 상 install이 몇 ns~us 소모. 기존
@@ -883,64 +837,6 @@ multi-level 알고리즘이 driving force가 되는 framework 진화 방향.)

 ---

-## Test strategy
-
-### T1. Launcher infrastructure
-
-`tests/test_ccl_ddp_launcher.py`:
- `test_world_size_equals_sip_count` — D1
- `test_ahbm_set_device_binds_tensor_to_single_sip` — D10/D11
- `test_get_rank_is_greenlet_local` — D9
- `test_run_spawns_one_worker_per_rank` — D12/D13
- `test_get_rank_debug_warning` — D9 warning path
-
-### T2. Install plan builder
-
-`tests/test_ccl_install_plan.py` (new):
- `build_install_plans` — ring_1d × leader_only 조합 (단일 PE per rank)
- `build_install_plans` — ring_1d × all_pes 조합 (multi-PE per rank; mapper
-  framework 동작 확인, 알고리즘-무관)
- Mapper / validator registry resolution (built-in key vs import path vs
-  unknown)
- Import path fallback (`"pkg.mod.fn"` 형식) 동작 검증
-
-### T3. Engine-routed IpcqInitMsg (equivalence — 핵심 검증)
-
-`tests/test_ipcq_init_routing.py` (new):
- **Routing**: `engine.submit(IpcqInitMsg)` → 지정 PE_IPCQ가 실제 설치 수행
- **Equivalence**: 동일한 IpcqInitMsg를 (a) sideband `_install_neighbors`
-  직접 호출, (b) engine.submit 두 경로로 보낸 뒤 PE_IPCQ 최종 state
-  (`_queue_pairs`, `_installed` 등) 동일성 비교
- **Ordering independence**: 서로 다른 PE의 install msg를 engine 큐에 임의
-  순서로 넣어도 최종 state가 동일
- **Idempotency (duplicate install)**: 동일 PE에 두 번 install msg → 두
-  번째는 에러 raise (policy: explicit error; D2 검증점 4 참고)
- **Multi-PE 병렬 install**: per-PE submit이 interference 없이 완료
- **Install 후 send 성공**: 설치 직후 `IpcqSendCmd` 실행해서 neighbor table
-  state가 실제로 유효한지 확인
-
-### T4. Barrier correctness
-
-`tests/test_collective_barrier.py` (new):
- Single collective 정상
- 다중 collective 연속 호출 (epoch 격리)
- 동일 rank의 duplicate join → RuntimeError
- Rank 1이 all_reduce 전 종료 → SpawnException + barrier.reset()
- Conditional branch 시 모든 rank 도달하면 정상
-
-### T5. E2E
-
-`tests/test_ccl_allreduce_matrix.py`:
- `ring_tcm` / `ring_hbm` / `ring_sram` @ ws=SIP_count
-
-### T6. 회귀
-
-기존 `test_ccl_framework`, `test_ccl_install`, `test_ccl_topologies`,
-`test_ccl_mock_runtime`, `test_pe_ipcq`, `test_ipcq_e2e`, 기타 non-CCL
-모두 통과.
-
---
-
 ## Consequences

 ### Positive
@@ -970,28 +866,3 @@ multi-level 알고리즘이 driving force가 되는 framework 진화 방향.)
 - IPCQ PE-level protocol (ADR-0023) 불변.
 - `DPPolicy` 필드 변경은 ADR-0026.
 - IO_CPU 역할 불변 (기존 transit 그대로).
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/runtime_api/distributed.py` | D1/D2/D7/D9: world_size fallback, rank_to_sip, plan 소유, engine-routed install/launch, epoch barrier |
-| `src/kernbench/runtime_api/context.py` | D10/D11: `_AhbmNamespace`, `ctx.ahbm`, `_create_tensor`가 `target_sip` 전달 |
-| `src/kernbench/runtime_api/multiprocessing.py` (new) | D12/D13: `spawn` + scheduler + exception |
-| `src/kernbench/ccl/install_plan.py` (new) | D6: `build_install_plans`, `SipInstallPlan`, `PeInstallSpec`, `NeighborTableEntry` |
-| `src/kernbench/ccl/mappers.py` (new) | D5: `leader_only`, `all_pes`, registry + resolver |
-| `src/kernbench/ccl/validators.py` (new) | D5: validator registry + resolver |
-| `src/kernbench/ccl/install.py` | Thin deprecated compat wrapper (D14) |
-| `src/kernbench/ccl/algorithms/ring_allreduce.py` | D4: `kernel` + `kernel_args` 유지 (큰 변화 없음) |
-| `src/kernbench/ccl/algorithms/mesh_allreduce.py` | D4 동일 |
-| `src/kernbench/ccl/algorithms/tree_allreduce.py` | D4 동일 |
-| `ccl.yaml` | 각 알고리즘에 `mapper` / `validator` 선언 추가 |
-| `src/kernbench/sim_engine/engine.py` | (If needed) `IpcqInitMsg` → PE_IPCQ 라우팅 확인 hook |
-| `benches/ccl_allreduce.py` | 새 launcher 기반 rewrite |
-| `tests/test_ccl_ddp_launcher.py` (new) | T1 |
-| `tests/test_ccl_install_plan.py` (new) | T2 |
-| `tests/test_ipcq_init_routing.py` (new) | T3 |
-| `tests/test_collective_barrier.py` (new) | T4 |
-| `tests/test_ccl_allreduce_matrix.py` | T5: ws=SIP_count 단순화 |
@@ -2,7 +2,7 @@

 ## Status

-Proposed (Revision 2 — Address-based matching; peer_direction field dropped)
+Accepted (Revision 2 — Address-based matching; peer_direction field dropped)

 ## Context

@@ -13,34 +13,6 @@ topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되
 2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
 topology 일반)에서 정확히 동작하도록 한다.

-### 현재 상태 (ADR-0023 D9 구현)
-
-`src/kernbench/components/builtin/pe_ipcq.py` — `_handle_meta_arrival`:
-
-```python
-def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
-    token = msg.token
-    sender_key = (token.src_sip, token.src_cube, token.src_pe)
-    for d, qp in self._queue_pairs.items():
-        p = qp["peer"]
-        if (p.sip, p.cube, p.pe) == sender_key:
-            qp["peer_head_cache"] = max(qp["peer_head_cache"], token.sender_seq + 1)
-            # ... wake recv waiters ...
-            return
-```
-
-`_credit_worker`도 동일한 "sender-coord-first-match" 패턴.
-
-`src/kernbench/ccl/install.py` — `reverse_direction`:
-
-```python
-def reverse_direction(my_rank: int, peer_rank: int) -> str | None:
-    for d, target in neighbor_table[peer_rank].items():
-        if target == my_rank:
-            return d
-    return None
-```
-
 ### 드러난 버그 — 2-rank bidirectional ring

 `ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
@@ -289,51 +261,6 @@ for plan in plans:

 ---

-## Test strategy
-
-### T1. Unit — `reverse_direction` opposite-preference
-
-`tests/test_ccl_install.py` (확장):
- Ring ws=2: `reverse_direction(0, 1, "E")` → "W", `reverse_direction(0, 1, "W")` → "E"
- Ring ws=4: `reverse_direction(0, 1, "E")` → "W" (자연스러운 opposite)
- Mesh 2×2: `reverse_direction(r, peer, "N")` → "S", "E" ↔ "W"
- Tree binary: opposite 없는 direction (parent) → fallback 경로
- Non-symmetric topology: opposite가 peer에 없고 다른 direction만 있는 경우
-
-### T2. Runtime — `_handle_meta_arrival` dst_addr 매칭
-
-`tests/test_pe_ipcq.py` (확장):
- 2-rank pair install 후, E direction dst_addr로 meta arrival → E의 `peer_head_cache`
-  증가 (W는 불변)
- W direction dst_addr로 meta arrival → W의 `peer_head_cache` 증가
- 잘못된 dst_addr (어느 rx range에도 속하지 않음) → 에러 또는 silent drop
-  (결정 후 명시)
-
-### T3. Credit — `dst_rx_base_pa` 매칭
-
-`tests/test_pe_ipcq.py` (확장):
- E direction send 후 peer가 consume → credit에 자기 W의 `my_rx_base_pa`
-  담아 송신 → sender의 E direction `peer_tail_cache` 증가
- W direction도 동일
-
-### T4. E2E — 2-rank bidirectional ring
-
-`tests/test_ipcq_e2e.py`:
- 2-rank ring_1d로 tl.send(E) + tl.recv(W) pattern이 양방향으로 작동
- ADR-0024의 `test_ccl_allreduce_matrix.py`에서 ring at ws=2가 통과
-
-### T5. Install invariant — rx_base range disjointness
-
-`tests/test_ccl_install_plan.py` (확장):
- I3.1 검증: `build_install_plans` 결과에서 모든 qp의 rx_base range가 disjoint
-
-### T6. 회귀
-
- 기존 ws≥3 ring / mesh / tree 테스트 그대로 통과
- `test_pe_ipcq`, `test_ipcq_e2e` 기존 케이스 회귀
-
---
-
 ## Consequences

 ### Positive
@@ -354,19 +281,3 @@ for plan in plans:

 - IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
  불변.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/ccl/install.py` | D1: `reverse_direction`에 `my_dir` 인자 추가, opposite-preference |
-| `src/kernbench/components/builtin/pe_ipcq.py` | D2: `_handle_meta_arrival` dst_addr 매칭 / D3: `_credit_worker` dst_rx_base_pa 매칭 / `_delayed_credit_send`가 `dst_rx_base_pa` 필드 채움 |
-| `src/kernbench/common/ipcq_types.py` | D3: `IpcqCreditMetadata`에 `dst_rx_base_pa` 필드 추가 |
-| `src/kernbench/ccl/install_plan.py` (ADR-0024 신규) | D6: I3.1 invariant 검증 (optional) |
-| `docs/adr/ADR-0023-ipcq-pe-collective.md` | Reference note: runtime 매칭 방식이 ADR-0025에서 바뀜 |
-| `tests/test_ccl_install.py` | T1 |
-| `tests/test_pe_ipcq.py` | T2, T3 |
-| `tests/test_ipcq_e2e.py` | T4 |
-| `tests/test_ccl_install_plan.py` | T5 |
@@ -13,53 +13,6 @@ intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이
 (ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
 layers가 담당).

-### 현재 상태
-
-`src/kernbench/policy/placement/dp.py`:
-
-```python
-@dataclass(frozen=True)
-class DPPolicy:
-    sip: Literal["replicate", "column_wise", "row_wise"] = "replicate"
-    cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
-    pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
-    num_pes: int | None = None
-    num_cubes: int | None = None
-    num_sips: int | None = None    # ← 제거 대상
-```
-
-`sip` / `num_sips` 필드는 텐서를 SIP 경계 **너머**로 분산하는 경로를 제공함.
-이는:
-
- **ADR-0024의 launcher 모델과 충돌**: ADR-0024는 "rank = SIP = 1 worker per SIP"
-  모델. 각 worker가 자기 SIP에 텐서를 생성. 텐서가 여러 SIP에 걸치는 경우는
-  Megatron-style TP가 개별 primitive로 처리해야 함.
- **사용자 의도와 불일치**: "DPPolicy는 한 디바이스 내에서 PE들로 분산하는 방법"
-  (사용자 진술).
- **개념 혼동**: `DPPolicy.sip="column_wise"`는 실제로 **TP**. 이름이 DP인데
-  하는 일은 TP → 신규 사용자에게 혼란.
-
-### 영향받는 call site (rollback 시점 grep 결과)
-
-**생성 사이트** (`DPPolicy(sip=...` 또는 `num_sips=...`):
- `tests/test_runtime_api_tensor.py`
- `benches/ccl_allreduce.py` (ADR-0024 scope 내에서 이미 개편됨)
- `tests/test_va_offset.py`
- `benches/va_offset_verify.py`
- `tests/test_sip_parallel.py`
-
-**참조 사이트** (`dp.sip`, `policy.sip`, `num_sips` 등):
- `src/kernbench/runtime_api/context.py` (`_create_tensor`, `launch`)
- `src/kernbench/components/builtin/pe_cpu.py`
- `src/kernbench/components/legacy/builtin/pe_cpu.py`
- `src/kernbench/policy/placement/dp.py` (구현 자체)
- `tests/test_tensor.py`, `test_ipcq_types.py`
-
-**핵심 테스트**: `test_sip_parallel.py`는 이름 그대로 "SIP 병렬성을 DPPolicy로
-표현하는" 테스트. 이 ADR 이후 **새 launcher 모델로 재작성** 필요.
-
---
-
 ## Decision

 ### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
@@ -258,66 +211,6 @@ for sip_id in sip_range:
 권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
 allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.

-### D6. Migration — 기존 call site
-
-**(A) `DPPolicy(sip=..., num_sips=..., ...)` 사용하던 코드**:
-
- `DPPolicy(sip="column_wise", cube=..., pe=...)` 패턴 → **해당 bench를 ADR-0024
-  launcher로 재작성**. worker가 `set_device(rank)`로 SIP 선택, DPPolicy는
-  cube/PE만.
- `DPPolicy(sip="replicate", num_sips=1, ...)` 패턴 → `DPPolicy(cube=..., pe=...)`로
-  축소 (필드가 사라지니 자연스럽게).
-
-**(B) `dp.sip`, `dp.num_sips` 읽던 코드**:
-
- 제거. `launch()`의 `_compute_local_shape`에서 `dp.sip` 분기 삭제.
- `pe_cpu.py`가 `dp.sip`을 참조하던 곳도 정리.
-
-**(C) `ShardSpec.pe_index`를 사용하던 코드 — 전부 수정 필요**:
-
- `.pe_index` 접근은 이제 `AttributeError` 발생 → 모든 call site 수정 필수.
- Allocator lookup: `allocators[spec.pe_index]` →
-  `allocators[(spec.sip, spec.cube, spec.pe)]`
- Flat integer가 꼭 필요한 국소 문맥: `spec.sip * N_CUBES * N_PE + spec.cube *
-  N_PE + spec.pe` 명시적 계산. **국소 변수로만 사용하고 공개 API에 노출하지
-  않는다**.
-
-**구현 착수 전 grep audit 체크리스트**:
-
-1. **Property 참조**:
-   - `\.pe_index\b` — 필드/property 접근 모두 (regex)
-   - `pe_index=` — 생성 시점의 키워드 인자
-   - `pe_index:` — dataclass 필드 선언
-2. **Allocator / dict indexing**:
-   - `allocators\[` — dict lookup 패턴. `allocators[spec.pe_index]` 같은
-     것이 걸리는지
-   - `_allocators\[` — 같은 패턴 (prefix _)
-3. **Flat index 수동 계산 블록**:
-   - `flat_idx =`
-   - `pe_index =` (좌변)
-   - `* pes_per_cube +` (전형적 flat 계산 패턴)
-   - `* self._num_cubes \* self._pes_per_cube` (global flat 계산)
-4. **Serialization / logging**:
-   - `asdict(.*shard` — dataclass 직렬화 시 `pe_index` 자동 포함 여부
-   - `repr(.*ShardSpec` — 로그 포맷에서 의존하는지
-   - JSON/YAML 저장 포맷에서 `pe_index` 키 사용 여부
-5. **Tests asserting integer PE identity**:
-   - `assert .*pe_index` — 정수 동일성 주장
-   - `spec.pe_index ==` — 비교 (SIP-local 의미로 변하면 테스트가 깨질 수 있음)
-
-각 match마다 "이 호출자가 global flat / SIP-local / 내부 lookup 중 무엇을
-기대했나"를 판단한 뒤 구조적 좌표로 교체.
-
-**(D) `test_sip_parallel.py`**:
-
- 이름 유지, 내용은 ADR-0024의 multi-greenlet launcher 기반 재작성.
- "SIP 병렬성 = rank 별 worker × 각자 DPPolicy" 로 검증.
-
-**(E) `test_va_offset.py`, `benches/va_offset_verify.py`**:
-
- `num_sips=1`만 쓰는 경우가 대부분. 단순히 필드 제거.
- SIP offset 테스트가 핵심이면 `set_device(rank)` + 구조적 좌표 관찰로 이식.
-
 ### D7. 하위 호환 — 불가 (cleanup ADR)

 이 ADR은 **breaking change**.
@@ -331,17 +224,6 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
 **Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
 코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.

-### D8. 문서 업데이트
-
- `ADR-0008` (tensor deploy) — DPPolicy 의미 갱신 note, ShardSpec 구조적 좌표
-  전환 명시
- DPPolicy docstring에 "intra-device only" 명시 (D1 코드 스니펫의 docstring)
- ShardSpec docstring에 **structural coordinates `(sip, cube, pe)`를 직접
-  사용하며, `pe_index`는 더 이상 제공되지 않음**을 명시 (D2)
- `docs/ccl-author-guide` 등 튜토리얼에서 `sip=...` 예시 제거
-
---
-
 ## Dependencies

 - **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
@@ -378,56 +260,6 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에

 ---

-## Test strategy
-
-### T1. 단위 테스트 갱신
-
- `tests/test_tensor.py`, `tests/test_ipcq_types.py`, `tests/test_runtime_api_tensor.py`
-  — DPPolicy 생성자 인자 정리, ShardSpec 구조적 좌표 검증
- `tests/test_va_offset.py` — `num_sips=1` 제거 후 동작 유지
-
-### T2. `resolve_dp_policy` 구조적 좌표 반환
-
-`tests/test_dp_policy.py` (new 또는 확장):
- `resolve_dp_policy(dp, ..., target_sip=1)` 결과의 모든 ShardSpec이 `sip=1`
- 각 spec의 `(cube, pe)`가 local (0..num_cubes-1, 0..num_pe-1)
- 같은 topology에서 `target_sip=0`과 `target_sip=1` 결과가 sip 필드만 다름
-
-### T3. `test_sip_parallel.py` 재작성
-
-SIP 병렬성 검증을 launcher 기반으로:
-
-```python
-def test_sip_parallel_via_launcher(topology):
-    ...
-    def worker(rank, ws, torch):
-        torch.ahbm.set_device(rank)
-        t = torch.zeros((1, 128), dtype="f16",
-                         dp=DPPolicy(cube="column_wise", pe="column_wise"))
-        # verify shard.sip == rank (structural coord)
-
-    spawn(worker, nprocs=n_sips, ...)
-```
-
-### T4. Allocator key migration
-
-`tests/test_allocator_structural_key.py` (new 또는 기존 확장):
- `PEMemAllocator` dict이 `(sip, cube, pe)` tuple key로 작동
- `deploy_tensor`가 구조적 좌표로 allocator lookup
- `_free_tensor`도 동일
-
-### T5. E2E 회귀
-
-ADR-0024의 `test_ccl_allreduce_matrix.py` 그대로 통과.
-
-### T6. 오류 검증
-
- `DPPolicy(sip="column_wise")` 호출 → `TypeError`. 테스트로 명시.
- `DPPolicy(num_sips=2)` 호출 → `TypeError`.
- `spec.pe_index` 접근 → `AttributeError` (property 완전 제거 검증).
-
---
-
 ## Consequences

 ### Positive
@@ -454,23 +286,3 @@ ADR-0024의 `test_ccl_allreduce_matrix.py` 그대로 통과.
 ### Neutral

 - 기존 `cube` / `pe` 필드 의미 불변.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/policy/placement/dp.py` | D1: `sip`/`num_sips` 제거 / D2: `ShardSpec`에 `sip`/`cube`/`pe` structural fields 추가, **`pe_index` property 제거** / D3: `resolve_dp_policy`에 `target_sip`, SIP-level 루프 제거 / 내부 resolver가 반환하는 shard 타입 이름도 `local_pe`로 명확화 (이름 충돌 방지) |
-| `src/kernbench/runtime_api/context.py` | D4: `_create_tensor` `target_sip` 전달 / D5: `_ensure_allocators` dict key → `(sip, cube, pe)` tuple / `launch`의 `dp.sip` 분기 제거 |
-| `src/kernbench/runtime_api/tensor.py` | D5: `deploy_tensor`가 구조적 좌표로 allocator lookup |
-| `src/kernbench/components/builtin/pe_cpu.py` | D6: `dp.sip` 참조 제거 |
-| `src/kernbench/components/legacy/builtin/pe_cpu.py` | D6: 동일 |
-| `benches/ccl_allreduce.py` | ADR-0024 scope에서 이미 처리 |
-| `benches/va_offset_verify.py` | D6: `num_sips=1` 제거 |
-| `tests/test_runtime_api_tensor.py` | D6 |
-| `tests/test_va_offset.py` | D6 |
-| `tests/test_tensor.py`, `test_ipcq_types.py` | D6 |
-| `tests/test_sip_parallel.py` | T3: launcher 기반 재작성 |
-| `tests/test_dp_policy.py` (new 또는 확장) | T2 |
-| `tests/test_allocator_structural_key.py` (new) | T4 |
@@ -2,7 +2,7 @@

 ## Status

-Proposed (Revision 7 — resume invariant / main-context wait 비재귀 invariant /
+Accepted (Revision 7 — resume invariant / main-context wait 비재귀 invariant /
 global barrier over-serialization tradeoff / TP forward yield-safety 명시,
 2026-04-14)

@@ -19,20 +19,6 @@ Megatron-style을 선택한 이유:
 - NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준.
 - DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적.

-### 현재 상태
-
- KernBench는 TP가 없음. 기존 `DPPolicy.sip="column_wise"` 경로는 ADR-0026에서
-  제거됨. rank = SIP launcher (ADR-0024) 위에 TP primitive를 얹는다.
- ADR-0024 Phase B에서 **worker-greenlet env.run 재진입 버그**가 드러남:
-  worker가 `ctx.wait(h)` (tensor 생성 시 MmuMapMsg 등)를 호출하면 `env.run`이
-  worker 컨텍스트에서 돌고, 이때 spawn되는 kernel greenlet의 `_parent`가
-  worker가 되어 orphan 발생. `ring_default_ws` strict xfail의 근본 원인.
- `dist.all_reduce`는 이미 `_defer_wait=True` + worker yield 패턴으로 이 문제를
-  피함 ([distributed.py:119-134](src/kernbench/runtime_api/distributed.py#L119-L134)).
- TP layer의 forward는 매번 `torch.launch("gemm", ...)`를 호출하고, 그 뒤에
-  `dist.all_reduce`가 따라오는 패턴이 반복됨. worker-wait 문제를 **반드시**
-  해결하지 않으면 TP 샘플이 첫 실행에서 실패.
-
 ### TP primitive 스펙 (Megatron-LM 참조)

 - **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에
@@ -907,155 +893,6 @@ PR을 심사.

 ---

-## Test strategy
-
-### T1. Unit — `tests/test_tp_parallel_state.py` (신규)
-
- `initialize_model_parallel(ws)`가 world_size와 일치하는 경우만 통과.
- `get_tensor_model_parallel_rank()`가 greenlet-local rank 반환 (ADR-0024 D9
-  회귀).
- 미초기화 상태에서 `get_tensor_model_parallel_world_size()`가 적절히 실패.
-
-### T2. Unit — `tests/test_tp_layers.py` (신규)
-
-**Shape / structural checks**:
-
- `ColumnParallelLinear(in=256, out=512).weight.shape` per-rank가 `(256, 512/ws)`.
- `RowParallelLinear(in=512, out=256).weight.shape` per-rank가 `(512/ws, 256)`.
- `ColumnParallelLinear.forward(x)`의 출력 텐서 shape이 `(M, K/ws)`.
-
-**Numerical correctness (weight ≠ zero)**: 단순 shape assert는 대수적 오류를
-놓치므로, 결정론적 non-zero 입력/weight으로 실제 연산 결과 검증:
-
- **T2.a (ColumnParallel, deterministic)**: weight를 per-rank identity
-  (또는 `(i, j) → i + rank * k_local + j` 같은 결정론적 패턴)으로 초기화
-  (`tensor.copy_`). 입력 `x`를 상수 벡터로 둔 뒤 forward. 각 rank의 출력이
-  **기대치 `x @ W_rank_local`와 rtol/atol 1e-2 이내로 일치** (gemm kernel의
-  fp16 round-off 고려).
- **T2.b (RowParallel, reduced output equality — primary)**: 모든 rank의
-  forward 결과가 동일 전역 행렬 곱 `concat([x_0..x_{ws-1}]) @ concat([W_0..
-  W_{ws-1}])`과 일치하는지 검증. rank-별 `y.numpy()` 비교로 (i) all-reduce 후
-  elementwise equality와 (ii) 기대치(host-side numpy로 계산) 일치 **둘 다**
-  assert. observable-only 검증 — internal hook 불필요.
-
-  *Optional implementation note*: partial-sum 단계를 더 세밀히 관찰하고 싶으면
-  `_pending_collective_handles` enqueue 직전 intercept hook을 쓸 수 있으나,
-  이는 내부 구현 detail에 결합되므로 ADR 수준의 test contract는 T2.b의
-  observable equality만 요구한다.
- **T2.c (rank-identity after all_reduce)**: 모든 rank의 `y.numpy()`이 elementwise
-  identical (mean뿐 아니라 full array equality, rtol 1e-2).
-
-**기존 weak assertion 금지**: `output mean이 identical` 같은 aggregate-only
-검증은 silently 깨지기 쉽기에 **main assertion으로 쓰지 말 것** — 보조
-sanity로만 사용.
-
-### T3. Worker-wait 일반화 + orphan regression — `tests/test_worker_wait_drain.py` (신규)
-
-본 테스트의 핵심 목적은 queue 동작이 아니라 **ADR-0024 Phase B orphan
-regression의 직접 방지**이다. 다음을 assert:
-
- **T3.a**: Worker가 `ctx.wait(h)`을 호출하면 `_pending_worker_waits`에
-  handle이 enqueue되고 main이 drain하기 전까지 worker는 resume되지 않는다.
- **T3.b**: `_drain_pending` 직후 worker가 resume되고 handle은 `_completed`
-  상태.
- **T3.c**: Multi-worker에서 모든 worker가 같은 drain 지점에서 resume.
- **T3.d (orphan invariant, 핵심)**: Worker 함수가 `torch.launch(...)`를
-  호출한 뒤, SimPy engine이 실제로 돌기 시작하는 시점에 **kernel greenlet의
-  `_parent`는 main greenlet**이다. 테스트는 `kernel_runner.run`을 monkey-patch
-  하거나 `KernelRunner._parent` capture 시점에 assertion hook을 걸어 이
-  invariant를 직접 검증.
- **T3.e (symptom regression)**: D0 없이는 T3.d와 등가인 GreenletExit 실패가
-  재현되어야 함 (historical failure mode 문서화 — 실제 테스트는 D0 도입 후
-  skip 또는 xfail 처리).
- **T3.f (idempotency)**: 같은 handle을 `ctx.wait(h)`로 두 번 호출해도
-  `engine.wait`은 한 번만 불린다 (D0.4-(3)).
- **T3.g (exception propagation)**: Worker가 `wait` 호출 후 raise하면 main
-  scheduler loop이 즉시 중단되고 예외가 위로 전파. 남은 `_pending_worker_waits`는
-  drain되지 않는다 (D0.4-(4)).
-
-### T4. `torch.multiprocessing.spawn` — `tests/test_mp_spawn.py` (신규)
-
- `spawn(fn, args, nprocs)`이 nprocs 개의 greenlet을 생성하고 각각 rank로 bind.
- 모든 worker 완료 후 return.
- 기존 bench `ccl_allreduce.py`의 hand-rolled loop을 `mp.spawn`으로 교체해도
-  matrix 회귀 통과.
-
-### T5. Host-read barrier — `tests/test_host_read_barrier.py` (신규)
-
-D0.5 contract를 직접 검증:
-
- **T5.a**: Worker가 `launch → tensor.numpy()`를 연속 호출하면 barrier가 동작,
-  numpy 결과는 kernel 완료 후 값 (post-drain).
- **T5.b**: `launch → tensor.shape` (metadata)는 barrier 발동 안 함 (pending
-  queue 그대로 유지).
- **T5.c**: Pending 큐가 비어 있는 상태의 `numpy()` 호출은 yield 없이 즉시
-  read (불필요한 context switch 방지).
- **T5.d**: `__getitem__`, `data` 역시 T5.a와 동일한 barrier 발동.
- **T5.e**: Collective pending (all_reduce) 진행 중 상태에서 `numpy()` 호출 시
-  collective drain까지 기다린 뒤 read.
- **T5.f (copy_ write barrier)**: target tensor에 미완료 pending handle이
-  있는 상태에서 `target.copy_(source)` 호출 시, write 전에 drain 발동.
-  주입한 host source가 drain-이후 상태에 덮어써지는지 확인 (stale-overwrite
-  없음).
- **T5.g (closed-set via registry)**: barrier entry-point의 closed-set은
-  **명시적 registry** (예: `tensor.py` 상단의 `_HOST_READ_BARRIERS = frozenset
-  ({"numpy", "data", "__getitem__", "__repr__", "copy_"})`)로 유지한다.
-  테스트는:
-  1. registry에 나열된 각 entry-point에 **실제 barrier 주입이 되어 있는지**
-     (invocation 시 pending queue를 확인하고 yield 경로를 거치는지) 관찰.
-  2. 새 host-read semantic API 추가는 code review에서 registry 업데이트를
-     의무화 (CODEOWNERS / review checklist로 운영).
-
-  **Non-goal**: Python introspection (method 시그니처, docstring 분석 등)으로
-  barrier-부재 API를 자동 탐지하는 것은 정밀도 문제로 ADR scope 밖. registry
-  + review 접근으로 충분.
-
-### T6. E2E — `tests/test_tp_mlp.py` (신규)
-
-2-layer MLP (ColumnParallel → RowParallel) forward:
-
-**Structural / liveness**:
-
- `ws = SIP count` (topology.yaml 기준 current 2) 모델로 실행 완료.
- **Deadlock 없음**: scheduler loop이 유한 시간 내 종료 (pytest-timeout 등).
- **Completion trace**: 각 `launch` 및 `all_reduce`가 `ctx._traces`에 entry
-  남김 (count = 예상 layer 수).
-
-**Numerical correctness (필수)**:
-
- **T6.a (zero-weight sanity)**: weight 전부 0 → 출력 전부 0. 파이프라인이
-  돌긴 하는지 확인용 smoke test. **이것만으로는 불충분 — T6.b/T6.c와 함께
-  채택**.
- **T6.b (deterministic pattern)**: 모든 weight를 결정론적 non-zero pattern
-  (예: all 0.01, 또는 per-rank identity에서 파생된 값)으로 `copy_`. 입력도
-  상수. 기대 출력을 host-side numpy로 계산한 뒤 각 rank의 `y.numpy()`와 rtol
-  1e-2로 비교.
- **T6.c (rank-consistency post all-reduce)**: RowParallel의 all-reduce
-  이후 **모든 rank의 output이 elementwise identical** (T2.c와 동일 기준).
-  단순 mean 일치가 아니라 full array equality.
- **T6.d (shape contract)**: ColumnParallel 출력이 `(B, D_hidden / ws)`,
-  RowParallel 출력이 `(B, D_out)`.
-
-### T7. 회귀 — `ring_default_ws` xfail 해제
-
- `tests/test_ccl_allreduce_matrix.py::test_ccl_allreduce_matrix[ring_default_ws]`의
-  `@pytest.mark.xfail(strict=True)` 제거 → **PASS**여야 함.
- Acceptance criteria (observable):
-  - **Deadlock 없음**: bench가 유한 시간 내 종료.
-  - **GreenletExit 없음**: stderr/log에 GreenletExit trace 없음.
-  - **Rank 0 산출**: `ring_allreduce_tcm (ws=2): 2 OK` 문자열이 출력.
-  - **Completion trace**: `all_reduce` trace entry 존재.
-  - **Numerical**: 각 rank의 입력 `r+1`에 대한 sum(1..ws)=3 결과를 tolerance
-    1e-1 이내로 달성.
-
-### T8. 회귀 — 기존 전체 test suite
-
- ADR-0026까지 통과하던 모든 test가 그대로 통과 (523 passed + 1 xfail).
- Phase 2 완료 기준: 524 passed (xfail 해제 포함) + 0 xfail + 위 T1~T7 신규
-  테스트 전부 통과.
-
---
-
 ## Consequences

 ### Positive
@@ -1080,29 +917,3 @@ D0.5 contract를 직접 검증:

 - ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation
  stack에 영향 없음 (D0 제외).
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/runtime_api/context.py` | D0.1/D0.2: `_pending_worker_waits` + `ctx.wait`의 worker fork, D1.3: `self.multiprocessing` namespace attach |
-| `src/kernbench/runtime_api/multiprocessing.py` | 신규 (D1): `_MultiprocessingNamespace.spawn` + `_drain_pending` + `SpawnException` |
-| `src/kernbench/runtime_api/distributed.py` | `_pending_collective_handles` 타입 annotation 보강 (`list[tuple[RequestHandle, int, dict]]`); spawn exception cleanup에서 clear 호출 지점 노출 |
-| `src/kernbench/runtime_api/tensor.py` | D0.5 barrier 주입: `numpy`, `__getitem__`, `data`, `__repr__`, `copy_` (source read + target write) |
-| `src/kernbench/tp/__init__.py` | 신규: public API re-export |
-| `src/kernbench/tp/parallel_state.py` | 신규: D3 |
-| `src/kernbench/tp/layers.py` | 신규: D4/D5 |
-| `src/kernbench/tp/primitives.py` | 신규: D6 |
-| `src/kernbench/tp/kernels.py` | 신규: TP layer용 `_gemm_kernel` (bench 복제) |
-| `src/kernbench/tp/mappings.py` | 신규 stub (backward TODO) |
-| `benches/tp_mlp.py` | 신규 샘플 (D7) |
-| `benches/ccl_allreduce.py` | hand-rolled loop → `torch.multiprocessing.spawn`으로 교체 (D1.4) |
-| `tests/test_tp_parallel_state.py` | 신규 (T1) |
-| `tests/test_tp_layers.py` | 신규 (T2) |
-| `tests/test_worker_wait_drain.py` | 신규 (T3): orphan invariant 직접 검증 포함 |
-| `tests/test_mp_spawn.py` | 신규 (T4) |
-| `tests/test_host_read_barrier.py` | 신규 (T5): D0.5 host-read barrier contract |
-| `tests/test_tp_mlp.py` | 신규 (T6) |
-| `tests/test_ccl_allreduce_matrix.py` | `ring_default_ws` xfail 제거 (T7) |
@@ -1,421 +0,0 @@
-# ADR-0029: Hierarchical All-Reduce — 3-level intra/inter-SIP 알고리즘
-
-## Status
-
-Superseded by ADR-0032 (Intercube all-reduce). The 3-level kernel and
-`hierarchical_allreduce.py` module have been removed. The cube-mesh
-intercube + inter-SIP path is now the single all-reduce algorithm.
-
-## Context
-
-### 목표
-
-"Rank = SIP" 모델 (ADR-0024) 위에서 각 SIP 내부의 모든 PE를 참여시키는
-**3-level 계층 all-reduce** 알고리즘을 정의한다. 각 레벨이 서로 다른 물리
-연결(intra-cube ring, inter-cube NoC, inter-SIP UCIe)을 활용해 대역폭을
-극대화한다.
-
-### 왜 hierarchical인가
-
-단순 ring/mesh/tree all-reduce는 SIP당 1 PE만 참여 (ADR-0024의 `leader_only`
-mapper). 이는 inter-SIP 단계는 잘 모델링하지만:
-
- **Intra-SIP PE가 노는 시간이 발생**. Leader PE가 inter-SIP 통신 중이면
-  나머지 7 PE / 16 cube는 유휴.
- **Intra-cube/inter-cube 연결 대역폭 미활용**. Cube NoC는 매우 빠르지만
-  단일 leader 사용 시 이 자원이 노출되지 않음.
- **실제 NCCL 등은 hierarchical**: NVLink(intra-node) + InfiniBand(inter-node)
-  의 bandwidth 차이를 활용. KernBench 토폴로지도 동일 구조
-  (intra-cube / inter-cube / inter-SIP의 bandwidth·latency 차이).
-
-### 현재 상태
-
- `src/kernbench/ccl/algorithms/hierarchical_allreduce.py` 이미 존재
-  (git log `10b33b4` — "Tensor indexing + hierarchical 3-level all-reduce
-  kernel"). PE-level로 world_size = total PE를 가정하는 옛 모델 기반 구현.
- ADR-0024에 의해 launcher는 rank = SIP로 바뀜.
- Hierarchical 커널은 **재해석 필요**: 이제 각 worker(1 per SIP)가 자기 SIP의
-  모든 PE를 참여시키고, kernel은 intra-cube → inter-cube → inter-SIP 순으로
-  3-level reduce + broadcast.
-
-### 풀어야 할 문제
-
-1. **ADR-0024 framework 위에 hierarchical 알고리즘 맞추기**
-   - Mapper: `all_pes` (ADR-0024 D5 제공)
-   - Validator: `multi_pe_sip_local` (ADR-0024 D8 제공)
-   - Kernel: 기존 `hierarchical_allreduce.py` 수정 — rank 계산 방식을 SIP 내
-     local (cube, pe)로 바꿈
-2. **PE-level neighbor graph 생성**
-   - Intra-cube: `(sip, cube, pe) ↔ (sip, cube, pe±1 mod N_PE)` (ring 내부)
-   - Inter-cube: `(sip, cube, 0) ↔ (sip, cube±1 mod N_CUBE, 0)` (cube leader만)
-   - Inter-SIP: `(sip, 0, 0) ↔ (sip±1 mod N_SIP, 0, 0)` (SIP leader만)
-3. **Tensor layout**: 각 PE가 1 tile을 소유하고 시작 (`multi_pe_sip_local`
-   validator가 이 layout 강제). DPPolicy(cube="column_wise",
-   pe="column_wise")로 달성 가능.
-4. **PE-level topology 표현 부족** (ADR-0024 D6의 "책임 분산" 이슈 구체화)
-   - Ring/mesh/tree 같은 단순 패턴은 rank-level topology_fn + mapper 조합으로
-     충분.
-   - Hierarchical은 레벨마다 다른 peer 매핑이라 `_build_pe_installs`에서
-     multi-level 해석을 해야 함.
-   - 장기적으로는 topology 모듈이 PE-level을 직접 표현하는 편이 명시적.
-
-### Non-problem (이 ADR 밖)
-
- Launcher / barrier / rank-to-SIP / mapper-validator registry → ADR-0024
- IPCQ direction addressing → ADR-0025
- DPPolicy 필드 정리 → ADR-0026
- Megatron TP → ADR-0027
-
---
-
-## Decision
-
-### D1. 알고리즘 구조 — 3-level reduce + 역순 broadcast
-
-```
-Level 1 (intra-cube, E/W ring):
-  각 cube의 N_PE개 PE가 bidirectional ring reduce → cube 내 PE 0에 부분합 집중
-Level 2 (inter-cube within SIP, N/S ring, PE 0만 참여):
-  N_CUBE개 cube-leader가 bidirectional ring reduce → SIP 내 (cube 0, PE 0)에
-  SIP 전체 부분합 집중
-Level 3 (inter-SIP, N_SIP peers, (cube 0, PE 0)만 참여):
-  Ring 또는 pair exchange로 전역 합산 완료
-Broadcast:
-  역순 — Level 3 결과를 (cube 0, PE 0)에서 SIP 내 모든 cube-leader로, 다시
-  각 cube 내 모든 PE로 전파
-```
-
-세부는 기존 `hierarchical_allreduce.py`의 커널 구현과 일치. ADR-0024 이후
-변경점은 **rank 계산 방식**과 **n_elem 해석**뿐:
-
- 기존 (rank=PE 모델): `rank = cube_id * pes_per_cube + local_pe`, `pe_addr =
-  t_ptr + rank * nbytes`
- 신규 (rank=SIP 모델): 커널은 SIP-local 좌표 `(cube_id, local_pe)`로만 동작.
-  텐서의 per-PE slice는 backend가 per-PE `TensorArg`로 전달 (ADR-0024 D3).
-  커널 내부 rank 계산 자체가 불필요해짐 — `tl.program_id(0/1)`로 충분.
-
-### D2. Framework integration — ADR-0024 infrastructure 재활용
-
-`ccl.yaml`:
-
-```yaml
-algorithms:
-  hierarchical_allreduce:
-    module: kernbench.ccl.algorithms.hierarchical_allreduce
-    topology: hierarchical_3level        # NEW — D3 참고
-    mapper: all_pes                      # ADR-0024 D5 built-in
-    validator: multi_pe_sip_local        # ADR-0024 D8 built-in
-    buffer_kind: tcm
-    n_elem: 128
-```
-
-Framework 관점에서 hierarchical은 **특별한 알고리즘이 아니라, 특정
-topology / mapper / validator 조합**. 본 ADR은 그 조합과 topology 패턴을
-정의.
-
-### D3. `hierarchical_3level` topology (신규)
-
-`kernbench/ccl/topologies.py`에 신규 추가:
-
-```python
-def hierarchical_3level(rank: int, world_size: int, spec: dict) -> dict:
-    """3-level hierarchical neighbor pattern.
-
-    Returns a nested structure describing intra-cube + inter-cube + inter-SIP
-    neighbors. Unlike ring_1d / mesh_2d which are rank → {dir: peer_rank},
-    hierarchical is PE-level and requires spec for cube_mesh / pe_layout.
-    """
-```
-
-반환 스키마 (초안):
-
-```python
-{
-    "intra_cube": {
-        # 각 cube 내 ring neighbors: (cube, pe) → {"E": (cube, pe_e), "W": (cube, pe_w)}
-        ...
-    },
-    "inter_cube": {
-        # cube-leader 간 ring: (cube, 0) → {"N": (cube_n, 0), "S": (cube_s, 0)}
-        ...
-    },
-    "inter_sip": {
-        # SIP-leader 간: rank → {"parent": peer_rank} (또는 ring 방식)
-        ...
-    },
-}
-```
-
-이 구조는 `_build_pe_installs`가 해석하여 각 PE의 neighbor table 엔트리
-(4-direction)에 대응시킨다.
-
-**Rank-level `topologies.py` 현 API와의 관계**: 기존 단순 패턴은
-`(rank → {dir: peer_rank})` 단일 레벨. Hierarchical은 multi-level이므로
-기존 API와 schema가 다름. `_resolve_topology`는 **알고리즘이 어떤 schema를
-쓰는지 선언**하고, builder가 그에 맞춰 해석하도록 확장 필요 (open question).
-
-### D4. PE-level neighbor graph — `_build_pe_installs` 확장
-
-기존 (ring/mesh/tree): topology_fn이 반환한 `(rank → {dir: peer_rank})`를
-각 참여 PE에 그대로 매핑 (leader_only일 경우 peer PE도 leader).
-
-신규 (hierarchical): `hierarchical_3level`의 3단 구조를 per-PE neighbor
-table로 펼침:
-
-```python
-def _build_pe_installs_hierarchical(rank, world_size, sip, pes, topo, spec):
-    """Hierarchical 전용 PE neighbor table 빌더."""
-    result = []
-    for (cube, pe) in pes:
-        entries = []
-        # Level 1: intra-cube ring (E/W)
-        for d, peer in topo["intra_cube"][(cube, pe)].items():
-            entries.append(NeighborTableEntry(direction=d, ...))
-        # Level 2: inter-cube ring (N/S) — cube leader (pe == 0)만
-        if pe == 0:
-            for d, peer in topo["inter_cube"][(cube, 0)].items():
-                entries.append(NeighborTableEntry(direction=d, ...))
-        # Level 3: inter-SIP — SIP leader (cube == 0 and pe == 0)만
-        if cube == 0 and pe == 0:
-            for d, peer_rank in topo["inter_sip"][rank].items():
-                # peer_rank → peer SIP의 (0, 0)
-                entries.append(NeighborTableEntry(
-                    direction=d, peer_sip=peer_rank, peer_cube=0, peer_pe=0, ...))
-        result.append(PeInstallSpec(cube=cube, pe=pe, neighbors=tuple(entries)))
-    return tuple(result)
-```
-
-`build_install_plans`에서 algorithm_config의 `topology`에 따라 적절한 builder
-선택 (기존 simple builder vs hierarchical builder).
-
-### D5. Kernel 재해석 — SIP-local 좌표로
-
-`src/kernbench/ccl/algorithms/hierarchical_allreduce.py`를 ADR-0024 D3에
-맞춰 수정:
-
-```python
-def kernel_args(*, n_elem: int, world_size: int, pes_per_cube: int,
-                cubes_per_sip: int, num_sips: int, **kw) -> tuple:
-    """world_size (= num_sips), pes_per_cube, cubes_per_sip를 스칼라로."""
-    return (n_elem, pes_per_cube, cubes_per_sip, num_sips)
-
-def kernel(t_ptr, n_elem, pes_per_cube, cubes_per_sip, num_sips, tl):
-    """SIP-local 좌표 기반.
-
-    이전 (rank=PE 모델):
-        rank = cube_id * pes_per_cube + local_pe
-        pe_addr = t_ptr + rank * nbytes
-    현재 (rank=SIP 모델):
-        per-PE tensor slice는 backend가 TensorArg로 전달 → t_ptr은 이미 local.
-        intra-cube ring은 tl.program_id(0) 사용.
-        inter-cube ring은 pe_id == 0 조건으로 제한.
-        inter-SIP reduce는 cube_id == 0 and pe_id == 0 조건으로 제한.
-    """
-    local_pe = tl.program_id(axis=0)
-    cube_id = tl.program_id(axis=1)
-
-    # Level 1: intra-cube ring
-    for _ in range(intra_rounds(pes_per_cube)):
-        tl.send(dir="E", src=acc)
-        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
-        acc = acc + recv
-
-    # Level 2: inter-cube (cube leader only)
-    if local_pe == 0:
-        for _ in range(inter_cube_rounds(cubes_per_sip)):
-            tl.send(dir="N", src=acc)
-            recv = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
-            acc = acc + recv
-
-    # Level 3: inter-SIP (SIP leader only)
-    if local_pe == 0 and cube_id == 0:
-        for _ in range(inter_sip_rounds(num_sips)):
-            tl.send(dir="parent", src=acc)
-            recv = tl.recv(dir="parent", shape=(n_elem,), dtype="f16")
-            acc = acc + recv
-
-    # Broadcast (reverse chain)
-    # ...
-    tl.store(t_ptr, acc)
-```
-
-`kernel_args`는 ADR-0024 D4의 keyword-only signature 계약을 따른다.
-
-### D6. Validator — `multi_pe_sip_local`
-
-ADR-0024 D8의 built-in 그대로 활용. `ccl.yaml`에서 `validator:
-multi_pe_sip_local` 지정 시 backend가 각 SIP에 `cubes × pes_per_cube`개
-shard가 있는지 검증.
-
-### D7. Bench — 기본 all-reduce bench 확장
-
-`benches/ccl_allreduce.py`의 worker는 `ccl.yaml`이 `hierarchical_allreduce`를
-선택하면 자동으로:
-
-```python
-# Worker 예
-dp = DPPolicy(cube="column_wise", pe="column_wise")
-tensor = torch.zeros((1, intra_sip_pes * n_elem), dp=dp, name="in")
-# tensor는 각 SIP의 모든 PE에 1 tile씩 분산 (multi_pe_sip_local validator 통과)
-dist.all_reduce(tensor, op="sum")
-```
-
-Worker 코드 자체는 알고리즘 종류를 모름 (`ccl.yaml` 선택에 의존). 단,
-**DPPolicy가 hierarchical 요구와 일치해야** 함 — `cube/pe="column_wise"`
-같은 SIP-내 분산을 하는 DPPolicy여야 `multi_pe_sip_local` 검증 통과. 이
-DPPolicy 선택은 bench 설정 또는 sample bench에서 결정.
-
---
-
-## Dependencies
-
- **ADR-0024**: Launcher, `all_pes` mapper, `multi_pe_sip_local` validator,
-  registry + import path. 본 ADR 구현의 전제.
- **ADR-0025**: IPCQ direction addressing — cube/pe/SIP 간 다중 direction을
-  동시 사용하므로 정확한 direction 매칭 필수.
- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
- **기존 `hierarchical_allreduce.py`**: 본 ADR은 그 커널의 재해석 + 주변
-  framework integration.
-
---
-
-## Non-goals
-
- **ADR-0024 framework 변경**: 재활용만.
- **Alternative reduce topology (tree-in-tree 등)**: 3-level ring이 첫 구현.
- **Dynamic level count**: 현재 SIP/cube/PE 3단 고정. 2단 (SIP + PE, cube
-  skip) 또는 4단 이상은 future.
- **Bandwidth-optimal schedule tuning**: reduce round 수 / chunk size 조정
-  같은 tuning은 별도.
- **Pipelined hierarchical**: 여러 chunk를 파이프라인으로 겹쳐서 돌리는
-  NCCL-style 최적화는 future.
-
---
-
-## Open questions
-
-### 🟠 중간 영향 — 구현 시 결정 필요
-
- **`topologies.py` 스키마 확장**: 기존 `ring_1d` 등은 단일 레벨 `(rank →
-  {dir: peer})`. `hierarchical_3level`은 multi-level. `_resolve_topology`가
-  둘을 모두 반환할 수 있도록 schema를 일반화할지, 아니면 hierarchical 전용
-  return type을 두고 builder가 분기할지.
-  - Option A: 모든 topology를 neighbor-list 형태로 단일화
-    (`[{direction, peer_sip, peer_cube, peer_pe}, ...]`)
-  - Option B: topology 모듈이 `kind` 필드 제공, builder가 분기
-  - 권장: Option A (single source of truth, ADR-0024 Open Q의
-    "PE-level topology 일원화" 방향과 일치)
-
- **`hierarchical_3level` vs algorithm별 topology 모듈**: 향후 mesh-based
-  hierarchical 등 variant이 생기면? `hierarchical_3level` 같은 이름이 이미
-  topology-specific. 변형은 새 key 추가 (`hierarchical_mesh_3level` 등) 또는
-  알고리즘 모듈에서 topology 생성 override.
-
-### 🟡 Nice-to-have
-
- **Reduce round 수 최적화**: Bidirectional ring은 `ceil((N-1)/2)` round.
-  Non-power-of-2 group size에서 idle PE 발생 가능.
- **Non-uniform topology 대응**: cube_mesh가 w != h일 때 inter-cube ring
-  balance.
- **Single SIP 케이스**: world_size = 1 (SIP 1개)일 때 Level 3 skip. Degenerate
-  case 검증.
-
-### 🟢 Framework evolution 시사점 (ADR-0024로부터 이관)
-
- **PE-level topology 일원화 (중장기)**: 현 설계는
-  - topology (rank graph 또는 level-separated)
-  - mapper (per-SIP PE set)
-  - `_build_pe_installs` (actual edges)
-
-  의 3단 분산. Hierarchical이 이 분산을 가장 스트레스 받는 케이스. 중장기로는
-  `topologies.py`가 PE-level neighbor list를 직접 반환하고 mapper는 단순히
-  "어느 PE가 참여하느냐"만 결정, `_build_pe_installs`는 flat
-  mapping으로 단순화되는 방향이 자연스러움. **본 ADR에서 Option A를 채택**하면
-  이 방향으로 이미 정합.
-
---
-
-## Test strategy
-
-### T1. Topology generator
-
-`tests/test_hierarchical_topology.py` (new):
- `hierarchical_3level(rank, world_size, spec)` → 각 level의 neighbor set이
-  예상 구조인지 (intra-cube는 ring, inter-cube는 cube-leader만 참여, inter-SIP은
-  SIP-leader만 참여)
- 2 SIP × 4 cubes × 4 PEs 같은 작은 토폴로지로 수작업 검증 가능
- Symmetry: rank r의 E neighbor가 peer에서 W로 역포인팅
-
-### T2. Install plan — hierarchical × all_pes
-
-`tests/test_ccl_install_plan.py` (확장):
- `build_install_plans(algorithm="hierarchical_allreduce", mapper="all_pes",
-  validator="multi_pe_sip_local")` 호출 시
-  - 각 SIP의 모든 PE가 `participating_pes`에 포함
-  - PE 0 (cube leader)만 inter-cube neighbor를 가짐
-  - (cube 0, pe 0) (SIP leader)만 inter-SIP neighbor를 가짐
-  - Non-leader PE는 intra-cube neighbor만
-
-### T3. Kernel unit — mock runtime
-
-`tests/test_hierarchical_mock_runtime.py` (new):
- `run_kernel_in_mock` (kernbench.ccl.testing)을 확장해 multi-level 지원
- 2 SIP × 2 cubes × 4 PEs (총 16 PE) 토폴로지에서 초기 tile을 rank+1로 채우고
-  hierarchical all-reduce 실행
- 모든 PE의 최종 결과가 `sum(1..16)`인지
-
-### T4. E2E — 실제 SimPy backend
-
-`tests/test_ccl_allreduce_matrix.py` (확장):
- `hierarchical @ ws=SIP_count`: multi_pe_sip_local layout + 3-level 알고리즘
-  전체 stack 통과 검증
-
-### T5. Validator enforcement
-
- `multi_pe_sip_local` validator가 wrong layout (예: leader_only 스타일 1
-  shard per rank) 입력에 raise
-
-### T6. 회귀
-
-기존 ring/mesh/tree 알고리즘 모두 그대로 통과. 본 ADR은 그들을 건드리지 않음.
-
---
-
-## Consequences
-
-### Positive
-
- **Intra-SIP PE 활용도 증가**: Inter-SIP 통신 중에도 intra-cube / inter-cube
-  reduce가 진행되어 전체 PE 가동률 향상.
- **Multi-level bandwidth 활용**: cube NoC, UCIe 모두 작동 → 더 정확한 HW 모델.
- **ADR-0024 framework 검증**: `all_pes` mapper + `multi_pe_sip_local`
-  validator의 첫 non-trivial use case. Framework 설계 타당성 확인.
- **기존 커널 재활용**: `hierarchical_allreduce.py` 큰 구조 유지, SIP-local
-  좌표만 재해석.
-
-### Negative
-
- **`topologies.py` schema 확장 필요**: Single-level vs multi-level 표현.
-  해결안(Option A)은 기존 ring/mesh/tree의 마이그레이션 비용 유발.
- **Validator / mapper 조합 요구**: 사용자가 DPPolicy를
-  `multi_pe_sip_local`에 맞춰 선택해야 함 (bench 설정 복잡도 증가).
-
-### Neutral
-
- 본 ADR 구현 전까지 `hierarchical_allreduce.py`는 deprecated 상태 유지 또는
-  ADR-0024 matrix test에서 제외. 현재 파일을 곧바로 삭제하지는 않음.
-
---
-
-## Affected files
-
-| File | Change |
-|------|--------|
-| `src/kernbench/ccl/topologies.py` | D3: `hierarchical_3level` topology 함수 추가. (Option A 채택 시) 기존 topology 출력 format 통일 |
-| `src/kernbench/ccl/install_plan.py` | D4: hierarchical builder 분기 (또는 단일 builder가 level 개수로 dispatch) |
-| `src/kernbench/ccl/algorithms/hierarchical_allreduce.py` | D5: SIP-local 좌표로 kernel 재작성, `kernel_args` keyword-only signature |
-| `ccl.yaml` | D2: `hierarchical_allreduce` 엔트리 추가 (`mapper: all_pes`, `validator: multi_pe_sip_local`, `topology: hierarchical_3level`) |
-| `tests/test_hierarchical_topology.py` (new) | T1 |
-| `tests/test_ccl_install_plan.py` | T2 확장 |
-| `tests/test_hierarchical_mock_runtime.py` (new) | T3 |
-| `tests/test_ccl_allreduce_matrix.py` | T4: hierarchical row 추가 |
@@ -2,7 +2,7 @@

 ## Status

-Proposed (Blocked on ADR-0031 — PhysAddr PE-resource extension)
+Proposed

 ## Context

@@ -1,261 +0,0 @@
-# ADR-0031: PhysAddr PE-Resource Extension
-
-## Status
-
-Superseded by ADR-0001 (Revision 2, 2026-04-27).
-PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables are now defined in
-ADR-0001 D2.3.3-D2.3.5.
-
-Previous status: Stub (Blocker for ADR-0030 — specific range allocations TBD)
-
-## Context
-
-### 목표
-
-ADR-0001의 `PhysAddr` schema를 **PE 내부의 다양한 resource**를 체계적으로
-표현할 수 있도록 확장한다. ADR-0030 (IPCQ PhysAddr integration) 및 향후의
-PE-local resource 추가 (scratchpad, register file, status register, 등)의
-기반을 제공한다.
-
-### 현재 상태 (ADR-0001)
-
-51-bit PhysAddr layout:
-
-```
-[50:47] rack_id  (4)
-[46:43] sip_id   (4)
-[42:38] sip_seg  (5)   # cube_id
-[37:0]  local_offset (38)
-```
-
-`local_offset` (38 bits) 내부:
-
- `[37]` selector: 1 = HBM window (128GB), 0 = PE resource window
- PE resource window는 `unit_type` (3 bits: PE | MCPU | SRAM) +
-  `pe_id` (4 bits) + `ext` (1 bit) + `sub_offset` (29 bits)
-
-Factory API:
- `PhysAddr.hbm_addr(...)` — HBM generic
- `PhysAddr.pe_hbm_addr(...)` — PE-local HBM slice
- `PhysAddr.pe_tcm_addr(...)` — PE TCM (via `UnitType.PE` + `sub_offset`)
- `PhysAddr.cube_sram_addr(...)` — Cube-shared SRAM
-
-### 풀어야 할 문제
-
-1. **PE 내부 resource 구분의 명시적 체계 부재**: 현재 `local_offset` (38 bits)
-   이 평면 공간으로 취급되고, PE TCM / IPCQ ring / scratchpad / 향후 register
-   file 등이 관습적 offset 범위로만 구분됨. Schema 레벨에서 명확하지 않음.
-2. **IPCQ 주소의 PhysAddr 표현 부재**: ADR-0030이 IPCQ ring buffer를 PhysAddr로
-   표현하려면 "이 주소가 IPCQ 영역"을 decode 가능해야 함. 현재는 불가.
-3. **향후 PE resource 확장 경로**: register file, performance counter 등
-   추가 시 일관된 위치 할당 규칙 필요.
-
-### 설계 방향 — local_offset을 PE 컴포넌트별 range로 분할
-
-`local_offset` (38 bits = 256GB per PE segment)을 **PE 컴포넌트마다 고정
-range**로 나누어 할당한다. 각 range는 해당 컴포넌트 전용 주소 공간이며,
-`PhysAddr.decode()`가 주소가 어느 range에 속하는지 판별해 해당하는 `kind` /
-`unit_type` / `sub_type` 필드를 채운다.
-
-개념적 구조 (구체적 bit 할당은 **TBD**):
-
-```
-local_offset [37:0]  (38 bits total)
-├── HBM window           [37] = 1    (기존 128GB)
-├── PE component ranges  [37] = 0
-│   ├── TCM              [range_1]
-│   ├── IPCQ rings       [range_2]
-│   ├── Scratchpad       [range_3]
-│   ├── Register file    [range_4]
-│   ├── (reserved)       ...
-│   └── Sideband / status [range_N]
-```
-
-### 왜 range-based partition인가
-
- **Schema-level 명시성**: 주소 하나 보고 어느 컴포넌트의 자원인지 decode 가능.
-  "Routing consumes decoded domains" (ADR-0001 D5) 계약 충족.
- **Unit type enum 확장보다 유연**: 3-bit `UnitType` 공간을 고갈시키지 않고
-  세분화 가능. 미래 추가 컴포넌트도 빈 range 할당.
- **Allocator 통합 자연**: 각 PE-level allocator가 관리하는 하위 pool을
-  address range와 1:1 매칭 (e.g., `reserve_ipcq_tcm()` → IPCQ range 안에서만
-  할당).
- **Decode routing 단순**: `PhysAddr.decode(addr)`가 range table을 참조해
-  `kind` + sub-field를 채움. 기존 HBM selector bit 패턴의 일반화.
-
-### 왜 지금 다루는가
-
- ADR-0030 (IPCQ PhysAddr 통합)이 이 확장에 **의존**. ADR-0030 단독 진행 시
-  `sub_offset` 공간을 불투명하게 재사용하게 되어 ADR-0001 계약 미충족.
- PE 내부 자원이 더 추가될 가능성 — 지금 구조를 정리해두면 일관된 확장 경로 확보.
-
---
-
-## Decision (pending specific range allocation)
-
-### D1. Range-based local_offset partition — approach
-
-`local_offset`을 고정 byte range로 분할하고, 각 range를 PE 컴포넌트에 할당한다.
-주소의 어느 range에 속하는가로 `kind` / component type을 결정.
-
-```python
-# src/kernbench/policy/address/phyaddr.py (conceptual, post-extension)
-@dataclass(frozen=True)
-class PeResourceRange:
-    name: str                # e.g. "tcm", "ipcq", "scratchpad", "regfile"
-    start_offset: int        # local_offset 내 시작
-    end_offset: int          # exclusive
-    byte_size: int           # end - start
-
-PE_RESOURCE_MAP: tuple[PeResourceRange, ...] = (
-    # TBD — 구체적 range 할당은 사용자가 별도 업데이트
-)
-```
-
-`PhysAddr.decode(addr)`의 PE resource 경로는:
-
-```python
-def decode_pe_resource(local_offset: int) -> dict:
-    for r in PE_RESOURCE_MAP:
-        if r.start_offset <= local_offset < r.end_offset:
-            return {
-                "kind": "pe_resource",
-                "component": r.name,                 # NEW: "tcm"/"ipcq"/...
-                "component_offset": local_offset - r.start_offset,  # within range
-            }
-    raise PhysAddrError(f"local_offset {local_offset} not in any PE range")
-```
-
-### D2. Specific range allocations — **TBD**
-
-> 사용자가 구체적 byte 할당을 별도로 정의한 뒤 본 ADR에 업데이트.
->
-> 필요 정보:
-> - 각 컴포넌트 (TCM, IPCQ, scratchpad, regfile, ...)의 이름 / byte size
-> - `local_offset` 내 시작 offset (align 고려)
-> - 현재 하드웨어 사양 / 시뮬레이션 요구 반영
-
-이 섹션이 채워진 뒤 ADR status: **Stub → Proposed → Accepted** 승격.
-
-### D3. Factory API — per-component 함수
-
-기존 `PhysAddr.pe_tcm_addr(...)` 패턴을 일반화:
-
-```python
-# 기존 (이미 존재)
-PhysAddr.pe_tcm_addr(rack_id, sip_id, cube_id, pe_id, tcm_offset)
-
-# 신규 (ADR-0031 후 추가)
-PhysAddr.pe_ipcq_addr(rack_id, sip_id, cube_id, pe_id, ipcq_offset)
-PhysAddr.pe_scratchpad_addr(...)
-PhysAddr.pe_regfile_addr(...)
-# ...
-```
-
-각 factory는 해당 컴포넌트의 range 내에서 `component_offset`만 받아 최종
-PhysAddr encoding. 호출자는 어느 range인지 몰라도 됨.
-
-### D4. Backward compatibility
-
- 기존 `pe_tcm_addr()` signature / semantic 유지.
- 내부 인코딩만 신규 range table을 참조하도록 변경.
- 기존 `UnitType.PE` decoding 경로는 `PE_RESOURCE_MAP`에서 "tcm" range를
-  대응하도록 매핑 → 기존 코드 transparent.
- 기존 코드가 `PhysAddr.decode(addr).unit_type == UnitType.PE`를 체크하는
-  경우는 여전히 유효 (TCM 주소는 계속 PE unit_type).
-
---
-
-## Open questions
-
-### 🔴 Pending user input (ADR 승격 blocker)
-
- **D2의 specific range allocation**: 사용자가 구체적 byte 할당 테이블을
-  제공해야 Stub → Proposed 승격 가능. 필요 정보:
-  - 컴포넌트 목록 (TCM, IPCQ, scratchpad, regfile 등)
-  - 각 컴포넌트의 byte size / 시작 offset
-  - Alignment 요구사항 (4KB / page-aligned 등)
-
-### 🟡 설계 세부 — range allocation 결정 과정에서 함께 결정
-
- **총 local_offset space 배분**: HBM window (bit 37 = 1, 128GB)을 유지할지,
-  아니면 PE resource space를 확장하기 위해 HBM window 축소할지.
- **Range padding / reserved space**: 미래 컴포넌트 추가를 위한 "reserved"
-  range 몇 개를 미리 확보할지.
- **Address alignment**: 각 range의 시작 offset이 특정 alignment (page /
-  cache line) 만족해야 하는지.
- **Diagnostic / debug 포맷**: `PhysAddr.decode()` 출력에서 component 이름 +
-  component_offset을 사람이 읽기 좋게 표시 (e.g., "IPCQ ring sip=0 cube=0 pe=3
-  offset=0x1234").
- **기존 `UnitType` enum의 role**: Range-based 접근 후에도 `unit_type` 필드
-  유지할지 (decode 결과에 `component` 추가), 또는 enum 대체할지.
-
-### 🟢 ADR-0030 연동 질문
-
- **IPCQ range 내 direction/slot 표현**: PhysAddr는 `component_offset` 단위
-  까지만 표현. "direction=E, slot=2"는 IPCQ range 내 offset 계산으로 도출
-  (`direction_idx * slot_region_size + slot_idx * slot_size`) — 이 공식은
-  ADR-0030 scope에서 구체화.
- **Allocator pool 구조**: `PEMemAllocator`가 여러 range (TCM, IPCQ,
-  scratchpad)를 개별 pool로 관리할지, 단일 pool에서 kind별 reserved만 관리
-  할지. Range-based schema면 개별 pool이 자연스러움.
-
---
-
-## Non-goals (this ADR)
-
- **51-bit 전체 layout 재작성**: 본 ADR은 `local_offset` (38 bits) 내부의
-  subdivision만 다룬다. Rack / SIP / cube segment 같은 상위 bit 구조는
-  불변.
- **`UnitType` enum 재설계**: range-based 접근으로 대체 가능하지만, 기존 enum
-  (PE / MCPU / SRAM)은 backward compat 위해 유지.
- **Dynamic range allocation**: runtime에 range 크기 바꾸는 기능 불필요. 모든
-  range는 컴파일 / 설정 시점에 고정.
- **Multi-process / multi-rack partitioning**: PE 내부 resource만 다룸.
-
---
-
-## Action
-
-### Phase 1 — User 입력: specific range allocation (**Blocker**)
- 사용자가 정의한 PE 컴포넌트별 byte range를 D2에 기입:
-  - `PE_RESOURCE_MAP` 테이블 내용 (name, start_offset, byte_size per 컴포넌트)
-  - 각 컴포넌트의 hardware spec 근거 note
-
-### Phase 2 — ADR Stub → Proposed 승격
- D2 채워지면 status 변경.
- Open questions의 "🔴 Pending user input" 블록 제거.
- ADR-0001에 amendment note 초안 작성.
-
-### Phase 3 — 구현
- `PhysAddr` range-based decode 구현.
- 신규 factory 함수 (`pe_ipcq_addr`, `pe_scratchpad_addr` 등 컴포넌트별)
-  추가.
- 기존 `pe_tcm_addr` 내부 인코딩만 신규 range table 참조하도록 수정
-  (signature 불변).
- 기존 코드 경로 회귀 확인.
-
-### Phase 4 — ADR-0030 unblock
- ADR-0030 "Blocked" 상태 해제.
- Install_plan builder가 `pe_ipcq_addr(...)` 등 확장된 factory 호출하도록
-  수정.
-
---
-
-## Dependencies
-
- **ADR-0001** (PhysAddr layout): 본 ADR은 ADR-0001의 확장.
- **ADR-0023** (IPCQ protocol): IPCQ ring buffer의 주소 체계를 PhysAddr로
-  통합할 수 있게 하는 기반.
- **ADR-0030** (IPCQ PhysAddr integration): 본 ADR에 blocked.
-
---
-
-## Affected files (future, after promotion to Proposed)
-
-| File | Change |
-|------|--------|
-| `src/kernbench/policy/address/phyaddr.py` | Range table (`PE_RESOURCE_MAP`), range-based decode, 신규 component-specific factory들 (`pe_ipcq_addr` 등), 기존 `pe_tcm_addr` 내부 인코딩 갱신 |
-| `src/kernbench/policy/address/allocator.py` | Range-aware pool 분리 (TCM pool / IPCQ pool / scratchpad pool 등 per-PE) |
-| `docs/adr/ADR-0001-physaddr-layout.md` | Amendment note: range-based PE resource partition |
-| `tests/test_phyaddr.py` | Range table 검증, 각 factory의 encode/decode round-trip, 기존 `pe_tcm_addr` 회귀 |