ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle: - ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable. - ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2: docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft), docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for retroactive docs pending verification. Merges (one ADR per topic, no change-history annotations): - ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items) - ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl. TileToken self-routing and multi-op composite epilogue scope) - ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md deleted; ADR-0019/0021 moved to adr-history with one-line stub status Retroactive documentation (G4 closures, code-verified): - ADR-0037 forwarding component (TransitComponent: first-flit overhead, serial worker, path-based routing, single impl/multiple names) - ADR-0036 IO_CPU component (target_start_ns global barrier stamping, per-cube fan-out, response aggregation) - ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources, target_start_ns passthrough) - ADR-0034 HBM controller internal design (per-PC state, address-based selection, flit-aware per-flit commit, async finalize, command-only fallback path) Content updates: - ADR-0010 expanded to full CLI surface (run/probe/web), retitled "Command Line Interface and Execution Semantics" - ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned - ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata block replaced with standard Status header - ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4); ADR-0027 cleaned of supersession history - ADR-0033 D6 cleanup: address-based PC selection moved out of future-work (now documented in ADR-0034 D3); related D1/D3 wording realigned - Cross-references back-filled in 5 ADRs (G3 gaps closed) Onboarding docs split: - docs/onboarding/ created - moved: hw-architecture-overview.md, latency-model.md, di-presentation.md, ccl-author-guide{,.en}.md - references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8). No behavior change. Tooling: - tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py (ADR EN/KO pair invariant checker) - .claude/commands/report.md tracked (/report slash command) - .gitignore: allow .claude/commands/*.md while keeping settings files ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00
parent 22fd0d2b9d
commit 687c98086d
97 changed files with 3286 additions and 3766 deletions
@@ -0,0 +1,441 @@
+# ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction
+
+## Status
+
+Merged into ADR-0011 (Address Model: LA section).
+
+## Context
+
+Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE.
+Currently, a VA-based access path is used; however, the following two channel mapping models
+are difficult to represent consistently.
+
+### Background: Local-HBM Pseudo Channel Structure
+
+The HBM in a CUBE consists of 32 or 64 pseudo channels.
+In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.
+
+Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM
+
+Both the number of pseudo channels and the number of PEs are topology parameters.
+`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe) determines
+the number of local channels per PE.
+
+The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel
+(e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the
+maximum memory BW.
+
+### Limitations of the Current VA Model
+
+When channels are divided into 8, requests must also be generated per channel and sent to DMA.
+However, in the current architecture, the kernel generates requests with VA (`tl.load`)
+and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.
+
+Therefore, instead of VA, we propose using **Logical Address (LA)**,
+where the **BAAW (Logical-to-Physical Mapping Unit)** inside PE_DMA
+converts LA to PA or a list of PAs based on segment-based mapping.
+
+### Two Channel Mapping Modes
+
+- **1:1 mode**: Creates and executes per-channel requests. Precise per-channel modeling.
+- **n:1 mode (default)**: Assumes interleaving across local HBM channels. Aggregated BW modeling.
+
+By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.
+
+### Core Requirements
+
+- The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
+- The difference must only be in the request representation and resource modeling approach
+- The kernel programming model must not be changed
+- Physical channel information must not be exposed to the kernel
+
+### Existing Physical Address
+
+The current system's 51-bit Physical Address is defined in `policy/address/phyaddr.py`:
+
+```
+[50:47] rack_id (4 bit)
+[46:43] sip_id  (4 bit)
+[42:38] cube_id (5 bit, sip_seg)
+[37]    hbm_selector (1=HBM window)
+[36:0]  hbm_offset   (37 bit, 128GB per cube)
+```
+
+PA is used to represent the final routable canonical physical destination,
+and this role is preserved.
+However, the timing and policy of logical access -> physical request conversion are not clearly separated.
+
+---
+
+## Decision
+
+### D1. Introduction of LA (Logical Address) — Replacing VA
+
+The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).
+
+#### Characteristics of LA
+
+- Like VA, tensors can be mapped to a contiguous memory space
+- Represents logical buffer + offset
+- Does not directly contain physical channel information
+- An intermediate abstraction maintained until physical resolution
+- The sole address scheme used by kernel code (`tl.load`, `tl.store`, `tl.composite`)
+
+#### LA Space Definition
+
+| Item | Value |
+|------|-------|
+| LA start address | `0x1_0000_0000` (4 GB, preserving the existing VA start point) |
+| LA space size | 64 GB per PE |
+| Alignment unit | Segment-based (see D3 below) |
+
+LA is a PE-local address space.
+Even if different PEs use the same LA value, they resolve to different PAs
+because each PE has a different BAAW segment table.
+
+#### VA Infrastructure Removal Scope
+
+With the introduction of LA, the following existing code will be replaced/removed:
+
+| Removal Target | Replacement |
+|----------------|-------------|
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, name/role changed) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
+| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | Replaced with BaawSegmentInstallMsg |
+| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
+| `runtime_api/tensor.py`: `va_base` field | `la_base` field |
+| `topology.yaml`: pe_mmu component entry | Removed |
+
+---
+
+### D2. Mapping Mode Configuration
+
+The mapping mode is configured at the cube level in topology.yaml:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
+    hbm_pseudo_channels: 64       # total pseudo channel count
+    hbm_channels_per_pe: 8        # local channel count per PE
+    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
+```
+
+This configuration is referenced during graph compilation (topology builder) and BAAW initialization.
+
+---
+
+### D3. Segments and BAAW
+
+#### Segment Definition
+
+A segment is a logical allocation unit that partitions the LA space so that each segment
+maps to a specific HBM channel or channel group.
+
+Segments are created by the runtime allocator during tensor deployment,
+and BAAW uses them to convert LA into physical requests.
+
+#### BAAW Segment Table Entry
+
+```python
+@dataclass
+class BaawSegment:
+    la_base: int          # segment start LA
+    la_size: int          # segment size (bytes)
+    mode: str             # "one_to_one" | "n_to_one"
+    # 1:1 mode fields
+    channel_count: int    # number of channels assigned to this segment (e.g., 8)
+    pa_bases: list[int]   # per-channel PA start address list (len = channel_count)
+    channel_ids: list[int]  # per-channel logical IDs (e.g., [0,1,2,...,7])
+    channel_size: int     # per-channel size (la_size // channel_count)
+    # n:1 mode fields
+    agg_pa_base: int      # aggregated PA start address
+    agg_node_id: str      # aggregated router node_id (for routing)
+```
+
+#### Segment Lifecycle
+
+1. **Allocation time** (tensor deploy):
+   - RuntimeContext allocates LA space from the LA allocator
+   - PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
+   - Sends `BaawSegmentInstallMsg` to PE_DMA to register in the segment table
+
+2. **Usage time** (kernel execution):
+   - Kernel issues `tl.load(la_ptr)` -> DmaReadCmd(src_addr=LA)
+   - PE_DMA looks up the segment corresponding to the LA in BAAW
+   - Converts to PA(s) according to the mode
+
+3. **Deallocation time** (tensor free):
+   - Removed from the segment table
+   - LA space returned, PA deallocated
+
+---
+
+### D4. BAAW (Logical-to-Physical Mapping Unit)
+
+#### Location
+
+BAAW is placed as a front-end stage inside PE_DMA.
+It is not a separate SimPy component; it is synchronous address resolution logic
+executed at the beginning of PE_DMA's `handle_command()`.
+
+#### Input
+
+- LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
+- access size (bytes)
+
+#### Output
+
+- 1:1 mode: `list[PhysicalRequest]` — each request is (PA, nbytes, channel_node_id)
+- n:1 mode: 1 `PhysicalRequest` — (agg_PA, nbytes, agg_node_id)
+
+```python
+@dataclass
+class PhysicalRequest:
+    pa: int           # 51-bit Physical Address
+    nbytes: int       # transfer size for this request
+    dst_node: str     # target node_id (channel router or aggregated router)
+```
+
+#### BAAW Resolve Logic
+
+```python
+def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
+    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
+    offset = la - seg.la_base
+
+    if seg.mode == "n_to_one":
+        pa = seg.agg_pa_base + offset
+        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
+
+    elif seg.mode == "one_to_one":
+        requests = []
+        per_ch_size = seg.channel_size
+        for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
+            ch_offset = offset % per_ch_size  # interleaved or striped
+            ch_nbytes = nbytes // seg.channel_count
+            pa = pa_base + ch_offset
+            dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
+            requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
+        return requests
+```
+
+#### Scope of Responsibility
+
+BAAW is responsible for:
+- Converting logical accesses into physical request units
+- Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
+- Generating Physical Addresses and determining target nodes
+
+BAAW is NOT responsible for:
+- Performing actual data movement
+- Executing NOC routing
+- Simulating bandwidth consumption (this is the role of downstream components)
+
+#### Output Contract
+
+The output of BAAW must be request units that can be directly used by the simulator's
+routing and resource model without any additional address decoding.
+
+---
+
+### D5. PE_DMA handle_command() Changes
+
+#### Current Flow (VA-based)
+
+```
+DmaReadCmd.src_addr (VA)
+  -> MMU.translate(VA) -> PA
+  -> PhysAddr.decode(PA) -> PhysAddr object
+  -> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
+  -> router.find_path(pe_prefix, dst_node_id) -> path
+  -> 1 sub-Transaction created -> fabric inject
+```
+
+#### New Flow (LA-based)
+
+```
+DmaReadCmd.src_addr (LA)
+  -> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
+  -> For each PhysicalRequest:
+      -> router.find_path(pe_prefix, req.dst_node) -> path
+      -> compute_drain_ns(path, req.nbytes) -> drain
+      -> sub-Transaction created -> fabric inject
+  -> Wait for all sub-Transactions to complete
+  -> pe_txn.done.succeed()
+```
+
+Key changes:
+- MMU reference removed -> replaced with BAAW resolve
+- PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
+- 1 request -> N requests injected in parallel (1:1 mode)
+
+---
+
+### D6. 1:1 Mode Details
+
+- One logical access -> N (= `channels_per_pe`) physical requests
+- N is a parameter determined by `hbm_pseudo_channels / pes_per_cube`
+- Each request:
+  - Fully resolved 51-bit PA
+  - Targets a specific channel router (`{pe_prefix}.ch_r{channel_id}`)
+- BW contention modeling via per-channel links
+- PE_DMA injects N sub-transactions simultaneously
+
+#### 1:1 Mode Example
+
+Configuration: `hbm_pseudo_channels=64`, `pes_per_cube=8`
+-> `channels_per_pe=8`, PE0 owns ch0-7
+
+```text
+Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "one_to_one", channel_count: 8,  # = channels_per_pe
+    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
+    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
+    channel_size: 512,  # = la_size / channel_count
+}
+
+BAAW resolve result (N=8 requests):
+  -> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
+  -> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
+  -> ...
+  -> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
+
+PE_DMA: N sub-transactions injected in parallel
+  Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
+  Total effective BW = N x channel_bw_gbs
+```
+
+Examples with different N values:
+- `hbm_pseudo_channels=32`, `pes_per_cube=8` -> `channels_per_pe=4`, 4 requests
+- `hbm_pseudo_channels=64`, `pes_per_cube=4` -> `channels_per_pe=16`, 16 requests
+
+---
+
+### D7. n:1 Mode Details
+
+- One logical access -> one aggregated request
+- Target: aggregated router -> hbm_ctrl (see ADR-0019)
+- Aggregated link BW = `channels_per_pe` x `channel_bw_gbs` (e.g., 8 x 32 = 256 GB/s)
+- Modeled as a single queue / resource
+- No per-channel PA decomposition
+
+#### n:1 Mode Example
+
+```
+Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "n_to_one",
+    agg_pa_base: PA_agg,
+    agg_node_id: "sip0.cube0.pe0.agg_router",
+}
+
+BAAW resolve result:
+  -> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
+
+PE_DMA: 1 sub-transaction injected
+  Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)
+```
+
+---
+
+### D8. Kernel Model Preservation
+
+- The kernel still issues only single memory ops (`tl.load`, `tl.store`, `tl.composite`)
+- LA is the address scheme passed to the kernel
+- Channel decomposition/aggregation is performed by BAAW inside PE_DMA
+- Physical channel information is not exposed to kernel code
+
+---
+
+## Consequences
+
+### Positive
+
+- 1:1 vs n:1 semantics are clearly separated at a single point: BAAW
+- Kernel abstraction is preserved — no kernel code changes required
+- Topology-based policy control is possible (mode switching via yaml)
+- Improved simulation model consistency and debuggability
+- Segment-based mapping is simpler and has lower overhead compared to page tables
+
+### Negative
+
+- Full refactoring of VA/MMU-based code is required
+- Increased complexity in the request generation path (managing N requests in 1:1 mode)
+- Reduced per-channel visibility in n:1 mode
+- Existing VA-related tests must be rewritten
+
+---
+
+## Alternatives
+
+### A1. Keep VA + Fan-out at MMU
+
+- Extend MMU to return per-channel PAs
+- Problem: MMU's role expands beyond address translation to include request decomposition
+- Problem: Aggregation representation is difficult in n:1 mode
+
+### A2. Kernel Generates Channel-Aware Requests
+
+- Kernel directly calls per-channel load/store
+- Problem: Abstraction leakage, reduced portability
+- Problem: All benchmark code must be modified
+
+### A3. Always Use PA (Without LA)
+
+- Runtime directly passes per-channel PA to the kernel
+- Problem: Conflicts with the aggregation model
+- Problem: Conversion timing is unclear, channel information exposed to kernel
+
+---
+
+## Implementation Notes
+
+### Implementation Order
+
+1. Introduce LA type (`policy/address/la_allocator.py`)
+2. Implement BAAW segment table (`policy/address/baaw.py`)
+3. Add `BaawSegmentInstallMsg` message type (`runtime_api/kernel.py`)
+4. Integrate BAAW into PE_DMA (`components/builtin/pe_dma.py` handle_command changes)
+5. Modify RuntimeContext: LA alloc + segment install (`runtime_api/context.py`)
+6. Change Tensor.va_base -> la_base (`runtime_api/tensor.py`)
+7. Remove VA/MMU code
+8. Remove pe_mmu from topology.yaml, add mapping mode configuration
+9. Test migration
+
+### Affected Existing Tests
+
+| Test File | Impact |
+|-----------|--------|
+| `tests/test_mmu_component.py` | Remove -> replace with BAAW segment install test |
+| `tests/test_mmu_fabric.py` | Remove -> replace with BAAW + fabric integration test |
+| `tests/test_pe_mmu.py` | Remove |
+| `tests/test_va_allocator.py` | Replace with LA allocator test |
+| `tests/test_va_integration.py` | Replace with LA + BAAW integration test |
+| `tests/test_va_offset.py` | Replace with LA offset test |
+
+---
+
+## Test Requirements
+
+- For the same logical access:
+  - 1:1 -> verify N requests are generated
+  - n:1 -> verify 1 aggregated request is generated
+- Verify effective bandwidth consistency across both modes
+- 1:1 -> verify per-channel contention modeling
+- n:1 -> verify aggregated bandwidth is reflected
+- Verify operation without kernel code changes
+- Verify correct BAAW segment install/uninstall operation
+- Verify no conflicts when multiple tensors are assigned to different segments
+
+---
+
+## Links
+
+- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR
+- ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration
+- ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact
@@ -0,0 +1,440 @@
+# ADR-0018: LA 기반 메모리 주소 추상화 및 HBM Channel Mapping Mode 도입
+
+## Status
+
+Merged into ADR-0011 (Address Model: LA section).
+
+## Context
+
+Kernbench는 CUBE 내부에서 PE_DMA와 Local-HBM 간의 메모리 접근을 시뮬레이션한다.
+현재는 VA 기반 접근 경로를 사용하고 있으나, 다음 두 가지 channel mapping 모델을
+일관되게 표현하기 어렵다.
+
+### 배경: Local-HBM pseudo channel 구조
+
+CUBE의 HBM은 32개 또는 64개의 pseudo channel로 구성된다.
+PE-Local-HBM 모델에서는 각 PE가 동일한 수의 pseudo channel을 담당한다.
+
+예: 64 pseudo channel, 8 PE per cube → 각 PE가 8개 pseudo channel을 local HBM으로 접근
+
+pseudo channel 수와 PE 수는 모두 topology 파라미터이다.
+`N = hbm_pseudo_channels / pes_per_cube` (= channels_per_pe)가
+PE당 local channel 수를 결정한다.
+
+각 pseudo channel의 BW(예: 32 GB/s)만큼 DMA와 pseudo channel 사이의 라우팅 경로 BW도
+맞춰지므로, PE가 N개 채널에 동시 request를 보내면 최대 메모리 BW를 활용할 수 있다.
+
+### 현재 VA 모델의 한계
+
+채널을 8개로 나누면 request도 채널별로 생성되어 DMA에 보내져야 한다.
+그러나 현재 구조에서는 커널이 VA를 가지고 request를 생성한 뒤(`tl.load`)
+DMA에 바로 전달하므로, PE_CPU가 채널별 DMA request를 생성하기 어렵다.
+
+따라서 VA 대신 **Logical Address(LA)** 를 사용하고,
+PE_DMA 내부의 **BAAW(Logical-to-Physical Mapping Unit)** 가
+segment-based mapping을 기반으로 LA → PA 또는 PA 리스트로 변환하는 구조를 제안한다.
+
+### 두 가지 channel mapping mode
+
+- **1:1 mode**: 채널별 request를 만들어 실행. 정밀한 per-channel 모델링
+- **n:1 mode (default)**: local HBM 채널 간 인터리빙 가정. aggregated BW 모델링
+
+두 모드를 지원하여 n:1 모드의 오버헤드를 측정/검토할 수 있게 한다.
+
+### 핵심 요구사항
+
+- PE_DMA → HBM_CTRL의 effective bandwidth semantics는 두 모드에서 동일해야 한다
+- 차이는 request 표현 방식과 resource 모델링 방식에만 있어야 한다
+- kernel programming model은 변경하지 않는다
+- physical channel 정보는 kernel에 노출되지 않아야 한다
+
+### 기존 Physical Address
+
+현재 시스템의 51-bit Physical Address는 `policy/address/phyaddr.py`에 정의되어 있다:
+
+```
+[50:47] rack_id (4 bit)
+[46:43] sip_id  (4 bit)
+[42:38] cube_id (5 bit, sip_seg)
+[37]    hbm_selector (1=HBM window)
+[36:0]  hbm_offset   (37 bit, 128GB per cube)
+```
+
+PA는 최종 라우팅 가능한 canonical physical destination을 표현하는 데 사용되며,
+이 역할은 유지된다.
+하지만 logical access → physical request 변환 시점과 정책이 명확히 분리되어 있지 않다.
+
+---
+
+## Decision
+
+### D1. LA (Logical Address) 도입 — VA를 대체
+
+기존 VA(Virtual Address) 인프라를 LA(Logical Address)로 대체한다.
+
+#### LA의 특징
+
+- VA처럼 Tensor를 연속적인 메모리 공간에 매핑할 수 있다
+- logical buffer + offset을 표현
+- physical channel 정보를 직접 포함하지 않음
+- physical resolution 이전까지 유지되는 중간 추상화
+- 커널 코드(`tl.load`, `tl.store`, `tl.composite`)가 사용하는 유일한 주소 체계
+
+#### LA 공간 정의
+
+| 항목 | 값 |
+|------|-----|
+| LA 시작 주소 | `0x1_0000_0000` (4 GB, 기존 VA 시작점 유지) |
+| LA 공간 크기 | PE당 64 GB |
+| 정렬 단위 | segment 단위 (아래 D3 참조) |
+
+LA는 PE-local 주소 공간이다.
+서로 다른 PE가 동일한 LA 값을 사용해도 BAAW의 segment table이 다르므로
+서로 다른 PA로 resolve된다.
+
+#### VA 인프라 제거 범위
+
+LA 도입에 따라 다음 기존 코드를 대체/제거한다:
+
+| 제거 대상 | 대체 |
+|-----------|------|
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (동일 free-list 방식, 이름/역할 변경) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (PE_DMA 내부) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | 제거 — BAAW는 별도 컴포넌트가 아닌 PE_DMA 내부 로직 |
+| `runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg | BaawSegmentInstallMsg로 대체 |
+| `runtime_api/context.py`: VA alloc + MMU mapping install | LA alloc + BAAW segment install |
+| `runtime_api/tensor.py`: `va_base` 필드 | `la_base` 필드 |
+| `topology.yaml`: pe_mmu 컴포넌트 항목 | 제거 |
+
+---
+
+### D2. Mapping Mode 설정
+
+topology.yaml의 cube 레벨에서 mapping mode를 설정한다:
+
+```yaml
+cube:
+  memory_map:
+    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
+    hbm_pseudo_channels: 64       # 전체 pseudo channel 수
+    hbm_channels_per_pe: 8        # PE당 local channel 수
+    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
+```
+
+이 설정은 graph compiler(topology builder)와 BAAW 초기화 시 참조된다.
+
+---
+
+### D3. Segment 및 BAAW
+
+#### Segment 정의
+
+Segment는 LA space를 partition하여, 각 segment가 특정 HBM channel 또는
+channel group에 매핑되도록 하는 logical allocation 단위이다.
+
+Segment는 runtime allocator가 tensor deploy 시 생성하며,
+BAAW는 이를 기반으로 LA를 physical request로 변환한다.
+
+#### BAAW Segment Table Entry
+
+```python
+@dataclass
+class BaawSegment:
+    la_base: int          # segment 시작 LA
+    la_size: int          # segment 크기 (bytes)
+    mode: str             # "one_to_one" | "n_to_one"
+    # 1:1 mode fields
+    channel_count: int    # 이 segment에 할당된 channel 수 (e.g., 8)
+    pa_bases: list[int]   # per-channel PA 시작 주소 리스트 (len = channel_count)
+    channel_ids: list[int]  # per-channel 논리적 ID (e.g., [0,1,2,...,7])
+    channel_size: int     # per-channel 크기 (la_size // channel_count)
+    # n:1 mode fields
+    agg_pa_base: int      # aggregated PA 시작 주소
+    agg_node_id: str      # aggregated router node_id (for routing)
+```
+
+#### Segment 라이프사이클
+
+1. **할당 시점** (tensor deploy):
+   - RuntimeContext가 LA allocator에서 LA 공간 할당
+   - PEMemAllocator가 per-channel PA 할당 (1:1) 또는 aggregated PA 할당 (n:1)
+   - `BaawSegmentInstallMsg`를 PE_DMA로 전송하여 segment table에 등록
+
+2. **사용 시점** (kernel 실행):
+   - 커널이 `tl.load(la_ptr)` → DmaReadCmd(src_addr=LA)
+   - PE_DMA가 BAAW에서 LA에 해당하는 segment를 lookup
+   - mode에 따라 PA(들)로 변환
+
+3. **해제 시점** (tensor free):
+   - segment table에서 제거
+   - LA 공간 반환, PA 해제
+
+---
+
+### D4. BAAW (Logical-to-Physical Mapping Unit)
+
+#### 위치
+
+BAAW는 PE_DMA 내부의 front-end stage로 배치된다.
+별도의 SimPy 컴포넌트가 아니며, PE_DMA의 `handle_command()` 시작 부분에서 실행되는
+동기적 address resolution 로직이다.
+
+#### 입력
+
+- LA (Logical Address) — DmaReadCmd.src_addr 또는 DmaWriteCmd.dst_addr
+- access size (bytes)
+
+#### 출력
+
+- 1:1 mode: `list[PhysicalRequest]` — 각 request는 (PA, nbytes, channel_node_id)
+- n:1 mode: `PhysicalRequest` 1개 — (agg_PA, nbytes, agg_node_id)
+
+```python
+@dataclass
+class PhysicalRequest:
+    pa: int           # 51-bit Physical Address
+    nbytes: int       # 이 request의 transfer size
+    dst_node: str     # target node_id (channel router or aggregated router)
+```
+
+#### BAAW Resolve 로직
+
+```python
+def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
+    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
+    offset = la - seg.la_base
+
+    if seg.mode == "n_to_one":
+        pa = seg.agg_pa_base + offset
+        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
+
+    elif seg.mode == "one_to_one":
+        requests = []
+        per_ch_size = seg.channel_size
+        for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
+            ch_offset = offset % per_ch_size  # interleaved or striped
+            ch_nbytes = nbytes // seg.channel_count
+            pa = pa_base + ch_offset
+            dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
+            requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
+        return requests
+```
+
+#### 역할 범위
+
+BAAW의 책임:
+- logical access를 physical request 단위로 변환
+- mapping mode에 따른 fan-out (1:1) 또는 pass-through (n:1) 수행
+- Physical Address 생성 및 target node 결정
+
+BAAW의 책임이 아닌 것:
+- 실제 data movement 수행
+- NOC routing 실행
+- bandwidth 소비 시뮬레이션 (downstream component의 역할)
+
+#### Output Contract
+
+BAAW의 출력은 추가적인 address decoding 없이
+simulator의 routing 및 resource 모델에서 직접 사용 가능한 request 단위여야 한다.
+
+---
+
+### D5. PE_DMA handle_command() 변경
+
+#### 현재 흐름 (VA 기반)
+
+```
+DmaReadCmd.src_addr (VA)
+  → MMU.translate(VA) → PA
+  → PhysAddr.decode(PA) → PhysAddr object
+  → resolver.resolve(PhysAddr) → dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
+  → router.find_path(pe_prefix, dst_node_id) → path
+  → 1개 sub-Transaction 생성 → fabric inject
+```
+
+#### 새 흐름 (LA 기반)
+
+```
+DmaReadCmd.src_addr (LA)
+  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
+  → 각 PhysicalRequest에 대해:
+      → router.find_path(pe_prefix, req.dst_node) → path
+      → compute_drain_ns(path, req.nbytes) → drain
+      → sub-Transaction 생성 → fabric inject
+  → 모든 sub-Transaction 완료 대기
+  → pe_txn.done.succeed()
+```
+
+핵심 변경:
+- MMU 참조 제거 → BAAW resolve로 대체
+- PhysAddr.decode() + resolver.resolve() → BAAW가 직접 dst_node 반환
+- 1개 request → N개 request 병렬 inject (1:1 mode)
+
+---
+
+### D6. 1:1 Mode 상세
+
+- 하나의 logical access → N개(= `channels_per_pe`)의 physical request
+- N은 `hbm_pseudo_channels / pes_per_cube`로 결정되는 파라미터
+- 각 request:
+  - fully resolved 51-bit PA
+  - 특정 channel router를 target (`{pe_prefix}.ch_r{channel_id}`)
+- per-channel link에 의한 BW contention 모델링
+- PE_DMA는 N개 sub-transaction을 동시에 inject
+
+#### 1:1 Mode 예시
+
+구성: `hbm_pseudo_channels=64`, `pes_per_cube=8`
+→ `channels_per_pe=8`, PE0이 ch0-7 소유
+
+```text
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "one_to_one", channel_count: 8,  # = channels_per_pe
+    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
+    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
+    channel_size: 512,  # = la_size / channel_count
+}
+
+BAAW resolve 결과 (N=8개 request):
+  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
+  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
+  → ...
+  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
+
+PE_DMA: N개 sub-transaction 병렬 inject
+  각각 channel router → hbm_ctrl link (channel_bw_gbs)를 통해 HBM 접근
+  총 effective BW = N × channel_bw_gbs
+```
+
+N이 다른 구성의 예:
+- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`, 4개 request
+- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`, 16개 request
+
+---
+
+### D7. n:1 Mode 상세
+
+- 하나의 logical access → 하나의 aggregated request
+- target: aggregated router → hbm_ctrl (ADR-0019 참조)
+- aggregated link BW = `channels_per_pe` × `channel_bw_gbs` (e.g., 8 × 32 = 256 GB/s)
+- single queue / resource로 모델링
+- per-channel PA 분해 없음
+
+#### n:1 Mode 예시
+
+```
+Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
+BAAW segment: {
+    la_base: 0x1_0000_0000, la_size: 4096,
+    mode: "n_to_one",
+    agg_pa_base: PA_agg,
+    agg_node_id: "sip0.cube0.pe0.agg_router",
+}
+
+BAAW resolve 결과:
+  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
+
+PE_DMA: 1개 sub-transaction inject
+  aggregated router → hbm_ctrl link (256 GB/s)를 통해 HBM 접근
+```
+
+---
+
+### D8. Kernel Model 유지
+
+- kernel은 여전히 단일 memory op만 발행 (`tl.load`, `tl.store`, `tl.composite`)
+- LA가 커널에 전달되는 주소 체계
+- channel 분해/집계는 PE_DMA 내부 BAAW에서 수행
+- kernel 코드에 physical channel 정보가 노출되지 않음
+
+---
+
+## Consequences
+
+### Positive
+
+- 1:1 vs n:1 semantics가 BAAW라는 단일 지점에서 명확히 분리됨
+- kernel abstraction 유지 — 커널 코드 변경 불필요
+- topology 기반 정책 제어 가능 (yaml에서 mode 전환)
+- simulation 모델 일관성 및 디버깅 용이성 향상
+- segment-based mapping은 page table 대비 단순하고 overhead가 낮음
+
+### Negative
+
+- VA/MMU 기반 코드 전체 리팩토링 필요
+- request 생성 경로 복잡도 증가 (1:1 mode에서 N개 request 관리)
+- n:1 mode에서 per-channel visibility 감소
+- 기존 VA 관련 테스트 재작성 필요
+
+---
+
+## Alternatives
+
+### A1. VA 유지 + MMU에서 fan-out
+
+- MMU가 per-channel PA를 반환하도록 확장
+- 문제: MMU의 역할이 address translation을 넘어 request 분해까지 확장됨
+- 문제: n:1 mode에서 aggregation 표현이 어려움
+
+### A2. Kernel이 channel-aware request 생성
+
+- 커널이 직접 채널별 load/store를 호출
+- 문제: abstraction leakage, portability 저하
+- 문제: 모든 벤치마크 코드 수정 필요
+
+### A3. 항상 PA 사용 (LA 없이)
+
+- runtime이 직접 per-channel PA를 커널에 전달
+- 문제: aggregation 모델과 충돌
+- 문제: 변환 시점이 불명확, 커널에 channel 정보 노출
+
+---
+
+## Implementation Notes
+
+### 구현 순서
+
+1. LA 타입 도입 (`policy/address/la_allocator.py`)
+2. BAAW segment table 구현 (`policy/address/baaw.py`)
+3. `BaawSegmentInstallMsg` 메시지 타입 추가 (`runtime_api/kernel.py`)
+4. PE_DMA에 BAAW 통합 (`components/builtin/pe_dma.py` handle_command 변경)
+5. RuntimeContext 변경: LA alloc + segment install (`runtime_api/context.py`)
+6. Tensor.va_base → la_base 변경 (`runtime_api/tensor.py`)
+7. VA/MMU 코드 제거
+8. topology.yaml에서 pe_mmu 제거, mapping mode 설정 추가
+9. 테스트 마이그레이션
+
+### 영향받는 기존 테스트
+
+| 테스트 파일 | 영향 |
+|------------|------|
+| `tests/test_mmu_component.py` | 제거 → BAAW segment install 테스트로 대체 |
+| `tests/test_mmu_fabric.py` | 제거 → BAAW + fabric 통합 테스트로 대체 |
+| `tests/test_pe_mmu.py` | 제거 |
+| `tests/test_va_allocator.py` | LA allocator 테스트로 대체 |
+| `tests/test_va_integration.py` | LA + BAAW 통합 테스트로 대체 |
+| `tests/test_va_offset.py` | LA offset 테스트로 대체 |
+
+---
+
+## Test Requirements
+
+- 동일 logical access에 대해:
+  - 1:1 → N개 request 생성 확인
+  - n:1 → 1개 aggregated request 생성 확인
+- 두 모드에서 effective bandwidth 일관성 검증
+- 1:1 → per-channel contention 모델링 확인
+- n:1 → aggregated bandwidth 반영 확인
+- kernel 코드 변경 없이 동작 확인
+- BAAW segment install/uninstall 정상 동작
+- 여러 tensor가 서로 다른 segment에 할당될 때 충돌 없음
+
+---
+
+## Links
+
+- ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU 도입) → 본 ADR이 대체
+- ADR-0019 (NOC Per-Channel HBM 연결 모델) → topology 측 연동
+- ADR-0014 (PE Internal Execution Model) → PE_DMA 변경 영향
@@ -0,0 +1,5 @@
+# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
+
+## Status
+
+Merged into ADR-0017 (Cube NOC and HBM Connectivity).
@@ -0,0 +1,5 @@
+# ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델
+
+## Status
+
+Merged into ADR-0017 (Cube NOC and HBM Connectivity).
@@ -0,0 +1,5 @@
+# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
+
+## Status
+
+Merged into ADR-0014 (PE Pipeline Execution Model).
@@ -0,0 +1,5 @@
+# ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅
+
+## Status
+
+Merged into ADR-0014 (PE Pipeline Execution Model).
@@ -0,0 +1,421 @@
+# ADR-0029: Hierarchical All-Reduce — 3-level intra/inter-SIP 알고리즘
+
+## Status
+
+Superseded by ADR-0032 (Intercube all-reduce). The 3-level kernel and
+`hierarchical_allreduce.py` module have been removed. The cube-mesh
+intercube + inter-SIP path is now the single all-reduce algorithm.
+
+## Context
+
+### 목표
+
+"Rank = SIP" 모델 (ADR-0024) 위에서 각 SIP 내부의 모든 PE를 참여시키는
+**3-level 계층 all-reduce** 알고리즘을 정의한다. 각 레벨이 서로 다른 물리
+연결(intra-cube ring, inter-cube NoC, inter-SIP UCIe)을 활용해 대역폭을
+극대화한다.
+
+### 왜 hierarchical인가
+
+단순 ring/mesh/tree all-reduce는 SIP당 1 PE만 참여 (ADR-0024의 `leader_only`
+mapper). 이는 inter-SIP 단계는 잘 모델링하지만:
+
+- **Intra-SIP PE가 노는 시간이 발생**. Leader PE가 inter-SIP 통신 중이면
+  나머지 7 PE / 16 cube는 유휴.
+- **Intra-cube/inter-cube 연결 대역폭 미활용**. Cube NoC는 매우 빠르지만
+  단일 leader 사용 시 이 자원이 노출되지 않음.
+- **실제 NCCL 등은 hierarchical**: NVLink(intra-node) + InfiniBand(inter-node)
+  의 bandwidth 차이를 활용. KernBench 토폴로지도 동일 구조
+  (intra-cube / inter-cube / inter-SIP의 bandwidth·latency 차이).
+
+### 현재 상태
+
+- `src/kernbench/ccl/algorithms/hierarchical_allreduce.py` 이미 존재
+  (git log `10b33b4` — "Tensor indexing + hierarchical 3-level all-reduce
+  kernel"). PE-level로 world_size = total PE를 가정하는 옛 모델 기반 구현.
+- ADR-0024에 의해 launcher는 rank = SIP로 바뀜.
+- Hierarchical 커널은 **재해석 필요**: 이제 각 worker(1 per SIP)가 자기 SIP의
+  모든 PE를 참여시키고, kernel은 intra-cube → inter-cube → inter-SIP 순으로
+  3-level reduce + broadcast.
+
+### 풀어야 할 문제
+
+1. **ADR-0024 framework 위에 hierarchical 알고리즘 맞추기**
+   - Mapper: `all_pes` (ADR-0024 D5 제공)
+   - Validator: `multi_pe_sip_local` (ADR-0024 D8 제공)
+   - Kernel: 기존 `hierarchical_allreduce.py` 수정 — rank 계산 방식을 SIP 내
+     local (cube, pe)로 바꿈
+2. **PE-level neighbor graph 생성**
+   - Intra-cube: `(sip, cube, pe) ↔ (sip, cube, pe±1 mod N_PE)` (ring 내부)
+   - Inter-cube: `(sip, cube, 0) ↔ (sip, cube±1 mod N_CUBE, 0)` (cube leader만)
+   - Inter-SIP: `(sip, 0, 0) ↔ (sip±1 mod N_SIP, 0, 0)` (SIP leader만)
+3. **Tensor layout**: 각 PE가 1 tile을 소유하고 시작 (`multi_pe_sip_local`
+   validator가 이 layout 강제). DPPolicy(cube="column_wise",
+   pe="column_wise")로 달성 가능.
+4. **PE-level topology 표현 부족** (ADR-0024 D6의 "책임 분산" 이슈 구체화)
+   - Ring/mesh/tree 같은 단순 패턴은 rank-level topology_fn + mapper 조합으로
+     충분.
+   - Hierarchical은 레벨마다 다른 peer 매핑이라 `_build_pe_installs`에서
+     multi-level 해석을 해야 함.
+   - 장기적으로는 topology 모듈이 PE-level을 직접 표현하는 편이 명시적.
+
+### Non-problem (이 ADR 밖)
+
+- Launcher / barrier / rank-to-SIP / mapper-validator registry → ADR-0024
+- IPCQ direction addressing → ADR-0025
+- DPPolicy 필드 정리 → ADR-0026
+- Megatron TP → ADR-0027
+
+---
+
+## Decision
+
+### D1. 알고리즘 구조 — 3-level reduce + 역순 broadcast
+
+```
+Level 1 (intra-cube, E/W ring):
+  각 cube의 N_PE개 PE가 bidirectional ring reduce → cube 내 PE 0에 부분합 집중
+Level 2 (inter-cube within SIP, N/S ring, PE 0만 참여):
+  N_CUBE개 cube-leader가 bidirectional ring reduce → SIP 내 (cube 0, PE 0)에
+  SIP 전체 부분합 집중
+Level 3 (inter-SIP, N_SIP peers, (cube 0, PE 0)만 참여):
+  Ring 또는 pair exchange로 전역 합산 완료
+Broadcast:
+  역순 — Level 3 결과를 (cube 0, PE 0)에서 SIP 내 모든 cube-leader로, 다시
+  각 cube 내 모든 PE로 전파
+```
+
+세부는 기존 `hierarchical_allreduce.py`의 커널 구현과 일치. ADR-0024 이후
+변경점은 **rank 계산 방식**과 **n_elem 해석**뿐:
+
+- 기존 (rank=PE 모델): `rank = cube_id * pes_per_cube + local_pe`, `pe_addr =
+  t_ptr + rank * nbytes`
+- 신규 (rank=SIP 모델): 커널은 SIP-local 좌표 `(cube_id, local_pe)`로만 동작.
+  텐서의 per-PE slice는 backend가 per-PE `TensorArg`로 전달 (ADR-0024 D3).
+  커널 내부 rank 계산 자체가 불필요해짐 — `tl.program_id(0/1)`로 충분.
+
+### D2. Framework integration — ADR-0024 infrastructure 재활용
+
+`ccl.yaml`:
+
+```yaml
+algorithms:
+  hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.hierarchical_allreduce
+    topology: hierarchical_3level        # NEW — D3 참고
+    mapper: all_pes                      # ADR-0024 D5 built-in
+    validator: multi_pe_sip_local        # ADR-0024 D8 built-in
+    buffer_kind: tcm
+    n_elem: 128
+```
+
+Framework 관점에서 hierarchical은 **특별한 알고리즘이 아니라, 특정
+topology / mapper / validator 조합**. 본 ADR은 그 조합과 topology 패턴을
+정의.
+
+### D3. `hierarchical_3level` topology (신규)
+
+`kernbench/ccl/topologies.py`에 신규 추가:
+
+```python
+def hierarchical_3level(rank: int, world_size: int, spec: dict) -> dict:
+    """3-level hierarchical neighbor pattern.
+
+    Returns a nested structure describing intra-cube + inter-cube + inter-SIP
+    neighbors. Unlike ring_1d / mesh_2d which are rank → {dir: peer_rank},
+    hierarchical is PE-level and requires spec for cube_mesh / pe_layout.
+    """
+```
+
+반환 스키마 (초안):
+
+```python
+{
+    "intra_cube": {
+        # 각 cube 내 ring neighbors: (cube, pe) → {"E": (cube, pe_e), "W": (cube, pe_w)}
+        ...
+    },
+    "inter_cube": {
+        # cube-leader 간 ring: (cube, 0) → {"N": (cube_n, 0), "S": (cube_s, 0)}
+        ...
+    },
+    "inter_sip": {
+        # SIP-leader 간: rank → {"parent": peer_rank} (또는 ring 방식)
+        ...
+    },
+}
+```
+
+이 구조는 `_build_pe_installs`가 해석하여 각 PE의 neighbor table 엔트리
+(4-direction)에 대응시킨다.
+
+**Rank-level `topologies.py` 현 API와의 관계**: 기존 단순 패턴은
+`(rank → {dir: peer_rank})` 단일 레벨. Hierarchical은 multi-level이므로
+기존 API와 schema가 다름. `_resolve_topology`는 **알고리즘이 어떤 schema를
+쓰는지 선언**하고, builder가 그에 맞춰 해석하도록 확장 필요 (open question).
+
+### D4. PE-level neighbor graph — `_build_pe_installs` 확장
+
+기존 (ring/mesh/tree): topology_fn이 반환한 `(rank → {dir: peer_rank})`를
+각 참여 PE에 그대로 매핑 (leader_only일 경우 peer PE도 leader).
+
+신규 (hierarchical): `hierarchical_3level`의 3단 구조를 per-PE neighbor
+table로 펼침:
+
+```python
+def _build_pe_installs_hierarchical(rank, world_size, sip, pes, topo, spec):
+    """Hierarchical 전용 PE neighbor table 빌더."""
+    result = []
+    for (cube, pe) in pes:
+        entries = []
+        # Level 1: intra-cube ring (E/W)
+        for d, peer in topo["intra_cube"][(cube, pe)].items():
+            entries.append(NeighborTableEntry(direction=d, ...))
+        # Level 2: inter-cube ring (N/S) — cube leader (pe == 0)만
+        if pe == 0:
+            for d, peer in topo["inter_cube"][(cube, 0)].items():
+                entries.append(NeighborTableEntry(direction=d, ...))
+        # Level 3: inter-SIP — SIP leader (cube == 0 and pe == 0)만
+        if cube == 0 and pe == 0:
+            for d, peer_rank in topo["inter_sip"][rank].items():
+                # peer_rank → peer SIP의 (0, 0)
+                entries.append(NeighborTableEntry(
+                    direction=d, peer_sip=peer_rank, peer_cube=0, peer_pe=0, ...))
+        result.append(PeInstallSpec(cube=cube, pe=pe, neighbors=tuple(entries)))
+    return tuple(result)
+```
+
+`build_install_plans`에서 algorithm_config의 `topology`에 따라 적절한 builder
+선택 (기존 simple builder vs hierarchical builder).
+
+### D5. Kernel 재해석 — SIP-local 좌표로
+
+`src/kernbench/ccl/algorithms/hierarchical_allreduce.py`를 ADR-0024 D3에
+맞춰 수정:
+
+```python
+def kernel_args(*, n_elem: int, world_size: int, pes_per_cube: int,
+                cubes_per_sip: int, num_sips: int, **kw) -> tuple:
+    """world_size (= num_sips), pes_per_cube, cubes_per_sip를 스칼라로."""
+    return (n_elem, pes_per_cube, cubes_per_sip, num_sips)
+
+def kernel(t_ptr, n_elem, pes_per_cube, cubes_per_sip, num_sips, tl):
+    """SIP-local 좌표 기반.
+
+    이전 (rank=PE 모델):
+        rank = cube_id * pes_per_cube + local_pe
+        pe_addr = t_ptr + rank * nbytes
+    현재 (rank=SIP 모델):
+        per-PE tensor slice는 backend가 TensorArg로 전달 → t_ptr은 이미 local.
+        intra-cube ring은 tl.program_id(0) 사용.
+        inter-cube ring은 pe_id == 0 조건으로 제한.
+        inter-SIP reduce는 cube_id == 0 and pe_id == 0 조건으로 제한.
+    """
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+
+    # Level 1: intra-cube ring
+    for _ in range(intra_rounds(pes_per_cube)):
+        tl.send(dir="E", src=acc)
+        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+        acc = acc + recv
+
+    # Level 2: inter-cube (cube leader only)
+    if local_pe == 0:
+        for _ in range(inter_cube_rounds(cubes_per_sip)):
+            tl.send(dir="N", src=acc)
+            recv = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
+            acc = acc + recv
+
+    # Level 3: inter-SIP (SIP leader only)
+    if local_pe == 0 and cube_id == 0:
+        for _ in range(inter_sip_rounds(num_sips)):
+            tl.send(dir="parent", src=acc)
+            recv = tl.recv(dir="parent", shape=(n_elem,), dtype="f16")
+            acc = acc + recv
+
+    # Broadcast (reverse chain)
+    # ...
+    tl.store(t_ptr, acc)
+```
+
+`kernel_args`는 ADR-0024 D4의 keyword-only signature 계약을 따른다.
+
+### D6. Validator — `multi_pe_sip_local`
+
+ADR-0024 D8의 built-in 그대로 활용. `ccl.yaml`에서 `validator:
+multi_pe_sip_local` 지정 시 backend가 각 SIP에 `cubes × pes_per_cube`개
+shard가 있는지 검증.
+
+### D7. Bench — 기본 all-reduce bench 확장
+
+`benches/ccl_allreduce.py`의 worker는 `ccl.yaml`이 `hierarchical_allreduce`를
+선택하면 자동으로:
+
+```python
+# Worker 예
+dp = DPPolicy(cube="column_wise", pe="column_wise")
+tensor = torch.zeros((1, intra_sip_pes * n_elem), dp=dp, name="in")
+# tensor는 각 SIP의 모든 PE에 1 tile씩 분산 (multi_pe_sip_local validator 통과)
+dist.all_reduce(tensor, op="sum")
+```
+
+Worker 코드 자체는 알고리즘 종류를 모름 (`ccl.yaml` 선택에 의존). 단,
+**DPPolicy가 hierarchical 요구와 일치해야** 함 — `cube/pe="column_wise"`
+같은 SIP-내 분산을 하는 DPPolicy여야 `multi_pe_sip_local` 검증 통과. 이
+DPPolicy 선택은 bench 설정 또는 sample bench에서 결정.
+
+---
+
+## Dependencies
+
+- **ADR-0024**: Launcher, `all_pes` mapper, `multi_pe_sip_local` validator,
+  registry + import path. 본 ADR 구현의 전제.
+- **ADR-0025**: IPCQ direction addressing — cube/pe/SIP 간 다중 direction을
+  동시 사용하므로 정확한 direction 매칭 필수.
+- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
+- **기존 `hierarchical_allreduce.py`**: 본 ADR은 그 커널의 재해석 + 주변
+  framework integration.
+
+---
+
+## Non-goals
+
+- **ADR-0024 framework 변경**: 재활용만.
+- **Alternative reduce topology (tree-in-tree 등)**: 3-level ring이 첫 구현.
+- **Dynamic level count**: 현재 SIP/cube/PE 3단 고정. 2단 (SIP + PE, cube
+  skip) 또는 4단 이상은 future.
+- **Bandwidth-optimal schedule tuning**: reduce round 수 / chunk size 조정
+  같은 tuning은 별도.
+- **Pipelined hierarchical**: 여러 chunk를 파이프라인으로 겹쳐서 돌리는
+  NCCL-style 최적화는 future.
+
+---
+
+## Open questions
+
+### 🟠 중간 영향 — 구현 시 결정 필요
+
+- **`topologies.py` 스키마 확장**: 기존 `ring_1d` 등은 단일 레벨 `(rank →
+  {dir: peer})`. `hierarchical_3level`은 multi-level. `_resolve_topology`가
+  둘을 모두 반환할 수 있도록 schema를 일반화할지, 아니면 hierarchical 전용
+  return type을 두고 builder가 분기할지.
+  - Option A: 모든 topology를 neighbor-list 형태로 단일화
+    (`[{direction, peer_sip, peer_cube, peer_pe}, ...]`)
+  - Option B: topology 모듈이 `kind` 필드 제공, builder가 분기
+  - 권장: Option A (single source of truth, ADR-0024 Open Q의
+    "PE-level topology 일원화" 방향과 일치)
+
+- **`hierarchical_3level` vs algorithm별 topology 모듈**: 향후 mesh-based
+  hierarchical 등 variant이 생기면? `hierarchical_3level` 같은 이름이 이미
+  topology-specific. 변형은 새 key 추가 (`hierarchical_mesh_3level` 등) 또는
+  알고리즘 모듈에서 topology 생성 override.
+
+### 🟡 Nice-to-have
+
+- **Reduce round 수 최적화**: Bidirectional ring은 `ceil((N-1)/2)` round.
+  Non-power-of-2 group size에서 idle PE 발생 가능.
+- **Non-uniform topology 대응**: cube_mesh가 w != h일 때 inter-cube ring
+  balance.
+- **Single SIP 케이스**: world_size = 1 (SIP 1개)일 때 Level 3 skip. Degenerate
+  case 검증.
+
+### 🟢 Framework evolution 시사점 (ADR-0024로부터 이관)
+
+- **PE-level topology 일원화 (중장기)**: 현 설계는
+  - topology (rank graph 또는 level-separated)
+  - mapper (per-SIP PE set)
+  - `_build_pe_installs` (actual edges)
+
+  의 3단 분산. Hierarchical이 이 분산을 가장 스트레스 받는 케이스. 중장기로는
+  `topologies.py`가 PE-level neighbor list를 직접 반환하고 mapper는 단순히
+  "어느 PE가 참여하느냐"만 결정, `_build_pe_installs`는 flat
+  mapping으로 단순화되는 방향이 자연스러움. **본 ADR에서 Option A를 채택**하면
+  이 방향으로 이미 정합.
+
+---
+
+## Test strategy
+
+### T1. Topology generator
+
+`tests/test_hierarchical_topology.py` (new):
+- `hierarchical_3level(rank, world_size, spec)` → 각 level의 neighbor set이
+  예상 구조인지 (intra-cube는 ring, inter-cube는 cube-leader만 참여, inter-SIP은
+  SIP-leader만 참여)
+- 2 SIP × 4 cubes × 4 PEs 같은 작은 토폴로지로 수작업 검증 가능
+- Symmetry: rank r의 E neighbor가 peer에서 W로 역포인팅
+
+### T2. Install plan — hierarchical × all_pes
+
+`tests/test_ccl_install_plan.py` (확장):
+- `build_install_plans(algorithm="hierarchical_allreduce", mapper="all_pes",
+  validator="multi_pe_sip_local")` 호출 시
+  - 각 SIP의 모든 PE가 `participating_pes`에 포함
+  - PE 0 (cube leader)만 inter-cube neighbor를 가짐
+  - (cube 0, pe 0) (SIP leader)만 inter-SIP neighbor를 가짐
+  - Non-leader PE는 intra-cube neighbor만
+
+### T3. Kernel unit — mock runtime
+
+`tests/test_hierarchical_mock_runtime.py` (new):
+- `run_kernel_in_mock` (kernbench.ccl.testing)을 확장해 multi-level 지원
+- 2 SIP × 2 cubes × 4 PEs (총 16 PE) 토폴로지에서 초기 tile을 rank+1로 채우고
+  hierarchical all-reduce 실행
+- 모든 PE의 최종 결과가 `sum(1..16)`인지
+
+### T4. E2E — 실제 SimPy backend
+
+`tests/test_ccl_allreduce_matrix.py` (확장):
+- `hierarchical @ ws=SIP_count`: multi_pe_sip_local layout + 3-level 알고리즘
+  전체 stack 통과 검증
+
+### T5. Validator enforcement
+
+- `multi_pe_sip_local` validator가 wrong layout (예: leader_only 스타일 1
+  shard per rank) 입력에 raise
+
+### T6. 회귀
+
+기존 ring/mesh/tree 알고리즘 모두 그대로 통과. 본 ADR은 그들을 건드리지 않음.
+
+---
+
+## Consequences
+
+### Positive
+
+- **Intra-SIP PE 활용도 증가**: Inter-SIP 통신 중에도 intra-cube / inter-cube
+  reduce가 진행되어 전체 PE 가동률 향상.
+- **Multi-level bandwidth 활용**: cube NoC, UCIe 모두 작동 → 더 정확한 HW 모델.
+- **ADR-0024 framework 검증**: `all_pes` mapper + `multi_pe_sip_local`
+  validator의 첫 non-trivial use case. Framework 설계 타당성 확인.
+- **기존 커널 재활용**: `hierarchical_allreduce.py` 큰 구조 유지, SIP-local
+  좌표만 재해석.
+
+### Negative
+
+- **`topologies.py` schema 확장 필요**: Single-level vs multi-level 표현.
+  해결안(Option A)은 기존 ring/mesh/tree의 마이그레이션 비용 유발.
+- **Validator / mapper 조합 요구**: 사용자가 DPPolicy를
+  `multi_pe_sip_local`에 맞춰 선택해야 함 (bench 설정 복잡도 증가).
+
+### Neutral
+
+- 본 ADR 구현 전까지 `hierarchical_allreduce.py`는 deprecated 상태 유지 또는
+  ADR-0024 matrix test에서 제외. 현재 파일을 곧바로 삭제하지는 않음.
+
+---
+
+## Affected files
+
+| File | Change |
+|------|--------|
+| `src/kernbench/ccl/topologies.py` | D3: `hierarchical_3level` topology 함수 추가. (Option A 채택 시) 기존 topology 출력 format 통일 |
+| `src/kernbench/ccl/install_plan.py` | D4: hierarchical builder 분기 (또는 단일 builder가 level 개수로 dispatch) |
+| `src/kernbench/ccl/algorithms/hierarchical_allreduce.py` | D5: SIP-local 좌표로 kernel 재작성, `kernel_args` keyword-only signature |
+| `ccl.yaml` | D2: `hierarchical_allreduce` 엔트리 추가 (`mapper: all_pes`, `validator: multi_pe_sip_local`, `topology: hierarchical_3level`) |
+| `tests/test_hierarchical_topology.py` (new) | T1 |
+| `tests/test_ccl_install_plan.py` | T2 확장 |
+| `tests/test_hierarchical_mock_runtime.py` (new) | T3 |
+| `tests/test_ccl_allreduce_matrix.py` | T4: hierarchical row 추가 |
@@ -0,0 +1,261 @@
+# ADR-0031: PhysAddr PE-Resource Extension
+
+## Status
+
+Superseded by ADR-0001 (Revision 2, 2026-04-27).
+PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables are now defined in
+ADR-0001 D2.3.3-D2.3.5.
+
+Previous status: Stub (Blocker for ADR-0030 — specific range allocations TBD)
+
+## Context
+
+### 목표
+
+ADR-0001의 `PhysAddr` schema를 **PE 내부의 다양한 resource**를 체계적으로
+표현할 수 있도록 확장한다. ADR-0030 (IPCQ PhysAddr integration) 및 향후의
+PE-local resource 추가 (scratchpad, register file, status register, 등)의
+기반을 제공한다.
+
+### 현재 상태 (ADR-0001)
+
+51-bit PhysAddr layout:
+
+```
+[50:47] rack_id  (4)
+[46:43] sip_id   (4)
+[42:38] sip_seg  (5)   # cube_id
+[37:0]  local_offset (38)
+```
+
+`local_offset` (38 bits) 내부:
+
+- `[37]` selector: 1 = HBM window (128GB), 0 = PE resource window
+- PE resource window는 `unit_type` (3 bits: PE | MCPU | SRAM) +
+  `pe_id` (4 bits) + `ext` (1 bit) + `sub_offset` (29 bits)
+
+Factory API:
+- `PhysAddr.hbm_addr(...)` — HBM generic
+- `PhysAddr.pe_hbm_addr(...)` — PE-local HBM slice
+- `PhysAddr.pe_tcm_addr(...)` — PE TCM (via `UnitType.PE` + `sub_offset`)
+- `PhysAddr.cube_sram_addr(...)` — Cube-shared SRAM
+
+### 풀어야 할 문제
+
+1. **PE 내부 resource 구분의 명시적 체계 부재**: 현재 `local_offset` (38 bits)
+   이 평면 공간으로 취급되고, PE TCM / IPCQ ring / scratchpad / 향후 register
+   file 등이 관습적 offset 범위로만 구분됨. Schema 레벨에서 명확하지 않음.
+2. **IPCQ 주소의 PhysAddr 표현 부재**: ADR-0030이 IPCQ ring buffer를 PhysAddr로
+   표현하려면 "이 주소가 IPCQ 영역"을 decode 가능해야 함. 현재는 불가.
+3. **향후 PE resource 확장 경로**: register file, performance counter 등
+   추가 시 일관된 위치 할당 규칙 필요.
+
+### 설계 방향 — local_offset을 PE 컴포넌트별 range로 분할
+
+`local_offset` (38 bits = 256GB per PE segment)을 **PE 컴포넌트마다 고정
+range**로 나누어 할당한다. 각 range는 해당 컴포넌트 전용 주소 공간이며,
+`PhysAddr.decode()`가 주소가 어느 range에 속하는지 판별해 해당하는 `kind` /
+`unit_type` / `sub_type` 필드를 채운다.
+
+개념적 구조 (구체적 bit 할당은 **TBD**):
+
+```
+local_offset [37:0]  (38 bits total)
+├── HBM window           [37] = 1    (기존 128GB)
+├── PE component ranges  [37] = 0
+│   ├── TCM              [range_1]
+│   ├── IPCQ rings       [range_2]
+│   ├── Scratchpad       [range_3]
+│   ├── Register file    [range_4]
+│   ├── (reserved)       ...
+│   └── Sideband / status [range_N]
+```
+
+### 왜 range-based partition인가
+
+- **Schema-level 명시성**: 주소 하나 보고 어느 컴포넌트의 자원인지 decode 가능.
+  "Routing consumes decoded domains" (ADR-0001 D5) 계약 충족.
+- **Unit type enum 확장보다 유연**: 3-bit `UnitType` 공간을 고갈시키지 않고
+  세분화 가능. 미래 추가 컴포넌트도 빈 range 할당.
+- **Allocator 통합 자연**: 각 PE-level allocator가 관리하는 하위 pool을
+  address range와 1:1 매칭 (e.g., `reserve_ipcq_tcm()` → IPCQ range 안에서만
+  할당).
+- **Decode routing 단순**: `PhysAddr.decode(addr)`가 range table을 참조해
+  `kind` + sub-field를 채움. 기존 HBM selector bit 패턴의 일반화.
+
+### 왜 지금 다루는가
+
+- ADR-0030 (IPCQ PhysAddr 통합)이 이 확장에 **의존**. ADR-0030 단독 진행 시
+  `sub_offset` 공간을 불투명하게 재사용하게 되어 ADR-0001 계약 미충족.
+- PE 내부 자원이 더 추가될 가능성 — 지금 구조를 정리해두면 일관된 확장 경로 확보.
+
+---
+
+## Decision (pending specific range allocation)
+
+### D1. Range-based local_offset partition — approach
+
+`local_offset`을 고정 byte range로 분할하고, 각 range를 PE 컴포넌트에 할당한다.
+주소의 어느 range에 속하는가로 `kind` / component type을 결정.
+
+```python
+# src/kernbench/policy/address/phyaddr.py (conceptual, post-extension)
+@dataclass(frozen=True)
+class PeResourceRange:
+    name: str                # e.g. "tcm", "ipcq", "scratchpad", "regfile"
+    start_offset: int        # local_offset 내 시작
+    end_offset: int          # exclusive
+    byte_size: int           # end - start
+
+PE_RESOURCE_MAP: tuple[PeResourceRange, ...] = (
+    # TBD — 구체적 range 할당은 사용자가 별도 업데이트
+)
+```
+
+`PhysAddr.decode(addr)`의 PE resource 경로는:
+
+```python
+def decode_pe_resource(local_offset: int) -> dict:
+    for r in PE_RESOURCE_MAP:
+        if r.start_offset <= local_offset < r.end_offset:
+            return {
+                "kind": "pe_resource",
+                "component": r.name,                 # NEW: "tcm"/"ipcq"/...
+                "component_offset": local_offset - r.start_offset,  # within range
+            }
+    raise PhysAddrError(f"local_offset {local_offset} not in any PE range")
+```
+
+### D2. Specific range allocations — **TBD**
+
+> 사용자가 구체적 byte 할당을 별도로 정의한 뒤 본 ADR에 업데이트.
+>
+> 필요 정보:
+> - 각 컴포넌트 (TCM, IPCQ, scratchpad, regfile, ...)의 이름 / byte size
+> - `local_offset` 내 시작 offset (align 고려)
+> - 현재 하드웨어 사양 / 시뮬레이션 요구 반영
+
+이 섹션이 채워진 뒤 ADR status: **Stub → Proposed → Accepted** 승격.
+
+### D3. Factory API — per-component 함수
+
+기존 `PhysAddr.pe_tcm_addr(...)` 패턴을 일반화:
+
+```python
+# 기존 (이미 존재)
+PhysAddr.pe_tcm_addr(rack_id, sip_id, cube_id, pe_id, tcm_offset)
+
+# 신규 (ADR-0031 후 추가)
+PhysAddr.pe_ipcq_addr(rack_id, sip_id, cube_id, pe_id, ipcq_offset)
+PhysAddr.pe_scratchpad_addr(...)
+PhysAddr.pe_regfile_addr(...)
+# ...
+```
+
+각 factory는 해당 컴포넌트의 range 내에서 `component_offset`만 받아 최종
+PhysAddr encoding. 호출자는 어느 range인지 몰라도 됨.
+
+### D4. Backward compatibility
+
+- 기존 `pe_tcm_addr()` signature / semantic 유지.
+- 내부 인코딩만 신규 range table을 참조하도록 변경.
+- 기존 `UnitType.PE` decoding 경로는 `PE_RESOURCE_MAP`에서 "tcm" range를
+  대응하도록 매핑 → 기존 코드 transparent.
+- 기존 코드가 `PhysAddr.decode(addr).unit_type == UnitType.PE`를 체크하는
+  경우는 여전히 유효 (TCM 주소는 계속 PE unit_type).
+
+---
+
+## Open questions
+
+### 🔴 Pending user input (ADR 승격 blocker)
+
+- **D2의 specific range allocation**: 사용자가 구체적 byte 할당 테이블을
+  제공해야 Stub → Proposed 승격 가능. 필요 정보:
+  - 컴포넌트 목록 (TCM, IPCQ, scratchpad, regfile 등)
+  - 각 컴포넌트의 byte size / 시작 offset
+  - Alignment 요구사항 (4KB / page-aligned 등)
+
+### 🟡 설계 세부 — range allocation 결정 과정에서 함께 결정
+
+- **총 local_offset space 배분**: HBM window (bit 37 = 1, 128GB)을 유지할지,
+  아니면 PE resource space를 확장하기 위해 HBM window 축소할지.
+- **Range padding / reserved space**: 미래 컴포넌트 추가를 위한 "reserved"
+  range 몇 개를 미리 확보할지.
+- **Address alignment**: 각 range의 시작 offset이 특정 alignment (page /
+  cache line) 만족해야 하는지.
+- **Diagnostic / debug 포맷**: `PhysAddr.decode()` 출력에서 component 이름 +
+  component_offset을 사람이 읽기 좋게 표시 (e.g., "IPCQ ring sip=0 cube=0 pe=3
+  offset=0x1234").
+- **기존 `UnitType` enum의 role**: Range-based 접근 후에도 `unit_type` 필드
+  유지할지 (decode 결과에 `component` 추가), 또는 enum 대체할지.
+
+### 🟢 ADR-0030 연동 질문
+
+- **IPCQ range 내 direction/slot 표현**: PhysAddr는 `component_offset` 단위
+  까지만 표현. "direction=E, slot=2"는 IPCQ range 내 offset 계산으로 도출
+  (`direction_idx * slot_region_size + slot_idx * slot_size`) — 이 공식은
+  ADR-0030 scope에서 구체화.
+- **Allocator pool 구조**: `PEMemAllocator`가 여러 range (TCM, IPCQ,
+  scratchpad)를 개별 pool로 관리할지, 단일 pool에서 kind별 reserved만 관리
+  할지. Range-based schema면 개별 pool이 자연스러움.
+
+---
+
+## Non-goals (this ADR)
+
+- **51-bit 전체 layout 재작성**: 본 ADR은 `local_offset` (38 bits) 내부의
+  subdivision만 다룬다. Rack / SIP / cube segment 같은 상위 bit 구조는
+  불변.
+- **`UnitType` enum 재설계**: range-based 접근으로 대체 가능하지만, 기존 enum
+  (PE / MCPU / SRAM)은 backward compat 위해 유지.
+- **Dynamic range allocation**: runtime에 range 크기 바꾸는 기능 불필요. 모든
+  range는 컴파일 / 설정 시점에 고정.
+- **Multi-process / multi-rack partitioning**: PE 내부 resource만 다룸.
+
+---
+
+## Action
+
+### Phase 1 — User 입력: specific range allocation (**Blocker**)
+- 사용자가 정의한 PE 컴포넌트별 byte range를 D2에 기입:
+  - `PE_RESOURCE_MAP` 테이블 내용 (name, start_offset, byte_size per 컴포넌트)
+  - 각 컴포넌트의 hardware spec 근거 note
+
+### Phase 2 — ADR Stub → Proposed 승격
+- D2 채워지면 status 변경.
+- Open questions의 "🔴 Pending user input" 블록 제거.
+- ADR-0001에 amendment note 초안 작성.
+
+### Phase 3 — 구현
+- `PhysAddr` range-based decode 구현.
+- 신규 factory 함수 (`pe_ipcq_addr`, `pe_scratchpad_addr` 등 컴포넌트별)
+  추가.
+- 기존 `pe_tcm_addr` 내부 인코딩만 신규 range table 참조하도록 수정
+  (signature 불변).
+- 기존 코드 경로 회귀 확인.
+
+### Phase 4 — ADR-0030 unblock
+- ADR-0030 "Blocked" 상태 해제.
+- Install_plan builder가 `pe_ipcq_addr(...)` 등 확장된 factory 호출하도록
+  수정.
+
+---
+
+## Dependencies
+
+- **ADR-0001** (PhysAddr layout): 본 ADR은 ADR-0001의 확장.
+- **ADR-0023** (IPCQ protocol): IPCQ ring buffer의 주소 체계를 PhysAddr로
+  통합할 수 있게 하는 기반.
+- **ADR-0030** (IPCQ PhysAddr integration): 본 ADR에 blocked.
+
+---
+
+## Affected files (future, after promotion to Proposed)
+
+| File | Change |
+|------|--------|
+| `src/kernbench/policy/address/phyaddr.py` | Range table (`PE_RESOURCE_MAP`), range-based decode, 신규 component-specific factory들 (`pe_ipcq_addr` 등), 기존 `pe_tcm_addr` 내부 인코딩 갱신 |
+| `src/kernbench/policy/address/allocator.py` | Range-aware pool 분리 (TCM pool / IPCQ pool / scratchpad pool 등 per-PE) |
+| `docs/adr/ADR-0001-mem-physaddr-layout.md` | Amendment note: range-based PE resource partition |
+| `tests/test_phyaddr.py` | Range table 검증, 각 factory의 encode/decode round-trip, 기존 `pe_tcm_addr` 회귀 |