ADR: translate adr-ko/ to Korean, fix ADR-0013 slug, refine Status check

Follow-up to the bilingual-structure commit: docs/adr-ko/ now holds only Korean versions (24 files translated from English placeholders), ADR-0013 slug uses kebab-case in both folders, and the verify tool allows translated parenthetical commentary in the Status block. - Translate 24 English files in docs/adr-ko/ to Korean. The previous bilingual-structure commit had left these as English copies because their source content was already English; this commit fulfills the policy that docs/adr-ko/ contains only Korean. - Rename ADR-0013 in both adr/ and adr-ko/ from ver-verification_strategy.md to ver-verification-strategy.md (kebab-case consistency with other ADRs). - CLAUDE.md (ADR Translation Discipline): clarify that only the Status lifecycle keyword (Accepted / Proposed / Stub / Draft / Superseded by ADR-NNNN / Merged into ADR-NNNN) must match across EN and KO; parenthetical commentary and trailing list items may be translated. - tools/verify_adr_lang_pairs.py: replace byte-equal Status check with normalize_status_keyword() which strips parenthetical commentary and takes only the first non-empty line. - tests/test_verify_adr_lang_pairs.py: update existing test names, add coverage for translated parenthetical, translated trailing list, and Superseded-by-NNNN keyword equality. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:17:56 -07:00
parent a796c1d2f7
commit 168b0c89f0
29 changed files with 2631 additions and 2651 deletions
@@ -1,288 +1,273 @@
-# ADR-0011: Memory Addressing — PA / VA / LA Address Models
+# ADR-0011: 메모리 주소 지정 — PA / VA / LA 주소 모델

 ## Status

 Accepted.

- **VA model: currently implemented (default).**
- PA model: implemented as PageFault fallback in PE_DMA.
- LA model: proposed, not implemented.
+- **VA 모델: 현재 구현됨 (기본값).**
+- PA 모델: PE_DMA의 PageFault fallback으로 구현됨.
+- LA 모델: 제안됨, 미구현.

 ## Context

-KernBench's address model evolved through three design points, each
-addressing a limitation of the previous. This ADR documents all three
-in one place because future implementation work selects among them.
+KernBench의 주소 모델은 각 단계마다 이전 단계의 한계를 해결하면서
+세 단계의 설계 지점을 거쳐 발전해 왔다. 본 ADR은 미래의 구현 작업이
+이 셋 중 하나를 선택해야 하므로 셋 모두를 한 곳에 기록한다.

-### PA-only baseline
+### PA 단독 베이스라인

-Phase 0 of KernBench treated all device memory operations
-(MemoryRead/MemoryWrite) as raw physical-address transfers. No
-host-side virtual addressing, no MMU/IOMMU translation. Allocators
-returned PA mappings; DMA requests carried PA directly.
+KernBench Phase 0는 모든 디바이스 메모리 동작(MemoryRead/MemoryWrite)을
+순수 물리 주소 전송으로 다뤘다. 호스트측 가상 주소 지정 없음, MMU/IOMMU
+변환 없음. 할당기는 PA 매핑을 반환하고, DMA 요청은 PA를 직접 운반했다.

-This was sufficient for early correctness/latency work but
-insufficient for running standard Triton kernels that use
-`base_addr + offset` patterns on sharded tensors: each PE's shard
-has a different PA, but the kernel needs a single contiguous address
-space to compute offsets.
+이는 초기 정확성·레이턴시 작업에는 충분했지만, 샤딩된 텐서에 대해
+`base_addr + offset` 패턴을 사용하는 표준 Triton 커널을 실행하기에는
+부족했다. 각 PE의 샤드는 서로 다른 PA를 갖지만, 커널은 offset을 계산하기
+위해 연속된 단일 주소 공간이 필요하기 때문이다.

-### Why VA/MMU (current default)
+### VA/MMU를 채택한 이유 (현재 기본값)

-A realistic system uses host-side virtual addressing and an
-MMU/IOMMU-style translation path for DMA: the host allocates physical
-memory at PE level, maps it into a virtual address space, installs
-mappings, and DMA requests use virtual addresses that are translated
-to physical addresses.
+현실적인 시스템은 호스트측 가상 주소 지정과 DMA를 위한 MMU/IOMMU 스타일
+변환 경로를 사용한다. 호스트는 PE 수준에서 물리 메모리를 할당하고,
+그것을 가상 주소 공간에 매핑하여 매핑을 설치한 뒤, DMA 요청은 가상
+주소를 사용하며 그것이 물리 주소로 변환된다.

-Adopting this model lets kernels use `base_addr + offset` over a
-contiguous VA range while the device-side MMU translates each access
-to the appropriate PA.
+이 모델을 채택하면 커널이 연속된 VA 범위에 대해 `base_addr + offset`을
+사용할 수 있고, 디바이스측 MMU가 각 접근을 적절한 PA로 변환한다.

-### Why LA/BAAW (proposed)
+### LA/BAAW를 제안한 이유

-VA/MMU treats HBM as a single backing space. KernBench needs to
-explore architectures where HBM is composed of multiple pseudo
-channels in parallel:
+VA/MMU는 HBM을 단일 backing 공간으로 다룬다. KernBench는 HBM이 병렬로
+여러 pseudo channel로 구성된 아키텍처를 탐색해야 한다:

- CUBE's HBM has 32 or 64 pseudo channels.
- In a PE-Local-HBM model, each PE is assigned N pseudo channels
+- CUBE의 HBM은 32 또는 64개의 pseudo channel을 갖는다.
+- PE-Local-HBM 모델에서 각 PE에는 N개의 pseudo channel이 할당된다
  (N = `hbm_pseudo_channels / pes_per_cube`).
- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
-  (N × per-channel).
+- 채널당 대역폭(예: 32 GB/s)이 PE의 총 대역폭을 결정한다
+  (N × 채널당).

-Two channel-mapping modes need to be modelable:
+두 가지 채널 매핑 모드를 모델링할 수 있어야 한다:

- **1:1 mode** — one logical access → N per-channel requests.
-  Precise per-channel BW contention modelling.
- **n:1 mode (default)** — one logical access → one aggregated
-  request. Channels are assumed to interleave; aggregated BW model.
+- **1:1 모드** — 하나의 논리 접근 → N개의 채널별 요청.
+  채널별 대역폭 경쟁을 정밀하게 모델링.
+- **n:1 모드 (기본값)** — 하나의 논리 접근 → 하나의 집계 요청.
+  채널들이 interleave된다고 가정; 집계된 대역폭 모델.

-VA's `tl.load(va_ptr)` produces a single DMA request to a single
-target. Decomposing that into per-channel requests inside PE_DMA
-requires the address layer to be aware of channels. This is the
-role of the LA (Logical Address) abstraction with BAAW
-(Logical-to-Physical Mapping Unit).
+VA의 `tl.load(va_ptr)`은 하나의 목표에 대한 하나의 DMA 요청을 생성한다.
+이를 PE_DMA 내부에서 채널별 요청으로 분해하려면 주소 계층이 채널을
+인지해야 한다. 이것이 BAAW(Logical-to-Physical Mapping Unit)를 가진
+LA(Logical Address) 추상화의 역할이다.

-Core requirements driving the LA design:
+LA 설계를 이끄는 핵심 요구사항:

- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
-  in both modes (only request shape and resource model differ).
- Kernel programming model is unchanged — physical channel
-  information is never exposed to kernel code.
- Mode switch is a topology-level configuration.
+- PE_DMA → HBM_CTRL 유효 대역폭 시맨틱이 두 모드에서 동일해야 한다
+  (요청 형태와 자원 모델만 다름).
+- 커널 프로그래밍 모델은 변경되지 않는다 — 물리 채널 정보는 커널 코드에
+  절대 노출되지 않는다.
+- 모드 전환은 토폴로지 수준의 설정이다.

-### Design space summary
+### 설계 공간 요약

-| Model | Status | Key idea |
-|-------|--------|----------|
-| PA | fallback (implemented) | Direct physical addressing, no translation |
-| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
-| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
+| 모델 | 상태 | 핵심 아이디어 |
+|------|------|--------------|
+| PA | fallback (구현됨) | 직접 물리 주소 지정, 변환 없음 |
+| VA | 현재 기본값 (구현됨) | 텐서별 연속 VA 범위; MMU가 접근별로 변환 |
+| LA | 제안됨 | LA + BAAW가 (PA, 채널)로 해석; 1:1 및 n:1 채널 매핑 모드 지원 |

 ---

 ## Decision

-This ADR defines three address models. At any given time the system
-operates in exactly one model. Selection is topology- / configuration-
-driven; coexistence within one simulation run is not required.
+본 ADR은 세 개의 주소 모델을 정의한다. 어느 시점에도 시스템은 정확히
+한 모델로 동작한다. 선택은 토폴로지·설정 주도이며, 단일 시뮬레이션 실행
+내에서의 공존은 요구되지 않는다.

 ---

-### Address Model: PA (Physical Address) — fallback
+### 주소 모델: PA (물리 주소) — fallback

-#### D-PA1. PA-only semantics
+#### D-PA1. PA 단독 시맨틱

- All device memory accesses (MemoryRead/MemoryWrite) operate on
-  device physical addresses (PA) plus size.
- PA-only mode remains functional via the PageFault fallback path in
-  PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
-  the value as a PA directly.
+- 모든 디바이스 메모리 접근(MemoryRead/MemoryWrite)은 디바이스 물리 주소(PA)와
+  크기에 대해 동작한다.
+- PA 단독 모드는 PE_DMA의 PageFault fallback 경로를 통해 여전히 동작한다.
+  DMA src/dst 주소에 MMU 매핑이 없으면 PE_DMA는 그 값을 PA로 직접 다룬다.

-#### D-PA2. Allocation produces PA mappings
+#### D-PA2. 할당은 PA 매핑을 생성한다

-Device allocation selects PE-local memory regions and returns PA
-mappings sufficient to execute kernels and issue DMA requests.
+디바이스 할당은 PE 로컬 메모리 영역을 선택하고 커널 실행 및 DMA 요청
+발행에 충분한 PA 매핑을 반환한다.

-PA model is retained primarily for backward compatibility with PA-only
-tests and as the underlying physical layer that VA / LA models resolve
-into.
+PA 모델은 주로 PA 단독 테스트와의 하위 호환성을 위해, 그리고 VA / LA
+모델이 해석되어 들어가는 기저 물리 계층으로 유지된다.

 ---

-### Address Model: VA (Virtual Address with MMU) — current default
+### 주소 모델: VA (MMU를 동반한 가상 주소) — 현재 기본값

-#### D-VA1. Virtual Address Model
+#### D-VA1. 가상 주소 모델

- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
- `TensorShard` does NOT carry a `va` field — shard VA is derived as
-  `va_base + offset_bytes`.
- Kernels receive `va_base` as their pointer argument (via
-  `TensorArg.va_base`).
- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
+- 각 텐서는 하나의 연속된 VA 범위(`TensorHandle.va_base`)를 가진다.
+- `TensorShard`는 `va` 필드를 가지지 **않는다** — 샤드 VA는
+  `va_base + offset_bytes`로 유도된다.
+- 커널은 포인터 인수로 `va_base`를 받는다(`TensorArg.va_base` 경유).
+- `DmaReadCmd.src_addr`와 `DmaWriteCmd.dst_addr`는 VA(PA가 아님)를 운반한다.

-#### D-VA2. PE_MMU Component
+#### D-VA2. PE_MMU 컴포넌트

- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
-  (synchronous `translate()` called by PE_DMA).
- Page-aligned dict lookup for O(1) VA → PA translation.
- `tlb_overhead_ns` configurable per-access latency.
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
-  directly (preserves PA model for backward compatibility).
+- 하이브리드 설계: SimPy 컴포넌트(`MmuMapMsg`용 inbox) + 유틸리티
+  (PE_DMA가 호출하는 동기식 `translate()`).
+- 페이지 정렬 dict 조회로 O(1) VA → PA 변환.
+- `tlb_overhead_ns`로 접근당 레이턴시 설정 가능.
+- PageFault fallback: VA에 매핑이 없으면 PE_DMA가 그것을 PA로 직접
+  다룬다 (PA 모델과의 하위 호환성 유지).

-#### D-VA3. Mapping Installation
+#### D-VA3. 매핑 설치

- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
-  fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
-  end-to-end.
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
-  cross-SIP mapping contamination for replicated tensors.
- Mapping strategy based on `DPPolicy.cube`:
-  - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
-    only. Each cube's PEs see only their local PA. No cross-cube
-    mapping installed.
-  - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
-    mappings to all target cubes. Enables cross-PE and cross-cube
-    DMA.
+- `MmuMapMsg`는 패브릭을 순회한다: Host → PCIE_EP → IO_CPU (큐브 fan-out)
+  → M_CPU (PE fan-out) → NOC → PE_MMU. 레이턴시는 end-to-end로 측정된다.
+- `MmuMapMsg.target_sips`는 SIP 수준 라우팅을 제어하여 복제 텐서의
+  cross-SIP 매핑 오염을 방지한다.
+- `DPPolicy.cube`에 기반한 매핑 전략:
+  - **Replicate** (`cube="replicate"`): (sip, cube)별 로컬 매핑만.
+    각 큐브의 PE들은 자신의 로컬 PA만 본다. cross-cube 매핑은 설치되지
+    않는다.
+  - **Sharded** (`cube="column_wise"` 등): 모든 샤드 매핑을 모든 대상
+    큐브로 브로드캐스트. cross-PE 및 cross-cube DMA를 가능하게 한다.

-#### D-VA4. Tensor Lifecycle
+#### D-VA4. 텐서 라이프사이클

- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
-  `weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
-  returns VA and PA space.
- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
- `VirtualAllocator` uses free-list with coalescing for VA space.
+- `del tensor`는 `Tensor.__del__` + `RuntimeContext`에 대한 `weakref`를
+  통해 자동 정리를 트리거한다. 패브릭을 통해 `MmuUnmapMsg`를 보내고
+  VA와 PA 공간을 반환한다.
+- `with RuntimeContext(...) as ctx:`는 스코프 기반 일괄 정리를 제공한다.
+- `RuntimeContext._tensors`는 GC 방지를 피하기 위해 `weakref.ref`를 사용.
+- `PEMemAllocator`는 coalescing이 있는 free-list를 사용한다(bump allocator 아님).
+- `VirtualAllocator`는 VA 공간에 대해 coalescing이 있는 free-list를 사용한다.

-#### D-VA5. Allocators
+#### D-VA5. 할당기

- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
-  with coalescing.
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
-  coalescing.
- Page size configurable via `topology.yaml` `pe_mmu` attrs
-  (default 4096).
+- `VirtualAllocator`: 디바이스 전체의 VA 공간, coalescing을 동반한
+  페이지 정렬 alloc/free.
+- `PEMemAllocator`: PE별 HBM/TCM, coalescing을 동반한 free-list 기반
+  alloc/free.
+- 페이지 크기는 `topology.yaml`의 `pe_mmu` attrs로 설정 가능
+  (기본 4096).

-#### Consequences (VA model)
+#### Consequences (VA 모델)

- Triton kernels use `base_addr + offset` patterns naturally on
-  sharded tensors.
- All latency remains explicit via graph traversal, including MMU
-  mapping installation and per-access TLB overhead.
- PA-only mode retained as fallback (PageFault → treat as PA).
- IPCQ and other fixed-address resources bypass MMU (use PA directly).
+- Triton 커널은 샤딩된 텐서에 대해 `base_addr + offset` 패턴을 자연스럽게
+  사용한다.
+- 모든 레이턴시는 MMU 매핑 설치와 접근당 TLB 오버헤드를 포함하여
+  그래프 순회를 통해 명시적이다.
+- PA 단독 모드는 fallback으로 유지된다 (PageFault → PA로 처리).
+- IPCQ와 그 외 고정 주소 자원은 MMU를 우회한다 (PA 직접 사용).

 ---

-### Address Model: LA (Logical Address with BAAW) — proposed
+### 주소 모델: LA (BAAW를 동반한 논리 주소) — 제안됨

-LA replaces VA when channel-level HBM modelling is required.
-Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
-removed artifacts). Coexistence with VA in the same run is not a goal.
+LA는 채널 수준 HBM 모델링이 필요할 때 VA를 대체한다.
+이 모델을 채택하면 VA/MMU 인프라가 제거된다 (D-LA1이 제거되는 산출물을
+나열한다). 동일 실행 내에서 VA와의 공존은 목표가 아니다.

-#### D-LA1. LA introduction — replaces VA infrastructure
+#### D-LA1. LA 도입 — VA 인프라 대체

-LA is the sole address space used by kernel code (`tl.load`,
-`tl.store`, `tl.composite`). Properties:
+LA는 커널 코드(`tl.load`, `tl.store`, `tl.composite`)가 사용하는
+유일한 주소 공간이다. 속성:

- Can map a Tensor to a contiguous logical space (like VA).
- Expresses `(logical buffer + offset)`.
- Does NOT contain physical channel information directly.
- Stays as an intermediate abstraction until physical resolution.
+- Tensor를 연속된 논리 공간에 매핑할 수 있다 (VA처럼).
+- `(논리 버퍼 + offset)`을 표현한다.
+- 물리 채널 정보를 직접 포함하지 **않는다**.
+- 물리적 해석이 일어나기 전까지는 중간 추상화로 유지된다.

-LA address space:
+LA 주소 공간:

-| Item | Value |
+| 항목 | 값 |
 |------|-------|
-| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
-| LA space size | 64 GB per PE |
-| Alignment unit | segment (see D-LA3) |
+| LA 시작 | `0x1_0000_0000` (4 GB, 이전 VA 시작과 동일) |
+| LA 공간 크기 | PE당 64 GB |
+| 정렬 단위 | segment (D-LA3 참조) |

-LA is PE-local: different PEs may use the same LA value; BAAW segment
-tables differ → they resolve to different PAs.
+LA는 PE 로컬이다: 서로 다른 PE가 동일한 LA 값을 사용할 수 있지만,
+BAAW segment 테이블이 다르므로 서로 다른 PA로 해석된다.

-VA infrastructure removed when LA is adopted:
+LA가 채택되면 제거되는 VA 인프라:

-| Removed | Replacement |
+| 제거 | 대체 |
 |---------|-------------|
-| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
-| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
-| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
+| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (동일한 free-list 접근, 이름 변경) |
+| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment 테이블 (PE_DMA 내부) |
+| `components/builtin/pe_mmu.py` (PeMmuComponent) | 제거 — BAAW는 별도 컴포넌트가 아니라 PE_DMA 내부 로직 |
 | `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
 | `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
 | `runtime_api/tensor.py`: `va_base` | `la_base` |
-| `topology.yaml`: `pe_mmu` component entry | Removed |
+| `topology.yaml`: `pe_mmu` 컴포넌트 entry | 제거 |

-#### D-LA2. Mapping mode setting
+#### D-LA2. 매핑 모드 설정

-Topology-level (cube) configuration:
+토폴로지 수준(큐브) 설정:

 ```yaml
 cube:
  memory_map:
    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
-    hbm_pseudo_channels: 64       # total pseudo channel count
-    hbm_channels_per_pe: 8        # per-PE local channel count
-    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth
+    hbm_pseudo_channels: 64       # 전체 pseudo channel 수
+    hbm_channels_per_pe: 8        # PE당 로컬 채널 수
+    hbm_channel_bw_gbs: 32.0      # 채널당 대역폭
 ```

-Consumed by the graph compiler (topology builder) and BAAW
-initialisation.
+그래프 컴파일러(토폴로지 빌더)와 BAAW 초기화가 이 값을 소비한다.

-#### D-LA3. Segment and BAAW
+#### D-LA3. Segment와 BAAW

-Segment partitions the LA space; each segment maps to a specific HBM
-channel or channel group. Created at tensor deploy time by the runtime
-allocator. BAAW resolves LA → physical request(s) using the segment
-table.
+Segment는 LA 공간을 분할한다. 각 segment는 특정 HBM 채널 또는 채널
+그룹에 매핑된다. 텐서 deploy 시점에 런타임 할당기가 생성한다. BAAW는
+segment 테이블을 사용하여 LA → 물리 요청(들)로 해석한다.

 ```python
@dataclass
 class BaawSegment:
-    la_base: int          # segment start LA
-    la_size: int          # segment size (bytes)
+    la_base: int          # segment 시작 LA
+    la_size: int          # segment 크기 (bytes)
    mode: str             # "one_to_one" | "n_to_one"
-    # 1:1 mode fields
-    channel_count: int    # channels assigned to this segment (e.g. 8)
-    pa_bases: list[int]   # per-channel PA bases (len = channel_count)
-    channel_ids: list[int]   # per-channel logical IDs (e.g. [0..7])
-    channel_size: int     # per-channel size (la_size // channel_count)
-    # n:1 mode fields
-    agg_pa_base: int      # aggregated PA base
-    agg_node_id: str      # aggregated router node_id
+    # 1:1 모드 필드
+    channel_count: int    # 이 segment에 할당된 채널 수 (예: 8)
+    pa_bases: list[int]   # 채널별 PA base (len = channel_count)
+    channel_ids: list[int]   # 채널별 논리 ID (예: [0..7])
+    channel_size: int     # 채널당 크기 (la_size // channel_count)
+    # n:1 모드 필드
+    agg_pa_base: int      # 집계 PA base
+    agg_node_id: str      # 집계 라우터 node_id
 ```

-Segment lifecycle:
+Segment 라이프사이클:

-1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
-   allocator. PEMemAllocator allocates per-channel PA (1:1) or
-   aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
-   with PE_DMA.
-2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
-   (src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
-   converts to PA(s).
-3. **Free** (tensor free): segment removed from table; LA and PA
-   returned.
+1. **할당** (텐서 deploy): RuntimeContext가 LA allocator에서 LA를
+   할당한다. PEMemAllocator가 채널별 PA(1:1) 또는 집계 PA(n:1)를
+   할당한다. `BaawSegmentInstallMsg`가 segment를 PE_DMA에 등록한다.
+2. **사용** (커널 실행): 커널 `tl.load(la_ptr)` → `DmaReadCmd
+   (src_addr=LA)`. PE_DMA의 BAAW 프론트엔드가 segment를 조회하여
+   PA(들)로 변환한다.
+3. **해제** (텐서 free): segment가 테이블에서 제거되고 LA와 PA가
+   반환된다.

-#### D-LA4. BAAW resolution logic
+#### D-LA4. BAAW 해석 로직

-BAAW is a front-end stage inside PE_DMA, not a separate SimPy
-component. Synchronous address-resolution logic executed at the start
-of PE_DMA's `handle_command()`.
+BAAW는 PE_DMA 내부의 프론트엔드 단계이며, 별도의 SimPy 컴포넌트가 아니다.
+PE_DMA의 `handle_command()` 시작 시점에 실행되는 동기식 주소 해석 로직.

-Input: `(LA, nbytes)`. Output:
+입력: `(LA, nbytes)`. 출력:

- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
- **n:1 mode**: single `PhysicalRequest`.
+- **1:1 모드**: `list[PhysicalRequest]` — 채널당 하나.
+- **n:1 모드**: 단일 `PhysicalRequest`.

 ```python
@dataclass
 class PhysicalRequest:
-    pa: int           # 51-bit Physical Address
-    nbytes: int       # transfer size for this request
-    dst_node: str     # target node_id (channel router or aggregated router)
+    pa: int           # 51-bit 물리 주소
+    nbytes: int       # 이 요청의 전송 크기
+    dst_node: str     # 대상 node_id (채널 라우터 또는 집계 라우터)


 def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
@@ -305,65 +290,65 @@ def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
    return requests
 ```

-BAAW responsibilities:
+BAAW의 책임:

- Convert logical access → physical request units.
- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
- Compute PA and target node.
+- 논리 접근 → 물리 요청 단위로 변환.
+- 모드에 따라 fan-out(1:1) 또는 pass-through(n:1) 적용.
+- PA와 대상 노드 계산.

-BAAW non-responsibilities:
+BAAW가 하지 않는 것:

- Performing actual data movement.
- Executing NOC routing.
- Simulating bandwidth occupation (downstream components' job).
+- 실제 데이터 이동 수행.
+- NOC 라우팅 실행.
+- 대역폭 점유 시뮬레이션 (하위 컴포넌트의 역할).

-BAAW output is directly usable by the simulator's routing and resource
-model without additional address decoding.
+BAAW의 출력은 추가적인 주소 디코딩 없이 시뮬레이터의 라우팅·자원
+모델에서 바로 사용 가능하다.

-#### D-LA5. PE_DMA `handle_command()` change
+#### D-LA5. PE_DMA `handle_command()` 변경

-Current (VA-based) flow:
+현재(VA 기반) 흐름:

 ```
 DmaReadCmd.src_addr (VA)
  → MMU.translate(VA) → PA
-  → PhysAddr.decode(PA) → PhysAddr object
+  → PhysAddr.decode(PA) → PhysAddr 객체
  → resolver.resolve(PhysAddr) → dst_node_id
  → router.find_path(pe_prefix, dst_node_id) → path
-  → 1 sub-Transaction → fabric inject
+  → 1 sub-Transaction → 패브릭 주입
 ```

-LA-based flow:
+LA 기반 흐름:

 ```
 DmaReadCmd.src_addr (LA)
  → BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
-  → for each PhysicalRequest:
+  → 각 PhysicalRequest에 대해:
      → router.find_path(pe_prefix, req.dst_node) → path
      → compute_drain_ns(path, req.nbytes) → drain
-      → sub-Transaction → fabric inject
-  → await all sub-Transactions
+      → sub-Transaction → 패브릭 주입
+  → 모든 sub-Transaction 대기
  → pe_txn.done.succeed()
 ```

-Key changes:
+주요 변경:

- MMU reference removed → BAAW resolve.
- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
-  directly.
- 1 request → N parallel requests in 1:1 mode.
+- MMU 참조 제거 → BAAW resolve.
+- `PhysAddr.decode()` + `resolver.resolve()` → BAAW가 `dst_node`를
+  직접 반환.
+- 1 요청 → 1:1 모드에서 N개의 병렬 요청.

-#### D-LA6. 1:1 mode detail
+#### D-LA6. 1:1 모드 상세

- One logical access → N physical requests (N = `channels_per_pe`).
+- 하나의 논리 접근 → N개의 물리 요청 (N = `channels_per_pe`).
 - N = `hbm_pseudo_channels / pes_per_cube`.
- Each request: fully-resolved 51-bit PA, targets a specific channel
-  router (`{pe_prefix}.ch_r{channel_id}`).
- Per-channel link models BW contention.
- PE_DMA injects N sub-transactions concurrently.
+- 각 요청: 완전히 해석된 51-bit PA, 특정 채널 라우터를 대상으로 함
+  (`{pe_prefix}.ch_r{channel_id}`).
+- 채널별 링크가 대역폭 경쟁을 모델링.
+- PE_DMA가 N개의 sub-transaction을 동시에 주입.

-Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
-PE0 owns ch0-7.
+예: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
+PE0은 ch0-7을 소유.

 ```text
 Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
@@ -375,32 +360,32 @@ BAAW segment: {
    channel_size: 512,
 }

-BAAW resolve result (8 requests):
+BAAW resolve 결과 (8 요청):
  → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
  → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
  → ...
  → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")

-PE_DMA: 8 sub-transactions parallel inject
-  per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
-  Total effective BW = 8 × channel_bw_gbs
+PE_DMA: 8개 sub-transaction 병렬 주입
+  채널별 라우터 → hbm_ctrl 링크 (channel_bw_gbs) per channel
+  전체 유효 BW = 8 × channel_bw_gbs
 ```

-Other N values:
+다른 N 값:

 - `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
-  4 requests
+  4 요청
 - `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
-  16 requests
+  16 요청

-#### D-LA7. n:1 mode detail
+#### D-LA7. n:1 모드 상세

- One logical access → one aggregated request.
- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
-  (e.g. 8 × 32 = 256 GB/s).
- Single queue / resource for modelling.
- No per-channel PA decomposition.
+- 하나의 논리 접근 → 하나의 집계 요청.
+- 대상: 집계 라우터 → hbm_ctrl (ADR-0017 D8 참조).
+- 집계 링크 BW = `channels_per_pe × channel_bw_gbs`
+  (예: 8 × 32 = 256 GB/s).
+- 모델링을 위한 단일 큐 / 자원.
+- 채널별 PA 분해 없음.

 ```text
 Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
@@ -411,111 +396,108 @@ BAAW segment: {
    agg_node_id: "sip0.cube0.pe0.agg_router",
 }

-BAAW resolve result:
+BAAW resolve 결과:
  → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")

 PE_DMA: 1 sub-transaction
-  aggregated router → hbm_ctrl link (256 GB/s)
+  집계 라우터 → hbm_ctrl 링크 (256 GB/s)
 ```

-#### D-LA8. Kernel model preserved
+#### D-LA8. 커널 모델 보존

- Kernel still issues single memory ops (`tl.load`, `tl.store`,
-  `tl.composite`).
- LA is the address scheme exposed to kernel code.
- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
- Kernel code never sees physical channel information.
+- 커널은 여전히 단일 메모리 op(`tl.load`, `tl.store`,
+  `tl.composite`)을 발행한다.
+- LA가 커널 코드에 노출되는 주소 체계이다.
+- 채널 분해·집계는 PE_DMA의 BAAW 내부에서 일어난다.
+- 커널 코드는 물리 채널 정보를 절대 보지 않는다.

-#### Consequences (LA model, proposed)
+#### Consequences (LA 모델, 제안됨)

-Positive:
+긍정적:

- 1:1 vs n:1 semantics live in one place (BAAW).
- Kernel abstraction preserved — no kernel code changes.
- Topology-based policy control (mode switch via yaml).
- Improved simulation-model consistency and debuggability.
- Segment-based mapping is simpler than page tables; lower overhead.
+- 1:1 vs n:1 시맨틱이 한 곳(BAAW)에 모인다.
+- 커널 추상화 보존 — 커널 코드 변경 없음.
+- 토폴로지 기반 정책 제어 (yaml로 모드 전환).
+- 시뮬레이션 모델의 정합성·디버깅 가능성 향상.
+- Segment 기반 매핑이 페이지 테이블보다 단순하며 오버헤드도 적다.

-Negative:
+부정적:

- Full VA/MMU code refactor required.
- Request-generation path more complex (N requests in 1:1 mode).
- Reduced per-channel visibility in n:1 mode.
- VA-related tests need rewriting.
+- 전체 VA/MMU 코드 리팩터가 필요하다.
+- 요청 생성 경로가 더 복잡 (1:1 모드에서 N 요청).
+- n:1 모드에서 채널별 가시성 감소.
+- VA 관련 테스트 재작성 필요.

 ---

 ## Migration Path

- **PA → VA** was an extension. PA mode is retained as the PageFault
-  fallback inside PE_DMA. Switching does not require removing PA
-  code.
- **VA → LA**, if adopted, is a replacement, not coexistence. See
-  D-LA1 for the VA infrastructure removal list. PA fallback inside
-  PE_DMA may be retained orthogonally for tests.
+- **PA → VA**는 확장이었다. PA 모드는 PE_DMA 내부의 PageFault fallback으로
+  유지된다. 전환은 PA 코드 제거를 요구하지 않는다.
+- **VA → LA**는, 채택될 경우, 공존이 아닌 대체이다. VA 인프라 제거
+  목록은 D-LA1 참조. PA fallback은 테스트를 위해 PE_DMA 내부에 직교적으로
+  유지될 수 있다.

-## Alternatives Considered (LA model)
+## Alternatives Considered (LA 모델)

-1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
-   Rejected: MMU's role would grow beyond translation to request
-   decomposition; aggregation (n:1) becomes awkward to express.
-2. **Channel-aware kernel API**: kernels call per-channel load/store
-   directly. Rejected: abstraction leakage, portability loss, all
-   benchmarks need rewriting.
-3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
-   directly. Rejected: incompatible with aggregation; conversion
-   timing unclear; channel info leaks to kernel.
+1. **VA 유지 + MMU에서 fan-out**: MMU가 채널별 PA를 반환한다.
+   기각: MMU의 역할이 변환을 넘어 요청 분해까지 확장되며, 집계(n:1)를
+   표현하기 어색해진다.
+2. **채널 인지 커널 API**: 커널이 채널별 load/store를 직접 호출한다.
+   기각: 추상화 누출, 이식성 손실, 모든 벤치마크 재작성 필요.
+3. **항상 PA (LA 없음)**: 런타임이 커널에 채널별 PA를 직접 전달한다.
+   기각: 집계와 양립 불가; 변환 시점이 불명확; 채널 정보가 커널로 누출.

 ## Test Requirements

-### VA model (current, regression)
+### VA 모델 (현재, regression)

- Cross-PE / cross-cube DMA paths over installed mappings.
- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
- TLB-overhead-per-access timing.
- PageFault fallback path preserves PA-only behaviour.
+- 설치된 매핑을 따라 cross-PE / cross-cube DMA 경로.
+- 측정된 레이턴시를 동반한 `MmuMapMsg` / `MmuUnmapMsg`의 패브릭 순회.
+- 접근당 TLB 오버헤드 타이밍.
+- PageFault fallback 경로가 PA 단독 동작을 보존하는지.

-### LA model (when implemented)
+### LA 모델 (구현 시)

- 1:1 mode: same logical access → N per-channel requests.
- n:1 mode: same logical access → 1 aggregated request.
- Bandwidth equivalence between modes for identical workload.
- 1:1 mode: per-channel contention modelled correctly.
- n:1 mode: aggregated bandwidth correctly reflected.
- Kernel code unchanged across mode switch.
- BAAW segment install / uninstall correctness.
- Multiple tensors in distinct segments do not collide.
+- 1:1 모드: 동일 논리 접근 → N개의 채널별 요청.
+- n:1 모드: 동일 논리 접근 → 1개의 집계 요청.
+- 동일 워크로드에 대해 두 모드 사이의 대역폭 동치.
+- 1:1 모드: 채널별 경쟁이 올바르게 모델링됨.
+- n:1 모드: 집계된 대역폭이 올바르게 반영됨.
+- 모드 전환에 걸쳐 커널 코드가 변경되지 않음.
+- BAAW segment install / uninstall 정확성.
+- 별개 segment 안의 여러 텐서가 충돌하지 않음.

-## Implementation Order (LA, when scheduled)
+## Implementation Order (LA, 일정 잡힐 때)

-1. LA type (`policy/address/la_allocator.py`).
-2. BAAW segment table (`policy/address/baaw.py`).
+1. LA 타입 (`policy/address/la_allocator.py`).
+2. BAAW segment 테이블 (`policy/address/baaw.py`).
 3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
-4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
+4. PE_DMA BAAW 통합 (`components/builtin/pe_dma.py`
   `handle_command()`).
 5. RuntimeContext: LA alloc + segment install
   (`runtime_api/context.py`).
 6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
-7. Remove VA/MMU code.
-8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
-9. Test migration:
+7. VA/MMU 코드 제거.
+8. `topology.yaml`에서 `pe_mmu` 제거; 매핑 모드 설정 추가.
+9. 테스트 이전:

-| Test file | Action |
+| 테스트 파일 | 조치 |
 |-----------|--------|
-| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
-| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
-| `tests/test_pe_mmu.py` | Remove |
-| `tests/test_va_allocator.py` | Replace with LA allocator tests |
-| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
-| `tests/test_va_offset.py` | Replace with LA offset tests |
+| `tests/test_mmu_component.py` | 제거 → BAAW segment install 테스트 |
+| `tests/test_mmu_fabric.py` | 제거 → BAAW + 패브릭 통합 테스트 |
+| `tests/test_pe_mmu.py` | 제거 |
+| `tests/test_va_allocator.py` | LA allocator 테스트로 교체 |
+| `tests/test_va_integration.py` | LA + BAAW 통합 테스트로 교체 |
+| `tests/test_va_offset.py` | LA offset 테스트로 교체 |

 ## Links

- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0008 (tensor deployment)
- ADR-0009 (kernel execution)
- ADR-0014 (PE-internal execution model)
- ADR-0015 (component port/wire model)
- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
- ADR-0013 (Verification strategy — V1 PA tagging)
- SPEC R2 (latency by traversal), R10 (memory addressing)
+- ADR-0007 (runtime_api vs sim_engine 경계)
+- ADR-0008 (텐서 배포)
+- ADR-0009 (커널 실행)
+- ADR-0014 (PE 내부 실행 모델)
+- ADR-0015 (컴포넌트 포트/와이어 모델)
+- ADR-0017 (큐브 NOC와 HBM 연결성 — LA 모델 토폴로지 소비자)
+- ADR-0013 (검증 전략 — V1 PA 태깅)
+- SPEC R2 (순회 기반 레이턴시), R10 (메모리 주소 지정)