ADR: translate adr-ko/ to Korean, fix ADR-0013 slug, refine Status check

Follow-up to the bilingual-structure commit: docs/adr-ko/ now holds only Korean versions (24 files translated from English placeholders), ADR-0013 slug uses kebab-case in both folders, and the verify tool allows translated parenthetical commentary in the Status block. - Translate 24 English files in docs/adr-ko/ to Korean. The previous bilingual-structure commit had left these as English copies because their source content was already English; this commit fulfills the policy that docs/adr-ko/ contains only Korean. - Rename ADR-0013 in both adr/ and adr-ko/ from ver-verification_strategy.md to ver-verification-strategy.md (kebab-case consistency with other ADRs). - CLAUDE.md (ADR Translation Discipline): clarify that only the Status lifecycle keyword (Accepted / Proposed / Stub / Draft / Superseded by ADR-NNNN / Merged into ADR-NNNN) must match across EN and KO; parenthetical commentary and trailing list items may be translated. - tools/verify_adr_lang_pairs.py: replace byte-equal Status check with normalize_status_keyword() which strips parenthetical commentary and takes only the first non-empty line. - tests/test_verify_adr_lang_pairs.py: update existing test names, add coverage for translated parenthetical, translated trailing list, and Superseded-by-NNNN keyword equality. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 08:17:56 -07:00
parent a796c1d2f7
commit 168b0c89f0
29 changed files with 2631 additions and 2651 deletions
@@ -1,4 +1,4 @@
-# ADR-0033 — Latency Model: Assumptions and Known Simplifications
+# ADR-0033 — 레이턴시 모델: 가정 및 알려진 단순화

 ## Status

@@ -6,157 +6,147 @@ Accepted

 ## Context

-The simulator is an analytical, event-driven performance model — not a
-cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
-or omitted by design. To keep the model auditable and reviewable as a whole,
-this ADR consolidates the assumptions in one place. Individual component ADRs
-(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
-the *limits of fidelity*.
+이 시뮬레이터는 분석적·이벤트 기반 성능 모델이지, 사이클 정확(cycle-accurate)
+시뮬레이터나 RTL 수준 시뮬레이터가 아니다. 실제 HW의 많은 효과들이 설계상
+근사되거나 생략되었다. 모델 전체를 감사·리뷰할 수 있도록 유지하기 위해,
+본 ADR은 그런 가정들을 한 곳에 통합한다. 개별 컴포넌트 ADR(ADR-0015,
+ADR-0017, ADR-0004)들이 *메커니즘*을 정의하고, 본 문서는 *충실도의 한계*를
+정의한다.

 ## Decisions

-### D1. Modeled precisely
+### D1. 정밀하게 모델링되는 것

- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
+- **방향 에지별 BW 점유** (`available_at`을 통한 FIFO 직렬화) —
  ADR-0015 D2.
- **Per-component switching/overhead latency** (`overhead_ns` attr).
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
-  with address-based PC selection (ADR-0034 D3). Burst granularity tunable
-  (`burst_bytes`, default 256B). Read and write share each PC's
-  `available_at` (real HW command bus is per-PC shared).
- **HBM direction switching penalty mechanism**: per-PC last-direction
-  tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
- **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions
-  with payload into `Flit` objects of `flit_bytes` (default = HBM
-  `burst_bytes` = 256B). The wire emits each flit individually after
-  `prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles
-  flit arrival rate per real-HW wormhole semantics.
- **Separate Stores per directed edge** (Phase 2c key fix): the wire
-  is the *only* conduit between `src.out_ports[dst]` and
-  `dst.in_ports[src]`. Earlier the two were aliased to the same
-  `simpy.Store`; when the wire put a chunkified flit back, the
-  destination's `fan_in` could pull it before the wire applied
-  bandwidth delay, leaving half the flits bypassing the bottleneck.
- **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`):
-  forward each flit serially with per-transaction overhead applied
-  ONCE on the first-flit arrival (header decode model). Subsequent
-  flits pipeline through with no extra delay. Wormhole emerges
-  naturally across multi-hop paths.
- **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL
-  schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`,
-  with the `is_last` flit waiting for the last PC commit before
-  signaling `txn.done`.
- **Non-flit-aware components (default) reassemble flits at
-  ``_fan_in``** before the legacy `_forward_txn` path runs. This
-  preserves backward compatibility for components that have not yet
-  been migrated to flit-aware processing (e.g., `MCpuComponent`,
-  `IoCpuComponent` sub-txn generators). Such components reassemble
-  *once per leg boundary*, NOT per hop — multi-hop wormhole timing
-  through a chain of flit-aware routers is preserved.
+- **컴포넌트별 스위칭/오버헤드 레이턴시** (`overhead_ns` attr).
+- **HBM pseudo-channel별 병렬성**: 주소 기반 PC 선택을 동반한
+  stateless `pc_avail[N]` 배열로 (ADR-0034 D3). 버스트 granularity는 조정 가능
+  (`burst_bytes`, 기본 256B). 각 PC의 `available_at`은 read와 write가 공유한다
+  (실제 HW의 명령 버스가 PC별로 공유되기 때문).
+- **HBM 방향 전환 페널티 메커니즘**: PC별 last-direction 추적 +
+  설정 가능한 `switch_penalty_ns`. 기본값 0 — D2 참조.
+- **와이어 청크 스트리밍 (Phase 2c)**: 각 와이어는 payload가 있는
+  Transaction을 `flit_bytes` 단위의 `Flit` 객체로 분해한다(기본 = HBM
+  `burst_bytes` = 256B). 와이어는 각 flit을 `prop_ns + flit_nbytes/bw_gbs`
+  이후에 개별적으로 방출하므로 링크의 대역폭이 실제 HW의 wormhole 시맨틱대로
+  flit 도착률을 조절한다.
+- **방향 에지별로 분리된 Store** (Phase 2c 핵심 수정): 와이어는
+  `src.out_ports[dst]`와 `dst.in_ports[src]` 사이의 *유일한* 통로이다.
+  이전에는 둘이 동일한 `simpy.Store`로 별칭되어 있었다. 와이어가 청크화된
+  flit을 되돌려 넣을 때 목적지의 `fan_in`이 와이어가 대역폭 지연을 적용하기
+  전에 그것을 끌어가, flit의 절반이 병목을 우회할 수 있었다.
+- **Flit 인지 pass-through** (`TransitComponent`, `HbmCtrlComponent`):
+  각 flit을 직렬로 전달하며 트랜잭션 오버헤드는 첫 flit 도착 시점에 한 번만
+  적용된다(헤더 디코드 모델). 이후의 flit들은 추가 지연 없이 파이프라인을
+  통과한다. 다중 hop 경로 전반에서 wormhole이 자연스럽게 발현된다.
+- **HBM CTRL의 flit별 PC commit**: HBM CTRL에 도착하는 각 flit은
+  `max(env.now, pc_avail[pc]) + chunk_time`에 PC commit을 스케줄하며,
+  `is_last` flit이 마지막 PC commit을 기다린 후 `txn.done`을 신호한다.
+- **Flit 비인지 컴포넌트(기본)는 ``_fan_in``에서 flit을 재조립**하여
+  레거시 `_forward_txn` 경로가 실행되도록 한다. 이는 아직 flit 인지
+  처리로 마이그레이션되지 않은 컴포넌트(예: `MCpuComponent`,
+  `IoCpuComponent`의 sub-txn 생성기)에 대한 하위 호환성을 보존한다. 그런
+  컴포넌트들은 *leg 경계마다 한 번* 재조립하며, hop마다는 아니다 —
+  flit 인지 라우터 체인을 통한 다중 hop wormhole 타이밍이 보존된다.

-### D2. Approximated (with known directional error)
+### D2. 근사됨 (알려진 방향성 오차와 함께)

-| Effect | Real HW | Our model | Error direction |
+| 효과 | 실제 HW | 본 모델 | 오차 방향 |
 |--------|---------|-----------|----------------|
-| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
-| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes |
-| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
-| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |
+| 라우터 출력 포트 중재 | Round-robin / weighted | 와이어 에지 FIFO + 직렬 워커 | 사이클당 한 txn일 때 공정; multi-stream 공유는 flit 수준에서 모델링 안 됨 |
+| HBM 스케줄러 / 쓰기 버퍼 | FR-FCFS + watermark drain | FIFO, 재정렬 없음 | 교번이 조밀한 혼합 R/W에 대해 비관적 — 기본 `switch_penalty_ns = 0`은 이상적 스케줄러가 amortize한다고 가정 |
+| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | sub-flit 미세 타이밍 노이즈; 매우 작은 와이어 중재 윈도우에서만 영향 |
+| 와이어 수준 RR 공정성 | 공유 링크에서 사이클별 multi-flow 중재 | 에지마다 단일 직렬 와이어 프로세스 | 주어진 에지에 한 트랜잭션만 in-flight일 때만 공정. 동일 에지에서 동시 멀티 스트림 트래픽은 FIFO 순서로 직렬화됨 |

-### D3. Ignored (out of scope)
+### D3. 무시됨 (범위 외)

- Bank-level row buffer conflict penalty (assume no conflicts — best case;
-  the model has no per-bank state within a PC, so same-bank reuse cannot be
-  detected).
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
-  `burst_time = burst_bytes / pc_bw_gbs`).
- Refresh, ECC, thermal throttling, power gating.
- Clock domain crossings, PLL lock time.
- Upstream backpressure due to downstream buffer occupancy (input ports use
-  unbounded `simpy.Store`).
- Sub-flit cycle-level arbitration at routers (flit granularity is our
-  smallest unit).
+- 뱅크 수준의 row buffer 충돌 페널티 (충돌 없음 가정 — 최적 케이스;
+  모델은 PC 내부에 뱅크별 상태를 갖지 않으므로 동일 뱅크 재사용을 감지할 수 없다).
+- HBM tRP / tRCD / tFAW / tRC 타이밍 제약 (정상 상태의
+  `burst_time = burst_bytes / pc_bw_gbs`에 흡수).
+- 리프레시, ECC, 열 throttling, 전력 게이팅.
+- 클럭 도메인 교차, PLL lock 시간.
+- 하위 버퍼 점유로 인한 상위 backpressure (입력 포트는 unbounded
+  `simpy.Store`를 사용).
+- 라우터에서의 sub-flit 사이클 수준 중재 (flit granularity가 본 모델의
+  최소 단위).

-### D4. Workload sensitivity
+### D4. 워크로드 민감도

-Workloads where the above simplifications meaningfully affect results:
+위 단순화들이 결과에 의미 있게 영향을 미치는 워크로드:

- **Random scatter/gather**: bank conflict ignored → model optimistic.
- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
-  absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
-  setting it non-zero models pessimistic per-alternation cost.
- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
-  limits not modeled → model optimistic.
- **Very small (sub-flit) transactions**: flit quantization noise.
- **Concurrent multi-flow on a single wire**: wire is serial FIFO at the
-  flit level, so per-flow fairness within a single edge is not modeled.
-  Pre-edge merging (multiple sources arriving at a router and being
-  forwarded to the same downstream wire) is correctly modeled via the
-  flit-aware router's serial worker.
+- **무작위 scatter/gather**: 뱅크 충돌 무시 → 모델이 낙관적.
+- **혼합 R/W가 강한 워크로드** (예: GEMM 바이어스 누적): HBM 스케줄러
+  부재. 기본 `switch_penalty_ns = 0`은 이상적 amortization을 가정;
+  0이 아닌 값은 교번당 비관적 비용을 모델링.
+- **고동시성 (한 링크에 활성 흐름 >10개)**: HoL blocking과 VC 제한이
+  모델링되지 않음 → 모델이 낙관적.
+- **매우 작은(sub-flit) 트랜잭션**: flit 양자화 노이즈.
+- **단일 와이어상의 동시 multi-flow**: 와이어는 flit 수준에서 직렬
+  FIFO이므로 단일 에지 내에서의 흐름별 공정성은 모델링되지 않는다.
+  Pre-edge 병합(여러 source가 라우터에 도착하여 동일한 downstream
+  와이어로 전달되는 경우)은 flit 인지 라우터의 직렬 워커를 통해 올바르게
+  모델링된다.

-### D5. Verification policy
+### D5. 검증 정책

-For workloads in D4, cross-check against real HW or a cycle-accurate
-simulator before drawing absolute-magnitude conclusions. The model remains
-accurate for **relative comparisons** within the modeled regime.
+D4의 워크로드에 대해 절대값 결론을 내리기 전에 실제 HW나 사이클 정확
+시뮬레이터와 cross-check 할 것. 모델은 모델링된 영역 내에서의 **상대적
+비교**에 대해서는 여전히 정확하다.

-### D6. Future work
+### D6. 향후 작업

-Note: multi-stream merging at routers IS modeled correctly — each
-in_port has its own fan_in process, all push to a shared inbox, and
-the router worker forwards in inbox FIFO order. Flits from different
-upstream streams naturally interleave at flit granularity. The items
-below are different concerns, ordered by expected workload impact.
+참고: 라우터에서의 multi-stream 병합은 올바르게 모델링되고 있다 — 각
+in_port가 자신의 fan_in 프로세스를 가지며 모두 공유 인박스로 push하고,
+라우터 워커가 인박스 FIFO 순서로 전달한다. 서로 다른 상위 스트림의 flit들이
+flit granularity에서 자연스럽게 인터리브된다. 아래 항목들은 별개의 관심사이며,
+예상되는 워크로드 영향 순으로 정렬되어 있다.

-**Higher impact (workload accuracy gap)**:
+**영향이 큼 (워크로드 정확도 격차)**:

- [ ] **Bank-level conflict modeling** within a PC (opt-in via
-  `track_banks: true`). Currently we assume no same-bank reuse;
-  random scatter/gather workloads are optimistic here.
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
-  from the design discussion). Default `switch_penalty_ns=0` is the
-  ideal-amortization stand-in; bursty mixed R/W workloads benefit
-  from explicit modeling.
- [ ] **Backpressure** modeling for finite component buffers. Matters
-  at high concurrency / sustained saturation where buffer occupancy
-  causes upstream stalls.
- [ ] **Op_log integration with chunk-streaming**: currently op_log
-  fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
-  GemmCmd, MathCmd) which are not chunkified. Integration would
-  require flit-aware components to also emit op_log start/end hooks
-  per transaction (start on first flit, end on is_last).
+- [ ] PC 내의 **뱅크 수준 충돌 모델링** (`track_banks: true`로 opt-in).
+  현재는 동일 뱅크 재사용이 없다고 가정; 무작위 scatter/gather 워크로드는
+  이 부분에서 낙관적이다.
+- [ ] write buffer + watermark drain을 동반한 **HBM 스케줄러** (설계
+  논의에서의 Tier 2). 기본 `switch_penalty_ns=0`은 이상적 amortization의
+  stand-in; 버스티한 혼합 R/W 워크로드는 명시적 모델링으로부터 이득을 본다.
+- [ ] 유한한 컴포넌트 버퍼에 대한 **Backpressure** 모델링. 버퍼 점유가
+  상위 stall을 유발하는 고동시성/지속적 포화 상황에서 중요.
+- [ ] **청크 스트리밍과 op_log 통합**: 현재 op_log는 청크화되지 않는
+  PE 내부 명령 메시지(DmaReadCmd, DmaWriteCmd, GemmCmd, MathCmd)에 대해
+  발화한다. 통합은 flit 인지 컴포넌트들이 트랜잭션당 op_log start/end
+  hook(첫 flit에 start, is_last에 end)을 함께 방출하도록 요구한다.

-**Lower impact (academic / specific use cases)**:
+**영향이 작음 (학술적 / 특정 use case)**:

- [ ] **Cycle-accurate router arbitration policies** (RR with
-  priorities, age, iSLIP). The FIFO inbox is already approximately
-  fair when flit arrival times differ slightly between streams (the
-  common case for similar-rate workloads). True impact appears only
-  for: (a) priority/QoS modeling, (b) per-stream tail latency
-  analysis under sustained saturation. Not critical for makespan or
-  average-latency studies.
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
-  cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
-  per 32B flit. Effect is small for most workloads (sub-flit timing
-  noise on small messages).
+- [ ] **사이클 정확 라우터 중재 정책** (우선순위·age를 동반한 RR, iSLIP).
+  FIFO 인박스는 스트림 간 flit 도착 시간이 약간씩 다를 때 이미 근사적으로
+  공정하다(유사한 비율의 워크로드에서 흔한 경우). 실질적 영향은 (a)
+  우선순위/QoS 모델링, (b) 지속적 포화에서의 스트림별 tail latency 분석에서만
+  나타난다. makespan이나 평균 레이턴시 연구에는 결정적이지 않음.
+- [ ] 더 미세한 와이어 중재 사이클을 위한 **Sub-flit (32B) granularity**.
+  본 모델의 `flit_bytes`는 burst(256B)와 같지만, 실제 HW는 32B flit마다
+  중재한다. 대부분 워크로드에서는 영향이 작다(작은 메시지에 대한 sub-flit
+  타이밍 노이즈).

 ## Consequences

- Single review point for all model fidelity questions. Each future PR
-  touching latency must update the relevant section here.
- Workload-specific magnitude error envelopes are explicit.
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
-  enforces the ADR-0017 D8 invariant in code rather than relying on yaml
-  manual consistency.
- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
-  per-flit timing) rather than via terminal `drain_ns` injection. Single
-  transactions land at `drain + commit_time + small_overheads`; multi-hop
-  preserves wormhole pipelining; multi-stream merge correctly serializes
-  at the shared wire's FIFO.
+- 모든 모델 충실도 질문에 대한 단일 리뷰 지점. 레이턴시를 건드리는 향후
+  모든 PR은 본 문서의 해당 절을 갱신해야 한다.
+- 워크로드별 규모 오차 envelope이 명시적이다.
+- 빌더측 `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs` 유도가
+  yaml의 수동 일관성에 의존하지 않고 코드 내에서 ADR-0017 D8의 불변성을
+  강제한다.
+- 와이어 전송 시간은 터미널의 `drain_ns` 주입을 통해서가 아니라
+  병목 링크 통과당 한 번 부과된다(Phase 2c flit별 타이밍). 단일 트랜잭션은
+  `drain + commit_time + small_overheads`에 도달; 다중 hop은 wormhole
+  파이프라이닝을 보존; multi-stream 병합은 공유 와이어의 FIFO에서 올바르게
+  직렬화된다.

 ## Cross-references

- ADR-0015 — component / port / wire model.
- ADR-0017 — Cube NOC architecture and HBM connectivity.
- ADR-0004 — memory semantics, local HBM.
- ADR-0034 — HBM controller internal design.
+- ADR-0015 — 컴포넌트 / 포트 / 와이어 모델.
+- ADR-0017 — 큐브 NOC 아키텍처 및 HBM 연결성.
+- ADR-0004 — 메모리 시맨틱, 로컬 HBM.
+- ADR-0034 — HBM 컨트롤러 내부 설계.