ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle: - ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable. - ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2: docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft), docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for retroactive docs pending verification. Merges (one ADR per topic, no change-history annotations): - ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items) - ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl. TileToken self-routing and multi-op composite epilogue scope) - ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md deleted; ADR-0019/0021 moved to adr-history with one-line stub status Retroactive documentation (G4 closures, code-verified): - ADR-0037 forwarding component (TransitComponent: first-flit overhead, serial worker, path-based routing, single impl/multiple names) - ADR-0036 IO_CPU component (target_start_ns global barrier stamping, per-cube fan-out, response aggregation) - ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources, target_start_ns passthrough) - ADR-0034 HBM controller internal design (per-PC state, address-based selection, flit-aware per-flit commit, async finalize, command-only fallback path) Content updates: - ADR-0010 expanded to full CLI surface (run/probe/web), retitled "Command Line Interface and Execution Semantics" - ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned - ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata block replaced with standard Status header - ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4); ADR-0027 cleaned of supersession history - ADR-0033 D6 cleanup: address-based PC selection moved out of future-work (now documented in ADR-0034 D3); related D1/D3 wording realigned - Cross-references back-filled in 5 ADRs (G3 gaps closed) Onboarding docs split: - docs/onboarding/ created - moved: hw-architecture-overview.md, latency-model.md, di-presentation.md, ccl-author-guide{,.en}.md - references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8). No behavior change. Tooling: - tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py (ADR EN/KO pair invariant checker) - .claude/commands/report.md tracked (/report slash command) - .gitignore: allow .claude/commands/*.md while keeping settings files ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00
parent 22fd0d2b9d
commit 687c98086d
97 changed files with 3286 additions and 3766 deletions
@@ -0,0 +1,592 @@
+# CCL Algorithm Author Guide (English)
+
+This document is a step-by-step guide for engineers writing CCL
+(Collective Communication Library) algorithms in kernbench. The
+internal system design and component structure live in
+[ADR-0023](adr/ADR-0023-ipcq-pe-collective.md).
+
+The goal here is to clearly separate **what an algorithm author has to
+touch** from **what they can leave alone**, and to get a first
+algorithm running through the shortest possible path.
+
+---
+
+## 0. Five-minute tour
+
+| Things you touch | Location |
+|------------------|----------|
+| Algorithm module (kernel + optional `neighbors()`) | `src/kernbench/ccl/algorithms/<algo>.py` |
+| Algorithm registration | `ccl.yaml` |
+| Host bench (rank count, init, launch, verify) | `benches/<your_bench>.py` |
+| (Optional) unit test | `tests/test_<algo>.py` |
+
+| Things you do NOT touch | Location |
+|--------------------------|----------|
+| TLContext API | `src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec) |
+| Framework (topology generators, helpers, mock testing) | `src/kernbench/ccl/` |
+| PE_IPCQ / PE_DMA components | `src/kernbench/components/builtin/` |
+| Backend implementation (`install_ipcq`) | `src/kernbench/runtime_api/distributed.py` and `kernbench/ccl/install.py` |
+
+Workflow:
+1. Write a `kernel` function in the algorithm module.
+2. Register an entry in `ccl.yaml`.
+3. Write a host bench using `torch.distributed.init_process_group` /
+   `torch.distributed.all_reduce` (the unified `benches/ccl_allreduce.py`
+   handles the common case).
+4. (Optional) Run the mock runtime for fast unit tests (a few ms).
+5. `kernbench run --bench <name> --verify-data` for full SimPy verification.
+
+---
+
+## 1. Hello World — the simplest send/recv
+
+Each PE sends its tile to its E neighbor once and receives a tile from
+its W neighbor once. The reference code lives in
+[`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py).
+
+### Step 1: write the kernel
+
+New file `src/kernbench/ccl/algorithms/hello_send.py`:
+
+```python
+"""Hello world: send your tile to the next rank, receive from the previous one."""
+
+
+def kernel(t_ptr, n_elem, tl):
+    # Global rank is computed from program_id(0/1) (ADR-0022).
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+
+    nbytes = n_elem * 2  # f16
+    pe_addr = t_ptr + rank * nbytes
+
+    # Load our slice and send it east.
+    src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    tl.send(dir="E", src=src)
+
+    # Receive from west and store directly back into our slice.
+    recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+    tl.store(pe_addr, recv)
+
+
+def kernel_args(world_size: int, n_elem: int) -> tuple:
+    """Positional kernel args used by the ahbm backend (after t_ptr)."""
+    return (n_elem,)
+```
+
+Key points:
+
+- **Global rank is computed from `program_id(axis=0)` + `program_id(axis=1)`.**
+  TL has no contractually-supported `tl.rank` / `tl.world_size`. If the
+  host needs to pass `world_size` or anything else as an algorithm
+  parameter, it goes through ordinary `torch.launch` arguments.
+- **`tl.send` takes a `TensorHandle`.** PE_IPCQ reads
+  `addr`/`space`/`shape`/`dtype`/`nbytes` from the handle to issue an
+  `IpcqDmaToken` to PE_DMA.
+- **`tl.recv` requires `shape` and `dtype`.** The returned TensorHandle
+  points at the IPCQ ring slot and can be used directly as a `dst`
+  handle (e.g. `tl.store(pe_addr, recv)`). Phase 2's `dma_write` replay
+  handles the (slot → hbm) copy, so user code never has to touch
+  `recv.data`.
+
+### Step 2: register in `ccl.yaml`
+
+```yaml
+algorithms:
+  hello_send:
+    module: kernbench.ccl.algorithms.hello_send
+    topology: ring_1d
+    buffer_kind: tcm
+    world_size: 8
+```
+
+`world_size` here is optional. If absent, `AhbmCCLBackend` derives it
+from the topology spec (`sips × cubes_per_sip × pes_per_cube`).
+
+### Step 3: write a host bench (optional — the unified bench may suffice)
+
+For most CCL benchmarks the existing `benches/ccl_allreduce.py` is
+sufficient: it reads `ccl.yaml`, picks the algorithm, sets up the
+process group, and runs the collective. If your algorithm needs custom
+host logic, write a new bench file along the same lines.
+
+The host code looks like a real PyTorch DDP worker:
+
+```python
+"""benches/ccl_hello.py"""
+from __future__ import annotations
+
+import numpy as np
+
+from kernbench.policy.placement.dp import DPPolicy
+
+
+N_ELEM = 8
+
+
+def worker(rank: int, world_size: int, torch) -> None:
+    """Per-rank business logic — mirrors a real PyTorch DDP worker."""
+    dp = DPPolicy(
+        cube="replicate", pe="column_wise",
+        num_cubes=1, num_pes=world_size,
+    )
+    tensor = torch.zeros(
+        (1, world_size * N_ELEM), dtype="f16", dp=dp, name="hello_in",
+    )
+
+    # Per-rank initialization via the real PyTorch idiom.
+    init = np.zeros((1, world_size * N_ELEM), dtype=np.float16)
+    for r in range(world_size):
+        init[0, r * N_ELEM : (r + 1) * N_ELEM] = float(r + 1)
+    tensor.copy_(torch.from_numpy(init))
+
+    # The collective itself.
+    torch.distributed.all_reduce(tensor, op="sum")
+
+    # Verify on rank 0 (real PyTorch DDP idiom).
+    if rank == 0:
+        result = tensor.numpy()
+        for r in range(world_size):
+            expected = float(((r - 1) % world_size) + 1)
+            slice_r = result[0, r * N_ELEM : (r + 1) * N_ELEM]
+            print(
+                f"  rank {r}: got {float(slice_r.mean()):.1f}, "
+                f"expected {expected:.1f}"
+            )
+
+
+def run(torch) -> None:
+    """CLI entry point. Initializes dist, dispatches to worker."""
+    dist = torch.distributed
+    dist.init_process_group(backend="ahbm")
+    worker(
+        rank=dist.get_rank(),
+        world_size=dist.get_world_size(),
+        torch=torch,
+    )
+```
+
+### Step 4: unit test (optional but strongly recommended)
+
+`tests/test_hello_send.py`:
+
+```python
+import numpy as np
+
+from kernbench.ccl.algorithms.hello_send import kernel
+from kernbench.ccl.testing import run_kernel_in_mock
+
+
+def test_hello_send_4_ranks():
+    n_elem = 8
+    inputs = [
+        np.full((n_elem,), float(r + 1), dtype=np.float16)
+        for r in range(4)
+    ]
+    outputs = run_kernel_in_mock(
+        kernel_fn=kernel,
+        world_size=4,
+        topology="ring_1d",
+        inputs=inputs,
+        kernel_args=(n_elem,),
+    )
+    # rank r should now hold rank (r-1) % 4's data.
+    for r in range(4):
+        assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
+```
+
+`run_kernel_in_mock` runs every rank concurrently in pure Python (no
+SimPy), so a unit test like this finishes in **milliseconds**. It only
+verifies algorithmic correctness — no latency, no DMA, no fabric.
+
+### Step 5: SimPy validation
+
+```bash
+kernbench run --topology topology.yaml --bench ccl_hello --verify-data
+```
+
+Phase 1 runs the SimPy simulation + MemoryStore data movement, Phase 2
+replays the op_log for correctness. The bench's `print` lines should
+show OK for every rank.
+
+---
+
+## 2. Ring all-reduce — the second algorithm
+
+Slightly more complex. Each PE runs `world_size - 1` rounds, sending
+its current tile east and accumulating the tile received from the west.
+After all rounds, every PE holds the global sum.
+
+The reference implementation lives in
+[`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py).
+The core flow:
+
+```python
+"""Ring all-reduce."""
+
+
+def kernel(t_ptr, n_elem, world_size, tl):
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+    nbytes = n_elem * 2
+    pe_addr = t_ptr + rank * nbytes
+
+    # The handle points at HBM[pe_addr]. In greenlet mode .data is
+    # populated, but the kernel never has to touch .data directly.
+    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    current = acc  # source for the first send
+
+    for _step in range(world_size - 1):
+        tl.send(dir="E", src=current)
+        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+        # TensorHandle operator overload → MathCmd → PE_MATH dispatch.
+        # Phase 1 only models timing; Phase 2 DataExecutor replays the
+        # actual numpy accumulation.
+        acc = acc + recv
+        current = recv  # forward the received slot to the next round
+
+    # Store the final accumulator back to HBM. Source is acc (a PE-local
+    # scratch addr); dst is HBM. The op_log dma_write entry records both
+    # ends so Phase 2 copies the math result into HBM at verify time.
+    tl.store(pe_addr, acc)
+
+
+def kernel_args(world_size: int, n_elem: int) -> tuple:
+    return (n_elem, world_size)
+```
+
+Four key points:
+
+1. **Accumulation goes through TensorHandle operators.** `acc + recv`
+   emits a `MathCmd` and dispatches it through PE_MATH — i.e. the
+   real hardware path, so the latency model stays accurate. Per
+   ADR-0020 D3, Phase 1 only simulates timing; Phase 2's `DataExecutor`
+   replays the op_log and runs the actual numpy accumulation.
+2. **Use `current = recv` to forward.** Each round must update the send
+   source to the just-received slot handle so the same data circulates
+   exactly once around the ring. Setting `current = acc` would resend
+   the cumulative sum, inflating the result.
+3. **`tl.store(pe_addr, acc)` exactly once at the end.** Do not use a
+   store→reload pattern in the middle. `acc` lives in PE-local scratch;
+   the op_log records `(src=scratch, dst=hbm)` and Phase 2 first runs
+   math (filling scratch) then copies via the dma_write snapshot.
+4. **`world_size` is passed by the host explicitly.** TL only knows the
+   topology slot count (e.g. `num_programs(axis=0)` is "PEs per cube"),
+   not the participating CCL group size. The host bench knows
+   `world_size` and forwards it as an explicit kernel argument.
+
+For registration in `ccl.yaml` and wiring through the unified bench,
+look at the existing `ring_allreduce_tcm/_hbm/_sram` entries plus
+[`benches/ccl_allreduce.py`](../benches/ccl_allreduce.py). Mock unit
+tests live in
+[`tests/test_ccl_mock_runtime.py`](../tests/test_ccl_mock_runtime.py)
+and follow the `kernel_args=(n_elem, world_size)` convention.
+
+---
+
+## 3. `neighbors()` override — custom topology
+
+Most algorithms are happy with the builtin topologies (`ring_1d`,
+`mesh_2d`, `tree_binary`, `ring_1d_unidir`, `none`). If you want to
+modify a builtin or define a brand-new connectivity pattern, define a
+`neighbors()` function in your algorithm module.
+
+### Signature
+
+```python
+def neighbors(
+    rank: int, world_size: int, neighbor_map: dict[str, int],
+) -> dict[str, int] | None:
+    """Override the neighbor map produced by the builtin topology.
+
+    Args:
+        neighbor_map: the mapping the ccl.yaml ``topology`` field built.
+                      For ring_1d this is {"E": (rank+1)%ws, "W": (rank-1)%ws}.
+                      The dict is mutable — modify in place if you want.
+
+    Returns:
+        dict: the new neighbor map (or the modified-in-place dict).
+        None: do not override; use neighbor_map as-is.
+    """
+    return None
+```
+
+### Pattern A: tweak a builtin
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    # Only even ranks use W; remove W from odd ranks.
+    if rank % 2 == 1:
+        neighbor_map.pop("W", None)
+    return neighbor_map
+```
+
+### Pattern B: replace entirely (skip-connection ring)
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    return {"E": (rank + 2) % world_size}
+```
+
+### Pattern C: keep builtin
+
+Either omit `neighbors` entirely or return None:
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    return None  # explicit "use the builtin"
+```
+
+---
+
+## 4. PE kernel API reference (ADR-0023 D4)
+
+### IPCQ API
+
+| API | Description | Blocking? |
+|-----|-------------|-----------|
+| `tl.send(dir, src=TensorHandle)` | Send to a peer in the given direction. | Yes (waits if peer slots are full) |
+| `tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)` | Same, keyword form. | Yes |
+| `tl.recv(dir, shape=..., dtype=...)` | Blocking recv from one direction. | Yes |
+| `tl.recv(shape=..., dtype=...)` | Round-robin recv across all four directions. | Yes |
+| `tl.recv_async(dir, shape=..., dtype=...) → RecvFuture` | Non-blocking recv. | No |
+| `tl.wait(future)` | Wait for a non-blocking recv future → returns the resolved TensorHandle. | Yes |
+
+### Existing TL API (ADR-0020/0022, unchanged)
+
+| API | Description |
+|-----|-------------|
+| `tl.load(addr, shape, dtype) → TensorHandle` | DMA read; in greenlet mode `.data` carries the ndarray. |
+| `tl.store(addr, handle)` | DMA write — when `handle.data` is set the runner propagates it to MemoryStore. |
+| `tl.composite(op, ...)` | Submit a GEMM/Math composite (non-blocking). |
+| `tl.program_id(axis=0)` | Local PE id within the cube. |
+| `tl.program_id(axis=1)` | Cube id (ADR-0022). |
+| `tl.num_programs(axis=0/1)` | Topology slot counts (NOT the participating-rank count). |
+
+### Two recv modes
+
+The default is `return_slot` (zero-copy): the IPCQ slot address is
+returned in `handle.addr`. To force a copy into a custom destination,
+pass `dst_addr` + `dst_space`:
+
+```python
+recv = tl.recv(
+    dir="W", shape=(8,), dtype="f16",
+    dst_addr=my_scratch_addr,
+    dst_space="hbm",
+)
+# After this call recv.addr == my_scratch_addr (copy_to_dst mode).
+```
+
+---
+
+## 5. Helpers (`kernbench.ccl.helpers`)
+
+Convenience helpers to keep algorithm code short:
+
+```python
+from kernbench.ccl.helpers import chunked, ring_step, tree_step
+```
+
+### `chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`
+
+Split a tile of `n_elem` elements into `n_chunks` equal-size views.
+Each `Chunk` has `addr`, `n_elem`, `nbytes` fields.
+
+```python
+chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
+# chunks[0..3] are 16-element views with consecutive addresses.
+```
+
+### `ring_step(rank, step, world_size) → (send_idx, recv_idx)`
+
+Per-step chunk indices for a ring algorithm (reduce-scatter / all-gather):
+
+```python
+for step in range(world_size - 1):
+    send_idx, recv_idx = ring_step(rank, step, world_size)
+    tl.send(
+        dir="E", src_addr=chunks[send_idx].addr,
+        nbytes=chunks[send_idx].nbytes,
+        shape=(chunks[send_idx].n_elem,), dtype="f16",
+    )
+    recv = tl.recv(
+        dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16",
+    )
+    # accumulate ...
+```
+
+### `tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`
+
+Parent / children rank ids for a binary tree:
+
+```python
+info = tree_step(rank, world_size)
+if info["parent"] is None:
+    print(f"rank {rank} is the root")
+for child in info["children"]:
+    ...
+```
+
+---
+
+## 6. Unit testing — Mock runtime
+
+`kernbench.ccl.testing.run_kernel_in_mock` runs an algorithm without
+SimPy for fast feedback.
+
+### Basic usage
+
+```python
+import numpy as np
+
+from kernbench.ccl.testing import run_kernel_in_mock
+from kernbench.ccl.algorithms.my_algo import kernel
+
+
+def test_my_algo():
+    n_elem = 16
+    inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
+    expected = sum(inputs)
+    outputs = run_kernel_in_mock(
+        kernel_fn=kernel,
+        world_size=4,
+        topology="ring_1d",
+        inputs=inputs,
+        kernel_args=(n_elem, 4),  # positional args after t_ptr
+    )
+    for r in range(4):
+        assert np.allclose(outputs[r], expected, rtol=1e-3)
+```
+
+### Behavior
+
+- All ranks run their kernels concurrently as cooperative greenlets.
+- `tl.send` / `tl.recv` are serviced by in-memory FIFOs (no DMA, no
+  latency).
+- Each rank's last `store` is what the helper returns as a numpy array.
+
+### Limitations
+
+- No latency or performance numbers (it is not a simulation).
+- No PE_DMA, fabric, or BW model.
+- Correctness only.
+- One cube assumed: `program_id(axis=1)` is always 0.
+
+---
+
+## 7. Debugging
+
+### CCL trace
+
+```bash
+KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
+    --bench ccl_allreduce --verify-data
+```
+
+Per-rank send/recv events appear on stdout:
+
+```
+[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
+[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
+```
+
+### Pointer dump
+
+`kernbench.ccl.diagnostics.pointer_dump(engine)` returns a multi-line
+dump of every PE_IPCQ ring buffer's `my_head`, `my_tail`,
+`peer_head_cache`, `peer_tail_cache`. When something hangs, this shows
+which rank is stuck and on what.
+
+### Deadlock detection
+
+When the SimPy schedule empties because of unmatched send/recv pairs,
+the engine raises `IpcqDeadlock` and embeds the pointer dump in the
+message (ADR-0023 D14 F3). Wait-for-graph visualization is future
+work.
+
+---
+
+## 8. Common mistakes
+
+### 1. Using a direction that wasn't installed
+
+`topology: ring_1d` only installs E and W. Trying:
+
+```python
+tl.send(dir="N", ...)   # → IpcqInvalidDirection
+```
+
+Fix: switch to `topology: mesh_2d`, or add N/S in a `neighbors()` override.
+
+### 2. `send` without a matching `recv`
+
+```python
+def kernel(..., tl):
+    for _ in range(100):
+        tl.send(dir="E", ...)
+    # The peer never recvs → ring buffer fills → backpressure → deadlock.
+```
+
+Fix: every `send` needs a matching `recv` on the receiver side.
+Otherwise `IpcqDeadlock` is raised.
+
+### 3. dtype/shape mismatch
+
+By default mismatches are not validated. The author is responsible for
+consistency. Set `strict_validation: true` on a PE_IPCQ node's attrs to
+enable D14 F2 strict mode and catch them immediately.
+
+### 4. Assuming round-robin recv fairness
+
+`tl.recv()` (no direction) returns the first slot to arrive in
+round-robin order, but **arrival order is not predictable**. If your
+algorithm depends on a particular direction, name it explicitly:
+`tl.recv(dir="N", ...)`.
+
+### 5. Confusing `num_programs` with the CCL group size
+
+`tl.num_programs(axis=0/1)` reports topology slot counts, not the
+number of ranks participating in the collective. The host bench knows
+`world_size` and must pass it through as a kernel argument.
+
+### 6. Overwriting the send source before it's actually sent
+
+PE_DMA snapshots the source data into the IpcqDmaToken at send time,
+preserving in-flight semantics. Even so, the safest pattern is to call
+`tl.send` first and only mutate the source addr afterwards. If you
+mutate the addr before `tl.send` makes it into the PE_DMA queue, the
+snapshot will pick up the wrong data.
+
+---
+
+## 9. Next steps
+
+- Try other topologies (`mesh_2d`, `tree_binary`).
+- Faster algorithms (recursive halving / doubling).
+- Compare `buffer_kind` (tcm/hbm/sram) and `backpressure` (poll/sleep)
+  modes for latency.
+- Larger-scale validation through the unified `ccl_allreduce` bench
+  with different `ccl.yaml` overlays.
+
+If you add a new algorithm or pattern, please send a PR.
+
+---
+
+## References
+
+- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective design.
+- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1).
+- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution.
+- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model.
+
+Existing algorithm examples:
+
+- [`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py) — simplest send/recv
+- [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py) — ring all-reduce
+- [`src/kernbench/ccl/algorithms/mesh_allreduce.py`](../src/kernbench/ccl/algorithms/mesh_allreduce.py) — 2D mesh all-reduce
+- [`src/kernbench/ccl/algorithms/tree_allreduce.py`](../src/kernbench/ccl/algorithms/tree_allreduce.py) — binary tree all-reduce
@@ -0,0 +1,537 @@
+# CCL Algorithm Author Guide
+
+이 문서는 kernbench에서 CCL (Collective Communication Library) 알고리즘을
+직접 작성하는 사람을 위한 step-by-step 가이드이다. 시스템 내부 설계와
+컴포넌트 구조는 [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md)에 있다.
+
+본 가이드는 알고리즘 작성자가 **자신이 만져야 할 곳**과 **만지지 않아도 될 곳**을
+명확히 분리하고, 가장 짧은 경로로 첫 알고리즘을 동작시키는 것을 목표로 한다.
+
+---
+
+## 0. 5분 요약
+
+| 만지는 것 | 위치 |
+|----------|------|
+| 알고리즘 모듈 (kernel + 선택적 neighbors) | `src/kernbench/ccl/algorithms/<algo>.py` |
+| 알고리즘 등록 | `ccl.yaml` |
+| 호스트 bench (PE 수, 메모리 init, launch, 검증) | `benches/<your_bench>.py` |
+| (선택) 단위 테스트 | `tests/test_<algo>.py` |
+
+| 만지지 않는 것 | 위치 |
+|---------------|------|
+| TLContext API | `src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec) |
+| 프레임워크 (topology generators, helpers, mock testing) | `src/kernbench/ccl/` |
+| PE_IPCQ / PE_DMA 컴포넌트 | `src/kernbench/components/builtin/` |
+| backend 구현 (install_ipcq) | `src/kernbench/runtime_api/distributed.py` 및 `kernbench/ccl/install.py` |
+
+흐름:
+1. 알고리즘 모듈에 `kernel` 작성
+2. `ccl.yaml`에 entry 등록
+3. 호스트 bench에서 `install_ipcq` + `launch`
+4. (선택) mock runtime으로 단위 테스트 (수 ms)
+5. `kernbench run --bench <name> --verify-data`로 SimPy 검증
+
+---
+
+## 1. Hello World — 가장 단순한 send/recv
+
+각 PE가 자기 데이터를 E 방향 이웃에 한 번 보내고, W 방향에서 한 번 받는
+가장 단순한 알고리즘이다. 실제 동작 코드는
+[`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py)
+에 있다.
+
+### Step 1: kernel 작성
+
+새 파일 `src/kernbench/ccl/algorithms/hello_send.py`:
+
+```python
+"""Hello world: 자기 데이터를 다음 rank에 보내고 이전 rank에서 받기."""
+def kernel(t_ptr, n_elem, tl):
+    # 글로벌 rank는 program_id(0/1)에서 계산 (ADR-0022)
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+
+    nbytes = n_elem * 2  # f16
+    pe_addr = t_ptr + rank * nbytes
+
+    # 자기 슬라이스를 로드해서 E로 보낸다.
+    src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    tl.send(dir="E", src=src)
+
+    # W 방향에서 받아서 그대로 자기 슬라이스에 store한다.
+    recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+    tl.store(pe_addr, recv)
+```
+
+핵심 포인트:
+
+- **글로벌 rank는 `program_id(axis=0)` + `program_id(axis=1)`에서 계산.** TL에는
+  `tl.rank` / `tl.world_size` 같은 약속되지 않은 확장이 없다. 호스트가
+  `world_size` 같은 알고리즘 파라미터가 필요하면 `torch.launch`의 일반 인자로
+  전달한다.
+- **`tl.send`는 `TensorHandle`을 받는다.** 핸들의 `addr`/`space`/`shape`/`dtype`/`nbytes`를
+  PE_IPCQ가 읽어 PE_DMA에 IpcqDmaToken을 발행한다.
+- **`tl.recv`는 `shape`와 `dtype`이 필수.** 반환된 TensorHandle은 IPCQ ring slot을
+  가리키며, `tl.store(pe_addr, recv)`처럼 dst 핸들로 그대로 사용할 수 있다.
+  Phase 2 dma_write replay가 (slot, hbm) 복사를 수행하므로 numpy `.data`를
+  직접 만질 필요가 없다.
+
+### Step 2: ccl.yaml 등록
+
+`ccl.yaml`의 `algorithms` 섹션에 entry를 추가한다. (defaults.algorithm은 호스트
+bench가 `install_ipcq(algorithm=...)`로 명시 전달해도 되므로 꼭 바꿀 필요는 없다.)
+
+```yaml
+algorithms:
+  hello_send:
+    module: kernbench.ccl.algorithms.hello_send
+    topology: ring_1d
+    buffer_kind: tcm
+```
+
+### Step 3: 호스트 bench 작성
+
+새 파일 `benches/ccl_hello.py`:
+
+```python
+"""Hello-world ring rotation bench (각 PE가 W 이웃의 데이터를 1번 받음)."""
+import numpy as np
+
+from kernbench.ccl.algorithms import hello_send
+from kernbench.policy.placement.dp import DPPolicy
+
+ALGORITHM = "hello_send"
+N_ELEM = 8
+WORLD_SIZE = 8
+
+
+def run(torch):
+    plan = torch.install_ipcq(algorithm=ALGORITHM)
+
+    a = torch.zeros(
+        (1, WORLD_SIZE * N_ELEM), dtype="f16",
+        dp=DPPolicy(
+            cube="replicate", pe="column_wise",
+            num_cubes=1,
+        ),
+        name="hello_in",
+    )
+
+    store = torch.engine.memory_store
+    base = a._handle.va_base or a._handle.shards[0].pa
+    nbytes = N_ELEM * 2
+    for r in range(WORLD_SIZE):
+        store.write("hbm", base + r * nbytes,
+                    np.full((N_ELEM,), float(r + 1), dtype=np.float16))
+
+    torch.launch(ALGORITHM, hello_send.kernel, a, N_ELEM)
+
+    # rank r은 rank (r-1)%ws의 데이터를 가져야 한다.
+    for r, (sip, cube, pe) in enumerate(plan["rank_to_pe"]):
+        result = store.read("hbm", base + r * nbytes, shape=(N_ELEM,), dtype="f16")
+        prev = float(((r - 1) % WORLD_SIZE) + 1)
+        ok = np.allclose(result, prev)
+        print(f"  [{'OK ' if ok else 'FAIL'}] rank {r} got {float(result.mean()):.1f}, "
+              f"expected {prev:.1f}")
+```
+
+### Step 4: 단위 테스트 (선택, 강력 추천)
+
+`tests/test_hello_send.py`:
+
+```python
+import numpy as np
+from kernbench.ccl.algorithms.hello_send import kernel
+from kernbench.ccl.testing import run_kernel_in_mock
+
+
+def test_hello_send_4_ranks():
+    n_elem = 8
+    inputs = [np.full((n_elem,), float(r + 1), dtype=np.float16) for r in range(4)]
+
+    outputs = run_kernel_in_mock(
+        kernel_fn=kernel,
+        world_size=4,
+        topology="ring_1d",
+        inputs=inputs,
+        kernel_args=(n_elem,),
+    )
+
+    # rank r은 rank (r-1) % 4의 데이터를 받아야 함
+    for r in range(4):
+        assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
+```
+
+`run_kernel_in_mock`는 SimPy 없이 순수 Python으로 모든 rank를 동시 실행하므로
+**ms 단위로 끝난다**. 알고리즘 logic 정합성만 검증.
+
+### Step 5: 시뮬 검증
+
+```bash
+kernbench run --topology topology.yaml --bench ccl_hello --verify-data
+```
+
+Phase 1에서 SimPy 시뮬레이션 + MemoryStore 데이터 이동, Phase 2에서 op_log
+정합성 replay. 호스트 bench의 `print` 검증이 모든 rank에 대해 OK여야 한다.
+
+---
+
+## 2. Ring All-Reduce — 두 번째 알고리즘
+
+조금 더 복잡한 예제. Ring all-reduce는 N-1 라운드 동안 각 PE가 자기 데이터를
+E로 보내고 W에서 받아 누적한다. 최종적으로 모든 PE가 글로벌 sum을 갖는다.
+
+실제 동작 코드는 [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py)
+참조. 핵심 흐름:
+
+```python
+"""Ring all-reduce."""
+
+
+def kernel(t_ptr, n_elem, world_size, tl):
+    # rank
+    local_pe = tl.program_id(axis=0)
+    cube_id = tl.program_id(axis=1)
+    pes_per_cube = tl.num_programs(axis=0)
+    rank = cube_id * pes_per_cube + local_pe
+    nbytes = n_elem * 2
+    pe_addr = t_ptr + rank * nbytes
+
+    # HBM의 자기 슬라이스를 가리키는 TensorHandle. greenlet 모드에선 .data가
+    # 채워지지만 커널은 .data를 직접 만질 필요가 없다.
+    acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
+    current = acc  # 첫 라운드 send 출처
+
+    for _step in range(world_size - 1):
+        tl.send(dir="E", src=current)
+        recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
+        # TensorHandle 연산자 오버로드 → MathCmd → PE_MATH 디스패치.
+        # Phase 1은 타이밍만, Phase 2 DataExecutor가 실제 numpy 누적을 수행한다.
+        acc = acc + recv
+        current = recv  # 다음 라운드는 직전에 받은 슬롯을 다시 forward
+
+    # 최종 누적값을 자기 슬라이스에 store. 출처는 acc(=PE-local scratch addr)
+    # 이고 dst는 HBM. op_log dma_write가 (scratch, hbm) 복사 정보를 기록하므로
+    # Phase 2가 검증 시점에 HBM[pe_addr]에 정답을 채워준다.
+    tl.store(pe_addr, acc)
+```
+
+네 가지 포인트:
+
+1. **누적은 TensorHandle 연산자**: `acc + recv`는 `MathCmd`를 emit하고
+   PE_MATH로 디스패치된다 — 실제 하드웨어 경로를 거치므로 latency 모델이
+   정확하다. ADR-0020 D3대로 Phase 1은 타이밍만 시뮬레이션하고, Phase 2
+   `DataExecutor`가 op_log를 재실행하면서 numpy 누적을 수행한다.
+2. **`current = recv`로 forward**: 매 라운드의 send 출처를 직전에 받은 슬롯
+   핸들로 갱신해야 같은 데이터가 ring을 순회하면서 누적이 한 번씩 일어난다.
+   `current = acc`로 두면 누적값이 다시 송출되어 결과가 부풀려진다.
+3. **`tl.store(pe_addr, acc)` 한 번이면 끝**: 중간에 store→reload 패턴은
+   금지다. acc는 PE-local scratch에 살고, op_log가 (src=scratch, dst=hbm)
+   메타데이터를 기록한다. Phase 2가 math를 먼저 실행해 scratch를 채운 뒤
+   dma_write 스냅샷으로 HBM에 복사한다.
+4. **`world_size`는 호스트가 명시 전달**: TL은 topology slot 수만 안다 (예:
+   `num_programs(axis=0)`은 cube당 PE 수). 실제 참여하는 CCL group 크기는 bench가
+   알고 호스트→kernel 인자로 넘긴다.
+
+`ccl.yaml` 등록 + 호스트 bench는 [`benches/ccl_allreduce_tcm.py`](../benches/ccl_allreduce_tcm.py)
+참조. mock 단위 테스트는 [`tests/test_ccl_mock_runtime.py`](../tests/test_ccl_mock_runtime.py)
+를 그대로 따라하면 된다 (`kernel_args=(n_elem, world_size)` 인자 형태).
+
+---
+
+## 3. neighbors() override — Custom topology
+
+대부분의 알고리즘은 builtin topology(`ring_1d`, `mesh_2d`, `tree_binary`,
+`ring_1d_unidir`, `none`)로 충분하다. builtin을 변형하거나 새로 만들고 싶으면
+알고리즘 모듈에 `neighbors()`를 정의한다.
+
+### 시그니처
+
+```python
+def neighbors(rank: int, world_size: int, neighbor_map: dict[str, int]) -> dict[str, int] | None:
+    """builtin topology가 만든 neighbor_map을 override.
+
+    Args:
+        neighbor_map: ccl.yaml의 topology 필드가 만든 builtin 매핑.
+                      예: ring_1d → {"E": (rank+1)%ws, "W": (rank-1)%ws}
+                      mutable dict — 직접 수정 가능.
+
+    Returns:
+        dict: neighbor_map을 override한 결과 (또는 수정한 그 dict)
+        None: override 안 함, neighbor_map 그대로 사용
+    """
+    return None
+```
+
+### Pattern A: builtin을 base로 일부만 수정
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    # 짝수 rank만 W 방향 사용 (홀수 rank는 W 제거)
+    if rank % 2 == 1:
+        neighbor_map.pop("W", None)
+    return neighbor_map
+```
+
+### Pattern B: 완전히 새로 작성 (skip-connection ring)
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    # neighbor_map은 무시하고 새로 작성
+    return {"E": (rank + 2) % world_size}
+```
+
+### Pattern C: builtin 사용, override 없음
+
+`neighbors()` 함수를 정의하지 않거나 None을 반환:
+
+```python
+def neighbors(rank, world_size, neighbor_map):
+    return None  # 명시적으로 builtin 사용
+```
+
+---
+
+## 4. PE 커널 API 레퍼런스 (ADR-0023 D4)
+
+### IPCQ API
+
+| API | 설명 | Blocking? |
+|-----|------|-----------|
+| `tl.send(dir, src=TensorHandle)` | direction으로 데이터 send | Yes (peer slot full 시 wait) |
+| `tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)` | 동일, keyword 형태 | Yes |
+| `tl.recv(dir, shape=..., dtype=...)` | 특정 방향에서 blocking recv | Yes |
+| `tl.recv(shape=..., dtype=...)` | 4방향 round-robin recv (방향 미지정) | Yes |
+| `tl.recv_async(dir, shape=..., dtype=...) → RecvFuture` | non-blocking recv | No |
+| `tl.wait(future)` | non-blocking future 완료 대기 → TensorHandle | Yes |
+
+### 기존 TL API (ADR-0020/0022, 그대로 사용 가능)
+
+| API | 설명 |
+|-----|------|
+| `tl.load(addr, shape, dtype) → TensorHandle` | DMA read; greenlet 모드에서 `.data`에 ndarray |
+| `tl.store(addr, handle)` | DMA write — handle.data가 있으면 MemoryStore에 propagate |
+| `tl.composite(op, ...)` | GEMM/Math compute 비동기 submit |
+| `tl.program_id(axis=0)` | cube 내 local PE id |
+| `tl.program_id(axis=1)` | cube id (ADR-0022) |
+| `tl.num_programs(axis=0/1)` | topology 슬롯 수 (참여 ranks 수가 아님) |
+
+### `recv` 두 가지 모드
+
+기본은 `return_slot` (zero-copy): IPCQ slot 주소가 그대로 handle.addr에 들어온다.
+slot 데이터를 별도 위치로 복사하고 싶으면 `dst_addr` + `dst_space`를 명시:
+
+```python
+recv = tl.recv(
+    dir="W", shape=(8,), dtype="f16",
+    dst_addr=my_scratch_addr,
+    dst_space="hbm",
+)
+# 이제 recv.addr == my_scratch_addr (copy_to_dst 모드)
+```
+
+---
+
+## 5. Helpers (`kernbench.ccl.helpers`)
+
+알고리즘 코드를 짧게 유지하기 위한 헬퍼들:
+
+```python
+from kernbench.ccl.helpers import chunked, ring_step, tree_step
+```
+
+### `chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`
+
+총 `n_elem` 개의 element를 `n_chunks` 등분한 view 리스트를 반환. 각 `Chunk`는
+`addr`, `n_elem`, `nbytes` 필드를 가진다.
+
+```python
+chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
+# chunks[0..3] 각각 16 element view, addr이 연속
+```
+
+### `ring_step(rank, step, world_size) → (send_idx, recv_idx)`
+
+Ring algorithm의 step별 chunk 인덱스 (reduce-scatter / all-gather):
+
+```python
+for step in range(world_size - 1):
+    send_idx, recv_idx = ring_step(rank, step, world_size)
+    tl.send(dir="E", src_addr=chunks[send_idx].addr,
+            nbytes=chunks[send_idx].nbytes,
+            shape=(chunks[send_idx].n_elem,), dtype="f16")
+    recv = tl.recv(dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16")
+    # accumulate ...
+```
+
+### `tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`
+
+Binary tree의 parent/children rank:
+
+```python
+info = tree_step(rank, world_size)
+if info["parent"] is None:
+    print(f"rank {rank} is the root")
+for child in info["children"]:
+    ...
+```
+
+---
+
+## 6. 단위 테스트 — Mock Runtime
+
+`kernbench.ccl.testing.run_kernel_in_mock`은 SimPy를 거치지 않고 알고리즘을
+빠르게 검증할 수 있다.
+
+### 기본 사용법
+
+```python
+from kernbench.ccl.testing import run_kernel_in_mock
+from kernbench.ccl.algorithms.my_algo import kernel
+import numpy as np
+
+
+def test_my_algo():
+    n_elem = 16
+    inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
+    expected = sum(inputs)
+
+    outputs = run_kernel_in_mock(
+        kernel_fn=kernel,
+        world_size=4,
+        topology="ring_1d",
+        inputs=inputs,
+        kernel_args=(n_elem, 4),  # kernel의 (t_ptr 이후) 추가 positional 인자
+    )
+
+    for r in range(4):
+        assert np.allclose(outputs[r], expected, rtol=1e-3)
+```
+
+### 동작
+
+- 4개 rank의 kernel을 greenlet으로 동시 실행
+- `tl.send/recv`를 in-memory FIFO로 즉시 처리 (DMA, latency 무시)
+- 각 rank가 마지막에 store한 데이터를 ndarray로 반환
+
+### 한계
+
+- latency / 성능 측정 불가 (시뮬레이션이 아님)
+- PE_DMA, fabric, BW 모델 안 함
+- 정합성 검증만 가능
+- 한 cube 안에서 동작하는 가정 — `program_id(axis=1)`은 항상 0
+
+---
+
+## 7. 디버깅
+
+### CCL trace
+
+```bash
+KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
+    --bench ccl_allreduce_tcm --verify-data
+```
+
+각 rank의 send/recv 시점이 stdout에 출력된다:
+
+```
+[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
+[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
+...
+```
+
+### Pointer dump
+
+`kernbench.ccl.diagnostics.pointer_dump(engine)`는 모든 PE_IPCQ의 ring buffer
+상태(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`)를 multi-line
+문자열로 반환한다. hang이 발생하면 어느 rank가 어떤 상태에서 막혔는지 한눈에
+보인다.
+
+### Deadlock detection
+
+매칭되지 않는 send/recv 등으로 SimPy 스케줄이 비면 engine이 `IpcqDeadlock`을
+던지며 pointer dump를 메시지에 포함시킨다 (ADR-0023 D14 F3). 별도 wait-for graph
+시각화는 미래 작업.
+
+---
+
+## 8. 흔한 실수
+
+### 1. install 안 된 direction 사용
+
+ccl.yaml의 `topology: ring_1d`는 E/W만 install한다. N/S 사용 시:
+
+```python
+tl.send(dir="N", ...)   # → IpcqInvalidDirection 예외
+```
+
+해결: `topology: mesh_2d`로 바꾸거나, `neighbors()` override로 N/S 추가.
+
+### 2. send만 호출하고 recv 없음
+
+```python
+def kernel(..., tl):
+    for _ in range(100):
+        tl.send(dir="E", ...)
+    # peer 측 recv 없음 → ring buffer 가득 차면 backpressure → deadlock
+```
+
+해결: 모든 send에 짝이 되는 recv가 있어야 한다. 안 그러면 `IpcqDeadlock`이
+발생한다.
+
+### 3. dtype/shape 불일치
+
+기본 모드에서는 dtype/shape mismatch를 검증하지 않는다. 작성자가 직접 보장하거나,
+PE_IPCQ 노드 attrs에 `strict_validation: true`를 설정해 D14 F2 strict 모드로
+mismatch를 즉시 잡을 수 있다.
+
+### 4. round-robin recv의 fairness 가정
+
+`tl.recv()` (방향 미지정)는 round-robin으로 가져오지만, 도착한 첫 슬롯을 반환한다.
+**도착 순서를 알 수 없으므로** 알고리즘이 도착 방향에 의존하면 안 된다.
+필요하면 `tl.recv(dir="N", ...)`처럼 명시.
+
+### 5. CCL 그룹 크기 가정
+
+`tl.num_programs(axis=0/1)`은 토폴로지 슬롯 개수이지 CCL group 크기가 아니다.
+참여하는 rank 수(`world_size`)는 호스트 bench가 알고 있고, kernel 인자로 명시
+전달해야 한다.
+
+### 6. 호스트가 send-source 메모리를 도착 전에 덮어씀
+
+PE_DMA가 송신 시점에 src 데이터를 토큰에 스냅샷해서 in-flight 데이터의 의미가
+보존된다. 그래도 하나의 PE 안에서 같은 주소를 여러 step에 걸쳐 갱신할 때는
+direct send 후 다른 step에서 같은 주소를 store해도 안전하다 (token snapshot 덕분).
+하지만 `tl.send`가 PE_DMA 큐에 enqueue되기 전에 주소를 덮어쓰면 잘못된 데이터가
+스냅샷된다 — `tl.send`를 먼저, 메모리 변경을 나중에 하는 게 권장.
+
+---
+
+## 9. 다음 단계
+
+- `mesh_2d` / `tree_binary` 같은 다른 topology 활용
+- recursive halving/doubling 등 더 빠른 알고리즘
+- `buffer_kind` (tcm/hbm/sram) / `backpressure` (poll/sleep) 모드별 latency 비교
+- `ccl_ring_allreduce_multicube.py`, `ccl_ring_allreduce_multisip.py`처럼 큰
+  scale의 ring 검증
+
+새 알고리즘이나 패턴을 추가했다면 PR로 기여해주세요.
+
+---
+
+## 참고
+
+- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective 설계
+- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1)
+- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution
+- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model
+
+기존 알고리즘 예제:
+
+- [`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py) — 가장 단순한 send/recv
+- [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py) — ring all-reduce
+- [`src/kernbench/ccl/algorithms/mesh_allreduce.py`](../src/kernbench/ccl/algorithms/mesh_allreduce.py) — 2D mesh all-reduce
+- [`src/kernbench/ccl/algorithms/tree_allreduce.py`](../src/kernbench/ccl/algorithms/tree_allreduce.py) — binary tree all-reduce
@@ -0,0 +1,363 @@
+# 실무 DI 패턴: kernbench 구현으로 배우는 Dependency Injection
+
+---
+
+## 슬라이드 1 — 오늘 이야기할 것
+
+**질문:** 코드를 어떻게 설계해야 테스트하기 쉽고, 갈아끼우기 쉬울까?
+
+**답:** Dependency Injection (DI)
+
+오늘은 이론이 아니라 **실제로 돌아가는 시뮬레이터 코드**를 보면서 배웁니다.
+
+```
+kernbench
+└── AI 가속기 하드웨어를 Python으로 시뮬레이션하는 프레임워크
+    - 수십 개의 하드웨어 컴포넌트 (NOC, HBM, PE, CPU...)
+    - 각 컴포넌트는 런타임에 교체 가능
+    - 테스트에서 Mock 컴포넌트로 즉시 대체 가능
+```
+
+---
+
+## 슬라이드 2 — DI가 없으면 어떤 일이 생기나
+
+```python
+# ❌ DI 없는 코드
+class IoCpuComponent:
+    def run(self, env, nbytes):
+        router = PathRouter()        # 직접 생성 — 교체 불가
+        hbm = HbmCtrlComponent()    # 직접 생성 — 교체 불가
+        yield env.timeout(10.0)
+```
+
+**문제:**
+- 테스트할 때 실제 `PathRouter`와 `HbmCtrl`이 항상 따라온다
+- 컴포넌트를 Mock으로 바꾸려면 **소스 코드를 수정**해야 한다
+- 다른 topology(다른 라우팅 전략)를 쓰고 싶으면 **또 수정**
+
+> 클래스가 자기 의존성을 스스로 만들면, 그 클래스는 의존성과 결합된다
+
+---
+
+## 슬라이드 3 — DI의 핵심 원칙
+
+**의존성은 밖에서 만들어서 안으로 넣어준다**
+
+```
+┌────────────────────────────┐
+│  조립자 (Assembler)         │  ← 누가 무엇을 쓸지 결정
+│  GraphEngine.__init__      │
+└────────────┬───────────────┘
+             │ ctx 주입
+             ▼
+┌────────────────────────────┐
+│  컴포넌트 (Component)       │  ← 어떻게 동작하는지만 알면 됨
+│  IoCpuComponent            │
+│    self.ctx.router.find_path(...)  ← 그냥 사용
+└────────────────────────────┘
+```
+
+**세 가지 역할 분리:**
+1. **Interface** — 무엇을 할 수 있는가 (`ComponentBase`)
+2. **Implementation** — 어떻게 하는가 (`IoCpuComponent`, `HbmCtrlComponent`, ...)
+3. **Assembler** — 무엇을 연결할 것인가 (`GraphEngine`)
+
+---
+
+## 슬라이드 4 — 패턴 1: Constructor Injection
+
+> 생성자로 의존성을 받는다
+
+```python
+# kernbench/components/base.py
+
+class ComponentBase(ABC):
+    def __init__(self, node: Node, ctx: ComponentContext | None = None):
+        self.node = node
+        self.ctx = ctx          # 외부에서 주입받은 의존성
+        self.in_ports: dict[str, simpy.Store] = {}
+        self.out_ports: dict[str, simpy.Store] = {}
+```
+
+```python
+# 사용 측 — ctx를 직접 만들지 않는다
+class IoCpuComponent(ComponentBase):
+    def _dispatch(self, env, txn):
+        path = self.ctx.router.find_node_path(...)   # ctx는 이미 들어와 있음
+        yield self.out_ports[next_hop].put(...)
+```
+
+**언제 쓰나:**
+- 컴포넌트가 살아있는 동안 의존성이 바뀌지 않을 때
+- 의존성 없이는 컴포넌트가 동작하지 않을 때 (필수 의존성)
+
+---
+
+## 슬라이드 5 — Context Object 패턴
+
+> 의존성이 많아지면 묶어서 하나로
+
+```python
+# kernbench/components/context.py
+
+@dataclass
+class ComponentContext:
+    router: PathRouter              # 라우팅 정책
+    resolver: AddressResolver       # 주소 해석
+    positions: dict[str, ...]       # 물리적 위치 정보
+    ns_per_mm: float                # 전파 지연 상수
+    edge_map: dict[...]             # 엣지 정보
+    spec: dict                      # 토폴로지 스펙
+```
+
+**왜 Context로 묶나?**
+- 생성자 인자가 6개면 → 컴포넌트 추가할 때마다 시그니처 변경
+- Context 하나면 → 새 필드 추가해도 기존 컴포넌트 무영향
+- 컴포넌트는 **필요한 것만 꺼내 쓴다**
+
+```python
+class TwoDMeshNocComponent(ComponentBase):
+    def _route(self, env, txn):
+        src_pos = self.ctx.positions.get(prev_hop)   # 위치만 사용
+        ns_per_mm = self.ctx.ns_per_mm               # 상수만 사용
+        # router, resolver 등은 건드리지 않음
+```
+
+---
+
+## 슬라이드 6 — 패턴 2: Registry + Factory
+
+> 문자열 키 → 클래스 매핑으로 런타임 교체
+
+```python
+# kernbench/components/base.py
+
+class ComponentRegistry:
+    _registry: dict[str, type[ComponentBase]] = {}
+
+    @classmethod
+    def register(cls, impl: str, component_cls: type[ComponentBase]):
+        cls._registry[impl] = component_cls
+
+    @classmethod
+    def create(cls, node, overrides=None, ctx=None) -> ComponentBase:
+        if overrides and node.impl in overrides:
+            return overrides[node.impl](node, ctx)   # 1순위: 호출자 override
+        if node.impl in cls._registry:
+            return cls._registry[node.impl](node, ctx)  # 2순위: 등록된 구현
+        return DefaultComponent(node, ctx)           # 3순위: 기본값 fallback
+```
+
+**Resolution 우선순위:**
+```
+overrides[impl]        ← 테스트/실험용 주입
+  ↓ (없으면)
+_registry[impl]        ← 프로덕션 구현
+  ↓ (없으면)
+DefaultComponent       ← 안전한 fallback
+```
+
+---
+
+## 슬라이드 7 — Registry 등록 방식
+
+```python
+# kernbench/components/builtin/__init__.py
+
+from kernbench.components.base import ComponentRegistry
+from kernbench.components.builtin.noc import TwoDMeshNocComponent
+from kernbench.components.builtin.io_cpu import IoCpuComponent
+# ...
+
+ComponentRegistry.register("noc_2d_mesh_v1", TwoDMeshNocComponent)
+ComponentRegistry.register("io_cpu_v1",       IoCpuComponent)
+ComponentRegistry.register("hbm_ctrl_v1",     HbmCtrlComponent)
+# ...
+```
+
+**topology.yaml (설정 파일)**
+```yaml
+nodes:
+  - id: sip0.cube0.noc
+    impl: noc_2d_mesh_v1    # ← 이 문자열이 Registry 키
+```
+
+**흐름:**
+```
+YAML → impl 문자열 → Registry.create() → 실제 컴포넌트 인스턴스
+```
+
+impl 문자열만 바꾸면 동작이 바뀐다. 코드 수정 없음.
+
+---
+
+## 슬라이드 8 — 패턴 3: Override Injection (테스트용)
+
+> 호출자가 특정 impl만 갈아끼운다
+
+```python
+# tests/test_component_registry.py
+
+class SpyXbar(ComponentBase):
+    calls = 0
+
+    def run(self, env, nbytes):
+        SpyXbar.calls += 1
+        yield env.timeout(0)
+
+
+# 테스트에서 xbar_v1만 SpyXbar로 교체
+engine = GraphEngine(
+    graph,
+    component_overrides={"xbar_v1": SpyXbar}   # ← 이것만 추가
+)
+
+result = engine.run(msg)
+assert SpyXbar.calls > 0    # Xbar가 실제로 호출됐는지 검증
+```
+
+**핵심:** 테스트 코드가 프로덕션 코드를 **수정하지 않는다**
+
+---
+
+## 슬라이드 9 — 조립자: GraphEngine
+
+> 컴포넌트를 생성하고 연결하는 유일한 곳
+
+```python
+# kernbench/sim_engine/engine.py
+
+class GraphEngine:
+    def __init__(self, graph, component_overrides=None):
+
+        # 1. 공유 의존성 생성
+        ctx = ComponentContext(
+            router=PathRouter(graph),
+            resolver=AddressResolver(graph),
+            positions={nid: n.pos_mm for nid, n in graph.nodes.items()},
+            ns_per_mm=...,
+        )
+
+        # 2. 컴포넌트 생성 (DI: ctx 주입)
+        self._components = {
+            node_id: ComponentRegistry.create(node, overrides, ctx)
+            for node_id, node in graph.nodes.items()
+        }
+
+        # 3. 포트 연결 (배선)
+        for e in graph.edges:
+            store = simpy.Store(self._env)
+            self._components[e.src].out_ports[e.dst] = store
+            self._components[e.dst].in_ports[e.src] = store
+```
+
+**생성 → 주입 → 연결** — 이 세 단계가 한 곳에서만 일어난다
+
+---
+
+## 슬라이드 10 — 전체 구조 한눈에 보기
+
+```
+topology.yaml
+    │ impl: "noc_2d_mesh_v1"
+    ▼
+GraphEngine.__init__()                     ← 조립자
+    │
+    ├── ComponentContext 생성               ← 공유 의존성 묶음
+    │     ├── PathRouter
+    │     ├── AddressResolver
+    │     └── positions, ns_per_mm, ...
+    │
+    ├── ComponentRegistry.create(node, overrides, ctx)
+    │     ├── overrides["noc_2d_mesh_v1"]? → SpyNoc (테스트)
+    │     ├── registry["noc_2d_mesh_v1"]?  → TwoDMeshNocComponent (프로덕션)
+    │     └── fallback                     → DefaultComponent
+    │
+    └── 포트 배선: out_ports / in_ports 연결
+
+Component (TwoDMeshNocComponent)
+    └── self.ctx.positions, self.ctx.ns_per_mm 사용
+        (라우터, 리졸버는 건드리지 않음 — 필요한 것만)
+```
+
+---
+
+## 슬라이드 11 — 무엇을 얻었나
+
+| 상황 | DI 없이 | DI 있이 |
+|------|---------|---------|
+| NOC 알고리즘 교체 | 소스 코드 수정 | YAML에서 impl 문자열 변경 |
+| Xbar 동작 검증 | 실제 HW 전부 구동 | `overrides={"xbar_v1": SpyXbar}` |
+| 새 컴포넌트 추가 | 기존 코드 수정 | `register("new_v1", NewComp)` |
+| 컨텍스트 필드 추가 | 모든 생성자 수정 | `ComponentContext`에 필드 추가 |
+| 테스트 격리 | 불가능 | 필요한 것만 override |
+
+---
+
+## 슬라이드 12 — 실무 적용 체크리스트
+
+**설계할 때 물어볼 것:**
+
+1. **이 클래스가 직접 `new`(생성)하는 것은 무엇인가?**
+   → 생성하는 것 = 교체할 수 없는 것. 생성자로 받을 수 없는지 검토.
+
+2. **의존성이 3개 이상이면?**
+   → Context Object로 묶어라.
+
+3. **테스트에서 이 클래스를 단독으로 실행할 수 있는가?**
+   → 없다면 DI가 필요하다는 신호.
+
+4. **설정(YAML/config)으로 동작을 바꾸고 싶은가?**
+   → Registry + 문자열 키 패턴.
+
+5. **누가 조립하는가?**
+   → 조립자는 하나여야 한다. 컴포넌트 안에 조립 로직이 있으면 안 된다.
+
+---
+
+## 슬라이드 13 — 안티패턴: 이것은 하지 말자
+
+```python
+# ❌ 서비스 로케이터 (컴포넌트 안에서 registry 호출)
+class BadComponent(ComponentBase):
+    def run(self, env, nbytes):
+        router = ComponentRegistry.get("router")  # 컴포넌트가 직접 찾는다
+        ...
+
+# ❌ 전역 싱글톤 직접 참조
+class BadComponent(ComponentBase):
+    def run(self, env, nbytes):
+        router = GlobalRouter.instance()          # 교체 불가
+        ...
+
+# ❌ 생성자 안에서 의존성 생성
+class BadComponent(ComponentBase):
+    def __init__(self, node):
+        self.router = PathRouter(node.graph)      # 테스트에서 격리 불가
+```
+
+**공통 문제:** 컴포넌트가 자기 의존성을 스스로 해결한다 → 결합도 증가
+
+---
+
+## 슬라이드 14 — 요약
+
+> **DI = 의존성의 생성과 사용을 분리하는 것**
+
+```
+생성  →  Registry / Assembler (GraphEngine)
+사용  →  Component (IoCpuComponent, TwoDMeshNocComponent, ...)
+```
+
+**kernbench에서 배운 패턴 3가지:**
+
+1. **Constructor Injection** — 필수 의존성은 생성자로
+2. **Context Object** — 의존성 묶음을 하나의 dataclass로
+3. **Registry + Override** — 문자열 키로 구현체 선택, 테스트에서 교체
+
+**결과:** 141개 테스트, YAML 한 줄로 컴포넌트 교체, 프로덕션 코드 수정 없이 Mock 주입
+
+---
+
+*참고 코드: kernbench/src/kernbench/components/*
@@ -0,0 +1,237 @@
+# Hardware Architecture Overview
+
+본 문서는 AI Accelerator 플랫폼의 하드웨어 아키텍처를 요약한다.
+논문 분석 및 설계 검토 시 배경 지식으로 사용할 수 있다.
+
+> Source ADRs: ADR-0003, ADR-0004, ADR-0014, ADR-0017, ADR-0022
+
+---
+
+## 1. System Hierarchy
+
+시스템은 4단계 계층으로 구성된다.
+
+```
+Tray
+ ├── Host CPU (runtime, data placement)
+ ├── SIP 0 (accelerator)
+ │    ├── IO Chiplet (PCIe-EP, IO_CPU)
+ │    ├── CUBE 0
+ │    │    ├── PE 0 ─ PE 7
+ │    │    ├── HBM + HBM_CTRL
+ │    │    ├── Shared SRAM
+ │    │    ├── M_CPU (management)
+ │    │    ├── NOC 2D Mesh (router grid)
+ │    │    └── UCIe × 4 (N/S/E/W)
+ │    ├── CUBE 1 ... CUBE N
+ │    └── IO Chiplet(s)
+ ├── SIP 1 ... SIP M
+ └── Interconnect (PCIe / UAL)
+```
+
+| Level | 구성 | 연결 |
+|-------|------|------|
+| **Tray** | Host CPU + 여러 SIP | PCIe / UAL fabric |
+| **SIP** | 여러 CUBE + IO chiplet(s) | UCIe (cube간), PCIe-EP (host) |
+| **CUBE** | 여러 PE + HBM + SRAM + M_CPU + NOC mesh | UCIe × 4 ports (N/S/E/W) |
+| **PE** | PE_CPU + DMA + GEMM + MATH + TCM | NOC router 직결 |
+
+---
+
+## 2. CUBE Architecture
+
+각 CUBE는 독립적인 compute + memory unit이다.
+
+### 2.1 Components
+
+- **PEs**: 복수의 Processing Element, 각각 독립 커널 실행 가능
+- **HBM + HBM_CTRL**: High Bandwidth Memory. 각 PE에 local HBM 영역이 할당되어 최소 latency로 접근
+- **Shared SRAM**: Cube 내 모든 PE가 NOC를 통해 접근 가능한 공유 메모리
+- **M_CPU**: Management CPU. 커널 command 분배 및 completion 집계
+- **NOC (On-die Fabric)**: Cube 내 모든 컴포넌트를 연결하는 interconnect
+- **UCIe × 4**: 각 방향(N/S/E/W)에 복수 connection, inter-cube 연결
+
+### 2.2 NOC (On-die Fabric)
+
+NOC는 cube 내 PE, HBM, SRAM, M_CPU, UCIe를 연결하는 on-die interconnect이다.
+
+**아키텍처 요구사항** (topology 무관):
+- 모든 PE가 local HBM에 full bandwidth로 접근 가능
+- 모든 PE가 shared SRAM에 접근 가능
+- 모든 PE가 UCIe를 통해 다른 cube에 접근 가능
+- M_CPU가 모든 PE에 command를 전달 가능
+- Per-link contention 모델링 지원
+
+**현재 시뮬레이터 구현** (변경 가능):
+- 2D mesh router grid (6×6 기본, XY deterministic routing)
+- HBM_CTRL가 각 PE의 local router에 직결 (0 mesh hop)
+- 중앙 HBM zone에는 router 배치 제외
+- Contention: directed segment당 capacity=1 resource
+
+NOC topology는 2D mesh 외에 ring, crossbar, hierarchical 등 다른 구현도 가능하며,
+아키텍처 요구사항을 만족하는 한 교체 가능하다.
+
+### 2.3 주요 Data Path
+
+| Path | Route | 특성 |
+|------|-------|------|
+| PE → Local HBM | PE_DMA → NOC → HBM_CTRL | 최소 hop, 256 GB/s (×0.8 eff) |
+| PE → Remote PE's HBM | PE_DMA → NOC hops → HBM_CTRL | NOC BW/hop에 제한 |
+| PE → Shared SRAM | PE_DMA → NOC → SRAM | SRAM link BW에 제한 |
+| PE → Other CUBE's HBM | PE_DMA → NOC → UCIe → NOC → HBM_CTRL | UCIe overhead 16ns (TX+RX) |
+| Kernel Launch | IO → UCIe → M_CPU → NOC → PE_CPU | Command path |
+
+### 2.4 Key Bandwidths
+
+| Connection | Bandwidth | Notes |
+|------------|-----------|-------|
+| PE_DMA ↔ NOC | 256 GB/s | HBM slice BW 매칭 |
+| NOC ↔ HBM_CTRL | 256 GB/s | Per PE, local 접근 |
+| NOC ↔ SRAM | 128 GB/s × 4 | 512 GB/s aggregate |
+| NOC ↔ UCIe conn | 128 GB/s × 4 | 512 GB/s per port |
+| UCIe link (inter-cube) | 512 GB/s | 1.0mm seam distance |
+
+---
+
+## 3. PE Architecture
+
+각 PE는 하나의 커널 인스턴스를 실행하는 독립적인 프로세서이다.
+
+### 3.1 Internal Components
+
+```
+PE_CPU (control)
+  │
+  ├──→ PE_SCHED (dispatch)
+  │       │
+  │       ├──→ PE_DMA ←→ NOC Router ←→ HBM / SRAM / UCIe
+  │       │      ↕
+  │       ├──→ PE_FETCH_STORE ←→ PE_TCM (16MB SRAM)
+  │       │
+  │       ├──→ PE_GEMM (matrix multiply)
+  │       └──→ PE_MATH (elementwise)
+  │
+  └──→ PE_IPCQ (collective communication)
+           │
+           └──→ PE_DMA (IPCQ port)
+```
+
+| Component | 역할 |
+|-----------|------|
+| **PE_CPU** | 커널 instruction stream 실행, command 생성 |
+| **PE_SCHED** | Command dispatcher. Composite command를 tile pipeline으로 분해 |
+| **PE_DMA** | HBM ↔ TCM 데이터 전송 (NOC router mesh 경유). Read/Write 각 1 channel |
+| **PE_GEMM** | 행렬 곱 엔진. TCM에서 activation 읽기, HBM에서 weight streaming 가능 |
+| **PE_MATH** | Element-wise 연산 엔진. TCM 읽기/쓰기 |
+| **PE_TCM** | 16MB on-PE SRAM. Compute의 staging memory |
+| **PE_IPCQ** | PE간 collective communication 제어 (ring buffer pointer 관리) |
+
+### 3.2 Compute Pipeline (Tiled Execution)
+
+Composite command는 tile 단위로 pipeline 실행된다:
+
+```
+DMA_READ(t) → COMPUTE(t) → DMA_WRITE(t)
+```
+
+**Overlap 규칙**:
+- 허용: `DMA_READ(t+1) ∥ COMPUTE(t)`, `DMA_WRITE(t-1) ∥ COMPUTE(t)`
+- 금지: `GEMM(t) ∥ GEMM(t')`, `GEMM(t) ∥ MATH(t')`
+
+**DMA Engine**: Read/Write 각각 capacity=1. 동시 Read+Write 가능, 동시 Read+Read 불가.
+
+**Compute Engine**: GEMM과 MATH가 단일 compute slot 공유. 한 번에 하나만 실행.
+
+### 3.3 TCM-centric Dataflow
+
+모든 compute는 TCM을 중심으로 동작한다:
+
+```
+Input:   HBM → (NOC) → PE_DMA → PE_TCM
+Compute: PE_TCM → GEMM / MATH → PE_TCM
+Output:  PE_TCM → PE_DMA → (NOC) → HBM
+```
+
+PE_TCM은 두 영역으로 분할된다:
+- **SchedulerReservedTCM**: PE_SCHED 전용 tile buffer 영역 (DMA/compute staging)
+- **AllocatableTCM**: 범용 할당 영역 (host/DP-visible)
+
+두 영역은 hard isolation으로 분리된다.
+
+---
+
+## 4. Memory Hierarchy
+
+### 4.1 Memory Tiers
+
+| Memory | Scope | Capacity | Bandwidth | Latency | 접근 경로 |
+|--------|-------|----------|-----------|---------|-----------|
+| **PE_TCM** | PE 전용 | 16 MB | 512 GB/s | 최저 | 직결 (NOC 미경유) |
+| **Shared SRAM** | Cube 공유 | 32 MB | 128 GB/s (NoC link) | 중간 | PE → NOC → SRAM |
+| **Local HBM** | PE별 할당 | Large | 256 GB/s (×0.8 eff) | 높음 | PE → local router → HBM_CTRL |
+| **Remote HBM** | 다른 PE/Cube | Large | Mesh/UCIe BW 제한 | 최고 | PE → NOC mesh → (UCIe) → HBM_CTRL |
+
+### 4.2 Local HBM Bandwidth Guarantee
+
+- 각 PE는 자신의 local router에 직결된 HBM pseudo-channel을 가진다
+- Local HBM 접근은 **0 mesh hop** (switching overhead만)
+- Effective bandwidth = spec BW × efficiency factor (default 0.8)
+- 예: 256 GB/s × 0.8 = 204.8 GB/s effective
+- 이 보장은 fabric bandwidth와 무관하게 유지된다
+
+### 4.3 Memory-Centric Design Principle
+
+- **Compute는 data 근처에서 실행**: PE가 local HBM에 직결되어 데이터 이동 최소화
+- **TCM은 compute의 scratchpad**: 모든 compute 입출력은 TCM을 경유
+- **HBM은 primary storage**: 대용량 tensor 저장, DMA로 TCM에 tile 단위 load/store
+- **Shared SRAM은 cube-level 공유**: 중간 결과 공유, reduction buffer 등
+
+---
+
+## 5. SPMD Execution Model
+
+### 5.1 Program ID Mapping
+
+커널은 2D hardware grid에서 SPMD 방식으로 실행된다:
+
+| API | 반환 값 | 설명 |
+|-----|---------|------|
+| `tl.program_id(axis=0)` | `local_pe_id` | Cube 내 PE 인덱스 |
+| `tl.program_id(axis=1)` | `cube_id` | Cube 인덱스 |
+| `tl.num_programs(axis=0)` | `num_pes_per_cube` | Cube당 PE 수 |
+| `tl.num_programs(axis=1)` | `num_cubes` | 전체 Cube 수 |
+
+```python
+global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
+```
+
+### 5.2 Axis Mapping Rationale
+
+- **axis=0 = PE (innermost)**: Cube 내 PE는 HBM을 공유하고 local NOC로 통신. 빠르고 tightly-coupled. GPU의 thread-in-block에 대응.
+- **axis=1 = Cube (outer)**: Cube 간 통신은 UCIe 경유로 latency 높음. Coarse scheduling 단위. GPU의 block-in-grid에 대응.
+
+### 5.3 Kernel Execution Flow
+
+```
+Host CPU
+  → IO_CPU (PCIe-EP)
+    → M_CPU (management, per cube)
+      → PE_CPU × N (broadcast)
+        → Each PE executes same kernel with unique (pe_id, cube_id)
+```
+
+모든 PE가 동일 커널을 실행하되, `program_id`로 자신의 데이터 파티션을 식별하여
+독립적으로 처리한다 (SPMD).
+
+---
+
+## 6. Inter-PE Communication (IPCQ)
+
+PE 간 collective communication은 IPCQ(Inter-PE Communication Queue)를 통해 수행된다.
+
+- 각 PE는 방향별(N/S/E/W 등) ring buffer 기반 queue pair를 유지
+- **DMA-IPCQ co-design**: DMA data flit에 head pointer를 piggyback하여 별도 제어 메시지 없이 pointer 동기화
+- **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림
+- IPCQ slot buffer는 **TCM, Shared SRAM, Local HBM** 중 선택 가능
+
+자세한 내용은 `docs/ipcq-dma-codesign-hw.md` 및 ADR-0023 참조.
@@ -0,0 +1,381 @@
+# Latency Model
+
+## Overview
+
+kernbench uses a discrete-event simulation (SimPy) to compute end-to-end latency.
+Every request flows through a graph of **components** connected by **wires**.
+The total latency reported is the **actual SimPy wall-clock** (`env.now` delta),
+not a static formula—so contention and queueing are captured automatically.
+
+```text
+total_ns (actual) = wire_prop + component_overhead + drain + queueing
+                    ├── deterministic ──────────────────┘       │
+                    └── contention-dependent ────────────────────┘
+```
+
+## Three Deterministic Cost Components
+
+### 1. Wire Propagation
+
+```text
+wire_ns = distance_mm × ns_per_mm       (global: 0.01 = 10 ps/mm)
+```
+
+Every edge in the topology graph has a `distance_mm`. A SimPy wire process
+delays each message by `wire_ns` before delivering it to the next component.
+For on-chip silicon this is ~10 ps/mm; the same constant applies everywhere
+since all links are on-die or interposer. Wire propagation is typically <1 ns
+and negligible compared to other costs.
+
+### 2. Component Overhead (`overhead_ns`)
+
+```text
+component_ns = node.attrs["overhead_ns"]
+```
+
+Each component on the path adds a fixed processing delay via `yield env.timeout(overhead_ns)`.
+This models arbitration, protocol processing, pipeline stages, etc.
+
+| Component | overhead_ns | Meaning |
+|-----------|-------------|---------|
+| pcie_ep | 5.0 | PCIe protocol processing |
+| io_cpu | 10.0 | Command decode / dispatch |
+| m_cpu | 5.0 | DMA scheduling |
+| fabric switch | 5.0 | Packet arbitration |
+| xbar | 2.0 | Crossbar arbitration |
+| xbar bridge | 1.0 | Bridge traversal between xbar halves |
+| ucie | 8.0 | UCIe protocol overhead per port (TX or RX; 16ns per crossing) |
+| noc (2D mesh) | 0.0 | Hop delay modeled internally via manhattan distance |
+| hbm_ctrl | 0.0 | Access time via drain_ns; efficiency=0.8 reduces edge BW (256→204.8) |
+| pe_cpu | 2.0 | Command dispatch |
+| pe_scheduler | 1.0 | PE-internal scheduling |
+| pe_gemm/math | 0.0 | Placeholder; will use flops-based model |
+
+### 3. Drain (Serialization Delay)
+
+```text
+drain_ns = nbytes / bottleneck_bw_gbs
+```
+
+**Wormhole (cut-through) model**: data flows through intermediate nodes as a
+pipeline. Serialization cost is paid **once** at the terminal node, not at
+every hop. The bottleneck is the minimum `bw_gbs` across all edges in the path.
+
+Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32.0 ns`.
+
+### Formula (Theoretical Lower Bound)
+
+```text
+formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
+```
+
+This is the latency with **zero contention**—no other request competing for
+any resource. The engine provides `_formula_latency()` for verification.
+With no contention: `actual == formula`. With contention: `actual > formula`.
+
+### Diagram: PE DMA Read (pe0 → local slice0, 4096 bytes)
+
+```mermaid
+sequenceDiagram
+    participant D as pe_dma
+    participant X as xbar.pe0
+    participant H as hbm_ctrl.slice0
+
+    D->>X: txn (4096B)
+    Note over X: overhead 2.0 ns
+    X->>H: txn (wire 0.025 ns)
+    Note over H: acquire Resource
+    Note over H: overhead 0 ns
+    Note over H: drain 4096/256 = 16.0 ns
+    Note over H: release Resource
+    H-->>D: done.succeed()
+
+    Note over D,H: total_ns = 18.09 ns<br/>formula = wire(0.025) + ovhd(2.0) + drain(16.0) = 18.025 ns<br/>actual ≈ formula (no contention)
+```
+
+### Diagram: Two Requests — No Contention vs HOL Blocking
+
+#### Case 1: Different slices (parallel, no contention)
+
+```mermaid
+sequenceDiagram
+    participant A as Request A
+    participant S0 as hbm_ctrl.slice0<br/>Resource(cap=1)
+    participant S1 as hbm_ctrl.slice1<br/>Resource(cap=1)
+
+    Note over A,S1: t=2 ns — both requests arrive at their own slice
+    A->>S0: A (4KB)
+    A->>S1: B (4KB)
+    Note over S0: acquire (immediate)
+    Note over S1: acquire (immediate)
+    Note over S0: drain 16.0 ns
+    Note over S1: drain 16.0 ns
+    Note over S0: t=18 release
+    Note over S1: t=18 release
+
+    Note over A,S1: A actual = 18 ns, B actual = 18 ns<br/>No waiting — separate Resources
+```
+
+#### Case 2: Same slice (HOL blocking)
+
+```mermaid
+sequenceDiagram
+    participant A as Request A (4KB)
+    participant Q as hbm_ctrl.slice0<br/>Resource(cap=1)
+    participant B as Request B (64B)
+
+    Note over A,B: t=0 — A arrives first
+    A->>Q: acquire (immediate)
+    Note over Q: drain A = 16.0 ns
+
+    Note over B,Q: t=5 — B arrives, yield req → BLOCKED
+    B--xQ: waiting...
+
+    Note over Q: t=16 — A drain done, release
+    Q->>B: B acquires resource
+    Note over Q: drain B = 0.25 ns
+    Note over Q: t=16.25 — B done, release
+
+    Note over A,B: A actual = 16.0 ns (== formula)<br/>B actual = 11.25 ns (formula 0.25 + queueing 11.0)<br/>HOL blocking: short request waits behind long drain
+```
+
+---
+
+## How SimPy Tracks Latency
+
+### Measurement
+
+```python
+start_ns = env.now
+yield txn_done          # wait for the transaction to complete
+total_ns = env.now - start_ns     # ← this is what probe reports
+```
+
+`env.now` is SimPy's simulation clock. It only advances when a process `yield`s
+a timeout or waits on a resource/store. The delta between start and done captures
+**everything**: wire delays, component overheads, drain, and any queueing.
+
+### Component Pipeline
+
+Each component is a SimPy process:
+
+```text
+_fan_in (per in_port)  →  _inbox (Store)  →  _worker  →  out_ports
+```
+
+1. **`_fan_in`**: relays messages from each `in_port` into a shared `_inbox` Store.
+2. **`_worker`**: pulls from `_inbox`, spawns `_forward_txn` per message.
+3. **`_forward_txn`**: calls `run()` (overhead), then puts to `out_ports[next_hop]`.
+
+The worker uses `env.process()` (pipeline model), so multiple messages can be
+in-flight through the same component concurrently. Contention happens when
+they compete for shared resources (e.g., `simpy.Resource` in hbm_ctrl).
+
+### Wire Process
+
+```python
+while True:
+    msg = yield out_port.get()      # wait for sender
+    yield env.timeout(prop_ns)      # propagation delay
+    yield in_port.put(msg)          # deliver to receiver
+```
+
+Each directed edge has its own wire process. Messages are delayed by exactly
+`distance_mm × ns_per_mm`.
+
+---
+
+## Contention and Queueing
+
+Queueing delay is **not a separate formula term**—it emerges from SimPy's
+event scheduling when multiple requests compete for the same resource.
+
+### Where Contention Occurs
+
+| Resource | SimPy Type | Capacity | Effect |
+|----------|-----------|----------|--------|
+| hbm_ctrl | `simpy.Resource` | 1 | Serializes HBM access |
+| m_cpu DMA read engine | `simpy.Resource` | 1 | Serializes DMA reads |
+| m_cpu DMA write engine | `simpy.Resource` | 1 | Serializes DMA writes |
+| pe_dma channels | `simpy.Resource` | configurable | Serializes PE DMA ops |
+| component inbox | `simpy.Store` | unbounded | No backpressure (FIFO) |
+
+### How Queueing Works
+
+```python
+# hbm_ctrl._worker
+with self._resource.request() as req:
+    yield req                     # ← BLOCKS if resource is occupied
+    yield from self.run(env, txn.nbytes)
+    yield env.timeout(drain_ns)
+```
+
+If request A holds the resource and request B arrives:
+- B's `yield req` blocks until A releases the resource
+- SimPy advances B's `env.now` by A's remaining service time
+- This "extra" time shows up in B's `total_ns` automatically
+
+```text
+No contention:  actual_ns == formula_ns
+Contention:     actual_ns  > formula_ns
+                queueing_delay = actual_ns - formula_ns
+```
+
+### Head-of-Line (HOL) Blocking at hbm_ctrl
+
+The `simpy.Resource` is held for the **entire** `with` block—both overhead and
+drain. The resource is NOT released between overhead and drain:
+
+```python
+with self._resource.request() as req:
+    yield req                              # acquire (or wait)
+    yield from self.run(env, txn.nbytes)   # overhead_ns  ─┐
+    yield env.timeout(drain_ns)            # drain_ns      │ resource held
+# ← resource released here ───────────────────────────────┘
+```
+
+This means a short request arriving during a long request's drain must wait
+for the full remaining drain time—classic head-of-line blocking:
+
+```text
+Request A: 4 KB,  drain = 16.0 ns   (arrives at t=0)
+Request B: 64 B,  drain = 0.25 ns   (arrives at t=5)
+
+Timeline:
+  t=0.00   A acquires resource
+  t=0.00   A: overhead (0 ns)
+  t=0.00   A: drain starts (16.0 ns)
+  t=5.00   B arrives → yield req → BLOCKED (A holds resource)
+  t=16.00  A: drain done → resource released
+  t=16.00  B acquires resource
+  t=16.00  B: overhead (0 ns)
+  t=16.25  B: drain done → resource released
+
+  B actual  = 11.25 ns (waited 11.0 + own 0.25)
+  B formula = 0.25 ns
+  B queueing = 11.0 ns  ← HOL blocking penalty
+```
+
+**Why this is physically realistic**: An HBM channel processes one burst at a
+time. While data is being serialized onto the channel (drain), no other request
+can use that channel. The FIFO ordering (`simpy.Resource` default) reflects
+the simplest controller scheduling policy.
+
+**Alternative: priority scheduling**: If needed, `simpy.PriorityResource` can
+prioritize shorter requests (Shortest Job First), but this is not currently
+used since FIFO matches typical HBM controller behavior.
+
+---
+
+## Worked Example: Two Concurrent PE DMA Reads
+
+Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices
+(slice0 and slice1), submitted to the **same engine** at the same time.
+
+### Paths
+
+```text
+DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
+DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
+```
+
+### No Contention (different HBM slices)
+
+Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own
+`simpy.Resource(capacity=1)`, there is no resource competition.
+
+```text
+DMA A timeline:
+  t=0.00   pe_dma dequeues txn
+  t=0.00   xbar.pe0: overhead_ns=2.0 → t=2.00
+  t=2.025  wire prop (2.5mm × 0.01) → t=2.025
+  t=2.025  hbm_ctrl.slice0: yield req → immediate (no contention)
+  t=2.025  hbm_ctrl.slice0: overhead_ns=0 → t=2.025
+  t=18.025 drain_ns = 4096/256 = 16.0 → t=18.025
+  t=18.025 done
+
+DMA B timeline: (identical, on its own slice)
+  t=0.00   → ... → t=18.09  done
+```
+
+Both complete at ~18.09 ns. `actual == formula` for both.
+
+### With Contention (same HBM slice)
+
+Now suppose both PE0 and PE1 read from **slice0**:
+
+```text
+DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
+DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
+                                (chain traversal to reach slice0)
+```
+
+```text
+DMA A timeline:
+  t=0.00   xbar.pe0(2.0) → wire → hbm_ctrl.slice0
+  t=2.025  yield req → immediate (first to arrive)
+  t=18.025 drain 16.0 → release resource → done
+  actual_A = 18.025 ns (== formula)
+
+DMA B timeline:
+  t=0.00   xbar.pe1(2.0) → xbar.pe0(2.0) → wire → hbm_ctrl.slice0
+  t=4.035  yield req → BLOCKED (A holds resource until t=18.025)
+  t=18.025 acquire resource
+  t=34.025 drain 16.0 → release → done
+  actual_B = 34.035 ns
+
+  formula_B = wire(0.035) + overhead(4.0) + drain(32.0) = 36.035 ns
+  But actual_B is different because drain uses bottleneck BW of B's path (128 GB/s)
+  while A's path has BW 256 GB/s. Let's recalculate:
+
+  B's bottleneck: xbar_x_bw = 128 GB/s → drain = 4096/128 = 32.0 ns
+  formula_B = 0.035 + 4.0 + 32.0 = 36.035 ns
+  actual_B  = 36.035 + queueing ≈ 50+ ns
+  queueing  = time waiting for A to release hbm_ctrl
+```
+
+The key insight: **queueing delay is not in the formula**. It only appears in
+the actual SimPy simulation when resources are contested. The probe reports
+`actual_ns`, which includes all queueing. To see pure queueing overhead,
+compare `actual_ns` vs `formula_ns` (available in PE DMA traces).
+
+---
+
+## Probe Output Explained
+
+```text
+=== PE DMA Latency ===
+Case                Target              Actual  Ovhd  Drain  Wire  Ovhd% Drain%  Eff.BW   BN.BW   Util%
+pe-local-hbm        c0.pe0->c0.slice0    18.09   2.0  16.0  0.08  11.1% 88.5%   226.49   256.0   88.5%
+pe-cross-half-hbm   c0.pe0->c0.slice4    37.14   5.0  32.0  0.14  13.5% 86.1%   110.27   128.0   86.1%
+```
+
+| Column | Meaning |
+|--------|---------|
+| **Actual** | SimPy measured `env.now` delta (includes contention if any) |
+| **Ovhd** | Sum of `overhead_ns` for all components on the forward path |
+| **Drain** | `nbytes / bottleneck_bw` — serialization at terminal |
+| **Wire** | Sum of `distance_mm × ns_per_mm` for all edges |
+| **Ovhd%** | `Ovhd / Actual × 100` — fraction of time spent in component processing |
+| **Drain%** | `Drain / Actual × 100` — fraction of time spent in data transfer |
+| **Eff.BW** | `nbytes / Actual` — achieved bandwidth |
+| **BN.BW** | Bottleneck bandwidth (min `bw_gbs` on path) |
+| **Util%** | `Eff.BW / BN.BW × 100` — how close to theoretical max BW |
+
+### Why Util% < 100%
+
+`Util% = Drain% = drain_ns / actual_ns`. The gap from 100% is the overhead
+fraction. For small transfers (4KB), overhead is significant relative to drain.
+For large transfers, drain dominates and utilization approaches 100%.
+
+```text
+  4 KB:  Ovhd=2.0, Drain=16.0  → Util=88.5%   (overhead is 11% of time)
+ 64 KB:  Ovhd=2.0, Drain=256.0 → Util=99.2%   (overhead is <1% of time)
+```
+
+### H2D Path: Why Ovhd% is ~40%
+
+H2D traverses many components (pcie_ep → io_cpu → ucie → noc → m_cpu → noc →
+xbar → hbm_ctrl + response path). Total forward overhead is ~23 ns vs drain
+of 32 ns for 4KB, so overhead is comparable to data transfer time—resulting
+in ~55% utilization. This is expected for small command-path transfers.