ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 01:15:55 -07:00
parent 22fd0d2b9d
commit 687c98086d
97 changed files with 3286 additions and 3766 deletions
+592
View File
@@ -0,0 +1,592 @@
# CCL Algorithm Author Guide (English)
This document is a step-by-step guide for engineers writing CCL
(Collective Communication Library) algorithms in kernbench. The
internal system design and component structure live in
[ADR-0023](adr/ADR-0023-ipcq-pe-collective.md).
The goal here is to clearly separate **what an algorithm author has to
touch** from **what they can leave alone**, and to get a first
algorithm running through the shortest possible path.
---
## 0. Five-minute tour
| Things you touch | Location |
|------------------|----------|
| Algorithm module (kernel + optional `neighbors()`) | `src/kernbench/ccl/algorithms/<algo>.py` |
| Algorithm registration | `ccl.yaml` |
| Host bench (rank count, init, launch, verify) | `benches/<your_bench>.py` |
| (Optional) unit test | `tests/test_<algo>.py` |
| Things you do NOT touch | Location |
|--------------------------|----------|
| TLContext API | `src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec) |
| Framework (topology generators, helpers, mock testing) | `src/kernbench/ccl/` |
| PE_IPCQ / PE_DMA components | `src/kernbench/components/builtin/` |
| Backend implementation (`install_ipcq`) | `src/kernbench/runtime_api/distributed.py` and `kernbench/ccl/install.py` |
Workflow:
1. Write a `kernel` function in the algorithm module.
2. Register an entry in `ccl.yaml`.
3. Write a host bench using `torch.distributed.init_process_group` /
`torch.distributed.all_reduce` (the unified `benches/ccl_allreduce.py`
handles the common case).
4. (Optional) Run the mock runtime for fast unit tests (a few ms).
5. `kernbench run --bench <name> --verify-data` for full SimPy verification.
---
## 1. Hello World — the simplest send/recv
Each PE sends its tile to its E neighbor once and receives a tile from
its W neighbor once. The reference code lives in
[`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py).
### Step 1: write the kernel
New file `src/kernbench/ccl/algorithms/hello_send.py`:
```python
"""Hello world: send your tile to the next rank, receive from the previous one."""
def kernel(t_ptr, n_elem, tl):
# Global rank is computed from program_id(0/1) (ADR-0022).
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2 # f16
pe_addr = t_ptr + rank * nbytes
# Load our slice and send it east.
src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
tl.send(dir="E", src=src)
# Receive from west and store directly back into our slice.
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
tl.store(pe_addr, recv)
def kernel_args(world_size: int, n_elem: int) -> tuple:
"""Positional kernel args used by the ahbm backend (after t_ptr)."""
return (n_elem,)
```
Key points:
- **Global rank is computed from `program_id(axis=0)` + `program_id(axis=1)`.**
TL has no contractually-supported `tl.rank` / `tl.world_size`. If the
host needs to pass `world_size` or anything else as an algorithm
parameter, it goes through ordinary `torch.launch` arguments.
- **`tl.send` takes a `TensorHandle`.** PE_IPCQ reads
`addr`/`space`/`shape`/`dtype`/`nbytes` from the handle to issue an
`IpcqDmaToken` to PE_DMA.
- **`tl.recv` requires `shape` and `dtype`.** The returned TensorHandle
points at the IPCQ ring slot and can be used directly as a `dst`
handle (e.g. `tl.store(pe_addr, recv)`). Phase 2's `dma_write` replay
handles the (slot → hbm) copy, so user code never has to touch
`recv.data`.
### Step 2: register in `ccl.yaml`
```yaml
algorithms:
hello_send:
module: kernbench.ccl.algorithms.hello_send
topology: ring_1d
buffer_kind: tcm
world_size: 8
```
`world_size` here is optional. If absent, `AhbmCCLBackend` derives it
from the topology spec (`sips × cubes_per_sip × pes_per_cube`).
### Step 3: write a host bench (optional — the unified bench may suffice)
For most CCL benchmarks the existing `benches/ccl_allreduce.py` is
sufficient: it reads `ccl.yaml`, picks the algorithm, sets up the
process group, and runs the collective. If your algorithm needs custom
host logic, write a new bench file along the same lines.
The host code looks like a real PyTorch DDP worker:
```python
"""benches/ccl_hello.py"""
from __future__ import annotations
import numpy as np
from kernbench.policy.placement.dp import DPPolicy
N_ELEM = 8
def worker(rank: int, world_size: int, torch) -> None:
"""Per-rank business logic — mirrors a real PyTorch DDP worker."""
dp = DPPolicy(
cube="replicate", pe="column_wise",
num_cubes=1, num_pes=world_size,
)
tensor = torch.zeros(
(1, world_size * N_ELEM), dtype="f16", dp=dp, name="hello_in",
)
# Per-rank initialization via the real PyTorch idiom.
init = np.zeros((1, world_size * N_ELEM), dtype=np.float16)
for r in range(world_size):
init[0, r * N_ELEM : (r + 1) * N_ELEM] = float(r + 1)
tensor.copy_(torch.from_numpy(init))
# The collective itself.
torch.distributed.all_reduce(tensor, op="sum")
# Verify on rank 0 (real PyTorch DDP idiom).
if rank == 0:
result = tensor.numpy()
for r in range(world_size):
expected = float(((r - 1) % world_size) + 1)
slice_r = result[0, r * N_ELEM : (r + 1) * N_ELEM]
print(
f" rank {r}: got {float(slice_r.mean()):.1f}, "
f"expected {expected:.1f}"
)
def run(torch) -> None:
"""CLI entry point. Initializes dist, dispatches to worker."""
dist = torch.distributed
dist.init_process_group(backend="ahbm")
worker(
rank=dist.get_rank(),
world_size=dist.get_world_size(),
torch=torch,
)
```
### Step 4: unit test (optional but strongly recommended)
`tests/test_hello_send.py`:
```python
import numpy as np
from kernbench.ccl.algorithms.hello_send import kernel
from kernbench.ccl.testing import run_kernel_in_mock
def test_hello_send_4_ranks():
n_elem = 8
inputs = [
np.full((n_elem,), float(r + 1), dtype=np.float16)
for r in range(4)
]
outputs = run_kernel_in_mock(
kernel_fn=kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem,),
)
# rank r should now hold rank (r-1) % 4's data.
for r in range(4):
assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
```
`run_kernel_in_mock` runs every rank concurrently in pure Python (no
SimPy), so a unit test like this finishes in **milliseconds**. It only
verifies algorithmic correctness — no latency, no DMA, no fabric.
### Step 5: SimPy validation
```bash
kernbench run --topology topology.yaml --bench ccl_hello --verify-data
```
Phase 1 runs the SimPy simulation + MemoryStore data movement, Phase 2
replays the op_log for correctness. The bench's `print` lines should
show OK for every rank.
---
## 2. Ring all-reduce — the second algorithm
Slightly more complex. Each PE runs `world_size - 1` rounds, sending
its current tile east and accumulating the tile received from the west.
After all rounds, every PE holds the global sum.
The reference implementation lives in
[`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py).
The core flow:
```python
"""Ring all-reduce."""
def kernel(t_ptr, n_elem, world_size, tl):
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
# The handle points at HBM[pe_addr]. In greenlet mode .data is
# populated, but the kernel never has to touch .data directly.
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
current = acc # source for the first send
for _step in range(world_size - 1):
tl.send(dir="E", src=current)
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
# TensorHandle operator overload → MathCmd → PE_MATH dispatch.
# Phase 1 only models timing; Phase 2 DataExecutor replays the
# actual numpy accumulation.
acc = acc + recv
current = recv # forward the received slot to the next round
# Store the final accumulator back to HBM. Source is acc (a PE-local
# scratch addr); dst is HBM. The op_log dma_write entry records both
# ends so Phase 2 copies the math result into HBM at verify time.
tl.store(pe_addr, acc)
def kernel_args(world_size: int, n_elem: int) -> tuple:
return (n_elem, world_size)
```
Four key points:
1. **Accumulation goes through TensorHandle operators.** `acc + recv`
emits a `MathCmd` and dispatches it through PE_MATH — i.e. the
real hardware path, so the latency model stays accurate. Per
ADR-0020 D3, Phase 1 only simulates timing; Phase 2's `DataExecutor`
replays the op_log and runs the actual numpy accumulation.
2. **Use `current = recv` to forward.** Each round must update the send
source to the just-received slot handle so the same data circulates
exactly once around the ring. Setting `current = acc` would resend
the cumulative sum, inflating the result.
3. **`tl.store(pe_addr, acc)` exactly once at the end.** Do not use a
store→reload pattern in the middle. `acc` lives in PE-local scratch;
the op_log records `(src=scratch, dst=hbm)` and Phase 2 first runs
math (filling scratch) then copies via the dma_write snapshot.
4. **`world_size` is passed by the host explicitly.** TL only knows the
topology slot count (e.g. `num_programs(axis=0)` is "PEs per cube"),
not the participating CCL group size. The host bench knows
`world_size` and forwards it as an explicit kernel argument.
For registration in `ccl.yaml` and wiring through the unified bench,
look at the existing `ring_allreduce_tcm/_hbm/_sram` entries plus
[`benches/ccl_allreduce.py`](../benches/ccl_allreduce.py). Mock unit
tests live in
[`tests/test_ccl_mock_runtime.py`](../tests/test_ccl_mock_runtime.py)
and follow the `kernel_args=(n_elem, world_size)` convention.
---
## 3. `neighbors()` override — custom topology
Most algorithms are happy with the builtin topologies (`ring_1d`,
`mesh_2d`, `tree_binary`, `ring_1d_unidir`, `none`). If you want to
modify a builtin or define a brand-new connectivity pattern, define a
`neighbors()` function in your algorithm module.
### Signature
```python
def neighbors(
rank: int, world_size: int, neighbor_map: dict[str, int],
) -> dict[str, int] | None:
"""Override the neighbor map produced by the builtin topology.
Args:
neighbor_map: the mapping the ccl.yaml ``topology`` field built.
For ring_1d this is {"E": (rank+1)%ws, "W": (rank-1)%ws}.
The dict is mutable — modify in place if you want.
Returns:
dict: the new neighbor map (or the modified-in-place dict).
None: do not override; use neighbor_map as-is.
"""
return None
```
### Pattern A: tweak a builtin
```python
def neighbors(rank, world_size, neighbor_map):
# Only even ranks use W; remove W from odd ranks.
if rank % 2 == 1:
neighbor_map.pop("W", None)
return neighbor_map
```
### Pattern B: replace entirely (skip-connection ring)
```python
def neighbors(rank, world_size, neighbor_map):
return {"E": (rank + 2) % world_size}
```
### Pattern C: keep builtin
Either omit `neighbors` entirely or return None:
```python
def neighbors(rank, world_size, neighbor_map):
return None # explicit "use the builtin"
```
---
## 4. PE kernel API reference (ADR-0023 D4)
### IPCQ API
| API | Description | Blocking? |
|-----|-------------|-----------|
| `tl.send(dir, src=TensorHandle)` | Send to a peer in the given direction. | Yes (waits if peer slots are full) |
| `tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)` | Same, keyword form. | Yes |
| `tl.recv(dir, shape=..., dtype=...)` | Blocking recv from one direction. | Yes |
| `tl.recv(shape=..., dtype=...)` | Round-robin recv across all four directions. | Yes |
| `tl.recv_async(dir, shape=..., dtype=...) → RecvFuture` | Non-blocking recv. | No |
| `tl.wait(future)` | Wait for a non-blocking recv future → returns the resolved TensorHandle. | Yes |
### Existing TL API (ADR-0020/0022, unchanged)
| API | Description |
|-----|-------------|
| `tl.load(addr, shape, dtype) → TensorHandle` | DMA read; in greenlet mode `.data` carries the ndarray. |
| `tl.store(addr, handle)` | DMA write — when `handle.data` is set the runner propagates it to MemoryStore. |
| `tl.composite(op, ...)` | Submit a GEMM/Math composite (non-blocking). |
| `tl.program_id(axis=0)` | Local PE id within the cube. |
| `tl.program_id(axis=1)` | Cube id (ADR-0022). |
| `tl.num_programs(axis=0/1)` | Topology slot counts (NOT the participating-rank count). |
### Two recv modes
The default is `return_slot` (zero-copy): the IPCQ slot address is
returned in `handle.addr`. To force a copy into a custom destination,
pass `dst_addr` + `dst_space`:
```python
recv = tl.recv(
dir="W", shape=(8,), dtype="f16",
dst_addr=my_scratch_addr,
dst_space="hbm",
)
# After this call recv.addr == my_scratch_addr (copy_to_dst mode).
```
---
## 5. Helpers (`kernbench.ccl.helpers`)
Convenience helpers to keep algorithm code short:
```python
from kernbench.ccl.helpers import chunked, ring_step, tree_step
```
### `chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`
Split a tile of `n_elem` elements into `n_chunks` equal-size views.
Each `Chunk` has `addr`, `n_elem`, `nbytes` fields.
```python
chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
# chunks[0..3] are 16-element views with consecutive addresses.
```
### `ring_step(rank, step, world_size) → (send_idx, recv_idx)`
Per-step chunk indices for a ring algorithm (reduce-scatter / all-gather):
```python
for step in range(world_size - 1):
send_idx, recv_idx = ring_step(rank, step, world_size)
tl.send(
dir="E", src_addr=chunks[send_idx].addr,
nbytes=chunks[send_idx].nbytes,
shape=(chunks[send_idx].n_elem,), dtype="f16",
)
recv = tl.recv(
dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16",
)
# accumulate ...
```
### `tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`
Parent / children rank ids for a binary tree:
```python
info = tree_step(rank, world_size)
if info["parent"] is None:
print(f"rank {rank} is the root")
for child in info["children"]:
...
```
---
## 6. Unit testing — Mock runtime
`kernbench.ccl.testing.run_kernel_in_mock` runs an algorithm without
SimPy for fast feedback.
### Basic usage
```python
import numpy as np
from kernbench.ccl.testing import run_kernel_in_mock
from kernbench.ccl.algorithms.my_algo import kernel
def test_my_algo():
n_elem = 16
inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
expected = sum(inputs)
outputs = run_kernel_in_mock(
kernel_fn=kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem, 4), # positional args after t_ptr
)
for r in range(4):
assert np.allclose(outputs[r], expected, rtol=1e-3)
```
### Behavior
- All ranks run their kernels concurrently as cooperative greenlets.
- `tl.send` / `tl.recv` are serviced by in-memory FIFOs (no DMA, no
latency).
- Each rank's last `store` is what the helper returns as a numpy array.
### Limitations
- No latency or performance numbers (it is not a simulation).
- No PE_DMA, fabric, or BW model.
- Correctness only.
- One cube assumed: `program_id(axis=1)` is always 0.
---
## 7. Debugging
### CCL trace
```bash
KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
--bench ccl_allreduce --verify-data
```
Per-rank send/recv events appear on stdout:
```
[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
```
### Pointer dump
`kernbench.ccl.diagnostics.pointer_dump(engine)` returns a multi-line
dump of every PE_IPCQ ring buffer's `my_head`, `my_tail`,
`peer_head_cache`, `peer_tail_cache`. When something hangs, this shows
which rank is stuck and on what.
### Deadlock detection
When the SimPy schedule empties because of unmatched send/recv pairs,
the engine raises `IpcqDeadlock` and embeds the pointer dump in the
message (ADR-0023 D14 F3). Wait-for-graph visualization is future
work.
---
## 8. Common mistakes
### 1. Using a direction that wasn't installed
`topology: ring_1d` only installs E and W. Trying:
```python
tl.send(dir="N", ...) # → IpcqInvalidDirection
```
Fix: switch to `topology: mesh_2d`, or add N/S in a `neighbors()` override.
### 2. `send` without a matching `recv`
```python
def kernel(..., tl):
for _ in range(100):
tl.send(dir="E", ...)
# The peer never recvs → ring buffer fills → backpressure → deadlock.
```
Fix: every `send` needs a matching `recv` on the receiver side.
Otherwise `IpcqDeadlock` is raised.
### 3. dtype/shape mismatch
By default mismatches are not validated. The author is responsible for
consistency. Set `strict_validation: true` on a PE_IPCQ node's attrs to
enable D14 F2 strict mode and catch them immediately.
### 4. Assuming round-robin recv fairness
`tl.recv()` (no direction) returns the first slot to arrive in
round-robin order, but **arrival order is not predictable**. If your
algorithm depends on a particular direction, name it explicitly:
`tl.recv(dir="N", ...)`.
### 5. Confusing `num_programs` with the CCL group size
`tl.num_programs(axis=0/1)` reports topology slot counts, not the
number of ranks participating in the collective. The host bench knows
`world_size` and must pass it through as a kernel argument.
### 6. Overwriting the send source before it's actually sent
PE_DMA snapshots the source data into the IpcqDmaToken at send time,
preserving in-flight semantics. Even so, the safest pattern is to call
`tl.send` first and only mutate the source addr afterwards. If you
mutate the addr before `tl.send` makes it into the PE_DMA queue, the
snapshot will pick up the wrong data.
---
## 9. Next steps
- Try other topologies (`mesh_2d`, `tree_binary`).
- Faster algorithms (recursive halving / doubling).
- Compare `buffer_kind` (tcm/hbm/sram) and `backpressure` (poll/sleep)
modes for latency.
- Larger-scale validation through the unified `ccl_allreduce` bench
with different `ccl.yaml` overlays.
If you add a new algorithm or pattern, please send a PR.
---
## References
- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective design.
- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1).
- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution.
- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model.
Existing algorithm examples:
- [`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py) — simplest send/recv
- [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py) — ring all-reduce
- [`src/kernbench/ccl/algorithms/mesh_allreduce.py`](../src/kernbench/ccl/algorithms/mesh_allreduce.py) — 2D mesh all-reduce
- [`src/kernbench/ccl/algorithms/tree_allreduce.py`](../src/kernbench/ccl/algorithms/tree_allreduce.py) — binary tree all-reduce
+537
View File
@@ -0,0 +1,537 @@
# CCL Algorithm Author Guide
이 문서는 kernbench에서 CCL (Collective Communication Library) 알고리즘을
직접 작성하는 사람을 위한 step-by-step 가이드이다. 시스템 내부 설계와
컴포넌트 구조는 [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md)에 있다.
본 가이드는 알고리즘 작성자가 **자신이 만져야 할 곳**과 **만지지 않아도 될 곳**을
명확히 분리하고, 가장 짧은 경로로 첫 알고리즘을 동작시키는 것을 목표로 한다.
---
## 0. 5분 요약
| 만지는 것 | 위치 |
|----------|------|
| 알고리즘 모듈 (kernel + 선택적 neighbors) | `src/kernbench/ccl/algorithms/<algo>.py` |
| 알고리즘 등록 | `ccl.yaml` |
| 호스트 bench (PE 수, 메모리 init, launch, 검증) | `benches/<your_bench>.py` |
| (선택) 단위 테스트 | `tests/test_<algo>.py` |
| 만지지 않는 것 | 위치 |
|---------------|------|
| TLContext API | `src/kernbench/triton_emu/tl_context.py` (ADR-0022 spec) |
| 프레임워크 (topology generators, helpers, mock testing) | `src/kernbench/ccl/` |
| PE_IPCQ / PE_DMA 컴포넌트 | `src/kernbench/components/builtin/` |
| backend 구현 (install_ipcq) | `src/kernbench/runtime_api/distributed.py``kernbench/ccl/install.py` |
흐름:
1. 알고리즘 모듈에 `kernel` 작성
2. `ccl.yaml`에 entry 등록
3. 호스트 bench에서 `install_ipcq` + `launch`
4. (선택) mock runtime으로 단위 테스트 (수 ms)
5. `kernbench run --bench <name> --verify-data`로 SimPy 검증
---
## 1. Hello World — 가장 단순한 send/recv
각 PE가 자기 데이터를 E 방향 이웃에 한 번 보내고, W 방향에서 한 번 받는
가장 단순한 알고리즘이다. 실제 동작 코드는
[`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py)
에 있다.
### Step 1: kernel 작성
새 파일 `src/kernbench/ccl/algorithms/hello_send.py`:
```python
"""Hello world: 자기 데이터를 다음 rank에 보내고 이전 rank에서 받기."""
def kernel(t_ptr, n_elem, tl):
# 글로벌 rank는 program_id(0/1)에서 계산 (ADR-0022)
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2 # f16
pe_addr = t_ptr + rank * nbytes
# 자기 슬라이스를 로드해서 E로 보낸다.
src = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
tl.send(dir="E", src=src)
# W 방향에서 받아서 그대로 자기 슬라이스에 store한다.
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
tl.store(pe_addr, recv)
```
핵심 포인트:
- **글로벌 rank는 `program_id(axis=0)` + `program_id(axis=1)`에서 계산.** TL에는
`tl.rank` / `tl.world_size` 같은 약속되지 않은 확장이 없다. 호스트가
`world_size` 같은 알고리즘 파라미터가 필요하면 `torch.launch`의 일반 인자로
전달한다.
- **`tl.send``TensorHandle`을 받는다.** 핸들의 `addr`/`space`/`shape`/`dtype`/`nbytes`
PE_IPCQ가 읽어 PE_DMA에 IpcqDmaToken을 발행한다.
- **`tl.recv``shape``dtype`이 필수.** 반환된 TensorHandle은 IPCQ ring slot을
가리키며, `tl.store(pe_addr, recv)`처럼 dst 핸들로 그대로 사용할 수 있다.
Phase 2 dma_write replay가 (slot, hbm) 복사를 수행하므로 numpy `.data`
직접 만질 필요가 없다.
### Step 2: ccl.yaml 등록
`ccl.yaml``algorithms` 섹션에 entry를 추가한다. (defaults.algorithm은 호스트
bench가 `install_ipcq(algorithm=...)`로 명시 전달해도 되므로 꼭 바꿀 필요는 없다.)
```yaml
algorithms:
hello_send:
module: kernbench.ccl.algorithms.hello_send
topology: ring_1d
buffer_kind: tcm
```
### Step 3: 호스트 bench 작성
새 파일 `benches/ccl_hello.py`:
```python
"""Hello-world ring rotation bench (각 PE가 W 이웃의 데이터를 1번 받음)."""
import numpy as np
from kernbench.ccl.algorithms import hello_send
from kernbench.policy.placement.dp import DPPolicy
ALGORITHM = "hello_send"
N_ELEM = 8
WORLD_SIZE = 8
def run(torch):
plan = torch.install_ipcq(algorithm=ALGORITHM)
a = torch.zeros(
(1, WORLD_SIZE * N_ELEM), dtype="f16",
dp=DPPolicy(
cube="replicate", pe="column_wise",
num_cubes=1,
),
name="hello_in",
)
store = torch.engine.memory_store
base = a._handle.va_base or a._handle.shards[0].pa
nbytes = N_ELEM * 2
for r in range(WORLD_SIZE):
store.write("hbm", base + r * nbytes,
np.full((N_ELEM,), float(r + 1), dtype=np.float16))
torch.launch(ALGORITHM, hello_send.kernel, a, N_ELEM)
# rank r은 rank (r-1)%ws의 데이터를 가져야 한다.
for r, (sip, cube, pe) in enumerate(plan["rank_to_pe"]):
result = store.read("hbm", base + r * nbytes, shape=(N_ELEM,), dtype="f16")
prev = float(((r - 1) % WORLD_SIZE) + 1)
ok = np.allclose(result, prev)
print(f" [{'OK ' if ok else 'FAIL'}] rank {r} got {float(result.mean()):.1f}, "
f"expected {prev:.1f}")
```
### Step 4: 단위 테스트 (선택, 강력 추천)
`tests/test_hello_send.py`:
```python
import numpy as np
from kernbench.ccl.algorithms.hello_send import kernel
from kernbench.ccl.testing import run_kernel_in_mock
def test_hello_send_4_ranks():
n_elem = 8
inputs = [np.full((n_elem,), float(r + 1), dtype=np.float16) for r in range(4)]
outputs = run_kernel_in_mock(
kernel_fn=kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem,),
)
# rank r은 rank (r-1) % 4의 데이터를 받아야 함
for r in range(4):
assert np.array_equal(outputs[r], inputs[(r - 1) % 4])
```
`run_kernel_in_mock`는 SimPy 없이 순수 Python으로 모든 rank를 동시 실행하므로
**ms 단위로 끝난다**. 알고리즘 logic 정합성만 검증.
### Step 5: 시뮬 검증
```bash
kernbench run --topology topology.yaml --bench ccl_hello --verify-data
```
Phase 1에서 SimPy 시뮬레이션 + MemoryStore 데이터 이동, Phase 2에서 op_log
정합성 replay. 호스트 bench의 `print` 검증이 모든 rank에 대해 OK여야 한다.
---
## 2. Ring All-Reduce — 두 번째 알고리즘
조금 더 복잡한 예제. Ring all-reduce는 N-1 라운드 동안 각 PE가 자기 데이터를
E로 보내고 W에서 받아 누적한다. 최종적으로 모든 PE가 글로벌 sum을 갖는다.
실제 동작 코드는 [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py)
참조. 핵심 흐름:
```python
"""Ring all-reduce."""
def kernel(t_ptr, n_elem, world_size, tl):
# rank
local_pe = tl.program_id(axis=0)
cube_id = tl.program_id(axis=1)
pes_per_cube = tl.num_programs(axis=0)
rank = cube_id * pes_per_cube + local_pe
nbytes = n_elem * 2
pe_addr = t_ptr + rank * nbytes
# HBM의 자기 슬라이스를 가리키는 TensorHandle. greenlet 모드에선 .data가
# 채워지지만 커널은 .data를 직접 만질 필요가 없다.
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
current = acc # 첫 라운드 send 출처
for _step in range(world_size - 1):
tl.send(dir="E", src=current)
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
# TensorHandle 연산자 오버로드 → MathCmd → PE_MATH 디스패치.
# Phase 1은 타이밍만, Phase 2 DataExecutor가 실제 numpy 누적을 수행한다.
acc = acc + recv
current = recv # 다음 라운드는 직전에 받은 슬롯을 다시 forward
# 최종 누적값을 자기 슬라이스에 store. 출처는 acc(=PE-local scratch addr)
# 이고 dst는 HBM. op_log dma_write가 (scratch, hbm) 복사 정보를 기록하므로
# Phase 2가 검증 시점에 HBM[pe_addr]에 정답을 채워준다.
tl.store(pe_addr, acc)
```
네 가지 포인트:
1. **누적은 TensorHandle 연산자**: `acc + recv``MathCmd`를 emit하고
PE_MATH로 디스패치된다 — 실제 하드웨어 경로를 거치므로 latency 모델이
정확하다. ADR-0020 D3대로 Phase 1은 타이밍만 시뮬레이션하고, Phase 2
`DataExecutor`가 op_log를 재실행하면서 numpy 누적을 수행한다.
2. **`current = recv`로 forward**: 매 라운드의 send 출처를 직전에 받은 슬롯
핸들로 갱신해야 같은 데이터가 ring을 순회하면서 누적이 한 번씩 일어난다.
`current = acc`로 두면 누적값이 다시 송출되어 결과가 부풀려진다.
3. **`tl.store(pe_addr, acc)` 한 번이면 끝**: 중간에 store→reload 패턴은
금지다. acc는 PE-local scratch에 살고, op_log가 (src=scratch, dst=hbm)
메타데이터를 기록한다. Phase 2가 math를 먼저 실행해 scratch를 채운 뒤
dma_write 스냅샷으로 HBM에 복사한다.
4. **`world_size`는 호스트가 명시 전달**: TL은 topology slot 수만 안다 (예:
`num_programs(axis=0)`은 cube당 PE 수). 실제 참여하는 CCL group 크기는 bench가
알고 호스트→kernel 인자로 넘긴다.
`ccl.yaml` 등록 + 호스트 bench는 [`benches/ccl_allreduce_tcm.py`](../benches/ccl_allreduce_tcm.py)
참조. mock 단위 테스트는 [`tests/test_ccl_mock_runtime.py`](../tests/test_ccl_mock_runtime.py)
를 그대로 따라하면 된다 (`kernel_args=(n_elem, world_size)` 인자 형태).
---
## 3. neighbors() override — Custom topology
대부분의 알고리즘은 builtin topology(`ring_1d`, `mesh_2d`, `tree_binary`,
`ring_1d_unidir`, `none`)로 충분하다. builtin을 변형하거나 새로 만들고 싶으면
알고리즘 모듈에 `neighbors()`를 정의한다.
### 시그니처
```python
def neighbors(rank: int, world_size: int, neighbor_map: dict[str, int]) -> dict[str, int] | None:
"""builtin topology가 만든 neighbor_map을 override.
Args:
neighbor_map: ccl.yaml의 topology 필드가 만든 builtin 매핑.
예: ring_1d → {"E": (rank+1)%ws, "W": (rank-1)%ws}
mutable dict — 직접 수정 가능.
Returns:
dict: neighbor_map을 override한 결과 (또는 수정한 그 dict)
None: override 안 함, neighbor_map 그대로 사용
"""
return None
```
### Pattern A: builtin을 base로 일부만 수정
```python
def neighbors(rank, world_size, neighbor_map):
# 짝수 rank만 W 방향 사용 (홀수 rank는 W 제거)
if rank % 2 == 1:
neighbor_map.pop("W", None)
return neighbor_map
```
### Pattern B: 완전히 새로 작성 (skip-connection ring)
```python
def neighbors(rank, world_size, neighbor_map):
# neighbor_map은 무시하고 새로 작성
return {"E": (rank + 2) % world_size}
```
### Pattern C: builtin 사용, override 없음
`neighbors()` 함수를 정의하지 않거나 None을 반환:
```python
def neighbors(rank, world_size, neighbor_map):
return None # 명시적으로 builtin 사용
```
---
## 4. PE 커널 API 레퍼런스 (ADR-0023 D4)
### IPCQ API
| API | 설명 | Blocking? |
|-----|------|-----------|
| `tl.send(dir, src=TensorHandle)` | direction으로 데이터 send | Yes (peer slot full 시 wait) |
| `tl.send(dir, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)` | 동일, keyword 형태 | Yes |
| `tl.recv(dir, shape=..., dtype=...)` | 특정 방향에서 blocking recv | Yes |
| `tl.recv(shape=..., dtype=...)` | 4방향 round-robin recv (방향 미지정) | Yes |
| `tl.recv_async(dir, shape=..., dtype=...) → RecvFuture` | non-blocking recv | No |
| `tl.wait(future)` | non-blocking future 완료 대기 → TensorHandle | Yes |
### 기존 TL API (ADR-0020/0022, 그대로 사용 가능)
| API | 설명 |
|-----|------|
| `tl.load(addr, shape, dtype) → TensorHandle` | DMA read; greenlet 모드에서 `.data`에 ndarray |
| `tl.store(addr, handle)` | DMA write — handle.data가 있으면 MemoryStore에 propagate |
| `tl.composite(op, ...)` | GEMM/Math compute 비동기 submit |
| `tl.program_id(axis=0)` | cube 내 local PE id |
| `tl.program_id(axis=1)` | cube id (ADR-0022) |
| `tl.num_programs(axis=0/1)` | topology 슬롯 수 (참여 ranks 수가 아님) |
### `recv` 두 가지 모드
기본은 `return_slot` (zero-copy): IPCQ slot 주소가 그대로 handle.addr에 들어온다.
slot 데이터를 별도 위치로 복사하고 싶으면 `dst_addr` + `dst_space`를 명시:
```python
recv = tl.recv(
dir="W", shape=(8,), dtype="f16",
dst_addr=my_scratch_addr,
dst_space="hbm",
)
# 이제 recv.addr == my_scratch_addr (copy_to_dst 모드)
```
---
## 5. Helpers (`kernbench.ccl.helpers`)
알고리즘 코드를 짧게 유지하기 위한 헬퍼들:
```python
from kernbench.ccl.helpers import chunked, ring_step, tree_step
```
### `chunked(base_addr, n_chunks, n_elem, dtype="f16") → list[Chunk]`
`n_elem` 개의 element를 `n_chunks` 등분한 view 리스트를 반환. 각 `Chunk`
`addr`, `n_elem`, `nbytes` 필드를 가진다.
```python
chunks = chunked(t_ptr, n_chunks=4, n_elem=64, dtype="f16")
# chunks[0..3] 각각 16 element view, addr이 연속
```
### `ring_step(rank, step, world_size) → (send_idx, recv_idx)`
Ring algorithm의 step별 chunk 인덱스 (reduce-scatter / all-gather):
```python
for step in range(world_size - 1):
send_idx, recv_idx = ring_step(rank, step, world_size)
tl.send(dir="E", src_addr=chunks[send_idx].addr,
nbytes=chunks[send_idx].nbytes,
shape=(chunks[send_idx].n_elem,), dtype="f16")
recv = tl.recv(dir="W", shape=(chunks[recv_idx].n_elem,), dtype="f16")
# accumulate ...
```
### `tree_step(rank, world_size) → {"parent": int|None, "children": list[int]}`
Binary tree의 parent/children rank:
```python
info = tree_step(rank, world_size)
if info["parent"] is None:
print(f"rank {rank} is the root")
for child in info["children"]:
...
```
---
## 6. 단위 테스트 — Mock Runtime
`kernbench.ccl.testing.run_kernel_in_mock`은 SimPy를 거치지 않고 알고리즘을
빠르게 검증할 수 있다.
### 기본 사용법
```python
from kernbench.ccl.testing import run_kernel_in_mock
from kernbench.ccl.algorithms.my_algo import kernel
import numpy as np
def test_my_algo():
n_elem = 16
inputs = [np.arange(n_elem, dtype="f16") + r for r in range(4)]
expected = sum(inputs)
outputs = run_kernel_in_mock(
kernel_fn=kernel,
world_size=4,
topology="ring_1d",
inputs=inputs,
kernel_args=(n_elem, 4), # kernel의 (t_ptr 이후) 추가 positional 인자
)
for r in range(4):
assert np.allclose(outputs[r], expected, rtol=1e-3)
```
### 동작
- 4개 rank의 kernel을 greenlet으로 동시 실행
- `tl.send/recv`를 in-memory FIFO로 즉시 처리 (DMA, latency 무시)
- 각 rank가 마지막에 store한 데이터를 ndarray로 반환
### 한계
- latency / 성능 측정 불가 (시뮬레이션이 아님)
- PE_DMA, fabric, BW 모델 안 함
- 정합성 검증만 가능
- 한 cube 안에서 동작하는 가정 — `program_id(axis=1)`은 항상 0
---
## 7. 디버깅
### CCL trace
```bash
KERNBENCH_CCL_TRACE=1 kernbench run --topology topology.yaml \
--bench ccl_allreduce_tcm --verify-data
```
각 rank의 send/recv 시점이 stdout에 출력된다:
```
[ccl t=346.4 send] sip0.cube0.pe1 dir=E nbytes=64 seq=0
[ccl t=360.4 recv] sip0.cube0.pe2 dir=W nbytes=64
...
```
### Pointer dump
`kernbench.ccl.diagnostics.pointer_dump(engine)`는 모든 PE_IPCQ의 ring buffer
상태(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`)를 multi-line
문자열로 반환한다. hang이 발생하면 어느 rank가 어떤 상태에서 막혔는지 한눈에
보인다.
### Deadlock detection
매칭되지 않는 send/recv 등으로 SimPy 스케줄이 비면 engine이 `IpcqDeadlock`
던지며 pointer dump를 메시지에 포함시킨다 (ADR-0023 D14 F3). 별도 wait-for graph
시각화는 미래 작업.
---
## 8. 흔한 실수
### 1. install 안 된 direction 사용
ccl.yaml의 `topology: ring_1d`는 E/W만 install한다. N/S 사용 시:
```python
tl.send(dir="N", ...) # → IpcqInvalidDirection 예외
```
해결: `topology: mesh_2d`로 바꾸거나, `neighbors()` override로 N/S 추가.
### 2. send만 호출하고 recv 없음
```python
def kernel(..., tl):
for _ in range(100):
tl.send(dir="E", ...)
# peer 측 recv 없음 → ring buffer 가득 차면 backpressure → deadlock
```
해결: 모든 send에 짝이 되는 recv가 있어야 한다. 안 그러면 `IpcqDeadlock`
발생한다.
### 3. dtype/shape 불일치
기본 모드에서는 dtype/shape mismatch를 검증하지 않는다. 작성자가 직접 보장하거나,
PE_IPCQ 노드 attrs에 `strict_validation: true`를 설정해 D14 F2 strict 모드로
mismatch를 즉시 잡을 수 있다.
### 4. round-robin recv의 fairness 가정
`tl.recv()` (방향 미지정)는 round-robin으로 가져오지만, 도착한 첫 슬롯을 반환한다.
**도착 순서를 알 수 없으므로** 알고리즘이 도착 방향에 의존하면 안 된다.
필요하면 `tl.recv(dir="N", ...)`처럼 명시.
### 5. CCL 그룹 크기 가정
`tl.num_programs(axis=0/1)`은 토폴로지 슬롯 개수이지 CCL group 크기가 아니다.
참여하는 rank 수(`world_size`)는 호스트 bench가 알고 있고, kernel 인자로 명시
전달해야 한다.
### 6. 호스트가 send-source 메모리를 도착 전에 덮어씀
PE_DMA가 송신 시점에 src 데이터를 토큰에 스냅샷해서 in-flight 데이터의 의미가
보존된다. 그래도 하나의 PE 안에서 같은 주소를 여러 step에 걸쳐 갱신할 때는
direct send 후 다른 step에서 같은 주소를 store해도 안전하다 (token snapshot 덕분).
하지만 `tl.send`가 PE_DMA 큐에 enqueue되기 전에 주소를 덮어쓰면 잘못된 데이터가
스냅샷된다 — `tl.send`를 먼저, 메모리 변경을 나중에 하는 게 권장.
---
## 9. 다음 단계
- `mesh_2d` / `tree_binary` 같은 다른 topology 활용
- recursive halving/doubling 등 더 빠른 알고리즘
- `buffer_kind` (tcm/hbm/sram) / `backpressure` (poll/sleep) 모드별 latency 비교
- `ccl_ring_allreduce_multicube.py`, `ccl_ring_allreduce_multisip.py`처럼 큰
scale의 ring 검증
새 알고리즘이나 패턴을 추가했다면 PR로 기여해주세요.
---
## 참고
- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective 설계
- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1)
- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution
- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model
기존 알고리즘 예제:
- [`src/kernbench/ccl/algorithms/hello_send.py`](../src/kernbench/ccl/algorithms/hello_send.py) — 가장 단순한 send/recv
- [`src/kernbench/ccl/algorithms/ring_allreduce.py`](../src/kernbench/ccl/algorithms/ring_allreduce.py) — ring all-reduce
- [`src/kernbench/ccl/algorithms/mesh_allreduce.py`](../src/kernbench/ccl/algorithms/mesh_allreduce.py) — 2D mesh all-reduce
- [`src/kernbench/ccl/algorithms/tree_allreduce.py`](../src/kernbench/ccl/algorithms/tree_allreduce.py) — binary tree all-reduce
+363
View File
@@ -0,0 +1,363 @@
# 실무 DI 패턴: kernbench 구현으로 배우는 Dependency Injection
---
## 슬라이드 1 — 오늘 이야기할 것
**질문:** 코드를 어떻게 설계해야 테스트하기 쉽고, 갈아끼우기 쉬울까?
**답:** Dependency Injection (DI)
오늘은 이론이 아니라 **실제로 돌아가는 시뮬레이터 코드**를 보면서 배웁니다.
```
kernbench
└── AI 가속기 하드웨어를 Python으로 시뮬레이션하는 프레임워크
- 수십 개의 하드웨어 컴포넌트 (NOC, HBM, PE, CPU...)
- 각 컴포넌트는 런타임에 교체 가능
- 테스트에서 Mock 컴포넌트로 즉시 대체 가능
```
---
## 슬라이드 2 — DI가 없으면 어떤 일이 생기나
```python
# ❌ DI 없는 코드
class IoCpuComponent:
def run(self, env, nbytes):
router = PathRouter() # 직접 생성 — 교체 불가
hbm = HbmCtrlComponent() # 직접 생성 — 교체 불가
yield env.timeout(10.0)
```
**문제:**
- 테스트할 때 실제 `PathRouter``HbmCtrl`이 항상 따라온다
- 컴포넌트를 Mock으로 바꾸려면 **소스 코드를 수정**해야 한다
- 다른 topology(다른 라우팅 전략)를 쓰고 싶으면 **또 수정**
> 클래스가 자기 의존성을 스스로 만들면, 그 클래스는 의존성과 결합된다
---
## 슬라이드 3 — DI의 핵심 원칙
**의존성은 밖에서 만들어서 안으로 넣어준다**
```
┌────────────────────────────┐
│ 조립자 (Assembler) │ ← 누가 무엇을 쓸지 결정
│ GraphEngine.__init__ │
└────────────┬───────────────┘
│ ctx 주입
┌────────────────────────────┐
│ 컴포넌트 (Component) │ ← 어떻게 동작하는지만 알면 됨
│ IoCpuComponent │
│ self.ctx.router.find_path(...) ← 그냥 사용
└────────────────────────────┘
```
**세 가지 역할 분리:**
1. **Interface** — 무엇을 할 수 있는가 (`ComponentBase`)
2. **Implementation** — 어떻게 하는가 (`IoCpuComponent`, `HbmCtrlComponent`, ...)
3. **Assembler** — 무엇을 연결할 것인가 (`GraphEngine`)
---
## 슬라이드 4 — 패턴 1: Constructor Injection
> 생성자로 의존성을 받는다
```python
# kernbench/components/base.py
class ComponentBase(ABC):
def __init__(self, node: Node, ctx: ComponentContext | None = None):
self.node = node
self.ctx = ctx # 외부에서 주입받은 의존성
self.in_ports: dict[str, simpy.Store] = {}
self.out_ports: dict[str, simpy.Store] = {}
```
```python
# 사용 측 — ctx를 직접 만들지 않는다
class IoCpuComponent(ComponentBase):
def _dispatch(self, env, txn):
path = self.ctx.router.find_node_path(...) # ctx는 이미 들어와 있음
yield self.out_ports[next_hop].put(...)
```
**언제 쓰나:**
- 컴포넌트가 살아있는 동안 의존성이 바뀌지 않을 때
- 의존성 없이는 컴포넌트가 동작하지 않을 때 (필수 의존성)
---
## 슬라이드 5 — Context Object 패턴
> 의존성이 많아지면 묶어서 하나로
```python
# kernbench/components/context.py
@dataclass
class ComponentContext:
router: PathRouter # 라우팅 정책
resolver: AddressResolver # 주소 해석
positions: dict[str, ...] # 물리적 위치 정보
ns_per_mm: float # 전파 지연 상수
edge_map: dict[...] # 엣지 정보
spec: dict # 토폴로지 스펙
```
**왜 Context로 묶나?**
- 생성자 인자가 6개면 → 컴포넌트 추가할 때마다 시그니처 변경
- Context 하나면 → 새 필드 추가해도 기존 컴포넌트 무영향
- 컴포넌트는 **필요한 것만 꺼내 쓴다**
```python
class TwoDMeshNocComponent(ComponentBase):
def _route(self, env, txn):
src_pos = self.ctx.positions.get(prev_hop) # 위치만 사용
ns_per_mm = self.ctx.ns_per_mm # 상수만 사용
# router, resolver 등은 건드리지 않음
```
---
## 슬라이드 6 — 패턴 2: Registry + Factory
> 문자열 키 → 클래스 매핑으로 런타임 교체
```python
# kernbench/components/base.py
class ComponentRegistry:
_registry: dict[str, type[ComponentBase]] = {}
@classmethod
def register(cls, impl: str, component_cls: type[ComponentBase]):
cls._registry[impl] = component_cls
@classmethod
def create(cls, node, overrides=None, ctx=None) -> ComponentBase:
if overrides and node.impl in overrides:
return overrides[node.impl](node, ctx) # 1순위: 호출자 override
if node.impl in cls._registry:
return cls._registry[node.impl](node, ctx) # 2순위: 등록된 구현
return DefaultComponent(node, ctx) # 3순위: 기본값 fallback
```
**Resolution 우선순위:**
```
overrides[impl] ← 테스트/실험용 주입
↓ (없으면)
_registry[impl] ← 프로덕션 구현
↓ (없으면)
DefaultComponent ← 안전한 fallback
```
---
## 슬라이드 7 — Registry 등록 방식
```python
# kernbench/components/builtin/__init__.py
from kernbench.components.base import ComponentRegistry
from kernbench.components.builtin.noc import TwoDMeshNocComponent
from kernbench.components.builtin.io_cpu import IoCpuComponent
# ...
ComponentRegistry.register("noc_2d_mesh_v1", TwoDMeshNocComponent)
ComponentRegistry.register("io_cpu_v1", IoCpuComponent)
ComponentRegistry.register("hbm_ctrl_v1", HbmCtrlComponent)
# ...
```
**topology.yaml (설정 파일)**
```yaml
nodes:
- id: sip0.cube0.noc
impl: noc_2d_mesh_v1 # ← 이 문자열이 Registry 키
```
**흐름:**
```
YAML → impl 문자열 → Registry.create() → 실제 컴포넌트 인스턴스
```
impl 문자열만 바꾸면 동작이 바뀐다. 코드 수정 없음.
---
## 슬라이드 8 — 패턴 3: Override Injection (테스트용)
> 호출자가 특정 impl만 갈아끼운다
```python
# tests/test_component_registry.py
class SpyXbar(ComponentBase):
calls = 0
def run(self, env, nbytes):
SpyXbar.calls += 1
yield env.timeout(0)
# 테스트에서 xbar_v1만 SpyXbar로 교체
engine = GraphEngine(
graph,
component_overrides={"xbar_v1": SpyXbar} # ← 이것만 추가
)
result = engine.run(msg)
assert SpyXbar.calls > 0 # Xbar가 실제로 호출됐는지 검증
```
**핵심:** 테스트 코드가 프로덕션 코드를 **수정하지 않는다**
---
## 슬라이드 9 — 조립자: GraphEngine
> 컴포넌트를 생성하고 연결하는 유일한 곳
```python
# kernbench/sim_engine/engine.py
class GraphEngine:
def __init__(self, graph, component_overrides=None):
# 1. 공유 의존성 생성
ctx = ComponentContext(
router=PathRouter(graph),
resolver=AddressResolver(graph),
positions={nid: n.pos_mm for nid, n in graph.nodes.items()},
ns_per_mm=...,
)
# 2. 컴포넌트 생성 (DI: ctx 주입)
self._components = {
node_id: ComponentRegistry.create(node, overrides, ctx)
for node_id, node in graph.nodes.items()
}
# 3. 포트 연결 (배선)
for e in graph.edges:
store = simpy.Store(self._env)
self._components[e.src].out_ports[e.dst] = store
self._components[e.dst].in_ports[e.src] = store
```
**생성 → 주입 → 연결** — 이 세 단계가 한 곳에서만 일어난다
---
## 슬라이드 10 — 전체 구조 한눈에 보기
```
topology.yaml
│ impl: "noc_2d_mesh_v1"
GraphEngine.__init__() ← 조립자
├── ComponentContext 생성 ← 공유 의존성 묶음
│ ├── PathRouter
│ ├── AddressResolver
│ └── positions, ns_per_mm, ...
├── ComponentRegistry.create(node, overrides, ctx)
│ ├── overrides["noc_2d_mesh_v1"]? → SpyNoc (테스트)
│ ├── registry["noc_2d_mesh_v1"]? → TwoDMeshNocComponent (프로덕션)
│ └── fallback → DefaultComponent
└── 포트 배선: out_ports / in_ports 연결
Component (TwoDMeshNocComponent)
└── self.ctx.positions, self.ctx.ns_per_mm 사용
(라우터, 리졸버는 건드리지 않음 — 필요한 것만)
```
---
## 슬라이드 11 — 무엇을 얻었나
| 상황 | DI 없이 | DI 있이 |
|------|---------|---------|
| NOC 알고리즘 교체 | 소스 코드 수정 | YAML에서 impl 문자열 변경 |
| Xbar 동작 검증 | 실제 HW 전부 구동 | `overrides={"xbar_v1": SpyXbar}` |
| 새 컴포넌트 추가 | 기존 코드 수정 | `register("new_v1", NewComp)` |
| 컨텍스트 필드 추가 | 모든 생성자 수정 | `ComponentContext`에 필드 추가 |
| 테스트 격리 | 불가능 | 필요한 것만 override |
---
## 슬라이드 12 — 실무 적용 체크리스트
**설계할 때 물어볼 것:**
1. **이 클래스가 직접 `new`(생성)하는 것은 무엇인가?**
→ 생성하는 것 = 교체할 수 없는 것. 생성자로 받을 수 없는지 검토.
2. **의존성이 3개 이상이면?**
→ Context Object로 묶어라.
3. **테스트에서 이 클래스를 단독으로 실행할 수 있는가?**
→ 없다면 DI가 필요하다는 신호.
4. **설정(YAML/config)으로 동작을 바꾸고 싶은가?**
→ Registry + 문자열 키 패턴.
5. **누가 조립하는가?**
→ 조립자는 하나여야 한다. 컴포넌트 안에 조립 로직이 있으면 안 된다.
---
## 슬라이드 13 — 안티패턴: 이것은 하지 말자
```python
# ❌ 서비스 로케이터 (컴포넌트 안에서 registry 호출)
class BadComponent(ComponentBase):
def run(self, env, nbytes):
router = ComponentRegistry.get("router") # 컴포넌트가 직접 찾는다
...
# ❌ 전역 싱글톤 직접 참조
class BadComponent(ComponentBase):
def run(self, env, nbytes):
router = GlobalRouter.instance() # 교체 불가
...
# ❌ 생성자 안에서 의존성 생성
class BadComponent(ComponentBase):
def __init__(self, node):
self.router = PathRouter(node.graph) # 테스트에서 격리 불가
```
**공통 문제:** 컴포넌트가 자기 의존성을 스스로 해결한다 → 결합도 증가
---
## 슬라이드 14 — 요약
> **DI = 의존성의 생성과 사용을 분리하는 것**
```
생성 → Registry / Assembler (GraphEngine)
사용 → Component (IoCpuComponent, TwoDMeshNocComponent, ...)
```
**kernbench에서 배운 패턴 3가지:**
1. **Constructor Injection** — 필수 의존성은 생성자로
2. **Context Object** — 의존성 묶음을 하나의 dataclass로
3. **Registry + Override** — 문자열 키로 구현체 선택, 테스트에서 교체
**결과:** 141개 테스트, YAML 한 줄로 컴포넌트 교체, 프로덕션 코드 수정 없이 Mock 주입
---
*참고 코드: kernbench/src/kernbench/components/*
+237
View File
@@ -0,0 +1,237 @@
# Hardware Architecture Overview
본 문서는 AI Accelerator 플랫폼의 하드웨어 아키텍처를 요약한다.
논문 분석 및 설계 검토 시 배경 지식으로 사용할 수 있다.
> Source ADRs: ADR-0003, ADR-0004, ADR-0014, ADR-0017, ADR-0022
---
## 1. System Hierarchy
시스템은 4단계 계층으로 구성된다.
```
Tray
├── Host CPU (runtime, data placement)
├── SIP 0 (accelerator)
│ ├── IO Chiplet (PCIe-EP, IO_CPU)
│ ├── CUBE 0
│ │ ├── PE 0 ─ PE 7
│ │ ├── HBM + HBM_CTRL
│ │ ├── Shared SRAM
│ │ ├── M_CPU (management)
│ │ ├── NOC 2D Mesh (router grid)
│ │ └── UCIe × 4 (N/S/E/W)
│ ├── CUBE 1 ... CUBE N
│ └── IO Chiplet(s)
├── SIP 1 ... SIP M
└── Interconnect (PCIe / UAL)
```
| Level | 구성 | 연결 |
|-------|------|------|
| **Tray** | Host CPU + 여러 SIP | PCIe / UAL fabric |
| **SIP** | 여러 CUBE + IO chiplet(s) | UCIe (cube간), PCIe-EP (host) |
| **CUBE** | 여러 PE + HBM + SRAM + M_CPU + NOC mesh | UCIe × 4 ports (N/S/E/W) |
| **PE** | PE_CPU + DMA + GEMM + MATH + TCM | NOC router 직결 |
---
## 2. CUBE Architecture
각 CUBE는 독립적인 compute + memory unit이다.
### 2.1 Components
- **PEs**: 복수의 Processing Element, 각각 독립 커널 실행 가능
- **HBM + HBM_CTRL**: High Bandwidth Memory. 각 PE에 local HBM 영역이 할당되어 최소 latency로 접근
- **Shared SRAM**: Cube 내 모든 PE가 NOC를 통해 접근 가능한 공유 메모리
- **M_CPU**: Management CPU. 커널 command 분배 및 completion 집계
- **NOC (On-die Fabric)**: Cube 내 모든 컴포넌트를 연결하는 interconnect
- **UCIe × 4**: 각 방향(N/S/E/W)에 복수 connection, inter-cube 연결
### 2.2 NOC (On-die Fabric)
NOC는 cube 내 PE, HBM, SRAM, M_CPU, UCIe를 연결하는 on-die interconnect이다.
**아키텍처 요구사항** (topology 무관):
- 모든 PE가 local HBM에 full bandwidth로 접근 가능
- 모든 PE가 shared SRAM에 접근 가능
- 모든 PE가 UCIe를 통해 다른 cube에 접근 가능
- M_CPU가 모든 PE에 command를 전달 가능
- Per-link contention 모델링 지원
**현재 시뮬레이터 구현** (변경 가능):
- 2D mesh router grid (6×6 기본, XY deterministic routing)
- HBM_CTRL가 각 PE의 local router에 직결 (0 mesh hop)
- 중앙 HBM zone에는 router 배치 제외
- Contention: directed segment당 capacity=1 resource
NOC topology는 2D mesh 외에 ring, crossbar, hierarchical 등 다른 구현도 가능하며,
아키텍처 요구사항을 만족하는 한 교체 가능하다.
### 2.3 주요 Data Path
| Path | Route | 특성 |
|------|-------|------|
| PE → Local HBM | PE_DMA → NOC → HBM_CTRL | 최소 hop, 256 GB/s (×0.8 eff) |
| PE → Remote PE's HBM | PE_DMA → NOC hops → HBM_CTRL | NOC BW/hop에 제한 |
| PE → Shared SRAM | PE_DMA → NOC → SRAM | SRAM link BW에 제한 |
| PE → Other CUBE's HBM | PE_DMA → NOC → UCIe → NOC → HBM_CTRL | UCIe overhead 16ns (TX+RX) |
| Kernel Launch | IO → UCIe → M_CPU → NOC → PE_CPU | Command path |
### 2.4 Key Bandwidths
| Connection | Bandwidth | Notes |
|------------|-----------|-------|
| PE_DMA ↔ NOC | 256 GB/s | HBM slice BW 매칭 |
| NOC ↔ HBM_CTRL | 256 GB/s | Per PE, local 접근 |
| NOC ↔ SRAM | 128 GB/s × 4 | 512 GB/s aggregate |
| NOC ↔ UCIe conn | 128 GB/s × 4 | 512 GB/s per port |
| UCIe link (inter-cube) | 512 GB/s | 1.0mm seam distance |
---
## 3. PE Architecture
각 PE는 하나의 커널 인스턴스를 실행하는 독립적인 프로세서이다.
### 3.1 Internal Components
```
PE_CPU (control)
├──→ PE_SCHED (dispatch)
│ │
│ ├──→ PE_DMA ←→ NOC Router ←→ HBM / SRAM / UCIe
│ │ ↕
│ ├──→ PE_FETCH_STORE ←→ PE_TCM (16MB SRAM)
│ │
│ ├──→ PE_GEMM (matrix multiply)
│ └──→ PE_MATH (elementwise)
└──→ PE_IPCQ (collective communication)
└──→ PE_DMA (IPCQ port)
```
| Component | 역할 |
|-----------|------|
| **PE_CPU** | 커널 instruction stream 실행, command 생성 |
| **PE_SCHED** | Command dispatcher. Composite command를 tile pipeline으로 분해 |
| **PE_DMA** | HBM ↔ TCM 데이터 전송 (NOC router mesh 경유). Read/Write 각 1 channel |
| **PE_GEMM** | 행렬 곱 엔진. TCM에서 activation 읽기, HBM에서 weight streaming 가능 |
| **PE_MATH** | Element-wise 연산 엔진. TCM 읽기/쓰기 |
| **PE_TCM** | 16MB on-PE SRAM. Compute의 staging memory |
| **PE_IPCQ** | PE간 collective communication 제어 (ring buffer pointer 관리) |
### 3.2 Compute Pipeline (Tiled Execution)
Composite command는 tile 단위로 pipeline 실행된다:
```
DMA_READ(t) → COMPUTE(t) → DMA_WRITE(t)
```
**Overlap 규칙**:
- 허용: `DMA_READ(t+1) ∥ COMPUTE(t)`, `DMA_WRITE(t-1) ∥ COMPUTE(t)`
- 금지: `GEMM(t) ∥ GEMM(t')`, `GEMM(t) ∥ MATH(t')`
**DMA Engine**: Read/Write 각각 capacity=1. 동시 Read+Write 가능, 동시 Read+Read 불가.
**Compute Engine**: GEMM과 MATH가 단일 compute slot 공유. 한 번에 하나만 실행.
### 3.3 TCM-centric Dataflow
모든 compute는 TCM을 중심으로 동작한다:
```
Input: HBM → (NOC) → PE_DMA → PE_TCM
Compute: PE_TCM → GEMM / MATH → PE_TCM
Output: PE_TCM → PE_DMA → (NOC) → HBM
```
PE_TCM은 두 영역으로 분할된다:
- **SchedulerReservedTCM**: PE_SCHED 전용 tile buffer 영역 (DMA/compute staging)
- **AllocatableTCM**: 범용 할당 영역 (host/DP-visible)
두 영역은 hard isolation으로 분리된다.
---
## 4. Memory Hierarchy
### 4.1 Memory Tiers
| Memory | Scope | Capacity | Bandwidth | Latency | 접근 경로 |
|--------|-------|----------|-----------|---------|-----------|
| **PE_TCM** | PE 전용 | 16 MB | 512 GB/s | 최저 | 직결 (NOC 미경유) |
| **Shared SRAM** | Cube 공유 | 32 MB | 128 GB/s (NoC link) | 중간 | PE → NOC → SRAM |
| **Local HBM** | PE별 할당 | Large | 256 GB/s (×0.8 eff) | 높음 | PE → local router → HBM_CTRL |
| **Remote HBM** | 다른 PE/Cube | Large | Mesh/UCIe BW 제한 | 최고 | PE → NOC mesh → (UCIe) → HBM_CTRL |
### 4.2 Local HBM Bandwidth Guarantee
- 각 PE는 자신의 local router에 직결된 HBM pseudo-channel을 가진다
- Local HBM 접근은 **0 mesh hop** (switching overhead만)
- Effective bandwidth = spec BW × efficiency factor (default 0.8)
- 예: 256 GB/s × 0.8 = 204.8 GB/s effective
- 이 보장은 fabric bandwidth와 무관하게 유지된다
### 4.3 Memory-Centric Design Principle
- **Compute는 data 근처에서 실행**: PE가 local HBM에 직결되어 데이터 이동 최소화
- **TCM은 compute의 scratchpad**: 모든 compute 입출력은 TCM을 경유
- **HBM은 primary storage**: 대용량 tensor 저장, DMA로 TCM에 tile 단위 load/store
- **Shared SRAM은 cube-level 공유**: 중간 결과 공유, reduction buffer 등
---
## 5. SPMD Execution Model
### 5.1 Program ID Mapping
커널은 2D hardware grid에서 SPMD 방식으로 실행된다:
| API | 반환 값 | 설명 |
|-----|---------|------|
| `tl.program_id(axis=0)` | `local_pe_id` | Cube 내 PE 인덱스 |
| `tl.program_id(axis=1)` | `cube_id` | Cube 인덱스 |
| `tl.num_programs(axis=0)` | `num_pes_per_cube` | Cube당 PE 수 |
| `tl.num_programs(axis=1)` | `num_cubes` | 전체 Cube 수 |
```python
global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
```
### 5.2 Axis Mapping Rationale
- **axis=0 = PE (innermost)**: Cube 내 PE는 HBM을 공유하고 local NOC로 통신. 빠르고 tightly-coupled. GPU의 thread-in-block에 대응.
- **axis=1 = Cube (outer)**: Cube 간 통신은 UCIe 경유로 latency 높음. Coarse scheduling 단위. GPU의 block-in-grid에 대응.
### 5.3 Kernel Execution Flow
```
Host CPU
→ IO_CPU (PCIe-EP)
→ M_CPU (management, per cube)
→ PE_CPU × N (broadcast)
→ Each PE executes same kernel with unique (pe_id, cube_id)
```
모든 PE가 동일 커널을 실행하되, `program_id`로 자신의 데이터 파티션을 식별하여
독립적으로 처리한다 (SPMD).
---
## 6. Inter-PE Communication (IPCQ)
PE 간 collective communication은 IPCQ(Inter-PE Communication Queue)를 통해 수행된다.
- 각 PE는 방향별(N/S/E/W 등) ring buffer 기반 queue pair를 유지
- **DMA-IPCQ co-design**: DMA data flit에 head pointer를 piggyback하여 별도 제어 메시지 없이 pointer 동기화
- **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림
- IPCQ slot buffer는 **TCM, Shared SRAM, Local HBM** 중 선택 가능
자세한 내용은 `docs/ipcq-dma-codesign-hw.md` 및 ADR-0023 참조.
+381
View File
@@ -0,0 +1,381 @@
# Latency Model
## Overview
kernbench uses a discrete-event simulation (SimPy) to compute end-to-end latency.
Every request flows through a graph of **components** connected by **wires**.
The total latency reported is the **actual SimPy wall-clock** (`env.now` delta),
not a static formula—so contention and queueing are captured automatically.
```text
total_ns (actual) = wire_prop + component_overhead + drain + queueing
├── deterministic ──────────────────┘ │
└── contention-dependent ────────────────────┘
```
## Three Deterministic Cost Components
### 1. Wire Propagation
```text
wire_ns = distance_mm × ns_per_mm (global: 0.01 = 10 ps/mm)
```
Every edge in the topology graph has a `distance_mm`. A SimPy wire process
delays each message by `wire_ns` before delivering it to the next component.
For on-chip silicon this is ~10 ps/mm; the same constant applies everywhere
since all links are on-die or interposer. Wire propagation is typically <1 ns
and negligible compared to other costs.
### 2. Component Overhead (`overhead_ns`)
```text
component_ns = node.attrs["overhead_ns"]
```
Each component on the path adds a fixed processing delay via `yield env.timeout(overhead_ns)`.
This models arbitration, protocol processing, pipeline stages, etc.
| Component | overhead_ns | Meaning |
|-----------|-------------|---------|
| pcie_ep | 5.0 | PCIe protocol processing |
| io_cpu | 10.0 | Command decode / dispatch |
| m_cpu | 5.0 | DMA scheduling |
| fabric switch | 5.0 | Packet arbitration |
| xbar | 2.0 | Crossbar arbitration |
| xbar bridge | 1.0 | Bridge traversal between xbar halves |
| ucie | 8.0 | UCIe protocol overhead per port (TX or RX; 16ns per crossing) |
| noc (2D mesh) | 0.0 | Hop delay modeled internally via manhattan distance |
| hbm_ctrl | 0.0 | Access time via drain_ns; efficiency=0.8 reduces edge BW (256→204.8) |
| pe_cpu | 2.0 | Command dispatch |
| pe_scheduler | 1.0 | PE-internal scheduling |
| pe_gemm/math | 0.0 | Placeholder; will use flops-based model |
### 3. Drain (Serialization Delay)
```text
drain_ns = nbytes / bottleneck_bw_gbs
```
**Wormhole (cut-through) model**: data flows through intermediate nodes as a
pipeline. Serialization cost is paid **once** at the terminal node, not at
every hop. The bottleneck is the minimum `bw_gbs` across all edges in the path.
Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32.0 ns`.
### Formula (Theoretical Lower Bound)
```text
formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
```
This is the latency with **zero contention**—no other request competing for
any resource. The engine provides `_formula_latency()` for verification.
With no contention: `actual == formula`. With contention: `actual > formula`.
### Diagram: PE DMA Read (pe0 → local slice0, 4096 bytes)
```mermaid
sequenceDiagram
participant D as pe_dma
participant X as xbar.pe0
participant H as hbm_ctrl.slice0
D->>X: txn (4096B)
Note over X: overhead 2.0 ns
X->>H: txn (wire 0.025 ns)
Note over H: acquire Resource
Note over H: overhead 0 ns
Note over H: drain 4096/256 = 16.0 ns
Note over H: release Resource
H-->>D: done.succeed()
Note over D,H: total_ns = 18.09 ns<br/>formula = wire(0.025) + ovhd(2.0) + drain(16.0) = 18.025 ns<br/>actual ≈ formula (no contention)
```
### Diagram: Two Requests — No Contention vs HOL Blocking
#### Case 1: Different slices (parallel, no contention)
```mermaid
sequenceDiagram
participant A as Request A
participant S0 as hbm_ctrl.slice0<br/>Resource(cap=1)
participant S1 as hbm_ctrl.slice1<br/>Resource(cap=1)
Note over A,S1: t=2 ns — both requests arrive at their own slice
A->>S0: A (4KB)
A->>S1: B (4KB)
Note over S0: acquire (immediate)
Note over S1: acquire (immediate)
Note over S0: drain 16.0 ns
Note over S1: drain 16.0 ns
Note over S0: t=18 release
Note over S1: t=18 release
Note over A,S1: A actual = 18 ns, B actual = 18 ns<br/>No waiting — separate Resources
```
#### Case 2: Same slice (HOL blocking)
```mermaid
sequenceDiagram
participant A as Request A (4KB)
participant Q as hbm_ctrl.slice0<br/>Resource(cap=1)
participant B as Request B (64B)
Note over A,B: t=0 — A arrives first
A->>Q: acquire (immediate)
Note over Q: drain A = 16.0 ns
Note over B,Q: t=5 — B arrives, yield req → BLOCKED
B--xQ: waiting...
Note over Q: t=16 — A drain done, release
Q->>B: B acquires resource
Note over Q: drain B = 0.25 ns
Note over Q: t=16.25 — B done, release
Note over A,B: A actual = 16.0 ns (== formula)<br/>B actual = 11.25 ns (formula 0.25 + queueing 11.0)<br/>HOL blocking: short request waits behind long drain
```
---
## How SimPy Tracks Latency
### Measurement
```python
start_ns = env.now
yield txn_done # wait for the transaction to complete
total_ns = env.now - start_ns # ← this is what probe reports
```
`env.now` is SimPy's simulation clock. It only advances when a process `yield`s
a timeout or waits on a resource/store. The delta between start and done captures
**everything**: wire delays, component overheads, drain, and any queueing.
### Component Pipeline
Each component is a SimPy process:
```text
_fan_in (per in_port) → _inbox (Store) → _worker → out_ports
```
1. **`_fan_in`**: relays messages from each `in_port` into a shared `_inbox` Store.
2. **`_worker`**: pulls from `_inbox`, spawns `_forward_txn` per message.
3. **`_forward_txn`**: calls `run()` (overhead), then puts to `out_ports[next_hop]`.
The worker uses `env.process()` (pipeline model), so multiple messages can be
in-flight through the same component concurrently. Contention happens when
they compete for shared resources (e.g., `simpy.Resource` in hbm_ctrl).
### Wire Process
```python
while True:
msg = yield out_port.get() # wait for sender
yield env.timeout(prop_ns) # propagation delay
yield in_port.put(msg) # deliver to receiver
```
Each directed edge has its own wire process. Messages are delayed by exactly
`distance_mm × ns_per_mm`.
---
## Contention and Queueing
Queueing delay is **not a separate formula term**—it emerges from SimPy's
event scheduling when multiple requests compete for the same resource.
### Where Contention Occurs
| Resource | SimPy Type | Capacity | Effect |
|----------|-----------|----------|--------|
| hbm_ctrl | `simpy.Resource` | 1 | Serializes HBM access |
| m_cpu DMA read engine | `simpy.Resource` | 1 | Serializes DMA reads |
| m_cpu DMA write engine | `simpy.Resource` | 1 | Serializes DMA writes |
| pe_dma channels | `simpy.Resource` | configurable | Serializes PE DMA ops |
| component inbox | `simpy.Store` | unbounded | No backpressure (FIFO) |
### How Queueing Works
```python
# hbm_ctrl._worker
with self._resource.request() as req:
yield req # ← BLOCKS if resource is occupied
yield from self.run(env, txn.nbytes)
yield env.timeout(drain_ns)
```
If request A holds the resource and request B arrives:
- B's `yield req` blocks until A releases the resource
- SimPy advances B's `env.now` by A's remaining service time
- This "extra" time shows up in B's `total_ns` automatically
```text
No contention: actual_ns == formula_ns
Contention: actual_ns > formula_ns
queueing_delay = actual_ns - formula_ns
```
### Head-of-Line (HOL) Blocking at hbm_ctrl
The `simpy.Resource` is held for the **entire** `with` block—both overhead and
drain. The resource is NOT released between overhead and drain:
```python
with self._resource.request() as req:
yield req # acquire (or wait)
yield from self.run(env, txn.nbytes) # overhead_ns ─┐
yield env.timeout(drain_ns) # drain_ns │ resource held
# ← resource released here ───────────────────────────────┘
```
This means a short request arriving during a long request's drain must wait
for the full remaining drain time—classic head-of-line blocking:
```text
Request A: 4 KB, drain = 16.0 ns (arrives at t=0)
Request B: 64 B, drain = 0.25 ns (arrives at t=5)
Timeline:
t=0.00 A acquires resource
t=0.00 A: overhead (0 ns)
t=0.00 A: drain starts (16.0 ns)
t=5.00 B arrives → yield req → BLOCKED (A holds resource)
t=16.00 A: drain done → resource released
t=16.00 B acquires resource
t=16.00 B: overhead (0 ns)
t=16.25 B: drain done → resource released
B actual = 11.25 ns (waited 11.0 + own 0.25)
B formula = 0.25 ns
B queueing = 11.0 ns ← HOL blocking penalty
```
**Why this is physically realistic**: An HBM channel processes one burst at a
time. While data is being serialized onto the channel (drain), no other request
can use that channel. The FIFO ordering (`simpy.Resource` default) reflects
the simplest controller scheduling policy.
**Alternative: priority scheduling**: If needed, `simpy.PriorityResource` can
prioritize shorter requests (Shortest Job First), but this is not currently
used since FIFO matches typical HBM controller behavior.
---
## Worked Example: Two Concurrent PE DMA Reads
Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices
(slice0 and slice1), submitted to the **same engine** at the same time.
### Paths
```text
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
```
### No Contention (different HBM slices)
Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own
`simpy.Resource(capacity=1)`, there is no resource competition.
```text
DMA A timeline:
t=0.00 pe_dma dequeues txn
t=0.00 xbar.pe0: overhead_ns=2.0 → t=2.00
t=2.025 wire prop (2.5mm × 0.01) → t=2.025
t=2.025 hbm_ctrl.slice0: yield req → immediate (no contention)
t=2.025 hbm_ctrl.slice0: overhead_ns=0 → t=2.025
t=18.025 drain_ns = 4096/256 = 16.0 → t=18.025
t=18.025 done
DMA B timeline: (identical, on its own slice)
t=0.00 → ... → t=18.09 done
```
Both complete at ~18.09 ns. `actual == formula` for both.
### With Contention (same HBM slice)
Now suppose both PE0 and PE1 read from **slice0**:
```text
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
(chain traversal to reach slice0)
```
```text
DMA A timeline:
t=0.00 xbar.pe0(2.0) → wire → hbm_ctrl.slice0
t=2.025 yield req → immediate (first to arrive)
t=18.025 drain 16.0 → release resource → done
actual_A = 18.025 ns (== formula)
DMA B timeline:
t=0.00 xbar.pe1(2.0) → xbar.pe0(2.0) → wire → hbm_ctrl.slice0
t=4.035 yield req → BLOCKED (A holds resource until t=18.025)
t=18.025 acquire resource
t=34.025 drain 16.0 → release → done
actual_B = 34.035 ns
formula_B = wire(0.035) + overhead(4.0) + drain(32.0) = 36.035 ns
But actual_B is different because drain uses bottleneck BW of B's path (128 GB/s)
while A's path has BW 256 GB/s. Let's recalculate:
B's bottleneck: xbar_x_bw = 128 GB/s → drain = 4096/128 = 32.0 ns
formula_B = 0.035 + 4.0 + 32.0 = 36.035 ns
actual_B = 36.035 + queueing ≈ 50+ ns
queueing = time waiting for A to release hbm_ctrl
```
The key insight: **queueing delay is not in the formula**. It only appears in
the actual SimPy simulation when resources are contested. The probe reports
`actual_ns`, which includes all queueing. To see pure queueing overhead,
compare `actual_ns` vs `formula_ns` (available in PE DMA traces).
---
## Probe Output Explained
```text
=== PE DMA Latency ===
Case Target Actual Ovhd Drain Wire Ovhd% Drain% Eff.BW BN.BW Util%
pe-local-hbm c0.pe0->c0.slice0 18.09 2.0 16.0 0.08 11.1% 88.5% 226.49 256.0 88.5%
pe-cross-half-hbm c0.pe0->c0.slice4 37.14 5.0 32.0 0.14 13.5% 86.1% 110.27 128.0 86.1%
```
| Column | Meaning |
|--------|---------|
| **Actual** | SimPy measured `env.now` delta (includes contention if any) |
| **Ovhd** | Sum of `overhead_ns` for all components on the forward path |
| **Drain** | `nbytes / bottleneck_bw` — serialization at terminal |
| **Wire** | Sum of `distance_mm × ns_per_mm` for all edges |
| **Ovhd%** | `Ovhd / Actual × 100` — fraction of time spent in component processing |
| **Drain%** | `Drain / Actual × 100` — fraction of time spent in data transfer |
| **Eff.BW** | `nbytes / Actual` — achieved bandwidth |
| **BN.BW** | Bottleneck bandwidth (min `bw_gbs` on path) |
| **Util%** | `Eff.BW / BN.BW × 100` — how close to theoretical max BW |
### Why Util% < 100%
`Util% = Drain% = drain_ns / actual_ns`. The gap from 100% is the overhead
fraction. For small transfers (4KB), overhead is significant relative to drain.
For large transfers, drain dominates and utilization approaches 100%.
```text
4 KB: Ovhd=2.0, Drain=16.0 → Util=88.5% (overhead is 11% of time)
64 KB: Ovhd=2.0, Drain=256.0 → Util=99.2% (overhead is <1% of time)
```
### H2D Path: Why Ovhd% is ~40%
H2D traverses many components (pcie_ep → io_cpu → ucie → noc → m_cpu → noc →
xbar → hbm_ctrl + response path). Total forward overhead is ~23 ns vs drain
of 32 ns for 4KB, so overhead is comparable to data transfer time—resulting
in ~55% utilization. This is expected for small command-path transfers.