ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037

Filename + lifecycle: - ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable. - ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2: docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft), docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for retroactive docs pending verification. Merges (one ADR per topic, no change-history annotations): - ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items) - ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl. TileToken self-routing and multi-op composite epilogue scope) - ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md deleted; ADR-0019/0021 moved to adr-history with one-line stub status Retroactive documentation (G4 closures, code-verified): - ADR-0037 forwarding component (TransitComponent: first-flit overhead, serial worker, path-based routing, single impl/multiple names) - ADR-0036 IO_CPU component (target_start_ns global barrier stamping, per-cube fan-out, response aggregation) - ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources, target_start_ns passthrough) - ADR-0034 HBM controller internal design (per-PC state, address-based selection, flit-aware per-flit commit, async finalize, command-only fallback path) Content updates: - ADR-0010 expanded to full CLI surface (run/probe/web), retitled "Command Line Interface and Execution Semantics" - ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned - ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata block replaced with standard Status header - ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4); ADR-0027 cleaned of supersession history - ADR-0033 D6 cleanup: address-based PC selection moved out of future-work (now documented in ADR-0034 D3); related D1/D3 wording realigned - Cross-references back-filled in 5 ADRs (G3 gaps closed) Onboarding docs split: - docs/onboarding/ created - moved: hw-architecture-overview.md, latency-model.md, di-presentation.md, ccl-author-guide{,.en}.md - references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8). No behavior change. Tooling: - tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py (ADR EN/KO pair invariant checker) - .claude/commands/report.md tracked (/report slash command) - .gitignore: allow .claude/commands/*.md while keeping settings files ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00
parent 22fd0d2b9d
commit 687c98086d
97 changed files with 3286 additions and 3766 deletions
@@ -5,5 +5,5 @@ This package provides:
    - helpers:    utilities for algorithm authors (chunked, ring_step, ...)
    - testing:    mock CCL runtime for fast unit tests of algorithm kernels

-See docs/adr/ADR-0023-ipcq-pe-collective.md and docs/ccl-author-guide.md.
+See docs/adr/ADR-0023-dev-ipcq-pe-collective.md and docs/onboarding/ccl-author-guide.md.
 """
@@ -24,7 +24,7 @@ class Scope(Enum):

@dataclass(frozen=True)
 class OpSpec:
-    """One operation in a multi-op composite (head + epilogue, ADR-0021).
+    """One operation in a multi-op composite (head + epilogue, ADR-0014 D3.3).

    The head op (first in CompositeCmd.ops) defines tile geometry; subsequent
    ops are epilogue stages whose ``scope`` controls how often they fire
@@ -156,7 +156,7 @@ class CompositeCmd:
    out_nbytes: int
    math_op: str | None = None       # for op="math": which math operation
    data_op: bool = True
-    # Multi-op composite (ADR-0021 extension): when non-empty, ops[0] is the
+    # Multi-op composite (ADR-0014 D3.3): when non-empty, ops[0] is the
    # head and ops[1:] are epilogue stages with explicit scope. When empty,
    # the legacy single-op semantics (op/a/b/math_op) apply.
    ops: tuple[OpSpec, ...] = ()
@@ -15,7 +15,7 @@ if TYPE_CHECKING:


 class HbmCtrlComponent(ComponentBase):
-    """HBM controller with per-pseudo-channel (PC) striping (ADR-0019 D1, ADR-0033).
+    """HBM controller with per-pseudo-channel (PC) striping (ADR-0017 D4, ADR-0033).

    Stateless per-PC ``available_at`` array; each incoming transaction is
    split into ``ceil(nbytes / burst_bytes)`` chunks distributed round-robin
@@ -267,8 +267,9 @@ class MCpuComponent(ComponentBase):
    def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
        """Return list of HBM destination node_ids for DMA fan-out.

-        With single hbm_ctrl per cube (ADR-0019), always returns one node.
-        PA-based resolution still used for cross-cube routing.
+        The PA-based resolver maps each address to one per-PE
+        ``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly
+        one node. Cross-cube routing uses the same resolution.
        """
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"

@@ -17,9 +17,11 @@ if TYPE_CHECKING:
 class PeDmaComponent(PeEngineBase):
    """PE_DMA: dual-channel DMA engine with READ and WRITE resources.

-    Each channel has capacity=1 (ADR-0014 D4):
+    Compute channels (vc_compute) have capacity=1 each (ADR-0014 D4):
      - DMA_READ and DMA_WRITE may execute concurrently.
      - Multiple READs cannot overlap; multiple WRITEs cannot overlap.
+    The orthogonal vc_comm channel for IPCQ traffic is defined in
+    ADR-0023 D8.

    Handles two message types:
      - Transaction: external fabric messages (PeDmaMsg probes, M_CPU DMA)
@@ -1,4 +1,4 @@
-"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0021 D5).
+"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0014 D1).

 Handles both fetch (TCM → register) and store (register → TCM).
 BW serialization is delegated to PE_TCM via port communication.
@@ -18,7 +18,7 @@ if TYPE_CHECKING:


 class PeFetchStoreComponent(PeEngineBase):
-    """PE_FETCH_STORE: TCM ↔ Register File (ADR-0021 D5).
+    """PE_FETCH_STORE: TCM ↔ Register File (ADR-0014 D1).

    Receives TileTokens via pipeline self-routing.
    Sends TcmRequest to PE_TCM for BW-based latency.
@@ -1,4 +1,4 @@
-"""PE_GEMM: matrix multiplication engine (ADR-0021 D6).
+"""PE_GEMM: matrix multiplication engine (ADR-0014 D1).

 Handles both legacy PeInternalTxn (GemmCmd) and pipeline TileToken.
 In pipeline mode, receives token after fetch stage, computes MAC, chains to next.
@@ -32,7 +32,7 @@ _DTYPE_BITS: dict[str, int] = {


 class PeGemmComponent(PeEngineBase):
-    """PE_GEMM: MAC array (ADR-0021 D6).
+    """PE_GEMM: MAC array (ADR-0014 D1).

    In pipeline mode: pure compute — register data already fetched.
    In legacy mode: handles PeInternalTxn(GemmCmd) with shared accel_slot.
@@ -1,4 +1,4 @@
-"""PE_MATH: element-wise / reduction computation engine (ADR-0021 D6).
+"""PE_MATH: element-wise / reduction computation engine (ADR-0014 D1).

 Handles both legacy PeInternalTxn (MathCmd) and pipeline TileToken.
 In pipeline mode, receives token after fetch stage, computes SIMD, chains to next.
@@ -24,7 +24,7 @@ if TYPE_CHECKING:


 class PeMathComponent(PeEngineBase):
-    """PE_MATH: SIMD/Vector unit (ADR-0021 D6).
+    """PE_MATH: SIMD/Vector unit (ADR-0014 D1).

    In pipeline mode: pure compute — register data already fetched.
    In legacy mode: handles PeInternalTxn(MathCmd) with shared accel_slot.
@@ -1,10 +1,10 @@
-"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0021 D2).
+"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0014 D6).

 Receives PeInternalTxn from PE_CPU, routes to engines:
  - Simple commands (DmaReadCmd, GemmCmd, etc.) → direct dispatch to engine
  - CompositeCmd → generate TilePlan, feed tiles via _feed_loop

-Composite pipeline uses token self-routing (ADR-0021 D4):
+Composite pipeline uses token self-routing (ADR-0014 D6):
  Scheduler only does initial dispatch + completion tracking.
  Tiles chain through components based on their plan's stage sequence.
 """
@@ -24,7 +24,7 @@ if TYPE_CHECKING:


 class PeSchedulerComponent(ComponentBase):
-    """PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, ADR-0021 D2).
+    """PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, D6).

    Simple commands are forwarded to the appropriate engine.
    CompositeCmd creates a TilePlan and feeds tiles into the pipeline.
@@ -104,7 +104,7 @@ class PeSchedulerComponent(ComponentBase):
    def _dispatch_composite(
        self, env: simpy.Environment, pe_txn: Any, cmd: Any,
    ) -> Generator:
-        """Generate plan and enqueue to feeder. Non-blocking (ADR-0021 D4)."""
+        """Generate plan and enqueue to feeder. Non-blocking (ADR-0014 D6)."""
        from kernbench.components.builtin.pe_types import PipelineContext

        plan = self._generate_plan(cmd)
@@ -121,7 +121,7 @@ class PeSchedulerComponent(ComponentBase):
        yield self._pending_feeds.put((plan, ctx))

    def _feed_loop(self, env: simpy.Environment) -> Generator:
-        """Single feeder process: FIFO command ordering (ADR-0021 D2).
+        """Single feeder process: FIFO command ordering (ADR-0014 D6).

        No tile feed interleaving between commands.
        Queue full → only this process blocks.
@@ -1,4 +1,4 @@
-"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0021).
+"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0014 D1).

 Models scratchpad memory inside the PE. Handles both legacy Transaction forwarding
 and TcmRequest from PE_FETCH_STORE for BW-serialized read/write access.
@@ -32,7 +32,7 @@ class TcmRequest:


 class PeTcmComponent(ComponentBase):
-    """PE_TCM: BW-serialized scratchpad memory (ADR-0021 D1).
+    """PE_TCM: BW-serialized scratchpad memory (ADR-0014 D1).

    Dual-channel: read and write can proceed in parallel,
    but concurrent reads serialize, concurrent writes serialize.
@@ -1,4 +1,4 @@
-"""PE pipeline types for ADR-0021: TileToken, TilePlan, Stage, PipelineContext.
+"""PE pipeline types for ADR-0014 D6: TileToken, TilePlan, Stage, PipelineContext.

 These types are used by the PE_SCHEDULER and all PE engine components
 for tile-based pipeline execution with self-routing.
@@ -84,7 +84,7 @@ class PipelineContext:

@dataclass
 class TileToken:
-    """Self-routing tile token passed between PE components (ADR-0021 D9).
+    """Self-routing tile token passed between PE components (ADR-0014 D6).

    Single-owner: only one component holds this token at any time.
    params is a cache of plan.stages[stage_idx].params (canonical source).
@@ -1,4 +1,4 @@
-"""Tile plan generators for PE pipeline (ADR-0021).
+"""Tile plan generators for PE pipeline (ADR-0014 D6).

 Generates TilePlan with stage sequences for GEMM and Math operations.
 Ported from pe_accel tiling.py with stage-based plan structure.
@@ -1,2 +1,2 @@
 # Legacy component backups — not actively used.
-# Kept for reference during ADR-0021 migration.
+# Kept for reference during the PE pipeline refactor (ADR-0014).
@@ -264,8 +264,9 @@ class MCpuComponent(ComponentBase):
    def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
        """Return list of HBM destination node_ids for DMA fan-out.

-        With single hbm_ctrl per cube (ADR-0019), always returns one node.
-        PA-based resolution still used for cross-cube routing.
+        The PA-based resolver maps each address to one per-PE
+        ``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly
+        one node. Cross-cube routing uses the same resolution.
        """
        cube_prefix = self.node.id.rsplit(".", 1)[0]  # e.g. "sip0.cube0"

@@ -20,7 +20,7 @@ _AHBM_SEL_BIT = 37
 _AHBM_LOCAL_USED = 38  # bits actually meaningful for AHBM

 # HBM-offset bit layout for PC (pseudo-channel) striping
-# (ADR-0033 D6, ADR-0019). Given burst_bytes = 2^B and num_pcs = 2^P
+# (ADR-0033 D6, ADR-0017 D8). Given burst_bytes = 2^B and num_pcs = 2^P
 # configured at hbm_ctrl, the PC index is derived from hbm_offset as
 #   pc_shift = B; pc_mask = (1 << P) - 1
 #   pc = (hbm_offset >> pc_shift) & pc_mask
@@ -35,7 +35,7 @@ class AddressResolver:
    def __init__(self, graph: TopologyGraph) -> None:
        self._node_ids = set(graph.nodes)
        # HBM slice size (bytes) — used to decode pe_id from hbm_offset
-        # so HBM PA → hbm_ctrl.pe{X} (ADR-0019 D1/D4).
+        # so HBM PA → hbm_ctrl.pe{X} (ADR-0017 D4/D9).
        mm = graph.spec.get("cube", {}).get("memory_map", {})
        hbm_total_gb = int(mm.get("hbm_total_gb_per_cube", 48))
        slices_per_cube = int(mm.get("hbm_slices_per_cube", 8))
@@ -129,7 +129,7 @@ class PathRouter:
        Otherwise the cube's own UCIe port appears as a zero-distance
        bus that Dijkstra prefers over the mesh — that is intended only
        for cross-cube routing. Local PE_DMA must traverse the mesh so
-        cross-PE-slice access pays the mesh-distance cost (ADR-0019 D4).
+        cross-PE-slice access pays the mesh-distance cost (ADR-0017 D7).
        """
        start = f"{src_pe}.pe_dma"
        adj = self._adj_local if _same_cube(start, dst_node) else self._adj
@@ -137,13 +137,13 @@ class PathRouter:

    def find_path_with_distance(self, src_pe: str, dst_node: str) -> tuple[list[str], float]:
        """Match find_path's cube-local routing so reported distance reflects
-        the actual chosen path (ADR-0019 D4)."""
+        the actual chosen path (ADR-0017 D7)."""
        start = f"{src_pe}.pe_dma"
        adj = self._adj_local if _same_cube(start, dst_node) else self._adj
        return self._run_dijkstra_with_dist(adj, start, dst_node)

    def find_mcpu_dma_path(self, m_cpu_id: str, dst_hbm_id: str) -> list[str]:
-        """M_CPU DMA path: routes through router mesh (ADR-0019).
+        """M_CPU DMA path: routes through router mesh (ADR-0017).

        Same-cube: uses _adj_local (no UCIe) to stay within mesh.
        Cross-cube: uses _adj_all to route via UCIe.
@@ -58,7 +58,7 @@ def _get_active_context():


 class _AhbmNamespace:
-    """torch.ahbm — per-greenlet SIP device binding (ADR-0024 D10).
+    """torch.ahbm — per-greenlet SIP device binding (ADR-0024 D3).

    Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. KernBench's
    backend is 'ahbm' (not CUDA), so this namespace avoids pretending to be
@@ -124,7 +124,7 @@ class RuntimeContext:
        dc = DistributedContext()
        dc._ctx_ref = self  # back-reference for AhbmCCLBackend to reach ctx.launch etc.
        self.distributed = dc
-        # ADR-0024 D10: torch.ahbm (KernBench-native) + torch.accelerator
+        # ADR-0024 D3: torch.ahbm (KernBench-native) + torch.accelerator
        # (PyTorch 2.x portable) namespaces for per-greenlet device binding.
        self.ahbm = _AhbmNamespace()
        self.accelerator = _AcceleratorNamespace(self.ahbm)
@@ -472,7 +472,7 @@ class RuntimeContext:
        eff_num_pe = dp.num_pes if dp.num_pes is not None else self._pes_per_cube
        eff_num_cubes = dp.num_cubes if dp.num_cubes is not None else self._num_cubes
        # ADR-0026 D4: resolve structural coords directly at resolve time.
-        # ``torch.ahbm.set_device(rank)`` (ADR-0024 D10) selects the target
+        # ``torch.ahbm.set_device(rank)`` (ADR-0024 D3) selects the target
        # SIP; if unset, fall back to SIP 0 for single-driver compatibility.
        current_sip = (
            self.ahbm.current_device() if hasattr(self, "ahbm") else None
@@ -619,7 +619,7 @@ class RuntimeContext:
        Creates per-SIP KernelLaunchMsg with local va_base per tensor
        (like host driver sending per-rank launch commands).

-        When ``_defer_wait=True`` (ADR-0024 D7), returns the list of
+        When ``_defer_wait=True`` (ADR-0027 D0.4), returns the list of
        ``(handle, sip_id, meta)`` tuples instead of waiting. Caller is
        responsible for waiting — used by collective ops to yield between
        submit and wait so all sibling ranks can submit first.
@@ -786,7 +786,7 @@ class RuntimeContext:
            last_handle = h

        if _defer_wait:
-            # ADR-0024 D7: return the pending-list so the caller can yield
+            # ADR-0027 D0.4: return the pending-list so the caller can yield
            # between submit and drain. Used by collective ops that need
            # all sibling ranks to submit before any rank waits.
            return [
@@ -178,7 +178,7 @@ class DistributedContext:

    def __init__(self) -> None:
        self._backend: AhbmCCLBackend | None = None
-        # ADR-0024 D9: greenlet-local rank registry. Bench launcher calls
+        # ADR-0024 D2: greenlet-local rank registry. Bench launcher calls
        # _bind_rank(g, rank) when spawning workers; get_rank() resolves the
        # current greenlet to its rank. Unbound greenlets fall back to 0 for
        # single-driver test compat.
@@ -220,7 +220,7 @@ class DistributedContext:
    def get_rank(self) -> int:
        """Return the rank bound to the current greenlet (default 0).

-        ADR-0024 D9: workers spawned by the bench launcher each get a rank
+        ADR-0024 D2: workers spawned by the bench launcher each get a rank
        registered via ``_bind_rank``. Callers outside any bound greenlet
        fall back to rank 0 for single-driver test compat.
        """
@@ -230,7 +230,7 @@ class DistributedContext:
        return int(self._rank_by_greenlet.get(g, 0))

    def _bind_rank(self, g: Any, rank: int) -> None:
-        """Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D9)."""
+        """Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D2)."""
        self._rank_by_greenlet[g] = int(rank)

    def get_backend(self) -> str:
@@ -65,7 +65,7 @@ def _drain_pending(ctx: Any) -> None:
                # Populate _completed so fast-path in ctx.wait short-circuits
                # on the return leg.
                ctx._completed.add(h)
-        # (b) Collective backend queue (ADR-0024 D7 + D0.4-(2)).
+        # (b) Collective backend queue (ADR-0027 D0.4-(2)).
        if backend is not None:
            pending_list = getattr(backend, "_pending_collective_handles", None)
            if pending_list is not None:
@@ -51,7 +51,7 @@ class OpLogger:
        record_end fires.
        """
        snap: dict[str, Any] = {}
-        # TileToken (ADR-0021 pipeline) — capture which stage this is and its
+        # TileToken (ADR-0014 D6 pipeline) — capture which stage this is and its
        # per-stage params (e.g. op_kind/scope for epilogue MATH stages) so
        # we can recover them at record_end even after the token advances.
        try:
@@ -356,7 +356,7 @@ def _instantiate_cube(
 ) -> None:
    """Add all cube-internal nodes and edges, including PE instances.

-    Topology: explicit router mesh from cube_mesh.yaml (ADR-0019).
+    Topology: explicit router mesh from cube_mesh.yaml (ADR-0017 D1).
    Each router is a separate SimPy node. Components attach to routers
    based on cube_mesh.yaml attachment lists.
    """
@@ -367,10 +367,10 @@ def _instantiate_cube(
    clinks = cube["links"]
    mm = cube["memory_map"]

-    # ── Mode branch (ADR-0019) ──
+    # ── Mode branch (ADR-0017 D8) ──
    mode = mm.get("hbm_mapping_mode", "n_to_one")
    if mode == "one_to_one":
-        raise NotImplementedError("1:1 mode: ADR-0019 D3")
+        raise NotImplementedError("1:1 mode: ADR-0017 D8")

    # ── UCIe ports + connection nodes ──
    ucie_cfg = cube["ucie"]
@@ -404,11 +404,10 @@ def _instantiate_cube(
            label=name.upper().replace("_", " "),
        )

-    # ── Per-PE HBM controller (ADR-0019 D1/D4) ──
+    # ── Per-PE HBM controller (ADR-0017 D4) ──
    # Each PE owns one slice of the cube's HBM. The slice has its own
    # set of pseudo-channels and is reachable ONLY through that PE's
    # attaching router (see cube_mesh.yaml ``peX.hbm`` attach lists).
-    # Restored after the ADR-0019 over-consolidation in commit 5917b34.
    hbm_spec = cube["components"]["hbm_ctrl"]
    hbm_lx, hbm_ly = local_pos["hbm_ctrl"]
    _hbm_total_bw = float(cube["links"].get("hbm_to_router_bw_gbs", 256.0))
@@ -425,7 +424,7 @@ def _instantiate_cube(
            label=f"HBM CTRL pe{pe_idx}",
        )

-    # ── Router mesh from cube_mesh.yaml (ADR-0019 D3) ──
+    # ── Router mesh from cube_mesh.yaml (ADR-0017 D1) ──
    routers = mesh_data["routers"]
    router_spec = cube["components"]["noc_router"]
    router_bw = clinks.get("router_link_bw_gbs", 256.0)
@@ -573,7 +572,7 @@ def _instantiate_cube(
                    ))
            elif item.endswith(".hbm"):
                # peX.hbm: router rXcY owns the entry to hbm_ctrl.peX.
-                # (ADR-0019 D1/D4 — per-PE HBM partitioning.)
+                # (ADR-0017 D4 — per-PE HBM partitioning.)
                pe_prefix = item.rsplit(".", 1)[0]
                pe_idx = int(pe_prefix.replace("pe", ""))
                pe_hbm_id = f"{cp}.hbm_ctrl.pe{pe_idx}"
@@ -645,13 +644,12 @@ def _instantiate_cube(
                    ))

    # NOTE: HBM↔router edges are created in the per-router attach loop
-    # above (peX.hbm items map router → hbm_ctrl.peX). Removed the
-    # legacy "all routers → single hbm_ctrl" loop that bypassed the
-    # ADR-0019 D4 per-PE partition.
+    # above (peX.hbm items map router → hbm_ctrl.peX). See ADR-0017 D4
+    # for the per-PE partition contract.


 def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
-    """Add PE-internal edges for a single PE instance (ADR-0021)."""
+    """Add PE-internal edges for a single PE instance (ADR-0014 D8)."""
    edges.append(Edge(
        src=f"{pp}.pe_cpu", dst=f"{pp}.pe_scheduler",
        distance_mm=pe_links["pe_cpu_to_scheduler_mm"],
@@ -685,7 +683,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
            kind="pe_internal",
        ))

-    # Fetch/Store → TCM (ADR-0021 D5)
+    # Fetch/Store → TCM (ADR-0014 D5)
    if "fetch_store_to_tcm_mm" in pe_links:
        edges.append(Edge(
            src=f"{pp}.pe_fetch_store", dst=f"{pp}.pe_tcm",
@@ -694,7 +692,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
            kind="pe_internal",
        ))

-    # Chaining edges (ADR-0021 D4 — token self-routing)
+    # Chaining edges (ADR-0014 D6 — token self-routing)
    chaining = [
        ("pe_dma", "pe_fetch_store", "dma_to_fetch_store_mm"),
        ("pe_fetch_store", "pe_gemm", "fetch_store_to_gemm_mm"),
@@ -6,7 +6,7 @@
  forward(x) ends with ``dist.all_reduce`` to sum partial products.

 Both layers use the intra-device ``DPPolicy`` (ADR-0026). TP shard
-ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D10).
+ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D3).

 Yield-safety contract (ADR-0027 D4/D5): every forward path contains at
 least one ``ctx.wait`` (via ``torch.launch``) or one collective; this
@@ -53,7 +53,7 @@ class ColumnParallelLinear:
        self.k_local = out_features // ws
        self.dtype = dtype
        self._torch = torch
-        # Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D10) places
+        # Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D3) places
        # it on SIP ``rank``. Intra-SIP layout comes from DPPolicy (ADR-0026).
        self.weight = torch.zeros(
            (in_features, self.k_local),
@@ -43,7 +43,7 @@ def get_tensor_model_parallel_rank() -> int:
    """Return this worker's rank within the TP group.

    Delegates to the greenlet-local rank registered by the spawn launcher
-    (ADR-0024 D9 via ``torch.distributed.get_rank``).
+    (ADR-0024 D2 via ``torch.distributed.get_rank``).
    """
    # Resolve via the global torch.distributed facade on the active ctx.
    return _current_rank()