ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -5,5 +5,5 @@ This package provides:
|
||||
- helpers: utilities for algorithm authors (chunked, ring_step, ...)
|
||||
- testing: mock CCL runtime for fast unit tests of algorithm kernels
|
||||
|
||||
See docs/adr/ADR-0023-ipcq-pe-collective.md and docs/ccl-author-guide.md.
|
||||
See docs/adr/ADR-0023-dev-ipcq-pe-collective.md and docs/onboarding/ccl-author-guide.md.
|
||||
"""
|
||||
|
||||
@@ -24,7 +24,7 @@ class Scope(Enum):
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class OpSpec:
|
||||
"""One operation in a multi-op composite (head + epilogue, ADR-0021).
|
||||
"""One operation in a multi-op composite (head + epilogue, ADR-0014 D3.3).
|
||||
|
||||
The head op (first in CompositeCmd.ops) defines tile geometry; subsequent
|
||||
ops are epilogue stages whose ``scope`` controls how often they fire
|
||||
@@ -156,7 +156,7 @@ class CompositeCmd:
|
||||
out_nbytes: int
|
||||
math_op: str | None = None # for op="math": which math operation
|
||||
data_op: bool = True
|
||||
# Multi-op composite (ADR-0021 extension): when non-empty, ops[0] is the
|
||||
# Multi-op composite (ADR-0014 D3.3): when non-empty, ops[0] is the
|
||||
# head and ops[1:] are epilogue stages with explicit scope. When empty,
|
||||
# the legacy single-op semantics (op/a/b/math_op) apply.
|
||||
ops: tuple[OpSpec, ...] = ()
|
||||
|
||||
@@ -15,7 +15,7 @@ if TYPE_CHECKING:
|
||||
|
||||
|
||||
class HbmCtrlComponent(ComponentBase):
|
||||
"""HBM controller with per-pseudo-channel (PC) striping (ADR-0019 D1, ADR-0033).
|
||||
"""HBM controller with per-pseudo-channel (PC) striping (ADR-0017 D4, ADR-0033).
|
||||
|
||||
Stateless per-PC ``available_at`` array; each incoming transaction is
|
||||
split into ``ceil(nbytes / burst_bytes)`` chunks distributed round-robin
|
||||
|
||||
@@ -267,8 +267,9 @@ class MCpuComponent(ComponentBase):
|
||||
def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
|
||||
"""Return list of HBM destination node_ids for DMA fan-out.
|
||||
|
||||
With single hbm_ctrl per cube (ADR-0019), always returns one node.
|
||||
PA-based resolution still used for cross-cube routing.
|
||||
The PA-based resolver maps each address to one per-PE
|
||||
``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly
|
||||
one node. Cross-cube routing uses the same resolution.
|
||||
"""
|
||||
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
||||
|
||||
|
||||
@@ -17,9 +17,11 @@ if TYPE_CHECKING:
|
||||
class PeDmaComponent(PeEngineBase):
|
||||
"""PE_DMA: dual-channel DMA engine with READ and WRITE resources.
|
||||
|
||||
Each channel has capacity=1 (ADR-0014 D4):
|
||||
Compute channels (vc_compute) have capacity=1 each (ADR-0014 D4):
|
||||
- DMA_READ and DMA_WRITE may execute concurrently.
|
||||
- Multiple READs cannot overlap; multiple WRITEs cannot overlap.
|
||||
The orthogonal vc_comm channel for IPCQ traffic is defined in
|
||||
ADR-0023 D8.
|
||||
|
||||
Handles two message types:
|
||||
- Transaction: external fabric messages (PeDmaMsg probes, M_CPU DMA)
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0021 D5).
|
||||
"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0014 D1).
|
||||
|
||||
Handles both fetch (TCM → register) and store (register → TCM).
|
||||
BW serialization is delegated to PE_TCM via port communication.
|
||||
@@ -18,7 +18,7 @@ if TYPE_CHECKING:
|
||||
|
||||
|
||||
class PeFetchStoreComponent(PeEngineBase):
|
||||
"""PE_FETCH_STORE: TCM ↔ Register File (ADR-0021 D5).
|
||||
"""PE_FETCH_STORE: TCM ↔ Register File (ADR-0014 D1).
|
||||
|
||||
Receives TileTokens via pipeline self-routing.
|
||||
Sends TcmRequest to PE_TCM for BW-based latency.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""PE_GEMM: matrix multiplication engine (ADR-0021 D6).
|
||||
"""PE_GEMM: matrix multiplication engine (ADR-0014 D1).
|
||||
|
||||
Handles both legacy PeInternalTxn (GemmCmd) and pipeline TileToken.
|
||||
In pipeline mode, receives token after fetch stage, computes MAC, chains to next.
|
||||
@@ -32,7 +32,7 @@ _DTYPE_BITS: dict[str, int] = {
|
||||
|
||||
|
||||
class PeGemmComponent(PeEngineBase):
|
||||
"""PE_GEMM: MAC array (ADR-0021 D6).
|
||||
"""PE_GEMM: MAC array (ADR-0014 D1).
|
||||
|
||||
In pipeline mode: pure compute — register data already fetched.
|
||||
In legacy mode: handles PeInternalTxn(GemmCmd) with shared accel_slot.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""PE_MATH: element-wise / reduction computation engine (ADR-0021 D6).
|
||||
"""PE_MATH: element-wise / reduction computation engine (ADR-0014 D1).
|
||||
|
||||
Handles both legacy PeInternalTxn (MathCmd) and pipeline TileToken.
|
||||
In pipeline mode, receives token after fetch stage, computes SIMD, chains to next.
|
||||
@@ -24,7 +24,7 @@ if TYPE_CHECKING:
|
||||
|
||||
|
||||
class PeMathComponent(PeEngineBase):
|
||||
"""PE_MATH: SIMD/Vector unit (ADR-0021 D6).
|
||||
"""PE_MATH: SIMD/Vector unit (ADR-0014 D1).
|
||||
|
||||
In pipeline mode: pure compute — register data already fetched.
|
||||
In legacy mode: handles PeInternalTxn(MathCmd) with shared accel_slot.
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0021 D2).
|
||||
"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0014 D6).
|
||||
|
||||
Receives PeInternalTxn from PE_CPU, routes to engines:
|
||||
- Simple commands (DmaReadCmd, GemmCmd, etc.) → direct dispatch to engine
|
||||
- CompositeCmd → generate TilePlan, feed tiles via _feed_loop
|
||||
|
||||
Composite pipeline uses token self-routing (ADR-0021 D4):
|
||||
Composite pipeline uses token self-routing (ADR-0014 D6):
|
||||
Scheduler only does initial dispatch + completion tracking.
|
||||
Tiles chain through components based on their plan's stage sequence.
|
||||
"""
|
||||
@@ -24,7 +24,7 @@ if TYPE_CHECKING:
|
||||
|
||||
|
||||
class PeSchedulerComponent(ComponentBase):
|
||||
"""PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, ADR-0021 D2).
|
||||
"""PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, D6).
|
||||
|
||||
Simple commands are forwarded to the appropriate engine.
|
||||
CompositeCmd creates a TilePlan and feeds tiles into the pipeline.
|
||||
@@ -104,7 +104,7 @@ class PeSchedulerComponent(ComponentBase):
|
||||
def _dispatch_composite(
|
||||
self, env: simpy.Environment, pe_txn: Any, cmd: Any,
|
||||
) -> Generator:
|
||||
"""Generate plan and enqueue to feeder. Non-blocking (ADR-0021 D4)."""
|
||||
"""Generate plan and enqueue to feeder. Non-blocking (ADR-0014 D6)."""
|
||||
from kernbench.components.builtin.pe_types import PipelineContext
|
||||
|
||||
plan = self._generate_plan(cmd)
|
||||
@@ -121,7 +121,7 @@ class PeSchedulerComponent(ComponentBase):
|
||||
yield self._pending_feeds.put((plan, ctx))
|
||||
|
||||
def _feed_loop(self, env: simpy.Environment) -> Generator:
|
||||
"""Single feeder process: FIFO command ordering (ADR-0021 D2).
|
||||
"""Single feeder process: FIFO command ordering (ADR-0014 D6).
|
||||
|
||||
No tile feed interleaving between commands.
|
||||
Queue full → only this process blocks.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0021).
|
||||
"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0014 D1).
|
||||
|
||||
Models scratchpad memory inside the PE. Handles both legacy Transaction forwarding
|
||||
and TcmRequest from PE_FETCH_STORE for BW-serialized read/write access.
|
||||
@@ -32,7 +32,7 @@ class TcmRequest:
|
||||
|
||||
|
||||
class PeTcmComponent(ComponentBase):
|
||||
"""PE_TCM: BW-serialized scratchpad memory (ADR-0021 D1).
|
||||
"""PE_TCM: BW-serialized scratchpad memory (ADR-0014 D1).
|
||||
|
||||
Dual-channel: read and write can proceed in parallel,
|
||||
but concurrent reads serialize, concurrent writes serialize.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""PE pipeline types for ADR-0021: TileToken, TilePlan, Stage, PipelineContext.
|
||||
"""PE pipeline types for ADR-0014 D6: TileToken, TilePlan, Stage, PipelineContext.
|
||||
|
||||
These types are used by the PE_SCHEDULER and all PE engine components
|
||||
for tile-based pipeline execution with self-routing.
|
||||
@@ -84,7 +84,7 @@ class PipelineContext:
|
||||
|
||||
@dataclass
|
||||
class TileToken:
|
||||
"""Self-routing tile token passed between PE components (ADR-0021 D9).
|
||||
"""Self-routing tile token passed between PE components (ADR-0014 D6).
|
||||
|
||||
Single-owner: only one component holds this token at any time.
|
||||
params is a cache of plan.stages[stage_idx].params (canonical source).
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""Tile plan generators for PE pipeline (ADR-0021).
|
||||
"""Tile plan generators for PE pipeline (ADR-0014 D6).
|
||||
|
||||
Generates TilePlan with stage sequences for GEMM and Math operations.
|
||||
Ported from pe_accel tiling.py with stage-based plan structure.
|
||||
|
||||
@@ -1,2 +1,2 @@
|
||||
# Legacy component backups — not actively used.
|
||||
# Kept for reference during ADR-0021 migration.
|
||||
# Kept for reference during the PE pipeline refactor (ADR-0014).
|
||||
|
||||
@@ -264,8 +264,9 @@ class MCpuComponent(ComponentBase):
|
||||
def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
|
||||
"""Return list of HBM destination node_ids for DMA fan-out.
|
||||
|
||||
With single hbm_ctrl per cube (ADR-0019), always returns one node.
|
||||
PA-based resolution still used for cross-cube routing.
|
||||
The PA-based resolver maps each address to one per-PE
|
||||
``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly
|
||||
one node. Cross-cube routing uses the same resolution.
|
||||
"""
|
||||
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
||||
|
||||
|
||||
@@ -20,7 +20,7 @@ _AHBM_SEL_BIT = 37
|
||||
_AHBM_LOCAL_USED = 38 # bits actually meaningful for AHBM
|
||||
|
||||
# HBM-offset bit layout for PC (pseudo-channel) striping
|
||||
# (ADR-0033 D6, ADR-0019). Given burst_bytes = 2^B and num_pcs = 2^P
|
||||
# (ADR-0033 D6, ADR-0017 D8). Given burst_bytes = 2^B and num_pcs = 2^P
|
||||
# configured at hbm_ctrl, the PC index is derived from hbm_offset as
|
||||
# pc_shift = B; pc_mask = (1 << P) - 1
|
||||
# pc = (hbm_offset >> pc_shift) & pc_mask
|
||||
|
||||
@@ -35,7 +35,7 @@ class AddressResolver:
|
||||
def __init__(self, graph: TopologyGraph) -> None:
|
||||
self._node_ids = set(graph.nodes)
|
||||
# HBM slice size (bytes) — used to decode pe_id from hbm_offset
|
||||
# so HBM PA → hbm_ctrl.pe{X} (ADR-0019 D1/D4).
|
||||
# so HBM PA → hbm_ctrl.pe{X} (ADR-0017 D4/D9).
|
||||
mm = graph.spec.get("cube", {}).get("memory_map", {})
|
||||
hbm_total_gb = int(mm.get("hbm_total_gb_per_cube", 48))
|
||||
slices_per_cube = int(mm.get("hbm_slices_per_cube", 8))
|
||||
@@ -129,7 +129,7 @@ class PathRouter:
|
||||
Otherwise the cube's own UCIe port appears as a zero-distance
|
||||
bus that Dijkstra prefers over the mesh — that is intended only
|
||||
for cross-cube routing. Local PE_DMA must traverse the mesh so
|
||||
cross-PE-slice access pays the mesh-distance cost (ADR-0019 D4).
|
||||
cross-PE-slice access pays the mesh-distance cost (ADR-0017 D7).
|
||||
"""
|
||||
start = f"{src_pe}.pe_dma"
|
||||
adj = self._adj_local if _same_cube(start, dst_node) else self._adj
|
||||
@@ -137,13 +137,13 @@ class PathRouter:
|
||||
|
||||
def find_path_with_distance(self, src_pe: str, dst_node: str) -> tuple[list[str], float]:
|
||||
"""Match find_path's cube-local routing so reported distance reflects
|
||||
the actual chosen path (ADR-0019 D4)."""
|
||||
the actual chosen path (ADR-0017 D7)."""
|
||||
start = f"{src_pe}.pe_dma"
|
||||
adj = self._adj_local if _same_cube(start, dst_node) else self._adj
|
||||
return self._run_dijkstra_with_dist(adj, start, dst_node)
|
||||
|
||||
def find_mcpu_dma_path(self, m_cpu_id: str, dst_hbm_id: str) -> list[str]:
|
||||
"""M_CPU DMA path: routes through router mesh (ADR-0019).
|
||||
"""M_CPU DMA path: routes through router mesh (ADR-0017).
|
||||
|
||||
Same-cube: uses _adj_local (no UCIe) to stay within mesh.
|
||||
Cross-cube: uses _adj_all to route via UCIe.
|
||||
|
||||
@@ -58,7 +58,7 @@ def _get_active_context():
|
||||
|
||||
|
||||
class _AhbmNamespace:
|
||||
"""torch.ahbm — per-greenlet SIP device binding (ADR-0024 D10).
|
||||
"""torch.ahbm — per-greenlet SIP device binding (ADR-0024 D3).
|
||||
|
||||
Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. KernBench's
|
||||
backend is 'ahbm' (not CUDA), so this namespace avoids pretending to be
|
||||
@@ -124,7 +124,7 @@ class RuntimeContext:
|
||||
dc = DistributedContext()
|
||||
dc._ctx_ref = self # back-reference for AhbmCCLBackend to reach ctx.launch etc.
|
||||
self.distributed = dc
|
||||
# ADR-0024 D10: torch.ahbm (KernBench-native) + torch.accelerator
|
||||
# ADR-0024 D3: torch.ahbm (KernBench-native) + torch.accelerator
|
||||
# (PyTorch 2.x portable) namespaces for per-greenlet device binding.
|
||||
self.ahbm = _AhbmNamespace()
|
||||
self.accelerator = _AcceleratorNamespace(self.ahbm)
|
||||
@@ -472,7 +472,7 @@ class RuntimeContext:
|
||||
eff_num_pe = dp.num_pes if dp.num_pes is not None else self._pes_per_cube
|
||||
eff_num_cubes = dp.num_cubes if dp.num_cubes is not None else self._num_cubes
|
||||
# ADR-0026 D4: resolve structural coords directly at resolve time.
|
||||
# ``torch.ahbm.set_device(rank)`` (ADR-0024 D10) selects the target
|
||||
# ``torch.ahbm.set_device(rank)`` (ADR-0024 D3) selects the target
|
||||
# SIP; if unset, fall back to SIP 0 for single-driver compatibility.
|
||||
current_sip = (
|
||||
self.ahbm.current_device() if hasattr(self, "ahbm") else None
|
||||
@@ -619,7 +619,7 @@ class RuntimeContext:
|
||||
Creates per-SIP KernelLaunchMsg with local va_base per tensor
|
||||
(like host driver sending per-rank launch commands).
|
||||
|
||||
When ``_defer_wait=True`` (ADR-0024 D7), returns the list of
|
||||
When ``_defer_wait=True`` (ADR-0027 D0.4), returns the list of
|
||||
``(handle, sip_id, meta)`` tuples instead of waiting. Caller is
|
||||
responsible for waiting — used by collective ops to yield between
|
||||
submit and wait so all sibling ranks can submit first.
|
||||
@@ -786,7 +786,7 @@ class RuntimeContext:
|
||||
last_handle = h
|
||||
|
||||
if _defer_wait:
|
||||
# ADR-0024 D7: return the pending-list so the caller can yield
|
||||
# ADR-0027 D0.4: return the pending-list so the caller can yield
|
||||
# between submit and drain. Used by collective ops that need
|
||||
# all sibling ranks to submit before any rank waits.
|
||||
return [
|
||||
|
||||
@@ -178,7 +178,7 @@ class DistributedContext:
|
||||
|
||||
def __init__(self) -> None:
|
||||
self._backend: AhbmCCLBackend | None = None
|
||||
# ADR-0024 D9: greenlet-local rank registry. Bench launcher calls
|
||||
# ADR-0024 D2: greenlet-local rank registry. Bench launcher calls
|
||||
# _bind_rank(g, rank) when spawning workers; get_rank() resolves the
|
||||
# current greenlet to its rank. Unbound greenlets fall back to 0 for
|
||||
# single-driver test compat.
|
||||
@@ -220,7 +220,7 @@ class DistributedContext:
|
||||
def get_rank(self) -> int:
|
||||
"""Return the rank bound to the current greenlet (default 0).
|
||||
|
||||
ADR-0024 D9: workers spawned by the bench launcher each get a rank
|
||||
ADR-0024 D2: workers spawned by the bench launcher each get a rank
|
||||
registered via ``_bind_rank``. Callers outside any bound greenlet
|
||||
fall back to rank 0 for single-driver test compat.
|
||||
"""
|
||||
@@ -230,7 +230,7 @@ class DistributedContext:
|
||||
return int(self._rank_by_greenlet.get(g, 0))
|
||||
|
||||
def _bind_rank(self, g: Any, rank: int) -> None:
|
||||
"""Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D9)."""
|
||||
"""Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D2)."""
|
||||
self._rank_by_greenlet[g] = int(rank)
|
||||
|
||||
def get_backend(self) -> str:
|
||||
|
||||
@@ -65,7 +65,7 @@ def _drain_pending(ctx: Any) -> None:
|
||||
# Populate _completed so fast-path in ctx.wait short-circuits
|
||||
# on the return leg.
|
||||
ctx._completed.add(h)
|
||||
# (b) Collective backend queue (ADR-0024 D7 + D0.4-(2)).
|
||||
# (b) Collective backend queue (ADR-0027 D0.4-(2)).
|
||||
if backend is not None:
|
||||
pending_list = getattr(backend, "_pending_collective_handles", None)
|
||||
if pending_list is not None:
|
||||
|
||||
@@ -51,7 +51,7 @@ class OpLogger:
|
||||
record_end fires.
|
||||
"""
|
||||
snap: dict[str, Any] = {}
|
||||
# TileToken (ADR-0021 pipeline) — capture which stage this is and its
|
||||
# TileToken (ADR-0014 D6 pipeline) — capture which stage this is and its
|
||||
# per-stage params (e.g. op_kind/scope for epilogue MATH stages) so
|
||||
# we can recover them at record_end even after the token advances.
|
||||
try:
|
||||
|
||||
@@ -356,7 +356,7 @@ def _instantiate_cube(
|
||||
) -> None:
|
||||
"""Add all cube-internal nodes and edges, including PE instances.
|
||||
|
||||
Topology: explicit router mesh from cube_mesh.yaml (ADR-0019).
|
||||
Topology: explicit router mesh from cube_mesh.yaml (ADR-0017 D1).
|
||||
Each router is a separate SimPy node. Components attach to routers
|
||||
based on cube_mesh.yaml attachment lists.
|
||||
"""
|
||||
@@ -367,10 +367,10 @@ def _instantiate_cube(
|
||||
clinks = cube["links"]
|
||||
mm = cube["memory_map"]
|
||||
|
||||
# ── Mode branch (ADR-0019) ──
|
||||
# ── Mode branch (ADR-0017 D8) ──
|
||||
mode = mm.get("hbm_mapping_mode", "n_to_one")
|
||||
if mode == "one_to_one":
|
||||
raise NotImplementedError("1:1 mode: ADR-0019 D3")
|
||||
raise NotImplementedError("1:1 mode: ADR-0017 D8")
|
||||
|
||||
# ── UCIe ports + connection nodes ──
|
||||
ucie_cfg = cube["ucie"]
|
||||
@@ -404,11 +404,10 @@ def _instantiate_cube(
|
||||
label=name.upper().replace("_", " "),
|
||||
)
|
||||
|
||||
# ── Per-PE HBM controller (ADR-0019 D1/D4) ──
|
||||
# ── Per-PE HBM controller (ADR-0017 D4) ──
|
||||
# Each PE owns one slice of the cube's HBM. The slice has its own
|
||||
# set of pseudo-channels and is reachable ONLY through that PE's
|
||||
# attaching router (see cube_mesh.yaml ``peX.hbm`` attach lists).
|
||||
# Restored after the ADR-0019 over-consolidation in commit 5917b34.
|
||||
hbm_spec = cube["components"]["hbm_ctrl"]
|
||||
hbm_lx, hbm_ly = local_pos["hbm_ctrl"]
|
||||
_hbm_total_bw = float(cube["links"].get("hbm_to_router_bw_gbs", 256.0))
|
||||
@@ -425,7 +424,7 @@ def _instantiate_cube(
|
||||
label=f"HBM CTRL pe{pe_idx}",
|
||||
)
|
||||
|
||||
# ── Router mesh from cube_mesh.yaml (ADR-0019 D3) ──
|
||||
# ── Router mesh from cube_mesh.yaml (ADR-0017 D1) ──
|
||||
routers = mesh_data["routers"]
|
||||
router_spec = cube["components"]["noc_router"]
|
||||
router_bw = clinks.get("router_link_bw_gbs", 256.0)
|
||||
@@ -573,7 +572,7 @@ def _instantiate_cube(
|
||||
))
|
||||
elif item.endswith(".hbm"):
|
||||
# peX.hbm: router rXcY owns the entry to hbm_ctrl.peX.
|
||||
# (ADR-0019 D1/D4 — per-PE HBM partitioning.)
|
||||
# (ADR-0017 D4 — per-PE HBM partitioning.)
|
||||
pe_prefix = item.rsplit(".", 1)[0]
|
||||
pe_idx = int(pe_prefix.replace("pe", ""))
|
||||
pe_hbm_id = f"{cp}.hbm_ctrl.pe{pe_idx}"
|
||||
@@ -645,13 +644,12 @@ def _instantiate_cube(
|
||||
))
|
||||
|
||||
# NOTE: HBM↔router edges are created in the per-router attach loop
|
||||
# above (peX.hbm items map router → hbm_ctrl.peX). Removed the
|
||||
# legacy "all routers → single hbm_ctrl" loop that bypassed the
|
||||
# ADR-0019 D4 per-PE partition.
|
||||
# above (peX.hbm items map router → hbm_ctrl.peX). See ADR-0017 D4
|
||||
# for the per-PE partition contract.
|
||||
|
||||
|
||||
def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
|
||||
"""Add PE-internal edges for a single PE instance (ADR-0021)."""
|
||||
"""Add PE-internal edges for a single PE instance (ADR-0014 D8)."""
|
||||
edges.append(Edge(
|
||||
src=f"{pp}.pe_cpu", dst=f"{pp}.pe_scheduler",
|
||||
distance_mm=pe_links["pe_cpu_to_scheduler_mm"],
|
||||
@@ -685,7 +683,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
|
||||
kind="pe_internal",
|
||||
))
|
||||
|
||||
# Fetch/Store → TCM (ADR-0021 D5)
|
||||
# Fetch/Store → TCM (ADR-0014 D5)
|
||||
if "fetch_store_to_tcm_mm" in pe_links:
|
||||
edges.append(Edge(
|
||||
src=f"{pp}.pe_fetch_store", dst=f"{pp}.pe_tcm",
|
||||
@@ -694,7 +692,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
|
||||
kind="pe_internal",
|
||||
))
|
||||
|
||||
# Chaining edges (ADR-0021 D4 — token self-routing)
|
||||
# Chaining edges (ADR-0014 D6 — token self-routing)
|
||||
chaining = [
|
||||
("pe_dma", "pe_fetch_store", "dma_to_fetch_store_mm"),
|
||||
("pe_fetch_store", "pe_gemm", "fetch_store_to_gemm_mm"),
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
forward(x) ends with ``dist.all_reduce`` to sum partial products.
|
||||
|
||||
Both layers use the intra-device ``DPPolicy`` (ADR-0026). TP shard
|
||||
ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D10).
|
||||
ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D3).
|
||||
|
||||
Yield-safety contract (ADR-0027 D4/D5): every forward path contains at
|
||||
least one ``ctx.wait`` (via ``torch.launch``) or one collective; this
|
||||
@@ -53,7 +53,7 @@ class ColumnParallelLinear:
|
||||
self.k_local = out_features // ws
|
||||
self.dtype = dtype
|
||||
self._torch = torch
|
||||
# Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D10) places
|
||||
# Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D3) places
|
||||
# it on SIP ``rank``. Intra-SIP layout comes from DPPolicy (ADR-0026).
|
||||
self.weight = torch.zeros(
|
||||
(in_features, self.k_local),
|
||||
|
||||
@@ -43,7 +43,7 @@ def get_tensor_model_parallel_rank() -> int:
|
||||
"""Return this worker's rank within the TP group.
|
||||
|
||||
Delegates to the greenlet-local rank registered by the spawn launcher
|
||||
(ADR-0024 D9 via ``torch.distributed.get_rank``).
|
||||
(ADR-0024 D2 via ``torch.distributed.get_rank``).
|
||||
"""
|
||||
# Resolve via the global torch.distributed facade on the active ctx.
|
||||
return _current_rank()
|
||||
|
||||
Reference in New Issue
Block a user