diff --git a/CLAUDE.md b/CLAUDE.md index afc2ab4..f48675c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -202,8 +202,8 @@ General fallbacks. Apply to anything not explicitly covered above. > > Contains **foundations** (Authority & Scope → Terminology → Terminology > Discipline → Mental Model → Common Failure Modes) followed by **rules** -> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, runtime API / -> sim_engine Boundaries). +> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, ADR Translation +> Discipline, runtime API / sim_engine Boundaries). ## Authority & Scope @@ -218,14 +218,22 @@ General fallbacks. Apply to anything not explicitly covered above. ### ADR Lifecycle -ADRs live in one of three folders based on lifecycle state: +ADRs live in one of four folders. Three carry **canonical English** +content based on lifecycle state; the fourth holds Korean translations: -- `docs/adr/` — **Accepted** (current implementation reflected). +- `docs/adr/` — **Accepted** (canonical English; current + implementation reflected). - `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design only / future-work exploration / retroactive documentation pending - verification). + verification). **Authoring language is free** (any language); the + promotion step (below) translates to English. - `docs/adr-history/` — **Superseded** or **Merged** (no longer the - authoritative source; kept as historical record). + authoritative source; kept as historical record). Frozen — language + policy not applied retroactively. +- `docs/adr-ko/` — Korean translations of accepted ADRs (derived + artifact, 1:1 mirror of `docs/adr/`). English in `docs/adr/` is the + canonical source of truth; when KO and EN disagree, EN wins. See + *ADR Translation Discipline* below. Status field values: @@ -240,17 +248,23 @@ Status field values: Transitions: - **Proposed/Stub → Accepted**: when the ADR's decisions are - reflected in production code AND covered by tests. `git mv` from - `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`. + reflected in production code AND covered by tests. If the proposed + ADR is in Korean, translate to English and place the English in + `docs/adr/`; move the Korean original to `docs/adr-ko/`. If the + proposed ADR is in English, `git mv` it to `docs/adr/` and create + the Korean translation in `docs/adr-ko/`. Change Status to + `Accepted` in both files. - **Draft → Accepted**: when the ADR's text has been verified to - accurately describe the existing implementation. `git mv` from - `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`. + accurately describe the existing implementation. Same English / + Korean placement rule as above. - **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM` - and `git mv` to `docs/adr-history/`. The superseding ADR includes - a "Supersedes ADR-NNNN" reference (or, for partial supersession of - clauses, documents this in its own body). + in both the EN and KO files and `git mv` both to their respective + history locations (`docs/adr-history/` for English; the KO copy + stays in `docs/adr-ko/` only if it was already mirrored — see *ADR + Translation Discipline* for the frozen-history exception). - **Accepted → Merged**: set Status to `Merged into ADR-MMMM` - (single-line stub) and `git mv` to `docs/adr-history/`. + (single-line stub) in both files and apply the same `git mv` rule + as the Superseded transition. Cross-references between ADRs use the `ADR-NNNN` ID and remain valid regardless of folder location. ADR numbers are **immutable**; never @@ -361,11 +375,48 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo: ## Derived Artifacts (Clarification) - Generated diagrams under `docs/diagrams/` are **derived artifacts**, not production code. -- Creating or updating files in `docs/diagrams/`: +- Korean ADR translations under `docs/adr-ko/` are **derived artifacts** + (mirror of the canonical English in `docs/adr/`); see *ADR Translation + Discipline*. +- Creating or updating files in `docs/diagrams/` or `docs/adr-ko/`: - does NOT count as a production code change, - does NOT require Phase 2 approval, - MUST be consistent with SPEC.md and ADRs. +## ADR Translation Discipline + +English in `docs/adr/` is the canonical source of truth. Korean in +`docs/adr-ko/` mirrors it 1:1 as a derived artifact. + +**Bidirectional sync rule (MUST)**: any edit to a file in `docs/adr/` +must be accompanied, in the same change, by a mirroring edit to +`docs/adr-ko/.md`. The reverse also applies: edits to +`docs/adr-ko/` must mirror back into `docs/adr/`. The two files must +always describe the same architectural content. + +Mechanics: + +- When editing an EN ADR, propagate the change to its KO counterpart + by translating just the diff (preserve unaffected KO prose); do not + regenerate the whole KO file from scratch. +- When editing a KO ADR, propagate to EN the same way. +- Filename mirror: `docs/adr/X.md` ↔ `docs/adr-ko/X.md` (no language + suffix in either path). +- The `## Status` block content must remain byte-identical between + the EN and KO files (e.g., both say `Accepted`). +- Conflict policy: if the two diverge despite the rule, treat EN as + authoritative and overwrite KO. Surface the divergence to the user + before reconciling. +- `docs/adr-proposed/` is exempt — single language only, no mirror + required until promotion. +- `docs/adr-history/` is frozen — pre-existing mixed-language state + there is not migrated. + +Verification: `python tools/verify_adr_lang_pairs.py` checks that +every EN ADR has a matching KO file, the title's ADR-NNNN matches the +filename, and Status blocks are byte-equal. Run it on demand or wire +it into CI. Exit code: 0 = OK, 1 = mismatch. + ## runtime API / sim_engine Boundaries - runtime API MUST NOT hardcode topology/routing or internal hop sequences. diff --git a/docs/adr-ko/ADR-0001-mem-physaddr-layout.md b/docs/adr-ko/ADR-0001-mem-physaddr-layout.md new file mode 100644 index 0000000..5f33108 --- /dev/null +++ b/docs/adr-ko/ADR-0001-mem-physaddr-layout.md @@ -0,0 +1,362 @@ +# ADR-0001: 51-bit Physical Address Layout & Decoding Contract + +## Status + +Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal, +Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables. +Supersedes ADR-0031.) + +## Date + +2026-04-27 (original: 2026-02-27) + +## Context + +KernBench requires a stable, parsable physical address scheme that: + +- can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU) +- remains topology-agnostic (no hardcoded counts) +- supports swappable policy and DI-first components +- covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space + +### History + +- Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) + + sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice. +- ADR-0031 (stub) requested PE-resource range partition but was never + implemented. + +Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides +concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources. +ADR-0031 is superseded. + +## Decision + +We define a **PhysAddr value object** and an **address decoding contract** +that converts an integer address into routing domains. + +### D1. PhysAddr is an immutable value object + +- PhysAddr is immutable and comparable as a pure value. +- Any allocator returns a **fully specified PhysAddr** (not partial metadata). +- No global state may be required to interpret a PhysAddr. + +### D2. 51-bit Physical Address Layout + +A 51-bit physical address is adopted. + +#### 2.1 Top-Level Address Map + +```text +[50:47] sip_id (4) -- 16 SIPs +[46:42] die_id (5) -- 32 dies per SIP +[41: 0] local_offset (42) -- 4 TB per die +``` + +```text +50 47 46 42 41 0 ++---------+----------+-------------------------+ +| sip_id | die_id | local_offset | ++---------+----------+-------------------------+ +``` + +#### 2.2 die_id Allocation + +| die_id | Meaning | +|--------|---------| +| 0..15 | AHBM dies | +| 16..20 | IOCHIPLET dies | +| 21..31 | Reserved | + +#### 2.3 AHBM Die Layout + +Only lower 256 GB of the 4 TB die-local window is assigned. + +```text +[41:38] MBZ (4) +[37] addr_space (1) -- 0 = local resource, 1 = HBM memory +[36: 0] sub-address (37) +``` + +| addr_space | Meaning | +|------------|---------| +| 0 | Local resource | +| 1 | HBM memory | + +##### 2.3.1 HBM Window (addr_space = 1) + +```text +[36:0] hbm_offset (37) -- 128 GB decode window +``` + +The architectural decode window is fixed at 128 GB. Implemented capacity +may be smaller depending on SKU/topology (see D4). + +##### 2.3.2 Resource Window (addr_space = 0) + +```text +[36:34] resource_kind (3) +[33: 0] kind_local (34) -- 16 GB per kind +``` + +| resource_kind | Meaning | +|---------------|---------| +| 000 | PE_LOCAL | +| 001 | MCPU_LOCAL | +| 010 | CUBE_SRAM | +| 011..111 | Reserved | + +Each kind gets a 16 GB decode region. + +##### 2.3.3 PE_LOCAL (resource_kind = 000) + +```text +[33] MBZ (1) +[32:29] pe_id (4) -- 0..15 +[28:25] pe_sub_unit (4) +[24: 0] sub_offset (25) -- 32 MB per slot +``` + +16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode. + +| pe_sub_unit | Name | Budget | +|-------------|------|--------| +| 0 | PE_CPU_DTCM | 8 KB | +| 1 | MATH_ENGINE_DTCM | 8 KB | +| 2 | IPCQ | 256 KB | +| 3 | PE_CPU_SFR | 16 KB | +| 4 | MATH_ENGINE_SFR | 16 KB | +| 5 | DMA_ENGINE_SFR | 192 KB | +| 6 | PE_TCM | 2 MB | +| 7..15 | Reserved | -- | + +##### 2.3.4 MCPU_LOCAL (resource_kind = 001) + +```text +[33:30] MBZ (4) +[29:25] mcpu_sub_unit (5) +[24: 0] sub_offset (25) -- 32 MB per slot +``` + +1 GB active decode. + +| mcpu_sub_unit | Name | Budget | +|---------------|------|--------| +| 0 | MCPU_ITCM | 512 KB | +| 1 | MCPU_DTCM | 512 KB | +| 2 | IPCQ | 256 KB | +| 3 | MCPU_SFR | 8 KB | +| 4 | MCPU_DMA_SFR | 16 KB | +| 5 | MCPU_SRAM | 10 MB | +| 6..31 | Reserved | -- | + +##### 2.3.5 CUBE_SRAM (resource_kind = 010) + +```text +[33:25] MBZ (9) +[24: 0] sram_offset (25) -- flat 32 MB +``` + +#### 2.4 IOCHIPLET Die Layout + +Only lower 1 TB of the 4 TB die-local window is assigned. + +```text +[41:40] MBZ (2) +[39: 0] chiplet_offset (40) -- 1 TB +``` + +Region split by address range: + +| Range | Meaning | Decode condition | +|-------|---------|------------------| +| [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 | +| [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 | + +##### 2.4.1 IOCPU Region + +```text +[30:27] iocpu_sub_unit (4) +[26: 0] sub_offset (27) -- 128 MB per slot +``` + +16 x 128 MB slots. 2 GB active decode. + +| iocpu_sub_unit | Name | Budget | +|----------------|------|--------| +| 0 | IOCPU_ITCM | 512 KB | +| 1 | IOCPU_DTCM | 512 KB | +| 2 | IPCQ | 2 MB | +| 3 | IOCPU_SFR | 8 KB | +| 4 | IO_DMA_SFR | 16 KB | +| 5 | IO_SRAM | 64 MB | +| 6..15 | Reserved | -- | + +##### 2.4.2 UAL Region + +Sub-layout TBD (separate ADR). + +#### 2.5 Addressing Rules + +1. MBZ bits must be zero. An address with non-zero MBZ bits is + **architecturally invalid**. Implementation may raise a decode fault + or return an error -- behavior is not prescribed by this ADR. +2. Fixed slot sizes are chosen for simple hardware decode; actual + implemented capacity may be smaller than the slot. +3. Access beyond a sub-unit's implemented budget within a slot is + **architecturally invalid** (same policy as MBZ). + +### D3. Bitfield decoding is deterministic + +Given an integer address, field extraction (`sip_id`, `die_id`, `kind`, +`sub_unit`, `offset`) is purely positional. No runtime state is required. +Decoding deterministically maps an integer address to destination domains: +`sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM / +IOCPU / UAL). + +### D4. Capacity validation may depend on topology config + +Whether a decoded address falls within **implemented capacity** (e.g., +HBM 96 GB on a specific SKU) is checked against topology parameters +provided via DI/config. Decode itself (D3) never consults topology -- +only validation does. These parameters must live in the topology/config +layer, not in node implementations. + +### D5. Routing consumes decoded domains, not raw bits + +Routing policy uses decoded domains: + +- `src` location (sip / die / pe or node_id) +- `dst` domains derived from PhysAddr decoding +- `size_bytes` for size-aware link latency + +Routing must not inspect raw bit-fields directly except inside the +decoding module. + +## Alternatives Considered + +1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice, + consumes 4 bits that enable die-local expansion to 42 bits + (IOCHIPLET 1 TB). + +2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB. + Freed rack_id bits enable 42-bit local_offset. + +3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg + spanning)**: Rejected -- complicates D3 (deterministic decoding). + Uniform 4 TB window with MBZ padding is simpler. + +4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected -- + leads to duplicated logic, inconsistent routing, and hidden + assumptions. + +5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**: + Rejected -- violates SPEC R3 and breaks swappability. + +6. **Put decoding inside memory controllers or routers**: Rejected -- + leaks policy into components, violates SPEC R4 / D5. + +## Consequences + +### Positive + +- Simple hierarchical decoder: SIP -> die -> kind -> sub-unit. +- Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU). +- Deterministic routing domains enable clear test invariants (SPEC R1, R5). +- Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit + slots, reserved MBZ bits. +- DI-first: decoder can be swapped without changing components (SPEC R4). + +### Tradeoffs + +- Sparse address holes due to power-of-2 slot alignment. +- Large reserved/MBZ regions (intentional for future extension). +- Requires explicit configuration for topology-derived sizes (D4). +- Introduces a single "blessed" decoding module that must remain stable + and well-tested. + +## Supersedes + +- **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The + PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5 + fulfill ADR-0031's stated goals. + +## Implementation Notes (Non-normative) + +- Recommended module: `src/kernbench/policy/address/phyaddr.py` +- Tests should cover: encode/decode round-trip per kind, MBZ enforcement, + die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary + values, backward compatibility of factory APIs. +- Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`, + `cube_sram_addr` retain signatures (minus `rack_id`); `cube_id` + parameter renamed to `die_id`. +- New factories: `pe_resource_addr`, `mcpu_resource_addr`, + `iocpu_resource_addr`, `ual_addr`. + +## Appendix A. Address Examples + +### A.1 AHBM HBM access + +sip=2, die=5, HBM offset=0x1000 + +```text +sip_id = 2 -> [50:47] = 0b0010 +die_id = 5 -> [46:42] = 0b00101 +addr_space = 1 -> [37] = 1 (HBM) +hbm_offset = 0x1000 -> [36:0] + +51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000 +``` + +### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400 + +```text +sip_id = 0 -> [50:47] = 0 +die_id = 0 -> [46:42] = 0 +addr_space = 0 -> [37] = 0 +resource_kind = 0 -> [36:34] = 000 (PE_LOCAL) +pe_id = 3 -> [32:29] = 0011 +pe_sub_unit = 6 -> [28:25] = 0110 (PE_TCM) +sub_offset = 0x400 -> [24:0] + +local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400 +``` + +### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0 + +```text +sip_id = 1 -> [50:47] = 0001 +die_id = 3 -> [46:42] = 00011 +addr_space = 0 -> [37] = 0 +resource_kind = 1 -> [36:34] = 001 (MCPU_LOCAL) +mcpu_sub_unit = 5 -> [29:25] = 00101 (MCPU_SRAM) +sub_offset = 0 -> [24:0] = 0 + +local_offset = (1 << 34) | (5 << 25) +``` + +### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000 + +```text +sip_id = 1 -> [50:47] = 0001 +die_id = 17 -> [46:42] = 10001 (IOCHIPLET[1]) +iocpu_sub_unit = 2 -> [30:27] = 0010 (IPCQ) +sub_offset = 0x20000 -> [26:0] + +chiplet_offset = (2 << 27) | 0x20000 + (< 0x8000_0000 -> IOCPU region) +``` + +### A.5 IOCHIPLET -- UAL region, offset=4 GB + +```text +sip_id = 0 -> [50:47] = 0 +die_id = 16 -> [46:42] = 10000 (IOCHIPLET[0]) +chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region) +``` + +## Links + +- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first), + R5 (multi-domain comm) +- ADR-0031: Superseded diff --git a/docs/adr-ko/ADR-0002-lat-routing-distance.md b/docs/adr-ko/ADR-0002-lat-routing-distance.md new file mode 100644 index 0000000..19849f9 --- /dev/null +++ b/docs/adr-ko/ADR-0002-lat-routing-distance.md @@ -0,0 +1,102 @@ +# ADR-0002: Routing Distance, Ordering & Bypass Rules + +## Status +Accepted + +## Date +2026-02-27 + +## Context +The KernBench Graph Latency Simulator must compare kernel execution time +across different architectures and topologies by computing end-to-end +latency from graph traversal. + +To support meaningful comparison: +- routing must be deterministic +- latency must reflect actual interconnect structure +- local vs remote traffic must be distinguishable +- “bypass” optimizations must not undermine debuggability or correctness + +The simulator also aims to avoid software-managed metadata and hidden +shortcuts that obscure control paths. + +## Decision + +### D1. Distance is accumulated latency, not hop count +- Routing “distance” is defined as the **sum of per-node and per-link latency**. +- Hop count alone must not be used for ordering or path selection. +- Size-aware serialization latency (bytes / BW) contributes to distance. + +### D2. Routing order is derived from graph traversal +- The chosen route is the path with minimum accumulated latency + given the constructed graph and routing policy. +- Deterministic ordering must be guaranteed for identical inputs + (topology + policy + request). + +### D3. Bypass is explicit and graph-represented +- All paths must be explicitly represented in the graph and subject to latency accumulation. +- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations + (HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops. + Local HBM access has minimal hops (switching overhead only); remote access + traverses additional routers. +- Implicit or “magic” bypass paths are disallowed. + +### D4. No zero-latency end-to-end paths + +- Every routed request must incur **end-to-end** latency > 0. +- Individual fabric segments (e.g., NOC hops) MAY have distance_mm = 0 + when the fabric is distributed and distance is not meaningful at that granularity. + This is allowed because other components on the same path (e.g., PE_DMA, SRAM, + UCIe endpoints) contribute non-zero latency, ensuring the end-to-end invariant holds. +- Fully zero-latency end-to-end paths are disallowed, except for explicit + test-only stubs clearly marked as such. + +### D5. Policy vs topology responsibility split +- Topology builder: + - defines nodes and links and their latency/BW parameters +- Routing policy: + - selects among available graph paths based on decoded domains +- Routing policy must not assume missing links; missing connectivity + is a topology construction error. + +### D6. No software-managed routing metadata +- Routing decisions must not rely on per-request software-managed metadata + that tracks distance, hop count, or ordering outside the graph model. +- All distance/order computation is derived from traversal itself. + +## Alternatives Considered + +1) **Hop-count based routing** +- Rejected: ignores heterogeneous latency/BW and misrepresents + architectural differences. + +2) **Implicit local shortcuts** +- Rejected: breaks debuggability and violates traversal-based latency. + +3) **Software-managed distance metadata** +- Rejected: increases control overhead and obscures routing semantics. + +## Consequences + +### Positive +- Clear, debuggable hop-by-hop traces (SPEC R2, R4). +- Architecture comparisons reflect real interconnect structure. +- Routing behavior is reproducible and deterministic. + +### Tradeoffs / Costs +- Graph construction must be correct and complete. +- Bypass modeling requires explicit graph representation, + which slightly increases topology description complexity. + +## Implementation Notes (Non-normative) +- Recommended responsibilities: + - Graph builder: ensure all required paths exist. + - Router: select next hop based on decoded domains and policy. +- Tests should assert: + - non-zero end-to-end latency + - deterministic routing for identical inputs + - bypass paths appear explicitly in emitted traces + +## Links +- SPEC.md: R1 (routing), R2 (latency), R3 (topology), R5 (multi-domain comm) +- ADR-0001: PhysAddr layout & decoding contract diff --git a/docs/adr-ko/ADR-0003-dev-target-system-hierarchy.md b/docs/adr-ko/ADR-0003-dev-target-system-hierarchy.md new file mode 100644 index 0000000..e5acc7d --- /dev/null +++ b/docs/adr-ko/ADR-0003-dev-target-system-hierarchy.md @@ -0,0 +1,68 @@ +# ADR-0003: Target System Hierarchy & Modeling Scope + +## Status + +Accepted + +## Context + +We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform. +The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL +through switching fabrics, with a host CPU issuing commands/kernels. + +## Decision + +We model the system hierarchy explicitly: + +### D1. Tray-level + +- A compute tray contains: + - Host CPU (issues requests / coordinates runtime & data placement) + - Multiple identical SIPs (accelerators) + - Interconnect fabric between SIPs (PCIe and/or UAL via switches) + +### D2. SIP-level + +- A SIP is a multi-die package composed of: + - Multiple CUBEs (HBM die + compute PEs + UCIe) + - One or more IO chiplets (host/SIP interfaces) +- IO chiplets: + - provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP + - can be multiple per SIP + - placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 1–2 IO chiplets + +### D3. CUBE-level + +- A CUBE contains: + - HBM + memory controller (HBM_CTRL) + - NOC (on-die fabric): carries all intra-cube traffic including HBM data, + inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access. + Must provide: full-BW PE↔local HBM path, PE↔SRAM connectivity, + PE↔UCIe connectivity, M_CPU↔PE command path. + NOC topology is an implementation choice (e.g., 2D mesh, ring, crossbar); + current implementation uses a 2D mesh with XY routing (see ADR-0017). + HBM_CTRL is attached to each PE's local NOC port (local HBM = minimal hop). + - Shared SRAM: cube-level shared memory accessible by all PEs via NOC + - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation + - multiple PEs + - up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity + +### D4. PE-level + +- A PE can execute one kernel instance +- PE contains internal control + accelerators (modeled at PE view granularity): + - PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues + +## Consequences + +- The simulator supports abstraction by “views”: + - SIP view hides PE internals + - CUBE view treats each PE as a single block + - PE view expands PE internals +- Topology remains parameterized; sizes/counts/links come from configuration. + +## Links + +- SPEC R3/R5 +- ADR-0005 (diagram views) +- ADR-0017 (cube NOC 2D mesh architecture) diff --git a/docs/adr-ko/ADR-0004-mem-memory-semantics-local-hbm.md b/docs/adr-ko/ADR-0004-mem-memory-semantics-local-hbm.md new file mode 100644 index 0000000..d9144b0 --- /dev/null +++ b/docs/adr-ko/ADR-0004-mem-memory-semantics-local-hbm.md @@ -0,0 +1,76 @@ +# ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee + +## Status + +Accepted + +## Context + +Accurately modeling PE↔HBM behavior is essential for kernel latency estimation. +Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth. + +## Decision + +### D1. Local HBM definition + +- Each PE is assigned a logically defined “local HBM” region. +- Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s + router in the NOC mesh (ADR-0017 D4). +- The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops). +- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration. + +### D2. Local HBM bandwidth guarantee contract + +- Accesses from a PE to its local HBM MUST guarantee full effective HBM + read/write bandwidth independent of intervening fabric bandwidth limits. +- Effective HBM bandwidth = spec bandwidth x efficiency factor. + The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8) + models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page + misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective. +- The topology builder applies the efficiency factor to router-to-hbm edge + bandwidth at graph construction time, so all downstream routing and latency + computation uses the effective value. +- This guarantee is modeled by: + - a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point, + - while still incurring non-zero latency along explicitly modeled components. +- HBM CTRL internal modeling (PC striping, cut-through, scheduling fidelity) + is consolidated in ADR-0033 (Latency Model: Assumptions and Known + Simplifications). The aggregate BW guarantee here remains the contract; + ADR-0033 documents how the per-PC model realizes it and which scheduler + effects are intentionally simplified. + +### D3. Remote PE HBM semantics (intra-cube) + +- A PE that accesses another PE's local HBM traverses the NOC: + - PE_DMA → NOC → (fabric hops) → target PE's NOC port → HBM_CTRL +- NOC bandwidth and hop count may limit remote HBM access relative to local access. + +### D4. Non-local HBM semantics (inter-cube / inter-SIP) + +- Accesses from a PE to HBM in a different cube or SIP MAY be limited by: + - NOC bandwidth within the cube, + - inter-cube UCIe links, + - inter-SIP fabric (PCIe/UAL). +- These paths MUST be explicit and traceable. + +### D5. Shared SRAM semantics + +- Each CUBE contains a shared SRAM accessible by all PEs in that CUBE. +- Access path: PE_DMA → NOC → shared SRAM. +- Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth. +- Shared SRAM is not part of the HBM address space; it is a separate memory domain. + +## Verification Notes + +Tests should cover: + +- local-HBM case: BW matches HBM BW regardless of fabric BW parameter +- remote PE HBM case: latency includes mesh hop traversal +- non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters +- shared SRAM case: access via NOC with correct BW + +## Links + +- SPEC R2/R5 +- ADR-0002 (distance/order & explicit bypass) +- ADR-0017 D7 (PE DMA data paths through NOC to HBM) diff --git a/docs/adr-ko/ADR-0005-dev-diagram-views-distance-layout.md b/docs/adr-ko/ADR-0005-dev-diagram-views-distance-layout.md new file mode 100644 index 0000000..6391f19 --- /dev/null +++ b/docs/adr-ko/ADR-0005-dev-diagram-views-distance-layout.md @@ -0,0 +1,186 @@ +# ADR-0005: Diagram Views & Distance-Aware Layout Rules + +## Status + +Accepted + +## Context + +We require verifiable and inspectable system modeling for a large-scale, +parameterized AI Accelerator system. + +Humans must be able to: + +- visually inspect the modeled topology, +- reason about communication structure and relative distance, +- do so at multiple abstraction levels without being overwhelmed by detail. + +The simulator models distance (accumulated latency) as a first-class concept. +Diagrams must reflect this distance by default. + +--- + +## Decision + +### D1. Global Defaults + +- All diagrams MUST be **distance-aware by default**. +- All diagrams MUST render **representative views** of the architecture. +- Instance indices (e.g., sip0, cube2, pe3) MUST NOT be required for diagram generation. +- Instance indices MAY be used ONLY: + - to define a distance anchor in asymmetric or debugging scenarios, or + - when explicitly requested. + +--- + +### D2. Representative Rendering Rule + +- All CUBEs share the same internal structure. +- All PEs share the same internal structure. + +Therefore: + +- SIP-level diagrams render representative CUBEs and IO chiplets. +- CUBE-level diagrams render representative PEs as opaque blocks. +- PE-level diagrams render a representative PE with fully expanded internals. + +Diagrams MUST NOT depend on specific SIP, CUBE, or PE indices +unless explicitly requested. + +--- + +### D3. Diagram Views + +#### View A — SIP-Level Diagram + +**Purpose** +Explain system-scale structure and connectivity. + +**Visible elements** + +- SIP boundaries (optional) +- CUBEs (opaque blocks) +- IO chiplets (opaque blocks) +- Optional UCIe stubs only if needed to clarify connectivity + +**Hidden elements** + +- PE internals +- CUBE internal fabric +- IO chiplet internals + +**Visible links** + +- Host ↔ IO chiplets (PCIe) +- SIP ↔ SIP (PCIe / UAL via switches) +- IO ↔ CUBE (on-package links) + +--- + +#### View B — CUBE-Level Diagram + +**Purpose** +Explain cube-internal structure and data/control flow. + +**Visible elements** + +- Router mesh: 2D grid of NOC routers (from cube_mesh.yaml), all traffic routes through mesh +- HBM_CTRL attached to PE routers (local HBM = 0 hop) +- HBM subsystem (HBM_CTRL) +- Shared SRAM: cube-level shared memory +- Management CPU (M_CPU) +- PEs as opaque blocks (PE[0..N−1]) +- UCIe endpoints (N/E/W/S) as ports + +**Hidden elements** + +- PE internals + +**Visible links** + +- PE → router (HBM + non-HBM data path via mesh) +- Router ↔ HBM_CTRL (local HBM access) +- Router ↔ Router (mesh hops for remote access) +- Router ↔ UCIe endpoints +- Router ↔ shared SRAM +- M_CPU ↔ router (command path) +- Router → PE_CPU (command delivery, collapsed into PE block) + +--- + +#### View C — PE-Level Diagram + +**Purpose** +Explain internal PE behavior and execution structure. + +**Visible elements** + +- PE_CPU +- Command handler / scheduler +- PE_TCM (local SRAM) +- HW accelerators (DMA, GEMM, MATH, etc.) +- Local HBM interface +- Optional IPCQ / messaging endpoints + +**Visible links** + +- Control paths (CPU → scheduler → engines) +- Data paths (engines ↔ TCM, DMA ↔ local HBM) +- External fabric ports as abstract ports only + +--- + +### D4. Distance-Aware Layout (Default) + +#### Distance definition + +- Distance is defined as **accumulated latency**, consistent with ADR-0002. +- Distance is computed from a single anchor node. + +#### Default anchor selection + +- SIP view: IO chiplet (or Host CPU if present) +- CUBE view: a representative PE +- PE view: PE_CPU or Command Handler + +Anchors are **implicit defaults** and MUST NOT be required to be specified. + +#### Layout rules + +- Diagrams MUST be laid out in layers based on distance buckets. +- Layout direction MUST be consistent within a view type + (preferred: left-to-right). +- Nodes with equal distance MUST have stable ordering + (by role or identifier, deterministically). + +Cycles MAY be rendered using dashed or curved edges for readability, +without affecting distance semantics. + +--- + +### D5. Generation Contract (for Tools / Claude Code) + +When generating diagrams: + +- Assume distance-aware layout by default. +- Assume representative rendering by default. +- Do NOT ask for SIP/CUBE/PE indices unless required. +- Do NOT expand hidden abstraction levels. +- Prefer architectural clarity over micro-hop fidelity. + +--- + +## Consequences + +- Diagrams are stable across topology scaling. +- Changes in distance or routing policy are reflected visually. +- Diagrams serve as verifiable artifacts derived from the simulator model, + not as hand-maintained documentation. + +--- + +## Links + +- SPEC Section 4 (Output, Debuggability, and Diagrams) +- ADR-0002 (Routing distance semantics) +- ADR-0006 (Topology compilation & automatic diagram generation) diff --git a/docs/adr-ko/ADR-0006-dev-topology-compilation-distance-diagram.md b/docs/adr-ko/ADR-0006-dev-topology-compilation-distance-diagram.md new file mode 100644 index 0000000..4b3767c --- /dev/null +++ b/docs/adr-ko/ADR-0006-dev-topology-compilation-distance-diagram.md @@ -0,0 +1,130 @@ +# ADR-0006: Topology Compilation, Distance Extraction, and Automatic Diagram Generation + +## Status + +Accepted + +## Context + +The simulator compiles topology configuration (e.g., topology.yaml) into an explicit model graph, +and computes routing and accumulated latency (distance). +Diagrams should be generated from these authoritative artifacts to ensure consistency and avoid +hand-maintained topology drawings. + +Additionally, for usability, diagrams should be emitted automatically into a stable location +so that developers can preview them immediately in the repository. + +--- + +## Decision + +### D1. Topology compilation is the single source of truth + +- topology.yaml (or equivalent config) is compiled into: + - an explicit system graph, + - node/link attributes, + - routing policies. +This compiled graph is the authoritative representation of the system. + +### D2. Distance extraction during compilation + +- During or immediately after topology compilation, the simulator MUST compute distance metadata + (accumulated latency) consistent with ADR-0002. +- Distance metadata MUST be sufficient to support distance-aware diagram layout as defined in ADR-0005. +- Distributed fabric segments (e.g., NOC) MAY have distance_mm = 0 per ADR-0002 D4; + layout placement for such nodes uses explicit position metadata rather than distance buckets. + +### D3. Diagram generation is a derived artifact + +- Diagrams MUST be generated from: + - the compiled topology graph, + - extracted distance metadata, + - view/layout rules defined in ADR-0005. +- Diagram generation MUST NOT require additional hand-written topology descriptions. + +### D4. Automatic diagram emission to the repository + +- As part of topology compilation, the implementation MUST produce the following diagrams by default: + - SIP-level diagram (representative, distance-aware) + - CUBE-level diagram (representative, distance-aware) + - PE-level diagram (representative, distance-aware) +- The default output directory is: + - `docs/diagrams/` +- The generator MUST overwrite/update only when the compiled topology (or diagram rules) changes. + +### D5. View-specific projection and layout + +For each view (SIP / CUBE / PE): + +- The generator MUST project the compiled graph into a reduced view graph: + - hide/collapse nodes according to ADR-0005, + - preserve connectivity semantics relevant to that view, + - compute distance buckets and assign layout layers deterministically. +- CUBE-level projection MUST include: + - Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports, + and PEs as opaque blocks. + - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017). +- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices. + +### D6. Output formats and determinism + +- The generator MUST output at least one of: + - Mermaid (Markdown-native) + - Graphviz DOT (rank-based control) + - SVG (mm-accurate layout, no external dependencies) +- SVG is preferred when mm-accurate position metadata is available from the compiled topology. +- Output MUST be deterministic: + - same topology + same rules → identical diagram text +- File naming MUST be deterministic and stable (see "Output Conventions"). + +### D7. Performance and caching + +- Diagram generation MAY be lazy and/or cached, as long as the outputs in `docs/diagrams/` + remain consistent with the compiled topology. +- The implementation SHOULD use a cache key based on: + - topology content hash, + - routing policy version, + - diagram rules version, + - view type (SIP/CUBE/PE). + +--- + +## Output Conventions + +### Directory + +- `docs/diagrams/` is the canonical output directory for generated diagrams. + +### File names (recommended, deterministic) + +- `system_view.svg` / `system_view.mmd` / `system_view.dot` +- `sip_view.svg` / `sip_view.mmd` / `sip_view.dot` +- `cube_view.svg` / `cube_view.mmd` / `cube_view.dot` +- `pe_view.svg` / `pe_view.mmd` / `pe_view.dot` + +Optionally, for multi-topology workflows: + +- `sip_view__{topology_id}.svg` +- `cube_view__{topology_id}.svg` +- `pe_view__{topology_id}.svg` + +### Repository policy + +- Generated diagram files MAY be committed to the repository to enable diff-based review. +- If committed, they MUST be reproducible from topology compilation. + +--- + +## Consequences + +- Diagrams are always consistent with simulator behavior. +- Architectural changes automatically propagate to visualizations. +- Diagram diffs become meaningful indicators of architectural change. + +--- + +## Links + +- SPEC Section 4 (Output, Debuggability, and Diagrams) +- ADR-0002 (Distance semantics) +- ADR-0005 (Diagram views and layout rules) diff --git a/docs/adr-ko/ADR-0007-api-runtime-api-boundaries.md b/docs/adr-ko/ADR-0007-api-runtime-api-boundaries.md new file mode 100644 index 0000000..9522b9d --- /dev/null +++ b/docs/adr-ko/ADR-0007-api-runtime-api-boundaries.md @@ -0,0 +1,95 @@ +# ADR-0007: Runtime API and Simulation Engine Boundaries + +## Status + +Accepted + +## Context + +The simulator consists of multiple layers with distinct responsibilities: + +- a host-facing API layer used by benchmarks and user code, +- a discrete-event simulation engine that executes requests, +- device components that model hardware behavior. + +Without strict boundaries, orchestration logic can leak into components, +or simulation internals can become entangled with user-facing APIs. + +This ADR defines clear responsibility boundaries between: + +- runtime API, +- simulation engine (sim_engine), +- hardware components. + +--- + +## Decision + +### D1. Runtime API is host-facing orchestration only + +The runtime API represents host/driver-level behavior and MUST: + +- expose high-level operations (tensor deployment, kernel launch), +- submit requests only to endpoint components (e.g., IO_CPU), +- await completion via futures/handles, +- own and persist host-side metadata (tensor allocation maps, kernel bindings). + +The runtime API MUST NOT: + +- hardcode hop-by-hop routing or fan-out, +- directly invoke internal components (M_CPU, PE_CPU, engines), +- embed topology- or routing-specific assumptions. + +--- + +### D2. Simulation engine wires components and tracks completion + +The simulation engine (sim_engine) MUST: + +- wire components at initialization (create port stores + start wire + processes per the component port/wire framework — ADR-0015), +- inject requests into the compiled topology graph at entry components + (e.g., PCIE_EP for memory operations, IO_CPU for kernel launch), +- schedule and execute events using a discrete-event model, +- manage correlation ids and completion tracking. + +The simulation engine MUST NOT: + +- define tensor semantics, +- define kernel execution policies, +- expose internal graph details to the runtime API, +- walk the topology path during request execution, +- call component `run()` methods directly, +- track per-hop latency or decompose fan-out (components own this). + +--- + +### D3. Components own fan-out and aggregation + +Device-side components MUST: + +- fan-out requests to downstream domains + (IO_CPU → M_CPU → PE_CPU → schedulers/engines), +- aggregate completion and failure signals, +- propagate results deterministically upstream. + +Neither the runtime API nor the simulation engine may orchestrate +component-level fan-out explicitly. + +--- + +## Consequences + +- Runtime APIs remain stable as topology and routing evolve. +- Simulation internals can change without affecting user-facing code. +- Component implementations remain swappable via DI. + +--- + +## Links + +- SPEC R4, R7, R8 +- ADR-0008 (Tensor deployment) +- ADR-0009 (Kernel execution) +- ADR-0015 (Component port/wire model and engine role) +- ADR-0010 (CLI surface and execution semantics — runtime API consumer) diff --git a/docs/adr-ko/ADR-0008-api-tensor-deploy-and-allocation.md b/docs/adr-ko/ADR-0008-api-tensor-deploy-and-allocation.md new file mode 100644 index 0000000..36ca4da --- /dev/null +++ b/docs/adr-ko/ADR-0008-api-tensor-deploy-and-allocation.md @@ -0,0 +1,100 @@ +# ADR-0008: Tensor Deployment and Allocation (Host Allocator, PA-first) + +## Status + +Accepted + +## Context + +Benchmarks require PyTorch-like tensor semantics: + +- tensor creation (empty, fill), +- deployment to accelerator devices (tensor.to()). + +In the realistic system, host software manages allocation/mapping and installs +mappings for DMA/MMU. For Phase 0 we simplify (ADR-0011): + +- device memory operations use PA only, +- VA/MMU/IOMMU is not modeled. + +To keep the host↔device interface minimal, we avoid a separate +AllocateTensorMeta message. Instead, host allocation produces a PA shard map +that is used directly by MemoryWrite/Read and KernelLaunch. + +--- + +## Decision + +### D1. Tensor is a host-owned handle with PA shard mapping + +A Tensor object is a host-owned handle that encapsulates: + +- shape and dtype, +- initialization intent, +- device placement and allocation metadata as a PA shard map. + +After deployment, the Tensor handle MUST contain: + +- a list of shards, each with (sip,cube,pe,pa,nbytes,offset_bytes). + +This PA shard mapping is the single source of truth for kernel argument binding. + +--- + +### D2. Deployment uses a host allocator (Phase 0) + +In Phase 0, tensor deployment produces PA shard mappings via a host allocator: + +- placement (split/replicate/hybrid) is decided by a DP policy, +- allocation assigns PA ranges at the PE level and returns shard mappings, +- the Tensor handle stores the resulting shard list deterministically. + +No separate host-visible device allocation RPC is required in Phase 0. + +--- + +### D3. Data initialization and transfer uses MemoryWrite/Read only + +Any data initialization or transfer implied by a tensor (e.g., fill, copy) +MUST be represented using Host ↔ IO_CPU messages only: + +- MemoryWrite +- MemoryRead + +Rules: + +- MemoryWrite/Read MUST reference PA + (sip,cube,pe) tags (ADR-0012). +- Allocation metadata MUST NOT be embedded as a separate allocation message. +- Bulk tensor data MUST NOT be embedded in Phase 0 messages. + +The simulation engine schedules MemoryWrite/Read through the graph so that +latency is computed by explicit traversal. + +--- + +### D4. Extension path (non-breaking) + +Future ADRs MAY introduce optional VA/MMU/IOMMU modeling by adding: + +- virtual addressing in tensor handles, +- mapping install steps, +- translation latency/page granularity. + +The Phase 0 PA shard map remains a valid fast-path configuration. + +--- + +## Consequences + +- Host↔IO_CPU contract remains minimal (MemoryRead/Write + KernelLaunch). +- KernelLaunch can pass per-PE data placement explicitly via shard tags. +- Early implementation stays simple and testable. + +--- + +## Links + +- ADR-0011 (Memory Addressing — PA / VA / LA) +- ADR-0012 (Host↔IO_CPU schema) +- ADR-0007 (runtime_api vs sim_engine boundaries) +- ADR-0009 (Kernel execution) diff --git a/docs/adr-ko/ADR-0009-api-kernel-execution-messaging.md b/docs/adr-ko/ADR-0009-api-kernel-execution-messaging.md new file mode 100644 index 0000000..a94be07 --- /dev/null +++ b/docs/adr-ko/ADR-0009-api-kernel-execution-messaging.md @@ -0,0 +1,146 @@ +# ADR-0009: Kernel Execution Messaging and Completion Semantics + +## Status + +Accepted + +## Context + +Kernel execution is initiated by the host and proceeds through +device control components: + +Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines + +Completion propagates in reverse order. + +To keep benchmarks simple and topology-agnostic, +kernel execution must be endpoint-driven with deterministic aggregation. + +--- + +## Decision + +### D1. Kernel launch is an endpoint request + +A kernel launch is initiated by submitting a single KernelLaunch request +to the IO_CPU endpoint. + +The runtime API MUST: + +- construct the kernel launch request, +- submit it to IO_CPU, +- await a single completion result. + +The runtime API MUST NOT orchestrate internal fan-out. + +--- + +### D2. Tensor arguments are passed by metadata + +KernelLaunch requests MUST reference tensor arguments via: + +- host-owned tensor handles, or +- resolved device address maps derived from those handles. + +Bulk tensor data MUST NOT be embedded in kernel launch messages. + +--- + +### D3. Fan-out and aggregation are component responsibilities + +- IO_CPU fans out work to M_CPUs. +- M_CPU fans out work to PE_CPUs. +- PE_CPU manages kernel execution and engine dispatch. + +Completion semantics: + +- M_CPU completes when all targeted PEs complete or a failure policy triggers. +- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers. + +--- + +### D4. Completion and failure propagation + +- All messages MUST carry correlation identifiers. +- Completion and failure MUST propagate deterministically to the host. +- The simulation engine provides futures/handles to observe completion. + +--- + +### D5. Launch timing is endpoint-synchronized + +All PEs targeted by a single kernel launch MUST begin executing the kernel +body at the same simulated time, regardless of their dispatch path length +from the launch entry point. + +Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable +latency at every level. PEs near their M_CPU receive the launch earlier +than PEs farther away; cubes near an IO_CPU receive it earlier than cubes +farther away. Without synchronization, each PE's kernel begins at a +different `env.now`, making per-PE metrics such as `pe_exec_ns` a function +of dispatch-path geometry rather than of the kernel's behavior — +producing measurement artifacts in benchmarks that time kernel-internal +waits (for example `tl.recv` on cross-cube or cross-SIP hops). + +Mechanism. + +- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`. +- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it + computes `target_start_ns = env.now + max_latency` where + `max_latency` is the maximum, over every target (sip, cube, pe) + tuple, of the **two-leg dispatch chain**: + + ``` + max_latency(sip, cube, pe) = + compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube))) + + compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu)) + - io_cpu.overhead_ns + - m_cpu.overhead_ns + ``` + + This models the actual dispatch as **two sequential Transactions** + (IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's + `compute_path_latency_ns` adds its endpoints' `overhead_ns`; + `io_cpu.overhead_ns` is subtracted because IO_CPU has already + paid it before this method runs, and `m_cpu.overhead_ns` is + subtracted once because it appears as endpoint of leg1 *and* + start of leg2 but is paid only once at run time. A single + `find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent — + it can pick a graph path that bypasses M_CPU and silently + under-shoots the prediction for far cubes, breaking the D5 + invariant. + + The fanned-out sub-Transactions carry **`nbytes = 0`** for + `KernelLaunchMsg` (control message only). Without this, + large kernel-launch payloads would occupy fabric BW on the + shared first hop and serialize the per-cube dispatch, pushing + far M_CPUs past `target_start_ns` and re-introducing the + late-arrival violation. +- **M_CPU** passes an already-stamped `target_start_ns` through + unchanged. Only when the value is absent (e.g. a direct + launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier + `env.now + max(local command-path latency)`. +- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top + of `_execute_kernel`, before recording `pe_exec_start` and invoking + the kernel body. +- When `target_start_ns is None`, PE_CPU falls through to the legacy + unsynchronized behavior — preserving backward compatibility. + +IO_CPU-level stamping guarantees every PE across every targeted cube +uses the same barrier sim-time, eliminating both the within-cube +dispatch-offset artifact *and* the cross-cube offset artifact in +multi-cube launches. Models a real-hardware timed-broadcast launch +(latency-equalized dispatch tree). + +The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU +control plane — runtime API and application kernels are unchanged. + +--- + +## Links + +- SPEC R1, R2, R7, R8 +- ADR-0007 (Runtime API boundaries) +- ADR-0008 (Tensor deployment) +- ADR-0013 (Verification strategy — V2 fan-out tests) +- ADR-0015 D4 (concrete fabric path for kernel launch) diff --git a/docs/adr-ko/ADR-0010-api-cli-surface-and-semantics.md b/docs/adr-ko/ADR-0010-api-cli-surface-and-semantics.md new file mode 100644 index 0000000..4925a64 --- /dev/null +++ b/docs/adr-ko/ADR-0010-api-cli-surface-and-semantics.md @@ -0,0 +1,131 @@ +# ADR-0010: Command Line Interface and Execution Semantics + +## Status + +Accepted + +## Context + +The `kernbench` CLI is the user-facing entry point of the simulator. It +exposes three subcommands: + +- `run` — execute a benchmark against a topology. +- `probe` — diagnostic utility for latency / BW measurement. +- `web` — interactive topology viewer. + +Device enumeration is centralized in the CLI; neither the runtime API +nor the simulation engine enumerates devices. Benchmarks remain +single-device by design and accept a device identifier as input. + +## Decision + +### D1. Benchmark contract — single-device by design + +- A benchmark MUST define behavior for a single device only. +- A benchmark MUST accept a device identifier as input. +- Benchmarks MUST NOT enumerate or loop over multiple devices. + +Multi-device execution is the CLI's concern (D3), not the benchmark's. + +### D2. `kernbench run` — benchmark execution + +Required arguments: + +- `--topology `: topology YAML file path. Loaded via + `resolve_topology()`. +- `--bench `: benchmark name. Resolved via + `benches.loader.resolve_bench()`. + +Optional arguments: + +- `--device ` (default: `all`): + - `all` — run once per discovered SIP (see D3). + - `sip:` — run only on SIP N. + - Parsed via `resolve_device()`. +- `--verify-data` (default: off) — enable Phase 2 data verification + (see ADR-0020). When set, `engine_factory` constructs the engine + with `enable_data=True`. After the benchmark runs, a diagnostic + summary of recorded ops is printed. + +Each invocation runs the benchmark once within a single simulation +instance. + +### D3. Multi-device execution is logically parallel + +When `--device all` (or omitted) and the topology has multiple SIPs: + +- Benchmark executions are submitted to a single simulation engine + instance. +- Executions are logically parallel in simulation time. +- Inter-device contention is naturally modeled (shared fabric + bandwidth, cross-SIP traffic, etc.). + +The CLI does NOT spawn multiple OS processes or independent +simulation runs — parallelism is internal to one simulation instance. + +### D4. `kernbench probe` — latency / BW diagnostic utility + +Required argument: + +- `--topology `: topology YAML file path. + +Optional argument: + +- `--case ` (default: `all`) — run a predefined traffic + pattern, or `all` to run every defined case. + +Probe runs each pattern through the simulation engine and reports +per case: + +- End-to-end latency (ns). +- Effective bandwidth (nbytes / total_ns). +- Bottleneck bandwidth (min edge BW along the chosen path). +- Utilization (effective / bottleneck). + +Probe additionally validates monotonicity invariants — for example +that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤ +cross-SIP — and reports violations. Probe is a developer tool for +verifying the latency / BW model; it is not a benchmark. + +### D5. `kernbench web` — topology viewer + +Optional arguments: + +- `--port ` (default: `8765`) — HTTP port. +- `--no-open` — do not auto-open the browser. + +Launches a local HTTP server that renders the compiled topology in +the browser. Distinct from the static `docs/diagrams/` artifacts: + +- `docs/diagrams/` files are derived at topology-compile time + (ADR-0006). +- `kernbench web` is interactive — pan/zoom, hover for component + attributes, switch between SIP / CUBE / PE views. + +### D6. Runtime API and simulation engine remain device-scoped + +- Runtime API calls operate on one device per invocation. +- The simulation engine schedules all requests deterministically. +- Neither layer enumerates devices. + +This invariant keeps each layer testable in isolation; device +enumeration and multi-device fan-out live only in the CLI's `run` +command (D3). + +## Consequences + +- Benchmark authors write single-device logic; multi-device behavior + emerges from the CLI dispatching across SIPs. +- Adding a new subcommand (e.g., trace export, replay) does not + require benchmark or runtime-API changes — the CLI is the + extension point. +- `probe` and `web` are diagnostic / visualization tools, not + benchmarks; they bypass the benchmark loader path. + +## Links + +- SPEC R7, R8, R9 +- ADR-0007 (Runtime API and Simulation Engine Boundaries) +- ADR-0020 (Two-pass data execution — `--verify-data`) +- ADR-0006 (Topology compilation and diagram generation — + background for `kernbench web`) diff --git a/docs/adr-ko/ADR-0011-mem-memory-addressing-simplification.md b/docs/adr-ko/ADR-0011-mem-memory-addressing-simplification.md new file mode 100644 index 0000000..064c365 --- /dev/null +++ b/docs/adr-ko/ADR-0011-mem-memory-addressing-simplification.md @@ -0,0 +1,521 @@ +# ADR-0011: Memory Addressing — PA / VA / LA Address Models + +## Status + +Accepted. + +- **VA model: currently implemented (default).** +- PA model: implemented as PageFault fallback in PE_DMA. +- LA model: proposed, not implemented. + +## Context + +KernBench's address model evolved through three design points, each +addressing a limitation of the previous. This ADR documents all three +in one place because future implementation work selects among them. + +### PA-only baseline + +Phase 0 of KernBench treated all device memory operations +(MemoryRead/MemoryWrite) as raw physical-address transfers. No +host-side virtual addressing, no MMU/IOMMU translation. Allocators +returned PA mappings; DMA requests carried PA directly. + +This was sufficient for early correctness/latency work but +insufficient for running standard Triton kernels that use +`base_addr + offset` patterns on sharded tensors: each PE's shard +has a different PA, but the kernel needs a single contiguous address +space to compute offsets. + +### Why VA/MMU (current default) + +A realistic system uses host-side virtual addressing and an +MMU/IOMMU-style translation path for DMA: the host allocates physical +memory at PE level, maps it into a virtual address space, installs +mappings, and DMA requests use virtual addresses that are translated +to physical addresses. + +Adopting this model lets kernels use `base_addr + offset` over a +contiguous VA range while the device-side MMU translates each access +to the appropriate PA. + +### Why LA/BAAW (proposed) + +VA/MMU treats HBM as a single backing space. KernBench needs to +explore architectures where HBM is composed of multiple pseudo +channels in parallel: + +- CUBE's HBM has 32 or 64 pseudo channels. +- In a PE-Local-HBM model, each PE is assigned N pseudo channels + (N = `hbm_pseudo_channels / pes_per_cube`). +- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW + (N × per-channel). + +Two channel-mapping modes need to be modelable: + +- **1:1 mode** — one logical access → N per-channel requests. + Precise per-channel BW contention modelling. +- **n:1 mode (default)** — one logical access → one aggregated + request. Channels are assumed to interleave; aggregated BW model. + +VA's `tl.load(va_ptr)` produces a single DMA request to a single +target. Decomposing that into per-channel requests inside PE_DMA +requires the address layer to be aware of channels. This is the +role of the LA (Logical Address) abstraction with BAAW +(Logical-to-Physical Mapping Unit). + +Core requirements driving the LA design: + +- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical + in both modes (only request shape and resource model differ). +- Kernel programming model is unchanged — physical channel + information is never exposed to kernel code. +- Mode switch is a topology-level configuration. + +### Design space summary + +| Model | Status | Key idea | +|-------|--------|----------| +| PA | fallback (implemented) | Direct physical addressing, no translation | +| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access | +| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes | + +--- + +## Decision + +This ADR defines three address models. At any given time the system +operates in exactly one model. Selection is topology- / configuration- +driven; coexistence within one simulation run is not required. + +--- + +### Address Model: PA (Physical Address) — fallback + +#### D-PA1. PA-only semantics + +- All device memory accesses (MemoryRead/MemoryWrite) operate on + device physical addresses (PA) plus size. +- PA-only mode remains functional via the PageFault fallback path in + PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats + the value as a PA directly. + +#### D-PA2. Allocation produces PA mappings + +Device allocation selects PE-local memory regions and returns PA +mappings sufficient to execute kernels and issue DMA requests. + +PA model is retained primarily for backward compatibility with PA-only +tests and as the underlying physical layer that VA / LA models resolve +into. + +--- + +### Address Model: VA (Virtual Address with MMU) — current default + +#### D-VA1. Virtual Address Model + +- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`). +- `TensorShard` does NOT carry a `va` field — shard VA is derived as + `va_base + offset_bytes`. +- Kernels receive `va_base` as their pointer argument (via + `TensorArg.va_base`). +- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA). + +#### D-VA2. PE_MMU Component + +- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility + (synchronous `translate()` called by PE_DMA). +- Page-aligned dict lookup for O(1) VA → PA translation. +- `tlb_overhead_ns` configurable per-access latency. +- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA + directly (preserves PA model for backward compatibility). + +#### D-VA3. Mapping Installation + +- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube + fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured + end-to-end. +- `MmuMapMsg.target_sips` controls SIP-level routing to prevent + cross-SIP mapping contamination for replicated tensors. +- Mapping strategy based on `DPPolicy.cube`: + - **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping + only. Each cube's PEs see only their local PA. No cross-cube + mapping installed. + - **Sharded** (`cube="column_wise"`, etc.): broadcast all shard + mappings to all target cubes. Enables cross-PE and cross-cube + DMA. + +#### D-VA4. Tensor Lifecycle + +- `del tensor` triggers automatic cleanup via `Tensor.__del__` + + `weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric, + returns VA and PA space. +- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup. +- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC. +- `PEMemAllocator` uses free-list with coalescing (not bump allocator). +- `VirtualAllocator` uses free-list with coalescing for VA space. + +#### D-VA5. Allocators + +- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free + with coalescing. +- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with + coalescing. +- Page size configurable via `topology.yaml` `pe_mmu` attrs + (default 4096). + +#### Consequences (VA model) + +- Triton kernels use `base_addr + offset` patterns naturally on + sharded tensors. +- All latency remains explicit via graph traversal, including MMU + mapping installation and per-access TLB overhead. +- PA-only mode retained as fallback (PageFault → treat as PA). +- IPCQ and other fixed-address resources bypass MMU (use PA directly). + +--- + +### Address Model: LA (Logical Address with BAAW) — proposed + +LA replaces VA when channel-level HBM modelling is required. +Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the +removed artifacts). Coexistence with VA in the same run is not a goal. + +#### D-LA1. LA introduction — replaces VA infrastructure + +LA is the sole address space used by kernel code (`tl.load`, +`tl.store`, `tl.composite`). Properties: + +- Can map a Tensor to a contiguous logical space (like VA). +- Expresses `(logical buffer + offset)`. +- Does NOT contain physical channel information directly. +- Stays as an intermediate abstraction until physical resolution. + +LA address space: + +| Item | Value | +|------|-------| +| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) | +| LA space size | 64 GB per PE | +| Alignment unit | segment (see D-LA3) | + +LA is PE-local: different PEs may use the same LA value; BAAW segment +tables differ → they resolve to different PAs. + +VA infrastructure removed when LA is adopted: + +| Removed | Replacement | +|---------|-------------| +| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) | +| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) | +| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component | +| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` | +| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install | +| `runtime_api/tensor.py`: `va_base` | `la_base` | +| `topology.yaml`: `pe_mmu` component entry | Removed | + +#### D-LA2. Mapping mode setting + +Topology-level (cube) configuration: + +```yaml +cube: + memory_map: + hbm_mapping_mode: n_to_one # one_to_one | n_to_one + hbm_pseudo_channels: 64 # total pseudo channel count + hbm_channels_per_pe: 8 # per-PE local channel count + hbm_channel_bw_gbs: 32.0 # per-channel bandwidth +``` + +Consumed by the graph compiler (topology builder) and BAAW +initialisation. + +#### D-LA3. Segment and BAAW + +Segment partitions the LA space; each segment maps to a specific HBM +channel or channel group. Created at tensor deploy time by the runtime +allocator. BAAW resolves LA → physical request(s) using the segment +table. + +```python +@dataclass +class BaawSegment: + la_base: int # segment start LA + la_size: int # segment size (bytes) + mode: str # "one_to_one" | "n_to_one" + # 1:1 mode fields + channel_count: int # channels assigned to this segment (e.g. 8) + pa_bases: list[int] # per-channel PA bases (len = channel_count) + channel_ids: list[int] # per-channel logical IDs (e.g. [0..7]) + channel_size: int # per-channel size (la_size // channel_count) + # n:1 mode fields + agg_pa_base: int # aggregated PA base + agg_node_id: str # aggregated router node_id +``` + +Segment lifecycle: + +1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA + allocator. PEMemAllocator allocates per-channel PA (1:1) or + aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment + with PE_DMA. +2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd + (src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and + converts to PA(s). +3. **Free** (tensor free): segment removed from table; LA and PA + returned. + +#### D-LA4. BAAW resolution logic + +BAAW is a front-end stage inside PE_DMA, not a separate SimPy +component. Synchronous address-resolution logic executed at the start +of PE_DMA's `handle_command()`. + +Input: `(LA, nbytes)`. Output: + +- **1:1 mode**: `list[PhysicalRequest]` — one per channel. +- **n:1 mode**: single `PhysicalRequest`. + +```python +@dataclass +class PhysicalRequest: + pa: int # 51-bit Physical Address + nbytes: int # transfer size for this request + dst_node: str # target node_id (channel router or aggregated router) + + +def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]: + seg = self._find_segment(la) # la_base <= la < la_base + la_size + offset = la - seg.la_base + + if seg.mode == "n_to_one": + pa = seg.agg_pa_base + offset + return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)] + + # one_to_one + requests = [] + per_ch_size = seg.channel_size + for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)): + ch_offset = offset % per_ch_size + ch_nbytes = nbytes // seg.channel_count + pa = pa_base + ch_offset + dst_node = f"{self._pe_prefix}.ch_r{ch_id}" + requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node)) + return requests +``` + +BAAW responsibilities: + +- Convert logical access → physical request units. +- Apply mode-dependent fan-out (1:1) or pass-through (n:1). +- Compute PA and target node. + +BAAW non-responsibilities: + +- Performing actual data movement. +- Executing NOC routing. +- Simulating bandwidth occupation (downstream components' job). + +BAAW output is directly usable by the simulator's routing and resource +model without additional address decoding. + +#### D-LA5. PE_DMA `handle_command()` change + +Current (VA-based) flow: + +``` +DmaReadCmd.src_addr (VA) + → MMU.translate(VA) → PA + → PhysAddr.decode(PA) → PhysAddr object + → resolver.resolve(PhysAddr) → dst_node_id + → router.find_path(pe_prefix, dst_node_id) → path + → 1 sub-Transaction → fabric inject +``` + +LA-based flow: + +``` +DmaReadCmd.src_addr (LA) + → BAAW.resolve(LA, nbytes) → list[PhysicalRequest] + → for each PhysicalRequest: + → router.find_path(pe_prefix, req.dst_node) → path + → compute_drain_ns(path, req.nbytes) → drain + → sub-Transaction → fabric inject + → await all sub-Transactions + → pe_txn.done.succeed() +``` + +Key changes: + +- MMU reference removed → BAAW resolve. +- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node` + directly. +- 1 request → N parallel requests in 1:1 mode. + +#### D-LA6. 1:1 mode detail + +- One logical access → N physical requests (N = `channels_per_pe`). +- N = `hbm_pseudo_channels / pes_per_cube`. +- Each request: fully-resolved 51-bit PA, targets a specific channel + router (`{pe_prefix}.ch_r{channel_id}`). +- Per-channel link models BW contention. +- PE_DMA injects N sub-transactions concurrently. + +Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`. +PE0 owns ch0-7. + +```text +Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes +BAAW segment: { + la_base: 0x1_0000_0000, la_size: 4096, + mode: "one_to_one", channel_count: 8, + pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7], + channel_ids: [0, 1, 2, 3, 4, 5, 6, 7], + channel_size: 512, +} + +BAAW resolve result (8 requests): + → PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0") + → PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1") + → ... + → PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7") + +PE_DMA: 8 sub-transactions parallel inject + per-channel router → hbm_ctrl link (channel_bw_gbs) per channel + Total effective BW = 8 × channel_bw_gbs +``` + +Other N values: + +- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`, + 4 requests +- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`, + 16 requests + +#### D-LA7. n:1 mode detail + +- One logical access → one aggregated request. +- Target: aggregated router → hbm_ctrl (see ADR-0017 D8). +- Aggregated link BW = `channels_per_pe × channel_bw_gbs` + (e.g. 8 × 32 = 256 GB/s). +- Single queue / resource for modelling. +- No per-channel PA decomposition. + +```text +Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes +BAAW segment: { + la_base: 0x1_0000_0000, la_size: 4096, + mode: "n_to_one", + agg_pa_base: PA_agg, + agg_node_id: "sip0.cube0.pe0.agg_router", +} + +BAAW resolve result: + → PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router") + +PE_DMA: 1 sub-transaction + aggregated router → hbm_ctrl link (256 GB/s) +``` + +#### D-LA8. Kernel model preserved + +- Kernel still issues single memory ops (`tl.load`, `tl.store`, + `tl.composite`). +- LA is the address scheme exposed to kernel code. +- Channel decomposition / aggregation happens inside PE_DMA's BAAW. +- Kernel code never sees physical channel information. + +#### Consequences (LA model, proposed) + +Positive: + +- 1:1 vs n:1 semantics live in one place (BAAW). +- Kernel abstraction preserved — no kernel code changes. +- Topology-based policy control (mode switch via yaml). +- Improved simulation-model consistency and debuggability. +- Segment-based mapping is simpler than page tables; lower overhead. + +Negative: + +- Full VA/MMU code refactor required. +- Request-generation path more complex (N requests in 1:1 mode). +- Reduced per-channel visibility in n:1 mode. +- VA-related tests need rewriting. + +--- + +## Migration Path + +- **PA → VA** was an extension. PA mode is retained as the PageFault + fallback inside PE_DMA. Switching does not require removing PA + code. +- **VA → LA**, if adopted, is a replacement, not coexistence. See + D-LA1 for the VA infrastructure removal list. PA fallback inside + PE_DMA may be retained orthogonally for tests. + +## Alternatives Considered (LA model) + +1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs. + Rejected: MMU's role would grow beyond translation to request + decomposition; aggregation (n:1) becomes awkward to express. +2. **Channel-aware kernel API**: kernels call per-channel load/store + directly. Rejected: abstraction leakage, portability loss, all + benchmarks need rewriting. +3. **Always PA (no LA)**: runtime passes per-channel PA to kernel + directly. Rejected: incompatible with aggregation; conversion + timing unclear; channel info leaks to kernel. + +## Test Requirements + +### VA model (current, regression) + +- Cross-PE / cross-cube DMA paths over installed mappings. +- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency. +- TLB-overhead-per-access timing. +- PageFault fallback path preserves PA-only behaviour. + +### LA model (when implemented) + +- 1:1 mode: same logical access → N per-channel requests. +- n:1 mode: same logical access → 1 aggregated request. +- Bandwidth equivalence between modes for identical workload. +- 1:1 mode: per-channel contention modelled correctly. +- n:1 mode: aggregated bandwidth correctly reflected. +- Kernel code unchanged across mode switch. +- BAAW segment install / uninstall correctness. +- Multiple tensors in distinct segments do not collide. + +## Implementation Order (LA, when scheduled) + +1. LA type (`policy/address/la_allocator.py`). +2. BAAW segment table (`policy/address/baaw.py`). +3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`). +4. PE_DMA BAAW integration (`components/builtin/pe_dma.py` + `handle_command()`). +5. RuntimeContext: LA alloc + segment install + (`runtime_api/context.py`). +6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`). +7. Remove VA/MMU code. +8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings. +9. Test migration: + +| Test file | Action | +|-----------|--------| +| `tests/test_mmu_component.py` | Remove → BAAW segment install tests | +| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests | +| `tests/test_pe_mmu.py` | Remove | +| `tests/test_va_allocator.py` | Replace with LA allocator tests | +| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests | +| `tests/test_va_offset.py` | Replace with LA offset tests | + +## Links + +- ADR-0007 (runtime_api vs sim_engine boundaries) +- ADR-0008 (tensor deployment) +- ADR-0009 (kernel execution) +- ADR-0014 (PE-internal execution model) +- ADR-0015 (component port/wire model) +- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer) +- ADR-0013 (Verification strategy — V1 PA tagging) +- SPEC R2 (latency by traversal), R10 (memory addressing) diff --git a/docs/adr-ko/ADR-0012-api-host-io-message-schema.md b/docs/adr-ko/ADR-0012-api-host-io-message-schema.md new file mode 100644 index 0000000..07d95c5 --- /dev/null +++ b/docs/adr-ko/ADR-0012-api-host-io-message-schema.md @@ -0,0 +1,233 @@ +# ADR-0012: Host ↔ IO_CPU Message Schema (PA-first, PE-tagged) + +## Status + +Accepted + +## Context + +Phase 0 uses a PA-first memory model (ADR-0011): + +- memory operations use device physical addresses (PA) only, +- VA/MMU/IOMMU is not modeled. + +The host-facing runtime API interacts with the device via the IO_CPU endpoint. +We define stable, minimal message schemas for Host ↔ IO_CPU so that: + +- benchmarks remain stable, +- IO_CPU-internal fan-out/aggregation can evolve independently, +- completion and failure propagation is deterministic. + +We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe) +so IO_CPU can deterministically route/fan-out without relying on PA decoding. + +--- + +## Decision + +### D1. Contract scope + +This schema is the stable contract ONLY for Host ↔ IO_CPU. + +Messages beyond IO_CPU (to M_CPU, PE_CPU, schedulers, engines) are component-internal +and are NOT part of this host contract in Phase 0. + +--- + +### D2. Required message set + +The runtime API MUST use only these message types for Host ↔ IO_CPU: + +- MemoryWrite +- MemoryRead +- KernelLaunch + +All operations required by benchmarks (tensor init/copy, kernel run) MUST be expressible +with these messages. + +--- + +### D3. Common envelope (mandatory for all requests) + +All Host ↔ IO_CPU requests MUST include: + +- `msg_type: str` +- `correlation_id: str` + - generated by the host + - used to match responses deterministically +- `request_id: str` + - unique within a correlation_id +- `target_device: str` + - device identifier (e.g., "sip:0") +- `timestamp_tag: str | None` (optional) + - debug tag only; MUST NOT affect determinism + +All Host ↔ IO_CPU responses MUST include: + +- `correlation_id: str` +- `request_id: str` +- `completion: Completion` + +--- + +### D4. Completion schema (mandatory) + +`Completion` MUST have: + +- `ok: bool` +- `error_code: str | None` +- `error_message: str | None` + +Rules: + +- If `ok == true` then `error_code` and `error_message` MUST be null. +- If `ok == false` then `error_code` MUST be non-null. +- Completion semantics MUST be deterministic. + +--- + +### D5. MemoryWrite schema (PA-first, PE-tagged) + +`MemoryWrite` represents a host-initiated write/initialize operation to device memory. + +Mandatory fields: + +- common envelope fields (D3) +- destination placement tags (A 방식): + - `dst_sip: int` + - `dst_cube: int` + - `dst_pe: int` +- `dst_pa: int` + - destination physical address in the destination PE's address space +- `nbytes: int` +- `src_kind: "pattern" | "host_buffer_ref"` + - Phase 0 MUST support "pattern" +- `pattern: Pattern | None` + - required if `src_kind == "pattern"` + +`Pattern` (Phase 0 mandatory support): + +- `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"` +- `value: number | None` + - required for fill_*; ignored for zero + +Optional fields: + +- `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (default "AUTO") +- `debug_label: str | None` + +Notes: + +- This message MUST NOT embed bulk tensor data in Phase 0. +- All latency MUST come from explicit graph traversal and modeled components. + +--- + +### D6. MemoryRead schema (PA-first, PE-tagged) + +`MemoryRead` represents a host-initiated read from device memory. + +Mandatory fields: + +- common envelope fields (D3) +- source placement tags (A 방식): + - `src_sip: int` + - `src_cube: int` + - `src_pe: int` +- `src_pa: int` +- `nbytes: int` + +Optional fields: + +- `dst_kind: "host_sink" | "discard"` (default "host_sink") +- `debug_label: str | None` + +Response payload: + +- actual bytes are NOT required in Phase 0 (latency/traces focus) +- implementations MAY return lightweight stats or hashes later via a new ADR + +--- + +### D7. KernelLaunch schema (PA-first, PE-tagged shards) + +`KernelLaunch` represents launching a kernel on a target device via IO_CPU. + +Mandatory fields: + +- common envelope fields (D3) +- `kernel_ref: KernelRef` +- `args: list[KernelArg]` + +`KernelRef` MUST have: + +- `name: str` +- `kind: "deployed" | "builtin"` +- `deploy_pa: int | None` — PA where kernel binary was deployed (required for "deployed") +- `deploy_sip: int` — SIP where binary resides +- `deploy_cube: int` — cube where binary resides +- `deploy_pe: int` — PE where binary resides +- `nbytes_code: int` — kernel binary size (for BW modeling) + +Kernel binaries MUST be pre-deployed to device memory via MemoryWrite. +KernelLaunch MUST NOT embed kernel source code or IR in the launch message. + +`KernelArg` supports tensor args by PA mapping and scalars by value. + +Tensor arg (mandatory): + +- `arg_kind: "tensor"` +- `tensor_pa_map: TensorPAMap` + +`TensorPAMap` MUST have: + +- `shards: list[TensorShard]` + +`TensorShard` MUST have (A 방식 강제): + +- `sip: int` +- `cube: int` +- `pe: int` +- `pa: int` +- `nbytes: int` +- `offset_bytes: int` + +Scalar arg (mandatory): + +- `arg_kind: "scalar"` +- `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"` +- `value: number | bool` + +Optional KernelLaunch fields: + +- `grid: dict | None` +- `meta: dict | None` +- `failure_policy: "fail_fast" | "collect_all"` (default "fail_fast") +- `debug_label: str | None` + +Notes: + +- KernelLaunch MUST NOT embed bulk tensor data. +- KernelLaunch MUST be submitted only to the IO_CPU endpoint. +- IO_CPU MUST fan-out work internally using the shard (sip,cube,pe) tags. + +--- + +## Verification Notes + +Tests SHOULD validate: + +- schema validation rejects missing mandatory fields, +- deterministic correlation/response matching, +- MemoryWrite/Read/KernelLaunch produce explicit hop traces, +- all routed requests incur latency > 0. + +--- + +## Links + +- ADR-0011 (Memory Addressing — PA / VA / LA) +- ADR-0007 (runtime_api vs sim_engine boundaries) +- ADR-0009 (kernel execution fan-out/aggregation) +- ADR-0013 (Verification strategy — V1 message schema validation) +- SPEC R2, R7, R8 diff --git a/docs/adr-ko/ADR-0013-ver-verification_strategy.md b/docs/adr-ko/ADR-0013-ver-verification_strategy.md new file mode 100644 index 0000000..dd7e958 --- /dev/null +++ b/docs/adr-ko/ADR-0013-ver-verification_strategy.md @@ -0,0 +1,139 @@ +# ADR-0013: Verification Strategy and Phase 1 Test Plan + +## Status + +Accepted + +## Context + +KernBench is a system-level simulator whose correctness is defined by: + +- adherence to SPEC-defined invariants, +- determinism and debuggability, +- explicit modeling of routing and latency. + +Given the evolving implementation, we need a stable verification strategy +that prevents architectural drift while allowing incremental development. + +This ADR defines the Phase 1 verification plan and what constitutes +"correct behavior" for early implementations. + +--- + +## Decision + +### D1. Verification is contract-based + +Verification MUST be derived from: + +- SPEC requirements, +- accepted ADRs. + +Tests MUST validate architectural contracts, not incidental implementation details. + +--- + +### D2. Phase 1 verification scope + +Phase 1 verification focuses on: + +- message contract validity (ADR-0012), +- routing and fan-out semantics at the IO_CPU boundary (ADR-0009), +- PA-first memory addressing and shard tagging (ADR-0011), +- core latency and trace invariants (SPEC 0.1, R2). + +Microarchitectural accuracy, bandwidth contention, and cycle-level behavior +are explicitly out of scope in Phase 1. + +--- + +### D3. Required Phase 1 verification cases + +The following verification cases MUST be supported by the implementation: + +#### V1. Message schema validation + +- KernelLaunch requests missing `(sip, cube, pe)` in any tensor shard MUST be rejected. +- MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected. +- Completion results MUST follow the `ok / error_code / error_message` contract. + +#### V2. IO_CPU fan-out and aggregation + +Given: + +- a topology with one SIP, one CUBE, and two PEs, +- a KernelLaunch request containing two tensor shards targeting different PEs, + +The system MUST: + +- submit a single KernelLaunch to IO_CPU, +- fan-out work internally to both PEs, +- aggregate completion and return a single deterministic completion to the host. + +#### V3. Latency and trace invariants + +For any valid request: + +- the hop-by-hop trace MUST be non-empty, +- total latency MUST be greater than zero, +- repeated runs with identical inputs MUST produce identical traces. + +#### V4. Topology independence and cross-domain coverage + +Verification cases MUST pass for multiple topology shapes, including: + +- minimal: (1 SIP, 1 CUBE, 1 PE) +- multi-PE: (1 SIP, 1 CUBE, N PEs) +- multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE) +- multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE) + +For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on: + +- explicit connectivity (required links exist), +- deterministic routing and control-path traversal, +- non-empty traces and latency > 0 for representative cross-domain requests + (inter-CUBE and inter-SIP paths). + +Tests MUST NOT hardcode topology sizes, node ids, or link counts. +Instead, tests MUST derive expectations from the compiled topology metadata +--- + +### D4. Phase 1 artifacts + +Phase 1 MAY include: + +- verification-only test code, +- topology fixtures, +- trace inspection utilities. + +Phase 1 MUST NOT require: + +- production code changes solely to satisfy tests, +- weakening or removing tests to allow progress. + +--- + +### D5. Phase 2 enforcement + +Phase 2 (Apply) MUST: + +- run the Phase 1 verification cases, +- rollback all changes if any verification fails, +- preserve tests as authoritative contracts. + +--- + +## Consequences + +- Architectural correctness is enforced early. +- Tests serve as executable documentation of system behavior. +- Implementation remains flexible without losing rigor. + +--- + +## Links + +- SPEC 0.1, R2, R6 +- ADR-0011 (Memory Addressing — PA / VA / LA) +- ADR-0012 (Host ↔ IO_CPU message schema) +- ADR-0009 (Kernel execution semantics) diff --git a/docs/adr-ko/ADR-0014-dev-pe-pipeline-execution-model.md b/docs/adr-ko/ADR-0014-dev-pe-pipeline-execution-model.md new file mode 100644 index 0000000..ccb63f3 --- /dev/null +++ b/docs/adr-ko/ADR-0014-dev-pe-pipeline-execution-model.md @@ -0,0 +1,451 @@ +# ADR-0014: PE Pipeline Execution Model + +## Status + +Accepted + +## Context + +This ADR defines the PE-internal kernel execution model: + +- Role decomposition of PE-internal components +- Command dispatch paths (simple / composite / multi-op composite with epilogue) +- TileToken-based self-routing pipeline (scheduler does dispatch + completion only) +- TCM-centric dataflow with a register-file intermediary +- Engine resource model +- Observability and trace contract +- Topology representation + +PE-internal structure (7 components in scope; 2 cross-referenced): + +- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`, + `pe_tcm` — defined here +- `pe_mmu` — VA model, defined in ADR-0011 D-VA +- `pe_ipcq` — collective communication, defined in ADR-0023 + +The goal is a deterministic, trace-friendly execution contract that keeps +each block independently swappable. + +## Decision + +### D1. PE-internal component roles + +**PE_CPU** + +- Executes kernel instruction stream / control logic. +- Generates PE commands and submits them to `PE_SCHEDULER` (via + `PeInternalTxn`). +- Does NOT enqueue work directly into engine queues. + +**PE_SCHEDULER** + +- Sole dispatcher inside a PE. +- Receives commands from `PE_CPU`. Dispatch by command type: + - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`) + → forward directly to the target engine. + - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline + via a single `_feed_loop` (D6). +- Does not participate in stage-to-stage chaining within a composite; + that is handled by token self-routing (D6). + +**PE_DMA** + +- Handles memory transfers between TCM and external memory domains + (HBM, shared SRAM, cross-cube UCIe) through the cube NOC. +- Two execution channels: + - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4. +- Additional virtual channels: + - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles. + - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8). + +**PE_FETCH_STORE** + +- TCM ↔ Register File transfer unit. +- Isolates register-file access semantics from compute engines so that + GEMM/MATH stay pure compute components. +- BW-based latency model; TCM access contention naturally serializes + through `PE_TCM`'s BW resource. + +**PE_GEMM** + +- MAC array. Reads operands from the register file; writes results to + the register file. Does not touch `PE_TCM` directly. + +**PE_MATH** + +- Element-wise / reduction / SIMD unit. Reads / writes the register file. + +**PE_TCM** + +- Tightly-coupled scratchpad with BW-serialized access. Two logical + regions partitioned by ownership (see D5). + +**Cross-referenced components** (defined elsewhere): + +- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA). +- `pe_ipcq` — collective ring buffers and peer endpoint metadata + (ADR-0023). + +### D2. Command lifecycle and queues + +`PE_SCHEDULER` maintains three logical structures: + +**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler. + +**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks +expanded sub-commands, dependency state, engine assignment, and +completion status. + +**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion +records. + +**Single-writer rule**: only `PE_SCHEDULER` mutates command completion +state. Engines report completion via explicit events / messages +consumed by the scheduler. + +**Command completion**: when all sub-commands complete, `PE_SCHEDULER` +publishes a completion record. + +### D3. Dispatch modes + +#### D3.1 Simple command + +A simple command expands to exactly one engine sub-command: + +- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA` +- `GemmCmd` → `PE_GEMM` +- `MathCmd` → `PE_MATH` + +Flow: + +```text +PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution + → completion → PE_SCHEDULER → CompletionQueue +``` + +#### D3.2 Composite command (single-op tiled pipeline) + +The default `CompositeCmd` runs a single compute op as a tile-pipelined +sequence: + +```text +DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE +``` + +`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one +`TileToken` per tile with a monotonically increasing `tile_id`. + +Tile dependency (within one tile `t`): + +```text +DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t) +``` + +Inter-tile overlap is allowed wherever engine resources permit +(D4 governs the constraints): + +```text +DMA_READ(t+1) ∥ COMPUTE(t) +DMA_WRITE(t-1) ∥ COMPUTE(t) +``` + +#### D3.3 Multi-op composite (head + epilogue with scope) + +A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a +multi-op pipeline: + +```python +@dataclass(frozen=True) +class OpSpec: + kind: str # "gemm" | "math.exp" | "math.bias_add" | ... + scope: Scope # "per_k_tile" | "per_output_tile" | "once" + ... +``` + +- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines + M/K/N partition). +- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how + often they fire: + - `per_k_tile` — every K-reduction step. + - `per_output_tile` — once per output tile. + - `once` — once per kernel. + +Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural — +each stage is dispatched via token self-routing (D6), so GEMM and MATH +participate serially within the same composite even though they share +the compute slot (D4). + +The empty-`ops` form is the legacy single-op path. + +### D4. Engine resource model + +**DMA engine**: + +- `DMA_READ`: `simpy.Resource(capacity=1)`. +- `DMA_WRITE`: `simpy.Resource(capacity=1)`. +- Both channels run concurrently (READ ∥ WRITE allowed). +- Within a channel, requests serialize (READ ∥ READ disallowed; same + for WRITE). +- `vc_comm` is an orthogonal channel for IPCQ traffic defined in + ADR-0023 D8 — out of scope for this ADR. + +**Compute engine**: + +- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and + `PE_MATH`. +- At most one compute op runs at a time within a PE. +- Multi-op composite chains (D3.3) execute their compute stages serially + through this slot; token self-routing (D6) ensures the next stage + starts only after the previous compute releases the slot. + +**Engine completion**: each engine emits a completion event consumed by +the scheduler / `PipelineContext` (D6). + +### D5. Dataflow + +**Input path (HBM source)**: + +```text +HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM +PE_TCM → PE_FETCH_STORE → Register File +Register File → PE_GEMM | PE_MATH +``` + +**Input path (shared SRAM source)**: + +```text +Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM +PE_TCM → PE_FETCH_STORE → Register File +``` + +**Output path (HBM destination)**: + +```text +Register File → PE_FETCH_STORE → PE_TCM +PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM +``` + +GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the +single TCM↔register-file gateway. This makes TCM BW contention +explicit and lets fetch unit policies (e.g., prefetch) be replaced +independently of compute engines. + +#### D5.1 PE_TCM partitioning + +`PE_TCM` is split into two logical regions: + +**SchedulerReservedTCM** + +- Owned exclusively by `PE_SCHEDULER`. +- Holds composite-command tile buffers. +- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ / + COMPUTE / DMA_WRITE stage, guarantees input/output separation, and + manages tile-buffer lifetimes. + +**AllocatableTCM** + +- General-purpose region managed by `PEMemAllocator`. +- Used for host / DP-visible allocations. + +**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or +allocate inside `SchedulerReservedTCM`. The reserved region is excluded +from allocator-managed ranges by construction. + +**Tile buffer rules**: + +- Input and output buffers within `SchedulerReservedTCM` MUST NOT + overlap during a tile's active lifetime. +- A tile buffer remains valid until the corresponding `DMA_WRITE` + completes. +- Buffer reuse is permitted only after the consuming tile's lifetime + ends. + +### D6. TileToken self-routing pipeline + +A composite's stage-to-stage progression happens **without** routing +through the scheduler. Each component forwards the token directly to +the next stage's component using the token's `plan`: + +```text +Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete) + ↑ chaining: no scheduler hop ↑ + PipelineContext.complete_tile() +``` + +This mirrors real-HW done-wire chains. The scheduler handles only +**initial dispatch + completion aggregation**. + +#### TilePlan / Stage + +```python +class StageType(Enum): + DMA_READ = 0 + FETCH = 1 + GEMM = 2 + MATH = 3 + STORE = 4 + DMA_WRITE = 5 + +@dataclass(frozen=True) +class Stage: + stage_type: StageType + component: str # topology node id (e.g., "sip0.cube0.pe0.pe_dma") + params: dict # stage-specific parameters + +@dataclass(frozen=True) +class TilePlan: + tile_id: int + stages: tuple[Stage, ...] +``` + +#### TileToken + +```python +@dataclass +class TileToken: + tile_id: int + pipeline_ctx: PipelineContext + plan: TilePlan + stage_idx: int + params: dict # cached current stage params + data_op: bool = True # op_log opt-in (ADR-0020 D4) +``` + +Single-owner invariant: a token is owned by exactly one component at a +time. Lifecycle: scheduler creates with `stage_idx=0` → component +`_process()` → increment `stage_idx` → put to next stage's `in_port` → +last stage calls `pipeline_ctx.complete_tile()`. + +#### PipelineContext (exactly-once completion) + +```python +@dataclass +class PipelineContext: + id: str + total_tiles: int + completed_tiles: int = 0 + done_event: simpy.Event = None + + def complete_tile(self) -> None: + self.completed_tiles += 1 + if self.completed_tiles == self.total_tiles: + self.done_event.succeed() +``` + +Each tile's last stage MUST call `complete_tile()` exactly once. +Duplicate calls are bugs (SimPy `Event` can succeed at most once). + +#### Feed ordering + +`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a +`_pending_feeds` FIFO. Composite commands are enqueued in submission +order; tile feed for a command runs to completion before the next +command's feed begins. **Tile-feed interleaving between commands is +disallowed.** + +Within a single command's tiles, downstream pipeline overlap arises +naturally — earlier tiles progress through later stages while the feeder +keeps pushing remaining tiles into the first stage queue (SimPy Store +backpressure governs flow control). If the first-stage queue is full, +only the feeder blocks; the scheduler worker's inbox processing +continues. + +#### Token routing pattern (base class) + +```python +def _pipeline_worker(self, env): + while True: + token = yield self._inbox.get() + yield from self._process(env, token) # stage-specific logic + next_idx = token.stage_idx + 1 + if next_idx < len(token.plan.stages): + next_stage = token.plan.stages[next_idx] + token.stage_idx = next_idx + token.params = next_stage.params + yield self.out_ports[next_stage.component].put(token) + else: + token.pipeline_ctx.complete_tile() +``` + +Each component implements only `_process()`; chaining lives in the +base class. + +### D7. Observability and trace contract + +The simulator emits deterministic trace events: + +- `command_submitted` +- `sub_command_dispatched` +- `engine_start` +- `engine_complete` +- `tile_ready` +- `command_complete` + +For identical inputs, trace ordering MUST be deterministic. + +### D8. Topology representation + +PE-internal components are declared in `cube.pe_template`: + +```yaml +pe_template: + components: + pe_cpu: { kind: pe_cpu, impl: builtin.pe_cpu, attrs: { overhead_ns: ... } } + pe_scheduler: { kind: pe_scheduler, impl: builtin.pe_scheduler, attrs: { overhead_ns: ... } } + pe_dma: { kind: pe_dma, impl: builtin.pe_dma, attrs: { rd_engines: 1, wr_engines: 1 } } + pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } } + pe_gemm: { kind: pe_gemm, impl: builtin.pe_gemm, attrs: { shared_resource: accel_slot, ... } } + pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { shared_resource: accel_slot, ... } } + pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } } + pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { ... } } # ADR-0011 D-VA + pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { ... } } # ADR-0023 + links: + # Scheduler dispatch edges (initial) + scheduler_to_dma_mm: 0.0 + scheduler_to_fetch_store_mm: 0.0 + scheduler_to_gemm_mm: 0.0 + scheduler_to_math_mm: 0.0 + # Pipeline chaining edges (token self-routing per D6) + dma_to_fetch_store_mm: 0.0 + fetch_store_to_gemm_mm: 0.0 + fetch_store_to_math_mm: 0.0 + gemm_to_fetch_store_mm: 0.0 + gemm_to_math_mm: 0.0 + math_to_fetch_store_mm: 0.0 + fetch_store_to_dma_mm: 0.0 + fetch_store_to_tcm_bw_gbs: ... +``` + +Template is instantiated once per PE. PE instances are derived from +`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔ +cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4). + +## Consequences + +### Positive + +- Each block is an independent topology node — individually swappable + via DI (ADR-0015). +- PE-internal structure is visible in the topology graph. +- Components do not know their downstream — plan-based routing gives + flexibility (e.g., epilogue chains require no scheduler change). +- DMA and compute overlap naturally via SimPy Store backpressure. +- Multi-op composite expresses fused operations (e.g., GEMM + bias_add) + without engine-level coupling. +- TCM access contention is realistic — `PE_FETCH_STORE` is the single + TCM↔RF gateway. + +### Negative + +- Intra-PE component count is higher than a coarser model (7 base + 2 + cross-referenced) — more topology nodes/edges. +- Intra-PE token forwarding is explicit in traces (acceptable trade for + HW fidelity). + +## Links + +- ADR-0011 D-VA (PE_MMU component, VA translation) +- ADR-0015 D4 (component port/wire model) +- ADR-0020 (greenlet kernel execution / two-pass) +- ADR-0023 (PE_IPCQ + PE_DMA virtual channels) +- SPEC R3, R4 diff --git a/docs/adr-ko/ADR-0015-dev-component-port-wire-model.md b/docs/adr-ko/ADR-0015-dev-component-port-wire-model.md new file mode 100644 index 0000000..5f999af --- /dev/null +++ b/docs/adr-ko/ADR-0015-dev-component-port-wire-model.md @@ -0,0 +1,202 @@ +# ADR-0015: Component Port/Wire Model and Fabric Routing + +## Status + +Accepted + +## Context + +Realistic hardware modeling — queues, contention, fan-out — requires +that components own fabric traversal while the simulation engine +handles only initialization and completion observation. Direct method +calls between components, or path-walking inside the engine, defeat +queueing and contention semantics. + +This ADR defines: + +- how components communicate via typed port queues, +- how propagation delay is modeled (wire processes with BW occupancy), +- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch + (via M_CPU), +- the engine's reduced role (wire init + completion observation only), +- M_CPU.DMA as an internal subcomponent of M_CPU. + +--- + +## Decision + +### D1. Component port model + +Each component has typed input/output ports modeled as SimPy Stores: + +```text +in_ports: dict[str, simpy.Store] # keyed by source node_id +out_ports: dict[str, simpy.Store] # keyed by destination node_id +``` + +Ports are created at engine initialization based on graph edges. +Each directed edge (src → dst) results in: + +- `src.out_ports[dst]` — the sending end +- `dst.in_ports[src]` — the receiving end + +--- + +### D2. Wire process (propagation delay + BW occupancy) + +For each directed edge (src, dst) in the topology graph, a SimPy wire process +models propagation delay and BW occupancy: + +```python +def wire_process(env, out_port, in_port, delay_ns, bw_gbs): + available_at = 0.0 + while True: + cmd = yield out_port.get() + if bw_gbs > 0: + nbytes = getattr(cmd, "nbytes", 0) + if nbytes > 0: + wait = available_at - env.now + if wait > 0: + yield env.timeout(wait) + available_at = env.now + (nbytes / bw_gbs) + yield env.timeout(delay_ns) + yield in_port.put(cmd) +``` + +Wire processes are started at engine initialization. +Each directed edge maintains an `available_at` timestamp tracking when the link +becomes free for the next transaction. When a transaction occupies a link, the +next transaction on the same directed link must wait until occupancy clears +(back-to-back serialization). TX and RX directions are independent (separate +wire processes with separate `available_at` state). + +--- + +### D3. Engine role (reduced) + +The simulation engine MUST: + +- wire components at initialization (create port Stores, start wire processes), +- identify the entry component for each request type (PCIE_EP), +- put the request into the entry component's in_port, +- wait for a completion event. + +The simulation engine MUST NOT: + +- walk the topology path during request execution, +- call component `run()` methods directly, +- track per-hop latency or decompose fan-out. + +--- + +### D4. Fabric paths for Memory R/W and Kernel Launch + +Memory R/W and Kernel Launch use **different** fabric paths. +Memory operations bypass M_CPU and route directly to HBM via the crossbar. +Kernel Launch routes through M_CPU for PE fan-out. + +**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):** + +```text +pcie_ep → io_noc → io_ucie + → [transit cubes: ucie_in → noc → ucie_out] (zero or more) + → target cube: ucie_in → router mesh → hbm_ctrl +``` + +**Memory R/W completion path:** + +```text +hbm_ctrl → router mesh → [transit cubes: ucie → router mesh → ucie] + → io_ucie → io_noc → pcie_ep +``` + +**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):** + +```text +pcie_ep → io_noc → io_cpu → io_noc → io_ucie + → [transit cubes: ucie_in → noc → ucie_out] (zero or more) + → target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out) +``` + +**Kernel Launch completion path:** + +```text +PE[0..n] all complete → M_CPU (aggregation) + → noc → [transit cubes: ucie → noc → ucie] + → io_ucie → io_noc → io_cpu → io_noc → pcie_ep +``` + +**Rationale for M_CPU bypass on Memory R/W:** + +Memory write/read operations do not require command interpretation or PE +dispatch — they are direct data transfers to/from HBM. Routing through M_CPU +would add unnecessary overhead (5ns) without functional benefit. The io_noc +inside the IO chiplet handles the routing decision: memory operations go +directly to cube fabric, while kernel launches are forwarded to io_cpu first. + +--- + +### D5. M_CPU.DMA is an internal subcomponent of M_CPU + +M_CPU.DMA is NOT a separate topology node. +It is an internal subcomponent owned by the M_CPU component implementation. + +M_CPU.DMA: + +- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4), +- issues memory requests over the NOC to hbm_ctrl, +- receives completion from hbm_ctrl via the NOC, +- reports completion to M_CPU, +- is created and managed inside M_CPU's `__init__` and `run()`. + +M_CPU.DMA does not appear as a node in the compiled topology graph. + +--- + +### D6. Transit cube forwarding + +A cube that is not the target of a memory or kernel request acts as a transit node. +Transit cubes forward requests without consuming them: + +```text +ucie_in (from upstream) → noc → ucie_out (to downstream) +``` + +Transit forwarding is implemented entirely within the ucie_in component. +The noc and ucie_out components in a transit cube forward the packet without modification. + +--- + +### D7. _formula_latency is preserved as a lower-bound cross-check + +The path-based formula latency function (`_formula_latency`) is preserved in the engine +as a lower bound for correctness verification. + +Invariant: + +- Phase 0: `_formula_latency == component model total_ns` +- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing) + +This function is independent of the port/wire model and requires only the topology graph. +It is used for shard comparison in `_route_kernel` and as a regression guard. + +--- + +## Consequences + +- Components model realistic hardware behavior (queues, contention, fan-out). +- Propagation delay is modeled accurately per edge. +- Engine is decoupled from routing policy. +- Component implementations remain swappable via DI (ADR-0007 D3). + +--- + +## Links + +- ADR-0007 D2 (engine role boundary) +- ADR-0009 D3 (kernel execution fan-out hierarchy) +- ADR-0014 D4 (DMA engine capacity=1) +- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal) +- ADR-0016 (IOChiplet NOC and memory data path) +- ADR-0017 (cube NOC 2D mesh architecture) +- ADR-0033 (Latency model assumptions built on these mechanisms) diff --git a/docs/adr-ko/ADR-0016-dev-iochiplet-noc-and-memory-path.md b/docs/adr-ko/ADR-0016-dev-iochiplet-noc-and-memory-path.md new file mode 100644 index 0000000..cb1e281 --- /dev/null +++ b/docs/adr-ko/ADR-0016-dev-iochiplet-noc-and-memory-path.md @@ -0,0 +1,98 @@ +# ADR-0016: IOChiplet NOC and Memory Data Path + +## Status + +Accepted + +## Context + +ADR-0003 D2 defines IO chiplets as SIP-level components providing PCIe-EP and +IO_CPU interfaces, but does not specify internal routing within the IO chiplet. +ADR-0015 D4 was updated to document the M_CPU bypass for Memory R/W, but the +IO chiplet's internal NOC architecture that enables this routing was not +formally documented. + +The IO chiplet needs an internal routing fabric (io_noc) to: + +- connect pcie_ep, io_cpu, and per-cube UCIe PHY ports +- route memory operations (MemoryWrite/Read) directly to cube fabric without + passing through io_cpu +- route kernel launch commands through io_cpu for command interpretation + +## Decision + +### D1. IOChiplet internal NOC (io_noc) + +Each IO chiplet instance contains an internal NOC node (`io_noc`) that connects: + +- `pcie_ep` — host-facing PCIe endpoint +- `io_cpu` — command processor for kernel launch interpretation +- `io_ucie-{PHY}.conn{N}` — per-PHY connection nodes to cube UCIe ports + +The io_noc is a forwarding-only fabric (`forwarding_v1` implementation) with +zero overhead. All routing decisions are made by the simulation engine based +on message type, not by io_noc itself. + +### D2. IOChiplet UCIe decomposition + +Each IO chiplet PHY port is decomposed into: + +- `io_ucie-{PHY}` — the UCIe protocol endpoint (overhead = 8ns) +- `io_ucie-{PHY}.conn{N}` — N connection nodes between io_noc and io_ucie + +This mirrors the cube-side UCIe decomposition (ADR-0015 D1) and allows +multiple independent NOC-to-UCIe connections per PHY. + +### D3. Memory R/W path (M_CPU bypass) + +Memory operations (MemoryWrite, MemoryRead) are routed directly from pcie_ep +through io_noc to the target cube, bypassing io_cpu entirely: + +```text +pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → router mesh → hbm_ctrl +``` + +This avoids the 10ns io_cpu overhead for pure data transfers. The simulation +engine's `_process_memory_direct()` method uses `find_memory_path()` which +resolves the shortest path from pcie_ep to the target HBM node. + +### D4. Kernel Launch path (via io_cpu) + +Kernel launch commands require io_cpu for command interpretation and PE +fan-out setup: + +```text +pcie_ep → io_noc → io_cpu → io_noc → conn → io_ucie → [cube UCIe] + → noc → m_cpu → PE +``` + +The engine's `_entry_points()` method routes KernelLaunchMsg through both +pcie_ep (entry) and io_cpu (command processing). + +### D5. IOChiplet-to-cube port mapping + +Each IO chiplet instance declares which cube ports it connects to: + +```yaml +cube_ports: + - { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 } + - { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 } +``` + +The topology builder creates edges from io_ucie PHY nodes to the +corresponding cube UCIe port nodes, with the specified distance and +the IO chiplet's `per_connection_bw_gbs` as link bandwidth. + +## Consequences + +- IO chiplet has a well-defined internal routing fabric +- Memory operations avoid unnecessary io_cpu overhead +- Kernel launch commands still get proper command interpretation +- The io_noc pattern is consistent with cube-level NOC design +- ADR-0003 D2 is extended (not contradicted) by this ADR + +## Links + +- ADR-0003 D2 (IO chiplet definition) +- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch) +- ADR-0012 D1 (host-to-IO_CPU message schema) diff --git a/docs/adr-ko/ADR-0017-dev-cube-noc-and-hbm-connectivity.md b/docs/adr-ko/ADR-0017-dev-cube-noc-and-hbm-connectivity.md new file mode 100644 index 0000000..c442dde --- /dev/null +++ b/docs/adr-ko/ADR-0017-dev-cube-noc-and-hbm-connectivity.md @@ -0,0 +1,291 @@ +# ADR-0017: Cube NOC and HBM Connectivity + +## Status + +Accepted + +## Context + +The CUBE-level NOC is a 2D router mesh that carries every intra-cube +request: PE-to-HBM data, PE-to-PE traffic, command paths +(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic. + +The CUBE's HBM is exposed through per-PE controller endpoints attached +to PE routers. This per-PE partitioning makes local-vs-remote HBM +distinguishable by mesh distance: a PE's own HBM partition sits at its +own router (switching overhead only); another PE's HBM partition is +reachable by mesh hops to that PE's router. + +Two channel-mapping modes are supported in the design space: + +- **n:1 (default, implemented)** — each PE's HBM partition aggregates + `channels_per_pe` pseudo-channels into one endpoint. Effective + per-PE BW = N × per-channel BW. +- **1:1 (future)** — each PE router decomposes into per-channel + mini-routers; per-channel BW contention is modeled directly. + +In both modes the per-PE effective BW is identical; only the connectivity +granularity differs. + +## Decision + +### D1. 2D router mesh + +Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`. + +- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`). +- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`. +- Default 6×6 grid (sized from PE corner placement + UCIe attachment + count); larger PE counts scale the grid up. +- HBM exclusion zone: center rows/columns are excluded where HBM die + physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6). +- Latency = Manhattan distance × `ns_per_mm`. + +### D2. XY routing algorithm + +Deterministic XY routing: + +1. Horizontal segment: route from source X to destination X at source Y. +2. Vertical segment: route from destination X at source Y to destination Y. + +Each directed segment carries a unique key: + +- Horizontal: `("H", y_band, x_min, x_max, direction)` +- Vertical: `("V", x_band, y_min, y_max, direction)` + +Grid positions are snapped to the router grid, excluding the HBM zone. + +### D3. Per-segment contention model + +Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions +sharing a segment (same row or column band, same direction) contend for +the resource — modelling link-level serialization in a wormhole-routed +mesh. + +With no contention, NOC traversal latency equals Manhattan distance × +`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing +delay. + +### D4. NOC attachment points (per-PE HBM partition) + +Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`, +and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint — +`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's +HBM (one pseudo-channel group; see D8). + +Other attachments: + +- M_CPU and shared SRAM each occupy a dedicated edge router. +- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed + along that edge (see D6). + +```text + UCIe-N (conn x4) + | + +---------+---+---+---------+ + | | | | +PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma +PE0.cpu <--+ +hbm.pe0| | +hbm.pe2+--< PE2.cpu + | | | | +UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E +(conn x4) | | zone | | (conn x4) + | r2c0 | | | +M_CPU <--->+ | | | + | r3c0 | | | +SRAM <---->+ | | | + | | | | +PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma +PE4.cpu <--+ +hbm.pe4| | +hbm.pe6+--< PE6.cpu + | | | | + +---------+---+---+---------+ + | + UCIe-S (conn x4) +``` + +Per-PE HBM partitioning is the key invariant that makes local vs +cross-PE HBM distinguishable by mesh distance (see D7). + +### D5. NOC edge bandwidths and distances + +| Connection | BW (GB/s) | Distance | Notes | +| ----------------------------- | ---------- | ------------- | ------------------------------------------- | +| PE_DMA → NOC | 256.0 | Physical (PE) | Matches local-HBM aggregate BW | +| NOC → PE_CPU | — | 0.0 mm | Command path only | +| Router ↔ hbm_ctrl.pe{idx} | 256.0 | 0.0 mm | Per PE router; N × per-channel BW (see D8) | +| NOC ↔ M_CPU | — | 0.0 mm | Command path | +| NOC ↔ SRAM | 128.0 × 4 | 0.0 mm | 512 GB/s aggregate | +| NOC ↔ UCIe conn | 128.0 | 0.0 mm | Per connection; 4 conn per port | + +`0.0 mm` distances reflect the distributed nature of the NOC; actual +traversal distance is computed via Manhattan distance within the router +grid. + +### D6. UCIe decomposition and inter-cube traffic + +Each of the 4 UCIe ports (N, S, E, W) decomposes into: + +- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`). +- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe. + +This decomposition gives 4 independent NOC↔UCIe connections per port, +each with 128 GB/s bandwidth (512 GB/s aggregate per port). + +Inter-cube traffic path: + +```text +Source: PE_DMA → NOC → conn{i} → ucie-{PORT} + [UCIe link: 512 GB/s, 1.0mm seam distance] +Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx} +``` + +UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full +crossing incurs 16 ns (TX port + RX port). + +### D7. Data paths through the NOC + +All intra-cube traffic uses the same router mesh — no separate fast +paths. + +**Local HBM** (same PE's own partition; 0 mesh hops): + +```text +PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx} (switching overhead only) +``` + +**Cross-PE HBM within cube** (target PE's partition, reached by mesh): + +```text +PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'} +``` + +Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`): + +```text +PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2 +``` + +Dijkstra computes the shortest path within the mesh. + +**Cross-cube HBM** (UCIe traversal): + +```text +PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn + → r{x'}c{y'} → hbm_ctrl.pe{idx'} +``` + +**Kernel launch command to PE**: + +```text +[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU +``` + +**Shared SRAM access**: + +```text +PE_DMA → r{x}c{y} → (mesh) → SRAM +``` + +### D8. HBM channel mapping mode + +Channel mapping is configured at cube scope: + +```yaml +cube: + memory_map: + hbm_mapping_mode: n_to_one # one_to_one | n_to_one + hbm_pseudo_channels: 64 # total pseudo-channel count + hbm_channels_per_pe: 8 # per-PE local channel count + hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s) + hbm_slices_per_cube: 8 # number of per-PE partitions + hbm_total_gb_per_cube: 48 +``` + +**n:1 mode (default, implemented).** Each PE's HBM partition is a single +endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo- +channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals +`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to +interleave; only aggregate per-PE BW is modeled. No separate aggregated +router node exists — the per-PE router itself serves that role. + +**1:1 mode (future).** Each PE router decomposes into N channel +mini-routers; per-channel routing carries fully-resolved PA + channel ID. +A `ChannelSplitter` resolves a logical access to N per-channel physical +requests. Per-channel link models BW contention. Cross-PE channel +access semantics are deferred to the implementation ADR. + +**BW math (defaults).** + +| Parameter | Value | +| ---------------------------------- | -------------------------- | +| pseudo channels per cube | 64 (parameter) | +| PEs per cube | 8 (parameter) | +| channels per PE (N) | 64 / 8 = 8 | +| per-channel BW | 32 GB/s (parameter) | +| per-PE local BW | N × 32 = 256 GB/s | +| cube total HBM BW | 64 × 32 = 2048 GB/s | + +Both modes give the same per-PE effective BW; only the request shape and +contention model differ. + +### D9. AddressResolver — per-PE HBM endpoint + +The address resolver decodes a PA's HBM offset to the owning PE's +partition: + +```python +# policy/routing/router.py +hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube + +if addr.kind == "hbm": + pe_id = int(addr.hbm_offset) // hbm_slice_bytes + return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}" +``` + +The pe_id computation is intrinsic to the routing layer (not a +topology-time concern). Any HBM PA falls within exactly one partition, +yielding deterministic routing. + +External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the +same resolver path — there is no separate fast path. + +### D10. Mesh generation parameters + +`mesh_gen.py` produces `cube_mesh.yaml` from: + +- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner. +- `cube.geometry`: cube physical dimensions and HBM zone. +- `cube.ucie.n_connections`: determines router count for UCIe attachment. + +Output `mesh_data` dictionary contains: + +- Router grid with positions and HBM exclusion zones. +- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm` + per PE). +- UCIe-to-router attachments (N/S/E/W distributed across edge routers). +- M_CPU and SRAM router attachments. + +## Consequences + +- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM + (mesh hops) are naturally distinguishable, satisfying SPEC R5 + (multi-domain communication) and ADR-0002 (no zero-latency end-to-end + paths). +- All cube-internal traffic routes through one mesh — single contention + model, single layout, single set of edge BWs. +- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each + PE's partition is the n:1 aggregate of its assigned pseudo-channels. +- 1:1 mode extension is structurally natural — split each PE router into + N channel routers. +- Mesh generation is fully parameterised by `topology.yaml`; PE/cube + geometry changes propagate without code edits. + +## Links + +- ADR-0002 (Routing distance, ordering, no zero-latency paths) +- ADR-0003 D3 (cube-level NOC definition — extended here) +- ADR-0004 (Memory semantics, local HBM) +- ADR-0011 (Memory addressing — LA model consumes per-PE partition) +- ADR-0014 D1 (PE_DMA egress via router mesh) +- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch) +- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level) +- ADR-0033 (Latency model: per-PC parallelism, switch penalty) diff --git a/docs/adr-ko/ADR-0020-prog-data-execution-two-pass.md b/docs/adr-ko/ADR-0020-prog-data-execution-two-pass.md new file mode 100644 index 0000000..a8d277d --- /dev/null +++ b/docs/adr-ko/ADR-0020-prog-data-execution-two-pass.md @@ -0,0 +1,516 @@ +# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리) + +## Status + +Accepted + +## Context + +현재 시뮬레이션은 **타이밍만** 모델링한다. +`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만, +실제 텐서 데이터를 읽거나 연산하지 않는다. + +### 필요한 기능 + +1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다 +2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다 +3. 시뮬레이션 성능 저하를 최소화해야 한다 + +### 제약 조건 + +- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block +- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨 +- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함 +- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가) + +### 설계 탐색 결과 + +| Option | 방식 | 판정 | +|--------|------|------| +| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block | +| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block | +| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 | +| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 | + +--- + +## Decision + +### D1. 2-Pass 실행 모델 — Phase 0 제거 + +기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다. + +기존: +``` +Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가) +Phase 1: PeCommand 리스트를 SimPy replay (타이밍만) +``` + +변경: +``` +Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반 + - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터 + - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2) + - dynamic control flow 가능 (tl.load가 실제 데이터 반환) + +Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능 +``` + +본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다. +Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적, +Phase 2는 GEMM/Math 연산 정합성 검증. +Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행. + +### D2. Op Log 기록 — ComponentBase hook + +op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다. +개별 컴포넌트 구현을 수정하지 않는다. + +```python +class ComponentBase: + def _on_process_start(self, env, msg): + if self._op_logger and getattr(msg, 'data_op', False): + self._op_logger.record_start(env.now, self.node.id, msg) + + def _on_process_end(self, env, msg): + if self._op_logger and getattr(msg, 'data_op', False): + self._op_logger.record_end(env.now, self.node.id, msg) +``` + +`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다. +`_op_logger`는 optional — 없으면 오버헤드 제로. + +**hook 시점 정의**: + +| 시점 | 의미 | +|------|------| +| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) | +| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) | + +link traversal latency는 t_start/t_end에 포함되지 않는다. +link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다. + +### D3. Greenlet 기반 커널 실행 — Phase 0 제거 + +기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고, +**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다. + +#### 동작 원리 + +greenlet은 협력적 context switch를 제공하는 C 확장이다. +커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로 +switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다. + +``` +SimPy 루프 (parent greenlet) 커널 (child greenlet) +───────────────────────── ────────────────────── +g.switch() ─────────────────────────→ 커널 시작 + a = tl.load(ptr, ...) + 내부: parent.switch(DmaReadCmd) +cmd = DmaReadCmd ←────────────────── (커널 일시정지) + yield DmaReadMsg(...) + yield env.timeout(dma_latency) + data = memory_store.read(...) +g.switch(data) ─────────────────────→ (커널 재개) + a = data ← 실제 numpy array + if a[0][0] > 0.5: ← 분기 가능 + ... +``` + +커널은 **plain Python function**으로 유지된다. +greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다. + +#### KernelRunner — 프레임워크 레이어 + +greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인 +**KernelRunner**에 위치한다. + +```python +# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결) +class KernelRunner: + def run(self, env, kernel_fn, args, store): + g = greenlet(self._run_kernel) + cmd = g.switch(kernel_fn, args) + + while cmd is not None: + if isinstance(cmd, DmaReadCmd): + yield from self._dispatch_dma(env, cmd) + data = store.read(cmd.src_addr, cmd.shape, cmd.dtype) + cmd = g.switch(data) # 실제 데이터와 함께 재개 + elif isinstance(cmd, GemmCmd): + yield from self._dispatch_gemm(env, cmd) + cmd = g.switch() # 재개 (데이터 없음) + elif isinstance(cmd, DmaWriteCmd): + store.write(cmd.dst_addr, cmd.data) # visibility = issue 시점 + yield from self._dispatch_dma(env, cmd) # timing만 반영 + cmd = g.switch() + +# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름) +def _execute_kernel(self, env): + runner = KernelRunner(self.ctx) + yield from runner.run(env, kernel_fn, args, store) +``` + +**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다. +모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다. +KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면, +컴포넌트 베이스 클래스의 hook이 자동으로 기록한다. + +**레이어 분리**: +- **커널 코드**: plain function, greenlet 존재를 모름 +- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출 +- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**. +- **ComponentBase hook**: op_log 기록의 유일한 경로 +- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능 + +#### 메모리 읽기/쓰기 vs 연산의 처리 차이 + +| 연산 | Phase 1에서 | Phase 2에서 | +|------|------------|------------| +| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — | +| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — | +| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 | +| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 | + +메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름). +GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리). + +#### Store Visibility Rule + +`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue). +SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다. + +이는 timing과 visibility를 의도적으로 분리한 것이다: +- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시 +- **timing**: SimPy에서 DMA latency가 완료되는 시점 + +이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다. + +#### Result Handle Semantics + +`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다. + +Phase 1에서의 핵심 계약: + +1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.** +2. `tl.wait(handle)`은 **timing synchronization만 표현**하며, + handle을 ready로 만들지 않는다. +3. handle의 실제 결과 데이터 접근(`handle.data`, element access, + numpy conversion 등)은 **Phase 2에서만 가능**하다. +4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.** +5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로, + **memory-read 기반 control flow는 지원 가능**하다. + +| handle 상태 | Phase | 허용 동작 | +|------------|-------|----------| +| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 | +| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) | +| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 | +| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 | + +이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가 +block되어 2-pass 분리의 존재 이유가 사라진다. + +#### Phase 1 Materialization — Future Extension + +향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이 +필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여 +선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다. + +### D4. data_op 플래그 — 메시지 자기 선언 + +로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다. +프레임워크가 메시지 타입을 하드코딩하지 않는다. + +```python +class MsgBase: + data_op: bool = False # 기본: 로깅 안 함 + +class DmaReadCmd(MsgBase): + data_op = True # 메모리 이동 → 로깅 + +class GemmCmd(MsgBase): + data_op = True # 연산 → 로깅 + +class MathCmd(MsgBase): + data_op = True # 연산 → 로깅 +``` + +새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면 +프레임워크 코드 수정 없이 자동 로깅된다. + +### D5. Op Log 구조 + +#### op 분류 체계 + +2단계로 분류한다: + +| 레벨 | 필드 | 역할 | +|------|------|------| +| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 | +| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 | + +#### OpRecord 정의 + +```python +@dataclass +class OpRecord: + t_start: float # SimPy 시각 (ns) — service 시작 + t_end: float # SimPy 시각 (ns) — service 완료 + component_id: str # e.g. "sip0.cube0.pe0.pe_gemm" + op_kind: str # "memory" | "gemm" | "math" + op_name: str # 구체 연산명 + params: dict # 연산별 파라미터 (아래 참조) + dependency_ids: list[int] # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능 +``` + +#### dependency_ids 생성 규칙 + +`dependency_ids`는 **optional**이며, 기본적으로 executor는 +주소 기반 dependency 추론을 수행한다 (D6 참조). + +정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다: +- **기본 (address-based inference)**: executor가 read/write set을 분석하여 + RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분. +- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가 + 주소로 표현되지 않는 경우에 설정. + 예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라 + 논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다. + +#### op_log ordering + +op_log는 `t_start` 기준으로 **stable ordering**을 유지한다. +동일 `t_start`의 record들은 insertion order를 보존한다. + +#### params 상세 + +**memory (dma_read / dma_write)**: +```python +{ + "src_addr": int, # source 주소 (byte) + "dst_addr": int, # destination 주소 (byte) + "nbytes": int, # 전송 크기 + "src_space": str, # "hbm" | "tcm" | "sram" + "dst_space": str, # "hbm" | "tcm" | "sram" +} +``` + +**gemm**: +```python +{ + "src_a_addr": int, # operand A 주소 + "src_b_addr": int, # operand B 주소 + "dst_addr": int, # output 주소 + "shape_a": tuple, # e.g. (128, 256) + "shape_b": tuple, # e.g. (256, 128) + "shape_out": tuple, # e.g. (128, 128) + "dtype_in": str, # e.g. "f16" + "dtype_acc": str, # accumulation dtype, e.g. "f32" + "dtype_out": str, # output dtype, e.g. "f16" + "transpose_a": bool, + "transpose_b": bool, + "layout_a": str, # "row_major" | "col_major" + "layout_b": str, + "layout_out": str, + "addr_space": str, # "tcm" (GEMM operand는 항상 TCM) +} +``` + +**math**: +```python +{ + "op": str, # "exp" | "add" | "sum" | "where" | ... + "input_addrs": list[int], # operand 주소 목록 + "input_shapes": list[tuple], + "dst_addr": int, + "shape_out": tuple, + "dtype": str, + "axis": int | None, # reduction axis + "addr_space": str, # "tcm" +} +``` + +### D6. Phase 2 Executor + +Phase 2는 SimPy 밖에서 op_log를 실행한다. + +```python +class DataExecutor: + def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore): + self.store = initial_store # Phase 1의 MemoryStore snapshot을 입력으로 받는다 + + def run(self): + for t, ops in groupby(op_log, key=lambda o: o.t_start): + batch = list(ops) + independent, sequential = self._classify(batch) + self._execute_parallel(independent) + self._execute_sequential(sequential) +``` + +**병렬 실행 판정**: + +같은 `t_start`의 op들은 **병렬 후보**로 간주한다. +실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다: +- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사) +- `dependency_ids`에 명시된 선행 op 완료 여부 + +주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다. + +**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가 +모두 동일한** 독립 op들만 batching 대상이 된다. +예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음. +CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감. + +**Phase 2 실행 순서 보장**: + +Phase 2는 데이터 도착 시점을 고려하지 않으며, +dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만 +실행 순서를 보장한다. + +### D7. Memory Store + +`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며, +현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다. + +```python +class MemoryStore: + def write(self, space: str, addr: int, data: np.ndarray) -> None: ... + def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ... +``` + +**내부 저장 포맷: numpy ndarray** + +MemoryStore는 텐서를 **numpy ndarray**로 저장한다. + +| 후보 | store/load 속도 | Phase 2 연산 | 판정 | +|------|----------------|-------------|------| +| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** | +| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 | +| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 | + +- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회 +- read: numpy array를 **참조 반환** (복사 없음) +- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원) +- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등) +- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환 +- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당 + +**read/write contract**: + +- read/write는 **contiguous tensor** 기준이다. + non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다. +- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다. + reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한 + permissive behavior이다. +- addr은 byte-aligned이며, 최소 alignment = dtype 크기. +- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다. + shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error. +- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다. +- 구현 최적화로 tensor object cache를 둘 수 있지만, + canonical state는 byte-addressable storage이다. +- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다. + +### D8. 벤치마크 커널 코드 + +벤치마크의 **사용자 코드 API는 변경하지 않는다**. +`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지. + +단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를 +포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드). + +### D9. 컴포넌트 변경 없음 + +개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다. +op_log 기록은 ComponentBase hook의 책임이다. +커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며, +Phase 2 데이터 실행은 영향받지 않는다. + +### D10. Phase 2는 Optional + +```python +engine = GraphEngine(graph) +engine.run(benchmark) # Phase 1: 타이밍만 +result = engine.get_timing_result() + +if verify_data: + executor = DataExecutor(engine.op_log) # Phase 2: 데이터 + executor.run() + executor.verify(expected_output) +``` + +타이밍 분석만 필요하면 Phase 2를 건너뛴다. +op_logger를 비활성화하면 Phase 1 성능도 기존과 동일. + +### D11. Verification Contract + +기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다. + +dtype별 tolerance 정책: + +| dtype | 비교 방식 | tolerance | +|-------|----------|-----------| +| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 | +| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 | +| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 | +| int 계열 | `np.array_equal` | exact | + +- 기본 모드: 최종 output만 비교 (end-to-end correctness) +- 디버그 모드: intermediate tensor도 op 단위로 비교 가능 + (MemoryStore snapshot at each op boundary) + +--- + +## Non-goals + +- **Compute-result-based control flow**: 지원하지 않는다. + 모든 compute handle은 Phase 1에서 pending 상태이며, + `wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다. + Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은 + **error로 처리**한다. + 메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다. + Phase 1 materialization은 future extension (D3 참조). +- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간 + overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다. +- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며, + 실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다. + +## Open Questions + +- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를 + MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics) +- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히 + 일반화할지, 별도 op_kind를 둘지 +- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리 + (in-memory list vs disk-backed streaming) +- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을 + 하나의 fused op record로 기록할지, 개별 op으로 분리할지 +- **Math op schema 일반화**: 현재 math params는 단순 구조이나, + broadcasting rule, input별 dtype, keepdims, scalar/immediate operand, + where/mask 표현 등 일반화가 필요할 수 있음 +- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며, + streaming/disk-backed mode 도입 시 stable op_id로 대체 필요 +- **Phase 1 materialization policy**: D3의 Future Extension 참조. + 허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요 + +--- + +## Consequences + +### 긍정적 + +- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가) +- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능 +- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존) +- 벤치마크 사용자 코드 API 변경 불필요 +- 새 메시지 타입 추가 시 data_op 플래그만 설정 +- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원 +- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이 + +### 부정적 + +- op_log 메모리 사용량 (대규모 시뮬레이션 시) +- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM) +- pending handle (연산 미완료) 기반 동적 분기 불가 + (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정). + 메모리 데이터 기반 분기는 greenlet으로 지원된다. +- greenlet C 확장 의존성 추가 (pip install greenlet) diff --git a/docs/adr-ko/ADR-0022-prog-program-id-2d-grid.md b/docs/adr-ko/ADR-0022-prog-program-id-2d-grid.md new file mode 100644 index 0000000..371bb49 --- /dev/null +++ b/docs/adr-ko/ADR-0022-prog-program-id-2d-grid.md @@ -0,0 +1,90 @@ +# ADR-0022: 2D Grid program_id Semantics + +## Status + +Accepted + +## Context + +Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid. +Our hardware has a 2-level hierarchy: **cubes** contain **PEs**. +The previous implementation ignored the `axis` parameter and always returned a flat PE index, +making it impossible for kernels to distinguish their cube-local position from their cube identity. + +## Decision + +Map `tl.program_id` and `tl.num_programs` to the 2D hardware grid: + +| Call | Returns | Description | +|------|---------|-------------| +| `tl.program_id(axis=0)` | `local_pe_id` | PE index within cube | +| `tl.program_id(axis=1)` | `cube_id` | Cube index | +| `tl.num_programs(axis=0)` | `num_pes_per_cube` | PEs per cube | +| `tl.num_programs(axis=1)` | `num_cubes` | Total cubes | + +Global PID is derived as: + +```python +global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0) +``` + +### Axis mapping rationale + +- **axis=0 = PE (innermost)**: PEs within a cube share HBM and communicate via local NOC mesh. This is the fast, tightly-coupled dimension — analogous to threads within a block. +- **axis=1 = Cube (outer)**: Cross-cube communication goes through UCIe with higher latency. This is the coarser scheduling dimension — analogous to blocks in a grid. + +## Implementation + +### TLContext (`triton_emu/tl_context.py`) + +Added `cube_id` and `num_cubes` constructor parameters. `program_id()` and `num_programs()` dispatch on `axis`: + +```python +def program_id(self, axis: int = 0) -> int: + if axis == 1: + return self._cube_id + return self._pe_id + +def num_programs(self, axis: int = 0) -> int: + if axis == 1: + return self._num_cubes + return self._num_programs +``` + +### PE_CPU (`components/builtin/pe_cpu.py`) + +- Extracts `num_cubes` from `ctx.spec["system"]["sips"]["cubes_per_sip"]` +- Passes `cube_id` (already available as `self._cube_idx`) and `num_cubes` to TLContext + +### KernelRunner (`triton_emu/kernel_runner.py`) + +- Receives `num_cubes` from PE_CPU +- Passes `cube_id` and `num_cubes` to TLContext in greenlet mode + +## Backward Compatibility + +- Existing code using `tl.program_id(0)` or `tl.program_id()` is unchanged — returns the same PE index as before. +- `cube_id` and `num_cubes` default to `0` and `1`, so callers that don't provide them (e.g. unit tests) continue to work. + +## Usage Example + +```python +def sharded_gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl): + local_pid = tl.program_id(axis=0) # PE within cube + cube_id = tl.program_id(axis=1) # which cube + global_pid = cube_id * tl.num_programs(axis=0) + local_pid + + # Column-wise sharding across global PID + n_per_pid = N // (tl.num_programs(axis=1) * tl.num_programs(axis=0)) + col_start = global_pid * n_per_pid + + a = tl.load(a_ptr, shape=(M, K), dtype="f16") + b = tl.ref(b_ptr + col_start * K * 2, shape=(K, n_per_pid), dtype="f16") + h = tl.composite(op="gemm", a=a, b=b, out_ptr=out_ptr + col_start * M * 2) + tl.wait(h) +``` + +## Consequences + +- Benchmarks can now express cube-aware sharding and addressing without hardcoding topology dimensions. +- Future axis=2 (SIP-level) can be added following the same pattern if needed. diff --git a/docs/adr-ko/ADR-0023-dev-ipcq-pe-collective.md b/docs/adr-ko/ADR-0023-dev-ipcq-pe-collective.md new file mode 100644 index 0000000..5fd174d --- /dev/null +++ b/docs/adr-ko/ADR-0023-dev-ipcq-pe-collective.md @@ -0,0 +1,1648 @@ +# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication + +## Status + +Accepted + +## Context + +### 목표 + +CCL (Collective Communication Library) 커널을 PE 안에서 실행할 수 있도록 +PE 간 데이터 교환 인프라를 추가한다. 호스트는 그저 각 SIP에 커널을 launch만 하고, +실제 동기화와 데이터 이동은 **PE 커널 안에서 IPCQ(Inter-Process Communication +Queue)를 통해** 일어난다. + +이는 NCCL이 GPU 커널 안에서 NVLink 통신을 수행하는 모델, 또는 Cerebras/Tenstorrent의 +core-local 통신 큐와 유사하다. 호스트 레벨 collective(`dist.all_reduce`)는 +**미래 작업**으로 미루고, 본 ADR은 커널 collective 인프라에만 집중한다. + +### 풀어야 할 문제 + +1. PE 간 직접 데이터 이동 (peer's memory에 write) +2. 동기화 — 송신 측이 수신 측 buffer 공간을 확인해야 함 (backpressure) +3. compute traffic과 communication traffic의 자원 경쟁 (Head-of-Line blocking) +4. 호스트가 알고리즘에 따라 (ring/mesh/tree) 논리적 neighbor 토폴로지를 구성할 수 있어야 함 + +--- + +## Decision + +### D1. PE_IPCQ 컴포넌트 신규 추가 + +PE 안에 새 컴포넌트 `PE_IPCQ`를 추가한다. PE_GEMM/PE_MATH가 PE_CPU의 +sub-block을 별도 컴포넌트로 모델링하는 것과 동일한 패턴이다. + +``` +PE +├── PE_CPU +├── PE_SCHEDULER +├── PE_DMA +├── PE_IPCQ ← 신규 +├── PE_FETCH_STORE +├── PE_GEMM +├── PE_MATH +├── PE_TCM +├── PE_MMU +``` + +**역할 분리** (control plane vs data plane): +- **PE_IPCQ (control plane)**: ring buffer 주소 계산, head/tail pointer 관리, + peer pointer 캐시, backpressure 결정, 4-방향 neighbor 매핑 +- **PE_DMA (data plane)**: 실제 데이터를 cube_noc/UCIe/PCIE 경유로 peer 메모리에 전송 + +PE_IPCQ는 데이터 이동을 직접 수행하지 않고 PE_DMA에 위임한다. + +### D2. Ring Buffer 모델 + +각 PE는 4-방향(N/S/E/W) × {tx, rx} = 총 8개의 ring buffer를 가진다. + +```python +@dataclass +class IpcqQueuePair: + direction: Direction # N/S/E/W + peer: IpcqEndpoint # init 시 호스트가 설정 (D2.5) + tx_buffer_base: int # 내가 보낼 데이터의 base addr (자기 메모리) + rx_buffer_base: int # 내가 받을 데이터의 base addr (자기 메모리) + slot_size: int # tile 단위 + n_slots: int # ring depth + my_head: int # 내 send 위치 (다음에 쓸 tx/peer slot) + my_tail: int # 내 recv 위치 (다음에 읽을 rx slot) + peer_head_cache: int # 캐시: peer가 마지막으로 보낸 head 위치 (D9 piggyback으로 갱신) + peer_tail_cache: int # 캐시: peer가 마지막으로 소비한 tail 위치 (D9 fast path credit으로 갱신) +``` + +**필드명 규약 (canonical)**: 본 ADR 전체에서 다음 4개 이름을 일관되게 사용한다. + +| 필드 | 소유자 | 갱신 시점 | +|------|--------|----------| +| `my_head` | 자기 PE_IPCQ | tl.send 호출 후 즉시 (송신 추적용) | +| `my_tail` | 자기 PE_IPCQ | tl.recv 호출 후 즉시 (수신 추적용) | +| `peer_head_cache` | 자기 PE_IPCQ | IpcqMetaArrival 도착 시 (D9 piggyback) | +| `peer_tail_cache` | 자기 PE_IPCQ | IpcqCreditMetadata 도착 시 (D9 fast path) | + +다른 표현(`peer_head_local`, `peer_head`, `peer_tail` 등)은 사용하지 않는다. + +**Slot 단위**: fixed-size, 한 slot이 한 tile 데이터를 통째로 담는다. +descriptor 모델이 아니라 **full data embedding** 모델 (D5에서 상세). + +### D2.5. PeAddress / IpcqEndpoint 스키마 + +`IpcqQueuePair.peer`가 가져야 할 정보를 명시한다. 송신 측 PE_IPCQ가 +peer rx slot에 직접 DMA write하려면 다음을 모두 알아야 한다. + +```python +@dataclass(frozen=True) +class IpcqEndpoint: + """송신 측이 peer's rx_buffer 주소를 계산하기 위해 필요한 모든 정보.""" + sip: int # 목적지 SIP + cube: int # 목적지 cube + pe: int # 목적지 PE (cube 내 local index) + buffer_kind: str # "tcm" | "hbm" | "sram" — 어느 메모리 공간 + rx_base_pa: int # peer rx_buffer base의 PA (PhysAddr.encode()) + rx_base_va: int # peer rx_buffer base의 VA (선택, MMU 사용 시) + n_slots: int # peer ring depth (경계 wrap-around 계산용) + slot_size: int # peer slot 크기 (offset 계산용) +``` + +`IpcqQueuePair`의 `peer` 필드는 이 `IpcqEndpoint` 객체를 들고 있다. +주소 계산은: + +```python +slot_idx = self.my_head % peer.n_slots +dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size +``` + +PE_IPCQ는 이 dst_pa를 `IpcqDmaToken`의 dst_addr로 PE_DMA에 전달한다. +PE_DMA(vc_comm)는 fabric 라우팅(cube_noc/UCIe/PCIE)을 통해 dst_pa로 데이터를 전송한다. + +**Endpoint 생성 시점**: backend init (D10)에서 모든 PE의 IPCQ buffer를 +allocator로 할당받고, 각 rank의 neighbor table을 만들 때 peer rank의 +endpoint 정보를 install한다. 즉 install 순서는: + +1. **모든 rank의 IPCQ buffer 할당** (각 PE의 buffer_kind 메모리 공간에서) +2. **rank별 endpoint table 구성** (자신의 4-방향 peer가 어느 sip/cube/pe/pa를 갖는지) +3. **PE_IPCQ에 install** (`IpcqInitMsg` via fabric or sideband) + +이 순서는 모든 rank가 서로의 PA를 알아야 하므로, 단계 1을 모든 rank에 대해 +먼저 끝낸 후 단계 2-3을 진행한다. + +### D3. 4-방향 매핑 = 논리적 ProcessGroup + +PE는 4방향(N/S/E/W)을 logical port로 본다. 실제 peer 주소는 호스트 CCL init이 +알고리즘에 따라 설정한다. PE 커널은 토폴로지를 알지 못하고 방향만 사용한다. + +```python +# 호스트 init 예시 — 1D ring +for rank in range(world_size): + ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size]) + ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size]) + +# 호스트 init 예시 — 2D mesh +for r in range(R): + for c in range(C): + ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c)) + ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c)) + ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C)) + ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C)) +``` + +PE 코드 입장에서 `tl.send(dir="E", ...)`가 어디로 가는지는 알 필요가 없다. + +### D4. PE 커널 API + +```python +# Send (blocking, backpressure 발생 가능) +tl.send(dir: str, src_addr: int, nbytes: int) -> None + +# Recv (blocking) +data = tl.recv(dir: str) # 특정 방향에서 수신 +data = tl.recv() # 4방향 round-robin, 도착한 첫 tile 반환 + +# Recv (non-blocking) +handle = tl.recv_async(dir: str) +data = tl.wait(handle) +``` + +`tl.recv()` (방향 미지정)는 IPCQ가 last_polled_dir 인덱스를 들고 있다가 +다음 호출 시 그 다음 방향부터 검사하면서 데이터 있는 첫 슬롯을 반환한다. +4방향 모두 비어있으면 wait. + +**Fairness는 weak fairness**: polling 시작 방향을 회전시켜 단순 편향을 +완화하지만, 한 방향에 데이터가 항상 먼저 도착하면 다른 방향이 starvation될 +수 있다. strict fairness가 필요한 알고리즘은 `tl.recv(dir=...)`로 방향을 +명시해야 한다. (Open Questions 참조) + +### D5. Single-hop DMA Write + Full-data Slot 모델 + +데이터는 송신 측 메모리에서 수신 측 ring slot으로 **단일 DMA 전송**으로 +이동한다. 핵심 속성: + +- **Single-hop**: 송신 측 IPCQ가 peer rx slot 주소를 직접 알고 있어 한 번의 + fabric DMA로 데이터가 도착한다. +- **No CPU memcpy**: CPU가 데이터를 복사하지 않는다. +- **No intermediate staging**: 송신/수신 어느 쪽에도 별도 staging buffer가 + 없다 (송신은 자기 source 주소에서 직접, 수신은 자기 ring slot으로 직접). + +(엄밀히 말하면 fabric DMA write 자체는 발생하므로 "data movement가 전혀 없다"는 +의미는 아니다. NCCL의 "zero-copy"가 가리키는 것 — CPU memcpy / staging copy +부재 — 과 동일한 속성이다.) + +데이터 이동 모델: + +``` +PE A: tl.send(E, src_addr, nbytes) + 1. IPCQ가 peer rx slot 주소 계산 + dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size + 2. backpressure: my_head - peer_tail_cache < peer.n_slots ? + (꽉 찼으면 sleep/poll) + 3. PE_DMA(vc_comm)에 DMA 요청 → src_addr에서 peer의 dst_addr로 nbytes 전송 + 4. my_head += 1 + +PE B: data = tl.recv(W) + 1. 내 rx_buffer[my_tail % n_slots] 위치 확인 + 2. 데이터 도착 대기 (D7 backpressure 모드) + 3. 그 주소를 PE 커널에 반환 (또는 fetch unit으로 register file에 로드) + 4. my_tail += 1 + 5. credit return fast path 발행 (D9) — bottleneck-BW latency 후 + peer A의 peer_tail_cache 갱신 +``` + +**핵심**: Slot에 데이터가 통째로 들어간다. PE B의 recv는 자기 rx_buffer만 +읽으면 되고, A의 메모리를 read하지 않는다. 송신 측 IPCQ가 peer rx slot +주소를 알고 있으므로 직접 그 주소로 DMA write한다 (single-hop). + +본인의 PE_TCM read/write는 DMA를 거치지 않는다 (PE에 직접 붙어있음). +slot이 본인 TCM에 있으면 직접 접근, 아니면 PE_DMA 경유. + +### D6. Buffer 위치 — 3-way benchmark + +호스트 CCL init이 IPCQ ring buffer의 메모리 위치를 결정한다: + +```python +ipcq_init( + backend="ahbm", + buffer_kind="tcm" | "hbm" | "sram", + n_slots=8, + slot_size=4096, +) +``` + +| 위치 | 특징 | trade-off | +|------|------|-----------| +| **PE_TCM** | PE에 직접 붙음, 빠름 | 작음, PE 내부 자원과 경쟁 | +| **PE-local HBM** | 큼, DMA 경유 | latency 큼 | +| **Cube SRAM** | 중간 크기, cube-shared | cube 내 PE 간 contention | + +세 위치 모두 동일 코드로 동작하며 init만 다르다. 벤치마크로 비교 가능. + +**규칙**: peer가 read/write할 때는 DMA 경유. 본인이 자기 PE_TCM 읽기/쓰기는 +DMA 없음. + +### D7. Backpressure — 2-mode benchmark + +송신 측이 peer slot full을 감지했을 때, 또는 수신 측이 데이터 미도착을 +감지했을 때 어떻게 대기하는가: + +| 모드 | 동작 | 모델 | +|------|------|------| +| **poll** | 캐시된 peer pointer를 주기적으로 재확인. cache update event를 폴링 | spin loop | +| **sleep** | SimPy event를 yield하고 sleep, peer가 update event를 trigger하면 wake | interrupt-like | + +```python +ipcq_init(backpressure="poll" | "sleep", ...) +``` + +두 모드 모두 구현하여 latency/throughput trade-off를 벤치마크할 수 있다. + +### D8. PE_DMA Virtual Channel + +PE_DMA를 단일 큐에서 **2-channel virtual channel** 모델로 확장한다. + +``` +PE_DMA +├── vc_compute: GEMM/MATH의 tile load/store/writeback +└── vc_comm: IPCQ의 send 데이터 +``` + +각 VC는 독립적인 state machine을 가진다: +- 한 채널이 stall되어도 다른 채널은 진행 +- 동일 link(cube_noc, UCIe 등)는 공유하지만, link BW는 두 채널이 분할 사용 + +**Chunk 단위 인터리브**: +- 큰 GEMM tile DMA가 한 번에 link를 점유하지 않음 +- chunk_size 단위로 진행 (예: 256B), 매 chunk마다 다른 VC와 link BW 공유 +- chunk_size는 init 파라미터 (작을수록 fair, 클수록 효율) + +이로써: +- HoL blocking 해소 (compute DMA 진행 중에도 IPCQ send 끼어들 수 있음) +- compute/comm overlap 자연스러움 (NVIDIA copy engine + compute SM 패턴) +- HW 모델 정합 (NoC virtual channel은 실제 HW 기법) + +**첫 구현의 정확도 한계 (intentional)**: + +본 ADR의 첫 구현은 **deterministic chunk-level interleave + weighted +round-robin arbitration** (default 50/50, ccl.yaml에 노출)을 채택한다. +이는 first-order approximation이며, 실제 HW의 dynamic contention/credit-based +arbitration보다는 단순화된 모델이다. + +| 모델링 항목 | 첫 구현 | 향후 확장 가능 | +|------------|---------|---------------| +| VC 간 BW 분할 | 정적 weight | dynamic contention 기반 | +| Chunk 단위 인터리브 | deterministic round-robin | priority/QoS 기반 | +| Cross-VC dependency | 없음 (독립) | NoC arbiter component 추가 | + +이 단순화는 functional correctness에는 영향이 없으며, latency 측정에서 +heavy contention 시나리오는 실제보다 약간 optimistic한 결과를 낼 수 있다. +정밀화가 필요하면 별도 ADR로 NoC arbiter를 도입한다. + +#### Token routing + +- compute용 token (TileToken): 기존 PE_FETCH_STORE → PE_DMA 체이닝 그대로 +- comm용 token (IpcqDmaToken, 신규): PE_IPCQ → PE_DMA로 self-routing +- PE_DMA가 token 종류로 채널 결정 + +```python +class PeDmaComponent: + def _process(self, env, token): + if isinstance(token, IpcqDmaToken): + yield from self._vc_comm_process(env, token) + else: + yield from self._vc_compute_process(env, token) +``` + +### D9. Pointer 동기화 — DMA payload piggyback + +실제 하드웨어(NVLink, UCIe 등)는 DMA 메시지의 payload에 메타데이터를 +piggyback하여 송수신과 함께 pointer를 갱신한다. 본 시뮬레이션도 같은 모델을 +채택하여 **별도의 control 채널 없이** 메타데이터가 data와 함께 도착하도록 한다. + +이 모델의 핵심 이점: + +- **자동 ordering**: 메타데이터가 data와 동일 token으로 이동하므로 data가 + 먼저 visible해진 다음에야 head_cache가 갱신된다. 별도 ordering invariant + 없이 race condition이 원천 차단된다. +- **HW 정합**: 실제 NVLink/UCIe의 piggybacked header 모델과 일치 +- **컴포넌트 단순화**: 별도 IpcqPtrUpdate event 종류가 필요 없음 + +#### Send 흐름 (head 측 piggyback) + +``` +PE A: tl.send(E, src_addr, nbytes) + 1. PE_IPCQ가 backpressure 체크 (peer_tail_cache 기준) + 2. PE_IPCQ가 IpcqDmaToken 생성: + - data 본체 (src_addr → peer dst_addr) + - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction) + 3. PE_DMA(vc_comm)에 token put + 4. PE A는 자기 my_head++ (송신 추적용) + +[fabric DMA: latency 만큼 진행] + +PE B의 PE_DMA가 token 수신 + 5. data를 dst_addr (B의 rx slot)에 MemoryStore.write + 6. token의 metadata를 PE B의 PE_IPCQ로 forward (PE 내부 wire, ~1 cycle) + +PE B의 PE_IPCQ가 metadata 수신 + 7. peer_head_cache 갱신 (= A의 head 위치) + 8. 대기 중인 recv (해당 direction)가 있으면 wake +``` + +여기서 핵심은 **5와 6은 같은 SimPy step**이라는 것이다 — DMA 완료와 동시에 +data와 metadata가 atomic하게 visible해진다. + +#### Recv 흐름 (credit return — fast path with bottleneck-BW latency) + +수신측이 slot을 비우면 송신측은 그 사실을 알아야 한다 (backpressure 해제). +data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabric을 +거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe +credit return fast path를 추상화한 것이다. + +**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path +latency** (per-node overhead + edge propagation + drain) 기준으로 +산출한다. + +``` +credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes) +path = router.find_path(self_pe, peer_pe.pe_dma) +latency = compute_path_latency_ns(path, credit_size_bytes) + = sum(edge.distance_mm * ns_per_mm) + + sum(node_overhead_ns[n] for n in path) + + credit_size_bytes / bottleneck_bw_on_path +``` + +router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시 +`.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고 +credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번 +업데이트에서 수정됨). + +`tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는 +`env.process`로 fork). 이로써 credit-return cost가 receiver의 +`pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를 +완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`가 +HBM ack-trip을 기다리는 것의 protocol-level 등가물이다. + +이로써: +- **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이 + 자동으로 다른 latency를 가짐 +- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map + 및 `node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴 +- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게 + 보낼 데이터가 없어도 credit이 자동 발행됨 +- **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이 + RAW의 ack-trip cost와 균형을 이룸 + +``` +PE B: tl.recv(W) → 데이터 가져감 → my_tail++ + +PE B의 PE_IPCQ: + 1. router로 PE A까지 path 계산 + 2. compute_drain_ns(path, credit_size_bytes) = latency_ns + 3. env.process(self._delayed_credit_send(latency_ns, peer_credit_store, my_tail)) + +[fast path: latency_ns 만큼 timeout, fabric vc 미사용] + +PE A의 PE_IPCQ가 자기 credit_store에서 IpcqCreditMetadata 수신: + 4. peer_tail_cache 갱신 + 5. 대기 중인 send (해당 direction)가 있으면 wake +``` + +#### Component 결합도 — SimPy Store 채널 + +PE B의 PE_IPCQ가 PE A의 PE_IPCQ를 직접 호출하지 않는다. 대신 **init 시점에 +양쪽 PE_IPCQ 사이에 SimPy Store를 한 번 wire**해두고 (양방향 fast path 채널), +credit metadata는 그 store로 put한다. + +```python +class PeIpcqComponent: + def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns): + yield env.timeout(latency_ns) + yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...)) +``` + +backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께 +설치한다 (D12 IpcqInitMsg에 명시). + +#### Credit return fast path의 한계 + +- `credit_size_bytes`는 estimate. 보통 16-64 bytes로 충분하며, 실제 HW의 + credit return wire 크기를 모방한 값. +- fast path는 일반 vc_comm BW contention 모델에서 **제외**된다 (별도 채널). + 실제 HW의 credit return wire는 매우 lightweight이므로 1차 근사로 합리적. +- 정밀화가 필요하면 후속 ADR에서: + - credit fast path를 별도 link로 모델링 (BW limit + contention) + - 또는 piggyback 모드로 변경 가능 (`credit_return_mode: piggyback`) + +#### PE_DMA의 책임 추가 + +PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction +terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write + +metadata forward 수행. **data write와 metadata forward 사이에는 SimPy +yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic +구간 안이 아니라 그 앞에 위치해야 한다: + +```python +def _on_vc_comm_recv(self, env, txn): + # Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를 + # 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가 + # "도착"한 이후에만 깨어나야 하므로. + drain = getattr(txn, "drain_ns", 0.0) + if drain > 0: + yield env.timeout(drain) + + token = txn.request + # ── ATOMIC: 두 동작 사이에 yield 금지 ── + # 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind) + data = self._memory_store.read(token.src_space, token.src_addr, + shape=..., dtype=...) + self._memory_store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data) + # 2. token의 metadata를 자기 PE의 IPCQ로 forward + yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token)) + # ───────────────────────────────────── +``` + +`out_ports[ipcq_id].put`은 SimPy Store의 yield-able 호출이지만, PE 내부 +wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (실질적으로 +single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가 +삽입되면 안 된다. + +#### Drain-at-inbound semantics (D9 timing model) + +Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path` +를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns` +는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은 +Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든 +Transaction (raw DMA, kernel-launch fanout 등) 은 +`ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의 +경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을 +가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로) +**이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의 +timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다. + +여기서 drain을 지불할 때의 side-effect: + +- **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield` + 하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후 + 호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트. +- **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ + 의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain + 이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer + 시간을 관측하게 된다. + +물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로 +자신의 inbox로 drain될 때까지 대기. + +#### Backpressure latency 정확도 + +backpressure 해제까지 걸리는 시간: + +- **데이터 send 측 latency** = full fabric DMA (data + piggyback metadata 함께) +- **Credit return 측 latency** = fast path with bottleneck-BW + (`credit_size_bytes / bottleneck_bw_on_path`) + +| 시나리오 | 모델링된 latency | 실제 HW와의 관계 | +|---------|----------------|----------------| +| Cube 내 (fast link) | 작음 (bottleneck = cube_noc BW) | topology-aware approximation | +| Cross-cube (UCIe) | 중간 (bottleneck = UCIe BW) | topology-aware approximation | +| Cross-SIP (PCIE) | 큼 (bottleneck = PCIE BW) | topology-aware approximation | + +별도 magic latency 파라미터 없이 토폴로지에 비례한 first-order +approximation이 자동으로 반영된다. 실제 HW와 정확히 일치하지는 않지만 +(credit fast path는 contention 모델에서 제외, credit_size_bytes는 estimate), +magic constant 모델보다 훨씬 의미 있는 비교 가능. 정밀화는 후속 ADR로 +넘긴다. + +### D9.5. ADR-0020 (2-Pass) 통합 + +`tl.send/recv`는 ADR-0020의 2-pass 모델과 통합되어야 한다. Phase 1은 +타이밍과 실제 데이터 이동(MemoryStore) 모두 모델링하고, Phase 2는 op_log +기반 정합성 검증을 가능케 한다. + +#### Phase 1 (타이밍 + 데이터 이동) + +D9는 head 갱신과 tail 갱신을 다른 메커니즘으로 모델링한다: + +- **Send-side (head update)** — DMA payload piggyback. data write와 metadata + forward가 동일 SimPy step에 일어나므로 자동으로 atomic visibility 보장. +- **Recv-side (tail credit return)** — fast path SimPy Store 채널. + bottleneck-BW 기반 latency 후 peer_tail_cache 갱신. + +두 메커니즘을 합쳐서 전체 ring buffer pointer 일관성을 유지한다. + +**send 시**: + +1. PE_IPCQ가 backpressure 체크 (peer_tail_cache 기준) +2. PE_IPCQ가 IpcqDmaToken 생성 (data + piggyback metadata) → PE_DMA(vc_comm)에 put +3. PE_DMA가 fabric DMA 시뮬레이션 (latency 진행) +4. **DMA 완료와 동일한 SimPy step에 atomic 시퀀스**: + - **MemoryStore.write(buffer_kind, dst_pa, data)** — single-hop DMA write + - 수신측 PE_IPCQ에 metadata forward → peer_head_cache 갱신 → 대기 recv wake +5. **op_log 기록**: `OpRecord(op_kind="ipcq", op_name="send", params={src_space, src_addr, dst_space, dst_addr, nbytes, dir, dtype, shape, sender_seq})` + - `dst_space`는 `token.dst_endpoint.buffer_kind`에서 derive된 값이다 + (별도 token 필드가 아니다). dst_addr은 `token.dst_addr`. + +**recv 시**: + +1. PE_IPCQ가 (peer_head_cache > my_tail) AND (MemoryStore.has(slot_addr)) 조건 대기 + (D9 piggyback 모델에서는 두 조건이 같은 step에 truthy가 되지만, defensive check) +2. 조건 만족 시: `slot_addr = my_rx_base + slot_idx * slot_size` +3. **두 가지 모드** (`recv_mode`로 op_log에 기록): + - **`return_slot`** (default): slot_addr을 그대로 PE 커널에 반환. + 데이터 복사 없음. 커널이 slot 메모리를 직접 사용한다. + - **`copy_to_dst`**: 호출 시 dst_addr이 지정된 경우. slot 데이터를 읽어서 + dst_addr에 write. `data = memory_store.read(...)`; `memory_store.write(dst_space, dst_addr, data)` +4. PE_IPCQ가 my_tail++, fast path credit return을 발행 (D9 — vc_comm + fabric을 거치지 않고 별도 SimPy Store 채널로 bottleneck-BW latency 후 + peer 측 peer_tail_cache 갱신) +5. **op_log 기록**: `OpRecord(op_kind="ipcq", op_name="recv", params={recv_mode, src_space, src_addr, dst_space, dst_addr, nbytes, dir, dtype, shape, consumer_seq})` + - `recv_mode="return_slot"`: src_space/src_addr가 slot 위치, dst_addr=None + - `recv_mode="copy_to_dst"`: src_space/src_addr가 slot 위치, dst_space/dst_addr가 사용자 지정 위치 + +#### Phase 2 (op_log replay) + +DataExecutor가 `op_kind="ipcq"` 레코드를 만나면: + +- **send**: src → dst (peer rx slot)로 ndarray를 idempotent하게 write +- **recv (`recv_mode="return_slot"`)**: no-op. slot 데이터는 Phase 1에서 + 이미 적절한 위치에 있으며, 커널이 해당 slot 메모리를 직접 사용함. +- **recv (`recv_mode="copy_to_dst"`)**: slot → dst_addr로 ndarray를 idempotent + 하게 copy + +본질적으로 IPCQ는 **데이터 이동**만 하므로 Phase 2가 추가로 계산할 것은 없다. +DataExecutor의 GEMM/Math가 그 데이터를 사용하면 자동으로 정합성이 검증된다. + +```python +class DataExecutor: + def _execute_op(self, op): + if op.op_kind == "ipcq": + self._execute_ipcq(op) + elif op.op_kind == "memory": + ... + elif op.op_kind == "gemm": + ... + + def _execute_ipcq(self, op): + """IPCQ ops are data movement; Phase 1 already wrote to MemoryStore.""" + p = op.params + if op.op_name == "send": + data = self.store.read(p["src_space"], p["src_addr"], + shape=p["shape"], dtype=p["dtype"]) + self.store.write(p["dst_space"], p["dst_addr"], data) + elif op.op_name == "recv": + if p.get("recv_mode") == "copy_to_dst": + data = self.store.read(p["src_space"], p["src_addr"], + shape=p["shape"], dtype=p["dtype"]) + self.store.write(p["dst_space"], p["dst_addr"], data) + # recv_mode == "return_slot": no-op (data already in slot) +``` + +#### `--verify-data` 흐름 (CCL 커널) + +``` +1. kernbench run --bench ccl_allreduce --verify-data +2. backend init → IPCQ buffers 할당, neighbor table install +3. 모든 rank greenlet 동시 실행 +4. 각 PE 커널이 tl.send/recv → MemoryStore에 데이터 누적 +5. 시뮬레이션 완료 후 DataExecutor.run() → ipcq op 멱등 replay (no-op) +6. 벤치마크가 print(out) 또는 out.data 비교 → 정합성 확인 +``` + +벤치 작성자는 `out.data`로 결과를 읽고 expected와 비교하면 된다 (ADR-0020 D7 +Tensor.data 패턴). + +### D10. 호스트 CCL Init은 PyTorch 패턴 그대로 + +호스트 코드는 실제 PyTorch distributed 코드와 동일하게 유지한다. +`init_process_group`은 backend 객체만 만들고, IPCQ 설정 (neighbor topology, +buffer_kind, backpressure 등)은 받지 않는다. + +```python +# benches/ccl_allreduce.py — 실제 PyTorch와 동일한 호스트 코드 +def run_rank(rank, world_size, torch): + dist = torch.distributed + dist.init_process_group(backend="ahbm", world_size=world_size, rank=rank) + + tensor = torch.zeros((M, K), dtype="f16", dp=...) + + from kernbench.ccl.algorithms import ring_allreduce + torch.launch("ring_allreduce", ring_allreduce.kernel, tensor, rank, world_size) +``` + +IPCQ 설정은 backend가 **init_process_group 시점에** `ccl.yaml`을 읽고 즉시 +PE_IPCQ neighbor table을 install한다. 호스트 코드는 IPCQ를 인지할 필요가 없다. + +벤치마크 하나는 하나의 알고리즘을 사용하는 것을 가정하며, 사용할 알고리즘은 +`ccl.yaml`의 `defaults.algorithm` 으로 지정한다 (D11). 호스트 코드 변경 없이 +ccl.yaml만 수정하여 다른 알고리즘으로 교체할 수 있다. + +#### Init 흐름 (eager) + +1. `init_process_group(backend="ahbm")` 호출 +2. backend가 `ccl.yaml` 로드 → `defaults.algorithm` 결정 +3. `algorithms[]`에서 topology + buffer_kind + backpressure + slot/size 결정 +4. **즉시** 모든 PE의 PE_IPCQ에 neighbor table을 install (sideband 또는 fabric `IpcqInitMsg`) +5. 이후 `torch.launch(kernel_name, ...)`는 일반 launch와 동일하게 처리 + (CCL kernel이든 아니든 PE_IPCQ는 이미 준비됨) + +### D11. CCL 설정 파일 (`ccl.yaml`) + +IPCQ 설정과 알고리즘 metadata는 별도 YAML 파일에 둔다. +`components.yaml`/`topology.yaml`과 같은 패턴을 유지하며, 변경 이력이 코드처럼 +추적 가능하다. + +벤치마크 한 번 실행은 한 알고리즘만 사용한다 (`defaults.algorithm`). +다른 알고리즘으로 교체하려면 `ccl.yaml`의 `defaults.algorithm` 만 바꾸면 된다. + +```yaml +# ccl.yaml — CCL backend (ahbm) configuration +# +# 이 파일은 init_process_group(backend="ahbm") 시점에 로드되며, +# defaults.algorithm 으로 지정된 알고리즘에 따라 PE_IPCQ neighbor table을 +# install한다. 호스트 코드는 IPCQ 설정을 인지하지 않는다. + +defaults: + # 이번 벤치 실행에서 사용할 알고리즘. algorithms 섹션에 정의된 것 중 하나. + algorithm: ring_allreduce + + # IPCQ ring buffer가 위치할 메모리. + # tcm — PE-local TCM (작지만 빠름, PE 내부 자원과 경쟁) + # hbm — PE-local HBM (큼, DMA latency 큼) + # sram — Cube-shared SRAM (중간 크기, cube 내 PE 간 contention) + buffer_kind: tcm + + # send/recv가 peer slot full / data 미도착을 만났을 때의 대기 방식. + # poll — peer pointer 캐시를 spin loop로 재확인 + # sleep — SimPy event yield 후 wakeup 대기 (interrupt-like) + backpressure: sleep + + # Ring buffer depth (한 방향당 slot 개수). 클수록 in-flight 가능, 메모리 ↑ + n_slots: 8 + + # Slot 하나의 크기 (bytes). 한 tile을 통째로 담을 수 있는 크기여야 함. + slot_size: 4096 + + # PE_DMA virtual channel chunk 크기 (bytes). 작을수록 fair, 클수록 효율. + # IPCQ traffic과 compute traffic 사이의 인터리브 granularity (D8 참조). + vc_chunk_size: 256 + + # Credit return fast path 메시지 크기 (bytes). 실제 HW의 credit return wire + # 크기를 모방. backend가 라우팅 경로의 bottleneck BW를 보고 latency를 + # 계산한다 (D9 참조). 보통 16-64로 충분. + ipcq_credit_size_bytes: 16 + +algorithms: + # ── 알고리즘 정의 ───────────────────────────────────────────────── + # 각 entry는 알고리즘 모듈과 그 알고리즘이 요구하는 topology를 명시한다. + # 알고리즘별 default override 가능 (buffer_kind, backpressure 등). + + ring_allreduce: + # PE 커널이 정의된 모듈. `kernel(t_ptr, rank, world_size, tl)` 함수를 export. + module: kernbench.ccl.algorithms.ring_allreduce + + # 이 알고리즘이 요구하는 neighbor topology. builtin 이름 또는 "custom". + # ring_1d — 1D 양방향 ring (E/W) + # ring_1d_unidir — 1D 단방향 ring (E only) + # mesh_2d — 2D mesh (N/S/E/W) + # tree_binary — binary tree (parent/children direction) + # custom — 모듈의 neighbors(rank, world_size) 함수 사용 + topology: ring_1d + + tree_allreduce: + module: kernbench.ccl.algorithms.tree_allreduce + topology: tree_binary + # 알고리즘별 override (이 알고리즘만 hbm 사용) + buffer_kind: hbm + + custom_mesh: + module: kernbench.ccl.algorithms.custom_mesh + topology: custom # 모듈이 직접 neighbors() 함수 제공 +``` + +#### 알고리즘 모듈 구조 + +알고리즘 모듈은 두 개의 hook을 export한다 — `kernel`은 필수, `neighbors`는 선택. + +```python +# src/kernbench/ccl/algorithms/ring_allreduce.py + +def kernel(t_ptr, rank, world_size, tl): + """필수 — PE 커널. + + IPCQ 설정은 backend가 ccl.yaml + neighbors() 결과로 install한 상태이다. + 커널은 그저 4-방향 send/recv API만 사용하면 된다. + """ + for step in range(world_size - 1): + ... + tl.send(dir="E", ...) + data = tl.recv(dir="W") + + +def neighbors(rank, world_size, neighbor_map): + """선택 — neighbor table override hook. + + backend는 ccl.yaml의 topology 필드에 따라 builtin neighbor_map을 생성한 뒤, + 이 함수가 정의되어 있으면 호출하여 결과를 override 한다. + + Args: + rank: 이 rank의 인덱스 + world_size: 전체 rank 수 + neighbor_map: ccl.yaml의 topology 필드가 만든 builtin 매핑 + 예: ring_1d → {"E": (rank+1)%ws, "W": (rank-1)%ws} + mutable dict — 직접 수정 가능 + + Returns: + dict | None: + dict — neighbor_map을 override한 결과 + None — override 안 함, neighbor_map 그대로 사용 + """ + return None # 또는 수정 후 반환 +``` + +#### `neighbors` override 패턴 + +대부분의 알고리즘은 builtin topology만으로 충분하므로 `neighbors` 정의가 필요 없다. +정의가 필요한 경우의 패턴: + +**Pattern A — builtin을 base로 일부만 수정**: +```python +def neighbors(rank, world_size, neighbor_map): + # 짝수 rank만 W 사용 + if rank % 2 == 1: + neighbor_map.pop("W", None) + return neighbor_map +``` + +**Pattern B — 완전히 새로 만들기 (skip-connection ring 등)**: +```python +def neighbors(rank, world_size, neighbor_map): + # neighbor_map은 무시하고 새로 작성 + return {"E": (rank + 2) % world_size} +``` + +#### Builtin topology generators + +`ccl.yaml`의 `topology` 필드가 다음 builtin 이름이면 backend가 알아서 처리: + +| topology | 설명 | direction set | +|----------|------|---------------| +| `ring_1d` | 1D 양방향 ring | E, W | +| `ring_1d_unidir` | 1D 단방향 ring | E only | +| `mesh_2d` | 2D mesh | N, S, E, W | +| `tree_binary` | binary tree (root = rank 0) | parent, child_left, child_right | +| `none` | 빈 매핑 — 알고리즘이 `neighbors()`로 처음부터 작성 | (없음) | + +`topology: none`은 builtin이 빈 dict를 반환하므로 알고리즘의 `neighbors()`가 +처음부터 매핑을 만들어야 한다. + +#### 알고리즘 추가 절차 + +1. `src/kernbench/ccl/algorithms/.py`에 `kernel` 함수 작성 +2. `ccl.yaml`의 `algorithms` 섹션에 entry 추가 (`module`, `topology`) +3. (선택) 같은 모듈에 `neighbors()` 함수 추가하여 builtin override +4. `defaults.algorithm`을 새 알고리즘으로 설정하면 적용 + +호스트 코드는 손대지 않는다. + +### D12. 메시지 / 토큰 스키마 + +본 ADR이 추가하는 모든 메시지/토큰의 필드를 명시한다. 구현 시 이 정의를 +`src/kernbench/common/pe_commands.py`와 `src/kernbench/runtime_api/kernel.py`에 +그대로 추가한다. + +#### `IpcqInitMsg` (sideband, init 시 fan-out) + +backend가 모든 PE의 PE_IPCQ에 neighbor table을 install하기 위해 사용한다. +구조는 `MmuMapMsg`와 유사 (target_sips, target_cubes, target_pe + entries). + +```python +@dataclass(frozen=True) +class IpcqInitEntry: + direction: str # "N" | "S" | "E" | "W" + peer: IpcqEndpoint # D2.5 참조 + my_rx_base_pa: int # 자신의 rx_buffer base + my_rx_base_va: int # 선택 + n_slots: int + slot_size: int + # Credit fast path 채널 (D9). + # 계약: 이 필드는 반드시 simpy.Store 인스턴스이며, IpcqCreditMetadata + # 객체만을 받는 receive endpoint이다 (peer's PE_IPCQ가 자기 입력 큐로 + # 사용). 송신 측 PE_IPCQ는 _delayed_credit_send에서 이 store에 직접 + # IpcqCreditMetadata를 put한다. 다른 객체 type을 put해서는 안 된다. + # backend init 시 양방향 SimPy Store가 한 번 wire되며 이후 변경 불가. + peer_credit_store: "simpy.Store[IpcqCreditMetadata]" + +@dataclass(frozen=True) +class IpcqInitMsg: + correlation_id: str + request_id: str + target_sips: tuple[int, ...] + target_cubes: tuple[int, ...] + target_pe: int | tuple[int, ...] | str + entries: tuple[IpcqInitEntry, ...] # 이 PE의 4-방향 entry + backpressure_mode: str # "poll" | "sleep" + buffer_kind: str # "tcm" | "hbm" | "sram" + credit_size_bytes: int # D9 fast path latency 계산용 (default 16) +``` + +**Credit fast path channel wiring**: backend init이 모든 PE의 PE_IPCQ에 +양방향 fast path 채널을 한 번 설치한다. PE A의 IpcqInitEntry(direction=E)에 +PE B의 credit-receive Store reference를 넣어 송신 측이 직접 put할 수 있게 +한다 (별도 fabric routing 없음). + +#### `IpcqSendCmd` (PE_CPU → PE_IPCQ) + +```python +@dataclass(frozen=True) +class IpcqSendCmd: + direction: str # 어느 방향으로 보낼지 + src_addr: int # 보낼 데이터의 원본 주소 (TCM/HBM) + src_space: str # "tcm" | "hbm" | "sram" + nbytes: int + shape: tuple[int, ...] # data shape (op_log/MemoryStore용) + dtype: str + handle_id: str # completion 추적용 + data_op: bool = True # ADR-0020 op_log 기록 대상 +``` + +#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ) + +```python +@dataclass(frozen=True) +class IpcqRecvCmd: + direction: str | None # None이면 round-robin (weak fairness, D4) + # recv_mode: 두 가지 동작 모드 + # "return_slot" — slot 주소를 그대로 PE 커널에 반환 (default, zero-copy) + # "copy_to_dst" — slot 데이터를 dst_addr에 copy 후 반환 + recv_mode: str = "return_slot" + # dst_addr / dst_space는 recv_mode="copy_to_dst"일 때만 사용됨 + dst_addr: int = 0 + dst_space: str = "" + shape: tuple[int, ...] = () # data shape (op_log/MemoryStore용) + dtype: str = "" + handle_id: str = "" + blocking: bool = True # blocking vs non-blocking + data_op: bool = True +``` + +#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm 채널) + +D9의 piggyback 모델에 따라 token이 data + head metadata를 함께 담아 +fabric을 따라 이동한다. 수신 측 PE_DMA가 도착 시점에 data를 dst_addr에 +write하고 metadata를 PE_IPCQ로 forward한다 (atomic). + +```python +@dataclass +class IpcqDmaToken: + # ── Data movement (single-hop DMA write) ── + src_addr: int # 자기 메모리 주소 + src_space: str + dst_addr: int # peer rx slot 주소 (이미 계산됨) + dst_endpoint: IpcqEndpoint # 라우팅용 (sip/cube/pe) + nbytes: int # data 크기 + handle_id: str # 완료 시 송신 측 PE_IPCQ로 알림 + + # ── Piggyback metadata (수신측 PE_IPCQ가 자동 갱신할 정보) ── + sender_seq: int # 단조 증가 sequence number + # peer가 자기 head_cache로 사용 + src_sip: int # 송신 측 (수신측이 어느 peer인지 식별) + src_cube: int + src_pe: int + src_direction: str # 송신측 기준 방향 (수신측은 reverse 매핑으로 자기 direction 결정) + + data_op: bool = True # ADR-0020 op_log 기록 대상 +``` + +PE_DMA는 token type으로 채널 결정 (D8): TileToken → vc_compute, IpcqDmaToken → vc_comm. + +**수신 측 PE_DMA의 처리** (vc_comm 도착 시): + +```python +def _vc_comm_arrival(self, env, token: IpcqDmaToken): + # 1. data를 dst_addr에 write (data와 metadata atomic visibility) + if self._memory_store is not None: + data = self._memory_store.read(token.src_space, token.src_addr, + shape=..., dtype=...) + self._memory_store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data) + # 2. metadata를 자기 PE의 IPCQ로 forward (PE 내부 wire, 같은 step) + yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token)) +``` + +PE_IPCQ는 `IpcqMetaArrival`을 받아 sender_seq를 보고 peer_head_cache를 갱신한다. + +#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path 채널) + +Credit return은 D9의 fast path 모델에 따라 vc_comm fabric을 거치지 않고 +**별도의 SimPy Store 채널**로 전달된다. backend init 시 양방향 channel이 +미리 wire되며, latency는 bottleneck-BW 기반으로 계산된다. + +```python +@dataclass(frozen=True) +class IpcqCreditMetadata: + """Credit return — recv 측 → send 측 fast path.""" + consumer_seq: int # my_tail (recv 측의 새 tail) + src_sip: int # 누가 보냈는지 (수신 측이 어느 peer credit인지 식별) + src_cube: int + src_pe: int + src_direction: str # 송신 측 기준 방향 (수신 측은 reverse 매핑) +``` + +**전송 흐름**: + +```python +class PeIpcqComponent: + def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns): + yield env.timeout(latency_ns) + yield peer_credit_store.put(IpcqCreditMetadata( + consumer_seq=my_tail, src_sip=..., src_cube=..., src_pe=..., + src_direction=..., + )) +``` + +`latency_ns`는 D9에 정의된 대로: + +```python +path = self.ctx.router.find_path(self_pe_prefix, peer_pe_prefix) +latency_ns = self.ctx.compute_drain_ns(path, credit_size_bytes) +``` + +**별도의 IpcqPtrUpdate 이벤트는 없다** — head 갱신은 D9 piggyback 모델로, +tail 갱신은 D9 fast path SimPy Store 채널로 처리된다. + +### D13. 테스트 전략 + +단위/통합/regression 테스트를 명시한다. + +#### T1. 단위 테스트 (component-level) + +- **PE_IPCQ 단위** (`tests/test_pe_ipcq.py`): + - send: backpressure 미발생 시 즉시 PE_DMA로 token forward + - send: peer slot full → backpressure (poll/sleep 모드별) + - send: peer credit return (IpcqCreditMetadata) 도착 후 backpressure 해제 + - recv: 데이터 도착 시 즉시 반환 + - recv: 데이터 미도착 → wait → IpcqMetaArrival (D9 piggyback) 수신 시 wake + - recv (round-robin): 4-방향 중 도착한 첫 데이터 반환 (weak fairness) + - 잘못된 방향 → IpcqInvalidDirection 예외 + +- **PE_DMA virtual channel** (`tests/test_pe_dma_vc.py`): + - vc_compute / vc_comm 독립 진행 (한 채널 stall 시 다른 채널 진행) + - chunk-level 인터리브 verification + - link BW 분할 (50/50 또는 weighted) + +- **builtin topology** (`tests/test_ccl_topologies.py`): + - ring_1d/mesh_2d/tree_binary 각각 (rank, world_size) → neighbor dict 정합성 + - mesh_2d non-square → ValueError + - resolve_topology(custom, module) → module.neighbors 반환 + +#### T2. 통합 테스트 (E2E send/recv) + +- **`tests/test_ipcq_e2e.py`**: + - 2-rank ring: rank 0 send(E) → rank 1 recv(W) → 데이터 정합성 + - 4-rank ring: 양방향 send/recv 동시 진행, deadlock 없음 + - mesh_2d 4×4: N/S/E/W 4방향 동시 send/recv + +- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): + - greenlet 모드 + IPCQ → op_log에 ipcq 레코드 생성 검증 + - DataExecutor가 ipcq op 처리 후 결과 정합성 (`out.data` 확인) + +#### T3. Backend init 테스트 (`tests/test_ccl_backend_ipcq.py`) + +- ccl.yaml 로드 → `defaults.algorithm` 추출 +- builtin topology → IpcqInitMsg fan-out +- IpcqEndpoint의 PA가 모든 PE에서 일관 (rank A의 peer E의 rx_base_pa = rank A+1의 자기 rx_base_pa) +- buffer_kind 별 메모리 할당 (tcm/hbm/sram) + +#### T4. Regression + +- 기존 401 tests 전부 PASS +- ADR-0020 통합으로 인한 op_log/DataExecutor 영향 없음 (CCL 미사용 벤치) + +#### T5. 성능 / overhead + +- 단일 send/recv pair latency = (DMA latency) + (IPCQ overhead) +- 비교: 같은 nbytes의 일반 PE_DMA write와 거의 동일해야 함 (IPCQ overhead < 100 ns) + +### D14. Invariants & Failure Modes + +CCL 인프라에서 흔히 발생하는 hang/오류 상황을 명시하고, 대응 방식을 정의한다. + +#### Invariants (시뮬레이션이 보장해야 하는 것) + +I1. **Slot lifecycle exactly-once**: 한 send → 정확히 한 recv. 중복 send나 + 중복 recv는 sequence 오류로 간주. + +I2. **Pointer monotonicity**: my_head, my_tail은 단조 증가 (감소 없음). + sender_seq는 송신 측에서 단조 증가, 수신 측 cache 갱신도 단조 증가. + +I3. **Endpoint consistency**: rank A의 IpcqEndpoint(direction=E)의 peer가 + rank B라면, rank B의 IpcqEndpoint(reverse(E))의 peer는 rank A여야 함. + backend init 시 검증. + +I4. **buffer_kind consistency**: 한 ProcessGroup 내 모든 PE의 buffer_kind는 + 동일 (mixed kind는 supported 안 함, 첫 구현). 검증 실패 시 init 에러. + +I5. **op_log ordering**: send → DMA 완료 → recv 가능. op_log의 t_start + 순서가 이 인과관계를 위배하지 않음. + +I6. **Atomic data + metadata visibility (MUST)**: 본 ADR의 correctness 핵심 + 조건이다. 수신 측에서 data write (MemoryStore.write)와 metadata forward + (peer_head_cache 갱신)는 동일한 SimPy step에 일어나야 한다. control이 + data를 앞지를 수 없다. + + **구현 규칙 (MUST)**: + - PE_DMA의 vc_comm token 도착 처리(`_vc_comm_arrival`)는 다음 두 동작 + 사이에 **어떤 SimPy yield도 두어서는 안 된다**: + 1. `MemoryStore.write(token.dst_endpoint.buffer_kind, token.dst_addr, data)` + 2. PE_IPCQ에 `IpcqMetaArrival` forward + - 두 동작은 동일 SimPy event callback 내에서 연속 실행되어야 한다. + - 코드 리뷰에서 이 사이에 `yield` (또는 `yield from`)을 추가하는 것은 + correctness 위반으로 reject한다. + + 이 규칙을 위반하면 다른 SimPy process가 끼어들어 head_cache가 data + visibility보다 먼저 또는 늦게 보이는 race condition이 발생한다. + +I7. **MemoryStore slot existence ↔ pointer**: I6의 결과로, + `peer_head_cache > my_tail`이 truthy가 되는 step과 `MemoryStore.has(slot_addr)` + 이 truthy가 되는 step이 동일하다. recv는 두 조건을 모두 체크하지만 (defensive), + 단일 조건만 체크해도 정확하다. + +#### Failure Modes (런타임 에러) + +F1. **잘못된 direction**: + - PE 커널이 `tl.send(dir="X")` 호출 → install 안 된 direction + - PE_IPCQ가 즉시 `IpcqInvalidDirection` 예외 raise + - SimPy 시뮬레이션 즉시 abort, 사용자에게 명확한 에러 + +F2. **타입 mismatch**: + - send와 recv의 dtype/shape/nbytes가 일치하지 않음 + - 첫 구현은 검증 안 함 (dtype/shape는 hint), 향후 strict mode로 추가 + +F3. **Deadlock detection (timeout 기반)**: + - send: peer_tail_cache가 갱신 안 되고 영원히 wait + - recv: peer_head_cache 갱신 안 되고 영원히 wait + - 시뮬레이션 timeout (default 10ms simulated time) 초과 시 abort + - 디버그를 위해 각 PE의 last send/recv 위치, blocking 상태 dump + +F4. **Backend init 실패**: + - ccl.yaml에 `defaults.algorithm` 누락 + - `algorithms[name]` 정의 누락 + - 알고리즘 모듈 import 실패 + - topology 검증 실패 (I3, I4) + → 모두 `init_process_group` 시점에 즉시 에러 + +F5. **Slot full + 무한 backpressure**: + - peer가 영원히 안 받음 + - F3과 같이 timeout으로 처리 + - 디버그: 막힌 PE의 my_head, peer_tail_cache 출력 + +#### 진단 도구 (구현 단계에서 추가) + +- **CCL trace**: 각 send/recv를 (rank, t, dir, nbytes) 형태로 로깅 +- **Pointer dump**: 시뮬레이션 종료 시 또는 hang 시 모든 PE의 IPCQ pointer 상태 출력 +- **Deadlock graph**: hang 발생 시 wait-for 그래프 출력 (어느 PE가 어떤 PE를 기다리는지) + +### D15. 알고리즘 작성자 가이드 (요약) + +본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다. +자세한 step-by-step 가이드는 [docs/onboarding/ccl-author-guide.md](../onboarding/ccl-author-guide.md) 참조. + +#### 만지는 것 / 만지지 않는 것 + +| 만지는 것 | 만지지 않는 것 | +|----------|---------------| +| `src/kernbench/ccl/algorithms/.py` (kernel + 선택적 neighbors) | `benches/ccl_allreduce.py` 호스트 코드 | +| `ccl.yaml` 의 한 entry 추가 + `defaults.algorithm` | `src/kernbench/ccl/` 프레임워크 | +| (선택) `tests/test_.py` 단위 테스트 | `src/kernbench/components/builtin/pe_ipcq.py` 컴포넌트 | +| | `src/kernbench/runtime_api/distributed.py` backend | + +#### 알고리즘 모듈 인터페이스 contract + +```python +# src/kernbench/ccl/algorithms/.py + +def kernel(*args, tl) -> None: + """필수. PE 커널. + + Args (positional): tensor pointers, rank, world_size, 알고리즘 파라미터 + Args (keyword): tl — TLContext (자동 주입) + + 사용 가능한 IPCQ API: + tl.send(dir, src_addr, nbytes) # blocking, backpressure 시 wait + tl.recv(dir) # 특정 방향에서 blocking recv + tl.recv() # 4방향 round-robin + tl.recv_async(dir) → handle # non-blocking + tl.wait(handle) # non-blocking 완료 대기 + + 기존 API도 그대로 사용: + tl.load / tl.store / tl.composite / tl.program_id 등 + """ + ... + +def neighbors(rank, world_size, neighbor_map) -> dict | None: + """선택. ccl.yaml의 builtin topology가 만든 neighbor_map을 override. + + None 반환 → builtin 그대로 사용 + dict 반환 → 그 dict로 override (builtin을 base로 수정 가능) + """ + return None +``` + +#### 5-step 흐름 + +1. **kernel 함수 작성** — `src/kernbench/ccl/algorithms/.py` 신규 파일 +2. **ccl.yaml 등록** — `algorithms.` entry + `defaults.algorithm` 변경 +3. **(선택) neighbors override** — builtin topology를 base로 수정이 필요할 때 +4. **단위 테스트** — `kernbench.ccl.testing.run_kernel_in_mock` (SimPy 없이 빠름) +5. **시뮬 검증** — `kernbench run --bench ccl_allreduce --verify-data` + +호스트 코드 (`benches/ccl_allreduce.py`)는 손대지 않는다. + +#### 사용 가능한 헬퍼 (`kernbench.ccl.helpers`) + +| Helper | 설명 | +|--------|------| +| `chunked(addr, n_chunks, ...)` | 텐서를 n개 chunk view로 슬라이싱 | +| `ring_step(rank, step, ws)` | ring algorithm의 step별 (send_idx, recv_idx) | +| `tree_step(rank, level)` | binary tree의 level별 parent/child 인덱스 | + +#### 디버깅 도구 + +- `KERNBENCH_CCL_TRACE=1` — send/recv trace 출력 +- 시뮬 종료 시 자동 IPCQ pointer dump +- Deadlock 시 (10ms 시뮬 시간 초과) wait-for graph dump + +#### 흔한 실수 + +1. **install 안 된 direction 사용** — ccl.yaml의 topology가 ring_1d면 N/S 사용 불가 +2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure) +3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임 + +자세한 step-by-step과 hello-world 예제는 `docs/onboarding/ccl-author-guide.md` 참조. + +--- + +## HW Realization Notes (Informative) + +**Status of this section**: Forward-looking. Describes how the simulator +contract (D1–D15) would map to silicon. Not currently implemented; +subject to revision before tapeout. The simulator implements the +contract via Python/SimPy equivalents in +[pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and +[pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py). + +### D16. Proposed HW Block Diagram and End-to-End Dataflow + +![PE Baseline Architecture](../diagrams/pe_baseline.png) + +> Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`. + +![PE Proposed Architecture](../diagrams/pe_proposed.png) + +> Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`. + +**Baseline → Proposed 핵심 변경**: + +- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW) +- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic) +- TCM 내 **IPCQ Slot Region 예약 영역** 명시 +- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결 + +#### End-to-End Sequence (HW view) + +```mermaid +sequenceDiagram + participant CPU_A as PE_A: PE_CPU + participant IPCQ_A as PE_A: IPCQ Ctrl + participant DMA_A as PE_A: DMA + participant NOC as NoC Fabric + participant DMA_B as PE_B: DMA + participant IPCQ_B as PE_B: IPCQ Ctrl + participant TCM_B as PE_B: TCM + participant CPU_B as PE_B: PE_CPU + + Note over CPU_A: tl.send(dir="E", src=0x1000) + + CPU_A->>IPCQ_A: MMIO: send request + Note over IPCQ_A: Backpressure check:
(head - peer_tail_cache) < n_slots → PASS
Slot addr gen:
dst = peer_rx_base + (head%n) × slot_size + IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head} + Note over IPCQ_A: my_head++ + IPCQ_A-->>CPU_A: send returns (fire-and-forget) + + Note over DMA_A: TCM read → snapshot in read buffer
Flit pack: data + {sender_seq, dst_addr} + DMA_A->>NOC: IPCQ data flit(s) + + Note over NOC: hop latency + BW drain + + NOC->>DMA_B: IPCQ data flit(s) + Note over DMA_B: Terminal BW drain
Slot write latency + + rect rgb(255, 240, 220) + Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall + DMA_B->>TCM_B: write data → slot address + DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr} + end + + Note over IPCQ_B: Range match dst_addr → direction "W"
peer_head_cache["W"] = sender_seq + 1 + IPCQ_B-->>CPU_B: recv_wake signal + + Note over CPU_B: tl.recv(dir="W") wakes up + CPU_B->>IPCQ_B: recv request + Note over IPCQ_B: peer_head_cache > my_tail → YES
slot_addr = rx_base + (tail%n) × slot_size + IPCQ_B-->>CPU_B: return slot_addr + CPU_B->>TCM_B: read data from slot + Note over IPCQ_B: my_tail++ + + IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa} + Note over NOC: credit traversal (NoC latency) + NOC->>IPCQ_A: Credit arrival + + Note over IPCQ_A: Match dst_rx_base_pa → direction "E"
peer_tail_cache["E"] = consumer_seq
Backpressure deassert (if stalled) +``` + +### D17. IPCQ Controller HW Module (신규) + +PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. 시뮬레이터의 +`PeIpcqComponent`에 대응한다. + +#### QPair Register File + +방향별 queue pair 상태를 flip-flop으로 유지. PE_CPU가 MMIO(CSR)로 읽기/쓰기 +가능하며, init 시점에 소프트웨어가 채워넣는다. + +``` +Per-direction registers (each 64-bit): + my_head — sender write position (monotonic) + my_tail — receiver read position (monotonic) + peer_head_cache — last known peer head (updated by Meta Extractor) + peer_tail_cache — last known peer tail (updated by Credit Receiver) + rx_base_pa — this PE's rx buffer base physical address + peer_rx_base_pa — peer's rx buffer base physical address + n_slots — ring depth (power-of-2 제약, D21 참조) + slot_size — bytes per slot + peer_credit_tgt — peer PE의 credit receive 주소 + +Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare) +Total: 8 dirs × 9 regs × 8B = 576B flip-flops +``` + +#### Slot Address Generator (combinational) + +``` +Input: pointer (my_head or my_tail), n_slots, slot_size, base_pa +Output: slot_addr = base_pa + (pointer % n_slots) * slot_size + +Implementation: + n_slots power-of-2 → pointer & (n_slots - 1) (AND mask, 1 gate) + slot_size power-of-2 → barrel shift (1 cycle) + 64-bit add → ripple/kogge-stone adder (1 cycle) + +Latency: 1-2 cycles combinational +``` + +#### Backpressure Comparator (combinational) + +``` +full = (my_head - peer_tail_cache) >= n_slots + +Implementation: 64-bit subtract + unsigned compare +Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold +Latency: 1 cycle +``` + +#### Meta Extractor (inbound datapath sideband) + +DMA Engine의 inbound vc_comm path에 wired. 도착하는 IPCQ flit의 header에서 +metadata를 추출하여 queue pair 상태를 갱신한다. + +``` +Trigger: DMA inbound write completion (same cycle) +Extract: {sender_seq, dst_addr} from flit header + +Direction matching (ADR-0025 D2): + for each dir: + match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir]) + 8× parallel range comparators + priority encoder + +Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1) +Output: recv_wake signal → PE_CPU interrupt/flag +Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장) +``` + +#### Credit Injector (outbound) + +``` +Trigger: recv completion (my_tail 증가 후) +Action: pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC) + +Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa} +Latency: 1 cycle to generate, then NoC traversal +``` + +#### Credit Receiver (inbound sideband) + +``` +Trigger: 16B credit packet arrival (from NoC) +Extract: {consumer_seq, dst_rx_base_pa} + +Direction matching (ADR-0025 D3): + for each dir: + match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa) + +Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq) +Output: send_wake signal → deassert backpressure stall +Latency: 1 cycle +``` + +### D18. DMA Engine vc_comm IPCQ-aware Mode + +기존 vc_comm 채널(D8)에 IPCQ flit 처리 모드를 추가한다. + +**Outbound**: + +1. IPCQ Controller로부터 command 수신: `{src_addr, dst_addr, nbytes, sender_seq}` +2. TCM에서 src_addr read → DMA read buffer에 snapshot (standard DMA behavior) +3. Flit pack: data + piggyback metadata (sender_seq, dst_addr) +4. NoC fabric port에 inject +5. Fire-and-forget (completion 미대기) + +**Inbound**: + +1. NoC로부터 IPCQ flit 수신 +2. Terminal BW drain charge (`drain_ns = nbytes / bottleneck_bw`) +3. Slot write latency charge (backing memory tier) +4. **ATOMIC** (same pipeline stage, no stall insertion): + - TCM write: data → slot address + - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller +5. Done + +**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가 +동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요. 시뮬레이터의 +"no SimPy yield between MemoryStore.write and IpcqMetaArrival put" (D9, I6)이 +자연스럽게 보장된다. + +#### Data Snapshot Semantics + +DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다. +이는 DMA standard read-then-write behavior이므로 추가 HW 불필요. + +#### Credit Virtual Channel (선택적) + +- **옵션 A**: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분). +- **옵션 B**: 3rd dedicated credit VC 추가 (strict priority > data). + +옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로 +옵션 A로도 충분. + +### D19. Fabric Flit Format Extension + +``` +일반 data flit (예: 512-bit): +┌──────────────────────────────────────────┐ +│ [511:480] routing header (32b) │ +│ [479:0] payload (480b = 60B) │ +└──────────────────────────────────────────┘ + +IPCQ data flit (첫 flit에만 metadata 포함): +┌──────────────────────────────────────────┐ +│ [511:480] routing header (32b) │ +│ [511] ipcq_flag (1b) │ ← IPCQ vs normal DMA 식별 +│ [510:509] vc_id (2b) │ +│ [508:480] route + hop count │ +│ [479:416] ipcq_metadata (64b) │ ← piggyback +│ [479:448] sender_seq (32b) │ +│ [447:416] dst_addr[31:0] (32b) │ ← direction matching용 +│ [415:0] payload (416b = 52B) │ +└──────────────────────────────────────────┘ +후속 flits: full 60B payload (metadata 없음) + +Credit-only flit (128-bit, header-only): +┌──────────────────────────────────────────┐ +│ [127:96] routing header (32b) │ +│ [127] credit_flag (1b) │ +│ [95:64] consumer_seq (32b) │ +│ [63:0] dst_rx_base_pa (64b) │ +└──────────────────────────────────────────┘ +``` + +첫 flit의 payload가 60B → 52B로 감소 (13% overhead). Multi-flit transfer에서는 +후속 flit이 full payload이므로 대형 전송에서 overhead < 1%. + +### D20. TCM IPCQ Slot Region Layout + +``` +TCM Memory Map (16MB): +┌─────────────────────────────┐ 0x000000 +│ Kernel Working Memory │ +│ (compute tensors) │ +│ ~14MB │ +├─────────────────────────────┤ 0xE00000 +│ IPCQ RX Buffers │ +│ Dir N: slots × slot_size │ +│ Dir S: slots × slot_size │ +│ Dir E: slots × slot_size │ +│ Dir W: slots × slot_size │ +│ ~1MB │ +├─────────────────────────────┤ 0xF00000 +│ IPCQ Metadata / Scratch │ +│ ~1MB │ +└─────────────────────────────┘ 0xFFFFFF +``` + +IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 bank conflict를 +최소화한다 (Risk D22 참조). + +### D21. 2nm Implementation Analysis + +#### Area Estimate + +| Module | Gate Count | Area (2nm est.) | Notes | +|---|---|---|---| +| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops | +| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational | +| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators | +| **IPCQ Controller subtotal** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** | +| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack | +| **Total 변경분** | **~14.6K** | **~0.006 mm²** | | + +#### Timing + +| Path | Delay (2nm est.) | Target Clock | Margin | +|---|---|---|---| +| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× | +| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× | +| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× | +| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× | + +모든 critical path가 1 cycle 이내. Timing closure 문제 없음. + +#### Power + +- Active: ~1 mW (register R/W + comparators, send/recv 동작 시) +- Idle: leakage only +- PE 전체 전력 대비 무시 가능 + +#### Constraints + +| 항목 | 제약 | 근거 | +|---|---|---| +| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) | +| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 | +| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 | + +### D22. Risk Assessment + +#### TCM Bank Conflict + +- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall +- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 (D20) +- **Cost**: TCM banking flexibility 소폭 감소 +- **Severity**: Medium (성능 영향), Low (correctness 문제 아님) + +#### Credit Return Latency under Congestion + +- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall +- **Mitigation**: + - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미) + - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수 +- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음) + +#### Inter-Direction Ordering + +- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서 +- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은 + kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 (D2 + D4) +- **Severity**: Low (아키텍처 설계에 의해 해소) + +### D23. HW Alternatives Considered + +#### Doorbell + Polling (전통적 방식) + +``` +Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell +Recv: Polling loop on doorbell, or interrupt-driven +``` + +| 장점 | 단점 | +|---|---| +| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) | +| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) | +| | Polling은 전력 낭비, interrupt는 latency overhead | + +**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.** + +#### Hardware Message Queue (NVIDIA NVLink 스타일) + +``` +Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달 +Recv: HMQ에서 descriptor pop → data pointer 확인 +``` + +| 장점 | 단점 | +|---|---| +| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) | +| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 | +| | Large tensor에는 결국 DMA 필요 | + +**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는 +면적 낭비. **불채택.** + +#### RDMA-style Completion Queue (CQ) + +``` +Send: DMA write → peer에 CQE 자동 생성 +Recv: CQ poll/interrupt → data 위치 확인 +``` + +| 장점 | 단점 | +|---|---| +| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead | +| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 | +| | PE-to-PE CCL에는 over-engineered | + +**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합. +PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.** + +#### Credit-in-Data Piggyback (v2 최적화 후보) + +현재 설계에서 credit return은 별도 16B packet이다. Bidirectional 통신 +패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.** + +``` +PE_A →E→ PE_B: data + sender_seq=3 +PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4 ← credit이 data에 합쳐짐 +``` + +| 장점 | 단점 | +|---|---| +| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 | +| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) | +| | Logic 복잡도 소폭 증가 | + +**평가**: 현재 설계의 우수한 최적화. Bidirectional allreduce에서 credit packet을 +완전 제거 가능. Standalone credit fallback도 유지. **v2로 채택 권고.** + +### Open HW Questions + +- IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%) +- Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? (D18 참조) +- Inter-SIP link에서의 flit format 호환성 검증 필요 +- n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%) + +--- + +## Non-goals + +- **호스트 collective**: `dist.all_reduce`가 데이터 이동을 직접 수행하는 모델은 + 본 ADR 범위 외. 본 ADR은 PE 커널 안에서 일어나는 통신만 다룬다. +- **All-reduce 알고리즘**: ring/tree 등 알고리즘 자체는 별도 ADR (또는 커널 + 코드)에서 다룬다. 본 ADR은 인프라(IPCQ + VC)만 정의. +- **Reliability/error handling**: send/recv 실패, link 장애 등은 다루지 않음. +- **NoC arbiter 정밀 모델**: VC 간 dynamic contention은 첫 구현 범위 외 (D8). + +--- + +## Open Questions + +- **VC arbitration 정확도**: 첫 구현은 deterministic chunk interleave + + weighted round-robin. heavy contention 시나리오에서 실제보다 optimistic한 + 결과가 나올 수 있음. 정밀화 필요 시 별도 NoC arbiter component 도입을 검토. +- **Credit return fast path BW 모델**: 첫 구현은 fast path가 fabric BW + contention 모델에서 제외 (별도 lightweight wire 가정). 정밀화 필요 시 + credit fast path를 별도 link로 모델링하거나, `credit_return_mode: piggyback` + 옵션 추가. +- **Ring buffer slot의 메모리 할당**: TCM/HBM/SRAM 어디에 두든 IPCQ가 알아야 + 할 metadata (base addr, slot_size, n_slots). init 시 호스트가 사이드밴드로 + 넣을지, fabric MmuMapMsg와 유사한 메시지로 넣을지 결정 필요. +- **VC 간 BW 분할 default**: 균등 분할(50/50)인지, weighted(예: 80% compute, + 20% comm)인지. ccl.yaml에 노출하되 default 값 결정 필요. +- **Direction 개수**: 4방향(N/S/E/W) 고정인지, 6방향(+ Up/Down for 3D), + 또는 가변 N개로 확장할지. 첫 구현은 4방향 고정. +- **다중 channel 데이터 구조 (multi-tile aggregation)**: 한 collective에서 + 여러 tile을 fan-out 받는 경우 기존 round-robin recv로 충분한지, 별도 + primitive(`tl.recv_all`)가 필요한지. +- **Round-robin recv fairness**: 첫 구현은 last_polled_dir 인덱스 기반 weak + fairness. 한 방향에 데이터가 항상 먼저 도착하면 starvation 가능. strict + fairness가 필요하면 별도 fairness counter 추가. +- **Deadlock detection 정밀화**: 첫 구현은 timeout 기반. 향후 wait-for graph + 실시간 추적으로 deterministic deadlock detection 가능. + +--- + +## Consequences + +### 긍정적 + +- PE 간 직접 통신 가능 → CCL 커널 작성 가능 +- 호스트는 launch만, 동기화는 PE 안에서 → 단순한 호스트 코드, 강한 + compute/comm overlap +- VC를 통해 HoL blocking 제거 → collective latency가 compute traffic에 + block되지 않음 +- Buffer 위치/backpressure 모드를 init 파라미터로 선택 가능 → 벤치마크 가능 +- 4-방향 logical neighbor → 호스트가 ring/mesh/tree 등 알고리즘 자유롭게 + 매핑 + +### 부정적 + +- 컴포넌트 1개 신규 추가 (PE_IPCQ), PE_DMA 재설계 (VC 추가) +- IPCQ 메모리 (8 ring × slot_size × n_slots) 만큼 PE-local 메모리 사용 +- VC arbitration 모델이 first-order approximation이므로 heavy contention + 시나리오에서 실제 HW보다 약간 optimistic한 latency 결과 가능 (D8 한계) +- VC chunk-level 인터리브로 PE_DMA 구현이 더 복잡해짐 diff --git a/docs/adr-ko/ADR-0024-par-sip-tp-launcher.md b/docs/adr-ko/ADR-0024-par-sip-tp-launcher.md new file mode 100644 index 0000000..b321e84 --- /dev/null +++ b/docs/adr-ko/ADR-0024-par-sip-tp-launcher.md @@ -0,0 +1,206 @@ +# ADR-0024: SIP-level Launcher — rank = SIP + +## Status + +Accepted + +## Context + +### 목표 + +`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device) +경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이** +읽히는 bench 코드를 목표로 한다. + +real PyTorch와 비교: + +| 차원 | real PyTorch | KernBench | +| --- | --- | --- | +| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP | +| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 | +| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 | +| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP | +| `mp.spawn` | OS 프로세스 fork | greenlet fan-out | + +### 풀어야 할 문제 + +1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록. +2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각 + worker greenlet이 자기 rank / 자기 SIP를 정확히 식별. +3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면 + 기본 텐서 배치도 구조적 좌표로 표현되어야 함. + +### Non-problem (이 ADR 밖) + +- IPCQ direction addressing → ADR-0025 +- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026 +- Megatron-style TP → ADR-0027 +- DTensor → ADR-0028 (future) +- Worker scheduling / `mp.spawn` / collective drain / exception cleanup + → ADR-0027 D0/D1 +- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032 + +## Decision + +### D1. rank = SIP (world_size 해석) + +```python +def _resolve_world_size(self) -> int: + if "world_size" in self._merged: + return int(self._merged["world_size"]) + defaults = self._cfg_all.get("defaults", {}) + if "world_size" in defaults: + return int(defaults["world_size"]) + spec = self.ctx.spec or {} + return int(spec.get("system", {}).get("sips", {}).get("count", 1)) +``` + +우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml` +override는 legacy "rank = PE" 테스트 경로로 유지. + +### D2. Greenlet-local rank registry (+ debug warning) + +```python +class DistributedContext: + def __init__(self): + self._backend = None + self._rank_by_greenlet: dict = {} + + def _bind_rank(self, g, rank: int) -> None: + self._rank_by_greenlet[g] = int(rank) + + def get_rank(self) -> int: + self._ensure_initialized() + from greenlet import getcurrent + g = getcurrent() + if g not in self._rank_by_greenlet: + if os.environ.get("KERNBENCH_DEBUG"): + warnings.warn( + "get_rank() called outside a bound greenlet — returning 0. " + "Likely a bug unless running single-driver." + ) + return 0 + return int(self._rank_by_greenlet[g]) +``` + +### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩 + +KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는 +`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named +namespace를 사용한다. + +```python +class _AhbmNamespace: + """torch.ahbm — per-greenlet SIP device binding. + + Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since + KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent + API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime. + """ + + def __init__(self): + self._device_by_greenlet: dict = {} + + def set_device(self, device: int) -> None: + from greenlet import getcurrent + self._device_by_greenlet[getcurrent()] = int(device) + + def current_device(self) -> int | None: + from greenlet import getcurrent + return self._device_by_greenlet.get(getcurrent()) + +# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`. +# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`. +``` + +**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한 +`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`, +`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는 +코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다. + +```python +class _AcceleratorNamespace: + """torch.accelerator — device-agnostic API (PyTorch 2.x style). + + Aliases torch.ahbm for bench code that prefers device-neutral idiom: + torch.accelerator.set_device_index(rank) + torch.accelerator.current_device_index() + """ + + def __init__(self, ahbm: _AhbmNamespace): + self._ahbm = ahbm + + def set_device_index(self, device: int) -> None: + self._ahbm.set_device(device) + + def current_device_index(self) -> int | None: + return self._ahbm.current_device() + +# RuntimeContext +self.ahbm = _AhbmNamespace() +self.accelerator = _AcceleratorNamespace(self.ahbm) # alias +``` + +Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유: + +```python +torch.ahbm.set_device(rank) # KernBench-native, explicit backend +torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic +``` + +### D4. Tensor placement = structural (sip, cube, pe) 좌표 + +`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성. +세부는 ADR-0026. + +```python +# RuntimeContext._create_tensor +current_sip = self.ahbm.current_device() # (D3 naming) +if current_sip is None: + current_sip = 0 # single-driver fallback (D2와 일관) +placement = resolve_dp_policy( + dp, shape=shape_2d, itemsize=itemsize, + num_pe=eff_num_pe, num_cubes=eff_num_cubes, + target_sip=current_sip, +) +``` + +Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적 +좌표를 직접 보유. ShardSpec 상세는 ADR-0026. + +--- + +## Dependencies + +- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원. +- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와 + ShardSpec의 구조적 좌표 표현. +- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`, + collective drain, exception cleanup의 구현 기준. + +--- + +## Non-goals + +- **IPCQ protocol 수정**: ADR-0023 유지. +- **DPPolicy 필드 정리**: ADR-0026. +- **Megatron-style TP**: ADR-0027. +- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1. +- **Collective algorithm 구현**: ADR-0032. +- **Multi-node (프로세스 간)**: 단일 프로세스. + +--- + +## Consequences + +### Positive + +- **Bench = real PyTorch DDP** (공개 API 관점). +- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능. +- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이 + `(sip, cube, pe)` 3튜플 위에서 일관되게 동작. + +### Neutral + +- IPCQ PE-level protocol (ADR-0023) 불변. +- IO_CPU 역할 불변 (기존 transit 그대로). diff --git a/docs/adr-ko/ADR-0025-algo-ipcq-direction-addressing.md b/docs/adr-ko/ADR-0025-algo-ipcq-direction-addressing.md new file mode 100644 index 0000000..8a6afa0 --- /dev/null +++ b/docs/adr-ko/ADR-0025-algo-ipcq-direction-addressing.md @@ -0,0 +1,283 @@ +# ADR-0025: IPCQ Direction Addressing — address-based matching + +## Status + +Accepted (Revision 2 — Address-based matching; peer_direction field dropped) + +## Context + +### 목표 + +ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을 +topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다. +2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는 +topology 일반)에서 정확히 동작하도록 한다. + +### 드러난 버그 — 2-rank bidirectional ring + +`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer. + +**버그 1 (install)**: +- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite + direction convention) +- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정 +- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx) + +**버그 2 (runtime)**: +- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이 + sender 좌표만으로 direction 매칭 → 첫 direction (E) 승 +- peer_head_cache[E] 증가, peer_head_cache[W]는 불변 +- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock + +### 근본 원인 + +두 축에서 동일 문제: +1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가" + 결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때 + fragile +2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender + 좌표만으로 이루어짐 → direction 중복 시 ambiguous + +### 해결 방향 — address-based matching + +각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa + +direction_idx × bytes_per_direction). 따라서: + +- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous +- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운 + 대칭성) +- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of + truth** + +이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic +주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능. + +--- + +## Decision + +### D1. Install — `reverse_direction` opposite-preference + +`src/kernbench/ccl/install.py`: + +```python +# Extended in ADR-0032 with global_* pairs for inter-SIP directions, +# which were introduced by configure_sfr_intercube_multisip to keep +# intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint. +_OPPOSITE_DIR = { + "E": "W", "W": "E", "N": "S", "S": "N", + "global_E": "global_W", "global_W": "global_E", + "global_N": "global_S", "global_S": "global_N", +} + +def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None: + """Find peer's direction that reciprocates my_dir→peer_rank. + + Prefer the OPPOSITE direction (E↔W, N↔S) when the peer has it + pointing back to us. This matters in 2-rank bidirectional rings + where both E and W on one side point to the same peer — without + the preference, the first-match-wins iteration would route data + into the wrong rx slot. Falls back to any direction pointing back + for topologies without an opposite convention (tree_binary's + parent/child). + """ + nt = neighbor_table[peer_rank] + opp = _OPPOSITE_DIR.get(my_dir) + if opp is not None and nt.get(opp) == my_rank: + return opp + for d, target in nt.items(): + if target == my_rank: + return d + return None +``` + +호출부: + +```python +for d, peer_rank in nbrs.items(): + peer_dir = reverse_direction(r, peer_rank, d) # my_dir 전달 + if peer_dir is None: + continue + ... +``` + +### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭 + +`src/kernbench/components/builtin/pe_ipcq.py`: + +```python +def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None: + """Match incoming token to the receiver-side direction by dst_addr range. + + Each direction has a unique rx buffer address range + (my_rx_base_pa + n_slots * slot_size). The token's dst_addr (set by + the sender's IPCQ when computing peer's slot address) falls within + exactly one such range. This address-based matching is unambiguous + even when multiple directions have the same peer (2-rank ring). + """ + token = msg.token + dst_addr = token.dst_addr + for d, qp in self._queue_pairs.items(): + base = qp["my_rx_base_pa"] + size = qp["n_slots"] * qp["slot_size"] + if base <= dst_addr < base + size: + qp["peer_head_cache"] = max(qp["peer_head_cache"], + token.sender_seq + 1) + self._arrived_tokens.setdefault(d, []).append(token) + waiters = self._recv_waiters.get(d, []) + self._recv_waiters[d] = [] + for ev in waiters: + if not ev.triggered: + ev.succeed() + any_waiters = self._any_recv_waiters + self._any_recv_waiters = [] + for ev in any_waiters: + if not ev.triggered: + ev.succeed() + return + # Unknown dst_addr — diagnostic log (should not happen under correct install) +``` + +Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정. + +### D3. Credit — `dst_rx_base_pa` 필드 추가 + +`src/kernbench/common/ipcq_types.py`: + +```python +@dataclass(frozen=True) +class IpcqCreditMetadata: + consumer_seq: int + dst_rx_base_pa: int # NEW: 원 sender의 peer.rx_base_pa와 매칭용 + # 기존 필드 (diagnostic / log 용도로 유지) + src_sip: int + src_cube: int + src_pe: int + src_direction: str +``` + +Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를 +`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`). + +수신 측 (`_credit_worker`): + +```python +def _credit_worker(self, env): + while True: + credit = yield self._credit_inbox.get() + for d, qp in self._queue_pairs.items(): + # peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기 + if qp["peer"].rx_base_pa == credit.dst_rx_base_pa: + qp["peer_tail_cache"] = max(qp["peer_tail_cache"], + credit.consumer_seq) + waiters = self._send_waiters.get(d, []) + self._send_waiters[d] = [] + for ev in waiters: + if not ev.triggered: + ev.succeed() + break +``` + +Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous. + +### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음** + +ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**. +이유: +- Meta arrival은 dst_addr로 매칭 (D2) +- Credit은 dst_rx_base_pa로 매칭 (D3) +- qp에 peer_direction 저장 필요 없음 +- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`) + +IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**. + +### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only) + +기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지: +- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)` +- Diagnostics: pointer_dump 등에서 direction 표시 +- 미래 확장 여지 + +Runtime matching은 `dst_addr`만 사용. + +### D6. Invariants (ADR-0023 I3 강화) + +**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my +rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은 +이를 보장해야 한다 (reverse_direction opposite-preference). + +**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는 +서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지 +않음). 이것이 D2/D3의 주소-기반 매칭의 전제. + +Install time에 검증 가능: +```python +# ccl/install_plan.py: build_install_plans 끝에 assertion +all_rx_ranges = set() +for plan in plans: + for pe_install in plan.pe_installs: + for entry in pe_install.neighbors: + r = (entry.my_rx_base_pa, + entry.my_rx_base_pa + plan.n_slots * plan.slot_size) + overlap = any(_ranges_overlap(r, e) for e in all_rx_ranges) + assert not overlap + all_rx_ranges.add(r) +``` + +--- + +## Dependencies + +- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정 + (D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer + 변경은 없음. +- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가 + ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴. +- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의 + 주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동. + +--- + +## Non-goals + +- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게 + 인코딩되는가와 무관. +- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지. +- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그 + 무관. + +--- + +## Open questions + +- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형 + 순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로 + 전환 가능 (`_qp_by_rx_base`). +- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴 + 필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지. +- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2. + 대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능. + 단순 구현 먼저. + +--- + +## Consequences + +### Positive + +- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth. +- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작. +- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가. +- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관. +- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치. + +### Negative + +- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌 + W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction + 이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함. + +### Neutral + +- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는 + 불변. diff --git a/docs/adr-ko/ADR-0026-par-dppolicy-intra-device.md b/docs/adr-ko/ADR-0026-par-dppolicy-intra-device.md new file mode 100644 index 0000000..d043f59 --- /dev/null +++ b/docs/adr-ko/ADR-0026-par-dppolicy-intra-device.md @@ -0,0 +1,288 @@ +# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거 + +## Status + +Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail) + +## Context + +### 목표 + +`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한 +intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리 +(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel +layers가 담당). + +## Decision + +### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거 + +```python +@dataclass(frozen=True) +class DPPolicy: + """Intra-device (cube × PE) data-parallel policy. + + SIP-level placement is controlled by ``torch.ahbm.set_device(rank)`` + (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel + layers (ADR-0027). DPPolicy does not cross SIP boundaries. + """ + cube: Literal["replicate", "column_wise", "row_wise"] = "replicate" + pe: Literal["replicate", "column_wise", "row_wise"] = "replicate" + num_pes: int | None = None + num_cubes: int | None = None +``` + +제거되는 필드: `sip`, `num_sips`. + +### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거 + +현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube × +pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태. + +본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는 +property로도 **남기지 않는다**: + +```python +# src/kernbench/policy/placement/dp.py (after) +@dataclass(frozen=True) +class ShardSpec: + """Structural shard placement — intra-SIP (cube × PE) coord. + + Global-flat `pe_index` was removed in ADR-0026. Callers must use + structural coords (sip, cube, pe) directly. If a flat integer key is + needed (e.g. dict lookup), compute it explicitly at the call site. + """ + sip: int # structural — which SIP this shard lives on + cube: int # local within SIP + pe: int # local within cube + offset_bytes: int + nbytes: int +``` + +**핵심 원칙**: +- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플. +- **`pe_index` property도 없음** — silent semantics drift 차단. +- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시 + `AttributeError`** → 반드시 구조적 좌표로 migration. +- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가 + 명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산. + +**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어 +있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage +(AttributeError)가 훨씬 안전. + +### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성 + +ADR-0024 D4의 계약 구현. Post-hoc shifting 없음. + +```python +# src/kernbench/policy/placement/dp.py (after) + +@dataclass(frozen=True) +class _LocalPeShard: + """Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload.""" + local_pe: int # cube-local PE index (0..num_pe-1) + offset_bytes: int + nbytes: int + + +def resolve_dp_policy( + policy: DPPolicy, + *, + shape: tuple[int, int], + itemsize: int, + num_pe: int, + num_cubes: int = 1, + target_sip: int, # NEW — 어느 SIP에 배치할지 명시 +) -> list[ShardSpec]: + """2-level resolution (cube × PE) on a specified SIP. + + Returns ShardSpecs with structural coords (sip=target_sip, cube, pe). + No SIP-level split — DPPolicy is intra-device only. + """ + resolver = _PE_RESOLVERS[policy.pe] + all_shards: list[ShardSpec] = [] + + # Level 1: cube within SIP + cube_splits = _split_shape(policy.cube, shape, num_cubes, itemsize) + + for cube_id, (cube_shape, cube_offset) in enumerate(cube_splits): + # Level 2: PE within cube — resolver returns _LocalPeShard (local_pe) + local_shards = resolver(shape=cube_shape, itemsize=itemsize, + num_pe=num_pe) + + for ls in local_shards: + all_shards.append(ShardSpec( + sip=target_sip, # from caller (current_device) + cube=cube_id, # local within SIP + pe=ls.local_pe, # local within cube (explicit name) + offset_bytes=cube_offset + ls.offset_bytes, + nbytes=ls.nbytes, + )) + + return all_shards +``` + +**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard` +리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**. +과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소. + +**이름 규약 정리** (전체 ADR): +- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord) +- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미 +- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의 + 부가 효과: 이름 재등장 없음). + +### D4. `_create_tensor` — 구조적 좌표로 직접 placement + +ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy` +호출 시점에 직접 지정. + +```python +# context.py _create_tensor (after) +current_sip = self.ahbm.current_device() +if current_sip is None: + # Single-driver fallback (ADR-0024 D2와 일관). + # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는 + # 문제가 있음 → debug mode에서 경고. + if os.environ.get("KERNBENCH_DEBUG"): + import warnings + warnings.warn( + "torch.ahbm.current_device() is None; defaulting to SIP 0. " + "If this is a multi-rank launcher context, you likely forgot " + "torch.ahbm.set_device(rank) inside the worker.", + stacklevel=2, + ) + current_sip = 0 + +placement = resolve_dp_policy( + dp, + shape=shape_2d, + itemsize=itemsize, + num_pe=eff_num_pe, + num_cubes=eff_num_cubes, + target_sip=current_sip, # ← 구조적 좌표 일차 지정 +) + +# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함. +# 과거의 post-hoc shifting 블록은 완전히 제거. +``` + +**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면 +ADR-0027의 TP primitive 사용. + +**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로 +default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1` +환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에 +배치되는 것을 감지할 수 있도록 warning. + +### D5. Downstream — allocator lookup은 구조적 tuple key로 + +기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`): + +```python +for spec in placement: + alloc = allocators[spec.pe_index] # ← AttributeError (property 제거됨) +``` + +`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration: + +```python +for spec in placement: + alloc = allocators[(spec.sip, spec.cube, spec.pe)] +``` + +`_ensure_allocators`의 dict population도 tuple key로: + +```python +# context.py _ensure_allocators (after) +for sip_id in sip_range: + for cube_id in range(cubes_per_sip): + for pe_id in range(pes_per_cube): + self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator( + rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg, + ) +``` + +`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산 +블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용. + +**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서 +권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재 +allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지. + +### D7. 하위 호환 — 불가 (cleanup ADR) + +이 ADR은 **breaking change**. + +1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError` +2. `ShardSpec.pe_index` 접근 → `AttributeError` + +모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음. +KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration. + +**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한 +코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거. + +## Dependencies + +- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이 + SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로 + 좁힘. +- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로. + 이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관. + +--- + +## Non-goals + +- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미 + 유지. +- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로. +- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028. + +--- + +## Open questions + +- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0 + (SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver + 테스트와의 호환). +- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며 + launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope. +- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가 + 사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이 + 명시적 답. + +**Resolved (이전 rev에서 open이었던 것들)**: +- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2) +- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5) + +--- + +## Consequences + +### Positive + +- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device. +- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소. +- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 → + abstraction leakage 해소 (ADR-0024 D4 계약 충족). +- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시. +- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP + 경계 제어 메커니즘. + +### Negative + +- **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`, + `spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요. +- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드. + Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`, + `allocators` dict key 등) 연쇄 수정. +- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 → + migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff) +- `test_sip_parallel.py` 재작성 비용. + +### Neutral + +- 기존 `cube` / `pe` 필드 의미 불변. diff --git a/docs/adr-ko/ADR-0027-par-megatron-tp.md b/docs/adr-ko/ADR-0027-par-megatron-tp.md new file mode 100644 index 0000000..7b04254 --- /dev/null +++ b/docs/adr-ko/ADR-0027-par-megatron-tp.md @@ -0,0 +1,888 @@ +# ADR-0027: Megatron-style Tensor Parallelism API + +## Status + +Accepted + +## Context + +### 목표 + +SIP 간 tensor parallelism(TP)을 **Megatron-LM 스타일의 명시적 parallel layer** +API로 지원한다. DTensor 같은 선언적 추상화는 별도 ADR(0028) future work. + +Megatron-style을 선택한 이유: +- TP는 model의 특정 layer 경계에서 발생. 명시적 primitive가 mental model에 + 자연스러움. +- NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준. +- DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적. + +### TP primitive 스펙 (Megatron-LM 참조) + +- **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에 + 분산. 입력 full-replicated, 출력 column-sharded. 후속 RowParallelLinear가 + 올 때 forward all-reduce 없음. +- **RowParallelLinear**: weight의 **row(in_features)** 축을 TP ranks에 분산. + 입력이 이미 column-sharded (ColumnParallel의 출력). forward 끝에 + **all-reduce** 필요. +- **VocabParallelEmbedding**: embedding을 vocab 축에 분산. forward 끝에 + all-reduce. (초기 scope에서는 stub, 실제 구현은 all-gather kernel 선행 필요.) +- **`copy_to_tp_region`**, **`reduce_from_tp_region`**, **`scatter_to_tp_region`**, + **`gather_from_tp_region`** — 기본 primitive. + +### 풀어야 할 문제 + +1. **Worker-wait 일반화 (D0)**: `dist.all_reduce`의 defer/yield/drain 패턴을 + 모든 `ctx.wait` 경로로 확장. **이 ADR의 가장 큰 아키텍처 결정**. + +2. **런처 API 정규화 (D1)**: 현 bench들이 hand-rolled greenlet loop을 사용. + `torch.multiprocessing.spawn(fn, args, nprocs)`로 흡수해 real-PyTorch API 면 + 유지 + D0의 scheduler drain을 단일 구현 위치에 집중. + +3. **Per-rank weight 분산 표현**: 각 worker가 weight tensor의 자기 slice를 + 소유. ADR-0024의 `set_device(rank)` + ADR-0026의 intra-device DPPolicy로 + 자연스럽게 표현. + +4. **Forward-only scope**: 현재 KernBench는 backward가 없음 (simulation 목적). + 본 ADR은 **forward만** 우선 지원. Training simulation은 별도 ADR. + +5. **Collective 호출 지점**: RowParallelLinear가 forward 끝에 `all_reduce` 호출. + ADR-0024의 multi-greenlet 구조 + D0 generalization에서 자연스럽게 동작. + +6. **TP group 개념**: Megatron은 DP × TP × PP group을 교차 사용. 초기 scope는 + **TP group = 전체 SIP** 단순화. Mixed DP+TP는 future. + +--- + +## Decision + +### D0. Worker-wait 일반화 — `ctx.wait`가 worker 컨텍스트면 main으로 defer + +**문제 재확인**. `kernel_runner.run`은 spawn 시점의 `greenlet.getcurrent()`를 +kernel greenlet의 `_parent`로 캡처한다 +([kernel_runner.py:94](src/kernbench/triton_emu/kernel_runner.py#L94)). +main 컨텍스트에서 `env.run`이 돌면 parent=main이라 safe. worker 컨텍스트에서 +`env.run`이 돌면 parent=worker가 되고, worker가 yield/finish하는 순간 kernel +greenlet은 orphan → `GreenletExit` → ADR-0024 Phase B의 `ring_default_ws` 실패. + +**해결**. worker greenlet이 `ctx.wait(h)`를 호출하면 직접 `env.run`을 driving +하는 대신 **main scheduler로 yield**. main이 env.run을 drive해 handle이 완료 +되면 worker로 control return. + +#### D0.1 `RuntimeContext` 확장 + +```python +# context.py +@dataclass +class RuntimeContext: + ... + _pending_worker_waits: list[RequestHandle] = field(default_factory=list, init=False) +``` + +#### D0.2 `ctx.wait`의 worker fork + +```python +def wait(self, handle, *, _meta=None): + # Fast-path: already completed — skip enqueue + switch (consistent with + # D0.4-(3) idempotency). Avoids needless worker→main→worker round-trip + # and prevents redundant _pending_worker_waits growth. + if handle in self._completed: + completion, _trace = self.engine.get_completion(handle) + return completion + + from greenlet import getcurrent + g = getcurrent() + if g.parent is not None and not g.parent.dead: + # Worker greenlet: defer to main. Push handle, yield to parent. + # Parent (scheduler loop) drains env.run, then switches back. + self._pending_worker_waits.append(handle) + g.parent.switch() + # On resume: handle must have completed (main drained the list). + # Fall through to the status-quo completion/trace assembly. + + # Main context (or single-driver): drive engine directly. + wait_fn = getattr(self.engine, "wait", None) + if wait_fn is not None: + wait_fn(handle) + completion, trace = self.engine.get_completion(handle) + self._completed.add(handle) + if _meta is not None and trace is not None: + entry = dict(trace) if isinstance(trace, dict) else {"raw": trace} + entry.update(_meta) + self._traces.append(entry) + return completion +``` + +#### D0.3 `ctx.wait`의 worker-context 세만틱 contract (normative) + +본 ADR은 `ctx.wait`의 세만틱을 worker 컨텍스트에서 **명시적으로 변경**한다. + +- **Submit-vs-complete 분리**: `ctx.wait(h)`는 worker에서 호출될 때 "즉시 완료 + 보장"이 아니라 "**다음 scheduler drain 이후** 완료 보장"이다. worker가 + `wait()`에서 return하는 시점 = main이 해당 handle에 대해 `engine.wait`을 + 마친 시점. Main context 호출은 기존대로 즉시-동기 (status quo). +- **Resume invariant (normative)**: worker-deferred `ctx.wait(h)`에서 + `g.parent.switch()`가 return해 worker가 resume되는 시점에는 **반드시 + `h in ctx._completed`가 True여야 한다**. 이 invariant가 깨지면 worker가 + stale 상태에서 이후 단계를 진행하므로 `_drain_pending` / scheduler loop / + `ctx.wait` 어느 부분을 수정하든 이 불변식을 지켜야 한다. T3.b가 이 + invariant를 직접 assert한다. +- **관찰 가능 변화**: worker 안에서 `h = ctx.submit(msg); ctx.wait(h); + read(handle_result)` 패턴은 여전히 성립 — 단 `wait()`와 `read` 사이에는 + 자동으로 main-drain이 삽입되었다는 사실을 세만틱 명세로 포함한다. +- **Host 객체 직접 read는 D0.5 참조**: `ctx.wait` 없이 `tensor.numpy()`를 + 부르는 경우의 계약은 D0.5에서 별도로 규정. + +#### D0.4 Main scheduler drain — 규약 (normative) + +(D1의 `multiprocessing.spawn` 내부 구현. 아래는 세만틱 정의.) + +```python +while alive: + for g in alive: # (1) round-based worker switch + g.switch() + _drain_pending(ctx) # (2) drain in main context +``` + +(`_drain_pending`의 실제 정의는 D0.5 참조 — outer while-loop으로 두 큐가 +모두 빌 때까지 drain.) + +**규약**: + +1. **Round-based cooperative scheduling & yield 의무 (worker contract)**. + `g.switch()`는 해당 worker가 **자발적으로 yield**할 때까지 return하지 않는다 + (cooperative greenlet 세만틱). 따라서: + - Worker가 yield 없이 `while True: do_compute()` 같은 pure-compute loop를 + 돌면 `g.switch()`는 영원히 return하지 않고 **scheduler loop 자체가 hard + block**된다 (다른 worker는 switch 기회를 못 얻음, drain도 안 일어남). 이는 + starvation이 아니라 **scheduler non-progress (deadlock 등가)**이며 본 + ADR이 **unsupported**로 규정한다. + - Worker는 **반드시** `ctx.wait(h)`, `dist.all_reduce`, host-read barrier + (D0.5) 중 하나를 유한 step 내에 호출해야 한다. TP layer의 `forward`는 + 매 layer 끝에서 launch→wait 쌍을 포함하므로 자연스럽게 이 조건을 만족. + CCL kernel도 `dist.all_reduce` 내부에서 yield한다. + - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터 + 등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다. + - **Future extension**: non-collective 긴 계산 경로가 자주 나오면 + 명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를 + 도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면 + 됨. + - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round + 안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로 + enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO). + +2. **Drain 순서 = submission 순서 (FIFO)**. `_pending_worker_waits`는 list + append/pop(0)로 엄격한 FIFO. 완료 순서가 아니라 submission 순서로 drain되며, + SimPy scheduler 자체가 인과적으로 올바른 완료 순서를 보장하므로 submission + 순서 drain이 안전하다. `completion order`와 `drain order`는 혼동하지 말 것. + + **Two-queue ordering (worker waits → collectives)**: `_drain_pending`은 + worker wait 큐를 먼저, collective 큐를 나중에 drain한다. 이 순서의 근거: + - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접 + `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective + 큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며 + worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조). + - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된 + 후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만 + 하면 됨. worker wait 큐와의 순서 dependency 없음. + - **단일 drain barrier 안에서 둘 다 완료**: D0.5의 loop-until-empty 규약에 + 따라 한 barrier invocation에서 worker → collective → (새로 생긴 것이 + 있으면 반복) 순으로 모두 빠짐. worker가 resume될 땐 양쪽 모두 drained. + - **대안 (collective 먼저)도 가능**: 본 ADR은 현 구현 단순성을 위해 worker + 먼저를 고정했을 뿐 의미상 동치. 성능 프로파일 차이가 관찰되면 재조정. + +3. **중복 enqueue — correctness는 idempotent drain, dedup은 non-guaranteed**. + `ctx.wait(h)`는 `h in ctx._completed`면 즉시 return. `_drain_pending`도 + 동일 guard. 같은 handle이 `_pending_worker_waits`에 여러 번 appended + 되더라도 실제 `engine.wait`는 한 번만 호출된다 (idempotent). + - **Correctness**: idempotent drain에 의존 → safe. + - **Memory/성능**: 본 ADR은 `_pending_worker_waits`의 **dedup을 보장하지 + 않는다**. 같은 handle이 N번 enqueue되면 큐에 N개 element가 보관되고 + drain 시 N번 pop + in-set guard가 돈다. 단일 worker가 같은 handle을 + 반복 wait하는 비정상 패턴이 아니면 N은 1~수 수준. + - **Implementation freedom**: 구현은 선택적으로 dedup (예: `set`을 side + index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness + 를 바꾸지 않는 최적화로 분류. + +4. **Exception propagation + sibling cleanup**. + worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다. + scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행: + + ```python + try: + while True: + alive = [g for g in gs if not g.dead] + if not alive: + break + for g in alive: + if not g.dead: + g.switch() + _drain_pending(ctx) + except Exception as outer: + # (a) 살아남은 sibling worker greenlet 강제 종료. + for other in gs: + if not other.dead: + try: + other.throw(SystemExit) + except Exception: + pass # 사일런트 — 이미 예외 상황 + # (b) Backend barrier / pending 상태 초기화 (장래 epoch barrier 도입 대비). + backend = getattr(ctx.distributed, "_backend", None) + if backend is not None and hasattr(backend, "_barrier"): + backend._barrier.reset() + backend_pending = getattr(backend, "_pending_collective_handles", None) + if backend_pending is not None: + backend_pending.clear() + ctx._pending_worker_waits.clear() + # (c) 원인 예외는 SpawnException으로 래핑. + raise SpawnException(errors) from outer + ``` + + 규약: + - **Sibling abort 보장**: worker 하나가 raise하면 모든 sibling greenlet에 + `SystemExit`을 throw — greenlet은 즉시 terminate된다. greenlet leak 없음. + - **Pending queue 명시적 clear**: worker-wait + collective-pending 두 큐를 + 비움. 재사용 시 오염 방지. + - **`SpawnException(errors)` 래핑**: `errors: dict[int, Exception]`에 각 + rank의 원래 예외를 담는다. real-PyTorch `torch.multiprocessing.spawn`의 + failure 패턴과 호환. + - **Scope 제한**: `errors`에는 **자기 코드로 raise한 rank (root cause)만** + 포함된다. Sibling cleanup 과정에서 `throw(SystemExit)`으로 종료된 rank는 + `errors`에 나타나지 않는다 (SystemExit은 D1.2의 entry 래퍼 `try/except + Exception`에 걸리지 않음 — 의도된 설계: sibling 종료는 실패가 아니라 + cleanup signal). 독자가 "모든 failed rank가 다 들어올 것"으로 기대하지 + 않도록 명시. + - **`ctx._traces`는 예외 이전 시점까지의 partial 상태**. trace completeness + 는 보장되지 않음 (일부 launch/all_reduce가 entry를 남기지 못한 채 종료 + 가능). + - **Allocator / MemoryStore**는 예외 이전 상태 유지 — 재사용은 non-goal, + 새 `RuntimeContext` 생성 권장. + - **`join=False` / retry / partial recovery**는 본 ADR의 non-goal. + + `SpawnException`은 `runtime_api/multiprocessing.py`에 정의: + + ```python + class SpawnException(RuntimeError): + def __init__(self, errors: dict[int, Exception]): + self.errors = errors + first = next(iter(errors.items()), None) + msg = (f"spawn failed on ranks {sorted(errors.keys())}" + + (f": rank {first[0]} raised {first[1]!r}" if first else "")) + super().__init__(msg) + ``` + +5. **Single-driver 호환**. `g.parent is None`인 main-only 실행 (legacy 단일 + 드라이버 테스트)에서는 D0.2의 worker-fork 조건이 거짓 → 기존 즉시-동기 + 경로 유지. `_drain_pending`은 호출되지 않는다. + +#### D0.5 Host-read barrier — 결정 (normative) + +Worker 안에서 `tensor.numpy()`, `tensor.__getitem__`, `tensor.data` 등 +**host-observable read**는 **자동 drain barrier**로 정의한다. 호출 직전: + +1. `ctx._pending_worker_waits`와 `backend._pending_collective_handles`가 비어 + 있지 않으면 `g.parent.switch()`로 main에 yield → main은 `_drain_pending` + 실행 → 완료 후 worker resume. +2. 두 큐가 모두 비어 있으면 즉시 read. + +**Barrier 반복 규약 (normative — re-entrance)**: `_drain_pending`은 while-loop +로 **두 큐가 모두 완전히 비어질 때까지** drain한다. 단일 pass가 아님: + +```python +def _drain_pending(ctx): + while ctx._pending_worker_waits or ( + ctx.distributed._backend + and ctx.distributed._backend._pending_collective_handles + ): + while ctx._pending_worker_waits: + h = ctx._pending_worker_waits.pop(0) + if h not in ctx._completed: + ctx.engine.wait(h) + backend = ctx.distributed._backend + if backend is not None: + while backend._pending_collective_handles: + h, _sip_id, meta = backend._pending_collective_handles.pop(0) + ctx.wait(h, _meta=meta) # main context: safe; ctx.wait가 + # 다시 pending에 push하지 않음 +``` + +**Main-context ctx.wait 비재귀 invariant (normative)**: `_drain_pending` 내부의 +`ctx.wait(h, _meta=meta)` 호출은 main greenlet 컨텍스트에서 실행된다. D0.2의 +worker-fork 조건(`g.parent is not None and not g.parent.dead`)이 False이므로 +즉시-동기 경로로 진입 → **`_pending_worker_waits`에 절대 enqueue하지 않는다**. +이 invariant 덕분에 drain loop은 재귀/큐 재증가 없이 끝난다. 구현 시 +`g.parent is None`을 단일 main greenlet 보장으로 유지하는 것이 중요. + +**왜 loop인가**: `ctx.wait(h, _meta=meta)`는 main 컨텍스트에서 호출되므로 D0.2 +경로에 따라 engine을 **직접 drive**한다 (추가 enqueue 없음 — 위 invariant). +따라서 이론적으로는 single pass로 충분하지만 — 규약은 **loop-until-empty**로 +고정한다. 이유: + +1. **미래 확장 안전성**: 향후 drain 중 새 pending이 enqueue되는 구현 (예: + collective가 sub-handle을 가진 tree-reduce)이 생길 수 있다. loop 규약이면 + 이때도 correctness 유지. +2. **가독성**: "barrier는 pending이 빌 때까지 drain"이라는 단일 문장으로 + 의미가 닫힘. `ctx.wait` 호출이 새 enqueue를 안 한다는 non-trivial invariant + 에 의존하지 않음. +3. **Barrier의 세만틱은 "해당 read에 필요한 모든 dependency 완료"**: 현 모델 + 에선 모든 pending이 곧 모든 dependency이므로 둘은 동일. 사용자 mental model + 은 전자. + +**Termination 보증**: 두 체제로 분리해 서술한다. + +- **현재 구현**: `ctx.wait`는 main context에서 호출 시 engine을 직접 drive + (D0.2) → 새 pending을 enqueue하지 않는다. 한 iteration마다 pending의 크기가 + `pop(0)` + `engine.wait`로 엄격히 감소. iteration 수는 **초기 pending 크기 + 자체가 상한** → 유한 종료. +- **Future extension (loop 규약을 정당화하는 상한)**: 향후 drain 중 새 pending이 + enqueue되는 구현 (예: tree-reduce sub-handle)이 도입되면 초기 크기 상한은 + 깨진다. 그러나 SimPy causality는 handle의 dependency가 유한 DAG임을 보장하므로 + **nested depth가 finite**. loop 규약이 이 경우까지 자동 수용한다. + +두 체제 모두 무한 루프가 불가능함을 보장. 현 구현의 단일-pass 상한은 공격적 +최적화 시 참고 값일 뿐 규약은 loop-until-empty로 고정. + +**왜 implicit drain at read가 맞는가**: + +- 기존 open question에서 (a) implicit drain, (b) explicit barrier 둘 중 선택 + 문제였다. (b)는 명확하지만 TP layer 사용자가 `out = fc1.forward(x); + ctx.drain(); result = out.numpy()` 3-step을 매번 써야 하는 부담. (a)는 + "읽을 때 반영된 값을 보장"하는 단일 규약으로 CUDA의 `cudaDeviceSynchronize + before host copy` 패턴과 동일 — 숨은 규칙이 아닌 **명명된 entry-point의 + contract**이다. +- 본 ADR은 (a)를 채택하되 그 entry-point 목록을 **명시적으로 닫는다**: + `Tensor.numpy()`, `Tensor.data` (numpy alias), `Tensor.__getitem__`, + `Tensor.__repr__` (data가 포함되는 경우), 그 외 공식 host-read API는 본 + ADR 구현 시점에 코드베이스 검색으로 확정. 추가되는 host-read API는 반드시 + 이 contract를 따라야 한다 (테스트로 회귀 방지). +- `ctx.submit`만 하고 `wait` 없이 `numpy`를 직접 호출하는 경우도 drain + barrier가 동작 (pending queue에 handle이 있기 때문). 사용자가 explicit + wait을 생략해도 read 시점에 invariant가 복원된다. + +**`Tensor.copy_(source)` — write barrier 규정**: + +`copy_`는 semantically "target에 write"이지만 내부적으로 `source.numpy()`를 +호출하여 host에서 source 데이터를 가져온 뒤 `target._memory_store.write(...)` +로 각 shard에 쓴다. 두 방향 모두 barrier 처리: + +1. **Source-side (read barrier)**: `source.numpy()`가 D0.5 read barrier를 + 트리거 (source 자체가 deployed tensor이고 pending이 있을 때). +2. **Target-side (write barrier — global pending 기준)**: `copy_` 진입 시 + `ctx._pending_worker_waits` 또는 `backend._pending_collective_handles`가 + 비어 있지 않으면 write 전에 `g.parent.switch()`로 drain. **Per-tensor / + per-shard dependency tracking이 아니라 global pending queue 기준**. + - 왜 global인가: KernBench의 handle 표현에는 "이 handle이 target의 어느 + shard를 write한다"는 역추적 정보가 없다. 안전한 보수적 규약으로 "전역 + pending이 있으면 drain". 이 결과로 **unrelated tensor의 pending도 copy_를 + 막을 수 있다** — drop-in invariant 우선. + - **명시적 tradeoff**: 이 규약은 서로 독립적인 tensor 사이에도 불필요한 + serialization을 도입할 수 있다. 그러나 현 single-queue execution model + 하에서는 이 비용이 허용 가능 — cross-rank correctness와 "읽을 때 최신" + invariant를 단순한 규칙으로 보장하는 편이 우선. + - 실질적 영향: 단일 worker는 대부분 한 layer step 안에서 pending이 주로 + 자기 작업 — over-barrier로 인한 추가 context switch는 round 끝 scheduler + drain 시점과 일치하는 경우가 많아 큰 문제 안 됨. + - Future refinement: per-tensor pending tracking을 도입하면 이 규약을 + 좁힐 수 있으나 본 ADR scope 밖. + +**Non-barrier**: + +- `tensor.shape`, `tensor.dtype`, `tensor.name` 등 **metadata-only** 접근은 + drain하지 않음. 데이터 의존성이 없음. +- `tensor.pa`, `tensor.va` 등 raw address accessor도 drain하지 않음 (주소만, + 내용 아님). + +**공식 barrier entry-point (closed set)**: + +| API | Kind | Rationale | +|---|---|---| +| `Tensor.numpy()` | read | host-observable copy | +| `Tensor.data` | read | `numpy()` alias | +| `Tensor.__getitem__` | read | shard-aligned read | +| `Tensor.__repr__` (data 포함 시) | read | debugging/log | +| `Tensor.copy_(source)` | read + write | source read + target write | + +이 contract를 T5/T6에서 직접 검증. + +#### D0.6 왜 worker 함수 API는 불변인가 (informative) + +- `torch.zeros(...)` 내부는 `self.submit(msg)` + `self.wait(h)` 쌍. `wait`가 + D0.2/D0.3에 따라 자동으로 main-defer → 겉보기 동기적으로 보이지만 한 번 + yield. +- `tensor.numpy()`는 D0.5에 따라 host-read barrier → pending이 있으면 + drain→read, 없으면 즉시 read. +- `dist.all_reduce`는 기존 `_defer_wait=True` + `_pending_collective_handles` + 경로를 그대로 사용. D0.4의 drain이 두 큐를 함께 처리. + +#### D0.7 불변 조건 (invariants) + +- **kernel greenlet의 `_parent`는 항상 main**: env.run이 worker 컨텍스트에서 + 절대 돌지 않기 때문. (T3의 핵심 assertion.) +- **cross-rank 동기 지점**: 모든 worker가 yield한 뒤에만 drain → 모든 rank의 + kernel이 한 라운드에 함께 진행 (cross-rank IPCQ 교환의 필수 조건). +- **Single-driver 호환**: D0.4-(5). + +### D1. `torch.multiprocessing.spawn(fn, args, nprocs)` + +Real-PyTorch API 파리티 + D0의 scheduler loop의 단일 구현 위치. + +#### D1.0 API parity only — execution parity 아님 (normative) + +`torch.multiprocessing.spawn` 이름은 **API signature parity**에 한정된다. +실제 실행 모델은 **cooperative greenlet scheduler** (단일 Python 프로세스, +단일 OS 스레드, D0.4의 round-robin drive)이다. 다음은 **본 ADR이 제공하지 +않는 속성** — real-PyTorch `torch.multiprocessing.spawn`이 보장하는 것 중 +명시적으로 **non-goal**: + +- 프로세스 격리 (independent OS process per rank). +- 독립 address space (각 rank가 자기 Python heap 보유). +- Failure isolation (한 rank의 hard crash가 다른 rank 영향 없음). +- OS-level scheduler fairness (rank 간 preemptive time slicing). +- `mp.Queue`, `mp.Lock` 등 inter-process primitive. + +이 구현의 실제 성질: + +- 모든 rank는 같은 Python 프로세스 안의 greenlet. shared global state가 + 그대로 보임 (의도된 simulation convenience). +- GIL 하의 단일 스레드 → parallel execution 아님. SimPy 이벤트 순서로 + "논리적 동시성"만 재현. +- 한 worker에서 unhandled exception → 전체 simulation 중단 (D0.4-(4)). + +**호출자 의무**: real-PyTorch multi-process 샘플을 KernBench로 이식할 때 +프로세스 격리에 의존하는 로직 (예: `os.getpid`, 독립 임시 파일, 신호 처리 +등)은 지워야 한다. Namespace 이름은 코드 이식성을 위해 유지 — 세만틱은 +다르다. + +#### D1.1 Public surface + +```python +# runtime_api/multiprocessing.py (new) +class _MultiprocessingNamespace: + def __init__(self, ctx): + self._ctx = ctx + + def spawn(self, fn, args: tuple, nprocs: int, join: bool = True) -> None: + """Spawn `nprocs` worker greenlets, each calling fn(rank, *args). + + Mirrors torch.multiprocessing.spawn signature (minus `daemon`). + Drives the D0 scheduler loop until all workers finish. + """ + ... +``` + +#### D1.2 구현 + +```python +def spawn(self, fn, args, nprocs, join=True): + from greenlet import greenlet + ctx = self._ctx + dist = ctx.distributed + gs: list[greenlet] = [] + errors: dict[int, Exception] = {} + for rank in range(nprocs): + def _entry(r=rank): + try: + fn(r, *args) + except Exception as e: + errors[r] = e + raise + g = greenlet(_entry) + dist._bind_rank(g, rank) + gs.append(g) + + try: + while True: + alive = [g for g in gs if not g.dead] + if not alive: + break + for g in alive: + if not g.dead: + g.switch() + _drain_pending(ctx) # D0.5 + except Exception as outer: + # Sibling cleanup per D0.4-(4) + for other in gs: + if not other.dead: + try: + other.throw(SystemExit) + except Exception: + pass + backend = getattr(dist, "_backend", None) + if backend is not None: + if hasattr(backend, "_barrier"): + backend._barrier.reset() + if getattr(backend, "_pending_collective_handles", None) is not None: + backend._pending_collective_handles.clear() + ctx._pending_worker_waits.clear() + raise SpawnException(errors) from outer + # `join=True` semantics: we already wait for all workers. +``` + +#### D1.3 `torch` namespace attach + +`runtime_api/context.py` `__post_init__`에서: +```python +self.multiprocessing = _MultiprocessingNamespace(self) +``` + +→ bench 코드에서 `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`. + +#### D1.4 기존 bench 마이그레이션 + +`benches/ccl_allreduce.py`의 hand-rolled loop은 `torch.multiprocessing.spawn` +한 줄로 축소. 기존 matrix 회귀는 그대로 유지. 현재 xfail인 `ring_default_ws`는 +D0 덕분에 PASS로 전환 예상 (worker가 kernel greenlet orphan을 발생시키지 않음). + +### D2. 새 패키지 `kernbench.tp` + +``` +src/kernbench/tp/ + __init__.py — public API re-exports + parallel_state.py — TP group 관리 (현재 single global group) + layers.py — ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding + primitives.py — copy/reduce/scatter/gather_to/from_tp_region + kernels.py — TP layer가 launch하는 gemm kernel (재사용 가능) + mappings.py — forward identity/all_reduce, backward stub +``` + +### D3. `parallel_state` — TP group + +```python +# parallel_state.py +_TP_WORLD_SIZE = None + +def initialize_model_parallel(tensor_model_parallel_size: int) -> None: + """Initialize TP group. Must be called after dist.init_process_group.""" + global _TP_WORLD_SIZE + from kernbench.runtime_api.distributed import get_dist # or torch.distributed + dist = get_dist() + total = dist.get_world_size() + if tensor_model_parallel_size != total: + raise NotImplementedError( + "Only TP == world_size supported in initial scope" + ) + _TP_WORLD_SIZE = tensor_model_parallel_size + +def get_tensor_model_parallel_world_size() -> int: + return _TP_WORLD_SIZE + +def get_tensor_model_parallel_rank() -> int: + from kernbench.runtime_api.distributed import get_dist + return get_dist().get_rank() # ADR-0024 greenlet-local rank +``` + +초기 scope: TP size = world_size = topology SIP count. Pure TP 모델. + +### D4-pre. TP shard ownership vs DPPolicy — 역할 분리 (normative) + +TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다: + +| 개념 | 결정 주체 | 범위 | +|---|---|---| +| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** | +| **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** | + +따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로 +weight를 생성하고 `DPPolicy(cube="column_wise", pe="column_wise")`를 부여 +하면: + +- **Rank r**이 소유하는 slice = weight의 column 축 [r * k_local, (r+1) * + k_local) — **set_device(r)**가 이걸 결정 (해당 rank가 SIP r에 존재). +- **그 slice 내부**에서 cube × PE column-wise 분산 — **DPPolicy**가 이걸 + 결정. + +두 축은 **독립적**이다. 같은 DPPolicy로 두 rank가 자기 slice를 만들면 +slice 자체는 다른 SIP에 있지만 intra-SIP placement 패턴은 동일. 반대로 +DPPolicy를 `cube="replicate", pe="replicate"`로 바꿔도 TP shard ownership은 +유지되고 intra-rank placement만 달라짐. + +**이 경계가 흐려지는 실수** (본 ADR이 금지): + +- DPPolicy에 "SIP 축"이 다시 등장 (ADR-0026에서 제거됨). +- TP layer가 `set_device` 없이 `DPPolicy`만으로 cross-rank sharding을 + 표현 → 단일 rank 안에서 세로로 자른 것과 구분 안 됨. + +본 ADR의 TP layer는 항상 "rank = SIP = one slice 소유 + DPPolicy intra-SIP +분산" 관점에서만 weight/output을 다룬다. + +### D4. `ColumnParallelLinear` + +**중요**: host-side `torch.matmul` 추상화를 신규 도입하지 않는다. layer의 +forward는 `torch.launch("gemm", gemm_kernel, ...)`로 기존 gemm kernel을 +호출 — KernBench bench들이 이미 쓰는 패턴 +([benches/gemm_single_pe.py](benches/gemm_single_pe.py), +[benches/gpt3_qkv.py](benches/gpt3_qkv.py)). + +```python +# layers.py +from kernbench.policy.placement.dp import DPPolicy +from kernbench.tp.kernels import _gemm_kernel +from kernbench.tp.parallel_state import ( + get_tensor_model_parallel_rank, + get_tensor_model_parallel_world_size, +) + +class ColumnParallelLinear: + """Weight의 K(out_features) 축을 TP rank에 분산. + + forward(x): + x: (M, N) — full-replicated across ranks + W_k: (N, K / world_size) — rank-local slice (set_device로 SIP r에 거주) + y_k = x @ W_k → (M, K / world_size) — rank-local output + + 출력은 column-sharded. RowParallelLinear가 기대하는 입력 형태. + """ + + def __init__(self, in_features: int, out_features: int, bias: bool = False, + dtype: str = "f16", torch=None): + ws = get_tensor_model_parallel_world_size() + assert out_features % ws == 0 + self.in_features = in_features + self.k_local = out_features // ws + self._torch = torch + # 각 rank가 자기 slice 소유 — set_device(rank)에 의해 SIP r에 배치. + self.weight = torch.zeros( + (in_features, self.k_local), dtype=dtype, + dp=DPPolicy(cube="column_wise", pe="column_wise"), + name="col_parallel_w", + ) + self.bias = None + if bias: + self.bias = torch.zeros( + (self.k_local,), dtype=dtype, + dp=DPPolicy(cube="replicate", pe="replicate"), + name="col_parallel_b", + ) + + def forward(self, x): + # x는 full-replicated (caller 보장). 단순 local gemm. + M = x.shape[0] + out = self._torch.empty( + (M, self.k_local), dtype=x.dtype, + dp=DPPolicy(cube="column_wise", pe="column_wise"), + name="col_parallel_out", + ) + self._torch.launch( + "col_parallel_gemm", _gemm_kernel, + x, self.weight, out, M, self.in_features, self.k_local, + ) + # bias add는 별도 kernel 혹은 composite gemm의 fused bias. + # 초기 scope에서는 bias=False만 충분히 검증. + return out +``` + +**Yield-safety contract (normative)**: `ColumnParallelLinear.forward`는 한 번의 +`torch.launch` 호출로 kernel launch → 내부 `ctx.wait` 쌍을 포함한다. 이는 +D0.4-(1)의 "worker는 유한 step 내 yield" 조건을 자동으로 만족 — TP layer +사용자가 yield 패턴을 수동으로 삽입할 필요 없음. + +### D5. `RowParallelLinear` + +```python +class RowParallelLinear: + """Weight의 N(in_features) 축을 TP rank에 분산. + + forward(x): + x: (M, N / world_size) — rank-local slice (ColumnParallel의 출력) + W_k: (N / world_size, K) — rank-local slice + y_k = x @ W_k → (M, K) — partial sum on each rank + y = all_reduce(y_k, op="sum") → (M, K) on every rank + """ + + def __init__(self, in_features: int, out_features: int, bias: bool = False, + dtype: str = "f16", torch=None): + ws = get_tensor_model_parallel_world_size() + assert in_features % ws == 0 + self.n_local = in_features // ws + self.out_features = out_features + self._torch = torch + self.weight = torch.zeros( + (self.n_local, out_features), dtype=dtype, + dp=DPPolicy(cube="column_wise", pe="column_wise"), + name="row_parallel_w", + ) + # bias는 rank 0에만 (Megatron convention). 초기 scope에서는 생략. + self.bias = None + + def forward(self, x): + M = x.shape[0] + y_partial = self._torch.empty( + (M, self.out_features), dtype=x.dtype, + dp=DPPolicy(cube="column_wise", pe="column_wise"), + name="row_parallel_partial", + ) + self._torch.launch( + "row_parallel_gemm", _gemm_kernel, + x, self.weight, y_partial, M, self.n_local, self.out_features, + ) + # Cross-rank reduce. ADR-0024의 dist.all_reduce는 D0 + mp.spawn 하에서 + # 정상 동작 (kernel parent = main 유지). + self._torch.distributed.all_reduce(y_partial, op="sum") + return y_partial +``` + +**Yield-safety contract (normative)**: `RowParallelLinear.forward`는 launch → +내부 wait에 이어 `all_reduce` (defer + worker yield 패턴)까지 포함하므로 forward +한 번당 **최소 2회 yield**가 보장됨. D0.4-(1)의 scheduler progress 조건 자동 +만족. 모든 본 ADR의 TP layer forward는 "최소 하나의 wait 또는 collective를 +포함해 yield-safe하다"를 invariant로 유지한다 — 이후 추가되는 TP primitive +(VocabParallelEmbedding 등)도 동일 계약 필수. + +### D6. Primitive 함수 + +```python +# primitives.py +def copy_to_tp_region(x): + """Forward: identity. Backward: all-reduce. (Training 추가 시 구현).""" + return x + +def reduce_from_tp_region(x, torch): + """Forward: all-reduce. Backward: identity.""" + torch.distributed.all_reduce(x, op="sum") + return x + +def scatter_to_tp_region(x): + raise NotImplementedError( + "Phase 2: 사용자가 이미 sharded tensor를 생성하는 것으로 대체" + ) + +def gather_from_tp_region(x): + raise NotImplementedError( + "Phase 2: all-gather kernel 선행 필요 (future)" + ) +``` + +### D7. 샘플 bench — 2-layer MLP with TP + +```python +# benches/tp_mlp.py (신규) +from kernbench.policy.placement.dp import DPPolicy +import kernbench.tp as tp +import numpy as np + + +def worker(rank: int, world_size: int, torch): + torch.ahbm.set_device(rank) + tp.initialize_model_parallel(world_size) + + B, D_in, D_hidden, D_out = 1, 512, 2048, 512 + fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=torch) + fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=torch) + + x = torch.zeros( + (B, D_in), dtype="f16", + dp=DPPolicy(cube="replicate", pe="replicate"), + name="x", + ) + # init x with some pattern (e.g., constant) + x.copy_(torch.from_numpy(np.full((B, D_in), 0.1, dtype=np.float16))) + + h = fc1.forward(x) # column-sharded (B, D_hidden / ws) + y = fc2.forward(h) # all-reduced (B, D_out) on every rank + + # rank 0만 결과 출력 / 검증 + if rank == 0: + result = y.numpy() + # 실제 검증 값은 zero-init weight이면 전부 0 — scope에서는 "완료 자체" 검증 + print(f" tp_mlp: shape={result.shape}, mean={float(result.mean()):.4f}") + + +def run(torch): + torch.distributed.init_process_group(backend="ahbm") + ws = torch.distributed.get_world_size() + torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws) +``` + +### D8. Non-functional — training 미지원 + +본 ADR은 **inference/forward only**. Backward / gradient / optimizer는 future. +기존 KernBench가 training이 아니므로 자연스러움. + +### D9. 초기 scope 제약 + +- TP size = world_size (mixed DP+TP 없음). +- `scatter_to_tp_region`, `gather_from_tp_region`은 unimplemented. +- **Weight 기본값은 zero**. 적절한 init scheme (Xavier, Kaiming 등)은 future. + 단 테스트는 `tensor.copy_`로 결정론적 non-zero pattern을 주입해 numerical + correctness를 검증 (T2/T6). 즉 "production default = zero, 검증 = 결정론적 + non-zero"로 운영 분리. +- Bias 초기 scope에서 생략 (Megatron의 rank 0-only bias 정책은 future). +- Pipeline parallelism은 scope 밖. +- VocabParallelEmbedding은 all-gather 선행 필요 → stub only. + +### D10. 회귀: `ring_default_ws` xfail 해제 — 필수 acceptance + +D0 (worker-wait 일반화) + D0.5 (host-read barrier) 덕분에 모든 worker-driven +`ctx.wait` 및 host-read가 main-drain 경로로 routing됨 → ADR-0024 Phase B의 +kernel-greenlet orphan 원인이 소멸. 기존 matrix test의 `ring_default_ws` +strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 **필수 회귀 +기준**으로 포함. Observable acceptance criteria는 **T7**에 명시 (deadlock +부재, GreenletExit 부재, numerical tolerance 등). + +--- + +## Dependencies + +- **ADR-0024** (launcher): rank = SIP, greenlet-local rank, + `torch.ahbm.set_device(rank)`. +- **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현. +- **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반. + +--- + +## Non-goals + +- **Backward pass / training**: inference only. Training simulation은 별도 ADR. +- **Mixed parallelism (DP + TP + PP)**: 초기엔 pure TP only. +- **Weight init schemes**: 단순 zero / debug pattern. +- **Fused ops**: Megatron의 fused matmul+bias+gelu는 kernel 레벨 문제. +- **DTensor 통합**: ADR-0028 future. +- **Host-side `torch.matmul` 추상화**: TP layer는 `torch.launch(gemm_kernel, ...)` + 로 기존 gemm kernel을 호출. 신규 matmul host-op 도입 안 함. + +--- + +## Open questions + +- **`initialize_model_parallel` 위치**: `kernbench.tp.initialize_model_parallel` + (현 결정) vs real-PyTorch의 `torch.distributed.init_device_mesh`. TP 전용 + 모듈에 유지. +- **Weight init**: ADR은 zero. Debug pattern (e.g., identity)이 유효 검증에 + 필요할 수 있음 — Phase 1 test에서 필요 시 추가. +- **bias 배치 정책**: Megatron은 RowParallelLinear bias를 rank 0에만. 초기 + scope에서는 bias=False로 회피. +- **GEMM kernel 위치**: `kernbench.tp.kernels._gemm_kernel` vs 기존 + `benches/gemm_single_pe.py`에서 import. TP가 bench 의존을 가지면 안 되므로 + tp 내부에 복제. 향후 `kernbench.kernels` 공용 패키지로 이관 가능. + +**Resolved (이전 rev에서 open이었던 것들)**: +- ~~`tensor.numpy()` 호출 시 drain 타이밍~~ → **D0.5에서 결정**: 공식 host-read + entry-point(`numpy`, `data`, `__getitem__`, data-포함 `__repr__`)는 자동 + drain barrier. metadata-only accessor는 barrier 아님. + +--- + +## Consequences + +### Positive + +- **Megatron 코드 이식 용이**: real training code와 API 일치. +- **TP 벤치마크 가능**: scaling, communication-compute overlap 등 HW 특성 + 연구. +- **`ring_default_ws` xfail 해제**: D0의 부산물로 ADR-0024 Phase B 블로커 해소. +- **Scheduler loop 단일화**: D1 (`mp.spawn`) 도입으로 hand-rolled loop 제거. + 후속 collective/TP 벤치가 동일 패턴 재사용. +- **DPPolicy 의미 명확화** (ADR-0026 시너지): TP layer가 intra-device DPPolicy + 만 사용하는 모범 사례. + +### Negative + +- 새 모듈 (`kernbench.tp`) 유지보수 비용. +- 초기 scope가 제한적 (pure TP only, forward only). +- D0 generalization이 `ctx.wait`의 세만틱을 바꿈 — 단일 드라이버 테스트와의 + 호환성을 명시적으로 검증 필요 (T7). + +### Neutral + +- ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation + stack에 영향 없음 (D0 제외). diff --git a/docs/adr-ko/ADR-0032-algo-intercube-allreduce.md b/docs/adr-ko/ADR-0032-algo-intercube-allreduce.md new file mode 100644 index 0000000..bb6ba3c --- /dev/null +++ b/docs/adr-ko/ADR-0032-algo-intercube-allreduce.md @@ -0,0 +1,256 @@ +# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange + +## Status + +Accepted (supersedes ADR-0029). + +## Context + +### Goal + +Define a single all-reduce algorithm that exploits the topology hierarchy: +cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel, +one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`. + +### Why replace ADR-0029 (hierarchical 3-level) + +ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm +where every PE in the system participates. In practice this adds the +intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast) +without matching the common workload pattern where the tensor is sharded +**per cube** (not per PE within a cube). + +Moreover, the hierarchical design required: +- per-PE neighbor graph installation (`_build_pe_installs` multi-level) +- multi-level topology schema (`hierarchical_3level`) +- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure + +The intercube algorithm below removes all of that: **pe0-only same-lane +intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the +root cube, then broadcast back. Simpler kernel, simpler wiring, same +bandwidth characteristics for the common per-cube DP workload. + +### Current state + +- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel +- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip` +- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this + automatically at `init_process_group` time. +- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`, + `hierarchical_allreduce` modules and their tests are **removed**. + +--- + +## Decision + +### D1. Algorithm structure — 5 phases + +For each SIP (launched concurrently by `mp.spawn`): + +``` +Phase 1 — Row reduce W → E (cube mesh, pe0 only): + col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum. + +Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1): + row=0 sends S → row=1 accumulates, sends S → ... → root cube (15) + holds the full SIP sum. + +Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only): + Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast — + selected by sip_topo_kind (from topology.yaml sips.topology). + +Phase 4 — Col broadcast S → N on rightmost column. + +Phase 5 — Row broadcast E → W across the cube mesh. +``` + +After all phases every cube's pe0 holds the global sum. + +The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}` +(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical +across topologies; only phase 3 branches. Helper functions +`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the +three exchange patterns. + +### D2. Tensor layout (rank = SIP, per-worker) + +Per ADR-0024 rank = SIP at the process-group level. Each worker allocates +its own cube-mesh-spanning tensor: + +```python +dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1) +tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp) +``` + +Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses +each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`. + +### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip` + +Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor +tables for **every cube's pe0 across every SIP** — regardless of which +cube is the root or which SIP topology is selected. This lets the kernel +elect the root cube at runtime and supports topology switches without +re-wiring. + +| Level | Direction labels | Scope | +|---|---|---| +| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) | +| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` | + +Inter-SIP directions use the `global_*` prefix to keep the namespace +disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended +with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse- +direction resolver handles 2-SIP bidirectional rings correctly. + +Internally the function calls `install_ipcq` with: +- `world_size = n_sips × n_cubes` +- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]` +- A closure-captured `neighbors()` function that builds the map above. + +This `world_size` is internal to IPCQ wiring and does not leak to the +process-group rank. + +### D4. SIP topology — from `topology.yaml` + +```yaml +system: + sips: + count: 2 + topology: ring_1d # or torus_2d, mesh_2d_no_wrap +``` + +- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`. +- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on + `global_E/W` then col ring on `global_S/N`. +- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce + + broadcast per dimension. + +2D variants require `n_sips` to be a perfect square. + +### D5. Process-group integration — `AhbmCCLBackend` + +At `init_process_group` time the backend: + +1. Loads `ccl.yaml` + `topology.yaml`. +2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from + `system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`. +3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time + SFR wiring, mirrors NCCL communicator creation. + +At each `dist.all_reduce(tensor)` call: + +1. Resolves `kernel_fn` from `cfg["module"]`. +2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from + `kernel_args(world_size, n_elem)`. +3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where + `sip_rank` is the current greenlet's bound rank. +4. Launches with `_defer_wait=True`; the main scheduler drains pending + handles after all workers submit (per ADR-0027 D0.4). + +### D6. Config schema + +`ccl.yaml`: + +```yaml +defaults: + algorithm: intercube_allreduce + buffer_kind: tcm + ... + +algorithms: + intercube_allreduce: + module: kernbench.ccl.algorithms.intercube_allreduce + topology: none + buffer_kind: tcm + n_elem: 8 + root_cube: 15 +``` + +`topology.yaml`: + +```yaml +system: + sips: + count: 2 + topology: ring_1d +sip: + cube_mesh: { w: 4, h: 4 } +``` + +### D7. Algorithm module contract + +Modules loaded via `cfg["module"]` must export: + +| Name | Purpose | +|---|---| +| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` | +| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) | +| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code | +| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) | + +--- + +## Dependencies + +- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return). +- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank. +- **ADR-0025**: Address-based IPCQ direction matching; extended + `_OPPOSITE_DIR` with `global_*` pairs. +- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler. + +## Non-goals + +- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the + workload for this algorithm is per-cube DP. +- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and + `mesh_2d_no_wrap` require `n_sips = k²`. +- **Pipelined chunks**: single-tile per cube, no pipelining yet. +- **Root cube runtime election**: the kernel currently uses + `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE + corner. SFR wiring covers all cubes, so runtime election is a pure kernel + change when needed. + +--- + +## Consequences + +### Positive + +- **Single kernel, single install path** for all-reduce — replaces four + removed modules (`ring`, `mesh`, `tree`, `hierarchical`). +- **Topology-agnostic kernel**: ring / torus / mesh selected via one + integer param, no kernel duplication. +- **Automatic via `dist.all_reduce`**: no bench-level or user-level + algorithm selection needed; config-driven end-to-end. +- **Full SFR wiring**: every cube on every SIP has inter-SIP links + available — supports future dynamic root-cube election. + +### Negative + +- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that + shard within one cube across 8 PEs are not addressable by this kernel. + Such workloads would need a separate intra-cube all-reduce path (not + yet implemented). +- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a + given run only needs a subset (e.g. 1 SIP, ring only). Install cost is + small but not zero. + +--- + +## Affected files + +| File | Change | +|---|---| +| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` | +| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` | +| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` | +| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs | +| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args | +| `ccl.yaml` | Single `intercube_allreduce` entry | +| `topology.yaml` | Added `system.sips.topology` | +| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout | +| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh | +| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path | +| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification | +| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests | diff --git a/docs/adr-ko/ADR-0033-lat-latency-model-assumptions.md b/docs/adr-ko/ADR-0033-lat-latency-model-assumptions.md new file mode 100644 index 0000000..13ca1f9 --- /dev/null +++ b/docs/adr-ko/ADR-0033-lat-latency-model-assumptions.md @@ -0,0 +1,162 @@ +# ADR-0033 — Latency Model: Assumptions and Known Simplifications + +## Status + +Accepted + +## Context + +The simulator is an analytical, event-driven performance model — not a +cycle-accurate or RTL-level simulator. Many real-HW effects are approximated +or omitted by design. To keep the model auditable and reviewable as a whole, +this ADR consolidates the assumptions in one place. Individual component ADRs +(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines +the *limits of fidelity*. + +## Decisions + +### D1. Modeled precisely + +- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) — + ADR-0015 D2. +- **Per-component switching/overhead latency** (`overhead_ns` attr). +- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array + with address-based PC selection (ADR-0034 D3). Burst granularity tunable + (`burst_bytes`, default 256B). Read and write share each PC's + `available_at` (real HW command bus is per-PC shared). +- **HBM direction switching penalty mechanism**: per-PC last-direction + tracking + configurable `switch_penalty_ns`. Default 0 — see D2. +- **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions + with payload into `Flit` objects of `flit_bytes` (default = HBM + `burst_bytes` = 256B). The wire emits each flit individually after + `prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles + flit arrival rate per real-HW wormhole semantics. +- **Separate Stores per directed edge** (Phase 2c key fix): the wire + is the *only* conduit between `src.out_ports[dst]` and + `dst.in_ports[src]`. Earlier the two were aliased to the same + `simpy.Store`; when the wire put a chunkified flit back, the + destination's `fan_in` could pull it before the wire applied + bandwidth delay, leaving half the flits bypassing the bottleneck. +- **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`): + forward each flit serially with per-transaction overhead applied + ONCE on the first-flit arrival (header decode model). Subsequent + flits pipeline through with no extra delay. Wormhole emerges + naturally across multi-hop paths. +- **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL + schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`, + with the `is_last` flit waiting for the last PC commit before + signaling `txn.done`. +- **Non-flit-aware components (default) reassemble flits at + ``_fan_in``** before the legacy `_forward_txn` path runs. This + preserves backward compatibility for components that have not yet + been migrated to flit-aware processing (e.g., `MCpuComponent`, + `IoCpuComponent` sub-txn generators). Such components reassemble + *once per leg boundary*, NOT per hop — multi-hop wormhole timing + through a chain of flit-aware routers is preserved. + +### D2. Approximated (with known directional error) + +| Effect | Real HW | Our model | Error direction | +|--------|---------|-----------|----------------| +| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level | +| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes | +| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only | +| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order | + +### D3. Ignored (out of scope) + +- Bank-level row buffer conflict penalty (assume no conflicts — best case; + the model has no per-bank state within a PC, so same-bank reuse cannot be + detected). +- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state + `burst_time = burst_bytes / pc_bw_gbs`). +- Refresh, ECC, thermal throttling, power gating. +- Clock domain crossings, PLL lock time. +- Upstream backpressure due to downstream buffer occupancy (input ports use + unbounded `simpy.Store`). +- Sub-flit cycle-level arbitration at routers (flit granularity is our + smallest unit). + +### D4. Workload sensitivity + +Workloads where the above simplifications meaningfully affect results: + +- **Random scatter/gather**: bank conflict ignored → model optimistic. +- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler + absent. With default `switch_penalty_ns = 0` we assume ideal amortization; + setting it non-zero models pessimistic per-alternation cost. +- **High concurrency (>10 active flows on one link)**: HoL blocking and VC + limits not modeled → model optimistic. +- **Very small (sub-flit) transactions**: flit quantization noise. +- **Concurrent multi-flow on a single wire**: wire is serial FIFO at the + flit level, so per-flow fairness within a single edge is not modeled. + Pre-edge merging (multiple sources arriving at a router and being + forwarded to the same downstream wire) is correctly modeled via the + flit-aware router's serial worker. + +### D5. Verification policy + +For workloads in D4, cross-check against real HW or a cycle-accurate +simulator before drawing absolute-magnitude conclusions. The model remains +accurate for **relative comparisons** within the modeled regime. + +### D6. Future work + +Note: multi-stream merging at routers IS modeled correctly — each +in_port has its own fan_in process, all push to a shared inbox, and +the router worker forwards in inbox FIFO order. Flits from different +upstream streams naturally interleave at flit granularity. The items +below are different concerns, ordered by expected workload impact. + +**Higher impact (workload accuracy gap)**: + +- [ ] **Bank-level conflict modeling** within a PC (opt-in via + `track_banks: true`). Currently we assume no same-bank reuse; + random scatter/gather workloads are optimistic here. +- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2 + from the design discussion). Default `switch_penalty_ns=0` is the + ideal-amortization stand-in; bursty mixed R/W workloads benefit + from explicit modeling. +- [ ] **Backpressure** modeling for finite component buffers. Matters + at high concurrency / sustained saturation where buffer occupancy + causes upstream stalls. +- [ ] **Op_log integration with chunk-streaming**: currently op_log + fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd, + GemmCmd, MathCmd) which are not chunkified. Integration would + require flit-aware components to also emit op_log start/end hooks + per transaction (start on first flit, end on is_last). + +**Lower impact (academic / specific use cases)**: + +- [ ] **Cycle-accurate router arbitration policies** (RR with + priorities, age, iSLIP). The FIFO inbox is already approximately + fair when flit arrival times differ slightly between streams (the + common case for similar-rate workloads). True impact appears only + for: (a) priority/QoS modeling, (b) per-stream tail latency + analysis under sustained saturation. Not critical for makespan or + average-latency studies. +- [ ] **Sub-flit (32B) granularity** for finer wire arbitration + cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates + per 32B flit. Effect is small for most workloads (sub-flit timing + noise on small messages). + +## Consequences + +- Single review point for all model fidelity questions. Each future PR + touching latency must update the relevant section here. +- Workload-specific magnitude error envelopes are explicit. +- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs` + enforces the ADR-0017 D8 invariant in code rather than relying on yaml + manual consistency. +- Wire transfer time is charged once per bottleneck-link transit (Phase 2c + per-flit timing) rather than via terminal `drain_ns` injection. Single + transactions land at `drain + commit_time + small_overheads`; multi-hop + preserves wormhole pipelining; multi-stream merge correctly serializes + at the shared wire's FIFO. + +## Cross-references + +- ADR-0015 — component / port / wire model. +- ADR-0017 — Cube NOC architecture and HBM connectivity. +- ADR-0004 — memory semantics, local HBM. +- ADR-0034 — HBM controller internal design. diff --git a/docs/adr-ko/ADR-0034-dev-hbm-controller-internal-design.md b/docs/adr-ko/ADR-0034-dev-hbm-controller-internal-design.md new file mode 100644 index 0000000..b7d3e8f --- /dev/null +++ b/docs/adr-ko/ADR-0034-dev-hbm-controller-internal-design.md @@ -0,0 +1,271 @@ +# ADR-0034: HBM Controller Internal Design + +## Status + +Accepted + +## Context + +`HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of +the cube NOC. One instance is created per PE under the topology node +`sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router +(ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling, +burst-granular commit timing, address-based PC selection, and response +routing back to the requester. + +This ADR documents the component as currently implemented. ADR-0017 D4/D8 +defines *where* HBM CTRL attaches and *what* aggregate BW it must +deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in +scope. This ADR fills the gap between those two — the per-instance +internal scheduling model. + +## Decision + +### D1. Role + +`HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per +PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`) +attaches to that PE's router via the `peX.hbm` attachment list in +`cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping +(ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels +into one endpoint. + +The component models: + +- Per-PC scheduling (D2) with R/W command-bus sharing. +- Address-based PC selection (D3). +- Burst-granular commit timing (D4). +- Flit-aware per-flit PC commit and async finalize (D5, D6). +- Command-only Transaction handling for read-data drain (D7). +- Response routing back to the requester (D8). + +It does not model: + +- Bank-level row-buffer conflicts, refresh, ECC, thermal throttling + (ADR-0033 D3). +- Cross-PE HBM contention beyond its own router edge (handled by the + router mesh — ADR-0017 D3). +- 1:1 channel mode (ADR-0017 D8 future work). + +### D2. Per-PC scheduling model + +Per-instance state initialised in `start()`: + +- `_pc_avail: list[float]` — earliest sim-time each PC is free; length + `num_pcs`, initial 0.0. +- `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on + each PC, used for switch-penalty detection (D4); initial `None`. + +`num_pcs` and `burst_bytes` must each be a positive power of two so +that address-based PC selection (D3) reduces to a shift-and-mask. + +Read and write requests share the same `_pc_avail` slot per PC — the +real HW per-PC command bus is shared between read and write traffic, so +issuing a write to PC k blocks a subsequent read to PC k by exactly the +burst time. + +Direction `dir` for a request is inferred from the request type: + +- `MemoryWriteMsg` → `"W"`. +- `PeDmaMsg` with `is_write=True` → `"W"`. +- All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`. + +### D3. Address-based PC selection + +PC index for an access is derived from the access address by shift and +mask: + +```text +pc_shift = log2(burst_bytes) # default 8 (burst=256B) +pc_mask = num_pcs - 1 # default 7 (8 PCs) +pc = (address >> pc_shift) & pc_mask +``` + +Computed once in `start()` from topology config so alternative +`(burst_bytes, num_pcs)` pairs stay consistent. For the canonical +default `(256, 8)` this places the PC select field at bits `[10:8]` of +the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits +`[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column +within the PC slice (see `phyaddr.py` comment). + +Address-based striping — as opposed to address-blind global +round-robin — preserves PC parallelism for offset-disjoint concurrent +transfers: each transfer's bursts land deterministically on the PC set +implied by its byte addresses, so multi-PE workloads accessing disjoint +regions do not collide on a single PC. + +### D4. Burst granularity and PC commit timing + +A single PC commit takes: + +```text +chunk_time = burst_bytes / pc_bw_gbs # ns +``` + +- `burst_bytes` (default 256) is the burst granularity matching the + flit size (ADR-0033 D1). +- `pc_bw_gbs` is **builder-derived** from + `hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing + the ADR-0017 D8 invariant that aggregate per-PE BW equals the + router-to-HBM link BW. + +Per-PC commit scheduling for an arriving access on PC `pc` with +direction `dir`: + +```text +switch_cost = switch_penalty_ns + if pc_last_dir[pc] not in (None, dir) else 0 +start = max(env.now, pc_avail[pc]) + switch_cost +finish = start + chunk_time +pc_avail[pc] = finish +pc_last_dir[pc] = dir +``` + +Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM +scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values +model pessimistic per-alternation cost. + +### D5. Flit-aware per-flit PC commit (primary path) + +`_handle_flit` is the primary worker path. For each arriving `Flit`: + +1. On the **first** flit of a transaction (`tid = id(txn)` not in + `_txn_state`): + - Apply `overhead_ns` once via `run(env, nbytes)` — header decode + model, first-flit overhead pattern (ADR-0033 D1). + - Initialise `_txn_state[tid] = {"last_finish": env.now}`. +2. Compute `pc = _pc_for_address(flit.address)` (D3). +3. Apply the per-PC schedule (D4) using the request direction (D2). +4. Update `state["last_finish"] = max(state["last_finish"], finish)`. +5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn` + (D6). + +Per-flit address-aware commit is the mechanism that lets concurrent +multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs +in parallel. + +### D6. Async finalize per transaction + +When a transaction's last flit has been scheduled, finalisation runs in +a separately-spawned process: + +```python +def _finalize_txn(env, txn, last_finish): + wait = last_finish - env.now + if wait > 0: + yield env.timeout(wait) + yield from _send_response(env, txn) +``` + +`_handle_flit` spawns this via `env.process(...)` and returns +immediately, so the worker can pick up the next inbox message while the +last PC commit drains. + +Without this split — i.e. if the worker itself did +`yield env.timeout(wait)` — concurrent single-flit transactions whose +addresses hit distinct PCs would still serialise at `chunk_time` each +inside the worker, hiding the PC parallelism that D3 and D5 are +designed to expose. + +### D7. Non-flit fallback for command-only transactions + +`_handle_txn` runs when the inbox delivers a `Transaction` rather than a +`Flit`. This is the path for command-only requests that the wire does +not chunk into flits — most notably `MemoryReadMsg` whose command txn +carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing, +not as inbound flits). + +Procedure: + +1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)` + — for read commands, work is sized by the request. +2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else + 0. +3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are + scheduled over time at `drain/n_chunks` ns intervals to model the + bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain). +4. Apply `run(env, txn.nbytes)` once for `overhead_ns`. +5. For each chunk `i`, advance `chunk_interval` ns then apply the D4 + schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`. +6. After scheduling all chunks, wait `last_finish - env.now` then call + `_send_response`. + +`_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with +`_handle_flit` — there is exactly one source of PC scheduling truth +across both paths. + +### D8. Response routing + +`_send_response` dispatches on request type and path geometry: + +| Case | Trigger | Response | +| --- | --- | --- | +| PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` | +| Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) | +| Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) | +| Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path | + +The "bypass" classification matches the Memory R/W fabric path defined +in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl, +without M_CPU). The PE_DMA case is its own dedicated reverse-path to +keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a +ResponseMsg envelope). + +In all reverse-path cases, the response Transaction is put onto +`out_ports[reverse_path[1]]` — the first hop back along the recorded +forward path. If `reverse_path` has fewer than 2 entries (degenerate +path), the original `txn.done` is signalled directly. + +### D9. Configurable attributes + +| Attribute | Default | Source | Notes | +| --- | --- | --- | --- | +| `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 | +| `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant | +| `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) | +| `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching | +| `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) | +| `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) | + +`pc_bw_gbs` is derived by `topology/builder.py` rather than configured +directly so the aggregate per-PE BW matches the router-to-HBM link BW +without yaml-side duplication. + +## Consequences + +### Positive + +- Address-based PC selection preserves multi-stream HBM parallelism + that an address-blind round-robin would collapse — important for + multi-PE workloads with disjoint HBM regions. +- Flit-aware path (D5) + async finalize (D6) preserves wormhole + pipelining and exposes PC parallelism for back-to-back single-flit + transactions. +- Single source of PC scheduling truth (D4 mechanism, used by both D5 + flit path and D7 chunk-loop path). +- Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml + discipline. + +### Negative + +- No bank-level conflict modelling within a PC; address-blind to + bank/row-buffer reuse (ADR-0033 D3). +- No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed + FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns` + (ADR-0033 D2). +- `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state + accumulates per concurrent transaction and is removed only on + `is_last`. Adequate for current workloads. + +## Links + +- ADR-0001 (Physical address layout — PC bit field comment) +- ADR-0015 D4 (Memory R/W fabric path — bypass response case) +- ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers) +- ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR + implements) +- ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint + resolution) +- ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty, + flit-aware PC commit, first-flit overhead, chunk-loop drain) +- ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation) diff --git a/docs/adr-ko/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md b/docs/adr-ko/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md new file mode 100644 index 0000000..8d22a7b --- /dev/null +++ b/docs/adr-ko/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md @@ -0,0 +1,286 @@ +# ADR-0035: M_CPU and M_CPU.DMA Component Model + +## Status + +Accepted + +## Context + +M_CPU is the cube-level command processor. It receives commands from +IO_CPU (or from PCIE_EP when the engine routes Memory R/W through +M_CPU as a fallback), fans them out to the PEs in its cube, and +aggregates per-PE responses into a single ResponseMsg sent back to +IO_CPU on the reverse path. + +M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W +fan-out. Per ADR-0015 D5 it is **not** a separate topology node — +it lives as internal state of `MCpuComponent`. + +This ADR documents the M_CPU component implementation that realizes +those responsibilities, including the three distinct fan-out paths +(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource +model, and the response aggregation contract. + +## Decision + +### D1. Role + +M_CPU has three responsibilities: + +1. **Transit forwarding** — when not the terminal hop (e.g., on the + reverse response path PE → M_CPU → IO_CPU), forwards Transactions + to `next_hop` in their pre-computed path. +2. **Multi-PE fan-out at terminal hop** — dispatches to one of three + fan-out paths based on request type (D2). +3. **Response aggregation** — collects per-PE responses, sends a + single aggregate ResponseMsg back to IO_CPU on the reverse path. + +Per invocation (`run()`): applies `overhead_ns` once per incoming +Transaction. + +M_CPU does **not**: + +- Decide routing — paths are pre-computed by the router (ADR-0002). +- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines + (ADR-0014). +- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE + `hbm_ctrl.pe{X}` directly (ADR-0017 D9). +- Interpret tensor or kernel semantics — fan-out dispatch by Python + isinstance check only. + +### D2. Three fan-out paths dispatched by request type + +At the terminal hop the worker dispatches by request type: + +```python +elif self.ctx is not None and txn.request is not None: + if isinstance(txn.request, KernelLaunchMsg): + env.process(self._kernel_launch_fanout(env, txn)) + elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)): + env.process(self._mmu_msg_fanout(env, txn)) + else: + env.process(self._dma_fanout(env, txn)) +``` + +Each path uses a different router method: + +- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the + M_CPU-specific DMA path that avoids PE pipeline nodes. +- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the + generic NOC command path to PE_CPU. +- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command + path to PE_MMU. + +### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5) + +`MCpuComponent.start()` initializes two SimPy resources: + +```python +self._dma_write = simpy.Resource(env, capacity=1) # MemoryWriteMsg +self._dma_read = simpy.Resource(env, capacity=1) # MemoryReadMsg +``` + +Properties: + +- **Not a topology node** — managed entirely inside `MCpuComponent`; + does not appear in `topology.yaml` or in the compiled graph. +- **Independent read and write channels** — concurrent in-flight + Memory R/W is allowed. +- **Capacity=1 per channel** serializes the **dispatch step** + (`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory + R/W requests at this M_CPU. Actual fabric transfer time is modeled + by wire processes between components (ADR-0015 D2) and by + `drain_ns` at terminal hops; the DMA resource does not gate + transfer duration. + +Resource selection is request-type-based: + +```python +dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read +``` + +### D4. Transit forwarding at non-terminal hops + +When `txn.next_hop` is not None — typical for the reverse response +path (PE → M_CPU → IO_CPU) — the worker forwards normally: + +```python +if next_hop: + yield self.out_ports[next_hop].put(txn.advance()) +``` + +The fan-out branches fire only at the terminal hop. The same component +therefore serves both forward command dispatch and reverse response +relay roles. + +### D5. DMA fan-out (`_dma_fanout` — Memory R/W) + +For each Memory R/W request at terminal hop: + +1. `_resolve_dma_destinations(request)` returns a per-PE + `hbm_ctrl.pe{X}` derived from the request's PA via + `ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9). +2. For each destination: + - Acquire the appropriate DMA resource (`_dma_write` or + `_dma_read`) via `with dma_res.request() as req`. + - Resolve path via `ctx.router.find_mcpu_dma_path()`. + - Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`. + - Create sub-Transaction carrying `drain_ns` and dispatch to + `path[1]`. +3. Track `max_drain_ns` across destinations and record it as + `txn.result_data["xfer_ns"]` after all responses arrive. +4. After all per-PE responses are collected (D8), send an aggregate + ResponseMsg on the reverse command path back to IO_CPU. + +PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code — +no such node exists after ADR-0017 D4's per-PE partitioning. Kept +defensively but does not route to a real destination. + +### D6. Kernel launch fan-out (`_kernel_launch_fanout`) + +For `KernelLaunchMsg` at terminal hop: + +1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube. +2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via + `ctx.router.find_node_path()`. +3. **`target_start_ns` handling** (ADR-0009 D5): + - If the request already carries `target_start_ns` (stamped by + IO_CPU per ADR-0036 D3): **pass through unchanged**. + - If absent (direct-to-M_CPU launch in unit tests): compute a + per-cube barrier `env.now + max(per-PE leg latency)` and stamp + via `dataclasses.replace`. +4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a + control message; preserving nbytes=0 keeps fan-out off the shared + first-hop fabric BW, mirroring ADR-0036 D4). +5. After all per-PE responses arrive (D8), aggregate per-PE metrics + from each sub-Transaction's `result_data` into the parent + transaction: + + ```python + txn.result_data["pe_exec_ns"] = max(existing, max(pe_exec_values)) + txn.result_data["dma_ns"] = max(existing, max(dma_values)) + txn.result_data["compute_ns"] = max(existing, max(compute_values)) + ``` + + The max-merge with the existing value matters because cross-cube + IO_CPU fan-out shares the same parent `result_data`; merging + prevents one cube from clobbering another's metric. +6. Send aggregate ResponseMsg on reverse path back to IO_CPU. + +### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`) + +For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop: + +1. `_resolve_pe_ids(target_pe)` → PE ids. +2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via + `find_node_path()`. +3. Dispatch sub-Transactions with `nbytes=0`. +4. PE_MMU is a terminal node — it does **not** send a ResponseMsg + back. Instead, the sub-Transaction's own `sub_done` event is the + completion signal. +5. Wait for all `sub_done` events in-line (does **not** use + `_pending` counter — D8 is for response-bearing fan-out only). +6. Send aggregate ResponseMsg on reverse path back to IO_CPU. + +### D8. Response aggregation (`_pending` + `_parent_txns`) + +For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg +arriving on the reverse path): + +```python +self._pending: dict[str, tuple[int, int, simpy.Event]] = {} +self._parent_txns: dict[str, Any] = {} +``` + +- On dispatch: register `(expected, received=0, all_done)` and + remember the parent transaction. +- `_worker` recognises responses by `is_response=True` and routes + them to `_collect_response`, which increments `received` and + signals `all_done` when `received >= expected`. +- After `yield all_done`, the fan-out path constructs the aggregate + ResponseMsg: + + ```python + resp_msg = ResponseMsg( + correlation_id=request.correlation_id, + request_id=request.request_id, + src_cube=cube_id, + src_pe=-1, # -1 = M_CPU aggregate, not a single PE + success=True, # no failure semantics implemented + ) + ``` + +- The response Transaction travels on `list(reversed(txn.path))` + back to IO_CPU. + +MMU fan-out (D7) uses a simpler in-line list of `sub_done` events +because PE_MMU is terminal — there is no ResponseMsg path to +intercept. + +### D9. Helpers and configurable attribute + +`_resolve_pe_ids(target_pe)`: + +- `int` → `[target_pe]` +- `tuple[int, ...]` → `list(target_pe)` +- `"all"` → `range(n_slices)` where `n_slices` comes from cube + `memory_map.hbm_slices_per_cube` (default 8). + +Used by kernel-launch and MMU fan-out paths. + +Single configurable attribute drives per-instance latency: + +| Site | impl name | overhead_ns | +| --- | --- | --- | +| Cube `m_cpu` | `builtin.m_cpu` | 5.0 | + +Applied once in `run()` per Transaction — models command +interpretation and dispatch-decision time at M_CPU. + +## Consequences + +### Positive + +- Three fan-out paths are clearly separated by request type — adding + a new request kind is an isinstance branch + one fan-out method. +- M_CPU.DMA channels are independent (read and write run concurrently) + and serialize only the dispatch step at capacity=1. +- Transit-vs-terminal behavior is a single `if next_hop` check, so + the same component handles forward dispatch and reverse response + relay without role duplication. +- `target_start_ns` passthrough (D6) preserves the cross-cube barrier + established by IO_CPU (ADR-0036 D3), while the fallback computation + keeps direct-to-M_CPU unit tests working. +- Per-PE metric `max`-merge against existing parent `result_data` + values is robust to cross-cube IO_CPU fan-out sharing the same + parent. + +### Negative + +- No partial-failure semantics — a missing per-PE response stalls the + parent `all_done` indefinitely. Acceptable for simulation; not + suitable as a production-style endpoint. +- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead + code (no such node exists post-ADR-0017 D4). Kept defensively; + invites confusion and merits a follow-up cleanup. +- DMA resource serialization applies only at dispatch (the `put` call + is instantaneous in unbounded stores). The capacity=1 channel + models "one request in flight at a time at this M_CPU", not + "transfer duration serialization" — readers must consult wire + processes (ADR-0015 D2) and `drain_ns` for actual transfer + parallelism. + +## Links + +- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics) +- ADR-0009 D5 (`target_start_ns` — passed through unchanged when + present; computed as per-cube barrier when absent) +- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out + point) +- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same + contract at cube level) +- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a + topology node) +- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`) +- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes + through unchanged; nbytes=0 invariant preserved through fan-out) diff --git a/docs/adr-ko/ADR-0036-dev-io-cpu-component-model.md b/docs/adr-ko/ADR-0036-dev-io-cpu-component-model.md new file mode 100644 index 0000000..b79e9ad --- /dev/null +++ b/docs/adr-ko/ADR-0036-dev-io-cpu-component-model.md @@ -0,0 +1,216 @@ +# ADR-0036: IO_CPU Component Model + +## Status + +Accepted + +## Context + +IO_CPU is the IO chiplet's host-facing endpoint inside the simulation +graph. PCIE_EP receives host messages from the runtime API and routes +them via the io_noc; for command-bearing requests (KernelLaunch, +MmuMap/Unmap) the io_noc forwards to IO_CPU, which: + +- Fans out the request to per-cube M_CPUs. +- Aggregates per-cube responses into a single host-visible completion. +- For kernel launches, stamps a global `target_start_ns` barrier so + every PE across every targeted cube begins kernel body execution at + the same simulated time (ADR-0009 D5). + +Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3; +this component therefore handles only command-plane traffic in normal +operation. + +This ADR documents the IO_CPU component implementation that realizes +those responsibilities. + +## Decision + +### D1. Role + +IO_CPU is the host-facing endpoint of the IO chiplet. It has two +primary responsibilities: + +1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg / + MmuUnmapMsg to per-cube M_CPUs. +2. **Response aggregation** — collect per-cube ResponseMsg, signal + parent `txn.done` when all targeted cubes have responded. + +A third, narrower responsibility applies only to KernelLaunchMsg: +**`target_start_ns` global barrier stamping** (D3). + +The component does **not**: + +- Decide routing — paths are pre-computed by the router (ADR-0002). +- Decode tensor or kernel internals — those concerns belong to + M_CPU / PE_CPU / engines. +- Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3). +- Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4 + and ADR-0016 D3 (Memory R/W resolution code in + `_resolve_cube_targets` exists as a defensive fallback only). + +Per invocation (`run()`): applies the configured `overhead_ns` once +per incoming Transaction (D8). + +### D2. Forward path — multi-cube fan-out + +When a non-response Transaction arrives, the worker: + +1. Pays `overhead_ns` via `run()`. +2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)` + targets from the request (D5). +3. For each target: + - Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`. + - Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`. + - Creates a per-cube sub-Transaction with `path` populated and + forwards it to `path[1]` (the first hop on the io_noc). +4. Registers aggregation state: `_pending[request_id] = (expected, + received=0, parent_done)`. + +### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5) + +IO_CPU is the canonical stamper for `target_start_ns`. When the +request is a `KernelLaunchMsg`, IO_CPU computes a single global +barrier covering every targeted PE across every targeted cube: + +```text +for (sip, cube) in cube_targets: + leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0) + for pe_id in target_pe_ids: + leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id), + nbytes=0) + latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns + global_max = max(global_max, latency) + +target_start_ns = env.now + global_max +``` + +The request is then replaced (via `dataclasses.replace`) so the +stamped value propagates through the fan-out. + +Two overhead corrections: + +- `io_overhead_ns` is subtracted because IO_CPU has already paid it + in `run()` before this method runs. +- `m_overhead_ns` is subtracted once because it appears as the + endpoint of leg1 *and* the start of leg2 in path latency, but + M_CPU pays it only once at run time. + +Every downstream PE_CPU yields until `target_start_ns` before +beginning kernel body execution; all PEs therefore start at the same +simulated time regardless of how long their individual dispatch path +took. + +### D4. KernelLaunch sub-Transactions carry `nbytes=0` + +Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`, +overriding the parent `txn.nbytes`: + +- Kernel launch is a control message; payload size is irrelevant at + the data-fabric level. +- If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the + io_noc's shared first hop. With 16 cubes this serializes fan-out, + pushing far M_CPUs past `target_start_ns` and breaking the D3 + invariant. + +Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant +for the defensive Memory R/W fallback path, which carries actual +payload sizes). + +### D5. Per-request-type cube target resolution + +`_resolve_cube_targets` dispatches by request type: + +| Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics | +| --- | --- | --- | +| `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode | +| `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode | +| `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP | +| `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec | + +Each IO_CPU instance fans out only within its own SIP — `_my_sip()` +parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0). + +The Memory R/W rows exist for defensive completeness; the engine's +normal path routes Memory R/W via `_process_memory_direct()` / +`find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 / +ADR-0016 D3). + +### D6. Response aggregation + +`_pending: dict[request_id → (expected, received, parent_done)]`: + +- On dispatch: register `(len(cube_targets), 0, txn.done)`. +- `_worker` recognises responses by `is_response=True` and routes + them to `_collect_response`. +- `_collect_response` increments `received`; when `received >= + expected`, `parent_done.succeed()` is invoked and the entry is + removed from `_pending`. + +This is a simple per-request counter. There is no per-cube identity +tracking and no partial-failure handling — a missing response +indefinitely stalls the parent done. Production-style failure paths +are out of scope for the current simulator model. + +### D7. `target_pe` resolution helper + +`_resolve_pe_ids(target_pe)`: + +- `int` → `[target_pe]`. +- `tuple[int, ...]` → `list(target_pe)`. +- `"all"` → `range(n_slices)`, where `n_slices` comes from cube + `memory_map.hbm_slices_per_cube` (default 8). + +Used in D3's barrier computation to enumerate every PE target per +cube. + +### D8. Configurable `overhead_ns` + +A single attribute drives per-instance latency: + +| Site | impl name | overhead_ns | +| --- | --- | --- | +| IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 | + +Applied once in `run()` per Transaction. Models command +interpretation + dispatch-decision time at IO_CPU. + +## Consequences + +### Positive + +- Cross-cube and cross-SIP kernel launches share a single global + barrier (D3 + D4) — no per-cube divergence in start time. +- nbytes=0 invariant keeps fan-out off the shared first-hop fabric + BW, preserving the barrier's accuracy at scale (16 cubes). +- Response aggregation via a single counter → minimal state, + deterministic ordering of completion. +- Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs + cleanly independent. + +### Negative + +- No partial-failure semantics — a missing per-cube response + indefinitely stalls the parent. Adequate for simulation but not + suitable as a production-style endpoint. +- `_pending` is a regular dict; in-flight requests accumulate state. + Acceptable for current benchmark workloads (few concurrent + outstanding launches); unbounded in principle. +- The Memory R/W resolution branches in `_resolve_cube_targets` are + dead code in the normal engine path. Kept defensively but invite + drift if the bypass path ever changes. + +## Links + +- ADR-0002 (Routing distance — path computation) +- ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU) +- ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across + cubes) +- ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU) +- ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out) +- ADR-0012 (Host ↔ IO_CPU message schema) +- ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU) +- ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here) +- ADR-0016 D3 (Memory R/W path bypasses IO_CPU) +- ADR-0016 D4 (Kernel Launch path through IO_CPU for command + interpretation) diff --git a/docs/adr-ko/ADR-0037-dev-forwarding-component.md b/docs/adr-ko/ADR-0037-dev-forwarding-component.md new file mode 100644 index 0000000..193dbe0 --- /dev/null +++ b/docs/adr-ko/ADR-0037-dev-forwarding-component.md @@ -0,0 +1,200 @@ +# ADR-0037: Forwarding Component (forwarding_v1) + +## Status + +Accepted + +## Context + +The simulation graph has many node positions that exist purely to model +fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints, +IO chiplet io_noc, transit cubes. These share a common pattern: receive +a message, apply per-component overhead (modeling header decode + +routing decision time), forward to the next hop along the pre-computed +path. + +This ADR defines the contract for these transit nodes: a single +component type (`TransitComponent`) that handles flit-aware forwarding +with wormhole cut-through semantics, used under multiple impl names +according to the conceptual role each instance plays. + +## Decision + +### D1. Role + +The Forwarding component (`TransitComponent` class) is a **stateless +transit node** in the simulation graph. It models any fabric position +where a message physically traverses but no semantic processing +happens. + +Per traversal, the component: + +1. Reads an incoming Transaction or Flit from an `in_port`. +2. Applies the configured per-component overhead (`overhead_ns`), + applied **once per Transaction** even across multi-flit payloads + (see D2). +3. Looks up the next hop along the Transaction's pre-computed `path`. +4. Forwards to the corresponding `out_port`; at the terminal node + (no next hop), signals `txn.done` once the `is_last` flit arrives. + +The component **does NOT**: + +- Decide routing — paths are pre-computed by the router (ADR-0002 / + ADR-0017 D2). Forwarding only executes the per-hop step. +- Model wire propagation or bandwidth occupancy — separate wire + processes between components handle that (ADR-0015 D2). +- Resolve addresses — the AddressResolver does that (ADR-0017 D9). +- Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL) + handle that. + +### D2. First-flit overhead model (header decode) + +Per-Transaction `overhead_ns` is applied **exactly once**, at first +flit arrival: + +- `_txn_decoded: set[int]` tracks which Transactions have already + paid the overhead at this node. +- On first-flit arrival for a Transaction: `yield self.run(env, + msg.txn.nbytes)` — pays the overhead. +- Subsequent flits of the same Transaction skip the overhead — they + pipeline through with no extra delay. +- On `is_last` flit: remove the Transaction from `_txn_decoded`. + +This models the real-HW behavior where header decode and routing +decision happen once on first flit; payload flits then stream through +the same path (wormhole cut-through). Multi-hop pipelining emerges +naturally — each hop adds its own first-flit overhead, but flits +after the first do not re-pay overhead at any hop they have already +passed first. + +### D3. Serial worker forwarding (preserves order) + +The component's worker is a single SimPy process that consumes flits +from `_inbox` and forwards them serially in arrival order. The +component does NOT spawn `env.process(...)` per flit. + +Rationale: if the first flit yields on `overhead_ns` while subsequent +flits run in parallel processes, the later flits can overtake the +first. This produces out-of-order delivery and lets the `is_last` +flit arrive at the destination before the first flit — corrupting +both the transaction's completion semantics and any flit-index-based +processing downstream. + +### D4. Path-based next-hop routing + +Routing is **not** a Forwarding-component concern. The Transaction +arrives with a pre-computed `path` (built by the router; ADR-0002 / +ADR-0017 D2). The component just looks up its own position in the +path and forwards to `path[index + 1]`: + +```python +def _next_hop_in_path(self, txn): + my_id = self.node.id + path = txn.path + for i, n in enumerate(path): + if n == my_id and i + 1 < len(path): + return path[i + 1] + return None +``` + +If `next_hop` is found and present in `out_ports`, the flit is +forwarded. Otherwise (terminal node), `txn.done.succeed()` is +invoked when the `is_last` flit arrives. + +### D5. Flit-aware mode with Non-Flit fallback + +`_FLIT_AWARE = True` opts this component out of the base class's +flit-reassembly logic in `_fan_in`. Flits are placed directly on +`_inbox` (no reassembly), enabling per-flit handling in the worker +loop (D2, D3). + +Non-Flit messages — zero-byte control Transactions and other +non-chunkified payloads — fall through to the base class's legacy +`_forward_txn` path via `env.process`. This preserves backward +compatibility for control-plane traffic that does not benefit from +flit-level processing. + +### D6. Multi-stream merging at the base class + +Multi-stream FIFO merging at routers is the base class's +responsibility, not Forwarding's. The base class's `_fan_in` spawns +one process per `in_port`; all push to a single shared `_inbox`. +Flits from different upstream streams therefore interleave at +flit granularity in `_inbox`'s FIFO order. + +The Forwarding worker simply consumes `_inbox` in arrival order — +correctly modeling per-router multi-flow arbitration as +fair-FIFO over the shared inbox. + +### D7. Single implementation under multiple impl names + +A single `TransitComponent` class is registered under four impl names +in `components.yaml`: + +- `builtin.forwarding` — generic forwarding (e.g., `io_noc`, + `noc_router`, UCIe conn bridges) +- `builtin.switch` — tray-level switch +- `builtin.noc` — cube-level NOC fabric (legacy singleton; current + NOC routers use `builtin.forwarding`) +- `builtin.ucie` — UCIe protocol endpoint + +All four aliases instantiate the same class with the same behavior. +Per-instance differentiation lives only in `attrs.overhead_ns`. +Separate impl names exist as intent tags for readability and to +allow future divergence without backward-incompatible config +changes. + +### D8. Configurable `overhead_ns` + +A single attribute drives per-instance latency: + +| Usage site | impl name | overhead_ns | +| --- | --- | --- | +| Tray-level switch | `builtin.switch` | 5.0 | +| Cube NOC router | `builtin.forwarding` | 2.0 | +| IO chiplet io_noc | `builtin.forwarding` | 0.0 | +| UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 | +| UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 | + +Default is 0.0. The attribute is read at each `run()` invocation, so +dynamic reconfiguration is possible but not currently used. + +## Consequences + +### Positive + +- A single class handles all transit-node roles in the simulation + graph — minimal code surface for a high-population component type. +- Flit-aware processing + serial worker preserves wormhole semantics + across multi-hop paths without per-flit process overhead. +- `overhead_ns` is the only per-instance tunable; routing, BW, and + address resolution stay cleanly separated in their own components / + modules. +- Multi-stream merging emerges from the base-class structure; no + router-specific logic duplicates fair-FIFO arbitration. +- Non-Flit fallback path keeps control-plane traffic working without + forcing every message into the flit framework. + +### Negative + +- The single class hides usage-site intent inside `attrs.overhead_ns` + configuration; readers must consult `topology.yaml` + + `components.yaml` to see which impl name maps to which behavior + class. +- Per-flit serial worker is a bottleneck if `overhead_ns` is large + and many concurrent transactions arrive at the same router; current + values (0–8 ns) make this negligible. + +## Links + +- ADR-0002 (Routing distance — path computation) +- ADR-0015 D1 (Component port model) +- ADR-0015 D2 (Wire process — BW + propagation, separate from this + component) +- ADR-0015 D6 (Transit cube forwarding pattern) +- ADR-0016 D1 (IO chiplet io_noc — uses this component) +- ADR-0017 D1 (Cube NOC routers — use this component) +- ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this + component) +- ADR-0033 D1 (Flit-aware pass-through, first-flit overhead, + multi-stream merge semantics) diff --git a/docs/adr/ADR-0012-api-host-io-message-schema.md b/docs/adr/ADR-0012-api-host-io-message-schema.md index 07d95c5..0979788 100644 --- a/docs/adr/ADR-0012-api-host-io-message-schema.md +++ b/docs/adr/ADR-0012-api-host-io-message-schema.md @@ -18,7 +18,7 @@ We define stable, minimal message schemas for Host ↔ IO_CPU so that: - IO_CPU-internal fan-out/aggregation can evolve independently, - completion and failure propagation is deterministic. -We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe) +We also require PE-tagging (Scheme A): each shard explicitly carries (sip,cube,pe) so IO_CPU can deterministically route/fan-out without relying on PA decoding. --- @@ -93,7 +93,7 @@ Rules: Mandatory fields: - common envelope fields (D3) -- destination placement tags (A 방식): +- destination placement tags (Scheme A): - `dst_sip: int` - `dst_cube: int` - `dst_pe: int` @@ -130,7 +130,7 @@ Notes: Mandatory fields: - common envelope fields (D3) -- source placement tags (A 방식): +- source placement tags (Scheme A): - `src_sip: int` - `src_cube: int` - `src_pe: int` @@ -183,7 +183,7 @@ Tensor arg (mandatory): - `shards: list[TensorShard]` -`TensorShard` MUST have (A 방식 강제): +`TensorShard` MUST have (Scheme A enforced): - `sip: int` - `cube: int` diff --git a/docs/adr/ADR-0020-prog-data-execution-two-pass.en.md b/docs/adr/ADR-0020-prog-data-execution-two-pass.en.md deleted file mode 100644 index 04ac253..0000000 --- a/docs/adr/ADR-0020-prog-data-execution-two-pass.en.md +++ /dev/null @@ -1,519 +0,0 @@ -# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation) - -## Status - -Accepted - -## Context - -The current simulation models **timing only**. -`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies, -but do not actually read tensor data or perform computations. - -### Required Capabilities - -1. Must be able to store and read actual data in HBM/TCM/SRAM -2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results -3. Must minimize simulation performance degradation - -### Constraints - -- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything -- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations -- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused -- Kernel functions must remain plain Python functions (no generator/async transformation) - -### Design Exploration Results - -| Option | Approach | Verdict | -|--------|----------|---------| -| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block | -| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests | -| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads | -| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact | - ---- - -## Decision - -### D1. 2-Pass Execution Model — Phase 0 Elimination - -The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**. - -Before: -``` -Phase 0: Kernel → PeCommand list (no data, no branching) -Phase 1: Replay PeCommand list via SimPy (timing only) -``` - -After: -``` -Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based - - Memory read/write: SimPy timing + MemoryStore actual data - - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2) - - Dynamic control flow possible (tl.load returns actual data) - -Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable -``` - -This ADR **extends Phase 1 to be data-aware for memory operations only**. -Phase 1 handles latency/BW bottleneck analysis + memory data tracking, -Phase 2 handles GEMM/Math computation correctness verification. -Phase 2 is optional — if only timing is needed, run Phase 1 alone. - -### D2. Op Log Recording — ComponentBase Hook - -Op log recording is performed as a **hook in the component base class**. -Individual component implementations are not modified. - -```python -class ComponentBase: - def _on_process_start(self, env, msg): - if self._op_logger and getattr(msg, 'data_op', False): - self._op_logger.record_start(env.now, self.node.id, msg) - - def _on_process_end(self, env, msg): - if self._op_logger and getattr(msg, 'data_op', False): - self._op_logger.record_end(env.now, self.node.id, msg) -``` - -Hooks are called before and after `run()` within `_forward_txn()`. -`_op_logger` is optional — zero overhead when absent. - -**Hook timing definitions**: - -| Timing | Meaning | -|--------|---------| -| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) | -| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) | - -Link traversal latency is not included in t_start/t_end. -Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start. - -### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination - -The existing Phase 0 (kernel → PeCommand list) is eliminated, -and **greenlet** is used to cooperatively interleave kernel and SimPy execution. - -#### Operating Principle - -greenlet is a C extension that provides cooperative context switching. -When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet) -to perform timing simulation, and after completion, returns to the kernel with actual data. - -``` -SimPy loop (parent greenlet) Kernel (child greenlet) -───────────────────────── ────────────────────── -g.switch() ─────────────────────────→ Kernel starts - a = tl.load(ptr, ...) - internal: parent.switch(DmaReadCmd) -cmd = DmaReadCmd ←────────────────── (kernel paused) - yield DmaReadMsg(...) - yield env.timeout(dma_latency) - data = memory_store.read(...) -g.switch(data) ─────────────────────→ (kernel resumed) - a = data ← actual numpy array - if a[0][0] > 0.5: ← branching possible - ... -``` - -The kernel is maintained as a **plain Python function**. -greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc. - -#### KernelRunner — Framework Layer - -The greenlet loop resides not in the PE_CPU component but in the framework layer, -**KernelRunner**. - -```python -# KernelRunner (framework — greenlet ↔ SimPy bridge) -class KernelRunner: - def run(self, env, kernel_fn, args, store): - g = greenlet(self._run_kernel) - cmd = g.switch(kernel_fn, args) - - while cmd is not None: - if isinstance(cmd, DmaReadCmd): - yield from self._dispatch_dma(env, cmd) - data = store.read(cmd.src_addr, cmd.shape, cmd.dtype) - cmd = g.switch(data) # resume with actual data - elif isinstance(cmd, GemmCmd): - yield from self._dispatch_gemm(env, cmd) - cmd = g.switch() # resume (no data) - elif isinstance(cmd, DmaWriteCmd): - store.write(cmd.dst_addr, cmd.data) # visibility = issue time - yield from self._dispatch_dma(env, cmd) # timing only - cmd = g.switch() - -# PE_CPU (component — kept simple, unaware of greenlet) -def _execute_kernel(self, env): - runner = KernelRunner(self.ctx) - yield from runner.run(env, kernel_fn, args, store) -``` - -**Op logging single source of truth**: KernelRunner does not record directly to op_log. -All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**. -When KernelRunner delivers messages to components via `_dispatch_gemm()` etc., -the component base class hooks automatically record them. - -**Layer separation**: -- **Kernel code**: plain function, unaware of greenlet -- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()` -- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**. -- **ComponentBase hook**: the sole path for op_log recording -- **PE_CPU**: only calls KernelRunner, replaceable as a component - -#### Handling Differences Between Memory Read/Write and Compute - -| Operation | In Phase 1 | In Phase 2 | -|-----------|-----------|-----------| -| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — | -| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — | -| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation | -| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation | - -Memory read/write is processed immediately in Phase 1 (numpy slice, fast). -GEMM/Math operations are batch-executed in Phase 2 (performance separation). - -#### Store Visibility Rule - -`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue). -SimPy DMA timing is simulated separately afterward. - -This is an intentional separation of timing and visibility: -- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called -- **timing**: the point at which DMA latency completes in SimPy - -This separation allows a load immediately after a store to see the latest data in dynamic control flow. - -#### Result Handle Semantics - -`tl.composite()` (sync/async) returns a **handle** referencing the result tensor. - -The key contract in Phase 1: - -1. **All compute handles are always considered pending in Phase 1.** -2. `tl.wait(handle)` **expresses timing synchronization only** - and does not make the handle ready. -3. Accessing the handle's actual result data (`handle.data`, element access, - numpy conversion, etc.) is **only possible in Phase 2**. -4. Therefore, **compute-result-based control flow is not supported in Phase 1.** -5. In contrast, `tl.load()` returns actual data in Phase 1, so - **memory-read-based control flow is supported**. - -| Handle state | Phase | Allowed operations | -|------------|-------|----------| -| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only | -| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) | -| pending | Phase 1 | **Data access not allowed** — value-based branching not possible | -| ready | Phase 2 | Actual numpy data access, verification | - -This restriction is intentional. If computations were executed in Phase 1, -the SimPy single-thread would block, defeating the purpose of 2-pass separation. - -#### Phase 1 Materialization — Future Extension - -If Phase 1 eager execution becomes necessary for small operations -(scalar, small reduction) in the future, selective materialization can be supported -by adding a `materialized_in_phase1: bool` flag to the op record. -This is not implemented in the current scope. - -### D4. data_op Flag — Message Self-Declaration - -The logging target is determined by the `data_op` attribute on the message instance, -not by message type. The framework does not hardcode message types. - -```python -class MsgBase: - data_op: bool = False # default: no logging - -class DmaReadCmd(MsgBase): - data_op = True # memory transfer → logging - -class GemmCmd(MsgBase): - data_op = True # compute → logging - -class MathCmd(MsgBase): - data_op = True # compute → logging -``` - -When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True` -enables automatic logging without modifying framework code. - -### D5. Op Log Structure - -#### Op Classification Scheme - -A two-level classification is used: - -| Level | Field | Role | -|-------|-------|------| -| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion | -| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification | - -#### OpRecord Definition - -```python -@dataclass -class OpRecord: - t_start: float # SimPy time (ns) — service start - t_end: float # SimPy time (ns) — service completion - component_id: str # e.g. "sip0.cube0.pe0.pe_gemm" - op_kind: str # "memory" | "gemm" | "math" - op_name: str # specific operation name - params: dict # per-operation parameters (see below) - dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future -``` - -#### dependency_ids Generation Rules - -`dependency_ids` is **optional**, and by default the executor performs -address-based dependency inference (see D6). - -Explicit setting is only needed when precise execution ordering is required: -- **Default (address-based inference)**: the executor analyzes read/write sets to - automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases. -- **Explicit setting**: set when logical dependencies cannot be expressed via addresses - at the TLContext or command generation stage. - Example: completion handle-based synchronization — handle dependencies depend on - logical completion order rather than memory addresses, so they cannot be captured - by address inference. - -#### op_log Ordering - -The op_log maintains **stable ordering** based on `t_start`. -Records with the same `t_start` preserve insertion order. - -#### params Details - -**memory (dma_read / dma_write)**: -```python -{ - "src_addr": int, # source address (byte) - "dst_addr": int, # destination address (byte) - "nbytes": int, # transfer size - "src_space": str, # "hbm" | "tcm" | "sram" - "dst_space": str, # "hbm" | "tcm" | "sram" -} -``` - -**gemm**: -```python -{ - "src_a_addr": int, # operand A address - "src_b_addr": int, # operand B address - "dst_addr": int, # output address - "shape_a": tuple, # e.g. (128, 256) - "shape_b": tuple, # e.g. (256, 128) - "shape_out": tuple, # e.g. (128, 128) - "dtype_in": str, # e.g. "f16" - "dtype_acc": str, # accumulation dtype, e.g. "f32" - "dtype_out": str, # output dtype, e.g. "f16" - "transpose_a": bool, - "transpose_b": bool, - "layout_a": str, # "row_major" | "col_major" - "layout_b": str, - "layout_out": str, - "addr_space": str, # "tcm" (GEMM operands are always in TCM) -} -``` - -**math**: -```python -{ - "op": str, # "exp" | "add" | "sum" | "where" | ... - "input_addrs": list[int], # list of operand addresses - "input_shapes": list[tuple], - "dst_addr": int, - "shape_out": tuple, - "dtype": str, - "axis": int | None, # reduction axis - "addr_space": str, # "tcm" -} -``` - -### D6. Phase 2 Executor - -Phase 2 executes the op_log outside of SimPy. - -```python -class DataExecutor: - def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore): - self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input - - def run(self): - for t, ops in groupby(op_log, key=lambda o: o.t_start): - batch = list(ops) - independent, sequential = self._classify(batch) - self._execute_parallel(independent) - self._execute_sequential(sequential) -``` - -**Parallel execution determination**: - -Ops with the same `t_start` are considered **parallel candidates**. -The executor determines actual parallel execution based on the following criteria: -- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks) -- Whether predecessor ops specified in `dependency_ids` have completed - -Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel. - -**Batch optimization**: Only independent ops with the same op_name **and identical -shape, dtype, layout, and transpose flags** are eligible for batching. -Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call. -Improves BLAS efficiency on CPU, reduces launch overhead on GPU. - -**Phase 2 execution order guarantee**: - -Phase 2 does not consider data arrival timing, -and guarantees execution order solely through -dependencies (address-based inference + explicit dependency_ids). - -### D7. Memory Store - -`MemoryStore` logically follows byte-addressable semantics, -and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping). - -```python -class MemoryStore: - def write(self, space: str, addr: int, data: np.ndarray) -> None: ... - def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ... -``` - -**Internal storage format: numpy ndarray** - -MemoryStore stores tensors as **numpy ndarrays**. - -| Candidate | store/load speed | Phase 2 compute | Verdict | -|-----------|-----------------|-----------------|---------| -| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** | -| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected | -| torch tensor | Immediate | torch operations available | Use only for GPU optimization | - -- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup -- read: **returns numpy array by reference** (no copy) -- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported) -- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.) -- For byte-level access, convert via `.view(np.uint8)` -- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility - -**read/write contract**: - -- read/write operates on a **contiguous tensor** basis. - If non-contiguous stride views are needed, express them as separate copy ops. -- In the normal benchmark path, producer/consumer dtype match is expected. - Reinterpret cast is a permissive behavior for low-level memory validation - or special test cases. -- addr is byte-aligned, with minimum alignment = dtype size. -- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast. - Shape mismatch is verified based on nbytes, and raises an error on mismatch. -- Correctness criteria follow address-range-based read/write semantics. -- A tensor object cache may be used as an implementation optimization, - but the canonical state is byte-addressable storage. -- At deploy time, the host injects initial tensor data. - -### D8. Benchmark Kernel Code - -The benchmark's **user code API is not changed**. -The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained. - -However, internal command/message schemas may be extended to include metadata -required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose). - -### D9. No Component Changes - -Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified. -Op log recording is the responsibility of the ComponentBase hook. -When custom components are replaced, only the timing model changes, -and Phase 2 data execution is unaffected. - -### D10. Phase 2 is Optional - -```python -engine = GraphEngine(graph) -engine.run(benchmark) # Phase 1: timing only -result = engine.get_timing_result() - -if verify_data: - executor = DataExecutor(engine.op_log) # Phase 2: data - executor.run() - executor.verify(expected_output) -``` - -If only timing analysis is needed, Phase 2 is skipped. -If the op_logger is deactivated, Phase 1 performance is identical to the original. - -### D11. Verification Contract - -Basic verification **compares the final output tensor** against a reference backend (numpy). - -Per-dtype tolerance policy: - -| dtype | Comparison method | Tolerance | -|-------|----------|-----------| -| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 | -| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 | -| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 | -| int types | `np.array_equal` | exact | - -- Default mode: compare final output only (end-to-end correctness) -- Debug mode: can compare intermediate tensors on a per-op basis - (MemoryStore snapshot at each op boundary) - ---- - -## Non-goals - -- **Compute-result-based control flow**: not supported. - All compute handles are in pending state during Phase 1, - `wait()` expresses timing synchronization only and does not imply data readiness. - Accessing `handle.data`, element access, or truth-value evaluation in Phase 1 - is **treated as an error**. - Memory-data-based branching (results of `tl.load()`) is supported via greenlet. - Phase 1 materialization is a future extension (see D3). -- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce - the execution time overlap from Phase 1. Phase 2 only verifies data correctness. -- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls - and do not reproduce the actual hardware PE microarchitecture. - -## Open Questions - -- **Aliasing / slice view**: How to represent slice/views referencing the same - backing storage in MemoryStore (stride-based view vs copy semantics) -- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE - communication as memory ops or introduce a separate op_kind -- **Op log streaming**: Managing op_log memory usage in large-scale simulations - (in-memory list vs disk-backed streaming) -- **Fused operation**: Whether to record tl.composite's tiled pipeline - (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops -- **Math op schema generalization**: The current math params have a simple structure, - but generalization may be needed for broadcasting rules, per-input dtype, keepdims, - scalar/immediate operands, where/mask expressions, etc. -- **Op record identifier**: Currently dependency_ids are based on in-memory list indices; - replacement with stable op_id is needed when introducing streaming/disk-backed mode -- **Phase 1 materialization policy**: See Future Extension in D3. - If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops - needs to be defined - ---- - -## Consequences - -### Positive - -- Minimal impact on SimPy simulation performance (only op_log append added) -- Free to use multi-threading/GPU in Phase 2 -- Component replaceability preserved (ADR-0015 design philosophy maintained) -- No changes needed to benchmark user code API -- When adding new message types, only set the data_op flag -- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported -- `tl.load()` returns actual data, making kernel debugging easier - -### Negative - -- op_log memory usage (for large-scale simulations) -- Phase 2 execution time is proportional to tensor size (large GEMM) -- Dynamic branching based on pending handles (incomplete computations) not possible - (computations execute in Phase 2, result values are undetermined in Phase 1). - Memory-data-based branching is supported via greenlet. -- greenlet C extension dependency added (pip install greenlet) diff --git a/docs/adr/ADR-0020-prog-data-execution-two-pass.md b/docs/adr/ADR-0020-prog-data-execution-two-pass.md index a8d277d..04ac253 100644 --- a/docs/adr/ADR-0020-prog-data-execution-two-pass.md +++ b/docs/adr/ADR-0020-prog-data-execution-two-pass.md @@ -1,4 +1,4 @@ -# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리) +# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation) ## Status @@ -6,65 +6,65 @@ Accepted ## Context -현재 시뮬레이션은 **타이밍만** 모델링한다. -`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만, -실제 텐서 데이터를 읽거나 연산하지 않는다. +The current simulation models **timing only**. +`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies, +but do not actually read tensor data or perform computations. -### 필요한 기능 +### Required Capabilities -1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다 -2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다 -3. 시뮬레이션 성능 저하를 최소화해야 한다 +1. Must be able to store and read actual data in HBM/TCM/SRAM +2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results +3. Must minimize simulation performance degradation -### 제약 조건 +### Constraints -- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block -- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨 -- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함 -- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가) +- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything +- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations +- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused +- Kernel functions must remain plain Python functions (no generator/async transformation) -### 설계 탐색 결과 +### Design Exploration Results -| Option | 방식 | 판정 | -|--------|------|------| -| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block | -| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block | -| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 | -| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 | +| Option | Approach | Verdict | +|--------|----------|---------| +| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block | +| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests | +| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads | +| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact | --- ## Decision -### D1. 2-Pass 실행 모델 — Phase 0 제거 +### D1. 2-Pass Execution Model — Phase 0 Elimination -기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다. +The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**. -기존: +Before: ``` -Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가) -Phase 1: PeCommand 리스트를 SimPy replay (타이밍만) +Phase 0: Kernel → PeCommand list (no data, no branching) +Phase 1: Replay PeCommand list via SimPy (timing only) ``` -변경: +After: ``` -Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반 - - 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터 - - 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2) - - dynamic control flow 가능 (tl.load가 실제 데이터 반환) +Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based + - Memory read/write: SimPy timing + MemoryStore actual data + - Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2) + - Dynamic control flow possible (tl.load returns actual data) -Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능 +Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable ``` -본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다. -Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적, -Phase 2는 GEMM/Math 연산 정합성 검증. -Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행. +This ADR **extends Phase 1 to be data-aware for memory operations only**. +Phase 1 handles latency/BW bottleneck analysis + memory data tracking, +Phase 2 handles GEMM/Math computation correctness verification. +Phase 2 is optional — if only timing is needed, run Phase 1 alone. -### D2. Op Log 기록 — ComponentBase hook +### D2. Op Log Recording — ComponentBase Hook -op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다. -개별 컴포넌트 구현을 수정하지 않는다. +Op log recording is performed as a **hook in the component base class**. +Individual component implementations are not modified. ```python class ComponentBase: @@ -77,56 +77,56 @@ class ComponentBase: self._op_logger.record_end(env.now, self.node.id, msg) ``` -`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다. -`_op_logger`는 optional — 없으면 오버헤드 제로. +Hooks are called before and after `run()` within `_forward_txn()`. +`_op_logger` is optional — zero overhead when absent. -**hook 시점 정의**: +**Hook timing definitions**: -| 시점 | 의미 | -|------|------| -| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) | -| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) | +| Timing | Meaning | +|--------|---------| +| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) | +| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) | -link traversal latency는 t_start/t_end에 포함되지 않는다. -link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다. +Link traversal latency is not included in t_start/t_end. +Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start. -### D3. Greenlet 기반 커널 실행 — Phase 0 제거 +### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination -기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고, -**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다. +The existing Phase 0 (kernel → PeCommand list) is eliminated, +and **greenlet** is used to cooperatively interleave kernel and SimPy execution. -#### 동작 원리 +#### Operating Principle -greenlet은 협력적 context switch를 제공하는 C 확장이다. -커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로 -switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다. +greenlet is a C extension that provides cooperative context switching. +When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet) +to perform timing simulation, and after completion, returns to the kernel with actual data. ``` -SimPy 루프 (parent greenlet) 커널 (child greenlet) +SimPy loop (parent greenlet) Kernel (child greenlet) ───────────────────────── ────────────────────── -g.switch() ─────────────────────────→ 커널 시작 +g.switch() ─────────────────────────→ Kernel starts a = tl.load(ptr, ...) - 내부: parent.switch(DmaReadCmd) -cmd = DmaReadCmd ←────────────────── (커널 일시정지) + internal: parent.switch(DmaReadCmd) +cmd = DmaReadCmd ←────────────────── (kernel paused) yield DmaReadMsg(...) yield env.timeout(dma_latency) data = memory_store.read(...) -g.switch(data) ─────────────────────→ (커널 재개) - a = data ← 실제 numpy array - if a[0][0] > 0.5: ← 분기 가능 +g.switch(data) ─────────────────────→ (kernel resumed) + a = data ← actual numpy array + if a[0][0] > 0.5: ← branching possible ... ``` -커널은 **plain Python function**으로 유지된다. -greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다. +The kernel is maintained as a **plain Python function**. +greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc. -#### KernelRunner — 프레임워크 레이어 +#### KernelRunner — Framework Layer -greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인 -**KernelRunner**에 위치한다. +The greenlet loop resides not in the PE_CPU component but in the framework layer, +**KernelRunner**. ```python -# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결) +# KernelRunner (framework — greenlet ↔ SimPy bridge) class KernelRunner: def run(self, env, kernel_fn, args, store): g = greenlet(self._run_kernel) @@ -136,160 +136,162 @@ class KernelRunner: if isinstance(cmd, DmaReadCmd): yield from self._dispatch_dma(env, cmd) data = store.read(cmd.src_addr, cmd.shape, cmd.dtype) - cmd = g.switch(data) # 실제 데이터와 함께 재개 + cmd = g.switch(data) # resume with actual data elif isinstance(cmd, GemmCmd): yield from self._dispatch_gemm(env, cmd) - cmd = g.switch() # 재개 (데이터 없음) + cmd = g.switch() # resume (no data) elif isinstance(cmd, DmaWriteCmd): - store.write(cmd.dst_addr, cmd.data) # visibility = issue 시점 - yield from self._dispatch_dma(env, cmd) # timing만 반영 + store.write(cmd.dst_addr, cmd.data) # visibility = issue time + yield from self._dispatch_dma(env, cmd) # timing only cmd = g.switch() -# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름) +# PE_CPU (component — kept simple, unaware of greenlet) def _execute_kernel(self, env): runner = KernelRunner(self.ctx) yield from runner.run(env, kernel_fn, args, store) ``` -**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다. -모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다. -KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면, -컴포넌트 베이스 클래스의 hook이 자동으로 기록한다. +**Op logging single source of truth**: KernelRunner does not record directly to op_log. +All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**. +When KernelRunner delivers messages to components via `_dispatch_gemm()` etc., +the component base class hooks automatically record them. -**레이어 분리**: -- **커널 코드**: plain function, greenlet 존재를 모름 -- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출 -- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**. -- **ComponentBase hook**: op_log 기록의 유일한 경로 -- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능 +**Layer separation**: +- **Kernel code**: plain function, unaware of greenlet +- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()` +- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**. +- **ComponentBase hook**: the sole path for op_log recording +- **PE_CPU**: only calls KernelRunner, replaceable as a component -#### 메모리 읽기/쓰기 vs 연산의 처리 차이 +#### Handling Differences Between Memory Read/Write and Compute -| 연산 | Phase 1에서 | Phase 2에서 | -|------|------------|------------| -| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — | -| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — | -| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 | -| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 | +| Operation | In Phase 1 | In Phase 2 | +|-----------|-----------|-----------| +| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — | +| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — | +| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation | +| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation | -메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름). -GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리). +Memory read/write is processed immediately in Phase 1 (numpy slice, fast). +GEMM/Math operations are batch-executed in Phase 2 (performance separation). #### Store Visibility Rule -`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue). -SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다. +`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue). +SimPy DMA timing is simulated separately afterward. -이는 timing과 visibility를 의도적으로 분리한 것이다: -- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시 -- **timing**: SimPy에서 DMA latency가 완료되는 시점 +This is an intentional separation of timing and visibility: +- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called +- **timing**: the point at which DMA latency completes in SimPy -이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다. +This separation allows a load immediately after a store to see the latest data in dynamic control flow. #### Result Handle Semantics -`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다. +`tl.composite()` (sync/async) returns a **handle** referencing the result tensor. -Phase 1에서의 핵심 계약: +The key contract in Phase 1: -1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.** -2. `tl.wait(handle)`은 **timing synchronization만 표현**하며, - handle을 ready로 만들지 않는다. -3. handle의 실제 결과 데이터 접근(`handle.data`, element access, - numpy conversion 등)은 **Phase 2에서만 가능**하다. -4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.** -5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로, - **memory-read 기반 control flow는 지원 가능**하다. +1. **All compute handles are always considered pending in Phase 1.** +2. `tl.wait(handle)` **expresses timing synchronization only** + and does not make the handle ready. +3. Accessing the handle's actual result data (`handle.data`, element access, + numpy conversion, etc.) is **only possible in Phase 2**. +4. Therefore, **compute-result-based control flow is not supported in Phase 1.** +5. In contrast, `tl.load()` returns actual data in Phase 1, so + **memory-read-based control flow is supported**. -| handle 상태 | Phase | 허용 동작 | +| Handle state | Phase | Allowed operations | |------------|-------|----------| -| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 | -| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) | -| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 | -| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 | +| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only | +| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) | +| pending | Phase 1 | **Data access not allowed** — value-based branching not possible | +| ready | Phase 2 | Actual numpy data access, verification | -이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가 -block되어 2-pass 분리의 존재 이유가 사라진다. +This restriction is intentional. If computations were executed in Phase 1, +the SimPy single-thread would block, defeating the purpose of 2-pass separation. #### Phase 1 Materialization — Future Extension -향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이 -필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여 -선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다. +If Phase 1 eager execution becomes necessary for small operations +(scalar, small reduction) in the future, selective materialization can be supported +by adding a `materialized_in_phase1: bool` flag to the op record. +This is not implemented in the current scope. -### D4. data_op 플래그 — 메시지 자기 선언 +### D4. data_op Flag — Message Self-Declaration -로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다. -프레임워크가 메시지 타입을 하드코딩하지 않는다. +The logging target is determined by the `data_op` attribute on the message instance, +not by message type. The framework does not hardcode message types. ```python class MsgBase: - data_op: bool = False # 기본: 로깅 안 함 + data_op: bool = False # default: no logging class DmaReadCmd(MsgBase): - data_op = True # 메모리 이동 → 로깅 + data_op = True # memory transfer → logging class GemmCmd(MsgBase): - data_op = True # 연산 → 로깅 + data_op = True # compute → logging class MathCmd(MsgBase): - data_op = True # 연산 → 로깅 + data_op = True # compute → logging ``` -새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면 -프레임워크 코드 수정 없이 자동 로깅된다. +When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True` +enables automatic logging without modifying framework code. -### D5. Op Log 구조 +### D5. Op Log Structure -#### op 분류 체계 +#### Op Classification Scheme -2단계로 분류한다: +A two-level classification is used: -| 레벨 | 필드 | 역할 | -|------|------|------| -| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 | -| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 | +| Level | Field | Role | +|-------|-------|------| +| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion | +| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification | -#### OpRecord 정의 +#### OpRecord Definition ```python @dataclass class OpRecord: - t_start: float # SimPy 시각 (ns) — service 시작 - t_end: float # SimPy 시각 (ns) — service 완료 + t_start: float # SimPy time (ns) — service start + t_end: float # SimPy time (ns) — service completion component_id: str # e.g. "sip0.cube0.pe0.pe_gemm" op_kind: str # "memory" | "gemm" | "math" - op_name: str # 구체 연산명 - params: dict # 연산별 파라미터 (아래 참조) - dependency_ids: list[int] # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능 + op_name: str # specific operation name + params: dict # per-operation parameters (see below) + dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future ``` -#### dependency_ids 생성 규칙 +#### dependency_ids Generation Rules -`dependency_ids`는 **optional**이며, 기본적으로 executor는 -주소 기반 dependency 추론을 수행한다 (D6 참조). +`dependency_ids` is **optional**, and by default the executor performs +address-based dependency inference (see D6). -정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다: -- **기본 (address-based inference)**: executor가 read/write set을 분석하여 - RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분. -- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가 - 주소로 표현되지 않는 경우에 설정. - 예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라 - 논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다. +Explicit setting is only needed when precise execution ordering is required: +- **Default (address-based inference)**: the executor analyzes read/write sets to + automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases. +- **Explicit setting**: set when logical dependencies cannot be expressed via addresses + at the TLContext or command generation stage. + Example: completion handle-based synchronization — handle dependencies depend on + logical completion order rather than memory addresses, so they cannot be captured + by address inference. -#### op_log ordering +#### op_log Ordering -op_log는 `t_start` 기준으로 **stable ordering**을 유지한다. -동일 `t_start`의 record들은 insertion order를 보존한다. +The op_log maintains **stable ordering** based on `t_start`. +Records with the same `t_start` preserve insertion order. -#### params 상세 +#### params Details **memory (dma_read / dma_write)**: ```python { - "src_addr": int, # source 주소 (byte) - "dst_addr": int, # destination 주소 (byte) - "nbytes": int, # 전송 크기 + "src_addr": int, # source address (byte) + "dst_addr": int, # destination address (byte) + "nbytes": int, # transfer size "src_space": str, # "hbm" | "tcm" | "sram" "dst_space": str, # "hbm" | "tcm" | "sram" } @@ -298,9 +300,9 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다. **gemm**: ```python { - "src_a_addr": int, # operand A 주소 - "src_b_addr": int, # operand B 주소 - "dst_addr": int, # output 주소 + "src_a_addr": int, # operand A address + "src_b_addr": int, # operand B address + "dst_addr": int, # output address "shape_a": tuple, # e.g. (128, 256) "shape_b": tuple, # e.g. (256, 128) "shape_out": tuple, # e.g. (128, 128) @@ -312,7 +314,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다. "layout_a": str, # "row_major" | "col_major" "layout_b": str, "layout_out": str, - "addr_space": str, # "tcm" (GEMM operand는 항상 TCM) + "addr_space": str, # "tcm" (GEMM operands are always in TCM) } ``` @@ -320,7 +322,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다. ```python { "op": str, # "exp" | "add" | "sum" | "where" | ... - "input_addrs": list[int], # operand 주소 목록 + "input_addrs": list[int], # list of operand addresses "input_shapes": list[tuple], "dst_addr": int, "shape_out": tuple, @@ -332,12 +334,12 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다. ### D6. Phase 2 Executor -Phase 2는 SimPy 밖에서 op_log를 실행한다. +Phase 2 executes the op_log outside of SimPy. ```python class DataExecutor: def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore): - self.store = initial_store # Phase 1의 MemoryStore snapshot을 입력으로 받는다 + self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input def run(self): for t, ops in groupby(op_log, key=lambda o: o.t_start): @@ -347,30 +349,30 @@ class DataExecutor: self._execute_sequential(sequential) ``` -**병렬 실행 판정**: +**Parallel execution determination**: -같은 `t_start`의 op들은 **병렬 후보**로 간주한다. -실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다: -- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사) -- `dependency_ids`에 명시된 선행 op 완료 여부 +Ops with the same `t_start` are considered **parallel candidates**. +The executor determines actual parallel execution based on the following criteria: +- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks) +- Whether predecessor ops specified in `dependency_ids` have completed -주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다. +Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel. -**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가 -모두 동일한** 독립 op들만 batching 대상이 된다. -예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음. -CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감. +**Batch optimization**: Only independent ops with the same op_name **and identical +shape, dtype, layout, and transpose flags** are eligible for batching. +Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call. +Improves BLAS efficiency on CPU, reduces launch overhead on GPU. -**Phase 2 실행 순서 보장**: +**Phase 2 execution order guarantee**: -Phase 2는 데이터 도착 시점을 고려하지 않으며, -dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만 -실행 순서를 보장한다. +Phase 2 does not consider data arrival timing, +and guarantees execution order solely through +dependencies (address-based inference + explicit dependency_ids). ### D7. Memory Store -`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며, -현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다. +`MemoryStore` logically follows byte-addressable semantics, +and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping). ```python class MemoryStore: @@ -378,139 +380,140 @@ class MemoryStore: def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ... ``` -**내부 저장 포맷: numpy ndarray** +**Internal storage format: numpy ndarray** -MemoryStore는 텐서를 **numpy ndarray**로 저장한다. +MemoryStore stores tensors as **numpy ndarrays**. -| 후보 | store/load 속도 | Phase 2 연산 | 판정 | -|------|----------------|-------------|------| -| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** | -| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 | -| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 | +| Candidate | store/load speed | Phase 2 compute | Verdict | +|-----------|-----------------|-----------------|---------| +| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** | +| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected | +| torch tensor | Immediate | torch operations available | Use only for GPU optimization | -- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회 -- read: numpy array를 **참조 반환** (복사 없음) -- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원) -- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등) -- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환 -- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당 +- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup +- read: **returns numpy array by reference** (no copy) +- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported) +- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.) +- For byte-level access, convert via `.view(np.uint8)` +- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility **read/write contract**: -- read/write는 **contiguous tensor** 기준이다. - non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다. -- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다. - reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한 - permissive behavior이다. -- addr은 byte-aligned이며, 최소 alignment = dtype 크기. -- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다. - shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error. -- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다. -- 구현 최적화로 tensor object cache를 둘 수 있지만, - canonical state는 byte-addressable storage이다. -- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다. +- read/write operates on a **contiguous tensor** basis. + If non-contiguous stride views are needed, express them as separate copy ops. +- In the normal benchmark path, producer/consumer dtype match is expected. + Reinterpret cast is a permissive behavior for low-level memory validation + or special test cases. +- addr is byte-aligned, with minimum alignment = dtype size. +- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast. + Shape mismatch is verified based on nbytes, and raises an error on mismatch. +- Correctness criteria follow address-range-based read/write semantics. +- A tensor object cache may be used as an implementation optimization, + but the canonical state is byte-addressable storage. +- At deploy time, the host injects initial tensor data. -### D8. 벤치마크 커널 코드 +### D8. Benchmark Kernel Code -벤치마크의 **사용자 코드 API는 변경하지 않는다**. -`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지. +The benchmark's **user code API is not changed**. +The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained. -단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를 -포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드). +However, internal command/message schemas may be extended to include metadata +required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose). -### D9. 컴포넌트 변경 없음 +### D9. No Component Changes -개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다. -op_log 기록은 ComponentBase hook의 책임이다. -커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며, -Phase 2 데이터 실행은 영향받지 않는다. +Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified. +Op log recording is the responsibility of the ComponentBase hook. +When custom components are replaced, only the timing model changes, +and Phase 2 data execution is unaffected. -### D10. Phase 2는 Optional +### D10. Phase 2 is Optional ```python engine = GraphEngine(graph) -engine.run(benchmark) # Phase 1: 타이밍만 +engine.run(benchmark) # Phase 1: timing only result = engine.get_timing_result() if verify_data: - executor = DataExecutor(engine.op_log) # Phase 2: 데이터 + executor = DataExecutor(engine.op_log) # Phase 2: data executor.run() executor.verify(expected_output) ``` -타이밍 분석만 필요하면 Phase 2를 건너뛴다. -op_logger를 비활성화하면 Phase 1 성능도 기존과 동일. +If only timing analysis is needed, Phase 2 is skipped. +If the op_logger is deactivated, Phase 1 performance is identical to the original. ### D11. Verification Contract -기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다. +Basic verification **compares the final output tensor** against a reference backend (numpy). -dtype별 tolerance 정책: +Per-dtype tolerance policy: -| dtype | 비교 방식 | tolerance | +| dtype | Comparison method | Tolerance | |-------|----------|-----------| | f32 | `np.allclose` | rtol=1e-5, atol=1e-5 | | f16 | `np.allclose` | rtol=1e-3, atol=1e-3 | | bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 | -| int 계열 | `np.array_equal` | exact | +| int types | `np.array_equal` | exact | -- 기본 모드: 최종 output만 비교 (end-to-end correctness) -- 디버그 모드: intermediate tensor도 op 단위로 비교 가능 +- Default mode: compare final output only (end-to-end correctness) +- Debug mode: can compare intermediate tensors on a per-op basis (MemoryStore snapshot at each op boundary) --- ## Non-goals -- **Compute-result-based control flow**: 지원하지 않는다. - 모든 compute handle은 Phase 1에서 pending 상태이며, - `wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다. - Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은 - **error로 처리**한다. - 메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다. - Phase 1 materialization은 future extension (D3 참조). -- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간 - overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다. -- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며, - 실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다. +- **Compute-result-based control flow**: not supported. + All compute handles are in pending state during Phase 1, + `wait()` expresses timing synchronization only and does not imply data readiness. + Accessing `handle.data`, element access, or truth-value evaluation in Phase 1 + is **treated as an error**. + Memory-data-based branching (results of `tl.load()`) is supported via greenlet. + Phase 1 materialization is a future extension (see D3). +- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce + the execution time overlap from Phase 1. Phase 2 only verifies data correctness. +- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls + and do not reproduce the actual hardware PE microarchitecture. ## Open Questions -- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를 - MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics) -- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히 - 일반화할지, 별도 op_kind를 둘지 -- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리 +- **Aliasing / slice view**: How to represent slice/views referencing the same + backing storage in MemoryStore (stride-based view vs copy semantics) +- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE + communication as memory ops or introduce a separate op_kind +- **Op log streaming**: Managing op_log memory usage in large-scale simulations (in-memory list vs disk-backed streaming) -- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을 - 하나의 fused op record로 기록할지, 개별 op으로 분리할지 -- **Math op schema 일반화**: 현재 math params는 단순 구조이나, - broadcasting rule, input별 dtype, keepdims, scalar/immediate operand, - where/mask 표현 등 일반화가 필요할 수 있음 -- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며, - streaming/disk-backed mode 도입 시 stable op_id로 대체 필요 -- **Phase 1 materialization policy**: D3의 Future Extension 참조. - 허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요 +- **Fused operation**: Whether to record tl.composite's tiled pipeline + (READ→COMPUTE→WRITE) as a single fused op record or separate individual ops +- **Math op schema generalization**: The current math params have a simple structure, + but generalization may be needed for broadcasting rules, per-input dtype, keepdims, + scalar/immediate operands, where/mask expressions, etc. +- **Op record identifier**: Currently dependency_ids are based on in-memory list indices; + replacement with stable op_id is needed when introducing streaming/disk-backed mode +- **Phase 1 materialization policy**: See Future Extension in D3. + If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops + needs to be defined --- ## Consequences -### 긍정적 +### Positive -- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가) -- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능 -- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존) -- 벤치마크 사용자 코드 API 변경 불필요 -- 새 메시지 타입 추가 시 data_op 플래그만 설정 -- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원 -- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이 +- Minimal impact on SimPy simulation performance (only op_log append added) +- Free to use multi-threading/GPU in Phase 2 +- Component replaceability preserved (ADR-0015 design philosophy maintained) +- No changes needed to benchmark user code API +- When adding new message types, only set the data_op flag +- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported +- `tl.load()` returns actual data, making kernel debugging easier -### 부정적 +### Negative -- op_log 메모리 사용량 (대규모 시뮬레이션 시) -- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM) -- pending handle (연산 미완료) 기반 동적 분기 불가 - (연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정). - 메모리 데이터 기반 분기는 greenlet으로 지원된다. -- greenlet C 확장 의존성 추가 (pip install greenlet) +- op_log memory usage (for large-scale simulations) +- Phase 2 execution time is proportional to tensor size (large GEMM) +- Dynamic branching based on pending handles (incomplete computations) not possible + (computations execute in Phase 2, result values are undetermined in Phase 1). + Memory-data-based branching is supported via greenlet. +- greenlet C extension dependency added (pip install greenlet) diff --git a/docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md b/docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md deleted file mode 100644 index e6b6334..0000000 --- a/docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md +++ /dev/null @@ -1,882 +0,0 @@ -# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication - -## Status - -Accepted - -## Context - -### Goal - -Add the infrastructure that lets CCL (Collective Communication Library) -kernels run **inside** a PE. The host just launches a kernel on each -SIP; the actual synchronization and data movement happen **inside the -PE kernel via an IPCQ (Inter-Process Communication Queue)**. - -This mirrors how NCCL performs NVLink communication inside a GPU -kernel, or how Cerebras / Tenstorrent expose core-local communication -queues. Host-level collectives (`dist.all_reduce`) are deferred to -**future work**; this ADR focuses solely on the kernel-side collective -infrastructure. - -### Problems to solve - -1. PE-to-PE direct data movement (writing into a peer's memory). -2. Synchronization — the sender must check that the receiver has space - in its buffer (backpressure). -3. Resource contention between compute traffic and communication - traffic (Head-of-Line blocking). -4. The host must be able to construct logical neighbor topologies - (ring / mesh / tree) per algorithm. - ---- - -## Decision - -### D1. Add a new `PE_IPCQ` component - -A new component `PE_IPCQ` is added inside each PE. It follows the same -pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a -distinct component. - -``` -PE -├── PE_CPU -├── PE_SCHEDULER -├── PE_DMA -├── PE_IPCQ ← new -├── PE_FETCH_STORE -├── PE_GEMM -├── PE_MATH -├── PE_TCM -├── PE_MMU -``` - -**Role separation** (control plane vs. data plane): - -- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head / - tail pointer management, peer pointer caches, backpressure, 4-direction - neighbor mapping. -- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe - / PCIE into the peer's memory. - -PE_IPCQ does **not** move data itself — it delegates to PE_DMA. - -### D2. Ring buffer model - -Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers. - -```python -@dataclass -class IpcqQueuePair: - direction: Direction # N/S/E/W - peer: IpcqEndpoint # set by host at init time (D2.5) - tx_buffer_base: int # outgoing data base addr (in our memory) - rx_buffer_base: int # incoming data base addr (in our memory) - slot_size: int # 1 tile per slot - n_slots: int # ring depth - my_head: int # next slot we will write/send into - my_tail: int # next slot we will read/recv from - peer_head_cache: int # peer's last-seen head (updated via D9 piggyback) - peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit) -``` - -**Canonical field names**: throughout this ADR the four names above -(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used -consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`, -etc.) are not used. - -| Field | Owner | Updated when | -|-------|-------|--------------| -| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) | -| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) | -| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) | -| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) | - -**Slot unit**: fixed-size, one slot holds one full tile (no descriptor -indirection). Full data embedded in the slot. See D5. - -### D2.5. `IpcqEndpoint` schema - -`IpcqQueuePair.peer` carries everything the sender needs to compute the -peer's rx slot address: - -```python -@dataclass(frozen=True) -class IpcqEndpoint: - sip: int - cube: int - pe: int - buffer_kind: str # "tcm" | "hbm" | "sram" - rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode()) - rx_base_va: int # peer rx_buffer base VA (optional, MMU mode) - n_slots: int # peer ring depth (for wrap-around) - slot_size: int # peer slot size (for offset) -``` - -Address computation: - -```python -slot_idx = self.my_head % peer.n_slots -dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size -``` - -PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA -(vc_comm) routes the data to `dst_pa` through the fabric. - -**Endpoint construction order**: at backend init (D10), the IPCQ -buffers for **every PE** are allocated first (so each rank knows the -others' PA), then the per-rank neighbor tables are built and pushed to -PE_IPCQ via `IpcqInitMsg`. - -### D3. Four-direction mapping ≡ logical ProcessGroup - -The PE views four directions (N/S/E/W) as logical ports. Real peer -addresses are configured by the host CCL init, per the chosen -algorithm. The PE kernel never knows the topology, only directions. - -```python -# 1D ring -for rank in range(world_size): - ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size]) - ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size]) - -# 2D mesh -for r in range(R): - for c in range(C): - ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c)) - ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c)) - ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C)) - ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C)) -``` - -The PE code does not need to know where `tl.send(dir="E", ...)` actually -ends up. - -### D4. PE kernel API - -```python -# Send (blocking; may stall on backpressure) -tl.send(dir: str, src=TensorHandle) -tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...) - -# Recv (blocking) -recv = tl.recv(dir: str, shape=..., dtype=...) -recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions - -# Recv (non-blocking) -fut = tl.recv_async(dir: str, shape=..., dtype=...) -recv = tl.wait(fut) -``` - -`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each -call rotates through directions, returning the first available slot. -Empty in all 4 directions → wait. - -**Fairness is weak**: the rotating start mitigates simple bias, but if -one direction always wins the race the others can starve. Algorithms -that need strict fairness must call `tl.recv(dir=...)` explicitly. - -### D5. Single-hop DMA write + full-data slot model - -Data moves from sender memory into the receiver's ring slot in **one -DMA transfer**. Key properties: - -- **Single-hop**: the sender already knows the peer rx slot address and - fires one fabric DMA into it. -- **No CPU memcpy**: the CPU never copies data. -- **No intermediate staging**: neither side keeps a separate staging - buffer (sender uses the source addr directly; receiver gets the data - in its ring slot directly). - -(Strictly speaking the fabric DMA write does happen, so this is not -literally "no data movement" — it's the same property NCCL labels -"zero-copy", meaning no CPU memcpy and no staging copy.) - -``` -PE A: tl.send(E, src_addr, nbytes) - 1. IPCQ computes the peer rx slot address: - dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size - 2. Backpressure: my_head - peer_tail_cache < peer.n_slots ? - (full → sleep / poll) - 3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes - 4. my_head += 1 - -PE B: data = tl.recv(W) - 1. Look at rx_buffer[my_tail % n_slots] - 2. Wait for the data to arrive (D7 backpressure mode) - 3. Return the slot address to the kernel (or fetch into register file) - 4. my_tail += 1 - 5. Issue a credit-return fast path (D9): after the bottleneck-BW - latency the peer A's peer_tail_cache is updated. -``` - -The slot holds the full tile. The receiver only reads its own -rx_buffer; it never reads back into A's memory. The sender knows the -peer rx slot address and DMAs directly into it (single-hop). - -The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local -to the PE). - -### D6. Buffer placement — three-way benchmark - -The host CCL init picks the IPCQ ring-buffer location: - -```python -ipcq_init( - backend="ahbm", - buffer_kind="tcm" | "hbm" | "sram", - n_slots=8, - slot_size=4096, -) -``` - -| Location | Trait | Trade-off | -|----------|-------|-----------| -| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources | -| **PE-local HBM** | Large; via DMA | Higher latency | -| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention | - -All three locations run the same kernel code; only the init differs. - -### D7. Backpressure — two-mode benchmark - -How the sender or receiver waits when peer slots are full / data not -yet arrived: - -| Mode | Behavior | Model | -|------|----------|-------| -| **poll** | Periodically re-check the cached peer pointer | Spin loop | -| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like | - -```python -ipcq_init(backpressure="poll" | "sleep", ...) -``` - -Both modes are implemented so latency / throughput trade-offs can be -benchmarked. - -### D8. PE_DMA virtual channels - -Extend PE_DMA from a single queue into a **two-channel virtual-channel** -model. - -``` -PE_DMA -├── vc_compute: tile load / store / writeback for GEMM and Math -└── vc_comm: IPCQ send data -``` - -Each VC has an independent state machine: - -- One channel stalling does not block the other. -- The same physical link (cube_noc, UCIe, …) is shared, but link BW is - split between channels. - -**Chunk-level interleave**: - -- Large GEMM tile DMAs do not lock the link end-to-end. -- Progress happens in chunks (e.g. 256 B); each chunk shares link BW - with the other VC's pending chunks. -- Chunk size is an init parameter (smaller = fairer, larger = more - efficient). - -Net effect: - -- HoL blocking is eliminated (an IPCQ send can interleave with a long - compute DMA). -- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM - pattern). -- Matches the NoC-virtual-channel pattern used in real HW. - -**First-implementation accuracy limit (intentional)**: this ADR's -first cut uses **deterministic chunk-level interleave + weighted -round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`). -This is a first-order approximation and is simpler than real HW -dynamic-contention / credit-based arbiters. Functional correctness is -unaffected, but heavy-contention scenarios may report slightly -optimistic latency vs. real HW. A separate ADR can add a NoC arbiter -component later if more precision is needed. - -#### Token routing - -- Compute tokens (`TileToken`) — go through the existing - PE_FETCH_STORE → PE_DMA chain. -- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA - self-routing. -- PE_DMA picks the channel by token type. - -```python -class PeDmaComponent: - def _process(self, env, token): - if isinstance(token, IpcqDmaToken): - yield from self._vc_comm_process(env, token) - else: - yield from self._vc_compute_process(env, token) -``` - -### D9. Pointer synchronization — DMA payload piggyback - -Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so -pointers update along with the data. This simulation adopts the same -model: **no separate control channel** — metadata travels with the -data. - -The big benefits: - -- **Automatic ordering**: data and metadata move on the same token, so - data is visible **before** the head_cache update. No race. -- **HW fidelity**: matches NVLink / UCIe piggybacked headers. -- **Component simplification**: no separate `IpcqPtrUpdate` event type. - -#### Send flow (head update via piggyback) - -``` -PE A: tl.send(E, src_addr, nbytes) - 1. PE_IPCQ checks backpressure (using peer_tail_cache) - 2. PE_IPCQ creates an IpcqDmaToken: - - data body (src_addr → peer dst_addr) - - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction) - 3. Hand the token to PE_DMA(vc_comm) - 4. PE A increments my_head (send tracking) - -[fabric DMA: latency elapses] - -PE B's PE_DMA receives the token - 5. Writes data into dst_addr (B's rx slot) via MemoryStore.write - 6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle) - -PE B's PE_IPCQ receives the metadata - 7. Updates peer_head_cache (= A's head) - 8. Wakes any pending recv on that direction -``` - -**Steps 5 and 6 must execute in the same SimPy step** — DMA completion -makes data and metadata atomically visible. - -#### Recv flow (credit return — fast path with bottleneck-BW latency) - -When the receiver frees a slot, the sender must learn about it -(backpressure release). Unlike data, the credit return does **not** -travel through general vc_comm fabric — it uses a **separate fast -path**, an abstraction of the NVLink / UCIe credit-return wire. - -**Latency** is computed from the **full path latency** (per-node -overhead + edge propagation + drain), not a magic constant: - -``` -credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes) -path = router.find_path(self_pe, peer_pe.pe_dma) -latency = compute_path_latency_ns(path, credit_size_bytes) - = sum(edge.distance_mm * ns_per_mm) - + sum(node_overhead_ns[n] for n in path) - + credit_size_bytes / bottleneck_bw_on_path -``` - -The router auto-appends `.pe_dma` to the source only, so the -destination MUST be spelled with the explicit `.pe_dma` suffix or -`find_path` raises and the credit silently teleports at zero cost -(latent bug fixed alongside this update). - -`tl.recv` blocks on the credit-emit completion (recv yields-from -`_delayed_credit_send` rather than spawning it as a fork). This puts -the credit-return cost on the receiver's `pe_exec_ns`, modeling the -IPCQ control-plane completing the consume-acknowledgement before -recv returns to the kernel — the protocol equivalent of a non-posted -`tl.store` waiting for an HBM ack on the raw DMA path. - -That gives us: - -- **Topology-proportional approximation**: an in-cube credit return is - automatically faster than a cross-SIP credit return. -- **No magic constants**: every nanosecond comes from - `compute_path_latency_ns` on the same edge_map and `node_overhead_ns` - as data traffic. -- **No deadlock risk**: unlike piggyback, B can issue credit even when - it has no data to send back. `peer_credit_store.put` is unbounded. -- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit - cost on recv balances the HBM ack-trip cost RAW pays on the sender. - -#### Component coupling — SimPy Store channel - -PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init -time, **a SimPy Store is wired between the two** (a per-direction -fast-path channel) and credit metadata is `put` into that store. - -```python -class PeIpcqComponent: - def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns): - yield env.timeout(latency_ns) - yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...)) -``` - -Backend init wires both directions of the fast-path channel as part of -fan-out (see `IpcqInitMsg` in D12). - -#### Credit-return fast path limitations - -- `credit_size_bytes` is an estimate (typically 16–64 bytes). -- The fast path is **excluded from vc_comm BW contention** (separate - wire). Real HW credit-return wires are very lightweight, so this is a - reasonable first approximation. -- A follow-up ADR can: model the credit fast path as a separate link - (BW limit + contention), or switch to piggyback (`credit_return_mode: - piggyback`). - -#### PE_DMA's added responsibility - -When `vc_comm` receives a token, PE_DMA processes it as the following -sequence: pay the Transaction's terminal BW drain, then atomically -write data and forward metadata. **No SimPy yield is allowed between -the data write and the metadata forward** (invariant I6). The drain -yield must sit before the atomic block, not inside it: - -```python -def _on_vc_comm_recv(self, env, txn): - # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the - # sender PE_DMA). MUST happen before the atomic block so recv only - # wakes after the bytes have "landed". - drain = getattr(txn, "drain_ns", 0.0) - if drain > 0: - yield env.timeout(drain) - - token = txn.request - # ── ATOMIC: no yield between these two operations ── - data = self._memory_store.read(token.src_space, token.src_addr, - shape=..., dtype=...) - self._memory_store.write(token.dst_endpoint.buffer_kind, - token.dst_addr, data) - # 2. Forward metadata to the local PE_IPCQ - yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token)) - # ─────────────────────────────────────────────────── -``` - -The final `put` is yieldable but uses an unbounded internal store, so -it completes in a single step. That `put` is the closing call of the -atomic block; nothing may be inserted before it. - -#### Drain-at-inbound semantics (D9 timing model) - -The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path` -stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns` -is paid at each forwarding component via `run()`, and the remaining -BW drain is paid once at the Transaction's terminal. Every non-IPCQ -Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via -`ComponentBase._forward_txn` at the terminal node. For IPCQ the -destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound` -(so IPCQ-specific data write + metadata forward can happen), so **the -drain MUST be paid explicitly at the top of that handler** to keep -IPCQ's timing model on par with every other fabric Transaction. - -Side-effects of paying drain here: - -- **SRC `tl.send`** is unchanged — fire-and-forget semantics are - preserved because the sender PE_DMA does not `yield sub_done`. The - `sub_done.succeed()` call (made after metadata forward below) is an - event with no listener on the sender side. -- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only - when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata - forward now happens after the drain, recv observes the full fabric - transfer time including bandwidth cost. - -Matches the physical picture: send dispatches and leaves; recv waits -until the bytes have actually been drained into its inbox. - -### D9.5. ADR-0020 (2-pass) integration - -`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase -1 simulates timing **and** moves data via MemoryStore; Phase 2 enables -op-log-based correctness verification. - -#### Phase 1 (timing + data) - -D9 models head and tail updates with two different mechanisms: - -- **Send-side (head update)** — DMA payload piggyback. Data write and - metadata forward happen in the same SimPy step → automatic atomic - visibility. -- **Recv-side (tail credit return)** — fast-path SimPy Store channel - with bottleneck-BW latency, then `peer_tail_cache` update. - -Together they preserve ring-buffer pointer consistency. - -The op-log records `op_kind="ipcq"` entries for sends (with -`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with -`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`). -Two recv modes: - -- **`return_slot`** (default): the slot address is returned to the - kernel. Zero-copy. -- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`, - PE_IPCQ copies the slot data into the user dst. - -#### Phase 2 (op_log replay) - -When `DataExecutor` encounters an `op_kind="ipcq"` record: - -- **send**: idempotent `src → dst` ndarray write. -- **recv (`return_slot`)**: no-op (the slot already holds the data). -- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy. - -IPCQ ops are pure data movement — Phase 2 has nothing extra to compute. -The downstream GEMM / Math ops in `DataExecutor` will consume the data -and naturally validate correctness. - -### D10. Host CCL init keeps the PyTorch shape - -The host code looks just like real PyTorch DDP. `init_process_group` -creates the backend object; it does **not** receive IPCQ knobs -(neighbor topology, buffer_kind, backpressure …). - -```python -# benches/ccl_allreduce.py — same shape as real PyTorch -def worker(rank, world_size, torch): - dist = torch.distributed - dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology - tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...) - tensor.copy_(torch.from_numpy(init)) - dist.all_reduce(tensor, op="sum") -``` - -The IPCQ configuration is decided by the backend at -`init_process_group` time: it loads `ccl.yaml`, picks the algorithm, -and pushes IPCQ neighbor tables to every participating PE_IPCQ. The -host code never has to know about IPCQ. - -A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`. -Switching algorithms is purely a `ccl.yaml` change — no host edits -required. - -#### Init flow (eager) - -1. `init_process_group(backend="ahbm")` is called. -2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`. -3. Pulls topology + buffer_kind + backpressure + slot config from - `algorithms[]`. -4. **Immediately** installs neighbor tables on every PE_IPCQ - (sideband or fabric `IpcqInitMsg`). -5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally — - PE_IPCQ is already prepared whether the kernel is a CCL kernel or - not. - -### D11. CCL config file (`ccl.yaml`) - -IPCQ config and algorithm metadata live in a separate YAML file, -following the same pattern as `components.yaml` and `topology.yaml`. - -A single benchmark execution runs one algorithm -(`defaults.algorithm`). Switching algorithms means editing -`defaults.algorithm` only. - -```yaml -defaults: - algorithm: ring_allreduce_tcm - buffer_kind: tcm # tcm | hbm | sram - backpressure: sleep # poll | sleep - n_slots: 8 - slot_size: 4096 - vc_chunk_size: 256 - ipcq_credit_size_bytes: 16 - -algorithms: - ring_allreduce_tcm: - module: kernbench.ccl.algorithms.ring_allreduce - topology: ring_1d # builtin name or "custom" - buffer_kind: tcm - n_elem: 8 # optional, per-algorithm tile width - - tree_allreduce_7: - module: kernbench.ccl.algorithms.tree_allreduce - topology: tree_binary - buffer_kind: tcm - world_size: 7 # algorithm-level override - n_elem: 16 - - custom_mesh: - module: kernbench.ccl.algorithms.custom_mesh - topology: custom # the module supplies its own neighbors() -``` - -`world_size` is **not set in `defaults`**. The backend resolves it via: -`algorithm-level override > defaults override > topology spec`. The -last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP -where `WORLD_SIZE` comes from env vars rather than config files. - -#### Algorithm module structure - -Each algorithm module exports two hooks — `kernel` (required) and -`neighbors` (optional) — plus a `kernel_args` helper that the -backend uses to populate positional kernel arguments at `all_reduce` -time: - -```python -# src/kernbench/ccl/algorithms/ring_allreduce.py - -def kernel_args(world_size: int, n_elem: int) -> tuple: - return (n_elem, world_size) - - -def kernel(t_ptr, n_elem, world_size, tl): - """Required — the PE kernel. - - IPCQ is already installed by the backend before this is called. - The kernel only uses the four-direction send / recv API. - """ - ... - - -def neighbors(rank, world_size, neighbor_map): - """Optional — override the builtin topology's neighbor map. - - Returns a new dict, the modified-in-place dict, or None to keep the - builtin map. - """ - return None -``` - -#### `neighbors` override patterns - -- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc. -- **Pattern B — replace entirely**: ignore `neighbor_map` and return a - brand-new dict. -- **Pattern C — keep builtin**: omit `neighbors` or return None. - -#### Builtin topologies - -| topology | direction set | -|----------|---------------| -| `ring_1d` | E, W | -| `ring_1d_unidir` | E only | -| `mesh_2d` | N, S, E, W | -| `tree_binary` | parent, child_left, child_right | -| `none` | (empty) — algorithm must supply `neighbors()` | - -#### Adding a new algorithm - -1. Write `kernel` and `kernel_args` in - `src/kernbench/ccl/algorithms/.py`. -2. Add an entry in `ccl.yaml`'s `algorithms` section. -3. (Optional) provide `neighbors()` for custom topology. -4. Set `defaults.algorithm` to the new algorithm. - -The host bench (`benches/ccl_allreduce.py`) does not change. - -### D12. Message / token schema - -The new message types added by this ADR. They live in -`src/kernbench/common/pe_commands.py` and -`src/kernbench/runtime_api/kernel.py`. - -#### `IpcqInitMsg` (sideband, fan-out at init) - -The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors -`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`). -Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`, -`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store` -field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can -push `IpcqCreditMetadata` directly into the receiver's input queue. - -#### `IpcqSendCmd` (PE_CPU → PE_IPCQ) - -Carries `direction`, source addr/space, nbytes, shape, dtype, and a -handle id. `data_op=True` so it lands in the op_log. - -#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ) - -Carries `direction` (or None for round-robin), `recv_mode` -(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape, -dtype, blocking flag. - -#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel) - -Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`) -plus the head metadata (`sender_seq`, `src_sip/cube/pe`, -`src_direction`). PE_DMA picks the channel by token type -(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`). - -The receiver's PE_DMA, on token arrival, performs the I6 atomic -sequence: write data into MemoryStore, then forward `IpcqMetaArrival` -to the local PE_IPCQ. - -#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path) - -Carries `consumer_seq` (= my_tail), source PE coords, and source -direction. Travels through the dedicated SimPy Store channel rather -than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`. - -There is **no `IpcqPtrUpdate` event** — head updates flow via D9 -piggyback, tail updates via the D9 fast-path channel. - -### D13. Test strategy - -Test plan: - -#### T1. Unit tests (component-level) - -- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure - immediately forwards a token; full peer slot triggers backpressure - (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`; - round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`. -- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute` - / `vc_comm` independent progress, chunk interleave, BW split. -- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d / - mesh_2d / tree_binary correctness, mesh_2d non-square → - `ValueError`, custom resolver returns the module's `neighbors`. - -#### T2. Integration tests (E2E send/recv) - -- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional - no-deadlock), 4×4 mesh. -- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode - records `ipcq` ops in op_log; DataExecutor produces correct - `out.data`. - -#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`) - -`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA -consistency, per-`buffer_kind` allocation. - -#### T4. Regression - -All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for -non-CCL benches. - -#### T5. Performance / overhead - -Single send/recv pair latency = (DMA latency) + (IPCQ overhead). -Should be close to a regular PE_DMA write of the same nbytes (IPCQ -overhead < 100 ns). - -### D14. Invariants and failure modes - -#### Invariants - -I1. **Slot lifecycle exactly-once**: one send → exactly one recv. -I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly - non-decreasing; `sender_seq` strictly increasing. -I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank - B, then rank B's reverse-direction peer must be rank A. Verified at - init. -I4. **`buffer_kind` consistency**: all PEs in a process group share - the same `buffer_kind` (no mixed mode in the first cut). -I5. **op_log ordering**: send → DMA complete → recv possible. The - t_start order in op_log respects this causality. -I6. **Atomic data + metadata visibility (MUST)**: at the receiver - side, data write (`MemoryStore.write`) and metadata forward - (`peer_head_cache` update) **must execute in the same SimPy step**. - No yield is allowed between the two operations in PE_DMA's vc_comm - handler. Code review must reject any inserted `yield` (or `yield - from`) — it would create a race where head_cache becomes visible - before or after the data. -I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6, - the step in which `peer_head_cache > my_tail` becomes truthy is the - same step in which the slot data is observable. - -#### Failure modes (runtime errors) - -F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction - → `IpcqInvalidDirection`, simulation aborts. -F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched - send and recv. Not validated by default; opt-in strict mode catches - it (`strict_validation: true` on a PE_IPCQ node attrs). -F3. **Deadlock detection (timeout-based)**: the simulator empties its - schedule while a send/recv is still pending → engine raises - `IpcqDeadlock` and embeds a pointer dump. -F4. **Backend init failure**: missing `defaults.algorithm`, missing - `algorithms[name]`, module import failure, topology validation - failure (I3, I4) — all raised at `init_process_group` time. -F5. **Slot full + infinite backpressure**: the peer never recvs. - Surfaces as F3 timeout. - -#### Diagnostics - -- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as - `(rank, t, dir, nbytes)`. -- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)` - prints every PE_IPCQ ring buffer's `my_head`, `my_tail`, - `peer_head_cache`, `peer_tail_cache`. -- **Deadlock dump**: on hang the engine includes the pointer dump in - the `IpcqDeadlock` exception message. - -### D15. Algorithm-author cheat sheet - -Full step-by-step lives in -[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The -shortest version: - -| Things you touch | Things you don't | -|------------------|-------------------| -| `src/kernbench/ccl/algorithms/.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code | -| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework | -| (Optional) `tests/test_.py` mock test | PE_IPCQ component, AhbmCCLBackend | - -5-step flow: write the kernel → register in `ccl.yaml` → optional -`neighbors` override → optional mock unit test → SimPy validation via -`kernbench run --bench ccl_allreduce --verify-data`. - -Common mistakes: using a direction that wasn't installed, sends -without matching recvs (deadlock), dtype/shape disagreement, assuming -fairness from `tl.recv()` round-robin, confusing -`tl.num_programs(axis)` with the CCL group size. - ---- - -## Non-goals - -- **Host collective**: a model where `dist.all_reduce` itself moves - data on the host side is out of scope. This ADR only covers - communication that happens inside the PE kernel. -- **All-reduce algorithms**: ring / tree / etc. live in algorithm - modules and can be added without amending this ADR. -- **Reliability / error handling**: link faults, send/recv failure - recovery, etc. are out of scope. -- **NoC arbiter precision**: dynamic VC contention is left for a future - ADR (see D8). - ---- - -## Open questions - -- **VC arbitration accuracy** — the first cut uses deterministic - chunk interleave + weighted round-robin; heavy contention may report - optimistic latency. A NoC arbiter component can be added later. -- **Credit return BW model** — the fast path is currently outside the - fabric BW contention model. Can be modeled as a separate link or - switched to piggyback (`credit_return_mode: piggyback`). -- **Ring buffer slot allocation metadata** — whether the host pushes - IPCQ buffer metadata via sideband or via a fabric message similar to - `MmuMapMsg` is open. -- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in - `ccl.yaml`; default value TBD. -- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6 - (with Up/Down for 3D) or N (variable) is future work. -- **Multi-tile aggregation primitives** — whether - `tl.recv_all` or similar is needed for fan-in. -- **Round-robin recv fairness** — current weak fairness can starve; - strict fairness counter is future work. -- **Deadlock detection precision** — currently timeout-based; a - realtime wait-for graph would enable deterministic detection. - ---- - -## Consequences - -### Positive - -- PE-to-PE direct communication enables CCL kernels to be written. -- Host stays minimal (just `launch`), synchronization happens inside - the PE → strong compute / comm overlap. -- VCs eliminate HoL blocking → collective latency is not blocked by - compute traffic. -- Buffer placement and backpressure mode are init-time parameters → - easy to benchmark. -- Four-direction logical neighbors → host is free to map - ring/mesh/tree algorithms. - -### Negative - -- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs). -- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE. -- VC arbitration is a first-order approximation; heavy contention - scenarios may report slightly optimistic latency vs real HW (D8). -- Chunk-level interleave makes PE_DMA implementation more complex. diff --git a/docs/adr/ADR-0023-dev-ipcq-pe-collective.md b/docs/adr/ADR-0023-dev-ipcq-pe-collective.md index 5fd174d..2db86c5 100644 --- a/docs/adr/ADR-0023-dev-ipcq-pe-collective.md +++ b/docs/adr/ADR-0023-dev-ipcq-pe-collective.md @@ -6,39 +6,45 @@ Accepted ## Context -### 목표 +### Goal -CCL (Collective Communication Library) 커널을 PE 안에서 실행할 수 있도록 -PE 간 데이터 교환 인프라를 추가한다. 호스트는 그저 각 SIP에 커널을 launch만 하고, -실제 동기화와 데이터 이동은 **PE 커널 안에서 IPCQ(Inter-Process Communication -Queue)를 통해** 일어난다. +Add the infrastructure that lets CCL (Collective Communication Library) +kernels run **inside** a PE. The host just launches a kernel on each +SIP; the actual synchronization and data movement happen **inside the +PE kernel via an IPCQ (Inter-Process Communication Queue)**. -이는 NCCL이 GPU 커널 안에서 NVLink 통신을 수행하는 모델, 또는 Cerebras/Tenstorrent의 -core-local 통신 큐와 유사하다. 호스트 레벨 collective(`dist.all_reduce`)는 -**미래 작업**으로 미루고, 본 ADR은 커널 collective 인프라에만 집중한다. +This mirrors how NCCL performs NVLink communication inside a GPU +kernel, or how Cerebras / Tenstorrent expose core-local communication +queues. Host-level collectives (`dist.all_reduce`) are deferred to +**future work**; this ADR focuses solely on the kernel-side collective +infrastructure. -### 풀어야 할 문제 +### Problems to solve -1. PE 간 직접 데이터 이동 (peer's memory에 write) -2. 동기화 — 송신 측이 수신 측 buffer 공간을 확인해야 함 (backpressure) -3. compute traffic과 communication traffic의 자원 경쟁 (Head-of-Line blocking) -4. 호스트가 알고리즘에 따라 (ring/mesh/tree) 논리적 neighbor 토폴로지를 구성할 수 있어야 함 +1. PE-to-PE direct data movement (writing into a peer's memory). +2. Synchronization — the sender must check that the receiver has space + in its buffer (backpressure). +3. Resource contention between compute traffic and communication + traffic (Head-of-Line blocking). +4. The host must be able to construct logical neighbor topologies + (ring / mesh / tree) per algorithm. --- ## Decision -### D1. PE_IPCQ 컴포넌트 신규 추가 +### D1. Add a new `PE_IPCQ` component -PE 안에 새 컴포넌트 `PE_IPCQ`를 추가한다. PE_GEMM/PE_MATH가 PE_CPU의 -sub-block을 별도 컴포넌트로 모델링하는 것과 동일한 패턴이다. +A new component `PE_IPCQ` is added inside each PE. It follows the same +pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a +distinct component. ``` PE ├── PE_CPU ├── PE_SCHEDULER ├── PE_DMA -├── PE_IPCQ ← 신규 +├── PE_IPCQ ← new ├── PE_FETCH_STORE ├── PE_GEMM ├── PE_MATH @@ -46,99 +52,96 @@ PE ├── PE_MMU ``` -**역할 분리** (control plane vs data plane): -- **PE_IPCQ (control plane)**: ring buffer 주소 계산, head/tail pointer 관리, - peer pointer 캐시, backpressure 결정, 4-방향 neighbor 매핑 -- **PE_DMA (data plane)**: 실제 데이터를 cube_noc/UCIe/PCIE 경유로 peer 메모리에 전송 +**Role separation** (control plane vs. data plane): -PE_IPCQ는 데이터 이동을 직접 수행하지 않고 PE_DMA에 위임한다. +- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head / + tail pointer management, peer pointer caches, backpressure, 4-direction + neighbor mapping. +- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe + / PCIE into the peer's memory. -### D2. Ring Buffer 모델 +PE_IPCQ does **not** move data itself — it delegates to PE_DMA. -각 PE는 4-방향(N/S/E/W) × {tx, rx} = 총 8개의 ring buffer를 가진다. +### D2. Ring buffer model + +Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers. ```python @dataclass class IpcqQueuePair: direction: Direction # N/S/E/W - peer: IpcqEndpoint # init 시 호스트가 설정 (D2.5) - tx_buffer_base: int # 내가 보낼 데이터의 base addr (자기 메모리) - rx_buffer_base: int # 내가 받을 데이터의 base addr (자기 메모리) - slot_size: int # tile 단위 + peer: IpcqEndpoint # set by host at init time (D2.5) + tx_buffer_base: int # outgoing data base addr (in our memory) + rx_buffer_base: int # incoming data base addr (in our memory) + slot_size: int # 1 tile per slot n_slots: int # ring depth - my_head: int # 내 send 위치 (다음에 쓸 tx/peer slot) - my_tail: int # 내 recv 위치 (다음에 읽을 rx slot) - peer_head_cache: int # 캐시: peer가 마지막으로 보낸 head 위치 (D9 piggyback으로 갱신) - peer_tail_cache: int # 캐시: peer가 마지막으로 소비한 tail 위치 (D9 fast path credit으로 갱신) + my_head: int # next slot we will write/send into + my_tail: int # next slot we will read/recv from + peer_head_cache: int # peer's last-seen head (updated via D9 piggyback) + peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit) ``` -**필드명 규약 (canonical)**: 본 ADR 전체에서 다음 4개 이름을 일관되게 사용한다. +**Canonical field names**: throughout this ADR the four names above +(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used +consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`, +etc.) are not used. -| 필드 | 소유자 | 갱신 시점 | -|------|--------|----------| -| `my_head` | 자기 PE_IPCQ | tl.send 호출 후 즉시 (송신 추적용) | -| `my_tail` | 자기 PE_IPCQ | tl.recv 호출 후 즉시 (수신 추적용) | -| `peer_head_cache` | 자기 PE_IPCQ | IpcqMetaArrival 도착 시 (D9 piggyback) | -| `peer_tail_cache` | 자기 PE_IPCQ | IpcqCreditMetadata 도착 시 (D9 fast path) | +| Field | Owner | Updated when | +|-------|-------|--------------| +| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) | +| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) | +| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) | +| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) | -다른 표현(`peer_head_local`, `peer_head`, `peer_tail` 등)은 사용하지 않는다. +**Slot unit**: fixed-size, one slot holds one full tile (no descriptor +indirection). Full data embedded in the slot. See D5. -**Slot 단위**: fixed-size, 한 slot이 한 tile 데이터를 통째로 담는다. -descriptor 모델이 아니라 **full data embedding** 모델 (D5에서 상세). +### D2.5. `IpcqEndpoint` schema -### D2.5. PeAddress / IpcqEndpoint 스키마 - -`IpcqQueuePair.peer`가 가져야 할 정보를 명시한다. 송신 측 PE_IPCQ가 -peer rx slot에 직접 DMA write하려면 다음을 모두 알아야 한다. +`IpcqQueuePair.peer` carries everything the sender needs to compute the +peer's rx slot address: ```python @dataclass(frozen=True) class IpcqEndpoint: - """송신 측이 peer's rx_buffer 주소를 계산하기 위해 필요한 모든 정보.""" - sip: int # 목적지 SIP - cube: int # 목적지 cube - pe: int # 목적지 PE (cube 내 local index) - buffer_kind: str # "tcm" | "hbm" | "sram" — 어느 메모리 공간 - rx_base_pa: int # peer rx_buffer base의 PA (PhysAddr.encode()) - rx_base_va: int # peer rx_buffer base의 VA (선택, MMU 사용 시) - n_slots: int # peer ring depth (경계 wrap-around 계산용) - slot_size: int # peer slot 크기 (offset 계산용) + sip: int + cube: int + pe: int + buffer_kind: str # "tcm" | "hbm" | "sram" + rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode()) + rx_base_va: int # peer rx_buffer base VA (optional, MMU mode) + n_slots: int # peer ring depth (for wrap-around) + slot_size: int # peer slot size (for offset) ``` -`IpcqQueuePair`의 `peer` 필드는 이 `IpcqEndpoint` 객체를 들고 있다. -주소 계산은: +Address computation: ```python slot_idx = self.my_head % peer.n_slots dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size ``` -PE_IPCQ는 이 dst_pa를 `IpcqDmaToken`의 dst_addr로 PE_DMA에 전달한다. -PE_DMA(vc_comm)는 fabric 라우팅(cube_noc/UCIe/PCIE)을 통해 dst_pa로 데이터를 전송한다. +PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA +(vc_comm) routes the data to `dst_pa` through the fabric. -**Endpoint 생성 시점**: backend init (D10)에서 모든 PE의 IPCQ buffer를 -allocator로 할당받고, 각 rank의 neighbor table을 만들 때 peer rank의 -endpoint 정보를 install한다. 즉 install 순서는: +**Endpoint construction order**: at backend init (D10), the IPCQ +buffers for **every PE** are allocated first (so each rank knows the +others' PA), then the per-rank neighbor tables are built and pushed to +PE_IPCQ via `IpcqInitMsg`. -1. **모든 rank의 IPCQ buffer 할당** (각 PE의 buffer_kind 메모리 공간에서) -2. **rank별 endpoint table 구성** (자신의 4-방향 peer가 어느 sip/cube/pe/pa를 갖는지) -3. **PE_IPCQ에 install** (`IpcqInitMsg` via fabric or sideband) +### D3. Four-direction mapping ≡ logical ProcessGroup -이 순서는 모든 rank가 서로의 PA를 알아야 하므로, 단계 1을 모든 rank에 대해 -먼저 끝낸 후 단계 2-3을 진행한다. - -### D3. 4-방향 매핑 = 논리적 ProcessGroup - -PE는 4방향(N/S/E/W)을 logical port로 본다. 실제 peer 주소는 호스트 CCL init이 -알고리즘에 따라 설정한다. PE 커널은 토폴로지를 알지 못하고 방향만 사용한다. +The PE views four directions (N/S/E/W) as logical ports. Real peer +addresses are configured by the host CCL init, per the chosen +algorithm. The PE kernel never knows the topology, only directions. ```python -# 호스트 init 예시 — 1D ring +# 1D ring for rank in range(world_size): ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size]) ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size]) -# 호스트 init 예시 — 2D mesh +# 2D mesh for r in range(R): for c in range(C): ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c)) @@ -147,77 +150,77 @@ for r in range(R): ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C)) ``` -PE 코드 입장에서 `tl.send(dir="E", ...)`가 어디로 가는지는 알 필요가 없다. +The PE code does not need to know where `tl.send(dir="E", ...)` actually +ends up. -### D4. PE 커널 API +### D4. PE kernel API ```python -# Send (blocking, backpressure 발생 가능) -tl.send(dir: str, src_addr: int, nbytes: int) -> None +# Send (blocking; may stall on backpressure) +tl.send(dir: str, src=TensorHandle) +tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...) # Recv (blocking) -data = tl.recv(dir: str) # 특정 방향에서 수신 -data = tl.recv() # 4방향 round-robin, 도착한 첫 tile 반환 +recv = tl.recv(dir: str, shape=..., dtype=...) +recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions # Recv (non-blocking) -handle = tl.recv_async(dir: str) -data = tl.wait(handle) +fut = tl.recv_async(dir: str, shape=..., dtype=...) +recv = tl.wait(fut) ``` -`tl.recv()` (방향 미지정)는 IPCQ가 last_polled_dir 인덱스를 들고 있다가 -다음 호출 시 그 다음 방향부터 검사하면서 데이터 있는 첫 슬롯을 반환한다. -4방향 모두 비어있으면 wait. +`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each +call rotates through directions, returning the first available slot. +Empty in all 4 directions → wait. -**Fairness는 weak fairness**: polling 시작 방향을 회전시켜 단순 편향을 -완화하지만, 한 방향에 데이터가 항상 먼저 도착하면 다른 방향이 starvation될 -수 있다. strict fairness가 필요한 알고리즘은 `tl.recv(dir=...)`로 방향을 -명시해야 한다. (Open Questions 참조) +**Fairness is weak**: the rotating start mitigates simple bias, but if +one direction always wins the race the others can starve. Algorithms +that need strict fairness must call `tl.recv(dir=...)` explicitly. -### D5. Single-hop DMA Write + Full-data Slot 모델 +### D5. Single-hop DMA write + full-data slot model -데이터는 송신 측 메모리에서 수신 측 ring slot으로 **단일 DMA 전송**으로 -이동한다. 핵심 속성: +Data moves from sender memory into the receiver's ring slot in **one +DMA transfer**. Key properties: -- **Single-hop**: 송신 측 IPCQ가 peer rx slot 주소를 직접 알고 있어 한 번의 - fabric DMA로 데이터가 도착한다. -- **No CPU memcpy**: CPU가 데이터를 복사하지 않는다. -- **No intermediate staging**: 송신/수신 어느 쪽에도 별도 staging buffer가 - 없다 (송신은 자기 source 주소에서 직접, 수신은 자기 ring slot으로 직접). +- **Single-hop**: the sender already knows the peer rx slot address and + fires one fabric DMA into it. +- **No CPU memcpy**: the CPU never copies data. +- **No intermediate staging**: neither side keeps a separate staging + buffer (sender uses the source addr directly; receiver gets the data + in its ring slot directly). -(엄밀히 말하면 fabric DMA write 자체는 발생하므로 "data movement가 전혀 없다"는 -의미는 아니다. NCCL의 "zero-copy"가 가리키는 것 — CPU memcpy / staging copy -부재 — 과 동일한 속성이다.) - -데이터 이동 모델: +(Strictly speaking the fabric DMA write does happen, so this is not +literally "no data movement" — it's the same property NCCL labels +"zero-copy", meaning no CPU memcpy and no staging copy.) ``` PE A: tl.send(E, src_addr, nbytes) - 1. IPCQ가 peer rx slot 주소 계산 + 1. IPCQ computes the peer rx slot address: dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size - 2. backpressure: my_head - peer_tail_cache < peer.n_slots ? - (꽉 찼으면 sleep/poll) - 3. PE_DMA(vc_comm)에 DMA 요청 → src_addr에서 peer의 dst_addr로 nbytes 전송 + 2. Backpressure: my_head - peer_tail_cache < peer.n_slots ? + (full → sleep / poll) + 3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes 4. my_head += 1 PE B: data = tl.recv(W) - 1. 내 rx_buffer[my_tail % n_slots] 위치 확인 - 2. 데이터 도착 대기 (D7 backpressure 모드) - 3. 그 주소를 PE 커널에 반환 (또는 fetch unit으로 register file에 로드) + 1. Look at rx_buffer[my_tail % n_slots] + 2. Wait for the data to arrive (D7 backpressure mode) + 3. Return the slot address to the kernel (or fetch into register file) 4. my_tail += 1 - 5. credit return fast path 발행 (D9) — bottleneck-BW latency 후 - peer A의 peer_tail_cache 갱신 + 5. Issue a credit-return fast path (D9): after the bottleneck-BW + latency the peer A's peer_tail_cache is updated. ``` -**핵심**: Slot에 데이터가 통째로 들어간다. PE B의 recv는 자기 rx_buffer만 -읽으면 되고, A의 메모리를 read하지 않는다. 송신 측 IPCQ가 peer rx slot -주소를 알고 있으므로 직접 그 주소로 DMA write한다 (single-hop). +The slot holds the full tile. The receiver only reads its own +rx_buffer; it never reads back into A's memory. The sender knows the +peer rx slot address and DMAs directly into it (single-hop). -본인의 PE_TCM read/write는 DMA를 거치지 않는다 (PE에 직접 붙어있음). -slot이 본인 TCM에 있으면 직접 접근, 아니면 PE_DMA 경유. +The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local +to the PE). -### D6. Buffer 위치 — 3-way benchmark +### D6. Buffer placement — three-way benchmark -호스트 CCL init이 IPCQ ring buffer의 메모리 위치를 결정한다: +The host CCL init picks the IPCQ ring-buffer location: ```python ipcq_init( @@ -228,79 +231,80 @@ ipcq_init( ) ``` -| 위치 | 특징 | trade-off | -|------|------|-----------| -| **PE_TCM** | PE에 직접 붙음, 빠름 | 작음, PE 내부 자원과 경쟁 | -| **PE-local HBM** | 큼, DMA 경유 | latency 큼 | -| **Cube SRAM** | 중간 크기, cube-shared | cube 내 PE 간 contention | +| Location | Trait | Trade-off | +|----------|-------|-----------| +| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources | +| **PE-local HBM** | Large; via DMA | Higher latency | +| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention | -세 위치 모두 동일 코드로 동작하며 init만 다르다. 벤치마크로 비교 가능. +All three locations run the same kernel code; only the init differs. -**규칙**: peer가 read/write할 때는 DMA 경유. 본인이 자기 PE_TCM 읽기/쓰기는 -DMA 없음. +### D7. Backpressure — two-mode benchmark -### D7. Backpressure — 2-mode benchmark +How the sender or receiver waits when peer slots are full / data not +yet arrived: -송신 측이 peer slot full을 감지했을 때, 또는 수신 측이 데이터 미도착을 -감지했을 때 어떻게 대기하는가: - -| 모드 | 동작 | 모델 | -|------|------|------| -| **poll** | 캐시된 peer pointer를 주기적으로 재확인. cache update event를 폴링 | spin loop | -| **sleep** | SimPy event를 yield하고 sleep, peer가 update event를 trigger하면 wake | interrupt-like | +| Mode | Behavior | Model | +|------|----------|-------| +| **poll** | Periodically re-check the cached peer pointer | Spin loop | +| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like | ```python ipcq_init(backpressure="poll" | "sleep", ...) ``` -두 모드 모두 구현하여 latency/throughput trade-off를 벤치마크할 수 있다. +Both modes are implemented so latency / throughput trade-offs can be +benchmarked. -### D8. PE_DMA Virtual Channel +### D8. PE_DMA virtual channels -PE_DMA를 단일 큐에서 **2-channel virtual channel** 모델로 확장한다. +Extend PE_DMA from a single queue into a **two-channel virtual-channel** +model. ``` PE_DMA -├── vc_compute: GEMM/MATH의 tile load/store/writeback -└── vc_comm: IPCQ의 send 데이터 +├── vc_compute: tile load / store / writeback for GEMM and Math +└── vc_comm: IPCQ send data ``` -각 VC는 독립적인 state machine을 가진다: -- 한 채널이 stall되어도 다른 채널은 진행 -- 동일 link(cube_noc, UCIe 등)는 공유하지만, link BW는 두 채널이 분할 사용 +Each VC has an independent state machine: -**Chunk 단위 인터리브**: -- 큰 GEMM tile DMA가 한 번에 link를 점유하지 않음 -- chunk_size 단위로 진행 (예: 256B), 매 chunk마다 다른 VC와 link BW 공유 -- chunk_size는 init 파라미터 (작을수록 fair, 클수록 효율) +- One channel stalling does not block the other. +- The same physical link (cube_noc, UCIe, …) is shared, but link BW is + split between channels. -이로써: -- HoL blocking 해소 (compute DMA 진행 중에도 IPCQ send 끼어들 수 있음) -- compute/comm overlap 자연스러움 (NVIDIA copy engine + compute SM 패턴) -- HW 모델 정합 (NoC virtual channel은 실제 HW 기법) +**Chunk-level interleave**: -**첫 구현의 정확도 한계 (intentional)**: +- Large GEMM tile DMAs do not lock the link end-to-end. +- Progress happens in chunks (e.g. 256 B); each chunk shares link BW + with the other VC's pending chunks. +- Chunk size is an init parameter (smaller = fairer, larger = more + efficient). -본 ADR의 첫 구현은 **deterministic chunk-level interleave + weighted -round-robin arbitration** (default 50/50, ccl.yaml에 노출)을 채택한다. -이는 first-order approximation이며, 실제 HW의 dynamic contention/credit-based -arbitration보다는 단순화된 모델이다. +Net effect: -| 모델링 항목 | 첫 구현 | 향후 확장 가능 | -|------------|---------|---------------| -| VC 간 BW 분할 | 정적 weight | dynamic contention 기반 | -| Chunk 단위 인터리브 | deterministic round-robin | priority/QoS 기반 | -| Cross-VC dependency | 없음 (독립) | NoC arbiter component 추가 | +- HoL blocking is eliminated (an IPCQ send can interleave with a long + compute DMA). +- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM + pattern). +- Matches the NoC-virtual-channel pattern used in real HW. -이 단순화는 functional correctness에는 영향이 없으며, latency 측정에서 -heavy contention 시나리오는 실제보다 약간 optimistic한 결과를 낼 수 있다. -정밀화가 필요하면 별도 ADR로 NoC arbiter를 도입한다. +**First-implementation accuracy limit (intentional)**: this ADR's +first cut uses **deterministic chunk-level interleave + weighted +round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`). +This is a first-order approximation and is simpler than real HW +dynamic-contention / credit-based arbiters. Functional correctness is +unaffected, but heavy-contention scenarios may report slightly +optimistic latency vs. real HW. A separate ADR can add a NoC arbiter +component later if more precision is needed. #### Token routing -- compute용 token (TileToken): 기존 PE_FETCH_STORE → PE_DMA 체이닝 그대로 -- comm용 token (IpcqDmaToken, 신규): PE_IPCQ → PE_DMA로 self-routing -- PE_DMA가 token 종류로 채널 결정 +- Compute tokens (`TileToken`) — go through the existing + PE_FETCH_STORE → PE_DMA chain. +- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA + self-routing. +- PE_DMA picks the channel by token type. ```python class PeDmaComponent: @@ -311,55 +315,54 @@ class PeDmaComponent: yield from self._vc_compute_process(env, token) ``` -### D9. Pointer 동기화 — DMA payload piggyback +### D9. Pointer synchronization — DMA payload piggyback -실제 하드웨어(NVLink, UCIe 등)는 DMA 메시지의 payload에 메타데이터를 -piggyback하여 송수신과 함께 pointer를 갱신한다. 본 시뮬레이션도 같은 모델을 -채택하여 **별도의 control 채널 없이** 메타데이터가 data와 함께 도착하도록 한다. +Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so +pointers update along with the data. This simulation adopts the same +model: **no separate control channel** — metadata travels with the +data. -이 모델의 핵심 이점: +The big benefits: -- **자동 ordering**: 메타데이터가 data와 동일 token으로 이동하므로 data가 - 먼저 visible해진 다음에야 head_cache가 갱신된다. 별도 ordering invariant - 없이 race condition이 원천 차단된다. -- **HW 정합**: 실제 NVLink/UCIe의 piggybacked header 모델과 일치 -- **컴포넌트 단순화**: 별도 IpcqPtrUpdate event 종류가 필요 없음 +- **Automatic ordering**: data and metadata move on the same token, so + data is visible **before** the head_cache update. No race. +- **HW fidelity**: matches NVLink / UCIe piggybacked headers. +- **Component simplification**: no separate `IpcqPtrUpdate` event type. -#### Send 흐름 (head 측 piggyback) +#### Send flow (head update via piggyback) ``` PE A: tl.send(E, src_addr, nbytes) - 1. PE_IPCQ가 backpressure 체크 (peer_tail_cache 기준) - 2. PE_IPCQ가 IpcqDmaToken 생성: - - data 본체 (src_addr → peer dst_addr) - - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction) - 3. PE_DMA(vc_comm)에 token put - 4. PE A는 자기 my_head++ (송신 추적용) + 1. PE_IPCQ checks backpressure (using peer_tail_cache) + 2. PE_IPCQ creates an IpcqDmaToken: + - data body (src_addr → peer dst_addr) + - piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction) + 3. Hand the token to PE_DMA(vc_comm) + 4. PE A increments my_head (send tracking) -[fabric DMA: latency 만큼 진행] +[fabric DMA: latency elapses] -PE B의 PE_DMA가 token 수신 - 5. data를 dst_addr (B의 rx slot)에 MemoryStore.write - 6. token의 metadata를 PE B의 PE_IPCQ로 forward (PE 내부 wire, ~1 cycle) +PE B's PE_DMA receives the token + 5. Writes data into dst_addr (B's rx slot) via MemoryStore.write + 6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle) -PE B의 PE_IPCQ가 metadata 수신 - 7. peer_head_cache 갱신 (= A의 head 위치) - 8. 대기 중인 recv (해당 direction)가 있으면 wake +PE B's PE_IPCQ receives the metadata + 7. Updates peer_head_cache (= A's head) + 8. Wakes any pending recv on that direction ``` -여기서 핵심은 **5와 6은 같은 SimPy step**이라는 것이다 — DMA 완료와 동시에 -data와 metadata가 atomic하게 visible해진다. +**Steps 5 and 6 must execute in the same SimPy step** — DMA completion +makes data and metadata atomically visible. -#### Recv 흐름 (credit return — fast path with bottleneck-BW latency) +#### Recv flow (credit return — fast path with bottleneck-BW latency) -수신측이 slot을 비우면 송신측은 그 사실을 알아야 한다 (backpressure 해제). -data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabric을 -거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe -credit return fast path를 추상화한 것이다. +When the receiver frees a slot, the sender must learn about it +(backpressure release). Unlike data, the credit return does **not** +travel through general vc_comm fabric — it uses a **separate fast +path**, an abstraction of the NVLink / UCIe credit-return wire. -**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path -latency** (per-node overhead + edge propagation + drain) 기준으로 -산출한다. +**Latency** is computed from the **full path latency** (per-node +overhead + edge propagation + drain), not a magic constant: ``` credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes) @@ -370,47 +373,35 @@ latency = compute_path_latency_ns(path, credit_size_bytes) + credit_size_bytes / bottleneck_bw_on_path ``` -router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시 -`.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고 -credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번 -업데이트에서 수정됨). +The router auto-appends `.pe_dma` to the source only, so the +destination MUST be spelled with the explicit `.pe_dma` suffix or +`find_path` raises and the credit silently teleports at zero cost +(latent bug fixed alongside this update). -`tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는 -`env.process`로 fork). 이로써 credit-return cost가 receiver의 -`pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를 -완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`가 -HBM ack-trip을 기다리는 것의 protocol-level 등가물이다. +`tl.recv` blocks on the credit-emit completion (recv yields-from +`_delayed_credit_send` rather than spawning it as a fork). This puts +the credit-return cost on the receiver's `pe_exec_ns`, modeling the +IPCQ control-plane completing the consume-acknowledgement before +recv returns to the kernel — the protocol equivalent of a non-posted +`tl.store` waiting for an HBM ack on the raw DMA path. -이로써: -- **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이 - 자동으로 다른 latency를 가짐 -- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map - 및 `node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴 -- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게 - 보낼 데이터가 없어도 credit이 자동 발행됨 -- **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이 - RAW의 ack-trip cost와 균형을 이룸 +That gives us: -``` -PE B: tl.recv(W) → 데이터 가져감 → my_tail++ +- **Topology-proportional approximation**: an in-cube credit return is + automatically faster than a cross-SIP credit return. +- **No magic constants**: every nanosecond comes from + `compute_path_latency_ns` on the same edge_map and `node_overhead_ns` + as data traffic. +- **No deadlock risk**: unlike piggyback, B can issue credit even when + it has no data to send back. `peer_credit_store.put` is unbounded. +- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit + cost on recv balances the HBM ack-trip cost RAW pays on the sender. -PE B의 PE_IPCQ: - 1. router로 PE A까지 path 계산 - 2. compute_drain_ns(path, credit_size_bytes) = latency_ns - 3. env.process(self._delayed_credit_send(latency_ns, peer_credit_store, my_tail)) +#### Component coupling — SimPy Store channel -[fast path: latency_ns 만큼 timeout, fabric vc 미사용] - -PE A의 PE_IPCQ가 자기 credit_store에서 IpcqCreditMetadata 수신: - 4. peer_tail_cache 갱신 - 5. 대기 중인 send (해당 direction)가 있으면 wake -``` - -#### Component 결합도 — SimPy Store 채널 - -PE B의 PE_IPCQ가 PE A의 PE_IPCQ를 직접 호출하지 않는다. 대신 **init 시점에 -양쪽 PE_IPCQ 사이에 SimPy Store를 한 번 wire**해두고 (양방향 fast path 채널), -credit metadata는 그 store로 put한다. +PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init +time, **a SimPy Store is wired between the two** (a per-direction +fast-path channel) and credit metadata is `put` into that store. ```python class PeIpcqComponent: @@ -419,763 +410,414 @@ class PeIpcqComponent: yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...)) ``` -backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께 -설치한다 (D12 IpcqInitMsg에 명시). +Backend init wires both directions of the fast-path channel as part of +fan-out (see `IpcqInitMsg` in D12). -#### Credit return fast path의 한계 +#### Credit-return fast path limitations -- `credit_size_bytes`는 estimate. 보통 16-64 bytes로 충분하며, 실제 HW의 - credit return wire 크기를 모방한 값. -- fast path는 일반 vc_comm BW contention 모델에서 **제외**된다 (별도 채널). - 실제 HW의 credit return wire는 매우 lightweight이므로 1차 근사로 합리적. -- 정밀화가 필요하면 후속 ADR에서: - - credit fast path를 별도 link로 모델링 (BW limit + contention) - - 또는 piggyback 모드로 변경 가능 (`credit_return_mode: piggyback`) +- `credit_size_bytes` is an estimate (typically 16–64 bytes). +- The fast path is **excluded from vc_comm BW contention** (separate + wire). Real HW credit-return wires are very lightweight, so this is a + reasonable first approximation. +- A follow-up ADR can: model the credit fast path as a separate link + (BW limit + contention), or switch to piggyback (`credit_return_mode: + piggyback`). -#### PE_DMA의 책임 추가 +#### PE_DMA's added responsibility -PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction -terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write + -metadata forward 수행. **data write와 metadata forward 사이에는 SimPy -yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic -구간 안이 아니라 그 앞에 위치해야 한다: +When `vc_comm` receives a token, PE_DMA processes it as the following +sequence: pay the Transaction's terminal BW drain, then atomically +write data and forward metadata. **No SimPy yield is allowed between +the data write and the metadata forward** (invariant I6). The drain +yield must sit before the atomic block, not inside it: ```python def _on_vc_comm_recv(self, env, txn): - # Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를 - # 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가 - # "도착"한 이후에만 깨어나야 하므로. + # Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the + # sender PE_DMA). MUST happen before the atomic block so recv only + # wakes after the bytes have "landed". drain = getattr(txn, "drain_ns", 0.0) if drain > 0: yield env.timeout(drain) token = txn.request - # ── ATOMIC: 두 동작 사이에 yield 금지 ── - # 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind) + # ── ATOMIC: no yield between these two operations ── data = self._memory_store.read(token.src_space, token.src_addr, - shape=..., dtype=...) - self._memory_store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data) - # 2. token의 metadata를 자기 PE의 IPCQ로 forward + shape=..., dtype=...) + self._memory_store.write(token.dst_endpoint.buffer_kind, + token.dst_addr, data) + # 2. Forward metadata to the local PE_IPCQ yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token)) - # ───────────────────────────────────── + # ─────────────────────────────────────────────────── ``` -`out_ports[ipcq_id].put`은 SimPy Store의 yield-able 호출이지만, PE 내부 -wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (실질적으로 -single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가 -삽입되면 안 된다. +The final `put` is yieldable but uses an unbounded internal store, so +it completes in a single step. That `put` is the closing call of the +atomic block; nothing may be inserted before it. #### Drain-at-inbound semantics (D9 timing model) -Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path` -를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns` -는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은 -Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든 -Transaction (raw DMA, kernel-launch fanout 등) 은 -`ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의 -경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을 -가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로) -**이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의 -timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다. +The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path` +stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns` +is paid at each forwarding component via `run()`, and the remaining +BW drain is paid once at the Transaction's terminal. Every non-IPCQ +Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via +`ComponentBase._forward_txn` at the terminal node. For IPCQ the +destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound` +(so IPCQ-specific data write + metadata forward can happen), so **the +drain MUST be paid explicitly at the top of that handler** to keep +IPCQ's timing model on par with every other fabric Transaction. -여기서 drain을 지불할 때의 side-effect: +Side-effects of paying drain here: -- **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield` - 하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후 - 호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트. -- **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ - 의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain - 이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer - 시간을 관측하게 된다. +- **SRC `tl.send`** is unchanged — fire-and-forget semantics are + preserved because the sender PE_DMA does not `yield sub_done`. The + `sub_done.succeed()` call (made after metadata forward below) is an + event with no listener on the sender side. +- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only + when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata + forward now happens after the drain, recv observes the full fabric + transfer time including bandwidth cost. -물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로 -자신의 inbox로 drain될 때까지 대기. +Matches the physical picture: send dispatches and leaves; recv waits +until the bytes have actually been drained into its inbox. -#### Backpressure latency 정확도 +### D9.5. ADR-0020 (2-pass) integration -backpressure 해제까지 걸리는 시간: +`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase +1 simulates timing **and** moves data via MemoryStore; Phase 2 enables +op-log-based correctness verification. -- **데이터 send 측 latency** = full fabric DMA (data + piggyback metadata 함께) -- **Credit return 측 latency** = fast path with bottleneck-BW - (`credit_size_bytes / bottleneck_bw_on_path`) +#### Phase 1 (timing + data) -| 시나리오 | 모델링된 latency | 실제 HW와의 관계 | -|---------|----------------|----------------| -| Cube 내 (fast link) | 작음 (bottleneck = cube_noc BW) | topology-aware approximation | -| Cross-cube (UCIe) | 중간 (bottleneck = UCIe BW) | topology-aware approximation | -| Cross-SIP (PCIE) | 큼 (bottleneck = PCIE BW) | topology-aware approximation | +D9 models head and tail updates with two different mechanisms: -별도 magic latency 파라미터 없이 토폴로지에 비례한 first-order -approximation이 자동으로 반영된다. 실제 HW와 정확히 일치하지는 않지만 -(credit fast path는 contention 모델에서 제외, credit_size_bytes는 estimate), -magic constant 모델보다 훨씬 의미 있는 비교 가능. 정밀화는 후속 ADR로 -넘긴다. +- **Send-side (head update)** — DMA payload piggyback. Data write and + metadata forward happen in the same SimPy step → automatic atomic + visibility. +- **Recv-side (tail credit return)** — fast-path SimPy Store channel + with bottleneck-BW latency, then `peer_tail_cache` update. -### D9.5. ADR-0020 (2-Pass) 통합 +Together they preserve ring-buffer pointer consistency. -`tl.send/recv`는 ADR-0020의 2-pass 모델과 통합되어야 한다. Phase 1은 -타이밍과 실제 데이터 이동(MemoryStore) 모두 모델링하고, Phase 2는 op_log -기반 정합성 검증을 가능케 한다. +The op-log records `op_kind="ipcq"` entries for sends (with +`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with +`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`). +Two recv modes: -#### Phase 1 (타이밍 + 데이터 이동) - -D9는 head 갱신과 tail 갱신을 다른 메커니즘으로 모델링한다: - -- **Send-side (head update)** — DMA payload piggyback. data write와 metadata - forward가 동일 SimPy step에 일어나므로 자동으로 atomic visibility 보장. -- **Recv-side (tail credit return)** — fast path SimPy Store 채널. - bottleneck-BW 기반 latency 후 peer_tail_cache 갱신. - -두 메커니즘을 합쳐서 전체 ring buffer pointer 일관성을 유지한다. - -**send 시**: - -1. PE_IPCQ가 backpressure 체크 (peer_tail_cache 기준) -2. PE_IPCQ가 IpcqDmaToken 생성 (data + piggyback metadata) → PE_DMA(vc_comm)에 put -3. PE_DMA가 fabric DMA 시뮬레이션 (latency 진행) -4. **DMA 완료와 동일한 SimPy step에 atomic 시퀀스**: - - **MemoryStore.write(buffer_kind, dst_pa, data)** — single-hop DMA write - - 수신측 PE_IPCQ에 metadata forward → peer_head_cache 갱신 → 대기 recv wake -5. **op_log 기록**: `OpRecord(op_kind="ipcq", op_name="send", params={src_space, src_addr, dst_space, dst_addr, nbytes, dir, dtype, shape, sender_seq})` - - `dst_space`는 `token.dst_endpoint.buffer_kind`에서 derive된 값이다 - (별도 token 필드가 아니다). dst_addr은 `token.dst_addr`. - -**recv 시**: - -1. PE_IPCQ가 (peer_head_cache > my_tail) AND (MemoryStore.has(slot_addr)) 조건 대기 - (D9 piggyback 모델에서는 두 조건이 같은 step에 truthy가 되지만, defensive check) -2. 조건 만족 시: `slot_addr = my_rx_base + slot_idx * slot_size` -3. **두 가지 모드** (`recv_mode`로 op_log에 기록): - - **`return_slot`** (default): slot_addr을 그대로 PE 커널에 반환. - 데이터 복사 없음. 커널이 slot 메모리를 직접 사용한다. - - **`copy_to_dst`**: 호출 시 dst_addr이 지정된 경우. slot 데이터를 읽어서 - dst_addr에 write. `data = memory_store.read(...)`; `memory_store.write(dst_space, dst_addr, data)` -4. PE_IPCQ가 my_tail++, fast path credit return을 발행 (D9 — vc_comm - fabric을 거치지 않고 별도 SimPy Store 채널로 bottleneck-BW latency 후 - peer 측 peer_tail_cache 갱신) -5. **op_log 기록**: `OpRecord(op_kind="ipcq", op_name="recv", params={recv_mode, src_space, src_addr, dst_space, dst_addr, nbytes, dir, dtype, shape, consumer_seq})` - - `recv_mode="return_slot"`: src_space/src_addr가 slot 위치, dst_addr=None - - `recv_mode="copy_to_dst"`: src_space/src_addr가 slot 위치, dst_space/dst_addr가 사용자 지정 위치 +- **`return_slot`** (default): the slot address is returned to the + kernel. Zero-copy. +- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`, + PE_IPCQ copies the slot data into the user dst. #### Phase 2 (op_log replay) -DataExecutor가 `op_kind="ipcq"` 레코드를 만나면: +When `DataExecutor` encounters an `op_kind="ipcq"` record: -- **send**: src → dst (peer rx slot)로 ndarray를 idempotent하게 write -- **recv (`recv_mode="return_slot"`)**: no-op. slot 데이터는 Phase 1에서 - 이미 적절한 위치에 있으며, 커널이 해당 slot 메모리를 직접 사용함. -- **recv (`recv_mode="copy_to_dst"`)**: slot → dst_addr로 ndarray를 idempotent - 하게 copy +- **send**: idempotent `src → dst` ndarray write. +- **recv (`return_slot`)**: no-op (the slot already holds the data). +- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy. -본질적으로 IPCQ는 **데이터 이동**만 하므로 Phase 2가 추가로 계산할 것은 없다. -DataExecutor의 GEMM/Math가 그 데이터를 사용하면 자동으로 정합성이 검증된다. +IPCQ ops are pure data movement — Phase 2 has nothing extra to compute. +The downstream GEMM / Math ops in `DataExecutor` will consume the data +and naturally validate correctness. + +### D10. Host CCL init keeps the PyTorch shape + +The host code looks just like real PyTorch DDP. `init_process_group` +creates the backend object; it does **not** receive IPCQ knobs +(neighbor topology, buffer_kind, backpressure …). ```python -class DataExecutor: - def _execute_op(self, op): - if op.op_kind == "ipcq": - self._execute_ipcq(op) - elif op.op_kind == "memory": - ... - elif op.op_kind == "gemm": - ... - - def _execute_ipcq(self, op): - """IPCQ ops are data movement; Phase 1 already wrote to MemoryStore.""" - p = op.params - if op.op_name == "send": - data = self.store.read(p["src_space"], p["src_addr"], - shape=p["shape"], dtype=p["dtype"]) - self.store.write(p["dst_space"], p["dst_addr"], data) - elif op.op_name == "recv": - if p.get("recv_mode") == "copy_to_dst": - data = self.store.read(p["src_space"], p["src_addr"], - shape=p["shape"], dtype=p["dtype"]) - self.store.write(p["dst_space"], p["dst_addr"], data) - # recv_mode == "return_slot": no-op (data already in slot) -``` - -#### `--verify-data` 흐름 (CCL 커널) - -``` -1. kernbench run --bench ccl_allreduce --verify-data -2. backend init → IPCQ buffers 할당, neighbor table install -3. 모든 rank greenlet 동시 실행 -4. 각 PE 커널이 tl.send/recv → MemoryStore에 데이터 누적 -5. 시뮬레이션 완료 후 DataExecutor.run() → ipcq op 멱등 replay (no-op) -6. 벤치마크가 print(out) 또는 out.data 비교 → 정합성 확인 -``` - -벤치 작성자는 `out.data`로 결과를 읽고 expected와 비교하면 된다 (ADR-0020 D7 -Tensor.data 패턴). - -### D10. 호스트 CCL Init은 PyTorch 패턴 그대로 - -호스트 코드는 실제 PyTorch distributed 코드와 동일하게 유지한다. -`init_process_group`은 backend 객체만 만들고, IPCQ 설정 (neighbor topology, -buffer_kind, backpressure 등)은 받지 않는다. - -```python -# benches/ccl_allreduce.py — 실제 PyTorch와 동일한 호스트 코드 -def run_rank(rank, world_size, torch): +# benches/ccl_allreduce.py — same shape as real PyTorch +def worker(rank, world_size, torch): dist = torch.distributed - dist.init_process_group(backend="ahbm", world_size=world_size, rank=rank) - - tensor = torch.zeros((M, K), dtype="f16", dp=...) - - from kernbench.ccl.algorithms import ring_allreduce - torch.launch("ring_allreduce", ring_allreduce.kernel, tensor, rank, world_size) + dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology + tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...) + tensor.copy_(torch.from_numpy(init)) + dist.all_reduce(tensor, op="sum") ``` -IPCQ 설정은 backend가 **init_process_group 시점에** `ccl.yaml`을 읽고 즉시 -PE_IPCQ neighbor table을 install한다. 호스트 코드는 IPCQ를 인지할 필요가 없다. +The IPCQ configuration is decided by the backend at +`init_process_group` time: it loads `ccl.yaml`, picks the algorithm, +and pushes IPCQ neighbor tables to every participating PE_IPCQ. The +host code never has to know about IPCQ. -벤치마크 하나는 하나의 알고리즘을 사용하는 것을 가정하며, 사용할 알고리즘은 -`ccl.yaml`의 `defaults.algorithm` 으로 지정한다 (D11). 호스트 코드 변경 없이 -ccl.yaml만 수정하여 다른 알고리즘으로 교체할 수 있다. +A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`. +Switching algorithms is purely a `ccl.yaml` change — no host edits +required. -#### Init 흐름 (eager) +#### Init flow (eager) -1. `init_process_group(backend="ahbm")` 호출 -2. backend가 `ccl.yaml` 로드 → `defaults.algorithm` 결정 -3. `algorithms[]`에서 topology + buffer_kind + backpressure + slot/size 결정 -4. **즉시** 모든 PE의 PE_IPCQ에 neighbor table을 install (sideband 또는 fabric `IpcqInitMsg`) -5. 이후 `torch.launch(kernel_name, ...)`는 일반 launch와 동일하게 처리 - (CCL kernel이든 아니든 PE_IPCQ는 이미 준비됨) +1. `init_process_group(backend="ahbm")` is called. +2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`. +3. Pulls topology + buffer_kind + backpressure + slot config from + `algorithms[]`. +4. **Immediately** installs neighbor tables on every PE_IPCQ + (sideband or fabric `IpcqInitMsg`). +5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally — + PE_IPCQ is already prepared whether the kernel is a CCL kernel or + not. -### D11. CCL 설정 파일 (`ccl.yaml`) +### D11. CCL config file (`ccl.yaml`) -IPCQ 설정과 알고리즘 metadata는 별도 YAML 파일에 둔다. -`components.yaml`/`topology.yaml`과 같은 패턴을 유지하며, 변경 이력이 코드처럼 -추적 가능하다. +IPCQ config and algorithm metadata live in a separate YAML file, +following the same pattern as `components.yaml` and `topology.yaml`. -벤치마크 한 번 실행은 한 알고리즘만 사용한다 (`defaults.algorithm`). -다른 알고리즘으로 교체하려면 `ccl.yaml`의 `defaults.algorithm` 만 바꾸면 된다. +A single benchmark execution runs one algorithm +(`defaults.algorithm`). Switching algorithms means editing +`defaults.algorithm` only. ```yaml -# ccl.yaml — CCL backend (ahbm) configuration -# -# 이 파일은 init_process_group(backend="ahbm") 시점에 로드되며, -# defaults.algorithm 으로 지정된 알고리즘에 따라 PE_IPCQ neighbor table을 -# install한다. 호스트 코드는 IPCQ 설정을 인지하지 않는다. - defaults: - # 이번 벤치 실행에서 사용할 알고리즘. algorithms 섹션에 정의된 것 중 하나. - algorithm: ring_allreduce - - # IPCQ ring buffer가 위치할 메모리. - # tcm — PE-local TCM (작지만 빠름, PE 내부 자원과 경쟁) - # hbm — PE-local HBM (큼, DMA latency 큼) - # sram — Cube-shared SRAM (중간 크기, cube 내 PE 간 contention) - buffer_kind: tcm - - # send/recv가 peer slot full / data 미도착을 만났을 때의 대기 방식. - # poll — peer pointer 캐시를 spin loop로 재확인 - # sleep — SimPy event yield 후 wakeup 대기 (interrupt-like) - backpressure: sleep - - # Ring buffer depth (한 방향당 slot 개수). 클수록 in-flight 가능, 메모리 ↑ + algorithm: ring_allreduce_tcm + buffer_kind: tcm # tcm | hbm | sram + backpressure: sleep # poll | sleep n_slots: 8 - - # Slot 하나의 크기 (bytes). 한 tile을 통째로 담을 수 있는 크기여야 함. slot_size: 4096 - - # PE_DMA virtual channel chunk 크기 (bytes). 작을수록 fair, 클수록 효율. - # IPCQ traffic과 compute traffic 사이의 인터리브 granularity (D8 참조). vc_chunk_size: 256 - - # Credit return fast path 메시지 크기 (bytes). 실제 HW의 credit return wire - # 크기를 모방. backend가 라우팅 경로의 bottleneck BW를 보고 latency를 - # 계산한다 (D9 참조). 보통 16-64로 충분. ipcq_credit_size_bytes: 16 algorithms: - # ── 알고리즘 정의 ───────────────────────────────────────────────── - # 각 entry는 알고리즘 모듈과 그 알고리즘이 요구하는 topology를 명시한다. - # 알고리즘별 default override 가능 (buffer_kind, backpressure 등). - - ring_allreduce: - # PE 커널이 정의된 모듈. `kernel(t_ptr, rank, world_size, tl)` 함수를 export. + ring_allreduce_tcm: module: kernbench.ccl.algorithms.ring_allreduce + topology: ring_1d # builtin name or "custom" + buffer_kind: tcm + n_elem: 8 # optional, per-algorithm tile width - # 이 알고리즘이 요구하는 neighbor topology. builtin 이름 또는 "custom". - # ring_1d — 1D 양방향 ring (E/W) - # ring_1d_unidir — 1D 단방향 ring (E only) - # mesh_2d — 2D mesh (N/S/E/W) - # tree_binary — binary tree (parent/children direction) - # custom — 모듈의 neighbors(rank, world_size) 함수 사용 - topology: ring_1d - - tree_allreduce: + tree_allreduce_7: module: kernbench.ccl.algorithms.tree_allreduce topology: tree_binary - # 알고리즘별 override (이 알고리즘만 hbm 사용) - buffer_kind: hbm + buffer_kind: tcm + world_size: 7 # algorithm-level override + n_elem: 16 custom_mesh: module: kernbench.ccl.algorithms.custom_mesh - topology: custom # 모듈이 직접 neighbors() 함수 제공 + topology: custom # the module supplies its own neighbors() ``` -#### 알고리즘 모듈 구조 +`world_size` is **not set in `defaults`**. The backend resolves it via: +`algorithm-level override > defaults override > topology spec`. The +last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP +where `WORLD_SIZE` comes from env vars rather than config files. -알고리즘 모듈은 두 개의 hook을 export한다 — `kernel`은 필수, `neighbors`는 선택. +#### Algorithm module structure + +Each algorithm module exports two hooks — `kernel` (required) and +`neighbors` (optional) — plus a `kernel_args` helper that the +backend uses to populate positional kernel arguments at `all_reduce` +time: ```python # src/kernbench/ccl/algorithms/ring_allreduce.py -def kernel(t_ptr, rank, world_size, tl): - """필수 — PE 커널. +def kernel_args(world_size: int, n_elem: int) -> tuple: + return (n_elem, world_size) - IPCQ 설정은 backend가 ccl.yaml + neighbors() 결과로 install한 상태이다. - 커널은 그저 4-방향 send/recv API만 사용하면 된다. - """ - for step in range(world_size - 1): - ... - tl.send(dir="E", ...) - data = tl.recv(dir="W") +def kernel(t_ptr, n_elem, world_size, tl): + """Required — the PE kernel. -def neighbors(rank, world_size, neighbor_map): - """선택 — neighbor table override hook. - - backend는 ccl.yaml의 topology 필드에 따라 builtin neighbor_map을 생성한 뒤, - 이 함수가 정의되어 있으면 호출하여 결과를 override 한다. - - Args: - rank: 이 rank의 인덱스 - world_size: 전체 rank 수 - neighbor_map: ccl.yaml의 topology 필드가 만든 builtin 매핑 - 예: ring_1d → {"E": (rank+1)%ws, "W": (rank-1)%ws} - mutable dict — 직접 수정 가능 - - Returns: - dict | None: - dict — neighbor_map을 override한 결과 - None — override 안 함, neighbor_map 그대로 사용 - """ - return None # 또는 수정 후 반환 -``` - -#### `neighbors` override 패턴 - -대부분의 알고리즘은 builtin topology만으로 충분하므로 `neighbors` 정의가 필요 없다. -정의가 필요한 경우의 패턴: - -**Pattern A — builtin을 base로 일부만 수정**: -```python -def neighbors(rank, world_size, neighbor_map): - # 짝수 rank만 W 사용 - if rank % 2 == 1: - neighbor_map.pop("W", None) - return neighbor_map -``` - -**Pattern B — 완전히 새로 만들기 (skip-connection ring 등)**: -```python -def neighbors(rank, world_size, neighbor_map): - # neighbor_map은 무시하고 새로 작성 - return {"E": (rank + 2) % world_size} -``` - -#### Builtin topology generators - -`ccl.yaml`의 `topology` 필드가 다음 builtin 이름이면 backend가 알아서 처리: - -| topology | 설명 | direction set | -|----------|------|---------------| -| `ring_1d` | 1D 양방향 ring | E, W | -| `ring_1d_unidir` | 1D 단방향 ring | E only | -| `mesh_2d` | 2D mesh | N, S, E, W | -| `tree_binary` | binary tree (root = rank 0) | parent, child_left, child_right | -| `none` | 빈 매핑 — 알고리즘이 `neighbors()`로 처음부터 작성 | (없음) | - -`topology: none`은 builtin이 빈 dict를 반환하므로 알고리즘의 `neighbors()`가 -처음부터 매핑을 만들어야 한다. - -#### 알고리즘 추가 절차 - -1. `src/kernbench/ccl/algorithms/.py`에 `kernel` 함수 작성 -2. `ccl.yaml`의 `algorithms` 섹션에 entry 추가 (`module`, `topology`) -3. (선택) 같은 모듈에 `neighbors()` 함수 추가하여 builtin override -4. `defaults.algorithm`을 새 알고리즘으로 설정하면 적용 - -호스트 코드는 손대지 않는다. - -### D12. 메시지 / 토큰 스키마 - -본 ADR이 추가하는 모든 메시지/토큰의 필드를 명시한다. 구현 시 이 정의를 -`src/kernbench/common/pe_commands.py`와 `src/kernbench/runtime_api/kernel.py`에 -그대로 추가한다. - -#### `IpcqInitMsg` (sideband, init 시 fan-out) - -backend가 모든 PE의 PE_IPCQ에 neighbor table을 install하기 위해 사용한다. -구조는 `MmuMapMsg`와 유사 (target_sips, target_cubes, target_pe + entries). - -```python -@dataclass(frozen=True) -class IpcqInitEntry: - direction: str # "N" | "S" | "E" | "W" - peer: IpcqEndpoint # D2.5 참조 - my_rx_base_pa: int # 자신의 rx_buffer base - my_rx_base_va: int # 선택 - n_slots: int - slot_size: int - # Credit fast path 채널 (D9). - # 계약: 이 필드는 반드시 simpy.Store 인스턴스이며, IpcqCreditMetadata - # 객체만을 받는 receive endpoint이다 (peer's PE_IPCQ가 자기 입력 큐로 - # 사용). 송신 측 PE_IPCQ는 _delayed_credit_send에서 이 store에 직접 - # IpcqCreditMetadata를 put한다. 다른 객체 type을 put해서는 안 된다. - # backend init 시 양방향 SimPy Store가 한 번 wire되며 이후 변경 불가. - peer_credit_store: "simpy.Store[IpcqCreditMetadata]" - -@dataclass(frozen=True) -class IpcqInitMsg: - correlation_id: str - request_id: str - target_sips: tuple[int, ...] - target_cubes: tuple[int, ...] - target_pe: int | tuple[int, ...] | str - entries: tuple[IpcqInitEntry, ...] # 이 PE의 4-방향 entry - backpressure_mode: str # "poll" | "sleep" - buffer_kind: str # "tcm" | "hbm" | "sram" - credit_size_bytes: int # D9 fast path latency 계산용 (default 16) -``` - -**Credit fast path channel wiring**: backend init이 모든 PE의 PE_IPCQ에 -양방향 fast path 채널을 한 번 설치한다. PE A의 IpcqInitEntry(direction=E)에 -PE B의 credit-receive Store reference를 넣어 송신 측이 직접 put할 수 있게 -한다 (별도 fabric routing 없음). - -#### `IpcqSendCmd` (PE_CPU → PE_IPCQ) - -```python -@dataclass(frozen=True) -class IpcqSendCmd: - direction: str # 어느 방향으로 보낼지 - src_addr: int # 보낼 데이터의 원본 주소 (TCM/HBM) - src_space: str # "tcm" | "hbm" | "sram" - nbytes: int - shape: tuple[int, ...] # data shape (op_log/MemoryStore용) - dtype: str - handle_id: str # completion 추적용 - data_op: bool = True # ADR-0020 op_log 기록 대상 -``` - -#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ) - -```python -@dataclass(frozen=True) -class IpcqRecvCmd: - direction: str | None # None이면 round-robin (weak fairness, D4) - # recv_mode: 두 가지 동작 모드 - # "return_slot" — slot 주소를 그대로 PE 커널에 반환 (default, zero-copy) - # "copy_to_dst" — slot 데이터를 dst_addr에 copy 후 반환 - recv_mode: str = "return_slot" - # dst_addr / dst_space는 recv_mode="copy_to_dst"일 때만 사용됨 - dst_addr: int = 0 - dst_space: str = "" - shape: tuple[int, ...] = () # data shape (op_log/MemoryStore용) - dtype: str = "" - handle_id: str = "" - blocking: bool = True # blocking vs non-blocking - data_op: bool = True -``` - -#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm 채널) - -D9의 piggyback 모델에 따라 token이 data + head metadata를 함께 담아 -fabric을 따라 이동한다. 수신 측 PE_DMA가 도착 시점에 data를 dst_addr에 -write하고 metadata를 PE_IPCQ로 forward한다 (atomic). - -```python -@dataclass -class IpcqDmaToken: - # ── Data movement (single-hop DMA write) ── - src_addr: int # 자기 메모리 주소 - src_space: str - dst_addr: int # peer rx slot 주소 (이미 계산됨) - dst_endpoint: IpcqEndpoint # 라우팅용 (sip/cube/pe) - nbytes: int # data 크기 - handle_id: str # 완료 시 송신 측 PE_IPCQ로 알림 - - # ── Piggyback metadata (수신측 PE_IPCQ가 자동 갱신할 정보) ── - sender_seq: int # 단조 증가 sequence number - # peer가 자기 head_cache로 사용 - src_sip: int # 송신 측 (수신측이 어느 peer인지 식별) - src_cube: int - src_pe: int - src_direction: str # 송신측 기준 방향 (수신측은 reverse 매핑으로 자기 direction 결정) - - data_op: bool = True # ADR-0020 op_log 기록 대상 -``` - -PE_DMA는 token type으로 채널 결정 (D8): TileToken → vc_compute, IpcqDmaToken → vc_comm. - -**수신 측 PE_DMA의 처리** (vc_comm 도착 시): - -```python -def _vc_comm_arrival(self, env, token: IpcqDmaToken): - # 1. data를 dst_addr에 write (data와 metadata atomic visibility) - if self._memory_store is not None: - data = self._memory_store.read(token.src_space, token.src_addr, - shape=..., dtype=...) - self._memory_store.write(token.dst_endpoint.buffer_kind, token.dst_addr, data) - # 2. metadata를 자기 PE의 IPCQ로 forward (PE 내부 wire, 같은 step) - yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token)) -``` - -PE_IPCQ는 `IpcqMetaArrival`을 받아 sender_seq를 보고 peer_head_cache를 갱신한다. - -#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path 채널) - -Credit return은 D9의 fast path 모델에 따라 vc_comm fabric을 거치지 않고 -**별도의 SimPy Store 채널**로 전달된다. backend init 시 양방향 channel이 -미리 wire되며, latency는 bottleneck-BW 기반으로 계산된다. - -```python -@dataclass(frozen=True) -class IpcqCreditMetadata: - """Credit return — recv 측 → send 측 fast path.""" - consumer_seq: int # my_tail (recv 측의 새 tail) - src_sip: int # 누가 보냈는지 (수신 측이 어느 peer credit인지 식별) - src_cube: int - src_pe: int - src_direction: str # 송신 측 기준 방향 (수신 측은 reverse 매핑) -``` - -**전송 흐름**: - -```python -class PeIpcqComponent: - def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns): - yield env.timeout(latency_ns) - yield peer_credit_store.put(IpcqCreditMetadata( - consumer_seq=my_tail, src_sip=..., src_cube=..., src_pe=..., - src_direction=..., - )) -``` - -`latency_ns`는 D9에 정의된 대로: - -```python -path = self.ctx.router.find_path(self_pe_prefix, peer_pe_prefix) -latency_ns = self.ctx.compute_drain_ns(path, credit_size_bytes) -``` - -**별도의 IpcqPtrUpdate 이벤트는 없다** — head 갱신은 D9 piggyback 모델로, -tail 갱신은 D9 fast path SimPy Store 채널로 처리된다. - -### D13. 테스트 전략 - -단위/통합/regression 테스트를 명시한다. - -#### T1. 단위 테스트 (component-level) - -- **PE_IPCQ 단위** (`tests/test_pe_ipcq.py`): - - send: backpressure 미발생 시 즉시 PE_DMA로 token forward - - send: peer slot full → backpressure (poll/sleep 모드별) - - send: peer credit return (IpcqCreditMetadata) 도착 후 backpressure 해제 - - recv: 데이터 도착 시 즉시 반환 - - recv: 데이터 미도착 → wait → IpcqMetaArrival (D9 piggyback) 수신 시 wake - - recv (round-robin): 4-방향 중 도착한 첫 데이터 반환 (weak fairness) - - 잘못된 방향 → IpcqInvalidDirection 예외 - -- **PE_DMA virtual channel** (`tests/test_pe_dma_vc.py`): - - vc_compute / vc_comm 독립 진행 (한 채널 stall 시 다른 채널 진행) - - chunk-level 인터리브 verification - - link BW 분할 (50/50 또는 weighted) - -- **builtin topology** (`tests/test_ccl_topologies.py`): - - ring_1d/mesh_2d/tree_binary 각각 (rank, world_size) → neighbor dict 정합성 - - mesh_2d non-square → ValueError - - resolve_topology(custom, module) → module.neighbors 반환 - -#### T2. 통합 테스트 (E2E send/recv) - -- **`tests/test_ipcq_e2e.py`**: - - 2-rank ring: rank 0 send(E) → rank 1 recv(W) → 데이터 정합성 - - 4-rank ring: 양방향 send/recv 동시 진행, deadlock 없음 - - mesh_2d 4×4: N/S/E/W 4방향 동시 send/recv - -- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): - - greenlet 모드 + IPCQ → op_log에 ipcq 레코드 생성 검증 - - DataExecutor가 ipcq op 처리 후 결과 정합성 (`out.data` 확인) - -#### T3. Backend init 테스트 (`tests/test_ccl_backend_ipcq.py`) - -- ccl.yaml 로드 → `defaults.algorithm` 추출 -- builtin topology → IpcqInitMsg fan-out -- IpcqEndpoint의 PA가 모든 PE에서 일관 (rank A의 peer E의 rx_base_pa = rank A+1의 자기 rx_base_pa) -- buffer_kind 별 메모리 할당 (tcm/hbm/sram) - -#### T4. Regression - -- 기존 401 tests 전부 PASS -- ADR-0020 통합으로 인한 op_log/DataExecutor 영향 없음 (CCL 미사용 벤치) - -#### T5. 성능 / overhead - -- 단일 send/recv pair latency = (DMA latency) + (IPCQ overhead) -- 비교: 같은 nbytes의 일반 PE_DMA write와 거의 동일해야 함 (IPCQ overhead < 100 ns) - -### D14. Invariants & Failure Modes - -CCL 인프라에서 흔히 발생하는 hang/오류 상황을 명시하고, 대응 방식을 정의한다. - -#### Invariants (시뮬레이션이 보장해야 하는 것) - -I1. **Slot lifecycle exactly-once**: 한 send → 정확히 한 recv. 중복 send나 - 중복 recv는 sequence 오류로 간주. - -I2. **Pointer monotonicity**: my_head, my_tail은 단조 증가 (감소 없음). - sender_seq는 송신 측에서 단조 증가, 수신 측 cache 갱신도 단조 증가. - -I3. **Endpoint consistency**: rank A의 IpcqEndpoint(direction=E)의 peer가 - rank B라면, rank B의 IpcqEndpoint(reverse(E))의 peer는 rank A여야 함. - backend init 시 검증. - -I4. **buffer_kind consistency**: 한 ProcessGroup 내 모든 PE의 buffer_kind는 - 동일 (mixed kind는 supported 안 함, 첫 구현). 검증 실패 시 init 에러. - -I5. **op_log ordering**: send → DMA 완료 → recv 가능. op_log의 t_start - 순서가 이 인과관계를 위배하지 않음. - -I6. **Atomic data + metadata visibility (MUST)**: 본 ADR의 correctness 핵심 - 조건이다. 수신 측에서 data write (MemoryStore.write)와 metadata forward - (peer_head_cache 갱신)는 동일한 SimPy step에 일어나야 한다. control이 - data를 앞지를 수 없다. - - **구현 규칙 (MUST)**: - - PE_DMA의 vc_comm token 도착 처리(`_vc_comm_arrival`)는 다음 두 동작 - 사이에 **어떤 SimPy yield도 두어서는 안 된다**: - 1. `MemoryStore.write(token.dst_endpoint.buffer_kind, token.dst_addr, data)` - 2. PE_IPCQ에 `IpcqMetaArrival` forward - - 두 동작은 동일 SimPy event callback 내에서 연속 실행되어야 한다. - - 코드 리뷰에서 이 사이에 `yield` (또는 `yield from`)을 추가하는 것은 - correctness 위반으로 reject한다. - - 이 규칙을 위반하면 다른 SimPy process가 끼어들어 head_cache가 data - visibility보다 먼저 또는 늦게 보이는 race condition이 발생한다. - -I7. **MemoryStore slot existence ↔ pointer**: I6의 결과로, - `peer_head_cache > my_tail`이 truthy가 되는 step과 `MemoryStore.has(slot_addr)` - 이 truthy가 되는 step이 동일하다. recv는 두 조건을 모두 체크하지만 (defensive), - 단일 조건만 체크해도 정확하다. - -#### Failure Modes (런타임 에러) - -F1. **잘못된 direction**: - - PE 커널이 `tl.send(dir="X")` 호출 → install 안 된 direction - - PE_IPCQ가 즉시 `IpcqInvalidDirection` 예외 raise - - SimPy 시뮬레이션 즉시 abort, 사용자에게 명확한 에러 - -F2. **타입 mismatch**: - - send와 recv의 dtype/shape/nbytes가 일치하지 않음 - - 첫 구현은 검증 안 함 (dtype/shape는 hint), 향후 strict mode로 추가 - -F3. **Deadlock detection (timeout 기반)**: - - send: peer_tail_cache가 갱신 안 되고 영원히 wait - - recv: peer_head_cache 갱신 안 되고 영원히 wait - - 시뮬레이션 timeout (default 10ms simulated time) 초과 시 abort - - 디버그를 위해 각 PE의 last send/recv 위치, blocking 상태 dump - -F4. **Backend init 실패**: - - ccl.yaml에 `defaults.algorithm` 누락 - - `algorithms[name]` 정의 누락 - - 알고리즘 모듈 import 실패 - - topology 검증 실패 (I3, I4) - → 모두 `init_process_group` 시점에 즉시 에러 - -F5. **Slot full + 무한 backpressure**: - - peer가 영원히 안 받음 - - F3과 같이 timeout으로 처리 - - 디버그: 막힌 PE의 my_head, peer_tail_cache 출력 - -#### 진단 도구 (구현 단계에서 추가) - -- **CCL trace**: 각 send/recv를 (rank, t, dir, nbytes) 형태로 로깅 -- **Pointer dump**: 시뮬레이션 종료 시 또는 hang 시 모든 PE의 IPCQ pointer 상태 출력 -- **Deadlock graph**: hang 발생 시 wait-for 그래프 출력 (어느 PE가 어떤 PE를 기다리는지) - -### D15. 알고리즘 작성자 가이드 (요약) - -본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다. -자세한 step-by-step 가이드는 [docs/onboarding/ccl-author-guide.md](../onboarding/ccl-author-guide.md) 참조. - -#### 만지는 것 / 만지지 않는 것 - -| 만지는 것 | 만지지 않는 것 | -|----------|---------------| -| `src/kernbench/ccl/algorithms/.py` (kernel + 선택적 neighbors) | `benches/ccl_allreduce.py` 호스트 코드 | -| `ccl.yaml` 의 한 entry 추가 + `defaults.algorithm` | `src/kernbench/ccl/` 프레임워크 | -| (선택) `tests/test_.py` 단위 테스트 | `src/kernbench/components/builtin/pe_ipcq.py` 컴포넌트 | -| | `src/kernbench/runtime_api/distributed.py` backend | - -#### 알고리즘 모듈 인터페이스 contract - -```python -# src/kernbench/ccl/algorithms/.py - -def kernel(*args, tl) -> None: - """필수. PE 커널. - - Args (positional): tensor pointers, rank, world_size, 알고리즘 파라미터 - Args (keyword): tl — TLContext (자동 주입) - - 사용 가능한 IPCQ API: - tl.send(dir, src_addr, nbytes) # blocking, backpressure 시 wait - tl.recv(dir) # 특정 방향에서 blocking recv - tl.recv() # 4방향 round-robin - tl.recv_async(dir) → handle # non-blocking - tl.wait(handle) # non-blocking 완료 대기 - - 기존 API도 그대로 사용: - tl.load / tl.store / tl.composite / tl.program_id 등 + IPCQ is already installed by the backend before this is called. + The kernel only uses the four-direction send / recv API. """ ... -def neighbors(rank, world_size, neighbor_map) -> dict | None: - """선택. ccl.yaml의 builtin topology가 만든 neighbor_map을 override. - None 반환 → builtin 그대로 사용 - dict 반환 → 그 dict로 override (builtin을 base로 수정 가능) +def neighbors(rank, world_size, neighbor_map): + """Optional — override the builtin topology's neighbor map. + + Returns a new dict, the modified-in-place dict, or None to keep the + builtin map. """ return None ``` -#### 5-step 흐름 +#### `neighbors` override patterns -1. **kernel 함수 작성** — `src/kernbench/ccl/algorithms/.py` 신규 파일 -2. **ccl.yaml 등록** — `algorithms.` entry + `defaults.algorithm` 변경 -3. **(선택) neighbors override** — builtin topology를 base로 수정이 필요할 때 -4. **단위 테스트** — `kernbench.ccl.testing.run_kernel_in_mock` (SimPy 없이 빠름) -5. **시뮬 검증** — `kernbench run --bench ccl_allreduce --verify-data` +- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc. +- **Pattern B — replace entirely**: ignore `neighbor_map` and return a + brand-new dict. +- **Pattern C — keep builtin**: omit `neighbors` or return None. -호스트 코드 (`benches/ccl_allreduce.py`)는 손대지 않는다. +#### Builtin topologies -#### 사용 가능한 헬퍼 (`kernbench.ccl.helpers`) +| topology | direction set | +|----------|---------------| +| `ring_1d` | E, W | +| `ring_1d_unidir` | E only | +| `mesh_2d` | N, S, E, W | +| `tree_binary` | parent, child_left, child_right | +| `none` | (empty) — algorithm must supply `neighbors()` | -| Helper | 설명 | -|--------|------| -| `chunked(addr, n_chunks, ...)` | 텐서를 n개 chunk view로 슬라이싱 | -| `ring_step(rank, step, ws)` | ring algorithm의 step별 (send_idx, recv_idx) | -| `tree_step(rank, level)` | binary tree의 level별 parent/child 인덱스 | +#### Adding a new algorithm -#### 디버깅 도구 +1. Write `kernel` and `kernel_args` in + `src/kernbench/ccl/algorithms/.py`. +2. Add an entry in `ccl.yaml`'s `algorithms` section. +3. (Optional) provide `neighbors()` for custom topology. +4. Set `defaults.algorithm` to the new algorithm. -- `KERNBENCH_CCL_TRACE=1` — send/recv trace 출력 -- 시뮬 종료 시 자동 IPCQ pointer dump -- Deadlock 시 (10ms 시뮬 시간 초과) wait-for graph dump +The host bench (`benches/ccl_allreduce.py`) does not change. -#### 흔한 실수 +### D12. Message / token schema -1. **install 안 된 direction 사용** — ccl.yaml의 topology가 ring_1d면 N/S 사용 불가 -2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure) -3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임 +The new message types added by this ADR. They live in +`src/kernbench/common/pe_commands.py` and +`src/kernbench/runtime_api/kernel.py`. -자세한 step-by-step과 hello-world 예제는 `docs/onboarding/ccl-author-guide.md` 참조. +#### `IpcqInitMsg` (sideband, fan-out at init) + +The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors +`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`). +Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`, +`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store` +field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can +push `IpcqCreditMetadata` directly into the receiver's input queue. + +#### `IpcqSendCmd` (PE_CPU → PE_IPCQ) + +Carries `direction`, source addr/space, nbytes, shape, dtype, and a +handle id. `data_op=True` so it lands in the op_log. + +#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ) + +Carries `direction` (or None for round-robin), `recv_mode` +(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape, +dtype, blocking flag. + +#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel) + +Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`) +plus the head metadata (`sender_seq`, `src_sip/cube/pe`, +`src_direction`). PE_DMA picks the channel by token type +(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`). + +The receiver's PE_DMA, on token arrival, performs the I6 atomic +sequence: write data into MemoryStore, then forward `IpcqMetaArrival` +to the local PE_IPCQ. + +#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path) + +Carries `consumer_seq` (= my_tail), source PE coords, and source +direction. Travels through the dedicated SimPy Store channel rather +than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`. + +There is **no `IpcqPtrUpdate` event** — head updates flow via D9 +piggyback, tail updates via the D9 fast-path channel. + +### D13. Test strategy + +Test plan: + +#### T1. Unit tests (component-level) + +- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure + immediately forwards a token; full peer slot triggers backpressure + (poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`; + round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`. +- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute` + / `vc_comm` independent progress, chunk interleave, BW split. +- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d / + mesh_2d / tree_binary correctness, mesh_2d non-square → + `ValueError`, custom resolver returns the module's `neighbors`. + +#### T2. Integration tests (E2E send/recv) + +- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional + no-deadlock), 4×4 mesh. +- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode + records `ipcq` ops in op_log; DataExecutor produces correct + `out.data`. + +#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`) + +`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA +consistency, per-`buffer_kind` allocation. + +#### T4. Regression + +All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for +non-CCL benches. + +#### T5. Performance / overhead + +Single send/recv pair latency = (DMA latency) + (IPCQ overhead). +Should be close to a regular PE_DMA write of the same nbytes (IPCQ +overhead < 100 ns). + +### D14. Invariants and failure modes + +#### Invariants + +I1. **Slot lifecycle exactly-once**: one send → exactly one recv. +I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly + non-decreasing; `sender_seq` strictly increasing. +I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank + B, then rank B's reverse-direction peer must be rank A. Verified at + init. +I4. **`buffer_kind` consistency**: all PEs in a process group share + the same `buffer_kind` (no mixed mode in the first cut). +I5. **op_log ordering**: send → DMA complete → recv possible. The + t_start order in op_log respects this causality. +I6. **Atomic data + metadata visibility (MUST)**: at the receiver + side, data write (`MemoryStore.write`) and metadata forward + (`peer_head_cache` update) **must execute in the same SimPy step**. + No yield is allowed between the two operations in PE_DMA's vc_comm + handler. Code review must reject any inserted `yield` (or `yield + from`) — it would create a race where head_cache becomes visible + before or after the data. +I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6, + the step in which `peer_head_cache > my_tail` becomes truthy is the + same step in which the slot data is observable. + +#### Failure modes (runtime errors) + +F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction + → `IpcqInvalidDirection`, simulation aborts. +F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched + send and recv. Not validated by default; opt-in strict mode catches + it (`strict_validation: true` on a PE_IPCQ node attrs). +F3. **Deadlock detection (timeout-based)**: the simulator empties its + schedule while a send/recv is still pending → engine raises + `IpcqDeadlock` and embeds a pointer dump. +F4. **Backend init failure**: missing `defaults.algorithm`, missing + `algorithms[name]`, module import failure, topology validation + failure (I3, I4) — all raised at `init_process_group` time. +F5. **Slot full + infinite backpressure**: the peer never recvs. + Surfaces as F3 timeout. + +#### Diagnostics + +- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as + `(rank, t, dir, nbytes)`. +- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)` + prints every PE_IPCQ ring buffer's `my_head`, `my_tail`, + `peer_head_cache`, `peer_tail_cache`. +- **Deadlock dump**: on hang the engine includes the pointer dump in + the `IpcqDeadlock` exception message. + +### D15. Algorithm-author cheat sheet + +Full step-by-step lives in +[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The +shortest version: + +| Things you touch | Things you don't | +|------------------|-------------------| +| `src/kernbench/ccl/algorithms/.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code | +| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework | +| (Optional) `tests/test_.py` mock test | PE_IPCQ component, AhbmCCLBackend | + +5-step flow: write the kernel → register in `ccl.yaml` → optional +`neighbors` override → optional mock unit test → SimPy validation via +`kernbench run --bench ccl_allreduce --verify-data`. + +Common mistakes: using a direction that wasn't installed, sends +without matching recvs (deadlock), dtype/shape disagreement, assuming +fairness from `tl.recv()` round-robin, confusing +`tl.num_programs(axis)` with the CCL group size. --- @@ -1188,7 +830,7 @@ contract via Python/SimPy equivalents in [pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and [pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py). -### D16. Proposed HW Block Diagram and End-to-End Dataflow +### D16. Proposed HW block diagram and end-to-end dataflow ![PE Baseline Architecture](../diagrams/pe_baseline.png) @@ -1198,14 +840,14 @@ contract via Python/SimPy equivalents in > Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`. -**Baseline → Proposed 핵심 변경**: +**Baseline → Proposed key changes**: -- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW) +- Single FIFO inbox → **separate compute port / IPCQ port + WRR Arbiter** (NEW) - PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic) -- TCM 내 **IPCQ Slot Region 예약 영역** 명시 -- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결 +- **IPCQ Slot Region reserved area** within TCM +- Credit Injector / Receiver connect directly to the NoC via the Fabric Port -#### End-to-End Sequence (HW view) +#### End-to-end sequence (HW view) ```mermaid sequenceDiagram @@ -1257,15 +899,15 @@ sequenceDiagram Note over IPCQ_A: Match dst_rx_base_pa → direction "E"
peer_tail_cache["E"] = consumer_seq
Backpressure deassert (if stalled) ``` -### D17. IPCQ Controller HW Module (신규) +### D17. IPCQ Controller HW Module (NEW) -PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. 시뮬레이터의 -`PeIpcqComponent`에 대응한다. +The hardware control block sitting between PE_CPU and the DMA Engine. +Corresponds to the simulator's `PeIpcqComponent`. #### QPair Register File -방향별 queue pair 상태를 flip-flop으로 유지. PE_CPU가 MMIO(CSR)로 읽기/쓰기 -가능하며, init 시점에 소프트웨어가 채워넣는다. +Per-direction queue-pair state held in flip-flops. The PE_CPU reads / +writes them via MMIO (CSRs); software populates them at init time. ``` Per-direction registers (each 64-bit): @@ -1275,12 +917,12 @@ Per-direction registers (each 64-bit): peer_tail_cache — last known peer tail (updated by Credit Receiver) rx_base_pa — this PE's rx buffer base physical address peer_rx_base_pa — peer's rx buffer base physical address - n_slots — ring depth (power-of-2 제약, D21 참조) + n_slots — ring depth (power-of-2 constraint, see D21) slot_size — bytes per slot - peer_credit_tgt — peer PE의 credit receive 주소 + peer_credit_tgt — peer PE's credit-receive address -Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare) -Total: 8 dirs × 9 regs × 8B = 576B flip-flops +Directions: up to 8 (N/S/E/W/parent/child_left/child_right + spare) +Total: 8 dirs × 9 regs × 8 B = 576 B of flip-flops ``` #### Slot Address Generator (combinational) @@ -1292,9 +934,9 @@ Output: slot_addr = base_pa + (pointer % n_slots) * slot_size Implementation: n_slots power-of-2 → pointer & (n_slots - 1) (AND mask, 1 gate) slot_size power-of-2 → barrel shift (1 cycle) - 64-bit add → ripple/kogge-stone adder (1 cycle) + 64-bit add → ripple / Kogge-Stone adder (1 cycle) -Latency: 1-2 cycles combinational +Latency: 1–2 combinational cycles ``` #### Backpressure Comparator (combinational) @@ -1309,8 +951,8 @@ Latency: 1 cycle #### Meta Extractor (inbound datapath sideband) -DMA Engine의 inbound vc_comm path에 wired. 도착하는 IPCQ flit의 header에서 -metadata를 추출하여 queue pair 상태를 갱신한다. +Wired into the DMA Engine's inbound vc_comm path. Extracts metadata +from arriving IPCQ flit headers and updates queue-pair state. ``` Trigger: DMA inbound write completion (same cycle) @@ -1322,24 +964,24 @@ Direction matching (ADR-0025 D2): 8× parallel range comparators + priority encoder Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1) -Output: recv_wake signal → PE_CPU interrupt/flag -Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장) +Output: recv_wake signal → PE_CPU interrupt / flag +Latency: 1 cycle (pipelined with the DMA write — I6 atomicity is intrinsic) ``` #### Credit Injector (outbound) ``` -Trigger: recv completion (my_tail 증가 후) -Action: pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC) +Trigger: recv completion (after my_tail increments) +Action: pack a 16 B credit packet → DMA vc_comm (or a dedicated credit VC) Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa} -Latency: 1 cycle to generate, then NoC traversal +Latency: 1 cycle to generate; then NoC traversal ``` #### Credit Receiver (inbound sideband) ``` -Trigger: 16B credit packet arrival (from NoC) +Trigger: 16 B credit packet arrival (from NoC) Extract: {consumer_seq, dst_rx_base_pa} Direction matching (ADR-0025 D3): @@ -1351,67 +993,70 @@ Output: send_wake signal → deassert backpressure stall Latency: 1 cycle ``` -### D18. DMA Engine vc_comm IPCQ-aware Mode +### D18. DMA Engine vc_comm IPCQ-aware mode -기존 vc_comm 채널(D8)에 IPCQ flit 처리 모드를 추가한다. +Add IPCQ-flit handling to the existing vc_comm channel (D8). **Outbound**: -1. IPCQ Controller로부터 command 수신: `{src_addr, dst_addr, nbytes, sender_seq}` -2. TCM에서 src_addr read → DMA read buffer에 snapshot (standard DMA behavior) -3. Flit pack: data + piggyback metadata (sender_seq, dst_addr) -4. NoC fabric port에 inject -5. Fire-and-forget (completion 미대기) +1. Receive a command from the IPCQ Controller: `{src_addr, dst_addr, nbytes, sender_seq}`. +2. Read `src_addr` from TCM → snapshot into the DMA read buffer (standard DMA behavior). +3. Pack flit: data + piggyback metadata (`sender_seq`, `dst_addr`). +4. Inject into the NoC fabric port. +5. Fire-and-forget (no completion wait). **Inbound**: -1. NoC로부터 IPCQ flit 수신 -2. Terminal BW drain charge (`drain_ns = nbytes / bottleneck_bw`) -3. Slot write latency charge (backing memory tier) +1. Receive an IPCQ flit from the NoC. +2. Charge terminal BW drain (`drain_ns = nbytes / bottleneck_bw`). +3. Charge slot write latency (per backing memory tier). 4. **ATOMIC** (same pipeline stage, no stall insertion): - - TCM write: data → slot address - - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller -5. Done + - TCM write: data → slot address. + - Meta Extractor trigger: `sender_seq` + `dst_addr` → IPCQ Controller. +5. Done. -**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가 -동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요. 시뮬레이터의 -"no SimPy yield between MemoryStore.write and IpcqMetaArrival put" (D9, I6)이 -자연스럽게 보장된다. +**I6 atomicity guaranteed in hardware**: TCM write completion and Meta +Extractor trigger occur in the same pipeline stage, so no separate +synchronization is needed. The simulator's "no SimPy yield between +`MemoryStore.write` and `IpcqMetaArrival` put" (D9, I6) is preserved +naturally. -#### Data Snapshot Semantics +#### Data snapshot semantics -DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다. -이는 DMA standard read-then-write behavior이므로 추가 HW 불필요. +Data latched into the DMA read buffer is unaffected by subsequent +writes to `src` memory. This is standard DMA read-then-write +behavior; no extra HW is required. -#### Credit Virtual Channel (선택적) +#### Credit virtual channel (optional) -- **옵션 A**: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분). -- **옵션 B**: 3rd dedicated credit VC 추가 (strict priority > data). +- **Option A**: multiplex credits onto vc_comm (distinguish via 16 B + header-only flits). +- **Option B**: add a third dedicated credit VC (strict priority > data). -옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로 -옵션 A로도 충분. +Option B is friendlier to deadlock prevention, but a 16 B credit's BW +impact is negligible, so Option A suffices. -### D19. Fabric Flit Format Extension +### D19. Fabric flit format extension ``` -일반 data flit (예: 512-bit): +Generic data flit (e.g. 512-bit): ┌──────────────────────────────────────────┐ │ [511:480] routing header (32b) │ -│ [479:0] payload (480b = 60B) │ +│ [479:0] payload (480b = 60 B) │ └──────────────────────────────────────────┘ -IPCQ data flit (첫 flit에만 metadata 포함): +IPCQ data flit (only the first flit carries metadata): ┌──────────────────────────────────────────┐ │ [511:480] routing header (32b) │ -│ [511] ipcq_flag (1b) │ ← IPCQ vs normal DMA 식별 +│ [511] ipcq_flag (1b) │ ← IPCQ vs. normal DMA │ [510:509] vc_id (2b) │ │ [508:480] route + hop count │ │ [479:416] ipcq_metadata (64b) │ ← piggyback │ [479:448] sender_seq (32b) │ -│ [447:416] dst_addr[31:0] (32b) │ ← direction matching용 -│ [415:0] payload (416b = 52B) │ +│ [447:416] dst_addr[31:0] (32b) │ ← used for direction match +│ [415:0] payload (416b = 52 B) │ └──────────────────────────────────────────┘ -후속 flits: full 60B payload (metadata 없음) +Subsequent flits: full 60 B payload (no metadata). Credit-only flit (128-bit, header-only): ┌──────────────────────────────────────────┐ @@ -1422,227 +1067,242 @@ Credit-only flit (128-bit, header-only): └──────────────────────────────────────────┘ ``` -첫 flit의 payload가 60B → 52B로 감소 (13% overhead). Multi-flit transfer에서는 -후속 flit이 full payload이므로 대형 전송에서 overhead < 1%. +First-flit payload shrinks from 60 B to 52 B (13 % overhead). For +multi-flit transfers the subsequent flits carry full payloads, so +overhead < 1 % on large transfers. -### D20. TCM IPCQ Slot Region Layout +### D20. TCM IPCQ slot region layout ``` -TCM Memory Map (16MB): +TCM Memory Map (16 MB): ┌─────────────────────────────┐ 0x000000 │ Kernel Working Memory │ │ (compute tensors) │ -│ ~14MB │ +│ ~14 MB │ ├─────────────────────────────┤ 0xE00000 │ IPCQ RX Buffers │ │ Dir N: slots × slot_size │ │ Dir S: slots × slot_size │ │ Dir E: slots × slot_size │ │ Dir W: slots × slot_size │ -│ ~1MB │ +│ ~1 MB │ ├─────────────────────────────┤ 0xF00000 │ IPCQ Metadata / Scratch │ -│ ~1MB │ +│ ~1 MB │ └─────────────────────────────┘ 0xFFFFFF ``` -IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 bank conflict를 -최소화한다 (Risk D22 참조). +Place the IPCQ region in the upper TCM bank to minimize bank conflict +with compute accesses (see Risk D22). -### D21. 2nm Implementation Analysis +### D21. 2 nm implementation analysis -#### Area Estimate +#### Area estimate -| Module | Gate Count | Area (2nm est.) | Notes | +| Module | Gate count | Area (2 nm est.) | Notes | |---|---|---|---| -| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops | -| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational | -| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators | -| **IPCQ Controller subtotal** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** | -| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack | -| **Total 변경분** | **~14.6K** | **~0.006 mm²** | | +| QPair Register File | ~4.6 K FF | 0.002 mm² | 576 B of flip-flops | +| Slot Addr Gen + Backpressure | ~5 K gates | 0.001 mm² | Combinational | +| Meta Extractor + Credit Logic | ~3 K gates | 0.001 mm² | 8× parallel comparators | +| **IPCQ Controller subtotal** | **~12.6 K** | **~0.004 mm²** | **< 0.1 % of the PE area** | +| DMA vc_comm extension | ~2 K gates | 0.002 mm² | Flit pack / unpack | +| **Total delta** | **~14.6 K** | **~0.006 mm²** | | #### Timing -| Path | Delay (2nm est.) | Target Clock | Margin | +| Path | Delay (2 nm est.) | Target clock | Margin | |---|---|---|---| | Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× | | Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× | | Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× | | Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× | -모든 critical path가 1 cycle 이내. Timing closure 문제 없음. +All critical paths fit within one cycle. Timing closure is not a +concern. #### Power -- Active: ~1 mW (register R/W + comparators, send/recv 동작 시) -- Idle: leakage only -- PE 전체 전력 대비 무시 가능 +- Active: ~1 mW (register R/W + comparators while sending / receiving). +- Idle: leakage only. +- Negligible vs. total PE power. #### Constraints -| 항목 | 제약 | 근거 | +| Item | Constraint | Rationale | |---|---|---| -| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) | -| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 | -| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 | +| `n_slots` | **must be power-of-2** | mod → AND mask (1 gate). Arbitrary values need a divider (~10 cycles). | +| `slot_size` | **power-of-2 recommended** | mul → barrel shift. Arbitrary values need a multiplier. | +| TCM IPCQ region | **dedicated bank** | Prevents bank conflict with compute accesses. | -### D22. Risk Assessment +### D22. Risk assessment -#### TCM Bank Conflict +#### TCM bank conflict -- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall -- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 (D20) -- **Cost**: TCM banking flexibility 소폭 감소 -- **Severity**: Medium (성능 영향), Low (correctness 문제 아님) +- **Risk**: IPCQ slot write and compute read both target the same TCM + bank → stall. +- **Mitigation**: place the IPCQ region in a dedicated upper-address + bank (D20). +- **Cost**: a small loss of TCM banking flexibility. +- **Severity**: Medium (performance), Low (no correctness issue). -#### Credit Return Latency under Congestion +#### Credit return latency under congestion -- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall +- **Risk**: NoC congestion → credit-return delay → sender backpressure + stall. - **Mitigation**: - - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미) - - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수 -- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음) + - Put credits on a separate VC with strict priority (16 B → + negligible BW impact). + - Or pick `n_slots` generously (8+) so credit delay is absorbed by + buffer depth. +- **Severity**: Low (16 B credits contribute almost nothing to + congestion). -#### Inter-Direction Ordering +#### Inter-direction ordering -- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서 -- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은 - kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 (D2 + D4) -- **Severity**: Low (아키텍처 설계에 의해 해소) +- **Risk**: simultaneous sends from one PE on multiple directions. +- **Mitigation**: per-direction monotonic `sender_seq` suffices. + Inter-direction ordering is the kernel's (software's) + responsibility — same as the simulator model (D2 + D4). +- **Severity**: Low (resolved by design). -### D23. HW Alternatives Considered +### D23. HW alternatives considered -#### Doorbell + Polling (전통적 방식) +#### Doorbell + polling (traditional) ``` -Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell -Recv: Polling loop on doorbell, or interrupt-driven +Send: DMA write data → DMA write a doorbell register at the peer → peer polls doorbell +Recv: polling loop on the doorbell, or interrupt-driven ``` -| 장점 | 단점 | +| Pros | Cons | |---|---| -| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) | -| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) | -| | Polling은 전력 낭비, interrupt는 latency overhead | +| Simple HW (no IPCQ controller) | Two DMA transactions (data + doorbell) | +| Reuses existing DMA | Needs explicit fence between data and doorbell | +| | Polling burns power; interrupt adds latency | -**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.** +**Verdict**: 2–3× latency vs. piggyback. **Rejected.** -#### Hardware Message Queue (NVIDIA NVLink 스타일) +#### Hardware message queue (NVIDIA NVLink style) ``` -Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달 -Recv: HMQ에서 descriptor pop → data pointer 확인 +Send: CPU → push a descriptor onto HMQ → HW relays it to the peer HMQ +Recv: pop a descriptor from HMQ → use the data pointer ``` -| 장점 | 단점 | +| Pros | Cons | |---|---| -| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) | -| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 | -| | Large tensor에는 결국 DMA 필요 | +| CPU only writes descriptors | Needs a separate HMQ engine (~0.05 mm²) | +| Descriptor / data separation is flexible | Separate datapath from DMA → area / power overlap | +| | Large tensors still need DMA | -**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는 -면적 낭비. **불채택.** +**Verdict**: With CCL's large-tensor pattern, DMA is still required, +so HMQ + DMA is a duplicated datapath. **Rejected.** -#### RDMA-style Completion Queue (CQ) +#### RDMA-style completion queue (CQ) ``` -Send: DMA write → peer에 CQE 자동 생성 -Recv: CQ poll/interrupt → data 위치 확인 +Send: DMA write → CQE auto-posted at the peer +Recv: CQ poll / interrupt → read data location ``` -| 장점 | 단점 | +| Pros | Cons | |---|---| -| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead | -| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 | -| | PE-to-PE CCL에는 over-engineered | +| Mature InfiniBand / RoCE model | CQ management logic + CQE memory overhead | +| Good multi-tenant isolation | CQE / data ordering needs extra plumbing | +| | Over-engineered for PE-to-PE CCL | -**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합. -PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.** +**Verdict**: RDMA CQ is suited to host-facing NICs with multi-tenant +isolation. For single-owner PE-to-PE this is needless complexity. +**Rejected.** -#### Credit-in-Data Piggyback (v2 최적화 후보) +#### Credit-in-data piggyback (v2 optimization candidate) -현재 설계에서 credit return은 별도 16B packet이다. Bidirectional 통신 -패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.** +In the current design the credit return is a separate 16 B packet. +For bidirectional traffic patterns, **the credit can be folded into a +reverse-direction data flit**. ``` PE_A →E→ PE_B: data + sender_seq=3 -PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4 ← credit이 data에 합쳐짐 +PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4 ← credit folded into data ``` -| 장점 | 단점 | +| Pros | Cons | |---|---| -| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 | -| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) | -| | Logic 복잡도 소폭 증가 | +| Removes the dedicated credit packet → NoC BW savings | Needs fallback for unidirectional patterns | +| Bidirectional allreduce: credit latency → 0 | +8 B in the flit header (negligible) | +| | Slightly more logic complexity | -**평가**: 현재 설계의 우수한 최적화. Bidirectional allreduce에서 credit packet을 -완전 제거 가능. Standalone credit fallback도 유지. **v2로 채택 권고.** +**Verdict**: A strong optimization. Eliminates the credit packet for +bidirectional allreduce; the standalone credit fallback is retained. +**Recommended for v2.** -### Open HW Questions +### Open HW questions -- IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%) -- Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? (D18 참조) -- Inter-SIP link에서의 flit format 호환성 검증 필요 -- n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%) +- What fraction of TCM may the IPCQ slot region occupy? (Current + assumption: ~1 MB / 16 MB = 6.25 %.) +- Dedicated credit VC vs. vc_comm multiplexing? (See D18.) +- Inter-SIP link flit-format compatibility verification. +- Maximum `n_slots`? (8 directions × 8 slots × 64 KB = 4 MB → 25 % of + TCM.) --- ## Non-goals -- **호스트 collective**: `dist.all_reduce`가 데이터 이동을 직접 수행하는 모델은 - 본 ADR 범위 외. 본 ADR은 PE 커널 안에서 일어나는 통신만 다룬다. -- **All-reduce 알고리즘**: ring/tree 등 알고리즘 자체는 별도 ADR (또는 커널 - 코드)에서 다룬다. 본 ADR은 인프라(IPCQ + VC)만 정의. -- **Reliability/error handling**: send/recv 실패, link 장애 등은 다루지 않음. -- **NoC arbiter 정밀 모델**: VC 간 dynamic contention은 첫 구현 범위 외 (D8). +- **Host collective**: a model where `dist.all_reduce` itself moves + data on the host side is out of scope. This ADR only covers + communication that happens inside the PE kernel. +- **All-reduce algorithms**: ring / tree / etc. live in algorithm + modules and can be added without amending this ADR. +- **Reliability / error handling**: link faults, send/recv failure + recovery, etc. are out of scope. +- **NoC arbiter precision**: dynamic VC contention is left for a future + ADR (see D8). --- -## Open Questions +## Open questions -- **VC arbitration 정확도**: 첫 구현은 deterministic chunk interleave + - weighted round-robin. heavy contention 시나리오에서 실제보다 optimistic한 - 결과가 나올 수 있음. 정밀화 필요 시 별도 NoC arbiter component 도입을 검토. -- **Credit return fast path BW 모델**: 첫 구현은 fast path가 fabric BW - contention 모델에서 제외 (별도 lightweight wire 가정). 정밀화 필요 시 - credit fast path를 별도 link로 모델링하거나, `credit_return_mode: piggyback` - 옵션 추가. -- **Ring buffer slot의 메모리 할당**: TCM/HBM/SRAM 어디에 두든 IPCQ가 알아야 - 할 metadata (base addr, slot_size, n_slots). init 시 호스트가 사이드밴드로 - 넣을지, fabric MmuMapMsg와 유사한 메시지로 넣을지 결정 필요. -- **VC 간 BW 분할 default**: 균등 분할(50/50)인지, weighted(예: 80% compute, - 20% comm)인지. ccl.yaml에 노출하되 default 값 결정 필요. -- **Direction 개수**: 4방향(N/S/E/W) 고정인지, 6방향(+ Up/Down for 3D), - 또는 가변 N개로 확장할지. 첫 구현은 4방향 고정. -- **다중 channel 데이터 구조 (multi-tile aggregation)**: 한 collective에서 - 여러 tile을 fan-out 받는 경우 기존 round-robin recv로 충분한지, 별도 - primitive(`tl.recv_all`)가 필요한지. -- **Round-robin recv fairness**: 첫 구현은 last_polled_dir 인덱스 기반 weak - fairness. 한 방향에 데이터가 항상 먼저 도착하면 starvation 가능. strict - fairness가 필요하면 별도 fairness counter 추가. -- **Deadlock detection 정밀화**: 첫 구현은 timeout 기반. 향후 wait-for graph - 실시간 추적으로 deterministic deadlock detection 가능. +- **VC arbitration accuracy** — the first cut uses deterministic + chunk interleave + weighted round-robin; heavy contention may report + optimistic latency. A NoC arbiter component can be added later. +- **Credit return BW model** — the fast path is currently outside the + fabric BW contention model. Can be modeled as a separate link or + switched to piggyback (`credit_return_mode: piggyback`). +- **Ring buffer slot allocation metadata** — whether the host pushes + IPCQ buffer metadata via sideband or via a fabric message similar to + `MmuMapMsg` is open. +- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in + `ccl.yaml`; default value TBD. +- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6 + (with Up/Down for 3D) or N (variable) is future work. +- **Multi-tile aggregation primitives** — whether + `tl.recv_all` or similar is needed for fan-in. +- **Round-robin recv fairness** — current weak fairness can starve; + strict fairness counter is future work. +- **Deadlock detection precision** — currently timeout-based; a + realtime wait-for graph would enable deterministic detection. --- ## Consequences -### 긍정적 +### Positive -- PE 간 직접 통신 가능 → CCL 커널 작성 가능 -- 호스트는 launch만, 동기화는 PE 안에서 → 단순한 호스트 코드, 강한 - compute/comm overlap -- VC를 통해 HoL blocking 제거 → collective latency가 compute traffic에 - block되지 않음 -- Buffer 위치/backpressure 모드를 init 파라미터로 선택 가능 → 벤치마크 가능 -- 4-방향 logical neighbor → 호스트가 ring/mesh/tree 등 알고리즘 자유롭게 - 매핑 +- PE-to-PE direct communication enables CCL kernels to be written. +- Host stays minimal (just `launch`), synchronization happens inside + the PE → strong compute / comm overlap. +- VCs eliminate HoL blocking → collective latency is not blocked by + compute traffic. +- Buffer placement and backpressure mode are init-time parameters → + easy to benchmark. +- Four-direction logical neighbors → host is free to map + ring/mesh/tree algorithms. -### 부정적 +### Negative -- 컴포넌트 1개 신규 추가 (PE_IPCQ), PE_DMA 재설계 (VC 추가) -- IPCQ 메모리 (8 ring × slot_size × n_slots) 만큼 PE-local 메모리 사용 -- VC arbitration 모델이 first-order approximation이므로 heavy contention - 시나리오에서 실제 HW보다 약간 optimistic한 latency 결과 가능 (D8 한계) -- VC chunk-level 인터리브로 PE_DMA 구현이 더 복잡해짐 +- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs). +- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE. +- VC arbitration is a first-order approximation; heavy contention + scenarios may report slightly optimistic latency vs real HW (D8). +- Chunk-level interleave makes PE_DMA implementation more complex. diff --git a/docs/adr/ADR-0024-par-sip-tp-launcher.md b/docs/adr/ADR-0024-par-sip-tp-launcher.md index b321e84..52bd9b4 100644 --- a/docs/adr/ADR-0024-par-sip-tp-launcher.md +++ b/docs/adr/ADR-0024-par-sip-tp-launcher.md @@ -6,43 +6,46 @@ Accepted ## Context -### 목표 +### Goal -`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device) -경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이** -읽히는 bench 코드를 목표로 한다. +Align the participation unit (rank) of `torch.distributed` collective calls +to the **SIP** (device) boundary. The aim is bench code that, at the host +level, reads **indistinguishably** from real PyTorch DDP/TP scripts. -real PyTorch와 비교: +Comparison with real PyTorch: -| 차원 | real PyTorch | KernBench | +| Dimension | real PyTorch | KernBench | | --- | --- | --- | -| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP | -| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 | -| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 | +| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each | +| `get_rank()` | `RANK` env var | greenlet-local registry | +| `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology | | `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP | -| `mp.spawn` | OS 프로세스 fork | greenlet fan-out | +| `mp.spawn` | OS process fork | greenlet fan-out | -### 풀어야 할 문제 +### Problems to solve -1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록. -2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각 - worker greenlet이 자기 rank / 자기 SIP를 정확히 식별. -3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면 - 기본 텐서 배치도 구조적 좌표로 표현되어야 함. +1. **Public API where rank = SIP** — so bench workers do not have to know + about the PE concept. +2. **Greenlet-local rank/device tracking** — within the 1-process model, + each worker greenlet must correctly identify its own rank / its own SIP. +3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP, + the default tensor placement should also be expressed in structural + coordinates. -### Non-problem (이 ADR 밖) +### Non-problem (outside this ADR) - IPCQ direction addressing → ADR-0025 -- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026 +- Removing `DPPolicy.sip`/`num_sips` → ADR-0026 - Megatron-style TP → ADR-0027 - DTensor → ADR-0028 (future) - Worker scheduling / `mp.spawn` / collective drain / exception cleanup → ADR-0027 D0/D1 -- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032 +- Collective algorithm implementation (intercube_allreduce, SFR config) + → ADR-0032 ## Decision -### D1. rank = SIP (world_size 해석) +### D1. rank = SIP (world_size resolution) ```python def _resolve_world_size(self) -> int: @@ -55,8 +58,8 @@ def _resolve_world_size(self) -> int: return int(spec.get("system", {}).get("sips", {}).get("count", 1)) ``` -우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml` -override는 legacy "rank = PE" 테스트 경로로 유지. +Priority order: algorithm override > defaults override > SIP count. The +`ccl.yaml` override is retained as the legacy "rank = PE" test path. ### D2. Greenlet-local rank registry (+ debug warning) @@ -83,11 +86,11 @@ class DistributedContext: return int(self._rank_by_greenlet[g]) ``` -### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩 +### D3. `torch.ahbm.set_device(rank)` — SIP binding -KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는 -`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named -namespace를 사용한다. +The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses +`torch.cuda.set_device(r)`, but since we are not CUDA we use an +honestly-named namespace. ```python class _AhbmNamespace: @@ -113,10 +116,12 @@ class _AhbmNamespace: # Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`. ``` -**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한 -`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`, -`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는 -코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다. +**PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a +device-agnostic `torch.accelerator` namespace +(`torch.accelerator.set_device_index(r)`, +`torch.accelerator.current_device_index()`). To support users who want to +write code that is not tied to a specific device vendor, KernBench also +exposes this surface in parallel. ```python class _AcceleratorNamespace: @@ -141,23 +146,23 @@ self.ahbm = _AhbmNamespace() self.accelerator = _AcceleratorNamespace(self.ahbm) # alias ``` -Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유: +Bench authors may choose either — both share the same registry internally: ```python torch.ahbm.set_device(rank) # KernBench-native, explicit backend torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic ``` -### D4. Tensor placement = structural (sip, cube, pe) 좌표 +### D4. Tensor placement = structural (sip, cube, pe) coordinates -`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성. -세부는 ADR-0026. +`resolve_dp_policy` takes `target_sip` directly and produces placement in +structural coordinates. Details in ADR-0026. ```python # RuntimeContext._create_tensor current_sip = self.ahbm.current_device() # (D3 naming) if current_sip is None: - current_sip = 0 # single-driver fallback (D2와 일관) + current_sip = 0 # single-driver fallback (consistent with D2) placement = resolve_dp_policy( dp, shape=shape_2d, itemsize=itemsize, num_pe=eff_num_pe, num_cubes=eff_num_cubes, @@ -165,29 +170,29 @@ placement = resolve_dp_policy( ) ``` -Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적 -좌표를 직접 보유. ShardSpec 상세는 ADR-0026. +No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)` +structural coordinates directly. ShardSpec details in ADR-0026. --- ## Dependencies -- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원. -- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와 - ShardSpec의 구조적 좌표 표현. -- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`, - collective drain, exception cleanup의 구현 기준. +- **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace. +- **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature + used by D4 and the structural-coordinate representation of ShardSpec. +- **ADR-0027** (Megatron TP + scheduler): the implementation baseline for + worker scheduling, `mp.spawn`, collective drain, and exception cleanup. --- ## Non-goals -- **IPCQ protocol 수정**: ADR-0023 유지. -- **DPPolicy 필드 정리**: ADR-0026. +- **Modifying the IPCQ protocol**: ADR-0023 remains as-is. +- **Cleaning up DPPolicy fields**: ADR-0026. - **Megatron-style TP**: ADR-0027. - **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1. -- **Collective algorithm 구현**: ADR-0032. -- **Multi-node (프로세스 간)**: 단일 프로세스. +- **Collective algorithm implementation**: ADR-0032. +- **Multi-node (cross-process)**: single process only. --- @@ -195,12 +200,14 @@ Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적 ### Positive -- **Bench = real PyTorch DDP** (공개 API 관점). -- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능. -- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이 - `(sip, cube, pe)` 3튜플 위에서 일관되게 동작. +- **Bench = real PyTorch DDP** (from the public-API point of view). +- **Greenlet-local rank**: enables cross-rank correctness within the + 1-process model. +- **Structural placement coordinates**: lets the other ADRs (ADR-0026 / + ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)` + 3-tuple. ### Neutral -- IPCQ PE-level protocol (ADR-0023) 불변. -- IO_CPU 역할 불변 (기존 transit 그대로). +- IPCQ PE-level protocol (ADR-0023) is unchanged. +- IO_CPU role is unchanged (existing transit behavior preserved). diff --git a/docs/adr/ADR-0025-algo-ipcq-direction-addressing.md b/docs/adr/ADR-0025-algo-ipcq-direction-addressing.md index 8a6afa0..ca2974d 100644 --- a/docs/adr/ADR-0025-algo-ipcq-direction-addressing.md +++ b/docs/adr/ADR-0025-algo-ipcq-direction-addressing.md @@ -6,51 +6,58 @@ Accepted (Revision 2 — Address-based matching; peer_direction field dropped) ## Context -### 목표 +### Goal -ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을 -topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다. -2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는 -topology 일반)에서 정확히 동작하도록 한다. +In the IPCQ protocol of ADR-0023, make the **identification of "which +direction pair this transfer belongs to"** consistent and **address-based**, +without depending on topology / dict-order. It must work correctly in a +2-rank bidirectional ring (and more generally in any topology where +multiple directions point to the same peer). -### 드러난 버그 — 2-rank bidirectional ring +### The bug surfaced — 2-rank bidirectional ring -`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer. +`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). Both directions +point to the same peer. -**버그 1 (install)**: -- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite - direction convention) -- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정 -- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx) +**Bug 1 (install)**: +- `reverse_direction(0, 1)` → returns "E" by dict order (wrong; "W" is the + correct answer — opposite-direction convention) +- rank 0's E entry is set with `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")` +- tl.send(E) → data lands in sip1's E-rx buffer (should be W-rx) -**버그 2 (runtime)**: -- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이 - sender 좌표만으로 direction 매칭 → 첫 direction (E) 승 -- peer_head_cache[E] 증가, peer_head_cache[W]는 불변 -- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock +**Bug 2 (runtime)**: +- Even if install set up the correct address, the receiver's + `_handle_meta_arrival` matches direction by sender coordinates only → the + first direction (E) wins +- peer_head_cache[E] is incremented; peer_head_cache[W] is unchanged +- The kernel's tl.recv(W) waits on peer_head_cache[W] → blocks forever → + IpcqDeadlock -### 근본 원인 +### Root cause -두 축에서 동일 문제: -1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가" - 결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때 - fragile -2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender - 좌표만으로 이루어짐 → direction 중복 시 ambiguous +The same issue along two axes: +1. **Install-time pairing**: deciding "which of my directions pairs with + which direction of the peer" depends on dict-iteration-order → fragile + when multiple directions point to the same peer +2. **Runtime identification**: deciding "which qp should be updated" is + based on sender coordinates alone → ambiguous when directions are + duplicated -### 해결 방향 — address-based matching +### Solution direction — address-based matching -각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa + -direction_idx × bytes_per_direction). 따라서: +Each PE's rx buffer sits at a **unique address range per direction** +(rx_base_pa + direction_idx × bytes_per_direction). Therefore: -- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous -- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운 - 대칭성) -- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of - truth** +- **Runtime**: match by **dst_addr range** instead of sender coord → + unambiguous +- **Install**: prefer the opposite direction as a heuristic (the natural + symmetry of ring / mesh) +- No need for redundant metadata like `peer_direction` — **address is the + single source of truth** -이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic -주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능. +This design works **independently of the PhysAddr transition (ADR-0030)**. +Whether the current addresses are synthetic or PhysAddr, the same approach +applies as long as the per-direction range uniqueness is preserved. --- @@ -91,17 +98,17 @@ def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None: return None ``` -호출부: +Call site: ```python for d, peer_rank in nbrs.items(): - peer_dir = reverse_direction(r, peer_rank, d) # my_dir 전달 + peer_dir = reverse_direction(r, peer_rank, d) # pass my_dir if peer_dir is None: continue ... ``` -### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭 +### D2. Runtime — `_handle_meta_arrival` dst_addr matching `src/kernbench/components/builtin/pe_ipcq.py`: @@ -138,9 +145,10 @@ def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None: # Unknown dst_addr — diagnostic log (should not happen under correct install) ``` -Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정. +The sender-coordinate check is **removed**. `dst_addr` already determines +the direction. -### D3. Credit — `dst_rx_base_pa` 필드 추가 +### D3. Credit — add `dst_rx_base_pa` field `src/kernbench/common/ipcq_types.py`: @@ -148,25 +156,26 @@ Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정. @dataclass(frozen=True) class IpcqCreditMetadata: consumer_seq: int - dst_rx_base_pa: int # NEW: 원 sender의 peer.rx_base_pa와 매칭용 - # 기존 필드 (diagnostic / log 용도로 유지) + dst_rx_base_pa: int # NEW: matches the original sender's peer.rx_base_pa + # Existing fields (kept for diagnostic / logging purposes) src_sip: int src_cube: int src_pe: int src_direction: str ``` -Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를 -`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`). +When the credit is generated (`_delayed_credit_send`): it carries this +direction's `my_rx_base_pa` as `dst_rx_base_pa` (this is the +`peer.rx_base_pa` the other side used when it was the sender). -수신 측 (`_credit_worker`): +Receiver side (`_credit_worker`): ```python def _credit_worker(self, env): while True: credit = yield self._credit_inbox.get() for d, qp in self._queue_pairs.items(): - # peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기 + # Find the qp whose peer rx_base_pa matches the credit's dst_rx_base_pa if qp["peer"].rx_base_pa == credit.dst_rx_base_pa: qp["peer_tail_cache"] = max(qp["peer_tail_cache"], credit.consumer_seq) @@ -178,41 +187,45 @@ def _credit_worker(self, env): break ``` -Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous. +Sender-coordinate check removed. Matching by `dst_rx_base_pa` is +unambiguous. -### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음** +### D4. Do **not** add a `peer_direction` field to `IpcqInitEntry` -ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**. -이유: -- Meta arrival은 dst_addr로 매칭 (D2) -- Credit은 dst_rx_base_pa로 매칭 (D3) -- qp에 peer_direction 저장 필요 없음 -- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`) +The `IpcqInitEntry.peer_direction` proposed in ADR-0025 rev 1 is +**unnecessary**. Reasons: +- Meta arrivals are matched by dst_addr (D2) +- Credits are matched by dst_rx_base_pa (D3) +- No need to store peer_direction on qp +- Install only uses peer_dir internally when computing rx_base_pa + (`reverse_direction`) -IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**. +No change to the IpcqInitEntry schema. **Simpler** than rev 1. -### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only) +### D5. Keep `IpcqDmaToken.src_direction` (diagnostic only) -기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지: -- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)` -- Diagnostics: pointer_dump 등에서 direction 표시 -- 미래 확장 여지 +The existing `src_direction` field is not removed. It is retained for: +- Logging / trace: the `(rank, t, dir, nbytes)` output of + `KERNBENCH_CCL_TRACE=1` +- Diagnostics: showing direction in pointer_dump, etc. +- Room for future extension -Runtime matching은 `dst_addr`만 사용. +Runtime matching uses only `dst_addr`. -### D6. Invariants (ADR-0023 I3 강화) +### D6. Invariants (strengthens ADR-0023 I3) -**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my -rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은 -이를 보장해야 한다 (reverse_direction opposite-preference). +**I3 (strict)**: For each direction pair `(my_direction, peer_direction)`, +my rx_base and peer rx_base must point to **distinct direction slots**. +Install must guarantee this (reverse_direction opposite-preference). -**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는 -서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지 -않음). 이것이 D2/D3의 주소-기반 매칭의 전제. +**I3.1 (new)**: For every qp, `qp["my_rx_base_pa"]` and +`qp["peer"].rx_base_pa` occupy mutually disjoint address ranges (buffers +of different directions never overlap). This is the prerequisite for the +address-based matching of D2/D3. -Install time에 검증 가능: +Verifiable at install time: ```python -# ccl/install_plan.py: build_install_plans 끝에 assertion +# ccl/install_plan.py: assertion at the end of build_install_plans all_rx_ranges = set() for plan in plans: for pe_install in plan.pe_installs: @@ -228,36 +241,42 @@ for plan in plans: ## Dependencies -- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정 - (D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer - 변경은 없음. -- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가 - ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴. -- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의 - 주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동. +- **ADR-0023** (IPCQ protocol): this ADR modifies ADR-0023's runtime + matching logic (D2, D3) and improves the install heuristic (D1). No + change to the IPCQ protocol's semantic layer. +- **ADR-0024** (launcher): the case where a 2-rank bidirectional ring is + actually used is the ws=SIP_count model of ADR-0024. This ADR makes that + case work. +- **ADR-0030** (PhysAddr transition, stub): **independent** — ADR-0025's + address-based matching works identically whether the current addresses + are synthetic or PhysAddr. --- ## Non-goals -- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게 - 인코딩되는가와 무관. -- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지. -- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그 - 무관. +- **Migrating IPCQ addressing to PhysAddr**: ADR-0030 scope. This ADR is + agnostic to how addresses are encoded. +- **Multi-hop routing**: the single-hop DMA write assumption of ADR-0023 + D5 still holds. +- **Unidir ring specialization**: `ring_1d_unidir` only has a single + direction, so the bug does not apply. --- ## Open questions -- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형 - 순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로 - 전환 가능 (`_qp_by_rx_base`). -- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴 - 필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지. -- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2. - 대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능. - 단순 구현 먼저. +- **Address-matching performance**: `_handle_meta_arrival` and + `_credit_worker` iterate qp linearly (max 4 directions). The performance + impact is negligible. If it becomes an issue, this can be switched to a + dict lookup (`_qp_by_rx_base`). +- **Re-evaluating the need for `IpcqDmaToken.src_direction`**: whether to + keep this field, which is only kept for diagnostics, or to split it out + of logging. Currently retained. +- **Cost of install-time invariant verification**: the I3.1 verification + of D6 is O(N_PE × N_direction)^2. It could be slow on large topologies + → improvable via data structures such as interval trees. Simple + implementation first. --- @@ -265,19 +284,26 @@ for plan in plans: ### Positive -- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth. -- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작. -- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가. -- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관. -- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치. +- **Simplicity**: redundant `peer_direction` metadata removed. Address is + the single source of truth. +- **Unambiguous matching**: works on every topology (including duplicate + directions). +- **Minimal schema changes**: `IpcqInitEntry` unchanged, one field added + to `IpcqCreditMetadata`. +- **Independent of PhysAddr transition (ADR-0030)**: address-based matching + is agnostic to the address encoding. +- **Diagnostics retained**: `IpcqDmaToken.src_direction` is kept for + logging. ### Negative -- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌 - W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction - 이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함. +- Runtime matching is now by address comparison, so when debugging + questions like "why did peer_head_cache[W] update rather than [E]" one + has to follow the address range (previously the direction name was + enough). Mitigation: include a "direction ↔ rx_base_pa" mapping in + pointer_dump. ### Neutral -- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는 - 불변. +- The semantic layer of the IPCQ protocol (sender computes dst_addr, + receiver receives) is unchanged. diff --git a/docs/adr/ADR-0026-par-dppolicy-intra-device.md b/docs/adr/ADR-0026-par-dppolicy-intra-device.md index d043f59..16c8170 100644 --- a/docs/adr/ADR-0026-par-dppolicy-intra-device.md +++ b/docs/adr/ADR-0026-par-dppolicy-intra-device.md @@ -1,4 +1,4 @@ -# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거 +# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields ## Status @@ -6,16 +6,17 @@ Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail) ## Context -### 목표 +### Goal -`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한 -intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리 -(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel -layers가 담당). +Clarify `DPPolicy` as a pure intra-device abstraction that only expresses +**cube × PE distribution within a single device (SIP)**. Inter-SIP +distribution (TP) is split into a separate layer (handled by ADR-0024's +`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel +layers). ## Decision -### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거 +### D1. Remove `sip` + `num_sips` fields from `DPPolicy` ```python @dataclass(frozen=True) @@ -32,15 +33,16 @@ class DPPolicy: num_cubes: int | None = None ``` -제거되는 필드: `sip`, `num_sips`. +Removed fields: `sip`, `num_sips`. -### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거 +### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed -현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube × -pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태. +The current `ShardSpec.pe_index` is a **global flat index** +(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4 +flagged as "abstraction leakage". -본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는 -property로도 **남기지 않는다**: +This ADR **redefines ShardSpec in structural coordinates** and **does +not even leave `pe_index` as a property**: ```python # src/kernbench/policy/placement/dp.py (after) @@ -59,28 +61,32 @@ class ShardSpec: nbytes: int ``` -**핵심 원칙**: -- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플. -- **`pe_index` property도 없음** — silent semantics drift 차단. -- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시 - `AttributeError`** → 반드시 구조적 좌표로 migration. -- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가 - 명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산. +**Core principle**: +- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple. +- **No `pe_index` property either** — blocks silent semantics drift. +- Existing callers expecting global-flat get an **immediate + `AttributeError`** on `.pe_index` access → forced migration to + structural coordinates. +- Local contexts that genuinely need a flat integer key (e.g. internal + dict lookup) explicitly compute + `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call + site. -**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어 -있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage -(AttributeError)가 훨씬 안전. +**Justification for removing the property**: KernBench is an internal +project with a limited number of call sites. Explicit breakage +(AttributeError) is much safer than the risk of silent drift (semantics +change while the type stays int). -### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성 +### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates -ADR-0024 D4의 계약 구현. Post-hoc shifting 없음. +Implements the contract of ADR-0024 D4. No post-hoc shifting. ```python # src/kernbench/policy/placement/dp.py (after) @dataclass(frozen=True) class _LocalPeShard: - """Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload.""" + """Internal — return value of the PE resolver. Cube-local PE id + payload.""" local_pe: int # cube-local PE index (0..num_pe-1) offset_bytes: int nbytes: int @@ -93,7 +99,7 @@ def resolve_dp_policy( itemsize: int, num_pe: int, num_cubes: int = 1, - target_sip: int, # NEW — 어느 SIP에 배치할지 명시 + target_sip: int, # NEW — explicitly state which SIP to place on ) -> list[ShardSpec]: """2-level resolution (cube × PE) on a specified SIP. @@ -123,28 +129,30 @@ def resolve_dp_policy( return all_shards ``` -**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard` -리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**. -과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소. +**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a +list of `_LocalPeShard` — the `local_pe` field name makes it **explicit +that this is a "cube-local PE identifier"**. This resolves the previous +confusion with the name `ShardSpec.pe_index`. -**이름 규약 정리** (전체 ADR): -- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord) -- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미 -- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의 - 부가 효과: 이름 재등장 없음). +**Naming convention summary** (whole ADR): +- `ShardSpec.pe`: the final external API — cube-local PE (structural coord) +- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage +- `pe_index`: **removed**. Not retained anywhere, internal or external + (additional benefit of preventing silent drift: the name does not + reappear). -### D4. `_create_tensor` — 구조적 좌표로 직접 placement +### D4. `_create_tensor` — placement directly in structural coordinates -ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy` -호출 시점에 직접 지정. +Continuation of ADR-0024 D4. Post-hoc shifting removed; structural +coordinates are specified directly at the `resolve_dp_policy` call site. ```python # context.py _create_tensor (after) current_sip = self.ahbm.current_device() if current_sip is None: - # Single-driver fallback (ADR-0024 D2와 일관). - # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는 - # 문제가 있음 → debug mode에서 경고. + # Single-driver fallback (consistent with ADR-0024 D2). + # In launcher-based code, forgetting set_device() silently sticks the + # tensor on SIP 0 — emit a warning in debug mode. if os.environ.get("KERNBENCH_DEBUG"): import warnings warnings.warn( @@ -161,38 +169,39 @@ placement = resolve_dp_policy( itemsize=itemsize, num_pe=eff_num_pe, num_cubes=eff_num_cubes, - target_sip=current_sip, # ← 구조적 좌표 일차 지정 + target_sip=current_sip, # ← structural coord specified up front ) -# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함. -# 과거의 post-hoc shifting 블록은 완전히 제거. +# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local). +# The old post-hoc shifting block is removed entirely. ``` -**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면 -ADR-0027의 TP primitive 사용. +**Every** tensor is placed on the current device's SIP. If you need a +multi-SIP tensor, use the TP primitive of ADR-0027. -**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로 -default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1` -환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에 -배치되는 것을 감지할 수 있도록 warning. +**Trade-off of the single-driver fallback**: When set_device is not +called, defaulting to SIP 0 is kept for compatibility with existing +single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so +that accidentally omitting set_device in a launcher context — which would +silently place the tensor on the wrong SIP — can be detected. -### D5. Downstream — allocator lookup은 구조적 tuple key로 +### D5. Downstream — allocator lookup by structural tuple key -기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`): +Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`): ```python for spec in placement: - alloc = allocators[spec.pe_index] # ← AttributeError (property 제거됨) + alloc = allocators[spec.pe_index] # ← AttributeError (property removed) ``` -`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration: +With `pe_index` gone, migration to structural coordinates is **forced**: ```python for spec in placement: alloc = allocators[(spec.sip, spec.cube, spec.pe)] ``` -`_ensure_allocators`의 dict population도 tuple key로: +The dict population in `_ensure_allocators` is also tuple-keyed: ```python # context.py _ensure_allocators (after) @@ -204,59 +213,71 @@ for sip_id in sip_range: ) ``` -`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산 -블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용. +`_free_tensor` is the same: the old +`flat_idx = sip * ... + cube * ... + pe` computation block is removed, +and `(shard.sip, shard.cube, shard.pe)` is used directly. -**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서 -권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재 -allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지. +**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple +and hashable out of the box. A `PEIdentity` value object has the upside +of an explicit type, but the boilerplate is large and it is currently +the only key of the allocator dict, so it would be over-engineering. +Keep the tuple. -### D7. 하위 호환 — 불가 (cleanup ADR) +### D7. Backward compatibility — none (cleanup ADR) -이 ADR은 **breaking change**. +This ADR is a **breaking change**. -1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError` -2. `ShardSpec.pe_index` 접근 → `AttributeError` +1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)` → `TypeError` +2. `ShardSpec.pe_index` access → `AttributeError` -모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음. -KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration. +Both are **immediate, explicit breakage**. No deprecation warning / +fallback path. KernBench is an internal project with a bounded set of +call sites, so migration happens in one pass. -**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한 -코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거. +**Blocking silent drift** is the main upside of fully removing the +property: code that expected a global flat could otherwise silently +receive a SIP-local result and index incorrectly — that possibility is +eliminated. ## Dependencies -- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이 - SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로 - 좁힘. -- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로. - 이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관. +- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping + provide the SIP placement mechanism. This ADR sits on top and narrows + DPPolicy to pure intra-device. +- **ADR-0027** (Megatron TP): the alternative path when a tensor spans + multiple SIPs. After this ADR is applied, multi-SIP use cases move to + ADR-0027. --- ## Non-goals -- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미 - 유지. -- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로. -- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028. +- **Redesign of `DPPolicy.cube` / `pe`**: existing + replicate/column_wise/row_wise semantics are kept. +- **Tiling policy consolidation**: `tiled_column_major` / + `tiled_row_major` stay as they are. +- **New multi-device tensor abstraction**: a DTensor-like is ADR-0028. --- ## Open questions -- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0 - (SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver - 테스트와의 호환). -- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며 - launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope. -- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가 - 사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이 - 명시적 답. +- **Default value of current_sip in `_create_tensor`**: for calls without + set_device, whether to fall back to rank=0 (SIP 0) or to raise an + error. The recommendation is fallback (compatibility with existing + single-driver tests). +- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit + tests to the launcher base while preserving their intent requires + additional fixtures. Scoped as separate work. +- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone, + the concept of `num_sips` disappears entirely. The explicit answer for + expressing multi-SIP is to use the TP primitive of ADR-0027. -**Resolved (이전 rev에서 open이었던 것들)**: -- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2) -- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5) +**Resolved (items that were open in earlier revs)**: +- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully + removed** (D2) +- ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`** + (D5) --- @@ -264,25 +285,31 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 ### Positive -- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device. -- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소. -- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 → - abstraction leakage 해소 (ADR-0024 D4 계약 충족). -- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시. -- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP - 경계 제어 메커니즘. +- **Clean conceptual separation**: DPPolicy = intra-device, TP = + inter-device. +- **API simplification**: about a 33% reduction in DPPolicy constructor + fields. +- **Structural-coordinate consistency**: ShardSpec is expressed as a + `(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024 + D4 contract is satisfied). +- **Clear meaning of `pe_index`**: the single interpretation is + SIP-local. If global-flat is needed, it must be made explicit. +- **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is + the sole SIP-boundary control mechanism. ### Negative - **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`, - `spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요. -- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드. - Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`, - `allocators` dict key 등) 연쇄 수정. -- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 → - migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff) -- `test_sip_parallel.py` 재작성 비용. + `spec.pe_index` → `AttributeError`. All callers need to be fixed at + once. +- **ShardSpec schema change**: a single `pe_index` field becomes three + fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`, + `_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.). +- **No silent drift**: with the property fully removed, runtime failure + is immediate → migration leakage is blocked at the source. (Not a + negative but an explicit tradeoff.) +- The cost of rewriting `test_sip_parallel.py`. ### Neutral -- 기존 `cube` / `pe` 필드 의미 불변. +- The meaning of the existing `cube` / `pe` fields is unchanged. diff --git a/docs/adr/ADR-0027-par-megatron-tp.md b/docs/adr/ADR-0027-par-megatron-tp.md index 7b04254..e8fb267 100644 --- a/docs/adr/ADR-0027-par-megatron-tp.md +++ b/docs/adr/ADR-0027-par-megatron-tp.md @@ -6,70 +6,77 @@ Accepted ## Context -### 목표 +### Goal -SIP 간 tensor parallelism(TP)을 **Megatron-LM 스타일의 명시적 parallel layer** -API로 지원한다. DTensor 같은 선언적 추상화는 별도 ADR(0028) future work. +Support inter-SIP tensor parallelism (TP) via a **Megatron-LM style explicit +parallel layer** API. Declarative abstractions like DTensor are future work +in a separate ADR (0028). -Megatron-style을 선택한 이유: -- TP는 model의 특정 layer 경계에서 발생. 명시적 primitive가 mental model에 - 자연스러움. -- NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준. -- DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적. +Why Megatron-style was chosen: +- TP arises at specific layer boundaries of a model. Explicit primitives are + natural to the mental model. +- The de-facto industry standard established by NVIDIA Megatron / DeepSpeed. +- DTensor is declarative, so its design space is larger → phased approach. -### TP primitive 스펙 (Megatron-LM 참조) +### TP primitive spec (Megatron-LM reference) -- **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에 - 분산. 입력 full-replicated, 출력 column-sharded. 후속 RowParallelLinear가 - 올 때 forward all-reduce 없음. -- **RowParallelLinear**: weight의 **row(in_features)** 축을 TP ranks에 분산. - 입력이 이미 column-sharded (ColumnParallel의 출력). forward 끝에 - **all-reduce** 필요. -- **VocabParallelEmbedding**: embedding을 vocab 축에 분산. forward 끝에 - all-reduce. (초기 scope에서는 stub, 실제 구현은 all-gather kernel 선행 필요.) +- **ColumnParallelLinear**: shards the weight's **column (out_features)** axis + across TP ranks. Input is full-replicated, output is column-sharded. When a + RowParallelLinear follows, no forward all-reduce is required. +- **RowParallelLinear**: shards the weight's **row (in_features)** axis across + TP ranks. Input is already column-sharded (the output of ColumnParallel). + Requires an **all-reduce** at the end of forward. +- **VocabParallelEmbedding**: shards the embedding along the vocab axis. + all-reduce at the end of forward. (A stub in the initial scope; full + implementation requires an all-gather kernel as a prerequisite.) - **`copy_to_tp_region`**, **`reduce_from_tp_region`**, **`scatter_to_tp_region`**, - **`gather_from_tp_region`** — 기본 primitive. + **`gather_from_tp_region`** — basic primitives. -### 풀어야 할 문제 +### Problems to solve -1. **Worker-wait 일반화 (D0)**: `dist.all_reduce`의 defer/yield/drain 패턴을 - 모든 `ctx.wait` 경로로 확장. **이 ADR의 가장 큰 아키텍처 결정**. +1. **Worker-wait generalization (D0)**: extend the defer/yield/drain pattern of + `dist.all_reduce` to every `ctx.wait` path. **The biggest architectural + decision of this ADR.** -2. **런처 API 정규화 (D1)**: 현 bench들이 hand-rolled greenlet loop을 사용. - `torch.multiprocessing.spawn(fn, args, nprocs)`로 흡수해 real-PyTorch API 면 - 유지 + D0의 scheduler drain을 단일 구현 위치에 집중. +2. **Launcher API normalization (D1)**: current benches use a hand-rolled + greenlet loop. Absorb it into `torch.multiprocessing.spawn(fn, args, nprocs)` + to preserve the real-PyTorch API surface + concentrate D0's scheduler drain + in a single implementation site. -3. **Per-rank weight 분산 표현**: 각 worker가 weight tensor의 자기 slice를 - 소유. ADR-0024의 `set_device(rank)` + ADR-0026의 intra-device DPPolicy로 - 자연스럽게 표현. +3. **Per-rank weight shard representation**: each worker owns its own slice of + the weight tensor. Naturally expressed via ADR-0024's `set_device(rank)` + + ADR-0026's intra-device DPPolicy. -4. **Forward-only scope**: 현재 KernBench는 backward가 없음 (simulation 목적). - 본 ADR은 **forward만** 우선 지원. Training simulation은 별도 ADR. +4. **Forward-only scope**: KernBench currently has no backward (simulation + purposes). This ADR prioritizes **forward only**. Training simulation is a + separate ADR. -5. **Collective 호출 지점**: RowParallelLinear가 forward 끝에 `all_reduce` 호출. - ADR-0024의 multi-greenlet 구조 + D0 generalization에서 자연스럽게 동작. +5. **Collective call site**: RowParallelLinear calls `all_reduce` at the end of + forward. Naturally works with ADR-0024's multi-greenlet structure + D0 + generalization. -6. **TP group 개념**: Megatron은 DP × TP × PP group을 교차 사용. 초기 scope는 - **TP group = 전체 SIP** 단순화. Mixed DP+TP는 future. +6. **TP group concept**: Megatron crosses DP × TP × PP groups. The initial + scope simplifies to **TP group = all SIPs**. Mixed DP+TP is future work. --- ## Decision -### D0. Worker-wait 일반화 — `ctx.wait`가 worker 컨텍스트면 main으로 defer +### D0. Worker-wait generalization — `ctx.wait` defers to main when in worker context -**문제 재확인**. `kernel_runner.run`은 spawn 시점의 `greenlet.getcurrent()`를 -kernel greenlet의 `_parent`로 캡처한다 +**Restating the problem.** `kernel_runner.run` captures the `greenlet.getcurrent()` +at spawn time as the kernel greenlet's `_parent` ([kernel_runner.py:94](src/kernbench/triton_emu/kernel_runner.py#L94)). -main 컨텍스트에서 `env.run`이 돌면 parent=main이라 safe. worker 컨텍스트에서 -`env.run`이 돌면 parent=worker가 되고, worker가 yield/finish하는 순간 kernel -greenlet은 orphan → `GreenletExit` → ADR-0024 Phase B의 `ring_default_ws` 실패. +If `env.run` runs in the main context, parent=main is safe. If `env.run` runs +in a worker context, parent=worker, and the moment the worker yields/finishes +the kernel greenlet becomes an orphan → `GreenletExit` → failure of ADR-0024 +Phase B's `ring_default_ws`. -**해결**. worker greenlet이 `ctx.wait(h)`를 호출하면 직접 `env.run`을 driving -하는 대신 **main scheduler로 yield**. main이 env.run을 drive해 handle이 완료 -되면 worker로 control return. +**Resolution.** When a worker greenlet calls `ctx.wait(h)`, instead of driving +`env.run` directly, **yield to the main scheduler**. main drives env.run and, +once the handle completes, control returns to the worker. -#### D0.1 `RuntimeContext` 확장 +#### D0.1 `RuntimeContext` extension ```python # context.py @@ -79,7 +86,7 @@ class RuntimeContext: _pending_worker_waits: list[RequestHandle] = field(default_factory=list, init=False) ``` -#### D0.2 `ctx.wait`의 worker fork +#### D0.2 `ctx.wait` worker fork ```python def wait(self, handle, *, _meta=None): @@ -113,29 +120,33 @@ def wait(self, handle, *, _meta=None): return completion ``` -#### D0.3 `ctx.wait`의 worker-context 세만틱 contract (normative) +#### D0.3 `ctx.wait` worker-context semantic contract (normative) -본 ADR은 `ctx.wait`의 세만틱을 worker 컨텍스트에서 **명시적으로 변경**한다. +This ADR **explicitly changes** the semantics of `ctx.wait` in worker context. -- **Submit-vs-complete 분리**: `ctx.wait(h)`는 worker에서 호출될 때 "즉시 완료 - 보장"이 아니라 "**다음 scheduler drain 이후** 완료 보장"이다. worker가 - `wait()`에서 return하는 시점 = main이 해당 handle에 대해 `engine.wait`을 - 마친 시점. Main context 호출은 기존대로 즉시-동기 (status quo). -- **Resume invariant (normative)**: worker-deferred `ctx.wait(h)`에서 - `g.parent.switch()`가 return해 worker가 resume되는 시점에는 **반드시 - `h in ctx._completed`가 True여야 한다**. 이 invariant가 깨지면 worker가 - stale 상태에서 이후 단계를 진행하므로 `_drain_pending` / scheduler loop / - `ctx.wait` 어느 부분을 수정하든 이 불변식을 지켜야 한다. T3.b가 이 - invariant를 직접 assert한다. -- **관찰 가능 변화**: worker 안에서 `h = ctx.submit(msg); ctx.wait(h); - read(handle_result)` 패턴은 여전히 성립 — 단 `wait()`와 `read` 사이에는 - 자동으로 main-drain이 삽입되었다는 사실을 세만틱 명세로 포함한다. -- **Host 객체 직접 read는 D0.5 참조**: `ctx.wait` 없이 `tensor.numpy()`를 - 부르는 경우의 계약은 D0.5에서 별도로 규정. +- **Submit-vs-complete separation**: when called from a worker, `ctx.wait(h)` + no longer guarantees "immediate completion" but instead guarantees + "completion **after the next scheduler drain**". The point at which the + worker returns from `wait()` = the point at which main has finished + `engine.wait` for that handle. Main-context calls remain immediate-synchronous + as before (status quo). +- **Resume invariant (normative)**: at the point a worker resumes from a + worker-deferred `ctx.wait(h)` (when `g.parent.switch()` returns), **`h in + ctx._completed` must be True**. If this invariant breaks, the worker + proceeds in a stale state, so whichever of `_drain_pending` / the scheduler + loop / `ctx.wait` is modified, this invariant must be preserved. T3.b + directly asserts this invariant. +- **Observable change**: the pattern `h = ctx.submit(msg); ctx.wait(h); + read(handle_result)` inside a worker still holds — but the semantic spec + now includes the fact that a main-drain is automatically inserted between + `wait()` and `read`. +- **Direct host-object reads see D0.5**: the contract for calling + `tensor.numpy()` without `ctx.wait` is specified separately in D0.5. -#### D0.4 Main scheduler drain — 규약 (normative) +#### D0.4 Main scheduler drain — protocol (normative) -(D1의 `multiprocessing.spawn` 내부 구현. 아래는 세만틱 정의.) +(The internal implementation of D1's `multiprocessing.spawn`. Below is the +semantic definition.) ```python while alive: @@ -144,69 +155,82 @@ while alive: _drain_pending(ctx) # (2) drain in main context ``` -(`_drain_pending`의 실제 정의는 D0.5 참조 — outer while-loop으로 두 큐가 -모두 빌 때까지 drain.) +(The actual definition of `_drain_pending` is in D0.5 — an outer while-loop +that drains until both queues are empty.) -**규약**: +**Rules**: -1. **Round-based cooperative scheduling & yield 의무 (worker contract)**. - `g.switch()`는 해당 worker가 **자발적으로 yield**할 때까지 return하지 않는다 - (cooperative greenlet 세만틱). 따라서: - - Worker가 yield 없이 `while True: do_compute()` 같은 pure-compute loop를 - 돌면 `g.switch()`는 영원히 return하지 않고 **scheduler loop 자체가 hard - block**된다 (다른 worker는 switch 기회를 못 얻음, drain도 안 일어남). 이는 - starvation이 아니라 **scheduler non-progress (deadlock 등가)**이며 본 - ADR이 **unsupported**로 규정한다. - - Worker는 **반드시** `ctx.wait(h)`, `dist.all_reduce`, host-read barrier - (D0.5) 중 하나를 유한 step 내에 호출해야 한다. TP layer의 `forward`는 - 매 layer 끝에서 launch→wait 쌍을 포함하므로 자연스럽게 이 조건을 만족. - CCL kernel도 `dist.all_reduce` 내부에서 yield한다. - - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터 - 등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다. - - **Future extension**: non-collective 긴 계산 경로가 자주 나오면 - 명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를 - 도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면 - 됨. - - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round - 안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로 - enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO). +1. **Round-based cooperative scheduling & yield obligation (worker contract)**. + `g.switch()` does not return until the worker **voluntarily yields** + (cooperative greenlet semantics). Therefore: + - If a worker runs a pure-compute loop like `while True: do_compute()` + without yielding, `g.switch()` never returns and **the scheduler loop + itself hard-blocks** (other workers cannot get a switch turn, no drain + occurs). This is not starvation but **scheduler non-progress (deadlock + equivalent)**, and this ADR classifies it as **unsupported**. + - Workers **must** call one of `ctx.wait(h)`, `dist.all_reduce`, or a + host-read barrier (D0.5) within a finite number of steps. The `forward` + of a TP layer includes a launch→wait pair at the end of every layer, so + this condition is naturally met. CCL kernels also yield inside + `dist.all_reduce`. + - Implementations need not **detect** this (timeouts/steps-since-yield + counters, etc.). It is a user contract; the symptom on violation is + "simulation hang". + - **Future extension**: if non-collective long compute paths become + common, an explicit `torch.distributed.cooperative_yield()` primitive + (no-op yield) could be introduced. Out of scope for this ADR. Not a + breaking change — can be added if needed. + - Within a round, every alive worker receives one `switch` turn. Even if + a single worker calls wait multiple times within one round, the calls + are enqueued sequentially within that turn and processed in a single + scheduler drain batch (FIFO). -2. **Drain 순서 = submission 순서 (FIFO)**. `_pending_worker_waits`는 list - append/pop(0)로 엄격한 FIFO. 완료 순서가 아니라 submission 순서로 drain되며, - SimPy scheduler 자체가 인과적으로 올바른 완료 순서를 보장하므로 submission - 순서 drain이 안전하다. `completion order`와 `drain order`는 혼동하지 말 것. +2. **Drain order = submission order (FIFO)**. `_pending_worker_waits` is + strict FIFO via list append/pop(0). Drain occurs in submission order, not + completion order, and SimPy's scheduler itself guarantees a causally + correct completion order, so submission-order drain is safe. Do not + confuse `completion order` with `drain order`. - **Two-queue ordering (worker waits → collectives)**: `_drain_pending`은 - worker wait 큐를 먼저, collective 큐를 나중에 drain한다. 이 순서의 근거: - - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접 - `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective - 큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며 - worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조). - - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된 - 후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만 - 하면 됨. worker wait 큐와의 순서 dependency 없음. - - **단일 drain barrier 안에서 둘 다 완료**: D0.5의 loop-until-empty 규약에 - 따라 한 barrier invocation에서 worker → collective → (새로 생긴 것이 - 있으면 반복) 순으로 모두 빠짐. worker가 resume될 땐 양쪽 모두 drained. - - **대안 (collective 먼저)도 가능**: 본 ADR은 현 구현 단순성을 위해 worker - 먼저를 고정했을 뿐 의미상 동치. 성능 프로파일 차이가 관찰되면 재조정. + **Two-queue ordering (worker waits → collectives)**: `_drain_pending` + drains the worker wait queue first, then the collective queue. Rationale + for this ordering: + - **The two queues are different dependency sources**: worker waits are + handles produced by a worker's own `submit + wait` pair (tensor deploy, + MmuMap, etc.). The collective queue holds kernel-launch handles that + `dist.all_reduce` enqueues internally, which the worker never directly + waits on (see the two-queue drain model in D0.5). + - **Independent in correctness terms**: from the worker's perspective, a + collective is "already submitted, then yielded". Its completion timing + only needs to precede the worker's next action. There is no ordering + dependency with the worker wait queue. + - **Both finish within a single drain barrier**: per D0.5's + loop-until-empty rule, a single barrier invocation drains worker → + collective → (repeat if new ones appeared) in that order. By the time + the worker resumes, both sides are drained. + - **The alternative (collective first) is also valid**: this ADR fixes + worker-first only for current implementation simplicity; semantically + they are equivalent. Revisit if a performance-profile difference is + observed. -3. **중복 enqueue — correctness는 idempotent drain, dedup은 non-guaranteed**. - `ctx.wait(h)`는 `h in ctx._completed`면 즉시 return. `_drain_pending`도 - 동일 guard. 같은 handle이 `_pending_worker_waits`에 여러 번 appended - 되더라도 실제 `engine.wait`는 한 번만 호출된다 (idempotent). - - **Correctness**: idempotent drain에 의존 → safe. - - **Memory/성능**: 본 ADR은 `_pending_worker_waits`의 **dedup을 보장하지 - 않는다**. 같은 handle이 N번 enqueue되면 큐에 N개 element가 보관되고 - drain 시 N번 pop + in-set guard가 돈다. 단일 worker가 같은 handle을 - 반복 wait하는 비정상 패턴이 아니면 N은 1~수 수준. - - **Implementation freedom**: 구현은 선택적으로 dedup (예: `set`을 side - index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness - 를 바꾸지 않는 최적화로 분류. +3. **Duplicate enqueue — correctness via idempotent drain; dedup not + guaranteed**. `ctx.wait(h)` returns immediately if `h in ctx._completed`. + `_drain_pending` uses the same guard. Even if the same handle is appended + to `_pending_worker_waits` multiple times, `engine.wait` is invoked only + once (idempotent). + - **Correctness**: relies on idempotent drain → safe. + - **Memory/performance**: this ADR **does not guarantee dedup** of + `_pending_worker_waits`. If the same handle is enqueued N times, the + queue retains N elements and drain performs N pops + in-set guards. + Unless a single worker abnormally repeats waits on the same handle, N + stays at the order of 1 to a few. + - **Implementation freedom**: implementations may optionally dedup (e.g., + hold a `set` as a side index, or check `h not in pending_set` before + append). Classified as an optimization that does not change correctness. 4. **Exception propagation + sibling cleanup**. - worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다. - scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행: + When a worker greenlet raises, `g.switch()` propagates the exception to + main. The scheduler loop stops immediately and performs the following + cleanup **explicitly**: ```python try: @@ -219,14 +243,14 @@ while alive: g.switch() _drain_pending(ctx) except Exception as outer: - # (a) 살아남은 sibling worker greenlet 강제 종료. + # (a) Force-terminate surviving sibling worker greenlets. for other in gs: if not other.dead: try: other.throw(SystemExit) except Exception: - pass # 사일런트 — 이미 예외 상황 - # (b) Backend barrier / pending 상태 초기화 (장래 epoch barrier 도입 대비). + pass # silent — already in exceptional state + # (b) Reset backend barrier / pending state (in preparation for future epoch barrier). backend = getattr(ctx.distributed, "_backend", None) if backend is not None and hasattr(backend, "_barrier"): backend._barrier.reset() @@ -234,32 +258,34 @@ while alive: if backend_pending is not None: backend_pending.clear() ctx._pending_worker_waits.clear() - # (c) 원인 예외는 SpawnException으로 래핑. + # (c) Wrap the originating exception in SpawnException. raise SpawnException(errors) from outer ``` - 규약: - - **Sibling abort 보장**: worker 하나가 raise하면 모든 sibling greenlet에 - `SystemExit`을 throw — greenlet은 즉시 terminate된다. greenlet leak 없음. - - **Pending queue 명시적 clear**: worker-wait + collective-pending 두 큐를 - 비움. 재사용 시 오염 방지. - - **`SpawnException(errors)` 래핑**: `errors: dict[int, Exception]`에 각 - rank의 원래 예외를 담는다. real-PyTorch `torch.multiprocessing.spawn`의 - failure 패턴과 호환. - - **Scope 제한**: `errors`에는 **자기 코드로 raise한 rank (root cause)만** - 포함된다. Sibling cleanup 과정에서 `throw(SystemExit)`으로 종료된 rank는 - `errors`에 나타나지 않는다 (SystemExit은 D1.2의 entry 래퍼 `try/except - Exception`에 걸리지 않음 — 의도된 설계: sibling 종료는 실패가 아니라 - cleanup signal). 독자가 "모든 failed rank가 다 들어올 것"으로 기대하지 - 않도록 명시. - - **`ctx._traces`는 예외 이전 시점까지의 partial 상태**. trace completeness - 는 보장되지 않음 (일부 launch/all_reduce가 entry를 남기지 못한 채 종료 - 가능). - - **Allocator / MemoryStore**는 예외 이전 상태 유지 — 재사용은 non-goal, - 새 `RuntimeContext` 생성 권장. - - **`join=False` / retry / partial recovery**는 본 ADR의 non-goal. + Protocol: + - **Sibling abort guarantee**: when one worker raises, `SystemExit` is + thrown into all sibling greenlets — greenlets terminate immediately. No + greenlet leaks. + - **Explicit pending-queue clear**: both queues (worker-wait + + collective-pending) are cleared. Prevents contamination on reuse. + - **`SpawnException(errors)` wrapping**: `errors: dict[int, Exception]` + contains the original exception per rank. Compatible with the failure + pattern of real-PyTorch `torch.multiprocessing.spawn`. + - **Scope restriction**: `errors` includes **only ranks that raised + from their own code (root cause)**. Ranks terminated via + `throw(SystemExit)` during sibling cleanup do not appear in `errors` + (SystemExit is not caught by D1.2's entry wrapper `try/except + Exception` — intentional design: sibling termination is a cleanup + signal, not a failure). Made explicit so readers do not expect "all + failed ranks" to appear. + - **`ctx._traces` is the partial state up to the moment of exception**. + Trace completeness is not guaranteed (some launches/all_reduces may + terminate without leaving an entry). + - **Allocator / MemoryStore** remain in their pre-exception state — reuse + is non-goal; creating a fresh `RuntimeContext` is recommended. + - **`join=False` / retry / partial recovery** are non-goals for this ADR. - `SpawnException`은 `runtime_api/multiprocessing.py`에 정의: + `SpawnException` is defined in `runtime_api/multiprocessing.py`: ```python class SpawnException(RuntimeError): @@ -271,22 +297,25 @@ while alive: super().__init__(msg) ``` -5. **Single-driver 호환**. `g.parent is None`인 main-only 실행 (legacy 단일 - 드라이버 테스트)에서는 D0.2의 worker-fork 조건이 거짓 → 기존 즉시-동기 - 경로 유지. `_drain_pending`은 호출되지 않는다. +5. **Single-driver compatibility**. In main-only execution where `g.parent is + None` (legacy single-driver tests), D0.2's worker-fork condition is false + → the existing immediate-synchronous path is preserved. `_drain_pending` + is not invoked. -#### D0.5 Host-read barrier — 결정 (normative) +#### D0.5 Host-read barrier — decision (normative) -Worker 안에서 `tensor.numpy()`, `tensor.__getitem__`, `tensor.data` 등 -**host-observable read**는 **자동 drain barrier**로 정의한다. 호출 직전: +Inside a worker, **host-observable reads** such as `tensor.numpy()`, +`tensor.__getitem__`, and `tensor.data` are defined as **automatic drain +barriers**. Immediately before the call: -1. `ctx._pending_worker_waits`와 `backend._pending_collective_handles`가 비어 - 있지 않으면 `g.parent.switch()`로 main에 yield → main은 `_drain_pending` - 실행 → 완료 후 worker resume. -2. 두 큐가 모두 비어 있으면 즉시 read. +1. If `ctx._pending_worker_waits` or `backend._pending_collective_handles` + are non-empty, yield to main via `g.parent.switch()` → main runs + `_drain_pending` → worker resumes after completion. +2. If both queues are empty, read immediately. -**Barrier 반복 규약 (normative — re-entrance)**: `_drain_pending`은 while-loop -로 **두 큐가 모두 완전히 비어질 때까지** drain한다. 단일 pass가 아님: +**Barrier iteration protocol (normative — re-entrance)**: `_drain_pending` +drains via a while-loop **until both queues are completely empty**, not in a +single pass: ```python def _drain_pending(ctx): @@ -302,156 +331,175 @@ def _drain_pending(ctx): if backend is not None: while backend._pending_collective_handles: h, _sip_id, meta = backend._pending_collective_handles.pop(0) - ctx.wait(h, _meta=meta) # main context: safe; ctx.wait가 - # 다시 pending에 push하지 않음 + ctx.wait(h, _meta=meta) # main context: safe; ctx.wait will + # not push back to pending ``` -**Main-context ctx.wait 비재귀 invariant (normative)**: `_drain_pending` 내부의 -`ctx.wait(h, _meta=meta)` 호출은 main greenlet 컨텍스트에서 실행된다. D0.2의 -worker-fork 조건(`g.parent is not None and not g.parent.dead`)이 False이므로 -즉시-동기 경로로 진입 → **`_pending_worker_waits`에 절대 enqueue하지 않는다**. -이 invariant 덕분에 drain loop은 재귀/큐 재증가 없이 끝난다. 구현 시 -`g.parent is None`을 단일 main greenlet 보장으로 유지하는 것이 중요. +**Main-context ctx.wait non-recursion invariant (normative)**: the +`ctx.wait(h, _meta=meta)` call inside `_drain_pending` runs in the main +greenlet context. Because D0.2's worker-fork condition (`g.parent is not +None and not g.parent.dead`) is False, it enters the immediate-synchronous +path → **never enqueues to `_pending_worker_waits`**. Thanks to this +invariant, the drain loop terminates without recursion / queue re-growth. +When implementing, it is important to maintain `g.parent is None` as the +single-main-greenlet guarantee. -**왜 loop인가**: `ctx.wait(h, _meta=meta)`는 main 컨텍스트에서 호출되므로 D0.2 -경로에 따라 engine을 **직접 drive**한다 (추가 enqueue 없음 — 위 invariant). -따라서 이론적으로는 single pass로 충분하지만 — 규약은 **loop-until-empty**로 -고정한다. 이유: +**Why a loop**: `ctx.wait(h, _meta=meta)` is called in main context, so per +the D0.2 path it **drives the engine directly** (no additional enqueue — the +invariant above). In theory a single pass would suffice — but the protocol +is fixed at **loop-until-empty**. Reasons: -1. **미래 확장 안전성**: 향후 drain 중 새 pending이 enqueue되는 구현 (예: - collective가 sub-handle을 가진 tree-reduce)이 생길 수 있다. loop 규약이면 - 이때도 correctness 유지. -2. **가독성**: "barrier는 pending이 빌 때까지 drain"이라는 단일 문장으로 - 의미가 닫힘. `ctx.wait` 호출이 새 enqueue를 안 한다는 non-trivial invariant - 에 의존하지 않음. -3. **Barrier의 세만틱은 "해당 read에 필요한 모든 dependency 완료"**: 현 모델 - 에선 모든 pending이 곧 모든 dependency이므로 둘은 동일. 사용자 mental model - 은 전자. +1. **Future-extension safety**: a future implementation might enqueue new + pending items mid-drain (e.g., tree-reduce collectives with sub-handles). + The loop protocol preserves correctness in that case. +2. **Readability**: the single sentence "the barrier drains until pending + is empty" closes the semantics. No dependence on the non-trivial + invariant that `ctx.wait` calls do not produce new enqueues. +3. **Barrier semantics are "all dependencies needed for this read are + complete"**: in the current model all pending = all dependencies, so the + two are identical. The user mental model is the former. -**Termination 보증**: 두 체제로 분리해 서술한다. +**Termination guarantee**: described under two regimes. -- **현재 구현**: `ctx.wait`는 main context에서 호출 시 engine을 직접 drive - (D0.2) → 새 pending을 enqueue하지 않는다. 한 iteration마다 pending의 크기가 - `pop(0)` + `engine.wait`로 엄격히 감소. iteration 수는 **초기 pending 크기 - 자체가 상한** → 유한 종료. -- **Future extension (loop 규약을 정당화하는 상한)**: 향후 drain 중 새 pending이 - enqueue되는 구현 (예: tree-reduce sub-handle)이 도입되면 초기 크기 상한은 - 깨진다. 그러나 SimPy causality는 handle의 dependency가 유한 DAG임을 보장하므로 - **nested depth가 finite**. loop 규약이 이 경우까지 자동 수용한다. +- **Current implementation**: when called in main context, `ctx.wait` + drives the engine directly (D0.2) → does not enqueue new pending. Each + iteration strictly shrinks pending size by `pop(0)` + `engine.wait`. The + iteration count is bounded by **the initial pending size itself** → + finite termination. +- **Future extension (the bound that justifies the loop protocol)**: if an + implementation enqueues new pending mid-drain (e.g., tree-reduce + sub-handles) is introduced, the initial-size bound breaks. However, + SimPy causality guarantees that the dependency DAG of handles is finite, + so **nested depth is finite**. The loop protocol automatically + accommodates this case. -두 체제 모두 무한 루프가 불가능함을 보장. 현 구현의 단일-pass 상한은 공격적 -최적화 시 참고 값일 뿐 규약은 loop-until-empty로 고정. +Both regimes guarantee that infinite loops are impossible. The +single-pass bound of the current implementation is a reference value for +aggressive optimization; the protocol is fixed at loop-until-empty. -**왜 implicit drain at read가 맞는가**: +**Why implicit drain at read is correct**: -- 기존 open question에서 (a) implicit drain, (b) explicit barrier 둘 중 선택 - 문제였다. (b)는 명확하지만 TP layer 사용자가 `out = fc1.forward(x); - ctx.drain(); result = out.numpy()` 3-step을 매번 써야 하는 부담. (a)는 - "읽을 때 반영된 값을 보장"하는 단일 규약으로 CUDA의 `cudaDeviceSynchronize - before host copy` 패턴과 동일 — 숨은 규칙이 아닌 **명명된 entry-point의 - contract**이다. -- 본 ADR은 (a)를 채택하되 그 entry-point 목록을 **명시적으로 닫는다**: +- In the original open question, the choice was between (a) implicit drain + and (b) explicit barrier. (b) is clear but burdens TP layer users with + the 3-step pattern `out = fc1.forward(x); ctx.drain(); result = + out.numpy()` on every read. (a) is a single rule that "guarantees the + read sees the reflected value" — identical to CUDA's `cudaDeviceSynchronize + before host copy` pattern, which is not a hidden rule but the **contract + of a named entry point**. +- This ADR adopts (a) but **closes the entry-point list explicitly**: `Tensor.numpy()`, `Tensor.data` (numpy alias), `Tensor.__getitem__`, - `Tensor.__repr__` (data가 포함되는 경우), 그 외 공식 host-read API는 본 - ADR 구현 시점에 코드베이스 검색으로 확정. 추가되는 host-read API는 반드시 - 이 contract를 따라야 한다 (테스트로 회귀 방지). -- `ctx.submit`만 하고 `wait` 없이 `numpy`를 직접 호출하는 경우도 drain - barrier가 동작 (pending queue에 handle이 있기 때문). 사용자가 explicit - wait을 생략해도 read 시점에 invariant가 복원된다. + `Tensor.__repr__` (when data is included), and any other official + host-read APIs are finalized via codebase search at the time of + implementing this ADR. Any newly added host-read API must follow this + contract (regression-guarded by tests). +- Even when calling `numpy` directly after only `ctx.submit` without + `wait`, the drain barrier still operates (because the handle is in the + pending queue). The invariant is restored at read time even if the user + omits an explicit wait. -**`Tensor.copy_(source)` — write barrier 규정**: +**`Tensor.copy_(source)` — write barrier specification**: -`copy_`는 semantically "target에 write"이지만 내부적으로 `source.numpy()`를 -호출하여 host에서 source 데이터를 가져온 뒤 `target._memory_store.write(...)` -로 각 shard에 쓴다. 두 방향 모두 barrier 처리: +`copy_` is semantically "write to target", but internally it calls +`source.numpy()` to fetch source data on the host then writes to each +shard via `target._memory_store.write(...)`. Both directions are +barrier-handled: -1. **Source-side (read barrier)**: `source.numpy()`가 D0.5 read barrier를 - 트리거 (source 자체가 deployed tensor이고 pending이 있을 때). -2. **Target-side (write barrier — global pending 기준)**: `copy_` 진입 시 - `ctx._pending_worker_waits` 또는 `backend._pending_collective_handles`가 - 비어 있지 않으면 write 전에 `g.parent.switch()`로 drain. **Per-tensor / - per-shard dependency tracking이 아니라 global pending queue 기준**. - - 왜 global인가: KernBench의 handle 표현에는 "이 handle이 target의 어느 - shard를 write한다"는 역추적 정보가 없다. 안전한 보수적 규약으로 "전역 - pending이 있으면 drain". 이 결과로 **unrelated tensor의 pending도 copy_를 - 막을 수 있다** — drop-in invariant 우선. - - **명시적 tradeoff**: 이 규약은 서로 독립적인 tensor 사이에도 불필요한 - serialization을 도입할 수 있다. 그러나 현 single-queue execution model - 하에서는 이 비용이 허용 가능 — cross-rank correctness와 "읽을 때 최신" - invariant를 단순한 규칙으로 보장하는 편이 우선. - - 실질적 영향: 단일 worker는 대부분 한 layer step 안에서 pending이 주로 - 자기 작업 — over-barrier로 인한 추가 context switch는 round 끝 scheduler - drain 시점과 일치하는 경우가 많아 큰 문제 안 됨. - - Future refinement: per-tensor pending tracking을 도입하면 이 규약을 - 좁힐 수 있으나 본 ADR scope 밖. +1. **Source-side (read barrier)**: `source.numpy()` triggers the D0.5 read + barrier (when source itself is a deployed tensor with pending). +2. **Target-side (write barrier — based on global pending)**: on `copy_` + entry, if `ctx._pending_worker_waits` or + `backend._pending_collective_handles` are non-empty, drain via + `g.parent.switch()` before writing. **Not per-tensor / per-shard + dependency tracking, but based on the global pending queue**. + - Why global: KernBench's handle representation does not retain the + reverse-mapping information "this handle writes to which shard of which + target". A safe conservative rule: "drain if any global pending + exists". As a result, **pending of an unrelated tensor can also block + copy_** — drop-in invariant takes priority. + - **Explicit tradeoff**: this rule can introduce unnecessary + serialization between independent tensors. However, under the current + single-queue execution model this cost is acceptable — guaranteeing + cross-rank correctness and the "read sees latest" invariant via a + simple rule takes precedence. + - Practical impact: most pending of a single worker within a layer step + is its own work — extra context switches from over-barrier often + coincide with the end-of-round scheduler drain point, so no major + issue. + - Future refinement: per-tensor pending tracking could narrow this + rule, but it is out of scope for this ADR. **Non-barrier**: -- `tensor.shape`, `tensor.dtype`, `tensor.name` 등 **metadata-only** 접근은 - drain하지 않음. 데이터 의존성이 없음. -- `tensor.pa`, `tensor.va` 등 raw address accessor도 drain하지 않음 (주소만, - 내용 아님). +- `tensor.shape`, `tensor.dtype`, `tensor.name`, and other + **metadata-only** access does not drain. No data dependency. +- `tensor.pa`, `tensor.va`, and other raw address accessors also do not + drain (address only, not content). -**공식 barrier entry-point (closed set)**: +**Official barrier entry-points (closed set)**: | API | Kind | Rationale | |---|---|---| | `Tensor.numpy()` | read | host-observable copy | | `Tensor.data` | read | `numpy()` alias | | `Tensor.__getitem__` | read | shard-aligned read | -| `Tensor.__repr__` (data 포함 시) | read | debugging/log | +| `Tensor.__repr__` (when data is included) | read | debugging/log | | `Tensor.copy_(source)` | read + write | source read + target write | -이 contract를 T5/T6에서 직접 검증. +This contract is verified directly in T5/T6. -#### D0.6 왜 worker 함수 API는 불변인가 (informative) +#### D0.6 Why the worker function API is unchanged (informative) -- `torch.zeros(...)` 내부는 `self.submit(msg)` + `self.wait(h)` 쌍. `wait`가 - D0.2/D0.3에 따라 자동으로 main-defer → 겉보기 동기적으로 보이지만 한 번 - yield. -- `tensor.numpy()`는 D0.5에 따라 host-read barrier → pending이 있으면 - drain→read, 없으면 즉시 read. -- `dist.all_reduce`는 기존 `_defer_wait=True` + `_pending_collective_handles` - 경로를 그대로 사용. D0.4의 drain이 두 큐를 함께 처리. +- The inside of `torch.zeros(...)` is a `self.submit(msg)` + `self.wait(h)` + pair. `wait` auto-defers to main per D0.2/D0.3 — appears synchronous from + the outside but yields once. +- `tensor.numpy()` follows D0.5's host-read barrier → drain→read when + pending exists, immediate read otherwise. +- `dist.all_reduce` continues to use the existing `_defer_wait=True` + + `_pending_collective_handles` path. D0.4's drain processes both queues + together. -#### D0.7 불변 조건 (invariants) +#### D0.7 Invariants -- **kernel greenlet의 `_parent`는 항상 main**: env.run이 worker 컨텍스트에서 - 절대 돌지 않기 때문. (T3의 핵심 assertion.) -- **cross-rank 동기 지점**: 모든 worker가 yield한 뒤에만 drain → 모든 rank의 - kernel이 한 라운드에 함께 진행 (cross-rank IPCQ 교환의 필수 조건). -- **Single-driver 호환**: D0.4-(5). +- **The kernel greenlet's `_parent` is always main**: because env.run never + runs in worker context. (Core assertion of T3.) +- **Cross-rank synchronization point**: drain occurs only after every + worker has yielded → kernels of all ranks advance together within one + round (a prerequisite for cross-rank IPCQ exchange). +- **Single-driver compatibility**: D0.4-(5). ### D1. `torch.multiprocessing.spawn(fn, args, nprocs)` -Real-PyTorch API 파리티 + D0의 scheduler loop의 단일 구현 위치. +Real-PyTorch API parity + a single implementation site for D0's scheduler +loop. -#### D1.0 API parity only — execution parity 아님 (normative) +#### D1.0 API parity only — not execution parity (normative) -`torch.multiprocessing.spawn` 이름은 **API signature parity**에 한정된다. -실제 실행 모델은 **cooperative greenlet scheduler** (단일 Python 프로세스, -단일 OS 스레드, D0.4의 round-robin drive)이다. 다음은 **본 ADR이 제공하지 -않는 속성** — real-PyTorch `torch.multiprocessing.spawn`이 보장하는 것 중 -명시적으로 **non-goal**: +The name `torch.multiprocessing.spawn` is restricted to **API signature +parity**. The actual execution model is a **cooperative greenlet scheduler** +(single Python process, single OS thread, round-robin drive per D0.4). The +following are **properties this ADR does NOT provide** — among the +guarantees of real-PyTorch `torch.multiprocessing.spawn`, explicitly +**non-goals**: -- 프로세스 격리 (independent OS process per rank). -- 독립 address space (각 rank가 자기 Python heap 보유). -- Failure isolation (한 rank의 hard crash가 다른 rank 영향 없음). -- OS-level scheduler fairness (rank 간 preemptive time slicing). -- `mp.Queue`, `mp.Lock` 등 inter-process primitive. +- Process isolation (independent OS process per rank). +- Independent address space (each rank with its own Python heap). +- Failure isolation (a hard crash in one rank not affecting others). +- OS-level scheduler fairness (preemptive time slicing between ranks). +- Inter-process primitives such as `mp.Queue`, `mp.Lock`. -이 구현의 실제 성질: +Actual properties of this implementation: -- 모든 rank는 같은 Python 프로세스 안의 greenlet. shared global state가 - 그대로 보임 (의도된 simulation convenience). -- GIL 하의 단일 스레드 → parallel execution 아님. SimPy 이벤트 순서로 - "논리적 동시성"만 재현. -- 한 worker에서 unhandled exception → 전체 simulation 중단 (D0.4-(4)). +- All ranks are greenlets inside the same Python process. Shared global + state is visible as-is (intentional simulation convenience). +- Single-threaded under the GIL → not parallel execution. Only "logical + concurrency" via SimPy event ordering is reproduced. +- Unhandled exception in any one worker → entire simulation aborts + (D0.4-(4)). -**호출자 의무**: real-PyTorch multi-process 샘플을 KernBench로 이식할 때 -프로세스 격리에 의존하는 로직 (예: `os.getpid`, 독립 임시 파일, 신호 처리 -등)은 지워야 한다. Namespace 이름은 코드 이식성을 위해 유지 — 세만틱은 -다르다. +**Caller's obligation**: when porting real-PyTorch multi-process samples to +KernBench, logic that relies on process isolation (e.g., `os.getpid`, +independent temp files, signal handling) must be removed. The namespace name +is preserved for code portability — semantics differ. #### D1.1 Public surface @@ -470,7 +518,7 @@ class _MultiprocessingNamespace: ... ``` -#### D1.2 구현 +#### D1.2 Implementation ```python def spawn(self, fn, args, nprocs, join=True): @@ -520,28 +568,29 @@ def spawn(self, fn, args, nprocs, join=True): #### D1.3 `torch` namespace attach -`runtime_api/context.py` `__post_init__`에서: +In `runtime_api/context.py` `__post_init__`: ```python self.multiprocessing = _MultiprocessingNamespace(self) ``` -→ bench 코드에서 `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`. +→ in bench code: `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`. -#### D1.4 기존 bench 마이그레이션 +#### D1.4 Migration of existing benches -`benches/ccl_allreduce.py`의 hand-rolled loop은 `torch.multiprocessing.spawn` -한 줄로 축소. 기존 matrix 회귀는 그대로 유지. 현재 xfail인 `ring_default_ws`는 -D0 덕분에 PASS로 전환 예상 (worker가 kernel greenlet orphan을 발생시키지 않음). +The hand-rolled loop in `benches/ccl_allreduce.py` collapses into a single +`torch.multiprocessing.spawn` line. Existing matrix regressions are +preserved. The currently xfail `ring_default_ws` is expected to flip to +PASS thanks to D0 (workers no longer orphan the kernel greenlet). -### D2. 새 패키지 `kernbench.tp` +### D2. New package `kernbench.tp` ``` src/kernbench/tp/ __init__.py — public API re-exports - parallel_state.py — TP group 관리 (현재 single global group) + parallel_state.py — TP group management (currently a single global group) layers.py — ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding primitives.py — copy/reduce/scatter/gather_to/from_tp_region - kernels.py — TP layer가 launch하는 gemm kernel (재사용 가능) + kernels.py — gemm kernel launched by TP layers (reusable) mappings.py — forward identity/all_reduce, backward stub ``` @@ -571,45 +620,49 @@ def get_tensor_model_parallel_rank() -> int: return get_dist().get_rank() # ADR-0024 greenlet-local rank ``` -초기 scope: TP size = world_size = topology SIP count. Pure TP 모델. +Initial scope: TP size = world_size = topology SIP count. Pure TP model. -### D4-pre. TP shard ownership vs DPPolicy — 역할 분리 (normative) +### D4-pre. TP shard ownership vs DPPolicy — role separation (normative) -TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다: +In the weight/output representation of TP layers, two concepts are clearly +separated: -| 개념 | 결정 주체 | 범위 | +| Concept | Decided by | Scope | |---|---|---| -| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** | -| **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** | +| **TP shard ownership** (which rank owns which slice of the weight) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** | +| **Intra-rank placement** (how the owned slice is distributed across cube × PE inside the rank) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **inside one rank (within SIP boundary)** | -따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로 -weight를 생성하고 `DPPolicy(cube="column_wise", pe="column_wise")`를 부여 -하면: +Thus when `ColumnParallelLinear` creates a weight of shape `(in_features, +out_features // ws)` and assigns `DPPolicy(cube="column_wise", +pe="column_wise")`: -- **Rank r**이 소유하는 slice = weight의 column 축 [r * k_local, (r+1) * - k_local) — **set_device(r)**가 이걸 결정 (해당 rank가 SIP r에 존재). -- **그 slice 내부**에서 cube × PE column-wise 분산 — **DPPolicy**가 이걸 - 결정. +- The slice owned by **rank r** = column-axis [r * k_local, (r+1) * + k_local) of the weight — **set_device(r)** determines this (that rank + resides on SIP r). +- **Inside that slice**, the cube × PE column-wise distribution — **DPPolicy** + determines this. -두 축은 **독립적**이다. 같은 DPPolicy로 두 rank가 자기 slice를 만들면 -slice 자체는 다른 SIP에 있지만 intra-SIP placement 패턴은 동일. 반대로 -DPPolicy를 `cube="replicate", pe="replicate"`로 바꿔도 TP shard ownership은 -유지되고 intra-rank placement만 달라짐. +The two axes are **independent**. If two ranks build their own slice with +the same DPPolicy, the slices themselves live on different SIPs but the +intra-SIP placement pattern is the same. Conversely, changing DPPolicy to +`cube="replicate", pe="replicate"` preserves TP shard ownership and only +changes intra-rank placement. -**이 경계가 흐려지는 실수** (본 ADR이 금지): +**Mistakes that blur this boundary** (forbidden by this ADR): -- DPPolicy에 "SIP 축"이 다시 등장 (ADR-0026에서 제거됨). -- TP layer가 `set_device` 없이 `DPPolicy`만으로 cross-rank sharding을 - 표현 → 단일 rank 안에서 세로로 자른 것과 구분 안 됨. +- The "SIP axis" reappearing in DPPolicy (removed in ADR-0026). +- TP layers expressing cross-rank sharding via `DPPolicy` alone without + `set_device` → indistinguishable from a vertical split within a single + rank. -본 ADR의 TP layer는 항상 "rank = SIP = one slice 소유 + DPPolicy intra-SIP -분산" 관점에서만 weight/output을 다룬다. +The TP layers of this ADR always treat weight/output from the perspective +of "rank = SIP = owns one slice + DPPolicy intra-SIP distribution" only. ### D4. `ColumnParallelLinear` -**중요**: host-side `torch.matmul` 추상화를 신규 도입하지 않는다. layer의 -forward는 `torch.launch("gemm", gemm_kernel, ...)`로 기존 gemm kernel을 -호출 — KernBench bench들이 이미 쓰는 패턴 +**Important**: no new host-side `torch.matmul` abstraction is introduced. +The layer's forward calls the existing gemm kernel via `torch.launch("gemm", +gemm_kernel, ...)` — the pattern already used by KernBench benches ([benches/gemm_single_pe.py](benches/gemm_single_pe.py), [benches/gpt3_qkv.py](benches/gpt3_qkv.py)). @@ -623,14 +676,14 @@ from kernbench.tp.parallel_state import ( ) class ColumnParallelLinear: - """Weight의 K(out_features) 축을 TP rank에 분산. + """Shards the K(out_features) axis of the weight across TP ranks. forward(x): x: (M, N) — full-replicated across ranks - W_k: (N, K / world_size) — rank-local slice (set_device로 SIP r에 거주) + W_k: (N, K / world_size) — rank-local slice (placed on SIP r via set_device) y_k = x @ W_k → (M, K / world_size) — rank-local output - 출력은 column-sharded. RowParallelLinear가 기대하는 입력 형태. + Output is column-sharded. The input form expected by RowParallelLinear. """ def __init__(self, in_features: int, out_features: int, bias: bool = False, @@ -640,7 +693,7 @@ class ColumnParallelLinear: self.in_features = in_features self.k_local = out_features // ws self._torch = torch - # 각 rank가 자기 slice 소유 — set_device(rank)에 의해 SIP r에 배치. + # Each rank owns its own slice — placed on SIP r by set_device(rank). self.weight = torch.zeros( (in_features, self.k_local), dtype=dtype, dp=DPPolicy(cube="column_wise", pe="column_wise"), @@ -655,7 +708,7 @@ class ColumnParallelLinear: ) def forward(self, x): - # x는 full-replicated (caller 보장). 단순 local gemm. + # x is full-replicated (caller-guaranteed). Plain local gemm. M = x.shape[0] out = self._torch.empty( (M, self.k_local), dtype=x.dtype, @@ -666,24 +719,25 @@ class ColumnParallelLinear: "col_parallel_gemm", _gemm_kernel, x, self.weight, out, M, self.in_features, self.k_local, ) - # bias add는 별도 kernel 혹은 composite gemm의 fused bias. - # 초기 scope에서는 bias=False만 충분히 검증. + # bias add as a separate kernel or as fused bias of a composite gemm. + # Initial scope verifies bias=False sufficiently. return out ``` -**Yield-safety contract (normative)**: `ColumnParallelLinear.forward`는 한 번의 -`torch.launch` 호출로 kernel launch → 내부 `ctx.wait` 쌍을 포함한다. 이는 -D0.4-(1)의 "worker는 유한 step 내 yield" 조건을 자동으로 만족 — TP layer -사용자가 yield 패턴을 수동으로 삽입할 필요 없음. +**Yield-safety contract (normative)**: `ColumnParallelLinear.forward` +includes one `torch.launch` call containing a kernel launch → internal +`ctx.wait` pair. This automatically satisfies the "worker yields within a +finite number of steps" condition of D0.4-(1) — TP layer users do not need +to insert yield patterns manually. ### D5. `RowParallelLinear` ```python class RowParallelLinear: - """Weight의 N(in_features) 축을 TP rank에 분산. + """Shards the N(in_features) axis of the weight across TP ranks. forward(x): - x: (M, N / world_size) — rank-local slice (ColumnParallel의 출력) + x: (M, N / world_size) — rank-local slice (output of ColumnParallel) W_k: (N / world_size, K) — rank-local slice y_k = x @ W_k → (M, K) — partial sum on each rank y = all_reduce(y_k, op="sum") → (M, K) on every rank @@ -701,7 +755,7 @@ class RowParallelLinear: dp=DPPolicy(cube="column_wise", pe="column_wise"), name="row_parallel_w", ) - # bias는 rank 0에만 (Megatron convention). 초기 scope에서는 생략. + # bias only on rank 0 (Megatron convention). Omitted in initial scope. self.bias = None def forward(self, x): @@ -715,25 +769,26 @@ class RowParallelLinear: "row_parallel_gemm", _gemm_kernel, x, self.weight, y_partial, M, self.n_local, self.out_features, ) - # Cross-rank reduce. ADR-0024의 dist.all_reduce는 D0 + mp.spawn 하에서 - # 정상 동작 (kernel parent = main 유지). + # Cross-rank reduce. ADR-0024's dist.all_reduce works correctly + # under D0 + mp.spawn (kernel parent = main is preserved). self._torch.distributed.all_reduce(y_partial, op="sum") return y_partial ``` -**Yield-safety contract (normative)**: `RowParallelLinear.forward`는 launch → -내부 wait에 이어 `all_reduce` (defer + worker yield 패턴)까지 포함하므로 forward -한 번당 **최소 2회 yield**가 보장됨. D0.4-(1)의 scheduler progress 조건 자동 -만족. 모든 본 ADR의 TP layer forward는 "최소 하나의 wait 또는 collective를 -포함해 yield-safe하다"를 invariant로 유지한다 — 이후 추가되는 TP primitive -(VocabParallelEmbedding 등)도 동일 계약 필수. +**Yield-safety contract (normative)**: `RowParallelLinear.forward` includes +launch → internal wait followed by `all_reduce` (defer + worker-yield +pattern), so **at least 2 yields per forward** are guaranteed. The +scheduler-progress condition of D0.4-(1) is automatically satisfied. All TP +layer forwards in this ADR maintain the invariant "yield-safe by containing +at least one wait or collective" — any future TP primitives (e.g., +VocabParallelEmbedding) must keep the same contract. -### D6. Primitive 함수 +### D6. Primitive functions ```python # primitives.py def copy_to_tp_region(x): - """Forward: identity. Backward: all-reduce. (Training 추가 시 구현).""" + """Forward: identity. Backward: all-reduce. (Implemented when training is added).""" return x def reduce_from_tp_region(x, torch): @@ -743,19 +798,19 @@ def reduce_from_tp_region(x, torch): def scatter_to_tp_region(x): raise NotImplementedError( - "Phase 2: 사용자가 이미 sharded tensor를 생성하는 것으로 대체" + "Phase 2: replaced by users creating already-sharded tensors" ) def gather_from_tp_region(x): raise NotImplementedError( - "Phase 2: all-gather kernel 선행 필요 (future)" + "Phase 2: requires all-gather kernel as a prerequisite (future)" ) ``` -### D7. 샘플 bench — 2-layer MLP with TP +### D7. Sample bench — 2-layer MLP with TP ```python -# benches/tp_mlp.py (신규) +# benches/tp_mlp.py (new) from kernbench.policy.placement.dp import DPPolicy import kernbench.tp as tp import numpy as np @@ -780,10 +835,10 @@ def worker(rank: int, world_size: int, torch): h = fc1.forward(x) # column-sharded (B, D_hidden / ws) y = fc2.forward(h) # all-reduced (B, D_out) on every rank - # rank 0만 결과 출력 / 검증 + # Only rank 0 prints / verifies the result if rank == 0: result = y.numpy() - # 실제 검증 값은 zero-init weight이면 전부 0 — scope에서는 "완료 자체" 검증 + # With zero-init weights, all values are 0 — within scope "completion itself" is the check print(f" tp_mlp: shape={result.shape}, mean={float(result.mean()):.4f}") @@ -793,31 +848,33 @@ def run(torch): torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws) ``` -### D8. Non-functional — training 미지원 +### D8. Non-functional — training not supported -본 ADR은 **inference/forward only**. Backward / gradient / optimizer는 future. -기존 KernBench가 training이 아니므로 자연스러움. +This ADR is **inference/forward only**. Backward / gradient / optimizer is +future work. Natural because KernBench is not a training system. -### D9. 초기 scope 제약 +### D9. Initial-scope constraints -- TP size = world_size (mixed DP+TP 없음). -- `scatter_to_tp_region`, `gather_from_tp_region`은 unimplemented. -- **Weight 기본값은 zero**. 적절한 init scheme (Xavier, Kaiming 등)은 future. - 단 테스트는 `tensor.copy_`로 결정론적 non-zero pattern을 주입해 numerical - correctness를 검증 (T2/T6). 즉 "production default = zero, 검증 = 결정론적 - non-zero"로 운영 분리. -- Bias 초기 scope에서 생략 (Megatron의 rank 0-only bias 정책은 future). -- Pipeline parallelism은 scope 밖. -- VocabParallelEmbedding은 all-gather 선행 필요 → stub only. +- TP size = world_size (no mixed DP+TP). +- `scatter_to_tp_region`, `gather_from_tp_region` are unimplemented. +- **Default weight value is zero**. Proper init schemes (Xavier, Kaiming, + etc.) are future. Tests inject deterministic non-zero patterns via + `tensor.copy_` to verify numerical correctness (T2/T6). I.e., operate as + "production default = zero, verification = deterministic non-zero". +- Bias is omitted in the initial scope (Megatron's rank-0-only bias policy + is future). +- Pipeline parallelism is out of scope. +- VocabParallelEmbedding requires a prerequisite all-gather → stub only. -### D10. 회귀: `ring_default_ws` xfail 해제 — 필수 acceptance +### D10. Regression: `ring_default_ws` xfail removal — mandatory acceptance -D0 (worker-wait 일반화) + D0.5 (host-read barrier) 덕분에 모든 worker-driven -`ctx.wait` 및 host-read가 main-drain 경로로 routing됨 → ADR-0024 Phase B의 -kernel-greenlet orphan 원인이 소멸. 기존 matrix test의 `ring_default_ws` -strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 **필수 회귀 -기준**으로 포함. Observable acceptance criteria는 **T7**에 명시 (deadlock -부재, GreenletExit 부재, numerical tolerance 등). +Thanks to D0 (worker-wait generalization) + D0.5 (host-read barrier), every +worker-driven `ctx.wait` and host-read is routed through the main-drain path +→ the cause of the kernel-greenlet orphan in ADR-0024 Phase B disappears. +Flipping the existing matrix test's `ring_default_ws` strict-xfail case to +**PASS** after this ADR's implementation is included as a **mandatory +regression criterion**. Observable acceptance criteria are specified in +**T7** (no deadlock, no GreenletExit, numerical tolerance, etc.). --- @@ -825,40 +882,48 @@ strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 - **ADR-0024** (launcher): rank = SIP, greenlet-local rank, `torch.ahbm.set_device(rank)`. -- **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현. -- **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반. +- **ADR-0026** (DPPolicy intra-device): per-rank slice representation of + weight tensors. +- **ADR-0023 / ADR-0025** (IPCQ): foundation of `dist.all_reduce` + implementation. --- ## Non-goals -- **Backward pass / training**: inference only. Training simulation은 별도 ADR. -- **Mixed parallelism (DP + TP + PP)**: 초기엔 pure TP only. -- **Weight init schemes**: 단순 zero / debug pattern. -- **Fused ops**: Megatron의 fused matmul+bias+gelu는 kernel 레벨 문제. -- **DTensor 통합**: ADR-0028 future. -- **Host-side `torch.matmul` 추상화**: TP layer는 `torch.launch(gemm_kernel, ...)` - 로 기존 gemm kernel을 호출. 신규 matmul host-op 도입 안 함. +- **Backward pass / training**: inference only. Training simulation is a + separate ADR. +- **Mixed parallelism (DP + TP + PP)**: pure TP only at the start. +- **Weight init schemes**: simple zero / debug pattern. +- **Fused ops**: Megatron's fused matmul+bias+gelu is a kernel-level + concern. +- **DTensor integration**: ADR-0028 future. +- **Host-side `torch.matmul` abstraction**: TP layers call the existing + gemm kernel via `torch.launch(gemm_kernel, ...)`. No new matmul host-op + is introduced. --- ## Open questions -- **`initialize_model_parallel` 위치**: `kernbench.tp.initialize_model_parallel` - (현 결정) vs real-PyTorch의 `torch.distributed.init_device_mesh`. TP 전용 - 모듈에 유지. -- **Weight init**: ADR은 zero. Debug pattern (e.g., identity)이 유효 검증에 - 필요할 수 있음 — Phase 1 test에서 필요 시 추가. -- **bias 배치 정책**: Megatron은 RowParallelLinear bias를 rank 0에만. 초기 - scope에서는 bias=False로 회피. -- **GEMM kernel 위치**: `kernbench.tp.kernels._gemm_kernel` vs 기존 - `benches/gemm_single_pe.py`에서 import. TP가 bench 의존을 가지면 안 되므로 - tp 내부에 복제. 향후 `kernbench.kernels` 공용 패키지로 이관 가능. +- **Location of `initialize_model_parallel`**: + `kernbench.tp.initialize_model_parallel` (current decision) vs + real-PyTorch's `torch.distributed.init_device_mesh`. Kept in the TP-only + module. +- **Weight init**: the ADR uses zero. A debug pattern (e.g., identity) may + be needed for valid verification — add at Phase 1 test time if needed. +- **Bias placement policy**: Megatron places RowParallelLinear bias only on + rank 0. Avoided in the initial scope via bias=False. +- **GEMM kernel location**: `kernbench.tp.kernels._gemm_kernel` vs + importing from existing `benches/gemm_single_pe.py`. TP must not depend + on benches, so duplicated inside tp. Migration to a shared + `kernbench.kernels` package is possible later. -**Resolved (이전 rev에서 open이었던 것들)**: -- ~~`tensor.numpy()` 호출 시 drain 타이밍~~ → **D0.5에서 결정**: 공식 host-read - entry-point(`numpy`, `data`, `__getitem__`, data-포함 `__repr__`)는 자동 - drain barrier. metadata-only accessor는 barrier 아님. +**Resolved (previously open in earlier revisions)**: +- ~~Drain timing on `tensor.numpy()` call~~ → **decided in D0.5**: the + official host-read entry points (`numpy`, `data`, `__getitem__`, + data-containing `__repr__`) are automatic drain barriers. Metadata-only + accessors are not barriers. --- @@ -866,23 +931,25 @@ strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 ### Positive -- **Megatron 코드 이식 용이**: real training code와 API 일치. -- **TP 벤치마크 가능**: scaling, communication-compute overlap 등 HW 특성 - 연구. -- **`ring_default_ws` xfail 해제**: D0의 부산물로 ADR-0024 Phase B 블로커 해소. -- **Scheduler loop 단일화**: D1 (`mp.spawn`) 도입으로 hand-rolled loop 제거. - 후속 collective/TP 벤치가 동일 패턴 재사용. -- **DPPolicy 의미 명확화** (ADR-0026 시너지): TP layer가 intra-device DPPolicy - 만 사용하는 모범 사례. +- **Easy porting of Megatron code**: API matches real training code. +- **TP benchmarking enabled**: research on scaling, + communication-compute overlap, and other HW characteristics. +- **`ring_default_ws` xfail removal**: as a byproduct of D0, the ADR-0024 + Phase B blocker is resolved. +- **Scheduler-loop unification**: introducing D1 (`mp.spawn`) removes the + hand-rolled loop. Subsequent collective/TP benches reuse the same + pattern. +- **DPPolicy semantics clarified** (synergy with ADR-0026): TP layers as a + best-practice example of using intra-device DPPolicy only. ### Negative -- 새 모듈 (`kernbench.tp`) 유지보수 비용. -- 초기 scope가 제한적 (pure TP only, forward only). -- D0 generalization이 `ctx.wait`의 세만틱을 바꿈 — 단일 드라이버 테스트와의 - 호환성을 명시적으로 검증 필요 (T7). +- Maintenance cost of a new module (`kernbench.tp`). +- Initial scope is limited (pure TP only, forward only). +- D0 generalization changes the semantics of `ctx.wait` — compatibility + with single-driver tests must be explicitly verified (T7). ### Neutral -- ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation - stack에 영향 없음 (D0 제외). +- A pure upper layer added on top of ADR-0024/0026. No impact on the + hardware-simulation stack (apart from D0). diff --git a/tests/test_verify_adr_lang_pairs.py b/tests/test_verify_adr_lang_pairs.py index 190197a..0ad2881 100644 --- a/tests/test_verify_adr_lang_pairs.py +++ b/tests/test_verify_adr_lang_pairs.py @@ -92,6 +92,18 @@ def test_crlf_normalization(tmp_path: Path) -> None: assert v.verify(tmp_path) == [] +def test_em_dash_title_separator_recognized(tmp_path: Path) -> None: + """ADR-0033 uses ' — ' instead of ': ' between ADR-NNNN and the title.""" + en = tmp_path / "docs/adr/ADR-0033-foo-bar.md" + ko = tmp_path / "docs/adr-ko/ADR-0033-foo-bar.md" + en.parent.mkdir(parents=True, exist_ok=True) + ko.parent.mkdir(parents=True, exist_ok=True) + body = "## Status\n\nAccepted\n\n## Context\n\nbody\n" + en.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8") + ko.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8") + assert v.verify(tmp_path) == [] + + def test_underscore_in_slug_recognized(tmp_path: Path) -> None: """ADR-0013 uses an underscore in its slug; the regex must accept it.""" _make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013") diff --git a/tools/verify_adr_lang_pairs.py b/tools/verify_adr_lang_pairs.py index d0147d5..5661cf3 100644 --- a/tools/verify_adr_lang_pairs.py +++ b/tools/verify_adr_lang_pairs.py @@ -24,7 +24,7 @@ import sys from pathlib import Path ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$") -TITLE_RE = re.compile(r"^# ADR-(\d{4}):") +TITLE_RE = re.compile(r"^# ADR-(\d{4})\b") def _normalize(text: str) -> str: