ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/

Establish English as the canonical ADR language with Korean translations
held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror).
Promotion from adr-proposed/ to adr/ now writes English to adr/ and the
Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md.

- Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English,
  2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix
  dropped). ADR-0023 EN regenerated against KO source which had newer
  HW Realization Notes (D16-D23) section.
- docs/adr-history/ left frozen by design (transitional state).
- CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark
  docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline
  section covering bidirectional sync, conflict resolution (EN wins),
  and proposed-language freedom.
- tools/verify_adr_lang_pairs.py: new verification tool checking pair
  completeness, filename mirroring, ADR-ID match, Status byte-equality.
  Pre-commit hook intentionally not added; run on demand or in CI.
- tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF
  normalization, em-dash title separator, underscore-slug edge case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 01:38:44 -07:00
parent 687c98086d
commit a796c1d2f7
42 changed files with 10515 additions and 3422 deletions
+66 -15
View File
@@ -202,8 +202,8 @@ General fallbacks. Apply to anything not explicitly covered above.
>
> Contains **foundations** (Authority & Scope → Terminology → Terminology
> Discipline → Mental Model → Common Failure Modes) followed by **rules**
> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, runtime API /
> sim_engine Boundaries).
> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, ADR Translation
> Discipline, runtime API / sim_engine Boundaries).
## Authority & Scope
@@ -218,14 +218,22 @@ General fallbacks. Apply to anything not explicitly covered above.
### ADR Lifecycle
ADRs live in one of three folders based on lifecycle state:
ADRs live in one of four folders. Three carry **canonical English**
content based on lifecycle state; the fourth holds Korean translations:
- `docs/adr/`**Accepted** (current implementation reflected).
- `docs/adr/`**Accepted** (canonical English; current
implementation reflected).
- `docs/adr-proposed/`**Proposed**, **Stub**, or **Draft** (design
only / future-work exploration / retroactive documentation pending
verification).
verification). **Authoring language is free** (any language); the
promotion step (below) translates to English.
- `docs/adr-history/`**Superseded** or **Merged** (no longer the
authoritative source; kept as historical record).
authoritative source; kept as historical record). Frozen — language
policy not applied retroactively.
- `docs/adr-ko/` — Korean translations of accepted ADRs (derived
artifact, 1:1 mirror of `docs/adr/`). English in `docs/adr/` is the
canonical source of truth; when KO and EN disagree, EN wins. See
*ADR Translation Discipline* below.
Status field values:
@@ -240,17 +248,23 @@ Status field values:
Transitions:
- **Proposed/Stub → Accepted**: when the ADR's decisions are
reflected in production code AND covered by tests. `git mv` from
`docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
reflected in production code AND covered by tests. If the proposed
ADR is in Korean, translate to English and place the English in
`docs/adr/`; move the Korean original to `docs/adr-ko/`. If the
proposed ADR is in English, `git mv` it to `docs/adr/` and create
the Korean translation in `docs/adr-ko/`. Change Status to
`Accepted` in both files.
- **Draft → Accepted**: when the ADR's text has been verified to
accurately describe the existing implementation. `git mv` from
`docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
accurately describe the existing implementation. Same English /
Korean placement rule as above.
- **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM`
and `git mv` to `docs/adr-history/`. The superseding ADR includes
a "Supersedes ADR-NNNN" reference (or, for partial supersession of
clauses, documents this in its own body).
in both the EN and KO files and `git mv` both to their respective
history locations (`docs/adr-history/` for English; the KO copy
stays in `docs/adr-ko/` only if it was already mirrored — see *ADR
Translation Discipline* for the frozen-history exception).
- **Accepted → Merged**: set Status to `Merged into ADR-MMMM`
(single-line stub) and `git mv` to `docs/adr-history/`.
(single-line stub) in both files and apply the same `git mv` rule
as the Superseded transition.
Cross-references between ADRs use the `ADR-NNNN` ID and remain valid
regardless of folder location. ADR numbers are **immutable**; never
@@ -361,11 +375,48 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
## Derived Artifacts (Clarification)
- Generated diagrams under `docs/diagrams/` are **derived artifacts**, not production code.
- Creating or updating files in `docs/diagrams/`:
- Korean ADR translations under `docs/adr-ko/` are **derived artifacts**
(mirror of the canonical English in `docs/adr/`); see *ADR Translation
Discipline*.
- Creating or updating files in `docs/diagrams/` or `docs/adr-ko/`:
- does NOT count as a production code change,
- does NOT require Phase 2 approval,
- MUST be consistent with SPEC.md and ADRs.
## ADR Translation Discipline
English in `docs/adr/` is the canonical source of truth. Korean in
`docs/adr-ko/` mirrors it 1:1 as a derived artifact.
**Bidirectional sync rule (MUST)**: any edit to a file in `docs/adr/`
must be accompanied, in the same change, by a mirroring edit to
`docs/adr-ko/<same-filename>.md`. The reverse also applies: edits to
`docs/adr-ko/` must mirror back into `docs/adr/`. The two files must
always describe the same architectural content.
Mechanics:
- When editing an EN ADR, propagate the change to its KO counterpart
by translating just the diff (preserve unaffected KO prose); do not
regenerate the whole KO file from scratch.
- When editing a KO ADR, propagate to EN the same way.
- Filename mirror: `docs/adr/X.md``docs/adr-ko/X.md` (no language
suffix in either path).
- The `## Status` block content must remain byte-identical between
the EN and KO files (e.g., both say `Accepted`).
- Conflict policy: if the two diverge despite the rule, treat EN as
authoritative and overwrite KO. Surface the divergence to the user
before reconciling.
- `docs/adr-proposed/` is exempt — single language only, no mirror
required until promotion.
- `docs/adr-history/` is frozen — pre-existing mixed-language state
there is not migrated.
Verification: `python tools/verify_adr_lang_pairs.py` checks that
every EN ADR has a matching KO file, the title's ADR-NNNN matches the
filename, and Status blocks are byte-equal. Run it on demand or wire
it into CI. Exit code: 0 = OK, 1 = mismatch.
## runtime API / sim_engine Boundaries
- runtime API MUST NOT hardcode topology/routing or internal hop sequences.
+362
View File
@@ -0,0 +1,362 @@
# ADR-0001: 51-bit Physical Address Layout & Decoding Contract
## Status
Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal,
Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables.
Supersedes ADR-0031.)
## Date
2026-04-27 (original: 2026-02-27)
## Context
KernBench requires a stable, parsable physical address scheme that:
- can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU)
- remains topology-agnostic (no hardcoded counts)
- supports swappable policy and DI-first components
- covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space
### History
- Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) +
sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice.
- ADR-0031 (stub) requested PE-resource range partition but was never
implemented.
Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides
concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources.
ADR-0031 is superseded.
## Decision
We define a **PhysAddr value object** and an **address decoding contract**
that converts an integer address into routing domains.
### D1. PhysAddr is an immutable value object
- PhysAddr is immutable and comparable as a pure value.
- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
- No global state may be required to interpret a PhysAddr.
### D2. 51-bit Physical Address Layout
A 51-bit physical address is adopted.
#### 2.1 Top-Level Address Map
```text
[50:47] sip_id (4) -- 16 SIPs
[46:42] die_id (5) -- 32 dies per SIP
[41: 0] local_offset (42) -- 4 TB per die
```
```text
50 47 46 42 41 0
+---------+----------+-------------------------+
| sip_id | die_id | local_offset |
+---------+----------+-------------------------+
```
#### 2.2 die_id Allocation
| die_id | Meaning |
|--------|---------|
| 0..15 | AHBM dies |
| 16..20 | IOCHIPLET dies |
| 21..31 | Reserved |
#### 2.3 AHBM Die Layout
Only lower 256 GB of the 4 TB die-local window is assigned.
```text
[41:38] MBZ (4)
[37] addr_space (1) -- 0 = local resource, 1 = HBM memory
[36: 0] sub-address (37)
```
| addr_space | Meaning |
|------------|---------|
| 0 | Local resource |
| 1 | HBM memory |
##### 2.3.1 HBM Window (addr_space = 1)
```text
[36:0] hbm_offset (37) -- 128 GB decode window
```
The architectural decode window is fixed at 128 GB. Implemented capacity
may be smaller depending on SKU/topology (see D4).
##### 2.3.2 Resource Window (addr_space = 0)
```text
[36:34] resource_kind (3)
[33: 0] kind_local (34) -- 16 GB per kind
```
| resource_kind | Meaning |
|---------------|---------|
| 000 | PE_LOCAL |
| 001 | MCPU_LOCAL |
| 010 | CUBE_SRAM |
| 011..111 | Reserved |
Each kind gets a 16 GB decode region.
##### 2.3.3 PE_LOCAL (resource_kind = 000)
```text
[33] MBZ (1)
[32:29] pe_id (4) -- 0..15
[28:25] pe_sub_unit (4)
[24: 0] sub_offset (25) -- 32 MB per slot
```
16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode.
| pe_sub_unit | Name | Budget |
|-------------|------|--------|
| 0 | PE_CPU_DTCM | 8 KB |
| 1 | MATH_ENGINE_DTCM | 8 KB |
| 2 | IPCQ | 256 KB |
| 3 | PE_CPU_SFR | 16 KB |
| 4 | MATH_ENGINE_SFR | 16 KB |
| 5 | DMA_ENGINE_SFR | 192 KB |
| 6 | PE_TCM | 2 MB |
| 7..15 | Reserved | -- |
##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
```text
[33:30] MBZ (4)
[29:25] mcpu_sub_unit (5)
[24: 0] sub_offset (25) -- 32 MB per slot
```
1 GB active decode.
| mcpu_sub_unit | Name | Budget |
|---------------|------|--------|
| 0 | MCPU_ITCM | 512 KB |
| 1 | MCPU_DTCM | 512 KB |
| 2 | IPCQ | 256 KB |
| 3 | MCPU_SFR | 8 KB |
| 4 | MCPU_DMA_SFR | 16 KB |
| 5 | MCPU_SRAM | 10 MB |
| 6..31 | Reserved | -- |
##### 2.3.5 CUBE_SRAM (resource_kind = 010)
```text
[33:25] MBZ (9)
[24: 0] sram_offset (25) -- flat 32 MB
```
#### 2.4 IOCHIPLET Die Layout
Only lower 1 TB of the 4 TB die-local window is assigned.
```text
[41:40] MBZ (2)
[39: 0] chiplet_offset (40) -- 1 TB
```
Region split by address range:
| Range | Meaning | Decode condition |
|-------|---------|------------------|
| [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
| [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
##### 2.4.1 IOCPU Region
```text
[30:27] iocpu_sub_unit (4)
[26: 0] sub_offset (27) -- 128 MB per slot
```
16 x 128 MB slots. 2 GB active decode.
| iocpu_sub_unit | Name | Budget |
|----------------|------|--------|
| 0 | IOCPU_ITCM | 512 KB |
| 1 | IOCPU_DTCM | 512 KB |
| 2 | IPCQ | 2 MB |
| 3 | IOCPU_SFR | 8 KB |
| 4 | IO_DMA_SFR | 16 KB |
| 5 | IO_SRAM | 64 MB |
| 6..15 | Reserved | -- |
##### 2.4.2 UAL Region
Sub-layout TBD (separate ADR).
#### 2.5 Addressing Rules
1. MBZ bits must be zero. An address with non-zero MBZ bits is
**architecturally invalid**. Implementation may raise a decode fault
or return an error -- behavior is not prescribed by this ADR.
2. Fixed slot sizes are chosen for simple hardware decode; actual
implemented capacity may be smaller than the slot.
3. Access beyond a sub-unit's implemented budget within a slot is
**architecturally invalid** (same policy as MBZ).
### D3. Bitfield decoding is deterministic
Given an integer address, field extraction (`sip_id`, `die_id`, `kind`,
`sub_unit`, `offset`) is purely positional. No runtime state is required.
Decoding deterministically maps an integer address to destination domains:
`sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM /
IOCPU / UAL).
### D4. Capacity validation may depend on topology config
Whether a decoded address falls within **implemented capacity** (e.g.,
HBM 96 GB on a specific SKU) is checked against topology parameters
provided via DI/config. Decode itself (D3) never consults topology --
only validation does. These parameters must live in the topology/config
layer, not in node implementations.
### D5. Routing consumes decoded domains, not raw bits
Routing policy uses decoded domains:
- `src` location (sip / die / pe or node_id)
- `dst` domains derived from PhysAddr decoding
- `size_bytes` for size-aware link latency
Routing must not inspect raw bit-fields directly except inside the
decoding module.
## Alternatives Considered
1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice,
consumes 4 bits that enable die-local expansion to 42 bits
(IOCHIPLET 1 TB).
2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB.
Freed rack_id bits enable 42-bit local_offset.
3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg
spanning)**: Rejected -- complicates D3 (deterministic decoding).
Uniform 4 TB window with MBZ padding is simpler.
4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected --
leads to duplicated logic, inconsistent routing, and hidden
assumptions.
5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**:
Rejected -- violates SPEC R3 and breaks swappability.
6. **Put decoding inside memory controllers or routers**: Rejected --
leaks policy into components, violates SPEC R4 / D5.
## Consequences
### Positive
- Simple hierarchical decoder: SIP -> die -> kind -> sub-unit.
- Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU).
- Deterministic routing domains enable clear test invariants (SPEC R1, R5).
- Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit
slots, reserved MBZ bits.
- DI-first: decoder can be swapped without changing components (SPEC R4).
### Tradeoffs
- Sparse address holes due to power-of-2 slot alignment.
- Large reserved/MBZ regions (intentional for future extension).
- Requires explicit configuration for topology-derived sizes (D4).
- Introduces a single "blessed" decoding module that must remain stable
and well-tested.
## Supersedes
- **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The
PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5
fulfill ADR-0031's stated goals.
## Implementation Notes (Non-normative)
- Recommended module: `src/kernbench/policy/address/phyaddr.py`
- Tests should cover: encode/decode round-trip per kind, MBZ enforcement,
die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary
values, backward compatibility of factory APIs.
- Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`,
`cube_sram_addr` retain signatures (minus `rack_id`); `cube_id`
parameter renamed to `die_id`.
- New factories: `pe_resource_addr`, `mcpu_resource_addr`,
`iocpu_resource_addr`, `ual_addr`.
## Appendix A. Address Examples
### A.1 AHBM HBM access
sip=2, die=5, HBM offset=0x1000
```text
sip_id = 2 -> [50:47] = 0b0010
die_id = 5 -> [46:42] = 0b00101
addr_space = 1 -> [37] = 1 (HBM)
hbm_offset = 0x1000 -> [36:0]
51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
```
### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
```text
sip_id = 0 -> [50:47] = 0
die_id = 0 -> [46:42] = 0
addr_space = 0 -> [37] = 0
resource_kind = 0 -> [36:34] = 000 (PE_LOCAL)
pe_id = 3 -> [32:29] = 0011
pe_sub_unit = 6 -> [28:25] = 0110 (PE_TCM)
sub_offset = 0x400 -> [24:0]
local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
```
### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
```text
sip_id = 1 -> [50:47] = 0001
die_id = 3 -> [46:42] = 00011
addr_space = 0 -> [37] = 0
resource_kind = 1 -> [36:34] = 001 (MCPU_LOCAL)
mcpu_sub_unit = 5 -> [29:25] = 00101 (MCPU_SRAM)
sub_offset = 0 -> [24:0] = 0
local_offset = (1 << 34) | (5 << 25)
```
### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
```text
sip_id = 1 -> [50:47] = 0001
die_id = 17 -> [46:42] = 10001 (IOCHIPLET[1])
iocpu_sub_unit = 2 -> [30:27] = 0010 (IPCQ)
sub_offset = 0x20000 -> [26:0]
chiplet_offset = (2 << 27) | 0x20000
(< 0x8000_0000 -> IOCPU region)
```
### A.5 IOCHIPLET -- UAL region, offset=4 GB
```text
sip_id = 0 -> [50:47] = 0
die_id = 16 -> [46:42] = 10000 (IOCHIPLET[0])
chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
```
## Links
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
R5 (multi-domain comm)
- ADR-0031: Superseded
@@ -0,0 +1,102 @@
# ADR-0002: Routing Distance, Ordering & Bypass Rules
## Status
Accepted
## Date
2026-02-27
## Context
The KernBench Graph Latency Simulator must compare kernel execution time
across different architectures and topologies by computing end-to-end
latency from graph traversal.
To support meaningful comparison:
- routing must be deterministic
- latency must reflect actual interconnect structure
- local vs remote traffic must be distinguishable
- “bypass” optimizations must not undermine debuggability or correctness
The simulator also aims to avoid software-managed metadata and hidden
shortcuts that obscure control paths.
## Decision
### D1. Distance is accumulated latency, not hop count
- Routing “distance” is defined as the **sum of per-node and per-link latency**.
- Hop count alone must not be used for ordering or path selection.
- Size-aware serialization latency (bytes / BW) contributes to distance.
### D2. Routing order is derived from graph traversal
- The chosen route is the path with minimum accumulated latency
given the constructed graph and routing policy.
- Deterministic ordering must be guaranteed for identical inputs
(topology + policy + request).
### D3. Bypass is explicit and graph-represented
- All paths must be explicitly represented in the graph and subject to latency accumulation.
- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations
(HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
Local HBM access has minimal hops (switching overhead only); remote access
traverses additional routers.
- Implicit or “magic” bypass paths are disallowed.
### D4. No zero-latency end-to-end paths
- Every routed request must incur **end-to-end** latency > 0.
- Individual fabric segments (e.g., NOC hops) MAY have distance_mm = 0
when the fabric is distributed and distance is not meaningful at that granularity.
This is allowed because other components on the same path (e.g., PE_DMA, SRAM,
UCIe endpoints) contribute non-zero latency, ensuring the end-to-end invariant holds.
- Fully zero-latency end-to-end paths are disallowed, except for explicit
test-only stubs clearly marked as such.
### D5. Policy vs topology responsibility split
- Topology builder:
- defines nodes and links and their latency/BW parameters
- Routing policy:
- selects among available graph paths based on decoded domains
- Routing policy must not assume missing links; missing connectivity
is a topology construction error.
### D6. No software-managed routing metadata
- Routing decisions must not rely on per-request software-managed metadata
that tracks distance, hop count, or ordering outside the graph model.
- All distance/order computation is derived from traversal itself.
## Alternatives Considered
1) **Hop-count based routing**
- Rejected: ignores heterogeneous latency/BW and misrepresents
architectural differences.
2) **Implicit local shortcuts**
- Rejected: breaks debuggability and violates traversal-based latency.
3) **Software-managed distance metadata**
- Rejected: increases control overhead and obscures routing semantics.
## Consequences
### Positive
- Clear, debuggable hop-by-hop traces (SPEC R2, R4).
- Architecture comparisons reflect real interconnect structure.
- Routing behavior is reproducible and deterministic.
### Tradeoffs / Costs
- Graph construction must be correct and complete.
- Bypass modeling requires explicit graph representation,
which slightly increases topology description complexity.
## Implementation Notes (Non-normative)
- Recommended responsibilities:
- Graph builder: ensure all required paths exist.
- Router: select next hop based on decoded domains and policy.
- Tests should assert:
- non-zero end-to-end latency
- deterministic routing for identical inputs
- bypass paths appear explicitly in emitted traces
## Links
- SPEC.md: R1 (routing), R2 (latency), R3 (topology), R5 (multi-domain comm)
- ADR-0001: PhysAddr layout & decoding contract
@@ -0,0 +1,68 @@
# ADR-0003: Target System Hierarchy & Modeling Scope
## Status
Accepted
## Context
We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
through switching fabrics, with a host CPU issuing commands/kernels.
## Decision
We model the system hierarchy explicitly:
### D1. Tray-level
- A compute tray contains:
- Host CPU (issues requests / coordinates runtime & data placement)
- Multiple identical SIPs (accelerators)
- Interconnect fabric between SIPs (PCIe and/or UAL via switches)
### D2. SIP-level
- A SIP is a multi-die package composed of:
- Multiple CUBEs (HBM die + compute PEs + UCIe)
- One or more IO chiplets (host/SIP interfaces)
- IO chiplets:
- provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
- can be multiple per SIP
- placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 12 IO chiplets
### D3. CUBE-level
- A CUBE contains:
- HBM + memory controller (HBM_CTRL)
- NOC (on-die fabric): carries all intra-cube traffic including HBM data,
inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access.
Must provide: full-BW PE↔local HBM path, PE↔SRAM connectivity,
PE↔UCIe connectivity, M_CPU↔PE command path.
NOC topology is an implementation choice (e.g., 2D mesh, ring, crossbar);
current implementation uses a 2D mesh with XY routing (see ADR-0017).
HBM_CTRL is attached to each PE's local NOC port (local HBM = minimal hop).
- Shared SRAM: cube-level shared memory accessible by all PEs via NOC
- management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
- multiple PEs
- up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
### D4. PE-level
- A PE can execute one kernel instance
- PE contains internal control + accelerators (modeled at PE view granularity):
- PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
## Consequences
- The simulator supports abstraction by “views”:
- SIP view hides PE internals
- CUBE view treats each PE as a single block
- PE view expands PE internals
- Topology remains parameterized; sizes/counts/links come from configuration.
## Links
- SPEC R3/R5
- ADR-0005 (diagram views)
- ADR-0017 (cube NOC 2D mesh architecture)
@@ -0,0 +1,76 @@
# ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee
## Status
Accepted
## Context
Accurately modeling PE↔HBM behavior is essential for kernel latency estimation.
Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth.
## Decision
### D1. Local HBM definition
- Each PE is assigned a logically defined “local HBM” region.
- Local HBM corresponds to the pseudo-channel subset directly attached to that PEs
router in the NOC mesh (ADR-0017 D4).
- The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
### D2. Local HBM bandwidth guarantee contract
- Accesses from a PE to its local HBM MUST guarantee full effective HBM
read/write bandwidth independent of intervening fabric bandwidth limits.
- Effective HBM bandwidth = spec bandwidth x efficiency factor.
The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8)
models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page
misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective.
- The topology builder applies the efficiency factor to router-to-hbm edge
bandwidth at graph construction time, so all downstream routing and latency
computation uses the effective value.
- This guarantee is modeled by:
- a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
- while still incurring non-zero latency along explicitly modeled components.
- HBM CTRL internal modeling (PC striping, cut-through, scheduling fidelity)
is consolidated in ADR-0033 (Latency Model: Assumptions and Known
Simplifications). The aggregate BW guarantee here remains the contract;
ADR-0033 documents how the per-PC model realizes it and which scheduler
effects are intentionally simplified.
### D3. Remote PE HBM semantics (intra-cube)
- A PE that accesses another PE's local HBM traverses the NOC:
- PE_DMA → NOC → (fabric hops) → target PE's NOC port → HBM_CTRL
- NOC bandwidth and hop count may limit remote HBM access relative to local access.
### D4. Non-local HBM semantics (inter-cube / inter-SIP)
- Accesses from a PE to HBM in a different cube or SIP MAY be limited by:
- NOC bandwidth within the cube,
- inter-cube UCIe links,
- inter-SIP fabric (PCIe/UAL).
- These paths MUST be explicit and traceable.
### D5. Shared SRAM semantics
- Each CUBE contains a shared SRAM accessible by all PEs in that CUBE.
- Access path: PE_DMA → NOC → shared SRAM.
- Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth.
- Shared SRAM is not part of the HBM address space; it is a separate memory domain.
## Verification Notes
Tests should cover:
- local-HBM case: BW matches HBM BW regardless of fabric BW parameter
- remote PE HBM case: latency includes mesh hop traversal
- non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
- shared SRAM case: access via NOC with correct BW
## Links
- SPEC R2/R5
- ADR-0002 (distance/order & explicit bypass)
- ADR-0017 D7 (PE DMA data paths through NOC to HBM)
@@ -0,0 +1,186 @@
# ADR-0005: Diagram Views & Distance-Aware Layout Rules
## Status
Accepted
## Context
We require verifiable and inspectable system modeling for a large-scale,
parameterized AI Accelerator system.
Humans must be able to:
- visually inspect the modeled topology,
- reason about communication structure and relative distance,
- do so at multiple abstraction levels without being overwhelmed by detail.
The simulator models distance (accumulated latency) as a first-class concept.
Diagrams must reflect this distance by default.
---
## Decision
### D1. Global Defaults
- All diagrams MUST be **distance-aware by default**.
- All diagrams MUST render **representative views** of the architecture.
- Instance indices (e.g., sip0, cube2, pe3) MUST NOT be required for diagram generation.
- Instance indices MAY be used ONLY:
- to define a distance anchor in asymmetric or debugging scenarios, or
- when explicitly requested.
---
### D2. Representative Rendering Rule
- All CUBEs share the same internal structure.
- All PEs share the same internal structure.
Therefore:
- SIP-level diagrams render representative CUBEs and IO chiplets.
- CUBE-level diagrams render representative PEs as opaque blocks.
- PE-level diagrams render a representative PE with fully expanded internals.
Diagrams MUST NOT depend on specific SIP, CUBE, or PE indices
unless explicitly requested.
---
### D3. Diagram Views
#### View A — SIP-Level Diagram
**Purpose**
Explain system-scale structure and connectivity.
**Visible elements**
- SIP boundaries (optional)
- CUBEs (opaque blocks)
- IO chiplets (opaque blocks)
- Optional UCIe stubs only if needed to clarify connectivity
**Hidden elements**
- PE internals
- CUBE internal fabric
- IO chiplet internals
**Visible links**
- Host ↔ IO chiplets (PCIe)
- SIP ↔ SIP (PCIe / UAL via switches)
- IO ↔ CUBE (on-package links)
---
#### View B — CUBE-Level Diagram
**Purpose**
Explain cube-internal structure and data/control flow.
**Visible elements**
- Router mesh: 2D grid of NOC routers (from cube_mesh.yaml), all traffic routes through mesh
- HBM_CTRL attached to PE routers (local HBM = 0 hop)
- HBM subsystem (HBM_CTRL)
- Shared SRAM: cube-level shared memory
- Management CPU (M_CPU)
- PEs as opaque blocks (PE[0..N1])
- UCIe endpoints (N/E/W/S) as ports
**Hidden elements**
- PE internals
**Visible links**
- PE → router (HBM + non-HBM data path via mesh)
- Router ↔ HBM_CTRL (local HBM access)
- Router ↔ Router (mesh hops for remote access)
- Router ↔ UCIe endpoints
- Router ↔ shared SRAM
- M_CPU ↔ router (command path)
- Router → PE_CPU (command delivery, collapsed into PE block)
---
#### View C — PE-Level Diagram
**Purpose**
Explain internal PE behavior and execution structure.
**Visible elements**
- PE_CPU
- Command handler / scheduler
- PE_TCM (local SRAM)
- HW accelerators (DMA, GEMM, MATH, etc.)
- Local HBM interface
- Optional IPCQ / messaging endpoints
**Visible links**
- Control paths (CPU → scheduler → engines)
- Data paths (engines ↔ TCM, DMA ↔ local HBM)
- External fabric ports as abstract ports only
---
### D4. Distance-Aware Layout (Default)
#### Distance definition
- Distance is defined as **accumulated latency**, consistent with ADR-0002.
- Distance is computed from a single anchor node.
#### Default anchor selection
- SIP view: IO chiplet (or Host CPU if present)
- CUBE view: a representative PE
- PE view: PE_CPU or Command Handler
Anchors are **implicit defaults** and MUST NOT be required to be specified.
#### Layout rules
- Diagrams MUST be laid out in layers based on distance buckets.
- Layout direction MUST be consistent within a view type
(preferred: left-to-right).
- Nodes with equal distance MUST have stable ordering
(by role or identifier, deterministically).
Cycles MAY be rendered using dashed or curved edges for readability,
without affecting distance semantics.
---
### D5. Generation Contract (for Tools / Claude Code)
When generating diagrams:
- Assume distance-aware layout by default.
- Assume representative rendering by default.
- Do NOT ask for SIP/CUBE/PE indices unless required.
- Do NOT expand hidden abstraction levels.
- Prefer architectural clarity over micro-hop fidelity.
---
## Consequences
- Diagrams are stable across topology scaling.
- Changes in distance or routing policy are reflected visually.
- Diagrams serve as verifiable artifacts derived from the simulator model,
not as hand-maintained documentation.
---
## Links
- SPEC Section 4 (Output, Debuggability, and Diagrams)
- ADR-0002 (Routing distance semantics)
- ADR-0006 (Topology compilation & automatic diagram generation)
@@ -0,0 +1,130 @@
# ADR-0006: Topology Compilation, Distance Extraction, and Automatic Diagram Generation
## Status
Accepted
## Context
The simulator compiles topology configuration (e.g., topology.yaml) into an explicit model graph,
and computes routing and accumulated latency (distance).
Diagrams should be generated from these authoritative artifacts to ensure consistency and avoid
hand-maintained topology drawings.
Additionally, for usability, diagrams should be emitted automatically into a stable location
so that developers can preview them immediately in the repository.
---
## Decision
### D1. Topology compilation is the single source of truth
- topology.yaml (or equivalent config) is compiled into:
- an explicit system graph,
- node/link attributes,
- routing policies.
This compiled graph is the authoritative representation of the system.
### D2. Distance extraction during compilation
- During or immediately after topology compilation, the simulator MUST compute distance metadata
(accumulated latency) consistent with ADR-0002.
- Distance metadata MUST be sufficient to support distance-aware diagram layout as defined in ADR-0005.
- Distributed fabric segments (e.g., NOC) MAY have distance_mm = 0 per ADR-0002 D4;
layout placement for such nodes uses explicit position metadata rather than distance buckets.
### D3. Diagram generation is a derived artifact
- Diagrams MUST be generated from:
- the compiled topology graph,
- extracted distance metadata,
- view/layout rules defined in ADR-0005.
- Diagram generation MUST NOT require additional hand-written topology descriptions.
### D4. Automatic diagram emission to the repository
- As part of topology compilation, the implementation MUST produce the following diagrams by default:
- SIP-level diagram (representative, distance-aware)
- CUBE-level diagram (representative, distance-aware)
- PE-level diagram (representative, distance-aware)
- The default output directory is:
- `docs/diagrams/`
- The generator MUST overwrite/update only when the compiled topology (or diagram rules) changes.
### D5. View-specific projection and layout
For each view (SIP / CUBE / PE):
- The generator MUST project the compiled graph into a reduced view graph:
- hide/collapse nodes according to ADR-0005,
- preserve connectivity semantics relevant to that view,
- compute distance buckets and assign layout layers deterministically.
- CUBE-level projection MUST include:
- Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
and PEs as opaque blocks.
- All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017).
- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
### D6. Output formats and determinism
- The generator MUST output at least one of:
- Mermaid (Markdown-native)
- Graphviz DOT (rank-based control)
- SVG (mm-accurate layout, no external dependencies)
- SVG is preferred when mm-accurate position metadata is available from the compiled topology.
- Output MUST be deterministic:
- same topology + same rules → identical diagram text
- File naming MUST be deterministic and stable (see "Output Conventions").
### D7. Performance and caching
- Diagram generation MAY be lazy and/or cached, as long as the outputs in `docs/diagrams/`
remain consistent with the compiled topology.
- The implementation SHOULD use a cache key based on:
- topology content hash,
- routing policy version,
- diagram rules version,
- view type (SIP/CUBE/PE).
---
## Output Conventions
### Directory
- `docs/diagrams/` is the canonical output directory for generated diagrams.
### File names (recommended, deterministic)
- `system_view.svg` / `system_view.mmd` / `system_view.dot`
- `sip_view.svg` / `sip_view.mmd` / `sip_view.dot`
- `cube_view.svg` / `cube_view.mmd` / `cube_view.dot`
- `pe_view.svg` / `pe_view.mmd` / `pe_view.dot`
Optionally, for multi-topology workflows:
- `sip_view__{topology_id}.svg`
- `cube_view__{topology_id}.svg`
- `pe_view__{topology_id}.svg`
### Repository policy
- Generated diagram files MAY be committed to the repository to enable diff-based review.
- If committed, they MUST be reproducible from topology compilation.
---
## Consequences
- Diagrams are always consistent with simulator behavior.
- Architectural changes automatically propagate to visualizations.
- Diagram diffs become meaningful indicators of architectural change.
---
## Links
- SPEC Section 4 (Output, Debuggability, and Diagrams)
- ADR-0002 (Distance semantics)
- ADR-0005 (Diagram views and layout rules)
@@ -0,0 +1,95 @@
# ADR-0007: Runtime API and Simulation Engine Boundaries
## Status
Accepted
## Context
The simulator consists of multiple layers with distinct responsibilities:
- a host-facing API layer used by benchmarks and user code,
- a discrete-event simulation engine that executes requests,
- device components that model hardware behavior.
Without strict boundaries, orchestration logic can leak into components,
or simulation internals can become entangled with user-facing APIs.
This ADR defines clear responsibility boundaries between:
- runtime API,
- simulation engine (sim_engine),
- hardware components.
---
## Decision
### D1. Runtime API is host-facing orchestration only
The runtime API represents host/driver-level behavior and MUST:
- expose high-level operations (tensor deployment, kernel launch),
- submit requests only to endpoint components (e.g., IO_CPU),
- await completion via futures/handles,
- own and persist host-side metadata (tensor allocation maps, kernel bindings).
The runtime API MUST NOT:
- hardcode hop-by-hop routing or fan-out,
- directly invoke internal components (M_CPU, PE_CPU, engines),
- embed topology- or routing-specific assumptions.
---
### D2. Simulation engine wires components and tracks completion
The simulation engine (sim_engine) MUST:
- wire components at initialization (create port stores + start wire
processes per the component port/wire framework — ADR-0015),
- inject requests into the compiled topology graph at entry components
(e.g., PCIE_EP for memory operations, IO_CPU for kernel launch),
- schedule and execute events using a discrete-event model,
- manage correlation ids and completion tracking.
The simulation engine MUST NOT:
- define tensor semantics,
- define kernel execution policies,
- expose internal graph details to the runtime API,
- walk the topology path during request execution,
- call component `run()` methods directly,
- track per-hop latency or decompose fan-out (components own this).
---
### D3. Components own fan-out and aggregation
Device-side components MUST:
- fan-out requests to downstream domains
(IO_CPU → M_CPU → PE_CPU → schedulers/engines),
- aggregate completion and failure signals,
- propagate results deterministically upstream.
Neither the runtime API nor the simulation engine may orchestrate
component-level fan-out explicitly.
---
## Consequences
- Runtime APIs remain stable as topology and routing evolve.
- Simulation internals can change without affecting user-facing code.
- Component implementations remain swappable via DI.
---
## Links
- SPEC R4, R7, R8
- ADR-0008 (Tensor deployment)
- ADR-0009 (Kernel execution)
- ADR-0015 (Component port/wire model and engine role)
- ADR-0010 (CLI surface and execution semantics — runtime API consumer)
@@ -0,0 +1,100 @@
# ADR-0008: Tensor Deployment and Allocation (Host Allocator, PA-first)
## Status
Accepted
## Context
Benchmarks require PyTorch-like tensor semantics:
- tensor creation (empty, fill),
- deployment to accelerator devices (tensor.to()).
In the realistic system, host software manages allocation/mapping and installs
mappings for DMA/MMU. For Phase 0 we simplify (ADR-0011):
- device memory operations use PA only,
- VA/MMU/IOMMU is not modeled.
To keep the host↔device interface minimal, we avoid a separate
AllocateTensorMeta message. Instead, host allocation produces a PA shard map
that is used directly by MemoryWrite/Read and KernelLaunch.
---
## Decision
### D1. Tensor is a host-owned handle with PA shard mapping
A Tensor object is a host-owned handle that encapsulates:
- shape and dtype,
- initialization intent,
- device placement and allocation metadata as a PA shard map.
After deployment, the Tensor handle MUST contain:
- a list of shards, each with (sip,cube,pe,pa,nbytes,offset_bytes).
This PA shard mapping is the single source of truth for kernel argument binding.
---
### D2. Deployment uses a host allocator (Phase 0)
In Phase 0, tensor deployment produces PA shard mappings via a host allocator:
- placement (split/replicate/hybrid) is decided by a DP policy,
- allocation assigns PA ranges at the PE level and returns shard mappings,
- the Tensor handle stores the resulting shard list deterministically.
No separate host-visible device allocation RPC is required in Phase 0.
---
### D3. Data initialization and transfer uses MemoryWrite/Read only
Any data initialization or transfer implied by a tensor (e.g., fill, copy)
MUST be represented using Host ↔ IO_CPU messages only:
- MemoryWrite
- MemoryRead
Rules:
- MemoryWrite/Read MUST reference PA + (sip,cube,pe) tags (ADR-0012).
- Allocation metadata MUST NOT be embedded as a separate allocation message.
- Bulk tensor data MUST NOT be embedded in Phase 0 messages.
The simulation engine schedules MemoryWrite/Read through the graph so that
latency is computed by explicit traversal.
---
### D4. Extension path (non-breaking)
Future ADRs MAY introduce optional VA/MMU/IOMMU modeling by adding:
- virtual addressing in tensor handles,
- mapping install steps,
- translation latency/page granularity.
The Phase 0 PA shard map remains a valid fast-path configuration.
---
## Consequences
- Host↔IO_CPU contract remains minimal (MemoryRead/Write + KernelLaunch).
- KernelLaunch can pass per-PE data placement explicitly via shard tags.
- Early implementation stays simple and testable.
---
## Links
- ADR-0011 (Memory Addressing — PA / VA / LA)
- ADR-0012 (Host↔IO_CPU schema)
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0009 (Kernel execution)
@@ -0,0 +1,146 @@
# ADR-0009: Kernel Execution Messaging and Completion Semantics
## Status
Accepted
## Context
Kernel execution is initiated by the host and proceeds through
device control components:
Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
Completion propagates in reverse order.
To keep benchmarks simple and topology-agnostic,
kernel execution must be endpoint-driven with deterministic aggregation.
---
## Decision
### D1. Kernel launch is an endpoint request
A kernel launch is initiated by submitting a single KernelLaunch request
to the IO_CPU endpoint.
The runtime API MUST:
- construct the kernel launch request,
- submit it to IO_CPU,
- await a single completion result.
The runtime API MUST NOT orchestrate internal fan-out.
---
### D2. Tensor arguments are passed by metadata
KernelLaunch requests MUST reference tensor arguments via:
- host-owned tensor handles, or
- resolved device address maps derived from those handles.
Bulk tensor data MUST NOT be embedded in kernel launch messages.
---
### D3. Fan-out and aggregation are component responsibilities
- IO_CPU fans out work to M_CPUs.
- M_CPU fans out work to PE_CPUs.
- PE_CPU manages kernel execution and engine dispatch.
Completion semantics:
- M_CPU completes when all targeted PEs complete or a failure policy triggers.
- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
---
### D4. Completion and failure propagation
- All messages MUST carry correlation identifiers.
- Completion and failure MUST propagate deterministically to the host.
- The simulation engine provides futures/handles to observe completion.
---
### D5. Launch timing is endpoint-synchronized
All PEs targeted by a single kernel launch MUST begin executing the kernel
body at the same simulated time, regardless of their dispatch path length
from the launch entry point.
Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
latency at every level. PEs near their M_CPU receive the launch earlier
than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
farther away. Without synchronization, each PE's kernel begins at a
different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
of dispatch-path geometry rather than of the kernel's behavior —
producing measurement artifacts in benchmarks that time kernel-internal
waits (for example `tl.recv` on cross-cube or cross-SIP hops).
Mechanism.
- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
computes `target_start_ns = env.now + max_latency` where
`max_latency` is the maximum, over every target (sip, cube, pe)
tuple, of the **two-leg dispatch chain**:
```
max_latency(sip, cube, pe) =
compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
+ compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
- io_cpu.overhead_ns
- m_cpu.overhead_ns
```
This models the actual dispatch as **two sequential Transactions**
(IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
`compute_path_latency_ns` adds its endpoints' `overhead_ns`;
`io_cpu.overhead_ns` is subtracted because IO_CPU has already
paid it before this method runs, and `m_cpu.overhead_ns` is
subtracted once because it appears as endpoint of leg1 *and*
start of leg2 but is paid only once at run time. A single
`find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
it can pick a graph path that bypasses M_CPU and silently
under-shoots the prediction for far cubes, breaking the D5
invariant.
The fanned-out sub-Transactions carry **`nbytes = 0`** for
`KernelLaunchMsg` (control message only). Without this,
large kernel-launch payloads would occupy fabric BW on the
shared first hop and serialize the per-cube dispatch, pushing
far M_CPUs past `target_start_ns` and re-introducing the
late-arrival violation.
- **M_CPU** passes an already-stamped `target_start_ns` through
unchanged. Only when the value is absent (e.g. a direct
launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
`env.now + max(local command-path latency)`.
- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
of `_execute_kernel`, before recording `pe_exec_start` and invoking
the kernel body.
- When `target_start_ns is None`, PE_CPU falls through to the legacy
unsynchronized behavior — preserving backward compatibility.
IO_CPU-level stamping guarantees every PE across every targeted cube
uses the same barrier sim-time, eliminating both the within-cube
dispatch-offset artifact *and* the cross-cube offset artifact in
multi-cube launches. Models a real-hardware timed-broadcast launch
(latency-equalized dispatch tree).
The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
control plane — runtime API and application kernels are unchanged.
---
## Links
- SPEC R1, R2, R7, R8
- ADR-0007 (Runtime API boundaries)
- ADR-0008 (Tensor deployment)
- ADR-0013 (Verification strategy — V2 fan-out tests)
- ADR-0015 D4 (concrete fabric path for kernel launch)
@@ -0,0 +1,131 @@
# ADR-0010: Command Line Interface and Execution Semantics
## Status
Accepted
## Context
The `kernbench` CLI is the user-facing entry point of the simulator. It
exposes three subcommands:
- `run` — execute a benchmark against a topology.
- `probe` — diagnostic utility for latency / BW measurement.
- `web` — interactive topology viewer.
Device enumeration is centralized in the CLI; neither the runtime API
nor the simulation engine enumerates devices. Benchmarks remain
single-device by design and accept a device identifier as input.
## Decision
### D1. Benchmark contract — single-device by design
- A benchmark MUST define behavior for a single device only.
- A benchmark MUST accept a device identifier as input.
- Benchmarks MUST NOT enumerate or loop over multiple devices.
Multi-device execution is the CLI's concern (D3), not the benchmark's.
### D2. `kernbench run` — benchmark execution
Required arguments:
- `--topology <path>`: topology YAML file path. Loaded via
`resolve_topology()`.
- `--bench <name>`: benchmark name. Resolved via
`benches.loader.resolve_bench()`.
Optional arguments:
- `--device <selector>` (default: `all`):
- `all` — run once per discovered SIP (see D3).
- `sip:<N>` — run only on SIP N.
- Parsed via `resolve_device()`.
- `--verify-data` (default: off) — enable Phase 2 data verification
(see ADR-0020). When set, `engine_factory` constructs the engine
with `enable_data=True`. After the benchmark runs, a diagnostic
summary of recorded ops is printed.
Each invocation runs the benchmark once within a single simulation
instance.
### D3. Multi-device execution is logically parallel
When `--device all` (or omitted) and the topology has multiple SIPs:
- Benchmark executions are submitted to a single simulation engine
instance.
- Executions are logically parallel in simulation time.
- Inter-device contention is naturally modeled (shared fabric
bandwidth, cross-SIP traffic, etc.).
The CLI does NOT spawn multiple OS processes or independent
simulation runs — parallelism is internal to one simulation instance.
### D4. `kernbench probe` — latency / BW diagnostic utility
Required argument:
- `--topology <path>`: topology YAML file path.
Optional argument:
- `--case <name>` (default: `all`) — run a predefined traffic
pattern, or `all` to run every defined case.
Probe runs each pattern through the simulation engine and reports
per case:
- End-to-end latency (ns).
- Effective bandwidth (nbytes / total_ns).
- Bottleneck bandwidth (min edge BW along the chosen path).
- Utilization (effective / bottleneck).
Probe additionally validates monotonicity invariants — for example
that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
cross-SIP — and reports violations. Probe is a developer tool for
verifying the latency / BW model; it is not a benchmark.
### D5. `kernbench web` — topology viewer
Optional arguments:
- `--port <N>` (default: `8765`) — HTTP port.
- `--no-open` — do not auto-open the browser.
Launches a local HTTP server that renders the compiled topology in
the browser. Distinct from the static `docs/diagrams/` artifacts:
- `docs/diagrams/` files are derived at topology-compile time
(ADR-0006).
- `kernbench web` is interactive — pan/zoom, hover for component
attributes, switch between SIP / CUBE / PE views.
### D6. Runtime API and simulation engine remain device-scoped
- Runtime API calls operate on one device per invocation.
- The simulation engine schedules all requests deterministically.
- Neither layer enumerates devices.
This invariant keeps each layer testable in isolation; device
enumeration and multi-device fan-out live only in the CLI's `run`
command (D3).
## Consequences
- Benchmark authors write single-device logic; multi-device behavior
emerges from the CLI dispatching across SIPs.
- Adding a new subcommand (e.g., trace export, replay) does not
require benchmark or runtime-API changes — the CLI is the
extension point.
- `probe` and `web` are diagnostic / visualization tools, not
benchmarks; they bypass the benchmark loader path.
## Links
- SPEC R7, R8, R9
- ADR-0007 (Runtime API and Simulation Engine Boundaries)
- ADR-0020 (Two-pass data execution — `--verify-data`)
- ADR-0006 (Topology compilation and diagram generation —
background for `kernbench web`)
@@ -0,0 +1,521 @@
# ADR-0011: Memory Addressing — PA / VA / LA Address Models
## Status
Accepted.
- **VA model: currently implemented (default).**
- PA model: implemented as PageFault fallback in PE_DMA.
- LA model: proposed, not implemented.
## Context
KernBench's address model evolved through three design points, each
addressing a limitation of the previous. This ADR documents all three
in one place because future implementation work selects among them.
### PA-only baseline
Phase 0 of KernBench treated all device memory operations
(MemoryRead/MemoryWrite) as raw physical-address transfers. No
host-side virtual addressing, no MMU/IOMMU translation. Allocators
returned PA mappings; DMA requests carried PA directly.
This was sufficient for early correctness/latency work but
insufficient for running standard Triton kernels that use
`base_addr + offset` patterns on sharded tensors: each PE's shard
has a different PA, but the kernel needs a single contiguous address
space to compute offsets.
### Why VA/MMU (current default)
A realistic system uses host-side virtual addressing and an
MMU/IOMMU-style translation path for DMA: the host allocates physical
memory at PE level, maps it into a virtual address space, installs
mappings, and DMA requests use virtual addresses that are translated
to physical addresses.
Adopting this model lets kernels use `base_addr + offset` over a
contiguous VA range while the device-side MMU translates each access
to the appropriate PA.
### Why LA/BAAW (proposed)
VA/MMU treats HBM as a single backing space. KernBench needs to
explore architectures where HBM is composed of multiple pseudo
channels in parallel:
- CUBE's HBM has 32 or 64 pseudo channels.
- In a PE-Local-HBM model, each PE is assigned N pseudo channels
(N = `hbm_pseudo_channels / pes_per_cube`).
- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
(N × per-channel).
Two channel-mapping modes need to be modelable:
- **1:1 mode** — one logical access → N per-channel requests.
Precise per-channel BW contention modelling.
- **n:1 mode (default)** — one logical access → one aggregated
request. Channels are assumed to interleave; aggregated BW model.
VA's `tl.load(va_ptr)` produces a single DMA request to a single
target. Decomposing that into per-channel requests inside PE_DMA
requires the address layer to be aware of channels. This is the
role of the LA (Logical Address) abstraction with BAAW
(Logical-to-Physical Mapping Unit).
Core requirements driving the LA design:
- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
in both modes (only request shape and resource model differ).
- Kernel programming model is unchanged — physical channel
information is never exposed to kernel code.
- Mode switch is a topology-level configuration.
### Design space summary
| Model | Status | Key idea |
|-------|--------|----------|
| PA | fallback (implemented) | Direct physical addressing, no translation |
| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
---
## Decision
This ADR defines three address models. At any given time the system
operates in exactly one model. Selection is topology- / configuration-
driven; coexistence within one simulation run is not required.
---
### Address Model: PA (Physical Address) — fallback
#### D-PA1. PA-only semantics
- All device memory accesses (MemoryRead/MemoryWrite) operate on
device physical addresses (PA) plus size.
- PA-only mode remains functional via the PageFault fallback path in
PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
the value as a PA directly.
#### D-PA2. Allocation produces PA mappings
Device allocation selects PE-local memory regions and returns PA
mappings sufficient to execute kernels and issue DMA requests.
PA model is retained primarily for backward compatibility with PA-only
tests and as the underlying physical layer that VA / LA models resolve
into.
---
### Address Model: VA (Virtual Address with MMU) — current default
#### D-VA1. Virtual Address Model
- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
- `TensorShard` does NOT carry a `va` field — shard VA is derived as
`va_base + offset_bytes`.
- Kernels receive `va_base` as their pointer argument (via
`TensorArg.va_base`).
- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
#### D-VA2. PE_MMU Component
- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
(synchronous `translate()` called by PE_DMA).
- Page-aligned dict lookup for O(1) VA → PA translation.
- `tlb_overhead_ns` configurable per-access latency.
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
directly (preserves PA model for backward compatibility).
#### D-VA3. Mapping Installation
- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
end-to-end.
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
cross-SIP mapping contamination for replicated tensors.
- Mapping strategy based on `DPPolicy.cube`:
- **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
only. Each cube's PEs see only their local PA. No cross-cube
mapping installed.
- **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
mappings to all target cubes. Enables cross-PE and cross-cube
DMA.
#### D-VA4. Tensor Lifecycle
- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
`weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
returns VA and PA space.
- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
- `VirtualAllocator` uses free-list with coalescing for VA space.
#### D-VA5. Allocators
- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
with coalescing.
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
coalescing.
- Page size configurable via `topology.yaml` `pe_mmu` attrs
(default 4096).
#### Consequences (VA model)
- Triton kernels use `base_addr + offset` patterns naturally on
sharded tensors.
- All latency remains explicit via graph traversal, including MMU
mapping installation and per-access TLB overhead.
- PA-only mode retained as fallback (PageFault → treat as PA).
- IPCQ and other fixed-address resources bypass MMU (use PA directly).
---
### Address Model: LA (Logical Address with BAAW) — proposed
LA replaces VA when channel-level HBM modelling is required.
Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
removed artifacts). Coexistence with VA in the same run is not a goal.
#### D-LA1. LA introduction — replaces VA infrastructure
LA is the sole address space used by kernel code (`tl.load`,
`tl.store`, `tl.composite`). Properties:
- Can map a Tensor to a contiguous logical space (like VA).
- Expresses `(logical buffer + offset)`.
- Does NOT contain physical channel information directly.
- Stays as an intermediate abstraction until physical resolution.
LA address space:
| Item | Value |
|------|-------|
| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
| LA space size | 64 GB per PE |
| Alignment unit | segment (see D-LA3) |
LA is PE-local: different PEs may use the same LA value; BAAW segment
tables differ → they resolve to different PAs.
VA infrastructure removed when LA is adopted:
| Removed | Replacement |
|---------|-------------|
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
| `runtime_api/tensor.py`: `va_base` | `la_base` |
| `topology.yaml`: `pe_mmu` component entry | Removed |
#### D-LA2. Mapping mode setting
Topology-level (cube) configuration:
```yaml
cube:
memory_map:
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
hbm_pseudo_channels: 64 # total pseudo channel count
hbm_channels_per_pe: 8 # per-PE local channel count
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth
```
Consumed by the graph compiler (topology builder) and BAAW
initialisation.
#### D-LA3. Segment and BAAW
Segment partitions the LA space; each segment maps to a specific HBM
channel or channel group. Created at tensor deploy time by the runtime
allocator. BAAW resolves LA → physical request(s) using the segment
table.
```python
@dataclass
class BaawSegment:
la_base: int # segment start LA
la_size: int # segment size (bytes)
mode: str # "one_to_one" | "n_to_one"
# 1:1 mode fields
channel_count: int # channels assigned to this segment (e.g. 8)
pa_bases: list[int] # per-channel PA bases (len = channel_count)
channel_ids: list[int] # per-channel logical IDs (e.g. [0..7])
channel_size: int # per-channel size (la_size // channel_count)
# n:1 mode fields
agg_pa_base: int # aggregated PA base
agg_node_id: str # aggregated router node_id
```
Segment lifecycle:
1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
allocator. PEMemAllocator allocates per-channel PA (1:1) or
aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
with PE_DMA.
2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
(src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
converts to PA(s).
3. **Free** (tensor free): segment removed from table; LA and PA
returned.
#### D-LA4. BAAW resolution logic
BAAW is a front-end stage inside PE_DMA, not a separate SimPy
component. Synchronous address-resolution logic executed at the start
of PE_DMA's `handle_command()`.
Input: `(LA, nbytes)`. Output:
- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
- **n:1 mode**: single `PhysicalRequest`.
```python
@dataclass
class PhysicalRequest:
pa: int # 51-bit Physical Address
nbytes: int # transfer size for this request
dst_node: str # target node_id (channel router or aggregated router)
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
seg = self._find_segment(la) # la_base <= la < la_base + la_size
offset = la - seg.la_base
if seg.mode == "n_to_one":
pa = seg.agg_pa_base + offset
return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
# one_to_one
requests = []
per_ch_size = seg.channel_size
for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
ch_offset = offset % per_ch_size
ch_nbytes = nbytes // seg.channel_count
pa = pa_base + ch_offset
dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
return requests
```
BAAW responsibilities:
- Convert logical access → physical request units.
- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
- Compute PA and target node.
BAAW non-responsibilities:
- Performing actual data movement.
- Executing NOC routing.
- Simulating bandwidth occupation (downstream components' job).
BAAW output is directly usable by the simulator's routing and resource
model without additional address decoding.
#### D-LA5. PE_DMA `handle_command()` change
Current (VA-based) flow:
```
DmaReadCmd.src_addr (VA)
→ MMU.translate(VA) → PA
→ PhysAddr.decode(PA) → PhysAddr object
→ resolver.resolve(PhysAddr) → dst_node_id
→ router.find_path(pe_prefix, dst_node_id) → path
→ 1 sub-Transaction → fabric inject
```
LA-based flow:
```
DmaReadCmd.src_addr (LA)
→ BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
→ for each PhysicalRequest:
→ router.find_path(pe_prefix, req.dst_node) → path
→ compute_drain_ns(path, req.nbytes) → drain
→ sub-Transaction → fabric inject
→ await all sub-Transactions
→ pe_txn.done.succeed()
```
Key changes:
- MMU reference removed → BAAW resolve.
- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
directly.
- 1 request → N parallel requests in 1:1 mode.
#### D-LA6. 1:1 mode detail
- One logical access → N physical requests (N = `channels_per_pe`).
- N = `hbm_pseudo_channels / pes_per_cube`.
- Each request: fully-resolved 51-bit PA, targets a specific channel
router (`{pe_prefix}.ch_r{channel_id}`).
- Per-channel link models BW contention.
- PE_DMA injects N sub-transactions concurrently.
Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
PE0 owns ch0-7.
```text
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "one_to_one", channel_count: 8,
pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
channel_size: 512,
}
BAAW resolve result (8 requests):
→ PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
→ PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
→ ...
→ PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
PE_DMA: 8 sub-transactions parallel inject
per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
Total effective BW = 8 × channel_bw_gbs
```
Other N values:
- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
4 requests
- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
16 requests
#### D-LA7. n:1 mode detail
- One logical access → one aggregated request.
- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
(e.g. 8 × 32 = 256 GB/s).
- Single queue / resource for modelling.
- No per-channel PA decomposition.
```text
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
la_base: 0x1_0000_0000, la_size: 4096,
mode: "n_to_one",
agg_pa_base: PA_agg,
agg_node_id: "sip0.cube0.pe0.agg_router",
}
BAAW resolve result:
→ PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
PE_DMA: 1 sub-transaction
aggregated router → hbm_ctrl link (256 GB/s)
```
#### D-LA8. Kernel model preserved
- Kernel still issues single memory ops (`tl.load`, `tl.store`,
`tl.composite`).
- LA is the address scheme exposed to kernel code.
- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
- Kernel code never sees physical channel information.
#### Consequences (LA model, proposed)
Positive:
- 1:1 vs n:1 semantics live in one place (BAAW).
- Kernel abstraction preserved — no kernel code changes.
- Topology-based policy control (mode switch via yaml).
- Improved simulation-model consistency and debuggability.
- Segment-based mapping is simpler than page tables; lower overhead.
Negative:
- Full VA/MMU code refactor required.
- Request-generation path more complex (N requests in 1:1 mode).
- Reduced per-channel visibility in n:1 mode.
- VA-related tests need rewriting.
---
## Migration Path
- **PA → VA** was an extension. PA mode is retained as the PageFault
fallback inside PE_DMA. Switching does not require removing PA
code.
- **VA → LA**, if adopted, is a replacement, not coexistence. See
D-LA1 for the VA infrastructure removal list. PA fallback inside
PE_DMA may be retained orthogonally for tests.
## Alternatives Considered (LA model)
1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
Rejected: MMU's role would grow beyond translation to request
decomposition; aggregation (n:1) becomes awkward to express.
2. **Channel-aware kernel API**: kernels call per-channel load/store
directly. Rejected: abstraction leakage, portability loss, all
benchmarks need rewriting.
3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
directly. Rejected: incompatible with aggregation; conversion
timing unclear; channel info leaks to kernel.
## Test Requirements
### VA model (current, regression)
- Cross-PE / cross-cube DMA paths over installed mappings.
- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
- TLB-overhead-per-access timing.
- PageFault fallback path preserves PA-only behaviour.
### LA model (when implemented)
- 1:1 mode: same logical access → N per-channel requests.
- n:1 mode: same logical access → 1 aggregated request.
- Bandwidth equivalence between modes for identical workload.
- 1:1 mode: per-channel contention modelled correctly.
- n:1 mode: aggregated bandwidth correctly reflected.
- Kernel code unchanged across mode switch.
- BAAW segment install / uninstall correctness.
- Multiple tensors in distinct segments do not collide.
## Implementation Order (LA, when scheduled)
1. LA type (`policy/address/la_allocator.py`).
2. BAAW segment table (`policy/address/baaw.py`).
3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
`handle_command()`).
5. RuntimeContext: LA alloc + segment install
(`runtime_api/context.py`).
6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
7. Remove VA/MMU code.
8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
9. Test migration:
| Test file | Action |
|-----------|--------|
| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
| `tests/test_pe_mmu.py` | Remove |
| `tests/test_va_allocator.py` | Replace with LA allocator tests |
| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
| `tests/test_va_offset.py` | Replace with LA offset tests |
## Links
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0008 (tensor deployment)
- ADR-0009 (kernel execution)
- ADR-0014 (PE-internal execution model)
- ADR-0015 (component port/wire model)
- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
- ADR-0013 (Verification strategy — V1 PA tagging)
- SPEC R2 (latency by traversal), R10 (memory addressing)
@@ -0,0 +1,233 @@
# ADR-0012: Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
## Status
Accepted
## Context
Phase 0 uses a PA-first memory model (ADR-0011):
- memory operations use device physical addresses (PA) only,
- VA/MMU/IOMMU is not modeled.
The host-facing runtime API interacts with the device via the IO_CPU endpoint.
We define stable, minimal message schemas for Host ↔ IO_CPU so that:
- benchmarks remain stable,
- IO_CPU-internal fan-out/aggregation can evolve independently,
- completion and failure propagation is deterministic.
We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
so IO_CPU can deterministically route/fan-out without relying on PA decoding.
---
## Decision
### D1. Contract scope
This schema is the stable contract ONLY for Host ↔ IO_CPU.
Messages beyond IO_CPU (to M_CPU, PE_CPU, schedulers, engines) are component-internal
and are NOT part of this host contract in Phase 0.
---
### D2. Required message set
The runtime API MUST use only these message types for Host ↔ IO_CPU:
- MemoryWrite
- MemoryRead
- KernelLaunch
All operations required by benchmarks (tensor init/copy, kernel run) MUST be expressible
with these messages.
---
### D3. Common envelope (mandatory for all requests)
All Host ↔ IO_CPU requests MUST include:
- `msg_type: str`
- `correlation_id: str`
- generated by the host
- used to match responses deterministically
- `request_id: str`
- unique within a correlation_id
- `target_device: str`
- device identifier (e.g., "sip:0")
- `timestamp_tag: str | None` (optional)
- debug tag only; MUST NOT affect determinism
All Host ↔ IO_CPU responses MUST include:
- `correlation_id: str`
- `request_id: str`
- `completion: Completion`
---
### D4. Completion schema (mandatory)
`Completion` MUST have:
- `ok: bool`
- `error_code: str | None`
- `error_message: str | None`
Rules:
- If `ok == true` then `error_code` and `error_message` MUST be null.
- If `ok == false` then `error_code` MUST be non-null.
- Completion semantics MUST be deterministic.
---
### D5. MemoryWrite schema (PA-first, PE-tagged)
`MemoryWrite` represents a host-initiated write/initialize operation to device memory.
Mandatory fields:
- common envelope fields (D3)
- destination placement tags (A 방식):
- `dst_sip: int`
- `dst_cube: int`
- `dst_pe: int`
- `dst_pa: int`
- destination physical address in the destination PE's address space
- `nbytes: int`
- `src_kind: "pattern" | "host_buffer_ref"`
- Phase 0 MUST support "pattern"
- `pattern: Pattern | None`
- required if `src_kind == "pattern"`
`Pattern` (Phase 0 mandatory support):
- `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"`
- `value: number | None`
- required for fill_*; ignored for zero
Optional fields:
- `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (default "AUTO")
- `debug_label: str | None`
Notes:
- This message MUST NOT embed bulk tensor data in Phase 0.
- All latency MUST come from explicit graph traversal and modeled components.
---
### D6. MemoryRead schema (PA-first, PE-tagged)
`MemoryRead` represents a host-initiated read from device memory.
Mandatory fields:
- common envelope fields (D3)
- source placement tags (A 방식):
- `src_sip: int`
- `src_cube: int`
- `src_pe: int`
- `src_pa: int`
- `nbytes: int`
Optional fields:
- `dst_kind: "host_sink" | "discard"` (default "host_sink")
- `debug_label: str | None`
Response payload:
- actual bytes are NOT required in Phase 0 (latency/traces focus)
- implementations MAY return lightweight stats or hashes later via a new ADR
---
### D7. KernelLaunch schema (PA-first, PE-tagged shards)
`KernelLaunch` represents launching a kernel on a target device via IO_CPU.
Mandatory fields:
- common envelope fields (D3)
- `kernel_ref: KernelRef`
- `args: list[KernelArg]`
`KernelRef` MUST have:
- `name: str`
- `kind: "deployed" | "builtin"`
- `deploy_pa: int | None` — PA where kernel binary was deployed (required for "deployed")
- `deploy_sip: int` — SIP where binary resides
- `deploy_cube: int` — cube where binary resides
- `deploy_pe: int` — PE where binary resides
- `nbytes_code: int` — kernel binary size (for BW modeling)
Kernel binaries MUST be pre-deployed to device memory via MemoryWrite.
KernelLaunch MUST NOT embed kernel source code or IR in the launch message.
`KernelArg` supports tensor args by PA mapping and scalars by value.
Tensor arg (mandatory):
- `arg_kind: "tensor"`
- `tensor_pa_map: TensorPAMap`
`TensorPAMap` MUST have:
- `shards: list[TensorShard]`
`TensorShard` MUST have (A 방식 강제):
- `sip: int`
- `cube: int`
- `pe: int`
- `pa: int`
- `nbytes: int`
- `offset_bytes: int`
Scalar arg (mandatory):
- `arg_kind: "scalar"`
- `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"`
- `value: number | bool`
Optional KernelLaunch fields:
- `grid: dict | None`
- `meta: dict | None`
- `failure_policy: "fail_fast" | "collect_all"` (default "fail_fast")
- `debug_label: str | None`
Notes:
- KernelLaunch MUST NOT embed bulk tensor data.
- KernelLaunch MUST be submitted only to the IO_CPU endpoint.
- IO_CPU MUST fan-out work internally using the shard (sip,cube,pe) tags.
---
## Verification Notes
Tests SHOULD validate:
- schema validation rejects missing mandatory fields,
- deterministic correlation/response matching,
- MemoryWrite/Read/KernelLaunch produce explicit hop traces,
- all routed requests incur latency > 0.
---
## Links
- ADR-0011 (Memory Addressing — PA / VA / LA)
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0009 (kernel execution fan-out/aggregation)
- ADR-0013 (Verification strategy — V1 message schema validation)
- SPEC R2, R7, R8
@@ -0,0 +1,139 @@
# ADR-0013: Verification Strategy and Phase 1 Test Plan
## Status
Accepted
## Context
KernBench is a system-level simulator whose correctness is defined by:
- adherence to SPEC-defined invariants,
- determinism and debuggability,
- explicit modeling of routing and latency.
Given the evolving implementation, we need a stable verification strategy
that prevents architectural drift while allowing incremental development.
This ADR defines the Phase 1 verification plan and what constitutes
"correct behavior" for early implementations.
---
## Decision
### D1. Verification is contract-based
Verification MUST be derived from:
- SPEC requirements,
- accepted ADRs.
Tests MUST validate architectural contracts, not incidental implementation details.
---
### D2. Phase 1 verification scope
Phase 1 verification focuses on:
- message contract validity (ADR-0012),
- routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
- PA-first memory addressing and shard tagging (ADR-0011),
- core latency and trace invariants (SPEC 0.1, R2).
Microarchitectural accuracy, bandwidth contention, and cycle-level behavior
are explicitly out of scope in Phase 1.
---
### D3. Required Phase 1 verification cases
The following verification cases MUST be supported by the implementation:
#### V1. Message schema validation
- KernelLaunch requests missing `(sip, cube, pe)` in any tensor shard MUST be rejected.
- MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
- Completion results MUST follow the `ok / error_code / error_message` contract.
#### V2. IO_CPU fan-out and aggregation
Given:
- a topology with one SIP, one CUBE, and two PEs,
- a KernelLaunch request containing two tensor shards targeting different PEs,
The system MUST:
- submit a single KernelLaunch to IO_CPU,
- fan-out work internally to both PEs,
- aggregate completion and return a single deterministic completion to the host.
#### V3. Latency and trace invariants
For any valid request:
- the hop-by-hop trace MUST be non-empty,
- total latency MUST be greater than zero,
- repeated runs with identical inputs MUST produce identical traces.
#### V4. Topology independence and cross-domain coverage
Verification cases MUST pass for multiple topology shapes, including:
- minimal: (1 SIP, 1 CUBE, 1 PE)
- multi-PE: (1 SIP, 1 CUBE, N PEs)
- multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
- multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)
For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:
- explicit connectivity (required links exist),
- deterministic routing and control-path traversal,
- non-empty traces and latency > 0 for representative cross-domain requests
(inter-CUBE and inter-SIP paths).
Tests MUST NOT hardcode topology sizes, node ids, or link counts.
Instead, tests MUST derive expectations from the compiled topology metadata
---
### D4. Phase 1 artifacts
Phase 1 MAY include:
- verification-only test code,
- topology fixtures,
- trace inspection utilities.
Phase 1 MUST NOT require:
- production code changes solely to satisfy tests,
- weakening or removing tests to allow progress.
---
### D5. Phase 2 enforcement
Phase 2 (Apply) MUST:
- run the Phase 1 verification cases,
- rollback all changes if any verification fails,
- preserve tests as authoritative contracts.
---
## Consequences
- Architectural correctness is enforced early.
- Tests serve as executable documentation of system behavior.
- Implementation remains flexible without losing rigor.
---
## Links
- SPEC 0.1, R2, R6
- ADR-0011 (Memory Addressing — PA / VA / LA)
- ADR-0012 (Host ↔ IO_CPU message schema)
- ADR-0009 (Kernel execution semantics)
@@ -0,0 +1,451 @@
# ADR-0014: PE Pipeline Execution Model
## Status
Accepted
## Context
This ADR defines the PE-internal kernel execution model:
- Role decomposition of PE-internal components
- Command dispatch paths (simple / composite / multi-op composite with epilogue)
- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
- TCM-centric dataflow with a register-file intermediary
- Engine resource model
- Observability and trace contract
- Topology representation
PE-internal structure (7 components in scope; 2 cross-referenced):
- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
`pe_tcm` — defined here
- `pe_mmu` — VA model, defined in ADR-0011 D-VA
- `pe_ipcq` — collective communication, defined in ADR-0023
The goal is a deterministic, trace-friendly execution contract that keeps
each block independently swappable.
## Decision
### D1. PE-internal component roles
**PE_CPU**
- Executes kernel instruction stream / control logic.
- Generates PE commands and submits them to `PE_SCHEDULER` (via
`PeInternalTxn`).
- Does NOT enqueue work directly into engine queues.
**PE_SCHEDULER**
- Sole dispatcher inside a PE.
- Receives commands from `PE_CPU`. Dispatch by command type:
- Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
→ forward directly to the target engine.
- `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
via a single `_feed_loop` (D6).
- Does not participate in stage-to-stage chaining within a composite;
that is handled by token self-routing (D6).
**PE_DMA**
- Handles memory transfers between TCM and external memory domains
(HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
- Two execution channels:
- `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
- Additional virtual channels:
- `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
- `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
**PE_FETCH_STORE**
- TCM ↔ Register File transfer unit.
- Isolates register-file access semantics from compute engines so that
GEMM/MATH stay pure compute components.
- BW-based latency model; TCM access contention naturally serializes
through `PE_TCM`'s BW resource.
**PE_GEMM**
- MAC array. Reads operands from the register file; writes results to
the register file. Does not touch `PE_TCM` directly.
**PE_MATH**
- Element-wise / reduction / SIMD unit. Reads / writes the register file.
**PE_TCM**
- Tightly-coupled scratchpad with BW-serialized access. Two logical
regions partitioned by ownership (see D5).
**Cross-referenced components** (defined elsewhere):
- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
- `pe_ipcq` — collective ring buffers and peer endpoint metadata
(ADR-0023).
### D2. Command lifecycle and queues
`PE_SCHEDULER` maintains three logical structures:
**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
expanded sub-commands, dependency state, engine assignment, and
completion status.
**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
records.
**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
state. Engines report completion via explicit events / messages
consumed by the scheduler.
**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
publishes a completion record.
### D3. Dispatch modes
#### D3.1 Simple command
A simple command expands to exactly one engine sub-command:
- `DmaReadCmd` / `DmaWriteCmd``PE_DMA`
- `GemmCmd``PE_GEMM`
- `MathCmd``PE_MATH`
Flow:
```text
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
→ completion → PE_SCHEDULER → CompletionQueue
```
#### D3.2 Composite command (single-op tiled pipeline)
The default `CompositeCmd` runs a single compute op as a tile-pipelined
sequence:
```text
DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
```
`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
`TileToken` per tile with a monotonically increasing `tile_id`.
Tile dependency (within one tile `t`):
```text
DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
```
Inter-tile overlap is allowed wherever engine resources permit
(D4 governs the constraints):
```text
DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t-1) ∥ COMPUTE(t)
```
#### D3.3 Multi-op composite (head + epilogue with scope)
A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
multi-op pipeline:
```python
@dataclass(frozen=True)
class OpSpec:
kind: str # "gemm" | "math.exp" | "math.bias_add" | ...
scope: Scope # "per_k_tile" | "per_output_tile" | "once"
...
```
- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
M/K/N partition).
- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
often they fire:
- `per_k_tile` — every K-reduction step.
- `per_output_tile` — once per output tile.
- `once` — once per kernel.
Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
each stage is dispatched via token self-routing (D6), so GEMM and MATH
participate serially within the same composite even though they share
the compute slot (D4).
The empty-`ops` form is the legacy single-op path.
### D4. Engine resource model
**DMA engine**:
- `DMA_READ`: `simpy.Resource(capacity=1)`.
- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
- Both channels run concurrently (READ ∥ WRITE allowed).
- Within a channel, requests serialize (READ ∥ READ disallowed; same
for WRITE).
- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
ADR-0023 D8 — out of scope for this ADR.
**Compute engine**:
- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
`PE_MATH`.
- At most one compute op runs at a time within a PE.
- Multi-op composite chains (D3.3) execute their compute stages serially
through this slot; token self-routing (D6) ensures the next stage
starts only after the previous compute releases the slot.
**Engine completion**: each engine emits a completion event consumed by
the scheduler / `PipelineContext` (D6).
### D5. Dataflow
**Input path (HBM source)**:
```text
HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
Register File → PE_GEMM | PE_MATH
```
**Input path (shared SRAM source)**:
```text
Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
PE_TCM → PE_FETCH_STORE → Register File
```
**Output path (HBM destination)**:
```text
Register File → PE_FETCH_STORE → PE_TCM
PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
```
GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
single TCM↔register-file gateway. This makes TCM BW contention
explicit and lets fetch unit policies (e.g., prefetch) be replaced
independently of compute engines.
#### D5.1 PE_TCM partitioning
`PE_TCM` is split into two logical regions:
**SchedulerReservedTCM**
- Owned exclusively by `PE_SCHEDULER`.
- Holds composite-command tile buffers.
- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
manages tile-buffer lifetimes.
**AllocatableTCM**
- General-purpose region managed by `PEMemAllocator`.
- Used for host / DP-visible allocations.
**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
allocate inside `SchedulerReservedTCM`. The reserved region is excluded
from allocator-managed ranges by construction.
**Tile buffer rules**:
- Input and output buffers within `SchedulerReservedTCM` MUST NOT
overlap during a tile's active lifetime.
- A tile buffer remains valid until the corresponding `DMA_WRITE`
completes.
- Buffer reuse is permitted only after the consuming tile's lifetime
ends.
### D6. TileToken self-routing pipeline
A composite's stage-to-stage progression happens **without** routing
through the scheduler. Each component forwards the token directly to
the next stage's component using the token's `plan`:
```text
Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
↑ chaining: no scheduler hop ↑
PipelineContext.complete_tile()
```
This mirrors real-HW done-wire chains. The scheduler handles only
**initial dispatch + completion aggregation**.
#### TilePlan / Stage
```python
class StageType(Enum):
DMA_READ = 0
FETCH = 1
GEMM = 2
MATH = 3
STORE = 4
DMA_WRITE = 5
@dataclass(frozen=True)
class Stage:
stage_type: StageType
component: str # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
params: dict # stage-specific parameters
@dataclass(frozen=True)
class TilePlan:
tile_id: int
stages: tuple[Stage, ...]
```
#### TileToken
```python
@dataclass
class TileToken:
tile_id: int
pipeline_ctx: PipelineContext
plan: TilePlan
stage_idx: int
params: dict # cached current stage params
data_op: bool = True # op_log opt-in (ADR-0020 D4)
```
Single-owner invariant: a token is owned by exactly one component at a
time. Lifecycle: scheduler creates with `stage_idx=0` → component
`_process()` → increment `stage_idx` → put to next stage's `in_port`
last stage calls `pipeline_ctx.complete_tile()`.
#### PipelineContext (exactly-once completion)
```python
@dataclass
class PipelineContext:
id: str
total_tiles: int
completed_tiles: int = 0
done_event: simpy.Event = None
def complete_tile(self) -> None:
self.completed_tiles += 1
if self.completed_tiles == self.total_tiles:
self.done_event.succeed()
```
Each tile's last stage MUST call `complete_tile()` exactly once.
Duplicate calls are bugs (SimPy `Event` can succeed at most once).
#### Feed ordering
`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
`_pending_feeds` FIFO. Composite commands are enqueued in submission
order; tile feed for a command runs to completion before the next
command's feed begins. **Tile-feed interleaving between commands is
disallowed.**
Within a single command's tiles, downstream pipeline overlap arises
naturally — earlier tiles progress through later stages while the feeder
keeps pushing remaining tiles into the first stage queue (SimPy Store
backpressure governs flow control). If the first-stage queue is full,
only the feeder blocks; the scheduler worker's inbox processing
continues.
#### Token routing pattern (base class)
```python
def _pipeline_worker(self, env):
while True:
token = yield self._inbox.get()
yield from self._process(env, token) # stage-specific logic
next_idx = token.stage_idx + 1
if next_idx < len(token.plan.stages):
next_stage = token.plan.stages[next_idx]
token.stage_idx = next_idx
token.params = next_stage.params
yield self.out_ports[next_stage.component].put(token)
else:
token.pipeline_ctx.complete_tile()
```
Each component implements only `_process()`; chaining lives in the
base class.
### D7. Observability and trace contract
The simulator emits deterministic trace events:
- `command_submitted`
- `sub_command_dispatched`
- `engine_start`
- `engine_complete`
- `tile_ready`
- `command_complete`
For identical inputs, trace ordering MUST be deterministic.
### D8. Topology representation
PE-internal components are declared in `cube.pe_template`:
```yaml
pe_template:
components:
pe_cpu: { kind: pe_cpu, impl: builtin.pe_cpu, attrs: { overhead_ns: ... } }
pe_scheduler: { kind: pe_scheduler, impl: builtin.pe_scheduler, attrs: { overhead_ns: ... } }
pe_dma: { kind: pe_dma, impl: builtin.pe_dma, attrs: { rd_engines: 1, wr_engines: 1 } }
pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
pe_gemm: { kind: pe_gemm, impl: builtin.pe_gemm, attrs: { shared_resource: accel_slot, ... } }
pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { shared_resource: accel_slot, ... } }
pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { ... } } # ADR-0011 D-VA
pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { ... } } # ADR-0023
links:
# Scheduler dispatch edges (initial)
scheduler_to_dma_mm: 0.0
scheduler_to_fetch_store_mm: 0.0
scheduler_to_gemm_mm: 0.0
scheduler_to_math_mm: 0.0
# Pipeline chaining edges (token self-routing per D6)
dma_to_fetch_store_mm: 0.0
fetch_store_to_gemm_mm: 0.0
fetch_store_to_math_mm: 0.0
gemm_to_fetch_store_mm: 0.0
gemm_to_math_mm: 0.0
math_to_fetch_store_mm: 0.0
fetch_store_to_dma_mm: 0.0
fetch_store_to_tcm_bw_gbs: ...
```
Template is instantiated once per PE. PE instances are derived from
`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
## Consequences
### Positive
- Each block is an independent topology node — individually swappable
via DI (ADR-0015).
- PE-internal structure is visible in the topology graph.
- Components do not know their downstream — plan-based routing gives
flexibility (e.g., epilogue chains require no scheduler change).
- DMA and compute overlap naturally via SimPy Store backpressure.
- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
without engine-level coupling.
- TCM access contention is realistic — `PE_FETCH_STORE` is the single
TCM↔RF gateway.
### Negative
- Intra-PE component count is higher than a coarser model (7 base + 2
cross-referenced) — more topology nodes/edges.
- Intra-PE token forwarding is explicit in traces (acceptable trade for
HW fidelity).
## Links
- ADR-0011 D-VA (PE_MMU component, VA translation)
- ADR-0015 D4 (component port/wire model)
- ADR-0020 (greenlet kernel execution / two-pass)
- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
- SPEC R3, R4
@@ -0,0 +1,202 @@
# ADR-0015: Component Port/Wire Model and Fabric Routing
## Status
Accepted
## Context
Realistic hardware modeling — queues, contention, fan-out — requires
that components own fabric traversal while the simulation engine
handles only initialization and completion observation. Direct method
calls between components, or path-walking inside the engine, defeat
queueing and contention semantics.
This ADR defines:
- how components communicate via typed port queues,
- how propagation delay is modeled (wire processes with BW occupancy),
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
(via M_CPU),
- the engine's reduced role (wire init + completion observation only),
- M_CPU.DMA as an internal subcomponent of M_CPU.
---
## Decision
### D1. Component port model
Each component has typed input/output ports modeled as SimPy Stores:
```text
in_ports: dict[str, simpy.Store] # keyed by source node_id
out_ports: dict[str, simpy.Store] # keyed by destination node_id
```
Ports are created at engine initialization based on graph edges.
Each directed edge (src → dst) results in:
- `src.out_ports[dst]` — the sending end
- `dst.in_ports[src]` — the receiving end
---
### D2. Wire process (propagation delay + BW occupancy)
For each directed edge (src, dst) in the topology graph, a SimPy wire process
models propagation delay and BW occupancy:
```python
def wire_process(env, out_port, in_port, delay_ns, bw_gbs):
available_at = 0.0
while True:
cmd = yield out_port.get()
if bw_gbs > 0:
nbytes = getattr(cmd, "nbytes", 0)
if nbytes > 0:
wait = available_at - env.now
if wait > 0:
yield env.timeout(wait)
available_at = env.now + (nbytes / bw_gbs)
yield env.timeout(delay_ns)
yield in_port.put(cmd)
```
Wire processes are started at engine initialization.
Each directed edge maintains an `available_at` timestamp tracking when the link
becomes free for the next transaction. When a transaction occupies a link, the
next transaction on the same directed link must wait until occupancy clears
(back-to-back serialization). TX and RX directions are independent (separate
wire processes with separate `available_at` state).
---
### D3. Engine role (reduced)
The simulation engine MUST:
- wire components at initialization (create port Stores, start wire processes),
- identify the entry component for each request type (PCIE_EP),
- put the request into the entry component's in_port,
- wait for a completion event.
The simulation engine MUST NOT:
- walk the topology path during request execution,
- call component `run()` methods directly,
- track per-hop latency or decompose fan-out.
---
### D4. Fabric paths for Memory R/W and Kernel Launch
Memory R/W and Kernel Launch use **different** fabric paths.
Memory operations bypass M_CPU and route directly to HBM via the crossbar.
Kernel Launch routes through M_CPU for PE fan-out.
**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):**
```text
pcie_ep → io_noc → io_ucie
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
→ target cube: ucie_in → router mesh → hbm_ctrl
```
**Memory R/W completion path:**
```text
hbm_ctrl → router mesh → [transit cubes: ucie → router mesh → ucie]
→ io_ucie → io_noc → pcie_ep
```
**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):**
```text
pcie_ep → io_noc → io_cpu → io_noc → io_ucie
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
→ target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
```
**Kernel Launch completion path:**
```text
PE[0..n] all complete → M_CPU (aggregation)
→ noc → [transit cubes: ucie → noc → ucie]
→ io_ucie → io_noc → io_cpu → io_noc → pcie_ep
```
**Rationale for M_CPU bypass on Memory R/W:**
Memory write/read operations do not require command interpretation or PE
dispatch — they are direct data transfers to/from HBM. Routing through M_CPU
would add unnecessary overhead (5ns) without functional benefit. The io_noc
inside the IO chiplet handles the routing decision: memory operations go
directly to cube fabric, while kernel launches are forwarded to io_cpu first.
---
### D5. M_CPU.DMA is an internal subcomponent of M_CPU
M_CPU.DMA is NOT a separate topology node.
It is an internal subcomponent owned by the M_CPU component implementation.
M_CPU.DMA:
- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
- issues memory requests over the NOC to hbm_ctrl,
- receives completion from hbm_ctrl via the NOC,
- reports completion to M_CPU,
- is created and managed inside M_CPU's `__init__` and `run()`.
M_CPU.DMA does not appear as a node in the compiled topology graph.
---
### D6. Transit cube forwarding
A cube that is not the target of a memory or kernel request acts as a transit node.
Transit cubes forward requests without consuming them:
```text
ucie_in (from upstream) → noc → ucie_out (to downstream)
```
Transit forwarding is implemented entirely within the ucie_in component.
The noc and ucie_out components in a transit cube forward the packet without modification.
---
### D7. _formula_latency is preserved as a lower-bound cross-check
The path-based formula latency function (`_formula_latency`) is preserved in the engine
as a lower bound for correctness verification.
Invariant:
- Phase 0: `_formula_latency == component model total_ns`
- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
This function is independent of the port/wire model and requires only the topology graph.
It is used for shard comparison in `_route_kernel` and as a regression guard.
---
## Consequences
- Components model realistic hardware behavior (queues, contention, fan-out).
- Propagation delay is modeled accurately per edge.
- Engine is decoupled from routing policy.
- Component implementations remain swappable via DI (ADR-0007 D3).
---
## Links
- ADR-0007 D2 (engine role boundary)
- ADR-0009 D3 (kernel execution fan-out hierarchy)
- ADR-0014 D4 (DMA engine capacity=1)
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
- ADR-0016 (IOChiplet NOC and memory data path)
- ADR-0017 (cube NOC 2D mesh architecture)
- ADR-0033 (Latency model assumptions built on these mechanisms)
@@ -0,0 +1,98 @@
# ADR-0016: IOChiplet NOC and Memory Data Path
## Status
Accepted
## Context
ADR-0003 D2 defines IO chiplets as SIP-level components providing PCIe-EP and
IO_CPU interfaces, but does not specify internal routing within the IO chiplet.
ADR-0015 D4 was updated to document the M_CPU bypass for Memory R/W, but the
IO chiplet's internal NOC architecture that enables this routing was not
formally documented.
The IO chiplet needs an internal routing fabric (io_noc) to:
- connect pcie_ep, io_cpu, and per-cube UCIe PHY ports
- route memory operations (MemoryWrite/Read) directly to cube fabric without
passing through io_cpu
- route kernel launch commands through io_cpu for command interpretation
## Decision
### D1. IOChiplet internal NOC (io_noc)
Each IO chiplet instance contains an internal NOC node (`io_noc`) that connects:
- `pcie_ep` — host-facing PCIe endpoint
- `io_cpu` — command processor for kernel launch interpretation
- `io_ucie-{PHY}.conn{N}` — per-PHY connection nodes to cube UCIe ports
The io_noc is a forwarding-only fabric (`forwarding_v1` implementation) with
zero overhead. All routing decisions are made by the simulation engine based
on message type, not by io_noc itself.
### D2. IOChiplet UCIe decomposition
Each IO chiplet PHY port is decomposed into:
- `io_ucie-{PHY}` — the UCIe protocol endpoint (overhead = 8ns)
- `io_ucie-{PHY}.conn{N}` — N connection nodes between io_noc and io_ucie
This mirrors the cube-side UCIe decomposition (ADR-0015 D1) and allows
multiple independent NOC-to-UCIe connections per PHY.
### D3. Memory R/W path (M_CPU bypass)
Memory operations (MemoryWrite, MemoryRead) are routed directly from pcie_ep
through io_noc to the target cube, bypassing io_cpu entirely:
```text
pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → router mesh → hbm_ctrl
```
This avoids the 10ns io_cpu overhead for pure data transfers. The simulation
engine's `_process_memory_direct()` method uses `find_memory_path()` which
resolves the shortest path from pcie_ep to the target HBM node.
### D4. Kernel Launch path (via io_cpu)
Kernel launch commands require io_cpu for command interpretation and PE
fan-out setup:
```text
pcie_ep → io_noc → io_cpu → io_noc → conn → io_ucie → [cube UCIe]
→ noc → m_cpu → PE
```
The engine's `_entry_points()` method routes KernelLaunchMsg through both
pcie_ep (entry) and io_cpu (command processing).
### D5. IOChiplet-to-cube port mapping
Each IO chiplet instance declares which cube ports it connects to:
```yaml
cube_ports:
- { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 }
- { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 }
```
The topology builder creates edges from io_ucie PHY nodes to the
corresponding cube UCIe port nodes, with the specified distance and
the IO chiplet's `per_connection_bw_gbs` as link bandwidth.
## Consequences
- IO chiplet has a well-defined internal routing fabric
- Memory operations avoid unnecessary io_cpu overhead
- Kernel launch commands still get proper command interpretation
- The io_noc pattern is consistent with cube-level NOC design
- ADR-0003 D2 is extended (not contradicted) by this ADR
## Links
- ADR-0003 D2 (IO chiplet definition)
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
- ADR-0012 D1 (host-to-IO_CPU message schema)
@@ -0,0 +1,291 @@
# ADR-0017: Cube NOC and HBM Connectivity
## Status
Accepted
## Context
The CUBE-level NOC is a 2D router mesh that carries every intra-cube
request: PE-to-HBM data, PE-to-PE traffic, command paths
(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
The CUBE's HBM is exposed through per-PE controller endpoints attached
to PE routers. This per-PE partitioning makes local-vs-remote HBM
distinguishable by mesh distance: a PE's own HBM partition sits at its
own router (switching overhead only); another PE's HBM partition is
reachable by mesh hops to that PE's router.
Two channel-mapping modes are supported in the design space:
- **n:1 (default, implemented)** — each PE's HBM partition aggregates
`channels_per_pe` pseudo-channels into one endpoint. Effective
per-PE BW = N × per-channel BW.
- **1:1 (future)** — each PE router decomposes into per-channel
mini-routers; per-channel BW contention is modeled directly.
In both modes the per-PE effective BW is identical; only the connectivity
granularity differs.
## Decision
### D1. 2D router mesh
Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
- Default 6×6 grid (sized from PE corner placement + UCIe attachment
count); larger PE counts scale the grid up.
- HBM exclusion zone: center rows/columns are excluded where HBM die
physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
- Latency = Manhattan distance × `ns_per_mm`.
### D2. XY routing algorithm
Deterministic XY routing:
1. Horizontal segment: route from source X to destination X at source Y.
2. Vertical segment: route from destination X at source Y to destination Y.
Each directed segment carries a unique key:
- Horizontal: `("H", y_band, x_min, x_max, direction)`
- Vertical: `("V", x_band, y_min, y_max, direction)`
Grid positions are snapped to the router grid, excluding the HBM zone.
### D3. Per-segment contention model
Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
sharing a segment (same row or column band, same direction) contend for
the resource — modelling link-level serialization in a wormhole-routed
mesh.
With no contention, NOC traversal latency equals Manhattan distance ×
`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
delay.
### D4. NOC attachment points (per-PE HBM partition)
Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
HBM (one pseudo-channel group; see D8).
Other attachments:
- M_CPU and shared SRAM each occupy a dedicated edge router.
- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
along that edge (see D6).
```text
UCIe-N (conn x4)
|
+---------+---+---+---------+
| | | |
PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma
PE0.cpu <--+ +hbm.pe0| | +hbm.pe2+--< PE2.cpu
| | | |
UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E
(conn x4) | | zone | | (conn x4)
| r2c0 | | |
M_CPU <--->+ | | |
| r3c0 | | |
SRAM <---->+ | | |
| | | |
PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma
PE4.cpu <--+ +hbm.pe4| | +hbm.pe6+--< PE6.cpu
| | | |
+---------+---+---+---------+
|
UCIe-S (conn x4)
```
Per-PE HBM partitioning is the key invariant that makes local vs
cross-PE HBM distinguishable by mesh distance (see D7).
### D5. NOC edge bandwidths and distances
| Connection | BW (GB/s) | Distance | Notes |
| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
| PE_DMA → NOC | 256.0 | Physical (PE) | Matches local-HBM aggregate BW |
| NOC → PE_CPU | — | 0.0 mm | Command path only |
| Router ↔ hbm_ctrl.pe{idx} | 256.0 | 0.0 mm | Per PE router; N × per-channel BW (see D8) |
| NOC ↔ M_CPU | — | 0.0 mm | Command path |
| NOC ↔ SRAM | 128.0 × 4 | 0.0 mm | 512 GB/s aggregate |
| NOC ↔ UCIe conn | 128.0 | 0.0 mm | Per connection; 4 conn per port |
`0.0 mm` distances reflect the distributed nature of the NOC; actual
traversal distance is computed via Manhattan distance within the router
grid.
### D6. UCIe decomposition and inter-cube traffic
Each of the 4 UCIe ports (N, S, E, W) decomposes into:
- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
This decomposition gives 4 independent NOC↔UCIe connections per port,
each with 128 GB/s bandwidth (512 GB/s aggregate per port).
Inter-cube traffic path:
```text
Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
[UCIe link: 512 GB/s, 1.0mm seam distance]
Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
```
UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
crossing incurs 16 ns (TX port + RX port).
### D7. Data paths through the NOC
All intra-cube traffic uses the same router mesh — no separate fast
paths.
**Local HBM** (same PE's own partition; 0 mesh hops):
```text
PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx} (switching overhead only)
```
**Cross-PE HBM within cube** (target PE's partition, reached by mesh):
```text
PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
```
Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
```text
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
```
Dijkstra computes the shortest path within the mesh.
**Cross-cube HBM** (UCIe traversal):
```text
PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
→ r{x'}c{y'} → hbm_ctrl.pe{idx'}
```
**Kernel launch command to PE**:
```text
[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
```
**Shared SRAM access**:
```text
PE_DMA → r{x}c{y} → (mesh) → SRAM
```
### D8. HBM channel mapping mode
Channel mapping is configured at cube scope:
```yaml
cube:
memory_map:
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
hbm_pseudo_channels: 64 # total pseudo-channel count
hbm_channels_per_pe: 8 # per-PE local channel count
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
hbm_slices_per_cube: 8 # number of per-PE partitions
hbm_total_gb_per_cube: 48
```
**n:1 mode (default, implemented).** Each PE's HBM partition is a single
endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
interleave; only aggregate per-PE BW is modeled. No separate aggregated
router node exists — the per-PE router itself serves that role.
**1:1 mode (future).** Each PE router decomposes into N channel
mini-routers; per-channel routing carries fully-resolved PA + channel ID.
A `ChannelSplitter` resolves a logical access to N per-channel physical
requests. Per-channel link models BW contention. Cross-PE channel
access semantics are deferred to the implementation ADR.
**BW math (defaults).**
| Parameter | Value |
| ---------------------------------- | -------------------------- |
| pseudo channels per cube | 64 (parameter) |
| PEs per cube | 8 (parameter) |
| channels per PE (N) | 64 / 8 = 8 |
| per-channel BW | 32 GB/s (parameter) |
| per-PE local BW | N × 32 = 256 GB/s |
| cube total HBM BW | 64 × 32 = 2048 GB/s |
Both modes give the same per-PE effective BW; only the request shape and
contention model differ.
### D9. AddressResolver — per-PE HBM endpoint
The address resolver decodes a PA's HBM offset to the owning PE's
partition:
```python
# policy/routing/router.py
hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
if addr.kind == "hbm":
pe_id = int(addr.hbm_offset) // hbm_slice_bytes
return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
```
The pe_id computation is intrinsic to the routing layer (not a
topology-time concern). Any HBM PA falls within exactly one partition,
yielding deterministic routing.
External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
same resolver path — there is no separate fast path.
### D10. Mesh generation parameters
`mesh_gen.py` produces `cube_mesh.yaml` from:
- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
- `cube.geometry`: cube physical dimensions and HBM zone.
- `cube.ucie.n_connections`: determines router count for UCIe attachment.
Output `mesh_data` dictionary contains:
- Router grid with positions and HBM exclusion zones.
- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
per PE).
- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
- M_CPU and SRAM router attachments.
## Consequences
- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
(mesh hops) are naturally distinguishable, satisfying SPEC R5
(multi-domain communication) and ADR-0002 (no zero-latency end-to-end
paths).
- All cube-internal traffic routes through one mesh — single contention
model, single layout, single set of edge BWs.
- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
PE's partition is the n:1 aggregate of its assigned pseudo-channels.
- 1:1 mode extension is structurally natural — split each PE router into
N channel routers.
- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
geometry changes propagate without code edits.
## Links
- ADR-0002 (Routing distance, ordering, no zero-latency paths)
- ADR-0003 D3 (cube-level NOC definition — extended here)
- ADR-0004 (Memory semantics, local HBM)
- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
- ADR-0014 D1 (PE_DMA egress via router mesh)
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
- ADR-0033 (Latency model: per-PC parallelism, switch penalty)
@@ -0,0 +1,516 @@
# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
## Status
Accepted
## Context
현재 시뮬레이션은 **타이밍만** 모델링한다.
`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
실제 텐서 데이터를 읽거나 연산하지 않는다.
### 필요한 기능
1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
3. 시뮬레이션 성능 저하를 최소화해야 한다
### 제약 조건
- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
### 설계 탐색 결과
| Option | 방식 | 판정 |
|--------|------|------|
| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
---
## Decision
### D1. 2-Pass 실행 모델 — Phase 0 제거
기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
기존:
```
Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
```
변경:
```
Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
- 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
- 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
- dynamic control flow 가능 (tl.load가 실제 데이터 반환)
Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
```
본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
Phase 2는 GEMM/Math 연산 정합성 검증.
Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
### D2. Op Log 기록 — ComponentBase hook
op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
개별 컴포넌트 구현을 수정하지 않는다.
```python
class ComponentBase:
def _on_process_start(self, env, msg):
if self._op_logger and getattr(msg, 'data_op', False):
self._op_logger.record_start(env.now, self.node.id, msg)
def _on_process_end(self, env, msg):
if self._op_logger and getattr(msg, 'data_op', False):
self._op_logger.record_end(env.now, self.node.id, msg)
```
`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
`_op_logger`는 optional — 없으면 오버헤드 제로.
**hook 시점 정의**:
| 시점 | 의미 |
|------|------|
| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
link traversal latency는 t_start/t_end에 포함되지 않는다.
link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
### D3. Greenlet 기반 커널 실행 — Phase 0 제거
기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
#### 동작 원리
greenlet은 협력적 context switch를 제공하는 C 확장이다.
커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
```
SimPy 루프 (parent greenlet) 커널 (child greenlet)
───────────────────────── ──────────────────────
g.switch() ─────────────────────────→ 커널 시작
a = tl.load(ptr, ...)
내부: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←────────────────── (커널 일시정지)
yield DmaReadMsg(...)
yield env.timeout(dma_latency)
data = memory_store.read(...)
g.switch(data) ─────────────────────→ (커널 재개)
a = data ← 실제 numpy array
if a[0][0] > 0.5: ← 분기 가능
...
```
커널은 **plain Python function**으로 유지된다.
greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
#### KernelRunner — 프레임워크 레이어
greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
**KernelRunner**에 위치한다.
```python
# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
class KernelRunner:
def run(self, env, kernel_fn, args, store):
g = greenlet(self._run_kernel)
cmd = g.switch(kernel_fn, args)
while cmd is not None:
if isinstance(cmd, DmaReadCmd):
yield from self._dispatch_dma(env, cmd)
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
cmd = g.switch(data) # 실제 데이터와 함께 재개
elif isinstance(cmd, GemmCmd):
yield from self._dispatch_gemm(env, cmd)
cmd = g.switch() # 재개 (데이터 없음)
elif isinstance(cmd, DmaWriteCmd):
store.write(cmd.dst_addr, cmd.data) # visibility = issue 시점
yield from self._dispatch_dma(env, cmd) # timing만 반영
cmd = g.switch()
# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
def _execute_kernel(self, env):
runner = KernelRunner(self.ctx)
yield from runner.run(env, kernel_fn, args, store)
```
**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
**레이어 분리**:
- **커널 코드**: plain function, greenlet 존재를 모름
- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
- **ComponentBase hook**: op_log 기록의 유일한 경로
- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
#### 메모리 읽기/쓰기 vs 연산의 처리 차이
| 연산 | Phase 1에서 | Phase 2에서 |
|------|------------|------------|
| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
#### Store Visibility Rule
`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
이는 timing과 visibility를 의도적으로 분리한 것이다:
- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
- **timing**: SimPy에서 DMA latency가 완료되는 시점
이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
#### Result Handle Semantics
`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
Phase 1에서의 핵심 계약:
1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
handle을 ready로 만들지 않는다.
3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
numpy conversion 등)은 **Phase 2에서만 가능**하다.
4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
**memory-read 기반 control flow는 지원 가능**하다.
| handle 상태 | Phase | 허용 동작 |
|------------|-------|----------|
| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
block되어 2-pass 분리의 존재 이유가 사라진다.
#### Phase 1 Materialization — Future Extension
향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
### D4. data_op 플래그 — 메시지 자기 선언
로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
프레임워크가 메시지 타입을 하드코딩하지 않는다.
```python
class MsgBase:
data_op: bool = False # 기본: 로깅 안 함
class DmaReadCmd(MsgBase):
data_op = True # 메모리 이동 → 로깅
class GemmCmd(MsgBase):
data_op = True # 연산 → 로깅
class MathCmd(MsgBase):
data_op = True # 연산 → 로깅
```
새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
프레임워크 코드 수정 없이 자동 로깅된다.
### D5. Op Log 구조
#### op 분류 체계
2단계로 분류한다:
| 레벨 | 필드 | 역할 |
|------|------|------|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
#### OpRecord 정의
```python
@dataclass
class OpRecord:
t_start: float # SimPy 시각 (ns) — service 시작
t_end: float # SimPy 시각 (ns) — service 완료
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
op_kind: str # "memory" | "gemm" | "math"
op_name: str # 구체 연산명
params: dict # 연산별 파라미터 (아래 참조)
dependency_ids: list[int] # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
```
#### dependency_ids 생성 규칙
`dependency_ids`는 **optional**이며, 기본적으로 executor는
주소 기반 dependency 추론을 수행한다 (D6 참조).
정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
- **기본 (address-based inference)**: executor가 read/write set을 분석하여
RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
주소로 표현되지 않는 경우에 설정.
예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
#### op_log ordering
op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
동일 `t_start`의 record들은 insertion order를 보존한다.
#### params 상세
**memory (dma_read / dma_write)**:
```python
{
"src_addr": int, # source 주소 (byte)
"dst_addr": int, # destination 주소 (byte)
"nbytes": int, # 전송 크기
"src_space": str, # "hbm" | "tcm" | "sram"
"dst_space": str, # "hbm" | "tcm" | "sram"
}
```
**gemm**:
```python
{
"src_a_addr": int, # operand A 주소
"src_b_addr": int, # operand B 주소
"dst_addr": int, # output 주소
"shape_a": tuple, # e.g. (128, 256)
"shape_b": tuple, # e.g. (256, 128)
"shape_out": tuple, # e.g. (128, 128)
"dtype_in": str, # e.g. "f16"
"dtype_acc": str, # accumulation dtype, e.g. "f32"
"dtype_out": str, # output dtype, e.g. "f16"
"transpose_a": bool,
"transpose_b": bool,
"layout_a": str, # "row_major" | "col_major"
"layout_b": str,
"layout_out": str,
"addr_space": str, # "tcm" (GEMM operand는 항상 TCM)
}
```
**math**:
```python
{
"op": str, # "exp" | "add" | "sum" | "where" | ...
"input_addrs": list[int], # operand 주소 목록
"input_shapes": list[tuple],
"dst_addr": int,
"shape_out": tuple,
"dtype": str,
"axis": int | None, # reduction axis
"addr_space": str, # "tcm"
}
```
### D6. Phase 2 Executor
Phase 2는 SimPy 밖에서 op_log를 실행한다.
```python
class DataExecutor:
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
self.store = initial_store # Phase 1의 MemoryStore snapshot을 입력으로 받는다
def run(self):
for t, ops in groupby(op_log, key=lambda o: o.t_start):
batch = list(ops)
independent, sequential = self._classify(batch)
self._execute_parallel(independent)
self._execute_sequential(sequential)
```
**병렬 실행 판정**:
같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
- `dependency_ids`에 명시된 선행 op 완료 여부
주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
모두 동일한** 독립 op들만 batching 대상이 된다.
예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
**Phase 2 실행 순서 보장**:
Phase 2는 데이터 도착 시점을 고려하지 않으며,
dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
실행 순서를 보장한다.
### D7. Memory Store
`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
```python
class MemoryStore:
def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
```
**내부 저장 포맷: numpy ndarray**
MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
|------|----------------|-------------|------|
| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
- read: numpy array를 **참조 반환** (복사 없음)
- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
**read/write contract**:
- read/write는 **contiguous tensor** 기준이다.
non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
permissive behavior이다.
- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
- 구현 최적화로 tensor object cache를 둘 수 있지만,
canonical state는 byte-addressable storage이다.
- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
### D8. 벤치마크 커널 코드
벤치마크의 **사용자 코드 API는 변경하지 않는다**.
`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
### D9. 컴포넌트 변경 없음
개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
op_log 기록은 ComponentBase hook의 책임이다.
커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
Phase 2 데이터 실행은 영향받지 않는다.
### D10. Phase 2는 Optional
```python
engine = GraphEngine(graph)
engine.run(benchmark) # Phase 1: 타이밍만
result = engine.get_timing_result()
if verify_data:
executor = DataExecutor(engine.op_log) # Phase 2: 데이터
executor.run()
executor.verify(expected_output)
```
타이밍 분석만 필요하면 Phase 2를 건너뛴다.
op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
### D11. Verification Contract
기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
dtype별 tolerance 정책:
| dtype | 비교 방식 | tolerance |
|-------|----------|-----------|
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
| int 계열 | `np.array_equal` | exact |
- 기본 모드: 최종 output만 비교 (end-to-end correctness)
- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
(MemoryStore snapshot at each op boundary)
---
## Non-goals
- **Compute-result-based control flow**: 지원하지 않는다.
모든 compute handle은 Phase 1에서 pending 상태이며,
`wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
**error로 처리**한다.
메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
Phase 1 materialization은 future extension (D3 참조).
- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
## Open Questions
- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
일반화할지, 별도 op_kind를 둘지
- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
(in-memory list vs disk-backed streaming)
- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
하나의 fused op record로 기록할지, 개별 op으로 분리할지
- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
where/mask 표현 등 일반화가 필요할 수 있음
- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
- **Phase 1 materialization policy**: D3의 Future Extension 참조.
허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
---
## Consequences
### 긍정적
- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
- 벤치마크 사용자 코드 API 변경 불필요
- 새 메시지 타입 추가 시 data_op 플래그만 설정
- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
### 부정적
- op_log 메모리 사용량 (대규모 시뮬레이션 시)
- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
- pending handle (연산 미완료) 기반 동적 분기 불가
(연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
메모리 데이터 기반 분기는 greenlet으로 지원된다.
- greenlet C 확장 의존성 추가 (pip install greenlet)
@@ -0,0 +1,90 @@
# ADR-0022: 2D Grid program_id Semantics
## Status
Accepted
## Context
Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
The previous implementation ignored the `axis` parameter and always returned a flat PE index,
making it impossible for kernels to distinguish their cube-local position from their cube identity.
## Decision
Map `tl.program_id` and `tl.num_programs` to the 2D hardware grid:
| Call | Returns | Description |
|------|---------|-------------|
| `tl.program_id(axis=0)` | `local_pe_id` | PE index within cube |
| `tl.program_id(axis=1)` | `cube_id` | Cube index |
| `tl.num_programs(axis=0)` | `num_pes_per_cube` | PEs per cube |
| `tl.num_programs(axis=1)` | `num_cubes` | Total cubes |
Global PID is derived as:
```python
global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
```
### Axis mapping rationale
- **axis=0 = PE (innermost)**: PEs within a cube share HBM and communicate via local NOC mesh. This is the fast, tightly-coupled dimension — analogous to threads within a block.
- **axis=1 = Cube (outer)**: Cross-cube communication goes through UCIe with higher latency. This is the coarser scheduling dimension — analogous to blocks in a grid.
## Implementation
### TLContext (`triton_emu/tl_context.py`)
Added `cube_id` and `num_cubes` constructor parameters. `program_id()` and `num_programs()` dispatch on `axis`:
```python
def program_id(self, axis: int = 0) -> int:
if axis == 1:
return self._cube_id
return self._pe_id
def num_programs(self, axis: int = 0) -> int:
if axis == 1:
return self._num_cubes
return self._num_programs
```
### PE_CPU (`components/builtin/pe_cpu.py`)
- Extracts `num_cubes` from `ctx.spec["system"]["sips"]["cubes_per_sip"]`
- Passes `cube_id` (already available as `self._cube_idx`) and `num_cubes` to TLContext
### KernelRunner (`triton_emu/kernel_runner.py`)
- Receives `num_cubes` from PE_CPU
- Passes `cube_id` and `num_cubes` to TLContext in greenlet mode
## Backward Compatibility
- Existing code using `tl.program_id(0)` or `tl.program_id()` is unchanged — returns the same PE index as before.
- `cube_id` and `num_cubes` default to `0` and `1`, so callers that don't provide them (e.g. unit tests) continue to work.
## Usage Example
```python
def sharded_gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl):
local_pid = tl.program_id(axis=0) # PE within cube
cube_id = tl.program_id(axis=1) # which cube
global_pid = cube_id * tl.num_programs(axis=0) + local_pid
# Column-wise sharding across global PID
n_per_pid = N // (tl.num_programs(axis=1) * tl.num_programs(axis=0))
col_start = global_pid * n_per_pid
a = tl.load(a_ptr, shape=(M, K), dtype="f16")
b = tl.ref(b_ptr + col_start * K * 2, shape=(K, n_per_pid), dtype="f16")
h = tl.composite(op="gemm", a=a, b=b, out_ptr=out_ptr + col_start * M * 2)
tl.wait(h)
```
## Consequences
- Benchmarks can now express cube-aware sharding and addressing without hardcoding topology dimensions.
- Future axis=2 (SIP-level) can be added following the same pattern if needed.
File diff suppressed because it is too large Load Diff
+206
View File
@@ -0,0 +1,206 @@
# ADR-0024: SIP-level Launcher — rank = SIP
## Status
Accepted
## Context
### 목표
`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
읽히는 bench 코드를 목표로 한다.
real PyTorch와 비교:
| 차원 | real PyTorch | KernBench |
| --- | --- | --- |
| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
### 풀어야 할 문제
1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
기본 텐서 배치도 구조적 좌표로 표현되어야 함.
### Non-problem (이 ADR 밖)
- IPCQ direction addressing → ADR-0025
- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
- Megatron-style TP → ADR-0027
- DTensor → ADR-0028 (future)
- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
→ ADR-0027 D0/D1
- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
## Decision
### D1. rank = SIP (world_size 해석)
```python
def _resolve_world_size(self) -> int:
if "world_size" in self._merged:
return int(self._merged["world_size"])
defaults = self._cfg_all.get("defaults", {})
if "world_size" in defaults:
return int(defaults["world_size"])
spec = self.ctx.spec or {}
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
```
우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
override는 legacy "rank = PE" 테스트 경로로 유지.
### D2. Greenlet-local rank registry (+ debug warning)
```python
class DistributedContext:
def __init__(self):
self._backend = None
self._rank_by_greenlet: dict = {}
def _bind_rank(self, g, rank: int) -> None:
self._rank_by_greenlet[g] = int(rank)
def get_rank(self) -> int:
self._ensure_initialized()
from greenlet import getcurrent
g = getcurrent()
if g not in self._rank_by_greenlet:
if os.environ.get("KERNBENCH_DEBUG"):
warnings.warn(
"get_rank() called outside a bound greenlet — returning 0. "
"Likely a bug unless running single-driver."
)
return 0
return int(self._rank_by_greenlet[g])
```
### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
namespace를 사용한다.
```python
class _AhbmNamespace:
"""torch.ahbm — per-greenlet SIP device binding.
Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
"""
def __init__(self):
self._device_by_greenlet: dict = {}
def set_device(self, device: int) -> None:
from greenlet import getcurrent
self._device_by_greenlet[getcurrent()] = int(device)
def current_device(self) -> int | None:
from greenlet import getcurrent
return self._device_by_greenlet.get(getcurrent())
# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
```
**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
```python
class _AcceleratorNamespace:
"""torch.accelerator — device-agnostic API (PyTorch 2.x style).
Aliases torch.ahbm for bench code that prefers device-neutral idiom:
torch.accelerator.set_device_index(rank)
torch.accelerator.current_device_index()
"""
def __init__(self, ahbm: _AhbmNamespace):
self._ahbm = ahbm
def set_device_index(self, device: int) -> None:
self._ahbm.set_device(device)
def current_device_index(self) -> int | None:
return self._ahbm.current_device()
# RuntimeContext
self.ahbm = _AhbmNamespace()
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
```
Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
```python
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
```
### D4. Tensor placement = structural (sip, cube, pe) 좌표
`resolve_dp_policy``target_sip`을 직접 받아 구조적 좌표로 placement 생성.
세부는 ADR-0026.
```python
# RuntimeContext._create_tensor
current_sip = self.ahbm.current_device() # (D3 naming)
if current_sip is None:
current_sip = 0 # single-driver fallback (D2와 일관)
placement = resolve_dp_policy(
dp, shape=shape_2d, itemsize=itemsize,
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
target_sip=current_sip,
)
```
Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
---
## Dependencies
- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
ShardSpec의 구조적 좌표 표현.
- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
collective drain, exception cleanup의 구현 기준.
---
## Non-goals
- **IPCQ protocol 수정**: ADR-0023 유지.
- **DPPolicy 필드 정리**: ADR-0026.
- **Megatron-style TP**: ADR-0027.
- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
- **Collective algorithm 구현**: ADR-0032.
- **Multi-node (프로세스 간)**: 단일 프로세스.
---
## Consequences
### Positive
- **Bench = real PyTorch DDP** (공개 API 관점).
- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
`(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
### Neutral
- IPCQ PE-level protocol (ADR-0023) 불변.
- IO_CPU 역할 불변 (기존 transit 그대로).
@@ -0,0 +1,283 @@
# ADR-0025: IPCQ Direction Addressing — address-based matching
## Status
Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
## Context
### 목표
ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
topology 일반)에서 정확히 동작하도록 한다.
### 드러난 버그 — 2-rank bidirectional ring
`ring_1d(rank, world_size=2)``{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
**버그 1 (install)**:
- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
direction convention)
- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
**버그 2 (runtime)**:
- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`
sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
### 근본 원인
두 축에서 동일 문제:
1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
fragile
2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
좌표만으로 이루어짐 → direction 중복 시 ambiguous
### 해결 방향 — address-based matching
각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
direction_idx × bytes_per_direction). 따라서:
- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
대칭성)
- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
truth**
이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
---
## Decision
### D1. Install — `reverse_direction` opposite-preference
`src/kernbench/ccl/install.py`:
```python
# Extended in ADR-0032 with global_* pairs for inter-SIP directions,
# which were introduced by configure_sfr_intercube_multisip to keep
# intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint.
_OPPOSITE_DIR = {
"E": "W", "W": "E", "N": "S", "S": "N",
"global_E": "global_W", "global_W": "global_E",
"global_N": "global_S", "global_S": "global_N",
}
def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
"""Find peer's direction that reciprocates my_dir→peer_rank.
Prefer the OPPOSITE direction (E↔W, N↔S) when the peer has it
pointing back to us. This matters in 2-rank bidirectional rings
where both E and W on one side point to the same peer — without
the preference, the first-match-wins iteration would route data
into the wrong rx slot. Falls back to any direction pointing back
for topologies without an opposite convention (tree_binary's
parent/child).
"""
nt = neighbor_table[peer_rank]
opp = _OPPOSITE_DIR.get(my_dir)
if opp is not None and nt.get(opp) == my_rank:
return opp
for d, target in nt.items():
if target == my_rank:
return d
return None
```
호출부:
```python
for d, peer_rank in nbrs.items():
peer_dir = reverse_direction(r, peer_rank, d) # my_dir 전달
if peer_dir is None:
continue
...
```
### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
`src/kernbench/components/builtin/pe_ipcq.py`:
```python
def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
"""Match incoming token to the receiver-side direction by dst_addr range.
Each direction has a unique rx buffer address range
(my_rx_base_pa + n_slots * slot_size). The token's dst_addr (set by
the sender's IPCQ when computing peer's slot address) falls within
exactly one such range. This address-based matching is unambiguous
even when multiple directions have the same peer (2-rank ring).
"""
token = msg.token
dst_addr = token.dst_addr
for d, qp in self._queue_pairs.items():
base = qp["my_rx_base_pa"]
size = qp["n_slots"] * qp["slot_size"]
if base <= dst_addr < base + size:
qp["peer_head_cache"] = max(qp["peer_head_cache"],
token.sender_seq + 1)
self._arrived_tokens.setdefault(d, []).append(token)
waiters = self._recv_waiters.get(d, [])
self._recv_waiters[d] = []
for ev in waiters:
if not ev.triggered:
ev.succeed()
any_waiters = self._any_recv_waiters
self._any_recv_waiters = []
for ev in any_waiters:
if not ev.triggered:
ev.succeed()
return
# Unknown dst_addr — diagnostic log (should not happen under correct install)
```
Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
### D3. Credit — `dst_rx_base_pa` 필드 추가
`src/kernbench/common/ipcq_types.py`:
```python
@dataclass(frozen=True)
class IpcqCreditMetadata:
consumer_seq: int
dst_rx_base_pa: int # NEW: 원 sender의 peer.rx_base_pa와 매칭용
# 기존 필드 (diagnostic / log 용도로 유지)
src_sip: int
src_cube: int
src_pe: int
src_direction: str
```
Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`
`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
수신 측 (`_credit_worker`):
```python
def _credit_worker(self, env):
while True:
credit = yield self._credit_inbox.get()
for d, qp in self._queue_pairs.items():
# peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
credit.consumer_seq)
waiters = self._send_waiters.get(d, [])
self._send_waiters[d] = []
for ev in waiters:
if not ev.triggered:
ev.succeed()
break
```
Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`**불필요**.
이유:
- Meta arrival은 dst_addr로 매칭 (D2)
- Credit은 dst_rx_base_pa로 매칭 (D3)
- qp에 peer_direction 저장 필요 없음
- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
- Diagnostics: pointer_dump 등에서 direction 표시
- 미래 확장 여지
Runtime matching은 `dst_addr`만 사용.
### D6. Invariants (ADR-0023 I3 강화)
**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
이를 보장해야 한다 (reverse_direction opposite-preference).
**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]``qp["peer"].rx_base_pa`
서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
Install time에 검증 가능:
```python
# ccl/install_plan.py: build_install_plans 끝에 assertion
all_rx_ranges = set()
for plan in plans:
for pe_install in plan.pe_installs:
for entry in pe_install.neighbors:
r = (entry.my_rx_base_pa,
entry.my_rx_base_pa + plan.n_slots * plan.slot_size)
overlap = any(_ranges_overlap(r, e) for e in all_rx_ranges)
assert not overlap
all_rx_ranges.add(r)
```
---
## Dependencies
- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
(D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
변경은 없음.
- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
---
## Non-goals
- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
인코딩되는가와 무관.
- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
무관.
---
## Open questions
- **주소 매칭 성능**: `_handle_meta_arrival``_credit_worker`가 qp를 선형
순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
전환 가능 (`_qp_by_rx_base`).
- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
단순 구현 먼저.
---
## Consequences
### Positive
- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
### Negative
- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
### Neutral
- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
불변.
@@ -0,0 +1,288 @@
# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
## Status
Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
## Context
### 목표
`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
layers가 담당).
## Decision
### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
```python
@dataclass(frozen=True)
class DPPolicy:
"""Intra-device (cube × PE) data-parallel policy.
SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
(ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
layers (ADR-0027). DPPolicy does not cross SIP boundaries.
"""
cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
num_pes: int | None = None
num_cubes: int | None = None
```
제거되는 필드: `sip`, `num_sips`.
### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
현재 `ShardSpec.pe_index`**global flat index** (`sip × cubes × pes + cube ×
pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`
property로도 **남기지 않는다**:
```python
# src/kernbench/policy/placement/dp.py (after)
@dataclass(frozen=True)
class ShardSpec:
"""Structural shard placement — intra-SIP (cube × PE) coord.
Global-flat `pe_index` was removed in ADR-0026. Callers must use
structural coords (sip, cube, pe) directly. If a flat integer key is
needed (e.g. dict lookup), compute it explicitly at the call site.
"""
sip: int # structural — which SIP this shard lives on
cube: int # local within SIP
pe: int # local within cube
offset_bytes: int
nbytes: int
```
**핵심 원칙**:
- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
- **`pe_index` property도 없음** — silent semantics drift 차단.
- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
`AttributeError`** → 반드시 구조적 좌표로 migration.
- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
(AttributeError)가 훨씬 안전.
### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
```python
# src/kernbench/policy/placement/dp.py (after)
@dataclass(frozen=True)
class _LocalPeShard:
"""Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
local_pe: int # cube-local PE index (0..num_pe-1)
offset_bytes: int
nbytes: int
def resolve_dp_policy(
policy: DPPolicy,
*,
shape: tuple[int, int],
itemsize: int,
num_pe: int,
num_cubes: int = 1,
target_sip: int, # NEW — 어느 SIP에 배치할지 명시
) -> list[ShardSpec]:
"""2-level resolution (cube × PE) on a specified SIP.
Returns ShardSpecs with structural coords (sip=target_sip, cube, pe).
No SIP-level split — DPPolicy is intra-device only.
"""
resolver = _PE_RESOLVERS[policy.pe]
all_shards: list[ShardSpec] = []
# Level 1: cube within SIP
cube_splits = _split_shape(policy.cube, shape, num_cubes, itemsize)
for cube_id, (cube_shape, cube_offset) in enumerate(cube_splits):
# Level 2: PE within cube — resolver returns _LocalPeShard (local_pe)
local_shards = resolver(shape=cube_shape, itemsize=itemsize,
num_pe=num_pe)
for ls in local_shards:
all_shards.append(ShardSpec(
sip=target_sip, # from caller (current_device)
cube=cube_id, # local within SIP
pe=ls.local_pe, # local within cube (explicit name)
offset_bytes=cube_offset + ls.offset_bytes,
nbytes=ls.nbytes,
))
return all_shards
```
**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
**이름 규약 정리** (전체 ADR):
- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
부가 효과: 이름 재등장 없음).
### D4. `_create_tensor` — 구조적 좌표로 직접 placement
ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
호출 시점에 직접 지정.
```python
# context.py _create_tensor (after)
current_sip = self.ahbm.current_device()
if current_sip is None:
# Single-driver fallback (ADR-0024 D2와 일관).
# Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
# 문제가 있음 → debug mode에서 경고.
if os.environ.get("KERNBENCH_DEBUG"):
import warnings
warnings.warn(
"torch.ahbm.current_device() is None; defaulting to SIP 0. "
"If this is a multi-rank launcher context, you likely forgot "
"torch.ahbm.set_device(rank) inside the worker.",
stacklevel=2,
)
current_sip = 0
placement = resolve_dp_policy(
dp,
shape=shape_2d,
itemsize=itemsize,
num_pe=eff_num_pe,
num_cubes=eff_num_cubes,
target_sip=current_sip, # ← 구조적 좌표 일차 지정
)
# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
# 과거의 post-hoc shifting 블록은 완전히 제거.
```
**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
ADR-0027의 TP primitive 사용.
**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
배치되는 것을 감지할 수 있도록 warning.
### D5. Downstream — allocator lookup은 구조적 tuple key로
기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
```python
for spec in placement:
alloc = allocators[spec.pe_index] # ← AttributeError (property 제거됨)
```
`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
```python
for spec in placement:
alloc = allocators[(spec.sip, spec.cube, spec.pe)]
```
`_ensure_allocators`의 dict population도 tuple key로:
```python
# context.py _ensure_allocators (after)
for sip_id in sip_range:
for cube_id in range(cubes_per_sip):
for pe_id in range(pes_per_cube):
self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
)
```
`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
### D7. 하위 호환 — 불가 (cleanup ADR)
이 ADR은 **breaking change**.
1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
2. `ShardSpec.pe_index` 접근 → `AttributeError`
모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
## Dependencies
- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
좁힘.
- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
---
## Non-goals
- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
유지.
- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
---
## Open questions
- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
(SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
테스트와의 호환).
- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
- **`DPPolicy``num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
명시적 답.
**Resolved (이전 rev에서 open이었던 것들)**:
- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
---
## Consequences
### Positive
- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
abstraction leakage 해소 (ADR-0024 D4 계약 충족).
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
경계 제어 메커니즘.
### Negative
- **Breaking change (explicit)**: `DPPolicy(sip=...)``TypeError`,
`spec.pe_index``AttributeError`. 모든 호출자 한 번에 수정 필요.
- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
`allocators` dict key 등) 연쇄 수정.
- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
- `test_sip_parallel.py` 재작성 비용.
### Neutral
- 기존 `cube` / `pe` 필드 의미 불변.
+888
View File
@@ -0,0 +1,888 @@
# ADR-0027: Megatron-style Tensor Parallelism API
## Status
Accepted
## Context
### 목표
SIP 간 tensor parallelism(TP)을 **Megatron-LM 스타일의 명시적 parallel layer**
API로 지원한다. DTensor 같은 선언적 추상화는 별도 ADR(0028) future work.
Megatron-style을 선택한 이유:
- TP는 model의 특정 layer 경계에서 발생. 명시적 primitive가 mental model에
자연스러움.
- NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준.
- DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적.
### TP primitive 스펙 (Megatron-LM 참조)
- **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에
분산. 입력 full-replicated, 출력 column-sharded. 후속 RowParallelLinear가
올 때 forward all-reduce 없음.
- **RowParallelLinear**: weight의 **row(in_features)** 축을 TP ranks에 분산.
입력이 이미 column-sharded (ColumnParallel의 출력). forward 끝에
**all-reduce** 필요.
- **VocabParallelEmbedding**: embedding을 vocab 축에 분산. forward 끝에
all-reduce. (초기 scope에서는 stub, 실제 구현은 all-gather kernel 선행 필요.)
- **`copy_to_tp_region`**, **`reduce_from_tp_region`**, **`scatter_to_tp_region`**,
**`gather_from_tp_region`** — 기본 primitive.
### 풀어야 할 문제
1. **Worker-wait 일반화 (D0)**: `dist.all_reduce`의 defer/yield/drain 패턴을
모든 `ctx.wait` 경로로 확장. **이 ADR의 가장 큰 아키텍처 결정**.
2. **런처 API 정규화 (D1)**: 현 bench들이 hand-rolled greenlet loop을 사용.
`torch.multiprocessing.spawn(fn, args, nprocs)`로 흡수해 real-PyTorch API 면
유지 + D0의 scheduler drain을 단일 구현 위치에 집중.
3. **Per-rank weight 분산 표현**: 각 worker가 weight tensor의 자기 slice를
소유. ADR-0024의 `set_device(rank)` + ADR-0026의 intra-device DPPolicy로
자연스럽게 표현.
4. **Forward-only scope**: 현재 KernBench는 backward가 없음 (simulation 목적).
본 ADR은 **forward만** 우선 지원. Training simulation은 별도 ADR.
5. **Collective 호출 지점**: RowParallelLinear가 forward 끝에 `all_reduce` 호출.
ADR-0024의 multi-greenlet 구조 + D0 generalization에서 자연스럽게 동작.
6. **TP group 개념**: Megatron은 DP × TP × PP group을 교차 사용. 초기 scope는
**TP group = 전체 SIP** 단순화. Mixed DP+TP는 future.
---
## Decision
### D0. Worker-wait 일반화 — `ctx.wait`가 worker 컨텍스트면 main으로 defer
**문제 재확인**. `kernel_runner.run`은 spawn 시점의 `greenlet.getcurrent()`
kernel greenlet의 `_parent`로 캡처한다
([kernel_runner.py:94](src/kernbench/triton_emu/kernel_runner.py#L94)).
main 컨텍스트에서 `env.run`이 돌면 parent=main이라 safe. worker 컨텍스트에서
`env.run`이 돌면 parent=worker가 되고, worker가 yield/finish하는 순간 kernel
greenlet은 orphan → `GreenletExit` → ADR-0024 Phase B의 `ring_default_ws` 실패.
**해결**. worker greenlet이 `ctx.wait(h)`를 호출하면 직접 `env.run`을 driving
하는 대신 **main scheduler로 yield**. main이 env.run을 drive해 handle이 완료
되면 worker로 control return.
#### D0.1 `RuntimeContext` 확장
```python
# context.py
@dataclass
class RuntimeContext:
...
_pending_worker_waits: list[RequestHandle] = field(default_factory=list, init=False)
```
#### D0.2 `ctx.wait`의 worker fork
```python
def wait(self, handle, *, _meta=None):
# Fast-path: already completed — skip enqueue + switch (consistent with
# D0.4-(3) idempotency). Avoids needless worker→main→worker round-trip
# and prevents redundant _pending_worker_waits growth.
if handle in self._completed:
completion, _trace = self.engine.get_completion(handle)
return completion
from greenlet import getcurrent
g = getcurrent()
if g.parent is not None and not g.parent.dead:
# Worker greenlet: defer to main. Push handle, yield to parent.
# Parent (scheduler loop) drains env.run, then switches back.
self._pending_worker_waits.append(handle)
g.parent.switch()
# On resume: handle must have completed (main drained the list).
# Fall through to the status-quo completion/trace assembly.
# Main context (or single-driver): drive engine directly.
wait_fn = getattr(self.engine, "wait", None)
if wait_fn is not None:
wait_fn(handle)
completion, trace = self.engine.get_completion(handle)
self._completed.add(handle)
if _meta is not None and trace is not None:
entry = dict(trace) if isinstance(trace, dict) else {"raw": trace}
entry.update(_meta)
self._traces.append(entry)
return completion
```
#### D0.3 `ctx.wait`의 worker-context 세만틱 contract (normative)
본 ADR은 `ctx.wait`의 세만틱을 worker 컨텍스트에서 **명시적으로 변경**한다.
- **Submit-vs-complete 분리**: `ctx.wait(h)`는 worker에서 호출될 때 "즉시 완료
보장"이 아니라 "**다음 scheduler drain 이후** 완료 보장"이다. worker가
`wait()`에서 return하는 시점 = main이 해당 handle에 대해 `engine.wait`
마친 시점. Main context 호출은 기존대로 즉시-동기 (status quo).
- **Resume invariant (normative)**: worker-deferred `ctx.wait(h)`에서
`g.parent.switch()`가 return해 worker가 resume되는 시점에는 **반드시
`h in ctx._completed`가 True여야 한다**. 이 invariant가 깨지면 worker가
stale 상태에서 이후 단계를 진행하므로 `_drain_pending` / scheduler loop /
`ctx.wait` 어느 부분을 수정하든 이 불변식을 지켜야 한다. T3.b가 이
invariant를 직접 assert한다.
- **관찰 가능 변화**: worker 안에서 `h = ctx.submit(msg); ctx.wait(h);
read(handle_result)` 패턴은 여전히 성립 — 단 `wait()`와 `read` 사이에는
자동으로 main-drain이 삽입되었다는 사실을 세만틱 명세로 포함한다.
- **Host 객체 직접 read는 D0.5 참조**: `ctx.wait` 없이 `tensor.numpy()`를
부르는 경우의 계약은 D0.5에서 별도로 규정.
#### D0.4 Main scheduler drain — 규약 (normative)
(D1의 `multiprocessing.spawn` 내부 구현. 아래는 세만틱 정의.)
```python
while alive:
for g in alive: # (1) round-based worker switch
g.switch()
_drain_pending(ctx) # (2) drain in main context
```
(`_drain_pending`의 실제 정의는 D0.5 참조 — outer while-loop으로 두 큐가
모두 빌 때까지 drain.)
**규약**:
1. **Round-based cooperative scheduling & yield 의무 (worker contract)**.
`g.switch()`는 해당 worker가 **자발적으로 yield**할 때까지 return하지 않는다
(cooperative greenlet 세만틱). 따라서:
- Worker가 yield 없이 `while True: do_compute()` 같은 pure-compute loop를
돌면 `g.switch()`는 영원히 return하지 않고 **scheduler loop 자체가 hard
block**된다 (다른 worker는 switch 기회를 못 얻음, drain도 안 일어남). 이는
starvation이 아니라 **scheduler non-progress (deadlock 등가)**이며 본
ADR이 **unsupported**로 규정한다.
- Worker는 **반드시** `ctx.wait(h)`, `dist.all_reduce`, host-read barrier
(D0.5) 중 하나를 유한 step 내에 호출해야 한다. TP layer의 `forward`는
매 layer 끝에서 launch→wait 쌍을 포함하므로 자연스럽게 이 조건을 만족.
CCL kernel도 `dist.all_reduce` 내부에서 yield한다.
- 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
- **Future extension**: non-collective 긴 계산 경로가 자주 나오면
명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
됨.
- Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
2. **Drain 순서 = submission 순서 (FIFO)**. `_pending_worker_waits`는 list
append/pop(0)로 엄격한 FIFO. 완료 순서가 아니라 submission 순서로 drain되며,
SimPy scheduler 자체가 인과적으로 올바른 완료 순서를 보장하므로 submission
순서 drain이 안전하다. `completion order`와 `drain order`는 혼동하지 말 것.
**Two-queue ordering (worker waits → collectives)**: `_drain_pending`은
worker wait 큐를 먼저, collective 큐를 나중에 drain한다. 이 순서의 근거:
- **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
`submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
- **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
하면 됨. worker wait 큐와의 순서 dependency 없음.
- **단일 drain barrier 안에서 둘 다 완료**: D0.5의 loop-until-empty 규약에
따라 한 barrier invocation에서 worker → collective → (새로 생긴 것이
있으면 반복) 순으로 모두 빠짐. worker가 resume될 땐 양쪽 모두 drained.
- **대안 (collective 먼저)도 가능**: 본 ADR은 현 구현 단순성을 위해 worker
먼저를 고정했을 뿐 의미상 동치. 성능 프로파일 차이가 관찰되면 재조정.
3. **중복 enqueue — correctness는 idempotent drain, dedup은 non-guaranteed**.
`ctx.wait(h)`는 `h in ctx._completed`면 즉시 return. `_drain_pending`도
동일 guard. 같은 handle이 `_pending_worker_waits`에 여러 번 appended
되더라도 실제 `engine.wait`는 한 번만 호출된다 (idempotent).
- **Correctness**: idempotent drain에 의존 → safe.
- **Memory/성능**: 본 ADR은 `_pending_worker_waits`의 **dedup을 보장하지
않는다**. 같은 handle이 N번 enqueue되면 큐에 N개 element가 보관되고
drain 시 N번 pop + in-set guard가 돈다. 단일 worker가 같은 handle을
반복 wait하는 비정상 패턴이 아니면 N은 1~수 수준.
- **Implementation freedom**: 구현은 선택적으로 dedup (예: `set`을 side
index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
를 바꾸지 않는 최적화로 분류.
4. **Exception propagation + sibling cleanup**.
worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
```python
try:
while True:
alive = [g for g in gs if not g.dead]
if not alive:
break
for g in alive:
if not g.dead:
g.switch()
_drain_pending(ctx)
except Exception as outer:
# (a) 살아남은 sibling worker greenlet 강제 종료.
for other in gs:
if not other.dead:
try:
other.throw(SystemExit)
except Exception:
pass # 사일런트 — 이미 예외 상황
# (b) Backend barrier / pending 상태 초기화 (장래 epoch barrier 도입 대비).
backend = getattr(ctx.distributed, "_backend", None)
if backend is not None and hasattr(backend, "_barrier"):
backend._barrier.reset()
backend_pending = getattr(backend, "_pending_collective_handles", None)
if backend_pending is not None:
backend_pending.clear()
ctx._pending_worker_waits.clear()
# (c) 원인 예외는 SpawnException으로 래핑.
raise SpawnException(errors) from outer
```
규약:
- **Sibling abort 보장**: worker 하나가 raise하면 모든 sibling greenlet에
`SystemExit`을 throw — greenlet은 즉시 terminate된다. greenlet leak 없음.
- **Pending queue 명시적 clear**: worker-wait + collective-pending 두 큐를
비움. 재사용 시 오염 방지.
- **`SpawnException(errors)` 래핑**: `errors: dict[int, Exception]`에 각
rank의 원래 예외를 담는다. real-PyTorch `torch.multiprocessing.spawn`의
failure 패턴과 호환.
- **Scope 제한**: `errors`에는 **자기 코드로 raise한 rank (root cause)만**
포함된다. Sibling cleanup 과정에서 `throw(SystemExit)`으로 종료된 rank는
`errors`에 나타나지 않는다 (SystemExit은 D1.2의 entry 래퍼 `try/except
Exception`에 걸리지 않음 — 의도된 설계: sibling 종료는 실패가 아니라
cleanup signal). 독자가 "모든 failed rank가 다 들어올 것"으로 기대하지
않도록 명시.
- **`ctx._traces`는 예외 이전 시점까지의 partial 상태**. trace completeness
는 보장되지 않음 (일부 launch/all_reduce가 entry를 남기지 못한 채 종료
가능).
- **Allocator / MemoryStore**는 예외 이전 상태 유지 — 재사용은 non-goal,
새 `RuntimeContext` 생성 권장.
- **`join=False` / retry / partial recovery**는 본 ADR의 non-goal.
`SpawnException`은 `runtime_api/multiprocessing.py`에 정의:
```python
class SpawnException(RuntimeError):
def __init__(self, errors: dict[int, Exception]):
self.errors = errors
first = next(iter(errors.items()), None)
msg = (f"spawn failed on ranks {sorted(errors.keys())}"
+ (f": rank {first[0]} raised {first[1]!r}" if first else ""))
super().__init__(msg)
```
5. **Single-driver 호환**. `g.parent is None`인 main-only 실행 (legacy 단일
드라이버 테스트)에서는 D0.2의 worker-fork 조건이 거짓 → 기존 즉시-동기
경로 유지. `_drain_pending`은 호출되지 않는다.
#### D0.5 Host-read barrier — 결정 (normative)
Worker 안에서 `tensor.numpy()`, `tensor.__getitem__`, `tensor.data` 등
**host-observable read**는 **자동 drain barrier**로 정의한다. 호출 직전:
1. `ctx._pending_worker_waits`와 `backend._pending_collective_handles`가 비어
있지 않으면 `g.parent.switch()`로 main에 yield → main은 `_drain_pending`
실행 → 완료 후 worker resume.
2. 두 큐가 모두 비어 있으면 즉시 read.
**Barrier 반복 규약 (normative — re-entrance)**: `_drain_pending`은 while-loop
로 **두 큐가 모두 완전히 비어질 때까지** drain한다. 단일 pass가 아님:
```python
def _drain_pending(ctx):
while ctx._pending_worker_waits or (
ctx.distributed._backend
and ctx.distributed._backend._pending_collective_handles
):
while ctx._pending_worker_waits:
h = ctx._pending_worker_waits.pop(0)
if h not in ctx._completed:
ctx.engine.wait(h)
backend = ctx.distributed._backend
if backend is not None:
while backend._pending_collective_handles:
h, _sip_id, meta = backend._pending_collective_handles.pop(0)
ctx.wait(h, _meta=meta) # main context: safe; ctx.wait가
# 다시 pending에 push하지 않음
```
**Main-context ctx.wait 비재귀 invariant (normative)**: `_drain_pending` 내부의
`ctx.wait(h, _meta=meta)` 호출은 main greenlet 컨텍스트에서 실행된다. D0.2의
worker-fork 조건(`g.parent is not None and not g.parent.dead`)이 False이므로
즉시-동기 경로로 진입 → **`_pending_worker_waits`에 절대 enqueue하지 않는다**.
이 invariant 덕분에 drain loop은 재귀/큐 재증가 없이 끝난다. 구현 시
`g.parent is None`을 단일 main greenlet 보장으로 유지하는 것이 중요.
**왜 loop인가**: `ctx.wait(h, _meta=meta)`는 main 컨텍스트에서 호출되므로 D0.2
경로에 따라 engine을 **직접 drive**한다 (추가 enqueue 없음 — 위 invariant).
따라서 이론적으로는 single pass로 충분하지만 — 규약은 **loop-until-empty**로
고정한다. 이유:
1. **미래 확장 안전성**: 향후 drain 중 새 pending이 enqueue되는 구현 (예:
collective가 sub-handle을 가진 tree-reduce)이 생길 수 있다. loop 규약이면
이때도 correctness 유지.
2. **가독성**: "barrier는 pending이 빌 때까지 drain"이라는 단일 문장으로
의미가 닫힘. `ctx.wait` 호출이 새 enqueue를 안 한다는 non-trivial invariant
에 의존하지 않음.
3. **Barrier의 세만틱은 "해당 read에 필요한 모든 dependency 완료"**: 현 모델
에선 모든 pending이 곧 모든 dependency이므로 둘은 동일. 사용자 mental model
은 전자.
**Termination 보증**: 두 체제로 분리해 서술한다.
- **현재 구현**: `ctx.wait`는 main context에서 호출 시 engine을 직접 drive
(D0.2) → 새 pending을 enqueue하지 않는다. 한 iteration마다 pending의 크기가
`pop(0)` + `engine.wait`로 엄격히 감소. iteration 수는 **초기 pending 크기
자체가 상한** → 유한 종료.
- **Future extension (loop 규약을 정당화하는 상한)**: 향후 drain 중 새 pending이
enqueue되는 구현 (예: tree-reduce sub-handle)이 도입되면 초기 크기 상한은
깨진다. 그러나 SimPy causality는 handle의 dependency가 유한 DAG임을 보장하므로
**nested depth가 finite**. loop 규약이 이 경우까지 자동 수용한다.
두 체제 모두 무한 루프가 불가능함을 보장. 현 구현의 단일-pass 상한은 공격적
최적화 시 참고 값일 뿐 규약은 loop-until-empty로 고정.
**왜 implicit drain at read가 맞는가**:
- 기존 open question에서 (a) implicit drain, (b) explicit barrier 둘 중 선택
문제였다. (b)는 명확하지만 TP layer 사용자가 `out = fc1.forward(x);
ctx.drain(); result = out.numpy()` 3-step을 매번 써야 하는 부담. (a)는
"읽을 때 반영된 값을 보장"하는 단일 규약으로 CUDA의 `cudaDeviceSynchronize
before host copy` 패턴과 동일 — 숨은 규칙이 아닌 **명명된 entry-point의
contract**이다.
- 본 ADR은 (a)를 채택하되 그 entry-point 목록을 **명시적으로 닫는다**:
`Tensor.numpy()`, `Tensor.data` (numpy alias), `Tensor.__getitem__`,
`Tensor.__repr__` (data가 포함되는 경우), 그 외 공식 host-read API는 본
ADR 구현 시점에 코드베이스 검색으로 확정. 추가되는 host-read API는 반드시
이 contract를 따라야 한다 (테스트로 회귀 방지).
- `ctx.submit`만 하고 `wait` 없이 `numpy`를 직접 호출하는 경우도 drain
barrier가 동작 (pending queue에 handle이 있기 때문). 사용자가 explicit
wait을 생략해도 read 시점에 invariant가 복원된다.
**`Tensor.copy_(source)` — write barrier 규정**:
`copy_`는 semantically "target에 write"이지만 내부적으로 `source.numpy()`를
호출하여 host에서 source 데이터를 가져온 뒤 `target._memory_store.write(...)`
로 각 shard에 쓴다. 두 방향 모두 barrier 처리:
1. **Source-side (read barrier)**: `source.numpy()`가 D0.5 read barrier를
트리거 (source 자체가 deployed tensor이고 pending이 있을 때).
2. **Target-side (write barrier — global pending 기준)**: `copy_` 진입 시
`ctx._pending_worker_waits` 또는 `backend._pending_collective_handles`가
비어 있지 않으면 write 전에 `g.parent.switch()`로 drain. **Per-tensor /
per-shard dependency tracking이 아니라 global pending queue 기준**.
- 왜 global인가: KernBench의 handle 표현에는 "이 handle이 target의 어느
shard를 write한다"는 역추적 정보가 없다. 안전한 보수적 규약으로 "전역
pending이 있으면 drain". 이 결과로 **unrelated tensor의 pending도 copy_를
막을 수 있다** — drop-in invariant 우선.
- **명시적 tradeoff**: 이 규약은 서로 독립적인 tensor 사이에도 불필요한
serialization을 도입할 수 있다. 그러나 현 single-queue execution model
하에서는 이 비용이 허용 가능 — cross-rank correctness와 "읽을 때 최신"
invariant를 단순한 규칙으로 보장하는 편이 우선.
- 실질적 영향: 단일 worker는 대부분 한 layer step 안에서 pending이 주로
자기 작업 — over-barrier로 인한 추가 context switch는 round 끝 scheduler
drain 시점과 일치하는 경우가 많아 큰 문제 안 됨.
- Future refinement: per-tensor pending tracking을 도입하면 이 규약을
좁힐 수 있으나 본 ADR scope 밖.
**Non-barrier**:
- `tensor.shape`, `tensor.dtype`, `tensor.name` 등 **metadata-only** 접근은
drain하지 않음. 데이터 의존성이 없음.
- `tensor.pa`, `tensor.va` 등 raw address accessor도 drain하지 않음 (주소만,
내용 아님).
**공식 barrier entry-point (closed set)**:
| API | Kind | Rationale |
|---|---|---|
| `Tensor.numpy()` | read | host-observable copy |
| `Tensor.data` | read | `numpy()` alias |
| `Tensor.__getitem__` | read | shard-aligned read |
| `Tensor.__repr__` (data 포함 시) | read | debugging/log |
| `Tensor.copy_(source)` | read + write | source read + target write |
이 contract를 T5/T6에서 직접 검증.
#### D0.6 왜 worker 함수 API는 불변인가 (informative)
- `torch.zeros(...)` 내부는 `self.submit(msg)` + `self.wait(h)` 쌍. `wait`가
D0.2/D0.3에 따라 자동으로 main-defer → 겉보기 동기적으로 보이지만 한 번
yield.
- `tensor.numpy()`는 D0.5에 따라 host-read barrier → pending이 있으면
drain→read, 없으면 즉시 read.
- `dist.all_reduce`는 기존 `_defer_wait=True` + `_pending_collective_handles`
경로를 그대로 사용. D0.4의 drain이 두 큐를 함께 처리.
#### D0.7 불변 조건 (invariants)
- **kernel greenlet의 `_parent`는 항상 main**: env.run이 worker 컨텍스트에서
절대 돌지 않기 때문. (T3의 핵심 assertion.)
- **cross-rank 동기 지점**: 모든 worker가 yield한 뒤에만 drain → 모든 rank의
kernel이 한 라운드에 함께 진행 (cross-rank IPCQ 교환의 필수 조건).
- **Single-driver 호환**: D0.4-(5).
### D1. `torch.multiprocessing.spawn(fn, args, nprocs)`
Real-PyTorch API 파리티 + D0의 scheduler loop의 단일 구현 위치.
#### D1.0 API parity only — execution parity 아님 (normative)
`torch.multiprocessing.spawn` 이름은 **API signature parity**에 한정된다.
실제 실행 모델은 **cooperative greenlet scheduler** (단일 Python 프로세스,
단일 OS 스레드, D0.4의 round-robin drive)이다. 다음은 **본 ADR이 제공하지
않는 속성** — real-PyTorch `torch.multiprocessing.spawn`이 보장하는 것 중
명시적으로 **non-goal**:
- 프로세스 격리 (independent OS process per rank).
- 독립 address space (각 rank가 자기 Python heap 보유).
- Failure isolation (한 rank의 hard crash가 다른 rank 영향 없음).
- OS-level scheduler fairness (rank 간 preemptive time slicing).
- `mp.Queue`, `mp.Lock` 등 inter-process primitive.
이 구현의 실제 성질:
- 모든 rank는 같은 Python 프로세스 안의 greenlet. shared global state가
그대로 보임 (의도된 simulation convenience).
- GIL 하의 단일 스레드 → parallel execution 아님. SimPy 이벤트 순서로
"논리적 동시성"만 재현.
- 한 worker에서 unhandled exception → 전체 simulation 중단 (D0.4-(4)).
**호출자 의무**: real-PyTorch multi-process 샘플을 KernBench로 이식할 때
프로세스 격리에 의존하는 로직 (예: `os.getpid`, 독립 임시 파일, 신호 처리
등)은 지워야 한다. Namespace 이름은 코드 이식성을 위해 유지 — 세만틱은
다르다.
#### D1.1 Public surface
```python
# runtime_api/multiprocessing.py (new)
class _MultiprocessingNamespace:
def __init__(self, ctx):
self._ctx = ctx
def spawn(self, fn, args: tuple, nprocs: int, join: bool = True) -> None:
"""Spawn `nprocs` worker greenlets, each calling fn(rank, *args).
Mirrors torch.multiprocessing.spawn signature (minus `daemon`).
Drives the D0 scheduler loop until all workers finish.
"""
...
```
#### D1.2 구현
```python
def spawn(self, fn, args, nprocs, join=True):
from greenlet import greenlet
ctx = self._ctx
dist = ctx.distributed
gs: list[greenlet] = []
errors: dict[int, Exception] = {}
for rank in range(nprocs):
def _entry(r=rank):
try:
fn(r, *args)
except Exception as e:
errors[r] = e
raise
g = greenlet(_entry)
dist._bind_rank(g, rank)
gs.append(g)
try:
while True:
alive = [g for g in gs if not g.dead]
if not alive:
break
for g in alive:
if not g.dead:
g.switch()
_drain_pending(ctx) # D0.5
except Exception as outer:
# Sibling cleanup per D0.4-(4)
for other in gs:
if not other.dead:
try:
other.throw(SystemExit)
except Exception:
pass
backend = getattr(dist, "_backend", None)
if backend is not None:
if hasattr(backend, "_barrier"):
backend._barrier.reset()
if getattr(backend, "_pending_collective_handles", None) is not None:
backend._pending_collective_handles.clear()
ctx._pending_worker_waits.clear()
raise SpawnException(errors) from outer
# `join=True` semantics: we already wait for all workers.
```
#### D1.3 `torch` namespace attach
`runtime_api/context.py` `__post_init__`에서:
```python
self.multiprocessing = _MultiprocessingNamespace(self)
```
→ bench 코드에서 `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`.
#### D1.4 기존 bench 마이그레이션
`benches/ccl_allreduce.py`의 hand-rolled loop은 `torch.multiprocessing.spawn`
한 줄로 축소. 기존 matrix 회귀는 그대로 유지. 현재 xfail인 `ring_default_ws`는
D0 덕분에 PASS로 전환 예상 (worker가 kernel greenlet orphan을 발생시키지 않음).
### D2. 새 패키지 `kernbench.tp`
```
src/kernbench/tp/
__init__.py — public API re-exports
parallel_state.py — TP group 관리 (현재 single global group)
layers.py — ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding
primitives.py — copy/reduce/scatter/gather_to/from_tp_region
kernels.py — TP layer가 launch하는 gemm kernel (재사용 가능)
mappings.py — forward identity/all_reduce, backward stub
```
### D3. `parallel_state` — TP group
```python
# parallel_state.py
_TP_WORLD_SIZE = None
def initialize_model_parallel(tensor_model_parallel_size: int) -> None:
"""Initialize TP group. Must be called after dist.init_process_group."""
global _TP_WORLD_SIZE
from kernbench.runtime_api.distributed import get_dist # or torch.distributed
dist = get_dist()
total = dist.get_world_size()
if tensor_model_parallel_size != total:
raise NotImplementedError(
"Only TP == world_size supported in initial scope"
)
_TP_WORLD_SIZE = tensor_model_parallel_size
def get_tensor_model_parallel_world_size() -> int:
return _TP_WORLD_SIZE
def get_tensor_model_parallel_rank() -> int:
from kernbench.runtime_api.distributed import get_dist
return get_dist().get_rank() # ADR-0024 greenlet-local rank
```
초기 scope: TP size = world_size = topology SIP count. Pure TP 모델.
### D4-pre. TP shard ownership vs DPPolicy — 역할 분리 (normative)
TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:
| 개념 | 결정 주체 | 범위 |
|---|---|---|
| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
| **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
weight를 생성하고 `DPPolicy(cube="column_wise", pe="column_wise")`를 부여
하면:
- **Rank r**이 소유하는 slice = weight의 column 축 [r * k_local, (r+1) *
k_local) — **set_device(r)**가 이걸 결정 (해당 rank가 SIP r에 존재).
- **그 slice 내부**에서 cube × PE column-wise 분산 — **DPPolicy**가 이걸
결정.
두 축은 **독립적**이다. 같은 DPPolicy로 두 rank가 자기 slice를 만들면
slice 자체는 다른 SIP에 있지만 intra-SIP placement 패턴은 동일. 반대로
DPPolicy를 `cube="replicate", pe="replicate"`로 바꿔도 TP shard ownership은
유지되고 intra-rank placement만 달라짐.
**이 경계가 흐려지는 실수** (본 ADR이 금지):
- DPPolicy에 "SIP 축"이 다시 등장 (ADR-0026에서 제거됨).
- TP layer가 `set_device` 없이 `DPPolicy`만으로 cross-rank sharding을
표현 → 단일 rank 안에서 세로로 자른 것과 구분 안 됨.
본 ADR의 TP layer는 항상 "rank = SIP = one slice 소유 + DPPolicy intra-SIP
분산" 관점에서만 weight/output을 다룬다.
### D4. `ColumnParallelLinear`
**중요**: host-side `torch.matmul` 추상화를 신규 도입하지 않는다. layer의
forward는 `torch.launch("gemm", gemm_kernel, ...)`로 기존 gemm kernel을
호출 — KernBench bench들이 이미 쓰는 패턴
([benches/gemm_single_pe.py](benches/gemm_single_pe.py),
[benches/gpt3_qkv.py](benches/gpt3_qkv.py)).
```python
# layers.py
from kernbench.policy.placement.dp import DPPolicy
from kernbench.tp.kernels import _gemm_kernel
from kernbench.tp.parallel_state import (
get_tensor_model_parallel_rank,
get_tensor_model_parallel_world_size,
)
class ColumnParallelLinear:
"""Weight의 K(out_features) 축을 TP rank에 분산.
forward(x):
x: (M, N) — full-replicated across ranks
W_k: (N, K / world_size) — rank-local slice (set_device로 SIP r에 거주)
y_k = x @ W_k → (M, K / world_size) — rank-local output
출력은 column-sharded. RowParallelLinear가 기대하는 입력 형태.
"""
def __init__(self, in_features: int, out_features: int, bias: bool = False,
dtype: str = "f16", torch=None):
ws = get_tensor_model_parallel_world_size()
assert out_features % ws == 0
self.in_features = in_features
self.k_local = out_features // ws
self._torch = torch
# 각 rank가 자기 slice 소유 — set_device(rank)에 의해 SIP r에 배치.
self.weight = torch.zeros(
(in_features, self.k_local), dtype=dtype,
dp=DPPolicy(cube="column_wise", pe="column_wise"),
name="col_parallel_w",
)
self.bias = None
if bias:
self.bias = torch.zeros(
(self.k_local,), dtype=dtype,
dp=DPPolicy(cube="replicate", pe="replicate"),
name="col_parallel_b",
)
def forward(self, x):
# x는 full-replicated (caller 보장). 단순 local gemm.
M = x.shape[0]
out = self._torch.empty(
(M, self.k_local), dtype=x.dtype,
dp=DPPolicy(cube="column_wise", pe="column_wise"),
name="col_parallel_out",
)
self._torch.launch(
"col_parallel_gemm", _gemm_kernel,
x, self.weight, out, M, self.in_features, self.k_local,
)
# bias add는 별도 kernel 혹은 composite gemm의 fused bias.
# 초기 scope에서는 bias=False만 충분히 검증.
return out
```
**Yield-safety contract (normative)**: `ColumnParallelLinear.forward`는 한 번의
`torch.launch` 호출로 kernel launch → 내부 `ctx.wait` 쌍을 포함한다. 이는
D0.4-(1)의 "worker는 유한 step 내 yield" 조건을 자동으로 만족 — TP layer
사용자가 yield 패턴을 수동으로 삽입할 필요 없음.
### D5. `RowParallelLinear`
```python
class RowParallelLinear:
"""Weight의 N(in_features) 축을 TP rank에 분산.
forward(x):
x: (M, N / world_size) — rank-local slice (ColumnParallel의 출력)
W_k: (N / world_size, K) — rank-local slice
y_k = x @ W_k → (M, K) — partial sum on each rank
y = all_reduce(y_k, op="sum") → (M, K) on every rank
"""
def __init__(self, in_features: int, out_features: int, bias: bool = False,
dtype: str = "f16", torch=None):
ws = get_tensor_model_parallel_world_size()
assert in_features % ws == 0
self.n_local = in_features // ws
self.out_features = out_features
self._torch = torch
self.weight = torch.zeros(
(self.n_local, out_features), dtype=dtype,
dp=DPPolicy(cube="column_wise", pe="column_wise"),
name="row_parallel_w",
)
# bias는 rank 0에만 (Megatron convention). 초기 scope에서는 생략.
self.bias = None
def forward(self, x):
M = x.shape[0]
y_partial = self._torch.empty(
(M, self.out_features), dtype=x.dtype,
dp=DPPolicy(cube="column_wise", pe="column_wise"),
name="row_parallel_partial",
)
self._torch.launch(
"row_parallel_gemm", _gemm_kernel,
x, self.weight, y_partial, M, self.n_local, self.out_features,
)
# Cross-rank reduce. ADR-0024의 dist.all_reduce는 D0 + mp.spawn 하에서
# 정상 동작 (kernel parent = main 유지).
self._torch.distributed.all_reduce(y_partial, op="sum")
return y_partial
```
**Yield-safety contract (normative)**: `RowParallelLinear.forward`는 launch →
내부 wait에 이어 `all_reduce` (defer + worker yield 패턴)까지 포함하므로 forward
한 번당 **최소 2회 yield**가 보장됨. D0.4-(1)의 scheduler progress 조건 자동
만족. 모든 본 ADR의 TP layer forward는 "최소 하나의 wait 또는 collective를
포함해 yield-safe하다"를 invariant로 유지한다 — 이후 추가되는 TP primitive
(VocabParallelEmbedding 등)도 동일 계약 필수.
### D6. Primitive 함수
```python
# primitives.py
def copy_to_tp_region(x):
"""Forward: identity. Backward: all-reduce. (Training 추가 시 구현)."""
return x
def reduce_from_tp_region(x, torch):
"""Forward: all-reduce. Backward: identity."""
torch.distributed.all_reduce(x, op="sum")
return x
def scatter_to_tp_region(x):
raise NotImplementedError(
"Phase 2: 사용자가 이미 sharded tensor를 생성하는 것으로 대체"
)
def gather_from_tp_region(x):
raise NotImplementedError(
"Phase 2: all-gather kernel 선행 필요 (future)"
)
```
### D7. 샘플 bench — 2-layer MLP with TP
```python
# benches/tp_mlp.py (신규)
from kernbench.policy.placement.dp import DPPolicy
import kernbench.tp as tp
import numpy as np
def worker(rank: int, world_size: int, torch):
torch.ahbm.set_device(rank)
tp.initialize_model_parallel(world_size)
B, D_in, D_hidden, D_out = 1, 512, 2048, 512
fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=torch)
fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=torch)
x = torch.zeros(
(B, D_in), dtype="f16",
dp=DPPolicy(cube="replicate", pe="replicate"),
name="x",
)
# init x with some pattern (e.g., constant)
x.copy_(torch.from_numpy(np.full((B, D_in), 0.1, dtype=np.float16)))
h = fc1.forward(x) # column-sharded (B, D_hidden / ws)
y = fc2.forward(h) # all-reduced (B, D_out) on every rank
# rank 0만 결과 출력 / 검증
if rank == 0:
result = y.numpy()
# 실제 검증 값은 zero-init weight이면 전부 0 — scope에서는 "완료 자체" 검증
print(f" tp_mlp: shape={result.shape}, mean={float(result.mean()):.4f}")
def run(torch):
torch.distributed.init_process_group(backend="ahbm")
ws = torch.distributed.get_world_size()
torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)
```
### D8. Non-functional — training 미지원
본 ADR은 **inference/forward only**. Backward / gradient / optimizer는 future.
기존 KernBench가 training이 아니므로 자연스러움.
### D9. 초기 scope 제약
- TP size = world_size (mixed DP+TP 없음).
- `scatter_to_tp_region`, `gather_from_tp_region`은 unimplemented.
- **Weight 기본값은 zero**. 적절한 init scheme (Xavier, Kaiming 등)은 future.
단 테스트는 `tensor.copy_`로 결정론적 non-zero pattern을 주입해 numerical
correctness를 검증 (T2/T6). 즉 "production default = zero, 검증 = 결정론적
non-zero"로 운영 분리.
- Bias 초기 scope에서 생략 (Megatron의 rank 0-only bias 정책은 future).
- Pipeline parallelism은 scope 밖.
- VocabParallelEmbedding은 all-gather 선행 필요 → stub only.
### D10. 회귀: `ring_default_ws` xfail 해제 — 필수 acceptance
D0 (worker-wait 일반화) + D0.5 (host-read barrier) 덕분에 모든 worker-driven
`ctx.wait` 및 host-read가 main-drain 경로로 routing됨 → ADR-0024 Phase B의
kernel-greenlet orphan 원인이 소멸. 기존 matrix test의 `ring_default_ws`
strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 **필수 회귀
기준**으로 포함. Observable acceptance criteria는 **T7**에 명시 (deadlock
부재, GreenletExit 부재, numerical tolerance 등).
---
## Dependencies
- **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
`torch.ahbm.set_device(rank)`.
- **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
- **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
---
## Non-goals
- **Backward pass / training**: inference only. Training simulation은 별도 ADR.
- **Mixed parallelism (DP + TP + PP)**: 초기엔 pure TP only.
- **Weight init schemes**: 단순 zero / debug pattern.
- **Fused ops**: Megatron의 fused matmul+bias+gelu는 kernel 레벨 문제.
- **DTensor 통합**: ADR-0028 future.
- **Host-side `torch.matmul` 추상화**: TP layer는 `torch.launch(gemm_kernel, ...)`
로 기존 gemm kernel을 호출. 신규 matmul host-op 도입 안 함.
---
## Open questions
- **`initialize_model_parallel` 위치**: `kernbench.tp.initialize_model_parallel`
(현 결정) vs real-PyTorch의 `torch.distributed.init_device_mesh`. TP 전용
모듈에 유지.
- **Weight init**: ADR은 zero. Debug pattern (e.g., identity)이 유효 검증에
필요할 수 있음 — Phase 1 test에서 필요 시 추가.
- **bias 배치 정책**: Megatron은 RowParallelLinear bias를 rank 0에만. 초기
scope에서는 bias=False로 회피.
- **GEMM kernel 위치**: `kernbench.tp.kernels._gemm_kernel` vs 기존
`benches/gemm_single_pe.py`에서 import. TP가 bench 의존을 가지면 안 되므로
tp 내부에 복제. 향후 `kernbench.kernels` 공용 패키지로 이관 가능.
**Resolved (이전 rev에서 open이었던 것들)**:
- ~~`tensor.numpy()` 호출 시 drain 타이밍~~ → **D0.5에서 결정**: 공식 host-read
entry-point(`numpy`, `data`, `__getitem__`, data-포함 `__repr__`)는 자동
drain barrier. metadata-only accessor는 barrier 아님.
---
## Consequences
### Positive
- **Megatron 코드 이식 용이**: real training code와 API 일치.
- **TP 벤치마크 가능**: scaling, communication-compute overlap 등 HW 특성
연구.
- **`ring_default_ws` xfail 해제**: D0의 부산물로 ADR-0024 Phase B 블로커 해소.
- **Scheduler loop 단일화**: D1 (`mp.spawn`) 도입으로 hand-rolled loop 제거.
후속 collective/TP 벤치가 동일 패턴 재사용.
- **DPPolicy 의미 명확화** (ADR-0026 시너지): TP layer가 intra-device DPPolicy
만 사용하는 모범 사례.
### Negative
- 새 모듈 (`kernbench.tp`) 유지보수 비용.
- 초기 scope가 제한적 (pure TP only, forward only).
- D0 generalization이 `ctx.wait`의 세만틱을 바꿈 — 단일 드라이버 테스트와의
호환성을 명시적으로 검증 필요 (T7).
### Neutral
- ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation
stack에 영향 없음 (D0 제외).
@@ -0,0 +1,256 @@
# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
## Status
Accepted (supersedes ADR-0029).
## Context
### Goal
Define a single all-reduce algorithm that exploits the topology hierarchy:
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
### Why replace ADR-0029 (hierarchical 3-level)
ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
where every PE in the system participates. In practice this adds the
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
without matching the common workload pattern where the tensor is sharded
**per cube** (not per PE within a cube).
Moreover, the hierarchical design required:
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
- multi-level topology schema (`hierarchical_3level`)
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
The intercube algorithm below removes all of that: **pe0-only same-lane
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
root cube, then broadcast back. Simpler kernel, simpler wiring, same
bandwidth characteristics for the common per-cube DP workload.
### Current state
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
- `src/kernbench/ccl/sfr_config.py``configure_sfr_intercube_multisip`
- `src/kernbench/runtime_api/distributed.py``AhbmCCLBackend` wires this
automatically at `init_process_group` time.
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
`hierarchical_allreduce` modules and their tests are **removed**.
---
## Decision
### D1. Algorithm structure — 5 phases
For each SIP (launched concurrently by `mp.spawn`):
```
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
holds the full SIP sum.
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
selected by sip_topo_kind (from topology.yaml sips.topology).
Phase 4 — Col broadcast S → N on rightmost column.
Phase 5 — Row broadcast E → W across the cube mesh.
```
After all phases every cube's pe0 holds the global sum.
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
across topologies; only phase 3 branches. Helper functions
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
three exchange patterns.
### D2. Tensor layout (rank = SIP, per-worker)
Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
its own cube-mesh-spanning tensor:
```python
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
```
Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
tables for **every cube's pe0 across every SIP** — regardless of which
cube is the root or which SIP topology is selected. This lets the kernel
elect the root cube at runtime and supports topology switches without
re-wiring.
| Level | Direction labels | Scope |
|---|---|---|
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
Inter-SIP directions use the `global_*` prefix to keep the namespace
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
direction resolver handles 2-SIP bidirectional rings correctly.
Internally the function calls `install_ipcq` with:
- `world_size = n_sips × n_cubes`
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
- A closure-captured `neighbors()` function that builds the map above.
This `world_size` is internal to IPCQ wiring and does not leak to the
process-group rank.
### D4. SIP topology — from `topology.yaml`
```yaml
system:
sips:
count: 2
topology: ring_1d # or torus_2d, mesh_2d_no_wrap
```
- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
`global_E/W` then col ring on `global_S/N`.
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
broadcast per dimension.
2D variants require `n_sips` to be a perfect square.
### D5. Process-group integration — `AhbmCCLBackend`
At `init_process_group` time the backend:
1. Loads `ccl.yaml` + `topology.yaml`.
2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
`system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
SFR wiring, mirrors NCCL communicator creation.
At each `dist.all_reduce(tensor)` call:
1. Resolves `kernel_fn` from `cfg["module"]`.
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
`kernel_args(world_size, n_elem)`.
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
`sip_rank` is the current greenlet's bound rank.
4. Launches with `_defer_wait=True`; the main scheduler drains pending
handles after all workers submit (per ADR-0027 D0.4).
### D6. Config schema
`ccl.yaml`:
```yaml
defaults:
algorithm: intercube_allreduce
buffer_kind: tcm
...
algorithms:
intercube_allreduce:
module: kernbench.ccl.algorithms.intercube_allreduce
topology: none
buffer_kind: tcm
n_elem: 8
root_cube: 15
```
`topology.yaml`:
```yaml
system:
sips:
count: 2
topology: ring_1d
sip:
cube_mesh: { w: 4, h: 4 }
```
### D7. Algorithm module contract
Modules loaded via `cfg["module"]` must export:
| Name | Purpose |
|---|---|
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
---
## Dependencies
- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
- **ADR-0025**: Address-based IPCQ direction matching; extended
`_OPPOSITE_DIR` with `global_*` pairs.
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
## Non-goals
- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
workload for this algorithm is per-cube DP.
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
`mesh_2d_no_wrap` require `n_sips = k²`.
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
- **Root cube runtime election**: the kernel currently uses
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
corner. SFR wiring covers all cubes, so runtime election is a pure kernel
change when needed.
---
## Consequences
### Positive
- **Single kernel, single install path** for all-reduce — replaces four
removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
integer param, no kernel duplication.
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
algorithm selection needed; config-driven end-to-end.
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
available — supports future dynamic root-cube election.
### Negative
- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
shard within one cube across 8 PEs are not addressable by this kernel.
Such workloads would need a separate intra-cube all-reduce path (not
yet implemented).
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
small but not zero.
---
## Affected files
| File | Change |
|---|---|
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
| `ccl.yaml` | Single `intercube_allreduce` entry |
| `topology.yaml` | Added `system.sips.topology` |
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
@@ -0,0 +1,162 @@
# ADR-0033 — Latency Model: Assumptions and Known Simplifications
## Status
Accepted
## Context
The simulator is an analytical, event-driven performance model — not a
cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
or omitted by design. To keep the model auditable and reviewable as a whole,
this ADR consolidates the assumptions in one place. Individual component ADRs
(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
the *limits of fidelity*.
## Decisions
### D1. Modeled precisely
- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
ADR-0015 D2.
- **Per-component switching/overhead latency** (`overhead_ns` attr).
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
with address-based PC selection (ADR-0034 D3). Burst granularity tunable
(`burst_bytes`, default 256B). Read and write share each PC's
`available_at` (real HW command bus is per-PC shared).
- **HBM direction switching penalty mechanism**: per-PC last-direction
tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
- **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions
with payload into `Flit` objects of `flit_bytes` (default = HBM
`burst_bytes` = 256B). The wire emits each flit individually after
`prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles
flit arrival rate per real-HW wormhole semantics.
- **Separate Stores per directed edge** (Phase 2c key fix): the wire
is the *only* conduit between `src.out_ports[dst]` and
`dst.in_ports[src]`. Earlier the two were aliased to the same
`simpy.Store`; when the wire put a chunkified flit back, the
destination's `fan_in` could pull it before the wire applied
bandwidth delay, leaving half the flits bypassing the bottleneck.
- **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`):
forward each flit serially with per-transaction overhead applied
ONCE on the first-flit arrival (header decode model). Subsequent
flits pipeline through with no extra delay. Wormhole emerges
naturally across multi-hop paths.
- **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL
schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`,
with the `is_last` flit waiting for the last PC commit before
signaling `txn.done`.
- **Non-flit-aware components (default) reassemble flits at
``_fan_in``** before the legacy `_forward_txn` path runs. This
preserves backward compatibility for components that have not yet
been migrated to flit-aware processing (e.g., `MCpuComponent`,
`IoCpuComponent` sub-txn generators). Such components reassemble
*once per leg boundary*, NOT per hop — multi-hop wormhole timing
through a chain of flit-aware routers is preserved.
### D2. Approximated (with known directional error)
| Effect | Real HW | Our model | Error direction |
|--------|---------|-----------|----------------|
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes |
| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |
### D3. Ignored (out of scope)
- Bank-level row buffer conflict penalty (assume no conflicts — best case;
the model has no per-bank state within a PC, so same-bank reuse cannot be
detected).
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
`burst_time = burst_bytes / pc_bw_gbs`).
- Refresh, ECC, thermal throttling, power gating.
- Clock domain crossings, PLL lock time.
- Upstream backpressure due to downstream buffer occupancy (input ports use
unbounded `simpy.Store`).
- Sub-flit cycle-level arbitration at routers (flit granularity is our
smallest unit).
### D4. Workload sensitivity
Workloads where the above simplifications meaningfully affect results:
- **Random scatter/gather**: bank conflict ignored → model optimistic.
- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
setting it non-zero models pessimistic per-alternation cost.
- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
limits not modeled → model optimistic.
- **Very small (sub-flit) transactions**: flit quantization noise.
- **Concurrent multi-flow on a single wire**: wire is serial FIFO at the
flit level, so per-flow fairness within a single edge is not modeled.
Pre-edge merging (multiple sources arriving at a router and being
forwarded to the same downstream wire) is correctly modeled via the
flit-aware router's serial worker.
### D5. Verification policy
For workloads in D4, cross-check against real HW or a cycle-accurate
simulator before drawing absolute-magnitude conclusions. The model remains
accurate for **relative comparisons** within the modeled regime.
### D6. Future work
Note: multi-stream merging at routers IS modeled correctly — each
in_port has its own fan_in process, all push to a shared inbox, and
the router worker forwards in inbox FIFO order. Flits from different
upstream streams naturally interleave at flit granularity. The items
below are different concerns, ordered by expected workload impact.
**Higher impact (workload accuracy gap)**:
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
`track_banks: true`). Currently we assume no same-bank reuse;
random scatter/gather workloads are optimistic here.
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
from the design discussion). Default `switch_penalty_ns=0` is the
ideal-amortization stand-in; bursty mixed R/W workloads benefit
from explicit modeling.
- [ ] **Backpressure** modeling for finite component buffers. Matters
at high concurrency / sustained saturation where buffer occupancy
causes upstream stalls.
- [ ] **Op_log integration with chunk-streaming**: currently op_log
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
GemmCmd, MathCmd) which are not chunkified. Integration would
require flit-aware components to also emit op_log start/end hooks
per transaction (start on first flit, end on is_last).
**Lower impact (academic / specific use cases)**:
- [ ] **Cycle-accurate router arbitration policies** (RR with
priorities, age, iSLIP). The FIFO inbox is already approximately
fair when flit arrival times differ slightly between streams (the
common case for similar-rate workloads). True impact appears only
for: (a) priority/QoS modeling, (b) per-stream tail latency
analysis under sustained saturation. Not critical for makespan or
average-latency studies.
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
per 32B flit. Effect is small for most workloads (sub-flit timing
noise on small messages).
## Consequences
- Single review point for all model fidelity questions. Each future PR
touching latency must update the relevant section here.
- Workload-specific magnitude error envelopes are explicit.
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
enforces the ADR-0017 D8 invariant in code rather than relying on yaml
manual consistency.
- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
per-flit timing) rather than via terminal `drain_ns` injection. Single
transactions land at `drain + commit_time + small_overheads`; multi-hop
preserves wormhole pipelining; multi-stream merge correctly serializes
at the shared wire's FIFO.
## Cross-references
- ADR-0015 — component / port / wire model.
- ADR-0017 — Cube NOC architecture and HBM connectivity.
- ADR-0004 — memory semantics, local HBM.
- ADR-0034 — HBM controller internal design.
@@ -0,0 +1,271 @@
# ADR-0034: HBM Controller Internal Design
## Status
Accepted
## Context
`HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of
the cube NOC. One instance is created per PE under the topology node
`sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router
(ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling,
burst-granular commit timing, address-based PC selection, and response
routing back to the requester.
This ADR documents the component as currently implemented. ADR-0017 D4/D8
defines *where* HBM CTRL attaches and *what* aggregate BW it must
deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in
scope. This ADR fills the gap between those two — the per-instance
internal scheduling model.
## Decision
### D1. Role
`HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per
PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`)
attaches to that PE's router via the `peX.hbm` attachment list in
`cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping
(ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels
into one endpoint.
The component models:
- Per-PC scheduling (D2) with R/W command-bus sharing.
- Address-based PC selection (D3).
- Burst-granular commit timing (D4).
- Flit-aware per-flit PC commit and async finalize (D5, D6).
- Command-only Transaction handling for read-data drain (D7).
- Response routing back to the requester (D8).
It does not model:
- Bank-level row-buffer conflicts, refresh, ECC, thermal throttling
(ADR-0033 D3).
- Cross-PE HBM contention beyond its own router edge (handled by the
router mesh — ADR-0017 D3).
- 1:1 channel mode (ADR-0017 D8 future work).
### D2. Per-PC scheduling model
Per-instance state initialised in `start()`:
- `_pc_avail: list[float]` — earliest sim-time each PC is free; length
`num_pcs`, initial 0.0.
- `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on
each PC, used for switch-penalty detection (D4); initial `None`.
`num_pcs` and `burst_bytes` must each be a positive power of two so
that address-based PC selection (D3) reduces to a shift-and-mask.
Read and write requests share the same `_pc_avail` slot per PC — the
real HW per-PC command bus is shared between read and write traffic, so
issuing a write to PC k blocks a subsequent read to PC k by exactly the
burst time.
Direction `dir` for a request is inferred from the request type:
- `MemoryWriteMsg``"W"`.
- `PeDmaMsg` with `is_write=True``"W"`.
- All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`.
### D3. Address-based PC selection
PC index for an access is derived from the access address by shift and
mask:
```text
pc_shift = log2(burst_bytes) # default 8 (burst=256B)
pc_mask = num_pcs - 1 # default 7 (8 PCs)
pc = (address >> pc_shift) & pc_mask
```
Computed once in `start()` from topology config so alternative
`(burst_bytes, num_pcs)` pairs stay consistent. For the canonical
default `(256, 8)` this places the PC select field at bits `[10:8]` of
the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits
`[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column
within the PC slice (see `phyaddr.py` comment).
Address-based striping — as opposed to address-blind global
round-robin — preserves PC parallelism for offset-disjoint concurrent
transfers: each transfer's bursts land deterministically on the PC set
implied by its byte addresses, so multi-PE workloads accessing disjoint
regions do not collide on a single PC.
### D4. Burst granularity and PC commit timing
A single PC commit takes:
```text
chunk_time = burst_bytes / pc_bw_gbs # ns
```
- `burst_bytes` (default 256) is the burst granularity matching the
flit size (ADR-0033 D1).
- `pc_bw_gbs` is **builder-derived** from
`hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing
the ADR-0017 D8 invariant that aggregate per-PE BW equals the
router-to-HBM link BW.
Per-PC commit scheduling for an arriving access on PC `pc` with
direction `dir`:
```text
switch_cost = switch_penalty_ns
if pc_last_dir[pc] not in (None, dir) else 0
start = max(env.now, pc_avail[pc]) + switch_cost
finish = start + chunk_time
pc_avail[pc] = finish
pc_last_dir[pc] = dir
```
Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM
scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values
model pessimistic per-alternation cost.
### D5. Flit-aware per-flit PC commit (primary path)
`_handle_flit` is the primary worker path. For each arriving `Flit`:
1. On the **first** flit of a transaction (`tid = id(txn)` not in
`_txn_state`):
- Apply `overhead_ns` once via `run(env, nbytes)` — header decode
model, first-flit overhead pattern (ADR-0033 D1).
- Initialise `_txn_state[tid] = {"last_finish": env.now}`.
2. Compute `pc = _pc_for_address(flit.address)` (D3).
3. Apply the per-PC schedule (D4) using the request direction (D2).
4. Update `state["last_finish"] = max(state["last_finish"], finish)`.
5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn`
(D6).
Per-flit address-aware commit is the mechanism that lets concurrent
multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs
in parallel.
### D6. Async finalize per transaction
When a transaction's last flit has been scheduled, finalisation runs in
a separately-spawned process:
```python
def _finalize_txn(env, txn, last_finish):
wait = last_finish - env.now
if wait > 0:
yield env.timeout(wait)
yield from _send_response(env, txn)
```
`_handle_flit` spawns this via `env.process(...)` and returns
immediately, so the worker can pick up the next inbox message while the
last PC commit drains.
Without this split — i.e. if the worker itself did
`yield env.timeout(wait)` — concurrent single-flit transactions whose
addresses hit distinct PCs would still serialise at `chunk_time` each
inside the worker, hiding the PC parallelism that D3 and D5 are
designed to expose.
### D7. Non-flit fallback for command-only transactions
`_handle_txn` runs when the inbox delivers a `Transaction` rather than a
`Flit`. This is the path for command-only requests that the wire does
not chunk into flits — most notably `MemoryReadMsg` whose command txn
carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing,
not as inbound flits).
Procedure:
1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)`
— for read commands, work is sized by the request.
2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else
0.
3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are
scheduled over time at `drain/n_chunks` ns intervals to model the
bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain).
4. Apply `run(env, txn.nbytes)` once for `overhead_ns`.
5. For each chunk `i`, advance `chunk_interval` ns then apply the D4
schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`.
6. After scheduling all chunks, wait `last_finish - env.now` then call
`_send_response`.
`_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with
`_handle_flit` — there is exactly one source of PC scheduling truth
across both paths.
### D8. Response routing
`_send_response` dispatches on request type and path geometry:
| Case | Trigger | Response |
| --- | --- | --- |
| PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` |
| Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) |
| Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) |
| Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path |
The "bypass" classification matches the Memory R/W fabric path defined
in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl,
without M_CPU). The PE_DMA case is its own dedicated reverse-path to
keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a
ResponseMsg envelope).
In all reverse-path cases, the response Transaction is put onto
`out_ports[reverse_path[1]]` — the first hop back along the recorded
forward path. If `reverse_path` has fewer than 2 entries (degenerate
path), the original `txn.done` is signalled directly.
### D9. Configurable attributes
| Attribute | Default | Source | Notes |
| --- | --- | --- | --- |
| `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 |
| `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant |
| `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) |
| `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching |
| `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) |
| `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) |
`pc_bw_gbs` is derived by `topology/builder.py` rather than configured
directly so the aggregate per-PE BW matches the router-to-HBM link BW
without yaml-side duplication.
## Consequences
### Positive
- Address-based PC selection preserves multi-stream HBM parallelism
that an address-blind round-robin would collapse — important for
multi-PE workloads with disjoint HBM regions.
- Flit-aware path (D5) + async finalize (D6) preserves wormhole
pipelining and exposes PC parallelism for back-to-back single-flit
transactions.
- Single source of PC scheduling truth (D4 mechanism, used by both D5
flit path and D7 chunk-loop path).
- Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml
discipline.
### Negative
- No bank-level conflict modelling within a PC; address-blind to
bank/row-buffer reuse (ADR-0033 D3).
- No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed
FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns`
(ADR-0033 D2).
- `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state
accumulates per concurrent transaction and is removed only on
`is_last`. Adequate for current workloads.
## Links
- ADR-0001 (Physical address layout — PC bit field comment)
- ADR-0015 D4 (Memory R/W fabric path — bypass response case)
- ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers)
- ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR
implements)
- ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint
resolution)
- ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty,
flit-aware PC commit, first-flit overhead, chunk-loop drain)
- ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation)
@@ -0,0 +1,286 @@
# ADR-0035: M_CPU and M_CPU.DMA Component Model
## Status
Accepted
## Context
M_CPU is the cube-level command processor. It receives commands from
IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
M_CPU as a fallback), fans them out to the PEs in its cube, and
aggregates per-PE responses into a single ResponseMsg sent back to
IO_CPU on the reverse path.
M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
it lives as internal state of `MCpuComponent`.
This ADR documents the M_CPU component implementation that realizes
those responsibilities, including the three distinct fan-out paths
(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
model, and the response aggregation contract.
## Decision
### D1. Role
M_CPU has three responsibilities:
1. **Transit forwarding** — when not the terminal hop (e.g., on the
reverse response path PE → M_CPU → IO_CPU), forwards Transactions
to `next_hop` in their pre-computed path.
2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
fan-out paths based on request type (D2).
3. **Response aggregation** — collects per-PE responses, sends a
single aggregate ResponseMsg back to IO_CPU on the reverse path.
Per invocation (`run()`): applies `overhead_ns` once per incoming
Transaction.
M_CPU does **not**:
- Decide routing — paths are pre-computed by the router (ADR-0002).
- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
(ADR-0014).
- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
`hbm_ctrl.pe{X}` directly (ADR-0017 D9).
- Interpret tensor or kernel semantics — fan-out dispatch by Python
isinstance check only.
### D2. Three fan-out paths dispatched by request type
At the terminal hop the worker dispatches by request type:
```python
elif self.ctx is not None and txn.request is not None:
if isinstance(txn.request, KernelLaunchMsg):
env.process(self._kernel_launch_fanout(env, txn))
elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
env.process(self._mmu_msg_fanout(env, txn))
else:
env.process(self._dma_fanout(env, txn))
```
Each path uses a different router method:
- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
M_CPU-specific DMA path that avoids PE pipeline nodes.
- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
generic NOC command path to PE_CPU.
- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
path to PE_MMU.
### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
`MCpuComponent.start()` initializes two SimPy resources:
```python
self._dma_write = simpy.Resource(env, capacity=1) # MemoryWriteMsg
self._dma_read = simpy.Resource(env, capacity=1) # MemoryReadMsg
```
Properties:
- **Not a topology node** — managed entirely inside `MCpuComponent`;
does not appear in `topology.yaml` or in the compiled graph.
- **Independent read and write channels** — concurrent in-flight
Memory R/W is allowed.
- **Capacity=1 per channel** serializes the **dispatch step**
(`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
R/W requests at this M_CPU. Actual fabric transfer time is modeled
by wire processes between components (ADR-0015 D2) and by
`drain_ns` at terminal hops; the DMA resource does not gate
transfer duration.
Resource selection is request-type-based:
```python
dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
```
### D4. Transit forwarding at non-terminal hops
When `txn.next_hop` is not None — typical for the reverse response
path (PE → M_CPU → IO_CPU) — the worker forwards normally:
```python
if next_hop:
yield self.out_ports[next_hop].put(txn.advance())
```
The fan-out branches fire only at the terminal hop. The same component
therefore serves both forward command dispatch and reverse response
relay roles.
### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
For each Memory R/W request at terminal hop:
1. `_resolve_dma_destinations(request)` returns a per-PE
`hbm_ctrl.pe{X}` derived from the request's PA via
`ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
2. For each destination:
- Acquire the appropriate DMA resource (`_dma_write` or
`_dma_read`) via `with dma_res.request() as req`.
- Resolve path via `ctx.router.find_mcpu_dma_path()`.
- Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
- Create sub-Transaction carrying `drain_ns` and dispatch to
`path[1]`.
3. Track `max_drain_ns` across destinations and record it as
`txn.result_data["xfer_ns"]` after all responses arrive.
4. After all per-PE responses are collected (D8), send an aggregate
ResponseMsg on the reverse command path back to IO_CPU.
PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
no such node exists after ADR-0017 D4's per-PE partitioning. Kept
defensively but does not route to a real destination.
### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
For `KernelLaunchMsg` at terminal hop:
1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
`ctx.router.find_node_path()`.
3. **`target_start_ns` handling** (ADR-0009 D5):
- If the request already carries `target_start_ns` (stamped by
IO_CPU per ADR-0036 D3): **pass through unchanged**.
- If absent (direct-to-M_CPU launch in unit tests): compute a
per-cube barrier `env.now + max(per-PE leg latency)` and stamp
via `dataclasses.replace`.
4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
control message; preserving nbytes=0 keeps fan-out off the shared
first-hop fabric BW, mirroring ADR-0036 D4).
5. After all per-PE responses arrive (D8), aggregate per-PE metrics
from each sub-Transaction's `result_data` into the parent
transaction:
```python
txn.result_data["pe_exec_ns"] = max(existing, max(pe_exec_values))
txn.result_data["dma_ns"] = max(existing, max(dma_values))
txn.result_data["compute_ns"] = max(existing, max(compute_values))
```
The max-merge with the existing value matters because cross-cube
IO_CPU fan-out shares the same parent `result_data`; merging
prevents one cube from clobbering another's metric.
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
1. `_resolve_pe_ids(target_pe)` → PE ids.
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
`find_node_path()`.
3. Dispatch sub-Transactions with `nbytes=0`.
4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
back. Instead, the sub-Transaction's own `sub_done` event is the
completion signal.
5. Wait for all `sub_done` events in-line (does **not** use
`_pending` counter — D8 is for response-bearing fan-out only).
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
### D8. Response aggregation (`_pending` + `_parent_txns`)
For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
arriving on the reverse path):
```python
self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
self._parent_txns: dict[str, Any] = {}
```
- On dispatch: register `(expected, received=0, all_done)` and
remember the parent transaction.
- `_worker` recognises responses by `is_response=True` and routes
them to `_collect_response`, which increments `received` and
signals `all_done` when `received >= expected`.
- After `yield all_done`, the fan-out path constructs the aggregate
ResponseMsg:
```python
resp_msg = ResponseMsg(
correlation_id=request.correlation_id,
request_id=request.request_id,
src_cube=cube_id,
src_pe=-1, # -1 = M_CPU aggregate, not a single PE
success=True, # no failure semantics implemented
)
```
- The response Transaction travels on `list(reversed(txn.path))`
back to IO_CPU.
MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
because PE_MMU is terminal — there is no ResponseMsg path to
intercept.
### D9. Helpers and configurable attribute
`_resolve_pe_ids(target_pe)`:
- `int` → `[target_pe]`
- `tuple[int, ...]` → `list(target_pe)`
- `"all"` → `range(n_slices)` where `n_slices` comes from cube
`memory_map.hbm_slices_per_cube` (default 8).
Used by kernel-launch and MMU fan-out paths.
Single configurable attribute drives per-instance latency:
| Site | impl name | overhead_ns |
| --- | --- | --- |
| Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
Applied once in `run()` per Transaction — models command
interpretation and dispatch-decision time at M_CPU.
## Consequences
### Positive
- Three fan-out paths are clearly separated by request type — adding
a new request kind is an isinstance branch + one fan-out method.
- M_CPU.DMA channels are independent (read and write run concurrently)
and serialize only the dispatch step at capacity=1.
- Transit-vs-terminal behavior is a single `if next_hop` check, so
the same component handles forward dispatch and reverse response
relay without role duplication.
- `target_start_ns` passthrough (D6) preserves the cross-cube barrier
established by IO_CPU (ADR-0036 D3), while the fallback computation
keeps direct-to-M_CPU unit tests working.
- Per-PE metric `max`-merge against existing parent `result_data`
values is robust to cross-cube IO_CPU fan-out sharing the same
parent.
### Negative
- No partial-failure semantics — a missing per-PE response stalls the
parent `all_done` indefinitely. Acceptable for simulation; not
suitable as a production-style endpoint.
- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
code (no such node exists post-ADR-0017 D4). Kept defensively;
invites confusion and merits a follow-up cleanup.
- DMA resource serialization applies only at dispatch (the `put` call
is instantaneous in unbounded stores). The capacity=1 channel
models "one request in flight at a time at this M_CPU", not
"transfer duration serialization" — readers must consult wire
processes (ADR-0015 D2) and `drain_ns` for actual transfer
parallelism.
## Links
- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
- ADR-0009 D5 (`target_start_ns` — passed through unchanged when
present; computed as per-cube barrier when absent)
- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
point)
- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
contract at cube level)
- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
topology node)
- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
through unchanged; nbytes=0 invariant preserved through fan-out)
@@ -0,0 +1,216 @@
# ADR-0036: IO_CPU Component Model
## Status
Accepted
## Context
IO_CPU is the IO chiplet's host-facing endpoint inside the simulation
graph. PCIE_EP receives host messages from the runtime API and routes
them via the io_noc; for command-bearing requests (KernelLaunch,
MmuMap/Unmap) the io_noc forwards to IO_CPU, which:
- Fans out the request to per-cube M_CPUs.
- Aggregates per-cube responses into a single host-visible completion.
- For kernel launches, stamps a global `target_start_ns` barrier so
every PE across every targeted cube begins kernel body execution at
the same simulated time (ADR-0009 D5).
Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3;
this component therefore handles only command-plane traffic in normal
operation.
This ADR documents the IO_CPU component implementation that realizes
those responsibilities.
## Decision
### D1. Role
IO_CPU is the host-facing endpoint of the IO chiplet. It has two
primary responsibilities:
1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg /
MmuUnmapMsg to per-cube M_CPUs.
2. **Response aggregation** — collect per-cube ResponseMsg, signal
parent `txn.done` when all targeted cubes have responded.
A third, narrower responsibility applies only to KernelLaunchMsg:
**`target_start_ns` global barrier stamping** (D3).
The component does **not**:
- Decide routing — paths are pre-computed by the router (ADR-0002).
- Decode tensor or kernel internals — those concerns belong to
M_CPU / PE_CPU / engines.
- Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3).
- Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4
and ADR-0016 D3 (Memory R/W resolution code in
`_resolve_cube_targets` exists as a defensive fallback only).
Per invocation (`run()`): applies the configured `overhead_ns` once
per incoming Transaction (D8).
### D2. Forward path — multi-cube fan-out
When a non-response Transaction arrives, the worker:
1. Pays `overhead_ns` via `run()`.
2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)`
targets from the request (D5).
3. For each target:
- Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`.
- Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`.
- Creates a per-cube sub-Transaction with `path` populated and
forwards it to `path[1]` (the first hop on the io_noc).
4. Registers aggregation state: `_pending[request_id] = (expected,
received=0, parent_done)`.
### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5)
IO_CPU is the canonical stamper for `target_start_ns`. When the
request is a `KernelLaunchMsg`, IO_CPU computes a single global
barrier covering every targeted PE across every targeted cube:
```text
for (sip, cube) in cube_targets:
leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0)
for pe_id in target_pe_ids:
leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id),
nbytes=0)
latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns
global_max = max(global_max, latency)
target_start_ns = env.now + global_max
```
The request is then replaced (via `dataclasses.replace`) so the
stamped value propagates through the fan-out.
Two overhead corrections:
- `io_overhead_ns` is subtracted because IO_CPU has already paid it
in `run()` before this method runs.
- `m_overhead_ns` is subtracted once because it appears as the
endpoint of leg1 *and* the start of leg2 in path latency, but
M_CPU pays it only once at run time.
Every downstream PE_CPU yields until `target_start_ns` before
beginning kernel body execution; all PEs therefore start at the same
simulated time regardless of how long their individual dispatch path
took.
### D4. KernelLaunch sub-Transactions carry `nbytes=0`
Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`,
overriding the parent `txn.nbytes`:
- Kernel launch is a control message; payload size is irrelevant at
the data-fabric level.
- If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the
io_noc's shared first hop. With 16 cubes this serializes fan-out,
pushing far M_CPUs past `target_start_ns` and breaking the D3
invariant.
Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant
for the defensive Memory R/W fallback path, which carries actual
payload sizes).
### D5. Per-request-type cube target resolution
`_resolve_cube_targets` dispatches by request type:
| Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics |
| --- | --- | --- |
| `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode |
| `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode |
| `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP |
| `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec |
Each IO_CPU instance fans out only within its own SIP — `_my_sip()`
parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0).
The Memory R/W rows exist for defensive completeness; the engine's
normal path routes Memory R/W via `_process_memory_direct()` /
`find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 /
ADR-0016 D3).
### D6. Response aggregation
`_pending: dict[request_id → (expected, received, parent_done)]`:
- On dispatch: register `(len(cube_targets), 0, txn.done)`.
- `_worker` recognises responses by `is_response=True` and routes
them to `_collect_response`.
- `_collect_response` increments `received`; when `received >=
expected`, `parent_done.succeed()` is invoked and the entry is
removed from `_pending`.
This is a simple per-request counter. There is no per-cube identity
tracking and no partial-failure handling — a missing response
indefinitely stalls the parent done. Production-style failure paths
are out of scope for the current simulator model.
### D7. `target_pe` resolution helper
`_resolve_pe_ids(target_pe)`:
- `int` → `[target_pe]`.
- `tuple[int, ...]` → `list(target_pe)`.
- `"all"` → `range(n_slices)`, where `n_slices` comes from cube
`memory_map.hbm_slices_per_cube` (default 8).
Used in D3's barrier computation to enumerate every PE target per
cube.
### D8. Configurable `overhead_ns`
A single attribute drives per-instance latency:
| Site | impl name | overhead_ns |
| --- | --- | --- |
| IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 |
Applied once in `run()` per Transaction. Models command
interpretation + dispatch-decision time at IO_CPU.
## Consequences
### Positive
- Cross-cube and cross-SIP kernel launches share a single global
barrier (D3 + D4) — no per-cube divergence in start time.
- nbytes=0 invariant keeps fan-out off the shared first-hop fabric
BW, preserving the barrier's accuracy at scale (16 cubes).
- Response aggregation via a single counter → minimal state,
deterministic ordering of completion.
- Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs
cleanly independent.
### Negative
- No partial-failure semantics — a missing per-cube response
indefinitely stalls the parent. Adequate for simulation but not
suitable as a production-style endpoint.
- `_pending` is a regular dict; in-flight requests accumulate state.
Acceptable for current benchmark workloads (few concurrent
outstanding launches); unbounded in principle.
- The Memory R/W resolution branches in `_resolve_cube_targets` are
dead code in the normal engine path. Kept defensively but invite
drift if the bypass path ever changes.
## Links
- ADR-0002 (Routing distance — path computation)
- ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU)
- ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across
cubes)
- ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU)
- ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out)
- ADR-0012 (Host ↔ IO_CPU message schema)
- ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU)
- ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here)
- ADR-0016 D3 (Memory R/W path bypasses IO_CPU)
- ADR-0016 D4 (Kernel Launch path through IO_CPU for command
interpretation)
@@ -0,0 +1,200 @@
# ADR-0037: Forwarding Component (forwarding_v1)
## Status
Accepted
## Context
The simulation graph has many node positions that exist purely to model
fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints,
IO chiplet io_noc, transit cubes. These share a common pattern: receive
a message, apply per-component overhead (modeling header decode +
routing decision time), forward to the next hop along the pre-computed
path.
This ADR defines the contract for these transit nodes: a single
component type (`TransitComponent`) that handles flit-aware forwarding
with wormhole cut-through semantics, used under multiple impl names
according to the conceptual role each instance plays.
## Decision
### D1. Role
The Forwarding component (`TransitComponent` class) is a **stateless
transit node** in the simulation graph. It models any fabric position
where a message physically traverses but no semantic processing
happens.
Per traversal, the component:
1. Reads an incoming Transaction or Flit from an `in_port`.
2. Applies the configured per-component overhead (`overhead_ns`),
applied **once per Transaction** even across multi-flit payloads
(see D2).
3. Looks up the next hop along the Transaction's pre-computed `path`.
4. Forwards to the corresponding `out_port`; at the terminal node
(no next hop), signals `txn.done` once the `is_last` flit arrives.
The component **does NOT**:
- Decide routing — paths are pre-computed by the router (ADR-0002 /
ADR-0017 D2). Forwarding only executes the per-hop step.
- Model wire propagation or bandwidth occupancy — separate wire
processes between components handle that (ADR-0015 D2).
- Resolve addresses — the AddressResolver does that (ADR-0017 D9).
- Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL)
handle that.
### D2. First-flit overhead model (header decode)
Per-Transaction `overhead_ns` is applied **exactly once**, at first
flit arrival:
- `_txn_decoded: set[int]` tracks which Transactions have already
paid the overhead at this node.
- On first-flit arrival for a Transaction: `yield self.run(env,
msg.txn.nbytes)` — pays the overhead.
- Subsequent flits of the same Transaction skip the overhead — they
pipeline through with no extra delay.
- On `is_last` flit: remove the Transaction from `_txn_decoded`.
This models the real-HW behavior where header decode and routing
decision happen once on first flit; payload flits then stream through
the same path (wormhole cut-through). Multi-hop pipelining emerges
naturally — each hop adds its own first-flit overhead, but flits
after the first do not re-pay overhead at any hop they have already
passed first.
### D3. Serial worker forwarding (preserves order)
The component's worker is a single SimPy process that consumes flits
from `_inbox` and forwards them serially in arrival order. The
component does NOT spawn `env.process(...)` per flit.
Rationale: if the first flit yields on `overhead_ns` while subsequent
flits run in parallel processes, the later flits can overtake the
first. This produces out-of-order delivery and lets the `is_last`
flit arrive at the destination before the first flit — corrupting
both the transaction's completion semantics and any flit-index-based
processing downstream.
### D4. Path-based next-hop routing
Routing is **not** a Forwarding-component concern. The Transaction
arrives with a pre-computed `path` (built by the router; ADR-0002 /
ADR-0017 D2). The component just looks up its own position in the
path and forwards to `path[index + 1]`:
```python
def _next_hop_in_path(self, txn):
my_id = self.node.id
path = txn.path
for i, n in enumerate(path):
if n == my_id and i + 1 < len(path):
return path[i + 1]
return None
```
If `next_hop` is found and present in `out_ports`, the flit is
forwarded. Otherwise (terminal node), `txn.done.succeed()` is
invoked when the `is_last` flit arrives.
### D5. Flit-aware mode with Non-Flit fallback
`_FLIT_AWARE = True` opts this component out of the base class's
flit-reassembly logic in `_fan_in`. Flits are placed directly on
`_inbox` (no reassembly), enabling per-flit handling in the worker
loop (D2, D3).
Non-Flit messages — zero-byte control Transactions and other
non-chunkified payloads — fall through to the base class's legacy
`_forward_txn` path via `env.process`. This preserves backward
compatibility for control-plane traffic that does not benefit from
flit-level processing.
### D6. Multi-stream merging at the base class
Multi-stream FIFO merging at routers is the base class's
responsibility, not Forwarding's. The base class's `_fan_in` spawns
one process per `in_port`; all push to a single shared `_inbox`.
Flits from different upstream streams therefore interleave at
flit granularity in `_inbox`'s FIFO order.
The Forwarding worker simply consumes `_inbox` in arrival order —
correctly modeling per-router multi-flow arbitration as
fair-FIFO over the shared inbox.
### D7. Single implementation under multiple impl names
A single `TransitComponent` class is registered under four impl names
in `components.yaml`:
- `builtin.forwarding` — generic forwarding (e.g., `io_noc`,
`noc_router`, UCIe conn bridges)
- `builtin.switch` — tray-level switch
- `builtin.noc` — cube-level NOC fabric (legacy singleton; current
NOC routers use `builtin.forwarding`)
- `builtin.ucie` — UCIe protocol endpoint
All four aliases instantiate the same class with the same behavior.
Per-instance differentiation lives only in `attrs.overhead_ns`.
Separate impl names exist as intent tags for readability and to
allow future divergence without backward-incompatible config
changes.
### D8. Configurable `overhead_ns`
A single attribute drives per-instance latency:
| Usage site | impl name | overhead_ns |
| --- | --- | --- |
| Tray-level switch | `builtin.switch` | 5.0 |
| Cube NOC router | `builtin.forwarding` | 2.0 |
| IO chiplet io_noc | `builtin.forwarding` | 0.0 |
| UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
| UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
Default is 0.0. The attribute is read at each `run()` invocation, so
dynamic reconfiguration is possible but not currently used.
## Consequences
### Positive
- A single class handles all transit-node roles in the simulation
graph — minimal code surface for a high-population component type.
- Flit-aware processing + serial worker preserves wormhole semantics
across multi-hop paths without per-flit process overhead.
- `overhead_ns` is the only per-instance tunable; routing, BW, and
address resolution stay cleanly separated in their own components /
modules.
- Multi-stream merging emerges from the base-class structure; no
router-specific logic duplicates fair-FIFO arbitration.
- Non-Flit fallback path keeps control-plane traffic working without
forcing every message into the flit framework.
### Negative
- The single class hides usage-site intent inside `attrs.overhead_ns`
configuration; readers must consult `topology.yaml` +
`components.yaml` to see which impl name maps to which behavior
class.
- Per-flit serial worker is a bottleneck if `overhead_ns` is large
and many concurrent transactions arrive at the same router; current
values (08 ns) make this negligible.
## Links
- ADR-0002 (Routing distance — path computation)
- ADR-0015 D1 (Component port model)
- ADR-0015 D2 (Wire process — BW + propagation, separate from this
component)
- ADR-0015 D6 (Transit cube forwarding pattern)
- ADR-0016 D1 (IO chiplet io_noc — uses this component)
- ADR-0017 D1 (Cube NOC routers — use this component)
- ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this
component)
- ADR-0033 D1 (Flit-aware pass-through, first-flit overhead,
multi-stream merge semantics)
@@ -18,7 +18,7 @@ We define stable, minimal message schemas for Host ↔ IO_CPU so that:
- IO_CPU-internal fan-out/aggregation can evolve independently,
- completion and failure propagation is deterministic.
We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
We also require PE-tagging (Scheme A): each shard explicitly carries (sip,cube,pe)
so IO_CPU can deterministically route/fan-out without relying on PA decoding.
---
@@ -93,7 +93,7 @@ Rules:
Mandatory fields:
- common envelope fields (D3)
- destination placement tags (A 방식):
- destination placement tags (Scheme A):
- `dst_sip: int`
- `dst_cube: int`
- `dst_pe: int`
@@ -130,7 +130,7 @@ Notes:
Mandatory fields:
- common envelope fields (D3)
- source placement tags (A 방식):
- source placement tags (Scheme A):
- `src_sip: int`
- `src_cube: int`
- `src_pe: int`
@@ -183,7 +183,7 @@ Tensor arg (mandatory):
- `shards: list[TensorShard]`
`TensorShard` MUST have (A 방식 강제):
`TensorShard` MUST have (Scheme A enforced):
- `sip: int`
- `cube: int`
@@ -1,519 +0,0 @@
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
## Status
Accepted
## Context
The current simulation models **timing only**.
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
but do not actually read tensor data or perform computations.
### Required Capabilities
1. Must be able to store and read actual data in HBM/TCM/SRAM
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
3. Must minimize simulation performance degradation
### Constraints
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
- Kernel functions must remain plain Python functions (no generator/async transformation)
### Design Exploration Results
| Option | Approach | Verdict |
|--------|----------|---------|
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
---
## Decision
### D1. 2-Pass Execution Model — Phase 0 Elimination
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
Before:
```
Phase 0: Kernel → PeCommand list (no data, no branching)
Phase 1: Replay PeCommand list via SimPy (timing only)
```
After:
```
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
- Memory read/write: SimPy timing + MemoryStore actual data
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
- Dynamic control flow possible (tl.load returns actual data)
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
```
This ADR **extends Phase 1 to be data-aware for memory operations only**.
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
Phase 2 handles GEMM/Math computation correctness verification.
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
### D2. Op Log Recording — ComponentBase Hook
Op log recording is performed as a **hook in the component base class**.
Individual component implementations are not modified.
```python
class ComponentBase:
def _on_process_start(self, env, msg):
if self._op_logger and getattr(msg, 'data_op', False):
self._op_logger.record_start(env.now, self.node.id, msg)
def _on_process_end(self, env, msg):
if self._op_logger and getattr(msg, 'data_op', False):
self._op_logger.record_end(env.now, self.node.id, msg)
```
Hooks are called before and after `run()` within `_forward_txn()`.
`_op_logger` is optional — zero overhead when absent.
**Hook timing definitions**:
| Timing | Meaning |
|--------|---------|
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
Link traversal latency is not included in t_start/t_end.
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
The existing Phase 0 (kernel → PeCommand list) is eliminated,
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
#### Operating Principle
greenlet is a C extension that provides cooperative context switching.
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
to perform timing simulation, and after completion, returns to the kernel with actual data.
```
SimPy loop (parent greenlet) Kernel (child greenlet)
───────────────────────── ──────────────────────
g.switch() ─────────────────────────→ Kernel starts
a = tl.load(ptr, ...)
internal: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←────────────────── (kernel paused)
yield DmaReadMsg(...)
yield env.timeout(dma_latency)
data = memory_store.read(...)
g.switch(data) ─────────────────────→ (kernel resumed)
a = data ← actual numpy array
if a[0][0] > 0.5: ← branching possible
...
```
The kernel is maintained as a **plain Python function**.
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
#### KernelRunner — Framework Layer
The greenlet loop resides not in the PE_CPU component but in the framework layer,
**KernelRunner**.
```python
# KernelRunner (framework — greenlet ↔ SimPy bridge)
class KernelRunner:
def run(self, env, kernel_fn, args, store):
g = greenlet(self._run_kernel)
cmd = g.switch(kernel_fn, args)
while cmd is not None:
if isinstance(cmd, DmaReadCmd):
yield from self._dispatch_dma(env, cmd)
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
cmd = g.switch(data) # resume with actual data
elif isinstance(cmd, GemmCmd):
yield from self._dispatch_gemm(env, cmd)
cmd = g.switch() # resume (no data)
elif isinstance(cmd, DmaWriteCmd):
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
yield from self._dispatch_dma(env, cmd) # timing only
cmd = g.switch()
# PE_CPU (component — kept simple, unaware of greenlet)
def _execute_kernel(self, env):
runner = KernelRunner(self.ctx)
yield from runner.run(env, kernel_fn, args, store)
```
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
the component base class hooks automatically record them.
**Layer separation**:
- **Kernel code**: plain function, unaware of greenlet
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
- **ComponentBase hook**: the sole path for op_log recording
- **PE_CPU**: only calls KernelRunner, replaceable as a component
#### Handling Differences Between Memory Read/Write and Compute
| Operation | In Phase 1 | In Phase 2 |
|-----------|-----------|-----------|
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
#### Store Visibility Rule
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
SimPy DMA timing is simulated separately afterward.
This is an intentional separation of timing and visibility:
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
- **timing**: the point at which DMA latency completes in SimPy
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
#### Result Handle Semantics
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
The key contract in Phase 1:
1. **All compute handles are always considered pending in Phase 1.**
2. `tl.wait(handle)` **expresses timing synchronization only**
and does not make the handle ready.
3. Accessing the handle's actual result data (`handle.data`, element access,
numpy conversion, etc.) is **only possible in Phase 2**.
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
5. In contrast, `tl.load()` returns actual data in Phase 1, so
**memory-read-based control flow is supported**.
| Handle state | Phase | Allowed operations |
|------------|-------|----------|
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
| ready | Phase 2 | Actual numpy data access, verification |
This restriction is intentional. If computations were executed in Phase 1,
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
#### Phase 1 Materialization — Future Extension
If Phase 1 eager execution becomes necessary for small operations
(scalar, small reduction) in the future, selective materialization can be supported
by adding a `materialized_in_phase1: bool` flag to the op record.
This is not implemented in the current scope.
### D4. data_op Flag — Message Self-Declaration
The logging target is determined by the `data_op` attribute on the message instance,
not by message type. The framework does not hardcode message types.
```python
class MsgBase:
data_op: bool = False # default: no logging
class DmaReadCmd(MsgBase):
data_op = True # memory transfer → logging
class GemmCmd(MsgBase):
data_op = True # compute → logging
class MathCmd(MsgBase):
data_op = True # compute → logging
```
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
enables automatic logging without modifying framework code.
### D5. Op Log Structure
#### Op Classification Scheme
A two-level classification is used:
| Level | Field | Role |
|-------|-------|------|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
#### OpRecord Definition
```python
@dataclass
class OpRecord:
t_start: float # SimPy time (ns) — service start
t_end: float # SimPy time (ns) — service completion
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
op_kind: str # "memory" | "gemm" | "math"
op_name: str # specific operation name
params: dict # per-operation parameters (see below)
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
```
#### dependency_ids Generation Rules
`dependency_ids` is **optional**, and by default the executor performs
address-based dependency inference (see D6).
Explicit setting is only needed when precise execution ordering is required:
- **Default (address-based inference)**: the executor analyzes read/write sets to
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
at the TLContext or command generation stage.
Example: completion handle-based synchronization — handle dependencies depend on
logical completion order rather than memory addresses, so they cannot be captured
by address inference.
#### op_log Ordering
The op_log maintains **stable ordering** based on `t_start`.
Records with the same `t_start` preserve insertion order.
#### params Details
**memory (dma_read / dma_write)**:
```python
{
"src_addr": int, # source address (byte)
"dst_addr": int, # destination address (byte)
"nbytes": int, # transfer size
"src_space": str, # "hbm" | "tcm" | "sram"
"dst_space": str, # "hbm" | "tcm" | "sram"
}
```
**gemm**:
```python
{
"src_a_addr": int, # operand A address
"src_b_addr": int, # operand B address
"dst_addr": int, # output address
"shape_a": tuple, # e.g. (128, 256)
"shape_b": tuple, # e.g. (256, 128)
"shape_out": tuple, # e.g. (128, 128)
"dtype_in": str, # e.g. "f16"
"dtype_acc": str, # accumulation dtype, e.g. "f32"
"dtype_out": str, # output dtype, e.g. "f16"
"transpose_a": bool,
"transpose_b": bool,
"layout_a": str, # "row_major" | "col_major"
"layout_b": str,
"layout_out": str,
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
}
```
**math**:
```python
{
"op": str, # "exp" | "add" | "sum" | "where" | ...
"input_addrs": list[int], # list of operand addresses
"input_shapes": list[tuple],
"dst_addr": int,
"shape_out": tuple,
"dtype": str,
"axis": int | None, # reduction axis
"addr_space": str, # "tcm"
}
```
### D6. Phase 2 Executor
Phase 2 executes the op_log outside of SimPy.
```python
class DataExecutor:
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
def run(self):
for t, ops in groupby(op_log, key=lambda o: o.t_start):
batch = list(ops)
independent, sequential = self._classify(batch)
self._execute_parallel(independent)
self._execute_sequential(sequential)
```
**Parallel execution determination**:
Ops with the same `t_start` are considered **parallel candidates**.
The executor determines actual parallel execution based on the following criteria:
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
- Whether predecessor ops specified in `dependency_ids` have completed
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
**Batch optimization**: Only independent ops with the same op_name **and identical
shape, dtype, layout, and transpose flags** are eligible for batching.
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
**Phase 2 execution order guarantee**:
Phase 2 does not consider data arrival timing,
and guarantees execution order solely through
dependencies (address-based inference + explicit dependency_ids).
### D7. Memory Store
`MemoryStore` logically follows byte-addressable semantics,
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
```python
class MemoryStore:
def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
```
**Internal storage format: numpy ndarray**
MemoryStore stores tensors as **numpy ndarrays**.
| Candidate | store/load speed | Phase 2 compute | Verdict |
|-----------|-----------------|-----------------|---------|
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
- read: **returns numpy array by reference** (no copy)
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
- For byte-level access, convert via `.view(np.uint8)`
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
**read/write contract**:
- read/write operates on a **contiguous tensor** basis.
If non-contiguous stride views are needed, express them as separate copy ops.
- In the normal benchmark path, producer/consumer dtype match is expected.
Reinterpret cast is a permissive behavior for low-level memory validation
or special test cases.
- addr is byte-aligned, with minimum alignment = dtype size.
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
- Correctness criteria follow address-range-based read/write semantics.
- A tensor object cache may be used as an implementation optimization,
but the canonical state is byte-addressable storage.
- At deploy time, the host injects initial tensor data.
### D8. Benchmark Kernel Code
The benchmark's **user code API is not changed**.
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
However, internal command/message schemas may be extended to include metadata
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
### D9. No Component Changes
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
Op log recording is the responsibility of the ComponentBase hook.
When custom components are replaced, only the timing model changes,
and Phase 2 data execution is unaffected.
### D10. Phase 2 is Optional
```python
engine = GraphEngine(graph)
engine.run(benchmark) # Phase 1: timing only
result = engine.get_timing_result()
if verify_data:
executor = DataExecutor(engine.op_log) # Phase 2: data
executor.run()
executor.verify(expected_output)
```
If only timing analysis is needed, Phase 2 is skipped.
If the op_logger is deactivated, Phase 1 performance is identical to the original.
### D11. Verification Contract
Basic verification **compares the final output tensor** against a reference backend (numpy).
Per-dtype tolerance policy:
| dtype | Comparison method | Tolerance |
|-------|----------|-----------|
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
| int types | `np.array_equal` | exact |
- Default mode: compare final output only (end-to-end correctness)
- Debug mode: can compare intermediate tensors on a per-op basis
(MemoryStore snapshot at each op boundary)
---
## Non-goals
- **Compute-result-based control flow**: not supported.
All compute handles are in pending state during Phase 1,
`wait()` expresses timing synchronization only and does not imply data readiness.
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
is **treated as an error**.
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
Phase 1 materialization is a future extension (see D3).
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
and do not reproduce the actual hardware PE microarchitecture.
## Open Questions
- **Aliasing / slice view**: How to represent slice/views referencing the same
backing storage in MemoryStore (stride-based view vs copy semantics)
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
communication as memory ops or introduce a separate op_kind
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
(in-memory list vs disk-backed streaming)
- **Fused operation**: Whether to record tl.composite's tiled pipeline
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
- **Math op schema generalization**: The current math params have a simple structure,
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
scalar/immediate operands, where/mask expressions, etc.
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
replacement with stable op_id is needed when introducing streaming/disk-backed mode
- **Phase 1 materialization policy**: See Future Extension in D3.
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
needs to be defined
---
## Consequences
### Positive
- Minimal impact on SimPy simulation performance (only op_log append added)
- Free to use multi-threading/GPU in Phase 2
- Component replaceability preserved (ADR-0015 design philosophy maintained)
- No changes needed to benchmark user code API
- When adding new message types, only set the data_op flag
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
- `tl.load()` returns actual data, making kernel debugging easier
### Negative
- op_log memory usage (for large-scale simulations)
- Phase 2 execution time is proportional to tensor size (large GEMM)
- Dynamic branching based on pending handles (incomplete computations) not possible
(computations execute in Phase 2, result values are undetermined in Phase 1).
Memory-data-based branching is supported via greenlet.
- greenlet C extension dependency added (pip install greenlet)
+268 -265
View File
@@ -1,4 +1,4 @@
# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
## Status
@@ -6,65 +6,65 @@ Accepted
## Context
현재 시뮬레이션은 **타이밍만** 모델링한다.
`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
실제 텐서 데이터를 읽거나 연산하지 않는다.
The current simulation models **timing only**.
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
but do not actually read tensor data or perform computations.
### 필요한 기능
### Required Capabilities
1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
3. 시뮬레이션 성능 저하를 최소화해야 한다
1. Must be able to store and read actual data in HBM/TCM/SRAM
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
3. Must minimize simulation performance degradation
### 제약 조건
### Constraints
- SimPy single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
- Kernel functions must remain plain Python functions (no generator/async transformation)
### 설계 탐색 결과
### Design Exploration Results
| Option | 방식 | 판정 |
|--------|------|------|
| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
| Option | Approach | Verdict |
|--------|----------|---------|
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
---
## Decision
### D1. 2-Pass 실행 모델 — Phase 0 제거
### D1. 2-Pass Execution Model — Phase 0 Elimination
기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
기존:
Before:
```
Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
Phase 0: Kernel → PeCommand list (no data, no branching)
Phase 1: Replay PeCommand list via SimPy (timing only)
```
변경:
After:
```
Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
- 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
- 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
- dynamic control flow 가능 (tl.load가 실제 데이터 반환)
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
- Memory read/write: SimPy timing + MemoryStore actual data
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
- Dynamic control flow possible (tl.load returns actual data)
Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
```
본 ADR은 **메모리 연산에 한해 Phase 1 data-aware로 확장**한다.
Phase 1 latency/BW 병목 분석 + 메모리 데이터 추적,
Phase 2 GEMM/Math 연산 정합성 검증.
Phase 2 optional — 타이밍만 필요하면 Phase 1만 실행.
This ADR **extends Phase 1 to be data-aware for memory operations only**.
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
Phase 2 handles GEMM/Math computation correctness verification.
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
### D2. Op Log 기록 — ComponentBase hook
### D2. Op Log Recording — ComponentBase Hook
op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
개별 컴포넌트 구현을 수정하지 않는다.
Op log recording is performed as a **hook in the component base class**.
Individual component implementations are not modified.
```python
class ComponentBase:
@@ -77,56 +77,56 @@ class ComponentBase:
self._op_logger.record_end(env.now, self.node.id, msg)
```
`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
`_op_logger` optional — 없으면 오버헤드 제로.
Hooks are called before and after `run()` within `_forward_txn()`.
`_op_logger` is optional — zero overhead when absent.
**hook 시점 정의**:
**Hook timing definitions**:
| 시점 | 의미 |
|------|------|
| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
| Timing | Meaning |
|--------|---------|
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
link traversal latency t_start/t_end에 포함되지 않는다.
link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
Link traversal latency is not included in t_start/t_end.
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
### D3. Greenlet 기반 커널 실행 — Phase 0 제거
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
The existing Phase 0 (kernel → PeCommand list) is eliminated,
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
#### 동작 원리
#### Operating Principle
greenlet은 협력적 context switch를 제공하는 C 확장이다.
커널(child greenlet) `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)
switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
greenlet is a C extension that provides cooperative context switching.
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
to perform timing simulation, and after completion, returns to the kernel with actual data.
```
SimPy 루프 (parent greenlet) 커널 (child greenlet)
SimPy loop (parent greenlet) Kernel (child greenlet)
───────────────────────── ──────────────────────
g.switch() ─────────────────────────→ 커널 시작
g.switch() ─────────────────────────→ Kernel starts
a = tl.load(ptr, ...)
내부: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←────────────────── (커널 일시정지)
internal: parent.switch(DmaReadCmd)
cmd = DmaReadCmd ←────────────────── (kernel paused)
yield DmaReadMsg(...)
yield env.timeout(dma_latency)
data = memory_store.read(...)
g.switch(data) ─────────────────────→ (커널 재개)
a = data ← 실제 numpy array
if a[0][0] > 0.5: ← 분기 가능
g.switch(data) ─────────────────────→ (kernel resumed)
a = data ← actual numpy array
if a[0][0] > 0.5: ← branching possible
...
```
커널은 **plain Python function**으로 유지된다.
greenlet switch `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
The kernel is maintained as a **plain Python function**.
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
#### KernelRunner — 프레임워크 레이어
#### KernelRunner — Framework Layer
greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
**KernelRunner**에 위치한다.
The greenlet loop resides not in the PE_CPU component but in the framework layer,
**KernelRunner**.
```python
# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
# KernelRunner (framework — greenlet ↔ SimPy bridge)
class KernelRunner:
def run(self, env, kernel_fn, args, store):
g = greenlet(self._run_kernel)
@@ -136,160 +136,162 @@ class KernelRunner:
if isinstance(cmd, DmaReadCmd):
yield from self._dispatch_dma(env, cmd)
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
cmd = g.switch(data) # 실제 데이터와 함께 재개
cmd = g.switch(data) # resume with actual data
elif isinstance(cmd, GemmCmd):
yield from self._dispatch_gemm(env, cmd)
cmd = g.switch() # 재개 (데이터 없음)
cmd = g.switch() # resume (no data)
elif isinstance(cmd, DmaWriteCmd):
store.write(cmd.dst_addr, cmd.data) # visibility = issue 시점
yield from self._dispatch_dma(env, cmd) # timing만 반영
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
yield from self._dispatch_dma(env, cmd) # timing only
cmd = g.switch()
# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
# PE_CPU (component — kept simple, unaware of greenlet)
def _execute_kernel(self, env):
runner = KernelRunner(self.ctx)
yield from runner.run(env, kernel_fn, args, store)
```
**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
모든 op logging**ComponentBase hook (_on_process_start/end)** 담당한다.
KernelRunner `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
the component base class hooks automatically record them.
**레이어 분리**:
- **커널 코드**: plain function, greenlet 존재를 모름
- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
- **ComponentBase hook**: op_log 기록의 유일한 경로
- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
**Layer separation**:
- **Kernel code**: plain function, unaware of greenlet
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
- **ComponentBase hook**: the sole path for op_log recording
- **PE_CPU**: only calls KernelRunner, replaceable as a component
#### 메모리 읽기/쓰기 vs 연산의 처리 차이
#### Handling Differences Between Memory Read/Write and Compute
| 연산 | Phase 1에서 | Phase 2에서 |
|------|------------|------------|
| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
| Operation | In Phase 1 | In Phase 2 |
|-----------|-----------|-----------|
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
#### Store Visibility Rule
`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
SimPy DMA timing is simulated separately afterward.
이는 timing visibility를 의도적으로 분리한 것이다:
- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
- **timing**: SimPy에서 DMA latency가 완료되는 시점
This is an intentional separation of timing and visibility:
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
- **timing**: the point at which DMA latency completes in SimPy
이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
#### Result Handle Semantics
`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
Phase 1에서의 핵심 계약:
The key contract in Phase 1:
1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
handle ready로 만들지 않는다.
3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
numpy conversion 등)은 **Phase 2에서만 가능**하다.
4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
**memory-read 기반 control flow는 지원 가능**하다.
1. **All compute handles are always considered pending in Phase 1.**
2. `tl.wait(handle)` **expresses timing synchronization only**
and does not make the handle ready.
3. Accessing the handle's actual result data (`handle.data`, element access,
numpy conversion, etc.) is **only possible in Phase 2**.
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
5. In contrast, `tl.load()` returns actual data in Phase 1, so
**memory-read-based control flow is supported**.
| handle 상태 | Phase | 허용 동작 |
| Handle state | Phase | Allowed operations |
|------------|-------|----------|
| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
| pending | Phase 1 | handle `tl.store()`의 대상으로 전달 (logical destination 연결만, payload Phase 2) |
| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
| ready | Phase 2 | Actual numpy data access, verification |
이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
block되어 2-pass 분리의 존재 이유가 사라진다.
This restriction is intentional. If computations were executed in Phase 1,
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
#### Phase 1 Materialization — Future Extension
향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
If Phase 1 eager execution becomes necessary for small operations
(scalar, small reduction) in the future, selective materialization can be supported
by adding a `materialized_in_phase1: bool` flag to the op record.
This is not implemented in the current scope.
### D4. data_op 플래그 — 메시지 자기 선언
### D4. data_op Flag — Message Self-Declaration
로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
프레임워크가 메시지 타입을 하드코딩하지 않는다.
The logging target is determined by the `data_op` attribute on the message instance,
not by message type. The framework does not hardcode message types.
```python
class MsgBase:
data_op: bool = False # 기본: 로깅 안 함
data_op: bool = False # default: no logging
class DmaReadCmd(MsgBase):
data_op = True # 메모리 이동 → 로깅
data_op = True # memory transfer → logging
class GemmCmd(MsgBase):
data_op = True # 연산 → 로깅
data_op = True # compute → logging
class MathCmd(MsgBase):
data_op = True # 연산 → 로깅
data_op = True # compute → logging
```
새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
프레임워크 코드 수정 없이 자동 로깅된다.
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
enables automatic logging without modifying framework code.
### D5. Op Log 구조
### D5. Op Log Structure
#### op 분류 체계
#### Op Classification Scheme
2단계로 분류한다:
A two-level classification is used:
| 레벨 | 필드 | 역할 |
|------|------|------|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
| Level | Field | Role |
|-------|-------|------|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
#### OpRecord 정의
#### OpRecord Definition
```python
@dataclass
class OpRecord:
t_start: float # SimPy 시각 (ns) — service 시작
t_end: float # SimPy 시각 (ns) — service 완료
t_start: float # SimPy time (ns) — service start
t_end: float # SimPy time (ns) — service completion
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
op_kind: str # "memory" | "gemm" | "math"
op_name: str # 구체 연산명
params: dict # 연산별 파라미터 (아래 참조)
dependency_ids: list[int] # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
op_name: str # specific operation name
params: dict # per-operation parameters (see below)
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
```
#### dependency_ids 생성 규칙
#### dependency_ids Generation Rules
`dependency_ids` **optional**이며, 기본적으로 executor는
주소 기반 dependency 추론을 수행한다 (D6 참조).
`dependency_ids` is **optional**, and by default the executor performs
address-based dependency inference (see D6).
정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
- **기본 (address-based inference)**: executor read/write set을 분석하여
RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
주소로 표현되지 않는 경우에 설정.
: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
Explicit setting is only needed when precise execution ordering is required:
- **Default (address-based inference)**: the executor analyzes read/write sets to
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
at the TLContext or command generation stage.
Example: completion handle-based synchronization — handle dependencies depend on
logical completion order rather than memory addresses, so they cannot be captured
by address inference.
#### op_log ordering
#### op_log Ordering
op_log`t_start` 기준으로 **stable ordering**을 유지한다.
동일 `t_start`의 record들은 insertion order를 보존한다.
The op_log maintains **stable ordering** based on `t_start`.
Records with the same `t_start` preserve insertion order.
#### params 상세
#### params Details
**memory (dma_read / dma_write)**:
```python
{
"src_addr": int, # source 주소 (byte)
"dst_addr": int, # destination 주소 (byte)
"nbytes": int, # 전송 크기
"src_addr": int, # source address (byte)
"dst_addr": int, # destination address (byte)
"nbytes": int, # transfer size
"src_space": str, # "hbm" | "tcm" | "sram"
"dst_space": str, # "hbm" | "tcm" | "sram"
}
@@ -298,9 +300,9 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
**gemm**:
```python
{
"src_a_addr": int, # operand A 주소
"src_b_addr": int, # operand B 주소
"dst_addr": int, # output 주소
"src_a_addr": int, # operand A address
"src_b_addr": int, # operand B address
"dst_addr": int, # output address
"shape_a": tuple, # e.g. (128, 256)
"shape_b": tuple, # e.g. (256, 128)
"shape_out": tuple, # e.g. (128, 128)
@@ -312,7 +314,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
"layout_a": str, # "row_major" | "col_major"
"layout_b": str,
"layout_out": str,
"addr_space": str, # "tcm" (GEMM operand는 항상 TCM)
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
}
```
@@ -320,7 +322,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
```python
{
"op": str, # "exp" | "add" | "sum" | "where" | ...
"input_addrs": list[int], # operand 주소 목록
"input_addrs": list[int], # list of operand addresses
"input_shapes": list[tuple],
"dst_addr": int,
"shape_out": tuple,
@@ -332,12 +334,12 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
### D6. Phase 2 Executor
Phase 2는 SimPy 밖에서 op_log를 실행한다.
Phase 2 executes the op_log outside of SimPy.
```python
class DataExecutor:
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
self.store = initial_store # Phase 1 MemoryStore snapshot을 입력으로 받는다
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
def run(self):
for t, ops in groupby(op_log, key=lambda o: o.t_start):
@@ -347,30 +349,30 @@ class DataExecutor:
self._execute_sequential(sequential)
```
**병렬 실행 판정**:
**Parallel execution determination**:
같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
- `dependency_ids`에 명시된 선행 op 완료 여부
Ops with the same `t_start` are considered **parallel candidates**.
The executor determines actual parallel execution based on the following criteria:
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
- Whether predecessor ops specified in `dependency_ids` have completed
주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
모두 동일한** 독립 op들만 batching 대상이 된다.
예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
**Batch optimization**: Only independent ops with the same op_name **and identical
shape, dtype, layout, and transpose flags** are eligible for batching.
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
**Phase 2 실행 순서 보장**:
**Phase 2 execution order guarantee**:
Phase 2는 데이터 도착 시점을 고려하지 않으며,
dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
실행 순서를 보장한다.
Phase 2 does not consider data arrival timing,
and guarantees execution order solely through
dependencies (address-based inference + explicit dependency_ids).
### D7. Memory Store
`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
`MemoryStore` logically follows byte-addressable semantics,
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
```python
class MemoryStore:
@@ -378,139 +380,140 @@ class MemoryStore:
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
```
**내부 저장 포맷: numpy ndarray**
**Internal storage format: numpy ndarray**
MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
MemoryStore stores tensors as **numpy ndarrays**.
| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
|------|----------------|-------------|------|
| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
| Candidate | store/load speed | Phase 2 compute | Verdict |
|-----------|-----------------|-----------------|---------|
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
- write: numpy array**참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
- read: numpy array를 **참조 반환** (복사 없음)
- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
- dtype numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16`)
- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
- read: **returns numpy array by reference** (no copy)
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
- For byte-level access, convert via `.view(np.uint8)`
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
**read/write contract**:
- read/write **contiguous tensor** 기준이다.
non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
reinterpret cast low-level memory validation 또는 특수 테스트 케이스를 위한
permissive behavior이다.
- addr byte-aligned이며, 최소 alignment = dtype 크기.
- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
- 구현 최적화로 tensor object cache를 둘 수 있지만,
canonical state byte-addressable storage이다.
- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
- read/write operates on a **contiguous tensor** basis.
If non-contiguous stride views are needed, express them as separate copy ops.
- In the normal benchmark path, producer/consumer dtype match is expected.
Reinterpret cast is a permissive behavior for low-level memory validation
or special test cases.
- addr is byte-aligned, with minimum alignment = dtype size.
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
- Correctness criteria follow address-range-based read/write semantics.
- A tensor object cache may be used as an implementation optimization,
but the canonical state is byte-addressable storage.
- At deploy time, the host injects initial tensor data.
### D8. 벤치마크 커널 코드
### D8. Benchmark Kernel Code
벤치마크의 **사용자 코드 API는 변경하지 않는다**.
`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
The benchmark's **user code API is not changed**.
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata
포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
However, internal command/message schemas may be extended to include metadata
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
### D9. 컴포넌트 변경 없음
### D9. No Component Changes
개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
op_log 기록은 ComponentBase hook의 책임이다.
커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
Phase 2 데이터 실행은 영향받지 않는다.
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
Op log recording is the responsibility of the ComponentBase hook.
When custom components are replaced, only the timing model changes,
and Phase 2 data execution is unaffected.
### D10. Phase 2 Optional
### D10. Phase 2 is Optional
```python
engine = GraphEngine(graph)
engine.run(benchmark) # Phase 1: 타이밍만
engine.run(benchmark) # Phase 1: timing only
result = engine.get_timing_result()
if verify_data:
executor = DataExecutor(engine.op_log) # Phase 2: 데이터
executor = DataExecutor(engine.op_log) # Phase 2: data
executor.run()
executor.verify(expected_output)
```
타이밍 분석만 필요하면 Phase 2를 건너뛴다.
op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
If only timing analysis is needed, Phase 2 is skipped.
If the op_logger is deactivated, Phase 1 performance is identical to the original.
### D11. Verification Contract
기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
Basic verification **compares the final output tensor** against a reference backend (numpy).
dtype tolerance 정책:
Per-dtype tolerance policy:
| dtype | 비교 방식 | tolerance |
| dtype | Comparison method | Tolerance |
|-------|----------|-----------|
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
| int 계열 | `np.array_equal` | exact |
| int types | `np.array_equal` | exact |
- 기본 모드: 최종 output만 비교 (end-to-end correctness)
- 디버그 모드: intermediate tensor op 단위로 비교 가능
- Default mode: compare final output only (end-to-end correctness)
- Debug mode: can compare intermediate tensors on a per-op basis
(MemoryStore snapshot at each op boundary)
---
## Non-goals
- **Compute-result-based control flow**: 지원하지 않는다.
모든 compute handle은 Phase 1에서 pending 상태이며,
`wait()` timing synchronization만 표현하고 data readiness를 의미하지 않는다.
Phase 1에서 `handle.data` 접근, element access, truth-value evaluation
**error로 처리**한다.
메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
Phase 1 materialization future extension (D3 참조).
- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
- **Compute-result-based control flow**: not supported.
All compute handles are in pending state during Phase 1,
`wait()` expresses timing synchronization only and does not imply data readiness.
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
is **treated as an error**.
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
Phase 1 materialization is a future extension (see D3).
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
and do not reproduce the actual hardware PE microarchitecture.
## Open Questions
- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
일반화할지, 별도 op_kind를 둘지
- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
- **Aliasing / slice view**: How to represent slice/views referencing the same
backing storage in MemoryStore (stride-based view vs copy semantics)
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
communication as memory ops or introduce a separate op_kind
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
(in-memory list vs disk-backed streaming)
- **Fused operation**: tl.composite tiled pipeline (READ→COMPUTE→WRITE)을
하나의 fused op record로 기록할지, 개별 op으로 분리할지
- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
broadcasting rule, input dtype, keepdims, scalar/immediate operand,
where/mask 표현 등 일반화가 필요할 수 있음
- **Op record 식별자**: 현재 dependency_ids in-memory list index 기반이며,
streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
- **Phase 1 materialization policy**: D3의 Future Extension 참조.
허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
- **Fused operation**: Whether to record tl.composite's tiled pipeline
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
- **Math op schema generalization**: The current math params have a simple structure,
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
scalar/immediate operands, where/mask expressions, etc.
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
replacement with stable op_id is needed when introducing streaming/disk-backed mode
- **Phase 1 materialization policy**: See Future Extension in D3.
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
needs to be defined
---
## Consequences
### 긍정적
### Positive
- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
- 벤치마크 사용자 코드 API 변경 불필요
- 새 메시지 타입 추가 시 data_op 플래그만 설정
- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
- Minimal impact on SimPy simulation performance (only op_log append added)
- Free to use multi-threading/GPU in Phase 2
- Component replaceability preserved (ADR-0015 design philosophy maintained)
- No changes needed to benchmark user code API
- When adding new message types, only set the data_op flag
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
- `tl.load()` returns actual data, making kernel debugging easier
### 부정적
### Negative
- op_log 메모리 사용량 (대규모 시뮬레이션 시)
- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
- pending handle (연산 미완료) 기반 동적 분기 불가
(연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
메모리 데이터 기반 분기는 greenlet으로 지원된다.
- greenlet C 확장 의존성 추가 (pip install greenlet)
- op_log memory usage (for large-scale simulations)
- Phase 2 execution time is proportional to tensor size (large GEMM)
- Dynamic branching based on pending handles (incomplete computations) not possible
(computations execute in Phase 2, result values are undetermined in Phase 1).
Memory-data-based branching is supported via greenlet.
- greenlet C extension dependency added (pip install greenlet)
@@ -1,882 +0,0 @@
# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
## Status
Accepted
## Context
### Goal
Add the infrastructure that lets CCL (Collective Communication Library)
kernels run **inside** a PE. The host just launches a kernel on each
SIP; the actual synchronization and data movement happen **inside the
PE kernel via an IPCQ (Inter-Process Communication Queue)**.
This mirrors how NCCL performs NVLink communication inside a GPU
kernel, or how Cerebras / Tenstorrent expose core-local communication
queues. Host-level collectives (`dist.all_reduce`) are deferred to
**future work**; this ADR focuses solely on the kernel-side collective
infrastructure.
### Problems to solve
1. PE-to-PE direct data movement (writing into a peer's memory).
2. Synchronization — the sender must check that the receiver has space
in its buffer (backpressure).
3. Resource contention between compute traffic and communication
traffic (Head-of-Line blocking).
4. The host must be able to construct logical neighbor topologies
(ring / mesh / tree) per algorithm.
---
## Decision
### D1. Add a new `PE_IPCQ` component
A new component `PE_IPCQ` is added inside each PE. It follows the same
pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
distinct component.
```
PE
├── PE_CPU
├── PE_SCHEDULER
├── PE_DMA
├── PE_IPCQ ← new
├── PE_FETCH_STORE
├── PE_GEMM
├── PE_MATH
├── PE_TCM
├── PE_MMU
```
**Role separation** (control plane vs. data plane):
- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
tail pointer management, peer pointer caches, backpressure, 4-direction
neighbor mapping.
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
/ PCIE into the peer's memory.
PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
### D2. Ring buffer model
Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
```python
@dataclass
class IpcqQueuePair:
direction: Direction # N/S/E/W
peer: IpcqEndpoint # set by host at init time (D2.5)
tx_buffer_base: int # outgoing data base addr (in our memory)
rx_buffer_base: int # incoming data base addr (in our memory)
slot_size: int # 1 tile per slot
n_slots: int # ring depth
my_head: int # next slot we will write/send into
my_tail: int # next slot we will read/recv from
peer_head_cache: int # peer's last-seen head (updated via D9 piggyback)
peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit)
```
**Canonical field names**: throughout this ADR the four names above
(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
etc.) are not used.
| Field | Owner | Updated when |
|-------|-------|--------------|
| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
indirection). Full data embedded in the slot. See D5.
### D2.5. `IpcqEndpoint` schema
`IpcqQueuePair.peer` carries everything the sender needs to compute the
peer's rx slot address:
```python
@dataclass(frozen=True)
class IpcqEndpoint:
sip: int
cube: int
pe: int
buffer_kind: str # "tcm" | "hbm" | "sram"
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
rx_base_va: int # peer rx_buffer base VA (optional, MMU mode)
n_slots: int # peer ring depth (for wrap-around)
slot_size: int # peer slot size (for offset)
```
Address computation:
```python
slot_idx = self.my_head % peer.n_slots
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
```
PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
(vc_comm) routes the data to `dst_pa` through the fabric.
**Endpoint construction order**: at backend init (D10), the IPCQ
buffers for **every PE** are allocated first (so each rank knows the
others' PA), then the per-rank neighbor tables are built and pushed to
PE_IPCQ via `IpcqInitMsg`.
### D3. Four-direction mapping ≡ logical ProcessGroup
The PE views four directions (N/S/E/W) as logical ports. Real peer
addresses are configured by the host CCL init, per the chosen
algorithm. The PE kernel never knows the topology, only directions.
```python
# 1D ring
for rank in range(world_size):
ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
# 2D mesh
for r in range(R):
for c in range(C):
ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
```
The PE code does not need to know where `tl.send(dir="E", ...)` actually
ends up.
### D4. PE kernel API
```python
# Send (blocking; may stall on backpressure)
tl.send(dir: str, src=TensorHandle)
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
# Recv (blocking)
recv = tl.recv(dir: str, shape=..., dtype=...)
recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions
# Recv (non-blocking)
fut = tl.recv_async(dir: str, shape=..., dtype=...)
recv = tl.wait(fut)
```
`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
call rotates through directions, returning the first available slot.
Empty in all 4 directions → wait.
**Fairness is weak**: the rotating start mitigates simple bias, but if
one direction always wins the race the others can starve. Algorithms
that need strict fairness must call `tl.recv(dir=...)` explicitly.
### D5. Single-hop DMA write + full-data slot model
Data moves from sender memory into the receiver's ring slot in **one
DMA transfer**. Key properties:
- **Single-hop**: the sender already knows the peer rx slot address and
fires one fabric DMA into it.
- **No CPU memcpy**: the CPU never copies data.
- **No intermediate staging**: neither side keeps a separate staging
buffer (sender uses the source addr directly; receiver gets the data
in its ring slot directly).
(Strictly speaking the fabric DMA write does happen, so this is not
literally "no data movement" — it's the same property NCCL labels
"zero-copy", meaning no CPU memcpy and no staging copy.)
```
PE A: tl.send(E, src_addr, nbytes)
1. IPCQ computes the peer rx slot address:
dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
(full → sleep / poll)
3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
4. my_head += 1
PE B: data = tl.recv(W)
1. Look at rx_buffer[my_tail % n_slots]
2. Wait for the data to arrive (D7 backpressure mode)
3. Return the slot address to the kernel (or fetch into register file)
4. my_tail += 1
5. Issue a credit-return fast path (D9): after the bottleneck-BW
latency the peer A's peer_tail_cache is updated.
```
The slot holds the full tile. The receiver only reads its own
rx_buffer; it never reads back into A's memory. The sender knows the
peer rx slot address and DMAs directly into it (single-hop).
The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
to the PE).
### D6. Buffer placement — three-way benchmark
The host CCL init picks the IPCQ ring-buffer location:
```python
ipcq_init(
backend="ahbm",
buffer_kind="tcm" | "hbm" | "sram",
n_slots=8,
slot_size=4096,
)
```
| Location | Trait | Trade-off |
|----------|-------|-----------|
| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
| **PE-local HBM** | Large; via DMA | Higher latency |
| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
All three locations run the same kernel code; only the init differs.
### D7. Backpressure — two-mode benchmark
How the sender or receiver waits when peer slots are full / data not
yet arrived:
| Mode | Behavior | Model |
|------|----------|-------|
| **poll** | Periodically re-check the cached peer pointer | Spin loop |
| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
```python
ipcq_init(backpressure="poll" | "sleep", ...)
```
Both modes are implemented so latency / throughput trade-offs can be
benchmarked.
### D8. PE_DMA virtual channels
Extend PE_DMA from a single queue into a **two-channel virtual-channel**
model.
```
PE_DMA
├── vc_compute: tile load / store / writeback for GEMM and Math
└── vc_comm: IPCQ send data
```
Each VC has an independent state machine:
- One channel stalling does not block the other.
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
split between channels.
**Chunk-level interleave**:
- Large GEMM tile DMAs do not lock the link end-to-end.
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
with the other VC's pending chunks.
- Chunk size is an init parameter (smaller = fairer, larger = more
efficient).
Net effect:
- HoL blocking is eliminated (an IPCQ send can interleave with a long
compute DMA).
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
pattern).
- Matches the NoC-virtual-channel pattern used in real HW.
**First-implementation accuracy limit (intentional)**: this ADR's
first cut uses **deterministic chunk-level interleave + weighted
round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
This is a first-order approximation and is simpler than real HW
dynamic-contention / credit-based arbiters. Functional correctness is
unaffected, but heavy-contention scenarios may report slightly
optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
component later if more precision is needed.
#### Token routing
- Compute tokens (`TileToken`) — go through the existing
PE_FETCH_STORE → PE_DMA chain.
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
self-routing.
- PE_DMA picks the channel by token type.
```python
class PeDmaComponent:
def _process(self, env, token):
if isinstance(token, IpcqDmaToken):
yield from self._vc_comm_process(env, token)
else:
yield from self._vc_compute_process(env, token)
```
### D9. Pointer synchronization — DMA payload piggyback
Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
pointers update along with the data. This simulation adopts the same
model: **no separate control channel** — metadata travels with the
data.
The big benefits:
- **Automatic ordering**: data and metadata move on the same token, so
data is visible **before** the head_cache update. No race.
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
- **Component simplification**: no separate `IpcqPtrUpdate` event type.
#### Send flow (head update via piggyback)
```
PE A: tl.send(E, src_addr, nbytes)
1. PE_IPCQ checks backpressure (using peer_tail_cache)
2. PE_IPCQ creates an IpcqDmaToken:
- data body (src_addr → peer dst_addr)
- piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
3. Hand the token to PE_DMA(vc_comm)
4. PE A increments my_head (send tracking)
[fabric DMA: latency elapses]
PE B's PE_DMA receives the token
5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
PE B's PE_IPCQ receives the metadata
7. Updates peer_head_cache (= A's head)
8. Wakes any pending recv on that direction
```
**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
makes data and metadata atomically visible.
#### Recv flow (credit return — fast path with bottleneck-BW latency)
When the receiver frees a slot, the sender must learn about it
(backpressure release). Unlike data, the credit return does **not**
travel through general vc_comm fabric — it uses a **separate fast
path**, an abstraction of the NVLink / UCIe credit-return wire.
**Latency** is computed from the **full path latency** (per-node
overhead + edge propagation + drain), not a magic constant:
```
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
path = router.find_path(self_pe, peer_pe.pe_dma)
latency = compute_path_latency_ns(path, credit_size_bytes)
= sum(edge.distance_mm * ns_per_mm)
+ sum(node_overhead_ns[n] for n in path)
+ credit_size_bytes / bottleneck_bw_on_path
```
The router auto-appends `.pe_dma` to the source only, so the
destination MUST be spelled with the explicit `.pe_dma` suffix or
`find_path` raises and the credit silently teleports at zero cost
(latent bug fixed alongside this update).
`tl.recv` blocks on the credit-emit completion (recv yields-from
`_delayed_credit_send` rather than spawning it as a fork). This puts
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
IPCQ control-plane completing the consume-acknowledgement before
recv returns to the kernel — the protocol equivalent of a non-posted
`tl.store` waiting for an HBM ack on the raw DMA path.
That gives us:
- **Topology-proportional approximation**: an in-cube credit return is
automatically faster than a cross-SIP credit return.
- **No magic constants**: every nanosecond comes from
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
as data traffic.
- **No deadlock risk**: unlike piggyback, B can issue credit even when
it has no data to send back. `peer_credit_store.put` is unbounded.
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
#### Component coupling — SimPy Store channel
PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
time, **a SimPy Store is wired between the two** (a per-direction
fast-path channel) and credit metadata is `put` into that store.
```python
class PeIpcqComponent:
def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
yield env.timeout(latency_ns)
yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
```
Backend init wires both directions of the fast-path channel as part of
fan-out (see `IpcqInitMsg` in D12).
#### Credit-return fast path limitations
- `credit_size_bytes` is an estimate (typically 1664 bytes).
- The fast path is **excluded from vc_comm BW contention** (separate
wire). Real HW credit-return wires are very lightweight, so this is a
reasonable first approximation.
- A follow-up ADR can: model the credit fast path as a separate link
(BW limit + contention), or switch to piggyback (`credit_return_mode:
piggyback`).
#### PE_DMA's added responsibility
When `vc_comm` receives a token, PE_DMA processes it as the following
sequence: pay the Transaction's terminal BW drain, then atomically
write data and forward metadata. **No SimPy yield is allowed between
the data write and the metadata forward** (invariant I6). The drain
yield must sit before the atomic block, not inside it:
```python
def _on_vc_comm_recv(self, env, txn):
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
# sender PE_DMA). MUST happen before the atomic block so recv only
# wakes after the bytes have "landed".
drain = getattr(txn, "drain_ns", 0.0)
if drain > 0:
yield env.timeout(drain)
token = txn.request
# ── ATOMIC: no yield between these two operations ──
data = self._memory_store.read(token.src_space, token.src_addr,
shape=..., dtype=...)
self._memory_store.write(token.dst_endpoint.buffer_kind,
token.dst_addr, data)
# 2. Forward metadata to the local PE_IPCQ
yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
# ───────────────────────────────────────────────────
```
The final `put` is yieldable but uses an unbounded internal store, so
it completes in a single step. That `put` is the closing call of the
atomic block; nothing may be inserted before it.
#### Drain-at-inbound semantics (D9 timing model)
The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
is paid at each forwarding component via `run()`, and the remaining
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
(so IPCQ-specific data write + metadata forward can happen), so **the
drain MUST be paid explicitly at the top of that handler** to keep
IPCQ's timing model on par with every other fabric Transaction.
Side-effects of paying drain here:
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
preserved because the sender PE_DMA does not `yield sub_done`. The
`sub_done.succeed()` call (made after metadata forward below) is an
event with no listener on the sender side.
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
forward now happens after the drain, recv observes the full fabric
transfer time including bandwidth cost.
Matches the physical picture: send dispatches and leaves; recv waits
until the bytes have actually been drained into its inbox.
### D9.5. ADR-0020 (2-pass) integration
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
op-log-based correctness verification.
#### Phase 1 (timing + data)
D9 models head and tail updates with two different mechanisms:
- **Send-side (head update)** — DMA payload piggyback. Data write and
metadata forward happen in the same SimPy step → automatic atomic
visibility.
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
with bottleneck-BW latency, then `peer_tail_cache` update.
Together they preserve ring-buffer pointer consistency.
The op-log records `op_kind="ipcq"` entries for sends (with
`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
Two recv modes:
- **`return_slot`** (default): the slot address is returned to the
kernel. Zero-copy.
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
PE_IPCQ copies the slot data into the user dst.
#### Phase 2 (op_log replay)
When `DataExecutor` encounters an `op_kind="ipcq"` record:
- **send**: idempotent `src → dst` ndarray write.
- **recv (`return_slot`)**: no-op (the slot already holds the data).
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
The downstream GEMM / Math ops in `DataExecutor` will consume the data
and naturally validate correctness.
### D10. Host CCL init keeps the PyTorch shape
The host code looks just like real PyTorch DDP. `init_process_group`
creates the backend object; it does **not** receive IPCQ knobs
(neighbor topology, buffer_kind, backpressure …).
```python
# benches/ccl_allreduce.py — same shape as real PyTorch
def worker(rank, world_size, torch):
dist = torch.distributed
dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology
tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
tensor.copy_(torch.from_numpy(init))
dist.all_reduce(tensor, op="sum")
```
The IPCQ configuration is decided by the backend at
`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
host code never has to know about IPCQ.
A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
Switching algorithms is purely a `ccl.yaml` change — no host edits
required.
#### Init flow (eager)
1. `init_process_group(backend="ahbm")` is called.
2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
3. Pulls topology + buffer_kind + backpressure + slot config from
`algorithms[<algo>]`.
4. **Immediately** installs neighbor tables on every PE_IPCQ
(sideband or fabric `IpcqInitMsg`).
5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
PE_IPCQ is already prepared whether the kernel is a CCL kernel or
not.
### D11. CCL config file (`ccl.yaml`)
IPCQ config and algorithm metadata live in a separate YAML file,
following the same pattern as `components.yaml` and `topology.yaml`.
A single benchmark execution runs one algorithm
(`defaults.algorithm`). Switching algorithms means editing
`defaults.algorithm` only.
```yaml
defaults:
algorithm: ring_allreduce_tcm
buffer_kind: tcm # tcm | hbm | sram
backpressure: sleep # poll | sleep
n_slots: 8
slot_size: 4096
vc_chunk_size: 256
ipcq_credit_size_bytes: 16
algorithms:
ring_allreduce_tcm:
module: kernbench.ccl.algorithms.ring_allreduce
topology: ring_1d # builtin name or "custom"
buffer_kind: tcm
n_elem: 8 # optional, per-algorithm tile width
tree_allreduce_7:
module: kernbench.ccl.algorithms.tree_allreduce
topology: tree_binary
buffer_kind: tcm
world_size: 7 # algorithm-level override
n_elem: 16
custom_mesh:
module: kernbench.ccl.algorithms.custom_mesh
topology: custom # the module supplies its own neighbors()
```
`world_size` is **not set in `defaults`**. The backend resolves it via:
`algorithm-level override > defaults override > topology spec`. The
last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
where `WORLD_SIZE` comes from env vars rather than config files.
#### Algorithm module structure
Each algorithm module exports two hooks — `kernel` (required) and
`neighbors` (optional) — plus a `kernel_args` helper that the
backend uses to populate positional kernel arguments at `all_reduce`
time:
```python
# src/kernbench/ccl/algorithms/ring_allreduce.py
def kernel_args(world_size: int, n_elem: int) -> tuple:
return (n_elem, world_size)
def kernel(t_ptr, n_elem, world_size, tl):
"""Required — the PE kernel.
IPCQ is already installed by the backend before this is called.
The kernel only uses the four-direction send / recv API.
"""
...
def neighbors(rank, world_size, neighbor_map):
"""Optional — override the builtin topology's neighbor map.
Returns a new dict, the modified-in-place dict, or None to keep the
builtin map.
"""
return None
```
#### `neighbors` override patterns
- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
brand-new dict.
- **Pattern C — keep builtin**: omit `neighbors` or return None.
#### Builtin topologies
| topology | direction set |
|----------|---------------|
| `ring_1d` | E, W |
| `ring_1d_unidir` | E only |
| `mesh_2d` | N, S, E, W |
| `tree_binary` | parent, child_left, child_right |
| `none` | (empty) — algorithm must supply `neighbors()` |
#### Adding a new algorithm
1. Write `kernel` and `kernel_args` in
`src/kernbench/ccl/algorithms/<algo>.py`.
2. Add an entry in `ccl.yaml`'s `algorithms` section.
3. (Optional) provide `neighbors()` for custom topology.
4. Set `defaults.algorithm` to the new algorithm.
The host bench (`benches/ccl_allreduce.py`) does not change.
### D12. Message / token schema
The new message types added by this ADR. They live in
`src/kernbench/common/pe_commands.py` and
`src/kernbench/runtime_api/kernel.py`.
#### `IpcqInitMsg` (sideband, fan-out at init)
The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
push `IpcqCreditMetadata` directly into the receiver's input queue.
#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
Carries `direction`, source addr/space, nbytes, shape, dtype, and a
handle id. `data_op=True` so it lands in the op_log.
#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
Carries `direction` (or None for round-robin), `recv_mode`
(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
dtype, blocking flag.
#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
`src_direction`). PE_DMA picks the channel by token type
(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
The receiver's PE_DMA, on token arrival, performs the I6 atomic
sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
to the local PE_IPCQ.
#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
Carries `consumer_seq` (= my_tail), source PE coords, and source
direction. Travels through the dedicated SimPy Store channel rather
than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
There is **no `IpcqPtrUpdate` event** — head updates flow via D9
piggyback, tail updates via the D9 fast-path channel.
### D13. Test strategy
Test plan:
#### T1. Unit tests (component-level)
- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
immediately forwards a token; full peer slot triggers backpressure
(poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
/ `vc_comm` independent progress, chunk interleave, BW split.
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
mesh_2d / tree_binary correctness, mesh_2d non-square →
`ValueError`, custom resolver returns the module's `neighbors`.
#### T2. Integration tests (E2E send/recv)
- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
no-deadlock), 4×4 mesh.
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
records `ipcq` ops in op_log; DataExecutor produces correct
`out.data`.
#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
consistency, per-`buffer_kind` allocation.
#### T4. Regression
All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
non-CCL benches.
#### T5. Performance / overhead
Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
Should be close to a regular PE_DMA write of the same nbytes (IPCQ
overhead < 100 ns).
### D14. Invariants and failure modes
#### Invariants
I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
non-decreasing; `sender_seq` strictly increasing.
I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
B, then rank B's reverse-direction peer must be rank A. Verified at
init.
I4. **`buffer_kind` consistency**: all PEs in a process group share
the same `buffer_kind` (no mixed mode in the first cut).
I5. **op_log ordering**: send → DMA complete → recv possible. The
t_start order in op_log respects this causality.
I6. **Atomic data + metadata visibility (MUST)**: at the receiver
side, data write (`MemoryStore.write`) and metadata forward
(`peer_head_cache` update) **must execute in the same SimPy step**.
No yield is allowed between the two operations in PE_DMA's vc_comm
handler. Code review must reject any inserted `yield` (or `yield
from`) — it would create a race where head_cache becomes visible
before or after the data.
I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
the step in which `peer_head_cache > my_tail` becomes truthy is the
same step in which the slot data is observable.
#### Failure modes (runtime errors)
F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
`IpcqInvalidDirection`, simulation aborts.
F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
send and recv. Not validated by default; opt-in strict mode catches
it (`strict_validation: true` on a PE_IPCQ node attrs).
F3. **Deadlock detection (timeout-based)**: the simulator empties its
schedule while a send/recv is still pending → engine raises
`IpcqDeadlock` and embeds a pointer dump.
F4. **Backend init failure**: missing `defaults.algorithm`, missing
`algorithms[name]`, module import failure, topology validation
failure (I3, I4) — all raised at `init_process_group` time.
F5. **Slot full + infinite backpressure**: the peer never recvs.
Surfaces as F3 timeout.
#### Diagnostics
- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
`(rank, t, dir, nbytes)`.
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
`peer_head_cache`, `peer_tail_cache`.
- **Deadlock dump**: on hang the engine includes the pointer dump in
the `IpcqDeadlock` exception message.
### D15. Algorithm-author cheat sheet
Full step-by-step lives in
[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
shortest version:
| Things you touch | Things you don't |
|------------------|-------------------|
| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
5-step flow: write the kernel → register in `ccl.yaml` → optional
`neighbors` override → optional mock unit test → SimPy validation via
`kernbench run --bench ccl_allreduce --verify-data`.
Common mistakes: using a direction that wasn't installed, sends
without matching recvs (deadlock), dtype/shape disagreement, assuming
fairness from `tl.recv()` round-robin, confusing
`tl.num_programs(axis)` with the CCL group size.
---
## Non-goals
- **Host collective**: a model where `dist.all_reduce` itself moves
data on the host side is out of scope. This ADR only covers
communication that happens inside the PE kernel.
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
modules and can be added without amending this ADR.
- **Reliability / error handling**: link faults, send/recv failure
recovery, etc. are out of scope.
- **NoC arbiter precision**: dynamic VC contention is left for a future
ADR (see D8).
---
## Open questions
- **VC arbitration accuracy** — the first cut uses deterministic
chunk interleave + weighted round-robin; heavy contention may report
optimistic latency. A NoC arbiter component can be added later.
- **Credit return BW model** — the fast path is currently outside the
fabric BW contention model. Can be modeled as a separate link or
switched to piggyback (`credit_return_mode: piggyback`).
- **Ring buffer slot allocation metadata** — whether the host pushes
IPCQ buffer metadata via sideband or via a fabric message similar to
`MmuMapMsg` is open.
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
`ccl.yaml`; default value TBD.
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
(with Up/Down for 3D) or N (variable) is future work.
- **Multi-tile aggregation primitives** — whether
`tl.recv_all` or similar is needed for fan-in.
- **Round-robin recv fairness** — current weak fairness can starve;
strict fairness counter is future work.
- **Deadlock detection precision** — currently timeout-based; a
realtime wait-for graph would enable deterministic detection.
---
## Consequences
### Positive
- PE-to-PE direct communication enables CCL kernels to be written.
- Host stays minimal (just `launch`), synchronization happens inside
the PE → strong compute / comm overlap.
- VCs eliminate HoL blocking → collective latency is not blocked by
compute traffic.
- Buffer placement and backpressure mode are init-time parameters →
easy to benchmark.
- Four-direction logical neighbors → host is free to map
ring/mesh/tree algorithms.
### Negative
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
- VC arbitration is a first-order approximation; heavy contention
scenarios may report slightly optimistic latency vs real HW (D8).
- Chunk-level interleave makes PE_DMA implementation more complex.
File diff suppressed because it is too large Load Diff
+59 -52
View File
@@ -6,43 +6,46 @@ Accepted
## Context
### 목표
### Goal
`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
읽히는 bench 코드를 목표로 한다.
Align the participation unit (rank) of `torch.distributed` collective calls
to the **SIP** (device) boundary. The aim is bench code that, at the host
level, reads **indistinguishably** from real PyTorch DDP/TP scripts.
real PyTorch와 비교:
Comparison with real PyTorch:
| 차원 | real PyTorch | KernBench |
| Dimension | real PyTorch | KernBench |
| --- | --- | --- |
| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 1 SIP |
| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
| `get_rank()` | `RANK` env var | greenlet-local registry |
| `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology |
| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
| `mp.spawn` | OS process fork | greenlet fan-out |
### 풀어야 할 문제
### Problems to solve
1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
기본 텐서 배치도 구조적 좌표로 표현되어야 함.
1. **Public API where rank = SIP** so bench workers do not have to know
about the PE concept.
2. **Greenlet-local rank/device tracking** — within the 1-process model,
each worker greenlet must correctly identify its own rank / its own SIP.
3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP,
the default tensor placement should also be expressed in structural
coordinates.
### Non-problem ( ADR)
### Non-problem (outside this ADR)
- IPCQ direction addressing → ADR-0025
- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
- Removing `DPPolicy.sip`/`num_sips` → ADR-0026
- Megatron-style TP → ADR-0027
- DTensor → ADR-0028 (future)
- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
→ ADR-0027 D0/D1
- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
- Collective algorithm implementation (intercube_allreduce, SFR config)
→ ADR-0032
## Decision
### D1. rank = SIP (world_size 해석)
### D1. rank = SIP (world_size resolution)
```python
def _resolve_world_size(self) -> int:
@@ -55,8 +58,8 @@ def _resolve_world_size(self) -> int:
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
```
우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
override는 legacy "rank = PE" 테스트 경로로 유지.
Priority order: algorithm override > defaults override > SIP count. The
`ccl.yaml` override is retained as the legacy "rank = PE" test path.
### D2. Greenlet-local rank registry (+ debug warning)
@@ -83,11 +86,11 @@ class DistributedContext:
return int(self._rank_by_greenlet[g])
```
### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
### D3. `torch.ahbm.set_device(rank)` — SIP binding
KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch
`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
namespace를 사용한다.
The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses
`torch.cuda.set_device(r)`, but since we are not CUDA we use an
honestly-named namespace.
```python
class _AhbmNamespace:
@@ -113,10 +116,12 @@ class _AhbmNamespace:
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
```
**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
**PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a
device-agnostic `torch.accelerator` namespace
(`torch.accelerator.set_device_index(r)`,
`torch.accelerator.current_device_index()`). To support users who want to
write code that is not tied to a specific device vendor, KernBench also
exposes this surface in parallel.
```python
class _AcceleratorNamespace:
@@ -141,23 +146,23 @@ self.ahbm = _AhbmNamespace()
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
```
Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
Bench authors may choose either — both share the same registry internally:
```python
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
```
### D4. Tensor placement = structural (sip, cube, pe) 좌표
### D4. Tensor placement = structural (sip, cube, pe) coordinates
`resolve_dp_policy` `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
세부는 ADR-0026.
`resolve_dp_policy` takes `target_sip` directly and produces placement in
structural coordinates. Details in ADR-0026.
```python
# RuntimeContext._create_tensor
current_sip = self.ahbm.current_device() # (D3 naming)
if current_sip is None:
current_sip = 0 # single-driver fallback (D2와 일관)
current_sip = 0 # single-driver fallback (consistent with D2)
placement = resolve_dp_policy(
dp, shape=shape_2d, itemsize=itemsize,
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
@@ -165,29 +170,29 @@ placement = resolve_dp_policy(
)
```
Post-hoc `pe_index` shifting 없음 — ShardSpec `(sip, cube, pe)` 구조적
좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
structural coordinates directly. ShardSpec details in ADR-0026.
---
## Dependencies
- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
ShardSpec의 구조적 좌표 표현.
- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
collective drain, exception cleanup의 구현 기준.
- **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace.
- **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature
used by D4 and the structural-coordinate representation of ShardSpec.
- **ADR-0027** (Megatron TP + scheduler): the implementation baseline for
worker scheduling, `mp.spawn`, collective drain, and exception cleanup.
---
## Non-goals
- **IPCQ protocol 수정**: ADR-0023 유지.
- **DPPolicy 필드 정리**: ADR-0026.
- **Modifying the IPCQ protocol**: ADR-0023 remains as-is.
- **Cleaning up DPPolicy fields**: ADR-0026.
- **Megatron-style TP**: ADR-0027.
- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
- **Collective algorithm 구현**: ADR-0032.
- **Multi-node (프로세스 간)**: 단일 프로세스.
- **Collective algorithm implementation**: ADR-0032.
- **Multi-node (cross-process)**: single process only.
---
@@ -195,12 +200,14 @@ Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
### Positive
- **Bench = real PyTorch DDP** (공개 API 관점).
- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
`(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
- **Bench = real PyTorch DDP** (from the public-API point of view).
- **Greenlet-local rank**: enables cross-rank correctness within the
1-process model.
- **Structural placement coordinates**: lets the other ADRs (ADR-0026 /
ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)`
3-tuple.
### Neutral
- IPCQ PE-level protocol (ADR-0023) 불변.
- IO_CPU 역할 불변 (기존 transit 그대로).
- IPCQ PE-level protocol (ADR-0023) is unchanged.
- IO_CPU role is unchanged (existing transit behavior preserved).
@@ -6,51 +6,58 @@ Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
## Context
### 목표
### Goal
ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
topology 일반)에서 정확히 동작하도록 한다.
In the IPCQ protocol of ADR-0023, make the **identification of "which
direction pair this transfer belongs to"** consistent and **address-based**,
without depending on topology / dict-order. It must work correctly in a
2-rank bidirectional ring (and more generally in any topology where
multiple directions point to the same peer).
### 드러난 버그 — 2-rank bidirectional ring
### The bug surfaced — 2-rank bidirectional ring
`ring_1d(rank, world_size=2)``{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
`ring_1d(rank, world_size=2)``{"E": 1, "W": 1}` (rank 0). Both directions
point to the same peer.
**버그 1 (install)**:
- `reverse_direction(0, 1)`dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
direction convention)
- rank 0 E entry `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
- tl.send(E) → data sip1 E-rx buffer로 landing (should be W-rx)
**Bug 1 (install)**:
- `reverse_direction(0, 1)`returns "E" by dict order (wrong; "W" is the
correct answer — opposite-direction convention)
- rank 0's E entry is set with `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`
- tl.send(E) → data lands in sip1's E-rx buffer (should be W-rx)
**버그 2 (runtime)**:
- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`
sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
**Bug 2 (runtime)**:
- Even if install set up the correct address, the receiver's
`_handle_meta_arrival` matches direction by sender coordinates only → the
first direction (E) wins
- peer_head_cache[E] is incremented; peer_head_cache[W] is unchanged
- The kernel's tl.recv(W) waits on peer_head_cache[W] → blocks forever →
IpcqDeadlock
### 근본 원인
### Root cause
두 축에서 동일 문제:
1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
fragile
2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
좌표만으로 이루어짐 → direction 중복 시 ambiguous
The same issue along two axes:
1. **Install-time pairing**: deciding "which of my directions pairs with
which direction of the peer" depends on dict-iteration-order → fragile
when multiple directions point to the same peer
2. **Runtime identification**: deciding "which qp should be updated" is
based on sender coordinates alone → ambiguous when directions are
duplicated
### 해결 방향 — address-based matching
### Solution direction — address-based matching
PE rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
direction_idx × bytes_per_direction). 따라서:
Each PE's rx buffer sits at a **unique address range per direction**
(rx_base_pa + direction_idx × bytes_per_direction). Therefore:
- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
대칭성)
- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
truth**
- **Runtime**: match by **dst_addr range** instead of sender coord →
unambiguous
- **Install**: prefer the opposite direction as a heuristic (the natural
symmetry of ring / mesh)
- No need for redundant metadata like `peer_direction` — **address is the
single source of truth**
이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
This design works **independently of the PhysAddr transition (ADR-0030)**.
Whether the current addresses are synthetic or PhysAddr, the same approach
applies as long as the per-direction range uniqueness is preserved.
---
@@ -91,17 +98,17 @@ def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
return None
```
호출부:
Call site:
```python
for d, peer_rank in nbrs.items():
peer_dir = reverse_direction(r, peer_rank, d) # my_dir 전달
peer_dir = reverse_direction(r, peer_rank, d) # pass my_dir
if peer_dir is None:
continue
...
```
### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
### D2. Runtime — `_handle_meta_arrival` dst_addr matching
`src/kernbench/components/builtin/pe_ipcq.py`:
@@ -138,9 +145,10 @@ def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
# Unknown dst_addr — diagnostic log (should not happen under correct install)
```
Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
The sender-coordinate check is **removed**. `dst_addr` already determines
the direction.
### D3. Credit — `dst_rx_base_pa` 필드 추가
### D3. Credit — add `dst_rx_base_pa` field
`src/kernbench/common/ipcq_types.py`:
@@ -148,25 +156,26 @@ Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
@dataclass(frozen=True)
class IpcqCreditMetadata:
consumer_seq: int
dst_rx_base_pa: int # NEW: sender peer.rx_base_pa와 매칭용
# 기존 필드 (diagnostic / log 용도로 유지)
dst_rx_base_pa: int # NEW: matches the original sender's peer.rx_base_pa
# Existing fields (kept for diagnostic / logging purposes)
src_sip: int
src_cube: int
src_pe: int
src_direction: str
```
Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`
`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
When the credit is generated (`_delayed_credit_send`): it carries this
direction's `my_rx_base_pa` as `dst_rx_base_pa` (this is the
`peer.rx_base_pa` the other side used when it was the sender).
수신 측 (`_credit_worker`):
Receiver side (`_credit_worker`):
```python
def _credit_worker(self, env):
while True:
credit = yield self._credit_inbox.get()
for d, qp in self._queue_pairs.items():
# peer rx_base_pa credit dst_rx_base_pa가 일치하는 qp 찾기
# Find the qp whose peer rx_base_pa matches the credit's dst_rx_base_pa
if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
credit.consumer_seq)
@@ -178,41 +187,45 @@ def _credit_worker(self, env):
break
```
Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
Sender-coordinate check removed. Matching by `dst_rx_base_pa` is
unambiguous.
### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
### D4. Do **not** add a `peer_direction` field to `IpcqInitEntry`
ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`**불필요**.
이유:
- Meta arrival dst_addr로 매칭 (D2)
- Credit dst_rx_base_pa로 매칭 (D3)
- qp에 peer_direction 저장 필요 없음
- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
The `IpcqInitEntry.peer_direction` proposed in ADR-0025 rev 1 is
**unnecessary**. Reasons:
- Meta arrivals are matched by dst_addr (D2)
- Credits are matched by dst_rx_base_pa (D3)
- No need to store peer_direction on qp
- Install only uses peer_dir internally when computing rx_base_pa
(`reverse_direction`)
IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
No change to the IpcqInitEntry schema. **Simpler** than rev 1.
### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
### D5. Keep `IpcqDmaToken.src_direction` (diagnostic only)
기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
- Diagnostics: pointer_dump 등에서 direction 표시
- 미래 확장 여지
The existing `src_direction` field is not removed. It is retained for:
- Logging / trace: the `(rank, t, dir, nbytes)` output of
`KERNBENCH_CCL_TRACE=1`
- Diagnostics: showing direction in pointer_dump, etc.
- Room for future extension
Runtime matching `dst_addr`만 사용.
Runtime matching uses only `dst_addr`.
### D6. Invariants (ADR-0023 I3 강화)
### D6. Invariants (strengthens ADR-0023 I3)
**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
rx_base peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
이를 보장해야 한다 (reverse_direction opposite-preference).
**I3 (strict)**: For each direction pair `(my_direction, peer_direction)`,
my rx_base and peer rx_base must point to **distinct direction slots**.
Install must guarantee this (reverse_direction opposite-preference).
**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]``qp["peer"].rx_base_pa`
서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
**I3.1 (new)**: For every qp, `qp["my_rx_base_pa"]` and
`qp["peer"].rx_base_pa` occupy mutually disjoint address ranges (buffers
of different directions never overlap). This is the prerequisite for the
address-based matching of D2/D3.
Install time에 검증 가능:
Verifiable at install time:
```python
# ccl/install_plan.py: build_install_plans 끝에 assertion
# ccl/install_plan.py: assertion at the end of build_install_plans
all_rx_ranges = set()
for plan in plans:
for pe_install in plan.pe_installs:
@@ -228,36 +241,42 @@ for plan in plans:
## Dependencies
- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023 runtime 매칭 로직 수정
(D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
변경은 없음.
- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
- **ADR-0023** (IPCQ protocol): this ADR modifies ADR-0023's runtime
matching logic (D2, D3) and improves the install heuristic (D1). No
change to the IPCQ protocol's semantic layer.
- **ADR-0024** (launcher): the case where a 2-rank bidirectional ring is
actually used is the ws=SIP_count model of ADR-0024. This ADR makes that
case work.
- **ADR-0030** (PhysAddr transition, stub): **independent** — ADR-0025's
address-based matching works identically whether the current addresses
are synthetic or PhysAddr.
---
## Non-goals
- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
인코딩되는가와 무관.
- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
무관.
- **Migrating IPCQ addressing to PhysAddr**: ADR-0030 scope. This ADR is
agnostic to how addresses are encoded.
- **Multi-hop routing**: the single-hop DMA write assumption of ADR-0023
D5 still holds.
- **Unidir ring specialization**: `ring_1d_unidir` only has a single
direction, so the bug does not apply.
---
## Open questions
- **주소 매칭 성능**: `_handle_meta_arrival``_credit_worker`가 qp를 선형
순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
전환 가능 (`_qp_by_rx_base`).
- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
단순 구현 먼저.
- **Address-matching performance**: `_handle_meta_arrival` and
`_credit_worker` iterate qp linearly (max 4 directions). The performance
impact is negligible. If it becomes an issue, this can be switched to a
dict lookup (`_qp_by_rx_base`).
- **Re-evaluating the need for `IpcqDmaToken.src_direction`**: whether to
keep this field, which is only kept for diagnostics, or to split it out
of logging. Currently retained.
- **Cost of install-time invariant verification**: the I3.1 verification
of D6 is O(N_PE × N_direction)^2. It could be slow on large topologies
→ improvable via data structures such as interval trees. Simple
implementation first.
---
@@ -265,19 +284,26 @@ for plan in plans:
### Positive
- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
- **Simplicity**: redundant `peer_direction` metadata removed. Address is
the single source of truth.
- **Unambiguous matching**: works on every topology (including duplicate
directions).
- **Minimal schema changes**: `IpcqInitEntry` unchanged, one field added
to `IpcqCreditMetadata`.
- **Independent of PhysAddr transition (ADR-0030)**: address-based matching
is agnostic to the address encoding.
- **Diagnostics retained**: `IpcqDmaToken.src_direction` is kept for
logging.
### Negative
- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
- Runtime matching is now by address comparison, so when debugging
questions like "why did peer_head_cache[W] update rather than [E]" one
has to follow the address range (previously the direction name was
enough). Mitigation: include a "direction ↔ rx_base_pa" mapping in
pointer_dump.
### Neutral
- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
불변.
- The semantic layer of the IPCQ protocol (sender computes dst_addr,
receiver receives) is unchanged.
+130 -103
View File
@@ -1,4 +1,4 @@
# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields
## Status
@@ -6,16 +6,17 @@ Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
## Context
### 목표
### Goal
`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
layers가 담당).
Clarify `DPPolicy` as a pure intra-device abstraction that only expresses
**cube × PE distribution within a single device (SIP)**. Inter-SIP
distribution (TP) is split into a separate layer (handled by ADR-0024's
`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel
layers).
## Decision
### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
### D1. Remove `sip` + `num_sips` fields from `DPPolicy`
```python
@dataclass(frozen=True)
@@ -32,15 +33,16 @@ class DPPolicy:
num_cubes: int | None = None
```
제거되는 필드: `sip`, `num_sips`.
Removed fields: `sip`, `num_sips`.
### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed
현재 `ShardSpec.pe_index` **global flat index** (`sip × cubes × pes + cube ×
pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
The current `ShardSpec.pe_index` is a **global flat index**
(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4
flagged as "abstraction leakage".
본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`
property로도 **남기지 않는다**:
This ADR **redefines ShardSpec in structural coordinates** and **does
not even leave `pe_index` as a property**:
```python
# src/kernbench/policy/placement/dp.py (after)
@@ -59,28 +61,32 @@ class ShardSpec:
nbytes: int
```
**핵심 원칙**:
- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
- **`pe_index` property도 없음** — silent semantics drift 차단.
- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
`AttributeError`** → 반드시 구조적 좌표로 migration.
- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
**Core principle**:
- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple.
- **No `pe_index` property either** — blocks silent semantics drift.
- Existing callers expecting global-flat get an **immediate
`AttributeError`** on `.pe_index` access → forced migration to
structural coordinates.
- Local contexts that genuinely need a flat integer key (e.g. internal
dict lookup) explicitly compute
`spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call
site.
**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
(AttributeError)가 훨씬 안전.
**Justification for removing the property**: KernBench is an internal
project with a limited number of call sites. Explicit breakage
(AttributeError) is much safer than the risk of silent drift (semantics
change while the type stays int).
### D3. `resolve_dp_policy` `target_sip`을 받아 structural 좌표 생성
### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates
ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
Implements the contract of ADR-0024 D4. No post-hoc shifting.
```python
# src/kernbench/policy/placement/dp.py (after)
@dataclass(frozen=True)
class _LocalPeShard:
"""Internal — PE resolver의 반환. Cubelocal PE 식별자 + payload."""
"""Internal — return value of the PE resolver. Cube-local PE id + payload."""
local_pe: int # cube-local PE index (0..num_pe-1)
offset_bytes: int
nbytes: int
@@ -93,7 +99,7 @@ def resolve_dp_policy(
itemsize: int,
num_pe: int,
num_cubes: int = 1,
target_sip: int, # NEW — 어느 SIP에 배치할지 명시
target_sip: int, # NEW — explicitly state which SIP to place on
) -> list[ShardSpec]:
"""2-level resolution (cube × PE) on a specified SIP.
@@ -123,28 +129,30 @@ def resolve_dp_policy(
return all_shards
```
**내부 resolver** (`column_wise`, `row_wise`, `replicate`)`_LocalPeShard`
리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a
list of `_LocalPeShard` — the `local_pe` field name makes it **explicit
that this is a "cube-local PE identifier"**. This resolves the previous
confusion with the name `ShardSpec.pe_index`.
**이름 규약 정리** (전체 ADR):
- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
부가 효과: 이름 재등장 없음).
**Naming convention summary** (whole ADR):
- `ShardSpec.pe`: the final external API — cube-local PE (structural coord)
- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage
- `pe_index`: **removed**. Not retained anywhere, internal or external
(additional benefit of preventing silent drift: the name does not
reappear).
### D4. `_create_tensor` — 구조적 좌표로 직접 placement
### D4. `_create_tensor` — placement directly in structural coordinates
ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
호출 시점에 직접 지정.
Continuation of ADR-0024 D4. Post-hoc shifting removed; structural
coordinates are specified directly at the `resolve_dp_policy` call site.
```python
# context.py _create_tensor (after)
current_sip = self.ahbm.current_device()
if current_sip is None:
# Single-driver fallback (ADR-0024 D2와 일관).
# Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
# 문제가 있음 → debug mode에서 경고.
# Single-driver fallback (consistent with ADR-0024 D2).
# In launcher-based code, forgetting set_device() silently sticks the
# tensor on SIP 0 — emit a warning in debug mode.
if os.environ.get("KERNBENCH_DEBUG"):
import warnings
warnings.warn(
@@ -161,38 +169,39 @@ placement = resolve_dp_policy(
itemsize=itemsize,
num_pe=eff_num_pe,
num_cubes=eff_num_cubes,
target_sip=current_sip, # ← 구조적 좌표 일차 지정
target_sip=current_sip, # ← structural coord specified up front
)
# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
# 과거의 post-hoc shifting 블록은 완전히 제거.
# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
# The old post-hoc shifting block is removed entirely.
```
**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
ADR-0027의 TP primitive 사용.
**Every** tensor is placed on the current device's SIP. If you need a
multi-SIP tensor, use the TP primitive of ADR-0027.
**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
배치되는 것을 감지할 수 있도록 warning.
**Trade-off of the single-driver fallback**: When set_device is not
called, defaulting to SIP 0 is kept for compatibility with existing
single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so
that accidentally omitting set_device in a launcher context — which would
silently place the tensor on the wrong SIP — can be detected.
### D5. Downstream — allocator lookup은 구조적 tuple key
### D5. Downstream — allocator lookup by structural tuple key
기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
```python
for spec in placement:
alloc = allocators[spec.pe_index] # ← AttributeError (property 제거됨)
alloc = allocators[spec.pe_index] # ← AttributeError (property removed)
```
`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
With `pe_index` gone, migration to structural coordinates is **forced**:
```python
for spec in placement:
alloc = allocators[(spec.sip, spec.cube, spec.pe)]
```
`_ensure_allocators`의 dict population도 tuple key:
The dict population in `_ensure_allocators` is also tuple-keyed:
```python
# context.py _ensure_allocators (after)
@@ -204,59 +213,71 @@ for sip_id in sip_range:
)
```
`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
`_free_tensor` is the same: the old
`flat_idx = sip * ... + cube * ... + pe` computation block is removed,
and `(shard.sip, shard.cube, shard.pe)` is used directly.
**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple
and hashable out of the box. A `PEIdentity` value object has the upside
of an explicit type, but the boilerplate is large and it is currently
the only key of the allocator dict, so it would be over-engineering.
Keep the tuple.
### D7. 하위 호환 — 불가 (cleanup ADR)
### D7. Backward compatibility — none (cleanup ADR)
이 ADR은 **breaking change**.
This ADR is a **breaking change**.
1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출`TypeError`
2. `ShardSpec.pe_index` 접근`AttributeError`
1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)``TypeError`
2. `ShardSpec.pe_index` access`AttributeError`
모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
Both are **immediate, explicit breakage**. No deprecation warning /
fallback path. KernBench is an internal project with a bounded set of
call sites, so migration happens in one pass.
**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
**Blocking silent drift** is the main upside of fully removing the
property: code that expected a global flat could otherwise silently
receive a SIP-local result and index incorrectly — that possibility is
eliminated.
## Dependencies
- **ADR-0024** (launcher): `set_device(rank)` current-device scoping
SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
좁힘.
- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping
provide the SIP placement mechanism. This ADR sits on top and narrows
DPPolicy to pure intra-device.
- **ADR-0027** (Megatron TP): the alternative path when a tensor spans
multiple SIPs. After this ADR is applied, multi-SIP use cases move to
ADR-0027.
---
## Non-goals
- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
유지.
- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
- **Redesign of `DPPolicy.cube` / `pe`**: existing
replicate/column_wise/row_wise semantics are kept.
- **Tiling policy consolidation**: `tiled_column_major` /
`tiled_row_major` stay as they are.
- **New multi-device tensor abstraction**: a DTensor-like is ADR-0028.
---
## Open questions
- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
(SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
테스트와의 호환).
- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
- **`DPPolicy``num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
명시적 답.
- **Default value of current_sip in `_create_tensor`**: for calls without
set_device, whether to fall back to rank=0 (SIP 0) or to raise an
error. The recommendation is fallback (compatibility with existing
single-driver tests).
- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit
tests to the launcher base while preserving their intent requires
additional fixtures. Scoped as separate work.
- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone,
the concept of `num_sips` disappears entirely. The explicit answer for
expressing multi-SIP is to use the TP primitive of ADR-0027.
**Resolved (이전 rev에서 open이었던 것들)**:
- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
**Resolved (items that were open in earlier revs)**:
- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully
removed** (D2)
- ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`**
(D5)
---
@@ -264,25 +285,31 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
### Positive
- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
abstraction leakage 해소 (ADR-0024 D4 계약 충족).
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
경계 제어 메커니즘.
- **Clean conceptual separation**: DPPolicy = intra-device, TP =
inter-device.
- **API simplification**: about a 33% reduction in DPPolicy constructor
fields.
- **Structural-coordinate consistency**: ShardSpec is expressed as a
`(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024
D4 contract is satisfied).
- **Clear meaning of `pe_index`**: the single interpretation is
SIP-local. If global-flat is needed, it must be made explicit.
- **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is
the sole SIP-boundary control mechanism.
### Negative
- **Breaking change (explicit)**: `DPPolicy(sip=...)``TypeError`,
`spec.pe_index``AttributeError`. 모든 호출자 한 번에 수정 필요.
- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
`allocators` dict key 등) 연쇄 수정.
- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
- `test_sip_parallel.py` 재작성 비용.
`spec.pe_index``AttributeError`. All callers need to be fixed at
once.
- **ShardSpec schema change**: a single `pe_index` field becomes three
fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`,
`_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.).
- **No silent drift**: with the property fully removed, runtime failure
is immediate → migration leakage is blocked at the source. (Not a
negative but an explicit tradeoff.)
- The cost of rewriting `test_sip_parallel.py`.
### Neutral
- 기존 `cube` / `pe` 필드 의미 불변.
- The meaning of the existing `cube` / `pe` fields is unchanged.
File diff suppressed because it is too large Load Diff
+12
View File
@@ -92,6 +92,18 @@ def test_crlf_normalization(tmp_path: Path) -> None:
assert v.verify(tmp_path) == []
def test_em_dash_title_separator_recognized(tmp_path: Path) -> None:
"""ADR-0033 uses '' instead of ': ' between ADR-NNNN and the title."""
en = tmp_path / "docs/adr/ADR-0033-foo-bar.md"
ko = tmp_path / "docs/adr-ko/ADR-0033-foo-bar.md"
en.parent.mkdir(parents=True, exist_ok=True)
ko.parent.mkdir(parents=True, exist_ok=True)
body = "## Status\n\nAccepted\n\n## Context\n\nbody\n"
en.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8")
ko.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8")
assert v.verify(tmp_path) == []
def test_underscore_in_slug_recognized(tmp_path: Path) -> None:
"""ADR-0013 uses an underscore in its slug; the regex must accept it."""
_make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013")
+1 -1
View File
@@ -24,7 +24,7 @@ import sys
from pathlib import Path
ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$")
TITLE_RE = re.compile(r"^# ADR-(\d{4}):")
TITLE_RE = re.compile(r"^# ADR-(\d{4})\b")
def _normalize(text: str) -> str: