ADR: bilingual structure — EN canonical in adr/, KO mirror in adr-ko/
Establish English as the canonical ADR language with Korean translations held in a parallel docs/adr-ko/ tree as derived artifacts (1:1 mirror). Promotion from adr-proposed/ to adr/ now writes English to adr/ and the Korean to adr-ko/; bidirectional sync rule documented in CLAUDE.md. - Migrate 30 ADRs in docs/adr/: 28 Korean-only translated to English, 2 bilingual pairs (ADR-0020, ADR-0023) consolidated (.en.md suffix dropped). ADR-0023 EN regenerated against KO source which had newer HW Realization Notes (D16-D23) section. - docs/adr-history/ left frozen by design (transitional state). - CLAUDE.md (Part 2): update ADR Lifecycle for 4-folder layout, mark docs/adr-ko/ as a Derived Artifact, add ADR Translation Discipline section covering bidirectional sync, conflict resolution (EN wins), and proposed-language freedom. - tools/verify_adr_lang_pairs.py: new verification tool checking pair completeness, filename mirroring, ADR-ID match, Status byte-equality. Pre-commit hook intentionally not added; run on demand or in CI. - tests/test_verify_adr_lang_pairs.py: 11 cases including CRLF/LF normalization, em-dash title separator, underscore-slug edge case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -202,8 +202,8 @@ General fallbacks. Apply to anything not explicitly covered above.
|
|||||||
>
|
>
|
||||||
> Contains **foundations** (Authority & Scope → Terminology → Terminology
|
> Contains **foundations** (Authority & Scope → Terminology → Terminology
|
||||||
> Discipline → Mental Model → Common Failure Modes) followed by **rules**
|
> Discipline → Mental Model → Common Failure Modes) followed by **rules**
|
||||||
> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, runtime API /
|
> (Non-Trivial, Verification Plan, CLI, Derived Artifacts, ADR Translation
|
||||||
> sim_engine Boundaries).
|
> Discipline, runtime API / sim_engine Boundaries).
|
||||||
|
|
||||||
## Authority & Scope
|
## Authority & Scope
|
||||||
|
|
||||||
@@ -218,14 +218,22 @@ General fallbacks. Apply to anything not explicitly covered above.
|
|||||||
|
|
||||||
### ADR Lifecycle
|
### ADR Lifecycle
|
||||||
|
|
||||||
ADRs live in one of three folders based on lifecycle state:
|
ADRs live in one of four folders. Three carry **canonical English**
|
||||||
|
content based on lifecycle state; the fourth holds Korean translations:
|
||||||
|
|
||||||
- `docs/adr/` — **Accepted** (current implementation reflected).
|
- `docs/adr/` — **Accepted** (canonical English; current
|
||||||
|
implementation reflected).
|
||||||
- `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design
|
- `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design
|
||||||
only / future-work exploration / retroactive documentation pending
|
only / future-work exploration / retroactive documentation pending
|
||||||
verification).
|
verification). **Authoring language is free** (any language); the
|
||||||
|
promotion step (below) translates to English.
|
||||||
- `docs/adr-history/` — **Superseded** or **Merged** (no longer the
|
- `docs/adr-history/` — **Superseded** or **Merged** (no longer the
|
||||||
authoritative source; kept as historical record).
|
authoritative source; kept as historical record). Frozen — language
|
||||||
|
policy not applied retroactively.
|
||||||
|
- `docs/adr-ko/` — Korean translations of accepted ADRs (derived
|
||||||
|
artifact, 1:1 mirror of `docs/adr/`). English in `docs/adr/` is the
|
||||||
|
canonical source of truth; when KO and EN disagree, EN wins. See
|
||||||
|
*ADR Translation Discipline* below.
|
||||||
|
|
||||||
Status field values:
|
Status field values:
|
||||||
|
|
||||||
@@ -240,17 +248,23 @@ Status field values:
|
|||||||
Transitions:
|
Transitions:
|
||||||
|
|
||||||
- **Proposed/Stub → Accepted**: when the ADR's decisions are
|
- **Proposed/Stub → Accepted**: when the ADR's decisions are
|
||||||
reflected in production code AND covered by tests. `git mv` from
|
reflected in production code AND covered by tests. If the proposed
|
||||||
`docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
|
ADR is in Korean, translate to English and place the English in
|
||||||
|
`docs/adr/`; move the Korean original to `docs/adr-ko/`. If the
|
||||||
|
proposed ADR is in English, `git mv` it to `docs/adr/` and create
|
||||||
|
the Korean translation in `docs/adr-ko/`. Change Status to
|
||||||
|
`Accepted` in both files.
|
||||||
- **Draft → Accepted**: when the ADR's text has been verified to
|
- **Draft → Accepted**: when the ADR's text has been verified to
|
||||||
accurately describe the existing implementation. `git mv` from
|
accurately describe the existing implementation. Same English /
|
||||||
`docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
|
Korean placement rule as above.
|
||||||
- **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM`
|
- **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM`
|
||||||
and `git mv` to `docs/adr-history/`. The superseding ADR includes
|
in both the EN and KO files and `git mv` both to their respective
|
||||||
a "Supersedes ADR-NNNN" reference (or, for partial supersession of
|
history locations (`docs/adr-history/` for English; the KO copy
|
||||||
clauses, documents this in its own body).
|
stays in `docs/adr-ko/` only if it was already mirrored — see *ADR
|
||||||
|
Translation Discipline* for the frozen-history exception).
|
||||||
- **Accepted → Merged**: set Status to `Merged into ADR-MMMM`
|
- **Accepted → Merged**: set Status to `Merged into ADR-MMMM`
|
||||||
(single-line stub) and `git mv` to `docs/adr-history/`.
|
(single-line stub) in both files and apply the same `git mv` rule
|
||||||
|
as the Superseded transition.
|
||||||
|
|
||||||
Cross-references between ADRs use the `ADR-NNNN` ID and remain valid
|
Cross-references between ADRs use the `ADR-NNNN` ID and remain valid
|
||||||
regardless of folder location. ADR numbers are **immutable**; never
|
regardless of folder location. ADR numbers are **immutable**; never
|
||||||
@@ -361,11 +375,48 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
|
|||||||
## Derived Artifacts (Clarification)
|
## Derived Artifacts (Clarification)
|
||||||
|
|
||||||
- Generated diagrams under `docs/diagrams/` are **derived artifacts**, not production code.
|
- Generated diagrams under `docs/diagrams/` are **derived artifacts**, not production code.
|
||||||
- Creating or updating files in `docs/diagrams/`:
|
- Korean ADR translations under `docs/adr-ko/` are **derived artifacts**
|
||||||
|
(mirror of the canonical English in `docs/adr/`); see *ADR Translation
|
||||||
|
Discipline*.
|
||||||
|
- Creating or updating files in `docs/diagrams/` or `docs/adr-ko/`:
|
||||||
- does NOT count as a production code change,
|
- does NOT count as a production code change,
|
||||||
- does NOT require Phase 2 approval,
|
- does NOT require Phase 2 approval,
|
||||||
- MUST be consistent with SPEC.md and ADRs.
|
- MUST be consistent with SPEC.md and ADRs.
|
||||||
|
|
||||||
|
## ADR Translation Discipline
|
||||||
|
|
||||||
|
English in `docs/adr/` is the canonical source of truth. Korean in
|
||||||
|
`docs/adr-ko/` mirrors it 1:1 as a derived artifact.
|
||||||
|
|
||||||
|
**Bidirectional sync rule (MUST)**: any edit to a file in `docs/adr/`
|
||||||
|
must be accompanied, in the same change, by a mirroring edit to
|
||||||
|
`docs/adr-ko/<same-filename>.md`. The reverse also applies: edits to
|
||||||
|
`docs/adr-ko/` must mirror back into `docs/adr/`. The two files must
|
||||||
|
always describe the same architectural content.
|
||||||
|
|
||||||
|
Mechanics:
|
||||||
|
|
||||||
|
- When editing an EN ADR, propagate the change to its KO counterpart
|
||||||
|
by translating just the diff (preserve unaffected KO prose); do not
|
||||||
|
regenerate the whole KO file from scratch.
|
||||||
|
- When editing a KO ADR, propagate to EN the same way.
|
||||||
|
- Filename mirror: `docs/adr/X.md` ↔ `docs/adr-ko/X.md` (no language
|
||||||
|
suffix in either path).
|
||||||
|
- The `## Status` block content must remain byte-identical between
|
||||||
|
the EN and KO files (e.g., both say `Accepted`).
|
||||||
|
- Conflict policy: if the two diverge despite the rule, treat EN as
|
||||||
|
authoritative and overwrite KO. Surface the divergence to the user
|
||||||
|
before reconciling.
|
||||||
|
- `docs/adr-proposed/` is exempt — single language only, no mirror
|
||||||
|
required until promotion.
|
||||||
|
- `docs/adr-history/` is frozen — pre-existing mixed-language state
|
||||||
|
there is not migrated.
|
||||||
|
|
||||||
|
Verification: `python tools/verify_adr_lang_pairs.py` checks that
|
||||||
|
every EN ADR has a matching KO file, the title's ADR-NNNN matches the
|
||||||
|
filename, and Status blocks are byte-equal. Run it on demand or wire
|
||||||
|
it into CI. Exit code: 0 = OK, 1 = mismatch.
|
||||||
|
|
||||||
## runtime API / sim_engine Boundaries
|
## runtime API / sim_engine Boundaries
|
||||||
|
|
||||||
- runtime API MUST NOT hardcode topology/routing or internal hop sequences.
|
- runtime API MUST NOT hardcode topology/routing or internal hop sequences.
|
||||||
|
|||||||
@@ -0,0 +1,362 @@
|
|||||||
|
# ADR-0001: 51-bit Physical Address Layout & Decoding Contract
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal,
|
||||||
|
Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables.
|
||||||
|
Supersedes ADR-0031.)
|
||||||
|
|
||||||
|
## Date
|
||||||
|
|
||||||
|
2026-04-27 (original: 2026-02-27)
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
KernBench requires a stable, parsable physical address scheme that:
|
||||||
|
|
||||||
|
- can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU)
|
||||||
|
- remains topology-agnostic (no hardcoded counts)
|
||||||
|
- supports swappable policy and DI-first components
|
||||||
|
- covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space
|
||||||
|
|
||||||
|
### History
|
||||||
|
|
||||||
|
- Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) +
|
||||||
|
sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice.
|
||||||
|
- ADR-0031 (stub) requested PE-resource range partition but was never
|
||||||
|
implemented.
|
||||||
|
|
||||||
|
Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides
|
||||||
|
concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources.
|
||||||
|
ADR-0031 is superseded.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
We define a **PhysAddr value object** and an **address decoding contract**
|
||||||
|
that converts an integer address into routing domains.
|
||||||
|
|
||||||
|
### D1. PhysAddr is an immutable value object
|
||||||
|
|
||||||
|
- PhysAddr is immutable and comparable as a pure value.
|
||||||
|
- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
|
||||||
|
- No global state may be required to interpret a PhysAddr.
|
||||||
|
|
||||||
|
### D2. 51-bit Physical Address Layout
|
||||||
|
|
||||||
|
A 51-bit physical address is adopted.
|
||||||
|
|
||||||
|
#### 2.1 Top-Level Address Map
|
||||||
|
|
||||||
|
```text
|
||||||
|
[50:47] sip_id (4) -- 16 SIPs
|
||||||
|
[46:42] die_id (5) -- 32 dies per SIP
|
||||||
|
[41: 0] local_offset (42) -- 4 TB per die
|
||||||
|
```
|
||||||
|
|
||||||
|
```text
|
||||||
|
50 47 46 42 41 0
|
||||||
|
+---------+----------+-------------------------+
|
||||||
|
| sip_id | die_id | local_offset |
|
||||||
|
+---------+----------+-------------------------+
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2.2 die_id Allocation
|
||||||
|
|
||||||
|
| die_id | Meaning |
|
||||||
|
|--------|---------|
|
||||||
|
| 0..15 | AHBM dies |
|
||||||
|
| 16..20 | IOCHIPLET dies |
|
||||||
|
| 21..31 | Reserved |
|
||||||
|
|
||||||
|
#### 2.3 AHBM Die Layout
|
||||||
|
|
||||||
|
Only lower 256 GB of the 4 TB die-local window is assigned.
|
||||||
|
|
||||||
|
```text
|
||||||
|
[41:38] MBZ (4)
|
||||||
|
[37] addr_space (1) -- 0 = local resource, 1 = HBM memory
|
||||||
|
[36: 0] sub-address (37)
|
||||||
|
```
|
||||||
|
|
||||||
|
| addr_space | Meaning |
|
||||||
|
|------------|---------|
|
||||||
|
| 0 | Local resource |
|
||||||
|
| 1 | HBM memory |
|
||||||
|
|
||||||
|
##### 2.3.1 HBM Window (addr_space = 1)
|
||||||
|
|
||||||
|
```text
|
||||||
|
[36:0] hbm_offset (37) -- 128 GB decode window
|
||||||
|
```
|
||||||
|
|
||||||
|
The architectural decode window is fixed at 128 GB. Implemented capacity
|
||||||
|
may be smaller depending on SKU/topology (see D4).
|
||||||
|
|
||||||
|
##### 2.3.2 Resource Window (addr_space = 0)
|
||||||
|
|
||||||
|
```text
|
||||||
|
[36:34] resource_kind (3)
|
||||||
|
[33: 0] kind_local (34) -- 16 GB per kind
|
||||||
|
```
|
||||||
|
|
||||||
|
| resource_kind | Meaning |
|
||||||
|
|---------------|---------|
|
||||||
|
| 000 | PE_LOCAL |
|
||||||
|
| 001 | MCPU_LOCAL |
|
||||||
|
| 010 | CUBE_SRAM |
|
||||||
|
| 011..111 | Reserved |
|
||||||
|
|
||||||
|
Each kind gets a 16 GB decode region.
|
||||||
|
|
||||||
|
##### 2.3.3 PE_LOCAL (resource_kind = 000)
|
||||||
|
|
||||||
|
```text
|
||||||
|
[33] MBZ (1)
|
||||||
|
[32:29] pe_id (4) -- 0..15
|
||||||
|
[28:25] pe_sub_unit (4)
|
||||||
|
[24: 0] sub_offset (25) -- 32 MB per slot
|
||||||
|
```
|
||||||
|
|
||||||
|
16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode.
|
||||||
|
|
||||||
|
| pe_sub_unit | Name | Budget |
|
||||||
|
|-------------|------|--------|
|
||||||
|
| 0 | PE_CPU_DTCM | 8 KB |
|
||||||
|
| 1 | MATH_ENGINE_DTCM | 8 KB |
|
||||||
|
| 2 | IPCQ | 256 KB |
|
||||||
|
| 3 | PE_CPU_SFR | 16 KB |
|
||||||
|
| 4 | MATH_ENGINE_SFR | 16 KB |
|
||||||
|
| 5 | DMA_ENGINE_SFR | 192 KB |
|
||||||
|
| 6 | PE_TCM | 2 MB |
|
||||||
|
| 7..15 | Reserved | -- |
|
||||||
|
|
||||||
|
##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
|
||||||
|
|
||||||
|
```text
|
||||||
|
[33:30] MBZ (4)
|
||||||
|
[29:25] mcpu_sub_unit (5)
|
||||||
|
[24: 0] sub_offset (25) -- 32 MB per slot
|
||||||
|
```
|
||||||
|
|
||||||
|
1 GB active decode.
|
||||||
|
|
||||||
|
| mcpu_sub_unit | Name | Budget |
|
||||||
|
|---------------|------|--------|
|
||||||
|
| 0 | MCPU_ITCM | 512 KB |
|
||||||
|
| 1 | MCPU_DTCM | 512 KB |
|
||||||
|
| 2 | IPCQ | 256 KB |
|
||||||
|
| 3 | MCPU_SFR | 8 KB |
|
||||||
|
| 4 | MCPU_DMA_SFR | 16 KB |
|
||||||
|
| 5 | MCPU_SRAM | 10 MB |
|
||||||
|
| 6..31 | Reserved | -- |
|
||||||
|
|
||||||
|
##### 2.3.5 CUBE_SRAM (resource_kind = 010)
|
||||||
|
|
||||||
|
```text
|
||||||
|
[33:25] MBZ (9)
|
||||||
|
[24: 0] sram_offset (25) -- flat 32 MB
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2.4 IOCHIPLET Die Layout
|
||||||
|
|
||||||
|
Only lower 1 TB of the 4 TB die-local window is assigned.
|
||||||
|
|
||||||
|
```text
|
||||||
|
[41:40] MBZ (2)
|
||||||
|
[39: 0] chiplet_offset (40) -- 1 TB
|
||||||
|
```
|
||||||
|
|
||||||
|
Region split by address range:
|
||||||
|
|
||||||
|
| Range | Meaning | Decode condition |
|
||||||
|
|-------|---------|------------------|
|
||||||
|
| [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
|
||||||
|
| [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
|
||||||
|
|
||||||
|
##### 2.4.1 IOCPU Region
|
||||||
|
|
||||||
|
```text
|
||||||
|
[30:27] iocpu_sub_unit (4)
|
||||||
|
[26: 0] sub_offset (27) -- 128 MB per slot
|
||||||
|
```
|
||||||
|
|
||||||
|
16 x 128 MB slots. 2 GB active decode.
|
||||||
|
|
||||||
|
| iocpu_sub_unit | Name | Budget |
|
||||||
|
|----------------|------|--------|
|
||||||
|
| 0 | IOCPU_ITCM | 512 KB |
|
||||||
|
| 1 | IOCPU_DTCM | 512 KB |
|
||||||
|
| 2 | IPCQ | 2 MB |
|
||||||
|
| 3 | IOCPU_SFR | 8 KB |
|
||||||
|
| 4 | IO_DMA_SFR | 16 KB |
|
||||||
|
| 5 | IO_SRAM | 64 MB |
|
||||||
|
| 6..15 | Reserved | -- |
|
||||||
|
|
||||||
|
##### 2.4.2 UAL Region
|
||||||
|
|
||||||
|
Sub-layout TBD (separate ADR).
|
||||||
|
|
||||||
|
#### 2.5 Addressing Rules
|
||||||
|
|
||||||
|
1. MBZ bits must be zero. An address with non-zero MBZ bits is
|
||||||
|
**architecturally invalid**. Implementation may raise a decode fault
|
||||||
|
or return an error -- behavior is not prescribed by this ADR.
|
||||||
|
2. Fixed slot sizes are chosen for simple hardware decode; actual
|
||||||
|
implemented capacity may be smaller than the slot.
|
||||||
|
3. Access beyond a sub-unit's implemented budget within a slot is
|
||||||
|
**architecturally invalid** (same policy as MBZ).
|
||||||
|
|
||||||
|
### D3. Bitfield decoding is deterministic
|
||||||
|
|
||||||
|
Given an integer address, field extraction (`sip_id`, `die_id`, `kind`,
|
||||||
|
`sub_unit`, `offset`) is purely positional. No runtime state is required.
|
||||||
|
Decoding deterministically maps an integer address to destination domains:
|
||||||
|
`sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM /
|
||||||
|
IOCPU / UAL).
|
||||||
|
|
||||||
|
### D4. Capacity validation may depend on topology config
|
||||||
|
|
||||||
|
Whether a decoded address falls within **implemented capacity** (e.g.,
|
||||||
|
HBM 96 GB on a specific SKU) is checked against topology parameters
|
||||||
|
provided via DI/config. Decode itself (D3) never consults topology --
|
||||||
|
only validation does. These parameters must live in the topology/config
|
||||||
|
layer, not in node implementations.
|
||||||
|
|
||||||
|
### D5. Routing consumes decoded domains, not raw bits
|
||||||
|
|
||||||
|
Routing policy uses decoded domains:
|
||||||
|
|
||||||
|
- `src` location (sip / die / pe or node_id)
|
||||||
|
- `dst` domains derived from PhysAddr decoding
|
||||||
|
- `size_bytes` for size-aware link latency
|
||||||
|
|
||||||
|
Routing must not inspect raw bit-fields directly except inside the
|
||||||
|
decoding module.
|
||||||
|
|
||||||
|
## Alternatives Considered
|
||||||
|
|
||||||
|
1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice,
|
||||||
|
consumes 4 bits that enable die-local expansion to 42 bits
|
||||||
|
(IOCHIPLET 1 TB).
|
||||||
|
|
||||||
|
2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB.
|
||||||
|
Freed rack_id bits enable 42-bit local_offset.
|
||||||
|
|
||||||
|
3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg
|
||||||
|
spanning)**: Rejected -- complicates D3 (deterministic decoding).
|
||||||
|
Uniform 4 TB window with MBZ padding is simpler.
|
||||||
|
|
||||||
|
4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected --
|
||||||
|
leads to duplicated logic, inconsistent routing, and hidden
|
||||||
|
assumptions.
|
||||||
|
|
||||||
|
5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**:
|
||||||
|
Rejected -- violates SPEC R3 and breaks swappability.
|
||||||
|
|
||||||
|
6. **Put decoding inside memory controllers or routers**: Rejected --
|
||||||
|
leaks policy into components, violates SPEC R4 / D5.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- Simple hierarchical decoder: SIP -> die -> kind -> sub-unit.
|
||||||
|
- Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU).
|
||||||
|
- Deterministic routing domains enable clear test invariants (SPEC R1, R5).
|
||||||
|
- Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit
|
||||||
|
slots, reserved MBZ bits.
|
||||||
|
- DI-first: decoder can be swapped without changing components (SPEC R4).
|
||||||
|
|
||||||
|
### Tradeoffs
|
||||||
|
|
||||||
|
- Sparse address holes due to power-of-2 slot alignment.
|
||||||
|
- Large reserved/MBZ regions (intentional for future extension).
|
||||||
|
- Requires explicit configuration for topology-derived sizes (D4).
|
||||||
|
- Introduces a single "blessed" decoding module that must remain stable
|
||||||
|
and well-tested.
|
||||||
|
|
||||||
|
## Supersedes
|
||||||
|
|
||||||
|
- **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The
|
||||||
|
PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5
|
||||||
|
fulfill ADR-0031's stated goals.
|
||||||
|
|
||||||
|
## Implementation Notes (Non-normative)
|
||||||
|
|
||||||
|
- Recommended module: `src/kernbench/policy/address/phyaddr.py`
|
||||||
|
- Tests should cover: encode/decode round-trip per kind, MBZ enforcement,
|
||||||
|
die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary
|
||||||
|
values, backward compatibility of factory APIs.
|
||||||
|
- Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`,
|
||||||
|
`cube_sram_addr` retain signatures (minus `rack_id`); `cube_id`
|
||||||
|
parameter renamed to `die_id`.
|
||||||
|
- New factories: `pe_resource_addr`, `mcpu_resource_addr`,
|
||||||
|
`iocpu_resource_addr`, `ual_addr`.
|
||||||
|
|
||||||
|
## Appendix A. Address Examples
|
||||||
|
|
||||||
|
### A.1 AHBM HBM access
|
||||||
|
|
||||||
|
sip=2, die=5, HBM offset=0x1000
|
||||||
|
|
||||||
|
```text
|
||||||
|
sip_id = 2 -> [50:47] = 0b0010
|
||||||
|
die_id = 5 -> [46:42] = 0b00101
|
||||||
|
addr_space = 1 -> [37] = 1 (HBM)
|
||||||
|
hbm_offset = 0x1000 -> [36:0]
|
||||||
|
|
||||||
|
51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
|
||||||
|
```
|
||||||
|
|
||||||
|
### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
|
||||||
|
|
||||||
|
```text
|
||||||
|
sip_id = 0 -> [50:47] = 0
|
||||||
|
die_id = 0 -> [46:42] = 0
|
||||||
|
addr_space = 0 -> [37] = 0
|
||||||
|
resource_kind = 0 -> [36:34] = 000 (PE_LOCAL)
|
||||||
|
pe_id = 3 -> [32:29] = 0011
|
||||||
|
pe_sub_unit = 6 -> [28:25] = 0110 (PE_TCM)
|
||||||
|
sub_offset = 0x400 -> [24:0]
|
||||||
|
|
||||||
|
local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
|
||||||
|
```
|
||||||
|
|
||||||
|
### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
|
||||||
|
|
||||||
|
```text
|
||||||
|
sip_id = 1 -> [50:47] = 0001
|
||||||
|
die_id = 3 -> [46:42] = 00011
|
||||||
|
addr_space = 0 -> [37] = 0
|
||||||
|
resource_kind = 1 -> [36:34] = 001 (MCPU_LOCAL)
|
||||||
|
mcpu_sub_unit = 5 -> [29:25] = 00101 (MCPU_SRAM)
|
||||||
|
sub_offset = 0 -> [24:0] = 0
|
||||||
|
|
||||||
|
local_offset = (1 << 34) | (5 << 25)
|
||||||
|
```
|
||||||
|
|
||||||
|
### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
|
||||||
|
|
||||||
|
```text
|
||||||
|
sip_id = 1 -> [50:47] = 0001
|
||||||
|
die_id = 17 -> [46:42] = 10001 (IOCHIPLET[1])
|
||||||
|
iocpu_sub_unit = 2 -> [30:27] = 0010 (IPCQ)
|
||||||
|
sub_offset = 0x20000 -> [26:0]
|
||||||
|
|
||||||
|
chiplet_offset = (2 << 27) | 0x20000
|
||||||
|
(< 0x8000_0000 -> IOCPU region)
|
||||||
|
```
|
||||||
|
|
||||||
|
### A.5 IOCHIPLET -- UAL region, offset=4 GB
|
||||||
|
|
||||||
|
```text
|
||||||
|
sip_id = 0 -> [50:47] = 0
|
||||||
|
die_id = 16 -> [46:42] = 10000 (IOCHIPLET[0])
|
||||||
|
chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
|
||||||
|
R5 (multi-domain comm)
|
||||||
|
- ADR-0031: Superseded
|
||||||
@@ -0,0 +1,102 @@
|
|||||||
|
# ADR-0002: Routing Distance, Ordering & Bypass Rules
|
||||||
|
|
||||||
|
## Status
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Date
|
||||||
|
2026-02-27
|
||||||
|
|
||||||
|
## Context
|
||||||
|
The KernBench Graph Latency Simulator must compare kernel execution time
|
||||||
|
across different architectures and topologies by computing end-to-end
|
||||||
|
latency from graph traversal.
|
||||||
|
|
||||||
|
To support meaningful comparison:
|
||||||
|
- routing must be deterministic
|
||||||
|
- latency must reflect actual interconnect structure
|
||||||
|
- local vs remote traffic must be distinguishable
|
||||||
|
- “bypass” optimizations must not undermine debuggability or correctness
|
||||||
|
|
||||||
|
The simulator also aims to avoid software-managed metadata and hidden
|
||||||
|
shortcuts that obscure control paths.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Distance is accumulated latency, not hop count
|
||||||
|
- Routing “distance” is defined as the **sum of per-node and per-link latency**.
|
||||||
|
- Hop count alone must not be used for ordering or path selection.
|
||||||
|
- Size-aware serialization latency (bytes / BW) contributes to distance.
|
||||||
|
|
||||||
|
### D2. Routing order is derived from graph traversal
|
||||||
|
- The chosen route is the path with minimum accumulated latency
|
||||||
|
given the constructed graph and routing policy.
|
||||||
|
- Deterministic ordering must be guaranteed for identical inputs
|
||||||
|
(topology + policy + request).
|
||||||
|
|
||||||
|
### D3. Bypass is explicit and graph-represented
|
||||||
|
- All paths must be explicitly represented in the graph and subject to latency accumulation.
|
||||||
|
- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations
|
||||||
|
(HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
|
||||||
|
Local HBM access has minimal hops (switching overhead only); remote access
|
||||||
|
traverses additional routers.
|
||||||
|
- Implicit or “magic” bypass paths are disallowed.
|
||||||
|
|
||||||
|
### D4. No zero-latency end-to-end paths
|
||||||
|
|
||||||
|
- Every routed request must incur **end-to-end** latency > 0.
|
||||||
|
- Individual fabric segments (e.g., NOC hops) MAY have distance_mm = 0
|
||||||
|
when the fabric is distributed and distance is not meaningful at that granularity.
|
||||||
|
This is allowed because other components on the same path (e.g., PE_DMA, SRAM,
|
||||||
|
UCIe endpoints) contribute non-zero latency, ensuring the end-to-end invariant holds.
|
||||||
|
- Fully zero-latency end-to-end paths are disallowed, except for explicit
|
||||||
|
test-only stubs clearly marked as such.
|
||||||
|
|
||||||
|
### D5. Policy vs topology responsibility split
|
||||||
|
- Topology builder:
|
||||||
|
- defines nodes and links and their latency/BW parameters
|
||||||
|
- Routing policy:
|
||||||
|
- selects among available graph paths based on decoded domains
|
||||||
|
- Routing policy must not assume missing links; missing connectivity
|
||||||
|
is a topology construction error.
|
||||||
|
|
||||||
|
### D6. No software-managed routing metadata
|
||||||
|
- Routing decisions must not rely on per-request software-managed metadata
|
||||||
|
that tracks distance, hop count, or ordering outside the graph model.
|
||||||
|
- All distance/order computation is derived from traversal itself.
|
||||||
|
|
||||||
|
## Alternatives Considered
|
||||||
|
|
||||||
|
1) **Hop-count based routing**
|
||||||
|
- Rejected: ignores heterogeneous latency/BW and misrepresents
|
||||||
|
architectural differences.
|
||||||
|
|
||||||
|
2) **Implicit local shortcuts**
|
||||||
|
- Rejected: breaks debuggability and violates traversal-based latency.
|
||||||
|
|
||||||
|
3) **Software-managed distance metadata**
|
||||||
|
- Rejected: increases control overhead and obscures routing semantics.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
- Clear, debuggable hop-by-hop traces (SPEC R2, R4).
|
||||||
|
- Architecture comparisons reflect real interconnect structure.
|
||||||
|
- Routing behavior is reproducible and deterministic.
|
||||||
|
|
||||||
|
### Tradeoffs / Costs
|
||||||
|
- Graph construction must be correct and complete.
|
||||||
|
- Bypass modeling requires explicit graph representation,
|
||||||
|
which slightly increases topology description complexity.
|
||||||
|
|
||||||
|
## Implementation Notes (Non-normative)
|
||||||
|
- Recommended responsibilities:
|
||||||
|
- Graph builder: ensure all required paths exist.
|
||||||
|
- Router: select next hop based on decoded domains and policy.
|
||||||
|
- Tests should assert:
|
||||||
|
- non-zero end-to-end latency
|
||||||
|
- deterministic routing for identical inputs
|
||||||
|
- bypass paths appear explicitly in emitted traces
|
||||||
|
|
||||||
|
## Links
|
||||||
|
- SPEC.md: R1 (routing), R2 (latency), R3 (topology), R5 (multi-domain comm)
|
||||||
|
- ADR-0001: PhysAddr layout & decoding contract
|
||||||
@@ -0,0 +1,68 @@
|
|||||||
|
# ADR-0003: Target System Hierarchy & Modeling Scope
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
|
||||||
|
The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
|
||||||
|
through switching fabrics, with a host CPU issuing commands/kernels.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
We model the system hierarchy explicitly:
|
||||||
|
|
||||||
|
### D1. Tray-level
|
||||||
|
|
||||||
|
- A compute tray contains:
|
||||||
|
- Host CPU (issues requests / coordinates runtime & data placement)
|
||||||
|
- Multiple identical SIPs (accelerators)
|
||||||
|
- Interconnect fabric between SIPs (PCIe and/or UAL via switches)
|
||||||
|
|
||||||
|
### D2. SIP-level
|
||||||
|
|
||||||
|
- A SIP is a multi-die package composed of:
|
||||||
|
- Multiple CUBEs (HBM die + compute PEs + UCIe)
|
||||||
|
- One or more IO chiplets (host/SIP interfaces)
|
||||||
|
- IO chiplets:
|
||||||
|
- provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
|
||||||
|
- can be multiple per SIP
|
||||||
|
- placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 1–2 IO chiplets
|
||||||
|
|
||||||
|
### D3. CUBE-level
|
||||||
|
|
||||||
|
- A CUBE contains:
|
||||||
|
- HBM + memory controller (HBM_CTRL)
|
||||||
|
- NOC (on-die fabric): carries all intra-cube traffic including HBM data,
|
||||||
|
inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access.
|
||||||
|
Must provide: full-BW PE↔local HBM path, PE↔SRAM connectivity,
|
||||||
|
PE↔UCIe connectivity, M_CPU↔PE command path.
|
||||||
|
NOC topology is an implementation choice (e.g., 2D mesh, ring, crossbar);
|
||||||
|
current implementation uses a 2D mesh with XY routing (see ADR-0017).
|
||||||
|
HBM_CTRL is attached to each PE's local NOC port (local HBM = minimal hop).
|
||||||
|
- Shared SRAM: cube-level shared memory accessible by all PEs via NOC
|
||||||
|
- management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
|
||||||
|
- multiple PEs
|
||||||
|
- up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
|
||||||
|
|
||||||
|
### D4. PE-level
|
||||||
|
|
||||||
|
- A PE can execute one kernel instance
|
||||||
|
- PE contains internal control + accelerators (modeled at PE view granularity):
|
||||||
|
- PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- The simulator supports abstraction by “views”:
|
||||||
|
- SIP view hides PE internals
|
||||||
|
- CUBE view treats each PE as a single block
|
||||||
|
- PE view expands PE internals
|
||||||
|
- Topology remains parameterized; sizes/counts/links come from configuration.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- SPEC R3/R5
|
||||||
|
- ADR-0005 (diagram views)
|
||||||
|
- ADR-0017 (cube NOC 2D mesh architecture)
|
||||||
@@ -0,0 +1,76 @@
|
|||||||
|
# ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Accurately modeling PE↔HBM behavior is essential for kernel latency estimation.
|
||||||
|
Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Local HBM definition
|
||||||
|
|
||||||
|
- Each PE is assigned a logically defined “local HBM” region.
|
||||||
|
- Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
|
||||||
|
router in the NOC mesh (ADR-0017 D4).
|
||||||
|
- The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
|
||||||
|
- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
|
||||||
|
|
||||||
|
### D2. Local HBM bandwidth guarantee contract
|
||||||
|
|
||||||
|
- Accesses from a PE to its local HBM MUST guarantee full effective HBM
|
||||||
|
read/write bandwidth independent of intervening fabric bandwidth limits.
|
||||||
|
- Effective HBM bandwidth = spec bandwidth x efficiency factor.
|
||||||
|
The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8)
|
||||||
|
models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page
|
||||||
|
misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective.
|
||||||
|
- The topology builder applies the efficiency factor to router-to-hbm edge
|
||||||
|
bandwidth at graph construction time, so all downstream routing and latency
|
||||||
|
computation uses the effective value.
|
||||||
|
- This guarantee is modeled by:
|
||||||
|
- a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
|
||||||
|
- while still incurring non-zero latency along explicitly modeled components.
|
||||||
|
- HBM CTRL internal modeling (PC striping, cut-through, scheduling fidelity)
|
||||||
|
is consolidated in ADR-0033 (Latency Model: Assumptions and Known
|
||||||
|
Simplifications). The aggregate BW guarantee here remains the contract;
|
||||||
|
ADR-0033 documents how the per-PC model realizes it and which scheduler
|
||||||
|
effects are intentionally simplified.
|
||||||
|
|
||||||
|
### D3. Remote PE HBM semantics (intra-cube)
|
||||||
|
|
||||||
|
- A PE that accesses another PE's local HBM traverses the NOC:
|
||||||
|
- PE_DMA → NOC → (fabric hops) → target PE's NOC port → HBM_CTRL
|
||||||
|
- NOC bandwidth and hop count may limit remote HBM access relative to local access.
|
||||||
|
|
||||||
|
### D4. Non-local HBM semantics (inter-cube / inter-SIP)
|
||||||
|
|
||||||
|
- Accesses from a PE to HBM in a different cube or SIP MAY be limited by:
|
||||||
|
- NOC bandwidth within the cube,
|
||||||
|
- inter-cube UCIe links,
|
||||||
|
- inter-SIP fabric (PCIe/UAL).
|
||||||
|
- These paths MUST be explicit and traceable.
|
||||||
|
|
||||||
|
### D5. Shared SRAM semantics
|
||||||
|
|
||||||
|
- Each CUBE contains a shared SRAM accessible by all PEs in that CUBE.
|
||||||
|
- Access path: PE_DMA → NOC → shared SRAM.
|
||||||
|
- Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth.
|
||||||
|
- Shared SRAM is not part of the HBM address space; it is a separate memory domain.
|
||||||
|
|
||||||
|
## Verification Notes
|
||||||
|
|
||||||
|
Tests should cover:
|
||||||
|
|
||||||
|
- local-HBM case: BW matches HBM BW regardless of fabric BW parameter
|
||||||
|
- remote PE HBM case: latency includes mesh hop traversal
|
||||||
|
- non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
|
||||||
|
- shared SRAM case: access via NOC with correct BW
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- SPEC R2/R5
|
||||||
|
- ADR-0002 (distance/order & explicit bypass)
|
||||||
|
- ADR-0017 D7 (PE DMA data paths through NOC to HBM)
|
||||||
@@ -0,0 +1,186 @@
|
|||||||
|
# ADR-0005: Diagram Views & Distance-Aware Layout Rules
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
We require verifiable and inspectable system modeling for a large-scale,
|
||||||
|
parameterized AI Accelerator system.
|
||||||
|
|
||||||
|
Humans must be able to:
|
||||||
|
|
||||||
|
- visually inspect the modeled topology,
|
||||||
|
- reason about communication structure and relative distance,
|
||||||
|
- do so at multiple abstraction levels without being overwhelmed by detail.
|
||||||
|
|
||||||
|
The simulator models distance (accumulated latency) as a first-class concept.
|
||||||
|
Diagrams must reflect this distance by default.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Global Defaults
|
||||||
|
|
||||||
|
- All diagrams MUST be **distance-aware by default**.
|
||||||
|
- All diagrams MUST render **representative views** of the architecture.
|
||||||
|
- Instance indices (e.g., sip0, cube2, pe3) MUST NOT be required for diagram generation.
|
||||||
|
- Instance indices MAY be used ONLY:
|
||||||
|
- to define a distance anchor in asymmetric or debugging scenarios, or
|
||||||
|
- when explicitly requested.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D2. Representative Rendering Rule
|
||||||
|
|
||||||
|
- All CUBEs share the same internal structure.
|
||||||
|
- All PEs share the same internal structure.
|
||||||
|
|
||||||
|
Therefore:
|
||||||
|
|
||||||
|
- SIP-level diagrams render representative CUBEs and IO chiplets.
|
||||||
|
- CUBE-level diagrams render representative PEs as opaque blocks.
|
||||||
|
- PE-level diagrams render a representative PE with fully expanded internals.
|
||||||
|
|
||||||
|
Diagrams MUST NOT depend on specific SIP, CUBE, or PE indices
|
||||||
|
unless explicitly requested.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D3. Diagram Views
|
||||||
|
|
||||||
|
#### View A — SIP-Level Diagram
|
||||||
|
|
||||||
|
**Purpose**
|
||||||
|
Explain system-scale structure and connectivity.
|
||||||
|
|
||||||
|
**Visible elements**
|
||||||
|
|
||||||
|
- SIP boundaries (optional)
|
||||||
|
- CUBEs (opaque blocks)
|
||||||
|
- IO chiplets (opaque blocks)
|
||||||
|
- Optional UCIe stubs only if needed to clarify connectivity
|
||||||
|
|
||||||
|
**Hidden elements**
|
||||||
|
|
||||||
|
- PE internals
|
||||||
|
- CUBE internal fabric
|
||||||
|
- IO chiplet internals
|
||||||
|
|
||||||
|
**Visible links**
|
||||||
|
|
||||||
|
- Host ↔ IO chiplets (PCIe)
|
||||||
|
- SIP ↔ SIP (PCIe / UAL via switches)
|
||||||
|
- IO ↔ CUBE (on-package links)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### View B — CUBE-Level Diagram
|
||||||
|
|
||||||
|
**Purpose**
|
||||||
|
Explain cube-internal structure and data/control flow.
|
||||||
|
|
||||||
|
**Visible elements**
|
||||||
|
|
||||||
|
- Router mesh: 2D grid of NOC routers (from cube_mesh.yaml), all traffic routes through mesh
|
||||||
|
- HBM_CTRL attached to PE routers (local HBM = 0 hop)
|
||||||
|
- HBM subsystem (HBM_CTRL)
|
||||||
|
- Shared SRAM: cube-level shared memory
|
||||||
|
- Management CPU (M_CPU)
|
||||||
|
- PEs as opaque blocks (PE[0..N−1])
|
||||||
|
- UCIe endpoints (N/E/W/S) as ports
|
||||||
|
|
||||||
|
**Hidden elements**
|
||||||
|
|
||||||
|
- PE internals
|
||||||
|
|
||||||
|
**Visible links**
|
||||||
|
|
||||||
|
- PE → router (HBM + non-HBM data path via mesh)
|
||||||
|
- Router ↔ HBM_CTRL (local HBM access)
|
||||||
|
- Router ↔ Router (mesh hops for remote access)
|
||||||
|
- Router ↔ UCIe endpoints
|
||||||
|
- Router ↔ shared SRAM
|
||||||
|
- M_CPU ↔ router (command path)
|
||||||
|
- Router → PE_CPU (command delivery, collapsed into PE block)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### View C — PE-Level Diagram
|
||||||
|
|
||||||
|
**Purpose**
|
||||||
|
Explain internal PE behavior and execution structure.
|
||||||
|
|
||||||
|
**Visible elements**
|
||||||
|
|
||||||
|
- PE_CPU
|
||||||
|
- Command handler / scheduler
|
||||||
|
- PE_TCM (local SRAM)
|
||||||
|
- HW accelerators (DMA, GEMM, MATH, etc.)
|
||||||
|
- Local HBM interface
|
||||||
|
- Optional IPCQ / messaging endpoints
|
||||||
|
|
||||||
|
**Visible links**
|
||||||
|
|
||||||
|
- Control paths (CPU → scheduler → engines)
|
||||||
|
- Data paths (engines ↔ TCM, DMA ↔ local HBM)
|
||||||
|
- External fabric ports as abstract ports only
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D4. Distance-Aware Layout (Default)
|
||||||
|
|
||||||
|
#### Distance definition
|
||||||
|
|
||||||
|
- Distance is defined as **accumulated latency**, consistent with ADR-0002.
|
||||||
|
- Distance is computed from a single anchor node.
|
||||||
|
|
||||||
|
#### Default anchor selection
|
||||||
|
|
||||||
|
- SIP view: IO chiplet (or Host CPU if present)
|
||||||
|
- CUBE view: a representative PE
|
||||||
|
- PE view: PE_CPU or Command Handler
|
||||||
|
|
||||||
|
Anchors are **implicit defaults** and MUST NOT be required to be specified.
|
||||||
|
|
||||||
|
#### Layout rules
|
||||||
|
|
||||||
|
- Diagrams MUST be laid out in layers based on distance buckets.
|
||||||
|
- Layout direction MUST be consistent within a view type
|
||||||
|
(preferred: left-to-right).
|
||||||
|
- Nodes with equal distance MUST have stable ordering
|
||||||
|
(by role or identifier, deterministically).
|
||||||
|
|
||||||
|
Cycles MAY be rendered using dashed or curved edges for readability,
|
||||||
|
without affecting distance semantics.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D5. Generation Contract (for Tools / Claude Code)
|
||||||
|
|
||||||
|
When generating diagrams:
|
||||||
|
|
||||||
|
- Assume distance-aware layout by default.
|
||||||
|
- Assume representative rendering by default.
|
||||||
|
- Do NOT ask for SIP/CUBE/PE indices unless required.
|
||||||
|
- Do NOT expand hidden abstraction levels.
|
||||||
|
- Prefer architectural clarity over micro-hop fidelity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Diagrams are stable across topology scaling.
|
||||||
|
- Changes in distance or routing policy are reflected visually.
|
||||||
|
- Diagrams serve as verifiable artifacts derived from the simulator model,
|
||||||
|
not as hand-maintained documentation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- SPEC Section 4 (Output, Debuggability, and Diagrams)
|
||||||
|
- ADR-0002 (Routing distance semantics)
|
||||||
|
- ADR-0006 (Topology compilation & automatic diagram generation)
|
||||||
@@ -0,0 +1,130 @@
|
|||||||
|
# ADR-0006: Topology Compilation, Distance Extraction, and Automatic Diagram Generation
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The simulator compiles topology configuration (e.g., topology.yaml) into an explicit model graph,
|
||||||
|
and computes routing and accumulated latency (distance).
|
||||||
|
Diagrams should be generated from these authoritative artifacts to ensure consistency and avoid
|
||||||
|
hand-maintained topology drawings.
|
||||||
|
|
||||||
|
Additionally, for usability, diagrams should be emitted automatically into a stable location
|
||||||
|
so that developers can preview them immediately in the repository.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Topology compilation is the single source of truth
|
||||||
|
|
||||||
|
- topology.yaml (or equivalent config) is compiled into:
|
||||||
|
- an explicit system graph,
|
||||||
|
- node/link attributes,
|
||||||
|
- routing policies.
|
||||||
|
This compiled graph is the authoritative representation of the system.
|
||||||
|
|
||||||
|
### D2. Distance extraction during compilation
|
||||||
|
|
||||||
|
- During or immediately after topology compilation, the simulator MUST compute distance metadata
|
||||||
|
(accumulated latency) consistent with ADR-0002.
|
||||||
|
- Distance metadata MUST be sufficient to support distance-aware diagram layout as defined in ADR-0005.
|
||||||
|
- Distributed fabric segments (e.g., NOC) MAY have distance_mm = 0 per ADR-0002 D4;
|
||||||
|
layout placement for such nodes uses explicit position metadata rather than distance buckets.
|
||||||
|
|
||||||
|
### D3. Diagram generation is a derived artifact
|
||||||
|
|
||||||
|
- Diagrams MUST be generated from:
|
||||||
|
- the compiled topology graph,
|
||||||
|
- extracted distance metadata,
|
||||||
|
- view/layout rules defined in ADR-0005.
|
||||||
|
- Diagram generation MUST NOT require additional hand-written topology descriptions.
|
||||||
|
|
||||||
|
### D4. Automatic diagram emission to the repository
|
||||||
|
|
||||||
|
- As part of topology compilation, the implementation MUST produce the following diagrams by default:
|
||||||
|
- SIP-level diagram (representative, distance-aware)
|
||||||
|
- CUBE-level diagram (representative, distance-aware)
|
||||||
|
- PE-level diagram (representative, distance-aware)
|
||||||
|
- The default output directory is:
|
||||||
|
- `docs/diagrams/`
|
||||||
|
- The generator MUST overwrite/update only when the compiled topology (or diagram rules) changes.
|
||||||
|
|
||||||
|
### D5. View-specific projection and layout
|
||||||
|
|
||||||
|
For each view (SIP / CUBE / PE):
|
||||||
|
|
||||||
|
- The generator MUST project the compiled graph into a reduced view graph:
|
||||||
|
- hide/collapse nodes according to ADR-0005,
|
||||||
|
- preserve connectivity semantics relevant to that view,
|
||||||
|
- compute distance buckets and assign layout layers deterministically.
|
||||||
|
- CUBE-level projection MUST include:
|
||||||
|
- Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
|
||||||
|
and PEs as opaque blocks.
|
||||||
|
- All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017).
|
||||||
|
- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
|
||||||
|
|
||||||
|
### D6. Output formats and determinism
|
||||||
|
|
||||||
|
- The generator MUST output at least one of:
|
||||||
|
- Mermaid (Markdown-native)
|
||||||
|
- Graphviz DOT (rank-based control)
|
||||||
|
- SVG (mm-accurate layout, no external dependencies)
|
||||||
|
- SVG is preferred when mm-accurate position metadata is available from the compiled topology.
|
||||||
|
- Output MUST be deterministic:
|
||||||
|
- same topology + same rules → identical diagram text
|
||||||
|
- File naming MUST be deterministic and stable (see "Output Conventions").
|
||||||
|
|
||||||
|
### D7. Performance and caching
|
||||||
|
|
||||||
|
- Diagram generation MAY be lazy and/or cached, as long as the outputs in `docs/diagrams/`
|
||||||
|
remain consistent with the compiled topology.
|
||||||
|
- The implementation SHOULD use a cache key based on:
|
||||||
|
- topology content hash,
|
||||||
|
- routing policy version,
|
||||||
|
- diagram rules version,
|
||||||
|
- view type (SIP/CUBE/PE).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Output Conventions
|
||||||
|
|
||||||
|
### Directory
|
||||||
|
|
||||||
|
- `docs/diagrams/` is the canonical output directory for generated diagrams.
|
||||||
|
|
||||||
|
### File names (recommended, deterministic)
|
||||||
|
|
||||||
|
- `system_view.svg` / `system_view.mmd` / `system_view.dot`
|
||||||
|
- `sip_view.svg` / `sip_view.mmd` / `sip_view.dot`
|
||||||
|
- `cube_view.svg` / `cube_view.mmd` / `cube_view.dot`
|
||||||
|
- `pe_view.svg` / `pe_view.mmd` / `pe_view.dot`
|
||||||
|
|
||||||
|
Optionally, for multi-topology workflows:
|
||||||
|
|
||||||
|
- `sip_view__{topology_id}.svg`
|
||||||
|
- `cube_view__{topology_id}.svg`
|
||||||
|
- `pe_view__{topology_id}.svg`
|
||||||
|
|
||||||
|
### Repository policy
|
||||||
|
|
||||||
|
- Generated diagram files MAY be committed to the repository to enable diff-based review.
|
||||||
|
- If committed, they MUST be reproducible from topology compilation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Diagrams are always consistent with simulator behavior.
|
||||||
|
- Architectural changes automatically propagate to visualizations.
|
||||||
|
- Diagram diffs become meaningful indicators of architectural change.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- SPEC Section 4 (Output, Debuggability, and Diagrams)
|
||||||
|
- ADR-0002 (Distance semantics)
|
||||||
|
- ADR-0005 (Diagram views and layout rules)
|
||||||
@@ -0,0 +1,95 @@
|
|||||||
|
# ADR-0007: Runtime API and Simulation Engine Boundaries
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The simulator consists of multiple layers with distinct responsibilities:
|
||||||
|
|
||||||
|
- a host-facing API layer used by benchmarks and user code,
|
||||||
|
- a discrete-event simulation engine that executes requests,
|
||||||
|
- device components that model hardware behavior.
|
||||||
|
|
||||||
|
Without strict boundaries, orchestration logic can leak into components,
|
||||||
|
or simulation internals can become entangled with user-facing APIs.
|
||||||
|
|
||||||
|
This ADR defines clear responsibility boundaries between:
|
||||||
|
|
||||||
|
- runtime API,
|
||||||
|
- simulation engine (sim_engine),
|
||||||
|
- hardware components.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Runtime API is host-facing orchestration only
|
||||||
|
|
||||||
|
The runtime API represents host/driver-level behavior and MUST:
|
||||||
|
|
||||||
|
- expose high-level operations (tensor deployment, kernel launch),
|
||||||
|
- submit requests only to endpoint components (e.g., IO_CPU),
|
||||||
|
- await completion via futures/handles,
|
||||||
|
- own and persist host-side metadata (tensor allocation maps, kernel bindings).
|
||||||
|
|
||||||
|
The runtime API MUST NOT:
|
||||||
|
|
||||||
|
- hardcode hop-by-hop routing or fan-out,
|
||||||
|
- directly invoke internal components (M_CPU, PE_CPU, engines),
|
||||||
|
- embed topology- or routing-specific assumptions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D2. Simulation engine wires components and tracks completion
|
||||||
|
|
||||||
|
The simulation engine (sim_engine) MUST:
|
||||||
|
|
||||||
|
- wire components at initialization (create port stores + start wire
|
||||||
|
processes per the component port/wire framework — ADR-0015),
|
||||||
|
- inject requests into the compiled topology graph at entry components
|
||||||
|
(e.g., PCIE_EP for memory operations, IO_CPU for kernel launch),
|
||||||
|
- schedule and execute events using a discrete-event model,
|
||||||
|
- manage correlation ids and completion tracking.
|
||||||
|
|
||||||
|
The simulation engine MUST NOT:
|
||||||
|
|
||||||
|
- define tensor semantics,
|
||||||
|
- define kernel execution policies,
|
||||||
|
- expose internal graph details to the runtime API,
|
||||||
|
- walk the topology path during request execution,
|
||||||
|
- call component `run()` methods directly,
|
||||||
|
- track per-hop latency or decompose fan-out (components own this).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D3. Components own fan-out and aggregation
|
||||||
|
|
||||||
|
Device-side components MUST:
|
||||||
|
|
||||||
|
- fan-out requests to downstream domains
|
||||||
|
(IO_CPU → M_CPU → PE_CPU → schedulers/engines),
|
||||||
|
- aggregate completion and failure signals,
|
||||||
|
- propagate results deterministically upstream.
|
||||||
|
|
||||||
|
Neither the runtime API nor the simulation engine may orchestrate
|
||||||
|
component-level fan-out explicitly.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Runtime APIs remain stable as topology and routing evolve.
|
||||||
|
- Simulation internals can change without affecting user-facing code.
|
||||||
|
- Component implementations remain swappable via DI.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- SPEC R4, R7, R8
|
||||||
|
- ADR-0008 (Tensor deployment)
|
||||||
|
- ADR-0009 (Kernel execution)
|
||||||
|
- ADR-0015 (Component port/wire model and engine role)
|
||||||
|
- ADR-0010 (CLI surface and execution semantics — runtime API consumer)
|
||||||
@@ -0,0 +1,100 @@
|
|||||||
|
# ADR-0008: Tensor Deployment and Allocation (Host Allocator, PA-first)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Benchmarks require PyTorch-like tensor semantics:
|
||||||
|
|
||||||
|
- tensor creation (empty, fill),
|
||||||
|
- deployment to accelerator devices (tensor.to()).
|
||||||
|
|
||||||
|
In the realistic system, host software manages allocation/mapping and installs
|
||||||
|
mappings for DMA/MMU. For Phase 0 we simplify (ADR-0011):
|
||||||
|
|
||||||
|
- device memory operations use PA only,
|
||||||
|
- VA/MMU/IOMMU is not modeled.
|
||||||
|
|
||||||
|
To keep the host↔device interface minimal, we avoid a separate
|
||||||
|
AllocateTensorMeta message. Instead, host allocation produces a PA shard map
|
||||||
|
that is used directly by MemoryWrite/Read and KernelLaunch.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Tensor is a host-owned handle with PA shard mapping
|
||||||
|
|
||||||
|
A Tensor object is a host-owned handle that encapsulates:
|
||||||
|
|
||||||
|
- shape and dtype,
|
||||||
|
- initialization intent,
|
||||||
|
- device placement and allocation metadata as a PA shard map.
|
||||||
|
|
||||||
|
After deployment, the Tensor handle MUST contain:
|
||||||
|
|
||||||
|
- a list of shards, each with (sip,cube,pe,pa,nbytes,offset_bytes).
|
||||||
|
|
||||||
|
This PA shard mapping is the single source of truth for kernel argument binding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D2. Deployment uses a host allocator (Phase 0)
|
||||||
|
|
||||||
|
In Phase 0, tensor deployment produces PA shard mappings via a host allocator:
|
||||||
|
|
||||||
|
- placement (split/replicate/hybrid) is decided by a DP policy,
|
||||||
|
- allocation assigns PA ranges at the PE level and returns shard mappings,
|
||||||
|
- the Tensor handle stores the resulting shard list deterministically.
|
||||||
|
|
||||||
|
No separate host-visible device allocation RPC is required in Phase 0.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D3. Data initialization and transfer uses MemoryWrite/Read only
|
||||||
|
|
||||||
|
Any data initialization or transfer implied by a tensor (e.g., fill, copy)
|
||||||
|
MUST be represented using Host ↔ IO_CPU messages only:
|
||||||
|
|
||||||
|
- MemoryWrite
|
||||||
|
- MemoryRead
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
|
||||||
|
- MemoryWrite/Read MUST reference PA + (sip,cube,pe) tags (ADR-0012).
|
||||||
|
- Allocation metadata MUST NOT be embedded as a separate allocation message.
|
||||||
|
- Bulk tensor data MUST NOT be embedded in Phase 0 messages.
|
||||||
|
|
||||||
|
The simulation engine schedules MemoryWrite/Read through the graph so that
|
||||||
|
latency is computed by explicit traversal.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D4. Extension path (non-breaking)
|
||||||
|
|
||||||
|
Future ADRs MAY introduce optional VA/MMU/IOMMU modeling by adding:
|
||||||
|
|
||||||
|
- virtual addressing in tensor handles,
|
||||||
|
- mapping install steps,
|
||||||
|
- translation latency/page granularity.
|
||||||
|
|
||||||
|
The Phase 0 PA shard map remains a valid fast-path configuration.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Host↔IO_CPU contract remains minimal (MemoryRead/Write + KernelLaunch).
|
||||||
|
- KernelLaunch can pass per-PE data placement explicitly via shard tags.
|
||||||
|
- Early implementation stays simple and testable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0011 (Memory Addressing — PA / VA / LA)
|
||||||
|
- ADR-0012 (Host↔IO_CPU schema)
|
||||||
|
- ADR-0007 (runtime_api vs sim_engine boundaries)
|
||||||
|
- ADR-0009 (Kernel execution)
|
||||||
@@ -0,0 +1,146 @@
|
|||||||
|
# ADR-0009: Kernel Execution Messaging and Completion Semantics
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Kernel execution is initiated by the host and proceeds through
|
||||||
|
device control components:
|
||||||
|
|
||||||
|
Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
|
||||||
|
|
||||||
|
Completion propagates in reverse order.
|
||||||
|
|
||||||
|
To keep benchmarks simple and topology-agnostic,
|
||||||
|
kernel execution must be endpoint-driven with deterministic aggregation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Kernel launch is an endpoint request
|
||||||
|
|
||||||
|
A kernel launch is initiated by submitting a single KernelLaunch request
|
||||||
|
to the IO_CPU endpoint.
|
||||||
|
|
||||||
|
The runtime API MUST:
|
||||||
|
|
||||||
|
- construct the kernel launch request,
|
||||||
|
- submit it to IO_CPU,
|
||||||
|
- await a single completion result.
|
||||||
|
|
||||||
|
The runtime API MUST NOT orchestrate internal fan-out.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D2. Tensor arguments are passed by metadata
|
||||||
|
|
||||||
|
KernelLaunch requests MUST reference tensor arguments via:
|
||||||
|
|
||||||
|
- host-owned tensor handles, or
|
||||||
|
- resolved device address maps derived from those handles.
|
||||||
|
|
||||||
|
Bulk tensor data MUST NOT be embedded in kernel launch messages.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D3. Fan-out and aggregation are component responsibilities
|
||||||
|
|
||||||
|
- IO_CPU fans out work to M_CPUs.
|
||||||
|
- M_CPU fans out work to PE_CPUs.
|
||||||
|
- PE_CPU manages kernel execution and engine dispatch.
|
||||||
|
|
||||||
|
Completion semantics:
|
||||||
|
|
||||||
|
- M_CPU completes when all targeted PEs complete or a failure policy triggers.
|
||||||
|
- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D4. Completion and failure propagation
|
||||||
|
|
||||||
|
- All messages MUST carry correlation identifiers.
|
||||||
|
- Completion and failure MUST propagate deterministically to the host.
|
||||||
|
- The simulation engine provides futures/handles to observe completion.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D5. Launch timing is endpoint-synchronized
|
||||||
|
|
||||||
|
All PEs targeted by a single kernel launch MUST begin executing the kernel
|
||||||
|
body at the same simulated time, regardless of their dispatch path length
|
||||||
|
from the launch entry point.
|
||||||
|
|
||||||
|
Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
|
||||||
|
latency at every level. PEs near their M_CPU receive the launch earlier
|
||||||
|
than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
|
||||||
|
farther away. Without synchronization, each PE's kernel begins at a
|
||||||
|
different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
|
||||||
|
of dispatch-path geometry rather than of the kernel's behavior —
|
||||||
|
producing measurement artifacts in benchmarks that time kernel-internal
|
||||||
|
waits (for example `tl.recv` on cross-cube or cross-SIP hops).
|
||||||
|
|
||||||
|
Mechanism.
|
||||||
|
|
||||||
|
- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
|
||||||
|
- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
|
||||||
|
computes `target_start_ns = env.now + max_latency` where
|
||||||
|
`max_latency` is the maximum, over every target (sip, cube, pe)
|
||||||
|
tuple, of the **two-leg dispatch chain**:
|
||||||
|
|
||||||
|
```
|
||||||
|
max_latency(sip, cube, pe) =
|
||||||
|
compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
|
||||||
|
+ compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
|
||||||
|
- io_cpu.overhead_ns
|
||||||
|
- m_cpu.overhead_ns
|
||||||
|
```
|
||||||
|
|
||||||
|
This models the actual dispatch as **two sequential Transactions**
|
||||||
|
(IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
|
||||||
|
`compute_path_latency_ns` adds its endpoints' `overhead_ns`;
|
||||||
|
`io_cpu.overhead_ns` is subtracted because IO_CPU has already
|
||||||
|
paid it before this method runs, and `m_cpu.overhead_ns` is
|
||||||
|
subtracted once because it appears as endpoint of leg1 *and*
|
||||||
|
start of leg2 but is paid only once at run time. A single
|
||||||
|
`find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
|
||||||
|
it can pick a graph path that bypasses M_CPU and silently
|
||||||
|
under-shoots the prediction for far cubes, breaking the D5
|
||||||
|
invariant.
|
||||||
|
|
||||||
|
The fanned-out sub-Transactions carry **`nbytes = 0`** for
|
||||||
|
`KernelLaunchMsg` (control message only). Without this,
|
||||||
|
large kernel-launch payloads would occupy fabric BW on the
|
||||||
|
shared first hop and serialize the per-cube dispatch, pushing
|
||||||
|
far M_CPUs past `target_start_ns` and re-introducing the
|
||||||
|
late-arrival violation.
|
||||||
|
- **M_CPU** passes an already-stamped `target_start_ns` through
|
||||||
|
unchanged. Only when the value is absent (e.g. a direct
|
||||||
|
launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
|
||||||
|
`env.now + max(local command-path latency)`.
|
||||||
|
- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
|
||||||
|
of `_execute_kernel`, before recording `pe_exec_start` and invoking
|
||||||
|
the kernel body.
|
||||||
|
- When `target_start_ns is None`, PE_CPU falls through to the legacy
|
||||||
|
unsynchronized behavior — preserving backward compatibility.
|
||||||
|
|
||||||
|
IO_CPU-level stamping guarantees every PE across every targeted cube
|
||||||
|
uses the same barrier sim-time, eliminating both the within-cube
|
||||||
|
dispatch-offset artifact *and* the cross-cube offset artifact in
|
||||||
|
multi-cube launches. Models a real-hardware timed-broadcast launch
|
||||||
|
(latency-equalized dispatch tree).
|
||||||
|
|
||||||
|
The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
|
||||||
|
control plane — runtime API and application kernels are unchanged.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- SPEC R1, R2, R7, R8
|
||||||
|
- ADR-0007 (Runtime API boundaries)
|
||||||
|
- ADR-0008 (Tensor deployment)
|
||||||
|
- ADR-0013 (Verification strategy — V2 fan-out tests)
|
||||||
|
- ADR-0015 D4 (concrete fabric path for kernel launch)
|
||||||
@@ -0,0 +1,131 @@
|
|||||||
|
# ADR-0010: Command Line Interface and Execution Semantics
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The `kernbench` CLI is the user-facing entry point of the simulator. It
|
||||||
|
exposes three subcommands:
|
||||||
|
|
||||||
|
- `run` — execute a benchmark against a topology.
|
||||||
|
- `probe` — diagnostic utility for latency / BW measurement.
|
||||||
|
- `web` — interactive topology viewer.
|
||||||
|
|
||||||
|
Device enumeration is centralized in the CLI; neither the runtime API
|
||||||
|
nor the simulation engine enumerates devices. Benchmarks remain
|
||||||
|
single-device by design and accept a device identifier as input.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Benchmark contract — single-device by design
|
||||||
|
|
||||||
|
- A benchmark MUST define behavior for a single device only.
|
||||||
|
- A benchmark MUST accept a device identifier as input.
|
||||||
|
- Benchmarks MUST NOT enumerate or loop over multiple devices.
|
||||||
|
|
||||||
|
Multi-device execution is the CLI's concern (D3), not the benchmark's.
|
||||||
|
|
||||||
|
### D2. `kernbench run` — benchmark execution
|
||||||
|
|
||||||
|
Required arguments:
|
||||||
|
|
||||||
|
- `--topology <path>`: topology YAML file path. Loaded via
|
||||||
|
`resolve_topology()`.
|
||||||
|
- `--bench <name>`: benchmark name. Resolved via
|
||||||
|
`benches.loader.resolve_bench()`.
|
||||||
|
|
||||||
|
Optional arguments:
|
||||||
|
|
||||||
|
- `--device <selector>` (default: `all`):
|
||||||
|
- `all` — run once per discovered SIP (see D3).
|
||||||
|
- `sip:<N>` — run only on SIP N.
|
||||||
|
- Parsed via `resolve_device()`.
|
||||||
|
- `--verify-data` (default: off) — enable Phase 2 data verification
|
||||||
|
(see ADR-0020). When set, `engine_factory` constructs the engine
|
||||||
|
with `enable_data=True`. After the benchmark runs, a diagnostic
|
||||||
|
summary of recorded ops is printed.
|
||||||
|
|
||||||
|
Each invocation runs the benchmark once within a single simulation
|
||||||
|
instance.
|
||||||
|
|
||||||
|
### D3. Multi-device execution is logically parallel
|
||||||
|
|
||||||
|
When `--device all` (or omitted) and the topology has multiple SIPs:
|
||||||
|
|
||||||
|
- Benchmark executions are submitted to a single simulation engine
|
||||||
|
instance.
|
||||||
|
- Executions are logically parallel in simulation time.
|
||||||
|
- Inter-device contention is naturally modeled (shared fabric
|
||||||
|
bandwidth, cross-SIP traffic, etc.).
|
||||||
|
|
||||||
|
The CLI does NOT spawn multiple OS processes or independent
|
||||||
|
simulation runs — parallelism is internal to one simulation instance.
|
||||||
|
|
||||||
|
### D4. `kernbench probe` — latency / BW diagnostic utility
|
||||||
|
|
||||||
|
Required argument:
|
||||||
|
|
||||||
|
- `--topology <path>`: topology YAML file path.
|
||||||
|
|
||||||
|
Optional argument:
|
||||||
|
|
||||||
|
- `--case <name>` (default: `all`) — run a predefined traffic
|
||||||
|
pattern, or `all` to run every defined case.
|
||||||
|
|
||||||
|
Probe runs each pattern through the simulation engine and reports
|
||||||
|
per case:
|
||||||
|
|
||||||
|
- End-to-end latency (ns).
|
||||||
|
- Effective bandwidth (nbytes / total_ns).
|
||||||
|
- Bottleneck bandwidth (min edge BW along the chosen path).
|
||||||
|
- Utilization (effective / bottleneck).
|
||||||
|
|
||||||
|
Probe additionally validates monotonicity invariants — for example
|
||||||
|
that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
|
||||||
|
cross-SIP — and reports violations. Probe is a developer tool for
|
||||||
|
verifying the latency / BW model; it is not a benchmark.
|
||||||
|
|
||||||
|
### D5. `kernbench web` — topology viewer
|
||||||
|
|
||||||
|
Optional arguments:
|
||||||
|
|
||||||
|
- `--port <N>` (default: `8765`) — HTTP port.
|
||||||
|
- `--no-open` — do not auto-open the browser.
|
||||||
|
|
||||||
|
Launches a local HTTP server that renders the compiled topology in
|
||||||
|
the browser. Distinct from the static `docs/diagrams/` artifacts:
|
||||||
|
|
||||||
|
- `docs/diagrams/` files are derived at topology-compile time
|
||||||
|
(ADR-0006).
|
||||||
|
- `kernbench web` is interactive — pan/zoom, hover for component
|
||||||
|
attributes, switch between SIP / CUBE / PE views.
|
||||||
|
|
||||||
|
### D6. Runtime API and simulation engine remain device-scoped
|
||||||
|
|
||||||
|
- Runtime API calls operate on one device per invocation.
|
||||||
|
- The simulation engine schedules all requests deterministically.
|
||||||
|
- Neither layer enumerates devices.
|
||||||
|
|
||||||
|
This invariant keeps each layer testable in isolation; device
|
||||||
|
enumeration and multi-device fan-out live only in the CLI's `run`
|
||||||
|
command (D3).
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Benchmark authors write single-device logic; multi-device behavior
|
||||||
|
emerges from the CLI dispatching across SIPs.
|
||||||
|
- Adding a new subcommand (e.g., trace export, replay) does not
|
||||||
|
require benchmark or runtime-API changes — the CLI is the
|
||||||
|
extension point.
|
||||||
|
- `probe` and `web` are diagnostic / visualization tools, not
|
||||||
|
benchmarks; they bypass the benchmark loader path.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- SPEC R7, R8, R9
|
||||||
|
- ADR-0007 (Runtime API and Simulation Engine Boundaries)
|
||||||
|
- ADR-0020 (Two-pass data execution — `--verify-data`)
|
||||||
|
- ADR-0006 (Topology compilation and diagram generation —
|
||||||
|
background for `kernbench web`)
|
||||||
@@ -0,0 +1,521 @@
|
|||||||
|
# ADR-0011: Memory Addressing — PA / VA / LA Address Models
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted.
|
||||||
|
|
||||||
|
- **VA model: currently implemented (default).**
|
||||||
|
- PA model: implemented as PageFault fallback in PE_DMA.
|
||||||
|
- LA model: proposed, not implemented.
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
KernBench's address model evolved through three design points, each
|
||||||
|
addressing a limitation of the previous. This ADR documents all three
|
||||||
|
in one place because future implementation work selects among them.
|
||||||
|
|
||||||
|
### PA-only baseline
|
||||||
|
|
||||||
|
Phase 0 of KernBench treated all device memory operations
|
||||||
|
(MemoryRead/MemoryWrite) as raw physical-address transfers. No
|
||||||
|
host-side virtual addressing, no MMU/IOMMU translation. Allocators
|
||||||
|
returned PA mappings; DMA requests carried PA directly.
|
||||||
|
|
||||||
|
This was sufficient for early correctness/latency work but
|
||||||
|
insufficient for running standard Triton kernels that use
|
||||||
|
`base_addr + offset` patterns on sharded tensors: each PE's shard
|
||||||
|
has a different PA, but the kernel needs a single contiguous address
|
||||||
|
space to compute offsets.
|
||||||
|
|
||||||
|
### Why VA/MMU (current default)
|
||||||
|
|
||||||
|
A realistic system uses host-side virtual addressing and an
|
||||||
|
MMU/IOMMU-style translation path for DMA: the host allocates physical
|
||||||
|
memory at PE level, maps it into a virtual address space, installs
|
||||||
|
mappings, and DMA requests use virtual addresses that are translated
|
||||||
|
to physical addresses.
|
||||||
|
|
||||||
|
Adopting this model lets kernels use `base_addr + offset` over a
|
||||||
|
contiguous VA range while the device-side MMU translates each access
|
||||||
|
to the appropriate PA.
|
||||||
|
|
||||||
|
### Why LA/BAAW (proposed)
|
||||||
|
|
||||||
|
VA/MMU treats HBM as a single backing space. KernBench needs to
|
||||||
|
explore architectures where HBM is composed of multiple pseudo
|
||||||
|
channels in parallel:
|
||||||
|
|
||||||
|
- CUBE's HBM has 32 or 64 pseudo channels.
|
||||||
|
- In a PE-Local-HBM model, each PE is assigned N pseudo channels
|
||||||
|
(N = `hbm_pseudo_channels / pes_per_cube`).
|
||||||
|
- Per-channel BW (e.g. 32 GB/s) determines aggregate PE BW
|
||||||
|
(N × per-channel).
|
||||||
|
|
||||||
|
Two channel-mapping modes need to be modelable:
|
||||||
|
|
||||||
|
- **1:1 mode** — one logical access → N per-channel requests.
|
||||||
|
Precise per-channel BW contention modelling.
|
||||||
|
- **n:1 mode (default)** — one logical access → one aggregated
|
||||||
|
request. Channels are assumed to interleave; aggregated BW model.
|
||||||
|
|
||||||
|
VA's `tl.load(va_ptr)` produces a single DMA request to a single
|
||||||
|
target. Decomposing that into per-channel requests inside PE_DMA
|
||||||
|
requires the address layer to be aware of channels. This is the
|
||||||
|
role of the LA (Logical Address) abstraction with BAAW
|
||||||
|
(Logical-to-Physical Mapping Unit).
|
||||||
|
|
||||||
|
Core requirements driving the LA design:
|
||||||
|
|
||||||
|
- PE_DMA → HBM_CTRL effective bandwidth semantics must be identical
|
||||||
|
in both modes (only request shape and resource model differ).
|
||||||
|
- Kernel programming model is unchanged — physical channel
|
||||||
|
information is never exposed to kernel code.
|
||||||
|
- Mode switch is a topology-level configuration.
|
||||||
|
|
||||||
|
### Design space summary
|
||||||
|
|
||||||
|
| Model | Status | Key idea |
|
||||||
|
|-------|--------|----------|
|
||||||
|
| PA | fallback (implemented) | Direct physical addressing, no translation |
|
||||||
|
| VA | current default (implemented) | Per-tensor contiguous VA range; MMU translates per access |
|
||||||
|
| LA | proposed | LA + BAAW resolves to (PA, channel); supports 1:1 and n:1 channel mapping modes |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
This ADR defines three address models. At any given time the system
|
||||||
|
operates in exactly one model. Selection is topology- / configuration-
|
||||||
|
driven; coexistence within one simulation run is not required.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Address Model: PA (Physical Address) — fallback
|
||||||
|
|
||||||
|
#### D-PA1. PA-only semantics
|
||||||
|
|
||||||
|
- All device memory accesses (MemoryRead/MemoryWrite) operate on
|
||||||
|
device physical addresses (PA) plus size.
|
||||||
|
- PA-only mode remains functional via the PageFault fallback path in
|
||||||
|
PE_DMA: if a DMA src/dst address has no MMU mapping, PE_DMA treats
|
||||||
|
the value as a PA directly.
|
||||||
|
|
||||||
|
#### D-PA2. Allocation produces PA mappings
|
||||||
|
|
||||||
|
Device allocation selects PE-local memory regions and returns PA
|
||||||
|
mappings sufficient to execute kernels and issue DMA requests.
|
||||||
|
|
||||||
|
PA model is retained primarily for backward compatibility with PA-only
|
||||||
|
tests and as the underlying physical layer that VA / LA models resolve
|
||||||
|
into.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Address Model: VA (Virtual Address with MMU) — current default
|
||||||
|
|
||||||
|
#### D-VA1. Virtual Address Model
|
||||||
|
|
||||||
|
- Each tensor gets a single contiguous VA range (`TensorHandle.va_base`).
|
||||||
|
- `TensorShard` does NOT carry a `va` field — shard VA is derived as
|
||||||
|
`va_base + offset_bytes`.
|
||||||
|
- Kernels receive `va_base` as their pointer argument (via
|
||||||
|
`TensorArg.va_base`).
|
||||||
|
- `DmaReadCmd.src_addr` and `DmaWriteCmd.dst_addr` carry VA (not PA).
|
||||||
|
|
||||||
|
#### D-VA2. PE_MMU Component
|
||||||
|
|
||||||
|
- Hybrid design: SimPy component (inbox for `MmuMapMsg`) + utility
|
||||||
|
(synchronous `translate()` called by PE_DMA).
|
||||||
|
- Page-aligned dict lookup for O(1) VA → PA translation.
|
||||||
|
- `tlb_overhead_ns` configurable per-access latency.
|
||||||
|
- PageFault fallback: if VA has no mapping, PE_DMA treats it as PA
|
||||||
|
directly (preserves PA model for backward compatibility).
|
||||||
|
|
||||||
|
#### D-VA3. Mapping Installation
|
||||||
|
|
||||||
|
- `MmuMapMsg` traverses the fabric: Host → PCIE_EP → IO_CPU (cube
|
||||||
|
fan-out) → M_CPU (PE fan-out) → NOC → PE_MMU. Latency is measured
|
||||||
|
end-to-end.
|
||||||
|
- `MmuMapMsg.target_sips` controls SIP-level routing to prevent
|
||||||
|
cross-SIP mapping contamination for replicated tensors.
|
||||||
|
- Mapping strategy based on `DPPolicy.cube`:
|
||||||
|
- **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping
|
||||||
|
only. Each cube's PEs see only their local PA. No cross-cube
|
||||||
|
mapping installed.
|
||||||
|
- **Sharded** (`cube="column_wise"`, etc.): broadcast all shard
|
||||||
|
mappings to all target cubes. Enables cross-PE and cross-cube
|
||||||
|
DMA.
|
||||||
|
|
||||||
|
#### D-VA4. Tensor Lifecycle
|
||||||
|
|
||||||
|
- `del tensor` triggers automatic cleanup via `Tensor.__del__` +
|
||||||
|
`weakref` to `RuntimeContext`. Sends `MmuUnmapMsg` through fabric,
|
||||||
|
returns VA and PA space.
|
||||||
|
- `with RuntimeContext(...) as ctx:` provides scope-based bulk cleanup.
|
||||||
|
- `RuntimeContext._tensors` uses `weakref.ref` to avoid preventing GC.
|
||||||
|
- `PEMemAllocator` uses free-list with coalescing (not bump allocator).
|
||||||
|
- `VirtualAllocator` uses free-list with coalescing for VA space.
|
||||||
|
|
||||||
|
#### D-VA5. Allocators
|
||||||
|
|
||||||
|
- `VirtualAllocator`: device-wide VA space, page-aligned alloc/free
|
||||||
|
with coalescing.
|
||||||
|
- `PEMemAllocator`: per-PE HBM/TCM, free-list based alloc/free with
|
||||||
|
coalescing.
|
||||||
|
- Page size configurable via `topology.yaml` `pe_mmu` attrs
|
||||||
|
(default 4096).
|
||||||
|
|
||||||
|
#### Consequences (VA model)
|
||||||
|
|
||||||
|
- Triton kernels use `base_addr + offset` patterns naturally on
|
||||||
|
sharded tensors.
|
||||||
|
- All latency remains explicit via graph traversal, including MMU
|
||||||
|
mapping installation and per-access TLB overhead.
|
||||||
|
- PA-only mode retained as fallback (PageFault → treat as PA).
|
||||||
|
- IPCQ and other fixed-address resources bypass MMU (use PA directly).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Address Model: LA (Logical Address with BAAW) — proposed
|
||||||
|
|
||||||
|
LA replaces VA when channel-level HBM modelling is required.
|
||||||
|
Adopting this model removes the VA/MMU infrastructure (D-LA1 lists the
|
||||||
|
removed artifacts). Coexistence with VA in the same run is not a goal.
|
||||||
|
|
||||||
|
#### D-LA1. LA introduction — replaces VA infrastructure
|
||||||
|
|
||||||
|
LA is the sole address space used by kernel code (`tl.load`,
|
||||||
|
`tl.store`, `tl.composite`). Properties:
|
||||||
|
|
||||||
|
- Can map a Tensor to a contiguous logical space (like VA).
|
||||||
|
- Expresses `(logical buffer + offset)`.
|
||||||
|
- Does NOT contain physical channel information directly.
|
||||||
|
- Stays as an intermediate abstraction until physical resolution.
|
||||||
|
|
||||||
|
LA address space:
|
||||||
|
|
||||||
|
| Item | Value |
|
||||||
|
|------|-------|
|
||||||
|
| LA start | `0x1_0000_0000` (4 GB, preserves former VA start) |
|
||||||
|
| LA space size | 64 GB per PE |
|
||||||
|
| Alignment unit | segment (see D-LA3) |
|
||||||
|
|
||||||
|
LA is PE-local: different PEs may use the same LA value; BAAW segment
|
||||||
|
tables differ → they resolve to different PAs.
|
||||||
|
|
||||||
|
VA infrastructure removed when LA is adopted:
|
||||||
|
|
||||||
|
| Removed | Replacement |
|
||||||
|
|---------|-------------|
|
||||||
|
| `policy/address/va_allocator.py` (VirtualAllocator) | LA allocator (same free-list approach, renamed) |
|
||||||
|
| `policy/address/pe_mmu.py` (PeMMU) | BAAW segment table (inside PE_DMA) |
|
||||||
|
| `components/builtin/pe_mmu.py` (PeMmuComponent) | Removed — BAAW is internal PE_DMA logic, not a separate component |
|
||||||
|
| `runtime_api/kernel.py`: `MmuMapMsg`, `MmuUnmapMsg` | `BaawSegmentInstallMsg` |
|
||||||
|
| `runtime_api/context.py`: VA alloc + MMU install | LA alloc + BAAW segment install |
|
||||||
|
| `runtime_api/tensor.py`: `va_base` | `la_base` |
|
||||||
|
| `topology.yaml`: `pe_mmu` component entry | Removed |
|
||||||
|
|
||||||
|
#### D-LA2. Mapping mode setting
|
||||||
|
|
||||||
|
Topology-level (cube) configuration:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
cube:
|
||||||
|
memory_map:
|
||||||
|
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
||||||
|
hbm_pseudo_channels: 64 # total pseudo channel count
|
||||||
|
hbm_channels_per_pe: 8 # per-PE local channel count
|
||||||
|
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth
|
||||||
|
```
|
||||||
|
|
||||||
|
Consumed by the graph compiler (topology builder) and BAAW
|
||||||
|
initialisation.
|
||||||
|
|
||||||
|
#### D-LA3. Segment and BAAW
|
||||||
|
|
||||||
|
Segment partitions the LA space; each segment maps to a specific HBM
|
||||||
|
channel or channel group. Created at tensor deploy time by the runtime
|
||||||
|
allocator. BAAW resolves LA → physical request(s) using the segment
|
||||||
|
table.
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class BaawSegment:
|
||||||
|
la_base: int # segment start LA
|
||||||
|
la_size: int # segment size (bytes)
|
||||||
|
mode: str # "one_to_one" | "n_to_one"
|
||||||
|
# 1:1 mode fields
|
||||||
|
channel_count: int # channels assigned to this segment (e.g. 8)
|
||||||
|
pa_bases: list[int] # per-channel PA bases (len = channel_count)
|
||||||
|
channel_ids: list[int] # per-channel logical IDs (e.g. [0..7])
|
||||||
|
channel_size: int # per-channel size (la_size // channel_count)
|
||||||
|
# n:1 mode fields
|
||||||
|
agg_pa_base: int # aggregated PA base
|
||||||
|
agg_node_id: str # aggregated router node_id
|
||||||
|
```
|
||||||
|
|
||||||
|
Segment lifecycle:
|
||||||
|
|
||||||
|
1. **Allocate** (tensor deploy): RuntimeContext allocates LA from LA
|
||||||
|
allocator. PEMemAllocator allocates per-channel PA (1:1) or
|
||||||
|
aggregated PA (n:1). `BaawSegmentInstallMsg` registers the segment
|
||||||
|
with PE_DMA.
|
||||||
|
2. **Use** (kernel run): kernel `tl.load(la_ptr)` → `DmaReadCmd
|
||||||
|
(src_addr=LA)`. PE_DMA's BAAW front-end looks up the segment and
|
||||||
|
converts to PA(s).
|
||||||
|
3. **Free** (tensor free): segment removed from table; LA and PA
|
||||||
|
returned.
|
||||||
|
|
||||||
|
#### D-LA4. BAAW resolution logic
|
||||||
|
|
||||||
|
BAAW is a front-end stage inside PE_DMA, not a separate SimPy
|
||||||
|
component. Synchronous address-resolution logic executed at the start
|
||||||
|
of PE_DMA's `handle_command()`.
|
||||||
|
|
||||||
|
Input: `(LA, nbytes)`. Output:
|
||||||
|
|
||||||
|
- **1:1 mode**: `list[PhysicalRequest]` — one per channel.
|
||||||
|
- **n:1 mode**: single `PhysicalRequest`.
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class PhysicalRequest:
|
||||||
|
pa: int # 51-bit Physical Address
|
||||||
|
nbytes: int # transfer size for this request
|
||||||
|
dst_node: str # target node_id (channel router or aggregated router)
|
||||||
|
|
||||||
|
|
||||||
|
def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
|
||||||
|
seg = self._find_segment(la) # la_base <= la < la_base + la_size
|
||||||
|
offset = la - seg.la_base
|
||||||
|
|
||||||
|
if seg.mode == "n_to_one":
|
||||||
|
pa = seg.agg_pa_base + offset
|
||||||
|
return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]
|
||||||
|
|
||||||
|
# one_to_one
|
||||||
|
requests = []
|
||||||
|
per_ch_size = seg.channel_size
|
||||||
|
for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
|
||||||
|
ch_offset = offset % per_ch_size
|
||||||
|
ch_nbytes = nbytes // seg.channel_count
|
||||||
|
pa = pa_base + ch_offset
|
||||||
|
dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
|
||||||
|
requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
|
||||||
|
return requests
|
||||||
|
```
|
||||||
|
|
||||||
|
BAAW responsibilities:
|
||||||
|
|
||||||
|
- Convert logical access → physical request units.
|
||||||
|
- Apply mode-dependent fan-out (1:1) or pass-through (n:1).
|
||||||
|
- Compute PA and target node.
|
||||||
|
|
||||||
|
BAAW non-responsibilities:
|
||||||
|
|
||||||
|
- Performing actual data movement.
|
||||||
|
- Executing NOC routing.
|
||||||
|
- Simulating bandwidth occupation (downstream components' job).
|
||||||
|
|
||||||
|
BAAW output is directly usable by the simulator's routing and resource
|
||||||
|
model without additional address decoding.
|
||||||
|
|
||||||
|
#### D-LA5. PE_DMA `handle_command()` change
|
||||||
|
|
||||||
|
Current (VA-based) flow:
|
||||||
|
|
||||||
|
```
|
||||||
|
DmaReadCmd.src_addr (VA)
|
||||||
|
→ MMU.translate(VA) → PA
|
||||||
|
→ PhysAddr.decode(PA) → PhysAddr object
|
||||||
|
→ resolver.resolve(PhysAddr) → dst_node_id
|
||||||
|
→ router.find_path(pe_prefix, dst_node_id) → path
|
||||||
|
→ 1 sub-Transaction → fabric inject
|
||||||
|
```
|
||||||
|
|
||||||
|
LA-based flow:
|
||||||
|
|
||||||
|
```
|
||||||
|
DmaReadCmd.src_addr (LA)
|
||||||
|
→ BAAW.resolve(LA, nbytes) → list[PhysicalRequest]
|
||||||
|
→ for each PhysicalRequest:
|
||||||
|
→ router.find_path(pe_prefix, req.dst_node) → path
|
||||||
|
→ compute_drain_ns(path, req.nbytes) → drain
|
||||||
|
→ sub-Transaction → fabric inject
|
||||||
|
→ await all sub-Transactions
|
||||||
|
→ pe_txn.done.succeed()
|
||||||
|
```
|
||||||
|
|
||||||
|
Key changes:
|
||||||
|
|
||||||
|
- MMU reference removed → BAAW resolve.
|
||||||
|
- `PhysAddr.decode()` + `resolver.resolve()` → BAAW returns `dst_node`
|
||||||
|
directly.
|
||||||
|
- 1 request → N parallel requests in 1:1 mode.
|
||||||
|
|
||||||
|
#### D-LA6. 1:1 mode detail
|
||||||
|
|
||||||
|
- One logical access → N physical requests (N = `channels_per_pe`).
|
||||||
|
- N = `hbm_pseudo_channels / pes_per_cube`.
|
||||||
|
- Each request: fully-resolved 51-bit PA, targets a specific channel
|
||||||
|
router (`{pe_prefix}.ch_r{channel_id}`).
|
||||||
|
- Per-channel link models BW contention.
|
||||||
|
- PE_DMA injects N sub-transactions concurrently.
|
||||||
|
|
||||||
|
Example: `hbm_pseudo_channels=64`, `pes_per_cube=8` → `channels_per_pe=8`.
|
||||||
|
PE0 owns ch0-7.
|
||||||
|
|
||||||
|
```text
|
||||||
|
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
|
||||||
|
BAAW segment: {
|
||||||
|
la_base: 0x1_0000_0000, la_size: 4096,
|
||||||
|
mode: "one_to_one", channel_count: 8,
|
||||||
|
pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
|
||||||
|
channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
|
||||||
|
channel_size: 512,
|
||||||
|
}
|
||||||
|
|
||||||
|
BAAW resolve result (8 requests):
|
||||||
|
→ PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
|
||||||
|
→ PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
|
||||||
|
→ ...
|
||||||
|
→ PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")
|
||||||
|
|
||||||
|
PE_DMA: 8 sub-transactions parallel inject
|
||||||
|
per-channel router → hbm_ctrl link (channel_bw_gbs) per channel
|
||||||
|
Total effective BW = 8 × channel_bw_gbs
|
||||||
|
```
|
||||||
|
|
||||||
|
Other N values:
|
||||||
|
|
||||||
|
- `hbm_pseudo_channels=32`, `pes_per_cube=8` → `channels_per_pe=4`,
|
||||||
|
4 requests
|
||||||
|
- `hbm_pseudo_channels=64`, `pes_per_cube=4` → `channels_per_pe=16`,
|
||||||
|
16 requests
|
||||||
|
|
||||||
|
#### D-LA7. n:1 mode detail
|
||||||
|
|
||||||
|
- One logical access → one aggregated request.
|
||||||
|
- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
|
||||||
|
- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
|
||||||
|
(e.g. 8 × 32 = 256 GB/s).
|
||||||
|
- Single queue / resource for modelling.
|
||||||
|
- No per-channel PA decomposition.
|
||||||
|
|
||||||
|
```text
|
||||||
|
Tensor A (4 KB) → LA 0x1_0000_0000, size=4096 bytes
|
||||||
|
BAAW segment: {
|
||||||
|
la_base: 0x1_0000_0000, la_size: 4096,
|
||||||
|
mode: "n_to_one",
|
||||||
|
agg_pa_base: PA_agg,
|
||||||
|
agg_node_id: "sip0.cube0.pe0.agg_router",
|
||||||
|
}
|
||||||
|
|
||||||
|
BAAW resolve result:
|
||||||
|
→ PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")
|
||||||
|
|
||||||
|
PE_DMA: 1 sub-transaction
|
||||||
|
aggregated router → hbm_ctrl link (256 GB/s)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### D-LA8. Kernel model preserved
|
||||||
|
|
||||||
|
- Kernel still issues single memory ops (`tl.load`, `tl.store`,
|
||||||
|
`tl.composite`).
|
||||||
|
- LA is the address scheme exposed to kernel code.
|
||||||
|
- Channel decomposition / aggregation happens inside PE_DMA's BAAW.
|
||||||
|
- Kernel code never sees physical channel information.
|
||||||
|
|
||||||
|
#### Consequences (LA model, proposed)
|
||||||
|
|
||||||
|
Positive:
|
||||||
|
|
||||||
|
- 1:1 vs n:1 semantics live in one place (BAAW).
|
||||||
|
- Kernel abstraction preserved — no kernel code changes.
|
||||||
|
- Topology-based policy control (mode switch via yaml).
|
||||||
|
- Improved simulation-model consistency and debuggability.
|
||||||
|
- Segment-based mapping is simpler than page tables; lower overhead.
|
||||||
|
|
||||||
|
Negative:
|
||||||
|
|
||||||
|
- Full VA/MMU code refactor required.
|
||||||
|
- Request-generation path more complex (N requests in 1:1 mode).
|
||||||
|
- Reduced per-channel visibility in n:1 mode.
|
||||||
|
- VA-related tests need rewriting.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Migration Path
|
||||||
|
|
||||||
|
- **PA → VA** was an extension. PA mode is retained as the PageFault
|
||||||
|
fallback inside PE_DMA. Switching does not require removing PA
|
||||||
|
code.
|
||||||
|
- **VA → LA**, if adopted, is a replacement, not coexistence. See
|
||||||
|
D-LA1 for the VA infrastructure removal list. PA fallback inside
|
||||||
|
PE_DMA may be retained orthogonally for tests.
|
||||||
|
|
||||||
|
## Alternatives Considered (LA model)
|
||||||
|
|
||||||
|
1. **Keep VA + fan-out in MMU**: MMU returns per-channel PAs.
|
||||||
|
Rejected: MMU's role would grow beyond translation to request
|
||||||
|
decomposition; aggregation (n:1) becomes awkward to express.
|
||||||
|
2. **Channel-aware kernel API**: kernels call per-channel load/store
|
||||||
|
directly. Rejected: abstraction leakage, portability loss, all
|
||||||
|
benchmarks need rewriting.
|
||||||
|
3. **Always PA (no LA)**: runtime passes per-channel PA to kernel
|
||||||
|
directly. Rejected: incompatible with aggregation; conversion
|
||||||
|
timing unclear; channel info leaks to kernel.
|
||||||
|
|
||||||
|
## Test Requirements
|
||||||
|
|
||||||
|
### VA model (current, regression)
|
||||||
|
|
||||||
|
- Cross-PE / cross-cube DMA paths over installed mappings.
|
||||||
|
- `MmuMapMsg` / `MmuUnmapMsg` fabric traversal with measured latency.
|
||||||
|
- TLB-overhead-per-access timing.
|
||||||
|
- PageFault fallback path preserves PA-only behaviour.
|
||||||
|
|
||||||
|
### LA model (when implemented)
|
||||||
|
|
||||||
|
- 1:1 mode: same logical access → N per-channel requests.
|
||||||
|
- n:1 mode: same logical access → 1 aggregated request.
|
||||||
|
- Bandwidth equivalence between modes for identical workload.
|
||||||
|
- 1:1 mode: per-channel contention modelled correctly.
|
||||||
|
- n:1 mode: aggregated bandwidth correctly reflected.
|
||||||
|
- Kernel code unchanged across mode switch.
|
||||||
|
- BAAW segment install / uninstall correctness.
|
||||||
|
- Multiple tensors in distinct segments do not collide.
|
||||||
|
|
||||||
|
## Implementation Order (LA, when scheduled)
|
||||||
|
|
||||||
|
1. LA type (`policy/address/la_allocator.py`).
|
||||||
|
2. BAAW segment table (`policy/address/baaw.py`).
|
||||||
|
3. `BaawSegmentInstallMsg` (`runtime_api/kernel.py`).
|
||||||
|
4. PE_DMA BAAW integration (`components/builtin/pe_dma.py`
|
||||||
|
`handle_command()`).
|
||||||
|
5. RuntimeContext: LA alloc + segment install
|
||||||
|
(`runtime_api/context.py`).
|
||||||
|
6. `Tensor.va_base` → `Tensor.la_base` (`runtime_api/tensor.py`).
|
||||||
|
7. Remove VA/MMU code.
|
||||||
|
8. Remove `pe_mmu` from `topology.yaml`; add mapping mode settings.
|
||||||
|
9. Test migration:
|
||||||
|
|
||||||
|
| Test file | Action |
|
||||||
|
|-----------|--------|
|
||||||
|
| `tests/test_mmu_component.py` | Remove → BAAW segment install tests |
|
||||||
|
| `tests/test_mmu_fabric.py` | Remove → BAAW + fabric integration tests |
|
||||||
|
| `tests/test_pe_mmu.py` | Remove |
|
||||||
|
| `tests/test_va_allocator.py` | Replace with LA allocator tests |
|
||||||
|
| `tests/test_va_integration.py` | Replace with LA + BAAW integration tests |
|
||||||
|
| `tests/test_va_offset.py` | Replace with LA offset tests |
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0007 (runtime_api vs sim_engine boundaries)
|
||||||
|
- ADR-0008 (tensor deployment)
|
||||||
|
- ADR-0009 (kernel execution)
|
||||||
|
- ADR-0014 (PE-internal execution model)
|
||||||
|
- ADR-0015 (component port/wire model)
|
||||||
|
- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
|
||||||
|
- ADR-0013 (Verification strategy — V1 PA tagging)
|
||||||
|
- SPEC R2 (latency by traversal), R10 (memory addressing)
|
||||||
@@ -0,0 +1,233 @@
|
|||||||
|
# ADR-0012: Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Phase 0 uses a PA-first memory model (ADR-0011):
|
||||||
|
|
||||||
|
- memory operations use device physical addresses (PA) only,
|
||||||
|
- VA/MMU/IOMMU is not modeled.
|
||||||
|
|
||||||
|
The host-facing runtime API interacts with the device via the IO_CPU endpoint.
|
||||||
|
We define stable, minimal message schemas for Host ↔ IO_CPU so that:
|
||||||
|
|
||||||
|
- benchmarks remain stable,
|
||||||
|
- IO_CPU-internal fan-out/aggregation can evolve independently,
|
||||||
|
- completion and failure propagation is deterministic.
|
||||||
|
|
||||||
|
We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
|
||||||
|
so IO_CPU can deterministically route/fan-out without relying on PA decoding.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Contract scope
|
||||||
|
|
||||||
|
This schema is the stable contract ONLY for Host ↔ IO_CPU.
|
||||||
|
|
||||||
|
Messages beyond IO_CPU (to M_CPU, PE_CPU, schedulers, engines) are component-internal
|
||||||
|
and are NOT part of this host contract in Phase 0.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D2. Required message set
|
||||||
|
|
||||||
|
The runtime API MUST use only these message types for Host ↔ IO_CPU:
|
||||||
|
|
||||||
|
- MemoryWrite
|
||||||
|
- MemoryRead
|
||||||
|
- KernelLaunch
|
||||||
|
|
||||||
|
All operations required by benchmarks (tensor init/copy, kernel run) MUST be expressible
|
||||||
|
with these messages.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D3. Common envelope (mandatory for all requests)
|
||||||
|
|
||||||
|
All Host ↔ IO_CPU requests MUST include:
|
||||||
|
|
||||||
|
- `msg_type: str`
|
||||||
|
- `correlation_id: str`
|
||||||
|
- generated by the host
|
||||||
|
- used to match responses deterministically
|
||||||
|
- `request_id: str`
|
||||||
|
- unique within a correlation_id
|
||||||
|
- `target_device: str`
|
||||||
|
- device identifier (e.g., "sip:0")
|
||||||
|
- `timestamp_tag: str | None` (optional)
|
||||||
|
- debug tag only; MUST NOT affect determinism
|
||||||
|
|
||||||
|
All Host ↔ IO_CPU responses MUST include:
|
||||||
|
|
||||||
|
- `correlation_id: str`
|
||||||
|
- `request_id: str`
|
||||||
|
- `completion: Completion`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D4. Completion schema (mandatory)
|
||||||
|
|
||||||
|
`Completion` MUST have:
|
||||||
|
|
||||||
|
- `ok: bool`
|
||||||
|
- `error_code: str | None`
|
||||||
|
- `error_message: str | None`
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
|
||||||
|
- If `ok == true` then `error_code` and `error_message` MUST be null.
|
||||||
|
- If `ok == false` then `error_code` MUST be non-null.
|
||||||
|
- Completion semantics MUST be deterministic.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D5. MemoryWrite schema (PA-first, PE-tagged)
|
||||||
|
|
||||||
|
`MemoryWrite` represents a host-initiated write/initialize operation to device memory.
|
||||||
|
|
||||||
|
Mandatory fields:
|
||||||
|
|
||||||
|
- common envelope fields (D3)
|
||||||
|
- destination placement tags (A 방식):
|
||||||
|
- `dst_sip: int`
|
||||||
|
- `dst_cube: int`
|
||||||
|
- `dst_pe: int`
|
||||||
|
- `dst_pa: int`
|
||||||
|
- destination physical address in the destination PE's address space
|
||||||
|
- `nbytes: int`
|
||||||
|
- `src_kind: "pattern" | "host_buffer_ref"`
|
||||||
|
- Phase 0 MUST support "pattern"
|
||||||
|
- `pattern: Pattern | None`
|
||||||
|
- required if `src_kind == "pattern"`
|
||||||
|
|
||||||
|
`Pattern` (Phase 0 mandatory support):
|
||||||
|
|
||||||
|
- `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"`
|
||||||
|
- `value: number | None`
|
||||||
|
- required for fill_*; ignored for zero
|
||||||
|
|
||||||
|
Optional fields:
|
||||||
|
|
||||||
|
- `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (default "AUTO")
|
||||||
|
- `debug_label: str | None`
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
|
||||||
|
- This message MUST NOT embed bulk tensor data in Phase 0.
|
||||||
|
- All latency MUST come from explicit graph traversal and modeled components.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D6. MemoryRead schema (PA-first, PE-tagged)
|
||||||
|
|
||||||
|
`MemoryRead` represents a host-initiated read from device memory.
|
||||||
|
|
||||||
|
Mandatory fields:
|
||||||
|
|
||||||
|
- common envelope fields (D3)
|
||||||
|
- source placement tags (A 방식):
|
||||||
|
- `src_sip: int`
|
||||||
|
- `src_cube: int`
|
||||||
|
- `src_pe: int`
|
||||||
|
- `src_pa: int`
|
||||||
|
- `nbytes: int`
|
||||||
|
|
||||||
|
Optional fields:
|
||||||
|
|
||||||
|
- `dst_kind: "host_sink" | "discard"` (default "host_sink")
|
||||||
|
- `debug_label: str | None`
|
||||||
|
|
||||||
|
Response payload:
|
||||||
|
|
||||||
|
- actual bytes are NOT required in Phase 0 (latency/traces focus)
|
||||||
|
- implementations MAY return lightweight stats or hashes later via a new ADR
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D7. KernelLaunch schema (PA-first, PE-tagged shards)
|
||||||
|
|
||||||
|
`KernelLaunch` represents launching a kernel on a target device via IO_CPU.
|
||||||
|
|
||||||
|
Mandatory fields:
|
||||||
|
|
||||||
|
- common envelope fields (D3)
|
||||||
|
- `kernel_ref: KernelRef`
|
||||||
|
- `args: list[KernelArg]`
|
||||||
|
|
||||||
|
`KernelRef` MUST have:
|
||||||
|
|
||||||
|
- `name: str`
|
||||||
|
- `kind: "deployed" | "builtin"`
|
||||||
|
- `deploy_pa: int | None` — PA where kernel binary was deployed (required for "deployed")
|
||||||
|
- `deploy_sip: int` — SIP where binary resides
|
||||||
|
- `deploy_cube: int` — cube where binary resides
|
||||||
|
- `deploy_pe: int` — PE where binary resides
|
||||||
|
- `nbytes_code: int` — kernel binary size (for BW modeling)
|
||||||
|
|
||||||
|
Kernel binaries MUST be pre-deployed to device memory via MemoryWrite.
|
||||||
|
KernelLaunch MUST NOT embed kernel source code or IR in the launch message.
|
||||||
|
|
||||||
|
`KernelArg` supports tensor args by PA mapping and scalars by value.
|
||||||
|
|
||||||
|
Tensor arg (mandatory):
|
||||||
|
|
||||||
|
- `arg_kind: "tensor"`
|
||||||
|
- `tensor_pa_map: TensorPAMap`
|
||||||
|
|
||||||
|
`TensorPAMap` MUST have:
|
||||||
|
|
||||||
|
- `shards: list[TensorShard]`
|
||||||
|
|
||||||
|
`TensorShard` MUST have (A 방식 강제):
|
||||||
|
|
||||||
|
- `sip: int`
|
||||||
|
- `cube: int`
|
||||||
|
- `pe: int`
|
||||||
|
- `pa: int`
|
||||||
|
- `nbytes: int`
|
||||||
|
- `offset_bytes: int`
|
||||||
|
|
||||||
|
Scalar arg (mandatory):
|
||||||
|
|
||||||
|
- `arg_kind: "scalar"`
|
||||||
|
- `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"`
|
||||||
|
- `value: number | bool`
|
||||||
|
|
||||||
|
Optional KernelLaunch fields:
|
||||||
|
|
||||||
|
- `grid: dict | None`
|
||||||
|
- `meta: dict | None`
|
||||||
|
- `failure_policy: "fail_fast" | "collect_all"` (default "fail_fast")
|
||||||
|
- `debug_label: str | None`
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
|
||||||
|
- KernelLaunch MUST NOT embed bulk tensor data.
|
||||||
|
- KernelLaunch MUST be submitted only to the IO_CPU endpoint.
|
||||||
|
- IO_CPU MUST fan-out work internally using the shard (sip,cube,pe) tags.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Verification Notes
|
||||||
|
|
||||||
|
Tests SHOULD validate:
|
||||||
|
|
||||||
|
- schema validation rejects missing mandatory fields,
|
||||||
|
- deterministic correlation/response matching,
|
||||||
|
- MemoryWrite/Read/KernelLaunch produce explicit hop traces,
|
||||||
|
- all routed requests incur latency > 0.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0011 (Memory Addressing — PA / VA / LA)
|
||||||
|
- ADR-0007 (runtime_api vs sim_engine boundaries)
|
||||||
|
- ADR-0009 (kernel execution fan-out/aggregation)
|
||||||
|
- ADR-0013 (Verification strategy — V1 message schema validation)
|
||||||
|
- SPEC R2, R7, R8
|
||||||
@@ -0,0 +1,139 @@
|
|||||||
|
# ADR-0013: Verification Strategy and Phase 1 Test Plan
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
KernBench is a system-level simulator whose correctness is defined by:
|
||||||
|
|
||||||
|
- adherence to SPEC-defined invariants,
|
||||||
|
- determinism and debuggability,
|
||||||
|
- explicit modeling of routing and latency.
|
||||||
|
|
||||||
|
Given the evolving implementation, we need a stable verification strategy
|
||||||
|
that prevents architectural drift while allowing incremental development.
|
||||||
|
|
||||||
|
This ADR defines the Phase 1 verification plan and what constitutes
|
||||||
|
"correct behavior" for early implementations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Verification is contract-based
|
||||||
|
|
||||||
|
Verification MUST be derived from:
|
||||||
|
|
||||||
|
- SPEC requirements,
|
||||||
|
- accepted ADRs.
|
||||||
|
|
||||||
|
Tests MUST validate architectural contracts, not incidental implementation details.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D2. Phase 1 verification scope
|
||||||
|
|
||||||
|
Phase 1 verification focuses on:
|
||||||
|
|
||||||
|
- message contract validity (ADR-0012),
|
||||||
|
- routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
|
||||||
|
- PA-first memory addressing and shard tagging (ADR-0011),
|
||||||
|
- core latency and trace invariants (SPEC 0.1, R2).
|
||||||
|
|
||||||
|
Microarchitectural accuracy, bandwidth contention, and cycle-level behavior
|
||||||
|
are explicitly out of scope in Phase 1.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D3. Required Phase 1 verification cases
|
||||||
|
|
||||||
|
The following verification cases MUST be supported by the implementation:
|
||||||
|
|
||||||
|
#### V1. Message schema validation
|
||||||
|
|
||||||
|
- KernelLaunch requests missing `(sip, cube, pe)` in any tensor shard MUST be rejected.
|
||||||
|
- MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
|
||||||
|
- Completion results MUST follow the `ok / error_code / error_message` contract.
|
||||||
|
|
||||||
|
#### V2. IO_CPU fan-out and aggregation
|
||||||
|
|
||||||
|
Given:
|
||||||
|
|
||||||
|
- a topology with one SIP, one CUBE, and two PEs,
|
||||||
|
- a KernelLaunch request containing two tensor shards targeting different PEs,
|
||||||
|
|
||||||
|
The system MUST:
|
||||||
|
|
||||||
|
- submit a single KernelLaunch to IO_CPU,
|
||||||
|
- fan-out work internally to both PEs,
|
||||||
|
- aggregate completion and return a single deterministic completion to the host.
|
||||||
|
|
||||||
|
#### V3. Latency and trace invariants
|
||||||
|
|
||||||
|
For any valid request:
|
||||||
|
|
||||||
|
- the hop-by-hop trace MUST be non-empty,
|
||||||
|
- total latency MUST be greater than zero,
|
||||||
|
- repeated runs with identical inputs MUST produce identical traces.
|
||||||
|
|
||||||
|
#### V4. Topology independence and cross-domain coverage
|
||||||
|
|
||||||
|
Verification cases MUST pass for multiple topology shapes, including:
|
||||||
|
|
||||||
|
- minimal: (1 SIP, 1 CUBE, 1 PE)
|
||||||
|
- multi-PE: (1 SIP, 1 CUBE, N PEs)
|
||||||
|
- multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
|
||||||
|
- multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)
|
||||||
|
|
||||||
|
For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:
|
||||||
|
|
||||||
|
- explicit connectivity (required links exist),
|
||||||
|
- deterministic routing and control-path traversal,
|
||||||
|
- non-empty traces and latency > 0 for representative cross-domain requests
|
||||||
|
(inter-CUBE and inter-SIP paths).
|
||||||
|
|
||||||
|
Tests MUST NOT hardcode topology sizes, node ids, or link counts.
|
||||||
|
Instead, tests MUST derive expectations from the compiled topology metadata
|
||||||
|
---
|
||||||
|
|
||||||
|
### D4. Phase 1 artifacts
|
||||||
|
|
||||||
|
Phase 1 MAY include:
|
||||||
|
|
||||||
|
- verification-only test code,
|
||||||
|
- topology fixtures,
|
||||||
|
- trace inspection utilities.
|
||||||
|
|
||||||
|
Phase 1 MUST NOT require:
|
||||||
|
|
||||||
|
- production code changes solely to satisfy tests,
|
||||||
|
- weakening or removing tests to allow progress.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D5. Phase 2 enforcement
|
||||||
|
|
||||||
|
Phase 2 (Apply) MUST:
|
||||||
|
|
||||||
|
- run the Phase 1 verification cases,
|
||||||
|
- rollback all changes if any verification fails,
|
||||||
|
- preserve tests as authoritative contracts.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Architectural correctness is enforced early.
|
||||||
|
- Tests serve as executable documentation of system behavior.
|
||||||
|
- Implementation remains flexible without losing rigor.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- SPEC 0.1, R2, R6
|
||||||
|
- ADR-0011 (Memory Addressing — PA / VA / LA)
|
||||||
|
- ADR-0012 (Host ↔ IO_CPU message schema)
|
||||||
|
- ADR-0009 (Kernel execution semantics)
|
||||||
@@ -0,0 +1,451 @@
|
|||||||
|
# ADR-0014: PE Pipeline Execution Model
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
This ADR defines the PE-internal kernel execution model:
|
||||||
|
|
||||||
|
- Role decomposition of PE-internal components
|
||||||
|
- Command dispatch paths (simple / composite / multi-op composite with epilogue)
|
||||||
|
- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
|
||||||
|
- TCM-centric dataflow with a register-file intermediary
|
||||||
|
- Engine resource model
|
||||||
|
- Observability and trace contract
|
||||||
|
- Topology representation
|
||||||
|
|
||||||
|
PE-internal structure (7 components in scope; 2 cross-referenced):
|
||||||
|
|
||||||
|
- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
|
||||||
|
`pe_tcm` — defined here
|
||||||
|
- `pe_mmu` — VA model, defined in ADR-0011 D-VA
|
||||||
|
- `pe_ipcq` — collective communication, defined in ADR-0023
|
||||||
|
|
||||||
|
The goal is a deterministic, trace-friendly execution contract that keeps
|
||||||
|
each block independently swappable.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. PE-internal component roles
|
||||||
|
|
||||||
|
**PE_CPU**
|
||||||
|
|
||||||
|
- Executes kernel instruction stream / control logic.
|
||||||
|
- Generates PE commands and submits them to `PE_SCHEDULER` (via
|
||||||
|
`PeInternalTxn`).
|
||||||
|
- Does NOT enqueue work directly into engine queues.
|
||||||
|
|
||||||
|
**PE_SCHEDULER**
|
||||||
|
|
||||||
|
- Sole dispatcher inside a PE.
|
||||||
|
- Receives commands from `PE_CPU`. Dispatch by command type:
|
||||||
|
- Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
|
||||||
|
→ forward directly to the target engine.
|
||||||
|
- `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
|
||||||
|
via a single `_feed_loop` (D6).
|
||||||
|
- Does not participate in stage-to-stage chaining within a composite;
|
||||||
|
that is handled by token self-routing (D6).
|
||||||
|
|
||||||
|
**PE_DMA**
|
||||||
|
|
||||||
|
- Handles memory transfers between TCM and external memory domains
|
||||||
|
(HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
|
||||||
|
- Two execution channels:
|
||||||
|
- `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
|
||||||
|
- Additional virtual channels:
|
||||||
|
- `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
|
||||||
|
- `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
|
||||||
|
|
||||||
|
**PE_FETCH_STORE**
|
||||||
|
|
||||||
|
- TCM ↔ Register File transfer unit.
|
||||||
|
- Isolates register-file access semantics from compute engines so that
|
||||||
|
GEMM/MATH stay pure compute components.
|
||||||
|
- BW-based latency model; TCM access contention naturally serializes
|
||||||
|
through `PE_TCM`'s BW resource.
|
||||||
|
|
||||||
|
**PE_GEMM**
|
||||||
|
|
||||||
|
- MAC array. Reads operands from the register file; writes results to
|
||||||
|
the register file. Does not touch `PE_TCM` directly.
|
||||||
|
|
||||||
|
**PE_MATH**
|
||||||
|
|
||||||
|
- Element-wise / reduction / SIMD unit. Reads / writes the register file.
|
||||||
|
|
||||||
|
**PE_TCM**
|
||||||
|
|
||||||
|
- Tightly-coupled scratchpad with BW-serialized access. Two logical
|
||||||
|
regions partitioned by ownership (see D5).
|
||||||
|
|
||||||
|
**Cross-referenced components** (defined elsewhere):
|
||||||
|
|
||||||
|
- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
|
||||||
|
- `pe_ipcq` — collective ring buffers and peer endpoint metadata
|
||||||
|
(ADR-0023).
|
||||||
|
|
||||||
|
### D2. Command lifecycle and queues
|
||||||
|
|
||||||
|
`PE_SCHEDULER` maintains three logical structures:
|
||||||
|
|
||||||
|
**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
|
||||||
|
|
||||||
|
**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
|
||||||
|
expanded sub-commands, dependency state, engine assignment, and
|
||||||
|
completion status.
|
||||||
|
|
||||||
|
**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
|
||||||
|
records.
|
||||||
|
|
||||||
|
**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
|
||||||
|
state. Engines report completion via explicit events / messages
|
||||||
|
consumed by the scheduler.
|
||||||
|
|
||||||
|
**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
|
||||||
|
publishes a completion record.
|
||||||
|
|
||||||
|
### D3. Dispatch modes
|
||||||
|
|
||||||
|
#### D3.1 Simple command
|
||||||
|
|
||||||
|
A simple command expands to exactly one engine sub-command:
|
||||||
|
|
||||||
|
- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
|
||||||
|
- `GemmCmd` → `PE_GEMM`
|
||||||
|
- `MathCmd` → `PE_MATH`
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
|
||||||
|
→ completion → PE_SCHEDULER → CompletionQueue
|
||||||
|
```
|
||||||
|
|
||||||
|
#### D3.2 Composite command (single-op tiled pipeline)
|
||||||
|
|
||||||
|
The default `CompositeCmd` runs a single compute op as a tile-pipelined
|
||||||
|
sequence:
|
||||||
|
|
||||||
|
```text
|
||||||
|
DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
|
||||||
|
```
|
||||||
|
|
||||||
|
`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
|
||||||
|
`TileToken` per tile with a monotonically increasing `tile_id`.
|
||||||
|
|
||||||
|
Tile dependency (within one tile `t`):
|
||||||
|
|
||||||
|
```text
|
||||||
|
DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
|
||||||
|
```
|
||||||
|
|
||||||
|
Inter-tile overlap is allowed wherever engine resources permit
|
||||||
|
(D4 governs the constraints):
|
||||||
|
|
||||||
|
```text
|
||||||
|
DMA_READ(t+1) ∥ COMPUTE(t)
|
||||||
|
DMA_WRITE(t-1) ∥ COMPUTE(t)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### D3.3 Multi-op composite (head + epilogue with scope)
|
||||||
|
|
||||||
|
A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
|
||||||
|
multi-op pipeline:
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class OpSpec:
|
||||||
|
kind: str # "gemm" | "math.exp" | "math.bias_add" | ...
|
||||||
|
scope: Scope # "per_k_tile" | "per_output_tile" | "once"
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
|
||||||
|
M/K/N partition).
|
||||||
|
- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
|
||||||
|
often they fire:
|
||||||
|
- `per_k_tile` — every K-reduction step.
|
||||||
|
- `per_output_tile` — once per output tile.
|
||||||
|
- `once` — once per kernel.
|
||||||
|
|
||||||
|
Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
|
||||||
|
each stage is dispatched via token self-routing (D6), so GEMM and MATH
|
||||||
|
participate serially within the same composite even though they share
|
||||||
|
the compute slot (D4).
|
||||||
|
|
||||||
|
The empty-`ops` form is the legacy single-op path.
|
||||||
|
|
||||||
|
### D4. Engine resource model
|
||||||
|
|
||||||
|
**DMA engine**:
|
||||||
|
|
||||||
|
- `DMA_READ`: `simpy.Resource(capacity=1)`.
|
||||||
|
- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
|
||||||
|
- Both channels run concurrently (READ ∥ WRITE allowed).
|
||||||
|
- Within a channel, requests serialize (READ ∥ READ disallowed; same
|
||||||
|
for WRITE).
|
||||||
|
- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
|
||||||
|
ADR-0023 D8 — out of scope for this ADR.
|
||||||
|
|
||||||
|
**Compute engine**:
|
||||||
|
|
||||||
|
- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
|
||||||
|
`PE_MATH`.
|
||||||
|
- At most one compute op runs at a time within a PE.
|
||||||
|
- Multi-op composite chains (D3.3) execute their compute stages serially
|
||||||
|
through this slot; token self-routing (D6) ensures the next stage
|
||||||
|
starts only after the previous compute releases the slot.
|
||||||
|
|
||||||
|
**Engine completion**: each engine emits a completion event consumed by
|
||||||
|
the scheduler / `PipelineContext` (D6).
|
||||||
|
|
||||||
|
### D5. Dataflow
|
||||||
|
|
||||||
|
**Input path (HBM source)**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
|
||||||
|
PE_TCM → PE_FETCH_STORE → Register File
|
||||||
|
Register File → PE_GEMM | PE_MATH
|
||||||
|
```
|
||||||
|
|
||||||
|
**Input path (shared SRAM source)**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
|
||||||
|
PE_TCM → PE_FETCH_STORE → Register File
|
||||||
|
```
|
||||||
|
|
||||||
|
**Output path (HBM destination)**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Register File → PE_FETCH_STORE → PE_TCM
|
||||||
|
PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
|
||||||
|
```
|
||||||
|
|
||||||
|
GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
|
||||||
|
single TCM↔register-file gateway. This makes TCM BW contention
|
||||||
|
explicit and lets fetch unit policies (e.g., prefetch) be replaced
|
||||||
|
independently of compute engines.
|
||||||
|
|
||||||
|
#### D5.1 PE_TCM partitioning
|
||||||
|
|
||||||
|
`PE_TCM` is split into two logical regions:
|
||||||
|
|
||||||
|
**SchedulerReservedTCM**
|
||||||
|
|
||||||
|
- Owned exclusively by `PE_SCHEDULER`.
|
||||||
|
- Holds composite-command tile buffers.
|
||||||
|
- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
|
||||||
|
COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
|
||||||
|
manages tile-buffer lifetimes.
|
||||||
|
|
||||||
|
**AllocatableTCM**
|
||||||
|
|
||||||
|
- General-purpose region managed by `PEMemAllocator`.
|
||||||
|
- Used for host / DP-visible allocations.
|
||||||
|
|
||||||
|
**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
|
||||||
|
allocate inside `SchedulerReservedTCM`. The reserved region is excluded
|
||||||
|
from allocator-managed ranges by construction.
|
||||||
|
|
||||||
|
**Tile buffer rules**:
|
||||||
|
|
||||||
|
- Input and output buffers within `SchedulerReservedTCM` MUST NOT
|
||||||
|
overlap during a tile's active lifetime.
|
||||||
|
- A tile buffer remains valid until the corresponding `DMA_WRITE`
|
||||||
|
completes.
|
||||||
|
- Buffer reuse is permitted only after the consuming tile's lifetime
|
||||||
|
ends.
|
||||||
|
|
||||||
|
### D6. TileToken self-routing pipeline
|
||||||
|
|
||||||
|
A composite's stage-to-stage progression happens **without** routing
|
||||||
|
through the scheduler. Each component forwards the token directly to
|
||||||
|
the next stage's component using the token's `plan`:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
|
||||||
|
↑ chaining: no scheduler hop ↑
|
||||||
|
PipelineContext.complete_tile()
|
||||||
|
```
|
||||||
|
|
||||||
|
This mirrors real-HW done-wire chains. The scheduler handles only
|
||||||
|
**initial dispatch + completion aggregation**.
|
||||||
|
|
||||||
|
#### TilePlan / Stage
|
||||||
|
|
||||||
|
```python
|
||||||
|
class StageType(Enum):
|
||||||
|
DMA_READ = 0
|
||||||
|
FETCH = 1
|
||||||
|
GEMM = 2
|
||||||
|
MATH = 3
|
||||||
|
STORE = 4
|
||||||
|
DMA_WRITE = 5
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class Stage:
|
||||||
|
stage_type: StageType
|
||||||
|
component: str # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
|
||||||
|
params: dict # stage-specific parameters
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class TilePlan:
|
||||||
|
tile_id: int
|
||||||
|
stages: tuple[Stage, ...]
|
||||||
|
```
|
||||||
|
|
||||||
|
#### TileToken
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class TileToken:
|
||||||
|
tile_id: int
|
||||||
|
pipeline_ctx: PipelineContext
|
||||||
|
plan: TilePlan
|
||||||
|
stage_idx: int
|
||||||
|
params: dict # cached current stage params
|
||||||
|
data_op: bool = True # op_log opt-in (ADR-0020 D4)
|
||||||
|
```
|
||||||
|
|
||||||
|
Single-owner invariant: a token is owned by exactly one component at a
|
||||||
|
time. Lifecycle: scheduler creates with `stage_idx=0` → component
|
||||||
|
`_process()` → increment `stage_idx` → put to next stage's `in_port` →
|
||||||
|
last stage calls `pipeline_ctx.complete_tile()`.
|
||||||
|
|
||||||
|
#### PipelineContext (exactly-once completion)
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class PipelineContext:
|
||||||
|
id: str
|
||||||
|
total_tiles: int
|
||||||
|
completed_tiles: int = 0
|
||||||
|
done_event: simpy.Event = None
|
||||||
|
|
||||||
|
def complete_tile(self) -> None:
|
||||||
|
self.completed_tiles += 1
|
||||||
|
if self.completed_tiles == self.total_tiles:
|
||||||
|
self.done_event.succeed()
|
||||||
|
```
|
||||||
|
|
||||||
|
Each tile's last stage MUST call `complete_tile()` exactly once.
|
||||||
|
Duplicate calls are bugs (SimPy `Event` can succeed at most once).
|
||||||
|
|
||||||
|
#### Feed ordering
|
||||||
|
|
||||||
|
`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
|
||||||
|
`_pending_feeds` FIFO. Composite commands are enqueued in submission
|
||||||
|
order; tile feed for a command runs to completion before the next
|
||||||
|
command's feed begins. **Tile-feed interleaving between commands is
|
||||||
|
disallowed.**
|
||||||
|
|
||||||
|
Within a single command's tiles, downstream pipeline overlap arises
|
||||||
|
naturally — earlier tiles progress through later stages while the feeder
|
||||||
|
keeps pushing remaining tiles into the first stage queue (SimPy Store
|
||||||
|
backpressure governs flow control). If the first-stage queue is full,
|
||||||
|
only the feeder blocks; the scheduler worker's inbox processing
|
||||||
|
continues.
|
||||||
|
|
||||||
|
#### Token routing pattern (base class)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _pipeline_worker(self, env):
|
||||||
|
while True:
|
||||||
|
token = yield self._inbox.get()
|
||||||
|
yield from self._process(env, token) # stage-specific logic
|
||||||
|
next_idx = token.stage_idx + 1
|
||||||
|
if next_idx < len(token.plan.stages):
|
||||||
|
next_stage = token.plan.stages[next_idx]
|
||||||
|
token.stage_idx = next_idx
|
||||||
|
token.params = next_stage.params
|
||||||
|
yield self.out_ports[next_stage.component].put(token)
|
||||||
|
else:
|
||||||
|
token.pipeline_ctx.complete_tile()
|
||||||
|
```
|
||||||
|
|
||||||
|
Each component implements only `_process()`; chaining lives in the
|
||||||
|
base class.
|
||||||
|
|
||||||
|
### D7. Observability and trace contract
|
||||||
|
|
||||||
|
The simulator emits deterministic trace events:
|
||||||
|
|
||||||
|
- `command_submitted`
|
||||||
|
- `sub_command_dispatched`
|
||||||
|
- `engine_start`
|
||||||
|
- `engine_complete`
|
||||||
|
- `tile_ready`
|
||||||
|
- `command_complete`
|
||||||
|
|
||||||
|
For identical inputs, trace ordering MUST be deterministic.
|
||||||
|
|
||||||
|
### D8. Topology representation
|
||||||
|
|
||||||
|
PE-internal components are declared in `cube.pe_template`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
pe_template:
|
||||||
|
components:
|
||||||
|
pe_cpu: { kind: pe_cpu, impl: builtin.pe_cpu, attrs: { overhead_ns: ... } }
|
||||||
|
pe_scheduler: { kind: pe_scheduler, impl: builtin.pe_scheduler, attrs: { overhead_ns: ... } }
|
||||||
|
pe_dma: { kind: pe_dma, impl: builtin.pe_dma, attrs: { rd_engines: 1, wr_engines: 1 } }
|
||||||
|
pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
|
||||||
|
pe_gemm: { kind: pe_gemm, impl: builtin.pe_gemm, attrs: { shared_resource: accel_slot, ... } }
|
||||||
|
pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { shared_resource: accel_slot, ... } }
|
||||||
|
pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
|
||||||
|
pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { ... } } # ADR-0011 D-VA
|
||||||
|
pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { ... } } # ADR-0023
|
||||||
|
links:
|
||||||
|
# Scheduler dispatch edges (initial)
|
||||||
|
scheduler_to_dma_mm: 0.0
|
||||||
|
scheduler_to_fetch_store_mm: 0.0
|
||||||
|
scheduler_to_gemm_mm: 0.0
|
||||||
|
scheduler_to_math_mm: 0.0
|
||||||
|
# Pipeline chaining edges (token self-routing per D6)
|
||||||
|
dma_to_fetch_store_mm: 0.0
|
||||||
|
fetch_store_to_gemm_mm: 0.0
|
||||||
|
fetch_store_to_math_mm: 0.0
|
||||||
|
gemm_to_fetch_store_mm: 0.0
|
||||||
|
gemm_to_math_mm: 0.0
|
||||||
|
math_to_fetch_store_mm: 0.0
|
||||||
|
fetch_store_to_dma_mm: 0.0
|
||||||
|
fetch_store_to_tcm_bw_gbs: ...
|
||||||
|
```
|
||||||
|
|
||||||
|
Template is instantiated once per PE. PE instances are derived from
|
||||||
|
`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
|
||||||
|
cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- Each block is an independent topology node — individually swappable
|
||||||
|
via DI (ADR-0015).
|
||||||
|
- PE-internal structure is visible in the topology graph.
|
||||||
|
- Components do not know their downstream — plan-based routing gives
|
||||||
|
flexibility (e.g., epilogue chains require no scheduler change).
|
||||||
|
- DMA and compute overlap naturally via SimPy Store backpressure.
|
||||||
|
- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
|
||||||
|
without engine-level coupling.
|
||||||
|
- TCM access contention is realistic — `PE_FETCH_STORE` is the single
|
||||||
|
TCM↔RF gateway.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- Intra-PE component count is higher than a coarser model (7 base + 2
|
||||||
|
cross-referenced) — more topology nodes/edges.
|
||||||
|
- Intra-PE token forwarding is explicit in traces (acceptable trade for
|
||||||
|
HW fidelity).
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0011 D-VA (PE_MMU component, VA translation)
|
||||||
|
- ADR-0015 D4 (component port/wire model)
|
||||||
|
- ADR-0020 (greenlet kernel execution / two-pass)
|
||||||
|
- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
|
||||||
|
- SPEC R3, R4
|
||||||
@@ -0,0 +1,202 @@
|
|||||||
|
# ADR-0015: Component Port/Wire Model and Fabric Routing
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Realistic hardware modeling — queues, contention, fan-out — requires
|
||||||
|
that components own fabric traversal while the simulation engine
|
||||||
|
handles only initialization and completion observation. Direct method
|
||||||
|
calls between components, or path-walking inside the engine, defeat
|
||||||
|
queueing and contention semantics.
|
||||||
|
|
||||||
|
This ADR defines:
|
||||||
|
|
||||||
|
- how components communicate via typed port queues,
|
||||||
|
- how propagation delay is modeled (wire processes with BW occupancy),
|
||||||
|
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
|
||||||
|
(via M_CPU),
|
||||||
|
- the engine's reduced role (wire init + completion observation only),
|
||||||
|
- M_CPU.DMA as an internal subcomponent of M_CPU.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Component port model
|
||||||
|
|
||||||
|
Each component has typed input/output ports modeled as SimPy Stores:
|
||||||
|
|
||||||
|
```text
|
||||||
|
in_ports: dict[str, simpy.Store] # keyed by source node_id
|
||||||
|
out_ports: dict[str, simpy.Store] # keyed by destination node_id
|
||||||
|
```
|
||||||
|
|
||||||
|
Ports are created at engine initialization based on graph edges.
|
||||||
|
Each directed edge (src → dst) results in:
|
||||||
|
|
||||||
|
- `src.out_ports[dst]` — the sending end
|
||||||
|
- `dst.in_ports[src]` — the receiving end
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D2. Wire process (propagation delay + BW occupancy)
|
||||||
|
|
||||||
|
For each directed edge (src, dst) in the topology graph, a SimPy wire process
|
||||||
|
models propagation delay and BW occupancy:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def wire_process(env, out_port, in_port, delay_ns, bw_gbs):
|
||||||
|
available_at = 0.0
|
||||||
|
while True:
|
||||||
|
cmd = yield out_port.get()
|
||||||
|
if bw_gbs > 0:
|
||||||
|
nbytes = getattr(cmd, "nbytes", 0)
|
||||||
|
if nbytes > 0:
|
||||||
|
wait = available_at - env.now
|
||||||
|
if wait > 0:
|
||||||
|
yield env.timeout(wait)
|
||||||
|
available_at = env.now + (nbytes / bw_gbs)
|
||||||
|
yield env.timeout(delay_ns)
|
||||||
|
yield in_port.put(cmd)
|
||||||
|
```
|
||||||
|
|
||||||
|
Wire processes are started at engine initialization.
|
||||||
|
Each directed edge maintains an `available_at` timestamp tracking when the link
|
||||||
|
becomes free for the next transaction. When a transaction occupies a link, the
|
||||||
|
next transaction on the same directed link must wait until occupancy clears
|
||||||
|
(back-to-back serialization). TX and RX directions are independent (separate
|
||||||
|
wire processes with separate `available_at` state).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D3. Engine role (reduced)
|
||||||
|
|
||||||
|
The simulation engine MUST:
|
||||||
|
|
||||||
|
- wire components at initialization (create port Stores, start wire processes),
|
||||||
|
- identify the entry component for each request type (PCIE_EP),
|
||||||
|
- put the request into the entry component's in_port,
|
||||||
|
- wait for a completion event.
|
||||||
|
|
||||||
|
The simulation engine MUST NOT:
|
||||||
|
|
||||||
|
- walk the topology path during request execution,
|
||||||
|
- call component `run()` methods directly,
|
||||||
|
- track per-hop latency or decompose fan-out.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D4. Fabric paths for Memory R/W and Kernel Launch
|
||||||
|
|
||||||
|
Memory R/W and Kernel Launch use **different** fabric paths.
|
||||||
|
Memory operations bypass M_CPU and route directly to HBM via the crossbar.
|
||||||
|
Kernel Launch routes through M_CPU for PE fan-out.
|
||||||
|
|
||||||
|
**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):**
|
||||||
|
|
||||||
|
```text
|
||||||
|
pcie_ep → io_noc → io_ucie
|
||||||
|
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
|
||||||
|
→ target cube: ucie_in → router mesh → hbm_ctrl
|
||||||
|
```
|
||||||
|
|
||||||
|
**Memory R/W completion path:**
|
||||||
|
|
||||||
|
```text
|
||||||
|
hbm_ctrl → router mesh → [transit cubes: ucie → router mesh → ucie]
|
||||||
|
→ io_ucie → io_noc → pcie_ep
|
||||||
|
```
|
||||||
|
|
||||||
|
**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):**
|
||||||
|
|
||||||
|
```text
|
||||||
|
pcie_ep → io_noc → io_cpu → io_noc → io_ucie
|
||||||
|
→ [transit cubes: ucie_in → noc → ucie_out] (zero or more)
|
||||||
|
→ target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Kernel Launch completion path:**
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE[0..n] all complete → M_CPU (aggregation)
|
||||||
|
→ noc → [transit cubes: ucie → noc → ucie]
|
||||||
|
→ io_ucie → io_noc → io_cpu → io_noc → pcie_ep
|
||||||
|
```
|
||||||
|
|
||||||
|
**Rationale for M_CPU bypass on Memory R/W:**
|
||||||
|
|
||||||
|
Memory write/read operations do not require command interpretation or PE
|
||||||
|
dispatch — they are direct data transfers to/from HBM. Routing through M_CPU
|
||||||
|
would add unnecessary overhead (5ns) without functional benefit. The io_noc
|
||||||
|
inside the IO chiplet handles the routing decision: memory operations go
|
||||||
|
directly to cube fabric, while kernel launches are forwarded to io_cpu first.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D5. M_CPU.DMA is an internal subcomponent of M_CPU
|
||||||
|
|
||||||
|
M_CPU.DMA is NOT a separate topology node.
|
||||||
|
It is an internal subcomponent owned by the M_CPU component implementation.
|
||||||
|
|
||||||
|
M_CPU.DMA:
|
||||||
|
|
||||||
|
- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
|
||||||
|
- issues memory requests over the NOC to hbm_ctrl,
|
||||||
|
- receives completion from hbm_ctrl via the NOC,
|
||||||
|
- reports completion to M_CPU,
|
||||||
|
- is created and managed inside M_CPU's `__init__` and `run()`.
|
||||||
|
|
||||||
|
M_CPU.DMA does not appear as a node in the compiled topology graph.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D6. Transit cube forwarding
|
||||||
|
|
||||||
|
A cube that is not the target of a memory or kernel request acts as a transit node.
|
||||||
|
Transit cubes forward requests without consuming them:
|
||||||
|
|
||||||
|
```text
|
||||||
|
ucie_in (from upstream) → noc → ucie_out (to downstream)
|
||||||
|
```
|
||||||
|
|
||||||
|
Transit forwarding is implemented entirely within the ucie_in component.
|
||||||
|
The noc and ucie_out components in a transit cube forward the packet without modification.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### D7. _formula_latency is preserved as a lower-bound cross-check
|
||||||
|
|
||||||
|
The path-based formula latency function (`_formula_latency`) is preserved in the engine
|
||||||
|
as a lower bound for correctness verification.
|
||||||
|
|
||||||
|
Invariant:
|
||||||
|
|
||||||
|
- Phase 0: `_formula_latency == component model total_ns`
|
||||||
|
- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
|
||||||
|
|
||||||
|
This function is independent of the port/wire model and requires only the topology graph.
|
||||||
|
It is used for shard comparison in `_route_kernel` and as a regression guard.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Components model realistic hardware behavior (queues, contention, fan-out).
|
||||||
|
- Propagation delay is modeled accurately per edge.
|
||||||
|
- Engine is decoupled from routing policy.
|
||||||
|
- Component implementations remain swappable via DI (ADR-0007 D3).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0007 D2 (engine role boundary)
|
||||||
|
- ADR-0009 D3 (kernel execution fan-out hierarchy)
|
||||||
|
- ADR-0014 D4 (DMA engine capacity=1)
|
||||||
|
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
|
||||||
|
- ADR-0016 (IOChiplet NOC and memory data path)
|
||||||
|
- ADR-0017 (cube NOC 2D mesh architecture)
|
||||||
|
- ADR-0033 (Latency model assumptions built on these mechanisms)
|
||||||
@@ -0,0 +1,98 @@
|
|||||||
|
# ADR-0016: IOChiplet NOC and Memory Data Path
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
ADR-0003 D2 defines IO chiplets as SIP-level components providing PCIe-EP and
|
||||||
|
IO_CPU interfaces, but does not specify internal routing within the IO chiplet.
|
||||||
|
ADR-0015 D4 was updated to document the M_CPU bypass for Memory R/W, but the
|
||||||
|
IO chiplet's internal NOC architecture that enables this routing was not
|
||||||
|
formally documented.
|
||||||
|
|
||||||
|
The IO chiplet needs an internal routing fabric (io_noc) to:
|
||||||
|
|
||||||
|
- connect pcie_ep, io_cpu, and per-cube UCIe PHY ports
|
||||||
|
- route memory operations (MemoryWrite/Read) directly to cube fabric without
|
||||||
|
passing through io_cpu
|
||||||
|
- route kernel launch commands through io_cpu for command interpretation
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. IOChiplet internal NOC (io_noc)
|
||||||
|
|
||||||
|
Each IO chiplet instance contains an internal NOC node (`io_noc`) that connects:
|
||||||
|
|
||||||
|
- `pcie_ep` — host-facing PCIe endpoint
|
||||||
|
- `io_cpu` — command processor for kernel launch interpretation
|
||||||
|
- `io_ucie-{PHY}.conn{N}` — per-PHY connection nodes to cube UCIe ports
|
||||||
|
|
||||||
|
The io_noc is a forwarding-only fabric (`forwarding_v1` implementation) with
|
||||||
|
zero overhead. All routing decisions are made by the simulation engine based
|
||||||
|
on message type, not by io_noc itself.
|
||||||
|
|
||||||
|
### D2. IOChiplet UCIe decomposition
|
||||||
|
|
||||||
|
Each IO chiplet PHY port is decomposed into:
|
||||||
|
|
||||||
|
- `io_ucie-{PHY}` — the UCIe protocol endpoint (overhead = 8ns)
|
||||||
|
- `io_ucie-{PHY}.conn{N}` — N connection nodes between io_noc and io_ucie
|
||||||
|
|
||||||
|
This mirrors the cube-side UCIe decomposition (ADR-0015 D1) and allows
|
||||||
|
multiple independent NOC-to-UCIe connections per PHY.
|
||||||
|
|
||||||
|
### D3. Memory R/W path (M_CPU bypass)
|
||||||
|
|
||||||
|
Memory operations (MemoryWrite, MemoryRead) are routed directly from pcie_ep
|
||||||
|
through io_noc to the target cube, bypassing io_cpu entirely:
|
||||||
|
|
||||||
|
```text
|
||||||
|
pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → router mesh → hbm_ctrl
|
||||||
|
```
|
||||||
|
|
||||||
|
This avoids the 10ns io_cpu overhead for pure data transfers. The simulation
|
||||||
|
engine's `_process_memory_direct()` method uses `find_memory_path()` which
|
||||||
|
resolves the shortest path from pcie_ep to the target HBM node.
|
||||||
|
|
||||||
|
### D4. Kernel Launch path (via io_cpu)
|
||||||
|
|
||||||
|
Kernel launch commands require io_cpu for command interpretation and PE
|
||||||
|
fan-out setup:
|
||||||
|
|
||||||
|
```text
|
||||||
|
pcie_ep → io_noc → io_cpu → io_noc → conn → io_ucie → [cube UCIe]
|
||||||
|
→ noc → m_cpu → PE
|
||||||
|
```
|
||||||
|
|
||||||
|
The engine's `_entry_points()` method routes KernelLaunchMsg through both
|
||||||
|
pcie_ep (entry) and io_cpu (command processing).
|
||||||
|
|
||||||
|
### D5. IOChiplet-to-cube port mapping
|
||||||
|
|
||||||
|
Each IO chiplet instance declares which cube ports it connects to:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
cube_ports:
|
||||||
|
- { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 }
|
||||||
|
- { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 }
|
||||||
|
```
|
||||||
|
|
||||||
|
The topology builder creates edges from io_ucie PHY nodes to the
|
||||||
|
corresponding cube UCIe port nodes, with the specified distance and
|
||||||
|
the IO chiplet's `per_connection_bw_gbs` as link bandwidth.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- IO chiplet has a well-defined internal routing fabric
|
||||||
|
- Memory operations avoid unnecessary io_cpu overhead
|
||||||
|
- Kernel launch commands still get proper command interpretation
|
||||||
|
- The io_noc pattern is consistent with cube-level NOC design
|
||||||
|
- ADR-0003 D2 is extended (not contradicted) by this ADR
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0003 D2 (IO chiplet definition)
|
||||||
|
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
|
||||||
|
- ADR-0012 D1 (host-to-IO_CPU message schema)
|
||||||
@@ -0,0 +1,291 @@
|
|||||||
|
# ADR-0017: Cube NOC and HBM Connectivity
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The CUBE-level NOC is a 2D router mesh that carries every intra-cube
|
||||||
|
request: PE-to-HBM data, PE-to-PE traffic, command paths
|
||||||
|
(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
|
||||||
|
|
||||||
|
The CUBE's HBM is exposed through per-PE controller endpoints attached
|
||||||
|
to PE routers. This per-PE partitioning makes local-vs-remote HBM
|
||||||
|
distinguishable by mesh distance: a PE's own HBM partition sits at its
|
||||||
|
own router (switching overhead only); another PE's HBM partition is
|
||||||
|
reachable by mesh hops to that PE's router.
|
||||||
|
|
||||||
|
Two channel-mapping modes are supported in the design space:
|
||||||
|
|
||||||
|
- **n:1 (default, implemented)** — each PE's HBM partition aggregates
|
||||||
|
`channels_per_pe` pseudo-channels into one endpoint. Effective
|
||||||
|
per-PE BW = N × per-channel BW.
|
||||||
|
- **1:1 (future)** — each PE router decomposes into per-channel
|
||||||
|
mini-routers; per-channel BW contention is modeled directly.
|
||||||
|
|
||||||
|
In both modes the per-PE effective BW is identical; only the connectivity
|
||||||
|
granularity differs.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. 2D router mesh
|
||||||
|
|
||||||
|
Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
|
||||||
|
|
||||||
|
- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
|
||||||
|
- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
|
||||||
|
- Default 6×6 grid (sized from PE corner placement + UCIe attachment
|
||||||
|
count); larger PE counts scale the grid up.
|
||||||
|
- HBM exclusion zone: center rows/columns are excluded where HBM die
|
||||||
|
physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
|
||||||
|
- Latency = Manhattan distance × `ns_per_mm`.
|
||||||
|
|
||||||
|
### D2. XY routing algorithm
|
||||||
|
|
||||||
|
Deterministic XY routing:
|
||||||
|
|
||||||
|
1. Horizontal segment: route from source X to destination X at source Y.
|
||||||
|
2. Vertical segment: route from destination X at source Y to destination Y.
|
||||||
|
|
||||||
|
Each directed segment carries a unique key:
|
||||||
|
|
||||||
|
- Horizontal: `("H", y_band, x_min, x_max, direction)`
|
||||||
|
- Vertical: `("V", x_band, y_min, y_max, direction)`
|
||||||
|
|
||||||
|
Grid positions are snapped to the router grid, excluding the HBM zone.
|
||||||
|
|
||||||
|
### D3. Per-segment contention model
|
||||||
|
|
||||||
|
Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
|
||||||
|
sharing a segment (same row or column band, same direction) contend for
|
||||||
|
the resource — modelling link-level serialization in a wormhole-routed
|
||||||
|
mesh.
|
||||||
|
|
||||||
|
With no contention, NOC traversal latency equals Manhattan distance ×
|
||||||
|
`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
|
||||||
|
delay.
|
||||||
|
|
||||||
|
### D4. NOC attachment points (per-PE HBM partition)
|
||||||
|
|
||||||
|
Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
|
||||||
|
and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
|
||||||
|
`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
|
||||||
|
HBM (one pseudo-channel group; see D8).
|
||||||
|
|
||||||
|
Other attachments:
|
||||||
|
|
||||||
|
- M_CPU and shared SRAM each occupy a dedicated edge router.
|
||||||
|
- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
|
||||||
|
along that edge (see D6).
|
||||||
|
|
||||||
|
```text
|
||||||
|
UCIe-N (conn x4)
|
||||||
|
|
|
||||||
|
+---------+---+---+---------+
|
||||||
|
| | | |
|
||||||
|
PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma
|
||||||
|
PE0.cpu <--+ +hbm.pe0| | +hbm.pe2+--< PE2.cpu
|
||||||
|
| | | |
|
||||||
|
UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E
|
||||||
|
(conn x4) | | zone | | (conn x4)
|
||||||
|
| r2c0 | | |
|
||||||
|
M_CPU <--->+ | | |
|
||||||
|
| r3c0 | | |
|
||||||
|
SRAM <---->+ | | |
|
||||||
|
| | | |
|
||||||
|
PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma
|
||||||
|
PE4.cpu <--+ +hbm.pe4| | +hbm.pe6+--< PE6.cpu
|
||||||
|
| | | |
|
||||||
|
+---------+---+---+---------+
|
||||||
|
|
|
||||||
|
UCIe-S (conn x4)
|
||||||
|
```
|
||||||
|
|
||||||
|
Per-PE HBM partitioning is the key invariant that makes local vs
|
||||||
|
cross-PE HBM distinguishable by mesh distance (see D7).
|
||||||
|
|
||||||
|
### D5. NOC edge bandwidths and distances
|
||||||
|
|
||||||
|
| Connection | BW (GB/s) | Distance | Notes |
|
||||||
|
| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
|
||||||
|
| PE_DMA → NOC | 256.0 | Physical (PE) | Matches local-HBM aggregate BW |
|
||||||
|
| NOC → PE_CPU | — | 0.0 mm | Command path only |
|
||||||
|
| Router ↔ hbm_ctrl.pe{idx} | 256.0 | 0.0 mm | Per PE router; N × per-channel BW (see D8) |
|
||||||
|
| NOC ↔ M_CPU | — | 0.0 mm | Command path |
|
||||||
|
| NOC ↔ SRAM | 128.0 × 4 | 0.0 mm | 512 GB/s aggregate |
|
||||||
|
| NOC ↔ UCIe conn | 128.0 | 0.0 mm | Per connection; 4 conn per port |
|
||||||
|
|
||||||
|
`0.0 mm` distances reflect the distributed nature of the NOC; actual
|
||||||
|
traversal distance is computed via Manhattan distance within the router
|
||||||
|
grid.
|
||||||
|
|
||||||
|
### D6. UCIe decomposition and inter-cube traffic
|
||||||
|
|
||||||
|
Each of the 4 UCIe ports (N, S, E, W) decomposes into:
|
||||||
|
|
||||||
|
- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
|
||||||
|
- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
|
||||||
|
|
||||||
|
This decomposition gives 4 independent NOC↔UCIe connections per port,
|
||||||
|
each with 128 GB/s bandwidth (512 GB/s aggregate per port).
|
||||||
|
|
||||||
|
Inter-cube traffic path:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
|
||||||
|
[UCIe link: 512 GB/s, 1.0mm seam distance]
|
||||||
|
Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
|
||||||
|
```
|
||||||
|
|
||||||
|
UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
|
||||||
|
crossing incurs 16 ns (TX port + RX port).
|
||||||
|
|
||||||
|
### D7. Data paths through the NOC
|
||||||
|
|
||||||
|
All intra-cube traffic uses the same router mesh — no separate fast
|
||||||
|
paths.
|
||||||
|
|
||||||
|
**Local HBM** (same PE's own partition; 0 mesh hops):
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx} (switching overhead only)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cross-PE HBM within cube** (target PE's partition, reached by mesh):
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
|
||||||
|
```
|
||||||
|
|
||||||
|
Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
|
||||||
|
```
|
||||||
|
|
||||||
|
Dijkstra computes the shortest path within the mesh.
|
||||||
|
|
||||||
|
**Cross-cube HBM** (UCIe traversal):
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
|
||||||
|
→ r{x'}c{y'} → hbm_ctrl.pe{idx'}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Kernel launch command to PE**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
|
||||||
|
```
|
||||||
|
|
||||||
|
**Shared SRAM access**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE_DMA → r{x}c{y} → (mesh) → SRAM
|
||||||
|
```
|
||||||
|
|
||||||
|
### D8. HBM channel mapping mode
|
||||||
|
|
||||||
|
Channel mapping is configured at cube scope:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
cube:
|
||||||
|
memory_map:
|
||||||
|
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
||||||
|
hbm_pseudo_channels: 64 # total pseudo-channel count
|
||||||
|
hbm_channels_per_pe: 8 # per-PE local channel count
|
||||||
|
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
|
||||||
|
hbm_slices_per_cube: 8 # number of per-PE partitions
|
||||||
|
hbm_total_gb_per_cube: 48
|
||||||
|
```
|
||||||
|
|
||||||
|
**n:1 mode (default, implemented).** Each PE's HBM partition is a single
|
||||||
|
endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
|
||||||
|
channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
|
||||||
|
`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
|
||||||
|
interleave; only aggregate per-PE BW is modeled. No separate aggregated
|
||||||
|
router node exists — the per-PE router itself serves that role.
|
||||||
|
|
||||||
|
**1:1 mode (future).** Each PE router decomposes into N channel
|
||||||
|
mini-routers; per-channel routing carries fully-resolved PA + channel ID.
|
||||||
|
A `ChannelSplitter` resolves a logical access to N per-channel physical
|
||||||
|
requests. Per-channel link models BW contention. Cross-PE channel
|
||||||
|
access semantics are deferred to the implementation ADR.
|
||||||
|
|
||||||
|
**BW math (defaults).**
|
||||||
|
|
||||||
|
| Parameter | Value |
|
||||||
|
| ---------------------------------- | -------------------------- |
|
||||||
|
| pseudo channels per cube | 64 (parameter) |
|
||||||
|
| PEs per cube | 8 (parameter) |
|
||||||
|
| channels per PE (N) | 64 / 8 = 8 |
|
||||||
|
| per-channel BW | 32 GB/s (parameter) |
|
||||||
|
| per-PE local BW | N × 32 = 256 GB/s |
|
||||||
|
| cube total HBM BW | 64 × 32 = 2048 GB/s |
|
||||||
|
|
||||||
|
Both modes give the same per-PE effective BW; only the request shape and
|
||||||
|
contention model differ.
|
||||||
|
|
||||||
|
### D9. AddressResolver — per-PE HBM endpoint
|
||||||
|
|
||||||
|
The address resolver decodes a PA's HBM offset to the owning PE's
|
||||||
|
partition:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# policy/routing/router.py
|
||||||
|
hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
|
||||||
|
|
||||||
|
if addr.kind == "hbm":
|
||||||
|
pe_id = int(addr.hbm_offset) // hbm_slice_bytes
|
||||||
|
return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
|
||||||
|
```
|
||||||
|
|
||||||
|
The pe_id computation is intrinsic to the routing layer (not a
|
||||||
|
topology-time concern). Any HBM PA falls within exactly one partition,
|
||||||
|
yielding deterministic routing.
|
||||||
|
|
||||||
|
External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
|
||||||
|
same resolver path — there is no separate fast path.
|
||||||
|
|
||||||
|
### D10. Mesh generation parameters
|
||||||
|
|
||||||
|
`mesh_gen.py` produces `cube_mesh.yaml` from:
|
||||||
|
|
||||||
|
- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
|
||||||
|
- `cube.geometry`: cube physical dimensions and HBM zone.
|
||||||
|
- `cube.ucie.n_connections`: determines router count for UCIe attachment.
|
||||||
|
|
||||||
|
Output `mesh_data` dictionary contains:
|
||||||
|
|
||||||
|
- Router grid with positions and HBM exclusion zones.
|
||||||
|
- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
|
||||||
|
per PE).
|
||||||
|
- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
|
||||||
|
- M_CPU and SRAM router attachments.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
|
||||||
|
(mesh hops) are naturally distinguishable, satisfying SPEC R5
|
||||||
|
(multi-domain communication) and ADR-0002 (no zero-latency end-to-end
|
||||||
|
paths).
|
||||||
|
- All cube-internal traffic routes through one mesh — single contention
|
||||||
|
model, single layout, single set of edge BWs.
|
||||||
|
- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
|
||||||
|
PE's partition is the n:1 aggregate of its assigned pseudo-channels.
|
||||||
|
- 1:1 mode extension is structurally natural — split each PE router into
|
||||||
|
N channel routers.
|
||||||
|
- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
|
||||||
|
geometry changes propagate without code edits.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0002 (Routing distance, ordering, no zero-latency paths)
|
||||||
|
- ADR-0003 D3 (cube-level NOC definition — extended here)
|
||||||
|
- ADR-0004 (Memory semantics, local HBM)
|
||||||
|
- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
|
||||||
|
- ADR-0014 D1 (PE_DMA egress via router mesh)
|
||||||
|
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
|
||||||
|
- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
|
||||||
|
- ADR-0033 (Latency model: per-PC parallelism, switch penalty)
|
||||||
@@ -0,0 +1,516 @@
|
|||||||
|
# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
현재 시뮬레이션은 **타이밍만** 모델링한다.
|
||||||
|
`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
|
||||||
|
실제 텐서 데이터를 읽거나 연산하지 않는다.
|
||||||
|
|
||||||
|
### 필요한 기능
|
||||||
|
|
||||||
|
1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
|
||||||
|
2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
|
||||||
|
3. 시뮬레이션 성능 저하를 최소화해야 한다
|
||||||
|
|
||||||
|
### 제약 조건
|
||||||
|
|
||||||
|
- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
|
||||||
|
- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
|
||||||
|
- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
|
||||||
|
- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
|
||||||
|
|
||||||
|
### 설계 탐색 결과
|
||||||
|
|
||||||
|
| Option | 방식 | 판정 |
|
||||||
|
|--------|------|------|
|
||||||
|
| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
|
||||||
|
| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
|
||||||
|
| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
|
||||||
|
| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. 2-Pass 실행 모델 — Phase 0 제거
|
||||||
|
|
||||||
|
기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
|
||||||
|
|
||||||
|
기존:
|
||||||
|
```
|
||||||
|
Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
|
||||||
|
Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
|
||||||
|
```
|
||||||
|
|
||||||
|
변경:
|
||||||
|
```
|
||||||
|
Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
|
||||||
|
- 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
|
||||||
|
- 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
|
||||||
|
- dynamic control flow 가능 (tl.load가 실제 데이터 반환)
|
||||||
|
|
||||||
|
Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
|
||||||
|
```
|
||||||
|
|
||||||
|
본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
|
||||||
|
Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
|
||||||
|
Phase 2는 GEMM/Math 연산 정합성 검증.
|
||||||
|
Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
|
||||||
|
|
||||||
|
### D2. Op Log 기록 — ComponentBase hook
|
||||||
|
|
||||||
|
op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
|
||||||
|
개별 컴포넌트 구현을 수정하지 않는다.
|
||||||
|
|
||||||
|
```python
|
||||||
|
class ComponentBase:
|
||||||
|
def _on_process_start(self, env, msg):
|
||||||
|
if self._op_logger and getattr(msg, 'data_op', False):
|
||||||
|
self._op_logger.record_start(env.now, self.node.id, msg)
|
||||||
|
|
||||||
|
def _on_process_end(self, env, msg):
|
||||||
|
if self._op_logger and getattr(msg, 'data_op', False):
|
||||||
|
self._op_logger.record_end(env.now, self.node.id, msg)
|
||||||
|
```
|
||||||
|
|
||||||
|
`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
|
||||||
|
`_op_logger`는 optional — 없으면 오버헤드 제로.
|
||||||
|
|
||||||
|
**hook 시점 정의**:
|
||||||
|
|
||||||
|
| 시점 | 의미 |
|
||||||
|
|------|------|
|
||||||
|
| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
|
||||||
|
| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
|
||||||
|
|
||||||
|
link traversal latency는 t_start/t_end에 포함되지 않는다.
|
||||||
|
link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
|
||||||
|
|
||||||
|
### D3. Greenlet 기반 커널 실행 — Phase 0 제거
|
||||||
|
|
||||||
|
기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
|
||||||
|
**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
|
||||||
|
|
||||||
|
#### 동작 원리
|
||||||
|
|
||||||
|
greenlet은 협력적 context switch를 제공하는 C 확장이다.
|
||||||
|
커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
|
||||||
|
switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
|
||||||
|
|
||||||
|
```
|
||||||
|
SimPy 루프 (parent greenlet) 커널 (child greenlet)
|
||||||
|
───────────────────────── ──────────────────────
|
||||||
|
g.switch() ─────────────────────────→ 커널 시작
|
||||||
|
a = tl.load(ptr, ...)
|
||||||
|
내부: parent.switch(DmaReadCmd)
|
||||||
|
cmd = DmaReadCmd ←────────────────── (커널 일시정지)
|
||||||
|
yield DmaReadMsg(...)
|
||||||
|
yield env.timeout(dma_latency)
|
||||||
|
data = memory_store.read(...)
|
||||||
|
g.switch(data) ─────────────────────→ (커널 재개)
|
||||||
|
a = data ← 실제 numpy array
|
||||||
|
if a[0][0] > 0.5: ← 분기 가능
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
커널은 **plain Python function**으로 유지된다.
|
||||||
|
greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
|
||||||
|
|
||||||
|
#### KernelRunner — 프레임워크 레이어
|
||||||
|
|
||||||
|
greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
|
||||||
|
**KernelRunner**에 위치한다.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
|
||||||
|
class KernelRunner:
|
||||||
|
def run(self, env, kernel_fn, args, store):
|
||||||
|
g = greenlet(self._run_kernel)
|
||||||
|
cmd = g.switch(kernel_fn, args)
|
||||||
|
|
||||||
|
while cmd is not None:
|
||||||
|
if isinstance(cmd, DmaReadCmd):
|
||||||
|
yield from self._dispatch_dma(env, cmd)
|
||||||
|
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
|
||||||
|
cmd = g.switch(data) # 실제 데이터와 함께 재개
|
||||||
|
elif isinstance(cmd, GemmCmd):
|
||||||
|
yield from self._dispatch_gemm(env, cmd)
|
||||||
|
cmd = g.switch() # 재개 (데이터 없음)
|
||||||
|
elif isinstance(cmd, DmaWriteCmd):
|
||||||
|
store.write(cmd.dst_addr, cmd.data) # visibility = issue 시점
|
||||||
|
yield from self._dispatch_dma(env, cmd) # timing만 반영
|
||||||
|
cmd = g.switch()
|
||||||
|
|
||||||
|
# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
|
||||||
|
def _execute_kernel(self, env):
|
||||||
|
runner = KernelRunner(self.ctx)
|
||||||
|
yield from runner.run(env, kernel_fn, args, store)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
|
||||||
|
모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
|
||||||
|
KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
|
||||||
|
컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
|
||||||
|
|
||||||
|
**레이어 분리**:
|
||||||
|
- **커널 코드**: plain function, greenlet 존재를 모름
|
||||||
|
- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
|
||||||
|
- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
|
||||||
|
- **ComponentBase hook**: op_log 기록의 유일한 경로
|
||||||
|
- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
|
||||||
|
|
||||||
|
#### 메모리 읽기/쓰기 vs 연산의 처리 차이
|
||||||
|
|
||||||
|
| 연산 | Phase 1에서 | Phase 2에서 |
|
||||||
|
|------|------------|------------|
|
||||||
|
| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
|
||||||
|
| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
|
||||||
|
| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
|
||||||
|
| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
|
||||||
|
|
||||||
|
메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
|
||||||
|
GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
|
||||||
|
|
||||||
|
#### Store Visibility Rule
|
||||||
|
|
||||||
|
`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
|
||||||
|
SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
|
||||||
|
|
||||||
|
이는 timing과 visibility를 의도적으로 분리한 것이다:
|
||||||
|
- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
|
||||||
|
- **timing**: SimPy에서 DMA latency가 완료되는 시점
|
||||||
|
|
||||||
|
이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
|
||||||
|
|
||||||
|
#### Result Handle Semantics
|
||||||
|
|
||||||
|
`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
|
||||||
|
|
||||||
|
Phase 1에서의 핵심 계약:
|
||||||
|
|
||||||
|
1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
|
||||||
|
2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
|
||||||
|
handle을 ready로 만들지 않는다.
|
||||||
|
3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
|
||||||
|
numpy conversion 등)은 **Phase 2에서만 가능**하다.
|
||||||
|
4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
|
||||||
|
5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
|
||||||
|
**memory-read 기반 control flow는 지원 가능**하다.
|
||||||
|
|
||||||
|
| handle 상태 | Phase | 허용 동작 |
|
||||||
|
|------------|-------|----------|
|
||||||
|
| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
|
||||||
|
| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
|
||||||
|
| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
|
||||||
|
| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
|
||||||
|
|
||||||
|
이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
|
||||||
|
block되어 2-pass 분리의 존재 이유가 사라진다.
|
||||||
|
|
||||||
|
#### Phase 1 Materialization — Future Extension
|
||||||
|
|
||||||
|
향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
|
||||||
|
필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
|
||||||
|
선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
|
||||||
|
|
||||||
|
### D4. data_op 플래그 — 메시지 자기 선언
|
||||||
|
|
||||||
|
로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
|
||||||
|
프레임워크가 메시지 타입을 하드코딩하지 않는다.
|
||||||
|
|
||||||
|
```python
|
||||||
|
class MsgBase:
|
||||||
|
data_op: bool = False # 기본: 로깅 안 함
|
||||||
|
|
||||||
|
class DmaReadCmd(MsgBase):
|
||||||
|
data_op = True # 메모리 이동 → 로깅
|
||||||
|
|
||||||
|
class GemmCmd(MsgBase):
|
||||||
|
data_op = True # 연산 → 로깅
|
||||||
|
|
||||||
|
class MathCmd(MsgBase):
|
||||||
|
data_op = True # 연산 → 로깅
|
||||||
|
```
|
||||||
|
|
||||||
|
새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
|
||||||
|
프레임워크 코드 수정 없이 자동 로깅된다.
|
||||||
|
|
||||||
|
### D5. Op Log 구조
|
||||||
|
|
||||||
|
#### op 분류 체계
|
||||||
|
|
||||||
|
2단계로 분류한다:
|
||||||
|
|
||||||
|
| 레벨 | 필드 | 역할 |
|
||||||
|
|------|------|------|
|
||||||
|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
|
||||||
|
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
|
||||||
|
|
||||||
|
#### OpRecord 정의
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class OpRecord:
|
||||||
|
t_start: float # SimPy 시각 (ns) — service 시작
|
||||||
|
t_end: float # SimPy 시각 (ns) — service 완료
|
||||||
|
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
|
||||||
|
op_kind: str # "memory" | "gemm" | "math"
|
||||||
|
op_name: str # 구체 연산명
|
||||||
|
params: dict # 연산별 파라미터 (아래 참조)
|
||||||
|
dependency_ids: list[int] # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
|
||||||
|
```
|
||||||
|
|
||||||
|
#### dependency_ids 생성 규칙
|
||||||
|
|
||||||
|
`dependency_ids`는 **optional**이며, 기본적으로 executor는
|
||||||
|
주소 기반 dependency 추론을 수행한다 (D6 참조).
|
||||||
|
|
||||||
|
정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
|
||||||
|
- **기본 (address-based inference)**: executor가 read/write set을 분석하여
|
||||||
|
RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
|
||||||
|
- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
|
||||||
|
주소로 표현되지 않는 경우에 설정.
|
||||||
|
예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
|
||||||
|
논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
|
||||||
|
|
||||||
|
#### op_log ordering
|
||||||
|
|
||||||
|
op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
||||||
|
동일 `t_start`의 record들은 insertion order를 보존한다.
|
||||||
|
|
||||||
|
#### params 상세
|
||||||
|
|
||||||
|
**memory (dma_read / dma_write)**:
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
"src_addr": int, # source 주소 (byte)
|
||||||
|
"dst_addr": int, # destination 주소 (byte)
|
||||||
|
"nbytes": int, # 전송 크기
|
||||||
|
"src_space": str, # "hbm" | "tcm" | "sram"
|
||||||
|
"dst_space": str, # "hbm" | "tcm" | "sram"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**gemm**:
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
"src_a_addr": int, # operand A 주소
|
||||||
|
"src_b_addr": int, # operand B 주소
|
||||||
|
"dst_addr": int, # output 주소
|
||||||
|
"shape_a": tuple, # e.g. (128, 256)
|
||||||
|
"shape_b": tuple, # e.g. (256, 128)
|
||||||
|
"shape_out": tuple, # e.g. (128, 128)
|
||||||
|
"dtype_in": str, # e.g. "f16"
|
||||||
|
"dtype_acc": str, # accumulation dtype, e.g. "f32"
|
||||||
|
"dtype_out": str, # output dtype, e.g. "f16"
|
||||||
|
"transpose_a": bool,
|
||||||
|
"transpose_b": bool,
|
||||||
|
"layout_a": str, # "row_major" | "col_major"
|
||||||
|
"layout_b": str,
|
||||||
|
"layout_out": str,
|
||||||
|
"addr_space": str, # "tcm" (GEMM operand는 항상 TCM)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**math**:
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
"op": str, # "exp" | "add" | "sum" | "where" | ...
|
||||||
|
"input_addrs": list[int], # operand 주소 목록
|
||||||
|
"input_shapes": list[tuple],
|
||||||
|
"dst_addr": int,
|
||||||
|
"shape_out": tuple,
|
||||||
|
"dtype": str,
|
||||||
|
"axis": int | None, # reduction axis
|
||||||
|
"addr_space": str, # "tcm"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### D6. Phase 2 Executor
|
||||||
|
|
||||||
|
Phase 2는 SimPy 밖에서 op_log를 실행한다.
|
||||||
|
|
||||||
|
```python
|
||||||
|
class DataExecutor:
|
||||||
|
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
|
||||||
|
self.store = initial_store # Phase 1의 MemoryStore snapshot을 입력으로 받는다
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
for t, ops in groupby(op_log, key=lambda o: o.t_start):
|
||||||
|
batch = list(ops)
|
||||||
|
independent, sequential = self._classify(batch)
|
||||||
|
self._execute_parallel(independent)
|
||||||
|
self._execute_sequential(sequential)
|
||||||
|
```
|
||||||
|
|
||||||
|
**병렬 실행 판정**:
|
||||||
|
|
||||||
|
같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
|
||||||
|
실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
|
||||||
|
- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
|
||||||
|
- `dependency_ids`에 명시된 선행 op 완료 여부
|
||||||
|
|
||||||
|
주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
|
||||||
|
|
||||||
|
**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
|
||||||
|
모두 동일한** 독립 op들만 batching 대상이 된다.
|
||||||
|
예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
|
||||||
|
CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
|
||||||
|
|
||||||
|
**Phase 2 실행 순서 보장**:
|
||||||
|
|
||||||
|
Phase 2는 데이터 도착 시점을 고려하지 않으며,
|
||||||
|
dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
|
||||||
|
실행 순서를 보장한다.
|
||||||
|
|
||||||
|
### D7. Memory Store
|
||||||
|
|
||||||
|
`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
|
||||||
|
현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
|
||||||
|
|
||||||
|
```python
|
||||||
|
class MemoryStore:
|
||||||
|
def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
|
||||||
|
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
|
||||||
|
```
|
||||||
|
|
||||||
|
**내부 저장 포맷: numpy ndarray**
|
||||||
|
|
||||||
|
MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
|
||||||
|
|
||||||
|
| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
|
||||||
|
|------|----------------|-------------|------|
|
||||||
|
| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
|
||||||
|
| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
|
||||||
|
| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
|
||||||
|
|
||||||
|
- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
|
||||||
|
- read: numpy array를 **참조 반환** (복사 없음)
|
||||||
|
- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
|
||||||
|
- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
|
||||||
|
- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
|
||||||
|
- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
|
||||||
|
|
||||||
|
**read/write contract**:
|
||||||
|
|
||||||
|
- read/write는 **contiguous tensor** 기준이다.
|
||||||
|
non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
|
||||||
|
- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
|
||||||
|
reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
|
||||||
|
permissive behavior이다.
|
||||||
|
- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
|
||||||
|
- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
|
||||||
|
shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
|
||||||
|
- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
|
||||||
|
- 구현 최적화로 tensor object cache를 둘 수 있지만,
|
||||||
|
canonical state는 byte-addressable storage이다.
|
||||||
|
- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
|
||||||
|
|
||||||
|
### D8. 벤치마크 커널 코드
|
||||||
|
|
||||||
|
벤치마크의 **사용자 코드 API는 변경하지 않는다**.
|
||||||
|
`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
|
||||||
|
|
||||||
|
단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
|
||||||
|
포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
|
||||||
|
|
||||||
|
### D9. 컴포넌트 변경 없음
|
||||||
|
|
||||||
|
개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
|
||||||
|
op_log 기록은 ComponentBase hook의 책임이다.
|
||||||
|
커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
|
||||||
|
Phase 2 데이터 실행은 영향받지 않는다.
|
||||||
|
|
||||||
|
### D10. Phase 2는 Optional
|
||||||
|
|
||||||
|
```python
|
||||||
|
engine = GraphEngine(graph)
|
||||||
|
engine.run(benchmark) # Phase 1: 타이밍만
|
||||||
|
result = engine.get_timing_result()
|
||||||
|
|
||||||
|
if verify_data:
|
||||||
|
executor = DataExecutor(engine.op_log) # Phase 2: 데이터
|
||||||
|
executor.run()
|
||||||
|
executor.verify(expected_output)
|
||||||
|
```
|
||||||
|
|
||||||
|
타이밍 분석만 필요하면 Phase 2를 건너뛴다.
|
||||||
|
op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
|
||||||
|
|
||||||
|
### D11. Verification Contract
|
||||||
|
|
||||||
|
기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
|
||||||
|
|
||||||
|
dtype별 tolerance 정책:
|
||||||
|
|
||||||
|
| dtype | 비교 방식 | tolerance |
|
||||||
|
|-------|----------|-----------|
|
||||||
|
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
|
||||||
|
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
|
||||||
|
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
|
||||||
|
| int 계열 | `np.array_equal` | exact |
|
||||||
|
|
||||||
|
- 기본 모드: 최종 output만 비교 (end-to-end correctness)
|
||||||
|
- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
|
||||||
|
(MemoryStore snapshot at each op boundary)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- **Compute-result-based control flow**: 지원하지 않는다.
|
||||||
|
모든 compute handle은 Phase 1에서 pending 상태이며,
|
||||||
|
`wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
|
||||||
|
Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
|
||||||
|
**error로 처리**한다.
|
||||||
|
메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
|
||||||
|
Phase 1 materialization은 future extension (D3 참조).
|
||||||
|
- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
|
||||||
|
overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
|
||||||
|
- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
|
||||||
|
실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
|
||||||
|
|
||||||
|
## Open Questions
|
||||||
|
|
||||||
|
- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
|
||||||
|
MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
|
||||||
|
- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
|
||||||
|
일반화할지, 별도 op_kind를 둘지
|
||||||
|
- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
|
||||||
|
(in-memory list vs disk-backed streaming)
|
||||||
|
- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
|
||||||
|
하나의 fused op record로 기록할지, 개별 op으로 분리할지
|
||||||
|
- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
|
||||||
|
broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
|
||||||
|
where/mask 표현 등 일반화가 필요할 수 있음
|
||||||
|
- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
|
||||||
|
streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
|
||||||
|
- **Phase 1 materialization policy**: D3의 Future Extension 참조.
|
||||||
|
허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### 긍정적
|
||||||
|
|
||||||
|
- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
|
||||||
|
- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
|
||||||
|
- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
|
||||||
|
- 벤치마크 사용자 코드 API 변경 불필요
|
||||||
|
- 새 메시지 타입 추가 시 data_op 플래그만 설정
|
||||||
|
- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
|
||||||
|
- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
|
||||||
|
|
||||||
|
### 부정적
|
||||||
|
|
||||||
|
- op_log 메모리 사용량 (대규모 시뮬레이션 시)
|
||||||
|
- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
|
||||||
|
- pending handle (연산 미완료) 기반 동적 분기 불가
|
||||||
|
(연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
|
||||||
|
메모리 데이터 기반 분기는 greenlet으로 지원된다.
|
||||||
|
- greenlet C 확장 의존성 추가 (pip install greenlet)
|
||||||
@@ -0,0 +1,90 @@
|
|||||||
|
# ADR-0022: 2D Grid program_id Semantics
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
|
||||||
|
Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
|
||||||
|
The previous implementation ignored the `axis` parameter and always returned a flat PE index,
|
||||||
|
making it impossible for kernels to distinguish their cube-local position from their cube identity.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Map `tl.program_id` and `tl.num_programs` to the 2D hardware grid:
|
||||||
|
|
||||||
|
| Call | Returns | Description |
|
||||||
|
|------|---------|-------------|
|
||||||
|
| `tl.program_id(axis=0)` | `local_pe_id` | PE index within cube |
|
||||||
|
| `tl.program_id(axis=1)` | `cube_id` | Cube index |
|
||||||
|
| `tl.num_programs(axis=0)` | `num_pes_per_cube` | PEs per cube |
|
||||||
|
| `tl.num_programs(axis=1)` | `num_cubes` | Total cubes |
|
||||||
|
|
||||||
|
Global PID is derived as:
|
||||||
|
|
||||||
|
```python
|
||||||
|
global_pid = tl.program_id(axis=1) * tl.num_programs(axis=0) + tl.program_id(axis=0)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Axis mapping rationale
|
||||||
|
|
||||||
|
- **axis=0 = PE (innermost)**: PEs within a cube share HBM and communicate via local NOC mesh. This is the fast, tightly-coupled dimension — analogous to threads within a block.
|
||||||
|
- **axis=1 = Cube (outer)**: Cross-cube communication goes through UCIe with higher latency. This is the coarser scheduling dimension — analogous to blocks in a grid.
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
### TLContext (`triton_emu/tl_context.py`)
|
||||||
|
|
||||||
|
Added `cube_id` and `num_cubes` constructor parameters. `program_id()` and `num_programs()` dispatch on `axis`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def program_id(self, axis: int = 0) -> int:
|
||||||
|
if axis == 1:
|
||||||
|
return self._cube_id
|
||||||
|
return self._pe_id
|
||||||
|
|
||||||
|
def num_programs(self, axis: int = 0) -> int:
|
||||||
|
if axis == 1:
|
||||||
|
return self._num_cubes
|
||||||
|
return self._num_programs
|
||||||
|
```
|
||||||
|
|
||||||
|
### PE_CPU (`components/builtin/pe_cpu.py`)
|
||||||
|
|
||||||
|
- Extracts `num_cubes` from `ctx.spec["system"]["sips"]["cubes_per_sip"]`
|
||||||
|
- Passes `cube_id` (already available as `self._cube_idx`) and `num_cubes` to TLContext
|
||||||
|
|
||||||
|
### KernelRunner (`triton_emu/kernel_runner.py`)
|
||||||
|
|
||||||
|
- Receives `num_cubes` from PE_CPU
|
||||||
|
- Passes `cube_id` and `num_cubes` to TLContext in greenlet mode
|
||||||
|
|
||||||
|
## Backward Compatibility
|
||||||
|
|
||||||
|
- Existing code using `tl.program_id(0)` or `tl.program_id()` is unchanged — returns the same PE index as before.
|
||||||
|
- `cube_id` and `num_cubes` default to `0` and `1`, so callers that don't provide them (e.g. unit tests) continue to work.
|
||||||
|
|
||||||
|
## Usage Example
|
||||||
|
|
||||||
|
```python
|
||||||
|
def sharded_gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl):
|
||||||
|
local_pid = tl.program_id(axis=0) # PE within cube
|
||||||
|
cube_id = tl.program_id(axis=1) # which cube
|
||||||
|
global_pid = cube_id * tl.num_programs(axis=0) + local_pid
|
||||||
|
|
||||||
|
# Column-wise sharding across global PID
|
||||||
|
n_per_pid = N // (tl.num_programs(axis=1) * tl.num_programs(axis=0))
|
||||||
|
col_start = global_pid * n_per_pid
|
||||||
|
|
||||||
|
a = tl.load(a_ptr, shape=(M, K), dtype="f16")
|
||||||
|
b = tl.ref(b_ptr + col_start * K * 2, shape=(K, n_per_pid), dtype="f16")
|
||||||
|
h = tl.composite(op="gemm", a=a, b=b, out_ptr=out_ptr + col_start * M * 2)
|
||||||
|
tl.wait(h)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Benchmarks can now express cube-aware sharding and addressing without hardcoding topology dimensions.
|
||||||
|
- Future axis=2 (SIP-level) can be added following the same pattern if needed.
|
||||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,206 @@
|
|||||||
|
# ADR-0024: SIP-level Launcher — rank = SIP
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
### 목표
|
||||||
|
|
||||||
|
`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
|
||||||
|
경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
|
||||||
|
읽히는 bench 코드를 목표로 한다.
|
||||||
|
|
||||||
|
real PyTorch와 비교:
|
||||||
|
|
||||||
|
| 차원 | real PyTorch | KernBench |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
|
||||||
|
| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
|
||||||
|
| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
|
||||||
|
| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
|
||||||
|
| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
|
||||||
|
|
||||||
|
### 풀어야 할 문제
|
||||||
|
|
||||||
|
1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
|
||||||
|
2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
|
||||||
|
worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
|
||||||
|
3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
|
||||||
|
기본 텐서 배치도 구조적 좌표로 표현되어야 함.
|
||||||
|
|
||||||
|
### Non-problem (이 ADR 밖)
|
||||||
|
|
||||||
|
- IPCQ direction addressing → ADR-0025
|
||||||
|
- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
|
||||||
|
- Megatron-style TP → ADR-0027
|
||||||
|
- DTensor → ADR-0028 (future)
|
||||||
|
- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
|
||||||
|
→ ADR-0027 D0/D1
|
||||||
|
- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. rank = SIP (world_size 해석)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _resolve_world_size(self) -> int:
|
||||||
|
if "world_size" in self._merged:
|
||||||
|
return int(self._merged["world_size"])
|
||||||
|
defaults = self._cfg_all.get("defaults", {})
|
||||||
|
if "world_size" in defaults:
|
||||||
|
return int(defaults["world_size"])
|
||||||
|
spec = self.ctx.spec or {}
|
||||||
|
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
||||||
|
```
|
||||||
|
|
||||||
|
우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
|
||||||
|
override는 legacy "rank = PE" 테스트 경로로 유지.
|
||||||
|
|
||||||
|
### D2. Greenlet-local rank registry (+ debug warning)
|
||||||
|
|
||||||
|
```python
|
||||||
|
class DistributedContext:
|
||||||
|
def __init__(self):
|
||||||
|
self._backend = None
|
||||||
|
self._rank_by_greenlet: dict = {}
|
||||||
|
|
||||||
|
def _bind_rank(self, g, rank: int) -> None:
|
||||||
|
self._rank_by_greenlet[g] = int(rank)
|
||||||
|
|
||||||
|
def get_rank(self) -> int:
|
||||||
|
self._ensure_initialized()
|
||||||
|
from greenlet import getcurrent
|
||||||
|
g = getcurrent()
|
||||||
|
if g not in self._rank_by_greenlet:
|
||||||
|
if os.environ.get("KERNBENCH_DEBUG"):
|
||||||
|
warnings.warn(
|
||||||
|
"get_rank() called outside a bound greenlet — returning 0. "
|
||||||
|
"Likely a bug unless running single-driver."
|
||||||
|
)
|
||||||
|
return 0
|
||||||
|
return int(self._rank_by_greenlet[g])
|
||||||
|
```
|
||||||
|
|
||||||
|
### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
|
||||||
|
|
||||||
|
KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
|
||||||
|
`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
|
||||||
|
namespace를 사용한다.
|
||||||
|
|
||||||
|
```python
|
||||||
|
class _AhbmNamespace:
|
||||||
|
"""torch.ahbm — per-greenlet SIP device binding.
|
||||||
|
|
||||||
|
Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
|
||||||
|
KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
|
||||||
|
API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self._device_by_greenlet: dict = {}
|
||||||
|
|
||||||
|
def set_device(self, device: int) -> None:
|
||||||
|
from greenlet import getcurrent
|
||||||
|
self._device_by_greenlet[getcurrent()] = int(device)
|
||||||
|
|
||||||
|
def current_device(self) -> int | None:
|
||||||
|
from greenlet import getcurrent
|
||||||
|
return self._device_by_greenlet.get(getcurrent())
|
||||||
|
|
||||||
|
# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
|
||||||
|
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
|
||||||
|
```
|
||||||
|
|
||||||
|
**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
|
||||||
|
`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
|
||||||
|
`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
|
||||||
|
코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
|
||||||
|
|
||||||
|
```python
|
||||||
|
class _AcceleratorNamespace:
|
||||||
|
"""torch.accelerator — device-agnostic API (PyTorch 2.x style).
|
||||||
|
|
||||||
|
Aliases torch.ahbm for bench code that prefers device-neutral idiom:
|
||||||
|
torch.accelerator.set_device_index(rank)
|
||||||
|
torch.accelerator.current_device_index()
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, ahbm: _AhbmNamespace):
|
||||||
|
self._ahbm = ahbm
|
||||||
|
|
||||||
|
def set_device_index(self, device: int) -> None:
|
||||||
|
self._ahbm.set_device(device)
|
||||||
|
|
||||||
|
def current_device_index(self) -> int | None:
|
||||||
|
return self._ahbm.current_device()
|
||||||
|
|
||||||
|
# RuntimeContext
|
||||||
|
self.ahbm = _AhbmNamespace()
|
||||||
|
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
|
||||||
|
```
|
||||||
|
|
||||||
|
Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
|
||||||
|
|
||||||
|
```python
|
||||||
|
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
|
||||||
|
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
|
||||||
|
```
|
||||||
|
|
||||||
|
### D4. Tensor placement = structural (sip, cube, pe) 좌표
|
||||||
|
|
||||||
|
`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
|
||||||
|
세부는 ADR-0026.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# RuntimeContext._create_tensor
|
||||||
|
current_sip = self.ahbm.current_device() # (D3 naming)
|
||||||
|
if current_sip is None:
|
||||||
|
current_sip = 0 # single-driver fallback (D2와 일관)
|
||||||
|
placement = resolve_dp_policy(
|
||||||
|
dp, shape=shape_2d, itemsize=itemsize,
|
||||||
|
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
|
||||||
|
target_sip=current_sip,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
|
||||||
|
좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
|
||||||
|
- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
|
||||||
|
ShardSpec의 구조적 좌표 표현.
|
||||||
|
- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
|
||||||
|
collective drain, exception cleanup의 구현 기준.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- **IPCQ protocol 수정**: ADR-0023 유지.
|
||||||
|
- **DPPolicy 필드 정리**: ADR-0026.
|
||||||
|
- **Megatron-style TP**: ADR-0027.
|
||||||
|
- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
|
||||||
|
- **Collective algorithm 구현**: ADR-0032.
|
||||||
|
- **Multi-node (프로세스 간)**: 단일 프로세스.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **Bench = real PyTorch DDP** (공개 API 관점).
|
||||||
|
- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
|
||||||
|
- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
|
||||||
|
`(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
|
||||||
|
|
||||||
|
### Neutral
|
||||||
|
|
||||||
|
- IPCQ PE-level protocol (ADR-0023) 불변.
|
||||||
|
- IO_CPU 역할 불변 (기존 transit 그대로).
|
||||||
@@ -0,0 +1,283 @@
|
|||||||
|
# ADR-0025: IPCQ Direction Addressing — address-based matching
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
### 목표
|
||||||
|
|
||||||
|
ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
|
||||||
|
topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
|
||||||
|
2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
|
||||||
|
topology 일반)에서 정확히 동작하도록 한다.
|
||||||
|
|
||||||
|
### 드러난 버그 — 2-rank bidirectional ring
|
||||||
|
|
||||||
|
`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
|
||||||
|
|
||||||
|
**버그 1 (install)**:
|
||||||
|
- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
|
||||||
|
direction convention)
|
||||||
|
- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
|
||||||
|
- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
|
||||||
|
|
||||||
|
**버그 2 (runtime)**:
|
||||||
|
- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이
|
||||||
|
sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
|
||||||
|
- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
|
||||||
|
- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
|
||||||
|
|
||||||
|
### 근본 원인
|
||||||
|
|
||||||
|
두 축에서 동일 문제:
|
||||||
|
1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
|
||||||
|
결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
|
||||||
|
fragile
|
||||||
|
2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
|
||||||
|
좌표만으로 이루어짐 → direction 중복 시 ambiguous
|
||||||
|
|
||||||
|
### 해결 방향 — address-based matching
|
||||||
|
|
||||||
|
각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
|
||||||
|
direction_idx × bytes_per_direction). 따라서:
|
||||||
|
|
||||||
|
- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
|
||||||
|
- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
|
||||||
|
대칭성)
|
||||||
|
- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
|
||||||
|
truth**
|
||||||
|
|
||||||
|
이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
|
||||||
|
주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Install — `reverse_direction` opposite-preference
|
||||||
|
|
||||||
|
`src/kernbench/ccl/install.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Extended in ADR-0032 with global_* pairs for inter-SIP directions,
|
||||||
|
# which were introduced by configure_sfr_intercube_multisip to keep
|
||||||
|
# intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint.
|
||||||
|
_OPPOSITE_DIR = {
|
||||||
|
"E": "W", "W": "E", "N": "S", "S": "N",
|
||||||
|
"global_E": "global_W", "global_W": "global_E",
|
||||||
|
"global_N": "global_S", "global_S": "global_N",
|
||||||
|
}
|
||||||
|
|
||||||
|
def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
|
||||||
|
"""Find peer's direction that reciprocates my_dir→peer_rank.
|
||||||
|
|
||||||
|
Prefer the OPPOSITE direction (E↔W, N↔S) when the peer has it
|
||||||
|
pointing back to us. This matters in 2-rank bidirectional rings
|
||||||
|
where both E and W on one side point to the same peer — without
|
||||||
|
the preference, the first-match-wins iteration would route data
|
||||||
|
into the wrong rx slot. Falls back to any direction pointing back
|
||||||
|
for topologies without an opposite convention (tree_binary's
|
||||||
|
parent/child).
|
||||||
|
"""
|
||||||
|
nt = neighbor_table[peer_rank]
|
||||||
|
opp = _OPPOSITE_DIR.get(my_dir)
|
||||||
|
if opp is not None and nt.get(opp) == my_rank:
|
||||||
|
return opp
|
||||||
|
for d, target in nt.items():
|
||||||
|
if target == my_rank:
|
||||||
|
return d
|
||||||
|
return None
|
||||||
|
```
|
||||||
|
|
||||||
|
호출부:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for d, peer_rank in nbrs.items():
|
||||||
|
peer_dir = reverse_direction(r, peer_rank, d) # my_dir 전달
|
||||||
|
if peer_dir is None:
|
||||||
|
continue
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
|
||||||
|
|
||||||
|
`src/kernbench/components/builtin/pe_ipcq.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
|
||||||
|
"""Match incoming token to the receiver-side direction by dst_addr range.
|
||||||
|
|
||||||
|
Each direction has a unique rx buffer address range
|
||||||
|
(my_rx_base_pa + n_slots * slot_size). The token's dst_addr (set by
|
||||||
|
the sender's IPCQ when computing peer's slot address) falls within
|
||||||
|
exactly one such range. This address-based matching is unambiguous
|
||||||
|
even when multiple directions have the same peer (2-rank ring).
|
||||||
|
"""
|
||||||
|
token = msg.token
|
||||||
|
dst_addr = token.dst_addr
|
||||||
|
for d, qp in self._queue_pairs.items():
|
||||||
|
base = qp["my_rx_base_pa"]
|
||||||
|
size = qp["n_slots"] * qp["slot_size"]
|
||||||
|
if base <= dst_addr < base + size:
|
||||||
|
qp["peer_head_cache"] = max(qp["peer_head_cache"],
|
||||||
|
token.sender_seq + 1)
|
||||||
|
self._arrived_tokens.setdefault(d, []).append(token)
|
||||||
|
waiters = self._recv_waiters.get(d, [])
|
||||||
|
self._recv_waiters[d] = []
|
||||||
|
for ev in waiters:
|
||||||
|
if not ev.triggered:
|
||||||
|
ev.succeed()
|
||||||
|
any_waiters = self._any_recv_waiters
|
||||||
|
self._any_recv_waiters = []
|
||||||
|
for ev in any_waiters:
|
||||||
|
if not ev.triggered:
|
||||||
|
ev.succeed()
|
||||||
|
return
|
||||||
|
# Unknown dst_addr — diagnostic log (should not happen under correct install)
|
||||||
|
```
|
||||||
|
|
||||||
|
Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
|
||||||
|
|
||||||
|
### D3. Credit — `dst_rx_base_pa` 필드 추가
|
||||||
|
|
||||||
|
`src/kernbench/common/ipcq_types.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class IpcqCreditMetadata:
|
||||||
|
consumer_seq: int
|
||||||
|
dst_rx_base_pa: int # NEW: 원 sender의 peer.rx_base_pa와 매칭용
|
||||||
|
# 기존 필드 (diagnostic / log 용도로 유지)
|
||||||
|
src_sip: int
|
||||||
|
src_cube: int
|
||||||
|
src_pe: int
|
||||||
|
src_direction: str
|
||||||
|
```
|
||||||
|
|
||||||
|
Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를
|
||||||
|
`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
|
||||||
|
|
||||||
|
수신 측 (`_credit_worker`):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _credit_worker(self, env):
|
||||||
|
while True:
|
||||||
|
credit = yield self._credit_inbox.get()
|
||||||
|
for d, qp in self._queue_pairs.items():
|
||||||
|
# peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
|
||||||
|
if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
|
||||||
|
qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
|
||||||
|
credit.consumer_seq)
|
||||||
|
waiters = self._send_waiters.get(d, [])
|
||||||
|
self._send_waiters[d] = []
|
||||||
|
for ev in waiters:
|
||||||
|
if not ev.triggered:
|
||||||
|
ev.succeed()
|
||||||
|
break
|
||||||
|
```
|
||||||
|
|
||||||
|
Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
|
||||||
|
|
||||||
|
### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
|
||||||
|
|
||||||
|
ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**.
|
||||||
|
이유:
|
||||||
|
- Meta arrival은 dst_addr로 매칭 (D2)
|
||||||
|
- Credit은 dst_rx_base_pa로 매칭 (D3)
|
||||||
|
- qp에 peer_direction 저장 필요 없음
|
||||||
|
- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
|
||||||
|
|
||||||
|
IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
|
||||||
|
|
||||||
|
### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
|
||||||
|
|
||||||
|
기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
|
||||||
|
- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
|
||||||
|
- Diagnostics: pointer_dump 등에서 direction 표시
|
||||||
|
- 미래 확장 여지
|
||||||
|
|
||||||
|
Runtime matching은 `dst_addr`만 사용.
|
||||||
|
|
||||||
|
### D6. Invariants (ADR-0023 I3 강화)
|
||||||
|
|
||||||
|
**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
|
||||||
|
rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
|
||||||
|
이를 보장해야 한다 (reverse_direction opposite-preference).
|
||||||
|
|
||||||
|
**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는
|
||||||
|
서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
|
||||||
|
않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
|
||||||
|
|
||||||
|
Install time에 검증 가능:
|
||||||
|
```python
|
||||||
|
# ccl/install_plan.py: build_install_plans 끝에 assertion
|
||||||
|
all_rx_ranges = set()
|
||||||
|
for plan in plans:
|
||||||
|
for pe_install in plan.pe_installs:
|
||||||
|
for entry in pe_install.neighbors:
|
||||||
|
r = (entry.my_rx_base_pa,
|
||||||
|
entry.my_rx_base_pa + plan.n_slots * plan.slot_size)
|
||||||
|
overlap = any(_ranges_overlap(r, e) for e in all_rx_ranges)
|
||||||
|
assert not overlap
|
||||||
|
all_rx_ranges.add(r)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
|
||||||
|
(D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
|
||||||
|
변경은 없음.
|
||||||
|
- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
|
||||||
|
ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
|
||||||
|
- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
|
||||||
|
주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
|
||||||
|
인코딩되는가와 무관.
|
||||||
|
- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
|
||||||
|
- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
|
||||||
|
무관.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형
|
||||||
|
순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
|
||||||
|
전환 가능 (`_qp_by_rx_base`).
|
||||||
|
- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
|
||||||
|
필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
|
||||||
|
- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
|
||||||
|
대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
|
||||||
|
단순 구현 먼저.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
|
||||||
|
- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
|
||||||
|
- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
|
||||||
|
- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
|
||||||
|
- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
|
||||||
|
W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
|
||||||
|
이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
|
||||||
|
|
||||||
|
### Neutral
|
||||||
|
|
||||||
|
- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
|
||||||
|
불변.
|
||||||
@@ -0,0 +1,288 @@
|
|||||||
|
# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
### 목표
|
||||||
|
|
||||||
|
`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
|
||||||
|
intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
|
||||||
|
(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
|
||||||
|
layers가 담당).
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class DPPolicy:
|
||||||
|
"""Intra-device (cube × PE) data-parallel policy.
|
||||||
|
|
||||||
|
SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
|
||||||
|
(ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
|
||||||
|
layers (ADR-0027). DPPolicy does not cross SIP boundaries.
|
||||||
|
"""
|
||||||
|
cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
|
||||||
|
pe: Literal["replicate", "column_wise", "row_wise"] = "replicate"
|
||||||
|
num_pes: int | None = None
|
||||||
|
num_cubes: int | None = None
|
||||||
|
```
|
||||||
|
|
||||||
|
제거되는 필드: `sip`, `num_sips`.
|
||||||
|
|
||||||
|
### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
|
||||||
|
|
||||||
|
현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
|
||||||
|
pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
|
||||||
|
|
||||||
|
본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
|
||||||
|
property로도 **남기지 않는다**:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# src/kernbench/policy/placement/dp.py (after)
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class ShardSpec:
|
||||||
|
"""Structural shard placement — intra-SIP (cube × PE) coord.
|
||||||
|
|
||||||
|
Global-flat `pe_index` was removed in ADR-0026. Callers must use
|
||||||
|
structural coords (sip, cube, pe) directly. If a flat integer key is
|
||||||
|
needed (e.g. dict lookup), compute it explicitly at the call site.
|
||||||
|
"""
|
||||||
|
sip: int # structural — which SIP this shard lives on
|
||||||
|
cube: int # local within SIP
|
||||||
|
pe: int # local within cube
|
||||||
|
offset_bytes: int
|
||||||
|
nbytes: int
|
||||||
|
```
|
||||||
|
|
||||||
|
**핵심 원칙**:
|
||||||
|
- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
|
||||||
|
- **`pe_index` property도 없음** — silent semantics drift 차단.
|
||||||
|
- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
|
||||||
|
`AttributeError`** → 반드시 구조적 좌표로 migration.
|
||||||
|
- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
|
||||||
|
명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
|
||||||
|
|
||||||
|
**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
|
||||||
|
있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
|
||||||
|
(AttributeError)가 훨씬 안전.
|
||||||
|
|
||||||
|
### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
|
||||||
|
|
||||||
|
ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# src/kernbench/policy/placement/dp.py (after)
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class _LocalPeShard:
|
||||||
|
"""Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
|
||||||
|
local_pe: int # cube-local PE index (0..num_pe-1)
|
||||||
|
offset_bytes: int
|
||||||
|
nbytes: int
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_dp_policy(
|
||||||
|
policy: DPPolicy,
|
||||||
|
*,
|
||||||
|
shape: tuple[int, int],
|
||||||
|
itemsize: int,
|
||||||
|
num_pe: int,
|
||||||
|
num_cubes: int = 1,
|
||||||
|
target_sip: int, # NEW — 어느 SIP에 배치할지 명시
|
||||||
|
) -> list[ShardSpec]:
|
||||||
|
"""2-level resolution (cube × PE) on a specified SIP.
|
||||||
|
|
||||||
|
Returns ShardSpecs with structural coords (sip=target_sip, cube, pe).
|
||||||
|
No SIP-level split — DPPolicy is intra-device only.
|
||||||
|
"""
|
||||||
|
resolver = _PE_RESOLVERS[policy.pe]
|
||||||
|
all_shards: list[ShardSpec] = []
|
||||||
|
|
||||||
|
# Level 1: cube within SIP
|
||||||
|
cube_splits = _split_shape(policy.cube, shape, num_cubes, itemsize)
|
||||||
|
|
||||||
|
for cube_id, (cube_shape, cube_offset) in enumerate(cube_splits):
|
||||||
|
# Level 2: PE within cube — resolver returns _LocalPeShard (local_pe)
|
||||||
|
local_shards = resolver(shape=cube_shape, itemsize=itemsize,
|
||||||
|
num_pe=num_pe)
|
||||||
|
|
||||||
|
for ls in local_shards:
|
||||||
|
all_shards.append(ShardSpec(
|
||||||
|
sip=target_sip, # from caller (current_device)
|
||||||
|
cube=cube_id, # local within SIP
|
||||||
|
pe=ls.local_pe, # local within cube (explicit name)
|
||||||
|
offset_bytes=cube_offset + ls.offset_bytes,
|
||||||
|
nbytes=ls.nbytes,
|
||||||
|
))
|
||||||
|
|
||||||
|
return all_shards
|
||||||
|
```
|
||||||
|
|
||||||
|
**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
|
||||||
|
리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
|
||||||
|
과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
|
||||||
|
|
||||||
|
**이름 규약 정리** (전체 ADR):
|
||||||
|
- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
|
||||||
|
- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
|
||||||
|
- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
|
||||||
|
부가 효과: 이름 재등장 없음).
|
||||||
|
|
||||||
|
### D4. `_create_tensor` — 구조적 좌표로 직접 placement
|
||||||
|
|
||||||
|
ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
|
||||||
|
호출 시점에 직접 지정.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# context.py _create_tensor (after)
|
||||||
|
current_sip = self.ahbm.current_device()
|
||||||
|
if current_sip is None:
|
||||||
|
# Single-driver fallback (ADR-0024 D2와 일관).
|
||||||
|
# Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
|
||||||
|
# 문제가 있음 → debug mode에서 경고.
|
||||||
|
if os.environ.get("KERNBENCH_DEBUG"):
|
||||||
|
import warnings
|
||||||
|
warnings.warn(
|
||||||
|
"torch.ahbm.current_device() is None; defaulting to SIP 0. "
|
||||||
|
"If this is a multi-rank launcher context, you likely forgot "
|
||||||
|
"torch.ahbm.set_device(rank) inside the worker.",
|
||||||
|
stacklevel=2,
|
||||||
|
)
|
||||||
|
current_sip = 0
|
||||||
|
|
||||||
|
placement = resolve_dp_policy(
|
||||||
|
dp,
|
||||||
|
shape=shape_2d,
|
||||||
|
itemsize=itemsize,
|
||||||
|
num_pe=eff_num_pe,
|
||||||
|
num_cubes=eff_num_cubes,
|
||||||
|
target_sip=current_sip, # ← 구조적 좌표 일차 지정
|
||||||
|
)
|
||||||
|
|
||||||
|
# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
|
||||||
|
# 과거의 post-hoc shifting 블록은 완전히 제거.
|
||||||
|
```
|
||||||
|
|
||||||
|
**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
|
||||||
|
ADR-0027의 TP primitive 사용.
|
||||||
|
|
||||||
|
**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
|
||||||
|
default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
|
||||||
|
환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
|
||||||
|
배치되는 것을 감지할 수 있도록 warning.
|
||||||
|
|
||||||
|
### D5. Downstream — allocator lookup은 구조적 tuple key로
|
||||||
|
|
||||||
|
기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
|
||||||
|
|
||||||
|
```python
|
||||||
|
for spec in placement:
|
||||||
|
alloc = allocators[spec.pe_index] # ← AttributeError (property 제거됨)
|
||||||
|
```
|
||||||
|
|
||||||
|
`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
|
||||||
|
|
||||||
|
```python
|
||||||
|
for spec in placement:
|
||||||
|
alloc = allocators[(spec.sip, spec.cube, spec.pe)]
|
||||||
|
```
|
||||||
|
|
||||||
|
`_ensure_allocators`의 dict population도 tuple key로:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# context.py _ensure_allocators (after)
|
||||||
|
for sip_id in sip_range:
|
||||||
|
for cube_id in range(cubes_per_sip):
|
||||||
|
for pe_id in range(pes_per_cube):
|
||||||
|
self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
|
||||||
|
rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
|
||||||
|
블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
|
||||||
|
|
||||||
|
**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
|
||||||
|
권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
|
||||||
|
allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
|
||||||
|
|
||||||
|
### D7. 하위 호환 — 불가 (cleanup ADR)
|
||||||
|
|
||||||
|
이 ADR은 **breaking change**.
|
||||||
|
|
||||||
|
1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
|
||||||
|
2. `ShardSpec.pe_index` 접근 → `AttributeError`
|
||||||
|
|
||||||
|
모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
|
||||||
|
KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
|
||||||
|
|
||||||
|
**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
|
||||||
|
코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
|
||||||
|
SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
|
||||||
|
좁힘.
|
||||||
|
- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
|
||||||
|
이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
|
||||||
|
유지.
|
||||||
|
- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
|
||||||
|
- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
|
||||||
|
(SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
|
||||||
|
테스트와의 호환).
|
||||||
|
- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
|
||||||
|
launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
|
||||||
|
- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
|
||||||
|
사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
|
||||||
|
명시적 답.
|
||||||
|
|
||||||
|
**Resolved (이전 rev에서 open이었던 것들)**:
|
||||||
|
- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
|
||||||
|
- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
|
||||||
|
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
|
||||||
|
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
|
||||||
|
abstraction leakage 해소 (ADR-0024 D4 계약 충족).
|
||||||
|
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
|
||||||
|
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
|
||||||
|
경계 제어 메커니즘.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
|
||||||
|
`spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요.
|
||||||
|
- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
|
||||||
|
Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
|
||||||
|
`allocators` dict key 등) 연쇄 수정.
|
||||||
|
- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
|
||||||
|
migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
|
||||||
|
- `test_sip_parallel.py` 재작성 비용.
|
||||||
|
|
||||||
|
### Neutral
|
||||||
|
|
||||||
|
- 기존 `cube` / `pe` 필드 의미 불변.
|
||||||
@@ -0,0 +1,888 @@
|
|||||||
|
# ADR-0027: Megatron-style Tensor Parallelism API
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
### 목표
|
||||||
|
|
||||||
|
SIP 간 tensor parallelism(TP)을 **Megatron-LM 스타일의 명시적 parallel layer**
|
||||||
|
API로 지원한다. DTensor 같은 선언적 추상화는 별도 ADR(0028) future work.
|
||||||
|
|
||||||
|
Megatron-style을 선택한 이유:
|
||||||
|
- TP는 model의 특정 layer 경계에서 발생. 명시적 primitive가 mental model에
|
||||||
|
자연스러움.
|
||||||
|
- NVIDIA Megatron / DeepSpeed가 확립한 인더스트리 표준.
|
||||||
|
- DTensor는 선언적이라 디자인 공간이 더 크다 → 단계적.
|
||||||
|
|
||||||
|
### TP primitive 스펙 (Megatron-LM 참조)
|
||||||
|
|
||||||
|
- **ColumnParallelLinear**: weight의 **column(out_features)** 축을 TP ranks에
|
||||||
|
분산. 입력 full-replicated, 출력 column-sharded. 후속 RowParallelLinear가
|
||||||
|
올 때 forward all-reduce 없음.
|
||||||
|
- **RowParallelLinear**: weight의 **row(in_features)** 축을 TP ranks에 분산.
|
||||||
|
입력이 이미 column-sharded (ColumnParallel의 출력). forward 끝에
|
||||||
|
**all-reduce** 필요.
|
||||||
|
- **VocabParallelEmbedding**: embedding을 vocab 축에 분산. forward 끝에
|
||||||
|
all-reduce. (초기 scope에서는 stub, 실제 구현은 all-gather kernel 선행 필요.)
|
||||||
|
- **`copy_to_tp_region`**, **`reduce_from_tp_region`**, **`scatter_to_tp_region`**,
|
||||||
|
**`gather_from_tp_region`** — 기본 primitive.
|
||||||
|
|
||||||
|
### 풀어야 할 문제
|
||||||
|
|
||||||
|
1. **Worker-wait 일반화 (D0)**: `dist.all_reduce`의 defer/yield/drain 패턴을
|
||||||
|
모든 `ctx.wait` 경로로 확장. **이 ADR의 가장 큰 아키텍처 결정**.
|
||||||
|
|
||||||
|
2. **런처 API 정규화 (D1)**: 현 bench들이 hand-rolled greenlet loop을 사용.
|
||||||
|
`torch.multiprocessing.spawn(fn, args, nprocs)`로 흡수해 real-PyTorch API 면
|
||||||
|
유지 + D0의 scheduler drain을 단일 구현 위치에 집중.
|
||||||
|
|
||||||
|
3. **Per-rank weight 분산 표현**: 각 worker가 weight tensor의 자기 slice를
|
||||||
|
소유. ADR-0024의 `set_device(rank)` + ADR-0026의 intra-device DPPolicy로
|
||||||
|
자연스럽게 표현.
|
||||||
|
|
||||||
|
4. **Forward-only scope**: 현재 KernBench는 backward가 없음 (simulation 목적).
|
||||||
|
본 ADR은 **forward만** 우선 지원. Training simulation은 별도 ADR.
|
||||||
|
|
||||||
|
5. **Collective 호출 지점**: RowParallelLinear가 forward 끝에 `all_reduce` 호출.
|
||||||
|
ADR-0024의 multi-greenlet 구조 + D0 generalization에서 자연스럽게 동작.
|
||||||
|
|
||||||
|
6. **TP group 개념**: Megatron은 DP × TP × PP group을 교차 사용. 초기 scope는
|
||||||
|
**TP group = 전체 SIP** 단순화. Mixed DP+TP는 future.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D0. Worker-wait 일반화 — `ctx.wait`가 worker 컨텍스트면 main으로 defer
|
||||||
|
|
||||||
|
**문제 재확인**. `kernel_runner.run`은 spawn 시점의 `greenlet.getcurrent()`를
|
||||||
|
kernel greenlet의 `_parent`로 캡처한다
|
||||||
|
([kernel_runner.py:94](src/kernbench/triton_emu/kernel_runner.py#L94)).
|
||||||
|
main 컨텍스트에서 `env.run`이 돌면 parent=main이라 safe. worker 컨텍스트에서
|
||||||
|
`env.run`이 돌면 parent=worker가 되고, worker가 yield/finish하는 순간 kernel
|
||||||
|
greenlet은 orphan → `GreenletExit` → ADR-0024 Phase B의 `ring_default_ws` 실패.
|
||||||
|
|
||||||
|
**해결**. worker greenlet이 `ctx.wait(h)`를 호출하면 직접 `env.run`을 driving
|
||||||
|
하는 대신 **main scheduler로 yield**. main이 env.run을 drive해 handle이 완료
|
||||||
|
되면 worker로 control return.
|
||||||
|
|
||||||
|
#### D0.1 `RuntimeContext` 확장
|
||||||
|
|
||||||
|
```python
|
||||||
|
# context.py
|
||||||
|
@dataclass
|
||||||
|
class RuntimeContext:
|
||||||
|
...
|
||||||
|
_pending_worker_waits: list[RequestHandle] = field(default_factory=list, init=False)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### D0.2 `ctx.wait`의 worker fork
|
||||||
|
|
||||||
|
```python
|
||||||
|
def wait(self, handle, *, _meta=None):
|
||||||
|
# Fast-path: already completed — skip enqueue + switch (consistent with
|
||||||
|
# D0.4-(3) idempotency). Avoids needless worker→main→worker round-trip
|
||||||
|
# and prevents redundant _pending_worker_waits growth.
|
||||||
|
if handle in self._completed:
|
||||||
|
completion, _trace = self.engine.get_completion(handle)
|
||||||
|
return completion
|
||||||
|
|
||||||
|
from greenlet import getcurrent
|
||||||
|
g = getcurrent()
|
||||||
|
if g.parent is not None and not g.parent.dead:
|
||||||
|
# Worker greenlet: defer to main. Push handle, yield to parent.
|
||||||
|
# Parent (scheduler loop) drains env.run, then switches back.
|
||||||
|
self._pending_worker_waits.append(handle)
|
||||||
|
g.parent.switch()
|
||||||
|
# On resume: handle must have completed (main drained the list).
|
||||||
|
# Fall through to the status-quo completion/trace assembly.
|
||||||
|
|
||||||
|
# Main context (or single-driver): drive engine directly.
|
||||||
|
wait_fn = getattr(self.engine, "wait", None)
|
||||||
|
if wait_fn is not None:
|
||||||
|
wait_fn(handle)
|
||||||
|
completion, trace = self.engine.get_completion(handle)
|
||||||
|
self._completed.add(handle)
|
||||||
|
if _meta is not None and trace is not None:
|
||||||
|
entry = dict(trace) if isinstance(trace, dict) else {"raw": trace}
|
||||||
|
entry.update(_meta)
|
||||||
|
self._traces.append(entry)
|
||||||
|
return completion
|
||||||
|
```
|
||||||
|
|
||||||
|
#### D0.3 `ctx.wait`의 worker-context 세만틱 contract (normative)
|
||||||
|
|
||||||
|
본 ADR은 `ctx.wait`의 세만틱을 worker 컨텍스트에서 **명시적으로 변경**한다.
|
||||||
|
|
||||||
|
- **Submit-vs-complete 분리**: `ctx.wait(h)`는 worker에서 호출될 때 "즉시 완료
|
||||||
|
보장"이 아니라 "**다음 scheduler drain 이후** 완료 보장"이다. worker가
|
||||||
|
`wait()`에서 return하는 시점 = main이 해당 handle에 대해 `engine.wait`을
|
||||||
|
마친 시점. Main context 호출은 기존대로 즉시-동기 (status quo).
|
||||||
|
- **Resume invariant (normative)**: worker-deferred `ctx.wait(h)`에서
|
||||||
|
`g.parent.switch()`가 return해 worker가 resume되는 시점에는 **반드시
|
||||||
|
`h in ctx._completed`가 True여야 한다**. 이 invariant가 깨지면 worker가
|
||||||
|
stale 상태에서 이후 단계를 진행하므로 `_drain_pending` / scheduler loop /
|
||||||
|
`ctx.wait` 어느 부분을 수정하든 이 불변식을 지켜야 한다. T3.b가 이
|
||||||
|
invariant를 직접 assert한다.
|
||||||
|
- **관찰 가능 변화**: worker 안에서 `h = ctx.submit(msg); ctx.wait(h);
|
||||||
|
read(handle_result)` 패턴은 여전히 성립 — 단 `wait()`와 `read` 사이에는
|
||||||
|
자동으로 main-drain이 삽입되었다는 사실을 세만틱 명세로 포함한다.
|
||||||
|
- **Host 객체 직접 read는 D0.5 참조**: `ctx.wait` 없이 `tensor.numpy()`를
|
||||||
|
부르는 경우의 계약은 D0.5에서 별도로 규정.
|
||||||
|
|
||||||
|
#### D0.4 Main scheduler drain — 규약 (normative)
|
||||||
|
|
||||||
|
(D1의 `multiprocessing.spawn` 내부 구현. 아래는 세만틱 정의.)
|
||||||
|
|
||||||
|
```python
|
||||||
|
while alive:
|
||||||
|
for g in alive: # (1) round-based worker switch
|
||||||
|
g.switch()
|
||||||
|
_drain_pending(ctx) # (2) drain in main context
|
||||||
|
```
|
||||||
|
|
||||||
|
(`_drain_pending`의 실제 정의는 D0.5 참조 — outer while-loop으로 두 큐가
|
||||||
|
모두 빌 때까지 drain.)
|
||||||
|
|
||||||
|
**규약**:
|
||||||
|
|
||||||
|
1. **Round-based cooperative scheduling & yield 의무 (worker contract)**.
|
||||||
|
`g.switch()`는 해당 worker가 **자발적으로 yield**할 때까지 return하지 않는다
|
||||||
|
(cooperative greenlet 세만틱). 따라서:
|
||||||
|
- Worker가 yield 없이 `while True: do_compute()` 같은 pure-compute loop를
|
||||||
|
돌면 `g.switch()`는 영원히 return하지 않고 **scheduler loop 자체가 hard
|
||||||
|
block**된다 (다른 worker는 switch 기회를 못 얻음, drain도 안 일어남). 이는
|
||||||
|
starvation이 아니라 **scheduler non-progress (deadlock 등가)**이며 본
|
||||||
|
ADR이 **unsupported**로 규정한다.
|
||||||
|
- Worker는 **반드시** `ctx.wait(h)`, `dist.all_reduce`, host-read barrier
|
||||||
|
(D0.5) 중 하나를 유한 step 내에 호출해야 한다. TP layer의 `forward`는
|
||||||
|
매 layer 끝에서 launch→wait 쌍을 포함하므로 자연스럽게 이 조건을 만족.
|
||||||
|
CCL kernel도 `dist.all_reduce` 내부에서 yield한다.
|
||||||
|
- 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
|
||||||
|
등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
|
||||||
|
- **Future extension**: non-collective 긴 계산 경로가 자주 나오면
|
||||||
|
명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
|
||||||
|
도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
|
||||||
|
됨.
|
||||||
|
- Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
|
||||||
|
안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
|
||||||
|
enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
|
||||||
|
|
||||||
|
2. **Drain 순서 = submission 순서 (FIFO)**. `_pending_worker_waits`는 list
|
||||||
|
append/pop(0)로 엄격한 FIFO. 완료 순서가 아니라 submission 순서로 drain되며,
|
||||||
|
SimPy scheduler 자체가 인과적으로 올바른 완료 순서를 보장하므로 submission
|
||||||
|
순서 drain이 안전하다. `completion order`와 `drain order`는 혼동하지 말 것.
|
||||||
|
|
||||||
|
**Two-queue ordering (worker waits → collectives)**: `_drain_pending`은
|
||||||
|
worker wait 큐를 먼저, collective 큐를 나중에 drain한다. 이 순서의 근거:
|
||||||
|
- **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
|
||||||
|
`submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
|
||||||
|
큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
|
||||||
|
worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
|
||||||
|
- **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
|
||||||
|
후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
|
||||||
|
하면 됨. worker wait 큐와의 순서 dependency 없음.
|
||||||
|
- **단일 drain barrier 안에서 둘 다 완료**: D0.5의 loop-until-empty 규약에
|
||||||
|
따라 한 barrier invocation에서 worker → collective → (새로 생긴 것이
|
||||||
|
있으면 반복) 순으로 모두 빠짐. worker가 resume될 땐 양쪽 모두 drained.
|
||||||
|
- **대안 (collective 먼저)도 가능**: 본 ADR은 현 구현 단순성을 위해 worker
|
||||||
|
먼저를 고정했을 뿐 의미상 동치. 성능 프로파일 차이가 관찰되면 재조정.
|
||||||
|
|
||||||
|
3. **중복 enqueue — correctness는 idempotent drain, dedup은 non-guaranteed**.
|
||||||
|
`ctx.wait(h)`는 `h in ctx._completed`면 즉시 return. `_drain_pending`도
|
||||||
|
동일 guard. 같은 handle이 `_pending_worker_waits`에 여러 번 appended
|
||||||
|
되더라도 실제 `engine.wait`는 한 번만 호출된다 (idempotent).
|
||||||
|
- **Correctness**: idempotent drain에 의존 → safe.
|
||||||
|
- **Memory/성능**: 본 ADR은 `_pending_worker_waits`의 **dedup을 보장하지
|
||||||
|
않는다**. 같은 handle이 N번 enqueue되면 큐에 N개 element가 보관되고
|
||||||
|
drain 시 N번 pop + in-set guard가 돈다. 단일 worker가 같은 handle을
|
||||||
|
반복 wait하는 비정상 패턴이 아니면 N은 1~수 수준.
|
||||||
|
- **Implementation freedom**: 구현은 선택적으로 dedup (예: `set`을 side
|
||||||
|
index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
|
||||||
|
를 바꾸지 않는 최적화로 분류.
|
||||||
|
|
||||||
|
4. **Exception propagation + sibling cleanup**.
|
||||||
|
worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
|
||||||
|
scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
|
||||||
|
|
||||||
|
```python
|
||||||
|
try:
|
||||||
|
while True:
|
||||||
|
alive = [g for g in gs if not g.dead]
|
||||||
|
if not alive:
|
||||||
|
break
|
||||||
|
for g in alive:
|
||||||
|
if not g.dead:
|
||||||
|
g.switch()
|
||||||
|
_drain_pending(ctx)
|
||||||
|
except Exception as outer:
|
||||||
|
# (a) 살아남은 sibling worker greenlet 강제 종료.
|
||||||
|
for other in gs:
|
||||||
|
if not other.dead:
|
||||||
|
try:
|
||||||
|
other.throw(SystemExit)
|
||||||
|
except Exception:
|
||||||
|
pass # 사일런트 — 이미 예외 상황
|
||||||
|
# (b) Backend barrier / pending 상태 초기화 (장래 epoch barrier 도입 대비).
|
||||||
|
backend = getattr(ctx.distributed, "_backend", None)
|
||||||
|
if backend is not None and hasattr(backend, "_barrier"):
|
||||||
|
backend._barrier.reset()
|
||||||
|
backend_pending = getattr(backend, "_pending_collective_handles", None)
|
||||||
|
if backend_pending is not None:
|
||||||
|
backend_pending.clear()
|
||||||
|
ctx._pending_worker_waits.clear()
|
||||||
|
# (c) 원인 예외는 SpawnException으로 래핑.
|
||||||
|
raise SpawnException(errors) from outer
|
||||||
|
```
|
||||||
|
|
||||||
|
규약:
|
||||||
|
- **Sibling abort 보장**: worker 하나가 raise하면 모든 sibling greenlet에
|
||||||
|
`SystemExit`을 throw — greenlet은 즉시 terminate된다. greenlet leak 없음.
|
||||||
|
- **Pending queue 명시적 clear**: worker-wait + collective-pending 두 큐를
|
||||||
|
비움. 재사용 시 오염 방지.
|
||||||
|
- **`SpawnException(errors)` 래핑**: `errors: dict[int, Exception]`에 각
|
||||||
|
rank의 원래 예외를 담는다. real-PyTorch `torch.multiprocessing.spawn`의
|
||||||
|
failure 패턴과 호환.
|
||||||
|
- **Scope 제한**: `errors`에는 **자기 코드로 raise한 rank (root cause)만**
|
||||||
|
포함된다. Sibling cleanup 과정에서 `throw(SystemExit)`으로 종료된 rank는
|
||||||
|
`errors`에 나타나지 않는다 (SystemExit은 D1.2의 entry 래퍼 `try/except
|
||||||
|
Exception`에 걸리지 않음 — 의도된 설계: sibling 종료는 실패가 아니라
|
||||||
|
cleanup signal). 독자가 "모든 failed rank가 다 들어올 것"으로 기대하지
|
||||||
|
않도록 명시.
|
||||||
|
- **`ctx._traces`는 예외 이전 시점까지의 partial 상태**. trace completeness
|
||||||
|
는 보장되지 않음 (일부 launch/all_reduce가 entry를 남기지 못한 채 종료
|
||||||
|
가능).
|
||||||
|
- **Allocator / MemoryStore**는 예외 이전 상태 유지 — 재사용은 non-goal,
|
||||||
|
새 `RuntimeContext` 생성 권장.
|
||||||
|
- **`join=False` / retry / partial recovery**는 본 ADR의 non-goal.
|
||||||
|
|
||||||
|
`SpawnException`은 `runtime_api/multiprocessing.py`에 정의:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class SpawnException(RuntimeError):
|
||||||
|
def __init__(self, errors: dict[int, Exception]):
|
||||||
|
self.errors = errors
|
||||||
|
first = next(iter(errors.items()), None)
|
||||||
|
msg = (f"spawn failed on ranks {sorted(errors.keys())}"
|
||||||
|
+ (f": rank {first[0]} raised {first[1]!r}" if first else ""))
|
||||||
|
super().__init__(msg)
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Single-driver 호환**. `g.parent is None`인 main-only 실행 (legacy 단일
|
||||||
|
드라이버 테스트)에서는 D0.2의 worker-fork 조건이 거짓 → 기존 즉시-동기
|
||||||
|
경로 유지. `_drain_pending`은 호출되지 않는다.
|
||||||
|
|
||||||
|
#### D0.5 Host-read barrier — 결정 (normative)
|
||||||
|
|
||||||
|
Worker 안에서 `tensor.numpy()`, `tensor.__getitem__`, `tensor.data` 등
|
||||||
|
**host-observable read**는 **자동 drain barrier**로 정의한다. 호출 직전:
|
||||||
|
|
||||||
|
1. `ctx._pending_worker_waits`와 `backend._pending_collective_handles`가 비어
|
||||||
|
있지 않으면 `g.parent.switch()`로 main에 yield → main은 `_drain_pending`
|
||||||
|
실행 → 완료 후 worker resume.
|
||||||
|
2. 두 큐가 모두 비어 있으면 즉시 read.
|
||||||
|
|
||||||
|
**Barrier 반복 규약 (normative — re-entrance)**: `_drain_pending`은 while-loop
|
||||||
|
로 **두 큐가 모두 완전히 비어질 때까지** drain한다. 단일 pass가 아님:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _drain_pending(ctx):
|
||||||
|
while ctx._pending_worker_waits or (
|
||||||
|
ctx.distributed._backend
|
||||||
|
and ctx.distributed._backend._pending_collective_handles
|
||||||
|
):
|
||||||
|
while ctx._pending_worker_waits:
|
||||||
|
h = ctx._pending_worker_waits.pop(0)
|
||||||
|
if h not in ctx._completed:
|
||||||
|
ctx.engine.wait(h)
|
||||||
|
backend = ctx.distributed._backend
|
||||||
|
if backend is not None:
|
||||||
|
while backend._pending_collective_handles:
|
||||||
|
h, _sip_id, meta = backend._pending_collective_handles.pop(0)
|
||||||
|
ctx.wait(h, _meta=meta) # main context: safe; ctx.wait가
|
||||||
|
# 다시 pending에 push하지 않음
|
||||||
|
```
|
||||||
|
|
||||||
|
**Main-context ctx.wait 비재귀 invariant (normative)**: `_drain_pending` 내부의
|
||||||
|
`ctx.wait(h, _meta=meta)` 호출은 main greenlet 컨텍스트에서 실행된다. D0.2의
|
||||||
|
worker-fork 조건(`g.parent is not None and not g.parent.dead`)이 False이므로
|
||||||
|
즉시-동기 경로로 진입 → **`_pending_worker_waits`에 절대 enqueue하지 않는다**.
|
||||||
|
이 invariant 덕분에 drain loop은 재귀/큐 재증가 없이 끝난다. 구현 시
|
||||||
|
`g.parent is None`을 단일 main greenlet 보장으로 유지하는 것이 중요.
|
||||||
|
|
||||||
|
**왜 loop인가**: `ctx.wait(h, _meta=meta)`는 main 컨텍스트에서 호출되므로 D0.2
|
||||||
|
경로에 따라 engine을 **직접 drive**한다 (추가 enqueue 없음 — 위 invariant).
|
||||||
|
따라서 이론적으로는 single pass로 충분하지만 — 규약은 **loop-until-empty**로
|
||||||
|
고정한다. 이유:
|
||||||
|
|
||||||
|
1. **미래 확장 안전성**: 향후 drain 중 새 pending이 enqueue되는 구현 (예:
|
||||||
|
collective가 sub-handle을 가진 tree-reduce)이 생길 수 있다. loop 규약이면
|
||||||
|
이때도 correctness 유지.
|
||||||
|
2. **가독성**: "barrier는 pending이 빌 때까지 drain"이라는 단일 문장으로
|
||||||
|
의미가 닫힘. `ctx.wait` 호출이 새 enqueue를 안 한다는 non-trivial invariant
|
||||||
|
에 의존하지 않음.
|
||||||
|
3. **Barrier의 세만틱은 "해당 read에 필요한 모든 dependency 완료"**: 현 모델
|
||||||
|
에선 모든 pending이 곧 모든 dependency이므로 둘은 동일. 사용자 mental model
|
||||||
|
은 전자.
|
||||||
|
|
||||||
|
**Termination 보증**: 두 체제로 분리해 서술한다.
|
||||||
|
|
||||||
|
- **현재 구현**: `ctx.wait`는 main context에서 호출 시 engine을 직접 drive
|
||||||
|
(D0.2) → 새 pending을 enqueue하지 않는다. 한 iteration마다 pending의 크기가
|
||||||
|
`pop(0)` + `engine.wait`로 엄격히 감소. iteration 수는 **초기 pending 크기
|
||||||
|
자체가 상한** → 유한 종료.
|
||||||
|
- **Future extension (loop 규약을 정당화하는 상한)**: 향후 drain 중 새 pending이
|
||||||
|
enqueue되는 구현 (예: tree-reduce sub-handle)이 도입되면 초기 크기 상한은
|
||||||
|
깨진다. 그러나 SimPy causality는 handle의 dependency가 유한 DAG임을 보장하므로
|
||||||
|
**nested depth가 finite**. loop 규약이 이 경우까지 자동 수용한다.
|
||||||
|
|
||||||
|
두 체제 모두 무한 루프가 불가능함을 보장. 현 구현의 단일-pass 상한은 공격적
|
||||||
|
최적화 시 참고 값일 뿐 규약은 loop-until-empty로 고정.
|
||||||
|
|
||||||
|
**왜 implicit drain at read가 맞는가**:
|
||||||
|
|
||||||
|
- 기존 open question에서 (a) implicit drain, (b) explicit barrier 둘 중 선택
|
||||||
|
문제였다. (b)는 명확하지만 TP layer 사용자가 `out = fc1.forward(x);
|
||||||
|
ctx.drain(); result = out.numpy()` 3-step을 매번 써야 하는 부담. (a)는
|
||||||
|
"읽을 때 반영된 값을 보장"하는 단일 규약으로 CUDA의 `cudaDeviceSynchronize
|
||||||
|
before host copy` 패턴과 동일 — 숨은 규칙이 아닌 **명명된 entry-point의
|
||||||
|
contract**이다.
|
||||||
|
- 본 ADR은 (a)를 채택하되 그 entry-point 목록을 **명시적으로 닫는다**:
|
||||||
|
`Tensor.numpy()`, `Tensor.data` (numpy alias), `Tensor.__getitem__`,
|
||||||
|
`Tensor.__repr__` (data가 포함되는 경우), 그 외 공식 host-read API는 본
|
||||||
|
ADR 구현 시점에 코드베이스 검색으로 확정. 추가되는 host-read API는 반드시
|
||||||
|
이 contract를 따라야 한다 (테스트로 회귀 방지).
|
||||||
|
- `ctx.submit`만 하고 `wait` 없이 `numpy`를 직접 호출하는 경우도 drain
|
||||||
|
barrier가 동작 (pending queue에 handle이 있기 때문). 사용자가 explicit
|
||||||
|
wait을 생략해도 read 시점에 invariant가 복원된다.
|
||||||
|
|
||||||
|
**`Tensor.copy_(source)` — write barrier 규정**:
|
||||||
|
|
||||||
|
`copy_`는 semantically "target에 write"이지만 내부적으로 `source.numpy()`를
|
||||||
|
호출하여 host에서 source 데이터를 가져온 뒤 `target._memory_store.write(...)`
|
||||||
|
로 각 shard에 쓴다. 두 방향 모두 barrier 처리:
|
||||||
|
|
||||||
|
1. **Source-side (read barrier)**: `source.numpy()`가 D0.5 read barrier를
|
||||||
|
트리거 (source 자체가 deployed tensor이고 pending이 있을 때).
|
||||||
|
2. **Target-side (write barrier — global pending 기준)**: `copy_` 진입 시
|
||||||
|
`ctx._pending_worker_waits` 또는 `backend._pending_collective_handles`가
|
||||||
|
비어 있지 않으면 write 전에 `g.parent.switch()`로 drain. **Per-tensor /
|
||||||
|
per-shard dependency tracking이 아니라 global pending queue 기준**.
|
||||||
|
- 왜 global인가: KernBench의 handle 표현에는 "이 handle이 target의 어느
|
||||||
|
shard를 write한다"는 역추적 정보가 없다. 안전한 보수적 규약으로 "전역
|
||||||
|
pending이 있으면 drain". 이 결과로 **unrelated tensor의 pending도 copy_를
|
||||||
|
막을 수 있다** — drop-in invariant 우선.
|
||||||
|
- **명시적 tradeoff**: 이 규약은 서로 독립적인 tensor 사이에도 불필요한
|
||||||
|
serialization을 도입할 수 있다. 그러나 현 single-queue execution model
|
||||||
|
하에서는 이 비용이 허용 가능 — cross-rank correctness와 "읽을 때 최신"
|
||||||
|
invariant를 단순한 규칙으로 보장하는 편이 우선.
|
||||||
|
- 실질적 영향: 단일 worker는 대부분 한 layer step 안에서 pending이 주로
|
||||||
|
자기 작업 — over-barrier로 인한 추가 context switch는 round 끝 scheduler
|
||||||
|
drain 시점과 일치하는 경우가 많아 큰 문제 안 됨.
|
||||||
|
- Future refinement: per-tensor pending tracking을 도입하면 이 규약을
|
||||||
|
좁힐 수 있으나 본 ADR scope 밖.
|
||||||
|
|
||||||
|
**Non-barrier**:
|
||||||
|
|
||||||
|
- `tensor.shape`, `tensor.dtype`, `tensor.name` 등 **metadata-only** 접근은
|
||||||
|
drain하지 않음. 데이터 의존성이 없음.
|
||||||
|
- `tensor.pa`, `tensor.va` 등 raw address accessor도 drain하지 않음 (주소만,
|
||||||
|
내용 아님).
|
||||||
|
|
||||||
|
**공식 barrier entry-point (closed set)**:
|
||||||
|
|
||||||
|
| API | Kind | Rationale |
|
||||||
|
|---|---|---|
|
||||||
|
| `Tensor.numpy()` | read | host-observable copy |
|
||||||
|
| `Tensor.data` | read | `numpy()` alias |
|
||||||
|
| `Tensor.__getitem__` | read | shard-aligned read |
|
||||||
|
| `Tensor.__repr__` (data 포함 시) | read | debugging/log |
|
||||||
|
| `Tensor.copy_(source)` | read + write | source read + target write |
|
||||||
|
|
||||||
|
이 contract를 T5/T6에서 직접 검증.
|
||||||
|
|
||||||
|
#### D0.6 왜 worker 함수 API는 불변인가 (informative)
|
||||||
|
|
||||||
|
- `torch.zeros(...)` 내부는 `self.submit(msg)` + `self.wait(h)` 쌍. `wait`가
|
||||||
|
D0.2/D0.3에 따라 자동으로 main-defer → 겉보기 동기적으로 보이지만 한 번
|
||||||
|
yield.
|
||||||
|
- `tensor.numpy()`는 D0.5에 따라 host-read barrier → pending이 있으면
|
||||||
|
drain→read, 없으면 즉시 read.
|
||||||
|
- `dist.all_reduce`는 기존 `_defer_wait=True` + `_pending_collective_handles`
|
||||||
|
경로를 그대로 사용. D0.4의 drain이 두 큐를 함께 처리.
|
||||||
|
|
||||||
|
#### D0.7 불변 조건 (invariants)
|
||||||
|
|
||||||
|
- **kernel greenlet의 `_parent`는 항상 main**: env.run이 worker 컨텍스트에서
|
||||||
|
절대 돌지 않기 때문. (T3의 핵심 assertion.)
|
||||||
|
- **cross-rank 동기 지점**: 모든 worker가 yield한 뒤에만 drain → 모든 rank의
|
||||||
|
kernel이 한 라운드에 함께 진행 (cross-rank IPCQ 교환의 필수 조건).
|
||||||
|
- **Single-driver 호환**: D0.4-(5).
|
||||||
|
|
||||||
|
### D1. `torch.multiprocessing.spawn(fn, args, nprocs)`
|
||||||
|
|
||||||
|
Real-PyTorch API 파리티 + D0의 scheduler loop의 단일 구현 위치.
|
||||||
|
|
||||||
|
#### D1.0 API parity only — execution parity 아님 (normative)
|
||||||
|
|
||||||
|
`torch.multiprocessing.spawn` 이름은 **API signature parity**에 한정된다.
|
||||||
|
실제 실행 모델은 **cooperative greenlet scheduler** (단일 Python 프로세스,
|
||||||
|
단일 OS 스레드, D0.4의 round-robin drive)이다. 다음은 **본 ADR이 제공하지
|
||||||
|
않는 속성** — real-PyTorch `torch.multiprocessing.spawn`이 보장하는 것 중
|
||||||
|
명시적으로 **non-goal**:
|
||||||
|
|
||||||
|
- 프로세스 격리 (independent OS process per rank).
|
||||||
|
- 독립 address space (각 rank가 자기 Python heap 보유).
|
||||||
|
- Failure isolation (한 rank의 hard crash가 다른 rank 영향 없음).
|
||||||
|
- OS-level scheduler fairness (rank 간 preemptive time slicing).
|
||||||
|
- `mp.Queue`, `mp.Lock` 등 inter-process primitive.
|
||||||
|
|
||||||
|
이 구현의 실제 성질:
|
||||||
|
|
||||||
|
- 모든 rank는 같은 Python 프로세스 안의 greenlet. shared global state가
|
||||||
|
그대로 보임 (의도된 simulation convenience).
|
||||||
|
- GIL 하의 단일 스레드 → parallel execution 아님. SimPy 이벤트 순서로
|
||||||
|
"논리적 동시성"만 재현.
|
||||||
|
- 한 worker에서 unhandled exception → 전체 simulation 중단 (D0.4-(4)).
|
||||||
|
|
||||||
|
**호출자 의무**: real-PyTorch multi-process 샘플을 KernBench로 이식할 때
|
||||||
|
프로세스 격리에 의존하는 로직 (예: `os.getpid`, 독립 임시 파일, 신호 처리
|
||||||
|
등)은 지워야 한다. Namespace 이름은 코드 이식성을 위해 유지 — 세만틱은
|
||||||
|
다르다.
|
||||||
|
|
||||||
|
#### D1.1 Public surface
|
||||||
|
|
||||||
|
```python
|
||||||
|
# runtime_api/multiprocessing.py (new)
|
||||||
|
class _MultiprocessingNamespace:
|
||||||
|
def __init__(self, ctx):
|
||||||
|
self._ctx = ctx
|
||||||
|
|
||||||
|
def spawn(self, fn, args: tuple, nprocs: int, join: bool = True) -> None:
|
||||||
|
"""Spawn `nprocs` worker greenlets, each calling fn(rank, *args).
|
||||||
|
|
||||||
|
Mirrors torch.multiprocessing.spawn signature (minus `daemon`).
|
||||||
|
Drives the D0 scheduler loop until all workers finish.
|
||||||
|
"""
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
#### D1.2 구현
|
||||||
|
|
||||||
|
```python
|
||||||
|
def spawn(self, fn, args, nprocs, join=True):
|
||||||
|
from greenlet import greenlet
|
||||||
|
ctx = self._ctx
|
||||||
|
dist = ctx.distributed
|
||||||
|
gs: list[greenlet] = []
|
||||||
|
errors: dict[int, Exception] = {}
|
||||||
|
for rank in range(nprocs):
|
||||||
|
def _entry(r=rank):
|
||||||
|
try:
|
||||||
|
fn(r, *args)
|
||||||
|
except Exception as e:
|
||||||
|
errors[r] = e
|
||||||
|
raise
|
||||||
|
g = greenlet(_entry)
|
||||||
|
dist._bind_rank(g, rank)
|
||||||
|
gs.append(g)
|
||||||
|
|
||||||
|
try:
|
||||||
|
while True:
|
||||||
|
alive = [g for g in gs if not g.dead]
|
||||||
|
if not alive:
|
||||||
|
break
|
||||||
|
for g in alive:
|
||||||
|
if not g.dead:
|
||||||
|
g.switch()
|
||||||
|
_drain_pending(ctx) # D0.5
|
||||||
|
except Exception as outer:
|
||||||
|
# Sibling cleanup per D0.4-(4)
|
||||||
|
for other in gs:
|
||||||
|
if not other.dead:
|
||||||
|
try:
|
||||||
|
other.throw(SystemExit)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
backend = getattr(dist, "_backend", None)
|
||||||
|
if backend is not None:
|
||||||
|
if hasattr(backend, "_barrier"):
|
||||||
|
backend._barrier.reset()
|
||||||
|
if getattr(backend, "_pending_collective_handles", None) is not None:
|
||||||
|
backend._pending_collective_handles.clear()
|
||||||
|
ctx._pending_worker_waits.clear()
|
||||||
|
raise SpawnException(errors) from outer
|
||||||
|
# `join=True` semantics: we already wait for all workers.
|
||||||
|
```
|
||||||
|
|
||||||
|
#### D1.3 `torch` namespace attach
|
||||||
|
|
||||||
|
`runtime_api/context.py` `__post_init__`에서:
|
||||||
|
```python
|
||||||
|
self.multiprocessing = _MultiprocessingNamespace(self)
|
||||||
|
```
|
||||||
|
|
||||||
|
→ bench 코드에서 `torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)`.
|
||||||
|
|
||||||
|
#### D1.4 기존 bench 마이그레이션
|
||||||
|
|
||||||
|
`benches/ccl_allreduce.py`의 hand-rolled loop은 `torch.multiprocessing.spawn`
|
||||||
|
한 줄로 축소. 기존 matrix 회귀는 그대로 유지. 현재 xfail인 `ring_default_ws`는
|
||||||
|
D0 덕분에 PASS로 전환 예상 (worker가 kernel greenlet orphan을 발생시키지 않음).
|
||||||
|
|
||||||
|
### D2. 새 패키지 `kernbench.tp`
|
||||||
|
|
||||||
|
```
|
||||||
|
src/kernbench/tp/
|
||||||
|
__init__.py — public API re-exports
|
||||||
|
parallel_state.py — TP group 관리 (현재 single global group)
|
||||||
|
layers.py — ColumnParallelLinear, RowParallelLinear, VocabParallelEmbedding
|
||||||
|
primitives.py — copy/reduce/scatter/gather_to/from_tp_region
|
||||||
|
kernels.py — TP layer가 launch하는 gemm kernel (재사용 가능)
|
||||||
|
mappings.py — forward identity/all_reduce, backward stub
|
||||||
|
```
|
||||||
|
|
||||||
|
### D3. `parallel_state` — TP group
|
||||||
|
|
||||||
|
```python
|
||||||
|
# parallel_state.py
|
||||||
|
_TP_WORLD_SIZE = None
|
||||||
|
|
||||||
|
def initialize_model_parallel(tensor_model_parallel_size: int) -> None:
|
||||||
|
"""Initialize TP group. Must be called after dist.init_process_group."""
|
||||||
|
global _TP_WORLD_SIZE
|
||||||
|
from kernbench.runtime_api.distributed import get_dist # or torch.distributed
|
||||||
|
dist = get_dist()
|
||||||
|
total = dist.get_world_size()
|
||||||
|
if tensor_model_parallel_size != total:
|
||||||
|
raise NotImplementedError(
|
||||||
|
"Only TP == world_size supported in initial scope"
|
||||||
|
)
|
||||||
|
_TP_WORLD_SIZE = tensor_model_parallel_size
|
||||||
|
|
||||||
|
def get_tensor_model_parallel_world_size() -> int:
|
||||||
|
return _TP_WORLD_SIZE
|
||||||
|
|
||||||
|
def get_tensor_model_parallel_rank() -> int:
|
||||||
|
from kernbench.runtime_api.distributed import get_dist
|
||||||
|
return get_dist().get_rank() # ADR-0024 greenlet-local rank
|
||||||
|
```
|
||||||
|
|
||||||
|
초기 scope: TP size = world_size = topology SIP count. Pure TP 모델.
|
||||||
|
|
||||||
|
### D4-pre. TP shard ownership vs DPPolicy — 역할 분리 (normative)
|
||||||
|
|
||||||
|
TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:
|
||||||
|
|
||||||
|
| 개념 | 결정 주체 | 범위 |
|
||||||
|
|---|---|---|
|
||||||
|
| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
|
||||||
|
| **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
|
||||||
|
|
||||||
|
따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
|
||||||
|
weight를 생성하고 `DPPolicy(cube="column_wise", pe="column_wise")`를 부여
|
||||||
|
하면:
|
||||||
|
|
||||||
|
- **Rank r**이 소유하는 slice = weight의 column 축 [r * k_local, (r+1) *
|
||||||
|
k_local) — **set_device(r)**가 이걸 결정 (해당 rank가 SIP r에 존재).
|
||||||
|
- **그 slice 내부**에서 cube × PE column-wise 분산 — **DPPolicy**가 이걸
|
||||||
|
결정.
|
||||||
|
|
||||||
|
두 축은 **독립적**이다. 같은 DPPolicy로 두 rank가 자기 slice를 만들면
|
||||||
|
slice 자체는 다른 SIP에 있지만 intra-SIP placement 패턴은 동일. 반대로
|
||||||
|
DPPolicy를 `cube="replicate", pe="replicate"`로 바꿔도 TP shard ownership은
|
||||||
|
유지되고 intra-rank placement만 달라짐.
|
||||||
|
|
||||||
|
**이 경계가 흐려지는 실수** (본 ADR이 금지):
|
||||||
|
|
||||||
|
- DPPolicy에 "SIP 축"이 다시 등장 (ADR-0026에서 제거됨).
|
||||||
|
- TP layer가 `set_device` 없이 `DPPolicy`만으로 cross-rank sharding을
|
||||||
|
표현 → 단일 rank 안에서 세로로 자른 것과 구분 안 됨.
|
||||||
|
|
||||||
|
본 ADR의 TP layer는 항상 "rank = SIP = one slice 소유 + DPPolicy intra-SIP
|
||||||
|
분산" 관점에서만 weight/output을 다룬다.
|
||||||
|
|
||||||
|
### D4. `ColumnParallelLinear`
|
||||||
|
|
||||||
|
**중요**: host-side `torch.matmul` 추상화를 신규 도입하지 않는다. layer의
|
||||||
|
forward는 `torch.launch("gemm", gemm_kernel, ...)`로 기존 gemm kernel을
|
||||||
|
호출 — KernBench bench들이 이미 쓰는 패턴
|
||||||
|
([benches/gemm_single_pe.py](benches/gemm_single_pe.py),
|
||||||
|
[benches/gpt3_qkv.py](benches/gpt3_qkv.py)).
|
||||||
|
|
||||||
|
```python
|
||||||
|
# layers.py
|
||||||
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
from kernbench.tp.kernels import _gemm_kernel
|
||||||
|
from kernbench.tp.parallel_state import (
|
||||||
|
get_tensor_model_parallel_rank,
|
||||||
|
get_tensor_model_parallel_world_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
class ColumnParallelLinear:
|
||||||
|
"""Weight의 K(out_features) 축을 TP rank에 분산.
|
||||||
|
|
||||||
|
forward(x):
|
||||||
|
x: (M, N) — full-replicated across ranks
|
||||||
|
W_k: (N, K / world_size) — rank-local slice (set_device로 SIP r에 거주)
|
||||||
|
y_k = x @ W_k → (M, K / world_size) — rank-local output
|
||||||
|
|
||||||
|
출력은 column-sharded. RowParallelLinear가 기대하는 입력 형태.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, in_features: int, out_features: int, bias: bool = False,
|
||||||
|
dtype: str = "f16", torch=None):
|
||||||
|
ws = get_tensor_model_parallel_world_size()
|
||||||
|
assert out_features % ws == 0
|
||||||
|
self.in_features = in_features
|
||||||
|
self.k_local = out_features // ws
|
||||||
|
self._torch = torch
|
||||||
|
# 각 rank가 자기 slice 소유 — set_device(rank)에 의해 SIP r에 배치.
|
||||||
|
self.weight = torch.zeros(
|
||||||
|
(in_features, self.k_local), dtype=dtype,
|
||||||
|
dp=DPPolicy(cube="column_wise", pe="column_wise"),
|
||||||
|
name="col_parallel_w",
|
||||||
|
)
|
||||||
|
self.bias = None
|
||||||
|
if bias:
|
||||||
|
self.bias = torch.zeros(
|
||||||
|
(self.k_local,), dtype=dtype,
|
||||||
|
dp=DPPolicy(cube="replicate", pe="replicate"),
|
||||||
|
name="col_parallel_b",
|
||||||
|
)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
# x는 full-replicated (caller 보장). 단순 local gemm.
|
||||||
|
M = x.shape[0]
|
||||||
|
out = self._torch.empty(
|
||||||
|
(M, self.k_local), dtype=x.dtype,
|
||||||
|
dp=DPPolicy(cube="column_wise", pe="column_wise"),
|
||||||
|
name="col_parallel_out",
|
||||||
|
)
|
||||||
|
self._torch.launch(
|
||||||
|
"col_parallel_gemm", _gemm_kernel,
|
||||||
|
x, self.weight, out, M, self.in_features, self.k_local,
|
||||||
|
)
|
||||||
|
# bias add는 별도 kernel 혹은 composite gemm의 fused bias.
|
||||||
|
# 초기 scope에서는 bias=False만 충분히 검증.
|
||||||
|
return out
|
||||||
|
```
|
||||||
|
|
||||||
|
**Yield-safety contract (normative)**: `ColumnParallelLinear.forward`는 한 번의
|
||||||
|
`torch.launch` 호출로 kernel launch → 내부 `ctx.wait` 쌍을 포함한다. 이는
|
||||||
|
D0.4-(1)의 "worker는 유한 step 내 yield" 조건을 자동으로 만족 — TP layer
|
||||||
|
사용자가 yield 패턴을 수동으로 삽입할 필요 없음.
|
||||||
|
|
||||||
|
### D5. `RowParallelLinear`
|
||||||
|
|
||||||
|
```python
|
||||||
|
class RowParallelLinear:
|
||||||
|
"""Weight의 N(in_features) 축을 TP rank에 분산.
|
||||||
|
|
||||||
|
forward(x):
|
||||||
|
x: (M, N / world_size) — rank-local slice (ColumnParallel의 출력)
|
||||||
|
W_k: (N / world_size, K) — rank-local slice
|
||||||
|
y_k = x @ W_k → (M, K) — partial sum on each rank
|
||||||
|
y = all_reduce(y_k, op="sum") → (M, K) on every rank
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, in_features: int, out_features: int, bias: bool = False,
|
||||||
|
dtype: str = "f16", torch=None):
|
||||||
|
ws = get_tensor_model_parallel_world_size()
|
||||||
|
assert in_features % ws == 0
|
||||||
|
self.n_local = in_features // ws
|
||||||
|
self.out_features = out_features
|
||||||
|
self._torch = torch
|
||||||
|
self.weight = torch.zeros(
|
||||||
|
(self.n_local, out_features), dtype=dtype,
|
||||||
|
dp=DPPolicy(cube="column_wise", pe="column_wise"),
|
||||||
|
name="row_parallel_w",
|
||||||
|
)
|
||||||
|
# bias는 rank 0에만 (Megatron convention). 초기 scope에서는 생략.
|
||||||
|
self.bias = None
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
M = x.shape[0]
|
||||||
|
y_partial = self._torch.empty(
|
||||||
|
(M, self.out_features), dtype=x.dtype,
|
||||||
|
dp=DPPolicy(cube="column_wise", pe="column_wise"),
|
||||||
|
name="row_parallel_partial",
|
||||||
|
)
|
||||||
|
self._torch.launch(
|
||||||
|
"row_parallel_gemm", _gemm_kernel,
|
||||||
|
x, self.weight, y_partial, M, self.n_local, self.out_features,
|
||||||
|
)
|
||||||
|
# Cross-rank reduce. ADR-0024의 dist.all_reduce는 D0 + mp.spawn 하에서
|
||||||
|
# 정상 동작 (kernel parent = main 유지).
|
||||||
|
self._torch.distributed.all_reduce(y_partial, op="sum")
|
||||||
|
return y_partial
|
||||||
|
```
|
||||||
|
|
||||||
|
**Yield-safety contract (normative)**: `RowParallelLinear.forward`는 launch →
|
||||||
|
내부 wait에 이어 `all_reduce` (defer + worker yield 패턴)까지 포함하므로 forward
|
||||||
|
한 번당 **최소 2회 yield**가 보장됨. D0.4-(1)의 scheduler progress 조건 자동
|
||||||
|
만족. 모든 본 ADR의 TP layer forward는 "최소 하나의 wait 또는 collective를
|
||||||
|
포함해 yield-safe하다"를 invariant로 유지한다 — 이후 추가되는 TP primitive
|
||||||
|
(VocabParallelEmbedding 등)도 동일 계약 필수.
|
||||||
|
|
||||||
|
### D6. Primitive 함수
|
||||||
|
|
||||||
|
```python
|
||||||
|
# primitives.py
|
||||||
|
def copy_to_tp_region(x):
|
||||||
|
"""Forward: identity. Backward: all-reduce. (Training 추가 시 구현)."""
|
||||||
|
return x
|
||||||
|
|
||||||
|
def reduce_from_tp_region(x, torch):
|
||||||
|
"""Forward: all-reduce. Backward: identity."""
|
||||||
|
torch.distributed.all_reduce(x, op="sum")
|
||||||
|
return x
|
||||||
|
|
||||||
|
def scatter_to_tp_region(x):
|
||||||
|
raise NotImplementedError(
|
||||||
|
"Phase 2: 사용자가 이미 sharded tensor를 생성하는 것으로 대체"
|
||||||
|
)
|
||||||
|
|
||||||
|
def gather_from_tp_region(x):
|
||||||
|
raise NotImplementedError(
|
||||||
|
"Phase 2: all-gather kernel 선행 필요 (future)"
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### D7. 샘플 bench — 2-layer MLP with TP
|
||||||
|
|
||||||
|
```python
|
||||||
|
# benches/tp_mlp.py (신규)
|
||||||
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
import kernbench.tp as tp
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
|
||||||
|
def worker(rank: int, world_size: int, torch):
|
||||||
|
torch.ahbm.set_device(rank)
|
||||||
|
tp.initialize_model_parallel(world_size)
|
||||||
|
|
||||||
|
B, D_in, D_hidden, D_out = 1, 512, 2048, 512
|
||||||
|
fc1 = tp.ColumnParallelLinear(D_in, D_hidden, torch=torch)
|
||||||
|
fc2 = tp.RowParallelLinear(D_hidden, D_out, torch=torch)
|
||||||
|
|
||||||
|
x = torch.zeros(
|
||||||
|
(B, D_in), dtype="f16",
|
||||||
|
dp=DPPolicy(cube="replicate", pe="replicate"),
|
||||||
|
name="x",
|
||||||
|
)
|
||||||
|
# init x with some pattern (e.g., constant)
|
||||||
|
x.copy_(torch.from_numpy(np.full((B, D_in), 0.1, dtype=np.float16)))
|
||||||
|
|
||||||
|
h = fc1.forward(x) # column-sharded (B, D_hidden / ws)
|
||||||
|
y = fc2.forward(h) # all-reduced (B, D_out) on every rank
|
||||||
|
|
||||||
|
# rank 0만 결과 출력 / 검증
|
||||||
|
if rank == 0:
|
||||||
|
result = y.numpy()
|
||||||
|
# 실제 검증 값은 zero-init weight이면 전부 0 — scope에서는 "완료 자체" 검증
|
||||||
|
print(f" tp_mlp: shape={result.shape}, mean={float(result.mean()):.4f}")
|
||||||
|
|
||||||
|
|
||||||
|
def run(torch):
|
||||||
|
torch.distributed.init_process_group(backend="ahbm")
|
||||||
|
ws = torch.distributed.get_world_size()
|
||||||
|
torch.multiprocessing.spawn(worker, args=(ws,), nprocs=ws)
|
||||||
|
```
|
||||||
|
|
||||||
|
### D8. Non-functional — training 미지원
|
||||||
|
|
||||||
|
본 ADR은 **inference/forward only**. Backward / gradient / optimizer는 future.
|
||||||
|
기존 KernBench가 training이 아니므로 자연스러움.
|
||||||
|
|
||||||
|
### D9. 초기 scope 제약
|
||||||
|
|
||||||
|
- TP size = world_size (mixed DP+TP 없음).
|
||||||
|
- `scatter_to_tp_region`, `gather_from_tp_region`은 unimplemented.
|
||||||
|
- **Weight 기본값은 zero**. 적절한 init scheme (Xavier, Kaiming 등)은 future.
|
||||||
|
단 테스트는 `tensor.copy_`로 결정론적 non-zero pattern을 주입해 numerical
|
||||||
|
correctness를 검증 (T2/T6). 즉 "production default = zero, 검증 = 결정론적
|
||||||
|
non-zero"로 운영 분리.
|
||||||
|
- Bias 초기 scope에서 생략 (Megatron의 rank 0-only bias 정책은 future).
|
||||||
|
- Pipeline parallelism은 scope 밖.
|
||||||
|
- VocabParallelEmbedding은 all-gather 선행 필요 → stub only.
|
||||||
|
|
||||||
|
### D10. 회귀: `ring_default_ws` xfail 해제 — 필수 acceptance
|
||||||
|
|
||||||
|
D0 (worker-wait 일반화) + D0.5 (host-read barrier) 덕분에 모든 worker-driven
|
||||||
|
`ctx.wait` 및 host-read가 main-drain 경로로 routing됨 → ADR-0024 Phase B의
|
||||||
|
kernel-greenlet orphan 원인이 소멸. 기존 matrix test의 `ring_default_ws`
|
||||||
|
strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 **필수 회귀
|
||||||
|
기준**으로 포함. Observable acceptance criteria는 **T7**에 명시 (deadlock
|
||||||
|
부재, GreenletExit 부재, numerical tolerance 등).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
|
||||||
|
`torch.ahbm.set_device(rank)`.
|
||||||
|
- **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
|
||||||
|
- **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- **Backward pass / training**: inference only. Training simulation은 별도 ADR.
|
||||||
|
- **Mixed parallelism (DP + TP + PP)**: 초기엔 pure TP only.
|
||||||
|
- **Weight init schemes**: 단순 zero / debug pattern.
|
||||||
|
- **Fused ops**: Megatron의 fused matmul+bias+gelu는 kernel 레벨 문제.
|
||||||
|
- **DTensor 통합**: ADR-0028 future.
|
||||||
|
- **Host-side `torch.matmul` 추상화**: TP layer는 `torch.launch(gemm_kernel, ...)`
|
||||||
|
로 기존 gemm kernel을 호출. 신규 matmul host-op 도입 안 함.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
- **`initialize_model_parallel` 위치**: `kernbench.tp.initialize_model_parallel`
|
||||||
|
(현 결정) vs real-PyTorch의 `torch.distributed.init_device_mesh`. TP 전용
|
||||||
|
모듈에 유지.
|
||||||
|
- **Weight init**: ADR은 zero. Debug pattern (e.g., identity)이 유효 검증에
|
||||||
|
필요할 수 있음 — Phase 1 test에서 필요 시 추가.
|
||||||
|
- **bias 배치 정책**: Megatron은 RowParallelLinear bias를 rank 0에만. 초기
|
||||||
|
scope에서는 bias=False로 회피.
|
||||||
|
- **GEMM kernel 위치**: `kernbench.tp.kernels._gemm_kernel` vs 기존
|
||||||
|
`benches/gemm_single_pe.py`에서 import. TP가 bench 의존을 가지면 안 되므로
|
||||||
|
tp 내부에 복제. 향후 `kernbench.kernels` 공용 패키지로 이관 가능.
|
||||||
|
|
||||||
|
**Resolved (이전 rev에서 open이었던 것들)**:
|
||||||
|
- ~~`tensor.numpy()` 호출 시 drain 타이밍~~ → **D0.5에서 결정**: 공식 host-read
|
||||||
|
entry-point(`numpy`, `data`, `__getitem__`, data-포함 `__repr__`)는 자동
|
||||||
|
drain barrier. metadata-only accessor는 barrier 아님.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **Megatron 코드 이식 용이**: real training code와 API 일치.
|
||||||
|
- **TP 벤치마크 가능**: scaling, communication-compute overlap 등 HW 특성
|
||||||
|
연구.
|
||||||
|
- **`ring_default_ws` xfail 해제**: D0의 부산물로 ADR-0024 Phase B 블로커 해소.
|
||||||
|
- **Scheduler loop 단일화**: D1 (`mp.spawn`) 도입으로 hand-rolled loop 제거.
|
||||||
|
후속 collective/TP 벤치가 동일 패턴 재사용.
|
||||||
|
- **DPPolicy 의미 명확화** (ADR-0026 시너지): TP layer가 intra-device DPPolicy
|
||||||
|
만 사용하는 모범 사례.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- 새 모듈 (`kernbench.tp`) 유지보수 비용.
|
||||||
|
- 초기 scope가 제한적 (pure TP only, forward only).
|
||||||
|
- D0 generalization이 `ctx.wait`의 세만틱을 바꿈 — 단일 드라이버 테스트와의
|
||||||
|
호환성을 명시적으로 검증 필요 (T7).
|
||||||
|
|
||||||
|
### Neutral
|
||||||
|
|
||||||
|
- ADR-0024/0026 기반 위에 순수한 상위 레이어 추가. Hardware simulation
|
||||||
|
stack에 영향 없음 (D0 제외).
|
||||||
@@ -0,0 +1,256 @@
|
|||||||
|
# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (supersedes ADR-0029).
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Define a single all-reduce algorithm that exploits the topology hierarchy:
|
||||||
|
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
|
||||||
|
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
|
||||||
|
|
||||||
|
### Why replace ADR-0029 (hierarchical 3-level)
|
||||||
|
|
||||||
|
ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
|
||||||
|
where every PE in the system participates. In practice this adds the
|
||||||
|
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
|
||||||
|
without matching the common workload pattern where the tensor is sharded
|
||||||
|
**per cube** (not per PE within a cube).
|
||||||
|
|
||||||
|
Moreover, the hierarchical design required:
|
||||||
|
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
|
||||||
|
- multi-level topology schema (`hierarchical_3level`)
|
||||||
|
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
|
||||||
|
|
||||||
|
The intercube algorithm below removes all of that: **pe0-only same-lane
|
||||||
|
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
|
||||||
|
root cube, then broadcast back. Simpler kernel, simpler wiring, same
|
||||||
|
bandwidth characteristics for the common per-cube DP workload.
|
||||||
|
|
||||||
|
### Current state
|
||||||
|
|
||||||
|
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
|
||||||
|
- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
|
||||||
|
- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
|
||||||
|
automatically at `init_process_group` time.
|
||||||
|
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
|
||||||
|
`hierarchical_allreduce` modules and their tests are **removed**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Algorithm structure — 5 phases
|
||||||
|
|
||||||
|
For each SIP (launched concurrently by `mp.spawn`):
|
||||||
|
|
||||||
|
```
|
||||||
|
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
|
||||||
|
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
|
||||||
|
|
||||||
|
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
|
||||||
|
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
|
||||||
|
holds the full SIP sum.
|
||||||
|
|
||||||
|
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
|
||||||
|
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
|
||||||
|
selected by sip_topo_kind (from topology.yaml sips.topology).
|
||||||
|
|
||||||
|
Phase 4 — Col broadcast S → N on rightmost column.
|
||||||
|
|
||||||
|
Phase 5 — Row broadcast E → W across the cube mesh.
|
||||||
|
```
|
||||||
|
|
||||||
|
After all phases every cube's pe0 holds the global sum.
|
||||||
|
|
||||||
|
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
|
||||||
|
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
|
||||||
|
across topologies; only phase 3 branches. Helper functions
|
||||||
|
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
|
||||||
|
three exchange patterns.
|
||||||
|
|
||||||
|
### D2. Tensor layout (rank = SIP, per-worker)
|
||||||
|
|
||||||
|
Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
|
||||||
|
its own cube-mesh-spanning tensor:
|
||||||
|
|
||||||
|
```python
|
||||||
|
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
|
||||||
|
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
|
||||||
|
```
|
||||||
|
|
||||||
|
Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
|
||||||
|
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
|
||||||
|
|
||||||
|
### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
|
||||||
|
|
||||||
|
Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
|
||||||
|
tables for **every cube's pe0 across every SIP** — regardless of which
|
||||||
|
cube is the root or which SIP topology is selected. This lets the kernel
|
||||||
|
elect the root cube at runtime and supports topology switches without
|
||||||
|
re-wiring.
|
||||||
|
|
||||||
|
| Level | Direction labels | Scope |
|
||||||
|
|---|---|---|
|
||||||
|
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
|
||||||
|
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
|
||||||
|
|
||||||
|
Inter-SIP directions use the `global_*` prefix to keep the namespace
|
||||||
|
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
|
||||||
|
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
|
||||||
|
direction resolver handles 2-SIP bidirectional rings correctly.
|
||||||
|
|
||||||
|
Internally the function calls `install_ipcq` with:
|
||||||
|
- `world_size = n_sips × n_cubes`
|
||||||
|
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
|
||||||
|
- A closure-captured `neighbors()` function that builds the map above.
|
||||||
|
|
||||||
|
This `world_size` is internal to IPCQ wiring and does not leak to the
|
||||||
|
process-group rank.
|
||||||
|
|
||||||
|
### D4. SIP topology — from `topology.yaml`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
system:
|
||||||
|
sips:
|
||||||
|
count: 2
|
||||||
|
topology: ring_1d # or torus_2d, mesh_2d_no_wrap
|
||||||
|
```
|
||||||
|
|
||||||
|
- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
|
||||||
|
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
|
||||||
|
`global_E/W` then col ring on `global_S/N`.
|
||||||
|
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
|
||||||
|
broadcast per dimension.
|
||||||
|
|
||||||
|
2D variants require `n_sips` to be a perfect square.
|
||||||
|
|
||||||
|
### D5. Process-group integration — `AhbmCCLBackend`
|
||||||
|
|
||||||
|
At `init_process_group` time the backend:
|
||||||
|
|
||||||
|
1. Loads `ccl.yaml` + `topology.yaml`.
|
||||||
|
2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
|
||||||
|
`system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
|
||||||
|
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
|
||||||
|
SFR wiring, mirrors NCCL communicator creation.
|
||||||
|
|
||||||
|
At each `dist.all_reduce(tensor)` call:
|
||||||
|
|
||||||
|
1. Resolves `kernel_fn` from `cfg["module"]`.
|
||||||
|
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
|
||||||
|
`kernel_args(world_size, n_elem)`.
|
||||||
|
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
|
||||||
|
`sip_rank` is the current greenlet's bound rank.
|
||||||
|
4. Launches with `_defer_wait=True`; the main scheduler drains pending
|
||||||
|
handles after all workers submit (per ADR-0027 D0.4).
|
||||||
|
|
||||||
|
### D6. Config schema
|
||||||
|
|
||||||
|
`ccl.yaml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
defaults:
|
||||||
|
algorithm: intercube_allreduce
|
||||||
|
buffer_kind: tcm
|
||||||
|
...
|
||||||
|
|
||||||
|
algorithms:
|
||||||
|
intercube_allreduce:
|
||||||
|
module: kernbench.ccl.algorithms.intercube_allreduce
|
||||||
|
topology: none
|
||||||
|
buffer_kind: tcm
|
||||||
|
n_elem: 8
|
||||||
|
root_cube: 15
|
||||||
|
```
|
||||||
|
|
||||||
|
`topology.yaml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
system:
|
||||||
|
sips:
|
||||||
|
count: 2
|
||||||
|
topology: ring_1d
|
||||||
|
sip:
|
||||||
|
cube_mesh: { w: 4, h: 4 }
|
||||||
|
```
|
||||||
|
|
||||||
|
### D7. Algorithm module contract
|
||||||
|
|
||||||
|
Modules loaded via `cfg["module"]` must export:
|
||||||
|
|
||||||
|
| Name | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
|
||||||
|
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
|
||||||
|
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
|
||||||
|
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
|
||||||
|
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
|
||||||
|
- **ADR-0025**: Address-based IPCQ direction matching; extended
|
||||||
|
`_OPPOSITE_DIR` with `global_*` pairs.
|
||||||
|
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
|
||||||
|
workload for this algorithm is per-cube DP.
|
||||||
|
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
|
||||||
|
`mesh_2d_no_wrap` require `n_sips = k²`.
|
||||||
|
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
|
||||||
|
- **Root cube runtime election**: the kernel currently uses
|
||||||
|
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
|
||||||
|
corner. SFR wiring covers all cubes, so runtime election is a pure kernel
|
||||||
|
change when needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **Single kernel, single install path** for all-reduce — replaces four
|
||||||
|
removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
|
||||||
|
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
|
||||||
|
integer param, no kernel duplication.
|
||||||
|
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
|
||||||
|
algorithm selection needed; config-driven end-to-end.
|
||||||
|
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
|
||||||
|
available — supports future dynamic root-cube election.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
|
||||||
|
shard within one cube across 8 PEs are not addressable by this kernel.
|
||||||
|
Such workloads would need a separate intra-cube all-reduce path (not
|
||||||
|
yet implemented).
|
||||||
|
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
|
||||||
|
given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
|
||||||
|
small but not zero.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Affected files
|
||||||
|
|
||||||
|
| File | Change |
|
||||||
|
|---|---|
|
||||||
|
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
|
||||||
|
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
|
||||||
|
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
|
||||||
|
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
|
||||||
|
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
|
||||||
|
| `ccl.yaml` | Single `intercube_allreduce` entry |
|
||||||
|
| `topology.yaml` | Added `system.sips.topology` |
|
||||||
|
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
|
||||||
|
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
|
||||||
|
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
|
||||||
|
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
|
||||||
|
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
|
||||||
@@ -0,0 +1,162 @@
|
|||||||
|
# ADR-0033 — Latency Model: Assumptions and Known Simplifications
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The simulator is an analytical, event-driven performance model — not a
|
||||||
|
cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
|
||||||
|
or omitted by design. To keep the model auditable and reviewable as a whole,
|
||||||
|
this ADR consolidates the assumptions in one place. Individual component ADRs
|
||||||
|
(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
|
||||||
|
the *limits of fidelity*.
|
||||||
|
|
||||||
|
## Decisions
|
||||||
|
|
||||||
|
### D1. Modeled precisely
|
||||||
|
|
||||||
|
- **Per-directed-edge BW occupancy** (FIFO serialization via `available_at`) —
|
||||||
|
ADR-0015 D2.
|
||||||
|
- **Per-component switching/overhead latency** (`overhead_ns` attr).
|
||||||
|
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
|
||||||
|
with address-based PC selection (ADR-0034 D3). Burst granularity tunable
|
||||||
|
(`burst_bytes`, default 256B). Read and write share each PC's
|
||||||
|
`available_at` (real HW command bus is per-PC shared).
|
||||||
|
- **HBM direction switching penalty mechanism**: per-PC last-direction
|
||||||
|
tracking + configurable `switch_penalty_ns`. Default 0 — see D2.
|
||||||
|
- **Wire chunk-streaming (Phase 2c)**: each wire decomposes Transactions
|
||||||
|
with payload into `Flit` objects of `flit_bytes` (default = HBM
|
||||||
|
`burst_bytes` = 256B). The wire emits each flit individually after
|
||||||
|
`prop_ns + flit_nbytes/bw_gbs` so the link's bandwidth throttles
|
||||||
|
flit arrival rate per real-HW wormhole semantics.
|
||||||
|
- **Separate Stores per directed edge** (Phase 2c key fix): the wire
|
||||||
|
is the *only* conduit between `src.out_ports[dst]` and
|
||||||
|
`dst.in_ports[src]`. Earlier the two were aliased to the same
|
||||||
|
`simpy.Store`; when the wire put a chunkified flit back, the
|
||||||
|
destination's `fan_in` could pull it before the wire applied
|
||||||
|
bandwidth delay, leaving half the flits bypassing the bottleneck.
|
||||||
|
- **Flit-aware pass-through** (`TransitComponent`, `HbmCtrlComponent`):
|
||||||
|
forward each flit serially with per-transaction overhead applied
|
||||||
|
ONCE on the first-flit arrival (header decode model). Subsequent
|
||||||
|
flits pipeline through with no extra delay. Wormhole emerges
|
||||||
|
naturally across multi-hop paths.
|
||||||
|
- **HBM CTRL per-flit PC commit**: each flit arriving at HBM CTRL
|
||||||
|
schedules a PC commit at `max(env.now, pc_avail[pc]) + chunk_time`,
|
||||||
|
with the `is_last` flit waiting for the last PC commit before
|
||||||
|
signaling `txn.done`.
|
||||||
|
- **Non-flit-aware components (default) reassemble flits at
|
||||||
|
``_fan_in``** before the legacy `_forward_txn` path runs. This
|
||||||
|
preserves backward compatibility for components that have not yet
|
||||||
|
been migrated to flit-aware processing (e.g., `MCpuComponent`,
|
||||||
|
`IoCpuComponent` sub-txn generators). Such components reassemble
|
||||||
|
*once per leg boundary*, NOT per hop — multi-hop wormhole timing
|
||||||
|
through a chain of flit-aware routers is preserved.
|
||||||
|
|
||||||
|
### D2. Approximated (with known directional error)
|
||||||
|
|
||||||
|
| Effect | Real HW | Our model | Error direction |
|
||||||
|
|--------|---------|-----------|----------------|
|
||||||
|
| Router output port arbitration | Round-robin / weighted | Wire edge FIFO + serial worker | Fair when one txn per cycle; multi-stream sharing not modeled at flit level |
|
||||||
|
| HBM scheduler / write buffer | FR-FCFS + watermark drain | FIFO, no reordering | Pessimistic for mixed R/W when alternations are dense — default `switch_penalty_ns = 0` assumes ideal scheduler amortizes |
|
||||||
|
| Flit ↔ burst granularity | 32B flit < 256B burst | `flit_bytes = burst_bytes = 256B` | Sub-flit fine-grained timing noise; affects very small wire arbitration windows only |
|
||||||
|
| Wire-level RR fairness | Per-cycle multi-flow arbitration on shared link | Single serial wire process per edge | Fair only when one transaction is in flight on a given edge at a time. Multi-stream concurrent traffic on the same edge serializes by FIFO order |
|
||||||
|
|
||||||
|
### D3. Ignored (out of scope)
|
||||||
|
|
||||||
|
- Bank-level row buffer conflict penalty (assume no conflicts — best case;
|
||||||
|
the model has no per-bank state within a PC, so same-bank reuse cannot be
|
||||||
|
detected).
|
||||||
|
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
|
||||||
|
`burst_time = burst_bytes / pc_bw_gbs`).
|
||||||
|
- Refresh, ECC, thermal throttling, power gating.
|
||||||
|
- Clock domain crossings, PLL lock time.
|
||||||
|
- Upstream backpressure due to downstream buffer occupancy (input ports use
|
||||||
|
unbounded `simpy.Store`).
|
||||||
|
- Sub-flit cycle-level arbitration at routers (flit granularity is our
|
||||||
|
smallest unit).
|
||||||
|
|
||||||
|
### D4. Workload sensitivity
|
||||||
|
|
||||||
|
Workloads where the above simplifications meaningfully affect results:
|
||||||
|
|
||||||
|
- **Random scatter/gather**: bank conflict ignored → model optimistic.
|
||||||
|
- **Heavy mixed R/W intensive** (e.g., GEMM bias accumulation): HBM scheduler
|
||||||
|
absent. With default `switch_penalty_ns = 0` we assume ideal amortization;
|
||||||
|
setting it non-zero models pessimistic per-alternation cost.
|
||||||
|
- **High concurrency (>10 active flows on one link)**: HoL blocking and VC
|
||||||
|
limits not modeled → model optimistic.
|
||||||
|
- **Very small (sub-flit) transactions**: flit quantization noise.
|
||||||
|
- **Concurrent multi-flow on a single wire**: wire is serial FIFO at the
|
||||||
|
flit level, so per-flow fairness within a single edge is not modeled.
|
||||||
|
Pre-edge merging (multiple sources arriving at a router and being
|
||||||
|
forwarded to the same downstream wire) is correctly modeled via the
|
||||||
|
flit-aware router's serial worker.
|
||||||
|
|
||||||
|
### D5. Verification policy
|
||||||
|
|
||||||
|
For workloads in D4, cross-check against real HW or a cycle-accurate
|
||||||
|
simulator before drawing absolute-magnitude conclusions. The model remains
|
||||||
|
accurate for **relative comparisons** within the modeled regime.
|
||||||
|
|
||||||
|
### D6. Future work
|
||||||
|
|
||||||
|
Note: multi-stream merging at routers IS modeled correctly — each
|
||||||
|
in_port has its own fan_in process, all push to a shared inbox, and
|
||||||
|
the router worker forwards in inbox FIFO order. Flits from different
|
||||||
|
upstream streams naturally interleave at flit granularity. The items
|
||||||
|
below are different concerns, ordered by expected workload impact.
|
||||||
|
|
||||||
|
**Higher impact (workload accuracy gap)**:
|
||||||
|
|
||||||
|
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
|
||||||
|
`track_banks: true`). Currently we assume no same-bank reuse;
|
||||||
|
random scatter/gather workloads are optimistic here.
|
||||||
|
- [ ] **HBM scheduler** with write buffer + watermark drain (Tier 2
|
||||||
|
from the design discussion). Default `switch_penalty_ns=0` is the
|
||||||
|
ideal-amortization stand-in; bursty mixed R/W workloads benefit
|
||||||
|
from explicit modeling.
|
||||||
|
- [ ] **Backpressure** modeling for finite component buffers. Matters
|
||||||
|
at high concurrency / sustained saturation where buffer occupancy
|
||||||
|
causes upstream stalls.
|
||||||
|
- [ ] **Op_log integration with chunk-streaming**: currently op_log
|
||||||
|
fires on PE-internal command messages (DmaReadCmd, DmaWriteCmd,
|
||||||
|
GemmCmd, MathCmd) which are not chunkified. Integration would
|
||||||
|
require flit-aware components to also emit op_log start/end hooks
|
||||||
|
per transaction (start on first flit, end on is_last).
|
||||||
|
|
||||||
|
**Lower impact (academic / specific use cases)**:
|
||||||
|
|
||||||
|
- [ ] **Cycle-accurate router arbitration policies** (RR with
|
||||||
|
priorities, age, iSLIP). The FIFO inbox is already approximately
|
||||||
|
fair when flit arrival times differ slightly between streams (the
|
||||||
|
common case for similar-rate workloads). True impact appears only
|
||||||
|
for: (a) priority/QoS modeling, (b) per-stream tail latency
|
||||||
|
analysis under sustained saturation. Not critical for makespan or
|
||||||
|
average-latency studies.
|
||||||
|
- [ ] **Sub-flit (32B) granularity** for finer wire arbitration
|
||||||
|
cycles. Our `flit_bytes` equals burst (256B); real HW arbitrates
|
||||||
|
per 32B flit. Effect is small for most workloads (sub-flit timing
|
||||||
|
noise on small messages).
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Single review point for all model fidelity questions. Each future PR
|
||||||
|
touching latency must update the relevant section here.
|
||||||
|
- Workload-specific magnitude error envelopes are explicit.
|
||||||
|
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
|
||||||
|
enforces the ADR-0017 D8 invariant in code rather than relying on yaml
|
||||||
|
manual consistency.
|
||||||
|
- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
|
||||||
|
per-flit timing) rather than via terminal `drain_ns` injection. Single
|
||||||
|
transactions land at `drain + commit_time + small_overheads`; multi-hop
|
||||||
|
preserves wormhole pipelining; multi-stream merge correctly serializes
|
||||||
|
at the shared wire's FIFO.
|
||||||
|
|
||||||
|
## Cross-references
|
||||||
|
|
||||||
|
- ADR-0015 — component / port / wire model.
|
||||||
|
- ADR-0017 — Cube NOC architecture and HBM connectivity.
|
||||||
|
- ADR-0004 — memory semantics, local HBM.
|
||||||
|
- ADR-0034 — HBM controller internal design.
|
||||||
@@ -0,0 +1,271 @@
|
|||||||
|
# ADR-0034: HBM Controller Internal Design
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
`HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of
|
||||||
|
the cube NOC. One instance is created per PE under the topology node
|
||||||
|
`sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router
|
||||||
|
(ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling,
|
||||||
|
burst-granular commit timing, address-based PC selection, and response
|
||||||
|
routing back to the requester.
|
||||||
|
|
||||||
|
This ADR documents the component as currently implemented. ADR-0017 D4/D8
|
||||||
|
defines *where* HBM CTRL attaches and *what* aggregate BW it must
|
||||||
|
deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in
|
||||||
|
scope. This ADR fills the gap between those two — the per-instance
|
||||||
|
internal scheduling model.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Role
|
||||||
|
|
||||||
|
`HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per
|
||||||
|
PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`)
|
||||||
|
attaches to that PE's router via the `peX.hbm` attachment list in
|
||||||
|
`cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping
|
||||||
|
(ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels
|
||||||
|
into one endpoint.
|
||||||
|
|
||||||
|
The component models:
|
||||||
|
|
||||||
|
- Per-PC scheduling (D2) with R/W command-bus sharing.
|
||||||
|
- Address-based PC selection (D3).
|
||||||
|
- Burst-granular commit timing (D4).
|
||||||
|
- Flit-aware per-flit PC commit and async finalize (D5, D6).
|
||||||
|
- Command-only Transaction handling for read-data drain (D7).
|
||||||
|
- Response routing back to the requester (D8).
|
||||||
|
|
||||||
|
It does not model:
|
||||||
|
|
||||||
|
- Bank-level row-buffer conflicts, refresh, ECC, thermal throttling
|
||||||
|
(ADR-0033 D3).
|
||||||
|
- Cross-PE HBM contention beyond its own router edge (handled by the
|
||||||
|
router mesh — ADR-0017 D3).
|
||||||
|
- 1:1 channel mode (ADR-0017 D8 future work).
|
||||||
|
|
||||||
|
### D2. Per-PC scheduling model
|
||||||
|
|
||||||
|
Per-instance state initialised in `start()`:
|
||||||
|
|
||||||
|
- `_pc_avail: list[float]` — earliest sim-time each PC is free; length
|
||||||
|
`num_pcs`, initial 0.0.
|
||||||
|
- `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on
|
||||||
|
each PC, used for switch-penalty detection (D4); initial `None`.
|
||||||
|
|
||||||
|
`num_pcs` and `burst_bytes` must each be a positive power of two so
|
||||||
|
that address-based PC selection (D3) reduces to a shift-and-mask.
|
||||||
|
|
||||||
|
Read and write requests share the same `_pc_avail` slot per PC — the
|
||||||
|
real HW per-PC command bus is shared between read and write traffic, so
|
||||||
|
issuing a write to PC k blocks a subsequent read to PC k by exactly the
|
||||||
|
burst time.
|
||||||
|
|
||||||
|
Direction `dir` for a request is inferred from the request type:
|
||||||
|
|
||||||
|
- `MemoryWriteMsg` → `"W"`.
|
||||||
|
- `PeDmaMsg` with `is_write=True` → `"W"`.
|
||||||
|
- All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`.
|
||||||
|
|
||||||
|
### D3. Address-based PC selection
|
||||||
|
|
||||||
|
PC index for an access is derived from the access address by shift and
|
||||||
|
mask:
|
||||||
|
|
||||||
|
```text
|
||||||
|
pc_shift = log2(burst_bytes) # default 8 (burst=256B)
|
||||||
|
pc_mask = num_pcs - 1 # default 7 (8 PCs)
|
||||||
|
pc = (address >> pc_shift) & pc_mask
|
||||||
|
```
|
||||||
|
|
||||||
|
Computed once in `start()` from topology config so alternative
|
||||||
|
`(burst_bytes, num_pcs)` pairs stay consistent. For the canonical
|
||||||
|
default `(256, 8)` this places the PC select field at bits `[10:8]` of
|
||||||
|
the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits
|
||||||
|
`[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column
|
||||||
|
within the PC slice (see `phyaddr.py` comment).
|
||||||
|
|
||||||
|
Address-based striping — as opposed to address-blind global
|
||||||
|
round-robin — preserves PC parallelism for offset-disjoint concurrent
|
||||||
|
transfers: each transfer's bursts land deterministically on the PC set
|
||||||
|
implied by its byte addresses, so multi-PE workloads accessing disjoint
|
||||||
|
regions do not collide on a single PC.
|
||||||
|
|
||||||
|
### D4. Burst granularity and PC commit timing
|
||||||
|
|
||||||
|
A single PC commit takes:
|
||||||
|
|
||||||
|
```text
|
||||||
|
chunk_time = burst_bytes / pc_bw_gbs # ns
|
||||||
|
```
|
||||||
|
|
||||||
|
- `burst_bytes` (default 256) is the burst granularity matching the
|
||||||
|
flit size (ADR-0033 D1).
|
||||||
|
- `pc_bw_gbs` is **builder-derived** from
|
||||||
|
`hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing
|
||||||
|
the ADR-0017 D8 invariant that aggregate per-PE BW equals the
|
||||||
|
router-to-HBM link BW.
|
||||||
|
|
||||||
|
Per-PC commit scheduling for an arriving access on PC `pc` with
|
||||||
|
direction `dir`:
|
||||||
|
|
||||||
|
```text
|
||||||
|
switch_cost = switch_penalty_ns
|
||||||
|
if pc_last_dir[pc] not in (None, dir) else 0
|
||||||
|
start = max(env.now, pc_avail[pc]) + switch_cost
|
||||||
|
finish = start + chunk_time
|
||||||
|
pc_avail[pc] = finish
|
||||||
|
pc_last_dir[pc] = dir
|
||||||
|
```
|
||||||
|
|
||||||
|
Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM
|
||||||
|
scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values
|
||||||
|
model pessimistic per-alternation cost.
|
||||||
|
|
||||||
|
### D5. Flit-aware per-flit PC commit (primary path)
|
||||||
|
|
||||||
|
`_handle_flit` is the primary worker path. For each arriving `Flit`:
|
||||||
|
|
||||||
|
1. On the **first** flit of a transaction (`tid = id(txn)` not in
|
||||||
|
`_txn_state`):
|
||||||
|
- Apply `overhead_ns` once via `run(env, nbytes)` — header decode
|
||||||
|
model, first-flit overhead pattern (ADR-0033 D1).
|
||||||
|
- Initialise `_txn_state[tid] = {"last_finish": env.now}`.
|
||||||
|
2. Compute `pc = _pc_for_address(flit.address)` (D3).
|
||||||
|
3. Apply the per-PC schedule (D4) using the request direction (D2).
|
||||||
|
4. Update `state["last_finish"] = max(state["last_finish"], finish)`.
|
||||||
|
5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn`
|
||||||
|
(D6).
|
||||||
|
|
||||||
|
Per-flit address-aware commit is the mechanism that lets concurrent
|
||||||
|
multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs
|
||||||
|
in parallel.
|
||||||
|
|
||||||
|
### D6. Async finalize per transaction
|
||||||
|
|
||||||
|
When a transaction's last flit has been scheduled, finalisation runs in
|
||||||
|
a separately-spawned process:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _finalize_txn(env, txn, last_finish):
|
||||||
|
wait = last_finish - env.now
|
||||||
|
if wait > 0:
|
||||||
|
yield env.timeout(wait)
|
||||||
|
yield from _send_response(env, txn)
|
||||||
|
```
|
||||||
|
|
||||||
|
`_handle_flit` spawns this via `env.process(...)` and returns
|
||||||
|
immediately, so the worker can pick up the next inbox message while the
|
||||||
|
last PC commit drains.
|
||||||
|
|
||||||
|
Without this split — i.e. if the worker itself did
|
||||||
|
`yield env.timeout(wait)` — concurrent single-flit transactions whose
|
||||||
|
addresses hit distinct PCs would still serialise at `chunk_time` each
|
||||||
|
inside the worker, hiding the PC parallelism that D3 and D5 are
|
||||||
|
designed to expose.
|
||||||
|
|
||||||
|
### D7. Non-flit fallback for command-only transactions
|
||||||
|
|
||||||
|
`_handle_txn` runs when the inbox delivers a `Transaction` rather than a
|
||||||
|
`Flit`. This is the path for command-only requests that the wire does
|
||||||
|
not chunk into flits — most notably `MemoryReadMsg` whose command txn
|
||||||
|
carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing,
|
||||||
|
not as inbound flits).
|
||||||
|
|
||||||
|
Procedure:
|
||||||
|
|
||||||
|
1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)`
|
||||||
|
— for read commands, work is sized by the request.
|
||||||
|
2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else
|
||||||
|
0.
|
||||||
|
3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are
|
||||||
|
scheduled over time at `drain/n_chunks` ns intervals to model the
|
||||||
|
bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain).
|
||||||
|
4. Apply `run(env, txn.nbytes)` once for `overhead_ns`.
|
||||||
|
5. For each chunk `i`, advance `chunk_interval` ns then apply the D4
|
||||||
|
schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`.
|
||||||
|
6. After scheduling all chunks, wait `last_finish - env.now` then call
|
||||||
|
`_send_response`.
|
||||||
|
|
||||||
|
`_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with
|
||||||
|
`_handle_flit` — there is exactly one source of PC scheduling truth
|
||||||
|
across both paths.
|
||||||
|
|
||||||
|
### D8. Response routing
|
||||||
|
|
||||||
|
`_send_response` dispatches on request type and path geometry:
|
||||||
|
|
||||||
|
| Case | Trigger | Response |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` |
|
||||||
|
| Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) |
|
||||||
|
| Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) |
|
||||||
|
| Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path |
|
||||||
|
|
||||||
|
The "bypass" classification matches the Memory R/W fabric path defined
|
||||||
|
in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl,
|
||||||
|
without M_CPU). The PE_DMA case is its own dedicated reverse-path to
|
||||||
|
keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a
|
||||||
|
ResponseMsg envelope).
|
||||||
|
|
||||||
|
In all reverse-path cases, the response Transaction is put onto
|
||||||
|
`out_ports[reverse_path[1]]` — the first hop back along the recorded
|
||||||
|
forward path. If `reverse_path` has fewer than 2 entries (degenerate
|
||||||
|
path), the original `txn.done` is signalled directly.
|
||||||
|
|
||||||
|
### D9. Configurable attributes
|
||||||
|
|
||||||
|
| Attribute | Default | Source | Notes |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 |
|
||||||
|
| `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant |
|
||||||
|
| `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) |
|
||||||
|
| `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching |
|
||||||
|
| `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) |
|
||||||
|
| `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) |
|
||||||
|
|
||||||
|
`pc_bw_gbs` is derived by `topology/builder.py` rather than configured
|
||||||
|
directly so the aggregate per-PE BW matches the router-to-HBM link BW
|
||||||
|
without yaml-side duplication.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- Address-based PC selection preserves multi-stream HBM parallelism
|
||||||
|
that an address-blind round-robin would collapse — important for
|
||||||
|
multi-PE workloads with disjoint HBM regions.
|
||||||
|
- Flit-aware path (D5) + async finalize (D6) preserves wormhole
|
||||||
|
pipelining and exposes PC parallelism for back-to-back single-flit
|
||||||
|
transactions.
|
||||||
|
- Single source of PC scheduling truth (D4 mechanism, used by both D5
|
||||||
|
flit path and D7 chunk-loop path).
|
||||||
|
- Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml
|
||||||
|
discipline.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- No bank-level conflict modelling within a PC; address-blind to
|
||||||
|
bank/row-buffer reuse (ADR-0033 D3).
|
||||||
|
- No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed
|
||||||
|
FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns`
|
||||||
|
(ADR-0033 D2).
|
||||||
|
- `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state
|
||||||
|
accumulates per concurrent transaction and is removed only on
|
||||||
|
`is_last`. Adequate for current workloads.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0001 (Physical address layout — PC bit field comment)
|
||||||
|
- ADR-0015 D4 (Memory R/W fabric path — bypass response case)
|
||||||
|
- ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers)
|
||||||
|
- ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR
|
||||||
|
implements)
|
||||||
|
- ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint
|
||||||
|
resolution)
|
||||||
|
- ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty,
|
||||||
|
flit-aware PC commit, first-flit overhead, chunk-loop drain)
|
||||||
|
- ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation)
|
||||||
@@ -0,0 +1,286 @@
|
|||||||
|
# ADR-0035: M_CPU and M_CPU.DMA Component Model
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
M_CPU is the cube-level command processor. It receives commands from
|
||||||
|
IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
|
||||||
|
M_CPU as a fallback), fans them out to the PEs in its cube, and
|
||||||
|
aggregates per-PE responses into a single ResponseMsg sent back to
|
||||||
|
IO_CPU on the reverse path.
|
||||||
|
|
||||||
|
M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
|
||||||
|
fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
|
||||||
|
it lives as internal state of `MCpuComponent`.
|
||||||
|
|
||||||
|
This ADR documents the M_CPU component implementation that realizes
|
||||||
|
those responsibilities, including the three distinct fan-out paths
|
||||||
|
(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
|
||||||
|
model, and the response aggregation contract.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Role
|
||||||
|
|
||||||
|
M_CPU has three responsibilities:
|
||||||
|
|
||||||
|
1. **Transit forwarding** — when not the terminal hop (e.g., on the
|
||||||
|
reverse response path PE → M_CPU → IO_CPU), forwards Transactions
|
||||||
|
to `next_hop` in their pre-computed path.
|
||||||
|
2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
|
||||||
|
fan-out paths based on request type (D2).
|
||||||
|
3. **Response aggregation** — collects per-PE responses, sends a
|
||||||
|
single aggregate ResponseMsg back to IO_CPU on the reverse path.
|
||||||
|
|
||||||
|
Per invocation (`run()`): applies `overhead_ns` once per incoming
|
||||||
|
Transaction.
|
||||||
|
|
||||||
|
M_CPU does **not**:
|
||||||
|
|
||||||
|
- Decide routing — paths are pre-computed by the router (ADR-0002).
|
||||||
|
- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
|
||||||
|
(ADR-0014).
|
||||||
|
- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
|
||||||
|
`hbm_ctrl.pe{X}` directly (ADR-0017 D9).
|
||||||
|
- Interpret tensor or kernel semantics — fan-out dispatch by Python
|
||||||
|
isinstance check only.
|
||||||
|
|
||||||
|
### D2. Three fan-out paths dispatched by request type
|
||||||
|
|
||||||
|
At the terminal hop the worker dispatches by request type:
|
||||||
|
|
||||||
|
```python
|
||||||
|
elif self.ctx is not None and txn.request is not None:
|
||||||
|
if isinstance(txn.request, KernelLaunchMsg):
|
||||||
|
env.process(self._kernel_launch_fanout(env, txn))
|
||||||
|
elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
|
||||||
|
env.process(self._mmu_msg_fanout(env, txn))
|
||||||
|
else:
|
||||||
|
env.process(self._dma_fanout(env, txn))
|
||||||
|
```
|
||||||
|
|
||||||
|
Each path uses a different router method:
|
||||||
|
|
||||||
|
- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
|
||||||
|
M_CPU-specific DMA path that avoids PE pipeline nodes.
|
||||||
|
- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
|
||||||
|
generic NOC command path to PE_CPU.
|
||||||
|
- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
|
||||||
|
path to PE_MMU.
|
||||||
|
|
||||||
|
### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
|
||||||
|
|
||||||
|
`MCpuComponent.start()` initializes two SimPy resources:
|
||||||
|
|
||||||
|
```python
|
||||||
|
self._dma_write = simpy.Resource(env, capacity=1) # MemoryWriteMsg
|
||||||
|
self._dma_read = simpy.Resource(env, capacity=1) # MemoryReadMsg
|
||||||
|
```
|
||||||
|
|
||||||
|
Properties:
|
||||||
|
|
||||||
|
- **Not a topology node** — managed entirely inside `MCpuComponent`;
|
||||||
|
does not appear in `topology.yaml` or in the compiled graph.
|
||||||
|
- **Independent read and write channels** — concurrent in-flight
|
||||||
|
Memory R/W is allowed.
|
||||||
|
- **Capacity=1 per channel** serializes the **dispatch step**
|
||||||
|
(`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
|
||||||
|
R/W requests at this M_CPU. Actual fabric transfer time is modeled
|
||||||
|
by wire processes between components (ADR-0015 D2) and by
|
||||||
|
`drain_ns` at terminal hops; the DMA resource does not gate
|
||||||
|
transfer duration.
|
||||||
|
|
||||||
|
Resource selection is request-type-based:
|
||||||
|
|
||||||
|
```python
|
||||||
|
dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
|
||||||
|
```
|
||||||
|
|
||||||
|
### D4. Transit forwarding at non-terminal hops
|
||||||
|
|
||||||
|
When `txn.next_hop` is not None — typical for the reverse response
|
||||||
|
path (PE → M_CPU → IO_CPU) — the worker forwards normally:
|
||||||
|
|
||||||
|
```python
|
||||||
|
if next_hop:
|
||||||
|
yield self.out_ports[next_hop].put(txn.advance())
|
||||||
|
```
|
||||||
|
|
||||||
|
The fan-out branches fire only at the terminal hop. The same component
|
||||||
|
therefore serves both forward command dispatch and reverse response
|
||||||
|
relay roles.
|
||||||
|
|
||||||
|
### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
|
||||||
|
|
||||||
|
For each Memory R/W request at terminal hop:
|
||||||
|
|
||||||
|
1. `_resolve_dma_destinations(request)` returns a per-PE
|
||||||
|
`hbm_ctrl.pe{X}` derived from the request's PA via
|
||||||
|
`ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
|
||||||
|
2. For each destination:
|
||||||
|
- Acquire the appropriate DMA resource (`_dma_write` or
|
||||||
|
`_dma_read`) via `with dma_res.request() as req`.
|
||||||
|
- Resolve path via `ctx.router.find_mcpu_dma_path()`.
|
||||||
|
- Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
|
||||||
|
- Create sub-Transaction carrying `drain_ns` and dispatch to
|
||||||
|
`path[1]`.
|
||||||
|
3. Track `max_drain_ns` across destinations and record it as
|
||||||
|
`txn.result_data["xfer_ns"]` after all responses arrive.
|
||||||
|
4. After all per-PE responses are collected (D8), send an aggregate
|
||||||
|
ResponseMsg on the reverse command path back to IO_CPU.
|
||||||
|
|
||||||
|
PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
|
||||||
|
no such node exists after ADR-0017 D4's per-PE partitioning. Kept
|
||||||
|
defensively but does not route to a real destination.
|
||||||
|
|
||||||
|
### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
|
||||||
|
|
||||||
|
For `KernelLaunchMsg` at terminal hop:
|
||||||
|
|
||||||
|
1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
|
||||||
|
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
|
||||||
|
`ctx.router.find_node_path()`.
|
||||||
|
3. **`target_start_ns` handling** (ADR-0009 D5):
|
||||||
|
- If the request already carries `target_start_ns` (stamped by
|
||||||
|
IO_CPU per ADR-0036 D3): **pass through unchanged**.
|
||||||
|
- If absent (direct-to-M_CPU launch in unit tests): compute a
|
||||||
|
per-cube barrier `env.now + max(per-PE leg latency)` and stamp
|
||||||
|
via `dataclasses.replace`.
|
||||||
|
4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
|
||||||
|
control message; preserving nbytes=0 keeps fan-out off the shared
|
||||||
|
first-hop fabric BW, mirroring ADR-0036 D4).
|
||||||
|
5. After all per-PE responses arrive (D8), aggregate per-PE metrics
|
||||||
|
from each sub-Transaction's `result_data` into the parent
|
||||||
|
transaction:
|
||||||
|
|
||||||
|
```python
|
||||||
|
txn.result_data["pe_exec_ns"] = max(existing, max(pe_exec_values))
|
||||||
|
txn.result_data["dma_ns"] = max(existing, max(dma_values))
|
||||||
|
txn.result_data["compute_ns"] = max(existing, max(compute_values))
|
||||||
|
```
|
||||||
|
|
||||||
|
The max-merge with the existing value matters because cross-cube
|
||||||
|
IO_CPU fan-out shares the same parent `result_data`; merging
|
||||||
|
prevents one cube from clobbering another's metric.
|
||||||
|
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
|
||||||
|
|
||||||
|
### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
|
||||||
|
|
||||||
|
For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
|
||||||
|
|
||||||
|
1. `_resolve_pe_ids(target_pe)` → PE ids.
|
||||||
|
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
|
||||||
|
`find_node_path()`.
|
||||||
|
3. Dispatch sub-Transactions with `nbytes=0`.
|
||||||
|
4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
|
||||||
|
back. Instead, the sub-Transaction's own `sub_done` event is the
|
||||||
|
completion signal.
|
||||||
|
5. Wait for all `sub_done` events in-line (does **not** use
|
||||||
|
`_pending` counter — D8 is for response-bearing fan-out only).
|
||||||
|
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
|
||||||
|
|
||||||
|
### D8. Response aggregation (`_pending` + `_parent_txns`)
|
||||||
|
|
||||||
|
For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
|
||||||
|
arriving on the reverse path):
|
||||||
|
|
||||||
|
```python
|
||||||
|
self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
|
||||||
|
self._parent_txns: dict[str, Any] = {}
|
||||||
|
```
|
||||||
|
|
||||||
|
- On dispatch: register `(expected, received=0, all_done)` and
|
||||||
|
remember the parent transaction.
|
||||||
|
- `_worker` recognises responses by `is_response=True` and routes
|
||||||
|
them to `_collect_response`, which increments `received` and
|
||||||
|
signals `all_done` when `received >= expected`.
|
||||||
|
- After `yield all_done`, the fan-out path constructs the aggregate
|
||||||
|
ResponseMsg:
|
||||||
|
|
||||||
|
```python
|
||||||
|
resp_msg = ResponseMsg(
|
||||||
|
correlation_id=request.correlation_id,
|
||||||
|
request_id=request.request_id,
|
||||||
|
src_cube=cube_id,
|
||||||
|
src_pe=-1, # -1 = M_CPU aggregate, not a single PE
|
||||||
|
success=True, # no failure semantics implemented
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- The response Transaction travels on `list(reversed(txn.path))`
|
||||||
|
back to IO_CPU.
|
||||||
|
|
||||||
|
MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
|
||||||
|
because PE_MMU is terminal — there is no ResponseMsg path to
|
||||||
|
intercept.
|
||||||
|
|
||||||
|
### D9. Helpers and configurable attribute
|
||||||
|
|
||||||
|
`_resolve_pe_ids(target_pe)`:
|
||||||
|
|
||||||
|
- `int` → `[target_pe]`
|
||||||
|
- `tuple[int, ...]` → `list(target_pe)`
|
||||||
|
- `"all"` → `range(n_slices)` where `n_slices` comes from cube
|
||||||
|
`memory_map.hbm_slices_per_cube` (default 8).
|
||||||
|
|
||||||
|
Used by kernel-launch and MMU fan-out paths.
|
||||||
|
|
||||||
|
Single configurable attribute drives per-instance latency:
|
||||||
|
|
||||||
|
| Site | impl name | overhead_ns |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
|
||||||
|
|
||||||
|
Applied once in `run()` per Transaction — models command
|
||||||
|
interpretation and dispatch-decision time at M_CPU.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- Three fan-out paths are clearly separated by request type — adding
|
||||||
|
a new request kind is an isinstance branch + one fan-out method.
|
||||||
|
- M_CPU.DMA channels are independent (read and write run concurrently)
|
||||||
|
and serialize only the dispatch step at capacity=1.
|
||||||
|
- Transit-vs-terminal behavior is a single `if next_hop` check, so
|
||||||
|
the same component handles forward dispatch and reverse response
|
||||||
|
relay without role duplication.
|
||||||
|
- `target_start_ns` passthrough (D6) preserves the cross-cube barrier
|
||||||
|
established by IO_CPU (ADR-0036 D3), while the fallback computation
|
||||||
|
keeps direct-to-M_CPU unit tests working.
|
||||||
|
- Per-PE metric `max`-merge against existing parent `result_data`
|
||||||
|
values is robust to cross-cube IO_CPU fan-out sharing the same
|
||||||
|
parent.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- No partial-failure semantics — a missing per-PE response stalls the
|
||||||
|
parent `all_done` indefinitely. Acceptable for simulation; not
|
||||||
|
suitable as a production-style endpoint.
|
||||||
|
- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
|
||||||
|
code (no such node exists post-ADR-0017 D4). Kept defensively;
|
||||||
|
invites confusion and merits a follow-up cleanup.
|
||||||
|
- DMA resource serialization applies only at dispatch (the `put` call
|
||||||
|
is instantaneous in unbounded stores). The capacity=1 channel
|
||||||
|
models "one request in flight at a time at this M_CPU", not
|
||||||
|
"transfer duration serialization" — readers must consult wire
|
||||||
|
processes (ADR-0015 D2) and `drain_ns` for actual transfer
|
||||||
|
parallelism.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
|
||||||
|
- ADR-0009 D5 (`target_start_ns` — passed through unchanged when
|
||||||
|
present; computed as per-cube barrier when absent)
|
||||||
|
- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
|
||||||
|
point)
|
||||||
|
- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
|
||||||
|
contract at cube level)
|
||||||
|
- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
|
||||||
|
topology node)
|
||||||
|
- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
|
||||||
|
- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
|
||||||
|
through unchanged; nbytes=0 invariant preserved through fan-out)
|
||||||
@@ -0,0 +1,216 @@
|
|||||||
|
# ADR-0036: IO_CPU Component Model
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
IO_CPU is the IO chiplet's host-facing endpoint inside the simulation
|
||||||
|
graph. PCIE_EP receives host messages from the runtime API and routes
|
||||||
|
them via the io_noc; for command-bearing requests (KernelLaunch,
|
||||||
|
MmuMap/Unmap) the io_noc forwards to IO_CPU, which:
|
||||||
|
|
||||||
|
- Fans out the request to per-cube M_CPUs.
|
||||||
|
- Aggregates per-cube responses into a single host-visible completion.
|
||||||
|
- For kernel launches, stamps a global `target_start_ns` barrier so
|
||||||
|
every PE across every targeted cube begins kernel body execution at
|
||||||
|
the same simulated time (ADR-0009 D5).
|
||||||
|
|
||||||
|
Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3;
|
||||||
|
this component therefore handles only command-plane traffic in normal
|
||||||
|
operation.
|
||||||
|
|
||||||
|
This ADR documents the IO_CPU component implementation that realizes
|
||||||
|
those responsibilities.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Role
|
||||||
|
|
||||||
|
IO_CPU is the host-facing endpoint of the IO chiplet. It has two
|
||||||
|
primary responsibilities:
|
||||||
|
|
||||||
|
1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg /
|
||||||
|
MmuUnmapMsg to per-cube M_CPUs.
|
||||||
|
2. **Response aggregation** — collect per-cube ResponseMsg, signal
|
||||||
|
parent `txn.done` when all targeted cubes have responded.
|
||||||
|
|
||||||
|
A third, narrower responsibility applies only to KernelLaunchMsg:
|
||||||
|
**`target_start_ns` global barrier stamping** (D3).
|
||||||
|
|
||||||
|
The component does **not**:
|
||||||
|
|
||||||
|
- Decide routing — paths are pre-computed by the router (ADR-0002).
|
||||||
|
- Decode tensor or kernel internals — those concerns belong to
|
||||||
|
M_CPU / PE_CPU / engines.
|
||||||
|
- Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3).
|
||||||
|
- Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4
|
||||||
|
and ADR-0016 D3 (Memory R/W resolution code in
|
||||||
|
`_resolve_cube_targets` exists as a defensive fallback only).
|
||||||
|
|
||||||
|
Per invocation (`run()`): applies the configured `overhead_ns` once
|
||||||
|
per incoming Transaction (D8).
|
||||||
|
|
||||||
|
### D2. Forward path — multi-cube fan-out
|
||||||
|
|
||||||
|
When a non-response Transaction arrives, the worker:
|
||||||
|
|
||||||
|
1. Pays `overhead_ns` via `run()`.
|
||||||
|
2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)`
|
||||||
|
targets from the request (D5).
|
||||||
|
3. For each target:
|
||||||
|
- Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`.
|
||||||
|
- Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`.
|
||||||
|
- Creates a per-cube sub-Transaction with `path` populated and
|
||||||
|
forwards it to `path[1]` (the first hop on the io_noc).
|
||||||
|
4. Registers aggregation state: `_pending[request_id] = (expected,
|
||||||
|
received=0, parent_done)`.
|
||||||
|
|
||||||
|
### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5)
|
||||||
|
|
||||||
|
IO_CPU is the canonical stamper for `target_start_ns`. When the
|
||||||
|
request is a `KernelLaunchMsg`, IO_CPU computes a single global
|
||||||
|
barrier covering every targeted PE across every targeted cube:
|
||||||
|
|
||||||
|
```text
|
||||||
|
for (sip, cube) in cube_targets:
|
||||||
|
leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0)
|
||||||
|
for pe_id in target_pe_ids:
|
||||||
|
leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id),
|
||||||
|
nbytes=0)
|
||||||
|
latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns
|
||||||
|
global_max = max(global_max, latency)
|
||||||
|
|
||||||
|
target_start_ns = env.now + global_max
|
||||||
|
```
|
||||||
|
|
||||||
|
The request is then replaced (via `dataclasses.replace`) so the
|
||||||
|
stamped value propagates through the fan-out.
|
||||||
|
|
||||||
|
Two overhead corrections:
|
||||||
|
|
||||||
|
- `io_overhead_ns` is subtracted because IO_CPU has already paid it
|
||||||
|
in `run()` before this method runs.
|
||||||
|
- `m_overhead_ns` is subtracted once because it appears as the
|
||||||
|
endpoint of leg1 *and* the start of leg2 in path latency, but
|
||||||
|
M_CPU pays it only once at run time.
|
||||||
|
|
||||||
|
Every downstream PE_CPU yields until `target_start_ns` before
|
||||||
|
beginning kernel body execution; all PEs therefore start at the same
|
||||||
|
simulated time regardless of how long their individual dispatch path
|
||||||
|
took.
|
||||||
|
|
||||||
|
### D4. KernelLaunch sub-Transactions carry `nbytes=0`
|
||||||
|
|
||||||
|
Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`,
|
||||||
|
overriding the parent `txn.nbytes`:
|
||||||
|
|
||||||
|
- Kernel launch is a control message; payload size is irrelevant at
|
||||||
|
the data-fabric level.
|
||||||
|
- If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the
|
||||||
|
io_noc's shared first hop. With 16 cubes this serializes fan-out,
|
||||||
|
pushing far M_CPUs past `target_start_ns` and breaking the D3
|
||||||
|
invariant.
|
||||||
|
|
||||||
|
Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant
|
||||||
|
for the defensive Memory R/W fallback path, which carries actual
|
||||||
|
payload sizes).
|
||||||
|
|
||||||
|
### D5. Per-request-type cube target resolution
|
||||||
|
|
||||||
|
`_resolve_cube_targets` dispatches by request type:
|
||||||
|
|
||||||
|
| Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode |
|
||||||
|
| `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode |
|
||||||
|
| `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP |
|
||||||
|
| `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec |
|
||||||
|
|
||||||
|
Each IO_CPU instance fans out only within its own SIP — `_my_sip()`
|
||||||
|
parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0).
|
||||||
|
|
||||||
|
The Memory R/W rows exist for defensive completeness; the engine's
|
||||||
|
normal path routes Memory R/W via `_process_memory_direct()` /
|
||||||
|
`find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 /
|
||||||
|
ADR-0016 D3).
|
||||||
|
|
||||||
|
### D6. Response aggregation
|
||||||
|
|
||||||
|
`_pending: dict[request_id → (expected, received, parent_done)]`:
|
||||||
|
|
||||||
|
- On dispatch: register `(len(cube_targets), 0, txn.done)`.
|
||||||
|
- `_worker` recognises responses by `is_response=True` and routes
|
||||||
|
them to `_collect_response`.
|
||||||
|
- `_collect_response` increments `received`; when `received >=
|
||||||
|
expected`, `parent_done.succeed()` is invoked and the entry is
|
||||||
|
removed from `_pending`.
|
||||||
|
|
||||||
|
This is a simple per-request counter. There is no per-cube identity
|
||||||
|
tracking and no partial-failure handling — a missing response
|
||||||
|
indefinitely stalls the parent done. Production-style failure paths
|
||||||
|
are out of scope for the current simulator model.
|
||||||
|
|
||||||
|
### D7. `target_pe` resolution helper
|
||||||
|
|
||||||
|
`_resolve_pe_ids(target_pe)`:
|
||||||
|
|
||||||
|
- `int` → `[target_pe]`.
|
||||||
|
- `tuple[int, ...]` → `list(target_pe)`.
|
||||||
|
- `"all"` → `range(n_slices)`, where `n_slices` comes from cube
|
||||||
|
`memory_map.hbm_slices_per_cube` (default 8).
|
||||||
|
|
||||||
|
Used in D3's barrier computation to enumerate every PE target per
|
||||||
|
cube.
|
||||||
|
|
||||||
|
### D8. Configurable `overhead_ns`
|
||||||
|
|
||||||
|
A single attribute drives per-instance latency:
|
||||||
|
|
||||||
|
| Site | impl name | overhead_ns |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 |
|
||||||
|
|
||||||
|
Applied once in `run()` per Transaction. Models command
|
||||||
|
interpretation + dispatch-decision time at IO_CPU.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- Cross-cube and cross-SIP kernel launches share a single global
|
||||||
|
barrier (D3 + D4) — no per-cube divergence in start time.
|
||||||
|
- nbytes=0 invariant keeps fan-out off the shared first-hop fabric
|
||||||
|
BW, preserving the barrier's accuracy at scale (16 cubes).
|
||||||
|
- Response aggregation via a single counter → minimal state,
|
||||||
|
deterministic ordering of completion.
|
||||||
|
- Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs
|
||||||
|
cleanly independent.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- No partial-failure semantics — a missing per-cube response
|
||||||
|
indefinitely stalls the parent. Adequate for simulation but not
|
||||||
|
suitable as a production-style endpoint.
|
||||||
|
- `_pending` is a regular dict; in-flight requests accumulate state.
|
||||||
|
Acceptable for current benchmark workloads (few concurrent
|
||||||
|
outstanding launches); unbounded in principle.
|
||||||
|
- The Memory R/W resolution branches in `_resolve_cube_targets` are
|
||||||
|
dead code in the normal engine path. Kept defensively but invite
|
||||||
|
drift if the bypass path ever changes.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0002 (Routing distance — path computation)
|
||||||
|
- ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU)
|
||||||
|
- ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across
|
||||||
|
cubes)
|
||||||
|
- ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU)
|
||||||
|
- ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out)
|
||||||
|
- ADR-0012 (Host ↔ IO_CPU message schema)
|
||||||
|
- ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU)
|
||||||
|
- ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here)
|
||||||
|
- ADR-0016 D3 (Memory R/W path bypasses IO_CPU)
|
||||||
|
- ADR-0016 D4 (Kernel Launch path through IO_CPU for command
|
||||||
|
interpretation)
|
||||||
@@ -0,0 +1,200 @@
|
|||||||
|
# ADR-0037: Forwarding Component (forwarding_v1)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The simulation graph has many node positions that exist purely to model
|
||||||
|
fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints,
|
||||||
|
IO chiplet io_noc, transit cubes. These share a common pattern: receive
|
||||||
|
a message, apply per-component overhead (modeling header decode +
|
||||||
|
routing decision time), forward to the next hop along the pre-computed
|
||||||
|
path.
|
||||||
|
|
||||||
|
This ADR defines the contract for these transit nodes: a single
|
||||||
|
component type (`TransitComponent`) that handles flit-aware forwarding
|
||||||
|
with wormhole cut-through semantics, used under multiple impl names
|
||||||
|
according to the conceptual role each instance plays.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Role
|
||||||
|
|
||||||
|
The Forwarding component (`TransitComponent` class) is a **stateless
|
||||||
|
transit node** in the simulation graph. It models any fabric position
|
||||||
|
where a message physically traverses but no semantic processing
|
||||||
|
happens.
|
||||||
|
|
||||||
|
Per traversal, the component:
|
||||||
|
|
||||||
|
1. Reads an incoming Transaction or Flit from an `in_port`.
|
||||||
|
2. Applies the configured per-component overhead (`overhead_ns`),
|
||||||
|
applied **once per Transaction** even across multi-flit payloads
|
||||||
|
(see D2).
|
||||||
|
3. Looks up the next hop along the Transaction's pre-computed `path`.
|
||||||
|
4. Forwards to the corresponding `out_port`; at the terminal node
|
||||||
|
(no next hop), signals `txn.done` once the `is_last` flit arrives.
|
||||||
|
|
||||||
|
The component **does NOT**:
|
||||||
|
|
||||||
|
- Decide routing — paths are pre-computed by the router (ADR-0002 /
|
||||||
|
ADR-0017 D2). Forwarding only executes the per-hop step.
|
||||||
|
- Model wire propagation or bandwidth occupancy — separate wire
|
||||||
|
processes between components handle that (ADR-0015 D2).
|
||||||
|
- Resolve addresses — the AddressResolver does that (ADR-0017 D9).
|
||||||
|
- Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL)
|
||||||
|
handle that.
|
||||||
|
|
||||||
|
### D2. First-flit overhead model (header decode)
|
||||||
|
|
||||||
|
Per-Transaction `overhead_ns` is applied **exactly once**, at first
|
||||||
|
flit arrival:
|
||||||
|
|
||||||
|
- `_txn_decoded: set[int]` tracks which Transactions have already
|
||||||
|
paid the overhead at this node.
|
||||||
|
- On first-flit arrival for a Transaction: `yield self.run(env,
|
||||||
|
msg.txn.nbytes)` — pays the overhead.
|
||||||
|
- Subsequent flits of the same Transaction skip the overhead — they
|
||||||
|
pipeline through with no extra delay.
|
||||||
|
- On `is_last` flit: remove the Transaction from `_txn_decoded`.
|
||||||
|
|
||||||
|
This models the real-HW behavior where header decode and routing
|
||||||
|
decision happen once on first flit; payload flits then stream through
|
||||||
|
the same path (wormhole cut-through). Multi-hop pipelining emerges
|
||||||
|
naturally — each hop adds its own first-flit overhead, but flits
|
||||||
|
after the first do not re-pay overhead at any hop they have already
|
||||||
|
passed first.
|
||||||
|
|
||||||
|
### D3. Serial worker forwarding (preserves order)
|
||||||
|
|
||||||
|
The component's worker is a single SimPy process that consumes flits
|
||||||
|
from `_inbox` and forwards them serially in arrival order. The
|
||||||
|
component does NOT spawn `env.process(...)` per flit.
|
||||||
|
|
||||||
|
Rationale: if the first flit yields on `overhead_ns` while subsequent
|
||||||
|
flits run in parallel processes, the later flits can overtake the
|
||||||
|
first. This produces out-of-order delivery and lets the `is_last`
|
||||||
|
flit arrive at the destination before the first flit — corrupting
|
||||||
|
both the transaction's completion semantics and any flit-index-based
|
||||||
|
processing downstream.
|
||||||
|
|
||||||
|
### D4. Path-based next-hop routing
|
||||||
|
|
||||||
|
Routing is **not** a Forwarding-component concern. The Transaction
|
||||||
|
arrives with a pre-computed `path` (built by the router; ADR-0002 /
|
||||||
|
ADR-0017 D2). The component just looks up its own position in the
|
||||||
|
path and forwards to `path[index + 1]`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _next_hop_in_path(self, txn):
|
||||||
|
my_id = self.node.id
|
||||||
|
path = txn.path
|
||||||
|
for i, n in enumerate(path):
|
||||||
|
if n == my_id and i + 1 < len(path):
|
||||||
|
return path[i + 1]
|
||||||
|
return None
|
||||||
|
```
|
||||||
|
|
||||||
|
If `next_hop` is found and present in `out_ports`, the flit is
|
||||||
|
forwarded. Otherwise (terminal node), `txn.done.succeed()` is
|
||||||
|
invoked when the `is_last` flit arrives.
|
||||||
|
|
||||||
|
### D5. Flit-aware mode with Non-Flit fallback
|
||||||
|
|
||||||
|
`_FLIT_AWARE = True` opts this component out of the base class's
|
||||||
|
flit-reassembly logic in `_fan_in`. Flits are placed directly on
|
||||||
|
`_inbox` (no reassembly), enabling per-flit handling in the worker
|
||||||
|
loop (D2, D3).
|
||||||
|
|
||||||
|
Non-Flit messages — zero-byte control Transactions and other
|
||||||
|
non-chunkified payloads — fall through to the base class's legacy
|
||||||
|
`_forward_txn` path via `env.process`. This preserves backward
|
||||||
|
compatibility for control-plane traffic that does not benefit from
|
||||||
|
flit-level processing.
|
||||||
|
|
||||||
|
### D6. Multi-stream merging at the base class
|
||||||
|
|
||||||
|
Multi-stream FIFO merging at routers is the base class's
|
||||||
|
responsibility, not Forwarding's. The base class's `_fan_in` spawns
|
||||||
|
one process per `in_port`; all push to a single shared `_inbox`.
|
||||||
|
Flits from different upstream streams therefore interleave at
|
||||||
|
flit granularity in `_inbox`'s FIFO order.
|
||||||
|
|
||||||
|
The Forwarding worker simply consumes `_inbox` in arrival order —
|
||||||
|
correctly modeling per-router multi-flow arbitration as
|
||||||
|
fair-FIFO over the shared inbox.
|
||||||
|
|
||||||
|
### D7. Single implementation under multiple impl names
|
||||||
|
|
||||||
|
A single `TransitComponent` class is registered under four impl names
|
||||||
|
in `components.yaml`:
|
||||||
|
|
||||||
|
- `builtin.forwarding` — generic forwarding (e.g., `io_noc`,
|
||||||
|
`noc_router`, UCIe conn bridges)
|
||||||
|
- `builtin.switch` — tray-level switch
|
||||||
|
- `builtin.noc` — cube-level NOC fabric (legacy singleton; current
|
||||||
|
NOC routers use `builtin.forwarding`)
|
||||||
|
- `builtin.ucie` — UCIe protocol endpoint
|
||||||
|
|
||||||
|
All four aliases instantiate the same class with the same behavior.
|
||||||
|
Per-instance differentiation lives only in `attrs.overhead_ns`.
|
||||||
|
Separate impl names exist as intent tags for readability and to
|
||||||
|
allow future divergence without backward-incompatible config
|
||||||
|
changes.
|
||||||
|
|
||||||
|
### D8. Configurable `overhead_ns`
|
||||||
|
|
||||||
|
A single attribute drives per-instance latency:
|
||||||
|
|
||||||
|
| Usage site | impl name | overhead_ns |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| Tray-level switch | `builtin.switch` | 5.0 |
|
||||||
|
| Cube NOC router | `builtin.forwarding` | 2.0 |
|
||||||
|
| IO chiplet io_noc | `builtin.forwarding` | 0.0 |
|
||||||
|
| UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
|
||||||
|
| UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
|
||||||
|
|
||||||
|
Default is 0.0. The attribute is read at each `run()` invocation, so
|
||||||
|
dynamic reconfiguration is possible but not currently used.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- A single class handles all transit-node roles in the simulation
|
||||||
|
graph — minimal code surface for a high-population component type.
|
||||||
|
- Flit-aware processing + serial worker preserves wormhole semantics
|
||||||
|
across multi-hop paths without per-flit process overhead.
|
||||||
|
- `overhead_ns` is the only per-instance tunable; routing, BW, and
|
||||||
|
address resolution stay cleanly separated in their own components /
|
||||||
|
modules.
|
||||||
|
- Multi-stream merging emerges from the base-class structure; no
|
||||||
|
router-specific logic duplicates fair-FIFO arbitration.
|
||||||
|
- Non-Flit fallback path keeps control-plane traffic working without
|
||||||
|
forcing every message into the flit framework.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- The single class hides usage-site intent inside `attrs.overhead_ns`
|
||||||
|
configuration; readers must consult `topology.yaml` +
|
||||||
|
`components.yaml` to see which impl name maps to which behavior
|
||||||
|
class.
|
||||||
|
- Per-flit serial worker is a bottleneck if `overhead_ns` is large
|
||||||
|
and many concurrent transactions arrive at the same router; current
|
||||||
|
values (0–8 ns) make this negligible.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0002 (Routing distance — path computation)
|
||||||
|
- ADR-0015 D1 (Component port model)
|
||||||
|
- ADR-0015 D2 (Wire process — BW + propagation, separate from this
|
||||||
|
component)
|
||||||
|
- ADR-0015 D6 (Transit cube forwarding pattern)
|
||||||
|
- ADR-0016 D1 (IO chiplet io_noc — uses this component)
|
||||||
|
- ADR-0017 D1 (Cube NOC routers — use this component)
|
||||||
|
- ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this
|
||||||
|
component)
|
||||||
|
- ADR-0033 D1 (Flit-aware pass-through, first-flit overhead,
|
||||||
|
multi-stream merge semantics)
|
||||||
@@ -18,7 +18,7 @@ We define stable, minimal message schemas for Host ↔ IO_CPU so that:
|
|||||||
- IO_CPU-internal fan-out/aggregation can evolve independently,
|
- IO_CPU-internal fan-out/aggregation can evolve independently,
|
||||||
- completion and failure propagation is deterministic.
|
- completion and failure propagation is deterministic.
|
||||||
|
|
||||||
We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
|
We also require PE-tagging (Scheme A): each shard explicitly carries (sip,cube,pe)
|
||||||
so IO_CPU can deterministically route/fan-out without relying on PA decoding.
|
so IO_CPU can deterministically route/fan-out without relying on PA decoding.
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -93,7 +93,7 @@ Rules:
|
|||||||
Mandatory fields:
|
Mandatory fields:
|
||||||
|
|
||||||
- common envelope fields (D3)
|
- common envelope fields (D3)
|
||||||
- destination placement tags (A 방식):
|
- destination placement tags (Scheme A):
|
||||||
- `dst_sip: int`
|
- `dst_sip: int`
|
||||||
- `dst_cube: int`
|
- `dst_cube: int`
|
||||||
- `dst_pe: int`
|
- `dst_pe: int`
|
||||||
@@ -130,7 +130,7 @@ Notes:
|
|||||||
Mandatory fields:
|
Mandatory fields:
|
||||||
|
|
||||||
- common envelope fields (D3)
|
- common envelope fields (D3)
|
||||||
- source placement tags (A 방식):
|
- source placement tags (Scheme A):
|
||||||
- `src_sip: int`
|
- `src_sip: int`
|
||||||
- `src_cube: int`
|
- `src_cube: int`
|
||||||
- `src_pe: int`
|
- `src_pe: int`
|
||||||
@@ -183,7 +183,7 @@ Tensor arg (mandatory):
|
|||||||
|
|
||||||
- `shards: list[TensorShard]`
|
- `shards: list[TensorShard]`
|
||||||
|
|
||||||
`TensorShard` MUST have (A 방식 강제):
|
`TensorShard` MUST have (Scheme A enforced):
|
||||||
|
|
||||||
- `sip: int`
|
- `sip: int`
|
||||||
- `cube: int`
|
- `cube: int`
|
||||||
|
|||||||
@@ -1,519 +0,0 @@
|
|||||||
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
The current simulation models **timing only**.
|
|
||||||
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
|
|
||||||
but do not actually read tensor data or perform computations.
|
|
||||||
|
|
||||||
### Required Capabilities
|
|
||||||
|
|
||||||
1. Must be able to store and read actual data in HBM/TCM/SRAM
|
|
||||||
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
|
|
||||||
3. Must minimize simulation performance degradation
|
|
||||||
|
|
||||||
### Constraints
|
|
||||||
|
|
||||||
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
|
|
||||||
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
|
|
||||||
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
|
|
||||||
- Kernel functions must remain plain Python functions (no generator/async transformation)
|
|
||||||
|
|
||||||
### Design Exploration Results
|
|
||||||
|
|
||||||
| Option | Approach | Verdict |
|
|
||||||
|--------|----------|---------|
|
|
||||||
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
|
|
||||||
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
|
|
||||||
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
|
|
||||||
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### D1. 2-Pass Execution Model — Phase 0 Elimination
|
|
||||||
|
|
||||||
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
|
|
||||||
|
|
||||||
Before:
|
|
||||||
```
|
|
||||||
Phase 0: Kernel → PeCommand list (no data, no branching)
|
|
||||||
Phase 1: Replay PeCommand list via SimPy (timing only)
|
|
||||||
```
|
|
||||||
|
|
||||||
After:
|
|
||||||
```
|
|
||||||
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
|
|
||||||
- Memory read/write: SimPy timing + MemoryStore actual data
|
|
||||||
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
|
|
||||||
- Dynamic control flow possible (tl.load returns actual data)
|
|
||||||
|
|
||||||
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
|
|
||||||
```
|
|
||||||
|
|
||||||
This ADR **extends Phase 1 to be data-aware for memory operations only**.
|
|
||||||
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
|
|
||||||
Phase 2 handles GEMM/Math computation correctness verification.
|
|
||||||
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
|
|
||||||
|
|
||||||
### D2. Op Log Recording — ComponentBase Hook
|
|
||||||
|
|
||||||
Op log recording is performed as a **hook in the component base class**.
|
|
||||||
Individual component implementations are not modified.
|
|
||||||
|
|
||||||
```python
|
|
||||||
class ComponentBase:
|
|
||||||
def _on_process_start(self, env, msg):
|
|
||||||
if self._op_logger and getattr(msg, 'data_op', False):
|
|
||||||
self._op_logger.record_start(env.now, self.node.id, msg)
|
|
||||||
|
|
||||||
def _on_process_end(self, env, msg):
|
|
||||||
if self._op_logger and getattr(msg, 'data_op', False):
|
|
||||||
self._op_logger.record_end(env.now, self.node.id, msg)
|
|
||||||
```
|
|
||||||
|
|
||||||
Hooks are called before and after `run()` within `_forward_txn()`.
|
|
||||||
`_op_logger` is optional — zero overhead when absent.
|
|
||||||
|
|
||||||
**Hook timing definitions**:
|
|
||||||
|
|
||||||
| Timing | Meaning |
|
|
||||||
|--------|---------|
|
|
||||||
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
|
|
||||||
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
|
|
||||||
|
|
||||||
Link traversal latency is not included in t_start/t_end.
|
|
||||||
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
|
|
||||||
|
|
||||||
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
|
|
||||||
|
|
||||||
The existing Phase 0 (kernel → PeCommand list) is eliminated,
|
|
||||||
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
|
|
||||||
|
|
||||||
#### Operating Principle
|
|
||||||
|
|
||||||
greenlet is a C extension that provides cooperative context switching.
|
|
||||||
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
|
|
||||||
to perform timing simulation, and after completion, returns to the kernel with actual data.
|
|
||||||
|
|
||||||
```
|
|
||||||
SimPy loop (parent greenlet) Kernel (child greenlet)
|
|
||||||
───────────────────────── ──────────────────────
|
|
||||||
g.switch() ─────────────────────────→ Kernel starts
|
|
||||||
a = tl.load(ptr, ...)
|
|
||||||
internal: parent.switch(DmaReadCmd)
|
|
||||||
cmd = DmaReadCmd ←────────────────── (kernel paused)
|
|
||||||
yield DmaReadMsg(...)
|
|
||||||
yield env.timeout(dma_latency)
|
|
||||||
data = memory_store.read(...)
|
|
||||||
g.switch(data) ─────────────────────→ (kernel resumed)
|
|
||||||
a = data ← actual numpy array
|
|
||||||
if a[0][0] > 0.5: ← branching possible
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
The kernel is maintained as a **plain Python function**.
|
|
||||||
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
|
|
||||||
|
|
||||||
#### KernelRunner — Framework Layer
|
|
||||||
|
|
||||||
The greenlet loop resides not in the PE_CPU component but in the framework layer,
|
|
||||||
**KernelRunner**.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# KernelRunner (framework — greenlet ↔ SimPy bridge)
|
|
||||||
class KernelRunner:
|
|
||||||
def run(self, env, kernel_fn, args, store):
|
|
||||||
g = greenlet(self._run_kernel)
|
|
||||||
cmd = g.switch(kernel_fn, args)
|
|
||||||
|
|
||||||
while cmd is not None:
|
|
||||||
if isinstance(cmd, DmaReadCmd):
|
|
||||||
yield from self._dispatch_dma(env, cmd)
|
|
||||||
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
|
|
||||||
cmd = g.switch(data) # resume with actual data
|
|
||||||
elif isinstance(cmd, GemmCmd):
|
|
||||||
yield from self._dispatch_gemm(env, cmd)
|
|
||||||
cmd = g.switch() # resume (no data)
|
|
||||||
elif isinstance(cmd, DmaWriteCmd):
|
|
||||||
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
|
|
||||||
yield from self._dispatch_dma(env, cmd) # timing only
|
|
||||||
cmd = g.switch()
|
|
||||||
|
|
||||||
# PE_CPU (component — kept simple, unaware of greenlet)
|
|
||||||
def _execute_kernel(self, env):
|
|
||||||
runner = KernelRunner(self.ctx)
|
|
||||||
yield from runner.run(env, kernel_fn, args, store)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
|
|
||||||
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
|
|
||||||
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
|
|
||||||
the component base class hooks automatically record them.
|
|
||||||
|
|
||||||
**Layer separation**:
|
|
||||||
- **Kernel code**: plain function, unaware of greenlet
|
|
||||||
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
|
|
||||||
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
|
|
||||||
- **ComponentBase hook**: the sole path for op_log recording
|
|
||||||
- **PE_CPU**: only calls KernelRunner, replaceable as a component
|
|
||||||
|
|
||||||
#### Handling Differences Between Memory Read/Write and Compute
|
|
||||||
|
|
||||||
| Operation | In Phase 1 | In Phase 2 |
|
|
||||||
|-----------|-----------|-----------|
|
|
||||||
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
|
|
||||||
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
|
|
||||||
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
|
|
||||||
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
|
|
||||||
|
|
||||||
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
|
|
||||||
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
|
|
||||||
|
|
||||||
#### Store Visibility Rule
|
|
||||||
|
|
||||||
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
|
|
||||||
SimPy DMA timing is simulated separately afterward.
|
|
||||||
|
|
||||||
This is an intentional separation of timing and visibility:
|
|
||||||
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
|
|
||||||
- **timing**: the point at which DMA latency completes in SimPy
|
|
||||||
|
|
||||||
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
|
|
||||||
|
|
||||||
#### Result Handle Semantics
|
|
||||||
|
|
||||||
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
|
|
||||||
|
|
||||||
The key contract in Phase 1:
|
|
||||||
|
|
||||||
1. **All compute handles are always considered pending in Phase 1.**
|
|
||||||
2. `tl.wait(handle)` **expresses timing synchronization only**
|
|
||||||
and does not make the handle ready.
|
|
||||||
3. Accessing the handle's actual result data (`handle.data`, element access,
|
|
||||||
numpy conversion, etc.) is **only possible in Phase 2**.
|
|
||||||
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
|
|
||||||
5. In contrast, `tl.load()` returns actual data in Phase 1, so
|
|
||||||
**memory-read-based control flow is supported**.
|
|
||||||
|
|
||||||
| Handle state | Phase | Allowed operations |
|
|
||||||
|------------|-------|----------|
|
|
||||||
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
|
|
||||||
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
|
|
||||||
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
|
|
||||||
| ready | Phase 2 | Actual numpy data access, verification |
|
|
||||||
|
|
||||||
This restriction is intentional. If computations were executed in Phase 1,
|
|
||||||
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
|
|
||||||
|
|
||||||
#### Phase 1 Materialization — Future Extension
|
|
||||||
|
|
||||||
If Phase 1 eager execution becomes necessary for small operations
|
|
||||||
(scalar, small reduction) in the future, selective materialization can be supported
|
|
||||||
by adding a `materialized_in_phase1: bool` flag to the op record.
|
|
||||||
This is not implemented in the current scope.
|
|
||||||
|
|
||||||
### D4. data_op Flag — Message Self-Declaration
|
|
||||||
|
|
||||||
The logging target is determined by the `data_op` attribute on the message instance,
|
|
||||||
not by message type. The framework does not hardcode message types.
|
|
||||||
|
|
||||||
```python
|
|
||||||
class MsgBase:
|
|
||||||
data_op: bool = False # default: no logging
|
|
||||||
|
|
||||||
class DmaReadCmd(MsgBase):
|
|
||||||
data_op = True # memory transfer → logging
|
|
||||||
|
|
||||||
class GemmCmd(MsgBase):
|
|
||||||
data_op = True # compute → logging
|
|
||||||
|
|
||||||
class MathCmd(MsgBase):
|
|
||||||
data_op = True # compute → logging
|
|
||||||
```
|
|
||||||
|
|
||||||
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
|
|
||||||
enables automatic logging without modifying framework code.
|
|
||||||
|
|
||||||
### D5. Op Log Structure
|
|
||||||
|
|
||||||
#### Op Classification Scheme
|
|
||||||
|
|
||||||
A two-level classification is used:
|
|
||||||
|
|
||||||
| Level | Field | Role |
|
|
||||||
|-------|-------|------|
|
|
||||||
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
|
|
||||||
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
|
|
||||||
|
|
||||||
#### OpRecord Definition
|
|
||||||
|
|
||||||
```python
|
|
||||||
@dataclass
|
|
||||||
class OpRecord:
|
|
||||||
t_start: float # SimPy time (ns) — service start
|
|
||||||
t_end: float # SimPy time (ns) — service completion
|
|
||||||
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
|
|
||||||
op_kind: str # "memory" | "gemm" | "math"
|
|
||||||
op_name: str # specific operation name
|
|
||||||
params: dict # per-operation parameters (see below)
|
|
||||||
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
|
|
||||||
```
|
|
||||||
|
|
||||||
#### dependency_ids Generation Rules
|
|
||||||
|
|
||||||
`dependency_ids` is **optional**, and by default the executor performs
|
|
||||||
address-based dependency inference (see D6).
|
|
||||||
|
|
||||||
Explicit setting is only needed when precise execution ordering is required:
|
|
||||||
- **Default (address-based inference)**: the executor analyzes read/write sets to
|
|
||||||
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
|
|
||||||
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
|
|
||||||
at the TLContext or command generation stage.
|
|
||||||
Example: completion handle-based synchronization — handle dependencies depend on
|
|
||||||
logical completion order rather than memory addresses, so they cannot be captured
|
|
||||||
by address inference.
|
|
||||||
|
|
||||||
#### op_log Ordering
|
|
||||||
|
|
||||||
The op_log maintains **stable ordering** based on `t_start`.
|
|
||||||
Records with the same `t_start` preserve insertion order.
|
|
||||||
|
|
||||||
#### params Details
|
|
||||||
|
|
||||||
**memory (dma_read / dma_write)**:
|
|
||||||
```python
|
|
||||||
{
|
|
||||||
"src_addr": int, # source address (byte)
|
|
||||||
"dst_addr": int, # destination address (byte)
|
|
||||||
"nbytes": int, # transfer size
|
|
||||||
"src_space": str, # "hbm" | "tcm" | "sram"
|
|
||||||
"dst_space": str, # "hbm" | "tcm" | "sram"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**gemm**:
|
|
||||||
```python
|
|
||||||
{
|
|
||||||
"src_a_addr": int, # operand A address
|
|
||||||
"src_b_addr": int, # operand B address
|
|
||||||
"dst_addr": int, # output address
|
|
||||||
"shape_a": tuple, # e.g. (128, 256)
|
|
||||||
"shape_b": tuple, # e.g. (256, 128)
|
|
||||||
"shape_out": tuple, # e.g. (128, 128)
|
|
||||||
"dtype_in": str, # e.g. "f16"
|
|
||||||
"dtype_acc": str, # accumulation dtype, e.g. "f32"
|
|
||||||
"dtype_out": str, # output dtype, e.g. "f16"
|
|
||||||
"transpose_a": bool,
|
|
||||||
"transpose_b": bool,
|
|
||||||
"layout_a": str, # "row_major" | "col_major"
|
|
||||||
"layout_b": str,
|
|
||||||
"layout_out": str,
|
|
||||||
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
**math**:
|
|
||||||
```python
|
|
||||||
{
|
|
||||||
"op": str, # "exp" | "add" | "sum" | "where" | ...
|
|
||||||
"input_addrs": list[int], # list of operand addresses
|
|
||||||
"input_shapes": list[tuple],
|
|
||||||
"dst_addr": int,
|
|
||||||
"shape_out": tuple,
|
|
||||||
"dtype": str,
|
|
||||||
"axis": int | None, # reduction axis
|
|
||||||
"addr_space": str, # "tcm"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### D6. Phase 2 Executor
|
|
||||||
|
|
||||||
Phase 2 executes the op_log outside of SimPy.
|
|
||||||
|
|
||||||
```python
|
|
||||||
class DataExecutor:
|
|
||||||
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
|
|
||||||
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
|
|
||||||
|
|
||||||
def run(self):
|
|
||||||
for t, ops in groupby(op_log, key=lambda o: o.t_start):
|
|
||||||
batch = list(ops)
|
|
||||||
independent, sequential = self._classify(batch)
|
|
||||||
self._execute_parallel(independent)
|
|
||||||
self._execute_sequential(sequential)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Parallel execution determination**:
|
|
||||||
|
|
||||||
Ops with the same `t_start` are considered **parallel candidates**.
|
|
||||||
The executor determines actual parallel execution based on the following criteria:
|
|
||||||
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
|
|
||||||
- Whether predecessor ops specified in `dependency_ids` have completed
|
|
||||||
|
|
||||||
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
|
|
||||||
|
|
||||||
**Batch optimization**: Only independent ops with the same op_name **and identical
|
|
||||||
shape, dtype, layout, and transpose flags** are eligible for batching.
|
|
||||||
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
|
|
||||||
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
|
|
||||||
|
|
||||||
**Phase 2 execution order guarantee**:
|
|
||||||
|
|
||||||
Phase 2 does not consider data arrival timing,
|
|
||||||
and guarantees execution order solely through
|
|
||||||
dependencies (address-based inference + explicit dependency_ids).
|
|
||||||
|
|
||||||
### D7. Memory Store
|
|
||||||
|
|
||||||
`MemoryStore` logically follows byte-addressable semantics,
|
|
||||||
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
|
|
||||||
|
|
||||||
```python
|
|
||||||
class MemoryStore:
|
|
||||||
def write(self, space: str, addr: int, data: np.ndarray) -> None: ...
|
|
||||||
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
|
|
||||||
```
|
|
||||||
|
|
||||||
**Internal storage format: numpy ndarray**
|
|
||||||
|
|
||||||
MemoryStore stores tensors as **numpy ndarrays**.
|
|
||||||
|
|
||||||
| Candidate | store/load speed | Phase 2 compute | Verdict |
|
|
||||||
|-----------|-----------------|-----------------|---------|
|
|
||||||
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
|
|
||||||
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
|
|
||||||
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
|
|
||||||
|
|
||||||
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
|
|
||||||
- read: **returns numpy array by reference** (no copy)
|
|
||||||
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
|
|
||||||
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
|
|
||||||
- For byte-level access, convert via `.view(np.uint8)`
|
|
||||||
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
|
|
||||||
|
|
||||||
**read/write contract**:
|
|
||||||
|
|
||||||
- read/write operates on a **contiguous tensor** basis.
|
|
||||||
If non-contiguous stride views are needed, express them as separate copy ops.
|
|
||||||
- In the normal benchmark path, producer/consumer dtype match is expected.
|
|
||||||
Reinterpret cast is a permissive behavior for low-level memory validation
|
|
||||||
or special test cases.
|
|
||||||
- addr is byte-aligned, with minimum alignment = dtype size.
|
|
||||||
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
|
|
||||||
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
|
|
||||||
- Correctness criteria follow address-range-based read/write semantics.
|
|
||||||
- A tensor object cache may be used as an implementation optimization,
|
|
||||||
but the canonical state is byte-addressable storage.
|
|
||||||
- At deploy time, the host injects initial tensor data.
|
|
||||||
|
|
||||||
### D8. Benchmark Kernel Code
|
|
||||||
|
|
||||||
The benchmark's **user code API is not changed**.
|
|
||||||
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
|
|
||||||
|
|
||||||
However, internal command/message schemas may be extended to include metadata
|
|
||||||
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
|
|
||||||
|
|
||||||
### D9. No Component Changes
|
|
||||||
|
|
||||||
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
|
|
||||||
Op log recording is the responsibility of the ComponentBase hook.
|
|
||||||
When custom components are replaced, only the timing model changes,
|
|
||||||
and Phase 2 data execution is unaffected.
|
|
||||||
|
|
||||||
### D10. Phase 2 is Optional
|
|
||||||
|
|
||||||
```python
|
|
||||||
engine = GraphEngine(graph)
|
|
||||||
engine.run(benchmark) # Phase 1: timing only
|
|
||||||
result = engine.get_timing_result()
|
|
||||||
|
|
||||||
if verify_data:
|
|
||||||
executor = DataExecutor(engine.op_log) # Phase 2: data
|
|
||||||
executor.run()
|
|
||||||
executor.verify(expected_output)
|
|
||||||
```
|
|
||||||
|
|
||||||
If only timing analysis is needed, Phase 2 is skipped.
|
|
||||||
If the op_logger is deactivated, Phase 1 performance is identical to the original.
|
|
||||||
|
|
||||||
### D11. Verification Contract
|
|
||||||
|
|
||||||
Basic verification **compares the final output tensor** against a reference backend (numpy).
|
|
||||||
|
|
||||||
Per-dtype tolerance policy:
|
|
||||||
|
|
||||||
| dtype | Comparison method | Tolerance |
|
|
||||||
|-------|----------|-----------|
|
|
||||||
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
|
|
||||||
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
|
|
||||||
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
|
|
||||||
| int types | `np.array_equal` | exact |
|
|
||||||
|
|
||||||
- Default mode: compare final output only (end-to-end correctness)
|
|
||||||
- Debug mode: can compare intermediate tensors on a per-op basis
|
|
||||||
(MemoryStore snapshot at each op boundary)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **Compute-result-based control flow**: not supported.
|
|
||||||
All compute handles are in pending state during Phase 1,
|
|
||||||
`wait()` expresses timing synchronization only and does not imply data readiness.
|
|
||||||
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
|
|
||||||
is **treated as an error**.
|
|
||||||
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
|
|
||||||
Phase 1 materialization is a future extension (see D3).
|
|
||||||
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
|
|
||||||
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
|
|
||||||
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
|
|
||||||
and do not reproduce the actual hardware PE microarchitecture.
|
|
||||||
|
|
||||||
## Open Questions
|
|
||||||
|
|
||||||
- **Aliasing / slice view**: How to represent slice/views referencing the same
|
|
||||||
backing storage in MemoryStore (stride-based view vs copy semantics)
|
|
||||||
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
|
|
||||||
communication as memory ops or introduce a separate op_kind
|
|
||||||
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
|
|
||||||
(in-memory list vs disk-backed streaming)
|
|
||||||
- **Fused operation**: Whether to record tl.composite's tiled pipeline
|
|
||||||
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
|
|
||||||
- **Math op schema generalization**: The current math params have a simple structure,
|
|
||||||
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
|
|
||||||
scalar/immediate operands, where/mask expressions, etc.
|
|
||||||
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
|
|
||||||
replacement with stable op_id is needed when introducing streaming/disk-backed mode
|
|
||||||
- **Phase 1 materialization policy**: See Future Extension in D3.
|
|
||||||
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
|
|
||||||
needs to be defined
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
### Positive
|
|
||||||
|
|
||||||
- Minimal impact on SimPy simulation performance (only op_log append added)
|
|
||||||
- Free to use multi-threading/GPU in Phase 2
|
|
||||||
- Component replaceability preserved (ADR-0015 design philosophy maintained)
|
|
||||||
- No changes needed to benchmark user code API
|
|
||||||
- When adding new message types, only set the data_op flag
|
|
||||||
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
|
|
||||||
- `tl.load()` returns actual data, making kernel debugging easier
|
|
||||||
|
|
||||||
### Negative
|
|
||||||
|
|
||||||
- op_log memory usage (for large-scale simulations)
|
|
||||||
- Phase 2 execution time is proportional to tensor size (large GEMM)
|
|
||||||
- Dynamic branching based on pending handles (incomplete computations) not possible
|
|
||||||
(computations execute in Phase 2, result values are undetermined in Phase 1).
|
|
||||||
Memory-data-based branching is supported via greenlet.
|
|
||||||
- greenlet C extension dependency added (pip install greenlet)
|
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
# ADR-0020: 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
|
# ADR-0020: 2-Pass Data Execution Model (Timing / Data Separation)
|
||||||
|
|
||||||
## Status
|
## Status
|
||||||
|
|
||||||
@@ -6,65 +6,65 @@ Accepted
|
|||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
현재 시뮬레이션은 **타이밍만** 모델링한다.
|
The current simulation models **timing only**.
|
||||||
`tl.load()`, `tl.composite(op="gemm")` 등은 SimPy latency를 생성하지만,
|
`tl.load()`, `tl.composite(op="gemm")`, etc. generate SimPy latencies,
|
||||||
실제 텐서 데이터를 읽거나 연산하지 않는다.
|
but do not actually read tensor data or perform computations.
|
||||||
|
|
||||||
### 필요한 기능
|
### Required Capabilities
|
||||||
|
|
||||||
1. HBM/TCM/SRAM에 실제 데이터를 저장하고 읽을 수 있어야 한다
|
1. Must be able to store and read actual data in HBM/TCM/SRAM
|
||||||
2. PE_GEMM, PE_MATH가 실제 행렬 연산을 수행하고 결과를 검증할 수 있어야 한다
|
2. PE_GEMM, PE_MATH must be able to perform actual matrix operations and verify results
|
||||||
3. 시뮬레이션 성능 저하를 최소화해야 한다
|
3. Must minimize simulation performance degradation
|
||||||
|
|
||||||
### 제약 조건
|
### Constraints
|
||||||
|
|
||||||
- SimPy는 single-thread 이벤트 루프 — numpy matmul을 안에서 하면 전체가 block
|
- SimPy is a single-thread event loop — running numpy matmul inside it blocks everything
|
||||||
- 컴포넌트는 교체 가능해야 한다 (ADR-0015) — 프레임워크 요구사항이 구현에 침투하면 안 됨
|
- Components must be replaceable (ADR-0015) — framework requirements must not leak into implementations
|
||||||
- 벤치마크 커널은 명령형 코드(tl.load → tl.composite → tl.wait) — 같은 코드를 재사용해야 함
|
- Benchmark kernels are imperative code (tl.load → tl.composite → tl.wait) — the same code must be reused
|
||||||
- 커널 함수는 plain Python function으로 유지해야 한다 (generator/async 변환 불가)
|
- Kernel functions must remain plain Python functions (no generator/async transformation)
|
||||||
|
|
||||||
### 설계 탐색 결과
|
### Design Exploration Results
|
||||||
|
|
||||||
| Option | 방식 | 판정 |
|
| Option | Approach | Verdict |
|
||||||
|--------|------|------|
|
|--------|----------|---------|
|
||||||
| SimPy 내 직접 실행 | GEMM을 SimPy 안에서 numpy 호출 | 탈락: single-thread block |
|
| Direct execution in SimPy | Call numpy GEMM inside SimPy | Rejected: single-thread block |
|
||||||
| SimPy + ThreadPool | future.submit → timeout → result() | 탈락: back-to-back 요청 시 result()에서 block |
|
| SimPy + ThreadPool | future.submit → timeout → result() | Rejected: blocks on result() for back-to-back requests |
|
||||||
| Symbolic + lazy | 메타데이터만 추적, 나중에 실행 | 탈락: control-flow dependent 읽기 처리 곤란 |
|
| Symbolic + lazy | Track metadata only, execute later | Rejected: difficult to handle control-flow dependent reads |
|
||||||
| **2-pass (채택)** | Phase 1: 타이밍, Phase 2: 데이터 | 완전 분리, 성능 영향 없음 |
|
| **2-pass (adopted)** | Phase 1: timing, Phase 2: data | Full separation, no performance impact |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Decision
|
## Decision
|
||||||
|
|
||||||
### D1. 2-Pass 실행 모델 — Phase 0 제거
|
### D1. 2-Pass Execution Model — Phase 0 Elimination
|
||||||
|
|
||||||
기존의 3단계(Phase 0 → Phase 1 → Phase 2)를 **2단계로 통합**한다.
|
The existing 3 stages (Phase 0 → Phase 1 → Phase 2) are **consolidated into 2 stages**.
|
||||||
|
|
||||||
기존:
|
Before:
|
||||||
```
|
```
|
||||||
Phase 0: 커널 → PeCommand 리스트 (데이터 없음, 분기 불가)
|
Phase 0: Kernel → PeCommand list (no data, no branching)
|
||||||
Phase 1: PeCommand 리스트를 SimPy replay (타이밍만)
|
Phase 1: Replay PeCommand list via SimPy (timing only)
|
||||||
```
|
```
|
||||||
|
|
||||||
변경:
|
After:
|
||||||
```
|
```
|
||||||
Phase 1 (타이밍): 커널 + SimPy 통합 실행 — greenlet 기반
|
Phase 1 (timing): Kernel + SimPy integrated execution — greenlet-based
|
||||||
- 메모리 읽기/쓰기: SimPy 타이밍 + MemoryStore 실제 데이터
|
- Memory read/write: SimPy timing + MemoryStore actual data
|
||||||
- 연산 (GEMM/Math): SimPy 타이밍 + op_log 기록 (실제 연산은 Phase 2)
|
- Compute (GEMM/Math): SimPy timing + op_log recording (actual computation in Phase 2)
|
||||||
- dynamic control flow 가능 (tl.load가 실제 데이터 반환)
|
- Dynamic control flow possible (tl.load returns actual data)
|
||||||
|
|
||||||
Phase 2 (데이터): op_log 기반 실제 연산 실행 — SimPy 외부, 병렬 가능
|
Phase 2 (data): Actual computation execution based on op_log — outside SimPy, parallelizable
|
||||||
```
|
```
|
||||||
|
|
||||||
본 ADR은 **메모리 연산에 한해 Phase 1을 data-aware로 확장**한다.
|
This ADR **extends Phase 1 to be data-aware for memory operations only**.
|
||||||
Phase 1은 latency/BW 병목 분석 + 메모리 데이터 추적,
|
Phase 1 handles latency/BW bottleneck analysis + memory data tracking,
|
||||||
Phase 2는 GEMM/Math 연산 정합성 검증.
|
Phase 2 handles GEMM/Math computation correctness verification.
|
||||||
Phase 2는 optional — 타이밍만 필요하면 Phase 1만 실행.
|
Phase 2 is optional — if only timing is needed, run Phase 1 alone.
|
||||||
|
|
||||||
### D2. Op Log 기록 — ComponentBase hook
|
### D2. Op Log Recording — ComponentBase Hook
|
||||||
|
|
||||||
op_log 기록은 **컴포넌트 베이스 클래스의 hook**으로 수행한다.
|
Op log recording is performed as a **hook in the component base class**.
|
||||||
개별 컴포넌트 구현을 수정하지 않는다.
|
Individual component implementations are not modified.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class ComponentBase:
|
class ComponentBase:
|
||||||
@@ -77,56 +77,56 @@ class ComponentBase:
|
|||||||
self._op_logger.record_end(env.now, self.node.id, msg)
|
self._op_logger.record_end(env.now, self.node.id, msg)
|
||||||
```
|
```
|
||||||
|
|
||||||
`_forward_txn()` 에서 `run()` 전후로 hook을 호출한다.
|
Hooks are called before and after `run()` within `_forward_txn()`.
|
||||||
`_op_logger`는 optional — 없으면 오버헤드 제로.
|
`_op_logger` is optional — zero overhead when absent.
|
||||||
|
|
||||||
**hook 시점 정의**:
|
**Hook timing definitions**:
|
||||||
|
|
||||||
| 시점 | 의미 |
|
| Timing | Meaning |
|
||||||
|------|------|
|
|--------|---------|
|
||||||
| `t_start` | 컴포넌트가 해당 msg의 **service를 시작**한 시점 (`run()` 진입 직전) |
|
| `t_start` | The point at which the component **begins servicing** the msg (immediately before `run()` entry) |
|
||||||
| `t_end` | 컴포넌트의 **내부 service가 완료**된 시점 (`run()` 반환 직후) |
|
| `t_end` | The point at which the component's **internal service completes** (immediately after `run()` returns) |
|
||||||
|
|
||||||
link traversal latency는 t_start/t_end에 포함되지 않는다.
|
Link traversal latency is not included in t_start/t_end.
|
||||||
link latency는 발신 컴포넌트의 t_end와 수신 컴포넌트의 t_start 차이로 관측된다.
|
Link latency is observed as the difference between the sending component's t_end and the receiving component's t_start.
|
||||||
|
|
||||||
### D3. Greenlet 기반 커널 실행 — Phase 0 제거
|
### D3. Greenlet-Based Kernel Execution — Phase 0 Elimination
|
||||||
|
|
||||||
기존 Phase 0 (커널 → PeCommand 리스트)를 제거하고,
|
The existing Phase 0 (kernel → PeCommand list) is eliminated,
|
||||||
**greenlet**을 사용하여 커널과 SimPy를 협력적으로 interleave 실행한다.
|
and **greenlet** is used to cooperatively interleave kernel and SimPy execution.
|
||||||
|
|
||||||
#### 동작 원리
|
#### Operating Principle
|
||||||
|
|
||||||
greenlet은 협력적 context switch를 제공하는 C 확장이다.
|
greenlet is a C extension that provides cooperative context switching.
|
||||||
커널(child greenlet)이 `tl.load()` 등을 호출하면 SimPy 루프(parent greenlet)로
|
When the kernel (child greenlet) calls `tl.load()` etc., it switches to the SimPy loop (parent greenlet)
|
||||||
switch하여 타이밍 시뮬레이션을 수행하고, 완료 후 실제 데이터와 함께 커널로 돌아온다.
|
to perform timing simulation, and after completion, returns to the kernel with actual data.
|
||||||
|
|
||||||
```
|
```
|
||||||
SimPy 루프 (parent greenlet) 커널 (child greenlet)
|
SimPy loop (parent greenlet) Kernel (child greenlet)
|
||||||
───────────────────────── ──────────────────────
|
───────────────────────── ──────────────────────
|
||||||
g.switch() ─────────────────────────→ 커널 시작
|
g.switch() ─────────────────────────→ Kernel starts
|
||||||
a = tl.load(ptr, ...)
|
a = tl.load(ptr, ...)
|
||||||
내부: parent.switch(DmaReadCmd)
|
internal: parent.switch(DmaReadCmd)
|
||||||
cmd = DmaReadCmd ←────────────────── (커널 일시정지)
|
cmd = DmaReadCmd ←────────────────── (kernel paused)
|
||||||
yield DmaReadMsg(...)
|
yield DmaReadMsg(...)
|
||||||
yield env.timeout(dma_latency)
|
yield env.timeout(dma_latency)
|
||||||
data = memory_store.read(...)
|
data = memory_store.read(...)
|
||||||
g.switch(data) ─────────────────────→ (커널 재개)
|
g.switch(data) ─────────────────────→ (kernel resumed)
|
||||||
a = data ← 실제 numpy array
|
a = data ← actual numpy array
|
||||||
if a[0][0] > 0.5: ← 분기 가능
|
if a[0][0] > 0.5: ← branching possible
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
커널은 **plain Python function**으로 유지된다.
|
The kernel is maintained as a **plain Python function**.
|
||||||
greenlet switch는 `tl.load()`, `tl.store()` 등의 **내부 구현에만** 존재한다.
|
greenlet switches exist **only within the internal implementation** of `tl.load()`, `tl.store()`, etc.
|
||||||
|
|
||||||
#### KernelRunner — 프레임워크 레이어
|
#### KernelRunner — Framework Layer
|
||||||
|
|
||||||
greenlet 루프는 PE_CPU 컴포넌트가 아니라 프레임워크 레이어인
|
The greenlet loop resides not in the PE_CPU component but in the framework layer,
|
||||||
**KernelRunner**에 위치한다.
|
**KernelRunner**.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# KernelRunner (프레임워크 — greenlet ↔ SimPy 연결)
|
# KernelRunner (framework — greenlet ↔ SimPy bridge)
|
||||||
class KernelRunner:
|
class KernelRunner:
|
||||||
def run(self, env, kernel_fn, args, store):
|
def run(self, env, kernel_fn, args, store):
|
||||||
g = greenlet(self._run_kernel)
|
g = greenlet(self._run_kernel)
|
||||||
@@ -136,160 +136,162 @@ class KernelRunner:
|
|||||||
if isinstance(cmd, DmaReadCmd):
|
if isinstance(cmd, DmaReadCmd):
|
||||||
yield from self._dispatch_dma(env, cmd)
|
yield from self._dispatch_dma(env, cmd)
|
||||||
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
|
data = store.read(cmd.src_addr, cmd.shape, cmd.dtype)
|
||||||
cmd = g.switch(data) # 실제 데이터와 함께 재개
|
cmd = g.switch(data) # resume with actual data
|
||||||
elif isinstance(cmd, GemmCmd):
|
elif isinstance(cmd, GemmCmd):
|
||||||
yield from self._dispatch_gemm(env, cmd)
|
yield from self._dispatch_gemm(env, cmd)
|
||||||
cmd = g.switch() # 재개 (데이터 없음)
|
cmd = g.switch() # resume (no data)
|
||||||
elif isinstance(cmd, DmaWriteCmd):
|
elif isinstance(cmd, DmaWriteCmd):
|
||||||
store.write(cmd.dst_addr, cmd.data) # visibility = issue 시점
|
store.write(cmd.dst_addr, cmd.data) # visibility = issue time
|
||||||
yield from self._dispatch_dma(env, cmd) # timing만 반영
|
yield from self._dispatch_dma(env, cmd) # timing only
|
||||||
cmd = g.switch()
|
cmd = g.switch()
|
||||||
|
|
||||||
# PE_CPU (컴포넌트 — 간단하게 유지, greenlet을 모름)
|
# PE_CPU (component — kept simple, unaware of greenlet)
|
||||||
def _execute_kernel(self, env):
|
def _execute_kernel(self, env):
|
||||||
runner = KernelRunner(self.ctx)
|
runner = KernelRunner(self.ctx)
|
||||||
yield from runner.run(env, kernel_fn, args, store)
|
yield from runner.run(env, kernel_fn, args, store)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Op logging single source of truth**: KernelRunner는 op_log에 직접 기록하지 않는다.
|
**Op logging single source of truth**: KernelRunner does not record directly to op_log.
|
||||||
모든 op logging은 **ComponentBase hook (_on_process_start/end)만** 담당한다.
|
All op logging is handled **solely by the ComponentBase hook (_on_process_start/end)**.
|
||||||
KernelRunner가 `_dispatch_gemm()` 등으로 컴포넌트에 메시지를 전달하면,
|
When KernelRunner delivers messages to components via `_dispatch_gemm()` etc.,
|
||||||
컴포넌트 베이스 클래스의 hook이 자동으로 기록한다.
|
the component base class hooks automatically record them.
|
||||||
|
|
||||||
**레이어 분리**:
|
**Layer separation**:
|
||||||
- **커널 코드**: plain function, greenlet 존재를 모름
|
- **Kernel code**: plain function, unaware of greenlet
|
||||||
- **TLContext**: `tl.load()` 내부에서 `parent.switch(cmd)` 호출
|
- **TLContext**: calls `parent.switch(cmd)` inside `tl.load()`
|
||||||
- **KernelRunner**: greenlet ↔ SimPy 연결, MemoryStore 읽기/쓰기 처리. **logging 안 함**.
|
- **KernelRunner**: greenlet ↔ SimPy bridge, handles MemoryStore read/write. **Does not log**.
|
||||||
- **ComponentBase hook**: op_log 기록의 유일한 경로
|
- **ComponentBase hook**: the sole path for op_log recording
|
||||||
- **PE_CPU**: KernelRunner를 호출만 함, 컴포넌트로서 교체 가능
|
- **PE_CPU**: only calls KernelRunner, replaceable as a component
|
||||||
|
|
||||||
#### 메모리 읽기/쓰기 vs 연산의 처리 차이
|
#### Handling Differences Between Memory Read/Write and Compute
|
||||||
|
|
||||||
| 연산 | Phase 1에서 | Phase 2에서 |
|
| Operation | In Phase 1 | In Phase 2 |
|
||||||
|------|------------|------------|
|
|-----------|-----------|-----------|
|
||||||
| `tl.load()` | SimPy 타이밍 + MemoryStore read → **실제 데이터 반환** | — |
|
| `tl.load()` | SimPy timing + MemoryStore read → **actual data returned** | — |
|
||||||
| `tl.store()` | SimPy 타이밍 + MemoryStore write → **실제 기록** | — |
|
| `tl.store()` | SimPy timing + MemoryStore write → **actual write** | — |
|
||||||
| `tl.composite(gemm)` | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
|
| `tl.composite(gemm)` | SimPy timing + **op_log recording only** | numpy actual computation |
|
||||||
| `tl.dot()` / math ops | SimPy 타이밍 + **op_log 기록만** | numpy 실제 연산 |
|
| `tl.dot()` / math ops | SimPy timing + **op_log recording only** | numpy actual computation |
|
||||||
|
|
||||||
메모리 읽기/쓰기는 Phase 1에서 즉시 처리 (numpy slice, 빠름).
|
Memory read/write is processed immediately in Phase 1 (numpy slice, fast).
|
||||||
GEMM/Math 연산은 Phase 2에서 batch 실행 (성능 분리).
|
GEMM/Math operations are batch-executed in Phase 2 (performance separation).
|
||||||
|
|
||||||
#### Store Visibility Rule
|
#### Store Visibility Rule
|
||||||
|
|
||||||
`tl.store()`는 **issue 시점에 MemoryStore에 즉시 반영**된다 (visibility = issue).
|
`tl.store()` is **immediately reflected in MemoryStore at issue time** (visibility = issue).
|
||||||
SimPy DMA 타이밍은 이후 별도로 시뮬레이션된다.
|
SimPy DMA timing is simulated separately afterward.
|
||||||
|
|
||||||
이는 timing과 visibility를 의도적으로 분리한 것이다:
|
This is an intentional separation of timing and visibility:
|
||||||
- **visibility**: MemoryStore에 반영되는 시점 = `store.write()` 호출 시
|
- **visibility**: the point at which it is reflected in MemoryStore = when `store.write()` is called
|
||||||
- **timing**: SimPy에서 DMA latency가 완료되는 시점
|
- **timing**: the point at which DMA latency completes in SimPy
|
||||||
|
|
||||||
이 분리로 dynamic control flow에서 store 직후 load가 최신 데이터를 볼 수 있다.
|
This separation allows a load immediately after a store to see the latest data in dynamic control flow.
|
||||||
|
|
||||||
#### Result Handle Semantics
|
#### Result Handle Semantics
|
||||||
|
|
||||||
`tl.composite()`(sync/async)는 결과 tensor를 참조하는 **handle**을 반환한다.
|
`tl.composite()` (sync/async) returns a **handle** referencing the result tensor.
|
||||||
|
|
||||||
Phase 1에서의 핵심 계약:
|
The key contract in Phase 1:
|
||||||
|
|
||||||
1. **모든 compute handle은 Phase 1에서 항상 pending 상태로 간주한다.**
|
1. **All compute handles are always considered pending in Phase 1.**
|
||||||
2. `tl.wait(handle)`은 **timing synchronization만 표현**하며,
|
2. `tl.wait(handle)` **expresses timing synchronization only**
|
||||||
handle을 ready로 만들지 않는다.
|
and does not make the handle ready.
|
||||||
3. handle의 실제 결과 데이터 접근(`handle.data`, element access,
|
3. Accessing the handle's actual result data (`handle.data`, element access,
|
||||||
numpy conversion 등)은 **Phase 2에서만 가능**하다.
|
numpy conversion, etc.) is **only possible in Phase 2**.
|
||||||
4. 따라서 Phase 1에서 **compute-result 기반 control flow는 지원하지 않는다.**
|
4. Therefore, **compute-result-based control flow is not supported in Phase 1.**
|
||||||
5. 반면 `tl.load()`는 Phase 1에서 실제 데이터를 반환하므로,
|
5. In contrast, `tl.load()` returns actual data in Phase 1, so
|
||||||
**memory-read 기반 control flow는 지원 가능**하다.
|
**memory-read-based control flow is supported**.
|
||||||
|
|
||||||
| handle 상태 | Phase | 허용 동작 |
|
| Handle state | Phase | Allowed operations |
|
||||||
|------------|-------|----------|
|
|------------|-------|----------|
|
||||||
| pending | Phase 1 | `tl.wait(handle)` — timing 동기화만 |
|
| pending | Phase 1 | `tl.wait(handle)` — timing synchronization only |
|
||||||
| pending | Phase 1 | handle을 `tl.store()`의 대상으로 전달 (logical destination 연결만, payload는 Phase 2) |
|
| pending | Phase 1 | Pass handle as target of `tl.store()` (logical destination binding only, payload in Phase 2) |
|
||||||
| pending | Phase 1 | **데이터 접근 불가** — 값 기반 분기 불가 |
|
| pending | Phase 1 | **Data access not allowed** — value-based branching not possible |
|
||||||
| ready | Phase 2 | 실제 numpy 데이터 접근, 검증 |
|
| ready | Phase 2 | Actual numpy data access, verification |
|
||||||
|
|
||||||
이 제약은 의도적이다. Phase 1에서 연산을 실행하면 SimPy single-thread가
|
This restriction is intentional. If computations were executed in Phase 1,
|
||||||
block되어 2-pass 분리의 존재 이유가 사라진다.
|
the SimPy single-thread would block, defeating the purpose of 2-pass separation.
|
||||||
|
|
||||||
#### Phase 1 Materialization — Future Extension
|
#### Phase 1 Materialization — Future Extension
|
||||||
|
|
||||||
향후 소형 연산(scalar, 작은 reduction)에 대해 Phase 1 eager execution이
|
If Phase 1 eager execution becomes necessary for small operations
|
||||||
필요한 경우, `materialized_in_phase1: bool` 플래그를 op record에 추가하여
|
(scalar, small reduction) in the future, selective materialization can be supported
|
||||||
선택적 materialization을 지원할 수 있다. 현재 범위에서는 구현하지 않는다.
|
by adding a `materialized_in_phase1: bool` flag to the op record.
|
||||||
|
This is not implemented in the current scope.
|
||||||
|
|
||||||
### D4. data_op 플래그 — 메시지 자기 선언
|
### D4. data_op Flag — Message Self-Declaration
|
||||||
|
|
||||||
로깅 대상은 메시지 타입이 아니라 메시지 인스턴스의 `data_op` 속성으로 결정한다.
|
The logging target is determined by the `data_op` attribute on the message instance,
|
||||||
프레임워크가 메시지 타입을 하드코딩하지 않는다.
|
not by message type. The framework does not hardcode message types.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class MsgBase:
|
class MsgBase:
|
||||||
data_op: bool = False # 기본: 로깅 안 함
|
data_op: bool = False # default: no logging
|
||||||
|
|
||||||
class DmaReadCmd(MsgBase):
|
class DmaReadCmd(MsgBase):
|
||||||
data_op = True # 메모리 이동 → 로깅
|
data_op = True # memory transfer → logging
|
||||||
|
|
||||||
class GemmCmd(MsgBase):
|
class GemmCmd(MsgBase):
|
||||||
data_op = True # 연산 → 로깅
|
data_op = True # compute → logging
|
||||||
|
|
||||||
class MathCmd(MsgBase):
|
class MathCmd(MsgBase):
|
||||||
data_op = True # 연산 → 로깅
|
data_op = True # compute → logging
|
||||||
```
|
```
|
||||||
|
|
||||||
새 메시지 타입(예: IpcqMsg) 추가 시 `data_op = True`만 설정하면
|
When adding a new message type (e.g., IpcqMsg), simply setting `data_op = True`
|
||||||
프레임워크 코드 수정 없이 자동 로깅된다.
|
enables automatic logging without modifying framework code.
|
||||||
|
|
||||||
### D5. Op Log 구조
|
### D5. Op Log Structure
|
||||||
|
|
||||||
#### op 분류 체계
|
#### Op Classification Scheme
|
||||||
|
|
||||||
2단계로 분류한다:
|
A two-level classification is used:
|
||||||
|
|
||||||
| 레벨 | 필드 | 역할 |
|
| Level | Field | Role |
|
||||||
|------|------|------|
|
|-------|-------|------|
|
||||||
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch 기준 |
|
| `op_kind` | `memory` \| `gemm` \| `math` | executor dispatch criterion |
|
||||||
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` 등 | 구체 연산 식별 |
|
| `op_name` | `dma_read` \| `dma_write` \| `gemm_f16` \| `exp` \| `add` \| `sum` etc. | specific operation identification |
|
||||||
|
|
||||||
#### OpRecord 정의
|
#### OpRecord Definition
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@dataclass
|
@dataclass
|
||||||
class OpRecord:
|
class OpRecord:
|
||||||
t_start: float # SimPy 시각 (ns) — service 시작
|
t_start: float # SimPy time (ns) — service start
|
||||||
t_end: float # SimPy 시각 (ns) — service 완료
|
t_end: float # SimPy time (ns) — service completion
|
||||||
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
|
component_id: str # e.g. "sip0.cube0.pe0.pe_gemm"
|
||||||
op_kind: str # "memory" | "gemm" | "math"
|
op_kind: str # "memory" | "gemm" | "math"
|
||||||
op_name: str # 구체 연산명
|
op_name: str # specific operation name
|
||||||
params: dict # 연산별 파라미터 (아래 참조)
|
params: dict # per-operation parameters (see below)
|
||||||
dependency_ids: list[int] # 현재는 in-memory record index 기반, 향후 stable op_id로 대체 가능
|
dependency_ids: list[int] # currently based on in-memory record index, may be replaced with stable op_id in the future
|
||||||
```
|
```
|
||||||
|
|
||||||
#### dependency_ids 생성 규칙
|
#### dependency_ids Generation Rules
|
||||||
|
|
||||||
`dependency_ids`는 **optional**이며, 기본적으로 executor는
|
`dependency_ids` is **optional**, and by default the executor performs
|
||||||
주소 기반 dependency 추론을 수행한다 (D6 참조).
|
address-based dependency inference (see D6).
|
||||||
|
|
||||||
정확한 실행 순서가 필요한 경우에만 명시적으로 설정한다:
|
Explicit setting is only needed when precise execution ordering is required:
|
||||||
- **기본 (address-based inference)**: executor가 read/write set을 분석하여
|
- **Default (address-based inference)**: the executor analyzes read/write sets to
|
||||||
RAW/WAW/WAR 의존성을 자동 추론. 대부분의 경우 이것으로 충분.
|
automatically infer RAW/WAW/WAR dependencies. This is sufficient for most cases.
|
||||||
- **명시적 설정**: TLContext 또는 command 생성 단계에서 logical dependency가
|
- **Explicit setting**: set when logical dependencies cannot be expressed via addresses
|
||||||
주소로 표현되지 않는 경우에 설정.
|
at the TLContext or command generation stage.
|
||||||
예: completion handle 기반 동기화 — handle dependency는 메모리 주소가 아니라
|
Example: completion handle-based synchronization — handle dependencies depend on
|
||||||
논리적 완료 순서에 의존하므로 address inference로 잡히지 않는다.
|
logical completion order rather than memory addresses, so they cannot be captured
|
||||||
|
by address inference.
|
||||||
|
|
||||||
#### op_log ordering
|
#### op_log Ordering
|
||||||
|
|
||||||
op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
The op_log maintains **stable ordering** based on `t_start`.
|
||||||
동일 `t_start`의 record들은 insertion order를 보존한다.
|
Records with the same `t_start` preserve insertion order.
|
||||||
|
|
||||||
#### params 상세
|
#### params Details
|
||||||
|
|
||||||
**memory (dma_read / dma_write)**:
|
**memory (dma_read / dma_write)**:
|
||||||
```python
|
```python
|
||||||
{
|
{
|
||||||
"src_addr": int, # source 주소 (byte)
|
"src_addr": int, # source address (byte)
|
||||||
"dst_addr": int, # destination 주소 (byte)
|
"dst_addr": int, # destination address (byte)
|
||||||
"nbytes": int, # 전송 크기
|
"nbytes": int, # transfer size
|
||||||
"src_space": str, # "hbm" | "tcm" | "sram"
|
"src_space": str, # "hbm" | "tcm" | "sram"
|
||||||
"dst_space": str, # "hbm" | "tcm" | "sram"
|
"dst_space": str, # "hbm" | "tcm" | "sram"
|
||||||
}
|
}
|
||||||
@@ -298,9 +300,9 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
|||||||
**gemm**:
|
**gemm**:
|
||||||
```python
|
```python
|
||||||
{
|
{
|
||||||
"src_a_addr": int, # operand A 주소
|
"src_a_addr": int, # operand A address
|
||||||
"src_b_addr": int, # operand B 주소
|
"src_b_addr": int, # operand B address
|
||||||
"dst_addr": int, # output 주소
|
"dst_addr": int, # output address
|
||||||
"shape_a": tuple, # e.g. (128, 256)
|
"shape_a": tuple, # e.g. (128, 256)
|
||||||
"shape_b": tuple, # e.g. (256, 128)
|
"shape_b": tuple, # e.g. (256, 128)
|
||||||
"shape_out": tuple, # e.g. (128, 128)
|
"shape_out": tuple, # e.g. (128, 128)
|
||||||
@@ -312,7 +314,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
|||||||
"layout_a": str, # "row_major" | "col_major"
|
"layout_a": str, # "row_major" | "col_major"
|
||||||
"layout_b": str,
|
"layout_b": str,
|
||||||
"layout_out": str,
|
"layout_out": str,
|
||||||
"addr_space": str, # "tcm" (GEMM operand는 항상 TCM)
|
"addr_space": str, # "tcm" (GEMM operands are always in TCM)
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -320,7 +322,7 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
|||||||
```python
|
```python
|
||||||
{
|
{
|
||||||
"op": str, # "exp" | "add" | "sum" | "where" | ...
|
"op": str, # "exp" | "add" | "sum" | "where" | ...
|
||||||
"input_addrs": list[int], # operand 주소 목록
|
"input_addrs": list[int], # list of operand addresses
|
||||||
"input_shapes": list[tuple],
|
"input_shapes": list[tuple],
|
||||||
"dst_addr": int,
|
"dst_addr": int,
|
||||||
"shape_out": tuple,
|
"shape_out": tuple,
|
||||||
@@ -332,12 +334,12 @@ op_log는 `t_start` 기준으로 **stable ordering**을 유지한다.
|
|||||||
|
|
||||||
### D6. Phase 2 Executor
|
### D6. Phase 2 Executor
|
||||||
|
|
||||||
Phase 2는 SimPy 밖에서 op_log를 실행한다.
|
Phase 2 executes the op_log outside of SimPy.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class DataExecutor:
|
class DataExecutor:
|
||||||
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
|
def __init__(self, op_log: list[OpRecord], initial_store: MemoryStore):
|
||||||
self.store = initial_store # Phase 1의 MemoryStore snapshot을 입력으로 받는다
|
self.store = initial_store # Takes the Phase 1 MemoryStore snapshot as input
|
||||||
|
|
||||||
def run(self):
|
def run(self):
|
||||||
for t, ops in groupby(op_log, key=lambda o: o.t_start):
|
for t, ops in groupby(op_log, key=lambda o: o.t_start):
|
||||||
@@ -347,30 +349,30 @@ class DataExecutor:
|
|||||||
self._execute_sequential(sequential)
|
self._execute_sequential(sequential)
|
||||||
```
|
```
|
||||||
|
|
||||||
**병렬 실행 판정**:
|
**Parallel execution determination**:
|
||||||
|
|
||||||
같은 `t_start`의 op들은 **병렬 후보**로 간주한다.
|
Ops with the same `t_start` are considered **parallel candidates**.
|
||||||
실제 병렬 실행 여부는 executor가 다음 기준으로 판정한다:
|
The executor determines actual parallel execution based on the following criteria:
|
||||||
- read/write 주소 범위 겹침 여부 (WAW, RAW, WAR 충돌 검사)
|
- Whether read/write address ranges overlap (WAW, RAW, WAR conflict checks)
|
||||||
- `dependency_ids`에 명시된 선행 op 완료 여부
|
- Whether predecessor ops specified in `dependency_ids` have completed
|
||||||
|
|
||||||
주소 범위가 겹치지 않고 명시적 의존성이 없는 op들만 병렬 실행한다.
|
Only ops with no overlapping address ranges and no explicit dependencies are executed in parallel.
|
||||||
|
|
||||||
**배치 최적화**: 동일 op_name이며 **shape, dtype, layout, transpose flag가
|
**Batch optimization**: Only independent ops with the same op_name **and identical
|
||||||
모두 동일한** 독립 op들만 batching 대상이 된다.
|
shape, dtype, layout, and transpose flags** are eligible for batching.
|
||||||
예: 여러 PE의 동일 shape GEMM → `np.matmul(a_batch, b_batch)` 한 번으로 묶음.
|
Example: identical shape GEMMs from multiple PEs → bundled into a single `np.matmul(a_batch, b_batch)` call.
|
||||||
CPU에서도 BLAS 효율 향상, GPU에서는 launch overhead 절감.
|
Improves BLAS efficiency on CPU, reduces launch overhead on GPU.
|
||||||
|
|
||||||
**Phase 2 실행 순서 보장**:
|
**Phase 2 execution order guarantee**:
|
||||||
|
|
||||||
Phase 2는 데이터 도착 시점을 고려하지 않으며,
|
Phase 2 does not consider data arrival timing,
|
||||||
dependency (주소 기반 추론 + 명시적 dependency_ids)를 통해서만
|
and guarantees execution order solely through
|
||||||
실행 순서를 보장한다.
|
dependencies (address-based inference + explicit dependency_ids).
|
||||||
|
|
||||||
### D7. Memory Store
|
### D7. Memory Store
|
||||||
|
|
||||||
`MemoryStore`는 논리적으로 byte-addressable semantics를 따르며,
|
`MemoryStore` logically follows byte-addressable semantics,
|
||||||
현재 구현은 **tensor-granular storage** (addr → numpy ndarray 매핑)를 사용한다.
|
and the current implementation uses **tensor-granular storage** (addr → numpy ndarray mapping).
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class MemoryStore:
|
class MemoryStore:
|
||||||
@@ -378,139 +380,140 @@ class MemoryStore:
|
|||||||
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
|
def read(self, space: str, addr: int, shape: tuple, dtype: str) -> np.ndarray: ...
|
||||||
```
|
```
|
||||||
|
|
||||||
**내부 저장 포맷: numpy ndarray**
|
**Internal storage format: numpy ndarray**
|
||||||
|
|
||||||
MemoryStore는 텐서를 **numpy ndarray**로 저장한다.
|
MemoryStore stores tensors as **numpy ndarrays**.
|
||||||
|
|
||||||
| 후보 | store/load 속도 | Phase 2 연산 | 판정 |
|
| Candidate | store/load speed | Phase 2 compute | Verdict |
|
||||||
|------|----------------|-------------|------|
|
|-----------|-----------------|-----------------|---------|
|
||||||
| **numpy ndarray** | 즉시 (참조 전달, 복사 없음) | `np.matmul` 바로 사용 | **채택** |
|
| **numpy ndarray** | Immediate (reference passing, no copy) | `np.matmul` directly usable | **Adopted** |
|
||||||
| bytearray | memcpy 필요 | `np.frombuffer` 변환 필요 | 탈락 |
|
| bytearray | Requires memcpy | Requires `np.frombuffer` conversion | Rejected |
|
||||||
| torch tensor | 즉시 | torch 연산 가능 | GPU 최적화 시만 사용 |
|
| torch tensor | Immediate | torch operations available | Use only for GPU optimization |
|
||||||
|
|
||||||
- write: numpy array를 **참조 저장** (복사 없음) → Phase 1 오버헤드 = dict lookup 1회
|
- write: **stores numpy array by reference** (no copy) → Phase 1 overhead = 1 dict lookup
|
||||||
- read: numpy array를 **참조 반환** (복사 없음)
|
- read: **returns numpy array by reference** (no copy)
|
||||||
- 동일 addr에 재 write 시 기존 array를 **tensor 단위로 덮어쓴다** (partial overwrite 미지원)
|
- Re-writing to the same addr **overwrites at tensor granularity** (partial overwrite not supported)
|
||||||
- dtype은 numpy native 사용 (`np.float16`, `np.float32`, `np.bfloat16` 등)
|
- dtype uses numpy native (`np.float16`, `np.float32`, `np.bfloat16`, etc.)
|
||||||
- byte-level access가 필요한 경우 `.view(np.uint8)` 로 변환
|
- For byte-level access, convert via `.view(np.uint8)`
|
||||||
- Phase 2에서 GPU batch 최적화 시 numpy → torch tensor 변환은 executor가 담당
|
- For GPU batch optimization in Phase 2, numpy → torch tensor conversion is the executor's responsibility
|
||||||
|
|
||||||
**read/write contract**:
|
**read/write contract**:
|
||||||
|
|
||||||
- read/write는 **contiguous tensor** 기준이다.
|
- read/write operates on a **contiguous tensor** basis.
|
||||||
non-contiguous stride view가 필요한 경우 별도 copy op으로 표현한다.
|
If non-contiguous stride views are needed, express them as separate copy ops.
|
||||||
- 일반 benchmark path에서는 producer/consumer dtype 일치를 기대한다.
|
- In the normal benchmark path, producer/consumer dtype match is expected.
|
||||||
reinterpret cast는 low-level memory validation 또는 특수 테스트 케이스를 위한
|
Reinterpret cast is a permissive behavior for low-level memory validation
|
||||||
permissive behavior이다.
|
or special test cases.
|
||||||
- addr은 byte-aligned이며, 최소 alignment = dtype 크기.
|
- addr is byte-aligned, with minimum alignment = dtype size.
|
||||||
- dtype mismatch (write와 다른 dtype으로 read)는 reinterpret cast로 처리한다.
|
- dtype mismatch (reading with a different dtype than written) is handled as a reinterpret cast.
|
||||||
shape 불일치 시 nbytes 기준으로 검증하고, 불일치하면 error.
|
Shape mismatch is verified based on nbytes, and raises an error on mismatch.
|
||||||
- 정합성 기준은 주소 범위 기반 read/write semantics를 따른다.
|
- Correctness criteria follow address-range-based read/write semantics.
|
||||||
- 구현 최적화로 tensor object cache를 둘 수 있지만,
|
- A tensor object cache may be used as an implementation optimization,
|
||||||
canonical state는 byte-addressable storage이다.
|
but the canonical state is byte-addressable storage.
|
||||||
- deploy 시점에 호스트가 초기 텐서 데이터를 주입한다.
|
- At deploy time, the host injects initial tensor data.
|
||||||
|
|
||||||
### D8. 벤치마크 커널 코드
|
### D8. Benchmark Kernel Code
|
||||||
|
|
||||||
벤치마크의 **사용자 코드 API는 변경하지 않는다**.
|
The benchmark's **user code API is not changed**.
|
||||||
`tl.load()`, `tl.composite()`, `tl.store()` 등의 호출 인터페이스는 유지.
|
The call interfaces for `tl.load()`, `tl.composite()`, `tl.store()`, etc. are maintained.
|
||||||
|
|
||||||
단, 내부 command/message schema는 Phase 2 실행에 필요한 metadata를
|
However, internal command/message schemas may be extended to include metadata
|
||||||
포함하도록 확장될 수 있다 (예: dtype_acc, transpose 등 추가 필드).
|
required for Phase 2 execution (e.g., additional fields such as dtype_acc, transpose).
|
||||||
|
|
||||||
### D9. 컴포넌트 변경 없음
|
### D9. No Component Changes
|
||||||
|
|
||||||
개별 컴포넌트 구현(PE_GEMM, PE_DMA, HBM_CTRL 등)은 수정하지 않는다.
|
Individual component implementations (PE_GEMM, PE_DMA, HBM_CTRL, etc.) are not modified.
|
||||||
op_log 기록은 ComponentBase hook의 책임이다.
|
Op log recording is the responsibility of the ComponentBase hook.
|
||||||
커스텀 컴포넌트 교체 시 타이밍 모델만 교체되며,
|
When custom components are replaced, only the timing model changes,
|
||||||
Phase 2 데이터 실행은 영향받지 않는다.
|
and Phase 2 data execution is unaffected.
|
||||||
|
|
||||||
### D10. Phase 2는 Optional
|
### D10. Phase 2 is Optional
|
||||||
|
|
||||||
```python
|
```python
|
||||||
engine = GraphEngine(graph)
|
engine = GraphEngine(graph)
|
||||||
engine.run(benchmark) # Phase 1: 타이밍만
|
engine.run(benchmark) # Phase 1: timing only
|
||||||
result = engine.get_timing_result()
|
result = engine.get_timing_result()
|
||||||
|
|
||||||
if verify_data:
|
if verify_data:
|
||||||
executor = DataExecutor(engine.op_log) # Phase 2: 데이터
|
executor = DataExecutor(engine.op_log) # Phase 2: data
|
||||||
executor.run()
|
executor.run()
|
||||||
executor.verify(expected_output)
|
executor.verify(expected_output)
|
||||||
```
|
```
|
||||||
|
|
||||||
타이밍 분석만 필요하면 Phase 2를 건너뛴다.
|
If only timing analysis is needed, Phase 2 is skipped.
|
||||||
op_logger를 비활성화하면 Phase 1 성능도 기존과 동일.
|
If the op_logger is deactivated, Phase 1 performance is identical to the original.
|
||||||
|
|
||||||
### D11. Verification Contract
|
### D11. Verification Contract
|
||||||
|
|
||||||
기본 검증은 **최종 output tensor**를 reference backend(numpy)와 비교한다.
|
Basic verification **compares the final output tensor** against a reference backend (numpy).
|
||||||
|
|
||||||
dtype별 tolerance 정책:
|
Per-dtype tolerance policy:
|
||||||
|
|
||||||
| dtype | 비교 방식 | tolerance |
|
| dtype | Comparison method | Tolerance |
|
||||||
|-------|----------|-----------|
|
|-------|----------|-----------|
|
||||||
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
|
| f32 | `np.allclose` | rtol=1e-5, atol=1e-5 |
|
||||||
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
|
| f16 | `np.allclose` | rtol=1e-3, atol=1e-3 |
|
||||||
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
|
| bf16 | `np.allclose` | rtol=1e-2, atol=1e-2 |
|
||||||
| int 계열 | `np.array_equal` | exact |
|
| int types | `np.array_equal` | exact |
|
||||||
|
|
||||||
- 기본 모드: 최종 output만 비교 (end-to-end correctness)
|
- Default mode: compare final output only (end-to-end correctness)
|
||||||
- 디버그 모드: intermediate tensor도 op 단위로 비교 가능
|
- Debug mode: can compare intermediate tensors on a per-op basis
|
||||||
(MemoryStore snapshot at each op boundary)
|
(MemoryStore snapshot at each op boundary)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Non-goals
|
## Non-goals
|
||||||
|
|
||||||
- **Compute-result-based control flow**: 지원하지 않는다.
|
- **Compute-result-based control flow**: not supported.
|
||||||
모든 compute handle은 Phase 1에서 pending 상태이며,
|
All compute handles are in pending state during Phase 1,
|
||||||
`wait()`는 timing synchronization만 표현하고 data readiness를 의미하지 않는다.
|
`wait()` expresses timing synchronization only and does not imply data readiness.
|
||||||
Phase 1에서 `handle.data` 접근, element access, truth-value evaluation은
|
Accessing `handle.data`, element access, or truth-value evaluation in Phase 1
|
||||||
**error로 처리**한다.
|
is **treated as an error**.
|
||||||
메모리 데이터 기반 분기(`tl.load()` 결과)는 greenlet으로 지원된다.
|
Memory-data-based branching (results of `tl.load()`) is supported via greenlet.
|
||||||
Phase 1 materialization은 future extension (D3 참조).
|
Phase 1 materialization is a future extension (see D3).
|
||||||
- **Cycle-accurate overlap reconstruction**: Phase 2에서 Phase 1의 실행 시간
|
- **Cycle-accurate overlap reconstruction**: Phase 2 does not precisely reproduce
|
||||||
overlap을 정확히 재현하지 않는다. Phase 2는 데이터 정합성만 검증한다.
|
the execution time overlap from Phase 1. Phase 2 only verifies data correctness.
|
||||||
- **GPU kernel compilation**: Phase 2의 GEMM/Math는 numpy/torch 호출이며,
|
- **GPU kernel compilation**: GEMM/Math in Phase 2 are numpy/torch calls
|
||||||
실제 하드웨어 PE의 마이크로아키텍처를 재현하지 않는다.
|
and do not reproduce the actual hardware PE microarchitecture.
|
||||||
|
|
||||||
## Open Questions
|
## Open Questions
|
||||||
|
|
||||||
- **Aliasing / slice view**: 동일 backing storage를 참조하는 slice/view를
|
- **Aliasing / slice view**: How to represent slice/views referencing the same
|
||||||
MemoryStore에서 어떻게 표현할지 (stride-based view vs copy semantics)
|
backing storage in MemoryStore (stride-based view vs copy semantics)
|
||||||
- **IPCQ/descriptor read 일반화**: PE-to-PE 통신을 memory op으로 완전히
|
- **IPCQ/descriptor read generalization**: Whether to fully generalize PE-to-PE
|
||||||
일반화할지, 별도 op_kind를 둘지
|
communication as memory ops or introduce a separate op_kind
|
||||||
- **Op log streaming**: 대규모 시뮬레이션에서 op_log 메모리 사용량 관리
|
- **Op log streaming**: Managing op_log memory usage in large-scale simulations
|
||||||
(in-memory list vs disk-backed streaming)
|
(in-memory list vs disk-backed streaming)
|
||||||
- **Fused operation**: tl.composite의 tiled pipeline (READ→COMPUTE→WRITE)을
|
- **Fused operation**: Whether to record tl.composite's tiled pipeline
|
||||||
하나의 fused op record로 기록할지, 개별 op으로 분리할지
|
(READ→COMPUTE→WRITE) as a single fused op record or separate individual ops
|
||||||
- **Math op schema 일반화**: 현재 math params는 단순 구조이나,
|
- **Math op schema generalization**: The current math params have a simple structure,
|
||||||
broadcasting rule, input별 dtype, keepdims, scalar/immediate operand,
|
but generalization may be needed for broadcasting rules, per-input dtype, keepdims,
|
||||||
where/mask 표현 등 일반화가 필요할 수 있음
|
scalar/immediate operands, where/mask expressions, etc.
|
||||||
- **Op record 식별자**: 현재 dependency_ids는 in-memory list index 기반이며,
|
- **Op record identifier**: Currently dependency_ids are based on in-memory list indices;
|
||||||
streaming/disk-backed mode 도입 시 stable op_id로 대체 필요
|
replacement with stable op_id is needed when introducing streaming/disk-backed mode
|
||||||
- **Phase 1 materialization policy**: D3의 Future Extension 참조.
|
- **Phase 1 materialization policy**: See Future Extension in D3.
|
||||||
허용 시 해당 op의 Phase 2 처리 방식 (skip / verify / recompute) 정의 필요
|
If allowed, the Phase 2 handling approach (skip / verify / recompute) for those ops
|
||||||
|
needs to be defined
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Consequences
|
## Consequences
|
||||||
|
|
||||||
### 긍정적
|
### Positive
|
||||||
|
|
||||||
- SimPy 시뮬레이션 성능 영향 최소 (op_log append만 추가)
|
- Minimal impact on SimPy simulation performance (only op_log append added)
|
||||||
- Phase 2에서 멀티스레드/GPU 자유롭게 사용 가능
|
- Free to use multi-threading/GPU in Phase 2
|
||||||
- 컴포넌트 교체 자유도 유지 (ADR-0015 설계 철학 보존)
|
- Component replaceability preserved (ADR-0015 design philosophy maintained)
|
||||||
- 벤치마크 사용자 코드 API 변경 불필요
|
- No changes needed to benchmark user code API
|
||||||
- 새 메시지 타입 추가 시 data_op 플래그만 설정
|
- When adding new message types, only set the data_op flag
|
||||||
- greenlet으로 Phase 0 제거 — 메모리 데이터 기반 dynamic control flow 지원
|
- Phase 0 eliminated via greenlet — memory-data-based dynamic control flow supported
|
||||||
- `tl.load()`가 실제 데이터를 반환하므로 커널 디버깅 용이
|
- `tl.load()` returns actual data, making kernel debugging easier
|
||||||
|
|
||||||
### 부정적
|
### Negative
|
||||||
|
|
||||||
- op_log 메모리 사용량 (대규모 시뮬레이션 시)
|
- op_log memory usage (for large-scale simulations)
|
||||||
- Phase 2 실행 시간은 텐서 크기에 비례 (대형 GEMM)
|
- Phase 2 execution time is proportional to tensor size (large GEMM)
|
||||||
- pending handle (연산 미완료) 기반 동적 분기 불가
|
- Dynamic branching based on pending handles (incomplete computations) not possible
|
||||||
(연산은 Phase 2에서 실행, Phase 1에서 결과 값 미확정).
|
(computations execute in Phase 2, result values are undetermined in Phase 1).
|
||||||
메모리 데이터 기반 분기는 greenlet으로 지원된다.
|
Memory-data-based branching is supported via greenlet.
|
||||||
- greenlet C 확장 의존성 추가 (pip install greenlet)
|
- greenlet C extension dependency added (pip install greenlet)
|
||||||
|
|||||||
@@ -1,882 +0,0 @@
|
|||||||
# ADR-0023: PE-level IPCQ — Inter-PE Collective Communication
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
### Goal
|
|
||||||
|
|
||||||
Add the infrastructure that lets CCL (Collective Communication Library)
|
|
||||||
kernels run **inside** a PE. The host just launches a kernel on each
|
|
||||||
SIP; the actual synchronization and data movement happen **inside the
|
|
||||||
PE kernel via an IPCQ (Inter-Process Communication Queue)**.
|
|
||||||
|
|
||||||
This mirrors how NCCL performs NVLink communication inside a GPU
|
|
||||||
kernel, or how Cerebras / Tenstorrent expose core-local communication
|
|
||||||
queues. Host-level collectives (`dist.all_reduce`) are deferred to
|
|
||||||
**future work**; this ADR focuses solely on the kernel-side collective
|
|
||||||
infrastructure.
|
|
||||||
|
|
||||||
### Problems to solve
|
|
||||||
|
|
||||||
1. PE-to-PE direct data movement (writing into a peer's memory).
|
|
||||||
2. Synchronization — the sender must check that the receiver has space
|
|
||||||
in its buffer (backpressure).
|
|
||||||
3. Resource contention between compute traffic and communication
|
|
||||||
traffic (Head-of-Line blocking).
|
|
||||||
4. The host must be able to construct logical neighbor topologies
|
|
||||||
(ring / mesh / tree) per algorithm.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### D1. Add a new `PE_IPCQ` component
|
|
||||||
|
|
||||||
A new component `PE_IPCQ` is added inside each PE. It follows the same
|
|
||||||
pattern as PE_GEMM / PE_MATH — modeling a sub-block of the PE as a
|
|
||||||
distinct component.
|
|
||||||
|
|
||||||
```
|
|
||||||
PE
|
|
||||||
├── PE_CPU
|
|
||||||
├── PE_SCHEDULER
|
|
||||||
├── PE_DMA
|
|
||||||
├── PE_IPCQ ← new
|
|
||||||
├── PE_FETCH_STORE
|
|
||||||
├── PE_GEMM
|
|
||||||
├── PE_MATH
|
|
||||||
├── PE_TCM
|
|
||||||
├── PE_MMU
|
|
||||||
```
|
|
||||||
|
|
||||||
**Role separation** (control plane vs. data plane):
|
|
||||||
|
|
||||||
- **PE_IPCQ (control plane)**: ring-buffer address arithmetic, head /
|
|
||||||
tail pointer management, peer pointer caches, backpressure, 4-direction
|
|
||||||
neighbor mapping.
|
|
||||||
- **PE_DMA (data plane)**: actually moves data through cube_noc / UCIe
|
|
||||||
/ PCIE into the peer's memory.
|
|
||||||
|
|
||||||
PE_IPCQ does **not** move data itself — it delegates to PE_DMA.
|
|
||||||
|
|
||||||
### D2. Ring buffer model
|
|
||||||
|
|
||||||
Each PE owns 4 directions (N/S/E/W) × {tx, rx} = 8 ring buffers.
|
|
||||||
|
|
||||||
```python
|
|
||||||
@dataclass
|
|
||||||
class IpcqQueuePair:
|
|
||||||
direction: Direction # N/S/E/W
|
|
||||||
peer: IpcqEndpoint # set by host at init time (D2.5)
|
|
||||||
tx_buffer_base: int # outgoing data base addr (in our memory)
|
|
||||||
rx_buffer_base: int # incoming data base addr (in our memory)
|
|
||||||
slot_size: int # 1 tile per slot
|
|
||||||
n_slots: int # ring depth
|
|
||||||
my_head: int # next slot we will write/send into
|
|
||||||
my_tail: int # next slot we will read/recv from
|
|
||||||
peer_head_cache: int # peer's last-seen head (updated via D9 piggyback)
|
|
||||||
peer_tail_cache: int # peer's last-seen tail (updated via D9 fast-path credit)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Canonical field names**: throughout this ADR the four names above
|
|
||||||
(`my_head`, `my_tail`, `peer_head_cache`, `peer_tail_cache`) are used
|
|
||||||
consistently. Synonyms (`peer_head_local`, `peer_head`, `peer_tail`,
|
|
||||||
etc.) are not used.
|
|
||||||
|
|
||||||
| Field | Owner | Updated when |
|
|
||||||
|-------|-------|--------------|
|
|
||||||
| `my_head` | local PE_IPCQ | immediately after `tl.send` (send tracking) |
|
|
||||||
| `my_tail` | local PE_IPCQ | immediately after `tl.recv` (recv tracking) |
|
|
||||||
| `peer_head_cache` | local PE_IPCQ | on `IpcqMetaArrival` (D9 piggyback) |
|
|
||||||
| `peer_tail_cache` | local PE_IPCQ | on `IpcqCreditMetadata` (D9 fast path) |
|
|
||||||
|
|
||||||
**Slot unit**: fixed-size, one slot holds one full tile (no descriptor
|
|
||||||
indirection). Full data embedded in the slot. See D5.
|
|
||||||
|
|
||||||
### D2.5. `IpcqEndpoint` schema
|
|
||||||
|
|
||||||
`IpcqQueuePair.peer` carries everything the sender needs to compute the
|
|
||||||
peer's rx slot address:
|
|
||||||
|
|
||||||
```python
|
|
||||||
@dataclass(frozen=True)
|
|
||||||
class IpcqEndpoint:
|
|
||||||
sip: int
|
|
||||||
cube: int
|
|
||||||
pe: int
|
|
||||||
buffer_kind: str # "tcm" | "hbm" | "sram"
|
|
||||||
rx_base_pa: int # peer rx_buffer base PA (PhysAddr.encode())
|
|
||||||
rx_base_va: int # peer rx_buffer base VA (optional, MMU mode)
|
|
||||||
n_slots: int # peer ring depth (for wrap-around)
|
|
||||||
slot_size: int # peer slot size (for offset)
|
|
||||||
```
|
|
||||||
|
|
||||||
Address computation:
|
|
||||||
|
|
||||||
```python
|
|
||||||
slot_idx = self.my_head % peer.n_slots
|
|
||||||
dst_pa = peer.rx_base_pa + slot_idx * peer.slot_size
|
|
||||||
```
|
|
||||||
|
|
||||||
PE_IPCQ passes `dst_pa` to PE_DMA inside an `IpcqDmaToken`. PE_DMA
|
|
||||||
(vc_comm) routes the data to `dst_pa` through the fabric.
|
|
||||||
|
|
||||||
**Endpoint construction order**: at backend init (D10), the IPCQ
|
|
||||||
buffers for **every PE** are allocated first (so each rank knows the
|
|
||||||
others' PA), then the per-rank neighbor tables are built and pushed to
|
|
||||||
PE_IPCQ via `IpcqInitMsg`.
|
|
||||||
|
|
||||||
### D3. Four-direction mapping ≡ logical ProcessGroup
|
|
||||||
|
|
||||||
The PE views four directions (N/S/E/W) as logical ports. Real peer
|
|
||||||
addresses are configured by the host CCL init, per the chosen
|
|
||||||
algorithm. The PE kernel never knows the topology, only directions.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# 1D ring
|
|
||||||
for rank in range(world_size):
|
|
||||||
ipcq_set_neighbor(rank, "E", peer=ranks[(rank + 1) % world_size])
|
|
||||||
ipcq_set_neighbor(rank, "W", peer=ranks[(rank - 1) % world_size])
|
|
||||||
|
|
||||||
# 2D mesh
|
|
||||||
for r in range(R):
|
|
||||||
for c in range(C):
|
|
||||||
ipcq_set_neighbor((r, c), "N", peer=((r - 1) % R, c))
|
|
||||||
ipcq_set_neighbor((r, c), "S", peer=((r + 1) % R, c))
|
|
||||||
ipcq_set_neighbor((r, c), "E", peer=(r, (c + 1) % C))
|
|
||||||
ipcq_set_neighbor((r, c), "W", peer=(r, (c - 1) % C))
|
|
||||||
```
|
|
||||||
|
|
||||||
The PE code does not need to know where `tl.send(dir="E", ...)` actually
|
|
||||||
ends up.
|
|
||||||
|
|
||||||
### D4. PE kernel API
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Send (blocking; may stall on backpressure)
|
|
||||||
tl.send(dir: str, src=TensorHandle)
|
|
||||||
tl.send(dir: str, src_addr=..., nbytes=..., shape=..., dtype=..., space=...)
|
|
||||||
|
|
||||||
# Recv (blocking)
|
|
||||||
recv = tl.recv(dir: str, shape=..., dtype=...)
|
|
||||||
recv = tl.recv(shape=..., dtype=...) # round-robin across 4 directions
|
|
||||||
|
|
||||||
# Recv (non-blocking)
|
|
||||||
fut = tl.recv_async(dir: str, shape=..., dtype=...)
|
|
||||||
recv = tl.wait(fut)
|
|
||||||
```
|
|
||||||
|
|
||||||
`tl.recv()` (no direction) keeps a `last_polled_dir` cursor and on each
|
|
||||||
call rotates through directions, returning the first available slot.
|
|
||||||
Empty in all 4 directions → wait.
|
|
||||||
|
|
||||||
**Fairness is weak**: the rotating start mitigates simple bias, but if
|
|
||||||
one direction always wins the race the others can starve. Algorithms
|
|
||||||
that need strict fairness must call `tl.recv(dir=...)` explicitly.
|
|
||||||
|
|
||||||
### D5. Single-hop DMA write + full-data slot model
|
|
||||||
|
|
||||||
Data moves from sender memory into the receiver's ring slot in **one
|
|
||||||
DMA transfer**. Key properties:
|
|
||||||
|
|
||||||
- **Single-hop**: the sender already knows the peer rx slot address and
|
|
||||||
fires one fabric DMA into it.
|
|
||||||
- **No CPU memcpy**: the CPU never copies data.
|
|
||||||
- **No intermediate staging**: neither side keeps a separate staging
|
|
||||||
buffer (sender uses the source addr directly; receiver gets the data
|
|
||||||
in its ring slot directly).
|
|
||||||
|
|
||||||
(Strictly speaking the fabric DMA write does happen, so this is not
|
|
||||||
literally "no data movement" — it's the same property NCCL labels
|
|
||||||
"zero-copy", meaning no CPU memcpy and no staging copy.)
|
|
||||||
|
|
||||||
```
|
|
||||||
PE A: tl.send(E, src_addr, nbytes)
|
|
||||||
1. IPCQ computes the peer rx slot address:
|
|
||||||
dst_addr = peer.rx_base_pa + (my_head % peer.n_slots) * peer.slot_size
|
|
||||||
2. Backpressure: my_head - peer_tail_cache < peer.n_slots ?
|
|
||||||
(full → sleep / poll)
|
|
||||||
3. Submit DMA on PE_DMA(vc_comm): src_addr → peer dst_addr, nbytes
|
|
||||||
4. my_head += 1
|
|
||||||
|
|
||||||
PE B: data = tl.recv(W)
|
|
||||||
1. Look at rx_buffer[my_tail % n_slots]
|
|
||||||
2. Wait for the data to arrive (D7 backpressure mode)
|
|
||||||
3. Return the slot address to the kernel (or fetch into register file)
|
|
||||||
4. my_tail += 1
|
|
||||||
5. Issue a credit-return fast path (D9): after the bottleneck-BW
|
|
||||||
latency the peer A's peer_tail_cache is updated.
|
|
||||||
```
|
|
||||||
|
|
||||||
The slot holds the full tile. The receiver only reads its own
|
|
||||||
rx_buffer; it never reads back into A's memory. The sender knows the
|
|
||||||
peer rx slot address and DMAs directly into it (single-hop).
|
|
||||||
|
|
||||||
The PE's own PE_TCM read/write does not go through DMA (PE_TCM is local
|
|
||||||
to the PE).
|
|
||||||
|
|
||||||
### D6. Buffer placement — three-way benchmark
|
|
||||||
|
|
||||||
The host CCL init picks the IPCQ ring-buffer location:
|
|
||||||
|
|
||||||
```python
|
|
||||||
ipcq_init(
|
|
||||||
backend="ahbm",
|
|
||||||
buffer_kind="tcm" | "hbm" | "sram",
|
|
||||||
n_slots=8,
|
|
||||||
slot_size=4096,
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
| Location | Trait | Trade-off |
|
|
||||||
|----------|-------|-----------|
|
|
||||||
| **PE_TCM** | Attached to the PE; fast | Small; competes with PE-internal resources |
|
|
||||||
| **PE-local HBM** | Large; via DMA | Higher latency |
|
|
||||||
| **Cube SRAM** | Mid-size; cube-shared | Cube-internal contention |
|
|
||||||
|
|
||||||
All three locations run the same kernel code; only the init differs.
|
|
||||||
|
|
||||||
### D7. Backpressure — two-mode benchmark
|
|
||||||
|
|
||||||
How the sender or receiver waits when peer slots are full / data not
|
|
||||||
yet arrived:
|
|
||||||
|
|
||||||
| Mode | Behavior | Model |
|
|
||||||
|------|----------|-------|
|
|
||||||
| **poll** | Periodically re-check the cached peer pointer | Spin loop |
|
|
||||||
| **sleep** | Yield a SimPy event; wake on a peer-trigger | Interrupt-like |
|
|
||||||
|
|
||||||
```python
|
|
||||||
ipcq_init(backpressure="poll" | "sleep", ...)
|
|
||||||
```
|
|
||||||
|
|
||||||
Both modes are implemented so latency / throughput trade-offs can be
|
|
||||||
benchmarked.
|
|
||||||
|
|
||||||
### D8. PE_DMA virtual channels
|
|
||||||
|
|
||||||
Extend PE_DMA from a single queue into a **two-channel virtual-channel**
|
|
||||||
model.
|
|
||||||
|
|
||||||
```
|
|
||||||
PE_DMA
|
|
||||||
├── vc_compute: tile load / store / writeback for GEMM and Math
|
|
||||||
└── vc_comm: IPCQ send data
|
|
||||||
```
|
|
||||||
|
|
||||||
Each VC has an independent state machine:
|
|
||||||
|
|
||||||
- One channel stalling does not block the other.
|
|
||||||
- The same physical link (cube_noc, UCIe, …) is shared, but link BW is
|
|
||||||
split between channels.
|
|
||||||
|
|
||||||
**Chunk-level interleave**:
|
|
||||||
|
|
||||||
- Large GEMM tile DMAs do not lock the link end-to-end.
|
|
||||||
- Progress happens in chunks (e.g. 256 B); each chunk shares link BW
|
|
||||||
with the other VC's pending chunks.
|
|
||||||
- Chunk size is an init parameter (smaller = fairer, larger = more
|
|
||||||
efficient).
|
|
||||||
|
|
||||||
Net effect:
|
|
||||||
|
|
||||||
- HoL blocking is eliminated (an IPCQ send can interleave with a long
|
|
||||||
compute DMA).
|
|
||||||
- Compute / comm overlap is natural (NVIDIA copy-engine + compute-SM
|
|
||||||
pattern).
|
|
||||||
- Matches the NoC-virtual-channel pattern used in real HW.
|
|
||||||
|
|
||||||
**First-implementation accuracy limit (intentional)**: this ADR's
|
|
||||||
first cut uses **deterministic chunk-level interleave + weighted
|
|
||||||
round-robin arbitration** (default 50 / 50, exposed in `ccl.yaml`).
|
|
||||||
This is a first-order approximation and is simpler than real HW
|
|
||||||
dynamic-contention / credit-based arbiters. Functional correctness is
|
|
||||||
unaffected, but heavy-contention scenarios may report slightly
|
|
||||||
optimistic latency vs. real HW. A separate ADR can add a NoC arbiter
|
|
||||||
component later if more precision is needed.
|
|
||||||
|
|
||||||
#### Token routing
|
|
||||||
|
|
||||||
- Compute tokens (`TileToken`) — go through the existing
|
|
||||||
PE_FETCH_STORE → PE_DMA chain.
|
|
||||||
- Communication tokens (`IpcqDmaToken`, new) — PE_IPCQ → PE_DMA
|
|
||||||
self-routing.
|
|
||||||
- PE_DMA picks the channel by token type.
|
|
||||||
|
|
||||||
```python
|
|
||||||
class PeDmaComponent:
|
|
||||||
def _process(self, env, token):
|
|
||||||
if isinstance(token, IpcqDmaToken):
|
|
||||||
yield from self._vc_comm_process(env, token)
|
|
||||||
else:
|
|
||||||
yield from self._vc_compute_process(env, token)
|
|
||||||
```
|
|
||||||
|
|
||||||
### D9. Pointer synchronization — DMA payload piggyback
|
|
||||||
|
|
||||||
Real HW (NVLink, UCIe, etc.) piggybacks metadata onto DMA payloads so
|
|
||||||
pointers update along with the data. This simulation adopts the same
|
|
||||||
model: **no separate control channel** — metadata travels with the
|
|
||||||
data.
|
|
||||||
|
|
||||||
The big benefits:
|
|
||||||
|
|
||||||
- **Automatic ordering**: data and metadata move on the same token, so
|
|
||||||
data is visible **before** the head_cache update. No race.
|
|
||||||
- **HW fidelity**: matches NVLink / UCIe piggybacked headers.
|
|
||||||
- **Component simplification**: no separate `IpcqPtrUpdate` event type.
|
|
||||||
|
|
||||||
#### Send flow (head update via piggyback)
|
|
||||||
|
|
||||||
```
|
|
||||||
PE A: tl.send(E, src_addr, nbytes)
|
|
||||||
1. PE_IPCQ checks backpressure (using peer_tail_cache)
|
|
||||||
2. PE_IPCQ creates an IpcqDmaToken:
|
|
||||||
- data body (src_addr → peer dst_addr)
|
|
||||||
- piggyback metadata: (sender_seq, src_sip/cube/pe, src_direction)
|
|
||||||
3. Hand the token to PE_DMA(vc_comm)
|
|
||||||
4. PE A increments my_head (send tracking)
|
|
||||||
|
|
||||||
[fabric DMA: latency elapses]
|
|
||||||
|
|
||||||
PE B's PE_DMA receives the token
|
|
||||||
5. Writes data into dst_addr (B's rx slot) via MemoryStore.write
|
|
||||||
6. Forwards token metadata to PE B's PE_IPCQ (PE-internal wire, ~1 cycle)
|
|
||||||
|
|
||||||
PE B's PE_IPCQ receives the metadata
|
|
||||||
7. Updates peer_head_cache (= A's head)
|
|
||||||
8. Wakes any pending recv on that direction
|
|
||||||
```
|
|
||||||
|
|
||||||
**Steps 5 and 6 must execute in the same SimPy step** — DMA completion
|
|
||||||
makes data and metadata atomically visible.
|
|
||||||
|
|
||||||
#### Recv flow (credit return — fast path with bottleneck-BW latency)
|
|
||||||
|
|
||||||
When the receiver frees a slot, the sender must learn about it
|
|
||||||
(backpressure release). Unlike data, the credit return does **not**
|
|
||||||
travel through general vc_comm fabric — it uses a **separate fast
|
|
||||||
path**, an abstraction of the NVLink / UCIe credit-return wire.
|
|
||||||
|
|
||||||
**Latency** is computed from the **full path latency** (per-node
|
|
||||||
overhead + edge propagation + drain), not a magic constant:
|
|
||||||
|
|
||||||
```
|
|
||||||
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
|
||||||
path = router.find_path(self_pe, peer_pe.pe_dma)
|
|
||||||
latency = compute_path_latency_ns(path, credit_size_bytes)
|
|
||||||
= sum(edge.distance_mm * ns_per_mm)
|
|
||||||
+ sum(node_overhead_ns[n] for n in path)
|
|
||||||
+ credit_size_bytes / bottleneck_bw_on_path
|
|
||||||
```
|
|
||||||
|
|
||||||
The router auto-appends `.pe_dma` to the source only, so the
|
|
||||||
destination MUST be spelled with the explicit `.pe_dma` suffix or
|
|
||||||
`find_path` raises and the credit silently teleports at zero cost
|
|
||||||
(latent bug fixed alongside this update).
|
|
||||||
|
|
||||||
`tl.recv` blocks on the credit-emit completion (recv yields-from
|
|
||||||
`_delayed_credit_send` rather than spawning it as a fork). This puts
|
|
||||||
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
|
|
||||||
IPCQ control-plane completing the consume-acknowledgement before
|
|
||||||
recv returns to the kernel — the protocol equivalent of a non-posted
|
|
||||||
`tl.store` waiting for an HBM ack on the raw DMA path.
|
|
||||||
|
|
||||||
That gives us:
|
|
||||||
|
|
||||||
- **Topology-proportional approximation**: an in-cube credit return is
|
|
||||||
automatically faster than a cross-SIP credit return.
|
|
||||||
- **No magic constants**: every nanosecond comes from
|
|
||||||
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
|
|
||||||
as data traffic.
|
|
||||||
- **No deadlock risk**: unlike piggyback, B can issue credit even when
|
|
||||||
it has no data to send back. `peer_credit_store.put` is unbounded.
|
|
||||||
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
|
|
||||||
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
|
|
||||||
|
|
||||||
#### Component coupling — SimPy Store channel
|
|
||||||
|
|
||||||
PE B's PE_IPCQ does not call PE A's PE_IPCQ directly. Instead, at init
|
|
||||||
time, **a SimPy Store is wired between the two** (a per-direction
|
|
||||||
fast-path channel) and credit metadata is `put` into that store.
|
|
||||||
|
|
||||||
```python
|
|
||||||
class PeIpcqComponent:
|
|
||||||
def _delayed_credit_send(self, env, peer_credit_store, my_tail, latency_ns):
|
|
||||||
yield env.timeout(latency_ns)
|
|
||||||
yield peer_credit_store.put(IpcqCreditMetadata(seq=my_tail, ...))
|
|
||||||
```
|
|
||||||
|
|
||||||
Backend init wires both directions of the fast-path channel as part of
|
|
||||||
fan-out (see `IpcqInitMsg` in D12).
|
|
||||||
|
|
||||||
#### Credit-return fast path limitations
|
|
||||||
|
|
||||||
- `credit_size_bytes` is an estimate (typically 16–64 bytes).
|
|
||||||
- The fast path is **excluded from vc_comm BW contention** (separate
|
|
||||||
wire). Real HW credit-return wires are very lightweight, so this is a
|
|
||||||
reasonable first approximation.
|
|
||||||
- A follow-up ADR can: model the credit fast path as a separate link
|
|
||||||
(BW limit + contention), or switch to piggyback (`credit_return_mode:
|
|
||||||
piggyback`).
|
|
||||||
|
|
||||||
#### PE_DMA's added responsibility
|
|
||||||
|
|
||||||
When `vc_comm` receives a token, PE_DMA processes it as the following
|
|
||||||
sequence: pay the Transaction's terminal BW drain, then atomically
|
|
||||||
write data and forward metadata. **No SimPy yield is allowed between
|
|
||||||
the data write and the metadata forward** (invariant I6). The drain
|
|
||||||
yield must sit before the atomic block, not inside it:
|
|
||||||
|
|
||||||
```python
|
|
||||||
def _on_vc_comm_recv(self, env, txn):
|
|
||||||
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
|
|
||||||
# sender PE_DMA). MUST happen before the atomic block so recv only
|
|
||||||
# wakes after the bytes have "landed".
|
|
||||||
drain = getattr(txn, "drain_ns", 0.0)
|
|
||||||
if drain > 0:
|
|
||||||
yield env.timeout(drain)
|
|
||||||
|
|
||||||
token = txn.request
|
|
||||||
# ── ATOMIC: no yield between these two operations ──
|
|
||||||
data = self._memory_store.read(token.src_space, token.src_addr,
|
|
||||||
shape=..., dtype=...)
|
|
||||||
self._memory_store.write(token.dst_endpoint.buffer_kind,
|
|
||||||
token.dst_addr, data)
|
|
||||||
# 2. Forward metadata to the local PE_IPCQ
|
|
||||||
yield self.out_ports[self._ipcq_id].put(IpcqMetaArrival(token=token))
|
|
||||||
# ───────────────────────────────────────────────────
|
|
||||||
```
|
|
||||||
|
|
||||||
The final `put` is yieldable but uses an unbounded internal store, so
|
|
||||||
it completes in a single step. That `put` is the closing call of the
|
|
||||||
atomic block; nothing may be inserted before it.
|
|
||||||
|
|
||||||
#### Drain-at-inbound semantics (D9 timing model)
|
|
||||||
|
|
||||||
The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
|
|
||||||
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
|
|
||||||
is paid at each forwarding component via `run()`, and the remaining
|
|
||||||
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
|
|
||||||
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
|
|
||||||
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
|
|
||||||
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
|
|
||||||
(so IPCQ-specific data write + metadata forward can happen), so **the
|
|
||||||
drain MUST be paid explicitly at the top of that handler** to keep
|
|
||||||
IPCQ's timing model on par with every other fabric Transaction.
|
|
||||||
|
|
||||||
Side-effects of paying drain here:
|
|
||||||
|
|
||||||
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
|
|
||||||
preserved because the sender PE_DMA does not `yield sub_done`. The
|
|
||||||
`sub_done.succeed()` call (made after metadata forward below) is an
|
|
||||||
event with no listener on the sender side.
|
|
||||||
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
|
|
||||||
when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
|
|
||||||
forward now happens after the drain, recv observes the full fabric
|
|
||||||
transfer time including bandwidth cost.
|
|
||||||
|
|
||||||
Matches the physical picture: send dispatches and leaves; recv waits
|
|
||||||
until the bytes have actually been drained into its inbox.
|
|
||||||
|
|
||||||
### D9.5. ADR-0020 (2-pass) integration
|
|
||||||
|
|
||||||
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
|
|
||||||
1 simulates timing **and** moves data via MemoryStore; Phase 2 enables
|
|
||||||
op-log-based correctness verification.
|
|
||||||
|
|
||||||
#### Phase 1 (timing + data)
|
|
||||||
|
|
||||||
D9 models head and tail updates with two different mechanisms:
|
|
||||||
|
|
||||||
- **Send-side (head update)** — DMA payload piggyback. Data write and
|
|
||||||
metadata forward happen in the same SimPy step → automatic atomic
|
|
||||||
visibility.
|
|
||||||
- **Recv-side (tail credit return)** — fast-path SimPy Store channel
|
|
||||||
with bottleneck-BW latency, then `peer_tail_cache` update.
|
|
||||||
|
|
||||||
Together they preserve ring-buffer pointer consistency.
|
|
||||||
|
|
||||||
The op-log records `op_kind="ipcq"` entries for sends (with
|
|
||||||
`src/dst/space/addr/nbytes/dir/dtype/shape/sender_seq`) and recvs (with
|
|
||||||
`recv_mode/src/dst/space/addr/nbytes/dir/dtype/shape/consumer_seq`).
|
|
||||||
Two recv modes:
|
|
||||||
|
|
||||||
- **`return_slot`** (default): the slot address is returned to the
|
|
||||||
kernel. Zero-copy.
|
|
||||||
- **`copy_to_dst`**: when the kernel passes `dst_addr` + `dst_space`,
|
|
||||||
PE_IPCQ copies the slot data into the user dst.
|
|
||||||
|
|
||||||
#### Phase 2 (op_log replay)
|
|
||||||
|
|
||||||
When `DataExecutor` encounters an `op_kind="ipcq"` record:
|
|
||||||
|
|
||||||
- **send**: idempotent `src → dst` ndarray write.
|
|
||||||
- **recv (`return_slot`)**: no-op (the slot already holds the data).
|
|
||||||
- **recv (`copy_to_dst`)**: idempotent `slot → dst_addr` copy.
|
|
||||||
|
|
||||||
IPCQ ops are pure data movement — Phase 2 has nothing extra to compute.
|
|
||||||
The downstream GEMM / Math ops in `DataExecutor` will consume the data
|
|
||||||
and naturally validate correctness.
|
|
||||||
|
|
||||||
### D10. Host CCL init keeps the PyTorch shape
|
|
||||||
|
|
||||||
The host code looks just like real PyTorch DDP. `init_process_group`
|
|
||||||
creates the backend object; it does **not** receive IPCQ knobs
|
|
||||||
(neighbor topology, buffer_kind, backpressure …).
|
|
||||||
|
|
||||||
```python
|
|
||||||
# benches/ccl_allreduce.py — same shape as real PyTorch
|
|
||||||
def worker(rank, world_size, torch):
|
|
||||||
dist = torch.distributed
|
|
||||||
dist.init_process_group(backend="ahbm") # reads ccl.yaml + topology
|
|
||||||
tensor = torch.zeros((1, world_size * N_ELEM), dtype="f16", dp=...)
|
|
||||||
tensor.copy_(torch.from_numpy(init))
|
|
||||||
dist.all_reduce(tensor, op="sum")
|
|
||||||
```
|
|
||||||
|
|
||||||
The IPCQ configuration is decided by the backend at
|
|
||||||
`init_process_group` time: it loads `ccl.yaml`, picks the algorithm,
|
|
||||||
and pushes IPCQ neighbor tables to every participating PE_IPCQ. The
|
|
||||||
host code never has to know about IPCQ.
|
|
||||||
|
|
||||||
A bench runs one algorithm, chosen via `ccl.yaml`'s `defaults.algorithm`.
|
|
||||||
Switching algorithms is purely a `ccl.yaml` change — no host edits
|
|
||||||
required.
|
|
||||||
|
|
||||||
#### Init flow (eager)
|
|
||||||
|
|
||||||
1. `init_process_group(backend="ahbm")` is called.
|
|
||||||
2. Backend loads `ccl.yaml` → resolves `defaults.algorithm`.
|
|
||||||
3. Pulls topology + buffer_kind + backpressure + slot config from
|
|
||||||
`algorithms[<algo>]`.
|
|
||||||
4. **Immediately** installs neighbor tables on every PE_IPCQ
|
|
||||||
(sideband or fabric `IpcqInitMsg`).
|
|
||||||
5. Subsequent `torch.launch(kernel_name, ...)` calls behave normally —
|
|
||||||
PE_IPCQ is already prepared whether the kernel is a CCL kernel or
|
|
||||||
not.
|
|
||||||
|
|
||||||
### D11. CCL config file (`ccl.yaml`)
|
|
||||||
|
|
||||||
IPCQ config and algorithm metadata live in a separate YAML file,
|
|
||||||
following the same pattern as `components.yaml` and `topology.yaml`.
|
|
||||||
|
|
||||||
A single benchmark execution runs one algorithm
|
|
||||||
(`defaults.algorithm`). Switching algorithms means editing
|
|
||||||
`defaults.algorithm` only.
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
defaults:
|
|
||||||
algorithm: ring_allreduce_tcm
|
|
||||||
buffer_kind: tcm # tcm | hbm | sram
|
|
||||||
backpressure: sleep # poll | sleep
|
|
||||||
n_slots: 8
|
|
||||||
slot_size: 4096
|
|
||||||
vc_chunk_size: 256
|
|
||||||
ipcq_credit_size_bytes: 16
|
|
||||||
|
|
||||||
algorithms:
|
|
||||||
ring_allreduce_tcm:
|
|
||||||
module: kernbench.ccl.algorithms.ring_allreduce
|
|
||||||
topology: ring_1d # builtin name or "custom"
|
|
||||||
buffer_kind: tcm
|
|
||||||
n_elem: 8 # optional, per-algorithm tile width
|
|
||||||
|
|
||||||
tree_allreduce_7:
|
|
||||||
module: kernbench.ccl.algorithms.tree_allreduce
|
|
||||||
topology: tree_binary
|
|
||||||
buffer_kind: tcm
|
|
||||||
world_size: 7 # algorithm-level override
|
|
||||||
n_elem: 16
|
|
||||||
|
|
||||||
custom_mesh:
|
|
||||||
module: kernbench.ccl.algorithms.custom_mesh
|
|
||||||
topology: custom # the module supplies its own neighbors()
|
|
||||||
```
|
|
||||||
|
|
||||||
`world_size` is **not set in `defaults`**. The backend resolves it via:
|
|
||||||
`algorithm-level override > defaults override > topology spec`. The
|
|
||||||
last fallback (`sips × cubes_per_sip × pes_per_cube`) mirrors real DDP
|
|
||||||
where `WORLD_SIZE` comes from env vars rather than config files.
|
|
||||||
|
|
||||||
#### Algorithm module structure
|
|
||||||
|
|
||||||
Each algorithm module exports two hooks — `kernel` (required) and
|
|
||||||
`neighbors` (optional) — plus a `kernel_args` helper that the
|
|
||||||
backend uses to populate positional kernel arguments at `all_reduce`
|
|
||||||
time:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# src/kernbench/ccl/algorithms/ring_allreduce.py
|
|
||||||
|
|
||||||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
|
||||||
return (n_elem, world_size)
|
|
||||||
|
|
||||||
|
|
||||||
def kernel(t_ptr, n_elem, world_size, tl):
|
|
||||||
"""Required — the PE kernel.
|
|
||||||
|
|
||||||
IPCQ is already installed by the backend before this is called.
|
|
||||||
The kernel only uses the four-direction send / recv API.
|
|
||||||
"""
|
|
||||||
...
|
|
||||||
|
|
||||||
|
|
||||||
def neighbors(rank, world_size, neighbor_map):
|
|
||||||
"""Optional — override the builtin topology's neighbor map.
|
|
||||||
|
|
||||||
Returns a new dict, the modified-in-place dict, or None to keep the
|
|
||||||
builtin map.
|
|
||||||
"""
|
|
||||||
return None
|
|
||||||
```
|
|
||||||
|
|
||||||
#### `neighbors` override patterns
|
|
||||||
|
|
||||||
- **Pattern A — tweak a builtin**: drop a direction for some ranks, etc.
|
|
||||||
- **Pattern B — replace entirely**: ignore `neighbor_map` and return a
|
|
||||||
brand-new dict.
|
|
||||||
- **Pattern C — keep builtin**: omit `neighbors` or return None.
|
|
||||||
|
|
||||||
#### Builtin topologies
|
|
||||||
|
|
||||||
| topology | direction set |
|
|
||||||
|----------|---------------|
|
|
||||||
| `ring_1d` | E, W |
|
|
||||||
| `ring_1d_unidir` | E only |
|
|
||||||
| `mesh_2d` | N, S, E, W |
|
|
||||||
| `tree_binary` | parent, child_left, child_right |
|
|
||||||
| `none` | (empty) — algorithm must supply `neighbors()` |
|
|
||||||
|
|
||||||
#### Adding a new algorithm
|
|
||||||
|
|
||||||
1. Write `kernel` and `kernel_args` in
|
|
||||||
`src/kernbench/ccl/algorithms/<algo>.py`.
|
|
||||||
2. Add an entry in `ccl.yaml`'s `algorithms` section.
|
|
||||||
3. (Optional) provide `neighbors()` for custom topology.
|
|
||||||
4. Set `defaults.algorithm` to the new algorithm.
|
|
||||||
|
|
||||||
The host bench (`benches/ccl_allreduce.py`) does not change.
|
|
||||||
|
|
||||||
### D12. Message / token schema
|
|
||||||
|
|
||||||
The new message types added by this ADR. They live in
|
|
||||||
`src/kernbench/common/pe_commands.py` and
|
|
||||||
`src/kernbench/runtime_api/kernel.py`.
|
|
||||||
|
|
||||||
#### `IpcqInitMsg` (sideband, fan-out at init)
|
|
||||||
|
|
||||||
The backend pushes neighbor tables to every PE_IPCQ. Structure mirrors
|
|
||||||
`MmuMapMsg` (`target_sips`, `target_cubes`, `target_pe`, `entries`).
|
|
||||||
Each `IpcqInitEntry` has `direction`, `peer: IpcqEndpoint`,
|
|
||||||
`my_rx_base_pa/va`, `n_slots`, `slot_size`, plus a `peer_credit_store`
|
|
||||||
field — a `simpy.Store` instance pre-wired so the sender PE_IPCQ can
|
|
||||||
push `IpcqCreditMetadata` directly into the receiver's input queue.
|
|
||||||
|
|
||||||
#### `IpcqSendCmd` (PE_CPU → PE_IPCQ)
|
|
||||||
|
|
||||||
Carries `direction`, source addr/space, nbytes, shape, dtype, and a
|
|
||||||
handle id. `data_op=True` so it lands in the op_log.
|
|
||||||
|
|
||||||
#### `IpcqRecvCmd` (PE_CPU → PE_IPCQ)
|
|
||||||
|
|
||||||
Carries `direction` (or None for round-robin), `recv_mode`
|
|
||||||
(`return_slot` / `copy_to_dst`), optional `dst_addr/dst_space`, shape,
|
|
||||||
dtype, blocking flag.
|
|
||||||
|
|
||||||
#### `IpcqDmaToken` (PE_IPCQ → PE_DMA, vc_comm channel)
|
|
||||||
|
|
||||||
Per D9 piggyback: the token carries the data (`src/dst/space/nbytes`)
|
|
||||||
plus the head metadata (`sender_seq`, `src_sip/cube/pe`,
|
|
||||||
`src_direction`). PE_DMA picks the channel by token type
|
|
||||||
(`IpcqDmaToken → vc_comm`, `TileToken → vc_compute`).
|
|
||||||
|
|
||||||
The receiver's PE_DMA, on token arrival, performs the I6 atomic
|
|
||||||
sequence: write data into MemoryStore, then forward `IpcqMetaArrival`
|
|
||||||
to the local PE_IPCQ.
|
|
||||||
|
|
||||||
#### `IpcqCreditMetadata` (PE_IPCQ → peer PE_IPCQ, fast path)
|
|
||||||
|
|
||||||
Carries `consumer_seq` (= my_tail), source PE coords, and source
|
|
||||||
direction. Travels through the dedicated SimPy Store channel rather
|
|
||||||
than `vc_comm`. Latency = `credit_size_bytes / bottleneck_bw_on_path`.
|
|
||||||
|
|
||||||
There is **no `IpcqPtrUpdate` event** — head updates flow via D9
|
|
||||||
piggyback, tail updates via the D9 fast-path channel.
|
|
||||||
|
|
||||||
### D13. Test strategy
|
|
||||||
|
|
||||||
Test plan:
|
|
||||||
|
|
||||||
#### T1. Unit tests (component-level)
|
|
||||||
|
|
||||||
- **PE_IPCQ** (`tests/test_pe_ipcq.py`): send without backpressure
|
|
||||||
immediately forwards a token; full peer slot triggers backpressure
|
|
||||||
(poll / sleep modes); recv waits, wakes on `IpcqMetaArrival`;
|
|
||||||
round-robin recv weak fairness; bad direction → `IpcqInvalidDirection`.
|
|
||||||
- **PE_DMA virtual channels** (`tests/test_pe_dma_vc.py`): `vc_compute`
|
|
||||||
/ `vc_comm` independent progress, chunk interleave, BW split.
|
|
||||||
- **Builtin topology** (`tests/test_ccl_topologies.py`): ring_1d /
|
|
||||||
mesh_2d / tree_binary correctness, mesh_2d non-square →
|
|
||||||
`ValueError`, custom resolver returns the module's `neighbors`.
|
|
||||||
|
|
||||||
#### T2. Integration tests (E2E send/recv)
|
|
||||||
|
|
||||||
- **`tests/test_ipcq_e2e.py`**: 2-rank ring, 4-rank ring (bidirectional
|
|
||||||
no-deadlock), 4×4 mesh.
|
|
||||||
- **CCL kernel + 2-pass** (`tests/test_ipcq_2pass.py`): greenlet mode
|
|
||||||
records `ipcq` ops in op_log; DataExecutor produces correct
|
|
||||||
`out.data`.
|
|
||||||
|
|
||||||
#### T3. Backend init (`tests/test_ccl_backend_ipcq.py`)
|
|
||||||
|
|
||||||
`ccl.yaml` load, builtin topology → `IpcqInitMsg` fan-out, endpoint PA
|
|
||||||
consistency, per-`buffer_kind` allocation.
|
|
||||||
|
|
||||||
#### T4. Regression
|
|
||||||
|
|
||||||
All existing tests pass; ADR-0020 op_log / DataExecutor unaffected for
|
|
||||||
non-CCL benches.
|
|
||||||
|
|
||||||
#### T5. Performance / overhead
|
|
||||||
|
|
||||||
Single send/recv pair latency = (DMA latency) + (IPCQ overhead).
|
|
||||||
Should be close to a regular PE_DMA write of the same nbytes (IPCQ
|
|
||||||
overhead < 100 ns).
|
|
||||||
|
|
||||||
### D14. Invariants and failure modes
|
|
||||||
|
|
||||||
#### Invariants
|
|
||||||
|
|
||||||
I1. **Slot lifecycle exactly-once**: one send → exactly one recv.
|
|
||||||
I2. **Pointer monotonicity**: `my_head` / `my_tail` strictly
|
|
||||||
non-decreasing; `sender_seq` strictly increasing.
|
|
||||||
I3. **Endpoint consistency**: if rank A's `direction=E` peer is rank
|
|
||||||
B, then rank B's reverse-direction peer must be rank A. Verified at
|
|
||||||
init.
|
|
||||||
I4. **`buffer_kind` consistency**: all PEs in a process group share
|
|
||||||
the same `buffer_kind` (no mixed mode in the first cut).
|
|
||||||
I5. **op_log ordering**: send → DMA complete → recv possible. The
|
|
||||||
t_start order in op_log respects this causality.
|
|
||||||
I6. **Atomic data + metadata visibility (MUST)**: at the receiver
|
|
||||||
side, data write (`MemoryStore.write`) and metadata forward
|
|
||||||
(`peer_head_cache` update) **must execute in the same SimPy step**.
|
|
||||||
No yield is allowed between the two operations in PE_DMA's vc_comm
|
|
||||||
handler. Code review must reject any inserted `yield` (or `yield
|
|
||||||
from`) — it would create a race where head_cache becomes visible
|
|
||||||
before or after the data.
|
|
||||||
I7. **MemoryStore slot existence ↔ pointer**: as a consequence of I6,
|
|
||||||
the step in which `peer_head_cache > my_tail` becomes truthy is the
|
|
||||||
same step in which the slot data is observable.
|
|
||||||
|
|
||||||
#### Failure modes (runtime errors)
|
|
||||||
|
|
||||||
F1. **Bad direction**: `tl.send(dir="X")` for an uninstalled direction
|
|
||||||
→ `IpcqInvalidDirection`, simulation aborts.
|
|
||||||
F2. **Type mismatch**: dtype/shape/nbytes disagreement between matched
|
|
||||||
send and recv. Not validated by default; opt-in strict mode catches
|
|
||||||
it (`strict_validation: true` on a PE_IPCQ node attrs).
|
|
||||||
F3. **Deadlock detection (timeout-based)**: the simulator empties its
|
|
||||||
schedule while a send/recv is still pending → engine raises
|
|
||||||
`IpcqDeadlock` and embeds a pointer dump.
|
|
||||||
F4. **Backend init failure**: missing `defaults.algorithm`, missing
|
|
||||||
`algorithms[name]`, module import failure, topology validation
|
|
||||||
failure (I3, I4) — all raised at `init_process_group` time.
|
|
||||||
F5. **Slot full + infinite backpressure**: the peer never recvs.
|
|
||||||
Surfaces as F3 timeout.
|
|
||||||
|
|
||||||
#### Diagnostics
|
|
||||||
|
|
||||||
- **CCL trace**: `KERNBENCH_CCL_TRACE=1` logs each send/recv as
|
|
||||||
`(rank, t, dir, nbytes)`.
|
|
||||||
- **Pointer dump**: `kernbench.ccl.diagnostics.pointer_dump(engine)`
|
|
||||||
prints every PE_IPCQ ring buffer's `my_head`, `my_tail`,
|
|
||||||
`peer_head_cache`, `peer_tail_cache`.
|
|
||||||
- **Deadlock dump**: on hang the engine includes the pointer dump in
|
|
||||||
the `IpcqDeadlock` exception message.
|
|
||||||
|
|
||||||
### D15. Algorithm-author cheat sheet
|
|
||||||
|
|
||||||
Full step-by-step lives in
|
|
||||||
[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
|
|
||||||
shortest version:
|
|
||||||
|
|
||||||
| Things you touch | Things you don't |
|
|
||||||
|------------------|-------------------|
|
|
||||||
| `src/kernbench/ccl/algorithms/<your_algo>.py` (`kernel`, `kernel_args`, optional `neighbors`) | `benches/ccl_allreduce.py` host code |
|
|
||||||
| One entry in `ccl.yaml` + optionally `defaults.algorithm` | `src/kernbench/ccl/` framework |
|
|
||||||
| (Optional) `tests/test_<your_algo>.py` mock test | PE_IPCQ component, AhbmCCLBackend |
|
|
||||||
|
|
||||||
5-step flow: write the kernel → register in `ccl.yaml` → optional
|
|
||||||
`neighbors` override → optional mock unit test → SimPy validation via
|
|
||||||
`kernbench run --bench ccl_allreduce --verify-data`.
|
|
||||||
|
|
||||||
Common mistakes: using a direction that wasn't installed, sends
|
|
||||||
without matching recvs (deadlock), dtype/shape disagreement, assuming
|
|
||||||
fairness from `tl.recv()` round-robin, confusing
|
|
||||||
`tl.num_programs(axis)` with the CCL group size.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **Host collective**: a model where `dist.all_reduce` itself moves
|
|
||||||
data on the host side is out of scope. This ADR only covers
|
|
||||||
communication that happens inside the PE kernel.
|
|
||||||
- **All-reduce algorithms**: ring / tree / etc. live in algorithm
|
|
||||||
modules and can be added without amending this ADR.
|
|
||||||
- **Reliability / error handling**: link faults, send/recv failure
|
|
||||||
recovery, etc. are out of scope.
|
|
||||||
- **NoC arbiter precision**: dynamic VC contention is left for a future
|
|
||||||
ADR (see D8).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Open questions
|
|
||||||
|
|
||||||
- **VC arbitration accuracy** — the first cut uses deterministic
|
|
||||||
chunk interleave + weighted round-robin; heavy contention may report
|
|
||||||
optimistic latency. A NoC arbiter component can be added later.
|
|
||||||
- **Credit return BW model** — the fast path is currently outside the
|
|
||||||
fabric BW contention model. Can be modeled as a separate link or
|
|
||||||
switched to piggyback (`credit_return_mode: piggyback`).
|
|
||||||
- **Ring buffer slot allocation metadata** — whether the host pushes
|
|
||||||
IPCQ buffer metadata via sideband or via a fabric message similar to
|
|
||||||
`MmuMapMsg` is open.
|
|
||||||
- **VC BW split default** — 50/50 vs. weighted (e.g. 80/20). Exposed in
|
|
||||||
`ccl.yaml`; default value TBD.
|
|
||||||
- **Direction count** — 4 (N/S/E/W) is fixed in the first cut; 6
|
|
||||||
(with Up/Down for 3D) or N (variable) is future work.
|
|
||||||
- **Multi-tile aggregation primitives** — whether
|
|
||||||
`tl.recv_all` or similar is needed for fan-in.
|
|
||||||
- **Round-robin recv fairness** — current weak fairness can starve;
|
|
||||||
strict fairness counter is future work.
|
|
||||||
- **Deadlock detection precision** — currently timeout-based; a
|
|
||||||
realtime wait-for graph would enable deterministic detection.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
### Positive
|
|
||||||
|
|
||||||
- PE-to-PE direct communication enables CCL kernels to be written.
|
|
||||||
- Host stays minimal (just `launch`), synchronization happens inside
|
|
||||||
the PE → strong compute / comm overlap.
|
|
||||||
- VCs eliminate HoL blocking → collective latency is not blocked by
|
|
||||||
compute traffic.
|
|
||||||
- Buffer placement and backpressure mode are init-time parameters →
|
|
||||||
easy to benchmark.
|
|
||||||
- Four-direction logical neighbors → host is free to map
|
|
||||||
ring/mesh/tree algorithms.
|
|
||||||
|
|
||||||
### Negative
|
|
||||||
|
|
||||||
- One new component (PE_IPCQ) and a redesigned PE_DMA (VCs).
|
|
||||||
- IPCQ memory cost = 8 rings × `slot_size` × `n_slots` per PE.
|
|
||||||
- VC arbitration is a first-order approximation; heavy contention
|
|
||||||
scenarios may report slightly optimistic latency vs real HW (D8).
|
|
||||||
- Chunk-level interleave makes PE_DMA implementation more complex.
|
|
||||||
File diff suppressed because it is too large
Load Diff
@@ -6,43 +6,46 @@ Accepted
|
|||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
### 목표
|
### Goal
|
||||||
|
|
||||||
`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
|
Align the participation unit (rank) of `torch.distributed` collective calls
|
||||||
경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
|
to the **SIP** (device) boundary. The aim is bench code that, at the host
|
||||||
읽히는 bench 코드를 목표로 한다.
|
level, reads **indistinguishably** from real PyTorch DDP/TP scripts.
|
||||||
|
|
||||||
real PyTorch와 비교:
|
Comparison with real PyTorch:
|
||||||
|
|
||||||
| 차원 | real PyTorch | KernBench |
|
| Dimension | real PyTorch | KernBench |
|
||||||
| --- | --- | --- |
|
| --- | --- | --- |
|
||||||
| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
|
| Process model | N processes, 1 GPU each | 1 process, N greenlets, 1 SIP each |
|
||||||
| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
|
| `get_rank()` | `RANK` env var | greenlet-local registry |
|
||||||
| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
|
| `get_world_size()` | `WORLD_SIZE` env var | SIP count from topology |
|
||||||
| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
|
| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
|
||||||
| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
|
| `mp.spawn` | OS process fork | greenlet fan-out |
|
||||||
|
|
||||||
### 풀어야 할 문제
|
### Problems to solve
|
||||||
|
|
||||||
1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
|
1. **Public API where rank = SIP** — so bench workers do not have to know
|
||||||
2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
|
about the PE concept.
|
||||||
worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
|
2. **Greenlet-local rank/device tracking** — within the 1-process model,
|
||||||
3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
|
each worker greenlet must correctly identify its own rank / its own SIP.
|
||||||
기본 텐서 배치도 구조적 좌표로 표현되어야 함.
|
3. **Tensor placement = structural (sip, cube, pe)** — if rank is SIP,
|
||||||
|
the default tensor placement should also be expressed in structural
|
||||||
|
coordinates.
|
||||||
|
|
||||||
### Non-problem (이 ADR 밖)
|
### Non-problem (outside this ADR)
|
||||||
|
|
||||||
- IPCQ direction addressing → ADR-0025
|
- IPCQ direction addressing → ADR-0025
|
||||||
- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
|
- Removing `DPPolicy.sip`/`num_sips` → ADR-0026
|
||||||
- Megatron-style TP → ADR-0027
|
- Megatron-style TP → ADR-0027
|
||||||
- DTensor → ADR-0028 (future)
|
- DTensor → ADR-0028 (future)
|
||||||
- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
|
- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
|
||||||
→ ADR-0027 D0/D1
|
→ ADR-0027 D0/D1
|
||||||
- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
|
- Collective algorithm implementation (intercube_allreduce, SFR config)
|
||||||
|
→ ADR-0032
|
||||||
|
|
||||||
## Decision
|
## Decision
|
||||||
|
|
||||||
### D1. rank = SIP (world_size 해석)
|
### D1. rank = SIP (world_size resolution)
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def _resolve_world_size(self) -> int:
|
def _resolve_world_size(self) -> int:
|
||||||
@@ -55,8 +58,8 @@ def _resolve_world_size(self) -> int:
|
|||||||
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
||||||
```
|
```
|
||||||
|
|
||||||
우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
|
Priority order: algorithm override > defaults override > SIP count. The
|
||||||
override는 legacy "rank = PE" 테스트 경로로 유지.
|
`ccl.yaml` override is retained as the legacy "rank = PE" test path.
|
||||||
|
|
||||||
### D2. Greenlet-local rank registry (+ debug warning)
|
### D2. Greenlet-local rank registry (+ debug warning)
|
||||||
|
|
||||||
@@ -83,11 +86,11 @@ class DistributedContext:
|
|||||||
return int(self._rank_by_greenlet[g])
|
return int(self._rank_by_greenlet[g])
|
||||||
```
|
```
|
||||||
|
|
||||||
### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
|
### D3. `torch.ahbm.set_device(rank)` — SIP binding
|
||||||
|
|
||||||
KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
|
The KernBench backend name is `ahbm` (ADR-0023). Real PyTorch uses
|
||||||
`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
|
`torch.cuda.set_device(r)`, but since we are not CUDA we use an
|
||||||
namespace를 사용한다.
|
honestly-named namespace.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class _AhbmNamespace:
|
class _AhbmNamespace:
|
||||||
@@ -113,10 +116,12 @@ class _AhbmNamespace:
|
|||||||
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
|
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
|
||||||
```
|
```
|
||||||
|
|
||||||
**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
|
**PyTorch 2.x style parallel support**: Recent PyTorch is moving toward a
|
||||||
`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
|
device-agnostic `torch.accelerator` namespace
|
||||||
`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
|
(`torch.accelerator.set_device_index(r)`,
|
||||||
코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
|
`torch.accelerator.current_device_index()`). To support users who want to
|
||||||
|
write code that is not tied to a specific device vendor, KernBench also
|
||||||
|
exposes this surface in parallel.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
class _AcceleratorNamespace:
|
class _AcceleratorNamespace:
|
||||||
@@ -141,23 +146,23 @@ self.ahbm = _AhbmNamespace()
|
|||||||
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
|
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
|
||||||
```
|
```
|
||||||
|
|
||||||
Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
|
Bench authors may choose either — both share the same registry internally:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
|
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
|
||||||
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
|
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
|
||||||
```
|
```
|
||||||
|
|
||||||
### D4. Tensor placement = structural (sip, cube, pe) 좌표
|
### D4. Tensor placement = structural (sip, cube, pe) coordinates
|
||||||
|
|
||||||
`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
|
`resolve_dp_policy` takes `target_sip` directly and produces placement in
|
||||||
세부는 ADR-0026.
|
structural coordinates. Details in ADR-0026.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# RuntimeContext._create_tensor
|
# RuntimeContext._create_tensor
|
||||||
current_sip = self.ahbm.current_device() # (D3 naming)
|
current_sip = self.ahbm.current_device() # (D3 naming)
|
||||||
if current_sip is None:
|
if current_sip is None:
|
||||||
current_sip = 0 # single-driver fallback (D2와 일관)
|
current_sip = 0 # single-driver fallback (consistent with D2)
|
||||||
placement = resolve_dp_policy(
|
placement = resolve_dp_policy(
|
||||||
dp, shape=shape_2d, itemsize=itemsize,
|
dp, shape=shape_2d, itemsize=itemsize,
|
||||||
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
|
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
|
||||||
@@ -165,29 +170,29 @@ placement = resolve_dp_policy(
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
|
No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
|
||||||
좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
|
structural coordinates directly. ShardSpec details in ADR-0026.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Dependencies
|
## Dependencies
|
||||||
|
|
||||||
- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
|
- **ADR-0023** (IPCQ): origin of the backend `ahbm` namespace.
|
||||||
- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
|
- **ADR-0026** (DPPolicy intra-device): the `resolve_dp_policy` signature
|
||||||
ShardSpec의 구조적 좌표 표현.
|
used by D4 and the structural-coordinate representation of ShardSpec.
|
||||||
- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
|
- **ADR-0027** (Megatron TP + scheduler): the implementation baseline for
|
||||||
collective drain, exception cleanup의 구현 기준.
|
worker scheduling, `mp.spawn`, collective drain, and exception cleanup.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Non-goals
|
## Non-goals
|
||||||
|
|
||||||
- **IPCQ protocol 수정**: ADR-0023 유지.
|
- **Modifying the IPCQ protocol**: ADR-0023 remains as-is.
|
||||||
- **DPPolicy 필드 정리**: ADR-0026.
|
- **Cleaning up DPPolicy fields**: ADR-0026.
|
||||||
- **Megatron-style TP**: ADR-0027.
|
- **Megatron-style TP**: ADR-0027.
|
||||||
- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
|
- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
|
||||||
- **Collective algorithm 구현**: ADR-0032.
|
- **Collective algorithm implementation**: ADR-0032.
|
||||||
- **Multi-node (프로세스 간)**: 단일 프로세스.
|
- **Multi-node (cross-process)**: single process only.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -195,12 +200,14 @@ Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
|
|||||||
|
|
||||||
### Positive
|
### Positive
|
||||||
|
|
||||||
- **Bench = real PyTorch DDP** (공개 API 관점).
|
- **Bench = real PyTorch DDP** (from the public-API point of view).
|
||||||
- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
|
- **Greenlet-local rank**: enables cross-rank correctness within the
|
||||||
- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
|
1-process model.
|
||||||
`(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
|
- **Structural placement coordinates**: lets the other ADRs (ADR-0026 /
|
||||||
|
ADR-0027 / ADR-0032) operate consistently on top of the `(sip, cube, pe)`
|
||||||
|
3-tuple.
|
||||||
|
|
||||||
### Neutral
|
### Neutral
|
||||||
|
|
||||||
- IPCQ PE-level protocol (ADR-0023) 불변.
|
- IPCQ PE-level protocol (ADR-0023) is unchanged.
|
||||||
- IO_CPU 역할 불변 (기존 transit 그대로).
|
- IO_CPU role is unchanged (existing transit behavior preserved).
|
||||||
|
|||||||
@@ -6,51 +6,58 @@ Accepted (Revision 2 — Address-based matching; peer_direction field dropped)
|
|||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
### 목표
|
### Goal
|
||||||
|
|
||||||
ADR-0023의 IPCQ protocol에서 **"어느 direction pair를 통한 전송인가"의 식별**을
|
In the IPCQ protocol of ADR-0023, make the **identification of "which
|
||||||
topology / dict-order에 의존하지 않고 **주소 기반**으로 일관되게 한다.
|
direction pair this transfer belongs to"** consistent and **address-based**,
|
||||||
2-rank bidirectional ring (또는 여러 direction이 동일 peer를 가리키는
|
without depending on topology / dict-order. It must work correctly in a
|
||||||
topology 일반)에서 정확히 동작하도록 한다.
|
2-rank bidirectional ring (and more generally in any topology where
|
||||||
|
multiple directions point to the same peer).
|
||||||
|
|
||||||
### 드러난 버그 — 2-rank bidirectional ring
|
### The bug surfaced — 2-rank bidirectional ring
|
||||||
|
|
||||||
`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). 양쪽 방향이 같은 peer.
|
`ring_1d(rank, world_size=2)` → `{"E": 1, "W": 1}` (rank 0). Both directions
|
||||||
|
point to the same peer.
|
||||||
|
|
||||||
**버그 1 (install)**:
|
**Bug 1 (install)**:
|
||||||
- `reverse_direction(0, 1)` → dict order로 "E" 반환 (틀림, "W"가 맞음 — opposite
|
- `reverse_direction(0, 1)` → returns "E" by dict order (wrong; "W" is the
|
||||||
direction convention)
|
correct answer — opposite-direction convention)
|
||||||
- rank 0의 E entry가 `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`로 설정
|
- rank 0's E entry is set with `peer.rx_base_pa = rx_base(sip1, cube0, pe0, d="E")`
|
||||||
- tl.send(E) → data가 sip1의 E-rx buffer로 landing (should be W-rx)
|
- tl.send(E) → data lands in sip1's E-rx buffer (should be W-rx)
|
||||||
|
|
||||||
**버그 2 (runtime)**:
|
**Bug 2 (runtime)**:
|
||||||
- 설령 install이 올바른 주소로 설정해도, receiver의 `_handle_meta_arrival`이
|
- Even if install set up the correct address, the receiver's
|
||||||
sender 좌표만으로 direction 매칭 → 첫 direction (E) 승
|
`_handle_meta_arrival` matches direction by sender coordinates only → the
|
||||||
- peer_head_cache[E] 증가, peer_head_cache[W]는 불변
|
first direction (E) wins
|
||||||
- Kernel의 tl.recv(W)는 peer_head_cache[W] 대기 → 영원히 블록 → IpcqDeadlock
|
- peer_head_cache[E] is incremented; peer_head_cache[W] is unchanged
|
||||||
|
- The kernel's tl.recv(W) waits on peer_head_cache[W] → blocks forever →
|
||||||
|
IpcqDeadlock
|
||||||
|
|
||||||
### 근본 원인
|
### Root cause
|
||||||
|
|
||||||
두 축에서 동일 문제:
|
The same issue along two axes:
|
||||||
1. **Install-time pairing**: "내 direction과 peer의 어느 direction이 짝인가"
|
1. **Install-time pairing**: deciding "which of my directions pairs with
|
||||||
결정이 dict-iteration-order에 의존 → 여러 direction이 같은 peer를 가리킬 때
|
which direction of the peer" depends on dict-iteration-order → fragile
|
||||||
fragile
|
when multiple directions point to the same peer
|
||||||
2. **Runtime identification**: "어느 qp를 업데이트해야 하는가" 결정이 sender
|
2. **Runtime identification**: deciding "which qp should be updated" is
|
||||||
좌표만으로 이루어짐 → direction 중복 시 ambiguous
|
based on sender coordinates alone → ambiguous when directions are
|
||||||
|
duplicated
|
||||||
|
|
||||||
### 해결 방향 — address-based matching
|
### Solution direction — address-based matching
|
||||||
|
|
||||||
각 PE의 rx buffer는 **direction별로 고유한 주소 range**에 위치 (rx_base_pa +
|
Each PE's rx buffer sits at a **unique address range per direction**
|
||||||
direction_idx × bytes_per_direction). 따라서:
|
(rx_base_pa + direction_idx × bytes_per_direction). Therefore:
|
||||||
|
|
||||||
- **Runtime**: sender coord 대신 **dst_addr 범위**로 매칭 → unambiguous
|
- **Runtime**: match by **dst_addr range** instead of sender coord →
|
||||||
- **Install**: opposite-direction 우선 선택 heuristic (ring / mesh의 자연스러운
|
unambiguous
|
||||||
대칭성)
|
- **Install**: prefer the opposite direction as a heuristic (the natural
|
||||||
- `peer_direction` 같은 이중 메타데이터 불필요 — **주소가 single source of
|
symmetry of ring / mesh)
|
||||||
truth**
|
- No need for redundant metadata like `peer_direction` — **address is the
|
||||||
|
single source of truth**
|
||||||
|
|
||||||
이 설계는 **PhysAddr 전환 (ADR-0030)과 독립적**으로 작동. 현재 synthetic
|
This design works **independently of the PhysAddr transition (ADR-0030)**.
|
||||||
주소든 PhysAddr든 direction별 range 유일성만 지켜지면 동일하게 적용 가능.
|
Whether the current addresses are synthetic or PhysAddr, the same approach
|
||||||
|
applies as long as the per-direction range uniqueness is preserved.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -91,17 +98,17 @@ def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
|
|||||||
return None
|
return None
|
||||||
```
|
```
|
||||||
|
|
||||||
호출부:
|
Call site:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
for d, peer_rank in nbrs.items():
|
for d, peer_rank in nbrs.items():
|
||||||
peer_dir = reverse_direction(r, peer_rank, d) # my_dir 전달
|
peer_dir = reverse_direction(r, peer_rank, d) # pass my_dir
|
||||||
if peer_dir is None:
|
if peer_dir is None:
|
||||||
continue
|
continue
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
### D2. Runtime — `_handle_meta_arrival` dst_addr 매칭
|
### D2. Runtime — `_handle_meta_arrival` dst_addr matching
|
||||||
|
|
||||||
`src/kernbench/components/builtin/pe_ipcq.py`:
|
`src/kernbench/components/builtin/pe_ipcq.py`:
|
||||||
|
|
||||||
@@ -138,9 +145,10 @@ def _handle_meta_arrival(self, msg: IpcqMetaArrival) -> None:
|
|||||||
# Unknown dst_addr — diagnostic log (should not happen under correct install)
|
# Unknown dst_addr — diagnostic log (should not happen under correct install)
|
||||||
```
|
```
|
||||||
|
|
||||||
Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
|
The sender-coordinate check is **removed**. `dst_addr` already determines
|
||||||
|
the direction.
|
||||||
|
|
||||||
### D3. Credit — `dst_rx_base_pa` 필드 추가
|
### D3. Credit — add `dst_rx_base_pa` field
|
||||||
|
|
||||||
`src/kernbench/common/ipcq_types.py`:
|
`src/kernbench/common/ipcq_types.py`:
|
||||||
|
|
||||||
@@ -148,25 +156,26 @@ Sender 좌표 검사는 **제거**. `dst_addr`가 이미 direction을 결정.
|
|||||||
@dataclass(frozen=True)
|
@dataclass(frozen=True)
|
||||||
class IpcqCreditMetadata:
|
class IpcqCreditMetadata:
|
||||||
consumer_seq: int
|
consumer_seq: int
|
||||||
dst_rx_base_pa: int # NEW: 원 sender의 peer.rx_base_pa와 매칭용
|
dst_rx_base_pa: int # NEW: matches the original sender's peer.rx_base_pa
|
||||||
# 기존 필드 (diagnostic / log 용도로 유지)
|
# Existing fields (kept for diagnostic / logging purposes)
|
||||||
src_sip: int
|
src_sip: int
|
||||||
src_cube: int
|
src_cube: int
|
||||||
src_pe: int
|
src_pe: int
|
||||||
src_direction: str
|
src_direction: str
|
||||||
```
|
```
|
||||||
|
|
||||||
Credit 생성 시 (`_delayed_credit_send`): 자기 direction의 `my_rx_base_pa`를
|
When the credit is generated (`_delayed_credit_send`): it carries this
|
||||||
`dst_rx_base_pa`로 실어 보냄 (이게 상대방이 sender 당시 썼던 `peer.rx_base_pa`).
|
direction's `my_rx_base_pa` as `dst_rx_base_pa` (this is the
|
||||||
|
`peer.rx_base_pa` the other side used when it was the sender).
|
||||||
|
|
||||||
수신 측 (`_credit_worker`):
|
Receiver side (`_credit_worker`):
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def _credit_worker(self, env):
|
def _credit_worker(self, env):
|
||||||
while True:
|
while True:
|
||||||
credit = yield self._credit_inbox.get()
|
credit = yield self._credit_inbox.get()
|
||||||
for d, qp in self._queue_pairs.items():
|
for d, qp in self._queue_pairs.items():
|
||||||
# peer의 rx_base_pa와 credit의 dst_rx_base_pa가 일치하는 qp 찾기
|
# Find the qp whose peer rx_base_pa matches the credit's dst_rx_base_pa
|
||||||
if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
|
if qp["peer"].rx_base_pa == credit.dst_rx_base_pa:
|
||||||
qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
|
qp["peer_tail_cache"] = max(qp["peer_tail_cache"],
|
||||||
credit.consumer_seq)
|
credit.consumer_seq)
|
||||||
@@ -178,41 +187,45 @@ def _credit_worker(self, env):
|
|||||||
break
|
break
|
||||||
```
|
```
|
||||||
|
|
||||||
Sender 좌표 검사 제거. `dst_rx_base_pa` 매칭으로 unambiguous.
|
Sender-coordinate check removed. Matching by `dst_rx_base_pa` is
|
||||||
|
unambiguous.
|
||||||
|
|
||||||
### D4. `IpcqInitEntry`에 `peer_direction` 필드를 **추가하지 않음**
|
### D4. Do **not** add a `peer_direction` field to `IpcqInitEntry`
|
||||||
|
|
||||||
ADR-0025 rev 1에서 제안했던 `IpcqInitEntry.peer_direction`은 **불필요**.
|
The `IpcqInitEntry.peer_direction` proposed in ADR-0025 rev 1 is
|
||||||
이유:
|
**unnecessary**. Reasons:
|
||||||
- Meta arrival은 dst_addr로 매칭 (D2)
|
- Meta arrivals are matched by dst_addr (D2)
|
||||||
- Credit은 dst_rx_base_pa로 매칭 (D3)
|
- Credits are matched by dst_rx_base_pa (D3)
|
||||||
- qp에 peer_direction 저장 필요 없음
|
- No need to store peer_direction on qp
|
||||||
- Install은 rx_base_pa 계산 시 내부적으로만 peer_dir 사용 (`reverse_direction`)
|
- Install only uses peer_dir internally when computing rx_base_pa
|
||||||
|
(`reverse_direction`)
|
||||||
|
|
||||||
IpcqInitEntry schema 변경 없음. Rev 1 대비 **단순화**.
|
No change to the IpcqInitEntry schema. **Simpler** than rev 1.
|
||||||
|
|
||||||
### D5. `IpcqDmaToken.src_direction` 유지 (diagnostic only)
|
### D5. Keep `IpcqDmaToken.src_direction` (diagnostic only)
|
||||||
|
|
||||||
기존 `src_direction` 필드는 제거하지 않는다. 다음 용도로 유지:
|
The existing `src_direction` field is not removed. It is retained for:
|
||||||
- Logging / trace: `KERNBENCH_CCL_TRACE=1` 출력의 `(rank, t, dir, nbytes)`
|
- Logging / trace: the `(rank, t, dir, nbytes)` output of
|
||||||
- Diagnostics: pointer_dump 등에서 direction 표시
|
`KERNBENCH_CCL_TRACE=1`
|
||||||
- 미래 확장 여지
|
- Diagnostics: showing direction in pointer_dump, etc.
|
||||||
|
- Room for future extension
|
||||||
|
|
||||||
Runtime matching은 `dst_addr`만 사용.
|
Runtime matching uses only `dst_addr`.
|
||||||
|
|
||||||
### D6. Invariants (ADR-0023 I3 강화)
|
### D6. Invariants (strengthens ADR-0023 I3)
|
||||||
|
|
||||||
**I3 (엄격)**: 각 방향 pair `(my_direction, peer_direction)`에 대해 my
|
**I3 (strict)**: For each direction pair `(my_direction, peer_direction)`,
|
||||||
rx_base와 peer rx_base는 **별개의 direction slot**을 가리켜야 함. Install은
|
my rx_base and peer rx_base must point to **distinct direction slots**.
|
||||||
이를 보장해야 한다 (reverse_direction opposite-preference).
|
Install must guarantee this (reverse_direction opposite-preference).
|
||||||
|
|
||||||
**I3.1 (신규)**: 모든 qp에 대해 `qp["my_rx_base_pa"]`와 `qp["peer"].rx_base_pa`는
|
**I3.1 (new)**: For every qp, `qp["my_rx_base_pa"]` and
|
||||||
서로 disjoint한 주소 range를 점유한다 (다른 direction의 buffer는 절대 겹치지
|
`qp["peer"].rx_base_pa` occupy mutually disjoint address ranges (buffers
|
||||||
않음). 이것이 D2/D3의 주소-기반 매칭의 전제.
|
of different directions never overlap). This is the prerequisite for the
|
||||||
|
address-based matching of D2/D3.
|
||||||
|
|
||||||
Install time에 검증 가능:
|
Verifiable at install time:
|
||||||
```python
|
```python
|
||||||
# ccl/install_plan.py: build_install_plans 끝에 assertion
|
# ccl/install_plan.py: assertion at the end of build_install_plans
|
||||||
all_rx_ranges = set()
|
all_rx_ranges = set()
|
||||||
for plan in plans:
|
for plan in plans:
|
||||||
for pe_install in plan.pe_installs:
|
for pe_install in plan.pe_installs:
|
||||||
@@ -228,36 +241,42 @@ for plan in plans:
|
|||||||
|
|
||||||
## Dependencies
|
## Dependencies
|
||||||
|
|
||||||
- **ADR-0023** (IPCQ protocol): 본 ADR은 ADR-0023의 runtime 매칭 로직 수정
|
- **ADR-0023** (IPCQ protocol): this ADR modifies ADR-0023's runtime
|
||||||
(D2, D3) + install heuristic 개선 (D1). IPCQ 프로토콜의 semantic layer
|
matching logic (D2, D3) and improves the install heuristic (D1). No
|
||||||
변경은 없음.
|
change to the IPCQ protocol's semantic layer.
|
||||||
- **ADR-0024** (launcher): 2-rank bidirectional ring이 실제 쓰이는 경우가
|
- **ADR-0024** (launcher): the case where a 2-rank bidirectional ring is
|
||||||
ADR-0024의 ws=SIP_count 모델. 본 ADR이 그 케이스를 작동시킴.
|
actually used is the ws=SIP_count model of ADR-0024. This ADR makes that
|
||||||
- **ADR-0030** (PhysAddr transition, stub): **독립적** — ADR-0025의
|
case work.
|
||||||
주소-기반 매칭은 현재 synthetic 주소든 PhysAddr이든 동일하게 작동.
|
- **ADR-0030** (PhysAddr transition, stub): **independent** — ADR-0025's
|
||||||
|
address-based matching works identically whether the current addresses
|
||||||
|
are synthetic or PhysAddr.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Non-goals
|
## Non-goals
|
||||||
|
|
||||||
- **IPCQ 주소 체계를 PhysAddr로 전환**: ADR-0030 scope. 본 ADR은 주소가 어떻게
|
- **Migrating IPCQ addressing to PhysAddr**: ADR-0030 scope. This ADR is
|
||||||
인코딩되는가와 무관.
|
agnostic to how addresses are encoded.
|
||||||
- **Multi-hop routing**: ADR-0023 D5의 single-hop DMA write 전제 유지.
|
- **Multi-hop routing**: the single-hop DMA write assumption of ADR-0023
|
||||||
- **Unidir ring 특수화**: `ring_1d_unidir`는 direction 하나만 있으므로 본 버그
|
D5 still holds.
|
||||||
무관.
|
- **Unidir ring specialization**: `ring_1d_unidir` only has a single
|
||||||
|
direction, so the bug does not apply.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Open questions
|
## Open questions
|
||||||
|
|
||||||
- **주소 매칭 성능**: `_handle_meta_arrival`과 `_credit_worker`가 qp를 선형
|
- **Address-matching performance**: `_handle_meta_arrival` and
|
||||||
순회 (max 4 direction). 성능 영향 무시 가능 수준. 문제 시 dict lookup으로
|
`_credit_worker` iterate qp linearly (max 4 directions). The performance
|
||||||
전환 가능 (`_qp_by_rx_base`).
|
impact is negligible. If it becomes an issue, this can be switched to a
|
||||||
- **`IpcqDmaToken.src_direction` 필요성 재평가**: diagnostic 용도로만 남긴
|
dict lookup (`_qp_by_rx_base`).
|
||||||
필드를 계속 유지할지, 또는 logging 외부로 분리할지. 현재는 유지.
|
- **Re-evaluating the need for `IpcqDmaToken.src_direction`**: whether to
|
||||||
- **Install-time invariant 검증 cost**: D6의 I3.1 검증은 O(N_PE × N_direction)^2.
|
keep this field, which is only kept for diagnostics, or to split it out
|
||||||
대형 topology에서 느려질 수 있음 → interval tree 등 자료구조로 개선 가능.
|
of logging. Currently retained.
|
||||||
단순 구현 먼저.
|
- **Cost of install-time invariant verification**: the I3.1 verification
|
||||||
|
of D6 is O(N_PE × N_direction)^2. It could be slow on large topologies
|
||||||
|
→ improvable via data structures such as interval trees. Simple
|
||||||
|
implementation first.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -265,19 +284,26 @@ for plan in plans:
|
|||||||
|
|
||||||
### Positive
|
### Positive
|
||||||
|
|
||||||
- **단순함**: `peer_direction` 이중 메타데이터 제거. 주소가 single source of truth.
|
- **Simplicity**: redundant `peer_direction` metadata removed. Address is
|
||||||
- **Unambiguous matching**: 모든 topology (direction 중복 포함)에서 동작.
|
the single source of truth.
|
||||||
- **Schema 변경 최소**: `IpcqInitEntry` 불변, `IpcqCreditMetadata`에 1 필드 추가.
|
- **Unambiguous matching**: works on every topology (including duplicate
|
||||||
- **PhysAddr 전환 (ADR-0030) 독립**: 주소-기반 매칭은 주소 인코딩 방식과 무관.
|
directions).
|
||||||
- **Diagnostic 유지**: `IpcqDmaToken.src_direction`은 로깅 용도로 존치.
|
- **Minimal schema changes**: `IpcqInitEntry` unchanged, one field added
|
||||||
|
to `IpcqCreditMetadata`.
|
||||||
|
- **Independent of PhysAddr transition (ADR-0030)**: address-based matching
|
||||||
|
is agnostic to the address encoding.
|
||||||
|
- **Diagnostics retained**: `IpcqDmaToken.src_direction` is kept for
|
||||||
|
logging.
|
||||||
|
|
||||||
### Negative
|
### Negative
|
||||||
|
|
||||||
- Runtime 매칭이 주소 비교로 바뀌어서 디버깅 시 "왜 peer_head_cache[E]가 아닌
|
- Runtime matching is now by address comparison, so when debugging
|
||||||
W가 업데이트됐나" 같은 질문에 address range를 추적해야 함 (기존엔 direction
|
questions like "why did peer_head_cache[W] update rather than [E]" one
|
||||||
이름으로 충분). 해결: pointer_dump에 "direction ↔ rx_base_pa" 매핑 포함.
|
has to follow the address range (previously the direction name was
|
||||||
|
enough). Mitigation: include a "direction ↔ rx_base_pa" mapping in
|
||||||
|
pointer_dump.
|
||||||
|
|
||||||
### Neutral
|
### Neutral
|
||||||
|
|
||||||
- IPCQ protocol의 semantic layer (sender가 dst_addr 계산, receiver가 수신)는
|
- The semantic layer of the IPCQ protocol (sender computes dst_addr,
|
||||||
불변.
|
receiver receives) is unchanged.
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
# ADR-0026: DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
|
# ADR-0026: DPPolicy = Intra-Device Only — remove sip/num_sips fields
|
||||||
|
|
||||||
## Status
|
## Status
|
||||||
|
|
||||||
@@ -6,16 +6,17 @@ Accepted (Revision 5 — Phase 2 landed 2026-04-14, 523 passed + 1 strict xfail)
|
|||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
### 목표
|
### Goal
|
||||||
|
|
||||||
`DPPolicy`를 **한 device(SIP) 내부의 cube × PE 분산**만 표현하는 순수한
|
Clarify `DPPolicy` as a pure intra-device abstraction that only expresses
|
||||||
intra-device 추상화로 명확화한다. SIP 간 분산(TP)은 별도 레이어로 분리
|
**cube × PE distribution within a single device (SIP)**. Inter-SIP
|
||||||
(ADR-0024의 `torch.ahbm.set_device(rank)` 또는 ADR-0027의 Megatron parallel
|
distribution (TP) is split into a separate layer (handled by ADR-0024's
|
||||||
layers가 담당).
|
`torch.ahbm.set_device(rank)` or by ADR-0027's Megatron-style parallel
|
||||||
|
layers).
|
||||||
|
|
||||||
## Decision
|
## Decision
|
||||||
|
|
||||||
### D1. `DPPolicy`에서 `sip` + `num_sips` 필드 제거
|
### D1. Remove `sip` + `num_sips` fields from `DPPolicy`
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@dataclass(frozen=True)
|
@dataclass(frozen=True)
|
||||||
@@ -32,15 +33,16 @@ class DPPolicy:
|
|||||||
num_cubes: int | None = None
|
num_cubes: int | None = None
|
||||||
```
|
```
|
||||||
|
|
||||||
제거되는 필드: `sip`, `num_sips`.
|
Removed fields: `sip`, `num_sips`.
|
||||||
|
|
||||||
### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
|
### D2. `ShardSpec` — structural (sip, cube, pe) coordinates, `pe_index` fully removed
|
||||||
|
|
||||||
현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
|
The current `ShardSpec.pe_index` is a **global flat index**
|
||||||
pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
|
(`sip × cubes × pes + cube × pes + pe`). This is the form ADR-0024 D4
|
||||||
|
flagged as "abstraction leakage".
|
||||||
|
|
||||||
본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
|
This ADR **redefines ShardSpec in structural coordinates** and **does
|
||||||
property로도 **남기지 않는다**:
|
not even leave `pe_index` as a property**:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# src/kernbench/policy/placement/dp.py (after)
|
# src/kernbench/policy/placement/dp.py (after)
|
||||||
@@ -59,28 +61,32 @@ class ShardSpec:
|
|||||||
nbytes: int
|
nbytes: int
|
||||||
```
|
```
|
||||||
|
|
||||||
**핵심 원칙**:
|
**Core principle**:
|
||||||
- ShardSpec의 정체성은 `(sip, cube, pe)` 3튜플.
|
- The identity of ShardSpec is the `(sip, cube, pe)` 3-tuple.
|
||||||
- **`pe_index` property도 없음** — silent semantics drift 차단.
|
- **No `pe_index` property either** — blocks silent semantics drift.
|
||||||
- Global flat을 기대한 기존 호출자는 `.pe_index` 접근 시 **즉시
|
- Existing callers expecting global-flat get an **immediate
|
||||||
`AttributeError`** → 반드시 구조적 좌표로 migration.
|
`AttributeError`** on `.pe_index` access → forced migration to
|
||||||
- Flat integer key가 필요한 국소 문맥 (예: 내부 dict lookup)은 호출자가
|
structural coordinates.
|
||||||
명시적으로 `spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe`를 계산.
|
- Local contexts that genuinely need a flat integer key (e.g. internal
|
||||||
|
dict lookup) explicitly compute
|
||||||
|
`spec.sip * N_CUBES * N_PE + spec.cube * N_PE + spec.pe` at the call
|
||||||
|
site.
|
||||||
|
|
||||||
**Property 제거 정당화**: KernBench는 사내 프로젝트로 call site가 한정되어
|
**Justification for removing the property**: KernBench is an internal
|
||||||
있음. Silent drift 위험 (의미만 바뀌고 타입은 같은 int) 대비 explicit breakage
|
project with a limited number of call sites. Explicit breakage
|
||||||
(AttributeError)가 훨씬 안전.
|
(AttributeError) is much safer than the risk of silent drift (semantics
|
||||||
|
change while the type stays int).
|
||||||
|
|
||||||
### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
|
### D3. `resolve_dp_policy` takes `target_sip` and produces structural coordinates
|
||||||
|
|
||||||
ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
|
Implements the contract of ADR-0024 D4. No post-hoc shifting.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# src/kernbench/policy/placement/dp.py (after)
|
# src/kernbench/policy/placement/dp.py (after)
|
||||||
|
|
||||||
@dataclass(frozen=True)
|
@dataclass(frozen=True)
|
||||||
class _LocalPeShard:
|
class _LocalPeShard:
|
||||||
"""Internal — PE resolver의 반환. Cube 내 local PE 식별자 + payload."""
|
"""Internal — return value of the PE resolver. Cube-local PE id + payload."""
|
||||||
local_pe: int # cube-local PE index (0..num_pe-1)
|
local_pe: int # cube-local PE index (0..num_pe-1)
|
||||||
offset_bytes: int
|
offset_bytes: int
|
||||||
nbytes: int
|
nbytes: int
|
||||||
@@ -93,7 +99,7 @@ def resolve_dp_policy(
|
|||||||
itemsize: int,
|
itemsize: int,
|
||||||
num_pe: int,
|
num_pe: int,
|
||||||
num_cubes: int = 1,
|
num_cubes: int = 1,
|
||||||
target_sip: int, # NEW — 어느 SIP에 배치할지 명시
|
target_sip: int, # NEW — explicitly state which SIP to place on
|
||||||
) -> list[ShardSpec]:
|
) -> list[ShardSpec]:
|
||||||
"""2-level resolution (cube × PE) on a specified SIP.
|
"""2-level resolution (cube × PE) on a specified SIP.
|
||||||
|
|
||||||
@@ -123,28 +129,30 @@ def resolve_dp_policy(
|
|||||||
return all_shards
|
return all_shards
|
||||||
```
|
```
|
||||||
|
|
||||||
**내부 resolver** (`column_wise`, `row_wise`, `replicate`)는 `_LocalPeShard`
|
**Internal resolvers** (`column_wise`, `row_wise`, `replicate`) return a
|
||||||
리스트 반환 — `local_pe` 필드명으로 **"cube-local PE identifier"임이 명시적**.
|
list of `_LocalPeShard` — the `local_pe` field name makes it **explicit
|
||||||
과거 `ShardSpec.pe_index`와 이름이 혼동되던 문제 해소.
|
that this is a "cube-local PE identifier"**. This resolves the previous
|
||||||
|
confusion with the name `ShardSpec.pe_index`.
|
||||||
|
|
||||||
**이름 규약 정리** (전체 ADR):
|
**Naming convention summary** (whole ADR):
|
||||||
- `ShardSpec.pe`: 최종 외부 API — cube-local PE (structural coord)
|
- `ShardSpec.pe`: the final external API — cube-local PE (structural coord)
|
||||||
- `_LocalPeShard.local_pe`: 내부 resolver 단계의 동일 의미
|
- `_LocalPeShard.local_pe`: the same meaning at the internal resolver stage
|
||||||
- `pe_index`: **제거**. 외부/내부 어디에도 남기지 않는다 (silent drift 차단의
|
- `pe_index`: **removed**. Not retained anywhere, internal or external
|
||||||
부가 효과: 이름 재등장 없음).
|
(additional benefit of preventing silent drift: the name does not
|
||||||
|
reappear).
|
||||||
|
|
||||||
### D4. `_create_tensor` — 구조적 좌표로 직접 placement
|
### D4. `_create_tensor` — placement directly in structural coordinates
|
||||||
|
|
||||||
ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
|
Continuation of ADR-0024 D4. Post-hoc shifting removed; structural
|
||||||
호출 시점에 직접 지정.
|
coordinates are specified directly at the `resolve_dp_policy` call site.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# context.py _create_tensor (after)
|
# context.py _create_tensor (after)
|
||||||
current_sip = self.ahbm.current_device()
|
current_sip = self.ahbm.current_device()
|
||||||
if current_sip is None:
|
if current_sip is None:
|
||||||
# Single-driver fallback (ADR-0024 D2와 일관).
|
# Single-driver fallback (consistent with ADR-0024 D2).
|
||||||
# Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
|
# In launcher-based code, forgetting set_device() silently sticks the
|
||||||
# 문제가 있음 → debug mode에서 경고.
|
# tensor on SIP 0 — emit a warning in debug mode.
|
||||||
if os.environ.get("KERNBENCH_DEBUG"):
|
if os.environ.get("KERNBENCH_DEBUG"):
|
||||||
import warnings
|
import warnings
|
||||||
warnings.warn(
|
warnings.warn(
|
||||||
@@ -161,38 +169,39 @@ placement = resolve_dp_policy(
|
|||||||
itemsize=itemsize,
|
itemsize=itemsize,
|
||||||
num_pe=eff_num_pe,
|
num_pe=eff_num_pe,
|
||||||
num_cubes=eff_num_cubes,
|
num_cubes=eff_num_cubes,
|
||||||
target_sip=current_sip, # ← 구조적 좌표 일차 지정
|
target_sip=current_sip, # ← structural coord specified up front
|
||||||
)
|
)
|
||||||
|
|
||||||
# placement의 각 ShardSpec은 이미 (sip=current_sip, cube=local, pe=local) 포함.
|
# Each ShardSpec in placement already carries (sip=current_sip, cube=local, pe=local).
|
||||||
# 과거의 post-hoc shifting 블록은 완전히 제거.
|
# The old post-hoc shifting block is removed entirely.
|
||||||
```
|
```
|
||||||
|
|
||||||
**모든** 텐서가 current device SIP에 배치됨. Multi-SIP 텐서를 만들고 싶으면
|
**Every** tensor is placed on the current device's SIP. If you need a
|
||||||
ADR-0027의 TP primitive 사용.
|
multi-SIP tensor, use the TP primitive of ADR-0027.
|
||||||
|
|
||||||
**Single-driver fallback의 trade-off**: set_device 없는 호출에서 SIP 0으로
|
**Trade-off of the single-driver fallback**: When set_device is not
|
||||||
default는 기존 single-driver 테스트 호환을 위해 유지. `KERNBENCH_DEBUG=1`
|
called, defaulting to SIP 0 is kept for compatibility with existing
|
||||||
환경에서는 launcher 컨텍스트의 실수로 set_device 누락 시 조용히 잘못된 SIP에
|
single-driver tests. With `KERNBENCH_DEBUG=1`, a warning is emitted so
|
||||||
배치되는 것을 감지할 수 있도록 warning.
|
that accidentally omitting set_device in a launcher context — which would
|
||||||
|
silently place the tensor on the wrong SIP — can be detected.
|
||||||
|
|
||||||
### D5. Downstream — allocator lookup은 구조적 tuple key로
|
### D5. Downstream — allocator lookup by structural tuple key
|
||||||
|
|
||||||
기존 `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
|
Existing `deploy_tensor` (`src/kernbench/runtime_api/tensor.py`):
|
||||||
|
|
||||||
```python
|
```python
|
||||||
for spec in placement:
|
for spec in placement:
|
||||||
alloc = allocators[spec.pe_index] # ← AttributeError (property 제거됨)
|
alloc = allocators[spec.pe_index] # ← AttributeError (property removed)
|
||||||
```
|
```
|
||||||
|
|
||||||
`pe_index`가 없어졌으므로 구조적 좌표로 **강제** migration:
|
With `pe_index` gone, migration to structural coordinates is **forced**:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
for spec in placement:
|
for spec in placement:
|
||||||
alloc = allocators[(spec.sip, spec.cube, spec.pe)]
|
alloc = allocators[(spec.sip, spec.cube, spec.pe)]
|
||||||
```
|
```
|
||||||
|
|
||||||
`_ensure_allocators`의 dict population도 tuple key로:
|
The dict population in `_ensure_allocators` is also tuple-keyed:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# context.py _ensure_allocators (after)
|
# context.py _ensure_allocators (after)
|
||||||
@@ -204,59 +213,71 @@ for sip_id in sip_range:
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
`_free_tensor`도 동일: 기존 `flat_idx = sip * ... + cube * ... + pe` 계산
|
`_free_tensor` is the same: the old
|
||||||
블록 제거, `(shard.sip, shard.cube, shard.pe)` 직접 사용.
|
`flat_idx = sip * ... + cube * ... + pe` computation block is removed,
|
||||||
|
and `(shard.sip, shard.cube, shard.pe)` is used directly.
|
||||||
|
|
||||||
**Tuple vs dataclass `PEIdentity`**: Tuple이 단순하고 hashable로 바로 써서
|
**Tuple vs dataclass `PEIdentity`**: Recommend the tuple — it is simple
|
||||||
권고. `PEIdentity` 값객체는 명시적 타입 장점은 있지만 boilerplate가 크고 현재
|
and hashable out of the box. A `PEIdentity` value object has the upside
|
||||||
allocator dict의 유일한 key라 오버엔지니어링. Tuple 유지.
|
of an explicit type, but the boilerplate is large and it is currently
|
||||||
|
the only key of the allocator dict, so it would be over-engineering.
|
||||||
|
Keep the tuple.
|
||||||
|
|
||||||
### D7. 하위 호환 — 불가 (cleanup ADR)
|
### D7. Backward compatibility — none (cleanup ADR)
|
||||||
|
|
||||||
이 ADR은 **breaking change**.
|
This ADR is a **breaking change**.
|
||||||
|
|
||||||
1. `DPPolicy(sip=...)` 또는 `DPPolicy(num_sips=...)` 호출 → `TypeError`
|
1. `DPPolicy(sip=...)` or `DPPolicy(num_sips=...)` → `TypeError`
|
||||||
2. `ShardSpec.pe_index` 접근 → `AttributeError`
|
2. `ShardSpec.pe_index` access → `AttributeError`
|
||||||
|
|
||||||
모두 **즉시 명시적 breakage**. Deprecation warning / fallback 경로 없음.
|
Both are **immediate, explicit breakage**. No deprecation warning /
|
||||||
KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 migration.
|
fallback path. KernBench is an internal project with a bounded set of
|
||||||
|
call sites, so migration happens in one pass.
|
||||||
|
|
||||||
**Silent drift 차단**이 property 완전 제거의 주된 이점: global flat을 기대한
|
**Blocking silent drift** is the main upside of fully removing the
|
||||||
코드가 SIP-local 결과를 받아 조용히 잘못된 인덱싱을 할 가능성 제거.
|
property: code that expected a global flat could otherwise silently
|
||||||
|
receive a SIP-local result and index incorrectly — that possibility is
|
||||||
|
eliminated.
|
||||||
|
|
||||||
## Dependencies
|
## Dependencies
|
||||||
|
|
||||||
- **ADR-0024** (launcher): `set_device(rank)` 및 current-device scoping이
|
- **ADR-0024** (launcher): `set_device(rank)` and current-device scoping
|
||||||
SIP 배치 메커니즘 제공. 본 ADR은 그 위에 서서 DPPolicy를 순수 intra-device로
|
provide the SIP placement mechanism. This ADR sits on top and narrows
|
||||||
좁힘.
|
DPPolicy to pure intra-device.
|
||||||
- **ADR-0027** (Megatron TP): 다중 SIP에 걸친 텐서가 필요한 경우의 대안 경로.
|
- **ADR-0027** (Megatron TP): the alternative path when a tensor spans
|
||||||
이 ADR 적용 후 multi-SIP use case는 ADR-0027로 이관.
|
multiple SIPs. After this ADR is applied, multi-SIP use cases move to
|
||||||
|
ADR-0027.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Non-goals
|
## Non-goals
|
||||||
|
|
||||||
- **`DPPolicy.cube` / `pe` 재설계**: 기존 replicate/column_wise/row_wise 의미
|
- **Redesign of `DPPolicy.cube` / `pe`**: existing
|
||||||
유지.
|
replicate/column_wise/row_wise semantics are kept.
|
||||||
- **Tiling 정책 통합**: `tiled_column_major` / `tiled_row_major`는 그대로.
|
- **Tiling policy consolidation**: `tiled_column_major` /
|
||||||
- **Multi-device 텐서 추상화 신규**: DTensor-like는 ADR-0028.
|
`tiled_row_major` stay as they are.
|
||||||
|
- **New multi-device tensor abstraction**: a DTensor-like is ADR-0028.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Open questions
|
## Open questions
|
||||||
|
|
||||||
- **`_create_tensor`의 current_sip 기본값**: set_device 없는 호출에서 rank=0
|
- **Default value of current_sip in `_create_tensor`**: for calls without
|
||||||
(SIP 0)로 fallback할지, 아니면 error 낼지. 권고는 fallback (기존 single-driver
|
set_device, whether to fall back to rank=0 (SIP 0) or to raise an
|
||||||
테스트와의 호환).
|
error. The recommendation is fallback (compatibility with existing
|
||||||
- **`test_sip_parallel.py` 재작성 범위**: 기존 단위 테스트의 의도를 유지하며
|
single-driver tests).
|
||||||
launcher 기반으로 옮기려면 추가 fixture 필요. 별도 작업으로 scope.
|
- **Scope of `test_sip_parallel.py` rewrite**: porting the existing unit
|
||||||
- **`DPPolicy`의 `num_sips=None` 의미**: 필드가 없어지면 `num_sips` 개념 자체가
|
tests to the launcher base while preserving their intent requires
|
||||||
사라짐. Multi-SIP을 표현하고 싶으면 ADR-0027의 TP primitive를 쓰라는 것이
|
additional fixtures. Scoped as separate work.
|
||||||
명시적 답.
|
- **Meaning of `num_sips=None` on `DPPolicy`**: once the field is gone,
|
||||||
|
the concept of `num_sips` disappears entirely. The explicit answer for
|
||||||
|
expressing multi-SIP is to use the TP primitive of ADR-0027.
|
||||||
|
|
||||||
**Resolved (이전 rev에서 open이었던 것들)**:
|
**Resolved (items that were open in earlier revs)**:
|
||||||
- ~~`ShardSpec.pe_index` property 존치 여부~~ → **완전 제거** (D2)
|
- ~~Whether to keep the `ShardSpec.pe_index` property~~ → **fully
|
||||||
- ~~`_ensure_allocators` dict key 형식~~ → **tuple `(sip, cube, pe)`** (D5)
|
removed** (D2)
|
||||||
|
- ~~Form of `_ensure_allocators` dict key~~ → **tuple `(sip, cube, pe)`**
|
||||||
|
(D5)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -264,25 +285,31 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
|
|||||||
|
|
||||||
### Positive
|
### Positive
|
||||||
|
|
||||||
- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
|
- **Clean conceptual separation**: DPPolicy = intra-device, TP =
|
||||||
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
|
inter-device.
|
||||||
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
|
- **API simplification**: about a 33% reduction in DPPolicy constructor
|
||||||
abstraction leakage 해소 (ADR-0024 D4 계약 충족).
|
fields.
|
||||||
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
|
- **Structural-coordinate consistency**: ShardSpec is expressed as a
|
||||||
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
|
`(sip, cube, pe)` tuple → abstraction leakage resolved (the ADR-0024
|
||||||
경계 제어 메커니즘.
|
D4 contract is satisfied).
|
||||||
|
- **Clear meaning of `pe_index`**: the single interpretation is
|
||||||
|
SIP-local. If global-flat is needed, it must be made explicit.
|
||||||
|
- **Launcher-model consistency**: ADR-0024's "1 worker per SIP" model is
|
||||||
|
the sole SIP-boundary control mechanism.
|
||||||
|
|
||||||
### Negative
|
### Negative
|
||||||
|
|
||||||
- **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
|
- **Breaking change (explicit)**: `DPPolicy(sip=...)` → `TypeError`,
|
||||||
`spec.pe_index` → `AttributeError`. 모든 호출자 한 번에 수정 필요.
|
`spec.pe_index` → `AttributeError`. All callers need to be fixed at
|
||||||
- **ShardSpec schema 변경**: `pe_index` 단일 필드 → `sip`/`cube`/`pe` 세 필드.
|
once.
|
||||||
Downstream (`deploy_tensor`, `_free_tensor`, `_ensure_allocators`,
|
- **ShardSpec schema change**: a single `pe_index` field becomes three
|
||||||
`allocators` dict key 등) 연쇄 수정.
|
fields `sip`/`cube`/`pe`. Cascading edits downstream (`deploy_tensor`,
|
||||||
- **Silent drift 없음**: property 완전 제거로 runtime에서 즉시 실패 →
|
`_free_tensor`, `_ensure_allocators`, `allocators` dict key, etc.).
|
||||||
migration leakage 원천 차단. (Negative가 아니라 explicit tradeoff)
|
- **No silent drift**: with the property fully removed, runtime failure
|
||||||
- `test_sip_parallel.py` 재작성 비용.
|
is immediate → migration leakage is blocked at the source. (Not a
|
||||||
|
negative but an explicit tradeoff.)
|
||||||
|
- The cost of rewriting `test_sip_parallel.py`.
|
||||||
|
|
||||||
### Neutral
|
### Neutral
|
||||||
|
|
||||||
- 기존 `cube` / `pe` 필드 의미 불변.
|
- The meaning of the existing `cube` / `pe` fields is unchanged.
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -92,6 +92,18 @@ def test_crlf_normalization(tmp_path: Path) -> None:
|
|||||||
assert v.verify(tmp_path) == []
|
assert v.verify(tmp_path) == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_em_dash_title_separator_recognized(tmp_path: Path) -> None:
|
||||||
|
"""ADR-0033 uses ' — ' instead of ': ' between ADR-NNNN and the title."""
|
||||||
|
en = tmp_path / "docs/adr/ADR-0033-foo-bar.md"
|
||||||
|
ko = tmp_path / "docs/adr-ko/ADR-0033-foo-bar.md"
|
||||||
|
en.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
ko.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
body = "## Status\n\nAccepted\n\n## Context\n\nbody\n"
|
||||||
|
en.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8")
|
||||||
|
ko.write_text("# ADR-0033 — Latency Model\n\n" + body, encoding="utf-8")
|
||||||
|
assert v.verify(tmp_path) == []
|
||||||
|
|
||||||
|
|
||||||
def test_underscore_in_slug_recognized(tmp_path: Path) -> None:
|
def test_underscore_in_slug_recognized(tmp_path: Path) -> None:
|
||||||
"""ADR-0013 uses an underscore in its slug; the regex must accept it."""
|
"""ADR-0013 uses an underscore in its slug; the regex must accept it."""
|
||||||
_make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013")
|
_make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013")
|
||||||
|
|||||||
@@ -24,7 +24,7 @@ import sys
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$")
|
ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$")
|
||||||
TITLE_RE = re.compile(r"^# ADR-(\d{4}):")
|
TITLE_RE = re.compile(r"^# ADR-(\d{4})\b")
|
||||||
|
|
||||||
|
|
||||||
def _normalize(text: str) -> str:
|
def _normalize(text: str) -> str:
|
||||||
|
|||||||
Reference in New Issue
Block a user