Compare commits
18 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 533e699299 | |||
| 54fcb7e4bc | |||
| ad5f01ab13 | |||
| 1c5752a9ec | |||
| 84a1325e5c | |||
| 1e39214f89 | |||
| fca24feac5 | |||
| d55dc6cb4f | |||
| 46291bf91b | |||
| 04c912f53e | |||
| 1c33afec55 | |||
| 81cc32c46b | |||
| e9cc40f74d | |||
| c1a5cf3a2a | |||
| 90874abbfe | |||
| 19dfc86dc3 | |||
| 14d800b0ae | |||
| 6918e6e906 |
@@ -1,25 +1,39 @@
|
|||||||
# ADR-0001: PhysAddr Layout & Address Decoding Contract
|
# ADR-0001: 51-bit Physical Address Layout & Decoding Contract
|
||||||
|
|
||||||
## Status
|
## Status
|
||||||
|
|
||||||
Accepted
|
Accepted (Revision 2 — 2026-04-27: concrete bit layout, rack_id removal,
|
||||||
|
Tray->SIP / SIP->DIE renaming, PE/MCPU/IOCPU sub-unit tables.
|
||||||
|
Supersedes ADR-0031.)
|
||||||
|
|
||||||
## Date
|
## Date
|
||||||
|
|
||||||
2026-02-27
|
2026-04-27 (original: 2026-02-27)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
KernBench Graph Latency Simulator must route requests deterministically and compute end-to-end latency strictly by graph traversal.
|
KernBench requires a stable, parsable physical address scheme that:
|
||||||
To model local vs remote traffic (same/different SIP, same/different CUBE, optional PE-group), requests need a stable, parsable address/location scheme that:
|
|
||||||
|
|
||||||
- can be decoded into routing domains (SIP/CUBE/HBM/PE-resource, etc.)
|
- can be decoded into routing domains (SIP / die / HBM / PE-resource / IOCPU)
|
||||||
- remains topology-agnostic (no hardcoded counts)
|
- remains topology-agnostic (no hardcoded counts)
|
||||||
- supports swappable policy and DI-first components without leaking topology assumptions into node implementations
|
- supports swappable policy and DI-first components
|
||||||
|
- covers multiple SIPs, AHBM dies, and IO chiplet dies in a unified space
|
||||||
|
|
||||||
|
### History
|
||||||
|
|
||||||
|
- Original ADR-0001 defined a 51-bit layout with `rack_id(4) + sip_id(4) +
|
||||||
|
sip_seg(5) + local_offset(38)`. `rack_id` was never used in practice.
|
||||||
|
- ADR-0031 (stub) requested PE-resource range partition but was never
|
||||||
|
implemented.
|
||||||
|
|
||||||
|
Revision 2 removes `rack_id`, renames `sip_seg -> die_id`, and provides
|
||||||
|
concrete sub-unit tables for PE, MCPU, CUBE_SRAM, and IOCPU resources.
|
||||||
|
ADR-0031 is superseded.
|
||||||
|
|
||||||
## Decision
|
## Decision
|
||||||
|
|
||||||
We define a **PhysAddr value object** and an **address decoding contract** that converts an integer address into routing domains.
|
We define a **PhysAddr value object** and an **address decoding contract**
|
||||||
|
that converts an integer address into routing domains.
|
||||||
|
|
||||||
### D1. PhysAddr is an immutable value object
|
### D1. PhysAddr is an immutable value object
|
||||||
|
|
||||||
@@ -27,82 +41,322 @@ We define a **PhysAddr value object** and an **address decoding contract** that
|
|||||||
- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
|
- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
|
||||||
- No global state may be required to interpret a PhysAddr.
|
- No global state may be required to interpret a PhysAddr.
|
||||||
|
|
||||||
### D2. PhysAddr fields (logical contract)
|
### D2. 51-bit Physical Address Layout
|
||||||
|
|
||||||
PhysAddr must be able to represent at least:
|
A 51-bit physical address is adopted.
|
||||||
|
|
||||||
- `rack_id` (optional but reserved for scale-out)
|
#### 2.1 Top-Level Address Map
|
||||||
- `sip_id` (device / SIP domain)
|
|
||||||
- `sip_seg` (SIP-level segment/window selection, e.g., cube window)
|
|
||||||
- `local_offset` (offset within the chosen segment/window)
|
|
||||||
|
|
||||||
Decoded/derived fields may include (optional):
|
```text
|
||||||
|
[50:47] sip_id (4) -- 16 SIPs
|
||||||
|
[46:42] die_id (5) -- 32 dies per SIP
|
||||||
|
[41: 0] local_offset (42) -- 4 TB per die
|
||||||
|
```
|
||||||
|
|
||||||
- `cube_id`
|
```text
|
||||||
- `kind` (e.g., HBM vs PE-resource vs raw)
|
50 47 46 42 41 0
|
||||||
- `unit_type` / `pe_id` (if PE-level addressing is modeled)
|
+---------+----------+-------------------------+
|
||||||
|
| sip_id | die_id | local_offset |
|
||||||
|
+---------+----------+-------------------------+
|
||||||
|
```
|
||||||
|
|
||||||
**Important:** The exact bit allocation may evolve, but the *semantic fields above* must remain decodable without hidden assumptions.
|
#### 2.2 die_id Allocation
|
||||||
|
|
||||||
### D3. Decoding is deterministic and policy-compatible
|
| die_id | Meaning |
|
||||||
|
|--------|---------|
|
||||||
|
| 0..15 | AHBM dies |
|
||||||
|
| 16..20 | IOCHIPLET dies |
|
||||||
|
| 21..31 | Reserved |
|
||||||
|
|
||||||
- Decoding must deterministically map an integer address to:
|
#### 2.3 AHBM Die Layout
|
||||||
- destination SIP domain (`sip_id`)
|
|
||||||
- destination sub-domain (`cube_id` if applicable)
|
|
||||||
- destination target kind (HBM/PE-resource/other)
|
|
||||||
- Decoding must not depend on runtime topology sizes; it may depend on **explicit topology parameters** provided through configuration (e.g., segment size, slice size), and those parameters must live in the topology/config layer (not in random components).
|
|
||||||
|
|
||||||
### D4. Topology-derived constants live in the topology layer
|
Only lower 256 GB of the 4 TB die-local window is assigned.
|
||||||
|
|
||||||
Constants such as segment sizes (e.g., HBM slice size / window size) are derived from topology configuration (YAML/JSON/dict) and are provided to the decoder via DI/config.
|
```text
|
||||||
They must not be hardcoded in node implementations.
|
[41:38] MBZ (4)
|
||||||
|
[37] addr_space (1) -- 0 = local resource, 1 = HBM memory
|
||||||
|
[36: 0] sub-address (37)
|
||||||
|
```
|
||||||
|
|
||||||
|
| addr_space | Meaning |
|
||||||
|
|------------|---------|
|
||||||
|
| 0 | Local resource |
|
||||||
|
| 1 | HBM memory |
|
||||||
|
|
||||||
|
##### 2.3.1 HBM Window (addr_space = 1)
|
||||||
|
|
||||||
|
```text
|
||||||
|
[36:0] hbm_offset (37) -- 128 GB decode window
|
||||||
|
```
|
||||||
|
|
||||||
|
The architectural decode window is fixed at 128 GB. Implemented capacity
|
||||||
|
may be smaller depending on SKU/topology (see D4).
|
||||||
|
|
||||||
|
##### 2.3.2 Resource Window (addr_space = 0)
|
||||||
|
|
||||||
|
```text
|
||||||
|
[36:34] resource_kind (3)
|
||||||
|
[33: 0] kind_local (34) -- 16 GB per kind
|
||||||
|
```
|
||||||
|
|
||||||
|
| resource_kind | Meaning |
|
||||||
|
|---------------|---------|
|
||||||
|
| 000 | PE_LOCAL |
|
||||||
|
| 001 | MCPU_LOCAL |
|
||||||
|
| 010 | CUBE_SRAM |
|
||||||
|
| 011..111 | Reserved |
|
||||||
|
|
||||||
|
Each kind gets a 16 GB decode region.
|
||||||
|
|
||||||
|
##### 2.3.3 PE_LOCAL (resource_kind = 000)
|
||||||
|
|
||||||
|
```text
|
||||||
|
[33] MBZ (1)
|
||||||
|
[32:29] pe_id (4) -- 0..15
|
||||||
|
[28:25] pe_sub_unit (4)
|
||||||
|
[24: 0] sub_offset (25) -- 32 MB per slot
|
||||||
|
```
|
||||||
|
|
||||||
|
16 PEs x 16 sub-unit slots x 32 MB = 8 GB active decode.
|
||||||
|
|
||||||
|
| pe_sub_unit | Name | Budget |
|
||||||
|
|-------------|------|--------|
|
||||||
|
| 0 | PE_CPU_DTCM | 8 KB |
|
||||||
|
| 1 | MATH_ENGINE_DTCM | 8 KB |
|
||||||
|
| 2 | IPCQ | 256 KB |
|
||||||
|
| 3 | PE_CPU_SFR | 16 KB |
|
||||||
|
| 4 | MATH_ENGINE_SFR | 16 KB |
|
||||||
|
| 5 | DMA_ENGINE_SFR | 192 KB |
|
||||||
|
| 6 | PE_TCM | 2 MB |
|
||||||
|
| 7..15 | Reserved | -- |
|
||||||
|
|
||||||
|
##### 2.3.4 MCPU_LOCAL (resource_kind = 001)
|
||||||
|
|
||||||
|
```text
|
||||||
|
[33:30] MBZ (4)
|
||||||
|
[29:25] mcpu_sub_unit (5)
|
||||||
|
[24: 0] sub_offset (25) -- 32 MB per slot
|
||||||
|
```
|
||||||
|
|
||||||
|
1 GB active decode.
|
||||||
|
|
||||||
|
| mcpu_sub_unit | Name | Budget |
|
||||||
|
|---------------|------|--------|
|
||||||
|
| 0 | MCPU_ITCM | 512 KB |
|
||||||
|
| 1 | MCPU_DTCM | 512 KB |
|
||||||
|
| 2 | IPCQ | 256 KB |
|
||||||
|
| 3 | MCPU_SFR | 8 KB |
|
||||||
|
| 4 | MCPU_DMA_SFR | 16 KB |
|
||||||
|
| 5 | MCPU_SRAM | 10 MB |
|
||||||
|
| 6..31 | Reserved | -- |
|
||||||
|
|
||||||
|
##### 2.3.5 CUBE_SRAM (resource_kind = 010)
|
||||||
|
|
||||||
|
```text
|
||||||
|
[33:25] MBZ (9)
|
||||||
|
[24: 0] sram_offset (25) -- flat 32 MB
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2.4 IOCHIPLET Die Layout
|
||||||
|
|
||||||
|
Only lower 1 TB of the 4 TB die-local window is assigned.
|
||||||
|
|
||||||
|
```text
|
||||||
|
[41:40] MBZ (2)
|
||||||
|
[39: 0] chiplet_offset (40) -- 1 TB
|
||||||
|
```
|
||||||
|
|
||||||
|
Region split by address range:
|
||||||
|
|
||||||
|
| Range | Meaning | Decode condition |
|
||||||
|
|-------|---------|------------------|
|
||||||
|
| [0, 2 GB) | IOCPU resource | chiplet_offset < 0x8000_0000 |
|
||||||
|
| [2 GB, 1 TB) | UAL | chiplet_offset >= 0x8000_0000 |
|
||||||
|
|
||||||
|
##### 2.4.1 IOCPU Region
|
||||||
|
|
||||||
|
```text
|
||||||
|
[30:27] iocpu_sub_unit (4)
|
||||||
|
[26: 0] sub_offset (27) -- 128 MB per slot
|
||||||
|
```
|
||||||
|
|
||||||
|
16 x 128 MB slots. 2 GB active decode.
|
||||||
|
|
||||||
|
| iocpu_sub_unit | Name | Budget |
|
||||||
|
|----------------|------|--------|
|
||||||
|
| 0 | IOCPU_ITCM | 512 KB |
|
||||||
|
| 1 | IOCPU_DTCM | 512 KB |
|
||||||
|
| 2 | IPCQ | 2 MB |
|
||||||
|
| 3 | IOCPU_SFR | 8 KB |
|
||||||
|
| 4 | IO_DMA_SFR | 16 KB |
|
||||||
|
| 5 | IO_SRAM | 64 MB |
|
||||||
|
| 6..15 | Reserved | -- |
|
||||||
|
|
||||||
|
##### 2.4.2 UAL Region
|
||||||
|
|
||||||
|
Sub-layout TBD (separate ADR).
|
||||||
|
|
||||||
|
#### 2.5 Addressing Rules
|
||||||
|
|
||||||
|
1. MBZ bits must be zero. An address with non-zero MBZ bits is
|
||||||
|
**architecturally invalid**. Implementation may raise a decode fault
|
||||||
|
or return an error -- behavior is not prescribed by this ADR.
|
||||||
|
2. Fixed slot sizes are chosen for simple hardware decode; actual
|
||||||
|
implemented capacity may be smaller than the slot.
|
||||||
|
3. Access beyond a sub-unit's implemented budget within a slot is
|
||||||
|
**architecturally invalid** (same policy as MBZ).
|
||||||
|
|
||||||
|
### D3. Bitfield decoding is deterministic
|
||||||
|
|
||||||
|
Given an integer address, field extraction (`sip_id`, `die_id`, `kind`,
|
||||||
|
`sub_unit`, `offset`) is purely positional. No runtime state is required.
|
||||||
|
Decoding deterministically maps an integer address to destination domains:
|
||||||
|
`sip_id`, `die_id`, target kind (HBM / PE_LOCAL / MCPU_LOCAL / CUBE_SRAM /
|
||||||
|
IOCPU / UAL).
|
||||||
|
|
||||||
|
### D4. Capacity validation may depend on topology config
|
||||||
|
|
||||||
|
Whether a decoded address falls within **implemented capacity** (e.g.,
|
||||||
|
HBM 96 GB on a specific SKU) is checked against topology parameters
|
||||||
|
provided via DI/config. Decode itself (D3) never consults topology --
|
||||||
|
only validation does. These parameters must live in the topology/config
|
||||||
|
layer, not in node implementations.
|
||||||
|
|
||||||
### D5. Routing consumes decoded domains, not raw bits
|
### D5. Routing consumes decoded domains, not raw bits
|
||||||
|
|
||||||
Routing policy uses decoded domains:
|
Routing policy uses decoded domains:
|
||||||
|
|
||||||
- `src` location (sip/cube/pe or node_id)
|
- `src` location (sip / die / pe or node_id)
|
||||||
- `dst` domains derived from PhysAddr decoding
|
- `dst` domains derived from PhysAddr decoding
|
||||||
- `size_bytes` for size-aware link latency
|
- `size_bytes` for size-aware link latency
|
||||||
Routing must not inspect raw bit-fields directly except inside the decoding module.
|
|
||||||
|
Routing must not inspect raw bit-fields directly except inside the
|
||||||
|
decoding module.
|
||||||
|
|
||||||
## Alternatives Considered
|
## Alternatives Considered
|
||||||
|
|
||||||
1) **Use raw integers everywhere, decode ad-hoc in routing**
|
1. **Keep `rack_id` (4 bits)**: Rejected -- never used in practice,
|
||||||
|
consumes 4 bits that enable die-local expansion to 42 bits
|
||||||
|
(IOCHIPLET 1 TB).
|
||||||
|
|
||||||
- Rejected: leads to duplicated logic, inconsistent routing, and hidden assumptions embedded in multiple components.
|
2. **Uniform 256 GB per die**: Rejected -- IOCHIPLET UAL requires ~1 TB.
|
||||||
|
Freed rack_id bits enable 42-bit local_offset.
|
||||||
|
|
||||||
1) **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**
|
3. **Variable-width die windows (AHBM 256 GB, CHIPLET 1 TB via multi-seg
|
||||||
|
spanning)**: Rejected -- complicates D3 (deterministic decoding).
|
||||||
|
Uniform 4 TB window with MBZ padding is simpler.
|
||||||
|
|
||||||
- Rejected: violates SPEC (R3) and breaks swappability and configuration-driven topologies.
|
4. **Use raw integers everywhere, decode ad-hoc in routing**: Rejected --
|
||||||
|
leads to duplicated logic, inconsistent routing, and hidden
|
||||||
|
assumptions.
|
||||||
|
|
||||||
1) **Put decoding inside memory controllers or routers**
|
5. **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**:
|
||||||
|
Rejected -- violates SPEC R3 and breaks swappability.
|
||||||
|
|
||||||
- Rejected: leaks policy into components and undermines DI-first, swappable implementations (SPEC R4).
|
6. **Put decoding inside memory controllers or routers**: Rejected --
|
||||||
|
leaks policy into components, violates SPEC R4 / D5.
|
||||||
|
|
||||||
## Consequences
|
## Consequences
|
||||||
|
|
||||||
### Positive
|
### Positive
|
||||||
|
|
||||||
- Deterministic routing domains enable clear test invariants for local vs remote paths (SPEC R1, R5).
|
- Simple hierarchical decoder: SIP -> die -> kind -> sub-unit.
|
||||||
- Keeps topology variability (SPEC R3) while preserving consistent semantics.
|
- Clean separation of memory (HBM) vs local resource (PE/MCPU/SRAM/IOCPU).
|
||||||
- DI-first: decoder can be swapped or extended without changing components or tests (SPEC R4).
|
- Deterministic routing domains enable clear test invariants (SPEC R1, R5).
|
||||||
|
- Expandable: 11 reserved die_id slots, reserved resource_kind / sub-unit
|
||||||
|
slots, reserved MBZ bits.
|
||||||
|
- DI-first: decoder can be swapped without changing components (SPEC R4).
|
||||||
|
|
||||||
### Tradeoffs / Costs
|
### Tradeoffs
|
||||||
|
|
||||||
- Requires explicit configuration for any topology-derived sizes.
|
- Sparse address holes due to power-of-2 slot alignment.
|
||||||
- Introduces a single “blessed” decoding module that must remain stable and well-tested.
|
- Large reserved/MBZ regions (intentional for future extension).
|
||||||
|
- Requires explicit configuration for topology-derived sizes (D4).
|
||||||
|
- Introduces a single "blessed" decoding module that must remain stable
|
||||||
|
and well-tested.
|
||||||
|
|
||||||
|
## Supersedes
|
||||||
|
|
||||||
|
- **ADR-0031 (PhysAddr PE-Resource Extension)**: stub status. The
|
||||||
|
PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables in D2.3.3-D2.3.5
|
||||||
|
fulfill ADR-0031's stated goals.
|
||||||
|
|
||||||
## Implementation Notes (Non-normative)
|
## Implementation Notes (Non-normative)
|
||||||
|
|
||||||
- Recommended module boundary:
|
- Recommended module: `src/kernbench/policy/address/phyaddr.py`
|
||||||
- `src/kernbench/policy/address/phyaddr.py`
|
- Tests should cover: encode/decode round-trip per kind, MBZ enforcement,
|
||||||
|
die_id dispatch (AHBM / IOCHIPLET / reserved), sub-unit boundary
|
||||||
|
values, backward compatibility of factory APIs.
|
||||||
|
- Factory methods: `hbm_addr`, `pe_hbm_addr`, `pe_tcm_addr`,
|
||||||
|
`cube_sram_addr` retain signatures (minus `rack_id`); `cube_id`
|
||||||
|
parameter renamed to `die_id`.
|
||||||
|
- New factories: `pe_resource_addr`, `mcpu_resource_addr`,
|
||||||
|
`iocpu_resource_addr`, `ual_addr`.
|
||||||
|
|
||||||
- Tests should cover:
|
## Appendix A. Address Examples
|
||||||
- deterministic decoding
|
|
||||||
- local vs remote classification from decoded fields
|
### A.1 AHBM HBM access
|
||||||
- invariants: “allocator returns full PhysAddr”, “decoding requires no global state”
|
|
||||||
|
sip=2, die=5, HBM offset=0x1000
|
||||||
|
|
||||||
|
```text
|
||||||
|
sip_id = 2 -> [50:47] = 0b0010
|
||||||
|
die_id = 5 -> [46:42] = 0b00101
|
||||||
|
addr_space = 1 -> [37] = 1 (HBM)
|
||||||
|
hbm_offset = 0x1000 -> [36:0]
|
||||||
|
|
||||||
|
51-bit addr = (2 << 47) | (5 << 42) | (1 << 37) | 0x1000
|
||||||
|
```
|
||||||
|
|
||||||
|
### A.2 AHBM PE_LOCAL -- PE3 PE_TCM, offset=0x400
|
||||||
|
|
||||||
|
```text
|
||||||
|
sip_id = 0 -> [50:47] = 0
|
||||||
|
die_id = 0 -> [46:42] = 0
|
||||||
|
addr_space = 0 -> [37] = 0
|
||||||
|
resource_kind = 0 -> [36:34] = 000 (PE_LOCAL)
|
||||||
|
pe_id = 3 -> [32:29] = 0011
|
||||||
|
pe_sub_unit = 6 -> [28:25] = 0110 (PE_TCM)
|
||||||
|
sub_offset = 0x400 -> [24:0]
|
||||||
|
|
||||||
|
local_offset = (0 << 34) | (3 << 29) | (6 << 25) | 0x400
|
||||||
|
```
|
||||||
|
|
||||||
|
### A.3 AHBM MCPU_LOCAL -- MCPU_SRAM, offset=0x0
|
||||||
|
|
||||||
|
```text
|
||||||
|
sip_id = 1 -> [50:47] = 0001
|
||||||
|
die_id = 3 -> [46:42] = 00011
|
||||||
|
addr_space = 0 -> [37] = 0
|
||||||
|
resource_kind = 1 -> [36:34] = 001 (MCPU_LOCAL)
|
||||||
|
mcpu_sub_unit = 5 -> [29:25] = 00101 (MCPU_SRAM)
|
||||||
|
sub_offset = 0 -> [24:0] = 0
|
||||||
|
|
||||||
|
local_offset = (1 << 34) | (5 << 25)
|
||||||
|
```
|
||||||
|
|
||||||
|
### A.4 IOCHIPLET -- IOCPU IPCQ, offset=0x20000
|
||||||
|
|
||||||
|
```text
|
||||||
|
sip_id = 1 -> [50:47] = 0001
|
||||||
|
die_id = 17 -> [46:42] = 10001 (IOCHIPLET[1])
|
||||||
|
iocpu_sub_unit = 2 -> [30:27] = 0010 (IPCQ)
|
||||||
|
sub_offset = 0x20000 -> [26:0]
|
||||||
|
|
||||||
|
chiplet_offset = (2 << 27) | 0x20000
|
||||||
|
(< 0x8000_0000 -> IOCPU region)
|
||||||
|
```
|
||||||
|
|
||||||
|
### A.5 IOCHIPLET -- UAL region, offset=4 GB
|
||||||
|
|
||||||
|
```text
|
||||||
|
sip_id = 0 -> [50:47] = 0
|
||||||
|
die_id = 16 -> [46:42] = 10000 (IOCHIPLET[0])
|
||||||
|
chiplet_offset = 0x1_0000_0000 (4 GB >= 2 GB -> UAL region)
|
||||||
|
```
|
||||||
|
|
||||||
## Links
|
## Links
|
||||||
|
|
||||||
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first), R5 (multi-domain comm)
|
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first),
|
||||||
|
R5 (multi-domain comm)
|
||||||
|
- ADR-0031: Superseded
|
||||||
|
|||||||
@@ -67,6 +67,76 @@ Completion semantics:
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
### D5. Launch timing is endpoint-synchronized
|
||||||
|
|
||||||
|
All PEs targeted by a single kernel launch MUST begin executing the kernel
|
||||||
|
body at the same simulated time, regardless of their dispatch path length
|
||||||
|
from the launch entry point.
|
||||||
|
|
||||||
|
Rationale. The dispatch tree Host → IO_CPU → M_CPU → PE_CPU has variable
|
||||||
|
latency at every level. PEs near their M_CPU receive the launch earlier
|
||||||
|
than PEs farther away; cubes near an IO_CPU receive it earlier than cubes
|
||||||
|
farther away. Without synchronization, each PE's kernel begins at a
|
||||||
|
different `env.now`, making per-PE metrics such as `pe_exec_ns` a function
|
||||||
|
of dispatch-path geometry rather than of the kernel's behavior —
|
||||||
|
producing measurement artifacts in benchmarks that time kernel-internal
|
||||||
|
waits (for example `tl.recv` on cross-cube or cross-SIP hops).
|
||||||
|
|
||||||
|
Mechanism.
|
||||||
|
|
||||||
|
- `KernelLaunchMsg` carries an optional `target_start_ns: float | None`.
|
||||||
|
- **IO_CPU** is the canonical stamper. On fan-out to M_CPUs, it
|
||||||
|
computes `target_start_ns = env.now + max_latency` where
|
||||||
|
`max_latency` is the maximum, over every target (sip, cube, pe)
|
||||||
|
tuple, of the **two-leg dispatch chain**:
|
||||||
|
|
||||||
|
```
|
||||||
|
max_latency(sip, cube, pe) =
|
||||||
|
compute_path_latency_ns(find_node_path(io_cpu, m_cpu(sip, cube)))
|
||||||
|
+ compute_path_latency_ns(find_node_path(m_cpu(sip, cube), pe_cpu))
|
||||||
|
- io_cpu.overhead_ns
|
||||||
|
- m_cpu.overhead_ns
|
||||||
|
```
|
||||||
|
|
||||||
|
This models the actual dispatch as **two sequential Transactions**
|
||||||
|
(IO_CPU → M_CPU, then M_CPU → PE_CPU). Each leg's
|
||||||
|
`compute_path_latency_ns` adds its endpoints' `overhead_ns`;
|
||||||
|
`io_cpu.overhead_ns` is subtracted because IO_CPU has already
|
||||||
|
paid it before this method runs, and `m_cpu.overhead_ns` is
|
||||||
|
subtracted once because it appears as endpoint of leg1 *and*
|
||||||
|
start of leg2 but is paid only once at run time. A single
|
||||||
|
`find_node_path(io_cpu, pe_cpu)` walk is **not** equivalent —
|
||||||
|
it can pick a graph path that bypasses M_CPU and silently
|
||||||
|
under-shoots the prediction for far cubes, breaking the D5
|
||||||
|
invariant.
|
||||||
|
|
||||||
|
The fanned-out sub-Transactions carry **`nbytes = 0`** for
|
||||||
|
`KernelLaunchMsg` (control message only). Without this,
|
||||||
|
large kernel-launch payloads would occupy fabric BW on the
|
||||||
|
shared first hop and serialize the per-cube dispatch, pushing
|
||||||
|
far M_CPUs past `target_start_ns` and re-introducing the
|
||||||
|
late-arrival violation.
|
||||||
|
- **M_CPU** passes an already-stamped `target_start_ns` through
|
||||||
|
unchanged. Only when the value is absent (e.g. a direct
|
||||||
|
launch-to-M_CPU unit test) does M_CPU compute a per-cube barrier
|
||||||
|
`env.now + max(local command-path latency)`.
|
||||||
|
- **PE_CPU** yields `env.timeout(target_start_ns - env.now)` at the top
|
||||||
|
of `_execute_kernel`, before recording `pe_exec_start` and invoking
|
||||||
|
the kernel body.
|
||||||
|
- When `target_start_ns is None`, PE_CPU falls through to the legacy
|
||||||
|
unsynchronized behavior — preserving backward compatibility.
|
||||||
|
|
||||||
|
IO_CPU-level stamping guarantees every PE across every targeted cube
|
||||||
|
uses the same barrier sim-time, eliminating both the within-cube
|
||||||
|
dispatch-offset artifact *and* the cross-cube offset artifact in
|
||||||
|
multi-cube launches. Models a real-hardware timed-broadcast launch
|
||||||
|
(latency-equalized dispatch tree).
|
||||||
|
|
||||||
|
The synchronization is internal to the engine / IO_CPU / M_CPU / PE_CPU
|
||||||
|
control plane — runtime API and application kernels are unchanged.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Links
|
## Links
|
||||||
|
|
||||||
- SPEC R1, R2, R7, R8
|
- SPEC R1, R2, R7, R8
|
||||||
|
|||||||
@@ -372,24 +372,41 @@ When the receiver frees a slot, the sender must learn about it
|
|||||||
travel through general vc_comm fabric — it uses a **separate fast
|
travel through general vc_comm fabric — it uses a **separate fast
|
||||||
path**, an abstraction of the NVLink / UCIe credit-return wire.
|
path**, an abstraction of the NVLink / UCIe credit-return wire.
|
||||||
|
|
||||||
**Latency** is computed from the **bottleneck BW on the path**, not a
|
**Latency** is computed from the **full path latency** (per-node
|
||||||
magic constant:
|
overhead + edge propagation + drain), not a magic constant:
|
||||||
|
|
||||||
```
|
```
|
||||||
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
||||||
path = router.find_path(self_pe, peer_pe)
|
path = router.find_path(self_pe, peer_pe.pe_dma)
|
||||||
latency = compute_drain_ns(path, credit_size_bytes)
|
latency = compute_path_latency_ns(path, credit_size_bytes)
|
||||||
= credit_size_bytes / bottleneck_bw_on_path
|
= sum(edge.distance_mm * ns_per_mm)
|
||||||
|
+ sum(node_overhead_ns[n] for n in path)
|
||||||
|
+ credit_size_bytes / bottleneck_bw_on_path
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The router auto-appends `.pe_dma` to the source only, so the
|
||||||
|
destination MUST be spelled with the explicit `.pe_dma` suffix or
|
||||||
|
`find_path` raises and the credit silently teleports at zero cost
|
||||||
|
(latent bug fixed alongside this update).
|
||||||
|
|
||||||
|
`tl.recv` blocks on the credit-emit completion (recv yields-from
|
||||||
|
`_delayed_credit_send` rather than spawning it as a fork). This puts
|
||||||
|
the credit-return cost on the receiver's `pe_exec_ns`, modeling the
|
||||||
|
IPCQ control-plane completing the consume-acknowledgement before
|
||||||
|
recv returns to the kernel — the protocol equivalent of a non-posted
|
||||||
|
`tl.store` waiting for an HBM ack on the raw DMA path.
|
||||||
|
|
||||||
That gives us:
|
That gives us:
|
||||||
|
|
||||||
- **Topology-proportional approximation**: an in-cube credit return is
|
- **Topology-proportional approximation**: an in-cube credit return is
|
||||||
automatically faster than a cross-SIP credit return.
|
automatically faster than a cross-SIP credit return.
|
||||||
- **No magic constants**: no arbitrary `ipcq_ctrl_latency_ns`.
|
- **No magic constants**: every nanosecond comes from
|
||||||
|
`compute_path_latency_ns` on the same edge_map and `node_overhead_ns`
|
||||||
|
as data traffic.
|
||||||
- **No deadlock risk**: unlike piggyback, B can issue credit even when
|
- **No deadlock risk**: unlike piggyback, B can issue credit even when
|
||||||
it has no data to send back.
|
it has no data to send back. `peer_credit_store.put` is unbounded.
|
||||||
- **Reuses existing utility**: `ComponentContext.compute_drain_ns`.
|
- **`IPCQ ≥ raw DMA`** for matched physical moves — the credit-emit
|
||||||
|
cost on recv balances the HBM ack-trip cost RAW pays on the sender.
|
||||||
|
|
||||||
#### Component coupling — SimPy Store channel
|
#### Component coupling — SimPy Store channel
|
||||||
|
|
||||||
@@ -420,11 +437,21 @@ fan-out (see `IpcqInitMsg` in D12).
|
|||||||
#### PE_DMA's added responsibility
|
#### PE_DMA's added responsibility
|
||||||
|
|
||||||
When `vc_comm` receives a token, PE_DMA processes it as the following
|
When `vc_comm` receives a token, PE_DMA processes it as the following
|
||||||
**atomic** sequence. **No SimPy yield is allowed between the two steps**
|
sequence: pay the Transaction's terminal BW drain, then atomically
|
||||||
(invariant I6):
|
write data and forward metadata. **No SimPy yield is allowed between
|
||||||
|
the data write and the metadata forward** (invariant I6). The drain
|
||||||
|
yield must sit before the atomic block, not inside it:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def _on_vc_comm_recv(self, env, token):
|
def _on_vc_comm_recv(self, env, txn):
|
||||||
|
# Pay the terminal BW drain (nbytes / bottleneck_bw stamped by the
|
||||||
|
# sender PE_DMA). MUST happen before the atomic block so recv only
|
||||||
|
# wakes after the bytes have "landed".
|
||||||
|
drain = getattr(txn, "drain_ns", 0.0)
|
||||||
|
if drain > 0:
|
||||||
|
yield env.timeout(drain)
|
||||||
|
|
||||||
|
token = txn.request
|
||||||
# ── ATOMIC: no yield between these two operations ──
|
# ── ATOMIC: no yield between these two operations ──
|
||||||
data = self._memory_store.read(token.src_space, token.src_addr,
|
data = self._memory_store.read(token.src_space, token.src_addr,
|
||||||
shape=..., dtype=...)
|
shape=..., dtype=...)
|
||||||
@@ -439,6 +466,33 @@ The final `put` is yieldable but uses an unbounded internal store, so
|
|||||||
it completes in a single step. That `put` is the closing call of the
|
it completes in a single step. That `put` is the closing call of the
|
||||||
atomic block; nothing may be inserted before it.
|
atomic block; nothing may be inserted before it.
|
||||||
|
|
||||||
|
#### Drain-at-inbound semantics (D9 timing model)
|
||||||
|
|
||||||
|
The Transaction carries `drain_ns = nbytes / bottleneck_bw_on_path`
|
||||||
|
stamped at send-side PE_DMA. In this simulator per-hop `overhead_ns`
|
||||||
|
is paid at each forwarding component via `run()`, and the remaining
|
||||||
|
BW drain is paid once at the Transaction's terminal. Every non-IPCQ
|
||||||
|
Transaction (raw DMA, kernel-launch fanout, etc.) pays this drain via
|
||||||
|
`ComponentBase._forward_txn` at the terminal node. For IPCQ the
|
||||||
|
destination PE_DMA intercepts the Transaction with `_handle_ipcq_inbound`
|
||||||
|
(so IPCQ-specific data write + metadata forward can happen), so **the
|
||||||
|
drain MUST be paid explicitly at the top of that handler** to keep
|
||||||
|
IPCQ's timing model on par with every other fabric Transaction.
|
||||||
|
|
||||||
|
Side-effects of paying drain here:
|
||||||
|
|
||||||
|
- **SRC `tl.send`** is unchanged — fire-and-forget semantics are
|
||||||
|
preserved because the sender PE_DMA does not `yield sub_done`. The
|
||||||
|
`sub_done.succeed()` call (made after metadata forward below) is an
|
||||||
|
event with no listener on the sender side.
|
||||||
|
- **DST `tl.recv`** unblocks `drain_ns` later. Since recv wakes only
|
||||||
|
when `IpcqMetaArrival` reaches its local PE_IPCQ, and the metadata
|
||||||
|
forward now happens after the drain, recv observes the full fabric
|
||||||
|
transfer time including bandwidth cost.
|
||||||
|
|
||||||
|
Matches the physical picture: send dispatches and leaves; recv waits
|
||||||
|
until the bytes have actually been drained into its inbox.
|
||||||
|
|
||||||
### D9.5. ADR-0020 (2-pass) integration
|
### D9.5. ADR-0020 (2-pass) integration
|
||||||
|
|
||||||
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
|
`tl.send` / `tl.recv` integrates with ADR-0020's two-pass model. Phase
|
||||||
|
|||||||
@@ -365,23 +365,39 @@ data 경로의 piggyback 모델과 달리, credit return은 일반 vc_comm fabri
|
|||||||
거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
|
거치지 않고 **별도 fast path**로 처리한다. 이는 실제 HW의 NVLink/UCIe
|
||||||
credit return fast path를 추상화한 것이다.
|
credit return fast path를 추상화한 것이다.
|
||||||
|
|
||||||
**Latency 계산**: magic constant가 아니라 **라우팅 경로의 bottleneck BW**
|
**Latency 계산**: magic constant가 아니라 **라우팅 경로의 full path
|
||||||
기준으로 산출한다.
|
latency** (per-node overhead + edge propagation + drain) 기준으로
|
||||||
|
산출한다.
|
||||||
|
|
||||||
```
|
```
|
||||||
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
credit_size_bytes = 16 (ccl.yaml: ipcq_credit_size_bytes)
|
||||||
path = router.find_path(self_pe, peer_pe)
|
path = router.find_path(self_pe, peer_pe.pe_dma)
|
||||||
latency = compute_drain_ns(path, credit_size_bytes)
|
latency = compute_path_latency_ns(path, credit_size_bytes)
|
||||||
= credit_size_bytes / bottleneck_bw_on_path
|
= sum(edge.distance_mm * ns_per_mm)
|
||||||
|
+ sum(node_overhead_ns[n] for n in path)
|
||||||
|
+ credit_size_bytes / bottleneck_bw_on_path
|
||||||
```
|
```
|
||||||
|
|
||||||
|
router는 source에만 `.pe_dma`를 자동 부여하므로 destination에는 반드시
|
||||||
|
`.pe_dma` suffix를 명시해야 한다. 그렇지 않으면 `find_path`가 raise하고
|
||||||
|
credit이 0 cost로 silently teleport되는 latent bug가 발생한다 (이번
|
||||||
|
업데이트에서 수정됨).
|
||||||
|
|
||||||
|
`tl.recv`는 credit-emit 완료를 yield-from으로 기다린다 (이전에는
|
||||||
|
`env.process`로 fork). 이로써 credit-return cost가 receiver의
|
||||||
|
`pe_exec_ns`에 반영되어, IPCQ control-plane이 consume-acknowledgement를
|
||||||
|
완료한 뒤에야 recv가 kernel에 반환된다 — RAW DMA의 non-posted `tl.store`가
|
||||||
|
HBM ack-trip을 기다리는 것의 protocol-level 등가물이다.
|
||||||
|
|
||||||
이로써:
|
이로써:
|
||||||
- **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
|
- **토폴로지 비례 approximation**: cube 내 credit return과 cross-SIP credit이
|
||||||
자동으로 다른 latency를 가짐 (정확한 값은 아니지만 magic constant보다 의미 있음)
|
자동으로 다른 latency를 가짐
|
||||||
- **Magic constant 없음**: 별도 `ipcq_ctrl_latency_ns` 같은 임의 값 불필요
|
- **Magic constant 없음**: 모든 ns 값이 데이터 트래픽과 동일한 edge_map
|
||||||
- **Deadlock 위험 없음**: piggyback과 달리 B가 A에게 보낼 데이터가 없어도
|
및 `node_overhead_ns`에서 산출되는 `compute_path_latency_ns`로부터 옴
|
||||||
credit이 자동 발행됨
|
- **Deadlock 위험 없음**: `peer_credit_store.put`은 unbounded, B가 A에게
|
||||||
- **기존 utility 재사용**: `ComponentContext.compute_drain_ns` 그대로 사용
|
보낼 데이터가 없어도 credit이 자동 발행됨
|
||||||
|
- **`IPCQ ≥ raw DMA`** 보장: matched physical move에 대해 credit-emit이
|
||||||
|
RAW의 ack-trip cost와 균형을 이룸
|
||||||
|
|
||||||
```
|
```
|
||||||
PE B: tl.recv(W) → 데이터 가져감 → my_tail++
|
PE B: tl.recv(W) → 데이터 가져감 → my_tail++
|
||||||
@@ -426,11 +442,22 @@ backend init에서 IpcqInitMsg fan-out 시 양방향 fast path channel을 함께
|
|||||||
|
|
||||||
#### PE_DMA의 책임 추가
|
#### PE_DMA의 책임 추가
|
||||||
|
|
||||||
PE_DMA(vc_comm)는 token 수신 시 다음 atomic 시퀀스로 처리한다.
|
PE_DMA(vc_comm)는 token 수신 시 다음 시퀀스로 처리한다: Transaction
|
||||||
**두 동작 사이에 SimPy yield를 두어서는 안 된다** (I6 MUST 규칙 참조):
|
terminal의 BW drain을 먼저 지불하고, 이어서 atomic하게 data write +
|
||||||
|
metadata forward 수행. **data write와 metadata forward 사이에는 SimPy
|
||||||
|
yield를 두어서는 안 된다** (I6 MUST 규칙 참조). drain yield는 atomic
|
||||||
|
구간 안이 아니라 그 앞에 위치해야 한다:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
def _on_vc_comm_recv(self, env, token):
|
def _on_vc_comm_recv(self, env, txn):
|
||||||
|
# Sender PE_DMA가 찍어 둔 drain_ns (= nbytes / bottleneck_bw) 를
|
||||||
|
# 여기서 지불. atomic 구간보다 앞이어야 한다 — recv는 bytes가
|
||||||
|
# "도착"한 이후에만 깨어나야 하므로.
|
||||||
|
drain = getattr(txn, "drain_ns", 0.0)
|
||||||
|
if drain > 0:
|
||||||
|
yield env.timeout(drain)
|
||||||
|
|
||||||
|
token = txn.request
|
||||||
# ── ATOMIC: 두 동작 사이에 yield 금지 ──
|
# ── ATOMIC: 두 동작 사이에 yield 금지 ──
|
||||||
# 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
|
# 1. data를 dst_addr에 write (dst의 메모리 공간은 token.dst_endpoint.buffer_kind)
|
||||||
data = self._memory_store.read(token.src_space, token.src_addr,
|
data = self._memory_store.read(token.src_space, token.src_addr,
|
||||||
@@ -446,6 +473,32 @@ wire로 capacity가 unbounded인 store를 사용하므로 즉시 완료된다 (
|
|||||||
single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
|
single-step). 이 최종 put이 atomic 구간의 끝이며, 그 이전에 다른 yield가
|
||||||
삽입되면 안 된다.
|
삽입되면 안 된다.
|
||||||
|
|
||||||
|
#### Drain-at-inbound semantics (D9 timing model)
|
||||||
|
|
||||||
|
Transaction은 sender PE_DMA가 `drain_ns = nbytes / bottleneck_bw_on_path`
|
||||||
|
를 찍어 둔 상태로 fabric에 들어간다. 이 simulator에서 per-hop `overhead_ns`
|
||||||
|
는 각 forwarding component의 `run()` 에서 지불되고, 남은 BW drain은
|
||||||
|
Transaction의 terminal node에서 한 번 지불된다. IPCQ가 아닌 모든
|
||||||
|
Transaction (raw DMA, kernel-launch fanout 등) 은
|
||||||
|
`ComponentBase._forward_txn` 이 terminal에서 이 drain을 지불한다. IPCQ의
|
||||||
|
경우 목적지 PE_DMA가 `_handle_ipcq_inbound` 핸들러로 Transaction을
|
||||||
|
가로채서 (IPCQ 전용 data write + metadata forward를 해야 하므로)
|
||||||
|
**이 핸들러 최상단에서 drain을 명시적으로 지불해야 한다** — 그래야 IPCQ의
|
||||||
|
timing model이 다른 모든 fabric Transaction과 동일선상에 놓인다.
|
||||||
|
|
||||||
|
여기서 drain을 지불할 때의 side-effect:
|
||||||
|
|
||||||
|
- **SRC `tl.send`**: 동작 불변. sender PE_DMA가 `sub_done` 을 `yield`
|
||||||
|
하지 않으므로 fire-and-forget 의미가 보존된다. metadata forward 이후
|
||||||
|
호출되는 `sub_done.succeed()` 는 sender 입장에서 listener가 없는 이벤트.
|
||||||
|
- **DST `tl.recv`**: `drain_ns` 만큼 늦게 깨어난다. recv는 local PE_IPCQ
|
||||||
|
의 `IpcqMetaArrival` 수신 시에만 wake되며, metadata forward가 drain
|
||||||
|
이후로 이동했으므로 recv는 bandwidth까지 포함한 전체 fabric transfer
|
||||||
|
시간을 관측하게 된다.
|
||||||
|
|
||||||
|
물리적 그림과 일치: send는 dispatch하고 바로 반환; recv는 bytes가 실제로
|
||||||
|
자신의 inbox로 drain될 때까지 대기.
|
||||||
|
|
||||||
#### Backpressure latency 정확도
|
#### Backpressure latency 정확도
|
||||||
|
|
||||||
backpressure 해제까지 걸리는 시간:
|
backpressure 해제까지 걸리는 시간:
|
||||||
|
|||||||
@@ -2,7 +2,14 @@
|
|||||||
|
|
||||||
## Status
|
## Status
|
||||||
|
|
||||||
Proposed (Revision 8 — Hierarchical content split out to ADR-0029)
|
Accepted. rank = SIP process-group model stands. The allreduce algorithm
|
||||||
|
path (mapper / validator / per-PE install machinery originally targeted at
|
||||||
|
ADR-0029) has been replaced by ADR-0032: `AhbmCCLBackend` now calls
|
||||||
|
`configure_sfr_intercube_multisip` at `init_process_group` time and the
|
||||||
|
intercube kernel receives `(sip_rank, sip_topo_kind, sip_topo_w,
|
||||||
|
sip_topo_h)` appended after the module's `kernel_args()`. The
|
||||||
|
`leader_only` / `all_pes` mapper concepts in this document are no longer
|
||||||
|
used by the default allreduce path.
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
|
|||||||
@@ -89,7 +89,14 @@ direction_idx × bytes_per_direction). 따라서:
|
|||||||
`src/kernbench/ccl/install.py`:
|
`src/kernbench/ccl/install.py`:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
_OPPOSITE_DIR = {"E": "W", "W": "E", "N": "S", "S": "N"}
|
# Extended in ADR-0032 with global_* pairs for inter-SIP directions,
|
||||||
|
# which were introduced by configure_sfr_intercube_multisip to keep
|
||||||
|
# intercube (N/S/E/W) and inter-SIP (global_N/S/E/W) namespaces disjoint.
|
||||||
|
_OPPOSITE_DIR = {
|
||||||
|
"E": "W", "W": "E", "N": "S", "S": "N",
|
||||||
|
"global_E": "global_W", "global_W": "global_E",
|
||||||
|
"global_N": "global_S", "global_S": "global_N",
|
||||||
|
}
|
||||||
|
|
||||||
def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
|
def reverse_direction(my_rank: int, peer_rank: int, my_dir: str) -> str | None:
|
||||||
"""Find peer's direction that reciprocates my_dir→peer_rank.
|
"""Find peer's direction that reciprocates my_dir→peer_rank.
|
||||||
|
|||||||
@@ -2,7 +2,9 @@
|
|||||||
|
|
||||||
## Status
|
## Status
|
||||||
|
|
||||||
Proposed
|
Superseded by ADR-0032 (Intercube all-reduce). The 3-level kernel and
|
||||||
|
`hierarchical_allreduce.py` module have been removed. The cube-mesh
|
||||||
|
intercube + inter-SIP path is now the single all-reduce algorithm.
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
|
|||||||
@@ -2,7 +2,11 @@
|
|||||||
|
|
||||||
## Status
|
## Status
|
||||||
|
|
||||||
Stub (Blocker for ADR-0030 — specific range allocations TBD)
|
Superseded by ADR-0001 (Revision 2, 2026-04-27).
|
||||||
|
PE_LOCAL / MCPU_LOCAL / CUBE_SRAM sub-unit tables are now defined in
|
||||||
|
ADR-0001 D2.3.3-D2.3.5.
|
||||||
|
|
||||||
|
Previous status: Stub (Blocker for ADR-0030 — specific range allocations TBD)
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,256 @@
|
|||||||
|
# ADR-0032: Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted (supersedes ADR-0029).
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Define a single all-reduce algorithm that exploits the topology hierarchy:
|
||||||
|
cube mesh within each SIP (intercube) + inter-SIP exchange. One kernel,
|
||||||
|
one SFR configuration path, driven by `topology.yaml` and `ccl.yaml`.
|
||||||
|
|
||||||
|
### Why replace ADR-0029 (hierarchical 3-level)
|
||||||
|
|
||||||
|
ADR-0029 proposed a 3-level (intra-cube → inter-cube → inter-SIP) algorithm
|
||||||
|
where every PE in the system participates. In practice this adds the
|
||||||
|
intra-cube PE-to-PE stage complexity (bidirectional reduce + chain broadcast)
|
||||||
|
without matching the common workload pattern where the tensor is sharded
|
||||||
|
**per cube** (not per PE within a cube).
|
||||||
|
|
||||||
|
Moreover, the hierarchical design required:
|
||||||
|
- per-PE neighbor graph installation (`_build_pe_installs` multi-level)
|
||||||
|
- multi-level topology schema (`hierarchical_3level`)
|
||||||
|
- `all_pes` mapper + `multi_pe_sip_local` validator infrastructure
|
||||||
|
|
||||||
|
The intercube algorithm below removes all of that: **pe0-only same-lane
|
||||||
|
intercube reduce on the 4×4 cube mesh**, then inter-SIP exchange on the
|
||||||
|
root cube, then broadcast back. Simpler kernel, simpler wiring, same
|
||||||
|
bandwidth characteristics for the common per-cube DP workload.
|
||||||
|
|
||||||
|
### Current state
|
||||||
|
|
||||||
|
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
|
||||||
|
- `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
|
||||||
|
- `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
|
||||||
|
automatically at `init_process_group` time.
|
||||||
|
- Old `ring_allreduce`, `mesh_allreduce`, `tree_allreduce`,
|
||||||
|
`hierarchical_allreduce` modules and their tests are **removed**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Algorithm structure — 5 phases
|
||||||
|
|
||||||
|
For each SIP (launched concurrently by `mp.spawn`):
|
||||||
|
|
||||||
|
```
|
||||||
|
Phase 1 — Row reduce W → E (cube mesh, pe0 only):
|
||||||
|
col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
|
||||||
|
|
||||||
|
Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
|
||||||
|
row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
|
||||||
|
holds the full SIP sum.
|
||||||
|
|
||||||
|
Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
|
||||||
|
Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
|
||||||
|
selected by sip_topo_kind (from topology.yaml sips.topology).
|
||||||
|
|
||||||
|
Phase 4 — Col broadcast S → N on rightmost column.
|
||||||
|
|
||||||
|
Phase 5 — Row broadcast E → W across the cube mesh.
|
||||||
|
```
|
||||||
|
|
||||||
|
After all phases every cube's pe0 holds the global sum.
|
||||||
|
|
||||||
|
The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
|
||||||
|
(ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
|
||||||
|
across topologies; only phase 3 branches. Helper functions
|
||||||
|
`_inter_sip_ring`, `_inter_sip_torus_2d`, `_inter_sip_mesh_2d` encode the
|
||||||
|
three exchange patterns.
|
||||||
|
|
||||||
|
### D2. Tensor layout (rank = SIP, per-worker)
|
||||||
|
|
||||||
|
Per ADR-0024 rank = SIP at the process-group level. Each worker allocates
|
||||||
|
its own cube-mesh-spanning tensor:
|
||||||
|
|
||||||
|
```python
|
||||||
|
dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=16, num_pes=1)
|
||||||
|
tensor = torch.zeros((n_cubes, n_elem), dtype="f16", dp=dp)
|
||||||
|
```
|
||||||
|
|
||||||
|
Shard layout: 16 shards per SIP, one per cube on pe0. The kernel addresses
|
||||||
|
each cube's shard as `pe_addr = t_ptr + cube_id * n_elem * 2`.
|
||||||
|
|
||||||
|
### D3. SFR / IPCQ wiring — `configure_sfr_intercube_multisip`
|
||||||
|
|
||||||
|
Replaces the rank-to-2-PE install from ADR-0024. Wires PE_IPCQ neighbor
|
||||||
|
tables for **every cube's pe0 across every SIP** — regardless of which
|
||||||
|
cube is the root or which SIP topology is selected. This lets the kernel
|
||||||
|
elect the root cube at runtime and supports topology switches without
|
||||||
|
re-wiring.
|
||||||
|
|
||||||
|
| Level | Direction labels | Scope |
|
||||||
|
|---|---|---|
|
||||||
|
| Intercube within SIP | N / S / E / W | pe0 of every cube → pe0 of mesh neighbors (no wrap) |
|
||||||
|
| Inter-SIP (all cubes) | global_E / global_W / global_N / global_S | pe0 of cube c on sip A → pe0 of cube c on peer SIP per `sips.topology` |
|
||||||
|
|
||||||
|
Inter-SIP directions use the `global_*` prefix to keep the namespace
|
||||||
|
disjoint from intercube directions. ADR-0025's `_OPPOSITE_DIR` is extended
|
||||||
|
with `global_E ↔ global_W` and `global_N ↔ global_S` so the reverse-
|
||||||
|
direction resolver handles 2-SIP bidirectional rings correctly.
|
||||||
|
|
||||||
|
Internally the function calls `install_ipcq` with:
|
||||||
|
- `world_size = n_sips × n_cubes`
|
||||||
|
- `rank_to_pe = [(sip, cube, 0) for sip in range(n_sips) for cube in range(n_cubes)]`
|
||||||
|
- A closure-captured `neighbors()` function that builds the map above.
|
||||||
|
|
||||||
|
This `world_size` is internal to IPCQ wiring and does not leak to the
|
||||||
|
process-group rank.
|
||||||
|
|
||||||
|
### D4. SIP topology — from `topology.yaml`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
system:
|
||||||
|
sips:
|
||||||
|
count: 2
|
||||||
|
topology: ring_1d # or torus_2d, mesh_2d_no_wrap
|
||||||
|
```
|
||||||
|
|
||||||
|
- `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
|
||||||
|
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
|
||||||
|
`global_E/W` then col ring on `global_S/N`.
|
||||||
|
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
|
||||||
|
broadcast per dimension.
|
||||||
|
|
||||||
|
2D variants require `n_sips` to be a perfect square.
|
||||||
|
|
||||||
|
### D5. Process-group integration — `AhbmCCLBackend`
|
||||||
|
|
||||||
|
At `init_process_group` time the backend:
|
||||||
|
|
||||||
|
1. Loads `ccl.yaml` + `topology.yaml`.
|
||||||
|
2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
|
||||||
|
`system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
|
||||||
|
3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
|
||||||
|
SFR wiring, mirrors NCCL communicator creation.
|
||||||
|
|
||||||
|
At each `dist.all_reduce(tensor)` call:
|
||||||
|
|
||||||
|
1. Resolves `kernel_fn` from `cfg["module"]`.
|
||||||
|
2. Builds args: `(n_elem, cube_w, cube_h, n_sips)` from
|
||||||
|
`kernel_args(world_size, n_elem)`.
|
||||||
|
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
|
||||||
|
`sip_rank` is the current greenlet's bound rank.
|
||||||
|
4. Launches with `_defer_wait=True`; the main scheduler drains pending
|
||||||
|
handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).
|
||||||
|
|
||||||
|
### D6. Config schema
|
||||||
|
|
||||||
|
`ccl.yaml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
defaults:
|
||||||
|
algorithm: intercube_allreduce
|
||||||
|
buffer_kind: tcm
|
||||||
|
...
|
||||||
|
|
||||||
|
algorithms:
|
||||||
|
intercube_allreduce:
|
||||||
|
module: kernbench.ccl.algorithms.intercube_allreduce
|
||||||
|
topology: none
|
||||||
|
buffer_kind: tcm
|
||||||
|
n_elem: 8
|
||||||
|
root_cube: 15
|
||||||
|
```
|
||||||
|
|
||||||
|
`topology.yaml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
system:
|
||||||
|
sips:
|
||||||
|
count: 2
|
||||||
|
topology: ring_1d
|
||||||
|
sip:
|
||||||
|
cube_mesh: { w: 4, h: 4 }
|
||||||
|
```
|
||||||
|
|
||||||
|
### D7. Algorithm module contract
|
||||||
|
|
||||||
|
Modules loaded via `cfg["module"]` must export:
|
||||||
|
|
||||||
|
| Name | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `kernel` | callable, signature `(t_ptr, n_elem, cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl)` |
|
||||||
|
| `kernel_args(world_size, n_elem) -> tuple` | returns the first 4 scalar args (per-tensor) |
|
||||||
|
| `TOPO_NAME_TO_KIND: dict[str, int]` | maps `system.sips.topology` name to kernel branch code |
|
||||||
|
| `SIP_TOPO_RING`, `SIP_TOPO_TORUS`, `SIP_TOPO_MESH` | integer constants (0, 1, 2) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- **ADR-0023**: IPCQ protocol (neighbor table, send/recv, credit return).
|
||||||
|
- **ADR-0024**: rank = SIP launcher, `mp.spawn`, greenlet-local rank.
|
||||||
|
- **ADR-0025**: Address-based IPCQ direction matching; extended
|
||||||
|
`_OPPOSITE_DIR` with `global_*` pairs.
|
||||||
|
- **ADR-0027**: Worker-wait / collective-pending drain in main scheduler.
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
|
||||||
|
workload for this algorithm is per-cube DP.
|
||||||
|
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
|
||||||
|
`mesh_2d_no_wrap` require `n_sips = k²`.
|
||||||
|
- **Pipelined chunks**: single-tile per cube, no pipelining yet.
|
||||||
|
- **Root cube runtime election**: the kernel currently uses
|
||||||
|
`root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
|
||||||
|
corner. SFR wiring covers all cubes, so runtime election is a pure kernel
|
||||||
|
change when needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **Single kernel, single install path** for all-reduce — replaces four
|
||||||
|
removed modules (`ring`, `mesh`, `tree`, `hierarchical`).
|
||||||
|
- **Topology-agnostic kernel**: ring / torus / mesh selected via one
|
||||||
|
integer param, no kernel duplication.
|
||||||
|
- **Automatic via `dist.all_reduce`**: no bench-level or user-level
|
||||||
|
algorithm selection needed; config-driven end-to-end.
|
||||||
|
- **Full SFR wiring**: every cube on every SIP has inter-SIP links
|
||||||
|
available — supports future dynamic root-cube election.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- **Not suitable for per-PE sharded tensors**: TP-layer-style tensors that
|
||||||
|
shard within one cube across 8 PEs are not addressable by this kernel.
|
||||||
|
Such workloads would need a separate intra-cube all-reduce path (not
|
||||||
|
yet implemented).
|
||||||
|
- **`configure_sfr_intercube_multisip` always wires all pe0s**: even if a
|
||||||
|
given run only needs a subset (e.g. 1 SIP, ring only). Install cost is
|
||||||
|
small but not zero.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Affected files
|
||||||
|
|
||||||
|
| File | Change |
|
||||||
|
|---|---|
|
||||||
|
| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
|
||||||
|
| `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
|
||||||
|
| `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
|
||||||
|
| `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
|
||||||
|
| `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
|
||||||
|
| `ccl.yaml` | Single `intercube_allreduce` entry |
|
||||||
|
| `topology.yaml` | Added `system.sips.topology` |
|
||||||
|
| `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
|
||||||
|
| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
|
||||||
|
| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
|
||||||
|
| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
|
||||||
|
| Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
|
||||||
|
hbm,torus_2d,6,128,256,2002.0399999999827
|
||||||
|
hbm,torus_2d,6,1024,2048,3541.0399999999827
|
||||||
|
hbm,torus_2d,6,8192,16384,15889.03999999999
|
||||||
|
hbm,torus_2d,6,32768,65536,58225.03999999998
|
||||||
|
sram,torus_2d,6,128,256,1762.0399999999827
|
||||||
|
sram,torus_2d,6,1024,2048,2293.0399999999827
|
||||||
|
sram,torus_2d,6,8192,16384,6577.039999999986
|
||||||
|
sram,torus_2d,6,32768,65536,21265.03999999992
|
||||||
|
tcm,torus_2d,6,128,256,1678.0399999999827
|
||||||
|
tcm,torus_2d,6,1024,2048,1957.0399999999827
|
||||||
|
tcm,torus_2d,6,8192,16384,4225.039999999986
|
||||||
|
tcm,torus_2d,6,32768,65536,12001.03999999992
|
||||||
|
|
After Width: | Height: | Size: 68 KiB |
|
After Width: | Height: | Size: 40 KiB |
|
After Width: | Height: | Size: 82 KiB |
|
After Width: | Height: | Size: 38 KiB |
@@ -0,0 +1,37 @@
|
|||||||
|
algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2626.302499999998
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2634.7399999999952
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2645.9899999999925
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,2668.489999999987
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,2812.489999999987
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3010.489999999987
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,3406.489999999987
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,4198.489999999965
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,5782.489999999969
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,8950.489999999925
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,15286.48999999986
|
||||||
|
intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,21622.489999999932
|
||||||
|
intercube_allreduce,ring_1d,6,8,16,256,2302.9849999999933
|
||||||
|
intercube_allreduce,ring_1d,6,32,64,1024,2310.8599999999906
|
||||||
|
intercube_allreduce,ring_1d,6,64,128,2048,2321.359999999988
|
||||||
|
intercube_allreduce,ring_1d,6,128,256,4096,2342.3599999999824
|
||||||
|
intercube_allreduce,ring_1d,6,512,1024,16384,2479.3599999999824
|
||||||
|
intercube_allreduce,ring_1d,6,1024,2048,32768,2669.3599999999824
|
||||||
|
intercube_allreduce,ring_1d,6,2048,4096,65536,3049.3599999999824
|
||||||
|
intercube_allreduce,ring_1d,6,4096,8192,131072,3809.3599999999715
|
||||||
|
intercube_allreduce,ring_1d,6,8192,16384,262144,5329.359999999979
|
||||||
|
intercube_allreduce,ring_1d,6,16384,32768,524288,8369.35999999992
|
||||||
|
intercube_allreduce,ring_1d,6,32768,65536,1048576,14449.359999999899
|
||||||
|
intercube_allreduce,ring_1d,6,49152,98304,1572864,20529.35999999997
|
||||||
|
intercube_allreduce,torus_2d,6,8,16,256,1644.2899999999936
|
||||||
|
intercube_allreduce,torus_2d,6,32,64,1024,1651.0399999999909
|
||||||
|
intercube_allreduce,torus_2d,6,64,128,2048,1660.0399999999881
|
||||||
|
intercube_allreduce,torus_2d,6,128,256,4096,1678.0399999999827
|
||||||
|
intercube_allreduce,torus_2d,6,512,1024,16384,1795.0399999999827
|
||||||
|
intercube_allreduce,torus_2d,6,1024,2048,32768,1957.0399999999827
|
||||||
|
intercube_allreduce,torus_2d,6,2048,4096,65536,2281.0399999999827
|
||||||
|
intercube_allreduce,torus_2d,6,4096,8192,131072,2929.039999999979
|
||||||
|
intercube_allreduce,torus_2d,6,8192,16384,262144,4225.039999999986
|
||||||
|
intercube_allreduce,torus_2d,6,16384,32768,524288,6817.039999999943
|
||||||
|
intercube_allreduce,torus_2d,6,32768,65536,1048576,12001.03999999992
|
||||||
|
intercube_allreduce,torus_2d,6,49152,98304,1572864,17185.039999999994
|
||||||
|
|
After Width: | Height: | Size: 194 KiB |
|
After Width: | Height: | Size: 38 KiB |
|
After Width: | Height: | Size: 233 KiB |
|
After Width: | Height: | Size: 166 KiB |
|
After Width: | Height: | Size: 45 KiB |
|
After Width: | Height: | Size: 45 KiB |
|
After Width: | Height: | Size: 53 KiB |
|
After Width: | Height: | Size: 52 KiB |
|
After Width: | Height: | Size: 103 KiB |
@@ -0,0 +1,81 @@
|
|||||||
|
hop,label,size_bytes,path,total_ns
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,31.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,12.019999999996799
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,33.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,13.019999999996799
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,35.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,14.019999999996799
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,37.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,15.019999999996799
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,41.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,17.0199999999968
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,45.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,19.0199999999968
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,61.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,27.0199999999968
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,93.6399999999976
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,43.0199999999968
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,157.64000000000306
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,75.02000000000407
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,189.64000000000306
|
||||||
|
h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,91.02000000000407
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,31.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,12.019999999996799
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,33.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,13.019999999996799
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,35.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,14.019999999996799
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,37.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,15.019999999996799
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,41.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,17.0199999999968
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,45.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,19.0199999999968
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,61.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,27.0199999999968
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,93.6399999999976
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,43.0199999999968
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,157.64000000000306
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,75.02000000000407
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,189.64000000000306
|
||||||
|
h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,91.02000000000407
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,67.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,68.53999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,69.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,70.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,71.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,71.53999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,73.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,73.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,77.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,76.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,81.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,79.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,97.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,91.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,129.65999999999804
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,115.03999999999724
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,193.65999999999985
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,163.04000000000087
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,225.65999999999985
|
||||||
|
h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,187.04000000000087
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,87.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,88.53999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,89.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,90.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,91.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,91.53999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,93.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,93.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,97.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,96.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,101.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,99.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,117.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,111.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,149.65999999999804
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,135.03999999999724
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,213.65999999999985
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,183.04000000000087
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,245.65999999999985
|
||||||
|
h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,207.04000000000087
|
||||||
|
@@ -0,0 +1,157 @@
|
|||||||
|
direction: right
|
||||||
|
|
||||||
|
pe: PE {
|
||||||
|
style.fill: "#f8f9fa"
|
||||||
|
style.stroke: "#495057"
|
||||||
|
style.border-radius: 8
|
||||||
|
|
||||||
|
cpu: PE_CPU (control) {
|
||||||
|
style.fill: "#bbdefb"
|
||||||
|
style.stroke: "#1565c0"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
sched: PE_SCHED (dispatch) {
|
||||||
|
style.fill: "#bbdefb"
|
||||||
|
style.stroke: "#1565c0"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
ipcq_added: IPCQ (added) {
|
||||||
|
style.fill: "#e1f5fe"
|
||||||
|
style.stroke: "#0277bd"
|
||||||
|
style.stroke-dash: 5
|
||||||
|
style.stroke-width: 2
|
||||||
|
style.border-radius: 6
|
||||||
|
|
||||||
|
ipcq: PE_IPCQ (control plane) {
|
||||||
|
style.fill: "#bbdefb"
|
||||||
|
style.stroke: "#1565c0"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
dma: PE_DMA (single FIFO inbox) {
|
||||||
|
style.fill: "#fff3e0"
|
||||||
|
style.stroke: "#e65100"
|
||||||
|
style.border-radius: 6
|
||||||
|
}
|
||||||
|
|
||||||
|
fs: PE_FETCH_STORE {
|
||||||
|
style.fill: "#c8e6c9"
|
||||||
|
style.stroke: "#2e7d32"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
tcm: TCM (16MB SRAM) {
|
||||||
|
style.fill: "#fce4ec"
|
||||||
|
style.stroke: "#c62828"
|
||||||
|
style.border-radius: 6
|
||||||
|
|
||||||
|
ipcq_slot: IPCQ Slot Region {
|
||||||
|
style.stroke-dash: 5
|
||||||
|
style.fill: "#ffcdd2"
|
||||||
|
style.stroke: "#c62828"
|
||||||
|
style.border-radius: 3
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
gemm: GEMM engine {
|
||||||
|
style.fill: "#c8e6c9"
|
||||||
|
style.stroke: "#2e7d32"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
math: MATH engine {
|
||||||
|
style.fill: "#c8e6c9"
|
||||||
|
style.stroke: "#2e7d32"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
fport: Fabric Port {
|
||||||
|
style.fill: "#ffe0b2"
|
||||||
|
style.stroke: "#e65100"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
# Control → dispatch
|
||||||
|
cpu -> sched: cmd dispatch
|
||||||
|
cpu -> ipcq_added.ipcq: IpcqRequest
|
||||||
|
|
||||||
|
# Compute pipeline
|
||||||
|
sched -> dma: TileToken\n(compute port)
|
||||||
|
dma -> fs: TileToken
|
||||||
|
dma <-> tcm: DMA_READ/WRITE\n(HBM ↔ TCM)
|
||||||
|
fs <-> tcm: fetch/store\n(TCM ↔ reg)
|
||||||
|
fs -> gemm: TileToken
|
||||||
|
fs -> math: TileToken
|
||||||
|
gemm -> fs: TileToken
|
||||||
|
math -> fs: TileToken
|
||||||
|
|
||||||
|
# IPCQ data path — outbound
|
||||||
|
ipcq_added.ipcq -> dma: IpcqDmaToken\n(IPCQ port) {style.stroke: "#1565c0"}
|
||||||
|
|
||||||
|
# IPCQ data path — inbound (MetaArrival: DMA → IPCQ)
|
||||||
|
dma -> ipcq_added.ipcq: IpcqMetaArrival {style.stroke: "#1565c0"}
|
||||||
|
|
||||||
|
# Credit return (dashed)
|
||||||
|
ipcq_added.ipcq -> dma: IpcqCreditMetadata\n(NoC latency charged) {
|
||||||
|
style.stroke: "#7b1fa2"
|
||||||
|
style.stroke-dash: 5
|
||||||
|
}
|
||||||
|
|
||||||
|
# DMA ↔ Fabric
|
||||||
|
dma <-> fport
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── NoC Router + attached resources ──
|
||||||
|
|
||||||
|
noc: NoC Router {
|
||||||
|
style.fill: "#f3e5f5"
|
||||||
|
style.stroke: "#6a1b9a"
|
||||||
|
style.border-radius: 6
|
||||||
|
}
|
||||||
|
|
||||||
|
hbm: Local HBM {
|
||||||
|
style.fill: "#e8eaf6"
|
||||||
|
style.stroke: "#283593"
|
||||||
|
style.border-radius: 6
|
||||||
|
|
||||||
|
ipcq_slot_hbm: IPCQ Slot Region {
|
||||||
|
style.stroke-dash: 5
|
||||||
|
style.fill: "#c5cae9"
|
||||||
|
style.stroke: "#283593"
|
||||||
|
style.border-radius: 3
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
sram: Cube SRAM {
|
||||||
|
style.fill: "#e0f7fa"
|
||||||
|
style.stroke: "#00695c"
|
||||||
|
style.border-radius: 6
|
||||||
|
|
||||||
|
ipcq_slot_sram: IPCQ Slot Region {
|
||||||
|
style.stroke-dash: 5
|
||||||
|
style.fill: "#b2dfdb"
|
||||||
|
style.stroke: "#00695c"
|
||||||
|
style.border-radius: 3
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
other_pe: Other PEs {
|
||||||
|
style.fill: "#ede7f6"
|
||||||
|
style.stroke: "#6a1b9a"
|
||||||
|
style.border-radius: 6
|
||||||
|
}
|
||||||
|
|
||||||
|
other_cube: Other Cubes / SIPs {
|
||||||
|
style.fill: "#ede7f6"
|
||||||
|
style.stroke: "#6a1b9a"
|
||||||
|
style.border-radius: 6
|
||||||
|
}
|
||||||
|
|
||||||
|
pe.fport <-> noc
|
||||||
|
noc <-> hbm
|
||||||
|
noc <-> sram
|
||||||
|
noc <-> other_pe
|
||||||
|
noc <-> other_cube
|
||||||
|
After Width: | Height: | Size: 1014 KiB |
|
After Width: | Height: | Size: 44 KiB |
@@ -0,0 +1,166 @@
|
|||||||
|
direction: right
|
||||||
|
|
||||||
|
pe: PE {
|
||||||
|
style.fill: "#f8f9fa"
|
||||||
|
style.stroke: "#495057"
|
||||||
|
style.border-radius: 8
|
||||||
|
|
||||||
|
cpu: PE_CPU (control) {
|
||||||
|
style.fill: "#bbdefb"
|
||||||
|
style.stroke: "#1565c0"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
sched: PE_SCHED (dispatch) {
|
||||||
|
style.fill: "#bbdefb"
|
||||||
|
style.stroke: "#1565c0"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
ipcq: IPCQ Controller (NEW) {
|
||||||
|
style.fill: "#e1f5fe"
|
||||||
|
style.stroke: "#0277bd"
|
||||||
|
style.border-radius: 6
|
||||||
|
style.stroke-width: 2
|
||||||
|
|
||||||
|
ptrmgmt: Pointer Mgmt {
|
||||||
|
style.fill: "#b3e5fc"
|
||||||
|
style.stroke: "#0277bd"
|
||||||
|
style.border-radius: 4
|
||||||
|
|
||||||
|
qprf: QPair Reg File
|
||||||
|
bp: Backpressure
|
||||||
|
sag: Slot Addr Gen
|
||||||
|
}
|
||||||
|
|
||||||
|
sideband: Sideband {
|
||||||
|
style.fill: "#b3e5fc"
|
||||||
|
style.stroke: "#0277bd"
|
||||||
|
style.border-radius: 4
|
||||||
|
|
||||||
|
metax: Meta Extractor
|
||||||
|
crinj: Credit Injector
|
||||||
|
crrcv: Credit Receiver
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
dma: PE_DMA (MOD) {
|
||||||
|
style.fill: "#fff3e0"
|
||||||
|
style.stroke: "#e65100"
|
||||||
|
style.border-radius: 6
|
||||||
|
|
||||||
|
compute_port: compute port {
|
||||||
|
style.fill: "#ffe0b2"
|
||||||
|
style.stroke: "#e65100"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
ipcq_port: IPCQ port {
|
||||||
|
style.fill: "#ffe0b2"
|
||||||
|
style.stroke: "#e65100"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
wrr: WRR Arbiter (NEW) {
|
||||||
|
style.fill: "#ffcc80"
|
||||||
|
style.stroke: "#e65100"
|
||||||
|
style.border-radius: 4
|
||||||
|
style.stroke-width: 2
|
||||||
|
}
|
||||||
|
|
||||||
|
compute_port -> wrr
|
||||||
|
ipcq_port -> wrr
|
||||||
|
}
|
||||||
|
|
||||||
|
fs: PE_FETCH_STORE {
|
||||||
|
style.fill: "#c8e6c9"
|
||||||
|
style.stroke: "#2e7d32"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
tcm: TCM (16MB SRAM) {
|
||||||
|
style.fill: "#fce4ec"
|
||||||
|
style.stroke: "#c62828"
|
||||||
|
style.border-radius: 6
|
||||||
|
|
||||||
|
work: Kernel Working Memory {
|
||||||
|
style.fill: "#f8bbd0"
|
||||||
|
style.stroke: "#c62828"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
slot: IPCQ Slot Region (rsv) {
|
||||||
|
style.fill: "#f48fb1"
|
||||||
|
style.stroke: "#c62828"
|
||||||
|
style.border-radius: 4
|
||||||
|
style.stroke-width: 2
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
gemm: GEMM engine {
|
||||||
|
style.fill: "#c8e6c9"
|
||||||
|
style.stroke: "#2e7d32"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
math: MATH engine {
|
||||||
|
style.fill: "#c8e6c9"
|
||||||
|
style.stroke: "#2e7d32"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
fport: Fabric Port {
|
||||||
|
style.fill: "#ffe0b2"
|
||||||
|
style.stroke: "#e65100"
|
||||||
|
style.border-radius: 4
|
||||||
|
}
|
||||||
|
|
||||||
|
# Control
|
||||||
|
cpu -> sched: cmd dispatch
|
||||||
|
cpu -> ipcq: MMIO
|
||||||
|
|
||||||
|
# Compute pipeline
|
||||||
|
sched -> dma.compute_port: TileToken
|
||||||
|
dma -> fs: TileToken
|
||||||
|
dma <-> tcm.work: DMA_READ/WRITE\n(HBM ↔ TCM)
|
||||||
|
fs <-> tcm.work: fetch/store\n(TCM ↔ reg)
|
||||||
|
fs -> gemm: TileToken
|
||||||
|
fs -> math: TileToken
|
||||||
|
gemm -> fs: TileToken
|
||||||
|
math -> fs: TileToken
|
||||||
|
|
||||||
|
# IPCQ data path
|
||||||
|
ipcq -> dma.ipcq_port: IpcqDmaToken {style.stroke: "#0277bd"}
|
||||||
|
dma -> ipcq.sideband.metax: IpcqMetaArrival {style.stroke: "#0277bd"}
|
||||||
|
|
||||||
|
# IPCQ slot R/W
|
||||||
|
dma <-> tcm.slot: slot read/write {
|
||||||
|
style.stroke: "#0277bd"
|
||||||
|
style.stroke-dash: 3
|
||||||
|
}
|
||||||
|
|
||||||
|
# Credit via fabric port
|
||||||
|
ipcq.sideband.crinj -> fport: credit out (16B) {
|
||||||
|
style.stroke: "#7b1fa2"
|
||||||
|
style.stroke-dash: 5
|
||||||
|
}
|
||||||
|
fport -> ipcq.sideband.crrcv: credit in (16B) {
|
||||||
|
style.stroke: "#7b1fa2"
|
||||||
|
style.stroke-dash: 5
|
||||||
|
}
|
||||||
|
|
||||||
|
# DMA ↔ Fabric
|
||||||
|
dma.wrr <-> fport
|
||||||
|
}
|
||||||
|
|
||||||
|
noc: NoC Router {
|
||||||
|
style.fill: "#f3e5f5"
|
||||||
|
style.stroke: "#6a1b9a"
|
||||||
|
style.border-radius: 6
|
||||||
|
}
|
||||||
|
|
||||||
|
ext: Other PEs / Cubes / SIPs {
|
||||||
|
style.fill: "#ede7f6"
|
||||||
|
style.stroke: "#6a1b9a"
|
||||||
|
style.border-radius: 6
|
||||||
|
}
|
||||||
|
|
||||||
|
pe.fport <-> noc
|
||||||
|
noc <-> ext
|
||||||
|
After Width: | Height: | Size: 836 KiB |
|
After Width: | Height: | Size: 48 KiB |
@@ -0,0 +1,548 @@
|
|||||||
|
# IPCQ-DMA Co-design Hardware Design Document
|
||||||
|
|
||||||
|
**Status**: Draft — Review Requested
|
||||||
|
**Date**: 2026-04-28
|
||||||
|
**Authors**: YW Kang
|
||||||
|
**Reviewers**: (HW team TBD)
|
||||||
|
**Related**: ADR-0023 (IPCQ PE Collective), ADR-0025 (Direction Addressing)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Background & Motivation
|
||||||
|
|
||||||
|
IPCQ(Inter-PE Communication Queue)는 PE 간 collective communication을 위한
|
||||||
|
하드웨어 큐 메커니즘이다. 핵심 설계 원리는 **DMA가 데이터 전송 시 별도의
|
||||||
|
제어 메시지 없이, piggyback된 메타 정보를 바탕으로 IPCQ의 head/tail pointer를
|
||||||
|
자동 업데이트**하는 IPCQ-DMA co-design이다.
|
||||||
|
|
||||||
|
이 문서는:
|
||||||
|
|
||||||
|
1. 현재 PE 아키텍처에서 IPCQ가 하드웨어 수준에서 어떻게 동작하는지 기술하고,
|
||||||
|
2. 이 하드웨어를 시뮬레이터에서 어떻게 모델링하고 있는지 검증하며,
|
||||||
|
3. 실제 하드웨어 구현을 위한 설계를 제안하고,
|
||||||
|
4. 대안들을 검토하여 최적 접근을 확정한다.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. High-level Behavior of PE_IPCQ
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
> source: [`diagrams/pe_baseline.d2`](diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5` 로 렌더링.
|
||||||
|
|
||||||
|
### IPCQ 하드웨어 동작
|
||||||
|
|
||||||
|
**HW Configuration**:
|
||||||
|
* IPCQ는 PE 간에 ring buffer 기반의 단방향 큐를 설정하여 데이터를 전달한다.
|
||||||
|
* 각 PE는 방향별(N/S/E/W 등)로 독립적인 queue pair 를 유지한다.
|
||||||
|
* IPCQ는 각 queue pair 마다 sender's head/tail pointer, receiver's head/tail pointer 를 유지한다.
|
||||||
|
|
||||||
|
* **IPCQ Slot Region**: IPCQ의 수신 버퍼로, 다이어그램의 점선 박스로 표시된 것처럼 TCM, Cube SRAM, Local HBM 중 하나를 buffer_kind로 지정하여 사용할 수 있다.
|
||||||
|
각 tier별 성능 특성 (시뮬레이션 모델 값, `ipcq_types.py`):
|
||||||
|
|
||||||
|
| Buffer Kind | Intrinsic BW | Effective BW (NoC bottleneck) | 용도 |
|
||||||
|
|-------------|-------------|-------------------------------|------|
|
||||||
|
| TCM | 512 GB/s | 512 GB/s (직결, NoC 미경유) | 최저 latency, PE 내부 전용 |
|
||||||
|
| Cube SRAM | 512 GB/s | 128 GB/s (`sram_to_router_bw`) | Cube 내 공유, NoC BW에 제한 |
|
||||||
|
| Local HBM | 256 GB/s | 256 GB/s (`hbm_to_router_bw`) | 대용량, NoC BW에 제한 |
|
||||||
|
|
||||||
|
**Send 경로 (fire-and-forget)**:
|
||||||
|
1. PE_CPU가 `tl.send(dir, src_addr)` 발행 → PE_IPCQ에 IpcqRequest 전달
|
||||||
|
2. PE_IPCQ가 backpressure 확인: `(my_head - peer_tail_cache) < peer.n_slots`
|
||||||
|
3. Peer의 rx slot 주소 계산: `peer_rx_base + (my_head % n_slots) × slot_size`
|
||||||
|
4. IpcqDmaToken(data + piggyback metadata: sender_seq)을 PE_DMA에 전달
|
||||||
|
5. PE_IPCQ가 `my_head++`, PE_CPU에 즉시 반환 (DMA 완료를 기다리지 않음)
|
||||||
|
6. PE_DMA가 src data를 snapshot 후 NoC를 통해 peer PE_DMA로 전송
|
||||||
|
|
||||||
|
**Receive 경로 (blocking)**:
|
||||||
|
1. Peer PE_DMA가 data를 slot에 write하고, **같은 사이클에** metadata(sender_seq, dst_addr)를 추출
|
||||||
|
2. PE_IPCQ가 dst_addr range matching으로 방향을 식별, `peer_head_cache` 업데이트
|
||||||
|
3. `tl.recv(dir)` 대기 중인 PE_CPU에 wakeup signal 전달
|
||||||
|
4. PE_CPU가 slot에서 데이터 읽기, PE_IPCQ가 `my_tail++`
|
||||||
|
5. **Credit return**: PE_IPCQ가 16B credit packet(`consumer_seq`)을 NoC를 통해 sender에게 전송
|
||||||
|
6. Sender PE_IPCQ가 `peer_tail_cache` 업데이트, backpressure 해제
|
||||||
|
|
||||||
|
**핵심 설계 원리**:
|
||||||
|
- **Data + head pointer piggyback**: 별도의 head 동기화 메시지 없이, DMA data flit에 sender_seq를 실어보냄
|
||||||
|
- **Atomic write + metadata**: 수신측 DMA가 slot write와 metadata 전달을 같은 사이클에 수행 (I6 invariant)
|
||||||
|
- **Address-based direction matching**: 같은 peer에 여러 방향이 연결되어도 dst_addr range로 구분 (ADR-0025)
|
||||||
|
- **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Simulator Implementation Verification
|
||||||
|
|
||||||
|
위의 하드웨어 동작을 시뮬레이터에서 어떻게 모델링하는지 검증한다.
|
||||||
|
|
||||||
|
### 3.1 의도와 구현의 매핑
|
||||||
|
|
||||||
|
| 설계 의도 | 시뮬레이터 구현 | 위치 |
|
||||||
|
|-----------|----------------|------|
|
||||||
|
| DMA가 데이터 전송 시 head pointer를 piggyback | `IpcqDmaToken.sender_seq` 필드가 data flit과 함께 전달 | `ipcq_types.py:185` |
|
||||||
|
| 수신측 DMA가 data write + metadata 전달을 atomic 처리 | `_handle_ipcq_inbound`에서 `store.write` → `IpcqMetaArrival` 사이에 yield 없음 (I6) | `pe_dma.py:232-275` |
|
||||||
|
| Send는 fire-and-forget | `_handle_ipcq_outbound`에서 `sub_done`을 기다리지 않음 | `pe_dma.py:182` |
|
||||||
|
| Recv는 데이터 도착까지 block | `peer_head_cache > my_tail` 조건으로 대기 | `pe_ipcq.py:263` |
|
||||||
|
| Credit return은 별도 fast-path | SimPy Store를 통한 direct put (latency는 NoC 경로 기반으로 charge) | `pe_ipcq.py:443-469` |
|
||||||
|
| In-flight data semantics (snapshot) | Send 시점에 data snapshot 보존, 이후 src 수정과 무관 | `pe_dma.py:142-155` |
|
||||||
|
| PE_DMA 단일 inbox | 모든 in_port를 `_fan_in`으로 단일 FIFO에 merge (`base.py:51-53`) | compute port와 IPCQ port 사이에 arbiter 없음 |
|
||||||
|
|
||||||
|
### 3.2 Credit Return Path 모델링 상세
|
||||||
|
|
||||||
|
Credit return은 실제 NoC 경로를 `router.find_path()`로 찾고,
|
||||||
|
`compute_path_latency_ns()`로 hop latency + BW drain을 계산하여 charge한다.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# pe_ipcq.py:471-492
|
||||||
|
def _credit_latency_ns(self, direction: str) -> float:
|
||||||
|
path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
|
||||||
|
return self.ctx.compute_path_latency_ns(path, self._credit_size_bytes)
|
||||||
|
```
|
||||||
|
|
||||||
|
단, latency를 `env.timeout()`으로 지불한 후 `peer_credit_store`(SimPy Store)에
|
||||||
|
직접 put하는 방식이다. 실제 `Transaction`을 만들어 NoC를 hop-by-hop 통과시키지는
|
||||||
|
않으므로, **다른 트래픽과의 bandwidth contention은 모델링되지 않는다.**
|
||||||
|
|
||||||
|
| | Latency | BW Contention |
|
||||||
|
|---|---|---|
|
||||||
|
| Data path (IpcqDmaToken) | NoC Transaction으로 정확 모델링 | 실제 fabric 통과 |
|
||||||
|
| Credit path (16B) | NoC 경로 latency 정확 반영 | fabric Transaction 미주입 (단순화) |
|
||||||
|
|
||||||
|
Credit은 16B로 data transfer(수십~수백 KB) 대비 무시 가능한 크기이므로,
|
||||||
|
이 단순화로 인한 실질적 오차는 거의 없다.
|
||||||
|
|
||||||
|
### 3.3 검증 결론
|
||||||
|
|
||||||
|
시뮬레이터 구현은 IPCQ-DMA co-design 의도를 **정확하게 모델링**하고 있다.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Proposed Hardware Design
|
||||||
|
|
||||||
|
### 4.1 Block Diagram (변경 후)
|
||||||
|
|
||||||
|
변경점을 강조 표시: **(NEW)** = 신규, **(MOD)** = 수정.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
> Source: [`diagrams/pe_proposed.d2`](diagrams/pe_proposed.d2) — `d2 --layout=elk` 로 렌더링.
|
||||||
|
|
||||||
|
**Baseline → Proposed 핵심 변경**:
|
||||||
|
- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
|
||||||
|
- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
|
||||||
|
- TCM 내 **IPCQ Slot Region 예약 영역** 명시
|
||||||
|
- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
|
||||||
|
|
||||||
|
### 4.2 Module Details
|
||||||
|
|
||||||
|
#### 4.2.1 IPCQ Controller (신규 모듈)
|
||||||
|
|
||||||
|
PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록.
|
||||||
|
시뮬레이터의 `PeIpcqComponent`에 대응한다.
|
||||||
|
|
||||||
|
##### QPair Register File
|
||||||
|
|
||||||
|
방향별 queue pair 상태를 flip-flop으로 유지한다.
|
||||||
|
|
||||||
|
```
|
||||||
|
Per-direction registers (each 64-bit):
|
||||||
|
my_head — sender write position (monotonic)
|
||||||
|
my_tail — receiver read position (monotonic)
|
||||||
|
peer_head_cache — last known peer head (updated by Meta Extractor)
|
||||||
|
peer_tail_cache — last known peer tail (updated by Credit Receive)
|
||||||
|
rx_base_pa — this PE's rx buffer base physical address
|
||||||
|
peer_rx_base_pa — peer's rx buffer base physical address
|
||||||
|
n_slots — ring depth (power-of-2 제약, 아래 참조)
|
||||||
|
slot_size — bytes per slot
|
||||||
|
peer_credit_tgt — peer PE의 credit receive 주소
|
||||||
|
|
||||||
|
Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
|
||||||
|
Total: 8 dirs × 9 regs × 8B = 576B flip-flops
|
||||||
|
```
|
||||||
|
|
||||||
|
PE_CPU가 MMIO(CSR)로 읽기/쓰기 가능. Init 시점에 소프트웨어가 채워넣는다.
|
||||||
|
|
||||||
|
##### Slot Address Generator (combinational)
|
||||||
|
|
||||||
|
```
|
||||||
|
Input: pointer (my_head or my_tail), n_slots, slot_size, base_pa
|
||||||
|
Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
|
||||||
|
|
||||||
|
Implementation:
|
||||||
|
n_slots power-of-2 제약 → pointer & (n_slots - 1) (AND mask, 1 gate delay)
|
||||||
|
slot_size power-of-2 → barrel shift (1 cycle)
|
||||||
|
64-bit add → ripple/kogge-stone adder (1 cycle)
|
||||||
|
|
||||||
|
Latency: 1-2 cycles combinational
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Backpressure Comparator (combinational)
|
||||||
|
|
||||||
|
```
|
||||||
|
full = (my_head - peer_tail_cache) >= n_slots
|
||||||
|
|
||||||
|
Implementation: 64-bit subtract + unsigned compare
|
||||||
|
Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
|
||||||
|
Latency: 1 cycle
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Meta Extractor (inbound datapath sideband)
|
||||||
|
|
||||||
|
DMA Engine의 inbound vc_comm path에 wired. Arriving IPCQ flit의 header에서
|
||||||
|
metadata를 추출하여 queue pair 상태를 업데이트한다.
|
||||||
|
|
||||||
|
```
|
||||||
|
Trigger: DMA inbound write completion (same cycle)
|
||||||
|
Extract: {sender_seq, dst_addr} from flit header
|
||||||
|
|
||||||
|
Direction matching (ADR-0025 D2):
|
||||||
|
for each dir:
|
||||||
|
match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
|
||||||
|
8× parallel range comparators + priority encoder
|
||||||
|
|
||||||
|
Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
|
||||||
|
Output: recv_wake signal for matched direction → PE_CPU interrupt/flag
|
||||||
|
|
||||||
|
Implementation: 8× (2 comparators + AND) + priority encoder
|
||||||
|
Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Credit Injector (outbound)
|
||||||
|
|
||||||
|
```
|
||||||
|
Trigger: recv completion (my_tail 증가 후)
|
||||||
|
Action: pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
|
||||||
|
|
||||||
|
Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
|
||||||
|
Latency: 1 cycle to generate, then NoC traversal
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Credit Receiver (inbound sideband)
|
||||||
|
|
||||||
|
```
|
||||||
|
Trigger: 16B credit packet arrival (from NoC)
|
||||||
|
Extract: {consumer_seq, dst_rx_base_pa}
|
||||||
|
|
||||||
|
Direction matching (ADR-0025 D3):
|
||||||
|
for each dir:
|
||||||
|
match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
|
||||||
|
|
||||||
|
Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
|
||||||
|
Output: send_wake signal → deassert backpressure stall
|
||||||
|
|
||||||
|
Latency: 1 cycle
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 4.2.2 DMA Engine 수정사항
|
||||||
|
|
||||||
|
##### vc_comm IPCQ-aware mode
|
||||||
|
|
||||||
|
기존 vc_comm 채널에 IPCQ flit 처리 모드를 추가한다.
|
||||||
|
|
||||||
|
**Outbound**:
|
||||||
|
1. IPCQ Controller로부터 command 수신: {src_addr, dst_addr, nbytes, sender_seq}
|
||||||
|
2. TCM에서 src_addr read → DMA read buffer에 snapshot (기존 DMA behavior)
|
||||||
|
3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
|
||||||
|
4. NoC fabric port에 inject
|
||||||
|
5. Fire-and-forget (completion을 기다리지 않음)
|
||||||
|
|
||||||
|
**Inbound**:
|
||||||
|
1. NoC로부터 IPCQ flit 수신
|
||||||
|
2. Terminal BW drain charge (drain_ns = nbytes / bottleneck_bw)
|
||||||
|
3. Slot write latency charge (backing memory tier)
|
||||||
|
4. **ATOMIC** (same pipeline stage, no stall insertion):
|
||||||
|
- TCM write: data → slot address
|
||||||
|
- Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
|
||||||
|
5. Done
|
||||||
|
|
||||||
|
**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
|
||||||
|
동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요하다.
|
||||||
|
시뮬레이터의 "no yield between write and IpcqMetaArrival"이 자연스럽게 보장된다.
|
||||||
|
|
||||||
|
##### Data Snapshot Semantics
|
||||||
|
|
||||||
|
DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
|
||||||
|
이는 DMA의 standard read-then-write behavior이므로 추가 HW가 불필요하다.
|
||||||
|
|
||||||
|
##### Credit Virtual Channel (선택적)
|
||||||
|
|
||||||
|
옵션 A: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분)
|
||||||
|
옵션 B: 3rd dedicated credit VC 추가 (strict priority > data)
|
||||||
|
|
||||||
|
옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
|
||||||
|
옵션 A로도 충분하다.
|
||||||
|
|
||||||
|
#### 4.2.3 Fabric Flit Format 확장
|
||||||
|
|
||||||
|
```
|
||||||
|
일반 data flit (예: 512-bit):
|
||||||
|
┌──────────────────────────────────────────┐
|
||||||
|
│ [511:480] routing header (32b) │
|
||||||
|
│ [479:0] payload (480b = 60B) │
|
||||||
|
└──────────────────────────────────────────┘
|
||||||
|
|
||||||
|
IPCQ data flit (첫 flit에만 metadata 포함):
|
||||||
|
┌──────────────────────────────────────────┐
|
||||||
|
│ [511:480] routing header (32b) │
|
||||||
|
│ [511] ipcq_flag (1b) │ ← IPCQ vs normal DMA 식별
|
||||||
|
│ [510:509] vc_id (2b) │
|
||||||
|
│ [508:480] route + hop count │
|
||||||
|
│ [479:416] ipcq_metadata (64b) │ ← piggyback
|
||||||
|
│ [479:448] sender_seq (32b) │
|
||||||
|
│ [447:416] dst_addr[31:0] (32b) │ ← direction matching용
|
||||||
|
│ [415:0] payload (416b = 52B) │
|
||||||
|
└──────────────────────────────────────────┘
|
||||||
|
후속 flits: full 60B payload (metadata 없음)
|
||||||
|
|
||||||
|
Credit-only flit (128-bit, header-only):
|
||||||
|
┌──────────────────────────────────────────┐
|
||||||
|
│ [127:96] routing header (32b) │
|
||||||
|
│ [127] credit_flag (1b) │
|
||||||
|
│ [95:64] consumer_seq (32b) │
|
||||||
|
│ [63:0] dst_rx_base_pa (64b) │
|
||||||
|
└──────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
첫 flit의 payload가 60B → 52B로 감소 (13% overhead).
|
||||||
|
Multi-flit transfer에서는 후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
|
||||||
|
|
||||||
|
#### 4.2.4 TCM IPCQ Slot Region
|
||||||
|
|
||||||
|
```
|
||||||
|
TCM Memory Map (16MB):
|
||||||
|
┌─────────────────────────────┐ 0x000000
|
||||||
|
│ Kernel Working Memory │
|
||||||
|
│ (compute tensors) │
|
||||||
|
│ ~14MB │
|
||||||
|
├─────────────────────────────┤ 0xE00000
|
||||||
|
│ IPCQ RX Buffers │
|
||||||
|
│ Dir N: slots × slot_size │
|
||||||
|
│ Dir S: slots × slot_size │
|
||||||
|
│ Dir E: slots × slot_size │
|
||||||
|
│ Dir W: slots × slot_size │
|
||||||
|
│ ~1MB │
|
||||||
|
├─────────────────────────────┤ 0xF00000
|
||||||
|
│ IPCQ Metadata / Scratch │
|
||||||
|
│ ~1MB │
|
||||||
|
└─────────────────────────────┘ 0xFFFFFF
|
||||||
|
```
|
||||||
|
|
||||||
|
IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의
|
||||||
|
bank conflict를 최소화한다 (Section 6.1 참조).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. End-to-End Dataflow
|
||||||
|
|
||||||
|
### 5.1 Sequence Diagram
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant CPU_A as PE_A: PE_CPU
|
||||||
|
participant IPCQ_A as PE_A: IPCQ Ctrl
|
||||||
|
participant DMA_A as PE_A: DMA
|
||||||
|
participant NOC as NoC Fabric
|
||||||
|
participant DMA_B as PE_B: DMA
|
||||||
|
participant IPCQ_B as PE_B: IPCQ Ctrl
|
||||||
|
participant TCM_B as PE_B: TCM
|
||||||
|
participant CPU_B as PE_B: PE_CPU
|
||||||
|
|
||||||
|
Note over CPU_A: tl.send(dir="E", src=0x1000)
|
||||||
|
|
||||||
|
CPU_A->>IPCQ_A: MMIO: send request
|
||||||
|
Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
|
||||||
|
IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
|
||||||
|
Note over IPCQ_A: my_head++
|
||||||
|
IPCQ_A-->>CPU_A: send returns (fire-and-forget)
|
||||||
|
|
||||||
|
Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
|
||||||
|
DMA_A->>NOC: IPCQ data flit(s)
|
||||||
|
|
||||||
|
Note over NOC: hop latency + BW drain
|
||||||
|
|
||||||
|
NOC->>DMA_B: IPCQ data flit(s)
|
||||||
|
Note over DMA_B: Terminal BW drain<br/>Slot write latency
|
||||||
|
|
||||||
|
rect rgb(255, 240, 220)
|
||||||
|
Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
|
||||||
|
DMA_B->>TCM_B: write data → slot address
|
||||||
|
DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
|
||||||
|
end
|
||||||
|
|
||||||
|
Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
|
||||||
|
IPCQ_B-->>CPU_B: recv_wake signal
|
||||||
|
|
||||||
|
Note over CPU_B: tl.recv(dir="W") wakes up
|
||||||
|
CPU_B->>IPCQ_B: recv request
|
||||||
|
Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
|
||||||
|
IPCQ_B-->>CPU_B: return slot_addr
|
||||||
|
CPU_B->>TCM_B: read data from slot
|
||||||
|
Note over IPCQ_B: my_tail++
|
||||||
|
|
||||||
|
IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
|
||||||
|
Note over NOC: credit traversal (NoC latency)
|
||||||
|
NOC->>IPCQ_A: Credit arrival
|
||||||
|
|
||||||
|
Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. 2nm Implementation Analysis
|
||||||
|
|
||||||
|
### 6.1 Area Estimate
|
||||||
|
|
||||||
|
| Module | Gate Count | Area (2nm est.) | Notes |
|
||||||
|
|--------|-----------|-----------------|-------|
|
||||||
|
| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
|
||||||
|
| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
|
||||||
|
| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
|
||||||
|
| **Total IPCQ Controller** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
|
||||||
|
| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
|
||||||
|
| **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
|
||||||
|
|
||||||
|
### 6.2 Timing
|
||||||
|
|
||||||
|
| Path | Delay (2nm est.) | Target Clock | Margin |
|
||||||
|
|------|-------------------|-------------|--------|
|
||||||
|
| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
|
||||||
|
| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
|
||||||
|
| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
|
||||||
|
| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
|
||||||
|
|
||||||
|
모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
|
||||||
|
|
||||||
|
### 6.3 Power
|
||||||
|
|
||||||
|
- Active: ~1 mW (register read/write + comparators, send/recv 동작 시)
|
||||||
|
- Idle: leakage only
|
||||||
|
- PE 전체 전력 대비 무시 가능
|
||||||
|
|
||||||
|
### 6.4 Constraints
|
||||||
|
|
||||||
|
| 항목 | 제약 | 근거 |
|
||||||
|
|------|------|------|
|
||||||
|
| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
|
||||||
|
| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
|
||||||
|
| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Risk Assessment
|
||||||
|
|
||||||
|
### 7.1 TCM Bank Conflict
|
||||||
|
|
||||||
|
- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
|
||||||
|
- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치
|
||||||
|
- **Cost**: TCM banking flexibility 소폭 감소
|
||||||
|
- **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
|
||||||
|
|
||||||
|
### 7.2 Credit Return Latency under Congestion
|
||||||
|
|
||||||
|
- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
|
||||||
|
- **Mitigation**:
|
||||||
|
- Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
|
||||||
|
- 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
|
||||||
|
- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
|
||||||
|
|
||||||
|
### 7.3 Inter-Direction Ordering
|
||||||
|
|
||||||
|
- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
|
||||||
|
- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
|
||||||
|
kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일
|
||||||
|
- **Severity**: Low (아키텍처 설계에 의해 해소)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Alternatives Considered
|
||||||
|
|
||||||
|
### 8.1 Doorbell + Polling (전통적 방식)
|
||||||
|
|
||||||
|
```
|
||||||
|
Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
|
||||||
|
Recv: Polling loop on doorbell, or interrupt-driven
|
||||||
|
```
|
||||||
|
|
||||||
|
| 장점 | 단점 |
|
||||||
|
|------|------|
|
||||||
|
| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
|
||||||
|
| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
|
||||||
|
| | Polling은 전력 낭비, interrupt는 latency overhead |
|
||||||
|
|
||||||
|
**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
|
||||||
|
|
||||||
|
### 8.2 Hardware Message Queue (NVIDIA NVLink 스타일)
|
||||||
|
|
||||||
|
```
|
||||||
|
Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
|
||||||
|
Recv: HMQ에서 descriptor pop → data pointer 확인
|
||||||
|
```
|
||||||
|
|
||||||
|
| 장점 | 단점 |
|
||||||
|
|------|------|
|
||||||
|
| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
|
||||||
|
| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
|
||||||
|
| | Large tensor에는 결국 DMA 필요 |
|
||||||
|
|
||||||
|
**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
|
||||||
|
면적 낭비. **불채택.**
|
||||||
|
|
||||||
|
### 8.3 RDMA-style Completion Queue (CQ)
|
||||||
|
|
||||||
|
```
|
||||||
|
Send: DMA write → peer에 CQE 자동 생성
|
||||||
|
Recv: CQ poll/interrupt → data 위치 확인
|
||||||
|
```
|
||||||
|
|
||||||
|
| 장점 | 단점 |
|
||||||
|
|------|------|
|
||||||
|
| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
|
||||||
|
| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
|
||||||
|
| | PE-to-PE CCL에는 over-engineered |
|
||||||
|
|
||||||
|
**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
|
||||||
|
PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
|
||||||
|
|
||||||
|
### 8.4 Credit-in-Data Piggyback (v2 최적화 후보)
|
||||||
|
|
||||||
|
현재 설계에서 credit return은 별도 16B packet이다.
|
||||||
|
Bidirectional 통신 패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
|
||||||
|
|
||||||
|
```
|
||||||
|
PE_A →E→ PE_B: data + sender_seq=3
|
||||||
|
PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4 ← credit이 data에 합쳐짐
|
||||||
|
```
|
||||||
|
|
||||||
|
| 장점 | 단점 |
|
||||||
|
|------|------|
|
||||||
|
| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
|
||||||
|
| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
|
||||||
|
| | Logic 복잡도 소폭 증가 |
|
||||||
|
|
||||||
|
**평가**: 현재 설계의 우수한 최적화.
|
||||||
|
Bidirectional allreduce에서 credit packet을 완전 제거 가능.
|
||||||
|
Standalone credit fallback도 유지. **v2로 채택 권고.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Recommendations
|
||||||
|
|
||||||
|
1. **현재 IPCQ-DMA co-design을 기본 하드웨어 설계로 채택**
|
||||||
|
— 단순하고, 면적 효율적이며, 2nm에서 timing/power 문제 없음
|
||||||
|
|
||||||
|
2. **n_slots를 반드시 power-of-2로 제약**
|
||||||
|
— mod 연산을 AND mask로 대체, critical path 단축
|
||||||
|
|
||||||
|
3. **TCM banking에서 IPCQ region 전용 bank 할당**
|
||||||
|
— compute와의 bank conflict 방지
|
||||||
|
|
||||||
|
4. **v2에서 Credit-in-Data Piggyback (Section 8.4) 추가 검토**
|
||||||
|
— bidirectional 패턴에서 credit overhead 제거
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Open Questions
|
||||||
|
|
||||||
|
- [ ] IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
|
||||||
|
- [ ] Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가?
|
||||||
|
- [ ] Inter-SIP link에서의 flit format 호환성 검증 필요
|
||||||
|
- [ ] n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)
|
||||||
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
|
|||||||
name = "kernbench"
|
name = "kernbench"
|
||||||
version = "0.1.0"
|
version = "0.1.0"
|
||||||
requires-python = ">=3.10"
|
requires-python = ">=3.10"
|
||||||
dependencies = ["pytest", "simpy", "pyyaml", "fastapi>=0.110", "uvicorn[standard]>=0.29", "websockets>=12", "numpy>=1.24", "greenlet>=3.0"]
|
dependencies = ["pytest", "simpy", "pyyaml", "fastapi>=0.110", "uvicorn[standard]>=0.29", "websockets>=12", "numpy>=1.24", "greenlet>=3.0", "matplotlib>=3.7"]
|
||||||
|
|
||||||
[project.scripts]
|
[project.scripts]
|
||||||
kernbench = "kernbench.cli.main:main"
|
kernbench = "kernbench.cli.main:main"
|
||||||
|
|||||||
@@ -24,9 +24,7 @@ TOPO_NAME_TO_KIND = {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def kernel_args(world_size: int, n_elem: int) -> tuple:
|
def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
|
||||||
cube_w = 4
|
|
||||||
cube_h = 4
|
|
||||||
return (n_elem, cube_w, cube_h, world_size)
|
return (n_elem, cube_w, cube_h, world_size)
|
||||||
|
|
||||||
|
|
||||||
@@ -111,6 +109,11 @@ def allreduce_intercube_multidevice(
|
|||||||
):
|
):
|
||||||
"""Intercube all-reduce (pe0-only) with configurable SIP topology.
|
"""Intercube all-reduce (pe0-only) with configurable SIP topology.
|
||||||
|
|
||||||
|
Root cube sits at the geometric center (cube_w//2, cube_h//2) and
|
||||||
|
each phase converges bidirectionally so the intra-SIP critical path
|
||||||
|
is ~half what a corner-root walk would be (e.g., 4×4 mesh: 4 hops
|
||||||
|
reduce + 4 hops broadcast vs 6+6 with corner root).
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
t_ptr: VA base of the row-wise-sharded tensor on this SIP.
|
t_ptr: VA base of the row-wise-sharded tensor on this SIP.
|
||||||
n_elem: f16 elements per cube tile.
|
n_elem: f16 elements per cube tile.
|
||||||
@@ -127,61 +130,117 @@ def allreduce_intercube_multidevice(
|
|||||||
row = cube_id // cube_w
|
row = cube_id // cube_w
|
||||||
col = cube_id % cube_w
|
col = cube_id % cube_w
|
||||||
nbytes = n_elem * 2
|
nbytes = n_elem * 2
|
||||||
|
single_cube = (cube_w == 1 and cube_h == 1)
|
||||||
|
|
||||||
|
root_col = cube_w // 2
|
||||||
|
root_row = cube_h // 2
|
||||||
|
root_cube = root_row * cube_w + root_col
|
||||||
|
|
||||||
pe_addr = t_ptr + cube_id * nbytes
|
pe_addr = t_ptr + cube_id * nbytes
|
||||||
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
||||||
|
|
||||||
# ── Phase 1: row reduce W → E ──
|
if single_cube:
|
||||||
if col == 0:
|
# ── Single-cube mode: skip intra-SIP reduce, go directly to
|
||||||
|
# inter-SIP exchange (TP use case: one cube per rank). ──
|
||||||
|
if n_sips > 1:
|
||||||
|
if sip_topo_kind == SIP_TOPO_RING:
|
||||||
|
acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
|
||||||
|
elif sip_topo_kind == SIP_TOPO_TORUS:
|
||||||
|
acc = _inter_sip_torus_2d(
|
||||||
|
acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
|
||||||
|
elif sip_topo_kind == SIP_TOPO_MESH:
|
||||||
|
acc = _inter_sip_mesh_2d(
|
||||||
|
acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
|
||||||
|
else:
|
||||||
|
# ── Multi-cube mode: center-root bidirectional reduce
|
||||||
|
# + inter-SIP exchange + bidirectional broadcast ──
|
||||||
|
|
||||||
|
# Phase 1: row reduce — converge at col == root_col.
|
||||||
|
# Left half (col < root_col) walks W→E; right half (col > root_col)
|
||||||
|
# walks E→W; the root_col cube merges both sides.
|
||||||
|
if col == 0 and root_col > 0:
|
||||||
tl.send(dir="E", src=acc)
|
tl.send(dir="E", src=acc)
|
||||||
elif col < cube_w - 1:
|
elif 0 < col < root_col:
|
||||||
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
||||||
acc = acc + recv
|
acc = acc + recv
|
||||||
tl.send(dir="E", src=acc)
|
tl.send(dir="E", src=acc)
|
||||||
else:
|
elif col == root_col:
|
||||||
|
if root_col > 0:
|
||||||
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
recv = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
||||||
acc = acc + recv
|
acc = acc + recv
|
||||||
|
if cube_w - 1 > root_col:
|
||||||
|
recv = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
|
||||||
|
acc = acc + recv
|
||||||
|
elif root_col < col < cube_w - 1:
|
||||||
|
recv = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
|
||||||
|
acc = acc + recv
|
||||||
|
tl.send(dir="W", src=acc)
|
||||||
|
elif col == cube_w - 1 and cube_w - 1 > root_col:
|
||||||
|
tl.send(dir="W", src=acc)
|
||||||
|
|
||||||
# ── Phase 2: col reduce N → S on rightmost column ──
|
# Phase 2: col reduce on col == root_col — converge at row == root_row.
|
||||||
if col == cube_w - 1:
|
if col == root_col:
|
||||||
if row == 0:
|
if row == 0 and root_row > 0:
|
||||||
tl.send(dir="S", src=acc)
|
tl.send(dir="S", src=acc)
|
||||||
elif row < cube_h - 1:
|
elif 0 < row < root_row:
|
||||||
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
|
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
|
||||||
acc = acc + recv
|
acc = acc + recv
|
||||||
tl.send(dir="S", src=acc)
|
tl.send(dir="S", src=acc)
|
||||||
else:
|
elif row == root_row:
|
||||||
|
if root_row > 0:
|
||||||
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
|
recv = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
|
||||||
acc = acc + recv
|
acc = acc + recv
|
||||||
|
if cube_h - 1 > root_row:
|
||||||
|
recv = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
|
||||||
|
acc = acc + recv
|
||||||
|
elif root_row < row < cube_h - 1:
|
||||||
|
recv = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
|
||||||
|
acc = acc + recv
|
||||||
|
tl.send(dir="N", src=acc)
|
||||||
|
elif row == cube_h - 1 and cube_h - 1 > root_row:
|
||||||
|
tl.send(dir="N", src=acc)
|
||||||
|
|
||||||
# ── Phase 3: inter-SIP exchange on root cube ──
|
# Phase 3: inter-SIP exchange on root cube.
|
||||||
root_cube = (cube_h - 1) * cube_w + (cube_w - 1)
|
|
||||||
if cube_id == root_cube and n_sips > 1:
|
if cube_id == root_cube and n_sips > 1:
|
||||||
if sip_topo_kind == SIP_TOPO_RING:
|
if sip_topo_kind == SIP_TOPO_RING:
|
||||||
acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
|
acc = _inter_sip_ring(acc, n_sips, n_elem, tl)
|
||||||
elif sip_topo_kind == SIP_TOPO_TORUS:
|
elif sip_topo_kind == SIP_TOPO_TORUS:
|
||||||
acc = _inter_sip_torus_2d(acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
|
acc = _inter_sip_torus_2d(
|
||||||
|
acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
|
||||||
elif sip_topo_kind == SIP_TOPO_MESH:
|
elif sip_topo_kind == SIP_TOPO_MESH:
|
||||||
acc = _inter_sip_mesh_2d(acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
|
acc = _inter_sip_mesh_2d(
|
||||||
|
acc, sip_rank, sip_topo_w, sip_topo_h, n_elem, tl)
|
||||||
|
|
||||||
# ── Phase 4: col broadcast S → N on rightmost column ──
|
# Phase 4: col broadcast on col == root_col, outward from root_row.
|
||||||
if col == cube_w - 1:
|
if col == root_col:
|
||||||
if row == cube_h - 1:
|
if row == root_row:
|
||||||
|
if root_row > 0:
|
||||||
tl.send(dir="N", src=acc)
|
tl.send(dir="N", src=acc)
|
||||||
elif row > 0:
|
if cube_h - 1 > root_row:
|
||||||
|
tl.send(dir="S", src=acc)
|
||||||
|
elif row < root_row:
|
||||||
acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
|
acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
|
||||||
|
if row > 0:
|
||||||
tl.send(dir="N", src=acc)
|
tl.send(dir="N", src=acc)
|
||||||
else:
|
elif row > root_row:
|
||||||
acc = tl.recv(dir="S", shape=(n_elem,), dtype="f16")
|
acc = tl.recv(dir="N", shape=(n_elem,), dtype="f16")
|
||||||
|
if row < cube_h - 1:
|
||||||
|
tl.send(dir="S", src=acc)
|
||||||
|
|
||||||
# ── Phase 5: row broadcast E → W ──
|
# Phase 5: row broadcast outward from root_col.
|
||||||
if col == cube_w - 1:
|
if col == root_col:
|
||||||
|
if root_col > 0:
|
||||||
tl.send(dir="W", src=acc)
|
tl.send(dir="W", src=acc)
|
||||||
elif col > 0:
|
if cube_w - 1 > root_col:
|
||||||
|
tl.send(dir="E", src=acc)
|
||||||
|
elif col < root_col:
|
||||||
acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
|
acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
|
||||||
|
if col > 0:
|
||||||
tl.send(dir="W", src=acc)
|
tl.send(dir="W", src=acc)
|
||||||
else:
|
elif col > root_col:
|
||||||
acc = tl.recv(dir="E", shape=(n_elem,), dtype="f16")
|
acc = tl.recv(dir="W", shape=(n_elem,), dtype="f16")
|
||||||
|
if col < cube_w - 1:
|
||||||
|
tl.send(dir="E", src=acc)
|
||||||
|
|
||||||
tl.store(pe_addr, acc)
|
tl.store(pe_addr, acc)
|
||||||
|
|
||||||
|
|||||||
@@ -221,6 +221,8 @@ def install_ipcq(
|
|||||||
|
|
||||||
_OPPOSITE_DIR = {
|
_OPPOSITE_DIR = {
|
||||||
"E": "W", "W": "E", "N": "S", "S": "N",
|
"E": "W", "W": "E", "N": "S", "S": "N",
|
||||||
|
"intra_E": "intra_W", "intra_W": "intra_E",
|
||||||
|
"intra_N": "intra_S", "intra_S": "intra_N",
|
||||||
"global_E": "global_W", "global_W": "global_E",
|
"global_E": "global_W", "global_W": "global_E",
|
||||||
"global_N": "global_S", "global_S": "global_N",
|
"global_N": "global_S", "global_S": "global_N",
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -1,22 +1,24 @@
|
|||||||
"""SFR configuration for intercube + inter-SIP IPCQ wiring.
|
"""SFR configuration for the full IPCQ hardware wiring.
|
||||||
|
|
||||||
Provides ``configure_sfr_intercube_multisip`` which programs PE_IPCQ
|
Installs PE_IPCQ neighbor tables modeling the physical hardware.
|
||||||
neighbor tables for:
|
Wiring is independent of DPPolicy / kernel choice — the kernel decides
|
||||||
|
at runtime which links to use.
|
||||||
|
|
||||||
1. Intercube within each SIP — pe0 of every cube connects to pe0 of
|
Direction label namespaces (disjoint):
|
||||||
its N/S/E/W mesh neighbors (no wrap-around).
|
|
||||||
2. Inter-SIP on ALL cubes — pe0 of cube_c on sip_A connects to pe0 of
|
|
||||||
cube_c on each peer SIP, using ``global_E``/``global_W`` (ring) or
|
|
||||||
``global_N``/``global_S``/``global_E``/``global_W`` (mesh/torus)
|
|
||||||
direction labels. Wiring all cubes allows the kernel to
|
|
||||||
dynamically elect the root cube at runtime.
|
|
||||||
|
|
||||||
SIP-level topology is read from ``topology.yaml`` →
|
- Intra-cube PE-to-PE: ``intra_N / intra_S / intra_E / intra_W``
|
||||||
``system.sips.topology`` (e.g. ``ring_1d``, ``mesh_2d``).
|
Logical 2×4 PE grid within a cube (no wrap):
|
||||||
Intercube mesh dimensions come from ``sip.cube_mesh.w/h``.
|
|
||||||
|
|
||||||
Internally delegates to ``install_ipcq`` with a computed ``rank_to_pe``
|
Row 0: pe0 pe1 pe2 pe3
|
||||||
(pe0-only) and a closure-captured ``neighbors()`` function.
|
Row 1: pe4 pe5 pe6 pe7
|
||||||
|
|
||||||
|
- Intercube same-lane: ``N / S / E / W``
|
||||||
|
``pe_i of cube_A ↔ pe_i of cube_B`` across the 4×4 cube mesh
|
||||||
|
(no wrap). Every PE i ∈ [0..7] wired independently.
|
||||||
|
|
||||||
|
- Inter-SIP same-(cube, pe): ``global_N / global_S / global_E / global_W``
|
||||||
|
``pe_i of cube_c on sip_A ↔ pe_i of cube_c on sip_B`` per
|
||||||
|
``topology.yaml → system.sips.topology``.
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
@@ -27,12 +29,46 @@ from kernbench.ccl.install import install_ipcq
|
|||||||
from kernbench.ccl.topologies import _BUILTIN as _TOPO_BUILTINS
|
from kernbench.ccl.topologies import _BUILTIN as _TOPO_BUILTINS
|
||||||
|
|
||||||
|
|
||||||
|
# ── Intra-cube 2×4 PE grid ───────────────────────────────────────────
|
||||||
|
|
||||||
|
_PE_GRID_COLS = 4
|
||||||
|
_PE_GRID_ROWS = 2
|
||||||
|
_PES_PER_CUBE = _PE_GRID_COLS * _PE_GRID_ROWS # 8
|
||||||
|
|
||||||
|
|
||||||
|
def _intra_cube_neighbors(pe: int) -> dict[str, int]:
|
||||||
|
"""Logical 2×4 PE grid neighbors within a cube (no wrap).
|
||||||
|
|
||||||
|
Returns directions in the ``intra_*`` namespace.
|
||||||
|
"""
|
||||||
|
row, col = divmod(pe, _PE_GRID_COLS)
|
||||||
|
nbrs: dict[str, int] = {}
|
||||||
|
if col < _PE_GRID_COLS - 1:
|
||||||
|
nbrs["intra_E"] = row * _PE_GRID_COLS + (col + 1)
|
||||||
|
if col > 0:
|
||||||
|
nbrs["intra_W"] = row * _PE_GRID_COLS + (col - 1)
|
||||||
|
if row < _PE_GRID_ROWS - 1:
|
||||||
|
nbrs["intra_S"] = (row + 1) * _PE_GRID_COLS + col
|
||||||
|
if row > 0:
|
||||||
|
nbrs["intra_N"] = (row - 1) * _PE_GRID_COLS + col
|
||||||
|
return nbrs
|
||||||
|
|
||||||
|
|
||||||
|
# ── Public entry point ───────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
def configure_sfr_intercube_multisip(
|
def configure_sfr_intercube_multisip(
|
||||||
engine: Any,
|
engine: Any,
|
||||||
spec: dict,
|
spec: dict,
|
||||||
cfg: dict,
|
cfg: dict,
|
||||||
) -> dict[str, Any]:
|
) -> dict[str, Any]:
|
||||||
"""Wire IPCQ for intercube (pe0, mesh) + inter-SIP (pe0, all cubes).
|
"""Wire the full IPCQ hardware model.
|
||||||
|
|
||||||
|
Every PE on every cube on every SIP gets neighbor table entries for:
|
||||||
|
|
||||||
|
- intra-cube (2×4 grid) in the ``intra_*`` namespace
|
||||||
|
- intercube same-lane (4×4 cube mesh, no wrap) in ``N/S/E/W``
|
||||||
|
- inter-SIP same-(cube, pe) in ``global_*``
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
engine: GraphEngine with ``_components``.
|
engine: GraphEngine with ``_components``.
|
||||||
@@ -46,48 +82,71 @@ def configure_sfr_intercube_multisip(
|
|||||||
mesh_w = int(cm["w"])
|
mesh_w = int(cm["w"])
|
||||||
mesh_h = int(cm["h"])
|
mesh_h = int(cm["h"])
|
||||||
n_cubes = mesh_w * mesh_h
|
n_cubes = mesh_w * mesh_h
|
||||||
n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
sips_cfg = spec.get("system", {}).get("sips", {})
|
||||||
sip_topology = str(
|
n_sips = int(sips_cfg.get("count", 1))
|
||||||
spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
|
sip_topology = str(sips_cfg.get("topology", "ring_1d"))
|
||||||
)
|
sip_w = sips_cfg.get("w")
|
||||||
|
sip_h = sips_cfg.get("h")
|
||||||
|
sip_w = int(sip_w) if sip_w is not None else None
|
||||||
|
sip_h = int(sip_h) if sip_h is not None else None
|
||||||
|
|
||||||
if sip_topology not in _TOPO_BUILTINS:
|
if sip_topology not in _TOPO_BUILTINS:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"Unknown sip topology '{sip_topology}'. "
|
f"Unknown sip topology '{sip_topology}'. "
|
||||||
f"Available: {list(_TOPO_BUILTINS)}"
|
f"Available: {list(_TOPO_BUILTINS)}"
|
||||||
)
|
)
|
||||||
sip_topo_fn = _TOPO_BUILTINS[sip_topology]
|
_sip_topo_fn_raw = _TOPO_BUILTINS[sip_topology]
|
||||||
|
|
||||||
world_size = n_sips * n_cubes
|
def sip_topo_fn(rank: int, ws: int) -> dict:
|
||||||
|
if sip_w is not None and sip_h is not None:
|
||||||
|
try:
|
||||||
|
return _sip_topo_fn_raw(rank, ws, w=sip_w, h=sip_h)
|
||||||
|
except TypeError:
|
||||||
|
pass
|
||||||
|
return _sip_topo_fn_raw(rank, ws)
|
||||||
|
|
||||||
|
pes_per_cube = _PES_PER_CUBE
|
||||||
|
world_size = n_sips * n_cubes * pes_per_cube
|
||||||
pe_idx_to_pe: list[tuple[int, int, int]] = [
|
pe_idx_to_pe: list[tuple[int, int, int]] = [
|
||||||
(sip, cube, 0)
|
(sip, cube, pe)
|
||||||
for sip in range(n_sips)
|
for sip in range(n_sips)
|
||||||
for cube in range(n_cubes)
|
for cube in range(n_cubes)
|
||||||
|
for pe in range(pes_per_cube)
|
||||||
]
|
]
|
||||||
|
|
||||||
|
def _pe_idx(sip: int, cube: int, pe: int) -> int:
|
||||||
|
return (sip * n_cubes + cube) * pes_per_cube + pe
|
||||||
|
|
||||||
def _neighbors(pe_idx: int, ws: int, _base: dict) -> dict[str, int]:
|
def _neighbors(pe_idx: int, ws: int, _base: dict) -> dict[str, int]:
|
||||||
sip = pe_idx // n_cubes
|
tmp = pe_idx
|
||||||
cube = pe_idx % n_cubes
|
pe = tmp % pes_per_cube
|
||||||
|
tmp //= pes_per_cube
|
||||||
|
cube = tmp % n_cubes
|
||||||
|
sip = tmp // n_cubes
|
||||||
row = cube // mesh_w
|
row = cube // mesh_w
|
||||||
col = cube % mesh_w
|
col = cube % mesh_w
|
||||||
|
|
||||||
nbrs: dict[str, int] = {}
|
nbrs: dict[str, int] = {}
|
||||||
|
|
||||||
# Intercube within SIP (mesh, no wrap-around)
|
# ── Intra-cube (intra_N/S/E/W) ──
|
||||||
if col < mesh_w - 1:
|
for d, peer_pe in _intra_cube_neighbors(pe).items():
|
||||||
nbrs["E"] = sip * n_cubes + (row * mesh_w + col + 1)
|
nbrs[d] = _pe_idx(sip, cube, peer_pe)
|
||||||
if col > 0:
|
|
||||||
nbrs["W"] = sip * n_cubes + (row * mesh_w + col - 1)
|
|
||||||
if row < mesh_h - 1:
|
|
||||||
nbrs["S"] = sip * n_cubes + ((row + 1) * mesh_w + col)
|
|
||||||
if row > 0:
|
|
||||||
nbrs["N"] = sip * n_cubes + ((row - 1) * mesh_w + col)
|
|
||||||
|
|
||||||
# Inter-SIP on ALL cubes
|
# ── Intercube same-lane (N/S/E/W, 4×4 no wrap) ──
|
||||||
|
if col < mesh_w - 1:
|
||||||
|
nbrs["E"] = _pe_idx(sip, row * mesh_w + (col + 1), pe)
|
||||||
|
if col > 0:
|
||||||
|
nbrs["W"] = _pe_idx(sip, row * mesh_w + (col - 1), pe)
|
||||||
|
if row < mesh_h - 1:
|
||||||
|
nbrs["S"] = _pe_idx(sip, (row + 1) * mesh_w + col, pe)
|
||||||
|
if row > 0:
|
||||||
|
nbrs["N"] = _pe_idx(sip, (row - 1) * mesh_w + col, pe)
|
||||||
|
|
||||||
|
# ── Inter-SIP same-(cube, pe) (global_*) ──
|
||||||
if n_sips > 1:
|
if n_sips > 1:
|
||||||
sip_nbrs = sip_topo_fn(sip, n_sips)
|
sip_nbrs = sip_topo_fn(sip, n_sips)
|
||||||
for d, peer_sip in sip_nbrs.items():
|
for d, peer_sip in sip_nbrs.items():
|
||||||
nbrs[f"global_{d}"] = peer_sip * n_cubes + cube
|
nbrs[f"global_{d}"] = _pe_idx(peer_sip, cube, pe)
|
||||||
|
|
||||||
return nbrs
|
return nbrs
|
||||||
|
|
||||||
|
|||||||
@@ -33,23 +33,41 @@ def ring_1d_unidir(rank: int, world_size: int) -> NeighborMap:
|
|||||||
return {"E": (rank + 1) % world_size}
|
return {"E": (rank + 1) % world_size}
|
||||||
|
|
||||||
|
|
||||||
def mesh_2d(rank: int, world_size: int) -> NeighborMap:
|
def _resolve_2d_dims(
|
||||||
"""Square 2D mesh (N/S/E/W).
|
world_size: int, w: int | None, h: int | None, name: str,
|
||||||
|
) -> tuple[int, int]:
|
||||||
Layout: rank = row * side + col, with side = sqrt(world_size).
|
if w is not None and h is not None:
|
||||||
Wrap-around (torus) on all four edges.
|
if w * h != world_size:
|
||||||
"""
|
raise ValueError(
|
||||||
|
f"{name}: w*h ({w}*{h}) != world_size ({world_size})"
|
||||||
|
)
|
||||||
|
return w, h
|
||||||
side = int(round(world_size ** 0.5))
|
side = int(round(world_size ** 0.5))
|
||||||
if side * side != world_size:
|
if side * side != world_size:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"mesh_2d requires square world_size, got {world_size}"
|
f"{name} requires square world_size or explicit w,h, "
|
||||||
|
f"got {world_size}"
|
||||||
)
|
)
|
||||||
r, c = divmod(rank, side)
|
return side, side
|
||||||
|
|
||||||
|
|
||||||
|
def mesh_2d(
|
||||||
|
rank: int, world_size: int,
|
||||||
|
w: int | None = None, h: int | None = None,
|
||||||
|
) -> NeighborMap:
|
||||||
|
"""2D mesh (N/S/E/W) with wrap-around on all four edges.
|
||||||
|
|
||||||
|
Layout: rank = row * w + col. When w, h are given, supports
|
||||||
|
rectangular (e.g. 2x3) layouts. Otherwise falls back to square
|
||||||
|
side = sqrt(world_size).
|
||||||
|
"""
|
||||||
|
w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d")
|
||||||
|
r, c = divmod(rank, w)
|
||||||
return {
|
return {
|
||||||
"N": ((r - 1) % side) * side + c,
|
"N": ((r - 1) % h) * w + c,
|
||||||
"S": ((r + 1) % side) * side + c,
|
"S": ((r + 1) % h) * w + c,
|
||||||
"W": r * side + (c - 1) % side,
|
"W": r * w + (c - 1) % w,
|
||||||
"E": r * side + (c + 1) % side,
|
"E": r * w + (c + 1) % w,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -73,36 +91,30 @@ def tree_binary(rank: int, world_size: int) -> NeighborMap:
|
|||||||
return n
|
return n
|
||||||
|
|
||||||
|
|
||||||
def torus_2d(rank: int, world_size: int) -> NeighborMap:
|
def torus_2d(
|
||||||
"""Square 2D torus (N/S/E/W) with wrap-around on all edges.
|
rank: int, world_size: int,
|
||||||
|
w: int | None = None, h: int | None = None,
|
||||||
Alias for mesh_2d (which already wraps). Explicit name for clarity
|
) -> NeighborMap:
|
||||||
when used as a SIP-level topology.
|
"""2D torus (N/S/E/W) with wrap-around on all edges. Alias for mesh_2d."""
|
||||||
"""
|
return mesh_2d(rank, world_size, w=w, h=h)
|
||||||
return mesh_2d(rank, world_size)
|
|
||||||
|
|
||||||
|
|
||||||
def mesh_2d_no_wrap(rank: int, world_size: int) -> NeighborMap:
|
def mesh_2d_no_wrap(
|
||||||
"""Square 2D mesh (N/S/E/W) WITHOUT wrap-around.
|
rank: int, world_size: int,
|
||||||
|
w: int | None = None, h: int | None = None,
|
||||||
Edge nodes have fewer neighbors (no wrapping). Used for SIP-level
|
) -> NeighborMap:
|
||||||
topologies where physical links don't wrap.
|
"""2D mesh (N/S/E/W) WITHOUT wrap-around. Supports rectangular dims."""
|
||||||
"""
|
w, h = _resolve_2d_dims(world_size, w, h, "mesh_2d_no_wrap")
|
||||||
side = int(round(world_size ** 0.5))
|
r, c = divmod(rank, w)
|
||||||
if side * side != world_size:
|
|
||||||
raise ValueError(
|
|
||||||
f"mesh_2d_no_wrap requires square world_size, got {world_size}"
|
|
||||||
)
|
|
||||||
r, c = divmod(rank, side)
|
|
||||||
n: NeighborMap = {}
|
n: NeighborMap = {}
|
||||||
if r > 0:
|
if r > 0:
|
||||||
n["N"] = (r - 1) * side + c
|
n["N"] = (r - 1) * w + c
|
||||||
if r < side - 1:
|
if r < h - 1:
|
||||||
n["S"] = (r + 1) * side + c
|
n["S"] = (r + 1) * w + c
|
||||||
if c > 0:
|
if c > 0:
|
||||||
n["W"] = r * side + (c - 1)
|
n["W"] = r * w + (c - 1)
|
||||||
if c < side - 1:
|
if c < w - 1:
|
||||||
n["E"] = r * side + (c + 1)
|
n["E"] = r * w + (c + 1)
|
||||||
return n
|
return n
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -23,7 +23,7 @@ def _hbm_pa(sip: int, cube: int, pe_id: int, spec: dict) -> int:
|
|||||||
mm = spec["cube"]["memory_map"]
|
mm = spec["cube"]["memory_map"]
|
||||||
slice_bytes = mm["hbm_total_gb_per_cube"] * (1 << 30) // mm["hbm_slices_per_cube"]
|
slice_bytes = mm["hbm_total_gb_per_cube"] * (1 << 30) // mm["hbm_slices_per_cube"]
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
|
sip_id=sip, die_id=cube, pe_id=pe_id,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
|
|||||||
@@ -31,6 +31,26 @@ class IpcqInvalidDirection(ValueError):
|
|||||||
has no neighbor installed for this PE."""
|
has no neighbor installed for this PE."""
|
||||||
|
|
||||||
|
|
||||||
|
# ── ADR-0023 D9.7: IPCQ slot-memory latency model ───────────────────
|
||||||
|
#
|
||||||
|
# Per-tier (bw_gbs, overhead_ns) used to charge the slot write (inbound)
|
||||||
|
# and slot read (recv consume). Mirrors topology.yaml component values.
|
||||||
|
_BUFFER_KIND_BW: dict[str, tuple[float, float]] = {
|
||||||
|
"tcm": (512.0, 0.0),
|
||||||
|
"sram": (512.0, 2.0),
|
||||||
|
"hbm": (256.0, 6.0),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def slot_io_latency_ns(buffer_kind: str, nbytes: int) -> float:
|
||||||
|
"""Per-access latency for one slot read/write of ``nbytes`` against
|
||||||
|
the IPCQ backing memory tier (``buffer_kind``)."""
|
||||||
|
bw_gbs, overhead_ns = _BUFFER_KIND_BW.get(
|
||||||
|
buffer_kind, _BUFFER_KIND_BW["tcm"],
|
||||||
|
)
|
||||||
|
return float(nbytes) / bw_gbs + overhead_ns
|
||||||
|
|
||||||
|
|
||||||
# ── D2.5: IpcqEndpoint ───────────────────────────────────────────────
|
# ── D2.5: IpcqEndpoint ───────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -58,7 +58,18 @@ class IoCpuComponent(ComponentBase):
|
|||||||
self._pending[key] = (expected, received, parent_done)
|
self._pending[key] = (expected, received, parent_done)
|
||||||
|
|
||||||
def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
|
def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
|
||||||
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
|
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses.
|
||||||
|
|
||||||
|
ADR-0009 D5 (extended): for KernelLaunchMsg, stamp a single global
|
||||||
|
target_start_ns = env.now + max(IO_CPU → any target PE_CPU path
|
||||||
|
latency across all target cubes). M_CPU passes this value through
|
||||||
|
unchanged; every PE in every cube yields until the same sim-time
|
||||||
|
before beginning kernel execution. Without this, cross-cube
|
||||||
|
launches would have each cube's M_CPU compute its own per-cube
|
||||||
|
barrier relative to its local env.now, leaving PEs on different
|
||||||
|
cubes out of sync (the "h3/h4 dispatch-offset artifact").
|
||||||
|
"""
|
||||||
|
import dataclasses
|
||||||
from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
|
from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
|
||||||
|
|
||||||
request = txn.request
|
request = txn.request
|
||||||
@@ -72,10 +83,60 @@ class IoCpuComponent(ComponentBase):
|
|||||||
txn.done.succeed()
|
txn.done.succeed()
|
||||||
return
|
return
|
||||||
|
|
||||||
|
# For KernelLaunchMsg, compute the global barrier once here so
|
||||||
|
# every downstream PE_CPU uses the same target_start_ns.
|
||||||
|
if isinstance(request, KernelLaunchMsg):
|
||||||
|
io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
|
||||||
|
global_max_latency = 0.0
|
||||||
|
pe_ids = self._resolve_pe_ids(
|
||||||
|
getattr(request, "target_pe", "all")
|
||||||
|
)
|
||||||
|
for sip, cube in cube_targets:
|
||||||
|
try:
|
||||||
|
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
||||||
|
io_to_m_path = self.ctx.router.find_node_path(
|
||||||
|
self.node.id, m_cpu_id,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if len(io_to_m_path) < 2:
|
||||||
|
continue
|
||||||
|
leg1 = self.ctx.compute_path_latency_ns(
|
||||||
|
io_to_m_path, nbytes=0,
|
||||||
|
)
|
||||||
|
m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
|
||||||
|
for pe_id in pe_ids:
|
||||||
|
pe_cpu_id = (
|
||||||
|
f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
m_to_pe_path = self.ctx.router.find_node_path(
|
||||||
|
m_cpu_id, pe_cpu_id,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if len(m_to_pe_path) < 2:
|
||||||
|
continue
|
||||||
|
leg2 = self.ctx.compute_path_latency_ns(
|
||||||
|
m_to_pe_path, nbytes=0,
|
||||||
|
)
|
||||||
|
latency = leg1 + leg2 - io_overhead - m_overhead
|
||||||
|
if latency > global_max_latency:
|
||||||
|
global_max_latency = latency
|
||||||
|
request = dataclasses.replace(
|
||||||
|
request,
|
||||||
|
target_start_ns=float(env.now) + global_max_latency,
|
||||||
|
)
|
||||||
|
|
||||||
# Setup aggregation
|
# Setup aggregation
|
||||||
self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
|
self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
|
||||||
|
|
||||||
# Fan out to each target cube's M_CPU
|
# Fan out to each target cube's M_CPU. Kernel-launch fanout
|
||||||
|
# carries control metadata only; nbytes is forced to 0 for
|
||||||
|
# KernelLaunchMsg so the launch sub-txns do not occupy data-fabric
|
||||||
|
# BW (would otherwise serialize 16 cubes worth of fanout on the
|
||||||
|
# shared first hop and break ADR-0009 D5's barrier prediction).
|
||||||
|
is_kernel_launch = isinstance(request, KernelLaunchMsg)
|
||||||
for sip, cube in cube_targets:
|
for sip, cube in cube_targets:
|
||||||
try:
|
try:
|
||||||
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
||||||
@@ -86,11 +147,25 @@ class IoCpuComponent(ComponentBase):
|
|||||||
continue
|
continue
|
||||||
sub_txn = Transaction(
|
sub_txn = Transaction(
|
||||||
request=request, path=path, step=0,
|
request=request, path=path, step=0,
|
||||||
nbytes=txn.nbytes, done=env.event(),
|
nbytes=0 if is_kernel_launch else txn.nbytes,
|
||||||
|
done=env.event(),
|
||||||
result_data=txn.result_data,
|
result_data=txn.result_data,
|
||||||
)
|
)
|
||||||
yield self.out_ports[path[1]].put(sub_txn.advance())
|
yield self.out_ports[path[1]].put(sub_txn.advance())
|
||||||
|
|
||||||
|
def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
|
||||||
|
"""Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
|
||||||
|
if isinstance(target_pe, int):
|
||||||
|
return [target_pe]
|
||||||
|
if isinstance(target_pe, tuple):
|
||||||
|
return list(target_pe)
|
||||||
|
# "all": all PEs in a cube
|
||||||
|
n_slices = 8
|
||||||
|
if self.ctx and self.ctx.spec:
|
||||||
|
mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
|
||||||
|
n_slices = mm.get("hbm_slices_per_cube", 8)
|
||||||
|
return list(range(n_slices))
|
||||||
|
|
||||||
def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
|
def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
|
||||||
"""Return list of (sip, cube) pairs to fan out to."""
|
"""Return list of (sip, cube) pairs to fan out to."""
|
||||||
from kernbench.runtime_api.kernel import (
|
from kernbench.runtime_api.kernel import (
|
||||||
@@ -145,10 +220,10 @@ class IoCpuComponent(ComponentBase):
|
|||||||
return []
|
return []
|
||||||
|
|
||||||
def _cube_from_pa(self, pa_val: int, fallback: int) -> int:
|
def _cube_from_pa(self, pa_val: int, fallback: int) -> int:
|
||||||
"""Extract cube_id from a physical address, with fallback."""
|
"""Extract die_id from a physical address, with fallback."""
|
||||||
from kernbench.policy.address.phyaddr import PhysAddr
|
from kernbench.policy.address.phyaddr import PhysAddr
|
||||||
try:
|
try:
|
||||||
return PhysAddr.decode(pa_val).cube_id
|
return PhysAddr.decode(pa_val).die_id
|
||||||
except Exception:
|
except Exception:
|
||||||
return fallback
|
return fallback
|
||||||
|
|
||||||
|
|||||||
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
|
|||||||
Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
|
Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
|
||||||
PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
|
PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
|
||||||
Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
|
Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
|
||||||
|
|
||||||
|
ADR-0009 D5: stamps target_start_ns so every PE in this fanout
|
||||||
|
starts executing at the same env.now regardless of dispatch path.
|
||||||
"""
|
"""
|
||||||
|
import dataclasses
|
||||||
request = txn.request
|
request = txn.request
|
||||||
target_pe = getattr(request, "target_pe", "all")
|
target_pe = getattr(request, "target_pe", "all")
|
||||||
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
||||||
@@ -172,9 +176,13 @@ class MCpuComponent(ComponentBase):
|
|||||||
txn.done.succeed()
|
txn.done.succeed()
|
||||||
return
|
return
|
||||||
|
|
||||||
# Fan out to each PE_CPU, using response-based aggregation
|
# Resolve per-PE paths. If IO_CPU already stamped a global
|
||||||
sub_txns: list[Transaction] = []
|
# target_start_ns (ADR-0009 D5 extended), pass it through
|
||||||
n_dispatched = 0
|
# unchanged so every PE across every cube uses the same barrier.
|
||||||
|
# Otherwise (e.g. direct-to-M_CPU launch in a unit test) compute
|
||||||
|
# a per-cube barrier from env.now.
|
||||||
|
per_pe: list[tuple[int, list[str], float]] = []
|
||||||
|
max_latency = 0.0
|
||||||
for pe_id in pe_ids:
|
for pe_id in pe_ids:
|
||||||
pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
|
pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
|
||||||
try:
|
try:
|
||||||
@@ -183,8 +191,24 @@ class MCpuComponent(ComponentBase):
|
|||||||
continue
|
continue
|
||||||
if len(path) < 2:
|
if len(path) < 2:
|
||||||
continue
|
continue
|
||||||
|
latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
|
||||||
|
per_pe.append((pe_id, path, latency))
|
||||||
|
if latency > max_latency:
|
||||||
|
max_latency = latency
|
||||||
|
|
||||||
|
if getattr(request, "target_start_ns", None) is not None:
|
||||||
|
stamped_request = request
|
||||||
|
else:
|
||||||
|
stamped_request = dataclasses.replace(
|
||||||
|
request, target_start_ns=float(env.now) + max_latency,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Fan out to each PE_CPU, using response-based aggregation
|
||||||
|
sub_txns: list[Transaction] = []
|
||||||
|
n_dispatched = 0
|
||||||
|
for pe_id, path, _lat in per_pe:
|
||||||
sub_txn = Transaction(
|
sub_txn = Transaction(
|
||||||
request=request, path=path, step=0,
|
request=stamped_request, path=path, step=0,
|
||||||
nbytes=0, done=env.event(),
|
nbytes=0, done=env.event(),
|
||||||
)
|
)
|
||||||
yield self.out_ports[path[1]].put(sub_txn.advance())
|
yield self.out_ports[path[1]].put(sub_txn.advance())
|
||||||
@@ -204,16 +228,21 @@ class MCpuComponent(ComponentBase):
|
|||||||
yield all_done
|
yield all_done
|
||||||
del self._parent_txns[request.request_id]
|
del self._parent_txns[request.request_id]
|
||||||
|
|
||||||
# Aggregate PE-internal metrics (max across PEs)
|
# Aggregate PE-internal metrics (max across PEs and across cubes).
|
||||||
|
# Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
|
||||||
|
# merge against the existing value so cubes don't clobber each other.
|
||||||
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
|
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
|
||||||
if pe_exec_values:
|
if pe_exec_values:
|
||||||
txn.result_data["pe_exec_ns"] = max(pe_exec_values)
|
cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
|
||||||
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
|
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
|
||||||
if dma_values:
|
if dma_values:
|
||||||
txn.result_data["dma_ns"] = max(dma_values)
|
cur = txn.result_data.get("dma_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["dma_ns"] = max(cur, max(dma_values))
|
||||||
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
|
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
|
||||||
if compute_values:
|
if compute_values:
|
||||||
txn.result_data["compute_ns"] = max(compute_values)
|
cur = txn.result_data.get("compute_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["compute_ns"] = max(cur, max(compute_values))
|
||||||
|
|
||||||
# Send aggregate response on reverse command path back to IO_CPU
|
# Send aggregate response on reverse command path back to IO_CPU
|
||||||
reverse_path = list(reversed(txn.path))
|
reverse_path = list(reversed(txn.path))
|
||||||
|
|||||||
@@ -95,6 +95,13 @@ class PeCpuComponent(ComponentBase):
|
|||||||
request = txn.request
|
request = txn.request
|
||||||
yield from self.run(env, 0)
|
yield from self.run(env, 0)
|
||||||
|
|
||||||
|
# ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
|
||||||
|
# target_start_ns, wait until then so every PE in this launch
|
||||||
|
# begins pe_exec measurement at the same simulated time.
|
||||||
|
target_start = getattr(request, "target_start_ns", None)
|
||||||
|
if target_start is not None and target_start > env.now:
|
||||||
|
yield env.timeout(float(target_start) - env.now)
|
||||||
|
|
||||||
kernel_fn = get_kernel(request.kernel_ref.name)
|
kernel_fn = get_kernel(request.kernel_ref.name)
|
||||||
num_programs = self._derive_num_programs(request)
|
num_programs = self._derive_num_programs(request)
|
||||||
kernel_args = self._unpack_kernel_args(request)
|
kernel_args = self._unpack_kernel_args(request)
|
||||||
|
|||||||
@@ -186,15 +186,49 @@ class PeDmaComponent(PeEngineBase):
|
|||||||
# ── IPCQ inbound (fabric → PE_DMA → MemoryStore + PE_IPCQ) ──────
|
# ── IPCQ inbound (fabric → PE_DMA → MemoryStore + PE_IPCQ) ──────
|
||||||
|
|
||||||
def _handle_ipcq_inbound(self, env: simpy.Environment, txn: Any) -> Generator:
|
def _handle_ipcq_inbound(self, env: simpy.Environment, txn: Any) -> Generator:
|
||||||
"""At destination PE_DMA: atomically write data and forward metadata.
|
"""At destination PE_DMA: pay terminal drain, then atomically write
|
||||||
|
data and forward metadata.
|
||||||
|
|
||||||
|
ADR-0023 D9 (drain at inbound terminal): the Transaction carries
|
||||||
|
``drain_ns = nbytes / bottleneck_bw_on_path`` stamped by the sender
|
||||||
|
PE_DMA. Like every other Transaction terminal in the simulator (see
|
||||||
|
``ComponentBase._forward_txn``), this drain must be paid when the
|
||||||
|
Transaction reaches its destination. SRC-side ``tl.send`` is
|
||||||
|
fire-and-forget — it never yields on ``sub_done`` — so paying the
|
||||||
|
drain here does NOT delay the sender. What it DOES delay is the
|
||||||
|
IpcqMetaArrival forwarded below: that delay is the only signal
|
||||||
|
``tl.recv`` on DST blocks on, which is exactly the desired
|
||||||
|
semantics — "send dispatches and returns; recv waits until the
|
||||||
|
bytes have actually landed in its inbox".
|
||||||
|
|
||||||
|
The drain MUST be paid before the atomic block — inserting a yield
|
||||||
|
inside would break invariant I6.
|
||||||
|
|
||||||
I6 (MUST): no SimPy yield between MemoryStore.write and the
|
I6 (MUST): no SimPy yield between MemoryStore.write and the
|
||||||
IpcqMetaArrival put into PE_IPCQ.
|
IpcqMetaArrival put into PE_IPCQ.
|
||||||
"""
|
"""
|
||||||
from kernbench.common.ipcq_types import IpcqMetaArrival
|
from kernbench.common.ipcq_types import IpcqMetaArrival
|
||||||
|
|
||||||
|
# Pay terminal BW drain before the atomic write/metadata forward.
|
||||||
|
# Without this, IPCQ effectively got fabric bandwidth for free at
|
||||||
|
# the terminal (only intermediate-hop overhead_ns was charged),
|
||||||
|
# making IPCQ lower than raw DMA at large sizes in benchmarks.
|
||||||
|
drain = getattr(txn, "drain_ns", 0.0)
|
||||||
|
if drain > 0:
|
||||||
|
yield env.timeout(drain)
|
||||||
|
|
||||||
token = txn.request
|
token = txn.request
|
||||||
|
|
||||||
|
# ADR-0023 D9.7: charge IPCQ slot-WRITE latency against the
|
||||||
|
# backing-memory tier (tcm/sram/hbm) before the atomic block.
|
||||||
|
# Must come BEFORE the atomic write→IpcqMetaArrival pair (I6).
|
||||||
|
from kernbench.common.ipcq_types import slot_io_latency_ns
|
||||||
|
slot_write_ns = slot_io_latency_ns(
|
||||||
|
token.dst_endpoint.buffer_kind, token.nbytes,
|
||||||
|
)
|
||||||
|
if slot_write_ns > 0:
|
||||||
|
yield env.timeout(slot_write_ns)
|
||||||
|
|
||||||
# ── ATOMIC: do not introduce yield between these two operations ──
|
# ── ATOMIC: do not introduce yield between these two operations ──
|
||||||
# 1. Move data via MemoryStore (single-hop DMA write).
|
# 1. Move data via MemoryStore (single-hop DMA write).
|
||||||
# Prefer the in-flight snapshot stashed by the sender PE_DMA;
|
# Prefer the in-flight snapshot stashed by the sender PE_DMA;
|
||||||
@@ -278,7 +312,16 @@ class PeDmaComponent(PeEngineBase):
|
|||||||
dma_res = self._dma_write if is_write else self._dma_read
|
dma_res = self._dma_write if is_write else self._dma_read
|
||||||
assert dma_res is not None
|
assert dma_res is not None
|
||||||
|
|
||||||
pa = PhysAddr.decode(addr)
|
# Translate VA → PA via MMU (same logic as non-pipeline path)
|
||||||
|
target_pa = addr
|
||||||
|
if self._mmu is not None:
|
||||||
|
from kernbench.policy.address.pe_mmu import PageFault
|
||||||
|
try:
|
||||||
|
target_pa = self._mmu.translate(addr)
|
||||||
|
except PageFault:
|
||||||
|
target_pa = addr # fallback: treat as PA directly
|
||||||
|
|
||||||
|
pa = PhysAddr.decode(target_pa)
|
||||||
dst_node = self.ctx.resolver.resolve(pa)
|
dst_node = self.ctx.resolver.resolve(pa)
|
||||||
path = self.ctx.router.find_path(self._pe_prefix, dst_node)
|
path = self.ctx.router.find_path(self._pe_prefix, dst_node)
|
||||||
drain_ns = self.ctx.compute_drain_ns(path, nbytes)
|
drain_ns = self.ctx.compute_drain_ns(path, nbytes)
|
||||||
@@ -290,7 +333,7 @@ class PeDmaComponent(PeEngineBase):
|
|||||||
correlation_id="pipeline",
|
correlation_id="pipeline",
|
||||||
request_id=f"tile_{token.tile_id}",
|
request_id=f"tile_{token.tile_id}",
|
||||||
src_sip=0, src_cube=0, src_pe=0,
|
src_sip=0, src_cube=0, src_pe=0,
|
||||||
dst_pa=addr, nbytes=nbytes,
|
dst_pa=target_pa, nbytes=nbytes,
|
||||||
is_write=is_write,
|
is_write=is_write,
|
||||||
)
|
)
|
||||||
sub_txn = Transaction(
|
sub_txn = Transaction(
|
||||||
|
|||||||
@@ -329,6 +329,16 @@ class PeIpcqComponent(ComponentBase):
|
|||||||
|
|
||||||
qp["my_tail"] += 1
|
qp["my_tail"] += 1
|
||||||
|
|
||||||
|
# ADR-0023 D9.7: charge IPCQ slot-READ latency against the
|
||||||
|
# backing-memory tier (tcm/sram/hbm). Recv blocks for the
|
||||||
|
# kernel-side slot consume; pe_exec_ns reflects this cost.
|
||||||
|
from kernbench.common.ipcq_types import slot_io_latency_ns
|
||||||
|
slot_read_ns = slot_io_latency_ns(
|
||||||
|
self._buffer_kind, req.result_data.get("nbytes", 0),
|
||||||
|
)
|
||||||
|
if slot_read_ns > 0:
|
||||||
|
yield env.timeout(slot_read_ns)
|
||||||
|
|
||||||
# Diagnostics trace (D14)
|
# Diagnostics trace (D14)
|
||||||
from kernbench.ccl import diagnostics
|
from kernbench.ccl import diagnostics
|
||||||
if diagnostics.trace_enabled():
|
if diagnostics.trace_enabled():
|
||||||
@@ -338,9 +348,13 @@ class PeIpcqComponent(ComponentBase):
|
|||||||
nbytes=req.result_data.get("nbytes", 0),
|
nbytes=req.result_data.get("nbytes", 0),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Fast path credit return — bottleneck BW based latency
|
# Credit return: recv blocks on credit-emit so the protocol cost
|
||||||
env.process(
|
# (full path latency to deliver the credit metadata back to the
|
||||||
self._delayed_credit_send(env, direction, qp["peer_credit_store"], qp["my_tail"])
|
# sender) is reflected in the recv's pe_exec_ns. Models the IPCQ
|
||||||
|
# control-plane completing the consume-acknowledgement before
|
||||||
|
# recv returns to the kernel.
|
||||||
|
yield from self._delayed_credit_send(
|
||||||
|
env, direction, qp["peer_credit_store"], qp["my_tail"],
|
||||||
)
|
)
|
||||||
|
|
||||||
if not req.done.triggered:
|
if not req.done.triggered:
|
||||||
@@ -455,7 +469,12 @@ class PeIpcqComponent(ComponentBase):
|
|||||||
yield peer_credit_store.put(meta)
|
yield peer_credit_store.put(meta)
|
||||||
|
|
||||||
def _credit_latency_ns(self, direction: str) -> float:
|
def _credit_latency_ns(self, direction: str) -> float:
|
||||||
"""Compute credit fast path latency = credit_size / bottleneck_bw.
|
"""Full path latency for the credit-return packet.
|
||||||
|
|
||||||
|
Pays per-node overhead + edge prop + drain along the same fabric
|
||||||
|
the data took. PathRouter.find_path() auto-appends ".pe_dma" to
|
||||||
|
the source only, so the destination MUST be spelled with the
|
||||||
|
explicit ".pe_dma" suffix.
|
||||||
|
|
||||||
Falls back to 0 when ctx/router is unavailable (unit-test mode).
|
Falls back to 0 when ctx/router is unavailable (unit-test mode).
|
||||||
"""
|
"""
|
||||||
@@ -463,10 +482,12 @@ class PeIpcqComponent(ComponentBase):
|
|||||||
return 0.0
|
return 0.0
|
||||||
qp = self._queue_pairs[direction]
|
qp = self._queue_pairs[direction]
|
||||||
peer = qp["peer"]
|
peer = qp["peer"]
|
||||||
peer_pe_prefix = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}"
|
peer_pe_dma = f"sip{peer.sip}.cube{peer.cube}.pe{peer.pe}.pe_dma"
|
||||||
try:
|
try:
|
||||||
path = self.ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
|
path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
|
||||||
return self.ctx.compute_drain_ns(path, self._credit_size_bytes)
|
return self.ctx.compute_path_latency_ns(
|
||||||
|
path, self._credit_size_bytes,
|
||||||
|
)
|
||||||
except Exception:
|
except Exception:
|
||||||
return 0.0
|
return 0.0
|
||||||
|
|
||||||
|
|||||||
@@ -26,6 +26,9 @@ class ComponentContext:
|
|||||||
spec: dict = field(default_factory=dict) # topology spec (cube layout, PE count, etc.)
|
spec: dict = field(default_factory=dict) # topology spec (cube layout, PE count, etc.)
|
||||||
memory_store: Any = None # MemoryStore for Phase 1 data-aware execution (ADR-0020)
|
memory_store: Any = None # MemoryStore for Phase 1 data-aware execution (ADR-0020)
|
||||||
op_logger: Any = None # OpLogger for Phase 1 op recording (ADR-0020)
|
op_logger: Any = None # OpLogger for Phase 1 op recording (ADR-0020)
|
||||||
|
# node_id -> overhead_ns (ADR-0009 D5: used by M_CPU to compute per-PE
|
||||||
|
# dispatch latency when stamping target_start_ns on KernelLaunchMsg).
|
||||||
|
node_overhead_ns: dict[str, float] = field(default_factory=dict)
|
||||||
|
|
||||||
def get_shared_resource(
|
def get_shared_resource(
|
||||||
self, env: simpy.Environment, key: str, capacity: int = 1,
|
self, env: simpy.Environment, key: str, capacity: int = 1,
|
||||||
@@ -52,3 +55,19 @@ class ComponentContext:
|
|||||||
if min_bw == float("inf"):
|
if min_bw == float("inf"):
|
||||||
return 0.0
|
return 0.0
|
||||||
return nbytes / min_bw
|
return nbytes / min_bw
|
||||||
|
|
||||||
|
def compute_path_latency_ns(self, path: list[str], nbytes: int = 0) -> float:
|
||||||
|
"""Formula latency along path: wire + per-node overhead + drain.
|
||||||
|
|
||||||
|
ADR-0009 D5: M_CPU uses this to compute per-PE dispatch latency
|
||||||
|
when stamping target_start_ns on KernelLaunchMsg fanout.
|
||||||
|
"""
|
||||||
|
total = 0.0
|
||||||
|
for i in range(len(path) - 1):
|
||||||
|
edge = self.edge_map.get((path[i], path[i + 1]))
|
||||||
|
if edge:
|
||||||
|
total += edge.distance_mm * self.ns_per_mm
|
||||||
|
for node_id in path:
|
||||||
|
total += self.node_overhead_ns.get(node_id, 0.0)
|
||||||
|
total += self.compute_drain_ns(path, nbytes)
|
||||||
|
return total
|
||||||
|
|||||||
@@ -58,7 +58,13 @@ class IoCpuComponent(ComponentBase):
|
|||||||
self._pending[key] = (expected, received, parent_done)
|
self._pending[key] = (expected, received, parent_done)
|
||||||
|
|
||||||
def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
|
def _dispatch_to_m_cpus(self, env: simpy.Environment, txn: Any) -> Generator:
|
||||||
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses."""
|
"""Fan out sub-Transactions to target cube M_CPUs, wait for responses.
|
||||||
|
|
||||||
|
ADR-0009 D5 (extended): stamp a global target_start_ns on
|
||||||
|
KernelLaunchMsg so every PE across every target cube starts at
|
||||||
|
the same env.now. See the non-legacy builtin for full rationale.
|
||||||
|
"""
|
||||||
|
import dataclasses
|
||||||
from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
|
from kernbench.runtime_api.kernel import KernelLaunchMsg, MemoryReadMsg, MemoryWriteMsg
|
||||||
|
|
||||||
request = txn.request
|
request = txn.request
|
||||||
@@ -72,10 +78,53 @@ class IoCpuComponent(ComponentBase):
|
|||||||
txn.done.succeed()
|
txn.done.succeed()
|
||||||
return
|
return
|
||||||
|
|
||||||
|
if isinstance(request, KernelLaunchMsg):
|
||||||
|
io_overhead = self.ctx.node_overhead_ns.get(self.node.id, 0.0)
|
||||||
|
global_max_latency = 0.0
|
||||||
|
pe_ids = self._resolve_pe_ids(
|
||||||
|
getattr(request, "target_pe", "all")
|
||||||
|
)
|
||||||
|
for sip, cube in cube_targets:
|
||||||
|
try:
|
||||||
|
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
||||||
|
io_to_m_path = self.ctx.router.find_node_path(
|
||||||
|
self.node.id, m_cpu_id,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if len(io_to_m_path) < 2:
|
||||||
|
continue
|
||||||
|
leg1 = self.ctx.compute_path_latency_ns(
|
||||||
|
io_to_m_path, nbytes=0,
|
||||||
|
)
|
||||||
|
m_overhead = self.ctx.node_overhead_ns.get(m_cpu_id, 0.0)
|
||||||
|
for pe_id in pe_ids:
|
||||||
|
pe_cpu_id = (
|
||||||
|
f"sip{sip}.cube{cube}.pe{pe_id}.pe_cpu"
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
m_to_pe_path = self.ctx.router.find_node_path(
|
||||||
|
m_cpu_id, pe_cpu_id,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if len(m_to_pe_path) < 2:
|
||||||
|
continue
|
||||||
|
leg2 = self.ctx.compute_path_latency_ns(
|
||||||
|
m_to_pe_path, nbytes=0,
|
||||||
|
)
|
||||||
|
latency = leg1 + leg2 - io_overhead - m_overhead
|
||||||
|
if latency > global_max_latency:
|
||||||
|
global_max_latency = latency
|
||||||
|
request = dataclasses.replace(
|
||||||
|
request,
|
||||||
|
target_start_ns=float(env.now) + global_max_latency,
|
||||||
|
)
|
||||||
|
|
||||||
# Setup aggregation
|
# Setup aggregation
|
||||||
self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
|
self._pending[request.request_id] = (len(cube_targets), 0, txn.done)
|
||||||
|
|
||||||
# Fan out to each target cube's M_CPU
|
is_kernel_launch = isinstance(request, KernelLaunchMsg)
|
||||||
for sip, cube in cube_targets:
|
for sip, cube in cube_targets:
|
||||||
try:
|
try:
|
||||||
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
m_cpu_id = self.ctx.resolver.find_m_cpu(sip, cube)
|
||||||
@@ -86,11 +135,24 @@ class IoCpuComponent(ComponentBase):
|
|||||||
continue
|
continue
|
||||||
sub_txn = Transaction(
|
sub_txn = Transaction(
|
||||||
request=request, path=path, step=0,
|
request=request, path=path, step=0,
|
||||||
nbytes=txn.nbytes, done=env.event(),
|
nbytes=0 if is_kernel_launch else txn.nbytes,
|
||||||
|
done=env.event(),
|
||||||
result_data=txn.result_data,
|
result_data=txn.result_data,
|
||||||
)
|
)
|
||||||
yield self.out_ports[path[1]].put(sub_txn.advance())
|
yield self.out_ports[path[1]].put(sub_txn.advance())
|
||||||
|
|
||||||
|
def _resolve_pe_ids(self, target_pe: Any) -> list[int]:
|
||||||
|
"""Resolve target_pe → list of PE indices (mirrors M_CPU logic)."""
|
||||||
|
if isinstance(target_pe, int):
|
||||||
|
return [target_pe]
|
||||||
|
if isinstance(target_pe, tuple):
|
||||||
|
return list(target_pe)
|
||||||
|
n_slices = 8
|
||||||
|
if self.ctx and self.ctx.spec:
|
||||||
|
mm = self.ctx.spec.get("cube", {}).get("memory_map", {})
|
||||||
|
n_slices = mm.get("hbm_slices_per_cube", 8)
|
||||||
|
return list(range(n_slices))
|
||||||
|
|
||||||
def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
|
def _resolve_cube_targets(self, request: Any) -> list[tuple[int, int]]:
|
||||||
"""Return list of (sip, cube) pairs to fan out to."""
|
"""Return list of (sip, cube) pairs to fan out to."""
|
||||||
from kernbench.runtime_api.kernel import (
|
from kernbench.runtime_api.kernel import (
|
||||||
@@ -145,10 +207,10 @@ class IoCpuComponent(ComponentBase):
|
|||||||
return []
|
return []
|
||||||
|
|
||||||
def _cube_from_pa(self, pa_val: int, fallback: int) -> int:
|
def _cube_from_pa(self, pa_val: int, fallback: int) -> int:
|
||||||
"""Extract cube_id from a physical address, with fallback."""
|
"""Extract die_id from a physical address, with fallback."""
|
||||||
from kernbench.policy.address.phyaddr import PhysAddr
|
from kernbench.policy.address.phyaddr import PhysAddr
|
||||||
try:
|
try:
|
||||||
return PhysAddr.decode(pa_val).cube_id
|
return PhysAddr.decode(pa_val).die_id
|
||||||
except Exception:
|
except Exception:
|
||||||
return fallback
|
return fallback
|
||||||
|
|
||||||
|
|||||||
@@ -162,7 +162,11 @@ class MCpuComponent(ComponentBase):
|
|||||||
Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
|
Routes through find_node_path (M_CPU → NOC → PE_CPU command edges).
|
||||||
PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
|
PE_CPU sends ResponseMsg back via NOC → M_CPU on completion.
|
||||||
Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
|
Then sends aggregate ResponseMsg back to IO_CPU on the reverse path.
|
||||||
|
|
||||||
|
ADR-0009 D5: stamps target_start_ns so every PE in this fanout
|
||||||
|
starts executing at the same env.now regardless of dispatch path.
|
||||||
"""
|
"""
|
||||||
|
import dataclasses
|
||||||
request = txn.request
|
request = txn.request
|
||||||
target_pe = getattr(request, "target_pe", "all")
|
target_pe = getattr(request, "target_pe", "all")
|
||||||
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
||||||
@@ -172,9 +176,10 @@ class MCpuComponent(ComponentBase):
|
|||||||
txn.done.succeed()
|
txn.done.succeed()
|
||||||
return
|
return
|
||||||
|
|
||||||
# Fan out to each PE_CPU, using response-based aggregation
|
# Resolve per-PE paths. If IO_CPU already stamped a global
|
||||||
sub_txns: list[Transaction] = []
|
# target_start_ns (ADR-0009 D5 extended), pass it through.
|
||||||
n_dispatched = 0
|
per_pe: list[tuple[int, list[str], float]] = []
|
||||||
|
max_latency = 0.0
|
||||||
for pe_id in pe_ids:
|
for pe_id in pe_ids:
|
||||||
pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
|
pe_cpu_id = f"{cube_prefix}.pe{pe_id}.pe_cpu"
|
||||||
try:
|
try:
|
||||||
@@ -183,8 +188,24 @@ class MCpuComponent(ComponentBase):
|
|||||||
continue
|
continue
|
||||||
if len(path) < 2:
|
if len(path) < 2:
|
||||||
continue
|
continue
|
||||||
|
latency = self.ctx.compute_path_latency_ns(path, nbytes=0)
|
||||||
|
per_pe.append((pe_id, path, latency))
|
||||||
|
if latency > max_latency:
|
||||||
|
max_latency = latency
|
||||||
|
|
||||||
|
if getattr(request, "target_start_ns", None) is not None:
|
||||||
|
stamped_request = request
|
||||||
|
else:
|
||||||
|
stamped_request = dataclasses.replace(
|
||||||
|
request, target_start_ns=float(env.now) + max_latency,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Fan out to each PE_CPU, using response-based aggregation
|
||||||
|
sub_txns: list[Transaction] = []
|
||||||
|
n_dispatched = 0
|
||||||
|
for pe_id, path, _lat in per_pe:
|
||||||
sub_txn = Transaction(
|
sub_txn = Transaction(
|
||||||
request=request, path=path, step=0,
|
request=stamped_request, path=path, step=0,
|
||||||
nbytes=0, done=env.event(),
|
nbytes=0, done=env.event(),
|
||||||
)
|
)
|
||||||
yield self.out_ports[path[1]].put(sub_txn.advance())
|
yield self.out_ports[path[1]].put(sub_txn.advance())
|
||||||
@@ -204,16 +225,21 @@ class MCpuComponent(ComponentBase):
|
|||||||
yield all_done
|
yield all_done
|
||||||
del self._parent_txns[request.request_id]
|
del self._parent_txns[request.request_id]
|
||||||
|
|
||||||
# Aggregate PE-internal metrics (max across PEs)
|
# Aggregate PE-internal metrics (max across PEs and across cubes).
|
||||||
|
# Multiple M_CPUs share the same result_data dict via IO_CPU fanout;
|
||||||
|
# merge against the existing value so cubes don't clobber each other.
|
||||||
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
|
pe_exec_values = [st.result_data.get("pe_exec_ns", 0.0) for st in sub_txns]
|
||||||
if pe_exec_values:
|
if pe_exec_values:
|
||||||
txn.result_data["pe_exec_ns"] = max(pe_exec_values)
|
cur = txn.result_data.get("pe_exec_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["pe_exec_ns"] = max(cur, max(pe_exec_values))
|
||||||
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
|
dma_values = [st.result_data.get("dma_ns", 0.0) for st in sub_txns]
|
||||||
if dma_values:
|
if dma_values:
|
||||||
txn.result_data["dma_ns"] = max(dma_values)
|
cur = txn.result_data.get("dma_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["dma_ns"] = max(cur, max(dma_values))
|
||||||
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
|
compute_values = [st.result_data.get("compute_ns", 0.0) for st in sub_txns]
|
||||||
if compute_values:
|
if compute_values:
|
||||||
txn.result_data["compute_ns"] = max(compute_values)
|
cur = txn.result_data.get("compute_ns", 0.0) or 0.0
|
||||||
|
txn.result_data["compute_ns"] = max(cur, max(compute_values))
|
||||||
|
|
||||||
# Send aggregate response on reverse command path back to IO_CPU
|
# Send aggregate response on reverse command path back to IO_CPU
|
||||||
reverse_path = list(reversed(txn.path))
|
reverse_path = list(reversed(txn.path))
|
||||||
|
|||||||
@@ -71,6 +71,13 @@ class PeCpuComponent(ComponentBase):
|
|||||||
request = txn.request
|
request = txn.request
|
||||||
yield from self.run(env, 0)
|
yield from self.run(env, 0)
|
||||||
|
|
||||||
|
# ADR-0009 D5: synchronized launch barrier. If M_CPU stamped a
|
||||||
|
# target_start_ns, wait until then so every PE in this launch
|
||||||
|
# begins pe_exec measurement at the same simulated time.
|
||||||
|
target_start = getattr(request, "target_start_ns", None)
|
||||||
|
if target_start is not None and target_start > env.now:
|
||||||
|
yield env.timeout(float(target_start) - env.now)
|
||||||
|
|
||||||
kernel_fn = get_kernel(request.kernel_ref.name)
|
kernel_fn = get_kernel(request.kernel_ref.name)
|
||||||
num_programs = self._derive_num_programs(request)
|
num_programs = self._derive_num_programs(request)
|
||||||
kernel_args = self._unpack_kernel_args(request)
|
kernel_args = self._unpack_kernel_args(request)
|
||||||
|
|||||||
@@ -89,11 +89,10 @@ class _FreeList:
|
|||||||
|
|
||||||
class PEMemAllocator:
|
class PEMemAllocator:
|
||||||
def __init__(
|
def __init__(
|
||||||
self, rack_id: int, sip_id: int, cube_id: int, pe_id: int, cfg: AddressConfig,
|
self, sip_id: int, die_id: int, pe_id: int, cfg: AddressConfig,
|
||||||
) -> None:
|
) -> None:
|
||||||
self._rack_id = rack_id
|
|
||||||
self._sip_id = sip_id
|
self._sip_id = sip_id
|
||||||
self._cube_id = cube_id
|
self._die_id = die_id
|
||||||
self._pe_id = pe_id
|
self._pe_id = pe_id
|
||||||
self._cfg = cfg
|
self._cfg = cfg
|
||||||
self._hbm = _FreeList(cfg.hbm_slice_bytes)
|
self._hbm = _FreeList(cfg.hbm_slice_bytes)
|
||||||
@@ -108,7 +107,7 @@ class PEMemAllocator:
|
|||||||
f"available {self._cfg.hbm_slice_bytes - self._hbm.used}"
|
f"available {self._cfg.hbm_slice_bytes - self._hbm.used}"
|
||||||
)
|
)
|
||||||
return PhysAddr.pe_hbm_addr(
|
return PhysAddr.pe_hbm_addr(
|
||||||
rack_id=self._rack_id, sip_id=self._sip_id, cube_id=self._cube_id,
|
sip_id=self._sip_id, die_id=self._die_id,
|
||||||
pe_id=self._pe_id, pe_local_hbm_offset=offset,
|
pe_id=self._pe_id, pe_local_hbm_offset=offset,
|
||||||
slice_size_bytes=self._cfg.hbm_slice_bytes,
|
slice_size_bytes=self._cfg.hbm_slice_bytes,
|
||||||
)
|
)
|
||||||
@@ -128,7 +127,7 @@ class PEMemAllocator:
|
|||||||
f"available {self._cfg.tcm_allocatable_bytes - self._tcm.used}"
|
f"available {self._cfg.tcm_allocatable_bytes - self._tcm.used}"
|
||||||
)
|
)
|
||||||
return PhysAddr.pe_tcm_addr(
|
return PhysAddr.pe_tcm_addr(
|
||||||
rack_id=self._rack_id, sip_id=self._sip_id, cube_id=self._cube_id,
|
sip_id=self._sip_id, die_id=self._die_id,
|
||||||
pe_id=self._pe_id, tcm_offset=offset,
|
pe_id=self._pe_id, tcm_offset=offset,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -19,7 +19,14 @@ class PageFault(Exception):
|
|||||||
|
|
||||||
|
|
||||||
class PeMMU:
|
class PeMMU:
|
||||||
"""Per-PE MMU with page-aligned VA→PA translation table.
|
"""Per-PE MMU with sub-page-capable VA→PA translation table.
|
||||||
|
|
||||||
|
Each page-table entry is a list of (start_in_page, end_in_page,
|
||||||
|
pa_at_offset_zero) regions. This is a SIMULATOR STOPGAP — real MMUs
|
||||||
|
store one PA per page-table entry. Sub-page regions exist here so
|
||||||
|
DPPolicy layouts that shard below page granularity (e.g. 128 B
|
||||||
|
payloads with 4 KB pages) don't silently mis-route through last-
|
||||||
|
write-wins overwrites. Memory note: project_mmu_subpage_stopgap.md.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
page_size: Page size in bytes (default 2 MB).
|
page_size: Page size in bytes (default 2 MB).
|
||||||
@@ -34,7 +41,11 @@ class PeMMU:
|
|||||||
self._page_size = page_size
|
self._page_size = page_size
|
||||||
self._page_shift = (page_size - 1).bit_length()
|
self._page_shift = (page_size - 1).bit_length()
|
||||||
self._page_mask = page_size - 1
|
self._page_mask = page_size - 1
|
||||||
self._table: dict[int, int] = {} # va_page_number → pa_page_base
|
# vpn → list of (start_in_page, end_in_page, pa_at_offset_zero).
|
||||||
|
# pa_at_offset_zero is the PA that offset 0 of the page would map
|
||||||
|
# to under this region — i.e. translate(off) = pa_at_offset_zero
|
||||||
|
# + off when start <= off < end.
|
||||||
|
self._table: dict[int, list[tuple[int, int, int]]] = {}
|
||||||
self._overhead_ns = overhead_ns
|
self._overhead_ns = overhead_ns
|
||||||
|
|
||||||
@property
|
@property
|
||||||
@@ -46,21 +57,67 @@ class PeMMU:
|
|||||||
return len(self._table)
|
return len(self._table)
|
||||||
|
|
||||||
def map(self, va: int, pa: int, size: int) -> None:
|
def map(self, va: int, pa: int, size: int) -> None:
|
||||||
"""Register VA→PA mapping for a contiguous range."""
|
"""Register VA→PA mapping for a contiguous range.
|
||||||
for off in range(0, size, self._page_size):
|
|
||||||
vpn = (va + off) >> self._page_shift
|
Sub-page-aware: a single page can hold multiple disjoint regions,
|
||||||
self._table[vpn] = pa + off
|
each pointing to a different PA. Later map() calls APPEND a new
|
||||||
|
region; on overlap with an existing region, the new region wins
|
||||||
|
for the overlapping offsets (translate iterates in reverse so the
|
||||||
|
last write takes precedence — matches legacy single-PA behavior
|
||||||
|
when a full page is re-mapped).
|
||||||
|
"""
|
||||||
|
end_va = va + size
|
||||||
|
cur = va
|
||||||
|
while cur < end_va:
|
||||||
|
vpn = cur >> self._page_shift
|
||||||
|
page_base_va = vpn << self._page_shift
|
||||||
|
page_end_va = page_base_va + self._page_size
|
||||||
|
region_start = cur - page_base_va
|
||||||
|
region_end = min(end_va, page_end_va) - page_base_va
|
||||||
|
# PA seen at offset 0 of page if this region's mapping covered it
|
||||||
|
pa_at_offset_zero = pa + (cur - va) - region_start
|
||||||
|
self._table.setdefault(vpn, []).append(
|
||||||
|
(region_start, region_end, pa_at_offset_zero)
|
||||||
|
)
|
||||||
|
cur = page_base_va + region_end
|
||||||
|
|
||||||
def unmap(self, va: int, size: int) -> None:
|
def unmap(self, va: int, size: int) -> None:
|
||||||
"""Remove VA mapping for a contiguous range."""
|
"""Remove VA mapping for a contiguous range.
|
||||||
for off in range(0, size, self._page_size):
|
|
||||||
vpn = (va + off) >> self._page_shift
|
Drops any region whose extent is contained within the unmapped
|
||||||
self._table.pop(vpn, None)
|
range. Partial overlaps (region straddles the range boundary)
|
||||||
|
are left in place — caller is expected to unmap on the same
|
||||||
|
boundaries it mapped on.
|
||||||
|
"""
|
||||||
|
end_va = va + size
|
||||||
|
cur = va
|
||||||
|
while cur < end_va:
|
||||||
|
vpn = cur >> self._page_shift
|
||||||
|
page_base_va = vpn << self._page_shift
|
||||||
|
page_end_va = page_base_va + self._page_size
|
||||||
|
unmap_start = cur - page_base_va
|
||||||
|
unmap_end = min(end_va, page_end_va) - page_base_va
|
||||||
|
regions = self._table.get(vpn)
|
||||||
|
if regions is not None:
|
||||||
|
kept = [
|
||||||
|
r for r in regions
|
||||||
|
if not (r[0] >= unmap_start and r[1] <= unmap_end)
|
||||||
|
]
|
||||||
|
if kept:
|
||||||
|
self._table[vpn] = kept
|
||||||
|
else:
|
||||||
|
del self._table[vpn]
|
||||||
|
cur = page_base_va + unmap_end
|
||||||
|
|
||||||
def translate(self, va: int) -> int:
|
def translate(self, va: int) -> int:
|
||||||
"""Translate VA to PA. Raises PageFault if unmapped."""
|
"""Translate VA to PA. Raises PageFault if unmapped."""
|
||||||
vpn = va >> self._page_shift
|
vpn = va >> self._page_shift
|
||||||
pa_page_base = self._table.get(vpn)
|
regions = self._table.get(vpn)
|
||||||
if pa_page_base is None:
|
if regions is None:
|
||||||
|
raise PageFault(va)
|
||||||
|
offset = va & self._page_mask
|
||||||
|
# Iterate latest-first so newer map() calls win on overlap
|
||||||
|
for start, end, pa_at_offset_zero in reversed(regions):
|
||||||
|
if start <= offset < end:
|
||||||
|
return pa_at_offset_zero + offset
|
||||||
raise PageFault(va)
|
raise PageFault(va)
|
||||||
return pa_page_base + (va & self._page_mask)
|
|
||||||
|
|||||||
@@ -6,6 +6,47 @@ from typing import Literal
|
|||||||
|
|
||||||
MAX_51 = (1 << 51) - 1
|
MAX_51 = (1 << 51) - 1
|
||||||
|
|
||||||
|
# ── Layout constants (ADR-0001 Rev 2) ────────────────────────────────
|
||||||
|
# [50:47] sip_id (4)
|
||||||
|
# [46:42] die_id (5)
|
||||||
|
# [41: 0] local_offset (42)
|
||||||
|
_SIP_SHIFT = 47
|
||||||
|
_DIE_SHIFT = 42
|
||||||
|
_LOCAL_BITS = 42
|
||||||
|
_LOCAL_MASK = (1 << _LOCAL_BITS) - 1
|
||||||
|
|
||||||
|
# AHBM die: [41:38] MBZ, [37] addr_space, [36:0] sub-address
|
||||||
|
_AHBM_SEL_BIT = 37
|
||||||
|
_AHBM_LOCAL_USED = 38 # bits actually meaningful for AHBM
|
||||||
|
|
||||||
|
# Resource window: [36:34] resource_kind, [33:0] kind_local
|
||||||
|
_RES_KIND_SHIFT = 34
|
||||||
|
_RES_KIND_MASK = 0x7
|
||||||
|
|
||||||
|
# PE_LOCAL: [32:29] pe_id, [28:25] pe_sub_unit, [24:0] sub_offset
|
||||||
|
_PE_ID_SHIFT = 29
|
||||||
|
_PE_SUB_SHIFT = 25
|
||||||
|
_PE_SUB_OFFSET_BITS = 25
|
||||||
|
|
||||||
|
# MCPU_LOCAL: [29:25] mcpu_sub_unit, [24:0] sub_offset
|
||||||
|
_MCPU_SUB_SHIFT = 25
|
||||||
|
|
||||||
|
# CUBE_SRAM: [24:0] sram_offset
|
||||||
|
_SRAM_OFFSET_BITS = 25
|
||||||
|
|
||||||
|
# IOCHIPLET: [41:40] MBZ, [39:0] chiplet_offset
|
||||||
|
_CHIPLET_LOCAL_BITS = 40
|
||||||
|
_IOCPU_BOUNDARY = 1 << 31 # 2 GB
|
||||||
|
|
||||||
|
# IOCPU: [30:27] iocpu_sub_unit, [26:0] sub_offset
|
||||||
|
_IOCPU_SUB_SHIFT = 27
|
||||||
|
_IOCPU_SUB_OFFSET_BITS = 27
|
||||||
|
|
||||||
|
# die_id ranges
|
||||||
|
_AHBM_DIE_MAX = 15
|
||||||
|
_CHIPLET_DIE_MIN = 16
|
||||||
|
_CHIPLET_DIE_MAX = 20
|
||||||
|
|
||||||
|
|
||||||
class PhysAddrError(Exception):
|
class PhysAddrError(Exception):
|
||||||
pass
|
pass
|
||||||
@@ -22,163 +63,278 @@ def _chk_max(name: str, v: int, maxv: int) -> None:
|
|||||||
|
|
||||||
|
|
||||||
class UnitType(IntEnum):
|
class UnitType(IntEnum):
|
||||||
PE = 0
|
"""resource_kind values for AHBM resource window."""
|
||||||
MCPU = 1
|
PE = 0 # PE_LOCAL
|
||||||
SRAM = 2
|
MCPU = 1 # MCPU_LOCAL
|
||||||
|
SRAM = 2 # CUBE_SRAM
|
||||||
|
|
||||||
|
|
||||||
|
class PESubUnit(IntEnum):
|
||||||
|
PE_CPU_DTCM = 0
|
||||||
|
MATH_ENGINE_DTCM = 1
|
||||||
|
IPCQ = 2
|
||||||
|
PE_CPU_SFR = 3
|
||||||
|
MATH_ENGINE_SFR = 4
|
||||||
|
DMA_ENGINE_SFR = 5
|
||||||
|
PE_TCM = 6
|
||||||
|
|
||||||
|
|
||||||
|
class MCPUSubUnit(IntEnum):
|
||||||
|
MCPU_ITCM = 0
|
||||||
|
MCPU_DTCM = 1
|
||||||
|
IPCQ = 2
|
||||||
|
MCPU_SFR = 3
|
||||||
|
MCPU_DMA_SFR = 4
|
||||||
|
MCPU_SRAM = 5
|
||||||
|
|
||||||
|
|
||||||
|
class IOCPUSubUnit(IntEnum):
|
||||||
|
IOCPU_ITCM = 0
|
||||||
|
IOCPU_DTCM = 1
|
||||||
|
IPCQ = 2
|
||||||
|
IOCPU_SFR = 3
|
||||||
|
IO_DMA_SFR = 4
|
||||||
|
IO_SRAM = 5
|
||||||
|
|
||||||
|
|
||||||
@dataclass(frozen=True)
|
@dataclass(frozen=True)
|
||||||
class PhysAddr:
|
class PhysAddr:
|
||||||
"""
|
"""51-bit physical address value object (ADR-0001 Rev 2).
|
||||||
51-bit physical address value object.
|
|
||||||
|
|
||||||
Layout:
|
Layout:
|
||||||
[50:47] rack_id (4)
|
[50:47] sip_id (4) -- 16 SIPs
|
||||||
[46:43] sip_id (4)
|
[46:42] die_id (5) -- 0..15 AHBM, 16..20 IOCHIPLET
|
||||||
[42:38] sip_seg (5) # cube_id
|
[41: 0] local_offset (42) -- 4 TB per die
|
||||||
[37:0] local_offset (38) => each segment is 256GB
|
|
||||||
|
|
||||||
local_offset:
|
|
||||||
[37] selector: 1 = HBM window (128GB reserved), 0 = PE resource window
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
rack_id: int
|
|
||||||
sip_id: int
|
sip_id: int
|
||||||
sip_seg: int
|
die_id: int
|
||||||
local_offset: int
|
local_offset: int
|
||||||
|
|
||||||
kind: Literal["hbm", "pe_resource", "raw"] = "raw"
|
kind: Literal["hbm", "pe_resource", "iocpu", "ual", "raw"] = "raw"
|
||||||
cube_id: int = 0
|
|
||||||
unit_type: UnitType = UnitType.PE
|
unit_type: UnitType = UnitType.PE
|
||||||
pe_id: int = 0
|
pe_id: int = 0
|
||||||
ext: int = 0
|
pe_sub_unit: int = 0
|
||||||
sub_offset: int = 0
|
sub_offset: int = 0
|
||||||
hbm_offset: int = 0
|
hbm_offset: int = 0
|
||||||
|
iocpu_sub_unit: int = 0
|
||||||
|
chiplet_offset: int = 0
|
||||||
|
mcpu_sub_unit: int = 0
|
||||||
|
|
||||||
HBM_WINDOW_BYTES = 1 << 37 # 128GB
|
HBM_WINDOW_BYTES = 1 << 37 # 128 GB
|
||||||
|
|
||||||
|
# ── encode / decode ──────────────────────────────────────────────
|
||||||
|
|
||||||
def encode(self) -> int:
|
def encode(self) -> int:
|
||||||
_chk_range("rack_id", self.rack_id, 4)
|
|
||||||
_chk_range("sip_id", self.sip_id, 4)
|
_chk_range("sip_id", self.sip_id, 4)
|
||||||
_chk_range("sip_seg", self.sip_seg, 5)
|
_chk_range("die_id", self.die_id, 5)
|
||||||
_chk_range("local_offset", self.local_offset, 38)
|
_chk_range("local_offset", self.local_offset, _LOCAL_BITS)
|
||||||
addr = (self.rack_id << 47) | (self.sip_id << 43) | (self.sip_seg << 38) | self.local_offset
|
# MBZ enforcement
|
||||||
if not (0 <= addr <= MAX_51):
|
if self.die_id <= _AHBM_DIE_MAX:
|
||||||
raise PhysAddrError("address exceeds 51-bit space")
|
mbz_top = (self.local_offset >> _AHBM_LOCAL_USED) & 0xF
|
||||||
|
if mbz_top != 0:
|
||||||
|
raise PhysAddrError("AHBM local_offset bits [41:38] must be zero")
|
||||||
|
elif _CHIPLET_DIE_MIN <= self.die_id <= _CHIPLET_DIE_MAX:
|
||||||
|
mbz_top = (self.local_offset >> _CHIPLET_LOCAL_BITS) & 0x3
|
||||||
|
if mbz_top != 0:
|
||||||
|
raise PhysAddrError("IOCHIPLET local_offset bits [41:40] must be zero")
|
||||||
|
addr = (self.sip_id << _SIP_SHIFT) | (self.die_id << _DIE_SHIFT) | self.local_offset
|
||||||
return addr
|
return addr
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def decode(addr: int) -> PhysAddr:
|
def decode(addr: int) -> PhysAddr:
|
||||||
if not (0 <= addr <= MAX_51):
|
if not (0 <= addr <= MAX_51):
|
||||||
raise PhysAddrError("addr must be a 51-bit value")
|
raise PhysAddrError("addr must be a 51-bit value")
|
||||||
rack = (addr >> 47) & 0xF
|
sip_id = (addr >> _SIP_SHIFT) & 0xF
|
||||||
sip_id = (addr >> 43) & 0xF
|
die_id = (addr >> _DIE_SHIFT) & 0x1F
|
||||||
sip_seg = (addr >> 38) & 0x1F
|
local_offset = addr & _LOCAL_MASK
|
||||||
off = addr & ((1 << 38) - 1)
|
|
||||||
cube_id = sip_seg
|
if die_id <= _AHBM_DIE_MAX:
|
||||||
sel = (off >> 37) & 0x1
|
return PhysAddr._decode_ahbm(sip_id, die_id, local_offset)
|
||||||
|
elif _CHIPLET_DIE_MIN <= die_id <= _CHIPLET_DIE_MAX:
|
||||||
|
return PhysAddr._decode_chiplet(sip_id, die_id, local_offset)
|
||||||
|
else:
|
||||||
|
raise PhysAddrError(f"die_id {die_id} is reserved (21..31)")
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _decode_ahbm(sip_id: int, die_id: int, local_offset: int) -> PhysAddr:
|
||||||
|
sel = (local_offset >> _AHBM_SEL_BIT) & 0x1
|
||||||
if sel == 1:
|
if sel == 1:
|
||||||
hbm_offset = int(off & ((1 << 37) - 1))
|
hbm_offset = int(local_offset & ((1 << _AHBM_SEL_BIT) - 1))
|
||||||
return PhysAddr(
|
return PhysAddr(
|
||||||
rack_id=rack,
|
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
|
||||||
sip_id=sip_id,
|
kind="hbm", hbm_offset=hbm_offset,
|
||||||
sip_seg=sip_seg,
|
|
||||||
local_offset=off,
|
|
||||||
kind="hbm",
|
|
||||||
cube_id=cube_id,
|
|
||||||
hbm_offset=hbm_offset,
|
|
||||||
)
|
)
|
||||||
# PE resource decode
|
# Resource window
|
||||||
raw_ut = int((off >> 34) & 0x7)
|
res_kind = int((local_offset >> _RES_KIND_SHIFT) & _RES_KIND_MASK)
|
||||||
try:
|
try:
|
||||||
unit_type = UnitType(raw_ut)
|
unit_type = UnitType(res_kind)
|
||||||
except ValueError:
|
except ValueError:
|
||||||
raise PhysAddrError(f"unknown unit_type: {raw_ut}") from None
|
raise PhysAddrError(f"unknown resource_kind: {res_kind}") from None
|
||||||
pe_id = int((off >> 30) & 0xF)
|
|
||||||
ext = int((off >> 29) & 0x1)
|
if unit_type == UnitType.PE:
|
||||||
sub_offset = int(off & ((1 << 29) - 1))
|
pe_id = int((local_offset >> _PE_ID_SHIFT) & 0xF)
|
||||||
|
pe_sub = int((local_offset >> _PE_SUB_SHIFT) & 0xF)
|
||||||
|
sub_off = int(local_offset & ((1 << _PE_SUB_OFFSET_BITS) - 1))
|
||||||
return PhysAddr(
|
return PhysAddr(
|
||||||
rack_id=rack,
|
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
|
||||||
sip_id=sip_id,
|
kind="pe_resource", unit_type=unit_type,
|
||||||
sip_seg=sip_seg,
|
pe_id=pe_id, pe_sub_unit=pe_sub, sub_offset=sub_off,
|
||||||
local_offset=off,
|
)
|
||||||
kind="pe_resource",
|
elif unit_type == UnitType.MCPU:
|
||||||
cube_id=cube_id,
|
mcpu_sub = int((local_offset >> _MCPU_SUB_SHIFT) & 0x1F)
|
||||||
unit_type=unit_type,
|
sub_off = int(local_offset & ((1 << _PE_SUB_OFFSET_BITS) - 1))
|
||||||
pe_id=pe_id,
|
return PhysAddr(
|
||||||
ext=ext,
|
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
|
||||||
sub_offset=sub_offset,
|
kind="pe_resource", unit_type=unit_type,
|
||||||
hbm_offset=0,
|
mcpu_sub_unit=mcpu_sub, sub_offset=sub_off,
|
||||||
|
)
|
||||||
|
else: # SRAM
|
||||||
|
sub_off = int(local_offset & ((1 << _SRAM_OFFSET_BITS) - 1))
|
||||||
|
return PhysAddr(
|
||||||
|
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
|
||||||
|
kind="pe_resource", unit_type=unit_type,
|
||||||
|
sub_offset=sub_off,
|
||||||
)
|
)
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def hbm_addr(*, rack_id: int, sip_id: int, cube_id: int, hbm_offset: int) -> PhysAddr:
|
def _decode_chiplet(sip_id: int, die_id: int, local_offset: int) -> PhysAddr:
|
||||||
_chk_max("cube_id", cube_id, 31)
|
chip_off = local_offset & ((1 << _CHIPLET_LOCAL_BITS) - 1)
|
||||||
_chk_range("hbm_offset", hbm_offset, 37)
|
if chip_off < _IOCPU_BOUNDARY:
|
||||||
sip_seg = cube_id
|
iocpu_sub = int((chip_off >> _IOCPU_SUB_SHIFT) & 0xF)
|
||||||
local_offset = (1 << 37) | int(hbm_offset)
|
sub_off = int(chip_off & ((1 << _IOCPU_SUB_OFFSET_BITS) - 1))
|
||||||
return PhysAddr(
|
return PhysAddr(
|
||||||
rack_id=rack_id,
|
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
|
||||||
sip_id=sip_id,
|
kind="iocpu", chiplet_offset=chip_off,
|
||||||
sip_seg=sip_seg,
|
iocpu_sub_unit=iocpu_sub, sub_offset=sub_off,
|
||||||
local_offset=local_offset,
|
)
|
||||||
kind="hbm",
|
else:
|
||||||
cube_id=cube_id,
|
return PhysAddr(
|
||||||
hbm_offset=int(hbm_offset),
|
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
|
||||||
|
kind="ual", chiplet_offset=chip_off,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ── AHBM factory methods ────────────────────────────────────────
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def hbm_addr(*, sip_id: int, die_id: int, hbm_offset: int) -> PhysAddr:
|
||||||
|
_chk_max("die_id", die_id, _AHBM_DIE_MAX)
|
||||||
|
_chk_range("hbm_offset", hbm_offset, _AHBM_SEL_BIT)
|
||||||
|
local_offset = (1 << _AHBM_SEL_BIT) | int(hbm_offset)
|
||||||
|
return PhysAddr(
|
||||||
|
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
|
||||||
|
kind="hbm", hbm_offset=int(hbm_offset),
|
||||||
)
|
)
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def pe_hbm_addr(
|
def pe_hbm_addr(
|
||||||
*,
|
*, sip_id: int, die_id: int,
|
||||||
rack_id: int,
|
pe_id: int, pe_local_hbm_offset: int, slice_size_bytes: int,
|
||||||
sip_id: int,
|
|
||||||
cube_id: int,
|
|
||||||
pe_id: int,
|
|
||||||
pe_local_hbm_offset: int,
|
|
||||||
slice_size_bytes: int,
|
|
||||||
) -> PhysAddr:
|
) -> PhysAddr:
|
||||||
_chk_max("cube_id", cube_id, 31)
|
_chk_max("die_id", die_id, _AHBM_DIE_MAX)
|
||||||
_chk_range("pe_id", pe_id, 4)
|
_chk_range("pe_id", pe_id, 4)
|
||||||
if not (0 <= pe_local_hbm_offset < slice_size_bytes):
|
if not (0 <= pe_local_hbm_offset < slice_size_bytes):
|
||||||
raise PhysAddrError("pe_local_hbm_offset out of PE local slice range")
|
raise PhysAddrError("pe_local_hbm_offset out of PE local slice range")
|
||||||
hbm_offset = int(pe_id) * int(slice_size_bytes) + int(pe_local_hbm_offset)
|
hbm_offset = int(pe_id) * int(slice_size_bytes) + int(pe_local_hbm_offset)
|
||||||
if not (0 <= hbm_offset < PhysAddr.HBM_WINDOW_BYTES):
|
if not (0 <= hbm_offset < PhysAddr.HBM_WINDOW_BYTES):
|
||||||
raise PhysAddrError("HBM offset exceeds reserved 128GB window")
|
raise PhysAddrError("HBM offset exceeds reserved 128GB window")
|
||||||
return PhysAddr.hbm_addr(
|
return PhysAddr.hbm_addr(sip_id=sip_id, die_id=die_id, hbm_offset=hbm_offset)
|
||||||
rack_id=rack_id, sip_id=sip_id, cube_id=cube_id, hbm_offset=hbm_offset
|
|
||||||
)
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def hbm_pe_id(hbm_offset: int, slice_size_bytes: int) -> int:
|
def hbm_pe_id(hbm_offset: int, slice_size_bytes: int) -> int:
|
||||||
return hbm_offset // slice_size_bytes
|
return hbm_offset // slice_size_bytes
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def cube_sram_addr(
|
def pe_tcm_addr(
|
||||||
*, rack_id: int, sip_id: int, cube_id: int, sram_offset: int,
|
*, sip_id: int, die_id: int, pe_id: int, tcm_offset: int,
|
||||||
) -> PhysAddr:
|
) -> PhysAddr:
|
||||||
_chk_max("cube_id", cube_id, 31)
|
return PhysAddr.pe_resource_addr(
|
||||||
_chk_range("sram_offset", sram_offset, 29)
|
sip_id=sip_id, die_id=die_id, pe_id=pe_id,
|
||||||
sip_seg = cube_id
|
pe_sub_unit=PESubUnit.PE_TCM, sub_offset=tcm_offset,
|
||||||
local_offset = (UnitType.SRAM << 34) | sram_offset
|
|
||||||
return PhysAddr(
|
|
||||||
rack_id=rack_id, sip_id=sip_id, sip_seg=sip_seg,
|
|
||||||
local_offset=local_offset,
|
|
||||||
kind="pe_resource", cube_id=cube_id,
|
|
||||||
unit_type=UnitType.SRAM, sub_offset=sram_offset,
|
|
||||||
)
|
)
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def pe_tcm_addr(
|
def pe_resource_addr(
|
||||||
*, rack_id: int, sip_id: int, cube_id: int, pe_id: int, tcm_offset: int,
|
*, sip_id: int, die_id: int, pe_id: int,
|
||||||
|
pe_sub_unit: int, sub_offset: int,
|
||||||
) -> PhysAddr:
|
) -> PhysAddr:
|
||||||
_chk_max("cube_id", cube_id, 31)
|
_chk_max("die_id", die_id, _AHBM_DIE_MAX)
|
||||||
_chk_range("pe_id", pe_id, 4)
|
_chk_range("pe_id", pe_id, 4)
|
||||||
_chk_range("tcm_offset", tcm_offset, 29)
|
_chk_range("pe_sub_unit", pe_sub_unit, 4)
|
||||||
sip_seg = cube_id
|
_chk_range("sub_offset", sub_offset, _PE_SUB_OFFSET_BITS)
|
||||||
local_offset = (UnitType.PE << 34) | (pe_id << 30) | tcm_offset
|
local_offset = (
|
||||||
return PhysAddr(
|
(UnitType.PE << _RES_KIND_SHIFT)
|
||||||
rack_id=rack_id, sip_id=sip_id, sip_seg=sip_seg,
|
| (pe_id << _PE_ID_SHIFT)
|
||||||
local_offset=local_offset,
|
| (pe_sub_unit << _PE_SUB_SHIFT)
|
||||||
kind="pe_resource", cube_id=cube_id,
|
| sub_offset
|
||||||
unit_type=UnitType.PE, pe_id=pe_id, sub_offset=tcm_offset,
|
)
|
||||||
|
return PhysAddr(
|
||||||
|
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
|
||||||
|
kind="pe_resource", unit_type=UnitType.PE,
|
||||||
|
pe_id=pe_id, pe_sub_unit=pe_sub_unit, sub_offset=sub_offset,
|
||||||
|
)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def cube_sram_addr(
|
||||||
|
*, sip_id: int, die_id: int, sram_offset: int,
|
||||||
|
) -> PhysAddr:
|
||||||
|
_chk_max("die_id", die_id, _AHBM_DIE_MAX)
|
||||||
|
_chk_range("sram_offset", sram_offset, _SRAM_OFFSET_BITS)
|
||||||
|
local_offset = (UnitType.SRAM << _RES_KIND_SHIFT) | sram_offset
|
||||||
|
return PhysAddr(
|
||||||
|
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
|
||||||
|
kind="pe_resource", unit_type=UnitType.SRAM, sub_offset=sram_offset,
|
||||||
|
)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def mcpu_resource_addr(
|
||||||
|
*, sip_id: int, die_id: int, mcpu_sub_unit: int, sub_offset: int,
|
||||||
|
) -> PhysAddr:
|
||||||
|
_chk_max("die_id", die_id, _AHBM_DIE_MAX)
|
||||||
|
_chk_range("mcpu_sub_unit", mcpu_sub_unit, 5)
|
||||||
|
_chk_range("sub_offset", sub_offset, _PE_SUB_OFFSET_BITS)
|
||||||
|
local_offset = (
|
||||||
|
(UnitType.MCPU << _RES_KIND_SHIFT)
|
||||||
|
| (mcpu_sub_unit << _MCPU_SUB_SHIFT)
|
||||||
|
| sub_offset
|
||||||
|
)
|
||||||
|
return PhysAddr(
|
||||||
|
sip_id=sip_id, die_id=die_id, local_offset=local_offset,
|
||||||
|
kind="pe_resource", unit_type=UnitType.MCPU,
|
||||||
|
mcpu_sub_unit=mcpu_sub_unit, sub_offset=sub_offset,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ── IOCHIPLET factory methods ────────────────────────────────────
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def iocpu_resource_addr(
|
||||||
|
*, sip_id: int, die_id: int, iocpu_sub_unit: int, sub_offset: int,
|
||||||
|
) -> PhysAddr:
|
||||||
|
_chk_max("die_id", die_id, _CHIPLET_DIE_MAX)
|
||||||
|
if die_id < _CHIPLET_DIE_MIN:
|
||||||
|
raise PhysAddrError(
|
||||||
|
f"die_id {die_id} is not an IOCHIPLET "
|
||||||
|
f"(must be {_CHIPLET_DIE_MIN}..{_CHIPLET_DIE_MAX})"
|
||||||
|
)
|
||||||
|
_chk_range("iocpu_sub_unit", iocpu_sub_unit, 4)
|
||||||
|
_chk_range("sub_offset", sub_offset, _IOCPU_SUB_OFFSET_BITS)
|
||||||
|
chiplet_offset = (iocpu_sub_unit << _IOCPU_SUB_SHIFT) | sub_offset
|
||||||
|
if chiplet_offset >= _IOCPU_BOUNDARY:
|
||||||
|
raise PhysAddrError("IOCPU region overflow (must be < 2 GB)")
|
||||||
|
return PhysAddr(
|
||||||
|
sip_id=sip_id, die_id=die_id, local_offset=chiplet_offset,
|
||||||
|
kind="iocpu", chiplet_offset=chiplet_offset,
|
||||||
|
iocpu_sub_unit=iocpu_sub_unit, sub_offset=sub_offset,
|
||||||
|
)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def ual_addr(*, sip_id: int, die_id: int, ual_offset: int) -> PhysAddr:
|
||||||
|
_chk_max("die_id", die_id, _CHIPLET_DIE_MAX)
|
||||||
|
if die_id < _CHIPLET_DIE_MIN:
|
||||||
|
raise PhysAddrError(f"die_id {die_id} is not an IOCHIPLET")
|
||||||
|
chiplet_offset = _IOCPU_BOUNDARY + ual_offset
|
||||||
|
_chk_range("chiplet_offset", chiplet_offset, _CHIPLET_LOCAL_BITS)
|
||||||
|
return PhysAddr(
|
||||||
|
sip_id=sip_id, die_id=die_id, local_offset=chiplet_offset,
|
||||||
|
kind="ual", chiplet_offset=chiplet_offset,
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -27,16 +27,16 @@ class AddressResolver:
|
|||||||
|
|
||||||
def resolve(self, addr: PhysAddr) -> str:
|
def resolve(self, addr: PhysAddr) -> str:
|
||||||
s = addr.sip_id
|
s = addr.sip_id
|
||||||
c = addr.cube_id
|
d = addr.die_id
|
||||||
if addr.kind == "hbm":
|
if addr.kind == "hbm":
|
||||||
node_id = f"sip{s}.cube{c}.hbm_ctrl"
|
node_id = f"sip{s}.cube{d}.hbm_ctrl"
|
||||||
elif addr.kind == "pe_resource":
|
elif addr.kind == "pe_resource":
|
||||||
if addr.unit_type == UnitType.PE:
|
if addr.unit_type == UnitType.PE:
|
||||||
node_id = f"sip{s}.cube{c}.pe{addr.pe_id}.pe_tcm"
|
node_id = f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
|
||||||
elif addr.unit_type == UnitType.SRAM:
|
elif addr.unit_type == UnitType.SRAM:
|
||||||
node_id = f"sip{s}.cube{c}.sram"
|
node_id = f"sip{s}.cube{d}.sram"
|
||||||
elif addr.unit_type == UnitType.MCPU:
|
elif addr.unit_type == UnitType.MCPU:
|
||||||
node_id = f"sip{s}.cube{c}.m_cpu"
|
node_id = f"sip{s}.cube{d}.m_cpu"
|
||||||
else:
|
else:
|
||||||
raise RoutingError(f"unsupported unit_type: {addr.unit_type}")
|
raise RoutingError(f"unsupported unit_type: {addr.unit_type}")
|
||||||
else:
|
else:
|
||||||
|
|||||||
@@ -385,7 +385,7 @@ class RuntimeContext:
|
|||||||
for cube_id in range(cubes_per_sip):
|
for cube_id in range(cubes_per_sip):
|
||||||
for pe_id in range(pes_per_cube):
|
for pe_id in range(pes_per_cube):
|
||||||
self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
|
self._allocators[(sip_id, cube_id, pe_id)] = PEMemAllocator(
|
||||||
rack_id=0, sip_id=sip_id, cube_id=cube_id, pe_id=pe_id, cfg=cfg,
|
sip_id=sip_id, die_id=cube_id, pe_id=pe_id, cfg=cfg,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Initialize VA allocator (MMU mappings are installed via fabric MmuMapMsg)
|
# Initialize VA allocator (MMU mappings are installed via fabric MmuMapMsg)
|
||||||
|
|||||||
@@ -113,7 +113,18 @@ class AhbmCCLBackend:
|
|||||||
)
|
)
|
||||||
n_elem = shards[0].nbytes // tensor.itemsize
|
n_elem = shards[0].nbytes // tensor.itemsize
|
||||||
kernel_fn = self._algo_module.kernel
|
kernel_fn = self._algo_module.kernel
|
||||||
kernel_args = self._algo_module.kernel_args(self._world_size, n_elem)
|
# Derive effective cube dims from tensor's actual shard placement
|
||||||
|
# (may differ from topology mesh when TP uses fewer cubes).
|
||||||
|
sip0_cubes = sorted({s.cube for s in shards if s.sip == shards[0].sip})
|
||||||
|
eff_n_cubes = len(sip0_cubes) if sip0_cubes else 1
|
||||||
|
if eff_n_cubes == 1:
|
||||||
|
eff_cube_w, eff_cube_h = 1, 1
|
||||||
|
else:
|
||||||
|
eff_cube_w, eff_cube_h = self._cube_w, self._cube_h
|
||||||
|
kernel_args = self._algo_module.kernel_args(
|
||||||
|
self._world_size, n_elem,
|
||||||
|
cube_w=eff_cube_w, cube_h=eff_cube_h,
|
||||||
|
)
|
||||||
|
|
||||||
# Resolve sip_rank from the current greenlet's bound rank
|
# Resolve sip_rank from the current greenlet's bound rank
|
||||||
from greenlet import getcurrent as _gc
|
from greenlet import getcurrent as _gc
|
||||||
|
|||||||
@@ -90,6 +90,11 @@ class KernelLaunchMsg:
|
|||||||
args: tuple[KernelArg, ...]
|
args: tuple[KernelArg, ...]
|
||||||
target_cubes: tuple[int, ...] | Literal["all"] = "all"
|
target_cubes: tuple[int, ...] | Literal["all"] = "all"
|
||||||
target_pe: int | tuple[int, ...] | Literal["all"] = "all"
|
target_pe: int | tuple[int, ...] | Literal["all"] = "all"
|
||||||
|
# ADR-0009 D5: synchronized kernel start. When set, each PE_CPU yields
|
||||||
|
# until env.now >= target_start_ns before beginning kernel execution,
|
||||||
|
# so every PE in a launch starts at the same simulated time regardless
|
||||||
|
# of its M_CPU dispatch path length. Stamped by M_CPU fan-out.
|
||||||
|
target_start_ns: float | None = None
|
||||||
msg_type: Literal["kernel_launch"] = "kernel_launch"
|
msg_type: Literal["kernel_launch"] = "kernel_launch"
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -67,6 +67,10 @@ class GraphEngine:
|
|||||||
spec=graph.spec,
|
spec=graph.spec,
|
||||||
memory_store=self._memory_store,
|
memory_store=self._memory_store,
|
||||||
op_logger=self._op_logger,
|
op_logger=self._op_logger,
|
||||||
|
node_overhead_ns={
|
||||||
|
nid: float(n.attrs.get("overhead_ns", 0.0))
|
||||||
|
for nid, n in graph.nodes.items()
|
||||||
|
},
|
||||||
)
|
)
|
||||||
self._components: dict[str, ComponentBase] = {
|
self._components: dict[str, ComponentBase] = {
|
||||||
node_id: ComponentRegistry.create(node, overrides, ctx)
|
node_id: ComponentRegistry.create(node, overrides, ctx)
|
||||||
|
|||||||
@@ -212,7 +212,7 @@ def _generate_probe_h2d(graph, edge_map) -> list[dict]:
|
|||||||
t_offset = 0.0
|
t_offset = 0.0
|
||||||
for rid, (name, cube, hops) in enumerate(cases):
|
for rid, (name, cube, hops) in enumerate(cases):
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=0, cube_id=cube, pe_id=0,
|
sip_id=0, die_id=cube, pe_id=0,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
dst_node = resolver.resolve(pa)
|
dst_node = resolver.resolve(pa)
|
||||||
@@ -256,7 +256,7 @@ def _generate_probe_d2h(graph, edge_map) -> list[dict]:
|
|||||||
t_offset = 0.0
|
t_offset = 0.0
|
||||||
for rid, (name, cube, hops) in enumerate(cases):
|
for rid, (name, cube, hops) in enumerate(cases):
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=0, cube_id=cube, pe_id=0,
|
sip_id=0, die_id=cube, pe_id=0,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
dst_node = resolver.resolve(pa)
|
dst_node = resolver.resolve(pa)
|
||||||
@@ -310,7 +310,7 @@ def _generate_probe_pe_dma(graph, edge_map) -> list[dict]:
|
|||||||
t_offset = 0.0
|
t_offset = 0.0
|
||||||
for rid, (name, sip, src_cube, src_pe, dst_cube, dst_pe) in enumerate(cases):
|
for rid, (name, sip, src_cube, src_pe, dst_cube, dst_pe) in enumerate(cases):
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=sip, cube_id=dst_cube, pe_id=dst_pe,
|
sip_id=sip, die_id=dst_cube, pe_id=dst_pe,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
dst_node = resolver.resolve(pa)
|
dst_node = resolver.resolve(pa)
|
||||||
|
|||||||
@@ -7,11 +7,49 @@ stateful/SimPy-event-consuming and MUST NOT be shared).
|
|||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from kernbench.topology.builder import resolve_topology
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
|
||||||
|
def pytest_sessionfinish(session, exitstatus):
|
||||||
|
"""Aggregate parametrized sweep rows into combined CSV + PNG plots.
|
||||||
|
|
||||||
|
Runs on the controller node only (xdist worker processes set
|
||||||
|
``PYTEST_XDIST_WORKER``; we skip those). Idempotent — does nothing
|
||||||
|
if no sweep rows are present (e.g., when the sweep was filtered out).
|
||||||
|
"""
|
||||||
|
if os.environ.get("PYTEST_XDIST_WORKER"):
|
||||||
|
return
|
||||||
|
import importlib.util
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
def _exec(name: str, attr: str) -> None:
|
||||||
|
mod_path = Path(__file__).parent / name
|
||||||
|
if not mod_path.exists():
|
||||||
|
return
|
||||||
|
s = importlib.util.spec_from_file_location(
|
||||||
|
f"_{name.removesuffix('.py')}_for_aggregate", mod_path,
|
||||||
|
)
|
||||||
|
if s is None or s.loader is None:
|
||||||
|
return
|
||||||
|
mod = importlib.util.module_from_spec(s)
|
||||||
|
sys.modules[s.name] = mod
|
||||||
|
try:
|
||||||
|
s.loader.exec_module(mod)
|
||||||
|
fn = getattr(mod, attr, None)
|
||||||
|
if fn is not None:
|
||||||
|
fn()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[conftest] aggregator {attr}() in {name} failed: {e}")
|
||||||
|
|
||||||
|
_exec("test_allreduce_multidevice.py", "_aggregate_sweep_plots")
|
||||||
|
_exec("test_allreduce_buffer_kind_sweep.py", "aggregate_buffer_kind_plot")
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def topology():
|
def topology():
|
||||||
"""Session-scoped parsed topology (immutable graph + spec).
|
"""Session-scoped parsed topology (immutable graph + spec).
|
||||||
|
|||||||
@@ -149,7 +149,7 @@ def _make_tuple_allocators(
|
|||||||
) -> dict[tuple[int, int, int], PEMemAllocator]:
|
) -> dict[tuple[int, int, int], PEMemAllocator]:
|
||||||
return {
|
return {
|
||||||
(s, c, p): PEMemAllocator(
|
(s, c, p): PEMemAllocator(
|
||||||
rack_id=0, sip_id=s, cube_id=c, pe_id=p, cfg=_CFG,
|
sip_id=s, die_id=c, pe_id=p, cfg=_CFG,
|
||||||
)
|
)
|
||||||
for s in range(num_sips)
|
for s in range(num_sips)
|
||||||
for c in range(num_cubes)
|
for c in range(num_cubes)
|
||||||
|
|||||||
@@ -0,0 +1,196 @@
|
|||||||
|
"""Phase 1 buffer-kind allreduce sweep — torus_2d 6 SIPs.
|
||||||
|
|
||||||
|
Parametrized over (buffer_kind, n_elem). Each case runs the standard
|
||||||
|
config-driven allreduce app and writes a JSON row to a shared staging
|
||||||
|
dir; the conftest sessionfinish hook (added in Phase 1) aggregates
|
||||||
|
rows into ``docs/diagrams/allreduce_latency_plots/buffer_kind_sweep.png``.
|
||||||
|
|
||||||
|
Pre-Phase-2: the three buffer-kind lines overlap exactly because slot
|
||||||
|
access is latency-free today. Post-Phase-2 they spread out (tcm
|
||||||
|
fastest, hbm slowest).
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
# Reuse the allreduce app helpers.
|
||||||
|
from tests.test_allreduce_multidevice import (
|
||||||
|
_write_temp_configs,
|
||||||
|
run_allreduce,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_BUFFER_KINDS = ["tcm", "sram", "hbm"]
|
||||||
|
_N_ELEM_GRID = [128, 1024, 8192, 32768] # 256 B → 64 KB per slot
|
||||||
|
_ELEM_BYTES_F16 = 2
|
||||||
|
|
||||||
|
_OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
|
||||||
|
/ "allreduce_latency_plots")
|
||||||
|
_ROWS_DIR = _OUT_DIR / "_buffer_kind_rows"
|
||||||
|
|
||||||
|
|
||||||
|
def _bk_params():
|
||||||
|
out = []
|
||||||
|
for bk in _BUFFER_KINDS:
|
||||||
|
for n_elem in _N_ELEM_GRID:
|
||||||
|
out.append(pytest.param(bk, n_elem, id=f"{bk}-n_elem{n_elem}"))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("buffer_kind,n_elem", _bk_params())
|
||||||
|
def test_buffer_kind_allreduce_one(tmp_path, buffer_kind, n_elem):
|
||||||
|
"""One config of the buffer-kind sweep. xdist parallelizes."""
|
||||||
|
sub = tmp_path / f"{buffer_kind}_{n_elem}"
|
||||||
|
sub.mkdir()
|
||||||
|
topo_path, ccl_path = _write_temp_configs(
|
||||||
|
sub,
|
||||||
|
sip_topology="torus_2d",
|
||||||
|
n_sips=6,
|
||||||
|
algorithm="intercube_allreduce",
|
||||||
|
sip_w=3, sip_h=2,
|
||||||
|
n_elem_override=n_elem,
|
||||||
|
)
|
||||||
|
# Override buffer_kind in the temp ccl.yaml.
|
||||||
|
with open(ccl_path) as f:
|
||||||
|
ccl_cfg = yaml.safe_load(f)
|
||||||
|
ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
|
||||||
|
ccl_cfg.setdefault("algorithms", {}).setdefault(
|
||||||
|
"intercube_allreduce", {},
|
||||||
|
)["buffer_kind"] = buffer_kind
|
||||||
|
with open(ccl_path, "w") as f:
|
||||||
|
yaml.dump(ccl_cfg, f, default_flow_style=False)
|
||||||
|
|
||||||
|
topo = resolve_topology(topo_path)
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id=f"bk_sweep_{buffer_kind}_{n_elem}",
|
||||||
|
spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
result = run_allreduce(
|
||||||
|
ctx, engine, spec,
|
||||||
|
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
|
||||||
|
)
|
||||||
|
assert result["ok_cubes"] > 0
|
||||||
|
|
||||||
|
pe_exec_vals = [
|
||||||
|
float(tr.get("pe_exec_ns", 0.0) or 0.0)
|
||||||
|
for _, (_, tr) in engine._results.items()
|
||||||
|
if isinstance(tr, dict)
|
||||||
|
]
|
||||||
|
crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
|
||||||
|
|
||||||
|
bytes_per_pe = n_elem * _ELEM_BYTES_F16
|
||||||
|
record = {
|
||||||
|
"buffer_kind": buffer_kind,
|
||||||
|
"sip_topology": "torus_2d",
|
||||||
|
"n_sips": 6,
|
||||||
|
"n_elem": n_elem,
|
||||||
|
"bytes_per_pe": bytes_per_pe,
|
||||||
|
"latency_ns": crit_ns,
|
||||||
|
}
|
||||||
|
_ROWS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
row_path = _ROWS_DIR / f"{buffer_kind}_{n_elem}.json"
|
||||||
|
with open(row_path, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(record, f)
|
||||||
|
|
||||||
|
|
||||||
|
def aggregate_buffer_kind_plot() -> bool:
|
||||||
|
"""Read per-config rows and emit buffer_kind_sweep.png + CSV.
|
||||||
|
|
||||||
|
Called from conftest.pytest_sessionfinish (controller-only).
|
||||||
|
Returns True if rows were aggregated.
|
||||||
|
"""
|
||||||
|
import csv
|
||||||
|
|
||||||
|
if not _ROWS_DIR.exists():
|
||||||
|
return False
|
||||||
|
row_files = sorted(_ROWS_DIR.glob("*.json"))
|
||||||
|
if not row_files:
|
||||||
|
return False
|
||||||
|
|
||||||
|
records = []
|
||||||
|
for p in row_files:
|
||||||
|
with open(p, encoding="utf-8") as f:
|
||||||
|
records.append(json.load(f))
|
||||||
|
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from matplotlib.ticker import FuncFormatter
|
||||||
|
|
||||||
|
def _fmt_bytes(x, _pos):
|
||||||
|
if x <= 0:
|
||||||
|
return "0"
|
||||||
|
if x >= 1024 * 1024:
|
||||||
|
return f"{x / (1024 * 1024):.0f} MB"
|
||||||
|
if x >= 1024:
|
||||||
|
return f"{x / 1024:.0f} KB"
|
||||||
|
return f"{x:.0f} B"
|
||||||
|
|
||||||
|
_bytes_fmt = FuncFormatter(_fmt_bytes)
|
||||||
|
|
||||||
|
_OUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
with open(_OUT_DIR / "buffer_kind_sweep.csv", "w",
|
||||||
|
newline="", encoding="utf-8") as f:
|
||||||
|
w = csv.DictWriter(f, fieldnames=[
|
||||||
|
"buffer_kind", "sip_topology", "n_sips", "n_elem",
|
||||||
|
"bytes_per_pe", "latency_ns",
|
||||||
|
])
|
||||||
|
w.writeheader()
|
||||||
|
for r in sorted(records, key=lambda r: (
|
||||||
|
r["buffer_kind"], r["bytes_per_pe"],
|
||||||
|
)):
|
||||||
|
w.writerow(r)
|
||||||
|
|
||||||
|
colors = {"tcm": "tab:blue", "sram": "tab:orange", "hbm": "tab:red"}
|
||||||
|
fig, ax = plt.subplots(figsize=(10, 6))
|
||||||
|
for bk in ["tcm", "sram", "hbm"]:
|
||||||
|
rs = sorted(
|
||||||
|
[r for r in records if r["buffer_kind"] == bk],
|
||||||
|
key=lambda r: r["bytes_per_pe"],
|
||||||
|
)
|
||||||
|
if not rs:
|
||||||
|
continue
|
||||||
|
ax.plot(
|
||||||
|
[r["bytes_per_pe"] for r in rs],
|
||||||
|
[r["latency_ns"] for r in rs],
|
||||||
|
marker="o", lw=2.0,
|
||||||
|
color=colors[bk], label=f"buffer_kind = {bk}",
|
||||||
|
)
|
||||||
|
ax.set_xscale("log", base=2)
|
||||||
|
ax.set_xlabel("Bytes per PE (log scale)")
|
||||||
|
ax.set_ylabel("Time (ns)")
|
||||||
|
ax.set_title(
|
||||||
|
"Allreduce torus_2d (6 SIPs, 3×2) — IPCQ slot memory tier"
|
||||||
|
)
|
||||||
|
ax.grid(True, alpha=0.3)
|
||||||
|
ax.legend()
|
||||||
|
ax.xaxis.set_major_formatter(_bytes_fmt)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(_OUT_DIR / "buffer_kind_sweep.png", dpi=130)
|
||||||
|
plt.close(fig)
|
||||||
|
|
||||||
|
for p in row_files:
|
||||||
|
try:
|
||||||
|
p.unlink()
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
try:
|
||||||
|
_ROWS_DIR.rmdir()
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
print(f"\nWrote {_OUT_DIR / 'buffer_kind_sweep.png'} "
|
||||||
|
f"from {len(records)} rows")
|
||||||
|
return True
|
||||||
@@ -22,13 +22,23 @@ from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
|
|||||||
from kernbench.policy.placement.dp import DPPolicy
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
|
||||||
|
|
||||||
def _sip_topo_dims(sip_topo: str, n_sips: int) -> tuple[int, int]:
|
def _sip_topo_dims(
|
||||||
|
sip_topo: str, n_sips: int,
|
||||||
|
spec_w: int | None = None, spec_h: int | None = None,
|
||||||
|
) -> tuple[int, int]:
|
||||||
if sip_topo == "ring_1d":
|
if sip_topo == "ring_1d":
|
||||||
return (0, 0)
|
return (0, 0)
|
||||||
|
if spec_w is not None and spec_h is not None:
|
||||||
|
if spec_w * spec_h != n_sips:
|
||||||
|
raise ValueError(
|
||||||
|
f"sip layout {spec_w}x{spec_h} != n_sips ({n_sips})"
|
||||||
|
)
|
||||||
|
return (spec_w, spec_h)
|
||||||
side = int(round(math.sqrt(n_sips)))
|
side = int(round(math.sqrt(n_sips)))
|
||||||
if side * side != n_sips:
|
if side * side != n_sips:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"SIP topology '{sip_topo}' requires square n_sips, got {n_sips}"
|
f"SIP topology '{sip_topo}' requires square n_sips or "
|
||||||
|
f"explicit w/h in spec, got {n_sips}"
|
||||||
)
|
)
|
||||||
return (side, side)
|
return (side, side)
|
||||||
|
|
||||||
@@ -54,10 +64,13 @@ def run_allreduce(
|
|||||||
topo_name_to_kind = algo_module.TOPO_NAME_TO_KIND
|
topo_name_to_kind = algo_module.TOPO_NAME_TO_KIND
|
||||||
|
|
||||||
n_elem = int(cfg.get("n_elem", 8))
|
n_elem = int(cfg.get("n_elem", 8))
|
||||||
n_sips = int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
sips_cfg = spec.get("system", {}).get("sips", {})
|
||||||
sip_topo = str(
|
n_sips = int(sips_cfg.get("count", 1))
|
||||||
spec.get("system", {}).get("sips", {}).get("topology", "ring_1d")
|
sip_topo = str(sips_cfg.get("topology", "ring_1d"))
|
||||||
)
|
spec_sip_w = sips_cfg.get("w")
|
||||||
|
spec_sip_h = sips_cfg.get("h")
|
||||||
|
spec_sip_w = int(spec_sip_w) if spec_sip_w is not None else None
|
||||||
|
spec_sip_h = int(spec_sip_h) if spec_sip_h is not None else None
|
||||||
|
|
||||||
cm = spec["sip"]["cube_mesh"]
|
cm = spec["sip"]["cube_mesh"]
|
||||||
cube_w = int(cm["w"])
|
cube_w = int(cm["w"])
|
||||||
@@ -65,7 +78,9 @@ def run_allreduce(
|
|||||||
n_cubes = cube_w * cube_h
|
n_cubes = cube_w * cube_h
|
||||||
|
|
||||||
sip_topo_kind = topo_name_to_kind.get(sip_topo, 0)
|
sip_topo_kind = topo_name_to_kind.get(sip_topo, 0)
|
||||||
sip_topo_w, sip_topo_h = _sip_topo_dims(sip_topo, n_sips)
|
sip_topo_w, sip_topo_h = _sip_topo_dims(
|
||||||
|
sip_topo, n_sips, spec_w=spec_sip_w, spec_h=spec_sip_h,
|
||||||
|
)
|
||||||
|
|
||||||
algo_name = cfg.get("algorithm", "allreduce")
|
algo_name = cfg.get("algorithm", "allreduce")
|
||||||
print(f"\n{'=' * 60}")
|
print(f"\n{'=' * 60}")
|
||||||
@@ -173,18 +188,36 @@ from kernbench.topology.builder import resolve_topology
|
|||||||
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
|
||||||
CONFIGS = [
|
CONFIGS = [
|
||||||
pytest.param("intercube_allreduce", "ring_1d", 2, id="ring_2sip"),
|
pytest.param(
|
||||||
pytest.param("intercube_allreduce", "torus_2d", 4, id="torus_4sip"),
|
"intercube_allreduce", "ring_1d", 6, None, None,
|
||||||
pytest.param("intercube_allreduce", "mesh_2d_no_wrap", 4, id="mesh_4sip"),
|
id="ring_6sip",
|
||||||
|
),
|
||||||
|
pytest.param(
|
||||||
|
"intercube_allreduce", "torus_2d", 6, 2, 3,
|
||||||
|
id="torus_6sip_2x3",
|
||||||
|
),
|
||||||
|
pytest.param(
|
||||||
|
"intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3,
|
||||||
|
id="mesh_6sip_2x3",
|
||||||
|
),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
|
def _write_temp_configs(
|
||||||
|
tmp_path, sip_topology, n_sips, algorithm, n_elem_override=None,
|
||||||
|
sip_w=None, sip_h=None,
|
||||||
|
):
|
||||||
"""Write temp topology.yaml and ccl.yaml with the given overrides."""
|
"""Write temp topology.yaml and ccl.yaml with the given overrides."""
|
||||||
with open(TOPOLOGY_PATH) as f:
|
with open(TOPOLOGY_PATH) as f:
|
||||||
topo_cfg = yaml.safe_load(f)
|
topo_cfg = yaml.safe_load(f)
|
||||||
topo_cfg["system"]["sips"]["count"] = n_sips
|
topo_cfg["system"]["sips"]["count"] = n_sips
|
||||||
topo_cfg["system"]["sips"]["topology"] = sip_topology
|
topo_cfg["system"]["sips"]["topology"] = sip_topology
|
||||||
|
if sip_w is not None and sip_h is not None:
|
||||||
|
topo_cfg["system"]["sips"]["w"] = int(sip_w)
|
||||||
|
topo_cfg["system"]["sips"]["h"] = int(sip_h)
|
||||||
|
else:
|
||||||
|
topo_cfg["system"]["sips"].pop("w", None)
|
||||||
|
topo_cfg["system"]["sips"].pop("h", None)
|
||||||
topo_path = tmp_path / "topology.yaml"
|
topo_path = tmp_path / "topology.yaml"
|
||||||
with open(topo_path, "w") as f:
|
with open(topo_path, "w") as f:
|
||||||
yaml.dump(topo_cfg, f, default_flow_style=False)
|
yaml.dump(topo_cfg, f, default_flow_style=False)
|
||||||
@@ -193,6 +226,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
|
|||||||
with open(ccl_path) as f:
|
with open(ccl_path) as f:
|
||||||
ccl_cfg = yaml.safe_load(f)
|
ccl_cfg = yaml.safe_load(f)
|
||||||
ccl_cfg["defaults"]["algorithm"] = algorithm
|
ccl_cfg["defaults"]["algorithm"] = algorithm
|
||||||
|
if n_elem_override is not None:
|
||||||
|
ccl_cfg.setdefault("algorithms", {}).setdefault(
|
||||||
|
algorithm, {},
|
||||||
|
)["n_elem"] = int(n_elem_override)
|
||||||
|
# Ensure IPCQ slot is big enough for the per-message payload.
|
||||||
|
per_msg_bytes = int(n_elem_override) * 2 # f16
|
||||||
|
default_slot = int(ccl_cfg["defaults"].get("slot_size", 4096))
|
||||||
|
if per_msg_bytes > default_slot:
|
||||||
|
ccl_cfg["defaults"]["slot_size"] = per_msg_bytes
|
||||||
tmp_ccl = tmp_path / "ccl.yaml"
|
tmp_ccl = tmp_path / "ccl.yaml"
|
||||||
with open(tmp_ccl, "w") as f:
|
with open(tmp_ccl, "w") as f:
|
||||||
yaml.dump(ccl_cfg, f, default_flow_style=False)
|
yaml.dump(ccl_cfg, f, default_flow_style=False)
|
||||||
@@ -200,10 +242,15 @@ def _write_temp_configs(tmp_path, sip_topology, n_sips, algorithm):
|
|||||||
return str(topo_path), str(tmp_ccl)
|
return str(topo_path), str(tmp_ccl)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("algorithm,sip_topology,n_sips", CONFIGS)
|
@pytest.mark.parametrize(
|
||||||
def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
|
"algorithm,sip_topology,n_sips,sip_w,sip_h", CONFIGS,
|
||||||
|
)
|
||||||
|
def test_allreduce(
|
||||||
|
tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h,
|
||||||
|
):
|
||||||
topo_path, ccl_path = _write_temp_configs(
|
topo_path, ccl_path = _write_temp_configs(
|
||||||
tmp_path, sip_topology, n_sips, algorithm,
|
tmp_path, sip_topology, n_sips, algorithm,
|
||||||
|
sip_w=sip_w, sip_h=sip_h,
|
||||||
)
|
)
|
||||||
topo = resolve_topology(topo_path)
|
topo = resolve_topology(topo_path)
|
||||||
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
@@ -220,3 +267,570 @@ def test_allreduce(tmp_path, algorithm, sip_topology, n_sips):
|
|||||||
algorithm=algorithm, ccl_yaml=ccl_path,
|
algorithm=algorithm, ccl_yaml=ccl_path,
|
||||||
)
|
)
|
||||||
assert result["ok_cubes"] > 0
|
assert result["ok_cubes"] > 0
|
||||||
|
|
||||||
|
|
||||||
|
# ── Latency sweep (parametrized + xdist-friendly) ─────────────────────
|
||||||
|
|
||||||
|
# avoid 16 (== n_cubes, dim_map collision). Goes up to 96 KB per PE:
|
||||||
|
# bytes_per_pe = n_elem * 2 (f16). 49152 elem * 2 = 96 KB / PE.
|
||||||
|
_SWEEP_N_ELEM = [
|
||||||
|
8, 32, 64, 128, 512, 1024, 2048,
|
||||||
|
4096, 8192, 16384, 32768, 49152,
|
||||||
|
]
|
||||||
|
_ELEM_BYTES_F16 = 2
|
||||||
|
|
||||||
|
_SWEEP_TOPOLOGIES = [
|
||||||
|
("intercube_allreduce", "ring_1d", 6, None, None),
|
||||||
|
("intercube_allreduce", "torus_2d", 6, 2, 3),
|
||||||
|
("intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Shared on-disk staging dir for parametrized sweep rows. Each
|
||||||
|
# parametrized invocation writes one JSON file here; the aggregator
|
||||||
|
# (run from conftest.pytest_sessionfinish) reads them and emits the
|
||||||
|
# combined CSV + PNG plots.
|
||||||
|
_SWEEP_OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
|
||||||
|
/ "allreduce_latency_plots")
|
||||||
|
_SWEEP_ROWS_DIR = _SWEEP_OUT_DIR / "_rows"
|
||||||
|
|
||||||
|
|
||||||
|
def _sweep_params():
|
||||||
|
out = []
|
||||||
|
for algorithm, sip_topology, n_sips, sip_w, sip_h in _SWEEP_TOPOLOGIES:
|
||||||
|
for n_elem in _SWEEP_N_ELEM:
|
||||||
|
out.append(pytest.param(
|
||||||
|
algorithm, sip_topology, n_sips, sip_w, sip_h, n_elem,
|
||||||
|
id=f"{sip_topology}-n_elem{n_elem}",
|
||||||
|
))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"algorithm,sip_topology,n_sips,sip_w,sip_h,n_elem", _sweep_params(),
|
||||||
|
)
|
||||||
|
def test_allreduce_latency_one(
|
||||||
|
tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h, n_elem,
|
||||||
|
):
|
||||||
|
"""One config of the latency sweep. xdist parallelizes across params.
|
||||||
|
|
||||||
|
Writes a single JSON row to ``_SWEEP_ROWS_DIR``. The conftest
|
||||||
|
sessionfinish hook aggregates rows into CSV + plots after all
|
||||||
|
parametrized cases finish.
|
||||||
|
"""
|
||||||
|
import json
|
||||||
|
|
||||||
|
topo_path, ccl_path = _write_temp_configs(
|
||||||
|
tmp_path, sip_topology, n_sips, algorithm,
|
||||||
|
sip_w=sip_w, sip_h=sip_h,
|
||||||
|
n_elem_override=n_elem,
|
||||||
|
)
|
||||||
|
topo = resolve_topology(topo_path)
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id=f"sweep_{algorithm}_{sip_topology}_{n_elem}",
|
||||||
|
spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
result = run_allreduce(
|
||||||
|
ctx, engine, spec,
|
||||||
|
algorithm=algorithm, ccl_yaml=ccl_path,
|
||||||
|
)
|
||||||
|
assert result["ok_cubes"] > 0
|
||||||
|
|
||||||
|
pe_exec_vals = [
|
||||||
|
float(tr.get("pe_exec_ns", 0.0) or 0.0)
|
||||||
|
for _, (_, tr) in engine._results.items()
|
||||||
|
if isinstance(tr, dict)
|
||||||
|
]
|
||||||
|
crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
|
||||||
|
|
||||||
|
cm = spec["sip"]["cube_mesh"]
|
||||||
|
n_cubes = int(cm["w"]) * int(cm["h"])
|
||||||
|
bytes_per_sip = n_cubes * n_elem * _ELEM_BYTES_F16
|
||||||
|
bytes_per_pe = n_elem * _ELEM_BYTES_F16
|
||||||
|
|
||||||
|
record = {
|
||||||
|
"algorithm": algorithm,
|
||||||
|
"sip_topology": sip_topology,
|
||||||
|
"n_sips": n_sips,
|
||||||
|
"n_elem": n_elem,
|
||||||
|
"bytes_per_pe": bytes_per_pe,
|
||||||
|
"bytes_per_sip": bytes_per_sip,
|
||||||
|
"latency_ns": crit_ns,
|
||||||
|
}
|
||||||
|
|
||||||
|
_SWEEP_ROWS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
row_path = _SWEEP_ROWS_DIR / f"{sip_topology}_{n_elem}.json"
|
||||||
|
with open(row_path, "w", encoding="utf-8") as f:
|
||||||
|
json.dump(record, f)
|
||||||
|
|
||||||
|
|
||||||
|
def _aggregate_sweep_plots() -> bool:
|
||||||
|
"""Read all per-config rows and emit CSV + PNG plots.
|
||||||
|
|
||||||
|
Called by ``conftest.pytest_sessionfinish`` (controller node only).
|
||||||
|
Returns True if any rows were aggregated, False otherwise.
|
||||||
|
"""
|
||||||
|
import csv
|
||||||
|
import json
|
||||||
|
|
||||||
|
row_files = sorted(_SWEEP_ROWS_DIR.glob("*.json")) \
|
||||||
|
if _SWEEP_ROWS_DIR.exists() else []
|
||||||
|
records: list[dict] = []
|
||||||
|
if row_files:
|
||||||
|
for p in row_files:
|
||||||
|
with open(p, encoding="utf-8") as f:
|
||||||
|
records.append(json.load(f))
|
||||||
|
else:
|
||||||
|
# Fallback: replot from existing summary.csv (skip sweep re-run).
|
||||||
|
summary_path = _SWEEP_OUT_DIR / "summary.csv"
|
||||||
|
if not summary_path.exists():
|
||||||
|
return False
|
||||||
|
with open(summary_path, encoding="utf-8") as f:
|
||||||
|
for row in csv.DictReader(f):
|
||||||
|
records.append({
|
||||||
|
"algorithm": row["algorithm"],
|
||||||
|
"sip_topology": row["sip_topology"],
|
||||||
|
"n_sips": int(row["n_sips"]),
|
||||||
|
"n_elem": int(row["n_elem"]),
|
||||||
|
"bytes_per_pe": int(row["bytes_per_pe"]),
|
||||||
|
"bytes_per_sip": int(row["bytes_per_sip"]),
|
||||||
|
"latency_ns": float(row["latency_ns"]),
|
||||||
|
})
|
||||||
|
if not records:
|
||||||
|
return False
|
||||||
|
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from matplotlib.ticker import FuncFormatter
|
||||||
|
|
||||||
|
def _fmt_bytes(x, _pos):
|
||||||
|
if x <= 0:
|
||||||
|
return "0"
|
||||||
|
if x >= 1024 * 1024:
|
||||||
|
return f"{x / (1024 * 1024):.0f} MB"
|
||||||
|
if x >= 1024:
|
||||||
|
return f"{x / 1024:.0f} KB"
|
||||||
|
return f"{x:.0f} B"
|
||||||
|
|
||||||
|
_bytes_fmt = FuncFormatter(_fmt_bytes)
|
||||||
|
|
||||||
|
_SWEEP_OUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
with open(_SWEEP_OUT_DIR / "summary.csv", "w",
|
||||||
|
newline="", encoding="utf-8") as f:
|
||||||
|
w = csv.DictWriter(f, fieldnames=[
|
||||||
|
"algorithm", "sip_topology", "n_sips", "n_elem",
|
||||||
|
"bytes_per_pe", "bytes_per_sip", "latency_ns",
|
||||||
|
])
|
||||||
|
w.writeheader()
|
||||||
|
for r in sorted(records, key=lambda r: (
|
||||||
|
r["sip_topology"], r["bytes_per_pe"],
|
||||||
|
)):
|
||||||
|
w.writerow(r)
|
||||||
|
|
||||||
|
topologies = sorted({r["sip_topology"] for r in records})
|
||||||
|
for topo_name in topologies:
|
||||||
|
rs = sorted(
|
||||||
|
[r for r in records if r["sip_topology"] == topo_name],
|
||||||
|
key=lambda r: r["bytes_per_pe"],
|
||||||
|
)
|
||||||
|
if not rs:
|
||||||
|
continue
|
||||||
|
xs = [r["bytes_per_pe"] for r in rs]
|
||||||
|
ys = [r["latency_ns"] for r in rs]
|
||||||
|
title = (
|
||||||
|
f"Allreduce latency — {topo_name} "
|
||||||
|
f"(n_sips={rs[0]['n_sips']})"
|
||||||
|
)
|
||||||
|
fig, ax = plt.subplots(figsize=(8, 5))
|
||||||
|
ax.plot(xs, ys, marker="o", color="tab:blue")
|
||||||
|
ax.set_xscale("log", base=2)
|
||||||
|
ax.set_xlabel("Bytes per PE (log scale)")
|
||||||
|
ax.set_ylabel("Time (ns)")
|
||||||
|
ax.set_title(title)
|
||||||
|
ax.grid(True, alpha=0.3)
|
||||||
|
ax.xaxis.set_major_formatter(_bytes_fmt)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(_SWEEP_OUT_DIR / f"{topo_name}.png", dpi=120)
|
||||||
|
plt.close(fig)
|
||||||
|
|
||||||
|
colors = {"ring_1d": "tab:blue", "torus_2d": "tab:orange",
|
||||||
|
"mesh_2d_no_wrap": "tab:green"}
|
||||||
|
|
||||||
|
# ── Hand-derived theoretical model for torus_2d (6 SIPs) ──
|
||||||
|
# Critical-path analysis (per packet, packet = 128 B at NoC):
|
||||||
|
# local intra-SIP reduce + broadcast = 8 hops × 57 ns = 456 ns
|
||||||
|
# global X-direction reduce = 5 UCIe + 1 UAL = 445 ns
|
||||||
|
# global Y-direction reduce = 5 UCIe + 1 UAL = 445 ns
|
||||||
|
# per-packet startup latency = 456 + 445 + 445 = 1346 ns
|
||||||
|
# Packet count is PER CUBE (8 PEs/cube cooperate on the cube tile).
|
||||||
|
# At 6144 packets/cube the pipelined total is 8741 ns, so the
|
||||||
|
# bottleneck-stage interval τ = (8741 − 1346) / (6144 − 1) ≈ 1.204 ns.
|
||||||
|
# T_theoretical(N) = 1346 + (N − 1) × τ
|
||||||
|
# where N = ceil((bytes_per_pe × 8) / 128) = ceil(bytes_per_pe / 16)
|
||||||
|
NOC_PACKET_BYTES = 128
|
||||||
|
PES_PER_CUBE = 8
|
||||||
|
T_STARTUP_NS = 1346.0
|
||||||
|
TAU_NS = (8741.0 - 1346.0) / (6144 - 1) # ≈ 1.2038 ns/packet
|
||||||
|
|
||||||
|
def _theoretical_torus_2d_ns(bytes_per_pe: int) -> float:
|
||||||
|
bytes_per_cube = int(bytes_per_pe) * PES_PER_CUBE
|
||||||
|
n_packets = max(1, -(-bytes_per_cube // NOC_PACKET_BYTES)) # ceil
|
||||||
|
return T_STARTUP_NS + (n_packets - 1) * TAU_NS
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(9, 6))
|
||||||
|
for topo_name in topologies:
|
||||||
|
rs = sorted(
|
||||||
|
[r for r in records if r["sip_topology"] == topo_name],
|
||||||
|
key=lambda r: r["bytes_per_pe"],
|
||||||
|
)
|
||||||
|
if not rs:
|
||||||
|
continue
|
||||||
|
ax.plot(
|
||||||
|
[r["bytes_per_pe"] for r in rs],
|
||||||
|
[r["latency_ns"] for r in rs],
|
||||||
|
marker="o",
|
||||||
|
label=f"{topo_name} (n_sips={rs[0]['n_sips']})",
|
||||||
|
color=colors.get(topo_name),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Theoretical torus_2d curve across all payload sizes.
|
||||||
|
torus_rs = sorted(
|
||||||
|
[r for r in records if r["sip_topology"] == "torus_2d"],
|
||||||
|
key=lambda r: r["bytes_per_pe"],
|
||||||
|
)
|
||||||
|
if torus_rs:
|
||||||
|
xs_th = [r["bytes_per_pe"] for r in torus_rs]
|
||||||
|
ys_th = [_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs]
|
||||||
|
ax.plot(
|
||||||
|
xs_th, ys_th,
|
||||||
|
color="tab:red", linestyle="--", linewidth=1.6, marker="x",
|
||||||
|
label="theoretical torus_2d (6 SIPs)",
|
||||||
|
)
|
||||||
|
|
||||||
|
ax.set_xscale("log", base=2)
|
||||||
|
ax.set_xlabel("Bytes per PE (log scale)")
|
||||||
|
ax.set_ylabel("Time (ns)")
|
||||||
|
ax.set_title("Multi-device allreduce latency by topology")
|
||||||
|
ax.grid(True, alpha=0.3)
|
||||||
|
ax.set_xlim(left=min(r["bytes_per_pe"] for r in records) / 2,
|
||||||
|
right=max(r["bytes_per_pe"] for r in records) * 1.5)
|
||||||
|
ax.legend()
|
||||||
|
ax.xaxis.set_major_formatter(_bytes_fmt)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(_SWEEP_OUT_DIR / "overview.png", dpi=120)
|
||||||
|
plt.close(fig)
|
||||||
|
|
||||||
|
# Cleanup row staging dir so a partial future run doesn't pick up
|
||||||
|
# stale rows.
|
||||||
|
for p in row_files:
|
||||||
|
try:
|
||||||
|
p.unlink()
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
try:
|
||||||
|
_SWEEP_ROWS_DIR.rmdir()
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
print(f"\nWrote {_SWEEP_OUT_DIR / 'overview.png'} "
|
||||||
|
f"from {len(records)} rows")
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
# ── Topology diagram (device-level + cube-level reduction) ────────────
|
||||||
|
|
||||||
|
# Convention: "rows × cols" everywhere, row-major rank assignment
|
||||||
|
# (rank = row * n_cols + col). For the 2×3 inter-SIP grid, this means
|
||||||
|
# 2 rows × 3 columns: SIP 0 1 2 / SIP 3 4 5.
|
||||||
|
|
||||||
|
_PALETTE_BG = "#fafbfd"
|
||||||
|
_PALETTE_FRAME = "#3a3f4a"
|
||||||
|
_PALETTE_BLUE = "#2c6fb6"
|
||||||
|
_PALETTE_GREEN = "#2e8a4e"
|
||||||
|
_PALETTE_TEXT = "#1f2530"
|
||||||
|
_PALETTE_BOX_FILL = "#eaf2fb"
|
||||||
|
_PALETTE_BOX_EDGE = "#2c4a78"
|
||||||
|
_PALETTE_ROOT_FILL = "#ffd9b8"
|
||||||
|
_PALETTE_ROOT_EDGE = "#bd5a14"
|
||||||
|
|
||||||
|
|
||||||
|
def _arrow(ax, xy_from, xy_to, color="black", lw=1.4, alpha=1.0,
|
||||||
|
style="-|>", curve=0.0):
|
||||||
|
from matplotlib.patches import FancyArrowPatch
|
||||||
|
arrow = FancyArrowPatch(
|
||||||
|
xy_from, xy_to,
|
||||||
|
arrowstyle=style, mutation_scale=12,
|
||||||
|
color=color, lw=lw, alpha=alpha,
|
||||||
|
connectionstyle=f"arc3,rad={curve}",
|
||||||
|
)
|
||||||
|
ax.add_patch(arrow)
|
||||||
|
|
||||||
|
|
||||||
|
def _draw_sip_box(ax, cx, cy, w, h, label, *, fill=_PALETTE_BOX_FILL,
|
||||||
|
edge=_PALETTE_BOX_EDGE, text_color=_PALETTE_TEXT,
|
||||||
|
font=10):
|
||||||
|
from matplotlib.patches import FancyBboxPatch
|
||||||
|
box = FancyBboxPatch(
|
||||||
|
(cx - w / 2, cy - h / 2), w, h,
|
||||||
|
boxstyle="round,pad=0.02,rounding_size=0.10",
|
||||||
|
linewidth=1.4, edgecolor=edge, facecolor=fill,
|
||||||
|
)
|
||||||
|
ax.add_patch(box)
|
||||||
|
ax.text(cx, cy, label, ha="center", va="center",
|
||||||
|
color=text_color, fontsize=font, fontweight="bold")
|
||||||
|
|
||||||
|
|
||||||
|
def _frame_panel(ax, title, lim_x=10.0, lim_y=6.0):
|
||||||
|
"""Set up a square-ish panel with a visible outer border."""
|
||||||
|
from matplotlib.patches import FancyBboxPatch
|
||||||
|
ax.set_xlim(0, lim_x)
|
||||||
|
ax.set_ylim(0, lim_y)
|
||||||
|
ax.set_aspect("equal")
|
||||||
|
ax.axis("off")
|
||||||
|
ax.set_facecolor(_PALETTE_BG)
|
||||||
|
border = FancyBboxPatch(
|
||||||
|
(0.05, 0.05), lim_x - 0.10, lim_y - 0.10,
|
||||||
|
boxstyle="round,pad=0.01,rounding_size=0.12",
|
||||||
|
linewidth=1.4, edgecolor=_PALETTE_FRAME, facecolor=_PALETTE_BG,
|
||||||
|
zorder=0,
|
||||||
|
)
|
||||||
|
ax.add_patch(border)
|
||||||
|
ax.set_title(title, fontsize=12, fontweight="bold",
|
||||||
|
color=_PALETTE_TEXT, pad=8)
|
||||||
|
|
||||||
|
|
||||||
|
def _draw_ring_topology(ax):
|
||||||
|
_frame_panel(ax, "ring_1d (6 SIPs)", lim_x=10.0, lim_y=6.0)
|
||||||
|
|
||||||
|
xs = [1.2, 2.7, 4.2, 5.7, 7.2, 8.7]
|
||||||
|
y = 3.1
|
||||||
|
box_w, box_h = 1.05, 0.9
|
||||||
|
for i, x in enumerate(xs):
|
||||||
|
_draw_sip_box(ax, x, y, box_w, box_h, f"SIP {i}")
|
||||||
|
# Forward ring (global_E) — adjacent neighbours, anchored to box edges.
|
||||||
|
for i in range(5):
|
||||||
|
_arrow(ax, (xs[i] + box_w / 2, y),
|
||||||
|
(xs[i + 1] - box_w / 2, y),
|
||||||
|
color=_PALETTE_BLUE, lw=1.6)
|
||||||
|
# Wrap (SIP 5 → SIP 0). Anchor at right-CENTER of SIP 5 and
|
||||||
|
# left-CENTER of SIP 0; arc OUTSIDE (above) the row so it does not
|
||||||
|
# overlap any of the SIP boxes in between.
|
||||||
|
_arrow(
|
||||||
|
ax,
|
||||||
|
(xs[5] + box_w / 2, y),
|
||||||
|
(xs[0] - box_w / 2, y),
|
||||||
|
color=_PALETTE_BLUE, lw=1.6, curve=-0.40,
|
||||||
|
)
|
||||||
|
ax.text(5.0, y + 2.0, "global_E (ring)", ha="center",
|
||||||
|
color=_PALETTE_BLUE, fontsize=10, style="italic")
|
||||||
|
ax.text(5.0, y - 1.5,
|
||||||
|
"(global_W = reverse direction, used by the algorithm)",
|
||||||
|
ha="center", color="gray", fontsize=8, style="italic")
|
||||||
|
|
||||||
|
|
||||||
|
def _draw_grid_topology(ax, kind, *, n_rows=2, n_cols=3):
|
||||||
|
"""kind ∈ {'torus', 'mesh'}. Lays out as n_rows × n_cols (row-major).
|
||||||
|
|
||||||
|
For the sweep we use 2 rows × 3 cols → SIP layout::
|
||||||
|
|
||||||
|
row 0: SIP 0 SIP 1 SIP 2
|
||||||
|
row 1: SIP 3 SIP 4 SIP 5
|
||||||
|
"""
|
||||||
|
title = f"torus_2d ({n_rows}×{n_cols}, 6 SIPs)" if kind == "torus" \
|
||||||
|
else f"mesh_2d_no_wrap ({n_rows}×{n_cols}, 6 SIPs)"
|
||||||
|
_frame_panel(ax, title, lim_x=10.0, lim_y=6.0)
|
||||||
|
|
||||||
|
col_xs = [2.0, 5.0, 8.0] # 3 cols
|
||||||
|
row_ys = [4.3, 1.8] # 2 rows
|
||||||
|
box_w, box_h = 1.3, 0.95
|
||||||
|
pos: dict[tuple[int, int], tuple[float, float]] = {}
|
||||||
|
for r in range(n_rows):
|
||||||
|
for c in range(n_cols):
|
||||||
|
rank = r * n_cols + c
|
||||||
|
x, y = col_xs[c], row_ys[r]
|
||||||
|
pos[(r, c)] = (x, y)
|
||||||
|
_draw_sip_box(ax, x, y, box_w, box_h, f"SIP {rank}")
|
||||||
|
|
||||||
|
# Row edges (E↔W) — between adjacent columns within each row.
|
||||||
|
for r in range(n_rows):
|
||||||
|
for c in range(n_cols - 1):
|
||||||
|
x0, y0 = pos[(r, c)]
|
||||||
|
x1, y1 = pos[(r, c + 1)]
|
||||||
|
_arrow(ax, (x0 + box_w / 2, y0 + 0.10),
|
||||||
|
(x1 - box_w / 2, y1 + 0.10),
|
||||||
|
color=_PALETTE_BLUE, lw=1.5)
|
||||||
|
_arrow(ax, (x1 - box_w / 2, y1 - 0.10),
|
||||||
|
(x0 + box_w / 2, y0 - 0.10),
|
||||||
|
color=_PALETTE_BLUE, lw=1.5)
|
||||||
|
# Col edges (N↔S) — between adjacent rows within each column.
|
||||||
|
for c in range(n_cols):
|
||||||
|
for r in range(n_rows - 1):
|
||||||
|
x0, y0 = pos[(r, c)]
|
||||||
|
x1, y1 = pos[(r + 1, c)]
|
||||||
|
_arrow(ax, (x0 - 0.12, y0 - box_h / 2),
|
||||||
|
(x1 - 0.12, y1 + box_h / 2),
|
||||||
|
color=_PALETTE_GREEN, lw=1.5)
|
||||||
|
_arrow(ax, (x1 + 0.12, y1 + box_h / 2),
|
||||||
|
(x0 + 0.12, y0 - box_h / 2),
|
||||||
|
color=_PALETTE_GREEN, lw=1.5)
|
||||||
|
# Wrap arrows for torus only — anchor to the centre of the OUTER
|
||||||
|
# edge of the end SIPs and arc OUTSIDE the row/column so they do
|
||||||
|
# not overlap the SIPs in between.
|
||||||
|
if kind == "torus":
|
||||||
|
# Row wrap: last col → first col. Top row arcs UP, bottom row
|
||||||
|
# arcs DOWN, so each wrap sits clearly outside its own row.
|
||||||
|
for r in range(n_rows):
|
||||||
|
x0, y0 = pos[(r, 0)]
|
||||||
|
x1, y1 = pos[(r, n_cols - 1)]
|
||||||
|
curve = -0.45 if r == 0 else 0.45
|
||||||
|
_arrow(
|
||||||
|
ax,
|
||||||
|
(x1 + box_w / 2, y1),
|
||||||
|
(x0 - box_w / 2, y0),
|
||||||
|
color=_PALETTE_BLUE, lw=1.5,
|
||||||
|
curve=curve, alpha=0.9,
|
||||||
|
)
|
||||||
|
# Col wrap: last row → first row. Leftmost col arcs LEFT,
|
||||||
|
# rightmost col arcs RIGHT. Middle col(s) get a small inline
|
||||||
|
# marker + legend note (drawing them through the panel would
|
||||||
|
# collide with the row arrows).
|
||||||
|
for c in range(n_cols):
|
||||||
|
x0, y0 = pos[(0, c)]
|
||||||
|
x1, y1 = pos[(n_rows - 1, c)]
|
||||||
|
if c == 0:
|
||||||
|
curve = 0.55
|
||||||
|
elif c == n_cols - 1:
|
||||||
|
curve = -0.55
|
||||||
|
else:
|
||||||
|
continue # skip middle col — see legend note
|
||||||
|
_arrow(
|
||||||
|
ax,
|
||||||
|
(x1, y1 - box_h / 2),
|
||||||
|
(x0, y0 + box_h / 2),
|
||||||
|
color=_PALETTE_GREEN, lw=1.5,
|
||||||
|
curve=curve, alpha=0.9,
|
||||||
|
)
|
||||||
|
|
||||||
|
ax.text(0.7, 5.6, "global_E/W (row)", color=_PALETTE_BLUE,
|
||||||
|
fontsize=9, style="italic", fontweight="bold")
|
||||||
|
ax.text(0.7, 5.25, "global_N/S (col)", color=_PALETTE_GREEN,
|
||||||
|
fontsize=9, style="italic", fontweight="bold")
|
||||||
|
ax.text(0.7, 4.92,
|
||||||
|
"wrap = torus" if kind == "torus" else "no wrap = mesh",
|
||||||
|
color="gray", fontsize=8, style="italic")
|
||||||
|
if kind == "torus" and n_cols > 2:
|
||||||
|
ax.text(0.7, 0.3,
|
||||||
|
"(middle-col wrap omitted for clarity — every row "
|
||||||
|
"and every column wraps)",
|
||||||
|
color="gray", fontsize=7.5, style="italic")
|
||||||
|
|
||||||
|
|
||||||
|
def _draw_cube_reduction(ax):
|
||||||
|
"""4×4 cube grid inside SIP 0 — compact layout with phase legend."""
|
||||||
|
from matplotlib.patches import Rectangle
|
||||||
|
_frame_panel(ax, "Cube-level reduction inside SIP 0 (4×4 cubes)",
|
||||||
|
lim_x=10.0, lim_y=6.0)
|
||||||
|
|
||||||
|
cube_w = 0.65
|
||||||
|
cube_gap = 0.18
|
||||||
|
# Center the 4×4 grid in the left half of the panel.
|
||||||
|
grid_total = 4 * cube_w + 3 * cube_gap
|
||||||
|
grid_x0 = 0.7
|
||||||
|
grid_y0 = 0.7
|
||||||
|
centers: dict[tuple[int, int], tuple[float, float]] = {}
|
||||||
|
for r in range(4):
|
||||||
|
for c in range(4):
|
||||||
|
cx = grid_x0 + c * (cube_w + cube_gap) + cube_w / 2
|
||||||
|
cy = grid_y0 + (3 - r) * (cube_w + cube_gap) + cube_w / 2
|
||||||
|
centers[(r, c)] = (cx, cy)
|
||||||
|
cube_id = r * 4 + c
|
||||||
|
is_root = (r == 3 and c == 3)
|
||||||
|
face = _PALETTE_ROOT_FILL if is_root else _PALETTE_BOX_FILL
|
||||||
|
edge = _PALETTE_ROOT_EDGE if is_root else _PALETTE_BOX_EDGE
|
||||||
|
rect = Rectangle(
|
||||||
|
(cx - cube_w / 2, cy - cube_w / 2), cube_w, cube_w,
|
||||||
|
linewidth=1.2, edgecolor=edge, facecolor=face,
|
||||||
|
)
|
||||||
|
ax.add_patch(rect)
|
||||||
|
label = f"c{cube_id}"
|
||||||
|
ax.text(cx, cy, label, ha="center", va="center",
|
||||||
|
fontsize=7.5, fontweight="bold",
|
||||||
|
color=_PALETTE_ROOT_EDGE if is_root
|
||||||
|
else _PALETTE_TEXT)
|
||||||
|
|
||||||
|
# Phase 1: row reduce W→E.
|
||||||
|
for r in range(4):
|
||||||
|
for c in range(3):
|
||||||
|
x0, y0 = centers[(r, c)]
|
||||||
|
x1, y1 = centers[(r, c + 1)]
|
||||||
|
_arrow(ax, (x0 + cube_w / 2, y0), (x1 - cube_w / 2, y1),
|
||||||
|
color=_PALETTE_BLUE, lw=1.5)
|
||||||
|
# Phase 2: col reduce N→S along rightmost column.
|
||||||
|
for r in range(3):
|
||||||
|
x0, y0 = centers[(r, 3)]
|
||||||
|
x1, y1 = centers[(r + 1, 3)]
|
||||||
|
_arrow(ax, (x0, y0 - cube_w / 2), (x1, y1 + cube_w / 2),
|
||||||
|
color=_PALETTE_GREEN, lw=1.7)
|
||||||
|
|
||||||
|
# Phase legend on the right side.
|
||||||
|
legend_x = grid_x0 + grid_total + 0.55
|
||||||
|
ax.text(legend_x, 5.0, "Phase 1: row reduce (W → E)",
|
||||||
|
color=_PALETTE_BLUE, fontsize=10, fontweight="bold")
|
||||||
|
ax.text(legend_x, 4.55, "Phase 2: col reduce (N → S, rightmost col)",
|
||||||
|
color=_PALETTE_GREEN, fontsize=10, fontweight="bold")
|
||||||
|
ax.text(legend_x, 4.10, "Phase 3: inter-SIP exchange at root cube",
|
||||||
|
color=_PALETTE_ROOT_EDGE, fontsize=10, fontweight="bold")
|
||||||
|
ax.text(legend_x, 3.65, "Phase 4: col broadcast (S → N)",
|
||||||
|
color=_PALETTE_GREEN, fontsize=10, style="italic")
|
||||||
|
ax.text(legend_x, 3.20, "Phase 5: row broadcast (E → W)",
|
||||||
|
color=_PALETTE_BLUE, fontsize=10, style="italic")
|
||||||
|
ax.text(legend_x, 2.55,
|
||||||
|
"(broadcast phases reverse phases 2 & 1)",
|
||||||
|
color="gray", fontsize=8.5, style="italic")
|
||||||
|
ax.text(legend_x, 1.7,
|
||||||
|
"Root cube (c15, bottom-right) is the only\n"
|
||||||
|
"cube that performs the inter-SIP exchange.",
|
||||||
|
color=_PALETTE_ROOT_EDGE, fontsize=9, style="italic")
|
||||||
|
|
||||||
|
|
||||||
|
def emit_topology_diagram() -> str:
|
||||||
|
"""Emit a 2×2-panel topology diagram into docs/diagrams/allreduce_latency_plots/.
|
||||||
|
|
||||||
|
Top row: ring_1d | torus_2d (2×3)
|
||||||
|
Bot row: mesh_2d_no_wrap (2×3) | cube-level reduction in SIP 0
|
||||||
|
"""
|
||||||
|
import matplotlib.gridspec as gridspec
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
|
||||||
|
_SWEEP_OUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
fig = plt.figure(figsize=(16, 10), facecolor="white")
|
||||||
|
gs = gridspec.GridSpec(2, 2, figure=fig, hspace=0.30, wspace=0.10)
|
||||||
|
ax_ring = fig.add_subplot(gs[0, 0])
|
||||||
|
ax_torus = fig.add_subplot(gs[0, 1])
|
||||||
|
ax_mesh = fig.add_subplot(gs[1, 0])
|
||||||
|
ax_cube = fig.add_subplot(gs[1, 1])
|
||||||
|
|
||||||
|
_draw_ring_topology(ax_ring)
|
||||||
|
_draw_grid_topology(ax_torus, "torus", n_rows=2, n_cols=3)
|
||||||
|
_draw_grid_topology(ax_mesh, "mesh", n_rows=2, n_cols=3)
|
||||||
|
_draw_cube_reduction(ax_cube)
|
||||||
|
|
||||||
|
fig.suptitle(
|
||||||
|
"Allreduce topology — device-level (top: ring, torus, mesh) "
|
||||||
|
"and cube-level reduction in SIP 0",
|
||||||
|
fontsize=14, fontweight="bold", color=_PALETTE_TEXT, y=0.98,
|
||||||
|
)
|
||||||
|
out_path = _SWEEP_OUT_DIR / "topology.png"
|
||||||
|
fig.savefig(out_path, dpi=130, bbox_inches="tight",
|
||||||
|
facecolor=fig.get_facecolor())
|
||||||
|
plt.close(fig)
|
||||||
|
return str(out_path)
|
||||||
|
|
||||||
|
|
||||||
|
def test_emit_topology_diagram():
|
||||||
|
"""Emit topology.png alongside the sweep plots. Pure plotting; no sim."""
|
||||||
|
out = emit_topology_diagram()
|
||||||
|
assert Path(out).exists()
|
||||||
|
|||||||
@@ -23,7 +23,7 @@ def _engine():
|
|||||||
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
||||||
slice_bytes = 48 * (1 << 30) // 8
|
slice_bytes = 48 * (1 << 30) // 8
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
|
sip_id=sip, die_id=cube, pe_id=pe_id,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
|
|||||||
@@ -1,48 +0,0 @@
|
|||||||
"""Test that tl.recv() (no direction) works under the mock runtime
|
|
||||||
and the SimPy PE_IPCQ component (ADR-0023 D4 weak fairness)."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
|
|
||||||
from kernbench.ccl.testing import run_kernel_in_mock
|
|
||||||
|
|
||||||
|
|
||||||
def kernel_round_robin(t_ptr, n_elem, tl):
|
|
||||||
"""Each PE sends one tile E then receives N-1 tiles via round-robin.
|
|
||||||
Uses TensorHandle math (PE_MATH) so Phase 2 produces correct HBM
|
|
||||||
contents under SimPy + op_log replay."""
|
|
||||||
rank = tl.program_id(axis=0)
|
|
||||||
world_size = tl.num_programs(axis=0)
|
|
||||||
nbytes = n_elem * 2
|
|
||||||
|
|
||||||
pe_addr = t_ptr + rank * nbytes
|
|
||||||
acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")
|
|
||||||
current = acc
|
|
||||||
|
|
||||||
for _step in range(world_size - 1):
|
|
||||||
tl.send(dir="E", src=current)
|
|
||||||
# No direction → round-robin
|
|
||||||
recv = tl.recv(shape=(n_elem,), dtype="f16")
|
|
||||||
acc = acc + recv
|
|
||||||
current = recv # forward W's tile to E next round
|
|
||||||
|
|
||||||
tl.store(pe_addr, acc)
|
|
||||||
|
|
||||||
|
|
||||||
def test_round_robin_recv_mock_runtime():
|
|
||||||
n_elem = 8
|
|
||||||
inputs = [
|
|
||||||
np.full((n_elem,), float(r + 1), dtype=np.float16)
|
|
||||||
for r in range(4)
|
|
||||||
]
|
|
||||||
expected = sum(inputs) # [10,...]
|
|
||||||
|
|
||||||
outputs = run_kernel_in_mock(
|
|
||||||
kernel_fn=kernel_round_robin,
|
|
||||||
world_size=4,
|
|
||||||
topology="ring_1d",
|
|
||||||
inputs=inputs,
|
|
||||||
kernel_args=(n_elem,),
|
|
||||||
)
|
|
||||||
for r in range(4):
|
|
||||||
assert np.allclose(outputs[r], expected)
|
|
||||||
@@ -30,7 +30,7 @@ def _graph():
|
|||||||
def _hbm_pa(pe_id: int = 0) -> int:
|
def _hbm_pa(pe_id: int = 0) -> int:
|
||||||
slice_bytes = 48 * (1 << 30) // 8
|
slice_bytes = 48 * (1 << 30) // 8
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=0, cube_id=0, pe_id=pe_id,
|
sip_id=0, die_id=0, pe_id=pe_id,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
|
|||||||
@@ -0,0 +1,194 @@
|
|||||||
|
"""ADR-0009 D5 invariant: all PEs targeted by a single kernel launch MUST
|
||||||
|
begin executing the kernel body at the same simulated time, regardless of
|
||||||
|
their dispatch path length.
|
||||||
|
|
||||||
|
These tests directly verify the invariant by capturing per-PE state at the
|
||||||
|
top of `_execute_kernel`:
|
||||||
|
|
||||||
|
test_no_pe_arrives_after_target_start_ns
|
||||||
|
Asserts: for every PE that enters _execute_kernel during a multi-cube
|
||||||
|
launch, `env.now` at entry must be <= target_start_ns. Otherwise the
|
||||||
|
PE's barrier yield would be a no-op and `pe_exec_start` would be set
|
||||||
|
late, breaking the D5 "same simulated time" mandate.
|
||||||
|
|
||||||
|
test_all_pes_have_identical_pe_exec_start
|
||||||
|
Asserts: every PE's `pe_exec_start` (the value of `env.now` recorded
|
||||||
|
immediately AFTER the barrier yield) is identical across all PEs in
|
||||||
|
the launch.
|
||||||
|
|
||||||
|
Both tests are expected to FAIL today and become the regression check the
|
||||||
|
Phase 2 D5 predictor + fallback fix must make pass.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
|
||||||
|
|
||||||
|
def _capture_per_pe_d5_state():
|
||||||
|
"""Monkey-patch PeCpuComponent._execute_kernel to record, per PE:
|
||||||
|
|
||||||
|
- entry_now: env.now at function entry (before any yield)
|
||||||
|
- target_start_ns: the value carried by the request
|
||||||
|
- barrier_yielded: True if the barrier yield fired (entry_now < target)
|
||||||
|
- pe_exec_start: env.now immediately after the barrier check
|
||||||
|
(i.e. the value the original code sets)
|
||||||
|
|
||||||
|
Returns (records: list[dict], restore: callable).
|
||||||
|
"""
|
||||||
|
import kernbench.components.builtin.pe_cpu as pe_cpu_mod
|
||||||
|
|
||||||
|
records: list[dict] = []
|
||||||
|
original = pe_cpu_mod.PeCpuComponent._execute_kernel
|
||||||
|
|
||||||
|
def patched(self, env, txn):
|
||||||
|
request = txn.request
|
||||||
|
target_start = getattr(request, "target_start_ns", None)
|
||||||
|
entry_now = float(env.now)
|
||||||
|
rec = {
|
||||||
|
"node_id": self.node.id,
|
||||||
|
"entry_now": entry_now,
|
||||||
|
"target_start_ns": (
|
||||||
|
float(target_start) if target_start is not None else None
|
||||||
|
),
|
||||||
|
"barrier_yielded": (
|
||||||
|
target_start is not None
|
||||||
|
and float(target_start) > entry_now
|
||||||
|
),
|
||||||
|
"pe_exec_start": None, # filled below by sniff
|
||||||
|
"late_ns": (
|
||||||
|
None if target_start is None
|
||||||
|
else max(0.0, entry_now - float(target_start))
|
||||||
|
),
|
||||||
|
}
|
||||||
|
records.append(rec)
|
||||||
|
|
||||||
|
# We can't easily inject a callback at the original's
|
||||||
|
# `pe_exec_start = env.now` line without rewriting it. Approximate:
|
||||||
|
# if the original yields the barrier, env.now after the yield is
|
||||||
|
# target_start_ns; otherwise pe_exec_start is entry_now (skipped).
|
||||||
|
if rec["barrier_yielded"]:
|
||||||
|
rec["pe_exec_start"] = float(target_start)
|
||||||
|
else:
|
||||||
|
rec["pe_exec_start"] = entry_now
|
||||||
|
|
||||||
|
yield from original(self, env, txn)
|
||||||
|
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = patched
|
||||||
|
|
||||||
|
def restore():
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = original
|
||||||
|
|
||||||
|
return records, restore
|
||||||
|
|
||||||
|
|
||||||
|
def _run_multicube_launch():
|
||||||
|
"""Drive a no-op kernel launch across all 16 cubes x 8 PEs and return
|
||||||
|
the per-PE D5 records collected by the monkey-patch."""
|
||||||
|
records, restore = _capture_per_pe_d5_state()
|
||||||
|
try:
|
||||||
|
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine, target_device=DeviceSelector("all"),
|
||||||
|
correlation_id="d5_barrier", spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
dp = DPPolicy(
|
||||||
|
cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=16, num_pes=8,
|
||||||
|
)
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pass # no-op
|
||||||
|
|
||||||
|
ctx.ahbm.set_device(0)
|
||||||
|
t = ctx.zeros(
|
||||||
|
(16, 8 * 64), dtype="f16", dp=dp, name="probe",
|
||||||
|
)
|
||||||
|
t.copy_(ctx.from_numpy(
|
||||||
|
np.zeros((16, 8 * 64), dtype=np.float16),
|
||||||
|
))
|
||||||
|
|
||||||
|
pending = ctx.launch(
|
||||||
|
"d5_probe", kernel, t, 64, _defer_wait=True,
|
||||||
|
)
|
||||||
|
for h, _sip, meta in pending:
|
||||||
|
ctx.wait(h, _meta=meta)
|
||||||
|
finally:
|
||||||
|
restore()
|
||||||
|
return records
|
||||||
|
|
||||||
|
|
||||||
|
def test_no_pe_arrives_after_target_start_ns():
|
||||||
|
"""ADR-0009 D5: no PE may enter `_execute_kernel` after target_start_ns.
|
||||||
|
|
||||||
|
Today this fails because IO_CPU's predictor under-shoots actual
|
||||||
|
dispatch latency for far cubes (cube4, cube9-15). Phase 2 fix:
|
||||||
|
chain-aware predictor in IO_CPU + monotonic upward re-stamp in M_CPU.
|
||||||
|
"""
|
||||||
|
records = _run_multicube_launch()
|
||||||
|
assert records, "expected per-PE _execute_kernel records"
|
||||||
|
|
||||||
|
late = [
|
||||||
|
r for r in records
|
||||||
|
if r["target_start_ns"] is not None
|
||||||
|
and r["late_ns"] is not None
|
||||||
|
and r["late_ns"] > 1e-6
|
||||||
|
]
|
||||||
|
|
||||||
|
if late:
|
||||||
|
# Provide actionable diagnostic in the failure.
|
||||||
|
worst = sorted(late, key=lambda r: -r["late_ns"])[:5]
|
||||||
|
details = "\n".join(
|
||||||
|
f" {r['node_id']}: late by {r['late_ns']:.2f} ns "
|
||||||
|
f"(entry_now={r['entry_now']:.2f}, "
|
||||||
|
f"target_start_ns={r['target_start_ns']:.2f})"
|
||||||
|
for r in worst
|
||||||
|
)
|
||||||
|
pytest.fail(
|
||||||
|
f"ADR-0009 D5 violated: {len(late)}/{len(records)} PEs "
|
||||||
|
f"entered _execute_kernel AFTER target_start_ns "
|
||||||
|
f"(barrier yield silently skipped). "
|
||||||
|
f"Worst offenders:\n{details}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_all_pes_have_identical_pe_exec_start():
|
||||||
|
"""ADR-0009 D5: every PE's pe_exec_start must be identical.
|
||||||
|
|
||||||
|
With D5 honored, every PE either yields to target_start_ns (start =
|
||||||
|
target_start_ns) or, if late, would still be aligned by the M_CPU
|
||||||
|
upward re-stamp (Phase 2). Today: 75/128 PEs in this launch have
|
||||||
|
distinct pe_exec_start values because they skipped the barrier.
|
||||||
|
"""
|
||||||
|
records = _run_multicube_launch()
|
||||||
|
assert records, "expected per-PE _execute_kernel records"
|
||||||
|
|
||||||
|
starts = sorted({round(r["pe_exec_start"], 6) for r in records})
|
||||||
|
if len(starts) > 1:
|
||||||
|
spread = max(starts) - min(starts)
|
||||||
|
# Distribution of how many PEs at each distinct start time
|
||||||
|
from collections import Counter
|
||||||
|
bucket = Counter(round(r["pe_exec_start"], 6) for r in records)
|
||||||
|
details = "\n".join(
|
||||||
|
f" pe_exec_start={t}: {n} PEs"
|
||||||
|
for t, n in sorted(bucket.items())
|
||||||
|
)
|
||||||
|
pytest.fail(
|
||||||
|
f"ADR-0009 D5 violated: PEs have {len(starts)} distinct "
|
||||||
|
f"pe_exec_start values (spread = {spread:.2f} ns); "
|
||||||
|
f"D5 mandates a single common value. "
|
||||||
|
f"Distribution:\n{details}"
|
||||||
|
)
|
||||||
@@ -50,7 +50,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
|||||||
from kernbench.policy.address.phyaddr import PhysAddr
|
from kernbench.policy.address.phyaddr import PhysAddr
|
||||||
slice_bytes = 48 * (1 << 30) // 8
|
slice_bytes = 48 * (1 << 30) // 8
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
|
sip_id=sip, die_id=cube, pe_id=pe_id,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
|
|||||||
@@ -31,7 +31,7 @@ def _hbm_pa(sip=0, cube=0, pe_id=0):
|
|||||||
from kernbench.policy.address.phyaddr import PhysAddr
|
from kernbench.policy.address.phyaddr import PhysAddr
|
||||||
slice_bytes = 48 * (1 << 30) // 8
|
slice_bytes = 48 * (1 << 30) // 8
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
|
sip_id=sip, die_id=cube, pe_id=pe_id,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
|
|||||||
@@ -0,0 +1,622 @@
|
|||||||
|
"""High-level IPCQ + SFR connection diagram (presentation only).
|
||||||
|
|
||||||
|
Renders ``docs/diagrams/ipcq_diagram_plots/ipcq_send_recv.png`` showing one
|
||||||
|
concrete example: SIP 0 / cube 0 / pe 0 sending to pe 1 in the
|
||||||
|
``intra_E`` direction. Boxes and arrows are grounded in the actual
|
||||||
|
code paths:
|
||||||
|
|
||||||
|
- PE_IPCQ SFR fields: src/kernbench/components/builtin/pe_ipcq.py
|
||||||
|
- SFR install: src/kernbench/ccl/install.py +
|
||||||
|
src/kernbench/ccl/sfr_config.py
|
||||||
|
- PE_DMA outbound /
|
||||||
|
inbound atomic write: src/kernbench/components/builtin/pe_dma.py
|
||||||
|
|
||||||
|
This is a pure-plotting test (no simulation). It exists so the diagram
|
||||||
|
can be regenerated reproducibly alongside the rest of the suite.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
_OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
|
||||||
|
/ "ipcq_diagram_plots")
|
||||||
|
|
||||||
|
# Color palette (matches the topology diagram for visual continuity).
|
||||||
|
_BG = "#fafbfd"
|
||||||
|
_FRAME = "#3a3f4a"
|
||||||
|
_TEXT = "#1f2530"
|
||||||
|
_BLUE = "#2c6fb6"
|
||||||
|
_GREEN = "#2e8a4e"
|
||||||
|
_ORANGE = "#d3722a"
|
||||||
|
_PURPLE = "#7a4cb6"
|
||||||
|
_BOX_FILL = "#eaf2fb"
|
||||||
|
_BOX_EDGE = "#2c4a78"
|
||||||
|
_HW_FILL = "#f3ecda"
|
||||||
|
_HW_EDGE = "#a07a2a"
|
||||||
|
_MEM_FILL = "#e8f3e8"
|
||||||
|
_MEM_EDGE = "#2e8a4e"
|
||||||
|
|
||||||
|
|
||||||
|
def _box(ax, x, y, w, h, title, lines, *, fill=_BOX_FILL, edge=_BOX_EDGE,
|
||||||
|
title_color=None, font=9):
|
||||||
|
from matplotlib.patches import FancyBboxPatch
|
||||||
|
box = FancyBboxPatch(
|
||||||
|
(x, y), w, h,
|
||||||
|
boxstyle="round,pad=0.04,rounding_size=0.18",
|
||||||
|
linewidth=1.6, edgecolor=edge, facecolor=fill, zorder=2,
|
||||||
|
)
|
||||||
|
ax.add_patch(box)
|
||||||
|
ax.text(x + w / 2, y + h - 0.45, title,
|
||||||
|
ha="center", va="top", fontsize=font + 1.5,
|
||||||
|
fontweight="bold",
|
||||||
|
color=title_color or edge, zorder=3)
|
||||||
|
for i, line in enumerate(lines):
|
||||||
|
ax.text(
|
||||||
|
x + 0.25, y + h - 1.1 - i * 0.45, line,
|
||||||
|
ha="left", va="top", fontsize=font - 0.5, color=_TEXT,
|
||||||
|
family="monospace", zorder=3,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _arrow(ax, xy_from, xy_to, *, color=_BLUE, lw=1.8, curve=0.0,
|
||||||
|
style="-|>", alpha=1.0, zorder=4):
|
||||||
|
from matplotlib.patches import FancyArrowPatch
|
||||||
|
arrow = FancyArrowPatch(
|
||||||
|
xy_from, xy_to,
|
||||||
|
arrowstyle=style, mutation_scale=14,
|
||||||
|
color=color, lw=lw, alpha=alpha,
|
||||||
|
connectionstyle=f"arc3,rad={curve}",
|
||||||
|
zorder=zorder,
|
||||||
|
)
|
||||||
|
ax.add_patch(arrow)
|
||||||
|
|
||||||
|
|
||||||
|
def _step_label(ax, x, y, n, text, color=_BLUE):
|
||||||
|
from matplotlib.patches import Circle
|
||||||
|
ax.add_patch(Circle((x, y), 0.28, facecolor=color, edgecolor="white",
|
||||||
|
linewidth=1.4, zorder=5))
|
||||||
|
ax.text(x, y, str(n), ha="center", va="center", fontsize=9,
|
||||||
|
fontweight="bold", color="white", zorder=6)
|
||||||
|
ax.text(x + 0.45, y, text, ha="left", va="center", fontsize=9,
|
||||||
|
color=_TEXT, zorder=6)
|
||||||
|
|
||||||
|
|
||||||
|
def emit_ipcq_diagram() -> str:
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from matplotlib.patches import FancyBboxPatch, Rectangle
|
||||||
|
|
||||||
|
_OUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
fig, ax = plt.subplots(figsize=(18, 11), facecolor="white")
|
||||||
|
ax.set_xlim(0, 22)
|
||||||
|
ax.set_ylim(0, 14)
|
||||||
|
ax.set_aspect("equal")
|
||||||
|
ax.axis("off")
|
||||||
|
ax.set_facecolor(_BG)
|
||||||
|
|
||||||
|
# Outer panel border.
|
||||||
|
border = FancyBboxPatch(
|
||||||
|
(0.15, 0.15), 21.7, 13.7,
|
||||||
|
boxstyle="round,pad=0.02,rounding_size=0.20",
|
||||||
|
linewidth=1.4, edgecolor=_FRAME, facecolor=_BG, zorder=0,
|
||||||
|
)
|
||||||
|
ax.add_patch(border)
|
||||||
|
|
||||||
|
ax.set_title(
|
||||||
|
"IPCQ — SFR state and send/recv path between pe0 and pe1 "
|
||||||
|
"(intra_E direction, SIP 0 / cube 0)",
|
||||||
|
fontsize=14, fontweight="bold", color=_TEXT, pad=12,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ── pe0 side (left half) ────────────────────────────────────────
|
||||||
|
_box(
|
||||||
|
ax, x=0.8, y=8.4, w=8.4, h=5.0,
|
||||||
|
title="pe0.pe_ipcq (SFR — direction: intra_E)",
|
||||||
|
lines=[
|
||||||
|
"neighbor_table[intra_E]:",
|
||||||
|
" peer = sip0.cube0.pe1",
|
||||||
|
" peer.rx_base_pa → pe1's intra_W slot ring",
|
||||||
|
" my_rx_base_pa → pe0's intra_E slot ring",
|
||||||
|
" n_slots = 8 slot_size = 512 B",
|
||||||
|
"",
|
||||||
|
"head/tail counters (per direction):",
|
||||||
|
" my_head # ++ on tl.send",
|
||||||
|
" my_tail # ++ on tl.recv",
|
||||||
|
" peer_head_cache # updated on IpcqMetaArrival",
|
||||||
|
" peer_tail_cache # updated on IpcqCreditMetadata",
|
||||||
|
"",
|
||||||
|
"send blocks while (my_head − peer_tail_cache) ≥ n_slots",
|
||||||
|
],
|
||||||
|
edge=_BOX_EDGE, fill=_BOX_FILL,
|
||||||
|
)
|
||||||
|
|
||||||
|
_box(
|
||||||
|
ax, x=0.8, y=4.5, w=8.4, h=2.7,
|
||||||
|
title="pe0.pe_dma (outbound IPCQ driver)",
|
||||||
|
lines=[
|
||||||
|
"_handle_ipcq_outbound():",
|
||||||
|
" • snapshot src bytes from MemoryStore",
|
||||||
|
" • find fabric path → pe1.pe_dma",
|
||||||
|
" • send Transaction; do NOT wait (fire-and-forget)",
|
||||||
|
],
|
||||||
|
edge=_HW_EDGE, fill=_HW_FILL,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ── pe1 side (right half) ───────────────────────────────────────
|
||||||
|
_box(
|
||||||
|
ax, x=12.8, y=8.4, w=8.4, h=5.0,
|
||||||
|
title="pe1.pe_ipcq (SFR — direction: intra_W)",
|
||||||
|
lines=[
|
||||||
|
"neighbor_table[intra_W]:",
|
||||||
|
" peer = sip0.cube0.pe0",
|
||||||
|
" peer.rx_base_pa → pe0's intra_E slot ring",
|
||||||
|
" my_rx_base_pa → pe1's intra_W slot ring",
|
||||||
|
" n_slots = 8 slot_size = 512 B",
|
||||||
|
"",
|
||||||
|
"head/tail counters (per direction):",
|
||||||
|
" my_head # ++ on tl.send (other direction)",
|
||||||
|
" my_tail # ++ on tl.recv (this direction)",
|
||||||
|
" peer_head_cache # updated on IpcqMetaArrival",
|
||||||
|
" peer_tail_cache # updated on IpcqCreditMetadata",
|
||||||
|
"",
|
||||||
|
"recv blocks while peer_head_cache ≤ my_tail",
|
||||||
|
],
|
||||||
|
edge=_BOX_EDGE, fill=_BOX_FILL,
|
||||||
|
)
|
||||||
|
|
||||||
|
_box(
|
||||||
|
ax, x=12.8, y=4.5, w=8.4, h=2.7,
|
||||||
|
title="pe1.pe_dma (inbound IPCQ driver)",
|
||||||
|
lines=[
|
||||||
|
"_handle_ipcq_inbound():",
|
||||||
|
" • pay terminal drain over fabric BW",
|
||||||
|
" • atomic: write data into pe1's intra_W slot",
|
||||||
|
" • forward IpcqMetaArrival → pe1.pe_ipcq",
|
||||||
|
],
|
||||||
|
edge=_HW_EDGE, fill=_HW_FILL,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ── Slot ring buffer (under pe1.pe_dma) ─────────────────────────
|
||||||
|
ring_x0, ring_y0 = 12.8, 1.1
|
||||||
|
ring_w, ring_h = 8.4, 2.6
|
||||||
|
box = FancyBboxPatch(
|
||||||
|
(ring_x0, ring_y0), ring_w, ring_h,
|
||||||
|
boxstyle="round,pad=0.04,rounding_size=0.16",
|
||||||
|
linewidth=1.6, edgecolor=_MEM_EDGE, facecolor=_MEM_FILL, zorder=2,
|
||||||
|
)
|
||||||
|
ax.add_patch(box)
|
||||||
|
ax.text(
|
||||||
|
ring_x0 + ring_w / 2, ring_y0 + ring_h - 0.42,
|
||||||
|
"MemoryStore[buffer_kind] pe1's intra_W slot ring "
|
||||||
|
"(n_slots = 8, slot_size = 512 B)",
|
||||||
|
ha="center", va="top", fontsize=10, fontweight="bold",
|
||||||
|
color=_MEM_EDGE, zorder=3,
|
||||||
|
)
|
||||||
|
# 8 slots laid out horizontally inside the ring panel.
|
||||||
|
n_slots = 8
|
||||||
|
pad = 0.35
|
||||||
|
slot_w = (ring_w - 2 * pad) / n_slots
|
||||||
|
slot_h = 0.85
|
||||||
|
slot_y = ring_y0 + 0.3
|
||||||
|
for i in range(n_slots):
|
||||||
|
sx = ring_x0 + pad + i * slot_w
|
||||||
|
is_active = (i == 3) # Highlight one example slot
|
||||||
|
face = "#ffd9b8" if is_active else "white"
|
||||||
|
edge = _ORANGE if is_active else _MEM_EDGE
|
||||||
|
rect = Rectangle(
|
||||||
|
(sx + 0.05, slot_y), slot_w - 0.10, slot_h,
|
||||||
|
linewidth=1.2, facecolor=face, edgecolor=edge, zorder=3,
|
||||||
|
)
|
||||||
|
ax.add_patch(rect)
|
||||||
|
ax.text(
|
||||||
|
sx + slot_w / 2, slot_y + slot_h / 2,
|
||||||
|
f"s{i}", ha="center", va="center", fontsize=9,
|
||||||
|
color=_ORANGE if is_active else _TEXT,
|
||||||
|
fontweight="bold" if is_active else "normal", zorder=4,
|
||||||
|
)
|
||||||
|
ax.text(
|
||||||
|
ring_x0 + pad + 3 * slot_w + slot_w / 2, slot_y - 0.30,
|
||||||
|
"slot_idx = my_head % n_slots",
|
||||||
|
ha="center", va="top", fontsize=8, style="italic",
|
||||||
|
color=_ORANGE,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ── Fabric label (between pe0.pe_dma and pe1.pe_dma) ────────────
|
||||||
|
fab = FancyBboxPatch(
|
||||||
|
(9.6, 5.0), 2.6, 1.7,
|
||||||
|
boxstyle="round,pad=0.04,rounding_size=0.20",
|
||||||
|
linewidth=1.4, edgecolor=_PURPLE, facecolor="white", zorder=2,
|
||||||
|
)
|
||||||
|
ax.add_patch(fab)
|
||||||
|
ax.text(10.9, 6.4, "Fabric", ha="center", va="center",
|
||||||
|
fontsize=11, fontweight="bold", color=_PURPLE)
|
||||||
|
ax.text(10.9, 5.7, "(NoC routers,\npe_dma → pe_dma)",
|
||||||
|
ha="center", va="center", fontsize=8, color=_TEXT)
|
||||||
|
|
||||||
|
# ── Arrows + step labels ────────────────────────────────────────
|
||||||
|
# 1. tl.send ↘ pe0.pe_ipcq
|
||||||
|
_arrow(ax, (9.2, 12.9), (9.7, 12.9), color=_BLUE) # placeholder so number lands
|
||||||
|
_step_label(ax, 0.5, 13.6,
|
||||||
|
1, "kernel calls tl.send(dir='intra_E', src_addr=X)",
|
||||||
|
color=_BLUE)
|
||||||
|
# 2. pe0.pe_ipcq → pe0.pe_dma (IpcqDmaToken)
|
||||||
|
_arrow(ax, (5.0, 8.4), (5.0, 7.2), color=_BLUE, lw=2.0)
|
||||||
|
ax.text(5.2, 7.85, "IpcqDmaToken\n"
|
||||||
|
"dst = peer.rx_base_pa + slot_idx*512",
|
||||||
|
ha="left", va="center", fontsize=8, color=_BLUE,
|
||||||
|
family="monospace")
|
||||||
|
# 3. pe0.pe_dma → fabric → pe1.pe_dma (data, fire-and-forget)
|
||||||
|
_arrow(ax, (9.2, 5.85), (9.6, 5.85), color=_BLUE, lw=2.0)
|
||||||
|
_arrow(ax, (12.2, 5.85), (12.8, 5.85), color=_BLUE, lw=2.0)
|
||||||
|
ax.text(10.9, 4.7, "data (fire-and-forget)",
|
||||||
|
ha="center", va="center", fontsize=8, style="italic",
|
||||||
|
color=_BLUE)
|
||||||
|
# 4. pe1.pe_dma → MemoryStore slot (atomic)
|
||||||
|
_arrow(ax, (17.0, 4.5), (17.0, 3.7), color=_GREEN, lw=2.0)
|
||||||
|
ax.text(17.2, 4.10, "atomic write",
|
||||||
|
ha="left", va="center", fontsize=8, color=_GREEN,
|
||||||
|
family="monospace")
|
||||||
|
# 5. pe1.pe_dma → pe1.pe_ipcq (IpcqMetaArrival)
|
||||||
|
_arrow(ax, (15.0, 7.2), (15.0, 8.4), color=_GREEN, lw=2.0)
|
||||||
|
ax.text(13.0, 7.85, "IpcqMetaArrival\n"
|
||||||
|
"→ peer_head_cache update",
|
||||||
|
ha="left", va="center", fontsize=8, color=_GREEN,
|
||||||
|
family="monospace")
|
||||||
|
# 6. tl.recv unblocks (annotation only)
|
||||||
|
_step_label(ax, 12.85, 13.6,
|
||||||
|
6, "tl.recv(dir='intra_W') unblocks; consume slot; my_tail++",
|
||||||
|
color=_GREEN)
|
||||||
|
# 7. pe1.pe_ipcq → pe0.pe_ipcq (IpcqCreditMetadata, fast-path SimPy Store)
|
||||||
|
_arrow(ax, (12.8, 11.0), (9.2, 11.0),
|
||||||
|
color=_ORANGE, lw=2.0, curve=0.18)
|
||||||
|
ax.text(11.0, 11.55,
|
||||||
|
"IpcqCreditMetadata (consumer_seq, dst_rx_base_pa)\n"
|
||||||
|
"→ pe0's credit_inbox (SimPy Store, no fabric)",
|
||||||
|
ha="center", va="center", fontsize=8, color=_ORANGE,
|
||||||
|
family="monospace")
|
||||||
|
# 8. pe0.peer_tail_cache update unblocks tl.send
|
||||||
|
ax.text(0.5, 0.55,
|
||||||
|
"Steps 1–3 = data path (fabric, fire-and-forget); "
|
||||||
|
"4–6 = receiver wake-up; 7 = credit return (fast path); "
|
||||||
|
"8 = sender unblocks when peer_tail_cache catches up.",
|
||||||
|
ha="left", va="center", fontsize=9, color=_TEXT,
|
||||||
|
style="italic")
|
||||||
|
|
||||||
|
# In-figure step legend (top, between pe0/pe1 panels).
|
||||||
|
legend_x = 9.4
|
||||||
|
legend_y = 13.5
|
||||||
|
_step_label(ax, legend_x, legend_y, 2,
|
||||||
|
"PE_IPCQ → PE_DMA (token)", color=_BLUE)
|
||||||
|
_step_label(ax, legend_x, legend_y - 0.45, 3,
|
||||||
|
"PE_DMA → fabric → PE_DMA (data)", color=_BLUE)
|
||||||
|
_step_label(ax, legend_x, legend_y - 0.90, 4,
|
||||||
|
"atomic slot write", color=_GREEN)
|
||||||
|
_step_label(ax, legend_x, legend_y - 1.35, 5,
|
||||||
|
"IpcqMetaArrival", color=_GREEN)
|
||||||
|
_step_label(ax, legend_x, legend_y - 1.80, 7,
|
||||||
|
"IpcqCreditMetadata", color=_ORANGE)
|
||||||
|
|
||||||
|
out_path = _OUT_DIR / "ipcq_send_recv.png"
|
||||||
|
fig.savefig(out_path, dpi=130, bbox_inches="tight",
|
||||||
|
facecolor=fig.get_facecolor())
|
||||||
|
|
||||||
|
import matplotlib.pyplot as _plt
|
||||||
|
_plt.close(fig)
|
||||||
|
return str(out_path)
|
||||||
|
|
||||||
|
|
||||||
|
def test_emit_ipcq_diagram():
|
||||||
|
out = emit_ipcq_diagram()
|
||||||
|
assert Path(out).exists()
|
||||||
|
|
||||||
|
|
||||||
|
# ── 2nd diagram: two-PE data + DMA + IPCQ-memory layout ──────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _pe_panel(ax, x0, y0, w, h, label, *, edge=_FRAME, fill="white"):
|
||||||
|
"""Outer container for one PE: title bar + body."""
|
||||||
|
from matplotlib.patches import FancyBboxPatch
|
||||||
|
box = FancyBboxPatch(
|
||||||
|
(x0, y0), w, h,
|
||||||
|
boxstyle="round,pad=0.04,rounding_size=0.20",
|
||||||
|
linewidth=1.8, edgecolor=edge, facecolor=fill, zorder=1,
|
||||||
|
)
|
||||||
|
ax.add_patch(box)
|
||||||
|
# Title band
|
||||||
|
title_h = 0.55
|
||||||
|
band = FancyBboxPatch(
|
||||||
|
(x0 + 0.12, y0 + h - title_h - 0.10), w - 0.24, title_h,
|
||||||
|
boxstyle="round,pad=0.02,rounding_size=0.10",
|
||||||
|
linewidth=0, edgecolor="none", facecolor=edge, zorder=2,
|
||||||
|
)
|
||||||
|
ax.add_patch(band)
|
||||||
|
ax.text(
|
||||||
|
x0 + w / 2, y0 + h - title_h / 2 - 0.10, label,
|
||||||
|
ha="center", va="center", fontsize=12, fontweight="bold",
|
||||||
|
color="white", zorder=3,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _sub_block(ax, cx, cy, w, h, title, body_lines, *,
|
||||||
|
fill, edge, font=9):
|
||||||
|
from matplotlib.patches import FancyBboxPatch
|
||||||
|
rect = FancyBboxPatch(
|
||||||
|
(cx - w / 2, cy - h / 2), w, h,
|
||||||
|
boxstyle="round,pad=0.02,rounding_size=0.10",
|
||||||
|
linewidth=1.4, edgecolor=edge, facecolor=fill, zorder=3,
|
||||||
|
)
|
||||||
|
ax.add_patch(rect)
|
||||||
|
ax.text(cx, cy + h / 2 - 0.30, title, ha="center", va="top",
|
||||||
|
fontsize=font + 1, fontweight="bold", color=edge, zorder=4)
|
||||||
|
for i, line in enumerate(body_lines):
|
||||||
|
ax.text(
|
||||||
|
cx, cy + h / 2 - 0.75 - i * 0.34, line,
|
||||||
|
ha="center", va="top", fontsize=font - 0.5, color=_TEXT,
|
||||||
|
family="monospace", zorder=4,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _tcm_with_slots(ax, cx, cy, w, h, *, n_slots=8, active_slot=3,
|
||||||
|
title="PE_TCM (local memory)"):
|
||||||
|
"""Draw a TCM box that contains a source buffer + IPCQ slot ring."""
|
||||||
|
from matplotlib.patches import FancyBboxPatch, Rectangle
|
||||||
|
rect = FancyBboxPatch(
|
||||||
|
(cx - w / 2, cy - h / 2), w, h,
|
||||||
|
boxstyle="round,pad=0.02,rounding_size=0.10",
|
||||||
|
linewidth=1.4, edgecolor=_MEM_EDGE, facecolor=_MEM_FILL, zorder=3,
|
||||||
|
)
|
||||||
|
ax.add_patch(rect)
|
||||||
|
ax.text(
|
||||||
|
cx, cy + h / 2 - 0.28, title, ha="center", va="top",
|
||||||
|
fontsize=9.5, fontweight="bold", color=_MEM_EDGE, zorder=4,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Source buffer region (left part).
|
||||||
|
src_w = (w - 0.6) * 0.30
|
||||||
|
src_h = h - 1.20
|
||||||
|
sx = cx - w / 2 + 0.20
|
||||||
|
sy = cy - h / 2 + 0.20
|
||||||
|
src_rect = Rectangle(
|
||||||
|
(sx, sy), src_w, src_h,
|
||||||
|
linewidth=1.0, facecolor="white", edgecolor=_BLUE, zorder=4,
|
||||||
|
)
|
||||||
|
ax.add_patch(src_rect)
|
||||||
|
ax.text(sx + src_w / 2, sy + src_h / 2 + 0.18, "source",
|
||||||
|
ha="center", va="center", fontsize=8.5, color=_BLUE,
|
||||||
|
fontweight="bold", zorder=5)
|
||||||
|
ax.text(sx + src_w / 2, sy + src_h / 2 - 0.18, "buffer",
|
||||||
|
ha="center", va="center", fontsize=8.5, color=_BLUE,
|
||||||
|
fontweight="bold", zorder=5)
|
||||||
|
|
||||||
|
# Slot ring region (right part).
|
||||||
|
ring_x0 = sx + src_w + 0.30
|
||||||
|
ring_w = (cx + w / 2 - 0.20) - ring_x0
|
||||||
|
ring_y0 = sy
|
||||||
|
ring_h = src_h
|
||||||
|
ring_rect = Rectangle(
|
||||||
|
(ring_x0, ring_y0), ring_w, ring_h,
|
||||||
|
linewidth=1.0, facecolor="white", edgecolor=_ORANGE, zorder=4,
|
||||||
|
)
|
||||||
|
ax.add_patch(ring_rect)
|
||||||
|
ax.text(
|
||||||
|
ring_x0 + ring_w / 2, ring_y0 + ring_h - 0.18,
|
||||||
|
"IPCQ slot ring (intra_W)",
|
||||||
|
ha="center", va="top", fontsize=8.5, color=_ORANGE,
|
||||||
|
fontweight="bold", zorder=5,
|
||||||
|
)
|
||||||
|
# Draw 8 slots in a 2×4 grid.
|
||||||
|
cols = 4
|
||||||
|
rows = 2
|
||||||
|
slot_inner_pad = 0.12
|
||||||
|
sw = (ring_w - (cols + 1) * slot_inner_pad) / cols
|
||||||
|
sh = (ring_h - 0.65 - (rows + 1) * slot_inner_pad) / rows
|
||||||
|
for i in range(n_slots):
|
||||||
|
r = i // cols
|
||||||
|
c = i % cols
|
||||||
|
sx_i = ring_x0 + slot_inner_pad + c * (sw + slot_inner_pad)
|
||||||
|
sy_i = (ring_y0 + slot_inner_pad
|
||||||
|
+ (rows - 1 - r) * (sh + slot_inner_pad))
|
||||||
|
is_active = (i == active_slot)
|
||||||
|
face = "#ffd9b8" if is_active else "white"
|
||||||
|
edge = _ORANGE if is_active else "#c9c9c9"
|
||||||
|
ax.add_patch(Rectangle(
|
||||||
|
(sx_i, sy_i), sw, sh,
|
||||||
|
linewidth=1.0, facecolor=face, edgecolor=edge, zorder=5,
|
||||||
|
))
|
||||||
|
ax.text(
|
||||||
|
sx_i + sw / 2, sy_i + sh / 2, f"s{i}",
|
||||||
|
ha="center", va="center", fontsize=8,
|
||||||
|
fontweight="bold" if is_active else "normal",
|
||||||
|
color=_ORANGE if is_active else "#666",
|
||||||
|
zorder=6,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def emit_ipcq_dma_diagram() -> str:
|
||||||
|
"""Two-PE diagram emphasising: outbound DMA writes DIRECTLY into the
|
||||||
|
receiver's local memory (slot ring in PE_TCM). pe1.pe_dma is the
|
||||||
|
inbound memory port that pays drain + emits the MetaArrival notice;
|
||||||
|
the actual DMA payload terminates in the slot, not in another DMA.
|
||||||
|
"""
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from matplotlib.patches import FancyBboxPatch
|
||||||
|
|
||||||
|
_OUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
fig, ax = plt.subplots(figsize=(22, 12), facecolor="white")
|
||||||
|
XMAX, YMAX = 28.0, 14.0
|
||||||
|
ax.set_xlim(0, XMAX)
|
||||||
|
ax.set_ylim(0, YMAX)
|
||||||
|
ax.set_aspect("equal")
|
||||||
|
ax.axis("off")
|
||||||
|
ax.set_facecolor(_BG)
|
||||||
|
|
||||||
|
# Outer page border.
|
||||||
|
ax.add_patch(FancyBboxPatch(
|
||||||
|
(0.20, 0.20), XMAX - 0.40, YMAX - 0.40,
|
||||||
|
boxstyle="round,pad=0.02,rounding_size=0.20",
|
||||||
|
linewidth=1.4, edgecolor=_FRAME, facecolor=_BG, zorder=0,
|
||||||
|
))
|
||||||
|
|
||||||
|
ax.set_title(
|
||||||
|
"Two PEs over IPCQ — outbound DMA lands DIRECTLY in receiver "
|
||||||
|
"memory (slot ring in PE_TCM)",
|
||||||
|
fontsize=14, fontweight="bold", color=_TEXT, pad=12,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ── PE panels ───────────────────────────────────────────────────
|
||||||
|
PE0_X, PE0_W = 0.8, 11.6
|
||||||
|
PE1_X, PE1_W = 15.6, 11.6
|
||||||
|
PE_Y, PE_H = 1.6, 10.4
|
||||||
|
|
||||||
|
_pe_panel(ax, x0=PE0_X, y0=PE_Y, w=PE0_W, h=PE_H,
|
||||||
|
label="PE 0 (sender — sip0.cube0.pe0)",
|
||||||
|
edge=_BLUE, fill="white")
|
||||||
|
_pe_panel(ax, x0=PE1_X, y0=PE_Y, w=PE1_W, h=PE_H,
|
||||||
|
label="PE 1 (receiver — sip0.cube0.pe1)",
|
||||||
|
edge=_GREEN, fill="white")
|
||||||
|
|
||||||
|
# ── PE 0 sub-blocks ─────────────────────────────────────────────
|
||||||
|
# Top row: PE_CPU and PE_IPCQ
|
||||||
|
_sub_block(
|
||||||
|
ax, cx=PE0_X + 2.5, cy=10.3, w=3.4, h=1.6,
|
||||||
|
title="PE_CPU",
|
||||||
|
body_lines=["kernel:",
|
||||||
|
" tl.send(dir='intra_E',",
|
||||||
|
" src=ptr)"],
|
||||||
|
fill=_BOX_FILL, edge=_BOX_EDGE,
|
||||||
|
)
|
||||||
|
_sub_block(
|
||||||
|
ax, cx=PE0_X + 8.4, cy=10.3, w=4.0, h=1.6,
|
||||||
|
title="PE_IPCQ (control / SFR)",
|
||||||
|
body_lines=["per-direction state:",
|
||||||
|
" head/tail, peer.rx_base_pa,",
|
||||||
|
" peer_tail_cache"],
|
||||||
|
fill=_BOX_FILL, edge=_BOX_EDGE,
|
||||||
|
)
|
||||||
|
# Mid: PE_TCM (left, with src + slot ring) and PE_DMA outbound (right)
|
||||||
|
_tcm_with_slots(
|
||||||
|
ax, cx=PE0_X + 3.0, cy=5.4, w=5.6, h=3.6,
|
||||||
|
n_slots=8, active_slot=-1,
|
||||||
|
title="PE_TCM (local memory · buffer_kind = tcm)",
|
||||||
|
)
|
||||||
|
_sub_block(
|
||||||
|
ax, cx=PE0_X + 8.6, cy=5.4, w=3.6, h=3.6,
|
||||||
|
title="PE_DMA (outbound)",
|
||||||
|
body_lines=["snapshot src bytes",
|
||||||
|
" from PE_TCM",
|
||||||
|
"build Transaction",
|
||||||
|
" (dst = peer's slot PA)",
|
||||||
|
"fire onto fabric;",
|
||||||
|
" do not wait for ack"],
|
||||||
|
fill=_HW_FILL, edge=_HW_EDGE,
|
||||||
|
)
|
||||||
|
# Arrows on PE 0 side
|
||||||
|
_arrow(ax, (PE0_X + 4.20, 10.3), (PE0_X + 6.40, 10.3),
|
||||||
|
color=_BLUE, lw=1.7)
|
||||||
|
ax.text(PE0_X + 5.30, 10.65, "tl.send",
|
||||||
|
ha="center", va="center", fontsize=8.5, color=_BLUE,
|
||||||
|
fontweight="bold")
|
||||||
|
# PE_IPCQ → PE_DMA control (kept; label removed per request)
|
||||||
|
_arrow(ax, (PE0_X + 8.4, 9.50), (PE0_X + 8.6, 7.20),
|
||||||
|
color=_ORANGE, lw=1.6)
|
||||||
|
# PE_TCM(src) → PE_DMA (read source data)
|
||||||
|
_arrow(ax, (PE0_X + 5.80, 5.40), (PE0_X + 6.80, 5.40),
|
||||||
|
color=_BLUE, lw=2.0)
|
||||||
|
ax.text(PE0_X + 6.30, 6.05, "read source\n(snapshot)",
|
||||||
|
ha="center", va="bottom", fontsize=7.5, color=_BLUE,
|
||||||
|
family="monospace")
|
||||||
|
|
||||||
|
# ── Fabric in the middle ────────────────────────────────────────
|
||||||
|
FAB_X0, FAB_W = 12.6, 2.8
|
||||||
|
FAB_Y0, FAB_H = 4.6, 2.2
|
||||||
|
ax.add_patch(FancyBboxPatch(
|
||||||
|
(FAB_X0, FAB_Y0), FAB_W, FAB_H,
|
||||||
|
boxstyle="round,pad=0.04,rounding_size=0.20",
|
||||||
|
linewidth=1.6, edgecolor=_PURPLE, facecolor="white", zorder=2,
|
||||||
|
))
|
||||||
|
ax.text(FAB_X0 + FAB_W / 2, FAB_Y0 + FAB_H - 0.45,
|
||||||
|
"NoC Fabric", ha="center", va="center",
|
||||||
|
fontsize=12, fontweight="bold", color=_PURPLE)
|
||||||
|
ax.text(FAB_X0 + FAB_W / 2, FAB_Y0 + 0.55,
|
||||||
|
"(routers, links;\nfabric BW + drain time)",
|
||||||
|
ha="center", va="center", fontsize=8.5, color=_TEXT)
|
||||||
|
|
||||||
|
# ── PE 1 sub-blocks ─────────────────────────────────────────────
|
||||||
|
# Top row: PE_IPCQ and PE_CPU
|
||||||
|
_sub_block(
|
||||||
|
ax, cx=PE1_X + 3.2, cy=10.3, w=4.0, h=1.6,
|
||||||
|
title="PE_IPCQ (control / SFR)",
|
||||||
|
body_lines=["per-direction state:",
|
||||||
|
" head/tail, peer_head_cache,",
|
||||||
|
" my_rx_base_pa"],
|
||||||
|
fill=_BOX_FILL, edge=_BOX_EDGE,
|
||||||
|
)
|
||||||
|
_sub_block(
|
||||||
|
ax, cx=PE1_X + 9.1, cy=10.3, w=3.4, h=1.6,
|
||||||
|
title="PE_CPU",
|
||||||
|
body_lines=["kernel:",
|
||||||
|
" ptr = tl.recv(",
|
||||||
|
" dir='intra_W')"],
|
||||||
|
fill=_BOX_FILL, edge=_BOX_EDGE,
|
||||||
|
)
|
||||||
|
# Wide PE_TCM occupying the centre-bottom of PE 1 — the DMA payload
|
||||||
|
# terminates HERE (not in any DMA component).
|
||||||
|
_tcm_with_slots(
|
||||||
|
ax, cx=PE1_X + 5.0, cy=5.4, w=8.4, h=3.6,
|
||||||
|
n_slots=8, active_slot=3,
|
||||||
|
title="PE_TCM (local memory · buffer_kind = tcm)",
|
||||||
|
)
|
||||||
|
|
||||||
|
# ── DATA arrows: outbound DMA ──► RECEIVER MEMORY (the slot) ───
|
||||||
|
# The inbound PE_DMA is NOT on the data path — it's a sim-side
|
||||||
|
# bookkeeper that pays terminal drain + emits MetaArrival. The
|
||||||
|
# actual DMA payload jumps fabric → slot directly.
|
||||||
|
# 1) pe0.PE_DMA → fabric
|
||||||
|
_arrow(ax, (PE0_X + 10.40, 5.40), (FAB_X0, 5.40),
|
||||||
|
color=_BLUE, lw=2.8)
|
||||||
|
# 2) fabric → PE_TCM slot s3 (DMA payload terminates IN MEMORY)
|
||||||
|
SLOT_X = PE1_X + 2.95 # x-coordinate of slot s3 within PE_TCM
|
||||||
|
_arrow(ax, (FAB_X0 + FAB_W, 5.40), (SLOT_X, 5.40),
|
||||||
|
color=_BLUE, lw=2.8)
|
||||||
|
|
||||||
|
# PE_IPCQ → PE_CPU: tl.recv unblocks
|
||||||
|
_arrow(ax, (PE1_X + 5.20, 10.30), (PE1_X + 7.40, 10.30),
|
||||||
|
color=_GREEN, lw=1.7)
|
||||||
|
ax.text(PE1_X + 6.30, 10.65, "unblock tl.recv",
|
||||||
|
ha="center", va="center", fontsize=8.5, color=_GREEN,
|
||||||
|
fontweight="bold")
|
||||||
|
# PE_CPU → PE_TCM: kernel reads consumed slot via returned ptr
|
||||||
|
_arrow(ax, (PE1_X + 9.10, 9.50), (PE1_X + 8.10, 7.20),
|
||||||
|
color=_GREEN, lw=1.4, curve=0.10)
|
||||||
|
ax.text(PE1_X + 9.30, 8.30, "kernel reads\nslot data",
|
||||||
|
ha="left", va="center", fontsize=7.5, color=_GREEN)
|
||||||
|
|
||||||
|
# (Credit-return arrow + label removed per request — see code
|
||||||
|
# for the actual mechanism: pe1.pe_ipcq → pe0.credit_inbox via
|
||||||
|
# SimPy Store after env.timeout(fabric_path_latency_ns).)
|
||||||
|
|
||||||
|
# ── Footer legend ──────────────────────────────────────────────
|
||||||
|
ax.text(0.6, 0.85,
|
||||||
|
"DATA (blue) : pe0 PE_TCM[src] → pe0 PE_DMA → "
|
||||||
|
"NoC fabric → pe1 PE_TCM[slot s3] ← DMA write "
|
||||||
|
"terminates IN MEMORY",
|
||||||
|
ha="left", va="center", fontsize=9, color=_TEXT,
|
||||||
|
style="italic")
|
||||||
|
ax.text(0.6, 0.45,
|
||||||
|
"CTRL (orange) : PE_IPCQ issues IpcqDmaToken on send; "
|
||||||
|
"pe1's inbound port emits MetaArrival; credit return "
|
||||||
|
"uses the fabric path (timing) but bypasses the per-hop "
|
||||||
|
"component graph (D9 fast path).",
|
||||||
|
ha="left", va="center", fontsize=9, color=_TEXT,
|
||||||
|
style="italic")
|
||||||
|
|
||||||
|
out_path = _OUT_DIR / "ipcq_two_pe_dma.png"
|
||||||
|
fig.savefig(out_path, dpi=130, bbox_inches="tight",
|
||||||
|
facecolor=fig.get_facecolor())
|
||||||
|
plt.close(fig)
|
||||||
|
return str(out_path)
|
||||||
|
|
||||||
|
|
||||||
|
def test_emit_ipcq_dma_diagram():
|
||||||
|
out = emit_ipcq_dma_diagram()
|
||||||
|
assert Path(out).exists()
|
||||||
@@ -29,7 +29,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
|||||||
# 48 GB / 8 slices = 6 GB per slice
|
# 48 GB / 8 slices = 6 GB per slice
|
||||||
slice_bytes = 48 * (1 << 30) // 8
|
slice_bytes = 48 * (1 << 30) // 8
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
|
sip_id=sip, die_id=cube, pe_id=pe_id,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
@@ -37,7 +37,7 @@ def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
|||||||
|
|
||||||
def _sram_pa(sip: int = 0, cube: int = 0) -> int:
|
def _sram_pa(sip: int = 0, cube: int = 0) -> int:
|
||||||
"""Create an SRAM physical address."""
|
"""Create an SRAM physical address."""
|
||||||
pa = PhysAddr.cube_sram_addr(rack_id=0, sip_id=sip, cube_id=cube, sram_offset=0x800)
|
pa = PhysAddr.cube_sram_addr(sip_id=sip, die_id=cube, sram_offset=0x800)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,139 @@
|
|||||||
|
"""Phase 1 test for moving the intercube_allreduce root cube from the
|
||||||
|
bottom-right corner (3,3) to the geometric center (2,2).
|
||||||
|
|
||||||
|
Today's algorithm (intercube_allreduce.py) hardcodes
|
||||||
|
``root_cube = (cube_h-1) * cube_w + (cube_w-1)`` (= cube 15 in 4×4).
|
||||||
|
The intra-SIP critical path for one allreduce is therefore::
|
||||||
|
|
||||||
|
Phase 1 (row reduce W→E to col 3) : 3 hops
|
||||||
|
Phase 2 (col reduce N→S to row 3 on col 3): 3 hops
|
||||||
|
Phase 3 (inter-SIP at root) : (separate)
|
||||||
|
Phase 4 (col broadcast S→N) : 3 hops
|
||||||
|
Phase 5 (row broadcast E→W) : 3 hops
|
||||||
|
Total intra-SIP critical path : 12 hops
|
||||||
|
|
||||||
|
Moving the root to (2,2) and using BIDIRECTIONAL convergence (cols 0..2
|
||||||
|
go W→E, col 3 goes E→W in parallel; rows 0..2 go N→S, row 3 goes S→N
|
||||||
|
in parallel) cuts each phase's critical path from 3 hops to 2::
|
||||||
|
|
||||||
|
Phase 1 critical path : max(2, 1) = 2 hops
|
||||||
|
Phase 2 critical path : max(2, 1) = 2 hops
|
||||||
|
Phase 4 critical path : 2 hops
|
||||||
|
Phase 5 critical path : 2 hops
|
||||||
|
Total intra-SIP critical path : 8 hops
|
||||||
|
|
||||||
|
Per-hop cost at 96 KB on TCM ≈ 600 ns (slot IO write+read 384 ns +
|
||||||
|
fabric drain ~217 ns). 4 fewer hops ⇒ ~2.4 µs reduction.
|
||||||
|
|
||||||
|
EXPECTED Phase 1 outcome:
|
||||||
|
- Today (root = corner) : ~22.0 µs ← test FAILS (> 20500 ns)
|
||||||
|
- After Phase 2 (root = center) : ~19.6 µs ← test PASSES (< 20500 ns)
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
from tests.test_allreduce_multidevice import (
|
||||||
|
_write_temp_configs,
|
||||||
|
run_allreduce,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _run_torus_96kb(tmp_path: Path) -> float:
|
||||||
|
"""Run torus_2d 6-SIP allreduce at 96 KB / slot, return critical-path
|
||||||
|
pe_exec_ns. Fixed at TCM (the project default)."""
|
||||||
|
sub = tmp_path / "torus_root_center"
|
||||||
|
sub.mkdir()
|
||||||
|
topo_path, ccl_path = _write_temp_configs(
|
||||||
|
sub,
|
||||||
|
sip_topology="torus_2d",
|
||||||
|
n_sips=6,
|
||||||
|
algorithm="intercube_allreduce",
|
||||||
|
sip_w=3, sip_h=2,
|
||||||
|
n_elem_override=49152, # 49152 × 2 = 96 KB / slot
|
||||||
|
)
|
||||||
|
topo = resolve_topology(topo_path)
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id="root_center_phase1",
|
||||||
|
spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
result = run_allreduce(
|
||||||
|
ctx, engine, spec,
|
||||||
|
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
|
||||||
|
)
|
||||||
|
assert result["ok_cubes"] > 0
|
||||||
|
pe_exec_vals = [
|
||||||
|
float(tr.get("pe_exec_ns", 0.0) or 0.0)
|
||||||
|
for _, (_, tr) in engine._results.items()
|
||||||
|
if isinstance(tr, dict)
|
||||||
|
]
|
||||||
|
return max(pe_exec_vals) if pe_exec_vals else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
def test_intra_sip_critical_path_at_96k_below_threshold(tmp_path):
|
||||||
|
"""Post-Phase-2 (root=center, bidirectional reduce) the torus_2d
|
||||||
|
96 KB allreduce on TCM should drop below 20.5 µs.
|
||||||
|
|
||||||
|
Today's value: ~22.0 µs (12-hop critical path with corner root).
|
||||||
|
Expected post-Phase-2: ~19.6 µs (8-hop critical path with
|
||||||
|
center root) — model estimate, ~11% reduction end-to-end.
|
||||||
|
"""
|
||||||
|
lat_ns = _run_torus_96kb(tmp_path)
|
||||||
|
THRESHOLD_NS = 20_500.0
|
||||||
|
assert lat_ns < THRESHOLD_NS, (
|
||||||
|
f"torus_2d 6-SIP 96 KB allreduce should land below "
|
||||||
|
f"{THRESHOLD_NS:.0f} ns post-Phase-2 (root=center, "
|
||||||
|
f"bidirectional reduce). got {lat_ns:.1f} ns "
|
||||||
|
f"({lat_ns / 1000:.2f} µs)"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_correctness_preserved(tmp_path):
|
||||||
|
"""Smoke check: at small n_elem the new algorithm must still produce
|
||||||
|
the correct sum across all 96 cubes. ``run_allreduce`` validates
|
||||||
|
every cube against the expected reduce result (``ok_cubes`` must be
|
||||||
|
96 = 6 SIPs × 16 cubes).
|
||||||
|
|
||||||
|
This guards against the obvious Phase 2 risk: bidirectional reduce
|
||||||
|
sums each contribution exactly once. If implemented wrong (double-
|
||||||
|
counting or skipping the right edge column / bottom row), the
|
||||||
|
asserts inside run_allreduce fail.
|
||||||
|
"""
|
||||||
|
sub = tmp_path / "correctness"
|
||||||
|
sub.mkdir()
|
||||||
|
topo_path, ccl_path = _write_temp_configs(
|
||||||
|
sub,
|
||||||
|
sip_topology="torus_2d",
|
||||||
|
n_sips=6,
|
||||||
|
algorithm="intercube_allreduce",
|
||||||
|
sip_w=3, sip_h=2,
|
||||||
|
n_elem_override=128, # tiny payload to keep this fast
|
||||||
|
)
|
||||||
|
topo = resolve_topology(topo_path)
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id="root_center_correctness",
|
||||||
|
spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
result = run_allreduce(
|
||||||
|
ctx, engine, spec,
|
||||||
|
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
|
||||||
|
)
|
||||||
|
n_cubes = 6 * 16 # 6 SIPs × 16 cubes/SIP
|
||||||
|
assert result["ok_cubes"] == n_cubes, (
|
||||||
|
f"all 96 cubes must validate; got {result['ok_cubes']} OK"
|
||||||
|
)
|
||||||
@@ -1,8 +1,9 @@
|
|||||||
"""Tests for configure_sfr_intercube_multisip neighbor table wiring.
|
"""Tests for configure_sfr_intercube_multisip neighbor table wiring.
|
||||||
|
|
||||||
Verifies that IPCQ neighbor tables are correctly installed for
|
Verifies full IPCQ hardware wiring (independent of DPPolicy):
|
||||||
intercube (pe0, 4×4 mesh N/S/E/W) + inter-SIP (pe0, all cubes,
|
- intra-cube (2×4 PE grid) → intra_N/S/E/W
|
||||||
global_E/global_W) communication.
|
- intercube same-lane → N/S/E/W
|
||||||
|
- inter-SIP same-(cube, pe) → global_N/S/E/W
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
@@ -16,6 +17,7 @@ from kernbench.topology.builder import resolve_topology
|
|||||||
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
|
||||||
N_CUBES = 16
|
N_CUBES = 16
|
||||||
|
PES_PER_CUBE = 8
|
||||||
|
|
||||||
|
|
||||||
def _engine_and_spec():
|
def _engine_and_spec():
|
||||||
@@ -36,78 +38,102 @@ class TestConfigureSfrNeighborTables:
|
|||||||
plan = configure_sfr_intercube_multisip(engine, spec, cfg)
|
plan = configure_sfr_intercube_multisip(engine, spec, cfg)
|
||||||
|
|
||||||
n_sips = int(spec["system"]["sips"]["count"])
|
n_sips = int(spec["system"]["sips"]["count"])
|
||||||
assert plan["world_size"] == n_sips * N_CUBES
|
expected = n_sips * N_CUBES * PES_PER_CUBE
|
||||||
assert len(plan["rank_to_pe"]) == n_sips * N_CUBES
|
assert plan["world_size"] == expected
|
||||||
for pe_idx, (sip, cube, pe) in enumerate(plan["rank_to_pe"]):
|
assert len(plan["rank_to_pe"]) == expected
|
||||||
assert pe == 0, f"pe_idx {pe_idx}: pe must be 0, got {pe}"
|
|
||||||
|
|
||||||
def test_corner_cube0_has_E_and_S_only(self):
|
# ── Intra-cube (intra_N/S/E/W) ────────────────────────────────
|
||||||
"""Cube 0 (row=0, col=0) is NW corner: only E and S neighbors."""
|
|
||||||
|
def test_pe0_intra_cube_has_intra_E_and_intra_S(self):
|
||||||
|
"""pe0 is NW of the 2×4 PE grid: intra_E=pe1, intra_S=pe4."""
|
||||||
engine, spec = _engine_and_spec()
|
engine, spec = _engine_and_spec()
|
||||||
cfg = _merged_cfg()
|
cfg = _merged_cfg()
|
||||||
configure_sfr_intercube_multisip(engine, spec, cfg)
|
configure_sfr_intercube_multisip(engine, spec, cfg)
|
||||||
|
|
||||||
ipcq = engine._components["sip0.cube0.pe0.pe_ipcq"]
|
qp = engine._components["sip0.cube0.pe0.pe_ipcq"].queue_pairs
|
||||||
qp = ipcq.queue_pairs
|
assert "intra_E" in qp
|
||||||
assert "E" in qp, "cube 0 must have E neighbor"
|
assert qp["intra_E"]["peer"].pe == 1
|
||||||
assert "S" in qp, "cube 0 must have S neighbor"
|
assert "intra_S" in qp
|
||||||
assert "W" not in qp, "cube 0 (col=0) must NOT have W neighbor"
|
assert qp["intra_S"]["peer"].pe == 4
|
||||||
assert "N" not in qp, "cube 0 (row=0) must NOT have N neighbor"
|
assert "intra_W" not in qp
|
||||||
|
assert "intra_N" not in qp
|
||||||
|
|
||||||
|
def test_pe5_intra_cube_has_all_four(self):
|
||||||
|
"""pe5 (row=1, col=1 in 2×4 grid) has all 4 intra directions.
|
||||||
|
|
||||||
|
Intra neighbors: intra_N=pe1, intra_E=pe6, intra_W=pe4,
|
||||||
|
intra_S not present (row=1 is bottom row).
|
||||||
|
"""
|
||||||
|
engine, spec = _engine_and_spec()
|
||||||
|
cfg = _merged_cfg()
|
||||||
|
configure_sfr_intercube_multisip(engine, spec, cfg)
|
||||||
|
|
||||||
|
qp = engine._components["sip0.cube0.pe5.pe_ipcq"].queue_pairs
|
||||||
|
assert qp["intra_N"]["peer"].pe == 1
|
||||||
|
assert qp["intra_E"]["peer"].pe == 6
|
||||||
|
assert qp["intra_W"]["peer"].pe == 4
|
||||||
|
assert "intra_S" not in qp # bottom row
|
||||||
|
|
||||||
|
# ── Intercube same-lane (N/S/E/W) ─────────────────────────────
|
||||||
|
|
||||||
|
def test_corner_cube0_pe0_has_intercube_E_and_S(self):
|
||||||
|
"""Cube 0 (NW mesh corner): intercube E→cube1, S→cube4."""
|
||||||
|
engine, spec = _engine_and_spec()
|
||||||
|
cfg = _merged_cfg()
|
||||||
|
configure_sfr_intercube_multisip(engine, spec, cfg)
|
||||||
|
|
||||||
|
qp = engine._components["sip0.cube0.pe0.pe_ipcq"].queue_pairs
|
||||||
assert qp["E"]["peer"].cube == 1
|
assert qp["E"]["peer"].cube == 1
|
||||||
|
assert qp["E"]["peer"].pe == 0 # same-lane
|
||||||
assert qp["S"]["peer"].cube == 4
|
assert qp["S"]["peer"].cube == 4
|
||||||
|
assert qp["S"]["peer"].pe == 0
|
||||||
|
assert "W" not in qp, "cube 0 has no west neighbor"
|
||||||
|
assert "N" not in qp, "cube 0 has no north neighbor"
|
||||||
|
|
||||||
def test_interior_cube5_has_all_four(self):
|
def test_interior_cube5_pe3_has_all_four_intercube_same_lane(self):
|
||||||
"""Cube 5 (row=1, col=1) is interior: N/S/E/W all present."""
|
"""Cube 5 interior, pe3: intercube N/S/E/W all present, same-lane."""
|
||||||
engine, spec = _engine_and_spec()
|
engine, spec = _engine_and_spec()
|
||||||
cfg = _merged_cfg()
|
cfg = _merged_cfg()
|
||||||
configure_sfr_intercube_multisip(engine, spec, cfg)
|
configure_sfr_intercube_multisip(engine, spec, cfg)
|
||||||
|
|
||||||
ipcq = engine._components["sip0.cube5.pe0.pe_ipcq"]
|
qp = engine._components["sip0.cube5.pe3.pe_ipcq"].queue_pairs
|
||||||
qp = ipcq.queue_pairs
|
for d, expected_cube in [("N", 1), ("S", 9), ("E", 6), ("W", 4)]:
|
||||||
assert qp["N"]["peer"].cube == 1
|
assert qp[d]["peer"].cube == expected_cube
|
||||||
assert qp["S"]["peer"].cube == 9
|
assert qp[d]["peer"].pe == 3 # same-lane
|
||||||
assert qp["E"]["peer"].cube == 6
|
|
||||||
assert qp["W"]["peer"].cube == 4
|
|
||||||
|
|
||||||
def test_root_cube15_has_inter_sip(self):
|
def test_all_pes_have_intercube_wiring(self):
|
||||||
"""Cube 15 (root, SE corner) has N, W + global_E/global_W."""
|
"""Every PE on every interior cube has intercube same-lane wiring."""
|
||||||
engine, spec = _engine_and_spec()
|
engine, spec = _engine_and_spec()
|
||||||
cfg = _merged_cfg()
|
cfg = _merged_cfg()
|
||||||
configure_sfr_intercube_multisip(engine, spec, cfg)
|
configure_sfr_intercube_multisip(engine, spec, cfg)
|
||||||
|
|
||||||
ipcq0 = engine._components["sip0.cube15.pe0.pe_ipcq"]
|
# Interior cube 5: every PE should have N/S/E/W same-lane.
|
||||||
qp0 = ipcq0.queue_pairs
|
for pe in range(PES_PER_CUBE):
|
||||||
assert "N" in qp0
|
qp = engine._components[f"sip0.cube5.pe{pe}.pe_ipcq"].queue_pairs
|
||||||
assert "W" in qp0
|
for d in ("N", "S", "E", "W"):
|
||||||
assert "E" not in qp0, "cube 15 (col=3) must NOT have E"
|
assert d in qp, f"sip0.cube5.pe{pe} missing intercube {d}"
|
||||||
assert "S" not in qp0, "cube 15 (row=3) must NOT have S"
|
assert qp[d]["peer"].pe == pe, (
|
||||||
assert "global_E" in qp0, "root cube must have global_E"
|
f"sip0.cube5.pe{pe} {d} not same-lane"
|
||||||
assert "global_W" in qp0, "root cube must have global_W"
|
)
|
||||||
assert qp0["global_E"]["peer"].sip == 1
|
|
||||||
assert qp0["global_E"]["peer"].cube == 15
|
|
||||||
|
|
||||||
ipcq1 = engine._components["sip1.cube15.pe0.pe_ipcq"]
|
# ── Inter-SIP (global_*) ──────────────────────────────────────
|
||||||
qp1 = ipcq1.queue_pairs
|
|
||||||
assert qp1["global_E"]["peer"].sip == 0
|
|
||||||
assert qp1["global_E"]["peer"].cube == 15
|
|
||||||
|
|
||||||
def test_all_cubes_have_inter_sip(self):
|
def test_every_pe_on_every_cube_has_inter_sip(self):
|
||||||
"""ALL cubes (not just root) are wired for inter-SIP."""
|
"""All PEs on all cubes wired for inter-SIP via global_*."""
|
||||||
engine, spec = _engine_and_spec()
|
engine, spec = _engine_and_spec()
|
||||||
cfg = _merged_cfg()
|
cfg = _merged_cfg()
|
||||||
configure_sfr_intercube_multisip(engine, spec, cfg)
|
configure_sfr_intercube_multisip(engine, spec, cfg)
|
||||||
|
|
||||||
root_cube = int(cfg.get("root_cube", N_CUBES - 1))
|
|
||||||
for cube_id in range(N_CUBES):
|
for cube_id in range(N_CUBES):
|
||||||
ipcq = engine._components[f"sip0.cube{cube_id}.pe0.pe_ipcq"]
|
for pe in range(PES_PER_CUBE):
|
||||||
qp = ipcq.queue_pairs
|
qp = engine._components[
|
||||||
|
f"sip0.cube{cube_id}.pe{pe}.pe_ipcq"
|
||||||
|
].queue_pairs
|
||||||
assert "global_E" in qp, (
|
assert "global_E" in qp, (
|
||||||
f"sip0.cube{cube_id}.pe0 missing global_E"
|
f"sip0.cube{cube_id}.pe{pe} missing global_E"
|
||||||
)
|
|
||||||
assert "global_W" in qp, (
|
|
||||||
f"sip0.cube{cube_id}.pe0 missing global_W"
|
|
||||||
)
|
|
||||||
if cube_id == root_cube:
|
|
||||||
assert qp["global_E"]["peer"].sip != 0, (
|
|
||||||
f"root cube {root_cube} global_E must point to another SIP"
|
|
||||||
)
|
)
|
||||||
|
assert "global_W" in qp
|
||||||
|
# Peer must be same (cube, pe) on another SIP.
|
||||||
|
assert qp["global_E"]["peer"].sip == 1
|
||||||
|
assert qp["global_E"]["peer"].cube == cube_id
|
||||||
|
assert qp["global_E"]["peer"].pe == pe
|
||||||
|
|||||||
@@ -36,7 +36,7 @@ def _engine():
|
|||||||
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
||||||
slice_bytes = 48 * (1 << 30) // 8
|
slice_bytes = 48 * (1 << 30) // 8
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
|
sip_id=sip, die_id=cube, pe_id=pe_id,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
|
|||||||
@@ -0,0 +1,219 @@
|
|||||||
|
"""Phase 1 micro-tests for IPCQ slot-memory latency model.
|
||||||
|
|
||||||
|
These tests assert the TARGET behavior expected after Phase 2 wires
|
||||||
|
``buffer_kind`` (tcm/sram/hbm) into the IPCQ slot read/write latency
|
||||||
|
charges. They are written BEFORE the production change and are
|
||||||
|
EXPECTED TO FAIL today.
|
||||||
|
|
||||||
|
Failure semantics today:
|
||||||
|
- Slot access is latency-free, so the tcm/sram/hbm runs produce
|
||||||
|
identical pe_exec_ns. The ordering assertion therefore fails with
|
||||||
|
"tcm == sram == hbm" — proving the test harness is wired and that
|
||||||
|
Phase 2 production work is what makes them pass.
|
||||||
|
|
||||||
|
Reference (Phase 2 will edit these):
|
||||||
|
- src/kernbench/components/builtin/pe_dma.py — _handle_ipcq_inbound
|
||||||
|
- src/kernbench/components/builtin/pe_ipcq.py — _handle_recv,
|
||||||
|
_BUFFER_KIND_BW table
|
||||||
|
- src/kernbench/runtime_api/kernel.py — IpcqDmaToken adds
|
||||||
|
buffer_kind field
|
||||||
|
- ccl.yaml — algorithm.buffer_kind
|
||||||
|
|
||||||
|
The tests reuse the existing config-driven allreduce app
|
||||||
|
(``run_allreduce`` in tests/test_allreduce_multidevice.py) with a 2-SIP
|
||||||
|
ring topology and a SMALL n_elem so they finish fast (~3-5 s each).
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
# Reuse the test app's helpers so this micro-test file does not
|
||||||
|
# duplicate the run-allreduce + write-temp-configs plumbing.
|
||||||
|
from tests.test_allreduce_multidevice import (
|
||||||
|
_write_temp_configs,
|
||||||
|
run_allreduce,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# Expected per-tier BW + overhead (Phase 2 will encode this in
|
||||||
|
# pe_ipcq.py). Mirrors topology.yaml component values.
|
||||||
|
_EXPECTED_BW = {
|
||||||
|
"tcm": (512.0, 0.0),
|
||||||
|
"sram": (512.0, 2.0),
|
||||||
|
"hbm": (256.0, 6.0),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _expected_slot_io_ns(buffer_kind: str, nbytes: int) -> float:
|
||||||
|
"""Per-access latency the model is expected to add (write OR read)."""
|
||||||
|
bw_gbs, overhead_ns = _EXPECTED_BW[buffer_kind]
|
||||||
|
# 1 GB/s = 1 byte/ns
|
||||||
|
return nbytes / bw_gbs + overhead_ns
|
||||||
|
|
||||||
|
|
||||||
|
def _run_torus_allreduce(
|
||||||
|
tmp_path: Path, *, buffer_kind: str, n_elem: int,
|
||||||
|
) -> float:
|
||||||
|
"""Run one torus_2d 6-SIP allreduce and return critical-path
|
||||||
|
pe_exec_ns. The buffer_kind override is wired into ccl.yaml.
|
||||||
|
"""
|
||||||
|
sub = tmp_path / f"{buffer_kind}_{n_elem}"
|
||||||
|
sub.mkdir()
|
||||||
|
topo_path, ccl_path = _write_temp_configs(
|
||||||
|
sub,
|
||||||
|
sip_topology="torus_2d",
|
||||||
|
n_sips=6,
|
||||||
|
algorithm="intercube_allreduce",
|
||||||
|
sip_w=3, sip_h=2,
|
||||||
|
n_elem_override=n_elem,
|
||||||
|
)
|
||||||
|
# Patch ccl.yaml in-place so the algorithm picks up buffer_kind.
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
with open(ccl_path) as f:
|
||||||
|
ccl_cfg = yaml.safe_load(f)
|
||||||
|
ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
|
||||||
|
ccl_cfg.setdefault("algorithms", {}).setdefault(
|
||||||
|
"intercube_allreduce", {},
|
||||||
|
)["buffer_kind"] = buffer_kind
|
||||||
|
with open(ccl_path, "w") as f:
|
||||||
|
yaml.dump(ccl_cfg, f, default_flow_style=False)
|
||||||
|
|
||||||
|
topo = resolve_topology(topo_path)
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id=f"bk_{buffer_kind}_{n_elem}",
|
||||||
|
spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
result = run_allreduce(
|
||||||
|
ctx, engine, spec,
|
||||||
|
algorithm="intercube_allreduce", ccl_yaml=ccl_path,
|
||||||
|
)
|
||||||
|
assert result["ok_cubes"] > 0, "allreduce did not validate"
|
||||||
|
|
||||||
|
pe_exec_vals = [
|
||||||
|
float(tr.get("pe_exec_ns", 0.0) or 0.0)
|
||||||
|
for _, (_, tr) in engine._results.items()
|
||||||
|
if isinstance(tr, dict)
|
||||||
|
]
|
||||||
|
return max(pe_exec_vals) if pe_exec_vals else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
# ── Phase 1 assertions ───────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_slot_write_latency_orders_tcm_sram_hbm(tmp_path):
|
||||||
|
"""tcm < sram < hbm at 8192 B per send.
|
||||||
|
|
||||||
|
Pre-Phase-2: all three return the same pe_exec_ns and this
|
||||||
|
assertion fails. Post-Phase-2: the per-tier BW + overhead make
|
||||||
|
hbm visibly slower than sram, which is slower than tcm.
|
||||||
|
"""
|
||||||
|
n_elem = 4096 # 8192 B per slot
|
||||||
|
lat_tcm = _run_torus_allreduce(tmp_path, buffer_kind="tcm", n_elem=n_elem)
|
||||||
|
lat_sram = _run_torus_allreduce(tmp_path, buffer_kind="sram", n_elem=n_elem)
|
||||||
|
lat_hbm = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=n_elem)
|
||||||
|
|
||||||
|
# Expected per-access deltas (write+read = 2× the per-access value).
|
||||||
|
exp_tcm = 2 * _expected_slot_io_ns("tcm", n_elem * 2)
|
||||||
|
exp_sram = 2 * _expected_slot_io_ns("sram", n_elem * 2)
|
||||||
|
exp_hbm = 2 * _expected_slot_io_ns("hbm", n_elem * 2)
|
||||||
|
# Floor margin: 50% of the raw expected per-access delta — lets Phase 2
|
||||||
|
# implementation choose to charge only one side without breaking the test,
|
||||||
|
# but still requires a clearly observable gap.
|
||||||
|
margin_sram_tcm = 0.5 * (exp_sram - exp_tcm)
|
||||||
|
margin_hbm_sram = 0.5 * (exp_hbm - exp_sram)
|
||||||
|
|
||||||
|
assert lat_sram > lat_tcm + margin_sram_tcm, (
|
||||||
|
f"sram should be slower than tcm by ≥ {margin_sram_tcm:.1f} ns "
|
||||||
|
f"per allreduce, got sram={lat_sram:.1f} tcm={lat_tcm:.1f} "
|
||||||
|
f"(delta={lat_sram - lat_tcm:.1f})"
|
||||||
|
)
|
||||||
|
assert lat_hbm > lat_sram + margin_hbm_sram, (
|
||||||
|
f"hbm should be slower than sram by ≥ {margin_hbm_sram:.1f} ns "
|
||||||
|
f"per allreduce, got hbm={lat_hbm:.1f} sram={lat_sram:.1f} "
|
||||||
|
f"(delta={lat_hbm - lat_sram:.1f})"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_slot_io_scales_linearly_with_nbytes(tmp_path):
|
||||||
|
"""For buffer_kind=hbm, doubling nbytes should add ~nbytes/32 ns
|
||||||
|
of latency to each slot access. Sanity-checks the slope.
|
||||||
|
|
||||||
|
Pre-Phase-2: latency does not respond to nbytes via memory BW
|
||||||
|
(only via fabric drain), so the observed slope is dominated by
|
||||||
|
fabric BW and does NOT match 1/32 ns/B.
|
||||||
|
"""
|
||||||
|
lat_4k = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=2048)
|
||||||
|
lat_8k = _run_torus_allreduce(tmp_path, buffer_kind="hbm", n_elem=4096)
|
||||||
|
|
||||||
|
# Expected delta from doubling: at least one slot-IO event per cube
|
||||||
|
# in the critical path (very conservative). Per-access add = 4096/256 = 16
|
||||||
|
# ns on HBM going from 4k → 8k. Multiple slot accesses on the critical
|
||||||
|
# path should make the observed delta meaningfully larger.
|
||||||
|
expected_min_delta = 0.5 * (4096 / 256.0) # ≈ 8 ns
|
||||||
|
assert lat_8k - lat_4k > expected_min_delta, (
|
||||||
|
f"doubling nbytes on hbm should add ≥ {expected_min_delta:.1f} ns "
|
||||||
|
f"of slot-IO latency, got delta={lat_8k - lat_4k:.1f} ns "
|
||||||
|
f"(lat_4k={lat_4k:.1f}, lat_8k={lat_8k:.1f})"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_buffer_kind_sensitivity_grows_with_payload(tmp_path):
|
||||||
|
"""Credit-return cost is fabric-only by design (16 B packet); only
|
||||||
|
the data slot-IO charge depends on ``buffer_kind``. Therefore the
|
||||||
|
tcm-vs-hbm gap must scale with payload size and be a small fraction
|
||||||
|
of the large-payload gap at small payloads.
|
||||||
|
|
||||||
|
Concrete invariant the model must satisfy:
|
||||||
|
gap_small / gap_large < 0.10
|
||||||
|
|
||||||
|
Pre-Phase-2: gap_small == gap_large == 0 (division undefined → test
|
||||||
|
fails because gap_large is required > 0). Post-Phase-2: at small
|
||||||
|
nbytes the slot-IO charge is dominated by the constant
|
||||||
|
``overhead_ns`` term, while at large nbytes it is dominated by the
|
||||||
|
``nbytes / bw_gbs`` term — so gap_large grows linearly while
|
||||||
|
gap_small stays small.
|
||||||
|
"""
|
||||||
|
n_elem_small = 8 # 16 B per slot — overhead-bound
|
||||||
|
n_elem_large = 16384 # 32 KB per slot — bandwidth-bound
|
||||||
|
|
||||||
|
lat_tcm_small = _run_torus_allreduce(
|
||||||
|
tmp_path, buffer_kind="tcm", n_elem=n_elem_small,
|
||||||
|
)
|
||||||
|
lat_hbm_small = _run_torus_allreduce(
|
||||||
|
tmp_path, buffer_kind="hbm", n_elem=n_elem_small,
|
||||||
|
)
|
||||||
|
lat_tcm_large = _run_torus_allreduce(
|
||||||
|
tmp_path, buffer_kind="tcm", n_elem=n_elem_large,
|
||||||
|
)
|
||||||
|
lat_hbm_large = _run_torus_allreduce(
|
||||||
|
tmp_path, buffer_kind="hbm", n_elem=n_elem_large,
|
||||||
|
)
|
||||||
|
|
||||||
|
gap_small = abs(lat_hbm_small - lat_tcm_small)
|
||||||
|
gap_large = abs(lat_hbm_large - lat_tcm_large)
|
||||||
|
|
||||||
|
assert gap_large > 1000.0, (
|
||||||
|
f"large-payload buffer_kind gap must be observably large "
|
||||||
|
f"(this is the sweep's whole point). got gap_large={gap_large:.1f} ns "
|
||||||
|
f"(lat_tcm_large={lat_tcm_large:.1f}, lat_hbm_large={lat_hbm_large:.1f})"
|
||||||
|
)
|
||||||
|
assert gap_small / gap_large < 0.10, (
|
||||||
|
f"buffer_kind sensitivity should grow with payload — "
|
||||||
|
f"small-payload gap should be < 10% of large-payload gap. "
|
||||||
|
f"got gap_small={gap_small:.1f} ns, gap_large={gap_large:.1f} ns, "
|
||||||
|
f"ratio={gap_small / gap_large:.3f}"
|
||||||
|
)
|
||||||
@@ -0,0 +1,62 @@
|
|||||||
|
"""ADR-0009 D5: synchronized launch barrier.
|
||||||
|
|
||||||
|
M_CPU stamps KernelLaunchMsg with target_start_ns = env.now + max path
|
||||||
|
latency; PE_CPU yields until that time before recording pe_exec_start.
|
||||||
|
Every PE in a single launch MUST begin kernel execution at the same
|
||||||
|
env.now regardless of its dispatch path length.
|
||||||
|
|
||||||
|
We verify this indirectly: for a no-op kernel, pe_exec_ns = env.now -
|
||||||
|
pe_exec_start. If every PE's pe_exec_start is identical and every PE
|
||||||
|
runs the same no-op body, every pe_exec_ns value must be identical.
|
||||||
|
Without D5, pe_exec_start varies by dispatch-path length and so does
|
||||||
|
pe_exec_ns.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
|
||||||
|
|
||||||
|
def test_kernel_launch_sync_all_pes_have_equal_exec_time():
|
||||||
|
"""No-op kernel: every PE's pe_exec_ns must be identical under D5."""
|
||||||
|
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
with RuntimeContext(engine=engine, target_device=DeviceSelector("all"),
|
||||||
|
correlation_id="sync_test", spec=spec) as ctx:
|
||||||
|
dp = DPPolicy(cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=16, num_pes=8)
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pass # no-op
|
||||||
|
|
||||||
|
ctx.ahbm.set_device(0)
|
||||||
|
t = ctx.zeros((16, 8 * 64), dtype="f16", dp=dp, name="probe")
|
||||||
|
t.copy_(ctx.from_numpy(np.zeros((16, 8 * 64), dtype=np.float16)))
|
||||||
|
|
||||||
|
pending = ctx.launch("sync_probe", kernel, t, 64, _defer_wait=True)
|
||||||
|
for h, _sip, meta in pending:
|
||||||
|
ctx.wait(h, _meta=meta)
|
||||||
|
|
||||||
|
pe_exec_vals = []
|
||||||
|
for h, _sip, _meta in pending:
|
||||||
|
_, trace = engine.get_completion(h)
|
||||||
|
if trace and trace.get("pe_exec_ns") is not None:
|
||||||
|
pe_exec_vals.append(float(trace["pe_exec_ns"]))
|
||||||
|
|
||||||
|
assert pe_exec_vals, "expected completion traces with pe_exec_ns"
|
||||||
|
spread = max(pe_exec_vals) - min(pe_exec_vals)
|
||||||
|
assert spread < 1e-6, (
|
||||||
|
f"ADR-0009 D5 violated: pe_exec_ns spread across PEs = "
|
||||||
|
f"{spread:.6f} ns (expected 0). Values: {pe_exec_vals}"
|
||||||
|
)
|
||||||
@@ -38,7 +38,7 @@ def _engine():
|
|||||||
def _hbm_pa(sip=0, cube=0, pe_id=0):
|
def _hbm_pa(sip=0, cube=0, pe_id=0):
|
||||||
slice_bytes = 48 * (1 << 30) // 8
|
slice_bytes = 48 * (1 << 30) // 8
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
|
sip_id=sip, die_id=cube, pe_id=pe_id,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
|
|||||||
@@ -53,7 +53,7 @@ def _engine():
|
|||||||
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
||||||
slice_bytes = 48 * (1 << 30) // 8
|
slice_bytes = 48 * (1 << 30) // 8
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
|
sip_id=sip, die_id=cube, pe_id=pe_id,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
|
|||||||
@@ -0,0 +1,741 @@
|
|||||||
|
"""Diagnostic for the inter-cube RAW > IPCQ asymmetry on h3/h4 plots.
|
||||||
|
|
||||||
|
Single-shot run at h3 (sip0.cube0.pe0 -> sip0.cube1.pe0), nbytes=4096.
|
||||||
|
|
||||||
|
Captures per-PE pe_exec_ns and the actual path / drain / per-node overhead
|
||||||
|
breakdown for the RAW sub-txn (PE_DMA -> remote HBM_CTRL) vs the IPCQ
|
||||||
|
outbound sub-txn (PE_DMA -> peer PE_DMA), so we can localize the gap to
|
||||||
|
one of:
|
||||||
|
(a) drain at HBM-BW (RAW) vs fabric-BW (IPCQ)
|
||||||
|
(b) path-length / per-node overhead asymmetry
|
||||||
|
(c) RAW SRC paying tl.load (local HBM read) on top of remote tl.store
|
||||||
|
while IPCQ DST only pays inbound traversal+drain.
|
||||||
|
|
||||||
|
Phase 1 / test-only. No production code is modified.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
|
||||||
|
from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
|
||||||
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
|
||||||
|
import os
|
||||||
|
|
||||||
|
# Allow the test to be re-run for h4 (inter-cube vertical) at multiple sizes
|
||||||
|
# to investigate why IPCQ slope flattens past 8192 B (path may differ).
|
||||||
|
NBYTES = int(os.environ.get("DIAG_NBYTES", "4096"))
|
||||||
|
ELEM_BYTES = 2
|
||||||
|
N_ELEM = NBYTES // ELEM_BYTES
|
||||||
|
N_CUBES = 16
|
||||||
|
N_PES = 8
|
||||||
|
HOP = os.environ.get("DIAG_HOP", "h3")
|
||||||
|
if HOP == "h4":
|
||||||
|
SRC = (0, 0, 0)
|
||||||
|
DST = (0, 4, 0) # h4 inter-cube vertical
|
||||||
|
else:
|
||||||
|
SRC = (0, 0, 0)
|
||||||
|
DST = (0, 1, 0) # h3 inter-cube horizontal
|
||||||
|
|
||||||
|
|
||||||
|
# ── Per-PE pe_exec_ns capture via monkey-patch ───────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _install_barrier_capture():
|
||||||
|
"""Wrap PeCpuComponent._execute_kernel to log, for every PE that
|
||||||
|
enters: env.now at entry, target_start_ns the request carried,
|
||||||
|
whether the barrier yield fired (i.e. env.now < target_start_ns),
|
||||||
|
and env.now at pe_exec_start.
|
||||||
|
"""
|
||||||
|
import kernbench.components.builtin.pe_cpu as pe_cpu_mod
|
||||||
|
|
||||||
|
log: list[dict] = []
|
||||||
|
original = pe_cpu_mod.PeCpuComponent._execute_kernel
|
||||||
|
|
||||||
|
def patched(self, env, txn):
|
||||||
|
request = txn.request
|
||||||
|
target_start = getattr(request, "target_start_ns", None)
|
||||||
|
entry_now = float(env.now)
|
||||||
|
log_entry = {
|
||||||
|
"node_id": self.node.id,
|
||||||
|
"entry_now": entry_now,
|
||||||
|
"target_start_ns": (
|
||||||
|
float(target_start) if target_start is not None else None
|
||||||
|
),
|
||||||
|
"barrier_skipped": (
|
||||||
|
target_start is None
|
||||||
|
or float(target_start) <= entry_now
|
||||||
|
),
|
||||||
|
"delta_late_ns": (
|
||||||
|
None if target_start is None
|
||||||
|
else max(0.0, entry_now - float(target_start))
|
||||||
|
),
|
||||||
|
}
|
||||||
|
log.append(log_entry)
|
||||||
|
yield from original(self, env, txn)
|
||||||
|
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = patched
|
||||||
|
|
||||||
|
def restore():
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = original
|
||||||
|
|
||||||
|
return log, restore
|
||||||
|
|
||||||
|
|
||||||
|
def _install_per_pe_capture():
|
||||||
|
"""Wrap PeCpuComponent._execute_kernel so we record (node_id ->
|
||||||
|
pe_exec_ns) for every PE that executes a kernel during the run.
|
||||||
|
|
||||||
|
Returns (capture_dict, restore_callable).
|
||||||
|
"""
|
||||||
|
import kernbench.components.builtin.pe_cpu as pe_cpu_mod
|
||||||
|
|
||||||
|
captured: dict[str, float] = {}
|
||||||
|
original = pe_cpu_mod.PeCpuComponent._execute_kernel
|
||||||
|
|
||||||
|
def patched(self, env, txn):
|
||||||
|
gen = original(self, env, txn)
|
||||||
|
try:
|
||||||
|
value = yield from gen
|
||||||
|
finally:
|
||||||
|
v = txn.result_data.get("pe_exec_ns")
|
||||||
|
if v is not None:
|
||||||
|
captured[self.node.id] = float(v)
|
||||||
|
return value
|
||||||
|
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = patched
|
||||||
|
|
||||||
|
def restore():
|
||||||
|
pe_cpu_mod.PeCpuComponent._execute_kernel = original
|
||||||
|
|
||||||
|
return captured, restore
|
||||||
|
|
||||||
|
|
||||||
|
def _install_recv_capture(target_node_id: str):
|
||||||
|
"""Wrap PeIpcqComponent._handle_recv to log entry/exit times and the
|
||||||
|
peer_head_cache/my_tail values seen at the start.
|
||||||
|
|
||||||
|
This pins down whether recv ever blocked on a wait_event, or whether
|
||||||
|
it consumed without waiting (i.e. peer_head_cache > my_tail at entry).
|
||||||
|
"""
|
||||||
|
import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
|
||||||
|
|
||||||
|
log: list[dict] = []
|
||||||
|
original = pe_ipcq_mod.PeIpcqComponent._handle_recv
|
||||||
|
|
||||||
|
def patched(self, env, req, cmd):
|
||||||
|
if self.node.id != target_node_id:
|
||||||
|
yield from original(self, env, req, cmd)
|
||||||
|
return
|
||||||
|
# Snapshot state before dispatch
|
||||||
|
d = cmd.direction
|
||||||
|
qp = self._queue_pairs.get(d, {})
|
||||||
|
log.append({
|
||||||
|
"phase": "enter",
|
||||||
|
"t": float(env.now),
|
||||||
|
"direction": d,
|
||||||
|
"peer_head_cache": qp.get("peer_head_cache"),
|
||||||
|
"my_tail": qp.get("my_tail"),
|
||||||
|
})
|
||||||
|
yield from original(self, env, req, cmd)
|
||||||
|
qp = self._queue_pairs.get(d, {})
|
||||||
|
log.append({
|
||||||
|
"phase": "exit",
|
||||||
|
"t": float(env.now),
|
||||||
|
"direction": d,
|
||||||
|
"peer_head_cache": qp.get("peer_head_cache"),
|
||||||
|
"my_tail": qp.get("my_tail"),
|
||||||
|
})
|
||||||
|
|
||||||
|
pe_ipcq_mod.PeIpcqComponent._handle_recv = patched
|
||||||
|
|
||||||
|
def restore():
|
||||||
|
pe_ipcq_mod.PeIpcqComponent._handle_recv = original
|
||||||
|
|
||||||
|
return log, restore
|
||||||
|
|
||||||
|
|
||||||
|
def _install_meta_arrival_capture(target_node_id: str):
|
||||||
|
"""Log every IpcqMetaArrival that lands on ``target_node_id`` PE_IPCQ.
|
||||||
|
|
||||||
|
Records (env_now, sender_seq, dst_addr, matched_direction,
|
||||||
|
peer_head_cache_before, my_tail_before).
|
||||||
|
"""
|
||||||
|
import kernbench.components.builtin.pe_ipcq as pe_ipcq_mod
|
||||||
|
|
||||||
|
log: list[dict] = []
|
||||||
|
original = pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival
|
||||||
|
|
||||||
|
def patched(self, msg):
|
||||||
|
if self.node.id == target_node_id:
|
||||||
|
token = msg.token
|
||||||
|
now = float(self._env.now) if hasattr(self, "_env") else 0.0
|
||||||
|
# _env is not stored on the component; use ctx? Fall back to
|
||||||
|
# introspection via self._inbox._env (SimPy stores reference).
|
||||||
|
try:
|
||||||
|
now = float(self._inbox._env.now)
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
entry = {
|
||||||
|
"t": now,
|
||||||
|
"sender_seq": getattr(token, "sender_seq", None),
|
||||||
|
"dst_addr": getattr(token, "dst_addr", None),
|
||||||
|
"src_sip": getattr(token, "src_sip", None),
|
||||||
|
"src_cube": getattr(token, "src_cube", None),
|
||||||
|
"src_pe": getattr(token, "src_pe", None),
|
||||||
|
"src_direction": getattr(token, "src_direction", None),
|
||||||
|
"nbytes": getattr(token, "nbytes", None),
|
||||||
|
"matched_direction": None,
|
||||||
|
"peer_head_cache_before": {},
|
||||||
|
"my_tail_before": {},
|
||||||
|
}
|
||||||
|
for d, qp in self._queue_pairs.items():
|
||||||
|
entry["peer_head_cache_before"][d] = qp["peer_head_cache"]
|
||||||
|
entry["my_tail_before"][d] = qp["my_tail"]
|
||||||
|
base = qp["my_rx_base_pa"]
|
||||||
|
size = qp["n_slots"] * qp["slot_size"]
|
||||||
|
if base <= entry["dst_addr"] < base + size:
|
||||||
|
entry["matched_direction"] = d
|
||||||
|
log.append(entry)
|
||||||
|
return original(self, msg)
|
||||||
|
|
||||||
|
pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = patched
|
||||||
|
|
||||||
|
def restore():
|
||||||
|
pe_ipcq_mod.PeIpcqComponent._handle_meta_arrival = original
|
||||||
|
|
||||||
|
return log, restore
|
||||||
|
|
||||||
|
|
||||||
|
def _snapshot_qp_state(engine, target_node_id: str) -> dict:
|
||||||
|
"""Snapshot every direction's qp state on the target PE_IPCQ now.
|
||||||
|
|
||||||
|
Captures peer_head_cache, my_tail, my_rx_base_pa, n_slots, slot_size
|
||||||
|
for each installed direction.
|
||||||
|
"""
|
||||||
|
comp = engine._components.get(target_node_id)
|
||||||
|
if comp is None:
|
||||||
|
return {}
|
||||||
|
return {
|
||||||
|
d: {
|
||||||
|
"peer_head_cache": qp["peer_head_cache"],
|
||||||
|
"my_tail": qp["my_tail"],
|
||||||
|
"my_rx_base_pa": qp["my_rx_base_pa"],
|
||||||
|
"n_slots": qp["n_slots"],
|
||||||
|
"slot_size": qp["slot_size"],
|
||||||
|
"rx_range": (
|
||||||
|
qp["my_rx_base_pa"],
|
||||||
|
qp["my_rx_base_pa"] + qp["n_slots"] * qp["slot_size"],
|
||||||
|
),
|
||||||
|
}
|
||||||
|
for d, qp in comp.queue_pairs.items()
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ── Path / drain breakdown using engine ctx ──────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _path_breakdown(ctx, path: list[str], nbytes: int) -> dict:
|
||||||
|
edge_total_ns = 0.0
|
||||||
|
edge_details = []
|
||||||
|
min_bw = float("inf")
|
||||||
|
for i in range(len(path) - 1):
|
||||||
|
edge = ctx.edge_map.get((path[i], path[i + 1]))
|
||||||
|
if edge is None:
|
||||||
|
edge_details.append((path[i], path[i + 1], None, None, None))
|
||||||
|
continue
|
||||||
|
prop_ns = edge.distance_mm * ctx.ns_per_mm
|
||||||
|
edge_total_ns += prop_ns
|
||||||
|
bw = getattr(edge, "bw_gbs", None) or 0.0
|
||||||
|
if bw > 0 and bw < min_bw:
|
||||||
|
min_bw = bw
|
||||||
|
edge_details.append(
|
||||||
|
(path[i], path[i + 1], edge.distance_mm, prop_ns, bw),
|
||||||
|
)
|
||||||
|
|
||||||
|
overhead_total_ns = 0.0
|
||||||
|
overhead_details = []
|
||||||
|
for nid in path:
|
||||||
|
oh = float(ctx.node_overhead_ns.get(nid, 0.0))
|
||||||
|
overhead_total_ns += oh
|
||||||
|
overhead_details.append((nid, oh))
|
||||||
|
|
||||||
|
drain_ns = ctx.compute_drain_ns(path, nbytes)
|
||||||
|
bottleneck_bw = None if min_bw == float("inf") else min_bw
|
||||||
|
|
||||||
|
return {
|
||||||
|
"path": path,
|
||||||
|
"edges": edge_details,
|
||||||
|
"edge_total_ns": edge_total_ns,
|
||||||
|
"overheads": overhead_details,
|
||||||
|
"overhead_total_ns": overhead_total_ns,
|
||||||
|
"drain_ns": drain_ns,
|
||||||
|
"bottleneck_bw_gbs": bottleneck_bw,
|
||||||
|
"expected_total_ns": edge_total_ns + overhead_total_ns + drain_ns,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _print_breakdown(label: str, br: dict) -> None:
|
||||||
|
print(f"\n {label}")
|
||||||
|
print(f" path ({len(br['path'])} nodes):")
|
||||||
|
for nid in br["path"]:
|
||||||
|
print(f" - {nid}")
|
||||||
|
print(f" edges (prop. delay):")
|
||||||
|
for src, dst, dist_mm, prop_ns, bw in br["edges"]:
|
||||||
|
if dist_mm is None:
|
||||||
|
print(f" ! {src} -> {dst} EDGE NOT FOUND IN edge_map")
|
||||||
|
continue
|
||||||
|
print(
|
||||||
|
f" {src} -> {dst} "
|
||||||
|
f"dist={dist_mm:.3f}mm prop={prop_ns:.2f}ns "
|
||||||
|
f"bw={bw or 0:.2f}GB/s"
|
||||||
|
)
|
||||||
|
print(f" per-node overhead_ns:")
|
||||||
|
for nid, oh in br["overheads"]:
|
||||||
|
if oh > 0:
|
||||||
|
print(f" {nid:<60s} overhead_ns={oh:.2f}")
|
||||||
|
print(f" edge_total_ns = {br['edge_total_ns']:.2f}")
|
||||||
|
print(f" overhead_total_ns = {br['overhead_total_ns']:.2f}")
|
||||||
|
print(f" bottleneck_bw_gbs = {br['bottleneck_bw_gbs']}")
|
||||||
|
print(f" drain_ns (nbytes={NBYTES}) = {br['drain_ns']:.2f}")
|
||||||
|
print(f" expected_total_ns = {br['expected_total_ns']:.2f}")
|
||||||
|
|
||||||
|
|
||||||
|
# ── RAW path scenario ────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _dump_src_op_records(engine, src_sip, src_cube, src_pe, label) -> None:
|
||||||
|
"""Print op_logger records for ops on the SRC PE.
|
||||||
|
|
||||||
|
The op log captures t_start/t_end for memory/math/gemm/copy ops on
|
||||||
|
every component, so we can see how long tl.load vs tl.store vs
|
||||||
|
tl.send actually took at the engine level.
|
||||||
|
"""
|
||||||
|
op_logger = getattr(engine, "_op_logger", None)
|
||||||
|
if op_logger is None:
|
||||||
|
print(f" ({label}) op_logger not available")
|
||||||
|
return
|
||||||
|
src_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}."
|
||||||
|
recs = [r for r in op_logger.records if r.component_id.startswith(src_prefix)]
|
||||||
|
print(f" ({label}) op_logger records on SRC PE ({src_prefix}*):")
|
||||||
|
for r in recs[:40]:
|
||||||
|
dur = r.t_end - r.t_start
|
||||||
|
comp_short = r.component_id.replace(src_prefix, "")
|
||||||
|
params_short = ""
|
||||||
|
if "nbytes" in r.params:
|
||||||
|
params_short = f" nbytes={r.params['nbytes']}"
|
||||||
|
if "src_addr" in r.params:
|
||||||
|
params_short += f" src_addr={r.params['src_addr']}"
|
||||||
|
if "dst_addr" in r.params:
|
||||||
|
params_short += f" dst_addr={r.params['dst_addr']}"
|
||||||
|
print(
|
||||||
|
f" t=[{r.t_start:7.2f}..{r.t_end:7.2f}] dur={dur:6.2f}ns "
|
||||||
|
f"{comp_short:<25s} {r.op_kind:<8s} {r.op_name:<12s}{params_short}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _run_raw():
|
||||||
|
captured, restore = _install_per_pe_capture()
|
||||||
|
try:
|
||||||
|
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
src_sip, src_cube, src_pe = SRC
|
||||||
|
dst_sip, dst_cube, dst_pe = DST
|
||||||
|
assert src_sip == dst_sip
|
||||||
|
|
||||||
|
src_off = (src_cube * N_PES + src_pe) * N_ELEM * ELEM_BYTES
|
||||||
|
dst_off = (dst_cube * N_PES + dst_pe) * N_ELEM * ELEM_BYTES
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id="diag_raw",
|
||||||
|
spec=spec,
|
||||||
|
) as rt:
|
||||||
|
dp = DPPolicy(
|
||||||
|
cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=N_CUBES, num_pes=N_PES,
|
||||||
|
)
|
||||||
|
rt.ahbm.set_device(src_sip)
|
||||||
|
t = rt.zeros(
|
||||||
|
(N_CUBES, N_PES * N_ELEM), dtype="f16",
|
||||||
|
dp=dp, name="raw_tensor",
|
||||||
|
)
|
||||||
|
t.copy_(rt.from_numpy(
|
||||||
|
np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
|
||||||
|
))
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pe_id = tl.program_id(axis=0)
|
||||||
|
cube_id = tl.program_id(axis=1)
|
||||||
|
if cube_id == src_cube and pe_id == src_pe:
|
||||||
|
data = tl.load(
|
||||||
|
t_ptr + src_off, shape=(n_elem,), dtype="f16",
|
||||||
|
)
|
||||||
|
tl.store(t_ptr + dst_off, data)
|
||||||
|
|
||||||
|
pending = rt.launch(
|
||||||
|
"diag_raw_kernel", kernel, t, N_ELEM, _defer_wait=True,
|
||||||
|
)
|
||||||
|
for h, _sip, meta in pending:
|
||||||
|
rt.wait(h, _meta=meta)
|
||||||
|
|
||||||
|
# Compute the RAW sub-txn path: src PE_DMA -> dst HBM_CTRL
|
||||||
|
from kernbench.policy.address.phyaddr import PhysAddr
|
||||||
|
ctx = next(iter(engine._components.values())).ctx
|
||||||
|
src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
|
||||||
|
# Resolve dst PA to HBM controller node
|
||||||
|
# The raw store kernel issues DmaWriteCmd on dst VA; in the engine
|
||||||
|
# this is translated via PE_MMU. For diagnostic we approximate
|
||||||
|
# the destination as the dst cube's HBM controller for slice
|
||||||
|
# belonging to dst_pe.
|
||||||
|
# Use the resolver on a constructed PA matching the same memory
|
||||||
|
# slice the kernel writes to.
|
||||||
|
# The tensor is "row_wise" sharded across cubes, so each cube
|
||||||
|
# owns row[cube_id, :], with each PE owning a column slice.
|
||||||
|
# The actual dst PA depends on the AHBM allocator; we read it
|
||||||
|
# via the tensor's shard map.
|
||||||
|
shard_map = getattr(t, "_shard_map", None) or getattr(t, "shard_map", None)
|
||||||
|
# Fallback: query the resolver directly by constructing a PA in
|
||||||
|
# the dst cube's HBM region. If shard_map is unavailable, still
|
||||||
|
# show the breakdown for src-PE-DMA -> first reachable HBM_CTRL
|
||||||
|
# in dst cube.
|
||||||
|
dst_hbm_id = f"sip{dst_sip}.cube{dst_cube}.hbm_ctrl"
|
||||||
|
if dst_hbm_id not in engine._components:
|
||||||
|
# try alternate naming
|
||||||
|
for nid in engine._components.keys():
|
||||||
|
if (
|
||||||
|
nid.startswith(f"sip{dst_sip}.cube{dst_cube}.")
|
||||||
|
and "hbm" in nid
|
||||||
|
):
|
||||||
|
dst_hbm_id = nid
|
||||||
|
break
|
||||||
|
|
||||||
|
# find_path() prepends ".pe_dma" to src_pe automatically
|
||||||
|
try:
|
||||||
|
raw_path = ctx.router.find_path(src_pe_prefix, dst_hbm_id)
|
||||||
|
except Exception as e:
|
||||||
|
raw_path = []
|
||||||
|
print(f" WARN: find_path raw failed: {e}")
|
||||||
|
if not raw_path:
|
||||||
|
# Try other HBM-related node names in dst cube
|
||||||
|
for nid in engine._components.keys():
|
||||||
|
if not nid.startswith(f"sip{dst_sip}.cube{dst_cube}."):
|
||||||
|
continue
|
||||||
|
if "hbm" not in nid:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
p = ctx.router.find_path(src_pe_prefix, nid)
|
||||||
|
except Exception:
|
||||||
|
p = []
|
||||||
|
if p:
|
||||||
|
raw_path = p
|
||||||
|
print(f" (fallback raw dst node: {nid})")
|
||||||
|
break
|
||||||
|
|
||||||
|
return captured, ctx, raw_path, engine
|
||||||
|
finally:
|
||||||
|
restore()
|
||||||
|
|
||||||
|
|
||||||
|
# ── IPCQ path scenario ───────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _run_ipcq():
|
||||||
|
captured, restore = _install_per_pe_capture()
|
||||||
|
dst_pe_ipcq_id = (
|
||||||
|
f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_ipcq"
|
||||||
|
)
|
||||||
|
arrival_log, restore_arrival = _install_meta_arrival_capture(
|
||||||
|
dst_pe_ipcq_id,
|
||||||
|
)
|
||||||
|
recv_log, restore_recv = _install_recv_capture(dst_pe_ipcq_id)
|
||||||
|
barrier_log, restore_barrier = _install_barrier_capture()
|
||||||
|
try:
|
||||||
|
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
spec = topo.topology_obj.spec
|
||||||
|
|
||||||
|
src_sip, src_cube, src_pe = SRC
|
||||||
|
dst_sip, dst_cube, dst_pe = DST
|
||||||
|
|
||||||
|
cfg = load_ccl_config()
|
||||||
|
merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
|
||||||
|
merged["slot_size"] = max(int(merged.get("slot_size", 4096)), NBYTES)
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id="diag_ipcq",
|
||||||
|
spec=spec,
|
||||||
|
) as rt:
|
||||||
|
configure_sfr_intercube_multisip(engine, spec, merged)
|
||||||
|
dp = DPPolicy(
|
||||||
|
cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=N_CUBES, num_pes=N_PES,
|
||||||
|
)
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pe_id = tl.program_id(axis=0)
|
||||||
|
cube_id = tl.program_id(axis=1)
|
||||||
|
if cube_id == src_cube and pe_id == src_pe:
|
||||||
|
data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
|
||||||
|
tl.send(dir=("E" if HOP == "h3" else "S"), src=data)
|
||||||
|
elif cube_id == dst_cube and pe_id == dst_pe:
|
||||||
|
tl.recv(
|
||||||
|
dir=("W" if HOP == "h3" else "N"),
|
||||||
|
shape=(n_elem,), dtype="f16",
|
||||||
|
)
|
||||||
|
|
||||||
|
tensors = []
|
||||||
|
for s in sorted({src_sip, dst_sip}):
|
||||||
|
rt.ahbm.set_device(s)
|
||||||
|
t = rt.zeros(
|
||||||
|
(N_CUBES, N_PES * N_ELEM), dtype="f16",
|
||||||
|
dp=dp, name=f"sip{s}",
|
||||||
|
)
|
||||||
|
t.copy_(rt.from_numpy(
|
||||||
|
np.full((N_CUBES, N_PES * N_ELEM), 1.0, dtype=np.float16),
|
||||||
|
))
|
||||||
|
tensors.append(t)
|
||||||
|
|
||||||
|
all_pending = []
|
||||||
|
for tt in tensors:
|
||||||
|
pending = rt.launch(
|
||||||
|
"diag_ipcq_kernel", kernel, tt, N_ELEM, _defer_wait=True,
|
||||||
|
)
|
||||||
|
all_pending.extend(pending)
|
||||||
|
for h, _sip, meta in all_pending:
|
||||||
|
rt.wait(h, _meta=meta)
|
||||||
|
|
||||||
|
ctx = next(iter(engine._components.values())).ctx
|
||||||
|
src_pe_prefix = f"sip{src_sip}.cube{src_cube}.pe{src_pe}"
|
||||||
|
dst_pe_dma = f"sip{dst_sip}.cube{dst_cube}.pe{dst_pe}.pe_dma"
|
||||||
|
try:
|
||||||
|
ipcq_path = ctx.router.find_path(src_pe_prefix, dst_pe_dma)
|
||||||
|
except Exception as e:
|
||||||
|
ipcq_path = []
|
||||||
|
print(f" WARN: find_path ipcq failed: {e}")
|
||||||
|
# Snapshot DST PE_IPCQ qp state at end-of-run so we can see what
|
||||||
|
# peer_head_cache/my_tail looked like (and at which directions).
|
||||||
|
qp_state = _snapshot_qp_state(engine, dst_pe_ipcq_id)
|
||||||
|
return (captured, ctx, ipcq_path, engine,
|
||||||
|
arrival_log, qp_state, recv_log, barrier_log)
|
||||||
|
finally:
|
||||||
|
restore_barrier()
|
||||||
|
restore_recv()
|
||||||
|
restore_arrival()
|
||||||
|
restore()
|
||||||
|
|
||||||
|
|
||||||
|
# ── Test entry ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.diagnostic
|
||||||
|
def test_pe_to_pe_diagnostic_h3():
|
||||||
|
print("\n" + "=" * 78)
|
||||||
|
print(f" Diagnostic: h3 inter-cube horizontal, nbytes={NBYTES}")
|
||||||
|
print(f" src={SRC} dst={DST}")
|
||||||
|
print("=" * 78)
|
||||||
|
|
||||||
|
# ── RAW scenario
|
||||||
|
print("\n[RAW] tl.load + tl.store (sender pays both legs)")
|
||||||
|
raw_per_pe, raw_ctx, raw_path, raw_engine = _run_raw()
|
||||||
|
print(f" per-PE pe_exec_ns ({len(raw_per_pe)} entries):")
|
||||||
|
src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
|
||||||
|
dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
|
||||||
|
for nid in (src_id, dst_id):
|
||||||
|
if nid in raw_per_pe:
|
||||||
|
print(f" {nid:<60s} {raw_per_pe[nid]:.2f} ns <-- key PE")
|
||||||
|
nonzero = {k: v for k, v in raw_per_pe.items() if v > 0.5}
|
||||||
|
if nonzero:
|
||||||
|
print(f" other PEs with pe_exec_ns > 0.5 ns:")
|
||||||
|
for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
|
||||||
|
if nid not in (src_id, dst_id):
|
||||||
|
print(f" {nid:<60s} {v:.2f} ns")
|
||||||
|
print(f" max(pe_exec_ns) = "
|
||||||
|
f"{max(raw_per_pe.values()) if raw_per_pe else 0:.2f} ns")
|
||||||
|
|
||||||
|
if raw_path:
|
||||||
|
br = _path_breakdown(raw_ctx, raw_path, NBYTES)
|
||||||
|
_print_breakdown("RAW sub-txn path (src.pe_dma -> dst.hbm_ctrl)", br)
|
||||||
|
_dump_src_op_records(raw_engine, *SRC, "RAW")
|
||||||
|
|
||||||
|
# ── IPCQ scenario
|
||||||
|
print("\n[IPCQ] tl.send + tl.recv (recv pays inbound traversal+drain)")
|
||||||
|
(ipcq_per_pe, ipcq_ctx, ipcq_path, ipcq_engine,
|
||||||
|
arrival_log, qp_state, recv_log, barrier_log) = _run_ipcq()
|
||||||
|
print(f"\n [BARRIER LOG] {len(barrier_log)} _execute_kernel entries:")
|
||||||
|
src_id = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}.pe_cpu"
|
||||||
|
dst_id = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}.pe_cpu"
|
||||||
|
n_skipped = 0
|
||||||
|
src_entry = None
|
||||||
|
dst_entry = None
|
||||||
|
for e in barrier_log:
|
||||||
|
if e["barrier_skipped"]:
|
||||||
|
n_skipped += 1
|
||||||
|
if e["node_id"] == src_id:
|
||||||
|
src_entry = e
|
||||||
|
if e["node_id"] == dst_id:
|
||||||
|
dst_entry = e
|
||||||
|
print(f" PEs entering _execute_kernel: {len(barrier_log)}")
|
||||||
|
print(f" PEs that SKIPPED barrier (env.now > target_start): {n_skipped}")
|
||||||
|
if src_entry:
|
||||||
|
print(
|
||||||
|
f" SRC pe ({src_id}): entry_now={src_entry['entry_now']:.2f} "
|
||||||
|
f"target_start={src_entry['target_start_ns']:.2f} "
|
||||||
|
f"skipped={src_entry['barrier_skipped']} "
|
||||||
|
f"late_ns={src_entry['delta_late_ns']:.2f}"
|
||||||
|
)
|
||||||
|
if dst_entry:
|
||||||
|
print(
|
||||||
|
f" DST pe ({dst_id}): entry_now={dst_entry['entry_now']:.2f} "
|
||||||
|
f"target_start={dst_entry['target_start_ns']:.2f} "
|
||||||
|
f"skipped={dst_entry['barrier_skipped']} "
|
||||||
|
f"late_ns={dst_entry['delta_late_ns']:.2f}"
|
||||||
|
)
|
||||||
|
# Top 5 latest arrivals
|
||||||
|
sorted_late = sorted(
|
||||||
|
[e for e in barrier_log if e["delta_late_ns"] is not None],
|
||||||
|
key=lambda e: -e["delta_late_ns"],
|
||||||
|
)[:5]
|
||||||
|
print(f" Top 5 latest PE arrivals (positive = barrier missed):")
|
||||||
|
for e in sorted_late:
|
||||||
|
if e["delta_late_ns"] > 0:
|
||||||
|
print(
|
||||||
|
f" {e['node_id']}: late by {e['delta_late_ns']:.2f} ns "
|
||||||
|
f"(entry={e['entry_now']:.2f}, target={e['target_start_ns']:.2f})"
|
||||||
|
)
|
||||||
|
print(f"\n [RECV LOG on dst pe_ipcq] {len(recv_log)} entries:")
|
||||||
|
for e in recv_log:
|
||||||
|
print(
|
||||||
|
f" {e['phase']:5s} t={e['t']:8.2f} ns "
|
||||||
|
f"dir={e['direction']} "
|
||||||
|
f"peer_head_cache={e['peer_head_cache']} "
|
||||||
|
f"my_tail={e['my_tail']}"
|
||||||
|
)
|
||||||
|
print(f"\n [META-ARRIVAL LOG on dst pe_ipcq] {len(arrival_log)} arrivals:")
|
||||||
|
for i, e in enumerate(arrival_log):
|
||||||
|
print(
|
||||||
|
f" #{i:2d} t={e['t']:8.2f} ns "
|
||||||
|
f"src=(sip{e['src_sip']},cube{e['src_cube']},pe{e['src_pe']}) "
|
||||||
|
f"dir={e['src_direction']} "
|
||||||
|
f"sender_seq={e['sender_seq']} "
|
||||||
|
f"matched_dir={e['matched_direction']} "
|
||||||
|
f"nbytes={e['nbytes']}"
|
||||||
|
)
|
||||||
|
for d, ph in e["peer_head_cache_before"].items():
|
||||||
|
mt = e["my_tail_before"][d]
|
||||||
|
if ph != 0 or mt != 0 or d == e["matched_direction"]:
|
||||||
|
print(
|
||||||
|
f" before: dir={d} peer_head_cache={ph} my_tail={mt}"
|
||||||
|
)
|
||||||
|
print(f"\n [QP STATE END-OF-RUN on dst pe_ipcq]:")
|
||||||
|
for d, st in qp_state.items():
|
||||||
|
print(
|
||||||
|
f" dir={d} peer_head_cache={st['peer_head_cache']} "
|
||||||
|
f"my_tail={st['my_tail']} rx_range=[{st['rx_range'][0]}..."
|
||||||
|
f"{st['rx_range'][1]}) n_slots={st['n_slots']} "
|
||||||
|
f"slot_size={st['slot_size']}"
|
||||||
|
)
|
||||||
|
print(f" per-PE pe_exec_ns ({len(ipcq_per_pe)} entries):")
|
||||||
|
for nid in (src_id, dst_id):
|
||||||
|
if nid in ipcq_per_pe:
|
||||||
|
print(f" {nid:<60s} {ipcq_per_pe[nid]:.2f} ns <-- key PE")
|
||||||
|
nonzero = {k: v for k, v in ipcq_per_pe.items() if v > 0.5}
|
||||||
|
if nonzero:
|
||||||
|
print(f" other PEs with pe_exec_ns > 0.5 ns:")
|
||||||
|
for nid, v in sorted(nonzero.items(), key=lambda kv: -kv[1])[:6]:
|
||||||
|
if nid not in (src_id, dst_id):
|
||||||
|
print(f" {nid:<60s} {v:.2f} ns")
|
||||||
|
print(f" max(pe_exec_ns) = "
|
||||||
|
f"{max(ipcq_per_pe.values()) if ipcq_per_pe else 0:.2f} ns")
|
||||||
|
|
||||||
|
if ipcq_path:
|
||||||
|
br = _path_breakdown(ipcq_ctx, ipcq_path, NBYTES)
|
||||||
|
_print_breakdown("IPCQ sub-txn path (src.pe_dma -> peer.pe_dma)", br)
|
||||||
|
_dump_src_op_records(ipcq_engine, *SRC, "IPCQ")
|
||||||
|
_dump_src_op_records(ipcq_engine, *DST, "IPCQ DST")
|
||||||
|
|
||||||
|
# ── Credit-return path analysis (where the missing IPCQ "ack" lives)
|
||||||
|
print("\n" + "-" * 78)
|
||||||
|
print("Credit-return path (current modeling)")
|
||||||
|
print("-" * 78)
|
||||||
|
src_pe_prefix = f"sip{SRC[0]}.cube{SRC[1]}.pe{SRC[2]}"
|
||||||
|
dst_pe_prefix = f"sip{DST[0]}.cube{DST[1]}.pe{DST[2]}"
|
||||||
|
# PE_IPCQ._credit_latency_ns calls
|
||||||
|
# ctx.router.find_path(self._pe_prefix, peer_pe_prefix)
|
||||||
|
# where the *destination* lacks the ".pe_dma" suffix. find_path()
|
||||||
|
# only auto-appends to the source, so this raises -> the except
|
||||||
|
# clause silently returns 0.0. Effectively credit latency = 0.
|
||||||
|
try:
|
||||||
|
ipcq_ctx.router.find_path(dst_pe_prefix, src_pe_prefix)
|
||||||
|
bug_caught = False
|
||||||
|
except Exception as e:
|
||||||
|
bug_caught = True
|
||||||
|
print(f" CONFIRMED BUG in _credit_latency_ns: dest lacks '.pe_dma' "
|
||||||
|
f"-> find_path raises -> caught exception -> returns 0.0")
|
||||||
|
print(f" Error: {e}")
|
||||||
|
# The intended credit path is recv -> sender (reverse data direction)
|
||||||
|
try:
|
||||||
|
credit_path = ipcq_ctx.router.find_path(
|
||||||
|
dst_pe_prefix, f"{src_pe_prefix}.pe_dma",
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
credit_path = []
|
||||||
|
print(f" WARN: corrected find_path credit failed: {e}")
|
||||||
|
if credit_path:
|
||||||
|
credit_size = 16 # PE_IPCQ default _credit_size_bytes
|
||||||
|
# Today's modeling: drain only, 16 bytes -> ~0.125 ns
|
||||||
|
cur = ipcq_ctx.compute_drain_ns(credit_path, credit_size)
|
||||||
|
# Proposed modeling: full path latency (edges + node overhead + drain)
|
||||||
|
proposed = ipcq_ctx.compute_path_latency_ns(credit_path, credit_size)
|
||||||
|
print(f" credit path nodes = {len(credit_path)} (recv -> sender)")
|
||||||
|
for nid in credit_path[:6]:
|
||||||
|
print(f" {nid}")
|
||||||
|
if len(credit_path) > 6:
|
||||||
|
print(f" ... {len(credit_path) - 6} more nodes")
|
||||||
|
br = _path_breakdown(ipcq_ctx, credit_path, credit_size)
|
||||||
|
print(f" edge_total_ns = {br['edge_total_ns']:.2f}")
|
||||||
|
print(f" overhead_total_ns = {br['overhead_total_ns']:.2f}")
|
||||||
|
print(f" drain_ns(16 bytes) = {br['drain_ns']:.2f}")
|
||||||
|
print(f" CURRENT _credit_latency_ns (drain only) = {cur:.3f} ns")
|
||||||
|
print(f" PROPOSED (compute_path_latency_ns) = {proposed:.2f} ns")
|
||||||
|
print(f" delta = {proposed - cur:+.2f} ns")
|
||||||
|
|
||||||
|
# ── Comparison summary
|
||||||
|
print("\n" + "-" * 78)
|
||||||
|
print("Summary")
|
||||||
|
print("-" * 78)
|
||||||
|
raw_max = max(raw_per_pe.values()) if raw_per_pe else 0.0
|
||||||
|
ipcq_max = max(ipcq_per_pe.values()) if ipcq_per_pe else 0.0
|
||||||
|
print(f" RAW max(pe_exec_ns) = {raw_max:.2f} ns")
|
||||||
|
print(f" IPCQ max(pe_exec_ns) (current) = {ipcq_max:.2f} ns")
|
||||||
|
print(f" delta (RAW - IPCQ current) = {raw_max - ipcq_max:+.2f} ns")
|
||||||
|
if credit_path:
|
||||||
|
ipcq_with_credit = ipcq_max + (proposed - cur)
|
||||||
|
print(
|
||||||
|
f" IPCQ projected w/ blocking credit + full path overhead "
|
||||||
|
f"= {ipcq_with_credit:.2f} ns"
|
||||||
|
)
|
||||||
|
print(
|
||||||
|
f" delta (RAW - IPCQ projected) = "
|
||||||
|
f"{raw_max - ipcq_with_credit:+.2f} ns "
|
||||||
|
f"(<= 0 means IPCQ >= RAW)"
|
||||||
|
)
|
||||||
|
|
||||||
|
# No assertions — this is observational.
|
||||||
|
assert raw_per_pe, "no RAW pe_exec_ns recorded"
|
||||||
|
assert ipcq_per_pe, "no IPCQ pe_exec_ns recorded"
|
||||||
@@ -0,0 +1,347 @@
|
|||||||
|
"""PE-to-PE latency sweep across hop types and data sizes.
|
||||||
|
|
||||||
|
Compares IPCQ send/recv vs raw-DMA (tl.load + tl.store) latency for four
|
||||||
|
hop types:
|
||||||
|
|
||||||
|
H1 Intra-cube horizontal pe0 → pe1
|
||||||
|
H2 Intra-cube vertical pe0 → pe4
|
||||||
|
H3 Inter-cube horizontal sip0.cube0.pe0 → sip0.cube1.pe0
|
||||||
|
H4 Inter-cube vertical sip0.cube0.pe0 → sip0.cube4.pe0
|
||||||
|
|
||||||
|
Sizes: 128..10240 bytes. Emits PNGs with both lines plus a CSV.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import csv
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
|
||||||
|
from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
|
||||||
|
from kernbench.policy.placement.dp import DPPolicy
|
||||||
|
from kernbench.runtime_api.context import RuntimeContext
|
||||||
|
from kernbench.runtime_api.types import DeviceSelector
|
||||||
|
from kernbench.sim_engine.engine import GraphEngine
|
||||||
|
from kernbench.topology.builder import resolve_topology
|
||||||
|
|
||||||
|
TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
|
||||||
|
PLOT_DIR = Path(__file__).parent / "pe2pe_latency_plots"
|
||||||
|
|
||||||
|
SIZES = [128, 256, 384, 512, 768, 1024, 2048, 4096, 8192, 10240]
|
||||||
|
|
||||||
|
N_CUBES = 16
|
||||||
|
N_PES = 8
|
||||||
|
ELEM_BYTES = 2 # f16
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class Hop:
|
||||||
|
id: str
|
||||||
|
label: str
|
||||||
|
src: tuple[int, int, int]
|
||||||
|
dst: tuple[int, int, int]
|
||||||
|
send_dir: str
|
||||||
|
recv_dir: str
|
||||||
|
supports_raw: bool
|
||||||
|
|
||||||
|
|
||||||
|
HOPS = [
|
||||||
|
Hop("h1_intra_horizontal", "Intra-cube horizontal (pe0 to pe1)",
|
||||||
|
(0, 0, 0), (0, 0, 1), "intra_E", "intra_W", True),
|
||||||
|
Hop("h2_intra_vertical", "Intra-cube vertical (pe0 to pe4)",
|
||||||
|
(0, 0, 0), (0, 0, 4), "intra_S", "intra_N", True),
|
||||||
|
Hop("h3_inter_cube_horizontal", "Inter-cube horizontal (cube0 to cube1)",
|
||||||
|
(0, 0, 0), (0, 1, 0), "E", "W", True),
|
||||||
|
Hop("h4_inter_cube_vertical", "Inter-cube vertical (cube0 to cube4)",
|
||||||
|
(0, 0, 0), (0, 4, 0), "S", "N", True),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _make_engine():
|
||||||
|
topo = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
engine = GraphEngine(topo.topology_obj, enable_data=True)
|
||||||
|
return engine, topo.topology_obj.spec
|
||||||
|
|
||||||
|
|
||||||
|
# ── IPCQ path ────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _measure_ipcq(hop: Hop, nbytes: int) -> float:
|
||||||
|
engine, spec = _make_engine()
|
||||||
|
|
||||||
|
cfg = load_ccl_config()
|
||||||
|
merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
|
||||||
|
merged["slot_size"] = max(int(merged.get("slot_size", 4096)), nbytes)
|
||||||
|
|
||||||
|
n_elem = nbytes // ELEM_BYTES
|
||||||
|
src_sip, src_cube, src_pe = hop.src
|
||||||
|
dst_sip, dst_cube, dst_pe = hop.dst
|
||||||
|
send_dir, recv_dir = hop.send_dir, hop.recv_dir
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id=f"ipcq_{hop.id}_{nbytes}",
|
||||||
|
spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
configure_sfr_intercube_multisip(engine, spec, merged)
|
||||||
|
|
||||||
|
dp = DPPolicy(
|
||||||
|
cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=N_CUBES, num_pes=N_PES,
|
||||||
|
)
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pe_id = tl.program_id(axis=0)
|
||||||
|
cube_id = tl.program_id(axis=1)
|
||||||
|
if cube_id == src_cube and pe_id == src_pe:
|
||||||
|
data = tl.load(t_ptr, shape=(n_elem,), dtype="f16")
|
||||||
|
tl.send(dir=send_dir, src=data)
|
||||||
|
elif cube_id == dst_cube and pe_id == dst_pe:
|
||||||
|
tl.recv(dir=recv_dir, shape=(n_elem,), dtype="f16")
|
||||||
|
|
||||||
|
tensors = []
|
||||||
|
for s in sorted({src_sip, dst_sip}):
|
||||||
|
ctx.ahbm.set_device(s)
|
||||||
|
t = ctx.zeros(
|
||||||
|
(N_CUBES, N_PES * n_elem), dtype="f16",
|
||||||
|
dp=dp, name=f"sip{s}",
|
||||||
|
)
|
||||||
|
t.copy_(ctx.from_numpy(
|
||||||
|
np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
|
||||||
|
))
|
||||||
|
tensors.append(t)
|
||||||
|
|
||||||
|
all_pending = []
|
||||||
|
for t in tensors:
|
||||||
|
pending = ctx.launch(
|
||||||
|
f"{hop.id}_ipcq", kernel, t, n_elem, _defer_wait=True,
|
||||||
|
)
|
||||||
|
all_pending.extend(pending)
|
||||||
|
for h, sip_id, meta in all_pending:
|
||||||
|
ctx.wait(h, _meta=meta)
|
||||||
|
|
||||||
|
# Per-PE kernel execution time (excludes launch dispatch and
|
||||||
|
# response aggregation). IPCQ: DST blocks on tl.recv until the
|
||||||
|
# send arrives, so max across SIPs = DST's transfer time.
|
||||||
|
pe_exec_vals = []
|
||||||
|
for h, _sip, _meta in all_pending:
|
||||||
|
_, trace = engine.get_completion(h)
|
||||||
|
if trace and trace.get("pe_exec_ns") is not None:
|
||||||
|
pe_exec_vals.append(float(trace["pe_exec_ns"]))
|
||||||
|
|
||||||
|
return max(pe_exec_vals) if pe_exec_vals else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
# ── Raw DMA path (intra-SIP only) ────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _measure_raw(hop: Hop, nbytes: int) -> float:
|
||||||
|
"""tl.load from source slice + tl.store to destination slice. The VA
|
||||||
|
mapping spans the cube mesh within one SIP (MmuMapMsg broadcasts to all
|
||||||
|
cubes of the SIP), so the store goes through the fabric to the
|
||||||
|
destination PE's HBM. No IPCQ protocol involved.
|
||||||
|
"""
|
||||||
|
if not hop.supports_raw:
|
||||||
|
raise RuntimeError(f"hop {hop.id} does not support raw path")
|
||||||
|
|
||||||
|
engine, spec = _make_engine()
|
||||||
|
|
||||||
|
n_elem = nbytes // ELEM_BYTES
|
||||||
|
src_sip, src_cube, src_pe = hop.src
|
||||||
|
dst_sip, dst_cube, dst_pe = hop.dst
|
||||||
|
assert src_sip == dst_sip
|
||||||
|
|
||||||
|
# Slice offsets in the (N_CUBES, N_PES * n_elem) tensor:
|
||||||
|
# row = cube, slice within row = pe * n_elem .. (pe+1)*n_elem
|
||||||
|
# Byte offsets from va_base:
|
||||||
|
src_off = (src_cube * N_PES + src_pe) * n_elem * ELEM_BYTES
|
||||||
|
dst_off = (dst_cube * N_PES + dst_pe) * n_elem * ELEM_BYTES
|
||||||
|
|
||||||
|
with RuntimeContext(
|
||||||
|
engine=engine,
|
||||||
|
target_device=DeviceSelector("all"),
|
||||||
|
correlation_id=f"raw_{hop.id}_{nbytes}",
|
||||||
|
spec=spec,
|
||||||
|
) as ctx:
|
||||||
|
dp = DPPolicy(
|
||||||
|
cube="row_wise", pe="column_wise",
|
||||||
|
num_cubes=N_CUBES, num_pes=N_PES,
|
||||||
|
)
|
||||||
|
ctx.ahbm.set_device(src_sip)
|
||||||
|
t = ctx.zeros(
|
||||||
|
(N_CUBES, N_PES * n_elem), dtype="f16",
|
||||||
|
dp=dp, name="raw_tensor",
|
||||||
|
)
|
||||||
|
t.copy_(ctx.from_numpy(
|
||||||
|
np.full((N_CUBES, N_PES * n_elem), 1.0, dtype=np.float16),
|
||||||
|
))
|
||||||
|
|
||||||
|
def kernel(t_ptr, n_elem, tl):
|
||||||
|
pe_id = tl.program_id(axis=0)
|
||||||
|
cube_id = tl.program_id(axis=1)
|
||||||
|
if cube_id == src_cube and pe_id == src_pe:
|
||||||
|
data = tl.load(
|
||||||
|
t_ptr + src_off, shape=(n_elem,), dtype="f16",
|
||||||
|
)
|
||||||
|
tl.store(t_ptr + dst_off, data)
|
||||||
|
|
||||||
|
pending = ctx.launch(
|
||||||
|
f"{hop.id}_raw", kernel, t, n_elem, _defer_wait=True,
|
||||||
|
)
|
||||||
|
for h, sip_id, meta in pending:
|
||||||
|
ctx.wait(h, _meta=meta)
|
||||||
|
|
||||||
|
# Per-PE kernel execution time. Raw: only SRC does real work
|
||||||
|
# (tl.load + tl.store, store is blocking), so max across all PEs
|
||||||
|
# = SRC's transfer time. Idle PEs contribute only overhead_ns.
|
||||||
|
pe_exec_vals = []
|
||||||
|
for h, _sip, _meta in pending:
|
||||||
|
_, trace = engine.get_completion(h)
|
||||||
|
if trace and trace.get("pe_exec_ns") is not None:
|
||||||
|
pe_exec_vals.append(float(trace["pe_exec_ns"]))
|
||||||
|
|
||||||
|
return max(pe_exec_vals) if pe_exec_vals else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
# ── CSV + plotting ───────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def _write_csv(records, path: Path) -> None:
|
||||||
|
path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
with open(path, "w", newline="", encoding="utf-8") as f:
|
||||||
|
w = csv.DictWriter(
|
||||||
|
f, fieldnames=["hop", "label", "size_bytes", "path", "total_ns"],
|
||||||
|
)
|
||||||
|
w.writeheader()
|
||||||
|
for r in records:
|
||||||
|
w.writerow(r)
|
||||||
|
|
||||||
|
|
||||||
|
def _plot_per_hop(records, hop: Hop, path: Path) -> None:
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
|
||||||
|
ipcq = sorted(
|
||||||
|
[r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
|
||||||
|
key=lambda r: r["size_bytes"],
|
||||||
|
)
|
||||||
|
raw = sorted(
|
||||||
|
[r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
|
||||||
|
key=lambda r: r["size_bytes"],
|
||||||
|
)
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(8, 5))
|
||||||
|
if ipcq:
|
||||||
|
ax.plot(
|
||||||
|
[r["size_bytes"] for r in ipcq],
|
||||||
|
[r["total_ns"] for r in ipcq],
|
||||||
|
marker="o", label="IPCQ (send/recv)", color="tab:blue",
|
||||||
|
)
|
||||||
|
if raw:
|
||||||
|
ax.plot(
|
||||||
|
[r["size_bytes"] for r in raw],
|
||||||
|
[r["total_ns"] for r in raw],
|
||||||
|
marker="s", label="Raw DMA (load+store)", color="tab:orange",
|
||||||
|
)
|
||||||
|
ax.set_xlabel("Data size (bytes)")
|
||||||
|
ax.set_ylabel("Latency (ns)")
|
||||||
|
ax.set_title(hop.label)
|
||||||
|
ax.grid(True, alpha=0.3)
|
||||||
|
ax.legend()
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(path, dpi=120)
|
||||||
|
plt.close(fig)
|
||||||
|
|
||||||
|
|
||||||
|
def _plot_overview(records, path: Path) -> None:
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
|
||||||
|
fig, axes = plt.subplots(2, 2, figsize=(13, 9))
|
||||||
|
axes = axes.flatten()
|
||||||
|
for i, hop in enumerate(HOPS):
|
||||||
|
ax = axes[i]
|
||||||
|
ipcq = sorted(
|
||||||
|
[r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
|
||||||
|
key=lambda r: r["size_bytes"],
|
||||||
|
)
|
||||||
|
raw = sorted(
|
||||||
|
[r for r in records if r["hop"] == hop.id and r["path"] == "raw"],
|
||||||
|
key=lambda r: r["size_bytes"],
|
||||||
|
)
|
||||||
|
if ipcq:
|
||||||
|
ax.plot(
|
||||||
|
[r["size_bytes"] for r in ipcq],
|
||||||
|
[r["total_ns"] for r in ipcq],
|
||||||
|
marker="o", label="IPCQ", color="tab:blue",
|
||||||
|
)
|
||||||
|
if raw:
|
||||||
|
ax.plot(
|
||||||
|
[r["size_bytes"] for r in raw],
|
||||||
|
[r["total_ns"] for r in raw],
|
||||||
|
marker="s", label="Raw", color="tab:orange",
|
||||||
|
)
|
||||||
|
ax.set_title(hop.label, fontsize=10)
|
||||||
|
ax.set_xlabel("bytes")
|
||||||
|
ax.set_ylabel("ns")
|
||||||
|
ax.grid(True, alpha=0.3)
|
||||||
|
ax.legend(fontsize=8)
|
||||||
|
for j in range(len(HOPS), len(axes)):
|
||||||
|
axes[j].axis("off")
|
||||||
|
fig.suptitle(
|
||||||
|
"PE-to-PE latency: IPCQ vs raw DMA",
|
||||||
|
fontsize=14,
|
||||||
|
)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(path, dpi=120)
|
||||||
|
plt.close(fig)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Test entry ───────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_pe_to_pe_latency_sweep():
|
||||||
|
records: list[dict] = []
|
||||||
|
|
||||||
|
for hop in HOPS:
|
||||||
|
for size in SIZES:
|
||||||
|
# IPCQ path
|
||||||
|
ipcq_ns = _measure_ipcq(hop, size)
|
||||||
|
records.append({
|
||||||
|
"hop": hop.id, "label": hop.label,
|
||||||
|
"size_bytes": size, "path": "ipcq",
|
||||||
|
"total_ns": ipcq_ns,
|
||||||
|
})
|
||||||
|
|
||||||
|
raw_s = "n/a"
|
||||||
|
if hop.supports_raw:
|
||||||
|
raw_ns = _measure_raw(hop, size)
|
||||||
|
records.append({
|
||||||
|
"hop": hop.id, "label": hop.label,
|
||||||
|
"size_bytes": size, "path": "raw",
|
||||||
|
"total_ns": raw_ns,
|
||||||
|
})
|
||||||
|
raw_s = f"{raw_ns:7.1f}ns"
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"[{hop.id}] size={size:5d} "
|
||||||
|
f"ipcq={ipcq_ns:7.1f}ns raw={raw_s}"
|
||||||
|
)
|
||||||
|
|
||||||
|
PLOT_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
_write_csv(records, PLOT_DIR / "summary.csv")
|
||||||
|
for hop in HOPS:
|
||||||
|
_plot_per_hop(records, hop, PLOT_DIR / f"{hop.id}.png")
|
||||||
|
_plot_overview(records, PLOT_DIR / "overview.png")
|
||||||
|
|
||||||
|
for hop in HOPS:
|
||||||
|
rs = sorted(
|
||||||
|
[r for r in records if r["hop"] == hop.id and r["path"] == "ipcq"],
|
||||||
|
key=lambda r: r["size_bytes"],
|
||||||
|
)
|
||||||
|
for r in rs:
|
||||||
|
assert r["total_ns"] > 0, f"{hop.id}: total_ns must be > 0"
|
||||||
|
|
||||||
|
print(f"\n Plots + CSV written to {PLOT_DIR}")
|
||||||
@@ -1,7 +1,10 @@
|
|||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from kernbench.policy.address.allocator import AddressConfig, AllocationError, PEMemAllocator
|
from kernbench.policy.address.allocator import AddressConfig, AllocationError, PEMemAllocator
|
||||||
from kernbench.policy.address.phyaddr import PhysAddr, PhysAddrError, UnitType
|
from kernbench.policy.address.phyaddr import (
|
||||||
|
PhysAddr, PhysAddrError, UnitType,
|
||||||
|
PESubUnit, MCPUSubUnit, IOCPUSubUnit,
|
||||||
|
)
|
||||||
|
|
||||||
_MB = 1 << 20
|
_MB = 1 << 20
|
||||||
_GB = 1 << 30
|
_GB = 1 << 30
|
||||||
@@ -23,13 +26,11 @@ _CFG = AddressConfig(
|
|||||||
|
|
||||||
|
|
||||||
def test_physaddr_immutable():
|
def test_physaddr_immutable():
|
||||||
pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0)
|
pa = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0)
|
||||||
with pytest.raises(AttributeError):
|
with pytest.raises(AttributeError):
|
||||||
pa.rack_id = 1 # type: ignore[misc]
|
pa.sip_id = 1 # type: ignore[misc]
|
||||||
# hashable
|
{pa} # hashable
|
||||||
{pa}
|
pa2 = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0)
|
||||||
# comparable
|
|
||||||
pa2 = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0)
|
|
||||||
assert pa == pa2
|
assert pa == pa2
|
||||||
|
|
||||||
|
|
||||||
@@ -37,120 +38,133 @@ def test_physaddr_immutable():
|
|||||||
|
|
||||||
|
|
||||||
def test_hbm_encode_decode_roundtrip():
|
def test_hbm_encode_decode_roundtrip():
|
||||||
pa = PhysAddr.hbm_addr(rack_id=2, sip_id=3, cube_id=5, hbm_offset=0x1000)
|
pa = PhysAddr.hbm_addr(sip_id=3, die_id=5, hbm_offset=0x1000)
|
||||||
raw = pa.encode()
|
raw = pa.encode()
|
||||||
dec = PhysAddr.decode(raw)
|
dec = PhysAddr.decode(raw)
|
||||||
assert dec.rack_id == 2
|
|
||||||
assert dec.sip_id == 3
|
assert dec.sip_id == 3
|
||||||
assert dec.cube_id == 5
|
assert dec.die_id == 5
|
||||||
assert dec.kind == "hbm"
|
assert dec.kind == "hbm"
|
||||||
assert dec.hbm_offset == 0x1000
|
assert dec.hbm_offset == 0x1000
|
||||||
|
|
||||||
|
|
||||||
# ── PE resource encode/decode roundtrip ─────────────────────────────
|
# ── PE resource encode/decode roundtrip (new layout) ───────────────
|
||||||
|
|
||||||
|
|
||||||
def test_pe_resource_encode_decode_roundtrip():
|
def test_pe_resource_encode_decode_roundtrip():
|
||||||
pa = PhysAddr(
|
pa = PhysAddr.pe_resource_addr(
|
||||||
rack_id=1, sip_id=2, sip_seg=7, local_offset=0,
|
sip_id=2, die_id=7, pe_id=3,
|
||||||
kind="pe_resource", cube_id=7,
|
pe_sub_unit=PESubUnit.PE_TCM, sub_offset=0xFF,
|
||||||
unit_type=UnitType.PE, pe_id=3, ext=1, sub_offset=0xFF,
|
|
||||||
)
|
)
|
||||||
# manually build local_offset matching bit layout
|
raw = pa.encode()
|
||||||
local_offset = (UnitType.PE << 34) | (3 << 30) | (1 << 29) | 0xFF
|
|
||||||
pa2 = PhysAddr(
|
|
||||||
rack_id=1, sip_id=2, sip_seg=7, local_offset=local_offset,
|
|
||||||
kind="pe_resource", cube_id=7,
|
|
||||||
unit_type=UnitType.PE, pe_id=3, ext=1, sub_offset=0xFF,
|
|
||||||
)
|
|
||||||
raw = pa2.encode()
|
|
||||||
dec = PhysAddr.decode(raw)
|
dec = PhysAddr.decode(raw)
|
||||||
assert dec.kind == "pe_resource"
|
assert dec.kind == "pe_resource"
|
||||||
assert dec.unit_type == UnitType.PE
|
assert dec.unit_type == UnitType.PE
|
||||||
assert dec.pe_id == 3
|
assert dec.pe_id == 3
|
||||||
assert dec.ext == 1
|
assert dec.pe_sub_unit == PESubUnit.PE_TCM
|
||||||
assert dec.sub_offset == 0xFF
|
assert dec.sub_offset == 0xFF
|
||||||
|
assert dec.die_id == 7
|
||||||
|
assert dec.sip_id == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_pe_resource_all_sub_units():
|
||||||
|
"""Each PE sub-unit roundtrips correctly."""
|
||||||
|
for su in PESubUnit:
|
||||||
|
pa = PhysAddr.pe_resource_addr(
|
||||||
|
sip_id=0, die_id=0, pe_id=0,
|
||||||
|
pe_sub_unit=su, sub_offset=42,
|
||||||
|
)
|
||||||
|
dec = PhysAddr.decode(pa.encode())
|
||||||
|
assert dec.pe_sub_unit == su
|
||||||
|
assert dec.sub_offset == 42
|
||||||
|
|
||||||
|
|
||||||
# ── pe_hbm_addr factory ────────────────────────────────────────────
|
# ── pe_hbm_addr factory ────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
def test_pe_hbm_addr_factory():
|
def test_pe_hbm_addr_factory():
|
||||||
SLICE = 6 * (1 << 30) # 6 GB per PE slice
|
SLICE = 6 * _GB
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=0, cube_id=0,
|
sip_id=0, die_id=0,
|
||||||
pe_id=2, pe_local_hbm_offset=1024, slice_size_bytes=SLICE,
|
pe_id=2, pe_local_hbm_offset=1024, slice_size_bytes=SLICE,
|
||||||
)
|
)
|
||||||
assert pa.kind == "hbm"
|
assert pa.kind == "hbm"
|
||||||
assert pa.cube_id == 0
|
assert pa.die_id == 0
|
||||||
assert pa.hbm_offset == 2 * SLICE + 1024
|
assert pa.hbm_offset == 2 * SLICE + 1024
|
||||||
|
|
||||||
|
|
||||||
def test_pe_hbm_addr_overflow():
|
def test_pe_hbm_addr_overflow():
|
||||||
SLICE = 6 * (1 << 30)
|
SLICE = 6 * _GB
|
||||||
with pytest.raises(PhysAddrError, match="pe_local_hbm_offset"):
|
with pytest.raises(PhysAddrError, match="pe_local_hbm_offset"):
|
||||||
PhysAddr.pe_hbm_addr(
|
PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=0, cube_id=0,
|
sip_id=0, die_id=0,
|
||||||
pe_id=0, pe_local_hbm_offset=SLICE, slice_size_bytes=SLICE,
|
pe_id=0, pe_local_hbm_offset=SLICE, slice_size_bytes=SLICE,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
# ── Invalid unit_type decode (fix #1) ──────────────────────────────
|
# ── Invalid resource_kind decode ──────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
def test_invalid_unit_type_raises():
|
def test_invalid_resource_kind_raises():
|
||||||
# Craft a PE-resource address with unit_type=7 (invalid)
|
# resource_kind=7 (invalid), addr_space=0
|
||||||
local_offset = (7 << 34) | (0 << 30) | 0
|
local_offset = (7 << 34) | 0
|
||||||
pa_raw = PhysAddr(
|
pa_raw = PhysAddr(sip_id=0, die_id=0, local_offset=local_offset)
|
||||||
rack_id=0, sip_id=0, sip_seg=0, local_offset=local_offset,
|
|
||||||
)
|
|
||||||
raw = pa_raw.encode()
|
raw = pa_raw.encode()
|
||||||
with pytest.raises(PhysAddrError, match="unit_type"):
|
with pytest.raises(PhysAddrError, match="resource_kind"):
|
||||||
PhysAddr.decode(raw)
|
PhysAddr.decode(raw)
|
||||||
|
|
||||||
|
|
||||||
# ── hbm_pe_id utility (fix #3) ─────────────────────────────────────
|
# ── hbm_pe_id utility ─────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
def test_hbm_pe_id_utility():
|
def test_hbm_pe_id_utility():
|
||||||
SLICE = 6 * (1 << 30) # 6 GB
|
SLICE = 6 * _GB
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=0, cube_id=0,
|
sip_id=0, die_id=0,
|
||||||
pe_id=5, pe_local_hbm_offset=256, slice_size_bytes=SLICE,
|
pe_id=5, pe_local_hbm_offset=256, slice_size_bytes=SLICE,
|
||||||
)
|
)
|
||||||
assert PhysAddr.hbm_pe_id(pa.hbm_offset, SLICE) == 5
|
assert PhysAddr.hbm_pe_id(pa.hbm_offset, SLICE) == 5
|
||||||
|
|
||||||
|
|
||||||
# ── UnitType.SRAM exists (fix #5) ──────────────────────────────────
|
# ── UnitType / sub-unit enums ──────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
def test_sram_unit_type_exists():
|
def test_sram_unit_type_exists():
|
||||||
assert UnitType.SRAM == 2
|
assert UnitType.SRAM == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_pe_sub_unit_enum():
|
||||||
|
assert PESubUnit.PE_TCM == 6
|
||||||
|
assert PESubUnit.IPCQ == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_mcpu_sub_unit_enum():
|
||||||
|
assert MCPUSubUnit.MCPU_SRAM == 5
|
||||||
|
|
||||||
|
|
||||||
|
def test_iocpu_sub_unit_enum():
|
||||||
|
assert IOCPUSubUnit.IO_SRAM == 5
|
||||||
|
|
||||||
|
|
||||||
# ── cube_sram_addr factory + roundtrip ──────────────────────────────
|
# ── cube_sram_addr factory + roundtrip ──────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
def test_cube_sram_addr_roundtrip():
|
def test_cube_sram_addr_roundtrip():
|
||||||
pa = PhysAddr.cube_sram_addr(
|
pa = PhysAddr.cube_sram_addr(sip_id=1, die_id=3, sram_offset=0x800)
|
||||||
rack_id=0, sip_id=1, cube_id=3, sram_offset=0x800,
|
|
||||||
)
|
|
||||||
assert pa.kind == "pe_resource"
|
assert pa.kind == "pe_resource"
|
||||||
assert pa.unit_type == UnitType.SRAM
|
assert pa.unit_type == UnitType.SRAM
|
||||||
assert pa.cube_id == 3
|
assert pa.die_id == 3
|
||||||
assert pa.sub_offset == 0x800
|
assert pa.sub_offset == 0x800
|
||||||
# encode → decode roundtrip
|
|
||||||
dec = PhysAddr.decode(pa.encode())
|
dec = PhysAddr.decode(pa.encode())
|
||||||
assert dec.unit_type == UnitType.SRAM
|
assert dec.unit_type == UnitType.SRAM
|
||||||
assert dec.cube_id == 3
|
assert dec.die_id == 3
|
||||||
assert dec.sub_offset == 0x800
|
assert dec.sub_offset == 0x800
|
||||||
|
|
||||||
|
|
||||||
def test_cube_sram_addr_range_check():
|
def test_cube_sram_addr_range_check():
|
||||||
with pytest.raises(PhysAddrError):
|
with pytest.raises(PhysAddrError):
|
||||||
PhysAddr.cube_sram_addr(
|
PhysAddr.cube_sram_addr(
|
||||||
rack_id=0, sip_id=0, cube_id=0,
|
sip_id=0, die_id=0,
|
||||||
sram_offset=(1 << 29), # exceeds 29-bit sub_offset
|
sram_offset=(1 << 25), # exceeds 25-bit sub_offset
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -158,29 +172,137 @@ def test_cube_sram_addr_range_check():
|
|||||||
|
|
||||||
|
|
||||||
def test_pe_tcm_addr_roundtrip():
|
def test_pe_tcm_addr_roundtrip():
|
||||||
pa = PhysAddr.pe_tcm_addr(
|
pa = PhysAddr.pe_tcm_addr(sip_id=0, die_id=2, pe_id=7, tcm_offset=0x400)
|
||||||
rack_id=0, sip_id=0, cube_id=2, pe_id=7, tcm_offset=0x400,
|
|
||||||
)
|
|
||||||
assert pa.kind == "pe_resource"
|
assert pa.kind == "pe_resource"
|
||||||
assert pa.unit_type == UnitType.PE
|
assert pa.unit_type == UnitType.PE
|
||||||
assert pa.pe_id == 7
|
assert pa.pe_id == 7
|
||||||
assert pa.cube_id == 2
|
assert pa.die_id == 2
|
||||||
|
assert pa.pe_sub_unit == PESubUnit.PE_TCM
|
||||||
assert pa.sub_offset == 0x400
|
assert pa.sub_offset == 0x400
|
||||||
# encode → decode roundtrip
|
|
||||||
dec = PhysAddr.decode(pa.encode())
|
dec = PhysAddr.decode(pa.encode())
|
||||||
assert dec.unit_type == UnitType.PE
|
assert dec.unit_type == UnitType.PE
|
||||||
assert dec.pe_id == 7
|
assert dec.pe_id == 7
|
||||||
|
assert dec.pe_sub_unit == PESubUnit.PE_TCM
|
||||||
assert dec.sub_offset == 0x400
|
assert dec.sub_offset == 0x400
|
||||||
|
|
||||||
|
|
||||||
def test_pe_tcm_addr_range_check():
|
def test_pe_tcm_addr_range_check():
|
||||||
with pytest.raises(PhysAddrError):
|
with pytest.raises(PhysAddrError):
|
||||||
PhysAddr.pe_tcm_addr(
|
PhysAddr.pe_tcm_addr(
|
||||||
rack_id=0, sip_id=0, cube_id=0, pe_id=0,
|
sip_id=0, die_id=0, pe_id=0,
|
||||||
tcm_offset=(1 << 29), # exceeds 29-bit sub_offset
|
tcm_offset=(1 << 25), # exceeds 25-bit sub_offset
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ── MCPU resource factory + roundtrip ──────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_mcpu_resource_roundtrip():
|
||||||
|
pa = PhysAddr.mcpu_resource_addr(
|
||||||
|
sip_id=0, die_id=1,
|
||||||
|
mcpu_sub_unit=MCPUSubUnit.MCPU_SRAM, sub_offset=0x100,
|
||||||
|
)
|
||||||
|
assert pa.kind == "pe_resource"
|
||||||
|
assert pa.unit_type == UnitType.MCPU
|
||||||
|
assert pa.mcpu_sub_unit == MCPUSubUnit.MCPU_SRAM
|
||||||
|
assert pa.sub_offset == 0x100
|
||||||
|
dec = PhysAddr.decode(pa.encode())
|
||||||
|
assert dec.unit_type == UnitType.MCPU
|
||||||
|
assert dec.mcpu_sub_unit == MCPUSubUnit.MCPU_SRAM
|
||||||
|
assert dec.sub_offset == 0x100
|
||||||
|
|
||||||
|
|
||||||
|
# ── IOCHIPLET: IOCPU factory + roundtrip ────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_iocpu_resource_roundtrip():
|
||||||
|
pa = PhysAddr.iocpu_resource_addr(
|
||||||
|
sip_id=1, die_id=17,
|
||||||
|
iocpu_sub_unit=IOCPUSubUnit.IPCQ, sub_offset=0x20000,
|
||||||
|
)
|
||||||
|
assert pa.kind == "iocpu"
|
||||||
|
assert pa.iocpu_sub_unit == IOCPUSubUnit.IPCQ
|
||||||
|
assert pa.sub_offset == 0x20000
|
||||||
|
dec = PhysAddr.decode(pa.encode())
|
||||||
|
assert dec.kind == "iocpu"
|
||||||
|
assert dec.iocpu_sub_unit == IOCPUSubUnit.IPCQ
|
||||||
|
assert dec.sub_offset == 0x20000
|
||||||
|
assert dec.die_id == 17
|
||||||
|
|
||||||
|
|
||||||
|
def test_iocpu_die_range_check():
|
||||||
|
with pytest.raises(PhysAddrError, match="IOCHIPLET"):
|
||||||
|
PhysAddr.iocpu_resource_addr(
|
||||||
|
sip_id=0, die_id=5, # not a chiplet die
|
||||||
|
iocpu_sub_unit=0, sub_offset=0,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ── IOCHIPLET: UAL factory + roundtrip ──────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_ual_addr_roundtrip():
|
||||||
|
pa = PhysAddr.ual_addr(sip_id=0, die_id=16, ual_offset=0x1000)
|
||||||
|
assert pa.kind == "ual"
|
||||||
|
dec = PhysAddr.decode(pa.encode())
|
||||||
|
assert dec.kind == "ual"
|
||||||
|
assert dec.die_id == 16
|
||||||
|
assert dec.chiplet_offset >= (1 << 31) # >= 2 GB boundary
|
||||||
|
|
||||||
|
|
||||||
|
# ── die_id dispatch ────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_die_id_ahbm_range():
|
||||||
|
for die in [0, 15]:
|
||||||
|
pa = PhysAddr.hbm_addr(sip_id=0, die_id=die, hbm_offset=0)
|
||||||
|
dec = PhysAddr.decode(pa.encode())
|
||||||
|
assert dec.kind == "hbm"
|
||||||
|
assert dec.die_id == die
|
||||||
|
|
||||||
|
|
||||||
|
def test_die_id_chiplet_range():
|
||||||
|
for die in [16, 20]:
|
||||||
|
pa = PhysAddr.iocpu_resource_addr(
|
||||||
|
sip_id=0, die_id=die,
|
||||||
|
iocpu_sub_unit=0, sub_offset=0,
|
||||||
|
)
|
||||||
|
dec = PhysAddr.decode(pa.encode())
|
||||||
|
assert dec.kind == "iocpu"
|
||||||
|
assert dec.die_id == die
|
||||||
|
|
||||||
|
|
||||||
|
def test_die_id_reserved_raises():
|
||||||
|
raw = (0 << 47) | (21 << 42) | 0 # die_id=21 (reserved)
|
||||||
|
with pytest.raises(PhysAddrError, match="reserved"):
|
||||||
|
PhysAddr.decode(raw)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Boundary values ────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_sip_boundary():
|
||||||
|
pa = PhysAddr.hbm_addr(sip_id=15, die_id=0, hbm_offset=0)
|
||||||
|
dec = PhysAddr.decode(pa.encode())
|
||||||
|
assert dec.sip_id == 15
|
||||||
|
|
||||||
|
|
||||||
|
def test_mbz_enforcement_ahbm():
|
||||||
|
"""AHBM local_offset bits [41:38] must be zero."""
|
||||||
|
local_offset = (1 << 38) | (1 << 37) # MBZ bit set + HBM
|
||||||
|
pa = PhysAddr(sip_id=0, die_id=0, local_offset=local_offset)
|
||||||
|
with pytest.raises(PhysAddrError, match="bits \\[41:38\\]"):
|
||||||
|
pa.encode()
|
||||||
|
|
||||||
|
|
||||||
|
def test_mbz_enforcement_chiplet():
|
||||||
|
"""IOCHIPLET local_offset bits [41:40] must be zero."""
|
||||||
|
local_offset = (1 << 40) | 0 # MBZ bit set
|
||||||
|
pa = PhysAddr(sip_id=0, die_id=16, local_offset=local_offset)
|
||||||
|
with pytest.raises(PhysAddrError, match="bits \\[41:40\\]"):
|
||||||
|
pa.encode()
|
||||||
|
|
||||||
|
|
||||||
# ── AddressConfig ───────────────────────────────────────────────────
|
# ── AddressConfig ───────────────────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
@@ -193,7 +315,7 @@ def test_address_config_derived_sizes():
|
|||||||
|
|
||||||
|
|
||||||
def _make_alloc(pe_id: int = 0) -> PEMemAllocator:
|
def _make_alloc(pe_id: int = 0) -> PEMemAllocator:
|
||||||
return PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=pe_id, cfg=_CFG)
|
return PEMemAllocator(sip_id=0, die_id=0, pe_id=pe_id, cfg=_CFG)
|
||||||
|
|
||||||
|
|
||||||
def test_allocator_hbm_basic():
|
def test_allocator_hbm_basic():
|
||||||
@@ -201,8 +323,7 @@ def test_allocator_hbm_basic():
|
|||||||
pa = a.alloc_hbm(4096)
|
pa = a.alloc_hbm(4096)
|
||||||
assert pa.kind == "hbm"
|
assert pa.kind == "hbm"
|
||||||
assert pa.sip_id == 0
|
assert pa.sip_id == 0
|
||||||
assert pa.cube_id == 0
|
assert pa.die_id == 0
|
||||||
# hbm_offset should be pe3's slice start
|
|
||||||
assert pa.hbm_offset == 3 * 6 * _GB
|
assert pa.hbm_offset == 3 * 6 * _GB
|
||||||
|
|
||||||
|
|
||||||
@@ -210,8 +331,8 @@ def test_allocator_hbm_sequential():
|
|||||||
a = _make_alloc()
|
a = _make_alloc()
|
||||||
pa1 = a.alloc_hbm(1024)
|
pa1 = a.alloc_hbm(1024)
|
||||||
pa2 = a.alloc_hbm(2048)
|
pa2 = a.alloc_hbm(2048)
|
||||||
assert pa1.hbm_offset == 0 # pe0 slice start + 0
|
assert pa1.hbm_offset == 0
|
||||||
assert pa2.hbm_offset == 1024 # pe0 slice start + 1024
|
assert pa2.hbm_offset == 1024
|
||||||
|
|
||||||
|
|
||||||
def test_allocator_hbm_overflow():
|
def test_allocator_hbm_overflow():
|
||||||
@@ -235,7 +356,6 @@ def test_allocator_tcm_basic():
|
|||||||
|
|
||||||
def test_allocator_tcm_respects_reserved():
|
def test_allocator_tcm_respects_reserved():
|
||||||
a = _make_alloc()
|
a = _make_alloc()
|
||||||
# allocatable = 12 MB, should succeed
|
|
||||||
a.alloc_tcm(12 * _MB)
|
a.alloc_tcm(12 * _MB)
|
||||||
assert a.tcm_used == 12 * _MB
|
assert a.tcm_used == 12 * _MB
|
||||||
assert a.tcm_total == 12 * _MB
|
assert a.tcm_total == 12 * _MB
|
||||||
|
|||||||
@@ -21,7 +21,7 @@ def _engine():
|
|||||||
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
def _hbm_pa(sip: int = 0, cube: int = 0, pe_id: int = 0) -> int:
|
||||||
slice_bytes = 48 * (1 << 30) // 8
|
slice_bytes = 48 * (1 << 30) // 8
|
||||||
pa = PhysAddr.pe_hbm_addr(
|
pa = PhysAddr.pe_hbm_addr(
|
||||||
rack_id=0, sip_id=sip, cube_id=cube, pe_id=pe_id,
|
sip_id=sip, die_id=cube, pe_id=pe_id,
|
||||||
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
pe_local_hbm_offset=0x1000, slice_size_bytes=slice_bytes,
|
||||||
)
|
)
|
||||||
return pa.encode()
|
return pa.encode()
|
||||||
|
|||||||
@@ -20,7 +20,7 @@ def test_resolve_hbm_addr():
|
|||||||
"""HBM address -> sip{S}.cube{C}.hbm_ctrl (single controller per cube)."""
|
"""HBM address -> sip{S}.cube{C}.hbm_ctrl (single controller per cube)."""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
resolver = AddressResolver(g)
|
resolver = AddressResolver(g)
|
||||||
pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=3, hbm_offset=0x1000)
|
pa = PhysAddr.hbm_addr(sip_id=0, die_id=3, hbm_offset=0x1000)
|
||||||
assert resolver.resolve(pa) == "sip0.cube3.hbm_ctrl"
|
assert resolver.resolve(pa) == "sip0.cube3.hbm_ctrl"
|
||||||
|
|
||||||
|
|
||||||
@@ -28,33 +28,33 @@ def test_resolve_hbm_addr_high_offset():
|
|||||||
"""HBM address with large offset still resolves to same hbm_ctrl."""
|
"""HBM address with large offset still resolves to same hbm_ctrl."""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
resolver = AddressResolver(g)
|
resolver = AddressResolver(g)
|
||||||
pa = PhysAddr.hbm_addr(rack_id=0, sip_id=0, cube_id=0, hbm_offset=0x600000000)
|
pa = PhysAddr.hbm_addr(sip_id=0, die_id=0, hbm_offset=0x600000000)
|
||||||
assert resolver.resolve(pa) == "sip0.cube0.hbm_ctrl"
|
assert resolver.resolve(pa) == "sip0.cube0.hbm_ctrl"
|
||||||
|
|
||||||
|
|
||||||
def test_resolve_pe_tcm_addr():
|
def test_resolve_pe_tcm_addr():
|
||||||
"""PE TCM address → sip{S}.cube{C}.pe{P}.pe_tcm"""
|
"""PE TCM address -> sip{S}.cube{C}.pe{P}.pe_tcm"""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
resolver = AddressResolver(g)
|
resolver = AddressResolver(g)
|
||||||
pa = PhysAddr.pe_tcm_addr(rack_id=0, sip_id=1, cube_id=5, pe_id=7, tcm_offset=0x400)
|
pa = PhysAddr.pe_tcm_addr(sip_id=1, die_id=5, pe_id=7, tcm_offset=0x400)
|
||||||
assert resolver.resolve(pa) == "sip1.cube5.pe7.pe_tcm"
|
assert resolver.resolve(pa) == "sip1.cube5.pe7.pe_tcm"
|
||||||
|
|
||||||
|
|
||||||
def test_resolve_sram_addr():
|
def test_resolve_sram_addr():
|
||||||
"""SRAM address → sip{S}.cube{C}.sram"""
|
"""SRAM address -> sip{S}.cube{C}.sram"""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
resolver = AddressResolver(g)
|
resolver = AddressResolver(g)
|
||||||
pa = PhysAddr.cube_sram_addr(rack_id=0, sip_id=0, cube_id=10, sram_offset=0x800)
|
pa = PhysAddr.cube_sram_addr(sip_id=0, die_id=10, sram_offset=0x800)
|
||||||
assert resolver.resolve(pa) == "sip0.cube10.sram"
|
assert resolver.resolve(pa) == "sip0.cube10.sram"
|
||||||
|
|
||||||
|
|
||||||
def test_resolve_mcpu_addr():
|
def test_resolve_mcpu_addr():
|
||||||
"""MCPU pe_resource address → sip{S}.cube{C}.m_cpu"""
|
"""MCPU pe_resource address -> sip{S}.cube{C}.m_cpu"""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
resolver = AddressResolver(g)
|
resolver = AddressResolver(g)
|
||||||
pa = PhysAddr(
|
pa = PhysAddr.mcpu_resource_addr(
|
||||||
rack_id=0, sip_id=0, sip_seg=2, local_offset=(UnitType.MCPU << 34),
|
sip_id=0, die_id=2,
|
||||||
kind="pe_resource", cube_id=2, unit_type=UnitType.MCPU,
|
mcpu_sub_unit=0, sub_offset=0,
|
||||||
)
|
)
|
||||||
assert resolver.resolve(pa) == "sip0.cube2.m_cpu"
|
assert resolver.resolve(pa) == "sip0.cube2.m_cpu"
|
||||||
|
|
||||||
@@ -64,7 +64,7 @@ def test_resolve_nonexistent_node():
|
|||||||
g = _graph()
|
g = _graph()
|
||||||
resolver = AddressResolver(g)
|
resolver = AddressResolver(g)
|
||||||
# sip_id=15 doesn't exist in the 2-SIP topology
|
# sip_id=15 doesn't exist in the 2-SIP topology
|
||||||
pa = PhysAddr.hbm_addr(rack_id=0, sip_id=15, cube_id=0, hbm_offset=0)
|
pa = PhysAddr.hbm_addr(sip_id=15, die_id=0, hbm_offset=0)
|
||||||
with pytest.raises(RoutingError):
|
with pytest.raises(RoutingError):
|
||||||
resolver.resolve(pa)
|
resolver.resolve(pa)
|
||||||
|
|
||||||
@@ -73,7 +73,7 @@ def test_resolve_nonexistent_node():
|
|||||||
|
|
||||||
|
|
||||||
def test_path_local_hbm():
|
def test_path_local_hbm():
|
||||||
"""PE0 -> hbm_ctrl: pe_dma → router → hbm_ctrl (through router mesh)."""
|
"""PE0 -> hbm_ctrl: pe_dma -> router -> hbm_ctrl (through router mesh)."""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
router = PathRouter(g)
|
router = PathRouter(g)
|
||||||
path = router.find_path("sip0.cube0.pe0", "sip0.cube0.hbm_ctrl")
|
path = router.find_path("sip0.cube0.pe0", "sip0.cube0.hbm_ctrl")
|
||||||
@@ -107,7 +107,7 @@ def test_all_pe_hbm_equidistant():
|
|||||||
"""All PEs in a cube have equal routing distance to hbm_ctrl.
|
"""All PEs in a cube have equal routing distance to hbm_ctrl.
|
||||||
|
|
||||||
With n_to_one mapping and high routing weight on HBM edges,
|
With n_to_one mapping and high routing weight on HBM edges,
|
||||||
all PE→hbm_ctrl paths have the same accumulated distance.
|
all PE->hbm_ctrl paths have the same accumulated distance.
|
||||||
"""
|
"""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
router = PathRouter(g)
|
router = PathRouter(g)
|
||||||
@@ -151,7 +151,7 @@ def test_path_remote_cube_hbm():
|
|||||||
|
|
||||||
|
|
||||||
def test_path_sram_via_router_mesh():
|
def test_path_sram_via_router_mesh():
|
||||||
"""PE → SRAM must go through router mesh nodes."""
|
"""PE -> SRAM must go through router mesh nodes."""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
router = PathRouter(g)
|
router = PathRouter(g)
|
||||||
path = router.find_path("sip0.cube0.pe0", "sip0.cube0.sram")
|
path = router.find_path("sip0.cube0.pe0", "sip0.cube0.sram")
|
||||||
@@ -168,7 +168,7 @@ def test_path_sram_via_router_mesh():
|
|||||||
|
|
||||||
|
|
||||||
def test_path_local_tcm():
|
def test_path_local_tcm():
|
||||||
"""PE0 → own TCM is PE-internal, not via router mesh."""
|
"""PE0 -> own TCM is PE-internal, not via router mesh."""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
router = PathRouter(g)
|
router = PathRouter(g)
|
||||||
path = router.find_path("sip0.cube0.pe0", "sip0.cube0.pe0.pe_tcm")
|
path = router.find_path("sip0.cube0.pe0", "sip0.cube0.pe0.pe_tcm")
|
||||||
|
|||||||
@@ -0,0 +1,106 @@
|
|||||||
|
"""Rectangular (non-square) SIP-level 2D topology support.
|
||||||
|
|
||||||
|
Phase 1 regression target: today the 2D builtin topology functions in
|
||||||
|
``kernbench.ccl.topologies`` (``mesh_2d``, ``torus_2d``,
|
||||||
|
``mesh_2d_no_wrap``) hardcode ``side = sqrt(world_size)`` and raise
|
||||||
|
``ValueError`` for any non-square ``world_size``. This blocks running
|
||||||
|
the allreduce sweep at n_sips=6 on torus/mesh layouts.
|
||||||
|
|
||||||
|
Phase 2 will extend these functions to accept optional ``w, h`` kwargs
|
||||||
|
so a 2×3 (or 3×2, etc.) layout works. Until then, every test below is
|
||||||
|
expected to FAIL.
|
||||||
|
|
||||||
|
Layout convention used here (matches non-rectangular case):
|
||||||
|
rank = row * w + col for 0 <= row < h, 0 <= col < w
|
||||||
|
|
||||||
|
For w=2, h=3, world_size=6 the layout is:
|
||||||
|
|
||||||
|
col=0 col=1
|
||||||
|
row=0: 0 1
|
||||||
|
row=1: 2 3
|
||||||
|
row=2: 4 5
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from kernbench.ccl.topologies import (
|
||||||
|
mesh_2d,
|
||||||
|
mesh_2d_no_wrap,
|
||||||
|
torus_2d,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ── mesh_2d_no_wrap (no wrap-around) ──────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_mesh_2d_no_wrap_2x3_top_left():
|
||||||
|
"""rank 0 (top-left, no N, no W): only S and E."""
|
||||||
|
nbrs = mesh_2d_no_wrap(rank=0, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"S": 2, "E": 1}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
def test_mesh_2d_no_wrap_2x3_top_right():
|
||||||
|
"""rank 1 (top-right, no N, no E): only S and W."""
|
||||||
|
nbrs = mesh_2d_no_wrap(rank=1, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"S": 3, "W": 0}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
def test_mesh_2d_no_wrap_2x3_middle_left():
|
||||||
|
"""rank 2 (middle-left, no W): N, S, E."""
|
||||||
|
nbrs = mesh_2d_no_wrap(rank=2, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"N": 0, "S": 4, "E": 3}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
def test_mesh_2d_no_wrap_2x3_bottom_right():
|
||||||
|
"""rank 5 (bottom-right, no S, no E): only N and W."""
|
||||||
|
nbrs = mesh_2d_no_wrap(rank=5, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"N": 3, "W": 4}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
# ── torus_2d (wrap-around on all four edges) ─────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_torus_2d_2x3_top_left():
|
||||||
|
"""rank 0: N wraps to row 2 col 0 (rank 4); W wraps to col 1 (rank 1)."""
|
||||||
|
nbrs = torus_2d(rank=0, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"N": 4, "S": 2, "W": 1, "E": 1}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
def test_torus_2d_2x3_bottom_right():
|
||||||
|
"""rank 5: S wraps to row 0 (rank 1); E wraps to col 0 (rank 4)."""
|
||||||
|
nbrs = torus_2d(rank=5, world_size=6, w=2, h=3)
|
||||||
|
assert nbrs == {"N": 3, "S": 1, "W": 4, "E": 4}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
# ── mesh_2d alias for torus_2d ───────────────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_mesh_2d_2x3_matches_torus_2d():
|
||||||
|
"""mesh_2d is currently a torus alias; behaviour must match torus_2d."""
|
||||||
|
for rank in range(6):
|
||||||
|
assert mesh_2d(rank=rank, world_size=6, w=2, h=3) == \
|
||||||
|
torus_2d(rank=rank, world_size=6, w=2, h=3)
|
||||||
|
|
||||||
|
|
||||||
|
# ── Back-compat: square layouts still work without w/h kwargs ────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_square_back_compat_mesh_2d_no_wrap():
|
||||||
|
"""Calling without w, h should still work for square world_size."""
|
||||||
|
nbrs = mesh_2d_no_wrap(rank=0, world_size=4)
|
||||||
|
assert nbrs == {"S": 2, "E": 1}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
def test_square_back_compat_torus_2d():
|
||||||
|
nbrs = torus_2d(rank=0, world_size=4)
|
||||||
|
assert nbrs == {"N": 2, "S": 2, "W": 1, "E": 1}, nbrs
|
||||||
|
|
||||||
|
|
||||||
|
# ── Validation: w*h must match world_size ────────────────────────────
|
||||||
|
|
||||||
|
|
||||||
|
def test_rectangular_dims_must_match_world_size():
|
||||||
|
"""Phase 2 contract: explicit w, h must satisfy w*h == world_size."""
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
mesh_2d_no_wrap(rank=0, world_size=6, w=3, h=3) # 9 != 6
|
||||||
@@ -44,7 +44,7 @@ _CFG = AddressConfig(
|
|||||||
|
|
||||||
def _make_allocators(num_pe: int = 8) -> dict[tuple[int, int, int], PEMemAllocator]:
|
def _make_allocators(num_pe: int = 8) -> dict[tuple[int, int, int], PEMemAllocator]:
|
||||||
return {
|
return {
|
||||||
(0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=_CFG)
|
(0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=_CFG)
|
||||||
for i in range(num_pe)
|
for i in range(num_pe)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -55,7 +55,7 @@ def _make_ctx():
|
|||||||
|
|
||||||
def test_allocator_free_hbm_reclaims_space():
|
def test_allocator_free_hbm_reclaims_space():
|
||||||
"""free_hbm returns HBM space; subsequent alloc can reuse it."""
|
"""free_hbm returns HBM space; subsequent alloc can reuse it."""
|
||||||
a = PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=0, cfg=_CFG)
|
a = PEMemAllocator(sip_id=0, die_id=0, pe_id=0, cfg=_CFG)
|
||||||
pa1 = a.alloc_hbm(4096)
|
pa1 = a.alloc_hbm(4096)
|
||||||
used_after_alloc = a.hbm_used
|
used_after_alloc = a.hbm_used
|
||||||
a.free_hbm(pa1, 4096)
|
a.free_hbm(pa1, 4096)
|
||||||
@@ -66,7 +66,7 @@ def test_allocator_free_hbm_reclaims_space():
|
|||||||
|
|
||||||
def test_allocator_free_tcm_reclaims_space():
|
def test_allocator_free_tcm_reclaims_space():
|
||||||
"""free_tcm returns TCM space."""
|
"""free_tcm returns TCM space."""
|
||||||
a = PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=0, cfg=_CFG)
|
a = PEMemAllocator(sip_id=0, die_id=0, pe_id=0, cfg=_CFG)
|
||||||
pa1 = a.alloc_tcm(256)
|
pa1 = a.alloc_tcm(256)
|
||||||
used_after_alloc = a.tcm_used
|
used_after_alloc = a.tcm_used
|
||||||
a.free_tcm(pa1, 256)
|
a.free_tcm(pa1, 256)
|
||||||
|
|||||||
@@ -39,7 +39,7 @@ _CFG = AddressConfig(
|
|||||||
|
|
||||||
def _make_allocators(num_pe: int = 8) -> dict[tuple[int, int, int], PEMemAllocator]:
|
def _make_allocators(num_pe: int = 8) -> dict[tuple[int, int, int], PEMemAllocator]:
|
||||||
return {
|
return {
|
||||||
(0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=_CFG)
|
(0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=_CFG)
|
||||||
for i in range(num_pe)
|
for i in range(num_pe)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -70,7 +70,7 @@ def _make_standalone(shape, num_pe=NUM_PE):
|
|||||||
sram_bytes_per_cube=32 * _MB,
|
sram_bytes_per_cube=32 * _MB,
|
||||||
)
|
)
|
||||||
allocators = {
|
allocators = {
|
||||||
(0, 0, i): PEMemAllocator(rack_id=0, sip_id=0, cube_id=0, pe_id=i, cfg=cfg)
|
(0, 0, i): PEMemAllocator(sip_id=0, die_id=0, pe_id=i, cfg=cfg)
|
||||||
for i in range(num_pe)
|
for i in range(num_pe)
|
||||||
}
|
}
|
||||||
va_alloc = VirtualAllocator(va_base=0x1_0000_0000, va_size=64 * _GB, page_size=4096)
|
va_alloc = VirtualAllocator(va_base=0x1_0000_0000, va_size=64 * _GB, page_size=4096)
|
||||||
|
|||||||