commit - release 1
This commit is contained in:
@@ -0,0 +1,108 @@
|
||||
# ADR-0001: PhysAddr Layout & Address Decoding Contract
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Date
|
||||
|
||||
2026-02-27
|
||||
|
||||
## Context
|
||||
|
||||
KernBench Graph Latency Simulator must route requests deterministically and compute end-to-end latency strictly by graph traversal.
|
||||
To model local vs remote traffic (same/different SIP, same/different CUBE, optional PE-group), requests need a stable, parsable address/location scheme that:
|
||||
|
||||
- can be decoded into routing domains (SIP/CUBE/HBM/PE-resource, etc.)
|
||||
- remains topology-agnostic (no hardcoded counts)
|
||||
- supports swappable policy and DI-first components without leaking topology assumptions into node implementations
|
||||
|
||||
## Decision
|
||||
|
||||
We define a **PhysAddr value object** and an **address decoding contract** that converts an integer address into routing domains.
|
||||
|
||||
### D1. PhysAddr is an immutable value object
|
||||
|
||||
- PhysAddr is immutable and comparable as a pure value.
|
||||
- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
|
||||
- No global state may be required to interpret a PhysAddr.
|
||||
|
||||
### D2. PhysAddr fields (logical contract)
|
||||
|
||||
PhysAddr must be able to represent at least:
|
||||
|
||||
- `rack_id` (optional but reserved for scale-out)
|
||||
- `sip_id` (device / SIP domain)
|
||||
- `sip_seg` (SIP-level segment/window selection, e.g., cube window)
|
||||
- `local_offset` (offset within the chosen segment/window)
|
||||
|
||||
Decoded/derived fields may include (optional):
|
||||
|
||||
- `cube_id`
|
||||
- `kind` (e.g., HBM vs PE-resource vs raw)
|
||||
- `unit_type` / `pe_id` (if PE-level addressing is modeled)
|
||||
|
||||
**Important:** The exact bit allocation may evolve, but the *semantic fields above* must remain decodable without hidden assumptions.
|
||||
|
||||
### D3. Decoding is deterministic and policy-compatible
|
||||
|
||||
- Decoding must deterministically map an integer address to:
|
||||
- destination SIP domain (`sip_id`)
|
||||
- destination sub-domain (`cube_id` if applicable)
|
||||
- destination target kind (HBM/PE-resource/other)
|
||||
- Decoding must not depend on runtime topology sizes; it may depend on **explicit topology parameters** provided through configuration (e.g., segment size, slice size), and those parameters must live in the topology/config layer (not in random components).
|
||||
|
||||
### D4. Topology-derived constants live in the topology layer
|
||||
|
||||
Constants such as segment sizes (e.g., HBM slice size / window size) are derived from topology configuration (YAML/JSON/dict) and are provided to the decoder via DI/config.
|
||||
They must not be hardcoded in node implementations.
|
||||
|
||||
### D5. Routing consumes decoded domains, not raw bits
|
||||
|
||||
Routing policy uses decoded domains:
|
||||
|
||||
- `src` location (sip/cube/pe or node_id)
|
||||
- `dst` domains derived from PhysAddr decoding
|
||||
- `size_bytes` for size-aware link latency
|
||||
Routing must not inspect raw bit-fields directly except inside the decoding module.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
1) **Use raw integers everywhere, decode ad-hoc in routing**
|
||||
|
||||
- Rejected: leads to duplicated logic, inconsistent routing, and hidden assumptions embedded in multiple components.
|
||||
|
||||
1) **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**
|
||||
|
||||
- Rejected: violates SPEC (R3) and breaks swappability and configuration-driven topologies.
|
||||
|
||||
1) **Put decoding inside memory controllers or routers**
|
||||
|
||||
- Rejected: leaks policy into components and undermines DI-first, swappable implementations (SPEC R4).
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- Deterministic routing domains enable clear test invariants for local vs remote paths (SPEC R1, R5).
|
||||
- Keeps topology variability (SPEC R3) while preserving consistent semantics.
|
||||
- DI-first: decoder can be swapped or extended without changing components or tests (SPEC R4).
|
||||
|
||||
### Tradeoffs / Costs
|
||||
|
||||
- Requires explicit configuration for any topology-derived sizes.
|
||||
- Introduces a single “blessed” decoding module that must remain stable and well-tested.
|
||||
|
||||
## Implementation Notes (Non-normative)
|
||||
|
||||
- Recommended module boundary:
|
||||
- `src/kernbench/policy/address/phyaddr.py`
|
||||
|
||||
- Tests should cover:
|
||||
- deterministic decoding
|
||||
- local vs remote classification from decoded fields
|
||||
- invariants: “allocator returns full PhysAddr”, “decoding requires no global state”
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first), R5 (multi-domain comm)
|
||||
@@ -0,0 +1,103 @@
|
||||
# ADR-0002: Routing Distance, Ordering & Bypass Rules
|
||||
|
||||
## Status
|
||||
Accepted
|
||||
|
||||
## Date
|
||||
2026-02-27
|
||||
|
||||
## Context
|
||||
The KernBench Graph Latency Simulator must compare kernel execution time
|
||||
across different architectures and topologies by computing end-to-end
|
||||
latency from graph traversal.
|
||||
|
||||
To support meaningful comparison:
|
||||
- routing must be deterministic
|
||||
- latency must reflect actual interconnect structure
|
||||
- local vs remote traffic must be distinguishable
|
||||
- “bypass” optimizations must not undermine debuggability or correctness
|
||||
|
||||
The simulator also aims to avoid software-managed metadata and hidden
|
||||
shortcuts that obscure control paths.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Distance is accumulated latency, not hop count
|
||||
- Routing “distance” is defined as the **sum of per-node and per-link latency**.
|
||||
- Hop count alone must not be used for ordering or path selection.
|
||||
- Size-aware serialization latency (bytes / BW) contributes to distance.
|
||||
|
||||
### D2. Routing order is derived from graph traversal
|
||||
- The chosen route is the path with minimum accumulated latency
|
||||
given the constructed graph and routing policy.
|
||||
- Deterministic ordering must be guaranteed for identical inputs
|
||||
(topology + policy + request).
|
||||
|
||||
### D3. Bypass is explicit and graph-represented
|
||||
- Any bypass (e.g., local cube HBM access via XBAR instead of NOC) must be:
|
||||
- explicitly represented as a graph path, and
|
||||
- subject to latency accumulation like any other path.
|
||||
- Example: PE_DMA has dual egress — one to XBAR (HBM path) and one to NOC (non-HBM path).
|
||||
Both are explicit graph edges; neither is a “bypass” — they are distinct data paths
|
||||
serving different memory domains.
|
||||
- Implicit or “magic” bypass paths are disallowed.
|
||||
|
||||
### D4. No zero-latency end-to-end paths
|
||||
|
||||
- Every routed request must incur **end-to-end** latency > 0.
|
||||
- Individual fabric segments (e.g., NOC hops) MAY have distance_mm = 0
|
||||
when the fabric is distributed and distance is not meaningful at that granularity.
|
||||
This is allowed because other components on the same path (e.g., PE_DMA, SRAM,
|
||||
UCIe endpoints) contribute non-zero latency, ensuring the end-to-end invariant holds.
|
||||
- Fully zero-latency end-to-end paths are disallowed, except for explicit
|
||||
test-only stubs clearly marked as such.
|
||||
|
||||
### D5. Policy vs topology responsibility split
|
||||
- Topology builder:
|
||||
- defines nodes and links and their latency/BW parameters
|
||||
- Routing policy:
|
||||
- selects among available graph paths based on decoded domains
|
||||
- Routing policy must not assume missing links; missing connectivity
|
||||
is a topology construction error.
|
||||
|
||||
### D6. No software-managed routing metadata
|
||||
- Routing decisions must not rely on per-request software-managed metadata
|
||||
that tracks distance, hop count, or ordering outside the graph model.
|
||||
- All distance/order computation is derived from traversal itself.
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
1) **Hop-count based routing**
|
||||
- Rejected: ignores heterogeneous latency/BW and misrepresents
|
||||
architectural differences.
|
||||
|
||||
2) **Implicit local shortcuts**
|
||||
- Rejected: breaks debuggability and violates traversal-based latency.
|
||||
|
||||
3) **Software-managed distance metadata**
|
||||
- Rejected: increases control overhead and obscures routing semantics.
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- Clear, debuggable hop-by-hop traces (SPEC R2, R4).
|
||||
- Architecture comparisons reflect real interconnect structure.
|
||||
- Routing behavior is reproducible and deterministic.
|
||||
|
||||
### Tradeoffs / Costs
|
||||
- Graph construction must be correct and complete.
|
||||
- Bypass modeling requires explicit graph representation,
|
||||
which slightly increases topology description complexity.
|
||||
|
||||
## Implementation Notes (Non-normative)
|
||||
- Recommended responsibilities:
|
||||
- Graph builder: ensure all required paths exist.
|
||||
- Router: select next hop based on decoded domains and policy.
|
||||
- Tests should assert:
|
||||
- non-zero end-to-end latency
|
||||
- deterministic routing for identical inputs
|
||||
- bypass paths appear explicitly in emitted traces
|
||||
|
||||
## Links
|
||||
- SPEC.md: R1 (routing), R2 (latency), R3 (topology), R5 (multi-domain comm)
|
||||
- ADR-0001: PhysAddr layout & decoding contract
|
||||
@@ -0,0 +1,64 @@
|
||||
# ADR-0003: Target System Hierarchy & Modeling Scope
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
|
||||
The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
|
||||
through switching fabrics, with a host CPU issuing commands/kernels.
|
||||
|
||||
## Decision
|
||||
|
||||
We model the system hierarchy explicitly:
|
||||
|
||||
### D1. Tray-level
|
||||
|
||||
- A compute tray contains:
|
||||
- Host CPU (issues requests / coordinates runtime & data placement)
|
||||
- Multiple identical SIPs (accelerators)
|
||||
- Interconnect fabric between SIPs (PCIe and/or UAL via switches)
|
||||
|
||||
### D2. SIP-level
|
||||
|
||||
- A SIP is a multi-die package composed of:
|
||||
- Multiple CUBEs (HBM die + compute PEs + UCIe)
|
||||
- One or more IO chiplets (host/SIP interfaces)
|
||||
- IO chiplets:
|
||||
- provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
|
||||
- can be multiple per SIP
|
||||
- placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 1–2 IO chiplets
|
||||
|
||||
### D3. CUBE-level
|
||||
|
||||
- A CUBE contains:
|
||||
- HBM + memory controller (HBM_CTRL)
|
||||
- XBAR (top/bottom): HBM pseudo-channel crossbar, PE's dedicated path to HBM
|
||||
- Bridge (left/right): connects XBAR.top ↔ XBAR.bottom for cross-half HBM access
|
||||
- NOC: distributed on-die fabric spanning the entire cube (distance modeled as 0);
|
||||
carries non-HBM traffic including inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access
|
||||
- Shared SRAM: cube-level shared memory accessible by all PEs via NOC
|
||||
- management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
|
||||
- multiple PEs
|
||||
- up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
|
||||
|
||||
### D4. PE-level
|
||||
|
||||
- A PE can execute one kernel instance
|
||||
- PE contains internal control + accelerators (modeled at PE view granularity):
|
||||
- PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
|
||||
|
||||
## Consequences
|
||||
|
||||
- The simulator supports abstraction by “views”:
|
||||
- SIP view hides PE internals
|
||||
- CUBE view treats each PE as a single block
|
||||
- PE view expands PE internals
|
||||
- Topology remains parameterized; sizes/counts/links come from configuration.
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC R3/R5
|
||||
- ADR-0005 (diagram views)
|
||||
@@ -0,0 +1,64 @@
|
||||
# ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Accurately modeling PE↔HBM behavior is essential for kernel latency estimation.
|
||||
Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Local HBM definition
|
||||
|
||||
- Each PE is assigned a logically defined “local HBM” region.
|
||||
- Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s DMA path
|
||||
via the XBAR (top or bottom, depending on PE corner placement).
|
||||
- The path is: PE_DMA → XBAR.top/bottom → HBM_CTRL.
|
||||
- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
|
||||
|
||||
### D2. Local HBM bandwidth guarantee contract
|
||||
|
||||
- Accesses from a PE to its local HBM MUST guarantee full HBM read/write bandwidth
|
||||
independent of intervening fabric bandwidth limits.
|
||||
- This guarantee is modeled by:
|
||||
- a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
|
||||
- while still incurring non-zero latency along explicitly modeled components.
|
||||
|
||||
### D3. Cross-half HBM semantics
|
||||
|
||||
- A PE connected to XBAR.bottom that accesses HBM pseudo-channels on the XBAR.top half
|
||||
(or vice versa) traverses a bridge:
|
||||
- PE_DMA → XBAR.bottom → bridge → XBAR.top → HBM_CTRL
|
||||
- Bridge bandwidth may limit cross-half HBM access relative to local-half access.
|
||||
|
||||
### D4. Non-local HBM semantics (inter-cube / inter-SIP)
|
||||
|
||||
- Accesses from a PE to HBM in a different cube or SIP MAY be limited by:
|
||||
- NOC bandwidth within the cube,
|
||||
- inter-cube UCIe links,
|
||||
- inter-SIP fabric (PCIe/UAL).
|
||||
- These paths MUST be explicit and traceable.
|
||||
|
||||
### D5. Shared SRAM semantics
|
||||
|
||||
- Each CUBE contains a shared SRAM accessible by all PEs in that CUBE.
|
||||
- Access path: PE_DMA → NOC → shared SRAM.
|
||||
- Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth.
|
||||
- Shared SRAM is not part of the HBM address space; it is a separate memory domain.
|
||||
|
||||
## Verification Notes
|
||||
|
||||
Tests should cover:
|
||||
|
||||
- local-HBM case: BW matches HBM BW regardless of fabric BW parameter
|
||||
- cross-half HBM case: latency includes bridge traversal
|
||||
- non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
|
||||
- shared SRAM case: access via NOC with correct BW
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC R2/R5
|
||||
- ADR-0002 (distance/order & explicit bypass)
|
||||
@@ -0,0 +1,186 @@
|
||||
# ADR-0005: Diagram Views & Distance-Aware Layout Rules
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
We require verifiable and inspectable system modeling for a large-scale,
|
||||
parameterized AI Accelerator system.
|
||||
|
||||
Humans must be able to:
|
||||
|
||||
- visually inspect the modeled topology,
|
||||
- reason about communication structure and relative distance,
|
||||
- do so at multiple abstraction levels without being overwhelmed by detail.
|
||||
|
||||
The simulator models distance (accumulated latency) as a first-class concept.
|
||||
Diagrams must reflect this distance by default.
|
||||
|
||||
---
|
||||
|
||||
## Global Defaults
|
||||
|
||||
- All diagrams MUST be **distance-aware by default**.
|
||||
- All diagrams MUST render **representative views** of the architecture.
|
||||
- Instance indices (e.g., sip0, cube2, pe3) MUST NOT be required for diagram generation.
|
||||
- Instance indices MAY be used ONLY:
|
||||
- to define a distance anchor in asymmetric or debugging scenarios, or
|
||||
- when explicitly requested.
|
||||
|
||||
---
|
||||
|
||||
## Representative Rendering Rule
|
||||
|
||||
- All CUBEs share the same internal structure.
|
||||
- All PEs share the same internal structure.
|
||||
|
||||
Therefore:
|
||||
|
||||
- SIP-level diagrams render representative CUBEs and IO chiplets.
|
||||
- CUBE-level diagrams render representative PEs as opaque blocks.
|
||||
- PE-level diagrams render a representative PE with fully expanded internals.
|
||||
|
||||
Diagrams MUST NOT depend on specific SIP, CUBE, or PE indices
|
||||
unless explicitly requested.
|
||||
|
||||
---
|
||||
|
||||
## Diagram Views
|
||||
|
||||
### View A — SIP-Level Diagram
|
||||
|
||||
**Purpose**
|
||||
Explain system-scale structure and connectivity.
|
||||
|
||||
**Visible elements**
|
||||
|
||||
- SIP boundaries (optional)
|
||||
- CUBEs (opaque blocks)
|
||||
- IO chiplets (opaque blocks)
|
||||
- Optional UCIe stubs only if needed to clarify connectivity
|
||||
|
||||
**Hidden elements**
|
||||
|
||||
- PE internals
|
||||
- CUBE internal fabric
|
||||
- IO chiplet internals
|
||||
|
||||
**Visible links**
|
||||
|
||||
- Host ↔ IO chiplets (PCIe)
|
||||
- SIP ↔ SIP (PCIe / UAL via switches)
|
||||
- IO ↔ CUBE (on-package links)
|
||||
|
||||
---
|
||||
|
||||
### View B — CUBE-Level Diagram
|
||||
|
||||
**Purpose**
|
||||
Explain cube-internal structure and data/control flow.
|
||||
|
||||
**Visible elements**
|
||||
|
||||
- XBAR (top/bottom): HBM pseudo-channel crossbar
|
||||
- Bridge (left/right): cross-half HBM connectors between XBAR.top and XBAR.bottom
|
||||
- NOC: distributed on-die fabric for non-HBM traffic
|
||||
- HBM subsystem (HBM_CTRL)
|
||||
- Shared SRAM: cube-level shared memory
|
||||
- Management CPU (M_CPU)
|
||||
- PEs as opaque blocks (PE[0..N−1])
|
||||
- UCIe endpoints (N/E/W/S) as ports
|
||||
|
||||
**Hidden elements**
|
||||
|
||||
- PE internals
|
||||
|
||||
**Visible links**
|
||||
|
||||
- PE → XBAR (HBM data path, top or bottom by corner placement)
|
||||
- PE → NOC (non-HBM data path)
|
||||
- XBAR ↔ bridge ↔ XBAR (cross-half HBM access)
|
||||
- XBAR → HBM_CTRL
|
||||
- NOC ↔ UCIe endpoints
|
||||
- NOC ↔ shared SRAM
|
||||
- M_CPU ↔ NOC (command path)
|
||||
- NOC → PE_CPU (command delivery, collapsed into PE block)
|
||||
|
||||
---
|
||||
|
||||
### View C — PE-Level Diagram
|
||||
|
||||
**Purpose**
|
||||
Explain internal PE behavior and execution structure.
|
||||
|
||||
**Visible elements**
|
||||
|
||||
- PE_CPU
|
||||
- Command handler / scheduler
|
||||
- PE_TCM (local SRAM)
|
||||
- HW accelerators (DMA, GEMM, MATH, etc.)
|
||||
- Local HBM interface
|
||||
- Optional IPCQ / messaging endpoints
|
||||
|
||||
**Visible links**
|
||||
|
||||
- Control paths (CPU → scheduler → engines)
|
||||
- Data paths (engines ↔ TCM, DMA ↔ local HBM)
|
||||
- External fabric ports as abstract ports only
|
||||
|
||||
---
|
||||
|
||||
## Distance-Aware Layout (Default)
|
||||
|
||||
### Distance definition
|
||||
|
||||
- Distance is defined as **accumulated latency**, consistent with ADR-0002.
|
||||
- Distance is computed from a single anchor node.
|
||||
|
||||
### Default anchor selection
|
||||
|
||||
- SIP view: IO chiplet (or Host CPU if present)
|
||||
- CUBE view: a representative PE
|
||||
- PE view: PE_CPU or Command Handler
|
||||
|
||||
Anchors are **implicit defaults** and MUST NOT be required to be specified.
|
||||
|
||||
### Layout rules
|
||||
|
||||
- Diagrams MUST be laid out in layers based on distance buckets.
|
||||
- Layout direction MUST be consistent within a view type
|
||||
(preferred: left-to-right).
|
||||
- Nodes with equal distance MUST have stable ordering
|
||||
(by role or identifier, deterministically).
|
||||
|
||||
Cycles MAY be rendered using dashed or curved edges for readability,
|
||||
without affecting distance semantics.
|
||||
|
||||
---
|
||||
|
||||
## Generation Contract (for Tools / Claude Code)
|
||||
|
||||
When generating diagrams:
|
||||
|
||||
- Assume distance-aware layout by default.
|
||||
- Assume representative rendering by default.
|
||||
- Do NOT ask for SIP/CUBE/PE indices unless required.
|
||||
- Do NOT expand hidden abstraction levels.
|
||||
- Prefer architectural clarity over micro-hop fidelity.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- Diagrams are stable across topology scaling.
|
||||
- Changes in distance or routing policy are reflected visually.
|
||||
- Diagrams serve as verifiable artifacts derived from the simulator model,
|
||||
not as hand-maintained documentation.
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC Section 4 (Output, Debuggability, and Diagrams)
|
||||
- ADR-0002 (Routing distance semantics)
|
||||
- ADR-0006 (Topology compilation & automatic diagram generation)
|
||||
@@ -0,0 +1,130 @@
|
||||
# ADR-0006: Topology Compilation, Distance Extraction, and Automatic Diagram Generation
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The simulator compiles topology configuration (e.g., topology.yaml) into an explicit model graph,
|
||||
and computes routing and accumulated latency (distance).
|
||||
Diagrams should be generated from these authoritative artifacts to ensure consistency and avoid
|
||||
hand-maintained topology drawings.
|
||||
|
||||
Additionally, for usability, diagrams should be emitted automatically into a stable location
|
||||
so that developers can preview them immediately in the repository.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Topology compilation is the single source of truth
|
||||
|
||||
- topology.yaml (or equivalent config) is compiled into:
|
||||
- an explicit system graph,
|
||||
- node/link attributes,
|
||||
- routing policies.
|
||||
This compiled graph is the authoritative representation of the system.
|
||||
|
||||
### D2. Distance extraction during compilation
|
||||
|
||||
- During or immediately after topology compilation, the simulator MUST compute distance metadata
|
||||
(accumulated latency) consistent with ADR-0002.
|
||||
- Distance metadata MUST be sufficient to support distance-aware diagram layout as defined in ADR-0005.
|
||||
- Distributed fabric segments (e.g., NOC) MAY have distance_mm = 0 per ADR-0002 D4;
|
||||
layout placement for such nodes uses explicit position metadata rather than distance buckets.
|
||||
|
||||
### D3. Diagram generation is a derived artifact
|
||||
|
||||
- Diagrams MUST be generated from:
|
||||
- the compiled topology graph,
|
||||
- extracted distance metadata,
|
||||
- view/layout rules defined in ADR-0005.
|
||||
- Diagram generation MUST NOT require additional hand-written topology descriptions.
|
||||
|
||||
### D4. Automatic diagram emission to the repository
|
||||
|
||||
- As part of topology compilation, the implementation MUST produce the following diagrams by default:
|
||||
- SIP-level diagram (representative, distance-aware)
|
||||
- CUBE-level diagram (representative, distance-aware)
|
||||
- PE-level diagram (representative, distance-aware)
|
||||
- The default output directory is:
|
||||
- `docs/diagrams/`
|
||||
- The generator MUST overwrite/update only when the compiled topology (or diagram rules) changes.
|
||||
|
||||
### D5. View-specific projection and layout
|
||||
|
||||
For each view (SIP / CUBE / PE):
|
||||
|
||||
- The generator MUST project the compiled graph into a reduced view graph:
|
||||
- hide/collapse nodes according to ADR-0005,
|
||||
- preserve connectivity semantics relevant to that view,
|
||||
- compute distance buckets and assign layout layers deterministically.
|
||||
- CUBE-level projection MUST include:
|
||||
- XBAR (top/bottom), bridge (left/right), NOC, HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
|
||||
and PEs as opaque blocks.
|
||||
- Distinct edge kinds for HBM path (PE→XBAR) vs non-HBM path (PE→NOC).
|
||||
- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
|
||||
|
||||
### D6. Output formats and determinism
|
||||
|
||||
- The generator MUST output at least one of:
|
||||
- Mermaid (Markdown-native)
|
||||
- Graphviz DOT (rank-based control)
|
||||
- SVG (mm-accurate layout, no external dependencies)
|
||||
- SVG is preferred when mm-accurate position metadata is available from the compiled topology.
|
||||
- Output MUST be deterministic:
|
||||
- same topology + same rules → identical diagram text
|
||||
- File naming MUST be deterministic and stable (see "Output Conventions").
|
||||
|
||||
### D7. Performance and caching
|
||||
|
||||
- Diagram generation MAY be lazy and/or cached, as long as the outputs in `docs/diagrams/`
|
||||
remain consistent with the compiled topology.
|
||||
- The implementation SHOULD use a cache key based on:
|
||||
- topology content hash,
|
||||
- routing policy version,
|
||||
- diagram rules version,
|
||||
- view type (SIP/CUBE/PE).
|
||||
|
||||
---
|
||||
|
||||
## Output Conventions
|
||||
|
||||
### Directory
|
||||
|
||||
- `docs/diagrams/` is the canonical output directory for generated diagrams.
|
||||
|
||||
### File names (recommended, deterministic)
|
||||
|
||||
- `system_view.svg` / `system_view.mmd` / `system_view.dot`
|
||||
- `sip_view.svg` / `sip_view.mmd` / `sip_view.dot`
|
||||
- `cube_view.svg` / `cube_view.mmd` / `cube_view.dot`
|
||||
- `pe_view.svg` / `pe_view.mmd` / `pe_view.dot`
|
||||
|
||||
Optionally, for multi-topology workflows:
|
||||
|
||||
- `sip_view__{topology_id}.svg`
|
||||
- `cube_view__{topology_id}.svg`
|
||||
- `pe_view__{topology_id}.svg`
|
||||
|
||||
### Repository policy
|
||||
|
||||
- Generated diagram files MAY be committed to the repository to enable diff-based review.
|
||||
- If committed, they MUST be reproducible from topology compilation.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- Diagrams are always consistent with simulator behavior.
|
||||
- Architectural changes automatically propagate to visualizations.
|
||||
- Diagram diffs become meaningful indicators of architectural change.
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC Section 4 (Output, Debuggability, and Diagrams)
|
||||
- ADR-0002 (Distance semantics)
|
||||
- ADR-0005 (Diagram views and layout rules)
|
||||
@@ -0,0 +1,89 @@
|
||||
# ADR-0007: Runtime API and Simulation Engine Boundaries
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The simulator consists of multiple layers with distinct responsibilities:
|
||||
|
||||
- a host-facing API layer used by benchmarks and user code,
|
||||
- a discrete-event simulation engine that executes requests,
|
||||
- device components that model hardware behavior.
|
||||
|
||||
Without strict boundaries, orchestration logic can leak into components,
|
||||
or simulation internals can become entangled with user-facing APIs.
|
||||
|
||||
This ADR defines clear responsibility boundaries between:
|
||||
|
||||
- runtime API,
|
||||
- simulation engine (sim_engine),
|
||||
- hardware components.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Runtime API is host-facing orchestration only
|
||||
|
||||
The runtime API represents host/driver-level behavior and MUST:
|
||||
|
||||
- expose high-level operations (tensor deployment, kernel launch),
|
||||
- submit requests only to endpoint components (e.g., IO_CPU),
|
||||
- await completion via futures/handles,
|
||||
- own and persist host-side metadata (tensor allocation maps, kernel bindings).
|
||||
|
||||
The runtime API MUST NOT:
|
||||
|
||||
- hardcode hop-by-hop routing or fan-out,
|
||||
- directly invoke internal components (M_CPU, PE_CPU, engines),
|
||||
- embed topology- or routing-specific assumptions.
|
||||
|
||||
---
|
||||
|
||||
### D2. Simulation engine executes and schedules requests
|
||||
|
||||
The simulation engine (sim_engine) MUST:
|
||||
|
||||
- inject requests into the compiled topology graph,
|
||||
- schedule and execute events using a discrete-event model,
|
||||
- manage correlation ids and completion tracking,
|
||||
- decompose operations into low-level requests when required
|
||||
(e.g., MemoryWrite events).
|
||||
|
||||
The simulation engine MUST NOT:
|
||||
|
||||
- define tensor semantics,
|
||||
- define kernel execution policies,
|
||||
- expose internal graph details to the runtime API.
|
||||
|
||||
---
|
||||
|
||||
### D3. Components own fan-out and aggregation
|
||||
|
||||
Device-side components MUST:
|
||||
|
||||
- fan-out requests to downstream domains
|
||||
(IO_CPU → M_CPU → PE_CPU → schedulers/engines),
|
||||
- aggregate completion and failure signals,
|
||||
- propagate results deterministically upstream.
|
||||
|
||||
Neither the runtime API nor the simulation engine may orchestrate
|
||||
component-level fan-out explicitly.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- Runtime APIs remain stable as topology and routing evolve.
|
||||
- Simulation internals can change without affecting user-facing code.
|
||||
- Component implementations remain swappable via DI.
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC R4, R7, R8
|
||||
- ADR-0008 (Tensor deployment)
|
||||
- ADR-0009 (Kernel execution)
|
||||
@@ -0,0 +1,100 @@
|
||||
# ADR-0008: Tensor Deployment and Allocation (Host Allocator, PA-first)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Benchmarks require PyTorch-like tensor semantics:
|
||||
|
||||
- tensor creation (empty, fill),
|
||||
- deployment to accelerator devices (tensor.to()).
|
||||
|
||||
In the realistic system, host software manages allocation/mapping and installs
|
||||
mappings for DMA/MMU. For Phase 0 we simplify (ADR-0011):
|
||||
|
||||
- device memory operations use PA only,
|
||||
- VA/MMU/IOMMU is not modeled.
|
||||
|
||||
To keep the host↔device interface minimal, we avoid a separate
|
||||
AllocateTensorMeta message. Instead, host allocation produces a PA shard map
|
||||
that is used directly by MemoryWrite/Read and KernelLaunch.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Tensor is a host-owned handle with PA shard mapping
|
||||
|
||||
A Tensor object is a host-owned handle that encapsulates:
|
||||
|
||||
- shape and dtype,
|
||||
- initialization intent,
|
||||
- device placement and allocation metadata as a PA shard map.
|
||||
|
||||
After deployment, the Tensor handle MUST contain:
|
||||
|
||||
- a list of shards, each with (sip,cube,pe,pa,nbytes,offset_bytes).
|
||||
|
||||
This PA shard mapping is the single source of truth for kernel argument binding.
|
||||
|
||||
---
|
||||
|
||||
### D2. Deployment uses a host allocator (Phase 0)
|
||||
|
||||
In Phase 0, tensor deployment produces PA shard mappings via a host allocator:
|
||||
|
||||
- placement (split/replicate/hybrid) is decided by a DP policy,
|
||||
- allocation assigns PA ranges at the PE level and returns shard mappings,
|
||||
- the Tensor handle stores the resulting shard list deterministically.
|
||||
|
||||
No separate host-visible device allocation RPC is required in Phase 0.
|
||||
|
||||
---
|
||||
|
||||
### D3. Data initialization and transfer uses MemoryWrite/Read only
|
||||
|
||||
Any data initialization or transfer implied by a tensor (e.g., fill, copy)
|
||||
MUST be represented using Host ↔ IO_CPU messages only:
|
||||
|
||||
- MemoryWrite
|
||||
- MemoryRead
|
||||
|
||||
Rules:
|
||||
|
||||
- MemoryWrite/Read MUST reference PA + (sip,cube,pe) tags (ADR-0012).
|
||||
- Allocation metadata MUST NOT be embedded as a separate allocation message.
|
||||
- Bulk tensor data MUST NOT be embedded in Phase 0 messages.
|
||||
|
||||
The simulation engine schedules MemoryWrite/Read through the graph so that
|
||||
latency is computed by explicit traversal.
|
||||
|
||||
---
|
||||
|
||||
### D4. Extension path (non-breaking)
|
||||
|
||||
Future ADRs MAY introduce optional VA/MMU/IOMMU modeling by adding:
|
||||
|
||||
- virtual addressing in tensor handles,
|
||||
- mapping install steps,
|
||||
- translation latency/page granularity.
|
||||
|
||||
The Phase 0 PA shard map remains a valid fast-path configuration.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- Host↔IO_CPU contract remains minimal (MemoryRead/Write + KernelLaunch).
|
||||
- KernelLaunch can pass per-PE data placement explicitly via shard tags.
|
||||
- Early implementation stays simple and testable.
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- ADR-0011 (PA-first)
|
||||
- ADR-0012 (Host↔IO_CPU schema)
|
||||
- ADR-0007 (runtime_api vs sim_engine boundaries)
|
||||
- ADR-0009 (Kernel execution)
|
||||
@@ -0,0 +1,74 @@
|
||||
# ADR-0009: Kernel Execution Messaging and Completion Semantics
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Kernel execution is initiated by the host and proceeds through
|
||||
device control components:
|
||||
|
||||
Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
|
||||
|
||||
Completion propagates in reverse order.
|
||||
|
||||
To keep benchmarks simple and topology-agnostic,
|
||||
kernel execution must be endpoint-driven with deterministic aggregation.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Kernel launch is an endpoint request
|
||||
|
||||
A kernel launch is initiated by submitting a single KernelLaunch request
|
||||
to the IO_CPU endpoint.
|
||||
|
||||
The runtime API MUST:
|
||||
|
||||
- construct the kernel launch request,
|
||||
- submit it to IO_CPU,
|
||||
- await a single completion result.
|
||||
|
||||
The runtime API MUST NOT orchestrate internal fan-out.
|
||||
|
||||
---
|
||||
|
||||
### D2. Tensor arguments are passed by metadata
|
||||
|
||||
KernelLaunch requests MUST reference tensor arguments via:
|
||||
|
||||
- host-owned tensor handles, or
|
||||
- resolved device address maps derived from those handles.
|
||||
|
||||
Bulk tensor data MUST NOT be embedded in kernel launch messages.
|
||||
|
||||
---
|
||||
|
||||
### D3. Fan-out and aggregation are component responsibilities
|
||||
|
||||
- IO_CPU fans out work to M_CPUs.
|
||||
- M_CPU fans out work to PE_CPUs.
|
||||
- PE_CPU manages kernel execution and engine dispatch.
|
||||
|
||||
Completion semantics:
|
||||
|
||||
- M_CPU completes when all targeted PEs complete or a failure policy triggers.
|
||||
- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
|
||||
|
||||
---
|
||||
|
||||
### D4. Completion and failure propagation
|
||||
|
||||
- All messages MUST carry correlation identifiers.
|
||||
- Completion and failure MUST propagate deterministically to the host.
|
||||
- The simulation engine provides futures/handles to observe completion.
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC R1, R2, R7, R8
|
||||
- ADR-0007 (Runtime API boundaries)
|
||||
- ADR-0008 (Tensor deployment)
|
||||
@@ -0,0 +1,62 @@
|
||||
# ADR-0010: CLI Device Selection and Multi-Device Execution Semantics
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Benchmarks represent device-agnostic workloads that operate on a single device.
|
||||
Users may want to run a benchmark:
|
||||
|
||||
- on a specific device, or
|
||||
- across all devices in the system.
|
||||
|
||||
Device enumeration must not leak into benchmarks or runtime APIs.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Benchmarks are single-device by design
|
||||
|
||||
- A benchmark MUST define behavior for a single device only.
|
||||
- A benchmark MUST accept a device identifier as input.
|
||||
- Benchmarks MUST NOT enumerate or loop over multiple devices.
|
||||
|
||||
---
|
||||
|
||||
### D2. CLI controls device selection
|
||||
|
||||
The `kernbench run` command supports an optional `--device` argument:
|
||||
|
||||
- If `--device <id>` is specified:
|
||||
- the benchmark executes once for the specified device.
|
||||
|
||||
- If `--device` is omitted:
|
||||
- the benchmark executes once using all the SIPs discovered in the topology.
|
||||
|
||||
---
|
||||
|
||||
### D3. Multi-device execution is logically parallel
|
||||
|
||||
When running on multiple devices:
|
||||
|
||||
- benchmark executions are submitted to a single simulation engine instance,
|
||||
- executions are logically parallel in simulation time,
|
||||
- inter-device contention is naturally modeled.
|
||||
|
||||
---
|
||||
|
||||
### D4. Runtime API and simulation engine remain device-scoped
|
||||
|
||||
- Runtime API calls operate on one device per invocation.
|
||||
- The simulation engine schedules all requests deterministically.
|
||||
- Neither layer enumerates devices.
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC R7, R8
|
||||
- ADR-0007 (Runtime API boundaries)
|
||||
@@ -0,0 +1,65 @@
|
||||
# ADR-0011: Memory Addressing Simplification (PA-first)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
A realistic system uses host-side virtual addressing and an MMU/IOMMU-style
|
||||
translation path for DMA: host allocates physical memory at PE level, maps it
|
||||
into a virtual address space, installs mappings, and DMA requests use virtual
|
||||
addresses that are translated to physical addresses.
|
||||
|
||||
For early development, we want a minimal, deterministic model that enables:
|
||||
|
||||
- correct routing and latency accounting through the graph,
|
||||
- stable tensor deployment and kernel execution semantics,
|
||||
- future extension toward VA/MMU without rewriting workflows.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Phase 0 model is PA-only
|
||||
|
||||
The simulator uses a PA-first model:
|
||||
|
||||
- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical
|
||||
addresses (PA) plus size.
|
||||
- Tensor handles store PA-based shard mappings after deployment.
|
||||
- KernelLaunch passes tensor arguments as PA-based mappings (or references to them).
|
||||
- MMU/IOMMU concepts (virtual address spaces, page tables, translation latency)
|
||||
are NOT modeled in Phase 0.
|
||||
|
||||
### D2. Allocation produces PA mappings
|
||||
|
||||
Device allocation selects PE-local memory regions and returns PA mappings
|
||||
sufficient to execute kernels and issue DMA requests.
|
||||
|
||||
### D3. Extension path (non-breaking)
|
||||
|
||||
A future ADR MAY introduce an optional VA/MMU layer by:
|
||||
|
||||
- introducing virtual addresses in tensor handles,
|
||||
- adding a mapping-install step,
|
||||
- modeling translation latency and page granularity.
|
||||
|
||||
The Phase 0 PA model remains a valid fast-path configuration.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- Early implementation stays simple and testable.
|
||||
- All latency remains explicit via graph traversal, not hidden translation.
|
||||
- Future VA/MMU modeling can be added without breaking existing benchmarks.
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- ADR-0007 (runtime_api vs sim_engine boundaries)
|
||||
- ADR-0008 (tensor deployment)
|
||||
- ADR-0009 (kernel execution)
|
||||
- SPEC R2 (latency by traversal)
|
||||
@@ -0,0 +1,232 @@
|
||||
# ADR-0012: Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Phase 0 uses a PA-first memory model (ADR-0011):
|
||||
|
||||
- memory operations use device physical addresses (PA) only,
|
||||
- VA/MMU/IOMMU is not modeled.
|
||||
|
||||
The host-facing runtime API interacts with the device via the IO_CPU endpoint.
|
||||
We define stable, minimal message schemas for Host ↔ IO_CPU so that:
|
||||
|
||||
- benchmarks remain stable,
|
||||
- IO_CPU-internal fan-out/aggregation can evolve independently,
|
||||
- completion and failure propagation is deterministic.
|
||||
|
||||
We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
|
||||
so IO_CPU can deterministically route/fan-out without relying on PA decoding.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Contract scope
|
||||
|
||||
This schema is the stable contract ONLY for Host ↔ IO_CPU.
|
||||
|
||||
Messages beyond IO_CPU (to M_CPU, PE_CPU, schedulers, engines) are component-internal
|
||||
and are NOT part of this host contract in Phase 0.
|
||||
|
||||
---
|
||||
|
||||
### D2. Required message set
|
||||
|
||||
The runtime API MUST use only these message types for Host ↔ IO_CPU:
|
||||
|
||||
- MemoryWrite
|
||||
- MemoryRead
|
||||
- KernelLaunch
|
||||
|
||||
All operations required by benchmarks (tensor init/copy, kernel run) MUST be expressible
|
||||
with these messages.
|
||||
|
||||
---
|
||||
|
||||
### D3. Common envelope (mandatory for all requests)
|
||||
|
||||
All Host ↔ IO_CPU requests MUST include:
|
||||
|
||||
- `msg_type: str`
|
||||
- `correlation_id: str`
|
||||
- generated by the host
|
||||
- used to match responses deterministically
|
||||
- `request_id: str`
|
||||
- unique within a correlation_id
|
||||
- `target_device: str`
|
||||
- device identifier (e.g., "sip:0")
|
||||
- `timestamp_tag: str | None` (optional)
|
||||
- debug tag only; MUST NOT affect determinism
|
||||
|
||||
All Host ↔ IO_CPU responses MUST include:
|
||||
|
||||
- `correlation_id: str`
|
||||
- `request_id: str`
|
||||
- `completion: Completion`
|
||||
|
||||
---
|
||||
|
||||
### D4. Completion schema (mandatory)
|
||||
|
||||
`Completion` MUST have:
|
||||
|
||||
- `ok: bool`
|
||||
- `error_code: str | None`
|
||||
- `error_message: str | None`
|
||||
|
||||
Rules:
|
||||
|
||||
- If `ok == true` then `error_code` and `error_message` MUST be null.
|
||||
- If `ok == false` then `error_code` MUST be non-null.
|
||||
- Completion semantics MUST be deterministic.
|
||||
|
||||
---
|
||||
|
||||
### D5. MemoryWrite schema (PA-first, PE-tagged)
|
||||
|
||||
`MemoryWrite` represents a host-initiated write/initialize operation to device memory.
|
||||
|
||||
Mandatory fields:
|
||||
|
||||
- common envelope fields (D3)
|
||||
- destination placement tags (A 방식):
|
||||
- `dst_sip: int`
|
||||
- `dst_cube: int`
|
||||
- `dst_pe: int`
|
||||
- `dst_pa: int`
|
||||
- destination physical address in the destination PE's address space
|
||||
- `nbytes: int`
|
||||
- `src_kind: "pattern" | "host_buffer_ref"`
|
||||
- Phase 0 MUST support "pattern"
|
||||
- `pattern: Pattern | None`
|
||||
- required if `src_kind == "pattern"`
|
||||
|
||||
`Pattern` (Phase 0 mandatory support):
|
||||
|
||||
- `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"`
|
||||
- `value: number | None`
|
||||
- required for fill_*; ignored for zero
|
||||
|
||||
Optional fields:
|
||||
|
||||
- `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (default "AUTO")
|
||||
- `debug_label: str | None`
|
||||
|
||||
Notes:
|
||||
|
||||
- This message MUST NOT embed bulk tensor data in Phase 0.
|
||||
- All latency MUST come from explicit graph traversal and modeled components.
|
||||
|
||||
---
|
||||
|
||||
### D6. MemoryRead schema (PA-first, PE-tagged)
|
||||
|
||||
`MemoryRead` represents a host-initiated read from device memory.
|
||||
|
||||
Mandatory fields:
|
||||
|
||||
- common envelope fields (D3)
|
||||
- source placement tags (A 방식):
|
||||
- `src_sip: int`
|
||||
- `src_cube: int`
|
||||
- `src_pe: int`
|
||||
- `src_pa: int`
|
||||
- `nbytes: int`
|
||||
|
||||
Optional fields:
|
||||
|
||||
- `dst_kind: "host_sink" | "discard"` (default "host_sink")
|
||||
- `debug_label: str | None`
|
||||
|
||||
Response payload:
|
||||
|
||||
- actual bytes are NOT required in Phase 0 (latency/traces focus)
|
||||
- implementations MAY return lightweight stats or hashes later via a new ADR
|
||||
|
||||
---
|
||||
|
||||
### D7. KernelLaunch schema (PA-first, PE-tagged shards)
|
||||
|
||||
`KernelLaunch` represents launching a kernel on a target device via IO_CPU.
|
||||
|
||||
Mandatory fields:
|
||||
|
||||
- common envelope fields (D3)
|
||||
- `kernel_ref: KernelRef`
|
||||
- `args: list[KernelArg]`
|
||||
|
||||
`KernelRef` MUST have:
|
||||
|
||||
- `name: str`
|
||||
- `kind: "deployed" | "builtin"`
|
||||
- `deploy_pa: int | None` — PA where kernel binary was deployed (required for "deployed")
|
||||
- `deploy_sip: int` — SIP where binary resides
|
||||
- `deploy_cube: int` — cube where binary resides
|
||||
- `deploy_pe: int` — PE where binary resides
|
||||
- `nbytes_code: int` — kernel binary size (for BW modeling)
|
||||
|
||||
Kernel binaries MUST be pre-deployed to device memory via MemoryWrite.
|
||||
KernelLaunch MUST NOT embed kernel source code or IR in the launch message.
|
||||
|
||||
`KernelArg` supports tensor args by PA mapping and scalars by value.
|
||||
|
||||
Tensor arg (mandatory):
|
||||
|
||||
- `arg_kind: "tensor"`
|
||||
- `tensor_pa_map: TensorPAMap`
|
||||
|
||||
`TensorPAMap` MUST have:
|
||||
|
||||
- `shards: list[TensorShard]`
|
||||
|
||||
`TensorShard` MUST have (A 방식 강제):
|
||||
|
||||
- `sip: int`
|
||||
- `cube: int`
|
||||
- `pe: int`
|
||||
- `pa: int`
|
||||
- `nbytes: int`
|
||||
- `offset_bytes: int`
|
||||
|
||||
Scalar arg (mandatory):
|
||||
|
||||
- `arg_kind: "scalar"`
|
||||
- `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"`
|
||||
- `value: number | bool`
|
||||
|
||||
Optional KernelLaunch fields:
|
||||
|
||||
- `grid: dict | None`
|
||||
- `meta: dict | None`
|
||||
- `failure_policy: "fail_fast" | "collect_all"` (default "fail_fast")
|
||||
- `debug_label: str | None`
|
||||
|
||||
Notes:
|
||||
|
||||
- KernelLaunch MUST NOT embed bulk tensor data.
|
||||
- KernelLaunch MUST be submitted only to the IO_CPU endpoint.
|
||||
- IO_CPU MUST fan-out work internally using the shard (sip,cube,pe) tags.
|
||||
|
||||
---
|
||||
|
||||
## Verification Notes
|
||||
|
||||
Tests SHOULD validate:
|
||||
|
||||
- schema validation rejects missing mandatory fields,
|
||||
- deterministic correlation/response matching,
|
||||
- MemoryWrite/Read/KernelLaunch produce explicit hop traces,
|
||||
- all routed requests incur latency > 0.
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- ADR-0011 (PA-first memory addressing)
|
||||
- ADR-0007 (runtime_api vs sim_engine boundaries)
|
||||
- ADR-0009 (kernel execution fan-out/aggregation)
|
||||
- SPEC R2, R7, R8
|
||||
@@ -0,0 +1,139 @@
|
||||
# ADR-0013: Verification Strategy and Phase 1 Test Plan
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
KernBench is a system-level simulator whose correctness is defined by:
|
||||
|
||||
- adherence to SPEC-defined invariants,
|
||||
- determinism and debuggability,
|
||||
- explicit modeling of routing and latency.
|
||||
|
||||
Given the evolving implementation, we need a stable verification strategy
|
||||
that prevents architectural drift while allowing incremental development.
|
||||
|
||||
This ADR defines the Phase 1 verification plan and what constitutes
|
||||
"correct behavior" for early implementations.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Verification is contract-based
|
||||
|
||||
Verification MUST be derived from:
|
||||
|
||||
- SPEC requirements,
|
||||
- accepted ADRs.
|
||||
|
||||
Tests MUST validate architectural contracts, not incidental implementation details.
|
||||
|
||||
---
|
||||
|
||||
### D2. Phase 1 verification scope
|
||||
|
||||
Phase 1 verification focuses on:
|
||||
|
||||
- message contract validity (ADR-0012),
|
||||
- routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
|
||||
- PA-first memory addressing and shard tagging (ADR-0011),
|
||||
- core latency and trace invariants (SPEC 0.1, R2).
|
||||
|
||||
Microarchitectural accuracy, bandwidth contention, and cycle-level behavior
|
||||
are explicitly out of scope in Phase 1.
|
||||
|
||||
---
|
||||
|
||||
### D3. Required Phase 1 verification cases
|
||||
|
||||
The following verification cases MUST be supported by the implementation:
|
||||
|
||||
#### V1. Message schema validation
|
||||
|
||||
- KernelLaunch requests missing `(sip, cube, pe)` in any tensor shard MUST be rejected.
|
||||
- MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
|
||||
- Completion results MUST follow the `ok / error_code / error_message` contract.
|
||||
|
||||
#### V2. IO_CPU fan-out and aggregation
|
||||
|
||||
Given:
|
||||
|
||||
- a topology with one SIP, one CUBE, and two PEs,
|
||||
- a KernelLaunch request containing two tensor shards targeting different PEs,
|
||||
|
||||
The system MUST:
|
||||
|
||||
- submit a single KernelLaunch to IO_CPU,
|
||||
- fan-out work internally to both PEs,
|
||||
- aggregate completion and return a single deterministic completion to the host.
|
||||
|
||||
#### V3. Latency and trace invariants
|
||||
|
||||
For any valid request:
|
||||
|
||||
- the hop-by-hop trace MUST be non-empty,
|
||||
- total latency MUST be greater than zero,
|
||||
- repeated runs with identical inputs MUST produce identical traces.
|
||||
|
||||
#### V4. Topology independence and cross-domain coverage
|
||||
|
||||
Verification cases MUST pass for multiple topology shapes, including:
|
||||
|
||||
- minimal: (1 SIP, 1 CUBE, 1 PE)
|
||||
- multi-PE: (1 SIP, 1 CUBE, N PEs)
|
||||
- multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
|
||||
- multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)
|
||||
|
||||
For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:
|
||||
|
||||
- explicit connectivity (required links exist),
|
||||
- deterministic routing and control-path traversal,
|
||||
- non-empty traces and latency > 0 for representative cross-domain requests
|
||||
(inter-CUBE and inter-SIP paths).
|
||||
|
||||
Tests MUST NOT hardcode topology sizes, node ids, or link counts.
|
||||
Instead, tests MUST derive expectations from the compiled topology metadata
|
||||
---
|
||||
|
||||
### D4. Phase 1 artifacts
|
||||
|
||||
Phase 1 MAY include:
|
||||
|
||||
- verification-only test code,
|
||||
- topology fixtures,
|
||||
- trace inspection utilities.
|
||||
|
||||
Phase 1 MUST NOT require:
|
||||
|
||||
- production code changes solely to satisfy tests,
|
||||
- weakening or removing tests to allow progress.
|
||||
|
||||
---
|
||||
|
||||
### D5. Phase 2 enforcement
|
||||
|
||||
Phase 2 (Apply) MUST:
|
||||
|
||||
- run the Phase 1 verification cases,
|
||||
- rollback all changes if any verification fails,
|
||||
- preserve tests as authoritative contracts.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- Architectural correctness is enforced early.
|
||||
- Tests serve as executable documentation of system behavior.
|
||||
- Implementation remains flexible without losing rigor.
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC 0.1, R2, R6
|
||||
- ADR-0011 (PA-first memory addressing)
|
||||
- ADR-0012 (Host ↔ IO_CPU message schema)
|
||||
- ADR-0009 (Kernel execution semantics)
|
||||
@@ -0,0 +1,364 @@
|
||||
# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
|
||||
|
||||
- the dispatch model inside a PE,
|
||||
- the responsibilities of PE_SCHEDULER,
|
||||
- the PE_TCM-centric dataflow contract used by accelerator engines.
|
||||
|
||||
We need a deterministic and debuggable PE-internal execution contract that supports:
|
||||
|
||||
- simple single-engine commands
|
||||
- composite commands that build a tiled pipeline across DMA and accelerator engines
|
||||
|
||||
The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. PE internal component roles
|
||||
|
||||
Each PE contains the following logical components.
|
||||
|
||||
**PE_CPU**
|
||||
|
||||
- Executes kernel instruction stream or kernel control logic.
|
||||
- Generates PE commands.
|
||||
- Submits commands to PE_SCHEDULER.
|
||||
- PE_CPU does NOT enqueue work directly into engine queues.
|
||||
|
||||
**PE_SCHEDULER**
|
||||
|
||||
- The sole dispatcher inside a PE.
|
||||
- Receives commands from PE_CPU.
|
||||
- Expands composite commands into sub-commands.
|
||||
- Tracks dependencies and command state.
|
||||
- Dispatches work to engine queues.
|
||||
- Manages tile scheduling for composite commands.
|
||||
|
||||
**PE_DMA**
|
||||
|
||||
- Handles memory transfers between PE_TCM and external memory domains.
|
||||
- PE_DMA has **dual egress** at the CUBE level:
|
||||
- **→ XBAR**: dedicated path to HBM (local and cross-half via bridge)
|
||||
- **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.)
|
||||
- Supported directions include:
|
||||
- HBM → PE_TCM (via XBAR)
|
||||
- PE_TCM → HBM (via XBAR)
|
||||
- PE_TCM → shared SRAM (via NOC)
|
||||
- PE_TCM → other memory domains (via NOC, if supported by topology)
|
||||
|
||||
**PE_GEMM**
|
||||
|
||||
- Matrix multiplication engine.
|
||||
- Reads activations from PE_TCM.
|
||||
- May stream weights directly from HBM.
|
||||
|
||||
**PE_MATH**
|
||||
|
||||
- Element-wise computation engine.
|
||||
- Reads and writes PE_TCM.
|
||||
|
||||
**PE_TCM**
|
||||
|
||||
- Local SRAM used as the staging memory for accelerator operations.
|
||||
|
||||
---
|
||||
|
||||
### D2. Command lifecycle and queues
|
||||
|
||||
PE_SCHEDULER maintains three logical structures.
|
||||
|
||||
**SubmissionQueue**
|
||||
|
||||
- Written by PE_CPU.
|
||||
- Contains incoming PE commands waiting to be processed.
|
||||
|
||||
**InflightTable**
|
||||
|
||||
- Owned and mutated only by PE_SCHEDULER.
|
||||
- Tracks:
|
||||
- expanded sub-commands
|
||||
- dependency state
|
||||
- engine assignment
|
||||
- completion status
|
||||
|
||||
**CompletionQueue**
|
||||
|
||||
- Written by PE_SCHEDULER.
|
||||
- Contains final completion records for commands.
|
||||
|
||||
**Single-writer rule**
|
||||
|
||||
- Only PE_SCHEDULER is allowed to mutate command completion state.
|
||||
- Engine components must report completion via explicit completion events/messages.
|
||||
|
||||
**Command completion**
|
||||
|
||||
A command becomes DONE when:
|
||||
|
||||
- all sub-commands complete
|
||||
- PE_SCHEDULER publishes a completion record to CompletionQueue.
|
||||
|
||||
---
|
||||
|
||||
### D3. Dispatch modes
|
||||
|
||||
PE commands are divided into two categories.
|
||||
|
||||
#### D3.1 Simple command
|
||||
|
||||
A simple command expands to exactly one engine sub-command.
|
||||
|
||||
Examples include:
|
||||
|
||||
- DMA transfer
|
||||
- GEMM compute
|
||||
- MATH compute
|
||||
|
||||
Execution flow:
|
||||
|
||||
```
|
||||
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
|
||||
```
|
||||
|
||||
#### D3.2 Composite command (tiled pipeline)
|
||||
|
||||
Composite commands implement tiled pipelined execution across engines.
|
||||
|
||||
Each tile executes the following pipeline:
|
||||
|
||||
```
|
||||
Input DMA (READ)
|
||||
→ Compute (GEMM or MATH)
|
||||
→ Output DMA (WRITE)
|
||||
```
|
||||
|
||||
**Tiling rule**
|
||||
|
||||
If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
|
||||
Each tile is assigned a monotonically increasing `tile_id`.
|
||||
|
||||
**Tile dependency rules**
|
||||
|
||||
For tile `t`:
|
||||
|
||||
- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
|
||||
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
|
||||
- All dependencies are enforced by PE_SCHEDULER.
|
||||
|
||||
**Overlap policy (Phase 0 default)**
|
||||
|
||||
Operations for different tiles may overlap when engine resources permit.
|
||||
|
||||
Allowed overlaps:
|
||||
|
||||
```
|
||||
DMA_READ(t+1) ∥ COMPUTE(t)
|
||||
DMA_WRITE(t−1) ∥ COMPUTE(t)
|
||||
DMA_READ(t) ∥ DMA_WRITE(t)
|
||||
```
|
||||
|
||||
Disallowed overlaps:
|
||||
|
||||
```
|
||||
GEMM(t) ∥ GEMM(t′)
|
||||
MATH(t) ∥ MATH(t′)
|
||||
GEMM(t) ∥ MATH(t′)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### D4. Engine execution model (Phase 0 default)
|
||||
|
||||
Each engine behaves as a deterministic service resource.
|
||||
|
||||
**DMA engine**
|
||||
|
||||
PE_DMA contains two independent channels.
|
||||
|
||||
```
|
||||
DMA_READ capacity = 1
|
||||
DMA_WRITE capacity = 1
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- DMA_READ and DMA_WRITE may execute concurrently.
|
||||
- Multiple READs cannot overlap.
|
||||
- Multiple WRITEs cannot overlap.
|
||||
|
||||
Example allowed:
|
||||
|
||||
```
|
||||
DMA_READ(t+1) ∥ DMA_WRITE(t)
|
||||
```
|
||||
|
||||
Example not allowed:
|
||||
|
||||
```
|
||||
DMA_READ(t) ∥ DMA_READ(t+1)
|
||||
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
|
||||
```
|
||||
|
||||
**Compute engine**
|
||||
|
||||
Compute operations share a single compute resource.
|
||||
|
||||
```
|
||||
PE_ACCEL capacity = 1
|
||||
```
|
||||
|
||||
Both GEMM and MATH require this shared compute slot.
|
||||
|
||||
Consequences:
|
||||
|
||||
- GEMM ∥ GEMM not allowed
|
||||
- MATH ∥ MATH not allowed
|
||||
- GEMM ∥ MATH not allowed
|
||||
|
||||
Only one compute operation can run in a PE at a time.
|
||||
|
||||
**Compute opcode restriction**
|
||||
|
||||
Composite commands contain one compute opcode only.
|
||||
|
||||
Examples:
|
||||
|
||||
```
|
||||
COMPOSITE_GEMM
|
||||
COMPOSITE_MATH
|
||||
```
|
||||
|
||||
Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
|
||||
|
||||
**Engine completion signaling**
|
||||
|
||||
Every engine emits a completion event when a sub-command finishes.
|
||||
Completion events are delivered to PE_SCHEDULER.
|
||||
|
||||
---
|
||||
|
||||
### D5. Dataflow model
|
||||
|
||||
Compute operations use a TCM-centric dataflow model.
|
||||
|
||||
**Input path (HBM)**
|
||||
|
||||
```
|
||||
HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
|
||||
```
|
||||
|
||||
**Input path (shared SRAM)**
|
||||
|
||||
```
|
||||
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
|
||||
```
|
||||
|
||||
**Compute stage**
|
||||
|
||||
Compute engines read input tensors from PE_TCM.
|
||||
|
||||
```
|
||||
PE_TCM → GEMM / MATH
|
||||
```
|
||||
|
||||
Weights for GEMM may optionally stream directly from HBM (via XBAR).
|
||||
|
||||
**Output path (HBM)**
|
||||
|
||||
Compute results are written to PE_TCM, then DMA writes to HBM.
|
||||
|
||||
```
|
||||
PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
|
||||
```
|
||||
|
||||
**Output path (shared SRAM)**
|
||||
|
||||
```
|
||||
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
|
||||
```
|
||||
|
||||
#### D5.1 PE_TCM partitioning and ownership boundary
|
||||
|
||||
The PE_TCM address space is partitioned into two logical regions.
|
||||
|
||||
**SchedulerReservedTCM**
|
||||
|
||||
- A staging region owned exclusively by PE_SCHEDULER.
|
||||
- This region is used for composite command tile buffers.
|
||||
- PE_SCHEDULER:
|
||||
- partitions this region into tile buffers
|
||||
- assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
|
||||
- guarantees input/output buffer separation
|
||||
- manages tile buffer lifetime
|
||||
|
||||
**AllocatableTCM**
|
||||
|
||||
- General-purpose region managed by PEMemAllocator.
|
||||
- Used by host or DP-visible allocations.
|
||||
|
||||
**Visibility rule (hard isolation)**
|
||||
|
||||
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
|
||||
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
|
||||
- This prevents DP or host allocations from interfering with scheduler staging buffers.
|
||||
|
||||
**Tile buffer rules**
|
||||
|
||||
Within SchedulerReservedTCM:
|
||||
|
||||
- input buffers and output buffers must not overlap
|
||||
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
|
||||
- tile buffers remain valid until the corresponding DMA_WRITE completes
|
||||
- Buffer reuse is allowed only after the tile lifetime finishes.
|
||||
|
||||
---
|
||||
|
||||
### D6. Observability and trace contract
|
||||
|
||||
The simulator must emit deterministic trace events.
|
||||
|
||||
Required events include:
|
||||
|
||||
- `command_submitted`
|
||||
- `sub_command_dispatched`
|
||||
- `engine_start`
|
||||
- `engine_complete`
|
||||
- `tile_ready`
|
||||
- `command_complete`
|
||||
|
||||
Trace ordering must be deterministic for identical inputs.
|
||||
|
||||
---
|
||||
|
||||
### D7. Topology representation
|
||||
|
||||
PE internal components are declared in `cube.pe_template`.
|
||||
|
||||
The template is instantiated once per PE.
|
||||
|
||||
PE instances are derived from `cube.pe_layout`.
|
||||
|
||||
External connectivity such as:
|
||||
|
||||
- PE_DMA → XBAR (HBM data path)
|
||||
- PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe)
|
||||
- NOC → PE_CPU (command path from M_CPU)
|
||||
|
||||
is modeled at the CUBE level (see ADR-0003 D3).
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- SPEC R3, R4
|
||||
- ADR-0003 D4 (PE-level system hierarchy)
|
||||
- ADR-0005 View C (PE-level diagram)
|
||||
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
|
||||
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
|
||||
@@ -0,0 +1,178 @@
|
||||
# ADR-0015: Component Port/Wire Model and Fabric Routing
|
||||
|
||||
## Status
|
||||
|
||||
Proposed
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
|
||||
In practice, the engine iterates the topology path and calls `run()` on each component
|
||||
sequentially — conflating routing policy with component behavior and preventing realistic
|
||||
hardware modeling (queues, contention, fan-out).
|
||||
|
||||
ADR-0007 D3 already states that components own fan-out and aggregation, but the current
|
||||
implementation does not enforce this for fabric traversal.
|
||||
|
||||
This ADR defines:
|
||||
|
||||
- how components communicate via typed port queues,
|
||||
- how propagation delay is modeled (wire processes),
|
||||
- the fabric path for Memory R/W through M_CPU.DMA,
|
||||
- the reduced role of the simulation engine,
|
||||
- M_CPU.DMA as an internal subcomponent of M_CPU.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Component port model
|
||||
|
||||
Each component has typed input/output ports modeled as SimPy Stores:
|
||||
|
||||
```
|
||||
in_ports: dict[str, simpy.Store] # keyed by source node_id
|
||||
out_ports: dict[str, simpy.Store] # keyed by destination node_id
|
||||
```
|
||||
|
||||
Ports are created at engine initialization based on graph edges.
|
||||
Each directed edge (src → dst) results in:
|
||||
|
||||
- `src.out_ports[dst]` — the sending end
|
||||
- `dst.in_ports[src]` — the receiving end
|
||||
|
||||
---
|
||||
|
||||
### D2. Wire process (propagation delay)
|
||||
|
||||
For each directed edge (src, dst) in the topology graph, a SimPy wire process
|
||||
models propagation delay:
|
||||
|
||||
```python
|
||||
def wire_process(env, out_port, in_port, delay_ns):
|
||||
while True:
|
||||
cmd = yield out_port.get()
|
||||
yield env.timeout(delay_ns)
|
||||
yield in_port.put(cmd)
|
||||
```
|
||||
|
||||
Wire processes are started at engine initialization.
|
||||
BW constraints are enforced by the sending component's out_port capacity or token model,
|
||||
not by the wire process itself.
|
||||
|
||||
---
|
||||
|
||||
### D3. Engine role (reduced)
|
||||
|
||||
The simulation engine MUST:
|
||||
|
||||
- wire components at initialization (create port Stores, start wire processes),
|
||||
- identify the entry component for each request type (PCIE_EP),
|
||||
- put the request into the entry component's in_port,
|
||||
- wait for a completion event.
|
||||
|
||||
The simulation engine MUST NOT:
|
||||
|
||||
- walk the topology path during request execution,
|
||||
- call component `run()` methods directly,
|
||||
- track per-hop latency or decompose fan-out.
|
||||
|
||||
This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
|
||||
ADR-0007 D2 must be amended accordingly.
|
||||
|
||||
---
|
||||
|
||||
### D4. Unified fabric path for Memory R/W and Kernel Launch
|
||||
|
||||
Both Memory R/W and Kernel Launch use the same fabric path to reach the target cube's M_CPU.
|
||||
The difference is what M_CPU does upon receiving the request.
|
||||
|
||||
**Forward path (IO_CPU → target M_CPU):**
|
||||
|
||||
```
|
||||
IO_CPU
|
||||
→ [transit cubes: ucie_out → wire → ucie_in → noc → ucie_out] (zero or more)
|
||||
→ target cube: ucie_in → noc → M_CPU
|
||||
```
|
||||
|
||||
**At M_CPU (diverges by operation type):**
|
||||
|
||||
```
|
||||
Memory R/W: M_CPU → M_CPU.DMA → noc → hbm_ctrl
|
||||
Kernel Launch: M_CPU → PE[0..n] (parallel fan-out)
|
||||
```
|
||||
|
||||
**Completion path (reverse, same fabric):**
|
||||
|
||||
```
|
||||
Memory R/W: hbm_ctrl → noc → M_CPU.DMA → M_CPU
|
||||
Kernel Launch: PE[0..n] all complete → M_CPU (aggregation)
|
||||
|
||||
M_CPU → [transit cubes: ucie → noc → ucie] → IO_CPU → runtime_api
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### D5. M_CPU.DMA is an internal subcomponent of M_CPU
|
||||
|
||||
M_CPU.DMA is NOT a separate topology node.
|
||||
It is an internal subcomponent owned by the M_CPU component implementation.
|
||||
|
||||
M_CPU.DMA:
|
||||
|
||||
- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
|
||||
- issues memory requests over the NOC to hbm_ctrl,
|
||||
- receives completion from hbm_ctrl via the NOC,
|
||||
- reports completion to M_CPU,
|
||||
- is created and managed inside M_CPU's `__init__` and `run()`.
|
||||
|
||||
M_CPU.DMA does not appear as a node in the compiled topology graph.
|
||||
|
||||
---
|
||||
|
||||
### D6. Transit cube forwarding
|
||||
|
||||
A cube that is not the target of a memory or kernel request acts as a transit node.
|
||||
Transit cubes forward requests without consuming them:
|
||||
|
||||
```
|
||||
ucie_in (from upstream) → noc → ucie_out (to downstream)
|
||||
```
|
||||
|
||||
Transit forwarding is implemented entirely within the ucie_in component.
|
||||
The noc and ucie_out components in a transit cube forward the packet without modification.
|
||||
|
||||
---
|
||||
|
||||
### D7. _formula_latency is preserved as a lower-bound cross-check
|
||||
|
||||
The path-based formula latency function (`_formula_latency`) is preserved in the engine
|
||||
as a lower bound for correctness verification.
|
||||
|
||||
Invariant:
|
||||
|
||||
- Phase 0: `_formula_latency == component model total_ns`
|
||||
- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
|
||||
|
||||
This function is independent of the port/wire model and requires only the topology graph.
|
||||
It is used for shard comparison in `_route_kernel` and as a regression guard.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
- Components model realistic hardware behavior (queues, contention, fan-out).
|
||||
- Propagation delay is modeled accurately per edge.
|
||||
- Engine is decoupled from routing policy.
|
||||
- Component implementations remain swappable via DI (ADR-0007 D3).
|
||||
- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
|
||||
- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).
|
||||
|
||||
---
|
||||
|
||||
## Links
|
||||
|
||||
- ADR-0007 D2 (to be amended: engine path-walking clause)
|
||||
- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
|
||||
- ADR-0014 D4 (DMA engine capacity=1)
|
||||
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
|
||||
Reference in New Issue
Block a user