commit - release 1

This commit is contained in:
2026-03-18 11:47:48 -07:00
commit 6f43807900
109 changed files with 14909 additions and 0 deletions
+108
View File
@@ -0,0 +1,108 @@
# ADR-0001: PhysAddr Layout & Address Decoding Contract
## Status
Accepted
## Date
2026-02-27
## Context
KernBench Graph Latency Simulator must route requests deterministically and compute end-to-end latency strictly by graph traversal.
To model local vs remote traffic (same/different SIP, same/different CUBE, optional PE-group), requests need a stable, parsable address/location scheme that:
- can be decoded into routing domains (SIP/CUBE/HBM/PE-resource, etc.)
- remains topology-agnostic (no hardcoded counts)
- supports swappable policy and DI-first components without leaking topology assumptions into node implementations
## Decision
We define a **PhysAddr value object** and an **address decoding contract** that converts an integer address into routing domains.
### D1. PhysAddr is an immutable value object
- PhysAddr is immutable and comparable as a pure value.
- Any allocator returns a **fully specified PhysAddr** (not partial metadata).
- No global state may be required to interpret a PhysAddr.
### D2. PhysAddr fields (logical contract)
PhysAddr must be able to represent at least:
- `rack_id` (optional but reserved for scale-out)
- `sip_id` (device / SIP domain)
- `sip_seg` (SIP-level segment/window selection, e.g., cube window)
- `local_offset` (offset within the chosen segment/window)
Decoded/derived fields may include (optional):
- `cube_id`
- `kind` (e.g., HBM vs PE-resource vs raw)
- `unit_type` / `pe_id` (if PE-level addressing is modeled)
**Important:** The exact bit allocation may evolve, but the *semantic fields above* must remain decodable without hidden assumptions.
### D3. Decoding is deterministic and policy-compatible
- Decoding must deterministically map an integer address to:
- destination SIP domain (`sip_id`)
- destination sub-domain (`cube_id` if applicable)
- destination target kind (HBM/PE-resource/other)
- Decoding must not depend on runtime topology sizes; it may depend on **explicit topology parameters** provided through configuration (e.g., segment size, slice size), and those parameters must live in the topology/config layer (not in random components).
### D4. Topology-derived constants live in the topology layer
Constants such as segment sizes (e.g., HBM slice size / window size) are derived from topology configuration (YAML/JSON/dict) and are provided to the decoder via DI/config.
They must not be hardcoded in node implementations.
### D5. Routing consumes decoded domains, not raw bits
Routing policy uses decoded domains:
- `src` location (sip/cube/pe or node_id)
- `dst` domains derived from PhysAddr decoding
- `size_bytes` for size-aware link latency
Routing must not inspect raw bit-fields directly except inside the decoding module.
## Alternatives Considered
1) **Use raw integers everywhere, decode ad-hoc in routing**
- Rejected: leads to duplicated logic, inconsistent routing, and hidden assumptions embedded in multiple components.
1) **Hardcode topology sizes (SIP/CUBE/PE counts) into decoding**
- Rejected: violates SPEC (R3) and breaks swappability and configuration-driven topologies.
1) **Put decoding inside memory controllers or routers**
- Rejected: leaks policy into components and undermines DI-first, swappable implementations (SPEC R4).
## Consequences
### Positive
- Deterministic routing domains enable clear test invariants for local vs remote paths (SPEC R1, R5).
- Keeps topology variability (SPEC R3) while preserving consistent semantics.
- DI-first: decoder can be swapped or extended without changing components or tests (SPEC R4).
### Tradeoffs / Costs
- Requires explicit configuration for any topology-derived sizes.
- Introduces a single “blessed” decoding module that must remain stable and well-tested.
## Implementation Notes (Non-normative)
- Recommended module boundary:
- `src/kernbench/policy/address/phyaddr.py`
- Tests should cover:
- deterministic decoding
- local vs remote classification from decoded fields
- invariants: “allocator returns full PhysAddr”, “decoding requires no global state”
## Links
- SPEC.md: R1 (routing), R3 (configurable topology), R4 (DI-first), R5 (multi-domain comm)
+103
View File
@@ -0,0 +1,103 @@
# ADR-0002: Routing Distance, Ordering & Bypass Rules
## Status
Accepted
## Date
2026-02-27
## Context
The KernBench Graph Latency Simulator must compare kernel execution time
across different architectures and topologies by computing end-to-end
latency from graph traversal.
To support meaningful comparison:
- routing must be deterministic
- latency must reflect actual interconnect structure
- local vs remote traffic must be distinguishable
- “bypass” optimizations must not undermine debuggability or correctness
The simulator also aims to avoid software-managed metadata and hidden
shortcuts that obscure control paths.
## Decision
### D1. Distance is accumulated latency, not hop count
- Routing “distance” is defined as the **sum of per-node and per-link latency**.
- Hop count alone must not be used for ordering or path selection.
- Size-aware serialization latency (bytes / BW) contributes to distance.
### D2. Routing order is derived from graph traversal
- The chosen route is the path with minimum accumulated latency
given the constructed graph and routing policy.
- Deterministic ordering must be guaranteed for identical inputs
(topology + policy + request).
### D3. Bypass is explicit and graph-represented
- Any bypass (e.g., local cube HBM access via XBAR instead of NOC) must be:
- explicitly represented as a graph path, and
- subject to latency accumulation like any other path.
- Example: PE_DMA has dual egress — one to XBAR (HBM path) and one to NOC (non-HBM path).
Both are explicit graph edges; neither is a “bypass” — they are distinct data paths
serving different memory domains.
- Implicit or “magic” bypass paths are disallowed.
### D4. No zero-latency end-to-end paths
- Every routed request must incur **end-to-end** latency > 0.
- Individual fabric segments (e.g., NOC hops) MAY have distance_mm = 0
when the fabric is distributed and distance is not meaningful at that granularity.
This is allowed because other components on the same path (e.g., PE_DMA, SRAM,
UCIe endpoints) contribute non-zero latency, ensuring the end-to-end invariant holds.
- Fully zero-latency end-to-end paths are disallowed, except for explicit
test-only stubs clearly marked as such.
### D5. Policy vs topology responsibility split
- Topology builder:
- defines nodes and links and their latency/BW parameters
- Routing policy:
- selects among available graph paths based on decoded domains
- Routing policy must not assume missing links; missing connectivity
is a topology construction error.
### D6. No software-managed routing metadata
- Routing decisions must not rely on per-request software-managed metadata
that tracks distance, hop count, or ordering outside the graph model.
- All distance/order computation is derived from traversal itself.
## Alternatives Considered
1) **Hop-count based routing**
- Rejected: ignores heterogeneous latency/BW and misrepresents
architectural differences.
2) **Implicit local shortcuts**
- Rejected: breaks debuggability and violates traversal-based latency.
3) **Software-managed distance metadata**
- Rejected: increases control overhead and obscures routing semantics.
## Consequences
### Positive
- Clear, debuggable hop-by-hop traces (SPEC R2, R4).
- Architecture comparisons reflect real interconnect structure.
- Routing behavior is reproducible and deterministic.
### Tradeoffs / Costs
- Graph construction must be correct and complete.
- Bypass modeling requires explicit graph representation,
which slightly increases topology description complexity.
## Implementation Notes (Non-normative)
- Recommended responsibilities:
- Graph builder: ensure all required paths exist.
- Router: select next hop based on decoded domains and policy.
- Tests should assert:
- non-zero end-to-end latency
- deterministic routing for identical inputs
- bypass paths appear explicitly in emitted traces
## Links
- SPEC.md: R1 (routing), R2 (latency), R3 (topology), R5 (multi-domain comm)
- ADR-0001: PhysAddr layout & decoding contract
@@ -0,0 +1,64 @@
# ADR-0003: Target System Hierarchy & Modeling Scope
## Status
Accepted
## Context
We need a system-level simulator to evaluate LLM kernel performance on our AI Accelerator platform.
The platform is organized as a compute tray containing multiple identical SIPs connected via PCIe or UAL
through switching fabrics, with a host CPU issuing commands/kernels.
## Decision
We model the system hierarchy explicitly:
### D1. Tray-level
- A compute tray contains:
- Host CPU (issues requests / coordinates runtime & data placement)
- Multiple identical SIPs (accelerators)
- Interconnect fabric between SIPs (PCIe and/or UAL via switches)
### D2. SIP-level
- A SIP is a multi-die package composed of:
- Multiple CUBEs (HBM die + compute PEs + UCIe)
- One or more IO chiplets (host/SIP interfaces)
- IO chiplets:
- provide interfaces: PCIe-EP, IO_CPU, optionally UAL-EP
- can be multiple per SIP
- placement constrained to SIP shoreline (top/bottom/left/right); each shoreline may host 12 IO chiplets
### D3. CUBE-level
- A CUBE contains:
- HBM + memory controller (HBM_CTRL)
- XBAR (top/bottom): HBM pseudo-channel crossbar, PE's dedicated path to HBM
- Bridge (left/right): connects XBAR.top ↔ XBAR.bottom for cross-half HBM access
- NOC: distributed on-die fabric spanning the entire cube (distance modeled as 0);
carries non-HBM traffic including inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access
- Shared SRAM: cube-level shared memory accessible by all PEs via NOC
- management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
- multiple PEs
- up to 4 UCIe endpoints (N/E/W/S) for CUBE↔CUBE and CUBE↔IO connectivity
### D4. PE-level
- A PE can execute one kernel instance
- PE contains internal control + accelerators (modeled at PE view granularity):
- PE_CPU, command handler, PE_TCM, DMA/GEMM/MATH engines, internal queues
## Consequences
- The simulator supports abstraction by “views”:
- SIP view hides PE internals
- CUBE view treats each PE as a single block
- PE view expands PE internals
- Topology remains parameterized; sizes/counts/links come from configuration.
## Links
- SPEC R3/R5
- ADR-0005 (diagram views)
@@ -0,0 +1,64 @@
# ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee
## Status
Accepted
## Context
Accurately modeling PE↔HBM behavior is essential for kernel latency estimation.
Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth.
## Decision
### D1. Local HBM definition
- Each PE is assigned a logically defined “local HBM” region.
- Local HBM corresponds to the pseudo-channel subset directly attached to that PEs DMA path
via the XBAR (top or bottom, depending on PE corner placement).
- The path is: PE_DMA → XBAR.top/bottom → HBM_CTRL.
- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
### D2. Local HBM bandwidth guarantee contract
- Accesses from a PE to its local HBM MUST guarantee full HBM read/write bandwidth
independent of intervening fabric bandwidth limits.
- This guarantee is modeled by:
- a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
- while still incurring non-zero latency along explicitly modeled components.
### D3. Cross-half HBM semantics
- A PE connected to XBAR.bottom that accesses HBM pseudo-channels on the XBAR.top half
(or vice versa) traverses a bridge:
- PE_DMA → XBAR.bottom → bridge → XBAR.top → HBM_CTRL
- Bridge bandwidth may limit cross-half HBM access relative to local-half access.
### D4. Non-local HBM semantics (inter-cube / inter-SIP)
- Accesses from a PE to HBM in a different cube or SIP MAY be limited by:
- NOC bandwidth within the cube,
- inter-cube UCIe links,
- inter-SIP fabric (PCIe/UAL).
- These paths MUST be explicit and traceable.
### D5. Shared SRAM semantics
- Each CUBE contains a shared SRAM accessible by all PEs in that CUBE.
- Access path: PE_DMA → NOC → shared SRAM.
- Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth.
- Shared SRAM is not part of the HBM address space; it is a separate memory domain.
## Verification Notes
Tests should cover:
- local-HBM case: BW matches HBM BW regardless of fabric BW parameter
- cross-half HBM case: latency includes bridge traversal
- non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
- shared SRAM case: access via NOC with correct BW
## Links
- SPEC R2/R5
- ADR-0002 (distance/order & explicit bypass)
@@ -0,0 +1,186 @@
# ADR-0005: Diagram Views & Distance-Aware Layout Rules
## Status
Accepted
## Context
We require verifiable and inspectable system modeling for a large-scale,
parameterized AI Accelerator system.
Humans must be able to:
- visually inspect the modeled topology,
- reason about communication structure and relative distance,
- do so at multiple abstraction levels without being overwhelmed by detail.
The simulator models distance (accumulated latency) as a first-class concept.
Diagrams must reflect this distance by default.
---
## Global Defaults
- All diagrams MUST be **distance-aware by default**.
- All diagrams MUST render **representative views** of the architecture.
- Instance indices (e.g., sip0, cube2, pe3) MUST NOT be required for diagram generation.
- Instance indices MAY be used ONLY:
- to define a distance anchor in asymmetric or debugging scenarios, or
- when explicitly requested.
---
## Representative Rendering Rule
- All CUBEs share the same internal structure.
- All PEs share the same internal structure.
Therefore:
- SIP-level diagrams render representative CUBEs and IO chiplets.
- CUBE-level diagrams render representative PEs as opaque blocks.
- PE-level diagrams render a representative PE with fully expanded internals.
Diagrams MUST NOT depend on specific SIP, CUBE, or PE indices
unless explicitly requested.
---
## Diagram Views
### View A — SIP-Level Diagram
**Purpose**
Explain system-scale structure and connectivity.
**Visible elements**
- SIP boundaries (optional)
- CUBEs (opaque blocks)
- IO chiplets (opaque blocks)
- Optional UCIe stubs only if needed to clarify connectivity
**Hidden elements**
- PE internals
- CUBE internal fabric
- IO chiplet internals
**Visible links**
- Host ↔ IO chiplets (PCIe)
- SIP ↔ SIP (PCIe / UAL via switches)
- IO ↔ CUBE (on-package links)
---
### View B — CUBE-Level Diagram
**Purpose**
Explain cube-internal structure and data/control flow.
**Visible elements**
- XBAR (top/bottom): HBM pseudo-channel crossbar
- Bridge (left/right): cross-half HBM connectors between XBAR.top and XBAR.bottom
- NOC: distributed on-die fabric for non-HBM traffic
- HBM subsystem (HBM_CTRL)
- Shared SRAM: cube-level shared memory
- Management CPU (M_CPU)
- PEs as opaque blocks (PE[0..N1])
- UCIe endpoints (N/E/W/S) as ports
**Hidden elements**
- PE internals
**Visible links**
- PE → XBAR (HBM data path, top or bottom by corner placement)
- PE → NOC (non-HBM data path)
- XBAR ↔ bridge ↔ XBAR (cross-half HBM access)
- XBAR → HBM_CTRL
- NOC ↔ UCIe endpoints
- NOC ↔ shared SRAM
- M_CPU ↔ NOC (command path)
- NOC → PE_CPU (command delivery, collapsed into PE block)
---
### View C — PE-Level Diagram
**Purpose**
Explain internal PE behavior and execution structure.
**Visible elements**
- PE_CPU
- Command handler / scheduler
- PE_TCM (local SRAM)
- HW accelerators (DMA, GEMM, MATH, etc.)
- Local HBM interface
- Optional IPCQ / messaging endpoints
**Visible links**
- Control paths (CPU → scheduler → engines)
- Data paths (engines ↔ TCM, DMA ↔ local HBM)
- External fabric ports as abstract ports only
---
## Distance-Aware Layout (Default)
### Distance definition
- Distance is defined as **accumulated latency**, consistent with ADR-0002.
- Distance is computed from a single anchor node.
### Default anchor selection
- SIP view: IO chiplet (or Host CPU if present)
- CUBE view: a representative PE
- PE view: PE_CPU or Command Handler
Anchors are **implicit defaults** and MUST NOT be required to be specified.
### Layout rules
- Diagrams MUST be laid out in layers based on distance buckets.
- Layout direction MUST be consistent within a view type
(preferred: left-to-right).
- Nodes with equal distance MUST have stable ordering
(by role or identifier, deterministically).
Cycles MAY be rendered using dashed or curved edges for readability,
without affecting distance semantics.
---
## Generation Contract (for Tools / Claude Code)
When generating diagrams:
- Assume distance-aware layout by default.
- Assume representative rendering by default.
- Do NOT ask for SIP/CUBE/PE indices unless required.
- Do NOT expand hidden abstraction levels.
- Prefer architectural clarity over micro-hop fidelity.
---
## Consequences
- Diagrams are stable across topology scaling.
- Changes in distance or routing policy are reflected visually.
- Diagrams serve as verifiable artifacts derived from the simulator model,
not as hand-maintained documentation.
---
## Links
- SPEC Section 4 (Output, Debuggability, and Diagrams)
- ADR-0002 (Routing distance semantics)
- ADR-0006 (Topology compilation & automatic diagram generation)
@@ -0,0 +1,130 @@
# ADR-0006: Topology Compilation, Distance Extraction, and Automatic Diagram Generation
## Status
Accepted
## Context
The simulator compiles topology configuration (e.g., topology.yaml) into an explicit model graph,
and computes routing and accumulated latency (distance).
Diagrams should be generated from these authoritative artifacts to ensure consistency and avoid
hand-maintained topology drawings.
Additionally, for usability, diagrams should be emitted automatically into a stable location
so that developers can preview them immediately in the repository.
---
## Decision
### D1. Topology compilation is the single source of truth
- topology.yaml (or equivalent config) is compiled into:
- an explicit system graph,
- node/link attributes,
- routing policies.
This compiled graph is the authoritative representation of the system.
### D2. Distance extraction during compilation
- During or immediately after topology compilation, the simulator MUST compute distance metadata
(accumulated latency) consistent with ADR-0002.
- Distance metadata MUST be sufficient to support distance-aware diagram layout as defined in ADR-0005.
- Distributed fabric segments (e.g., NOC) MAY have distance_mm = 0 per ADR-0002 D4;
layout placement for such nodes uses explicit position metadata rather than distance buckets.
### D3. Diagram generation is a derived artifact
- Diagrams MUST be generated from:
- the compiled topology graph,
- extracted distance metadata,
- view/layout rules defined in ADR-0005.
- Diagram generation MUST NOT require additional hand-written topology descriptions.
### D4. Automatic diagram emission to the repository
- As part of topology compilation, the implementation MUST produce the following diagrams by default:
- SIP-level diagram (representative, distance-aware)
- CUBE-level diagram (representative, distance-aware)
- PE-level diagram (representative, distance-aware)
- The default output directory is:
- `docs/diagrams/`
- The generator MUST overwrite/update only when the compiled topology (or diagram rules) changes.
### D5. View-specific projection and layout
For each view (SIP / CUBE / PE):
- The generator MUST project the compiled graph into a reduced view graph:
- hide/collapse nodes according to ADR-0005,
- preserve connectivity semantics relevant to that view,
- compute distance buckets and assign layout layers deterministically.
- CUBE-level projection MUST include:
- XBAR (top/bottom), bridge (left/right), NOC, HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
and PEs as opaque blocks.
- Distinct edge kinds for HBM path (PE→XBAR) vs non-HBM path (PE→NOC).
- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
### D6. Output formats and determinism
- The generator MUST output at least one of:
- Mermaid (Markdown-native)
- Graphviz DOT (rank-based control)
- SVG (mm-accurate layout, no external dependencies)
- SVG is preferred when mm-accurate position metadata is available from the compiled topology.
- Output MUST be deterministic:
- same topology + same rules → identical diagram text
- File naming MUST be deterministic and stable (see "Output Conventions").
### D7. Performance and caching
- Diagram generation MAY be lazy and/or cached, as long as the outputs in `docs/diagrams/`
remain consistent with the compiled topology.
- The implementation SHOULD use a cache key based on:
- topology content hash,
- routing policy version,
- diagram rules version,
- view type (SIP/CUBE/PE).
---
## Output Conventions
### Directory
- `docs/diagrams/` is the canonical output directory for generated diagrams.
### File names (recommended, deterministic)
- `system_view.svg` / `system_view.mmd` / `system_view.dot`
- `sip_view.svg` / `sip_view.mmd` / `sip_view.dot`
- `cube_view.svg` / `cube_view.mmd` / `cube_view.dot`
- `pe_view.svg` / `pe_view.mmd` / `pe_view.dot`
Optionally, for multi-topology workflows:
- `sip_view__{topology_id}.svg`
- `cube_view__{topology_id}.svg`
- `pe_view__{topology_id}.svg`
### Repository policy
- Generated diagram files MAY be committed to the repository to enable diff-based review.
- If committed, they MUST be reproducible from topology compilation.
---
## Consequences
- Diagrams are always consistent with simulator behavior.
- Architectural changes automatically propagate to visualizations.
- Diagram diffs become meaningful indicators of architectural change.
---
## Links
- SPEC Section 4 (Output, Debuggability, and Diagrams)
- ADR-0002 (Distance semantics)
- ADR-0005 (Diagram views and layout rules)
@@ -0,0 +1,89 @@
# ADR-0007: Runtime API and Simulation Engine Boundaries
## Status
Accepted
## Context
The simulator consists of multiple layers with distinct responsibilities:
- a host-facing API layer used by benchmarks and user code,
- a discrete-event simulation engine that executes requests,
- device components that model hardware behavior.
Without strict boundaries, orchestration logic can leak into components,
or simulation internals can become entangled with user-facing APIs.
This ADR defines clear responsibility boundaries between:
- runtime API,
- simulation engine (sim_engine),
- hardware components.
---
## Decision
### D1. Runtime API is host-facing orchestration only
The runtime API represents host/driver-level behavior and MUST:
- expose high-level operations (tensor deployment, kernel launch),
- submit requests only to endpoint components (e.g., IO_CPU),
- await completion via futures/handles,
- own and persist host-side metadata (tensor allocation maps, kernel bindings).
The runtime API MUST NOT:
- hardcode hop-by-hop routing or fan-out,
- directly invoke internal components (M_CPU, PE_CPU, engines),
- embed topology- or routing-specific assumptions.
---
### D2. Simulation engine executes and schedules requests
The simulation engine (sim_engine) MUST:
- inject requests into the compiled topology graph,
- schedule and execute events using a discrete-event model,
- manage correlation ids and completion tracking,
- decompose operations into low-level requests when required
(e.g., MemoryWrite events).
The simulation engine MUST NOT:
- define tensor semantics,
- define kernel execution policies,
- expose internal graph details to the runtime API.
---
### D3. Components own fan-out and aggregation
Device-side components MUST:
- fan-out requests to downstream domains
(IO_CPU → M_CPU → PE_CPU → schedulers/engines),
- aggregate completion and failure signals,
- propagate results deterministically upstream.
Neither the runtime API nor the simulation engine may orchestrate
component-level fan-out explicitly.
---
## Consequences
- Runtime APIs remain stable as topology and routing evolve.
- Simulation internals can change without affecting user-facing code.
- Component implementations remain swappable via DI.
---
## Links
- SPEC R4, R7, R8
- ADR-0008 (Tensor deployment)
- ADR-0009 (Kernel execution)
@@ -0,0 +1,100 @@
# ADR-0008: Tensor Deployment and Allocation (Host Allocator, PA-first)
## Status
Accepted
## Context
Benchmarks require PyTorch-like tensor semantics:
- tensor creation (empty, fill),
- deployment to accelerator devices (tensor.to()).
In the realistic system, host software manages allocation/mapping and installs
mappings for DMA/MMU. For Phase 0 we simplify (ADR-0011):
- device memory operations use PA only,
- VA/MMU/IOMMU is not modeled.
To keep the host↔device interface minimal, we avoid a separate
AllocateTensorMeta message. Instead, host allocation produces a PA shard map
that is used directly by MemoryWrite/Read and KernelLaunch.
---
## Decision
### D1. Tensor is a host-owned handle with PA shard mapping
A Tensor object is a host-owned handle that encapsulates:
- shape and dtype,
- initialization intent,
- device placement and allocation metadata as a PA shard map.
After deployment, the Tensor handle MUST contain:
- a list of shards, each with (sip,cube,pe,pa,nbytes,offset_bytes).
This PA shard mapping is the single source of truth for kernel argument binding.
---
### D2. Deployment uses a host allocator (Phase 0)
In Phase 0, tensor deployment produces PA shard mappings via a host allocator:
- placement (split/replicate/hybrid) is decided by a DP policy,
- allocation assigns PA ranges at the PE level and returns shard mappings,
- the Tensor handle stores the resulting shard list deterministically.
No separate host-visible device allocation RPC is required in Phase 0.
---
### D3. Data initialization and transfer uses MemoryWrite/Read only
Any data initialization or transfer implied by a tensor (e.g., fill, copy)
MUST be represented using Host ↔ IO_CPU messages only:
- MemoryWrite
- MemoryRead
Rules:
- MemoryWrite/Read MUST reference PA + (sip,cube,pe) tags (ADR-0012).
- Allocation metadata MUST NOT be embedded as a separate allocation message.
- Bulk tensor data MUST NOT be embedded in Phase 0 messages.
The simulation engine schedules MemoryWrite/Read through the graph so that
latency is computed by explicit traversal.
---
### D4. Extension path (non-breaking)
Future ADRs MAY introduce optional VA/MMU/IOMMU modeling by adding:
- virtual addressing in tensor handles,
- mapping install steps,
- translation latency/page granularity.
The Phase 0 PA shard map remains a valid fast-path configuration.
---
## Consequences
- Host↔IO_CPU contract remains minimal (MemoryRead/Write + KernelLaunch).
- KernelLaunch can pass per-PE data placement explicitly via shard tags.
- Early implementation stays simple and testable.
---
## Links
- ADR-0011 (PA-first)
- ADR-0012 (Host↔IO_CPU schema)
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0009 (Kernel execution)
@@ -0,0 +1,74 @@
# ADR-0009: Kernel Execution Messaging and Completion Semantics
## Status
Accepted
## Context
Kernel execution is initiated by the host and proceeds through
device control components:
Host → IO_CPU → M_CPU → PE_CPU → schedulers → engines
Completion propagates in reverse order.
To keep benchmarks simple and topology-agnostic,
kernel execution must be endpoint-driven with deterministic aggregation.
---
## Decision
### D1. Kernel launch is an endpoint request
A kernel launch is initiated by submitting a single KernelLaunch request
to the IO_CPU endpoint.
The runtime API MUST:
- construct the kernel launch request,
- submit it to IO_CPU,
- await a single completion result.
The runtime API MUST NOT orchestrate internal fan-out.
---
### D2. Tensor arguments are passed by metadata
KernelLaunch requests MUST reference tensor arguments via:
- host-owned tensor handles, or
- resolved device address maps derived from those handles.
Bulk tensor data MUST NOT be embedded in kernel launch messages.
---
### D3. Fan-out and aggregation are component responsibilities
- IO_CPU fans out work to M_CPUs.
- M_CPU fans out work to PE_CPUs.
- PE_CPU manages kernel execution and engine dispatch.
Completion semantics:
- M_CPU completes when all targeted PEs complete or a failure policy triggers.
- IO_CPU completes when all targeted CUBEs complete or a failure policy triggers.
---
### D4. Completion and failure propagation
- All messages MUST carry correlation identifiers.
- Completion and failure MUST propagate deterministically to the host.
- The simulation engine provides futures/handles to observe completion.
---
## Links
- SPEC R1, R2, R7, R8
- ADR-0007 (Runtime API boundaries)
- ADR-0008 (Tensor deployment)
+62
View File
@@ -0,0 +1,62 @@
# ADR-0010: CLI Device Selection and Multi-Device Execution Semantics
## Status
Accepted
## Context
Benchmarks represent device-agnostic workloads that operate on a single device.
Users may want to run a benchmark:
- on a specific device, or
- across all devices in the system.
Device enumeration must not leak into benchmarks or runtime APIs.
---
## Decision
### D1. Benchmarks are single-device by design
- A benchmark MUST define behavior for a single device only.
- A benchmark MUST accept a device identifier as input.
- Benchmarks MUST NOT enumerate or loop over multiple devices.
---
### D2. CLI controls device selection
The `kernbench run` command supports an optional `--device` argument:
- If `--device <id>` is specified:
- the benchmark executes once for the specified device.
- If `--device` is omitted:
- the benchmark executes once using all the SIPs discovered in the topology.
---
### D3. Multi-device execution is logically parallel
When running on multiple devices:
- benchmark executions are submitted to a single simulation engine instance,
- executions are logically parallel in simulation time,
- inter-device contention is naturally modeled.
---
### D4. Runtime API and simulation engine remain device-scoped
- Runtime API calls operate on one device per invocation.
- The simulation engine schedules all requests deterministically.
- Neither layer enumerates devices.
---
## Links
- SPEC R7, R8
- ADR-0007 (Runtime API boundaries)
@@ -0,0 +1,65 @@
# ADR-0011: Memory Addressing Simplification (PA-first)
## Status
Accepted
## Context
A realistic system uses host-side virtual addressing and an MMU/IOMMU-style
translation path for DMA: host allocates physical memory at PE level, maps it
into a virtual address space, installs mappings, and DMA requests use virtual
addresses that are translated to physical addresses.
For early development, we want a minimal, deterministic model that enables:
- correct routing and latency accounting through the graph,
- stable tensor deployment and kernel execution semantics,
- future extension toward VA/MMU without rewriting workflows.
---
## Decision
### D1. Phase 0 model is PA-only
The simulator uses a PA-first model:
- All device memory accesses (MemoryRead/MemoryWrite) operate on device physical
addresses (PA) plus size.
- Tensor handles store PA-based shard mappings after deployment.
- KernelLaunch passes tensor arguments as PA-based mappings (or references to them).
- MMU/IOMMU concepts (virtual address spaces, page tables, translation latency)
are NOT modeled in Phase 0.
### D2. Allocation produces PA mappings
Device allocation selects PE-local memory regions and returns PA mappings
sufficient to execute kernels and issue DMA requests.
### D3. Extension path (non-breaking)
A future ADR MAY introduce an optional VA/MMU layer by:
- introducing virtual addresses in tensor handles,
- adding a mapping-install step,
- modeling translation latency and page granularity.
The Phase 0 PA model remains a valid fast-path configuration.
---
## Consequences
- Early implementation stays simple and testable.
- All latency remains explicit via graph traversal, not hidden translation.
- Future VA/MMU modeling can be added without breaking existing benchmarks.
---
## Links
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0008 (tensor deployment)
- ADR-0009 (kernel execution)
- SPEC R2 (latency by traversal)
+232
View File
@@ -0,0 +1,232 @@
# ADR-0012: Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
## Status
Accepted
## Context
Phase 0 uses a PA-first memory model (ADR-0011):
- memory operations use device physical addresses (PA) only,
- VA/MMU/IOMMU is not modeled.
The host-facing runtime API interacts with the device via the IO_CPU endpoint.
We define stable, minimal message schemas for Host ↔ IO_CPU so that:
- benchmarks remain stable,
- IO_CPU-internal fan-out/aggregation can evolve independently,
- completion and failure propagation is deterministic.
We also require PE-tagging (A 방식): each shard explicitly carries (sip,cube,pe)
so IO_CPU can deterministically route/fan-out without relying on PA decoding.
---
## Decision
### D1. Contract scope
This schema is the stable contract ONLY for Host ↔ IO_CPU.
Messages beyond IO_CPU (to M_CPU, PE_CPU, schedulers, engines) are component-internal
and are NOT part of this host contract in Phase 0.
---
### D2. Required message set
The runtime API MUST use only these message types for Host ↔ IO_CPU:
- MemoryWrite
- MemoryRead
- KernelLaunch
All operations required by benchmarks (tensor init/copy, kernel run) MUST be expressible
with these messages.
---
### D3. Common envelope (mandatory for all requests)
All Host ↔ IO_CPU requests MUST include:
- `msg_type: str`
- `correlation_id: str`
- generated by the host
- used to match responses deterministically
- `request_id: str`
- unique within a correlation_id
- `target_device: str`
- device identifier (e.g., "sip:0")
- `timestamp_tag: str | None` (optional)
- debug tag only; MUST NOT affect determinism
All Host ↔ IO_CPU responses MUST include:
- `correlation_id: str`
- `request_id: str`
- `completion: Completion`
---
### D4. Completion schema (mandatory)
`Completion` MUST have:
- `ok: bool`
- `error_code: str | None`
- `error_message: str | None`
Rules:
- If `ok == true` then `error_code` and `error_message` MUST be null.
- If `ok == false` then `error_code` MUST be non-null.
- Completion semantics MUST be deterministic.
---
### D5. MemoryWrite schema (PA-first, PE-tagged)
`MemoryWrite` represents a host-initiated write/initialize operation to device memory.
Mandatory fields:
- common envelope fields (D3)
- destination placement tags (A 방식):
- `dst_sip: int`
- `dst_cube: int`
- `dst_pe: int`
- `dst_pa: int`
- destination physical address in the destination PE's address space
- `nbytes: int`
- `src_kind: "pattern" | "host_buffer_ref"`
- Phase 0 MUST support "pattern"
- `pattern: Pattern | None`
- required if `src_kind == "pattern"`
`Pattern` (Phase 0 mandatory support):
- `pattern_kind: "zero" | "fill_u8" | "fill_u16" | "fill_u32" | "fill_fp16" | "fill_fp32"`
- `value: number | None`
- required for fill_*; ignored for zero
Optional fields:
- `dst_mem_kind: "HBM" | "TCM" | "AUTO"` (default "AUTO")
- `debug_label: str | None`
Notes:
- This message MUST NOT embed bulk tensor data in Phase 0.
- All latency MUST come from explicit graph traversal and modeled components.
---
### D6. MemoryRead schema (PA-first, PE-tagged)
`MemoryRead` represents a host-initiated read from device memory.
Mandatory fields:
- common envelope fields (D3)
- source placement tags (A 방식):
- `src_sip: int`
- `src_cube: int`
- `src_pe: int`
- `src_pa: int`
- `nbytes: int`
Optional fields:
- `dst_kind: "host_sink" | "discard"` (default "host_sink")
- `debug_label: str | None`
Response payload:
- actual bytes are NOT required in Phase 0 (latency/traces focus)
- implementations MAY return lightweight stats or hashes later via a new ADR
---
### D7. KernelLaunch schema (PA-first, PE-tagged shards)
`KernelLaunch` represents launching a kernel on a target device via IO_CPU.
Mandatory fields:
- common envelope fields (D3)
- `kernel_ref: KernelRef`
- `args: list[KernelArg]`
`KernelRef` MUST have:
- `name: str`
- `kind: "deployed" | "builtin"`
- `deploy_pa: int | None` — PA where kernel binary was deployed (required for "deployed")
- `deploy_sip: int` — SIP where binary resides
- `deploy_cube: int` — cube where binary resides
- `deploy_pe: int` — PE where binary resides
- `nbytes_code: int` — kernel binary size (for BW modeling)
Kernel binaries MUST be pre-deployed to device memory via MemoryWrite.
KernelLaunch MUST NOT embed kernel source code or IR in the launch message.
`KernelArg` supports tensor args by PA mapping and scalars by value.
Tensor arg (mandatory):
- `arg_kind: "tensor"`
- `tensor_pa_map: TensorPAMap`
`TensorPAMap` MUST have:
- `shards: list[TensorShard]`
`TensorShard` MUST have (A 방식 강제):
- `sip: int`
- `cube: int`
- `pe: int`
- `pa: int`
- `nbytes: int`
- `offset_bytes: int`
Scalar arg (mandatory):
- `arg_kind: "scalar"`
- `dtype: "i32" | "i64" | "fp16" | "fp32" | "bool"`
- `value: number | bool`
Optional KernelLaunch fields:
- `grid: dict | None`
- `meta: dict | None`
- `failure_policy: "fail_fast" | "collect_all"` (default "fail_fast")
- `debug_label: str | None`
Notes:
- KernelLaunch MUST NOT embed bulk tensor data.
- KernelLaunch MUST be submitted only to the IO_CPU endpoint.
- IO_CPU MUST fan-out work internally using the shard (sip,cube,pe) tags.
---
## Verification Notes
Tests SHOULD validate:
- schema validation rejects missing mandatory fields,
- deterministic correlation/response matching,
- MemoryWrite/Read/KernelLaunch produce explicit hop traces,
- all routed requests incur latency > 0.
---
## Links
- ADR-0011 (PA-first memory addressing)
- ADR-0007 (runtime_api vs sim_engine boundaries)
- ADR-0009 (kernel execution fan-out/aggregation)
- SPEC R2, R7, R8
+139
View File
@@ -0,0 +1,139 @@
# ADR-0013: Verification Strategy and Phase 1 Test Plan
## Status
Accepted
## Context
KernBench is a system-level simulator whose correctness is defined by:
- adherence to SPEC-defined invariants,
- determinism and debuggability,
- explicit modeling of routing and latency.
Given the evolving implementation, we need a stable verification strategy
that prevents architectural drift while allowing incremental development.
This ADR defines the Phase 1 verification plan and what constitutes
"correct behavior" for early implementations.
---
## Decision
### D1. Verification is contract-based
Verification MUST be derived from:
- SPEC requirements,
- accepted ADRs.
Tests MUST validate architectural contracts, not incidental implementation details.
---
### D2. Phase 1 verification scope
Phase 1 verification focuses on:
- message contract validity (ADR-0012),
- routing and fan-out semantics at the IO_CPU boundary (ADR-0009),
- PA-first memory addressing and shard tagging (ADR-0011),
- core latency and trace invariants (SPEC 0.1, R2).
Microarchitectural accuracy, bandwidth contention, and cycle-level behavior
are explicitly out of scope in Phase 1.
---
### D3. Required Phase 1 verification cases
The following verification cases MUST be supported by the implementation:
#### V1. Message schema validation
- KernelLaunch requests missing `(sip, cube, pe)` in any tensor shard MUST be rejected.
- MemoryWrite/MemoryRead requests missing destination/source placement tags MUST be rejected.
- Completion results MUST follow the `ok / error_code / error_message` contract.
#### V2. IO_CPU fan-out and aggregation
Given:
- a topology with one SIP, one CUBE, and two PEs,
- a KernelLaunch request containing two tensor shards targeting different PEs,
The system MUST:
- submit a single KernelLaunch to IO_CPU,
- fan-out work internally to both PEs,
- aggregate completion and return a single deterministic completion to the host.
#### V3. Latency and trace invariants
For any valid request:
- the hop-by-hop trace MUST be non-empty,
- total latency MUST be greater than zero,
- repeated runs with identical inputs MUST produce identical traces.
#### V4. Topology independence and cross-domain coverage
Verification cases MUST pass for multiple topology shapes, including:
- minimal: (1 SIP, 1 CUBE, 1 PE)
- multi-PE: (1 SIP, 1 CUBE, N PEs)
- multi-CUBE within a SIP: (1 SIP, M CUBEs, ≥1 PE per CUBE)
- multi-SIP tray: (K SIPs, ≥1 CUBE per SIP, ≥1 PE per CUBE)
For multi-CUBE and multi-SIP topologies, Phase 1 verification focuses on:
- explicit connectivity (required links exist),
- deterministic routing and control-path traversal,
- non-empty traces and latency > 0 for representative cross-domain requests
(inter-CUBE and inter-SIP paths).
Tests MUST NOT hardcode topology sizes, node ids, or link counts.
Instead, tests MUST derive expectations from the compiled topology metadata
---
### D4. Phase 1 artifacts
Phase 1 MAY include:
- verification-only test code,
- topology fixtures,
- trace inspection utilities.
Phase 1 MUST NOT require:
- production code changes solely to satisfy tests,
- weakening or removing tests to allow progress.
---
### D5. Phase 2 enforcement
Phase 2 (Apply) MUST:
- run the Phase 1 verification cases,
- rollback all changes if any verification fails,
- preserve tests as authoritative contracts.
---
## Consequences
- Architectural correctness is enforced early.
- Tests serve as executable documentation of system behavior.
- Implementation remains flexible without losing rigor.
---
## Links
- SPEC 0.1, R2, R6
- ADR-0011 (PA-first memory addressing)
- ADR-0012 (Host ↔ IO_CPU message schema)
- ADR-0009 (Kernel execution semantics)
@@ -0,0 +1,364 @@
# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
## Status
Proposed
## Context
ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
- the dispatch model inside a PE,
- the responsibilities of PE_SCHEDULER,
- the PE_TCM-centric dataflow contract used by accelerator engines.
We need a deterministic and debuggable PE-internal execution contract that supports:
- simple single-engine commands
- composite commands that build a tiled pipeline across DMA and accelerator engines
The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
## Decision
### D1. PE internal component roles
Each PE contains the following logical components.
**PE_CPU**
- Executes kernel instruction stream or kernel control logic.
- Generates PE commands.
- Submits commands to PE_SCHEDULER.
- PE_CPU does NOT enqueue work directly into engine queues.
**PE_SCHEDULER**
- The sole dispatcher inside a PE.
- Receives commands from PE_CPU.
- Expands composite commands into sub-commands.
- Tracks dependencies and command state.
- Dispatches work to engine queues.
- Manages tile scheduling for composite commands.
**PE_DMA**
- Handles memory transfers between PE_TCM and external memory domains.
- PE_DMA has **dual egress** at the CUBE level:
- **→ XBAR**: dedicated path to HBM (local and cross-half via bridge)
- **→ NOC**: path to non-HBM destinations (shared SRAM, inter-cube UCIe, etc.)
- Supported directions include:
- HBM → PE_TCM (via XBAR)
- PE_TCM → HBM (via XBAR)
- PE_TCM → shared SRAM (via NOC)
- PE_TCM → other memory domains (via NOC, if supported by topology)
**PE_GEMM**
- Matrix multiplication engine.
- Reads activations from PE_TCM.
- May stream weights directly from HBM.
**PE_MATH**
- Element-wise computation engine.
- Reads and writes PE_TCM.
**PE_TCM**
- Local SRAM used as the staging memory for accelerator operations.
---
### D2. Command lifecycle and queues
PE_SCHEDULER maintains three logical structures.
**SubmissionQueue**
- Written by PE_CPU.
- Contains incoming PE commands waiting to be processed.
**InflightTable**
- Owned and mutated only by PE_SCHEDULER.
- Tracks:
- expanded sub-commands
- dependency state
- engine assignment
- completion status
**CompletionQueue**
- Written by PE_SCHEDULER.
- Contains final completion records for commands.
**Single-writer rule**
- Only PE_SCHEDULER is allowed to mutate command completion state.
- Engine components must report completion via explicit completion events/messages.
**Command completion**
A command becomes DONE when:
- all sub-commands complete
- PE_SCHEDULER publishes a completion record to CompletionQueue.
---
### D3. Dispatch modes
PE commands are divided into two categories.
#### D3.1 Simple command
A simple command expands to exactly one engine sub-command.
Examples include:
- DMA transfer
- GEMM compute
- MATH compute
Execution flow:
```
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
```
#### D3.2 Composite command (tiled pipeline)
Composite commands implement tiled pipelined execution across engines.
Each tile executes the following pipeline:
```
Input DMA (READ)
→ Compute (GEMM or MATH)
→ Output DMA (WRITE)
```
**Tiling rule**
If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
Each tile is assigned a monotonically increasing `tile_id`.
**Tile dependency rules**
For tile `t`:
- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
- All dependencies are enforced by PE_SCHEDULER.
**Overlap policy (Phase 0 default)**
Operations for different tiles may overlap when engine resources permit.
Allowed overlaps:
```
DMA_READ(t+1) ∥ COMPUTE(t)
DMA_WRITE(t1) ∥ COMPUTE(t)
DMA_READ(t) ∥ DMA_WRITE(t)
```
Disallowed overlaps:
```
GEMM(t) ∥ GEMM(t)
MATH(t) ∥ MATH(t)
GEMM(t) ∥ MATH(t)
```
---
### D4. Engine execution model (Phase 0 default)
Each engine behaves as a deterministic service resource.
**DMA engine**
PE_DMA contains two independent channels.
```
DMA_READ capacity = 1
DMA_WRITE capacity = 1
```
Rules:
- DMA_READ and DMA_WRITE may execute concurrently.
- Multiple READs cannot overlap.
- Multiple WRITEs cannot overlap.
Example allowed:
```
DMA_READ(t+1) ∥ DMA_WRITE(t)
```
Example not allowed:
```
DMA_READ(t) ∥ DMA_READ(t+1)
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
```
**Compute engine**
Compute operations share a single compute resource.
```
PE_ACCEL capacity = 1
```
Both GEMM and MATH require this shared compute slot.
Consequences:
- GEMM ∥ GEMM not allowed
- MATH ∥ MATH not allowed
- GEMM ∥ MATH not allowed
Only one compute operation can run in a PE at a time.
**Compute opcode restriction**
Composite commands contain one compute opcode only.
Examples:
```
COMPOSITE_GEMM
COMPOSITE_MATH
```
Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
**Engine completion signaling**
Every engine emits a completion event when a sub-command finishes.
Completion events are delivered to PE_SCHEDULER.
---
### D5. Dataflow model
Compute operations use a TCM-centric dataflow model.
**Input path (HBM)**
```
HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
```
**Input path (shared SRAM)**
```
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
```
**Compute stage**
Compute engines read input tensors from PE_TCM.
```
PE_TCM → GEMM / MATH
```
Weights for GEMM may optionally stream directly from HBM (via XBAR).
**Output path (HBM)**
Compute results are written to PE_TCM, then DMA writes to HBM.
```
PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
```
**Output path (shared SRAM)**
```
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
```
#### D5.1 PE_TCM partitioning and ownership boundary
The PE_TCM address space is partitioned into two logical regions.
**SchedulerReservedTCM**
- A staging region owned exclusively by PE_SCHEDULER.
- This region is used for composite command tile buffers.
- PE_SCHEDULER:
- partitions this region into tile buffers
- assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
- guarantees input/output buffer separation
- manages tile buffer lifetime
**AllocatableTCM**
- General-purpose region managed by PEMemAllocator.
- Used by host or DP-visible allocations.
**Visibility rule (hard isolation)**
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
- This prevents DP or host allocations from interfering with scheduler staging buffers.
**Tile buffer rules**
Within SchedulerReservedTCM:
- input buffers and output buffers must not overlap
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
- tile buffers remain valid until the corresponding DMA_WRITE completes
- Buffer reuse is allowed only after the tile lifetime finishes.
---
### D6. Observability and trace contract
The simulator must emit deterministic trace events.
Required events include:
- `command_submitted`
- `sub_command_dispatched`
- `engine_start`
- `engine_complete`
- `tile_ready`
- `command_complete`
Trace ordering must be deterministic for identical inputs.
---
### D7. Topology representation
PE internal components are declared in `cube.pe_template`.
The template is instantiated once per PE.
PE instances are derived from `cube.pe_layout`.
External connectivity such as:
- PE_DMA → XBAR (HBM data path)
- PE_DMA → NOC (non-HBM data path: shared SRAM, inter-cube UCIe)
- NOC → PE_CPU (command path from M_CPU)
is modeled at the CUBE level (see ADR-0003 D3).
---
## Links
- SPEC R3, R4
- ADR-0003 D4 (PE-level system hierarchy)
- ADR-0005 View C (PE-level diagram)
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
@@ -0,0 +1,178 @@
# ADR-0015: Component Port/Wire Model and Fabric Routing
## Status
Proposed
## Context
ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
In practice, the engine iterates the topology path and calls `run()` on each component
sequentially — conflating routing policy with component behavior and preventing realistic
hardware modeling (queues, contention, fan-out).
ADR-0007 D3 already states that components own fan-out and aggregation, but the current
implementation does not enforce this for fabric traversal.
This ADR defines:
- how components communicate via typed port queues,
- how propagation delay is modeled (wire processes),
- the fabric path for Memory R/W through M_CPU.DMA,
- the reduced role of the simulation engine,
- M_CPU.DMA as an internal subcomponent of M_CPU.
---
## Decision
### D1. Component port model
Each component has typed input/output ports modeled as SimPy Stores:
```
in_ports: dict[str, simpy.Store] # keyed by source node_id
out_ports: dict[str, simpy.Store] # keyed by destination node_id
```
Ports are created at engine initialization based on graph edges.
Each directed edge (src → dst) results in:
- `src.out_ports[dst]` — the sending end
- `dst.in_ports[src]` — the receiving end
---
### D2. Wire process (propagation delay)
For each directed edge (src, dst) in the topology graph, a SimPy wire process
models propagation delay:
```python
def wire_process(env, out_port, in_port, delay_ns):
while True:
cmd = yield out_port.get()
yield env.timeout(delay_ns)
yield in_port.put(cmd)
```
Wire processes are started at engine initialization.
BW constraints are enforced by the sending component's out_port capacity or token model,
not by the wire process itself.
---
### D3. Engine role (reduced)
The simulation engine MUST:
- wire components at initialization (create port Stores, start wire processes),
- identify the entry component for each request type (PCIE_EP),
- put the request into the entry component's in_port,
- wait for a completion event.
The simulation engine MUST NOT:
- walk the topology path during request execution,
- call component `run()` methods directly,
- track per-hop latency or decompose fan-out.
This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
ADR-0007 D2 must be amended accordingly.
---
### D4. Unified fabric path for Memory R/W and Kernel Launch
Both Memory R/W and Kernel Launch use the same fabric path to reach the target cube's M_CPU.
The difference is what M_CPU does upon receiving the request.
**Forward path (IO_CPU → target M_CPU):**
```
IO_CPU
→ [transit cubes: ucie_out → wire → ucie_in → noc → ucie_out] (zero or more)
→ target cube: ucie_in → noc → M_CPU
```
**At M_CPU (diverges by operation type):**
```
Memory R/W: M_CPU → M_CPU.DMA → noc → hbm_ctrl
Kernel Launch: M_CPU → PE[0..n] (parallel fan-out)
```
**Completion path (reverse, same fabric):**
```
Memory R/W: hbm_ctrl → noc → M_CPU.DMA → M_CPU
Kernel Launch: PE[0..n] all complete → M_CPU (aggregation)
M_CPU → [transit cubes: ucie → noc → ucie] → IO_CPU → runtime_api
```
---
### D5. M_CPU.DMA is an internal subcomponent of M_CPU
M_CPU.DMA is NOT a separate topology node.
It is an internal subcomponent owned by the M_CPU component implementation.
M_CPU.DMA:
- owns the DMA READ and DMA WRITE queues (capacity=1 each, per ADR-0014 D4),
- issues memory requests over the NOC to hbm_ctrl,
- receives completion from hbm_ctrl via the NOC,
- reports completion to M_CPU,
- is created and managed inside M_CPU's `__init__` and `run()`.
M_CPU.DMA does not appear as a node in the compiled topology graph.
---
### D6. Transit cube forwarding
A cube that is not the target of a memory or kernel request acts as a transit node.
Transit cubes forward requests without consuming them:
```
ucie_in (from upstream) → noc → ucie_out (to downstream)
```
Transit forwarding is implemented entirely within the ucie_in component.
The noc and ucie_out components in a transit cube forward the packet without modification.
---
### D7. _formula_latency is preserved as a lower-bound cross-check
The path-based formula latency function (`_formula_latency`) is preserved in the engine
as a lower bound for correctness verification.
Invariant:
- Phase 0: `_formula_latency == component model total_ns`
- Phase 1+: `_formula_latency <= component model total_ns` (contention adds queueing)
This function is independent of the port/wire model and requires only the topology graph.
It is used for shard comparison in `_route_kernel` and as a regression guard.
---
## Consequences
- Components model realistic hardware behavior (queues, contention, fan-out).
- Propagation delay is modeled accurately per edge.
- Engine is decoupled from routing policy.
- Component implementations remain swappable via DI (ADR-0007 D3).
- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).
---
## Links
- ADR-0007 D2 (to be amended: engine path-walking clause)
- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
- ADR-0014 D4 (DMA engine capacity=1)
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
+363
View File
@@ -0,0 +1,363 @@
# 실무 DI 패턴: kernbench 구현으로 배우는 Dependency Injection
---
## 슬라이드 1 — 오늘 이야기할 것
**질문:** 코드를 어떻게 설계해야 테스트하기 쉽고, 갈아끼우기 쉬울까?
**답:** Dependency Injection (DI)
오늘은 이론이 아니라 **실제로 돌아가는 시뮬레이터 코드**를 보면서 배웁니다.
```
kernbench
└── AI 가속기 하드웨어를 Python으로 시뮬레이션하는 프레임워크
- 수십 개의 하드웨어 컴포넌트 (NOC, HBM, PE, CPU...)
- 각 컴포넌트는 런타임에 교체 가능
- 테스트에서 Mock 컴포넌트로 즉시 대체 가능
```
---
## 슬라이드 2 — DI가 없으면 어떤 일이 생기나
```python
# ❌ DI 없는 코드
class IoCpuComponent:
def run(self, env, nbytes):
router = PathRouter() # 직접 생성 — 교체 불가
hbm = HbmCtrlComponent() # 직접 생성 — 교체 불가
yield env.timeout(10.0)
```
**문제:**
- 테스트할 때 실제 `PathRouter``HbmCtrl`이 항상 따라온다
- 컴포넌트를 Mock으로 바꾸려면 **소스 코드를 수정**해야 한다
- 다른 topology(다른 라우팅 전략)를 쓰고 싶으면 **또 수정**
> 클래스가 자기 의존성을 스스로 만들면, 그 클래스는 의존성과 결합된다
---
## 슬라이드 3 — DI의 핵심 원칙
**의존성은 밖에서 만들어서 안으로 넣어준다**
```
┌────────────────────────────┐
│ 조립자 (Assembler) │ ← 누가 무엇을 쓸지 결정
│ GraphEngine.__init__ │
└────────────┬───────────────┘
│ ctx 주입
┌────────────────────────────┐
│ 컴포넌트 (Component) │ ← 어떻게 동작하는지만 알면 됨
│ IoCpuComponent │
│ self.ctx.router.find_path(...) ← 그냥 사용
└────────────────────────────┘
```
**세 가지 역할 분리:**
1. **Interface** — 무엇을 할 수 있는가 (`ComponentBase`)
2. **Implementation** — 어떻게 하는가 (`IoCpuComponent`, `HbmCtrlComponent`, ...)
3. **Assembler** — 무엇을 연결할 것인가 (`GraphEngine`)
---
## 슬라이드 4 — 패턴 1: Constructor Injection
> 생성자로 의존성을 받는다
```python
# kernbench/components/base.py
class ComponentBase(ABC):
def __init__(self, node: Node, ctx: ComponentContext | None = None):
self.node = node
self.ctx = ctx # 외부에서 주입받은 의존성
self.in_ports: dict[str, simpy.Store] = {}
self.out_ports: dict[str, simpy.Store] = {}
```
```python
# 사용 측 — ctx를 직접 만들지 않는다
class IoCpuComponent(ComponentBase):
def _dispatch(self, env, txn):
path = self.ctx.router.find_node_path(...) # ctx는 이미 들어와 있음
yield self.out_ports[next_hop].put(...)
```
**언제 쓰나:**
- 컴포넌트가 살아있는 동안 의존성이 바뀌지 않을 때
- 의존성 없이는 컴포넌트가 동작하지 않을 때 (필수 의존성)
---
## 슬라이드 5 — Context Object 패턴
> 의존성이 많아지면 묶어서 하나로
```python
# kernbench/components/context.py
@dataclass
class ComponentContext:
router: PathRouter # 라우팅 정책
resolver: AddressResolver # 주소 해석
positions: dict[str, ...] # 물리적 위치 정보
ns_per_mm: float # 전파 지연 상수
edge_map: dict[...] # 엣지 정보
spec: dict # 토폴로지 스펙
```
**왜 Context로 묶나?**
- 생성자 인자가 6개면 → 컴포넌트 추가할 때마다 시그니처 변경
- Context 하나면 → 새 필드 추가해도 기존 컴포넌트 무영향
- 컴포넌트는 **필요한 것만 꺼내 쓴다**
```python
class TwoDMeshNocComponent(ComponentBase):
def _route(self, env, txn):
src_pos = self.ctx.positions.get(prev_hop) # 위치만 사용
ns_per_mm = self.ctx.ns_per_mm # 상수만 사용
# router, resolver 등은 건드리지 않음
```
---
## 슬라이드 6 — 패턴 2: Registry + Factory
> 문자열 키 → 클래스 매핑으로 런타임 교체
```python
# kernbench/components/base.py
class ComponentRegistry:
_registry: dict[str, type[ComponentBase]] = {}
@classmethod
def register(cls, impl: str, component_cls: type[ComponentBase]):
cls._registry[impl] = component_cls
@classmethod
def create(cls, node, overrides=None, ctx=None) -> ComponentBase:
if overrides and node.impl in overrides:
return overrides[node.impl](node, ctx) # 1순위: 호출자 override
if node.impl in cls._registry:
return cls._registry[node.impl](node, ctx) # 2순위: 등록된 구현
return DefaultComponent(node, ctx) # 3순위: 기본값 fallback
```
**Resolution 우선순위:**
```
overrides[impl] ← 테스트/실험용 주입
↓ (없으면)
_registry[impl] ← 프로덕션 구현
↓ (없으면)
DefaultComponent ← 안전한 fallback
```
---
## 슬라이드 7 — Registry 등록 방식
```python
# kernbench/components/impls/__init__.py
from kernbench.components.base import ComponentRegistry
from kernbench.components.impls.noc import TwoDMeshNocComponent
from kernbench.components.impls.io_cpu import IoCpuComponent
# ...
ComponentRegistry.register("noc_2d_mesh_v1", TwoDMeshNocComponent)
ComponentRegistry.register("io_cpu_v1", IoCpuComponent)
ComponentRegistry.register("hbm_ctrl_v1", HbmCtrlComponent)
# ...
```
**topology.yaml (설정 파일)**
```yaml
nodes:
- id: sip0.cube0.noc
impl: noc_2d_mesh_v1 # ← 이 문자열이 Registry 키
```
**흐름:**
```
YAML → impl 문자열 → Registry.create() → 실제 컴포넌트 인스턴스
```
impl 문자열만 바꾸면 동작이 바뀐다. 코드 수정 없음.
---
## 슬라이드 8 — 패턴 3: Override Injection (테스트용)
> 호출자가 특정 impl만 갈아끼운다
```python
# tests/test_component_registry.py
class SpyXbar(ComponentBase):
calls = 0
def run(self, env, nbytes):
SpyXbar.calls += 1
yield env.timeout(0)
# 테스트에서 xbar_v1만 SpyXbar로 교체
engine = GraphEngine(
graph,
component_overrides={"xbar_v1": SpyXbar} # ← 이것만 추가
)
result = engine.run(msg)
assert SpyXbar.calls > 0 # Xbar가 실제로 호출됐는지 검증
```
**핵심:** 테스트 코드가 프로덕션 코드를 **수정하지 않는다**
---
## 슬라이드 9 — 조립자: GraphEngine
> 컴포넌트를 생성하고 연결하는 유일한 곳
```python
# kernbench/sim_engine/engine.py
class GraphEngine:
def __init__(self, graph, component_overrides=None):
# 1. 공유 의존성 생성
ctx = ComponentContext(
router=PathRouter(graph),
resolver=AddressResolver(graph),
positions={nid: n.pos_mm for nid, n in graph.nodes.items()},
ns_per_mm=...,
)
# 2. 컴포넌트 생성 (DI: ctx 주입)
self._components = {
node_id: ComponentRegistry.create(node, overrides, ctx)
for node_id, node in graph.nodes.items()
}
# 3. 포트 연결 (배선)
for e in graph.edges:
store = simpy.Store(self._env)
self._components[e.src].out_ports[e.dst] = store
self._components[e.dst].in_ports[e.src] = store
```
**생성 → 주입 → 연결** — 이 세 단계가 한 곳에서만 일어난다
---
## 슬라이드 10 — 전체 구조 한눈에 보기
```
topology.yaml
│ impl: "noc_2d_mesh_v1"
GraphEngine.__init__() ← 조립자
├── ComponentContext 생성 ← 공유 의존성 묶음
│ ├── PathRouter
│ ├── AddressResolver
│ └── positions, ns_per_mm, ...
├── ComponentRegistry.create(node, overrides, ctx)
│ ├── overrides["noc_2d_mesh_v1"]? → SpyNoc (테스트)
│ ├── registry["noc_2d_mesh_v1"]? → TwoDMeshNocComponent (프로덕션)
│ └── fallback → DefaultComponent
└── 포트 배선: out_ports / in_ports 연결
Component (TwoDMeshNocComponent)
└── self.ctx.positions, self.ctx.ns_per_mm 사용
(라우터, 리졸버는 건드리지 않음 — 필요한 것만)
```
---
## 슬라이드 11 — 무엇을 얻었나
| 상황 | DI 없이 | DI 있이 |
|------|---------|---------|
| NOC 알고리즘 교체 | 소스 코드 수정 | YAML에서 impl 문자열 변경 |
| Xbar 동작 검증 | 실제 HW 전부 구동 | `overrides={"xbar_v1": SpyXbar}` |
| 새 컴포넌트 추가 | 기존 코드 수정 | `register("new_v1", NewComp)` |
| 컨텍스트 필드 추가 | 모든 생성자 수정 | `ComponentContext`에 필드 추가 |
| 테스트 격리 | 불가능 | 필요한 것만 override |
---
## 슬라이드 12 — 실무 적용 체크리스트
**설계할 때 물어볼 것:**
1. **이 클래스가 직접 `new`(생성)하는 것은 무엇인가?**
→ 생성하는 것 = 교체할 수 없는 것. 생성자로 받을 수 없는지 검토.
2. **의존성이 3개 이상이면?**
→ Context Object로 묶어라.
3. **테스트에서 이 클래스를 단독으로 실행할 수 있는가?**
→ 없다면 DI가 필요하다는 신호.
4. **설정(YAML/config)으로 동작을 바꾸고 싶은가?**
→ Registry + 문자열 키 패턴.
5. **누가 조립하는가?**
→ 조립자는 하나여야 한다. 컴포넌트 안에 조립 로직이 있으면 안 된다.
---
## 슬라이드 13 — 안티패턴: 이것은 하지 말자
```python
# ❌ 서비스 로케이터 (컴포넌트 안에서 registry 호출)
class BadComponent(ComponentBase):
def run(self, env, nbytes):
router = ComponentRegistry.get("router") # 컴포넌트가 직접 찾는다
...
# ❌ 전역 싱글톤 직접 참조
class BadComponent(ComponentBase):
def run(self, env, nbytes):
router = GlobalRouter.instance() # 교체 불가
...
# ❌ 생성자 안에서 의존성 생성
class BadComponent(ComponentBase):
def __init__(self, node):
self.router = PathRouter(node.graph) # 테스트에서 격리 불가
```
**공통 문제:** 컴포넌트가 자기 의존성을 스스로 해결한다 → 결합도 증가
---
## 슬라이드 14 — 요약
> **DI = 의존성의 생성과 사용을 분리하는 것**
```
생성 → Registry / Assembler (GraphEngine)
사용 → Component (IoCpuComponent, TwoDMeshNocComponent, ...)
```
**kernbench에서 배운 패턴 3가지:**
1. **Constructor Injection** — 필수 의존성은 생성자로
2. **Context Object** — 의존성 묶음을 하나의 dataclass로
3. **Registry + Override** — 문자열 키로 구현체 선택, 테스트에서 교체
**결과:** 141개 테스트, YAML 한 줄로 컴포넌트 교체, 프로덕션 코드 수정 없이 Mock 주입
---
*참고 코드: kernbench/src/kernbench/components/*
+26
View File
@@ -0,0 +1,26 @@
# Generated Diagrams
This directory contains diagrams generated from topology compilation.
## What these files are
- Derived artifacts generated from:
- compiled topology graph
- distance (accumulated latency) metadata
- view/layout rules (ADR-0005)
These files are meant for quick visual inspection and review.
## Default outputs
- SIP view: `sip_view.mmd` (and/or `sip_view.dot`)
- CUBE view: `cube_view.mmd` (and/or `cube_view.dot`)
- PE view: `pe_view.mmd` (and/or `pe_view.dot`)
## How to preview
- In VS Code:
- open `.mmd` or `.md` containing Mermaid blocks and use Markdown Preview
- for `.dot`, use a Graphviz preview extension or `dot -Tpng`
## Notes
- Diagrams are representative and distance-aware by default.
- Instance indices are not required unless debugging asymmetry.
- Outputs should be deterministic for the same topology and rules.
+156
View File
@@ -0,0 +1,156 @@
<svg xmlns="http://www.w3.org/2000/svg" width="556" height="472" viewBox="0 0 556 472">
<title>cube</title>
<rect width="556" height="472" fill="#f8fafc"/>
<text x="278" y="18" text-anchor="middle" font-family="monospace" font-size="14" font-weight="bold" fill="#1e293b">CUBE VIEW</text>
<rect x="40.0" y="40.0" width="476.0" height="392.0" rx="6" fill="none" stroke="#475569" stroke-width="2" stroke-dasharray="8,4"/>
<rect x="152.0" y="166.0" width="252.0" height="140.0" rx="4" fill="#d1fae5" stroke="#10b981" stroke-width="1.5" stroke-dasharray="6,3" opacity="0.5"/>
<text x="278.0" y="278.0" text-anchor="middle" font-family="monospace" font-size="11" fill="#047857" opacity="0.7">HBM</text>
<polyline points="82.0,82.0 82.0,95.0 82.0,95.0 82.0,138.0" fill="none" stroke="#f97316" stroke-width="1" opacity="0.8"/>
<text x="82.0" y="92.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">6.0mm 256GB/s</text>
<polyline points="82.0,82.0 82.0,144.0 334.0,144.0 334.0,236.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<polyline points="334.0,236.0 334.0,144.0 82.0,144.0 82.0,82.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.6"/>
<polyline points="166.0,82.0 166.0,95.0 166.0,95.0 166.0,138.0" fill="none" stroke="#f97316" stroke-width="1" opacity="0.8"/>
<text x="166.0" y="92.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">6.0mm 256GB/s</text>
<polyline points="166.0,82.0 166.0,154.0 334.0,154.0 334.0,236.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<polyline points="334.0,236.0 334.0,144.0 166.0,144.0 166.0,82.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.6"/>
<polyline points="390.0,82.0 390.0,95.0 390.0,95.0 390.0,138.0" fill="none" stroke="#f97316" stroke-width="1" opacity="0.8"/>
<text x="390.0" y="92.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">6.0mm 256GB/s</text>
<polyline points="390.0,82.0 390.0,164.0 334.0,164.0 334.0,236.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<polyline points="334.0,236.0 334.0,144.0 390.0,144.0 390.0,82.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.6"/>
<polyline points="474.0,82.0 474.0,95.0 474.0,95.0 474.0,138.0" fill="none" stroke="#f97316" stroke-width="1" opacity="0.8"/>
<text x="474.0" y="92.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">6.0mm 256GB/s</text>
<polyline points="474.0,82.0 474.0,174.0 334.0,174.0 334.0,236.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<polyline points="334.0,236.0 334.0,144.0 474.0,144.0 474.0,82.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.6"/>
<polyline points="82.0,390.0 82.0,347.0 82.0,347.0 82.0,334.0" fill="none" stroke="#f97316" stroke-width="1" opacity="0.8"/>
<text x="82.0" y="344.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">6.0mm 256GB/s</text>
<polyline points="82.0,390.0 82.0,338.0 334.0,338.0 334.0,236.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<polyline points="334.0,236.0 334.0,298.0 82.0,298.0 82.0,390.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.6"/>
<polyline points="166.0,390.0 166.0,347.0 166.0,347.0 166.0,334.0" fill="none" stroke="#f97316" stroke-width="1" opacity="0.8"/>
<text x="166.0" y="344.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">6.0mm 256GB/s</text>
<polyline points="166.0,390.0 166.0,348.0 334.0,348.0 334.0,236.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<polyline points="334.0,236.0 334.0,298.0 166.0,298.0 166.0,390.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.6"/>
<polyline points="390.0,390.0 390.0,347.0 390.0,347.0 390.0,334.0" fill="none" stroke="#f97316" stroke-width="1" opacity="0.8"/>
<text x="390.0" y="344.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">6.0mm 256GB/s</text>
<polyline points="390.0,390.0 390.0,358.0 334.0,358.0 334.0,236.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<polyline points="334.0,236.0 334.0,298.0 390.0,298.0 390.0,390.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.6"/>
<polyline points="474.0,390.0 474.0,347.0 474.0,347.0 474.0,334.0" fill="none" stroke="#f97316" stroke-width="1" opacity="0.8"/>
<text x="474.0" y="344.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">6.0mm 256GB/s</text>
<polyline points="474.0,390.0 474.0,368.0 334.0,368.0 334.0,236.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<polyline points="334.0,236.0 334.0,298.0 474.0,298.0 474.0,390.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.6"/>
<polyline points="82.0,138.0 222.0,138.0 222.0,236.0" fill="none" stroke="#10b981" stroke-width="1" opacity="0.8"/>
<text x="152.0" y="183.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.5mm 256GB/s</text>
<polyline points="166.0,138.0 222.0,138.0 222.0,236.0" fill="none" stroke="#10b981" stroke-width="1" opacity="0.8"/>
<text x="194.0" y="183.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.5mm 256GB/s</text>
<polyline points="390.0,138.0 222.0,138.0 222.0,236.0" fill="none" stroke="#10b981" stroke-width="1" opacity="0.8"/>
<text x="306.0" y="183.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.5mm 256GB/s</text>
<polyline points="474.0,138.0 222.0,138.0 222.0,236.0" fill="none" stroke="#10b981" stroke-width="1" opacity="0.8"/>
<text x="348.0" y="183.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.5mm 256GB/s</text>
<polyline points="82.0,334.0 222.0,334.0 222.0,236.0" fill="none" stroke="#10b981" stroke-width="1" opacity="0.8"/>
<text x="152.0" y="281.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.5mm 256GB/s</text>
<polyline points="166.0,334.0 222.0,334.0 222.0,236.0" fill="none" stroke="#10b981" stroke-width="1" opacity="0.8"/>
<text x="194.0" y="281.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.5mm 256GB/s</text>
<polyline points="390.0,334.0 222.0,334.0 222.0,236.0" fill="none" stroke="#10b981" stroke-width="1" opacity="0.8"/>
<text x="306.0" y="281.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.5mm 256GB/s</text>
<polyline points="474.0,334.0 222.0,334.0 222.0,236.0" fill="none" stroke="#10b981" stroke-width="1" opacity="0.8"/>
<text x="348.0" y="281.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.5mm 256GB/s</text>
<line x1="82.0" y1="138.0" x2="166.0" y2="138.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="124.0" y="134.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.0mm 128GB/s</text>
<line x1="166.0" y1="138.0" x2="82.0" y2="138.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="124.0" y="134.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.0mm 128GB/s</text>
<line x1="166.0" y1="138.0" x2="390.0" y2="138.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="278.0" y="134.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">10.0mm 128GB/s</text>
<line x1="390.0" y1="138.0" x2="166.0" y2="138.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="278.0" y="134.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">10.0mm 128GB/s</text>
<line x1="390.0" y1="138.0" x2="474.0" y2="138.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="432.0" y="134.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.0mm 128GB/s</text>
<line x1="474.0" y1="138.0" x2="390.0" y2="138.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="432.0" y="134.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.0mm 128GB/s</text>
<line x1="82.0" y1="334.0" x2="166.0" y2="334.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="124.0" y="330.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.0mm 128GB/s</text>
<line x1="166.0" y1="334.0" x2="82.0" y2="334.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="124.0" y="330.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.0mm 128GB/s</text>
<line x1="166.0" y1="334.0" x2="390.0" y2="334.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="278.0" y="330.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">10.0mm 128GB/s</text>
<line x1="390.0" y1="334.0" x2="166.0" y2="334.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="278.0" y="330.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">10.0mm 128GB/s</text>
<line x1="390.0" y1="334.0" x2="474.0" y2="334.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="432.0" y="330.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.0mm 128GB/s</text>
<line x1="474.0" y1="334.0" x2="390.0" y2="334.0" stroke="#94a3b8" stroke-width="1" opacity="0.8"/>
<text x="432.0" y="330.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">2.0mm 128GB/s</text>
<polyline points="82.0,138.0 110.0,138.0 110.0,292.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<text x="96.0" y="211.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.0mm 512GB/s</text>
<polyline points="110.0,292.0 82.0,292.0 82.0,138.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<text x="96.0" y="211.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.0mm 512GB/s</text>
<polyline points="82.0,334.0 110.0,334.0 110.0,292.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<text x="96.0" y="309.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.0mm 512GB/s</text>
<polyline points="110.0,292.0 82.0,292.0 82.0,334.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<text x="96.0" y="309.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.0mm 512GB/s</text>
<polyline points="474.0,138.0 446.0,138.0 446.0,292.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<text x="460.0" y="211.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.0mm 512GB/s</text>
<polyline points="446.0,292.0 474.0,292.0 474.0,138.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<text x="460.0" y="211.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.0mm 512GB/s</text>
<polyline points="474.0,334.0 446.0,334.0 446.0,292.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<text x="460.0" y="309.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.0mm 512GB/s</text>
<polyline points="446.0,292.0 474.0,292.0 474.0,334.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.8"/>
<text x="460.0" y="309.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.0mm 512GB/s</text>
<polyline points="334.0,236.0 334.0,131.4 278.0,131.4 278.0,56.8" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.6"/>
<polyline points="334.0,236.0 334.0,310.6 278.0,310.6 278.0,415.2" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.6"/>
<polyline points="334.0,236.0 334.0,221.0 488.0,221.0 488.0,236.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.6"/>
<polyline points="334.0,236.0 334.0,221.0 68.0,221.0 68.0,236.0" fill="none" stroke="#a78bfa" stroke-width="1" opacity="0.6"/>
<polyline points="446.0,194.0 446.0,200.0 334.0,200.0 334.0,236.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.6"/>
<polyline points="334.0,236.0 334.0,200.0 446.0,200.0 446.0,194.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.6"/>
<polyline points="334.0,236.0 110.0,236.0 110.0,194.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.8"/>
<polyline points="110.0,194.0 334.0,194.0 334.0,236.0" fill="none" stroke="#f59e0b" stroke-width="1" opacity="0.8"/>
<rect x="250.0" y="40.0" width="56.0" height="33.6" rx="4" fill="#3b82f6" stroke="#475569" stroke-width="1"/>
<text x="278.0" y="60.8" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">UCIe-N</text>
<rect x="250.0" y="398.4" width="56.0" height="33.6" rx="4" fill="#3b82f6" stroke="#475569" stroke-width="1"/>
<text x="278.0" y="419.2" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">UCIe-S</text>
<rect x="460.0" y="219.2" width="56.0" height="33.6" rx="4" fill="#3b82f6" stroke="#475569" stroke-width="1"/>
<text x="488.0" y="240.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">UCIe-E</text>
<rect x="40.0" y="219.2" width="56.0" height="33.6" rx="4" fill="#3b82f6" stroke="#475569" stroke-width="1"/>
<text x="68.0" y="240.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">UCIe-W</text>
<rect x="306.0" y="219.2" width="56.0" height="33.6" rx="4" fill="#a78bfa" stroke="#475569" stroke-width="1"/>
<text x="334.0" y="240.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">NOC</text>
<rect x="418.0" y="177.2" width="56.0" height="33.6" rx="4" fill="#f59e0b" stroke="#475569" stroke-width="1"/>
<text x="446.0" y="198.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">M CPU</text>
<rect x="194.0" y="219.2" width="56.0" height="33.6" rx="4" fill="#10b981" stroke="#475569" stroke-width="1"/>
<text x="222.0" y="240.0" text-anchor="middle" font-family="monospace" font-size="8" fill="#ffffff">HBM CTRL</text>
<rect x="82.0" y="177.2" width="56.0" height="33.6" rx="4" fill="#f59e0b" stroke="#475569" stroke-width="1"/>
<text x="110.0" y="198.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">SRAM</text>
<rect x="82.0" y="275.2" width="56.0" height="33.6" rx="4" fill="#f97316" stroke="#475569" stroke-width="1"/>
<text x="110.0" y="296.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">Bridge LEFT</text>
<rect x="418.0" y="275.2" width="56.0" height="33.6" rx="4" fill="#f97316" stroke="#475569" stroke-width="1"/>
<text x="446.0" y="296.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">Bridge RIGHT</text>
<rect x="56.8" y="68.0" width="50.4" height="28.0" rx="4" fill="#94a3b8" stroke="#475569" stroke-width="1"/>
<text x="82.0" y="86.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE0</text>
<rect x="54.0" y="121.2" width="56.0" height="33.6" rx="4" fill="#f97316" stroke="#475569" stroke-width="1"/>
<text x="82.0" y="142.0" text-anchor="middle" font-family="monospace" font-size="8" fill="#1e293b">XBAR PE0</text>
<rect x="140.8" y="68.0" width="50.4" height="28.0" rx="4" fill="#94a3b8" stroke="#475569" stroke-width="1"/>
<text x="166.0" y="86.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE1</text>
<rect x="138.0" y="121.2" width="56.0" height="33.6" rx="4" fill="#f97316" stroke="#475569" stroke-width="1"/>
<text x="166.0" y="142.0" text-anchor="middle" font-family="monospace" font-size="8" fill="#1e293b">XBAR PE1</text>
<rect x="364.8" y="68.0" width="50.4" height="28.0" rx="4" fill="#94a3b8" stroke="#475569" stroke-width="1"/>
<text x="390.0" y="86.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE2</text>
<rect x="362.0" y="121.2" width="56.0" height="33.6" rx="4" fill="#f97316" stroke="#475569" stroke-width="1"/>
<text x="390.0" y="142.0" text-anchor="middle" font-family="monospace" font-size="8" fill="#1e293b">XBAR PE2</text>
<rect x="448.8" y="68.0" width="50.4" height="28.0" rx="4" fill="#94a3b8" stroke="#475569" stroke-width="1"/>
<text x="474.0" y="86.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE3</text>
<rect x="446.0" y="121.2" width="56.0" height="33.6" rx="4" fill="#f97316" stroke="#475569" stroke-width="1"/>
<text x="474.0" y="142.0" text-anchor="middle" font-family="monospace" font-size="8" fill="#1e293b">XBAR PE3</text>
<rect x="56.8" y="376.0" width="50.4" height="28.0" rx="4" fill="#94a3b8" stroke="#475569" stroke-width="1"/>
<text x="82.0" y="394.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE4</text>
<rect x="54.0" y="317.2" width="56.0" height="33.6" rx="4" fill="#f97316" stroke="#475569" stroke-width="1"/>
<text x="82.0" y="338.0" text-anchor="middle" font-family="monospace" font-size="8" fill="#1e293b">XBAR PE4</text>
<rect x="140.8" y="376.0" width="50.4" height="28.0" rx="4" fill="#94a3b8" stroke="#475569" stroke-width="1"/>
<text x="166.0" y="394.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE5</text>
<rect x="138.0" y="317.2" width="56.0" height="33.6" rx="4" fill="#f97316" stroke="#475569" stroke-width="1"/>
<text x="166.0" y="338.0" text-anchor="middle" font-family="monospace" font-size="8" fill="#1e293b">XBAR PE5</text>
<rect x="364.8" y="376.0" width="50.4" height="28.0" rx="4" fill="#94a3b8" stroke="#475569" stroke-width="1"/>
<text x="390.0" y="394.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE6</text>
<rect x="362.0" y="317.2" width="56.0" height="33.6" rx="4" fill="#f97316" stroke="#475569" stroke-width="1"/>
<text x="390.0" y="338.0" text-anchor="middle" font-family="monospace" font-size="8" fill="#1e293b">XBAR PE6</text>
<rect x="448.8" y="376.0" width="50.4" height="28.0" rx="4" fill="#94a3b8" stroke="#475569" stroke-width="1"/>
<text x="474.0" y="394.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">PE7</text>
<rect x="446.0" y="317.2" width="56.0" height="33.6" rx="4" fill="#f97316" stroke="#475569" stroke-width="1"/>
<text x="474.0" y="338.0" text-anchor="middle" font-family="monospace" font-size="8" fill="#1e293b">XBAR PE7</text>
</svg>

After

Width:  |  Height:  |  Size: 18 KiB

+31
View File
@@ -0,0 +1,31 @@
<svg xmlns="http://www.w3.org/2000/svg" width="500" height="360" viewBox="0 0 500 360">
<title>pe</title>
<rect width="500" height="360" fill="#f8fafc"/>
<text x="250" y="18" text-anchor="middle" font-family="monospace" font-size="14" font-weight="bold" fill="#1e293b">PE VIEW</text>
<line x1="92.5" y1="180.0" x2="180.0" y2="180.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="136.2" y="176.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text>
<polyline points="180.0,180.0 180.0,92.5 285.0,92.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="232.5" y="132.2" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text>
<line x1="180.0" y1="180.0" x2="285.0" y2="180.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="232.5" y="176.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text>
<polyline points="180.0,180.0 180.0,267.5 285.0,267.5" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="232.5" y="219.8" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm</text>
<polyline points="285.0,92.5 390.0,92.5 390.0,180.0" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="337.5" y="132.2" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm 512GB/s</text>
<line x1="285.0" y1="180.0" x2="390.0" y2="180.0" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="337.5" y="176.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm 512GB/s</text>
<polyline points="285.0,267.5 390.0,267.5 390.0,180.0" fill="none" stroke="#94a3b8" stroke-width="1.5" opacity="0.8"/>
<text x="337.5" y="219.8" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">0.5mm 512GB/s</text>
<rect x="48.8" y="155.5" width="87.5" height="49.0" rx="4" fill="#ef4444" stroke="#475569" stroke-width="1"/>
<text x="92.5" y="184.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE CPU</text>
<rect x="136.2" y="155.5" width="87.5" height="49.0" rx="4" fill="#f59e0b" stroke="#475569" stroke-width="1"/>
<text x="180.0" y="184.0" text-anchor="middle" font-family="monospace" font-size="9" fill="#1e293b">PE SCHEDULER</text>
<rect x="241.2" y="68.0" width="87.5" height="49.0" rx="4" fill="#3b82f6" stroke="#475569" stroke-width="1"/>
<text x="285.0" y="96.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE DMA</text>
<rect x="241.2" y="155.5" width="87.5" height="49.0" rx="4" fill="#8b5cf6" stroke="#475569" stroke-width="1"/>
<text x="285.0" y="184.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE GEMM</text>
<rect x="241.2" y="243.0" width="87.5" height="49.0" rx="4" fill="#ec4899" stroke="#475569" stroke-width="1"/>
<text x="285.0" y="271.5" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE MATH</text>
<rect x="346.2" y="155.5" width="87.5" height="49.0" rx="4" fill="#10b981" stroke="#475569" stroke-width="1"/>
<text x="390.0" y="184.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#ffffff">PE TCM</text>
</svg>

After

Width:  |  Height:  |  Size: 3.2 KiB

+72
View File
@@ -0,0 +1,72 @@
<svg xmlns="http://www.w3.org/2000/svg" width="820" height="500" viewBox="0 0 820 500" font-family="monospace">
<rect width="820" height="500" fill="#f8fafc" rx="6"/>
<text x="410" y="32" text-anchor="middle" font-size="16" font-weight="bold" fill="#1e293b">Placement: column_wise</text>
<text x="410.0" y="54.0" text-anchor="middle" font-size="12" fill="#475569" font-weight="normal">Tensor (1024×512) fp16 → K axis split into 8 parts</text>
<text x="320.0" y="82.0" text-anchor="middle" font-size="11" fill="#475569" font-weight="normal">← K=512 →</text>
<text x="68.0" y="250.0" text-anchor="middle" font-size="11" fill="#475569" transform="rotate(-90 68.0 250.0)">↑ M=1024 ↓</text>
<rect x="80.0" y="90.0" width="60.0" height="320.0" fill="#3b82f6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="110.0" y="246.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE0</text>
<text x="110.0" y="262.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(1024×64)</text>
<rect x="140.0" y="90.0" width="60.0" height="320.0" fill="#10b981" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="170.0" y="246.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE1</text>
<text x="170.0" y="262.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(1024×64)</text>
<rect x="200.0" y="90.0" width="60.0" height="320.0" fill="#f59e0b" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="230.0" y="246.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE2</text>
<text x="230.0" y="262.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">(1024×64)</text>
<rect x="260.0" y="90.0" width="60.0" height="320.0" fill="#ef4444" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="290.0" y="246.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE3</text>
<text x="290.0" y="262.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(1024×64)</text>
<rect x="320.0" y="90.0" width="60.0" height="320.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="350.0" y="246.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE4</text>
<text x="350.0" y="262.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(1024×64)</text>
<rect x="380.0" y="90.0" width="60.0" height="320.0" fill="#ec4899" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="410.0" y="246.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE5</text>
<text x="410.0" y="262.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(1024×64)</text>
<rect x="440.0" y="90.0" width="60.0" height="320.0" fill="#06b6d4" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="470.0" y="246.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE6</text>
<text x="470.0" y="262.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">(1024×64)</text>
<rect x="500.0" y="90.0" width="60.0" height="320.0" fill="#f97316" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="530.0" y="246.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE7</text>
<text x="530.0" y="262.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(1024×64)</text>
<rect x="80.0" y="90.0" width="480.0" height="320.0" fill="none" stroke="#1e293b" stroke-width="2" fill-opacity="1.0" rx="2"/>
<text x="110.0" y="426.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">off=0 B</text>
<text x="110.0" y="440.0" text-anchor="middle" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="170.0" y="426.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">off=128 KB</text>
<text x="170.0" y="440.0" text-anchor="middle" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="230.0" y="426.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">off=256 KB</text>
<text x="230.0" y="440.0" text-anchor="middle" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="290.0" y="426.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">off=384 KB</text>
<text x="290.0" y="440.0" text-anchor="middle" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="350.0" y="426.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">off=512 KB</text>
<text x="350.0" y="440.0" text-anchor="middle" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="410.0" y="426.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">off=640 KB</text>
<text x="410.0" y="440.0" text-anchor="middle" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="470.0" y="426.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">off=768 KB</text>
<text x="470.0" y="440.0" text-anchor="middle" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="530.0" y="426.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">off=896 KB</text>
<text x="530.0" y="440.0" text-anchor="middle" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="670.0" y="100.0" text-anchor="middle" font-size="12" fill="#1e293b" font-weight="bold">PE Legend</text>
<rect x="620.0" y="106.0" width="16.0" height="16.0" fill="#3b82f6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="642.0" y="118.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE0</text>
<rect x="620.0" y="128.0" width="16.0" height="16.0" fill="#10b981" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="642.0" y="140.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE1</text>
<rect x="620.0" y="150.0" width="16.0" height="16.0" fill="#f59e0b" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="642.0" y="162.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE2</text>
<rect x="620.0" y="172.0" width="16.0" height="16.0" fill="#ef4444" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="642.0" y="184.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE3</text>
<rect x="620.0" y="194.0" width="16.0" height="16.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="642.0" y="206.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE4</text>
<rect x="620.0" y="216.0" width="16.0" height="16.0" fill="#ec4899" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="642.0" y="228.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE5</text>
<rect x="620.0" y="238.0" width="16.0" height="16.0" fill="#06b6d4" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="642.0" y="250.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE6</text>
<rect x="620.0" y="260.0" width="16.0" height="16.0" fill="#f97316" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="642.0" y="272.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE7</text>
<rect x="620.0" y="320.0" width="167.0" height="120.0" fill="#e2e8f0" stroke="#94a3b8" stroke-width="1" fill-opacity="1.0" rx="2"/>
<text x="630.0" y="338.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Strategy: column_wise</text>
<text x="630.0" y="356.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Split axis: K</text>
<text x="630.0" y="374.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Shards: 8</text>
<text x="630.0" y="392.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Each: (1024, 64)</text>
<text x="630.0" y="410.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Each: 128 KB</text>
<text x="630.0" y="428.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Total: 1 MB</text>
</svg>

After

Width:  |  Height:  |  Size: 8.1 KiB

+47
View File
@@ -0,0 +1,47 @@
<svg xmlns="http://www.w3.org/2000/svg" width="820" height="500" viewBox="0 0 820 500" font-family="monospace">
<rect width="820" height="500" fill="#f8fafc" rx="6"/>
<text x="410" y="32" text-anchor="middle" font-size="16" font-weight="bold" fill="#1e293b">Placement: replicate</text>
<text x="410.0" y="54.0" text-anchor="middle" font-size="12" fill="#475569" font-weight="normal">Tensor (1024×512) fp16 → full copy to each PE</text>
<rect x="60.0" y="90.0" width="163.0" height="162.0" fill="#3b82f6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="141.5" y="157.0" text-anchor="middle" font-size="14" fill="#fff" font-weight="bold">PE0</text>
<text x="141.5" y="177.0" text-anchor="middle" font-size="11" fill="#fff" font-weight="normal">(1024×512)</text>
<text x="141.5" y="193.0" text-anchor="middle" font-size="10" fill="#fff" font-weight="normal">1 MB</text>
<text x="141.5" y="207.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">offset=0</text>
<rect x="239.0" y="90.0" width="163.0" height="162.0" fill="#10b981" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="320.5" y="157.0" text-anchor="middle" font-size="14" fill="#fff" font-weight="bold">PE1</text>
<text x="320.5" y="177.0" text-anchor="middle" font-size="11" fill="#fff" font-weight="normal">(1024×512)</text>
<text x="320.5" y="193.0" text-anchor="middle" font-size="10" fill="#fff" font-weight="normal">1 MB</text>
<text x="320.5" y="207.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">offset=0</text>
<rect x="418.0" y="90.0" width="163.0" height="162.0" fill="#f59e0b" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="499.5" y="157.0" text-anchor="middle" font-size="14" fill="#000" font-weight="bold">PE2</text>
<text x="499.5" y="177.0" text-anchor="middle" font-size="11" fill="#000" font-weight="normal">(1024×512)</text>
<text x="499.5" y="193.0" text-anchor="middle" font-size="10" fill="#000" font-weight="normal">1 MB</text>
<text x="499.5" y="207.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">offset=0</text>
<rect x="597.0" y="90.0" width="163.0" height="162.0" fill="#ef4444" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="678.5" y="157.0" text-anchor="middle" font-size="14" fill="#fff" font-weight="bold">PE3</text>
<text x="678.5" y="177.0" text-anchor="middle" font-size="11" fill="#fff" font-weight="normal">(1024×512)</text>
<text x="678.5" y="193.0" text-anchor="middle" font-size="10" fill="#fff" font-weight="normal">1 MB</text>
<text x="678.5" y="207.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">offset=0</text>
<rect x="60.0" y="268.0" width="163.0" height="162.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="141.5" y="335.0" text-anchor="middle" font-size="14" fill="#fff" font-weight="bold">PE4</text>
<text x="141.5" y="355.0" text-anchor="middle" font-size="11" fill="#fff" font-weight="normal">(1024×512)</text>
<text x="141.5" y="371.0" text-anchor="middle" font-size="10" fill="#fff" font-weight="normal">1 MB</text>
<text x="141.5" y="385.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">offset=0</text>
<rect x="239.0" y="268.0" width="163.0" height="162.0" fill="#ec4899" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="320.5" y="335.0" text-anchor="middle" font-size="14" fill="#fff" font-weight="bold">PE5</text>
<text x="320.5" y="355.0" text-anchor="middle" font-size="11" fill="#fff" font-weight="normal">(1024×512)</text>
<text x="320.5" y="371.0" text-anchor="middle" font-size="10" fill="#fff" font-weight="normal">1 MB</text>
<text x="320.5" y="385.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">offset=0</text>
<rect x="418.0" y="268.0" width="163.0" height="162.0" fill="#06b6d4" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="499.5" y="335.0" text-anchor="middle" font-size="14" fill="#000" font-weight="bold">PE6</text>
<text x="499.5" y="355.0" text-anchor="middle" font-size="11" fill="#000" font-weight="normal">(1024×512)</text>
<text x="499.5" y="371.0" text-anchor="middle" font-size="10" fill="#000" font-weight="normal">1 MB</text>
<text x="499.5" y="385.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">offset=0</text>
<rect x="597.0" y="268.0" width="163.0" height="162.0" fill="#f97316" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="678.5" y="335.0" text-anchor="middle" font-size="14" fill="#fff" font-weight="bold">PE7</text>
<text x="678.5" y="355.0" text-anchor="middle" font-size="11" fill="#fff" font-weight="normal">(1024×512)</text>
<text x="678.5" y="371.0" text-anchor="middle" font-size="10" fill="#fff" font-weight="normal">1 MB</text>
<text x="678.5" y="385.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">offset=0</text>
<rect x="60.0" y="450.0" width="496.0" height="30.0" fill="#e2e8f0" stroke="#94a3b8" stroke-width="1" fill-opacity="1.0" rx="2"/>
<text x="70.0" y="468.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Strategy: replicate | Shards: 8 | Each: 1 MB | Total mem: 8 MB</text>
</svg>

After

Width:  |  Height:  |  Size: 5.2 KiB

+72
View File
@@ -0,0 +1,72 @@
<svg xmlns="http://www.w3.org/2000/svg" width="820" height="560" viewBox="0 0 820 560" font-family="monospace">
<rect width="820" height="560" fill="#f8fafc" rx="6"/>
<text x="410" y="32" text-anchor="middle" font-size="16" font-weight="bold" fill="#1e293b">Placement: row_wise</text>
<text x="410.0" y="54.0" text-anchor="middle" font-size="12" fill="#475569" font-weight="normal">Tensor (1024×512) fp16 → M axis split into 8 parts</text>
<text x="240.0" y="82.0" text-anchor="middle" font-size="11" fill="#475569" font-weight="normal">← K=512 →</text>
<text x="68.0" y="290.0" text-anchor="middle" font-size="11" fill="#475569" transform="rotate(-90 68.0 290.0)">↑ M=1024 ↓</text>
<rect x="80.0" y="90.0" width="320.0" height="50.0" fill="#3b82f6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="240.0" y="111.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE0</text>
<text x="240.0" y="127.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(128×512)</text>
<rect x="80.0" y="140.0" width="320.0" height="50.0" fill="#10b981" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="240.0" y="161.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE1</text>
<text x="240.0" y="177.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(128×512)</text>
<rect x="80.0" y="190.0" width="320.0" height="50.0" fill="#f59e0b" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="240.0" y="211.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE2</text>
<text x="240.0" y="227.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">(128×512)</text>
<rect x="80.0" y="240.0" width="320.0" height="50.0" fill="#ef4444" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="240.0" y="261.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE3</text>
<text x="240.0" y="277.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(128×512)</text>
<rect x="80.0" y="290.0" width="320.0" height="50.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="240.0" y="311.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE4</text>
<text x="240.0" y="327.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(128×512)</text>
<rect x="80.0" y="340.0" width="320.0" height="50.0" fill="#ec4899" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="240.0" y="361.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE5</text>
<text x="240.0" y="377.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(128×512)</text>
<rect x="80.0" y="390.0" width="320.0" height="50.0" fill="#06b6d4" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="240.0" y="411.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE6</text>
<text x="240.0" y="427.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">(128×512)</text>
<rect x="80.0" y="440.0" width="320.0" height="50.0" fill="#f97316" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="240.0" y="461.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE7</text>
<text x="240.0" y="477.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">(128×512)</text>
<rect x="80.0" y="90.0" width="320.0" height="400.0" fill="none" stroke="#1e293b" stroke-width="2" fill-opacity="1.0" rx="2"/>
<text x="410.0" y="111.0" text-anchor="start" font-size="9" fill="#475569" font-weight="normal">off=0 B</text>
<text x="410.0" y="125.0" text-anchor="start" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="410.0" y="161.0" text-anchor="start" font-size="9" fill="#475569" font-weight="normal">off=128 KB</text>
<text x="410.0" y="175.0" text-anchor="start" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="410.0" y="211.0" text-anchor="start" font-size="9" fill="#475569" font-weight="normal">off=256 KB</text>
<text x="410.0" y="225.0" text-anchor="start" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="410.0" y="261.0" text-anchor="start" font-size="9" fill="#475569" font-weight="normal">off=384 KB</text>
<text x="410.0" y="275.0" text-anchor="start" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="410.0" y="311.0" text-anchor="start" font-size="9" fill="#475569" font-weight="normal">off=512 KB</text>
<text x="410.0" y="325.0" text-anchor="start" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="410.0" y="361.0" text-anchor="start" font-size="9" fill="#475569" font-weight="normal">off=640 KB</text>
<text x="410.0" y="375.0" text-anchor="start" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="410.0" y="411.0" text-anchor="start" font-size="9" fill="#475569" font-weight="normal">off=768 KB</text>
<text x="410.0" y="425.0" text-anchor="start" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="410.0" y="461.0" text-anchor="start" font-size="9" fill="#475569" font-weight="normal">off=896 KB</text>
<text x="410.0" y="475.0" text-anchor="start" font-size="9" fill="#64748b" font-weight="normal">128 KB</text>
<text x="630.0" y="100.0" text-anchor="middle" font-size="12" fill="#1e293b" font-weight="bold">PE Legend</text>
<rect x="580.0" y="106.0" width="16.0" height="16.0" fill="#3b82f6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="602.0" y="118.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE0</text>
<rect x="580.0" y="128.0" width="16.0" height="16.0" fill="#10b981" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="602.0" y="140.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE1</text>
<rect x="580.0" y="150.0" width="16.0" height="16.0" fill="#f59e0b" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="602.0" y="162.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE2</text>
<rect x="580.0" y="172.0" width="16.0" height="16.0" fill="#ef4444" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="602.0" y="184.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE3</text>
<rect x="580.0" y="194.0" width="16.0" height="16.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="602.0" y="206.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE4</text>
<rect x="580.0" y="216.0" width="16.0" height="16.0" fill="#ec4899" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="602.0" y="228.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE5</text>
<rect x="580.0" y="238.0" width="16.0" height="16.0" fill="#06b6d4" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="602.0" y="250.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE6</text>
<rect x="580.0" y="260.0" width="16.0" height="16.0" fill="#f97316" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="602.0" y="272.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE7</text>
<rect x="580.0" y="320.0" width="146.0" height="120.0" fill="#e2e8f0" stroke="#94a3b8" stroke-width="1" fill-opacity="1.0" rx="2"/>
<text x="590.0" y="338.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Strategy: row_wise</text>
<text x="590.0" y="356.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Split axis: M</text>
<text x="590.0" y="374.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Shards: 8</text>
<text x="590.0" y="392.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Each: (128, 512)</text>
<text x="590.0" y="410.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Each: 128 KB</text>
<text x="590.0" y="428.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Total: 1 MB</text>
</svg>

After

Width:  |  Height:  |  Size: 8.1 KiB

@@ -0,0 +1,116 @@
<svg xmlns="http://www.w3.org/2000/svg" width="820" height="620" viewBox="0 0 820 620" font-family="monospace">
<rect width="820" height="620" fill="#f8fafc" rx="6"/>
<text x="410" y="32" text-anchor="middle" font-size="16" font-weight="bold" fill="#1e293b">Placement: tiled_column_major</text>
<text x="410.0" y="54.0" text-anchor="middle" font-size="11" fill="#475569" font-weight="normal">Tensor (1024×512) fp16, tile=(256×128) → 4×4=16 tiles, column-major (K first)</text>
<text x="280.0" y="82.0" text-anchor="middle" font-size="11" fill="#475569" font-weight="normal">← K=512 →</text>
<text x="68.0" y="290.0" text-anchor="middle" font-size="11" fill="#475569" transform="rotate(-90 68.0 290.0)">↑ M=1024 ↓</text>
<rect x="80.0" y="90.0" width="100.0" height="100.0" fill="#3b82f6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="130.0" y="136.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE0</text>
<text x="130.0" y="152.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t0</text>
<rect x="180.0" y="90.0" width="100.0" height="100.0" fill="#10b981" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="230.0" y="136.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE1</text>
<text x="230.0" y="152.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t1</text>
<rect x="280.0" y="90.0" width="100.0" height="100.0" fill="#f59e0b" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="330.0" y="136.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE2</text>
<text x="330.0" y="152.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">t2</text>
<rect x="380.0" y="90.0" width="100.0" height="100.0" fill="#ef4444" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="430.0" y="136.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE3</text>
<text x="430.0" y="152.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t3</text>
<rect x="80.0" y="190.0" width="100.0" height="100.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="130.0" y="236.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE4</text>
<text x="130.0" y="252.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t4</text>
<rect x="180.0" y="190.0" width="100.0" height="100.0" fill="#ec4899" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="230.0" y="236.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE5</text>
<text x="230.0" y="252.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t5</text>
<rect x="280.0" y="190.0" width="100.0" height="100.0" fill="#06b6d4" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="330.0" y="236.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE6</text>
<text x="330.0" y="252.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">t6</text>
<rect x="380.0" y="190.0" width="100.0" height="100.0" fill="#f97316" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="430.0" y="236.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE7</text>
<text x="430.0" y="252.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t7</text>
<rect x="80.0" y="290.0" width="100.0" height="100.0" fill="#3b82f6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="130.0" y="336.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE0</text>
<text x="130.0" y="352.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t8</text>
<rect x="180.0" y="290.0" width="100.0" height="100.0" fill="#10b981" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="230.0" y="336.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE1</text>
<text x="230.0" y="352.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t9</text>
<rect x="280.0" y="290.0" width="100.0" height="100.0" fill="#f59e0b" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="330.0" y="336.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE2</text>
<text x="330.0" y="352.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">t10</text>
<rect x="380.0" y="290.0" width="100.0" height="100.0" fill="#ef4444" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="430.0" y="336.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE3</text>
<text x="430.0" y="352.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t11</text>
<rect x="80.0" y="390.0" width="100.0" height="100.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="130.0" y="436.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE4</text>
<text x="130.0" y="452.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t12</text>
<rect x="180.0" y="390.0" width="100.0" height="100.0" fill="#ec4899" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="230.0" y="436.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE5</text>
<text x="230.0" y="452.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t13</text>
<rect x="280.0" y="390.0" width="100.0" height="100.0" fill="#06b6d4" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="330.0" y="436.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE6</text>
<text x="330.0" y="452.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">t14</text>
<rect x="380.0" y="390.0" width="100.0" height="100.0" fill="#f97316" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="430.0" y="436.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE7</text>
<text x="430.0" y="452.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t15</text>
<rect x="80.0" y="90.0" width="400.0" height="400.0" fill="none" stroke="#1e293b" stroke-width="2" fill-opacity="1.0" rx="2"/>
<text x="130.0" y="506.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">k=0..127</text>
<text x="230.0" y="506.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">k=128..255</text>
<text x="330.0" y="506.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">k=256..383</text>
<text x="430.0" y="506.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">k=384..511</text>
<text x="64.0" y="140.0" text-anchor="end" font-size="9" fill="#475569" font-weight="normal">m=0..255</text>
<text x="64.0" y="240.0" text-anchor="end" font-size="9" fill="#475569" font-weight="normal">m=256..511</text>
<text x="64.0" y="340.0" text-anchor="end" font-size="9" fill="#475569" font-weight="normal">m=512..767</text>
<text x="64.0" y="440.0" text-anchor="end" font-size="9" fill="#475569" font-weight="normal">m=768..1023</text>
<text x="590.0" y="90.0" text-anchor="middle" font-size="12" fill="#1e293b" font-weight="bold">PE Legend</text>
<rect x="540.0" y="96.0" width="16.0" height="16.0" fill="#3b82f6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="108.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE0</text>
<rect x="540.0" y="118.0" width="16.0" height="16.0" fill="#10b981" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="130.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE1</text>
<rect x="540.0" y="140.0" width="16.0" height="16.0" fill="#f59e0b" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="152.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE2</text>
<rect x="540.0" y="162.0" width="16.0" height="16.0" fill="#ef4444" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="174.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE3</text>
<rect x="540.0" y="184.0" width="16.0" height="16.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="196.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE4</text>
<rect x="540.0" y="206.0" width="16.0" height="16.0" fill="#ec4899" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="218.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE5</text>
<rect x="540.0" y="228.0" width="16.0" height="16.0" fill="#06b6d4" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="240.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE6</text>
<rect x="540.0" y="250.0" width="16.0" height="16.0" fill="#f97316" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="262.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE7</text>
<text x="540.0" y="310.0" text-anchor="middle" font-size="12" fill="#1e293b" font-weight="bold">Tile Assignment Order</text>
<rect x="540.0" y="318.0" width="12.0" height="12.0" fill="#3b82f6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="328.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 0 → PE0 (0,0) off=0 B</text>
<rect x="540.0" y="334.0" width="12.0" height="12.0" fill="#10b981" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="344.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 1 → PE1 (0,1) off=256 B</text>
<rect x="540.0" y="350.0" width="12.0" height="12.0" fill="#f59e0b" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="360.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 2 → PE2 (0,2) off=512 B</text>
<rect x="540.0" y="366.0" width="12.0" height="12.0" fill="#ef4444" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="376.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 3 → PE3 (0,3) off=768 B</text>
<rect x="540.0" y="382.0" width="12.0" height="12.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="392.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 4 → PE4 (1,0) off=256 KB</text>
<rect x="540.0" y="398.0" width="12.0" height="12.0" fill="#ec4899" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="408.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 5 → PE5 (1,1) off=256 KB</text>
<rect x="540.0" y="414.0" width="12.0" height="12.0" fill="#06b6d4" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="424.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 6 → PE6 (1,2) off=256 KB</text>
<rect x="540.0" y="430.0" width="12.0" height="12.0" fill="#f97316" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="440.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 7 → PE7 (1,3) off=256 KB</text>
<rect x="540.0" y="446.0" width="12.0" height="12.0" fill="#3b82f6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="456.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 8 → PE0 (2,0) off=512 KB</text>
<rect x="540.0" y="462.0" width="12.0" height="12.0" fill="#10b981" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="472.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 9 → PE1 (2,1) off=512 KB</text>
<rect x="540.0" y="478.0" width="12.0" height="12.0" fill="#f59e0b" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="488.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t10 → PE2 (2,2) off=512 KB</text>
<rect x="540.0" y="494.0" width="12.0" height="12.0" fill="#ef4444" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="504.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t11 → PE3 (2,3) off=512 KB</text>
<rect x="540.0" y="510.0" width="12.0" height="12.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="520.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t12 → PE4 (3,0) off=768 KB</text>
<rect x="540.0" y="526.0" width="12.0" height="12.0" fill="#ec4899" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="536.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t13 → PE5 (3,1) off=768 KB</text>
<rect x="540.0" y="542.0" width="12.0" height="12.0" fill="#06b6d4" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="552.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t14 → PE6 (3,2) off=768 KB</text>
<rect x="540.0" y="558.0" width="12.0" height="12.0" fill="#f97316" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="568.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t15 → PE7 (3,3) off=768 KB</text>
<rect x="80.0" y="560.0" width="608.0" height="30.0" fill="#e2e8f0" stroke="#94a3b8" stroke-width="1" fill-opacity="1.0" rx="2"/>
<text x="90.0" y="578.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Strategy: tiled_column_major | Tile: (256×128)=64 KB | Tiles: 16 | Total: 1 MB</text>
</svg>

After

Width:  |  Height:  |  Size: 14 KiB

+116
View File
@@ -0,0 +1,116 @@
<svg xmlns="http://www.w3.org/2000/svg" width="820" height="620" viewBox="0 0 820 620" font-family="monospace">
<rect width="820" height="620" fill="#f8fafc" rx="6"/>
<text x="410" y="32" text-anchor="middle" font-size="16" font-weight="bold" fill="#1e293b">Placement: tiled_row_major</text>
<text x="410.0" y="54.0" text-anchor="middle" font-size="11" fill="#475569" font-weight="normal">Tensor (1024×512) fp16, tile=(256×128) → 4×4=16 tiles, row-major (M first)</text>
<text x="280.0" y="82.0" text-anchor="middle" font-size="11" fill="#475569" font-weight="normal">← K=512 →</text>
<text x="68.0" y="290.0" text-anchor="middle" font-size="11" fill="#475569" transform="rotate(-90 68.0 290.0)">↑ M=1024 ↓</text>
<rect x="80.0" y="90.0" width="100.0" height="100.0" fill="#3b82f6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="130.0" y="136.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE0</text>
<text x="130.0" y="152.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t0</text>
<rect x="80.0" y="190.0" width="100.0" height="100.0" fill="#10b981" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="130.0" y="236.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE1</text>
<text x="130.0" y="252.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t1</text>
<rect x="80.0" y="290.0" width="100.0" height="100.0" fill="#f59e0b" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="130.0" y="336.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE2</text>
<text x="130.0" y="352.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">t2</text>
<rect x="80.0" y="390.0" width="100.0" height="100.0" fill="#ef4444" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="130.0" y="436.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE3</text>
<text x="130.0" y="452.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t3</text>
<rect x="180.0" y="90.0" width="100.0" height="100.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="230.0" y="136.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE4</text>
<text x="230.0" y="152.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t4</text>
<rect x="180.0" y="190.0" width="100.0" height="100.0" fill="#ec4899" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="230.0" y="236.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE5</text>
<text x="230.0" y="252.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t5</text>
<rect x="180.0" y="290.0" width="100.0" height="100.0" fill="#06b6d4" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="230.0" y="336.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE6</text>
<text x="230.0" y="352.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">t6</text>
<rect x="180.0" y="390.0" width="100.0" height="100.0" fill="#f97316" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="230.0" y="436.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE7</text>
<text x="230.0" y="452.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t7</text>
<rect x="280.0" y="90.0" width="100.0" height="100.0" fill="#3b82f6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="330.0" y="136.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE0</text>
<text x="330.0" y="152.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t8</text>
<rect x="280.0" y="190.0" width="100.0" height="100.0" fill="#10b981" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="330.0" y="236.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE1</text>
<text x="330.0" y="252.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t9</text>
<rect x="280.0" y="290.0" width="100.0" height="100.0" fill="#f59e0b" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="330.0" y="336.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE2</text>
<text x="330.0" y="352.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">t10</text>
<rect x="280.0" y="390.0" width="100.0" height="100.0" fill="#ef4444" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="330.0" y="436.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE3</text>
<text x="330.0" y="452.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t11</text>
<rect x="380.0" y="90.0" width="100.0" height="100.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="430.0" y="136.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE4</text>
<text x="430.0" y="152.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t12</text>
<rect x="380.0" y="190.0" width="100.0" height="100.0" fill="#ec4899" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="430.0" y="236.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE5</text>
<text x="430.0" y="252.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t13</text>
<rect x="380.0" y="290.0" width="100.0" height="100.0" fill="#06b6d4" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="430.0" y="336.0" text-anchor="middle" font-size="12" fill="#000" font-weight="bold">PE6</text>
<text x="430.0" y="352.0" text-anchor="middle" font-size="9" fill="#000" font-weight="normal">t14</text>
<rect x="380.0" y="390.0" width="100.0" height="100.0" fill="#f97316" stroke="#334155" stroke-width="1.5" fill-opacity="1.0" rx="2"/>
<text x="430.0" y="436.0" text-anchor="middle" font-size="12" fill="#fff" font-weight="bold">PE7</text>
<text x="430.0" y="452.0" text-anchor="middle" font-size="9" fill="#fff" font-weight="normal">t15</text>
<rect x="80.0" y="90.0" width="400.0" height="400.0" fill="none" stroke="#1e293b" stroke-width="2" fill-opacity="1.0" rx="2"/>
<text x="130.0" y="506.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">k=0..127</text>
<text x="230.0" y="506.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">k=128..255</text>
<text x="330.0" y="506.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">k=256..383</text>
<text x="430.0" y="506.0" text-anchor="middle" font-size="9" fill="#475569" font-weight="normal">k=384..511</text>
<text x="64.0" y="140.0" text-anchor="end" font-size="9" fill="#475569" font-weight="normal">m=0..255</text>
<text x="64.0" y="240.0" text-anchor="end" font-size="9" fill="#475569" font-weight="normal">m=256..511</text>
<text x="64.0" y="340.0" text-anchor="end" font-size="9" fill="#475569" font-weight="normal">m=512..767</text>
<text x="64.0" y="440.0" text-anchor="end" font-size="9" fill="#475569" font-weight="normal">m=768..1023</text>
<text x="590.0" y="90.0" text-anchor="middle" font-size="12" fill="#1e293b" font-weight="bold">PE Legend</text>
<rect x="540.0" y="96.0" width="16.0" height="16.0" fill="#3b82f6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="108.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE0</text>
<rect x="540.0" y="118.0" width="16.0" height="16.0" fill="#10b981" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="130.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE1</text>
<rect x="540.0" y="140.0" width="16.0" height="16.0" fill="#f59e0b" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="152.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE2</text>
<rect x="540.0" y="162.0" width="16.0" height="16.0" fill="#ef4444" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="174.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE3</text>
<rect x="540.0" y="184.0" width="16.0" height="16.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="196.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE4</text>
<rect x="540.0" y="206.0" width="16.0" height="16.0" fill="#ec4899" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="218.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE5</text>
<rect x="540.0" y="228.0" width="16.0" height="16.0" fill="#06b6d4" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="240.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE6</text>
<rect x="540.0" y="250.0" width="16.0" height="16.0" fill="#f97316" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="562.0" y="262.0" text-anchor="start" font-size="11" fill="#1e293b" font-weight="normal">PE7</text>
<text x="540.0" y="310.0" text-anchor="middle" font-size="12" fill="#1e293b" font-weight="bold">Tile Assignment Order</text>
<rect x="540.0" y="318.0" width="12.0" height="12.0" fill="#3b82f6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="328.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 0 → PE0 (0,0) off=0 B</text>
<rect x="540.0" y="334.0" width="12.0" height="12.0" fill="#10b981" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="344.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 1 → PE1 (1,0) off=256 KB</text>
<rect x="540.0" y="350.0" width="12.0" height="12.0" fill="#f59e0b" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="360.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 2 → PE2 (2,0) off=512 KB</text>
<rect x="540.0" y="366.0" width="12.0" height="12.0" fill="#ef4444" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="376.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 3 → PE3 (3,0) off=768 KB</text>
<rect x="540.0" y="382.0" width="12.0" height="12.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="392.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 4 → PE4 (0,1) off=256 B</text>
<rect x="540.0" y="398.0" width="12.0" height="12.0" fill="#ec4899" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="408.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 5 → PE5 (1,1) off=256 KB</text>
<rect x="540.0" y="414.0" width="12.0" height="12.0" fill="#06b6d4" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="424.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 6 → PE6 (2,1) off=512 KB</text>
<rect x="540.0" y="430.0" width="12.0" height="12.0" fill="#f97316" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="440.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 7 → PE7 (3,1) off=768 KB</text>
<rect x="540.0" y="446.0" width="12.0" height="12.0" fill="#3b82f6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="456.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 8 → PE0 (0,2) off=512 B</text>
<rect x="540.0" y="462.0" width="12.0" height="12.0" fill="#10b981" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="472.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t 9 → PE1 (1,2) off=256 KB</text>
<rect x="540.0" y="478.0" width="12.0" height="12.0" fill="#f59e0b" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="488.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t10 → PE2 (2,2) off=512 KB</text>
<rect x="540.0" y="494.0" width="12.0" height="12.0" fill="#ef4444" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="504.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t11 → PE3 (3,2) off=768 KB</text>
<rect x="540.0" y="510.0" width="12.0" height="12.0" fill="#8b5cf6" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="520.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t12 → PE4 (0,3) off=768 B</text>
<rect x="540.0" y="526.0" width="12.0" height="12.0" fill="#ec4899" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="536.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t13 → PE5 (1,3) off=256 KB</text>
<rect x="540.0" y="542.0" width="12.0" height="12.0" fill="#06b6d4" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="552.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t14 → PE6 (2,3) off=512 KB</text>
<rect x="540.0" y="558.0" width="12.0" height="12.0" fill="#f97316" stroke="#334155" stroke-width="1.0" fill-opacity="1.0" rx="2"/>
<text x="558.0" y="568.0" text-anchor="start" font-size="9" fill="#334155" font-weight="normal">t15 → PE7 (3,3) off=768 KB</text>
<rect x="80.0" y="560.0" width="587.0" height="30.0" fill="#e2e8f0" stroke="#94a3b8" stroke-width="1" fill-opacity="1.0" rx="2"/>
<text x="90.0" y="578.0" text-anchor="start" font-size="10" fill="#334155" font-weight="normal">Strategy: tiled_row_major | Tile: (256×128)=64 KB | Tiles: 16 | Total: 1 MB</text>
</svg>

After

Width:  |  Height:  |  Size: 14 KiB

+95
View File
@@ -0,0 +1,95 @@
<svg xmlns="http://www.w3.org/2000/svg" width="648" height="648" viewBox="0 0 648 648">
<title>sip</title>
<rect width="648" height="648" fill="#f8fafc"/>
<text x="324" y="18" text-anchor="middle" font-family="monospace" font-size="14" font-weight="bold" fill="#1e293b">SIP VIEW</text>
<line x1="108.0" y1="144.0" x2="252.0" y2="144.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="180.0" y="140.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="108.0" y1="144.0" x2="108.0" y2="264.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="108.0" y="200.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="252.0" y1="144.0" x2="396.0" y2="144.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="324.0" y="140.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="252.0" y1="144.0" x2="252.0" y2="264.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="252.0" y="200.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="396.0" y1="144.0" x2="540.0" y2="144.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="468.0" y="140.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="396.0" y1="144.0" x2="396.0" y2="264.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="396.0" y="200.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="540.0" y1="144.0" x2="540.0" y2="264.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="540.0" y="200.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="108.0" y1="264.0" x2="252.0" y2="264.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="180.0" y="260.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="108.0" y1="264.0" x2="108.0" y2="384.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="108.0" y="320.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="252.0" y1="264.0" x2="396.0" y2="264.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="324.0" y="260.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="252.0" y1="264.0" x2="252.0" y2="384.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="252.0" y="320.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="396.0" y1="264.0" x2="540.0" y2="264.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="468.0" y="260.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="396.0" y1="264.0" x2="396.0" y2="384.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="396.0" y="320.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="540.0" y1="264.0" x2="540.0" y2="384.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="540.0" y="320.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="108.0" y1="384.0" x2="252.0" y2="384.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="180.0" y="380.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="108.0" y1="384.0" x2="108.0" y2="504.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="108.0" y="440.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="252.0" y1="384.0" x2="396.0" y2="384.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="324.0" y="380.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="252.0" y1="384.0" x2="252.0" y2="504.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="252.0" y="440.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="396.0" y1="384.0" x2="540.0" y2="384.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="468.0" y="380.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="396.0" y1="384.0" x2="396.0" y2="504.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="396.0" y="440.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="540.0" y1="384.0" x2="540.0" y2="504.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="540.0" y="440.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="108.0" y1="504.0" x2="252.0" y2="504.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="180.0" y="500.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="252.0" y1="504.0" x2="396.0" y2="504.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="324.0" y="500.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<line x1="396.0" y1="504.0" x2="540.0" y2="504.0" stroke="#3b82f6" stroke-width="1" opacity="0.8"/>
<text x="468.0" y="500.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">1.0mm 512GB/s</text>
<polyline points="324.0,56.0 108.0,56.0 108.0,144.0" fill="none" stroke="#0ea5e9" stroke-width="1" opacity="0.8"/>
<text x="216.0" y="96.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.5mm 512GB/s</text>
<polyline points="324.0,56.0 252.0,56.0 252.0,144.0" fill="none" stroke="#0ea5e9" stroke-width="1" opacity="0.8"/>
<text x="288.0" y="96.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.5mm 512GB/s</text>
<polyline points="324.0,56.0 396.0,56.0 396.0,144.0" fill="none" stroke="#0ea5e9" stroke-width="1" opacity="0.8"/>
<text x="360.0" y="96.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.5mm 512GB/s</text>
<polyline points="324.0,56.0 540.0,56.0 540.0,144.0" fill="none" stroke="#0ea5e9" stroke-width="1" opacity="0.8"/>
<text x="432.0" y="96.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">3.5mm 512GB/s</text>
<rect x="84.0" y="128.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="108.0" y="148.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (0,0)</text>
<rect x="228.0" y="128.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="252.0" y="148.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (1,0)</text>
<rect x="372.0" y="128.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="396.0" y="148.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (2,0)</text>
<rect x="516.0" y="128.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="540.0" y="148.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (3,0)</text>
<rect x="84.0" y="248.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="108.0" y="268.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (0,1)</text>
<rect x="228.0" y="248.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="252.0" y="268.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (1,1)</text>
<rect x="372.0" y="248.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="396.0" y="268.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (2,1)</text>
<rect x="516.0" y="248.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="540.0" y="268.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (3,1)</text>
<rect x="84.0" y="368.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="108.0" y="388.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (0,2)</text>
<rect x="228.0" y="368.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="252.0" y="388.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (1,2)</text>
<rect x="372.0" y="368.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="396.0" y="388.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (2,2)</text>
<rect x="516.0" y="368.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="540.0" y="388.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (3,2)</text>
<rect x="84.0" y="488.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="108.0" y="508.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (0,3)</text>
<rect x="228.0" y="488.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="252.0" y="508.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (1,3)</text>
<rect x="372.0" y="488.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="396.0" y="508.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (2,3)</text>
<rect x="516.0" y="488.0" width="48.0" height="32.0" rx="4" fill="#cbd5e1" stroke="#475569" stroke-width="1"/>
<text x="540.0" y="508.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#1e293b">CUBE (3,3)</text>
<rect x="308.0" y="50.0" width="32.0" height="12.0" rx="4" fill="#0ea5e9" stroke="#475569" stroke-width="1"/>
<text x="324.0" y="60.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#ffffff">IO io0</text>
</svg>

After

Width:  |  Height:  |  Size: 10 KiB

+19
View File
@@ -0,0 +1,19 @@
<svg xmlns="http://www.w3.org/2000/svg" width="768" height="396" viewBox="0 0 768 396">
<title>system</title>
<rect width="768" height="396" fill="#f8fafc"/>
<text x="384" y="18" text-anchor="middle" font-family="monospace" font-size="14" font-weight="bold" fill="#1e293b">SYSTEM VIEW</text>
<polyline points="384.0,60.0 182.0,60.0 182.0,120.0" fill="none" stroke="#6366f1" stroke-width="1" opacity="0.8"/>
<text x="283.0" y="86.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">20.0mm 256GB/s</text>
<polyline points="384.0,60.0 586.0,60.0 586.0,120.0" fill="none" stroke="#6366f1" stroke-width="1" opacity="0.8"/>
<text x="485.0" y="86.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#64748b">20.0mm 256GB/s</text>
<rect x="374.0" y="57.0" width="20.0" height="6.0" rx="4" fill="#6366f1" stroke="#475569" stroke-width="1"/>
<text x="384.0" y="64.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#ffffff">Fabric Switch</text>
<rect x="62.0" y="138.0" width="240.0" height="200.0" rx="4" fill="#e0e7ff" stroke="#475569" stroke-width="1"/>
<text x="182.0" y="242.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">SIP 0</text>
<rect x="174.0" y="117.0" width="16.0" height="6.0" rx="4" fill="#0ea5e9" stroke="#475569" stroke-width="1"/>
<text x="182.0" y="124.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#ffffff">IO io0</text>
<rect x="466.0" y="138.0" width="240.0" height="200.0" rx="4" fill="#e0e7ff" stroke="#475569" stroke-width="1"/>
<text x="586.0" y="242.0" text-anchor="middle" font-family="monospace" font-size="10" fill="#1e293b">SIP 1</text>
<rect x="578.0" y="117.0" width="16.0" height="6.0" rx="4" fill="#0ea5e9" stroke="#475569" stroke-width="1"/>
<text x="586.0" y="124.0" text-anchor="middle" font-family="monospace" font-size="7" fill="#ffffff">IO io0</text>
</svg>

After

Width:  |  Height:  |  Size: 1.9 KiB

+381
View File
@@ -0,0 +1,381 @@
# Latency Model
## Overview
kernbench uses a discrete-event simulation (SimPy) to compute end-to-end latency.
Every request flows through a graph of **components** connected by **wires**.
The total latency reported is the **actual SimPy wall-clock** (`env.now` delta),
not a static formula—so contention and queueing are captured automatically.
```
total_ns (actual) = wire_prop + component_overhead + drain + queueing
├── deterministic ──────────────────┘ │
└── contention-dependent ────────────────────┘
```
## Three Deterministic Cost Components
### 1. Wire Propagation
```
wire_ns = distance_mm × ns_per_mm (global: 0.01 = 10 ps/mm)
```
Every edge in the topology graph has a `distance_mm`. A SimPy wire process
delays each message by `wire_ns` before delivering it to the next component.
For on-chip silicon this is ~10 ps/mm; the same constant applies everywhere
since all links are on-die or interposer. Wire propagation is typically <1 ns
and negligible compared to other costs.
### 2. Component Overhead (`overhead_ns`)
```
component_ns = node.attrs["overhead_ns"]
```
Each component on the path adds a fixed processing delay via `yield env.timeout(overhead_ns)`.
This models arbitration, protocol processing, pipeline stages, etc.
| Component | overhead_ns | Meaning |
|-----------|-------------|---------|
| pcie_ep | 5.0 | PCIe protocol processing |
| io_cpu | 10.0 | Command decode / dispatch |
| m_cpu | 5.0 | DMA scheduling |
| fabric switch | 5.0 | Packet arbitration |
| xbar | 2.0 | Crossbar arbitration |
| xbar bridge | 1.0 | Bridge traversal between xbar halves |
| ucie | 1.0 | UCIe protocol overhead per port |
| noc (2D mesh) | 0.0 | Hop delay modeled internally via manhattan distance |
| hbm_ctrl | 0.0 | Access time captured in drain_ns |
| pe_cpu | 2.0 | Command dispatch |
| pe_scheduler | 1.0 | PE-internal scheduling |
| pe_gemm/math | 0.0 | Placeholder; will use flops-based model |
### 3. Drain (Serialization Delay)
```
drain_ns = nbytes / bottleneck_bw_gbs
```
**Wormhole (cut-through) model**: data flows through intermediate nodes as a
pipeline. Serialization cost is paid **once** at the terminal node, not at
every hop. The bottleneck is the minimum `bw_gbs` across all edges in the path.
Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32.0 ns`.
### Formula (Theoretical Lower Bound)
```
formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
```
This is the latency with **zero contention**—no other request competing for
any resource. The engine provides `_formula_latency()` for verification.
With no contention: `actual == formula`. With contention: `actual > formula`.
### Diagram: PE DMA Read (pe0 → local slice0, 4096 bytes)
```mermaid
sequenceDiagram
participant D as pe_dma
participant X as xbar.pe0
participant H as hbm_ctrl.slice0
D->>X: txn (4096B)
Note over X: overhead 2.0 ns
X->>H: txn (wire 0.025 ns)
Note over H: acquire Resource
Note over H: overhead 0 ns
Note over H: drain 4096/256 = 16.0 ns
Note over H: release Resource
H-->>D: done.succeed()
Note over D,H: total_ns = 18.09 ns<br/>formula = wire(0.025) + ovhd(2.0) + drain(16.0) = 18.025 ns<br/>actual ≈ formula (no contention)
```
### Diagram: Two Requests — No Contention vs HOL Blocking
#### Case 1: Different slices (parallel, no contention)
```mermaid
sequenceDiagram
participant A as Request A
participant S0 as hbm_ctrl.slice0<br/>Resource(cap=1)
participant S1 as hbm_ctrl.slice1<br/>Resource(cap=1)
Note over A,S1: t=2 ns — both requests arrive at their own slice
A->>S0: A (4KB)
A->>S1: B (4KB)
Note over S0: acquire (immediate)
Note over S1: acquire (immediate)
Note over S0: drain 16.0 ns
Note over S1: drain 16.0 ns
Note over S0: t=18 release
Note over S1: t=18 release
Note over A,S1: A actual = 18 ns, B actual = 18 ns<br/>No waiting — separate Resources
```
#### Case 2: Same slice (HOL blocking)
```mermaid
sequenceDiagram
participant A as Request A (4KB)
participant Q as hbm_ctrl.slice0<br/>Resource(cap=1)
participant B as Request B (64B)
Note over A,B: t=0 — A arrives first
A->>Q: acquire (immediate)
Note over Q: drain A = 16.0 ns
Note over B,Q: t=5 — B arrives, yield req → BLOCKED
B--xQ: waiting...
Note over Q: t=16 — A drain done, release
Q->>B: B acquires resource
Note over Q: drain B = 0.25 ns
Note over Q: t=16.25 — B done, release
Note over A,B: A actual = 16.0 ns (== formula)<br/>B actual = 11.25 ns (formula 0.25 + queueing 11.0)<br/>HOL blocking: short request waits behind long drain
```
---
## How SimPy Tracks Latency
### Measurement
```python
start_ns = env.now
yield txn_done # wait for the transaction to complete
total_ns = env.now - start_ns # ← this is what probe reports
```
`env.now` is SimPy's simulation clock. It only advances when a process `yield`s
a timeout or waits on a resource/store. The delta between start and done captures
**everything**: wire delays, component overheads, drain, and any queueing.
### Component Pipeline
Each component is a SimPy process:
```
_fan_in (per in_port) → _inbox (Store) → _worker → out_ports
```
1. **`_fan_in`**: relays messages from each `in_port` into a shared `_inbox` Store.
2. **`_worker`**: pulls from `_inbox`, spawns `_forward_txn` per message.
3. **`_forward_txn`**: calls `run()` (overhead), then puts to `out_ports[next_hop]`.
The worker uses `env.process()` (pipeline model), so multiple messages can be
in-flight through the same component concurrently. Contention happens when
they compete for shared resources (e.g., `simpy.Resource` in hbm_ctrl).
### Wire Process
```python
while True:
msg = yield out_port.get() # wait for sender
yield env.timeout(prop_ns) # propagation delay
yield in_port.put(msg) # deliver to receiver
```
Each directed edge has its own wire process. Messages are delayed by exactly
`distance_mm × ns_per_mm`.
---
## Contention and Queueing
Queueing delay is **not a separate formula term**—it emerges from SimPy's
event scheduling when multiple requests compete for the same resource.
### Where Contention Occurs
| Resource | SimPy Type | Capacity | Effect |
|----------|-----------|----------|--------|
| hbm_ctrl | `simpy.Resource` | 1 | Serializes HBM access |
| m_cpu DMA read engine | `simpy.Resource` | 1 | Serializes DMA reads |
| m_cpu DMA write engine | `simpy.Resource` | 1 | Serializes DMA writes |
| pe_dma channels | `simpy.Resource` | configurable | Serializes PE DMA ops |
| component inbox | `simpy.Store` | unbounded | No backpressure (FIFO) |
### How Queueing Works
```python
# hbm_ctrl._worker
with self._resource.request() as req:
yield req # ← BLOCKS if resource is occupied
yield from self.run(env, txn.nbytes)
yield env.timeout(drain_ns)
```
If request A holds the resource and request B arrives:
- B's `yield req` blocks until A releases the resource
- SimPy advances B's `env.now` by A's remaining service time
- This "extra" time shows up in B's `total_ns` automatically
```
No contention: actual_ns == formula_ns
Contention: actual_ns > formula_ns
queueing_delay = actual_ns - formula_ns
```
### Head-of-Line (HOL) Blocking at hbm_ctrl
The `simpy.Resource` is held for the **entire** `with` block—both overhead and
drain. The resource is NOT released between overhead and drain:
```python
with self._resource.request() as req:
yield req # acquire (or wait)
yield from self.run(env, txn.nbytes) # overhead_ns ─┐
yield env.timeout(drain_ns) # drain_ns │ resource held
# ← resource released here ───────────────────────────────┘
```
This means a short request arriving during a long request's drain must wait
for the full remaining drain time—classic head-of-line blocking:
```
Request A: 4 KB, drain = 16.0 ns (arrives at t=0)
Request B: 64 B, drain = 0.25 ns (arrives at t=5)
Timeline:
t=0.00 A acquires resource
t=0.00 A: overhead (0 ns)
t=0.00 A: drain starts (16.0 ns)
t=5.00 B arrives → yield req → BLOCKED (A holds resource)
t=16.00 A: drain done → resource released
t=16.00 B acquires resource
t=16.00 B: overhead (0 ns)
t=16.25 B: drain done → resource released
B actual = 11.25 ns (waited 11.0 + own 0.25)
B formula = 0.25 ns
B queueing = 11.0 ns ← HOL blocking penalty
```
**Why this is physically realistic**: An HBM channel processes one burst at a
time. While data is being serialized onto the channel (drain), no other request
can use that channel. The FIFO ordering (`simpy.Resource` default) reflects
the simplest controller scheduling policy.
**Alternative: priority scheduling**: If needed, `simpy.PriorityResource` can
prioritize shorter requests (Shortest Job First), but this is not currently
used since FIFO matches typical HBM controller behavior.
---
## Worked Example: Two Concurrent PE DMA Reads
Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices
(slice0 and slice1), submitted to the **same engine** at the same time.
### Paths
```
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
```
### No Contention (different HBM slices)
Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own
`simpy.Resource(capacity=1)`, there is no resource competition.
```
DMA A timeline:
t=0.00 pe_dma dequeues txn
t=0.00 xbar.pe0: overhead_ns=2.0 → t=2.00
t=2.025 wire prop (2.5mm × 0.01) → t=2.025
t=2.025 hbm_ctrl.slice0: yield req → immediate (no contention)
t=2.025 hbm_ctrl.slice0: overhead_ns=0 → t=2.025
t=18.025 drain_ns = 4096/256 = 16.0 → t=18.025
t=18.025 done
DMA B timeline: (identical, on its own slice)
t=0.00 → ... → t=18.09 done
```
Both complete at ~18.09 ns. `actual == formula` for both.
### With Contention (same HBM slice)
Now suppose both PE0 and PE1 read from **slice0**:
```
DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
(chain traversal to reach slice0)
```
```
DMA A timeline:
t=0.00 xbar.pe0(2.0) → wire → hbm_ctrl.slice0
t=2.025 yield req → immediate (first to arrive)
t=18.025 drain 16.0 → release resource → done
actual_A = 18.025 ns (== formula)
DMA B timeline:
t=0.00 xbar.pe1(2.0) → xbar.pe0(2.0) → wire → hbm_ctrl.slice0
t=4.035 yield req → BLOCKED (A holds resource until t=18.025)
t=18.025 acquire resource
t=34.025 drain 16.0 → release → done
actual_B = 34.035 ns
formula_B = wire(0.035) + overhead(4.0) + drain(32.0) = 36.035 ns
But actual_B is different because drain uses bottleneck BW of B's path (128 GB/s)
while A's path has BW 256 GB/s. Let's recalculate:
B's bottleneck: xbar_x_bw = 128 GB/s → drain = 4096/128 = 32.0 ns
formula_B = 0.035 + 4.0 + 32.0 = 36.035 ns
actual_B = 36.035 + queueing ≈ 50+ ns
queueing = time waiting for A to release hbm_ctrl
```
The key insight: **queueing delay is not in the formula**. It only appears in
the actual SimPy simulation when resources are contested. The probe reports
`actual_ns`, which includes all queueing. To see pure queueing overhead,
compare `actual_ns` vs `formula_ns` (available in PE DMA traces).
---
## Probe Output Explained
```
=== PE DMA Latency ===
Case Target Actual Ovhd Drain Wire Ovhd% Drain% Eff.BW BN.BW Util%
pe-local-hbm c0.pe0->c0.slice0 18.09 2.0 16.0 0.08 11.1% 88.5% 226.49 256.0 88.5%
pe-cross-half-hbm c0.pe0->c0.slice4 37.14 5.0 32.0 0.14 13.5% 86.1% 110.27 128.0 86.1%
```
| Column | Meaning |
|--------|---------|
| **Actual** | SimPy measured `env.now` delta (includes contention if any) |
| **Ovhd** | Sum of `overhead_ns` for all components on the forward path |
| **Drain** | `nbytes / bottleneck_bw` — serialization at terminal |
| **Wire** | Sum of `distance_mm × ns_per_mm` for all edges |
| **Ovhd%** | `Ovhd / Actual × 100` — fraction of time spent in component processing |
| **Drain%** | `Drain / Actual × 100` — fraction of time spent in data transfer |
| **Eff.BW** | `nbytes / Actual` — achieved bandwidth |
| **BN.BW** | Bottleneck bandwidth (min `bw_gbs` on path) |
| **Util%** | `Eff.BW / BN.BW × 100` — how close to theoretical max BW |
### Why Util% < 100%
`Util% = Drain% = drain_ns / actual_ns`. The gap from 100% is the overhead
fraction. For small transfers (4KB), overhead is significant relative to drain.
For large transfers, drain dominates and utilization approaches 100%.
```
4 KB: Ovhd=2.0, Drain=16.0 → Util=88.5% (overhead is 11% of time)
64 KB: Ovhd=2.0, Drain=256.0 → Util=99.2% (overhead is <1% of time)
```
### H2D Path: Why Ovhd% is ~40%
H2D traverses many components (pcie_ep → io_cpu → ucie → noc → m_cpu → noc →
xbar → hbm_ctrl + response path). Total forward overhead is ~23 ns vs drain
of 32 ns for 4KB, so overhead is comparable to data transfer time—resulting
in ~55% utilization. This is expected for small command-path transfers.