adr: add INDEX.md (auto-generated by tools/generate_adr_index.py)
Adds a section-based table of contents for the 46-ADR corpus, mirroring the /report skill's classification (Design Principles / High-level Architecture / Detailed Architecture by component / Implementation Decisions by topic). Generated for both docs/adr/ (EN titles) and docs/adr-ko/ (KO titles) from one tool. tools/generate_adr_index.py: - Single CLASSIFICATION dict per ADR — add an entry when introducing a new ADR; the script fails loud if any file is missing from the table. - DETAILED_COMPONENTS lists each builtin component and the ADR(s) that cover it (ADR-0014 appears under six PE engines; ADR-0023 under pe_dma + pe_ipcq). - Accepts both ":" and "—" title separators (matching ADR-0033's existing format). - --check mode for CI: exits 1 if INDEX.md is stale. Also includes the docs/report/architecture-2026-1H.md generated by the prior /report write (the public-facing architecture document; 836 lines, 76 source-attribution comments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,836 @@
|
||||
# KernBench — Architecture Design Document
|
||||
*2026 1H*
|
||||
|
||||
KernBench is a system-level, discrete-event simulator for AI-accelerator
|
||||
chiplet systems. It models the data-movement and control paths across
|
||||
the full hardware hierarchy and reports end-to-end execution latency
|
||||
for kernels dispatched to the device's compute units.
|
||||
|
||||
This document is a public summary of the architecture as designed and
|
||||
implemented in the first half of 2026. It assumes no prior knowledge of
|
||||
the simulator's internal documents; terms specific to the system are
|
||||
defined on first use.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
KernBench is grounded in two foundational commitments: every measured
|
||||
latency must trace to explicit, modeled events on the simulator's graph,
|
||||
and every behavioral claim must be verifiable through tests that target
|
||||
spec-level invariants rather than incidental implementation details.
|
||||
|
||||
<!-- src: ADR-0013 Context, Decision -->
|
||||
The verification posture is verification-driven. Tests are written to
|
||||
validate the architectural contracts that the simulator exposes —
|
||||
correct routing, deterministic results, monotonic latency under
|
||||
increasing hop counts — rather than to mirror the call graph of the
|
||||
implementation. Two phases coexist: a fast timing phase that exercises
|
||||
the simulator's discrete-event engine and produces a log of operations
|
||||
with timestamps, and an optional data-replay phase that uses that log
|
||||
to compute real numerical results. Tests can target either phase.
|
||||
|
||||
<!-- src: ADR-0033 Context, Decision -->
|
||||
The latency model is intentionally abstract rather than
|
||||
cycle-accurate. Each modeled node contributes a configurable per-node
|
||||
overhead, each link contributes wire delay plus byte-over-bandwidth
|
||||
serialization, and each terminal service contributes its own service
|
||||
time. The simulator does not attempt to reproduce cache coherence
|
||||
protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
|
||||
correctness; those are explicitly outside the scope. The aim is a
|
||||
simulator that compares system-level configurations meaningfully and
|
||||
deterministically, not one that ships microarchitectural truths.
|
||||
|
||||
<!-- src: ADR-0033 Decision, Consequences -->
|
||||
Determinism is a hard requirement. Given identical inputs — topology,
|
||||
routing policy, and request stream — the simulator must produce
|
||||
identical outputs, hop traces included. This rules out reliance on
|
||||
unordered set iteration on the critical path and forces every latency
|
||||
contribution to come from an explicitly scheduled event on a modeled
|
||||
component or link. There are no implicit waits, no hardcoded magic
|
||||
delays, and no shortcuts that bypass the modeled graph.
|
||||
|
||||
---
|
||||
|
||||
## High-level Architecture
|
||||
|
||||
<!-- src: ADR-0003 Context, Decision -->
|
||||
The simulated system is a four-level hierarchy. A **Tray** holds one or
|
||||
more **SIPs** (system-in-package), each containing a 2D mesh of
|
||||
**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
|
||||
host. Each CUBE contains a regular grid of **PEs** (processing
|
||||
elements) plus its own attached resources — high-bandwidth memory
|
||||
(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
|
||||
itself is a composite of nine sub-components rather than a monolithic
|
||||
core. This hierarchy is fixed; the parameters along each axis (counts,
|
||||
mesh dimensions, link widths) are configurable through the topology
|
||||
spec.
|
||||
|
||||
<!-- src: ADR-0007 Context, Decision -->
|
||||
A clean separation runs along the request flow. A **runtime API** at
|
||||
the top is the host-facing surface; it exposes tensor and kernel
|
||||
operations, owns host-side allocation metadata, and is topology-
|
||||
agnostic — it does not route or fan out. Below it the **simulation
|
||||
engine** decomposes runtime operations into discrete graph requests
|
||||
(memory writes, memory reads, kernel launches, MMU map installs) and
|
||||
schedules events deterministically. At the bottom, **components** model
|
||||
device behavior on a graph of nodes connected by links; they
|
||||
implement the actual latency contributions and pass requests along.
|
||||
No component reaches up into the runtime API, and no runtime call
|
||||
shortcuts the engine.
|
||||
|
||||
<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
|
||||
|
||||
### Tray
|
||||
|
||||
<!-- src: ADR-0003 Decision -->
|
||||
The Tray is the outermost boundary. It owns the host CPU on one side
|
||||
and one or more SIPs on the other, connected through a fabric switch.
|
||||
For collective communication that must traverse multiple SIPs, the
|
||||
fabric switch acts as the common rendezvous: device-side outbound
|
||||
traffic from one SIP routes through the switch and back into the
|
||||
target SIP's IO chiplet.
|
||||
|
||||
### SIP
|
||||
|
||||
<!-- src: ADR-0003 Decision, ADR-0017 Context -->
|
||||
A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
|
||||
default topology used by the simulator is a 4×4 cube mesh; the
|
||||
mesh dimensions are configurable. Each cube on the boundary of the
|
||||
mesh connects to its neighbors over UCIe (die-to-die) links arranged
|
||||
on the four cardinal sides — north, south, east, and west. The IO
|
||||
chiplets sit on one side of the SIP and provide the bridge to the host
|
||||
across PCIe.
|
||||
|
||||
<!-- src: ADR-0016 Context, Decision -->
|
||||
The IO chiplet itself contains its own internal network. A
|
||||
host-facing PCIe endpoint passes traffic to a small NOC ("network on
|
||||
chip"); from there it can branch to a control-plane CPU that processes
|
||||
kernel-launch messages, or it can take the direct memory data path to
|
||||
the cube's HBM controller. The decision to provide a direct memory
|
||||
path that bypasses the control CPU was a deliberate concession to
|
||||
keep host-issued memory writes from paying control-plane overhead on
|
||||
the data path.
|
||||
|
||||
### CUBE
|
||||
|
||||
<!-- src: ADR-0017 Decision -->
|
||||
Each CUBE owns a 2D mesh of NOC routers and a set of attached
|
||||
resources: PEs, the cube-local SRAM scratchpad, the management CPU
|
||||
(M_CPU), and the HBM partition (split across multiple PE-private
|
||||
slices for bandwidth). The router mesh uses deterministic XY routing.
|
||||
Attached components do not connect to each other directly — they all
|
||||
sit on the router mesh, and every cube-internal transfer pays the
|
||||
mesh distance from source to destination.
|
||||
|
||||
<!-- src: ADR-0017 Decision -->
|
||||
The HBM partition is per-PE: each PE owns one HBM slice, and the
|
||||
controller exposes per-PE channels so that the same PE always
|
||||
addresses the same set of HBM channels. This makes the local-HBM
|
||||
bandwidth from a PE to its own slice predictable, while accesses to
|
||||
another PE's slice — or a different cube's slice — pay the mesh
|
||||
distance and any UCIe crossings.
|
||||
|
||||
### PE
|
||||
|
||||
<!-- src: ADR-0014 Context, Decision -->
|
||||
A PE is not a monolithic core. Internally it is a set of nine
|
||||
sub-components, each modeling one stage of a request's flow: a small
|
||||
control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
|
||||
engine that moves data between the on-PE scratchpad and the register
|
||||
file, a GEMM compute engine, a math compute engine, the tightly-
|
||||
coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
|
||||
physical address translation, and an inter-PE collective queue
|
||||
(IPCQ). The scheduler decomposes higher-level operations into per-tile
|
||||
stage sequences, and tile tokens self-route from one sub-component
|
||||
to the next.
|
||||
|
||||
<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
|
||||
|
||||
---
|
||||
|
||||
## Detailed Architecture
|
||||
|
||||
This section describes each modeled device-side component in turn.
|
||||
Components are listed in the alphabetical order used by the
|
||||
simulator's source tree.
|
||||
|
||||
### forwarding
|
||||
|
||||
<!-- src: ADR-0037 Context, Decision -->
|
||||
The forwarding component is the generic routing relay used wherever a
|
||||
node only needs to apply a small processing overhead and pass the
|
||||
request to the next hop. NOC routers, conn nodes, and ucie phys all
|
||||
reduce to this. Its first act on receiving a request is to apply the
|
||||
per-node overhead configured for it in the topology spec; after the
|
||||
overhead it simply hands the request to the next hop along the path.
|
||||
|
||||
<!-- src: ADR-0037 Decision, Consequences -->
|
||||
The decision to share one implementation across these roles was made
|
||||
to keep the simulator's component set small without sacrificing
|
||||
modeling fidelity. Each instance still carries its own overhead and
|
||||
its own link bandwidth contributions, so different roles still produce
|
||||
different timing. What is shared is the dispatcher loop, not the
|
||||
parameter values.
|
||||
|
||||
### hbm_ctrl
|
||||
|
||||
<!-- src: ADR-0034 Context, Decision -->
|
||||
The HBM controller is the terminal node for all memory traffic that
|
||||
reaches HBM. Internally it owns a number of pseudo channels, partitioned
|
||||
per-PE so that each PE addresses a deterministic subset. On a request
|
||||
arrival the controller first selects the right pseudo channel from the
|
||||
target address, then enters a chunk-loop that drains the requested
|
||||
size in fixed-size flits over the channel's bandwidth.
|
||||
|
||||
<!-- src: ADR-0034 Decision, Consequences -->
|
||||
The chunk-loop pattern replaces an earlier all-at-once drain. The
|
||||
benefit is that the controller no longer presents a flit-aware fabric
|
||||
with a single bulk transfer; instead it emits flits at a paced rate
|
||||
matching the channel bandwidth, which makes cross-flow contention
|
||||
visible. The bandwidth budget is calibrated against the configured
|
||||
HBM total bandwidth divided across the channel count.
|
||||
|
||||
### io_cpu
|
||||
|
||||
<!-- src: ADR-0036 Context, Decision -->
|
||||
The IO_CPU is the control-plane processor sitting inside the IO chiplet.
|
||||
It receives kernel-launch messages from the host, decodes them, and
|
||||
dispatches per-cube launches to the cube's management CPU. Pure memory
|
||||
operations bypass it entirely, taking the direct data path established
|
||||
inside the IO chiplet.
|
||||
|
||||
<!-- src: ADR-0036 Decision -->
|
||||
On receiving a kernel-launch message, the IO_CPU consults the message's
|
||||
shard list — which already names the target SIP, cube, and PE for each
|
||||
piece of the tensor argument — and forwards a per-cube launch to each
|
||||
cube the kernel needs to reach. This makes the IO_CPU a deterministic
|
||||
fan-out point: it does not decode physical addresses to route, it just
|
||||
follows the explicit per-shard targets it was handed.
|
||||
|
||||
### m_cpu
|
||||
|
||||
<!-- src: ADR-0035 Context, Decision -->
|
||||
The M_CPU is the cube's management processor. It owns two distinct
|
||||
roles: as a control-plane fan-out point for kernel launches arriving
|
||||
from the IO chiplet, and as a DMA endpoint for host-initiated memory
|
||||
writes that need to land in this cube's HBM. The control role
|
||||
forwards launches to the right PE control CPUs; the DMA role places
|
||||
the actual bytes into HBM through the router mesh.
|
||||
|
||||
<!-- src: ADR-0035 Decision -->
|
||||
The component model deliberately distinguishes the two roles because
|
||||
their routing differs: the control fan-out path uses command-kind
|
||||
links that do not appear on data-path routes, while the DMA path uses
|
||||
the same router mesh as PE-initiated DMA, with PE-internal nodes
|
||||
excluded. The routing layer knows about both modes and selects the
|
||||
appropriate adjacency at request time.
|
||||
|
||||
### pcie_ep
|
||||
|
||||
<!-- src: ADR-0038 Context, Decision -->
|
||||
The PCIE endpoint is the protocol boundary at the host-device edge.
|
||||
Its first act on each incoming request is to apply a configured
|
||||
protocol-processing overhead; after that it simply forwards. There is
|
||||
no internal queuing model, no retry, and no TLP-level fidelity — those
|
||||
are deliberately outside scope. The endpoint is bidirectional: host →
|
||||
device traffic (memory writes, kernel launches) flows one way, and
|
||||
device-side outbound traffic (cross-SIP collective sends) flows the
|
||||
other.
|
||||
|
||||
<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
|
||||
A more detailed PCIe model was considered and rejected. The simulator
|
||||
is targeting system-level latency comparisons; making the endpoint
|
||||
heavier with credit-management and retry logic would not improve the
|
||||
metrics being studied. The decision keeps the endpoint as the
|
||||
documented protocol-boundary node, named consistently so routing
|
||||
helpers can locate it by SIP and IO instance.
|
||||
|
||||
### pe_cpu
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The PE control CPU is the entry point for kernel work arriving from
|
||||
the cube's management CPU. It receives kernel-launch messages, resolves
|
||||
the kernel function by name, and hands execution to the scheduler with
|
||||
the resolved tensor arguments. From the scheduler's point of view, the
|
||||
PE_CPU is the upstream source of high-level commands; from the rest
|
||||
of the system's point of view, the PE_CPU is where a kernel's
|
||||
execution begins on a given PE.
|
||||
|
||||
### pe_dma
|
||||
|
||||
<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
|
||||
The DMA engine on each PE has two distinct modes. In the standard PE
|
||||
pipeline it consumes tile tokens issued by the scheduler, acquires a
|
||||
read or write channel (modeled as a one-in-flight resource per
|
||||
direction), and runs the bytes to or from HBM through the mesh. In
|
||||
its collective mode it forwards send tokens for the cube's IPCQ into
|
||||
the fabric, snapshotting the source data at send time so later
|
||||
mutations cannot race the receiver's read. Both modes share the same
|
||||
channel resources but differ in their downstream handling — one
|
||||
returns when the round-trip completes, the other dispatches
|
||||
fire-and-forget.
|
||||
|
||||
### pe_fetch_store
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The fetch-store engine is the bridge between the on-PE scratchpad
|
||||
(TCM) and the register file. It does not run DMA; it only moves bytes
|
||||
internally. On receiving a tile-stage token it sends a short request
|
||||
to the TCM, waits for the bandwidth-serialized delay, and continues
|
||||
the pipeline. The split between this engine and the TCM lets the
|
||||
scratchpad model its own read/write bandwidth independently.
|
||||
|
||||
### pe_gemm
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The GEMM engine is the matrix-multiply compute unit. Tile tokens
|
||||
arriving at this stage carry the per-tile dimensions, and the engine
|
||||
contributes a service time accounting for one fused multiply-add over
|
||||
the tile's macs. Composite operations (where the same tensor pair is
|
||||
streamed across many tiles) reuse the engine through the scheduler;
|
||||
the engine itself is stateless between tiles.
|
||||
|
||||
### pe_ipcq
|
||||
|
||||
<!-- src: ADR-0023 Context, Decision -->
|
||||
The IPCQ — inter-process communication queue — is each PE's
|
||||
collective-communication endpoint. It owns ring buffers that hold
|
||||
inbound messages from neighbor PEs and bookkeeping for send credits.
|
||||
Direction names ("N", "S", "E", "W" for cube-internal neighbors and
|
||||
"global_*" for cross-SIP neighbors) are resolved to physical peer
|
||||
endpoints by a neighbor table installed at process-group creation
|
||||
time. The component itself does not move bytes — it issues DMA tokens
|
||||
through the local PE_DMA, which performs the actual cross-PE
|
||||
transfer.
|
||||
|
||||
<!-- src: ADR-0023 Decision, Consequences -->
|
||||
A key invariant is that the inbound terminal — where data lands at
|
||||
the receiver — pays the link bandwidth drain plus any cube-internal
|
||||
mesh hop to the slot's backing memory. This prevents IPCQ from
|
||||
silently outpacing raw DMA at large transfer sizes. Outbound sends
|
||||
are fire-and-forget; credit return is the only backpressure signal.
|
||||
|
||||
### pe_math
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The math engine handles element-wise and reduction operations. It
|
||||
consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
|
||||
`where`, etc.) and contributes a service time proportional to the
|
||||
number of elements processed. Like the GEMM engine it is stateless;
|
||||
chained epilogues (a sequence of math operations after a GEMM tile)
|
||||
are scheduled as separate stages.
|
||||
|
||||
### pe_mmu
|
||||
|
||||
<!-- src: ADR-0039 Context, Decision -->
|
||||
The MMU has two roles, exposed through one component. As a node on
|
||||
the cube NOC it receives MMU-map and MMU-unmap messages and updates
|
||||
its internal page table, so that the runtime API can install
|
||||
virtual-to-physical mappings with measured fabric latency. As a
|
||||
utility object held inside the PE it offers synchronous translate
|
||||
calls to the PE's DMA and GEMM engines without taking simulator time
|
||||
itself; the calling engine pays any configured TLB overhead in its
|
||||
own process.
|
||||
|
||||
<!-- src: ADR-0039 Decision, Alternatives Considered -->
|
||||
The page table supports multiple disjoint regions inside a single
|
||||
page, with later-write-wins semantics on overlap. This is a deliberate
|
||||
simulator stopgap to support parallelization policies that shard data
|
||||
at sub-page granularity without silent mis-routing through a real
|
||||
hardware MMU's one-PA-per-entry assumption. A real MMU does not work
|
||||
this way; the model documents this as a simplification.
|
||||
|
||||
### pe_scheduler
|
||||
|
||||
<!-- src: ADR-0014 Decision -->
|
||||
The scheduler is the sole dispatcher inside a PE. Simple commands are
|
||||
routed directly to the right engine. Composite commands generate a
|
||||
tile plan, and the resulting tile tokens are fed into the pipeline.
|
||||
Self-routing keeps the scheduler off the per-stage hot path: each
|
||||
engine, on finishing a stage, advances the token to the next stage's
|
||||
component itself, so the scheduler only does initial dispatch and
|
||||
completion tracking.
|
||||
|
||||
### pe_tcm
|
||||
|
||||
<!-- src: ADR-0040 Context, Decision -->
|
||||
The TCM is the per-PE tightly-coupled scratchpad memory. It models
|
||||
time only, not data — the actual payload lives in the simulator's
|
||||
memory store. Read and write are independent channels: each is
|
||||
modeled as a one-in-flight resource, so same-direction requests
|
||||
serialize but a read and a write can overlap. The bandwidth of each
|
||||
direction is configured separately and applied as bytes-over-bandwidth
|
||||
on each request.
|
||||
|
||||
<!-- src: ADR-0040 Decision, Alternatives Considered -->
|
||||
The decision to keep read and write on separate channels was made
|
||||
because the PE pipeline's normal case overlaps fetch (read) and store
|
||||
(write). Collapsing them into a single shared channel would have
|
||||
artificially serialized that overlap and produced an incorrect
|
||||
bandwidth ceiling.
|
||||
|
||||
### sram
|
||||
|
||||
<!-- src: ADR-0041 Context, Decision -->
|
||||
The cube SRAM is a per-cube scratchpad attached to one of the cube's
|
||||
routers. As a node it applies a configured access overhead, pays the
|
||||
link-bandwidth drain stamped on the incoming request, and sends a
|
||||
response on the reverse path. It is a terminal — it does not forward.
|
||||
|
||||
<!-- src: ADR-0041 Decision, Consequences -->
|
||||
A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
|
||||
that an inter-PE collective slot can live in. When the slot lives in
|
||||
SRAM, the PE_DMA pays the slot read or write latency directly using
|
||||
the configured SRAM bandwidth and overhead; the SRAM component does
|
||||
not need to know about collective semantics. This separation keeps
|
||||
the SRAM component agnostic to the collective subsystem.
|
||||
|
||||
### tiling
|
||||
|
||||
<!-- src: ADR-0042 Context, Decision -->
|
||||
The tile-plan generator is not a runtime component — it is a pure
|
||||
module of functions that take a problem shape (matrix dimensions, tile
|
||||
sizes) and produce an ordered list of tile-stage sequences. The
|
||||
scheduler consumes this list. Each tile's stage sequence depends on
|
||||
how its operands are staged: operands streamed from HBM produce
|
||||
DMA_READ stages, operands already resident in TCM (because they were
|
||||
loaded eagerly upfront) skip them.
|
||||
|
||||
<!-- src: ADR-0042 Decision, Consequences -->
|
||||
The plan generator is intentionally pure — given the same input it
|
||||
returns the same plan, with no simulator events created. This lets
|
||||
the rest of the system reason about tile sequences as data, and it
|
||||
makes the plan testable in isolation without simulator state. New
|
||||
plan variants (for example, K-major or DTensor-aware plans) can be
|
||||
added as new functions following the same shape.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Decisions
|
||||
|
||||
This section collects cross-cutting decisions — algorithms, policies,
|
||||
schemes, and contracts — that span multiple components rather than
|
||||
living inside one.
|
||||
|
||||
### Address Scheme
|
||||
|
||||
<!-- src: ADR-0001 Context, Decision -->
|
||||
Every physical address in the simulator decodes into a structured
|
||||
location. A fixed-width physical address carries the SIP id, the
|
||||
cube id within the SIP, a type discriminator (HBM vs PE-resource vs
|
||||
others), and a type-specific offset. HBM addresses additionally encode
|
||||
the per-PE slice offset so the controller can determine which PE
|
||||
owns the target slice without external lookup. The layout is
|
||||
deliberately reserved rather than packed-to-fit, so new sub-units can
|
||||
be added at the type-discriminator level without rewriting existing
|
||||
addresses.
|
||||
|
||||
<!-- src: ADR-0011 Context, Decision -->
|
||||
On top of physical addressing, the simulator supports three address
|
||||
models that the runtime API selects between. Direct physical
|
||||
addressing is retained as a fallback. Virtual addressing — the
|
||||
current default — gives each tensor a contiguous virtual range at
|
||||
deployment, with the per-PE MMU translating per access; an
|
||||
alternative logical-address scheme remains a future option. The
|
||||
virtual-address path is what every modern test path takes; the PA
|
||||
fallback is used by the MMU itself when no mapping exists for an
|
||||
address (a deliberate signal, not an error).
|
||||
|
||||
<!-- src: ADR-0011 Decision, Consequences -->
|
||||
Tensor placement is represented as a list of physical-address shards,
|
||||
each tagged with target SIP, cube, and PE, plus a single tensor-wide
|
||||
virtual base. This means a kernel sees one virtual base for the whole
|
||||
tensor while the host driver and the engine still know exactly where
|
||||
each shard lives. Replicated tensors get per-cube local PA mappings;
|
||||
sharded tensors broadcast their mapping across cubes within a SIP.
|
||||
|
||||
### Routing, Distance & Helper API
|
||||
|
||||
<!-- src: ADR-0002 Context, Decision -->
|
||||
Routing is policy-driven, deterministic, and topology-aware. Given a
|
||||
source, a destination, and an intent — for example, PE-initiated
|
||||
DMA versus host-initiated memory write versus a generic
|
||||
component-to-component query — the routing layer picks the right
|
||||
path. The intent matters because different traffic types must avoid
|
||||
different categories of edges: PE-initiated DMA should not traverse
|
||||
command-only links; M_CPU DMA should not pass through PE-internal
|
||||
pipeline edges; cube-local transfers should not use the
|
||||
zero-distance UCIe bus that would otherwise look attractive to a
|
||||
shortest-path search.
|
||||
|
||||
<!-- src: ADR-0051 Decision -->
|
||||
The routing layer therefore maintains four separate adjacency graphs
|
||||
at construction, each excluding a different category of edges, and
|
||||
picks the appropriate one per intent. On top of the graphs sits a
|
||||
helper API that hides the topology's naming convention: callers ask
|
||||
for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
|
||||
the HBM destination for a given physical address, and receive the
|
||||
corresponding node id. No component constructs node-id strings
|
||||
directly; if the naming convention ever changes, the change is local
|
||||
to the helper layer.
|
||||
|
||||
<!-- src: ADR-0051 Decision, Consequences -->
|
||||
Path-finding itself uses Dijkstra with explicit per-edge weights
|
||||
(routing weight is allowed to differ from physical distance — for
|
||||
example, UCIe is configured to be routing-preferable). Tie-breaks
|
||||
follow insertion order, which keeps results deterministic. Paths
|
||||
between unreachable nodes raise rather than returning empty, surfacing
|
||||
topology errors immediately.
|
||||
|
||||
### Memory Semantics and Local-HBM Bandwidth
|
||||
|
||||
<!-- src: ADR-0004 Context, Decision -->
|
||||
A PE accessing its own HBM slice through its own cube's NOC must see
|
||||
the full local HBM bandwidth — that is the model's intent. Memory
|
||||
traffic accumulates latency from per-component overhead and
|
||||
bytes-over-link-bandwidth serialization along the path, but the
|
||||
controller does not throttle below the slice's allotted bandwidth.
|
||||
Cross-PE-slice accesses inside the same cube, cross-cube accesses
|
||||
through UCIe, and cross-SIP accesses through PCIe each pay
|
||||
progressively more overhead as the path grows.
|
||||
|
||||
### Topology Compilation, Diagrams & Builder Algorithms
|
||||
|
||||
<!-- src: ADR-0006 Context, Decision -->
|
||||
Topology is configurable, not hardcoded. The simulator reads a YAML
|
||||
spec, compiles it into a flat graph of nodes and edges plus four
|
||||
view projections at different abstraction levels — system, SIP, cube,
|
||||
PE — and uses the compiled graph as the single source for both
|
||||
execution and visualization. Distance metadata used by routing is
|
||||
extracted at compile time so that diagrams and routing decisions
|
||||
agree by construction.
|
||||
|
||||
<!-- src: ADR-0005 Context, Decision -->
|
||||
Diagrams are derived artifacts of the compiled topology. The visualizer
|
||||
produces one SVG per view at the appropriate abstraction level; nothing
|
||||
in the diagrams is hand-drawn or hand-positioned. Distance-aware
|
||||
layout rules place nodes in the diagrams using the same coordinates
|
||||
that routing uses to compute distance, so a diagram that "looks
|
||||
wrong" is a signal that the topology itself has a problem, not the
|
||||
visualizer.
|
||||
|
||||
<!-- src: ADR-0053 Decision -->
|
||||
Inside a cube the router mesh is generated automatically. PE corner
|
||||
positions are fixed by convention; the relay-column algorithm
|
||||
inserts additional grid columns whenever the gap between adjacent PE
|
||||
columns would exceed a tunable maximum. HBM occupies a central
|
||||
exclusion zone — router slots inside the zone are deliberately empty,
|
||||
since HBM controllers attach as separate named nodes. M_CPU and SRAM
|
||||
attach to the nearest router by Euclidean distance from their
|
||||
configured placement coordinates, and UCIe physical lanes distribute
|
||||
along the boundary rows and columns. The whole mesh is cached
|
||||
beside the topology spec and invalidated only when one of a small set
|
||||
of layout-relevant fields changes.
|
||||
|
||||
<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
|
||||
|
||||
### Tensor Deployment and Allocation
|
||||
|
||||
<!-- src: ADR-0008 Context, Decision -->
|
||||
Tensor deployment in the runtime API produces a list of physical-address
|
||||
shards plus a single tensor-wide virtual base. The host allocator
|
||||
walks the data-parallelism policy, computes per-shard placement, and
|
||||
emits the per-shard physical addresses through the per-PE allocators.
|
||||
No separate "allocate then later attach to a device" RPC exists —
|
||||
allocation and deployment are a single operation that produces a
|
||||
deployed tensor handle.
|
||||
|
||||
### Memory Allocator Algorithms
|
||||
|
||||
<!-- src: ADR-0048 Context, Decision -->
|
||||
Each per-PE allocator owns two channels — HBM slice and TCM — each
|
||||
backed by an offset-keyed free-list. Allocation is first-fit; freeing
|
||||
coalesces with adjacent free blocks. A device-wide virtual allocator
|
||||
sits above the per-PE allocators, aligns requests up to the configured
|
||||
page size, and coalesces on free in the same way. The trade-off is
|
||||
explicit: first-fit is simpler and cheaper than best-fit or buddy
|
||||
allocation, and the simulator's workload is stack-like enough
|
||||
(deploy / kernel / free in matched order) that fragmentation is not
|
||||
a practical concern.
|
||||
|
||||
<!-- src: ADR-0048 Decision, Consequences -->
|
||||
Allocation failure raises rather than silently returning a partial
|
||||
result. A partial tensor reaching the engine would route over wrong
|
||||
PAs and silently corrupt simulator output, so an out-of-memory signal
|
||||
is preferred. The free path trusts its caller to pass back exactly
|
||||
what was allocated; the small risk of caller error in exchange for
|
||||
fast common-case freeing is documented as a deliberate trade.
|
||||
|
||||
### Kernel Execution and Host-Device Messaging
|
||||
|
||||
<!-- src: ADR-0009 Context, Decision -->
|
||||
Kernel execution decomposes into a small set of messages that travel
|
||||
the device graph. The host issues a single kernel-launch message; the
|
||||
IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
|
||||
PE CPU resolves the kernel and runs it through the scheduler.
|
||||
Completion flows back the same way, gated by per-shard completion
|
||||
tracking. Memory operations follow the same pattern: a memory write
|
||||
or read travels as one message that the engine routes to the right
|
||||
HBM controller, with a response taking the reverse path.
|
||||
|
||||
<!-- src: ADR-0012 Context, Decision -->
|
||||
The schema between the host and the device-side IO CPU is PA-first
|
||||
and shard-tagged. Every byte of host-issued payload arrives with an
|
||||
explicit target SIP, cube, PE, and physical address. The IO_CPU does
|
||||
not decode addresses to derive placement — placement is named
|
||||
explicitly by the shard list. This makes the host-device interface
|
||||
deterministic and keeps the routing helper free of host-derived
|
||||
intent.
|
||||
|
||||
### CLI Surface and Semantics
|
||||
|
||||
<!-- src: ADR-0010 Context, Decision -->
|
||||
The command-line interface exposes four subcommands. A bench runner
|
||||
loads a topology, resolves a registered benchmark by name or index,
|
||||
and runs it on a selected device. A bench-listing command enumerates
|
||||
the registered benchmarks. A probe utility runs a fixed catalog of
|
||||
traffic patterns through the engine for latency and bandwidth
|
||||
verification. A web viewer renders the topology in a browser. A
|
||||
benchmark instance is always single-device by convention; multi-SIP
|
||||
collective work happens inside the benchmark through the launcher
|
||||
abstraction, not by multiplexing the CLI.
|
||||
|
||||
### Component Port and Wire Fabric Model
|
||||
|
||||
<!-- src: ADR-0015 Context, Decision -->
|
||||
Every modeled component exposes input and output ports, and every
|
||||
edge in the topology connects an output port on one component to an
|
||||
input port on another. Bandwidth and propagation delay are properties
|
||||
of the wire between ports, not of the component endpoints. A
|
||||
component's responsibility is to apply its configured per-node
|
||||
overhead and either forward to the next hop or terminate; the wire
|
||||
charges the byte-over-bandwidth serialization separately.
|
||||
|
||||
<!-- src: ADR-0015 Decision, Consequences -->
|
||||
This separation lets components be swapped behind their port
|
||||
interface without changing the rest of the model, and it keeps
|
||||
bandwidth contention at the wire level where multiple components may
|
||||
contend for the same edge. Future component models can refine
|
||||
internal behavior without disturbing the fabric.
|
||||
|
||||
### Two-Pass Data Execution
|
||||
|
||||
<!-- src: ADR-0020 Context, Decision -->
|
||||
The simulator runs in two passes. The first pass — fast and always
|
||||
on — runs the discrete-event engine and records every data operation
|
||||
in an operation log with timestamps, component identifiers, and per-
|
||||
operation parameters. The second pass — optional, opt-in — replays
|
||||
the log against an in-memory tensor store to produce actual numerical
|
||||
results. Tests that only need timing skip the second pass; tests that
|
||||
need to verify correctness opt in.
|
||||
|
||||
<!-- src: ADR-0020 Decision, Consequences -->
|
||||
The split lets the timing engine remain unconcerned with data
|
||||
semantics: kernels move handles around, not bytes. The replay phase
|
||||
recovers data semantics from the recorded operations, in their
|
||||
original time order with a small set of secondary-sort rules. The
|
||||
op-log records carry enough metadata — input snapshots for compute
|
||||
operations, source snapshots for cross-component copies — that the
|
||||
replay phase cannot mis-order with respect to in-flight mutations.
|
||||
|
||||
### Sim-engine Op Log and Memory Store Schemas
|
||||
|
||||
<!-- src: ADR-0052 Context, Decision -->
|
||||
The operation log holds typed records with seven fields each: start
|
||||
and end timestamps, the component that issued the operation, an
|
||||
operation kind ("memory", "gemm", "math"), an operation name, a
|
||||
parameter dictionary, and a (currently unused) dependency list.
|
||||
Records are kept in stable timestamp order. The parameter dictionary
|
||||
varies by operation: a DMA read carries source address and byte count;
|
||||
a GEMM carries operand shapes, dtypes, and address spaces; a math
|
||||
operation carries input addresses and snapshots.
|
||||
|
||||
<!-- src: ADR-0052 Decision, Consequences -->
|
||||
The companion memory store is a two-level dictionary keyed by
|
||||
address space ("hbm", "tcm", "sram", others) and integer address.
|
||||
Reads and writes are reference-based — no copy by default — so
|
||||
callers wanting to detach a snapshot must copy explicitly. This is
|
||||
deliberate: the engine-internal snapshot paths copy at well-defined
|
||||
points (math input capture, HBM source capture for DMA writes,
|
||||
inbound collective copies) and downstream replay code therefore
|
||||
sees stable data even when slot or scratch addresses are reused by
|
||||
later operations.
|
||||
|
||||
### 2D Grid Program Identity
|
||||
|
||||
<!-- src: ADR-0022 Context, Decision -->
|
||||
Inside a kernel the program identity is two-dimensional. The
|
||||
first axis corresponds to the PE index within a cube; the second
|
||||
corresponds to the cube index within a SIP. Together they let a
|
||||
kernel address its position both within its cube and within the
|
||||
larger system without needing to know the full topology. Total
|
||||
program counts along each axis are exposed symmetrically.
|
||||
|
||||
### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
|
||||
|
||||
<!-- src: ADR-0024 Context, Decision -->
|
||||
The launcher model treats each SIP as one rank. Inside a process the
|
||||
launcher spawns one greenlet per SIP rank; the rank is bound to its
|
||||
greenlet so that any code running in that worker sees the right
|
||||
distributed-style rank. This is a deliberately PyTorch-compatible
|
||||
shape: a benchmark looks like a small DDP training script — initialize
|
||||
a process group, spawn workers, each worker runs the same body.
|
||||
|
||||
<!-- src: ADR-0026 Context, Decision -->
|
||||
Data-parallelism policy lives in a single object that names the
|
||||
sharding strategy along the cube axis (replicate, row-wise,
|
||||
column-wise) and along the PE axis (same set of values), and optionally
|
||||
overrides the number of cubes or PEs participating. The policy is
|
||||
intra-device — it does not cross SIP boundaries. SIP-level parallelism
|
||||
is the launcher's responsibility, and the two axes compose
|
||||
orthogonally.
|
||||
|
||||
<!-- src: ADR-0027 Context, Decision -->
|
||||
A Megatron-style tensor-parallel API sits on top of the launcher and
|
||||
the DP policy. Layer-level building blocks — column-parallel linear,
|
||||
row-parallel linear, all-reduce — name their sharding intent in terms
|
||||
the launcher and the placement policy can compose. This is the layer
|
||||
that bench code typically writes against.
|
||||
|
||||
<!-- src: ADR-0047 Context, Decision -->
|
||||
For collective operations the runtime exposes a PyTorch-compatible
|
||||
distributed backend named "ahbm". On process-group initialization the
|
||||
backend loads the configured collective-algorithm module, resolves
|
||||
the world size (priority: explicit ccl.yaml override → defaults
|
||||
section → topology SIP count), imports the algorithm module
|
||||
dynamically, derives the SIP topology kind, and pushes the inter-PE
|
||||
neighbor table to every participating PE. From that point on, an
|
||||
all-reduce call dispatches the algorithm's kernel function across
|
||||
all ranks.
|
||||
|
||||
<!-- src: ADR-0050 Context, Decision -->
|
||||
A collective-algorithm module is a Python module with a small, fixed
|
||||
contract. It exposes topology-kind integer constants, a name-to-kind
|
||||
mapping for the YAML configuration, a kernel-arguments builder, and
|
||||
a kernel function — the kernel function being aliased to the name
|
||||
`kernel` so the backend can find it generically. The kernel itself
|
||||
takes the tensor pointer, the per-cube element count, cube mesh
|
||||
width and height, the world size, the current rank, and the SIP
|
||||
topology dimensions; the backend appends those last four arguments
|
||||
automatically. New collectives slot in by adding a new module that
|
||||
follows this shape.
|
||||
|
||||
<!-- src: ADR-0027 Decision, Consequences -->
|
||||
The combination is deliberate: bench authors get to write code that
|
||||
looks like a regular distributed training script, while the launcher,
|
||||
backend, and placement policies behind it remain free to redirect
|
||||
work to the right SIP, cube, and PE without exposing topology to the
|
||||
kernel.
|
||||
|
||||
### IPCQ Direction Addressing
|
||||
|
||||
<!-- src: ADR-0025 Context, Decision -->
|
||||
Inside a collective algorithm, peer PEs are named by direction —
|
||||
"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
|
||||
cross-SIP neighbors. Direction addressing is the addressing scheme:
|
||||
the algorithm names a direction, the IPCQ neighbor table installed
|
||||
at process-group time resolves the direction to the peer endpoint's
|
||||
physical-address coordinates, and the PE_DMA performs the actual
|
||||
transfer. The algorithm itself does not see PA arithmetic — direction
|
||||
is the user-facing handle.
|
||||
|
||||
### Intercube All-Reduce
|
||||
|
||||
<!-- src: ADR-0032 Context, Decision -->
|
||||
The default all-reduce algorithm uses a center-rooted bidirectional
|
||||
phase inside each SIP's cube mesh followed by an inter-SIP exchange
|
||||
on the mesh's root cube, and then a bidirectional broadcast back
|
||||
out. Center-rooting halves the in-cube hop count compared with a
|
||||
corner-rooted walk. The inter-SIP exchange itself follows the
|
||||
configured SIP topology — ring, torus, or non-wrapping mesh —
|
||||
selected at runtime through the SIP-topology kind integer the
|
||||
backend passes to the kernel.
|
||||
|
||||
### Evaluation Harnesses
|
||||
|
||||
<!-- src: ADR-0043 Context, Decision -->
|
||||
The all-reduce evaluation harness drives correctness and the
|
||||
latency/buffer-kind sweeps through the public distributed path —
|
||||
initialize process group, spawn workers, call all-reduce — rather
|
||||
than the lower-level engine interface. A shared helper module factors
|
||||
out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
|
||||
HBM) and the inter-SIP topology variants. The plots produced by the
|
||||
harness are part of its output contract; the harness regenerates them
|
||||
on demand.
|
||||
|
||||
<!-- src: ADR-0044 Context, Decision -->
|
||||
The GEMM evaluation harness is split into two layers. A heavy
|
||||
shape-and-variant sweep lives as a manual script — it runs the same
|
||||
composite-GEMM benchmark across many shapes and operand-staging
|
||||
variants, harvests the resulting op-log, and writes a JSON summary.
|
||||
A faster figure-generation layer lives in the test suite and consumes
|
||||
that JSON to render plots. The split keeps the heavy data
|
||||
generation explicit and out of the regular test path.
|
||||
|
||||
### Bench Module Contract
|
||||
|
||||
<!-- src: ADR-0045 Context, Decision -->
|
||||
Adding a new benchmark requires only dropping a file into the
|
||||
benchmarks directory. The file registers one or more benchmark
|
||||
functions through a small decorator that takes a kebab-case name and
|
||||
a human-readable description. The decorator is the registration
|
||||
mechanism — there is no separate manifest. Each benchmark function
|
||||
takes one argument, conventionally named `torch`, which is the
|
||||
runtime context exposing tensor allocation, kernel launch,
|
||||
distributed APIs, and process-spawning. The function name is `run` by
|
||||
convention.
|
||||
|
||||
<!-- src: ADR-0045 Decision, Consequences -->
|
||||
A benchmark must submit at least one operation, or the runner
|
||||
returns an error. A benchmark instance is single-device by default;
|
||||
when a benchmark is collective, it uses the distributed-process-spawn
|
||||
pattern internally — one worker greenlet per rank, with each worker
|
||||
binding to its rank. Multi-device benchmark patterns outside that
|
||||
shape are not supported.
|
||||
|
||||
### Kernel-side `tl.*` API
|
||||
|
||||
<!-- src: ADR-0046 Context, Decision -->
|
||||
Inside a kernel function, the `tl` argument exposes the kernel-side
|
||||
API in a shape that mirrors the conventions of established
|
||||
GPU-kernel languages. Categories: reference handles that name HBM
|
||||
data without issuing DMA; data movement (load, store) that does
|
||||
issue DMA; GEMM and math compute (dot, composite, the unary and
|
||||
binary math operations, reductions); index and scalar helpers
|
||||
(program identity, range-builders); metadata-only operations like
|
||||
transpose; and the collective primitives (send, receive,
|
||||
non-blocking receive). Tensor handles support arithmetic operators
|
||||
via a thread-local active context so kernel code reads naturally.
|
||||
|
||||
<!-- src: ADR-0046 Decision, Consequences -->
|
||||
The API supports two execution modes. A command-list mode records
|
||||
operations into a list without consuming simulator time — useful for
|
||||
inspection and lightweight tests. A greenlet-driven mode runs the
|
||||
kernel as a child greenlet that switches back to the simulator on
|
||||
each `tl.*` call; the simulator drives the event scheduler and hands
|
||||
real data back to the kernel as DMA reads complete. The two modes
|
||||
share the same surface; the kernel does not know which one it is
|
||||
running under.
|
||||
|
||||
### Probe Subcommand
|
||||
|
||||
<!-- src: ADR-0049 Context, Decision -->
|
||||
The probe utility runs three families of traffic patterns through
|
||||
the engine — host-to-device writes at increasing hop counts,
|
||||
device-to-host reads at increasing hop counts, and PE-initiated DMA
|
||||
across the cube mesh — and reports actual latency, the analytical
|
||||
formula breakdown, effective bandwidth, bottleneck bandwidth, and
|
||||
utilization. A fixed reference size is used for the summary table;
|
||||
a separate utilization-versus-size sweep covers a logarithmic range
|
||||
of transfer sizes. Each case runs in its own engine instance so
|
||||
cases do not perturb each other.
|
||||
|
||||
<!-- src: ADR-0049 Decision, Consequences -->
|
||||
The probe also checks a small set of invariants automatically:
|
||||
monotonic latency increase with hop count, device-to-host latency
|
||||
at least as large as host-to-device for the same hop count, and a
|
||||
faster best-case path than worst-case for cross-cube PE DMA. Failures
|
||||
print prominently. The output is meant for human reading; automated
|
||||
parsing should not depend on column widths or whitespace.
|
||||
|
||||
---
|
||||
|
||||
This document summarizes 46 architecture decisions captured during
|
||||
the first half of 2026. It is regenerated mechanically from the
|
||||
decision corpus; sources are recorded in HTML comments throughout.
|
||||
Reference in New Issue
Block a user