e33e76f2d1
Adds a section-based table of contents for the 46-ADR corpus, mirroring the /report skill's classification (Design Principles / High-level Architecture / Detailed Architecture by component / Implementation Decisions by topic). Generated for both docs/adr/ (EN titles) and docs/adr-ko/ (KO titles) from one tool. tools/generate_adr_index.py: - Single CLASSIFICATION dict per ADR — add an entry when introducing a new ADR; the script fails loud if any file is missing from the table. - DETAILED_COMPONENTS lists each builtin component and the ADR(s) that cover it (ADR-0014 appears under six PE engines; ADR-0023 under pe_dma + pe_ipcq). - Accepts both ":" and "—" title separators (matching ADR-0033's existing format). - --check mode for CI: exits 1 if INDEX.md is stale. Also includes the docs/report/architecture-2026-1H.md generated by the prior /report write (the public-facing architecture document; 836 lines, 76 source-attribution comments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
837 lines
40 KiB
Markdown
837 lines
40 KiB
Markdown
# KernBench — Architecture Design Document
|
||
*2026 1H*
|
||
|
||
KernBench is a system-level, discrete-event simulator for AI-accelerator
|
||
chiplet systems. It models the data-movement and control paths across
|
||
the full hardware hierarchy and reports end-to-end execution latency
|
||
for kernels dispatched to the device's compute units.
|
||
|
||
This document is a public summary of the architecture as designed and
|
||
implemented in the first half of 2026. It assumes no prior knowledge of
|
||
the simulator's internal documents; terms specific to the system are
|
||
defined on first use.
|
||
|
||
---
|
||
|
||
## Design Principles
|
||
|
||
KernBench is grounded in two foundational commitments: every measured
|
||
latency must trace to explicit, modeled events on the simulator's graph,
|
||
and every behavioral claim must be verifiable through tests that target
|
||
spec-level invariants rather than incidental implementation details.
|
||
|
||
<!-- src: ADR-0013 Context, Decision -->
|
||
The verification posture is verification-driven. Tests are written to
|
||
validate the architectural contracts that the simulator exposes —
|
||
correct routing, deterministic results, monotonic latency under
|
||
increasing hop counts — rather than to mirror the call graph of the
|
||
implementation. Two phases coexist: a fast timing phase that exercises
|
||
the simulator's discrete-event engine and produces a log of operations
|
||
with timestamps, and an optional data-replay phase that uses that log
|
||
to compute real numerical results. Tests can target either phase.
|
||
|
||
<!-- src: ADR-0033 Context, Decision -->
|
||
The latency model is intentionally abstract rather than
|
||
cycle-accurate. Each modeled node contributes a configurable per-node
|
||
overhead, each link contributes wire delay plus byte-over-bandwidth
|
||
serialization, and each terminal service contributes its own service
|
||
time. The simulator does not attempt to reproduce cache coherence
|
||
protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
|
||
correctness; those are explicitly outside the scope. The aim is a
|
||
simulator that compares system-level configurations meaningfully and
|
||
deterministically, not one that ships microarchitectural truths.
|
||
|
||
<!-- src: ADR-0033 Decision, Consequences -->
|
||
Determinism is a hard requirement. Given identical inputs — topology,
|
||
routing policy, and request stream — the simulator must produce
|
||
identical outputs, hop traces included. This rules out reliance on
|
||
unordered set iteration on the critical path and forces every latency
|
||
contribution to come from an explicitly scheduled event on a modeled
|
||
component or link. There are no implicit waits, no hardcoded magic
|
||
delays, and no shortcuts that bypass the modeled graph.
|
||
|
||
---
|
||
|
||
## High-level Architecture
|
||
|
||
<!-- src: ADR-0003 Context, Decision -->
|
||
The simulated system is a four-level hierarchy. A **Tray** holds one or
|
||
more **SIPs** (system-in-package), each containing a 2D mesh of
|
||
**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
|
||
host. Each CUBE contains a regular grid of **PEs** (processing
|
||
elements) plus its own attached resources — high-bandwidth memory
|
||
(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
|
||
itself is a composite of nine sub-components rather than a monolithic
|
||
core. This hierarchy is fixed; the parameters along each axis (counts,
|
||
mesh dimensions, link widths) are configurable through the topology
|
||
spec.
|
||
|
||
<!-- src: ADR-0007 Context, Decision -->
|
||
A clean separation runs along the request flow. A **runtime API** at
|
||
the top is the host-facing surface; it exposes tensor and kernel
|
||
operations, owns host-side allocation metadata, and is topology-
|
||
agnostic — it does not route or fan out. Below it the **simulation
|
||
engine** decomposes runtime operations into discrete graph requests
|
||
(memory writes, memory reads, kernel launches, MMU map installs) and
|
||
schedules events deterministically. At the bottom, **components** model
|
||
device behavior on a graph of nodes connected by links; they
|
||
implement the actual latency contributions and pass requests along.
|
||
No component reaches up into the runtime API, and no runtime call
|
||
shortcuts the engine.
|
||
|
||
<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
|
||
|
||
### Tray
|
||
|
||
<!-- src: ADR-0003 Decision -->
|
||
The Tray is the outermost boundary. It owns the host CPU on one side
|
||
and one or more SIPs on the other, connected through a fabric switch.
|
||
For collective communication that must traverse multiple SIPs, the
|
||
fabric switch acts as the common rendezvous: device-side outbound
|
||
traffic from one SIP routes through the switch and back into the
|
||
target SIP's IO chiplet.
|
||
|
||
### SIP
|
||
|
||
<!-- src: ADR-0003 Decision, ADR-0017 Context -->
|
||
A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
|
||
default topology used by the simulator is a 4×4 cube mesh; the
|
||
mesh dimensions are configurable. Each cube on the boundary of the
|
||
mesh connects to its neighbors over UCIe (die-to-die) links arranged
|
||
on the four cardinal sides — north, south, east, and west. The IO
|
||
chiplets sit on one side of the SIP and provide the bridge to the host
|
||
across PCIe.
|
||
|
||
<!-- src: ADR-0016 Context, Decision -->
|
||
The IO chiplet itself contains its own internal network. A
|
||
host-facing PCIe endpoint passes traffic to a small NOC ("network on
|
||
chip"); from there it can branch to a control-plane CPU that processes
|
||
kernel-launch messages, or it can take the direct memory data path to
|
||
the cube's HBM controller. The decision to provide a direct memory
|
||
path that bypasses the control CPU was a deliberate concession to
|
||
keep host-issued memory writes from paying control-plane overhead on
|
||
the data path.
|
||
|
||
### CUBE
|
||
|
||
<!-- src: ADR-0017 Decision -->
|
||
Each CUBE owns a 2D mesh of NOC routers and a set of attached
|
||
resources: PEs, the cube-local SRAM scratchpad, the management CPU
|
||
(M_CPU), and the HBM partition (split across multiple PE-private
|
||
slices for bandwidth). The router mesh uses deterministic XY routing.
|
||
Attached components do not connect to each other directly — they all
|
||
sit on the router mesh, and every cube-internal transfer pays the
|
||
mesh distance from source to destination.
|
||
|
||
<!-- src: ADR-0017 Decision -->
|
||
The HBM partition is per-PE: each PE owns one HBM slice, and the
|
||
controller exposes per-PE channels so that the same PE always
|
||
addresses the same set of HBM channels. This makes the local-HBM
|
||
bandwidth from a PE to its own slice predictable, while accesses to
|
||
another PE's slice — or a different cube's slice — pay the mesh
|
||
distance and any UCIe crossings.
|
||
|
||
### PE
|
||
|
||
<!-- src: ADR-0014 Context, Decision -->
|
||
A PE is not a monolithic core. Internally it is a set of nine
|
||
sub-components, each modeling one stage of a request's flow: a small
|
||
control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
|
||
engine that moves data between the on-PE scratchpad and the register
|
||
file, a GEMM compute engine, a math compute engine, the tightly-
|
||
coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
|
||
physical address translation, and an inter-PE collective queue
|
||
(IPCQ). The scheduler decomposes higher-level operations into per-tile
|
||
stage sequences, and tile tokens self-route from one sub-component
|
||
to the next.
|
||
|
||
<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
|
||
|
||
---
|
||
|
||
## Detailed Architecture
|
||
|
||
This section describes each modeled device-side component in turn.
|
||
Components are listed in the alphabetical order used by the
|
||
simulator's source tree.
|
||
|
||
### forwarding
|
||
|
||
<!-- src: ADR-0037 Context, Decision -->
|
||
The forwarding component is the generic routing relay used wherever a
|
||
node only needs to apply a small processing overhead and pass the
|
||
request to the next hop. NOC routers, conn nodes, and ucie phys all
|
||
reduce to this. Its first act on receiving a request is to apply the
|
||
per-node overhead configured for it in the topology spec; after the
|
||
overhead it simply hands the request to the next hop along the path.
|
||
|
||
<!-- src: ADR-0037 Decision, Consequences -->
|
||
The decision to share one implementation across these roles was made
|
||
to keep the simulator's component set small without sacrificing
|
||
modeling fidelity. Each instance still carries its own overhead and
|
||
its own link bandwidth contributions, so different roles still produce
|
||
different timing. What is shared is the dispatcher loop, not the
|
||
parameter values.
|
||
|
||
### hbm_ctrl
|
||
|
||
<!-- src: ADR-0034 Context, Decision -->
|
||
The HBM controller is the terminal node for all memory traffic that
|
||
reaches HBM. Internally it owns a number of pseudo channels, partitioned
|
||
per-PE so that each PE addresses a deterministic subset. On a request
|
||
arrival the controller first selects the right pseudo channel from the
|
||
target address, then enters a chunk-loop that drains the requested
|
||
size in fixed-size flits over the channel's bandwidth.
|
||
|
||
<!-- src: ADR-0034 Decision, Consequences -->
|
||
The chunk-loop pattern replaces an earlier all-at-once drain. The
|
||
benefit is that the controller no longer presents a flit-aware fabric
|
||
with a single bulk transfer; instead it emits flits at a paced rate
|
||
matching the channel bandwidth, which makes cross-flow contention
|
||
visible. The bandwidth budget is calibrated against the configured
|
||
HBM total bandwidth divided across the channel count.
|
||
|
||
### io_cpu
|
||
|
||
<!-- src: ADR-0036 Context, Decision -->
|
||
The IO_CPU is the control-plane processor sitting inside the IO chiplet.
|
||
It receives kernel-launch messages from the host, decodes them, and
|
||
dispatches per-cube launches to the cube's management CPU. Pure memory
|
||
operations bypass it entirely, taking the direct data path established
|
||
inside the IO chiplet.
|
||
|
||
<!-- src: ADR-0036 Decision -->
|
||
On receiving a kernel-launch message, the IO_CPU consults the message's
|
||
shard list — which already names the target SIP, cube, and PE for each
|
||
piece of the tensor argument — and forwards a per-cube launch to each
|
||
cube the kernel needs to reach. This makes the IO_CPU a deterministic
|
||
fan-out point: it does not decode physical addresses to route, it just
|
||
follows the explicit per-shard targets it was handed.
|
||
|
||
### m_cpu
|
||
|
||
<!-- src: ADR-0035 Context, Decision -->
|
||
The M_CPU is the cube's management processor. It owns two distinct
|
||
roles: as a control-plane fan-out point for kernel launches arriving
|
||
from the IO chiplet, and as a DMA endpoint for host-initiated memory
|
||
writes that need to land in this cube's HBM. The control role
|
||
forwards launches to the right PE control CPUs; the DMA role places
|
||
the actual bytes into HBM through the router mesh.
|
||
|
||
<!-- src: ADR-0035 Decision -->
|
||
The component model deliberately distinguishes the two roles because
|
||
their routing differs: the control fan-out path uses command-kind
|
||
links that do not appear on data-path routes, while the DMA path uses
|
||
the same router mesh as PE-initiated DMA, with PE-internal nodes
|
||
excluded. The routing layer knows about both modes and selects the
|
||
appropriate adjacency at request time.
|
||
|
||
### pcie_ep
|
||
|
||
<!-- src: ADR-0038 Context, Decision -->
|
||
The PCIE endpoint is the protocol boundary at the host-device edge.
|
||
Its first act on each incoming request is to apply a configured
|
||
protocol-processing overhead; after that it simply forwards. There is
|
||
no internal queuing model, no retry, and no TLP-level fidelity — those
|
||
are deliberately outside scope. The endpoint is bidirectional: host →
|
||
device traffic (memory writes, kernel launches) flows one way, and
|
||
device-side outbound traffic (cross-SIP collective sends) flows the
|
||
other.
|
||
|
||
<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
|
||
A more detailed PCIe model was considered and rejected. The simulator
|
||
is targeting system-level latency comparisons; making the endpoint
|
||
heavier with credit-management and retry logic would not improve the
|
||
metrics being studied. The decision keeps the endpoint as the
|
||
documented protocol-boundary node, named consistently so routing
|
||
helpers can locate it by SIP and IO instance.
|
||
|
||
### pe_cpu
|
||
|
||
<!-- src: ADR-0014 Decision -->
|
||
The PE control CPU is the entry point for kernel work arriving from
|
||
the cube's management CPU. It receives kernel-launch messages, resolves
|
||
the kernel function by name, and hands execution to the scheduler with
|
||
the resolved tensor arguments. From the scheduler's point of view, the
|
||
PE_CPU is the upstream source of high-level commands; from the rest
|
||
of the system's point of view, the PE_CPU is where a kernel's
|
||
execution begins on a given PE.
|
||
|
||
### pe_dma
|
||
|
||
<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
|
||
The DMA engine on each PE has two distinct modes. In the standard PE
|
||
pipeline it consumes tile tokens issued by the scheduler, acquires a
|
||
read or write channel (modeled as a one-in-flight resource per
|
||
direction), and runs the bytes to or from HBM through the mesh. In
|
||
its collective mode it forwards send tokens for the cube's IPCQ into
|
||
the fabric, snapshotting the source data at send time so later
|
||
mutations cannot race the receiver's read. Both modes share the same
|
||
channel resources but differ in their downstream handling — one
|
||
returns when the round-trip completes, the other dispatches
|
||
fire-and-forget.
|
||
|
||
### pe_fetch_store
|
||
|
||
<!-- src: ADR-0014 Decision -->
|
||
The fetch-store engine is the bridge between the on-PE scratchpad
|
||
(TCM) and the register file. It does not run DMA; it only moves bytes
|
||
internally. On receiving a tile-stage token it sends a short request
|
||
to the TCM, waits for the bandwidth-serialized delay, and continues
|
||
the pipeline. The split between this engine and the TCM lets the
|
||
scratchpad model its own read/write bandwidth independently.
|
||
|
||
### pe_gemm
|
||
|
||
<!-- src: ADR-0014 Decision -->
|
||
The GEMM engine is the matrix-multiply compute unit. Tile tokens
|
||
arriving at this stage carry the per-tile dimensions, and the engine
|
||
contributes a service time accounting for one fused multiply-add over
|
||
the tile's macs. Composite operations (where the same tensor pair is
|
||
streamed across many tiles) reuse the engine through the scheduler;
|
||
the engine itself is stateless between tiles.
|
||
|
||
### pe_ipcq
|
||
|
||
<!-- src: ADR-0023 Context, Decision -->
|
||
The IPCQ — inter-process communication queue — is each PE's
|
||
collective-communication endpoint. It owns ring buffers that hold
|
||
inbound messages from neighbor PEs and bookkeeping for send credits.
|
||
Direction names ("N", "S", "E", "W" for cube-internal neighbors and
|
||
"global_*" for cross-SIP neighbors) are resolved to physical peer
|
||
endpoints by a neighbor table installed at process-group creation
|
||
time. The component itself does not move bytes — it issues DMA tokens
|
||
through the local PE_DMA, which performs the actual cross-PE
|
||
transfer.
|
||
|
||
<!-- src: ADR-0023 Decision, Consequences -->
|
||
A key invariant is that the inbound terminal — where data lands at
|
||
the receiver — pays the link bandwidth drain plus any cube-internal
|
||
mesh hop to the slot's backing memory. This prevents IPCQ from
|
||
silently outpacing raw DMA at large transfer sizes. Outbound sends
|
||
are fire-and-forget; credit return is the only backpressure signal.
|
||
|
||
### pe_math
|
||
|
||
<!-- src: ADR-0014 Decision -->
|
||
The math engine handles element-wise and reduction operations. It
|
||
consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
|
||
`where`, etc.) and contributes a service time proportional to the
|
||
number of elements processed. Like the GEMM engine it is stateless;
|
||
chained epilogues (a sequence of math operations after a GEMM tile)
|
||
are scheduled as separate stages.
|
||
|
||
### pe_mmu
|
||
|
||
<!-- src: ADR-0039 Context, Decision -->
|
||
The MMU has two roles, exposed through one component. As a node on
|
||
the cube NOC it receives MMU-map and MMU-unmap messages and updates
|
||
its internal page table, so that the runtime API can install
|
||
virtual-to-physical mappings with measured fabric latency. As a
|
||
utility object held inside the PE it offers synchronous translate
|
||
calls to the PE's DMA and GEMM engines without taking simulator time
|
||
itself; the calling engine pays any configured TLB overhead in its
|
||
own process.
|
||
|
||
<!-- src: ADR-0039 Decision, Alternatives Considered -->
|
||
The page table supports multiple disjoint regions inside a single
|
||
page, with later-write-wins semantics on overlap. This is a deliberate
|
||
simulator stopgap to support parallelization policies that shard data
|
||
at sub-page granularity without silent mis-routing through a real
|
||
hardware MMU's one-PA-per-entry assumption. A real MMU does not work
|
||
this way; the model documents this as a simplification.
|
||
|
||
### pe_scheduler
|
||
|
||
<!-- src: ADR-0014 Decision -->
|
||
The scheduler is the sole dispatcher inside a PE. Simple commands are
|
||
routed directly to the right engine. Composite commands generate a
|
||
tile plan, and the resulting tile tokens are fed into the pipeline.
|
||
Self-routing keeps the scheduler off the per-stage hot path: each
|
||
engine, on finishing a stage, advances the token to the next stage's
|
||
component itself, so the scheduler only does initial dispatch and
|
||
completion tracking.
|
||
|
||
### pe_tcm
|
||
|
||
<!-- src: ADR-0040 Context, Decision -->
|
||
The TCM is the per-PE tightly-coupled scratchpad memory. It models
|
||
time only, not data — the actual payload lives in the simulator's
|
||
memory store. Read and write are independent channels: each is
|
||
modeled as a one-in-flight resource, so same-direction requests
|
||
serialize but a read and a write can overlap. The bandwidth of each
|
||
direction is configured separately and applied as bytes-over-bandwidth
|
||
on each request.
|
||
|
||
<!-- src: ADR-0040 Decision, Alternatives Considered -->
|
||
The decision to keep read and write on separate channels was made
|
||
because the PE pipeline's normal case overlaps fetch (read) and store
|
||
(write). Collapsing them into a single shared channel would have
|
||
artificially serialized that overlap and produced an incorrect
|
||
bandwidth ceiling.
|
||
|
||
### sram
|
||
|
||
<!-- src: ADR-0041 Context, Decision -->
|
||
The cube SRAM is a per-cube scratchpad attached to one of the cube's
|
||
routers. As a node it applies a configured access overhead, pays the
|
||
link-bandwidth drain stamped on the incoming request, and sends a
|
||
response on the reverse path. It is a terminal — it does not forward.
|
||
|
||
<!-- src: ADR-0041 Decision, Consequences -->
|
||
A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
|
||
that an inter-PE collective slot can live in. When the slot lives in
|
||
SRAM, the PE_DMA pays the slot read or write latency directly using
|
||
the configured SRAM bandwidth and overhead; the SRAM component does
|
||
not need to know about collective semantics. This separation keeps
|
||
the SRAM component agnostic to the collective subsystem.
|
||
|
||
### tiling
|
||
|
||
<!-- src: ADR-0042 Context, Decision -->
|
||
The tile-plan generator is not a runtime component — it is a pure
|
||
module of functions that take a problem shape (matrix dimensions, tile
|
||
sizes) and produce an ordered list of tile-stage sequences. The
|
||
scheduler consumes this list. Each tile's stage sequence depends on
|
||
how its operands are staged: operands streamed from HBM produce
|
||
DMA_READ stages, operands already resident in TCM (because they were
|
||
loaded eagerly upfront) skip them.
|
||
|
||
<!-- src: ADR-0042 Decision, Consequences -->
|
||
The plan generator is intentionally pure — given the same input it
|
||
returns the same plan, with no simulator events created. This lets
|
||
the rest of the system reason about tile sequences as data, and it
|
||
makes the plan testable in isolation without simulator state. New
|
||
plan variants (for example, K-major or DTensor-aware plans) can be
|
||
added as new functions following the same shape.
|
||
|
||
---
|
||
|
||
## Implementation Decisions
|
||
|
||
This section collects cross-cutting decisions — algorithms, policies,
|
||
schemes, and contracts — that span multiple components rather than
|
||
living inside one.
|
||
|
||
### Address Scheme
|
||
|
||
<!-- src: ADR-0001 Context, Decision -->
|
||
Every physical address in the simulator decodes into a structured
|
||
location. A fixed-width physical address carries the SIP id, the
|
||
cube id within the SIP, a type discriminator (HBM vs PE-resource vs
|
||
others), and a type-specific offset. HBM addresses additionally encode
|
||
the per-PE slice offset so the controller can determine which PE
|
||
owns the target slice without external lookup. The layout is
|
||
deliberately reserved rather than packed-to-fit, so new sub-units can
|
||
be added at the type-discriminator level without rewriting existing
|
||
addresses.
|
||
|
||
<!-- src: ADR-0011 Context, Decision -->
|
||
On top of physical addressing, the simulator supports three address
|
||
models that the runtime API selects between. Direct physical
|
||
addressing is retained as a fallback. Virtual addressing — the
|
||
current default — gives each tensor a contiguous virtual range at
|
||
deployment, with the per-PE MMU translating per access; an
|
||
alternative logical-address scheme remains a future option. The
|
||
virtual-address path is what every modern test path takes; the PA
|
||
fallback is used by the MMU itself when no mapping exists for an
|
||
address (a deliberate signal, not an error).
|
||
|
||
<!-- src: ADR-0011 Decision, Consequences -->
|
||
Tensor placement is represented as a list of physical-address shards,
|
||
each tagged with target SIP, cube, and PE, plus a single tensor-wide
|
||
virtual base. This means a kernel sees one virtual base for the whole
|
||
tensor while the host driver and the engine still know exactly where
|
||
each shard lives. Replicated tensors get per-cube local PA mappings;
|
||
sharded tensors broadcast their mapping across cubes within a SIP.
|
||
|
||
### Routing, Distance & Helper API
|
||
|
||
<!-- src: ADR-0002 Context, Decision -->
|
||
Routing is policy-driven, deterministic, and topology-aware. Given a
|
||
source, a destination, and an intent — for example, PE-initiated
|
||
DMA versus host-initiated memory write versus a generic
|
||
component-to-component query — the routing layer picks the right
|
||
path. The intent matters because different traffic types must avoid
|
||
different categories of edges: PE-initiated DMA should not traverse
|
||
command-only links; M_CPU DMA should not pass through PE-internal
|
||
pipeline edges; cube-local transfers should not use the
|
||
zero-distance UCIe bus that would otherwise look attractive to a
|
||
shortest-path search.
|
||
|
||
<!-- src: ADR-0051 Decision -->
|
||
The routing layer therefore maintains four separate adjacency graphs
|
||
at construction, each excluding a different category of edges, and
|
||
picks the appropriate one per intent. On top of the graphs sits a
|
||
helper API that hides the topology's naming convention: callers ask
|
||
for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
|
||
the HBM destination for a given physical address, and receive the
|
||
corresponding node id. No component constructs node-id strings
|
||
directly; if the naming convention ever changes, the change is local
|
||
to the helper layer.
|
||
|
||
<!-- src: ADR-0051 Decision, Consequences -->
|
||
Path-finding itself uses Dijkstra with explicit per-edge weights
|
||
(routing weight is allowed to differ from physical distance — for
|
||
example, UCIe is configured to be routing-preferable). Tie-breaks
|
||
follow insertion order, which keeps results deterministic. Paths
|
||
between unreachable nodes raise rather than returning empty, surfacing
|
||
topology errors immediately.
|
||
|
||
### Memory Semantics and Local-HBM Bandwidth
|
||
|
||
<!-- src: ADR-0004 Context, Decision -->
|
||
A PE accessing its own HBM slice through its own cube's NOC must see
|
||
the full local HBM bandwidth — that is the model's intent. Memory
|
||
traffic accumulates latency from per-component overhead and
|
||
bytes-over-link-bandwidth serialization along the path, but the
|
||
controller does not throttle below the slice's allotted bandwidth.
|
||
Cross-PE-slice accesses inside the same cube, cross-cube accesses
|
||
through UCIe, and cross-SIP accesses through PCIe each pay
|
||
progressively more overhead as the path grows.
|
||
|
||
### Topology Compilation, Diagrams & Builder Algorithms
|
||
|
||
<!-- src: ADR-0006 Context, Decision -->
|
||
Topology is configurable, not hardcoded. The simulator reads a YAML
|
||
spec, compiles it into a flat graph of nodes and edges plus four
|
||
view projections at different abstraction levels — system, SIP, cube,
|
||
PE — and uses the compiled graph as the single source for both
|
||
execution and visualization. Distance metadata used by routing is
|
||
extracted at compile time so that diagrams and routing decisions
|
||
agree by construction.
|
||
|
||
<!-- src: ADR-0005 Context, Decision -->
|
||
Diagrams are derived artifacts of the compiled topology. The visualizer
|
||
produces one SVG per view at the appropriate abstraction level; nothing
|
||
in the diagrams is hand-drawn or hand-positioned. Distance-aware
|
||
layout rules place nodes in the diagrams using the same coordinates
|
||
that routing uses to compute distance, so a diagram that "looks
|
||
wrong" is a signal that the topology itself has a problem, not the
|
||
visualizer.
|
||
|
||
<!-- src: ADR-0053 Decision -->
|
||
Inside a cube the router mesh is generated automatically. PE corner
|
||
positions are fixed by convention; the relay-column algorithm
|
||
inserts additional grid columns whenever the gap between adjacent PE
|
||
columns would exceed a tunable maximum. HBM occupies a central
|
||
exclusion zone — router slots inside the zone are deliberately empty,
|
||
since HBM controllers attach as separate named nodes. M_CPU and SRAM
|
||
attach to the nearest router by Euclidean distance from their
|
||
configured placement coordinates, and UCIe physical lanes distribute
|
||
along the boundary rows and columns. The whole mesh is cached
|
||
beside the topology spec and invalidated only when one of a small set
|
||
of layout-relevant fields changes.
|
||
|
||
<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
|
||
|
||
### Tensor Deployment and Allocation
|
||
|
||
<!-- src: ADR-0008 Context, Decision -->
|
||
Tensor deployment in the runtime API produces a list of physical-address
|
||
shards plus a single tensor-wide virtual base. The host allocator
|
||
walks the data-parallelism policy, computes per-shard placement, and
|
||
emits the per-shard physical addresses through the per-PE allocators.
|
||
No separate "allocate then later attach to a device" RPC exists —
|
||
allocation and deployment are a single operation that produces a
|
||
deployed tensor handle.
|
||
|
||
### Memory Allocator Algorithms
|
||
|
||
<!-- src: ADR-0048 Context, Decision -->
|
||
Each per-PE allocator owns two channels — HBM slice and TCM — each
|
||
backed by an offset-keyed free-list. Allocation is first-fit; freeing
|
||
coalesces with adjacent free blocks. A device-wide virtual allocator
|
||
sits above the per-PE allocators, aligns requests up to the configured
|
||
page size, and coalesces on free in the same way. The trade-off is
|
||
explicit: first-fit is simpler and cheaper than best-fit or buddy
|
||
allocation, and the simulator's workload is stack-like enough
|
||
(deploy / kernel / free in matched order) that fragmentation is not
|
||
a practical concern.
|
||
|
||
<!-- src: ADR-0048 Decision, Consequences -->
|
||
Allocation failure raises rather than silently returning a partial
|
||
result. A partial tensor reaching the engine would route over wrong
|
||
PAs and silently corrupt simulator output, so an out-of-memory signal
|
||
is preferred. The free path trusts its caller to pass back exactly
|
||
what was allocated; the small risk of caller error in exchange for
|
||
fast common-case freeing is documented as a deliberate trade.
|
||
|
||
### Kernel Execution and Host-Device Messaging
|
||
|
||
<!-- src: ADR-0009 Context, Decision -->
|
||
Kernel execution decomposes into a small set of messages that travel
|
||
the device graph. The host issues a single kernel-launch message; the
|
||
IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
|
||
PE CPU resolves the kernel and runs it through the scheduler.
|
||
Completion flows back the same way, gated by per-shard completion
|
||
tracking. Memory operations follow the same pattern: a memory write
|
||
or read travels as one message that the engine routes to the right
|
||
HBM controller, with a response taking the reverse path.
|
||
|
||
<!-- src: ADR-0012 Context, Decision -->
|
||
The schema between the host and the device-side IO CPU is PA-first
|
||
and shard-tagged. Every byte of host-issued payload arrives with an
|
||
explicit target SIP, cube, PE, and physical address. The IO_CPU does
|
||
not decode addresses to derive placement — placement is named
|
||
explicitly by the shard list. This makes the host-device interface
|
||
deterministic and keeps the routing helper free of host-derived
|
||
intent.
|
||
|
||
### CLI Surface and Semantics
|
||
|
||
<!-- src: ADR-0010 Context, Decision -->
|
||
The command-line interface exposes four subcommands. A bench runner
|
||
loads a topology, resolves a registered benchmark by name or index,
|
||
and runs it on a selected device. A bench-listing command enumerates
|
||
the registered benchmarks. A probe utility runs a fixed catalog of
|
||
traffic patterns through the engine for latency and bandwidth
|
||
verification. A web viewer renders the topology in a browser. A
|
||
benchmark instance is always single-device by convention; multi-SIP
|
||
collective work happens inside the benchmark through the launcher
|
||
abstraction, not by multiplexing the CLI.
|
||
|
||
### Component Port and Wire Fabric Model
|
||
|
||
<!-- src: ADR-0015 Context, Decision -->
|
||
Every modeled component exposes input and output ports, and every
|
||
edge in the topology connects an output port on one component to an
|
||
input port on another. Bandwidth and propagation delay are properties
|
||
of the wire between ports, not of the component endpoints. A
|
||
component's responsibility is to apply its configured per-node
|
||
overhead and either forward to the next hop or terminate; the wire
|
||
charges the byte-over-bandwidth serialization separately.
|
||
|
||
<!-- src: ADR-0015 Decision, Consequences -->
|
||
This separation lets components be swapped behind their port
|
||
interface without changing the rest of the model, and it keeps
|
||
bandwidth contention at the wire level where multiple components may
|
||
contend for the same edge. Future component models can refine
|
||
internal behavior without disturbing the fabric.
|
||
|
||
### Two-Pass Data Execution
|
||
|
||
<!-- src: ADR-0020 Context, Decision -->
|
||
The simulator runs in two passes. The first pass — fast and always
|
||
on — runs the discrete-event engine and records every data operation
|
||
in an operation log with timestamps, component identifiers, and per-
|
||
operation parameters. The second pass — optional, opt-in — replays
|
||
the log against an in-memory tensor store to produce actual numerical
|
||
results. Tests that only need timing skip the second pass; tests that
|
||
need to verify correctness opt in.
|
||
|
||
<!-- src: ADR-0020 Decision, Consequences -->
|
||
The split lets the timing engine remain unconcerned with data
|
||
semantics: kernels move handles around, not bytes. The replay phase
|
||
recovers data semantics from the recorded operations, in their
|
||
original time order with a small set of secondary-sort rules. The
|
||
op-log records carry enough metadata — input snapshots for compute
|
||
operations, source snapshots for cross-component copies — that the
|
||
replay phase cannot mis-order with respect to in-flight mutations.
|
||
|
||
### Sim-engine Op Log and Memory Store Schemas
|
||
|
||
<!-- src: ADR-0052 Context, Decision -->
|
||
The operation log holds typed records with seven fields each: start
|
||
and end timestamps, the component that issued the operation, an
|
||
operation kind ("memory", "gemm", "math"), an operation name, a
|
||
parameter dictionary, and a (currently unused) dependency list.
|
||
Records are kept in stable timestamp order. The parameter dictionary
|
||
varies by operation: a DMA read carries source address and byte count;
|
||
a GEMM carries operand shapes, dtypes, and address spaces; a math
|
||
operation carries input addresses and snapshots.
|
||
|
||
<!-- src: ADR-0052 Decision, Consequences -->
|
||
The companion memory store is a two-level dictionary keyed by
|
||
address space ("hbm", "tcm", "sram", others) and integer address.
|
||
Reads and writes are reference-based — no copy by default — so
|
||
callers wanting to detach a snapshot must copy explicitly. This is
|
||
deliberate: the engine-internal snapshot paths copy at well-defined
|
||
points (math input capture, HBM source capture for DMA writes,
|
||
inbound collective copies) and downstream replay code therefore
|
||
sees stable data even when slot or scratch addresses are reused by
|
||
later operations.
|
||
|
||
### 2D Grid Program Identity
|
||
|
||
<!-- src: ADR-0022 Context, Decision -->
|
||
Inside a kernel the program identity is two-dimensional. The
|
||
first axis corresponds to the PE index within a cube; the second
|
||
corresponds to the cube index within a SIP. Together they let a
|
||
kernel address its position both within its cube and within the
|
||
larger system without needing to know the full topology. Total
|
||
program counts along each axis are exposed symmetrically.
|
||
|
||
### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
|
||
|
||
<!-- src: ADR-0024 Context, Decision -->
|
||
The launcher model treats each SIP as one rank. Inside a process the
|
||
launcher spawns one greenlet per SIP rank; the rank is bound to its
|
||
greenlet so that any code running in that worker sees the right
|
||
distributed-style rank. This is a deliberately PyTorch-compatible
|
||
shape: a benchmark looks like a small DDP training script — initialize
|
||
a process group, spawn workers, each worker runs the same body.
|
||
|
||
<!-- src: ADR-0026 Context, Decision -->
|
||
Data-parallelism policy lives in a single object that names the
|
||
sharding strategy along the cube axis (replicate, row-wise,
|
||
column-wise) and along the PE axis (same set of values), and optionally
|
||
overrides the number of cubes or PEs participating. The policy is
|
||
intra-device — it does not cross SIP boundaries. SIP-level parallelism
|
||
is the launcher's responsibility, and the two axes compose
|
||
orthogonally.
|
||
|
||
<!-- src: ADR-0027 Context, Decision -->
|
||
A Megatron-style tensor-parallel API sits on top of the launcher and
|
||
the DP policy. Layer-level building blocks — column-parallel linear,
|
||
row-parallel linear, all-reduce — name their sharding intent in terms
|
||
the launcher and the placement policy can compose. This is the layer
|
||
that bench code typically writes against.
|
||
|
||
<!-- src: ADR-0047 Context, Decision -->
|
||
For collective operations the runtime exposes a PyTorch-compatible
|
||
distributed backend named "ahbm". On process-group initialization the
|
||
backend loads the configured collective-algorithm module, resolves
|
||
the world size (priority: explicit ccl.yaml override → defaults
|
||
section → topology SIP count), imports the algorithm module
|
||
dynamically, derives the SIP topology kind, and pushes the inter-PE
|
||
neighbor table to every participating PE. From that point on, an
|
||
all-reduce call dispatches the algorithm's kernel function across
|
||
all ranks.
|
||
|
||
<!-- src: ADR-0050 Context, Decision -->
|
||
A collective-algorithm module is a Python module with a small, fixed
|
||
contract. It exposes topology-kind integer constants, a name-to-kind
|
||
mapping for the YAML configuration, a kernel-arguments builder, and
|
||
a kernel function — the kernel function being aliased to the name
|
||
`kernel` so the backend can find it generically. The kernel itself
|
||
takes the tensor pointer, the per-cube element count, cube mesh
|
||
width and height, the world size, the current rank, and the SIP
|
||
topology dimensions; the backend appends those last four arguments
|
||
automatically. New collectives slot in by adding a new module that
|
||
follows this shape.
|
||
|
||
<!-- src: ADR-0027 Decision, Consequences -->
|
||
The combination is deliberate: bench authors get to write code that
|
||
looks like a regular distributed training script, while the launcher,
|
||
backend, and placement policies behind it remain free to redirect
|
||
work to the right SIP, cube, and PE without exposing topology to the
|
||
kernel.
|
||
|
||
### IPCQ Direction Addressing
|
||
|
||
<!-- src: ADR-0025 Context, Decision -->
|
||
Inside a collective algorithm, peer PEs are named by direction —
|
||
"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
|
||
cross-SIP neighbors. Direction addressing is the addressing scheme:
|
||
the algorithm names a direction, the IPCQ neighbor table installed
|
||
at process-group time resolves the direction to the peer endpoint's
|
||
physical-address coordinates, and the PE_DMA performs the actual
|
||
transfer. The algorithm itself does not see PA arithmetic — direction
|
||
is the user-facing handle.
|
||
|
||
### Intercube All-Reduce
|
||
|
||
<!-- src: ADR-0032 Context, Decision -->
|
||
The default all-reduce algorithm uses a center-rooted bidirectional
|
||
phase inside each SIP's cube mesh followed by an inter-SIP exchange
|
||
on the mesh's root cube, and then a bidirectional broadcast back
|
||
out. Center-rooting halves the in-cube hop count compared with a
|
||
corner-rooted walk. The inter-SIP exchange itself follows the
|
||
configured SIP topology — ring, torus, or non-wrapping mesh —
|
||
selected at runtime through the SIP-topology kind integer the
|
||
backend passes to the kernel.
|
||
|
||
### Evaluation Harnesses
|
||
|
||
<!-- src: ADR-0043 Context, Decision -->
|
||
The all-reduce evaluation harness drives correctness and the
|
||
latency/buffer-kind sweeps through the public distributed path —
|
||
initialize process group, spawn workers, call all-reduce — rather
|
||
than the lower-level engine interface. A shared helper module factors
|
||
out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
|
||
HBM) and the inter-SIP topology variants. The plots produced by the
|
||
harness are part of its output contract; the harness regenerates them
|
||
on demand.
|
||
|
||
<!-- src: ADR-0044 Context, Decision -->
|
||
The GEMM evaluation harness is split into two layers. A heavy
|
||
shape-and-variant sweep lives as a manual script — it runs the same
|
||
composite-GEMM benchmark across many shapes and operand-staging
|
||
variants, harvests the resulting op-log, and writes a JSON summary.
|
||
A faster figure-generation layer lives in the test suite and consumes
|
||
that JSON to render plots. The split keeps the heavy data
|
||
generation explicit and out of the regular test path.
|
||
|
||
### Bench Module Contract
|
||
|
||
<!-- src: ADR-0045 Context, Decision -->
|
||
Adding a new benchmark requires only dropping a file into the
|
||
benchmarks directory. The file registers one or more benchmark
|
||
functions through a small decorator that takes a kebab-case name and
|
||
a human-readable description. The decorator is the registration
|
||
mechanism — there is no separate manifest. Each benchmark function
|
||
takes one argument, conventionally named `torch`, which is the
|
||
runtime context exposing tensor allocation, kernel launch,
|
||
distributed APIs, and process-spawning. The function name is `run` by
|
||
convention.
|
||
|
||
<!-- src: ADR-0045 Decision, Consequences -->
|
||
A benchmark must submit at least one operation, or the runner
|
||
returns an error. A benchmark instance is single-device by default;
|
||
when a benchmark is collective, it uses the distributed-process-spawn
|
||
pattern internally — one worker greenlet per rank, with each worker
|
||
binding to its rank. Multi-device benchmark patterns outside that
|
||
shape are not supported.
|
||
|
||
### Kernel-side `tl.*` API
|
||
|
||
<!-- src: ADR-0046 Context, Decision -->
|
||
Inside a kernel function, the `tl` argument exposes the kernel-side
|
||
API in a shape that mirrors the conventions of established
|
||
GPU-kernel languages. Categories: reference handles that name HBM
|
||
data without issuing DMA; data movement (load, store) that does
|
||
issue DMA; GEMM and math compute (dot, composite, the unary and
|
||
binary math operations, reductions); index and scalar helpers
|
||
(program identity, range-builders); metadata-only operations like
|
||
transpose; and the collective primitives (send, receive,
|
||
non-blocking receive). Tensor handles support arithmetic operators
|
||
via a thread-local active context so kernel code reads naturally.
|
||
|
||
<!-- src: ADR-0046 Decision, Consequences -->
|
||
The API supports two execution modes. A command-list mode records
|
||
operations into a list without consuming simulator time — useful for
|
||
inspection and lightweight tests. A greenlet-driven mode runs the
|
||
kernel as a child greenlet that switches back to the simulator on
|
||
each `tl.*` call; the simulator drives the event scheduler and hands
|
||
real data back to the kernel as DMA reads complete. The two modes
|
||
share the same surface; the kernel does not know which one it is
|
||
running under.
|
||
|
||
### Probe Subcommand
|
||
|
||
<!-- src: ADR-0049 Context, Decision -->
|
||
The probe utility runs three families of traffic patterns through
|
||
the engine — host-to-device writes at increasing hop counts,
|
||
device-to-host reads at increasing hop counts, and PE-initiated DMA
|
||
across the cube mesh — and reports actual latency, the analytical
|
||
formula breakdown, effective bandwidth, bottleneck bandwidth, and
|
||
utilization. A fixed reference size is used for the summary table;
|
||
a separate utilization-versus-size sweep covers a logarithmic range
|
||
of transfer sizes. Each case runs in its own engine instance so
|
||
cases do not perturb each other.
|
||
|
||
<!-- src: ADR-0049 Decision, Consequences -->
|
||
The probe also checks a small set of invariants automatically:
|
||
monotonic latency increase with hop count, device-to-host latency
|
||
at least as large as host-to-device for the same hop count, and a
|
||
faster best-case path than worst-case for cross-cube PE DMA. Failures
|
||
print prominently. The output is meant for human reading; automated
|
||
parsing should not depend on column widths or whitespace.
|
||
|
||
---
|
||
|
||
This document summarizes 46 architecture decisions captured during
|
||
the first half of 2026. It is regenerated mechanically from the
|
||
decision corpus; sources are recorded in HTML comments throughout.
|