kernbench2/docs/report/architecture-2026-1H.md

# KernBench — Architecture Design Document
*2026 1H*

KernBench is a system-level, discrete-event simulator for AI-accelerator
chiplet systems. It models the data-movement and control paths across
the full hardware hierarchy and reports end-to-end execution latency
for kernels dispatched to the device's compute units.

This document is a public summary of the architecture as designed and
implemented in the first half of 2026. It assumes no prior knowledge of
the simulator's internal documents; terms specific to the system are
defined on first use.

---

## Design Principles

KernBench is grounded in two foundational commitments: every measured
latency must trace to explicit, modeled events on the simulator's graph,
and every behavioral claim must be verifiable through tests that target
spec-level invariants rather than incidental implementation details.

<!-- src: ADR-0013 Context, Decision -->
The verification posture is verification-driven. Tests are written to
validate the architectural contracts that the simulator exposes —
correct routing, deterministic results, monotonic latency under
increasing hop counts — rather than to mirror the call graph of the
implementation. Two phases coexist: a fast timing phase that exercises
the simulator's discrete-event engine and produces a log of operations
with timestamps, and an optional data-replay phase that uses that log
to compute real numerical results. Tests can target either phase.

<!-- src: ADR-0033 Context, Decision -->
The latency model is intentionally abstract rather than
cycle-accurate. Each modeled node contributes a configurable per-node
overhead, each link contributes wire delay plus byte-over-bandwidth
serialization, and each terminal service contributes its own service
time. The simulator does not attempt to reproduce cache coherence
protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
correctness; those are explicitly outside the scope. The aim is a
simulator that compares system-level configurations meaningfully and
deterministically, not one that ships microarchitectural truths.

<!-- src: ADR-0033 Decision, Consequences -->
Determinism is a hard requirement. Given identical inputs — topology,
routing policy, and request stream — the simulator must produce
identical outputs, hop traces included. This rules out reliance on
unordered set iteration on the critical path and forces every latency
contribution to come from an explicitly scheduled event on a modeled
component or link. There are no implicit waits, no hardcoded magic
delays, and no shortcuts that bypass the modeled graph.

---

## High-level Architecture

<!-- src: ADR-0003 Context, Decision -->
The simulated system is a four-level hierarchy. A **Tray** holds one or
more **SIPs** (system-in-package), each containing a 2D mesh of
**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
host. Each CUBE contains a regular grid of **PEs** (processing
elements) plus its own attached resources — high-bandwidth memory
(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
itself is a composite of nine sub-components rather than a monolithic
core. This hierarchy is fixed; the parameters along each axis (counts,
mesh dimensions, link widths) are configurable through the topology
spec.

<!-- src: ADR-0007 Context, Decision -->
A clean separation runs along the request flow. A **runtime API** at
the top is the host-facing surface; it exposes tensor and kernel
operations, owns host-side allocation metadata, and is topology-
agnostic — it does not route or fan out. Below it the **simulation
engine** decomposes runtime operations into discrete graph requests
(memory writes, memory reads, kernel launches, MMU map installs) and
schedules events deterministically. At the bottom, **components** model
device behavior on a graph of nodes connected by links; they
implement the actual latency contributions and pass requests along.
No component reaches up into the runtime API, and no runtime call
shortcuts the engine.

<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->

### Tray

<!-- src: ADR-0003 Decision -->
The Tray is the outermost boundary. It owns the host CPU on one side
and one or more SIPs on the other, connected through a fabric switch.
For collective communication that must traverse multiple SIPs, the
fabric switch acts as the common rendezvous: device-side outbound
traffic from one SIP routes through the switch and back into the
target SIP's IO chiplet.

### SIP

<!-- src: ADR-0003 Decision, ADR-0017 Context -->
A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
default topology used by the simulator is a 4×4 cube mesh; the
mesh dimensions are configurable. Each cube on the boundary of the
mesh connects to its neighbors over UCIe (die-to-die) links arranged
on the four cardinal sides — north, south, east, and west. The IO
chiplets sit on one side of the SIP and provide the bridge to the host
across PCIe.

<!-- src: ADR-0016 Context, Decision -->
The IO chiplet itself contains its own internal network. A
host-facing PCIe endpoint passes traffic to a small NOC ("network on
chip"); from there it can branch to a control-plane CPU that processes
kernel-launch messages, or it can take the direct memory data path to
the cube's HBM controller. The decision to provide a direct memory
path that bypasses the control CPU was a deliberate concession to
keep host-issued memory writes from paying control-plane overhead on
the data path.

### CUBE

<!-- src: ADR-0017 Decision -->
Each CUBE owns a 2D mesh of NOC routers and a set of attached
resources: PEs, the cube-local SRAM scratchpad, the management CPU
(M_CPU), and the HBM partition (split across multiple PE-private
slices for bandwidth). The router mesh uses deterministic XY routing.
Attached components do not connect to each other directly — they all
sit on the router mesh, and every cube-internal transfer pays the
mesh distance from source to destination.

<!-- src: ADR-0017 Decision -->
The HBM partition is per-PE: each PE owns one HBM slice, and the
controller exposes per-PE channels so that the same PE always
addresses the same set of HBM channels. This makes the local-HBM
bandwidth from a PE to its own slice predictable, while accesses to
another PE's slice — or a different cube's slice — pay the mesh
distance and any UCIe crossings.

### PE

<!-- src: ADR-0014 Context, Decision -->
A PE is not a monolithic core. Internally it is a set of nine
sub-components, each modeling one stage of a request's flow: a small
control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
engine that moves data between the on-PE scratchpad and the register
file, a GEMM compute engine, a math compute engine, the tightly-
coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
physical address translation, and an inter-PE collective queue
(IPCQ). The scheduler decomposes higher-level operations into per-tile
stage sequences, and tile tokens self-route from one sub-component
to the next.

<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->

---

## Detailed Architecture

This section describes each modeled device-side component in turn.
Components are listed in the alphabetical order used by the
simulator's source tree.

### forwarding

<!-- src: ADR-0037 Context, Decision -->
The forwarding component is the generic routing relay used wherever a
node only needs to apply a small processing overhead and pass the
request to the next hop. NOC routers, conn nodes, and ucie phys all
reduce to this. Its first act on receiving a request is to apply the
per-node overhead configured for it in the topology spec; after the
overhead it simply hands the request to the next hop along the path.

<!-- src: ADR-0037 Decision, Consequences -->
The decision to share one implementation across these roles was made
to keep the simulator's component set small without sacrificing
modeling fidelity. Each instance still carries its own overhead and
its own link bandwidth contributions, so different roles still produce
different timing. What is shared is the dispatcher loop, not the
parameter values.

### hbm_ctrl

<!-- src: ADR-0034 Context, Decision -->
The HBM controller is the terminal node for all memory traffic that
reaches HBM. Internally it owns a number of pseudo channels, partitioned
per-PE so that each PE addresses a deterministic subset. On a request
arrival the controller first selects the right pseudo channel from the
target address, then enters a chunk-loop that drains the requested
size in fixed-size flits over the channel's bandwidth.

<!-- src: ADR-0034 Decision, Consequences -->
The chunk-loop pattern replaces an earlier all-at-once drain. The
benefit is that the controller no longer presents a flit-aware fabric
with a single bulk transfer; instead it emits flits at a paced rate
matching the channel bandwidth, which makes cross-flow contention
visible. The bandwidth budget is calibrated against the configured
HBM total bandwidth divided across the channel count.

### io_cpu

<!-- src: ADR-0036 Context, Decision -->
The IO_CPU is the control-plane processor sitting inside the IO chiplet.
It receives kernel-launch messages from the host, decodes them, and
dispatches per-cube launches to the cube's management CPU. Pure memory
operations bypass it entirely, taking the direct data path established
inside the IO chiplet.

<!-- src: ADR-0036 Decision -->
On receiving a kernel-launch message, the IO_CPU consults the message's
shard list — which already names the target SIP, cube, and PE for each
piece of the tensor argument — and forwards a per-cube launch to each
cube the kernel needs to reach. This makes the IO_CPU a deterministic
fan-out point: it does not decode physical addresses to route, it just
follows the explicit per-shard targets it was handed.

### m_cpu

<!-- src: ADR-0035 Context, Decision -->
The M_CPU is the cube's management processor. It owns two distinct
roles: as a control-plane fan-out point for kernel launches arriving
from the IO chiplet, and as a DMA endpoint for host-initiated memory
writes that need to land in this cube's HBM. The control role
forwards launches to the right PE control CPUs; the DMA role places
the actual bytes into HBM through the router mesh.

<!-- src: ADR-0035 Decision -->
The component model deliberately distinguishes the two roles because
their routing differs: the control fan-out path uses command-kind
links that do not appear on data-path routes, while the DMA path uses
the same router mesh as PE-initiated DMA, with PE-internal nodes
excluded. The routing layer knows about both modes and selects the
appropriate adjacency at request time.

### pcie_ep

<!-- src: ADR-0038 Context, Decision -->
The PCIE endpoint is the protocol boundary at the host-device edge.
Its first act on each incoming request is to apply a configured
protocol-processing overhead; after that it simply forwards. There is
no internal queuing model, no retry, and no TLP-level fidelity — those
are deliberately outside scope. The endpoint is bidirectional: host →
device traffic (memory writes, kernel launches) flows one way, and
device-side outbound traffic (cross-SIP collective sends) flows the
other.

<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
A more detailed PCIe model was considered and rejected. The simulator
is targeting system-level latency comparisons; making the endpoint
heavier with credit-management and retry logic would not improve the
metrics being studied. The decision keeps the endpoint as the
documented protocol-boundary node, named consistently so routing
helpers can locate it by SIP and IO instance.

### pe_cpu

<!-- src: ADR-0014 Decision -->
The PE control CPU is the entry point for kernel work arriving from
the cube's management CPU. It receives kernel-launch messages, resolves
the kernel function by name, and hands execution to the scheduler with
the resolved tensor arguments. From the scheduler's point of view, the
PE_CPU is the upstream source of high-level commands; from the rest
of the system's point of view, the PE_CPU is where a kernel's
execution begins on a given PE.

### pe_dma

<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
The DMA engine on each PE has two distinct modes. In the standard PE
pipeline it consumes tile tokens issued by the scheduler, acquires a
read or write channel (modeled as a one-in-flight resource per
direction), and runs the bytes to or from HBM through the mesh. In
its collective mode it forwards send tokens for the cube's IPCQ into
the fabric, snapshotting the source data at send time so later
mutations cannot race the receiver's read. Both modes share the same
channel resources but differ in their downstream handling — one
returns when the round-trip completes, the other dispatches
fire-and-forget.

### pe_fetch_store

<!-- src: ADR-0014 Decision -->
The fetch-store engine is the bridge between the on-PE scratchpad
(TCM) and the register file. It does not run DMA; it only moves bytes
internally. On receiving a tile-stage token it sends a short request
to the TCM, waits for the bandwidth-serialized delay, and continues
the pipeline. The split between this engine and the TCM lets the
scratchpad model its own read/write bandwidth independently.

### pe_gemm

<!-- src: ADR-0014 Decision -->
The GEMM engine is the matrix-multiply compute unit. Tile tokens
arriving at this stage carry the per-tile dimensions, and the engine
contributes a service time accounting for one fused multiply-add over
the tile's macs. Composite operations (where the same tensor pair is
streamed across many tiles) reuse the engine through the scheduler;
the engine itself is stateless between tiles.

### pe_ipcq

<!-- src: ADR-0023 Context, Decision -->
The IPCQ — inter-process communication queue — is each PE's
collective-communication endpoint. It owns ring buffers that hold
inbound messages from neighbor PEs and bookkeeping for send credits.
Direction names ("N", "S", "E", "W" for cube-internal neighbors and
"global_*" for cross-SIP neighbors) are resolved to physical peer
endpoints by a neighbor table installed at process-group creation
time. The component itself does not move bytes — it issues DMA tokens
through the local PE_DMA, which performs the actual cross-PE
transfer.

<!-- src: ADR-0023 Decision, Consequences -->
A key invariant is that the inbound terminal — where data lands at
the receiver — pays the link bandwidth drain plus any cube-internal
mesh hop to the slot's backing memory. This prevents IPCQ from
silently outpacing raw DMA at large transfer sizes. Outbound sends
are fire-and-forget; credit return is the only backpressure signal.

### pe_math

<!-- src: ADR-0014 Decision -->
The math engine handles element-wise and reduction operations. It
consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
`where`, etc.) and contributes a service time proportional to the
number of elements processed. Like the GEMM engine it is stateless;
chained epilogues (a sequence of math operations after a GEMM tile)
are scheduled as separate stages.

### pe_mmu

<!-- src: ADR-0039 Context, Decision -->
The MMU has two roles, exposed through one component. As a node on
the cube NOC it receives MMU-map and MMU-unmap messages and updates
its internal page table, so that the runtime API can install
virtual-to-physical mappings with measured fabric latency. As a
utility object held inside the PE it offers synchronous translate
calls to the PE's DMA and GEMM engines without taking simulator time
itself; the calling engine pays any configured TLB overhead in its
own process.

<!-- src: ADR-0039 Decision, Alternatives Considered -->
The page table supports multiple disjoint regions inside a single
page, with later-write-wins semantics on overlap. This is a deliberate
simulator stopgap to support parallelization policies that shard data
at sub-page granularity without silent mis-routing through a real
hardware MMU's one-PA-per-entry assumption. A real MMU does not work
this way; the model documents this as a simplification.

### pe_scheduler

<!-- src: ADR-0014 Decision -->
The scheduler is the sole dispatcher inside a PE. Simple commands are
routed directly to the right engine. Composite commands generate a
tile plan, and the resulting tile tokens are fed into the pipeline.
Self-routing keeps the scheduler off the per-stage hot path: each
engine, on finishing a stage, advances the token to the next stage's
component itself, so the scheduler only does initial dispatch and
completion tracking.

### pe_tcm

<!-- src: ADR-0040 Context, Decision -->
The TCM is the per-PE tightly-coupled scratchpad memory. It models
time only, not data — the actual payload lives in the simulator's
memory store. Read and write are independent channels: each is
modeled as a one-in-flight resource, so same-direction requests
serialize but a read and a write can overlap. The bandwidth of each
direction is configured separately and applied as bytes-over-bandwidth
on each request.

<!-- src: ADR-0040 Decision, Alternatives Considered -->
The decision to keep read and write on separate channels was made
because the PE pipeline's normal case overlaps fetch (read) and store
(write). Collapsing them into a single shared channel would have
artificially serialized that overlap and produced an incorrect
bandwidth ceiling.

### sram

<!-- src: ADR-0041 Context, Decision -->
The cube SRAM is a per-cube scratchpad attached to one of the cube's
routers. As a node it applies a configured access overhead, pays the
link-bandwidth drain stamped on the incoming request, and sends a
response on the reverse path. It is a terminal — it does not forward.

<!-- src: ADR-0041 Decision, Consequences -->
A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
that an inter-PE collective slot can live in. When the slot lives in
SRAM, the PE_DMA pays the slot read or write latency directly using
the configured SRAM bandwidth and overhead; the SRAM component does
not need to know about collective semantics. This separation keeps
the SRAM component agnostic to the collective subsystem.

### tiling

<!-- src: ADR-0042 Context, Decision -->
The tile-plan generator is not a runtime component — it is a pure
module of functions that take a problem shape (matrix dimensions, tile
sizes) and produce an ordered list of tile-stage sequences. The
scheduler consumes this list. Each tile's stage sequence depends on
how its operands are staged: operands streamed from HBM produce
DMA_READ stages, operands already resident in TCM (because they were
loaded eagerly upfront) skip them.

<!-- src: ADR-0042 Decision, Consequences -->
The plan generator is intentionally pure — given the same input it
returns the same plan, with no simulator events created. This lets
the rest of the system reason about tile sequences as data, and it
makes the plan testable in isolation without simulator state. New
plan variants (for example, K-major or DTensor-aware plans) can be
added as new functions following the same shape.

---

## Implementation Decisions

This section collects cross-cutting decisions — algorithms, policies,
schemes, and contracts — that span multiple components rather than
living inside one.

### Address Scheme

<!-- src: ADR-0001 Context, Decision -->
Every physical address in the simulator decodes into a structured
location. A fixed-width physical address carries the SIP id, the
cube id within the SIP, a type discriminator (HBM vs PE-resource vs
others), and a type-specific offset. HBM addresses additionally encode
the per-PE slice offset so the controller can determine which PE
owns the target slice without external lookup. The layout is
deliberately reserved rather than packed-to-fit, so new sub-units can
be added at the type-discriminator level without rewriting existing
addresses.

<!-- src: ADR-0011 Context, Decision -->
On top of physical addressing, the simulator supports three address
models that the runtime API selects between. Direct physical
addressing is retained as a fallback. Virtual addressing — the
current default — gives each tensor a contiguous virtual range at
deployment, with the per-PE MMU translating per access; an
alternative logical-address scheme remains a future option. The
virtual-address path is what every modern test path takes; the PA
fallback is used by the MMU itself when no mapping exists for an
address (a deliberate signal, not an error).

<!-- src: ADR-0011 Decision, Consequences -->
Tensor placement is represented as a list of physical-address shards,
each tagged with target SIP, cube, and PE, plus a single tensor-wide
virtual base. This means a kernel sees one virtual base for the whole
tensor while the host driver and the engine still know exactly where
each shard lives. Replicated tensors get per-cube local PA mappings;
sharded tensors broadcast their mapping across cubes within a SIP.

### Routing, Distance & Helper API

<!-- src: ADR-0002 Context, Decision -->
Routing is policy-driven, deterministic, and topology-aware. Given a
source, a destination, and an intent — for example, PE-initiated
DMA versus host-initiated memory write versus a generic
component-to-component query — the routing layer picks the right
path. The intent matters because different traffic types must avoid
different categories of edges: PE-initiated DMA should not traverse
command-only links; M_CPU DMA should not pass through PE-internal
pipeline edges; cube-local transfers should not use the
zero-distance UCIe bus that would otherwise look attractive to a
shortest-path search.

<!-- src: ADR-0051 Decision -->
The routing layer therefore maintains four separate adjacency graphs
at construction, each excluding a different category of edges, and
picks the appropriate one per intent. On top of the graphs sits a
helper API that hides the topology's naming convention: callers ask
for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
the HBM destination for a given physical address, and receive the
corresponding node id. No component constructs node-id strings
directly; if the naming convention ever changes, the change is local
to the helper layer.

<!-- src: ADR-0051 Decision, Consequences -->
Path-finding itself uses Dijkstra with explicit per-edge weights
(routing weight is allowed to differ from physical distance — for
example, UCIe is configured to be routing-preferable). Tie-breaks
follow insertion order, which keeps results deterministic. Paths
between unreachable nodes raise rather than returning empty, surfacing
topology errors immediately.

### Memory Semantics and Local-HBM Bandwidth

<!-- src: ADR-0004 Context, Decision -->
A PE accessing its own HBM slice through its own cube's NOC must see
the full local HBM bandwidth — that is the model's intent. Memory
traffic accumulates latency from per-component overhead and
bytes-over-link-bandwidth serialization along the path, but the
controller does not throttle below the slice's allotted bandwidth.
Cross-PE-slice accesses inside the same cube, cross-cube accesses
through UCIe, and cross-SIP accesses through PCIe each pay
progressively more overhead as the path grows.

### Topology Compilation, Diagrams & Builder Algorithms

<!-- src: ADR-0006 Context, Decision -->
Topology is configurable, not hardcoded. The simulator reads a YAML
spec, compiles it into a flat graph of nodes and edges plus four
view projections at different abstraction levels — system, SIP, cube,
PE — and uses the compiled graph as the single source for both
execution and visualization. Distance metadata used by routing is
extracted at compile time so that diagrams and routing decisions
agree by construction.

<!-- src: ADR-0005 Context, Decision -->
Diagrams are derived artifacts of the compiled topology. The visualizer
produces one SVG per view at the appropriate abstraction level; nothing
in the diagrams is hand-drawn or hand-positioned. Distance-aware
layout rules place nodes in the diagrams using the same coordinates
that routing uses to compute distance, so a diagram that "looks
wrong" is a signal that the topology itself has a problem, not the
visualizer.

<!-- src: ADR-0053 Decision -->
Inside a cube the router mesh is generated automatically. PE corner
positions are fixed by convention; the relay-column algorithm
inserts additional grid columns whenever the gap between adjacent PE
columns would exceed a tunable maximum. HBM occupies a central
exclusion zone — router slots inside the zone are deliberately empty,
since HBM controllers attach as separate named nodes. M_CPU and SRAM
attach to the nearest router by Euclidean distance from their
configured placement coordinates, and UCIe physical lanes distribute
along the boundary rows and columns. The whole mesh is cached
beside the topology spec and invalidated only when one of a small set
of layout-relevant fields changes.

<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->

### Tensor Deployment and Allocation

<!-- src: ADR-0008 Context, Decision -->
Tensor deployment in the runtime API produces a list of physical-address
shards plus a single tensor-wide virtual base. The host allocator
walks the data-parallelism policy, computes per-shard placement, and
emits the per-shard physical addresses through the per-PE allocators.
No separate "allocate then later attach to a device" RPC exists —
allocation and deployment are a single operation that produces a
deployed tensor handle.

### Memory Allocator Algorithms

<!-- src: ADR-0048 Context, Decision -->
Each per-PE allocator owns two channels — HBM slice and TCM — each
backed by an offset-keyed free-list. Allocation is first-fit; freeing
coalesces with adjacent free blocks. A device-wide virtual allocator
sits above the per-PE allocators, aligns requests up to the configured
page size, and coalesces on free in the same way. The trade-off is
explicit: first-fit is simpler and cheaper than best-fit or buddy
allocation, and the simulator's workload is stack-like enough
(deploy / kernel / free in matched order) that fragmentation is not
a practical concern.

<!-- src: ADR-0048 Decision, Consequences -->
Allocation failure raises rather than silently returning a partial
result. A partial tensor reaching the engine would route over wrong
PAs and silently corrupt simulator output, so an out-of-memory signal
is preferred. The free path trusts its caller to pass back exactly
what was allocated; the small risk of caller error in exchange for
fast common-case freeing is documented as a deliberate trade.

### Kernel Execution and Host-Device Messaging

<!-- src: ADR-0009 Context, Decision -->
Kernel execution decomposes into a small set of messages that travel
the device graph. The host issues a single kernel-launch message; the
IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
PE CPU resolves the kernel and runs it through the scheduler.
Completion flows back the same way, gated by per-shard completion
tracking. Memory operations follow the same pattern: a memory write
or read travels as one message that the engine routes to the right
HBM controller, with a response taking the reverse path.

<!-- src: ADR-0012 Context, Decision -->
The schema between the host and the device-side IO CPU is PA-first
and shard-tagged. Every byte of host-issued payload arrives with an
explicit target SIP, cube, PE, and physical address. The IO_CPU does
not decode addresses to derive placement — placement is named
explicitly by the shard list. This makes the host-device interface
deterministic and keeps the routing helper free of host-derived
intent.

### CLI Surface and Semantics

<!-- src: ADR-0010 Context, Decision -->
The command-line interface exposes four subcommands. A bench runner
loads a topology, resolves a registered benchmark by name or index,
and runs it on a selected device. A bench-listing command enumerates
the registered benchmarks. A probe utility runs a fixed catalog of
traffic patterns through the engine for latency and bandwidth
verification. A web viewer renders the topology in a browser. A
benchmark instance is always single-device by convention; multi-SIP
collective work happens inside the benchmark through the launcher
abstraction, not by multiplexing the CLI.

### Component Port and Wire Fabric Model

<!-- src: ADR-0015 Context, Decision -->
Every modeled component exposes input and output ports, and every
edge in the topology connects an output port on one component to an
input port on another. Bandwidth and propagation delay are properties
of the wire between ports, not of the component endpoints. A
component's responsibility is to apply its configured per-node
overhead and either forward to the next hop or terminate; the wire
charges the byte-over-bandwidth serialization separately.

<!-- src: ADR-0015 Decision, Consequences -->
This separation lets components be swapped behind their port
interface without changing the rest of the model, and it keeps
bandwidth contention at the wire level where multiple components may
contend for the same edge. Future component models can refine
internal behavior without disturbing the fabric.

### Two-Pass Data Execution

<!-- src: ADR-0020 Context, Decision -->
The simulator runs in two passes. The first pass — fast and always
on — runs the discrete-event engine and records every data operation
in an operation log with timestamps, component identifiers, and per-
operation parameters. The second pass — optional, opt-in — replays
the log against an in-memory tensor store to produce actual numerical
results. Tests that only need timing skip the second pass; tests that
need to verify correctness opt in.

<!-- src: ADR-0020 Decision, Consequences -->
The split lets the timing engine remain unconcerned with data
semantics: kernels move handles around, not bytes. The replay phase
recovers data semantics from the recorded operations, in their
original time order with a small set of secondary-sort rules. The
op-log records carry enough metadata — input snapshots for compute
operations, source snapshots for cross-component copies — that the
replay phase cannot mis-order with respect to in-flight mutations.

### Sim-engine Op Log and Memory Store Schemas

<!-- src: ADR-0052 Context, Decision -->
The operation log holds typed records with seven fields each: start
and end timestamps, the component that issued the operation, an
operation kind ("memory", "gemm", "math"), an operation name, a
parameter dictionary, and a (currently unused) dependency list.
Records are kept in stable timestamp order. The parameter dictionary
varies by operation: a DMA read carries source address and byte count;
a GEMM carries operand shapes, dtypes, and address spaces; a math
operation carries input addresses and snapshots.

<!-- src: ADR-0052 Decision, Consequences -->
The companion memory store is a two-level dictionary keyed by
address space ("hbm", "tcm", "sram", others) and integer address.
Reads and writes are reference-based — no copy by default — so
callers wanting to detach a snapshot must copy explicitly. This is
deliberate: the engine-internal snapshot paths copy at well-defined
points (math input capture, HBM source capture for DMA writes,
inbound collective copies) and downstream replay code therefore
sees stable data even when slot or scratch addresses are reused by
later operations.

### 2D Grid Program Identity

<!-- src: ADR-0022 Context, Decision -->
Inside a kernel the program identity is two-dimensional. The
first axis corresponds to the PE index within a cube; the second
corresponds to the cube index within a SIP. Together they let a
kernel address its position both within its cube and within the
larger system without needing to know the full topology. Total
program counts along each axis are exposed symmetrically.

### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module

<!-- src: ADR-0024 Context, Decision -->
The launcher model treats each SIP as one rank. Inside a process the
launcher spawns one greenlet per SIP rank; the rank is bound to its
greenlet so that any code running in that worker sees the right
distributed-style rank. This is a deliberately PyTorch-compatible
shape: a benchmark looks like a small DDP training script — initialize
a process group, spawn workers, each worker runs the same body.

<!-- src: ADR-0026 Context, Decision -->
Data-parallelism policy lives in a single object that names the
sharding strategy along the cube axis (replicate, row-wise,
column-wise) and along the PE axis (same set of values), and optionally
overrides the number of cubes or PEs participating. The policy is
intra-device — it does not cross SIP boundaries. SIP-level parallelism
is the launcher's responsibility, and the two axes compose
orthogonally.

<!-- src: ADR-0027 Context, Decision -->
A Megatron-style tensor-parallel API sits on top of the launcher and
the DP policy. Layer-level building blocks — column-parallel linear,
row-parallel linear, all-reduce — name their sharding intent in terms
the launcher and the placement policy can compose. This is the layer
that bench code typically writes against.

<!-- src: ADR-0047 Context, Decision -->
For collective operations the runtime exposes a PyTorch-compatible
distributed backend named "ahbm". On process-group initialization the
backend loads the configured collective-algorithm module, resolves
the world size (priority: explicit ccl.yaml override → defaults
section → topology SIP count), imports the algorithm module
dynamically, derives the SIP topology kind, and pushes the inter-PE
neighbor table to every participating PE. From that point on, an
all-reduce call dispatches the algorithm's kernel function across
all ranks.

<!-- src: ADR-0050 Context, Decision -->
A collective-algorithm module is a Python module with a small, fixed
contract. It exposes topology-kind integer constants, a name-to-kind
mapping for the YAML configuration, a kernel-arguments builder, and
a kernel function — the kernel function being aliased to the name
`kernel` so the backend can find it generically. The kernel itself
takes the tensor pointer, the per-cube element count, cube mesh
width and height, the world size, the current rank, and the SIP
topology dimensions; the backend appends those last four arguments
automatically. New collectives slot in by adding a new module that
follows this shape.

<!-- src: ADR-0027 Decision, Consequences -->
The combination is deliberate: bench authors get to write code that
looks like a regular distributed training script, while the launcher,
backend, and placement policies behind it remain free to redirect
work to the right SIP, cube, and PE without exposing topology to the
kernel.

### IPCQ Direction Addressing

<!-- src: ADR-0025 Context, Decision -->
Inside a collective algorithm, peer PEs are named by direction —
"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
cross-SIP neighbors. Direction addressing is the addressing scheme:
the algorithm names a direction, the IPCQ neighbor table installed
at process-group time resolves the direction to the peer endpoint's
physical-address coordinates, and the PE_DMA performs the actual
transfer. The algorithm itself does not see PA arithmetic — direction
is the user-facing handle.

### Intercube All-Reduce

<!-- src: ADR-0032 Context, Decision -->
The default all-reduce algorithm uses a center-rooted bidirectional
phase inside each SIP's cube mesh followed by an inter-SIP exchange
on the mesh's root cube, and then a bidirectional broadcast back
out. Center-rooting halves the in-cube hop count compared with a
corner-rooted walk. The inter-SIP exchange itself follows the
configured SIP topology — ring, torus, or non-wrapping mesh —
selected at runtime through the SIP-topology kind integer the
backend passes to the kernel.

### Evaluation Harnesses

<!-- src: ADR-0043 Context, Decision -->
The all-reduce evaluation harness drives correctness and the
latency/buffer-kind sweeps through the public distributed path —
initialize process group, spawn workers, call all-reduce — rather
than the lower-level engine interface. A shared helper module factors
out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
HBM) and the inter-SIP topology variants. The plots produced by the
harness are part of its output contract; the harness regenerates them
on demand.

<!-- src: ADR-0044 Context, Decision -->
The GEMM evaluation harness is split into two layers. A heavy
shape-and-variant sweep lives as a manual script — it runs the same
composite-GEMM benchmark across many shapes and operand-staging
variants, harvests the resulting op-log, and writes a JSON summary.
A faster figure-generation layer lives in the test suite and consumes
that JSON to render plots. The split keeps the heavy data
generation explicit and out of the regular test path.

### Bench Module Contract

<!-- src: ADR-0045 Context, Decision -->
Adding a new benchmark requires only dropping a file into the
benchmarks directory. The file registers one or more benchmark
functions through a small decorator that takes a kebab-case name and
a human-readable description. The decorator is the registration
mechanism — there is no separate manifest. Each benchmark function
takes one argument, conventionally named `torch`, which is the
runtime context exposing tensor allocation, kernel launch,
distributed APIs, and process-spawning. The function name is `run` by
convention.

<!-- src: ADR-0045 Decision, Consequences -->
A benchmark must submit at least one operation, or the runner
returns an error. A benchmark instance is single-device by default;
when a benchmark is collective, it uses the distributed-process-spawn
pattern internally — one worker greenlet per rank, with each worker
binding to its rank. Multi-device benchmark patterns outside that
shape are not supported.

### Kernel-side `tl.*` API

<!-- src: ADR-0046 Context, Decision -->
Inside a kernel function, the `tl` argument exposes the kernel-side
API in a shape that mirrors the conventions of established
GPU-kernel languages. Categories: reference handles that name HBM
data without issuing DMA; data movement (load, store) that does
issue DMA; GEMM and math compute (dot, composite, the unary and
binary math operations, reductions); index and scalar helpers
(program identity, range-builders); metadata-only operations like
transpose; and the collective primitives (send, receive,
non-blocking receive). Tensor handles support arithmetic operators
via a thread-local active context so kernel code reads naturally.

<!-- src: ADR-0046 Decision, Consequences -->
The API supports two execution modes. A command-list mode records
operations into a list without consuming simulator time — useful for
inspection and lightweight tests. A greenlet-driven mode runs the
kernel as a child greenlet that switches back to the simulator on
each `tl.*` call; the simulator drives the event scheduler and hands
real data back to the kernel as DMA reads complete. The two modes
share the same surface; the kernel does not know which one it is
running under.

### Probe Subcommand

<!-- src: ADR-0049 Context, Decision -->
The probe utility runs three families of traffic patterns through
the engine — host-to-device writes at increasing hop counts,
device-to-host reads at increasing hop counts, and PE-initiated DMA
across the cube mesh — and reports actual latency, the analytical
formula breakdown, effective bandwidth, bottleneck bandwidth, and
utilization. A fixed reference size is used for the summary table;
a separate utilization-versus-size sweep covers a logarithmic range
of transfer sizes. Each case runs in its own engine instance so
cases do not perturb each other.

<!-- src: ADR-0049 Decision, Consequences -->
The probe also checks a small set of invariants automatically:
monotonic latency increase with hop count, device-to-host latency
at least as large as host-to-device for the same hop count, and a
faster best-case path than worst-case for cross-cube PE DMA. Failures
print prominently. The output is meant for human reading; automated
parsing should not depend on column widths or whitespace.

---

This document summarizes 46 architecture decisions captured during
the first half of 2026. It is regenerated mechanically from the
decision corpus; sources are recorded in HTML comments throughout.