# KernBench — Architecture Design Document *2026 1H* KernBench is a system-level, discrete-event simulator for AI-accelerator chiplet systems. It models the data-movement and control paths across the full hardware hierarchy and reports end-to-end execution latency for kernels dispatched to the device's compute units. This document is a public summary of the architecture as designed and implemented in the first half of 2026. It assumes no prior knowledge of the simulator's internal documents; terms specific to the system are defined on first use. --- ## Design Principles KernBench is grounded in two foundational commitments: every measured latency must trace to explicit, modeled events on the simulator's graph, and every behavioral claim must be verifiable through tests that target spec-level invariants rather than incidental implementation details. The verification posture is verification-driven. Tests are written to validate the architectural contracts that the simulator exposes — correct routing, deterministic results, monotonic latency under increasing hop counts — rather than to mirror the call graph of the implementation. Two phases coexist: a fast timing phase that exercises the simulator's discrete-event engine and produces a log of operations with timestamps, and an optional data-replay phase that uses that log to compute real numerical results. Tests can target either phase. The latency model is intentionally abstract rather than cycle-accurate. Each modeled node contributes a configurable per-node overhead, each link contributes wire delay plus byte-over-bandwidth serialization, and each terminal service contributes its own service time. The simulator does not attempt to reproduce cache coherence protocols, microarchitectural pipelines, or full PCIe/UCIe protocol correctness; those are explicitly outside the scope. The aim is a simulator that compares system-level configurations meaningfully and deterministically, not one that ships microarchitectural truths. Determinism is a hard requirement. Given identical inputs — topology, routing policy, and request stream — the simulator must produce identical outputs, hop traces included. This rules out reliance on unordered set iteration on the critical path and forces every latency contribution to come from an explicitly scheduled event on a modeled component or link. There are no implicit waits, no hardcoded magic delays, and no shortcuts that bypass the modeled graph. --- ## High-level Architecture The simulated system is a four-level hierarchy. A **Tray** holds one or more **SIPs** (system-in-package), each containing a 2D mesh of **CUBEs** plus one or more **IO chiplets** that connect the SIP to the host. Each CUBE contains a regular grid of **PEs** (processing elements) plus its own attached resources — high-bandwidth memory (HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE itself is a composite of nine sub-components rather than a monolithic core. This hierarchy is fixed; the parameters along each axis (counts, mesh dimensions, link widths) are configurable through the topology spec. A clean separation runs along the request flow. A **runtime API** at the top is the host-facing surface; it exposes tensor and kernel operations, owns host-side allocation metadata, and is topology- agnostic — it does not route or fan out. Below it the **simulation engine** decomposes runtime operations into discrete graph requests (memory writes, memory reads, kernel launches, MMU map installs) and schedules events deterministically. At the bottom, **components** model device behavior on a graph of nodes connected by links; they implement the actual latency contributions and pass requests along. No component reaches up into the runtime API, and no runtime call shortcuts the engine. ### Tray The Tray is the outermost boundary. It owns the host CPU on one side and one or more SIPs on the other, connected through a fabric switch. For collective communication that must traverse multiple SIPs, the fabric switch acts as the common rendezvous: device-side outbound traffic from one SIP routes through the switch and back into the target SIP's IO chiplet. ### SIP A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The default topology used by the simulator is a 4×4 cube mesh; the mesh dimensions are configurable. Each cube on the boundary of the mesh connects to its neighbors over UCIe (die-to-die) links arranged on the four cardinal sides — north, south, east, and west. The IO chiplets sit on one side of the SIP and provide the bridge to the host across PCIe. The IO chiplet itself contains its own internal network. A host-facing PCIe endpoint passes traffic to a small NOC ("network on chip"); from there it can branch to a control-plane CPU that processes kernel-launch messages, or it can take the direct memory data path to the cube's HBM controller. The decision to provide a direct memory path that bypasses the control CPU was a deliberate concession to keep host-issued memory writes from paying control-plane overhead on the data path. ### CUBE Each CUBE owns a 2D mesh of NOC routers and a set of attached resources: PEs, the cube-local SRAM scratchpad, the management CPU (M_CPU), and the HBM partition (split across multiple PE-private slices for bandwidth). The router mesh uses deterministic XY routing. Attached components do not connect to each other directly — they all sit on the router mesh, and every cube-internal transfer pays the mesh distance from source to destination. The HBM partition is per-PE: each PE owns one HBM slice, and the controller exposes per-PE channels so that the same PE always addresses the same set of HBM channels. This makes the local-HBM bandwidth from a PE to its own slice predictable, while accesses to another PE's slice — or a different cube's slice — pay the mesh distance and any UCIe crossings. ### PE A PE is not a monolithic core. Internally it is a set of nine sub-components, each modeling one stage of a request's flow: a small control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store engine that moves data between the on-PE scratchpad and the register file, a GEMM compute engine, a math compute engine, the tightly- coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to- physical address translation, and an inter-PE collective queue (IPCQ). The scheduler decomposes higher-level operations into per-tile stage sequences, and tile tokens self-route from one sub-component to the next. --- ## Detailed Architecture This section describes each modeled device-side component in turn. Components are listed in the alphabetical order used by the simulator's source tree. ### forwarding The forwarding component is the generic routing relay used wherever a node only needs to apply a small processing overhead and pass the request to the next hop. NOC routers, conn nodes, and ucie phys all reduce to this. Its first act on receiving a request is to apply the per-node overhead configured for it in the topology spec; after the overhead it simply hands the request to the next hop along the path. The decision to share one implementation across these roles was made to keep the simulator's component set small without sacrificing modeling fidelity. Each instance still carries its own overhead and its own link bandwidth contributions, so different roles still produce different timing. What is shared is the dispatcher loop, not the parameter values. ### hbm_ctrl The HBM controller is the terminal node for all memory traffic that reaches HBM. Internally it owns a number of pseudo channels, partitioned per-PE so that each PE addresses a deterministic subset. On a request arrival the controller first selects the right pseudo channel from the target address, then enters a chunk-loop that drains the requested size in fixed-size flits over the channel's bandwidth. The chunk-loop pattern replaces an earlier all-at-once drain. The benefit is that the controller no longer presents a flit-aware fabric with a single bulk transfer; instead it emits flits at a paced rate matching the channel bandwidth, which makes cross-flow contention visible. The bandwidth budget is calibrated against the configured HBM total bandwidth divided across the channel count. ### io_cpu The IO_CPU is the control-plane processor sitting inside the IO chiplet. It receives kernel-launch messages from the host, decodes them, and dispatches per-cube launches to the cube's management CPU. Pure memory operations bypass it entirely, taking the direct data path established inside the IO chiplet. On receiving a kernel-launch message, the IO_CPU consults the message's shard list — which already names the target SIP, cube, and PE for each piece of the tensor argument — and forwards a per-cube launch to each cube the kernel needs to reach. This makes the IO_CPU a deterministic fan-out point: it does not decode physical addresses to route, it just follows the explicit per-shard targets it was handed. ### m_cpu The M_CPU is the cube's management processor. It owns two distinct roles: as a control-plane fan-out point for kernel launches arriving from the IO chiplet, and as a DMA endpoint for host-initiated memory writes that need to land in this cube's HBM. The control role forwards launches to the right PE control CPUs; the DMA role places the actual bytes into HBM through the router mesh. The component model deliberately distinguishes the two roles because their routing differs: the control fan-out path uses command-kind links that do not appear on data-path routes, while the DMA path uses the same router mesh as PE-initiated DMA, with PE-internal nodes excluded. The routing layer knows about both modes and selects the appropriate adjacency at request time. ### pcie_ep The PCIE endpoint is the protocol boundary at the host-device edge. Its first act on each incoming request is to apply a configured protocol-processing overhead; after that it simply forwards. There is no internal queuing model, no retry, and no TLP-level fidelity — those are deliberately outside scope. The endpoint is bidirectional: host → device traffic (memory writes, kernel launches) flows one way, and device-side outbound traffic (cross-SIP collective sends) flows the other. A more detailed PCIe model was considered and rejected. The simulator is targeting system-level latency comparisons; making the endpoint heavier with credit-management and retry logic would not improve the metrics being studied. The decision keeps the endpoint as the documented protocol-boundary node, named consistently so routing helpers can locate it by SIP and IO instance. ### pe_cpu The PE control CPU is the entry point for kernel work arriving from the cube's management CPU. It receives kernel-launch messages, resolves the kernel function by name, and hands execution to the scheduler with the resolved tensor arguments. From the scheduler's point of view, the PE_CPU is the upstream source of high-level commands; from the rest of the system's point of view, the PE_CPU is where a kernel's execution begins on a given PE. ### pe_dma The DMA engine on each PE has two distinct modes. In the standard PE pipeline it consumes tile tokens issued by the scheduler, acquires a read or write channel (modeled as a one-in-flight resource per direction), and runs the bytes to or from HBM through the mesh. In its collective mode it forwards send tokens for the cube's IPCQ into the fabric, snapshotting the source data at send time so later mutations cannot race the receiver's read. Both modes share the same channel resources but differ in their downstream handling — one returns when the round-trip completes, the other dispatches fire-and-forget. ### pe_fetch_store The fetch-store engine is the bridge between the on-PE scratchpad (TCM) and the register file. It does not run DMA; it only moves bytes internally. On receiving a tile-stage token it sends a short request to the TCM, waits for the bandwidth-serialized delay, and continues the pipeline. The split between this engine and the TCM lets the scratchpad model its own read/write bandwidth independently. ### pe_gemm The GEMM engine is the matrix-multiply compute unit. Tile tokens arriving at this stage carry the per-tile dimensions, and the engine contributes a service time accounting for one fused multiply-add over the tile's macs. Composite operations (where the same tensor pair is streamed across many tiles) reuse the engine through the scheduler; the engine itself is stateless between tiles. ### pe_ipcq The IPCQ — inter-process communication queue — is each PE's collective-communication endpoint. It owns ring buffers that hold inbound messages from neighbor PEs and bookkeeping for send credits. Direction names ("N", "S", "E", "W" for cube-internal neighbors and "global_*" for cross-SIP neighbors) are resolved to physical peer endpoints by a neighbor table installed at process-group creation time. The component itself does not move bytes — it issues DMA tokens through the local PE_DMA, which performs the actual cross-PE transfer. A key invariant is that the inbound terminal — where data lands at the receiver — pays the link bandwidth drain plus any cube-internal mesh hop to the slot's backing memory. This prevents IPCQ from silently outpacing raw DMA at large transfer sizes. Outbound sends are fire-and-forget; credit return is the only backpressure signal. ### pe_math The math engine handles element-wise and reduction operations. It consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`, `where`, etc.) and contributes a service time proportional to the number of elements processed. Like the GEMM engine it is stateless; chained epilogues (a sequence of math operations after a GEMM tile) are scheduled as separate stages. ### pe_mmu The MMU has two roles, exposed through one component. As a node on the cube NOC it receives MMU-map and MMU-unmap messages and updates its internal page table, so that the runtime API can install virtual-to-physical mappings with measured fabric latency. As a utility object held inside the PE it offers synchronous translate calls to the PE's DMA and GEMM engines without taking simulator time itself; the calling engine pays any configured TLB overhead in its own process. The page table supports multiple disjoint regions inside a single page, with later-write-wins semantics on overlap. This is a deliberate simulator stopgap to support parallelization policies that shard data at sub-page granularity without silent mis-routing through a real hardware MMU's one-PA-per-entry assumption. A real MMU does not work this way; the model documents this as a simplification. ### pe_scheduler The scheduler is the sole dispatcher inside a PE. Simple commands are routed directly to the right engine. Composite commands generate a tile plan, and the resulting tile tokens are fed into the pipeline. Self-routing keeps the scheduler off the per-stage hot path: each engine, on finishing a stage, advances the token to the next stage's component itself, so the scheduler only does initial dispatch and completion tracking. ### pe_tcm The TCM is the per-PE tightly-coupled scratchpad memory. It models time only, not data — the actual payload lives in the simulator's memory store. Read and write are independent channels: each is modeled as a one-in-flight resource, so same-direction requests serialize but a read and a write can overlap. The bandwidth of each direction is configured separately and applied as bytes-over-bandwidth on each request. The decision to keep read and write on separate channels was made because the PE pipeline's normal case overlaps fetch (read) and store (write). Collapsing them into a single shared channel would have artificially serialized that overlap and produced an incorrect bandwidth ceiling. ### sram The cube SRAM is a per-cube scratchpad attached to one of the cube's routers. As a node it applies a configured access overhead, pays the link-bandwidth drain stamped on the incoming request, and sends a response on the reverse path. It is a terminal — it does not forward. A second role is as one of three backing-memory tiers (TCM, SRAM, HBM) that an inter-PE collective slot can live in. When the slot lives in SRAM, the PE_DMA pays the slot read or write latency directly using the configured SRAM bandwidth and overhead; the SRAM component does not need to know about collective semantics. This separation keeps the SRAM component agnostic to the collective subsystem. ### tiling The tile-plan generator is not a runtime component — it is a pure module of functions that take a problem shape (matrix dimensions, tile sizes) and produce an ordered list of tile-stage sequences. The scheduler consumes this list. Each tile's stage sequence depends on how its operands are staged: operands streamed from HBM produce DMA_READ stages, operands already resident in TCM (because they were loaded eagerly upfront) skip them. The plan generator is intentionally pure — given the same input it returns the same plan, with no simulator events created. This lets the rest of the system reason about tile sequences as data, and it makes the plan testable in isolation without simulator state. New plan variants (for example, K-major or DTensor-aware plans) can be added as new functions following the same shape. --- ## Implementation Decisions This section collects cross-cutting decisions — algorithms, policies, schemes, and contracts — that span multiple components rather than living inside one. ### Address Scheme Every physical address in the simulator decodes into a structured location. A fixed-width physical address carries the SIP id, the cube id within the SIP, a type discriminator (HBM vs PE-resource vs others), and a type-specific offset. HBM addresses additionally encode the per-PE slice offset so the controller can determine which PE owns the target slice without external lookup. The layout is deliberately reserved rather than packed-to-fit, so new sub-units can be added at the type-discriminator level without rewriting existing addresses. On top of physical addressing, the simulator supports three address models that the runtime API selects between. Direct physical addressing is retained as a fallback. Virtual addressing — the current default — gives each tensor a contiguous virtual range at deployment, with the per-PE MMU translating per access; an alternative logical-address scheme remains a future option. The virtual-address path is what every modern test path takes; the PA fallback is used by the MMU itself when no mapping exists for an address (a deliberate signal, not an error). Tensor placement is represented as a list of physical-address shards, each tagged with target SIP, cube, and PE, plus a single tensor-wide virtual base. This means a kernel sees one virtual base for the whole tensor while the host driver and the engine still know exactly where each shard lives. Replicated tensors get per-cube local PA mappings; sharded tensors broadcast their mapping across cubes within a SIP. ### Routing, Distance & Helper API Routing is policy-driven, deterministic, and topology-aware. Given a source, a destination, and an intent — for example, PE-initiated DMA versus host-initiated memory write versus a generic component-to-component query — the routing layer picks the right path. The intent matters because different traffic types must avoid different categories of edges: PE-initiated DMA should not traverse command-only links; M_CPU DMA should not pass through PE-internal pipeline edges; cube-local transfers should not use the zero-distance UCIe bus that would otherwise look attractive to a shortest-path search. The routing layer therefore maintains four separate adjacency graphs at construction, each excluding a different category of edges, and picks the appropriate one per intent. On top of the graphs sits a helper API that hides the topology's naming convention: callers ask for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or the HBM destination for a given physical address, and receive the corresponding node id. No component constructs node-id strings directly; if the naming convention ever changes, the change is local to the helper layer. Path-finding itself uses Dijkstra with explicit per-edge weights (routing weight is allowed to differ from physical distance — for example, UCIe is configured to be routing-preferable). Tie-breaks follow insertion order, which keeps results deterministic. Paths between unreachable nodes raise rather than returning empty, surfacing topology errors immediately. ### Memory Semantics and Local-HBM Bandwidth A PE accessing its own HBM slice through its own cube's NOC must see the full local HBM bandwidth — that is the model's intent. Memory traffic accumulates latency from per-component overhead and bytes-over-link-bandwidth serialization along the path, but the controller does not throttle below the slice's allotted bandwidth. Cross-PE-slice accesses inside the same cube, cross-cube accesses through UCIe, and cross-SIP accesses through PCIe each pay progressively more overhead as the path grows. ### Topology Compilation, Diagrams & Builder Algorithms Topology is configurable, not hardcoded. The simulator reads a YAML spec, compiles it into a flat graph of nodes and edges plus four view projections at different abstraction levels — system, SIP, cube, PE — and uses the compiled graph as the single source for both execution and visualization. Distance metadata used by routing is extracted at compile time so that diagrams and routing decisions agree by construction. Diagrams are derived artifacts of the compiled topology. The visualizer produces one SVG per view at the appropriate abstraction level; nothing in the diagrams is hand-drawn or hand-positioned. Distance-aware layout rules place nodes in the diagrams using the same coordinates that routing uses to compute distance, so a diagram that "looks wrong" is a signal that the topology itself has a problem, not the visualizer. Inside a cube the router mesh is generated automatically. PE corner positions are fixed by convention; the relay-column algorithm inserts additional grid columns whenever the gap between adjacent PE columns would exceed a tunable maximum. HBM occupies a central exclusion zone — router slots inside the zone are deliberately empty, since HBM controllers attach as separate named nodes. M_CPU and SRAM attach to the nearest router by Euclidean distance from their configured placement coordinates, and UCIe physical lanes distribute along the boundary rows and columns. The whole mesh is cached beside the topology spec and invalidated only when one of a small set of layout-relevant fields changes. ### Tensor Deployment and Allocation Tensor deployment in the runtime API produces a list of physical-address shards plus a single tensor-wide virtual base. The host allocator walks the data-parallelism policy, computes per-shard placement, and emits the per-shard physical addresses through the per-PE allocators. No separate "allocate then later attach to a device" RPC exists — allocation and deployment are a single operation that produces a deployed tensor handle. ### Memory Allocator Algorithms Each per-PE allocator owns two channels — HBM slice and TCM — each backed by an offset-keyed free-list. Allocation is first-fit; freeing coalesces with adjacent free blocks. A device-wide virtual allocator sits above the per-PE allocators, aligns requests up to the configured page size, and coalesces on free in the same way. The trade-off is explicit: first-fit is simpler and cheaper than best-fit or buddy allocation, and the simulator's workload is stack-like enough (deploy / kernel / free in matched order) that fragmentation is not a practical concern. Allocation failure raises rather than silently returning a partial result. A partial tensor reaching the engine would route over wrong PAs and silently corrupt simulator output, so an out-of-memory signal is preferred. The free path trusts its caller to pass back exactly what was allocated; the small risk of caller error in exchange for fast common-case freeing is documented as a deliberate trade. ### Kernel Execution and Host-Device Messaging Kernel execution decomposes into a small set of messages that travel the device graph. The host issues a single kernel-launch message; the IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the PE CPU resolves the kernel and runs it through the scheduler. Completion flows back the same way, gated by per-shard completion tracking. Memory operations follow the same pattern: a memory write or read travels as one message that the engine routes to the right HBM controller, with a response taking the reverse path. The schema between the host and the device-side IO CPU is PA-first and shard-tagged. Every byte of host-issued payload arrives with an explicit target SIP, cube, PE, and physical address. The IO_CPU does not decode addresses to derive placement — placement is named explicitly by the shard list. This makes the host-device interface deterministic and keeps the routing helper free of host-derived intent. ### CLI Surface and Semantics The command-line interface exposes four subcommands. A bench runner loads a topology, resolves a registered benchmark by name or index, and runs it on a selected device. A bench-listing command enumerates the registered benchmarks. A probe utility runs a fixed catalog of traffic patterns through the engine for latency and bandwidth verification. A web viewer renders the topology in a browser. A benchmark instance is always single-device by convention; multi-SIP collective work happens inside the benchmark through the launcher abstraction, not by multiplexing the CLI. ### Component Port and Wire Fabric Model Every modeled component exposes input and output ports, and every edge in the topology connects an output port on one component to an input port on another. Bandwidth and propagation delay are properties of the wire between ports, not of the component endpoints. A component's responsibility is to apply its configured per-node overhead and either forward to the next hop or terminate; the wire charges the byte-over-bandwidth serialization separately. This separation lets components be swapped behind their port interface without changing the rest of the model, and it keeps bandwidth contention at the wire level where multiple components may contend for the same edge. Future component models can refine internal behavior without disturbing the fabric. ### Two-Pass Data Execution The simulator runs in two passes. The first pass — fast and always on — runs the discrete-event engine and records every data operation in an operation log with timestamps, component identifiers, and per- operation parameters. The second pass — optional, opt-in — replays the log against an in-memory tensor store to produce actual numerical results. Tests that only need timing skip the second pass; tests that need to verify correctness opt in. The split lets the timing engine remain unconcerned with data semantics: kernels move handles around, not bytes. The replay phase recovers data semantics from the recorded operations, in their original time order with a small set of secondary-sort rules. The op-log records carry enough metadata — input snapshots for compute operations, source snapshots for cross-component copies — that the replay phase cannot mis-order with respect to in-flight mutations. ### Sim-engine Op Log and Memory Store Schemas The operation log holds typed records with seven fields each: start and end timestamps, the component that issued the operation, an operation kind ("memory", "gemm", "math"), an operation name, a parameter dictionary, and a (currently unused) dependency list. Records are kept in stable timestamp order. The parameter dictionary varies by operation: a DMA read carries source address and byte count; a GEMM carries operand shapes, dtypes, and address spaces; a math operation carries input addresses and snapshots. The companion memory store is a two-level dictionary keyed by address space ("hbm", "tcm", "sram", others) and integer address. Reads and writes are reference-based — no copy by default — so callers wanting to detach a snapshot must copy explicitly. This is deliberate: the engine-internal snapshot paths copy at well-defined points (math input capture, HBM source capture for DMA writes, inbound collective copies) and downstream replay code therefore sees stable data even when slot or scratch addresses are reused by later operations. ### 2D Grid Program Identity Inside a kernel the program identity is two-dimensional. The first axis corresponds to the PE index within a cube; the second corresponds to the cube index within a SIP. Together they let a kernel address its position both within its cube and within the larger system without needing to know the full topology. Total program counts along each axis are exposed symmetrically. ### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module The launcher model treats each SIP as one rank. Inside a process the launcher spawns one greenlet per SIP rank; the rank is bound to its greenlet so that any code running in that worker sees the right distributed-style rank. This is a deliberately PyTorch-compatible shape: a benchmark looks like a small DDP training script — initialize a process group, spawn workers, each worker runs the same body. Data-parallelism policy lives in a single object that names the sharding strategy along the cube axis (replicate, row-wise, column-wise) and along the PE axis (same set of values), and optionally overrides the number of cubes or PEs participating. The policy is intra-device — it does not cross SIP boundaries. SIP-level parallelism is the launcher's responsibility, and the two axes compose orthogonally. A Megatron-style tensor-parallel API sits on top of the launcher and the DP policy. Layer-level building blocks — column-parallel linear, row-parallel linear, all-reduce — name their sharding intent in terms the launcher and the placement policy can compose. This is the layer that bench code typically writes against. For collective operations the runtime exposes a PyTorch-compatible distributed backend named "ahbm". On process-group initialization the backend loads the configured collective-algorithm module, resolves the world size (priority: explicit ccl.yaml override → defaults section → topology SIP count), imports the algorithm module dynamically, derives the SIP topology kind, and pushes the inter-PE neighbor table to every participating PE. From that point on, an all-reduce call dispatches the algorithm's kernel function across all ranks. A collective-algorithm module is a Python module with a small, fixed contract. It exposes topology-kind integer constants, a name-to-kind mapping for the YAML configuration, a kernel-arguments builder, and a kernel function — the kernel function being aliased to the name `kernel` so the backend can find it generically. The kernel itself takes the tensor pointer, the per-cube element count, cube mesh width and height, the world size, the current rank, and the SIP topology dimensions; the backend appends those last four arguments automatically. New collectives slot in by adding a new module that follows this shape. The combination is deliberate: bench authors get to write code that looks like a regular distributed training script, while the launcher, backend, and placement policies behind it remain free to redirect work to the right SIP, cube, and PE without exposing topology to the kernel. ### IPCQ Direction Addressing Inside a collective algorithm, peer PEs are named by direction — "N", "S", "E", "W" for cube-internal neighbors, and "global_*" for cross-SIP neighbors. Direction addressing is the addressing scheme: the algorithm names a direction, the IPCQ neighbor table installed at process-group time resolves the direction to the peer endpoint's physical-address coordinates, and the PE_DMA performs the actual transfer. The algorithm itself does not see PA arithmetic — direction is the user-facing handle. ### Intercube All-Reduce The default all-reduce algorithm uses a center-rooted bidirectional phase inside each SIP's cube mesh followed by an inter-SIP exchange on the mesh's root cube, and then a bidirectional broadcast back out. Center-rooting halves the in-cube hop count compared with a corner-rooted walk. The inter-SIP exchange itself follows the configured SIP topology — ring, torus, or non-wrapping mesh — selected at runtime through the SIP-topology kind integer the backend passes to the kernel. ### Evaluation Harnesses The all-reduce evaluation harness drives correctness and the latency/buffer-kind sweeps through the public distributed path — initialize process group, spawn workers, call all-reduce — rather than the lower-level engine interface. A shared helper module factors out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM, HBM) and the inter-SIP topology variants. The plots produced by the harness are part of its output contract; the harness regenerates them on demand. The GEMM evaluation harness is split into two layers. A heavy shape-and-variant sweep lives as a manual script — it runs the same composite-GEMM benchmark across many shapes and operand-staging variants, harvests the resulting op-log, and writes a JSON summary. A faster figure-generation layer lives in the test suite and consumes that JSON to render plots. The split keeps the heavy data generation explicit and out of the regular test path. ### Bench Module Contract Adding a new benchmark requires only dropping a file into the benchmarks directory. The file registers one or more benchmark functions through a small decorator that takes a kebab-case name and a human-readable description. The decorator is the registration mechanism — there is no separate manifest. Each benchmark function takes one argument, conventionally named `torch`, which is the runtime context exposing tensor allocation, kernel launch, distributed APIs, and process-spawning. The function name is `run` by convention. A benchmark must submit at least one operation, or the runner returns an error. A benchmark instance is single-device by default; when a benchmark is collective, it uses the distributed-process-spawn pattern internally — one worker greenlet per rank, with each worker binding to its rank. Multi-device benchmark patterns outside that shape are not supported. ### Kernel-side `tl.*` API Inside a kernel function, the `tl` argument exposes the kernel-side API in a shape that mirrors the conventions of established GPU-kernel languages. Categories: reference handles that name HBM data without issuing DMA; data movement (load, store) that does issue DMA; GEMM and math compute (dot, composite, the unary and binary math operations, reductions); index and scalar helpers (program identity, range-builders); metadata-only operations like transpose; and the collective primitives (send, receive, non-blocking receive). Tensor handles support arithmetic operators via a thread-local active context so kernel code reads naturally. The API supports two execution modes. A command-list mode records operations into a list without consuming simulator time — useful for inspection and lightweight tests. A greenlet-driven mode runs the kernel as a child greenlet that switches back to the simulator on each `tl.*` call; the simulator drives the event scheduler and hands real data back to the kernel as DMA reads complete. The two modes share the same surface; the kernel does not know which one it is running under. ### Probe Subcommand The probe utility runs three families of traffic patterns through the engine — host-to-device writes at increasing hop counts, device-to-host reads at increasing hop counts, and PE-initiated DMA across the cube mesh — and reports actual latency, the analytical formula breakdown, effective bandwidth, bottleneck bandwidth, and utilization. A fixed reference size is used for the summary table; a separate utilization-versus-size sweep covers a logarithmic range of transfer sizes. Each case runs in its own engine instance so cases do not perturb each other. The probe also checks a small set of invariants automatically: monotonic latency increase with hop count, device-to-host latency at least as large as host-to-device for the same hop count, and a faster best-case path than worst-case for cross-cube PE DMA. Failures print prominently. The output is meant for human reading; automated parsing should not depend on column widths or whitespace. --- This document summarizes 46 architecture decisions captured during the first half of 2026. It is regenerated mechanically from the decision corpus; sources are recorded in HTML comments throughout.