adr: add INDEX.md (auto-generated by tools/generate_adr_index.py)

Adds a section-based table of contents for the 46-ADR corpus, mirroring the /report skill's classification (Design Principles / High-level Architecture / Detailed Architecture by component / Implementation Decisions by topic). Generated for both docs/adr/ (EN titles) and docs/adr-ko/ (KO titles) from one tool. tools/generate_adr_index.py: - Single CLASSIFICATION dict per ADR — add an entry when introducing a new ADR; the script fails loud if any file is missing from the table. - DETAILED_COMPONENTS lists each builtin component and the ADR(s) that cover it (ADR-0014 appears under six PE engines; ADR-0023 under pe_dma + pe_ipcq). - Accepts both ":" and "—" title separators (matching ADR-0033's existing format). - --check mode for CI: exits 1 if INDEX.md is stale. Also includes the docs/report/architecture-2026-1H.md generated by the prior /report write (the public-facing architecture document; 836 lines, 76 source-attribution comments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 11:15:37 -07:00
parent bd49c93703
commit e33e76f2d1
4 changed files with 1517 additions and 0 deletions
@@ -0,0 +1,836 @@
+# KernBench — Architecture Design Document
+*2026 1H*
+
+KernBench is a system-level, discrete-event simulator for AI-accelerator
+chiplet systems. It models the data-movement and control paths across
+the full hardware hierarchy and reports end-to-end execution latency
+for kernels dispatched to the device's compute units.
+
+This document is a public summary of the architecture as designed and
+implemented in the first half of 2026. It assumes no prior knowledge of
+the simulator's internal documents; terms specific to the system are
+defined on first use.
+
+---
+
+## Design Principles
+
+KernBench is grounded in two foundational commitments: every measured
+latency must trace to explicit, modeled events on the simulator's graph,
+and every behavioral claim must be verifiable through tests that target
+spec-level invariants rather than incidental implementation details.
+
+<!-- src: ADR-0013 Context, Decision -->
+The verification posture is verification-driven. Tests are written to
+validate the architectural contracts that the simulator exposes —
+correct routing, deterministic results, monotonic latency under
+increasing hop counts — rather than to mirror the call graph of the
+implementation. Two phases coexist: a fast timing phase that exercises
+the simulator's discrete-event engine and produces a log of operations
+with timestamps, and an optional data-replay phase that uses that log
+to compute real numerical results. Tests can target either phase.
+
+<!-- src: ADR-0033 Context, Decision -->
+The latency model is intentionally abstract rather than
+cycle-accurate. Each modeled node contributes a configurable per-node
+overhead, each link contributes wire delay plus byte-over-bandwidth
+serialization, and each terminal service contributes its own service
+time. The simulator does not attempt to reproduce cache coherence
+protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
+correctness; those are explicitly outside the scope. The aim is a
+simulator that compares system-level configurations meaningfully and
+deterministically, not one that ships microarchitectural truths.
+
+<!-- src: ADR-0033 Decision, Consequences -->
+Determinism is a hard requirement. Given identical inputs — topology,
+routing policy, and request stream — the simulator must produce
+identical outputs, hop traces included. This rules out reliance on
+unordered set iteration on the critical path and forces every latency
+contribution to come from an explicitly scheduled event on a modeled
+component or link. There are no implicit waits, no hardcoded magic
+delays, and no shortcuts that bypass the modeled graph.
+
+---
+
+## High-level Architecture
+
+<!-- src: ADR-0003 Context, Decision -->
+The simulated system is a four-level hierarchy. A **Tray** holds one or
+more **SIPs** (system-in-package), each containing a 2D mesh of
+**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
+host. Each CUBE contains a regular grid of **PEs** (processing
+elements) plus its own attached resources — high-bandwidth memory
+(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
+itself is a composite of nine sub-components rather than a monolithic
+core. This hierarchy is fixed; the parameters along each axis (counts,
+mesh dimensions, link widths) are configurable through the topology
+spec.
+
+<!-- src: ADR-0007 Context, Decision -->
+A clean separation runs along the request flow. A **runtime API** at
+the top is the host-facing surface; it exposes tensor and kernel
+operations, owns host-side allocation metadata, and is topology-
+agnostic — it does not route or fan out. Below it the **simulation
+engine** decomposes runtime operations into discrete graph requests
+(memory writes, memory reads, kernel launches, MMU map installs) and
+schedules events deterministically. At the bottom, **components** model
+device behavior on a graph of nodes connected by links; they
+implement the actual latency contributions and pass requests along.
+No component reaches up into the runtime API, and no runtime call
+shortcuts the engine.
+
+<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
+
+### Tray
+
+<!-- src: ADR-0003 Decision -->
+The Tray is the outermost boundary. It owns the host CPU on one side
+and one or more SIPs on the other, connected through a fabric switch.
+For collective communication that must traverse multiple SIPs, the
+fabric switch acts as the common rendezvous: device-side outbound
+traffic from one SIP routes through the switch and back into the
+target SIP's IO chiplet.
+
+### SIP
+
+<!-- src: ADR-0003 Decision, ADR-0017 Context -->
+A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
+default topology used by the simulator is a 4×4 cube mesh; the
+mesh dimensions are configurable. Each cube on the boundary of the
+mesh connects to its neighbors over UCIe (die-to-die) links arranged
+on the four cardinal sides — north, south, east, and west. The IO
+chiplets sit on one side of the SIP and provide the bridge to the host
+across PCIe.
+
+<!-- src: ADR-0016 Context, Decision -->
+The IO chiplet itself contains its own internal network. A
+host-facing PCIe endpoint passes traffic to a small NOC ("network on
+chip"); from there it can branch to a control-plane CPU that processes
+kernel-launch messages, or it can take the direct memory data path to
+the cube's HBM controller. The decision to provide a direct memory
+path that bypasses the control CPU was a deliberate concession to
+keep host-issued memory writes from paying control-plane overhead on
+the data path.
+
+### CUBE
+
+<!-- src: ADR-0017 Decision -->
+Each CUBE owns a 2D mesh of NOC routers and a set of attached
+resources: PEs, the cube-local SRAM scratchpad, the management CPU
+(M_CPU), and the HBM partition (split across multiple PE-private
+slices for bandwidth). The router mesh uses deterministic XY routing.
+Attached components do not connect to each other directly — they all
+sit on the router mesh, and every cube-internal transfer pays the
+mesh distance from source to destination.
+
+<!-- src: ADR-0017 Decision -->
+The HBM partition is per-PE: each PE owns one HBM slice, and the
+controller exposes per-PE channels so that the same PE always
+addresses the same set of HBM channels. This makes the local-HBM
+bandwidth from a PE to its own slice predictable, while accesses to
+another PE's slice — or a different cube's slice — pay the mesh
+distance and any UCIe crossings.
+
+### PE
+
+<!-- src: ADR-0014 Context, Decision -->
+A PE is not a monolithic core. Internally it is a set of nine
+sub-components, each modeling one stage of a request's flow: a small
+control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
+engine that moves data between the on-PE scratchpad and the register
+file, a GEMM compute engine, a math compute engine, the tightly-
+coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
+physical address translation, and an inter-PE collective queue
+(IPCQ). The scheduler decomposes higher-level operations into per-tile
+stage sequences, and tile tokens self-route from one sub-component
+to the next.
+
+<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
+
+---
+
+## Detailed Architecture
+
+This section describes each modeled device-side component in turn.
+Components are listed in the alphabetical order used by the
+simulator's source tree.
+
+### forwarding
+
+<!-- src: ADR-0037 Context, Decision -->
+The forwarding component is the generic routing relay used wherever a
+node only needs to apply a small processing overhead and pass the
+request to the next hop. NOC routers, conn nodes, and ucie phys all
+reduce to this. Its first act on receiving a request is to apply the
+per-node overhead configured for it in the topology spec; after the
+overhead it simply hands the request to the next hop along the path.
+
+<!-- src: ADR-0037 Decision, Consequences -->
+The decision to share one implementation across these roles was made
+to keep the simulator's component set small without sacrificing
+modeling fidelity. Each instance still carries its own overhead and
+its own link bandwidth contributions, so different roles still produce
+different timing. What is shared is the dispatcher loop, not the
+parameter values.
+
+### hbm_ctrl
+
+<!-- src: ADR-0034 Context, Decision -->
+The HBM controller is the terminal node for all memory traffic that
+reaches HBM. Internally it owns a number of pseudo channels, partitioned
+per-PE so that each PE addresses a deterministic subset. On a request
+arrival the controller first selects the right pseudo channel from the
+target address, then enters a chunk-loop that drains the requested
+size in fixed-size flits over the channel's bandwidth.
+
+<!-- src: ADR-0034 Decision, Consequences -->
+The chunk-loop pattern replaces an earlier all-at-once drain. The
+benefit is that the controller no longer presents a flit-aware fabric
+with a single bulk transfer; instead it emits flits at a paced rate
+matching the channel bandwidth, which makes cross-flow contention
+visible. The bandwidth budget is calibrated against the configured
+HBM total bandwidth divided across the channel count.
+
+### io_cpu
+
+<!-- src: ADR-0036 Context, Decision -->
+The IO_CPU is the control-plane processor sitting inside the IO chiplet.
+It receives kernel-launch messages from the host, decodes them, and
+dispatches per-cube launches to the cube's management CPU. Pure memory
+operations bypass it entirely, taking the direct data path established
+inside the IO chiplet.
+
+<!-- src: ADR-0036 Decision -->
+On receiving a kernel-launch message, the IO_CPU consults the message's
+shard list — which already names the target SIP, cube, and PE for each
+piece of the tensor argument — and forwards a per-cube launch to each
+cube the kernel needs to reach. This makes the IO_CPU a deterministic
+fan-out point: it does not decode physical addresses to route, it just
+follows the explicit per-shard targets it was handed.
+
+### m_cpu
+
+<!-- src: ADR-0035 Context, Decision -->
+The M_CPU is the cube's management processor. It owns two distinct
+roles: as a control-plane fan-out point for kernel launches arriving
+from the IO chiplet, and as a DMA endpoint for host-initiated memory
+writes that need to land in this cube's HBM. The control role
+forwards launches to the right PE control CPUs; the DMA role places
+the actual bytes into HBM through the router mesh.
+
+<!-- src: ADR-0035 Decision -->
+The component model deliberately distinguishes the two roles because
+their routing differs: the control fan-out path uses command-kind
+links that do not appear on data-path routes, while the DMA path uses
+the same router mesh as PE-initiated DMA, with PE-internal nodes
+excluded. The routing layer knows about both modes and selects the
+appropriate adjacency at request time.
+
+### pcie_ep
+
+<!-- src: ADR-0038 Context, Decision -->
+The PCIE endpoint is the protocol boundary at the host-device edge.
+Its first act on each incoming request is to apply a configured
+protocol-processing overhead; after that it simply forwards. There is
+no internal queuing model, no retry, and no TLP-level fidelity — those
+are deliberately outside scope. The endpoint is bidirectional: host →
+device traffic (memory writes, kernel launches) flows one way, and
+device-side outbound traffic (cross-SIP collective sends) flows the
+other.
+
+<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
+A more detailed PCIe model was considered and rejected. The simulator
+is targeting system-level latency comparisons; making the endpoint
+heavier with credit-management and retry logic would not improve the
+metrics being studied. The decision keeps the endpoint as the
+documented protocol-boundary node, named consistently so routing
+helpers can locate it by SIP and IO instance.
+
+### pe_cpu
+
+<!-- src: ADR-0014 Decision -->
+The PE control CPU is the entry point for kernel work arriving from
+the cube's management CPU. It receives kernel-launch messages, resolves
+the kernel function by name, and hands execution to the scheduler with
+the resolved tensor arguments. From the scheduler's point of view, the
+PE_CPU is the upstream source of high-level commands; from the rest
+of the system's point of view, the PE_CPU is where a kernel's
+execution begins on a given PE.
+
+### pe_dma
+
+<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
+The DMA engine on each PE has two distinct modes. In the standard PE
+pipeline it consumes tile tokens issued by the scheduler, acquires a
+read or write channel (modeled as a one-in-flight resource per
+direction), and runs the bytes to or from HBM through the mesh. In
+its collective mode it forwards send tokens for the cube's IPCQ into
+the fabric, snapshotting the source data at send time so later
+mutations cannot race the receiver's read. Both modes share the same
+channel resources but differ in their downstream handling — one
+returns when the round-trip completes, the other dispatches
+fire-and-forget.
+
+### pe_fetch_store
+
+<!-- src: ADR-0014 Decision -->
+The fetch-store engine is the bridge between the on-PE scratchpad
+(TCM) and the register file. It does not run DMA; it only moves bytes
+internally. On receiving a tile-stage token it sends a short request
+to the TCM, waits for the bandwidth-serialized delay, and continues
+the pipeline. The split between this engine and the TCM lets the
+scratchpad model its own read/write bandwidth independently.
+
+### pe_gemm
+
+<!-- src: ADR-0014 Decision -->
+The GEMM engine is the matrix-multiply compute unit. Tile tokens
+arriving at this stage carry the per-tile dimensions, and the engine
+contributes a service time accounting for one fused multiply-add over
+the tile's macs. Composite operations (where the same tensor pair is
+streamed across many tiles) reuse the engine through the scheduler;
+the engine itself is stateless between tiles.
+
+### pe_ipcq
+
+<!-- src: ADR-0023 Context, Decision -->
+The IPCQ — inter-process communication queue — is each PE's
+collective-communication endpoint. It owns ring buffers that hold
+inbound messages from neighbor PEs and bookkeeping for send credits.
+Direction names ("N", "S", "E", "W" for cube-internal neighbors and
+"global_*" for cross-SIP neighbors) are resolved to physical peer
+endpoints by a neighbor table installed at process-group creation
+time. The component itself does not move bytes — it issues DMA tokens
+through the local PE_DMA, which performs the actual cross-PE
+transfer.
+
+<!-- src: ADR-0023 Decision, Consequences -->
+A key invariant is that the inbound terminal — where data lands at
+the receiver — pays the link bandwidth drain plus any cube-internal
+mesh hop to the slot's backing memory. This prevents IPCQ from
+silently outpacing raw DMA at large transfer sizes. Outbound sends
+are fire-and-forget; credit return is the only backpressure signal.
+
+### pe_math
+
+<!-- src: ADR-0014 Decision -->
+The math engine handles element-wise and reduction operations. It
+consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
+`where`, etc.) and contributes a service time proportional to the
+number of elements processed. Like the GEMM engine it is stateless;
+chained epilogues (a sequence of math operations after a GEMM tile)
+are scheduled as separate stages.
+
+### pe_mmu
+
+<!-- src: ADR-0039 Context, Decision -->
+The MMU has two roles, exposed through one component. As a node on
+the cube NOC it receives MMU-map and MMU-unmap messages and updates
+its internal page table, so that the runtime API can install
+virtual-to-physical mappings with measured fabric latency. As a
+utility object held inside the PE it offers synchronous translate
+calls to the PE's DMA and GEMM engines without taking simulator time
+itself; the calling engine pays any configured TLB overhead in its
+own process.
+
+<!-- src: ADR-0039 Decision, Alternatives Considered -->
+The page table supports multiple disjoint regions inside a single
+page, with later-write-wins semantics on overlap. This is a deliberate
+simulator stopgap to support parallelization policies that shard data
+at sub-page granularity without silent mis-routing through a real
+hardware MMU's one-PA-per-entry assumption. A real MMU does not work
+this way; the model documents this as a simplification.
+
+### pe_scheduler
+
+<!-- src: ADR-0014 Decision -->
+The scheduler is the sole dispatcher inside a PE. Simple commands are
+routed directly to the right engine. Composite commands generate a
+tile plan, and the resulting tile tokens are fed into the pipeline.
+Self-routing keeps the scheduler off the per-stage hot path: each
+engine, on finishing a stage, advances the token to the next stage's
+component itself, so the scheduler only does initial dispatch and
+completion tracking.
+
+### pe_tcm
+
+<!-- src: ADR-0040 Context, Decision -->
+The TCM is the per-PE tightly-coupled scratchpad memory. It models
+time only, not data — the actual payload lives in the simulator's
+memory store. Read and write are independent channels: each is
+modeled as a one-in-flight resource, so same-direction requests
+serialize but a read and a write can overlap. The bandwidth of each
+direction is configured separately and applied as bytes-over-bandwidth
+on each request.
+
+<!-- src: ADR-0040 Decision, Alternatives Considered -->
+The decision to keep read and write on separate channels was made
+because the PE pipeline's normal case overlaps fetch (read) and store
+(write). Collapsing them into a single shared channel would have
+artificially serialized that overlap and produced an incorrect
+bandwidth ceiling.
+
+### sram
+
+<!-- src: ADR-0041 Context, Decision -->
+The cube SRAM is a per-cube scratchpad attached to one of the cube's
+routers. As a node it applies a configured access overhead, pays the
+link-bandwidth drain stamped on the incoming request, and sends a
+response on the reverse path. It is a terminal — it does not forward.
+
+<!-- src: ADR-0041 Decision, Consequences -->
+A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
+that an inter-PE collective slot can live in. When the slot lives in
+SRAM, the PE_DMA pays the slot read or write latency directly using
+the configured SRAM bandwidth and overhead; the SRAM component does
+not need to know about collective semantics. This separation keeps
+the SRAM component agnostic to the collective subsystem.
+
+### tiling
+
+<!-- src: ADR-0042 Context, Decision -->
+The tile-plan generator is not a runtime component — it is a pure
+module of functions that take a problem shape (matrix dimensions, tile
+sizes) and produce an ordered list of tile-stage sequences. The
+scheduler consumes this list. Each tile's stage sequence depends on
+how its operands are staged: operands streamed from HBM produce
+DMA_READ stages, operands already resident in TCM (because they were
+loaded eagerly upfront) skip them.
+
+<!-- src: ADR-0042 Decision, Consequences -->
+The plan generator is intentionally pure — given the same input it
+returns the same plan, with no simulator events created. This lets
+the rest of the system reason about tile sequences as data, and it
+makes the plan testable in isolation without simulator state. New
+plan variants (for example, K-major or DTensor-aware plans) can be
+added as new functions following the same shape.
+
+---
+
+## Implementation Decisions
+
+This section collects cross-cutting decisions — algorithms, policies,
+schemes, and contracts — that span multiple components rather than
+living inside one.
+
+### Address Scheme
+
+<!-- src: ADR-0001 Context, Decision -->
+Every physical address in the simulator decodes into a structured
+location. A fixed-width physical address carries the SIP id, the
+cube id within the SIP, a type discriminator (HBM vs PE-resource vs
+others), and a type-specific offset. HBM addresses additionally encode
+the per-PE slice offset so the controller can determine which PE
+owns the target slice without external lookup. The layout is
+deliberately reserved rather than packed-to-fit, so new sub-units can
+be added at the type-discriminator level without rewriting existing
+addresses.
+
+<!-- src: ADR-0011 Context, Decision -->
+On top of physical addressing, the simulator supports three address
+models that the runtime API selects between. Direct physical
+addressing is retained as a fallback. Virtual addressing — the
+current default — gives each tensor a contiguous virtual range at
+deployment, with the per-PE MMU translating per access; an
+alternative logical-address scheme remains a future option. The
+virtual-address path is what every modern test path takes; the PA
+fallback is used by the MMU itself when no mapping exists for an
+address (a deliberate signal, not an error).
+
+<!-- src: ADR-0011 Decision, Consequences -->
+Tensor placement is represented as a list of physical-address shards,
+each tagged with target SIP, cube, and PE, plus a single tensor-wide
+virtual base. This means a kernel sees one virtual base for the whole
+tensor while the host driver and the engine still know exactly where
+each shard lives. Replicated tensors get per-cube local PA mappings;
+sharded tensors broadcast their mapping across cubes within a SIP.
+
+### Routing, Distance & Helper API
+
+<!-- src: ADR-0002 Context, Decision -->
+Routing is policy-driven, deterministic, and topology-aware. Given a
+source, a destination, and an intent — for example, PE-initiated
+DMA versus host-initiated memory write versus a generic
+component-to-component query — the routing layer picks the right
+path. The intent matters because different traffic types must avoid
+different categories of edges: PE-initiated DMA should not traverse
+command-only links; M_CPU DMA should not pass through PE-internal
+pipeline edges; cube-local transfers should not use the
+zero-distance UCIe bus that would otherwise look attractive to a
+shortest-path search.
+
+<!-- src: ADR-0051 Decision -->
+The routing layer therefore maintains four separate adjacency graphs
+at construction, each excluding a different category of edges, and
+picks the appropriate one per intent. On top of the graphs sits a
+helper API that hides the topology's naming convention: callers ask
+for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
+the HBM destination for a given physical address, and receive the
+corresponding node id. No component constructs node-id strings
+directly; if the naming convention ever changes, the change is local
+to the helper layer.
+
+<!-- src: ADR-0051 Decision, Consequences -->
+Path-finding itself uses Dijkstra with explicit per-edge weights
+(routing weight is allowed to differ from physical distance — for
+example, UCIe is configured to be routing-preferable). Tie-breaks
+follow insertion order, which keeps results deterministic. Paths
+between unreachable nodes raise rather than returning empty, surfacing
+topology errors immediately.
+
+### Memory Semantics and Local-HBM Bandwidth
+
+<!-- src: ADR-0004 Context, Decision -->
+A PE accessing its own HBM slice through its own cube's NOC must see
+the full local HBM bandwidth — that is the model's intent. Memory
+traffic accumulates latency from per-component overhead and
+bytes-over-link-bandwidth serialization along the path, but the
+controller does not throttle below the slice's allotted bandwidth.
+Cross-PE-slice accesses inside the same cube, cross-cube accesses
+through UCIe, and cross-SIP accesses through PCIe each pay
+progressively more overhead as the path grows.
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+<!-- src: ADR-0006 Context, Decision -->
+Topology is configurable, not hardcoded. The simulator reads a YAML
+spec, compiles it into a flat graph of nodes and edges plus four
+view projections at different abstraction levels — system, SIP, cube,
+PE — and uses the compiled graph as the single source for both
+execution and visualization. Distance metadata used by routing is
+extracted at compile time so that diagrams and routing decisions
+agree by construction.
+
+<!-- src: ADR-0005 Context, Decision -->
+Diagrams are derived artifacts of the compiled topology. The visualizer
+produces one SVG per view at the appropriate abstraction level; nothing
+in the diagrams is hand-drawn or hand-positioned. Distance-aware
+layout rules place nodes in the diagrams using the same coordinates
+that routing uses to compute distance, so a diagram that "looks
+wrong" is a signal that the topology itself has a problem, not the
+visualizer.
+
+<!-- src: ADR-0053 Decision -->
+Inside a cube the router mesh is generated automatically. PE corner
+positions are fixed by convention; the relay-column algorithm
+inserts additional grid columns whenever the gap between adjacent PE
+columns would exceed a tunable maximum. HBM occupies a central
+exclusion zone — router slots inside the zone are deliberately empty,
+since HBM controllers attach as separate named nodes. M_CPU and SRAM
+attach to the nearest router by Euclidean distance from their
+configured placement coordinates, and UCIe physical lanes distribute
+along the boundary rows and columns. The whole mesh is cached
+beside the topology spec and invalidated only when one of a small set
+of layout-relevant fields changes.
+
+<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
+
+### Tensor Deployment and Allocation
+
+<!-- src: ADR-0008 Context, Decision -->
+Tensor deployment in the runtime API produces a list of physical-address
+shards plus a single tensor-wide virtual base. The host allocator
+walks the data-parallelism policy, computes per-shard placement, and
+emits the per-shard physical addresses through the per-PE allocators.
+No separate "allocate then later attach to a device" RPC exists —
+allocation and deployment are a single operation that produces a
+deployed tensor handle.
+
+### Memory Allocator Algorithms
+
+<!-- src: ADR-0048 Context, Decision -->
+Each per-PE allocator owns two channels — HBM slice and TCM — each
+backed by an offset-keyed free-list. Allocation is first-fit; freeing
+coalesces with adjacent free blocks. A device-wide virtual allocator
+sits above the per-PE allocators, aligns requests up to the configured
+page size, and coalesces on free in the same way. The trade-off is
+explicit: first-fit is simpler and cheaper than best-fit or buddy
+allocation, and the simulator's workload is stack-like enough
+(deploy / kernel / free in matched order) that fragmentation is not
+a practical concern.
+
+<!-- src: ADR-0048 Decision, Consequences -->
+Allocation failure raises rather than silently returning a partial
+result. A partial tensor reaching the engine would route over wrong
+PAs and silently corrupt simulator output, so an out-of-memory signal
+is preferred. The free path trusts its caller to pass back exactly
+what was allocated; the small risk of caller error in exchange for
+fast common-case freeing is documented as a deliberate trade.
+
+### Kernel Execution and Host-Device Messaging
+
+<!-- src: ADR-0009 Context, Decision -->
+Kernel execution decomposes into a small set of messages that travel
+the device graph. The host issues a single kernel-launch message; the
+IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
+PE CPU resolves the kernel and runs it through the scheduler.
+Completion flows back the same way, gated by per-shard completion
+tracking. Memory operations follow the same pattern: a memory write
+or read travels as one message that the engine routes to the right
+HBM controller, with a response taking the reverse path.
+
+<!-- src: ADR-0012 Context, Decision -->
+The schema between the host and the device-side IO CPU is PA-first
+and shard-tagged. Every byte of host-issued payload arrives with an
+explicit target SIP, cube, PE, and physical address. The IO_CPU does
+not decode addresses to derive placement — placement is named
+explicitly by the shard list. This makes the host-device interface
+deterministic and keeps the routing helper free of host-derived
+intent.
+
+### CLI Surface and Semantics
+
+<!-- src: ADR-0010 Context, Decision -->
+The command-line interface exposes four subcommands. A bench runner
+loads a topology, resolves a registered benchmark by name or index,
+and runs it on a selected device. A bench-listing command enumerates
+the registered benchmarks. A probe utility runs a fixed catalog of
+traffic patterns through the engine for latency and bandwidth
+verification. A web viewer renders the topology in a browser. A
+benchmark instance is always single-device by convention; multi-SIP
+collective work happens inside the benchmark through the launcher
+abstraction, not by multiplexing the CLI.
+
+### Component Port and Wire Fabric Model
+
+<!-- src: ADR-0015 Context, Decision -->
+Every modeled component exposes input and output ports, and every
+edge in the topology connects an output port on one component to an
+input port on another. Bandwidth and propagation delay are properties
+of the wire between ports, not of the component endpoints. A
+component's responsibility is to apply its configured per-node
+overhead and either forward to the next hop or terminate; the wire
+charges the byte-over-bandwidth serialization separately.
+
+<!-- src: ADR-0015 Decision, Consequences -->
+This separation lets components be swapped behind their port
+interface without changing the rest of the model, and it keeps
+bandwidth contention at the wire level where multiple components may
+contend for the same edge. Future component models can refine
+internal behavior without disturbing the fabric.
+
+### Two-Pass Data Execution
+
+<!-- src: ADR-0020 Context, Decision -->
+The simulator runs in two passes. The first pass — fast and always
+on — runs the discrete-event engine and records every data operation
+in an operation log with timestamps, component identifiers, and per-
+operation parameters. The second pass — optional, opt-in — replays
+the log against an in-memory tensor store to produce actual numerical
+results. Tests that only need timing skip the second pass; tests that
+need to verify correctness opt in.
+
+<!-- src: ADR-0020 Decision, Consequences -->
+The split lets the timing engine remain unconcerned with data
+semantics: kernels move handles around, not bytes. The replay phase
+recovers data semantics from the recorded operations, in their
+original time order with a small set of secondary-sort rules. The
+op-log records carry enough metadata — input snapshots for compute
+operations, source snapshots for cross-component copies — that the
+replay phase cannot mis-order with respect to in-flight mutations.
+
+### Sim-engine Op Log and Memory Store Schemas
+
+<!-- src: ADR-0052 Context, Decision -->
+The operation log holds typed records with seven fields each: start
+and end timestamps, the component that issued the operation, an
+operation kind ("memory", "gemm", "math"), an operation name, a
+parameter dictionary, and a (currently unused) dependency list.
+Records are kept in stable timestamp order. The parameter dictionary
+varies by operation: a DMA read carries source address and byte count;
+a GEMM carries operand shapes, dtypes, and address spaces; a math
+operation carries input addresses and snapshots.
+
+<!-- src: ADR-0052 Decision, Consequences -->
+The companion memory store is a two-level dictionary keyed by
+address space ("hbm", "tcm", "sram", others) and integer address.
+Reads and writes are reference-based — no copy by default — so
+callers wanting to detach a snapshot must copy explicitly. This is
+deliberate: the engine-internal snapshot paths copy at well-defined
+points (math input capture, HBM source capture for DMA writes,
+inbound collective copies) and downstream replay code therefore
+sees stable data even when slot or scratch addresses are reused by
+later operations.
+
+### 2D Grid Program Identity
+
+<!-- src: ADR-0022 Context, Decision -->
+Inside a kernel the program identity is two-dimensional. The
+first axis corresponds to the PE index within a cube; the second
+corresponds to the cube index within a SIP. Together they let a
+kernel address its position both within its cube and within the
+larger system without needing to know the full topology. Total
+program counts along each axis are exposed symmetrically.
+
+### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
+
+<!-- src: ADR-0024 Context, Decision -->
+The launcher model treats each SIP as one rank. Inside a process the
+launcher spawns one greenlet per SIP rank; the rank is bound to its
+greenlet so that any code running in that worker sees the right
+distributed-style rank. This is a deliberately PyTorch-compatible
+shape: a benchmark looks like a small DDP training script — initialize
+a process group, spawn workers, each worker runs the same body.
+
+<!-- src: ADR-0026 Context, Decision -->
+Data-parallelism policy lives in a single object that names the
+sharding strategy along the cube axis (replicate, row-wise,
+column-wise) and along the PE axis (same set of values), and optionally
+overrides the number of cubes or PEs participating. The policy is
+intra-device — it does not cross SIP boundaries. SIP-level parallelism
+is the launcher's responsibility, and the two axes compose
+orthogonally.
+
+<!-- src: ADR-0027 Context, Decision -->
+A Megatron-style tensor-parallel API sits on top of the launcher and
+the DP policy. Layer-level building blocks — column-parallel linear,
+row-parallel linear, all-reduce — name their sharding intent in terms
+the launcher and the placement policy can compose. This is the layer
+that bench code typically writes against.
+
+<!-- src: ADR-0047 Context, Decision -->
+For collective operations the runtime exposes a PyTorch-compatible
+distributed backend named "ahbm". On process-group initialization the
+backend loads the configured collective-algorithm module, resolves
+the world size (priority: explicit ccl.yaml override → defaults
+section → topology SIP count), imports the algorithm module
+dynamically, derives the SIP topology kind, and pushes the inter-PE
+neighbor table to every participating PE. From that point on, an
+all-reduce call dispatches the algorithm's kernel function across
+all ranks.
+
+<!-- src: ADR-0050 Context, Decision -->
+A collective-algorithm module is a Python module with a small, fixed
+contract. It exposes topology-kind integer constants, a name-to-kind
+mapping for the YAML configuration, a kernel-arguments builder, and
+a kernel function — the kernel function being aliased to the name
+`kernel` so the backend can find it generically. The kernel itself
+takes the tensor pointer, the per-cube element count, cube mesh
+width and height, the world size, the current rank, and the SIP
+topology dimensions; the backend appends those last four arguments
+automatically. New collectives slot in by adding a new module that
+follows this shape.
+
+<!-- src: ADR-0027 Decision, Consequences -->
+The combination is deliberate: bench authors get to write code that
+looks like a regular distributed training script, while the launcher,
+backend, and placement policies behind it remain free to redirect
+work to the right SIP, cube, and PE without exposing topology to the
+kernel.
+
+### IPCQ Direction Addressing
+
+<!-- src: ADR-0025 Context, Decision -->
+Inside a collective algorithm, peer PEs are named by direction —
+"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
+cross-SIP neighbors. Direction addressing is the addressing scheme:
+the algorithm names a direction, the IPCQ neighbor table installed
+at process-group time resolves the direction to the peer endpoint's
+physical-address coordinates, and the PE_DMA performs the actual
+transfer. The algorithm itself does not see PA arithmetic — direction
+is the user-facing handle.
+
+### Intercube All-Reduce
+
+<!-- src: ADR-0032 Context, Decision -->
+The default all-reduce algorithm uses a center-rooted bidirectional
+phase inside each SIP's cube mesh followed by an inter-SIP exchange
+on the mesh's root cube, and then a bidirectional broadcast back
+out. Center-rooting halves the in-cube hop count compared with a
+corner-rooted walk. The inter-SIP exchange itself follows the
+configured SIP topology — ring, torus, or non-wrapping mesh —
+selected at runtime through the SIP-topology kind integer the
+backend passes to the kernel.
+
+### Evaluation Harnesses
+
+<!-- src: ADR-0043 Context, Decision -->
+The all-reduce evaluation harness drives correctness and the
+latency/buffer-kind sweeps through the public distributed path —
+initialize process group, spawn workers, call all-reduce — rather
+than the lower-level engine interface. A shared helper module factors
+out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
+HBM) and the inter-SIP topology variants. The plots produced by the
+harness are part of its output contract; the harness regenerates them
+on demand.
+
+<!-- src: ADR-0044 Context, Decision -->
+The GEMM evaluation harness is split into two layers. A heavy
+shape-and-variant sweep lives as a manual script — it runs the same
+composite-GEMM benchmark across many shapes and operand-staging
+variants, harvests the resulting op-log, and writes a JSON summary.
+A faster figure-generation layer lives in the test suite and consumes
+that JSON to render plots. The split keeps the heavy data
+generation explicit and out of the regular test path.
+
+### Bench Module Contract
+
+<!-- src: ADR-0045 Context, Decision -->
+Adding a new benchmark requires only dropping a file into the
+benchmarks directory. The file registers one or more benchmark
+functions through a small decorator that takes a kebab-case name and
+a human-readable description. The decorator is the registration
+mechanism — there is no separate manifest. Each benchmark function
+takes one argument, conventionally named `torch`, which is the
+runtime context exposing tensor allocation, kernel launch,
+distributed APIs, and process-spawning. The function name is `run` by
+convention.
+
+<!-- src: ADR-0045 Decision, Consequences -->
+A benchmark must submit at least one operation, or the runner
+returns an error. A benchmark instance is single-device by default;
+when a benchmark is collective, it uses the distributed-process-spawn
+pattern internally — one worker greenlet per rank, with each worker
+binding to its rank. Multi-device benchmark patterns outside that
+shape are not supported.
+
+### Kernel-side `tl.*` API
+
+<!-- src: ADR-0046 Context, Decision -->
+Inside a kernel function, the `tl` argument exposes the kernel-side
+API in a shape that mirrors the conventions of established
+GPU-kernel languages. Categories: reference handles that name HBM
+data without issuing DMA; data movement (load, store) that does
+issue DMA; GEMM and math compute (dot, composite, the unary and
+binary math operations, reductions); index and scalar helpers
+(program identity, range-builders); metadata-only operations like
+transpose; and the collective primitives (send, receive,
+non-blocking receive). Tensor handles support arithmetic operators
+via a thread-local active context so kernel code reads naturally.
+
+<!-- src: ADR-0046 Decision, Consequences -->
+The API supports two execution modes. A command-list mode records
+operations into a list without consuming simulator time — useful for
+inspection and lightweight tests. A greenlet-driven mode runs the
+kernel as a child greenlet that switches back to the simulator on
+each `tl.*` call; the simulator drives the event scheduler and hands
+real data back to the kernel as DMA reads complete. The two modes
+share the same surface; the kernel does not know which one it is
+running under.
+
+### Probe Subcommand
+
+<!-- src: ADR-0049 Context, Decision -->
+The probe utility runs three families of traffic patterns through
+the engine — host-to-device writes at increasing hop counts,
+device-to-host reads at increasing hop counts, and PE-initiated DMA
+across the cube mesh — and reports actual latency, the analytical
+formula breakdown, effective bandwidth, bottleneck bandwidth, and
+utilization. A fixed reference size is used for the summary table;
+a separate utilization-versus-size sweep covers a logarithmic range
+of transfer sizes. Each case runs in its own engine instance so
+cases do not perturb each other.
+
+<!-- src: ADR-0049 Decision, Consequences -->
+The probe also checks a small set of invariants automatically:
+monotonic latency increase with hop count, device-to-host latency
+at least as large as host-to-device for the same hop count, and a
+faster best-case path than worst-case for cross-cube PE DMA. Failures
+print prominently. The output is meant for human reading; automated
+parsing should not depend on column widths or whitespace.
+
+---
+
+This document summarizes 46 architecture decisions captured during
+the first half of 2026. It is regenerated mechanically from the
+decision corpus; sources are recorded in HTML comments throughout.