Add CHANGES.md, README, update SPEC/ADRs for release 2

- CHANGES.md: detailed changelog for release 1 and 2 - README.md: full project docs with install, probe, run, test usage - SPEC.md: add ADR-0014~0017 references, update R7 for pcie_ep endpoint - ADR-0003: update NOC description to reference ADR-0017 - ADR-0004: add HBM efficiency factor (0.8) to BW guarantee contract - ADR-0014: status Proposed -> Accepted - ADR-0015: update D4 to M_CPU bypass for Memory R/W, add ADR-0016/0017 links - ADR-0016 (new): IOChiplet NOC and memory data path - ADR-0017 (new): Cube NOC 2D mesh architecture - Fix MD lint warnings (unfenced code blocks) across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 01:43:15 -07:00
parent d75da439c6
commit fc6abbc8ee
10 changed files with 613 additions and 65 deletions
@@ -37,8 +37,10 @@ We model the system hierarchy explicitly:
  - HBM + memory controller (HBM_CTRL)
  - XBAR (top/bottom): HBM pseudo-channel crossbar, PE's dedicated path to HBM
  - Bridge (left/right): connects XBAR.top ↔ XBAR.bottom for cross-half HBM access
-  - NOC: distributed on-die fabric spanning the entire cube (distance modeled as 0);
-    carries non-HBM traffic including inter-cube (UCIe), command (M_CPU↔PE_CPU), and shared SRAM access
+  - NOC: 2D mesh router grid spanning the entire cube with XY routing and
+    per-segment contention modeling; carries all intra-cube traffic including
+    PE DMA to xbar (HBM), inter-cube (UCIe), command (M_CPU↔PE_CPU), and
+    shared SRAM access. See ADR-0017 for full NOC architecture.
  - Shared SRAM: cube-level shared memory accessible by all PEs via NOC
  - management/control CPU (M_CPU) coordinating PE command distribution and completion aggregation
  - multiple PEs
@@ -62,3 +64,4 @@ We model the system hierarchy explicitly:

 - SPEC R3/R5
 - ADR-0005 (diagram views)
+- ADR-0017 (cube NOC 2D mesh architecture)
@@ -21,8 +21,15 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,

 ### D2. Local HBM bandwidth guarantee contract

- Accesses from a PE to its local HBM MUST guarantee full HBM read/write bandwidth
-  independent of intervening fabric bandwidth limits.
+- Accesses from a PE to its local HBM MUST guarantee full effective HBM
+  read/write bandwidth independent of intervening fabric bandwidth limits.
+- Effective HBM bandwidth = spec bandwidth x efficiency factor.
+  The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8)
+  models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page
+  misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective.
+- The topology builder applies the efficiency factor to xbar-to-hbm edge
+  bandwidth at graph construction time, so all downstream routing and latency
+  computation uses the effective value.
 - This guarantee is modeled by:
  - a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
  - while still incurring non-zero latency along explicitly modeled components.
@@ -62,3 +69,4 @@ Tests should cover:

 - SPEC R2/R5
 - ADR-0002 (distance/order & explicit bypass)
+- ADR-0017 D7 (PE DMA data paths through NOC to HBM)
@@ -2,7 +2,7 @@

 ## Status

-Proposed
+Accepted

 ## Context

@@ -123,7 +123,7 @@ Examples include:

 Execution flow:

-```
+```text
 PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
 ```

@@ -133,7 +133,7 @@ Composite commands implement tiled pipelined execution across engines.

 Each tile executes the following pipeline:

-```
+```text
 Input DMA (READ)
 → Compute (GEMM or MATH)
 → Output DMA (WRITE)
@@ -158,7 +158,7 @@ Operations for different tiles may overlap when engine resources permit.

 Allowed overlaps:

-```
+```text
 DMA_READ(t+1) ∥ COMPUTE(t)
 DMA_WRITE(t−1) ∥ COMPUTE(t)
 DMA_READ(t) ∥ DMA_WRITE(t)
@@ -166,7 +166,7 @@ DMA_READ(t) ∥ DMA_WRITE(t)

 Disallowed overlaps:

-```
+```text
 GEMM(t) ∥ GEMM(t′)
 MATH(t) ∥ MATH(t′)
 GEMM(t) ∥ MATH(t′)
@@ -182,7 +182,7 @@ Each engine behaves as a deterministic service resource.

 PE_DMA contains two independent channels.

-```
+```text
 DMA_READ capacity  = 1
 DMA_WRITE capacity = 1
 ```
@@ -195,13 +195,13 @@ Rules:

 Example allowed:

-```
+```text
 DMA_READ(t+1) ∥ DMA_WRITE(t)
 ```

 Example not allowed:

-```
+```text
 DMA_READ(t) ∥ DMA_READ(t+1)
 DMA_WRITE(t) ∥ DMA_WRITE(t+1)
 ```
@@ -210,7 +210,7 @@ DMA_WRITE(t) ∥ DMA_WRITE(t+1)

 Compute operations share a single compute resource.

-```
+```text
 PE_ACCEL capacity = 1
 ```

@@ -230,7 +230,7 @@ Composite commands contain one compute opcode only.

 Examples:

-```
+```text
 COMPOSITE_GEMM
 COMPOSITE_MATH
 ```
@@ -250,13 +250,13 @@ Compute operations use a TCM-centric dataflow model.

 **Input path (HBM)**

-```
+```text
 HBM → XBAR → PE_DMA (DMA_READ) → PE_TCM
 ```

 **Input path (shared SRAM)**

-```
+```text
 Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
 ```

@@ -264,7 +264,7 @@ Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM

 Compute engines read input tensors from PE_TCM.

-```
+```text
 PE_TCM → GEMM / MATH
 ```

@@ -274,13 +274,13 @@ Weights for GEMM may optionally stream directly from HBM (via XBAR).

 Compute results are written to PE_TCM, then DMA writes to HBM.

-```
+```text
 PE_TCM → PE_DMA (DMA_WRITE) → XBAR → HBM
 ```

 **Output path (shared SRAM)**

-```
+```text
 PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
 ```

@@ -17,8 +17,8 @@ implementation does not enforce this for fabric traversal.
 This ADR defines:

 - how components communicate via typed port queues,
- how propagation delay is modeled (wire processes),
- the fabric path for Memory R/W through M_CPU.DMA,
+- how propagation delay is modeled (wire processes with BW occupancy),
+- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU),
 - the reduced role of the simulation engine,
 - M_CPU.DMA as an internal subcomponent of M_CPU.

@@ -30,7 +30,7 @@ This ADR defines:

 Each component has typed input/output ports modeled as SimPy Stores:

-```
+```text
 in_ports:  dict[str, simpy.Store]   # keyed by source node_id
 out_ports: dict[str, simpy.Store]   # keyed by destination node_id
 ```
@@ -93,35 +93,51 @@ ADR-0007 D2 must be amended accordingly.

 ---

-### D4. Unified fabric path for Memory R/W and Kernel Launch
+### D4. Fabric paths for Memory R/W and Kernel Launch

-Both Memory R/W and Kernel Launch use the same fabric path to reach the target cube's M_CPU.
-The difference is what M_CPU does upon receiving the request.
+Memory R/W and Kernel Launch use **different** fabric paths.
+Memory operations bypass M_CPU and route directly to HBM via the crossbar.
+Kernel Launch routes through M_CPU for PE fan-out.

-**Forward path (IO_CPU → target M_CPU):**
+**Memory R/W forward path (pcie_ep → hbm_ctrl, M_CPU bypass):**

-```
-IO_CPU
-  → [transit cubes: ucie_out → wire → ucie_in → noc → ucie_out]  (zero or more)
-  → target cube: ucie_in → noc → M_CPU
+```text
+pcie_ep → io_noc → io_ucie
+  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
+  → target cube: ucie_in → noc → xbar → hbm_ctrl
 ```

-**At M_CPU (diverges by operation type):**
+**Memory R/W completion path:**

-```
-Memory R/W:     M_CPU → M_CPU.DMA → noc → hbm_ctrl
-Kernel Launch:  M_CPU → PE[0..n] (parallel fan-out)
+```text
+hbm_ctrl → xbar → noc → [transit cubes: ucie → noc → ucie]
+  → io_ucie → io_noc → pcie_ep
 ```

-**Completion path (reverse, same fabric):**
+**Kernel Launch forward path (pcie_ep → io_cpu → M_CPU → PE):**

+```text
+pcie_ep → io_noc → io_cpu → io_noc → io_ucie
+  → [transit cubes: ucie_in → noc → ucie_out]  (zero or more)
+  → target cube: ucie_in → noc → M_CPU → PE[0..n] (parallel fan-out)
 ```
-Memory R/W:     hbm_ctrl → noc → M_CPU.DMA → M_CPU
-Kernel Launch:  PE[0..n] all complete → M_CPU (aggregation)

-M_CPU → [transit cubes: ucie → noc → ucie] → IO_CPU → runtime_api
+**Kernel Launch completion path:**
+
+```text
+PE[0..n] all complete → M_CPU (aggregation)
+  → noc → [transit cubes: ucie → noc → ucie]
+  → io_ucie → io_noc → io_cpu → io_noc → pcie_ep
 ```

+**Rationale for M_CPU bypass on Memory R/W:**
+
+Memory write/read operations do not require command interpretation or PE
+dispatch — they are direct data transfers to/from HBM. Routing through M_CPU
+would add unnecessary overhead (5ns) without functional benefit. The io_noc
+inside the IO chiplet handles the routing decision: memory operations go
+directly to cube fabric, while kernel launches are forwarded to io_cpu first.
+
 ---

 ### D5. M_CPU.DMA is an internal subcomponent of M_CPU
@@ -146,7 +162,7 @@ M_CPU.DMA does not appear as a node in the compiled topology graph.
 A cube that is not the target of a memory or kernel request acts as a transit node.
 Transit cubes forward requests without consuming them:

-```
+```text
 ucie_in (from upstream) → noc → ucie_out (to downstream)
 ```

@@ -187,3 +203,5 @@ It is used for shard comparison in `_route_kernel` and as a regression guard.
 - ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
 - ADR-0014 D4 (DMA engine capacity=1)
 - ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
+- ADR-0016 (IOChiplet NOC and memory data path)
+- ADR-0017 (cube NOC 2D mesh architecture)
@@ -0,0 +1,98 @@
+# ADR-0016: IOChiplet NOC and Memory Data Path
+
+## Status
+
+Accepted
+
+## Context
+
+ADR-0003 D2 defines IO chiplets as SIP-level components providing PCIe-EP and
+IO_CPU interfaces, but does not specify internal routing within the IO chiplet.
+ADR-0015 D4 was updated to document the M_CPU bypass for Memory R/W, but the
+IO chiplet's internal NOC architecture that enables this routing was not
+formally documented.
+
+The IO chiplet needs an internal routing fabric (io_noc) to:
+
+- connect pcie_ep, io_cpu, and per-cube UCIe PHY ports
+- route memory operations (MemoryWrite/Read) directly to cube fabric without
+  passing through io_cpu
+- route kernel launch commands through io_cpu for command interpretation
+
+## Decision
+
+### D1. IOChiplet internal NOC (io_noc)
+
+Each IO chiplet instance contains an internal NOC node (`io_noc`) that connects:
+
+- `pcie_ep` — host-facing PCIe endpoint
+- `io_cpu` — command processor for kernel launch interpretation
+- `io_ucie-{PHY}.conn{N}` — per-PHY connection nodes to cube UCIe ports
+
+The io_noc is a forwarding-only fabric (`forwarding_v1` implementation) with
+zero overhead. All routing decisions are made by the simulation engine based
+on message type, not by io_noc itself.
+
+### D2. IOChiplet UCIe decomposition
+
+Each IO chiplet PHY port is decomposed into:
+
+- `io_ucie-{PHY}` — the UCIe protocol endpoint (overhead = 8ns)
+- `io_ucie-{PHY}.conn{N}` — N connection nodes between io_noc and io_ucie
+
+This mirrors the cube-side UCIe decomposition (ADR-0015 D1) and allows
+multiple independent NOC-to-UCIe connections per PHY.
+
+### D3. Memory R/W path (M_CPU bypass)
+
+Memory operations (MemoryWrite, MemoryRead) are routed directly from pcie_ep
+through io_noc to the target cube, bypassing io_cpu entirely:
+
+```text
+pcie_ep → io_noc → conn → io_ucie → [cube UCIe] → noc → xbar → hbm_ctrl
+```
+
+This avoids the 10ns io_cpu overhead for pure data transfers. The simulation
+engine's `_process_memory_direct()` method uses `find_memory_path()` which
+resolves the shortest path from pcie_ep to the target HBM node.
+
+### D4. Kernel Launch path (via io_cpu)
+
+Kernel launch commands require io_cpu for command interpretation and PE
+fan-out setup:
+
+```text
+pcie_ep → io_noc → io_cpu → io_noc → conn → io_ucie → [cube UCIe]
+  → noc → m_cpu → PE
+```
+
+The engine's `_entry_points()` method routes KernelLaunchMsg through both
+pcie_ep (entry) and io_cpu (command processing).
+
+### D5. IOChiplet-to-cube port mapping
+
+Each IO chiplet instance declares which cube ports it connects to:
+
+```yaml
+cube_ports:
+  - { cube: {xy: [0,0]}, cube_side: N, phy: P0, distance_mm: 2.0 }
+  - { cube: {xy: [1,0]}, cube_side: N, phy: P1, distance_mm: 2.0 }
+```
+
+The topology builder creates edges from io_ucie PHY nodes to the
+corresponding cube UCIe port nodes, with the specified distance and
+the IO chiplet's `per_connection_bw_gbs` as link bandwidth.
+
+## Consequences
+
+- IO chiplet has a well-defined internal routing fabric
+- Memory operations avoid unnecessary io_cpu overhead
+- Kernel launch commands still get proper command interpretation
+- The io_noc pattern is consistent with cube-level NOC design
+- ADR-0003 D2 is extended (not contradicted) by this ADR
+
+## Links
+
+- ADR-0003 D2 (IO chiplet definition)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0012 D1 (host-to-IO_CPU message schema)
@@ -0,0 +1,189 @@
+# ADR-0017: Cube NOC 2D Mesh Architecture
+
+## Status
+
+Accepted
+
+## Context
+
+ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but
+does not specify the internal routing model, contention semantics, or
+attachment topology. The implementation uses a 2D mesh router grid with
+XY routing and per-segment contention modeling. This ADR formalizes that
+architecture.
+
+## Decision
+
+### D1. NOC node and router grid
+
+Each cube contains a single NOC topology node (`sip{S}.cube{C}.noc`)
+implemented as `noc_2d_mesh_v1`. Internally, the NOC models a 2D router
+grid generated by `mesh_gen.py`.
+
+Grid properties:
+
+- Default dimensions: 6x6 routers (derived from PE layout + UCIe connections)
+- Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`)
+- HBM exclusion zone: center rows/columns are excluded where HBM physically
+  occupies space (e.g., r2c2, r2c3, r3c2, r3c3)
+- Router positions are derived from physical PE corner placement and cube
+  geometry
+
+The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance
+traversal within the mesh (distance_mm x ns_per_mm).
+
+### D2. XY routing algorithm
+
+The NOC uses deterministic XY routing:
+
+1. Horizontal segment: route from source X to destination X at source Y
+2. Vertical segment: route from destination X at source Y to destination Y
+
+Each directed segment is identified by a unique link key:
+
+- Horizontal: `("H", y_band, x_min, x_max, direction)`
+- Vertical: `("V", x_band, y_min, y_max, direction)`
+
+Grid positions are snapped to the router grid, excluding the HBM zone.
+
+### D3. Contention model
+
+Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
+sharing a segment (same row or column band, same direction) contend for the
+resource. This models link-level serialization in a wormhole-routed mesh.
+
+With no contention, NOC traversal latency equals the Manhattan distance
+multiplied by `ns_per_mm`. Under contention, additional queueing delay
+is added by SimPy's resource scheduling.
+
+### D4. NOC attachment points
+
+The NOC connects to all major cube-level components:
+
+```text
+                    UCIe-N (conn x4)
+                         |
+           +---------+---+---+---------+
+           |         |       |         |
+PE0.dma ---+  r0c0   |  ...  |  r0c5  +--- PE2.dma
+PE0.cpu <--+         |       |         +--< PE2.cpu
+           |         |       |         |
+UCIe-W ----+  ...    | [HBM] |  ...   +---- UCIe-E
+(conn x4)  |         | zone  |         |  (conn x4)
+           |  r2c0   |       |         |
+M_CPU <--->+         |       |         |
+           |  r3c0   |       |         |
+SRAM <---->+         |       |         |
+           |         |       |         |
+PE4.dma ---+  r4c0   |  ...  |  r4c5  +--- PE6.dma
+PE4.cpu <--+         |       |         +--< PE6.cpu
+           |         |       |         |
+           +---------+---+---+---------+
+                         |
+                    UCIe-S (conn x4)
+
+xbar_top attached to: r0c0, r0c1, r1c4, r1c5 (top-half PE routers)
+xbar_bot attached to: r4c0, r4c1, r5c4, r5c5 (bottom-half PE routers)
+```
+
+### D5. NOC edge bandwidths and distances
+
+| Connection | BW (GB/s) | Distance | Notes |
+| --- | --- | --- | --- |
+| PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW |
+| NOC -> PE_CPU | - | 0.0 mm | Command path only |
+| NOC <-> xbar_top | 256.0 | 0.0 mm | Per xbar half |
+| NOC <-> xbar_bot | 256.0 | 0.0 mm | Per xbar half |
+| NOC <-> M_CPU | - | 0.0 mm | Command path |
+| NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate |
+| NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port |
+
+Distance 0.0 mm for most connections reflects the distributed nature of
+the NOC; the actual traversal distance is computed internally via Manhattan
+distance within the router grid.
+
+### D6. UCIe decomposition and inter-cube traffic
+
+Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into:
+
+- 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns)
+- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe
+
+This decomposition enables N=4 independent NOC-to-UCIe connections per port,
+each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s.
+
+Inter-cube traffic path:
+
+```text
+Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
+                    [UCIe link: 512 GB/s, 1.0mm seam distance]
+Target: ucie-{PORT} -> conn{i} -> NOC -> xbar -> HBM
+```
+
+UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a
+full crossing incurs 16 ns (TX port + RX port).
+
+### D7. Data paths through the NOC
+
+**PE DMA to local HBM (same half):**
+
+```text
+PE_DMA -> NOC -> xbar_top -> HBM_CTRL.slice{0-3}
+```
+
+**PE DMA to cross-half HBM:**
+
+```text
+PE_DMA -> NOC -> xbar_top -> bridge -> xbar_bot -> HBM_CTRL.slice{4-7}
+```
+
+**PE DMA to remote cube HBM:**
+
+```text
+PE_DMA -> NOC -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> NOC -> xbar -> HBM
+```
+
+**Kernel Launch command to PE:**
+
+```text
+[from io_noc] -> ucie -> conn -> NOC -> M_CPU -> NOC -> PE_CPU
+```
+
+**Shared SRAM access:**
+
+```text
+PE_DMA -> NOC -> SRAM
+```
+
+### D8. Mesh generation
+
+The router grid is generated by `mesh_gen.py` based on:
+
+- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner
+- `cube.geometry`: cube physical dimensions and HBM zone
+- `cube.ucie.n_connections`: determines router count for UCIe attachment
+
+The generator produces a `mesh_data` dictionary containing:
+
+- Router grid with positions and HBM exclusion zones
+- PE-to-router attachments (pe_dma, pe_cpu per PE)
+- UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
+- M_CPU and SRAM router attachments
+- xbar_top/bot router assignments (top-half vs bottom-half PE routers)
+
+## Consequences
+
+- NOC provides position-aware routing with deterministic latency
+- Contention is captured per directed segment (not per-node)
+- All cube-internal traffic is explicitly routed through the NOC
+- HBM exclusion zone reflects physical die layout constraints
+- The mesh generation is fully parameterized by `topology.yaml`
+
+## Links
+
+- ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
+- ADR-0004 D1 (PE DMA to local HBM path via xbar)
+- ADR-0004 D3 (cross-half HBM via bridge)
+- ADR-0014 D1 (PE_DMA dual egress: xbar for HBM, NOC for non-HBM)
+- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
+- ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)
@@ -7,7 +7,7 @@ Every request flows through a graph of **components** connected by **wires**.
 The total latency reported is the **actual SimPy wall-clock** (`env.now` delta),
 not a static formula—so contention and queueing are captured automatically.

-```
+```text
 total_ns (actual) = wire_prop + component_overhead + drain + queueing
                    ├── deterministic ──────────────────┘       │
                    └── contention-dependent ────────────────────┘
@@ -17,7 +17,7 @@ total_ns (actual) = wire_prop + component_overhead + drain + queueing

 ### 1. Wire Propagation

-```
+```text
 wire_ns = distance_mm × ns_per_mm       (global: 0.01 = 10 ps/mm)
 ```

@@ -29,7 +29,7 @@ and negligible compared to other costs.

 ### 2. Component Overhead (`overhead_ns`)

-```
+```text
 component_ns = node.attrs["overhead_ns"]
 ```

@@ -53,7 +53,7 @@ This models arbitration, protocol processing, pipeline stages, etc.

 ### 3. Drain (Serialization Delay)

-```
+```text
 drain_ns = nbytes / bottleneck_bw_gbs
 ```

@@ -65,7 +65,7 @@ Example: 4096 bytes through a path with bottleneck 128 GB/s → `4096 / 128 = 32

 ### Formula (Theoretical Lower Bound)

-```
+```text
 formula_ns = Σ(wire_prop) + Σ(overhead_ns) + drain_ns
 ```

@@ -159,7 +159,7 @@ a timeout or waits on a resource/store. The delta between start and done capture

 Each component is a SimPy process:

-```
+```text
 _fan_in (per in_port)  →  _inbox (Store)  →  _worker  →  out_ports
 ```

@@ -215,7 +215,7 @@ If request A holds the resource and request B arrives:
 - SimPy advances B's `env.now` by A's remaining service time
 - This "extra" time shows up in B's `total_ns` automatically

-```
+```text
 No contention:  actual_ns == formula_ns
 Contention:     actual_ns  > formula_ns
                queueing_delay = actual_ns - formula_ns
@@ -237,7 +237,7 @@ with self._resource.request() as req:
 This means a short request arriving during a long request's drain must wait
 for the full remaining drain time—classic head-of-line blocking:

-```
+```text
 Request A: 4 KB,  drain = 16.0 ns   (arrives at t=0)
 Request B: 64 B,  drain = 0.25 ns   (arrives at t=5)

@@ -274,7 +274,7 @@ Setup: PE0 and PE1 in cube0 both read 4096 bytes from their local HBM slices

 ### Paths

-```
+```text
 DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
 DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
 ```
@@ -284,7 +284,7 @@ DMA B: pe1.pe_dma → xbar.pe1 → hbm_ctrl.slice1
 Since slice0 and slice1 are **separate** hbm_ctrl instances, each with its own
 `simpy.Resource(capacity=1)`, there is no resource competition.

-```
+```text
 DMA A timeline:
  t=0.00   pe_dma dequeues txn
  t=0.00   xbar.pe0: overhead_ns=2.0 → t=2.00
@@ -304,13 +304,13 @@ Both complete at ~18.09 ns. `actual == formula` for both.

 Now suppose both PE0 and PE1 read from **slice0**:

-```
+```text
 DMA A: pe0.pe_dma → xbar.pe0 → hbm_ctrl.slice0
 DMA B: pe1.pe_dma → xbar.pe1 → xbar.pe0 → hbm_ctrl.slice0
                                (chain traversal to reach slice0)
 ```

-```
+```text
 DMA A timeline:
  t=0.00   xbar.pe0(2.0) → wire → hbm_ctrl.slice0
  t=2.025  yield req → immediate (first to arrive)
@@ -343,7 +343,7 @@ compare `actual_ns` vs `formula_ns` (available in PE DMA traces).

 ## Probe Output Explained

-```
+```text
 === PE DMA Latency ===
 Case                Target              Actual  Ovhd  Drain  Wire  Ovhd% Drain%  Eff.BW   BN.BW   Util%
 pe-local-hbm        c0.pe0->c0.slice0    18.09   2.0  16.0  0.08  11.1% 88.5%   226.49   256.0   88.5%
@@ -368,7 +368,7 @@ pe-cross-half-hbm   c0.pe0->c0.slice4    37.14   5.0  32.0  0.14  13.5% 86.1%
 fraction. For small transfers (4KB), overhead is significant relative to drain.
 For large transfers, drain dominates and utilization approaches 100%.

-```
+```text
  4 KB:  Ovhd=2.0, Drain=16.0  → Util=88.5%   (overhead is 11% of time)
 64 KB:  Ovhd=2.0, Drain=256.0 → Util=99.2%   (overhead is <1% of time)
 ```