Files
kernbench2/docs/adr/ADR-0004-mem-memory-semantics-local-hbm.md
T
ywkang 687c98086d ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
  (dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
  docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
  docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
  retroactive docs pending verification.

Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
  TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
  Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
  deleted; ADR-0019/0021 moved to adr-history with one-line stub status

Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
  serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
  per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
  target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
  selection, flit-aware per-flit commit, async finalize, command-only
  fallback path)

Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
  "Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
  block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
  ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
  (now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)

Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
  ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py

Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.

Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
  (ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 01:15:55 -07:00

77 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0004: Memory Semantics & Local-HBM Bandwidth Guarantee
## Status
Accepted
## Context
Accurately modeling PE↔HBM behavior is essential for kernel latency estimation.
Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, independent of intervening on-die fabric bandwidth.
## Decision
### D1. Local HBM definition
- Each PE is assigned a logically defined “local HBM” region.
- Local HBM corresponds to the pseudo-channel subset directly attached to that PEs
router in the NOC mesh (ADR-0017 D4).
- The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
### D2. Local HBM bandwidth guarantee contract
- Accesses from a PE to its local HBM MUST guarantee full effective HBM
read/write bandwidth independent of intervening fabric bandwidth limits.
- Effective HBM bandwidth = spec bandwidth x efficiency factor.
The efficiency factor (configured via `hbm_ctrl.attrs.efficiency`, default 0.8)
models real-world DRAM inefficiencies (refresh cycles, bank conflicts, page
misses). For example: 256 GB/s spec x 0.8 = 204.8 GB/s effective.
- The topology builder applies the efficiency factor to router-to-hbm edge
bandwidth at graph construction time, so all downstream routing and latency
computation uses the effective value.
- This guarantee is modeled by:
- a dedicated logical path and/or service model that enforces HBM BW at the PE-local-HBM interaction point,
- while still incurring non-zero latency along explicitly modeled components.
- HBM CTRL internal modeling (PC striping, cut-through, scheduling fidelity)
is consolidated in ADR-0033 (Latency Model: Assumptions and Known
Simplifications). The aggregate BW guarantee here remains the contract;
ADR-0033 documents how the per-PC model realizes it and which scheduler
effects are intentionally simplified.
### D3. Remote PE HBM semantics (intra-cube)
- A PE that accesses another PE's local HBM traverses the NOC:
- PE_DMA → NOC → (fabric hops) → target PE's NOC port → HBM_CTRL
- NOC bandwidth and hop count may limit remote HBM access relative to local access.
### D4. Non-local HBM semantics (inter-cube / inter-SIP)
- Accesses from a PE to HBM in a different cube or SIP MAY be limited by:
- NOC bandwidth within the cube,
- inter-cube UCIe links,
- inter-SIP fabric (PCIe/UAL).
- These paths MUST be explicit and traceable.
### D5. Shared SRAM semantics
- Each CUBE contains a shared SRAM accessible by all PEs in that CUBE.
- Access path: PE_DMA → NOC → shared SRAM.
- Shared SRAM bandwidth is limited by the NOC↔SRAM link bandwidth.
- Shared SRAM is not part of the HBM address space; it is a separate memory domain.
## Verification Notes
Tests should cover:
- local-HBM case: BW matches HBM BW regardless of fabric BW parameter
- remote PE HBM case: latency includes mesh hop traversal
- non-local cases (inter-cube/inter-SIP): BW/latency respond to fabric/link parameters
- shared SRAM case: access via NOC with correct BW
## Links
- SPEC R2/R5
- ADR-0002 (distance/order & explicit bypass)
- ADR-0017 D7 (PE DMA data paths through NOC to HBM)