diff --git a/.claude/commands/report.md b/.claude/commands/report.md new file mode 100644 index 0000000..a6ea2bc --- /dev/null +++ b/.claude/commands/report.md @@ -0,0 +1,318 @@ +--- +description: Generate a public-facing architecture design document from approved ADRs and SPEC.md, with gap analysis reported to chat only. +--- + +# `/report` — Architecture Design Document Generator + +Generates a **public-facing** architecture design document at +`docs/report/architecture-{YYYY}-{1H|2H}.md` derived from the current ADR +corpus, SPEC.md, CLAUDE.md, and the canonical component list. + +This command is **strictly read-only** on `docs/adr/`, `SPEC.md`, +`CLAUDE.md`, and `src/`. The only write is the report file itself +(a derived artifact under `docs/report/`). + +--- + +## Invocation + +Two modes: + +- `/report` — **dry-run** (default). No file is written. The command + reads sources, performs classification, and reports the planned TOC + + gap analysis to chat only. Use this to validate ADR-to-section + mapping before committing. +- `/report write` — **write mode**. Performs the same procedure and + writes `docs/report/architecture-{period}.md`. Use after a dry-run + whose classification looks correct. + +Period determination (both modes), from system date: + +- month 1–6 → `{YYYY}-1H` +- month 7–12 → `{YYYY}-2H` + +In write mode, if `docs/report/architecture-{period}.md` already exists, +overwrite it without asking (regeneration is the expected operation). + +--- + +## Output Contract + +### Document body (`docs/report/architecture-{period}.md`) + +Public release form. Reader is an external developer/architect. They do +**not** have access to SPEC.md or ADR files. Therefore: + +- **No `ADR-NNNN` identifiers** in visible prose. +- **No `SPEC R/§` identifiers** in visible prose. +- **No internal jargon** assumed without definition. +- **No diagram embeds** — only `` placeholders. +- **Attribution via HTML comments** — every prose paragraph that derives + from a source carries an inline comment immediately above it: + `` (multiple sources allowed). + +### Chat-only report (not written to any file) + +After writing the document, report to the user in the chat response: + +- File path written. +- Section counts (e.g., "Detailed Architecture: 8 components covered, + 2 in `builtin/` have no ADR backing"). +- **G1 gaps** — SPEC requirements (R-numbers / §) with no ADR citing them. +- **G2 gaps** — ADRs missing **Context** or **Decision**. Alternatives + and Consequences are optional; their absence is NOT a gap. +- **G3 gaps** — ADR cross-references without a back-reference. +- **G4 suggestions** — areas where an ADR seems missing based on the + ADR corpus + SPEC reading. Phrase as suggestions, not findings. Each + G4 item must say *why* it's suggested and remain falsifiable. +- **G5 consistency issues** — ADR-to-ADR inconsistencies: + - **G5a (supersession not reflected)** — ADR-A states it supersedes + ADR-B, but ADR-B's Status is not marked as Superseded. + - **G5b (merge candidates)** — two or more ADRs cover near-identical + scope (detected naturally during section assignment, not via + exhaustive pair-wise scan). + - **G5c (explicit contradictions)** — two ADRs whose Decisions + directly oppose each other. Must cite both quotations; do not + speculate contradictions from topical similarity alone. +- **TOC rationale** — for each section, list contributing ADR IDs + (this is for the user's verification only, never written to the + document itself). + +G4 must never appear in the document body. G1–G3 are also chat-only. + +--- + +## Procedure + +### Step 1 — Determine period + +Use current system date. Compute `{YYYY}-1H` or `{YYYY}-2H`. + +### Step 2 — Ingest ADRs + +For each `docs/adr/ADR-NNNN-*.md`: + +- If both `ADR-NNNN-*.md` (Korean) and `ADR-NNNN-*.en.md` (English) + exist for the same number, **prefer the Korean `.md`** version. +- Parse for the four canonical sections: Context, Decision, Alternatives + (also accept "Alternatives Considered"), Consequences. +- Record presence/absence of **Context** and **Decision** for G2. + Alternatives and Consequences presence is recorded for use during + authoring, but their absence is not a gap. +- Record ADR-NNNN cross-references for G3. +- Record Status (e.g., Accepted, Superseded, Draft) and any "supersedes + ADR-NNNN" text in the body for G5a. + +Process ADRs in **numerical order** for determinism. + +### Step 3 — Read canonical component list + +List `src/kernbench/components/builtin/*.py`, excluding `__init__.py`, +`pe_types.py`, and `__pycache__/`. Sort alphabetically. This is the +canonical order for Detailed Architecture subsections. + +### Step 4 — Read SPEC.md and CLAUDE.md + +For G1 detection: extract every `R` and `§` identifier mentioned +in SPEC.md. For each ADR, check which of these it cites. SPEC IDs with +zero citing ADRs → G1. + +### Step 5 — Section assignment + +Assign each ADR to exactly one of: + +- **Design Principles** — project-wide rationale, philosophy, mission + (e.g., "why source-level kernel execution", "why fast multi-device + scaling"). Includes ADRs that describe foundational invariants + (e.g., latency model assumptions, verification strategy). +- **High-level Architecture** — Tray / SIP / CUBE / PE hierarchy and + cross-layer boundaries (e.g., runtime API ↔ sim_engine ↔ components). +- **Detailed Architecture** — single-component internal designs. One + subsection per file in the canonical component list. ADRs whose + primary topic is the internal structure of one component go here. +- **Implementation Decisions** — **cross-cutting** algorithms / policies + / schemes / models that don't belong to a single component: + collective algorithms, parallelization policies, address schemes, + routing algorithms, model assumptions. + +Boundary rule between Detailed Architecture and Implementation Decisions: + +> Detailed Architecture = component-internal. +> Implementation Decisions = spans multiple components OR is an +> algorithm/policy/scheme/assumption rather than a structural choice. + +If an ADR fits two sections plausibly, prefer the one that minimizes +duplication and pick the more specific bucket (Detailed if it primarily +concerns one component, else Implementation Decisions). + +During classification, opportunistically detect ADR consistency issues: + +- **G5b (merge candidate)** — if two or more ADRs land in the same + Detailed Architecture subsection or the same Implementation Decisions + topic AND their primary scope is near-identical, record as a merge + candidate. Topical adjacency is not enough; the scopes must be + effectively the same question. +- **G5c (explicit contradiction)** — if while reading you encounter two + ADRs whose Decisions directly oppose each other on the same question, + record both quotations verbatim with their ADR IDs. Do NOT speculate + contradictions from similarity, vocabulary, or domain overlap — only + explicit, citable opposition. + +Do NOT perform an exhaustive pair-wise scan of all ADRs. G5b/G5c are +byproducts of normal reading; if not encountered, the chat report +shows "(none)". + +### Step 6 — Write the document (write mode only) + +In **dry-run mode**, skip this step entirely. Proceed directly to Step 7. + +```markdown +# KernBench — Architecture Design Document +*{YYYY} {1H|2H}* + +## Design Principles + + +## High-level Architecture + + +### Tray +### SIP +### CUBE +### PE + +## Detailed Architecture +### +### +... + +## Implementation Decisions +### +### +... +``` + +#### Authoring rules (apply to every section) + +- **Stay grounded.** Every claim must trace to an ADR's stated content + (Context / Decision / Alternatives / Consequences). No invented + motivation, no invented alternatives, no invented trade-offs. +- **4-part discipline, naturally.** Each subsection should naturally + cover: the problem the design addresses, the decision made, the + alternatives considered, the consequences. Do **not** label these + with rigid headers like "**Problem.**" — weave them into prose. But + ensure all four are present *if the source ADR documents them*. +- **Missing → omit, not fabricate.** If a source ADR has no + "Alternatives" section, do **not** invent alternatives for the + report. Simply write the remaining parts and record G2 in chat. +- **Attribution.** Every paragraph derived from one or more ADRs + carries an HTML comment immediately above: + ``. +- **Diagram placeholders.** Where a diagram would help, insert + `` + on its own line. **Never** embed an image (`![...](...)`). +- **Public tone.** Self-contained. Define internal terms (SIP, CUBE, + PE, Tray, NOC, IPCQ, TCM, etc.) on first use within the document. + Do not assume reader has read SPEC or ADRs. +- **No internal references.** No `ADR-NNNN` in body text. No + `SPEC §X.Y` or `R` in body text. These appear only inside HTML + attribution comments. +- **Detailed Architecture component subsections.** Use the canonical + list from Step 3 in order. For each component file, write a + subsection drawing from any ADR that primarily concerns that + component. If no ADR covers a component, write a one-line stub + noting the component exists and flag it in chat report. If an ADR + covers a topic not in the canonical list, place it under + "Detailed Architecture → Other" (sub-subsection) and flag for + canonical-list extension in chat. +- **Implementation Decisions topic naming.** Derive topic names from + ADR titles, made reader-friendly (no ADR number). Group related + ADRs under one topic when natural (e.g., multiple address-related + ADRs under "Address Scheme"). + +### Step 7 — Generate chat report + +After Step 6 (write mode) or directly from Step 5 (dry-run mode), +emit the following to chat. Do **not** write any of this to a file. + +In **dry-run mode**, replace the `Wrote:` line with: +`**DRY-RUN — no file written.** Review TOC and gaps below. Run \`/report write\` to commit.` + +``` +## /report — Generation Summary + +**Wrote:** docs/report/architecture-{period}.md + +**Section coverage** +- Design Principles: ADRs +- High-level Architecture: ADRs +- Detailed Architecture: / components ; components without ADR: [...] +- Implementation Decisions: topics, ADRs + +**TOC rationale (ADR → section mapping)** +- Design Principles: ADR-NNNN, ADR-MMMM +- High-level Architecture: ... +- Detailed Architecture → : ADR-NNNN +- Implementation Decisions → : ADR-NNNN, ADR-MMMM + +**G1 — SPEC requirements without ADR support** +- R / §: not cited by any ADR +- (or "none") + +**G2 — ADRs missing required sections (Context or Decision)** +- ADR-NNNN: missing +- (or "none") + +**G3 — Broken cross-references** +- ADR-NNNN cites ADR-MMMM; ADR-MMMM does not back-reference +- (or "none") + +**G4 — Suggested topics that may warrant a new ADR (verify before acting)** +- : +- (or "none") + +**G5 — ADR consistency issues** +- **G5a (supersession not reflected)** + - ADR-NNNN claims to supersede ADR-MMMM, but ADR-MMMM Status is "" + - (or "none") +- **G5b (merge candidates)** + - ADR-NNNN + ADR-MMMM: near-identical scope on — evaluate merge + - (or "none") +- **G5c (explicit contradictions)** + - ADR-NNNN says ""; ADR-MMMM says "" — direct opposition on + - (or "none") +``` + +--- + +## Constraints (do not violate) + +1. **Read-only on source.** No writes to `docs/adr/`, `SPEC.md`, + `CLAUDE.md`, or `src/`. Only write is + `docs/report/architecture-{period}.md`. +2. **No fabrication.** Every body paragraph traces to ADR content via + HTML attribution comment. +3. **No diagram embeds.** Placeholders only. +4. **No internal IDs in body.** ADR-NNNN and SPEC R/§ stay inside + HTML comments only. +5. **Determinism.** ADRs processed in numerical order; components in + canonical (alphabetical) order. Same inputs → same output. +6. **G4 stays in chat.** Never written to the document. +7. **Korean bilingual preference.** When both `.md` and `.en.md` + exist for the same ADR number, use `.md`. +8. **All ADRs included.** No exclusion list. ADRs about internal + tooling (CLI, diagram views, verification strategy) are still + included — usually under Design Principles or Implementation + Decisions, written in publishable form. + +--- + +## Failure modes to avoid + +- **Padding** with general background not present in the source ADRs. +- **Inferring alternatives** the ADR doesn't mention. +- **Quietly skipping** an ADR because it seems internal. Include it, + rephrase for public audience. +- **Inventing components** not in `src/kernbench/components/builtin/`. +- **Auto-selecting diagrams** from `docs/diagrams/`. Only placeholders. +- **Promoting G4 suggestions to the document.** They stay in chat. diff --git a/.claude/settings.local.json b/.claude/settings.local.json index fbd7d37..3c0351e 100644 --- a/.claude/settings.local.json +++ b/.claude/settings.local.json @@ -30,7 +30,10 @@ "Bash(python -m pytest tests/test_pe_components.py -v)", "Bash(python -m pytest tests/test_triton_emu.py -v)", "Bash(python -m pytest tests/test_pe_components.py tests/test_triton_emu.py -v)", - "Bash(python -m pytest tests/test_pe_components.py::test_mcpu_multi_pe_kernel_launch tests/test_pe_components.py::test_qkv_gemm_bench_multi_pe_completes -v)" + "Bash(python -m pytest tests/test_pe_components.py::test_mcpu_multi_pe_kernel_launch tests/test_pe_components.py::test_qkv_gemm_bench_multi_pe_completes -v)", + "Bash(git add:*)", + "Bash(git commit:*)", + "Bash(git push:*)" ] } } diff --git a/.gitignore b/.gitignore index ff7356b..93a3b88 100644 --- a/.gitignore +++ b/.gitignore @@ -29,4 +29,6 @@ build/ # Logs *.log -.claude/ +.claude/* +!.claude/commands/ +!.claude/commands/*.md diff --git a/CLAUDE.md b/CLAUDE.md index 2057f0c..afc2ab4 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -218,17 +218,43 @@ General fallbacks. Apply to anything not explicitly covered above. ### ADR Lifecycle -- `docs/adr/` contains ADRs reflecting current implementation or - work-in-progress designs. -- `docs/history/` contains superseded ADRs as historical record. -- When an ADR is superseded: - 1. The superseding ADR includes a "Supersedes ADR-NNNN" line. - 2. The superseded ADR's Status is set to "Superseded by ADR-MMMM". - 3. The superseded ADR file is **moved** (git mv) to `docs/history/`. -- Cross-references between ADRs use the ADR-NNNN ID and remain - valid regardless of file location. -- ADR numbers are **immutable**; never renumber. Numbering holes - from moved ADRs are expected. +ADRs live in one of three folders based on lifecycle state: + +- `docs/adr/` — **Accepted** (current implementation reflected). +- `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design + only / future-work exploration / retroactive documentation pending + verification). +- `docs/adr-history/` — **Superseded** or **Merged** (no longer the + authoritative source; kept as historical record). + +Status field values: + +- `Accepted` — design is in current implementation. +- `Proposed` — design is concrete but not yet implemented. +- `Stub (Future Work)` — design space exploration; no commitment yet. +- `Draft` — retroactive documentation drafted but not yet verified + against the implementation it describes. +- `Superseded by ADR-NNNN` — replaced by another ADR. +- `Merged into ADR-NNNN` — content absorbed by another ADR. + +Transitions: + +- **Proposed/Stub → Accepted**: when the ADR's decisions are + reflected in production code AND covered by tests. `git mv` from + `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`. +- **Draft → Accepted**: when the ADR's text has been verified to + accurately describe the existing implementation. `git mv` from + `docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`. +- **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM` + and `git mv` to `docs/adr-history/`. The superseding ADR includes + a "Supersedes ADR-NNNN" reference (or, for partial supersession of + clauses, documents this in its own body). +- **Accepted → Merged**: set Status to `Merged into ADR-MMMM` + (single-line stub) and `git mv` to `docs/adr-history/`. + +Cross-references between ADRs use the `ADR-NNNN` ID and remain valid +regardless of folder location. ADR numbers are **immutable**; never +renumber. Numbering holes from moved ADRs are expected. ## Terminology diff --git a/README.md b/README.md index 36d969e..77a9f5e 100644 --- a/README.md +++ b/README.md @@ -155,5 +155,6 @@ kernbench/ ## Documentation - [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release -- [docs/latency-model.md](docs/latency-model.md) — latency model explanation with worked examples +- [docs/onboarding/latency-model.md](docs/onboarding/latency-model.md) — latency model explanation with worked examples +- [docs/onboarding/](docs/onboarding/) — onboarding guides (architecture overview, latency model, CCL author guide, intro presentation) - [docs/adr/](docs/adr/) — Architecture Decision Records diff --git a/SPEC.md b/SPEC.md index 0850a1b..bdcfaaf 100644 --- a/SPEC.md +++ b/SPEC.md @@ -51,7 +51,7 @@ Major architectural decisions are documented in ADRs and referenced by number. - ADR-0007: runtime_api vs sim_engine responsibility boundaries - ADR-0008: Tensor deployment and allocation (Host allocator, PA-first) - ADR-0009: Kernel execution fan-out and completion semantics -- ADR-0010: CLI device selection and multi-device execution semantics +- ADR-0010: Command line interface and execution semantics - ADR-0011: Memory Addressing — PA / VA / LA Address Models - ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards) - ADR-0013: Verification strategy and Phase 1 test plan diff --git a/docs/history/ADR-0018-Logical Address.en.md b/docs/adr-history/ADR-0018-mem-logical-address.en.md similarity index 100% rename from docs/history/ADR-0018-Logical Address.en.md rename to docs/adr-history/ADR-0018-mem-logical-address.en.md diff --git a/docs/history/ADR-0018-Logical Address.md b/docs/adr-history/ADR-0018-mem-logical-address.md similarity index 100% rename from docs/history/ADR-0018-Logical Address.md rename to docs/adr-history/ADR-0018-mem-logical-address.md diff --git a/docs/adr-history/ADR-0019-dev-noc-local-hbm.en.md b/docs/adr-history/ADR-0019-dev-noc-local-hbm.en.md new file mode 100644 index 0000000..1d746ae --- /dev/null +++ b/docs/adr-history/ADR-0019-dev-noc-local-hbm.en.md @@ -0,0 +1,5 @@ +# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC + +## Status + +Merged into ADR-0017 (Cube NOC and HBM Connectivity). diff --git a/docs/adr-history/ADR-0019-dev-noc-local-hbm.md b/docs/adr-history/ADR-0019-dev-noc-local-hbm.md new file mode 100644 index 0000000..63efb1d --- /dev/null +++ b/docs/adr-history/ADR-0019-dev-noc-local-hbm.md @@ -0,0 +1,5 @@ +# ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델 + +## Status + +Merged into ADR-0017 (Cube NOC and HBM Connectivity). diff --git a/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.en.md b/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.en.md new file mode 100644 index 0000000..710dadd --- /dev/null +++ b/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.en.md @@ -0,0 +1,5 @@ +# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing + +## Status + +Merged into ADR-0014 (PE Pipeline Execution Model). diff --git a/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.md b/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.md new file mode 100644 index 0000000..82b48e2 --- /dev/null +++ b/docs/adr-history/ADR-0021-dev-pe-pipeline-refactor.md @@ -0,0 +1,5 @@ +# ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅 + +## Status + +Merged into ADR-0014 (PE Pipeline Execution Model). diff --git a/docs/history/ADR-0029-hierarchical-allreduce.md b/docs/adr-history/ADR-0029-algo-hierarchical-allreduce.md similarity index 100% rename from docs/history/ADR-0029-hierarchical-allreduce.md rename to docs/adr-history/ADR-0029-algo-hierarchical-allreduce.md diff --git a/docs/history/ADR-0031-physaddr-pe-resource-extension.md b/docs/adr-history/ADR-0031-mem-physaddr-pe-resource-extension.md similarity index 99% rename from docs/history/ADR-0031-physaddr-pe-resource-extension.md rename to docs/adr-history/ADR-0031-mem-physaddr-pe-resource-extension.md index e8bb39e..6aa23c5 100644 --- a/docs/history/ADR-0031-physaddr-pe-resource-extension.md +++ b/docs/adr-history/ADR-0031-mem-physaddr-pe-resource-extension.md @@ -257,5 +257,5 @@ PhysAddr encoding. 호출자는 어느 range인지 몰라도 됨. |------|--------| | `src/kernbench/policy/address/phyaddr.py` | Range table (`PE_RESOURCE_MAP`), range-based decode, 신규 component-specific factory들 (`pe_ipcq_addr` 등), 기존 `pe_tcm_addr` 내부 인코딩 갱신 | | `src/kernbench/policy/address/allocator.py` | Range-aware pool 분리 (TCM pool / IPCQ pool / scratchpad pool 등 per-PE) | -| `docs/adr/ADR-0001-physaddr-layout.md` | Amendment note: range-based PE resource partition | +| `docs/adr/ADR-0001-mem-physaddr-layout.md` | Amendment note: range-based PE resource partition | | `tests/test_phyaddr.py` | Range table 검증, 각 factory의 encode/decode round-trip, 기존 `pe_tcm_addr` 회귀 | diff --git a/docs/adr/ADR-0028-dtensor-support.md b/docs/adr-proposed/ADR-0028-par-dtensor-support.md similarity index 100% rename from docs/adr/ADR-0028-dtensor-support.md rename to docs/adr-proposed/ADR-0028-par-dtensor-support.md diff --git a/docs/adr/ADR-0030-ipcq-physaddr.md b/docs/adr-proposed/ADR-0030-mem-ipcq-physaddr.md similarity index 99% rename from docs/adr/ADR-0030-ipcq-physaddr.md rename to docs/adr-proposed/ADR-0030-mem-ipcq-physaddr.md index ab735be..cfc9add 100644 --- a/docs/adr/ADR-0030-ipcq-physaddr.md +++ b/docs/adr-proposed/ADR-0030-mem-ipcq-physaddr.md @@ -340,7 +340,7 @@ encoding can be plugged in later" 약속이 이행된 것. | `src/kernbench/sim_engine/memory_store.py` | D3: IPCQ buffer가 기존 space와 공유되는지 검증 | | `src/kernbench/sim_engine/engine.py` | D4: IPCQ token routing이 PhysAddr-based fabric 경로 사용 | | `src/kernbench/ccl/diagnostics.py` | D5: pointer_dump를 PhysAddr 포매팅으로 개선 | -| `docs/adr/ADR-0023-ipcq-pe-collective.md` | D6: D2.5 amendment note | +| `docs/adr/ADR-0023-dev-ipcq-pe-collective.md` | D6: D2.5 amendment note | | `tests/test_ipcq_physaddr.py` (new) | T1 | | `tests/test_ipcq_alloc.py` (new) | T2 | | `tests/test_ccl_install_plan.py` | T3 확장 | diff --git a/docs/adr/ADR-0001-physaddr-layout.md b/docs/adr/ADR-0001-mem-physaddr-layout.md similarity index 100% rename from docs/adr/ADR-0001-physaddr-layout.md rename to docs/adr/ADR-0001-mem-physaddr-layout.md diff --git a/docs/adr/ADR-0002-routing-distance.md b/docs/adr/ADR-0002-lat-routing-distance.md similarity index 97% rename from docs/adr/ADR-0002-routing-distance.md rename to docs/adr/ADR-0002-lat-routing-distance.md index 34bd7e4..19849f9 100644 --- a/docs/adr/ADR-0002-routing-distance.md +++ b/docs/adr/ADR-0002-lat-routing-distance.md @@ -35,7 +35,7 @@ shortcuts that obscure control paths. ### D3. Bypass is explicit and graph-represented - All paths must be explicitly represented in the graph and subject to latency accumulation. -- Example: PE_DMA connects to the NOC router mesh (ADR-0019). All destinations +- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops. Local HBM access has minimal hops (switching overhead only); remote access traverses additional routers. diff --git a/docs/adr/ADR-0003-target-system-hierarchy.md b/docs/adr/ADR-0003-dev-target-system-hierarchy.md similarity index 100% rename from docs/adr/ADR-0003-target-system-hierarchy.md rename to docs/adr/ADR-0003-dev-target-system-hierarchy.md diff --git a/docs/adr/ADR-0004-memory-semantics-local-hbm.md b/docs/adr/ADR-0004-mem-memory-semantics-local-hbm.md similarity index 98% rename from docs/adr/ADR-0004-memory-semantics-local-hbm.md rename to docs/adr/ADR-0004-mem-memory-semantics-local-hbm.md index cb7d3ff..d9144b0 100644 --- a/docs/adr/ADR-0004-memory-semantics-local-hbm.md +++ b/docs/adr/ADR-0004-mem-memory-semantics-local-hbm.md @@ -15,7 +15,7 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth, - Each PE is assigned a logically defined “local HBM” region. - Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s - router in the NOC mesh (ADR-0019). + router in the NOC mesh (ADR-0017 D4). - The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops). - The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration. diff --git a/docs/adr/ADR-0005-diagram-views-distance-layout.md b/docs/adr/ADR-0005-dev-diagram-views-distance-layout.md similarity index 92% rename from docs/adr/ADR-0005-diagram-views-distance-layout.md rename to docs/adr/ADR-0005-dev-diagram-views-distance-layout.md index 6908409..6391f19 100644 --- a/docs/adr/ADR-0005-diagram-views-distance-layout.md +++ b/docs/adr/ADR-0005-dev-diagram-views-distance-layout.md @@ -20,7 +20,9 @@ Diagrams must reflect this distance by default. --- -## Global Defaults +## Decision + +### D1. Global Defaults - All diagrams MUST be **distance-aware by default**. - All diagrams MUST render **representative views** of the architecture. @@ -31,7 +33,7 @@ Diagrams must reflect this distance by default. --- -## Representative Rendering Rule +### D2. Representative Rendering Rule - All CUBEs share the same internal structure. - All PEs share the same internal structure. @@ -47,9 +49,9 @@ unless explicitly requested. --- -## Diagram Views +### D3. Diagram Views -### View A — SIP-Level Diagram +#### View A — SIP-Level Diagram **Purpose** Explain system-scale structure and connectivity. @@ -75,7 +77,7 @@ Explain system-scale structure and connectivity. --- -### View B — CUBE-Level Diagram +#### View B — CUBE-Level Diagram **Purpose** Explain cube-internal structure and data/control flow. @@ -106,7 +108,7 @@ Explain cube-internal structure and data/control flow. --- -### View C — PE-Level Diagram +#### View C — PE-Level Diagram **Purpose** Explain internal PE behavior and execution structure. @@ -128,14 +130,14 @@ Explain internal PE behavior and execution structure. --- -## Distance-Aware Layout (Default) +### D4. Distance-Aware Layout (Default) -### Distance definition +#### Distance definition - Distance is defined as **accumulated latency**, consistent with ADR-0002. - Distance is computed from a single anchor node. -### Default anchor selection +#### Default anchor selection - SIP view: IO chiplet (or Host CPU if present) - CUBE view: a representative PE @@ -143,7 +145,7 @@ Explain internal PE behavior and execution structure. Anchors are **implicit defaults** and MUST NOT be required to be specified. -### Layout rules +#### Layout rules - Diagrams MUST be laid out in layers based on distance buckets. - Layout direction MUST be consistent within a view type @@ -156,7 +158,7 @@ without affecting distance semantics. --- -## Generation Contract (for Tools / Claude Code) +### D5. Generation Contract (for Tools / Claude Code) When generating diagrams: diff --git a/docs/adr/ADR-0006-topology-compilation-distance-diagram.md b/docs/adr/ADR-0006-dev-topology-compilation-distance-diagram.md similarity index 99% rename from docs/adr/ADR-0006-topology-compilation-distance-diagram.md rename to docs/adr/ADR-0006-dev-topology-compilation-distance-diagram.md index 60b0d8b..4b3767c 100644 --- a/docs/adr/ADR-0006-topology-compilation-distance-diagram.md +++ b/docs/adr/ADR-0006-dev-topology-compilation-distance-diagram.md @@ -63,7 +63,7 @@ For each view (SIP / CUBE / PE): - CUBE-level projection MUST include: - Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports, and PEs as opaque blocks. - - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0019). + - All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017). - Default anchors are implicit (ADR-0005) and MUST NOT require instance indices. ### D6. Output formats and determinism diff --git a/docs/adr/ADR-0007-runtime-api-boundaries.md b/docs/adr/ADR-0007-api-runtime-api-boundaries.md similarity index 73% rename from docs/adr/ADR-0007-runtime-api-boundaries.md rename to docs/adr/ADR-0007-api-runtime-api-boundaries.md index 51975be..9522b9d 100644 --- a/docs/adr/ADR-0007-runtime-api-boundaries.md +++ b/docs/adr/ADR-0007-api-runtime-api-boundaries.md @@ -42,21 +42,25 @@ The runtime API MUST NOT: --- -### D2. Simulation engine executes and schedules requests +### D2. Simulation engine wires components and tracks completion The simulation engine (sim_engine) MUST: -- inject requests into the compiled topology graph, +- wire components at initialization (create port stores + start wire + processes per the component port/wire framework — ADR-0015), +- inject requests into the compiled topology graph at entry components + (e.g., PCIE_EP for memory operations, IO_CPU for kernel launch), - schedule and execute events using a discrete-event model, -- manage correlation ids and completion tracking, -- decompose operations into low-level requests when required - (e.g., MemoryWrite events). +- manage correlation ids and completion tracking. The simulation engine MUST NOT: - define tensor semantics, - define kernel execution policies, -- expose internal graph details to the runtime API. +- expose internal graph details to the runtime API, +- walk the topology path during request execution, +- call component `run()` methods directly, +- track per-hop latency or decompose fan-out (components own this). --- @@ -87,3 +91,5 @@ component-level fan-out explicitly. - SPEC R4, R7, R8 - ADR-0008 (Tensor deployment) - ADR-0009 (Kernel execution) +- ADR-0015 (Component port/wire model and engine role) +- ADR-0010 (CLI surface and execution semantics — runtime API consumer) diff --git a/docs/adr/ADR-0008-tensor-deploy-and-allocation.md b/docs/adr/ADR-0008-api-tensor-deploy-and-allocation.md similarity index 100% rename from docs/adr/ADR-0008-tensor-deploy-and-allocation.md rename to docs/adr/ADR-0008-api-tensor-deploy-and-allocation.md diff --git a/docs/adr/ADR-0009-kernel-execution-messaging.md b/docs/adr/ADR-0009-api-kernel-execution-messaging.md similarity index 97% rename from docs/adr/ADR-0009-kernel-execution-messaging.md rename to docs/adr/ADR-0009-api-kernel-execution-messaging.md index fae01b5..a94be07 100644 --- a/docs/adr/ADR-0009-kernel-execution-messaging.md +++ b/docs/adr/ADR-0009-api-kernel-execution-messaging.md @@ -142,3 +142,5 @@ control plane — runtime API and application kernels are unchanged. - SPEC R1, R2, R7, R8 - ADR-0007 (Runtime API boundaries) - ADR-0008 (Tensor deployment) +- ADR-0013 (Verification strategy — V2 fan-out tests) +- ADR-0015 D4 (concrete fabric path for kernel launch) diff --git a/docs/adr/ADR-0010-api-cli-surface-and-semantics.md b/docs/adr/ADR-0010-api-cli-surface-and-semantics.md new file mode 100644 index 0000000..4925a64 --- /dev/null +++ b/docs/adr/ADR-0010-api-cli-surface-and-semantics.md @@ -0,0 +1,131 @@ +# ADR-0010: Command Line Interface and Execution Semantics + +## Status + +Accepted + +## Context + +The `kernbench` CLI is the user-facing entry point of the simulator. It +exposes three subcommands: + +- `run` — execute a benchmark against a topology. +- `probe` — diagnostic utility for latency / BW measurement. +- `web` — interactive topology viewer. + +Device enumeration is centralized in the CLI; neither the runtime API +nor the simulation engine enumerates devices. Benchmarks remain +single-device by design and accept a device identifier as input. + +## Decision + +### D1. Benchmark contract — single-device by design + +- A benchmark MUST define behavior for a single device only. +- A benchmark MUST accept a device identifier as input. +- Benchmarks MUST NOT enumerate or loop over multiple devices. + +Multi-device execution is the CLI's concern (D3), not the benchmark's. + +### D2. `kernbench run` — benchmark execution + +Required arguments: + +- `--topology `: topology YAML file path. Loaded via + `resolve_topology()`. +- `--bench `: benchmark name. Resolved via + `benches.loader.resolve_bench()`. + +Optional arguments: + +- `--device ` (default: `all`): + - `all` — run once per discovered SIP (see D3). + - `sip:` — run only on SIP N. + - Parsed via `resolve_device()`. +- `--verify-data` (default: off) — enable Phase 2 data verification + (see ADR-0020). When set, `engine_factory` constructs the engine + with `enable_data=True`. After the benchmark runs, a diagnostic + summary of recorded ops is printed. + +Each invocation runs the benchmark once within a single simulation +instance. + +### D3. Multi-device execution is logically parallel + +When `--device all` (or omitted) and the topology has multiple SIPs: + +- Benchmark executions are submitted to a single simulation engine + instance. +- Executions are logically parallel in simulation time. +- Inter-device contention is naturally modeled (shared fabric + bandwidth, cross-SIP traffic, etc.). + +The CLI does NOT spawn multiple OS processes or independent +simulation runs — parallelism is internal to one simulation instance. + +### D4. `kernbench probe` — latency / BW diagnostic utility + +Required argument: + +- `--topology `: topology YAML file path. + +Optional argument: + +- `--case ` (default: `all`) — run a predefined traffic + pattern, or `all` to run every defined case. + +Probe runs each pattern through the simulation engine and reports +per case: + +- End-to-end latency (ns). +- Effective bandwidth (nbytes / total_ns). +- Bottleneck bandwidth (min edge BW along the chosen path). +- Utilization (effective / bottleneck). + +Probe additionally validates monotonicity invariants — for example +that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤ +cross-SIP — and reports violations. Probe is a developer tool for +verifying the latency / BW model; it is not a benchmark. + +### D5. `kernbench web` — topology viewer + +Optional arguments: + +- `--port ` (default: `8765`) — HTTP port. +- `--no-open` — do not auto-open the browser. + +Launches a local HTTP server that renders the compiled topology in +the browser. Distinct from the static `docs/diagrams/` artifacts: + +- `docs/diagrams/` files are derived at topology-compile time + (ADR-0006). +- `kernbench web` is interactive — pan/zoom, hover for component + attributes, switch between SIP / CUBE / PE views. + +### D6. Runtime API and simulation engine remain device-scoped + +- Runtime API calls operate on one device per invocation. +- The simulation engine schedules all requests deterministically. +- Neither layer enumerates devices. + +This invariant keeps each layer testable in isolation; device +enumeration and multi-device fan-out live only in the CLI's `run` +command (D3). + +## Consequences + +- Benchmark authors write single-device logic; multi-device behavior + emerges from the CLI dispatching across SIPs. +- Adding a new subcommand (e.g., trace export, replay) does not + require benchmark or runtime-API changes — the CLI is the + extension point. +- `probe` and `web` are diagnostic / visualization tools, not + benchmarks; they bypass the benchmark loader path. + +## Links + +- SPEC R7, R8, R9 +- ADR-0007 (Runtime API and Simulation Engine Boundaries) +- ADR-0020 (Two-pass data execution — `--verify-data`) +- ADR-0006 (Topology compilation and diagram generation — + background for `kernbench web`) diff --git a/docs/adr/ADR-0010-cli-device-selection.md b/docs/adr/ADR-0010-cli-device-selection.md deleted file mode 100644 index bed601b..0000000 --- a/docs/adr/ADR-0010-cli-device-selection.md +++ /dev/null @@ -1,62 +0,0 @@ -# ADR-0010: CLI Device Selection and Multi-Device Execution Semantics - -## Status - -Accepted - -## Context - -Benchmarks represent device-agnostic workloads that operate on a single device. -Users may want to run a benchmark: - -- on a specific device, or -- across all devices in the system. - -Device enumeration must not leak into benchmarks or runtime APIs. - ---- - -## Decision - -### D1. Benchmarks are single-device by design - -- A benchmark MUST define behavior for a single device only. -- A benchmark MUST accept a device identifier as input. -- Benchmarks MUST NOT enumerate or loop over multiple devices. - ---- - -### D2. CLI controls device selection - -The `kernbench run` command supports an optional `--device` argument: - -- If `--device ` is specified: - - the benchmark executes once for the specified device. - -- If `--device` is omitted: - - the benchmark executes once using all the SIPs discovered in the topology. - ---- - -### D3. Multi-device execution is logically parallel - -When running on multiple devices: - -- benchmark executions are submitted to a single simulation engine instance, -- executions are logically parallel in simulation time, -- inter-device contention is naturally modeled. - ---- - -### D4. Runtime API and simulation engine remain device-scoped - -- Runtime API calls operate on one device per invocation. -- The simulation engine schedules all requests deterministically. -- Neither layer enumerates devices. - ---- - -## Links - -- SPEC R7, R8 -- ADR-0007 (Runtime API boundaries) diff --git a/docs/adr/ADR-0011-memory-addressing-simplification.md b/docs/adr/ADR-0011-mem-memory-addressing-simplification.md similarity index 98% rename from docs/adr/ADR-0011-memory-addressing-simplification.md rename to docs/adr/ADR-0011-mem-memory-addressing-simplification.md index 8961b24..064c365 100644 --- a/docs/adr/ADR-0011-memory-addressing-simplification.md +++ b/docs/adr/ADR-0011-mem-memory-addressing-simplification.md @@ -396,7 +396,7 @@ Other N values: #### D-LA7. n:1 mode detail - One logical access → one aggregated request. -- Target: aggregated router → hbm_ctrl (see ADR-0019). +- Target: aggregated router → hbm_ctrl (see ADR-0017 D8). - Aggregated link BW = `channels_per_pe × channel_bw_gbs` (e.g. 8 × 32 = 256 GB/s). - Single queue / resource for modelling. @@ -516,6 +516,6 @@ Negative: - ADR-0009 (kernel execution) - ADR-0014 (PE-internal execution model) - ADR-0015 (component port/wire model) -- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology - consumer) +- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer) +- ADR-0013 (Verification strategy — V1 PA tagging) - SPEC R2 (latency by traversal), R10 (memory addressing) diff --git a/docs/adr/ADR-0012-host-io-message-schema.md b/docs/adr/ADR-0012-api-host-io-message-schema.md similarity index 98% rename from docs/adr/ADR-0012-host-io-message-schema.md rename to docs/adr/ADR-0012-api-host-io-message-schema.md index c9332f5..07d95c5 100644 --- a/docs/adr/ADR-0012-host-io-message-schema.md +++ b/docs/adr/ADR-0012-api-host-io-message-schema.md @@ -229,4 +229,5 @@ Tests SHOULD validate: - ADR-0011 (Memory Addressing — PA / VA / LA) - ADR-0007 (runtime_api vs sim_engine boundaries) - ADR-0009 (kernel execution fan-out/aggregation) +- ADR-0013 (Verification strategy — V1 message schema validation) - SPEC R2, R7, R8 diff --git a/docs/adr/ADR-0013-verification_strategy.md b/docs/adr/ADR-0013-ver-verification_strategy.md similarity index 100% rename from docs/adr/ADR-0013-verification_strategy.md rename to docs/adr/ADR-0013-ver-verification_strategy.md diff --git a/docs/adr/ADR-0014-dev-pe-pipeline-execution-model.md b/docs/adr/ADR-0014-dev-pe-pipeline-execution-model.md new file mode 100644 index 0000000..ccb63f3 --- /dev/null +++ b/docs/adr/ADR-0014-dev-pe-pipeline-execution-model.md @@ -0,0 +1,451 @@ +# ADR-0014: PE Pipeline Execution Model + +## Status + +Accepted + +## Context + +This ADR defines the PE-internal kernel execution model: + +- Role decomposition of PE-internal components +- Command dispatch paths (simple / composite / multi-op composite with epilogue) +- TileToken-based self-routing pipeline (scheduler does dispatch + completion only) +- TCM-centric dataflow with a register-file intermediary +- Engine resource model +- Observability and trace contract +- Topology representation + +PE-internal structure (7 components in scope; 2 cross-referenced): + +- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`, + `pe_tcm` — defined here +- `pe_mmu` — VA model, defined in ADR-0011 D-VA +- `pe_ipcq` — collective communication, defined in ADR-0023 + +The goal is a deterministic, trace-friendly execution contract that keeps +each block independently swappable. + +## Decision + +### D1. PE-internal component roles + +**PE_CPU** + +- Executes kernel instruction stream / control logic. +- Generates PE commands and submits them to `PE_SCHEDULER` (via + `PeInternalTxn`). +- Does NOT enqueue work directly into engine queues. + +**PE_SCHEDULER** + +- Sole dispatcher inside a PE. +- Receives commands from `PE_CPU`. Dispatch by command type: + - Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`) + → forward directly to the target engine. + - `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline + via a single `_feed_loop` (D6). +- Does not participate in stage-to-stage chaining within a composite; + that is handled by token self-routing (D6). + +**PE_DMA** + +- Handles memory transfers between TCM and external memory domains + (HBM, shared SRAM, cross-cube UCIe) through the cube NOC. +- Two execution channels: + - `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4. +- Additional virtual channels: + - `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles. + - `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8). + +**PE_FETCH_STORE** + +- TCM ↔ Register File transfer unit. +- Isolates register-file access semantics from compute engines so that + GEMM/MATH stay pure compute components. +- BW-based latency model; TCM access contention naturally serializes + through `PE_TCM`'s BW resource. + +**PE_GEMM** + +- MAC array. Reads operands from the register file; writes results to + the register file. Does not touch `PE_TCM` directly. + +**PE_MATH** + +- Element-wise / reduction / SIMD unit. Reads / writes the register file. + +**PE_TCM** + +- Tightly-coupled scratchpad with BW-serialized access. Two logical + regions partitioned by ownership (see D5). + +**Cross-referenced components** (defined elsewhere): + +- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA). +- `pe_ipcq` — collective ring buffers and peer endpoint metadata + (ADR-0023). + +### D2. Command lifecycle and queues + +`PE_SCHEDULER` maintains three logical structures: + +**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler. + +**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks +expanded sub-commands, dependency state, engine assignment, and +completion status. + +**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion +records. + +**Single-writer rule**: only `PE_SCHEDULER` mutates command completion +state. Engines report completion via explicit events / messages +consumed by the scheduler. + +**Command completion**: when all sub-commands complete, `PE_SCHEDULER` +publishes a completion record. + +### D3. Dispatch modes + +#### D3.1 Simple command + +A simple command expands to exactly one engine sub-command: + +- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA` +- `GemmCmd` → `PE_GEMM` +- `MathCmd` → `PE_MATH` + +Flow: + +```text +PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution + → completion → PE_SCHEDULER → CompletionQueue +``` + +#### D3.2 Composite command (single-op tiled pipeline) + +The default `CompositeCmd` runs a single compute op as a tile-pipelined +sequence: + +```text +DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE +``` + +`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one +`TileToken` per tile with a monotonically increasing `tile_id`. + +Tile dependency (within one tile `t`): + +```text +DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t) +``` + +Inter-tile overlap is allowed wherever engine resources permit +(D4 governs the constraints): + +```text +DMA_READ(t+1) ∥ COMPUTE(t) +DMA_WRITE(t-1) ∥ COMPUTE(t) +``` + +#### D3.3 Multi-op composite (head + epilogue with scope) + +A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a +multi-op pipeline: + +```python +@dataclass(frozen=True) +class OpSpec: + kind: str # "gemm" | "math.exp" | "math.bias_add" | ... + scope: Scope # "per_k_tile" | "per_output_tile" | "once" + ... +``` + +- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines + M/K/N partition). +- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how + often they fire: + - `per_k_tile` — every K-reduction step. + - `per_output_tile` — once per output tile. + - `once` — once per kernel. + +Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural — +each stage is dispatched via token self-routing (D6), so GEMM and MATH +participate serially within the same composite even though they share +the compute slot (D4). + +The empty-`ops` form is the legacy single-op path. + +### D4. Engine resource model + +**DMA engine**: + +- `DMA_READ`: `simpy.Resource(capacity=1)`. +- `DMA_WRITE`: `simpy.Resource(capacity=1)`. +- Both channels run concurrently (READ ∥ WRITE allowed). +- Within a channel, requests serialize (READ ∥ READ disallowed; same + for WRITE). +- `vc_comm` is an orthogonal channel for IPCQ traffic defined in + ADR-0023 D8 — out of scope for this ADR. + +**Compute engine**: + +- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and + `PE_MATH`. +- At most one compute op runs at a time within a PE. +- Multi-op composite chains (D3.3) execute their compute stages serially + through this slot; token self-routing (D6) ensures the next stage + starts only after the previous compute releases the slot. + +**Engine completion**: each engine emits a completion event consumed by +the scheduler / `PipelineContext` (D6). + +### D5. Dataflow + +**Input path (HBM source)**: + +```text +HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM +PE_TCM → PE_FETCH_STORE → Register File +Register File → PE_GEMM | PE_MATH +``` + +**Input path (shared SRAM source)**: + +```text +Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM +PE_TCM → PE_FETCH_STORE → Register File +``` + +**Output path (HBM destination)**: + +```text +Register File → PE_FETCH_STORE → PE_TCM +PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM +``` + +GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the +single TCM↔register-file gateway. This makes TCM BW contention +explicit and lets fetch unit policies (e.g., prefetch) be replaced +independently of compute engines. + +#### D5.1 PE_TCM partitioning + +`PE_TCM` is split into two logical regions: + +**SchedulerReservedTCM** + +- Owned exclusively by `PE_SCHEDULER`. +- Holds composite-command tile buffers. +- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ / + COMPUTE / DMA_WRITE stage, guarantees input/output separation, and + manages tile-buffer lifetimes. + +**AllocatableTCM** + +- General-purpose region managed by `PEMemAllocator`. +- Used for host / DP-visible allocations. + +**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or +allocate inside `SchedulerReservedTCM`. The reserved region is excluded +from allocator-managed ranges by construction. + +**Tile buffer rules**: + +- Input and output buffers within `SchedulerReservedTCM` MUST NOT + overlap during a tile's active lifetime. +- A tile buffer remains valid until the corresponding `DMA_WRITE` + completes. +- Buffer reuse is permitted only after the consuming tile's lifetime + ends. + +### D6. TileToken self-routing pipeline + +A composite's stage-to-stage progression happens **without** routing +through the scheduler. Each component forwards the token directly to +the next stage's component using the token's `plan`: + +```text +Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete) + ↑ chaining: no scheduler hop ↑ + PipelineContext.complete_tile() +``` + +This mirrors real-HW done-wire chains. The scheduler handles only +**initial dispatch + completion aggregation**. + +#### TilePlan / Stage + +```python +class StageType(Enum): + DMA_READ = 0 + FETCH = 1 + GEMM = 2 + MATH = 3 + STORE = 4 + DMA_WRITE = 5 + +@dataclass(frozen=True) +class Stage: + stage_type: StageType + component: str # topology node id (e.g., "sip0.cube0.pe0.pe_dma") + params: dict # stage-specific parameters + +@dataclass(frozen=True) +class TilePlan: + tile_id: int + stages: tuple[Stage, ...] +``` + +#### TileToken + +```python +@dataclass +class TileToken: + tile_id: int + pipeline_ctx: PipelineContext + plan: TilePlan + stage_idx: int + params: dict # cached current stage params + data_op: bool = True # op_log opt-in (ADR-0020 D4) +``` + +Single-owner invariant: a token is owned by exactly one component at a +time. Lifecycle: scheduler creates with `stage_idx=0` → component +`_process()` → increment `stage_idx` → put to next stage's `in_port` → +last stage calls `pipeline_ctx.complete_tile()`. + +#### PipelineContext (exactly-once completion) + +```python +@dataclass +class PipelineContext: + id: str + total_tiles: int + completed_tiles: int = 0 + done_event: simpy.Event = None + + def complete_tile(self) -> None: + self.completed_tiles += 1 + if self.completed_tiles == self.total_tiles: + self.done_event.succeed() +``` + +Each tile's last stage MUST call `complete_tile()` exactly once. +Duplicate calls are bugs (SimPy `Event` can succeed at most once). + +#### Feed ordering + +`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a +`_pending_feeds` FIFO. Composite commands are enqueued in submission +order; tile feed for a command runs to completion before the next +command's feed begins. **Tile-feed interleaving between commands is +disallowed.** + +Within a single command's tiles, downstream pipeline overlap arises +naturally — earlier tiles progress through later stages while the feeder +keeps pushing remaining tiles into the first stage queue (SimPy Store +backpressure governs flow control). If the first-stage queue is full, +only the feeder blocks; the scheduler worker's inbox processing +continues. + +#### Token routing pattern (base class) + +```python +def _pipeline_worker(self, env): + while True: + token = yield self._inbox.get() + yield from self._process(env, token) # stage-specific logic + next_idx = token.stage_idx + 1 + if next_idx < len(token.plan.stages): + next_stage = token.plan.stages[next_idx] + token.stage_idx = next_idx + token.params = next_stage.params + yield self.out_ports[next_stage.component].put(token) + else: + token.pipeline_ctx.complete_tile() +``` + +Each component implements only `_process()`; chaining lives in the +base class. + +### D7. Observability and trace contract + +The simulator emits deterministic trace events: + +- `command_submitted` +- `sub_command_dispatched` +- `engine_start` +- `engine_complete` +- `tile_ready` +- `command_complete` + +For identical inputs, trace ordering MUST be deterministic. + +### D8. Topology representation + +PE-internal components are declared in `cube.pe_template`: + +```yaml +pe_template: + components: + pe_cpu: { kind: pe_cpu, impl: builtin.pe_cpu, attrs: { overhead_ns: ... } } + pe_scheduler: { kind: pe_scheduler, impl: builtin.pe_scheduler, attrs: { overhead_ns: ... } } + pe_dma: { kind: pe_dma, impl: builtin.pe_dma, attrs: { rd_engines: 1, wr_engines: 1 } } + pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } } + pe_gemm: { kind: pe_gemm, impl: builtin.pe_gemm, attrs: { shared_resource: accel_slot, ... } } + pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { shared_resource: accel_slot, ... } } + pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } } + pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { ... } } # ADR-0011 D-VA + pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { ... } } # ADR-0023 + links: + # Scheduler dispatch edges (initial) + scheduler_to_dma_mm: 0.0 + scheduler_to_fetch_store_mm: 0.0 + scheduler_to_gemm_mm: 0.0 + scheduler_to_math_mm: 0.0 + # Pipeline chaining edges (token self-routing per D6) + dma_to_fetch_store_mm: 0.0 + fetch_store_to_gemm_mm: 0.0 + fetch_store_to_math_mm: 0.0 + gemm_to_fetch_store_mm: 0.0 + gemm_to_math_mm: 0.0 + math_to_fetch_store_mm: 0.0 + fetch_store_to_dma_mm: 0.0 + fetch_store_to_tcm_bw_gbs: ... +``` + +Template is instantiated once per PE. PE instances are derived from +`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔ +cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4). + +## Consequences + +### Positive + +- Each block is an independent topology node — individually swappable + via DI (ADR-0015). +- PE-internal structure is visible in the topology graph. +- Components do not know their downstream — plan-based routing gives + flexibility (e.g., epilogue chains require no scheduler change). +- DMA and compute overlap naturally via SimPy Store backpressure. +- Multi-op composite expresses fused operations (e.g., GEMM + bias_add) + without engine-level coupling. +- TCM access contention is realistic — `PE_FETCH_STORE` is the single + TCM↔RF gateway. + +### Negative + +- Intra-PE component count is higher than a coarser model (7 base + 2 + cross-referenced) — more topology nodes/edges. +- Intra-PE token forwarding is explicit in traces (acceptable trade for + HW fidelity). + +## Links + +- ADR-0011 D-VA (PE_MMU component, VA translation) +- ADR-0015 D4 (component port/wire model) +- ADR-0020 (greenlet kernel execution / two-pass) +- ADR-0023 (PE_IPCQ + PE_DMA virtual channels) +- SPEC R3, R4 diff --git a/docs/adr/ADR-0014-pe-internal-execution-model.md b/docs/adr/ADR-0014-pe-internal-execution-model.md deleted file mode 100644 index 7153b2a..0000000 --- a/docs/adr/ADR-0014-pe-internal-execution-model.md +++ /dev/null @@ -1,365 +0,0 @@ -# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands) - -## Status - -Accepted - -## Context - -ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define: - -- the dispatch model inside a PE, -- the responsibilities of PE_SCHEDULER, -- the PE_TCM-centric dataflow contract used by accelerator engines. - -We need a deterministic and debuggable PE-internal execution contract that supports: - -- simple single-engine commands -- composite commands that build a tiled pipeline across DMA and accelerator engines - -The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling. - -## Decision - -### D1. PE internal component roles - -Each PE contains the following logical components. - -**PE_CPU** - -- Executes kernel instruction stream or kernel control logic. -- Generates PE commands. -- Submits commands to PE_SCHEDULER. -- PE_CPU does NOT enqueue work directly into engine queues. - -**PE_SCHEDULER** - -- The sole dispatcher inside a PE. -- Receives commands from PE_CPU. -- Expands composite commands into sub-commands. -- Tracks dependencies and command state. -- Dispatches work to engine queues. -- Manages tile scheduling for composite commands. - -**PE_DMA** - -- Handles memory transfers between PE_TCM and external memory domains. -- PE_DMA connects to the cube-level NOC (on-die fabric): - - All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the NOC - - Local HBM access: PE_DMA → NOC → hbm_ctrl (minimal hop) - - Remote/shared: PE_DMA → NOC → (fabric hops) → destination -- Supported directions include: - - HBM → PE_TCM (via NOC) - - PE_TCM → HBM (via NOC) - - PE_TCM → shared SRAM (via NOC) - - PE_TCM → other memory domains (via NOC, if supported by topology) - -**PE_GEMM** - -- Matrix multiplication engine. -- Reads activations from PE_TCM. -- May stream weights directly from HBM. - -**PE_MATH** - -- Element-wise computation engine. -- Reads and writes PE_TCM. - -**PE_TCM** - -- Local SRAM used as the staging memory for accelerator operations. - ---- - -### D2. Command lifecycle and queues - -PE_SCHEDULER maintains three logical structures. - -**SubmissionQueue** - -- Written by PE_CPU. -- Contains incoming PE commands waiting to be processed. - -**InflightTable** - -- Owned and mutated only by PE_SCHEDULER. -- Tracks: - - expanded sub-commands - - dependency state - - engine assignment - - completion status - -**CompletionQueue** - -- Written by PE_SCHEDULER. -- Contains final completion records for commands. - -**Single-writer rule** - -- Only PE_SCHEDULER is allowed to mutate command completion state. -- Engine components must report completion via explicit completion events/messages. - -**Command completion** - -A command becomes DONE when: - -- all sub-commands complete -- PE_SCHEDULER publishes a completion record to CompletionQueue. - ---- - -### D3. Dispatch modes - -PE commands are divided into two categories. - -#### D3.1 Simple command - -A simple command expands to exactly one engine sub-command. - -Examples include: - -- DMA transfer -- GEMM compute -- MATH compute - -Execution flow: - -```text -PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue -``` - -#### D3.2 Composite command (tiled pipeline) - -Composite commands implement tiled pipelined execution across engines. - -Each tile executes the following pipeline: - -```text -Input DMA (READ) -→ Compute (GEMM or MATH) -→ Output DMA (WRITE) -``` - -**Tiling rule** - -If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles. -Each tile is assigned a monotonically increasing `tile_id`. - -**Tile dependency rules** - -For tile `t`: - -- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)` -- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)` -- All dependencies are enforced by PE_SCHEDULER. - -**Overlap policy (Phase 0 default)** - -Operations for different tiles may overlap when engine resources permit. - -Allowed overlaps: - -```text -DMA_READ(t+1) ∥ COMPUTE(t) -DMA_WRITE(t−1) ∥ COMPUTE(t) -DMA_READ(t) ∥ DMA_WRITE(t) -``` - -Disallowed overlaps: - -```text -GEMM(t) ∥ GEMM(t′) -MATH(t) ∥ MATH(t′) -GEMM(t) ∥ MATH(t′) -``` - ---- - -### D4. Engine execution model (Phase 0 default) - -Each engine behaves as a deterministic service resource. - -**DMA engine** - -PE_DMA contains two independent channels. - -```text -DMA_READ capacity = 1 -DMA_WRITE capacity = 1 -``` - -Rules: - -- DMA_READ and DMA_WRITE may execute concurrently. -- Multiple READs cannot overlap. -- Multiple WRITEs cannot overlap. - -Example allowed: - -```text -DMA_READ(t+1) ∥ DMA_WRITE(t) -``` - -Example not allowed: - -```text -DMA_READ(t) ∥ DMA_READ(t+1) -DMA_WRITE(t) ∥ DMA_WRITE(t+1) -``` - -**Compute engine** - -Compute operations share a single compute resource. - -```text -PE_ACCEL capacity = 1 -``` - -Both GEMM and MATH require this shared compute slot. - -Consequences: - -- GEMM ∥ GEMM not allowed -- MATH ∥ MATH not allowed -- GEMM ∥ MATH not allowed - -Only one compute operation can run in a PE at a time. - -**Compute opcode restriction** - -Composite commands contain one compute opcode only. - -Examples: - -```text -COMPOSITE_GEMM -COMPOSITE_MATH -``` - -Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0. - -**Engine completion signaling** - -Every engine emits a completion event when a sub-command finishes. -Completion events are delivered to PE_SCHEDULER. - ---- - -### D5. Dataflow model - -Compute operations use a TCM-centric dataflow model. - -**Input path (HBM)** - -```text -HBM → NOC → PE_DMA (DMA_READ) → PE_TCM -``` - -**Input path (shared SRAM)** - -```text -Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM -``` - -**Compute stage** - -Compute engines read input tensors from PE_TCM. - -```text -PE_TCM → GEMM / MATH -``` - -Weights for GEMM may optionally stream directly from HBM (via NOC). - -**Output path (HBM)** - -Compute results are written to PE_TCM, then DMA writes to HBM. - -```text -PE_TCM → PE_DMA (DMA_WRITE) → NOC → HBM -``` - -**Output path (shared SRAM)** - -```text -PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM -``` - -#### D5.1 PE_TCM partitioning and ownership boundary - -The PE_TCM address space is partitioned into two logical regions. - -**SchedulerReservedTCM** - -- A staging region owned exclusively by PE_SCHEDULER. -- This region is used for composite command tile buffers. -- PE_SCHEDULER: - - partitions this region into tile buffers - - assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages - - guarantees input/output buffer separation - - manages tile buffer lifetime - -**AllocatableTCM** - -- General-purpose region managed by PEMemAllocator. -- Used by host or DP-visible allocations. - -**Visibility rule (hard isolation)** - -- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM. -- SchedulerReservedTCM is excluded from allocator-managed ranges by construction. -- This prevents DP or host allocations from interfering with scheduler staging buffers. - -**Tile buffer rules** - -Within SchedulerReservedTCM: - -- input buffers and output buffers must not overlap -- PE_SCHEDULER assigns tile buffers for DMA and compute stages -- tile buffers remain valid until the corresponding DMA_WRITE completes -- Buffer reuse is allowed only after the tile lifetime finishes. - ---- - -### D6. Observability and trace contract - -The simulator must emit deterministic trace events. - -Required events include: - -- `command_submitted` -- `sub_command_dispatched` -- `engine_start` -- `engine_complete` -- `tile_ready` -- `command_complete` - -Trace ordering must be deterministic for identical inputs. - ---- - -### D7. Topology representation - -PE internal components are declared in `cube.pe_template`. - -The template is instantiated once per PE. - -PE instances are derived from `cube.pe_layout`. - -External connectivity such as: - -- PE_DMA → NOC → HBM (data path) -- PE_DMA → NOC → shared SRAM, inter-cube UCIe (non-HBM data path) -- NOC → PE_CPU (command path from M_CPU) - -is modeled at the CUBE level (see ADR-0003 D3). - ---- - -## Links - -- SPEC R3, R4 -- ADR-0003 D4 (PE-level system hierarchy) -- ADR-0005 View C (PE-level diagram) -- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance) -- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch) diff --git a/docs/adr/ADR-0015-component-port-wire-model.md b/docs/adr/ADR-0015-dev-component-port-wire-model.md similarity index 86% rename from docs/adr/ADR-0015-component-port-wire-model.md rename to docs/adr/ADR-0015-dev-component-port-wire-model.md index acfbb9c..5f999af 100644 --- a/docs/adr/ADR-0015-component-port-wire-model.md +++ b/docs/adr/ADR-0015-dev-component-port-wire-model.md @@ -6,20 +6,19 @@ Accepted ## Context -ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine. -In practice, the engine iterates the topology path and calls `run()` on each component -sequentially — conflating routing policy with component behavior and preventing realistic -hardware modeling (queues, contention, fan-out). - -ADR-0007 D3 already states that components own fan-out and aggregation, but the current -implementation does not enforce this for fabric traversal. +Realistic hardware modeling — queues, contention, fan-out — requires +that components own fabric traversal while the simulation engine +handles only initialization and completion observation. Direct method +calls between components, or path-walking inside the engine, defeat +queueing and contention semantics. This ADR defines: - how components communicate via typed port queues, - how propagation delay is modeled (wire processes with BW occupancy), -- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU), -- the reduced role of the simulation engine, +- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch + (via M_CPU), +- the engine's reduced role (wire init + completion observation only), - M_CPU.DMA as an internal subcomponent of M_CPU. --- @@ -88,9 +87,6 @@ The simulation engine MUST NOT: - call component `run()` methods directly, - track per-hop latency or decompose fan-out. -This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause. -ADR-0007 D2 must be amended accordingly. - --- ### D4. Fabric paths for Memory R/W and Kernel Launch @@ -192,16 +188,15 @@ It is used for shard comparison in `_route_kernel` and as a regression guard. - Propagation delay is modeled accurately per edge. - Engine is decoupled from routing policy. - Component implementations remain swappable via DI (ADR-0007 D3). -- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities. -- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above). --- ## Links -- ADR-0007 D2 (to be amended: engine path-walking clause) -- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced) +- ADR-0007 D2 (engine role boundary) +- ADR-0009 D3 (kernel execution fan-out hierarchy) - ADR-0014 D4 (DMA engine capacity=1) - ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal) - ADR-0016 (IOChiplet NOC and memory data path) - ADR-0017 (cube NOC 2D mesh architecture) +- ADR-0033 (Latency model assumptions built on these mechanisms) diff --git a/docs/adr/ADR-0016-iochiplet-noc-and-memory-path.md b/docs/adr/ADR-0016-dev-iochiplet-noc-and-memory-path.md similarity index 100% rename from docs/adr/ADR-0016-iochiplet-noc-and-memory-path.md rename to docs/adr/ADR-0016-dev-iochiplet-noc-and-memory-path.md diff --git a/docs/adr/ADR-0017-cube-noc-2d-mesh.md b/docs/adr/ADR-0017-cube-noc-2d-mesh.md deleted file mode 100644 index c43c841..0000000 --- a/docs/adr/ADR-0017-cube-noc-2d-mesh.md +++ /dev/null @@ -1,189 +0,0 @@ -# ADR-0017: Cube NOC 2D Mesh Architecture - -## Status - -Accepted - -## Context - -ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but -does not specify the internal routing model, contention semantics, or -attachment topology. The implementation uses a 2D mesh router grid with -XY routing and per-segment contention modeling. This ADR formalizes that -architecture. - -## Decision - -### D1. NOC node and router grid - -Each cube contains a 2D router mesh generated by `mesh_gen.py`. -Each router is a separate topology node (`sip{S}.cube{C}.r{row}c{col}`) -implemented as `forwarding_v1`. (Supersedes the original single-node -`noc_2d_mesh_v1` design — see ADR-0019.) - -Grid properties: - -- Default dimensions: 6x6 routers (derived from PE layout + UCIe connections) -- Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`) -- HBM exclusion zone: center rows/columns are excluded where HBM physically - occupies space (e.g., r2c2, r2c3, r3c2, r3c3) -- Router positions are derived from physical PE corner placement and cube - geometry - -The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance -traversal within the mesh (distance_mm x ns_per_mm). - -### D2. XY routing algorithm - -The NOC uses deterministic XY routing: - -1. Horizontal segment: route from source X to destination X at source Y -2. Vertical segment: route from destination X at source Y to destination Y - -Each directed segment is identified by a unique link key: - -- Horizontal: `("H", y_band, x_min, x_max, direction)` -- Vertical: `("V", x_band, y_min, y_max, direction)` - -Grid positions are snapped to the router grid, excluding the HBM zone. - -### D3. Contention model - -Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions -sharing a segment (same row or column band, same direction) contend for the -resource. This models link-level serialization in a wormhole-routed mesh. - -With no contention, NOC traversal latency equals the Manhattan distance -multiplied by `ns_per_mm`. Under contention, additional queueing delay -is added by SimPy's resource scheduling. - -### D4. NOC attachment points - -The NOC connects to all major cube-level components: - -```text - UCIe-N (conn x4) - | - +---------+---+---+---------+ - | | | | -PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma -PE0.cpu <--+ | | +--< PE2.cpu - | | | | -UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E -(conn x4) | | zone | | (conn x4) - | r2c0 | | | -M_CPU <--->+ | | | - | r3c0 | | | -SRAM <---->+ | | | - | | | | -PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma -PE4.cpu <--+ | | +--< PE6.cpu - | | | | - +---------+---+---+---------+ - | - UCIe-S (conn x4) - -HBM attach: PE가 있는 라우터에 hbm_ctrl도 연결 (ADR-0019 D1) -(xbar_top/xbar_bot은 ADR-0019에 의해 제거됨) -``` - -### D5. NOC edge bandwidths and distances - -| Connection | BW (GB/s) | Distance | Notes | -| --- | --- | --- | --- | -| PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW | -| NOC -> PE_CPU | - | 0.0 mm | Command path only | -| Router <-> HBM_CTRL | 256.0 | 0.0 mm | Per PE router (ADR-0019) | -| NOC <-> M_CPU | - | 0.0 mm | Command path | -| NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate | -| NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port | - -Distance 0.0 mm for most connections reflects the distributed nature of -the NOC; the actual traversal distance is computed internally via Manhattan -distance within the router grid. - -### D6. UCIe decomposition and inter-cube traffic - -Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into: - -- 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns) -- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe - -This decomposition enables N=4 independent NOC-to-UCIe connections per port, -each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s. - -Inter-cube traffic path: - -```text -Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT} - [UCIe link: 512 GB/s, 1.0mm seam distance] -Target: ucie-{PORT} -> conn{i} -> r{x}c{y} -> (mesh hops) -> hbm_ctrl -``` - -UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a -full crossing incurs 16 ns (TX port + RX port). - -### D7. Data paths through the NOC - -**PE DMA to local HBM (same half):** - -```text -PE_DMA -> r{x}c{y} -> hbm_ctrl (local: 0 mesh hops, switching overhead only) -``` - -**PE DMA to remote PE's HBM:** - -```text -PE_DMA -> r{x}c{y} -> (mesh hops) -> r{x'}c{y'} -> hbm_ctrl -``` - -**PE DMA to remote cube HBM:** - -```text -PE_DMA -> r{x}c{y} -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> r{x'}c{y'} -> hbm_ctrl -``` - -**Kernel Launch command to PE:** - -```text -[from io_noc] -> ucie -> conn -> r{x}c{y} -> (mesh hops) -> M_CPU -> (mesh hops) -> PE_CPU -``` - -**Shared SRAM access:** - -```text -PE_DMA -> r{x}c{y} -> (mesh hops) -> SRAM -``` - -### D8. Mesh generation - -The router grid is generated by `mesh_gen.py` based on: - -- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner -- `cube.geometry`: cube physical dimensions and HBM zone -- `cube.ucie.n_connections`: determines router count for UCIe attachment - -The generator produces a `mesh_data` dictionary containing: - -- Router grid with positions and HBM exclusion zones -- PE-to-router attachments (pe_dma, pe_cpu per PE) -- UCIe-to-router attachments (N/S/E/W, distributed across edge routers) -- M_CPU and SRAM router attachments -- HBM attachment per PE router (ADR-0019) - -## Consequences - -- NOC provides position-aware routing with deterministic latency -- Contention is captured per directed segment (not per-node) -- All cube-internal traffic is explicitly routed through the NOC -- HBM exclusion zone reflects physical die layout constraints -- The mesh generation is fully parameterized by `topology.yaml` - -## Links - -- ADR-0003 D3 (cube-level NOC definition — extended by this ADR) -- ADR-0004 D1 (PE DMA to local HBM path via router mesh) -- ADR-0014 D1 (PE_DMA egress via router mesh) -- ADR-0019 (NOC-Local HBM — xbar/bridge 제거, 명시적 라우터 mesh) -- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch) -- ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level) diff --git a/docs/adr/ADR-0017-dev-cube-noc-and-hbm-connectivity.md b/docs/adr/ADR-0017-dev-cube-noc-and-hbm-connectivity.md new file mode 100644 index 0000000..c442dde --- /dev/null +++ b/docs/adr/ADR-0017-dev-cube-noc-and-hbm-connectivity.md @@ -0,0 +1,291 @@ +# ADR-0017: Cube NOC and HBM Connectivity + +## Status + +Accepted + +## Context + +The CUBE-level NOC is a 2D router mesh that carries every intra-cube +request: PE-to-HBM data, PE-to-PE traffic, command paths +(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic. + +The CUBE's HBM is exposed through per-PE controller endpoints attached +to PE routers. This per-PE partitioning makes local-vs-remote HBM +distinguishable by mesh distance: a PE's own HBM partition sits at its +own router (switching overhead only); another PE's HBM partition is +reachable by mesh hops to that PE's router. + +Two channel-mapping modes are supported in the design space: + +- **n:1 (default, implemented)** — each PE's HBM partition aggregates + `channels_per_pe` pseudo-channels into one endpoint. Effective + per-PE BW = N × per-channel BW. +- **1:1 (future)** — each PE router decomposes into per-channel + mini-routers; per-channel BW contention is modeled directly. + +In both modes the per-PE effective BW is identical; only the connectivity +granularity differs. + +## Decision + +### D1. 2D router mesh + +Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`. + +- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`). +- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`. +- Default 6×6 grid (sized from PE corner placement + UCIe attachment + count); larger PE counts scale the grid up. +- HBM exclusion zone: center rows/columns are excluded where HBM die + physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6). +- Latency = Manhattan distance × `ns_per_mm`. + +### D2. XY routing algorithm + +Deterministic XY routing: + +1. Horizontal segment: route from source X to destination X at source Y. +2. Vertical segment: route from destination X at source Y to destination Y. + +Each directed segment carries a unique key: + +- Horizontal: `("H", y_band, x_min, x_max, direction)` +- Vertical: `("V", x_band, y_min, y_max, direction)` + +Grid positions are snapped to the router grid, excluding the HBM zone. + +### D3. Per-segment contention model + +Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions +sharing a segment (same row or column band, same direction) contend for +the resource — modelling link-level serialization in a wormhole-routed +mesh. + +With no contention, NOC traversal latency equals Manhattan distance × +`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing +delay. + +### D4. NOC attachment points (per-PE HBM partition) + +Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`, +and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint — +`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's +HBM (one pseudo-channel group; see D8). + +Other attachments: + +- M_CPU and shared SRAM each occupy a dedicated edge router. +- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed + along that edge (see D6). + +```text + UCIe-N (conn x4) + | + +---------+---+---+---------+ + | | | | +PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma +PE0.cpu <--+ +hbm.pe0| | +hbm.pe2+--< PE2.cpu + | | | | +UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E +(conn x4) | | zone | | (conn x4) + | r2c0 | | | +M_CPU <--->+ | | | + | r3c0 | | | +SRAM <---->+ | | | + | | | | +PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma +PE4.cpu <--+ +hbm.pe4| | +hbm.pe6+--< PE6.cpu + | | | | + +---------+---+---+---------+ + | + UCIe-S (conn x4) +``` + +Per-PE HBM partitioning is the key invariant that makes local vs +cross-PE HBM distinguishable by mesh distance (see D7). + +### D5. NOC edge bandwidths and distances + +| Connection | BW (GB/s) | Distance | Notes | +| ----------------------------- | ---------- | ------------- | ------------------------------------------- | +| PE_DMA → NOC | 256.0 | Physical (PE) | Matches local-HBM aggregate BW | +| NOC → PE_CPU | — | 0.0 mm | Command path only | +| Router ↔ hbm_ctrl.pe{idx} | 256.0 | 0.0 mm | Per PE router; N × per-channel BW (see D8) | +| NOC ↔ M_CPU | — | 0.0 mm | Command path | +| NOC ↔ SRAM | 128.0 × 4 | 0.0 mm | 512 GB/s aggregate | +| NOC ↔ UCIe conn | 128.0 | 0.0 mm | Per connection; 4 conn per port | + +`0.0 mm` distances reflect the distributed nature of the NOC; actual +traversal distance is computed via Manhattan distance within the router +grid. + +### D6. UCIe decomposition and inter-cube traffic + +Each of the 4 UCIe ports (N, S, E, W) decomposes into: + +- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`). +- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe. + +This decomposition gives 4 independent NOC↔UCIe connections per port, +each with 128 GB/s bandwidth (512 GB/s aggregate per port). + +Inter-cube traffic path: + +```text +Source: PE_DMA → NOC → conn{i} → ucie-{PORT} + [UCIe link: 512 GB/s, 1.0mm seam distance] +Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx} +``` + +UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full +crossing incurs 16 ns (TX port + RX port). + +### D7. Data paths through the NOC + +All intra-cube traffic uses the same router mesh — no separate fast +paths. + +**Local HBM** (same PE's own partition; 0 mesh hops): + +```text +PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx} (switching overhead only) +``` + +**Cross-PE HBM within cube** (target PE's partition, reached by mesh): + +```text +PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'} +``` + +Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`): + +```text +PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2 +``` + +Dijkstra computes the shortest path within the mesh. + +**Cross-cube HBM** (UCIe traversal): + +```text +PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn + → r{x'}c{y'} → hbm_ctrl.pe{idx'} +``` + +**Kernel launch command to PE**: + +```text +[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU +``` + +**Shared SRAM access**: + +```text +PE_DMA → r{x}c{y} → (mesh) → SRAM +``` + +### D8. HBM channel mapping mode + +Channel mapping is configured at cube scope: + +```yaml +cube: + memory_map: + hbm_mapping_mode: n_to_one # one_to_one | n_to_one + hbm_pseudo_channels: 64 # total pseudo-channel count + hbm_channels_per_pe: 8 # per-PE local channel count + hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s) + hbm_slices_per_cube: 8 # number of per-PE partitions + hbm_total_gb_per_cube: 48 +``` + +**n:1 mode (default, implemented).** Each PE's HBM partition is a single +endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo- +channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals +`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to +interleave; only aggregate per-PE BW is modeled. No separate aggregated +router node exists — the per-PE router itself serves that role. + +**1:1 mode (future).** Each PE router decomposes into N channel +mini-routers; per-channel routing carries fully-resolved PA + channel ID. +A `ChannelSplitter` resolves a logical access to N per-channel physical +requests. Per-channel link models BW contention. Cross-PE channel +access semantics are deferred to the implementation ADR. + +**BW math (defaults).** + +| Parameter | Value | +| ---------------------------------- | -------------------------- | +| pseudo channels per cube | 64 (parameter) | +| PEs per cube | 8 (parameter) | +| channels per PE (N) | 64 / 8 = 8 | +| per-channel BW | 32 GB/s (parameter) | +| per-PE local BW | N × 32 = 256 GB/s | +| cube total HBM BW | 64 × 32 = 2048 GB/s | + +Both modes give the same per-PE effective BW; only the request shape and +contention model differ. + +### D9. AddressResolver — per-PE HBM endpoint + +The address resolver decodes a PA's HBM offset to the owning PE's +partition: + +```python +# policy/routing/router.py +hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube + +if addr.kind == "hbm": + pe_id = int(addr.hbm_offset) // hbm_slice_bytes + return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}" +``` + +The pe_id computation is intrinsic to the routing layer (not a +topology-time concern). Any HBM PA falls within exactly one partition, +yielding deterministic routing. + +External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the +same resolver path — there is no separate fast path. + +### D10. Mesh generation parameters + +`mesh_gen.py` produces `cube_mesh.yaml` from: + +- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner. +- `cube.geometry`: cube physical dimensions and HBM zone. +- `cube.ucie.n_connections`: determines router count for UCIe attachment. + +Output `mesh_data` dictionary contains: + +- Router grid with positions and HBM exclusion zones. +- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm` + per PE). +- UCIe-to-router attachments (N/S/E/W distributed across edge routers). +- M_CPU and SRAM router attachments. + +## Consequences + +- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM + (mesh hops) are naturally distinguishable, satisfying SPEC R5 + (multi-domain communication) and ADR-0002 (no zero-latency end-to-end + paths). +- All cube-internal traffic routes through one mesh — single contention + model, single layout, single set of edge BWs. +- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each + PE's partition is the n:1 aggregate of its assigned pseudo-channels. +- 1:1 mode extension is structurally natural — split each PE router into + N channel routers. +- Mesh generation is fully parameterised by `topology.yaml`; PE/cube + geometry changes propagate without code edits. + +## Links + +- ADR-0002 (Routing distance, ordering, no zero-latency paths) +- ADR-0003 D3 (cube-level NOC definition — extended here) +- ADR-0004 (Memory semantics, local HBM) +- ADR-0011 (Memory addressing — LA model consumes per-PE partition) +- ADR-0014 D1 (PE_DMA egress via router mesh) +- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch) +- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level) +- ADR-0033 (Latency model: per-PC parallelism, switch penalty) diff --git a/docs/adr/ADR-0019-NOC-Local HBM.en.md b/docs/adr/ADR-0019-NOC-Local HBM.en.md deleted file mode 100644 index 815a70b..0000000 --- a/docs/adr/ADR-0019-NOC-Local HBM.en.md +++ /dev/null @@ -1,305 +0,0 @@ -# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC - -## Status - -Accepted - -## Context - -The CUBE-internal NOC must connect each PE to HBM. KernBench needs -to evaluate two connectivity models: - -- **1:1 mode** — PE_DMA connects to N separate per-channel routers, - each with its own link to hbm_ctrl. Models per-channel BW - contention precisely. - N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`). -- **n:1 mode** — PE_DMA connects to a single aggregated router with - one link to hbm_ctrl. Channels are treated as interleaved; only - aggregate BW is modeled. - -Effective PE-local BW is identical under both modes -(= N × per-channel BW); only the connectivity granularity differs. - ---- - -## Decision - -### D1. HBM Attaches to PE Routers - -Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**, -and attach the HBM access point to the same router where the PE is attached. - -- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops) -- Remote PE's HBM access: reaches the target PE's router via mesh hops -- The read/write resource model within the HBM controller is preserved - -Node naming changes: - -| Current | After Change | -| ---- | ------- | -| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) | - -In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that -the builder generates an edge between that router and hbm_ctrl. - ---- - -### D2. Complete Removal of xbar, bridge, and Single NOC Node - -Remove all of the following nodes and related edges: - -- `{cube}.xbar_top`, `{cube}.xbar_bot` -- `{cube}.bridge.left`, `{cube}.bridge.right` -- `{cube}.noc` (single TwoDMeshNocComponent node) -- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar` -- Edges of type `xbar_to_bridge`, `bridge_to_xbar` -- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node - -Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**. -Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py` -is created as a separate SimPy node in the topology graph, -and adjacent routers are connected via XY mesh edges. - ---- - -### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1) - -#### Router Nodes Based on cube_mesh.yaml - -Each non-null router from cube_mesh.yaml generated by `mesh_gen.py` -is created as a **separate SimPy node** in the topology graph. - -- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`) -- kind: `noc_router`, impl: `forwarding_v1` -- pos_mm: taken from cube_mesh.yaml - -Based on the attach information in cube_mesh.yaml, components are connected to each router: -- `pe{p}.dma` → PE_DMA ↔ router edge -- `pe{p}.cpu` → PE_CPU ↔ router edge -- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1) -- `m_cpu` → M_CPU ↔ router edge -- `sram` → SRAM ↔ router edge -- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge - -Router-to-router XY mesh edges: bidirectional edges between adjacent routers. -Null routers (HBM exclusion zones) are skipped. - -#### 1:1 Mode Extension (To Be Implemented Later) - -In 1:1 mode, each router differentiates into N channel mini-routers. -Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required. -N GEMM engines per PE are also added at this point. - ---- - -### D4. Cross-PE HBM Access (n:1 Mode) - -In n:1 mode, when a PE accesses another PE's local HBM, -it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router. - -Example: PE0 (r0c0) accessing PE2's (r1c4) HBM: - -```text -PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl -``` - -The Dijkstra router finds the shortest path in the mesh. - -Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3. - ---- - -### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh - -In n:1 mode, no separate "aggregated router" is created. -The existing router grid from cube_mesh.yaml serves that role. - -#### Connection Structure - -PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached: - -```text -sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs) -sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs) -``` - -Routers are connected via XY mesh edges. PE's local HBM access goes -directly from its own router (switching overhead only). - -#### n:1 Mode Full Data Paths - -**Local HBM (0 hops):** -```text -PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only) -``` - -**Remote HBM (mesh hops):** -```text -PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl -``` - -**M_CPU DMA:** -```text -M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl -``` - ---- - -### D6. All Traffic Is Unified onto the Same Router Mesh - -- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh -- Local access does not use a separate fast path (xbar) -- Cross-cube (remote) access path: - -```text -PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT} - → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl -``` - -UCIe connections maintain the existing structure, -but both endpoints become mesh routers instead of xbars. - -The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`. - ---- - -### D7. AddressResolver Changes - -Current `AddressResolver.resolve()`: - -```python -# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}" -pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes) -return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}" -``` - -After change: - -```python -# Changed: HBM → single endpoint -return f"sip{s}.cube{c}.hbm_ctrl" -``` - -The pe_slice calculation is removed. -In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router. - -resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility. - ---- - -### D8. topology.yaml Configuration Changes - -#### Added Settings - -```yaml -cube: - memory_map: - hbm_mapping_mode: n_to_one # one_to_one | n_to_one - hbm_pseudo_channels: 64 # total pseudo channel count - hbm_channels_per_pe: 8 # local channels per PE (= pseudo_channels / pes_per_cube) - hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s) - hbm_total_gb_per_cube: 48 # retained -``` - -#### Removed Settings - -```yaml -# To be removed -links: - xbar_to_hbm_bw_gbs: 256.0 # → replaced by channel_bw_gbs × channels_per_pe - xbar_to_hbm_mm: 2.5 # → replaced by ch_router_to_hbm_mm - xbar_to_bridge_bw_gbs: 128.0 # → removed (no bridge) - xbar_to_bridge_mm: 3.0 # → removed - noc_to_xbar_bw_gbs: ... # → removed - noc_to_xbar_mm: ... # → removed -``` - -#### Added Link Settings - -```yaml -links: - router_link_bw_gbs: 256.0 # XY mesh link BW between routers - router_overhead_ns: 2.0 # router switching overhead - pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router - hbm_to_router_bw_gbs: 256.0 # HBM ↔ router (= N × channel_bw) -``` - ---- - -### D9. Bandwidth Numerical Consistency - -| Configuration | Value | -| ---- | --- | -| pseudo channels per cube | 64 (parameter) | -| PEs per cube | 8 (parameter) | -| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 | -| per-channel BW | 32 GB/s (parameter) | -| per-PE local BW | N × 32 = 256 GB/s | -| cube total HBM BW | 64 × 32 = 2048 GB/s | - -The effective BW per PE is identical in both modes: - -- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s -- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s - ---- - -## Consequences - -### Positive - -- The router mesh based on cube_mesh.yaml accurately reflects physical placement -- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low -- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity -- Aligns well with graph compiler-based topology generation -- Channel count and PE count are both parameterized, enabling testing of various configurations -- 1:1 mode extension naturally follows through router differentiation - -### Negative - -- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube) -- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model - ---- - -## Alternatives - -### A1. Retain Existing xbar + HBM Slices - -- Local/remote paths remain bifurcated -- Cannot model at pseudo-channel granularity -- Cannot switch between 1:1/n:1 modes - -### A2. Always Generate Per-Channel Links and Aggregate Only in n:1 - -- Topology structure always has 1:1 size -- Expressing n:1 semantics via link aggregation is complex -- No reduction in router node count - -### A3. Gradual Transition (Retain xbar + Add NOC Path) - -- Higher compatibility, but dual-path coexistence increases complexity -- Since xbar removal is ultimately necessary, the intermediate step provides little value - ---- - -## Test Requirements - -- Verify that requests are delivered via per-channel links in 1:1 mode -- Verify that requests are delivered via the aggregated link in n:1 mode -- Verify that topology is correctly generated in both modes: - - 1:1: `total_ch` channel routers + per-PE links + horizontal links - - n:1: `pes_per_cube` aggregated routers + per-PE links -- Verify that effective BW is consistent across both modes for the same workload -- Verify that horizontal line routing works for cross-PE access -- Verify that routing through UCIe works for cross-cube access -- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.) - ---- - -## Links - -- ADR-0011 (LA model) → addressing-side integration -- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion -- ADR-0004 (Memory Semantics) → BW model redefinition -- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes diff --git a/docs/adr/ADR-0019-NOC-Local HBM.md b/docs/adr/ADR-0019-NOC-Local HBM.md deleted file mode 100644 index d1aed2b..0000000 --- a/docs/adr/ADR-0019-NOC-Local HBM.md +++ /dev/null @@ -1,305 +0,0 @@ -# ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델 - -## Status - -Accepted - -## Context - -CUBE 내부 NOC은 각 PE를 HBM에 연결해야 한다. KernBench는 두 가지 -connectivity 모델을 비교 평가할 수 있어야 한다. - -- **1:1 mode** — PE_DMA가 N개 per-channel router 각각에 별도 link로 - 연결되고, 각 router는 hbm_ctrl에 자기 channel link를 가진다. - Per-channel BW contention을 정확히 모델링. - N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`). -- **n:1 mode** — PE_DMA가 단일 aggregated router를 거쳐 하나의 link로 - hbm_ctrl에 연결. Channel들이 interleaved 된 것으로 가정하고 - aggregate BW만 모델링. - -두 모드에서 PE당 effective BW는 동일 (= N × per-channel BW); -connectivity granularity만 다르다. - ---- - -## Decision - -### D1. HBM은 PE 라우터에 attach된다 - -현재의 `hbm_ctrl.slice{0-7}` (8개 노드)를 **`hbm_ctrl` 단일 노드**로 통합하고, -PE가 attach된 라우터에 HBM access point도 함께 attach한다. - -- n:1 mode: PE의 local HBM 접근은 자기 라우터에서 바로 (switching overhead만, 0 hop) -- remote PE의 HBM 접근: mesh hop을 거쳐 대상 PE의 라우터에 도달 -- HBM controller 내부의 read/write resource 모델은 유지 - -노드 네이밍 변경: - -| 현재 | 변경 후 | -| ---- | ------- | -| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (단일) | - -`mesh_gen.py`에서 PE attachment에 `pe{idx}.hbm`을 추가하여, -builder가 해당 라우터와 hbm_ctrl 간 edge를 생성한다. - ---- - -### D2. xbar, bridge, 단일 NOC 노드 완전 제거 - -기존 다음 노드 및 관련 edge를 모두 제거한다: - -- `{cube}.xbar_top`, `{cube}.xbar_bot` -- `{cube}.bridge.left`, `{cube}.bridge.right` -- `{cube}.noc` (단일 TwoDMeshNocComponent 노드) -- `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar` 종류의 edge -- `xbar_to_bridge`, `bridge_to_xbar` 종류의 edge -- `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu` 등 단일 noc 노드 참조 edge - -이들의 역할은 **cube_mesh.yaml 기반의 명시적 라우터 mesh**가 대체한다. -기존 `mesh_gen.py`가 생성하는 6×6 라우터 grid의 각 라우터(r0c0, r0c1, ...)를 -별도의 SimPy 노드로 topology graph에 생성하고, -인접 라우터 간 XY mesh edge로 연결한다. - ---- - -### D3. 명시적 라우터 mesh (n:1 / 1:1 공통 기반) - -#### cube_mesh.yaml 기반 라우터 노드 - -`mesh_gen.py`가 생성한 cube_mesh.yaml의 각 non-null 라우터를 -topology graph의 **별도 SimPy 노드**로 생성한다. - -- 노드 ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`) -- kind: `noc_router`, impl: `forwarding_v1` -- pos_mm: cube_mesh.yaml에서 가져옴 - -기존 cube_mesh.yaml의 attach 정보에 따라 각 라우터에 component를 연결: -- `pe{p}.dma` → PE_DMA ↔ 라우터 edge -- `pe{p}.cpu` → PE_CPU ↔ 라우터 edge -- `pe{p}.hbm` → HBM_CTRL ↔ 라우터 edge (n:1에서 추가) -- `m_cpu` → M_CPU ↔ 라우터 edge -- `sram` → SRAM ↔ 라우터 edge -- `ucie_{dir}.c{i}` → UCIe conn ↔ 라우터 edge - -라우터 간 XY mesh edge: 인접 라우터 간 bidirectional edge. -null 라우터(HBM exclusion zone)는 skip. - -#### 1:1 mode 확장 (나중에 구현) - -1:1 mode에서는 각 라우터가 N개 channel mini-router로 분화된다. -per-channel routing과 ChannelSplitter (LA → per-channel PA) 도입이 필요. -PE당 N개 GEMM engine도 이 시점에 추가. - ---- - -### D4. cross-PE HBM 접근 (n:1 mode) - -n:1 mode에서 PE가 다른 PE의 local HBM에 접근하는 경우, -cube_mesh.yaml의 XY mesh를 통해 대상 PE의 라우터까지 hop한다. - -예: PE0(r0c0)이 PE2(r1c4)의 HBM에 접근: - -```text -PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl -``` - -Dijkstra router가 mesh에서 최단 경로를 탐색한다. - -1:1 mode에서의 cross-PE channel 접근은 D3의 1:1 확장 시 정의한다. - ---- - -### D5. n:1 mode: cube_mesh.yaml 라우터 mesh 사용 - -n:1 mode에서는 별도의 "aggregated router"를 생성하지 않는다. -기존 cube_mesh.yaml의 라우터 grid가 그 역할을 한다. - -#### 연결 구조 - -각 PE가 attach된 라우터에 PE_DMA, PE_CPU, HBM이 함께 연결된다: - -```text -sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs) -sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs) -``` - -라우터 간 XY mesh edge로 연결. PE의 local HBM 접근은 -자기 라우터에서 바로 (switching overhead만). - -#### n:1 mode 전체 데이터 경로 - -**local HBM (0 hop):** -```text -PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only) -``` - -**remote HBM (mesh hops):** -```text -PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl -``` - -**M_CPU DMA:** -```text -M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl -``` - ---- - -### D6. 모든 트래픽을 동일 router mesh로 통일한다 - -- 모든 memory access (DMA data)와 command (PE_CPU)가 동일 router mesh를 사용한다 -- local access도 별도의 fast path(xbar)를 사용하지 않는다 -- cross-cube (remote) access 경로: - -```text -PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT} - → [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl -``` - -UCIe 연결은 기존 구조를 유지하되, -양쪽 endpoint가 xbar 대신 mesh 라우터가 된다. - -UCIe line 수는 BW 비율로 결정: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`. - ---- - -### D7. AddressResolver 변경 - -현재 `AddressResolver.resolve()`: - -```python -# 현재: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}" -pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes) -return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}" -``` - -변경 후: - -```python -# 변경: HBM → 단일 endpoint -return f"sip{s}.cube{c}.hbm_ctrl" -``` - -pe_slice 계산이 제거된다. -n:1 mode에서 PE_DMA는 자기 라우터에 attach된 hbm_ctrl에 직접 접근한다. - -resolver.resolve()는 외부 접근(M_CPU DMA 등) 및 backward compatibility용으로 유지한다. - ---- - -### D8. topology.yaml 설정 변경 - -#### 추가 설정 - -```yaml -cube: - memory_map: - hbm_mapping_mode: n_to_one # one_to_one | n_to_one - hbm_pseudo_channels: 64 # 전체 pseudo channel 수 - hbm_channels_per_pe: 8 # PE당 local channel 수 (= pseudo_channels / pes_per_cube) - hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s) - hbm_total_gb_per_cube: 48 # 유지 -``` - -#### 제거 설정 - -```yaml -# 제거 대상 -links: - xbar_to_hbm_bw_gbs: 256.0 # → channel_bw_gbs × channels_per_pe로 대체 - xbar_to_hbm_mm: 2.5 # → ch_router_to_hbm_mm으로 대체 - xbar_to_bridge_bw_gbs: 128.0 # → 제거 (bridge 없음) - xbar_to_bridge_mm: 3.0 # → 제거 - noc_to_xbar_bw_gbs: ... # → 제거 - noc_to_xbar_mm: ... # → 제거 -``` - -#### 추가 link 설정 - -```yaml -links: - router_link_bw_gbs: 256.0 # 라우터 간 XY mesh link BW - router_overhead_ns: 2.0 # 라우터 switching overhead - pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ 라우터 - hbm_to_router_bw_gbs: 256.0 # HBM ↔ 라우터 (= N × channel_bw) -``` - ---- - -### D9. 대역폭 수치 정합 - -| 구성 | 값 | -| ---- | --- | -| pseudo channels per cube | 64 (파라미터) | -| PEs per cube | 8 (파라미터) | -| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 | -| per-channel BW | 32 GB/s (파라미터) | -| per-PE local BW | N × 32 = 256 GB/s | -| cube total HBM BW | 64 × 32 = 2048 GB/s | - -두 모드에서 PE당 effective BW는 동일: - -- 1:1 mode: N개 channel link × channel_bw_gbs = N × 32 = 256 GB/s -- n:1 mode: 1개 aggregated link = N × channel_bw_gbs = 256 GB/s - ---- - -## Consequences - -### Positive - -- cube_mesh.yaml 기반 라우터 mesh로 물리적 배치를 정확히 반영한다 -- n:1 mode에서 기존 VA 체계를 유지하여 전환 비용이 낮다 -- local / remote / command 트래픽이 동일 mesh로 통일되어 단순하다 -- graph compiler 기반 topology 생성과 잘 맞는다 -- channel 수, PE 수가 모두 파라미터이므로 다양한 구성을 테스트할 수 있다 -- 1:1 mode 확장이 라우터 분화로 자연스럽게 가능하다 - -### Negative - -- 명시적 라우터 노드로 인해 SimPy 노드 수가 증가한다 (6×6 = 최대 32개 라우터/cube) -- TwoDMeshNocComponent의 내부 contention 모델을 라우터별 모델로 교체 필요 - ---- - -## Alternatives - -### A1. 기존 xbar + HBM slice 유지 - -- local/remote 경로가 이원화됨 -- pseudo-channel 단위 모델링 불가 -- 1:1/n:1 mode 전환 불가 - -### A2. per-channel link를 항상 생성하고 n:1에서만 집계 - -- topology 구조가 항상 1:1 크기 -- n:1 semantics를 link aggregation으로 표현하기 복잡 -- router 노드 수 감소 효과 없음 - -### A3. 단계적 전환 (xbar 유지 + NOC 경로 추가) - -- 호환성은 높으나 두 경로 공존으로 복잡도 증가 -- 최종적으로 xbar 제거가 필요하므로 중간 단계의 가치가 낮음 - ---- - -## Test Requirements - -- 1:1 mode에서 channel별 link로 request가 전달되는지 확인 -- n:1 mode에서 aggregated link로 request가 전달되는지 확인 -- 두 mode에서 topology가 올바르게 생성되는지 검증: - - 1:1: `total_ch`개 channel router + per-PE link + horizontal link - - n:1: `pes_per_cube`개 aggregated router + per-PE link -- 동일 workload에서 effective BW가 두 모드에서 일관적인지 확인 -- cross-PE 접근 시 horizontal line routing이 동작하는지 확인 -- cross-cube 접근 시 UCIe를 통한 routing이 동작하는지 확인 -- 파라미터 변경 (channels_per_pe = 4, 8, 16 등)에서 topology 생성이 정상인지 확인 - ---- - -## Links - -- ADR-0011 (LA model) → addressing 측 연동 -- ADR-0017 (Cube NOC 2D Mesh) → 본 ADR이 xbar/bridge 부분을 대체 -- ADR-0004 (Memory Semantics) → BW 모델 재정의 -- ADR-0014 (PE Internal Execution Model) → PE_DMA 경로 변경 영향 diff --git a/docs/adr/ADR-0020-data-execution-two-pass.en.md b/docs/adr/ADR-0020-prog-data-execution-two-pass.en.md similarity index 100% rename from docs/adr/ADR-0020-data-execution-two-pass.en.md rename to docs/adr/ADR-0020-prog-data-execution-two-pass.en.md diff --git a/docs/adr/ADR-0020-data-execution-two-pass.md b/docs/adr/ADR-0020-prog-data-execution-two-pass.md similarity index 100% rename from docs/adr/ADR-0020-data-execution-two-pass.md rename to docs/adr/ADR-0020-prog-data-execution-two-pass.md diff --git a/docs/adr/ADR-0021-pe-pipeline-refactor.en.md b/docs/adr/ADR-0021-pe-pipeline-refactor.en.md deleted file mode 100644 index 0d5dd6e..0000000 --- a/docs/adr/ADR-0021-pe-pipeline-refactor.en.md +++ /dev/null @@ -1,432 +0,0 @@ -# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing - -## Status - -Accepted - -## Context - -### Actual Hardware Structure - -``` -HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine -``` - -- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns) -- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns) -- GEMM/MATH Engine: computation between Register Files (cycle-accurate) -- Completion signal: PE-internal 1-cycle wire signal (done pin assert) - ---- - -## Decision - -### D1. Separate Each Block into an Independent Component - -The internal blocks of pe_accel are separated into **independent PeEngineBase components**. -Existing 5 blocks + 1 Fetch/Store Unit = 6 components. - -| Component | Role | HW Correspondence | -|-----------|------|-------------------| -| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer | -| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine | -| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit | -| PE_GEMM | MAC compute (register only) | MAC Array | -| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit | -| PE_TCM | BW-serialized scratchpad | SRAM Bank | - -Each component exists as a topology node and is connected via ports/wires. -Replacing the `impl` allows changing the timing model of an individual block. - -### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion - -**Components do not pass through the scheduler at every stage.** -The token carries a plan so that components chain directly to the next stage. - -``` -Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler - ↑ chaining: does not go through scheduler completion only -``` - -This matches the actual HW structure where each block's done signal is directly -connected to the next block via wire. The scheduler is responsible **only for -initial dispatch + completion aggregation**. - -#### Stage Definition - -```python -class StageType(Enum): - DMA_READ = 0 - FETCH = 1 - GEMM = 2 - MATH = 3 - STORE = 4 - DMA_WRITE = 5 -``` - -#### Plan Structure - -When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**. -The plan defines the **stage sequence** for each tile: - -```python -@dataclass -class Stage: - stage_type: StageType - component: str # topology node ID (e.g. "sip0.cube0.pe0.pe_dma") - params: dict # per-stage parameters (dynamic) - -@dataclass(frozen=True) -class TilePlan: - tile_id: int - stages: tuple[Stage, ...] # list of stages to execute in order (immutable) -``` - -The stage sequence varies depending on the plan: - -```python -# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM -stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE) - -# GEMM directly from TCM data (skip DMA read): -stages = (FETCH, GEMM, STORE, DMA_WRITE) - -# MATH element-wise: -stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE) - -# GEMM + accumulation (intermediate K-tile, skip writeback): -stages = (DMA_READ, FETCH, GEMM, STORE) # store to TCM only -``` - -**Components do not hardcode the next component.** -They read the next stage from the token's plan and forward it directly via out_port. -This is the same pattern as a network packet carrying a routing header. - -#### Pipeline Context - -```python -@dataclass -class PipelineContext: - id: str - total_tiles: int - completed_tiles: int = 0 - done_event: simpy.Event = None # succeeds when all tiles are complete - - def complete_tile(self) -> None: - self.completed_tiles += 1 - if self.completed_tiles == self.total_tiles: - self.done_event.succeed() -``` - -**Completion follows an exactly-once contract**: the last stage of each tile must call -`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must -succeed only once (SimPy Event constraint). - -#### Scheduler Role (Reduced) - -When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext, -enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately. - -Actual tile injection is handled by a **single feeder process** (`_feed_loop`). -This feeder consumes `_pending_feeds` in FIFO order and -**does not allow tile feed interleaving across composite commands.** -That is, the feed for the next command begins only after all tiles of the current -command have been injected into the first stage queue. - -There is **exactly one `_feed_loop`** per scheduler, and -tile feed for composite commands is performed exclusively through this single process. -Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**. - -This structure maintains command issue order while ensuring that when the first stage -queue is full, only the feeder process blocks — the scheduler worker's inbox processing -itself does not stall. - -```python -class PeSchedulerV2(PeEngineBase): - _pipelines: dict[str, PipelineContext] - _pending_feeds: simpy.Store # FIFO of (plan, ctx) - - def start(self, env): - super().start(env) - self._pending_feeds = simpy.Store(env) - env.process(self._feed_loop(env)) - - def _dispatch_composite(self, env, pe_txn, cmd): - plan = generate_plan(cmd) - ctx = PipelineContext( - id=next_id(), - total_tiles=len(plan.tiles), - done_event=pe_txn.done, - ) - self._pipelines[ctx.id] = ctx - - # only enqueue to feeder queue and return immediately - yield self._pending_feeds.put((plan, ctx)) - - def _feed_loop(self, env): - """Single feeder process: feeds composite commands in FIFO order. - - Tile feed interleaving across composite commands is not allowed. - The feed for the next command begins only after all tiles of the - current command have been injected into the first stage queue. - - When the first stage queue is full, only this feeder blocks; - the scheduler worker's inbox processing does not stall. - """ - while True: - plan, ctx = yield self._pending_feeds.get() - for tile in plan.tiles: - token = TileToken( - tile_id=tile.tile_id, - pipeline_ctx=ctx, - plan=tile, - stage_idx=0, - params=tile.stages[0].params, - ) - yield self.out_ports[tile.stages[0].component].put(token) - # queue capacity = HW queue depth → feeder blocks only when full -``` - -In this ADR, the scheduler can accept multiple composite commands, -but tile submission order follows per-command FIFO. -Within a command, tile-level pipeline overlap is allowed, -but tile feed interleaving across commands is not. - -### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria - -| Communication Type | Method | HW Correspondence | -|-------------------|--------|-------------------| -| Tile token (work directive) | message via out_port | enqueue to command queue | -| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue | -| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt | - -**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth. - -**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model -is applied to intra-PE stage triggers. Chaining between components corresponds to -PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost -is incurred. - -**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`. -When all tiles are complete, PipelineContext calls done_event.succeed(). - -### D4. Asynchronous Pipeline — Natural Overlap - -The scheduler processes CompositeCmds **asynchronously**. -However, tile feed does not spawn an independent process per command; instead, -the scheduler's internal **single feeder process** performs the feed in FIFO order. -Therefore, the scheduler can continue to receive the next command, -but the first-stage tile injection order is guaranteed per command. - -Since **SimPy Store capacity = HW queue depth**: -- When the queue is full, put() naturally blocks (backpressure) -- While DMA is processing tile 0, GEMM can start fetching an already-completed tile -- When a second CompositeCmd arrives, it is immediately queued to the DMA queue - -``` -First-stage feed order (feeder → DMA queue): - [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]... - ↑ cmd2 starts after cmd1 feed completes - -Runtime pipeline (downstream overlap): - PE_DMA: [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]... - PE_FETCH: [cmd1:t0][cmd1:t1]... - PE_GEMM: [cmd1:t0][cmd1:t1]... - ↑ pipeline overlap within the same command -``` - -Here, the overlap does not come from tile feed interleaving across different commands, -but occurs naturally as tiles from earlier commands progress to downstream stages -while the feeder continues injecting subsequent tiles. - -For example, tile feed for cmd2 does not start until all tiles of cmd1 have been -injected into the first stage queue. However, while cmd1.tile0 has already progressed -to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so -**pipeline overlap within the same command occurs naturally**. - -#### Component Chaining Pattern - -All components follow the same pattern: - -```python -def _pipeline_worker(self, env): - while True: - token = yield self._inbox.get() - - # process own stage - yield from self._process(env, token) - - # chain to next stage (read from plan) - next_idx = token.stage_idx + 1 - if next_idx < len(token.plan.stages): - next_stage = token.plan.stages[next_idx] - token.stage_idx = next_idx - token.params = next_stage.params - yield self.out_ports[next_stage.component].put(token) - else: - # last stage — pipeline completion - token.pipeline_ctx.complete_tile() -``` - -### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer - -Previously, GemmBlock and MathBlock each implemented their own TCM read/write. -This is separated into a **PE_FETCH_STORE component**. - -```python -# PE_FETCH_STORE._process() -def _process(self, env, token): - yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...)) - yield tcm_done - # chaining is handled by the base class (D4 pattern) -``` - -Advantages: -- GEMM/MATH perform **pure compute only** — no TCM access logic -- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource) -- Prefetch strategies can be experimented with by replacing the fetch unit alone - -### D6. Simplification of Each Compute Component - -GEMM/MATH perform compute only with register data already prepared. -**Chaining follows the common pattern (D4), so only _process() needs to be implemented:** - -```python -# PE_GEMM._process() -def _process(self, env, token): - yield env.timeout(self._mac_latency(token.params)) - -# PE_MATH._process() -def _process(self, env, token): - yield env.timeout(self._simd_latency(token.params)) - -# PE_FETCH_STORE._process() -def _process(self, env, token): - yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...)) - yield tcm_done - -# PE_DMA._process() -def _process(self, env, token): - yield from self._do_fabric_dma(token.params) -``` - -By replacing only the timing model, one can freely switch between cycle-accurate -and analytical models. Since the chaining logic resides in the base class, -each component only implements its pure stage logic. - -### D7. Topology Changes - -Add PE_FETCH_STORE to the PE template: - -```yaml -pe_template: - components: - pe_cpu: { kind: pe_cpu, impl: pe_cpu_v1, ... } - pe_scheduler: { kind: pe_scheduler, impl: pe_scheduler_v2, ... } - pe_dma: { kind: pe_dma, impl: pe_dma_v1, ... } - pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... } - pe_gemm: { kind: pe_gemm, impl: pe_gemm_v1, ... } - pe_math: { kind: pe_math, impl: pe_math_v1, ... } - pe_mmu: { kind: pe_mmu, impl: pe_mmu_v1, ... } - pe_tcm: { kind: pe_tcm, impl: pe_tcm_v1, ... } - links: - # existing links... - fetch_store_to_tcm_bw_gbs: 512.0 - fetch_store_to_tcm_mm: 0.0 -``` - -PE internal edge connections: -``` -PE_SCHEDULER → PE_DMA (initial dispatch) -PE_SCHEDULER → PE_FETCH_STORE (initial dispatch) -PE_SCHEDULER → PE_GEMM (initial dispatch) -PE_SCHEDULER → PE_MATH (initial dispatch) -PE_DMA → PE_FETCH_STORE (chaining) -PE_FETCH_STORE → PE_GEMM (chaining) -PE_FETCH_STORE → PE_MATH (chaining) -PE_GEMM → PE_FETCH_STORE (store chaining) -PE_MATH → PE_FETCH_STORE (store chaining) -PE_FETCH_STORE → PE_DMA (writeback chaining) -PE_FETCH_STORE → PE_TCM (BW request) -``` - -Topology edges encompass both **control/dispatch visibility + runtime chaining**. -Scheduler → sub-component edges are initial dispatch paths, while -inter-component edges are runtime chaining paths driven by token self-routing. - -### D9. TileToken Message Definition - -A message used for passing tile work between components. -The token carries the plan and stage index, enabling self-routing. - -```python -@dataclass -class TileToken: - tile_id: int - pipeline_ctx: PipelineContext # completion tracking - plan: TilePlan # full stage sequence for this tile (immutable) - stage_idx: int # current stage index in plan.stages - params: dict # current stage parameter cache (canonical: plan.stages[stage_idx].params) - data_op: bool = True # op_log recording target (ADR-0020) -``` - -A TileToken is **owned by exactly one component at a time** and -is never referenced by multiple components simultaneously (single-owner). - -Token lifecycle: -1. Scheduler creates it with stage_idx=0 and puts it to the first stage component -2. The component executes _process(), increments stage_idx, and puts it to the next component -3. The last stage component calls pipeline_ctx.complete_tile() -4. When all tiles are complete, PipelineContext calls done_event.succeed() - -Relationship with existing PeInternalTxn: -- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged) -- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing) - ---- - -## Non-goals - -- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified - (PeInternalTxn-based, ADR-0014 maintained) -- **Resource contention model across multiple pipelines**: the current scope focuses on - accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines - are future work. - -## Open Questions - -- **Register File capacity model**: whether to model capacity limits when the fetch unit - loads into registers. Capacity is expressed in bytes (register_file_bytes), and - the number of tiles that can be held simultaneously is determined by tile size. - When capacity is exceeded, fetch stalls, creating natural backpressure. -- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite - commands. Therefore, overlap arises not from pre-injection across commands, but - naturally from pipeline progression of tiles within the same command. - If additional prefetch is needed, it should be considered at the level of tile ordering - within the same command or fetch/store unit policy, not cross-command injection. -- **PE_DMA coalescing**: per-tile DMA may cause fragmentation. - Direction is to merge/coalesce within DMA without scheduler involvement. -- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the - default/sole execution model. If a sync mode is needed for debug or validation - purposes, it will be considered in a future ADR. -- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline. - Bank conflict modeling when multiple pipelines simultaneously access TCM is future work. - ---- - -## Consequences - -### Positive - -- Each block is an independent component — individually replaceable (ADR-0015 compliant) -- PE internal structure is visible in the topology -- Components do not know the next component — plan-based routing provides flexibility -- Natural pipeline overlap between DMA and compute (SimPy Store backpressure) -- Improved HW modeling accuracy (done signal = Event, data transfer = message) -- Fetch/store separation enables accurate TCM BW contention modeling - -### Negative - -- Increased number of PE internal components (5 → 6) — more topology nodes/edges -- Component separation makes intra-PE token forwarding more explicit than before - diff --git a/docs/adr/ADR-0021-pe-pipeline-refactor.md b/docs/adr/ADR-0021-pe-pipeline-refactor.md deleted file mode 100644 index 628937b..0000000 --- a/docs/adr/ADR-0021-pe-pipeline-refactor.md +++ /dev/null @@ -1,426 +0,0 @@ -# ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅 - -## Status - -Accepted - -## Context - -### 실제 하드웨어 구조 - -``` -HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine -``` - -- DMA: HBM ↔ TCM 전송 (fabric 경유, 수십~수백 ns) -- Fetch/Store Unit: TCM ↔ Register File 전송 (BW 기반, 수 ns) -- GEMM/MATH Engine: Register File 간 연산 (cycle-accurate) -- 완료 신호: PE 내부 1-cycle wire signal (done pin assert) - ---- - -## Decision - -### D1. 각 블록을 독립 컴포넌트로 분리 - -pe_accel의 내부 블록을 **독립 PeEngineBase 컴포넌트**로 분리한다. -기존 5개 + Fetch/Store Unit 1개 = 6개 컴포넌트. - -| 컴포넌트 | 역할 | HW 대응 | -|----------|------|---------| -| PE_SCHEDULER | plan 생성, tile 상태 관리, stage 라우팅 | Scheduler/Sequencer | -| PE_DMA | HBM ↔ TCM (fabric 경유) | DMA Engine | -| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit | -| PE_GEMM | MAC compute (register only) | MAC Array | -| PE_MATH | element-wise/reduction (register only) | SIMD/Vector Unit | -| PE_TCM | BW-serialized scratchpad | SRAM Bank | - -각 컴포넌트는 topology 노드로 존재하며, port/wire로 연결된다. -`impl`을 교체하면 개별 블록의 타이밍 모델을 변경할 수 있다. - -### D2. Token Self-Routing — Scheduler는 dispatch + completion만 - -**컴포넌트가 매 stage마다 scheduler를 경유하지 않는다.** -Token이 plan을 가지고 있어 컴포넌트가 직접 다음 stage로 체이닝한다. - -``` -Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler - ↑ 체이닝: scheduler 안 거침 completion만 -``` - -이는 실제 HW에서 각 블록의 done signal이 다음 블록에 직접 wire로 연결되어 -있는 구조와 일치한다. Scheduler는 **초기 dispatch + completion aggregation만** 담당. - -#### Stage 정의 - -```python -class StageType(Enum): - DMA_READ = 0 - FETCH = 1 - GEMM = 2 - MATH = 3 - STORE = 4 - DMA_WRITE = 5 -``` - -#### Plan 구조 - -Scheduler가 CompositeCmd를 받으면 **tile 단위 실행 plan**을 생성한다. -Plan은 각 tile의 **stage sequence**를 정의한다: - -```python -@dataclass -class Stage: - stage_type: StageType - component: str # topology 노드 ID (e.g. "sip0.cube0.pe0.pe_dma") - params: dict # stage별 파라미터 (dynamic) - -@dataclass(frozen=True) -class TilePlan: - tile_id: int - stages: tuple[Stage, ...] # 순서대로 실행할 stage 목록 (immutable) -``` - -Plan에 따라 stage sequence가 달라진다: - -```python -# 일반 GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM -stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE) - -# TCM 데이터로 바로 GEMM (DMA read 생략): -stages = (FETCH, GEMM, STORE, DMA_WRITE) - -# MATH element-wise: -stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE) - -# GEMM + accumulation (중간 K-tile, writeback 생략): -stages = (DMA_READ, FETCH, GEMM, STORE) # store to TCM only -``` - -**컴포넌트는 다음 컴포넌트를 하드코딩하지 않는다.** -Token의 plan에서 다음 stage를 읽고, out_port로 직접 전달한다. -네트워크 패킷이 라우팅 헤더를 가지고 있는 것과 같은 패턴이다. - -#### Pipeline Context - -```python -@dataclass -class PipelineContext: - id: str - total_tiles: int - completed_tiles: int = 0 - done_event: simpy.Event = None # 모든 tile 완료 시 succeed - - def complete_tile(self) -> None: - self.completed_tiles += 1 - if self.completed_tiles == self.total_tiles: - self.done_event.succeed() -``` - -**Completion은 exactly-once contract**: 각 tile의 마지막 stage는 정확히 한 번만 -`complete_tile()`을 호출해야 한다. 중복 호출은 버그이며, `done_event`는 -단 한 번만 succeed되어야 한다 (SimPy Event 제약). - -#### Scheduler 역할 (축소됨) - -Scheduler는 CompositeCmd를 받으면 plan과 PipelineContext를 생성한 뒤, -이를 scheduler 내부의 `_pending_feeds` FIFO에 enqueue하고 즉시 리턴한다. - -실제 tile 투입은 **단일 feeder process** (`_feed_loop`)가 담당한다. -이 feeder는 `_pending_feeds`를 FIFO 순서로 소비하며, -**composite command 간 tile feed interleaving은 허용하지 않는다.** -즉, 한 command의 모든 tile이 첫 stage queue에 투입된 후에만 -다음 command의 feed가 시작된다. - -Scheduler당 `_feed_loop`는 **정확히 하나만** 존재하며, -composite command의 tile feed는 이 단일 process를 통해서만 수행된다. -Command issue order는 **PE_SCHEDULER가 PeInternalTxn을 수신한 순서**를 의미한다. - -이 구조는 command issue order를 유지하면서도, 첫 stage queue full 시 -feeder process만 block되고 scheduler worker의 inbox 처리 자체는 멈추지 않도록 한다. - -```python -class PeSchedulerV2(PeEngineBase): - _pipelines: dict[str, PipelineContext] - _pending_feeds: simpy.Store # FIFO of (plan, ctx) - - def start(self, env): - super().start(env) - self._pending_feeds = simpy.Store(env) - env.process(self._feed_loop(env)) - - def _dispatch_composite(self, env, pe_txn, cmd): - plan = generate_plan(cmd) - ctx = PipelineContext( - id=next_id(), - total_tiles=len(plan.tiles), - done_event=pe_txn.done, - ) - self._pipelines[ctx.id] = ctx - - # feeder queue에 등록만 하고 즉시 리턴 - yield self._pending_feeds.put((plan, ctx)) - - def _feed_loop(self, env): - """단일 feeder process: composite command를 FIFO 순서로 feed. - - Composite command 간 tile feed interleaving은 허용하지 않는다. - 한 command의 모든 tile이 첫 stage queue에 투입된 후에만 - 다음 command의 feed가 시작된다. - - 첫 stage queue full 시 이 feeder만 block되며, - scheduler worker의 inbox 처리는 멈추지 않는다. - """ - while True: - plan, ctx = yield self._pending_feeds.get() - for tile in plan.tiles: - token = TileToken( - tile_id=tile.tile_id, - pipeline_ctx=ctx, - plan=tile, - stage_idx=0, - params=tile.stages[0].params, - ) - yield self.out_ports[tile.stages[0].component].put(token) - # queue capacity = HW queue depth → full이면 feeder만 block -``` - -본 ADR에서 scheduler는 여러 composite command를 수용할 수 있으나, -tile submission order는 command 단위 FIFO를 따른다. -Command 내부에서는 tile-level pipeline overlap을 허용하지만, -command 간 tile feed interleaving은 허용하지 않는다. - -### D3. 데이터 전달 vs 완료 신호 — HW 모델링 기준 - -| 통신 유형 | 방식 | HW 대응 | -|----------|------|---------| -| tile token (작업 지시) | message via out_port | command queue에 enqueue | -| stage 완료 → 다음 stage | 컴포넌트가 직접 out_port.put | done-triggered local enqueue | -| pipeline 완료 → scheduler | PipelineContext.complete_tile() | completion interrupt | - -**Tile token**: out_port.put() 사용. SimPy Store capacity = HW queue depth. - -**Intra-PE chaining latency**: 본 ADR 범위에서는 intra-PE stage trigger에 -explicit latency model을 두지 않는다. 컴포넌트 간 체이닝은 PE 내부 wire에 해당하며, -scheduler 왕복이 없으므로 artificial hop cost가 발생하지 않는다. - -**Pipeline 완료**: 마지막 stage의 컴포넌트가 `pipeline_ctx.complete_tile()` 호출. -모든 tile 완료 시 PipelineContext가 done_event.succeed(). - -### D4. 비동기 파이프라인 — 자연스러운 overlap - -Scheduler는 CompositeCmd를 **비동기로** 처리한다. -다만 tile feed는 command마다 독립 process를 만들지 않고, -scheduler 내부의 **단일 feeder process**가 FIFO 순서로 수행한다. -따라서 scheduler는 다음 command를 계속 받을 수 있지만, -첫-stage tile 투입 순서는 command 단위로 보장된다. - -**SimPy Store capacity = HW queue depth**이므로: -- queue가 차면 put()이 자연스럽게 block (backpressure) -- DMA가 tile 0을 처리하는 동안 GEMM은 이미 완료된 tile의 fetch를 시작 -- 두 번째 CompositeCmd가 들어오면 DMA queue에 바로 이어서 투입 - -``` -First-stage feed order (feeder → DMA queue): - [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]... - ↑ cmd1 feed 완료 후 cmd2 시작 - -Runtime pipeline (downstream overlap): - PE_DMA: [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]... - PE_FETCH: [cmd1:t0][cmd1:t1]... - PE_GEMM: [cmd1:t0][cmd1:t1]... - ↑ 같은 cmd 내부에서 pipeline overlap -``` - -이때 overlap은 서로 다른 command의 tile feed interleaving에서 오는 것이 아니라, -먼저 투입된 command의 tile들이 downstream stage로 진행되는 동안 feeder가 -다음 tile들을 계속 투입하면서 자연스럽게 발생한다. - -예를 들어 cmd1의 모든 tile이 첫 stage queue에 투입되기 전에는 -cmd2의 tile feed는 시작되지 않는다. 그러나 cmd1.tile0이 이미 GEMM으로 -진행한 상태에서 cmd1.tile1, cmd1.tile2가 DMA/FETCH에 남아 있을 수 있으므로, -**같은 command 내부에서는 pipeline overlap이 자연스럽게 발생**한다. - -#### 컴포넌트 체이닝 패턴 - -모든 컴포넌트가 동일한 패턴을 따른다: - -```python -def _pipeline_worker(self, env): - while True: - token = yield self._inbox.get() - - # 자기 stage 처리 - yield from self._process(env, token) - - # 다음 stage로 체이닝 (plan에서 읽음) - next_idx = token.stage_idx + 1 - if next_idx < len(token.plan.stages): - next_stage = token.plan.stages[next_idx] - token.stage_idx = next_idx - token.params = next_stage.params - yield self.out_ports[next_stage.component].put(token) - else: - # 마지막 stage — pipeline completion - token.pipeline_ctx.complete_tile() -``` - -### D5. PE_FETCH_STORE — TCM ↔ Register File 전담 - -기존에 GemmBlock과 MathBlock이 각각 TCM read/write를 구현했으나, -이를 **PE_FETCH_STORE 컴포넌트**로 분리한다. - -```python -# PE_FETCH_STORE._process() -def _process(self, env, token): - yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...)) - yield tcm_done - # 체이닝은 base class가 처리 (D4 패턴) -``` - -장점: -- GEMM/MATH는 **순수 compute만** — TCM 접근 로직 없음 -- fetch/store BW 경합이 자연스럽게 모델링됨 (PE_TCM의 resource로 serialization) -- prefetch 전략 등 fetch unit 단독 교체로 실험 가능 - -### D6. 각 Compute 컴포넌트의 단순화 - -GEMM/MATH는 register 데이터가 이미 준비된 상태에서 compute만 수행. -**체이닝은 공통 패턴(D4)을 따르므로, _process()만 구현하면 된다:** - -```python -# PE_GEMM._process() -def _process(self, env, token): - yield env.timeout(self._mac_latency(token.params)) - -# PE_MATH._process() -def _process(self, env, token): - yield env.timeout(self._simd_latency(token.params)) - -# PE_FETCH_STORE._process() -def _process(self, env, token): - yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...)) - yield tcm_done - -# PE_DMA._process() -def _process(self, env, token): - yield from self._do_fabric_dma(token.params) -``` - -타이밍 모델만 교체하면 cycle-accurate든 analytical든 자유롭게 변경 가능. -체이닝 로직은 base class에 있으므로 각 컴포넌트는 순수 stage 로직만 구현. - -### D7. Topology 변경 - -PE template에 PE_FETCH_STORE 추가: - -```yaml -pe_template: - components: - pe_cpu: { kind: pe_cpu, impl: pe_cpu_v1, ... } - pe_scheduler: { kind: pe_scheduler, impl: pe_scheduler_v2, ... } - pe_dma: { kind: pe_dma, impl: pe_dma_v1, ... } - pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... } - pe_gemm: { kind: pe_gemm, impl: pe_gemm_v1, ... } - pe_math: { kind: pe_math, impl: pe_math_v1, ... } - pe_mmu: { kind: pe_mmu, impl: pe_mmu_v1, ... } - pe_tcm: { kind: pe_tcm, impl: pe_tcm_v1, ... } - links: - # 기존 links... - fetch_store_to_tcm_bw_gbs: 512.0 - fetch_store_to_tcm_mm: 0.0 -``` - -PE 내부 edge 연결: -``` -PE_SCHEDULER → PE_DMA (초기 dispatch) -PE_SCHEDULER → PE_FETCH_STORE (초기 dispatch) -PE_SCHEDULER → PE_GEMM (초기 dispatch) -PE_SCHEDULER → PE_MATH (초기 dispatch) -PE_DMA → PE_FETCH_STORE (체이닝) -PE_FETCH_STORE → PE_GEMM (체이닝) -PE_FETCH_STORE → PE_MATH (체이닝) -PE_GEMM → PE_FETCH_STORE (store 체이닝) -PE_MATH → PE_FETCH_STORE (store 체이닝) -PE_FETCH_STORE → PE_DMA (writeback 체이닝) -PE_FETCH_STORE → PE_TCM (BW 요청) -``` - -Topology edge는 **control/dispatch visibility + runtime chaining** 양쪽을 포함한다. -Scheduler → 하위 컴포넌트 edge는 초기 dispatch 경로이며, -컴포넌트 간 edge는 token self-routing에 의한 runtime chaining 경로이다. - -### D9. TileToken 메시지 정의 - -컴포넌트 간 tile 작업 전달에 사용하는 메시지. -Token이 plan과 stage index를 가지고 있어 self-routing이 가능하다. - -```python -@dataclass -class TileToken: - tile_id: int - pipeline_ctx: PipelineContext # completion 추적 - plan: TilePlan # 이 tile의 전체 stage sequence (immutable) - stage_idx: int # 현재 stage index in plan.stages - params: dict # current stage 파라미터 캐시 (canonical: plan.stages[stage_idx].params) - data_op: bool = True # op_log 기록 대상 (ADR-0020) -``` - -TileToken은 한 시점에 **하나의 컴포넌트에 의해서만 소유**되며, -동시에 여러 컴포넌트에 의해 참조되지 않는다 (single-owner). - -Token lifecycle: -1. Scheduler가 stage_idx=0으로 생성, 첫 stage 컴포넌트에 put -2. 컴포넌트가 _process() 실행 후 stage_idx 증가, 다음 컴포넌트에 put -3. 마지막 stage 컴포넌트가 pipeline_ctx.complete_tile() 호출 -4. 모든 tile 완료 시 PipelineContext가 done_event.succeed() - -기존 PeInternalTxn과의 관계: -- PeInternalTxn: PE_CPU → PE_SCHEDULER 간 command 전달 (기존 유지) -- TileToken: PE_SCHEDULER → 하위 컴포넌트 간 tile 단위 작업 전달 (신규, self-routing) - ---- - -## Non-goals - -- **PE_CPU 변경**: PE_CPU → PE_SCHEDULER 인터페이스는 변경하지 않음 - (PeInternalTxn 기반, ADR-0014 유지) -- **다중 pipeline 간 자원 경합 모델**: 현재 범위에서는 단일 pipeline의 - 정확한 모델링에 집중. 다중 pipeline 간 TCM bank conflict 등은 future work. - -## Open Questions - -- **Register File 용량 모델**: fetch unit이 register에 로드할 때 용량 제한을 - 모델링할지. 용량은 바이트 단위(register_file_bytes)로 표현하며, - 동시에 보유 가능한 tile 수는 tile 크기에 따라 결정된다. - 용량 초과 시 fetch가 stall되어 자연스러운 backpressure가 발생한다. -- **Prefetch 전략**: 본 ADR에서는 composite command 간 tile feed interleaving을 - 허용하지 않는다. 따라서 overlap은 command 간 선행 투입이 아니라, - 같은 command 내부 tile들의 pipeline progression에서 자연스럽게 발생한다. - 추가적인 prefetch가 필요하면 command 간 투입이 아니라, 같은 command 내부에서의 - tile ordering 또는 fetch/store unit policy 차원에서 검토한다. -- **PE_DMA coalescing**: tile 단위 DMA는 fragmentation 발생 가능. - DMA 내부에서 merge/coalesce하되 scheduler는 관여하지 않는 방향. -- **동기 실행 모드**: 본 ADR에서는 비동기 pipeline을 기본/유일 execution model로 - 채택한다. 디버그 또는 validation 목적의 sync mode가 필요하면 future ADR에서 검토. -- **다중 pipeline 간 TCM bank conflict**: 현재 단일 pipeline 기준. - 다중 pipeline이 동시에 TCM에 접근할 때의 bank conflict 모델은 future work. - ---- - -## Consequences - -### 긍정적 - -- 각 블록이 독립 컴포넌트 — 개별 교체 가능 (ADR-0015 준수) -- topology에서 PE 내부 구조 가시화 -- 컴포넌트가 다음 컴포넌트를 모름 — plan 기반 라우팅으로 유연성 확보 -- DMA와 compute의 자연스러운 파이프라인 overlap (SimPy Store backpressure) -- HW 모델링 정확도 향상 (done signal = Event, data transfer = message) -- fetch/store 분리로 TCM BW 경합 정확히 모델링 - -### 부정적 - -- PE 내부 컴포넌트 수 증가 (5 → 6) — topology 노드/edge 증가 -- 컴포넌트 분리로 인해 intra-PE token forwarding이 이전 대비 더 명시적으로 드러남 - diff --git a/docs/adr/ADR-0022-program-id-2d-grid.md b/docs/adr/ADR-0022-prog-program-id-2d-grid.md similarity index 96% rename from docs/adr/ADR-0022-program-id-2d-grid.md rename to docs/adr/ADR-0022-prog-program-id-2d-grid.md index 9bf7966..371bb49 100644 --- a/docs/adr/ADR-0022-program-id-2d-grid.md +++ b/docs/adr/ADR-0022-prog-program-id-2d-grid.md @@ -1,10 +1,10 @@ # ADR-0022: 2D Grid program_id Semantics -- **Status**: Accepted -- **Date**: 2026-04-09 -- **Context**: Triton-style kernel addressing for multi-cube PE topology +## Status -## Problem +Accepted + +## Context Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid. Our hardware has a 2-level hierarchy: **cubes** contain **PEs**. diff --git a/docs/adr/ADR-0023-ipcq-pe-collective.en.md b/docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md similarity index 99% rename from docs/adr/ADR-0023-ipcq-pe-collective.en.md rename to docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md index 5322753..e6b6334 100644 --- a/docs/adr/ADR-0023-ipcq-pe-collective.en.md +++ b/docs/adr/ADR-0023-dev-ipcq-pe-collective.en.md @@ -709,7 +709,7 @@ piggyback, tail updates via the D9 fast-path channel. ### D13. Test strategy -Following the ADR-0021 D8 pattern. +Test plan: #### T1. Unit tests (component-level) @@ -801,7 +801,7 @@ F5. **Slot full + infinite backpressure**: the peer never recvs. ### D15. Algorithm-author cheat sheet Full step-by-step lives in -[`docs/ccl-author-guide.en.md`](../ccl-author-guide.en.md). The +[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The shortest version: | Things you touch | Things you don't | diff --git a/docs/adr/ADR-0023-ipcq-pe-collective.md b/docs/adr/ADR-0023-dev-ipcq-pe-collective.md similarity index 77% rename from docs/adr/ADR-0023-ipcq-pe-collective.md rename to docs/adr/ADR-0023-dev-ipcq-pe-collective.md index 026b6f3..5fd174d 100644 --- a/docs/adr/ADR-0023-ipcq-pe-collective.md +++ b/docs/adr/ADR-0023-dev-ipcq-pe-collective.md @@ -969,7 +969,7 @@ tail 갱신은 D9 fast path SimPy Store 채널로 처리된다. ### D13. 테스트 전략 -ADR-0021의 D8 패턴을 따라 단위/통합/regression 테스트를 명시한다. +단위/통합/regression 테스트를 명시한다. #### T1. 단위 테스트 (component-level) @@ -1102,7 +1102,7 @@ F5. **Slot full + 무한 backpressure**: ### D15. 알고리즘 작성자 가이드 (요약) 본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다. -자세한 step-by-step 가이드는 [docs/ccl-author-guide.md](../ccl-author-guide.md) 참조. +자세한 step-by-step 가이드는 [docs/onboarding/ccl-author-guide.md](../onboarding/ccl-author-guide.md) 참조. #### 만지는 것 / 만지지 않는 것 @@ -1175,7 +1175,416 @@ def neighbors(rank, world_size, neighbor_map) -> dict | None: 2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure) 3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임 -자세한 step-by-step과 hello-world 예제는 `docs/ccl-author-guide.md` 참조. +자세한 step-by-step과 hello-world 예제는 `docs/onboarding/ccl-author-guide.md` 참조. + +--- + +## HW Realization Notes (Informative) + +**Status of this section**: Forward-looking. Describes how the simulator +contract (D1–D15) would map to silicon. Not currently implemented; +subject to revision before tapeout. The simulator implements the +contract via Python/SimPy equivalents in +[pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and +[pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py). + +### D16. Proposed HW Block Diagram and End-to-End Dataflow + +![PE Baseline Architecture](../diagrams/pe_baseline.png) + +> Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`. + +![PE Proposed Architecture](../diagrams/pe_proposed.png) + +> Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`. + +**Baseline → Proposed 핵심 변경**: + +- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW) +- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic) +- TCM 내 **IPCQ Slot Region 예약 영역** 명시 +- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결 + +#### End-to-End Sequence (HW view) + +```mermaid +sequenceDiagram + participant CPU_A as PE_A: PE_CPU + participant IPCQ_A as PE_A: IPCQ Ctrl + participant DMA_A as PE_A: DMA + participant NOC as NoC Fabric + participant DMA_B as PE_B: DMA + participant IPCQ_B as PE_B: IPCQ Ctrl + participant TCM_B as PE_B: TCM + participant CPU_B as PE_B: PE_CPU + + Note over CPU_A: tl.send(dir="E", src=0x1000) + + CPU_A->>IPCQ_A: MMIO: send request + Note over IPCQ_A: Backpressure check:
(head - peer_tail_cache) < n_slots → PASS
Slot addr gen:
dst = peer_rx_base + (head%n) × slot_size + IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head} + Note over IPCQ_A: my_head++ + IPCQ_A-->>CPU_A: send returns (fire-and-forget) + + Note over DMA_A: TCM read → snapshot in read buffer
Flit pack: data + {sender_seq, dst_addr} + DMA_A->>NOC: IPCQ data flit(s) + + Note over NOC: hop latency + BW drain + + NOC->>DMA_B: IPCQ data flit(s) + Note over DMA_B: Terminal BW drain
Slot write latency + + rect rgb(255, 240, 220) + Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall + DMA_B->>TCM_B: write data → slot address + DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr} + end + + Note over IPCQ_B: Range match dst_addr → direction "W"
peer_head_cache["W"] = sender_seq + 1 + IPCQ_B-->>CPU_B: recv_wake signal + + Note over CPU_B: tl.recv(dir="W") wakes up + CPU_B->>IPCQ_B: recv request + Note over IPCQ_B: peer_head_cache > my_tail → YES
slot_addr = rx_base + (tail%n) × slot_size + IPCQ_B-->>CPU_B: return slot_addr + CPU_B->>TCM_B: read data from slot + Note over IPCQ_B: my_tail++ + + IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa} + Note over NOC: credit traversal (NoC latency) + NOC->>IPCQ_A: Credit arrival + + Note over IPCQ_A: Match dst_rx_base_pa → direction "E"
peer_tail_cache["E"] = consumer_seq
Backpressure deassert (if stalled) +``` + +### D17. IPCQ Controller HW Module (신규) + +PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. 시뮬레이터의 +`PeIpcqComponent`에 대응한다. + +#### QPair Register File + +방향별 queue pair 상태를 flip-flop으로 유지. PE_CPU가 MMIO(CSR)로 읽기/쓰기 +가능하며, init 시점에 소프트웨어가 채워넣는다. + +``` +Per-direction registers (each 64-bit): + my_head — sender write position (monotonic) + my_tail — receiver read position (monotonic) + peer_head_cache — last known peer head (updated by Meta Extractor) + peer_tail_cache — last known peer tail (updated by Credit Receiver) + rx_base_pa — this PE's rx buffer base physical address + peer_rx_base_pa — peer's rx buffer base physical address + n_slots — ring depth (power-of-2 제약, D21 참조) + slot_size — bytes per slot + peer_credit_tgt — peer PE의 credit receive 주소 + +Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare) +Total: 8 dirs × 9 regs × 8B = 576B flip-flops +``` + +#### Slot Address Generator (combinational) + +``` +Input: pointer (my_head or my_tail), n_slots, slot_size, base_pa +Output: slot_addr = base_pa + (pointer % n_slots) * slot_size + +Implementation: + n_slots power-of-2 → pointer & (n_slots - 1) (AND mask, 1 gate) + slot_size power-of-2 → barrel shift (1 cycle) + 64-bit add → ripple/kogge-stone adder (1 cycle) + +Latency: 1-2 cycles combinational +``` + +#### Backpressure Comparator (combinational) + +``` +full = (my_head - peer_tail_cache) >= n_slots + +Implementation: 64-bit subtract + unsigned compare +Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold +Latency: 1 cycle +``` + +#### Meta Extractor (inbound datapath sideband) + +DMA Engine의 inbound vc_comm path에 wired. 도착하는 IPCQ flit의 header에서 +metadata를 추출하여 queue pair 상태를 갱신한다. + +``` +Trigger: DMA inbound write completion (same cycle) +Extract: {sender_seq, dst_addr} from flit header + +Direction matching (ADR-0025 D2): + for each dir: + match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir]) + 8× parallel range comparators + priority encoder + +Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1) +Output: recv_wake signal → PE_CPU interrupt/flag +Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장) +``` + +#### Credit Injector (outbound) + +``` +Trigger: recv completion (my_tail 증가 후) +Action: pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC) + +Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa} +Latency: 1 cycle to generate, then NoC traversal +``` + +#### Credit Receiver (inbound sideband) + +``` +Trigger: 16B credit packet arrival (from NoC) +Extract: {consumer_seq, dst_rx_base_pa} + +Direction matching (ADR-0025 D3): + for each dir: + match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa) + +Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq) +Output: send_wake signal → deassert backpressure stall +Latency: 1 cycle +``` + +### D18. DMA Engine vc_comm IPCQ-aware Mode + +기존 vc_comm 채널(D8)에 IPCQ flit 처리 모드를 추가한다. + +**Outbound**: + +1. IPCQ Controller로부터 command 수신: `{src_addr, dst_addr, nbytes, sender_seq}` +2. TCM에서 src_addr read → DMA read buffer에 snapshot (standard DMA behavior) +3. Flit pack: data + piggyback metadata (sender_seq, dst_addr) +4. NoC fabric port에 inject +5. Fire-and-forget (completion 미대기) + +**Inbound**: + +1. NoC로부터 IPCQ flit 수신 +2. Terminal BW drain charge (`drain_ns = nbytes / bottleneck_bw`) +3. Slot write latency charge (backing memory tier) +4. **ATOMIC** (same pipeline stage, no stall insertion): + - TCM write: data → slot address + - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller +5. Done + +**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가 +동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요. 시뮬레이터의 +"no SimPy yield between MemoryStore.write and IpcqMetaArrival put" (D9, I6)이 +자연스럽게 보장된다. + +#### Data Snapshot Semantics + +DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다. +이는 DMA standard read-then-write behavior이므로 추가 HW 불필요. + +#### Credit Virtual Channel (선택적) + +- **옵션 A**: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분). +- **옵션 B**: 3rd dedicated credit VC 추가 (strict priority > data). + +옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로 +옵션 A로도 충분. + +### D19. Fabric Flit Format Extension + +``` +일반 data flit (예: 512-bit): +┌──────────────────────────────────────────┐ +│ [511:480] routing header (32b) │ +│ [479:0] payload (480b = 60B) │ +└──────────────────────────────────────────┘ + +IPCQ data flit (첫 flit에만 metadata 포함): +┌──────────────────────────────────────────┐ +│ [511:480] routing header (32b) │ +│ [511] ipcq_flag (1b) │ ← IPCQ vs normal DMA 식별 +│ [510:509] vc_id (2b) │ +│ [508:480] route + hop count │ +│ [479:416] ipcq_metadata (64b) │ ← piggyback +│ [479:448] sender_seq (32b) │ +│ [447:416] dst_addr[31:0] (32b) │ ← direction matching용 +│ [415:0] payload (416b = 52B) │ +└──────────────────────────────────────────┘ +후속 flits: full 60B payload (metadata 없음) + +Credit-only flit (128-bit, header-only): +┌──────────────────────────────────────────┐ +│ [127:96] routing header (32b) │ +│ [127] credit_flag (1b) │ +│ [95:64] consumer_seq (32b) │ +│ [63:0] dst_rx_base_pa (64b) │ +└──────────────────────────────────────────┘ +``` + +첫 flit의 payload가 60B → 52B로 감소 (13% overhead). Multi-flit transfer에서는 +후속 flit이 full payload이므로 대형 전송에서 overhead < 1%. + +### D20. TCM IPCQ Slot Region Layout + +``` +TCM Memory Map (16MB): +┌─────────────────────────────┐ 0x000000 +│ Kernel Working Memory │ +│ (compute tensors) │ +│ ~14MB │ +├─────────────────────────────┤ 0xE00000 +│ IPCQ RX Buffers │ +│ Dir N: slots × slot_size │ +│ Dir S: slots × slot_size │ +│ Dir E: slots × slot_size │ +│ Dir W: slots × slot_size │ +│ ~1MB │ +├─────────────────────────────┤ 0xF00000 +│ IPCQ Metadata / Scratch │ +│ ~1MB │ +└─────────────────────────────┘ 0xFFFFFF +``` + +IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 bank conflict를 +최소화한다 (Risk D22 참조). + +### D21. 2nm Implementation Analysis + +#### Area Estimate + +| Module | Gate Count | Area (2nm est.) | Notes | +|---|---|---|---| +| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops | +| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational | +| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators | +| **IPCQ Controller subtotal** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** | +| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack | +| **Total 변경분** | **~14.6K** | **~0.006 mm²** | | + +#### Timing + +| Path | Delay (2nm est.) | Target Clock | Margin | +|---|---|---|---| +| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× | +| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× | +| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× | +| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× | + +모든 critical path가 1 cycle 이내. Timing closure 문제 없음. + +#### Power + +- Active: ~1 mW (register R/W + comparators, send/recv 동작 시) +- Idle: leakage only +- PE 전체 전력 대비 무시 가능 + +#### Constraints + +| 항목 | 제약 | 근거 | +|---|---|---| +| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) | +| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 | +| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 | + +### D22. Risk Assessment + +#### TCM Bank Conflict + +- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall +- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 (D20) +- **Cost**: TCM banking flexibility 소폭 감소 +- **Severity**: Medium (성능 영향), Low (correctness 문제 아님) + +#### Credit Return Latency under Congestion + +- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall +- **Mitigation**: + - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미) + - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수 +- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음) + +#### Inter-Direction Ordering + +- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서 +- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은 + kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 (D2 + D4) +- **Severity**: Low (아키텍처 설계에 의해 해소) + +### D23. HW Alternatives Considered + +#### Doorbell + Polling (전통적 방식) + +``` +Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell +Recv: Polling loop on doorbell, or interrupt-driven +``` + +| 장점 | 단점 | +|---|---| +| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) | +| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) | +| | Polling은 전력 낭비, interrupt는 latency overhead | + +**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.** + +#### Hardware Message Queue (NVIDIA NVLink 스타일) + +``` +Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달 +Recv: HMQ에서 descriptor pop → data pointer 확인 +``` + +| 장점 | 단점 | +|---|---| +| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) | +| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 | +| | Large tensor에는 결국 DMA 필요 | + +**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는 +면적 낭비. **불채택.** + +#### RDMA-style Completion Queue (CQ) + +``` +Send: DMA write → peer에 CQE 자동 생성 +Recv: CQ poll/interrupt → data 위치 확인 +``` + +| 장점 | 단점 | +|---|---| +| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead | +| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 | +| | PE-to-PE CCL에는 over-engineered | + +**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합. +PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.** + +#### Credit-in-Data Piggyback (v2 최적화 후보) + +현재 설계에서 credit return은 별도 16B packet이다. Bidirectional 통신 +패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.** + +``` +PE_A →E→ PE_B: data + sender_seq=3 +PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4 ← credit이 data에 합쳐짐 +``` + +| 장점 | 단점 | +|---|---| +| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 | +| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) | +| | Logic 복잡도 소폭 증가 | + +**평가**: 현재 설계의 우수한 최적화. Bidirectional allreduce에서 credit packet을 +완전 제거 가능. Standalone credit fallback도 유지. **v2로 채택 권고.** + +### Open HW Questions + +- IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%) +- Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? (D18 참조) +- Inter-SIP link에서의 flit format 호환성 검증 필요 +- n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%) --- diff --git a/docs/adr/ADR-0024-par-sip-tp-launcher.md b/docs/adr/ADR-0024-par-sip-tp-launcher.md new file mode 100644 index 0000000..b321e84 --- /dev/null +++ b/docs/adr/ADR-0024-par-sip-tp-launcher.md @@ -0,0 +1,206 @@ +# ADR-0024: SIP-level Launcher — rank = SIP + +## Status + +Accepted + +## Context + +### 목표 + +`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device) +경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이** +읽히는 bench 코드를 목표로 한다. + +real PyTorch와 비교: + +| 차원 | real PyTorch | KernBench | +| --- | --- | --- | +| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP | +| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 | +| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 | +| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP | +| `mp.spawn` | OS 프로세스 fork | greenlet fan-out | + +### 풀어야 할 문제 + +1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록. +2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각 + worker greenlet이 자기 rank / 자기 SIP를 정확히 식별. +3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면 + 기본 텐서 배치도 구조적 좌표로 표현되어야 함. + +### Non-problem (이 ADR 밖) + +- IPCQ direction addressing → ADR-0025 +- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026 +- Megatron-style TP → ADR-0027 +- DTensor → ADR-0028 (future) +- Worker scheduling / `mp.spawn` / collective drain / exception cleanup + → ADR-0027 D0/D1 +- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032 + +## Decision + +### D1. rank = SIP (world_size 해석) + +```python +def _resolve_world_size(self) -> int: + if "world_size" in self._merged: + return int(self._merged["world_size"]) + defaults = self._cfg_all.get("defaults", {}) + if "world_size" in defaults: + return int(defaults["world_size"]) + spec = self.ctx.spec or {} + return int(spec.get("system", {}).get("sips", {}).get("count", 1)) +``` + +우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml` +override는 legacy "rank = PE" 테스트 경로로 유지. + +### D2. Greenlet-local rank registry (+ debug warning) + +```python +class DistributedContext: + def __init__(self): + self._backend = None + self._rank_by_greenlet: dict = {} + + def _bind_rank(self, g, rank: int) -> None: + self._rank_by_greenlet[g] = int(rank) + + def get_rank(self) -> int: + self._ensure_initialized() + from greenlet import getcurrent + g = getcurrent() + if g not in self._rank_by_greenlet: + if os.environ.get("KERNBENCH_DEBUG"): + warnings.warn( + "get_rank() called outside a bound greenlet — returning 0. " + "Likely a bug unless running single-driver." + ) + return 0 + return int(self._rank_by_greenlet[g]) +``` + +### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩 + +KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는 +`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named +namespace를 사용한다. + +```python +class _AhbmNamespace: + """torch.ahbm — per-greenlet SIP device binding. + + Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since + KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent + API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime. + """ + + def __init__(self): + self._device_by_greenlet: dict = {} + + def set_device(self, device: int) -> None: + from greenlet import getcurrent + self._device_by_greenlet[getcurrent()] = int(device) + + def current_device(self) -> int | None: + from greenlet import getcurrent + return self._device_by_greenlet.get(getcurrent()) + +# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`. +# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`. +``` + +**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한 +`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`, +`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는 +코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다. + +```python +class _AcceleratorNamespace: + """torch.accelerator — device-agnostic API (PyTorch 2.x style). + + Aliases torch.ahbm for bench code that prefers device-neutral idiom: + torch.accelerator.set_device_index(rank) + torch.accelerator.current_device_index() + """ + + def __init__(self, ahbm: _AhbmNamespace): + self._ahbm = ahbm + + def set_device_index(self, device: int) -> None: + self._ahbm.set_device(device) + + def current_device_index(self) -> int | None: + return self._ahbm.current_device() + +# RuntimeContext +self.ahbm = _AhbmNamespace() +self.accelerator = _AcceleratorNamespace(self.ahbm) # alias +``` + +Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유: + +```python +torch.ahbm.set_device(rank) # KernBench-native, explicit backend +torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic +``` + +### D4. Tensor placement = structural (sip, cube, pe) 좌표 + +`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성. +세부는 ADR-0026. + +```python +# RuntimeContext._create_tensor +current_sip = self.ahbm.current_device() # (D3 naming) +if current_sip is None: + current_sip = 0 # single-driver fallback (D2와 일관) +placement = resolve_dp_policy( + dp, shape=shape_2d, itemsize=itemsize, + num_pe=eff_num_pe, num_cubes=eff_num_cubes, + target_sip=current_sip, +) +``` + +Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적 +좌표를 직접 보유. ShardSpec 상세는 ADR-0026. + +--- + +## Dependencies + +- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원. +- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와 + ShardSpec의 구조적 좌표 표현. +- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`, + collective drain, exception cleanup의 구현 기준. + +--- + +## Non-goals + +- **IPCQ protocol 수정**: ADR-0023 유지. +- **DPPolicy 필드 정리**: ADR-0026. +- **Megatron-style TP**: ADR-0027. +- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1. +- **Collective algorithm 구현**: ADR-0032. +- **Multi-node (프로세스 간)**: 단일 프로세스. + +--- + +## Consequences + +### Positive + +- **Bench = real PyTorch DDP** (공개 API 관점). +- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능. +- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이 + `(sip, cube, pe)` 3튜플 위에서 일관되게 동작. + +### Neutral + +- IPCQ PE-level protocol (ADR-0023) 불변. +- IO_CPU 역할 불변 (기존 transit 그대로). diff --git a/docs/adr/ADR-0024-sip-tp-launcher.md b/docs/adr/ADR-0024-sip-tp-launcher.md deleted file mode 100644 index fcc97bb..0000000 --- a/docs/adr/ADR-0024-sip-tp-launcher.md +++ /dev/null @@ -1,868 +0,0 @@ -# ADR-0024: SIP-level TP Launcher — rank = SIP (host-driven dispatch) - -## Status - -Accepted. rank = SIP process-group model stands. The allreduce algorithm -path (mapper / validator / per-PE install machinery originally targeted at -ADR-0029) has been replaced by ADR-0032: `AhbmCCLBackend` now calls -`configure_sfr_intercube_multisip` at `init_process_group` time and the -intercube kernel receives `(sip_rank, sip_topo_kind, sip_topo_w, -sip_topo_h)` appended after the module's `kernel_args()`. The -`leader_only` / `all_pes` mapper concepts in this document are no longer -used by the default allreduce path. - -## Context - -### 목표 - -`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device) -경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이** -읽히는 bench 코드를 목표로 한다. - -real PyTorch와 비교: - -| 차원 | real PyTorch | KernBench (이 ADR 이후) | -|---|---|---| -| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP | -| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 | -| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 | -| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP | -| `mp.spawn` | OS 프로세스 fork | greenlet fan-out | - -### 설계 원칙 — 공개 API의 추상화, 내부는 기존 path 활용 - -**공개 API (bench worker) 수준의 추상화**: -``` -rank = SIP -DPPolicy = intra-device (cube × PE) 분산만 -dist.all_reduce, torch.ahbm.set_device, mp.spawn 등 PyTorch-style 표면 -``` - -**Framework 내부 구현**: -``` -build_install_plans (host): topology + mapper + algorithm → SipInstallPlan - ↓ -backend (host): plan의 per-PE spec을 engine.submit으로 IpcqInitMsg 디스패치 - ↓ -engine: 기존 PE-scoped routing (MmuMapMsg 등과 동일 경로) - ↓ -PE_IPCQ: 자체 message loop에서 IpcqInitMsg 처리 (기존 capability) -``` - -**핵심**: 새 message 타입이나 IO_CPU 확장 없음. 기존 engine routing과 기존 -`IpcqInitMsg` 타입을 그대로 사용. 기존의 "sideband direct call" 우회만 -제거하여 convention 일원화. - -### 풀어야 할 문제 - -1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록. -2. **Multi-worker 실행** — N개 rank가 독립 worker 코드 실행. 1 프로세스 제약 - 하에서 greenlet + barrier 동기화. -3. **Cross-rank collective submit 동기화** — 첫 rank가 혼자 wait하면 peer 부재로 - SimPy deadlock. 모든 rank submit 후 drain 보장. -4. **기존 sideband install 제거** — IpcqInitMsg를 engine.submit으로 일원화. - MmuMapMsg 등 다른 control-plane 메시지와 동일 패턴. -5. **Algorithm / mapper / validator 분리** — 알고리즘 모듈은 kernel 코드만 - 담고, topology / mapping / validation은 registry + 선언. - -### Non-problem (이 ADR 밖) - -- IPCQ direction addressing fix → **ADR-0025** -- `DPPolicy.sip`/`num_sips` 제거 → **ADR-0026** -- Megatron-style TP → **ADR-0027** -- DTensor → **ADR-0028 (future)** -- **IO_CPU를 SIP-level control-plane 단일 endpoint로 승격**: 이 ADR에서는 - invariant으로 채택하지 않음. 현재 KernBench에 해당 원칙이 없고, 단독으로 - 도입하기엔 정당화가 약함. 미래에 control-plane latency 모델링 정밀도 요구가 - 생기면 별도 ADR. - -## Decision - -### D1. rank = SIP (world_size 해석) - -```python -def _resolve_world_size(self) -> int: - if "world_size" in self._merged: - return int(self._merged["world_size"]) - defaults = self._cfg_all.get("defaults", {}) - if "world_size" in defaults: - return int(defaults["world_size"]) - spec = self.ctx.spec or {} - return int(spec.get("system", {}).get("sips", {}).get("count", 1)) -``` - -우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml` -override는 legacy "rank = PE" 테스트 경로로 유지. - -### D2. Install 경로 — engine.submit 일원화 - -`ccl/install.py`의 sideband direct call을 제거하고, `IpcqInitMsg`를 -`engine.submit`으로 보낸다. MmuMapMsg / MemoryWriteMsg 등이 이미 동일 패턴. - -```python -# Backend (AhbmCCLBackend.__init__ 또는 init_process_group 시점) -from kernbench.ccl.install_plan import build_install_plans - -plans = build_install_plans( - world_size=self._world_size, - algorithm=self._merged["algorithm"], - algorithm_config=self._merged, - spec=self.ctx.spec, -) -self._plans = plans - -# Each PE_IPCQ가 자기 neighbor table을 받도록 engine 경유 submit -handles = [] -for plan in plans: - for pe_install in plan.pe_installs: - h = self.ctx.submit(IpcqInitMsg( - correlation_id=self.ctx.correlation_id, - request_id=f"ipcq_init_s{plan.sip}c{pe_install.cube}p{pe_install.pe}", - target_sips=(plan.sip,), - target_cubes=(pe_install.cube,), - target_pe=pe_install.pe, - entries=pe_install.neighbors, - buffer_kind=plan.buffer_kind, - n_slots=plan.n_slots, - slot_size=plan.slot_size, - # ... (기존 IpcqInitMsg 필드) - )) - handles.append(h) - -# Eager install — init_process_group이 반환하기 전에 완료 보장 -for h in handles: - self.ctx.wait(h) -``` - -**PE_IPCQ 컴포넌트**는 이미 `IpcqInitMsg`를 main loop에서 처리 (`pe_ipcq.py` -라인 145-147). 변경 불필요. 유일한 차이는 "message가 sideband Python call이 -아니라 engine queue를 거쳐 도착한다"는 점. - -**Correctness invariant (equivalence)**: `init_process_group()`은 모든 -install handle을 `wait()`한 후 반환하므로 launch-before-install 문제는 -구조적으로 없다. 남는 correctness 질문은 단 하나: - -> Engine-routed `IpcqInitMsg` 처리가 기존 sideband -> `pe_ipcq._install_neighbors(msg)` 호출과 **동일한 최종 PE_IPCQ 상태**를 -> 생성하는가. - -검증 포인트 (T3 참고): - -1. **State equivalence**: `_install_neighbors()` 내부 상태 전이가 engine - dispatch path에서도 동일하게 일어나 최종 PE_IPCQ state - (`_queue_pairs`, `_installed`, `_credit_inbox` 등)가 일치. - -2. **Sideband-only side effect 부재**: Sideband path에서만 있던 부수 효과가 - 없음 (예: engine.submit이 설정하는 request_id / correlation tracking 등이 - install semantics를 왜곡하지 않음). - -3. **Ordering independence**: 서로 다른 PE들의 install message가 engine - 큐에서 임의 순서로 처리되어도 최종 상태가 동일. 즉 install은 **PE별 - 독립 연산**이어야 하고, cross-PE 순서 의존성이 있으면 안 됨. - -4. **Idempotency**: 동일 PE에 대해 `IpcqInitMsg`가 두 번 도착하면? 현재 - 설계 전제는 "per-PE 단 한 번 install". 중복 install 시 동작은 정의되지 - 않음. 보수적 정책: - - 최초 install 시 `_installed = True`로 전이 - - 이후 중복 install msg는 **에러** (raise) 또는 **silent idempotent** - (no-op) 둘 중 하나로 명시 - - Recommend: **raise** (명시적 에러 → 버그 조기 검출). T3에 duplicate - install 케이스 추가. - -5. **Partial install visibility**: 일부 PE만 install 완료된 중간 상태가 - 외부에 observable한가? 현재 구조에서는 `init_process_group()`의 eager - wait-all이 barrier 역할을 하므로 partial state는 bench 코드에 노출되지 - 않음. 단, debugging / introspection API는 중간 상태를 볼 수 있음 (문제 - 아님, 문서화만). - -**Timing 영향**: Engine-routed install은 `init_process_group()`이 SimPy 시간을 -소비하게 만든다. 기존 sideband install은 사실상 zero-cost. ADR 계약: - -> Benchmarks must not rely on zero-cost initialization. -> `init_process_group()` consumes simulated time proportional to the number -> of participating PEs × per-PE install latency. First collective call -> starts at a well-defined but non-zero sim time. - -### D3. Launch 경로 — non-CCL 커널과 동일 primitive - -**CCL 커널은 non-CCL 커널과 동일한 `KernelLaunchMsg` submission path를 쓴다.** -Engine 내부의 IO_CPU/M_CPU transit 같은 것은 **기존 구현 세부이지 CCL-specific -장치가 아님**. Backend는 plan의 `participating_pes` 목록을 돌면서 `KernelLaunchMsg`를 -submit할 뿐이다. 새 메시지 타입 없음, 새 라우팅 경로 없음. - -```python -# AhbmCCLBackend.all_reduce -def all_reduce(self, tensor, op="sum"): - if op != "sum": - raise NotImplementedError(...) - if tensor._handle is None or not tensor._handle.shards: - raise RuntimeError(...) - - # Validator — global handle 기준 (D8) - validator_name = self._merged.get("validator") - if validator_name: - resolve_validator(validator_name)(tensor._handle, self._world_size, self.ctx.spec) - - rank = self.ctx.distributed.get_rank() - plan = self._plans[rank] - tensor_view = _tensor_slice_for_sip(tensor._handle, plan.sip) - - # Plan에서 kernel args 계산 (host-side) - import importlib - mod = importlib.import_module(plan.kernel_module) - n_elem = tensor_view.shards[0].nbytes // tensor.itemsize - kargs = mod.kernel_args(n_elem=n_elem, world_size=plan.world_size, - **plan.kernel_config) - - def _submit(): - out = [] - for (cube, pe) in plan.participating_pes: - h = self.ctx.submit(KernelLaunchMsg( - correlation_id=self.ctx.correlation_id, - request_id=f"allreduce_r{rank}_c{cube}p{pe}", - kernel_ref=KernelRef(name=plan.algorithm_name, kind="builtin"), - args=(_tensor_arg_for_pe(tensor_view, cube, pe), *kargs), - target_sips=(plan.sip,), - target_cubes=(cube,), - target_pe=pe, - )) - out.append(h) - return out - - self._barrier.submit_and_drain(self.ctx, rank, _submit) -``` - -### D4. Algorithm ABI — 얇게 + 명시적 arg 계약 - -각 알고리즘 모듈은 **kernel + kernel_args만 필수**. - -```python -# src/kernbench/ccl/algorithms/ring_allreduce.py -def kernel(t_ptr, n_elem, world_size, tl): - """PE-side kernel code. - - Signature convention: first positional arg is the tensor pointer - (per-PE slice), subsequent positional args are whatever - kernel_args() returns. `tl` is injected by the TLContext runtime. - """ - -def kernel_args(*, n_elem: int, world_size: int, **kw) -> tuple: - """Return the tuple of non-tensor positional args. - - Signature contract: - - Called keyword-only with n_elem and world_size plus kernel_config. - - Returns a tuple (possibly empty) of scalar / metadata args. - - The backend constructs the final KernelLaunchMsg.args as: - (per_pe_tensor_arg, *kernel_args(...)) - where per_pe_tensor_arg is a TensorArg containing only the shards - local to the receiving PE (derived from tensor_view). - """ - return (n_elem, world_size) -``` - -**Arg assembly in backend (reference)**: - -```python -# AhbmCCLBackend.all_reduce (D3에서 발췌) -kargs = mod.kernel_args(n_elem=n_elem, world_size=plan.world_size, - **plan.kernel_config) -for (cube, pe) in plan.participating_pes: - pe_tensor_arg = _tensor_arg_for_pe(tensor_view, cube, pe) - self.ctx.submit(KernelLaunchMsg( - args=(pe_tensor_arg, *kargs), # tensor first, then kernel_args return - target_sips=(plan.sip,), - target_cubes=(cube,), - target_pe=pe, - ... - )) -``` - -**ccl.yaml**에서 선언적 metadata: - -```yaml -algorithms: - ring_allreduce_tcm: - module: kernbench.ccl.algorithms.ring_allreduce - topology: ring_1d # kernbench/ccl/topologies.py - mapper: leader_only # kernbench/ccl/mappers.py (신규) - validator: single_shard_per_rank # kernbench/ccl/validators.py (신규) - buffer_kind: tcm - n_elem: 8 -``` - -- `topology` (필수) -- `mapper` (선택, default `"leader_only"`) -- `validator` (선택) - -알고리즘 모듈 자체에는 mapper/validator/participating_pes/neighbor -생성기가 **들어가지 않음**. - -### D5. Mapper + validator — registry key **또는** import path - -Host-side framework가 built-in registry 제공. 커스텀 확장은 dot-import path. - -```python -# src/kernbench/ccl/mappers.py (new) -Mapper = Callable[[dict, int], list[tuple[int, int]]] - -def leader_only(spec, rank): - """Single leader PE per SIP. Ring/tree/mesh용.""" - return [(0, 0)] - -def all_pes(spec, rank): - """Every PE in the SIP. 알고리즘이 intra-SIP 전체 PE를 참여시킬 때 사용 - (e.g. intra-SIP reduction, intra-SIP broadcast, hierarchical collective - 의 낮은 레벨 등).""" - cm = spec["sip"]["cube_mesh"] - pl = spec["cube"]["pe_layout"] - n_cubes = cm["w"] * cm["h"] - n_pes = pl["pe_per_corner"] * len(pl["corners"]) - return [(c, p) for c in range(n_cubes) for p in range(n_pes)] - -MAPPER_REGISTRY = {"leader_only": leader_only, "all_pes": all_pes} - -def resolve_mapper(key_or_path: str) -> Mapper: - if key_or_path in MAPPER_REGISTRY: - return MAPPER_REGISTRY[key_or_path] - if "." in key_or_path: - import importlib - mod_path, fn_name = key_or_path.rsplit(".", 1) - return getattr(importlib.import_module(mod_path), fn_name) - raise ValueError(f"unknown mapper: {key_or_path!r}") -``` - -Validator도 동일 패턴 (`src/kernbench/ccl/validators.py`). 입력은 **global -TensorHandle** (D8 참고). - -### D6. Host-side install plan builder - -```python -# src/kernbench/ccl/install_plan.py (new; 기존 install.py의 재구성) -from dataclasses import dataclass -from typing import Any, Mapping - -@dataclass(frozen=True) -class NeighborTableEntry: - direction: str - peer_direction: str # ADR-0025 - peer_sip: int - peer_cube: int - peer_pe: int - rx_base_pa: int - # ... 기타 IPCQ 설정 ... - -@dataclass(frozen=True) -class PeInstallSpec: - cube: int - pe: int - neighbors: tuple[NeighborTableEntry, ...] - -@dataclass(frozen=True) -class SipInstallPlan: - algorithm_name: str # human-readable ("ring_allreduce_tcm") - sip: int - rank: int - world_size: int - pe_installs: tuple[PeInstallSpec, ...] # per-PE neighbor tables - buffer_kind: str - n_slots: int - slot_size: int - kernel_module: str - participating_pes: tuple[tuple[int, int], ...] - kernel_config: Mapping[str, Any] - - -def build_install_plans( - world_size: int, - algorithm: str, - algorithm_config: dict, - spec: dict, -) -> list[SipInstallPlan]: - """Compose topology + mapper + algorithm into per-SIP plan list.""" - topo_fn = _resolve_topology(algorithm_config["topology"]) - mapper = resolve_mapper(algorithm_config.get("mapper", "leader_only")) - - # kernel_config: launch 시 kernel_args에 전달할 algorithm-specific params - kernel_config = { - k: v for k, v in algorithm_config.items() - if k in {"n_elem", "reduce_op", "chunk_size"} or k.startswith("kernel_") - } - - plans = [] - for rank in range(world_size): - sip = rank # identity mapping (non-identity는 open question) - pes = mapper(spec, rank) - pe_installs = _build_pe_installs( - rank=rank, world_size=world_size, sip=sip, - pes=pes, topo_fn=topo_fn, algorithm_config=algorithm_config, spec=spec, - ) - plans.append(SipInstallPlan( - algorithm_name=algorithm, - sip=sip, rank=rank, world_size=world_size, - pe_installs=pe_installs, - buffer_kind=algorithm_config["buffer_kind"], - n_slots=algorithm_config["n_slots"], - slot_size=algorithm_config["slot_size"], - kernel_module=algorithm_config["module"], - participating_pes=tuple(pes), - kernel_config=kernel_config, - )) - return plans -``` - -`_build_pe_installs`는 기존 `ccl/install.py`의 neighbor 계산 로직을 재활용 -(ADR-0025의 `reverse_direction` 개선 반영). - -**Multi-PE 매퍼와 neighbor 생성 책임**: mapper가 SIP 내 여러 PE를 반환하는 -경우 (`all_pes` 등), PE-level neighbor 그래프는 `_build_pe_installs` 내부에 -형성된다. 즉 topology 모듈은 rank-level 관계만 제공하고, PE-level 연결은 -builder에서 풀어낸다. 복잡한 multi-level 패턴을 쓰는 알고리즘은 이 책임 -분산이 관리 부담이 될 수 있음 — 관련 논의는 ADR-0029 참고. - -### D7. Epoch-based collective barrier - -Cross-rank submit 동기화. 각 collective 호출은 독립 epoch. 같은 rank의 -중복 join은 즉시 에러. - -```python -# src/kernbench/runtime_api/distributed.py -@dataclass -class _EpochState: - participants: set[int] = field(default_factory=set) - pending: list = field(default_factory=list) - drained: bool = False - returned: int = 0 - - -class _CollectiveBarrier: - """Epoch-based barrier. - - Contract: - - Each call joins the earliest non-drained epoch. - - Each rank may join a given epoch at most once. Duplicate join raises. - - Last arriver (participants == world_size) performs drain and advances - _next_epoch. Earlier arrivers yield and re-check drained on resume. - - Epoch state is GC'd when returned == world_size (success path). - - On failure paths, residual state is acceptable; reset() clears it. - """ - - def __init__(self, world_size: int): - self._world_size = world_size - self._next_epoch = 0 - self._state: dict[int, _EpochState] = {} - - def submit_and_drain(self, ctx, rank: int, submit_fn) -> None: - epoch = self._next_epoch - state = self._state.setdefault(epoch, _EpochState()) - - if rank in state.participants: - raise RuntimeError( - f"rank {rank} attempted duplicate join to epoch {epoch}" - ) - state.participants.add(rank) - - handles = submit_fn() - state.pending.extend(handles) - - is_last = len(state.participants) >= self._world_size - - if is_last: - for h in state.pending: - ctx.wait(h) - state.drained = True - self._next_epoch = epoch + 1 - else: - from greenlet import getcurrent - g = getcurrent() - if g.parent is None: - raise RuntimeError("barrier requires a bound worker greenlet") - while not state.drained: - g.parent.switch() - - state.returned += 1 - if state.returned >= self._world_size: - self._state.pop(epoch, None) - - def reset(self) -> None: - """Explicit cleanup on spawn exception unwinding.""" - self._state.clear() - self._next_epoch = 0 -``` - -### D8. Per-rank tensor view + validator contract - -**Validator** (host-side, pre-slice, global handle 기준): - -```python -# src/kernbench/ccl/validators.py -Validator = Callable[[TensorHandle, int, dict], None] - -def single_shard_per_rank(handle, world_size, spec): - """Ring 계열: 정확히 world_size개 shard, SIP당 1개.""" - if len(handle.shards) != world_size: - raise ValueError(...) - per_sip = {} - for s in handle.shards: - per_sip[s.sip] = per_sip.get(s.sip, 0) + 1 - if any(c != 1 for c in per_sip.values()): - raise ValueError(...) - -def multi_pe_sip_local(handle, world_size, spec): - """Multi-PE per SIP layout: 각 SIP에 intra-SIP PE 수만큼 shard 존재. - Intra-SIP 전체 PE를 참여시키는 알고리즘이 사용.""" - cm = spec["sip"]["cube_mesh"] - pl = spec["cube"]["pe_layout"] - per_sip = cm["w"] * cm["h"] * pl["pe_per_corner"] * len(pl["corners"]) - if len(handle.shards) != world_size * per_sip: - raise ValueError(...) - -VALIDATOR_REGISTRY = {...} -def resolve_validator(key_or_path): ... -``` - -Validator는 world 전체의 shard layout 불변량을 본다. Per-rank view는 -backend가 validator 호출 **후** `_tensor_slice_for_sip`로 생성. - -**Per-rank tensor view** — SIP-local slice: - -```python -def _tensor_slice_for_sip(handle, sip) -> TensorArg: - sip_shards = [s for s in handle.shards if s.sip == sip] - if not sip_shards: - raise RuntimeError(f"tensor has no shards on SIP {sip}") - # Deterministic ordering contract: (cube, pe, offset_bytes) ascending. - # Multi-PE mappers (hierarchical 등) rely on this ordering to align - # per-PE tensor arg construction with participating_pes enumeration. - sip_shards.sort(key=lambda s: (s.cube, s.pe, s.offset_bytes)) - min_offset = min(s.offset_bytes for s in sip_shards) - local_va_base = handle.va_base + min_offset if handle.va_base else 0 - return TensorArg( - shards=tuple(TensorArgShard(...) for s in sip_shards), - va_base=local_va_base, - ) -``` - -**Ordering invariant**: slice의 shard는 `(cube, pe, offset_bytes)` 오름차순. -Backend가 `participating_pes`를 iterate하며 `_tensor_arg_for_pe(view, cube, pe)`를 -구성할 때, 결정론적 ordering을 전제할 수 있다. 특히 `all_pes` mapper + -hierarchical 알고리즘이 per-PE slice 조합을 순서 의존적으로 해석하는 경우에 -중요. - -### D9. Greenlet-local rank registry (+ debug warning) - -```python -class DistributedContext: - def __init__(self): - self._backend = None - self._rank_by_greenlet: dict = {} - - def _bind_rank(self, g, rank: int) -> None: - self._rank_by_greenlet[g] = int(rank) - - def get_rank(self) -> int: - self._ensure_initialized() - from greenlet import getcurrent - g = getcurrent() - if g not in self._rank_by_greenlet: - if os.environ.get("KERNBENCH_DEBUG"): - warnings.warn( - "get_rank() called outside a bound greenlet — returning 0. " - "Likely a bug unless running single-driver." - ) - return 0 - return int(self._rank_by_greenlet[g]) -``` - -### D10. `torch.ahbm.set_device(rank)` — SIP 바인딩 - -KernBench 백엔드 이름은 `ahbm` (ADR-0023 D10). Real PyTorch는 -`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named -namespace를 사용한다. - -```python -class _AhbmNamespace: - """torch.ahbm — per-greenlet SIP device binding. - - Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since - KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent - API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime. - """ - - def __init__(self): - self._device_by_greenlet: dict = {} - - def set_device(self, device: int) -> None: - from greenlet import getcurrent - self._device_by_greenlet[getcurrent()] = int(device) - - def current_device(self) -> int | None: - from greenlet import getcurrent - return self._device_by_greenlet.get(getcurrent()) - -# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`. -# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`. -``` - -**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한 -`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`, -`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는 -코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다. - -```python -class _AcceleratorNamespace: - """torch.accelerator — device-agnostic API (PyTorch 2.x style). - - Aliases torch.ahbm for bench code that prefers device-neutral idiom: - torch.accelerator.set_device_index(rank) - torch.accelerator.current_device_index() - """ - - def __init__(self, ahbm: _AhbmNamespace): - self._ahbm = ahbm - - def set_device_index(self, device: int) -> None: - self._ahbm.set_device(device) - - def current_device_index(self) -> int | None: - return self._ahbm.current_device() - -# RuntimeContext -self.ahbm = _AhbmNamespace() -self.accelerator = _AcceleratorNamespace(self.ahbm) # alias -``` - -Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유: - -```python -torch.ahbm.set_device(rank) # KernBench-native, explicit backend -torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic -``` - -### D11. Tensor placement = structural (sip, cube, pe) 좌표 - -`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성. -세부는 ADR-0026. - -```python -# RuntimeContext._create_tensor -current_sip = self.ahbm.current_device() # (D10 naming) -if current_sip is None: - current_sip = 0 # single-driver fallback (D9와 일관) -placement = resolve_dp_policy( - dp, shape=shape_2d, itemsize=itemsize, - num_pe=eff_num_pe, num_cubes=eff_num_cubes, - target_sip=current_sip, -) -``` - -Post-hoc `pe_index` shifting 제거 — ShardSpec이 `(sip, cube, pe)` 구조적 -좌표 보유. - -### D12. `torch.multiprocessing.spawn`-compat surface - -Bench 작성자 표면은 real PyTorch `mp.spawn`과 동일: - -```python -# src/kernbench/runtime_api/multiprocessing.py (new) -def spawn(fn, args=(), nprocs=1, join=True, daemon=False, start_method="spawn"): - """Drop-in for torch.multiprocessing.spawn. - Internal: greenlet fan-out + epoch-barrier sync + exception propagation. - """ - ... - -# torch namespace에 부착 -torch.multiprocessing = SimpleNamespace(spawn=spawn) -``` - -Bench: - -```python -import torch.multiprocessing as mp -mp.spawn(worker, nprocs=world_size, args=(world_size, torch)) -``` - -### D13. Scheduler + exception handling - -```python -def spawn(fn, args, nprocs, ...): - dist = torch.distributed - gs: list[greenlet] = [] - errors: dict[int, Exception] = {} - - for rank in range(nprocs): - def _entry(r=rank): - try: - fn(r, *args) - except Exception as e: - errors[r] = e - raise - g = greenlet(_entry) - dist._bind_rank(g, rank) - gs.append(g) - - try: - while True: - alive = [g for g in gs if not g.dead] - if not alive: - break - for g in alive: - if not g.dead: - g.switch() - except Exception as outer: - for other in gs: - if not other.dead: - try: - other.throw(SystemExit) - except Exception: - pass - # Epoch barrier state 명시적 cleanup - backend = getattr(dist, "_backend", None) - if backend is not None and hasattr(backend, "_barrier"): - backend._barrier.reset() - raise SpawnException(errors) from outer -``` - -**Scheduler contract**: -- Deterministic round-robin over insertion order (rank 0, 1, ..., N-1). -- 동기화 지점은 epoch barrier (D7)만. Scheduler 순서에 의존하는 correctness 없음. -- 예외 발생 시 다른 greenlet 강제 종료 + `SpawnException` 전파. - -**Starvation guideline**: -- 일반적으로 collective barrier가 workers를 동기화. 큰 편차 없음. -- 극단적 non-collective 루프 대비 cooperative yield 제공: - `torch.distributed.cooperative_yield()`. - -### D14. Backward compatibility - -1. **Single-driver 호출**: `get_rank()` 0 반환 (D9). -2. **`ccl.yaml` world_size override**: D1 fallback 우회 — legacy "rank = PE" - 테스트 경로로 사용 가능. -3. **`DPPolicy.sip="column_wise"` 명시**: ADR-0026 scope. -4. **`install_ipcq()` compatibility wrapper**: - -기존 `ccl/install.py`의 `install_ipcq()` API는 곧바로 제거하지 않는다. -Thin compatibility wrapper로 남겨 기존 직접 호출자가 점진적으로 migration할 -수 있게 한다. - -```python -# src/kernbench/ccl/install.py (after this ADR) -def install_ipcq(engine, spec, merged, *, algo_module=None, rank_to_pe=None): - """DEPRECATED: legacy host-side PE installer. - - Internally delegates to build_install_plans + engine-routed IpcqInitMsg. - Use dist.init_process_group() instead. - """ - from kernbench.ccl.install_plan import build_install_plans - import warnings - warnings.warn( - "install_ipcq() is deprecated; use dist.init_process_group()", - DeprecationWarning, stacklevel=2, - ) - plans = build_install_plans( - world_size=merged.get("world_size", 1), - algorithm=merged["algorithm"], - algorithm_config=merged, - spec=spec, - ) - handles = [] - for plan in plans: - for pe_install in plan.pe_installs: - h = engine.submit(IpcqInitMsg( - target_sips=(plan.sip,), - target_cubes=(pe_install.cube,), - target_pe=pe_install.pe, - entries=pe_install.neighbors, - buffer_kind=plan.buffer_kind, - n_slots=plan.n_slots, - slot_size=plan.slot_size, - )) - handles.append(h) - for h in handles: - engine.wait(h) - return {"world_size": merged.get("world_size", 1), "plans": plans} -``` - -Migration 스케줄: -- Phase 1: wrapper로 유지 + DeprecationWarning -- Phase 2: 직접 호출자 grep-audit → 각각 `dist.init_process_group()` 또는 - `build_install_plans()` 직접 사용으로 이관 -- Phase 3: wrapper 제거 (별도 cleanup ADR 또는 PR) - ---- - -## Dependencies - -- **ADR-0023** (IPCQ): `IpcqInitMsg` 메시지 타입과 PE_IPCQ 핸들링을 그대로 - 활용. Engine-routed submit으로 전환하는 것이 유일한 변경. -- **ADR-0025** (IPCQ direction fix): `_build_pe_installs`의 neighbor 계산이 - 2-rank ring 등에서 정확히 동작하려면 필요. -- **ADR-0003 / 0016** (IO_CPU): IO_CPU는 기존 transit 역할 그대로. 본 ADR에서 - IO_CPU 역할 변경 없음. - ---- - -## Non-goals - -- **IPCQ protocol 수정**: ADR-0023 유지. -- **DPPolicy 필드 정리**: ADR-0026. -- **Megatron-style TP**: ADR-0027. -- **Multi-node (프로세스 간)**: 단일 프로세스. -- **IO_CPU SIP control-plane 단일 endpoint 원칙 채택**: 본 ADR 범위 밖. 현재 - KernBench에 이 원칙이 없고, 도입은 별도 ADR. -- **Hierarchical all-reduce 알고리즘 설계**: ADR-0029. 본 ADR은 그 알고리즘이 - 쓸 framework 인프라 (`all_pes` mapper, `multi_pe_sip_local` validator, - registry 확장점)만 제공. - ---- - -## Open questions - -### 🟡 Nice-to-have — scope 경계 관련 - -- **Install timing 허용치**: SimPy 시간 상 install이 몇 ns~us 소모. 기존 - sideband는 0ns. 기존 테스트가 t=0 시작을 전제로 하는지 확인 (audit 결과에 - 따라 테스트 교정 필요). - -- **`IpcqInitMsg` 배치 가능성**: MmuMapMsg처럼 `target_pe="all"` 브로드캐스트 - 는 IPCQ에서는 부적합 (PE마다 neighbor가 다름). 현재는 per-PE 개별 submit. - Per-PE payload를 담는 batched IpcqInitMsg 타입은 future optimization. - -- **`_rank_to_sip` 매핑**: 현재 identity. Non-trivial mapping 요구 시 별도. - -- **Cooperative yield API 위치**: `torch.distributed.cooperative_yield()`로 - 노출 예정. 실제 필요성은 Phase 2 이후 벤치 추가 시 판단. - -(PE-level topology 일원화 관련 중장기 방향은 **ADR-0029** 참고 — 복잡한 -multi-level 알고리즘이 driving force가 되는 framework 진화 방향.) - ---- - -## Consequences - -### Positive - -- **새 message 타입 0개**: 기존 `IpcqInitMsg` + `KernelLaunchMsg`만으로 구현. -- **IO_CPU / engine 변경 없음**: 기존 routing 그대로. -- **Sideband install convention 제거**: MmuMapMsg 등과 동일 패턴으로 일원화. -- **Plan state stale 문제 소멸**: Plan은 host 단일 소유. -- **Bench = real PyTorch DDP** (공개 API 관점). -- **Algorithm ABI 경량**: `kernel` + `kernel_args`만 필수. -- **Epoch-based barrier**: interleaved collective 안전. -- **Control/data plane 분리**: data plane(PE_IPCQ)은 ADR-0023 유지, control - plane은 host-driven. -- 장기 확장성: Megatron TP, DTensor 기반. - -### Negative - -- 신규 모듈: `install_plan.py`, `mappers.py`, `validators.py`, - `multiprocessing.py`. -- Engine이 `IpcqInitMsg`를 엔진-path로 라우팅할 수 있는지 구현 시 확인 필요 - (minor hook 가능성). -- Install이 SimPy 시간을 소모 (positive로도 볼 수 있으나, 기존 sideband 시점 - 0ns 전제인 테스트가 있으면 교정 필요). - -### Neutral - -- IPCQ PE-level protocol (ADR-0023) 불변. -- `DPPolicy` 필드 변경은 ADR-0026. -- IO_CPU 역할 불변 (기존 transit 그대로). diff --git a/docs/adr/ADR-0025-ipcq-direction-addressing.md b/docs/adr/ADR-0025-algo-ipcq-direction-addressing.md similarity index 100% rename from docs/adr/ADR-0025-ipcq-direction-addressing.md rename to docs/adr/ADR-0025-algo-ipcq-direction-addressing.md diff --git a/docs/adr/ADR-0026-dppolicy-intra-device.md b/docs/adr/ADR-0026-par-dppolicy-intra-device.md similarity index 96% rename from docs/adr/ADR-0026-dppolicy-intra-device.md rename to docs/adr/ADR-0026-par-dppolicy-intra-device.md index f11beef..d043f59 100644 --- a/docs/adr/ADR-0026-dppolicy-intra-device.md +++ b/docs/adr/ADR-0026-par-dppolicy-intra-device.md @@ -23,7 +23,7 @@ class DPPolicy: """Intra-device (cube × PE) data-parallel policy. SIP-level placement is controlled by ``torch.ahbm.set_device(rank)`` - (ADR-0024 D10) and, for model-level TP, by Megatron-style parallel + (ADR-0024 D3) and, for model-level TP, by Megatron-style parallel layers (ADR-0027). DPPolicy does not cross SIP boundaries. """ cube: Literal["replicate", "column_wise", "row_wise"] = "replicate" @@ -37,7 +37,7 @@ class DPPolicy: ### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거 현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube × -pes + pe`). 이는 ADR-0024 D11이 "abstraction leakage"로 지적한 형태. +pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태. 본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는 property로도 **남기지 않는다**: @@ -73,7 +73,7 @@ class ShardSpec: ### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성 -ADR-0024 D11의 계약 구현. Post-hoc shifting 없음. +ADR-0024 D4의 계약 구현. Post-hoc shifting 없음. ```python # src/kernbench/policy/placement/dp.py (after) @@ -135,14 +135,14 @@ def resolve_dp_policy( ### D4. `_create_tensor` — 구조적 좌표로 직접 placement -ADR-0024 D11 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy` +ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy` 호출 시점에 직접 지정. ```python # context.py _create_tensor (after) current_sip = self.ahbm.current_device() if current_sip is None: - # Single-driver fallback (ADR-0024 D9와 일관). + # Single-driver fallback (ADR-0024 D2와 일관). # Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는 # 문제가 있음 → debug mode에서 경고. if os.environ.get("KERNBENCH_DEBUG"): @@ -267,7 +267,7 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에 - **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device. - **API 단순화**: DPPolicy 생성자 필드 ~33% 축소. - **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 → - abstraction leakage 해소 (ADR-0024 D11 계약 충족). + abstraction leakage 해소 (ADR-0024 D4 계약 충족). - **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시. - **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP 경계 제어 메커니즘. diff --git a/docs/adr/ADR-0027-megatron-tp.md b/docs/adr/ADR-0027-par-megatron-tp.md similarity index 94% rename from docs/adr/ADR-0027-megatron-tp.md rename to docs/adr/ADR-0027-par-megatron-tp.md index bdacbcf..7b04254 100644 --- a/docs/adr/ADR-0027-megatron-tp.md +++ b/docs/adr/ADR-0027-par-megatron-tp.md @@ -2,9 +2,7 @@ ## Status -Accepted (Revision 7 — resume invariant / main-context wait 비재귀 invariant / -global barrier over-serialization tradeoff / TP forward yield-safety 명시, -2026-04-14) +Accepted ## Context @@ -166,9 +164,9 @@ while alive: - 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터 등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다. - **Future extension**: non-collective 긴 계산 경로가 자주 나오면 - ADR-0024 D13의 `torch.distributed.cooperative_yield()` primitive (명시적 - no-op yield)를 도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — - 필요 시 추가하면 됨. + 명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를 + 도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면 + 됨. - Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round 안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로 enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO). @@ -183,7 +181,7 @@ while alive: - **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접 `submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective 큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며 - worker는 이걸 직접 wait하지 않는다 (ADR-0024 D7). + worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조). - **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된 후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만 하면 됨. worker wait 큐와의 순서 dependency 없음. @@ -206,7 +204,7 @@ while alive: index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness 를 바꾸지 않는 최적화로 분류. -4. **Exception propagation + sibling cleanup (ADR-0024 D13 방식 채택)**. +4. **Exception propagation + sibling cleanup**. worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다. scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행: @@ -581,7 +579,7 @@ TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다: | 개념 | 결정 주체 | 범위 | |---|---|---| -| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D9/D10) | **cross-rank, cross-SIP** | +| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** | | **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** | 따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로 @@ -825,40 +823,11 @@ strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을 ## Dependencies -- **ADR-0024** (launcher): rank = SIP, greenlet-local rank, `dist.all_reduce`, - `torch.ahbm.set_device(rank)`. 본 ADR의 D0/D1이 이 인프라를 확장. +- **ADR-0024** (launcher): rank = SIP, greenlet-local rank, + `torch.ahbm.set_device(rank)`. - **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현. - **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반. -### Supersedes (partial) - -ADR-0024의 다음 섹션은 **미구현 상태의 설계**이며, 본 ADR이 더 단순한 모델로 -대체한다: - -- **ADR-0024 D7 (`_CollectiveBarrier.submit_and_drain`)** — epoch 기반 last- - arriver-drains 패턴. 문제: last arriver가 **worker 컨텍스트에서** `ctx.wait`을 - 호출해 env.run을 drive → D0.2가 막으려는 orphan 원인을 재현한다. 본 ADR의 - **D0.4 two-queue drain** (worker가 모두 yield한 뒤 main이 drain)이 동일한 - "모든 rank가 submit 완료 전까지 어떤 rank의 collective도 진행되지 않음" - invariant를 **worker-safe하게** 제공한다. `_CollectiveBarrier` 클래스는 - 구현하지 않는다. -- **ADR-0024 D12/D13 (`spawn_workers` skeleton)** — signature / scheduler - loop / exception handling 설계. 본 ADR의 **D1**이 real-PyTorch API와 일치하는 - signature (`spawn(fn, args, nprocs)`)로 재정의하며, D0 scheduler drain을 단일 - 위치에서 수행한다. ADR-0024 D13의 exception cleanup (siblings - `throw(SystemExit)` + `SpawnException` 래핑)은 본 ADR에 그대로 흡수 - (D0.4-(4) 참조). - -현 구현은 ADR-0024의 D7/D12/D13 어느 것도 landing하지 않았으므로 supersede에 -따른 마이그레이션 비용은 없음. 향후 `docs/adr/ADR-0024`에 "superseded by -ADR-0027 D0/D1" 주석만 추가하면 정합. - -**Source of truth (normative, 구현자 대상)**: worker scheduling / collective -drain / spawn / exception cleanup의 구현 기준은 **ADR-0027 D0/D1이다**. 구현 -시 ADR-0024 D7/D12/D13의 pseudocode / contract / signature를 참고하지 말 것 — -두 ADR이 다른 결론을 낼 때는 항상 ADR-0027이 우선한다. 리뷰어도 이 원칙으로 -PR을 심사. - --- ## Non-goals diff --git a/docs/adr/ADR-0032-intercube-allreduce.md b/docs/adr/ADR-0032-algo-intercube-allreduce.md similarity index 99% rename from docs/adr/ADR-0032-intercube-allreduce.md rename to docs/adr/ADR-0032-algo-intercube-allreduce.md index c3df130..bb6ba3c 100644 --- a/docs/adr/ADR-0032-intercube-allreduce.md +++ b/docs/adr/ADR-0032-algo-intercube-allreduce.md @@ -146,7 +146,7 @@ At each `dist.all_reduce(tensor)` call: 3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where `sip_rank` is the current greenlet's bound rank. 4. Launches with `_defer_wait=True`; the main scheduler drains pending - handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4). + handles after all workers submit (per ADR-0027 D0.4). ### D6. Config schema diff --git a/docs/adr/ADR-0033-latency-model-assumptions.md b/docs/adr/ADR-0033-lat-latency-model-assumptions.md similarity index 83% rename from docs/adr/ADR-0033-latency-model-assumptions.md rename to docs/adr/ADR-0033-lat-latency-model-assumptions.md index 5d4024f..13ca1f9 100644 --- a/docs/adr/ADR-0033-latency-model-assumptions.md +++ b/docs/adr/ADR-0033-lat-latency-model-assumptions.md @@ -10,7 +10,7 @@ The simulator is an analytical, event-driven performance model — not a cycle-accurate or RTL-level simulator. Many real-HW effects are approximated or omitted by design. To keep the model auditable and reviewable as a whole, this ADR consolidates the assumptions in one place. Individual component ADRs -(ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines +(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines the *limits of fidelity*. ## Decisions @@ -21,7 +21,7 @@ the *limits of fidelity*. ADR-0015 D2. - **Per-component switching/overhead latency** (`overhead_ns` attr). - **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array - with global round-robin chunking. Burst granularity tunable + with address-based PC selection (ADR-0034 D3). Burst granularity tunable (`burst_bytes`, default 256B). Read and write share each PC's `available_at` (real HW command bus is per-PC shared). - **HBM direction switching penalty mechanism**: per-PC last-direction @@ -66,8 +66,8 @@ the *limits of fidelity*. ### D3. Ignored (out of scope) - Bank-level row buffer conflict penalty (assume no conflicts — best case; - round-robin chunk assignment is address-blind so we cannot detect same-bank - reuse). + the model has no per-bank state within a PC, so same-bank reuse cannot be + detected). - HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state `burst_time = burst_bytes / pc_bw_gbs`). - Refresh, ECC, thermal throttling, power gating. @@ -110,29 +110,6 @@ below are different concerns, ordered by expected workload impact. **Higher impact (workload accuracy gap)**: -- [ ] **Address-based PC selection at HBM CTRL** (replace the - address-blind global round-robin). Compute the PC index from - the HBM byte offset using parameters already in topology config: - - pc_shift = log2(burst_bytes) # default 8 (burst=256B) - pc_mask = num_pcs - 1 # default 7 (8 PCs) - pc = (hbm_offset >> pc_shift) & pc_mask - - For the default `burst_bytes=256, num_pcs=8` this places the PC - select field at HBM byte-offset bits **[10:8]**: bits [7:0] are - the within-burst offset (same PC), bits [10:8] are the 3-bit PC - index, and bits [36:11] are row/bank/column within the PC slice. - Shift/mask are derived from topology config rather than hardcoded - so alternative `(burst_bytes, num_pcs)` pairs stay consistent. - See `src/kernbench/policy/address/phyaddr.py` for the canonical - comment. - - Real-HW workloads where this matters most: (a) strided multi- - transaction streams that under global-RR collide on the same PCs - but under address-striping land on disjoint sets; (b) offset- - disjoint parallel transfers where address-striping preserves - parallelism while global-RR re-serializes them. Directly affects - multi-PE concurrent HBM workload latencies. - [ ] **Bank-level conflict modeling** within a PC (opt-in via `track_banks: true`). Currently we assume no same-bank reuse; random scatter/gather workloads are optimistic here. @@ -169,7 +146,7 @@ below are different concerns, ordered by expected workload impact. touching latency must update the relevant section here. - Workload-specific magnitude error envelopes are explicit. - Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs` - enforces the ADR-0019 D9 invariant in code rather than relying on yaml + enforces the ADR-0017 D8 invariant in code rather than relying on yaml manual consistency. - Wire transfer time is charged once per bottleneck-link transit (Phase 2c per-flit timing) rather than via terminal `drain_ns` injection. Single @@ -180,5 +157,6 @@ below are different concerns, ordered by expected workload impact. ## Cross-references - ADR-0015 — component / port / wire model. -- ADR-0019 — NoC and local HBM topology. +- ADR-0017 — Cube NOC architecture and HBM connectivity. - ADR-0004 — memory semantics, local HBM. +- ADR-0034 — HBM controller internal design. diff --git a/docs/adr/ADR-0034-dev-hbm-controller-internal-design.md b/docs/adr/ADR-0034-dev-hbm-controller-internal-design.md new file mode 100644 index 0000000..b7d3e8f --- /dev/null +++ b/docs/adr/ADR-0034-dev-hbm-controller-internal-design.md @@ -0,0 +1,271 @@ +# ADR-0034: HBM Controller Internal Design + +## Status + +Accepted + +## Context + +`HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of +the cube NOC. One instance is created per PE under the topology node +`sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router +(ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling, +burst-granular commit timing, address-based PC selection, and response +routing back to the requester. + +This ADR documents the component as currently implemented. ADR-0017 D4/D8 +defines *where* HBM CTRL attaches and *what* aggregate BW it must +deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in +scope. This ADR fills the gap between those two — the per-instance +internal scheduling model. + +## Decision + +### D1. Role + +`HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per +PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`) +attaches to that PE's router via the `peX.hbm` attachment list in +`cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping +(ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels +into one endpoint. + +The component models: + +- Per-PC scheduling (D2) with R/W command-bus sharing. +- Address-based PC selection (D3). +- Burst-granular commit timing (D4). +- Flit-aware per-flit PC commit and async finalize (D5, D6). +- Command-only Transaction handling for read-data drain (D7). +- Response routing back to the requester (D8). + +It does not model: + +- Bank-level row-buffer conflicts, refresh, ECC, thermal throttling + (ADR-0033 D3). +- Cross-PE HBM contention beyond its own router edge (handled by the + router mesh — ADR-0017 D3). +- 1:1 channel mode (ADR-0017 D8 future work). + +### D2. Per-PC scheduling model + +Per-instance state initialised in `start()`: + +- `_pc_avail: list[float]` — earliest sim-time each PC is free; length + `num_pcs`, initial 0.0. +- `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on + each PC, used for switch-penalty detection (D4); initial `None`. + +`num_pcs` and `burst_bytes` must each be a positive power of two so +that address-based PC selection (D3) reduces to a shift-and-mask. + +Read and write requests share the same `_pc_avail` slot per PC — the +real HW per-PC command bus is shared between read and write traffic, so +issuing a write to PC k blocks a subsequent read to PC k by exactly the +burst time. + +Direction `dir` for a request is inferred from the request type: + +- `MemoryWriteMsg` → `"W"`. +- `PeDmaMsg` with `is_write=True` → `"W"`. +- All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`. + +### D3. Address-based PC selection + +PC index for an access is derived from the access address by shift and +mask: + +```text +pc_shift = log2(burst_bytes) # default 8 (burst=256B) +pc_mask = num_pcs - 1 # default 7 (8 PCs) +pc = (address >> pc_shift) & pc_mask +``` + +Computed once in `start()` from topology config so alternative +`(burst_bytes, num_pcs)` pairs stay consistent. For the canonical +default `(256, 8)` this places the PC select field at bits `[10:8]` of +the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits +`[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column +within the PC slice (see `phyaddr.py` comment). + +Address-based striping — as opposed to address-blind global +round-robin — preserves PC parallelism for offset-disjoint concurrent +transfers: each transfer's bursts land deterministically on the PC set +implied by its byte addresses, so multi-PE workloads accessing disjoint +regions do not collide on a single PC. + +### D4. Burst granularity and PC commit timing + +A single PC commit takes: + +```text +chunk_time = burst_bytes / pc_bw_gbs # ns +``` + +- `burst_bytes` (default 256) is the burst granularity matching the + flit size (ADR-0033 D1). +- `pc_bw_gbs` is **builder-derived** from + `hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing + the ADR-0017 D8 invariant that aggregate per-PE BW equals the + router-to-HBM link BW. + +Per-PC commit scheduling for an arriving access on PC `pc` with +direction `dir`: + +```text +switch_cost = switch_penalty_ns + if pc_last_dir[pc] not in (None, dir) else 0 +start = max(env.now, pc_avail[pc]) + switch_cost +finish = start + chunk_time +pc_avail[pc] = finish +pc_last_dir[pc] = dir +``` + +Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM +scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values +model pessimistic per-alternation cost. + +### D5. Flit-aware per-flit PC commit (primary path) + +`_handle_flit` is the primary worker path. For each arriving `Flit`: + +1. On the **first** flit of a transaction (`tid = id(txn)` not in + `_txn_state`): + - Apply `overhead_ns` once via `run(env, nbytes)` — header decode + model, first-flit overhead pattern (ADR-0033 D1). + - Initialise `_txn_state[tid] = {"last_finish": env.now}`. +2. Compute `pc = _pc_for_address(flit.address)` (D3). +3. Apply the per-PC schedule (D4) using the request direction (D2). +4. Update `state["last_finish"] = max(state["last_finish"], finish)`. +5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn` + (D6). + +Per-flit address-aware commit is the mechanism that lets concurrent +multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs +in parallel. + +### D6. Async finalize per transaction + +When a transaction's last flit has been scheduled, finalisation runs in +a separately-spawned process: + +```python +def _finalize_txn(env, txn, last_finish): + wait = last_finish - env.now + if wait > 0: + yield env.timeout(wait) + yield from _send_response(env, txn) +``` + +`_handle_flit` spawns this via `env.process(...)` and returns +immediately, so the worker can pick up the next inbox message while the +last PC commit drains. + +Without this split — i.e. if the worker itself did +`yield env.timeout(wait)` — concurrent single-flit transactions whose +addresses hit distinct PCs would still serialise at `chunk_time` each +inside the worker, hiding the PC parallelism that D3 and D5 are +designed to expose. + +### D7. Non-flit fallback for command-only transactions + +`_handle_txn` runs when the inbox delivers a `Transaction` rather than a +`Flit`. This is the path for command-only requests that the wire does +not chunk into flits — most notably `MemoryReadMsg` whose command txn +carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing, +not as inbound flits). + +Procedure: + +1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)` + — for read commands, work is sized by the request. +2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else + 0. +3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are + scheduled over time at `drain/n_chunks` ns intervals to model the + bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain). +4. Apply `run(env, txn.nbytes)` once for `overhead_ns`. +5. For each chunk `i`, advance `chunk_interval` ns then apply the D4 + schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`. +6. After scheduling all chunks, wait `last_finish - env.now` then call + `_send_response`. + +`_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with +`_handle_flit` — there is exactly one source of PC scheduling truth +across both paths. + +### D8. Response routing + +`_send_response` dispatches on request type and path geometry: + +| Case | Trigger | Response | +| --- | --- | --- | +| PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` | +| Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) | +| Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) | +| Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path | + +The "bypass" classification matches the Memory R/W fabric path defined +in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl, +without M_CPU). The PE_DMA case is its own dedicated reverse-path to +keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a +ResponseMsg envelope). + +In all reverse-path cases, the response Transaction is put onto +`out_ports[reverse_path[1]]` — the first hop back along the recorded +forward path. If `reverse_path` has fewer than 2 entries (degenerate +path), the original `txn.done` is signalled directly. + +### D9. Configurable attributes + +| Attribute | Default | Source | Notes | +| --- | --- | --- | --- | +| `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 | +| `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant | +| `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) | +| `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching | +| `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) | +| `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) | + +`pc_bw_gbs` is derived by `topology/builder.py` rather than configured +directly so the aggregate per-PE BW matches the router-to-HBM link BW +without yaml-side duplication. + +## Consequences + +### Positive + +- Address-based PC selection preserves multi-stream HBM parallelism + that an address-blind round-robin would collapse — important for + multi-PE workloads with disjoint HBM regions. +- Flit-aware path (D5) + async finalize (D6) preserves wormhole + pipelining and exposes PC parallelism for back-to-back single-flit + transactions. +- Single source of PC scheduling truth (D4 mechanism, used by both D5 + flit path and D7 chunk-loop path). +- Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml + discipline. + +### Negative + +- No bank-level conflict modelling within a PC; address-blind to + bank/row-buffer reuse (ADR-0033 D3). +- No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed + FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns` + (ADR-0033 D2). +- `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state + accumulates per concurrent transaction and is removed only on + `is_last`. Adequate for current workloads. + +## Links + +- ADR-0001 (Physical address layout — PC bit field comment) +- ADR-0015 D4 (Memory R/W fabric path — bypass response case) +- ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers) +- ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR + implements) +- ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint + resolution) +- ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty, + flit-aware PC commit, first-flit overhead, chunk-loop drain) +- ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation) diff --git a/docs/adr/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md b/docs/adr/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md new file mode 100644 index 0000000..8d22a7b --- /dev/null +++ b/docs/adr/ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md @@ -0,0 +1,286 @@ +# ADR-0035: M_CPU and M_CPU.DMA Component Model + +## Status + +Accepted + +## Context + +M_CPU is the cube-level command processor. It receives commands from +IO_CPU (or from PCIE_EP when the engine routes Memory R/W through +M_CPU as a fallback), fans them out to the PEs in its cube, and +aggregates per-PE responses into a single ResponseMsg sent back to +IO_CPU on the reverse path. + +M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W +fan-out. Per ADR-0015 D5 it is **not** a separate topology node — +it lives as internal state of `MCpuComponent`. + +This ADR documents the M_CPU component implementation that realizes +those responsibilities, including the three distinct fan-out paths +(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource +model, and the response aggregation contract. + +## Decision + +### D1. Role + +M_CPU has three responsibilities: + +1. **Transit forwarding** — when not the terminal hop (e.g., on the + reverse response path PE → M_CPU → IO_CPU), forwards Transactions + to `next_hop` in their pre-computed path. +2. **Multi-PE fan-out at terminal hop** — dispatches to one of three + fan-out paths based on request type (D2). +3. **Response aggregation** — collects per-PE responses, sends a + single aggregate ResponseMsg back to IO_CPU on the reverse path. + +Per invocation (`run()`): applies `overhead_ns` once per incoming +Transaction. + +M_CPU does **not**: + +- Decide routing — paths are pre-computed by the router (ADR-0002). +- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines + (ADR-0014). +- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE + `hbm_ctrl.pe{X}` directly (ADR-0017 D9). +- Interpret tensor or kernel semantics — fan-out dispatch by Python + isinstance check only. + +### D2. Three fan-out paths dispatched by request type + +At the terminal hop the worker dispatches by request type: + +```python +elif self.ctx is not None and txn.request is not None: + if isinstance(txn.request, KernelLaunchMsg): + env.process(self._kernel_launch_fanout(env, txn)) + elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)): + env.process(self._mmu_msg_fanout(env, txn)) + else: + env.process(self._dma_fanout(env, txn)) +``` + +Each path uses a different router method: + +- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the + M_CPU-specific DMA path that avoids PE pipeline nodes. +- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the + generic NOC command path to PE_CPU. +- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command + path to PE_MMU. + +### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5) + +`MCpuComponent.start()` initializes two SimPy resources: + +```python +self._dma_write = simpy.Resource(env, capacity=1) # MemoryWriteMsg +self._dma_read = simpy.Resource(env, capacity=1) # MemoryReadMsg +``` + +Properties: + +- **Not a topology node** — managed entirely inside `MCpuComponent`; + does not appear in `topology.yaml` or in the compiled graph. +- **Independent read and write channels** — concurrent in-flight + Memory R/W is allowed. +- **Capacity=1 per channel** serializes the **dispatch step** + (`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory + R/W requests at this M_CPU. Actual fabric transfer time is modeled + by wire processes between components (ADR-0015 D2) and by + `drain_ns` at terminal hops; the DMA resource does not gate + transfer duration. + +Resource selection is request-type-based: + +```python +dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read +``` + +### D4. Transit forwarding at non-terminal hops + +When `txn.next_hop` is not None — typical for the reverse response +path (PE → M_CPU → IO_CPU) — the worker forwards normally: + +```python +if next_hop: + yield self.out_ports[next_hop].put(txn.advance()) +``` + +The fan-out branches fire only at the terminal hop. The same component +therefore serves both forward command dispatch and reverse response +relay roles. + +### D5. DMA fan-out (`_dma_fanout` — Memory R/W) + +For each Memory R/W request at terminal hop: + +1. `_resolve_dma_destinations(request)` returns a per-PE + `hbm_ctrl.pe{X}` derived from the request's PA via + `ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9). +2. For each destination: + - Acquire the appropriate DMA resource (`_dma_write` or + `_dma_read`) via `with dma_res.request() as req`. + - Resolve path via `ctx.router.find_mcpu_dma_path()`. + - Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`. + - Create sub-Transaction carrying `drain_ns` and dispatch to + `path[1]`. +3. Track `max_drain_ns` across destinations and record it as + `txn.result_data["xfer_ns"]` after all responses arrive. +4. After all per-PE responses are collected (D8), send an aggregate + ResponseMsg on the reverse command path back to IO_CPU. + +PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code — +no such node exists after ADR-0017 D4's per-PE partitioning. Kept +defensively but does not route to a real destination. + +### D6. Kernel launch fan-out (`_kernel_launch_fanout`) + +For `KernelLaunchMsg` at terminal hop: + +1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube. +2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via + `ctx.router.find_node_path()`. +3. **`target_start_ns` handling** (ADR-0009 D5): + - If the request already carries `target_start_ns` (stamped by + IO_CPU per ADR-0036 D3): **pass through unchanged**. + - If absent (direct-to-M_CPU launch in unit tests): compute a + per-cube barrier `env.now + max(per-PE leg latency)` and stamp + via `dataclasses.replace`. +4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a + control message; preserving nbytes=0 keeps fan-out off the shared + first-hop fabric BW, mirroring ADR-0036 D4). +5. After all per-PE responses arrive (D8), aggregate per-PE metrics + from each sub-Transaction's `result_data` into the parent + transaction: + + ```python + txn.result_data["pe_exec_ns"] = max(existing, max(pe_exec_values)) + txn.result_data["dma_ns"] = max(existing, max(dma_values)) + txn.result_data["compute_ns"] = max(existing, max(compute_values)) + ``` + + The max-merge with the existing value matters because cross-cube + IO_CPU fan-out shares the same parent `result_data`; merging + prevents one cube from clobbering another's metric. +6. Send aggregate ResponseMsg on reverse path back to IO_CPU. + +### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`) + +For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop: + +1. `_resolve_pe_ids(target_pe)` → PE ids. +2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via + `find_node_path()`. +3. Dispatch sub-Transactions with `nbytes=0`. +4. PE_MMU is a terminal node — it does **not** send a ResponseMsg + back. Instead, the sub-Transaction's own `sub_done` event is the + completion signal. +5. Wait for all `sub_done` events in-line (does **not** use + `_pending` counter — D8 is for response-bearing fan-out only). +6. Send aggregate ResponseMsg on reverse path back to IO_CPU. + +### D8. Response aggregation (`_pending` + `_parent_txns`) + +For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg +arriving on the reverse path): + +```python +self._pending: dict[str, tuple[int, int, simpy.Event]] = {} +self._parent_txns: dict[str, Any] = {} +``` + +- On dispatch: register `(expected, received=0, all_done)` and + remember the parent transaction. +- `_worker` recognises responses by `is_response=True` and routes + them to `_collect_response`, which increments `received` and + signals `all_done` when `received >= expected`. +- After `yield all_done`, the fan-out path constructs the aggregate + ResponseMsg: + + ```python + resp_msg = ResponseMsg( + correlation_id=request.correlation_id, + request_id=request.request_id, + src_cube=cube_id, + src_pe=-1, # -1 = M_CPU aggregate, not a single PE + success=True, # no failure semantics implemented + ) + ``` + +- The response Transaction travels on `list(reversed(txn.path))` + back to IO_CPU. + +MMU fan-out (D7) uses a simpler in-line list of `sub_done` events +because PE_MMU is terminal — there is no ResponseMsg path to +intercept. + +### D9. Helpers and configurable attribute + +`_resolve_pe_ids(target_pe)`: + +- `int` → `[target_pe]` +- `tuple[int, ...]` → `list(target_pe)` +- `"all"` → `range(n_slices)` where `n_slices` comes from cube + `memory_map.hbm_slices_per_cube` (default 8). + +Used by kernel-launch and MMU fan-out paths. + +Single configurable attribute drives per-instance latency: + +| Site | impl name | overhead_ns | +| --- | --- | --- | +| Cube `m_cpu` | `builtin.m_cpu` | 5.0 | + +Applied once in `run()` per Transaction — models command +interpretation and dispatch-decision time at M_CPU. + +## Consequences + +### Positive + +- Three fan-out paths are clearly separated by request type — adding + a new request kind is an isinstance branch + one fan-out method. +- M_CPU.DMA channels are independent (read and write run concurrently) + and serialize only the dispatch step at capacity=1. +- Transit-vs-terminal behavior is a single `if next_hop` check, so + the same component handles forward dispatch and reverse response + relay without role duplication. +- `target_start_ns` passthrough (D6) preserves the cross-cube barrier + established by IO_CPU (ADR-0036 D3), while the fallback computation + keeps direct-to-M_CPU unit tests working. +- Per-PE metric `max`-merge against existing parent `result_data` + values is robust to cross-cube IO_CPU fan-out sharing the same + parent. + +### Negative + +- No partial-failure semantics — a missing per-PE response stalls the + parent `all_done` indefinitely. Acceptable for simulation; not + suitable as a production-style endpoint. +- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead + code (no such node exists post-ADR-0017 D4). Kept defensively; + invites confusion and merits a follow-up cleanup. +- DMA resource serialization applies only at dispatch (the `put` call + is instantaneous in unbounded stores). The capacity=1 channel + models "one request in flight at a time at this M_CPU", not + "transfer duration serialization" — readers must consult wire + processes (ADR-0015 D2) and `drain_ns` for actual transfer + parallelism. + +## Links + +- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics) +- ADR-0009 D5 (`target_start_ns` — passed through unchanged when + present; computed as per-cube barrier when absent) +- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out + point) +- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same + contract at cube level) +- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a + topology node) +- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`) +- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes + through unchanged; nbytes=0 invariant preserved through fan-out) diff --git a/docs/adr/ADR-0036-dev-io-cpu-component-model.md b/docs/adr/ADR-0036-dev-io-cpu-component-model.md new file mode 100644 index 0000000..b79e9ad --- /dev/null +++ b/docs/adr/ADR-0036-dev-io-cpu-component-model.md @@ -0,0 +1,216 @@ +# ADR-0036: IO_CPU Component Model + +## Status + +Accepted + +## Context + +IO_CPU is the IO chiplet's host-facing endpoint inside the simulation +graph. PCIE_EP receives host messages from the runtime API and routes +them via the io_noc; for command-bearing requests (KernelLaunch, +MmuMap/Unmap) the io_noc forwards to IO_CPU, which: + +- Fans out the request to per-cube M_CPUs. +- Aggregates per-cube responses into a single host-visible completion. +- For kernel launches, stamps a global `target_start_ns` barrier so + every PE across every targeted cube begins kernel body execution at + the same simulated time (ADR-0009 D5). + +Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3; +this component therefore handles only command-plane traffic in normal +operation. + +This ADR documents the IO_CPU component implementation that realizes +those responsibilities. + +## Decision + +### D1. Role + +IO_CPU is the host-facing endpoint of the IO chiplet. It has two +primary responsibilities: + +1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg / + MmuUnmapMsg to per-cube M_CPUs. +2. **Response aggregation** — collect per-cube ResponseMsg, signal + parent `txn.done` when all targeted cubes have responded. + +A third, narrower responsibility applies only to KernelLaunchMsg: +**`target_start_ns` global barrier stamping** (D3). + +The component does **not**: + +- Decide routing — paths are pre-computed by the router (ADR-0002). +- Decode tensor or kernel internals — those concerns belong to + M_CPU / PE_CPU / engines. +- Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3). +- Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4 + and ADR-0016 D3 (Memory R/W resolution code in + `_resolve_cube_targets` exists as a defensive fallback only). + +Per invocation (`run()`): applies the configured `overhead_ns` once +per incoming Transaction (D8). + +### D2. Forward path — multi-cube fan-out + +When a non-response Transaction arrives, the worker: + +1. Pays `overhead_ns` via `run()`. +2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)` + targets from the request (D5). +3. For each target: + - Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`. + - Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`. + - Creates a per-cube sub-Transaction with `path` populated and + forwards it to `path[1]` (the first hop on the io_noc). +4. Registers aggregation state: `_pending[request_id] = (expected, + received=0, parent_done)`. + +### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5) + +IO_CPU is the canonical stamper for `target_start_ns`. When the +request is a `KernelLaunchMsg`, IO_CPU computes a single global +barrier covering every targeted PE across every targeted cube: + +```text +for (sip, cube) in cube_targets: + leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0) + for pe_id in target_pe_ids: + leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id), + nbytes=0) + latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns + global_max = max(global_max, latency) + +target_start_ns = env.now + global_max +``` + +The request is then replaced (via `dataclasses.replace`) so the +stamped value propagates through the fan-out. + +Two overhead corrections: + +- `io_overhead_ns` is subtracted because IO_CPU has already paid it + in `run()` before this method runs. +- `m_overhead_ns` is subtracted once because it appears as the + endpoint of leg1 *and* the start of leg2 in path latency, but + M_CPU pays it only once at run time. + +Every downstream PE_CPU yields until `target_start_ns` before +beginning kernel body execution; all PEs therefore start at the same +simulated time regardless of how long their individual dispatch path +took. + +### D4. KernelLaunch sub-Transactions carry `nbytes=0` + +Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`, +overriding the parent `txn.nbytes`: + +- Kernel launch is a control message; payload size is irrelevant at + the data-fabric level. +- If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the + io_noc's shared first hop. With 16 cubes this serializes fan-out, + pushing far M_CPUs past `target_start_ns` and breaking the D3 + invariant. + +Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant +for the defensive Memory R/W fallback path, which carries actual +payload sizes). + +### D5. Per-request-type cube target resolution + +`_resolve_cube_targets` dispatches by request type: + +| Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics | +| --- | --- | --- | +| `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode | +| `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode | +| `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP | +| `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec | + +Each IO_CPU instance fans out only within its own SIP — `_my_sip()` +parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0). + +The Memory R/W rows exist for defensive completeness; the engine's +normal path routes Memory R/W via `_process_memory_direct()` / +`find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 / +ADR-0016 D3). + +### D6. Response aggregation + +`_pending: dict[request_id → (expected, received, parent_done)]`: + +- On dispatch: register `(len(cube_targets), 0, txn.done)`. +- `_worker` recognises responses by `is_response=True` and routes + them to `_collect_response`. +- `_collect_response` increments `received`; when `received >= + expected`, `parent_done.succeed()` is invoked and the entry is + removed from `_pending`. + +This is a simple per-request counter. There is no per-cube identity +tracking and no partial-failure handling — a missing response +indefinitely stalls the parent done. Production-style failure paths +are out of scope for the current simulator model. + +### D7. `target_pe` resolution helper + +`_resolve_pe_ids(target_pe)`: + +- `int` → `[target_pe]`. +- `tuple[int, ...]` → `list(target_pe)`. +- `"all"` → `range(n_slices)`, where `n_slices` comes from cube + `memory_map.hbm_slices_per_cube` (default 8). + +Used in D3's barrier computation to enumerate every PE target per +cube. + +### D8. Configurable `overhead_ns` + +A single attribute drives per-instance latency: + +| Site | impl name | overhead_ns | +| --- | --- | --- | +| IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 | + +Applied once in `run()` per Transaction. Models command +interpretation + dispatch-decision time at IO_CPU. + +## Consequences + +### Positive + +- Cross-cube and cross-SIP kernel launches share a single global + barrier (D3 + D4) — no per-cube divergence in start time. +- nbytes=0 invariant keeps fan-out off the shared first-hop fabric + BW, preserving the barrier's accuracy at scale (16 cubes). +- Response aggregation via a single counter → minimal state, + deterministic ordering of completion. +- Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs + cleanly independent. + +### Negative + +- No partial-failure semantics — a missing per-cube response + indefinitely stalls the parent. Adequate for simulation but not + suitable as a production-style endpoint. +- `_pending` is a regular dict; in-flight requests accumulate state. + Acceptable for current benchmark workloads (few concurrent + outstanding launches); unbounded in principle. +- The Memory R/W resolution branches in `_resolve_cube_targets` are + dead code in the normal engine path. Kept defensively but invite + drift if the bypass path ever changes. + +## Links + +- ADR-0002 (Routing distance — path computation) +- ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU) +- ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across + cubes) +- ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU) +- ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out) +- ADR-0012 (Host ↔ IO_CPU message schema) +- ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU) +- ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here) +- ADR-0016 D3 (Memory R/W path bypasses IO_CPU) +- ADR-0016 D4 (Kernel Launch path through IO_CPU for command + interpretation) diff --git a/docs/adr/ADR-0037-dev-forwarding-component.md b/docs/adr/ADR-0037-dev-forwarding-component.md new file mode 100644 index 0000000..193dbe0 --- /dev/null +++ b/docs/adr/ADR-0037-dev-forwarding-component.md @@ -0,0 +1,200 @@ +# ADR-0037: Forwarding Component (forwarding_v1) + +## Status + +Accepted + +## Context + +The simulation graph has many node positions that exist purely to model +fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints, +IO chiplet io_noc, transit cubes. These share a common pattern: receive +a message, apply per-component overhead (modeling header decode + +routing decision time), forward to the next hop along the pre-computed +path. + +This ADR defines the contract for these transit nodes: a single +component type (`TransitComponent`) that handles flit-aware forwarding +with wormhole cut-through semantics, used under multiple impl names +according to the conceptual role each instance plays. + +## Decision + +### D1. Role + +The Forwarding component (`TransitComponent` class) is a **stateless +transit node** in the simulation graph. It models any fabric position +where a message physically traverses but no semantic processing +happens. + +Per traversal, the component: + +1. Reads an incoming Transaction or Flit from an `in_port`. +2. Applies the configured per-component overhead (`overhead_ns`), + applied **once per Transaction** even across multi-flit payloads + (see D2). +3. Looks up the next hop along the Transaction's pre-computed `path`. +4. Forwards to the corresponding `out_port`; at the terminal node + (no next hop), signals `txn.done` once the `is_last` flit arrives. + +The component **does NOT**: + +- Decide routing — paths are pre-computed by the router (ADR-0002 / + ADR-0017 D2). Forwarding only executes the per-hop step. +- Model wire propagation or bandwidth occupancy — separate wire + processes between components handle that (ADR-0015 D2). +- Resolve addresses — the AddressResolver does that (ADR-0017 D9). +- Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL) + handle that. + +### D2. First-flit overhead model (header decode) + +Per-Transaction `overhead_ns` is applied **exactly once**, at first +flit arrival: + +- `_txn_decoded: set[int]` tracks which Transactions have already + paid the overhead at this node. +- On first-flit arrival for a Transaction: `yield self.run(env, + msg.txn.nbytes)` — pays the overhead. +- Subsequent flits of the same Transaction skip the overhead — they + pipeline through with no extra delay. +- On `is_last` flit: remove the Transaction from `_txn_decoded`. + +This models the real-HW behavior where header decode and routing +decision happen once on first flit; payload flits then stream through +the same path (wormhole cut-through). Multi-hop pipelining emerges +naturally — each hop adds its own first-flit overhead, but flits +after the first do not re-pay overhead at any hop they have already +passed first. + +### D3. Serial worker forwarding (preserves order) + +The component's worker is a single SimPy process that consumes flits +from `_inbox` and forwards them serially in arrival order. The +component does NOT spawn `env.process(...)` per flit. + +Rationale: if the first flit yields on `overhead_ns` while subsequent +flits run in parallel processes, the later flits can overtake the +first. This produces out-of-order delivery and lets the `is_last` +flit arrive at the destination before the first flit — corrupting +both the transaction's completion semantics and any flit-index-based +processing downstream. + +### D4. Path-based next-hop routing + +Routing is **not** a Forwarding-component concern. The Transaction +arrives with a pre-computed `path` (built by the router; ADR-0002 / +ADR-0017 D2). The component just looks up its own position in the +path and forwards to `path[index + 1]`: + +```python +def _next_hop_in_path(self, txn): + my_id = self.node.id + path = txn.path + for i, n in enumerate(path): + if n == my_id and i + 1 < len(path): + return path[i + 1] + return None +``` + +If `next_hop` is found and present in `out_ports`, the flit is +forwarded. Otherwise (terminal node), `txn.done.succeed()` is +invoked when the `is_last` flit arrives. + +### D5. Flit-aware mode with Non-Flit fallback + +`_FLIT_AWARE = True` opts this component out of the base class's +flit-reassembly logic in `_fan_in`. Flits are placed directly on +`_inbox` (no reassembly), enabling per-flit handling in the worker +loop (D2, D3). + +Non-Flit messages — zero-byte control Transactions and other +non-chunkified payloads — fall through to the base class's legacy +`_forward_txn` path via `env.process`. This preserves backward +compatibility for control-plane traffic that does not benefit from +flit-level processing. + +### D6. Multi-stream merging at the base class + +Multi-stream FIFO merging at routers is the base class's +responsibility, not Forwarding's. The base class's `_fan_in` spawns +one process per `in_port`; all push to a single shared `_inbox`. +Flits from different upstream streams therefore interleave at +flit granularity in `_inbox`'s FIFO order. + +The Forwarding worker simply consumes `_inbox` in arrival order — +correctly modeling per-router multi-flow arbitration as +fair-FIFO over the shared inbox. + +### D7. Single implementation under multiple impl names + +A single `TransitComponent` class is registered under four impl names +in `components.yaml`: + +- `builtin.forwarding` — generic forwarding (e.g., `io_noc`, + `noc_router`, UCIe conn bridges) +- `builtin.switch` — tray-level switch +- `builtin.noc` — cube-level NOC fabric (legacy singleton; current + NOC routers use `builtin.forwarding`) +- `builtin.ucie` — UCIe protocol endpoint + +All four aliases instantiate the same class with the same behavior. +Per-instance differentiation lives only in `attrs.overhead_ns`. +Separate impl names exist as intent tags for readability and to +allow future divergence without backward-incompatible config +changes. + +### D8. Configurable `overhead_ns` + +A single attribute drives per-instance latency: + +| Usage site | impl name | overhead_ns | +| --- | --- | --- | +| Tray-level switch | `builtin.switch` | 5.0 | +| Cube NOC router | `builtin.forwarding` | 2.0 | +| IO chiplet io_noc | `builtin.forwarding` | 0.0 | +| UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 | +| UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 | + +Default is 0.0. The attribute is read at each `run()` invocation, so +dynamic reconfiguration is possible but not currently used. + +## Consequences + +### Positive + +- A single class handles all transit-node roles in the simulation + graph — minimal code surface for a high-population component type. +- Flit-aware processing + serial worker preserves wormhole semantics + across multi-hop paths without per-flit process overhead. +- `overhead_ns` is the only per-instance tunable; routing, BW, and + address resolution stay cleanly separated in their own components / + modules. +- Multi-stream merging emerges from the base-class structure; no + router-specific logic duplicates fair-FIFO arbitration. +- Non-Flit fallback path keeps control-plane traffic working without + forcing every message into the flit framework. + +### Negative + +- The single class hides usage-site intent inside `attrs.overhead_ns` + configuration; readers must consult `topology.yaml` + + `components.yaml` to see which impl name maps to which behavior + class. +- Per-flit serial worker is a bottleneck if `overhead_ns` is large + and many concurrent transactions arrive at the same router; current + values (0–8 ns) make this negligible. + +## Links + +- ADR-0002 (Routing distance — path computation) +- ADR-0015 D1 (Component port model) +- ADR-0015 D2 (Wire process — BW + propagation, separate from this + component) +- ADR-0015 D6 (Transit cube forwarding pattern) +- ADR-0016 D1 (IO chiplet io_noc — uses this component) +- ADR-0017 D1 (Cube NOC routers — use this component) +- ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this + component) +- ADR-0033 D1 (Flit-aware pass-through, first-flit overhead, + multi-stream merge semantics) diff --git a/docs/ipcq-dma-codesign-hw.md b/docs/ipcq-dma-codesign-hw.md deleted file mode 100644 index 379ad77..0000000 --- a/docs/ipcq-dma-codesign-hw.md +++ /dev/null @@ -1,548 +0,0 @@ -# IPCQ-DMA Co-design Hardware Design Document - -**Status**: Draft — Review Requested -**Date**: 2026-04-28 -**Authors**: YW Kang -**Reviewers**: (HW team TBD) -**Related**: ADR-0023 (IPCQ PE Collective), ADR-0025 (Direction Addressing) - ---- - -## 1. Background & Motivation - -IPCQ(Inter-PE Communication Queue)는 PE 간 collective communication을 위한 -하드웨어 큐 메커니즘이다. 핵심 설계 원리는 **DMA가 데이터 전송 시 별도의 -제어 메시지 없이, piggyback된 메타 정보를 바탕으로 IPCQ의 head/tail pointer를 -자동 업데이트**하는 IPCQ-DMA co-design이다. - -이 문서는: - -1. 현재 PE 아키텍처에서 IPCQ가 하드웨어 수준에서 어떻게 동작하는지 기술하고, -2. 이 하드웨어를 시뮬레이터에서 어떻게 모델링하고 있는지 검증하며, -3. 실제 하드웨어 구현을 위한 설계를 제안하고, -4. 대안들을 검토하여 최적 접근을 확정한다. - ---- - -## 2. High-level Behavior of PE_IPCQ - -![PE Baseline Architecture](diagrams/pe_baseline.png) - -> source: [`diagrams/pe_baseline.d2`](diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5` 로 렌더링. - -### IPCQ 하드웨어 동작 - -**HW Configuration**: -* IPCQ는 PE 간에 ring buffer 기반의 단방향 큐를 설정하여 데이터를 전달한다. -* 각 PE는 방향별(N/S/E/W 등)로 독립적인 queue pair 를 유지한다. -* IPCQ는 각 queue pair 마다 sender's head/tail pointer, receiver's head/tail pointer 를 유지한다. - -* **IPCQ Slot Region**: IPCQ의 수신 버퍼로, 다이어그램의 점선 박스로 표시된 것처럼 TCM, Cube SRAM, Local HBM 중 하나를 buffer_kind로 지정하여 사용할 수 있다. -각 tier별 성능 특성 (시뮬레이션 모델 값, `ipcq_types.py`): - -| Buffer Kind | Intrinsic BW | Effective BW (NoC bottleneck) | 용도 | -|-------------|-------------|-------------------------------|------| -| TCM | 512 GB/s | 512 GB/s (직결, NoC 미경유) | 최저 latency, PE 내부 전용 | -| Cube SRAM | 512 GB/s | 128 GB/s (`sram_to_router_bw`) | Cube 내 공유, NoC BW에 제한 | -| Local HBM | 256 GB/s | 256 GB/s (`hbm_to_router_bw`) | 대용량, NoC BW에 제한 | - -**Send 경로 (fire-and-forget)**: -1. PE_CPU가 `tl.send(dir, src_addr)` 발행 → PE_IPCQ에 IpcqRequest 전달 -2. PE_IPCQ가 backpressure 확인: `(my_head - peer_tail_cache) < peer.n_slots` -3. Peer의 rx slot 주소 계산: `peer_rx_base + (my_head % n_slots) × slot_size` -4. IpcqDmaToken(data + piggyback metadata: sender_seq)을 PE_DMA에 전달 -5. PE_IPCQ가 `my_head++`, PE_CPU에 즉시 반환 (DMA 완료를 기다리지 않음) -6. PE_DMA가 src data를 snapshot 후 NoC를 통해 peer PE_DMA로 전송 - -**Receive 경로 (blocking)**: -1. Peer PE_DMA가 data를 slot에 write하고, **같은 사이클에** metadata(sender_seq, dst_addr)를 추출 -2. PE_IPCQ가 dst_addr range matching으로 방향을 식별, `peer_head_cache` 업데이트 -3. `tl.recv(dir)` 대기 중인 PE_CPU에 wakeup signal 전달 -4. PE_CPU가 slot에서 데이터 읽기, PE_IPCQ가 `my_tail++` -5. **Credit return**: PE_IPCQ가 16B credit packet(`consumer_seq`)을 NoC를 통해 sender에게 전송 -6. Sender PE_IPCQ가 `peer_tail_cache` 업데이트, backpressure 해제 - -**핵심 설계 원리**: -- **Data + head pointer piggyback**: 별도의 head 동기화 메시지 없이, DMA data flit에 sender_seq를 실어보냄 -- **Atomic write + metadata**: 수신측 DMA가 slot write와 metadata 전달을 같은 사이클에 수행 (I6 invariant) -- **Address-based direction matching**: 같은 peer에 여러 방향이 연결되어도 dst_addr range로 구분 (ADR-0025) -- **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림 - ---- - -## 3. Simulator Implementation Verification - -위의 하드웨어 동작을 시뮬레이터에서 어떻게 모델링하는지 검증한다. - -### 3.1 의도와 구현의 매핑 - -| 설계 의도 | 시뮬레이터 구현 | 위치 | -|-----------|----------------|------| -| DMA가 데이터 전송 시 head pointer를 piggyback | `IpcqDmaToken.sender_seq` 필드가 data flit과 함께 전달 | `ipcq_types.py:185` | -| 수신측 DMA가 data write + metadata 전달을 atomic 처리 | `_handle_ipcq_inbound`에서 `store.write` → `IpcqMetaArrival` 사이에 yield 없음 (I6) | `pe_dma.py:232-275` | -| Send는 fire-and-forget | `_handle_ipcq_outbound`에서 `sub_done`을 기다리지 않음 | `pe_dma.py:182` | -| Recv는 데이터 도착까지 block | `peer_head_cache > my_tail` 조건으로 대기 | `pe_ipcq.py:263` | -| Credit return은 별도 fast-path | SimPy Store를 통한 direct put (latency는 NoC 경로 기반으로 charge) | `pe_ipcq.py:443-469` | -| In-flight data semantics (snapshot) | Send 시점에 data snapshot 보존, 이후 src 수정과 무관 | `pe_dma.py:142-155` | -| PE_DMA 단일 inbox | 모든 in_port를 `_fan_in`으로 단일 FIFO에 merge (`base.py:51-53`) | compute port와 IPCQ port 사이에 arbiter 없음 | - -### 3.2 Credit Return Path 모델링 상세 - -Credit return은 실제 NoC 경로를 `router.find_path()`로 찾고, -`compute_path_latency_ns()`로 hop latency + BW drain을 계산하여 charge한다. - -```python -# pe_ipcq.py:471-492 -def _credit_latency_ns(self, direction: str) -> float: - path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma) - return self.ctx.compute_path_latency_ns(path, self._credit_size_bytes) -``` - -단, latency를 `env.timeout()`으로 지불한 후 `peer_credit_store`(SimPy Store)에 -직접 put하는 방식이다. 실제 `Transaction`을 만들어 NoC를 hop-by-hop 통과시키지는 -않으므로, **다른 트래픽과의 bandwidth contention은 모델링되지 않는다.** - -| | Latency | BW Contention | -|---|---|---| -| Data path (IpcqDmaToken) | NoC Transaction으로 정확 모델링 | 실제 fabric 통과 | -| Credit path (16B) | NoC 경로 latency 정확 반영 | fabric Transaction 미주입 (단순화) | - -Credit은 16B로 data transfer(수십~수백 KB) 대비 무시 가능한 크기이므로, -이 단순화로 인한 실질적 오차는 거의 없다. - -### 3.3 검증 결론 - -시뮬레이터 구현은 IPCQ-DMA co-design 의도를 **정확하게 모델링**하고 있다. - ---- - -## 4. Proposed Hardware Design - -### 4.1 Block Diagram (변경 후) - -변경점을 강조 표시: **(NEW)** = 신규, **(MOD)** = 수정. - -![PE Proposed Architecture](diagrams/pe_proposed.png) - -> Source: [`diagrams/pe_proposed.d2`](diagrams/pe_proposed.d2) — `d2 --layout=elk` 로 렌더링. - -**Baseline → Proposed 핵심 변경**: -- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW) -- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic) -- TCM 내 **IPCQ Slot Region 예약 영역** 명시 -- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결 - -### 4.2 Module Details - -#### 4.2.1 IPCQ Controller (신규 모듈) - -PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. -시뮬레이터의 `PeIpcqComponent`에 대응한다. - -##### QPair Register File - -방향별 queue pair 상태를 flip-flop으로 유지한다. - -``` -Per-direction registers (each 64-bit): - my_head — sender write position (monotonic) - my_tail — receiver read position (monotonic) - peer_head_cache — last known peer head (updated by Meta Extractor) - peer_tail_cache — last known peer tail (updated by Credit Receive) - rx_base_pa — this PE's rx buffer base physical address - peer_rx_base_pa — peer's rx buffer base physical address - n_slots — ring depth (power-of-2 제약, 아래 참조) - slot_size — bytes per slot - peer_credit_tgt — peer PE의 credit receive 주소 - -Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare) -Total: 8 dirs × 9 regs × 8B = 576B flip-flops -``` - -PE_CPU가 MMIO(CSR)로 읽기/쓰기 가능. Init 시점에 소프트웨어가 채워넣는다. - -##### Slot Address Generator (combinational) - -``` -Input: pointer (my_head or my_tail), n_slots, slot_size, base_pa -Output: slot_addr = base_pa + (pointer % n_slots) * slot_size - -Implementation: - n_slots power-of-2 제약 → pointer & (n_slots - 1) (AND mask, 1 gate delay) - slot_size power-of-2 → barrel shift (1 cycle) - 64-bit add → ripple/kogge-stone adder (1 cycle) - -Latency: 1-2 cycles combinational -``` - -##### Backpressure Comparator (combinational) - -``` -full = (my_head - peer_tail_cache) >= n_slots - -Implementation: 64-bit subtract + unsigned compare -Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold -Latency: 1 cycle -``` - -##### Meta Extractor (inbound datapath sideband) - -DMA Engine의 inbound vc_comm path에 wired. Arriving IPCQ flit의 header에서 -metadata를 추출하여 queue pair 상태를 업데이트한다. - -``` -Trigger: DMA inbound write completion (same cycle) -Extract: {sender_seq, dst_addr} from flit header - -Direction matching (ADR-0025 D2): - for each dir: - match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir]) - 8× parallel range comparators + priority encoder - -Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1) -Output: recv_wake signal for matched direction → PE_CPU interrupt/flag - -Implementation: 8× (2 comparators + AND) + priority encoder -Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장) -``` - -##### Credit Injector (outbound) - -``` -Trigger: recv completion (my_tail 증가 후) -Action: pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC) - -Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa} -Latency: 1 cycle to generate, then NoC traversal -``` - -##### Credit Receiver (inbound sideband) - -``` -Trigger: 16B credit packet arrival (from NoC) -Extract: {consumer_seq, dst_rx_base_pa} - -Direction matching (ADR-0025 D3): - for each dir: - match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa) - -Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq) -Output: send_wake signal → deassert backpressure stall - -Latency: 1 cycle -``` - -#### 4.2.2 DMA Engine 수정사항 - -##### vc_comm IPCQ-aware mode - -기존 vc_comm 채널에 IPCQ flit 처리 모드를 추가한다. - -**Outbound**: -1. IPCQ Controller로부터 command 수신: {src_addr, dst_addr, nbytes, sender_seq} -2. TCM에서 src_addr read → DMA read buffer에 snapshot (기존 DMA behavior) -3. Flit pack: data + piggyback metadata (sender_seq, dst_addr) -4. NoC fabric port에 inject -5. Fire-and-forget (completion을 기다리지 않음) - -**Inbound**: -1. NoC로부터 IPCQ flit 수신 -2. Terminal BW drain charge (drain_ns = nbytes / bottleneck_bw) -3. Slot write latency charge (backing memory tier) -4. **ATOMIC** (same pipeline stage, no stall insertion): - - TCM write: data → slot address - - Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller -5. Done - -**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가 -동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요하다. -시뮬레이터의 "no yield between write and IpcqMetaArrival"이 자연스럽게 보장된다. - -##### Data Snapshot Semantics - -DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다. -이는 DMA의 standard read-then-write behavior이므로 추가 HW가 불필요하다. - -##### Credit Virtual Channel (선택적) - -옵션 A: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분) -옵션 B: 3rd dedicated credit VC 추가 (strict priority > data) - -옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로 -옵션 A로도 충분하다. - -#### 4.2.3 Fabric Flit Format 확장 - -``` -일반 data flit (예: 512-bit): -┌──────────────────────────────────────────┐ -│ [511:480] routing header (32b) │ -│ [479:0] payload (480b = 60B) │ -└──────────────────────────────────────────┘ - -IPCQ data flit (첫 flit에만 metadata 포함): -┌──────────────────────────────────────────┐ -│ [511:480] routing header (32b) │ -│ [511] ipcq_flag (1b) │ ← IPCQ vs normal DMA 식별 -│ [510:509] vc_id (2b) │ -│ [508:480] route + hop count │ -│ [479:416] ipcq_metadata (64b) │ ← piggyback -│ [479:448] sender_seq (32b) │ -│ [447:416] dst_addr[31:0] (32b) │ ← direction matching용 -│ [415:0] payload (416b = 52B) │ -└──────────────────────────────────────────┘ -후속 flits: full 60B payload (metadata 없음) - -Credit-only flit (128-bit, header-only): -┌──────────────────────────────────────────┐ -│ [127:96] routing header (32b) │ -│ [127] credit_flag (1b) │ -│ [95:64] consumer_seq (32b) │ -│ [63:0] dst_rx_base_pa (64b) │ -└──────────────────────────────────────────┘ -``` - -첫 flit의 payload가 60B → 52B로 감소 (13% overhead). -Multi-flit transfer에서는 후속 flit이 full payload이므로 대형 전송에서 overhead < 1%. - -#### 4.2.4 TCM IPCQ Slot Region - -``` -TCM Memory Map (16MB): -┌─────────────────────────────┐ 0x000000 -│ Kernel Working Memory │ -│ (compute tensors) │ -│ ~14MB │ -├─────────────────────────────┤ 0xE00000 -│ IPCQ RX Buffers │ -│ Dir N: slots × slot_size │ -│ Dir S: slots × slot_size │ -│ Dir E: slots × slot_size │ -│ Dir W: slots × slot_size │ -│ ~1MB │ -├─────────────────────────────┤ 0xF00000 -│ IPCQ Metadata / Scratch │ -│ ~1MB │ -└─────────────────────────────┘ 0xFFFFFF -``` - -IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 -bank conflict를 최소화한다 (Section 6.1 참조). - ---- - -## 5. End-to-End Dataflow - -### 5.1 Sequence Diagram - -```mermaid -sequenceDiagram - participant CPU_A as PE_A: PE_CPU - participant IPCQ_A as PE_A: IPCQ Ctrl - participant DMA_A as PE_A: DMA - participant NOC as NoC Fabric - participant DMA_B as PE_B: DMA - participant IPCQ_B as PE_B: IPCQ Ctrl - participant TCM_B as PE_B: TCM - participant CPU_B as PE_B: PE_CPU - - Note over CPU_A: tl.send(dir="E", src=0x1000) - - CPU_A->>IPCQ_A: MMIO: send request - Note over IPCQ_A: Backpressure check:
(head - peer_tail_cache) < n_slots → PASS
Slot addr gen:
dst = peer_rx_base + (head%n) × slot_size - IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head} - Note over IPCQ_A: my_head++ - IPCQ_A-->>CPU_A: send returns (fire-and-forget) - - Note over DMA_A: TCM read → snapshot in read buffer
Flit pack: data + {sender_seq, dst_addr} - DMA_A->>NOC: IPCQ data flit(s) - - Note over NOC: hop latency + BW drain - - NOC->>DMA_B: IPCQ data flit(s) - Note over DMA_B: Terminal BW drain
Slot write latency - - rect rgb(255, 240, 220) - Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall - DMA_B->>TCM_B: write data → slot address - DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr} - end - - Note over IPCQ_B: Range match dst_addr → direction "W"
peer_head_cache["W"] = sender_seq + 1 - IPCQ_B-->>CPU_B: recv_wake signal - - Note over CPU_B: tl.recv(dir="W") wakes up - CPU_B->>IPCQ_B: recv request - Note over IPCQ_B: peer_head_cache > my_tail → YES
slot_addr = rx_base + (tail%n) × slot_size - IPCQ_B-->>CPU_B: return slot_addr - CPU_B->>TCM_B: read data from slot - Note over IPCQ_B: my_tail++ - - IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa} - Note over NOC: credit traversal (NoC latency) - NOC->>IPCQ_A: Credit arrival - - Note over IPCQ_A: Match dst_rx_base_pa → direction "E"
peer_tail_cache["E"] = consumer_seq
Backpressure deassert (if stalled) -``` - ---- - -## 6. 2nm Implementation Analysis - -### 6.1 Area Estimate - -| Module | Gate Count | Area (2nm est.) | Notes | -|--------|-----------|-----------------|-------| -| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops | -| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational | -| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators | -| **Total IPCQ Controller** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** | -| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack | -| **Total 변경분** | **~14.6K** | **~0.006 mm²** | | - -### 6.2 Timing - -| Path | Delay (2nm est.) | Target Clock | Margin | -|------|-------------------|-------------|--------| -| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× | -| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× | -| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× | -| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× | - -모든 critical path가 1 cycle 이내. Timing closure 문제 없음. - -### 6.3 Power - -- Active: ~1 mW (register read/write + comparators, send/recv 동작 시) -- Idle: leakage only -- PE 전체 전력 대비 무시 가능 - -### 6.4 Constraints - -| 항목 | 제약 | 근거 | -|------|------|------| -| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) | -| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 | -| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 | - ---- - -## 7. Risk Assessment - -### 7.1 TCM Bank Conflict - -- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall -- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 -- **Cost**: TCM banking flexibility 소폭 감소 -- **Severity**: Medium (성능 영향), Low (correctness 문제 아님) - -### 7.2 Credit Return Latency under Congestion - -- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall -- **Mitigation**: - - Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미) - - 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수 -- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음) - -### 7.3 Inter-Direction Ordering - -- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서 -- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은 - kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 -- **Severity**: Low (아키텍처 설계에 의해 해소) - ---- - -## 8. Alternatives Considered - -### 8.1 Doorbell + Polling (전통적 방식) - -``` -Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell -Recv: Polling loop on doorbell, or interrupt-driven -``` - -| 장점 | 단점 | -|------|------| -| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) | -| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) | -| | Polling은 전력 낭비, interrupt는 latency overhead | - -**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.** - -### 8.2 Hardware Message Queue (NVIDIA NVLink 스타일) - -``` -Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달 -Recv: HMQ에서 descriptor pop → data pointer 확인 -``` - -| 장점 | 단점 | -|------|------| -| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) | -| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 | -| | Large tensor에는 결국 DMA 필요 | - -**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는 -면적 낭비. **불채택.** - -### 8.3 RDMA-style Completion Queue (CQ) - -``` -Send: DMA write → peer에 CQE 자동 생성 -Recv: CQ poll/interrupt → data 위치 확인 -``` - -| 장점 | 단점 | -|------|------| -| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead | -| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 | -| | PE-to-PE CCL에는 over-engineered | - -**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합. -PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.** - -### 8.4 Credit-in-Data Piggyback (v2 최적화 후보) - -현재 설계에서 credit return은 별도 16B packet이다. -Bidirectional 통신 패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.** - -``` -PE_A →E→ PE_B: data + sender_seq=3 -PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4 ← credit이 data에 합쳐짐 -``` - -| 장점 | 단점 | -|------|------| -| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 | -| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) | -| | Logic 복잡도 소폭 증가 | - -**평가**: 현재 설계의 우수한 최적화. -Bidirectional allreduce에서 credit packet을 완전 제거 가능. -Standalone credit fallback도 유지. **v2로 채택 권고.** - ---- - -## 9. Recommendations - -1. **현재 IPCQ-DMA co-design을 기본 하드웨어 설계로 채택** - — 단순하고, 면적 효율적이며, 2nm에서 timing/power 문제 없음 - -2. **n_slots를 반드시 power-of-2로 제약** - — mod 연산을 AND mask로 대체, critical path 단축 - -3. **TCM banking에서 IPCQ region 전용 bank 할당** - — compute와의 bank conflict 방지 - -4. **v2에서 Credit-in-Data Piggyback (Section 8.4) 추가 검토** - — bidirectional 패턴에서 credit overhead 제거 - ---- - -## 10. Open Questions - -- [ ] IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%) -- [ ] Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? -- [ ] Inter-SIP link에서의 flit format 호환성 검증 필요 -- [ ] n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%) diff --git a/docs/ccl-author-guide.en.md b/docs/onboarding/ccl-author-guide.en.md similarity index 99% rename from docs/ccl-author-guide.en.md rename to docs/onboarding/ccl-author-guide.en.md index 7fd38e1..3c46885 100644 --- a/docs/ccl-author-guide.en.md +++ b/docs/onboarding/ccl-author-guide.en.md @@ -582,7 +582,7 @@ If you add a new algorithm or pattern, please send a PR. - [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective design. - [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1). - [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution. -- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor. +- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model. Existing algorithm examples: diff --git a/docs/ccl-author-guide.md b/docs/onboarding/ccl-author-guide.md similarity index 99% rename from docs/ccl-author-guide.md rename to docs/onboarding/ccl-author-guide.md index d785f24..0603826 100644 --- a/docs/ccl-author-guide.md +++ b/docs/onboarding/ccl-author-guide.md @@ -527,7 +527,7 @@ direct send 후 다른 step에서 같은 주소를 store해도 안전하다 (tok - [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective 설계 - [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1) - [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution -- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor +- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model 기존 알고리즘 예제: diff --git a/docs/di-presentation.md b/docs/onboarding/di-presentation.md similarity index 100% rename from docs/di-presentation.md rename to docs/onboarding/di-presentation.md diff --git a/docs/hw-architecture-overview.md b/docs/onboarding/hw-architecture-overview.md similarity index 100% rename from docs/hw-architecture-overview.md rename to docs/onboarding/hw-architecture-overview.md diff --git a/docs/latency-model.md b/docs/onboarding/latency-model.md similarity index 100% rename from docs/latency-model.md rename to docs/onboarding/latency-model.md diff --git a/src/kernbench/ccl/__init__.py b/src/kernbench/ccl/__init__.py index aa60e46..d4494a8 100644 --- a/src/kernbench/ccl/__init__.py +++ b/src/kernbench/ccl/__init__.py @@ -5,5 +5,5 @@ This package provides: - helpers: utilities for algorithm authors (chunked, ring_step, ...) - testing: mock CCL runtime for fast unit tests of algorithm kernels -See docs/adr/ADR-0023-ipcq-pe-collective.md and docs/ccl-author-guide.md. +See docs/adr/ADR-0023-dev-ipcq-pe-collective.md and docs/onboarding/ccl-author-guide.md. """ diff --git a/src/kernbench/common/pe_commands.py b/src/kernbench/common/pe_commands.py index 1c47c4f..02a32d4 100644 --- a/src/kernbench/common/pe_commands.py +++ b/src/kernbench/common/pe_commands.py @@ -24,7 +24,7 @@ class Scope(Enum): @dataclass(frozen=True) class OpSpec: - """One operation in a multi-op composite (head + epilogue, ADR-0021). + """One operation in a multi-op composite (head + epilogue, ADR-0014 D3.3). The head op (first in CompositeCmd.ops) defines tile geometry; subsequent ops are epilogue stages whose ``scope`` controls how often they fire @@ -156,7 +156,7 @@ class CompositeCmd: out_nbytes: int math_op: str | None = None # for op="math": which math operation data_op: bool = True - # Multi-op composite (ADR-0021 extension): when non-empty, ops[0] is the + # Multi-op composite (ADR-0014 D3.3): when non-empty, ops[0] is the # head and ops[1:] are epilogue stages with explicit scope. When empty, # the legacy single-op semantics (op/a/b/math_op) apply. ops: tuple[OpSpec, ...] = () diff --git a/src/kernbench/components/builtin/hbm_ctrl.py b/src/kernbench/components/builtin/hbm_ctrl.py index 09bcaa6..f45d7c7 100644 --- a/src/kernbench/components/builtin/hbm_ctrl.py +++ b/src/kernbench/components/builtin/hbm_ctrl.py @@ -15,7 +15,7 @@ if TYPE_CHECKING: class HbmCtrlComponent(ComponentBase): - """HBM controller with per-pseudo-channel (PC) striping (ADR-0019 D1, ADR-0033). + """HBM controller with per-pseudo-channel (PC) striping (ADR-0017 D4, ADR-0033). Stateless per-PC ``available_at`` array; each incoming transaction is split into ``ceil(nbytes / burst_bytes)`` chunks distributed round-robin diff --git a/src/kernbench/components/builtin/m_cpu.py b/src/kernbench/components/builtin/m_cpu.py index 740a272..d8fffaf 100644 --- a/src/kernbench/components/builtin/m_cpu.py +++ b/src/kernbench/components/builtin/m_cpu.py @@ -267,8 +267,9 @@ class MCpuComponent(ComponentBase): def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]: """Return list of HBM destination node_ids for DMA fan-out. - With single hbm_ctrl per cube (ADR-0019), always returns one node. - PA-based resolution still used for cross-cube routing. + The PA-based resolver maps each address to one per-PE + ``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly + one node. Cross-cube routing uses the same resolution. """ cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0" diff --git a/src/kernbench/components/builtin/pe_dma.py b/src/kernbench/components/builtin/pe_dma.py index ed3de9e..a416fa5 100644 --- a/src/kernbench/components/builtin/pe_dma.py +++ b/src/kernbench/components/builtin/pe_dma.py @@ -17,9 +17,11 @@ if TYPE_CHECKING: class PeDmaComponent(PeEngineBase): """PE_DMA: dual-channel DMA engine with READ and WRITE resources. - Each channel has capacity=1 (ADR-0014 D4): + Compute channels (vc_compute) have capacity=1 each (ADR-0014 D4): - DMA_READ and DMA_WRITE may execute concurrently. - Multiple READs cannot overlap; multiple WRITEs cannot overlap. + The orthogonal vc_comm channel for IPCQ traffic is defined in + ADR-0023 D8. Handles two message types: - Transaction: external fabric messages (PeDmaMsg probes, M_CPU DMA) diff --git a/src/kernbench/components/builtin/pe_fetch_store.py b/src/kernbench/components/builtin/pe_fetch_store.py index 3d65e2c..5e3f34c 100644 --- a/src/kernbench/components/builtin/pe_fetch_store.py +++ b/src/kernbench/components/builtin/pe_fetch_store.py @@ -1,4 +1,4 @@ -"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0021 D5). +"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0014 D1). Handles both fetch (TCM → register) and store (register → TCM). BW serialization is delegated to PE_TCM via port communication. @@ -18,7 +18,7 @@ if TYPE_CHECKING: class PeFetchStoreComponent(PeEngineBase): - """PE_FETCH_STORE: TCM ↔ Register File (ADR-0021 D5). + """PE_FETCH_STORE: TCM ↔ Register File (ADR-0014 D1). Receives TileTokens via pipeline self-routing. Sends TcmRequest to PE_TCM for BW-based latency. diff --git a/src/kernbench/components/builtin/pe_gemm.py b/src/kernbench/components/builtin/pe_gemm.py index 718d130..ea5a916 100644 --- a/src/kernbench/components/builtin/pe_gemm.py +++ b/src/kernbench/components/builtin/pe_gemm.py @@ -1,4 +1,4 @@ -"""PE_GEMM: matrix multiplication engine (ADR-0021 D6). +"""PE_GEMM: matrix multiplication engine (ADR-0014 D1). Handles both legacy PeInternalTxn (GemmCmd) and pipeline TileToken. In pipeline mode, receives token after fetch stage, computes MAC, chains to next. @@ -32,7 +32,7 @@ _DTYPE_BITS: dict[str, int] = { class PeGemmComponent(PeEngineBase): - """PE_GEMM: MAC array (ADR-0021 D6). + """PE_GEMM: MAC array (ADR-0014 D1). In pipeline mode: pure compute — register data already fetched. In legacy mode: handles PeInternalTxn(GemmCmd) with shared accel_slot. diff --git a/src/kernbench/components/builtin/pe_math.py b/src/kernbench/components/builtin/pe_math.py index cf5bcf9..eeb819e 100644 --- a/src/kernbench/components/builtin/pe_math.py +++ b/src/kernbench/components/builtin/pe_math.py @@ -1,4 +1,4 @@ -"""PE_MATH: element-wise / reduction computation engine (ADR-0021 D6). +"""PE_MATH: element-wise / reduction computation engine (ADR-0014 D1). Handles both legacy PeInternalTxn (MathCmd) and pipeline TileToken. In pipeline mode, receives token after fetch stage, computes SIMD, chains to next. @@ -24,7 +24,7 @@ if TYPE_CHECKING: class PeMathComponent(PeEngineBase): - """PE_MATH: SIMD/Vector unit (ADR-0021 D6). + """PE_MATH: SIMD/Vector unit (ADR-0014 D1). In pipeline mode: pure compute — register data already fetched. In legacy mode: handles PeInternalTxn(MathCmd) with shared accel_slot. diff --git a/src/kernbench/components/builtin/pe_scheduler.py b/src/kernbench/components/builtin/pe_scheduler.py index 994acfb..87c73b9 100644 --- a/src/kernbench/components/builtin/pe_scheduler.py +++ b/src/kernbench/components/builtin/pe_scheduler.py @@ -1,10 +1,10 @@ -"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0021 D2). +"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0014 D6). Receives PeInternalTxn from PE_CPU, routes to engines: - Simple commands (DmaReadCmd, GemmCmd, etc.) → direct dispatch to engine - CompositeCmd → generate TilePlan, feed tiles via _feed_loop -Composite pipeline uses token self-routing (ADR-0021 D4): +Composite pipeline uses token self-routing (ADR-0014 D6): Scheduler only does initial dispatch + completion tracking. Tiles chain through components based on their plan's stage sequence. """ @@ -24,7 +24,7 @@ if TYPE_CHECKING: class PeSchedulerComponent(ComponentBase): - """PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, ADR-0021 D2). + """PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, D6). Simple commands are forwarded to the appropriate engine. CompositeCmd creates a TilePlan and feeds tiles into the pipeline. @@ -104,7 +104,7 @@ class PeSchedulerComponent(ComponentBase): def _dispatch_composite( self, env: simpy.Environment, pe_txn: Any, cmd: Any, ) -> Generator: - """Generate plan and enqueue to feeder. Non-blocking (ADR-0021 D4).""" + """Generate plan and enqueue to feeder. Non-blocking (ADR-0014 D6).""" from kernbench.components.builtin.pe_types import PipelineContext plan = self._generate_plan(cmd) @@ -121,7 +121,7 @@ class PeSchedulerComponent(ComponentBase): yield self._pending_feeds.put((plan, ctx)) def _feed_loop(self, env: simpy.Environment) -> Generator: - """Single feeder process: FIFO command ordering (ADR-0021 D2). + """Single feeder process: FIFO command ordering (ADR-0014 D6). No tile feed interleaving between commands. Queue full → only this process blocks. diff --git a/src/kernbench/components/builtin/pe_tcm.py b/src/kernbench/components/builtin/pe_tcm.py index dfe940e..dfb2b36 100644 --- a/src/kernbench/components/builtin/pe_tcm.py +++ b/src/kernbench/components/builtin/pe_tcm.py @@ -1,4 +1,4 @@ -"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0021). +"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0014 D1). Models scratchpad memory inside the PE. Handles both legacy Transaction forwarding and TcmRequest from PE_FETCH_STORE for BW-serialized read/write access. @@ -32,7 +32,7 @@ class TcmRequest: class PeTcmComponent(ComponentBase): - """PE_TCM: BW-serialized scratchpad memory (ADR-0021 D1). + """PE_TCM: BW-serialized scratchpad memory (ADR-0014 D1). Dual-channel: read and write can proceed in parallel, but concurrent reads serialize, concurrent writes serialize. diff --git a/src/kernbench/components/builtin/pe_types.py b/src/kernbench/components/builtin/pe_types.py index 77b92bb..64cfb26 100644 --- a/src/kernbench/components/builtin/pe_types.py +++ b/src/kernbench/components/builtin/pe_types.py @@ -1,4 +1,4 @@ -"""PE pipeline types for ADR-0021: TileToken, TilePlan, Stage, PipelineContext. +"""PE pipeline types for ADR-0014 D6: TileToken, TilePlan, Stage, PipelineContext. These types are used by the PE_SCHEDULER and all PE engine components for tile-based pipeline execution with self-routing. @@ -84,7 +84,7 @@ class PipelineContext: @dataclass class TileToken: - """Self-routing tile token passed between PE components (ADR-0021 D9). + """Self-routing tile token passed between PE components (ADR-0014 D6). Single-owner: only one component holds this token at any time. params is a cache of plan.stages[stage_idx].params (canonical source). diff --git a/src/kernbench/components/builtin/tiling.py b/src/kernbench/components/builtin/tiling.py index 88884a6..0704ea3 100644 --- a/src/kernbench/components/builtin/tiling.py +++ b/src/kernbench/components/builtin/tiling.py @@ -1,4 +1,4 @@ -"""Tile plan generators for PE pipeline (ADR-0021). +"""Tile plan generators for PE pipeline (ADR-0014 D6). Generates TilePlan with stage sequences for GEMM and Math operations. Ported from pe_accel tiling.py with stage-based plan structure. diff --git a/src/kernbench/components/legacy/__init__.py b/src/kernbench/components/legacy/__init__.py index 03d0854..948d652 100644 --- a/src/kernbench/components/legacy/__init__.py +++ b/src/kernbench/components/legacy/__init__.py @@ -1,2 +1,2 @@ # Legacy component backups — not actively used. -# Kept for reference during ADR-0021 migration. +# Kept for reference during the PE pipeline refactor (ADR-0014). diff --git a/src/kernbench/components/legacy/builtin/m_cpu.py b/src/kernbench/components/legacy/builtin/m_cpu.py index b8e928c..b69b248 100644 --- a/src/kernbench/components/legacy/builtin/m_cpu.py +++ b/src/kernbench/components/legacy/builtin/m_cpu.py @@ -264,8 +264,9 @@ class MCpuComponent(ComponentBase): def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]: """Return list of HBM destination node_ids for DMA fan-out. - With single hbm_ctrl per cube (ADR-0019), always returns one node. - PA-based resolution still used for cross-cube routing. + The PA-based resolver maps each address to one per-PE + ``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly + one node. Cross-cube routing uses the same resolution. """ cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0" diff --git a/src/kernbench/policy/address/phyaddr.py b/src/kernbench/policy/address/phyaddr.py index 4394ba6..a1e566e 100644 --- a/src/kernbench/policy/address/phyaddr.py +++ b/src/kernbench/policy/address/phyaddr.py @@ -20,7 +20,7 @@ _AHBM_SEL_BIT = 37 _AHBM_LOCAL_USED = 38 # bits actually meaningful for AHBM # HBM-offset bit layout for PC (pseudo-channel) striping -# (ADR-0033 D6, ADR-0019). Given burst_bytes = 2^B and num_pcs = 2^P +# (ADR-0033 D6, ADR-0017 D8). Given burst_bytes = 2^B and num_pcs = 2^P # configured at hbm_ctrl, the PC index is derived from hbm_offset as # pc_shift = B; pc_mask = (1 << P) - 1 # pc = (hbm_offset >> pc_shift) & pc_mask diff --git a/src/kernbench/policy/routing/router.py b/src/kernbench/policy/routing/router.py index ca1e105..0869079 100644 --- a/src/kernbench/policy/routing/router.py +++ b/src/kernbench/policy/routing/router.py @@ -35,7 +35,7 @@ class AddressResolver: def __init__(self, graph: TopologyGraph) -> None: self._node_ids = set(graph.nodes) # HBM slice size (bytes) — used to decode pe_id from hbm_offset - # so HBM PA → hbm_ctrl.pe{X} (ADR-0019 D1/D4). + # so HBM PA → hbm_ctrl.pe{X} (ADR-0017 D4/D9). mm = graph.spec.get("cube", {}).get("memory_map", {}) hbm_total_gb = int(mm.get("hbm_total_gb_per_cube", 48)) slices_per_cube = int(mm.get("hbm_slices_per_cube", 8)) @@ -129,7 +129,7 @@ class PathRouter: Otherwise the cube's own UCIe port appears as a zero-distance bus that Dijkstra prefers over the mesh — that is intended only for cross-cube routing. Local PE_DMA must traverse the mesh so - cross-PE-slice access pays the mesh-distance cost (ADR-0019 D4). + cross-PE-slice access pays the mesh-distance cost (ADR-0017 D7). """ start = f"{src_pe}.pe_dma" adj = self._adj_local if _same_cube(start, dst_node) else self._adj @@ -137,13 +137,13 @@ class PathRouter: def find_path_with_distance(self, src_pe: str, dst_node: str) -> tuple[list[str], float]: """Match find_path's cube-local routing so reported distance reflects - the actual chosen path (ADR-0019 D4).""" + the actual chosen path (ADR-0017 D7).""" start = f"{src_pe}.pe_dma" adj = self._adj_local if _same_cube(start, dst_node) else self._adj return self._run_dijkstra_with_dist(adj, start, dst_node) def find_mcpu_dma_path(self, m_cpu_id: str, dst_hbm_id: str) -> list[str]: - """M_CPU DMA path: routes through router mesh (ADR-0019). + """M_CPU DMA path: routes through router mesh (ADR-0017). Same-cube: uses _adj_local (no UCIe) to stay within mesh. Cross-cube: uses _adj_all to route via UCIe. diff --git a/src/kernbench/runtime_api/context.py b/src/kernbench/runtime_api/context.py index d339875..6e94388 100644 --- a/src/kernbench/runtime_api/context.py +++ b/src/kernbench/runtime_api/context.py @@ -58,7 +58,7 @@ def _get_active_context(): class _AhbmNamespace: - """torch.ahbm — per-greenlet SIP device binding (ADR-0024 D10). + """torch.ahbm — per-greenlet SIP device binding (ADR-0024 D3). Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. KernBench's backend is 'ahbm' (not CUDA), so this namespace avoids pretending to be @@ -124,7 +124,7 @@ class RuntimeContext: dc = DistributedContext() dc._ctx_ref = self # back-reference for AhbmCCLBackend to reach ctx.launch etc. self.distributed = dc - # ADR-0024 D10: torch.ahbm (KernBench-native) + torch.accelerator + # ADR-0024 D3: torch.ahbm (KernBench-native) + torch.accelerator # (PyTorch 2.x portable) namespaces for per-greenlet device binding. self.ahbm = _AhbmNamespace() self.accelerator = _AcceleratorNamespace(self.ahbm) @@ -472,7 +472,7 @@ class RuntimeContext: eff_num_pe = dp.num_pes if dp.num_pes is not None else self._pes_per_cube eff_num_cubes = dp.num_cubes if dp.num_cubes is not None else self._num_cubes # ADR-0026 D4: resolve structural coords directly at resolve time. - # ``torch.ahbm.set_device(rank)`` (ADR-0024 D10) selects the target + # ``torch.ahbm.set_device(rank)`` (ADR-0024 D3) selects the target # SIP; if unset, fall back to SIP 0 for single-driver compatibility. current_sip = ( self.ahbm.current_device() if hasattr(self, "ahbm") else None @@ -619,7 +619,7 @@ class RuntimeContext: Creates per-SIP KernelLaunchMsg with local va_base per tensor (like host driver sending per-rank launch commands). - When ``_defer_wait=True`` (ADR-0024 D7), returns the list of + When ``_defer_wait=True`` (ADR-0027 D0.4), returns the list of ``(handle, sip_id, meta)`` tuples instead of waiting. Caller is responsible for waiting — used by collective ops to yield between submit and wait so all sibling ranks can submit first. @@ -786,7 +786,7 @@ class RuntimeContext: last_handle = h if _defer_wait: - # ADR-0024 D7: return the pending-list so the caller can yield + # ADR-0027 D0.4: return the pending-list so the caller can yield # between submit and drain. Used by collective ops that need # all sibling ranks to submit before any rank waits. return [ diff --git a/src/kernbench/runtime_api/distributed.py b/src/kernbench/runtime_api/distributed.py index a56086f..3d5f4b6 100644 --- a/src/kernbench/runtime_api/distributed.py +++ b/src/kernbench/runtime_api/distributed.py @@ -178,7 +178,7 @@ class DistributedContext: def __init__(self) -> None: self._backend: AhbmCCLBackend | None = None - # ADR-0024 D9: greenlet-local rank registry. Bench launcher calls + # ADR-0024 D2: greenlet-local rank registry. Bench launcher calls # _bind_rank(g, rank) when spawning workers; get_rank() resolves the # current greenlet to its rank. Unbound greenlets fall back to 0 for # single-driver test compat. @@ -220,7 +220,7 @@ class DistributedContext: def get_rank(self) -> int: """Return the rank bound to the current greenlet (default 0). - ADR-0024 D9: workers spawned by the bench launcher each get a rank + ADR-0024 D2: workers spawned by the bench launcher each get a rank registered via ``_bind_rank``. Callers outside any bound greenlet fall back to rank 0 for single-driver test compat. """ @@ -230,7 +230,7 @@ class DistributedContext: return int(self._rank_by_greenlet.get(g, 0)) def _bind_rank(self, g: Any, rank: int) -> None: - """Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D9).""" + """Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D2).""" self._rank_by_greenlet[g] = int(rank) def get_backend(self) -> str: diff --git a/src/kernbench/runtime_api/multiprocessing.py b/src/kernbench/runtime_api/multiprocessing.py index 53994f2..0b50fae 100644 --- a/src/kernbench/runtime_api/multiprocessing.py +++ b/src/kernbench/runtime_api/multiprocessing.py @@ -65,7 +65,7 @@ def _drain_pending(ctx: Any) -> None: # Populate _completed so fast-path in ctx.wait short-circuits # on the return leg. ctx._completed.add(h) - # (b) Collective backend queue (ADR-0024 D7 + D0.4-(2)). + # (b) Collective backend queue (ADR-0027 D0.4-(2)). if backend is not None: pending_list = getattr(backend, "_pending_collective_handles", None) if pending_list is not None: diff --git a/src/kernbench/sim_engine/op_log.py b/src/kernbench/sim_engine/op_log.py index acc0d5d..51d2d30 100644 --- a/src/kernbench/sim_engine/op_log.py +++ b/src/kernbench/sim_engine/op_log.py @@ -51,7 +51,7 @@ class OpLogger: record_end fires. """ snap: dict[str, Any] = {} - # TileToken (ADR-0021 pipeline) — capture which stage this is and its + # TileToken (ADR-0014 D6 pipeline) — capture which stage this is and its # per-stage params (e.g. op_kind/scope for epilogue MATH stages) so # we can recover them at record_end even after the token advances. try: diff --git a/src/kernbench/topology/builder.py b/src/kernbench/topology/builder.py index 8516b3b..f33f1ed 100644 --- a/src/kernbench/topology/builder.py +++ b/src/kernbench/topology/builder.py @@ -356,7 +356,7 @@ def _instantiate_cube( ) -> None: """Add all cube-internal nodes and edges, including PE instances. - Topology: explicit router mesh from cube_mesh.yaml (ADR-0019). + Topology: explicit router mesh from cube_mesh.yaml (ADR-0017 D1). Each router is a separate SimPy node. Components attach to routers based on cube_mesh.yaml attachment lists. """ @@ -367,10 +367,10 @@ def _instantiate_cube( clinks = cube["links"] mm = cube["memory_map"] - # ── Mode branch (ADR-0019) ── + # ── Mode branch (ADR-0017 D8) ── mode = mm.get("hbm_mapping_mode", "n_to_one") if mode == "one_to_one": - raise NotImplementedError("1:1 mode: ADR-0019 D3") + raise NotImplementedError("1:1 mode: ADR-0017 D8") # ── UCIe ports + connection nodes ── ucie_cfg = cube["ucie"] @@ -404,11 +404,10 @@ def _instantiate_cube( label=name.upper().replace("_", " "), ) - # ── Per-PE HBM controller (ADR-0019 D1/D4) ── + # ── Per-PE HBM controller (ADR-0017 D4) ── # Each PE owns one slice of the cube's HBM. The slice has its own # set of pseudo-channels and is reachable ONLY through that PE's # attaching router (see cube_mesh.yaml ``peX.hbm`` attach lists). - # Restored after the ADR-0019 over-consolidation in commit 5917b34. hbm_spec = cube["components"]["hbm_ctrl"] hbm_lx, hbm_ly = local_pos["hbm_ctrl"] _hbm_total_bw = float(cube["links"].get("hbm_to_router_bw_gbs", 256.0)) @@ -425,7 +424,7 @@ def _instantiate_cube( label=f"HBM CTRL pe{pe_idx}", ) - # ── Router mesh from cube_mesh.yaml (ADR-0019 D3) ── + # ── Router mesh from cube_mesh.yaml (ADR-0017 D1) ── routers = mesh_data["routers"] router_spec = cube["components"]["noc_router"] router_bw = clinks.get("router_link_bw_gbs", 256.0) @@ -573,7 +572,7 @@ def _instantiate_cube( )) elif item.endswith(".hbm"): # peX.hbm: router rXcY owns the entry to hbm_ctrl.peX. - # (ADR-0019 D1/D4 — per-PE HBM partitioning.) + # (ADR-0017 D4 — per-PE HBM partitioning.) pe_prefix = item.rsplit(".", 1)[0] pe_idx = int(pe_prefix.replace("pe", "")) pe_hbm_id = f"{cp}.hbm_ctrl.pe{pe_idx}" @@ -645,13 +644,12 @@ def _instantiate_cube( )) # NOTE: HBM↔router edges are created in the per-router attach loop - # above (peX.hbm items map router → hbm_ctrl.peX). Removed the - # legacy "all routers → single hbm_ctrl" loop that bypassed the - # ADR-0019 D4 per-PE partition. + # above (peX.hbm items map router → hbm_ctrl.peX). See ADR-0017 D4 + # for the per-PE partition contract. def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None: - """Add PE-internal edges for a single PE instance (ADR-0021).""" + """Add PE-internal edges for a single PE instance (ADR-0014 D8).""" edges.append(Edge( src=f"{pp}.pe_cpu", dst=f"{pp}.pe_scheduler", distance_mm=pe_links["pe_cpu_to_scheduler_mm"], @@ -685,7 +683,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None: kind="pe_internal", )) - # Fetch/Store → TCM (ADR-0021 D5) + # Fetch/Store → TCM (ADR-0014 D5) if "fetch_store_to_tcm_mm" in pe_links: edges.append(Edge( src=f"{pp}.pe_fetch_store", dst=f"{pp}.pe_tcm", @@ -694,7 +692,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None: kind="pe_internal", )) - # Chaining edges (ADR-0021 D4 — token self-routing) + # Chaining edges (ADR-0014 D6 — token self-routing) chaining = [ ("pe_dma", "pe_fetch_store", "dma_to_fetch_store_mm"), ("pe_fetch_store", "pe_gemm", "fetch_store_to_gemm_mm"), diff --git a/src/kernbench/tp/layers.py b/src/kernbench/tp/layers.py index 27e0bd8..5d66429 100644 --- a/src/kernbench/tp/layers.py +++ b/src/kernbench/tp/layers.py @@ -6,7 +6,7 @@ forward(x) ends with ``dist.all_reduce`` to sum partial products. Both layers use the intra-device ``DPPolicy`` (ADR-0026). TP shard -ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D10). +ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D3). Yield-safety contract (ADR-0027 D4/D5): every forward path contains at least one ``ctx.wait`` (via ``torch.launch``) or one collective; this @@ -53,7 +53,7 @@ class ColumnParallelLinear: self.k_local = out_features // ws self.dtype = dtype self._torch = torch - # Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D10) places + # Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D3) places # it on SIP ``rank``. Intra-SIP layout comes from DPPolicy (ADR-0026). self.weight = torch.zeros( (in_features, self.k_local), diff --git a/src/kernbench/tp/parallel_state.py b/src/kernbench/tp/parallel_state.py index 2952f83..3aa3dbb 100644 --- a/src/kernbench/tp/parallel_state.py +++ b/src/kernbench/tp/parallel_state.py @@ -43,7 +43,7 @@ def get_tensor_model_parallel_rank() -> int: """Return this worker's rank within the TP group. Delegates to the greenlet-local rank registered by the spawn launcher - (ADR-0024 D9 via ``torch.distributed.get_rank``). + (ADR-0024 D2 via ``torch.distributed.get_rank``). """ # Resolve via the global torch.distributed facade on the active ctx. return _current_rank() diff --git a/tests/test_e2e_pipeline.py b/tests/test_e2e_pipeline.py index aa35685..39aa01e 100644 --- a/tests/test_e2e_pipeline.py +++ b/tests/test_e2e_pipeline.py @@ -1,4 +1,4 @@ -"""End-to-end pipeline tests (ADR-0020 + ADR-0021). +"""End-to-end pipeline tests (ADR-0020 + ADR-0014). Verifies: 1. Actual benchmark kernel → greenlet mode → op_log → DataExecutor → accuracy diff --git a/tests/test_hbm_pc_striping.py b/tests/test_hbm_pc_striping.py index 535c45f..5a0a7ee 100644 --- a/tests/test_hbm_pc_striping.py +++ b/tests/test_hbm_pc_striping.py @@ -68,7 +68,7 @@ def _path_drain_for_write(eng: GraphEngine, msg: MemoryWriteMsg) -> float: def test_builder_derives_pc_bw_gbs(): """Topology builder must inject `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs` - as an attr on every hbm_ctrl node. Enforces ADR-0019 D9 invariant + as an attr on every hbm_ctrl node. Enforces ADR-0017 D8 invariant (channels_per_PE × per-PC BW = aggregated link BW) at build time. """ handle = resolve_topology(str(TOPOLOGY_PATH)) diff --git a/tests/test_ipcq_buffer_kind_locations.py b/tests/test_ipcq_buffer_kind_locations.py index 912b3a5..b7b4270 100644 --- a/tests/test_ipcq_buffer_kind_locations.py +++ b/tests/test_ipcq_buffer_kind_locations.py @@ -192,13 +192,10 @@ def test_hbm_pe_hop_charged_at_large_payload(tmp_path): chunk of latency from the PE↔HBM hop on send and recv, so the total HBM/TCM gap should clearly clear the threshold below. - Threshold history: the gap was 4 µs under the over-consolidated - single-hbm_ctrl model (commit 5917b34), inflated by serialization - on the shared HBM controller. With ADR-0019 D1 per-PE HBM CTRL - restored, each PE's slice runs on its own controller with no - cross-PE contention, so the IPCQ pattern (each PE writes its own - slice) drops the gap to ≈ 1.7 µs — still well above the bare - slot-IO term, confirming the PE↔HBM hop is being charged. + Under ADR-0017 D4 per-PE HBM CTRL, each PE's slice runs on its own + controller with no cross-PE contention, so the IPCQ pattern (each + PE writes its own slice) yields a gap of ≈ 1.7 µs — well above the + bare slot-IO term, confirming the PE↔HBM hop is being charged. """ n_elem = 16384 # 32 KB / PE lat_tcm = _run_allreduce_with_buffer_kind( diff --git a/tests/test_noc_mesh.py b/tests/test_noc_mesh.py index bb9821f..2bb5666 100644 --- a/tests/test_noc_mesh.py +++ b/tests/test_noc_mesh.py @@ -1,4 +1,4 @@ -"""Tests for CUBE NOC Explicit Router Mesh (ADR-0019). +"""Tests for CUBE NOC Explicit Router Mesh (ADR-0017). Key changes verified: - Explicit router nodes per cube from cube_mesh.yaml (6×6 grid) @@ -125,14 +125,14 @@ def test_mesh_file_pe_corner_positions(): def test_mesh_file_no_xbar_section(): - """mesh output must not contain xbar section (ADR-0019 D2).""" + """mesh output must not contain xbar section (ADR-0017 D1).""" _graph() mesh = yaml.safe_load(MESH_PATH.read_text()) assert "xbar" not in mesh, "xbar section should be removed from cube_mesh.yaml" def test_mesh_file_pe_hbm_attached(): - """PE routers must have pe{idx}.hbm in attach list (ADR-0019 D1).""" + """PE routers must have pe{idx}.hbm in attach list (ADR-0017 D4).""" _graph() mesh = yaml.safe_load(MESH_PATH.read_text()) for rid, rdata in mesh["routers"].items(): @@ -235,7 +235,7 @@ def test_mesh_ucie_all_four_directions(): # ══════════════════════════════════════════════════════════════════ -# 2. Topology Graph: Explicit Router Mesh (ADR-0019) +# 2. Topology Graph: Explicit Router Mesh (ADR-0017) # ══════════════════════════════════════════════════════════════════ @@ -247,7 +247,7 @@ def test_router_nodes_exist(): def test_no_xbar_or_bridge_nodes(): - """xbar/bridge nodes must not exist (ADR-0019 D2).""" + """xbar/bridge nodes must not exist (ADR-0017 D1).""" graph = _graph() bad = [n for n in graph.nodes if "xbar" in n or "bridge" in n] assert len(bad) == 0, f"Old xbar/bridge nodes found: {bad[:5]}" @@ -260,11 +260,10 @@ def test_no_single_noc_node(): def test_per_pe_hbm_ctrl_nodes(): - """Each cube has 8 per-PE HBM CTRL instances (ADR-0019 D1). + """Each cube has 8 per-PE HBM CTRL instances (ADR-0017 D4). - Restored from over-consolidation in commit 5917b34. The legacy - single ``sip0.cube0.hbm_ctrl`` is gone; each PE owns its own - ``hbm_ctrl.pe{X}`` reachable through that PE's attaching router. + Each PE owns its own ``hbm_ctrl.pe{X}`` reachable through that PE's + attaching router. No cube-wide single ``hbm_ctrl`` node exists. """ graph = _graph() for pe in range(8): @@ -272,7 +271,7 @@ def test_per_pe_hbm_ctrl_nodes(): # Legacy single hbm_ctrl must not exist legacy_id = "sip0.cube0.hbm_ctrl" assert legacy_id not in graph.nodes, ( - f"legacy {legacy_id} must be removed (per-PE partitioning, ADR-0019 D1)" + f"legacy {legacy_id} must not exist (per-PE partitioning, ADR-0017 D4)" ) @@ -297,9 +296,7 @@ def test_pe_dma_connects_to_router(): def test_each_hbm_ctrl_connects_only_to_owning_router(): """Each ``hbm_ctrl.pe{X}`` must have exactly one router edge (router_to_hbm + hbm_to_router) to its owning PE's attaching - router (ADR-0019 D4). Replaces a prior test that asserted the - single hbm_ctrl was connected to all routers — that asserted the - spec-violating consolidation introduced in commit 5917b34. + router (ADR-0017 D7). """ graph = _graph() pe_router = {0: "r0c0", 1: "r0c1", 2: "r1c4", 3: "r1c5", @@ -513,7 +510,7 @@ def test_null_routers_excluded(): # ══════════════════════════════════════════════════════════════════ -# 7. Router Mesh Latency (ADR-0019) +# 7. Router Mesh Latency (ADR-0017) # ══════════════════════════════════════════════════════════════════ diff --git a/tests/test_pe_pipeline.py b/tests/test_pe_pipeline.py index 2f404e0..2b3dc87 100644 --- a/tests/test_pe_pipeline.py +++ b/tests/test_pe_pipeline.py @@ -1,4 +1,4 @@ -"""Tests for ADR-0021 PE pipeline: TileToken self-routing, pipeline overlap, e2e accuracy. +"""Tests for ADR-0014 D6 PE pipeline: TileToken self-routing, pipeline overlap, e2e accuracy. Test plan items: 3. Phase 1 → Phase 2 end-to-end (op_log → DataExecutor → verify) diff --git a/tests/test_per_pe_hbm_partition.py b/tests/test_per_pe_hbm_partition.py index 930d042..15314a1 100644 --- a/tests/test_per_pe_hbm_partition.py +++ b/tests/test_per_pe_hbm_partition.py @@ -1,18 +1,13 @@ -"""Tests for ADR-0019 D1/D4 per-PE HBM partitioning. +"""Tests for ADR-0017 D4/D7 per-PE HBM partitioning. -Restores the architectural property that was lost in commit 5917b34 -(2026-04-04 "Replace xbar/bridge/single-NOC with explicit router mesh"), -which over-consolidated 8 per-slice HBM CTRL nodes into one cube-wide -HBM CTRL connected to every router. ADR-0019 D1/D4 specifies: +ADR-0017 D4/D7 specifies: - Each PE owns 8 of the cube's 64 pseudo-channels (PE_X → PCs 8X..8X+7). - HBM CTRL is split per-PE: ``hbm_ctrl.pe{X}`` is reachable ONLY through PE_X's attaching router. Accessing PE_Y's slice from PE_X requires mesh routing to r_Y_attach before entering hbm_ctrl.pe{Y}. -These tests are written BEFORE the production change and are expected -to FAIL on current code (HBM CTRL is a single ``hbm_ctrl`` node attached -to all routers). Phase 2 must make them PASS without weakening +These tests enforce that property without weakening assertions. """ from __future__ import annotations @@ -66,16 +61,16 @@ def test_topology_has_8_hbm_ctrl_per_cube(): for pe in range(8): nid = f"sip0.cube0.hbm_ctrl.pe{pe}" assert nid in graph.nodes, ( - f"Expected per-PE HBM CTRL node {nid!r} (ADR-0019 D1)" + f"Expected per-PE HBM CTRL node {nid!r} (ADR-0017 D4)" ) node = graph.nodes[nid] assert int(node.attrs.get("num_pcs", 0)) == 8, ( f"{nid} must have num_pcs=8; got {node.attrs.get('num_pcs')}" ) - # Legacy single hbm_ctrl must not exist + # Cube-wide single hbm_ctrl must not exist assert "sip0.cube0.hbm_ctrl" not in graph.nodes, ( - "Legacy single sip0.cube0.hbm_ctrl must be removed in favor of " - "per-PE hbm_ctrl.pe{X} (ADR-0019 D1)" + "Cube-wide single sip0.cube0.hbm_ctrl must not exist; only " + "per-PE hbm_ctrl.pe{X} (ADR-0017 D4)" ) @@ -199,10 +194,8 @@ def test_probe_cli_intra_cube_cases_are_monotonic(): """Probe CLI cases must show monotonic latency: pe-local-hbm < pe-same-half-hbm < pe-cross-half-hbm. - Prior to per-PE partitioning these three return identical latency - because all roads lead to the same hbm_ctrl. With ADR-0019 D4 - restored, same-half (pe0→pe1) is 1 mesh hop further than local, - and cross-half (pe0→pe4) is several hops further. + Per ADR-0017 D7, same-half (pe0→pe1) is 1 mesh hop further than + local, and cross-half (pe0→pe4) is several hops further. """ graph = _graph() spec = graph.spec diff --git a/tests/test_routing.py b/tests/test_routing.py index cda2943..4633a25 100644 --- a/tests/test_routing.py +++ b/tests/test_routing.py @@ -17,7 +17,7 @@ def _graph(): def test_resolve_hbm_addr(): - """HBM address -> sip{S}.cube{C}.hbm_ctrl.pe{X} (per-PE controller, ADR-0019 D1).""" + """HBM address -> sip{S}.cube{C}.hbm_ctrl.pe{X} (per-PE controller, ADR-0017 D9).""" g = _graph() resolver = AddressResolver(g) # offset 0x1000 falls inside PE0's slice (slice_size = 6 GB) @@ -102,16 +102,13 @@ def test_path_remote_pe_hbm(): assert not any("xbar" in n or "bridge" in n for n in path) -# ── PathRouter: cross-PE HBM distance reflects mesh hops (ADR-0019 D4) ─ +# ── PathRouter: cross-PE HBM distance reflects mesh hops (ADR-0017 D7) ─ def test_cross_pe_hbm_distance_increases_with_mesh_hops(): - """Restored ADR-0019 D4 behavior: accessing another PE's HBM slice - must take more routing distance than accessing one's own slice, - because each per-PE hbm_ctrl is reachable only via its PE's router. - - Replaces a previous ``test_all_pe_hbm_equidistant`` that asserted the - over-consolidated (spec-violating) behavior introduced in 5917b34. + """ADR-0017 D7: accessing another PE's HBM slice must take more + routing distance than accessing one's own slice, because each + per-PE hbm_ctrl is reachable only via its PE's router. """ g = _graph() router = PathRouter(g) diff --git a/tests/test_topology_compile.py b/tests/test_topology_compile.py index 77fe943..a66174d 100644 --- a/tests/test_topology_compile.py +++ b/tests/test_topology_compile.py @@ -21,7 +21,7 @@ def test_full_graph_node_count(): # + 20 ucie (4 ports x (1 port + 4 conn)) # + 8 PEs x 9 pe_comps)) (ADR-0023: +pe_ipcq) # IO: pcie_ep + io_cpu + noc + 4 io_ucie_ports + 4*4 io_ucie_conn = 23 - # cube: 32 + 10 + 20 + 72 = 134 (was 127; ADR-0019 D1 per-PE HBM CTRL) + # cube: 32 + 10 + 20 + 72 = 134 (per-PE HBM CTRL, ADR-0017 D4) # = 1 + 2*(23 + 16*134) = 1 + 2*(23+2144) = 1 + 4334 = 4335 assert len(g.nodes) == 4335 @@ -29,9 +29,9 @@ def test_full_graph_node_count(): def test_full_graph_edge_count(): g = _graph() # ADR-0023: +3 IPCQ edges per PE - # ADR-0019 D1 (restored): HBM↔router edges drop from 32 routers × 2 - # to 8 PE-routers × 2 per cube. 32 cubes × (16-64) = -1536 edges. - # Multi-op composite (ADR-0021): +1 gemm→math edge per PE for + # ADR-0017 D4: HBM↔router edges = 8 PE-routers × 2 per cube + # (per-PE partition; not all 32 routers). + # Multi-op composite (ADR-0014 D3.3): +1 gemm→math edge per PE for # epilogue chaining = 2 SIPs × 16 cubes × 8 PEs = +256 edges. assert len(g.edges) == 12412 @@ -73,7 +73,7 @@ def test_cube_component_nodes_exist(): # Null holes must not exist for null_rc in ("r2c2", "r2c3", "r3c2", "r3c3"): assert f"{cp}.{null_rc}" not in g.nodes - # Per-PE HBM CTRL (ADR-0019 D1) — 8 instances, no legacy single node + # Per-PE HBM CTRL (ADR-0017 D4) — 8 instances; no cube-wide single node for pe in range(8): nid = f"{cp}.hbm_ctrl.pe{pe}" assert g.nodes[nid].kind == "hbm_ctrl" @@ -94,7 +94,7 @@ def test_pe_component_nodes_exist(): def test_hbm_ctrl_at_cube_center(): g = _graph() - # Per-PE hbm_ctrl nodes share the cube's HBM placement (ADR-0019 D1) + # Per-PE hbm_ctrl nodes share the cube's HBM placement (ADR-0017 D4) # cube0 origin = (0, 0), hbm at (6.5, 7.0) for pe in range(8): node = g.nodes[f"sip0.cube0.hbm_ctrl.pe{pe}"] @@ -190,8 +190,7 @@ def test_pe_internal_edges(): def test_per_pe_hbm_ctrl_connects_only_to_owning_router(): """Each hbm_ctrl.pe{X} connects ONLY to PE_X's attaching router - (ADR-0019 D4). Replaces a prior test that asserted the - spec-violating all-routers consolidation (commit 5917b34).""" + (ADR-0017 D7).""" g = _graph() es = _edge_set(g) cp = "sip0.cube0" diff --git a/tests/test_tp_parallel_state.py b/tests/test_tp_parallel_state.py index de2aae7..c2c6ccd 100644 --- a/tests/test_tp_parallel_state.py +++ b/tests/test_tp_parallel_state.py @@ -56,7 +56,7 @@ def test_initialize_mismatched_ws_raises(topology): def test_get_tp_rank_is_greenlet_local(topology): """D3: get_tensor_model_parallel_rank returns greenlet-local rank - (delegates to torch.distributed.get_rank, ADR-0024 D9).""" + (delegates to torch.distributed.get_rank, ADR-0024 D2).""" import kernbench.tp as tp with _make_ctx(topology) as ctx: diff --git a/tests/test_verify_adr_lang_pairs.py b/tests/test_verify_adr_lang_pairs.py new file mode 100644 index 0000000..190197a --- /dev/null +++ b/tests/test_verify_adr_lang_pairs.py @@ -0,0 +1,107 @@ +"""Tests for tools/verify_adr_lang_pairs.py.""" + +from __future__ import annotations + +import sys +from pathlib import Path + +_REPO_ROOT = Path(__file__).resolve().parents[1] +sys.path.insert(0, str(_REPO_ROOT / "tools")) + +import verify_adr_lang_pairs as v # noqa: E402 + + +def _make_adr( + path: Path, + title_id: str, + title_text: str = "Some Title", + status: str = "Accepted", +) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text( + f"# ADR-{title_id}: {title_text}\n\n" + f"## Status\n\n{status}\n\n" + f"## Context\n\nbody\n", + encoding="utf-8", + ) + + +def test_complete_pairs_pass(tmp_path: Path) -> None: + _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", "Foo EN") + _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", "Foo KO") + assert v.verify(tmp_path) == [] + + +def test_empty_dirs_pass(tmp_path: Path) -> None: + assert v.verify(tmp_path) == [] + + +def test_missing_ko_fails(tmp_path: Path) -> None: + _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001") + errs = v.verify(tmp_path) + assert any("missing KO" in e and "ADR-0001-foo-bar.md" in e for e in errs) + + +def test_orphan_ko_fails(tmp_path: Path) -> None: + _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001") + errs = v.verify(tmp_path) + assert any("orphan KO" in e and "ADR-0001-foo-bar.md" in e for e in errs) + + +def test_status_mismatch_fails(tmp_path: Path) -> None: + _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", status="Accepted") + _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", status="Proposed") + errs = v.verify(tmp_path) + assert any("Status block mismatch" in e for e in errs) + + +def test_title_id_mismatch_fails(tmp_path: Path) -> None: + _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0002") + _make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001") + errs = v.verify(tmp_path) + assert any("EN title ADR-ID" in e for e in errs) + + +def test_multiline_status_with_parenthetical_passes(tmp_path: Path) -> None: + """Real ADRs like ADR-0001 have multi-line Status with revision notes.""" + multiline_status = ( + "Accepted (Revision 2 - 2026-04-27: concrete bit layout,\n" + "Supersedes ADR-0031.)" + ) + _make_adr( + tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", status=multiline_status + ) + _make_adr( + tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", status=multiline_status + ) + assert v.verify(tmp_path) == [] + + +def test_crlf_normalization(tmp_path: Path) -> None: + """KO has CRLF, EN has LF; Status content is otherwise identical -> pass.""" + en = tmp_path / "docs/adr/ADR-0001-foo-bar.md" + ko = tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md" + en.parent.mkdir(parents=True, exist_ok=True) + ko.parent.mkdir(parents=True, exist_ok=True) + en.write_bytes( + b"# ADR-0001: Foo\n\n## Status\n\nAccepted\n\n## Context\n\nbody\n" + ) + ko.write_bytes( + b"# ADR-0001: Foo\r\n\r\n## Status\r\n\r\nAccepted\r\n\r\n## Context\r\n\r\nbody\r\n" + ) + assert v.verify(tmp_path) == [] + + +def test_underscore_in_slug_recognized(tmp_path: Path) -> None: + """ADR-0013 uses an underscore in its slug; the regex must accept it.""" + _make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013") + _make_adr(tmp_path / "docs/adr-ko/ADR-0013-ver-verification_strategy.md", "0013") + assert v.verify(tmp_path) == [] + + +def test_main_exit_codes(tmp_path: Path, capsys) -> None: + assert v.main(["--root", str(tmp_path)]) == 0 + _make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001") + assert v.main(["--root", str(tmp_path)]) == 1 + out = capsys.readouterr().out + assert "FAILED" in out diff --git a/tools/verify_adr_lang_pairs.py b/tools/verify_adr_lang_pairs.py new file mode 100644 index 0000000..d0147d5 --- /dev/null +++ b/tools/verify_adr_lang_pairs.py @@ -0,0 +1,144 @@ +"""Verify ADR language pair invariants. + +Policy (see CLAUDE.md Part 2 -> ADR Translation Discipline): + docs/adr/ : English canonical + docs/adr-ko/ : Korean translation (1:1 mirror) + docs/adr-history/: frozen, not checked (transitional) + docs/adr-proposed/: language-free, not checked + +Checks: + - every docs/adr/.md has a matching docs/adr-ko/.md + - every docs/adr-ko/.md has a matching docs/adr/.md (no orphans) + - title line `# ADR-NNNN:` of each pair matches the filename's NNNN + - `## Status` block content is byte-equal (after CRLF/LF normalization) + between EN and KO + +Exit code: 0 if all OK, 1 if any mismatch. +""" + +from __future__ import annotations + +import argparse +import re +import sys +from pathlib import Path + +ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$") +TITLE_RE = re.compile(r"^# ADR-(\d{4}):") + + +def _normalize(text: str) -> str: + return text.replace("\r\n", "\n").replace("\r", "\n") + + +def find_adr_files(adr_dir: Path) -> dict[str, Path]: + if not adr_dir.is_dir(): + return {} + return { + p.name: p + for p in sorted(adr_dir.iterdir()) + if p.is_file() and ADR_FILENAME_RE.match(p.name) + } + + +def extract_title_id(text: str) -> str | None: + lines = _normalize(text).splitlines() + if not lines: + return None + m = TITLE_RE.match(lines[0]) + return m.group(1) if m else None + + +def extract_status_block(text: str) -> str | None: + """Return content between `## Status` and the next `## ` heading, stripped. + + Returns None if no `## Status` heading exists. + """ + lines = _normalize(text).splitlines() + in_status = False + collected: list[str] = [] + for line in lines: + if line.strip() == "## Status": + in_status = True + continue + if in_status and line.startswith("## "): + break + if in_status: + collected.append(line) + if not in_status: + return None + return "\n".join(collected).strip() + + +def verify(root: Path) -> list[str]: + errors: list[str] = [] + en_dir = root / "docs" / "adr" + ko_dir = root / "docs" / "adr-ko" + + en_files = find_adr_files(en_dir) + ko_files = find_adr_files(ko_dir) + + for name in en_files: + if name not in ko_files: + errors.append(f"missing KO translation: docs/adr-ko/{name}") + for name in ko_files: + if name not in en_files: + errors.append(f"orphan KO (no canonical EN): docs/adr-ko/{name}") + + for name in sorted(en_files.keys() & ko_files.keys()): + m = ADR_FILENAME_RE.match(name) + assert m is not None + expected_id = m.group(1) + + en_text = en_files[name].read_text(encoding="utf-8") + ko_text = ko_files[name].read_text(encoding="utf-8") + + en_id = extract_title_id(en_text) + ko_id = extract_title_id(ko_text) + if en_id != expected_id: + errors.append( + f"{name}: EN title ADR-ID {en_id!r} != filename {expected_id!r}" + ) + if ko_id != expected_id: + errors.append( + f"{name}: KO title ADR-ID {ko_id!r} != filename {expected_id!r}" + ) + + en_status = extract_status_block(en_text) + ko_status = extract_status_block(ko_text) + if en_status is None: + errors.append(f"{name}: EN missing `## Status` section") + if ko_status is None: + errors.append(f"{name}: KO missing `## Status` section") + if en_status is not None and ko_status is not None and en_status != ko_status: + errors.append( + f"{name}: Status block mismatch\n" + f" EN: {en_status!r}\n" + f" KO: {ko_status!r}" + ) + + return errors + + +def main(argv: list[str] | None = None) -> int: + p = argparse.ArgumentParser(description=__doc__) + p.add_argument( + "--root", + type=Path, + default=Path.cwd(), + help="Repository root (default: cwd)", + ) + args = p.parse_args(argv) + + errors = verify(args.root) + if errors: + print("ADR language pair verification FAILED:") + for e in errors: + print(f" - {e}") + return 1 + print("ADR language pair verification OK") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/topology.yaml b/topology.yaml index 6ccb9c3..3f83a46 100644 --- a/topology.yaml +++ b/topology.yaml @@ -78,15 +78,15 @@ cube: scheduler_to_fetch_store_mm: 0.5 dma_to_tcm_bw_gbs: 512.0 dma_to_tcm_mm: 0.5 - dma_to_fetch_store_mm: 0.0 # DMA → fetch_store chaining (ADR-0021) + dma_to_fetch_store_mm: 0.0 # DMA → fetch_store chaining (ADR-0014 D6) fetch_store_to_tcm_bw_gbs: 512.0 fetch_store_to_tcm_mm: 0.0 - fetch_store_to_gemm_mm: 0.0 # fetch → GEMM chaining (ADR-0021) - fetch_store_to_math_mm: 0.0 # fetch → MATH chaining (ADR-0021) - gemm_to_fetch_store_mm: 0.0 # GEMM → store chaining (ADR-0021) - gemm_to_math_mm: 0.0 # GEMM → MATH epilogue chaining (ADR-0021) - math_to_fetch_store_mm: 0.0 # MATH → store chaining (ADR-0021) - fetch_store_to_dma_mm: 0.0 # store → DMA writeback chaining (ADR-0021) + fetch_store_to_gemm_mm: 0.0 # fetch → GEMM chaining (ADR-0014 D6) + fetch_store_to_math_mm: 0.0 # fetch → MATH chaining (ADR-0014 D6) + gemm_to_fetch_store_mm: 0.0 # GEMM → store chaining (ADR-0014 D6) + gemm_to_math_mm: 0.0 # GEMM → MATH epilogue chaining (ADR-0014 D6) + math_to_fetch_store_mm: 0.0 # MATH → store chaining (ADR-0014 D6) + fetch_store_to_dma_mm: 0.0 # store → DMA writeback chaining (ADR-0014 D6) gemm_to_tcm_bw_gbs: 512.0 gemm_to_tcm_mm: 0.5 math_to_tcm_bw_gbs: 512.0 @@ -99,7 +99,7 @@ cube: hbm_total_gb_per_cube: 48 hbm_slices_per_cube: 8 hbm_total_bw_gbs: 1024.0 - hbm_mapping_mode: n_to_one # one_to_one | n_to_one (ADR-0019) + hbm_mapping_mode: n_to_one # one_to_one | n_to_one (ADR-0017 D8) hbm_pseudo_channels: 64 # total pseudo channels per cube hbm_channels_per_pe: 8 # = pseudo_channels / pes_per_cube hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s) @@ -123,7 +123,7 @@ cube: per_connection_bw_gbs: 128.0 # BW per connection; 4 × 128 = 512 GB/s = UCIe PHY BW links: - # Router mesh links (ADR-0019) + # Router mesh links (ADR-0017 D5) router_link_bw_gbs: 256.0 # inter-router XY mesh link BW router_overhead_ns: 2.0 # per-router switching overhead pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router (= N × channel_bw)