ADR housekeeping: category prefixes, lifecycle folders, retroactive 0034-0037
Filename + lifecycle:
- ADR rename to ADR-NNNN-<cat>-title.md with 8 3-letter category prefixes
(dev / mem / lat / prog / algo / par / api / ver). Numbers stay immutable.
- ADR Lifecycle split into 3 folders, documented in CLAUDE.md Part 2:
docs/adr/ (Accepted), docs/adr-proposed/ (Proposed/Stub/Draft),
docs/adr-history/ (Superseded/Merged). Status field gains "Draft" for
retroactive docs pending verification.
Merges (one ADR per topic, no change-history annotations):
- ADR-0017 absorbs ADR-0019 (Cube NOC + per-PE HBM connectivity, 10 D-items)
- ADR-0014 absorbs ADR-0021 (PE pipeline execution model, 8 D-items incl.
TileToken self-routing and multi-op composite epilogue scope)
- ADR-0023 absorbs docs/ipcq-dma-codesign-hw.md as new "HW Realization
Notes (Informative)" section (D16-D23 + Open HW Questions). codesign-hw.md
deleted; ADR-0019/0021 moved to adr-history with one-line stub status
Retroactive documentation (G4 closures, code-verified):
- ADR-0037 forwarding component (TransitComponent: first-flit overhead,
serial worker, path-based routing, single impl/multiple names)
- ADR-0036 IO_CPU component (target_start_ns global barrier stamping,
per-cube fan-out, response aggregation)
- ADR-0035 M_CPU & M_CPU.DMA component (3 fan-out paths, DMA Resources,
target_start_ns passthrough)
- ADR-0034 HBM controller internal design (per-PC state, address-based
selection, flit-aware per-flit commit, async finalize, command-only
fallback path)
Content updates:
- ADR-0010 expanded to full CLI surface (run/probe/web), retitled
"Command Line Interface and Execution Semantics"
- ADR-0007 D2 rewritten to current state; ADR-0015 supersession notes pruned
- ADR-0005 wrapped in Decision header with D1-D5; ADR-0022 metadata
block replaced with standard Status header
- ADR-0024 trimmed to rank=SIP launcher essentials (D1-D4);
ADR-0027 cleaned of supersession history
- ADR-0033 D6 cleanup: address-based PC selection moved out of future-work
(now documented in ADR-0034 D3); related D1/D3 wording realigned
- Cross-references back-filled in 5 ADRs (G3 gaps closed)
Onboarding docs split:
- docs/onboarding/ created
- moved: hw-architecture-overview.md, latency-model.md, di-presentation.md,
ccl-author-guide{,.en}.md
- references updated in README, ADR-0023{,.en}, src/kernbench/ccl/__init__.py
Source / test / yaml: ADR-NNNN cross-references in docstrings and YAML
comments updated after the merges (ADR-0021->0014 D6, ADR-0019->0017 D8).
No behavior change.
Tooling:
- tools/verify_adr_lang_pairs.py + tests/test_verify_adr_lang_pairs.py
(ADR EN/KO pair invariant checker)
- .claude/commands/report.md tracked (/report slash command)
- .gitignore: allow .claude/commands/*.md while keeping settings files ignored
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,318 @@
|
|||||||
|
---
|
||||||
|
description: Generate a public-facing architecture design document from approved ADRs and SPEC.md, with gap analysis reported to chat only.
|
||||||
|
---
|
||||||
|
|
||||||
|
# `/report` — Architecture Design Document Generator
|
||||||
|
|
||||||
|
Generates a **public-facing** architecture design document at
|
||||||
|
`docs/report/architecture-{YYYY}-{1H|2H}.md` derived from the current ADR
|
||||||
|
corpus, SPEC.md, CLAUDE.md, and the canonical component list.
|
||||||
|
|
||||||
|
This command is **strictly read-only** on `docs/adr/`, `SPEC.md`,
|
||||||
|
`CLAUDE.md`, and `src/`. The only write is the report file itself
|
||||||
|
(a derived artifact under `docs/report/`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Invocation
|
||||||
|
|
||||||
|
Two modes:
|
||||||
|
|
||||||
|
- `/report` — **dry-run** (default). No file is written. The command
|
||||||
|
reads sources, performs classification, and reports the planned TOC
|
||||||
|
+ gap analysis to chat only. Use this to validate ADR-to-section
|
||||||
|
mapping before committing.
|
||||||
|
- `/report write` — **write mode**. Performs the same procedure and
|
||||||
|
writes `docs/report/architecture-{period}.md`. Use after a dry-run
|
||||||
|
whose classification looks correct.
|
||||||
|
|
||||||
|
Period determination (both modes), from system date:
|
||||||
|
|
||||||
|
- month 1–6 → `{YYYY}-1H`
|
||||||
|
- month 7–12 → `{YYYY}-2H`
|
||||||
|
|
||||||
|
In write mode, if `docs/report/architecture-{period}.md` already exists,
|
||||||
|
overwrite it without asking (regeneration is the expected operation).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Output Contract
|
||||||
|
|
||||||
|
### Document body (`docs/report/architecture-{period}.md`)
|
||||||
|
|
||||||
|
Public release form. Reader is an external developer/architect. They do
|
||||||
|
**not** have access to SPEC.md or ADR files. Therefore:
|
||||||
|
|
||||||
|
- **No `ADR-NNNN` identifiers** in visible prose.
|
||||||
|
- **No `SPEC R/§` identifiers** in visible prose.
|
||||||
|
- **No internal jargon** assumed without definition.
|
||||||
|
- **No diagram embeds** — only `<!-- DIAGRAM: ... -->` placeholders.
|
||||||
|
- **Attribution via HTML comments** — every prose paragraph that derives
|
||||||
|
from a source carries an inline comment immediately above it:
|
||||||
|
`<!-- src: ADR-NNNN <section-name> -->` (multiple sources allowed).
|
||||||
|
|
||||||
|
### Chat-only report (not written to any file)
|
||||||
|
|
||||||
|
After writing the document, report to the user in the chat response:
|
||||||
|
|
||||||
|
- File path written.
|
||||||
|
- Section counts (e.g., "Detailed Architecture: 8 components covered,
|
||||||
|
2 in `builtin/` have no ADR backing").
|
||||||
|
- **G1 gaps** — SPEC requirements (R-numbers / §) with no ADR citing them.
|
||||||
|
- **G2 gaps** — ADRs missing **Context** or **Decision**. Alternatives
|
||||||
|
and Consequences are optional; their absence is NOT a gap.
|
||||||
|
- **G3 gaps** — ADR cross-references without a back-reference.
|
||||||
|
- **G4 suggestions** — areas where an ADR seems missing based on the
|
||||||
|
ADR corpus + SPEC reading. Phrase as suggestions, not findings. Each
|
||||||
|
G4 item must say *why* it's suggested and remain falsifiable.
|
||||||
|
- **G5 consistency issues** — ADR-to-ADR inconsistencies:
|
||||||
|
- **G5a (supersession not reflected)** — ADR-A states it supersedes
|
||||||
|
ADR-B, but ADR-B's Status is not marked as Superseded.
|
||||||
|
- **G5b (merge candidates)** — two or more ADRs cover near-identical
|
||||||
|
scope (detected naturally during section assignment, not via
|
||||||
|
exhaustive pair-wise scan).
|
||||||
|
- **G5c (explicit contradictions)** — two ADRs whose Decisions
|
||||||
|
directly oppose each other. Must cite both quotations; do not
|
||||||
|
speculate contradictions from topical similarity alone.
|
||||||
|
- **TOC rationale** — for each section, list contributing ADR IDs
|
||||||
|
(this is for the user's verification only, never written to the
|
||||||
|
document itself).
|
||||||
|
|
||||||
|
G4 must never appear in the document body. G1–G3 are also chat-only.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Procedure
|
||||||
|
|
||||||
|
### Step 1 — Determine period
|
||||||
|
|
||||||
|
Use current system date. Compute `{YYYY}-1H` or `{YYYY}-2H`.
|
||||||
|
|
||||||
|
### Step 2 — Ingest ADRs
|
||||||
|
|
||||||
|
For each `docs/adr/ADR-NNNN-*.md`:
|
||||||
|
|
||||||
|
- If both `ADR-NNNN-*.md` (Korean) and `ADR-NNNN-*.en.md` (English)
|
||||||
|
exist for the same number, **prefer the Korean `.md`** version.
|
||||||
|
- Parse for the four canonical sections: Context, Decision, Alternatives
|
||||||
|
(also accept "Alternatives Considered"), Consequences.
|
||||||
|
- Record presence/absence of **Context** and **Decision** for G2.
|
||||||
|
Alternatives and Consequences presence is recorded for use during
|
||||||
|
authoring, but their absence is not a gap.
|
||||||
|
- Record ADR-NNNN cross-references for G3.
|
||||||
|
- Record Status (e.g., Accepted, Superseded, Draft) and any "supersedes
|
||||||
|
ADR-NNNN" text in the body for G5a.
|
||||||
|
|
||||||
|
Process ADRs in **numerical order** for determinism.
|
||||||
|
|
||||||
|
### Step 3 — Read canonical component list
|
||||||
|
|
||||||
|
List `src/kernbench/components/builtin/*.py`, excluding `__init__.py`,
|
||||||
|
`pe_types.py`, and `__pycache__/`. Sort alphabetically. This is the
|
||||||
|
canonical order for Detailed Architecture subsections.
|
||||||
|
|
||||||
|
### Step 4 — Read SPEC.md and CLAUDE.md
|
||||||
|
|
||||||
|
For G1 detection: extract every `R<N>` and `§<X.Y>` identifier mentioned
|
||||||
|
in SPEC.md. For each ADR, check which of these it cites. SPEC IDs with
|
||||||
|
zero citing ADRs → G1.
|
||||||
|
|
||||||
|
### Step 5 — Section assignment
|
||||||
|
|
||||||
|
Assign each ADR to exactly one of:
|
||||||
|
|
||||||
|
- **Design Principles** — project-wide rationale, philosophy, mission
|
||||||
|
(e.g., "why source-level kernel execution", "why fast multi-device
|
||||||
|
scaling"). Includes ADRs that describe foundational invariants
|
||||||
|
(e.g., latency model assumptions, verification strategy).
|
||||||
|
- **High-level Architecture** — Tray / SIP / CUBE / PE hierarchy and
|
||||||
|
cross-layer boundaries (e.g., runtime API ↔ sim_engine ↔ components).
|
||||||
|
- **Detailed Architecture** — single-component internal designs. One
|
||||||
|
subsection per file in the canonical component list. ADRs whose
|
||||||
|
primary topic is the internal structure of one component go here.
|
||||||
|
- **Implementation Decisions** — **cross-cutting** algorithms / policies
|
||||||
|
/ schemes / models that don't belong to a single component:
|
||||||
|
collective algorithms, parallelization policies, address schemes,
|
||||||
|
routing algorithms, model assumptions.
|
||||||
|
|
||||||
|
Boundary rule between Detailed Architecture and Implementation Decisions:
|
||||||
|
|
||||||
|
> Detailed Architecture = component-internal.
|
||||||
|
> Implementation Decisions = spans multiple components OR is an
|
||||||
|
> algorithm/policy/scheme/assumption rather than a structural choice.
|
||||||
|
|
||||||
|
If an ADR fits two sections plausibly, prefer the one that minimizes
|
||||||
|
duplication and pick the more specific bucket (Detailed if it primarily
|
||||||
|
concerns one component, else Implementation Decisions).
|
||||||
|
|
||||||
|
During classification, opportunistically detect ADR consistency issues:
|
||||||
|
|
||||||
|
- **G5b (merge candidate)** — if two or more ADRs land in the same
|
||||||
|
Detailed Architecture subsection or the same Implementation Decisions
|
||||||
|
topic AND their primary scope is near-identical, record as a merge
|
||||||
|
candidate. Topical adjacency is not enough; the scopes must be
|
||||||
|
effectively the same question.
|
||||||
|
- **G5c (explicit contradiction)** — if while reading you encounter two
|
||||||
|
ADRs whose Decisions directly oppose each other on the same question,
|
||||||
|
record both quotations verbatim with their ADR IDs. Do NOT speculate
|
||||||
|
contradictions from similarity, vocabulary, or domain overlap — only
|
||||||
|
explicit, citable opposition.
|
||||||
|
|
||||||
|
Do NOT perform an exhaustive pair-wise scan of all ADRs. G5b/G5c are
|
||||||
|
byproducts of normal reading; if not encountered, the chat report
|
||||||
|
shows "(none)".
|
||||||
|
|
||||||
|
### Step 6 — Write the document (write mode only)
|
||||||
|
|
||||||
|
In **dry-run mode**, skip this step entirely. Proceed directly to Step 7.
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# KernBench — Architecture Design Document
|
||||||
|
*{YYYY} {1H|2H}*
|
||||||
|
|
||||||
|
## Design Principles
|
||||||
|
<prose>
|
||||||
|
|
||||||
|
## High-level Architecture
|
||||||
|
<intro prose>
|
||||||
|
|
||||||
|
### Tray
|
||||||
|
### SIP
|
||||||
|
### CUBE
|
||||||
|
### PE
|
||||||
|
|
||||||
|
## Detailed Architecture
|
||||||
|
### <component-1>
|
||||||
|
### <component-2>
|
||||||
|
...
|
||||||
|
|
||||||
|
## Implementation Decisions
|
||||||
|
### <topic-1>
|
||||||
|
### <topic-2>
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Authoring rules (apply to every section)
|
||||||
|
|
||||||
|
- **Stay grounded.** Every claim must trace to an ADR's stated content
|
||||||
|
(Context / Decision / Alternatives / Consequences). No invented
|
||||||
|
motivation, no invented alternatives, no invented trade-offs.
|
||||||
|
- **4-part discipline, naturally.** Each subsection should naturally
|
||||||
|
cover: the problem the design addresses, the decision made, the
|
||||||
|
alternatives considered, the consequences. Do **not** label these
|
||||||
|
with rigid headers like "**Problem.**" — weave them into prose. But
|
||||||
|
ensure all four are present *if the source ADR documents them*.
|
||||||
|
- **Missing → omit, not fabricate.** If a source ADR has no
|
||||||
|
"Alternatives" section, do **not** invent alternatives for the
|
||||||
|
report. Simply write the remaining parts and record G2 in chat.
|
||||||
|
- **Attribution.** Every paragraph derived from one or more ADRs
|
||||||
|
carries an HTML comment immediately above:
|
||||||
|
`<!-- src: ADR-NNNN <section> [, ADR-MMMM <section>] -->`.
|
||||||
|
- **Diagram placeholders.** Where a diagram would help, insert
|
||||||
|
`<!-- DIAGRAM: <short description of what the diagram should show> -->`
|
||||||
|
on its own line. **Never** embed an image (``).
|
||||||
|
- **Public tone.** Self-contained. Define internal terms (SIP, CUBE,
|
||||||
|
PE, Tray, NOC, IPCQ, TCM, etc.) on first use within the document.
|
||||||
|
Do not assume reader has read SPEC or ADRs.
|
||||||
|
- **No internal references.** No `ADR-NNNN` in body text. No
|
||||||
|
`SPEC §X.Y` or `R<N>` in body text. These appear only inside HTML
|
||||||
|
attribution comments.
|
||||||
|
- **Detailed Architecture component subsections.** Use the canonical
|
||||||
|
list from Step 3 in order. For each component file, write a
|
||||||
|
subsection drawing from any ADR that primarily concerns that
|
||||||
|
component. If no ADR covers a component, write a one-line stub
|
||||||
|
noting the component exists and flag it in chat report. If an ADR
|
||||||
|
covers a topic not in the canonical list, place it under
|
||||||
|
"Detailed Architecture → Other" (sub-subsection) and flag for
|
||||||
|
canonical-list extension in chat.
|
||||||
|
- **Implementation Decisions topic naming.** Derive topic names from
|
||||||
|
ADR titles, made reader-friendly (no ADR number). Group related
|
||||||
|
ADRs under one topic when natural (e.g., multiple address-related
|
||||||
|
ADRs under "Address Scheme").
|
||||||
|
|
||||||
|
### Step 7 — Generate chat report
|
||||||
|
|
||||||
|
After Step 6 (write mode) or directly from Step 5 (dry-run mode),
|
||||||
|
emit the following to chat. Do **not** write any of this to a file.
|
||||||
|
|
||||||
|
In **dry-run mode**, replace the `Wrote:` line with:
|
||||||
|
`**DRY-RUN — no file written.** Review TOC and gaps below. Run \`/report write\` to commit.`
|
||||||
|
|
||||||
|
```
|
||||||
|
## /report — Generation Summary
|
||||||
|
|
||||||
|
**Wrote:** docs/report/architecture-{period}.md
|
||||||
|
|
||||||
|
**Section coverage**
|
||||||
|
- Design Principles: <N> ADRs
|
||||||
|
- High-level Architecture: <N> ADRs
|
||||||
|
- Detailed Architecture: <covered>/<total> components ; components without ADR: [...]
|
||||||
|
- Implementation Decisions: <N> topics, <N> ADRs
|
||||||
|
|
||||||
|
**TOC rationale (ADR → section mapping)**
|
||||||
|
- Design Principles: ADR-NNNN, ADR-MMMM
|
||||||
|
- High-level Architecture: ...
|
||||||
|
- Detailed Architecture → <component>: ADR-NNNN
|
||||||
|
- Implementation Decisions → <topic>: ADR-NNNN, ADR-MMMM
|
||||||
|
|
||||||
|
**G1 — SPEC requirements without ADR support**
|
||||||
|
- R<N> / §<X.Y>: not cited by any ADR
|
||||||
|
- (or "none")
|
||||||
|
|
||||||
|
**G2 — ADRs missing required sections (Context or Decision)**
|
||||||
|
- ADR-NNNN: missing <Context|Decision>
|
||||||
|
- (or "none")
|
||||||
|
|
||||||
|
**G3 — Broken cross-references**
|
||||||
|
- ADR-NNNN cites ADR-MMMM; ADR-MMMM does not back-reference
|
||||||
|
- (or "none")
|
||||||
|
|
||||||
|
**G4 — Suggested topics that may warrant a new ADR (verify before acting)**
|
||||||
|
- <topic>: <why agent thinks it may be missing — must be falsifiable>
|
||||||
|
- (or "none")
|
||||||
|
|
||||||
|
**G5 — ADR consistency issues**
|
||||||
|
- **G5a (supersession not reflected)**
|
||||||
|
- ADR-NNNN claims to supersede ADR-MMMM, but ADR-MMMM Status is "<status>"
|
||||||
|
- (or "none")
|
||||||
|
- **G5b (merge candidates)**
|
||||||
|
- ADR-NNNN + ADR-MMMM: near-identical scope on <topic> — evaluate merge
|
||||||
|
- (or "none")
|
||||||
|
- **G5c (explicit contradictions)**
|
||||||
|
- ADR-NNNN says "<quote>"; ADR-MMMM says "<quote>" — direct opposition on <question>
|
||||||
|
- (or "none")
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Constraints (do not violate)
|
||||||
|
|
||||||
|
1. **Read-only on source.** No writes to `docs/adr/`, `SPEC.md`,
|
||||||
|
`CLAUDE.md`, or `src/`. Only write is
|
||||||
|
`docs/report/architecture-{period}.md`.
|
||||||
|
2. **No fabrication.** Every body paragraph traces to ADR content via
|
||||||
|
HTML attribution comment.
|
||||||
|
3. **No diagram embeds.** Placeholders only.
|
||||||
|
4. **No internal IDs in body.** ADR-NNNN and SPEC R/§ stay inside
|
||||||
|
HTML comments only.
|
||||||
|
5. **Determinism.** ADRs processed in numerical order; components in
|
||||||
|
canonical (alphabetical) order. Same inputs → same output.
|
||||||
|
6. **G4 stays in chat.** Never written to the document.
|
||||||
|
7. **Korean bilingual preference.** When both `.md` and `.en.md`
|
||||||
|
exist for the same ADR number, use `.md`.
|
||||||
|
8. **All ADRs included.** No exclusion list. ADRs about internal
|
||||||
|
tooling (CLI, diagram views, verification strategy) are still
|
||||||
|
included — usually under Design Principles or Implementation
|
||||||
|
Decisions, written in publishable form.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Failure modes to avoid
|
||||||
|
|
||||||
|
- **Padding** with general background not present in the source ADRs.
|
||||||
|
- **Inferring alternatives** the ADR doesn't mention.
|
||||||
|
- **Quietly skipping** an ADR because it seems internal. Include it,
|
||||||
|
rephrase for public audience.
|
||||||
|
- **Inventing components** not in `src/kernbench/components/builtin/`.
|
||||||
|
- **Auto-selecting diagrams** from `docs/diagrams/`. Only placeholders.
|
||||||
|
- **Promoting G4 suggestions to the document.** They stay in chat.
|
||||||
@@ -30,7 +30,10 @@
|
|||||||
"Bash(python -m pytest tests/test_pe_components.py -v)",
|
"Bash(python -m pytest tests/test_pe_components.py -v)",
|
||||||
"Bash(python -m pytest tests/test_triton_emu.py -v)",
|
"Bash(python -m pytest tests/test_triton_emu.py -v)",
|
||||||
"Bash(python -m pytest tests/test_pe_components.py tests/test_triton_emu.py -v)",
|
"Bash(python -m pytest tests/test_pe_components.py tests/test_triton_emu.py -v)",
|
||||||
"Bash(python -m pytest tests/test_pe_components.py::test_mcpu_multi_pe_kernel_launch tests/test_pe_components.py::test_qkv_gemm_bench_multi_pe_completes -v)"
|
"Bash(python -m pytest tests/test_pe_components.py::test_mcpu_multi_pe_kernel_launch tests/test_pe_components.py::test_qkv_gemm_bench_multi_pe_completes -v)",
|
||||||
|
"Bash(git add:*)",
|
||||||
|
"Bash(git commit:*)",
|
||||||
|
"Bash(git push:*)"
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
+3
-1
@@ -29,4 +29,6 @@ build/
|
|||||||
|
|
||||||
# Logs
|
# Logs
|
||||||
*.log
|
*.log
|
||||||
.claude/
|
.claude/*
|
||||||
|
!.claude/commands/
|
||||||
|
!.claude/commands/*.md
|
||||||
|
|||||||
@@ -218,17 +218,43 @@ General fallbacks. Apply to anything not explicitly covered above.
|
|||||||
|
|
||||||
### ADR Lifecycle
|
### ADR Lifecycle
|
||||||
|
|
||||||
- `docs/adr/` contains ADRs reflecting current implementation or
|
ADRs live in one of three folders based on lifecycle state:
|
||||||
work-in-progress designs.
|
|
||||||
- `docs/history/` contains superseded ADRs as historical record.
|
- `docs/adr/` — **Accepted** (current implementation reflected).
|
||||||
- When an ADR is superseded:
|
- `docs/adr-proposed/` — **Proposed**, **Stub**, or **Draft** (design
|
||||||
1. The superseding ADR includes a "Supersedes ADR-NNNN" line.
|
only / future-work exploration / retroactive documentation pending
|
||||||
2. The superseded ADR's Status is set to "Superseded by ADR-MMMM".
|
verification).
|
||||||
3. The superseded ADR file is **moved** (git mv) to `docs/history/`.
|
- `docs/adr-history/` — **Superseded** or **Merged** (no longer the
|
||||||
- Cross-references between ADRs use the ADR-NNNN ID and remain
|
authoritative source; kept as historical record).
|
||||||
valid regardless of file location.
|
|
||||||
- ADR numbers are **immutable**; never renumber. Numbering holes
|
Status field values:
|
||||||
from moved ADRs are expected.
|
|
||||||
|
- `Accepted` — design is in current implementation.
|
||||||
|
- `Proposed` — design is concrete but not yet implemented.
|
||||||
|
- `Stub (Future Work)` — design space exploration; no commitment yet.
|
||||||
|
- `Draft` — retroactive documentation drafted but not yet verified
|
||||||
|
against the implementation it describes.
|
||||||
|
- `Superseded by ADR-NNNN` — replaced by another ADR.
|
||||||
|
- `Merged into ADR-NNNN` — content absorbed by another ADR.
|
||||||
|
|
||||||
|
Transitions:
|
||||||
|
|
||||||
|
- **Proposed/Stub → Accepted**: when the ADR's decisions are
|
||||||
|
reflected in production code AND covered by tests. `git mv` from
|
||||||
|
`docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
|
||||||
|
- **Draft → Accepted**: when the ADR's text has been verified to
|
||||||
|
accurately describe the existing implementation. `git mv` from
|
||||||
|
`docs/adr-proposed/` to `docs/adr/`, change Status to `Accepted`.
|
||||||
|
- **Accepted → Superseded**: set Status to `Superseded by ADR-MMMM`
|
||||||
|
and `git mv` to `docs/adr-history/`. The superseding ADR includes
|
||||||
|
a "Supersedes ADR-NNNN" reference (or, for partial supersession of
|
||||||
|
clauses, documents this in its own body).
|
||||||
|
- **Accepted → Merged**: set Status to `Merged into ADR-MMMM`
|
||||||
|
(single-line stub) and `git mv` to `docs/adr-history/`.
|
||||||
|
|
||||||
|
Cross-references between ADRs use the `ADR-NNNN` ID and remain valid
|
||||||
|
regardless of folder location. ADR numbers are **immutable**; never
|
||||||
|
renumber. Numbering holes from moved ADRs are expected.
|
||||||
|
|
||||||
## Terminology
|
## Terminology
|
||||||
|
|
||||||
|
|||||||
@@ -155,5 +155,6 @@ kernbench/
|
|||||||
## Documentation
|
## Documentation
|
||||||
|
|
||||||
- [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release
|
- [CHANGES.md](CHANGES.md) — changelog with detailed descriptions of each release
|
||||||
- [docs/latency-model.md](docs/latency-model.md) — latency model explanation with worked examples
|
- [docs/onboarding/latency-model.md](docs/onboarding/latency-model.md) — latency model explanation with worked examples
|
||||||
|
- [docs/onboarding/](docs/onboarding/) — onboarding guides (architecture overview, latency model, CCL author guide, intro presentation)
|
||||||
- [docs/adr/](docs/adr/) — Architecture Decision Records
|
- [docs/adr/](docs/adr/) — Architecture Decision Records
|
||||||
|
|||||||
@@ -51,7 +51,7 @@ Major architectural decisions are documented in ADRs and referenced by number.
|
|||||||
- ADR-0007: runtime_api vs sim_engine responsibility boundaries
|
- ADR-0007: runtime_api vs sim_engine responsibility boundaries
|
||||||
- ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
|
- ADR-0008: Tensor deployment and allocation (Host allocator, PA-first)
|
||||||
- ADR-0009: Kernel execution fan-out and completion semantics
|
- ADR-0009: Kernel execution fan-out and completion semantics
|
||||||
- ADR-0010: CLI device selection and multi-device execution semantics
|
- ADR-0010: Command line interface and execution semantics
|
||||||
- ADR-0011: Memory Addressing — PA / VA / LA Address Models
|
- ADR-0011: Memory Addressing — PA / VA / LA Address Models
|
||||||
- ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
|
- ADR-0012: Host ↔ IO_CPU message schema (PA-first, PE-tagged shards)
|
||||||
- ADR-0013: Verification strategy and Phase 1 test plan
|
- ADR-0013: Verification strategy and Phase 1 test plan
|
||||||
|
|||||||
@@ -0,0 +1,5 @@
|
|||||||
|
# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Merged into ADR-0017 (Cube NOC and HBM Connectivity).
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
# ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Merged into ADR-0017 (Cube NOC and HBM Connectivity).
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Merged into ADR-0014 (PE Pipeline Execution Model).
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
# ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Merged into ADR-0014 (PE Pipeline Execution Model).
|
||||||
+1
-1
@@ -257,5 +257,5 @@ PhysAddr encoding. 호출자는 어느 range인지 몰라도 됨.
|
|||||||
|------|--------|
|
|------|--------|
|
||||||
| `src/kernbench/policy/address/phyaddr.py` | Range table (`PE_RESOURCE_MAP`), range-based decode, 신규 component-specific factory들 (`pe_ipcq_addr` 등), 기존 `pe_tcm_addr` 내부 인코딩 갱신 |
|
| `src/kernbench/policy/address/phyaddr.py` | Range table (`PE_RESOURCE_MAP`), range-based decode, 신규 component-specific factory들 (`pe_ipcq_addr` 등), 기존 `pe_tcm_addr` 내부 인코딩 갱신 |
|
||||||
| `src/kernbench/policy/address/allocator.py` | Range-aware pool 분리 (TCM pool / IPCQ pool / scratchpad pool 등 per-PE) |
|
| `src/kernbench/policy/address/allocator.py` | Range-aware pool 분리 (TCM pool / IPCQ pool / scratchpad pool 등 per-PE) |
|
||||||
| `docs/adr/ADR-0001-physaddr-layout.md` | Amendment note: range-based PE resource partition |
|
| `docs/adr/ADR-0001-mem-physaddr-layout.md` | Amendment note: range-based PE resource partition |
|
||||||
| `tests/test_phyaddr.py` | Range table 검증, 각 factory의 encode/decode round-trip, 기존 `pe_tcm_addr` 회귀 |
|
| `tests/test_phyaddr.py` | Range table 검증, 각 factory의 encode/decode round-trip, 기존 `pe_tcm_addr` 회귀 |
|
||||||
+1
-1
@@ -340,7 +340,7 @@ encoding can be plugged in later" 약속이 이행된 것.
|
|||||||
| `src/kernbench/sim_engine/memory_store.py` | D3: IPCQ buffer가 기존 space와 공유되는지 검증 |
|
| `src/kernbench/sim_engine/memory_store.py` | D3: IPCQ buffer가 기존 space와 공유되는지 검증 |
|
||||||
| `src/kernbench/sim_engine/engine.py` | D4: IPCQ token routing이 PhysAddr-based fabric 경로 사용 |
|
| `src/kernbench/sim_engine/engine.py` | D4: IPCQ token routing이 PhysAddr-based fabric 경로 사용 |
|
||||||
| `src/kernbench/ccl/diagnostics.py` | D5: pointer_dump를 PhysAddr 포매팅으로 개선 |
|
| `src/kernbench/ccl/diagnostics.py` | D5: pointer_dump를 PhysAddr 포매팅으로 개선 |
|
||||||
| `docs/adr/ADR-0023-ipcq-pe-collective.md` | D6: D2.5 amendment note |
|
| `docs/adr/ADR-0023-dev-ipcq-pe-collective.md` | D6: D2.5 amendment note |
|
||||||
| `tests/test_ipcq_physaddr.py` (new) | T1 |
|
| `tests/test_ipcq_physaddr.py` (new) | T1 |
|
||||||
| `tests/test_ipcq_alloc.py` (new) | T2 |
|
| `tests/test_ipcq_alloc.py` (new) | T2 |
|
||||||
| `tests/test_ccl_install_plan.py` | T3 확장 |
|
| `tests/test_ccl_install_plan.py` | T3 확장 |
|
||||||
@@ -35,7 +35,7 @@ shortcuts that obscure control paths.
|
|||||||
|
|
||||||
### D3. Bypass is explicit and graph-represented
|
### D3. Bypass is explicit and graph-represented
|
||||||
- All paths must be explicitly represented in the graph and subject to latency accumulation.
|
- All paths must be explicitly represented in the graph and subject to latency accumulation.
|
||||||
- Example: PE_DMA connects to the NOC router mesh (ADR-0019). All destinations
|
- Example: PE_DMA connects to the NOC router mesh (ADR-0017 D7). All destinations
|
||||||
(HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
|
(HBM, shared SRAM, inter-cube UCIe) are reached via explicit mesh hops.
|
||||||
Local HBM access has minimal hops (switching overhead only); remote access
|
Local HBM access has minimal hops (switching overhead only); remote access
|
||||||
traverses additional routers.
|
traverses additional routers.
|
||||||
+1
-1
@@ -15,7 +15,7 @@ Each PE has a notion of “local HBM” that must guarantee full HBM bandwidth,
|
|||||||
|
|
||||||
- Each PE is assigned a logically defined “local HBM” region.
|
- Each PE is assigned a logically defined “local HBM” region.
|
||||||
- Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
|
- Local HBM corresponds to the pseudo-channel subset directly attached to that PE’s
|
||||||
router in the NOC mesh (ADR-0019).
|
router in the NOC mesh (ADR-0017 D4).
|
||||||
- The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
|
- The path is: PE_DMA → local router → HBM_CTRL (switching overhead only, 0 mesh hops).
|
||||||
- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
|
- The mapping (HBM pseudo-channels → PE local regions) is derived from topology configuration.
|
||||||
|
|
||||||
+13
-11
@@ -20,7 +20,9 @@ Diagrams must reflect this distance by default.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Global Defaults
|
## Decision
|
||||||
|
|
||||||
|
### D1. Global Defaults
|
||||||
|
|
||||||
- All diagrams MUST be **distance-aware by default**.
|
- All diagrams MUST be **distance-aware by default**.
|
||||||
- All diagrams MUST render **representative views** of the architecture.
|
- All diagrams MUST render **representative views** of the architecture.
|
||||||
@@ -31,7 +33,7 @@ Diagrams must reflect this distance by default.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Representative Rendering Rule
|
### D2. Representative Rendering Rule
|
||||||
|
|
||||||
- All CUBEs share the same internal structure.
|
- All CUBEs share the same internal structure.
|
||||||
- All PEs share the same internal structure.
|
- All PEs share the same internal structure.
|
||||||
@@ -47,9 +49,9 @@ unless explicitly requested.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Diagram Views
|
### D3. Diagram Views
|
||||||
|
|
||||||
### View A — SIP-Level Diagram
|
#### View A — SIP-Level Diagram
|
||||||
|
|
||||||
**Purpose**
|
**Purpose**
|
||||||
Explain system-scale structure and connectivity.
|
Explain system-scale structure and connectivity.
|
||||||
@@ -75,7 +77,7 @@ Explain system-scale structure and connectivity.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### View B — CUBE-Level Diagram
|
#### View B — CUBE-Level Diagram
|
||||||
|
|
||||||
**Purpose**
|
**Purpose**
|
||||||
Explain cube-internal structure and data/control flow.
|
Explain cube-internal structure and data/control flow.
|
||||||
@@ -106,7 +108,7 @@ Explain cube-internal structure and data/control flow.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### View C — PE-Level Diagram
|
#### View C — PE-Level Diagram
|
||||||
|
|
||||||
**Purpose**
|
**Purpose**
|
||||||
Explain internal PE behavior and execution structure.
|
Explain internal PE behavior and execution structure.
|
||||||
@@ -128,14 +130,14 @@ Explain internal PE behavior and execution structure.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Distance-Aware Layout (Default)
|
### D4. Distance-Aware Layout (Default)
|
||||||
|
|
||||||
### Distance definition
|
#### Distance definition
|
||||||
|
|
||||||
- Distance is defined as **accumulated latency**, consistent with ADR-0002.
|
- Distance is defined as **accumulated latency**, consistent with ADR-0002.
|
||||||
- Distance is computed from a single anchor node.
|
- Distance is computed from a single anchor node.
|
||||||
|
|
||||||
### Default anchor selection
|
#### Default anchor selection
|
||||||
|
|
||||||
- SIP view: IO chiplet (or Host CPU if present)
|
- SIP view: IO chiplet (or Host CPU if present)
|
||||||
- CUBE view: a representative PE
|
- CUBE view: a representative PE
|
||||||
@@ -143,7 +145,7 @@ Explain internal PE behavior and execution structure.
|
|||||||
|
|
||||||
Anchors are **implicit defaults** and MUST NOT be required to be specified.
|
Anchors are **implicit defaults** and MUST NOT be required to be specified.
|
||||||
|
|
||||||
### Layout rules
|
#### Layout rules
|
||||||
|
|
||||||
- Diagrams MUST be laid out in layers based on distance buckets.
|
- Diagrams MUST be laid out in layers based on distance buckets.
|
||||||
- Layout direction MUST be consistent within a view type
|
- Layout direction MUST be consistent within a view type
|
||||||
@@ -156,7 +158,7 @@ without affecting distance semantics.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Generation Contract (for Tools / Claude Code)
|
### D5. Generation Contract (for Tools / Claude Code)
|
||||||
|
|
||||||
When generating diagrams:
|
When generating diagrams:
|
||||||
|
|
||||||
+1
-1
@@ -63,7 +63,7 @@ For each view (SIP / CUBE / PE):
|
|||||||
- CUBE-level projection MUST include:
|
- CUBE-level projection MUST include:
|
||||||
- Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
|
- Router mesh (from cube_mesh.yaml), HBM_CTRL, shared SRAM, M_CPU, UCIe ports,
|
||||||
and PEs as opaque blocks.
|
and PEs as opaque blocks.
|
||||||
- All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0019).
|
- All paths (HBM, non-HBM, command) route through the same router mesh (ADR-0017).
|
||||||
- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
|
- Default anchors are implicit (ADR-0005) and MUST NOT require instance indices.
|
||||||
|
|
||||||
### D6. Output formats and determinism
|
### D6. Output formats and determinism
|
||||||
+12
-6
@@ -42,21 +42,25 @@ The runtime API MUST NOT:
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### D2. Simulation engine executes and schedules requests
|
### D2. Simulation engine wires components and tracks completion
|
||||||
|
|
||||||
The simulation engine (sim_engine) MUST:
|
The simulation engine (sim_engine) MUST:
|
||||||
|
|
||||||
- inject requests into the compiled topology graph,
|
- wire components at initialization (create port stores + start wire
|
||||||
|
processes per the component port/wire framework — ADR-0015),
|
||||||
|
- inject requests into the compiled topology graph at entry components
|
||||||
|
(e.g., PCIE_EP for memory operations, IO_CPU for kernel launch),
|
||||||
- schedule and execute events using a discrete-event model,
|
- schedule and execute events using a discrete-event model,
|
||||||
- manage correlation ids and completion tracking,
|
- manage correlation ids and completion tracking.
|
||||||
- decompose operations into low-level requests when required
|
|
||||||
(e.g., MemoryWrite events).
|
|
||||||
|
|
||||||
The simulation engine MUST NOT:
|
The simulation engine MUST NOT:
|
||||||
|
|
||||||
- define tensor semantics,
|
- define tensor semantics,
|
||||||
- define kernel execution policies,
|
- define kernel execution policies,
|
||||||
- expose internal graph details to the runtime API.
|
- expose internal graph details to the runtime API,
|
||||||
|
- walk the topology path during request execution,
|
||||||
|
- call component `run()` methods directly,
|
||||||
|
- track per-hop latency or decompose fan-out (components own this).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -87,3 +91,5 @@ component-level fan-out explicitly.
|
|||||||
- SPEC R4, R7, R8
|
- SPEC R4, R7, R8
|
||||||
- ADR-0008 (Tensor deployment)
|
- ADR-0008 (Tensor deployment)
|
||||||
- ADR-0009 (Kernel execution)
|
- ADR-0009 (Kernel execution)
|
||||||
|
- ADR-0015 (Component port/wire model and engine role)
|
||||||
|
- ADR-0010 (CLI surface and execution semantics — runtime API consumer)
|
||||||
+2
@@ -142,3 +142,5 @@ control plane — runtime API and application kernels are unchanged.
|
|||||||
- SPEC R1, R2, R7, R8
|
- SPEC R1, R2, R7, R8
|
||||||
- ADR-0007 (Runtime API boundaries)
|
- ADR-0007 (Runtime API boundaries)
|
||||||
- ADR-0008 (Tensor deployment)
|
- ADR-0008 (Tensor deployment)
|
||||||
|
- ADR-0013 (Verification strategy — V2 fan-out tests)
|
||||||
|
- ADR-0015 D4 (concrete fabric path for kernel launch)
|
||||||
@@ -0,0 +1,131 @@
|
|||||||
|
# ADR-0010: Command Line Interface and Execution Semantics
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The `kernbench` CLI is the user-facing entry point of the simulator. It
|
||||||
|
exposes three subcommands:
|
||||||
|
|
||||||
|
- `run` — execute a benchmark against a topology.
|
||||||
|
- `probe` — diagnostic utility for latency / BW measurement.
|
||||||
|
- `web` — interactive topology viewer.
|
||||||
|
|
||||||
|
Device enumeration is centralized in the CLI; neither the runtime API
|
||||||
|
nor the simulation engine enumerates devices. Benchmarks remain
|
||||||
|
single-device by design and accept a device identifier as input.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Benchmark contract — single-device by design
|
||||||
|
|
||||||
|
- A benchmark MUST define behavior for a single device only.
|
||||||
|
- A benchmark MUST accept a device identifier as input.
|
||||||
|
- Benchmarks MUST NOT enumerate or loop over multiple devices.
|
||||||
|
|
||||||
|
Multi-device execution is the CLI's concern (D3), not the benchmark's.
|
||||||
|
|
||||||
|
### D2. `kernbench run` — benchmark execution
|
||||||
|
|
||||||
|
Required arguments:
|
||||||
|
|
||||||
|
- `--topology <path>`: topology YAML file path. Loaded via
|
||||||
|
`resolve_topology()`.
|
||||||
|
- `--bench <name>`: benchmark name. Resolved via
|
||||||
|
`benches.loader.resolve_bench()`.
|
||||||
|
|
||||||
|
Optional arguments:
|
||||||
|
|
||||||
|
- `--device <selector>` (default: `all`):
|
||||||
|
- `all` — run once per discovered SIP (see D3).
|
||||||
|
- `sip:<N>` — run only on SIP N.
|
||||||
|
- Parsed via `resolve_device()`.
|
||||||
|
- `--verify-data` (default: off) — enable Phase 2 data verification
|
||||||
|
(see ADR-0020). When set, `engine_factory` constructs the engine
|
||||||
|
with `enable_data=True`. After the benchmark runs, a diagnostic
|
||||||
|
summary of recorded ops is printed.
|
||||||
|
|
||||||
|
Each invocation runs the benchmark once within a single simulation
|
||||||
|
instance.
|
||||||
|
|
||||||
|
### D3. Multi-device execution is logically parallel
|
||||||
|
|
||||||
|
When `--device all` (or omitted) and the topology has multiple SIPs:
|
||||||
|
|
||||||
|
- Benchmark executions are submitted to a single simulation engine
|
||||||
|
instance.
|
||||||
|
- Executions are logically parallel in simulation time.
|
||||||
|
- Inter-device contention is naturally modeled (shared fabric
|
||||||
|
bandwidth, cross-SIP traffic, etc.).
|
||||||
|
|
||||||
|
The CLI does NOT spawn multiple OS processes or independent
|
||||||
|
simulation runs — parallelism is internal to one simulation instance.
|
||||||
|
|
||||||
|
### D4. `kernbench probe` — latency / BW diagnostic utility
|
||||||
|
|
||||||
|
Required argument:
|
||||||
|
|
||||||
|
- `--topology <path>`: topology YAML file path.
|
||||||
|
|
||||||
|
Optional argument:
|
||||||
|
|
||||||
|
- `--case <name>` (default: `all`) — run a predefined traffic
|
||||||
|
pattern, or `all` to run every defined case.
|
||||||
|
|
||||||
|
Probe runs each pattern through the simulation engine and reports
|
||||||
|
per case:
|
||||||
|
|
||||||
|
- End-to-end latency (ns).
|
||||||
|
- Effective bandwidth (nbytes / total_ns).
|
||||||
|
- Bottleneck bandwidth (min edge BW along the chosen path).
|
||||||
|
- Utilization (effective / bottleneck).
|
||||||
|
|
||||||
|
Probe additionally validates monotonicity invariants — for example
|
||||||
|
that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
|
||||||
|
cross-SIP — and reports violations. Probe is a developer tool for
|
||||||
|
verifying the latency / BW model; it is not a benchmark.
|
||||||
|
|
||||||
|
### D5. `kernbench web` — topology viewer
|
||||||
|
|
||||||
|
Optional arguments:
|
||||||
|
|
||||||
|
- `--port <N>` (default: `8765`) — HTTP port.
|
||||||
|
- `--no-open` — do not auto-open the browser.
|
||||||
|
|
||||||
|
Launches a local HTTP server that renders the compiled topology in
|
||||||
|
the browser. Distinct from the static `docs/diagrams/` artifacts:
|
||||||
|
|
||||||
|
- `docs/diagrams/` files are derived at topology-compile time
|
||||||
|
(ADR-0006).
|
||||||
|
- `kernbench web` is interactive — pan/zoom, hover for component
|
||||||
|
attributes, switch between SIP / CUBE / PE views.
|
||||||
|
|
||||||
|
### D6. Runtime API and simulation engine remain device-scoped
|
||||||
|
|
||||||
|
- Runtime API calls operate on one device per invocation.
|
||||||
|
- The simulation engine schedules all requests deterministically.
|
||||||
|
- Neither layer enumerates devices.
|
||||||
|
|
||||||
|
This invariant keeps each layer testable in isolation; device
|
||||||
|
enumeration and multi-device fan-out live only in the CLI's `run`
|
||||||
|
command (D3).
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Benchmark authors write single-device logic; multi-device behavior
|
||||||
|
emerges from the CLI dispatching across SIPs.
|
||||||
|
- Adding a new subcommand (e.g., trace export, replay) does not
|
||||||
|
require benchmark or runtime-API changes — the CLI is the
|
||||||
|
extension point.
|
||||||
|
- `probe` and `web` are diagnostic / visualization tools, not
|
||||||
|
benchmarks; they bypass the benchmark loader path.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- SPEC R7, R8, R9
|
||||||
|
- ADR-0007 (Runtime API and Simulation Engine Boundaries)
|
||||||
|
- ADR-0020 (Two-pass data execution — `--verify-data`)
|
||||||
|
- ADR-0006 (Topology compilation and diagram generation —
|
||||||
|
background for `kernbench web`)
|
||||||
@@ -1,62 +0,0 @@
|
|||||||
# ADR-0010: CLI Device Selection and Multi-Device Execution Semantics
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
Benchmarks represent device-agnostic workloads that operate on a single device.
|
|
||||||
Users may want to run a benchmark:
|
|
||||||
|
|
||||||
- on a specific device, or
|
|
||||||
- across all devices in the system.
|
|
||||||
|
|
||||||
Device enumeration must not leak into benchmarks or runtime APIs.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### D1. Benchmarks are single-device by design
|
|
||||||
|
|
||||||
- A benchmark MUST define behavior for a single device only.
|
|
||||||
- A benchmark MUST accept a device identifier as input.
|
|
||||||
- Benchmarks MUST NOT enumerate or loop over multiple devices.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D2. CLI controls device selection
|
|
||||||
|
|
||||||
The `kernbench run` command supports an optional `--device` argument:
|
|
||||||
|
|
||||||
- If `--device <id>` is specified:
|
|
||||||
- the benchmark executes once for the specified device.
|
|
||||||
|
|
||||||
- If `--device` is omitted:
|
|
||||||
- the benchmark executes once using all the SIPs discovered in the topology.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D3. Multi-device execution is logically parallel
|
|
||||||
|
|
||||||
When running on multiple devices:
|
|
||||||
|
|
||||||
- benchmark executions are submitted to a single simulation engine instance,
|
|
||||||
- executions are logically parallel in simulation time,
|
|
||||||
- inter-device contention is naturally modeled.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D4. Runtime API and simulation engine remain device-scoped
|
|
||||||
|
|
||||||
- Runtime API calls operate on one device per invocation.
|
|
||||||
- The simulation engine schedules all requests deterministically.
|
|
||||||
- Neither layer enumerates devices.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Links
|
|
||||||
|
|
||||||
- SPEC R7, R8
|
|
||||||
- ADR-0007 (Runtime API boundaries)
|
|
||||||
+3
-3
@@ -396,7 +396,7 @@ Other N values:
|
|||||||
#### D-LA7. n:1 mode detail
|
#### D-LA7. n:1 mode detail
|
||||||
|
|
||||||
- One logical access → one aggregated request.
|
- One logical access → one aggregated request.
|
||||||
- Target: aggregated router → hbm_ctrl (see ADR-0019).
|
- Target: aggregated router → hbm_ctrl (see ADR-0017 D8).
|
||||||
- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
|
- Aggregated link BW = `channels_per_pe × channel_bw_gbs`
|
||||||
(e.g. 8 × 32 = 256 GB/s).
|
(e.g. 8 × 32 = 256 GB/s).
|
||||||
- Single queue / resource for modelling.
|
- Single queue / resource for modelling.
|
||||||
@@ -516,6 +516,6 @@ Negative:
|
|||||||
- ADR-0009 (kernel execution)
|
- ADR-0009 (kernel execution)
|
||||||
- ADR-0014 (PE-internal execution model)
|
- ADR-0014 (PE-internal execution model)
|
||||||
- ADR-0015 (component port/wire model)
|
- ADR-0015 (component port/wire model)
|
||||||
- ADR-0019 (NOC + per-channel HBM connectivity — LA model topology
|
- ADR-0017 (Cube NOC and HBM connectivity — LA model topology consumer)
|
||||||
consumer)
|
- ADR-0013 (Verification strategy — V1 PA tagging)
|
||||||
- SPEC R2 (latency by traversal), R10 (memory addressing)
|
- SPEC R2 (latency by traversal), R10 (memory addressing)
|
||||||
+1
@@ -229,4 +229,5 @@ Tests SHOULD validate:
|
|||||||
- ADR-0011 (Memory Addressing — PA / VA / LA)
|
- ADR-0011 (Memory Addressing — PA / VA / LA)
|
||||||
- ADR-0007 (runtime_api vs sim_engine boundaries)
|
- ADR-0007 (runtime_api vs sim_engine boundaries)
|
||||||
- ADR-0009 (kernel execution fan-out/aggregation)
|
- ADR-0009 (kernel execution fan-out/aggregation)
|
||||||
|
- ADR-0013 (Verification strategy — V1 message schema validation)
|
||||||
- SPEC R2, R7, R8
|
- SPEC R2, R7, R8
|
||||||
@@ -0,0 +1,451 @@
|
|||||||
|
# ADR-0014: PE Pipeline Execution Model
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
This ADR defines the PE-internal kernel execution model:
|
||||||
|
|
||||||
|
- Role decomposition of PE-internal components
|
||||||
|
- Command dispatch paths (simple / composite / multi-op composite with epilogue)
|
||||||
|
- TileToken-based self-routing pipeline (scheduler does dispatch + completion only)
|
||||||
|
- TCM-centric dataflow with a register-file intermediary
|
||||||
|
- Engine resource model
|
||||||
|
- Observability and trace contract
|
||||||
|
- Topology representation
|
||||||
|
|
||||||
|
PE-internal structure (7 components in scope; 2 cross-referenced):
|
||||||
|
|
||||||
|
- `pe_cpu`, `pe_scheduler`, `pe_dma`, `pe_fetch_store`, `pe_gemm`, `pe_math`,
|
||||||
|
`pe_tcm` — defined here
|
||||||
|
- `pe_mmu` — VA model, defined in ADR-0011 D-VA
|
||||||
|
- `pe_ipcq` — collective communication, defined in ADR-0023
|
||||||
|
|
||||||
|
The goal is a deterministic, trace-friendly execution contract that keeps
|
||||||
|
each block independently swappable.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. PE-internal component roles
|
||||||
|
|
||||||
|
**PE_CPU**
|
||||||
|
|
||||||
|
- Executes kernel instruction stream / control logic.
|
||||||
|
- Generates PE commands and submits them to `PE_SCHEDULER` (via
|
||||||
|
`PeInternalTxn`).
|
||||||
|
- Does NOT enqueue work directly into engine queues.
|
||||||
|
|
||||||
|
**PE_SCHEDULER**
|
||||||
|
|
||||||
|
- Sole dispatcher inside a PE.
|
||||||
|
- Receives commands from `PE_CPU`. Dispatch by command type:
|
||||||
|
- Simple command (`DmaReadCmd`, `DmaWriteCmd`, `GemmCmd`, `MathCmd`)
|
||||||
|
→ forward directly to the target engine.
|
||||||
|
- `CompositeCmd` → generate a `TilePlan`, feed tiles into the pipeline
|
||||||
|
via a single `_feed_loop` (D6).
|
||||||
|
- Does not participate in stage-to-stage chaining within a composite;
|
||||||
|
that is handled by token self-routing (D6).
|
||||||
|
|
||||||
|
**PE_DMA**
|
||||||
|
|
||||||
|
- Handles memory transfers between TCM and external memory domains
|
||||||
|
(HBM, shared SRAM, cross-cube UCIe) through the cube NOC.
|
||||||
|
- Two execution channels:
|
||||||
|
- `DMA_READ` (capacity = 1) and `DMA_WRITE` (capacity = 1) — see D4.
|
||||||
|
- Additional virtual channels:
|
||||||
|
- `vc_compute` — load/store/writeback traffic for GEMM/MATH tiles.
|
||||||
|
- `vc_comm` — IPCQ collective send data (defined in ADR-0023 D8).
|
||||||
|
|
||||||
|
**PE_FETCH_STORE**
|
||||||
|
|
||||||
|
- TCM ↔ Register File transfer unit.
|
||||||
|
- Isolates register-file access semantics from compute engines so that
|
||||||
|
GEMM/MATH stay pure compute components.
|
||||||
|
- BW-based latency model; TCM access contention naturally serializes
|
||||||
|
through `PE_TCM`'s BW resource.
|
||||||
|
|
||||||
|
**PE_GEMM**
|
||||||
|
|
||||||
|
- MAC array. Reads operands from the register file; writes results to
|
||||||
|
the register file. Does not touch `PE_TCM` directly.
|
||||||
|
|
||||||
|
**PE_MATH**
|
||||||
|
|
||||||
|
- Element-wise / reduction / SIMD unit. Reads / writes the register file.
|
||||||
|
|
||||||
|
**PE_TCM**
|
||||||
|
|
||||||
|
- Tightly-coupled scratchpad with BW-serialized access. Two logical
|
||||||
|
regions partitioned by ownership (see D5).
|
||||||
|
|
||||||
|
**Cross-referenced components** (defined elsewhere):
|
||||||
|
|
||||||
|
- `pe_mmu` — VA→PA translation per access (ADR-0011 D-VA).
|
||||||
|
- `pe_ipcq` — collective ring buffers and peer endpoint metadata
|
||||||
|
(ADR-0023).
|
||||||
|
|
||||||
|
### D2. Command lifecycle and queues
|
||||||
|
|
||||||
|
`PE_SCHEDULER` maintains three logical structures:
|
||||||
|
|
||||||
|
**SubmissionQueue** — written by `PE_CPU`; consumed by the scheduler.
|
||||||
|
|
||||||
|
**InflightTable** — owned and mutated only by `PE_SCHEDULER`; tracks
|
||||||
|
expanded sub-commands, dependency state, engine assignment, and
|
||||||
|
completion status.
|
||||||
|
|
||||||
|
**CompletionQueue** — written by `PE_SCHEDULER`; holds final completion
|
||||||
|
records.
|
||||||
|
|
||||||
|
**Single-writer rule**: only `PE_SCHEDULER` mutates command completion
|
||||||
|
state. Engines report completion via explicit events / messages
|
||||||
|
consumed by the scheduler.
|
||||||
|
|
||||||
|
**Command completion**: when all sub-commands complete, `PE_SCHEDULER`
|
||||||
|
publishes a completion record.
|
||||||
|
|
||||||
|
### D3. Dispatch modes
|
||||||
|
|
||||||
|
#### D3.1 Simple command
|
||||||
|
|
||||||
|
A simple command expands to exactly one engine sub-command:
|
||||||
|
|
||||||
|
- `DmaReadCmd` / `DmaWriteCmd` → `PE_DMA`
|
||||||
|
- `GemmCmd` → `PE_GEMM`
|
||||||
|
- `MathCmd` → `PE_MATH`
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution
|
||||||
|
→ completion → PE_SCHEDULER → CompletionQueue
|
||||||
|
```
|
||||||
|
|
||||||
|
#### D3.2 Composite command (single-op tiled pipeline)
|
||||||
|
|
||||||
|
The default `CompositeCmd` runs a single compute op as a tile-pipelined
|
||||||
|
sequence:
|
||||||
|
|
||||||
|
```text
|
||||||
|
DMA_READ → FETCH (TCM → RF) → COMPUTE (GEMM | MATH) → STORE (RF → TCM) → DMA_WRITE
|
||||||
|
```
|
||||||
|
|
||||||
|
`PE_SCHEDULER` splits the DMA payload into hardware tiles and emits one
|
||||||
|
`TileToken` per tile with a monotonically increasing `tile_id`.
|
||||||
|
|
||||||
|
Tile dependency (within one tile `t`):
|
||||||
|
|
||||||
|
```text
|
||||||
|
DMA_READ(t) → FETCH(t) → COMPUTE(t) → STORE(t) → DMA_WRITE(t)
|
||||||
|
```
|
||||||
|
|
||||||
|
Inter-tile overlap is allowed wherever engine resources permit
|
||||||
|
(D4 governs the constraints):
|
||||||
|
|
||||||
|
```text
|
||||||
|
DMA_READ(t+1) ∥ COMPUTE(t)
|
||||||
|
DMA_WRITE(t-1) ∥ COMPUTE(t)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### D3.3 Multi-op composite (head + epilogue with scope)
|
||||||
|
|
||||||
|
A `CompositeCmd` MAY carry `ops: tuple[OpSpec, ...]` to express a
|
||||||
|
multi-op pipeline:
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class OpSpec:
|
||||||
|
kind: str # "gemm" | "math.exp" | "math.bias_add" | ...
|
||||||
|
scope: Scope # "per_k_tile" | "per_output_tile" | "once"
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
- `ops[0]` (head) defines tile geometry (e.g., the head GEMM determines
|
||||||
|
M/K/N partition).
|
||||||
|
- `ops[1:]` (epilogue) are subsequent stages whose `scope` decides how
|
||||||
|
often they fire:
|
||||||
|
- `per_k_tile` — every K-reduction step.
|
||||||
|
- `per_output_tile` — once per output tile.
|
||||||
|
- `once` — once per kernel.
|
||||||
|
|
||||||
|
Cross-engine chains (e.g., GEMM head → MATH epilogue) are natural —
|
||||||
|
each stage is dispatched via token self-routing (D6), so GEMM and MATH
|
||||||
|
participate serially within the same composite even though they share
|
||||||
|
the compute slot (D4).
|
||||||
|
|
||||||
|
The empty-`ops` form is the legacy single-op path.
|
||||||
|
|
||||||
|
### D4. Engine resource model
|
||||||
|
|
||||||
|
**DMA engine**:
|
||||||
|
|
||||||
|
- `DMA_READ`: `simpy.Resource(capacity=1)`.
|
||||||
|
- `DMA_WRITE`: `simpy.Resource(capacity=1)`.
|
||||||
|
- Both channels run concurrently (READ ∥ WRITE allowed).
|
||||||
|
- Within a channel, requests serialize (READ ∥ READ disallowed; same
|
||||||
|
for WRITE).
|
||||||
|
- `vc_comm` is an orthogonal channel for IPCQ traffic defined in
|
||||||
|
ADR-0023 D8 — out of scope for this ADR.
|
||||||
|
|
||||||
|
**Compute engine**:
|
||||||
|
|
||||||
|
- `accel_slot`: `simpy.Resource(capacity=1)` shared by `PE_GEMM` and
|
||||||
|
`PE_MATH`.
|
||||||
|
- At most one compute op runs at a time within a PE.
|
||||||
|
- Multi-op composite chains (D3.3) execute their compute stages serially
|
||||||
|
through this slot; token self-routing (D6) ensures the next stage
|
||||||
|
starts only after the previous compute releases the slot.
|
||||||
|
|
||||||
|
**Engine completion**: each engine emits a completion event consumed by
|
||||||
|
the scheduler / `PipelineContext` (D6).
|
||||||
|
|
||||||
|
### D5. Dataflow
|
||||||
|
|
||||||
|
**Input path (HBM source)**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
HBM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
|
||||||
|
PE_TCM → PE_FETCH_STORE → Register File
|
||||||
|
Register File → PE_GEMM | PE_MATH
|
||||||
|
```
|
||||||
|
|
||||||
|
**Input path (shared SRAM source)**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Shared SRAM → cube NOC → PE_DMA (DMA_READ) → PE_TCM
|
||||||
|
PE_TCM → PE_FETCH_STORE → Register File
|
||||||
|
```
|
||||||
|
|
||||||
|
**Output path (HBM destination)**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Register File → PE_FETCH_STORE → PE_TCM
|
||||||
|
PE_TCM → PE_DMA (DMA_WRITE) → cube NOC → HBM
|
||||||
|
```
|
||||||
|
|
||||||
|
GEMM/MATH never touch `PE_TCM` directly — `PE_FETCH_STORE` is the
|
||||||
|
single TCM↔register-file gateway. This makes TCM BW contention
|
||||||
|
explicit and lets fetch unit policies (e.g., prefetch) be replaced
|
||||||
|
independently of compute engines.
|
||||||
|
|
||||||
|
#### D5.1 PE_TCM partitioning
|
||||||
|
|
||||||
|
`PE_TCM` is split into two logical regions:
|
||||||
|
|
||||||
|
**SchedulerReservedTCM**
|
||||||
|
|
||||||
|
- Owned exclusively by `PE_SCHEDULER`.
|
||||||
|
- Holds composite-command tile buffers.
|
||||||
|
- `PE_SCHEDULER` partitions this region, assigns buffers per DMA_READ /
|
||||||
|
COMPUTE / DMA_WRITE stage, guarantees input/output separation, and
|
||||||
|
manages tile-buffer lifetimes.
|
||||||
|
|
||||||
|
**AllocatableTCM**
|
||||||
|
|
||||||
|
- General-purpose region managed by `PEMemAllocator`.
|
||||||
|
- Used for host / DP-visible allocations.
|
||||||
|
|
||||||
|
**Visibility rule (hard isolation)**: `PEMemAllocator` MUST NOT see or
|
||||||
|
allocate inside `SchedulerReservedTCM`. The reserved region is excluded
|
||||||
|
from allocator-managed ranges by construction.
|
||||||
|
|
||||||
|
**Tile buffer rules**:
|
||||||
|
|
||||||
|
- Input and output buffers within `SchedulerReservedTCM` MUST NOT
|
||||||
|
overlap during a tile's active lifetime.
|
||||||
|
- A tile buffer remains valid until the corresponding `DMA_WRITE`
|
||||||
|
completes.
|
||||||
|
- Buffer reuse is permitted only after the consuming tile's lifetime
|
||||||
|
ends.
|
||||||
|
|
||||||
|
### D6. TileToken self-routing pipeline
|
||||||
|
|
||||||
|
A composite's stage-to-stage progression happens **without** routing
|
||||||
|
through the scheduler. Each component forwards the token directly to
|
||||||
|
the next stage's component using the token's `plan`:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Scheduler → DMA → Fetch → GEMM → Math (epi) → Store → DMA_WB → (complete)
|
||||||
|
↑ chaining: no scheduler hop ↑
|
||||||
|
PipelineContext.complete_tile()
|
||||||
|
```
|
||||||
|
|
||||||
|
This mirrors real-HW done-wire chains. The scheduler handles only
|
||||||
|
**initial dispatch + completion aggregation**.
|
||||||
|
|
||||||
|
#### TilePlan / Stage
|
||||||
|
|
||||||
|
```python
|
||||||
|
class StageType(Enum):
|
||||||
|
DMA_READ = 0
|
||||||
|
FETCH = 1
|
||||||
|
GEMM = 2
|
||||||
|
MATH = 3
|
||||||
|
STORE = 4
|
||||||
|
DMA_WRITE = 5
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class Stage:
|
||||||
|
stage_type: StageType
|
||||||
|
component: str # topology node id (e.g., "sip0.cube0.pe0.pe_dma")
|
||||||
|
params: dict # stage-specific parameters
|
||||||
|
|
||||||
|
@dataclass(frozen=True)
|
||||||
|
class TilePlan:
|
||||||
|
tile_id: int
|
||||||
|
stages: tuple[Stage, ...]
|
||||||
|
```
|
||||||
|
|
||||||
|
#### TileToken
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class TileToken:
|
||||||
|
tile_id: int
|
||||||
|
pipeline_ctx: PipelineContext
|
||||||
|
plan: TilePlan
|
||||||
|
stage_idx: int
|
||||||
|
params: dict # cached current stage params
|
||||||
|
data_op: bool = True # op_log opt-in (ADR-0020 D4)
|
||||||
|
```
|
||||||
|
|
||||||
|
Single-owner invariant: a token is owned by exactly one component at a
|
||||||
|
time. Lifecycle: scheduler creates with `stage_idx=0` → component
|
||||||
|
`_process()` → increment `stage_idx` → put to next stage's `in_port` →
|
||||||
|
last stage calls `pipeline_ctx.complete_tile()`.
|
||||||
|
|
||||||
|
#### PipelineContext (exactly-once completion)
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class PipelineContext:
|
||||||
|
id: str
|
||||||
|
total_tiles: int
|
||||||
|
completed_tiles: int = 0
|
||||||
|
done_event: simpy.Event = None
|
||||||
|
|
||||||
|
def complete_tile(self) -> None:
|
||||||
|
self.completed_tiles += 1
|
||||||
|
if self.completed_tiles == self.total_tiles:
|
||||||
|
self.done_event.succeed()
|
||||||
|
```
|
||||||
|
|
||||||
|
Each tile's last stage MUST call `complete_tile()` exactly once.
|
||||||
|
Duplicate calls are bugs (SimPy `Event` can succeed at most once).
|
||||||
|
|
||||||
|
#### Feed ordering
|
||||||
|
|
||||||
|
`PE_SCHEDULER` has exactly one `_feed_loop` process consuming a
|
||||||
|
`_pending_feeds` FIFO. Composite commands are enqueued in submission
|
||||||
|
order; tile feed for a command runs to completion before the next
|
||||||
|
command's feed begins. **Tile-feed interleaving between commands is
|
||||||
|
disallowed.**
|
||||||
|
|
||||||
|
Within a single command's tiles, downstream pipeline overlap arises
|
||||||
|
naturally — earlier tiles progress through later stages while the feeder
|
||||||
|
keeps pushing remaining tiles into the first stage queue (SimPy Store
|
||||||
|
backpressure governs flow control). If the first-stage queue is full,
|
||||||
|
only the feeder blocks; the scheduler worker's inbox processing
|
||||||
|
continues.
|
||||||
|
|
||||||
|
#### Token routing pattern (base class)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _pipeline_worker(self, env):
|
||||||
|
while True:
|
||||||
|
token = yield self._inbox.get()
|
||||||
|
yield from self._process(env, token) # stage-specific logic
|
||||||
|
next_idx = token.stage_idx + 1
|
||||||
|
if next_idx < len(token.plan.stages):
|
||||||
|
next_stage = token.plan.stages[next_idx]
|
||||||
|
token.stage_idx = next_idx
|
||||||
|
token.params = next_stage.params
|
||||||
|
yield self.out_ports[next_stage.component].put(token)
|
||||||
|
else:
|
||||||
|
token.pipeline_ctx.complete_tile()
|
||||||
|
```
|
||||||
|
|
||||||
|
Each component implements only `_process()`; chaining lives in the
|
||||||
|
base class.
|
||||||
|
|
||||||
|
### D7. Observability and trace contract
|
||||||
|
|
||||||
|
The simulator emits deterministic trace events:
|
||||||
|
|
||||||
|
- `command_submitted`
|
||||||
|
- `sub_command_dispatched`
|
||||||
|
- `engine_start`
|
||||||
|
- `engine_complete`
|
||||||
|
- `tile_ready`
|
||||||
|
- `command_complete`
|
||||||
|
|
||||||
|
For identical inputs, trace ordering MUST be deterministic.
|
||||||
|
|
||||||
|
### D8. Topology representation
|
||||||
|
|
||||||
|
PE-internal components are declared in `cube.pe_template`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
pe_template:
|
||||||
|
components:
|
||||||
|
pe_cpu: { kind: pe_cpu, impl: builtin.pe_cpu, attrs: { overhead_ns: ... } }
|
||||||
|
pe_scheduler: { kind: pe_scheduler, impl: builtin.pe_scheduler, attrs: { overhead_ns: ... } }
|
||||||
|
pe_dma: { kind: pe_dma, impl: builtin.pe_dma, attrs: { rd_engines: 1, wr_engines: 1 } }
|
||||||
|
pe_fetch_store: { kind: pe_fetch_store, impl: builtin.pe_fetch_store, attrs: { ... } }
|
||||||
|
pe_gemm: { kind: pe_gemm, impl: builtin.pe_gemm, attrs: { shared_resource: accel_slot, ... } }
|
||||||
|
pe_math: { kind: pe_math, impl: builtin.pe_math, attrs: { shared_resource: accel_slot, ... } }
|
||||||
|
pe_tcm: { kind: pe_tcm, impl: builtin.pe_tcm, attrs: { size_mb: ..., read_bw_gbs: ..., write_bw_gbs: ... } }
|
||||||
|
pe_mmu: { kind: pe_mmu, impl: builtin.pe_mmu, attrs: { ... } } # ADR-0011 D-VA
|
||||||
|
pe_ipcq: { kind: pe_ipcq, impl: builtin.pe_ipcq, attrs: { ... } } # ADR-0023
|
||||||
|
links:
|
||||||
|
# Scheduler dispatch edges (initial)
|
||||||
|
scheduler_to_dma_mm: 0.0
|
||||||
|
scheduler_to_fetch_store_mm: 0.0
|
||||||
|
scheduler_to_gemm_mm: 0.0
|
||||||
|
scheduler_to_math_mm: 0.0
|
||||||
|
# Pipeline chaining edges (token self-routing per D6)
|
||||||
|
dma_to_fetch_store_mm: 0.0
|
||||||
|
fetch_store_to_gemm_mm: 0.0
|
||||||
|
fetch_store_to_math_mm: 0.0
|
||||||
|
gemm_to_fetch_store_mm: 0.0
|
||||||
|
gemm_to_math_mm: 0.0
|
||||||
|
math_to_fetch_store_mm: 0.0
|
||||||
|
fetch_store_to_dma_mm: 0.0
|
||||||
|
fetch_store_to_tcm_bw_gbs: ...
|
||||||
|
```
|
||||||
|
|
||||||
|
Template is instantiated once per PE. PE instances are derived from
|
||||||
|
`cube.pe_layout` (corner placement). External connectivity (PE_DMA ↔
|
||||||
|
cube NOC ↔ HBM, etc.) is modeled at the cube level (ADR-0017 D4).
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- Each block is an independent topology node — individually swappable
|
||||||
|
via DI (ADR-0015).
|
||||||
|
- PE-internal structure is visible in the topology graph.
|
||||||
|
- Components do not know their downstream — plan-based routing gives
|
||||||
|
flexibility (e.g., epilogue chains require no scheduler change).
|
||||||
|
- DMA and compute overlap naturally via SimPy Store backpressure.
|
||||||
|
- Multi-op composite expresses fused operations (e.g., GEMM + bias_add)
|
||||||
|
without engine-level coupling.
|
||||||
|
- TCM access contention is realistic — `PE_FETCH_STORE` is the single
|
||||||
|
TCM↔RF gateway.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- Intra-PE component count is higher than a coarser model (7 base + 2
|
||||||
|
cross-referenced) — more topology nodes/edges.
|
||||||
|
- Intra-PE token forwarding is explicit in traces (acceptable trade for
|
||||||
|
HW fidelity).
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0011 D-VA (PE_MMU component, VA translation)
|
||||||
|
- ADR-0015 D4 (component port/wire model)
|
||||||
|
- ADR-0020 (greenlet kernel execution / two-pass)
|
||||||
|
- ADR-0023 (PE_IPCQ + PE_DMA virtual channels)
|
||||||
|
- SPEC R3, R4
|
||||||
@@ -1,365 +0,0 @@
|
|||||||
# ADR-0014: PE Internal Execution Model (PE_CPU, PE_SCHEDULER, and Composite Commands)
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
ADR-0003 (system hierarchy) and ADR-0009 (kernel execution semantics) reference PE internals but do not define:
|
|
||||||
|
|
||||||
- the dispatch model inside a PE,
|
|
||||||
- the responsibilities of PE_SCHEDULER,
|
|
||||||
- the PE_TCM-centric dataflow contract used by accelerator engines.
|
|
||||||
|
|
||||||
We need a deterministic and debuggable PE-internal execution contract that supports:
|
|
||||||
|
|
||||||
- simple single-engine commands
|
|
||||||
- composite commands that build a tiled pipeline across DMA and accelerator engines
|
|
||||||
|
|
||||||
The simulator must produce deterministic traces and allow modeling of PE-internal pipelining without introducing nondeterministic engine scheduling.
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### D1. PE internal component roles
|
|
||||||
|
|
||||||
Each PE contains the following logical components.
|
|
||||||
|
|
||||||
**PE_CPU**
|
|
||||||
|
|
||||||
- Executes kernel instruction stream or kernel control logic.
|
|
||||||
- Generates PE commands.
|
|
||||||
- Submits commands to PE_SCHEDULER.
|
|
||||||
- PE_CPU does NOT enqueue work directly into engine queues.
|
|
||||||
|
|
||||||
**PE_SCHEDULER**
|
|
||||||
|
|
||||||
- The sole dispatcher inside a PE.
|
|
||||||
- Receives commands from PE_CPU.
|
|
||||||
- Expands composite commands into sub-commands.
|
|
||||||
- Tracks dependencies and command state.
|
|
||||||
- Dispatches work to engine queues.
|
|
||||||
- Manages tile scheduling for composite commands.
|
|
||||||
|
|
||||||
**PE_DMA**
|
|
||||||
|
|
||||||
- Handles memory transfers between PE_TCM and external memory domains.
|
|
||||||
- PE_DMA connects to the cube-level NOC (on-die fabric):
|
|
||||||
- All destinations (HBM, shared SRAM, inter-cube UCIe) are reached via the NOC
|
|
||||||
- Local HBM access: PE_DMA → NOC → hbm_ctrl (minimal hop)
|
|
||||||
- Remote/shared: PE_DMA → NOC → (fabric hops) → destination
|
|
||||||
- Supported directions include:
|
|
||||||
- HBM → PE_TCM (via NOC)
|
|
||||||
- PE_TCM → HBM (via NOC)
|
|
||||||
- PE_TCM → shared SRAM (via NOC)
|
|
||||||
- PE_TCM → other memory domains (via NOC, if supported by topology)
|
|
||||||
|
|
||||||
**PE_GEMM**
|
|
||||||
|
|
||||||
- Matrix multiplication engine.
|
|
||||||
- Reads activations from PE_TCM.
|
|
||||||
- May stream weights directly from HBM.
|
|
||||||
|
|
||||||
**PE_MATH**
|
|
||||||
|
|
||||||
- Element-wise computation engine.
|
|
||||||
- Reads and writes PE_TCM.
|
|
||||||
|
|
||||||
**PE_TCM**
|
|
||||||
|
|
||||||
- Local SRAM used as the staging memory for accelerator operations.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D2. Command lifecycle and queues
|
|
||||||
|
|
||||||
PE_SCHEDULER maintains three logical structures.
|
|
||||||
|
|
||||||
**SubmissionQueue**
|
|
||||||
|
|
||||||
- Written by PE_CPU.
|
|
||||||
- Contains incoming PE commands waiting to be processed.
|
|
||||||
|
|
||||||
**InflightTable**
|
|
||||||
|
|
||||||
- Owned and mutated only by PE_SCHEDULER.
|
|
||||||
- Tracks:
|
|
||||||
- expanded sub-commands
|
|
||||||
- dependency state
|
|
||||||
- engine assignment
|
|
||||||
- completion status
|
|
||||||
|
|
||||||
**CompletionQueue**
|
|
||||||
|
|
||||||
- Written by PE_SCHEDULER.
|
|
||||||
- Contains final completion records for commands.
|
|
||||||
|
|
||||||
**Single-writer rule**
|
|
||||||
|
|
||||||
- Only PE_SCHEDULER is allowed to mutate command completion state.
|
|
||||||
- Engine components must report completion via explicit completion events/messages.
|
|
||||||
|
|
||||||
**Command completion**
|
|
||||||
|
|
||||||
A command becomes DONE when:
|
|
||||||
|
|
||||||
- all sub-commands complete
|
|
||||||
- PE_SCHEDULER publishes a completion record to CompletionQueue.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D3. Dispatch modes
|
|
||||||
|
|
||||||
PE commands are divided into two categories.
|
|
||||||
|
|
||||||
#### D3.1 Simple command
|
|
||||||
|
|
||||||
A simple command expands to exactly one engine sub-command.
|
|
||||||
|
|
||||||
Examples include:
|
|
||||||
|
|
||||||
- DMA transfer
|
|
||||||
- GEMM compute
|
|
||||||
- MATH compute
|
|
||||||
|
|
||||||
Execution flow:
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_CPU → SubmissionQueue → PE_SCHEDULER → engine queue → engine execution → completion event → PE_SCHEDULER → CompletionQueue
|
|
||||||
```
|
|
||||||
|
|
||||||
#### D3.2 Composite command (tiled pipeline)
|
|
||||||
|
|
||||||
Composite commands implement tiled pipelined execution across engines.
|
|
||||||
|
|
||||||
Each tile executes the following pipeline:
|
|
||||||
|
|
||||||
```text
|
|
||||||
Input DMA (READ)
|
|
||||||
→ Compute (GEMM or MATH)
|
|
||||||
→ Output DMA (WRITE)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Tiling rule**
|
|
||||||
|
|
||||||
If the DMA payload exceeds hardware tile size, PE_SCHEDULER splits the transfer into tiles.
|
|
||||||
Each tile is assigned a monotonically increasing `tile_id`.
|
|
||||||
|
|
||||||
**Tile dependency rules**
|
|
||||||
|
|
||||||
For tile `t`:
|
|
||||||
|
|
||||||
- Compute must wait for input DMA: `DMA_READ(t) → COMPUTE(t)`
|
|
||||||
- Output DMA must wait for compute: `COMPUTE(t) → DMA_WRITE(t)`
|
|
||||||
- All dependencies are enforced by PE_SCHEDULER.
|
|
||||||
|
|
||||||
**Overlap policy (Phase 0 default)**
|
|
||||||
|
|
||||||
Operations for different tiles may overlap when engine resources permit.
|
|
||||||
|
|
||||||
Allowed overlaps:
|
|
||||||
|
|
||||||
```text
|
|
||||||
DMA_READ(t+1) ∥ COMPUTE(t)
|
|
||||||
DMA_WRITE(t−1) ∥ COMPUTE(t)
|
|
||||||
DMA_READ(t) ∥ DMA_WRITE(t)
|
|
||||||
```
|
|
||||||
|
|
||||||
Disallowed overlaps:
|
|
||||||
|
|
||||||
```text
|
|
||||||
GEMM(t) ∥ GEMM(t′)
|
|
||||||
MATH(t) ∥ MATH(t′)
|
|
||||||
GEMM(t) ∥ MATH(t′)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D4. Engine execution model (Phase 0 default)
|
|
||||||
|
|
||||||
Each engine behaves as a deterministic service resource.
|
|
||||||
|
|
||||||
**DMA engine**
|
|
||||||
|
|
||||||
PE_DMA contains two independent channels.
|
|
||||||
|
|
||||||
```text
|
|
||||||
DMA_READ capacity = 1
|
|
||||||
DMA_WRITE capacity = 1
|
|
||||||
```
|
|
||||||
|
|
||||||
Rules:
|
|
||||||
|
|
||||||
- DMA_READ and DMA_WRITE may execute concurrently.
|
|
||||||
- Multiple READs cannot overlap.
|
|
||||||
- Multiple WRITEs cannot overlap.
|
|
||||||
|
|
||||||
Example allowed:
|
|
||||||
|
|
||||||
```text
|
|
||||||
DMA_READ(t+1) ∥ DMA_WRITE(t)
|
|
||||||
```
|
|
||||||
|
|
||||||
Example not allowed:
|
|
||||||
|
|
||||||
```text
|
|
||||||
DMA_READ(t) ∥ DMA_READ(t+1)
|
|
||||||
DMA_WRITE(t) ∥ DMA_WRITE(t+1)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Compute engine**
|
|
||||||
|
|
||||||
Compute operations share a single compute resource.
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_ACCEL capacity = 1
|
|
||||||
```
|
|
||||||
|
|
||||||
Both GEMM and MATH require this shared compute slot.
|
|
||||||
|
|
||||||
Consequences:
|
|
||||||
|
|
||||||
- GEMM ∥ GEMM not allowed
|
|
||||||
- MATH ∥ MATH not allowed
|
|
||||||
- GEMM ∥ MATH not allowed
|
|
||||||
|
|
||||||
Only one compute operation can run in a PE at a time.
|
|
||||||
|
|
||||||
**Compute opcode restriction**
|
|
||||||
|
|
||||||
Composite commands contain one compute opcode only.
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
|
|
||||||
```text
|
|
||||||
COMPOSITE_GEMM
|
|
||||||
COMPOSITE_MATH
|
|
||||||
```
|
|
||||||
|
|
||||||
Mixed compute pipelines such as `GEMM → MATH` are not supported in Phase 0.
|
|
||||||
|
|
||||||
**Engine completion signaling**
|
|
||||||
|
|
||||||
Every engine emits a completion event when a sub-command finishes.
|
|
||||||
Completion events are delivered to PE_SCHEDULER.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D5. Dataflow model
|
|
||||||
|
|
||||||
Compute operations use a TCM-centric dataflow model.
|
|
||||||
|
|
||||||
**Input path (HBM)**
|
|
||||||
|
|
||||||
```text
|
|
||||||
HBM → NOC → PE_DMA (DMA_READ) → PE_TCM
|
|
||||||
```
|
|
||||||
|
|
||||||
**Input path (shared SRAM)**
|
|
||||||
|
|
||||||
```text
|
|
||||||
Shared SRAM → NOC → PE_DMA (DMA_READ) → PE_TCM
|
|
||||||
```
|
|
||||||
|
|
||||||
**Compute stage**
|
|
||||||
|
|
||||||
Compute engines read input tensors from PE_TCM.
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_TCM → GEMM / MATH
|
|
||||||
```
|
|
||||||
|
|
||||||
Weights for GEMM may optionally stream directly from HBM (via NOC).
|
|
||||||
|
|
||||||
**Output path (HBM)**
|
|
||||||
|
|
||||||
Compute results are written to PE_TCM, then DMA writes to HBM.
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_TCM → PE_DMA (DMA_WRITE) → NOC → HBM
|
|
||||||
```
|
|
||||||
|
|
||||||
**Output path (shared SRAM)**
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_TCM → PE_DMA (DMA_WRITE) → NOC → Shared SRAM
|
|
||||||
```
|
|
||||||
|
|
||||||
#### D5.1 PE_TCM partitioning and ownership boundary
|
|
||||||
|
|
||||||
The PE_TCM address space is partitioned into two logical regions.
|
|
||||||
|
|
||||||
**SchedulerReservedTCM**
|
|
||||||
|
|
||||||
- A staging region owned exclusively by PE_SCHEDULER.
|
|
||||||
- This region is used for composite command tile buffers.
|
|
||||||
- PE_SCHEDULER:
|
|
||||||
- partitions this region into tile buffers
|
|
||||||
- assigns buffers for DMA_READ, COMPUTE, and DMA_WRITE stages
|
|
||||||
- guarantees input/output buffer separation
|
|
||||||
- manages tile buffer lifetime
|
|
||||||
|
|
||||||
**AllocatableTCM**
|
|
||||||
|
|
||||||
- General-purpose region managed by PEMemAllocator.
|
|
||||||
- Used by host or DP-visible allocations.
|
|
||||||
|
|
||||||
**Visibility rule (hard isolation)**
|
|
||||||
|
|
||||||
- PEMemAllocator must not see or allocate memory inside SchedulerReservedTCM.
|
|
||||||
- SchedulerReservedTCM is excluded from allocator-managed ranges by construction.
|
|
||||||
- This prevents DP or host allocations from interfering with scheduler staging buffers.
|
|
||||||
|
|
||||||
**Tile buffer rules**
|
|
||||||
|
|
||||||
Within SchedulerReservedTCM:
|
|
||||||
|
|
||||||
- input buffers and output buffers must not overlap
|
|
||||||
- PE_SCHEDULER assigns tile buffers for DMA and compute stages
|
|
||||||
- tile buffers remain valid until the corresponding DMA_WRITE completes
|
|
||||||
- Buffer reuse is allowed only after the tile lifetime finishes.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D6. Observability and trace contract
|
|
||||||
|
|
||||||
The simulator must emit deterministic trace events.
|
|
||||||
|
|
||||||
Required events include:
|
|
||||||
|
|
||||||
- `command_submitted`
|
|
||||||
- `sub_command_dispatched`
|
|
||||||
- `engine_start`
|
|
||||||
- `engine_complete`
|
|
||||||
- `tile_ready`
|
|
||||||
- `command_complete`
|
|
||||||
|
|
||||||
Trace ordering must be deterministic for identical inputs.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D7. Topology representation
|
|
||||||
|
|
||||||
PE internal components are declared in `cube.pe_template`.
|
|
||||||
|
|
||||||
The template is instantiated once per PE.
|
|
||||||
|
|
||||||
PE instances are derived from `cube.pe_layout`.
|
|
||||||
|
|
||||||
External connectivity such as:
|
|
||||||
|
|
||||||
- PE_DMA → NOC → HBM (data path)
|
|
||||||
- PE_DMA → NOC → shared SRAM, inter-cube UCIe (non-HBM data path)
|
|
||||||
- NOC → PE_CPU (command path from M_CPU)
|
|
||||||
|
|
||||||
is modeled at the CUBE level (see ADR-0003 D3).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Links
|
|
||||||
|
|
||||||
- SPEC R3, R4
|
|
||||||
- ADR-0003 D4 (PE-level system hierarchy)
|
|
||||||
- ADR-0005 View C (PE-level diagram)
|
|
||||||
- ADR-0008 D2 (PA-level allocation at PE scope; PEMemAllocator is the per-PE allocator instance)
|
|
||||||
- ADR-0009 D3 (kernel execution fan-out and PE_CPU dispatch)
|
|
||||||
+11
-16
@@ -6,20 +6,19 @@ Accepted
|
|||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
ADR-0007 D2 assigns path-walking and low-level request decomposition to the simulation engine.
|
Realistic hardware modeling — queues, contention, fan-out — requires
|
||||||
In practice, the engine iterates the topology path and calls `run()` on each component
|
that components own fabric traversal while the simulation engine
|
||||||
sequentially — conflating routing policy with component behavior and preventing realistic
|
handles only initialization and completion observation. Direct method
|
||||||
hardware modeling (queues, contention, fan-out).
|
calls between components, or path-walking inside the engine, defeat
|
||||||
|
queueing and contention semantics.
|
||||||
ADR-0007 D3 already states that components own fan-out and aggregation, but the current
|
|
||||||
implementation does not enforce this for fabric traversal.
|
|
||||||
|
|
||||||
This ADR defines:
|
This ADR defines:
|
||||||
|
|
||||||
- how components communicate via typed port queues,
|
- how components communicate via typed port queues,
|
||||||
- how propagation delay is modeled (wire processes with BW occupancy),
|
- how propagation delay is modeled (wire processes with BW occupancy),
|
||||||
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch (via M_CPU),
|
- the fabric paths for Memory R/W (M_CPU bypass) and Kernel Launch
|
||||||
- the reduced role of the simulation engine,
|
(via M_CPU),
|
||||||
|
- the engine's reduced role (wire init + completion observation only),
|
||||||
- M_CPU.DMA as an internal subcomponent of M_CPU.
|
- M_CPU.DMA as an internal subcomponent of M_CPU.
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -88,9 +87,6 @@ The simulation engine MUST NOT:
|
|||||||
- call component `run()` methods directly,
|
- call component `run()` methods directly,
|
||||||
- track per-hop latency or decompose fan-out.
|
- track per-hop latency or decompose fan-out.
|
||||||
|
|
||||||
This supersedes ADR-0007 D2's "decompose operations into low-level requests" clause.
|
|
||||||
ADR-0007 D2 must be amended accordingly.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### D4. Fabric paths for Memory R/W and Kernel Launch
|
### D4. Fabric paths for Memory R/W and Kernel Launch
|
||||||
@@ -192,16 +188,15 @@ It is used for shard comparison in `_route_kernel` and as a regression guard.
|
|||||||
- Propagation delay is modeled accurately per edge.
|
- Propagation delay is modeled accurately per edge.
|
||||||
- Engine is decoupled from routing policy.
|
- Engine is decoupled from routing policy.
|
||||||
- Component implementations remain swappable via DI (ADR-0007 D3).
|
- Component implementations remain swappable via DI (ADR-0007 D3).
|
||||||
- ADR-0007 D2 must be amended to remove path-walking from engine responsibilities.
|
|
||||||
- ADR-0009 D3 should be updated to reference the unified fabric path (D4 above).
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Links
|
## Links
|
||||||
|
|
||||||
- ADR-0007 D2 (to be amended: engine path-walking clause)
|
- ADR-0007 D2 (engine role boundary)
|
||||||
- ADR-0009 D3 (kernel execution fan-out; fabric path to be referenced)
|
- ADR-0009 D3 (kernel execution fan-out hierarchy)
|
||||||
- ADR-0014 D4 (DMA engine capacity=1)
|
- ADR-0014 D4 (DMA engine capacity=1)
|
||||||
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
|
- ADR-0012 D1 (host ↔ IO_CPU message schema; M_CPU.DMA is component-internal)
|
||||||
- ADR-0016 (IOChiplet NOC and memory data path)
|
- ADR-0016 (IOChiplet NOC and memory data path)
|
||||||
- ADR-0017 (cube NOC 2D mesh architecture)
|
- ADR-0017 (cube NOC 2D mesh architecture)
|
||||||
|
- ADR-0033 (Latency model assumptions built on these mechanisms)
|
||||||
@@ -1,189 +0,0 @@
|
|||||||
# ADR-0017: Cube NOC 2D Mesh Architecture
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
ADR-0003 D3 defines the cube-level NOC as a "distributed on-die fabric" but
|
|
||||||
does not specify the internal routing model, contention semantics, or
|
|
||||||
attachment topology. The implementation uses a 2D mesh router grid with
|
|
||||||
XY routing and per-segment contention modeling. This ADR formalizes that
|
|
||||||
architecture.
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### D1. NOC node and router grid
|
|
||||||
|
|
||||||
Each cube contains a 2D router mesh generated by `mesh_gen.py`.
|
|
||||||
Each router is a separate topology node (`sip{S}.cube{C}.r{row}c{col}`)
|
|
||||||
implemented as `forwarding_v1`. (Supersedes the original single-node
|
|
||||||
`noc_2d_mesh_v1` design — see ADR-0019.)
|
|
||||||
|
|
||||||
Grid properties:
|
|
||||||
|
|
||||||
- Default dimensions: 6x6 routers (derived from PE layout + UCIe connections)
|
|
||||||
- Router naming: `r{row}c{col}` (e.g., `r0c0`, `r5c5`)
|
|
||||||
- HBM exclusion zone: center rows/columns are excluded where HBM physically
|
|
||||||
occupies space (e.g., r2c2, r2c3, r3c2, r3c3)
|
|
||||||
- Router positions are derived from physical PE corner placement and cube
|
|
||||||
geometry
|
|
||||||
|
|
||||||
The NOC overhead_ns is 0.0. Latency is modeled by Manhattan distance
|
|
||||||
traversal within the mesh (distance_mm x ns_per_mm).
|
|
||||||
|
|
||||||
### D2. XY routing algorithm
|
|
||||||
|
|
||||||
The NOC uses deterministic XY routing:
|
|
||||||
|
|
||||||
1. Horizontal segment: route from source X to destination X at source Y
|
|
||||||
2. Vertical segment: route from destination X at source Y to destination Y
|
|
||||||
|
|
||||||
Each directed segment is identified by a unique link key:
|
|
||||||
|
|
||||||
- Horizontal: `("H", y_band, x_min, x_max, direction)`
|
|
||||||
- Vertical: `("V", x_band, y_min, y_max, direction)`
|
|
||||||
|
|
||||||
Grid positions are snapped to the router grid, excluding the HBM zone.
|
|
||||||
|
|
||||||
### D3. Contention model
|
|
||||||
|
|
||||||
Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
|
|
||||||
sharing a segment (same row or column band, same direction) contend for the
|
|
||||||
resource. This models link-level serialization in a wormhole-routed mesh.
|
|
||||||
|
|
||||||
With no contention, NOC traversal latency equals the Manhattan distance
|
|
||||||
multiplied by `ns_per_mm`. Under contention, additional queueing delay
|
|
||||||
is added by SimPy's resource scheduling.
|
|
||||||
|
|
||||||
### D4. NOC attachment points
|
|
||||||
|
|
||||||
The NOC connects to all major cube-level components:
|
|
||||||
|
|
||||||
```text
|
|
||||||
UCIe-N (conn x4)
|
|
||||||
|
|
|
||||||
+---------+---+---+---------+
|
|
||||||
| | | |
|
|
||||||
PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma
|
|
||||||
PE0.cpu <--+ | | +--< PE2.cpu
|
|
||||||
| | | |
|
|
||||||
UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E
|
|
||||||
(conn x4) | | zone | | (conn x4)
|
|
||||||
| r2c0 | | |
|
|
||||||
M_CPU <--->+ | | |
|
|
||||||
| r3c0 | | |
|
|
||||||
SRAM <---->+ | | |
|
|
||||||
| | | |
|
|
||||||
PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma
|
|
||||||
PE4.cpu <--+ | | +--< PE6.cpu
|
|
||||||
| | | |
|
|
||||||
+---------+---+---+---------+
|
|
||||||
|
|
|
||||||
UCIe-S (conn x4)
|
|
||||||
|
|
||||||
HBM attach: PE가 있는 라우터에 hbm_ctrl도 연결 (ADR-0019 D1)
|
|
||||||
(xbar_top/xbar_bot은 ADR-0019에 의해 제거됨)
|
|
||||||
```
|
|
||||||
|
|
||||||
### D5. NOC edge bandwidths and distances
|
|
||||||
|
|
||||||
| Connection | BW (GB/s) | Distance | Notes |
|
|
||||||
| --- | --- | --- | --- |
|
|
||||||
| PE_DMA -> NOC | 256.0 | Physical (PE pos) | Matches HBM slice BW |
|
|
||||||
| NOC -> PE_CPU | - | 0.0 mm | Command path only |
|
|
||||||
| Router <-> HBM_CTRL | 256.0 | 0.0 mm | Per PE router (ADR-0019) |
|
|
||||||
| NOC <-> M_CPU | - | 0.0 mm | Command path |
|
|
||||||
| NOC <-> SRAM | 128.0 x4 | 0.0 mm | 512 GB/s aggregate |
|
|
||||||
| NOC <-> UCIe conn | 128.0 | 0.0 mm | Per connection, 4 per port |
|
|
||||||
|
|
||||||
Distance 0.0 mm for most connections reflects the distributed nature of
|
|
||||||
the NOC; the actual traversal distance is computed internally via Manhattan
|
|
||||||
distance within the router grid.
|
|
||||||
|
|
||||||
### D6. UCIe decomposition and inter-cube traffic
|
|
||||||
|
|
||||||
Each cube has 4 UCIe ports (N, S, E, W). Each port is decomposed into:
|
|
||||||
|
|
||||||
- 1 `ucie-{PORT}` node: UCIe protocol endpoint (overhead = 8.0 ns)
|
|
||||||
- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe
|
|
||||||
|
|
||||||
This decomposition enables N=4 independent NOC-to-UCIe connections per port,
|
|
||||||
each with 128 GB/s bandwidth. Total aggregate per port: 512 GB/s.
|
|
||||||
|
|
||||||
Inter-cube traffic path:
|
|
||||||
|
|
||||||
```text
|
|
||||||
Source: PE_DMA -> NOC -> conn{i} -> ucie-{PORT}
|
|
||||||
[UCIe link: 512 GB/s, 1.0mm seam distance]
|
|
||||||
Target: ucie-{PORT} -> conn{i} -> r{x}c{y} -> (mesh hops) -> hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
UCIe overhead (8.0 ns) is applied at each ucie-{PORT} node, so a
|
|
||||||
full crossing incurs 16 ns (TX port + RX port).
|
|
||||||
|
|
||||||
### D7. Data paths through the NOC
|
|
||||||
|
|
||||||
**PE DMA to local HBM (same half):**
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_DMA -> r{x}c{y} -> hbm_ctrl (local: 0 mesh hops, switching overhead only)
|
|
||||||
```
|
|
||||||
|
|
||||||
**PE DMA to remote PE's HBM:**
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_DMA -> r{x}c{y} -> (mesh hops) -> r{x'}c{y'} -> hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
**PE DMA to remote cube HBM:**
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_DMA -> r{x}c{y} -> conn -> ucie-E -> [seam] -> ucie-W -> conn -> r{x'}c{y'} -> hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
**Kernel Launch command to PE:**
|
|
||||||
|
|
||||||
```text
|
|
||||||
[from io_noc] -> ucie -> conn -> r{x}c{y} -> (mesh hops) -> M_CPU -> (mesh hops) -> PE_CPU
|
|
||||||
```
|
|
||||||
|
|
||||||
**Shared SRAM access:**
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_DMA -> r{x}c{y} -> (mesh hops) -> SRAM
|
|
||||||
```
|
|
||||||
|
|
||||||
### D8. Mesh generation
|
|
||||||
|
|
||||||
The router grid is generated by `mesh_gen.py` based on:
|
|
||||||
|
|
||||||
- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner
|
|
||||||
- `cube.geometry`: cube physical dimensions and HBM zone
|
|
||||||
- `cube.ucie.n_connections`: determines router count for UCIe attachment
|
|
||||||
|
|
||||||
The generator produces a `mesh_data` dictionary containing:
|
|
||||||
|
|
||||||
- Router grid with positions and HBM exclusion zones
|
|
||||||
- PE-to-router attachments (pe_dma, pe_cpu per PE)
|
|
||||||
- UCIe-to-router attachments (N/S/E/W, distributed across edge routers)
|
|
||||||
- M_CPU and SRAM router attachments
|
|
||||||
- HBM attachment per PE router (ADR-0019)
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
- NOC provides position-aware routing with deterministic latency
|
|
||||||
- Contention is captured per directed segment (not per-node)
|
|
||||||
- All cube-internal traffic is explicitly routed through the NOC
|
|
||||||
- HBM exclusion zone reflects physical die layout constraints
|
|
||||||
- The mesh generation is fully parameterized by `topology.yaml`
|
|
||||||
|
|
||||||
## Links
|
|
||||||
|
|
||||||
- ADR-0003 D3 (cube-level NOC definition — extended by this ADR)
|
|
||||||
- ADR-0004 D1 (PE DMA to local HBM path via router mesh)
|
|
||||||
- ADR-0014 D1 (PE_DMA egress via router mesh)
|
|
||||||
- ADR-0019 (NOC-Local HBM — xbar/bridge 제거, 명시적 라우터 mesh)
|
|
||||||
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
|
|
||||||
- ADR-0016 D1 (IOChiplet io_noc — analogous pattern at IO chiplet level)
|
|
||||||
@@ -0,0 +1,291 @@
|
|||||||
|
# ADR-0017: Cube NOC and HBM Connectivity
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The CUBE-level NOC is a 2D router mesh that carries every intra-cube
|
||||||
|
request: PE-to-HBM data, PE-to-PE traffic, command paths
|
||||||
|
(M_CPU↔PE_CPU), shared SRAM access, and inter-cube UCIe traffic.
|
||||||
|
|
||||||
|
The CUBE's HBM is exposed through per-PE controller endpoints attached
|
||||||
|
to PE routers. This per-PE partitioning makes local-vs-remote HBM
|
||||||
|
distinguishable by mesh distance: a PE's own HBM partition sits at its
|
||||||
|
own router (switching overhead only); another PE's HBM partition is
|
||||||
|
reachable by mesh hops to that PE's router.
|
||||||
|
|
||||||
|
Two channel-mapping modes are supported in the design space:
|
||||||
|
|
||||||
|
- **n:1 (default, implemented)** — each PE's HBM partition aggregates
|
||||||
|
`channels_per_pe` pseudo-channels into one endpoint. Effective
|
||||||
|
per-PE BW = N × per-channel BW.
|
||||||
|
- **1:1 (future)** — each PE router decomposes into per-channel
|
||||||
|
mini-routers; per-channel BW contention is modeled directly.
|
||||||
|
|
||||||
|
In both modes the per-PE effective BW is identical; only the connectivity
|
||||||
|
granularity differs.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. 2D router mesh
|
||||||
|
|
||||||
|
Each cube contains a 2D mesh of NOC routers generated by `mesh_gen.py`.
|
||||||
|
|
||||||
|
- Node naming: `sip{S}.cube{C}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`).
|
||||||
|
- Implementation: `forwarding_v1`. NOC `overhead_ns = 0`.
|
||||||
|
- Default 6×6 grid (sized from PE corner placement + UCIe attachment
|
||||||
|
count); larger PE counts scale the grid up.
|
||||||
|
- HBM exclusion zone: center rows/columns are excluded where HBM die
|
||||||
|
physically occupies space (e.g., r2c2, r2c3, r3c2, r3c3 for a 6×6).
|
||||||
|
- Latency = Manhattan distance × `ns_per_mm`.
|
||||||
|
|
||||||
|
### D2. XY routing algorithm
|
||||||
|
|
||||||
|
Deterministic XY routing:
|
||||||
|
|
||||||
|
1. Horizontal segment: route from source X to destination X at source Y.
|
||||||
|
2. Vertical segment: route from destination X at source Y to destination Y.
|
||||||
|
|
||||||
|
Each directed segment carries a unique key:
|
||||||
|
|
||||||
|
- Horizontal: `("H", y_band, x_min, x_max, direction)`
|
||||||
|
- Vertical: `("V", x_band, y_min, y_max, direction)`
|
||||||
|
|
||||||
|
Grid positions are snapped to the router grid, excluding the HBM zone.
|
||||||
|
|
||||||
|
### D3. Per-segment contention model
|
||||||
|
|
||||||
|
Each directed XY segment is a `simpy.Resource(capacity=1)`. Transactions
|
||||||
|
sharing a segment (same row or column band, same direction) contend for
|
||||||
|
the resource — modelling link-level serialization in a wormhole-routed
|
||||||
|
mesh.
|
||||||
|
|
||||||
|
With no contention, NOC traversal latency equals Manhattan distance ×
|
||||||
|
`ns_per_mm`. Under contention, SimPy's resource scheduling adds queueing
|
||||||
|
delay.
|
||||||
|
|
||||||
|
### D4. NOC attachment points (per-PE HBM partition)
|
||||||
|
|
||||||
|
Every PE router carries three attachments: `pe{idx}.dma`, `pe{idx}.cpu`,
|
||||||
|
and `pe{idx}.hbm`. The last is the per-PE HBM controller endpoint —
|
||||||
|
`sip{S}.cube{C}.hbm_ctrl.pe{idx}` — which owns one slice of the cube's
|
||||||
|
HBM (one pseudo-channel group; see D8).
|
||||||
|
|
||||||
|
Other attachments:
|
||||||
|
|
||||||
|
- M_CPU and shared SRAM each occupy a dedicated edge router.
|
||||||
|
- UCIe endpoints (N/S/E/W) each expose 4 connection routers distributed
|
||||||
|
along that edge (see D6).
|
||||||
|
|
||||||
|
```text
|
||||||
|
UCIe-N (conn x4)
|
||||||
|
|
|
||||||
|
+---------+---+---+---------+
|
||||||
|
| | | |
|
||||||
|
PE0.dma ---+ r0c0 | ... | r0c5 +--- PE2.dma
|
||||||
|
PE0.cpu <--+ +hbm.pe0| | +hbm.pe2+--< PE2.cpu
|
||||||
|
| | | |
|
||||||
|
UCIe-W ----+ ... | [HBM] | ... +---- UCIe-E
|
||||||
|
(conn x4) | | zone | | (conn x4)
|
||||||
|
| r2c0 | | |
|
||||||
|
M_CPU <--->+ | | |
|
||||||
|
| r3c0 | | |
|
||||||
|
SRAM <---->+ | | |
|
||||||
|
| | | |
|
||||||
|
PE4.dma ---+ r4c0 | ... | r4c5 +--- PE6.dma
|
||||||
|
PE4.cpu <--+ +hbm.pe4| | +hbm.pe6+--< PE6.cpu
|
||||||
|
| | | |
|
||||||
|
+---------+---+---+---------+
|
||||||
|
|
|
||||||
|
UCIe-S (conn x4)
|
||||||
|
```
|
||||||
|
|
||||||
|
Per-PE HBM partitioning is the key invariant that makes local vs
|
||||||
|
cross-PE HBM distinguishable by mesh distance (see D7).
|
||||||
|
|
||||||
|
### D5. NOC edge bandwidths and distances
|
||||||
|
|
||||||
|
| Connection | BW (GB/s) | Distance | Notes |
|
||||||
|
| ----------------------------- | ---------- | ------------- | ------------------------------------------- |
|
||||||
|
| PE_DMA → NOC | 256.0 | Physical (PE) | Matches local-HBM aggregate BW |
|
||||||
|
| NOC → PE_CPU | — | 0.0 mm | Command path only |
|
||||||
|
| Router ↔ hbm_ctrl.pe{idx} | 256.0 | 0.0 mm | Per PE router; N × per-channel BW (see D8) |
|
||||||
|
| NOC ↔ M_CPU | — | 0.0 mm | Command path |
|
||||||
|
| NOC ↔ SRAM | 128.0 × 4 | 0.0 mm | 512 GB/s aggregate |
|
||||||
|
| NOC ↔ UCIe conn | 128.0 | 0.0 mm | Per connection; 4 conn per port |
|
||||||
|
|
||||||
|
`0.0 mm` distances reflect the distributed nature of the NOC; actual
|
||||||
|
traversal distance is computed via Manhattan distance within the router
|
||||||
|
grid.
|
||||||
|
|
||||||
|
### D6. UCIe decomposition and inter-cube traffic
|
||||||
|
|
||||||
|
Each of the 4 UCIe ports (N, S, E, W) decomposes into:
|
||||||
|
|
||||||
|
- 1 `ucie-{PORT}` node: UCIe protocol endpoint (`overhead = 8.0 ns`).
|
||||||
|
- 4 `ucie-{PORT}.conn{0-3}` nodes: connection bridges between NOC and UCIe.
|
||||||
|
|
||||||
|
This decomposition gives 4 independent NOC↔UCIe connections per port,
|
||||||
|
each with 128 GB/s bandwidth (512 GB/s aggregate per port).
|
||||||
|
|
||||||
|
Inter-cube traffic path:
|
||||||
|
|
||||||
|
```text
|
||||||
|
Source: PE_DMA → NOC → conn{i} → ucie-{PORT}
|
||||||
|
[UCIe link: 512 GB/s, 1.0mm seam distance]
|
||||||
|
Target: ucie-{PORT} → conn{i} → r{x}c{y} → (mesh hops) → hbm_ctrl.pe{idx}
|
||||||
|
```
|
||||||
|
|
||||||
|
UCIe overhead (8.0 ns) is applied at each `ucie-{PORT}` node, so a full
|
||||||
|
crossing incurs 16 ns (TX port + RX port).
|
||||||
|
|
||||||
|
### D7. Data paths through the NOC
|
||||||
|
|
||||||
|
All intra-cube traffic uses the same router mesh — no separate fast
|
||||||
|
paths.
|
||||||
|
|
||||||
|
**Local HBM** (same PE's own partition; 0 mesh hops):
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE_DMA → r{x}c{y} → hbm_ctrl.pe{idx} (switching overhead only)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cross-PE HBM within cube** (target PE's partition, reached by mesh):
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE_DMA → r{x}c{y} → (mesh hops) → r{x'}c{y'} → hbm_ctrl.pe{idx'}
|
||||||
|
```
|
||||||
|
|
||||||
|
Example: PE0 (on `r0c0`) accessing PE2's HBM (PE2 on `r1c4`):
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl.pe2
|
||||||
|
```
|
||||||
|
|
||||||
|
Dijkstra computes the shortest path within the mesh.
|
||||||
|
|
||||||
|
**Cross-cube HBM** (UCIe traversal):
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE_DMA → r{x}c{y} → conn → ucie-{PORT} → [seam] → ucie-{PORT'} → conn
|
||||||
|
→ r{x'}c{y'} → hbm_ctrl.pe{idx'}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Kernel launch command to PE**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
[from io_noc] → ucie → conn → r{x}c{y} → (mesh) → M_CPU → (mesh) → PE_CPU
|
||||||
|
```
|
||||||
|
|
||||||
|
**Shared SRAM access**:
|
||||||
|
|
||||||
|
```text
|
||||||
|
PE_DMA → r{x}c{y} → (mesh) → SRAM
|
||||||
|
```
|
||||||
|
|
||||||
|
### D8. HBM channel mapping mode
|
||||||
|
|
||||||
|
Channel mapping is configured at cube scope:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
cube:
|
||||||
|
memory_map:
|
||||||
|
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
||||||
|
hbm_pseudo_channels: 64 # total pseudo-channel count
|
||||||
|
hbm_channels_per_pe: 8 # per-PE local channel count
|
||||||
|
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
|
||||||
|
hbm_slices_per_cube: 8 # number of per-PE partitions
|
||||||
|
hbm_total_gb_per_cube: 48
|
||||||
|
```
|
||||||
|
|
||||||
|
**n:1 mode (default, implemented).** Each PE's HBM partition is a single
|
||||||
|
endpoint `hbm_ctrl.pe{idx}` that aggregates `channels_per_pe` pseudo-
|
||||||
|
channels. The `Router ↔ hbm_ctrl.pe{idx}` link bandwidth equals
|
||||||
|
`channels_per_pe × hbm_channel_bw_gbs`. Pseudo-channels are assumed to
|
||||||
|
interleave; only aggregate per-PE BW is modeled. No separate aggregated
|
||||||
|
router node exists — the per-PE router itself serves that role.
|
||||||
|
|
||||||
|
**1:1 mode (future).** Each PE router decomposes into N channel
|
||||||
|
mini-routers; per-channel routing carries fully-resolved PA + channel ID.
|
||||||
|
A `ChannelSplitter` resolves a logical access to N per-channel physical
|
||||||
|
requests. Per-channel link models BW contention. Cross-PE channel
|
||||||
|
access semantics are deferred to the implementation ADR.
|
||||||
|
|
||||||
|
**BW math (defaults).**
|
||||||
|
|
||||||
|
| Parameter | Value |
|
||||||
|
| ---------------------------------- | -------------------------- |
|
||||||
|
| pseudo channels per cube | 64 (parameter) |
|
||||||
|
| PEs per cube | 8 (parameter) |
|
||||||
|
| channels per PE (N) | 64 / 8 = 8 |
|
||||||
|
| per-channel BW | 32 GB/s (parameter) |
|
||||||
|
| per-PE local BW | N × 32 = 256 GB/s |
|
||||||
|
| cube total HBM BW | 64 × 32 = 2048 GB/s |
|
||||||
|
|
||||||
|
Both modes give the same per-PE effective BW; only the request shape and
|
||||||
|
contention model differ.
|
||||||
|
|
||||||
|
### D9. AddressResolver — per-PE HBM endpoint
|
||||||
|
|
||||||
|
The address resolver decodes a PA's HBM offset to the owning PE's
|
||||||
|
partition:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# policy/routing/router.py
|
||||||
|
hbm_slice_bytes = hbm_total_gb_per_cube * (1 << 30) // hbm_slices_per_cube
|
||||||
|
|
||||||
|
if addr.kind == "hbm":
|
||||||
|
pe_id = int(addr.hbm_offset) // hbm_slice_bytes
|
||||||
|
return f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
|
||||||
|
```
|
||||||
|
|
||||||
|
The pe_id computation is intrinsic to the routing layer (not a
|
||||||
|
topology-time concern). Any HBM PA falls within exactly one partition,
|
||||||
|
yielding deterministic routing.
|
||||||
|
|
||||||
|
External callers (e.g., M_CPU DMA, Memory R/W from PCIE_EP) follow the
|
||||||
|
same resolver path — there is no separate fast path.
|
||||||
|
|
||||||
|
### D10. Mesh generation parameters
|
||||||
|
|
||||||
|
`mesh_gen.py` produces `cube_mesh.yaml` from:
|
||||||
|
|
||||||
|
- `cube.pe_layout`: corner placement (NW, NE, SW, SE) and PEs per corner.
|
||||||
|
- `cube.geometry`: cube physical dimensions and HBM zone.
|
||||||
|
- `cube.ucie.n_connections`: determines router count for UCIe attachment.
|
||||||
|
|
||||||
|
Output `mesh_data` dictionary contains:
|
||||||
|
|
||||||
|
- Router grid with positions and HBM exclusion zones.
|
||||||
|
- PE-to-router attachments (`pe{idx}.dma`, `pe{idx}.cpu`, `pe{idx}.hbm`
|
||||||
|
per PE).
|
||||||
|
- UCIe-to-router attachments (N/S/E/W distributed across edge routers).
|
||||||
|
- M_CPU and SRAM router attachments.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- Local HBM (0 mesh hops, switching overhead only) and cross-PE HBM
|
||||||
|
(mesh hops) are naturally distinguishable, satisfying SPEC R5
|
||||||
|
(multi-domain communication) and ADR-0002 (no zero-latency end-to-end
|
||||||
|
paths).
|
||||||
|
- All cube-internal traffic routes through one mesh — single contention
|
||||||
|
model, single layout, single set of edge BWs.
|
||||||
|
- Per-PE HBM partitioning maps cleanly to the LA model (ADR-0011): each
|
||||||
|
PE's partition is the n:1 aggregate of its assigned pseudo-channels.
|
||||||
|
- 1:1 mode extension is structurally natural — split each PE router into
|
||||||
|
N channel routers.
|
||||||
|
- Mesh generation is fully parameterised by `topology.yaml`; PE/cube
|
||||||
|
geometry changes propagate without code edits.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0002 (Routing distance, ordering, no zero-latency paths)
|
||||||
|
- ADR-0003 D3 (cube-level NOC definition — extended here)
|
||||||
|
- ADR-0004 (Memory semantics, local HBM)
|
||||||
|
- ADR-0011 (Memory addressing — LA model consumes per-PE partition)
|
||||||
|
- ADR-0014 D1 (PE_DMA egress via router mesh)
|
||||||
|
- ADR-0015 D4 (fabric paths for Memory R/W and Kernel Launch)
|
||||||
|
- ADR-0016 (IOChiplet io_noc — analogous pattern at IO chiplet level)
|
||||||
|
- ADR-0033 (Latency model: per-PC parallelism, switch penalty)
|
||||||
@@ -1,305 +0,0 @@
|
|||||||
# ADR-0019: Per-Channel and Aggregated HBM Connection Models within CUBE NOC
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
The CUBE-internal NOC must connect each PE to HBM. KernBench needs
|
|
||||||
to evaluate two connectivity models:
|
|
||||||
|
|
||||||
- **1:1 mode** — PE_DMA connects to N separate per-channel routers,
|
|
||||||
each with its own link to hbm_ctrl. Models per-channel BW
|
|
||||||
contention precisely.
|
|
||||||
N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
|
|
||||||
- **n:1 mode** — PE_DMA connects to a single aggregated router with
|
|
||||||
one link to hbm_ctrl. Channels are treated as interleaved; only
|
|
||||||
aggregate BW is modeled.
|
|
||||||
|
|
||||||
Effective PE-local BW is identical under both modes
|
|
||||||
(= N × per-channel BW); only the connectivity granularity differs.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### D1. HBM Attaches to PE Routers
|
|
||||||
|
|
||||||
Consolidate the current `hbm_ctrl.slice{0-7}` (8 nodes) into a **single `hbm_ctrl` node**,
|
|
||||||
and attach the HBM access point to the same router where the PE is attached.
|
|
||||||
|
|
||||||
- n:1 mode: PE's local HBM access goes directly from its own router (switching overhead only, 0 hops)
|
|
||||||
- Remote PE's HBM access: reaches the target PE's router via mesh hops
|
|
||||||
- The read/write resource model within the HBM controller is preserved
|
|
||||||
|
|
||||||
Node naming changes:
|
|
||||||
|
|
||||||
| Current | After Change |
|
|
||||||
| ---- | ------- |
|
|
||||||
| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (single) |
|
|
||||||
|
|
||||||
In `mesh_gen.py`, add `pe{idx}.hbm` to the PE attachment so that
|
|
||||||
the builder generates an edge between that router and hbm_ctrl.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D2. Complete Removal of xbar, bridge, and Single NOC Node
|
|
||||||
|
|
||||||
Remove all of the following nodes and related edges:
|
|
||||||
|
|
||||||
- `{cube}.xbar_top`, `{cube}.xbar_bot`
|
|
||||||
- `{cube}.bridge.left`, `{cube}.bridge.right`
|
|
||||||
- `{cube}.noc` (single TwoDMeshNocComponent node)
|
|
||||||
- Edges of type `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar`
|
|
||||||
- Edges of type `xbar_to_bridge`, `bridge_to_xbar`
|
|
||||||
- Edges of type `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu`, etc. referencing the single noc node
|
|
||||||
|
|
||||||
Their role is replaced by an **explicit router mesh based on cube_mesh.yaml**.
|
|
||||||
Each router (r0c0, r0c1, ...) from the 6x6 router grid generated by `mesh_gen.py`
|
|
||||||
is created as a separate SimPy node in the topology graph,
|
|
||||||
and adjacent routers are connected via XY mesh edges.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D3. Explicit Router Mesh (Common Basis for n:1 / 1:1)
|
|
||||||
|
|
||||||
#### Router Nodes Based on cube_mesh.yaml
|
|
||||||
|
|
||||||
Each non-null router from cube_mesh.yaml generated by `mesh_gen.py`
|
|
||||||
is created as a **separate SimPy node** in the topology graph.
|
|
||||||
|
|
||||||
- Node ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
|
|
||||||
- kind: `noc_router`, impl: `forwarding_v1`
|
|
||||||
- pos_mm: taken from cube_mesh.yaml
|
|
||||||
|
|
||||||
Based on the attach information in cube_mesh.yaml, components are connected to each router:
|
|
||||||
- `pe{p}.dma` → PE_DMA ↔ router edge
|
|
||||||
- `pe{p}.cpu` → PE_CPU ↔ router edge
|
|
||||||
- `pe{p}.hbm` → HBM_CTRL ↔ router edge (added in n:1)
|
|
||||||
- `m_cpu` → M_CPU ↔ router edge
|
|
||||||
- `sram` → SRAM ↔ router edge
|
|
||||||
- `ucie_{dir}.c{i}` → UCIe conn ↔ router edge
|
|
||||||
|
|
||||||
Router-to-router XY mesh edges: bidirectional edges between adjacent routers.
|
|
||||||
Null routers (HBM exclusion zones) are skipped.
|
|
||||||
|
|
||||||
#### 1:1 Mode Extension (To Be Implemented Later)
|
|
||||||
|
|
||||||
In 1:1 mode, each router differentiates into N channel mini-routers.
|
|
||||||
Per-channel routing and ChannelSplitter (LA → per-channel PA) introduction are required.
|
|
||||||
N GEMM engines per PE are also added at this point.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D4. Cross-PE HBM Access (n:1 Mode)
|
|
||||||
|
|
||||||
In n:1 mode, when a PE accesses another PE's local HBM,
|
|
||||||
it hops through the XY mesh in cube_mesh.yaml to reach the target PE's router.
|
|
||||||
|
|
||||||
Example: PE0 (r0c0) accessing PE2's (r1c4) HBM:
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
The Dijkstra router finds the shortest path in the mesh.
|
|
||||||
|
|
||||||
Cross-PE channel access in 1:1 mode will be defined during the 1:1 extension in D3.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D5. n:1 Mode: Uses cube_mesh.yaml Router Mesh
|
|
||||||
|
|
||||||
In n:1 mode, no separate "aggregated router" is created.
|
|
||||||
The existing router grid from cube_mesh.yaml serves that role.
|
|
||||||
|
|
||||||
#### Connection Structure
|
|
||||||
|
|
||||||
PE_DMA, PE_CPU, and HBM are all connected to the router where each PE is attached:
|
|
||||||
|
|
||||||
```text
|
|
||||||
sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
|
|
||||||
sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
|
|
||||||
```
|
|
||||||
|
|
||||||
Routers are connected via XY mesh edges. PE's local HBM access goes
|
|
||||||
directly from its own router (switching overhead only).
|
|
||||||
|
|
||||||
#### n:1 Mode Full Data Paths
|
|
||||||
|
|
||||||
**Local HBM (0 hops):**
|
|
||||||
```text
|
|
||||||
PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Remote HBM (mesh hops):**
|
|
||||||
```text
|
|
||||||
PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
**M_CPU DMA:**
|
|
||||||
```text
|
|
||||||
M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D6. All Traffic Is Unified onto the Same Router Mesh
|
|
||||||
|
|
||||||
- All memory accesses (DMA data) and commands (PE_CPU) use the same router mesh
|
|
||||||
- Local access does not use a separate fast path (xbar)
|
|
||||||
- Cross-cube (remote) access path:
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
|
|
||||||
→ [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
UCIe connections maintain the existing structure,
|
|
||||||
but both endpoints become mesh routers instead of xbars.
|
|
||||||
|
|
||||||
The number of UCIe lines is determined by BW ratio: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D7. AddressResolver Changes
|
|
||||||
|
|
||||||
Current `AddressResolver.resolve()`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Current: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
|
|
||||||
pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
|
|
||||||
return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
|
|
||||||
```
|
|
||||||
|
|
||||||
After change:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Changed: HBM → single endpoint
|
|
||||||
return f"sip{s}.cube{c}.hbm_ctrl"
|
|
||||||
```
|
|
||||||
|
|
||||||
The pe_slice calculation is removed.
|
|
||||||
In n:1 mode, PE_DMA directly accesses the hbm_ctrl attached to its own router.
|
|
||||||
|
|
||||||
resolver.resolve() is retained for external access (M_CPU DMA, etc.) and backward compatibility.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D8. topology.yaml Configuration Changes
|
|
||||||
|
|
||||||
#### Added Settings
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
cube:
|
|
||||||
memory_map:
|
|
||||||
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
|
||||||
hbm_pseudo_channels: 64 # total pseudo channel count
|
|
||||||
hbm_channels_per_pe: 8 # local channels per PE (= pseudo_channels / pes_per_cube)
|
|
||||||
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
|
|
||||||
hbm_total_gb_per_cube: 48 # retained
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Removed Settings
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# To be removed
|
|
||||||
links:
|
|
||||||
xbar_to_hbm_bw_gbs: 256.0 # → replaced by channel_bw_gbs × channels_per_pe
|
|
||||||
xbar_to_hbm_mm: 2.5 # → replaced by ch_router_to_hbm_mm
|
|
||||||
xbar_to_bridge_bw_gbs: 128.0 # → removed (no bridge)
|
|
||||||
xbar_to_bridge_mm: 3.0 # → removed
|
|
||||||
noc_to_xbar_bw_gbs: ... # → removed
|
|
||||||
noc_to_xbar_mm: ... # → removed
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Added Link Settings
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
links:
|
|
||||||
router_link_bw_gbs: 256.0 # XY mesh link BW between routers
|
|
||||||
router_overhead_ns: 2.0 # router switching overhead
|
|
||||||
pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router
|
|
||||||
hbm_to_router_bw_gbs: 256.0 # HBM ↔ router (= N × channel_bw)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D9. Bandwidth Numerical Consistency
|
|
||||||
|
|
||||||
| Configuration | Value |
|
|
||||||
| ---- | --- |
|
|
||||||
| pseudo channels per cube | 64 (parameter) |
|
|
||||||
| PEs per cube | 8 (parameter) |
|
|
||||||
| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
|
|
||||||
| per-channel BW | 32 GB/s (parameter) |
|
|
||||||
| per-PE local BW | N × 32 = 256 GB/s |
|
|
||||||
| cube total HBM BW | 64 × 32 = 2048 GB/s |
|
|
||||||
|
|
||||||
The effective BW per PE is identical in both modes:
|
|
||||||
|
|
||||||
- 1:1 mode: N channel links × channel_bw_gbs = N × 32 = 256 GB/s
|
|
||||||
- n:1 mode: 1 aggregated link = N × channel_bw_gbs = 256 GB/s
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
### Positive
|
|
||||||
|
|
||||||
- The router mesh based on cube_mesh.yaml accurately reflects physical placement
|
|
||||||
- In n:1 mode, the existing VA scheme is preserved, keeping transition costs low
|
|
||||||
- Local / remote / command traffic is unified onto the same mesh, resulting in simplicity
|
|
||||||
- Aligns well with graph compiler-based topology generation
|
|
||||||
- Channel count and PE count are both parameterized, enabling testing of various configurations
|
|
||||||
- 1:1 mode extension naturally follows through router differentiation
|
|
||||||
|
|
||||||
### Negative
|
|
||||||
|
|
||||||
- The number of SimPy nodes increases due to explicit router nodes (6x6 = up to 32 routers/cube)
|
|
||||||
- The internal contention model of TwoDMeshNocComponent needs to be replaced with a per-router model
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Alternatives
|
|
||||||
|
|
||||||
### A1. Retain Existing xbar + HBM Slices
|
|
||||||
|
|
||||||
- Local/remote paths remain bifurcated
|
|
||||||
- Cannot model at pseudo-channel granularity
|
|
||||||
- Cannot switch between 1:1/n:1 modes
|
|
||||||
|
|
||||||
### A2. Always Generate Per-Channel Links and Aggregate Only in n:1
|
|
||||||
|
|
||||||
- Topology structure always has 1:1 size
|
|
||||||
- Expressing n:1 semantics via link aggregation is complex
|
|
||||||
- No reduction in router node count
|
|
||||||
|
|
||||||
### A3. Gradual Transition (Retain xbar + Add NOC Path)
|
|
||||||
|
|
||||||
- Higher compatibility, but dual-path coexistence increases complexity
|
|
||||||
- Since xbar removal is ultimately necessary, the intermediate step provides little value
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Test Requirements
|
|
||||||
|
|
||||||
- Verify that requests are delivered via per-channel links in 1:1 mode
|
|
||||||
- Verify that requests are delivered via the aggregated link in n:1 mode
|
|
||||||
- Verify that topology is correctly generated in both modes:
|
|
||||||
- 1:1: `total_ch` channel routers + per-PE links + horizontal links
|
|
||||||
- n:1: `pes_per_cube` aggregated routers + per-PE links
|
|
||||||
- Verify that effective BW is consistent across both modes for the same workload
|
|
||||||
- Verify that horizontal line routing works for cross-PE access
|
|
||||||
- Verify that routing through UCIe works for cross-cube access
|
|
||||||
- Verify that topology generation is correct under parameter variations (channels_per_pe = 4, 8, 16, etc.)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Links
|
|
||||||
|
|
||||||
- ADR-0011 (LA model) → addressing-side integration
|
|
||||||
- ADR-0017 (Cube NOC 2D Mesh) → this ADR replaces the xbar/bridge portion
|
|
||||||
- ADR-0004 (Memory Semantics) → BW model redefinition
|
|
||||||
- ADR-0014 (PE Internal Execution Model) → impact from PE_DMA path changes
|
|
||||||
@@ -1,305 +0,0 @@
|
|||||||
# ADR-0019: CUBE NOC 내 Per-Channel 및 Aggregated HBM 연결 모델
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
CUBE 내부 NOC은 각 PE를 HBM에 연결해야 한다. KernBench는 두 가지
|
|
||||||
connectivity 모델을 비교 평가할 수 있어야 한다.
|
|
||||||
|
|
||||||
- **1:1 mode** — PE_DMA가 N개 per-channel router 각각에 별도 link로
|
|
||||||
연결되고, 각 router는 hbm_ctrl에 자기 channel link를 가진다.
|
|
||||||
Per-channel BW contention을 정확히 모델링.
|
|
||||||
N = `hbm_pseudo_channels / pes_per_cube` (= `channels_per_pe`).
|
|
||||||
- **n:1 mode** — PE_DMA가 단일 aggregated router를 거쳐 하나의 link로
|
|
||||||
hbm_ctrl에 연결. Channel들이 interleaved 된 것으로 가정하고
|
|
||||||
aggregate BW만 모델링.
|
|
||||||
|
|
||||||
두 모드에서 PE당 effective BW는 동일 (= N × per-channel BW);
|
|
||||||
connectivity granularity만 다르다.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### D1. HBM은 PE 라우터에 attach된다
|
|
||||||
|
|
||||||
현재의 `hbm_ctrl.slice{0-7}` (8개 노드)를 **`hbm_ctrl` 단일 노드**로 통합하고,
|
|
||||||
PE가 attach된 라우터에 HBM access point도 함께 attach한다.
|
|
||||||
|
|
||||||
- n:1 mode: PE의 local HBM 접근은 자기 라우터에서 바로 (switching overhead만, 0 hop)
|
|
||||||
- remote PE의 HBM 접근: mesh hop을 거쳐 대상 PE의 라우터에 도달
|
|
||||||
- HBM controller 내부의 read/write resource 모델은 유지
|
|
||||||
|
|
||||||
노드 네이밍 변경:
|
|
||||||
|
|
||||||
| 현재 | 변경 후 |
|
|
||||||
| ---- | ------- |
|
|
||||||
| `sip0.cube0.hbm_ctrl.slice0` ~ `slice7` | `sip0.cube0.hbm_ctrl` (단일) |
|
|
||||||
|
|
||||||
`mesh_gen.py`에서 PE attachment에 `pe{idx}.hbm`을 추가하여,
|
|
||||||
builder가 해당 라우터와 hbm_ctrl 간 edge를 생성한다.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D2. xbar, bridge, 단일 NOC 노드 완전 제거
|
|
||||||
|
|
||||||
기존 다음 노드 및 관련 edge를 모두 제거한다:
|
|
||||||
|
|
||||||
- `{cube}.xbar_top`, `{cube}.xbar_bot`
|
|
||||||
- `{cube}.bridge.left`, `{cube}.bridge.right`
|
|
||||||
- `{cube}.noc` (단일 TwoDMeshNocComponent 노드)
|
|
||||||
- `noc_to_xbar`, `xbar_to_noc`, `xbar_to_hbm`, `hbm_to_xbar` 종류의 edge
|
|
||||||
- `xbar_to_bridge`, `bridge_to_xbar` 종류의 edge
|
|
||||||
- `pe_to_noc`, `noc_to_pe`, `noc_to_pe_cpu` 등 단일 noc 노드 참조 edge
|
|
||||||
|
|
||||||
이들의 역할은 **cube_mesh.yaml 기반의 명시적 라우터 mesh**가 대체한다.
|
|
||||||
기존 `mesh_gen.py`가 생성하는 6×6 라우터 grid의 각 라우터(r0c0, r0c1, ...)를
|
|
||||||
별도의 SimPy 노드로 topology graph에 생성하고,
|
|
||||||
인접 라우터 간 XY mesh edge로 연결한다.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D3. 명시적 라우터 mesh (n:1 / 1:1 공통 기반)
|
|
||||||
|
|
||||||
#### cube_mesh.yaml 기반 라우터 노드
|
|
||||||
|
|
||||||
`mesh_gen.py`가 생성한 cube_mesh.yaml의 각 non-null 라우터를
|
|
||||||
topology graph의 **별도 SimPy 노드**로 생성한다.
|
|
||||||
|
|
||||||
- 노드 ID: `{cube}.r{row}c{col}` (e.g., `sip0.cube0.r0c0`)
|
|
||||||
- kind: `noc_router`, impl: `forwarding_v1`
|
|
||||||
- pos_mm: cube_mesh.yaml에서 가져옴
|
|
||||||
|
|
||||||
기존 cube_mesh.yaml의 attach 정보에 따라 각 라우터에 component를 연결:
|
|
||||||
- `pe{p}.dma` → PE_DMA ↔ 라우터 edge
|
|
||||||
- `pe{p}.cpu` → PE_CPU ↔ 라우터 edge
|
|
||||||
- `pe{p}.hbm` → HBM_CTRL ↔ 라우터 edge (n:1에서 추가)
|
|
||||||
- `m_cpu` → M_CPU ↔ 라우터 edge
|
|
||||||
- `sram` → SRAM ↔ 라우터 edge
|
|
||||||
- `ucie_{dir}.c{i}` → UCIe conn ↔ 라우터 edge
|
|
||||||
|
|
||||||
라우터 간 XY mesh edge: 인접 라우터 간 bidirectional edge.
|
|
||||||
null 라우터(HBM exclusion zone)는 skip.
|
|
||||||
|
|
||||||
#### 1:1 mode 확장 (나중에 구현)
|
|
||||||
|
|
||||||
1:1 mode에서는 각 라우터가 N개 channel mini-router로 분화된다.
|
|
||||||
per-channel routing과 ChannelSplitter (LA → per-channel PA) 도입이 필요.
|
|
||||||
PE당 N개 GEMM engine도 이 시점에 추가.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D4. cross-PE HBM 접근 (n:1 mode)
|
|
||||||
|
|
||||||
n:1 mode에서 PE가 다른 PE의 local HBM에 접근하는 경우,
|
|
||||||
cube_mesh.yaml의 XY mesh를 통해 대상 PE의 라우터까지 hop한다.
|
|
||||||
|
|
||||||
예: PE0(r0c0)이 PE2(r1c4)의 HBM에 접근:
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE0.pe_dma → r0c0 → r0c1 → r0c2 → r0c3 → r0c4 → r1c4 → hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
Dijkstra router가 mesh에서 최단 경로를 탐색한다.
|
|
||||||
|
|
||||||
1:1 mode에서의 cross-PE channel 접근은 D3의 1:1 확장 시 정의한다.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D5. n:1 mode: cube_mesh.yaml 라우터 mesh 사용
|
|
||||||
|
|
||||||
n:1 mode에서는 별도의 "aggregated router"를 생성하지 않는다.
|
|
||||||
기존 cube_mesh.yaml의 라우터 grid가 그 역할을 한다.
|
|
||||||
|
|
||||||
#### 연결 구조
|
|
||||||
|
|
||||||
각 PE가 attach된 라우터에 PE_DMA, PE_CPU, HBM이 함께 연결된다:
|
|
||||||
|
|
||||||
```text
|
|
||||||
sip0.cube0.pe0.pe_dma ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
|
|
||||||
sip0.cube0.hbm_ctrl ←→ sip0.cube0.r0c0 (bw: N × channel_bw_gbs)
|
|
||||||
```
|
|
||||||
|
|
||||||
라우터 간 XY mesh edge로 연결. PE의 local HBM 접근은
|
|
||||||
자기 라우터에서 바로 (switching overhead만).
|
|
||||||
|
|
||||||
#### n:1 mode 전체 데이터 경로
|
|
||||||
|
|
||||||
**local HBM (0 hop):**
|
|
||||||
```text
|
|
||||||
PE0.pe_dma → r0c0 → hbm_ctrl (switching overhead only)
|
|
||||||
```
|
|
||||||
|
|
||||||
**remote HBM (mesh hops):**
|
|
||||||
```text
|
|
||||||
PE0.pe_dma → r0c0 → r0c1 → ... → r1c4 → hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
**M_CPU DMA:**
|
|
||||||
```text
|
|
||||||
M_CPU → r2c0 → (mesh hops) → r{x}c{y} → hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D6. 모든 트래픽을 동일 router mesh로 통일한다
|
|
||||||
|
|
||||||
- 모든 memory access (DMA data)와 command (PE_CPU)가 동일 router mesh를 사용한다
|
|
||||||
- local access도 별도의 fast path(xbar)를 사용하지 않는다
|
|
||||||
- cross-cube (remote) access 경로:
|
|
||||||
|
|
||||||
```text
|
|
||||||
PE_DMA → r{x}c{y} → (mesh hops) → ucie_conn → ucie-{PORT}
|
|
||||||
→ [UCIe link] → remote ucie → remote conn → remote r{x}c{y} → hbm_ctrl
|
|
||||||
```
|
|
||||||
|
|
||||||
UCIe 연결은 기존 구조를 유지하되,
|
|
||||||
양쪽 endpoint가 xbar 대신 mesh 라우터가 된다.
|
|
||||||
|
|
||||||
UCIe line 수는 BW 비율로 결정: `ucie_lines_per_side = ceil(ucie_bw / noc_line_bw)`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D7. AddressResolver 변경
|
|
||||||
|
|
||||||
현재 `AddressResolver.resolve()`:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# 현재: HBM offset → pe_slice → "sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
|
|
||||||
pe_slice = PhysAddr.hbm_pe_id(addr.hbm_offset, self._slice_size_bytes)
|
|
||||||
return f"sip{s}.cube{c}.hbm_ctrl.slice{pe_slice}"
|
|
||||||
```
|
|
||||||
|
|
||||||
변경 후:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# 변경: HBM → 단일 endpoint
|
|
||||||
return f"sip{s}.cube{c}.hbm_ctrl"
|
|
||||||
```
|
|
||||||
|
|
||||||
pe_slice 계산이 제거된다.
|
|
||||||
n:1 mode에서 PE_DMA는 자기 라우터에 attach된 hbm_ctrl에 직접 접근한다.
|
|
||||||
|
|
||||||
resolver.resolve()는 외부 접근(M_CPU DMA 등) 및 backward compatibility용으로 유지한다.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D8. topology.yaml 설정 변경
|
|
||||||
|
|
||||||
#### 추가 설정
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
cube:
|
|
||||||
memory_map:
|
|
||||||
hbm_mapping_mode: n_to_one # one_to_one | n_to_one
|
|
||||||
hbm_pseudo_channels: 64 # 전체 pseudo channel 수
|
|
||||||
hbm_channels_per_pe: 8 # PE당 local channel 수 (= pseudo_channels / pes_per_cube)
|
|
||||||
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
|
|
||||||
hbm_total_gb_per_cube: 48 # 유지
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 제거 설정
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# 제거 대상
|
|
||||||
links:
|
|
||||||
xbar_to_hbm_bw_gbs: 256.0 # → channel_bw_gbs × channels_per_pe로 대체
|
|
||||||
xbar_to_hbm_mm: 2.5 # → ch_router_to_hbm_mm으로 대체
|
|
||||||
xbar_to_bridge_bw_gbs: 128.0 # → 제거 (bridge 없음)
|
|
||||||
xbar_to_bridge_mm: 3.0 # → 제거
|
|
||||||
noc_to_xbar_bw_gbs: ... # → 제거
|
|
||||||
noc_to_xbar_mm: ... # → 제거
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 추가 link 설정
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
links:
|
|
||||||
router_link_bw_gbs: 256.0 # 라우터 간 XY mesh link BW
|
|
||||||
router_overhead_ns: 2.0 # 라우터 switching overhead
|
|
||||||
pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ 라우터
|
|
||||||
hbm_to_router_bw_gbs: 256.0 # HBM ↔ 라우터 (= N × channel_bw)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### D9. 대역폭 수치 정합
|
|
||||||
|
|
||||||
| 구성 | 값 |
|
|
||||||
| ---- | --- |
|
|
||||||
| pseudo channels per cube | 64 (파라미터) |
|
|
||||||
| PEs per cube | 8 (파라미터) |
|
|
||||||
| channels per PE (N) | `pseudo_channels / pes_per_cube` = 8 |
|
|
||||||
| per-channel BW | 32 GB/s (파라미터) |
|
|
||||||
| per-PE local BW | N × 32 = 256 GB/s |
|
|
||||||
| cube total HBM BW | 64 × 32 = 2048 GB/s |
|
|
||||||
|
|
||||||
두 모드에서 PE당 effective BW는 동일:
|
|
||||||
|
|
||||||
- 1:1 mode: N개 channel link × channel_bw_gbs = N × 32 = 256 GB/s
|
|
||||||
- n:1 mode: 1개 aggregated link = N × channel_bw_gbs = 256 GB/s
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
### Positive
|
|
||||||
|
|
||||||
- cube_mesh.yaml 기반 라우터 mesh로 물리적 배치를 정확히 반영한다
|
|
||||||
- n:1 mode에서 기존 VA 체계를 유지하여 전환 비용이 낮다
|
|
||||||
- local / remote / command 트래픽이 동일 mesh로 통일되어 단순하다
|
|
||||||
- graph compiler 기반 topology 생성과 잘 맞는다
|
|
||||||
- channel 수, PE 수가 모두 파라미터이므로 다양한 구성을 테스트할 수 있다
|
|
||||||
- 1:1 mode 확장이 라우터 분화로 자연스럽게 가능하다
|
|
||||||
|
|
||||||
### Negative
|
|
||||||
|
|
||||||
- 명시적 라우터 노드로 인해 SimPy 노드 수가 증가한다 (6×6 = 최대 32개 라우터/cube)
|
|
||||||
- TwoDMeshNocComponent의 내부 contention 모델을 라우터별 모델로 교체 필요
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Alternatives
|
|
||||||
|
|
||||||
### A1. 기존 xbar + HBM slice 유지
|
|
||||||
|
|
||||||
- local/remote 경로가 이원화됨
|
|
||||||
- pseudo-channel 단위 모델링 불가
|
|
||||||
- 1:1/n:1 mode 전환 불가
|
|
||||||
|
|
||||||
### A2. per-channel link를 항상 생성하고 n:1에서만 집계
|
|
||||||
|
|
||||||
- topology 구조가 항상 1:1 크기
|
|
||||||
- n:1 semantics를 link aggregation으로 표현하기 복잡
|
|
||||||
- router 노드 수 감소 효과 없음
|
|
||||||
|
|
||||||
### A3. 단계적 전환 (xbar 유지 + NOC 경로 추가)
|
|
||||||
|
|
||||||
- 호환성은 높으나 두 경로 공존으로 복잡도 증가
|
|
||||||
- 최종적으로 xbar 제거가 필요하므로 중간 단계의 가치가 낮음
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Test Requirements
|
|
||||||
|
|
||||||
- 1:1 mode에서 channel별 link로 request가 전달되는지 확인
|
|
||||||
- n:1 mode에서 aggregated link로 request가 전달되는지 확인
|
|
||||||
- 두 mode에서 topology가 올바르게 생성되는지 검증:
|
|
||||||
- 1:1: `total_ch`개 channel router + per-PE link + horizontal link
|
|
||||||
- n:1: `pes_per_cube`개 aggregated router + per-PE link
|
|
||||||
- 동일 workload에서 effective BW가 두 모드에서 일관적인지 확인
|
|
||||||
- cross-PE 접근 시 horizontal line routing이 동작하는지 확인
|
|
||||||
- cross-cube 접근 시 UCIe를 통한 routing이 동작하는지 확인
|
|
||||||
- 파라미터 변경 (channels_per_pe = 4, 8, 16 등)에서 topology 생성이 정상인지 확인
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Links
|
|
||||||
|
|
||||||
- ADR-0011 (LA model) → addressing 측 연동
|
|
||||||
- ADR-0017 (Cube NOC 2D Mesh) → 본 ADR이 xbar/bridge 부분을 대체
|
|
||||||
- ADR-0004 (Memory Semantics) → BW 모델 재정의
|
|
||||||
- ADR-0014 (PE Internal Execution Model) → PE_DMA 경로 변경 영향
|
|
||||||
@@ -1,432 +0,0 @@
|
|||||||
# ADR-0021: PE Pipeline Refactoring — Component Separation + Scheduler-Based Routing
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
### Actual Hardware Structure
|
|
||||||
|
|
||||||
```
|
|
||||||
HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
|
|
||||||
```
|
|
||||||
|
|
||||||
- DMA: HBM ↔ TCM transfer (via fabric, tens to hundreds of ns)
|
|
||||||
- Fetch/Store Unit: TCM ↔ Register File transfer (BW-based, a few ns)
|
|
||||||
- GEMM/MATH Engine: computation between Register Files (cycle-accurate)
|
|
||||||
- Completion signal: PE-internal 1-cycle wire signal (done pin assert)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### D1. Separate Each Block into an Independent Component
|
|
||||||
|
|
||||||
The internal blocks of pe_accel are separated into **independent PeEngineBase components**.
|
|
||||||
Existing 5 blocks + 1 Fetch/Store Unit = 6 components.
|
|
||||||
|
|
||||||
| Component | Role | HW Correspondence |
|
|
||||||
|-----------|------|-------------------|
|
|
||||||
| PE_SCHEDULER | Plan generation, tile state management, stage routing | Scheduler/Sequencer |
|
|
||||||
| PE_DMA | HBM ↔ TCM (via fabric) | DMA Engine |
|
|
||||||
| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
|
|
||||||
| PE_GEMM | MAC compute (register only) | MAC Array |
|
|
||||||
| PE_MATH | Element-wise/reduction (register only) | SIMD/Vector Unit |
|
|
||||||
| PE_TCM | BW-serialized scratchpad | SRAM Bank |
|
|
||||||
|
|
||||||
Each component exists as a topology node and is connected via ports/wires.
|
|
||||||
Replacing the `impl` allows changing the timing model of an individual block.
|
|
||||||
|
|
||||||
### D2. Token Self-Routing — Scheduler Handles Only Dispatch + Completion
|
|
||||||
|
|
||||||
**Components do not pass through the scheduler at every stage.**
|
|
||||||
The token carries a plan so that components chain directly to the next stage.
|
|
||||||
|
|
||||||
```
|
|
||||||
Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
|
|
||||||
↑ chaining: does not go through scheduler completion only
|
|
||||||
```
|
|
||||||
|
|
||||||
This matches the actual HW structure where each block's done signal is directly
|
|
||||||
connected to the next block via wire. The scheduler is responsible **only for
|
|
||||||
initial dispatch + completion aggregation**.
|
|
||||||
|
|
||||||
#### Stage Definition
|
|
||||||
|
|
||||||
```python
|
|
||||||
class StageType(Enum):
|
|
||||||
DMA_READ = 0
|
|
||||||
FETCH = 1
|
|
||||||
GEMM = 2
|
|
||||||
MATH = 3
|
|
||||||
STORE = 4
|
|
||||||
DMA_WRITE = 5
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Plan Structure
|
|
||||||
|
|
||||||
When the scheduler receives a CompositeCmd, it generates a **per-tile execution plan**.
|
|
||||||
The plan defines the **stage sequence** for each tile:
|
|
||||||
|
|
||||||
```python
|
|
||||||
@dataclass
|
|
||||||
class Stage:
|
|
||||||
stage_type: StageType
|
|
||||||
component: str # topology node ID (e.g. "sip0.cube0.pe0.pe_dma")
|
|
||||||
params: dict # per-stage parameters (dynamic)
|
|
||||||
|
|
||||||
@dataclass(frozen=True)
|
|
||||||
class TilePlan:
|
|
||||||
tile_id: int
|
|
||||||
stages: tuple[Stage, ...] # list of stages to execute in order (immutable)
|
|
||||||
```
|
|
||||||
|
|
||||||
The stage sequence varies depending on the plan:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Normal GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
|
|
||||||
stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
|
|
||||||
|
|
||||||
# GEMM directly from TCM data (skip DMA read):
|
|
||||||
stages = (FETCH, GEMM, STORE, DMA_WRITE)
|
|
||||||
|
|
||||||
# MATH element-wise:
|
|
||||||
stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
|
|
||||||
|
|
||||||
# GEMM + accumulation (intermediate K-tile, skip writeback):
|
|
||||||
stages = (DMA_READ, FETCH, GEMM, STORE) # store to TCM only
|
|
||||||
```
|
|
||||||
|
|
||||||
**Components do not hardcode the next component.**
|
|
||||||
They read the next stage from the token's plan and forward it directly via out_port.
|
|
||||||
This is the same pattern as a network packet carrying a routing header.
|
|
||||||
|
|
||||||
#### Pipeline Context
|
|
||||||
|
|
||||||
```python
|
|
||||||
@dataclass
|
|
||||||
class PipelineContext:
|
|
||||||
id: str
|
|
||||||
total_tiles: int
|
|
||||||
completed_tiles: int = 0
|
|
||||||
done_event: simpy.Event = None # succeeds when all tiles are complete
|
|
||||||
|
|
||||||
def complete_tile(self) -> None:
|
|
||||||
self.completed_tiles += 1
|
|
||||||
if self.completed_tiles == self.total_tiles:
|
|
||||||
self.done_event.succeed()
|
|
||||||
```
|
|
||||||
|
|
||||||
**Completion follows an exactly-once contract**: the last stage of each tile must call
|
|
||||||
`complete_tile()` exactly once. Duplicate calls are a bug, and `done_event` must
|
|
||||||
succeed only once (SimPy Event constraint).
|
|
||||||
|
|
||||||
#### Scheduler Role (Reduced)
|
|
||||||
|
|
||||||
When the scheduler receives a CompositeCmd, it creates a plan and PipelineContext,
|
|
||||||
enqueues them into the scheduler's internal `_pending_feeds` FIFO, and returns immediately.
|
|
||||||
|
|
||||||
Actual tile injection is handled by a **single feeder process** (`_feed_loop`).
|
|
||||||
This feeder consumes `_pending_feeds` in FIFO order and
|
|
||||||
**does not allow tile feed interleaving across composite commands.**
|
|
||||||
That is, the feed for the next command begins only after all tiles of the current
|
|
||||||
command have been injected into the first stage queue.
|
|
||||||
|
|
||||||
There is **exactly one `_feed_loop`** per scheduler, and
|
|
||||||
tile feed for composite commands is performed exclusively through this single process.
|
|
||||||
Command issue order refers to **the order in which PE_SCHEDULER receives PeInternalTxn**.
|
|
||||||
|
|
||||||
This structure maintains command issue order while ensuring that when the first stage
|
|
||||||
queue is full, only the feeder process blocks — the scheduler worker's inbox processing
|
|
||||||
itself does not stall.
|
|
||||||
|
|
||||||
```python
|
|
||||||
class PeSchedulerV2(PeEngineBase):
|
|
||||||
_pipelines: dict[str, PipelineContext]
|
|
||||||
_pending_feeds: simpy.Store # FIFO of (plan, ctx)
|
|
||||||
|
|
||||||
def start(self, env):
|
|
||||||
super().start(env)
|
|
||||||
self._pending_feeds = simpy.Store(env)
|
|
||||||
env.process(self._feed_loop(env))
|
|
||||||
|
|
||||||
def _dispatch_composite(self, env, pe_txn, cmd):
|
|
||||||
plan = generate_plan(cmd)
|
|
||||||
ctx = PipelineContext(
|
|
||||||
id=next_id(),
|
|
||||||
total_tiles=len(plan.tiles),
|
|
||||||
done_event=pe_txn.done,
|
|
||||||
)
|
|
||||||
self._pipelines[ctx.id] = ctx
|
|
||||||
|
|
||||||
# only enqueue to feeder queue and return immediately
|
|
||||||
yield self._pending_feeds.put((plan, ctx))
|
|
||||||
|
|
||||||
def _feed_loop(self, env):
|
|
||||||
"""Single feeder process: feeds composite commands in FIFO order.
|
|
||||||
|
|
||||||
Tile feed interleaving across composite commands is not allowed.
|
|
||||||
The feed for the next command begins only after all tiles of the
|
|
||||||
current command have been injected into the first stage queue.
|
|
||||||
|
|
||||||
When the first stage queue is full, only this feeder blocks;
|
|
||||||
the scheduler worker's inbox processing does not stall.
|
|
||||||
"""
|
|
||||||
while True:
|
|
||||||
plan, ctx = yield self._pending_feeds.get()
|
|
||||||
for tile in plan.tiles:
|
|
||||||
token = TileToken(
|
|
||||||
tile_id=tile.tile_id,
|
|
||||||
pipeline_ctx=ctx,
|
|
||||||
plan=tile,
|
|
||||||
stage_idx=0,
|
|
||||||
params=tile.stages[0].params,
|
|
||||||
)
|
|
||||||
yield self.out_ports[tile.stages[0].component].put(token)
|
|
||||||
# queue capacity = HW queue depth → feeder blocks only when full
|
|
||||||
```
|
|
||||||
|
|
||||||
In this ADR, the scheduler can accept multiple composite commands,
|
|
||||||
but tile submission order follows per-command FIFO.
|
|
||||||
Within a command, tile-level pipeline overlap is allowed,
|
|
||||||
but tile feed interleaving across commands is not.
|
|
||||||
|
|
||||||
### D3. Data Transfer vs. Completion Signal — HW Modeling Criteria
|
|
||||||
|
|
||||||
| Communication Type | Method | HW Correspondence |
|
|
||||||
|-------------------|--------|-------------------|
|
|
||||||
| Tile token (work directive) | message via out_port | enqueue to command queue |
|
|
||||||
| Stage completion → next stage | component directly calls out_port.put | done-triggered local enqueue |
|
|
||||||
| Pipeline completion → scheduler | PipelineContext.complete_tile() | completion interrupt |
|
|
||||||
|
|
||||||
**Tile token**: uses out_port.put(). SimPy Store capacity = HW queue depth.
|
|
||||||
|
|
||||||
**Intra-PE chaining latency**: within the scope of this ADR, no explicit latency model
|
|
||||||
is applied to intra-PE stage triggers. Chaining between components corresponds to
|
|
||||||
PE-internal wires, and since there is no scheduler round-trip, no artificial hop cost
|
|
||||||
is incurred.
|
|
||||||
|
|
||||||
**Pipeline completion**: the component at the last stage calls `pipeline_ctx.complete_tile()`.
|
|
||||||
When all tiles are complete, PipelineContext calls done_event.succeed().
|
|
||||||
|
|
||||||
### D4. Asynchronous Pipeline — Natural Overlap
|
|
||||||
|
|
||||||
The scheduler processes CompositeCmds **asynchronously**.
|
|
||||||
However, tile feed does not spawn an independent process per command; instead,
|
|
||||||
the scheduler's internal **single feeder process** performs the feed in FIFO order.
|
|
||||||
Therefore, the scheduler can continue to receive the next command,
|
|
||||||
but the first-stage tile injection order is guaranteed per command.
|
|
||||||
|
|
||||||
Since **SimPy Store capacity = HW queue depth**:
|
|
||||||
- When the queue is full, put() naturally blocks (backpressure)
|
|
||||||
- While DMA is processing tile 0, GEMM can start fetching an already-completed tile
|
|
||||||
- When a second CompositeCmd arrives, it is immediately queued to the DMA queue
|
|
||||||
|
|
||||||
```
|
|
||||||
First-stage feed order (feeder → DMA queue):
|
|
||||||
[cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
|
|
||||||
↑ cmd2 starts after cmd1 feed completes
|
|
||||||
|
|
||||||
Runtime pipeline (downstream overlap):
|
|
||||||
PE_DMA: [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
|
|
||||||
PE_FETCH: [cmd1:t0][cmd1:t1]...
|
|
||||||
PE_GEMM: [cmd1:t0][cmd1:t1]...
|
|
||||||
↑ pipeline overlap within the same command
|
|
||||||
```
|
|
||||||
|
|
||||||
Here, the overlap does not come from tile feed interleaving across different commands,
|
|
||||||
but occurs naturally as tiles from earlier commands progress to downstream stages
|
|
||||||
while the feeder continues injecting subsequent tiles.
|
|
||||||
|
|
||||||
For example, tile feed for cmd2 does not start until all tiles of cmd1 have been
|
|
||||||
injected into the first stage queue. However, while cmd1.tile0 has already progressed
|
|
||||||
to GEMM, cmd1.tile1 and cmd1.tile2 may still remain in DMA/FETCH, so
|
|
||||||
**pipeline overlap within the same command occurs naturally**.
|
|
||||||
|
|
||||||
#### Component Chaining Pattern
|
|
||||||
|
|
||||||
All components follow the same pattern:
|
|
||||||
|
|
||||||
```python
|
|
||||||
def _pipeline_worker(self, env):
|
|
||||||
while True:
|
|
||||||
token = yield self._inbox.get()
|
|
||||||
|
|
||||||
# process own stage
|
|
||||||
yield from self._process(env, token)
|
|
||||||
|
|
||||||
# chain to next stage (read from plan)
|
|
||||||
next_idx = token.stage_idx + 1
|
|
||||||
if next_idx < len(token.plan.stages):
|
|
||||||
next_stage = token.plan.stages[next_idx]
|
|
||||||
token.stage_idx = next_idx
|
|
||||||
token.params = next_stage.params
|
|
||||||
yield self.out_ports[next_stage.component].put(token)
|
|
||||||
else:
|
|
||||||
# last stage — pipeline completion
|
|
||||||
token.pipeline_ctx.complete_tile()
|
|
||||||
```
|
|
||||||
|
|
||||||
### D5. PE_FETCH_STORE — Dedicated TCM ↔ Register File Transfer
|
|
||||||
|
|
||||||
Previously, GemmBlock and MathBlock each implemented their own TCM read/write.
|
|
||||||
This is separated into a **PE_FETCH_STORE component**.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# PE_FETCH_STORE._process()
|
|
||||||
def _process(self, env, token):
|
|
||||||
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
|
|
||||||
yield tcm_done
|
|
||||||
# chaining is handled by the base class (D4 pattern)
|
|
||||||
```
|
|
||||||
|
|
||||||
Advantages:
|
|
||||||
- GEMM/MATH perform **pure compute only** — no TCM access logic
|
|
||||||
- Fetch/store BW contention is naturally modeled (serialization via PE_TCM resource)
|
|
||||||
- Prefetch strategies can be experimented with by replacing the fetch unit alone
|
|
||||||
|
|
||||||
### D6. Simplification of Each Compute Component
|
|
||||||
|
|
||||||
GEMM/MATH perform compute only with register data already prepared.
|
|
||||||
**Chaining follows the common pattern (D4), so only _process() needs to be implemented:**
|
|
||||||
|
|
||||||
```python
|
|
||||||
# PE_GEMM._process()
|
|
||||||
def _process(self, env, token):
|
|
||||||
yield env.timeout(self._mac_latency(token.params))
|
|
||||||
|
|
||||||
# PE_MATH._process()
|
|
||||||
def _process(self, env, token):
|
|
||||||
yield env.timeout(self._simd_latency(token.params))
|
|
||||||
|
|
||||||
# PE_FETCH_STORE._process()
|
|
||||||
def _process(self, env, token):
|
|
||||||
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
|
|
||||||
yield tcm_done
|
|
||||||
|
|
||||||
# PE_DMA._process()
|
|
||||||
def _process(self, env, token):
|
|
||||||
yield from self._do_fabric_dma(token.params)
|
|
||||||
```
|
|
||||||
|
|
||||||
By replacing only the timing model, one can freely switch between cycle-accurate
|
|
||||||
and analytical models. Since the chaining logic resides in the base class,
|
|
||||||
each component only implements its pure stage logic.
|
|
||||||
|
|
||||||
### D7. Topology Changes
|
|
||||||
|
|
||||||
Add PE_FETCH_STORE to the PE template:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
pe_template:
|
|
||||||
components:
|
|
||||||
pe_cpu: { kind: pe_cpu, impl: pe_cpu_v1, ... }
|
|
||||||
pe_scheduler: { kind: pe_scheduler, impl: pe_scheduler_v2, ... }
|
|
||||||
pe_dma: { kind: pe_dma, impl: pe_dma_v1, ... }
|
|
||||||
pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
|
|
||||||
pe_gemm: { kind: pe_gemm, impl: pe_gemm_v1, ... }
|
|
||||||
pe_math: { kind: pe_math, impl: pe_math_v1, ... }
|
|
||||||
pe_mmu: { kind: pe_mmu, impl: pe_mmu_v1, ... }
|
|
||||||
pe_tcm: { kind: pe_tcm, impl: pe_tcm_v1, ... }
|
|
||||||
links:
|
|
||||||
# existing links...
|
|
||||||
fetch_store_to_tcm_bw_gbs: 512.0
|
|
||||||
fetch_store_to_tcm_mm: 0.0
|
|
||||||
```
|
|
||||||
|
|
||||||
PE internal edge connections:
|
|
||||||
```
|
|
||||||
PE_SCHEDULER → PE_DMA (initial dispatch)
|
|
||||||
PE_SCHEDULER → PE_FETCH_STORE (initial dispatch)
|
|
||||||
PE_SCHEDULER → PE_GEMM (initial dispatch)
|
|
||||||
PE_SCHEDULER → PE_MATH (initial dispatch)
|
|
||||||
PE_DMA → PE_FETCH_STORE (chaining)
|
|
||||||
PE_FETCH_STORE → PE_GEMM (chaining)
|
|
||||||
PE_FETCH_STORE → PE_MATH (chaining)
|
|
||||||
PE_GEMM → PE_FETCH_STORE (store chaining)
|
|
||||||
PE_MATH → PE_FETCH_STORE (store chaining)
|
|
||||||
PE_FETCH_STORE → PE_DMA (writeback chaining)
|
|
||||||
PE_FETCH_STORE → PE_TCM (BW request)
|
|
||||||
```
|
|
||||||
|
|
||||||
Topology edges encompass both **control/dispatch visibility + runtime chaining**.
|
|
||||||
Scheduler → sub-component edges are initial dispatch paths, while
|
|
||||||
inter-component edges are runtime chaining paths driven by token self-routing.
|
|
||||||
|
|
||||||
### D9. TileToken Message Definition
|
|
||||||
|
|
||||||
A message used for passing tile work between components.
|
|
||||||
The token carries the plan and stage index, enabling self-routing.
|
|
||||||
|
|
||||||
```python
|
|
||||||
@dataclass
|
|
||||||
class TileToken:
|
|
||||||
tile_id: int
|
|
||||||
pipeline_ctx: PipelineContext # completion tracking
|
|
||||||
plan: TilePlan # full stage sequence for this tile (immutable)
|
|
||||||
stage_idx: int # current stage index in plan.stages
|
|
||||||
params: dict # current stage parameter cache (canonical: plan.stages[stage_idx].params)
|
|
||||||
data_op: bool = True # op_log recording target (ADR-0020)
|
|
||||||
```
|
|
||||||
|
|
||||||
A TileToken is **owned by exactly one component at a time** and
|
|
||||||
is never referenced by multiple components simultaneously (single-owner).
|
|
||||||
|
|
||||||
Token lifecycle:
|
|
||||||
1. Scheduler creates it with stage_idx=0 and puts it to the first stage component
|
|
||||||
2. The component executes _process(), increments stage_idx, and puts it to the next component
|
|
||||||
3. The last stage component calls pipeline_ctx.complete_tile()
|
|
||||||
4. When all tiles are complete, PipelineContext calls done_event.succeed()
|
|
||||||
|
|
||||||
Relationship with existing PeInternalTxn:
|
|
||||||
- PeInternalTxn: command transfer between PE_CPU → PE_SCHEDULER (existing, unchanged)
|
|
||||||
- TileToken: per-tile work transfer from PE_SCHEDULER → sub-components (new, self-routing)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **PE_CPU changes**: the PE_CPU → PE_SCHEDULER interface is not modified
|
|
||||||
(PeInternalTxn-based, ADR-0014 maintained)
|
|
||||||
- **Resource contention model across multiple pipelines**: the current scope focuses on
|
|
||||||
accurate modeling of a single pipeline. TCM bank conflicts across multiple pipelines
|
|
||||||
are future work.
|
|
||||||
|
|
||||||
## Open Questions
|
|
||||||
|
|
||||||
- **Register File capacity model**: whether to model capacity limits when the fetch unit
|
|
||||||
loads into registers. Capacity is expressed in bytes (register_file_bytes), and
|
|
||||||
the number of tiles that can be held simultaneously is determined by tile size.
|
|
||||||
When capacity is exceeded, fetch stalls, creating natural backpressure.
|
|
||||||
- **Prefetch strategy**: this ADR does not allow tile feed interleaving across composite
|
|
||||||
commands. Therefore, overlap arises not from pre-injection across commands, but
|
|
||||||
naturally from pipeline progression of tiles within the same command.
|
|
||||||
If additional prefetch is needed, it should be considered at the level of tile ordering
|
|
||||||
within the same command or fetch/store unit policy, not cross-command injection.
|
|
||||||
- **PE_DMA coalescing**: per-tile DMA may cause fragmentation.
|
|
||||||
Direction is to merge/coalesce within DMA without scheduler involvement.
|
|
||||||
- **Synchronous execution mode**: this ADR adopts asynchronous pipeline as the
|
|
||||||
default/sole execution model. If a sync mode is needed for debug or validation
|
|
||||||
purposes, it will be considered in a future ADR.
|
|
||||||
- **TCM bank conflict across multiple pipelines**: currently based on a single pipeline.
|
|
||||||
Bank conflict modeling when multiple pipelines simultaneously access TCM is future work.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
### Positive
|
|
||||||
|
|
||||||
- Each block is an independent component — individually replaceable (ADR-0015 compliant)
|
|
||||||
- PE internal structure is visible in the topology
|
|
||||||
- Components do not know the next component — plan-based routing provides flexibility
|
|
||||||
- Natural pipeline overlap between DMA and compute (SimPy Store backpressure)
|
|
||||||
- Improved HW modeling accuracy (done signal = Event, data transfer = message)
|
|
||||||
- Fetch/store separation enables accurate TCM BW contention modeling
|
|
||||||
|
|
||||||
### Negative
|
|
||||||
|
|
||||||
- Increased number of PE internal components (5 → 6) — more topology nodes/edges
|
|
||||||
- Component separation makes intra-PE token forwarding more explicit than before
|
|
||||||
|
|
||||||
@@ -1,426 +0,0 @@
|
|||||||
# ADR-0021: PE 파이프라인 리팩토링 — 컴포넌트 분리 + Scheduler 기반 라우팅
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
### 실제 하드웨어 구조
|
|
||||||
|
|
||||||
```
|
|
||||||
HBM ←(DMA)→ TCM ←(Fetch/Store Unit)→ Register File ←→ GEMM/MATH Engine
|
|
||||||
```
|
|
||||||
|
|
||||||
- DMA: HBM ↔ TCM 전송 (fabric 경유, 수십~수백 ns)
|
|
||||||
- Fetch/Store Unit: TCM ↔ Register File 전송 (BW 기반, 수 ns)
|
|
||||||
- GEMM/MATH Engine: Register File 간 연산 (cycle-accurate)
|
|
||||||
- 완료 신호: PE 내부 1-cycle wire signal (done pin assert)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### D1. 각 블록을 독립 컴포넌트로 분리
|
|
||||||
|
|
||||||
pe_accel의 내부 블록을 **독립 PeEngineBase 컴포넌트**로 분리한다.
|
|
||||||
기존 5개 + Fetch/Store Unit 1개 = 6개 컴포넌트.
|
|
||||||
|
|
||||||
| 컴포넌트 | 역할 | HW 대응 |
|
|
||||||
|----------|------|---------|
|
|
||||||
| PE_SCHEDULER | plan 생성, tile 상태 관리, stage 라우팅 | Scheduler/Sequencer |
|
|
||||||
| PE_DMA | HBM ↔ TCM (fabric 경유) | DMA Engine |
|
|
||||||
| PE_FETCH_STORE | TCM ↔ Register File | Load/Store Unit |
|
|
||||||
| PE_GEMM | MAC compute (register only) | MAC Array |
|
|
||||||
| PE_MATH | element-wise/reduction (register only) | SIMD/Vector Unit |
|
|
||||||
| PE_TCM | BW-serialized scratchpad | SRAM Bank |
|
|
||||||
|
|
||||||
각 컴포넌트는 topology 노드로 존재하며, port/wire로 연결된다.
|
|
||||||
`impl`을 교체하면 개별 블록의 타이밍 모델을 변경할 수 있다.
|
|
||||||
|
|
||||||
### D2. Token Self-Routing — Scheduler는 dispatch + completion만
|
|
||||||
|
|
||||||
**컴포넌트가 매 stage마다 scheduler를 경유하지 않는다.**
|
|
||||||
Token이 plan을 가지고 있어 컴포넌트가 직접 다음 stage로 체이닝한다.
|
|
||||||
|
|
||||||
```
|
|
||||||
Scheduler → DMA → Fetch → GEMM → Math → Store → DMA_WB → (done) → Scheduler
|
|
||||||
↑ 체이닝: scheduler 안 거침 completion만
|
|
||||||
```
|
|
||||||
|
|
||||||
이는 실제 HW에서 각 블록의 done signal이 다음 블록에 직접 wire로 연결되어
|
|
||||||
있는 구조와 일치한다. Scheduler는 **초기 dispatch + completion aggregation만** 담당.
|
|
||||||
|
|
||||||
#### Stage 정의
|
|
||||||
|
|
||||||
```python
|
|
||||||
class StageType(Enum):
|
|
||||||
DMA_READ = 0
|
|
||||||
FETCH = 1
|
|
||||||
GEMM = 2
|
|
||||||
MATH = 3
|
|
||||||
STORE = 4
|
|
||||||
DMA_WRITE = 5
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Plan 구조
|
|
||||||
|
|
||||||
Scheduler가 CompositeCmd를 받으면 **tile 단위 실행 plan**을 생성한다.
|
|
||||||
Plan은 각 tile의 **stage sequence**를 정의한다:
|
|
||||||
|
|
||||||
```python
|
|
||||||
@dataclass
|
|
||||||
class Stage:
|
|
||||||
stage_type: StageType
|
|
||||||
component: str # topology 노드 ID (e.g. "sip0.cube0.pe0.pe_dma")
|
|
||||||
params: dict # stage별 파라미터 (dynamic)
|
|
||||||
|
|
||||||
@dataclass(frozen=True)
|
|
||||||
class TilePlan:
|
|
||||||
tile_id: int
|
|
||||||
stages: tuple[Stage, ...] # 순서대로 실행할 stage 목록 (immutable)
|
|
||||||
```
|
|
||||||
|
|
||||||
Plan에 따라 stage sequence가 달라진다:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# 일반 GEMM: HBM → TCM → Register → Compute → Register → TCM → HBM
|
|
||||||
stages = (DMA_READ, FETCH, GEMM, STORE, DMA_WRITE)
|
|
||||||
|
|
||||||
# TCM 데이터로 바로 GEMM (DMA read 생략):
|
|
||||||
stages = (FETCH, GEMM, STORE, DMA_WRITE)
|
|
||||||
|
|
||||||
# MATH element-wise:
|
|
||||||
stages = (DMA_READ, FETCH, MATH, STORE, DMA_WRITE)
|
|
||||||
|
|
||||||
# GEMM + accumulation (중간 K-tile, writeback 생략):
|
|
||||||
stages = (DMA_READ, FETCH, GEMM, STORE) # store to TCM only
|
|
||||||
```
|
|
||||||
|
|
||||||
**컴포넌트는 다음 컴포넌트를 하드코딩하지 않는다.**
|
|
||||||
Token의 plan에서 다음 stage를 읽고, out_port로 직접 전달한다.
|
|
||||||
네트워크 패킷이 라우팅 헤더를 가지고 있는 것과 같은 패턴이다.
|
|
||||||
|
|
||||||
#### Pipeline Context
|
|
||||||
|
|
||||||
```python
|
|
||||||
@dataclass
|
|
||||||
class PipelineContext:
|
|
||||||
id: str
|
|
||||||
total_tiles: int
|
|
||||||
completed_tiles: int = 0
|
|
||||||
done_event: simpy.Event = None # 모든 tile 완료 시 succeed
|
|
||||||
|
|
||||||
def complete_tile(self) -> None:
|
|
||||||
self.completed_tiles += 1
|
|
||||||
if self.completed_tiles == self.total_tiles:
|
|
||||||
self.done_event.succeed()
|
|
||||||
```
|
|
||||||
|
|
||||||
**Completion은 exactly-once contract**: 각 tile의 마지막 stage는 정확히 한 번만
|
|
||||||
`complete_tile()`을 호출해야 한다. 중복 호출은 버그이며, `done_event`는
|
|
||||||
단 한 번만 succeed되어야 한다 (SimPy Event 제약).
|
|
||||||
|
|
||||||
#### Scheduler 역할 (축소됨)
|
|
||||||
|
|
||||||
Scheduler는 CompositeCmd를 받으면 plan과 PipelineContext를 생성한 뒤,
|
|
||||||
이를 scheduler 내부의 `_pending_feeds` FIFO에 enqueue하고 즉시 리턴한다.
|
|
||||||
|
|
||||||
실제 tile 투입은 **단일 feeder process** (`_feed_loop`)가 담당한다.
|
|
||||||
이 feeder는 `_pending_feeds`를 FIFO 순서로 소비하며,
|
|
||||||
**composite command 간 tile feed interleaving은 허용하지 않는다.**
|
|
||||||
즉, 한 command의 모든 tile이 첫 stage queue에 투입된 후에만
|
|
||||||
다음 command의 feed가 시작된다.
|
|
||||||
|
|
||||||
Scheduler당 `_feed_loop`는 **정확히 하나만** 존재하며,
|
|
||||||
composite command의 tile feed는 이 단일 process를 통해서만 수행된다.
|
|
||||||
Command issue order는 **PE_SCHEDULER가 PeInternalTxn을 수신한 순서**를 의미한다.
|
|
||||||
|
|
||||||
이 구조는 command issue order를 유지하면서도, 첫 stage queue full 시
|
|
||||||
feeder process만 block되고 scheduler worker의 inbox 처리 자체는 멈추지 않도록 한다.
|
|
||||||
|
|
||||||
```python
|
|
||||||
class PeSchedulerV2(PeEngineBase):
|
|
||||||
_pipelines: dict[str, PipelineContext]
|
|
||||||
_pending_feeds: simpy.Store # FIFO of (plan, ctx)
|
|
||||||
|
|
||||||
def start(self, env):
|
|
||||||
super().start(env)
|
|
||||||
self._pending_feeds = simpy.Store(env)
|
|
||||||
env.process(self._feed_loop(env))
|
|
||||||
|
|
||||||
def _dispatch_composite(self, env, pe_txn, cmd):
|
|
||||||
plan = generate_plan(cmd)
|
|
||||||
ctx = PipelineContext(
|
|
||||||
id=next_id(),
|
|
||||||
total_tiles=len(plan.tiles),
|
|
||||||
done_event=pe_txn.done,
|
|
||||||
)
|
|
||||||
self._pipelines[ctx.id] = ctx
|
|
||||||
|
|
||||||
# feeder queue에 등록만 하고 즉시 리턴
|
|
||||||
yield self._pending_feeds.put((plan, ctx))
|
|
||||||
|
|
||||||
def _feed_loop(self, env):
|
|
||||||
"""단일 feeder process: composite command를 FIFO 순서로 feed.
|
|
||||||
|
|
||||||
Composite command 간 tile feed interleaving은 허용하지 않는다.
|
|
||||||
한 command의 모든 tile이 첫 stage queue에 투입된 후에만
|
|
||||||
다음 command의 feed가 시작된다.
|
|
||||||
|
|
||||||
첫 stage queue full 시 이 feeder만 block되며,
|
|
||||||
scheduler worker의 inbox 처리는 멈추지 않는다.
|
|
||||||
"""
|
|
||||||
while True:
|
|
||||||
plan, ctx = yield self._pending_feeds.get()
|
|
||||||
for tile in plan.tiles:
|
|
||||||
token = TileToken(
|
|
||||||
tile_id=tile.tile_id,
|
|
||||||
pipeline_ctx=ctx,
|
|
||||||
plan=tile,
|
|
||||||
stage_idx=0,
|
|
||||||
params=tile.stages[0].params,
|
|
||||||
)
|
|
||||||
yield self.out_ports[tile.stages[0].component].put(token)
|
|
||||||
# queue capacity = HW queue depth → full이면 feeder만 block
|
|
||||||
```
|
|
||||||
|
|
||||||
본 ADR에서 scheduler는 여러 composite command를 수용할 수 있으나,
|
|
||||||
tile submission order는 command 단위 FIFO를 따른다.
|
|
||||||
Command 내부에서는 tile-level pipeline overlap을 허용하지만,
|
|
||||||
command 간 tile feed interleaving은 허용하지 않는다.
|
|
||||||
|
|
||||||
### D3. 데이터 전달 vs 완료 신호 — HW 모델링 기준
|
|
||||||
|
|
||||||
| 통신 유형 | 방식 | HW 대응 |
|
|
||||||
|----------|------|---------|
|
|
||||||
| tile token (작업 지시) | message via out_port | command queue에 enqueue |
|
|
||||||
| stage 완료 → 다음 stage | 컴포넌트가 직접 out_port.put | done-triggered local enqueue |
|
|
||||||
| pipeline 완료 → scheduler | PipelineContext.complete_tile() | completion interrupt |
|
|
||||||
|
|
||||||
**Tile token**: out_port.put() 사용. SimPy Store capacity = HW queue depth.
|
|
||||||
|
|
||||||
**Intra-PE chaining latency**: 본 ADR 범위에서는 intra-PE stage trigger에
|
|
||||||
explicit latency model을 두지 않는다. 컴포넌트 간 체이닝은 PE 내부 wire에 해당하며,
|
|
||||||
scheduler 왕복이 없으므로 artificial hop cost가 발생하지 않는다.
|
|
||||||
|
|
||||||
**Pipeline 완료**: 마지막 stage의 컴포넌트가 `pipeline_ctx.complete_tile()` 호출.
|
|
||||||
모든 tile 완료 시 PipelineContext가 done_event.succeed().
|
|
||||||
|
|
||||||
### D4. 비동기 파이프라인 — 자연스러운 overlap
|
|
||||||
|
|
||||||
Scheduler는 CompositeCmd를 **비동기로** 처리한다.
|
|
||||||
다만 tile feed는 command마다 독립 process를 만들지 않고,
|
|
||||||
scheduler 내부의 **단일 feeder process**가 FIFO 순서로 수행한다.
|
|
||||||
따라서 scheduler는 다음 command를 계속 받을 수 있지만,
|
|
||||||
첫-stage tile 투입 순서는 command 단위로 보장된다.
|
|
||||||
|
|
||||||
**SimPy Store capacity = HW queue depth**이므로:
|
|
||||||
- queue가 차면 put()이 자연스럽게 block (backpressure)
|
|
||||||
- DMA가 tile 0을 처리하는 동안 GEMM은 이미 완료된 tile의 fetch를 시작
|
|
||||||
- 두 번째 CompositeCmd가 들어오면 DMA queue에 바로 이어서 투입
|
|
||||||
|
|
||||||
```
|
|
||||||
First-stage feed order (feeder → DMA queue):
|
|
||||||
[cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN] | [cmd2:t0][cmd2:t1]...
|
|
||||||
↑ cmd1 feed 완료 후 cmd2 시작
|
|
||||||
|
|
||||||
Runtime pipeline (downstream overlap):
|
|
||||||
PE_DMA: [cmd1:t0][cmd1:t1][cmd1:t2]...[cmd1:tN][cmd2:t0][cmd2:t1]...
|
|
||||||
PE_FETCH: [cmd1:t0][cmd1:t1]...
|
|
||||||
PE_GEMM: [cmd1:t0][cmd1:t1]...
|
|
||||||
↑ 같은 cmd 내부에서 pipeline overlap
|
|
||||||
```
|
|
||||||
|
|
||||||
이때 overlap은 서로 다른 command의 tile feed interleaving에서 오는 것이 아니라,
|
|
||||||
먼저 투입된 command의 tile들이 downstream stage로 진행되는 동안 feeder가
|
|
||||||
다음 tile들을 계속 투입하면서 자연스럽게 발생한다.
|
|
||||||
|
|
||||||
예를 들어 cmd1의 모든 tile이 첫 stage queue에 투입되기 전에는
|
|
||||||
cmd2의 tile feed는 시작되지 않는다. 그러나 cmd1.tile0이 이미 GEMM으로
|
|
||||||
진행한 상태에서 cmd1.tile1, cmd1.tile2가 DMA/FETCH에 남아 있을 수 있으므로,
|
|
||||||
**같은 command 내부에서는 pipeline overlap이 자연스럽게 발생**한다.
|
|
||||||
|
|
||||||
#### 컴포넌트 체이닝 패턴
|
|
||||||
|
|
||||||
모든 컴포넌트가 동일한 패턴을 따른다:
|
|
||||||
|
|
||||||
```python
|
|
||||||
def _pipeline_worker(self, env):
|
|
||||||
while True:
|
|
||||||
token = yield self._inbox.get()
|
|
||||||
|
|
||||||
# 자기 stage 처리
|
|
||||||
yield from self._process(env, token)
|
|
||||||
|
|
||||||
# 다음 stage로 체이닝 (plan에서 읽음)
|
|
||||||
next_idx = token.stage_idx + 1
|
|
||||||
if next_idx < len(token.plan.stages):
|
|
||||||
next_stage = token.plan.stages[next_idx]
|
|
||||||
token.stage_idx = next_idx
|
|
||||||
token.params = next_stage.params
|
|
||||||
yield self.out_ports[next_stage.component].put(token)
|
|
||||||
else:
|
|
||||||
# 마지막 stage — pipeline completion
|
|
||||||
token.pipeline_ctx.complete_tile()
|
|
||||||
```
|
|
||||||
|
|
||||||
### D5. PE_FETCH_STORE — TCM ↔ Register File 전담
|
|
||||||
|
|
||||||
기존에 GemmBlock과 MathBlock이 각각 TCM read/write를 구현했으나,
|
|
||||||
이를 **PE_FETCH_STORE 컴포넌트**로 분리한다.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# PE_FETCH_STORE._process()
|
|
||||||
def _process(self, env, token):
|
|
||||||
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
|
|
||||||
yield tcm_done
|
|
||||||
# 체이닝은 base class가 처리 (D4 패턴)
|
|
||||||
```
|
|
||||||
|
|
||||||
장점:
|
|
||||||
- GEMM/MATH는 **순수 compute만** — TCM 접근 로직 없음
|
|
||||||
- fetch/store BW 경합이 자연스럽게 모델링됨 (PE_TCM의 resource로 serialization)
|
|
||||||
- prefetch 전략 등 fetch unit 단독 교체로 실험 가능
|
|
||||||
|
|
||||||
### D6. 각 Compute 컴포넌트의 단순화
|
|
||||||
|
|
||||||
GEMM/MATH는 register 데이터가 이미 준비된 상태에서 compute만 수행.
|
|
||||||
**체이닝은 공통 패턴(D4)을 따르므로, _process()만 구현하면 된다:**
|
|
||||||
|
|
||||||
```python
|
|
||||||
# PE_GEMM._process()
|
|
||||||
def _process(self, env, token):
|
|
||||||
yield env.timeout(self._mac_latency(token.params))
|
|
||||||
|
|
||||||
# PE_MATH._process()
|
|
||||||
def _process(self, env, token):
|
|
||||||
yield env.timeout(self._simd_latency(token.params))
|
|
||||||
|
|
||||||
# PE_FETCH_STORE._process()
|
|
||||||
def _process(self, env, token):
|
|
||||||
yield self.out_ports[tcm_id].put(TcmRequest(token.params["direction"], ...))
|
|
||||||
yield tcm_done
|
|
||||||
|
|
||||||
# PE_DMA._process()
|
|
||||||
def _process(self, env, token):
|
|
||||||
yield from self._do_fabric_dma(token.params)
|
|
||||||
```
|
|
||||||
|
|
||||||
타이밍 모델만 교체하면 cycle-accurate든 analytical든 자유롭게 변경 가능.
|
|
||||||
체이닝 로직은 base class에 있으므로 각 컴포넌트는 순수 stage 로직만 구현.
|
|
||||||
|
|
||||||
### D7. Topology 변경
|
|
||||||
|
|
||||||
PE template에 PE_FETCH_STORE 추가:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
pe_template:
|
|
||||||
components:
|
|
||||||
pe_cpu: { kind: pe_cpu, impl: pe_cpu_v1, ... }
|
|
||||||
pe_scheduler: { kind: pe_scheduler, impl: pe_scheduler_v2, ... }
|
|
||||||
pe_dma: { kind: pe_dma, impl: pe_dma_v1, ... }
|
|
||||||
pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, ... }
|
|
||||||
pe_gemm: { kind: pe_gemm, impl: pe_gemm_v1, ... }
|
|
||||||
pe_math: { kind: pe_math, impl: pe_math_v1, ... }
|
|
||||||
pe_mmu: { kind: pe_mmu, impl: pe_mmu_v1, ... }
|
|
||||||
pe_tcm: { kind: pe_tcm, impl: pe_tcm_v1, ... }
|
|
||||||
links:
|
|
||||||
# 기존 links...
|
|
||||||
fetch_store_to_tcm_bw_gbs: 512.0
|
|
||||||
fetch_store_to_tcm_mm: 0.0
|
|
||||||
```
|
|
||||||
|
|
||||||
PE 내부 edge 연결:
|
|
||||||
```
|
|
||||||
PE_SCHEDULER → PE_DMA (초기 dispatch)
|
|
||||||
PE_SCHEDULER → PE_FETCH_STORE (초기 dispatch)
|
|
||||||
PE_SCHEDULER → PE_GEMM (초기 dispatch)
|
|
||||||
PE_SCHEDULER → PE_MATH (초기 dispatch)
|
|
||||||
PE_DMA → PE_FETCH_STORE (체이닝)
|
|
||||||
PE_FETCH_STORE → PE_GEMM (체이닝)
|
|
||||||
PE_FETCH_STORE → PE_MATH (체이닝)
|
|
||||||
PE_GEMM → PE_FETCH_STORE (store 체이닝)
|
|
||||||
PE_MATH → PE_FETCH_STORE (store 체이닝)
|
|
||||||
PE_FETCH_STORE → PE_DMA (writeback 체이닝)
|
|
||||||
PE_FETCH_STORE → PE_TCM (BW 요청)
|
|
||||||
```
|
|
||||||
|
|
||||||
Topology edge는 **control/dispatch visibility + runtime chaining** 양쪽을 포함한다.
|
|
||||||
Scheduler → 하위 컴포넌트 edge는 초기 dispatch 경로이며,
|
|
||||||
컴포넌트 간 edge는 token self-routing에 의한 runtime chaining 경로이다.
|
|
||||||
|
|
||||||
### D9. TileToken 메시지 정의
|
|
||||||
|
|
||||||
컴포넌트 간 tile 작업 전달에 사용하는 메시지.
|
|
||||||
Token이 plan과 stage index를 가지고 있어 self-routing이 가능하다.
|
|
||||||
|
|
||||||
```python
|
|
||||||
@dataclass
|
|
||||||
class TileToken:
|
|
||||||
tile_id: int
|
|
||||||
pipeline_ctx: PipelineContext # completion 추적
|
|
||||||
plan: TilePlan # 이 tile의 전체 stage sequence (immutable)
|
|
||||||
stage_idx: int # 현재 stage index in plan.stages
|
|
||||||
params: dict # current stage 파라미터 캐시 (canonical: plan.stages[stage_idx].params)
|
|
||||||
data_op: bool = True # op_log 기록 대상 (ADR-0020)
|
|
||||||
```
|
|
||||||
|
|
||||||
TileToken은 한 시점에 **하나의 컴포넌트에 의해서만 소유**되며,
|
|
||||||
동시에 여러 컴포넌트에 의해 참조되지 않는다 (single-owner).
|
|
||||||
|
|
||||||
Token lifecycle:
|
|
||||||
1. Scheduler가 stage_idx=0으로 생성, 첫 stage 컴포넌트에 put
|
|
||||||
2. 컴포넌트가 _process() 실행 후 stage_idx 증가, 다음 컴포넌트에 put
|
|
||||||
3. 마지막 stage 컴포넌트가 pipeline_ctx.complete_tile() 호출
|
|
||||||
4. 모든 tile 완료 시 PipelineContext가 done_event.succeed()
|
|
||||||
|
|
||||||
기존 PeInternalTxn과의 관계:
|
|
||||||
- PeInternalTxn: PE_CPU → PE_SCHEDULER 간 command 전달 (기존 유지)
|
|
||||||
- TileToken: PE_SCHEDULER → 하위 컴포넌트 간 tile 단위 작업 전달 (신규, self-routing)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **PE_CPU 변경**: PE_CPU → PE_SCHEDULER 인터페이스는 변경하지 않음
|
|
||||||
(PeInternalTxn 기반, ADR-0014 유지)
|
|
||||||
- **다중 pipeline 간 자원 경합 모델**: 현재 범위에서는 단일 pipeline의
|
|
||||||
정확한 모델링에 집중. 다중 pipeline 간 TCM bank conflict 등은 future work.
|
|
||||||
|
|
||||||
## Open Questions
|
|
||||||
|
|
||||||
- **Register File 용량 모델**: fetch unit이 register에 로드할 때 용량 제한을
|
|
||||||
모델링할지. 용량은 바이트 단위(register_file_bytes)로 표현하며,
|
|
||||||
동시에 보유 가능한 tile 수는 tile 크기에 따라 결정된다.
|
|
||||||
용량 초과 시 fetch가 stall되어 자연스러운 backpressure가 발생한다.
|
|
||||||
- **Prefetch 전략**: 본 ADR에서는 composite command 간 tile feed interleaving을
|
|
||||||
허용하지 않는다. 따라서 overlap은 command 간 선행 투입이 아니라,
|
|
||||||
같은 command 내부 tile들의 pipeline progression에서 자연스럽게 발생한다.
|
|
||||||
추가적인 prefetch가 필요하면 command 간 투입이 아니라, 같은 command 내부에서의
|
|
||||||
tile ordering 또는 fetch/store unit policy 차원에서 검토한다.
|
|
||||||
- **PE_DMA coalescing**: tile 단위 DMA는 fragmentation 발생 가능.
|
|
||||||
DMA 내부에서 merge/coalesce하되 scheduler는 관여하지 않는 방향.
|
|
||||||
- **동기 실행 모드**: 본 ADR에서는 비동기 pipeline을 기본/유일 execution model로
|
|
||||||
채택한다. 디버그 또는 validation 목적의 sync mode가 필요하면 future ADR에서 검토.
|
|
||||||
- **다중 pipeline 간 TCM bank conflict**: 현재 단일 pipeline 기준.
|
|
||||||
다중 pipeline이 동시에 TCM에 접근할 때의 bank conflict 모델은 future work.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
### 긍정적
|
|
||||||
|
|
||||||
- 각 블록이 독립 컴포넌트 — 개별 교체 가능 (ADR-0015 준수)
|
|
||||||
- topology에서 PE 내부 구조 가시화
|
|
||||||
- 컴포넌트가 다음 컴포넌트를 모름 — plan 기반 라우팅으로 유연성 확보
|
|
||||||
- DMA와 compute의 자연스러운 파이프라인 overlap (SimPy Store backpressure)
|
|
||||||
- HW 모델링 정확도 향상 (done signal = Event, data transfer = message)
|
|
||||||
- fetch/store 분리로 TCM BW 경합 정확히 모델링
|
|
||||||
|
|
||||||
### 부정적
|
|
||||||
|
|
||||||
- PE 내부 컴포넌트 수 증가 (5 → 6) — topology 노드/edge 증가
|
|
||||||
- 컴포넌트 분리로 인해 intra-PE token forwarding이 이전 대비 더 명시적으로 드러남
|
|
||||||
|
|
||||||
+4
-4
@@ -1,10 +1,10 @@
|
|||||||
# ADR-0022: 2D Grid program_id Semantics
|
# ADR-0022: 2D Grid program_id Semantics
|
||||||
|
|
||||||
- **Status**: Accepted
|
## Status
|
||||||
- **Date**: 2026-04-09
|
|
||||||
- **Context**: Triton-style kernel addressing for multi-cube PE topology
|
|
||||||
|
|
||||||
## Problem
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
|
Triton kernels use `tl.program_id(axis)` to identify their position in a launch grid.
|
||||||
Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
|
Our hardware has a 2-level hierarchy: **cubes** contain **PEs**.
|
||||||
+2
-2
@@ -709,7 +709,7 @@ piggyback, tail updates via the D9 fast-path channel.
|
|||||||
|
|
||||||
### D13. Test strategy
|
### D13. Test strategy
|
||||||
|
|
||||||
Following the ADR-0021 D8 pattern.
|
Test plan:
|
||||||
|
|
||||||
#### T1. Unit tests (component-level)
|
#### T1. Unit tests (component-level)
|
||||||
|
|
||||||
@@ -801,7 +801,7 @@ F5. **Slot full + infinite backpressure**: the peer never recvs.
|
|||||||
### D15. Algorithm-author cheat sheet
|
### D15. Algorithm-author cheat sheet
|
||||||
|
|
||||||
Full step-by-step lives in
|
Full step-by-step lives in
|
||||||
[`docs/ccl-author-guide.en.md`](../ccl-author-guide.en.md). The
|
[`docs/onboarding/ccl-author-guide.en.md`](../onboarding/ccl-author-guide.en.md). The
|
||||||
shortest version:
|
shortest version:
|
||||||
|
|
||||||
| Things you touch | Things you don't |
|
| Things you touch | Things you don't |
|
||||||
+412
-3
@@ -969,7 +969,7 @@ tail 갱신은 D9 fast path SimPy Store 채널로 처리된다.
|
|||||||
|
|
||||||
### D13. 테스트 전략
|
### D13. 테스트 전략
|
||||||
|
|
||||||
ADR-0021의 D8 패턴을 따라 단위/통합/regression 테스트를 명시한다.
|
단위/통합/regression 테스트를 명시한다.
|
||||||
|
|
||||||
#### T1. 단위 테스트 (component-level)
|
#### T1. 단위 테스트 (component-level)
|
||||||
|
|
||||||
@@ -1102,7 +1102,7 @@ F5. **Slot full + 무한 backpressure**:
|
|||||||
### D15. 알고리즘 작성자 가이드 (요약)
|
### D15. 알고리즘 작성자 가이드 (요약)
|
||||||
|
|
||||||
본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다.
|
본 섹션은 알고리즘 작성자가 한 화면으로 시작점을 잡을 수 있도록 한다.
|
||||||
자세한 step-by-step 가이드는 [docs/ccl-author-guide.md](../ccl-author-guide.md) 참조.
|
자세한 step-by-step 가이드는 [docs/onboarding/ccl-author-guide.md](../onboarding/ccl-author-guide.md) 참조.
|
||||||
|
|
||||||
#### 만지는 것 / 만지지 않는 것
|
#### 만지는 것 / 만지지 않는 것
|
||||||
|
|
||||||
@@ -1175,7 +1175,416 @@ def neighbors(rank, world_size, neighbor_map) -> dict | None:
|
|||||||
2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure)
|
2. **send/recv 짝 맞지 않음** — peer 측 recv 없으면 hang (slot full backpressure)
|
||||||
3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임
|
3. **dtype/shape 불일치** — 첫 구현은 검증 안 함, 작성자 책임
|
||||||
|
|
||||||
자세한 step-by-step과 hello-world 예제는 `docs/ccl-author-guide.md` 참조.
|
자세한 step-by-step과 hello-world 예제는 `docs/onboarding/ccl-author-guide.md` 참조.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## HW Realization Notes (Informative)
|
||||||
|
|
||||||
|
**Status of this section**: Forward-looking. Describes how the simulator
|
||||||
|
contract (D1–D15) would map to silicon. Not currently implemented;
|
||||||
|
subject to revision before tapeout. The simulator implements the
|
||||||
|
contract via Python/SimPy equivalents in
|
||||||
|
[pe_ipcq.py](../../src/kernbench/components/builtin/pe_ipcq.py) and
|
||||||
|
[pe_dma.py](../../src/kernbench/components/builtin/pe_dma.py).
|
||||||
|
|
||||||
|
### D16. Proposed HW Block Diagram and End-to-End Dataflow
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
> Source: [`../diagrams/pe_baseline.d2`](../diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5`.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
> Source: [`../diagrams/pe_proposed.d2`](../diagrams/pe_proposed.d2) — `d2 --layout=elk`.
|
||||||
|
|
||||||
|
**Baseline → Proposed 핵심 변경**:
|
||||||
|
|
||||||
|
- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
|
||||||
|
- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
|
||||||
|
- TCM 내 **IPCQ Slot Region 예약 영역** 명시
|
||||||
|
- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
|
||||||
|
|
||||||
|
#### End-to-End Sequence (HW view)
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant CPU_A as PE_A: PE_CPU
|
||||||
|
participant IPCQ_A as PE_A: IPCQ Ctrl
|
||||||
|
participant DMA_A as PE_A: DMA
|
||||||
|
participant NOC as NoC Fabric
|
||||||
|
participant DMA_B as PE_B: DMA
|
||||||
|
participant IPCQ_B as PE_B: IPCQ Ctrl
|
||||||
|
participant TCM_B as PE_B: TCM
|
||||||
|
participant CPU_B as PE_B: PE_CPU
|
||||||
|
|
||||||
|
Note over CPU_A: tl.send(dir="E", src=0x1000)
|
||||||
|
|
||||||
|
CPU_A->>IPCQ_A: MMIO: send request
|
||||||
|
Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
|
||||||
|
IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
|
||||||
|
Note over IPCQ_A: my_head++
|
||||||
|
IPCQ_A-->>CPU_A: send returns (fire-and-forget)
|
||||||
|
|
||||||
|
Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
|
||||||
|
DMA_A->>NOC: IPCQ data flit(s)
|
||||||
|
|
||||||
|
Note over NOC: hop latency + BW drain
|
||||||
|
|
||||||
|
NOC->>DMA_B: IPCQ data flit(s)
|
||||||
|
Note over DMA_B: Terminal BW drain<br/>Slot write latency
|
||||||
|
|
||||||
|
rect rgb(255, 240, 220)
|
||||||
|
Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
|
||||||
|
DMA_B->>TCM_B: write data → slot address
|
||||||
|
DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
|
||||||
|
end
|
||||||
|
|
||||||
|
Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
|
||||||
|
IPCQ_B-->>CPU_B: recv_wake signal
|
||||||
|
|
||||||
|
Note over CPU_B: tl.recv(dir="W") wakes up
|
||||||
|
CPU_B->>IPCQ_B: recv request
|
||||||
|
Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
|
||||||
|
IPCQ_B-->>CPU_B: return slot_addr
|
||||||
|
CPU_B->>TCM_B: read data from slot
|
||||||
|
Note over IPCQ_B: my_tail++
|
||||||
|
|
||||||
|
IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
|
||||||
|
Note over NOC: credit traversal (NoC latency)
|
||||||
|
NOC->>IPCQ_A: Credit arrival
|
||||||
|
|
||||||
|
Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
|
||||||
|
```
|
||||||
|
|
||||||
|
### D17. IPCQ Controller HW Module (신규)
|
||||||
|
|
||||||
|
PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록. 시뮬레이터의
|
||||||
|
`PeIpcqComponent`에 대응한다.
|
||||||
|
|
||||||
|
#### QPair Register File
|
||||||
|
|
||||||
|
방향별 queue pair 상태를 flip-flop으로 유지. PE_CPU가 MMIO(CSR)로 읽기/쓰기
|
||||||
|
가능하며, init 시점에 소프트웨어가 채워넣는다.
|
||||||
|
|
||||||
|
```
|
||||||
|
Per-direction registers (each 64-bit):
|
||||||
|
my_head — sender write position (monotonic)
|
||||||
|
my_tail — receiver read position (monotonic)
|
||||||
|
peer_head_cache — last known peer head (updated by Meta Extractor)
|
||||||
|
peer_tail_cache — last known peer tail (updated by Credit Receiver)
|
||||||
|
rx_base_pa — this PE's rx buffer base physical address
|
||||||
|
peer_rx_base_pa — peer's rx buffer base physical address
|
||||||
|
n_slots — ring depth (power-of-2 제약, D21 참조)
|
||||||
|
slot_size — bytes per slot
|
||||||
|
peer_credit_tgt — peer PE의 credit receive 주소
|
||||||
|
|
||||||
|
Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
|
||||||
|
Total: 8 dirs × 9 regs × 8B = 576B flip-flops
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Slot Address Generator (combinational)
|
||||||
|
|
||||||
|
```
|
||||||
|
Input: pointer (my_head or my_tail), n_slots, slot_size, base_pa
|
||||||
|
Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
|
||||||
|
|
||||||
|
Implementation:
|
||||||
|
n_slots power-of-2 → pointer & (n_slots - 1) (AND mask, 1 gate)
|
||||||
|
slot_size power-of-2 → barrel shift (1 cycle)
|
||||||
|
64-bit add → ripple/kogge-stone adder (1 cycle)
|
||||||
|
|
||||||
|
Latency: 1-2 cycles combinational
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Backpressure Comparator (combinational)
|
||||||
|
|
||||||
|
```
|
||||||
|
full = (my_head - peer_tail_cache) >= n_slots
|
||||||
|
|
||||||
|
Implementation: 64-bit subtract + unsigned compare
|
||||||
|
Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
|
||||||
|
Latency: 1 cycle
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Meta Extractor (inbound datapath sideband)
|
||||||
|
|
||||||
|
DMA Engine의 inbound vc_comm path에 wired. 도착하는 IPCQ flit의 header에서
|
||||||
|
metadata를 추출하여 queue pair 상태를 갱신한다.
|
||||||
|
|
||||||
|
```
|
||||||
|
Trigger: DMA inbound write completion (same cycle)
|
||||||
|
Extract: {sender_seq, dst_addr} from flit header
|
||||||
|
|
||||||
|
Direction matching (ADR-0025 D2):
|
||||||
|
for each dir:
|
||||||
|
match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
|
||||||
|
8× parallel range comparators + priority encoder
|
||||||
|
|
||||||
|
Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
|
||||||
|
Output: recv_wake signal → PE_CPU interrupt/flag
|
||||||
|
Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Credit Injector (outbound)
|
||||||
|
|
||||||
|
```
|
||||||
|
Trigger: recv completion (my_tail 증가 후)
|
||||||
|
Action: pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
|
||||||
|
|
||||||
|
Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
|
||||||
|
Latency: 1 cycle to generate, then NoC traversal
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Credit Receiver (inbound sideband)
|
||||||
|
|
||||||
|
```
|
||||||
|
Trigger: 16B credit packet arrival (from NoC)
|
||||||
|
Extract: {consumer_seq, dst_rx_base_pa}
|
||||||
|
|
||||||
|
Direction matching (ADR-0025 D3):
|
||||||
|
for each dir:
|
||||||
|
match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
|
||||||
|
|
||||||
|
Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
|
||||||
|
Output: send_wake signal → deassert backpressure stall
|
||||||
|
Latency: 1 cycle
|
||||||
|
```
|
||||||
|
|
||||||
|
### D18. DMA Engine vc_comm IPCQ-aware Mode
|
||||||
|
|
||||||
|
기존 vc_comm 채널(D8)에 IPCQ flit 처리 모드를 추가한다.
|
||||||
|
|
||||||
|
**Outbound**:
|
||||||
|
|
||||||
|
1. IPCQ Controller로부터 command 수신: `{src_addr, dst_addr, nbytes, sender_seq}`
|
||||||
|
2. TCM에서 src_addr read → DMA read buffer에 snapshot (standard DMA behavior)
|
||||||
|
3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
|
||||||
|
4. NoC fabric port에 inject
|
||||||
|
5. Fire-and-forget (completion 미대기)
|
||||||
|
|
||||||
|
**Inbound**:
|
||||||
|
|
||||||
|
1. NoC로부터 IPCQ flit 수신
|
||||||
|
2. Terminal BW drain charge (`drain_ns = nbytes / bottleneck_bw`)
|
||||||
|
3. Slot write latency charge (backing memory tier)
|
||||||
|
4. **ATOMIC** (same pipeline stage, no stall insertion):
|
||||||
|
- TCM write: data → slot address
|
||||||
|
- Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
|
||||||
|
5. Done
|
||||||
|
|
||||||
|
**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
|
||||||
|
동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요. 시뮬레이터의
|
||||||
|
"no SimPy yield between MemoryStore.write and IpcqMetaArrival put" (D9, I6)이
|
||||||
|
자연스럽게 보장된다.
|
||||||
|
|
||||||
|
#### Data Snapshot Semantics
|
||||||
|
|
||||||
|
DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
|
||||||
|
이는 DMA standard read-then-write behavior이므로 추가 HW 불필요.
|
||||||
|
|
||||||
|
#### Credit Virtual Channel (선택적)
|
||||||
|
|
||||||
|
- **옵션 A**: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분).
|
||||||
|
- **옵션 B**: 3rd dedicated credit VC 추가 (strict priority > data).
|
||||||
|
|
||||||
|
옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
|
||||||
|
옵션 A로도 충분.
|
||||||
|
|
||||||
|
### D19. Fabric Flit Format Extension
|
||||||
|
|
||||||
|
```
|
||||||
|
일반 data flit (예: 512-bit):
|
||||||
|
┌──────────────────────────────────────────┐
|
||||||
|
│ [511:480] routing header (32b) │
|
||||||
|
│ [479:0] payload (480b = 60B) │
|
||||||
|
└──────────────────────────────────────────┘
|
||||||
|
|
||||||
|
IPCQ data flit (첫 flit에만 metadata 포함):
|
||||||
|
┌──────────────────────────────────────────┐
|
||||||
|
│ [511:480] routing header (32b) │
|
||||||
|
│ [511] ipcq_flag (1b) │ ← IPCQ vs normal DMA 식별
|
||||||
|
│ [510:509] vc_id (2b) │
|
||||||
|
│ [508:480] route + hop count │
|
||||||
|
│ [479:416] ipcq_metadata (64b) │ ← piggyback
|
||||||
|
│ [479:448] sender_seq (32b) │
|
||||||
|
│ [447:416] dst_addr[31:0] (32b) │ ← direction matching용
|
||||||
|
│ [415:0] payload (416b = 52B) │
|
||||||
|
└──────────────────────────────────────────┘
|
||||||
|
후속 flits: full 60B payload (metadata 없음)
|
||||||
|
|
||||||
|
Credit-only flit (128-bit, header-only):
|
||||||
|
┌──────────────────────────────────────────┐
|
||||||
|
│ [127:96] routing header (32b) │
|
||||||
|
│ [127] credit_flag (1b) │
|
||||||
|
│ [95:64] consumer_seq (32b) │
|
||||||
|
│ [63:0] dst_rx_base_pa (64b) │
|
||||||
|
└──────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
첫 flit의 payload가 60B → 52B로 감소 (13% overhead). Multi-flit transfer에서는
|
||||||
|
후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
|
||||||
|
|
||||||
|
### D20. TCM IPCQ Slot Region Layout
|
||||||
|
|
||||||
|
```
|
||||||
|
TCM Memory Map (16MB):
|
||||||
|
┌─────────────────────────────┐ 0x000000
|
||||||
|
│ Kernel Working Memory │
|
||||||
|
│ (compute tensors) │
|
||||||
|
│ ~14MB │
|
||||||
|
├─────────────────────────────┤ 0xE00000
|
||||||
|
│ IPCQ RX Buffers │
|
||||||
|
│ Dir N: slots × slot_size │
|
||||||
|
│ Dir S: slots × slot_size │
|
||||||
|
│ Dir E: slots × slot_size │
|
||||||
|
│ Dir W: slots × slot_size │
|
||||||
|
│ ~1MB │
|
||||||
|
├─────────────────────────────┤ 0xF00000
|
||||||
|
│ IPCQ Metadata / Scratch │
|
||||||
|
│ ~1MB │
|
||||||
|
└─────────────────────────────┘ 0xFFFFFF
|
||||||
|
```
|
||||||
|
|
||||||
|
IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의 bank conflict를
|
||||||
|
최소화한다 (Risk D22 참조).
|
||||||
|
|
||||||
|
### D21. 2nm Implementation Analysis
|
||||||
|
|
||||||
|
#### Area Estimate
|
||||||
|
|
||||||
|
| Module | Gate Count | Area (2nm est.) | Notes |
|
||||||
|
|---|---|---|---|
|
||||||
|
| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
|
||||||
|
| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
|
||||||
|
| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
|
||||||
|
| **IPCQ Controller subtotal** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
|
||||||
|
| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
|
||||||
|
| **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
|
||||||
|
|
||||||
|
#### Timing
|
||||||
|
|
||||||
|
| Path | Delay (2nm est.) | Target Clock | Margin |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
|
||||||
|
| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
|
||||||
|
| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
|
||||||
|
| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
|
||||||
|
|
||||||
|
모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
|
||||||
|
|
||||||
|
#### Power
|
||||||
|
|
||||||
|
- Active: ~1 mW (register R/W + comparators, send/recv 동작 시)
|
||||||
|
- Idle: leakage only
|
||||||
|
- PE 전체 전력 대비 무시 가능
|
||||||
|
|
||||||
|
#### Constraints
|
||||||
|
|
||||||
|
| 항목 | 제약 | 근거 |
|
||||||
|
|---|---|---|
|
||||||
|
| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
|
||||||
|
| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
|
||||||
|
| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
|
||||||
|
|
||||||
|
### D22. Risk Assessment
|
||||||
|
|
||||||
|
#### TCM Bank Conflict
|
||||||
|
|
||||||
|
- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
|
||||||
|
- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치 (D20)
|
||||||
|
- **Cost**: TCM banking flexibility 소폭 감소
|
||||||
|
- **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
|
||||||
|
|
||||||
|
#### Credit Return Latency under Congestion
|
||||||
|
|
||||||
|
- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
|
||||||
|
- **Mitigation**:
|
||||||
|
- Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
|
||||||
|
- 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
|
||||||
|
- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
|
||||||
|
|
||||||
|
#### Inter-Direction Ordering
|
||||||
|
|
||||||
|
- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
|
||||||
|
- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
|
||||||
|
kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일 (D2 + D4)
|
||||||
|
- **Severity**: Low (아키텍처 설계에 의해 해소)
|
||||||
|
|
||||||
|
### D23. HW Alternatives Considered
|
||||||
|
|
||||||
|
#### Doorbell + Polling (전통적 방식)
|
||||||
|
|
||||||
|
```
|
||||||
|
Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
|
||||||
|
Recv: Polling loop on doorbell, or interrupt-driven
|
||||||
|
```
|
||||||
|
|
||||||
|
| 장점 | 단점 |
|
||||||
|
|---|---|
|
||||||
|
| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
|
||||||
|
| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
|
||||||
|
| | Polling은 전력 낭비, interrupt는 latency overhead |
|
||||||
|
|
||||||
|
**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
|
||||||
|
|
||||||
|
#### Hardware Message Queue (NVIDIA NVLink 스타일)
|
||||||
|
|
||||||
|
```
|
||||||
|
Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
|
||||||
|
Recv: HMQ에서 descriptor pop → data pointer 확인
|
||||||
|
```
|
||||||
|
|
||||||
|
| 장점 | 단점 |
|
||||||
|
|---|---|
|
||||||
|
| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
|
||||||
|
| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
|
||||||
|
| | Large tensor에는 결국 DMA 필요 |
|
||||||
|
|
||||||
|
**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
|
||||||
|
면적 낭비. **불채택.**
|
||||||
|
|
||||||
|
#### RDMA-style Completion Queue (CQ)
|
||||||
|
|
||||||
|
```
|
||||||
|
Send: DMA write → peer에 CQE 자동 생성
|
||||||
|
Recv: CQ poll/interrupt → data 위치 확인
|
||||||
|
```
|
||||||
|
|
||||||
|
| 장점 | 단점 |
|
||||||
|
|---|---|
|
||||||
|
| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
|
||||||
|
| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
|
||||||
|
| | PE-to-PE CCL에는 over-engineered |
|
||||||
|
|
||||||
|
**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
|
||||||
|
PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
|
||||||
|
|
||||||
|
#### Credit-in-Data Piggyback (v2 최적화 후보)
|
||||||
|
|
||||||
|
현재 설계에서 credit return은 별도 16B packet이다. Bidirectional 통신
|
||||||
|
패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
|
||||||
|
|
||||||
|
```
|
||||||
|
PE_A →E→ PE_B: data + sender_seq=3
|
||||||
|
PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4 ← credit이 data에 합쳐짐
|
||||||
|
```
|
||||||
|
|
||||||
|
| 장점 | 단점 |
|
||||||
|
|---|---|
|
||||||
|
| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
|
||||||
|
| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
|
||||||
|
| | Logic 복잡도 소폭 증가 |
|
||||||
|
|
||||||
|
**평가**: 현재 설계의 우수한 최적화. Bidirectional allreduce에서 credit packet을
|
||||||
|
완전 제거 가능. Standalone credit fallback도 유지. **v2로 채택 권고.**
|
||||||
|
|
||||||
|
### Open HW Questions
|
||||||
|
|
||||||
|
- IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
|
||||||
|
- Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가? (D18 참조)
|
||||||
|
- Inter-SIP link에서의 flit format 호환성 검증 필요
|
||||||
|
- n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -0,0 +1,206 @@
|
|||||||
|
# ADR-0024: SIP-level Launcher — rank = SIP
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
### 목표
|
||||||
|
|
||||||
|
`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
|
||||||
|
경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
|
||||||
|
읽히는 bench 코드를 목표로 한다.
|
||||||
|
|
||||||
|
real PyTorch와 비교:
|
||||||
|
|
||||||
|
| 차원 | real PyTorch | KernBench |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
|
||||||
|
| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
|
||||||
|
| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
|
||||||
|
| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
|
||||||
|
| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
|
||||||
|
|
||||||
|
### 풀어야 할 문제
|
||||||
|
|
||||||
|
1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
|
||||||
|
2. **Greenlet-local rank/device tracking** — 1-프로세스 모델 안에서 각
|
||||||
|
worker greenlet이 자기 rank / 자기 SIP를 정확히 식별.
|
||||||
|
3. **Tensor placement = structural (sip, cube, pe)** — rank가 SIP이면
|
||||||
|
기본 텐서 배치도 구조적 좌표로 표현되어야 함.
|
||||||
|
|
||||||
|
### Non-problem (이 ADR 밖)
|
||||||
|
|
||||||
|
- IPCQ direction addressing → ADR-0025
|
||||||
|
- `DPPolicy.sip`/`num_sips` 제거 → ADR-0026
|
||||||
|
- Megatron-style TP → ADR-0027
|
||||||
|
- DTensor → ADR-0028 (future)
|
||||||
|
- Worker scheduling / `mp.spawn` / collective drain / exception cleanup
|
||||||
|
→ ADR-0027 D0/D1
|
||||||
|
- Collective algorithm 구현 (intercube_allreduce, SFR config) → ADR-0032
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. rank = SIP (world_size 해석)
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _resolve_world_size(self) -> int:
|
||||||
|
if "world_size" in self._merged:
|
||||||
|
return int(self._merged["world_size"])
|
||||||
|
defaults = self._cfg_all.get("defaults", {})
|
||||||
|
if "world_size" in defaults:
|
||||||
|
return int(defaults["world_size"])
|
||||||
|
spec = self.ctx.spec or {}
|
||||||
|
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
||||||
|
```
|
||||||
|
|
||||||
|
우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
|
||||||
|
override는 legacy "rank = PE" 테스트 경로로 유지.
|
||||||
|
|
||||||
|
### D2. Greenlet-local rank registry (+ debug warning)
|
||||||
|
|
||||||
|
```python
|
||||||
|
class DistributedContext:
|
||||||
|
def __init__(self):
|
||||||
|
self._backend = None
|
||||||
|
self._rank_by_greenlet: dict = {}
|
||||||
|
|
||||||
|
def _bind_rank(self, g, rank: int) -> None:
|
||||||
|
self._rank_by_greenlet[g] = int(rank)
|
||||||
|
|
||||||
|
def get_rank(self) -> int:
|
||||||
|
self._ensure_initialized()
|
||||||
|
from greenlet import getcurrent
|
||||||
|
g = getcurrent()
|
||||||
|
if g not in self._rank_by_greenlet:
|
||||||
|
if os.environ.get("KERNBENCH_DEBUG"):
|
||||||
|
warnings.warn(
|
||||||
|
"get_rank() called outside a bound greenlet — returning 0. "
|
||||||
|
"Likely a bug unless running single-driver."
|
||||||
|
)
|
||||||
|
return 0
|
||||||
|
return int(self._rank_by_greenlet[g])
|
||||||
|
```
|
||||||
|
|
||||||
|
### D3. `torch.ahbm.set_device(rank)` — SIP 바인딩
|
||||||
|
|
||||||
|
KernBench 백엔드 이름은 `ahbm` (ADR-0023). Real PyTorch는
|
||||||
|
`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
|
||||||
|
namespace를 사용한다.
|
||||||
|
|
||||||
|
```python
|
||||||
|
class _AhbmNamespace:
|
||||||
|
"""torch.ahbm — per-greenlet SIP device binding.
|
||||||
|
|
||||||
|
Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
|
||||||
|
KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
|
||||||
|
API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self._device_by_greenlet: dict = {}
|
||||||
|
|
||||||
|
def set_device(self, device: int) -> None:
|
||||||
|
from greenlet import getcurrent
|
||||||
|
self._device_by_greenlet[getcurrent()] = int(device)
|
||||||
|
|
||||||
|
def current_device(self) -> int | None:
|
||||||
|
from greenlet import getcurrent
|
||||||
|
return self._device_by_greenlet.get(getcurrent())
|
||||||
|
|
||||||
|
# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
|
||||||
|
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
|
||||||
|
```
|
||||||
|
|
||||||
|
**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
|
||||||
|
`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
|
||||||
|
`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
|
||||||
|
코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
|
||||||
|
|
||||||
|
```python
|
||||||
|
class _AcceleratorNamespace:
|
||||||
|
"""torch.accelerator — device-agnostic API (PyTorch 2.x style).
|
||||||
|
|
||||||
|
Aliases torch.ahbm for bench code that prefers device-neutral idiom:
|
||||||
|
torch.accelerator.set_device_index(rank)
|
||||||
|
torch.accelerator.current_device_index()
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, ahbm: _AhbmNamespace):
|
||||||
|
self._ahbm = ahbm
|
||||||
|
|
||||||
|
def set_device_index(self, device: int) -> None:
|
||||||
|
self._ahbm.set_device(device)
|
||||||
|
|
||||||
|
def current_device_index(self) -> int | None:
|
||||||
|
return self._ahbm.current_device()
|
||||||
|
|
||||||
|
# RuntimeContext
|
||||||
|
self.ahbm = _AhbmNamespace()
|
||||||
|
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
|
||||||
|
```
|
||||||
|
|
||||||
|
Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
|
||||||
|
|
||||||
|
```python
|
||||||
|
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
|
||||||
|
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
|
||||||
|
```
|
||||||
|
|
||||||
|
### D4. Tensor placement = structural (sip, cube, pe) 좌표
|
||||||
|
|
||||||
|
`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
|
||||||
|
세부는 ADR-0026.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# RuntimeContext._create_tensor
|
||||||
|
current_sip = self.ahbm.current_device() # (D3 naming)
|
||||||
|
if current_sip is None:
|
||||||
|
current_sip = 0 # single-driver fallback (D2와 일관)
|
||||||
|
placement = resolve_dp_policy(
|
||||||
|
dp, shape=shape_2d, itemsize=itemsize,
|
||||||
|
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
|
||||||
|
target_sip=current_sip,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
|
||||||
|
좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- **ADR-0023** (IPCQ): backend `ahbm` namespace의 기원.
|
||||||
|
- **ADR-0026** (DPPolicy intra-device): D4의 `resolve_dp_policy` 시그니처와
|
||||||
|
ShardSpec의 구조적 좌표 표현.
|
||||||
|
- **ADR-0027** (Megatron TP + scheduler): worker scheduling, `mp.spawn`,
|
||||||
|
collective drain, exception cleanup의 구현 기준.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- **IPCQ protocol 수정**: ADR-0023 유지.
|
||||||
|
- **DPPolicy 필드 정리**: ADR-0026.
|
||||||
|
- **Megatron-style TP**: ADR-0027.
|
||||||
|
- **Worker scheduling / spawn / drain / exception cleanup**: ADR-0027 D0/D1.
|
||||||
|
- **Collective algorithm 구현**: ADR-0032.
|
||||||
|
- **Multi-node (프로세스 간)**: 단일 프로세스.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- **Bench = real PyTorch DDP** (공개 API 관점).
|
||||||
|
- **Greenlet-local rank**: 1-프로세스 모델에서 cross-rank correctness 가능.
|
||||||
|
- **Structural placement 좌표**: ADR-0026 / ADR-0027 / ADR-0032의 다른 ADR이
|
||||||
|
`(sip, cube, pe)` 3튜플 위에서 일관되게 동작.
|
||||||
|
|
||||||
|
### Neutral
|
||||||
|
|
||||||
|
- IPCQ PE-level protocol (ADR-0023) 불변.
|
||||||
|
- IO_CPU 역할 불변 (기존 transit 그대로).
|
||||||
@@ -1,868 +0,0 @@
|
|||||||
# ADR-0024: SIP-level TP Launcher — rank = SIP (host-driven dispatch)
|
|
||||||
|
|
||||||
## Status
|
|
||||||
|
|
||||||
Accepted. rank = SIP process-group model stands. The allreduce algorithm
|
|
||||||
path (mapper / validator / per-PE install machinery originally targeted at
|
|
||||||
ADR-0029) has been replaced by ADR-0032: `AhbmCCLBackend` now calls
|
|
||||||
`configure_sfr_intercube_multisip` at `init_process_group` time and the
|
|
||||||
intercube kernel receives `(sip_rank, sip_topo_kind, sip_topo_w,
|
|
||||||
sip_topo_h)` appended after the module's `kernel_args()`. The
|
|
||||||
`leader_only` / `all_pes` mapper concepts in this document are no longer
|
|
||||||
used by the default allreduce path.
|
|
||||||
|
|
||||||
## Context
|
|
||||||
|
|
||||||
### 목표
|
|
||||||
|
|
||||||
`torch.distributed` collective 호출의 참여 단위(rank)를 **SIP**(device)
|
|
||||||
경계에 맞춘다. 실제 PyTorch DDP/TP 스크립트와 **호스트 레벨에서 구분 없이**
|
|
||||||
읽히는 bench 코드를 목표로 한다.
|
|
||||||
|
|
||||||
real PyTorch와 비교:
|
|
||||||
|
|
||||||
| 차원 | real PyTorch | KernBench (이 ADR 이후) |
|
|
||||||
|---|---|---|
|
|
||||||
| 프로세스 모델 | N개 프로세스, 각 1 GPU | 1 프로세스, N greenlet, 각 1 SIP |
|
|
||||||
| `get_rank()` | `RANK` env var | greenlet-local 레지스트리 |
|
|
||||||
| `get_world_size()` | `WORLD_SIZE` env var | topology의 SIP 수 |
|
|
||||||
| `torch.cuda.set_device(r)` (real) / `torch.ahbm.set_device(r)` (KernBench) | rank → GPU | rank → SIP |
|
|
||||||
| `mp.spawn` | OS 프로세스 fork | greenlet fan-out |
|
|
||||||
|
|
||||||
### 설계 원칙 — 공개 API의 추상화, 내부는 기존 path 활용
|
|
||||||
|
|
||||||
**공개 API (bench worker) 수준의 추상화**:
|
|
||||||
```
|
|
||||||
rank = SIP
|
|
||||||
DPPolicy = intra-device (cube × PE) 분산만
|
|
||||||
dist.all_reduce, torch.ahbm.set_device, mp.spawn 등 PyTorch-style 표면
|
|
||||||
```
|
|
||||||
|
|
||||||
**Framework 내부 구현**:
|
|
||||||
```
|
|
||||||
build_install_plans (host): topology + mapper + algorithm → SipInstallPlan
|
|
||||||
↓
|
|
||||||
backend (host): plan의 per-PE spec을 engine.submit으로 IpcqInitMsg 디스패치
|
|
||||||
↓
|
|
||||||
engine: 기존 PE-scoped routing (MmuMapMsg 등과 동일 경로)
|
|
||||||
↓
|
|
||||||
PE_IPCQ: 자체 message loop에서 IpcqInitMsg 처리 (기존 capability)
|
|
||||||
```
|
|
||||||
|
|
||||||
**핵심**: 새 message 타입이나 IO_CPU 확장 없음. 기존 engine routing과 기존
|
|
||||||
`IpcqInitMsg` 타입을 그대로 사용. 기존의 "sideband direct call" 우회만
|
|
||||||
제거하여 convention 일원화.
|
|
||||||
|
|
||||||
### 풀어야 할 문제
|
|
||||||
|
|
||||||
1. **공개 API에서 rank = SIP** — bench worker가 PE 개념을 알지 않도록.
|
|
||||||
2. **Multi-worker 실행** — N개 rank가 독립 worker 코드 실행. 1 프로세스 제약
|
|
||||||
하에서 greenlet + barrier 동기화.
|
|
||||||
3. **Cross-rank collective submit 동기화** — 첫 rank가 혼자 wait하면 peer 부재로
|
|
||||||
SimPy deadlock. 모든 rank submit 후 drain 보장.
|
|
||||||
4. **기존 sideband install 제거** — IpcqInitMsg를 engine.submit으로 일원화.
|
|
||||||
MmuMapMsg 등 다른 control-plane 메시지와 동일 패턴.
|
|
||||||
5. **Algorithm / mapper / validator 분리** — 알고리즘 모듈은 kernel 코드만
|
|
||||||
담고, topology / mapping / validation은 registry + 선언.
|
|
||||||
|
|
||||||
### Non-problem (이 ADR 밖)
|
|
||||||
|
|
||||||
- IPCQ direction addressing fix → **ADR-0025**
|
|
||||||
- `DPPolicy.sip`/`num_sips` 제거 → **ADR-0026**
|
|
||||||
- Megatron-style TP → **ADR-0027**
|
|
||||||
- DTensor → **ADR-0028 (future)**
|
|
||||||
- **IO_CPU를 SIP-level control-plane 단일 endpoint로 승격**: 이 ADR에서는
|
|
||||||
invariant으로 채택하지 않음. 현재 KernBench에 해당 원칙이 없고, 단독으로
|
|
||||||
도입하기엔 정당화가 약함. 미래에 control-plane latency 모델링 정밀도 요구가
|
|
||||||
생기면 별도 ADR.
|
|
||||||
|
|
||||||
## Decision
|
|
||||||
|
|
||||||
### D1. rank = SIP (world_size 해석)
|
|
||||||
|
|
||||||
```python
|
|
||||||
def _resolve_world_size(self) -> int:
|
|
||||||
if "world_size" in self._merged:
|
|
||||||
return int(self._merged["world_size"])
|
|
||||||
defaults = self._cfg_all.get("defaults", {})
|
|
||||||
if "world_size" in defaults:
|
|
||||||
return int(defaults["world_size"])
|
|
||||||
spec = self.ctx.spec or {}
|
|
||||||
return int(spec.get("system", {}).get("sips", {}).get("count", 1))
|
|
||||||
```
|
|
||||||
|
|
||||||
우선순위: 알고리즘 override > defaults override > SIP count. `ccl.yaml`
|
|
||||||
override는 legacy "rank = PE" 테스트 경로로 유지.
|
|
||||||
|
|
||||||
### D2. Install 경로 — engine.submit 일원화
|
|
||||||
|
|
||||||
`ccl/install.py`의 sideband direct call을 제거하고, `IpcqInitMsg`를
|
|
||||||
`engine.submit`으로 보낸다. MmuMapMsg / MemoryWriteMsg 등이 이미 동일 패턴.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Backend (AhbmCCLBackend.__init__ 또는 init_process_group 시점)
|
|
||||||
from kernbench.ccl.install_plan import build_install_plans
|
|
||||||
|
|
||||||
plans = build_install_plans(
|
|
||||||
world_size=self._world_size,
|
|
||||||
algorithm=self._merged["algorithm"],
|
|
||||||
algorithm_config=self._merged,
|
|
||||||
spec=self.ctx.spec,
|
|
||||||
)
|
|
||||||
self._plans = plans
|
|
||||||
|
|
||||||
# Each PE_IPCQ가 자기 neighbor table을 받도록 engine 경유 submit
|
|
||||||
handles = []
|
|
||||||
for plan in plans:
|
|
||||||
for pe_install in plan.pe_installs:
|
|
||||||
h = self.ctx.submit(IpcqInitMsg(
|
|
||||||
correlation_id=self.ctx.correlation_id,
|
|
||||||
request_id=f"ipcq_init_s{plan.sip}c{pe_install.cube}p{pe_install.pe}",
|
|
||||||
target_sips=(plan.sip,),
|
|
||||||
target_cubes=(pe_install.cube,),
|
|
||||||
target_pe=pe_install.pe,
|
|
||||||
entries=pe_install.neighbors,
|
|
||||||
buffer_kind=plan.buffer_kind,
|
|
||||||
n_slots=plan.n_slots,
|
|
||||||
slot_size=plan.slot_size,
|
|
||||||
# ... (기존 IpcqInitMsg 필드)
|
|
||||||
))
|
|
||||||
handles.append(h)
|
|
||||||
|
|
||||||
# Eager install — init_process_group이 반환하기 전에 완료 보장
|
|
||||||
for h in handles:
|
|
||||||
self.ctx.wait(h)
|
|
||||||
```
|
|
||||||
|
|
||||||
**PE_IPCQ 컴포넌트**는 이미 `IpcqInitMsg`를 main loop에서 처리 (`pe_ipcq.py`
|
|
||||||
라인 145-147). 변경 불필요. 유일한 차이는 "message가 sideband Python call이
|
|
||||||
아니라 engine queue를 거쳐 도착한다"는 점.
|
|
||||||
|
|
||||||
**Correctness invariant (equivalence)**: `init_process_group()`은 모든
|
|
||||||
install handle을 `wait()`한 후 반환하므로 launch-before-install 문제는
|
|
||||||
구조적으로 없다. 남는 correctness 질문은 단 하나:
|
|
||||||
|
|
||||||
> Engine-routed `IpcqInitMsg` 처리가 기존 sideband
|
|
||||||
> `pe_ipcq._install_neighbors(msg)` 호출과 **동일한 최종 PE_IPCQ 상태**를
|
|
||||||
> 생성하는가.
|
|
||||||
|
|
||||||
검증 포인트 (T3 참고):
|
|
||||||
|
|
||||||
1. **State equivalence**: `_install_neighbors()` 내부 상태 전이가 engine
|
|
||||||
dispatch path에서도 동일하게 일어나 최종 PE_IPCQ state
|
|
||||||
(`_queue_pairs`, `_installed`, `_credit_inbox` 등)가 일치.
|
|
||||||
|
|
||||||
2. **Sideband-only side effect 부재**: Sideband path에서만 있던 부수 효과가
|
|
||||||
없음 (예: engine.submit이 설정하는 request_id / correlation tracking 등이
|
|
||||||
install semantics를 왜곡하지 않음).
|
|
||||||
|
|
||||||
3. **Ordering independence**: 서로 다른 PE들의 install message가 engine
|
|
||||||
큐에서 임의 순서로 처리되어도 최종 상태가 동일. 즉 install은 **PE별
|
|
||||||
독립 연산**이어야 하고, cross-PE 순서 의존성이 있으면 안 됨.
|
|
||||||
|
|
||||||
4. **Idempotency**: 동일 PE에 대해 `IpcqInitMsg`가 두 번 도착하면? 현재
|
|
||||||
설계 전제는 "per-PE 단 한 번 install". 중복 install 시 동작은 정의되지
|
|
||||||
않음. 보수적 정책:
|
|
||||||
- 최초 install 시 `_installed = True`로 전이
|
|
||||||
- 이후 중복 install msg는 **에러** (raise) 또는 **silent idempotent**
|
|
||||||
(no-op) 둘 중 하나로 명시
|
|
||||||
- Recommend: **raise** (명시적 에러 → 버그 조기 검출). T3에 duplicate
|
|
||||||
install 케이스 추가.
|
|
||||||
|
|
||||||
5. **Partial install visibility**: 일부 PE만 install 완료된 중간 상태가
|
|
||||||
외부에 observable한가? 현재 구조에서는 `init_process_group()`의 eager
|
|
||||||
wait-all이 barrier 역할을 하므로 partial state는 bench 코드에 노출되지
|
|
||||||
않음. 단, debugging / introspection API는 중간 상태를 볼 수 있음 (문제
|
|
||||||
아님, 문서화만).
|
|
||||||
|
|
||||||
**Timing 영향**: Engine-routed install은 `init_process_group()`이 SimPy 시간을
|
|
||||||
소비하게 만든다. 기존 sideband install은 사실상 zero-cost. ADR 계약:
|
|
||||||
|
|
||||||
> Benchmarks must not rely on zero-cost initialization.
|
|
||||||
> `init_process_group()` consumes simulated time proportional to the number
|
|
||||||
> of participating PEs × per-PE install latency. First collective call
|
|
||||||
> starts at a well-defined but non-zero sim time.
|
|
||||||
|
|
||||||
### D3. Launch 경로 — non-CCL 커널과 동일 primitive
|
|
||||||
|
|
||||||
**CCL 커널은 non-CCL 커널과 동일한 `KernelLaunchMsg` submission path를 쓴다.**
|
|
||||||
Engine 내부의 IO_CPU/M_CPU transit 같은 것은 **기존 구현 세부이지 CCL-specific
|
|
||||||
장치가 아님**. Backend는 plan의 `participating_pes` 목록을 돌면서 `KernelLaunchMsg`를
|
|
||||||
submit할 뿐이다. 새 메시지 타입 없음, 새 라우팅 경로 없음.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# AhbmCCLBackend.all_reduce
|
|
||||||
def all_reduce(self, tensor, op="sum"):
|
|
||||||
if op != "sum":
|
|
||||||
raise NotImplementedError(...)
|
|
||||||
if tensor._handle is None or not tensor._handle.shards:
|
|
||||||
raise RuntimeError(...)
|
|
||||||
|
|
||||||
# Validator — global handle 기준 (D8)
|
|
||||||
validator_name = self._merged.get("validator")
|
|
||||||
if validator_name:
|
|
||||||
resolve_validator(validator_name)(tensor._handle, self._world_size, self.ctx.spec)
|
|
||||||
|
|
||||||
rank = self.ctx.distributed.get_rank()
|
|
||||||
plan = self._plans[rank]
|
|
||||||
tensor_view = _tensor_slice_for_sip(tensor._handle, plan.sip)
|
|
||||||
|
|
||||||
# Plan에서 kernel args 계산 (host-side)
|
|
||||||
import importlib
|
|
||||||
mod = importlib.import_module(plan.kernel_module)
|
|
||||||
n_elem = tensor_view.shards[0].nbytes // tensor.itemsize
|
|
||||||
kargs = mod.kernel_args(n_elem=n_elem, world_size=plan.world_size,
|
|
||||||
**plan.kernel_config)
|
|
||||||
|
|
||||||
def _submit():
|
|
||||||
out = []
|
|
||||||
for (cube, pe) in plan.participating_pes:
|
|
||||||
h = self.ctx.submit(KernelLaunchMsg(
|
|
||||||
correlation_id=self.ctx.correlation_id,
|
|
||||||
request_id=f"allreduce_r{rank}_c{cube}p{pe}",
|
|
||||||
kernel_ref=KernelRef(name=plan.algorithm_name, kind="builtin"),
|
|
||||||
args=(_tensor_arg_for_pe(tensor_view, cube, pe), *kargs),
|
|
||||||
target_sips=(plan.sip,),
|
|
||||||
target_cubes=(cube,),
|
|
||||||
target_pe=pe,
|
|
||||||
))
|
|
||||||
out.append(h)
|
|
||||||
return out
|
|
||||||
|
|
||||||
self._barrier.submit_and_drain(self.ctx, rank, _submit)
|
|
||||||
```
|
|
||||||
|
|
||||||
### D4. Algorithm ABI — 얇게 + 명시적 arg 계약
|
|
||||||
|
|
||||||
각 알고리즘 모듈은 **kernel + kernel_args만 필수**.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# src/kernbench/ccl/algorithms/ring_allreduce.py
|
|
||||||
def kernel(t_ptr, n_elem, world_size, tl):
|
|
||||||
"""PE-side kernel code.
|
|
||||||
|
|
||||||
Signature convention: first positional arg is the tensor pointer
|
|
||||||
(per-PE slice), subsequent positional args are whatever
|
|
||||||
kernel_args() returns. `tl` is injected by the TLContext runtime.
|
|
||||||
"""
|
|
||||||
|
|
||||||
def kernel_args(*, n_elem: int, world_size: int, **kw) -> tuple:
|
|
||||||
"""Return the tuple of non-tensor positional args.
|
|
||||||
|
|
||||||
Signature contract:
|
|
||||||
- Called keyword-only with n_elem and world_size plus kernel_config.
|
|
||||||
- Returns a tuple (possibly empty) of scalar / metadata args.
|
|
||||||
- The backend constructs the final KernelLaunchMsg.args as:
|
|
||||||
(per_pe_tensor_arg, *kernel_args(...))
|
|
||||||
where per_pe_tensor_arg is a TensorArg containing only the shards
|
|
||||||
local to the receiving PE (derived from tensor_view).
|
|
||||||
"""
|
|
||||||
return (n_elem, world_size)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Arg assembly in backend (reference)**:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# AhbmCCLBackend.all_reduce (D3에서 발췌)
|
|
||||||
kargs = mod.kernel_args(n_elem=n_elem, world_size=plan.world_size,
|
|
||||||
**plan.kernel_config)
|
|
||||||
for (cube, pe) in plan.participating_pes:
|
|
||||||
pe_tensor_arg = _tensor_arg_for_pe(tensor_view, cube, pe)
|
|
||||||
self.ctx.submit(KernelLaunchMsg(
|
|
||||||
args=(pe_tensor_arg, *kargs), # tensor first, then kernel_args return
|
|
||||||
target_sips=(plan.sip,),
|
|
||||||
target_cubes=(cube,),
|
|
||||||
target_pe=pe,
|
|
||||||
...
|
|
||||||
))
|
|
||||||
```
|
|
||||||
|
|
||||||
**ccl.yaml**에서 선언적 metadata:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
algorithms:
|
|
||||||
ring_allreduce_tcm:
|
|
||||||
module: kernbench.ccl.algorithms.ring_allreduce
|
|
||||||
topology: ring_1d # kernbench/ccl/topologies.py
|
|
||||||
mapper: leader_only # kernbench/ccl/mappers.py (신규)
|
|
||||||
validator: single_shard_per_rank # kernbench/ccl/validators.py (신규)
|
|
||||||
buffer_kind: tcm
|
|
||||||
n_elem: 8
|
|
||||||
```
|
|
||||||
|
|
||||||
- `topology` (필수)
|
|
||||||
- `mapper` (선택, default `"leader_only"`)
|
|
||||||
- `validator` (선택)
|
|
||||||
|
|
||||||
알고리즘 모듈 자체에는 mapper/validator/participating_pes/neighbor
|
|
||||||
생성기가 **들어가지 않음**.
|
|
||||||
|
|
||||||
### D5. Mapper + validator — registry key **또는** import path
|
|
||||||
|
|
||||||
Host-side framework가 built-in registry 제공. 커스텀 확장은 dot-import path.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# src/kernbench/ccl/mappers.py (new)
|
|
||||||
Mapper = Callable[[dict, int], list[tuple[int, int]]]
|
|
||||||
|
|
||||||
def leader_only(spec, rank):
|
|
||||||
"""Single leader PE per SIP. Ring/tree/mesh용."""
|
|
||||||
return [(0, 0)]
|
|
||||||
|
|
||||||
def all_pes(spec, rank):
|
|
||||||
"""Every PE in the SIP. 알고리즘이 intra-SIP 전체 PE를 참여시킬 때 사용
|
|
||||||
(e.g. intra-SIP reduction, intra-SIP broadcast, hierarchical collective
|
|
||||||
의 낮은 레벨 등)."""
|
|
||||||
cm = spec["sip"]["cube_mesh"]
|
|
||||||
pl = spec["cube"]["pe_layout"]
|
|
||||||
n_cubes = cm["w"] * cm["h"]
|
|
||||||
n_pes = pl["pe_per_corner"] * len(pl["corners"])
|
|
||||||
return [(c, p) for c in range(n_cubes) for p in range(n_pes)]
|
|
||||||
|
|
||||||
MAPPER_REGISTRY = {"leader_only": leader_only, "all_pes": all_pes}
|
|
||||||
|
|
||||||
def resolve_mapper(key_or_path: str) -> Mapper:
|
|
||||||
if key_or_path in MAPPER_REGISTRY:
|
|
||||||
return MAPPER_REGISTRY[key_or_path]
|
|
||||||
if "." in key_or_path:
|
|
||||||
import importlib
|
|
||||||
mod_path, fn_name = key_or_path.rsplit(".", 1)
|
|
||||||
return getattr(importlib.import_module(mod_path), fn_name)
|
|
||||||
raise ValueError(f"unknown mapper: {key_or_path!r}")
|
|
||||||
```
|
|
||||||
|
|
||||||
Validator도 동일 패턴 (`src/kernbench/ccl/validators.py`). 입력은 **global
|
|
||||||
TensorHandle** (D8 참고).
|
|
||||||
|
|
||||||
### D6. Host-side install plan builder
|
|
||||||
|
|
||||||
```python
|
|
||||||
# src/kernbench/ccl/install_plan.py (new; 기존 install.py의 재구성)
|
|
||||||
from dataclasses import dataclass
|
|
||||||
from typing import Any, Mapping
|
|
||||||
|
|
||||||
@dataclass(frozen=True)
|
|
||||||
class NeighborTableEntry:
|
|
||||||
direction: str
|
|
||||||
peer_direction: str # ADR-0025
|
|
||||||
peer_sip: int
|
|
||||||
peer_cube: int
|
|
||||||
peer_pe: int
|
|
||||||
rx_base_pa: int
|
|
||||||
# ... 기타 IPCQ 설정 ...
|
|
||||||
|
|
||||||
@dataclass(frozen=True)
|
|
||||||
class PeInstallSpec:
|
|
||||||
cube: int
|
|
||||||
pe: int
|
|
||||||
neighbors: tuple[NeighborTableEntry, ...]
|
|
||||||
|
|
||||||
@dataclass(frozen=True)
|
|
||||||
class SipInstallPlan:
|
|
||||||
algorithm_name: str # human-readable ("ring_allreduce_tcm")
|
|
||||||
sip: int
|
|
||||||
rank: int
|
|
||||||
world_size: int
|
|
||||||
pe_installs: tuple[PeInstallSpec, ...] # per-PE neighbor tables
|
|
||||||
buffer_kind: str
|
|
||||||
n_slots: int
|
|
||||||
slot_size: int
|
|
||||||
kernel_module: str
|
|
||||||
participating_pes: tuple[tuple[int, int], ...]
|
|
||||||
kernel_config: Mapping[str, Any]
|
|
||||||
|
|
||||||
|
|
||||||
def build_install_plans(
|
|
||||||
world_size: int,
|
|
||||||
algorithm: str,
|
|
||||||
algorithm_config: dict,
|
|
||||||
spec: dict,
|
|
||||||
) -> list[SipInstallPlan]:
|
|
||||||
"""Compose topology + mapper + algorithm into per-SIP plan list."""
|
|
||||||
topo_fn = _resolve_topology(algorithm_config["topology"])
|
|
||||||
mapper = resolve_mapper(algorithm_config.get("mapper", "leader_only"))
|
|
||||||
|
|
||||||
# kernel_config: launch 시 kernel_args에 전달할 algorithm-specific params
|
|
||||||
kernel_config = {
|
|
||||||
k: v for k, v in algorithm_config.items()
|
|
||||||
if k in {"n_elem", "reduce_op", "chunk_size"} or k.startswith("kernel_")
|
|
||||||
}
|
|
||||||
|
|
||||||
plans = []
|
|
||||||
for rank in range(world_size):
|
|
||||||
sip = rank # identity mapping (non-identity는 open question)
|
|
||||||
pes = mapper(spec, rank)
|
|
||||||
pe_installs = _build_pe_installs(
|
|
||||||
rank=rank, world_size=world_size, sip=sip,
|
|
||||||
pes=pes, topo_fn=topo_fn, algorithm_config=algorithm_config, spec=spec,
|
|
||||||
)
|
|
||||||
plans.append(SipInstallPlan(
|
|
||||||
algorithm_name=algorithm,
|
|
||||||
sip=sip, rank=rank, world_size=world_size,
|
|
||||||
pe_installs=pe_installs,
|
|
||||||
buffer_kind=algorithm_config["buffer_kind"],
|
|
||||||
n_slots=algorithm_config["n_slots"],
|
|
||||||
slot_size=algorithm_config["slot_size"],
|
|
||||||
kernel_module=algorithm_config["module"],
|
|
||||||
participating_pes=tuple(pes),
|
|
||||||
kernel_config=kernel_config,
|
|
||||||
))
|
|
||||||
return plans
|
|
||||||
```
|
|
||||||
|
|
||||||
`_build_pe_installs`는 기존 `ccl/install.py`의 neighbor 계산 로직을 재활용
|
|
||||||
(ADR-0025의 `reverse_direction` 개선 반영).
|
|
||||||
|
|
||||||
**Multi-PE 매퍼와 neighbor 생성 책임**: mapper가 SIP 내 여러 PE를 반환하는
|
|
||||||
경우 (`all_pes` 등), PE-level neighbor 그래프는 `_build_pe_installs` 내부에
|
|
||||||
형성된다. 즉 topology 모듈은 rank-level 관계만 제공하고, PE-level 연결은
|
|
||||||
builder에서 풀어낸다. 복잡한 multi-level 패턴을 쓰는 알고리즘은 이 책임
|
|
||||||
분산이 관리 부담이 될 수 있음 — 관련 논의는 ADR-0029 참고.
|
|
||||||
|
|
||||||
### D7. Epoch-based collective barrier
|
|
||||||
|
|
||||||
Cross-rank submit 동기화. 각 collective 호출은 독립 epoch. 같은 rank의
|
|
||||||
중복 join은 즉시 에러.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# src/kernbench/runtime_api/distributed.py
|
|
||||||
@dataclass
|
|
||||||
class _EpochState:
|
|
||||||
participants: set[int] = field(default_factory=set)
|
|
||||||
pending: list = field(default_factory=list)
|
|
||||||
drained: bool = False
|
|
||||||
returned: int = 0
|
|
||||||
|
|
||||||
|
|
||||||
class _CollectiveBarrier:
|
|
||||||
"""Epoch-based barrier.
|
|
||||||
|
|
||||||
Contract:
|
|
||||||
- Each call joins the earliest non-drained epoch.
|
|
||||||
- Each rank may join a given epoch at most once. Duplicate join raises.
|
|
||||||
- Last arriver (participants == world_size) performs drain and advances
|
|
||||||
_next_epoch. Earlier arrivers yield and re-check drained on resume.
|
|
||||||
- Epoch state is GC'd when returned == world_size (success path).
|
|
||||||
- On failure paths, residual state is acceptable; reset() clears it.
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(self, world_size: int):
|
|
||||||
self._world_size = world_size
|
|
||||||
self._next_epoch = 0
|
|
||||||
self._state: dict[int, _EpochState] = {}
|
|
||||||
|
|
||||||
def submit_and_drain(self, ctx, rank: int, submit_fn) -> None:
|
|
||||||
epoch = self._next_epoch
|
|
||||||
state = self._state.setdefault(epoch, _EpochState())
|
|
||||||
|
|
||||||
if rank in state.participants:
|
|
||||||
raise RuntimeError(
|
|
||||||
f"rank {rank} attempted duplicate join to epoch {epoch}"
|
|
||||||
)
|
|
||||||
state.participants.add(rank)
|
|
||||||
|
|
||||||
handles = submit_fn()
|
|
||||||
state.pending.extend(handles)
|
|
||||||
|
|
||||||
is_last = len(state.participants) >= self._world_size
|
|
||||||
|
|
||||||
if is_last:
|
|
||||||
for h in state.pending:
|
|
||||||
ctx.wait(h)
|
|
||||||
state.drained = True
|
|
||||||
self._next_epoch = epoch + 1
|
|
||||||
else:
|
|
||||||
from greenlet import getcurrent
|
|
||||||
g = getcurrent()
|
|
||||||
if g.parent is None:
|
|
||||||
raise RuntimeError("barrier requires a bound worker greenlet")
|
|
||||||
while not state.drained:
|
|
||||||
g.parent.switch()
|
|
||||||
|
|
||||||
state.returned += 1
|
|
||||||
if state.returned >= self._world_size:
|
|
||||||
self._state.pop(epoch, None)
|
|
||||||
|
|
||||||
def reset(self) -> None:
|
|
||||||
"""Explicit cleanup on spawn exception unwinding."""
|
|
||||||
self._state.clear()
|
|
||||||
self._next_epoch = 0
|
|
||||||
```
|
|
||||||
|
|
||||||
### D8. Per-rank tensor view + validator contract
|
|
||||||
|
|
||||||
**Validator** (host-side, pre-slice, global handle 기준):
|
|
||||||
|
|
||||||
```python
|
|
||||||
# src/kernbench/ccl/validators.py
|
|
||||||
Validator = Callable[[TensorHandle, int, dict], None]
|
|
||||||
|
|
||||||
def single_shard_per_rank(handle, world_size, spec):
|
|
||||||
"""Ring 계열: 정확히 world_size개 shard, SIP당 1개."""
|
|
||||||
if len(handle.shards) != world_size:
|
|
||||||
raise ValueError(...)
|
|
||||||
per_sip = {}
|
|
||||||
for s in handle.shards:
|
|
||||||
per_sip[s.sip] = per_sip.get(s.sip, 0) + 1
|
|
||||||
if any(c != 1 for c in per_sip.values()):
|
|
||||||
raise ValueError(...)
|
|
||||||
|
|
||||||
def multi_pe_sip_local(handle, world_size, spec):
|
|
||||||
"""Multi-PE per SIP layout: 각 SIP에 intra-SIP PE 수만큼 shard 존재.
|
|
||||||
Intra-SIP 전체 PE를 참여시키는 알고리즘이 사용."""
|
|
||||||
cm = spec["sip"]["cube_mesh"]
|
|
||||||
pl = spec["cube"]["pe_layout"]
|
|
||||||
per_sip = cm["w"] * cm["h"] * pl["pe_per_corner"] * len(pl["corners"])
|
|
||||||
if len(handle.shards) != world_size * per_sip:
|
|
||||||
raise ValueError(...)
|
|
||||||
|
|
||||||
VALIDATOR_REGISTRY = {...}
|
|
||||||
def resolve_validator(key_or_path): ...
|
|
||||||
```
|
|
||||||
|
|
||||||
Validator는 world 전체의 shard layout 불변량을 본다. Per-rank view는
|
|
||||||
backend가 validator 호출 **후** `_tensor_slice_for_sip`로 생성.
|
|
||||||
|
|
||||||
**Per-rank tensor view** — SIP-local slice:
|
|
||||||
|
|
||||||
```python
|
|
||||||
def _tensor_slice_for_sip(handle, sip) -> TensorArg:
|
|
||||||
sip_shards = [s for s in handle.shards if s.sip == sip]
|
|
||||||
if not sip_shards:
|
|
||||||
raise RuntimeError(f"tensor has no shards on SIP {sip}")
|
|
||||||
# Deterministic ordering contract: (cube, pe, offset_bytes) ascending.
|
|
||||||
# Multi-PE mappers (hierarchical 등) rely on this ordering to align
|
|
||||||
# per-PE tensor arg construction with participating_pes enumeration.
|
|
||||||
sip_shards.sort(key=lambda s: (s.cube, s.pe, s.offset_bytes))
|
|
||||||
min_offset = min(s.offset_bytes for s in sip_shards)
|
|
||||||
local_va_base = handle.va_base + min_offset if handle.va_base else 0
|
|
||||||
return TensorArg(
|
|
||||||
shards=tuple(TensorArgShard(...) for s in sip_shards),
|
|
||||||
va_base=local_va_base,
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Ordering invariant**: slice의 shard는 `(cube, pe, offset_bytes)` 오름차순.
|
|
||||||
Backend가 `participating_pes`를 iterate하며 `_tensor_arg_for_pe(view, cube, pe)`를
|
|
||||||
구성할 때, 결정론적 ordering을 전제할 수 있다. 특히 `all_pes` mapper +
|
|
||||||
hierarchical 알고리즘이 per-PE slice 조합을 순서 의존적으로 해석하는 경우에
|
|
||||||
중요.
|
|
||||||
|
|
||||||
### D9. Greenlet-local rank registry (+ debug warning)
|
|
||||||
|
|
||||||
```python
|
|
||||||
class DistributedContext:
|
|
||||||
def __init__(self):
|
|
||||||
self._backend = None
|
|
||||||
self._rank_by_greenlet: dict = {}
|
|
||||||
|
|
||||||
def _bind_rank(self, g, rank: int) -> None:
|
|
||||||
self._rank_by_greenlet[g] = int(rank)
|
|
||||||
|
|
||||||
def get_rank(self) -> int:
|
|
||||||
self._ensure_initialized()
|
|
||||||
from greenlet import getcurrent
|
|
||||||
g = getcurrent()
|
|
||||||
if g not in self._rank_by_greenlet:
|
|
||||||
if os.environ.get("KERNBENCH_DEBUG"):
|
|
||||||
warnings.warn(
|
|
||||||
"get_rank() called outside a bound greenlet — returning 0. "
|
|
||||||
"Likely a bug unless running single-driver."
|
|
||||||
)
|
|
||||||
return 0
|
|
||||||
return int(self._rank_by_greenlet[g])
|
|
||||||
```
|
|
||||||
|
|
||||||
### D10. `torch.ahbm.set_device(rank)` — SIP 바인딩
|
|
||||||
|
|
||||||
KernBench 백엔드 이름은 `ahbm` (ADR-0023 D10). Real PyTorch는
|
|
||||||
`torch.cuda.set_device(r)`이지만 우리는 CUDA가 아니므로 honestly-named
|
|
||||||
namespace를 사용한다.
|
|
||||||
|
|
||||||
```python
|
|
||||||
class _AhbmNamespace:
|
|
||||||
"""torch.ahbm — per-greenlet SIP device binding.
|
|
||||||
|
|
||||||
Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. Since
|
|
||||||
KernBench's backend is 'ahbm' (not CUDA), we expose the equivalent
|
|
||||||
API under ``torch.ahbm`` to avoid pretending to be a CUDA runtime.
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
self._device_by_greenlet: dict = {}
|
|
||||||
|
|
||||||
def set_device(self, device: int) -> None:
|
|
||||||
from greenlet import getcurrent
|
|
||||||
self._device_by_greenlet[getcurrent()] = int(device)
|
|
||||||
|
|
||||||
def current_device(self) -> int | None:
|
|
||||||
from greenlet import getcurrent
|
|
||||||
return self._device_by_greenlet.get(getcurrent())
|
|
||||||
|
|
||||||
# Attached to RuntimeContext as `self.ahbm = _AhbmNamespace()`.
|
|
||||||
# Bench code: `torch.ahbm.set_device(rank)` mirrors `torch.cuda.set_device`.
|
|
||||||
```
|
|
||||||
|
|
||||||
**PyTorch 2.x style 병행 지원**: 최신 PyTorch는 device-agnostic한
|
|
||||||
`torch.accelerator` 네임스페이스를 지향 (`torch.accelerator.set_device_index(r)`,
|
|
||||||
`torch.accelerator.current_device_index()`). Device vendor에 종속되지 않는
|
|
||||||
코드를 쓰려는 사용자를 위해 KernBench도 이 표면을 병행 지원한다.
|
|
||||||
|
|
||||||
```python
|
|
||||||
class _AcceleratorNamespace:
|
|
||||||
"""torch.accelerator — device-agnostic API (PyTorch 2.x style).
|
|
||||||
|
|
||||||
Aliases torch.ahbm for bench code that prefers device-neutral idiom:
|
|
||||||
torch.accelerator.set_device_index(rank)
|
|
||||||
torch.accelerator.current_device_index()
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(self, ahbm: _AhbmNamespace):
|
|
||||||
self._ahbm = ahbm
|
|
||||||
|
|
||||||
def set_device_index(self, device: int) -> None:
|
|
||||||
self._ahbm.set_device(device)
|
|
||||||
|
|
||||||
def current_device_index(self) -> int | None:
|
|
||||||
return self._ahbm.current_device()
|
|
||||||
|
|
||||||
# RuntimeContext
|
|
||||||
self.ahbm = _AhbmNamespace()
|
|
||||||
self.accelerator = _AcceleratorNamespace(self.ahbm) # alias
|
|
||||||
```
|
|
||||||
|
|
||||||
Bench 작성자는 다음 중 하나를 선택 — 둘 다 내부적으로 같은 레지스트리를 보유:
|
|
||||||
|
|
||||||
```python
|
|
||||||
torch.ahbm.set_device(rank) # KernBench-native, explicit backend
|
|
||||||
torch.accelerator.set_device_index(rank) # PyTorch 2.x device-agnostic
|
|
||||||
```
|
|
||||||
|
|
||||||
### D11. Tensor placement = structural (sip, cube, pe) 좌표
|
|
||||||
|
|
||||||
`resolve_dp_policy`가 `target_sip`을 직접 받아 구조적 좌표로 placement 생성.
|
|
||||||
세부는 ADR-0026.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# RuntimeContext._create_tensor
|
|
||||||
current_sip = self.ahbm.current_device() # (D10 naming)
|
|
||||||
if current_sip is None:
|
|
||||||
current_sip = 0 # single-driver fallback (D9와 일관)
|
|
||||||
placement = resolve_dp_policy(
|
|
||||||
dp, shape=shape_2d, itemsize=itemsize,
|
|
||||||
num_pe=eff_num_pe, num_cubes=eff_num_cubes,
|
|
||||||
target_sip=current_sip,
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Post-hoc `pe_index` shifting 제거 — ShardSpec이 `(sip, cube, pe)` 구조적
|
|
||||||
좌표 보유.
|
|
||||||
|
|
||||||
### D12. `torch.multiprocessing.spawn`-compat surface
|
|
||||||
|
|
||||||
Bench 작성자 표면은 real PyTorch `mp.spawn`과 동일:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# src/kernbench/runtime_api/multiprocessing.py (new)
|
|
||||||
def spawn(fn, args=(), nprocs=1, join=True, daemon=False, start_method="spawn"):
|
|
||||||
"""Drop-in for torch.multiprocessing.spawn.
|
|
||||||
Internal: greenlet fan-out + epoch-barrier sync + exception propagation.
|
|
||||||
"""
|
|
||||||
...
|
|
||||||
|
|
||||||
# torch namespace에 부착
|
|
||||||
torch.multiprocessing = SimpleNamespace(spawn=spawn)
|
|
||||||
```
|
|
||||||
|
|
||||||
Bench:
|
|
||||||
|
|
||||||
```python
|
|
||||||
import torch.multiprocessing as mp
|
|
||||||
mp.spawn(worker, nprocs=world_size, args=(world_size, torch))
|
|
||||||
```
|
|
||||||
|
|
||||||
### D13. Scheduler + exception handling
|
|
||||||
|
|
||||||
```python
|
|
||||||
def spawn(fn, args, nprocs, ...):
|
|
||||||
dist = torch.distributed
|
|
||||||
gs: list[greenlet] = []
|
|
||||||
errors: dict[int, Exception] = {}
|
|
||||||
|
|
||||||
for rank in range(nprocs):
|
|
||||||
def _entry(r=rank):
|
|
||||||
try:
|
|
||||||
fn(r, *args)
|
|
||||||
except Exception as e:
|
|
||||||
errors[r] = e
|
|
||||||
raise
|
|
||||||
g = greenlet(_entry)
|
|
||||||
dist._bind_rank(g, rank)
|
|
||||||
gs.append(g)
|
|
||||||
|
|
||||||
try:
|
|
||||||
while True:
|
|
||||||
alive = [g for g in gs if not g.dead]
|
|
||||||
if not alive:
|
|
||||||
break
|
|
||||||
for g in alive:
|
|
||||||
if not g.dead:
|
|
||||||
g.switch()
|
|
||||||
except Exception as outer:
|
|
||||||
for other in gs:
|
|
||||||
if not other.dead:
|
|
||||||
try:
|
|
||||||
other.throw(SystemExit)
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
# Epoch barrier state 명시적 cleanup
|
|
||||||
backend = getattr(dist, "_backend", None)
|
|
||||||
if backend is not None and hasattr(backend, "_barrier"):
|
|
||||||
backend._barrier.reset()
|
|
||||||
raise SpawnException(errors) from outer
|
|
||||||
```
|
|
||||||
|
|
||||||
**Scheduler contract**:
|
|
||||||
- Deterministic round-robin over insertion order (rank 0, 1, ..., N-1).
|
|
||||||
- 동기화 지점은 epoch barrier (D7)만. Scheduler 순서에 의존하는 correctness 없음.
|
|
||||||
- 예외 발생 시 다른 greenlet 강제 종료 + `SpawnException` 전파.
|
|
||||||
|
|
||||||
**Starvation guideline**:
|
|
||||||
- 일반적으로 collective barrier가 workers를 동기화. 큰 편차 없음.
|
|
||||||
- 극단적 non-collective 루프 대비 cooperative yield 제공:
|
|
||||||
`torch.distributed.cooperative_yield()`.
|
|
||||||
|
|
||||||
### D14. Backward compatibility
|
|
||||||
|
|
||||||
1. **Single-driver 호출**: `get_rank()` 0 반환 (D9).
|
|
||||||
2. **`ccl.yaml` world_size override**: D1 fallback 우회 — legacy "rank = PE"
|
|
||||||
테스트 경로로 사용 가능.
|
|
||||||
3. **`DPPolicy.sip="column_wise"` 명시**: ADR-0026 scope.
|
|
||||||
4. **`install_ipcq()` compatibility wrapper**:
|
|
||||||
|
|
||||||
기존 `ccl/install.py`의 `install_ipcq()` API는 곧바로 제거하지 않는다.
|
|
||||||
Thin compatibility wrapper로 남겨 기존 직접 호출자가 점진적으로 migration할
|
|
||||||
수 있게 한다.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# src/kernbench/ccl/install.py (after this ADR)
|
|
||||||
def install_ipcq(engine, spec, merged, *, algo_module=None, rank_to_pe=None):
|
|
||||||
"""DEPRECATED: legacy host-side PE installer.
|
|
||||||
|
|
||||||
Internally delegates to build_install_plans + engine-routed IpcqInitMsg.
|
|
||||||
Use dist.init_process_group() instead.
|
|
||||||
"""
|
|
||||||
from kernbench.ccl.install_plan import build_install_plans
|
|
||||||
import warnings
|
|
||||||
warnings.warn(
|
|
||||||
"install_ipcq() is deprecated; use dist.init_process_group()",
|
|
||||||
DeprecationWarning, stacklevel=2,
|
|
||||||
)
|
|
||||||
plans = build_install_plans(
|
|
||||||
world_size=merged.get("world_size", 1),
|
|
||||||
algorithm=merged["algorithm"],
|
|
||||||
algorithm_config=merged,
|
|
||||||
spec=spec,
|
|
||||||
)
|
|
||||||
handles = []
|
|
||||||
for plan in plans:
|
|
||||||
for pe_install in plan.pe_installs:
|
|
||||||
h = engine.submit(IpcqInitMsg(
|
|
||||||
target_sips=(plan.sip,),
|
|
||||||
target_cubes=(pe_install.cube,),
|
|
||||||
target_pe=pe_install.pe,
|
|
||||||
entries=pe_install.neighbors,
|
|
||||||
buffer_kind=plan.buffer_kind,
|
|
||||||
n_slots=plan.n_slots,
|
|
||||||
slot_size=plan.slot_size,
|
|
||||||
))
|
|
||||||
handles.append(h)
|
|
||||||
for h in handles:
|
|
||||||
engine.wait(h)
|
|
||||||
return {"world_size": merged.get("world_size", 1), "plans": plans}
|
|
||||||
```
|
|
||||||
|
|
||||||
Migration 스케줄:
|
|
||||||
- Phase 1: wrapper로 유지 + DeprecationWarning
|
|
||||||
- Phase 2: 직접 호출자 grep-audit → 각각 `dist.init_process_group()` 또는
|
|
||||||
`build_install_plans()` 직접 사용으로 이관
|
|
||||||
- Phase 3: wrapper 제거 (별도 cleanup ADR 또는 PR)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Dependencies
|
|
||||||
|
|
||||||
- **ADR-0023** (IPCQ): `IpcqInitMsg` 메시지 타입과 PE_IPCQ 핸들링을 그대로
|
|
||||||
활용. Engine-routed submit으로 전환하는 것이 유일한 변경.
|
|
||||||
- **ADR-0025** (IPCQ direction fix): `_build_pe_installs`의 neighbor 계산이
|
|
||||||
2-rank ring 등에서 정확히 동작하려면 필요.
|
|
||||||
- **ADR-0003 / 0016** (IO_CPU): IO_CPU는 기존 transit 역할 그대로. 본 ADR에서
|
|
||||||
IO_CPU 역할 변경 없음.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
- **IPCQ protocol 수정**: ADR-0023 유지.
|
|
||||||
- **DPPolicy 필드 정리**: ADR-0026.
|
|
||||||
- **Megatron-style TP**: ADR-0027.
|
|
||||||
- **Multi-node (프로세스 간)**: 단일 프로세스.
|
|
||||||
- **IO_CPU SIP control-plane 단일 endpoint 원칙 채택**: 본 ADR 범위 밖. 현재
|
|
||||||
KernBench에 이 원칙이 없고, 도입은 별도 ADR.
|
|
||||||
- **Hierarchical all-reduce 알고리즘 설계**: ADR-0029. 본 ADR은 그 알고리즘이
|
|
||||||
쓸 framework 인프라 (`all_pes` mapper, `multi_pe_sip_local` validator,
|
|
||||||
registry 확장점)만 제공.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Open questions
|
|
||||||
|
|
||||||
### 🟡 Nice-to-have — scope 경계 관련
|
|
||||||
|
|
||||||
- **Install timing 허용치**: SimPy 시간 상 install이 몇 ns~us 소모. 기존
|
|
||||||
sideband는 0ns. 기존 테스트가 t=0 시작을 전제로 하는지 확인 (audit 결과에
|
|
||||||
따라 테스트 교정 필요).
|
|
||||||
|
|
||||||
- **`IpcqInitMsg` 배치 가능성**: MmuMapMsg처럼 `target_pe="all"` 브로드캐스트
|
|
||||||
는 IPCQ에서는 부적합 (PE마다 neighbor가 다름). 현재는 per-PE 개별 submit.
|
|
||||||
Per-PE payload를 담는 batched IpcqInitMsg 타입은 future optimization.
|
|
||||||
|
|
||||||
- **`_rank_to_sip` 매핑**: 현재 identity. Non-trivial mapping 요구 시 별도.
|
|
||||||
|
|
||||||
- **Cooperative yield API 위치**: `torch.distributed.cooperative_yield()`로
|
|
||||||
노출 예정. 실제 필요성은 Phase 2 이후 벤치 추가 시 판단.
|
|
||||||
|
|
||||||
(PE-level topology 일원화 관련 중장기 방향은 **ADR-0029** 참고 — 복잡한
|
|
||||||
multi-level 알고리즘이 driving force가 되는 framework 진화 방향.)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Consequences
|
|
||||||
|
|
||||||
### Positive
|
|
||||||
|
|
||||||
- **새 message 타입 0개**: 기존 `IpcqInitMsg` + `KernelLaunchMsg`만으로 구현.
|
|
||||||
- **IO_CPU / engine 변경 없음**: 기존 routing 그대로.
|
|
||||||
- **Sideband install convention 제거**: MmuMapMsg 등과 동일 패턴으로 일원화.
|
|
||||||
- **Plan state stale 문제 소멸**: Plan은 host 단일 소유.
|
|
||||||
- **Bench = real PyTorch DDP** (공개 API 관점).
|
|
||||||
- **Algorithm ABI 경량**: `kernel` + `kernel_args`만 필수.
|
|
||||||
- **Epoch-based barrier**: interleaved collective 안전.
|
|
||||||
- **Control/data plane 분리**: data plane(PE_IPCQ)은 ADR-0023 유지, control
|
|
||||||
plane은 host-driven.
|
|
||||||
- 장기 확장성: Megatron TP, DTensor 기반.
|
|
||||||
|
|
||||||
### Negative
|
|
||||||
|
|
||||||
- 신규 모듈: `install_plan.py`, `mappers.py`, `validators.py`,
|
|
||||||
`multiprocessing.py`.
|
|
||||||
- Engine이 `IpcqInitMsg`를 엔진-path로 라우팅할 수 있는지 구현 시 확인 필요
|
|
||||||
(minor hook 가능성).
|
|
||||||
- Install이 SimPy 시간을 소모 (positive로도 볼 수 있으나, 기존 sideband 시점
|
|
||||||
0ns 전제인 테스트가 있으면 교정 필요).
|
|
||||||
|
|
||||||
### Neutral
|
|
||||||
|
|
||||||
- IPCQ PE-level protocol (ADR-0023) 불변.
|
|
||||||
- `DPPolicy` 필드 변경은 ADR-0026.
|
|
||||||
- IO_CPU 역할 불변 (기존 transit 그대로).
|
|
||||||
+6
-6
@@ -23,7 +23,7 @@ class DPPolicy:
|
|||||||
"""Intra-device (cube × PE) data-parallel policy.
|
"""Intra-device (cube × PE) data-parallel policy.
|
||||||
|
|
||||||
SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
|
SIP-level placement is controlled by ``torch.ahbm.set_device(rank)``
|
||||||
(ADR-0024 D10) and, for model-level TP, by Megatron-style parallel
|
(ADR-0024 D3) and, for model-level TP, by Megatron-style parallel
|
||||||
layers (ADR-0027). DPPolicy does not cross SIP boundaries.
|
layers (ADR-0027). DPPolicy does not cross SIP boundaries.
|
||||||
"""
|
"""
|
||||||
cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
|
cube: Literal["replicate", "column_wise", "row_wise"] = "replicate"
|
||||||
@@ -37,7 +37,7 @@ class DPPolicy:
|
|||||||
### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
|
### D2. `ShardSpec` — structural (sip, cube, pe) 좌표, `pe_index` 완전 제거
|
||||||
|
|
||||||
현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
|
현재 `ShardSpec.pe_index`는 **global flat index** (`sip × cubes × pes + cube ×
|
||||||
pes + pe`). 이는 ADR-0024 D11이 "abstraction leakage"로 지적한 형태.
|
pes + pe`). 이는 ADR-0024 D4이 "abstraction leakage"로 지적한 형태.
|
||||||
|
|
||||||
본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
|
본 ADR에서 ShardSpec을 **structural 좌표로 재정의**하고, `pe_index`는
|
||||||
property로도 **남기지 않는다**:
|
property로도 **남기지 않는다**:
|
||||||
@@ -73,7 +73,7 @@ class ShardSpec:
|
|||||||
|
|
||||||
### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
|
### D3. `resolve_dp_policy`가 `target_sip`을 받아 structural 좌표 생성
|
||||||
|
|
||||||
ADR-0024 D11의 계약 구현. Post-hoc shifting 없음.
|
ADR-0024 D4의 계약 구현. Post-hoc shifting 없음.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# src/kernbench/policy/placement/dp.py (after)
|
# src/kernbench/policy/placement/dp.py (after)
|
||||||
@@ -135,14 +135,14 @@ def resolve_dp_policy(
|
|||||||
|
|
||||||
### D4. `_create_tensor` — 구조적 좌표로 직접 placement
|
### D4. `_create_tensor` — 구조적 좌표로 직접 placement
|
||||||
|
|
||||||
ADR-0024 D11 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
|
ADR-0024 D4 연속선. Post-hoc shifting 제거, 구조적 좌표를 `resolve_dp_policy`
|
||||||
호출 시점에 직접 지정.
|
호출 시점에 직접 지정.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# context.py _create_tensor (after)
|
# context.py _create_tensor (after)
|
||||||
current_sip = self.ahbm.current_device()
|
current_sip = self.ahbm.current_device()
|
||||||
if current_sip is None:
|
if current_sip is None:
|
||||||
# Single-driver fallback (ADR-0024 D9와 일관).
|
# Single-driver fallback (ADR-0024 D2와 일관).
|
||||||
# Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
|
# Launcher 기반 코드가 set_device()를 빼먹으면 조용히 SIP 0에 박히는
|
||||||
# 문제가 있음 → debug mode에서 경고.
|
# 문제가 있음 → debug mode에서 경고.
|
||||||
if os.environ.get("KERNBENCH_DEBUG"):
|
if os.environ.get("KERNBENCH_DEBUG"):
|
||||||
@@ -267,7 +267,7 @@ KernBench는 사내 프로젝트로 call site가 한정되어 있어 한 번에
|
|||||||
- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
|
- **개념 분리 명확**: DPPolicy = intra-device, TP = inter-device.
|
||||||
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
|
- **API 단순화**: DPPolicy 생성자 필드 ~33% 축소.
|
||||||
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
|
- **Structural 좌표 일관성**: ShardSpec이 `(sip, cube, pe)` 튜플로 표현 →
|
||||||
abstraction leakage 해소 (ADR-0024 D11 계약 충족).
|
abstraction leakage 해소 (ADR-0024 D4 계약 충족).
|
||||||
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
|
- **`pe_index` 의미 명확**: SIP-local이 단일 해석. Global flat이 필요하면 명시.
|
||||||
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
|
- **Launcher 모델 일관성**: ADR-0024의 "1 worker per SIP" 모델이 유일한 SIP
|
||||||
경계 제어 메커니즘.
|
경계 제어 메커니즘.
|
||||||
@@ -2,9 +2,7 @@
|
|||||||
|
|
||||||
## Status
|
## Status
|
||||||
|
|
||||||
Accepted (Revision 7 — resume invariant / main-context wait 비재귀 invariant /
|
Accepted
|
||||||
global barrier over-serialization tradeoff / TP forward yield-safety 명시,
|
|
||||||
2026-04-14)
|
|
||||||
|
|
||||||
## Context
|
## Context
|
||||||
|
|
||||||
@@ -166,9 +164,9 @@ while alive:
|
|||||||
- 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
|
- 구현이 이를 **감지**할 필요는 없다 (타임아웃/steps-since-yield 카운터
|
||||||
등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
|
등). 이는 user contract이며 위반 시 증상은 "simulation hang"이다.
|
||||||
- **Future extension**: non-collective 긴 계산 경로가 자주 나오면
|
- **Future extension**: non-collective 긴 계산 경로가 자주 나오면
|
||||||
ADR-0024 D13의 `torch.distributed.cooperative_yield()` primitive (명시적
|
명시적 `torch.distributed.cooperative_yield()` primitive (no-op yield)를
|
||||||
no-op yield)를 도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 —
|
도입할 수 있다. 현 ADR 범위 밖. Breaking change 아님 — 필요 시 추가하면
|
||||||
필요 시 추가하면 됨.
|
됨.
|
||||||
- Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
|
- Round 내에서는 alive worker 전체가 한 번씩 `switch`를 받는다. 단일 round
|
||||||
안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
|
안에서 한 worker가 여러 번 wait를 호출해도 그 turn 안에서 순차적으로
|
||||||
enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
|
enqueue된 뒤 scheduler drain 한 번에 일괄 처리 (FIFO).
|
||||||
@@ -183,7 +181,7 @@ while alive:
|
|||||||
- **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
|
- **두 큐는 서로 다른 dependency source**: worker wait은 worker가 직접
|
||||||
`submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
|
`submit + wait` 쌍으로 만들어낸 handle (tensor deploy, MmuMap 등). collective
|
||||||
큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
|
큐는 `dist.all_reduce`가 내부적으로 enqueue한 kernel launch handle이며
|
||||||
worker는 이걸 직접 wait하지 않는다 (ADR-0024 D7).
|
worker는 이걸 직접 wait하지 않는다 (D0.5의 두 큐 drain 모델 참조).
|
||||||
- **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
|
- **Correctness 관점 독립**: collective는 worker 관점에선 "이미 submit된
|
||||||
후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
|
후 yield한" 상태. 그 완료 타이밍은 worker의 다음 action 시점 이전이기만
|
||||||
하면 됨. worker wait 큐와의 순서 dependency 없음.
|
하면 됨. worker wait 큐와의 순서 dependency 없음.
|
||||||
@@ -206,7 +204,7 @@ while alive:
|
|||||||
index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
|
index로 두거나 append 전 `h not in pending_set` 검사) 가능. correctness
|
||||||
를 바꾸지 않는 최적화로 분류.
|
를 바꾸지 않는 최적화로 분류.
|
||||||
|
|
||||||
4. **Exception propagation + sibling cleanup (ADR-0024 D13 방식 채택)**.
|
4. **Exception propagation + sibling cleanup**.
|
||||||
worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
|
worker greenlet이 raise하면 `g.switch()`가 main으로 예외를 전달한다.
|
||||||
scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
|
scheduler loop은 즉시 중단되고 다음 cleanup을 **명시적으로** 수행:
|
||||||
|
|
||||||
@@ -581,7 +579,7 @@ TP layer의 weight/output 표현에서 두 개념을 명확히 분리한다:
|
|||||||
|
|
||||||
| 개념 | 결정 주체 | 범위 |
|
| 개념 | 결정 주체 | 범위 |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D9/D10) | **cross-rank, cross-SIP** |
|
| **TP shard ownership** (어느 rank가 weight의 어떤 slice를 소유하는가) | greenlet-local rank + `torch.ahbm.set_device(rank)` (ADR-0024 D2/D3) | **cross-rank, cross-SIP** |
|
||||||
| **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
|
| **Intra-rank placement** (소유된 slice를 rank 내부에서 cube × PE로 어떻게 분산하는가) | `DPPolicy(cube=..., pe=...)` (ADR-0026) | **한 rank 내부 (SIP 경계 안)** |
|
||||||
|
|
||||||
따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
|
따라서 `ColumnParallelLinear`가 `(in_features, out_features // ws)` shape로
|
||||||
@@ -825,40 +823,11 @@ strict-xfail 케이스를 본 ADR 구현 이후 **PASS**로 전환하는 것을
|
|||||||
|
|
||||||
## Dependencies
|
## Dependencies
|
||||||
|
|
||||||
- **ADR-0024** (launcher): rank = SIP, greenlet-local rank, `dist.all_reduce`,
|
- **ADR-0024** (launcher): rank = SIP, greenlet-local rank,
|
||||||
`torch.ahbm.set_device(rank)`. 본 ADR의 D0/D1이 이 인프라를 확장.
|
`torch.ahbm.set_device(rank)`.
|
||||||
- **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
|
- **ADR-0026** (DPPolicy intra-device): weight tensor의 per-rank slice 표현.
|
||||||
- **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
|
- **ADR-0023 / ADR-0025** (IPCQ): `dist.all_reduce` 구현의 기반.
|
||||||
|
|
||||||
### Supersedes (partial)
|
|
||||||
|
|
||||||
ADR-0024의 다음 섹션은 **미구현 상태의 설계**이며, 본 ADR이 더 단순한 모델로
|
|
||||||
대체한다:
|
|
||||||
|
|
||||||
- **ADR-0024 D7 (`_CollectiveBarrier.submit_and_drain`)** — epoch 기반 last-
|
|
||||||
arriver-drains 패턴. 문제: last arriver가 **worker 컨텍스트에서** `ctx.wait`을
|
|
||||||
호출해 env.run을 drive → D0.2가 막으려는 orphan 원인을 재현한다. 본 ADR의
|
|
||||||
**D0.4 two-queue drain** (worker가 모두 yield한 뒤 main이 drain)이 동일한
|
|
||||||
"모든 rank가 submit 완료 전까지 어떤 rank의 collective도 진행되지 않음"
|
|
||||||
invariant를 **worker-safe하게** 제공한다. `_CollectiveBarrier` 클래스는
|
|
||||||
구현하지 않는다.
|
|
||||||
- **ADR-0024 D12/D13 (`spawn_workers` skeleton)** — signature / scheduler
|
|
||||||
loop / exception handling 설계. 본 ADR의 **D1**이 real-PyTorch API와 일치하는
|
|
||||||
signature (`spawn(fn, args, nprocs)`)로 재정의하며, D0 scheduler drain을 단일
|
|
||||||
위치에서 수행한다. ADR-0024 D13의 exception cleanup (siblings
|
|
||||||
`throw(SystemExit)` + `SpawnException` 래핑)은 본 ADR에 그대로 흡수
|
|
||||||
(D0.4-(4) 참조).
|
|
||||||
|
|
||||||
현 구현은 ADR-0024의 D7/D12/D13 어느 것도 landing하지 않았으므로 supersede에
|
|
||||||
따른 마이그레이션 비용은 없음. 향후 `docs/adr/ADR-0024`에 "superseded by
|
|
||||||
ADR-0027 D0/D1" 주석만 추가하면 정합.
|
|
||||||
|
|
||||||
**Source of truth (normative, 구현자 대상)**: worker scheduling / collective
|
|
||||||
drain / spawn / exception cleanup의 구현 기준은 **ADR-0027 D0/D1이다**. 구현
|
|
||||||
시 ADR-0024 D7/D12/D13의 pseudocode / contract / signature를 참고하지 말 것 —
|
|
||||||
두 ADR이 다른 결론을 낼 때는 항상 ADR-0027이 우선한다. 리뷰어도 이 원칙으로
|
|
||||||
PR을 심사.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Non-goals
|
## Non-goals
|
||||||
+1
-1
@@ -146,7 +146,7 @@ At each `dist.all_reduce(tensor)` call:
|
|||||||
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
|
3. Appends `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` where
|
||||||
`sip_rank` is the current greenlet's bound rank.
|
`sip_rank` is the current greenlet's bound rank.
|
||||||
4. Launches with `_defer_wait=True`; the main scheduler drains pending
|
4. Launches with `_defer_wait=True`; the main scheduler drains pending
|
||||||
handles after all workers submit (per ADR-0024 D7 / ADR-0027 D0.4).
|
handles after all workers submit (per ADR-0027 D0.4).
|
||||||
|
|
||||||
### D6. Config schema
|
### D6. Config schema
|
||||||
|
|
||||||
+7
-29
@@ -10,7 +10,7 @@ The simulator is an analytical, event-driven performance model — not a
|
|||||||
cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
|
cycle-accurate or RTL-level simulator. Many real-HW effects are approximated
|
||||||
or omitted by design. To keep the model auditable and reviewable as a whole,
|
or omitted by design. To keep the model auditable and reviewable as a whole,
|
||||||
this ADR consolidates the assumptions in one place. Individual component ADRs
|
this ADR consolidates the assumptions in one place. Individual component ADRs
|
||||||
(ADR-0015, ADR-0019, ADR-0004) define the *mechanisms*; this document defines
|
(ADR-0015, ADR-0017, ADR-0004) define the *mechanisms*; this document defines
|
||||||
the *limits of fidelity*.
|
the *limits of fidelity*.
|
||||||
|
|
||||||
## Decisions
|
## Decisions
|
||||||
@@ -21,7 +21,7 @@ the *limits of fidelity*.
|
|||||||
ADR-0015 D2.
|
ADR-0015 D2.
|
||||||
- **Per-component switching/overhead latency** (`overhead_ns` attr).
|
- **Per-component switching/overhead latency** (`overhead_ns` attr).
|
||||||
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
|
- **HBM per-pseudo-channel parallelism** via stateless `pc_avail[N]` array
|
||||||
with global round-robin chunking. Burst granularity tunable
|
with address-based PC selection (ADR-0034 D3). Burst granularity tunable
|
||||||
(`burst_bytes`, default 256B). Read and write share each PC's
|
(`burst_bytes`, default 256B). Read and write share each PC's
|
||||||
`available_at` (real HW command bus is per-PC shared).
|
`available_at` (real HW command bus is per-PC shared).
|
||||||
- **HBM direction switching penalty mechanism**: per-PC last-direction
|
- **HBM direction switching penalty mechanism**: per-PC last-direction
|
||||||
@@ -66,8 +66,8 @@ the *limits of fidelity*.
|
|||||||
### D3. Ignored (out of scope)
|
### D3. Ignored (out of scope)
|
||||||
|
|
||||||
- Bank-level row buffer conflict penalty (assume no conflicts — best case;
|
- Bank-level row buffer conflict penalty (assume no conflicts — best case;
|
||||||
round-robin chunk assignment is address-blind so we cannot detect same-bank
|
the model has no per-bank state within a PC, so same-bank reuse cannot be
|
||||||
reuse).
|
detected).
|
||||||
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
|
- HBM tRP / tRCD / tFAW / tRC timing constraints (absorbed into the steady-state
|
||||||
`burst_time = burst_bytes / pc_bw_gbs`).
|
`burst_time = burst_bytes / pc_bw_gbs`).
|
||||||
- Refresh, ECC, thermal throttling, power gating.
|
- Refresh, ECC, thermal throttling, power gating.
|
||||||
@@ -110,29 +110,6 @@ below are different concerns, ordered by expected workload impact.
|
|||||||
|
|
||||||
**Higher impact (workload accuracy gap)**:
|
**Higher impact (workload accuracy gap)**:
|
||||||
|
|
||||||
- [ ] **Address-based PC selection at HBM CTRL** (replace the
|
|
||||||
address-blind global round-robin). Compute the PC index from
|
|
||||||
the HBM byte offset using parameters already in topology config:
|
|
||||||
|
|
||||||
pc_shift = log2(burst_bytes) # default 8 (burst=256B)
|
|
||||||
pc_mask = num_pcs - 1 # default 7 (8 PCs)
|
|
||||||
pc = (hbm_offset >> pc_shift) & pc_mask
|
|
||||||
|
|
||||||
For the default `burst_bytes=256, num_pcs=8` this places the PC
|
|
||||||
select field at HBM byte-offset bits **[10:8]**: bits [7:0] are
|
|
||||||
the within-burst offset (same PC), bits [10:8] are the 3-bit PC
|
|
||||||
index, and bits [36:11] are row/bank/column within the PC slice.
|
|
||||||
Shift/mask are derived from topology config rather than hardcoded
|
|
||||||
so alternative `(burst_bytes, num_pcs)` pairs stay consistent.
|
|
||||||
See `src/kernbench/policy/address/phyaddr.py` for the canonical
|
|
||||||
comment.
|
|
||||||
|
|
||||||
Real-HW workloads where this matters most: (a) strided multi-
|
|
||||||
transaction streams that under global-RR collide on the same PCs
|
|
||||||
but under address-striping land on disjoint sets; (b) offset-
|
|
||||||
disjoint parallel transfers where address-striping preserves
|
|
||||||
parallelism while global-RR re-serializes them. Directly affects
|
|
||||||
multi-PE concurrent HBM workload latencies.
|
|
||||||
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
|
- [ ] **Bank-level conflict modeling** within a PC (opt-in via
|
||||||
`track_banks: true`). Currently we assume no same-bank reuse;
|
`track_banks: true`). Currently we assume no same-bank reuse;
|
||||||
random scatter/gather workloads are optimistic here.
|
random scatter/gather workloads are optimistic here.
|
||||||
@@ -169,7 +146,7 @@ below are different concerns, ordered by expected workload impact.
|
|||||||
touching latency must update the relevant section here.
|
touching latency must update the relevant section here.
|
||||||
- Workload-specific magnitude error envelopes are explicit.
|
- Workload-specific magnitude error envelopes are explicit.
|
||||||
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
|
- Builder-side derivation of `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
|
||||||
enforces the ADR-0019 D9 invariant in code rather than relying on yaml
|
enforces the ADR-0017 D8 invariant in code rather than relying on yaml
|
||||||
manual consistency.
|
manual consistency.
|
||||||
- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
|
- Wire transfer time is charged once per bottleneck-link transit (Phase 2c
|
||||||
per-flit timing) rather than via terminal `drain_ns` injection. Single
|
per-flit timing) rather than via terminal `drain_ns` injection. Single
|
||||||
@@ -180,5 +157,6 @@ below are different concerns, ordered by expected workload impact.
|
|||||||
## Cross-references
|
## Cross-references
|
||||||
|
|
||||||
- ADR-0015 — component / port / wire model.
|
- ADR-0015 — component / port / wire model.
|
||||||
- ADR-0019 — NoC and local HBM topology.
|
- ADR-0017 — Cube NOC architecture and HBM connectivity.
|
||||||
- ADR-0004 — memory semantics, local HBM.
|
- ADR-0004 — memory semantics, local HBM.
|
||||||
|
- ADR-0034 — HBM controller internal design.
|
||||||
@@ -0,0 +1,271 @@
|
|||||||
|
# ADR-0034: HBM Controller Internal Design
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
`HbmCtrlComponent` is the per-PE HBM partition endpoint at the leaf of
|
||||||
|
the cube NOC. One instance is created per PE under the topology node
|
||||||
|
`sip{S}.cube{C}.hbm_ctrl.pe{idx}` and attaches to that PE's router
|
||||||
|
(ADR-0017 D4). The component models per-pseudo-channel (PC) scheduling,
|
||||||
|
burst-granular commit timing, address-based PC selection, and response
|
||||||
|
routing back to the requester.
|
||||||
|
|
||||||
|
This ADR documents the component as currently implemented. ADR-0017 D4/D8
|
||||||
|
defines *where* HBM CTRL attaches and *what* aggregate BW it must
|
||||||
|
deliver. ADR-0033 D1/D2 defines *what fidelity* of HBM modelling is in
|
||||||
|
scope. This ADR fills the gap between those two — the per-instance
|
||||||
|
internal scheduling model.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Role
|
||||||
|
|
||||||
|
`HbmCtrlComponent` is a per-PE HBM partition endpoint. One instance per
|
||||||
|
PE (default 8 per cube, set by `cube.memory_map.hbm_slices_per_cube`)
|
||||||
|
attaches to that PE's router via the `peX.hbm` attachment list in
|
||||||
|
`cube_mesh.yaml` (ADR-0017 D4). In the default n:1 channel mapping
|
||||||
|
(ADR-0017 D8) the instance aggregates `channels_per_pe` pseudo-channels
|
||||||
|
into one endpoint.
|
||||||
|
|
||||||
|
The component models:
|
||||||
|
|
||||||
|
- Per-PC scheduling (D2) with R/W command-bus sharing.
|
||||||
|
- Address-based PC selection (D3).
|
||||||
|
- Burst-granular commit timing (D4).
|
||||||
|
- Flit-aware per-flit PC commit and async finalize (D5, D6).
|
||||||
|
- Command-only Transaction handling for read-data drain (D7).
|
||||||
|
- Response routing back to the requester (D8).
|
||||||
|
|
||||||
|
It does not model:
|
||||||
|
|
||||||
|
- Bank-level row-buffer conflicts, refresh, ECC, thermal throttling
|
||||||
|
(ADR-0033 D3).
|
||||||
|
- Cross-PE HBM contention beyond its own router edge (handled by the
|
||||||
|
router mesh — ADR-0017 D3).
|
||||||
|
- 1:1 channel mode (ADR-0017 D8 future work).
|
||||||
|
|
||||||
|
### D2. Per-PC scheduling model
|
||||||
|
|
||||||
|
Per-instance state initialised in `start()`:
|
||||||
|
|
||||||
|
- `_pc_avail: list[float]` — earliest sim-time each PC is free; length
|
||||||
|
`num_pcs`, initial 0.0.
|
||||||
|
- `_pc_last_dir: list["R"|"W"|None]` — direction of the last commit on
|
||||||
|
each PC, used for switch-penalty detection (D4); initial `None`.
|
||||||
|
|
||||||
|
`num_pcs` and `burst_bytes` must each be a positive power of two so
|
||||||
|
that address-based PC selection (D3) reduces to a shift-and-mask.
|
||||||
|
|
||||||
|
Read and write requests share the same `_pc_avail` slot per PC — the
|
||||||
|
real HW per-PC command bus is shared between read and write traffic, so
|
||||||
|
issuing a write to PC k blocks a subsequent read to PC k by exactly the
|
||||||
|
burst time.
|
||||||
|
|
||||||
|
Direction `dir` for a request is inferred from the request type:
|
||||||
|
|
||||||
|
- `MemoryWriteMsg` → `"W"`.
|
||||||
|
- `PeDmaMsg` with `is_write=True` → `"W"`.
|
||||||
|
- All others (`MemoryReadMsg`, `PeDmaMsg` read) → `"R"`.
|
||||||
|
|
||||||
|
### D3. Address-based PC selection
|
||||||
|
|
||||||
|
PC index for an access is derived from the access address by shift and
|
||||||
|
mask:
|
||||||
|
|
||||||
|
```text
|
||||||
|
pc_shift = log2(burst_bytes) # default 8 (burst=256B)
|
||||||
|
pc_mask = num_pcs - 1 # default 7 (8 PCs)
|
||||||
|
pc = (address >> pc_shift) & pc_mask
|
||||||
|
```
|
||||||
|
|
||||||
|
Computed once in `start()` from topology config so alternative
|
||||||
|
`(burst_bytes, num_pcs)` pairs stay consistent. For the canonical
|
||||||
|
default `(256, 8)` this places the PC select field at bits `[10:8]` of
|
||||||
|
the HBM byte offset: bits `[7:0]` are within-burst (same PC), bits
|
||||||
|
`[10:8]` are the 3-bit PC index, bits `[36:11]` are row/bank/column
|
||||||
|
within the PC slice (see `phyaddr.py` comment).
|
||||||
|
|
||||||
|
Address-based striping — as opposed to address-blind global
|
||||||
|
round-robin — preserves PC parallelism for offset-disjoint concurrent
|
||||||
|
transfers: each transfer's bursts land deterministically on the PC set
|
||||||
|
implied by its byte addresses, so multi-PE workloads accessing disjoint
|
||||||
|
regions do not collide on a single PC.
|
||||||
|
|
||||||
|
### D4. Burst granularity and PC commit timing
|
||||||
|
|
||||||
|
A single PC commit takes:
|
||||||
|
|
||||||
|
```text
|
||||||
|
chunk_time = burst_bytes / pc_bw_gbs # ns
|
||||||
|
```
|
||||||
|
|
||||||
|
- `burst_bytes` (default 256) is the burst granularity matching the
|
||||||
|
flit size (ADR-0033 D1).
|
||||||
|
- `pc_bw_gbs` is **builder-derived** from
|
||||||
|
`hbm_to_router_bw_gbs / num_pcs` (`topology/builder.py`), enforcing
|
||||||
|
the ADR-0017 D8 invariant that aggregate per-PE BW equals the
|
||||||
|
router-to-HBM link BW.
|
||||||
|
|
||||||
|
Per-PC commit scheduling for an arriving access on PC `pc` with
|
||||||
|
direction `dir`:
|
||||||
|
|
||||||
|
```text
|
||||||
|
switch_cost = switch_penalty_ns
|
||||||
|
if pc_last_dir[pc] not in (None, dir) else 0
|
||||||
|
start = max(env.now, pc_avail[pc]) + switch_cost
|
||||||
|
finish = start + chunk_time
|
||||||
|
pc_avail[pc] = finish
|
||||||
|
pc_last_dir[pc] = dir
|
||||||
|
```
|
||||||
|
|
||||||
|
Default `switch_penalty_ns = 0` — Tier 0 assumption that an ideal HBM
|
||||||
|
scheduler amortises R/W switching cost (ADR-0033 D2). Non-zero values
|
||||||
|
model pessimistic per-alternation cost.
|
||||||
|
|
||||||
|
### D5. Flit-aware per-flit PC commit (primary path)
|
||||||
|
|
||||||
|
`_handle_flit` is the primary worker path. For each arriving `Flit`:
|
||||||
|
|
||||||
|
1. On the **first** flit of a transaction (`tid = id(txn)` not in
|
||||||
|
`_txn_state`):
|
||||||
|
- Apply `overhead_ns` once via `run(env, nbytes)` — header decode
|
||||||
|
model, first-flit overhead pattern (ADR-0033 D1).
|
||||||
|
- Initialise `_txn_state[tid] = {"last_finish": env.now}`.
|
||||||
|
2. Compute `pc = _pc_for_address(flit.address)` (D3).
|
||||||
|
3. Apply the per-PC schedule (D4) using the request direction (D2).
|
||||||
|
4. Update `state["last_finish"] = max(state["last_finish"], finish)`.
|
||||||
|
5. If `flit.is_last`: pop `_txn_state[tid]` and spawn `_finalize_txn`
|
||||||
|
(D6).
|
||||||
|
|
||||||
|
Per-flit address-aware commit is the mechanism that lets concurrent
|
||||||
|
multi-PE traffic to disjoint HBM offsets pipeline through distinct PCs
|
||||||
|
in parallel.
|
||||||
|
|
||||||
|
### D6. Async finalize per transaction
|
||||||
|
|
||||||
|
When a transaction's last flit has been scheduled, finalisation runs in
|
||||||
|
a separately-spawned process:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _finalize_txn(env, txn, last_finish):
|
||||||
|
wait = last_finish - env.now
|
||||||
|
if wait > 0:
|
||||||
|
yield env.timeout(wait)
|
||||||
|
yield from _send_response(env, txn)
|
||||||
|
```
|
||||||
|
|
||||||
|
`_handle_flit` spawns this via `env.process(...)` and returns
|
||||||
|
immediately, so the worker can pick up the next inbox message while the
|
||||||
|
last PC commit drains.
|
||||||
|
|
||||||
|
Without this split — i.e. if the worker itself did
|
||||||
|
`yield env.timeout(wait)` — concurrent single-flit transactions whose
|
||||||
|
addresses hit distinct PCs would still serialise at `chunk_time` each
|
||||||
|
inside the worker, hiding the PC parallelism that D3 and D5 are
|
||||||
|
designed to expose.
|
||||||
|
|
||||||
|
### D7. Non-flit fallback for command-only transactions
|
||||||
|
|
||||||
|
`_handle_txn` runs when the inbox delivers a `Transaction` rather than a
|
||||||
|
`Flit`. This is the path for command-only requests that the wire does
|
||||||
|
not chunk into flits — most notably `MemoryReadMsg` whose command txn
|
||||||
|
carries `nbytes=0` (data drain is modelled at HBM CTRL post-processing,
|
||||||
|
not as inbound flits).
|
||||||
|
|
||||||
|
Procedure:
|
||||||
|
|
||||||
|
1. `work_bytes = txn.nbytes if txn.nbytes > 0 else int(request.nbytes or 0)`
|
||||||
|
— for read commands, work is sized by the request.
|
||||||
|
2. `n_chunks = ceil(work_bytes / burst_bytes)` if `work_bytes > 0` else
|
||||||
|
0.
|
||||||
|
3. `chunk_interval = drain_ns / n_chunks` (when both > 0) — chunks are
|
||||||
|
scheduled over time at `drain/n_chunks` ns intervals to model the
|
||||||
|
bottleneck-link's data arrival rate (ADR-0033 D1 chunk-loop drain).
|
||||||
|
4. Apply `run(env, txn.nbytes)` once for `overhead_ns`.
|
||||||
|
5. For each chunk `i`, advance `chunk_interval` ns then apply the D4
|
||||||
|
schedule with `pc = _pc_for_address(base_address + i * burst_bytes)`.
|
||||||
|
6. After scheduling all chunks, wait `last_finish - env.now` then call
|
||||||
|
`_send_response`.
|
||||||
|
|
||||||
|
`_handle_txn` shares the same `_pc_avail` / `_pc_last_dir` state with
|
||||||
|
`_handle_flit` — there is exactly one source of PC scheduling truth
|
||||||
|
across both paths.
|
||||||
|
|
||||||
|
### D8. Response routing
|
||||||
|
|
||||||
|
`_send_response` dispatches on request type and path geometry:
|
||||||
|
|
||||||
|
| Case | Trigger | Response |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| PE_DMA | `isinstance(txn.request, PeDmaMsg)` | New reverse-path Transaction (`is_response=True`, `nbytes=0`), same `done` |
|
||||||
|
| Bypass — Memory Read | `"m_cpu" not in any(txn.path)` AND `MemoryReadMsg` | Reverse-path Transaction with `nbytes=request.nbytes` (data return) |
|
||||||
|
| Bypass — Memory Write | `"m_cpu" not in any(txn.path)` AND not Memory Read | `txn.done.succeed()` (write completes locally) |
|
||||||
|
| Default | otherwise | New `ResponseMsg(correlation_id, request_id, src_cube, src_pe, success=True)` on reverse path |
|
||||||
|
|
||||||
|
The "bypass" classification matches the Memory R/W fabric path defined
|
||||||
|
in ADR-0015 D4 (PCIE_EP → io_noc → ucie → cube router → hbm_ctrl,
|
||||||
|
without M_CPU). The PE_DMA case is its own dedicated reverse-path to
|
||||||
|
keep the inner-loop DMA fast (PE_DMA reads/writes do not synthesise a
|
||||||
|
ResponseMsg envelope).
|
||||||
|
|
||||||
|
In all reverse-path cases, the response Transaction is put onto
|
||||||
|
`out_ports[reverse_path[1]]` — the first hop back along the recorded
|
||||||
|
forward path. If `reverse_path` has fewer than 2 entries (degenerate
|
||||||
|
path), the original `txn.done` is signalled directly.
|
||||||
|
|
||||||
|
### D9. Configurable attributes
|
||||||
|
|
||||||
|
| Attribute | Default | Source | Notes |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| `num_pcs` | 8 | topology cube `hbm_ctrl.attrs` | Must be power of 2 |
|
||||||
|
| `pc_bw_gbs` | 32.0 | builder-derived: `hbm_to_router_bw_gbs / num_pcs` | Enforces ADR-0017 D8 invariant |
|
||||||
|
| `burst_bytes` | 256 | topology attrs | Must be power of 2; equals `flit_bytes` (ADR-0033 D1) |
|
||||||
|
| `switch_penalty_ns` | 0.0 | topology attrs | Tier 0 default; non-zero models pessimistic R/W switching |
|
||||||
|
| `efficiency` | 1.0 | topology attrs | Applied at builder time to `hbm_to_router_bw_gbs` (router-edge BW scaling only) |
|
||||||
|
| `overhead_ns` | 0.0 | topology attrs | First-flit decode overhead (D5) |
|
||||||
|
|
||||||
|
`pc_bw_gbs` is derived by `topology/builder.py` rather than configured
|
||||||
|
directly so the aggregate per-PE BW matches the router-to-HBM link BW
|
||||||
|
without yaml-side duplication.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- Address-based PC selection preserves multi-stream HBM parallelism
|
||||||
|
that an address-blind round-robin would collapse — important for
|
||||||
|
multi-PE workloads with disjoint HBM regions.
|
||||||
|
- Flit-aware path (D5) + async finalize (D6) preserves wormhole
|
||||||
|
pipelining and exposes PC parallelism for back-to-back single-flit
|
||||||
|
transactions.
|
||||||
|
- Single source of PC scheduling truth (D4 mechanism, used by both D5
|
||||||
|
flit path and D7 chunk-loop path).
|
||||||
|
- Builder-derived `pc_bw_gbs` enforces ADR-0017 D8 in code, not yaml
|
||||||
|
discipline.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- No bank-level conflict modelling within a PC; address-blind to
|
||||||
|
bank/row-buffer reuse (ADR-0033 D3).
|
||||||
|
- No HBM scheduler (FR-FCFS / write-buffer / watermark drain); fixed
|
||||||
|
FIFO per PC. Bursty mixed R/W is approximated by `switch_penalty_ns`
|
||||||
|
(ADR-0033 D2).
|
||||||
|
- `_txn_state` is a regular dict keyed by `id(txn)`; in-flight state
|
||||||
|
accumulates per concurrent transaction and is removed only on
|
||||||
|
`is_last`. Adequate for current workloads.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0001 (Physical address layout — PC bit field comment)
|
||||||
|
- ADR-0015 D4 (Memory R/W fabric path — bypass response case)
|
||||||
|
- ADR-0017 D4 (Per-PE HBM partitioning — attachment to PE routers)
|
||||||
|
- ADR-0017 D8 (HBM channel mapping mode — n:1 aggregate this ADR
|
||||||
|
implements)
|
||||||
|
- ADR-0017 D9 (AddressResolver — `hbm_ctrl.pe{pe_id}` endpoint
|
||||||
|
resolution)
|
||||||
|
- ADR-0033 D1 (Modelled precisely — per-PC parallelism, switch penalty,
|
||||||
|
flit-aware PC commit, first-flit overhead, chunk-loop drain)
|
||||||
|
- ADR-0033 D2 (Switch-penalty default 0 — ideal scheduler amortisation)
|
||||||
@@ -0,0 +1,286 @@
|
|||||||
|
# ADR-0035: M_CPU and M_CPU.DMA Component Model
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
M_CPU is the cube-level command processor. It receives commands from
|
||||||
|
IO_CPU (or from PCIE_EP when the engine routes Memory R/W through
|
||||||
|
M_CPU as a fallback), fans them out to the PEs in its cube, and
|
||||||
|
aggregates per-PE responses into a single ResponseMsg sent back to
|
||||||
|
IO_CPU on the reverse path.
|
||||||
|
|
||||||
|
M_CPU.DMA is the cube-level DMA channel pair that handles Memory R/W
|
||||||
|
fan-out. Per ADR-0015 D5 it is **not** a separate topology node —
|
||||||
|
it lives as internal state of `MCpuComponent`.
|
||||||
|
|
||||||
|
This ADR documents the M_CPU component implementation that realizes
|
||||||
|
those responsibilities, including the three distinct fan-out paths
|
||||||
|
(Memory R/W, Kernel Launch, MMU Map/Unmap), the M_CPU.DMA resource
|
||||||
|
model, and the response aggregation contract.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Role
|
||||||
|
|
||||||
|
M_CPU has three responsibilities:
|
||||||
|
|
||||||
|
1. **Transit forwarding** — when not the terminal hop (e.g., on the
|
||||||
|
reverse response path PE → M_CPU → IO_CPU), forwards Transactions
|
||||||
|
to `next_hop` in their pre-computed path.
|
||||||
|
2. **Multi-PE fan-out at terminal hop** — dispatches to one of three
|
||||||
|
fan-out paths based on request type (D2).
|
||||||
|
3. **Response aggregation** — collects per-PE responses, sends a
|
||||||
|
single aggregate ResponseMsg back to IO_CPU on the reverse path.
|
||||||
|
|
||||||
|
Per invocation (`run()`): applies `overhead_ns` once per incoming
|
||||||
|
Transaction.
|
||||||
|
|
||||||
|
M_CPU does **not**:
|
||||||
|
|
||||||
|
- Decide routing — paths are pre-computed by the router (ADR-0002).
|
||||||
|
- Handle PE-internal execution — PE_CPU / PE_SCHEDULER / engines
|
||||||
|
(ADR-0014).
|
||||||
|
- Decode addresses — `ctx.resolver.resolve(pa)` returns the per-PE
|
||||||
|
`hbm_ctrl.pe{X}` directly (ADR-0017 D9).
|
||||||
|
- Interpret tensor or kernel semantics — fan-out dispatch by Python
|
||||||
|
isinstance check only.
|
||||||
|
|
||||||
|
### D2. Three fan-out paths dispatched by request type
|
||||||
|
|
||||||
|
At the terminal hop the worker dispatches by request type:
|
||||||
|
|
||||||
|
```python
|
||||||
|
elif self.ctx is not None and txn.request is not None:
|
||||||
|
if isinstance(txn.request, KernelLaunchMsg):
|
||||||
|
env.process(self._kernel_launch_fanout(env, txn))
|
||||||
|
elif isinstance(txn.request, (MmuMapMsg, MmuUnmapMsg)):
|
||||||
|
env.process(self._mmu_msg_fanout(env, txn))
|
||||||
|
else:
|
||||||
|
env.process(self._dma_fanout(env, txn))
|
||||||
|
```
|
||||||
|
|
||||||
|
Each path uses a different router method:
|
||||||
|
|
||||||
|
- `_dma_fanout` uses `ctx.router.find_mcpu_dma_path()` — the
|
||||||
|
M_CPU-specific DMA path that avoids PE pipeline nodes.
|
||||||
|
- `_kernel_launch_fanout` uses `ctx.router.find_node_path()` — the
|
||||||
|
generic NOC command path to PE_CPU.
|
||||||
|
- `_mmu_msg_fanout` uses `ctx.router.find_node_path()` — NOC command
|
||||||
|
path to PE_MMU.
|
||||||
|
|
||||||
|
### D3. M_CPU.DMA internal subcomponent (ADR-0015 D5)
|
||||||
|
|
||||||
|
`MCpuComponent.start()` initializes two SimPy resources:
|
||||||
|
|
||||||
|
```python
|
||||||
|
self._dma_write = simpy.Resource(env, capacity=1) # MemoryWriteMsg
|
||||||
|
self._dma_read = simpy.Resource(env, capacity=1) # MemoryReadMsg
|
||||||
|
```
|
||||||
|
|
||||||
|
Properties:
|
||||||
|
|
||||||
|
- **Not a topology node** — managed entirely inside `MCpuComponent`;
|
||||||
|
does not appear in `topology.yaml` or in the compiled graph.
|
||||||
|
- **Independent read and write channels** — concurrent in-flight
|
||||||
|
Memory R/W is allowed.
|
||||||
|
- **Capacity=1 per channel** serializes the **dispatch step**
|
||||||
|
(`yield self.out_ports[...].put(...)`) of concurrent in-flight Memory
|
||||||
|
R/W requests at this M_CPU. Actual fabric transfer time is modeled
|
||||||
|
by wire processes between components (ADR-0015 D2) and by
|
||||||
|
`drain_ns` at terminal hops; the DMA resource does not gate
|
||||||
|
transfer duration.
|
||||||
|
|
||||||
|
Resource selection is request-type-based:
|
||||||
|
|
||||||
|
```python
|
||||||
|
dma_res = self._dma_write if isinstance(request, MemoryWriteMsg) else self._dma_read
|
||||||
|
```
|
||||||
|
|
||||||
|
### D4. Transit forwarding at non-terminal hops
|
||||||
|
|
||||||
|
When `txn.next_hop` is not None — typical for the reverse response
|
||||||
|
path (PE → M_CPU → IO_CPU) — the worker forwards normally:
|
||||||
|
|
||||||
|
```python
|
||||||
|
if next_hop:
|
||||||
|
yield self.out_ports[next_hop].put(txn.advance())
|
||||||
|
```
|
||||||
|
|
||||||
|
The fan-out branches fire only at the terminal hop. The same component
|
||||||
|
therefore serves both forward command dispatch and reverse response
|
||||||
|
relay roles.
|
||||||
|
|
||||||
|
### D5. DMA fan-out (`_dma_fanout` — Memory R/W)
|
||||||
|
|
||||||
|
For each Memory R/W request at terminal hop:
|
||||||
|
|
||||||
|
1. `_resolve_dma_destinations(request)` returns a per-PE
|
||||||
|
`hbm_ctrl.pe{X}` derived from the request's PA via
|
||||||
|
`ctx.resolver.resolve(PhysAddr.decode(pa))` (ADR-0017 D9).
|
||||||
|
2. For each destination:
|
||||||
|
- Acquire the appropriate DMA resource (`_dma_write` or
|
||||||
|
`_dma_read`) via `with dma_res.request() as req`.
|
||||||
|
- Resolve path via `ctx.router.find_mcpu_dma_path()`.
|
||||||
|
- Compute `drain_ns = ctx.compute_drain_ns(path, nbytes)`.
|
||||||
|
- Create sub-Transaction carrying `drain_ns` and dispatch to
|
||||||
|
`path[1]`.
|
||||||
|
3. Track `max_drain_ns` across destinations and record it as
|
||||||
|
`txn.result_data["xfer_ns"]` after all responses arrive.
|
||||||
|
4. After all per-PE responses are collected (D8), send an aggregate
|
||||||
|
ResponseMsg on the reverse command path back to IO_CPU.
|
||||||
|
|
||||||
|
PA decode fallback (`f"{cube_prefix}.hbm_ctrl"`) is legacy dead code —
|
||||||
|
no such node exists after ADR-0017 D4's per-PE partitioning. Kept
|
||||||
|
defensively but does not route to a real destination.
|
||||||
|
|
||||||
|
### D6. Kernel launch fan-out (`_kernel_launch_fanout`)
|
||||||
|
|
||||||
|
For `KernelLaunchMsg` at terminal hop:
|
||||||
|
|
||||||
|
1. `_resolve_pe_ids(target_pe)` → list of PE ids in this cube.
|
||||||
|
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_cpu"` via
|
||||||
|
`ctx.router.find_node_path()`.
|
||||||
|
3. **`target_start_ns` handling** (ADR-0009 D5):
|
||||||
|
- If the request already carries `target_start_ns` (stamped by
|
||||||
|
IO_CPU per ADR-0036 D3): **pass through unchanged**.
|
||||||
|
- If absent (direct-to-M_CPU launch in unit tests): compute a
|
||||||
|
per-cube barrier `env.now + max(per-PE leg latency)` and stamp
|
||||||
|
via `dataclasses.replace`.
|
||||||
|
4. Dispatch sub-Transactions with `nbytes=0` (kernel launch is a
|
||||||
|
control message; preserving nbytes=0 keeps fan-out off the shared
|
||||||
|
first-hop fabric BW, mirroring ADR-0036 D4).
|
||||||
|
5. After all per-PE responses arrive (D8), aggregate per-PE metrics
|
||||||
|
from each sub-Transaction's `result_data` into the parent
|
||||||
|
transaction:
|
||||||
|
|
||||||
|
```python
|
||||||
|
txn.result_data["pe_exec_ns"] = max(existing, max(pe_exec_values))
|
||||||
|
txn.result_data["dma_ns"] = max(existing, max(dma_values))
|
||||||
|
txn.result_data["compute_ns"] = max(existing, max(compute_values))
|
||||||
|
```
|
||||||
|
|
||||||
|
The max-merge with the existing value matters because cross-cube
|
||||||
|
IO_CPU fan-out shares the same parent `result_data`; merging
|
||||||
|
prevents one cube from clobbering another's metric.
|
||||||
|
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
|
||||||
|
|
||||||
|
### D7. MMU map/unmap fan-out (`_mmu_msg_fanout`)
|
||||||
|
|
||||||
|
For `MmuMapMsg` / `MmuUnmapMsg` at terminal hop:
|
||||||
|
|
||||||
|
1. `_resolve_pe_ids(target_pe)` → PE ids.
|
||||||
|
2. For each PE: find path to `f"{cube_prefix}.pe{pe_id}.pe_mmu"` via
|
||||||
|
`find_node_path()`.
|
||||||
|
3. Dispatch sub-Transactions with `nbytes=0`.
|
||||||
|
4. PE_MMU is a terminal node — it does **not** send a ResponseMsg
|
||||||
|
back. Instead, the sub-Transaction's own `sub_done` event is the
|
||||||
|
completion signal.
|
||||||
|
5. Wait for all `sub_done` events in-line (does **not** use
|
||||||
|
`_pending` counter — D8 is for response-bearing fan-out only).
|
||||||
|
6. Send aggregate ResponseMsg on reverse path back to IO_CPU.
|
||||||
|
|
||||||
|
### D8. Response aggregation (`_pending` + `_parent_txns`)
|
||||||
|
|
||||||
|
For DMA and kernel-launch fan-out (which expect per-PE ResponseMsg
|
||||||
|
arriving on the reverse path):
|
||||||
|
|
||||||
|
```python
|
||||||
|
self._pending: dict[str, tuple[int, int, simpy.Event]] = {}
|
||||||
|
self._parent_txns: dict[str, Any] = {}
|
||||||
|
```
|
||||||
|
|
||||||
|
- On dispatch: register `(expected, received=0, all_done)` and
|
||||||
|
remember the parent transaction.
|
||||||
|
- `_worker` recognises responses by `is_response=True` and routes
|
||||||
|
them to `_collect_response`, which increments `received` and
|
||||||
|
signals `all_done` when `received >= expected`.
|
||||||
|
- After `yield all_done`, the fan-out path constructs the aggregate
|
||||||
|
ResponseMsg:
|
||||||
|
|
||||||
|
```python
|
||||||
|
resp_msg = ResponseMsg(
|
||||||
|
correlation_id=request.correlation_id,
|
||||||
|
request_id=request.request_id,
|
||||||
|
src_cube=cube_id,
|
||||||
|
src_pe=-1, # -1 = M_CPU aggregate, not a single PE
|
||||||
|
success=True, # no failure semantics implemented
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- The response Transaction travels on `list(reversed(txn.path))`
|
||||||
|
back to IO_CPU.
|
||||||
|
|
||||||
|
MMU fan-out (D7) uses a simpler in-line list of `sub_done` events
|
||||||
|
because PE_MMU is terminal — there is no ResponseMsg path to
|
||||||
|
intercept.
|
||||||
|
|
||||||
|
### D9. Helpers and configurable attribute
|
||||||
|
|
||||||
|
`_resolve_pe_ids(target_pe)`:
|
||||||
|
|
||||||
|
- `int` → `[target_pe]`
|
||||||
|
- `tuple[int, ...]` → `list(target_pe)`
|
||||||
|
- `"all"` → `range(n_slices)` where `n_slices` comes from cube
|
||||||
|
`memory_map.hbm_slices_per_cube` (default 8).
|
||||||
|
|
||||||
|
Used by kernel-launch and MMU fan-out paths.
|
||||||
|
|
||||||
|
Single configurable attribute drives per-instance latency:
|
||||||
|
|
||||||
|
| Site | impl name | overhead_ns |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| Cube `m_cpu` | `builtin.m_cpu` | 5.0 |
|
||||||
|
|
||||||
|
Applied once in `run()` per Transaction — models command
|
||||||
|
interpretation and dispatch-decision time at M_CPU.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- Three fan-out paths are clearly separated by request type — adding
|
||||||
|
a new request kind is an isinstance branch + one fan-out method.
|
||||||
|
- M_CPU.DMA channels are independent (read and write run concurrently)
|
||||||
|
and serialize only the dispatch step at capacity=1.
|
||||||
|
- Transit-vs-terminal behavior is a single `if next_hop` check, so
|
||||||
|
the same component handles forward dispatch and reverse response
|
||||||
|
relay without role duplication.
|
||||||
|
- `target_start_ns` passthrough (D6) preserves the cross-cube barrier
|
||||||
|
established by IO_CPU (ADR-0036 D3), while the fallback computation
|
||||||
|
keeps direct-to-M_CPU unit tests working.
|
||||||
|
- Per-PE metric `max`-merge against existing parent `result_data`
|
||||||
|
values is robust to cross-cube IO_CPU fan-out sharing the same
|
||||||
|
parent.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- No partial-failure semantics — a missing per-PE response stalls the
|
||||||
|
parent `all_done` indefinitely. Acceptable for simulation; not
|
||||||
|
suitable as a production-style endpoint.
|
||||||
|
- `_resolve_dma_destinations`'s cube-wide hbm_ctrl fallback is dead
|
||||||
|
code (no such node exists post-ADR-0017 D4). Kept defensively;
|
||||||
|
invites confusion and merits a follow-up cleanup.
|
||||||
|
- DMA resource serialization applies only at dispatch (the `put` call
|
||||||
|
is instantaneous in unbounded stores). The capacity=1 channel
|
||||||
|
models "one request in flight at a time at this M_CPU", not
|
||||||
|
"transfer duration serialization" — readers must consult wire
|
||||||
|
processes (ADR-0015 D2) and `drain_ns` for actual transfer
|
||||||
|
parallelism.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0009 D3 (M_CPU fan-out and aggregation completion semantics)
|
||||||
|
- ADR-0009 D5 (`target_start_ns` — passed through unchanged when
|
||||||
|
present; computed as per-cube barrier when absent)
|
||||||
|
- ADR-0011 D-VA3 (MmuMapMsg fabric path includes M_CPU as PE fan-out
|
||||||
|
point)
|
||||||
|
- ADR-0014 D4 (DMA engine capacity=1; M_CPU.DMA mirrors the same
|
||||||
|
contract at cube level)
|
||||||
|
- ADR-0015 D5 (M_CPU.DMA is internal subcomponent of M_CPU, not a
|
||||||
|
topology node)
|
||||||
|
- ADR-0017 D9 (AddressResolver returns per-PE `hbm_ctrl.pe{X}`)
|
||||||
|
- ADR-0036 D3 / D4 (IO_CPU stamps `target_start_ns`; M_CPU passes
|
||||||
|
through unchanged; nbytes=0 invariant preserved through fan-out)
|
||||||
@@ -0,0 +1,216 @@
|
|||||||
|
# ADR-0036: IO_CPU Component Model
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
IO_CPU is the IO chiplet's host-facing endpoint inside the simulation
|
||||||
|
graph. PCIE_EP receives host messages from the runtime API and routes
|
||||||
|
them via the io_noc; for command-bearing requests (KernelLaunch,
|
||||||
|
MmuMap/Unmap) the io_noc forwards to IO_CPU, which:
|
||||||
|
|
||||||
|
- Fans out the request to per-cube M_CPUs.
|
||||||
|
- Aggregates per-cube responses into a single host-visible completion.
|
||||||
|
- For kernel launches, stamps a global `target_start_ns` barrier so
|
||||||
|
every PE across every targeted cube begins kernel body execution at
|
||||||
|
the same simulated time (ADR-0009 D5).
|
||||||
|
|
||||||
|
Memory R/W traffic bypasses IO_CPU per ADR-0015 D4 / ADR-0016 D3;
|
||||||
|
this component therefore handles only command-plane traffic in normal
|
||||||
|
operation.
|
||||||
|
|
||||||
|
This ADR documents the IO_CPU component implementation that realizes
|
||||||
|
those responsibilities.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Role
|
||||||
|
|
||||||
|
IO_CPU is the host-facing endpoint of the IO chiplet. It has two
|
||||||
|
primary responsibilities:
|
||||||
|
|
||||||
|
1. **Multi-cube fan-out** — distribute KernelLaunchMsg / MmuMapMsg /
|
||||||
|
MmuUnmapMsg to per-cube M_CPUs.
|
||||||
|
2. **Response aggregation** — collect per-cube ResponseMsg, signal
|
||||||
|
parent `txn.done` when all targeted cubes have responded.
|
||||||
|
|
||||||
|
A third, narrower responsibility applies only to KernelLaunchMsg:
|
||||||
|
**`target_start_ns` global barrier stamping** (D3).
|
||||||
|
|
||||||
|
The component does **not**:
|
||||||
|
|
||||||
|
- Decide routing — paths are pre-computed by the router (ADR-0002).
|
||||||
|
- Decode tensor or kernel internals — those concerns belong to
|
||||||
|
M_CPU / PE_CPU / engines.
|
||||||
|
- Handle PE-level fan-out — M_CPU fans out within a cube (ADR-0009 D3).
|
||||||
|
- Handle Memory R/W data path — those bypass IO_CPU per ADR-0015 D4
|
||||||
|
and ADR-0016 D3 (Memory R/W resolution code in
|
||||||
|
`_resolve_cube_targets` exists as a defensive fallback only).
|
||||||
|
|
||||||
|
Per invocation (`run()`): applies the configured `overhead_ns` once
|
||||||
|
per incoming Transaction (D8).
|
||||||
|
|
||||||
|
### D2. Forward path — multi-cube fan-out
|
||||||
|
|
||||||
|
When a non-response Transaction arrives, the worker:
|
||||||
|
|
||||||
|
1. Pays `overhead_ns` via `run()`.
|
||||||
|
2. Calls `_resolve_cube_targets` to derive the list of `(sip, cube)`
|
||||||
|
targets from the request (D5).
|
||||||
|
3. For each target:
|
||||||
|
- Resolves M_CPU node id via `ctx.resolver.find_m_cpu(sip, cube)`.
|
||||||
|
- Resolves the path via `ctx.router.find_node_path(io_cpu, m_cpu)`.
|
||||||
|
- Creates a per-cube sub-Transaction with `path` populated and
|
||||||
|
forwards it to `path[1]` (the first hop on the io_noc).
|
||||||
|
4. Registers aggregation state: `_pending[request_id] = (expected,
|
||||||
|
received=0, parent_done)`.
|
||||||
|
|
||||||
|
### D3. KernelLaunch `target_start_ns` global barrier (ADR-0009 D5)
|
||||||
|
|
||||||
|
IO_CPU is the canonical stamper for `target_start_ns`. When the
|
||||||
|
request is a `KernelLaunchMsg`, IO_CPU computes a single global
|
||||||
|
barrier covering every targeted PE across every targeted cube:
|
||||||
|
|
||||||
|
```text
|
||||||
|
for (sip, cube) in cube_targets:
|
||||||
|
leg1 = compute_path_latency_ns(io_cpu → m_cpu(sip, cube), nbytes=0)
|
||||||
|
for pe_id in target_pe_ids:
|
||||||
|
leg2 = compute_path_latency_ns(m_cpu → pe_cpu(sip, cube, pe_id),
|
||||||
|
nbytes=0)
|
||||||
|
latency = leg1 + leg2 - io_overhead_ns - m_overhead_ns
|
||||||
|
global_max = max(global_max, latency)
|
||||||
|
|
||||||
|
target_start_ns = env.now + global_max
|
||||||
|
```
|
||||||
|
|
||||||
|
The request is then replaced (via `dataclasses.replace`) so the
|
||||||
|
stamped value propagates through the fan-out.
|
||||||
|
|
||||||
|
Two overhead corrections:
|
||||||
|
|
||||||
|
- `io_overhead_ns` is subtracted because IO_CPU has already paid it
|
||||||
|
in `run()` before this method runs.
|
||||||
|
- `m_overhead_ns` is subtracted once because it appears as the
|
||||||
|
endpoint of leg1 *and* the start of leg2 in path latency, but
|
||||||
|
M_CPU pays it only once at run time.
|
||||||
|
|
||||||
|
Every downstream PE_CPU yields until `target_start_ns` before
|
||||||
|
beginning kernel body execution; all PEs therefore start at the same
|
||||||
|
simulated time regardless of how long their individual dispatch path
|
||||||
|
took.
|
||||||
|
|
||||||
|
### D4. KernelLaunch sub-Transactions carry `nbytes=0`
|
||||||
|
|
||||||
|
Per-cube sub-Transactions for KernelLaunchMsg force `nbytes=0`,
|
||||||
|
overriding the parent `txn.nbytes`:
|
||||||
|
|
||||||
|
- Kernel launch is a control message; payload size is irrelevant at
|
||||||
|
the data-fabric level.
|
||||||
|
- If `nbytes > 0`, every per-cube sub-txn occupies fabric BW on the
|
||||||
|
io_noc's shared first hop. With 16 cubes this serializes fan-out,
|
||||||
|
pushing far M_CPUs past `target_start_ns` and breaking the D3
|
||||||
|
invariant.
|
||||||
|
|
||||||
|
Non-KernelLaunch sub-Transactions preserve `txn.nbytes` (only relevant
|
||||||
|
for the defensive Memory R/W fallback path, which carries actual
|
||||||
|
payload sizes).
|
||||||
|
|
||||||
|
### D5. Per-request-type cube target resolution
|
||||||
|
|
||||||
|
`_resolve_cube_targets` dispatches by request type:
|
||||||
|
|
||||||
|
| Request type | Source of `(sip, cube)` | `target_cubes="all"` semantics |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `MemoryWriteMsg` | `dst_sip`, `dst_cube` (or `PhysAddr.decode(dst_pa).die_id` fallback) | single cube derived from PA decode |
|
||||||
|
| `MemoryReadMsg` | `src_sip`, `src_cube` (or `PhysAddr.decode(src_pa).die_id` fallback) | single cube derived from PA decode |
|
||||||
|
| `KernelLaunchMsg` | tensor shards filtered by `shard.sip == my_sip` | every cube that owns a shard on this SIP |
|
||||||
|
| `MmuMapMsg` / `MmuUnmapMsg` | `target_cubes` list, filtered to this SIP | `range(cubes_per_sip)` from spec |
|
||||||
|
|
||||||
|
Each IO_CPU instance fans out only within its own SIP — `_my_sip()`
|
||||||
|
parses the SIP id from the node id (e.g., `sip0.io0.io_cpu` → 0).
|
||||||
|
|
||||||
|
The Memory R/W rows exist for defensive completeness; the engine's
|
||||||
|
normal path routes Memory R/W via `_process_memory_direct()` /
|
||||||
|
`find_memory_path()`, bypassing IO_CPU entirely (ADR-0015 D4 /
|
||||||
|
ADR-0016 D3).
|
||||||
|
|
||||||
|
### D6. Response aggregation
|
||||||
|
|
||||||
|
`_pending: dict[request_id → (expected, received, parent_done)]`:
|
||||||
|
|
||||||
|
- On dispatch: register `(len(cube_targets), 0, txn.done)`.
|
||||||
|
- `_worker` recognises responses by `is_response=True` and routes
|
||||||
|
them to `_collect_response`.
|
||||||
|
- `_collect_response` increments `received`; when `received >=
|
||||||
|
expected`, `parent_done.succeed()` is invoked and the entry is
|
||||||
|
removed from `_pending`.
|
||||||
|
|
||||||
|
This is a simple per-request counter. There is no per-cube identity
|
||||||
|
tracking and no partial-failure handling — a missing response
|
||||||
|
indefinitely stalls the parent done. Production-style failure paths
|
||||||
|
are out of scope for the current simulator model.
|
||||||
|
|
||||||
|
### D7. `target_pe` resolution helper
|
||||||
|
|
||||||
|
`_resolve_pe_ids(target_pe)`:
|
||||||
|
|
||||||
|
- `int` → `[target_pe]`.
|
||||||
|
- `tuple[int, ...]` → `list(target_pe)`.
|
||||||
|
- `"all"` → `range(n_slices)`, where `n_slices` comes from cube
|
||||||
|
`memory_map.hbm_slices_per_cube` (default 8).
|
||||||
|
|
||||||
|
Used in D3's barrier computation to enumerate every PE target per
|
||||||
|
cube.
|
||||||
|
|
||||||
|
### D8. Configurable `overhead_ns`
|
||||||
|
|
||||||
|
A single attribute drives per-instance latency:
|
||||||
|
|
||||||
|
| Site | impl name | overhead_ns |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| IO chiplet `io_cpu` | `builtin.io_cpu` | 10.0 |
|
||||||
|
|
||||||
|
Applied once in `run()` per Transaction. Models command
|
||||||
|
interpretation + dispatch-decision time at IO_CPU.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- Cross-cube and cross-SIP kernel launches share a single global
|
||||||
|
barrier (D3 + D4) — no per-cube divergence in start time.
|
||||||
|
- nbytes=0 invariant keeps fan-out off the shared first-hop fabric
|
||||||
|
BW, preserving the barrier's accuracy at scale (16 cubes).
|
||||||
|
- Response aggregation via a single counter → minimal state,
|
||||||
|
deterministic ordering of completion.
|
||||||
|
- Per-SIP scoping (`_my_sip()`) keeps IO_CPUs in different SIPs
|
||||||
|
cleanly independent.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- No partial-failure semantics — a missing per-cube response
|
||||||
|
indefinitely stalls the parent. Adequate for simulation but not
|
||||||
|
suitable as a production-style endpoint.
|
||||||
|
- `_pending` is a regular dict; in-flight requests accumulate state.
|
||||||
|
Acceptable for current benchmark workloads (few concurrent
|
||||||
|
outstanding launches); unbounded in principle.
|
||||||
|
- The Memory R/W resolution branches in `_resolve_cube_targets` are
|
||||||
|
dead code in the normal engine path. Kept defensively but invite
|
||||||
|
drift if the bypass path ever changes.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0002 (Routing distance — path computation)
|
||||||
|
- ADR-0009 D1 (Kernel launch is an endpoint request to IO_CPU)
|
||||||
|
- ADR-0009 D3 (M_CPU fans out within a cube; IO_CPU fans out across
|
||||||
|
cubes)
|
||||||
|
- ADR-0009 D5 (target_start_ns canonical stamping at IO_CPU)
|
||||||
|
- ADR-0011 D-VA3 (MmuMapMsg routes through IO_CPU for cube fan-out)
|
||||||
|
- ADR-0012 (Host ↔ IO_CPU message schema)
|
||||||
|
- ADR-0015 D4 (Memory R/W bypasses IO_CPU; Kernel Launch via IO_CPU)
|
||||||
|
- ADR-0016 D1 (IO chiplet io_noc — IO_CPU attaches here)
|
||||||
|
- ADR-0016 D3 (Memory R/W path bypasses IO_CPU)
|
||||||
|
- ADR-0016 D4 (Kernel Launch path through IO_CPU for command
|
||||||
|
interpretation)
|
||||||
@@ -0,0 +1,200 @@
|
|||||||
|
# ADR-0037: Forwarding Component (forwarding_v1)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The simulation graph has many node positions that exist purely to model
|
||||||
|
fabric traversal — NOC mesh routers, switches, UCIe protocol endpoints,
|
||||||
|
IO chiplet io_noc, transit cubes. These share a common pattern: receive
|
||||||
|
a message, apply per-component overhead (modeling header decode +
|
||||||
|
routing decision time), forward to the next hop along the pre-computed
|
||||||
|
path.
|
||||||
|
|
||||||
|
This ADR defines the contract for these transit nodes: a single
|
||||||
|
component type (`TransitComponent`) that handles flit-aware forwarding
|
||||||
|
with wormhole cut-through semantics, used under multiple impl names
|
||||||
|
according to the conceptual role each instance plays.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### D1. Role
|
||||||
|
|
||||||
|
The Forwarding component (`TransitComponent` class) is a **stateless
|
||||||
|
transit node** in the simulation graph. It models any fabric position
|
||||||
|
where a message physically traverses but no semantic processing
|
||||||
|
happens.
|
||||||
|
|
||||||
|
Per traversal, the component:
|
||||||
|
|
||||||
|
1. Reads an incoming Transaction or Flit from an `in_port`.
|
||||||
|
2. Applies the configured per-component overhead (`overhead_ns`),
|
||||||
|
applied **once per Transaction** even across multi-flit payloads
|
||||||
|
(see D2).
|
||||||
|
3. Looks up the next hop along the Transaction's pre-computed `path`.
|
||||||
|
4. Forwards to the corresponding `out_port`; at the terminal node
|
||||||
|
(no next hop), signals `txn.done` once the `is_last` flit arrives.
|
||||||
|
|
||||||
|
The component **does NOT**:
|
||||||
|
|
||||||
|
- Decide routing — paths are pre-computed by the router (ADR-0002 /
|
||||||
|
ADR-0017 D2). Forwarding only executes the per-hop step.
|
||||||
|
- Model wire propagation or bandwidth occupancy — separate wire
|
||||||
|
processes between components handle that (ADR-0015 D2).
|
||||||
|
- Resolve addresses — the AddressResolver does that (ADR-0017 D9).
|
||||||
|
- Aggregate completion — terminal endpoints (IO_CPU, M_CPU, HBM_CTRL)
|
||||||
|
handle that.
|
||||||
|
|
||||||
|
### D2. First-flit overhead model (header decode)
|
||||||
|
|
||||||
|
Per-Transaction `overhead_ns` is applied **exactly once**, at first
|
||||||
|
flit arrival:
|
||||||
|
|
||||||
|
- `_txn_decoded: set[int]` tracks which Transactions have already
|
||||||
|
paid the overhead at this node.
|
||||||
|
- On first-flit arrival for a Transaction: `yield self.run(env,
|
||||||
|
msg.txn.nbytes)` — pays the overhead.
|
||||||
|
- Subsequent flits of the same Transaction skip the overhead — they
|
||||||
|
pipeline through with no extra delay.
|
||||||
|
- On `is_last` flit: remove the Transaction from `_txn_decoded`.
|
||||||
|
|
||||||
|
This models the real-HW behavior where header decode and routing
|
||||||
|
decision happen once on first flit; payload flits then stream through
|
||||||
|
the same path (wormhole cut-through). Multi-hop pipelining emerges
|
||||||
|
naturally — each hop adds its own first-flit overhead, but flits
|
||||||
|
after the first do not re-pay overhead at any hop they have already
|
||||||
|
passed first.
|
||||||
|
|
||||||
|
### D3. Serial worker forwarding (preserves order)
|
||||||
|
|
||||||
|
The component's worker is a single SimPy process that consumes flits
|
||||||
|
from `_inbox` and forwards them serially in arrival order. The
|
||||||
|
component does NOT spawn `env.process(...)` per flit.
|
||||||
|
|
||||||
|
Rationale: if the first flit yields on `overhead_ns` while subsequent
|
||||||
|
flits run in parallel processes, the later flits can overtake the
|
||||||
|
first. This produces out-of-order delivery and lets the `is_last`
|
||||||
|
flit arrive at the destination before the first flit — corrupting
|
||||||
|
both the transaction's completion semantics and any flit-index-based
|
||||||
|
processing downstream.
|
||||||
|
|
||||||
|
### D4. Path-based next-hop routing
|
||||||
|
|
||||||
|
Routing is **not** a Forwarding-component concern. The Transaction
|
||||||
|
arrives with a pre-computed `path` (built by the router; ADR-0002 /
|
||||||
|
ADR-0017 D2). The component just looks up its own position in the
|
||||||
|
path and forwards to `path[index + 1]`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _next_hop_in_path(self, txn):
|
||||||
|
my_id = self.node.id
|
||||||
|
path = txn.path
|
||||||
|
for i, n in enumerate(path):
|
||||||
|
if n == my_id and i + 1 < len(path):
|
||||||
|
return path[i + 1]
|
||||||
|
return None
|
||||||
|
```
|
||||||
|
|
||||||
|
If `next_hop` is found and present in `out_ports`, the flit is
|
||||||
|
forwarded. Otherwise (terminal node), `txn.done.succeed()` is
|
||||||
|
invoked when the `is_last` flit arrives.
|
||||||
|
|
||||||
|
### D5. Flit-aware mode with Non-Flit fallback
|
||||||
|
|
||||||
|
`_FLIT_AWARE = True` opts this component out of the base class's
|
||||||
|
flit-reassembly logic in `_fan_in`. Flits are placed directly on
|
||||||
|
`_inbox` (no reassembly), enabling per-flit handling in the worker
|
||||||
|
loop (D2, D3).
|
||||||
|
|
||||||
|
Non-Flit messages — zero-byte control Transactions and other
|
||||||
|
non-chunkified payloads — fall through to the base class's legacy
|
||||||
|
`_forward_txn` path via `env.process`. This preserves backward
|
||||||
|
compatibility for control-plane traffic that does not benefit from
|
||||||
|
flit-level processing.
|
||||||
|
|
||||||
|
### D6. Multi-stream merging at the base class
|
||||||
|
|
||||||
|
Multi-stream FIFO merging at routers is the base class's
|
||||||
|
responsibility, not Forwarding's. The base class's `_fan_in` spawns
|
||||||
|
one process per `in_port`; all push to a single shared `_inbox`.
|
||||||
|
Flits from different upstream streams therefore interleave at
|
||||||
|
flit granularity in `_inbox`'s FIFO order.
|
||||||
|
|
||||||
|
The Forwarding worker simply consumes `_inbox` in arrival order —
|
||||||
|
correctly modeling per-router multi-flow arbitration as
|
||||||
|
fair-FIFO over the shared inbox.
|
||||||
|
|
||||||
|
### D7. Single implementation under multiple impl names
|
||||||
|
|
||||||
|
A single `TransitComponent` class is registered under four impl names
|
||||||
|
in `components.yaml`:
|
||||||
|
|
||||||
|
- `builtin.forwarding` — generic forwarding (e.g., `io_noc`,
|
||||||
|
`noc_router`, UCIe conn bridges)
|
||||||
|
- `builtin.switch` — tray-level switch
|
||||||
|
- `builtin.noc` — cube-level NOC fabric (legacy singleton; current
|
||||||
|
NOC routers use `builtin.forwarding`)
|
||||||
|
- `builtin.ucie` — UCIe protocol endpoint
|
||||||
|
|
||||||
|
All four aliases instantiate the same class with the same behavior.
|
||||||
|
Per-instance differentiation lives only in `attrs.overhead_ns`.
|
||||||
|
Separate impl names exist as intent tags for readability and to
|
||||||
|
allow future divergence without backward-incompatible config
|
||||||
|
changes.
|
||||||
|
|
||||||
|
### D8. Configurable `overhead_ns`
|
||||||
|
|
||||||
|
A single attribute drives per-instance latency:
|
||||||
|
|
||||||
|
| Usage site | impl name | overhead_ns |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| Tray-level switch | `builtin.switch` | 5.0 |
|
||||||
|
| Cube NOC router | `builtin.forwarding` | 2.0 |
|
||||||
|
| IO chiplet io_noc | `builtin.forwarding` | 0.0 |
|
||||||
|
| UCIe protocol endpoint (`ucie-{N,S,E,W}`) | `builtin.ucie` | 8.0 |
|
||||||
|
| UCIe conn bridge (`ucie-{PORT}.conn{N}`) | `builtin.forwarding` | 0.0 |
|
||||||
|
|
||||||
|
Default is 0.0. The attribute is read at each `run()` invocation, so
|
||||||
|
dynamic reconfiguration is possible but not currently used.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
|
||||||
|
- A single class handles all transit-node roles in the simulation
|
||||||
|
graph — minimal code surface for a high-population component type.
|
||||||
|
- Flit-aware processing + serial worker preserves wormhole semantics
|
||||||
|
across multi-hop paths without per-flit process overhead.
|
||||||
|
- `overhead_ns` is the only per-instance tunable; routing, BW, and
|
||||||
|
address resolution stay cleanly separated in their own components /
|
||||||
|
modules.
|
||||||
|
- Multi-stream merging emerges from the base-class structure; no
|
||||||
|
router-specific logic duplicates fair-FIFO arbitration.
|
||||||
|
- Non-Flit fallback path keeps control-plane traffic working without
|
||||||
|
forcing every message into the flit framework.
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
|
||||||
|
- The single class hides usage-site intent inside `attrs.overhead_ns`
|
||||||
|
configuration; readers must consult `topology.yaml` +
|
||||||
|
`components.yaml` to see which impl name maps to which behavior
|
||||||
|
class.
|
||||||
|
- Per-flit serial worker is a bottleneck if `overhead_ns` is large
|
||||||
|
and many concurrent transactions arrive at the same router; current
|
||||||
|
values (0–8 ns) make this negligible.
|
||||||
|
|
||||||
|
## Links
|
||||||
|
|
||||||
|
- ADR-0002 (Routing distance — path computation)
|
||||||
|
- ADR-0015 D1 (Component port model)
|
||||||
|
- ADR-0015 D2 (Wire process — BW + propagation, separate from this
|
||||||
|
component)
|
||||||
|
- ADR-0015 D6 (Transit cube forwarding pattern)
|
||||||
|
- ADR-0016 D1 (IO chiplet io_noc — uses this component)
|
||||||
|
- ADR-0017 D1 (Cube NOC routers — use this component)
|
||||||
|
- ADR-0017 D6 (UCIe decomposition — `ucie-{PORT}` instances use this
|
||||||
|
component)
|
||||||
|
- ADR-0033 D1 (Flit-aware pass-through, first-flit overhead,
|
||||||
|
multi-stream merge semantics)
|
||||||
@@ -1,548 +0,0 @@
|
|||||||
# IPCQ-DMA Co-design Hardware Design Document
|
|
||||||
|
|
||||||
**Status**: Draft — Review Requested
|
|
||||||
**Date**: 2026-04-28
|
|
||||||
**Authors**: YW Kang
|
|
||||||
**Reviewers**: (HW team TBD)
|
|
||||||
**Related**: ADR-0023 (IPCQ PE Collective), ADR-0025 (Direction Addressing)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 1. Background & Motivation
|
|
||||||
|
|
||||||
IPCQ(Inter-PE Communication Queue)는 PE 간 collective communication을 위한
|
|
||||||
하드웨어 큐 메커니즘이다. 핵심 설계 원리는 **DMA가 데이터 전송 시 별도의
|
|
||||||
제어 메시지 없이, piggyback된 메타 정보를 바탕으로 IPCQ의 head/tail pointer를
|
|
||||||
자동 업데이트**하는 IPCQ-DMA co-design이다.
|
|
||||||
|
|
||||||
이 문서는:
|
|
||||||
|
|
||||||
1. 현재 PE 아키텍처에서 IPCQ가 하드웨어 수준에서 어떻게 동작하는지 기술하고,
|
|
||||||
2. 이 하드웨어를 시뮬레이터에서 어떻게 모델링하고 있는지 검증하며,
|
|
||||||
3. 실제 하드웨어 구현을 위한 설계를 제안하고,
|
|
||||||
4. 대안들을 검토하여 최적 접근을 확정한다.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2. High-level Behavior of PE_IPCQ
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
> source: [`diagrams/pe_baseline.d2`](diagrams/pe_baseline.d2) — `d2 --layout=elk --scale 1.5` 로 렌더링.
|
|
||||||
|
|
||||||
### IPCQ 하드웨어 동작
|
|
||||||
|
|
||||||
**HW Configuration**:
|
|
||||||
* IPCQ는 PE 간에 ring buffer 기반의 단방향 큐를 설정하여 데이터를 전달한다.
|
|
||||||
* 각 PE는 방향별(N/S/E/W 등)로 독립적인 queue pair 를 유지한다.
|
|
||||||
* IPCQ는 각 queue pair 마다 sender's head/tail pointer, receiver's head/tail pointer 를 유지한다.
|
|
||||||
|
|
||||||
* **IPCQ Slot Region**: IPCQ의 수신 버퍼로, 다이어그램의 점선 박스로 표시된 것처럼 TCM, Cube SRAM, Local HBM 중 하나를 buffer_kind로 지정하여 사용할 수 있다.
|
|
||||||
각 tier별 성능 특성 (시뮬레이션 모델 값, `ipcq_types.py`):
|
|
||||||
|
|
||||||
| Buffer Kind | Intrinsic BW | Effective BW (NoC bottleneck) | 용도 |
|
|
||||||
|-------------|-------------|-------------------------------|------|
|
|
||||||
| TCM | 512 GB/s | 512 GB/s (직결, NoC 미경유) | 최저 latency, PE 내부 전용 |
|
|
||||||
| Cube SRAM | 512 GB/s | 128 GB/s (`sram_to_router_bw`) | Cube 내 공유, NoC BW에 제한 |
|
|
||||||
| Local HBM | 256 GB/s | 256 GB/s (`hbm_to_router_bw`) | 대용량, NoC BW에 제한 |
|
|
||||||
|
|
||||||
**Send 경로 (fire-and-forget)**:
|
|
||||||
1. PE_CPU가 `tl.send(dir, src_addr)` 발행 → PE_IPCQ에 IpcqRequest 전달
|
|
||||||
2. PE_IPCQ가 backpressure 확인: `(my_head - peer_tail_cache) < peer.n_slots`
|
|
||||||
3. Peer의 rx slot 주소 계산: `peer_rx_base + (my_head % n_slots) × slot_size`
|
|
||||||
4. IpcqDmaToken(data + piggyback metadata: sender_seq)을 PE_DMA에 전달
|
|
||||||
5. PE_IPCQ가 `my_head++`, PE_CPU에 즉시 반환 (DMA 완료를 기다리지 않음)
|
|
||||||
6. PE_DMA가 src data를 snapshot 후 NoC를 통해 peer PE_DMA로 전송
|
|
||||||
|
|
||||||
**Receive 경로 (blocking)**:
|
|
||||||
1. Peer PE_DMA가 data를 slot에 write하고, **같은 사이클에** metadata(sender_seq, dst_addr)를 추출
|
|
||||||
2. PE_IPCQ가 dst_addr range matching으로 방향을 식별, `peer_head_cache` 업데이트
|
|
||||||
3. `tl.recv(dir)` 대기 중인 PE_CPU에 wakeup signal 전달
|
|
||||||
4. PE_CPU가 slot에서 데이터 읽기, PE_IPCQ가 `my_tail++`
|
|
||||||
5. **Credit return**: PE_IPCQ가 16B credit packet(`consumer_seq`)을 NoC를 통해 sender에게 전송
|
|
||||||
6. Sender PE_IPCQ가 `peer_tail_cache` 업데이트, backpressure 해제
|
|
||||||
|
|
||||||
**핵심 설계 원리**:
|
|
||||||
- **Data + head pointer piggyback**: 별도의 head 동기화 메시지 없이, DMA data flit에 sender_seq를 실어보냄
|
|
||||||
- **Atomic write + metadata**: 수신측 DMA가 slot write와 metadata 전달을 같은 사이클에 수행 (I6 invariant)
|
|
||||||
- **Address-based direction matching**: 같은 peer에 여러 방향이 연결되어도 dst_addr range로 구분 (ADR-0025)
|
|
||||||
- **Credit-based flow control**: Receiver가 slot 소비 후 16B credit으로 sender에게 알림
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3. Simulator Implementation Verification
|
|
||||||
|
|
||||||
위의 하드웨어 동작을 시뮬레이터에서 어떻게 모델링하는지 검증한다.
|
|
||||||
|
|
||||||
### 3.1 의도와 구현의 매핑
|
|
||||||
|
|
||||||
| 설계 의도 | 시뮬레이터 구현 | 위치 |
|
|
||||||
|-----------|----------------|------|
|
|
||||||
| DMA가 데이터 전송 시 head pointer를 piggyback | `IpcqDmaToken.sender_seq` 필드가 data flit과 함께 전달 | `ipcq_types.py:185` |
|
|
||||||
| 수신측 DMA가 data write + metadata 전달을 atomic 처리 | `_handle_ipcq_inbound`에서 `store.write` → `IpcqMetaArrival` 사이에 yield 없음 (I6) | `pe_dma.py:232-275` |
|
|
||||||
| Send는 fire-and-forget | `_handle_ipcq_outbound`에서 `sub_done`을 기다리지 않음 | `pe_dma.py:182` |
|
|
||||||
| Recv는 데이터 도착까지 block | `peer_head_cache > my_tail` 조건으로 대기 | `pe_ipcq.py:263` |
|
|
||||||
| Credit return은 별도 fast-path | SimPy Store를 통한 direct put (latency는 NoC 경로 기반으로 charge) | `pe_ipcq.py:443-469` |
|
|
||||||
| In-flight data semantics (snapshot) | Send 시점에 data snapshot 보존, 이후 src 수정과 무관 | `pe_dma.py:142-155` |
|
|
||||||
| PE_DMA 단일 inbox | 모든 in_port를 `_fan_in`으로 단일 FIFO에 merge (`base.py:51-53`) | compute port와 IPCQ port 사이에 arbiter 없음 |
|
|
||||||
|
|
||||||
### 3.2 Credit Return Path 모델링 상세
|
|
||||||
|
|
||||||
Credit return은 실제 NoC 경로를 `router.find_path()`로 찾고,
|
|
||||||
`compute_path_latency_ns()`로 hop latency + BW drain을 계산하여 charge한다.
|
|
||||||
|
|
||||||
```python
|
|
||||||
# pe_ipcq.py:471-492
|
|
||||||
def _credit_latency_ns(self, direction: str) -> float:
|
|
||||||
path = self.ctx.router.find_path(self._pe_prefix, peer_pe_dma)
|
|
||||||
return self.ctx.compute_path_latency_ns(path, self._credit_size_bytes)
|
|
||||||
```
|
|
||||||
|
|
||||||
단, latency를 `env.timeout()`으로 지불한 후 `peer_credit_store`(SimPy Store)에
|
|
||||||
직접 put하는 방식이다. 실제 `Transaction`을 만들어 NoC를 hop-by-hop 통과시키지는
|
|
||||||
않으므로, **다른 트래픽과의 bandwidth contention은 모델링되지 않는다.**
|
|
||||||
|
|
||||||
| | Latency | BW Contention |
|
|
||||||
|---|---|---|
|
|
||||||
| Data path (IpcqDmaToken) | NoC Transaction으로 정확 모델링 | 실제 fabric 통과 |
|
|
||||||
| Credit path (16B) | NoC 경로 latency 정확 반영 | fabric Transaction 미주입 (단순화) |
|
|
||||||
|
|
||||||
Credit은 16B로 data transfer(수십~수백 KB) 대비 무시 가능한 크기이므로,
|
|
||||||
이 단순화로 인한 실질적 오차는 거의 없다.
|
|
||||||
|
|
||||||
### 3.3 검증 결론
|
|
||||||
|
|
||||||
시뮬레이터 구현은 IPCQ-DMA co-design 의도를 **정확하게 모델링**하고 있다.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4. Proposed Hardware Design
|
|
||||||
|
|
||||||
### 4.1 Block Diagram (변경 후)
|
|
||||||
|
|
||||||
변경점을 강조 표시: **(NEW)** = 신규, **(MOD)** = 수정.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
> Source: [`diagrams/pe_proposed.d2`](diagrams/pe_proposed.d2) — `d2 --layout=elk` 로 렌더링.
|
|
||||||
|
|
||||||
**Baseline → Proposed 핵심 변경**:
|
|
||||||
- 단일 FIFO inbox → **compute port / IPCQ port 분리 + WRR Arbiter** (NEW)
|
|
||||||
- PE_IPCQ (SimPy component) → **IPCQ Controller** (HW register + combinational logic)
|
|
||||||
- TCM 내 **IPCQ Slot Region 예약 영역** 명시
|
|
||||||
- Credit Injector / Receiver가 Fabric Port를 통해 NoC에 직접 연결
|
|
||||||
|
|
||||||
### 4.2 Module Details
|
|
||||||
|
|
||||||
#### 4.2.1 IPCQ Controller (신규 모듈)
|
|
||||||
|
|
||||||
PE_CPU와 DMA Engine 사이에 위치하는 하드웨어 제어 블록.
|
|
||||||
시뮬레이터의 `PeIpcqComponent`에 대응한다.
|
|
||||||
|
|
||||||
##### QPair Register File
|
|
||||||
|
|
||||||
방향별 queue pair 상태를 flip-flop으로 유지한다.
|
|
||||||
|
|
||||||
```
|
|
||||||
Per-direction registers (each 64-bit):
|
|
||||||
my_head — sender write position (monotonic)
|
|
||||||
my_tail — receiver read position (monotonic)
|
|
||||||
peer_head_cache — last known peer head (updated by Meta Extractor)
|
|
||||||
peer_tail_cache — last known peer tail (updated by Credit Receive)
|
|
||||||
rx_base_pa — this PE's rx buffer base physical address
|
|
||||||
peer_rx_base_pa — peer's rx buffer base physical address
|
|
||||||
n_slots — ring depth (power-of-2 제약, 아래 참조)
|
|
||||||
slot_size — bytes per slot
|
|
||||||
peer_credit_tgt — peer PE의 credit receive 주소
|
|
||||||
|
|
||||||
Directions: 최대 8 (N/S/E/W/parent/child_left/child_right + spare)
|
|
||||||
Total: 8 dirs × 9 regs × 8B = 576B flip-flops
|
|
||||||
```
|
|
||||||
|
|
||||||
PE_CPU가 MMIO(CSR)로 읽기/쓰기 가능. Init 시점에 소프트웨어가 채워넣는다.
|
|
||||||
|
|
||||||
##### Slot Address Generator (combinational)
|
|
||||||
|
|
||||||
```
|
|
||||||
Input: pointer (my_head or my_tail), n_slots, slot_size, base_pa
|
|
||||||
Output: slot_addr = base_pa + (pointer % n_slots) * slot_size
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
n_slots power-of-2 제약 → pointer & (n_slots - 1) (AND mask, 1 gate delay)
|
|
||||||
slot_size power-of-2 → barrel shift (1 cycle)
|
|
||||||
64-bit add → ripple/kogge-stone adder (1 cycle)
|
|
||||||
|
|
||||||
Latency: 1-2 cycles combinational
|
|
||||||
```
|
|
||||||
|
|
||||||
##### Backpressure Comparator (combinational)
|
|
||||||
|
|
||||||
```
|
|
||||||
full = (my_head - peer_tail_cache) >= n_slots
|
|
||||||
|
|
||||||
Implementation: 64-bit subtract + unsigned compare
|
|
||||||
Output: stall signal → PE_CPU (IPCQ send blocked) or DMA issue hold
|
|
||||||
Latency: 1 cycle
|
|
||||||
```
|
|
||||||
|
|
||||||
##### Meta Extractor (inbound datapath sideband)
|
|
||||||
|
|
||||||
DMA Engine의 inbound vc_comm path에 wired. Arriving IPCQ flit의 header에서
|
|
||||||
metadata를 추출하여 queue pair 상태를 업데이트한다.
|
|
||||||
|
|
||||||
```
|
|
||||||
Trigger: DMA inbound write completion (same cycle)
|
|
||||||
Extract: {sender_seq, dst_addr} from flit header
|
|
||||||
|
|
||||||
Direction matching (ADR-0025 D2):
|
|
||||||
for each dir:
|
|
||||||
match = (base_pa[dir] <= dst_addr) && (dst_addr < base_pa[dir] + n_slots[dir] * slot_size[dir])
|
|
||||||
8× parallel range comparators + priority encoder
|
|
||||||
|
|
||||||
Update: peer_head_cache[matched_dir] = max(peer_head_cache, sender_seq + 1)
|
|
||||||
Output: recv_wake signal for matched direction → PE_CPU interrupt/flag
|
|
||||||
|
|
||||||
Implementation: 8× (2 comparators + AND) + priority encoder
|
|
||||||
Latency: 1 cycle (pipelined with DMA write — I6 atomicity 자연 보장)
|
|
||||||
```
|
|
||||||
|
|
||||||
##### Credit Injector (outbound)
|
|
||||||
|
|
||||||
```
|
|
||||||
Trigger: recv completion (my_tail 증가 후)
|
|
||||||
Action: pack 16B credit packet → DMA vc_comm (또는 dedicated credit VC)
|
|
||||||
|
|
||||||
Packet: {consumer_seq = my_tail, dst_rx_base_pa = my_rx_base_pa}
|
|
||||||
Latency: 1 cycle to generate, then NoC traversal
|
|
||||||
```
|
|
||||||
|
|
||||||
##### Credit Receiver (inbound sideband)
|
|
||||||
|
|
||||||
```
|
|
||||||
Trigger: 16B credit packet arrival (from NoC)
|
|
||||||
Extract: {consumer_seq, dst_rx_base_pa}
|
|
||||||
|
|
||||||
Direction matching (ADR-0025 D3):
|
|
||||||
for each dir:
|
|
||||||
match = (peer_rx_base_pa[dir] == credit.dst_rx_base_pa)
|
|
||||||
|
|
||||||
Update: peer_tail_cache[matched_dir] = max(peer_tail_cache, consumer_seq)
|
|
||||||
Output: send_wake signal → deassert backpressure stall
|
|
||||||
|
|
||||||
Latency: 1 cycle
|
|
||||||
```
|
|
||||||
|
|
||||||
#### 4.2.2 DMA Engine 수정사항
|
|
||||||
|
|
||||||
##### vc_comm IPCQ-aware mode
|
|
||||||
|
|
||||||
기존 vc_comm 채널에 IPCQ flit 처리 모드를 추가한다.
|
|
||||||
|
|
||||||
**Outbound**:
|
|
||||||
1. IPCQ Controller로부터 command 수신: {src_addr, dst_addr, nbytes, sender_seq}
|
|
||||||
2. TCM에서 src_addr read → DMA read buffer에 snapshot (기존 DMA behavior)
|
|
||||||
3. Flit pack: data + piggyback metadata (sender_seq, dst_addr)
|
|
||||||
4. NoC fabric port에 inject
|
|
||||||
5. Fire-and-forget (completion을 기다리지 않음)
|
|
||||||
|
|
||||||
**Inbound**:
|
|
||||||
1. NoC로부터 IPCQ flit 수신
|
|
||||||
2. Terminal BW drain charge (drain_ns = nbytes / bottleneck_bw)
|
|
||||||
3. Slot write latency charge (backing memory tier)
|
|
||||||
4. **ATOMIC** (same pipeline stage, no stall insertion):
|
|
||||||
- TCM write: data → slot address
|
|
||||||
- Meta Extractor trigger: sender_seq + dst_addr → IPCQ Controller
|
|
||||||
5. Done
|
|
||||||
|
|
||||||
**I6 atomicity 하드웨어 보장**: TCM write completion과 Meta Extractor trigger가
|
|
||||||
동일 pipeline stage에서 발생하므로 별도 synchronization이 불필요하다.
|
|
||||||
시뮬레이터의 "no yield between write and IpcqMetaArrival"이 자연스럽게 보장된다.
|
|
||||||
|
|
||||||
##### Data Snapshot Semantics
|
|
||||||
|
|
||||||
DMA read buffer에 latch된 데이터는 src memory의 이후 수정에 영향받지 않는다.
|
|
||||||
이는 DMA의 standard read-then-write behavior이므로 추가 HW가 불필요하다.
|
|
||||||
|
|
||||||
##### Credit Virtual Channel (선택적)
|
|
||||||
|
|
||||||
옵션 A: vc_comm에 credit을 multiplexing (16B header-only flit으로 구분)
|
|
||||||
옵션 B: 3rd dedicated credit VC 추가 (strict priority > data)
|
|
||||||
|
|
||||||
옵션 B가 deadlock prevention에 유리하나, 16B credit의 BW 영향이 무시 가능하므로
|
|
||||||
옵션 A로도 충분하다.
|
|
||||||
|
|
||||||
#### 4.2.3 Fabric Flit Format 확장
|
|
||||||
|
|
||||||
```
|
|
||||||
일반 data flit (예: 512-bit):
|
|
||||||
┌──────────────────────────────────────────┐
|
|
||||||
│ [511:480] routing header (32b) │
|
|
||||||
│ [479:0] payload (480b = 60B) │
|
|
||||||
└──────────────────────────────────────────┘
|
|
||||||
|
|
||||||
IPCQ data flit (첫 flit에만 metadata 포함):
|
|
||||||
┌──────────────────────────────────────────┐
|
|
||||||
│ [511:480] routing header (32b) │
|
|
||||||
│ [511] ipcq_flag (1b) │ ← IPCQ vs normal DMA 식별
|
|
||||||
│ [510:509] vc_id (2b) │
|
|
||||||
│ [508:480] route + hop count │
|
|
||||||
│ [479:416] ipcq_metadata (64b) │ ← piggyback
|
|
||||||
│ [479:448] sender_seq (32b) │
|
|
||||||
│ [447:416] dst_addr[31:0] (32b) │ ← direction matching용
|
|
||||||
│ [415:0] payload (416b = 52B) │
|
|
||||||
└──────────────────────────────────────────┘
|
|
||||||
후속 flits: full 60B payload (metadata 없음)
|
|
||||||
|
|
||||||
Credit-only flit (128-bit, header-only):
|
|
||||||
┌──────────────────────────────────────────┐
|
|
||||||
│ [127:96] routing header (32b) │
|
|
||||||
│ [127] credit_flag (1b) │
|
|
||||||
│ [95:64] consumer_seq (32b) │
|
|
||||||
│ [63:0] dst_rx_base_pa (64b) │
|
|
||||||
└──────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
첫 flit의 payload가 60B → 52B로 감소 (13% overhead).
|
|
||||||
Multi-flit transfer에서는 후속 flit이 full payload이므로 대형 전송에서 overhead < 1%.
|
|
||||||
|
|
||||||
#### 4.2.4 TCM IPCQ Slot Region
|
|
||||||
|
|
||||||
```
|
|
||||||
TCM Memory Map (16MB):
|
|
||||||
┌─────────────────────────────┐ 0x000000
|
|
||||||
│ Kernel Working Memory │
|
|
||||||
│ (compute tensors) │
|
|
||||||
│ ~14MB │
|
|
||||||
├─────────────────────────────┤ 0xE00000
|
|
||||||
│ IPCQ RX Buffers │
|
|
||||||
│ Dir N: slots × slot_size │
|
|
||||||
│ Dir S: slots × slot_size │
|
|
||||||
│ Dir E: slots × slot_size │
|
|
||||||
│ Dir W: slots × slot_size │
|
|
||||||
│ ~1MB │
|
|
||||||
├─────────────────────────────┤ 0xF00000
|
|
||||||
│ IPCQ Metadata / Scratch │
|
|
||||||
│ ~1MB │
|
|
||||||
└─────────────────────────────┘ 0xFFFFFF
|
|
||||||
```
|
|
||||||
|
|
||||||
IPCQ region을 TCM의 상위 bank에 배치하여 compute access와의
|
|
||||||
bank conflict를 최소화한다 (Section 6.1 참조).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 5. End-to-End Dataflow
|
|
||||||
|
|
||||||
### 5.1 Sequence Diagram
|
|
||||||
|
|
||||||
```mermaid
|
|
||||||
sequenceDiagram
|
|
||||||
participant CPU_A as PE_A: PE_CPU
|
|
||||||
participant IPCQ_A as PE_A: IPCQ Ctrl
|
|
||||||
participant DMA_A as PE_A: DMA
|
|
||||||
participant NOC as NoC Fabric
|
|
||||||
participant DMA_B as PE_B: DMA
|
|
||||||
participant IPCQ_B as PE_B: IPCQ Ctrl
|
|
||||||
participant TCM_B as PE_B: TCM
|
|
||||||
participant CPU_B as PE_B: PE_CPU
|
|
||||||
|
|
||||||
Note over CPU_A: tl.send(dir="E", src=0x1000)
|
|
||||||
|
|
||||||
CPU_A->>IPCQ_A: MMIO: send request
|
|
||||||
Note over IPCQ_A: Backpressure check:<br/>(head - peer_tail_cache) < n_slots → PASS<br/>Slot addr gen:<br/>dst = peer_rx_base + (head%n) × slot_size
|
|
||||||
IPCQ_A->>DMA_A: IpcqDmaToken {src, dst, sender_seq=head}
|
|
||||||
Note over IPCQ_A: my_head++
|
|
||||||
IPCQ_A-->>CPU_A: send returns (fire-and-forget)
|
|
||||||
|
|
||||||
Note over DMA_A: TCM read → snapshot in read buffer<br/>Flit pack: data + {sender_seq, dst_addr}
|
|
||||||
DMA_A->>NOC: IPCQ data flit(s)
|
|
||||||
|
|
||||||
Note over NOC: hop latency + BW drain
|
|
||||||
|
|
||||||
NOC->>DMA_B: IPCQ data flit(s)
|
|
||||||
Note over DMA_B: Terminal BW drain<br/>Slot write latency
|
|
||||||
|
|
||||||
rect rgb(255, 240, 220)
|
|
||||||
Note over DMA_B,IPCQ_B: ATOMIC (I6): same cycle, no stall
|
|
||||||
DMA_B->>TCM_B: write data → slot address
|
|
||||||
DMA_B->>IPCQ_B: Meta Extractor: {sender_seq, dst_addr}
|
|
||||||
end
|
|
||||||
|
|
||||||
Note over IPCQ_B: Range match dst_addr → direction "W"<br/>peer_head_cache["W"] = sender_seq + 1
|
|
||||||
IPCQ_B-->>CPU_B: recv_wake signal
|
|
||||||
|
|
||||||
Note over CPU_B: tl.recv(dir="W") wakes up
|
|
||||||
CPU_B->>IPCQ_B: recv request
|
|
||||||
Note over IPCQ_B: peer_head_cache > my_tail → YES<br/>slot_addr = rx_base + (tail%n) × slot_size
|
|
||||||
IPCQ_B-->>CPU_B: return slot_addr
|
|
||||||
CPU_B->>TCM_B: read data from slot
|
|
||||||
Note over IPCQ_B: my_tail++
|
|
||||||
|
|
||||||
IPCQ_B->>NOC: Credit (16B): {consumer_seq, dst_rx_base_pa}
|
|
||||||
Note over NOC: credit traversal (NoC latency)
|
|
||||||
NOC->>IPCQ_A: Credit arrival
|
|
||||||
|
|
||||||
Note over IPCQ_A: Match dst_rx_base_pa → direction "E"<br/>peer_tail_cache["E"] = consumer_seq<br/>Backpressure deassert (if stalled)
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 6. 2nm Implementation Analysis
|
|
||||||
|
|
||||||
### 6.1 Area Estimate
|
|
||||||
|
|
||||||
| Module | Gate Count | Area (2nm est.) | Notes |
|
|
||||||
|--------|-----------|-----------------|-------|
|
|
||||||
| QPair Register File | ~4.6K FF | 0.002 mm² | 576B flip-flops |
|
|
||||||
| Slot Addr Gen + Backpressure | ~5K gates | 0.001 mm² | Combinational |
|
|
||||||
| Meta Extractor + Credit Logic | ~3K gates | 0.001 mm² | 8× parallel comparators |
|
|
||||||
| **Total IPCQ Controller** | **~12.6K** | **~0.004 mm²** | **PE 전체 대비 < 0.1%** |
|
|
||||||
| DMA vc_comm 확장 | ~2K gates | 0.002 mm² | Flit pack/unpack |
|
|
||||||
| **Total 변경분** | **~14.6K** | **~0.006 mm²** | |
|
|
||||||
|
|
||||||
### 6.2 Timing
|
|
||||||
|
|
||||||
| Path | Delay (2nm est.) | Target Clock | Margin |
|
|
||||||
|------|-------------------|-------------|--------|
|
|
||||||
| Backpressure (sub + cmp) | ~0.3 ns | 1 GHz (1 ns) | 3× |
|
|
||||||
| Slot Addr Gen (mask + shift + add) | ~0.5 ns | 1 GHz | 2× |
|
|
||||||
| Meta Extractor (8× range match) | ~0.4 ns | 1 GHz | 2.5× |
|
|
||||||
| Credit Receiver (8× equality) | ~0.3 ns | 1 GHz | 3× |
|
|
||||||
|
|
||||||
모든 critical path가 1 cycle 이내. Timing closure 문제 없음.
|
|
||||||
|
|
||||||
### 6.3 Power
|
|
||||||
|
|
||||||
- Active: ~1 mW (register read/write + comparators, send/recv 동작 시)
|
|
||||||
- Idle: leakage only
|
|
||||||
- PE 전체 전력 대비 무시 가능
|
|
||||||
|
|
||||||
### 6.4 Constraints
|
|
||||||
|
|
||||||
| 항목 | 제약 | 근거 |
|
|
||||||
|------|------|------|
|
|
||||||
| `n_slots` | **반드시 power-of-2** | mod → AND mask (1 gate). 임의 값은 divider 필요 (~10 cycles) |
|
|
||||||
| `slot_size` | **power-of-2 권장** | mul → barrel shift. 임의 값은 multiplier 필요 |
|
|
||||||
| TCM IPCQ region | **전용 bank 배치** | Compute access와 bank conflict 방지 |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 7. Risk Assessment
|
|
||||||
|
|
||||||
### 7.1 TCM Bank Conflict
|
|
||||||
|
|
||||||
- **Risk**: IPCQ slot write와 compute read가 동일 bank 접근 시 stall
|
|
||||||
- **Mitigation**: IPCQ region을 TCM 상위 address의 전용 bank에 배치
|
|
||||||
- **Cost**: TCM banking flexibility 소폭 감소
|
|
||||||
- **Severity**: Medium (성능 영향), Low (correctness 문제 아님)
|
|
||||||
|
|
||||||
### 7.2 Credit Return Latency under Congestion
|
|
||||||
|
|
||||||
- **Risk**: NoC 혼잡 시 credit return 지연 → sender backpressure stall
|
|
||||||
- **Mitigation**:
|
|
||||||
- Credit을 별도 VC로 분리 + strict priority (16B로 BW impact 미미)
|
|
||||||
- 또는 n_slots를 넉넉히(8+) 설정하여 credit 지연을 buffer로 흡수
|
|
||||||
- **Severity**: Low (credit 16B는 congestion에 거의 기여하지 않음)
|
|
||||||
|
|
||||||
### 7.3 Inter-Direction Ordering
|
|
||||||
|
|
||||||
- **Risk**: 같은 PE에서 여러 방향으로 동시 send 시 순서
|
|
||||||
- **Mitigation**: Per-direction monotonic seq으로 충분. Inter-direction ordering은
|
|
||||||
kernel(소프트웨어) 책임 — 현재 시뮬레이터 모델과 동일
|
|
||||||
- **Severity**: Low (아키텍처 설계에 의해 해소)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 8. Alternatives Considered
|
|
||||||
|
|
||||||
### 8.1 Doorbell + Polling (전통적 방식)
|
|
||||||
|
|
||||||
```
|
|
||||||
Send: DMA write data → DMA write doorbell register at peer → peer polls doorbell
|
|
||||||
Recv: Polling loop on doorbell, or interrupt-driven
|
|
||||||
```
|
|
||||||
|
|
||||||
| 장점 | 단점 |
|
|
||||||
|------|------|
|
|
||||||
| 단순한 HW (IPCQ controller 불필요) | 2번의 DMA transaction (data + doorbell) |
|
|
||||||
| 기존 DMA 재사용 | Data/doorbell 사이 ordering 보장 필요 (fence) |
|
|
||||||
| | Polling은 전력 낭비, interrupt는 latency overhead |
|
|
||||||
|
|
||||||
**평가**: Piggyback 대비 latency 2-3× 증가. **불채택.**
|
|
||||||
|
|
||||||
### 8.2 Hardware Message Queue (NVIDIA NVLink 스타일)
|
|
||||||
|
|
||||||
```
|
|
||||||
Send: CPU → HMQ에 descriptor push → HW가 peer HMQ로 자동 전달
|
|
||||||
Recv: HMQ에서 descriptor pop → data pointer 확인
|
|
||||||
```
|
|
||||||
|
|
||||||
| 장점 | 단점 |
|
|
||||||
|------|------|
|
|
||||||
| CPU는 descriptor만 작성 | 별도 HMQ engine 필요 (~0.05 mm²) |
|
|
||||||
| Descriptor/data 분리 → 유연 | DMA와 별개 datapath → area/power 중복 |
|
|
||||||
| | Large tensor에는 결국 DMA 필요 |
|
|
||||||
|
|
||||||
**평가**: CCL의 large tensor 패턴에서 DMA 필수이므로 HMQ + DMA 이중 구조는
|
|
||||||
면적 낭비. **불채택.**
|
|
||||||
|
|
||||||
### 8.3 RDMA-style Completion Queue (CQ)
|
|
||||||
|
|
||||||
```
|
|
||||||
Send: DMA write → peer에 CQE 자동 생성
|
|
||||||
Recv: CQ poll/interrupt → data 위치 확인
|
|
||||||
```
|
|
||||||
|
|
||||||
| 장점 | 단점 |
|
|
||||||
|------|------|
|
|
||||||
| InfiniBand/RoCE 성숙 모델 | CQ 관리 logic + CQE memory overhead |
|
|
||||||
| Multi-tenant/isolation 용이 | CQE/data ordering 보장 추가 필요 |
|
|
||||||
| | PE-to-PE CCL에는 over-engineered |
|
|
||||||
|
|
||||||
**평가**: RDMA CQ는 host-facing NIC의 multi-tenant 격리에 적합.
|
|
||||||
PE 간 단일 owner 환경에서는 불필요한 복잡성. **불채택.**
|
|
||||||
|
|
||||||
### 8.4 Credit-in-Data Piggyback (v2 최적화 후보)
|
|
||||||
|
|
||||||
현재 설계에서 credit return은 별도 16B packet이다.
|
|
||||||
Bidirectional 통신 패턴에서는 **reverse 방향 data flit에 credit을 합칠 수 있다.**
|
|
||||||
|
|
||||||
```
|
|
||||||
PE_A →E→ PE_B: data + sender_seq=3
|
|
||||||
PE_B →W→ PE_A: data + sender_seq=5 + credit_ack=4 ← credit이 data에 합쳐짐
|
|
||||||
```
|
|
||||||
|
|
||||||
| 장점 | 단점 |
|
|
||||||
|------|------|
|
|
||||||
| Credit 전용 packet 제거 → NoC BW 절약 | Unidirectional 패턴에서는 fallback 필요 |
|
|
||||||
| Bidirectional allreduce에서 credit latency → 0 | Flit header에 8B 추가 (overhead 미미) |
|
|
||||||
| | Logic 복잡도 소폭 증가 |
|
|
||||||
|
|
||||||
**평가**: 현재 설계의 우수한 최적화.
|
|
||||||
Bidirectional allreduce에서 credit packet을 완전 제거 가능.
|
|
||||||
Standalone credit fallback도 유지. **v2로 채택 권고.**
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 9. Recommendations
|
|
||||||
|
|
||||||
1. **현재 IPCQ-DMA co-design을 기본 하드웨어 설계로 채택**
|
|
||||||
— 단순하고, 면적 효율적이며, 2nm에서 timing/power 문제 없음
|
|
||||||
|
|
||||||
2. **n_slots를 반드시 power-of-2로 제약**
|
|
||||||
— mod 연산을 AND mask로 대체, critical path 단축
|
|
||||||
|
|
||||||
3. **TCM banking에서 IPCQ region 전용 bank 할당**
|
|
||||||
— compute와의 bank conflict 방지
|
|
||||||
|
|
||||||
4. **v2에서 Credit-in-Data Piggyback (Section 8.4) 추가 검토**
|
|
||||||
— bidirectional 패턴에서 credit overhead 제거
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 10. Open Questions
|
|
||||||
|
|
||||||
- [ ] IPCQ slot region size를 TCM의 몇 %까지 허용할 것인가? (현재 가정: ~1MB / 16MB = 6.25%)
|
|
||||||
- [ ] Credit VC를 별도로 둘 것인가, vc_comm에 multiplexing할 것인가?
|
|
||||||
- [ ] Inter-SIP link에서의 flit format 호환성 검증 필요
|
|
||||||
- [ ] n_slots 최대값 제한? (8 directions × 8 slots × 64KB = 4MB → TCM의 25%)
|
|
||||||
@@ -582,7 +582,7 @@ If you add a new algorithm or pattern, please send a PR.
|
|||||||
- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective design.
|
- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective design.
|
||||||
- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1).
|
- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1).
|
||||||
- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution.
|
- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution.
|
||||||
- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor.
|
- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model.
|
||||||
|
|
||||||
Existing algorithm examples:
|
Existing algorithm examples:
|
||||||
|
|
||||||
@@ -527,7 +527,7 @@ direct send 후 다른 step에서 같은 주소를 store해도 안전하다 (tok
|
|||||||
- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective 설계
|
- [ADR-0023](adr/ADR-0023-ipcq-pe-collective.md): IPCQ + PE-level collective 설계
|
||||||
- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1)
|
- [ADR-0022](adr/ADR-0022-program-id-2d-grid.md): 2D grid program_id (axis=0/1)
|
||||||
- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution
|
- [ADR-0020](adr/ADR-0020-data-execution-two-pass.md): 2-pass data execution
|
||||||
- [ADR-0021](adr/ADR-0021-pe-pipeline-refactor.md): PE pipeline refactor
|
- [ADR-0014](adr/ADR-0014-dev-pe-pipeline-execution-model.md): PE pipeline execution model
|
||||||
|
|
||||||
기존 알고리즘 예제:
|
기존 알고리즘 예제:
|
||||||
|
|
||||||
@@ -5,5 +5,5 @@ This package provides:
|
|||||||
- helpers: utilities for algorithm authors (chunked, ring_step, ...)
|
- helpers: utilities for algorithm authors (chunked, ring_step, ...)
|
||||||
- testing: mock CCL runtime for fast unit tests of algorithm kernels
|
- testing: mock CCL runtime for fast unit tests of algorithm kernels
|
||||||
|
|
||||||
See docs/adr/ADR-0023-ipcq-pe-collective.md and docs/ccl-author-guide.md.
|
See docs/adr/ADR-0023-dev-ipcq-pe-collective.md and docs/onboarding/ccl-author-guide.md.
|
||||||
"""
|
"""
|
||||||
|
|||||||
@@ -24,7 +24,7 @@ class Scope(Enum):
|
|||||||
|
|
||||||
@dataclass(frozen=True)
|
@dataclass(frozen=True)
|
||||||
class OpSpec:
|
class OpSpec:
|
||||||
"""One operation in a multi-op composite (head + epilogue, ADR-0021).
|
"""One operation in a multi-op composite (head + epilogue, ADR-0014 D3.3).
|
||||||
|
|
||||||
The head op (first in CompositeCmd.ops) defines tile geometry; subsequent
|
The head op (first in CompositeCmd.ops) defines tile geometry; subsequent
|
||||||
ops are epilogue stages whose ``scope`` controls how often they fire
|
ops are epilogue stages whose ``scope`` controls how often they fire
|
||||||
@@ -156,7 +156,7 @@ class CompositeCmd:
|
|||||||
out_nbytes: int
|
out_nbytes: int
|
||||||
math_op: str | None = None # for op="math": which math operation
|
math_op: str | None = None # for op="math": which math operation
|
||||||
data_op: bool = True
|
data_op: bool = True
|
||||||
# Multi-op composite (ADR-0021 extension): when non-empty, ops[0] is the
|
# Multi-op composite (ADR-0014 D3.3): when non-empty, ops[0] is the
|
||||||
# head and ops[1:] are epilogue stages with explicit scope. When empty,
|
# head and ops[1:] are epilogue stages with explicit scope. When empty,
|
||||||
# the legacy single-op semantics (op/a/b/math_op) apply.
|
# the legacy single-op semantics (op/a/b/math_op) apply.
|
||||||
ops: tuple[OpSpec, ...] = ()
|
ops: tuple[OpSpec, ...] = ()
|
||||||
|
|||||||
@@ -15,7 +15,7 @@ if TYPE_CHECKING:
|
|||||||
|
|
||||||
|
|
||||||
class HbmCtrlComponent(ComponentBase):
|
class HbmCtrlComponent(ComponentBase):
|
||||||
"""HBM controller with per-pseudo-channel (PC) striping (ADR-0019 D1, ADR-0033).
|
"""HBM controller with per-pseudo-channel (PC) striping (ADR-0017 D4, ADR-0033).
|
||||||
|
|
||||||
Stateless per-PC ``available_at`` array; each incoming transaction is
|
Stateless per-PC ``available_at`` array; each incoming transaction is
|
||||||
split into ``ceil(nbytes / burst_bytes)`` chunks distributed round-robin
|
split into ``ceil(nbytes / burst_bytes)`` chunks distributed round-robin
|
||||||
|
|||||||
@@ -267,8 +267,9 @@ class MCpuComponent(ComponentBase):
|
|||||||
def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
|
def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
|
||||||
"""Return list of HBM destination node_ids for DMA fan-out.
|
"""Return list of HBM destination node_ids for DMA fan-out.
|
||||||
|
|
||||||
With single hbm_ctrl per cube (ADR-0019), always returns one node.
|
The PA-based resolver maps each address to one per-PE
|
||||||
PA-based resolution still used for cross-cube routing.
|
``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly
|
||||||
|
one node. Cross-cube routing uses the same resolution.
|
||||||
"""
|
"""
|
||||||
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
||||||
|
|
||||||
|
|||||||
@@ -17,9 +17,11 @@ if TYPE_CHECKING:
|
|||||||
class PeDmaComponent(PeEngineBase):
|
class PeDmaComponent(PeEngineBase):
|
||||||
"""PE_DMA: dual-channel DMA engine with READ and WRITE resources.
|
"""PE_DMA: dual-channel DMA engine with READ and WRITE resources.
|
||||||
|
|
||||||
Each channel has capacity=1 (ADR-0014 D4):
|
Compute channels (vc_compute) have capacity=1 each (ADR-0014 D4):
|
||||||
- DMA_READ and DMA_WRITE may execute concurrently.
|
- DMA_READ and DMA_WRITE may execute concurrently.
|
||||||
- Multiple READs cannot overlap; multiple WRITEs cannot overlap.
|
- Multiple READs cannot overlap; multiple WRITEs cannot overlap.
|
||||||
|
The orthogonal vc_comm channel for IPCQ traffic is defined in
|
||||||
|
ADR-0023 D8.
|
||||||
|
|
||||||
Handles two message types:
|
Handles two message types:
|
||||||
- Transaction: external fabric messages (PeDmaMsg probes, M_CPU DMA)
|
- Transaction: external fabric messages (PeDmaMsg probes, M_CPU DMA)
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0021 D5).
|
"""PE_FETCH_STORE: TCM ↔ Register File transfer unit (ADR-0014 D1).
|
||||||
|
|
||||||
Handles both fetch (TCM → register) and store (register → TCM).
|
Handles both fetch (TCM → register) and store (register → TCM).
|
||||||
BW serialization is delegated to PE_TCM via port communication.
|
BW serialization is delegated to PE_TCM via port communication.
|
||||||
@@ -18,7 +18,7 @@ if TYPE_CHECKING:
|
|||||||
|
|
||||||
|
|
||||||
class PeFetchStoreComponent(PeEngineBase):
|
class PeFetchStoreComponent(PeEngineBase):
|
||||||
"""PE_FETCH_STORE: TCM ↔ Register File (ADR-0021 D5).
|
"""PE_FETCH_STORE: TCM ↔ Register File (ADR-0014 D1).
|
||||||
|
|
||||||
Receives TileTokens via pipeline self-routing.
|
Receives TileTokens via pipeline self-routing.
|
||||||
Sends TcmRequest to PE_TCM for BW-based latency.
|
Sends TcmRequest to PE_TCM for BW-based latency.
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
"""PE_GEMM: matrix multiplication engine (ADR-0021 D6).
|
"""PE_GEMM: matrix multiplication engine (ADR-0014 D1).
|
||||||
|
|
||||||
Handles both legacy PeInternalTxn (GemmCmd) and pipeline TileToken.
|
Handles both legacy PeInternalTxn (GemmCmd) and pipeline TileToken.
|
||||||
In pipeline mode, receives token after fetch stage, computes MAC, chains to next.
|
In pipeline mode, receives token after fetch stage, computes MAC, chains to next.
|
||||||
@@ -32,7 +32,7 @@ _DTYPE_BITS: dict[str, int] = {
|
|||||||
|
|
||||||
|
|
||||||
class PeGemmComponent(PeEngineBase):
|
class PeGemmComponent(PeEngineBase):
|
||||||
"""PE_GEMM: MAC array (ADR-0021 D6).
|
"""PE_GEMM: MAC array (ADR-0014 D1).
|
||||||
|
|
||||||
In pipeline mode: pure compute — register data already fetched.
|
In pipeline mode: pure compute — register data already fetched.
|
||||||
In legacy mode: handles PeInternalTxn(GemmCmd) with shared accel_slot.
|
In legacy mode: handles PeInternalTxn(GemmCmd) with shared accel_slot.
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
"""PE_MATH: element-wise / reduction computation engine (ADR-0021 D6).
|
"""PE_MATH: element-wise / reduction computation engine (ADR-0014 D1).
|
||||||
|
|
||||||
Handles both legacy PeInternalTxn (MathCmd) and pipeline TileToken.
|
Handles both legacy PeInternalTxn (MathCmd) and pipeline TileToken.
|
||||||
In pipeline mode, receives token after fetch stage, computes SIMD, chains to next.
|
In pipeline mode, receives token after fetch stage, computes SIMD, chains to next.
|
||||||
@@ -24,7 +24,7 @@ if TYPE_CHECKING:
|
|||||||
|
|
||||||
|
|
||||||
class PeMathComponent(PeEngineBase):
|
class PeMathComponent(PeEngineBase):
|
||||||
"""PE_MATH: SIMD/Vector unit (ADR-0021 D6).
|
"""PE_MATH: SIMD/Vector unit (ADR-0014 D1).
|
||||||
|
|
||||||
In pipeline mode: pure compute — register data already fetched.
|
In pipeline mode: pure compute — register data already fetched.
|
||||||
In legacy mode: handles PeInternalTxn(MathCmd) with shared accel_slot.
|
In legacy mode: handles PeInternalTxn(MathCmd) with shared accel_slot.
|
||||||
|
|||||||
@@ -1,10 +1,10 @@
|
|||||||
"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0021 D2).
|
"""PE_SCHEDULER: plan generation + tile dispatch (ADR-0014 D6).
|
||||||
|
|
||||||
Receives PeInternalTxn from PE_CPU, routes to engines:
|
Receives PeInternalTxn from PE_CPU, routes to engines:
|
||||||
- Simple commands (DmaReadCmd, GemmCmd, etc.) → direct dispatch to engine
|
- Simple commands (DmaReadCmd, GemmCmd, etc.) → direct dispatch to engine
|
||||||
- CompositeCmd → generate TilePlan, feed tiles via _feed_loop
|
- CompositeCmd → generate TilePlan, feed tiles via _feed_loop
|
||||||
|
|
||||||
Composite pipeline uses token self-routing (ADR-0021 D4):
|
Composite pipeline uses token self-routing (ADR-0014 D6):
|
||||||
Scheduler only does initial dispatch + completion tracking.
|
Scheduler only does initial dispatch + completion tracking.
|
||||||
Tiles chain through components based on their plan's stage sequence.
|
Tiles chain through components based on their plan's stage sequence.
|
||||||
"""
|
"""
|
||||||
@@ -24,7 +24,7 @@ if TYPE_CHECKING:
|
|||||||
|
|
||||||
|
|
||||||
class PeSchedulerComponent(ComponentBase):
|
class PeSchedulerComponent(ComponentBase):
|
||||||
"""PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, ADR-0021 D2).
|
"""PE_SCHEDULER: sole dispatcher inside a PE (ADR-0014 D1, D6).
|
||||||
|
|
||||||
Simple commands are forwarded to the appropriate engine.
|
Simple commands are forwarded to the appropriate engine.
|
||||||
CompositeCmd creates a TilePlan and feeds tiles into the pipeline.
|
CompositeCmd creates a TilePlan and feeds tiles into the pipeline.
|
||||||
@@ -104,7 +104,7 @@ class PeSchedulerComponent(ComponentBase):
|
|||||||
def _dispatch_composite(
|
def _dispatch_composite(
|
||||||
self, env: simpy.Environment, pe_txn: Any, cmd: Any,
|
self, env: simpy.Environment, pe_txn: Any, cmd: Any,
|
||||||
) -> Generator:
|
) -> Generator:
|
||||||
"""Generate plan and enqueue to feeder. Non-blocking (ADR-0021 D4)."""
|
"""Generate plan and enqueue to feeder. Non-blocking (ADR-0014 D6)."""
|
||||||
from kernbench.components.builtin.pe_types import PipelineContext
|
from kernbench.components.builtin.pe_types import PipelineContext
|
||||||
|
|
||||||
plan = self._generate_plan(cmd)
|
plan = self._generate_plan(cmd)
|
||||||
@@ -121,7 +121,7 @@ class PeSchedulerComponent(ComponentBase):
|
|||||||
yield self._pending_feeds.put((plan, ctx))
|
yield self._pending_feeds.put((plan, ctx))
|
||||||
|
|
||||||
def _feed_loop(self, env: simpy.Environment) -> Generator:
|
def _feed_loop(self, env: simpy.Environment) -> Generator:
|
||||||
"""Single feeder process: FIFO command ordering (ADR-0021 D2).
|
"""Single feeder process: FIFO command ordering (ADR-0014 D6).
|
||||||
|
|
||||||
No tile feed interleaving between commands.
|
No tile feed interleaving between commands.
|
||||||
Queue full → only this process blocks.
|
Queue full → only this process blocks.
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0021).
|
"""PE_TCM: tightly-coupled memory with BW-based access serialization (ADR-0014 D1).
|
||||||
|
|
||||||
Models scratchpad memory inside the PE. Handles both legacy Transaction forwarding
|
Models scratchpad memory inside the PE. Handles both legacy Transaction forwarding
|
||||||
and TcmRequest from PE_FETCH_STORE for BW-serialized read/write access.
|
and TcmRequest from PE_FETCH_STORE for BW-serialized read/write access.
|
||||||
@@ -32,7 +32,7 @@ class TcmRequest:
|
|||||||
|
|
||||||
|
|
||||||
class PeTcmComponent(ComponentBase):
|
class PeTcmComponent(ComponentBase):
|
||||||
"""PE_TCM: BW-serialized scratchpad memory (ADR-0021 D1).
|
"""PE_TCM: BW-serialized scratchpad memory (ADR-0014 D1).
|
||||||
|
|
||||||
Dual-channel: read and write can proceed in parallel,
|
Dual-channel: read and write can proceed in parallel,
|
||||||
but concurrent reads serialize, concurrent writes serialize.
|
but concurrent reads serialize, concurrent writes serialize.
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
"""PE pipeline types for ADR-0021: TileToken, TilePlan, Stage, PipelineContext.
|
"""PE pipeline types for ADR-0014 D6: TileToken, TilePlan, Stage, PipelineContext.
|
||||||
|
|
||||||
These types are used by the PE_SCHEDULER and all PE engine components
|
These types are used by the PE_SCHEDULER and all PE engine components
|
||||||
for tile-based pipeline execution with self-routing.
|
for tile-based pipeline execution with self-routing.
|
||||||
@@ -84,7 +84,7 @@ class PipelineContext:
|
|||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
class TileToken:
|
class TileToken:
|
||||||
"""Self-routing tile token passed between PE components (ADR-0021 D9).
|
"""Self-routing tile token passed between PE components (ADR-0014 D6).
|
||||||
|
|
||||||
Single-owner: only one component holds this token at any time.
|
Single-owner: only one component holds this token at any time.
|
||||||
params is a cache of plan.stages[stage_idx].params (canonical source).
|
params is a cache of plan.stages[stage_idx].params (canonical source).
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
"""Tile plan generators for PE pipeline (ADR-0021).
|
"""Tile plan generators for PE pipeline (ADR-0014 D6).
|
||||||
|
|
||||||
Generates TilePlan with stage sequences for GEMM and Math operations.
|
Generates TilePlan with stage sequences for GEMM and Math operations.
|
||||||
Ported from pe_accel tiling.py with stage-based plan structure.
|
Ported from pe_accel tiling.py with stage-based plan structure.
|
||||||
|
|||||||
@@ -1,2 +1,2 @@
|
|||||||
# Legacy component backups — not actively used.
|
# Legacy component backups — not actively used.
|
||||||
# Kept for reference during ADR-0021 migration.
|
# Kept for reference during the PE pipeline refactor (ADR-0014).
|
||||||
|
|||||||
@@ -264,8 +264,9 @@ class MCpuComponent(ComponentBase):
|
|||||||
def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
|
def _resolve_dma_destinations(self, request: Any, target_pe: int | str) -> list[str]:
|
||||||
"""Return list of HBM destination node_ids for DMA fan-out.
|
"""Return list of HBM destination node_ids for DMA fan-out.
|
||||||
|
|
||||||
With single hbm_ctrl per cube (ADR-0019), always returns one node.
|
The PA-based resolver maps each address to one per-PE
|
||||||
PA-based resolution still used for cross-cube routing.
|
``hbm_ctrl.pe{X}`` (ADR-0017 D9), so this method returns exactly
|
||||||
|
one node. Cross-cube routing uses the same resolution.
|
||||||
"""
|
"""
|
||||||
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
cube_prefix = self.node.id.rsplit(".", 1)[0] # e.g. "sip0.cube0"
|
||||||
|
|
||||||
|
|||||||
@@ -20,7 +20,7 @@ _AHBM_SEL_BIT = 37
|
|||||||
_AHBM_LOCAL_USED = 38 # bits actually meaningful for AHBM
|
_AHBM_LOCAL_USED = 38 # bits actually meaningful for AHBM
|
||||||
|
|
||||||
# HBM-offset bit layout for PC (pseudo-channel) striping
|
# HBM-offset bit layout for PC (pseudo-channel) striping
|
||||||
# (ADR-0033 D6, ADR-0019). Given burst_bytes = 2^B and num_pcs = 2^P
|
# (ADR-0033 D6, ADR-0017 D8). Given burst_bytes = 2^B and num_pcs = 2^P
|
||||||
# configured at hbm_ctrl, the PC index is derived from hbm_offset as
|
# configured at hbm_ctrl, the PC index is derived from hbm_offset as
|
||||||
# pc_shift = B; pc_mask = (1 << P) - 1
|
# pc_shift = B; pc_mask = (1 << P) - 1
|
||||||
# pc = (hbm_offset >> pc_shift) & pc_mask
|
# pc = (hbm_offset >> pc_shift) & pc_mask
|
||||||
|
|||||||
@@ -35,7 +35,7 @@ class AddressResolver:
|
|||||||
def __init__(self, graph: TopologyGraph) -> None:
|
def __init__(self, graph: TopologyGraph) -> None:
|
||||||
self._node_ids = set(graph.nodes)
|
self._node_ids = set(graph.nodes)
|
||||||
# HBM slice size (bytes) — used to decode pe_id from hbm_offset
|
# HBM slice size (bytes) — used to decode pe_id from hbm_offset
|
||||||
# so HBM PA → hbm_ctrl.pe{X} (ADR-0019 D1/D4).
|
# so HBM PA → hbm_ctrl.pe{X} (ADR-0017 D4/D9).
|
||||||
mm = graph.spec.get("cube", {}).get("memory_map", {})
|
mm = graph.spec.get("cube", {}).get("memory_map", {})
|
||||||
hbm_total_gb = int(mm.get("hbm_total_gb_per_cube", 48))
|
hbm_total_gb = int(mm.get("hbm_total_gb_per_cube", 48))
|
||||||
slices_per_cube = int(mm.get("hbm_slices_per_cube", 8))
|
slices_per_cube = int(mm.get("hbm_slices_per_cube", 8))
|
||||||
@@ -129,7 +129,7 @@ class PathRouter:
|
|||||||
Otherwise the cube's own UCIe port appears as a zero-distance
|
Otherwise the cube's own UCIe port appears as a zero-distance
|
||||||
bus that Dijkstra prefers over the mesh — that is intended only
|
bus that Dijkstra prefers over the mesh — that is intended only
|
||||||
for cross-cube routing. Local PE_DMA must traverse the mesh so
|
for cross-cube routing. Local PE_DMA must traverse the mesh so
|
||||||
cross-PE-slice access pays the mesh-distance cost (ADR-0019 D4).
|
cross-PE-slice access pays the mesh-distance cost (ADR-0017 D7).
|
||||||
"""
|
"""
|
||||||
start = f"{src_pe}.pe_dma"
|
start = f"{src_pe}.pe_dma"
|
||||||
adj = self._adj_local if _same_cube(start, dst_node) else self._adj
|
adj = self._adj_local if _same_cube(start, dst_node) else self._adj
|
||||||
@@ -137,13 +137,13 @@ class PathRouter:
|
|||||||
|
|
||||||
def find_path_with_distance(self, src_pe: str, dst_node: str) -> tuple[list[str], float]:
|
def find_path_with_distance(self, src_pe: str, dst_node: str) -> tuple[list[str], float]:
|
||||||
"""Match find_path's cube-local routing so reported distance reflects
|
"""Match find_path's cube-local routing so reported distance reflects
|
||||||
the actual chosen path (ADR-0019 D4)."""
|
the actual chosen path (ADR-0017 D7)."""
|
||||||
start = f"{src_pe}.pe_dma"
|
start = f"{src_pe}.pe_dma"
|
||||||
adj = self._adj_local if _same_cube(start, dst_node) else self._adj
|
adj = self._adj_local if _same_cube(start, dst_node) else self._adj
|
||||||
return self._run_dijkstra_with_dist(adj, start, dst_node)
|
return self._run_dijkstra_with_dist(adj, start, dst_node)
|
||||||
|
|
||||||
def find_mcpu_dma_path(self, m_cpu_id: str, dst_hbm_id: str) -> list[str]:
|
def find_mcpu_dma_path(self, m_cpu_id: str, dst_hbm_id: str) -> list[str]:
|
||||||
"""M_CPU DMA path: routes through router mesh (ADR-0019).
|
"""M_CPU DMA path: routes through router mesh (ADR-0017).
|
||||||
|
|
||||||
Same-cube: uses _adj_local (no UCIe) to stay within mesh.
|
Same-cube: uses _adj_local (no UCIe) to stay within mesh.
|
||||||
Cross-cube: uses _adj_all to route via UCIe.
|
Cross-cube: uses _adj_all to route via UCIe.
|
||||||
|
|||||||
@@ -58,7 +58,7 @@ def _get_active_context():
|
|||||||
|
|
||||||
|
|
||||||
class _AhbmNamespace:
|
class _AhbmNamespace:
|
||||||
"""torch.ahbm — per-greenlet SIP device binding (ADR-0024 D10).
|
"""torch.ahbm — per-greenlet SIP device binding (ADR-0024 D3).
|
||||||
|
|
||||||
Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. KernBench's
|
Real-PyTorch parity idiom: ``torch.cuda.set_device(rank)``. KernBench's
|
||||||
backend is 'ahbm' (not CUDA), so this namespace avoids pretending to be
|
backend is 'ahbm' (not CUDA), so this namespace avoids pretending to be
|
||||||
@@ -124,7 +124,7 @@ class RuntimeContext:
|
|||||||
dc = DistributedContext()
|
dc = DistributedContext()
|
||||||
dc._ctx_ref = self # back-reference for AhbmCCLBackend to reach ctx.launch etc.
|
dc._ctx_ref = self # back-reference for AhbmCCLBackend to reach ctx.launch etc.
|
||||||
self.distributed = dc
|
self.distributed = dc
|
||||||
# ADR-0024 D10: torch.ahbm (KernBench-native) + torch.accelerator
|
# ADR-0024 D3: torch.ahbm (KernBench-native) + torch.accelerator
|
||||||
# (PyTorch 2.x portable) namespaces for per-greenlet device binding.
|
# (PyTorch 2.x portable) namespaces for per-greenlet device binding.
|
||||||
self.ahbm = _AhbmNamespace()
|
self.ahbm = _AhbmNamespace()
|
||||||
self.accelerator = _AcceleratorNamespace(self.ahbm)
|
self.accelerator = _AcceleratorNamespace(self.ahbm)
|
||||||
@@ -472,7 +472,7 @@ class RuntimeContext:
|
|||||||
eff_num_pe = dp.num_pes if dp.num_pes is not None else self._pes_per_cube
|
eff_num_pe = dp.num_pes if dp.num_pes is not None else self._pes_per_cube
|
||||||
eff_num_cubes = dp.num_cubes if dp.num_cubes is not None else self._num_cubes
|
eff_num_cubes = dp.num_cubes if dp.num_cubes is not None else self._num_cubes
|
||||||
# ADR-0026 D4: resolve structural coords directly at resolve time.
|
# ADR-0026 D4: resolve structural coords directly at resolve time.
|
||||||
# ``torch.ahbm.set_device(rank)`` (ADR-0024 D10) selects the target
|
# ``torch.ahbm.set_device(rank)`` (ADR-0024 D3) selects the target
|
||||||
# SIP; if unset, fall back to SIP 0 for single-driver compatibility.
|
# SIP; if unset, fall back to SIP 0 for single-driver compatibility.
|
||||||
current_sip = (
|
current_sip = (
|
||||||
self.ahbm.current_device() if hasattr(self, "ahbm") else None
|
self.ahbm.current_device() if hasattr(self, "ahbm") else None
|
||||||
@@ -619,7 +619,7 @@ class RuntimeContext:
|
|||||||
Creates per-SIP KernelLaunchMsg with local va_base per tensor
|
Creates per-SIP KernelLaunchMsg with local va_base per tensor
|
||||||
(like host driver sending per-rank launch commands).
|
(like host driver sending per-rank launch commands).
|
||||||
|
|
||||||
When ``_defer_wait=True`` (ADR-0024 D7), returns the list of
|
When ``_defer_wait=True`` (ADR-0027 D0.4), returns the list of
|
||||||
``(handle, sip_id, meta)`` tuples instead of waiting. Caller is
|
``(handle, sip_id, meta)`` tuples instead of waiting. Caller is
|
||||||
responsible for waiting — used by collective ops to yield between
|
responsible for waiting — used by collective ops to yield between
|
||||||
submit and wait so all sibling ranks can submit first.
|
submit and wait so all sibling ranks can submit first.
|
||||||
@@ -786,7 +786,7 @@ class RuntimeContext:
|
|||||||
last_handle = h
|
last_handle = h
|
||||||
|
|
||||||
if _defer_wait:
|
if _defer_wait:
|
||||||
# ADR-0024 D7: return the pending-list so the caller can yield
|
# ADR-0027 D0.4: return the pending-list so the caller can yield
|
||||||
# between submit and drain. Used by collective ops that need
|
# between submit and drain. Used by collective ops that need
|
||||||
# all sibling ranks to submit before any rank waits.
|
# all sibling ranks to submit before any rank waits.
|
||||||
return [
|
return [
|
||||||
|
|||||||
@@ -178,7 +178,7 @@ class DistributedContext:
|
|||||||
|
|
||||||
def __init__(self) -> None:
|
def __init__(self) -> None:
|
||||||
self._backend: AhbmCCLBackend | None = None
|
self._backend: AhbmCCLBackend | None = None
|
||||||
# ADR-0024 D9: greenlet-local rank registry. Bench launcher calls
|
# ADR-0024 D2: greenlet-local rank registry. Bench launcher calls
|
||||||
# _bind_rank(g, rank) when spawning workers; get_rank() resolves the
|
# _bind_rank(g, rank) when spawning workers; get_rank() resolves the
|
||||||
# current greenlet to its rank. Unbound greenlets fall back to 0 for
|
# current greenlet to its rank. Unbound greenlets fall back to 0 for
|
||||||
# single-driver test compat.
|
# single-driver test compat.
|
||||||
@@ -220,7 +220,7 @@ class DistributedContext:
|
|||||||
def get_rank(self) -> int:
|
def get_rank(self) -> int:
|
||||||
"""Return the rank bound to the current greenlet (default 0).
|
"""Return the rank bound to the current greenlet (default 0).
|
||||||
|
|
||||||
ADR-0024 D9: workers spawned by the bench launcher each get a rank
|
ADR-0024 D2: workers spawned by the bench launcher each get a rank
|
||||||
registered via ``_bind_rank``. Callers outside any bound greenlet
|
registered via ``_bind_rank``. Callers outside any bound greenlet
|
||||||
fall back to rank 0 for single-driver test compat.
|
fall back to rank 0 for single-driver test compat.
|
||||||
"""
|
"""
|
||||||
@@ -230,7 +230,7 @@ class DistributedContext:
|
|||||||
return int(self._rank_by_greenlet.get(g, 0))
|
return int(self._rank_by_greenlet.get(g, 0))
|
||||||
|
|
||||||
def _bind_rank(self, g: Any, rank: int) -> None:
|
def _bind_rank(self, g: Any, rank: int) -> None:
|
||||||
"""Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D9)."""
|
"""Bind a greenlet to a rank so ``get_rank()`` returns it (ADR-0024 D2)."""
|
||||||
self._rank_by_greenlet[g] = int(rank)
|
self._rank_by_greenlet[g] = int(rank)
|
||||||
|
|
||||||
def get_backend(self) -> str:
|
def get_backend(self) -> str:
|
||||||
|
|||||||
@@ -65,7 +65,7 @@ def _drain_pending(ctx: Any) -> None:
|
|||||||
# Populate _completed so fast-path in ctx.wait short-circuits
|
# Populate _completed so fast-path in ctx.wait short-circuits
|
||||||
# on the return leg.
|
# on the return leg.
|
||||||
ctx._completed.add(h)
|
ctx._completed.add(h)
|
||||||
# (b) Collective backend queue (ADR-0024 D7 + D0.4-(2)).
|
# (b) Collective backend queue (ADR-0027 D0.4-(2)).
|
||||||
if backend is not None:
|
if backend is not None:
|
||||||
pending_list = getattr(backend, "_pending_collective_handles", None)
|
pending_list = getattr(backend, "_pending_collective_handles", None)
|
||||||
if pending_list is not None:
|
if pending_list is not None:
|
||||||
|
|||||||
@@ -51,7 +51,7 @@ class OpLogger:
|
|||||||
record_end fires.
|
record_end fires.
|
||||||
"""
|
"""
|
||||||
snap: dict[str, Any] = {}
|
snap: dict[str, Any] = {}
|
||||||
# TileToken (ADR-0021 pipeline) — capture which stage this is and its
|
# TileToken (ADR-0014 D6 pipeline) — capture which stage this is and its
|
||||||
# per-stage params (e.g. op_kind/scope for epilogue MATH stages) so
|
# per-stage params (e.g. op_kind/scope for epilogue MATH stages) so
|
||||||
# we can recover them at record_end even after the token advances.
|
# we can recover them at record_end even after the token advances.
|
||||||
try:
|
try:
|
||||||
|
|||||||
@@ -356,7 +356,7 @@ def _instantiate_cube(
|
|||||||
) -> None:
|
) -> None:
|
||||||
"""Add all cube-internal nodes and edges, including PE instances.
|
"""Add all cube-internal nodes and edges, including PE instances.
|
||||||
|
|
||||||
Topology: explicit router mesh from cube_mesh.yaml (ADR-0019).
|
Topology: explicit router mesh from cube_mesh.yaml (ADR-0017 D1).
|
||||||
Each router is a separate SimPy node. Components attach to routers
|
Each router is a separate SimPy node. Components attach to routers
|
||||||
based on cube_mesh.yaml attachment lists.
|
based on cube_mesh.yaml attachment lists.
|
||||||
"""
|
"""
|
||||||
@@ -367,10 +367,10 @@ def _instantiate_cube(
|
|||||||
clinks = cube["links"]
|
clinks = cube["links"]
|
||||||
mm = cube["memory_map"]
|
mm = cube["memory_map"]
|
||||||
|
|
||||||
# ── Mode branch (ADR-0019) ──
|
# ── Mode branch (ADR-0017 D8) ──
|
||||||
mode = mm.get("hbm_mapping_mode", "n_to_one")
|
mode = mm.get("hbm_mapping_mode", "n_to_one")
|
||||||
if mode == "one_to_one":
|
if mode == "one_to_one":
|
||||||
raise NotImplementedError("1:1 mode: ADR-0019 D3")
|
raise NotImplementedError("1:1 mode: ADR-0017 D8")
|
||||||
|
|
||||||
# ── UCIe ports + connection nodes ──
|
# ── UCIe ports + connection nodes ──
|
||||||
ucie_cfg = cube["ucie"]
|
ucie_cfg = cube["ucie"]
|
||||||
@@ -404,11 +404,10 @@ def _instantiate_cube(
|
|||||||
label=name.upper().replace("_", " "),
|
label=name.upper().replace("_", " "),
|
||||||
)
|
)
|
||||||
|
|
||||||
# ── Per-PE HBM controller (ADR-0019 D1/D4) ──
|
# ── Per-PE HBM controller (ADR-0017 D4) ──
|
||||||
# Each PE owns one slice of the cube's HBM. The slice has its own
|
# Each PE owns one slice of the cube's HBM. The slice has its own
|
||||||
# set of pseudo-channels and is reachable ONLY through that PE's
|
# set of pseudo-channels and is reachable ONLY through that PE's
|
||||||
# attaching router (see cube_mesh.yaml ``peX.hbm`` attach lists).
|
# attaching router (see cube_mesh.yaml ``peX.hbm`` attach lists).
|
||||||
# Restored after the ADR-0019 over-consolidation in commit 5917b34.
|
|
||||||
hbm_spec = cube["components"]["hbm_ctrl"]
|
hbm_spec = cube["components"]["hbm_ctrl"]
|
||||||
hbm_lx, hbm_ly = local_pos["hbm_ctrl"]
|
hbm_lx, hbm_ly = local_pos["hbm_ctrl"]
|
||||||
_hbm_total_bw = float(cube["links"].get("hbm_to_router_bw_gbs", 256.0))
|
_hbm_total_bw = float(cube["links"].get("hbm_to_router_bw_gbs", 256.0))
|
||||||
@@ -425,7 +424,7 @@ def _instantiate_cube(
|
|||||||
label=f"HBM CTRL pe{pe_idx}",
|
label=f"HBM CTRL pe{pe_idx}",
|
||||||
)
|
)
|
||||||
|
|
||||||
# ── Router mesh from cube_mesh.yaml (ADR-0019 D3) ──
|
# ── Router mesh from cube_mesh.yaml (ADR-0017 D1) ──
|
||||||
routers = mesh_data["routers"]
|
routers = mesh_data["routers"]
|
||||||
router_spec = cube["components"]["noc_router"]
|
router_spec = cube["components"]["noc_router"]
|
||||||
router_bw = clinks.get("router_link_bw_gbs", 256.0)
|
router_bw = clinks.get("router_link_bw_gbs", 256.0)
|
||||||
@@ -573,7 +572,7 @@ def _instantiate_cube(
|
|||||||
))
|
))
|
||||||
elif item.endswith(".hbm"):
|
elif item.endswith(".hbm"):
|
||||||
# peX.hbm: router rXcY owns the entry to hbm_ctrl.peX.
|
# peX.hbm: router rXcY owns the entry to hbm_ctrl.peX.
|
||||||
# (ADR-0019 D1/D4 — per-PE HBM partitioning.)
|
# (ADR-0017 D4 — per-PE HBM partitioning.)
|
||||||
pe_prefix = item.rsplit(".", 1)[0]
|
pe_prefix = item.rsplit(".", 1)[0]
|
||||||
pe_idx = int(pe_prefix.replace("pe", ""))
|
pe_idx = int(pe_prefix.replace("pe", ""))
|
||||||
pe_hbm_id = f"{cp}.hbm_ctrl.pe{pe_idx}"
|
pe_hbm_id = f"{cp}.hbm_ctrl.pe{pe_idx}"
|
||||||
@@ -645,13 +644,12 @@ def _instantiate_cube(
|
|||||||
))
|
))
|
||||||
|
|
||||||
# NOTE: HBM↔router edges are created in the per-router attach loop
|
# NOTE: HBM↔router edges are created in the per-router attach loop
|
||||||
# above (peX.hbm items map router → hbm_ctrl.peX). Removed the
|
# above (peX.hbm items map router → hbm_ctrl.peX). See ADR-0017 D4
|
||||||
# legacy "all routers → single hbm_ctrl" loop that bypassed the
|
# for the per-PE partition contract.
|
||||||
# ADR-0019 D4 per-PE partition.
|
|
||||||
|
|
||||||
|
|
||||||
def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
|
def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
|
||||||
"""Add PE-internal edges for a single PE instance (ADR-0021)."""
|
"""Add PE-internal edges for a single PE instance (ADR-0014 D8)."""
|
||||||
edges.append(Edge(
|
edges.append(Edge(
|
||||||
src=f"{pp}.pe_cpu", dst=f"{pp}.pe_scheduler",
|
src=f"{pp}.pe_cpu", dst=f"{pp}.pe_scheduler",
|
||||||
distance_mm=pe_links["pe_cpu_to_scheduler_mm"],
|
distance_mm=pe_links["pe_cpu_to_scheduler_mm"],
|
||||||
@@ -685,7 +683,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
|
|||||||
kind="pe_internal",
|
kind="pe_internal",
|
||||||
))
|
))
|
||||||
|
|
||||||
# Fetch/Store → TCM (ADR-0021 D5)
|
# Fetch/Store → TCM (ADR-0014 D5)
|
||||||
if "fetch_store_to_tcm_mm" in pe_links:
|
if "fetch_store_to_tcm_mm" in pe_links:
|
||||||
edges.append(Edge(
|
edges.append(Edge(
|
||||||
src=f"{pp}.pe_fetch_store", dst=f"{pp}.pe_tcm",
|
src=f"{pp}.pe_fetch_store", dst=f"{pp}.pe_tcm",
|
||||||
@@ -694,7 +692,7 @@ def _add_pe_internal_edges(edges: list[Edge], pp: str, pe_links: dict) -> None:
|
|||||||
kind="pe_internal",
|
kind="pe_internal",
|
||||||
))
|
))
|
||||||
|
|
||||||
# Chaining edges (ADR-0021 D4 — token self-routing)
|
# Chaining edges (ADR-0014 D6 — token self-routing)
|
||||||
chaining = [
|
chaining = [
|
||||||
("pe_dma", "pe_fetch_store", "dma_to_fetch_store_mm"),
|
("pe_dma", "pe_fetch_store", "dma_to_fetch_store_mm"),
|
||||||
("pe_fetch_store", "pe_gemm", "fetch_store_to_gemm_mm"),
|
("pe_fetch_store", "pe_gemm", "fetch_store_to_gemm_mm"),
|
||||||
|
|||||||
@@ -6,7 +6,7 @@
|
|||||||
forward(x) ends with ``dist.all_reduce`` to sum partial products.
|
forward(x) ends with ``dist.all_reduce`` to sum partial products.
|
||||||
|
|
||||||
Both layers use the intra-device ``DPPolicy`` (ADR-0026). TP shard
|
Both layers use the intra-device ``DPPolicy`` (ADR-0026). TP shard
|
||||||
ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D10).
|
ownership is determined by ``torch.ahbm.set_device(rank)`` (ADR-0024 D3).
|
||||||
|
|
||||||
Yield-safety contract (ADR-0027 D4/D5): every forward path contains at
|
Yield-safety contract (ADR-0027 D4/D5): every forward path contains at
|
||||||
least one ``ctx.wait`` (via ``torch.launch``) or one collective; this
|
least one ``ctx.wait`` (via ``torch.launch``) or one collective; this
|
||||||
@@ -53,7 +53,7 @@ class ColumnParallelLinear:
|
|||||||
self.k_local = out_features // ws
|
self.k_local = out_features // ws
|
||||||
self.dtype = dtype
|
self.dtype = dtype
|
||||||
self._torch = torch
|
self._torch = torch
|
||||||
# Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D10) places
|
# Per-rank weight slice. ``set_device(rank)`` (ADR-0024 D3) places
|
||||||
# it on SIP ``rank``. Intra-SIP layout comes from DPPolicy (ADR-0026).
|
# it on SIP ``rank``. Intra-SIP layout comes from DPPolicy (ADR-0026).
|
||||||
self.weight = torch.zeros(
|
self.weight = torch.zeros(
|
||||||
(in_features, self.k_local),
|
(in_features, self.k_local),
|
||||||
|
|||||||
@@ -43,7 +43,7 @@ def get_tensor_model_parallel_rank() -> int:
|
|||||||
"""Return this worker's rank within the TP group.
|
"""Return this worker's rank within the TP group.
|
||||||
|
|
||||||
Delegates to the greenlet-local rank registered by the spawn launcher
|
Delegates to the greenlet-local rank registered by the spawn launcher
|
||||||
(ADR-0024 D9 via ``torch.distributed.get_rank``).
|
(ADR-0024 D2 via ``torch.distributed.get_rank``).
|
||||||
"""
|
"""
|
||||||
# Resolve via the global torch.distributed facade on the active ctx.
|
# Resolve via the global torch.distributed facade on the active ctx.
|
||||||
return _current_rank()
|
return _current_rank()
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
"""End-to-end pipeline tests (ADR-0020 + ADR-0021).
|
"""End-to-end pipeline tests (ADR-0020 + ADR-0014).
|
||||||
|
|
||||||
Verifies:
|
Verifies:
|
||||||
1. Actual benchmark kernel → greenlet mode → op_log → DataExecutor → accuracy
|
1. Actual benchmark kernel → greenlet mode → op_log → DataExecutor → accuracy
|
||||||
|
|||||||
@@ -68,7 +68,7 @@ def _path_drain_for_write(eng: GraphEngine, msg: MemoryWriteMsg) -> float:
|
|||||||
|
|
||||||
def test_builder_derives_pc_bw_gbs():
|
def test_builder_derives_pc_bw_gbs():
|
||||||
"""Topology builder must inject `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
|
"""Topology builder must inject `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`
|
||||||
as an attr on every hbm_ctrl node. Enforces ADR-0019 D9 invariant
|
as an attr on every hbm_ctrl node. Enforces ADR-0017 D8 invariant
|
||||||
(channels_per_PE × per-PC BW = aggregated link BW) at build time.
|
(channels_per_PE × per-PC BW = aggregated link BW) at build time.
|
||||||
"""
|
"""
|
||||||
handle = resolve_topology(str(TOPOLOGY_PATH))
|
handle = resolve_topology(str(TOPOLOGY_PATH))
|
||||||
|
|||||||
@@ -192,13 +192,10 @@ def test_hbm_pe_hop_charged_at_large_payload(tmp_path):
|
|||||||
chunk of latency from the PE↔HBM hop on send and recv, so the
|
chunk of latency from the PE↔HBM hop on send and recv, so the
|
||||||
total HBM/TCM gap should clearly clear the threshold below.
|
total HBM/TCM gap should clearly clear the threshold below.
|
||||||
|
|
||||||
Threshold history: the gap was 4 µs under the over-consolidated
|
Under ADR-0017 D4 per-PE HBM CTRL, each PE's slice runs on its own
|
||||||
single-hbm_ctrl model (commit 5917b34), inflated by serialization
|
controller with no cross-PE contention, so the IPCQ pattern (each
|
||||||
on the shared HBM controller. With ADR-0019 D1 per-PE HBM CTRL
|
PE writes its own slice) yields a gap of ≈ 1.7 µs — well above the
|
||||||
restored, each PE's slice runs on its own controller with no
|
bare slot-IO term, confirming the PE↔HBM hop is being charged.
|
||||||
cross-PE contention, so the IPCQ pattern (each PE writes its own
|
|
||||||
slice) drops the gap to ≈ 1.7 µs — still well above the bare
|
|
||||||
slot-IO term, confirming the PE↔HBM hop is being charged.
|
|
||||||
"""
|
"""
|
||||||
n_elem = 16384 # 32 KB / PE
|
n_elem = 16384 # 32 KB / PE
|
||||||
lat_tcm = _run_allreduce_with_buffer_kind(
|
lat_tcm = _run_allreduce_with_buffer_kind(
|
||||||
|
|||||||
+11
-14
@@ -1,4 +1,4 @@
|
|||||||
"""Tests for CUBE NOC Explicit Router Mesh (ADR-0019).
|
"""Tests for CUBE NOC Explicit Router Mesh (ADR-0017).
|
||||||
|
|
||||||
Key changes verified:
|
Key changes verified:
|
||||||
- Explicit router nodes per cube from cube_mesh.yaml (6×6 grid)
|
- Explicit router nodes per cube from cube_mesh.yaml (6×6 grid)
|
||||||
@@ -125,14 +125,14 @@ def test_mesh_file_pe_corner_positions():
|
|||||||
|
|
||||||
|
|
||||||
def test_mesh_file_no_xbar_section():
|
def test_mesh_file_no_xbar_section():
|
||||||
"""mesh output must not contain xbar section (ADR-0019 D2)."""
|
"""mesh output must not contain xbar section (ADR-0017 D1)."""
|
||||||
_graph()
|
_graph()
|
||||||
mesh = yaml.safe_load(MESH_PATH.read_text())
|
mesh = yaml.safe_load(MESH_PATH.read_text())
|
||||||
assert "xbar" not in mesh, "xbar section should be removed from cube_mesh.yaml"
|
assert "xbar" not in mesh, "xbar section should be removed from cube_mesh.yaml"
|
||||||
|
|
||||||
|
|
||||||
def test_mesh_file_pe_hbm_attached():
|
def test_mesh_file_pe_hbm_attached():
|
||||||
"""PE routers must have pe{idx}.hbm in attach list (ADR-0019 D1)."""
|
"""PE routers must have pe{idx}.hbm in attach list (ADR-0017 D4)."""
|
||||||
_graph()
|
_graph()
|
||||||
mesh = yaml.safe_load(MESH_PATH.read_text())
|
mesh = yaml.safe_load(MESH_PATH.read_text())
|
||||||
for rid, rdata in mesh["routers"].items():
|
for rid, rdata in mesh["routers"].items():
|
||||||
@@ -235,7 +235,7 @@ def test_mesh_ucie_all_four_directions():
|
|||||||
|
|
||||||
|
|
||||||
# ══════════════════════════════════════════════════════════════════
|
# ══════════════════════════════════════════════════════════════════
|
||||||
# 2. Topology Graph: Explicit Router Mesh (ADR-0019)
|
# 2. Topology Graph: Explicit Router Mesh (ADR-0017)
|
||||||
# ══════════════════════════════════════════════════════════════════
|
# ══════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
|
||||||
@@ -247,7 +247,7 @@ def test_router_nodes_exist():
|
|||||||
|
|
||||||
|
|
||||||
def test_no_xbar_or_bridge_nodes():
|
def test_no_xbar_or_bridge_nodes():
|
||||||
"""xbar/bridge nodes must not exist (ADR-0019 D2)."""
|
"""xbar/bridge nodes must not exist (ADR-0017 D1)."""
|
||||||
graph = _graph()
|
graph = _graph()
|
||||||
bad = [n for n in graph.nodes if "xbar" in n or "bridge" in n]
|
bad = [n for n in graph.nodes if "xbar" in n or "bridge" in n]
|
||||||
assert len(bad) == 0, f"Old xbar/bridge nodes found: {bad[:5]}"
|
assert len(bad) == 0, f"Old xbar/bridge nodes found: {bad[:5]}"
|
||||||
@@ -260,11 +260,10 @@ def test_no_single_noc_node():
|
|||||||
|
|
||||||
|
|
||||||
def test_per_pe_hbm_ctrl_nodes():
|
def test_per_pe_hbm_ctrl_nodes():
|
||||||
"""Each cube has 8 per-PE HBM CTRL instances (ADR-0019 D1).
|
"""Each cube has 8 per-PE HBM CTRL instances (ADR-0017 D4).
|
||||||
|
|
||||||
Restored from over-consolidation in commit 5917b34. The legacy
|
Each PE owns its own ``hbm_ctrl.pe{X}`` reachable through that PE's
|
||||||
single ``sip0.cube0.hbm_ctrl`` is gone; each PE owns its own
|
attaching router. No cube-wide single ``hbm_ctrl`` node exists.
|
||||||
``hbm_ctrl.pe{X}`` reachable through that PE's attaching router.
|
|
||||||
"""
|
"""
|
||||||
graph = _graph()
|
graph = _graph()
|
||||||
for pe in range(8):
|
for pe in range(8):
|
||||||
@@ -272,7 +271,7 @@ def test_per_pe_hbm_ctrl_nodes():
|
|||||||
# Legacy single hbm_ctrl must not exist
|
# Legacy single hbm_ctrl must not exist
|
||||||
legacy_id = "sip0.cube0.hbm_ctrl"
|
legacy_id = "sip0.cube0.hbm_ctrl"
|
||||||
assert legacy_id not in graph.nodes, (
|
assert legacy_id not in graph.nodes, (
|
||||||
f"legacy {legacy_id} must be removed (per-PE partitioning, ADR-0019 D1)"
|
f"legacy {legacy_id} must not exist (per-PE partitioning, ADR-0017 D4)"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -297,9 +296,7 @@ def test_pe_dma_connects_to_router():
|
|||||||
def test_each_hbm_ctrl_connects_only_to_owning_router():
|
def test_each_hbm_ctrl_connects_only_to_owning_router():
|
||||||
"""Each ``hbm_ctrl.pe{X}`` must have exactly one router edge
|
"""Each ``hbm_ctrl.pe{X}`` must have exactly one router edge
|
||||||
(router_to_hbm + hbm_to_router) to its owning PE's attaching
|
(router_to_hbm + hbm_to_router) to its owning PE's attaching
|
||||||
router (ADR-0019 D4). Replaces a prior test that asserted the
|
router (ADR-0017 D7).
|
||||||
single hbm_ctrl was connected to all routers — that asserted the
|
|
||||||
spec-violating consolidation introduced in commit 5917b34.
|
|
||||||
"""
|
"""
|
||||||
graph = _graph()
|
graph = _graph()
|
||||||
pe_router = {0: "r0c0", 1: "r0c1", 2: "r1c4", 3: "r1c5",
|
pe_router = {0: "r0c0", 1: "r0c1", 2: "r1c4", 3: "r1c5",
|
||||||
@@ -513,7 +510,7 @@ def test_null_routers_excluded():
|
|||||||
|
|
||||||
|
|
||||||
# ══════════════════════════════════════════════════════════════════
|
# ══════════════════════════════════════════════════════════════════
|
||||||
# 7. Router Mesh Latency (ADR-0019)
|
# 7. Router Mesh Latency (ADR-0017)
|
||||||
# ══════════════════════════════════════════════════════════════════
|
# ══════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
"""Tests for ADR-0021 PE pipeline: TileToken self-routing, pipeline overlap, e2e accuracy.
|
"""Tests for ADR-0014 D6 PE pipeline: TileToken self-routing, pipeline overlap, e2e accuracy.
|
||||||
|
|
||||||
Test plan items:
|
Test plan items:
|
||||||
3. Phase 1 → Phase 2 end-to-end (op_log → DataExecutor → verify)
|
3. Phase 1 → Phase 2 end-to-end (op_log → DataExecutor → verify)
|
||||||
|
|||||||
@@ -1,18 +1,13 @@
|
|||||||
"""Tests for ADR-0019 D1/D4 per-PE HBM partitioning.
|
"""Tests for ADR-0017 D4/D7 per-PE HBM partitioning.
|
||||||
|
|
||||||
Restores the architectural property that was lost in commit 5917b34
|
ADR-0017 D4/D7 specifies:
|
||||||
(2026-04-04 "Replace xbar/bridge/single-NOC with explicit router mesh"),
|
|
||||||
which over-consolidated 8 per-slice HBM CTRL nodes into one cube-wide
|
|
||||||
HBM CTRL connected to every router. ADR-0019 D1/D4 specifies:
|
|
||||||
|
|
||||||
- Each PE owns 8 of the cube's 64 pseudo-channels (PE_X → PCs 8X..8X+7).
|
- Each PE owns 8 of the cube's 64 pseudo-channels (PE_X → PCs 8X..8X+7).
|
||||||
- HBM CTRL is split per-PE: ``hbm_ctrl.pe{X}`` is reachable ONLY through
|
- HBM CTRL is split per-PE: ``hbm_ctrl.pe{X}`` is reachable ONLY through
|
||||||
PE_X's attaching router. Accessing PE_Y's slice from PE_X requires
|
PE_X's attaching router. Accessing PE_Y's slice from PE_X requires
|
||||||
mesh routing to r_Y_attach before entering hbm_ctrl.pe{Y}.
|
mesh routing to r_Y_attach before entering hbm_ctrl.pe{Y}.
|
||||||
|
|
||||||
These tests are written BEFORE the production change and are expected
|
These tests enforce that property without weakening
|
||||||
to FAIL on current code (HBM CTRL is a single ``hbm_ctrl`` node attached
|
|
||||||
to all routers). Phase 2 must make them PASS without weakening
|
|
||||||
assertions.
|
assertions.
|
||||||
"""
|
"""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
@@ -66,16 +61,16 @@ def test_topology_has_8_hbm_ctrl_per_cube():
|
|||||||
for pe in range(8):
|
for pe in range(8):
|
||||||
nid = f"sip0.cube0.hbm_ctrl.pe{pe}"
|
nid = f"sip0.cube0.hbm_ctrl.pe{pe}"
|
||||||
assert nid in graph.nodes, (
|
assert nid in graph.nodes, (
|
||||||
f"Expected per-PE HBM CTRL node {nid!r} (ADR-0019 D1)"
|
f"Expected per-PE HBM CTRL node {nid!r} (ADR-0017 D4)"
|
||||||
)
|
)
|
||||||
node = graph.nodes[nid]
|
node = graph.nodes[nid]
|
||||||
assert int(node.attrs.get("num_pcs", 0)) == 8, (
|
assert int(node.attrs.get("num_pcs", 0)) == 8, (
|
||||||
f"{nid} must have num_pcs=8; got {node.attrs.get('num_pcs')}"
|
f"{nid} must have num_pcs=8; got {node.attrs.get('num_pcs')}"
|
||||||
)
|
)
|
||||||
# Legacy single hbm_ctrl must not exist
|
# Cube-wide single hbm_ctrl must not exist
|
||||||
assert "sip0.cube0.hbm_ctrl" not in graph.nodes, (
|
assert "sip0.cube0.hbm_ctrl" not in graph.nodes, (
|
||||||
"Legacy single sip0.cube0.hbm_ctrl must be removed in favor of "
|
"Cube-wide single sip0.cube0.hbm_ctrl must not exist; only "
|
||||||
"per-PE hbm_ctrl.pe{X} (ADR-0019 D1)"
|
"per-PE hbm_ctrl.pe{X} (ADR-0017 D4)"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -199,10 +194,8 @@ def test_probe_cli_intra_cube_cases_are_monotonic():
|
|||||||
"""Probe CLI cases must show monotonic latency:
|
"""Probe CLI cases must show monotonic latency:
|
||||||
pe-local-hbm < pe-same-half-hbm < pe-cross-half-hbm.
|
pe-local-hbm < pe-same-half-hbm < pe-cross-half-hbm.
|
||||||
|
|
||||||
Prior to per-PE partitioning these three return identical latency
|
Per ADR-0017 D7, same-half (pe0→pe1) is 1 mesh hop further than
|
||||||
because all roads lead to the same hbm_ctrl. With ADR-0019 D4
|
local, and cross-half (pe0→pe4) is several hops further.
|
||||||
restored, same-half (pe0→pe1) is 1 mesh hop further than local,
|
|
||||||
and cross-half (pe0→pe4) is several hops further.
|
|
||||||
"""
|
"""
|
||||||
graph = _graph()
|
graph = _graph()
|
||||||
spec = graph.spec
|
spec = graph.spec
|
||||||
|
|||||||
@@ -17,7 +17,7 @@ def _graph():
|
|||||||
|
|
||||||
|
|
||||||
def test_resolve_hbm_addr():
|
def test_resolve_hbm_addr():
|
||||||
"""HBM address -> sip{S}.cube{C}.hbm_ctrl.pe{X} (per-PE controller, ADR-0019 D1)."""
|
"""HBM address -> sip{S}.cube{C}.hbm_ctrl.pe{X} (per-PE controller, ADR-0017 D9)."""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
resolver = AddressResolver(g)
|
resolver = AddressResolver(g)
|
||||||
# offset 0x1000 falls inside PE0's slice (slice_size = 6 GB)
|
# offset 0x1000 falls inside PE0's slice (slice_size = 6 GB)
|
||||||
@@ -102,16 +102,13 @@ def test_path_remote_pe_hbm():
|
|||||||
assert not any("xbar" in n or "bridge" in n for n in path)
|
assert not any("xbar" in n or "bridge" in n for n in path)
|
||||||
|
|
||||||
|
|
||||||
# ── PathRouter: cross-PE HBM distance reflects mesh hops (ADR-0019 D4) ─
|
# ── PathRouter: cross-PE HBM distance reflects mesh hops (ADR-0017 D7) ─
|
||||||
|
|
||||||
|
|
||||||
def test_cross_pe_hbm_distance_increases_with_mesh_hops():
|
def test_cross_pe_hbm_distance_increases_with_mesh_hops():
|
||||||
"""Restored ADR-0019 D4 behavior: accessing another PE's HBM slice
|
"""ADR-0017 D7: accessing another PE's HBM slice must take more
|
||||||
must take more routing distance than accessing one's own slice,
|
routing distance than accessing one's own slice, because each
|
||||||
because each per-PE hbm_ctrl is reachable only via its PE's router.
|
per-PE hbm_ctrl is reachable only via its PE's router.
|
||||||
|
|
||||||
Replaces a previous ``test_all_pe_hbm_equidistant`` that asserted the
|
|
||||||
over-consolidated (spec-violating) behavior introduced in 5917b34.
|
|
||||||
"""
|
"""
|
||||||
g = _graph()
|
g = _graph()
|
||||||
router = PathRouter(g)
|
router = PathRouter(g)
|
||||||
|
|||||||
@@ -21,7 +21,7 @@ def test_full_graph_node_count():
|
|||||||
# + 20 ucie (4 ports x (1 port + 4 conn))
|
# + 20 ucie (4 ports x (1 port + 4 conn))
|
||||||
# + 8 PEs x 9 pe_comps)) (ADR-0023: +pe_ipcq)
|
# + 8 PEs x 9 pe_comps)) (ADR-0023: +pe_ipcq)
|
||||||
# IO: pcie_ep + io_cpu + noc + 4 io_ucie_ports + 4*4 io_ucie_conn = 23
|
# IO: pcie_ep + io_cpu + noc + 4 io_ucie_ports + 4*4 io_ucie_conn = 23
|
||||||
# cube: 32 + 10 + 20 + 72 = 134 (was 127; ADR-0019 D1 per-PE HBM CTRL)
|
# cube: 32 + 10 + 20 + 72 = 134 (per-PE HBM CTRL, ADR-0017 D4)
|
||||||
# = 1 + 2*(23 + 16*134) = 1 + 2*(23+2144) = 1 + 4334 = 4335
|
# = 1 + 2*(23 + 16*134) = 1 + 2*(23+2144) = 1 + 4334 = 4335
|
||||||
assert len(g.nodes) == 4335
|
assert len(g.nodes) == 4335
|
||||||
|
|
||||||
@@ -29,9 +29,9 @@ def test_full_graph_node_count():
|
|||||||
def test_full_graph_edge_count():
|
def test_full_graph_edge_count():
|
||||||
g = _graph()
|
g = _graph()
|
||||||
# ADR-0023: +3 IPCQ edges per PE
|
# ADR-0023: +3 IPCQ edges per PE
|
||||||
# ADR-0019 D1 (restored): HBM↔router edges drop from 32 routers × 2
|
# ADR-0017 D4: HBM↔router edges = 8 PE-routers × 2 per cube
|
||||||
# to 8 PE-routers × 2 per cube. 32 cubes × (16-64) = -1536 edges.
|
# (per-PE partition; not all 32 routers).
|
||||||
# Multi-op composite (ADR-0021): +1 gemm→math edge per PE for
|
# Multi-op composite (ADR-0014 D3.3): +1 gemm→math edge per PE for
|
||||||
# epilogue chaining = 2 SIPs × 16 cubes × 8 PEs = +256 edges.
|
# epilogue chaining = 2 SIPs × 16 cubes × 8 PEs = +256 edges.
|
||||||
assert len(g.edges) == 12412
|
assert len(g.edges) == 12412
|
||||||
|
|
||||||
@@ -73,7 +73,7 @@ def test_cube_component_nodes_exist():
|
|||||||
# Null holes must not exist
|
# Null holes must not exist
|
||||||
for null_rc in ("r2c2", "r2c3", "r3c2", "r3c3"):
|
for null_rc in ("r2c2", "r2c3", "r3c2", "r3c3"):
|
||||||
assert f"{cp}.{null_rc}" not in g.nodes
|
assert f"{cp}.{null_rc}" not in g.nodes
|
||||||
# Per-PE HBM CTRL (ADR-0019 D1) — 8 instances, no legacy single node
|
# Per-PE HBM CTRL (ADR-0017 D4) — 8 instances; no cube-wide single node
|
||||||
for pe in range(8):
|
for pe in range(8):
|
||||||
nid = f"{cp}.hbm_ctrl.pe{pe}"
|
nid = f"{cp}.hbm_ctrl.pe{pe}"
|
||||||
assert g.nodes[nid].kind == "hbm_ctrl"
|
assert g.nodes[nid].kind == "hbm_ctrl"
|
||||||
@@ -94,7 +94,7 @@ def test_pe_component_nodes_exist():
|
|||||||
|
|
||||||
def test_hbm_ctrl_at_cube_center():
|
def test_hbm_ctrl_at_cube_center():
|
||||||
g = _graph()
|
g = _graph()
|
||||||
# Per-PE hbm_ctrl nodes share the cube's HBM placement (ADR-0019 D1)
|
# Per-PE hbm_ctrl nodes share the cube's HBM placement (ADR-0017 D4)
|
||||||
# cube0 origin = (0, 0), hbm at (6.5, 7.0)
|
# cube0 origin = (0, 0), hbm at (6.5, 7.0)
|
||||||
for pe in range(8):
|
for pe in range(8):
|
||||||
node = g.nodes[f"sip0.cube0.hbm_ctrl.pe{pe}"]
|
node = g.nodes[f"sip0.cube0.hbm_ctrl.pe{pe}"]
|
||||||
@@ -190,8 +190,7 @@ def test_pe_internal_edges():
|
|||||||
|
|
||||||
def test_per_pe_hbm_ctrl_connects_only_to_owning_router():
|
def test_per_pe_hbm_ctrl_connects_only_to_owning_router():
|
||||||
"""Each hbm_ctrl.pe{X} connects ONLY to PE_X's attaching router
|
"""Each hbm_ctrl.pe{X} connects ONLY to PE_X's attaching router
|
||||||
(ADR-0019 D4). Replaces a prior test that asserted the
|
(ADR-0017 D7)."""
|
||||||
spec-violating all-routers consolidation (commit 5917b34)."""
|
|
||||||
g = _graph()
|
g = _graph()
|
||||||
es = _edge_set(g)
|
es = _edge_set(g)
|
||||||
cp = "sip0.cube0"
|
cp = "sip0.cube0"
|
||||||
|
|||||||
@@ -56,7 +56,7 @@ def test_initialize_mismatched_ws_raises(topology):
|
|||||||
|
|
||||||
def test_get_tp_rank_is_greenlet_local(topology):
|
def test_get_tp_rank_is_greenlet_local(topology):
|
||||||
"""D3: get_tensor_model_parallel_rank returns greenlet-local rank
|
"""D3: get_tensor_model_parallel_rank returns greenlet-local rank
|
||||||
(delegates to torch.distributed.get_rank, ADR-0024 D9)."""
|
(delegates to torch.distributed.get_rank, ADR-0024 D2)."""
|
||||||
import kernbench.tp as tp
|
import kernbench.tp as tp
|
||||||
|
|
||||||
with _make_ctx(topology) as ctx:
|
with _make_ctx(topology) as ctx:
|
||||||
|
|||||||
@@ -0,0 +1,107 @@
|
|||||||
|
"""Tests for tools/verify_adr_lang_pairs.py."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
_REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
sys.path.insert(0, str(_REPO_ROOT / "tools"))
|
||||||
|
|
||||||
|
import verify_adr_lang_pairs as v # noqa: E402
|
||||||
|
|
||||||
|
|
||||||
|
def _make_adr(
|
||||||
|
path: Path,
|
||||||
|
title_id: str,
|
||||||
|
title_text: str = "Some Title",
|
||||||
|
status: str = "Accepted",
|
||||||
|
) -> None:
|
||||||
|
path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
path.write_text(
|
||||||
|
f"# ADR-{title_id}: {title_text}\n\n"
|
||||||
|
f"## Status\n\n{status}\n\n"
|
||||||
|
f"## Context\n\nbody\n",
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_complete_pairs_pass(tmp_path: Path) -> None:
|
||||||
|
_make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", "Foo EN")
|
||||||
|
_make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", "Foo KO")
|
||||||
|
assert v.verify(tmp_path) == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_empty_dirs_pass(tmp_path: Path) -> None:
|
||||||
|
assert v.verify(tmp_path) == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_missing_ko_fails(tmp_path: Path) -> None:
|
||||||
|
_make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001")
|
||||||
|
errs = v.verify(tmp_path)
|
||||||
|
assert any("missing KO" in e and "ADR-0001-foo-bar.md" in e for e in errs)
|
||||||
|
|
||||||
|
|
||||||
|
def test_orphan_ko_fails(tmp_path: Path) -> None:
|
||||||
|
_make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001")
|
||||||
|
errs = v.verify(tmp_path)
|
||||||
|
assert any("orphan KO" in e and "ADR-0001-foo-bar.md" in e for e in errs)
|
||||||
|
|
||||||
|
|
||||||
|
def test_status_mismatch_fails(tmp_path: Path) -> None:
|
||||||
|
_make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", status="Accepted")
|
||||||
|
_make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", status="Proposed")
|
||||||
|
errs = v.verify(tmp_path)
|
||||||
|
assert any("Status block mismatch" in e for e in errs)
|
||||||
|
|
||||||
|
|
||||||
|
def test_title_id_mismatch_fails(tmp_path: Path) -> None:
|
||||||
|
_make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0002")
|
||||||
|
_make_adr(tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001")
|
||||||
|
errs = v.verify(tmp_path)
|
||||||
|
assert any("EN title ADR-ID" in e for e in errs)
|
||||||
|
|
||||||
|
|
||||||
|
def test_multiline_status_with_parenthetical_passes(tmp_path: Path) -> None:
|
||||||
|
"""Real ADRs like ADR-0001 have multi-line Status with revision notes."""
|
||||||
|
multiline_status = (
|
||||||
|
"Accepted (Revision 2 - 2026-04-27: concrete bit layout,\n"
|
||||||
|
"Supersedes ADR-0031.)"
|
||||||
|
)
|
||||||
|
_make_adr(
|
||||||
|
tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001", status=multiline_status
|
||||||
|
)
|
||||||
|
_make_adr(
|
||||||
|
tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md", "0001", status=multiline_status
|
||||||
|
)
|
||||||
|
assert v.verify(tmp_path) == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_crlf_normalization(tmp_path: Path) -> None:
|
||||||
|
"""KO has CRLF, EN has LF; Status content is otherwise identical -> pass."""
|
||||||
|
en = tmp_path / "docs/adr/ADR-0001-foo-bar.md"
|
||||||
|
ko = tmp_path / "docs/adr-ko/ADR-0001-foo-bar.md"
|
||||||
|
en.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
ko.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
en.write_bytes(
|
||||||
|
b"# ADR-0001: Foo\n\n## Status\n\nAccepted\n\n## Context\n\nbody\n"
|
||||||
|
)
|
||||||
|
ko.write_bytes(
|
||||||
|
b"# ADR-0001: Foo\r\n\r\n## Status\r\n\r\nAccepted\r\n\r\n## Context\r\n\r\nbody\r\n"
|
||||||
|
)
|
||||||
|
assert v.verify(tmp_path) == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_underscore_in_slug_recognized(tmp_path: Path) -> None:
|
||||||
|
"""ADR-0013 uses an underscore in its slug; the regex must accept it."""
|
||||||
|
_make_adr(tmp_path / "docs/adr/ADR-0013-ver-verification_strategy.md", "0013")
|
||||||
|
_make_adr(tmp_path / "docs/adr-ko/ADR-0013-ver-verification_strategy.md", "0013")
|
||||||
|
assert v.verify(tmp_path) == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_main_exit_codes(tmp_path: Path, capsys) -> None:
|
||||||
|
assert v.main(["--root", str(tmp_path)]) == 0
|
||||||
|
_make_adr(tmp_path / "docs/adr/ADR-0001-foo-bar.md", "0001")
|
||||||
|
assert v.main(["--root", str(tmp_path)]) == 1
|
||||||
|
out = capsys.readouterr().out
|
||||||
|
assert "FAILED" in out
|
||||||
@@ -0,0 +1,144 @@
|
|||||||
|
"""Verify ADR language pair invariants.
|
||||||
|
|
||||||
|
Policy (see CLAUDE.md Part 2 -> ADR Translation Discipline):
|
||||||
|
docs/adr/ : English canonical
|
||||||
|
docs/adr-ko/ : Korean translation (1:1 mirror)
|
||||||
|
docs/adr-history/: frozen, not checked (transitional)
|
||||||
|
docs/adr-proposed/: language-free, not checked
|
||||||
|
|
||||||
|
Checks:
|
||||||
|
- every docs/adr/<X>.md has a matching docs/adr-ko/<X>.md
|
||||||
|
- every docs/adr-ko/<X>.md has a matching docs/adr/<X>.md (no orphans)
|
||||||
|
- title line `# ADR-NNNN:` of each pair matches the filename's NNNN
|
||||||
|
- `## Status` block content is byte-equal (after CRLF/LF normalization)
|
||||||
|
between EN and KO
|
||||||
|
|
||||||
|
Exit code: 0 if all OK, 1 if any mismatch.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-[a-z0-9_-]+\.md$")
|
||||||
|
TITLE_RE = re.compile(r"^# ADR-(\d{4}):")
|
||||||
|
|
||||||
|
|
||||||
|
def _normalize(text: str) -> str:
|
||||||
|
return text.replace("\r\n", "\n").replace("\r", "\n")
|
||||||
|
|
||||||
|
|
||||||
|
def find_adr_files(adr_dir: Path) -> dict[str, Path]:
|
||||||
|
if not adr_dir.is_dir():
|
||||||
|
return {}
|
||||||
|
return {
|
||||||
|
p.name: p
|
||||||
|
for p in sorted(adr_dir.iterdir())
|
||||||
|
if p.is_file() and ADR_FILENAME_RE.match(p.name)
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def extract_title_id(text: str) -> str | None:
|
||||||
|
lines = _normalize(text).splitlines()
|
||||||
|
if not lines:
|
||||||
|
return None
|
||||||
|
m = TITLE_RE.match(lines[0])
|
||||||
|
return m.group(1) if m else None
|
||||||
|
|
||||||
|
|
||||||
|
def extract_status_block(text: str) -> str | None:
|
||||||
|
"""Return content between `## Status` and the next `## ` heading, stripped.
|
||||||
|
|
||||||
|
Returns None if no `## Status` heading exists.
|
||||||
|
"""
|
||||||
|
lines = _normalize(text).splitlines()
|
||||||
|
in_status = False
|
||||||
|
collected: list[str] = []
|
||||||
|
for line in lines:
|
||||||
|
if line.strip() == "## Status":
|
||||||
|
in_status = True
|
||||||
|
continue
|
||||||
|
if in_status and line.startswith("## "):
|
||||||
|
break
|
||||||
|
if in_status:
|
||||||
|
collected.append(line)
|
||||||
|
if not in_status:
|
||||||
|
return None
|
||||||
|
return "\n".join(collected).strip()
|
||||||
|
|
||||||
|
|
||||||
|
def verify(root: Path) -> list[str]:
|
||||||
|
errors: list[str] = []
|
||||||
|
en_dir = root / "docs" / "adr"
|
||||||
|
ko_dir = root / "docs" / "adr-ko"
|
||||||
|
|
||||||
|
en_files = find_adr_files(en_dir)
|
||||||
|
ko_files = find_adr_files(ko_dir)
|
||||||
|
|
||||||
|
for name in en_files:
|
||||||
|
if name not in ko_files:
|
||||||
|
errors.append(f"missing KO translation: docs/adr-ko/{name}")
|
||||||
|
for name in ko_files:
|
||||||
|
if name not in en_files:
|
||||||
|
errors.append(f"orphan KO (no canonical EN): docs/adr-ko/{name}")
|
||||||
|
|
||||||
|
for name in sorted(en_files.keys() & ko_files.keys()):
|
||||||
|
m = ADR_FILENAME_RE.match(name)
|
||||||
|
assert m is not None
|
||||||
|
expected_id = m.group(1)
|
||||||
|
|
||||||
|
en_text = en_files[name].read_text(encoding="utf-8")
|
||||||
|
ko_text = ko_files[name].read_text(encoding="utf-8")
|
||||||
|
|
||||||
|
en_id = extract_title_id(en_text)
|
||||||
|
ko_id = extract_title_id(ko_text)
|
||||||
|
if en_id != expected_id:
|
||||||
|
errors.append(
|
||||||
|
f"{name}: EN title ADR-ID {en_id!r} != filename {expected_id!r}"
|
||||||
|
)
|
||||||
|
if ko_id != expected_id:
|
||||||
|
errors.append(
|
||||||
|
f"{name}: KO title ADR-ID {ko_id!r} != filename {expected_id!r}"
|
||||||
|
)
|
||||||
|
|
||||||
|
en_status = extract_status_block(en_text)
|
||||||
|
ko_status = extract_status_block(ko_text)
|
||||||
|
if en_status is None:
|
||||||
|
errors.append(f"{name}: EN missing `## Status` section")
|
||||||
|
if ko_status is None:
|
||||||
|
errors.append(f"{name}: KO missing `## Status` section")
|
||||||
|
if en_status is not None and ko_status is not None and en_status != ko_status:
|
||||||
|
errors.append(
|
||||||
|
f"{name}: Status block mismatch\n"
|
||||||
|
f" EN: {en_status!r}\n"
|
||||||
|
f" KO: {ko_status!r}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return errors
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
p = argparse.ArgumentParser(description=__doc__)
|
||||||
|
p.add_argument(
|
||||||
|
"--root",
|
||||||
|
type=Path,
|
||||||
|
default=Path.cwd(),
|
||||||
|
help="Repository root (default: cwd)",
|
||||||
|
)
|
||||||
|
args = p.parse_args(argv)
|
||||||
|
|
||||||
|
errors = verify(args.root)
|
||||||
|
if errors:
|
||||||
|
print("ADR language pair verification FAILED:")
|
||||||
|
for e in errors:
|
||||||
|
print(f" - {e}")
|
||||||
|
return 1
|
||||||
|
print("ADR language pair verification OK")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
+9
-9
@@ -78,15 +78,15 @@ cube:
|
|||||||
scheduler_to_fetch_store_mm: 0.5
|
scheduler_to_fetch_store_mm: 0.5
|
||||||
dma_to_tcm_bw_gbs: 512.0
|
dma_to_tcm_bw_gbs: 512.0
|
||||||
dma_to_tcm_mm: 0.5
|
dma_to_tcm_mm: 0.5
|
||||||
dma_to_fetch_store_mm: 0.0 # DMA → fetch_store chaining (ADR-0021)
|
dma_to_fetch_store_mm: 0.0 # DMA → fetch_store chaining (ADR-0014 D6)
|
||||||
fetch_store_to_tcm_bw_gbs: 512.0
|
fetch_store_to_tcm_bw_gbs: 512.0
|
||||||
fetch_store_to_tcm_mm: 0.0
|
fetch_store_to_tcm_mm: 0.0
|
||||||
fetch_store_to_gemm_mm: 0.0 # fetch → GEMM chaining (ADR-0021)
|
fetch_store_to_gemm_mm: 0.0 # fetch → GEMM chaining (ADR-0014 D6)
|
||||||
fetch_store_to_math_mm: 0.0 # fetch → MATH chaining (ADR-0021)
|
fetch_store_to_math_mm: 0.0 # fetch → MATH chaining (ADR-0014 D6)
|
||||||
gemm_to_fetch_store_mm: 0.0 # GEMM → store chaining (ADR-0021)
|
gemm_to_fetch_store_mm: 0.0 # GEMM → store chaining (ADR-0014 D6)
|
||||||
gemm_to_math_mm: 0.0 # GEMM → MATH epilogue chaining (ADR-0021)
|
gemm_to_math_mm: 0.0 # GEMM → MATH epilogue chaining (ADR-0014 D6)
|
||||||
math_to_fetch_store_mm: 0.0 # MATH → store chaining (ADR-0021)
|
math_to_fetch_store_mm: 0.0 # MATH → store chaining (ADR-0014 D6)
|
||||||
fetch_store_to_dma_mm: 0.0 # store → DMA writeback chaining (ADR-0021)
|
fetch_store_to_dma_mm: 0.0 # store → DMA writeback chaining (ADR-0014 D6)
|
||||||
gemm_to_tcm_bw_gbs: 512.0
|
gemm_to_tcm_bw_gbs: 512.0
|
||||||
gemm_to_tcm_mm: 0.5
|
gemm_to_tcm_mm: 0.5
|
||||||
math_to_tcm_bw_gbs: 512.0
|
math_to_tcm_bw_gbs: 512.0
|
||||||
@@ -99,7 +99,7 @@ cube:
|
|||||||
hbm_total_gb_per_cube: 48
|
hbm_total_gb_per_cube: 48
|
||||||
hbm_slices_per_cube: 8
|
hbm_slices_per_cube: 8
|
||||||
hbm_total_bw_gbs: 1024.0
|
hbm_total_bw_gbs: 1024.0
|
||||||
hbm_mapping_mode: n_to_one # one_to_one | n_to_one (ADR-0019)
|
hbm_mapping_mode: n_to_one # one_to_one | n_to_one (ADR-0017 D8)
|
||||||
hbm_pseudo_channels: 64 # total pseudo channels per cube
|
hbm_pseudo_channels: 64 # total pseudo channels per cube
|
||||||
hbm_channels_per_pe: 8 # = pseudo_channels / pes_per_cube
|
hbm_channels_per_pe: 8 # = pseudo_channels / pes_per_cube
|
||||||
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
|
hbm_channel_bw_gbs: 32.0 # per-channel bandwidth (GB/s)
|
||||||
@@ -123,7 +123,7 @@ cube:
|
|||||||
per_connection_bw_gbs: 128.0 # BW per connection; 4 × 128 = 512 GB/s = UCIe PHY BW
|
per_connection_bw_gbs: 128.0 # BW per connection; 4 × 128 = 512 GB/s = UCIe PHY BW
|
||||||
|
|
||||||
links:
|
links:
|
||||||
# Router mesh links (ADR-0019)
|
# Router mesh links (ADR-0017 D5)
|
||||||
router_link_bw_gbs: 256.0 # inter-router XY mesh link BW
|
router_link_bw_gbs: 256.0 # inter-router XY mesh link BW
|
||||||
router_overhead_ns: 2.0 # per-router switching overhead
|
router_overhead_ns: 2.0 # per-router switching overhead
|
||||||
pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router (= N × channel_bw)
|
pe_to_router_bw_gbs: 256.0 # PE_DMA ↔ router (= N × channel_bw)
|
||||||
|
|||||||
Reference in New Issue
Block a user