Files

T

ywkang 22fd0d2b9d ADR: introduce docs/history/, merge 0011+0018, prune migration cruft

- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
  immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
  alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
  (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
  into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
  (self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
  (code verified) and prune Implementation Notes / Affected files /
  Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
  Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links

21 files changed, 553 insertions(+), 1290 deletions(-).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 11:42:45 -07:00

15 KiB

Raw Blame History

ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction

Status

Merged into ADR-0011 (Address Model: LA section).

Context

Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE. Currently, a VA-based access path is used; however, the following two channel mapping models are difficult to represent consistently.

Background: Local-HBM Pseudo Channel Structure

The HBM in a CUBE consists of 32 or 64 pseudo channels. In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.

Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM

Both the number of pseudo channels and the number of PEs are topology parameters. N = hbm_pseudo_channels / pes_per_cube (= channels_per_pe) determines the number of local channels per PE.

The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel (e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the maximum memory BW.

Limitations of the Current VA Model

When channels are divided into 8, requests must also be generated per channel and sent to DMA. However, in the current architecture, the kernel generates requests with VA (tl.load) and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.

Therefore, instead of VA, we propose using Logical Address (LA), where the BAAW (Logical-to-Physical Mapping Unit) inside PE_DMA converts LA to PA or a list of PAs based on segment-based mapping.

Two Channel Mapping Modes

1:1 mode: Creates and executes per-channel requests. Precise per-channel modeling.
n:1 mode (default): Assumes interleaving across local HBM channels. Aggregated BW modeling.

By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.

Core Requirements

The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
The difference must only be in the request representation and resource modeling approach
The kernel programming model must not be changed
Physical channel information must not be exposed to the kernel

Existing Physical Address

The current system's 51-bit Physical Address is defined in policy/address/phyaddr.py:

[50:47] rack_id (4 bit)
[46:43] sip_id  (4 bit)
[42:38] cube_id (5 bit, sip_seg)
[37]    hbm_selector (1=HBM window)
[36:0]  hbm_offset   (37 bit, 128GB per cube)

PA is used to represent the final routable canonical physical destination, and this role is preserved. However, the timing and policy of logical access -> physical request conversion are not clearly separated.

Decision

D1. Introduction of LA (Logical Address) — Replacing VA

The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).

Characteristics of LA

Like VA, tensors can be mapped to a contiguous memory space
Represents logical buffer + offset
Does not directly contain physical channel information
An intermediate abstraction maintained until physical resolution
The sole address scheme used by kernel code (tl.load, tl.store, tl.composite)

LA Space Definition

Item	Value
LA start address	`0x1_0000_0000` (4 GB, preserving the existing VA start point)
LA space size	64 GB per PE
Alignment unit	Segment-based (see D3 below)

LA is a PE-local address space. Even if different PEs use the same LA value, they resolve to different PAs because each PE has a different BAAW segment table.

VA Infrastructure Removal Scope

With the introduction of LA, the following existing code will be replaced/removed:

Removal Target	Replacement
`policy/address/va_allocator.py` (VirtualAllocator)	LA allocator (same free-list approach, name/role changed)
`policy/address/pe_mmu.py` (PeMMU)	BAAW segment table (inside PE_DMA)
`components/builtin/pe_mmu.py` (PeMmuComponent)	Removed — BAAW is internal PE_DMA logic, not a separate component
`runtime_api/kernel.py`: MmuMapMsg, MmuUnmapMsg	Replaced with BaawSegmentInstallMsg
`runtime_api/context.py`: VA alloc + MMU mapping install	LA alloc + BAAW segment install
`runtime_api/tensor.py`: `va_base` field	`la_base` field
`topology.yaml`: pe_mmu component entry	Removed

D2. Mapping Mode Configuration

The mapping mode is configured at the cube level in topology.yaml:

cube:
  memory_map:
    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
    hbm_pseudo_channels: 64       # total pseudo channel count
    hbm_channels_per_pe: 8        # local channel count per PE
    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth

This configuration is referenced during graph compilation (topology builder) and BAAW initialization.

D3. Segments and BAAW

Segment Definition

A segment is a logical allocation unit that partitions the LA space so that each segment maps to a specific HBM channel or channel group.

Segments are created by the runtime allocator during tensor deployment, and BAAW uses them to convert LA into physical requests.

BAAW Segment Table Entry

@dataclass
class BaawSegment:
    la_base: int          # segment start LA
    la_size: int          # segment size (bytes)
    mode: str             # "one_to_one" | "n_to_one"
    # 1:1 mode fields
    channel_count: int    # number of channels assigned to this segment (e.g., 8)
    pa_bases: list[int]   # per-channel PA start address list (len = channel_count)
    channel_ids: list[int]  # per-channel logical IDs (e.g., [0,1,2,...,7])
    channel_size: int     # per-channel size (la_size // channel_count)
    # n:1 mode fields
    agg_pa_base: int      # aggregated PA start address
    agg_node_id: str      # aggregated router node_id (for routing)

Segment Lifecycle

Allocation time (tensor deploy):
- RuntimeContext allocates LA space from the LA allocator
- PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
- Sends BaawSegmentInstallMsg to PE_DMA to register in the segment table
Usage time (kernel execution):
- Kernel issues tl.load(la_ptr) -> DmaReadCmd(src_addr=LA)
- PE_DMA looks up the segment corresponding to the LA in BAAW
- Converts to PA(s) according to the mode
Deallocation time (tensor free):
- Removed from the segment table
- LA space returned, PA deallocated

D4. BAAW (Logical-to-Physical Mapping Unit)

Location

BAAW is placed as a front-end stage inside PE_DMA. It is not a separate SimPy component; it is synchronous address resolution logic executed at the beginning of PE_DMA's handle_command().

Input

LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
access size (bytes)

Output

1:1 mode: list[PhysicalRequest] — each request is (PA, nbytes, channel_node_id)
n:1 mode: 1 PhysicalRequest — (agg_PA, nbytes, agg_node_id)

@dataclass
class PhysicalRequest:
    pa: int           # 51-bit Physical Address
    nbytes: int       # transfer size for this request
    dst_node: str     # target node_id (channel router or aggregated router)

BAAW Resolve Logic

def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
    offset = la - seg.la_base

    if seg.mode == "n_to_one":
        pa = seg.agg_pa_base + offset
        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]

    elif seg.mode == "one_to_one":
        requests = []
        per_ch_size = seg.channel_size
        for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
            ch_offset = offset % per_ch_size  # interleaved or striped
            ch_nbytes = nbytes // seg.channel_count
            pa = pa_base + ch_offset
            dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
            requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
        return requests

Scope of Responsibility

BAAW is responsible for:

Converting logical accesses into physical request units
Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
Generating Physical Addresses and determining target nodes

BAAW is NOT responsible for:

Performing actual data movement
Executing NOC routing
Simulating bandwidth consumption (this is the role of downstream components)

Output Contract

The output of BAAW must be request units that can be directly used by the simulator's routing and resource model without any additional address decoding.

D5. PE_DMA handle_command() Changes

Current Flow (VA-based)

DmaReadCmd.src_addr (VA)
  -> MMU.translate(VA) -> PA
  -> PhysAddr.decode(PA) -> PhysAddr object
  -> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
  -> router.find_path(pe_prefix, dst_node_id) -> path
  -> 1 sub-Transaction created -> fabric inject

New Flow (LA-based)

DmaReadCmd.src_addr (LA)
  -> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
  -> For each PhysicalRequest:
      -> router.find_path(pe_prefix, req.dst_node) -> path
      -> compute_drain_ns(path, req.nbytes) -> drain
      -> sub-Transaction created -> fabric inject
  -> Wait for all sub-Transactions to complete
  -> pe_txn.done.succeed()

Key changes:

MMU reference removed -> replaced with BAAW resolve
PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
1 request -> N requests injected in parallel (1:1 mode)

D6. 1:1 Mode Details

One logical access -> N (= channels_per_pe) physical requests
N is a parameter determined by hbm_pseudo_channels / pes_per_cube
Each request:
- Fully resolved 51-bit PA
- Targets a specific channel router ({pe_prefix}.ch_r{channel_id})
BW contention modeling via per-channel links
PE_DMA injects N sub-transactions simultaneously

1:1 Mode Example

Configuration: hbm_pseudo_channels=64, pes_per_cube=8 -> channels_per_pe=8, PE0 owns ch0-7

Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "one_to_one", channel_count: 8,  # = channels_per_pe
    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
    channel_size: 512,  # = la_size / channel_count
}

BAAW resolve result (N=8 requests):
  -> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
  -> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
  -> ...
  -> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")

PE_DMA: N sub-transactions injected in parallel
  Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
  Total effective BW = N x channel_bw_gbs

Examples with different N values:

hbm_pseudo_channels=32, pes_per_cube=8 -> channels_per_pe=4, 4 requests
hbm_pseudo_channels=64, pes_per_cube=4 -> channels_per_pe=16, 16 requests

D7. n:1 Mode Details

One logical access -> one aggregated request
Target: aggregated router -> hbm_ctrl (see ADR-0019)
Aggregated link BW = channels_per_pe x channel_bw_gbs (e.g., 8 x 32 = 256 GB/s)
Modeled as a single queue / resource
No per-channel PA decomposition

n:1 Mode Example

Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "n_to_one",
    agg_pa_base: PA_agg,
    agg_node_id: "sip0.cube0.pe0.agg_router",
}

BAAW resolve result:
  -> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")

PE_DMA: 1 sub-transaction injected
  Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)

D8. Kernel Model Preservation

The kernel still issues only single memory ops (tl.load, tl.store, tl.composite)
LA is the address scheme passed to the kernel
Channel decomposition/aggregation is performed by BAAW inside PE_DMA
Physical channel information is not exposed to kernel code

Consequences

Positive

1:1 vs n:1 semantics are clearly separated at a single point: BAAW
Kernel abstraction is preserved — no kernel code changes required
Topology-based policy control is possible (mode switching via yaml)
Improved simulation model consistency and debuggability
Segment-based mapping is simpler and has lower overhead compared to page tables

Negative

Full refactoring of VA/MMU-based code is required
Increased complexity in the request generation path (managing N requests in 1:1 mode)
Reduced per-channel visibility in n:1 mode
Existing VA-related tests must be rewritten

Alternatives

A1. Keep VA + Fan-out at MMU

Extend MMU to return per-channel PAs
Problem: MMU's role expands beyond address translation to include request decomposition
Problem: Aggregation representation is difficult in n:1 mode

A2. Kernel Generates Channel-Aware Requests

Kernel directly calls per-channel load/store
Problem: Abstraction leakage, reduced portability
Problem: All benchmark code must be modified

A3. Always Use PA (Without LA)

Runtime directly passes per-channel PA to the kernel
Problem: Conflicts with the aggregation model
Problem: Conversion timing is unclear, channel information exposed to kernel

Implementation Notes

Implementation Order

Introduce LA type (policy/address/la_allocator.py)
Implement BAAW segment table (policy/address/baaw.py)
Add BaawSegmentInstallMsg message type (runtime_api/kernel.py)
Integrate BAAW into PE_DMA (components/builtin/pe_dma.py handle_command changes)
Modify RuntimeContext: LA alloc + segment install (runtime_api/context.py)
Change Tensor.va_base -> la_base (runtime_api/tensor.py)
Remove VA/MMU code
Remove pe_mmu from topology.yaml, add mapping mode configuration
Test migration

Affected Existing Tests

Test File	Impact
`tests/test_mmu_component.py`	Remove -> replace with BAAW segment install test
`tests/test_mmu_fabric.py`	Remove -> replace with BAAW + fabric integration test
`tests/test_pe_mmu.py`	Remove
`tests/test_va_allocator.py`	Replace with LA allocator test
`tests/test_va_integration.py`	Replace with LA + BAAW integration test
`tests/test_va_offset.py`	Replace with LA offset test

Test Requirements

For the same logical access:
- 1:1 -> verify N requests are generated
- n:1 -> verify 1 aggregated request is generated
Verify effective bandwidth consistency across both modes
1:1 -> verify per-channel contention modeling
n:1 -> verify aggregated bandwidth is reflected
Verify operation without kernel code changes
Verify correct BAAW segment install/uninstall operation
Verify no conflicts when multiple tensors are assigned to different segments

15 KiB Raw Blame History