Files
kernbench2/docs/history/ADR-0018-Logical Address.en.md
T
ywkang 22fd0d2b9d ADR: introduce docs/history/, merge 0011+0018, prune migration cruft
- CLAUDE.md: add ADR Lifecycle subsection (superseded → docs/history/,
  immutable numbering, no renumber)
- ADR-0011: merge ADR-0018 content as "Address Model: LA" section
  alongside PA / VA; status notes VA model is currently implemented
- ADR-0018 / 0029 / 0031: moved to docs/history/ with status updates
  (0018 merged into 0011, 0029 superseded by 0032, 0031 absorbed
  into 0001 rev 2)
- ADR-0019: rewrite Context as PE-HBM connectivity decision
  (self-contained, no LA model framing)
- ADR-0019/0020/0021/0023/0025/0027: Status Proposed → Accepted
  (code verified) and prune Implementation Notes / Affected files /
  Test strategy / "현재 상태" sub-sections describing pre-impl state
- ADR-0024/0026: same migration-flavor cleanup; 0026 also drops D6
  Migration and D8 docs-update sub-decisions
- ADR-0030: status simplified (blocker ADR-0031 now superseded)
- SPEC.md: R10 + §0.2 reflect PA / VA / LA model names
- ADR-0008/0012/0013: refresh ADR-0011 subtitle in Links

21 files changed, 553 insertions(+), 1290 deletions(-).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 11:42:45 -07:00

15 KiB

ADR-0018: LA-Based Memory Address Abstraction and HBM Channel Mapping Mode Introduction

Status

Merged into ADR-0011 (Address Model: LA section).

Context

Kernbench simulates memory access between PE_DMA and Local-HBM within a CUBE. Currently, a VA-based access path is used; however, the following two channel mapping models are difficult to represent consistently.

Background: Local-HBM Pseudo Channel Structure

The HBM in a CUBE consists of 32 or 64 pseudo channels. In the PE-Local-HBM model, each PE is responsible for an equal number of pseudo channels.

Example: 64 pseudo channels, 8 PEs per cube -> each PE accesses 8 pseudo channels as local HBM

Both the number of pseudo channels and the number of PEs are topology parameters. N = hbm_pseudo_channels / pes_per_cube (= channels_per_pe) determines the number of local channels per PE.

The routing path BW between DMA and each pseudo channel matches the BW of each pseudo channel (e.g., 32 GB/s), so if a PE sends simultaneous requests to N channels, it can utilize the maximum memory BW.

Limitations of the Current VA Model

When channels are divided into 8, requests must also be generated per channel and sent to DMA. However, in the current architecture, the kernel generates requests with VA (tl.load) and passes them directly to DMA, making it difficult for PE_CPU to generate per-channel DMA requests.

Therefore, instead of VA, we propose using Logical Address (LA), where the BAAW (Logical-to-Physical Mapping Unit) inside PE_DMA converts LA to PA or a list of PAs based on segment-based mapping.

Two Channel Mapping Modes

  • 1:1 mode: Creates and executes per-channel requests. Precise per-channel modeling.
  • n:1 mode (default): Assumes interleaving across local HBM channels. Aggregated BW modeling.

By supporting both modes, the overhead of the n:1 mode can be measured and evaluated.

Core Requirements

  • The effective bandwidth semantics of PE_DMA -> HBM_CTRL must be identical in both modes
  • The difference must only be in the request representation and resource modeling approach
  • The kernel programming model must not be changed
  • Physical channel information must not be exposed to the kernel

Existing Physical Address

The current system's 51-bit Physical Address is defined in policy/address/phyaddr.py:

[50:47] rack_id (4 bit)
[46:43] sip_id  (4 bit)
[42:38] cube_id (5 bit, sip_seg)
[37]    hbm_selector (1=HBM window)
[36:0]  hbm_offset   (37 bit, 128GB per cube)

PA is used to represent the final routable canonical physical destination, and this role is preserved. However, the timing and policy of logical access -> physical request conversion are not clearly separated.


Decision

D1. Introduction of LA (Logical Address) — Replacing VA

The existing VA (Virtual Address) infrastructure is replaced with LA (Logical Address).

Characteristics of LA

  • Like VA, tensors can be mapped to a contiguous memory space
  • Represents logical buffer + offset
  • Does not directly contain physical channel information
  • An intermediate abstraction maintained until physical resolution
  • The sole address scheme used by kernel code (tl.load, tl.store, tl.composite)

LA Space Definition

Item Value
LA start address 0x1_0000_0000 (4 GB, preserving the existing VA start point)
LA space size 64 GB per PE
Alignment unit Segment-based (see D3 below)

LA is a PE-local address space. Even if different PEs use the same LA value, they resolve to different PAs because each PE has a different BAAW segment table.

VA Infrastructure Removal Scope

With the introduction of LA, the following existing code will be replaced/removed:

Removal Target Replacement
policy/address/va_allocator.py (VirtualAllocator) LA allocator (same free-list approach, name/role changed)
policy/address/pe_mmu.py (PeMMU) BAAW segment table (inside PE_DMA)
components/builtin/pe_mmu.py (PeMmuComponent) Removed — BAAW is internal PE_DMA logic, not a separate component
runtime_api/kernel.py: MmuMapMsg, MmuUnmapMsg Replaced with BaawSegmentInstallMsg
runtime_api/context.py: VA alloc + MMU mapping install LA alloc + BAAW segment install
runtime_api/tensor.py: va_base field la_base field
topology.yaml: pe_mmu component entry Removed

D2. Mapping Mode Configuration

The mapping mode is configured at the cube level in topology.yaml:

cube:
  memory_map:
    hbm_mapping_mode: n_to_one    # one_to_one | n_to_one
    hbm_pseudo_channels: 64       # total pseudo channel count
    hbm_channels_per_pe: 8        # local channel count per PE
    hbm_channel_bw_gbs: 32.0      # per-channel bandwidth

This configuration is referenced during graph compilation (topology builder) and BAAW initialization.


D3. Segments and BAAW

Segment Definition

A segment is a logical allocation unit that partitions the LA space so that each segment maps to a specific HBM channel or channel group.

Segments are created by the runtime allocator during tensor deployment, and BAAW uses them to convert LA into physical requests.

BAAW Segment Table Entry

@dataclass
class BaawSegment:
    la_base: int          # segment start LA
    la_size: int          # segment size (bytes)
    mode: str             # "one_to_one" | "n_to_one"
    # 1:1 mode fields
    channel_count: int    # number of channels assigned to this segment (e.g., 8)
    pa_bases: list[int]   # per-channel PA start address list (len = channel_count)
    channel_ids: list[int]  # per-channel logical IDs (e.g., [0,1,2,...,7])
    channel_size: int     # per-channel size (la_size // channel_count)
    # n:1 mode fields
    agg_pa_base: int      # aggregated PA start address
    agg_node_id: str      # aggregated router node_id (for routing)

Segment Lifecycle

  1. Allocation time (tensor deploy):

    • RuntimeContext allocates LA space from the LA allocator
    • PEMemAllocator allocates per-channel PA (1:1) or aggregated PA (n:1)
    • Sends BaawSegmentInstallMsg to PE_DMA to register in the segment table
  2. Usage time (kernel execution):

    • Kernel issues tl.load(la_ptr) -> DmaReadCmd(src_addr=LA)
    • PE_DMA looks up the segment corresponding to the LA in BAAW
    • Converts to PA(s) according to the mode
  3. Deallocation time (tensor free):

    • Removed from the segment table
    • LA space returned, PA deallocated

D4. BAAW (Logical-to-Physical Mapping Unit)

Location

BAAW is placed as a front-end stage inside PE_DMA. It is not a separate SimPy component; it is synchronous address resolution logic executed at the beginning of PE_DMA's handle_command().

Input

  • LA (Logical Address) — DmaReadCmd.src_addr or DmaWriteCmd.dst_addr
  • access size (bytes)

Output

  • 1:1 mode: list[PhysicalRequest] — each request is (PA, nbytes, channel_node_id)
  • n:1 mode: 1 PhysicalRequest — (agg_PA, nbytes, agg_node_id)
@dataclass
class PhysicalRequest:
    pa: int           # 51-bit Physical Address
    nbytes: int       # transfer size for this request
    dst_node: str     # target node_id (channel router or aggregated router)

BAAW Resolve Logic

def resolve(self, la: int, nbytes: int) -> list[PhysicalRequest]:
    seg = self._find_segment(la)  # la_base <= la < la_base + la_size
    offset = la - seg.la_base

    if seg.mode == "n_to_one":
        pa = seg.agg_pa_base + offset
        return [PhysicalRequest(pa=pa, nbytes=nbytes, dst_node=seg.agg_node_id)]

    elif seg.mode == "one_to_one":
        requests = []
        per_ch_size = seg.channel_size
        for i, (pa_base, ch_id) in enumerate(zip(seg.pa_bases, seg.channel_ids)):
            ch_offset = offset % per_ch_size  # interleaved or striped
            ch_nbytes = nbytes // seg.channel_count
            pa = pa_base + ch_offset
            dst_node = f"{self._pe_prefix}.ch_r{ch_id}"
            requests.append(PhysicalRequest(pa=pa, nbytes=ch_nbytes, dst_node=dst_node))
        return requests

Scope of Responsibility

BAAW is responsible for:

  • Converting logical accesses into physical request units
  • Performing fan-out (1:1) or pass-through (n:1) according to the mapping mode
  • Generating Physical Addresses and determining target nodes

BAAW is NOT responsible for:

  • Performing actual data movement
  • Executing NOC routing
  • Simulating bandwidth consumption (this is the role of downstream components)

Output Contract

The output of BAAW must be request units that can be directly used by the simulator's routing and resource model without any additional address decoding.


D5. PE_DMA handle_command() Changes

Current Flow (VA-based)

DmaReadCmd.src_addr (VA)
  -> MMU.translate(VA) -> PA
  -> PhysAddr.decode(PA) -> PhysAddr object
  -> resolver.resolve(PhysAddr) -> dst_node_id (e.g., "sip0.cube0.hbm_ctrl")
  -> router.find_path(pe_prefix, dst_node_id) -> path
  -> 1 sub-Transaction created -> fabric inject

New Flow (LA-based)

DmaReadCmd.src_addr (LA)
  -> BAAW.resolve(LA, nbytes) -> list[PhysicalRequest]
  -> For each PhysicalRequest:
      -> router.find_path(pe_prefix, req.dst_node) -> path
      -> compute_drain_ns(path, req.nbytes) -> drain
      -> sub-Transaction created -> fabric inject
  -> Wait for all sub-Transactions to complete
  -> pe_txn.done.succeed()

Key changes:

  • MMU reference removed -> replaced with BAAW resolve
  • PhysAddr.decode() + resolver.resolve() -> BAAW directly returns dst_node
  • 1 request -> N requests injected in parallel (1:1 mode)

D6. 1:1 Mode Details

  • One logical access -> N (= channels_per_pe) physical requests
  • N is a parameter determined by hbm_pseudo_channels / pes_per_cube
  • Each request:
    • Fully resolved 51-bit PA
    • Targets a specific channel router ({pe_prefix}.ch_r{channel_id})
  • BW contention modeling via per-channel links
  • PE_DMA injects N sub-transactions simultaneously

1:1 Mode Example

Configuration: hbm_pseudo_channels=64, pes_per_cube=8 -> channels_per_pe=8, PE0 owns ch0-7

Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "one_to_one", channel_count: 8,  # = channels_per_pe
    pa_bases: [PA_ch0, PA_ch1, ..., PA_ch7],
    channel_ids: [0, 1, 2, 3, 4, 5, 6, 7],
    channel_size: 512,  # = la_size / channel_count
}

BAAW resolve result (N=8 requests):
  -> PhysicalRequest(pa=PA_ch0, nbytes=512, dst_node="sip0.cube0.pe0.ch_r0")
  -> PhysicalRequest(pa=PA_ch1, nbytes=512, dst_node="sip0.cube0.pe0.ch_r1")
  -> ...
  -> PhysicalRequest(pa=PA_ch7, nbytes=512, dst_node="sip0.cube0.pe0.ch_r7")

PE_DMA: N sub-transactions injected in parallel
  Each accesses HBM via channel router -> hbm_ctrl link (channel_bw_gbs)
  Total effective BW = N x channel_bw_gbs

Examples with different N values:

  • hbm_pseudo_channels=32, pes_per_cube=8 -> channels_per_pe=4, 4 requests
  • hbm_pseudo_channels=64, pes_per_cube=4 -> channels_per_pe=16, 16 requests

D7. n:1 Mode Details

  • One logical access -> one aggregated request
  • Target: aggregated router -> hbm_ctrl (see ADR-0019)
  • Aggregated link BW = channels_per_pe x channel_bw_gbs (e.g., 8 x 32 = 256 GB/s)
  • Modeled as a single queue / resource
  • No per-channel PA decomposition

n:1 Mode Example

Tensor A (4 KB) -> LA 0x1_0000_0000, size=4096 bytes
BAAW segment: {
    la_base: 0x1_0000_0000, la_size: 4096,
    mode: "n_to_one",
    agg_pa_base: PA_agg,
    agg_node_id: "sip0.cube0.pe0.agg_router",
}

BAAW resolve result:
  -> PhysicalRequest(pa=PA_agg, nbytes=4096, dst_node="sip0.cube0.pe0.agg_router")

PE_DMA: 1 sub-transaction injected
  Accesses HBM via aggregated router -> hbm_ctrl link (256 GB/s)

D8. Kernel Model Preservation

  • The kernel still issues only single memory ops (tl.load, tl.store, tl.composite)
  • LA is the address scheme passed to the kernel
  • Channel decomposition/aggregation is performed by BAAW inside PE_DMA
  • Physical channel information is not exposed to kernel code

Consequences

Positive

  • 1:1 vs n:1 semantics are clearly separated at a single point: BAAW
  • Kernel abstraction is preserved — no kernel code changes required
  • Topology-based policy control is possible (mode switching via yaml)
  • Improved simulation model consistency and debuggability
  • Segment-based mapping is simpler and has lower overhead compared to page tables

Negative

  • Full refactoring of VA/MMU-based code is required
  • Increased complexity in the request generation path (managing N requests in 1:1 mode)
  • Reduced per-channel visibility in n:1 mode
  • Existing VA-related tests must be rewritten

Alternatives

A1. Keep VA + Fan-out at MMU

  • Extend MMU to return per-channel PAs
  • Problem: MMU's role expands beyond address translation to include request decomposition
  • Problem: Aggregation representation is difficult in n:1 mode

A2. Kernel Generates Channel-Aware Requests

  • Kernel directly calls per-channel load/store
  • Problem: Abstraction leakage, reduced portability
  • Problem: All benchmark code must be modified

A3. Always Use PA (Without LA)

  • Runtime directly passes per-channel PA to the kernel
  • Problem: Conflicts with the aggregation model
  • Problem: Conversion timing is unclear, channel information exposed to kernel

Implementation Notes

Implementation Order

  1. Introduce LA type (policy/address/la_allocator.py)
  2. Implement BAAW segment table (policy/address/baaw.py)
  3. Add BaawSegmentInstallMsg message type (runtime_api/kernel.py)
  4. Integrate BAAW into PE_DMA (components/builtin/pe_dma.py handle_command changes)
  5. Modify RuntimeContext: LA alloc + segment install (runtime_api/context.py)
  6. Change Tensor.va_base -> la_base (runtime_api/tensor.py)
  7. Remove VA/MMU code
  8. Remove pe_mmu from topology.yaml, add mapping mode configuration
  9. Test migration

Affected Existing Tests

Test File Impact
tests/test_mmu_component.py Remove -> replace with BAAW segment install test
tests/test_mmu_fabric.py Remove -> replace with BAAW + fabric integration test
tests/test_pe_mmu.py Remove
tests/test_va_allocator.py Replace with LA allocator test
tests/test_va_integration.py Replace with LA + BAAW integration test
tests/test_va_offset.py Replace with LA offset test

Test Requirements

  • For the same logical access:
    • 1:1 -> verify N requests are generated
    • n:1 -> verify 1 aggregated request is generated
  • Verify effective bandwidth consistency across both modes
  • 1:1 -> verify per-channel contention modeling
  • n:1 -> verify aggregated bandwidth is reflected
  • Verify operation without kernel code changes
  • Verify correct BAAW segment install/uninstall operation
  • Verify no conflicts when multiple tensors are assigned to different segments

  • ADR-0011 (Memory Addressing Simplification — PA-first, VA/MMU introduction) -> superseded by this ADR
  • ADR-0019 (NOC Per-Channel HBM Connection Model) -> topology-side integration
  • ADR-0014 (PE Internal Execution Model) -> PE_DMA change impact