Add SIP-level tensor parallelism, component registry YAML, VA offset verification
- DPPolicy: 3-level (sip/cube/pe), unified naming (column_wise/row_wise) - PE_CPU: auto num_programs from cube shard count - context.launch(): per-SIP KernelLaunchMsg with local va_base + auto local shape - deploy_tensor: removed mmus param, MMU mapping is context-only responsibility - ComponentRegistry: YAML-based lazy loading (components.yaml), impls→builtin rename - VA offset bench + tests: 2D/1D, standard Triton kernel pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -58,7 +58,7 @@ sufficient to execute kernels and issue DMA requests.
|
||||
- Mapping strategy based on `DPPolicy.cube`:
|
||||
- **Replicate** (`cube="replicate"`): per-(sip, cube) local mapping only.
|
||||
Each cube's PEs see only their local PA. No cross-cube mapping installed.
|
||||
- **Sharded** (`cube="shard_m"`, etc.): broadcast all shard mappings to all
|
||||
- **Sharded** (`cube="column_wise"`, etc.): broadcast all shard mappings to all
|
||||
target cubes. Enables cross-PE and cross-cube DMA.
|
||||
|
||||
#### D3.4 Tensor Lifecycle
|
||||
|
||||
@@ -163,11 +163,11 @@ DefaultComponent ← 안전한 fallback
|
||||
## 슬라이드 7 — Registry 등록 방식
|
||||
|
||||
```python
|
||||
# kernbench/components/impls/__init__.py
|
||||
# kernbench/components/builtin/__init__.py
|
||||
|
||||
from kernbench.components.base import ComponentRegistry
|
||||
from kernbench.components.impls.noc import TwoDMeshNocComponent
|
||||
from kernbench.components.impls.io_cpu import IoCpuComponent
|
||||
from kernbench.components.builtin.noc import TwoDMeshNocComponent
|
||||
from kernbench.components.builtin.io_cpu import IoCpuComponent
|
||||
# ...
|
||||
|
||||
ComponentRegistry.register("noc_2d_mesh_v1", TwoDMeshNocComponent)
|
||||
|
||||
Reference in New Issue
Block a user