Implement ADR-0021: PE pipeline refactor with token self-routing

Step 1-2: Backup existing code - builtin/ → builtin_legacy/ (unchanged backup) - custom/pe_accel/ → custom/pe_accel_legacy/ (unchanged backup) Step 3-4: New pipeline types and tiling - pe_types.py: StageType, Stage, TilePlan, PipelinePlan, PipelineContext, TileToken - tiling.py: generate_gemm_plan, generate_math_plan (ported from pe_accel) Step 5: Component implementations (ADR-0021 D4-D6) - PE_SCHEDULER: _feed_loop (singleton FIFO feeder) + plan generation - PE_FETCH_STORE: new component — TCM ↔ Register File - PE_GEMM: TileToken pipeline + legacy PeInternalTxn dual-mode - PE_MATH: TileToken pipeline + legacy dual-mode - PE_DMA: TileToken pipeline + legacy + fabric Transaction triple-mode - PE_TCM: TcmRequest handler with dual-channel BW serialization Step 6: Infrastructure - topology.yaml: pe_fetch_store component + chaining edges - components.yaml: pe_fetch_store_v1 registration - builder.py: PE_COMP_OFFSETS, _add_pe_internal_edges, PE view positions - Tests: node/edge counts, PE component sets updated All components handle both TileToken (pipeline) and PeInternalTxn (legacy). Token self-routing: components read next stage from token.plan, chain via out_port. 366 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 23:35:31 -07:00
parent 161132cdcb
commit b6eb97c49a
40 changed files with 4055 additions and 214 deletions
@@ -63,19 +63,28 @@ cube:
      pe_cpu:       { kind: pe_cpu,       impl: pe_cpu_v1,       attrs: { overhead_ns: 2.0 } }
      pe_scheduler: { kind: pe_scheduler, impl: pe_scheduler_v2, attrs: { overhead_ns: 1.0 } }
      pe_dma:       { kind: pe_dma,       impl: pe_dma_v1,       attrs: { rd_engines: 1, wr_engines: 1 } }
-      pe_gemm:      { kind: pe_gemm,      impl: pe_gemm_v1,      attrs: { overhead_ns: 0.0, shared_resource: accel_slot, peak_tflops_f16: 8.0 } }
-      pe_math:      { kind: pe_math,      impl: pe_math_v1,      attrs: { overhead_ns: 0.0, shared_resource: accel_slot } }
-      pe_mmu:       { kind: pe_mmu,       impl: pe_mmu_v1,       attrs: { tlb_overhead_ns: 0.5, page_size: 4096 } }
-      pe_tcm:       { kind: pe_tcm,       impl: pe_tcm_v1,       attrs:
-      { size_mb: 16 } }
+      pe_gemm:        { kind: pe_gemm,        impl: pe_gemm_v1,        attrs: { overhead_ns: 0.0, shared_resource: accel_slot, peak_tflops_f16: 8.0 } }
+      pe_math:        { kind: pe_math,        impl: pe_math_v1,        attrs: { overhead_ns: 0.0, shared_resource: accel_slot } }
+      pe_fetch_store: { kind: pe_fetch_store, impl: pe_fetch_store_v1, attrs: { overhead_ns: 0.0 } }
+      pe_mmu:         { kind: pe_mmu,         impl: pe_mmu_v1,         attrs: { tlb_overhead_ns: 0.5, page_size: 4096 } }
+      pe_tcm:         { kind: pe_tcm,         impl: pe_tcm_v1,         attrs: { size_mb: 16, read_bw_gbs: 512.0, write_bw_gbs: 512.0 } }
    links:
      pe_cpu_to_scheduler_mm:  0.5
      scheduler_to_dma_mm:     0.5
      scheduler_to_gemm_mm:    0.5
      scheduler_to_math_mm:    0.5
+      scheduler_to_fetch_store_mm: 0.5
      dma_to_tcm_bw_gbs:       512.0
      dma_to_tcm_mm:           0.5
-      gemm_to_tcm_bw_gbs:      512.0    # GEMM reads inputs from TCM (ADR-0014 D5)
+      dma_to_fetch_store_mm:   0.0     # DMA → fetch_store chaining (ADR-0021)
+      fetch_store_to_tcm_bw_gbs: 512.0
+      fetch_store_to_tcm_mm:   0.0
+      fetch_store_to_gemm_mm:  0.0     # fetch → GEMM chaining (ADR-0021)
+      fetch_store_to_math_mm:  0.0     # fetch → MATH chaining (ADR-0021)
+      gemm_to_fetch_store_mm:  0.0     # GEMM → store chaining (ADR-0021)
+      math_to_fetch_store_mm:  0.0     # MATH → store chaining (ADR-0021)
+      fetch_store_to_dma_mm:   0.0     # store → DMA writeback chaining (ADR-0021)
+      gemm_to_tcm_bw_gbs:      512.0
      gemm_to_tcm_mm:          0.5
      math_to_tcm_bw_gbs:      512.0
      math_to_tcm_mm:          0.5