Skip to content
Morning Briefing · Wednesday, May 6, 2026

Astera Labs Redefines AI Scale-Up Fabric as KV Cache Becomes the Infrastructure Bottleneck

networkingautomationai-mldatacentersciencesecurity
Listen to the episode
Astera Labs Redefines AI Scale-Up Fabric as KV Cache Becomes the Infrastructure Bottleneck
14 min · 71 turns
Plate Ileaf · spine
Schematic leaf-spine fabric — explicit-path traffic flows across the spine plane, pods at the edges.
Top Highlights
№ 01·Top Highlights

Top 3 Highlights

1. Astera Labs Scorpio X-Series: Memory-Semantic Fabric for Fragmented AI Workloads

TL;DR: Astera Labs launched the 320-lane Scorpio X-Series smart fabric switch targeting the gap between traditional GPU cluster designs and real-world agentic and multi-step AI workloads that constantly pause, branch, and wait.

Key Points:

  • 320 PCIe Gen 6 lanes per chip — enables direct load/store access across the fabric as if all GPUs shared local memory
  • Memory-semantic architecture: GPU accelerators access remote resources without explicit message-passing, collapsing east-west latency
  • Hardware-accelerated Hypercast and In-Network Compute boost collective operations by up to 2x, improving tokens-per-watt
  • Optical connectivity option enables multi-rack deployments scaling to thousands of GPUs
  • Shipments to hyperscalers have begun; broader production ramp H2 2026; merchant scale-up switch market projected at $20 billion by 2030
  • The expanded P-Series covers 32 to 320 lanes, targeting both boutique and hyperscale cluster sizes

So What? The NVIDIA NVSwitch monopoly on rack-scale AI interconnect has a real challenger. The Scorpio X-Series bets that agentic workloads — which pause constantly for tool calls, context retrieval, and human approval gates — need memory-semantic access more than they need the raw bandwidth of tightly-coupled GPU collectives. If that bet is right, Ethernet-like open fabric architectures win the AI scale-up layer, not just the scale-out layer. Run an Ethernet-versus-proprietary fabric evaluation for any AI cluster RFQ with a 2027 or later delivery date — the answer is no longer obvious.

Sourceshttps://www.asteralabs.com/news/astera-labs-broadens-scorpio-x-series-smart-fabric-switch-roadmap-to-address-expanding-scale-up-market-opportunities/, https://www.datacenterknowledge.com/infrastructure/astera-labs-targets-fragmented-ai-workloads-with-new-fabric, https://siliconangle.com/2026/05/05/astera-labs-debuts-new-scorpio-smart-fabric-data-center-switch-scale-ai-compute-clusters/


2. KV Cache Is the New Fabric Bottleneck — Two Papers Reshape Inference Architecture

TL;DR: Two papers arriving simultaneously make the same architectural argument from different angles: the KV cache, not the compute, is now the primary constraint for agentic LLM inference, and managing it well requires rethinking scheduling and routing at the infrastructure layer.

Key Points:

  • PPD Disaggregation (arxiv 2603.13358): Standard prefill-decode disaggregation assumes every turn starts fresh. For multi-turn agents, this forces repeated full prefills and saturates bandwidth between prefill and decode nodes. Append-prefill — handling only the new input tokens while reusing cached KV states — creates an order-of-magnitude less decoding slowdown. Routing append-prefill locally to decode nodes cuts Turn 2+ time-to-first-token by 68% while maintaining competitive throughput
  • Continuum KV Cache TTL (arxiv 2511.02230): Agentic workloads interleave LLM calls with tool executions, creating pauses that standard eviction policies interpret as completed requests. Continuum pins the KV cache in GPU memory with a time-to-live determined by reload cost versus benefit of retention. On real agentic benchmarks (SWE-Bench, BFCL) with Llama-3.1 8B and 70B, job completion time improves significantly and the benefit scales with turn depth
  • In production agentic workloads, KV cache hit rates can exceed 95% — the model barely recomputes, it mostly loads cached state
  • These are complementary: PPD addresses routing between cluster nodes, Continuum addresses scheduling within a node

So What? If you're operating LLM inference infrastructure today using standard vLLM or TGI defaults, your serving system was designed for single-shot queries, not agents. Both papers are pointing at the same gap: the scheduler and router need to understand workflow structure, not just individual requests. Before your next inference cluster spec, ask vendors explicitly how they handle KV state across tool-call gaps. The answer differentiates purpose-built agentic serving from repurposed chat infrastructure.

Sourceshttps://arxiv.org/abs/2603.13358, https://arxiv.org/abs/2511.02230


3. NVIDIA Extreme Co-Design: Agentic AI Demands a Six-Chip Stack

TL;DR: NVIDIA's Vera Rubin platform frames agentic AI as an infrastructure category requiring purpose-built codesign across GPU, CPU, DPU, NIC, fabric switch, and storage — not incremental scaling of existing GPU server configurations.

Key Points:

  • Vera Rubin NVL72 delivers one-tenth the cost per million tokens compared to Blackwell for equivalent inference workloads; real Claude Code sessions show context windows growing from 15K to 156K tokens per session with agents consuming up to 15x more tokens than standard chat
  • The six chips — Vera CPU, Rubin GPU, NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, Spectrum-6 Ethernet Switch — are codesigned as a single system, not assembled from separate product lines
  • Vera CPU handles tool execution gaps and KV cache offload; Groq 3 LPX (following NVIDIA's acquisition) covers SRAM-first decode for latency-bounded workloads
  • NVIDIA CMX purpose-built storage sustains 95%+ cache hit rates — prompt caching is now a storage problem, not just a GPU memory problem
  • Platform claims 400+ tokens per second per user on trillion-parameter MoE models with 400K context
  • Production availability from AWS, Google Cloud, Microsoft, and OCI in H2 2026

So What? The 15x token multiplier between chat and agents is the number to internalize. If you planned AI inference capacity based on GPT-4 chat usage patterns, your capacity estimates are wrong by an order of magnitude for agentic workloads. The codesign argument also means point-solution optimizations — swap the GPU, leave the rest — won't recover that gap. Build your 2027 AI infrastructure cost models around agentic consumption patterns, not chat baselines.

Sourceshttps://developer.nvidia.com/blog/building-for-the-rising-complexity-of-agentic-systems-with-extreme-co-design/, https://nvidianews.nvidia.com/news/nvidia-vera-rubin-platform


Networking
№ 02·Networking

Networking

Plate IInetworking
Schematic leaf-spine fabric — explicit-path traffic flows across the spine plane, pods at the edges.

Astera Labs Takes Aim at NVSwitch with Open Memory-Semantic Fabric

The Scorpio X-Series launch represents the clearest challenge to NVIDIA's NVSwitch dominance in rack-scale AI interconnect. The memory-semantic model — where GPUs treat remote accelerators as addressable memory rather than message-passing endpoints — is architecturally distinct from both traditional Ethernet and from InfiniBand/NVLink. Hyperscalers are already taking delivery. The broader significance is market structure: if Scorpio gains traction, the scale-up layer follows the same open-standards trajectory that the scale-out layer already took with Ultra Ethernet Consortium.

The Scorpio P-Series expansion — from 32 to 320 lanes — signals that Astera is targeting everything from mid-scale enterprise AI clusters to hyperscale build-outs with a single product family. That breadth matters for vendor evaluation: this is no longer a niche hyperscaler product.

Recommendation: Include Scorpio X-Series in any AI cluster NIC and fabric RFQ alongside ConnectX-9. The memory-semantic architecture changes the evaluation criteria — benchmark workload patterns that pause and resume, not just sustained all-to-all collectives.

Sourceshttps://www.asteralabs.com/news/astera-labs-broadens-scorpio-x-series-smart-fabric-switch-roadmap-to-address-expanding-scale-up-market-opportunities/, https://www.datacenterknowledge.com/infrastructure/astera-labs-targets-fragmented-ai-workloads-with-new-fabric


Cloudflare Automatic Return Routing: Solving IP Overlap Elegantly

Ivan Pepelnjak flagged Cloudflare's automatic return routing solution for a problem that's increasingly common as enterprises consolidate cloud access: multiple clients with overlapping private address ranges accessing shared resources. The solution avoids NAT by maintaining return-routing state at the egress point — traffic enters labeled with enough context that return packets reach the correct overlapping-addressed client. The architectural lesson is that stateful routing intelligence at the edge can solve address-space problems that would otherwise require renumbering or complex NAT translation chains.

Recommendation: If you're designing multi-tenant access to shared services (AI model endpoints, internal APIs, partner connectivity) and overlapping address space is a constraint, review the Cloudflare approach before defaulting to NAT.

Sourceshttps://blog.ipspace.net/2026/05/worth-reading-cloudflare-automatic-return-routing/


Automation
№ 03·Automation

Automation

Plate IIIautomation
Source-of-truth pipeline — intent → diff → apply → verify, idempotent on every revolution.

AutoCon 5 Munich: June 8-12, 2026 — 700 Engineers, Zero Vendor Marketing

AutoCon 5 convenes June 10-12 in Munich (with pre-conference workshops June 8-9). At 700+ registered engineers it is the largest dedicated network automation conference ever held. The program guide explicitly bars vendor marketing content — every session is an operator presenting production experience or a practitioner running a workshop on real tooling.

The program covers automation foundations (Git, Nornir, NAPALM, Jinja2, CI/CD), emerging topics (AI/LLM for network ops, self-healing networks, digital twins, gRPC protocols), and organizational challenges (ops-led automation, air-gap constraints, scaling to 500+ devices). Speakers include engineers from Swisscom, Deutsche Bahn, PlayStation, and LINX. The emphasis on organizational barriers rather than purely technical ones signals the community has moved past "can we automate this?" to "why haven't we automated this yet?"

Recommendation: Register for AutoCon 5. At minimum, download the program guide now and block June 8-12 — the workshops are where the highest-density practitioner exchange happens. Program guide at ac5-guide.pages.dev.

Sourceshttps://networkautomation.forum/autocon5, https://ac5-guide.pages.dev/


SAGA: Workflow-Aware Scheduling Closes the Agentic Inference Gap on GPU Clusters

Published at HPDC 2026, SAGA (Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters) surfaces multi-step agent workflow structure explicitly to the GPU cluster scheduler. Traditional cluster schedulers see individual LLM requests without knowing whether they belong to a sequential chain that shares KV state. SAGA treats the full agent workflow as the atomic scheduling unit — reserving compute and cache resources across steps rather than releasing them between tool calls. The result: online KV-cache management approaches the offline-optimal Bélády policy, which is the theoretical best achievable.

This complements both the PPD and Continuum papers above: PPD addresses within-request routing, Continuum addresses within-node cache lifetime, SAGA addresses cross-request cluster scheduling. Together they outline an emerging reference architecture for agentic inference infrastructure.

Recommendation: When evaluating or building LLM serving infrastructure for agentic use cases, treat workflow-awareness as a first-class scheduler requirement, not an optimization. Ask inference platform vendors how their scheduler handles multi-step agent workflows before committing to a deployment architecture.

Sourceshttps://arxiv.org/html/2605.00528v1


AI / ML
№ 04·AI / ML

AI & Machine Learning

Plate IVai / ml
Embedding space — clusters carry related concepts; the highlighted query vector pulls its nearest neighbors.

KV Cache Infrastructure: The Defining Constraint of Agentic AI in 2026

The convergence of PPD disaggregation, Continuum TTL scheduling, SAGA workflow-atomic cluster management, and NVIDIA's CMX storage announcement all point at the same conclusion: KV cache management has become the primary infrastructure design problem for agentic AI, not GPU compute. In 2023, the constraint was GPU memory. In 2024, it was compute-to-memory bandwidth. In 2026, it is cache hit rates, eviction policy, and KV state routing across a distributed serving cluster.

The economic driver is clear: agentic workloads consume 10-15x more tokens than chat equivalents, and with KV cache hit rates above 95% in production, the cost difference between a system that handles cache well and one that doesn't is an order of magnitude.

Recommendation: Rebuild your AI infrastructure cost model for 2027 using agentic consumption patterns — assume 10-15x token multipliers over chat workloads, 95%+ cache hit rates as the design target, and KV routing as a first-class network design problem. Standard chat-era capacity planning is systematically wrong for agent workloads.

Sourceshttps://arxiv.org/abs/2603.13358, https://arxiv.org/abs/2511.02230, https://arxiv.org/html/2605.00528v1


SparKV: Streaming KV Cache from Cloud to Edge Devices

SparKV (arxiv 2604.21231) presents an adaptive framework for on-device LLM inference that decides, per KV chunk, whether to stream it from cloud or compute it locally. The system models per-chunk cost, overlaps communication with computation, and adjusts at runtime when wireless connectivity fluctuates. On diverse edge hardware and LLMs, SparKV reduces latency in the prefill-heavy phase that makes on-device inference frustrating today. The architectural principle — cloud and edge as a dynamic compute continuum, not a binary choice — has network design implications: edge inference nodes need both low-latency compute access and reliable cloud KV streaming paths.

Recommendation: If you're designing network infrastructure for edge AI inference, treat KV cache streaming bandwidth as a traffic class requiring QoS guarantees, not best-effort background traffic.

Sourceshttps://arxiv.org/abs/2604.21231


Datacenter
№ 05·Datacenter

Datacenter

Plate Vdatacenter
Datacenter row — per-rack utilization at a glance. Cool colors are slack; warmer fills are pressure.

Hyperscaler AI Capacity 6x by 2035 — Grid Adjacency Is the New Real Estate

ABI Research (May 5) projects active hyperscaler IT load growing from 24.37 gigawatts in 2025 to 147.13 gigawatts by 2035, a 6x increase driven by AI build-outs, core cloud growth, and enterprise migration. The projection places grid-adjacent capacity — sites that can get power interconnection priority — as the primary physical constraint, ahead of land, water, or construction labor.

The 2026 AI datacenter liquid cooling market is projected at $3.7 billion, expanding to $17.83 billion by 2036. The acceleration between 2026 and 2030 reflects GPU density hitting thermal thresholds where air cooling becomes physically inadequate, not just economically inferior.

For network architects, the 6x capacity increase over nine years means the network management plane must scale proportionally. Fabrics deployed today will be managing 6x more endpoints by 2035 — automation and programmatic control are not optional at that scale.

Recommendation: Any facility selection made today should include power interconnection queue position as a primary site evaluation criterion. AI workload growth rates make today's 5-year site capacity estimates look conservative by 2030.

Sourceshttps://www.globenewswire.com/news-release/2026/05/05/3287693/0/en/Hyperscaler-Data-Center-Capacity-to-Surge-More-Than-6x-by-2035-as-AI-and-Cloud-Expansion-Reshape-Global-Infrastructure.html


Science
№ 06·Science

Science

Plate VIscience
Field schematic — three-body stability under quasi-equal masses, drawn from the day's central result.

QuantWare Raises $178M to Become the TSMC of Quantum Processors

QuantWare closed a $178 million Series B on May 5 — the largest private funding round ever raised by a dedicated quantum processor company. The round funds two things: VIO-40K, a modular architecture targeting 10,000 physical qubits (100x the current state of the art), and KiloFab, a dedicated quantum processor fab increasing production capacity by 20x. Intel Capital, IQT, and ETF Partners joined as new investors.

The TSMC framing is intentional: QuantWare's VIO architecture is explicitly open — third-party qubit chiplets and designs can plug into the platform. Rather than building a closed QPU for one customer, they are building the foundry layer that decouples QPU design from fabrication. That model, if it works, compresses the timeline between a breakthrough qubit design and a production quantum processor.

Context matters: the Caltech-Oratomic result covered Monday suggested fault-tolerant QC capable of breaking RSA may need only 10K-20K physical qubits. QuantWare's VIO-40K targets exactly that range. The threat window for harvest-now-decrypt-later attacks is shortening.

Recommendation: The VIO-40K timeline combined with the Caltech-Oratomic qubit reduction result means the PQC migration window is measured in years. If you have data with 5+ year confidentiality requirements and haven't started ML-KEM or ML-DSA migration, start the project planning now.

Sourceshttps://quantware.com/news/quantware-raises-178-million, https://quantumcomputingreport.com/quantware-secures-178m-series-b-to-scale-vio-architecture-and-expand-fabrication-capacity/


Security
№ 07·Security

Security

Plate VIIsecurity
Zero-trust egress — credentials are injected at the proxy boundary, never reaching the client runtime.

Identity-Based Microsegmentation for AI Agents: The Fabric Layer Closes the Gap

Elisity published analysis showing 85.6% of AI agents go live without full security approval — the other 14.4% have partial oversight or none. The architectural argument is direct: when an agent is compromised, traditional perimeter controls don't slow lateral movement because agents operate with broad credential scope by design. Identity-based microsegmentation at the network switch layer enforces policy based on device and workload identity verified against authoritative sources (CrowdStrike, ServiceNow CMDB, Active Directory) — and operates independently of whatever credentials the agent has acquired.

The enforcement point at the access switch is significant: it operates below the agent runtime, meaning an agent cannot disable or evade it by manipulating its own process environment. This aligns with the Five Eyes guidance covered Monday requiring enforcement at the action boundary, not the prompt layer.

Recommendation: Before granting AI agents write access to production infrastructure, verify that network-level identity enforcement is in place at the access layer. Policy Group enforcement at the switch is the containment layer that survives agent compromise — prompt-layer guardrails do not.

Sourceshttps://www.elisity.com/blog/ai-agent-network-security-microsegmentation-2026


Quick Takes
№ 08·Quick Takes

Quick Takes

  • NVIDIA Vera Rubin in production: All six chips now sampling; AWS, Google Cloud, Microsoft Azure, and OCI committed to H2 2026 availability. The Spectrum-6 Ethernet Switch integration means Vera Rubin clusters ship with an 800G AI Ethernet fabric built in — not as an afterthought.

  • Agentic workload token cost inflection: NVIDIA's published data shows Claude Code sessions scaling to 156K tokens at 15x chat consumption. Enterprise cost models using chat-era token pricing for agent deployments are systematically underestimated.

  • AutoCon 5 workshop pre-registration: The June 8-9 pre-conference workshops are separate from the main conference registration — they fill faster. Workshops cover Nornir, NAPALM, CI/CD, and gRPC deep dives with hands-on labs.


Automation
№ 09·Automation

Pipeline Stats

Plate VIIIautomation
Source-of-truth pipeline — intent → diff → apply → verify, idempotent on every revolution.
  • Articles processed: 80 (RSS digest) + supplemental web research
  • Topics researched: 6 domains (networking, automation, ai-ml, datacenter, science, security)
  • Quality score average: 4.5/5
  • Dedup rejections: all May 4-5 items excluded per 72-hour cooldown
  • RSS digest top score: 7.8 (PPD disaggregation, Continuum KV cache)
Subscribe

Get the briefing in your inbox.

One email per weekday morning. Same writing, same sources — no audio required.