Skip to content
Morning Briefing · Thursday, April 30, 2026

SRv6 Finds Its AI Fabric Moment as Hyperscalers Race to Own Their Silicon

networkingautomationai-mldatacentersciencesecurity
Listen to the episode
SRv6 Finds Its AI Fabric Moment as Hyperscalers Race to Own Their Silicon
16 min · 46 turns
Plate Ileaf · spine
Schematic leaf-spine fabric — explicit-path traffic flows across the spine plane, pods at the edges.
Top Highlights
№ 01·Top Highlights

Top 3 Highlights

1. Microsoft Runs SRv6 uSID Over SONiC in Production to Solve the AI Fabric ECMP Problem

At NANOG96 in February, Microsoft engineers quietly disclosed one of the most important AI fabric architecture decisions in the operator community: SRv6 uSID running over SONiC in a production AI training cluster backend — solving the ECMP entropy problem that degrades collective communication performance at scale. The story is only now surfacing in secondary coverage, and it deserves a full hearing.

Key Points:

  • Microsoft engineers Rita Hui and Pablo Camarillo presented at NANOG96 (February 2026) — one of the first public operator-level disclosures of SRv6 in a production hyperscale AI backend fabric
  • SRv6 uSID (micro-segment IDs) enables deterministic path programming without per-flow state in the core — the ingress node encodes explicit paths as compact 16-bit identifiers packed into the IPv6 destination address, handled by standard longest-prefix-match on commodity ASICs
  • SONiC's SAI abstraction exposes SRv6 SID programming via YANG models and gNMI — controller-driven path computation without per-device CLI operations, fitting directly into existing model-driven automation pipelines
  • ECMP hash collisions cause hot-spot spine links during all-reduce and all-gather at 1,000+ GPU scale; SRv6 explicit path programming bypasses the load balancer entirely rather than patching around its limitations
  • This is architecturally distinct from Google's Boardfly/OCS approach (covered April 23): Microsoft's SRv6 path runs on commodity merchant silicon with standards-track protocols — replicable without custom ASICs

Deep Dive:

Traditional ECMP mitigation — entropy header scrambling, WCMP, flowlet switching — hits limits at 10,000+ GPU scale. Microsoft's answer is a different architectural layer: don't fix the load balancer, change the routing model. SRv6 uSID encodes an explicit path in every packet header, visible end-to-end, processed by standard IPv6 longest-prefix-match at line rate. No per-flow MPLS state in the core, no PFC storms from lossless queue management requirements, no custom silicon required. The result is deterministic path control on commodity hardware.

The SONiC layer is operationally load-bearing. SAI's SRv6 SID table abstraction means path computation can be driven by a controller via gNMI YANG writes — the same model-driven pipeline network engineers are already building for configuration management. This isn't a separate proprietary traffic engineering overlay; it integrates into the open automation stack. At NANOG96, Microsoft's presentation also referenced Upperside World Congress 2026 sessions where Alibaba, Verizon, Nebius, Rakuten, and Deutsche Telekom shared SRv6 AI infrastructure experiences. This is becoming operator consensus, not a Microsoft-specific experiment.

The significance is clear: this is the strongest public evidence to date that the SRv6 + open NOS stack is production-grade for the most demanding AI training workloads. The IETF SPRING working group's SRv6 deployment BCP draft is advancing in parallel, giving operators a reference for production deployment patterns as the operator community builds out.

So What? If you're designing an AI training fabric above a few hundred GPUs, SRv6 uSID over SONiC is now a proven, standards-track alternative to vendor-proprietary congestion management — start with the NANOG96 slides and the IETF SPRING SRv6 deployment BCP draft.

SourcesNANOG96: SRv6 AI Backend — Microsoft | IETF srv6ops deployment draft


2. Hyperscaler Earnings Week: Microsoft Raises to $190B, Google Starts Selling TPUs

Both Microsoft and Alphabet reported Q1/Q3 2026 earnings this week, and both stories say the same thing in different words: AI infrastructure demand is so strong that neither company can build fast enough to meet it. Microsoft raised full-year capex to $190 billion partly on component inflation. Google started selling TPU hardware directly to customers' own datacenters — making it a chip vendor competing with NVIDIA.

Key Points:

  • Microsoft full-year 2026 capex: $190 billion, up $25 billion — at least partially driven by component price inflation (memory, storage); Q4 alone expects over $40 billion in hardware spend
  • Microsoft AI services ARR: $37 billion (up 123% year over year), generated by $97 billion in infrastructure investment over the prior four quarters — the unit economics haven't stabilized; CFO Amy Hood confirmed this gap won't close before 2027
  • Alphabet raised 2026 capex guidance to $180–190 billion; Google Cloud hit $20 billion in Q1 for the first time (up 63% year over year); cloud backlog nearly doubled to $462 billion
  • Google will sell TPU 8t and TPU 8i hardware directly to select customers' own datacenters — hardware revenue now included in cloud backlog; small revenue in 2026, majority arriving 2027
  • Amazon's Trainium/Inferentia semiconductor business crossed $20 billion annual run rate (Trainium4 in development with NVLink Fusion support for hybrid Trainium/NVIDIA clusters); combined five-hyperscaler 2026 capex tracking toward $630 billion, up 62% over 2025

Deep Dive:

The supply-constrained signal is more important than the headline numbers. Both Microsoft (with an $80 billion Azure backlog it cannot fulfill due to power grid limitations) and Google (explicitly stating cloud revenue was limited by compute availability) confirmed independently that the binding constraint is infrastructure capacity, not customer demand. That $25 billion Microsoft capex increase is not new data center expansion — it's inflation pass-through. Memory and storage prices are rising fast enough to force a mid-year spending revision, and that cost will reprice cloud services within two to four quarters.

Google's TPU-to-customers announcement deserves separate attention. For a decade, TPUs were Google's internal advantage — purpose-built silicon that gave Google Cloud performance and cost advantages competitors couldn't replicate. Selling the hardware externally changes the market structure: Google is now competing with NVIDIA as an AI accelerator vendor, not just as a cloud provider hosting NVIDIA gear. The "select customers" framing suggests bespoke deals with large AI labs initially, but the direction is clear. Combined with Amazon's Trainium reaching $20 billion in annual run rate — and Trainium4's NVLink Fusion compatibility enabling mixed Trainium/NVIDIA clusters in a single fabric — the AI compute market is restructuring around custom silicon that cloud providers both consume internally and now sell externally.

For network engineers, the practical landing point is direct: when a customer deploys TPU pods from Google or Trainium clusters from Amazon in their own datacenter, they need to build the RoCEv2 or Ultra Ethernet fabric, the RDMA congestion management stack, and the telemetry infrastructure to make those accelerators perform. The industrialization of AI compute is a networking problem.

So What? Model 2027 cloud infrastructure costs 15–20% higher than 2026 rates to account for component inflation pass-through; renegotiate long-term agreements before the price reset arrives. Get Google TPU hardware pricing and specs now — before the first external hardware ships and procurement timelines compress.

SourcesThe Register — Microsoft Q3 2026 | Alphabet Q1 2026 Earnings Transcript | The Register — Amazon chips $20B | Futurum Group: AI Capex 2026


3. netlab Brings Infrastructure-as-Code to Physical Multi-Vendor Labs

Ivan Pepelnjak at ipSpace.net published a workflow for using netlab to generate IP addressing plans and partial device configurations for mixed-vendor physical labs — directly solving the manual IP allocation errors that plague multi-vendor staging environments and training labs. Top RSS score this week at 8.5.

Key Points:

  • Five-step workflow: describe topology in topology.ymlnetlab report wiring (physical cabling plan) → connect hardware → netlab report addressing (IP allocation verification) → netlab create (native device configs) → deploy via netlab initial over out-of-band management
  • Configs are explicitly partial — base addressing and routing protocols only, leaving vendor-specific customization to the engineer; this is a deliberate pedagogical and operational choice
  • Multi-vendor confirmed: Arista EOS, Cumulus, FRR, Junos, Nokia SR OS/SR Linux, Cisco IOSv/IOS-XE/NX-OS, IOS-XR all supported
  • netlab 26.01 (January 2026) rewrote the config generator, dropping Ansible/Jinja2 in favor of a native engine — 11-second deploy vs. prior 140 seconds, making live-lab scenarios viable
  • Inspired by ITNOG 10, where a leaf-and-spine physical lab with BGP route reflectors and multiple vendor devices was the motivation — exactly the multi-vendor complexity netlab targets

So What? Replace your lab IP-addressing spreadsheet with topology.yml + netlab report wiring — you will stop chasing transposition errors that feel like hardware bugs.

SourcesipSpace.net — Generate Partial Device Configs with netlab


Automation
№ 02·Automation

Network Automation

Plate IIautomation
Source-of-truth pipeline — intent → diff → apply → verify, idempotent on every revolution.

SR Linux Config Conversion Tool Closes the YANG Model Drift Gap

Roman Dodin released srlconv, a tool that automatically diffs SR Linux configuration data models between software releases — solving the silent template breakage that occurs when Nokia updates its YANG model on its roughly annual release cycle.

  • srlconv spins up two SR Linux containers on different software versions with identical startup configs via Containerlab, captures the resulting configs, and diffs them — produces exactly what changed and how to fix your templates
  • Secondary approach: import Python upgrade scripts directly from inside SR Linux container images (srlinux.transform.transformations module) and compare them — more authoritative than release notes
  • The netlab community already used this for the 26.03 SR Linux release; netlab users get the benefit automatically
  • Pattern generalizes: any NOS with a diffable config representation and containerized images can adopt this approach; SR Linux is well-positioned because Nokia publishes container images freely

So What? Run srlconv against your current vs. target SR Linux version before the software upgrade, not after — the diff tells you exactly which template lines will break.

SourcesipSpace.net — SR Linux Config Conversion | GitHub: srl-labs/srlconv


Nautobot 3.1.0 Requires Django 5.2 and PostgreSQL 14 — Hard Dependency Bump

Nautobot 3.1.0 (April 14, 2026) upgrades to Django 5.2 and requires PostgreSQL 14.0 as a minimum — a breaking infrastructure dependency with the first major release since the 2.x line. PostgreSQL 13 and below are a hard blocker, not a warning.

  • New in 3.1.0: dependent object creation in modals, async job console output streaming, custom field scoping, Python 3.14 support; HTMX replaces django-ajax-tables (reduces JavaScript bundle size)
  • Nautobot 2.4.32 remains supported with security patches — GitPython and lxml security patches were applied to both 3.1.1 and 2.4.32 simultaneously on April 27, confirming active dual-line maintenance
  • Both point releases (3.1.1 and 2.4.32) dropped on the same day: coordinated cross-version security patching

So What? Audit your PostgreSQL version before scheduling the Nautobot 3.1.0 upgrade — version 13 and below is a blocker, not a warning, and Nautobot will not start.

SourcesGitHub: nautobot/nautobot Releases


Nautobot-app-Nornir v2.1.0 Eliminates Static Inventory Drift

The nautobot-app-nornir bridge reads Nautobot's ORM directly at task execution time — no serialized inventory snapshot, no drift between what Nornir sees and what actually exists in the network.

  • Credentials fetched from Nautobot's secrets engine at task time, not baked into inventory files — reduces secrets sprawl in automation repositories
  • v2.1.0 is current stable; repository shows active maintenance through March 2026
  • This is the Network to Code reference architecture: Nautobot as source of truth + Nornir as execution framework

So What? If you're running Nornir against a static inventory file while Nautobot holds the ground truth, migrate the inventory source — the drift gap becomes a production incident eventually.

SourcesGitHub: nautobot/nautobot-app-nornir v2.1.0


Automation Trend Thread: All four stories share a theme: the tooling focus has shifted from "get automation working" to "make automation reliable across version boundaries." YANG model drift (srlconv), database dependency changes (Nautobot 3.1.0), source-of-truth freshness (Nornir-Nautobot ORM), and physical lab config generation (netlab) all address the same problem — automation that silently breaks when software versions change. The pre-merge validation pattern (Batfish + PR gates) is being complemented by post-upgrade drift detection and model comparison tools. The cultural gap is closing; the tooling gap is closing from the other direction.


Networking
№ 03·Networking

Networking & Architecture

Plate IIInetworking
Schematic leaf-spine fabric — explicit-path traffic flows across the spine plane, pods at the edges.

SONiC Reaches Orange Telecom and Alibaba at Scale

The SONiC Foundation's OCP EMEA Summit workshop reveals deployment scale data the community hasn't widely publicized: Orange (French telecom) is running 90 SONiC switches in production for network disaggregation with 150+ planned, and Alibaba Cloud reports 100,000+ white-box devices under SONiC management globally.

  • 4,300+ active contributors across 520+ contributing organizations — among the largest active open networking projects by organizational breadth
  • Four stated community priorities: global adoption growth, enterprise NOS hardening, education/training, and amplifying production deployment case studies — the enterprise hardening track is the new focus
  • Orange's carrier deployment (not hyperscaler, not cloud-native) is the signal: SONiC is crossing into telecom infrastructure; the Dell'Oro 10% enterprise switch share forecast may be conservative

So What? Orange's tier-one carrier deployment signals that the SONiC adoption curve is steeper than enterprise market share statistics suggest — add it to evaluation criteria for your next DC edge or campus core refresh.

SourcesSONiC Workshop at OCP EMEA Summit | ONUG: State of Enterprise SONiC Adoption


AI / ML
№ 04·AI / ML

AI/ML

Plate IVai / ml
Embedding space — clusters carry related concepts; the highlighted query vector pulls its nearest neighbors.

Zig Bans AI Contributions; Bun's 4x Compiler Improvement Will Never Go Upstream

Bun — the JavaScript runtime Anthropic acquired in December 2025 — achieved a 4x compile performance improvement using AI-assisted development but will not upstream it to Zig because Zig bans LLM-authored contributions. It is the clearest example to date of AI governance policies creating permanent fractures in open-source communities.

  • Zig's rationale: what a project gets from a pull request is not just code — it's a contributor who learns the idioms, builds trust, and joins the maintenance chain. AI-generated PRs deliver code but no contributor. For a small project where reviewer bandwidth is the real constraint, this is a defensible position.
  • Bun runs on its own Zig fork; the 4x improvement (parallel semantic analysis + multiple LLVM codegen units) is permanently stranded there
  • Pattern: AI usage policies are creating lasting forks wherever maintainer communities prohibit AI contributions — not through licensing, but through contribution policy. This is happening now, not hypothetically.
  • Governance lesson: if your team's improvements to an open-source tool cannot be upstreamed due to policy misalignment, you are on a permanent fork trajectory with compounding divergence costs

So What? Add "AI contribution policy" to your open-source due-diligence checklist — policy misalignment now means your team's improvements have no path back to the community.

SourcesSimon Willison — Zig anti-AI


April 2026 Model Landscape: Meta Muse Spark Closes the Open-Weight Upper Bound

The April model release cluster is the densest since GPT-4's launch. The structural signal is Meta Muse Spark: Meta's first closed, proprietary frontier model from the newly renamed Meta Superintelligence Labs — breaking Meta's historically open-weight default and setting a precedent for where the "open-weight ceiling" sits.

  • Llama 4 Scout (109B total, 17B active MoE, 10M context window) and Maverick (400B total) — both open weight, April 5
  • Claude 4 Opus (72.1% SWE-bench Verified, 94.2% HumanEval) — April 2; Claude Mythos (93.9% SWE-bench Verified) restricted to 50 organizations
  • Meta Muse Spark — closed, proprietary, no weights released; Meta's first frontier model kept under lock
  • Mistral Medium 3 — ships EU AI Act compliance metadata as part of the model card at release; the first model to do this at launch
  • Google Gemma 4 (Apache 2.0, up to 31B parameters) and the full Alibaba Qwen 3 lineup from 0.6B to 72B

So What? Meta Muse Spark's closure tells you where the open-weight ceiling sits: whatever capability level prompted this reversal is where open access now gets restricted.

SourcesFazm.ai — LLM releases April 2026 | WhatLLM — April 2026


Datacenter
№ 05·Datacenter

Datacenter

Plate Vdatacenter
Datacenter row — per-rack utilization at a glance. Cool colors are slack; warmer fills are pressure.

$630 Billion in 2026 Hyperscaler Capex: Power and Silicon Are the Ceiling

Combined 2026 capex from the four major hyperscalers (Microsoft $190B, Amazon $200B, Google $180–190B, Meta $115–135B) tracks toward approximately $630 billion — a 62% increase over 2025 aggregate spending. The headline is not the scale; it's what's constraining it.

Both Microsoft and Google confirmed they are supply-constrained, not demand-constrained. Microsoft has an $80 billion Azure backlog it cannot fulfill due to power grid access. Google explicitly stated cloud revenue was limited by compute availability. The AI buildout is being throttled by power grid access and silicon component pricing simultaneously, not by customer willingness to spend.

Microsoft's $25 billion upward revision is partially component inflation pass-through — memory and storage prices rising fast enough to force mid-year guidance revision. That inflation will reprice cloud services within two to four quarters. US utilities are projecting $1.4 trillion in infrastructure investment to serve AI datacenter load growth through the decade.

So What? Model 2027 cloud infrastructure costs 15–20% higher than current rates to account for component inflation pass-through; renegotiate long-term agreements before the price reset arrives.

SourcesThe Register — Microsoft Q3 2026 | CNBC — Hyperscaler Q1 2026 earnings | Futurum Group: AI Capex 2026


Security
№ 06·Security

Security

Plate VIsecurity
Zero-trust egress — credentials are injected at the proxy boundary, never reaching the client runtime.

Cloudflare Introduces Programmable Zero-Trust Egress for AI Agent Traffic

Cloudflare's Outbound Workers for Sandboxes — published during Agents Week 2026 — implements a programmable zero-trust egress proxy that intercepts all agent-to-external-service traffic, injects credentials server-side at the proxy boundary, and prevents tokens from ever reaching agent code at runtime.

  • Architectural shift: agents never hold credentials in their runtime context. The trust boundary moves from the agent runtime to the network egress point — matching how mature zero-trust models treat human users (credential exchange happens at the policy enforcement point)
  • RFC 9728 OAuth integration enables agent-to-service authentication via proper auditable token chains rather than service accounts or hardcoded API keys
  • Cloudflare Mesh extends this to private internal resources — scoped network access without manual tunnel configuration or broad service account permissions
  • Shadow MCP detection: passive discovery of unauthorized MCP servers operating inside the organization (agent control plane equivalent of shadow IT discovery)

This is architecturally distinct from the LiteLLM gateway placement lesson (covered April 29 — inbound credential aggregation). This addresses outbound agent traffic: even a compromised or prompt-injected agent cannot exfiltrate credentials it never possessed.

So What? Evaluate whether your agent runtime implements credential injection at the egress layer. If agents hold API keys in environment variables or prompt context today, the Outbound Workers egress proxy pattern (or an equivalent policy enforcement point) is the architectural fix — not prompt-level instructions to "handle credentials carefully."

SourcesCloudflare Blog — Agents Week 2026 in Review


Science
№ 07·Science

Science

Plate VIIscience
Field schematic — three-body stability under quasi-equal masses, drawn from the day's central result.

Graphene Electrons Violate a 160-Year-Old Physics Law — by 200x

Researchers at the Indian Institute of Science confirmed that electrons in ultra-clean graphene, tuned to a precise quantum boundary called the Dirac point, violate the Wiedemann-Franz law by more than two hundred times. The Wiedemann-Franz law has held for 160 years: in metals, electrical and thermal conductivity move together in a fixed ratio. At graphene's Dirac point, they move in opposite directions at low temperatures — with a deviation exceeding 200x.

The collective electron behavior creates a "Dirac fluid" — an exotic quantum state theoretically equivalent to the quark-gluon plasma studied at CERN's particle accelerators. The experimental team used quantum spin magnetometers (nitrogen-vacancy centers in diamond) to image the electron flow directly. Published in Nature Physics, April 15, 2026.

Technology implications: the Wiedemann-Franz law is baked into thermal management models for semiconductors and quantum hardware. Materials that violate it don't behave the way current simulation tools predict. The authors identify sensitivity improvements for quantum sensors detecting weak electromagnetic fields — relevant to quantum networking hardware alignment.

So What? A fundamental assumption in materials physics is not universal — quantum hardware design tools that depend on Wiedemann-Franz thermal coupling models need updating for Dirac materials.

SourcesScienceDaily — Graphene defies physics law | Nature Physics DOI 10.1038/s41567-025-02972-z


Muon g-2 Wins 2026 Breakthrough Prize for Sixty Years of Precision

Three generations of the Muon g-2 experiment (CERN, Brookhaven, Fermilab) won the 2026 Breakthrough Prize in Fundamental Physics for measuring the muon's anomalous magnetic moment to 127 parts per billion — approximately 30,000 times more precise than the first 1960s measurements. The persistent discrepancy with Standard Model predictions has survived multiple experimental runs and lattice QCD refinements, remaining the most active candidate for physics beyond the Standard Model.

SourcesBreakthrough Prize 2026 | Fermilab announcement


Quick Takes
№ 08·Quick Takes

Quick Takes

  • Trainium4 NVLink Fusion: Amazon's next-gen Trainium chip supports NVIDIA NVLink Fusion — enabling heterogeneous Trainium/NVIDIA GPU clusters in a single fabric. Mixed-silicon clusters have different memory bandwidth and interconnect characteristics, requiring new approaches to congestion management and collective communication scheduling. Source: CNBC — Amazon Trainium

  • Microsoft AI ROI gap: $97 billion invested over four quarters to generate $37 billion in AI ARR (2.6:1 ratio); CFO Amy Hood confirmed the gap won't normalize before 2027. Azure AI pricing is unlikely to decline in the near term. Source: The Register — Microsoft Q3 2026

  • SONiC contributor growth: 4,300+ active contributors across 520+ organizations; enterprise NOS hardening is now the stated community priority, distinct from the hyperscaler-focused development that drove earlier releases. Source: ONUG: State of Enterprise SONiC Adoption


Watch Today
№ 09·Watch Today

Watch Today

  • IETF SPRING SRv6 deployment BCP: The draft is advancing as operator deployments validate the architecture; watch for Microsoft NANOG96 data being cited in the draft as a production reference
  • Google TPU hardware delivery timeline: When the first external customer receives hardware, the AI compute vendor landscape changes permanently — the fabric requirements that come with it land on network engineers
  • Nautobot 3.x adoption curve: The PostgreSQL 14 minimum creates an infrastructure dependency many teams haven't planned for — the Django 5.2 path is architecturally correct long-term but the migration friction is real and underestimated

Pipeline stats: 5 parallel research agents — 14 primary stories, 3 quick takes, 0 dedup rejections. RSS digest: 24 articles, top score 8.5 (ipSpace.net netlab). Web searches: ~12 total. Quality score: 4.5/5.

Subscribe

Get the briefing in your inbox.

One email per weekday morning. Same writing, same sources — no audio required.