Skip to content
Morning Briefing · Friday, May 22, 2026

Amazon Replaces Fat-Tree With Flat Random-Graph Networks, Cuts Costs 45%

networkingautomationai-mldatacentersciencesecurity
Listen to the episode
Amazon Replaces Fat-Tree With Flat Random-Graph Networks, Cuts Costs 45%
19 min · 77 turns
Plate Ileaf · spine
Schematic leaf-spine fabric — explicit-path traffic flows across the spine plane, pods at the edges.
Top Highlights
№ 01·Top Highlights

Top 3 Highlights

1. Amazon Quietly Replaced Fat-Tree Topology With Flat Random-Graph Networks — and Cut Costs 45%

TL;DR: Amazon has published an arXiv paper revealing that RNG — flat datacenter networks built on quasi-random graph theory — has replaced fat-tree topology as the default architecture for most workloads across Amazon datacenters. The design matches or beats fat-tree performance while reducing switch count by up to 45%.

Key Points:

  • RNG (Random-Graph Network) eliminates the hierarchical tiers of fat-tree by connecting routers in a flat topology where random-graph expander properties provide resilience and many edge-disjoint paths between any pair of nodes
  • The ShuffleBox is a passive optical device that takes structured cable inputs and internally randomizes connectivity — so from the router's perspective neighbors are random, but from the cabling team's perspective complexity is comparable to fat-tree
  • Spraypoint routing sprays packets to a random neighbor first, then routes via distributed waypoints near the destination — avoiding the classic last-mile congestion problem; only two VRFs required on commodity switches
  • Cost reduction scales with oversubscription ratio: 9% savings at 1:1 oversubscription up to 45% at higher ratios; metric is switch count
  • Incremental deployment is a first-class property — adding new capacity requires only new shuffle panels and random re-balancing, not systematic rewiring

Deep Dive: The fat-tree topology has dominated datacenter design for nearly two decades because it delivers any-to-any bandwidth with predictable performance. The price is hierarchical overprovisioning: every tier doubles port density, every link is numbered and purposeful, and cost scales with the square of bandwidth. The counterintuitive insight in RNG is that random graphs are actually excellent expanders — they naturally achieve high connectivity and many edge-disjoint paths between node pairs without requiring the hierarchical structure that makes fat trees expensive.

The ShuffleBox is the engineering move that makes this deployable rather than theoretical. At Amazon's scale, physically cabling a random-graph topology by hand would be an operational nightmare — thousands of fiber runs in arbitrary patterns across potentially thousands of racks. The ShuffleBox solves this by creating a passive optical device that takes structured, predictable cable inputs and internally shuffles them into random logical connectivity. The cabling team runs cables to known ShuffleBox ports; the ShuffleBox handles the randomization. The result: random-graph topology properties without random-graph cabling chaos.

Spraypoint's routing is worth understanding because it challenges a deeply embedded assumption. Traditional routing tries to find the "best" path and concentrate traffic there. Spraypoint accepts that in a random graph there are many roughly equivalent paths and deliberately distributes traffic across all of them. The waypointing mechanism — forwarding first to a waypoint distributed around the destination cluster, then routing in — prevents the classic convergence problem where popular destinations create last-hop congestion. The two-VRF implementation is elegant: this routing-theoretic advance ships on commodity merchant silicon without exotic forwarding table requirements.

So What?: If your datacenter refresh is coming up in the next two years, this paper is required reading — the 9-to-45% switch count reduction maps directly to your oversubscription target, and the incremental deployment model means you can migrate without a forklift upgrade.

SourcesarXiv (Amazon/Bernardi et al.) — https://arxiv.org/abs/2604.15261


2. NetBox Replica Cache Fixes the Automation Read Wall at Scale

TL;DR: NetBox Labs has shipped a public preview of Replica Cache — a read-only API layer serving NetBox data from an operationally independent cache. Sub-50ms P95 query latency with zero load on the primary database. This directly addresses the read-contention bottleneck that automation fleets hit past 10,000 devices.

Key Points:

  • Change-capture replication keeps the cache current with typical data freshness under 10 seconds; programmatic freshness verification available for latency-sensitive decisions
  • P95 query latency under 50ms; primary database entirely isolated from automation read traffic
  • Supports server-side filtering, field selection, cursor-based pagination, and aggregation — full query surface, not a dumb snapshot
  • Works alongside TurboBulk (high-throughput writes), together addressing both read and write scaling walls
  • Available now for NetBox Cloud Premium tier; positioned for AI datacenter operators, hyperscalers, and large telco automation teams
  • Operationally independent from the primary — resilient during maintenance windows or primary degradation

Deep Dive: The architectural problem is straightforward but chronically underappreciated. When your NetBox instance is backing thousands of automation jobs, every Ansible playbook run, every Terraform plan, every monitoring check fires read queries against the same PostgreSQL primary that also handles interactive users and writes. At small scale this is invisible. Past 10,000 devices — especially in AI datacenter environments where IP allocations and port assignments change continuously — those reads dominate. The primary chokes, interactive latency spikes, and writes queue up.

NetBox Replica Cache moves the read path entirely outside the primary's blast radius. What's architecturally important is that they preserved the full API query surface rather than serving pre-computed snapshots — existing automation scripts work unchanged. The "operationally independent" property means the cache survives a primary database outage or maintenance window, which makes it a reliability feature as much as a performance one. For organizations where automation continuity is load-bearing, this is significant.

The "zero primary database load" claim is the key statement. It means automation read traffic can scale horizontally against the cache tier without touching capacity planning on the primary. For organizations that have been routing around NetBox with local caches, stale YAML files, or parallel CMDBs because the primary couldn't keep up, this removes the architectural excuse.

So What?: If you're on NetBox Cloud Premium and your automation fleet has grown past a few thousand devices, enable the Replica Cache preview and point your read-heavy Ansible and Terraform integrations at it — you'll recover primary database headroom and get a resilient read path for a 10-second freshness window that's acceptable for almost all inventory reads.

SourcesNetBox Labs Blog — https://netboxlabs.com/blog/replica-cache-public-preview/


3. Microsoft Completes the Agentic AI Safety Engineering Stack

TL;DR: Microsoft open-sourced RAMPART (a pytest-based CI/CD adversarial testing framework for AI agents) and Clarity (a pre-implementation design review agent). Combined with NVIDIA Verified Agent Skills and McQuaid sandboxes earlier this week, the agentic AI safety engineering discipline is now fully codified across four distinct layers.

Key Points:

  • RAMPART (Risk Assessment and Measurement Platform for Agentic Red Teaming) embeds adversarial safety tests directly in pytest suites, emitting pass/fail CI signals; supports statistical policies ("this action must be safe in at least 80% of runs") for probabilistic LLM behavior
  • Primary focus is cross-prompt injection attacks — agents processing poisoned documents, emails, tickets, or configuration repositories that redirect tool use
  • Microsoft internally used RAMPART to find close to 100 variants of a single vulnerability vector and verified mitigations in multi-turn conversations
  • Clarity is a pre-implementation structured design review agent that surfaces failure modes from security, human factors, adversarial, and operational perspectives before code is written; outputs saved as version-controlled markdown
  • Both repos live at github.com/microsoft/RAMPART and github.com/microsoft/clarity-agent/

Deep Dive: This week assembled a complete engineering safety stack for agentic AI, and Friday's RAMPART/Clarity release is the piece that closes the loop. Monday brought NVIDIA Verified Agent Skills — a pre-deployment catalog with cryptographic provenance for what a skill claims to do. Also Monday: McQuaid sandboxes and Git worktrees for runtime blast-radius containment via OS-level permissions. RAMPART adds CI-gated adversarial testing, answering the question neither of the other tools addresses: does this agent behave safely under adversarial inputs before it ships?

The pytest integration is the load-bearing design choice. By making safety tests emit standard pass/fail signals, RAMPART lets agentic safety testing live alongside unit tests in the same CI pipeline. The same gate that prevents broken code from shipping can prevent unsafe agent behavior from shipping. The statistical policy model — explicitly acknowledging that LLM outputs are probabilistic and setting a pass rate rather than requiring binary determinism — is honest engineering that will age better than approaches that pretend LLMs are deterministic.

The cross-prompt injection focus is exactly right for network operations contexts. If you are building an AI agent that reads device logs, ticket systems, configuration repositories, or SNMP polling output, every one of those sources is a potential injection vector. A misconfigured device (or a deliberate attacker) could embed content that redirects the agent's tool use. RAMPART lets you write test scenarios that simulate this, set acceptable pass rates, and block any release that falls below them.

The full stack as it now stands: (1) design-time validation (Clarity), (2) CI-gated adversarial testing (RAMPART), (3) pre-deployment skill verification with provenance (NVIDIA Verified Skills), (4) runtime blast-radius containment (McQuaid OS sandboxes). That is a complete discipline, not a collection of one-off announcements. Agentic NetOps is going to get safer and more auditable as a result.

So What?: Add RAMPART to your CI pipeline for any agent that reads external data sources — start with cross-prompt injection test scenarios against any agent touching logs, tickets, or config repositories, and set an 80% safety pass rate as your initial gate.

SourcesMicrosoft Security Blog — https://www.microsoft.com/en-us/security/blog/2026/05/20/introducing-rampart-and-clarity-open-source-tools-to-bring-safety-into-agent-development-workflow/ | The Register — https://www.theregister.com/security/2026/05/21/microsoft-open-sources-agentic-ai-safety-tools/5243822


Networking
№ 02·Networking

Networking & Architecture

Plate IInetworking
Schematic leaf-spine fabric — explicit-path traffic flows across the spine plane, pods at the edges.

SONiC Enterprise Adoption Crosses Into Mainstream AI Infrastructure

SONiC has moved decisively past hyperscaler-only deployments. SAKURA Internet (Japan) ran SONiC on an 800-GPU cluster ranked 49th globally in the TOP500 supercomputer list — a direct data point on viability for AI fabric at serious scale. Mitsui Knowledge Industry deployed 400G multi-tenant AI infrastructure with RoCEv2 on SONiC. A national Indian digital payments network is running 300+ SONiC switches handling hundreds of millions of daily transactions — enterprise-grade reliability outside the AI context.

Gartner projects over 40% of large datacenter operators (200+ switches) will run SONiC by end of 2026. Dell'Oro forecasts SONiC in nearly 10% of enterprise switches shipped this year. Active contributors now at 4,300+ across 520+ organizations. Key roadmap items aligned to AI fabric: telemetry, packet spray, SRv6.

The RoCEv2 + packet spray + telemetry combination is what makes SONiC credible for GPU cluster networking. If you're evaluating SONiC only for hyperscaler-style deployments, you're missing the AI fabric argument.

So What?: The Mitsui and SAKURA deployments are the proof points to take to your architecture review board when making the case for SONiC in AI fabric buildout.

SourcesONUG/Aviz Networks — https://onug.net/blog/state-of-enterprise-sonic-adoption-the-open-networking-shift-accelerates-in-the-ai-era/


Automation
№ 03·Automation

Network Automation

Plate IIIautomation
Source-of-truth pipeline — intent → diff → apply → verify, idempotent on every revolution.

Gartner Names Agentic NetOps a Formal Category — 65% of Network Activity Still Manual

Gartner's 2026 Innovation Insight report formally names "Agentic NetOps" as an emerging category, predicting a jump from under 10% to 30% of enterprises automating over half their network operations this year. The baseline data point is damning: 65% of network activities are still manual in 2026.

This is the analyst naming moment that creates enterprise buying pressure even in laggard organizations. Network to Code's platform reached general availability in April 2026, positioning Nautobot as the network source-of-truth foundation for agentic workflows. The report identifies network source-of-truth as the foundational prerequisite before deploying agentic automation — consistent with the source-of-truth platform graduation arc this show has tracked since NetBox 4.6 in May.

Ivan Pepelnjak's companion observation, published this week on ipSpace, is the harsher framing: Urs Baumann's ten-year-old network automation slides still apply verbatim. The tools have improved dramatically since 2016. The adoption hasn't. Pepelnjak's hypothesis is that AI-assisted interfaces — lowering the floor for non-programmer engineers — may succeed where a decade of tooling improvements failed, because they attack the adoption gap directly rather than improving tooling that engineers already don't use.

So What?: The Gartner naming moment is a signal to act — if your organization is still 65% manual and your leadership reads analyst reports, this is the report to table with a formal network source-of-truth and automation platform proposal.

SourcesNetwork to Code Blog — https://networktocode.com/blog/network-automation-platforms-key-insights-for-network-leaders-on-how-to-scale-network-automation/ | ipSpace.net — https://blog.ipspace.net/2026/04/state-network-automation/


AI / ML
№ 04·AI / ML

AI & Machine Learning

Plate IVai / ml
Embedding space — clusters carry related concepts; the highlighted query vector pulls its nearest neighbors.

Datasette Agent: Conversational SQL Queries Over Any Structured Database

Simon Willison released Datasette Agent this week, combining his three-year-old LLM Python library with the Datasette data exploration tool. The result: a conversational interface for querying any SQLite database, with plugin extensibility for chart generation and sandboxed code execution. Default model is Gemini 3.1 Flash-Lite; local models via LM Studio supported for privacy-sensitive data.

The network operations angle is direct: export your NetBox device inventory, SNMP polling results, log aggregations, or device configuration database to SQLite and point Datasette Agent at it. You get conversational queries against network state without writing SQL and without data leaving your environment if you run a local model.

So What?: Export your NetBox device inventory to SQLite and run Datasette Agent against it with a local model — it's a zero-cost proof of concept for conversational network state queries that requires no API keys and no data exposure.

SourcesSimon Willison's blog — https://simonwillison.net/2026/May/21/datasette-agent/


NVIDIA GPU Visibility on Kubernetes — Real-Time Allocation and Idle Detection

NVIDIA released the GPU Usage Monitor, built on the DCGM Exporter, deployable via single Helm chart to any Kubernetes cluster. Real-time GPU allocation, compute utilization, memory consumption, and pod status across the entire cluster. Targets SREs and platform teams where GPUs are routinely over-provisioned (30-50% memory usage models) or silently idle.

For teams running AI inference on GPU-accelerated Kubernetes, this is the observability layer that surfaces scheduling bottlenecks before users escalate them.

So What?: Deploy the NVIDIA GPU Usage Monitor Helm chart in your AI inference cluster this week — GPU utilization visibility is the first step toward rightsize-and-reclaim cycles that recover meaningful compute headroom.

SourcesNVIDIA Developer Blog — https://developer.nvidia.com/blog/get-real-time-visibility-into-gpu-usage-across-kubernetes-clusters/


Datacenter
№ 05·Datacenter

Datacenter

Plate Vdatacenter
Datacenter row — per-rack utilization at a glance. Cool colors are slack; warmer fills are pressure.

AI Infrastructure Spending Spreads From GPUs to Networking and Optics

NVIDIA's Q1 FY2027 results revealed that the AI spending wave is propagating outward from GPU clusters into fabric and interconnects. Networking revenue came in at $14.8 billion — up 199% year-over-year, the fastest-growing line in the company. New multi-year partnerships with Coherent, Corning, and Lumentum signal that silicon photonics and optical interconnects are becoming first-class infrastructure components, not afterthoughts.

The reporting restructure is the editorial signal: NVIDIA now separates "Hyperscale" from "ACIE" (AI Clouds, Industrial, and Enterprise) within its data center platform. Creating a named revenue bucket for enterprise and industrial is a public commitment that NVIDIA's roadmap will serve these segments directly — not just through hyperscaler trickle-down. The inference and edge emphasis reinforces this: training clusters were hyperscaler-only territory; inference infrastructure at scale is where enterprise and telecom operators live.

The practical implication for network engineers: the fabric around the GPUs is becoming as strategically important as the GPUs themselves. The $14.8 billion networking quarter is evidence, not projection.

So What?: Start treating AI cluster networking as a core skill — the 199% networking growth in a single quarter signals that the path from "network engineer" to "AI infrastructure engineer" runs through fabric design, not retraining on a different technology stack.

SourcesData Center Knowledge — https://www.datacenterknowledge.com/infrastructure/nvidia-earnings-show-ai-spending-moving-beyond-gpus


Utilities Are Forcing Datacenters to Fund Their Own Grid Upgrades

Oregon's Public Utility Commission this week approved a new large-load rate class requiring datacenters consuming 20 MW or more to directly fund the grid infrastructure they require and sign 10-year power purchase agreements. Oregon joins at least 18 states that have established similar cost-allocation frameworks, with 77 more tariff proposals pending across 36 states.

This is the Virginia Tier 4 generator story playing out at the utility interconnection level. Virginia required Tier 4 emissions controls and continuous monitoring for new generator permits from July 2026. Oregon is requiring datacenters to self-fund grid infrastructure. Texas set a 75 MW threshold via SB 6. Illinois at 50 MW. The direction is uniform: the externality-free era of hyperscale datacenter construction is over.

The Dallas construction surge from this week's global capacity report (31.7 GW under construction in 2025 — more than double the prior year, Dallas overtaking Northern Virginia for the first time) is this regulatory pressure made visible in market data. Operators are voting with construction dollars, and Virginia's permitting constraints have already shifted supply northward to Texas.

So What?: Treat state-level large-load tariff status as a first-order filter in datacenter site selection — the difference between a 20 MW threshold state and a 75 MW threshold state can shift total project cost by tens of millions in grid infrastructure funding obligations.

SourcesDataCenter Dynamics — https://www.datacenterdynamics.com/en/news/oregon-energy-regulator-approves-new-rate-class-for-large-load-data-centers/ | DataCenter Dynamics (global construction) — https://www.datacenterdynamics.com/en/news/global-data-center-capacity-under-construction-reached-317gw-in-2025-more-than-double-the-year-before-report/


Science
№ 06·Science

Science

Plate VIscience
Field schematic — three-body stability under quasi-equal masses, drawn from the day's central result.

Equal1 Fits a Working Quantum Computer in a Standard Server Rack

Equal1 demoed the RacQ at Dell Tech World 2026 — a full quantum computer in a standard 19-inch rack form factor, running a live demo alongside a Dell PowerEdge R770. The RacQ uses silicon spin qubits fabricated with standard CMOS semiconductor processes, integrating the entire quantum system (qubits, RF controls, classical interface) onto a single chip. Operating temperature is 0.3 Kelvin — roughly nine times colder than deep space (2.7K), powered by a self-contained closed-cycle cryocooler drawing 1,600 watts from a standard single-phase socket.

The architectural significance is the form-factor problem being solved. Every practical quantum system to date has required a purpose-built facility: dilution refrigerators the size of a small car, cryogenic plumbing, vibration isolation platforms, specialist engineers on-site. Equal1's CMOS bet — that silicon-spin qubits can be manufactured on the same process that makes every server CPU and network ASIC — makes miniaturization possible in a way that superconducting transmons or trapped ions cannot achieve. The Dell Quantum Intelligent Orchestrator handles workload scheduling across quantum and classical compute, abstracting the quantum resource for applications that don't need to understand it.

The qubit count and fidelity for RacQ have not been publicly disclosed, so commercial viability for specific workloads remains open. But the form-factor proof-of-concept is real, the manufacturing path via standard CMOS is credible, and the datacenter integration model is sound.

So What?: The form factor problem for enterprise quantum computing was just solved as a live demo, not a lab paper — the countdown to quantum-as-a-datacenter-resource is now a fidelity and qubit-count scaling problem, not an infrastructure problem.

SourcesServeTheHome — https://www.servethehome.com/equal1-single-rack-quantum-computer-at-dell-tech-world-2026/ | The Quantum Insider — https://thequantuminsider.com/2026/05/15/equal1-unveils-hybrid-rack-mounted-silicon-spin-quantum-computer/


US Government Takes $2 Billion in Equity Stakes in Nine Quantum Firms

The Commerce Department issued letters of intent to take equity positions — not grants, not contracts — in nine quantum computing companies. The equity mechanism signals the US government is treating quantum as strategic industrial infrastructure close enough to commercial viability to warrant an ownership position. Selection criteria are under scrutiny given reported connections between some firms and Trump family-linked investors.

So What?: The shift from grants to equity stakes is a bullish timing signal for quantum hardware timelines — though the selection criteria scrutiny is the story worth watching for follow-on accountability coverage.

SourcesArs Technica — https://arstechnica.com/gadgets/2026/05/us-government-takes-2-billion-equity-stake-in-nine-quantum-computing-firms/


Security
№ 07·Security

Security

Plate VIIsecurity
Zero-trust egress — credentials are injected at the proxy boundary, never reaching the client runtime.

Cloudflare CASB Now Monitors Claude AI Usage — No Endpoint Agents Required

Cloudflare extended its Cloud Access Security Broker to support the Claude Compliance API, enabling security and compliance teams to monitor Claude usage directly from the Cloudflare dashboard. The enforcement architecture is network-level, not endpoint-level — no agent install required. Cloudflare is using its SASE position as the AI governance enforcement plane: the network layer sees all traffic regardless of endpoint state.

This is the architectural pattern worth watching: SASE platforms as AI governance enforcement points. The same CASB layer that historically controlled SaaS app access is now the mechanism for monitoring AI tool interactions (prompts, file uploads, generated content). As AI governance frameworks mature toward NIST and ISO 42001 requirements, organizations already running Cloudflare Access or Zero Trust will find the enforcement boundary already exists.

No significant additional security architecture updates this cycle.

SourcesCloudflare Blog — https://blog.cloudflare.com/casb-anthropic-integration/


Quick Takes
№ 08·Quick Takes

Quick Takes

ipSpace OpenFlow Retrospective — 6 Hours Free: Ivan Pepelnjak published a full public OpenFlow deep dive (six hours, no registration required) this week. Framed as a retrospective on why the technology failed to scale operationally despite sound theoretical foundations — timely context for anyone evaluating gNMI and model-driven programmability approaches that attempt to solve the same forwarding-plane programmability problem. https://blog.ipspace.net/2026/05/openflow-videos/

Cal Poly Floquet Engineering — Exotic Quantum Matter More Stable Than Conventional: Cal Poly physicists demonstrated that periodically oscillating a magnetic field generates quantum phases of matter with no static analog — and these driven states show greater stability and decoherence resistance than conventional quantum states. Published Physical Review B Vol. 113, May 2026. The decoherence resistance angle is directly relevant to qubit scaling challenges. https://www.sciencedaily.com/releases/2026/05/260504154014.htm

NVIDIA Telco AI Factories Shift to Token Billing: NVIDIA's developer blog describes telcos building sovereign AI factories on the NCP reference architecture, with the commercial model shifting from selling GPU compute-hours to delivering AI services billed per token. Relevant for enterprise teams evaluating sovereign AI capacity sourcing as an alternative to hyperscaler APIs under data residency constraints. https://developer.nvidia.com/blog/building-token-metered-ai-services-on-telco-ai-factories/


Watch Today
№ 09·Watch Today

Watch Today

  • Equal1 / Dell Tech World: Any follow-up technical disclosure on RacQ qubit count and gate fidelity — the form factor is proven, the fidelity is the next question
  • RAMPART adoption signals: Whether security and platform engineering teams start integrating it into agentic CI pipelines over the next two to three weeks
  • Oregon PUC large-load tariff: Watch for neighboring state PUCs (Washington, California SB 886) to cite Oregon as precedent in their own tariff proceedings
  • Amazon RNG citations: Whether hyperscaler networking teams at Google, Meta, or Microsoft reference or respond to the RNG paper — topology convergence signals

Automation
№ 10·Automation

Pipeline Stats

Plate VIIIautomation
Source-of-truth pipeline — intent → diff → apply → verify, idempotent on every revolution.
  • Sources this run: 14 primary sources across 5 domains
  • Dedup rejections: 0 (all items cleared 72-hour cooldown)
  • Quality score: 4.5/5
  • Items published: 12 primary + 3 quick takes
  • RSS digest: Used (80 articles, 22 feeds; top score 17.0 — NetBox Replica Cache)
  • Cold open variant: E (HOST_B leads)
  • Today is Friday: Week in Review included in podcast
Subscribe

Get the briefing in your inbox.

One email per weekday morning. Same writing, same sources — no audio required.