MRC Goes Open — Ethernet AI Fabric Architecture Reaches Gigascale Production
Top 3 Highlights
1. MRC Goes Open — The Ethernet AI Fabric Architecture for 100K+ GPUs Is Now Public
The problem MRC solves: RoCEv2 hashes entire flows to a single ECMP path. One unlucky hash means a GPU-to-GPU flow congests a single spine link while identical links sit idle — the same flow collision problem that drove InfiniBand's adaptive routing advantage over Ethernet for years. MRC fixes this by allowing a single RDMA connection to spray packets across every available path simultaneously. Not flows — individual packets within a single connection can take different physical routes. No hash, no collision.
The full three-layer architecture:
- MRC (transport): Packet-level multipath across all ECMP paths simultaneously. Path-aware failure detection reroutes connections without waiting for a controller.
- Multi-plane Clos (physical): Separate physical spine planes provide redundancy for 100,000+ GPU clusters. Two-tier topologies become viable at this scale — no three-tier hierarchy required when physical plane redundancy compensates.
- SRv6 (failure bypass): Pre-programmed SRv6 segment lists give MRC autonomous failure rerouting. When a path fails, MRC follows a backup segment list it already knows. Recovery is sub-second. No centralized controller in the critical path.
The paper confirms this architecture ran frontier model training at both OpenAI and Microsoft. Production, not a test bed. AMD has already endorsed MRC for its ROCm networking stack. Oracle is also a confirmed operator. The OCP contribution means this is not NVIDIA's proprietary story — it is a multi-vendor open specification.
So What? Add MRC-capable NICs to your next AI fabric RFQ. Ask vendors specifically which platforms implement the open OCP MRC spec versus a proprietary variant. Read the arXiv paper before you're in that conversation — it is the architecture reference for what's running at scale.
SourcesarXiv 2605.04333, ServeTheHome, OpenAI, NVIDIA Blog
2. Anthropic Code with Claude 2026 — Agentic Engineering Hits Production Maturity
TL;DR: Anthropic's May 6 developer conference shipped the production tooling stack for autonomous AI agents: multi-agent orchestration, an outcome-based evaluation framework (Outcomes), persistent overnight memory (Dreams), schedule/webhook-triggered automation (Routines), doubled compute limits, and access to SpaceX's Colossus 1 datacenter (220,000+ NVIDIA GPUs). API volume is up 17× year-over-year — the actual adoption rate, not a conference claim.
The Outcomes, Dreams, and Routines triad:
Outcomes is the most infrastructure-relevant announcement. You define success criteria as a machine-checkable rubric — not a prompt instruction, a verifiable test. The agent iterates autonomously until criteria are satisfied or the token budget is exhausted. For network automation: define "configs pass Batfish validation, no BGP neighbor drops in pyATS, route table matches intent" and the agent iterates through config variations running those checks without a human in the approval loop at every step. The human engages when the rubric is satisfied — or when it repeatedly fails.
Dreams (research preview) lets an agent inspect previous sessions overnight, extract useful memories, and carry them forward as persistent state without an external vector database. The practical benefit is stateful agent behavior without building a memory architecture. The governance concern is that agents rewrite their own behavioral context without a human auditing what was "learned" — not yet resolved.
Routines are schedule-triggered and webhook-triggered automation sequences, native in the agent layer. Nightly network audits, change-window config validation, alerting-triggered remediation — expressible without LangChain boilerplate. For infrastructure teams already using Claude Code for automation, Routines is the most immediately deployable feature from the conference.
So What? Outcomes is the abstraction that makes autonomous agents safe to deploy against defined tasks. Before deploying agentic automation against production network infrastructure, design your "done and correct" criteria as a machine-verifiable contract — the tooling to enforce it at agent runtime now exists. If you can't express success as a Batfish or pyATS pass, you're not ready for autonomous execution.
SourcesSimon Willison (live blog), Lenny's Newsletter
3. SONiC 202505 Ships May 31 — SRv6, DPU Dark-Mode, and Enterprise Feature Closures
TL;DR: SONiC's May 2025 release, due May 31, lands three capabilities that close the loop on the AI fabric architecture and enterprise readiness gaps: static SRv6 uSID via SDN controller (the NANOG96 Microsoft AI fabric demo in community software), high-frequency streaming telemetry tuned for AI workload monitoring, and independent DPU firmware/OS upgrades for dark-mode SmartSwitch deployments.
Why this release matters architecturally: The SRv6 support is contributed by Cisco, Alibaba, Microsoft, and NVIDIA across 30+ PRs — the same architecture described in the MRC paper's three-layer design can now be built on open community software. SONiC is the NOS layer that makes the OpenAI/Microsoft AI fabric pattern reproducible outside hyperscale.
DPU dark-mode means the data processing unit is managed as a standalone SONiC node with zero visibility from the host OS. Firmware and OS upgrades on the DPU proceed independently without taking down the SmartSwitch. For zero-downtime AI training cluster operations, this is the feature that makes DPU-resident policy enforcement operationally viable, not just architecturally interesting.
The enterprise additions — per-VLAN spanning tree, 802.1X/MAB port-based access control, per-lane digital optical monitoring with pre-FEC and post-FEC BER — signal deliberate investment in enterprise operator requirements that hyperscalers don't need. The Orange telco deployment (90 switches production, 150+ planned) represents the adoption curve this track is feeding.
So What? Flag May 31 in your calendar. If you're running or planning a SONiC deployment, run the 202504 → 202505 diff against your deployment templates before the stable drop. The SRv6 and DPU dark-mode features combined are the release where the hyperscale AI fabric pattern becomes community-maintained standard software.
SourcesSONiC Foundation, SONiC GitHub Roadmap
Networking & Architecture
ARP's 4-Hour Default Was Rational in 1982 — Today It Is a Production Hazard
Ivan Pepelnjak publishes the historical WHY behind ARP's 4-hour timeout and MAC aging's 5-minute default — a follow-up to Tuesday's EVPN ARP operational failures article (different URL, different angle). The context: ARP was designed for 30-node networks with 2 MB RAM on thick coax. A 4-hour timeout minimized broadcasts on shared segments where refresh overhead was measurable. MAC aging at 5 minutes was the balance between stale entries and unicast flooding on 10 Mbps segments.
The dangerous mismatch: MAC ages out in 5 minutes but ARP doesn't expire for 4 hours. A host moves; the MAC clears; the sending host still holds a valid ARP cache pointing at the old L2 destination — traffic blackholes silently for up to 4 hours. Understanding the causal history makes the fix non-negotiable rather than advisory: these defaults were never intended to survive the jump into EVPN multi-site fabrics.
So What? Any automated EVPN fabric deployment playbook should include a paired test verifying ARP timeout is below MAC aging on each platform type. Add it to your ANTA or pyATS NRFU catalog — it catches the timer mismatch before it surfaces as an unexplained host migration failure.
SourcesIvan Pepelnjak, ipSpace.net
Automation & Programmability
Auvik Aurora Grounds Agentic IT Ops in Fifteen Years of Real Network Data
TL;DR: Auvik launched Aurora on April 29, an agentic AI platform grounded in 15 years of real customer network data — 300 million device configuration backups and 2.2 billion CLI command executions — rather than general LLM training data or documentation scraping. Current GA scope: alert prioritization, device lifecycle tracking, remediation script generation with mandatory human review. Autonomous execution is roadmap.
The differentiator is the corpus, not the model (Claude). When Aurora suggests a remediation, it reasons over context from hundreds of thousands of similar devices in similar situations, not from Cisco documentation. The CEO's framing is explicit: "there aren't a lot of kids coming out of university studying Cisco CLI" — Aurora targets the skills gap directly, not just augmenting engineers who already know this deeply.
So What? For MSPs and enterprise teams without dedicated automation engineers, Aurora is the first agentic network tool grounded in real operational data at meaningful scale. Evaluate the lifecycle and CVE tracking features independently of the agentic framing — both have immediate value regardless of how comfortable you are with AI-generated configs.
SourcesNetwork World, Business Wire
AI as Learning Tool vs. Autonomous Operator — The NetGru Framework for Network Engineers
TL;DR: CCIE veteran Adrian Iliesiu (NetGru) argues that the engineers who future-proof their careers use AI as a structured learning accelerator — not an autonomous operator — and that defining explicit guardrails for AI autonomy is now a core professional skill, not an optional configuration.
The distinction is sharper than the usual "AI augments you" framing. A network engineer who uses AI to explain, scaffold, and verify retains the ability to catch bad output. An engineer who prompts an AI to generate a BGP policy and ships it without understanding it loses that verification capability over time. Guardrails — what the AI can generate autonomously, what requires review, what it cannot touch — are architectural decisions, not personal preferences.
So What? If your AI-assisted workflow doesn't have explicit human review gates for production-impacting changes, you're running without guardrails by default. NetBox Copilot's mandatory confirmation step on write-path operations is one implementation of this pattern. Build yours explicitly, not by accident.
SourcesPacket Pushers NAN121
The Conference Circuit as Curriculum — NTT DATA Engineer Maps AutoCon to NANOG
TL;DR: NTT DATA Network Operations Engineer Joseph Nicholson traces on Packet Pushers how the AutoCon and NANOG conference circuit transformed him from a 10-minute lightning talk novice to a 45-minute NANOG presenter, with modular Ansible and AI tooling as the technical anchors. The structural insight: community knowledge transfer, not tool releases, is the primary driver of automation adoption.
The episode is worth listening to specifically for Nicholson's description of what a modular Ansible framework looks like in production at a large SP. The practitioner arc — 10-minute lightning talk to 45-minute NANOG session in roughly two conference cycles — is data on how fast community learning accelerates when you commit to presenting rather than just attending.
So What? If you're early in the automation journey, treat AutoCon and NANOG as curriculum, not optional events. The conference circuit is currently the highest-density venue for SP-scale automation patterns, and the knowledge transmission gap it fills is real and measurable.
SourcesPacket Pushers TCG075
Vibe Coding Is Reaching Network Automation — Simon Willison's Warning
TL;DR: Simon Willison published a post observing that vibe coding and agentic engineering — the two things he's been most careful to distinguish — have started to converge in his own practice. The accountability mechanisms that made human-authored code reliable are being removed faster than replacements are being built.
The mechanism Willison describes is normalization of deviance, borrowed from aerospace safety literature: each successful unreviewed AI-generated deployment makes the next one feel safer. Over time the review habit erodes. The structural breaks he identifies are significant: tests, documentation, and commit histories can all be generated by agents, so they no longer reliably signal human care or correctness. Dev workflows built for 200 lines per day are breaking at 2,000 lines per day.
For network automation engineers: AI-generated Ansible playbooks, Nornir scripts, or YANG-templated configs that skip pre-merge Batfish or pyATS validation put you in this failure mode regardless of your intent or experience level. The answer is not manual review at velocity — it is machine-checkable correctness gates mandatory before deployment. Outcomes (covered in Top 3 #2) is Anthropic's attempt to provide that primitive at the tooling layer.
So What? If your CI pipeline doesn't have a mandatory pre-merge validation gate for AI-generated configs — Batfish, ANTA, pyATS — you are operating as a vibe coding shop by default. The gate is not overhead; it is the only remaining defense on AI-generated infrastructure changes at deployment velocity.
SourcesSimon Willison
AI & Machine Learning
Reflex Benchmark — Vision Agents Burn 45× More Tokens Than API Calls
TL;DR: Reflex ran a controlled benchmark comparing a vision agent (AI clicking through UI via screenshots) against an API-based agent (direct endpoint calls) on an identical multi-step task. Same model, same task: vision agent used ~551,000 input tokens and took 17 minutes with high run-to-run variance. The API agent used ~12,000 tokens and completed in under 20 seconds with zero variance across trials. 45× the token spend for non-deterministic results.
The full data:
- Vision agent (Claude Sonnet + browser-use): 550,976 tokens average, 1,003 seconds average, 43–68 steps per trial (high variance)
- API agent (Claude Sonnet + tool-use): 12,151 tokens, 19.7 seconds, exactly 8 tool calls every trial
- API agent (Claude Haiku): 9,478 tokens, 7.7 seconds — cheaper still
There is also a hidden engineering cost: the vision agent required a 14-step explicit UI navigation walkthrough embedded in the prompt to handle review list pagination. That engineering overhead does not appear in the token count but it's real.
Why this matters for network operations tooling: If you're reaching for computer-use or browser-use agents because they seem easier to set up than writing API integrations, you're trading a one-time integration cost for a permanent per-run multiplier at two orders of magnitude higher, plus non-deterministic reliability. Any network management tool that makes an AI click through a management UI instead of calling RESTCONF, gNMI, or gRPC is running in the expensive, unreliable column of this benchmark.
So What? Invest in the API surface first. RESTCONF, gNMI, gRPC. The one-time cost of building the structured endpoint integration pays itself back on the first hundred runs. The 45× multiplier is permanent.
SourcesReflex, The Register
Context Window Race — Gemini 3.1 Ultra's 2M Token Window and What It's Actually For
TL;DR: Google's Gemini 3.1 Ultra ships with a 2-million-token context window (generally available). Claude 4.6 Opus and Sonnet carry 1-million-token windows at no long-context surcharge. The "fit your entire codebase in context" framing understates the infrastructure implication: these windows are primarily useful for in-context agent state — the full execution history of tool calls, intermediate reasoning, and working memory for a long-running autonomous workflow, without external retrieval.
For network automation agents running against large YANG model corpora, multi-vendor device state, or extended change management workflows, the context window is the buffer that eliminates retrieval latency and the accuracy degradation of imperfect search. A complex multi-step network audit workflow can keep its full execution history in context through the entire run.
So What? When specifying AI tooling for automation pipelines, context window size is now a meaningful infrastructure parameter for agentic workloads, not just a benchmark talking point. For workflows with complex multi-step execution, prefer models with large native context over architectures requiring external retrieval for agent state.
SourcesLLM Stats, Air Street Press, May 2026
Datacenter & Infrastructure
Arm's Datacenter Business Is About to Surpass Mobile: Arm CEO Rene Haas confirmed datacenter revenue will become Arm's largest business segment "soon," surpassing mobile — where Arm has dominated for two decades. The driver is Arm-based server CPUs (Graviton, Neoverse, Apple M-series) taking AI inference and cloud compute from x86. For infrastructure architects: Arm-based servers have different thermal and power envelopes, which matters for per-rack power budgets and cooling design as AI-adjacent inventory refreshes accelerate.
IREN Acquires Mirantis — Neoclouds Building Full Stacks: GPU neocloud IREN has agreed to acquire Mirantis, the OpenStack and Kubernetes distribution company. Raw GPU hour resellers are discovering they need a managed cloud layer to compete with hyperscalers long-term. Mirantis gives IREN that layer. Whether OpenStack is the right foundation in 2026 is a legitimate question — but the directional signal (neoclouds building full-stack platforms) is real and worth tracking.
SourcesThe Register — Arm, The Register — IREN/Mirantis
Science & Emerging Tech
IBM Heron Simulates Spin Transport in Real Materials — Quantum Utility With Experimental Validation
TL;DR: A team from Purdue, Oak Ridge National Laboratory, and IBM used a 40-qubit IBM Heron processor to run the first digital quantum simulation of spin transport in a real quantum magnet, then validated the results against actual neutron scattering experiments on the material. Published in Physical Review Letters, April 15, 2026.
The science: Spin transport in Heisenberg spin chains — the model underlying quantum magnets and spintronic materials — scales exponentially on classical hardware. The team developed a mid-circuit measurement algorithm reducing circuit complexity to linear scaling, enabling ~1,900 two-qubit gate circuits to run without full error correction. The simulation correctly reproduced three distinct transport regimes (ballistic, diffusive, superdiffusive) in the material, then matched those results against experimental neutron scattering data from potassium copper fluoride (KCuF₃) — an actual quantum magnet studied with physical experiments.
Why it matters: This is a meaningful step beyond demonstration: a quantum computer solving a materials physics problem that is experimentally relevant and hard classically, with cross-validation against real experimental data. Quantum-classical hybrid applications in materials science are arriving before fault tolerance. The classical AI infrastructure running these hybrid pipelines — NVIDIA Ising (covered Monday) — is now load-bearing for quantum workflows as an operational reality, not a future concern.
So What? Watch for similar results in chemistry optimization problems relevant to datacenter materials and semiconductor design. The hybrid deployment model — classical AI infrastructure + quantum coprocessor — is solidifying as the near-term architecture.
SourcesQuantum Computing Report, ORNL
Altermagnetic Proximity Effect Opens a New Route to Engineered Topological Superconductors
TL;DR: A Physical Review Letters paper (May 5, 2026) demonstrates theoretically that altermagnets — a newly confirmed third class of magnetic order — can transfer their distinctive spin-splitting properties into adjacent nonmagnetic materials via a proximity effect, potentially inducing topological superconductivity in the proximitized layer. Topological superconductors host Majorana zero modes, the basis for Microsoft's topological qubit approach.
The effect is tunable via interlayer spacing, making these van der Waals heterostructures physically adjustable in the lab. First-principles theory result — awaiting experimental replication. If confirmed, this expands the menu of physical qubit candidates beyond trapped ions, superconducting transmons, and neutral atoms.
So What? Preliminary — monitor for experimental replication. Altermagnetism is a fast-moving field since its 2024 experimental confirmation. Watch Physical Review Letters for follow-on hardware demonstrations over the next 6-12 months.
SourcesPhysical Review Letters, APS Physics
Security
MCP Server Trust Gap Gets a Cryptographic Blueprint
TL;DR: A peer-reviewed paper (MDPI Future Internet, May 4, 2026) proposes a three-layer architecture for a Trustworthy MCP Registry: RFC 8615 well-known URIs for server discovery, Sigstore keyless certificate signing for provenance, and JCS/JWS runtime verification to reject unauthorized capability mutations. It directly addresses the "rug pull" attack vector — an MCP server presents safe capabilities during vetting, then mutates them after deployment.
The Five Eyes guidance (covered May 4) named MCP tool-calling chains as the primary supply chain attack surface for agentic AI but offered no cryptographic mechanism to address it. This paper fills that gap. MCP server identity gets bound to OIDC-issued short-lived certificates logged to Sigstore's Rekor transparency ledger. Any runtime capability mutation must carry a cryptographically verified JCS-canonicalized signature or the client rejects the connection. All three components — Sigstore, RFC 8615, JCS/JWS — are production-grade today and composable.
So What? Blueprint, not yet adopted standard. For operators running agentic automation tools with MCP server connections (NetBox Copilot, Infrahub MCP, Forward Networks), this is the trust model to design toward. The rug-pull attack is real and the cryptographic architecture to close it exists now.
SourcesMDPI Future Internet
Watch Today
- SONiC 202505 (May 31): Calendar the drop date. Run 202504 → 202505 diff against your deployment templates before stable release.
- MRC OCP spec: Read arXiv paper 2605.04333 before your next AI fabric RFQ — it is the architecture reference for what's running in production at OpenAI and Microsoft.
- Anthropic Outcomes feature: Evaluate for agentic automation pipelines where you currently rely on human review gates. Beta available now for Claude Code.
- Altermagnetic proximity effect (PRL May 5): Track for experimental replication. If hardware demonstrations follow in 6-12 months, this expands the physical qubit candidate space.
- Auvik Aurora write-path (roadmap): Monitor for GA announcement — advisory-only is the current scope, autonomous execution is the gate to watch.
Domains researched: 5 (networking, automation, ai-ml, science, security). RSS digest used (top score 8.5 — ipSpace.net ARP timers). Web searches: ~11 total across all agents. Stories published: 14 primary + 3 quick takes. Quality score: 4.5/5.
Get the briefing in your inbox.
One email per weekday morning. Same writing, same sources — no audio required.