Open Inference Stack Reshuffles — TGI Exits and SGLang Leads
Top 3 Highlights
1. Open-Source Inference Stack Reshuffles — TGI Retires, SGLang Leads, China at 41% of Downloads
The inference serving stack that most teams standardized on twelve to eighteen months ago is no longer the right answer. TGI entering maintenance mode is not a minor footnote — it means Hugging Face quietly acknowledged their own infrastructure lost the performance race. The project receives security patches, not features. Teams running TGI in new deployments are on a dead-end stack, full stop.
SGLang has emerged as the open-source throughput leader. The benchmark gap between SGLang and vLLM that existed in mid-2025 has closed or reversed depending on workload type. And now TokenSpeed, released in preview on May 7 by the LightSeek Foundation (MIT license), adds a Blackwell-optimized engine that clocks nine percent faster minimum latency and eleven percent higher throughput than TensorRT-LLM on NVIDIA B200 hardware — specifically targeting the fifty-thousand-plus-token agentic context workloads that other engines don't optimize for simultaneously across both user-perceived responsiveness and per-GPU throughput. The framework's MLA decode kernels have already been adopted upstream by vLLM, suggesting it's already shaping the broader community.
The China angle in the Spring report deserves more attention than it's getting. Chinese organizations now represent 41% of all Hugging Face downloads, with Alibaba's Qwen family alone producing over 113,000 derivative models on the platform — more derivatives than Google and Meta combined. Independent developers have overtaken industry in total model creation. Open models are achieving ten-to-one-thousand-times cost reduction versus flagship closed models. The structural competitive landscape of AI has shifted, and the inference serving toolchain is catching up to reflect it.
The Hugging Face Kernel Hub — modular GPU-optimized kernels available as loadable libraries for both NVIDIA and AMD — is the abstraction layer that makes heterogeneous accelerator serving feasible without per-engine reimplementation. This is relevant for any team evaluating AMD or domestic accelerator alternatives to NVIDIA.
So What? If your team defaults to TGI for new deployments, stop. Evaluate SGLang and TokenSpeed before committing to vLLM or TGI on your next build. The Kernel Hub is the correct abstraction layer if you want GPU vendor flexibility. Invest one afternoon benchmarking SGLang against your current serving setup before your next GPU procurement conversation.
SourcesHugging Face Spring 2026 State of Open Source Report, MarkTechPost — TokenSpeed Preview Release
2. NetDevOps Maturity Gap — 70% of Teams Immature Despite Tool Proliferation
TL;DR: A Broadcom network observability analysis and a Network World survey arrive at the same uncomfortable conclusion: 70% of organizations have immature automation practices despite widespread tool adoption, 87% have major visibility gaps across cloud, internet, and hybrid segments, and 39% cite those gaps as the direct blocker to AI-assisted operations. Intel's case study shows what the other side looks like: 13 engineers managing 5,500 devices in 2025, up from 20 managing 3,000 in 2019.
The tool availability problem for network automation was solved years ago. The adoption problem has not been. And Broadcom's data makes the sequence painfully clear: you cannot automate what you cannot see, and AI-generated configs built on networks without consistent telemetry are a liability, not an asset. The 39% who cite visibility gaps as the direct AI blocker are describing a structural problem that an AI overlay cannot paper over.
Intel's operational data is the kind of number that wins budget conversations. From 2019 to 2025, they went from 20 engineers managing 3,000 devices to 13 managing 5,500 — a roughly 2.8x improvement in device-per-engineer ratio. That didn't happen by getting lucky on hiring. It happened by automating the routine work. The prerequisite, both studies agree, is a 1-to-2-year data quality and source-of-truth consolidation effort before automation bears weight. Teams that skip that step hit a wall when automation generates configs from stale IPAM data.
Broadcom's recommended progression — observability first, then automation, then AI — is the correct order of operations. Push back on any vendor who asks you to invert it. You cannot layer AI on top of gaps you haven't closed.
So What? Before the next budget cycle, run the honest audit: do you have consistent telemetry across every segment you consume but don't own — cloud, ISP, SaaS, hybrid workforce? If not, that gap is your highest-leverage investment. Then build your automation business case from Intel's numbers: calculate your current device-per-engineer ratio and what 2.8x improvement means in headcount terms. That's the conversation that gets budget.
SourcesBroadcom Network Observability — NetDevOps in 2026, Network World — Network Engineers Take on NetDevOps Roles
3. CoreWeave Crosses 1GW and Pivots to Self-Build — AI Factory Architecture Moves Past Colo
TL;DR: CoreWeave confirmed it has surpassed one gigawatt of operational data center capacity and is accelerating a shift from leasing to owned self-build facilities. Meanwhile Amazon purchased 1,300 acres near Austin for datacenter land-banking. Together these moves reveal the pattern: purpose-built AI factories with purpose-built networking are outgrowing what conventional colo can support.
CoreWeave's 1GW milestone matters less as a number and more as an architectural signal. Their current capacity is entirely leased, and they've stated clearly that leased colo is an architectural ceiling — you cannot run 130kW-per-rack liquid cooling in a facility designed for 5-to-10kW racks. The self-build pivot (first owned facility later this year, 3.5GW contracted by end of 2027, 8GW by 2030) is CoreWeave deciding that owning the physical layer is the only way to control the architecture.
NVIDIA's DSX reference architecture — which codifies Spectrum-X Ethernet, RoCE v2 with adaptive routing, and 800G fabric targeting 95% bandwidth efficiency at hundred-thousand-plus GPU scale — is the spec that CoreWeave and other NVIDIA-partnered neoclouds are building to. DSX is on track to become what Cisco UCS was in the virtualization era: the reference spec that your customer's procurement team arrives with before the architecture conversation starts. The networking requirements (RoCE v2, adaptive routing, Spectrum switches, ConnectX-8 SuperNICs) are not optional components — they're load-bearing assumptions of the design.
So What? Get familiar with DSX's networking requirements now. When the first AI factory RFQ lands, you don't want to be reading the spec in the room. Specifically: understand RoCE v2 adaptive routing, Spectrum-X congestion control, and how ConnectX-8 SuperNIC offloads differ from standard NIC deployments. That's the conversation you'll be having within eighteen months.
SourcesDataCenter Dynamics — CoreWeave 1GW Self-Build, DataCenter Dynamics — Amazon Texas Land
Networking & Architecture
ipSpace SR-MPLS Workshop Series Begins — Nine Topologies from ITNOG 10
Ivan Pepelnjak reorganized ipSpace.net's Segment Routing resources on May 11 and announced an upcoming multi-part blog series working through nine SR-MPLS topologies from the ITNOG 10 workshop (April 2026). The lab configurations are already available in the public ipspace/SR-workshop GitHub repository — usable today in any lab environment.
- Nine SR-MPLS network topologies with working configurations in the repo; blog series will document each topology's routing table behavior and design tradeoffs
- SR-MPLS and SRv6 tracks now separated on ipSpace.net — reflects the real deployment divergence (SR-MPLS proven in SP WAN; SRv6 uSID advancing in AI fabric ECMP, as covered at NANOG 96)
- ITNOG 10 workshop ran out of time before all nine topologies; the series covers what got skipped
This connects to the SR thread we've been tracking all spring — Microsoft's SRv6 uSID deployment over SONiC for AI training fabric ECMP was the production proof point; Ivan's SR-MPLS series is the foundational design work that explains the tradeoffs. The two tracks are complementary.
So What? Pull the SR-workshop GitHub repo and lab through at least two topologies this week — specifically the ones covering how SR-MPLS interacts with ECMP. If you missed the ITNOG 10 workshop, this series is the closest substitute.
SourcesIvan Pepelnjak, ipSpace.net — SR Reorganization and ITNOG 10 Series
PCIe 8.0 Draft 0.5 — 256 GT/s and 1 TB/s Per Slot, Final Spec Targeting 2028
PCI-SIG published Draft 0.5 of the PCIe 8.0 specification on May 6, doubling PCIe 7.0's 128 GT/s throughput to 256 GT/s. A 16-lane slot delivers one terabyte per second bidirectional.
- PAM4 signaling and FLIT encoding from PCIe 7.0 maintained — physical layer delta is managed
- Optical-aware retimer specs planned; copper traces cannot reliably carry 256 GT/s at board-to-board scale
- CopprLink active cable support for disaggregated GPU pooling and CXL memory expansion
- Eliminates the host-to-accelerator bandwidth bottleneck; enables next-generation CXL memory pooling without adding GPUs
- Final spec targets 2028; hardware with PCIe 8.0 and CXL 4.0 arrives 2028-2029
So What? Two years from final spec, but the optical and CopprLink roadmap signals where composable AI infrastructure is going. Rack designs with five-year depreciation cycles should assume PCIe 8.0/CXL 4.0 hardware availability in the 2028-2029 window.
SourcesServeTheHome — PCIe 8.0 Draft 0.5 Released
Automation & Programmability
Extreme Networks Agent ONE — Knowledge Graph Architecture for Enterprise NetOps
At Extreme Connect 2026 (May 4-7, Orlando), Extreme Networks unveiled Agent ONE — a two-tier agentic AI stack built around a networking-specific knowledge graph, not a generic LLM wrapper.
- Architecture: frontier models → AI Core knowledge graph (encodes MAC/client/policy/site/service relationships) → Skills layer (connectors, pipelines, ITSM) → Agent ONE Coworker (current, human-approved actions) → Agent ONE Operator (Q4 2026 roadmap, autonomous execution)
- The knowledge graph approach is what separates production-ready AI ops from chatbot wrappers — structured relational network data enables questions about MAC-to-policy bindings that a generic LLM cannot answer reliably
- Platform ONE now exposes unified topology across physical, Wi-Fi, and fabric layers with integrated alerting and inventory; Edge Services manages third-party Cisco, Aruba, HPE, and Juniper gear
- SPB (Shortest Path Bridging, IEEE 802.1aq) remains Extreme's enterprise fabric bet — zero-touch topology for teams that cannot operationalize BGP/EVPN expertise
So What? When evaluating AI-assisted NetOps platforms, ask whether the AI layer is built on a structured network knowledge model or wrapping a generic LLM. Knowledge graph architecture is the right answer; the correct follow-up question is whether that graph covers your specific vendor stack.
SourcesSiliconAngle — Extreme Connect 2026, Packet Pushers — Extreme Connect Orlando
ADTRAN: NetDevOps CI/CD Pattern for Optical Network Automation
ADTRAN published a practitioner architecture post detailing a Git-anchored CI/CD pipeline for optical network automation using Ansible, NETCONF/RESTCONF, and ONF-TAPI northbound interfaces — covering full service lifecycle with automated rollback.
- Stack: Git (per-device config history) → Ansible (automated deployment) → NETCONF/RESTCONF (programmatic config) → ONF-TAPI (vendor-agnostic optical service interface)
- Per-device files in Git preserve complete change history while enabling per-device customization
- ONF-TAPI northbound interface provides the vendor-agnostic hook that survives multi-vendor optical refreshes
- Automated rollback to prior stable state on failure
So What? Optical automation teams still running manual CLI workflows have a clear reference pattern here. Standardize on ONF-TAPI as the northbound interface before adding automation layers — it's the interface that won't force a rewrite when you swap vendors.
SourcesADTRAN Blog — NetDevOps Optical Network Automation
AI & Machine Learning
WebRTC Is Architecturally Broken for AI Voice — The Protocol Has to Change
A technical analysis (via Simon Willison, May 9) makes clear that WebRTC's core design — aggressively dropping audio packets to maintain low latency — is structurally incompatible with LLM voice applications. The failure mode is architectural, not implementational.
- WebRTC has no audio packet retransmission in browser implementations — designed for conferencing, where dropping a packet beats receiving it late
- LLM voice needs the inverse guarantee: losing a prompt packet is catastrophic; a 200ms delay is irrelevant
- Discord attempted engineering workarounds inside WebRTC constraints and could not solve it
- Production AI voice infrastructure likely requires QUIC-based alternatives or purpose-built AI audio transport protocols
- The industry is currently building AI voice products on WebRTC by default because it's ubiquitous, accumulating hidden architectural debt in the process
So What? If you're specifying networking requirements for an AI voice product, design for QUIC-capable transport from the start. The WebRTC default is an architectural liability. This analysis is the technical foundation for that pushback conversation.
SourcesSimon Willison's Weblog — WebRTC and AI Voice
Simon Willison: Use HTML, Not Markdown, as Your Default LLM Output Format
In a May 8 post, Willison argues that HTML is functionally superior to Markdown as the default LLM output format — not as style preference but as structural capability. HTML enables interactive widgets, SVG diagrams, inline annotations, and in-page navigation that Markdown cannot carry, and modern token budgets make HTML's verbosity a non-factor.
- Demonstrated with GPT-5.5 producing a self-contained interactive HTML explanation — a deliverable that requires a separate rendering pipeline if done in Markdown
- The verbosity trade-off that made Markdown attractive at GPT-4 scale largely disappears at 1M-token contexts
- Practical for internal tooling: topology explainers, runbook generators, config diff summaries
So What? Next time you prompt a model for a config review summary or topology explanation, append "output as a self-contained HTML page with color-coded sections" — compare the result to your Markdown default and decide.
SourcesSimon Willison's Weblog
Datacenter & Infrastructure
Samsung + M3 to Develop Floating Data Centers — The Fun One
Samsung and newly formed company M3 announced on May 8 a collaboration to develop floating data centers — literal waterway-mounted facilities using proximity to seawater or river water for cooling, sidestepping the land-and-power-grid constraints choking conventional builds.
- M3 is built by engineers with prior floating server deployment experience
- Key driver: power-adjacent land with fiber access is running short; water-cooled floating platforms bypass land acquisition
- Microsoft's Project Natick (sunken underwater, retired 2020) explored adjacent territory; Samsung's approach is surface-floating
- The hard networking problem: subsea fiber landing, wireless backhaul, and physical maintenance access introduce latency and availability challenges terrestrial facilities don't face
So What? Track this as a signal that desperation for power-adjacent, water-adjacent land is driving genuinely unconventional infrastructure directions. The real engineering question for network architects is the last-mile connectivity problem — a floating 50MW facility without direct fiber creates interesting operational challenges.
SourcesDataCenter Dynamics — Samsung + M3 Floating Data Centers
Science & Emerging Tech
China's Hanyuan-2 Claims First Dual-Core Neutral-Atom Quantum Computer — 200 Qubits [unverified]
CAS Cold Atom Technology unveiled Hanyuan-2 on May 8 — a 200-qubit neutral-atom system using two independent 100-qubit cores of distinct rubidium isotopes. One core targets error correction while the other executes computation.
- Claimed qubit manipulation accuracy: 99% (up from 90% on Hanyuan-1) — the key figure for evaluating the claim
- Claimed coherence time: over 100 seconds (up from ~20 seconds) — would be competitive with Western neutral-atom leaders if confirmed
- Power consumption: under 7 kilowatts using laser cooling, not dilution refrigeration
- Important caveat: no peer-reviewed publication, no independent benchmarks. This is a vendor announcement amplified by Chinese state media.
The dual-core architecture — running error correction and computation on separate isotope arrays simultaneously — is a structurally novel approach that mirrors disaggregation patterns in classical compute.
So What? [unverified] Monitor for independent replication. If the coherence time and manipulation accuracy claims hold under third-party scrutiny, PQC migration timelines compress further. ML-KEM and ML-DSA are NIST-finalized; if you're managing data with five-plus year confidentiality requirements, that migration is active engineering work now, not future planning.
SourcesThe Quantum Insider — Hanyuan-2, Quantum Computing Report
QKD Crosses 120km on Standard Telecom Fiber — Six Hours Stable, No Manual Adjustment
An international team demonstrated quantum key distribution across 120 kilometers of standard telecom fiber using semiconductor quantum dots, with continuous stable operation for over six hours without manual recalibration. Published in Light: Science & Applications (Vol. 15, 2026). Note: paper published February 25; news coverage appeared May 8.
- Time-bin encoding makes the system resistant to temperature shifts and vibration — key for deployed fiber
- Secure key rate: ~15 bits per second at 120km — sufficient for symmetric key distribution, not bulk data encryption
- Compatible with telecom C-band; works on deployed fiber infrastructure without modifications
- Six-hour unattended operation window vs. prior QKD systems requiring frequent recalibration
So What? QKD at metro-area distances on deployed fiber without constant babysitting is the threshold that moves it from lab curiosity to infrastructure-plausible. This doesn't displace PQC — it complements it for high-value point-to-point connections.
SourcesScienceDaily — QKD 120km, EurekAlert — QKD Quantum Dots
Security
No significant security architecture updates this cycle. The May 8-11 window was quiet on zero-trust and microsegmentation fronts. Items on watch for next cycle: NVIDIA confidential AI factory TEE/Kata Containers reference architecture (awaiting a fresh deployment hook) and EQTY Lab verifiable runtime for NVIDIA Enterprise AI Factory.
Quick Takes
-
Claude Opus 4.7 (released April 16): Thirteen percent improvement on a ninety-three-task coding benchmark; new
xhigheffort level for reasoning/latency control; state-of-the-art on Finance Agent benchmarks. Five dollars per million input tokens, twenty-five per million output. Blackstone and Goldman Sachs named as enterprise adopters for financial services AI agents. -
NYT issues editors note after AI summary misattributed as direct quote (May 10): A workflow failure — AI summary output promoted to quotable without validation against primary sources. Case study for any team building AI-assisted operations: design so AI output cannot be treated as authoritative without explicit confirmation against the source. This is an organizational process failure, not a model quality failure.
-
Hugging Face Kernel Hub: Modular GPU-optimized kernels available as separate loadable libraries for NVIDIA and AMD hardware. The abstraction layer that makes heterogeneous accelerator serving feasible — relevant for any team evaluating alternatives to NVIDIA in inference tiers.
SourcesAnthropic — Claude Opus 4.7, Simon Willison's Weblog — NYT correction, Hugging Face Spring 2026 Report
Watch This Week
- Google I/O 2026 (May 19): Expect Gemini infrastructure and Vertex AI / Gemini Enterprise Agent Platform updates, TPU 8t/8i follow-ups. Key question: does Google address the agent identity and registry gaps from Cloud Next?
- SONiC 202505 (May 31 target): Run the 202504-to-202505 diff against your deployment templates before the stable drop. DPU dark-mode for BlueField-4 Astra and SRv6 uSID via SDN controller are the two changes most likely to affect existing deployments.
- ipSpace SR-MPLS topology series: New posts weekly from May 11. Subscribe to ipSpace.net for notifications — nine topologies from ITNOG 10, starting now.
- Hanyuan-2 independent benchmarks: Watch Quantum Computing Report and The Quantum Insider for third-party validation of the 99% manipulation accuracy and 100+ second coherence claims.
5 domains researched | 15 web searches | 14 primary stories + 3 quick takes | quality score 4.5/5
Get the briefing in your inbox.
One email per weekday morning. Same writing, same sources — no audio required.