Founded in 2002, our laboratory conducts research on the design and implementation of a wide range of networked computing systems.
Today, large-scale software-defined networks use microservice-based controllers. Bugs in these controllers can reduce network availability by making the data plane state inconsistent with the high-level intent. To recover from such inconsistencies, modern controllers periodically reconcile the state of all the switches with the desired intent. However, periodic reconciliation limits the availability and performance of the network at scale. We introduce Zenith, a microservice-based controller that avoids inconsistencies by design rather than always relying on recovery mechanisms. We have formally verified Zenith’s specifications and have proved that it ensures the network state will eventually be consistent with intent. We automatically generate Zenith’s code from its specification to minimize the likelihood of errors in the final implementation. Zenith’s guarantees and abstractions also enable developers to independently verify SDN applications and ensure end-to-end safety and correctness. Zenith resolves inconsistencies 5\texttimes faster than today’s designs and significantly improves availability.
In this paper, we consider a new workload for which serverless platforms are well-suited: the execution of a 3D printer controller in the cloud. This workload is qualitatively different from those considered in prior work due to the stringent timing requirements. Our measurements on popular serverless platforms reveal millisecond-level overheads that impair the timely execution of the example control algorithm we consider. To mitigate the impact of these overheads, we judiciously partition the execution of the algorithm across a set of serverless functions and exploit timely speculation. Our evaluations on AWS Lambda show that, for 30 diverse print jobs, Cosmic is able to ensure the timely execution of the controller while reducing cost by 2.8x–3.5x compared to other approaches.
Cloud providers install mitigations to reduce the impact of network failures within their datacenters. Existing network mitigation systems rely on simple local criteria or global proxy metrics to determine the best action. In this paper, we show that we can support a broader range of actions and select more effective mitigations by directly optimizing end-to-end flow-level metrics and analyzing actions holistically. To achieve this, we develop novel techniques to quickly estimate the impact of different mitigations and rank them with high fidelity. Our results on incidents from a large cloud provider show orders of magnitude improvements in flow completion time and throughput. We also show our approach scales to large datacenters.
July 11, 2025
Two papers accepted at SIGCOMM 2025.
Nov 18, 2024
Pooria Namyar awarded the 2024 Google Fellowship in networking
Sep 1, 2024
Weiwu Pang joins Google Cloud NetInfra. Congrats!
July 24, 2024
Three papers accepted at NSDI 2025 (Spring).
June 21, 2024
RECACP accepted at Mobicom 2024..
Jan 4, 2024
Four papers accepted at NSDI 2024.
Oct 1, 2023
Jane Yen joins Google Cloud NetInfra Team. Congrats!
Sep 27, 2023
AeroTraj accepted at IMWUT/UbiComp 2023.
July 1, 2023
Jianfeng Wang joins Oracle. Congrats!