I am a Ph.D. student in the ECE department at the University of Southern California. I am currently a member of Networked Systems Lab and very fortunate to be advised by Prof. Ramesh Govindan. Generally, I am interested in every aspect of computer networks and systems. My recent focus is on datacenter networks, wide-area networks, network availability, and reliability.
Before joining USC, I completed my B.Sc. in Electrical Engineering at the Sharif University of Technology in 2019. During my undergraduate, I worked on the cloudification of telecommunication networks by leveraging SDN and NFV.
Feel free to drop me an email if you have any questions or want to discuss new ideas!
Graduate Research Assistant (Aug 2019 - present)
University of Southern California, Los Angeles, CA
Research Intern (Sep 2023 - present)
Google, Sunnyvale, CA
Mentors: Devdeep Ray, Yuliang Li, KK Yap, Nandita Dukkipati
Research Intern & Student Researcher (Jun 2022 - Jan 2023)
Microsoft Research, Redmond, WA
Mentors: Behnaz Arzani, Ryan Beckett, Srikanth Kandula
Research Intern & Student Researcher (Jun 2021 - Jan 2022)
Microsoft Research, Redmond, WA
Mentors: Behnaz Arzani, Dan Crankshaw, Srikanth Kandula
Production systems use heuristics because they are faster or scale better than their optimal counterparts. Yet, practitioners are often unaware of the performance gap between a heuristic and the optimum or between two heuristics in realistic scenarios. We present MetaOpt, a system that helps analyze heuristics. Users specify the heuristic and the optimal (or another heuristic) as input, and MetaOpt automatically encodes these efficiently for a solver to find performance gaps and their corresponding adversarial inputs. Its suite of built-in optimizations helps it scale its analysis to practical problem sizes. To show it is versatile, we used MetaOpt to analyze heuristics from three domains (traffic engineering, vector bin packing, and packet scheduling). We found a production traffic engineering heuristic can require 30% more capacity than the optimal to satisfy realistic demands. Based on the patterns in the adversarial inputs MetaOpt produced, we modified the heuristic to reduce its performance gap by 12.5×. We examined adversarial inputs to a vector bin packing heuristic and proved a new lower bound on its performance.
We consider the max-min fair resource allocation problem. The best-known solutions use either a sequence of optimizations or waterfilling, which only applies to a narrow set of cases. These solutions have become a practical bottleneck in WAN traffic engineering and cluster scheduling, especially at larger problem sizes. We improve both approaches: (1) we show how to convert the optimization sequence into a single fast optimization, and (2) we generalize waterfilling to the multi-path case. We empirically show our new algorithms Pareto-dominate prior techniques: they produce faster, fairer, and more efficient allocations. Some of our allocators also have theoretical guarantees: they trade off a bounded amount of unfairness for faster allocation. We have deployed our allocators in Azure’s WAN traffic engineering pipeline, where we preserve solution quality and achieve a roughly 3× speedup.
Some faults in data center networks require hours to days to repair because they may need reboots, re-imaging, or manual work by technicians. To reduce traffic impact, cloud providers mitigate the effect of faults, for example, by steering traffic to alternate paths. The state-of-art in automatic network mitigations uses simple safety checks and proxy metrics to determine mitigations. SWARM, the approach described in this paper, can pick orders of magnitude better mitigations by estimating end-to-end connection-level performance (CLP) metrics. At its core is a scalable CLP estimator that quickly ranks mitigations with high fidelity and, on failures observed at a large cloud provider, outperforms the state-of-the-art by over 700\times in some cases.
Oblivious routing distributes traffic from sources to destinations following predefined routes with rules independent of traffic demands. While finding optimal oblivious routing with a concave objective is intractable for general topologies, we show that it is tractable for structured topologies often used in datacenter networks. To achieve this, we apply graph automorphism and prove the existence of the optimal automorphism-invariant solution. This result reduces the search space to targeting the optimal automorphism-invariant solution. We design an iterative algorithm to obtain such a solution by alternating between convex optimization and a linear program. The convex optimization finds an automorphism-invariant solution based on representative variables and constraints, making the problem tractable. The linear program generates adversarial demands to ensure the final result satisfies all possible demands. Since the construction of the representative variables and constraints are combinatorial problems, we design polynomial-time algorithms for the construction. We evaluate the iterative algorithm in terms of throughput performance, scalability, and generality over three potential applications. The algorithm i) improves the throughput up to 87.5% for partially deployed FatTree and achieves up to 2.55\times throughput gain for DRing over heuristic algorithms, ii) scales for three considered topologies with a thousand switches, iii) applies to a general structured topology with non-uniform link capacity and server distribution.
Production systems use heuristics because they are faster or scale better than the corresponding optimal algorithms. Yet, practitioners are often unaware of how worse off a heuristic’s solution may be with respect to the optimum in realistic scenarios. Leveraging two-stage games and convex optimization, we present a provable framework that unveils settings where a given heuristic underperforms.
Oblivious routing distributes traffic from sources to destinations following predefined routes with rules independent of traffic demands. While finding optimal oblivious routing is intractable for general topologies, we show that it is tractable for structured topologies often used in datacenter networks. To achieve this, we apply graph automorphism and prove the existence of the optimal automorphism-invariant solution. This result reduces the search space to targeting the optimal automorphism-invariant solution. We design an iterative algorithm to obtain such a solution by alternating between two linear programs. The first program finds an automorphism-invariant solution based on representative variables and constraints, making the problem tractable. The second program generates adversarial demands to ensure the final result satisfies all possible demands. Since, the construction of the representative variables and constraints are combinatorial problems, we design polynomial-time algorithms for the construction. We evaluate proposed iterative algorithm in terms of throughput performance, scalability, and generality over three potential applications. The algorithm i) improves the throughput up to 87.5% over a heuristic algorithm for partially deployed FatTree, ii) scales for FatClique with a thousand switches, iii) is applicable to a general structured topology with non-uniform link capacity and server distribution.
While prior work has explored many proposed datacenter designs, only two designs, Clos-based and expander-based, are generally considered practical because they can scale using commodity switching chips. Prior work has used two different metrics, bisection bandwidth and throughput, for evaluating these topologies at scale. Little is known, theoretically or practically, how these metrics relate to each other. Exploiting characteristics of these topologies, we prove an upper bound on their throughput, then show that this upper bound better estimates worst-case throughput than all previously proposed throughput estimators and scales better than most of them. Using this upper bound, we show that for expander-based topologies, unlike Clos, beyond a certain size of the network, no topology can have full throughput, even if it has full bisection bandwidth; in fact, even relatively small expander-based topologies fail to achieve full throughput. We conclude by showing that using throughput to evaluate datacenter performance instead of bisection bandwidth can alter conclusions in prior work about datacenter cost, manageability, and reliability.