(she/her/hers)
Princeton University
graph applications, data supply, memory management
I am a fifth-year graduate student in Computer Science at Princeton University advised by Professor Margaret Martonosi. My broad area of research is Computer Architecture and more specifically, I am interested in optimizing data supply and memory management for graph applications. Throughout my PhD, I have been a contributor to the DECADES project, a full-stack system design that aims to improve the performance, power, and programmability of several emerging workflows in the broad areas of machine learning and graph analytics. This platform rapidly adapts to the increasingly flexible and blurred boundary between software and hardware through various reconfigurable hardware features, depending on application characteristics. I have developed hardware-software co-design and memory hierarchy approaches and am currently designing operating system techniques to address the memory latency bottlenecks of irregular workloads in the domains of graph and sparse applications.
Optimizing Data Supply and Memory Management for Graph Applications in Post-Moore Hardware-Software Systems
Graph structures naturally and efficiently capture relationships between entities, such as individuals in a social network, pages in the World Wide Web, and amino acids in protein molecules. Therefore, graph processing algorithms are commonly used in many important big-data applications. However, modern accelerator-oriented, heterogeneous system designs are inefficient for graph applications because they are characterized by irregular memory access patterns that confound sophisticated instruction scheduling and caching mechanisms as well as traditional prefetching and speculation techniques. As a result, these applications are severely bottlenecked by several long-latency memory accesses to off-chip DRAM and address translation overheads. My thesis aims to tackle these bottlenecks with three main thrusts spanning the hardware-software stack.
First, I propose a hardware-software co-design composed of specialized access tracking buffers and precise compiler-driven program slicing techniques. This design enables significant memory-level parallelism and performance gains by issuing and overlapping multiple long-latency memory accesses to amortize their cost. This approach outperforms out-of-order cores, do-all parallelism, prefetching, and prior decoupling approaches, achieving a 2.87x speedup and 8.61x gain in energy efficiency across a range of graph applications. Second, I present domain-specific cache management techniques that leverage tailored fetch, insertion, and replacement policies to alleviate memory bandwidth pressure and minimize off-chip DRAM accesses. This work yields up to 1.79x (geomean 1.3x) speedup over state-of-the-art cache management policies. Lastly, I describe application-specific operating system (OS) techniques to efficiently manage page sizes and mitigate address translation overheads. These mechanisms offer 77.3-96.3% the performance of ideal huge page usage, while requiring only 0.58-2.92% of the huge page resources that modern operating systems, namely Linux, require.
Overall, my work demonstrates the need for architects to revisit and equip simple hardware with specialized data supply technology to accelerate memory-bound applications when Dennard scaling no longer helps performance and energy savings are critical. Furthermore, as the end of Moore's Law has pushed computer architects to employ hardware accelerators in order to reach aggressive power and performance targets for compute-bound workloads, domain-specific cache hierarchy approaches and OS techniques for memory-bound workloads are becoming increasingly necessary. Together, the three proposals in my dissertation work attack the memory bottlenecks in graph applications from different perspectives of the hardware-software stack to enable optimal performance.