An increasing number of business applications naturally model data as graphs to capture complex relationships and dynamic interactions among entities. Processing such big graphs typically involves a variety of graph computations over billions of vertices and edges. It imposes many new challenges on users and developers. In this talk, I will first give an overview of the landscape of big graph processing and then present the design and implementation of GraphScope, a unified and open-source engine for big graph processing from Alibaba and Ant Group. GraphScope provides a unified programming interface to a wide range of graph computations such as graph traversal, pattern matching, iterative algorithms, and graph neural networks with a high-level programming language and supports parallel and distributed execution of sophisticated graph analysis on a cluster of machines. In addition, it provides an seamless integration of a highly optimized graph engine in a general-purpose data-parallel computing system. Finally, I will outline some challenges and future research directions for complex graph computations.
Jingren Zhou is Senior Vice President at Alibaba and Ant Group. He is responsible for driving data intelligence infrastructure and many key data-driven businesses at Alibaba and Ant Group. Specifically, he has led work to develop advanced techniques for personalized search, product recommendation, and advertisement at Alibaba’s e-commerce platform and Alipay, a leading online payment and digital lifestyle platform. He also manages data analytics and intelligence research lab at Alibaba DAMO Academy. His research interests include cloud-computing, databases, and large-scale machine learning. He received his PhD in Computer Science from Columbia University. He is a Fellow of IEEE.
High performance computing (HPC) is undergoing many changes at the system level. While scientific applications can reach petaflops or more in computing performance, potentially resulting in larger data generation rates and more frequent checkpointing, the data movement to the parallel file system remains costly due to constraints imposed by HPC centers on the IO bandwidth. In other words, the bandwidth to file systems is outpaced by the rate of data generation; the associated IO contention increases job runtime and delays execution. This situation is aggravated by the fact that when users submit their jobs to a HPC system, they rely on resource managers and job schedulers to monitor and manage the computing resources (i.e., nodes). Both resource managers and job schedulers remain blind to the impact of IO contention on the overall simulation performance.
In this talk we discuss how Artificial Intelligence (AI) can augment HPC systems to prevent and mitigate IO contention while dealing with IO bandwidth constraints. Our solution, called Analytics for IO (AI4IO), consists of a suite of AI-based tools that enable IO-awareness on HPC systems. Specifically, we present two AI4IO tools: PRIONN and CanarIO. PRIONN automates predictions about user-submitted job resource usage, including per-job IO bandwidth; CanarIO detects, in real-time, the presence of IO contention on HPC systems and predicts which jobs are affected by that contention (e.g., because of their frequent checkpointing). By working in concert, PRIONN and CanarIO predict the a priori knowledge necessary to prevent and mitigate IO contention with IO-aware scheduling. We integrate AI4IO in the Flux scheduler and show how A4IO produce improvements in simulation performance: we observe up to 6.2% improvement in makespan of HPC job workloads, which amounts to more than 18,000 node-hours saved per week on a production-size cluster. Our work is the first step to implementing IO-aware scheduling on production HPC systems.
Michela Taufer is an ACM Distinguished Scientist and holds the Jack Dongarra Professorship in High Performance Computing in the Department of Electrical Engineering and Computer Science at the University of Tennessee Knoxville (UTK). She earned her undergraduate degrees in Computer Engineering from the University of Padova (Italy) and her doctoral degree in Computer Science from the Swiss Federal Institute of Technology or ETH (Switzerland). From 2003 to 2004 she was a La Jolla Interfaces in Science Training Program (LJIS) Postdoctoral Fellow at the University of California San Diego (UCSD) and The Scripps Research Institute (TSRI), where she worked on interdisciplinary projects in computer systems and computational chemistry.
Michela has a long history of interdisciplinary work with scientists. Her research interests include scientific applications on heterogeneous platforms (i.e., multi-core platforms and accelerators); performance analysis, modeling and optimization; Artificial Intelligence (AI) for cyberinfrastructures (CI); AI integration into scientific workflows, computer simulations, and data analytics. She has been serving as the principal investigator of several NSF collaborative projects. She also has significant experience in mentoring a diverse population of students on interdisciplinary research. Michela’s training expertise includes efforts to spread high-performance computing participation in undergraduate education and research as well as efforts to increase the interest and participation of diverse populations in interdisciplinary studies.
Today’s high-performance applications demand microsecond-scale tail latencies and high request rates from operating systems, and most applications handle loads that have high variance over multiple timescales. Achieving these goals in an efficient way, where multiple applications must share the resources within each machine, is a difficult problem for today’s operating systems. Their primary shortcoming is that they adjust resource allocations too slowly—over 10s or 100s of milliseconds. This leaves them vulnerable to extreme spikes in tail latency under shifts in resource demand while, at the same time, leaving resources underutilized.
In this talk, I will present Caladan, a new approach to CPU scheduling that can achieve significantly better tail latency, throughput, and resource utilization through a collection of control signals and policies that rely on faster core allocation. Caladan consists of a centralized scheduler core that actively manages resource contention in the memory hierarchy and between hyperthreads, and a kernel module that bypasses the standard Linux Kernel scheduler to support microsecond-scale monitoring and placement of tasks. Caladan can maintain nearly perfect performance isolation between colocated applications with phased behaviors, and it lays the groundwork toward applying faster scheduling techniques to a broader range of systems problems.
Adam Belay is an Assistant Professor of Computer Science at the Massachusetts Institute of Technology, where he works on operating systems, networking, and systems security. During his Ph.D. at Stanford, he developed Dune, a system that safely exposes privileged CPU instructions to userspace; and IX, a dataplane operating system that significantly accelerates I/O performance. Dr. Belay’s current research interests lie in developing systems that cut across hardware and software layers to increase datacenter efficiency and performance. He is a member of the Parallel and Distributed Operating Systems Group, and a recipient of a Google Faculty Award, a Facebook Research Award, and the Stanford Graduate Fellowship. http://abelay.me