Lecture: Enabling High-performance Sampling for Big Data Processing

时间:2020-12-21浏览:0设置

Subject: Enabling High-performance Sampling for Big Data Processing
Lecturer: Prof. Wang Jun, University of Central Florida
Time: 5:00 p.m.-6:30p.m. , Oct. 26, 2019
Place: Room 235, CIE

Abstract:
In this talk, we aim to demonstrate how to perform sampling in today’s big data processing platforms. We enable both efficient and accurate approximations on arbitrary sub-datasets of a large dataset. Due to the prohibitive storage overhead of caching offline samples for each sub-dataset, existing offline sample based systems provide high accuracy results for only a limited number of sub-datasets, such as the popular ones. On the other hand, current online sample based approximation systems, which generate samples at runtime, do not take into account the uneven storage distribution of a sub-dataset. They work well for uniform distribution of a sub-dataset while suffer low sampling efficiency and poor estimation accuracy on unevenly distributed sub-datasets.
To address the problem, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system, and make good use of such information to facilitate online sampling. We have implemented Sapprox into Hadoop ecosystem as an example system and open sourced it on GitHub. Our comprehensive experimental results show that Sapprox can achieve a speedup by up to a factor of 20 over the precise execution.

Boigraphy:
Dr. Jun Wang is a Full Professor of Computer Engineering; and Director of the Computer Architecture and Storage Systems (CASS) Laboratory at the University of Central Florida, Orlando, FL, USA. He has authored over 120 publications in premier journals such as IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, and leading HPC and systems conferences such as VLDB, HPDC, EuroSys, IPDPS, ICS, Middleware, FAST. He has conducted extensive research in the areas of Computer Systems and High Performance Computing. His specific research interests include massive storage and file System in local, distributed and parallel systems environment. His group has secured multi-million dollars federal research fundings in last five years. At present, his group is investigating three US National Science Foundation projects, one DARPA and one NASA project. He has graduated 13 Ph.D. students who upon their graduations were employed by major US IT corporations (e.g., Google, Microsoft, etc). In 2019, he won IEEE Transactions on Cloud Computing Editorial Excellence and Eminence (EEE) award. He has been serving on the editorial board for the IEEE transactions on parallel and distributed systems, and IEEE transactions on cloud computing. He is a general executive chair for IEEE DASC/DataCom/PIcom/CyberSciTech 2017, and has co-chaired technical programs in numerous computer systems conferences including the 2018 IEEE international conference on High Performance Computing and Communications (HPCC18), the 10th IEEE International Conference on Networking, Architecture, and Storage (NAS 2015), and 1st International Workshop on Storage and I/O Virtualization, Performance, Energy, Evaluation and Dependability (SPEED 2008) held together with HPCA.

返回原图
/