Loading…
Tuesday, June 21 • 4:50pm - 5:15pm
On The [Ir]relevance of Network Performance for Data Processing

Sign up or log in to save this to your schedule and see who's attending!

Modern data processing frameworks are used in a variety of settings for a diverse set of workloads such as sorting, indexing, iterative computations, structured query processing, etc. As these frameworks run in a distributed environment, a natural question to ask is – how important is the network to the performance of these frameworks? Recent research in this field has led to contradictory results. One camp advocates the limited impact of networking performance on the overall performance of the framework. On the other hand, there is a large body of work on networking optimizations for data processing frameworks.

In this paper, we search for a better understanding of the matter. While answering the basic question concerning the importance of the network performance, our analysis raises new questions and points to previously unexplored or unnoticed avenues for performance optimizations. We take Apache Spark as a representative of a modern data-processing framework. However, to broaden the scope of our investigation, we also experiment with other frameworks such as Flink, Power- Graph or Timely. In our study – rather than analysing Spark-specific peculiarities – we look into procedures and subsystems that are common in any of these frameworks such as networking IO, shuffle data management, object (de)serialization, copies, job scheduling and coordination, etc. Nonetheless, we are aware that the roles of those individual components are different for the various systems, and we exercise caution when making generalized statements about the performance.

Tuesday June 21, 2016 4:50pm - 5:15pm
Denver Marriott City Center 1701 California Street, Denver, CO 80202

Attendees (1)