Cluster computing frameworks such as Apache Hadoop and Apache Spark are commonly used to analyze large data sets. The analysis often involves running multiple, similar queries on the same data sets. This data reuse should improve query performance, but we find that these frameworks schedule query tasks independently of each other and are thus unable to exploit the data sharing across these tasks. We present Quartet, a system that leverages information on cached data to schedule together tasks that share data. Our preliminary results are promising, showing that Quartet can increase the cache hit rate of Hadoop and Spark jobs by up to 54%. Our results suggest a shift in the way we think about job and task scheduling today, as Quartet is expected to perform better as more jobs are dispatched on the same data.
Monday June 20, 2016 9:00am - 9:25am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Cloud providers have begun to allow users to bid for surplus servers on a spot market. These servers are allocated if a user’s bid price is higher than their market price and revoked otherwise. Thus, analyzing price data to derive optimal bidding strategies has become a popular research topic. In this paper, we argue that sophisticated bidding strategies, in practice, do not provide any advantages over simple strategies for multiple reasons. First, due to price characteristics, there are a wide range of bid prices that yield the optimal cost and availability. Second, given the large number of spot markets, there is always a market with available surplus resources. Thus, if resources become unavailable due to a price spike, users need not wait until the spike subsides, but can instead provision a new spot resource elsewhere and migrate to it. Third, current spot market rules enable users to place maximum bids for resources without any penalty. Given bidding’s irrelevance, users can adopt trivial bidding strategies and focus instead on modifying applications to efficiently seek out and migrate to the lowest cost resources.
Monday June 20, 2016 9:15am - 9:40am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
We propose a class of query, called a derange query, that maps a function over a set of records and lazily aggregates the results. Derange queries defer work until it is either convenient or necessary, and, as a result, can reduce total I/O costs of the system.
Derange queries operate on a view of the data that is consistent with the point in time that they are issued, regardless of when the computation completes. They are most useful for performing calculations where the results are not needed until some future deadline. When necessary, derange queries can also execute immediately. Users can view partial results of in-progress queries at low cost.
Monday June 20, 2016 9:25am - 9:50am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Cloud services today are increasingly built using functionality from other running services. In this paper, we question whether legacy Quality of Services (QoS) metrics and enforcement techniques are sufficient as they are producer centric. We argue that, similar to customer rating systems found in banking systems and many sharing economy apps (e.g., Uber and Airbnb), Quality of Consumption (QoC) should be introduced to capture different metrics about service consumers. We show how the combination of QoS and QoC, dubbed QoX, can be used by consumers and providers to improve the security and management of their infrastructure. In addition, we demonstrate how sharing information among other consumers and providers increase the value of QoX. To address the main challenge with sharing information, namely sybil attacks and mis-information, we describe how we can leverage cloud providers as vouching authorities to ensure the integrity of information. We explore the motivations, challenges, and potentials to introduce such a framework in the cloud environment.
Monday June 20, 2016 9:40am - 10:05am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
The key to successful deployment of big data solutions lies in the timely distillation of meaningful information. This is made difficult by the mismatch between volume and velocity of data at scale and challenges posed by disparate speeds of IO, CPU, memory and communication links of data storage and processing systems. Instead of viewing storage as a bottleneck in this pipeline, we believe that storage systems are best positioned to discover and exploit intrinsic data properties to enhance information density of stored data. This has the potential to reduce the amount of new information that needs to be processed by an analytics workflow. Towards exploring this possibility, we propose SEeSAW, a Similarity Exploiting Storage for Accelerating Analytics Workflows that makes similarity a fundamental storage primitive. We show that SEeSAW transparently eliminates the need for applications to process uninformative data, thereby incurring substantially lower costs on IO, memory, computation and communication while speeding up (about 97% as observed in our experiment) the rate at which actionable outcomes can be derived by analyzing data. By increasing capacity of analytics workloads to absorb more data within the same resource envelope, SEeSAW can open up rich opportunities to reap greater benefits from machine and human generated data accumulated from various sources.
Monday June 20, 2016 9:50am - 10:15am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Computational spot markets enable users to bid on servers, and then continuously allocates them to the highest bidder: if a user is “out bid” for a server, the market revokes it and re-allocates it to the new highest bidder. Spot markets are common when trading commodities to balance real-time supply and demand—cloud platforms use them to sell their idle capacity, which varies over time. However, server-time differs from other commodities in that it is “stateful”: losing a spot server incurs an overhead that decreases the useful work it performs. Thus, variations in the spot price actually affect the inherent value of server-time bought in the spot market. As the spot market matures, we argue that price volatility will significantly decrease the value of spot servers. Thus, somewhat counter-intuitively, spot markets may not maximize the value of idle server capacity. To address the problem, we propose a more sustainable alternative that offers a variable amount of idle capacity to users for a fixed price, but with transient guarantees.
Monday June 20, 2016 10:05am - 10:30am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
In-memory analytics frameworks such as Apache Spark are rapidly gaining popularity as they provide order of magnitude performance speedup over disk-based systems for iterative workloads. For example, Spark uses the Resilient Distributed Dataset (RDD) abstraction to cache data in memory and iteratively compute on it in a distributed cluster.
In this paper, we make the case that existing abtractions such as RDD are coarse-grained and only allow discrete cache levels to be used for caching data. This results in inefficient memory utilization and lower than optimal performance. In addition, relying on the programmer to enforce caching decisions for an RDD makes it infeasible for the system to adapt to runtime changes. To overcome these challenges, we propose Neutrino that employs fine-grained memory caching of RDD partitions and adapts to the use of different in-memory cache levels based on runtime characteristics of the cluster. First, it extracts a data flow graph to capture the data access dependencies between RDDs across different stages of a Spark application without relying on cache enforcement decisions from the programmer. Second, it uses a dynamic-programming based algorithm to guide caching decisions across the cluster and adaptively convert or discard the RDD partitions from the different cache levels.
We have implemented a prototype of Neutrino as an extension to Spark and use four different machine-learning workloads for performance evaluation. Neutrino improves the average job execution time by up to 70% over the use of Spark’s native memory cache levels.
Monday June 20, 2016 10:15am - 10:40am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
An abundance of data in many disciplines has accelerated the adoption of distributed technologies such as Hadoop and Spark, which provide simple programming semantics and an active ecosystem. However, the current cloud computing model lacks the kinds of expressive and interactive debugging features found in traditional desktop computing. We seek to address these challenges with the development of BIGDEBUG, a framework providing interactive debugging primitives and tool-assisted fault localization services for big data analytics. We showcase the data provenance and optimized incremental computation features to effectively and efficiently support interactive debugging, and investigate new research directions on how to automatically pinpoint and repair the root cause of errors in large-scale distributed data processing.
Monday June 20, 2016 11:15am - 11:40am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Microsoft’s Pelican storage rack uses a new class of hard disk drive (HDD), known by vendors as archival class HDD. These HDDs are explicitly designed to store cooler and archival data, differing from existing HDDs by trading performance for cost. Our early Pelican experiences have helped some vendors define the particular characteristics of this class of drive. During the last twelve or so months we have gained considerable data on how these drives perform in Pelicans and in this paper we present data gathered from a test and a production environment. A key design choice for Pelican was to have only a small fraction of the HDDs concurrently spun up making Pelican a harsh environment to operate a HDD. We present data showing how the drives have been used, their power profile, their AFR, and conclude by discussing some issues for the future of these archive HDDs. As flash capacities increase eventually all HDDs will be archive class, so understanding their characteristics is of wide interest.
Monday June 20, 2016 11:15am - 11:40am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
We present Ovid, a framework for building evolvable large-scale distributed systems that run in the cloud. Ovid constructs and deploys distributed systems as a collection of simple components, creating systems suited for containerization in the cloud. Ovid supports evolution of systems through transformations, which are automated refinements. Examples of transformations include replication, batching, sharding, and encryption. Ovid transformations guarantee that an evolving system still implements the same specification. Moreover, systems built with transformations can be combined with other systems to implement more complex infrastructure services. The result of this framework is a software-defined distributed system, in which a logically centralized controller specifies the components, their interactions, and their transformations.
Monday June 20, 2016 11:40am - 12:05pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Digital data is projected to double every two years creating the need for cost effective and performant storage media [4]. Hard disk drives (HDDs) are a cost effective storage media that sit between speedy yet costly flashbased storage, and cheap but slower media such as tape drives. However, virtually all HDDs today use a technology called perpendicular magnetic recording, and the density achieved with this technology is reaching scalability limits due to physical properties of the technology [17]. While new technologies such as shingled magnetic recording (SMR) that further increase areal density are slated to enter the market [6], existing systems software is not prepared to fully utilize these devices because of the unique I/O constraints that they introduce.
SMR requires systems software to conform to the shingling constraint. The shingling constraint is an I/O ordering constraint imposed at the device level, and requires that writes be sequential and contiguous within a subset of the disk, called a zone. Thus, software that requires random block updates must use a scheme to serialize writes to the drive. This scheme can be handled internally in a drive or an alternative approach is to expose the zone abstraction and shingling constraint to the host operating system. Host level solutions are challenging because the shingling constraint is not compatible with software that assumes a random-write block device model, which has been in use for decades. The shingling constraint influences all layers of the I/O stack, and each layer must be made SMR compliant.
In order to manage the shingling write constraint of SMR HDDs, we have designed a zone-based extent allocator that maps ZEA logical blocks (ZBA) to LBAs of the HDD. Figure 1a depicts how ZEA is mapped onto a SMR HDD comprised of multiple types of zones, which are described in Table 1. ZEA writes logical extents, comprised of data and metadata, sequentially onto the SMR zone maintaining the shingling constraint.
Monday June 20, 2016 11:40am - 12:05pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
We present OpenLambda, a new, open-source platform for building next-generation web services and applications in the burgeoningmodel of serverless computation. We describe the key aspects of serverless computation, and present numerous research challenges that must be addressed in the design and implementation of such systems. We also include a brief study of current web applications, so as to better motivate some aspects of serverless application construction.
Monday June 20, 2016 12:05pm - 12:30pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Shingled Magnetic Recording (SMR) technology increases the areal density of hard disk drives. Among the three types of SMR drives on the market today, Host Aware SMR (HA-SMR) drives look the most promising. In this paper, we carry out evaluation to understand the performance of HA-SMR drives with the objective of building large-scale storage systems using this type of drive. We focus on evaluating the special features of HA-SMR drives, such as the open zone issue and media cache cleaning efficiency. Based on our observations we propose a novel host-controlled indirection buffer to enhance the drive’s I/O performance. Finally, we present a case study of the open zone issue to show the potential of this host-controlled indirection buffer for HA-SMR drives.
Monday June 20, 2016 12:05pm - 12:30pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
The ability to move data quickly between the nodes of a distributed system is important for the performance of cluster computing frameworks, such as Hadoop and Spark. We show that in a cluster with modern networking technology data serialization is the main bottleneck and source of overhead in the transfer of rich data in systems based on high-level programming languages such as Java. We propose a new data transfer mechanism that avoids serialization altogether by using a shared clusterwide address space to store data. The design and a prototype implementation of this approach are described. We show that our mechanism is significantly faster than serialized data transfer, and propose a number of possible applications for it.
Monday June 20, 2016 2:00pm - 2:25pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Storage systems are designed and optimized relying on wisdom derived from analysis studies of file-system and block-level workloads. However, while SSDs are becoming a dominant building block in many storage systems, their design continues to build on knowledge derived from analysis targeted at hard disk optimization. Though still valuable, it does not cover important aspects relevant for SSD performance. In a sense, we are “searching under the streetlight”, possibly missing important opportunities for optimizing storage system design.
We present the first I/O workload analysis designed with SSDs in mind. We characterize traces from four repositories and examine their ‘temperature’ ranges, sensitivity to page size, and ‘logical locality’. Our initial results reveal nontrivial aspects that can significantly influence the design and performance of SSD-based systems.
Monday June 20, 2016 2:00pm - 2:25pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Multicast has long been a performance bottleneck for data centers. Traditional solutions relying on IP multicast suffer from poor congestion control and loss recovery on the data plane, as well as slow and complex group membership and multicast tree management on the control plane. Some recent proposals have employed alternate optical circuit switched paths to enable lossless multicast and a centralized control architecture to quickly configure multicast trees. However, the high circuit reconfiguration delay of optical switches has substantially limited multicast performance.
In this paper, we propose to eliminate this reconfiguration delay by an unconventional optical multicast architecture called HyperOptics that directly interconnects top of rack switches by low cost optical splitters, thereby eliminating the need for optical switches. The ToRs are organized to form the connectivity of a regular graph. We analytically show that this architecture is scalable and efficient for multicasts. Preliminary simulations show that running multicasts on HyperOptics can on average be 2.1x faster than on an optical circuit switched network.
Monday June 20, 2016 2:25pm - 2:50pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
The performance of storage devices has been increased significantly due to emerging technologies such as Solid State Drives (SSDs) and Non-Volatile Memory Express (NVMe) interface. However, the complex I/O stack of the kernel impedes utilizing the full performance of NVMe SSDs. The application-specific optimization is also difficult on the kernel because the kernel should provide generality and fairness.
In this paper, we propose a user-level I/O framework which improves the performance by allowing user applications to access commercial NVMe SSDs directly without any hardware modification. Moreover, the proposed framework provides flexibility where user applications can select their own I/O policies including I/O completion method, caching, and I/O scheduling. Our evaluation results show that the proposed framework outperforms the kernel-based I/O by up to 30% on microbenchmarks and by up to 15% on Redis.
Monday June 20, 2016 2:25pm - 2:50pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
In this paper, we make a key observation that using multiple priority queues and weighted fair sharing at each port, Aalo does a good job in approximating SJF, but it does so only at the queue-granularity, as using FIFO to schedule CoFlows in each queue is rather simplistic, and has no reminiscence of SJF.
Instead, we discuss three insights into Aalo’s scheduler where exploiting the spatial dimension of the problem domain, i.e., the width (number of ports) of the CoFlows, can lead to better scheduling policies within each priority queue, improving the overall CCT.
Monday June 20, 2016 2:50pm - 3:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Flash-based key-value cache systems, such as Facebook’s McDipper [1] and Twitter’s Fatcache [2], provide a cost-efficient solution for high-speed key-value caching. These cache solutions typically take commercial SSDs and adopt a Memcached-like scheme to store and manage key-value pairs in flash. Such a practice, though simple, is inefficient. We advocate to reconsider the hardware/software architecture design by directly opening device-level details to key-value cache systems. This co-design approach can effectively bridge the semantic gap and closely connect the two layers together. Leveraging the domain knowledge of key-value caches and the unique device-level properties, we can maximize the efficiency of a key-value cache system on flash devices while minimizing its weakness. We are implementing a prototype based on the Open-channel SSD hardware platform. Our preliminary experiments show very promising results.
Monday June 20, 2016 2:50pm - 3:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
The growing variety of data storage and retrieval needs is driving the design and development of an increasing number of distributed storage applications such as keyvalue stores, distributed file systems, object stores, and databases. We observe that, to a large extent, such applications would implement their own way of handling features of data replication, failover, consistency, cluster topology, leadership election, etc. We found that 45– 82% of the code in six popular distributed storage applications can be classified as implementations of such common features. While such implementations allow for deeper optimizations tailored for a specific application, writing new applications to satisfy the ever-changing requirements of new types of data or I/O patterns is challenging, as it is notoriously hard to get all the features right in a distributed setting.
In this paper, we argue that for most modern storage applications, the common feature implementation (i.e., the distributed part) can be automated and offloaded, so developers can focus on the core application functions. We are designing a framework, ClusterOn, which aims to take care of the messy plumbing of distributed storage applications. The envisioned goal is that a developer simply “drops” a non-distributed application into ClusterOn, which will convert it into a scalable and highly configurable distributed application.
Monday June 20, 2016 3:45pm - 4:10pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
We present Mlcached, multi-level DRAM-NAND keyvalue cache, that is designed to enable independent resource provisioning of DRAM and NAND flash memory by completely decoupling each caching layers. Mlcached utilizes DRAM for L1 cache and our new KVcache device for L2 cache. The index-integrated FTL is implemented in the KV-cache device to eliminate any inmemory indexes that prohibit the independent resource provisioning. We show that Mlcached is only 12.8% slower than a DRAM-only Web caching service in the average RTT with 80% L1 cache hit while saving twothirds of its TCO. Moreover, our model-based study shows that Mlcached can provide up to 6X lower cost or 4X lower latency at the same SLA or TCO, respectively.
Monday June 20, 2016 4:00pm - 4:25pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
The storage needs of users have shifted from just needing to store data to requiring a rich interface which enables the efficient query of versions, snapshots and creation of clones. Providing these features in a distributed file system while maintaining scalability, strong consistency and performance remains a challenge. In this paper we introduce Silver, a file system which leverages the Corfu distributed logging system to not only store data, but to provide fast strongly consistent snapshots, clones and multi-versioning while preserving the scalability and performance of the distributed shared log. We describe and implement Silver using a FUSE prototype and show its performance characteristics.
Monday June 20, 2016 4:10pm - 4:35pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
FPGA-enabled datacenters have shown great potential for providing performance and energy efficiency improvement. In this paper we aim to answer one key question: how can we efficiently integrate FPGAs into stateof- the-art big-data computing frameworks like Apache Spark? To provide a generalized methodology and insights for efficient integration, we conduct an indepth analysis of challenges at single-thread, single-node multi-thread, and multi-node levels, and propose solutions including batch processing and the FPGA-as-a- Service framework to address them. With a step-by-step case study for the next-generation DNA sequencing application, we demonstrate how a straightforward integration with 1,000x slowdown can be tuned into an efficient integration with 2.6x overall system speedup and 2.4x energy efficiency improvement.
Monday June 20, 2016 4:25pm - 4:50pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Our key innovation is to allow volume snapshots in VDFS (our native hyper-converged distributed file system) to be exported to a stand-alone regular file that can be imported to another VDFS cluster efficiently (zerocopy when possible) called exo-clones. Our exo-clones carry provenance, policy, and similar to git commits, the fingerprints of the parent clones from which they were derived. They are analogous to commits in a distributed source control system, and can be stored outside of VDFS, rebased, and signed. Although they can be unpacked to any directory, when used with VDFS they can be mounted directly with zero-copying and are instantly available to all nodes mounting VDFS. VDFS with exoclones provides the format and the tools necessary to both transfer, and run encapsulated applications in both public and private clouds, and in both test/dev and production environments.
Monday June 20, 2016 4:35pm - 5:00pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Recently, unikernels have emerged as an exploration of minimalist software stacks to improve the security of applications in the cloud. In this paper, we propose extending the notion of minimalism beyond an individual virtual machine to include the underlying monitor and the interface it exposes. We propose unikernel monitors. Each unikernel is bundled with a tiny, specialized monitor that only contains what the unikernel needs both in terms of interface and implementation. Unikernel monitors improve isolation through minimal interfaces, reduce complexity, and boot unikernels quickly. Our initial prototype, ukvm, is less than 5% the code size of a traditional monitor, and bootsMirageOS unikernels in as little as 10ms (8× faster than a traditional monitor).
Monday June 20, 2016 4:50pm - 5:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
We present a new problem in data storage: how to build efficient backup and restore tools for increasingly popular Next-generation Eventually Consistent STorage systems (NECST). We show that the lack of a concise, consistent, logical view of data at a point-in-time is the key underlying problem; we suggest a deep semantic understanding of the data stored within the system of interest as a solution. We discuss research and productization challenges in this new domain, and present the status of our platform, Datos CODR (Consistent Orchestrated Distributed Recovery), which can extract consistent and deduplicated backups from NECST systems such as Cassandra, MongoDB, and many others.
Monday June 20, 2016 5:00pm - 5:25pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Big data analytics became a hot research topic nearly ten years ago, but since that time, a lot of things have changed. On the hardware side, trends such as the slowdown of processing with respect to I/O are starting to affect the design of big data systems. On the application side, big data systems are increasingly being used by non-programmers and require similar forms of interaction to "small data" analysis tools. Finally, big data systems are increasingly provided "as a service" on cloud infrastructure. I'll talk about these changes from the perspective of the Apache Spark project and from my experience at a company offering a cloud service for big data (Databricks).
Tuesday June 21, 2016 9:00am - 10:30am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Recent years have seen significant growth in the cloud computing market, both in terms of provider competition (including private offerings) and customer adoption. However, the cloud computing world still lacks adopted standard programming interfaces, which has a knock-on effect on the costs associated with interoperability and severely limits the flexibility and portability of applications and virtual infrastructures. This has brought about an increasing number of cross-cloud architectures, i.e. systems that span across cloud provisioning boundaries. This paper condenses discussions from the CrossCloud event series to outline the types of cross-cloud systems and their associated design decisions, and laments challenges and opportunities they create.
Tuesday June 21, 2016 11:00am - 11:25am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Much research effort has been devoted to improving the performance of the I/O stack in mobile devices, but limited time has been spent evaluating mobile application performance from the user’s perspective. In this paper, we try to understand how applications running on the newest devices behave with respect to this metric. We develop a methodology for quantifying user-perceived latency and use it to evaluate four common application benchmarks with I/O stack optimization on two of the latest smartphones. Contrary to our expectation, we find that (i) these applications respond reasonably fast and (ii) the user-perceived latency does not drastically (at most 11:8%) benefit from I/O stack optimizations.
Tuesday June 21, 2016 11:00am - 11:25am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Virtually all public clouds today are run by single providers, and this creates near-monopolies, inefficient markets, and hinders innovation at the infrastructure level. There are current proposals to change this, by creating open architectures that allow providers of computing and storage resources to compete for tenant services at multiple levels, all the way down to the bare metal. Networking, however, is not part of this, and is viewed as a commodity much like power or cooling. In this paper we borrow ideas from the Internet architecture, and propose to structure the cloud datacenter network as a marketplace where multiple service providers can offer connectivity services to tenants. Our marketplace, NetEx, divides the network into independently managed pods of resources, interconnected with multiple providers through special programmable switches that play a role analogous to that of an IXP. We demonstrate the feasibility of such an architecture by a prototype in Mininet, and argue that this can be a way to provide innovation, competition, and efficiency in future cloud datacenter networks.
Tuesday June 21, 2016 11:25am - 11:50am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Nowadays, mobile devices have become the necessities of everyday life. However, users may notice that after a long period of usage, mobile devices will start experiencing sluggish response. In this paper, by conducting an empirical study of filesystem fragmentation on several aged mobile devices, we found that: 1) Files may suffer from severe fragmentation, and database files are among the most severely fragmented files; 2) Filesystem fragmentation does affect the performance of mobile devices, and the impact varies from devices to devices. Conventional defragmentation schemes do not work well on mobile devices because they do not consider the characteristics specific to mobile storage. Two pilot solutions were then suggested to enhance file defragmentation for mobile devices.
Tuesday June 21, 2016 11:25am - 11:50am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
With public cloud providers poised to become indispensable utility providers, neutrality-related mandates will likely emerge to ensure a level playing field among their customers (“tenants”). We analogize with net neutrality to discuss: (i) what form cloud neutrality might take, (ii) what lessons might the net neutrality debate have to offer, and (iii) in what ways cloud neutrality would be different from (and even more difficult than) net neutrality. We use idealized thought experiments and simple workload case studies to illustrate our points and conclude with a discussion of challenges and future directions. Our paper points to a rich and important area for future work.
Tuesday June 21, 2016 11:50am - 12:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Photo management has become a sizable fraction of our computer interaction. Due to economic incentives, every software company wants to restrict users to using their software for photo management and use. Unfortunately, this results in duplication of images, repeated image manipulation operations, and an overall uneven and siloed user experience. In this paper, we motivate the need for a dedicated platform service for photo management which can not only manage the photos on one device, but also transparently manage content adaptation, image manipulation and propagation of the manipulation to all the applications on a device, and all devices using the service. Pixelsior presents our study of the requirements of such a system as well as a preliminary design motivated by requirements of consistency and efficiency.
Tuesday June 21, 2016 11:50am - 12:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Software will always be vulnerable to attacks. Although techniques exist that could prevent or limit the risk of exploits, performance overhead blocks their adoption. Services deployed into the cloud are typically customer facing, leaving them even more exposed to attacks from malicious users. However, the use of virtual machines, and the economy of scale found in cloud platforms, provides an opportunity to offer strong security guarantees to tenants at low cost to the cloud provider. We present ScaaS, a security Scanning as a Service framework for cloud platforms that uses frequent virtual machine checkpointing coupled with memory introspection techniques to detect bugs and malicious behavior in real time. By buffering VM outputs (i.e., outgoing network packets and disk writes) until a scan has been completed, ScaaS gives strong guarantees about the amount of damage an attack can do, while minimizing overheads.
Tuesday June 21, 2016 2:00pm - 2:25pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Deduplication is a widely studied capacity optimization technique that replaces redundant regions of data with references. Not only is deduplication an ongoing area of academic research, numerous vendors have deduplicated storage products. Historically, most deduplicationrelated publications focus on a narrow range of topics: maximizing deduplication ratios and read/write performance. While future research will continue to optimize these areas, we believe that there are numerous novel, deduplication-specific problems that have been largely ignored in the academic community. Based on feedback from customers as well as internal architecture discussions, we present new deduplication problems that will hopefully spur the next generation of research.
Tuesday June 21, 2016 2:00pm - 2:25pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
This paper presents a source-side backup scheme with low-resource usage through collaborative deduplication and approximated lazy deletion when frequent virtual machine snapshot backup is required in a large-scale cloud cluster. The key ideas are to orchestrate multiround duplicate detection batches among machines in a partitioned asynchronous manner and remove most unreferenced content chunks with approximated snapshot deletion. This paper discusses the challenges, main design and strategies, and evaluation results.
Tuesday June 21, 2016 2:25pm - 2:50pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Cold storage devices such as tape and optical discs are a good solution for reducing the total cost of owner- ship for storing data. However, there is a drawback in that media and drives are separated, and placing me- dia into drives when accessing data needs a few min- utes of time. Though placing correlated data together in the same medium reduces media exchange, multi- dimensional searches disrupt it. We propose two ap- proaches which replicate data and place them in different layout for solving the problem. By concentrating on rela- tive latency reduction or utilizing replicas originally gen- erated for avoiding data loss, our method achieves high latency reduction with restricted capacity efficiency loss. A simulation result shows 31% average relative latency reduction with capacity efficiency remaining at 91%.
Tuesday June 21, 2016 2:25pm - 2:50pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
In the late 1980s and early 1990s, object-oriented programming revolutionized software development, popularizing the approach of building of applications as collections of modular components. Today we are seeing a similar revolution in distributed system development, with the increasing popularity of microservice architectures built from containerized software components. Containers [15] [22] [1] [2] are particularly well-suited as the fundamental “object” in distributed systems by virtue of the walls they erect at the container boundary. As this architectural style matures, we are seeing the emergence of design patterns, much as we did for objectoriented programs, and for the same reason – thinking in terms of objects (or containers) abstracts away the lowlevel details of code, eventually revealing higher-level patterns that are common to a variety of applications and algorithms.
This paper describes three types of design patterns that we have observed emerging in container-based distributed systems: single-container patterns for container management, single-node patterns of closely cooperating containers, and multi-node patterns for distributed algorithms. Like object-oriented patterns before them, these patterns for distributed computation encode best practices, simplify development, and make the systems where they are used more reliable.
Tuesday June 21, 2016 2:50pm - 3:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Data compression and deduplication are two common approaches to increasing storage efficiency in the cloud environment. Both users and cloud service providers have economic incentives to compress their data before storing it in the cloud. However, our analysis indicates that compressed packages of different data and differ- ently compressed packages of the same data are usual- ly fundamentally different from one another even when they share a large amount of redundant data. Existing data deduplication systems cannot detect redundant data among them. We propose the X-Ray Dedup approach to extract from these packages the unique metadata, such as the “checksum” and “file length” information, and use it as the compressed file’s content signature to help detect and remove file level data redundancy. X-Ray Dedup is shown by our evaluations to be capable of breaking in the boundaries of compressed packages and significantly reducing compressed packages’ size requirements, thus further optimizing storage space in the cloud.
Tuesday June 21, 2016 2:50pm - 3:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Non-volatile memory, or NVM, is coming. Several technologies are maturing (FeRAM, ReRAM, PCM, DWM, FJG RAM), and soon we expect products from Intel, Micron, HP, SanDisk, and/or Samsung. Some of these products promise memory density close to flash and performance within a reasonable factor of DRAM. This technology could substantially improve the performance of software systems, especially storage systems.
Unfortunately, using NVM is hard: each technology has its quirks, and the details of products are not yet available. We need a way to integrate NVM into our software systems, without full knowledge of all the NVM product details and without having to redesign every software system for each forthcoming NVM technology.
We advocate the use of customized key-value stores. Rather than programming directly on NVM, developers (1) design a key-value store customized for the application, (2) implement the key-value store for the target NVM technology, and (3) program the application using the key-value store. When new NVM products emerge, with similar performance characteristics but different access mechanisms, developers need only modify the keyvalue store implementation, which is simpler, faster, and cheaper than redesigning the application. Thus, the keyvalue store serves as a middle layer that hides the details of the NVM technology, while providing a simple and familiar interface to the application. Customization ensures that the design is performant and simple.
Tuesday June 21, 2016 3:15pm - 3:40pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Many BigData customers use on-demand platforms in the cloud, where they can get a dedicated virtual cluster in a couple of minutes and pay only for the time they use. Increasingly, there is a demand for bare-metal bigdata solutions for applications that cannot tolerate the unpredictability and performance degradation of virtualized systems. Existing bare-metal solutions can introduce delays of 10s of minutes to provision a cluster by installing operating systems and applications on the local disks of servers. This has motivated recent research developing sophisticated mechanisms to optimize this installation. These approaches assume that using network mounted boot disks incur unacceptable run-time overhead. Our analysis suggest that while this assumption is true for application data, it is incorrect for operating systems and applications, and network mounting the boot disk and applications result in negligible run-time impact while leading to faster provisioning time.
Tuesday June 21, 2016 4:00pm - 4:25pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
We apply an extent-based clustering technique to the problem of identifying “hot” or frequently-written data in an SSD, allowing such data to be segregated for improved cleaning performance. We implement and evaluate this technology in simulation, using a page-mapped FTL with Greedy cleaning and separate hot and cold write frontiers. We compare it with two recently proposed hot data identification algorithms, Multiple Hash Functions and Multiple Bloom Filters, keeping the remainder of the FTL / cleaning algorithm unchanged. In almost all cases write amplification was lower with the extent-based algorithm; although in some cases the improvement was modest, in others it was as much as 20%. These gains are achieved with very small amounts of memory, e.g. roughly 10KB for the implementation tested, an important factor for SSDs where most DRAMis dedicated to address maps and data buffers.
Tuesday June 21, 2016 4:10pm - 4:35pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Scale-out applications have emerged as the dominant Internet services today. A request in a scale-out workload generally involves task partitioning and merging with barrier synchronization, making it difficult to predict the request tail latency to meet stringent tail Service Level Objectives (SLOs). In this paper, we find that the request tail latency can be faithfully predicted, in the high load region, by a prediction model using only the mean and variance of the task response time as input. The prediction errors for the 99th percentile request latency are found to be consistently within 10% at the load of 90%for both model and measurement-based testing cases. Consequently, the work in this paper establishes an important link between the request tail SLOs and the low order task statistics in a high load region, where the resource provisioning is desired. Finally, we discuss how the prediction model may facilitate highly scalable, tail-constrained resource provisioning for scaleout workloads.
Tuesday June 21, 2016 4:25pm - 4:50pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
In container-based virtualization where multiple isolat-ed containers share I/O resources on top of a single operating system, efficient and proportional I/O re-source sharing is an important system requirement. Mo-tivated by a lack of adequate support for I/O resource sharing in Linux Cgroup for high-performance NVMe SSDs, we developed a new weight-based dynamic throttling technique which can provide proportional I/O sharing for container-based virtualization solutions run-ning on NUMA multi-core systems with NVMe SSDs. By intelligently predicting the future I/O bandwidth requirement of containers based on past I/O service rates of I/O-active containers, and modifying the cur-rent Linux Cgroup implementation for better NUMA-scalable performance, our scheme achieves highly ac-curate I/O resource sharing while reducing wasted I/O bandwidth. Based on a Linux kernel 4.0.4 implementa-tion running on a 4-node NUMA multi-core systems with NVMe SSDs, our experimental results show that the proposed technique can efficiently share the I/O bandwidth of NVMe SSDs among multiple containers according to given I/O weights.
Tuesday June 21, 2016 4:35pm - 5:00pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Modern data processing frameworks are used in a variety of settings for a diverse set of workloads such as sorting, indexing, iterative computations, structured query processing, etc. As these frameworks run in a distributed environment, a natural question to ask is – how important is the network to the performance of these frameworks? Recent research in this field has led to contradictory results. One camp advocates the limited impact of networking performance on the overall performance of the framework. On the other hand, there is a large body of work on networking optimizations for data processing frameworks.
In this paper, we search for a better understanding of the matter. While answering the basic question concerning the importance of the network performance, our analysis raises new questions and points to previously unexplored or unnoticed avenues for performance optimizations. We take Apache Spark as a representative of a modern data-processing framework. However, to broaden the scope of our investigation, we also experiment with other frameworks such as Flink, Power- Graph or Timely. In our study – rather than analysing Spark-specific peculiarities – we look into procedures and subsystems that are common in any of these frameworks such as networking IO, shuffle data management, object (de)serialization, copies, job scheduling and coordination, etc. Nonetheless, we are aware that the roles of those individual components are different for the various systems, and we exercise caution when making generalized statements about the performance.
Tuesday June 21, 2016 4:50pm - 5:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
In this paper, we present a flash solid-state drive (SSD) optimization that provides hints of SSD internal behaviors, such as device I/O time and buffer activities, to the OS in order to mitigate the impact of I/O completion scheduling delays. The hints enable the OS to make reliable latency predictions of each I/O request so that the OS can make accurate scheduling decisions when to yield or block (busy wait) the CPU, ultimately improving user-perceived I/O performance. This was achieved by implementing latency predictors supported with an SSD I/O behavior tracker within the SSD that tracks I/O behavior at the level of internal resources, such as DRAM buffers or NAND chips. Evaluations with an SSD prototype based on a Xilinx Zynq-7000 FPGA and MLC flash chips showed that our optimizations enabled the OS to mask the scheduling delays without severely impacting system parallelism compared to prior I/O completion methods.
Tuesday June 21, 2016 5:00pm - 5:25pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Data centre networks are increasingly programmable, with application-specific network services proliferating, from custom load-balancers to middleboxes providing caching and aggregation. Developers must currently implement these services using traditional low-level APIs, which neither support natural operations on application data nor provide efficient performance isolation.
We describe FLICK, a framework for the programming and execution of application-specific network services on multi-core CPUs. Developers write network services in the FLICK language, which offers high-level processing constructs and application-relevant data types. FLICK programs are translated automatically to efficient, parallel task graphs, implemented in C++ on top of a user-space TCP stack. Task graphs have bounded resource usage at runtime, which means that the graphs of multiple services can execute concurrently without interference using cooperative scheduling. We evaluate FLICK with several services (an HTTP load-balancer, a Memcached router and a Hadoop data aggregator), showing that it achieves good performance while reducing development effort.
Wednesday June 22, 2016 10:30am - 10:55am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Open vSwitch is a high-performance multi-layer virtual switch that serves as a flexible foundation for building virtualized, stateless Layer 2 and 3 network services in multitenant datacenters. As workloads become more sophisticated, providing tenants with virtualized middlebox services is an increasingly important and recurring theme, yet it remains difficult to integrate these stateful services efficiently into Open vSwitch and its OpenFlow forwarding model: middleboxes perform complex operations that depend on internal state and inspection of packet payloads – functionality which is impossible to express in OpenFlow. In this paper, we present SoftFlow, an extension of Open vSwitch that seamlessly integratesmiddlebox functionality whilemaintaining the familiar OpenFlow forwarding model and performing significantly better than alternative techniques for middlebox integration.
Wednesday June 22, 2016 10:55am - 11:20am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
To achieve low TCP flow completion time (FCT) in data center networks (DCNs), it is critical and challenging to rapidly recover loss without adding extra congestion. Therefore, in this paper we propose a novel loss recovery approach FUSO that exploits multi-path diversity in DCN for transport loss recovery. In FUSO, when a multi-path transport sender suspects loss on one subflow, recovery packets are immediately sent over another sub-flow that is not or less lossy and has spare congestion window slots. FUSO is fast in that it does not need to wait for timeout on the lossy sub-flow, and it is cautious in that it does not violate congestion control algorithm. Testbed experiments and simulations show that FUSO decreases the latency-sensitive flows’ 99th percentile FCT by up to ~82.3% in a 1Gbps testbed, and up to ~87.9% in a 10Gpbs large-scale simulated network.
Wednesday June 22, 2016 11:20am - 11:45am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
StackMap leverages the best aspects of kernel-bypass networking into a new low-latency OS network service based on the full-featured TCP kernel implementation, by dedicating network interfaces to applications and offering an extended version of the netmap API for zero-copy, low-overhead data path alongside control path based on socket API. For small-message, transactional workloads, StackMap outperforms baseline Linux by 4 to 78 % in latency and 42 to 133 % in throughput. It also achieves comparable performance with Seastar, a highly-optimized user-level TCP/IP stack that runs on top of DPDK.
Wednesday June 22, 2016 11:45am - 12:10pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Many large-scale key-value storage systems sacrifice features like secondary indexing and/or consistency in favor of scalability or performance. This limits the ease and efficiency of application development on such systems. Implementing secondary indexing in a large-scale memory based system is challenging because the goals for low latency, high scalability, consistency and high availability often conflict with each other. This paper shows how a large-scale key-value storage system can be extended to provide secondary indexes while meeting those goals. The architecture, called SLIK, enables multiple secondary indexes for each table. SLIK represents index B+ trees using objects in the underlying key-value store. It allows indexes to be partitioned and distributed independently of the data in tables while providing reasonable consistency guarantees using a lightweight ordered write approach. Our implementation of this design on RAMCloud (a main memory key-value store) performs indexed reads in 11 μs and writes in 30 μs. The architecture supports indexes spanning thousands of nodes, and provides linear scalability for throughput.
Wednesday June 22, 2016 1:40pm - 2:05pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Time manipulation, typically done using gettime() and settime(), happens extensively across all software layers in smartphones, from the kernel, to the framework, to millions of apps. This paper presents the first study of a new class of software bugs on smartphones called sleep-induced time bugs (SITB). SITB happenswhen the phone is suspended, due to the aggressive sleeping policy adopted in smartphones, in the middle of a time critical section where time is being manipulated and delay caused by unexpected phone suspension alters the intended program behavior.
We first characterize time usages in the Android kernel, framework, and 978 apps into four categories and study their vulnerabilities to system suspension. Our study shows time manipulation happens extensively in all three software layers, totaling 1047, 1737 and 7798 times, respectively, and all four usage patterns are vulnerable to SITBs. We then present a tool called KLOCK, that makes use of a set of static analyses to systematically identify sleep-induced time bugs in three of the four time usage categories. When applied to five differentAndroid Linux kernels, KLOCK correctly flagged 63 SITBvulnerable time manipulation instances as time bugs.
Wednesday June 22, 2016 1:40pm - 2:05pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Multicore processors are not energy proportional: the first running CPU core that activates shared resources incurs much higher power cost than each additional core does. On the other hand, typical smartphone applications exhibit little parallelism and therefore when one core is activated by an interactive application, computing resources at other cores are available at a deep energy discount. By non-work-conserving scheduling, we exploit energy-discounted co-run opportunities to process best-effort smartphone tasks that involve no direct user interaction (e.g., data compression / encryption for cloud backup, background sensing, and offline bytecode compilation). We show that, for optimal co-run energy discount, the best-effort processing must not elevate the overall system power state (specifically, no reduction of the multicore CPU idle state, no increase of the core frequency, and no impact on the system suspension period). In addition, we use available ARM performance counters to identify co-run resource contention on the multicore processor and throttle best-effort task when it interferes with interactivity. Experimental results on a multicore smartphone show that we can reach up to 63% energy discount in the best-effort task processing with little performance impact on the interactive applications.
Wednesday June 22, 2016 2:05pm - 2:30pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
We analyze the manycore scalability of five widelydeployed file systems, namely, ext4, XFS, btrfs, F2FS, and tmpfs, by using our open source benchmark suite, FXMARK. FXMARK implements 19 microbenchmarks to stress specific components of each file system and includes three application benchmarks to measure the macroscopic scalability behavior. We observe that file systems are hidden scalability bottlenecks in many I/Ointensive applications even when there is no apparent contention at the application level. We found 25 scalability bottlenecks in file systems, many of which are unexpected or counterintuitive. We draw a set of observations on file system scalability behavior and unveil several core aspects of file system design that systems researchers must address.
Wednesday June 22, 2016 2:05pm - 2:30pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
The proliferation of connected sensing devices (or Internet of Things) can in theory enable a range of applications that make rich inferences about users and their environment. But in practice developing such applications today is arduous because they must implement all data sensing and inference logic, even as devices move or are temporarily disconnected. We develop Beam, a framework that simplifies IoT applications by letting them specify “what should be sensed or inferred,” without worrying about “how it is sensed or inferred.” Beam introduces the key abstraction of an inference graph to decouple applications from the mechanics of sensing and drawing inferences. The inference graph allows Beam to address three important challenges: (1) device selection in heterogeneous environments, (2) efficient resource usage, and (3) handling device disconnections. Using Beam we develop two diverse applications that use several different types of devices and show that their implementations required up to 12x fewer source lines of code while resulting in up to 3x higher inference accuracy.
Wednesday June 22, 2016 2:30pm - 2:55pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
File system designs are undergoing rapid evolution to exploit the potentials of flash memory. However, the internal parallelism, a key feature of flash devices, is hard to be leveraged in the file system level, due to the semantic gap caused by the flash translation layer (FTL).We observe that even flash-optimized file systems have serious garbage collection problems, which lead to significant performance degradation, for write-intensive workloads on multi-channel flash devices.
In this paper, we propose ParaFS to exploit the internal parallelism while ensuring efficient garbage collection. ParaFS is a log-structured file system over a simpli- fied block-level FTL that exposes the physical layout. With the knowledge of device information, ParaFS first proposes 2-D data allocation, to maintain the hot/cold data grouping in flash memory while exploiting channel- level parallelism. ParaFS then coordinates the garbage collection in both FS and FTL levels, to make garbage collection more efficient. In addition, ParaFS sched- ules read/write/erase requests over multiple channels to achieve consistent performance. Evaluations show that ParaFS effectively improves system performance for write-intensive workloads by 1.6x to 3.1x, compared to the flash-optimized F2FS file system.
Wednesday June 22, 2016 2:30pm - 2:55pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
A recent NSDI paper [1] reported that increasing the cache hit ratio for an HTTP proxy from 22% to 32% improved median page load time (PLT) formobile clients by less than 2%. We argue that there are two main causes for this weak improvement: objects on the critical path are often not cached, and the limited computational power of mobile devices causes computational delays to comprise a large portion of the critical path.
Both of these factors were, in fact, outlined by a previous analysis of desktop web performance [2]. However, we (as the authors of the HTTP proxy [1]) did not properly understand the analysis and could have saved ourselves substantial engineering costs ifwe had. We therefore argue for the need to highlight this prior analysis, and extend the analysis to include mobile devices with slow CPUs, precise cache hit ratios, and a controlled reproduction of the HTTP proxy caching results [1]. In the extreme case of a perfect cache hit ratio, desktop page load times are improved notably by 34% compared to no caching, but mobile page load times only improve by 13% in the median case. We extract a back-of-envelope performance model from these results to help understand their underlying causes.
Wednesday June 22, 2016 2:55pm - 3:20pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems in the past 15 years or so due to its high redundancy detection abil- ity. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut- points by computing and judging the rolling hashes of the data stream byte by byte. In this paper, we pro- pose FastCDC, a Fast and efficient CDC approach, that builds and improves on the latest Gear-based CDC ap- proach, one of the fastest CDC methods to our knowl- edge. The key idea behind FastCDC is the combined use of three key techniques, namely, simplifying and enhanc- ing the hash judgment to address our observed challenges facing Gear-based CDC, skipping sub-minimum chunk cut-point to further speed up CDC, and normalizing the chunk-size distribution in a small specified region to ad- dress the problem of the decreased deduplication ratio stemming from the cut-point skipping. Our evaluation results show that, by using a combination of the three techniques, FastCDC is about 10x faster than the best of open-source Rabin-based CDC, and about 3x faster than the state-of-the-art Gear- and AE-based CDC, while achieving nearly the same deduplication ratio as the clas- sic Rabin-based approach.
Wednesday June 22, 2016 2:55pm - 3:20pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Memory corruption vulnerabilities remain a grave threat to systems software written in C/C++. Current best practices dictate compiling programs with exploit mitigations such as stack canaries, address space layout randomization, and control-flow integrity. However, adversaries quickly find ways to circumvent such mitigations, sometimes even before these mitigations are widely deployed.
In this paper, we focus on an “orthogonal” defense that amplifies the effectiveness of traditional exploit mitigations. The key idea is to create multiple diversified replicas of a vulnerable program and then execute these replicas in lockstep on identical inputs while simultaneously monitoring their behavior. A malicious input that causes the diversified replicas to diverge in their behavior will be detected by the monitor; this allows discovery of previously unknown attacks such as zero-day exploits.
So far, such multi-variant execution environments (MVEEs) have been held back by substantial runtime overheads. This paper presents a new design, ReMon, that is non-intrusive, secure, and highly efficient. Whereas previous schemes either monitor every system call or none at all, our system enforces cross-checking only for security critical system calls while supporting more relaxed monitoring policies for system calls that are not security critical. We achieve this by splitting the monitoring and replication logic into an in-process component and a cross-process component. Our evaluation shows that Re- Mon offers same level of security as conservative MVEEs and run realistic server benchmarks at near-native speeds.
Wednesday June 22, 2016 3:50pm - 4:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Blockchains like Bitcoin and Namecoin and their respective P2P networks have seen significant adoption in the past few years and show promise as naming systems with no trusted parties. Users can register human meaningful names and securely associate data with them, and only the owner of the particular private keys that registered them can write or update the name-value pair. In theory, many decentralized systems can be built using these blockchain networks, such as new, decentralized versions of DNS and PKI. As the technology is relatively new and evolving rapidly, however, little production data or experience is available to guide design tradeoffs.
In this paper, we describe our experiences operating a large deployment of a decentralized PKI service built on top of the Namecoin blockchain. We present various challenges pertaining to network reliability, throughput, and security that we needed to overcome while registering and updating over 33,000 entries and 200,000 transactions on the Namecoin blockchain. Further, we discuss how our experience informed the design of a new blockchain-based naming and storage system called Blockstack. We detail why we switched from the Namecoin network to the Bitcoin network for the new system, and present operational lessons from this migration. Blockstack is released as open source software and currently powers a production PKI system for 55,000 users.
Wednesday June 22, 2016 4:15pm - 4:40pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
File systems that employ write-optimized dictionaries (WODs) can perform random-writes, metadata updates, and recursive directory traversals orders of magnitude faster than conventional file systems. However, previous WOD-based file systems have not obtained all of these performance gains without sacrificing performance on other operations, such as file deletion, file or directory renaming, or sequential writes.
Using three techniques, late-binding journaling, zoning, and range deletion, we show that there is no fundamental trade-off in write-optimization. These dramatic improvements can be retained while matching conventional file systems on all other operations.
BetrFS 0.2 delivers order-of-magnitude better performance than conventional file systems on directory scans and small random writes and matches the performance of conventional file systems on rename, delete, and sequential I/O. For example, BetrFS 0.2 performs directory scans 2.2x faster, and small random writes over two orders of magnitude faster, than the fastest conventional file system. But unlike BetrFS 0.1, it renames and deletes files commensurate with conventional file systems and performs large sequential I/O at nearly disk bandwidth. The performance benefits of these techniques extend to applications as well. BetrFS 0.2 continues to outperform conventional file systems on many applications, such as as rsync, git-diff, and tar, but improves git-clone performance by 35% over BetrFS 0.1, yielding performance comparable to other file systems.
Wednesday June 22, 2016 4:15pm - 4:40pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today – logs, counters, and metrics – have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. We have implemented a prototype of Pivot Tracing for Java-based systems and evaluate it on a heterogeneous Hadoop cluster comprising HDFS, HBase, MapReduce, and YARN. We show that Pivot Tracing can effectively identify a diverse range of root causes such as soft ware bugs, misconfiguration, and limping hardware. We show that Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between inter-operating applications, with low execution overhead.
Wednesday June 22, 2016 4:40pm - 5:05pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Satellite is a methodology, tool chain, and data-set for understanding global trends in website deployment and accessibility using only a single or small number of standard measurement nodes. Satellite collects information on DNS resolution and resource availability around the Internet by probing the IPv4 address space. These measurements are valuable in their breadth and sustainability - they do not require the use of a distributed measurement infrastructure, and therefore can be run at low cost and by multiple organizations. We demonstrate a clustering procedure which accurately captures the IP footprints of CDN deployments, and then show how this technique allows for more accurate determination of correct and incorrect IP resolutions. Satellite has multiple applications. It reveals the prevalence of CDNs by showing that 20% of the top 10,000 Alexa domains are hosted on shared infrastructure, and that CloudFlare alone accounts for nearly 10% of these sites. The same data-set detects 4,819 instances of ISP level DNS hijacking in 117 countries.
Wednesday June 22, 2016 4:40pm - 5:05pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Free cooling lowers datacenter costs significantly, but may also expose servers to higher and more variable temperatures and relative humidities. It is currently unclear whether these environmental conditions have a significant impact on hardware component reliability. Thus, in this paper, we use data from nine hyperscale datacenters to study the impact of environmental conditions on the reliability of server hardware, with a particular focus on disk drives and free cooling. Based on this study, we derive and validate a new model of disk lifetime as a function of environmental conditions. Furthermore, we quantify the tradeoffs between energy consumption, environmental conditions, component reliability, and datacenter costs. Finally, based on our analyses and model, we derive server and datacenter design lessons.
We draw many interesting observations, including (1) relative humidity seems to have a dominant impact on component failures; (2) disk failures increase significantly when operating at high relative humidity, due to controller/adaptor malfunction; and (3) though higher relative humidity increases component failures, software availability techniques can mask them and enable free-cooled operation, resulting in significantly lower infrastructure and energy costs that far outweigh the cost of the extra component failures.
Wednesday June 22, 2016 5:05pm - 5:30pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
The lower layers in the modern computing infrastructure are written in languages threatened by exploitation of memory management errors. Recently deployed exploit mitigations such as control-flow integrity (CFI) can prevent traditional return-oriented programming (ROP) exploits but are much less effective against newer techniques such as Counterfeit Object-Oriented Programming (COOP) that execute a chain of C++ virtual methods. Since these methods are valid control-flow targets, COOP attacks are hard to distinguish from benign computations. Code randomization is likewise ineffective against COOP. Until now, however, COOP attacks have been limited to vulnerable C++ applications which makes it unclear whether COOP is as general and portable a threat as ROP.
This paper demonstrates the first COOP-style exploit for Objective-C, the predominant programming language on Apple’s OS X and iOS platforms. We also retrofit the Objective-C runtime with the first practical and efficient defense against our novel attack. Our defense is able to protect complex, real-world software such as iTunes without recompilation. Our performance experiments show that the overhead of our defense is low in practice.
Wednesday June 22, 2016 5:05pm - 5:30pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Mingle with fellow attendees in the outdoor plaza for the Conference Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, speakers, and conference organizers.
Wednesday June 22, 2016 6:30pm - 8:00pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
This paper presents Callinicos, a robust storage system with a novel transaction protocol that generalizes minitransactions. This protocol allows Callinicos to cope with Byzantine failures, support cross-partition communication with transactions, and implement on-demand contention management. We have evaluated Callinicos with a set of micro-benchmarks, and two realistic applications: a Twitter-like social network and a distributed message queue. Our experiments show that: (i) cross-partition communication improves performance by reducing the number of aborts, and (ii) the conflict resolution protocol results in no aborts in the presence of contention and no overhead in the absence of contention.
Thursday June 23, 2016 10:30am - 10:55am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Hash tables are important data structures that lie at the heart of important applications such as key-value stores and relational databases. Typically bucketized cuckoo hash tables (BCHTs) are used because they provide highthroughput lookups and load factors that exceed 95%. Unfortunately, this performance comes at the cost of reduced memory access efficiency. Positive lookups (key is in the table) and negative lookups (where it is not) on average access 1.5 and 2.0 buckets, respectively, which results in 50 to 100% more table-containing cache lines to be accessed than should be minimally necessary.
To reduce these surplus accesses, this paper presents the Horton table, a revamped BCHT that reduces the expected cost of positive and negative lookups to fewer than 1.18 and 1.06 buckets, respectively, while still achieving load factors of 95%. The key innovation is remap entries, small in-bucket records that allow (1) more elements to be hashed using a single, primary hash function, (2) items that overflow buckets to be tracked and rehashed with one of many alternate functions while maintaining a worst-case lookup cost of 2 buckets, and (3) shortening the vast majority of negative searches to 1 bucket access. With these advancements, Horton tables outperform BCHTs by 17% to 89%.
Thursday June 23, 2016 10:30am - 10:55am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Consensus is at the core of many production-grade distributed systems. Given the prevalence of these systems, it is important to offer consensus as a cloud service. To match the multi-tenant requirements of the cloud, consensus as a service must provide performance guarantees, and prevent aggressive tenants from disrupting the others. Fulfilling this goal is not trivial without overprovisioning and under-utilizing resources.
We present Filo, the first system to provide consensus as a multi-tenant cloud service with throughput guarantees and efficient utilization of cloud resources. Tenants request an SLA by specifying their target throughput and degree of fault-tolerance. Filo then efficiently consolidates tenants on a shared set of servers using a novel placement algorithm that respects constraints imposed by the consensus problem. To respond to the load variations at runtime, Filo proposes a novel distributed controller that piggybacks on the consensus protocol to coordinate resource allocations across the servers and distribute the unused capacity fairly. Using a real testbed and simulations, we show that our placement algorithm is efficient at consolidating tenants, and while obtaining comparable efficiency and fairness, our distributed controller is ~5x faster than the centralized baseline approach.
Thursday June 23, 2016 10:55am - 11:20am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Cloud providers must dynamically allocate their physical resources to the right client to maximize the benefit that they can get out of given hardware. Cache Allocation Technology (CAT) makes it possible for the provider to allocate last level cache to virtual machines to prevent cache pollution. The provider can also allocate the cache to optimize client benefit. But how should it optimize client benefit, when it does not even know what the client plans to do?
We present an auction-based mechanism that dynamically allocates cache while optimizing client benefit and improving hardware utilization. We evaluate our mechanism on benchmarks from the Phoronix Test Suite. Experimental results show that Ginseng for cache allocation improved clients’ aggregated benefit by up to 42:8x compared with state-of-the-art static and dynamic algorithms.
Thursday June 23, 2016 10:55am - 11:20am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Web services from search to games to stock trading impose strict Service Level Objectives (SLOs) on tail latency. Meeting these objectives is challenging because the computational demand of each request is highly variable and load is bursty. Consequently, many servers run at low utilization (10 to 45%); turn off simultaneous multithreading (SMT); and execute only a single service—wasting hardware, energy, and money. Although co-running batch jobs with latency critical requests to utilize multiple SMT hardware contexts (lanes) is appealing, unmitigated sharing of core resources induces non-linear effects on tail latency and SLO violations.
We introduce principled borrowing to control SMT hardware execution in which batch threads borrow core resources. A batch thread executes in a reserved batch SMT lane when no latency-critical thread is executing in the partner request lane. We instrument batch threads to quickly detect execution in the request lane, step out of the way, and promptly return the borrowed resources. We introduce the nanonap system call to stop the batch thread’s execution without yielding its lane to the OS scheduler, ensuring that requests have exclusive use of the core’s resources. We evaluate our approach for colocating batch workloads with latency-critical requests using the Apache Lucene search engine. A conservative policy that executes batch threads only when request lane is idle improves utilization between 90% and 25% on one core depending on load, without compromising request SLOs. Our approach is straightforward, robust, and unobtrusive, opening the way to substantially improved resource utilization in datacenters running latency-critical workloads.
Thursday June 23, 2016 11:20am - 11:45am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Coordination services like ZooKeeper, etcd, Doozer, and Consul are increasingly used by distributed applications for consistent, reliable, and high-speed coordination. When applications execute in multiple geographic regions, coordination service deployments trade-off between performance, (achieved by using independent services in separate regions), and consistency.
We present a system design for modular composition of services that addresses this trade-off. We implement ZooNet, a prototype of this concept over ZooKeeper. ZooNet allows users to compose multiple instances of the service in a consistent fashion, facilitating applications that execute in multiple regions. In ZooNet, clients that access only local data suffer no performance penalty compared to working with a standard single ZooKeeper. Clients that use remote and local ZooKeepers show up to 7x performance improvement compared to consistent solutions available today.
Thursday June 23, 2016 11:20am - 11:45am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
This paper presents that, by combining on-demand instantiation and lazy recovery, we can reduce the cost of asynchronous state machine replication protocols, such as Paxos and UpRight, while maintaining their high availability. To reduce cost, we incorporate on-demand instantiation, which activates a subset of replicas first and activates backup ones when active ones fail. To solve its key limitation—the system can be halted for long when activating a backup replica, we apply lazy recovery, allowing the system to proceed while recovering backup nodes in the background. The key contribution of this paper is to identify that, when agreement nodes and execution nodes are logically separated, they each presents a unique property that enables lazy recovery. We have applied this idea to Paxos and built ThriftyPaxos, which, as shown in the evaluation, can achieve higher throughput and similar availability comparing to standard Paxos, despite the fact that ThriftyPaxos activates fewer replicas.
Thursday June 23, 2016 11:45am - 12:10pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
The efficiency of modern multiprogrammed multicore machines is heavily impacted by traffic due to data sharing and contention due to competition for shared resources. In this paper, we demonstrate the importance of identifying latency tolerance coupled with instructionlevel parallelism on the benefits of colocating threads on the same socket or physical core for parallel efficiency. By adding hardware counted CPU stall cycles due to cache misses to the measured statistics, we show that it is possible to infer latency tolerance at low cost. We develop and evaluate SAM-MPH, a multicore CPU scheduler that combines information on sources of traffic with tolerance for latency and need for computational resources. We also show the benefits of using a history of past intervals to introduce hysteresis when making mapping decisions, thereby avoiding oscillatory mappings and transient migrations that would impact performance. Experiments with a broad range of multiprogrammed parallel, graph processing, and data management workloads on 40-CPU and 80-CPU machines show that SAMMPH obtains ideal performance for standalone applications and improves performance by up to 61% over the default Linux scheduler for mixed workloads.
Thursday June 23, 2016 11:45am - 12:10pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
The need for scalable, high-performance datastores has led to the development of NoSQL databases, which achieve scalability by partitioning data over a single key. However, programmers often need to query data with other keys, which data stores provide by either querying every partition, eliminating the benefits of partitioning, or replicating additional indexes, wasting the benefits of data replication.
In this paper, we show there is no need to compromise scalability for functionality. We present Replex, a datastore that enables efficient querying on multiple keys by rethinking data placement during replication. Traditionally, a data store is first globally partitioned, then each partition is replicated identically to multiple nodes. Instead, Replex relies on a novel replication unit, termed replex, which partitions a full copy of the data based on its unique key. Replexes eliminate any additional overhead to maintaining indices, at the cost of increasing recovery complexity. To address this issue, we also introduce hybrid replexes, which enable a rich design space for trading off steady-state performance with faster recovery. We build, parameterize, and evaluate Replex on multiple dimensions and find that Replex surpasses the steady-state and failure recovery performance of Hyper- Dex, a state-of-the-art multi-key data store.
Thursday June 23, 2016 1:40pm - 2:05pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Locks are a natural place for improving the energy efficiency of software systems. First, concurrent systems are mainstream and when their threads synchronize, they typically do it with locks. Second, locks are well-defined abstractions, hence changing the algorithm implementing them can be achieved without modifying the system. Third, some locking strategies consume more power than others, thus the strategy choice can have a real effect. Last but not least, as we show in this paper, improving the energy efficiency of locks goes hand in hand with improving their throughput. It is a win-win situation.
We make our case for this throughput/energyefficiency correlation through a series of observations obtained from an exhaustive analysis of the energy efficiency of locks on two modern processors and six software systems: Memcached, MySQL, SQLite, RocksDB, HamsterDB, and Kyoto Kabinet. We propose simple lock-based techniques for improving the energy efficiency of these systems by 33% on average, driven by higher throughput, and without modifying the systems.
Thursday June 23, 2016 1:40pm - 2:05pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Video transcoding plays a critical role in a video streaming service. Content owners and publishers need video transcoders to adapt their videos to different formats, bitrates, and qualities before streaming them to end users with the best quality of service. In this paper, we report our experience to develop and deploy VideoCoreCluster, a low-cost, highly efficient video transcoder cluster for live video streaming services. We implemented the video transcoder cluster with low-cost single board computers, specifically the Raspberry Pi Model B. The quality of the transcoded video delivered by our cluster is comparable with the best open source softwarebased video transcoder, and our video transcoders consume much less energy. We designed a scheduling algorithm based on priority and capacity so that the cluster manager can leverage the characteristics of adaptive bitrate video streaming technologies to provide a reliable and scalable service for the video streaming infrastructure. We have replaced the software-based transcoders for some TV channels in a live TV streaming service deployment on our university campus with this cluster.
Thursday June 23, 2016 2:05pm - 2:30pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
The reuse distance (LRU stack distance) is an essential metric for performance prediction and optimization of storage and CPU cache. Over the last four decades, there have been steady improvements in the algorithmic efficiency of reuse distance measurement. This progress is accelerating in recent years both in theory and practical implementation.
In this paper, we present a kinetic model of LRU cache memory, based on the average eviction time (AET) of the cached data. The AET model enables fast measurement and low-cost sampling. It can produce the miss ratio curve (MRC) in linear time with extremely low space costs. On both CPU and storage benchmarks, AET reduces the time and space costs compare to former techniques. Furthermore, AET is a composable model that can characterize shared cache behavior through modeling individual programs.
Thursday June 23, 2016 2:05pm - 2:30pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Energy efficiency and timeliness (i.e., predictable job latency) are two essential – yet opposing – concerns for embedded systems. Hard timing guarantees require conservative resource allocation while energy minimization requires aggressively releasing resources and occasionally violating timing constraints. Recent work on approximate computing, however, opens up a new dimension of optimization: application accuracy. In this paper, we use approximate computing to achieve both hard timing guarantees and energy efficiency. Specifically, we propose MEANTIME: a runtime system that delivers hard latency guarantees and energy-minimal resource usage through small accuracy reductions. We test MEANTIME on a real Linux/ARM system with six applications. Overall, we find that MEANTIME never violates real-time deadlines and sacrifices a small amount (typically less than 2%) of accuracy while reducing energy to 54% of a conservative, full accuracy approach.
Thursday June 23, 2016 2:30pm - 2:55pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
We propose a new HTM-assisted concurrency control protocol, called HTCC, that achieves high scalability and robustness when processing OLTP workloads. HTCC attains its goal using a two-pronged strategy that exploits the strengths of HTM. First, it distinguishes between hot and cold records, and deals with each type differently – while accesses to highly contended data are protected using conventional fine-grained locks, accesses to cold data are HTM-guarded. This remarkably reduces the database transaction abort rate and exploits HTM’s effectiveness in executing low-contention critical sections. Second, to minimize the overhead inherited from successive restarts of aborted database transactions, HTCC caches the internal execution states of a transaction for performing delta-restoration, which partially updates the maintained read/write set and bypasses redundant index lookups during transaction re-execution at best effort. This approach is greatly facilitated by HTM’s speedy hardware mechanism for ensuring atomicity and isolation. We evaluated HTCC in a main-memory database prototype running on a 4 socket machine (40 cores in total), and confirmed that HTCC can scale near-linearly, yielding high transaction rate even under highly contended workloads.
Thursday June 23, 2016 2:30pm - 2:55pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
NAND-based solid-state (flash) drives are known for providing better performance than magnetic disk drives, but they have limits on endurance, the number of times data can be erased and overwritten. Furthermore, the unit of erasure can be many times larger than the basic unit of I/O; this leads to complexity with respect to consolidating live data and erasing obsolete data. When flash drives are used as a cache for a larger, disk-based storage system, the choice of a cache replacement algorithm can make a significant difference in both performance and endurance. While there are many cache replacement algorithms, their effectiveness is hard to judge due to the lack of a baseline against which to compare them: Belady’s MIN, the usual offline best-case algorithm, considers read hit ratio but not endurance.
We explore offline algorithms for flash caching in terms of both hit ratio and flash lifespan. We design and implement a multi-stage heuristic by synthesizing several techniques that manage data at the granularity of a flash erasure unit (which we call a container) to approximate the offline optimal algorithm. We find that simple techniques contribute most of the available erasure savings. Our evaluation shows that the container-optimized offline heuristic is able to provide the same optimal read hit ratio as MIN with 67% fewer flash erasures. More fundamentally, our investigation provides a useful approximate baseline for evaluating any online algorithm, highlighting the importance of comparing new policies for caching compound blocks in flash.
Thursday June 23, 2016 2:55pm - 3:20pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Modern RDMA hardware offers the potential for exceptional performance, but design choices including which RDMA operations to use and how to use them significantly affect observed performance. This paper lays out guidelines that can be used by system designers to navigate the RDMA design space. Our guidelines emphasize paying attention to low-level details such as individual PCIe transactions and NIC architecture. We empirically demonstrate how these guidelines can be used to improve the performance of RDMA-based systems: we design a networked sequencer that outperforms an existing design by 50x, and improve the CPU effciency of a prior highperformance key-value store by 83%. We also present and evaluate several new RDMA optimizations and pitfalls, and discuss how they affect the design of RDMA systems.
Thursday June 23, 2016 3:50pm - 4:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
FSCQ is the first file system with a machine-checkable proof (using the Coq proof assistant) that its implementation meets its specification and whose specification includes crashes. FSCQ provably avoids bugs that have plagued previous file systems, such as performing disk writes without sufficient barriers or forgetting to zero out directory blocks. If a crash happens at an inopportune time, these bugs can lead to data loss. FSCQ’s theorems prove that, under any sequence of crashes followed by reboots, FSCQ will recover the file system correctly without losing data.
To state FSCQ’s theorems, this paper introduces the Crash Hoare logic (CHL), which extends traditional Hoare logic with a crash condition, a recovery procedure, and logical address spaces for specifying disk states at different abstraction levels. CHL also reduces the proof effort for developers through proof automation. Using CHL, we developed, specified, and proved the correctness of the FSCQ file system. Although FSCQ’s design is relatively simple, experiments with FSCQ running as a user-level file system show that it is sufficient to run Unix applications with usable performance. FSCQ’s specifications and proofs required significantly more work than the implementation, but the work was manageable even for a small team of a few researchers.
Thursday June 23, 2016 3:50pm - 4:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
In traditional client-server designs, all requests are processed at the server storing the state, thereby maintaining strict locality between computation and state. The adoption of RDMA (Remote Direct Memory Access) makes it practical to relax locality by letting clients fetch server state and process requests themselves. Such client-side processing improves performance when the server CPU, instead of the network, is the bottleneck.We observe that combining server-side and client-side processing allows systems to balance and adapt to the available CPU and network resources with minimal configuration, and can free resources for other CPU-intensive work.
We present Cell, a distributed B-tree store that combines client-side and server-side processing. Cell distributes a global B-tree of “fat” (64MB) nodes across machines for server-side searches. Within each fat node, Cell organizes keys as a local B-tree of RDMA-friendly small nodes for client-side searches. Cell clients dynamically select whether to use client-side or server-side processing in response to available resources and the current workload. Our evaluation on a large RDMA-capable cluster show that Cell scales well and that its dynamic selector effectively responds to resource availability and workload properties.
Thursday June 23, 2016 4:15pm - 4:40pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spent their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities.
This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus “virtually” speeding it up.
We present COZ, a causal profiler, which we evaluate on a range of highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite. COZ identifies previously unknown optimization opportunities that are both significant and targeted. Guided by COZ, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.
Thursday June 23, 2016 4:15pm - 4:40pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
We present new biases in RC4, break the Wi-Fi Protected Access Temporal Key Integrity Protocol (WPA-TKIP), and design a practical plaintext recovery attack against the Transport Layer Security (TLS) protocol. To empirically find new biases in the RC4 keystream we use statistical hypothesis tests. This reveals many new biases in the initial keystream bytes, as well as several new longterm biases. Our fixed-plaintext recovery algorithms are capable of using multiple types of biases, and return a list of plaintext candidates in decreasing likelihood. To break WPA-TKIP we introduce a method to generate a large number of identical packets. This packet is decrypted by generating its plaintext candidate list, and using redundant packet structure to prune bad candidates. From the decrypted packet we derive the TKIP MIC key, which can be used to inject and decrypt packets. In practice the attack can be executed within an hour. We also attack TLS as used by HTTPS, where we show how to decrypt a secure cookie with a success rate of 94% using 9•227 ciphertexts. This is done by injecting known data around the cookie, abusing this using Mantin’s ABSAB bias, and brute-forcing the cookie by traversing the plaintext candidates. Using our traffic generation technique, we are able to execute the attack in merely 75 hours.
Thursday June 23, 2016 4:40pm - 5:05pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
We present a comprehensive and quantitative study on the development of the Linux memory manager. The study examines 4587 committed patches over the last five years (2009-2015) since Linux version 2.6.32. Insights derived from this study concern the development process of the virtual memory system, including its patch distribution and patterns, and techniques for memory optimizations and semantics. Specifically, we find that the changes to memory manager are highly centralized around the key functionalities, such as memory allocator, page fault handler and memory resource controller. The well-developed memory manager still suffers from increasing number of bugs unexpectedly. And the memory optimizations mainly focus on data structures, memory policies and fast path. To the best of our knowledge, this is the first such study on the virtual memory system.
Thursday June 23, 2016 4:40pm - 5:05pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
In the enterprise world, retaining data backups is the de-facto solution against data loss in the event of catastrophic failures. As backup software evolves to achieve faster backup and recovery times, however, backup systems deploying it become increasingly complex to administer. This complexity stems from optimizations targeted to specific applications, which increase the number of configuration parameters for the system. Still, there is no work in the literature that attempts to study the error characteristics of enterprise backup systems, despite our reliance on the guarantees they provide.
With this study we aim to help researchers and practitioners understand how backup system jobs fail, and identify factors that can be used to predict these failures. Our results are derived from an analysis of data on 775 million jobs, collected from more than 20,000 backup software installations over a span of 3 years. We confirm that trends reported in the software reliability literature also hold for backup systems, such as that the majority of job errors are due to misconfigurations. For the systems in our dataset, we find that error rates remain stable across software versions and over time. To better understand these errors, we investigate the effect of several factors on the system’s error rate, such as job sizes and policy complexity, and demonstrate their predictive power for future errors.
Thursday June 23, 2016 5:05pm - 5:30pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Software bugs are a well-known source of security vulnerabilities. One technique for finding bugs, symbolic execution, considers all possible inputs to a program but suffers from scalability limitations. This paper uses a variant, under-constrained symbolic execution, that improves scalability by directly checking individual functions, rather than whole programs. We present UC-KLEE, a novel, scalable framework for checking C/C++ systems code, along with two use cases. First, we use UC-KLEE to check whether patches introduce crashes. We check over 800 patches from BIND and OpenSSL and find 12 bugs, including two OpenSSL denial-of-service vulnerabilities. We also verify (with caveats) that 115 patches do not introduce crashes. Second, we use UC-KLEE as a generalized checking framework and implement checkers to find memory leaks, uninitialized data, and unsafe user input. We evaluate the checkers on over 20,000 functions from BIND, OpenSSL, and the Linux kernel, find 67 bugs, and verify that hundreds of functions are leak free and that thousands of functions do not access uninitialized data.
Thursday June 23, 2016 5:05pm - 5:30pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Check out the cool new ideas and the latest preliminary work on display at the Poster Session and Happy Hour. Take advantage of an opportunity to mingle with colleagues who may be interested in the same area while enjoying complimentary food and drinks. The list of accepted posters is now available. https://www.usenix.org/conference/atc16/accepted-posters
Thursday June 23, 2016 6:30pm - 8:00pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Deterministic replay, which provides the ability to travel backward in time and reconstruct the past execution flow of a multiprocessor system, has many prominent applications. Prior research in this area can be classified into two categories: hardware-only schemes and software-only schemes. While hardware-only schemes deliver high performance, they require significant modifications to the existing hardware which makes them difficult to deploy in real systems. In contrast, software-only schemes work on commodity hardware, but suffer from excessive performance overhead and huge logs caused by tracing every single memory access in the software layer.
In this paper, we present the design and implementation of a novel system, Samsara, which uses the hardware-assisted virtualization (HAV) extensions to achieve efficient and practical deterministic replay without requiring any hardware modification. Unlike prior software schemes which trace every single memory access to record interleaving, Samsara leverages the HAV extensions on commodity processors to track the read-set and write-set for implementing a chunk-based recording scheme in software. By doing so, we avoid all memory access detections, which is a major source of overhead in prior works. We implement and evaluate our system in KVM on commodity Intel Haswell processor. Evaluation results show that compared with prior software-only schemes, Samsara significantly reduces the log file size to 1/70th on average, and further reduces the recording overhead from about 10x, reported by state-of-the-art works, to 2.3x on average.
Friday June 24, 2016 9:00am - 9:25am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
There is a rising interest in accelerating stream processing through modern parallel hardware, yet it remains a challenge as how to exploit the available resources to achieve higher throughput without sacrificing latency due to the increased length of processing pipeline and communication path and the need for central coordination. To achieve these objectives, we introduce a novel top-down data flow model for stream join processing (arguably, one of the most resource-intensive operators in streamprocessing), called SplitJoin, that operates by splitting the join operation into independent storing and processing steps that gracefully scale with respect to the number of cores. Furthermore, SplitJoin eliminates the need for global coordination while preserving the order of input streams by re-thinking how streams are channeled into distributed join computation cores and maintaining the order of output streams by proposing a novel distributed punctuation technique. Throughout our experimental analysis, SplitJoin offered up to 60%improvement in throughputwhile reducing latency by up to 3.3X compared to state-of-the-art solutions.
Friday June 24, 2016 9:00am - 9:25am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
As more and more mobile applications need to run security critical codes (SCCs) for secure transactions and critical information handling, the demand for a Trusted Execution Environment (TEE) to ensure safe execution of SCCs is rapidly escalating. Although a number of studies have implemented TEEs using TrustZone or hypervisors and have evinced the effectiveness in terms of security, they face major challenges when considering deployment in mobile devices. TrustZone-based approaches bloat the TCB of the system as they must increase the code base size of the most privileged software. Hypervisor-based approaches incur performance overhead on mobile devices that are already suffering from resource restrictions.
To alleviate these problems, in this paper, we propose a hybrid approach that utilizes both TrustZone and a hypervisor. Our approach basically implements a TEE using a hypervisor, while mitigating performance overhead by activating the hypervisor only when the TEE is demanded by SCCs. This scheme, called on-demand hypervisor activation, has been efficiently and securely implemented by leveraging the memory protection capability of TrustZone. We have implemented and experimented our system with real world applications. The results show that our system can successfully protect SCCs without any noticeable delay (< 100 μs), while limiting the overhead increase due to our hypervisor during its hibernation near 0 %.
Friday June 24, 2016 9:25am - 9:50am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Single-PC, disk-based processing of big graphs has recently gained much popularity. At the core of an efficient disk-based system is a well-designed partition structure that can minimize random disk accesses. All existing systems use static partitions that are created before processing starts. These partitions have static layouts and are loaded entirely into memory in every single iteration even though much of the edge data is not changed across many iterations, causing these unchanged edges to have zero new impact on the computation of vertex values.
This work provides a general optimization that removes this I/O inefficiency by employing dynamic partitions whose layouts are dynamically adjustable. Our implementation of this optimization in GraphChi — a representative out-of-core vertex-centric graph system — yielded speedups of 1.5—2.8× on six large graphs. Our idea is generally applicable to other systems as well.
Friday June 24, 2016 9:25am - 9:50am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
With increasing GPU-intensive workloads deployed on cloud, the cloud service providers are seeking for practical and efficient GPU virtualization solutions. However, the cutting-edge GPU virtualization techniques such as gVirt still suffer from the restriction of scalability, which constrains the number of guest virtual GPU instances.
This paper introduces gScale, a scalable GPU virtualization solution. By taking advantage of the GPU programming model, gScale presents a dynamic sharing mechanism which combines partition and sharing together to break the hardware limitation of global graphics memory space. Particularly, we propose three approaches for gScale: (1) the private shadow graphics translation table, which enables global graphics memory space sharing among virtual GPU instances, (2) ladder mapping and fence memory space pool, which allows the CPU to access host physical memory space (serving the graphics memory) bypassing global graphics memory space, (3) slot sharing, which improves the performance of vGPU under a high density of instances.
The evaluation shows that gScale scales up to 15 guest virtual GPU instances in Linux or 12 guest virtual GPU instances in Windows, which is 5x and 4x scalability, respectively, compared to gVirt. At the same time, gScale incurs a slight runtime overhead on the performance of gVirt when hosting multiple virtual GPU instances.
Friday June 24, 2016 9:50am - 10:15am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Multi-version graph processing, where each version corresponds to a snapshot of an evolving graph, is a common scenario in large-scale graph processing. Straightforward application of existing graph processing systems often yields suboptimal performance due to high version-switching cost. We present Version Traveler (VT), a graph processing system featuring fast and memory- efficient version switching. VT achieves fast version switching by (i) representing differences among versions as deltas and (ii) constructing the next version by integrating the in-memory graph representation of the current version with the delta(s) relating the two versions. Furthermore, VT maintains high computation performance and memory compactness. Our evaluation using multi-version processing workloads with realistic datasets shows that VT outperforms PowerGraph— running 23x faster with a 15% memory overhead. VT is also superior to four multi-version processing systems, achieving up to 90% improvement when jointly considering processing time and resource consumption.
Friday June 24, 2016 9:50am - 10:15am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Dynamic binary translation (DBT) translates binary code from one instruction set architecture (ISA) to another (same or different) ISA at runtime, which makes it very useful in many applications such as system virtualization, whole program analysis, system debugging, and system security. Many techniques have been proposed to improve the efficiency of DBT systems for long-running and loop-intensive applications. However, for applications with short running time or long-running but with few hot code regions such as JavaScript and C# applications in web services, such techniques have difficulty in amortizing the overhead incurred during binary translation.
To reduce the translation overhead for such applications, this paper presents a general persistent code caching framework, which allows the reuse of translated binary code across different executions for the same or different applications. Compared to existing approaches, the proposed approach can seamlessly handle even dynamically generated code, which is very popular in script applications today. A prototype of the proposed framework has been implemented in an existing retargetable DBT system. Experimental results on a list of applications, including C/C++ and JavaScript, demonstrate that it can achieve 76.4% performance improvement on average compared to the original DBT system without helper threads for dynamic binary translation, and 9% performance improvement on average over the same DBT system with helper threads when code reuse is combined with help threads.
Friday June 24, 2016 10:15am - 10:40am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Given current technology trends towards fast storage devices and the need for increasing data processing density, it is important to examine key-value store designs that reduce CPU overhead. However, current key-value stores are still designed mostly for hard disk drives (HDDs) that exhibit a large difference between sequential and random access performance, and they incur high CPU overheads.
In this paper we present Tucana, a feature-rich keyvalue store that achieves low CPU overhead. Our design starts from a Be –tree approach to maintain asymptotic properties for inserts and uses three techniques to reduce overheads: copy-on-write, private allocation, and direct device management. In our design we favor choices that reduce overheads compared to sequential device accesses and large I/Os.
We evaluate our approach against RocksDB, a stateof- the-art key-value store, and show that our approach improves CPU efficiency by up to 9:2x and an average of 6x across all workloads we examine. In addition, Tucana improves throughput compared to RocksDB by up to 7x. Then, we use Tucana to replace the storage engine of HBase and compare it to native HBase and Cassandra two of the most popular NoSQL stores. Our results show that Tucana outperforms HBase by up to 8x in CPU efficiency and by up to 10x in throughput. Tucana’s improvements are even higher when compared to Cassandra.
Friday June 24, 2016 10:15am - 10:40am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
In recent years, operating systems have become increasingly complex and thus prone to security and performance issues. Accordingly, system updates to address these issues have become more frequently available and increasingly important. To complete such updates, users must reboot their systems, resulting in unavoidable downtime and further loss of the states of running applications.
We present , a practical OS update mechanism that uses a userspace checkpoint-and-restart mechanism, by using an optimized data structure for checkpointing and a memory persistence mechanism across update, combined with a fast in-place kernel switch. This allows for instant kernel updates spanning across major kernel versions without any kernel modifications.
Our evaluation shows that KUP can support any type of real kernel patches (e.g., security, minor or even major releases) with large-scale applications that include memcached, mysql, or in the middle of the Linux kernel compilation, unlike well-known dynamic hot-patching techniques (e.g., ksplice). Not only that, KUP can update a running Linux kernel in 3 seconds (overall downtime).
Friday June 24, 2016 11:00am - 11:25am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Wi-Fi has traditionally been considered a power-consuming communication system and has not been widely adopting in the sensor network and IoT space. We introduce Passive Wi-Fi that demonstrates for the first time that one can generate 802.11b transmissions using backscatter communication, while consuming 3–4 orders of magnitude lower power than existing Wi-Fi chipsets. Passive Wi-Fi transmissions can be decoded on any Wi-Fi device including routers, mobile phones and tablets. Building on this, we also present a network stack design that enables passive Wi-Fi transmitters to coexist with other devices in the ISM band, without incurring the power consumption of carrier sense and medium access control operations. We build prototype hardware and implement all four 802.11b bit rates on an FPGA platform. Our experimental evaluation shows that passive Wi-Fi transmissions can be decoded on off-the-shelf smartphones and Wi-Fi chipsets over distances of 30–100 feet in various line-of-sight and through-the-wall scenarios. Finally, we design a passive Wi-Fi IC that shows that 1 and 11 Mbps transmissions consume 14.5 and 59.2 µW respectively. This translates to 10000x lower power than existing Wi-Fi chipsets and 1000x lower power than Bluetooth LTE and ZigBee.
Friday June 24, 2016 11:00am - 11:25am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Software-Defined Internet Exchange Points (SDXes) promise to significantly increase the flexibility and function of interdomain traffic delivery on the Internet. Unfortunately, current SDX designs cannot yet achieve the scale required for large Internet exchange points (IXPs), which can host hundreds of participants exchanging traffic for hundreds of thousands of prefixes. Existing platforms are indeed too slow and inefficient to operate at this scale, typically requiring minutes to compile policies and millions of forwarding rules in the data plane.
We motivate, design, and implement iSDX, the first SDX architecture that can operate at the scale of the largest IXPs. We show that iSDX reduces both policy compilation time and forwarding table size by two orders of magnitude compared to current state-of-the-art SDX controllers. Our evaluation against a trace from one of the largest IXPs in the world found that iSDX can compile a realistic set of policies for 500 IXP participants in less than three seconds. Our public release of iSDX, complete with tutorials and documentation, is already spurring early adoption in operational networks.
Friday June 24, 2016 11:25am - 11:50am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
In this paper we present a novel system which incorporates programmable hardware (an FPGA) into a smartphone to enable a vision where apps can include both software and hardware components, or apps with hardware. We introduce a novel mechanism to enable sharing the FPGA in a practical manner by leveraging the unique deployment model of mobile applications - namely that deployment is via an app store, where we introduce a new cloud-based compilation. We present our prototype smart phone using the Zedboard, which pairs a Xilinx Zynq FPGA with an embedded Cortex A9, running an Android-based system which we extended to provide run-time system support for dynamically managing apps with hardware and providing a secure loading system. With this prototype, our evaluation demonstrates the performance gains for an AES encryption module (representing cryptography), a QAM modulation module (representing software-defined radio) of 3x to several orders of magnitude, with room for improvement and a hardware-based memory scanner (representing custom co-processors). We demonstrate the feasibility of our cloud-based compilation within the context of real app store statistics. Finally, we present a case study of a complete integration of hardware into an existing application (the Orbot Tor client).
Friday June 24, 2016 11:25am - 11:50am MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Recognition of human activities and gestures using preexisting WiFi signals has been shown to be feasible in recent studies. Given the pervasiveness of WiFi signals, this emerging sort of sensing poses a serious privacy threat. This paper is the first to counter the threat of unwanted or even malicious communication based sensing: it proposes a blackbox sensor obfuscation technique PhyCloak which distorts only the physical information in the communication signal that leaks privacy. The data in the communication signal is preserved and, in fact, the throughput of the link is increased with careful design. Moreover, the design allows coupling of the PhyCloak module with legitimate sensors, so that their sensing is preserved, while that of illegitimate sensors is obfuscated. The effectiveness of the design is validated via a prototype implementation on an SDR platform.
Friday June 24, 2016 11:50am - 12:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
Device drivers may encounter errors when communicating with OS kernel and hardware. However, error handling code often gets insufficient attention in driver development and testing, because these errors rarely occur in real execution. For this reason, many bugs are hidden in error handling code. Previous approaches for testing error handling code often neglect the characteristics of device drivers, which limits their efficiency and accuracy. In this paper, we first study the source code of Linux drivers to find useful characteristics of error handling code. Then we use these characteristics in fault injection testing, and propose a practical approach named EH-Test, which can automatically and efficiently test error handling code in drivers. To improve the representativeness of injected faults, we design a pattern-based extraction strategy to automatically and accurately extract target functions which can actually fail and trigger error handling code. During execution, we use a monitor to record runtime information and pair checkers to check resource usages. We have evaluated EH-Test on 15 real Linux device drivers and found 50 new bugs in Linux 3.17.2. The code coverage is also effectively increased. Comparison experiments to some previous approaches also show the availability of EH-Test.
Friday June 24, 2016 11:50am - 12:15pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
In a virtualized environment, it is not difficult to retrieve guest OS information from its hypervisor. However, it is very challenging to retrieve information in the reverse direction, i.e., retrieve the hypervisor information from within a guest OS, which remains an open problem and has not yet been comprehensively studied before. In this paper, we take the initiative and study this reverse information retrieval problem. In particular, we investigate how to determine the host OS kernel version from within a guest OS.We observe that modern commodity hypervisors introduce new features and bug fixes in almost every new release. Thus, by carefully analyzing the seven-year evolution of Linux KVM development (including 3485 patches), we can identify 19 features and 20 bugs in the hypervisor detectable from within a guest OS. Building on our detection of these features and bugs, we present a novel framework called Hyperprobe that for the first time enables users in a guest OS to automatically detect the underlying host OS kernel version in a few minutes. We implement a prototype of Hyperprobe and evaluate its effectiveness in five real world clouds, including Google Compute Engine (a.k.a. Google Cloud), HP Helion Public Cloud, ElasticHosts, Joyent Cloud, and CloudSigma, as well as in a controlled testbed environment, all yielding promising results.
Friday June 24, 2016 12:15pm - 12:40pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202
NUMA multicore machines are pervasive and many multithreaded applications are suffering from lock contention. To mitigate this issue, application and library developers can choose from the plethora of optimized mutex lock algorithms that have been designed over the past 25 years. Unfortunately, there is currently no broad study of the behavior of these optimized lock algorithms on realistic applications. In this paper, we fill this gap. We perform a performance study of 19 state-of-the-art mutex lock algorithms on 36 realistic applications. Our study shows that regarding locking on multicore machines, the case is not closed yet. Indeed, our conclusions include the following findings: (i) no single lock is the best for more than 50% of the studied workloads; (ii) every lock is harmful for several applications, even if the application parallelism is properly tuned; (iii) for several applications, the optimal lock changes when varying the number of applications or the workload. These findings call for further research on optimized lock algorithms and dynamic adaptation of contention management.
Friday June 24, 2016 12:15pm - 12:40pm MDT
Denver Marriott City Center1701 California Street, Denver, CO 80202