Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Arrakis: The Operating System is the Control Plane, Exams of Operating Systems

Recent device hardware trends enable a new approach to the design of network server operating systems. In a tra- ditional operating system, the kernel ...

Typology: Exams

2022/2023

Uploaded on 05/11/2023

sumaira
sumaira 🇺🇸

4.8

(57)

263 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
USENIX Association 11t h U SE NIX S ym pos iu m o n Op era ti ng Sy st em s D es ig n a nd I mpl em en ta tio n (O SD I ’ 14) 1
Arrakis: The Operating System is the Control Plane
Simon PeterJialin LiIrene ZhangDan R. K. PortsDoug Woos
Arvind KrishnamurthyThomas AndersonTimothy Roscoe
University of WashingtonETH Zurich
Abstract
Recent device hardware trends enable a new approach to
the design of network server operating systems. In a tra-
ditional operating system, the kernel mediates access to
device hardware by server applications, to enforce process
isolation as well as network and disk security. We have de-
signed and implemented a new operating system, Arrakis,
that splits the traditional role of the kernel in two. Applica-
tions have direct access to virtualized I/O devices, allowing
most I/O operations to skip the kernel entirely, while the
kernel is re-engineered to provide network and disk pro-
tection without kernel mediation of every operation. We
describe the hardware and software changes needed to
take advantage of this new abstraction, and we illustrate its
power by showing improvements of 2-5
×
in latency and
9
×
in throughput for a popular persistent NoSQL store
relative to a well-tuned Linux implementation.
1 Introduction
Reducing the overhead of the operating system process
abstraction has been a longstanding goal of systems design.
This issue has become particularly salient with modern
client-server computing. The combination of high speed
Ethernet and low latency persistent memories is consid-
erably raising the efficiency bar for I/O intensive software.
Many servers spend much of their time executingoperating
system code: delivering interrupts, demultiplexing and
copying network packets, and maintaining file system
meta-data. Server applications often perform very simple
functions, such as key-value table lookup and storage, yet
traverse the OS kernel multiple times per client request.
These trends have led to a long line of research aimed
at optimizing kernel code paths for various use cases:
eliminating redundant copies in the kernel [
45
], reducing
the overhead for large numbers of connections [
27
],
protocol specialization [
44
], resource containers [
8
,
39
],
direct transfers between disk and network buffers [
45
],
interrupt steering [
46
], system call batching [
49
], hardware
TCP acceleration, etc. Much of this has been adopted in
mainline commercial OSes, and yet it has been a losing
battle: we show that the Linux network and file system
stacks have latency and throughput many times worse than
that achieved by the raw hardware.
Twenty years ago, researchers proposed streamlining
packet handling for parallel computing over a network of
workstations by mapping the network hardware directly
into user space [
19
,
22
,
54
]. Although commercially
unsuccessful at the time, the virtualization market has now
led hardware vendors to revive the idea [
6
,
38
,
48
], and
also extend it to disks [52, 53].
This paper explores the OS implications of removing
the kernel from the data path for nearly all I/O operations.
We argue that doing this must provide applications with
the same security model as traditional designs; it is easy to
get good performance by extending the trusted computing
base to include application code, e.g., by allowing
applications unfiltered direct access to the network/disk.
We demonstrate that operating system protection is not
contradictory with high performance. For our prototype
implementation, a client request to the Redis persistent
NoSQL store has 2
×
better read latency,5
×
better write la-
tency, and 9×better write throughput compared to Linux.
We make three specific contributions:
Wegive an architecture for the division of labor between
the device hardware, kernel, and runtime for direct
network and disk I/O by unprivileged processes, and
we show how to efficiently emulate our model for I/O
devices that do not fully support virtualization (§3).
We implement a prototype of our model as a set of
modifications to the open source Barrelfish operating
system, running on commercially available multi-core
computers and I/O device hardware (§3.8).
We use our prototype to quantify the potential benefits
of user-level I/O for several widely used network
services, including a distributed object cache, Redis, an
IP-layer middlebox, and an HTTP load balancer (
§
4).
We show that significant gains are possible in terms of
both latency and scalability, relative to Linux, in many
cases without modifying the application programming
interface; additional gains are possible by changing the
POSIX API (§4.3).
2 Background
We first give a detailed breakdown of the OS and appli-
cation overheads in network and storage operations today,
followed by a discussion of current hardware technologies
that support user-level networking and I/O virtualization.
To analyze the sources of overhead, we record
timestamps at various stages of kernel and user-space pro-
cessing. Our experiments are conducted on a six machine
cluster consisting of 6-core Intel Xeon E5-2430 (Sandy
Bridge) systems at 2.2 GHz running Ubuntu Linux 13.04.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Arrakis: The Operating System is the Control Plane and more Exams Operating Systems in PDF only on Docsity!

USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 1

Arrakis: The Operating System is the Control Plane

Simon Peter∗^ Jialin Li∗^ Irene Zhang∗^ Dan R. K. Ports∗^ Doug Woos∗

Arvind Krishnamurthy∗^ Thomas Anderson∗^ Timothy Roscoe†

University of Washington∗^ ETH Zurich†

Abstract

Recent device hardware trends enable a new approach to the design of network server operating systems. In a tra- ditional operating system, the kernel mediates access to device hardware by server applications, to enforce process isolation as well as network and disk security. We have de- signed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applica- tions have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk pro- tection without kernel mediation of every operation. We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing improvements of 2-5 × in latency and 9 × in throughput for a popular persistent NoSQL store relative to a well-tuned Linux implementation.

1 Introduction

Reducing the overhead of the operating system process abstraction has been a longstanding goal of systems design. This issue has become particularly salient with modern client-server computing. The combination of high speed Ethernet and low latency persistent memories is consid- erably raising the efficiency bar for I/O intensive software. Many servers spend much of their time executing operating system code: delivering interrupts, demultiplexing and copying network packets, and maintaining file system meta-data. Server applications often perform very simple functions, such as key-value table lookup and storage, yet traverse the OS kernel multiple times per client request. These trends have led to a long line of research aimed at optimizing kernel code paths for various use cases: eliminating redundant copies in the kernel [ 45 ], reducing the overhead for large numbers of connections [ 27 ], protocol specialization [ 44 ], resource containers [ 8 , 39 ], direct transfers between disk and network buffers [ 45 ], interrupt steering [ 46 ], system call batching [ 49 ], hardware TCP acceleration, etc. Much of this has been adopted in mainline commercial OSes, and yet it has been a losing battle: we show that the Linux network and file system stacks have latency and throughput many times worse than that achieved by the raw hardware. Twenty years ago, researchers proposed streamlining packet handling for parallel computing over a network of workstations by mapping the network hardware directly

into user space [ 19 , 22 , 54 ]. Although commercially unsuccessful at the time, the virtualization market has now led hardware vendors to revive the idea [ 6 , 38 , 48 ], and also extend it to disks [52, 53]. This paper explores the OS implications of removing the kernel from the data path for nearly all I/O operations. We argue that doing this must provide applications with the same security model as traditional designs; it is easy to get good performance by extending the trusted computing base to include application code, e.g., by allowing applications unfiltered direct access to the network/disk. We demonstrate that operating system protection is not contradictory with high performance. For our prototype implementation, a client request to the Redis persistent NoSQL store has 2× better read latency, 5 × better write la- tency, and 9× better write throughput compared to Linux. We make three specific contributions:

  • We give an architecture for the division of labor between the device hardware, kernel, and runtime for direct network and disk I/O by unprivileged processes, and we show how to efficiently emulate our model for I/O devices that do not fully support virtualization (§3).
  • We implement a prototype of our model as a set of modifications to the open source Barrelfish operating system, running on commercially available multi-core computers and I/O device hardware (§3.8).
  • We use our prototype to quantify the potential benefits of user-level I/O for several widely used network services, including a distributed object cache, Redis, an IP-layer middlebox, and an HTTP load balancer (§4). We show that significant gains are possible in terms of both latency and scalability, relative to Linux, in many cases without modifying the application programming interface; additional gains are possible by changing the POSIX API (§4.3).

2 Background

We first give a detailed breakdown of the OS and appli- cation overheads in network and storage operations today, followed by a discussion of current hardware technologies that support user-level networking and I/O virtualization. To analyze the sources of overhead, we record timestamps at various stages of kernel and user-space pro- cessing. Our experiments are conducted on a six machine cluster consisting of 6-core Intel Xeon E5-2430 (Sandy Bridge) systems at 2.2 GHz running Ubuntu Linux 13.04.

2 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

Linux Arrakis Receiver running CPU idle Arrakis/P Arrakis/N Network stack in^ 1.26^ (37.6%)^ 1.24^ (20.0%)^ 0.32^ (22.3%)^ 0.21^ (55.3%) out 1.05 (31.3%) 1.42 (22.9%) 0.27 (18.7%) 0.17 (44.7%) Scheduler 0.17 (5.0%) 2.40 (38.8%) - -

Copy in^ 0.24^ (7.1%)^ 0.25^ (4.0%)^ 0.27^ (18.7%)^ - out 0.44 (13.2%) 0.55 (8.9%) 0.58 (40.3%) -

Kernel crossing return 0.10 (2.9%) 0.20 (3.3%) - - syscall 0.10 (2.9%) 0.13 (2.1%) - - Total 3.36 (σ = 0 .66) 6.19 (σ = 0 .82) 1.44 (σ < 0 .01) 0.38 (σ < 0 .01)

Table 1: Sources of packet processing overhead in Linux and Arrakis. All times are averages over 1,000 samples, given in μ s (and standard deviation for totals). Arrakis/P uses the POSIX interface, Arrakis/N uses the native Arrakis interface.

The systems have an Intel X520 (82599-based) 10Gb Ethernet adapter and an Intel MegaRAID RS3DC RAID controller with 1GB of flash-backed DRAM in front of a 100GB Intel DC S3700 SSD. All machines are connected to a 10Gb Dell PowerConnect 8024F Ethernet switch. One system (the server) executes the application under scrutiny, while the others act as clients.

2.1 Networking Stack Overheads

Consider a UDP echo server implemented as a Linux process. The server performs recvmsg and sendmsg calls in a loop, with no application-level processing, so it stresses packet processing in the OS. Figure 1 depicts the typical workflow for such an application. As Table 1 shows, operating system overhead for packet processing falls into four main categories.

  • Network stack costs: packet processing at the hardware, IP, and UDP layers.
  • Scheduler overhead: waking up a process (if neces- sary), selecting it to run, and context switching to it.
  • Kernel crossings: from kernel to user space and back.
  • Copying of packet data : from the kernel to a user buffer on receive, and back on send. Of the total 3.36 μ s (see Table 1) spent processing each packet in Linux, nearly 70% is spent in the network stack. This work is mostly software demultiplexing, security checks, and overhead due to indirection at various layers. The kernel must validate the header of incoming packets and perform security checks on arguments provided by the application when it sends a packet. The stack also performs checks at layer boundaries. Scheduler overhead depends significantly on whether the receiving process is currently running. If it is, only 5% of processing time is spent in the scheduler; if it is not, the time to context-switch to the server process from the idle process adds an extra 2.2 μ s and a further 0.6 μ s slowdown in other parts of the network stack.

Cache and lock contention issues on multicore systems add further overhead and are exacerbated by the fact that incoming messages can be delivered on different queues by the network card, causing them to be processed by dif- ferent CPU cores—which may not be the same as the cores on which the user-level process is scheduled, as depicted in Figure 1. Advanced hardware support such as accelerated receive flow steering [ 4 ] aims to mitigate this cost, but these solutions themselves impose non-trivial setup costs [46]. By leveraging hardware support to remove kernel mediation from the data plane, Arrakis can eliminate certain categories of overhead entirely, and minimize the effect of others. Table 1 also shows the corresponding overhead for two variants of Arrakis. Arrakis eliminates scheduling and kernel crossing overhead entirely, because packets are delivered directly to user space. Network stack processing is still required, of course, but it is greatly simplified: it is no longer necessary to demultiplex packets for different applications, and the user-level network stack need not validate parameters provided by the user as extensively as a kernel implementation must. Because each application has a separate network stack, and packets are delivered to cores where the application is running, lock contention and cache effects are reduced. In the Arrakis network stack, the time to copy packet data to and from user-provided buffers dominates the processing cost, a consequence of the mismatch between the POSIX interface (Arrakis/P) and NIC packet queues. Arriving data is first placed by the network hardware into a network buffer and then copied into the location specified by the POSIX read call. Data to be transmitted is moved into a buffer that can be placed in the network hardware queue; the POSIX write can then return, allowing the user memory to be reused before the data is sent. Although researchers have investigated ways to eliminate this copy from kernel network stacks [ 45 ], as Table 1 shows, most of the overhead for a kernel-resident network stack is elsewhere. Once the overhead of traversing the kernel is

4 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

Read hit Durable write Linux Arrakis/P Linux Arrakis/P epoll 2.42 (27.91%) 1.12 (27.52%) 2.64 (1.62%) 1.49 (4.73%) recv 0.98 (11.30%) 0.29 (7.13%) 1.55 (0.95%) 0.66 (2.09%) Parse input 0.85 (9.80%) 0.66 (16.22%) 2.34 (1.43%) 1.19 (3.78%) Lookup/set key 0.10 (1.15%) 0.10 (2.46%) 1.03 (0.63%) 0.43 (1.36%) Log marshaling - - 3.64 (2.23%) 2.43 (7.71%) write - - 6.33 (3.88%) 0.10 (0.32%) fsync - - 137.84 (84.49%) 24.26 (76.99%) Prepare response 0.60 (6.92%) 0.64 (15.72%) 0.59 (0.36%) 0.10 (0.32%) send 3.17 (36.56%) 0.71 (17.44%) 5.06 (3.10%) 0.33 (1.05%) Other 0.55 (6.34%) 0.46 (11.30%) 2.12 (1.30%) 0.52 (1.65%) Total 8.67 (σ = 2 .55) 4.07 (σ = 0 .44) 163.14 (σ = 13 .68) 31.51 (σ = 1 .91) 99th percentile 15.21 4.25 188.67 35.

Table 2: Overheads in the Redis NoSQL store for memory reads (hits) and durable writes (legend in Table 1).

In Arrakis, we use SR-IOV, the IOMMU, and supporting adapters to provide direct application-level access to I/O devices. This is a modern implementation of an idea which was implemented twenty years ago with U-Net [ 54 ], but generalized to flash storage and Ethernet network adapters. To make user-level I/O stacks tractable, we need a hardware-independent device model and API that captures the important features of SR-IOV adapters [ 31 , 40 , 41 , 51 ]; a hardware-specific device driver matches our API to the specifics of the particular device. We discuss this model in the next section, along with potential improvements to the existing hardware to better support user-level I/O. Remote Direct Memory Access (RDMA) is another popular model for user-level networking [ 48 ]. RDMA gives applications the ability to read from or write to a region of virtual memory on a remote machine directly from user-space, bypassing the operating system kernel on both sides. The intended use case is for a parallel program to be able to directly read and modify its data structures even when they are stored on remote machines. While RDMA provides the performance benefits of user-level networking to parallel applications, it is challenging to apply the model to a broader class of client- server applications [ 21 ]. Most importantly, RDMA is point-to-point. Each participant receives an authenticator providing it permission to remotely read/write a particular region of memory. Since clients in client-server computing are not mutually trusted, the hardware would need to keep a separate region of memory for each active connection. Therefore we do not consider RDMA operations here.

3 Design and Implementation

Arrakis has the following design goals:

  • Minimize kernel involvement for data-plane opera- tions: Arrakis is designed to limit or remove kernel me- diation for most I/O operations. I/O requests are routed

to and from the application’s address space without requiring kernel involvement and without sacrificing security and isolation properties.

  • Transparency to the application programmer: Ar- rakis is designed to significantly improve performance without requiring modifications to applications written to the POSIX API. Additional performance gains are possible if the developer can modify the application.
  • Appropriate OS/hardware abstractions: Arrakis’ ab- stractions should be sufficiently flexible to efficiently support a broad range of I/O patterns, scale well on mul- ticore systems, and support application requirements for locality and load balance. In this section, we show how we achieve these goals in Arrakis. We describe an ideal set of hardware facilities that should be present to take full advantage of this architecture, and we detail the design of the control plane and data plane interfaces that we provide to the application. Finally, we describe our implementation of Arrakis based on the Barrelfish operating system.

3.1 Architecture Overview Arrakis targets I/O hardware with support for virtualiza- tion, and Figure 3 shows the overall architecture. In this paper, we focus on hardware that can present multiple instances of itself to the operating system and the appli- cations running on the node. For each of these virtualized device instances, the underlying physical device provides unique memory mapped register files, descriptor queues, and interrupts, hence allowing the control plane to map each device instance to a separate protection domain. The device exports a management interface that is accessible from the control plane in order to create or destroy vir- tual device instances, associate individual instances with network flows or storage areas, and allocate shared re- sources to the different instances. Applications conduct I/O

USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 5

App Control Plane

Kernel

VNIC

Userspace

App

libos libos

VNIC

NIC^ Switch

VSA VSA Storage Controller

VSIC VSIC VSA

Figure 3: Arrakis architecture. The storage controller maps VSAs to physical storage.

through their protected virtual device instance without re- quiring kernel intervention. In order to perform these oper- ations, applications rely on a user-level I/O stack that is pro- vided as a library. The user-level I/O stack can be tailored to the application as it can assume exclusive access to a virtu- alized device instance, allowing us to remove any features not necessary for the application’s functionality. Finally, (de-)multiplexing operations and security checks are not needed in this dedicated environment and can be removed. The user naming and protection model is unchanged. A global naming system is provided by the control plane. This is especially important for sharing stored data. Applications implement their own storage, while the control plane manages naming and coarse-grain allocation, by associating each application with the directories and files it manages. Other applications can still read those files by indirecting through the kernel, which hands the directory or read request to the appropriate application.

3.2 Hardware Model

A key element of our work is to develop a hardware- independent layer for virtualized I/O—that is, a device model providing an “ideal” set of hardware features. This device model captures the functionality required to implement in hardware the data plane operations of a traditional kernel. Our model resembles what is already provided by some hardware I/O adapters; we hope it will provide guidance as to what is needed to support secure user-level networking and storage. In particular, we assume our network devices provide support for virtualization by presenting themselves as multiple virtual network interface cards (VNICs) and that they can also multiplex/demultiplex packets based on complex filter expressions, directly to queues that can be managed entirely in user space without the need for kernel intervention. Similarly, each storage controller exposes multiple virtual storage interface controllers (VSICs) in our model. Each VSIC provides independent storage command queues (e.g., of SCSI or ATA format) that are multiplexed by the hardware. Associated with each such virtual interface card (VIC) are queues and rate limiters.

VNICs also provide filters and VSICs provide virtual storage areas. We discuss these components below.

Queues: Each VIC contains multiple pairs of DMA queues for user-space send and receive. The exact form of these VIC queues could depend on the specifics of the I/O interface card. For example, it could support a scatter/gather interface to aggregate multiple physically- disjoint memory regions into a single data transfer. For NICs, it could also optionally support hardware checksum offload and TCP segmentation facilities. These features enable I/O to be handled more efficiently by performing additional work in hardware. In such cases, the Arrakis system offloads operations and further reduces overheads.

Transmit and receive filters: A transmit filter is a pred- icate on network packet header fields that the hardware will use to determine whether to send the packet or discard it (possibly signaling an error either to the application or the OS). The transmit filter prevents applications from spoofing information such as IP addresses and VLAN tags and thus eliminates kernel mediation to enforce these security checks. It can also be used to limit an application to communicate with only a pre-selected set of nodes. A receive filter is a similar predicate that determines which packets received from the network will be delivered to a VNIC and to a specific queue associated with the target VNIC. For example, a VNIC can be set up to receive all packets sent to a particular port, so both connection setup and data transfers can happen at user-level. Installation of transmit and receive filters are privileged operations performed via the kernel control plane.

Virtual storage areas: Storage controllers need to pro- vide an interface via their physical function to map virtual storage areas (VSAs) to extents of physical drives, and associate them with VSICs. A typical VSA will be large enough to allow the application to ignore the underlying multiplexing—e.g., multiple erasure blocks on flash, or cylinder groups on disk. An application can store multiple sub-directories and files in a single VSA, providing precise control over multi-object serialization constraints. A VSA is thus a persistent segment [ 13 ]. Applications reference blocks in the VSA using virtual offsets, converted by hardware into physical storage locations. A VSIC may have multiple VSAs, and each VSA may be mapped into multiple VSICs for interprocess sharing.

Bandwidth allocators: This includes support for re- source allocation mechanisms such as rate limiters and pacing/traffic shaping of I/O. Once a frame has been removed from a transmit rate-limited or paced queue, the next time another frame could be fetched from that queue is regulated by the rate limits and the inter-packet pacing controls associated with the queue. Installation of these controls are also privileged operations.

USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 7

3.4 Control Plane Interface

The interface between an application and the Arrakis control plane is used to request resources from the system and direct I/O flows to and from user programs. The key abstractions presented by this interface are VICs, doorbells, filters, VSAs, and rate specifiers. An application can create and delete VICs, and associate doorbells with particular events on particular VICs. A doorbell is an IPC end-point used to notify the application that an event (e.g. packet arrival or I/O completion) has occurred, and is discussed below. VICs are hardware resources and so Arrakis must allocate them among applications according to an OS policy. Currently this is done on a first-come-first-served basis, followed by spilling to software emulation (§3.3). Filters have a type (transmit or receive) and a predicate which corresponds to a convex sub-volume of the packet header space (for example, obtained with a set of mask-and-compare operations). Filters can be used to specify ranges of IP addresses and port numbers associated with valid packets transmitted/received at each VNIC. Filters are a better abstraction for our purposes than a conventional connection identifier (such as a TCP/IP 5-tuple), since they can encode a wider variety of communication patterns, as well as subsuming traditional port allocation and interface specification. For example, in the “map” phase of a MapReduce job we would like the application to send to, and receive from, an entire class of machines using the same communication end-point, but nevertheless isolate the data comprising the shuffle from other data. As a second example, web servers with a high rate of incoming TCP connections can run into scalability problems processing connection requests [ 46 ]. In Arrakis, a single filter can safely express both a listening socket and all subsequent connections to that socket, allowing server-side TCP connection establishment to avoid kernel mediation. Applications create a filter with a control plane oper- ation. In the common case, a simple higher-level wrapper suffices: filter = create_filter(flags, peerlist, servicelist). flags specifies the filter direction (transmit or receive) and whether the filter refers to the Ethernet, IP, TCP, or UDP header. peerlist is a list of accepted communication peers specified according to the filter type, and servicelist contains a list of accepted service addresses (e.g., port numbers) for the filter. Wildcards are permitted. The call to create_filter returns filter, a kernel- protected capability conferring authority to send or receive packets matching its predicate, and which can then be assigned to a specific queue on a VNIC. VSAs are acquired and assigned to VSICs in a similar fashion. Finally, a rate specifier can also be assigned to a queue, either to throttle incoming traffic (in the network receive case) or pace outgoing packets and I/O requests. Rate

specifiers and filters associated with a VIC queue can be updated dynamically, but all such updates require mediation from the Arrakis control plane. Our network filters are less expressive than OpenFlow matching tables, in that they do not support priority-based overlapping matches. This is a deliberate choice based on hardware capabilities: NICs today only support simple matching, and to support priorities in the API would lead to unpredictable consumption of hardware resources below the abstraction. Our philosophy is therefore to support expressing such policies only when the hardware can implement them efficiently.

3.5 File Name Lookup A design principle in Arrakis is to separate file naming from implementation. In a traditional system, the fully- qualified filename specifies the file system used to store the file and thus its metadata format. To work around this, many applications build their own metadata indirection inside the file abstraction [ 28 ]. Instead, Arrakis provides applications direct control over VSA storage allocation: an application is free to use its VSA to store metadata, directories, and file data. To allow other applications ac- cess to its data, an application can export file and directory names to the kernel virtual file system (VFS). To the rest of the VFS, an application-managed file or directory appears like a remote mount point—an indirection to a file system implemented elsewhere. Operations within the file or directory are handled locally, without kernel intervention. Other applications can gain access to these files in three ways. By default, the Arrakis application library managing the VSA exports a file server interface; other applications can use normal POSIX API calls via user-level RPC to the embedded library file server. This library can also run as a standalone process to provide access when the original application is not active. Just like a regular mounted file system, the library needs to implement only functionality required for file access on its VSA and may choose to skip any POSIX features that it does not directly support. Second, VSAs can be mapped into multiple processes. If an application, like a virus checker or backup system, has both permission to read the application’s metadata and the appropriate library support, it can directly access the file data in the VSA. In this case, access control is done for the entire VSA and not per file or directory. Finally, the user can direct the originating application to export its data into a standard format, such as a PDF file, stored as a normal file in the kernel-provided file system. The combination of VFS and library code implement POSIX semantics seamlessly. For example, if execute rights are revoked from a directory, the VFS prevents future traversal of that directory’s subtree, but existing RPC connections to parts of the subtree may remain intact until closed. This is akin to a POSIX process retaining a

8 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

subdirectory as the current working directory—relative traversals are still permitted.

3.6 Network Data Plane Interface

In Arrakis, applications send and receive network packets by directly communicating with hardware. The data plane interface is therefore implemented in an application library, allowing it to be co-designed with the application [ 43 ]. The Arrakis library provides two interfaces to applications. We describe the native Arrakis interface, which departs slightly from the POSIX standard to support true zero-copy I/O; Arrakis also provides a POSIX compatibility layer that supports unmodified applications. Applications send and receive packets on queues, which have previously been assigned filters as described above. While filters can include IP, TCP, and UDP field predicates, Arrakis does not require the hardware to perform protocol processing, only multiplexing. In our implementation, Ar- rakis provides a user-space network stack above the data plane interface. This stack is designed to maximize both latency and throughput. We maintain a clean separation be- tween three aspects of packet transmission and reception. Firstly, packets are transferred asynchronously between the network and main memory using conventional DMA techniques using rings of packet buffer descriptors. Secondly, the application transfers ownership of a trans- mit packet to the network hardware by enqueuing a chain of buffers onto the hardware descriptor rings, and acquires a received packet by the reverse process. This is performed by two VNIC driver functions. send_packet(queue, packet_array) sends a packet on a queue; the packet is specified by the scatter-gather array packet_array , and must conform to a filter already associated with the queue. receive_packet(queue) = packet receives a packet from a queue and returns a pointer to it. Both operations are asynchronous. packet_done(packet) returns ownership of a received packet to the VNIC. For optimal performance, the Arrakis stack would in- teract with the hardware queues not through these calls but directly via compiler-generated, optimized code tailored to the NIC descriptor format. However, the implementation we report on in this paper uses function calls to the driver. Thirdly, we handle asynchronous notification of events using doorbells associated with queues. Doorbells are delivered directly from hardware to user programs via hardware virtualized interrupts when applications are running and via the control plane to invoke the scheduler when applications are not running. In the latter case, higher latency is tolerable. Doorbells are exposed to Arrakis programs via regular event delivery mechanisms (e.g., a file descriptor event) and are fully integrated with existing I/O multiplexing interfaces (e.g., select ). They are useful both to notify an application of general availability of packets in receive queues, as well as a

lightweight notification mechanism for I/O completion and the reception of packets in high-priority queues. This design results in a protocol stack that decouples hardware from software as much as possible using the descriptor rings as a buffer, maximizing throughput and minimizing overhead under high packet rates, yielding low latency. On top of this native interface, Arrakis provides POSIX-compatible sockets. This compatibility layer allows Arrakis to support unmodified Linux applications. However, we show that performance gains can be achieved by using the asynchronous native interface.

3.7 Storage Data Plane Interface The low-level storage API provides a set of commands to asynchronously read, write, and flush hardware caches at any offset and of arbitrary size in a VSA via a command queue in the associated VSIC. To do so, the caller provides an array of virtual memory ranges (address and size) in RAM to be read/written, the VSA identifier, queue number, and matching array of ranges (offset and size) within the VSA. The implementation enqueues the corresponding commands to the VSIC, coalescing and reordering commands if this makes sense to the underlying media. I/O completion events are reported using doorbells. On top of this, a POSIX-compliant file system is provided. We have also designed a library of persistent data struc- tures, Caladan, to take advantage of low-latency storage devices. Persistent data structures can be more efficient than a simple read/write interface provided by file systems. Their drawback is a lack of backwards-compatibility to the POSIX API. Our design goals for persistent data structures are that (1) operations are immediately persistent, (2) the structure is robust versus crash failures, and (3) operations have minimal latency. We have designed persistent log and queue data structures according to these goals and modified a number of applications to use them (e.g., § 4.4). These data structures manage all metadata required for persistence, which allows tailoring of that data to reduce latency. For example, metadata can be allocated along with each data structure entry and persisted in a single hardware write operation. For the log and queue, the only metadata that needs to be kept is where they start and end. Pointers link entries to accommodate wrap-arounds and holes, optimizing for linear access and efficient prefetch of entries. By contrast, a filesystem typically has separate inodes to manage block allocation. The in-memory layout of Caladan structures is as stored, eliminating marshaling. The log API includes operations to open and close a log, create log entries (for metadata allocation), append them to the log (for persistence), iterate through the log (for read- ing), and trim the log. The queue API adds a pop operation to combine trimming and reading the queue. Persistence is asynchronous: an append operation returns immediately

10 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

  • What are the major contributors to performance overhead in Arrakis and how do they compare to those of Linux (presented in §2)?
  • Does Arrakis provide better latency and throughput for real-world cloud applications? How does the throughput scale with the number of CPU cores for these workloads?
  • Can Arrakis retain the benefits of user-level application execution and kernel enforcement, while providing high-performance packet-level network IO?
  • What additional performance gains are possible by departing from the POSIX interface? We compare the performance of the following OS configurations: Linux kernel version 3.8 (Ubuntu version 13.04), Arrakis using the POSIX interface (Arrakis/P), and Arrakis using its native interface (Arrakis/N). We tuned Linux network performance by installing the latest ixgbe device driver version 3.17.3 and disabling receive side scaling (RSS) when applications execute on only one processor. RSS spreads packets over several NIC receive queues, but incurs needless coherence overhead on a single core. The changes yield a throughput improvement of 10% over non-tuned Linux. We use the kernel-shipped MegaRAID driver version 6.600.18.00-rc1. Linux uses a number of performance-enhancing features of the network hardware, which Arrakis does not currently support. Among these features is the use of direct processor cache access by the NIC, TCP and UDP segmentation offload, large receive offload, and network packet header splitting. All of these features can be implemented in Arrakis; thus, our performance comparison is weighted in favor of Linux. 4.1 Server-side Packet Processing Performance

We load the UDP echo benchmark from §2 on the server and use all other machines in the cluster as load generators. These generate 1 KB UDP packets at a fixed rate and record the rate at which their echoes arrive. Each experiment exposes the server to maximum load for 20 seconds. Shown in Table 1, compared to Linux, Arrakis elimi- nates two system calls, software demultiplexing overhead, socket buffer locks, and security checks. In Arrakis/N, we additionally eliminate two socket buffer copies. Arrakis/P incurs a total server-side overhead of 1.44 μ s, 57% less than Linux. Arrakis/N reduces this overhead to 0.38 μs. The echo server is able to add a configurable delay before sending back each packet. We use this delay to simulate additional application-level processing time at the server. Figure 4 shows the average throughput attained by each system over various such delays; the theoretical line rate is 1.26M pps with zero processing. In the best case (no additional processing time), Arrakis/P achieves 2.3× the throughput of Linux. By

0

200

400

600

800

1000

1200

Throughput [k packets / s] 0 1 2 4 8 16 32 64 Processing time [us]

Linux Arrakis/P Arrakis/N Driver

Figure 4: Average UDP echo throughput for packets with 1024 byte payload over various processing times. The top y-axis value shows theoretical maximum throughput on the 10G network. Error bars in this and following figures show min/max measured over 5 repeats of the experiment. departing from POSIX, Arrakis/N achieves 3.9× the throughput of Linux. The relative benefit of Arrakis disappears at 64 μ s. To gauge how close Arrakis comes to the maximum possible throughput, we embedded a minimal echo server directly into the NIC device driver, eliminating any remaining API overhead. Arrakis/N achieves 94% of the driver limit. 4.2 Memcached Key-Value Store Memcached is an in-memory key-value store used by many cloud applications. It incurs a processing overhead of 2–3 μ s for an average object fetch request, comparable to the overhead of OS kernel network processing. We benchmark memcached 1.4.15 by sending it requests at a constant rate via its binary UDP protocol, using a tool similar to the popular memslap benchmark [ 2 ]. We configure a workload pattern of 90% fetch and 10% store requests on a pre-generated range of 128 different keys of a fixed size of 64 bytes and a value size of 1 KB, in line with real cloud deployments [7]. To measure network stack scalability for multiple cores, we vary the number of memcached server processes. Each server process executes independently on its own port number, such that measurements are not impacted by scal- ability bottlenecks in memcached itself, and we distribute load equally among the available memcached instances. On Linux, memcached processes share the kernel-level network stack. On Arrakis, each process obtains its own VNIC with an independent set of packet queues, each controlled by an independent instance of Extaris. Figure 5 shows that memcached on Arrakis/P achieves 1.7× the throughput of Linux on one core, and attains near line-rate at 4 CPU cores. The slightly lower throughput on all 6 cores is due to contention with Barrelfish system management processes [ 10 ]. By contrast, Linux throughput nearly plateaus beyond two cores. A single, multi-threaded memcached instance shows no noticeable throughput difference to the multi-process scenario. This is not surprising as memcached is optimized to scale well.

USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 11

0

200

400

600

800

1000

1200

Throughput [k transactions / s]^1 2 4 Number of CPU cores

Linux threads Linux procs Arrakis/P

Figure 5: Average memcached transaction throughput and scalability. Top y-axis value = 10Gb/s.

To conclude, the separation of network stack and appli- cation in Linux provides only limited information about the application’s packet processing and poses difficulty as- signing threads to the right CPU core. The resulting cache misses and socket lock contention are responsible for much of the Linux overhead. In Arrakis, the application is in con- trol of the whole packet processing flow: assignment of packets to packet queues, packet queues to cores, and fi- nally the scheduling of its own threads on these cores. The network stack thus does not need to acquire any locks, and packet data is always available in the right processor cache. Memcached is also an excellent example of the com- munication endpoint abstraction: we can create hardware filters to allow packet reception and transmission only between the memcached server and a designated list of client machines that are part of the cloud application. In the Linux case, we have to filter connections in the application.

4.3 Arrakis Native Interface Case Study

As a case study, we modified memcached to make use of Arrakis/N. In total, 74 lines of code were changed, with 11 pertaining to the receive side, and 63 to the send side. On the receive side, the changes involve eliminating memcached’s receive buffer and working directly with pointers to packet buffers provided by Extaris, as well as returning completed buffers to Extaris. The changes increase average throughput by 9% over Arrakis/P. On the send side, changes include allocating a number of send buffers to allow buffering of responses until fully sent by the NIC, which now must be done within memcached itself. They also involve the addition of reference counts to hash table entries and send buffers to determine when it is safe to reuse buffers and hash table entries that might otherwise still be processed by the NIC. We gain an additional 10% average throughput when using the send side API in addition to the receive side API.

4.4 Redis NoSQL Store

Redis [ 18 ] extends the memcached model from a cache to a persistent NoSQL object store. Our results in Table 2 show that Redis operations—while more laborious than Memcached—are still dominated by I/O stack overheads.

0

50

100

150

200

250

300

Throughput [k transactions / s] GET SET

Linux Arrakis/P Arrakis/P [15us] Linux/Caladan

Figure 6: Average Redis transaction throughput for GET and SET operations. The Arrakis/P [15us] and Linux/Caladan configurations apply only to SET operations. Redis can be used in the same scenario as Memcached and we follow an identical experiment setup, using Redis version 2.8.5. We use the benchmarking tool distributed with Redis and configure it to execute GET and SET requests in two separate benchmarks to a range of 65, random keys with a value size of 1,024 bytes, persisting each SET operation individually, with a total concurrency of 1,600 connections from 16 benchmark clients executing on the client machines. Redis is single-threaded, so we investigate only single-core performance. The Arrakis version of Redis uses Caladan. We changed 109 lines in the application to manage and exchange records with the Caladan log instead of a file. We did not eliminate Redis’ marshaling overhead (cf. Table 2). If we did, we would save another 2.43 μ s of write latency. Due to the fast I/O stacks, Redis’ read performance mirrors that of Memcached and write latency improves by 63%, while write throughput improves vastly, by 9×. To investigate what would happen if we had access to state-of-the-art storage hardware, we simulate (via a write-delaying RAM disk) a storage backend with 15 μ s write latency, such as the ioDrive2 [ 24 ]. Write throughput improves by another 1.6×, nearing Linux read throughput. Both network and disk virtualization is needed for good Redis performance. We tested this by porting Caladan to run on Linux, with the unmodified Linux network stack. This improved write throughput by only 5× compared to Linux, compared to 9× on Arrakis. Together, the combination of data-plane network and storage stacks can yield large benefits in latency and throughput for both read and write-heavy workloads. The tight integration of storage and data structure in Caladan allows for a number of latency-saving techniques that eliminate marshaling overhead, book-keeping of journals for file system metadata, and can offset storage allocation overhead. These benefits will increase further with upcoming hardware improvements.

4.5 HTTP Load Balancer To aid scalability of web services, HTTP load balancers are often deployed to distribute client load over a number

USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 13

0

200

400

600

800

1000

1200

1400

Throughput [k transactions / s]Arrakis/P Linux Arrakis/P Linux No limit 100Mbit/s limit

Figure 9: Memcached transaction throughput over 5 instances (colors), with and without rate limiting.

and its performance is 2.6 × that of Linux. We also see an interesting effect: the Linux implementation does not scale at all in this configuration. The reason for this are the raw IP sockets, which carry no connection information. Without an indication of which connections to steer to which sockets, each middlebox instance has to look at each incoming packet to determine whether it should handle it. This added overhead outweighs any performance gained via parallelism. In Arrakis, we can configure the hardware filters to steer packets based on packet header information and thus scale until we quickly hit the NIC throughput limit at two cores. We conclude that Arrakis allows us to retain the safety, abstraction, and management benefits of software develop- ment at user-level, while vastly improving the performance of low level packet operations. Filters provide a versatile interface to steer packet workloads based on arbitrary information stored in packet headers to effectively leverage multi-core parallelism, regardless of protocol specifics.

4.7 Performance Isolation

We show that QoS limits can be enforced in Arrakis, by simulating a simple multi-tenant scenario with 5 memcached instances pinned to distinct cores, to minimize processor crosstalk. One tenant has an SLA that allows it to send up to 100Mb/s. The other tenants are not limited. We use rate specifiers in Arrakis to set the transmit rate limit of the VNIC of the limited process. On Linux, we use queuing disciplines [ 29 ] (specifically, HTB [ 20 ]) to rate limit the source port of the equivalent process. We repeat the experiment from §4.2, plotting the throughput achieved by each memcached instance, shown in Figure 9. The bottom-most process (barely visible) is rate-limited to 100Mb/s in the experiment shown on the right hand side of the figure. All runs remained within the error bars shown in Figure 5. When rate-limiting, a bit of the total throughput is lost for both OSes because clients keep sending packets at the same high rate. These consume network bandwidth, even when later dropped due to the rate limit. We conclude that it is possible to provide the same kind of QoS enforcement—in this case, rate limiting—in Ar-

rakis, as in Linux. Thus, we are able to retain the protection and policing benefits of user-level application execution, while providing improved network performance.

5 Discussion

In this section, we discuss how we can extend the Arrakis model to apply to virtualized guest environments, as well as to interprocessor interrupts. 5.1 Arrakis as Virtualized Guest Arrakis’ model can be extended to virtualized envi- ronments. Making Arrakis a host in this environment is straight-forward—this is what the technology was originally designed for. The best way to support Arrakis as a guest is by moving the control plane into the virtual ma- chine monitor (VMM). Arrakis guest applications can then allocate virtual interface cards directly from the VMM. A simple way of accomplishing this is by pre-allocating a number of virtual interface cards in the VMM to the guest and let applications pick only from this pre-allocated set, without requiring a special interface to the VMM. The hardware limits apply to a virtualized environment in the same way as they do in the regular Arrakis environment. We believe the current limits on virtual adapters (typically 64) to be balanced with the number of available processing resources.

5.2 Virtualized Interprocessor Interrupts To date, most parallel applications are designed assuming that shared-memory is (relatively) efficient, while interprocessor signaling is (relatively) inefficient. A cache miss to data written by another core is handled in hardware, while alerting a thread on another processor requires kernel mediation on both the sending and receiving side. The kernel is involved even when signaling an event between two threads running inside the same application. With kernel bypass, a remote cache miss and a remote event delivery are similar in cost at a physical level. Modern hardware already provides the operating system the ability to control how device interrupts are routed. To safely deliver an interrupt within an application, without kernel mediation, requires that the hardware add access control. With this, the kernel could configure the interrupt routing hardware to permit signaling among cores running the same application, trapping to the kernel only when signaling between different applications.

6 Related Work

SPIN [ 14 ] and Exokernel [ 25 ] reduced shared kernel components to allow each application to have customized operating system management. Nemesis [ 15 ] reduces shared components to provide more performance isolation for multimedia applications. All three mediated I/O in the kernel. Relative to these systems, Arrakis shows that

14 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

application customization is consistent with very high performance. Following U-Net, a sequence of hardware standards such as VIA [ 19 ] and Infiniband [ 30 ] addressed the challenge of minimizing, or eliminating entirely, operating system involvement in sending and receiving network packets in the common case. To a large extent, these systems have focused on the needs of parallel applications for high throughout, low overhead communication. Arrakis supports a more general networking model including client-server and peer-to-peer communication. Our work was inspired in part by previous work on Dune [ 11 ], which used nested paging to provide support for user-level control over virtual memory, and Exitless IPIs [ 26 ], which presented a technique to demultiplex hardware interrupts between virtual machines without mediation from the virtual machine monitor. Netmap [ 49 ] implements high throughput network I/O by doing DMAs directly from user space. Sends and receives still require system calls, as the OS needs to do per- mission checks on every operation. Throughput is achieved at the expense of latency, by batching reads and writes. Similarly, IX [ 12 ] implements a custom, per-application network stack in a protected domain accessed with batched system calls. Arrakis eliminates the need for batching by handling operations at user level in the common case. Concurrently with our work, mTCP uses Intel’s DPDK interface to implement a scalable user-level TCP [ 36 ]; mTCP focuses on scalable network stack design, while our focus is on the operating system API for general client- server applications. We expect the performance of Extaris and mTCP to be similar. OpenOnload [ 50 ] is a hybrid user- and kernel-level network stack. It is completely binary- compatible with existing Linux applications; to support this, it has to keep a significant amount of socket state in the kernel and supports only a traditional socket API. Arrakis, in contrast, allows applications to access the network hardware directly and does not impose API constraints. Recent work has focused on reducing the overheads imposed by traditional file systems and block device drivers, given the availability of low latency persistent memory. DFS [ 37 ] and PMFS [ 23 ] are file systems designed for these devices. DFS relies on the flash storage layer for functionality traditionally implemented in the OS, such as block allocation. PMFS exploits the byte-addressability of persistent memory, avoiding the block layer. Both DFS and PMFS are implemented as kernel-level file systems, exposing POSIX interfaces. They focus on optimizing file system and device driver design for specific technologies, while Arrakis investigates how to allow applications fast, customized device access. Moneta-D [ 16 ] is a hardware and software platform for fast, user-level I/O to solid-state devices. The hardware and operating system cooperate to track permissions on hard-

ware extents, while a user-space driver communicates with the device through a virtual interface. Applications interact with the system through a traditional file system. Moneta- D is optimized for large files, since each open operation requires communication with the OS to check permissions; Arrakis does not have this issue, since applications have complete control over their VSAs. Aerie [ 53 ] proposes an architecture in which multiple processes communicate with a trusted user-space file system service for file metadata and lock operations, while directly accessing the hardware for reads and data-only writes. Arrakis provides more flexibility than Aerie, since storage solutions can be integrated tightly with applications rather than provided in a shared service, allowing for the development of higher-level abstractions, such as persistent data structures.

7 Conclusion

In this paper, we described and evaluated Arrakis, a new operating system designed to remove the kernel from the I/O data path without compromising process isolation. Unlike a traditional operating system, which mediates all I/O operations to enforce process isolation and resource limits, Arrakis uses device hardware to deliver I/O directly to a customized user-level library. The Arrakis kernel operates in the control plane, configuring the hardware to limit application misbehavior. To demonstrate the practicality of our approach, we have implemented Arrakis on commercially available network and storage hardware and used it to benchmark several typ- ical server workloads. We are able to show that protection and high performance are not contradictory: end-to-end client read and write latency to the Redis persistent NoSQL store is 2–5× faster and write throughput 9× higher on Arrakis than on a well-tuned Linux implementation.

Acknowledgments

This work was supported by NetApp, Google, and the National Science Foundation. We would like to thank the anonymous reviewers and our shepherd, Emmett Witchel, for their comments and feedback. We also thank Oleg Godunok for implementing the IOMMU driver, Antoine Kaufmann for implementing MSI-X support, and Taesoo Kim for implementing interrupt support into Extaris.

References

[1] http://www.barrelfish.org/.

[2] http://www.libmemcached.org/.

[3] http://haproxy.1wt.eu.

[4] Scaling in the Linux networking stack. https:// www.kernel.org/doc/Documentation/ networking/scaling.txt.

16 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association

[33] Intel Corporation. Intel RAID Controllers RS3DC080 and RS3DC040, Aug 2013. Product Brief. http://www.intel. com/content/dam/www/public/us/ en/documents/product-briefs/ raid-controller-rs3dc-brief.pdf.

[34] Intel Corporation. Intel virtualization technology for directed I/O architecture specification. Technical Re- port Order Number: D51397-006, Intel Corporation, Sep 2013.

[35] Intel Corporation. NVM Express, re- vision 1.1a edition, Sep 2013. http: //www.nvmexpress.org/wp-content/ uploads/NVM-Express-1_1a.pdf.

[36] E. Jeong, S. Woo, M. Jamshed, H. J. S. Ihm, D. Han, and K. Park. mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems. In NSDI, 2014.

[37] W. K. Josephson, L. A. Bongo, K. Li, and D. Flynn. DFS: A file system for virtualized flash storage. Trans. Storage, 6(3):14:1–14:25, Sep 2010.

[38] P. Kutch. PCI-SIG SR-IOV primer: An introduction to SR-IOV technology. Intel application note, 321211–002, Jan 2011.

[39] I. M. Leslie, D. McAuley, R. Black, T. Roscoe, P. Barham, D. Evers, R. Fairbairns, and E. Hyden. The design and implementation of an operating sys- tem to support distributed multimedia applications. IEEE J.Sel. A. Commun., 14(7):1280–1297, Sep

[40] LSI Corporation. LSISAS2308 PCI Ex- press to 8-Port 6Gb/s SAS/SATA Con- troller, Feb 2010. Product Brief. http: //www.lsi.com/downloads/Public/ SAS%20ICs/LSI_PB_SAS2308.pdf.

[41] LSI Corporation. LSISAS3008 PCI Ex- press to 8-Port 12Gb/s SAS/SATA Con- troller, Feb 2014. Product Brief. http: //www.lsi.com/downloads/Public/ SAS%20ICs/LSI_PB_SAS3008.pdf.

[42] lwIP. http://savannah.nongnu.org/ projects/lwip/.

[43] I. Marinos, R. N. M. Watson, and M. Handley. Network stack specialization for performance. In SIGCOMM, 2014.

[44] D. Mosberger and L. L. Peterson. Making paths ex- plicit in the Scout operating system. In OSDI, 1996.

[45] V. S. Pai, P. Druschel, and W. Zwanepoel. IO-Lite: A unified I/O buffering and caching system. In OSDI,

[46] A. Pesterev, J. Strauss, N. Zeldovich, and R. T. Morris. Improving network connection locality on multicore systems. In EuroSys, 2012.

[47] S. Radhakrishnan, Y. Geng, V. Jeyakumar, A. Kab- bani, G. Porter, and A. Vahdat. SENIC: Scalable NIC for end-host rate limiting. In NSDI, 2014.

[48] RDMA Consortium. Architectural speci- fications for RDMA over TCP/IP. http: //www.rdmaconsortium.org/.

[49] L. Rizzo. Netmap: A novel framework for fast packet I/O. In USENIX ATC, 2012.

[50] SolarFlare Communications, Inc. OpenOnload. http://www.openonload.org/.

[51] Solarflare Communications, Inc. Solarflare SFN5122F Dual-Port 10GbE Enterprise Server Adapter, 2010.

[52] A. Trivedi, P. Stuedi, B. Metzler, R. Pletka, B. G. Fitch, and T. R. Gross. Unified high-performance I/O: One stack to rule them all. In HotOS, 2013.

[53] H. Volos, S. Nalli, S. Panneerselvam, V. Varadarajan, P. Saxena, and M. M. Swift. Aerie: Flexible file-system interfaces to storage-class memory. In EuroSys, 2014.

[54] T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A user-level network interface for parallel and distributed computing. In SOSP, 1995.