Mercury: Enabling Remote Procedure Call for High-Performance Computing

Abstract

Abstract—Remote Procedure Call (RPC) is a technique that has been largely used by distributed services. This technique, now more and more used in the context of High-Performance Computing (HPC), allows the execution of routines to be delegated to remote nodes, which can be set aside and dedicated to specific tasks. However, existing RPC frameworks assume a sockets based network interface (usually on top of TCP/IP) which is not appropriate for HPC systems, as this API does not typically map well to the native network transport used on those systems, resulting in lower network performance. In addition, existing RPC frameworks often do not support handling large data arguments, such as those found in read or write calls.

We present in this paper an asynchronous RPC interface specifically designed for use in HPC systems that allows asynchronous transfer of parameters and execution requests and direct support of large data arguments. The interface is generic to allow any function call to be shipped. Additionally, the network implementation is abstracted, allowing easy porting to future systems and efficient use of existing native transport mechanisms.

Remote Procedure Call (RPC) is a widely used technique for distributed services. This technique, now increasingly used in the context of high-performance computing (HPC), allows the execution of routines to be delegated to remote nodes, which can be set aside and dedicated to specific tasks. However, existing RPC frameworks employ a socket-based network interface (usually on top of TCP/IP), which is not suitable for HPC systems because this API usually does not map well to the native network transport used on these systems , resulting in poor network performance. Furthermore, existing RPC frameworks often do not support handling large data parameters, such as those found in read or write calls. In this paper we propose an asynchronous RPC interface, specifically designed for use in HPC systems, allowing asynchronous transfer of parameters and execution requests and directly supporting large data parameters. The interface is generic, allowing any function call to be passed. Furthermore, the network implementation is abstracted, allowing easy porting to future systems and efficient use of existing native transport mechanisms.

RPC (Remote Procedure Call, Remote Procedure Call) is a protocol and programming model for remote communication in distributed systems. It allows one computing node (usually a client) to call a function or method on another computing node (usually a server) over the network, just like a local call. RPC hides the details of network communication, making the remote call process more transparent and simple for developers.
In RPC, the communication between the client and the server is usually implemented through a network transport protocol such as TCP or UDP. The whole process usually includes the following steps:
1. Define the interface: First, the interface for communication between the client and the server needs to be defined. This can be achieved by using a specific IDL (Interface Definition Language), such as Protocol, Buffers, Thrift or SOAP. An interface definition describes the available methods, parameters, and return type.
2. Generate code: According to the interface definition, the client and server code can be generated by the IDL compiler. These generated codes include client and server stubs (Stub) and stubs (Skeleton), which are used to handle network communication and serialize/deserialize data.
3. Remote call: The client application can trigger a remote call by calling the local client stub. The stub is responsible for encapsulating the call request into a network message and sending it to the server. After the server receives the request, it uses the stub to parse the request and execute the corresponding method. The execution result will be encapsulated as a response message and sent back to the client.
4. Data transmission and serialization: In RPC, function parameters and return values ​​need to be serialized and deserialized between the client and server. This is to convert the data into the format required for network transmission, so that it can be transferred between different computing nodes.
5. Error handling and exceptions: RPC frameworks usually provide error handling mechanisms and exception propagation. When an error or exception occurs in a remote call, error information can be passed back to the client for proper handling.
Common RPC frameworks include gRPC, Apache Thrift, Apache, Avro, XML-RPC, etc. These frameworks provide various functions, such as cross-language support, load balancing, service discovery and authentication, etc., making remote calls more convenient and reliable.
RPC is widely used in various scenarios in distributed systems, including microservice architecture, distributed computing, and remote data access, etc., in order to achieve communication and collaboration between different nodes.

I. INTRODUCTION

When working in an heterogeneous environment, it is often very useful for an engineer or a scientist to be able to distribute the various steps of an application workflow; particularly so in high-performance computing where it is common to see systems or nodes embedding different types of resources and libraries, which can be dedicated to specific tasks such as computation, storage or analysis and visualization. Remote procedure call (RPC) [1] is a technique that follows a client/server model and allows local calls to be transparently executed onto remote resources. It consists of serializing the local function parameters into a memory buffer and sending that buffer to a remote target which in turn deserializes the parameters and executes the corresponding function call. Libraries implementing this technique can be found in various domains such as web services with Google Protocol Buffers [2] or Facebook Thrift [3], or in domains such as grid computing with GridRPC [4]. RPC can also be realized using a more object oriented approach with frameworks such as CORBA [5] or Java RMI [6] where abstract objects and methods can be distributed across a range of nodes or machines.

​When working in a heterogeneous environment, it is often useful for an engineer or scientist to be able to distribute the individual steps of an application's workflow; especially in high performance computing, it is common to see systems or nodes embedded with different types of resources and libraries, these Resources and libraries can be dedicated to specific tasks, such as compute, storage, or analysis and visualization. Remote Procedure Call (RPC) [1] is a technology that follows the client/server model and allows transparent execution of local calls to remote resources. It consists of serializing local function parameters into a memory buffer and sending that buffer to the remote target, which in turn deserializes the parameters and executes the corresponding function call. Libraries implementing this technique can be found in fields such as web services using Google Protocol Buffers [2] or Facebook Thrift [3], or grid computing using GridRPC [4]. RPC can also be implemented using more object-oriented methods and frameworks, such as CORBA [5] or Java RMI [6], where abstract objects and methods can be distributed over a series of nodes or machines.

Heterogeneous environment: A heterogeneous environment refers to a computing system composed of different types of computing devices and processors. In a heterogeneous environment, different devices may have different architectures, processing capabilities, and characteristics. Common heterogeneous environments include:
1. Multi-core CPUs and accelerators: In this environment, the computing system consists of multiple CPU cores and accelerators (such as GPUs, FPGAs, or ASICs). CPU cores are typically used for general-purpose computing tasks, while accelerators are used for highly parallel specific computing tasks, such as graphics rendering, deep learning, or cryptography operations.
2. Distributed computing cluster: In this environment, the computing system consists of multiple computers, each of which may have different processor types and architectures. Computers in a cluster can be connected through a network and work together to complete large-scale computing tasks.
3. Cloud computing environment: Cloud computing provides network-based computing resources, which can be virtualized virtual machines or containers. In a cloud environment, users can choose different types of virtual machine instances, and each instance may have different processing capabilities and characteristics. This enables users to choose the most suitable computing resources according to their needs and budget.
In a heterogeneous environment, in order to fully utilize the advantages of various computing devices, it is necessary to develop corresponding programming models and tools. For example, use parallel programming frameworks such as CUDA, OpenCL, or MPI to exploit the parallel processing capabilities of accelerators, or use task schedulers and distributed system management tools to manage distributed computing clusters.
The challenge of a heterogeneous environment is to efficiently utilize the computing power of different devices and manage data transfer and synchronization issues. However, high-performance and efficient computing can be achieved through reasonable task allocation and scheduling.

However, using these standard and generic RPC frameworks on an HPC system presents two main limitations: the inability to take advantage of the native transport mechanism to transfer data efficiently, as these frameworks are mainly designed on top of TCP/IP protocols; and the inability to transfer very large amounts of data, as the limit imposed by the RPC interface is generally of the order of the megabyte. In addition, even if no limit is enforced, transferring large amounts of data through the RPC library is usually discouraged, mostly due to overhead caused by serialization and encoding, causing the data to be copied many times before reaching the remote node.

However, there are two major limitations to using these standards and common RPC frameworks on HPC systems: 1. the inability to efficiently transfer data using native transport mechanisms, since these frameworks are primarily designed on top of the TCP/IP protocol; 2. and Very large amounts of data cannot be transferred because the limit imposed by the RPC interface is usually on the order of megabytes (MB). Also, transferring large amounts of data over RPC libraries is generally discouraged, even without enforcing limits, mainly due to the overhead of serialization and encoding, causing the data to be copied multiple times before reaching the remote node.

The paper is organized as follows: we first discuss related work in section II, then in section III we discuss the network abstraction layer on top of which the interface is built, as well as the architecture defined to transfer small and large data efficiently. Section IV outlines the API and shows its advantages to enable the use of pipelining techniques. We then describe the development of network transport plugins for our interface as well as performance evaluation results. Section V presents conclusions and future work directions.

This paper is organized as follows: We first discuss related work in Section II, and then discuss in Section III the network abstraction layer that builds the interface, and the architecture defined for efficient transfer of small and large data. Section IV gives an overview of the API and shows the advantages of its support for the use of pipelining. We then describe the development of the network transport plugin for our interface and the performance evaluation results. Section V presents conclusions and directions for future work.

II. RELATED WORK

The Network File System (NFS) [7] is a very good example of the use of RPC with large data transfers and therefore very close to the use of RPC on an HPC system. It makes use of XDR [8] to serialize arbitrary data structures and create a system-independent description, the resulting stream of bytes is then sent to a remote resource, which can deserialize and get the data back from it. It can also make use of separate transport mechanisms (on recent versions of NFS) to transfer data over RDMA protocols, in which case the data is processed outside of the XDR stream. The interface that we present in this paper follows similar principles but in addition handles bulk data directly. It also does not limit to the use of XDR for data encoding, which can be a performance hit, especially when sender and receiver share a common system architecture. By providing a network abstraction layer, the RPC interface that we define gives the ability to the user to send small data and large data efficiently, using either small messages or remote memory access (RMA) types of transfer that fully support onesided semantics present on recent HPC systems. Furthermore, all the interface presented is non-blocking and therefore allows an asynchronous mode of operation, preventing the caller to wait for an operation to execute before another one can be issued.

Network File System (NFS) [7] is a good example of using RPC to handle large data transfers, and thus comes very close to using RPC on HPC systems. It leverages XDR [8] to serialize arbitrary data structures and create a system-independent description, then sends the resulting byte stream to a remote resource, which can deserialize and retrieve data from it. It can also transfer data over the RDMA protocol using a separate transport mechanism (on recent versions of NFS), in which case the data is processed outside of the XDR stream. The interface we present in this article follows similar principles, but additionally deals directly with bulk data. It is also not limited to data encoding using XDR, which can affect performance, especially when the sender and receiver share a common system architecture. By providing a network abstraction layer, the RPC interface we define enables users to efficiently send small and large data using small message or Remote Memory Access (RMA) type transports that fully support the one-sided semantics that exist on recent HPC systems. Additionally, all presented interfaces are non-blocking, thus allowing asynchronous modes of operation, preventing the caller from waiting for one operation to execute before issuing another.

Network File System (NFS): NFS (Network File System) is a protocol for sharing files and directories in a computer network. It allows multiple computers to transparently access a shared file system over a network as if they were local files. NFS was originally developed by Sun Microsystems and became a widely used network file sharing protocol. It uses a client-server model, where files are stored and managed on a file server, and clients request access to files on the server over a network.
The working principle of NFS is as follows:
1. File server settings: The directory on the file server is marked as a shared directory, and the list of clients allowed to access is configured.

2. Client mount: The client mounts the shared directory on the server to the local file system by using a specific NFS mount command. Once mounted, clients can access shared directories and files as if they were local files.

3. File access: The client can perform standard file operations on the mounted shared directories and files, such as reading, writing, creating and deleting. These operations are transmitted to the file server, and the corresponding operations are performed on the server.

4. File synchronization: NFS supports concurrent access to files, and multiple clients can read and write the same file at the same time. The server is responsible for handling synchronization and conflict resolution for concurrent access.

NFS provides the following advantages and features:

1. Transparency: NFS makes the remote file system transparent to the client, and the client can access the remote file as if it were a local file.

2. Shareability: Multiple clients can access and share the same file system at the same time, promoting collaboration and resource sharing.

3. Cross-platform support: NFS is cross-platform and can share files between different operating systems, such as UNIX, Linux, Windows, etc.

4. Performance optimization: NFS supports mechanisms such as caching and pre-reading to improve file access performance.

Although NFS has many advantages for file sharing, there are some considerations such as network latency, security, and performance. When configuring and using NFS, you need to consider the network environment and security requirements, and take corresponding measures to ensure the reliability and security of file access.

XDR (External Data Representation) is a data serialization format used to transmit and store data between different computer systems. It defines a set of specifications describing how to convert data into a machine-independent format for data exchange among different systems. XDR serialization uses a fixed data representation format, so that data has a consistent representation and parsing method between different machines and operating systems. This ensures data portability and interoperability. The process of XDR serialization converts data into a canonical binary representation for transmission over the network or storage to disk and other media. The serialized data can be transmitted through a network transport protocol (such as RPC), or stored persistently.
Features of XDR serialization include:

1. Machine-independent: XDR defines a set of data representation independent of specific machines and operating systems, so that data can be exchanged on different platforms.

2. Simplicity: The XDR serialization format is relatively simple, does not contain redundant information, and provides efficient data representation and analysis methods.

3. Scalability: XDR serialization supports multiple data types, including integers, floating-point numbers, strings, structures, etc., and complex data structures can be defined and used as needed.

4. Interoperability: The XDR serialization format is supported on multiple programming languages ​​and platforms, so that different systems can use a unified data representation method for data exchange.

XDR serialization is often used in the RPC (Remote Procedure Call) protocol as a data transmission format. By serializing the data into XDR format, it is convenient to transfer parameters and return results between the client and the server. It should be noted that using XDR serialization requires defining the structure and type of data, and writing corresponding parsing and serialization codes. When using XDR for data exchange, the sender and receiver need to use the same data definition and serialization rules to ensure correct parsing and consistency of data.

RMA (Remote Memory Access) is a technology for remote memory access in parallel computing and high-performance computing (HPC) systems. It allows computing nodes to directly access the memory of other nodes without the need for data transfer through traditional message passing interfaces such as message passing interface MPI. In the traditional message passing programming model, data exchange between nodes needs to be completed through explicit sending and receiving operations, involving data copying and message transmission. In the RMA model, nodes can directly access the memory of remote nodes, making data exchange more efficient and reducing the overhead of data copying.

The RMA model usually uses two main operations:
Remote Write: Through remote write operations, a node can write data to the memory of a remote node. The sending node writes the data directly into the memory space of the receiving node without explicit receiving operation of the receiving node.

Remote Read: Through the remote read operation, the node can read data from the memory of the remote node. The sending node can read data directly from the receiving node's memory without requiring an explicit send operation from the receiving node.

The implementation of the RMA model usually relies on dedicated high-performance network interfaces such as InfiniBand (IB) or Ethernet's RDMA (Remote Direct Memory Access) technology. These interfaces provide low-latency and high-bandwidth communication capabilities to support efficient execution of RMA operations. The RMA model has important applications in the field of high-performance computing, especially in parallel computing and large-scale data processing. It can reduce data copy and communication overhead, improve the efficiency of data exchange between computing nodes, thereby accelerating the execution of parallel computing tasks. However, the RMA model also requires reasonable programming and debugging skills to ensure the correctness and consistency of memory access and avoid data races and concurrency issues.

RDMA (Remote Direct Memory Access) is a high-performance network communication technology for remote memory access between computing nodes. RDMA technology allows computing nodes to transfer data directly between each other without the intervention of a central processing unit (CPU). In the traditional network communication model, the transmission of data usually requires the intervention of the CPU, that is, the data is copied from the source node to the kernel buffer of the sending node, and then copied from the kernel buffer of the receiving node to the memory of the target node. This data copy process involves CPU processing and context switching, resulting in additional latency and system overhead. With RDMA technology, data can be copied directly from the memory of the source node to the memory of the target node, bypassing the intervention of the CPU and kernel buffers. This makes data transfer more efficient, featuring low latency and high bandwidth.

RDMA technologies are usually based on dedicated high-performance network interfaces such as InfiniBand (IB) or RDMA over Converged
Ethernet (RoCE). These interfaces provide hardware-level support, including functions such as data packet transmission, data integrity checking, and flow control. RDMA technology is widely used in the field of high-performance computing (HPC), and is especially suitable for scenarios such as large-scale parallel computing, data center interconnection, and storage systems. It can provide communication capabilities with low latency, high throughput, and low CPU usage, thereby accelerating data transmission and processing tasks. It should be noted that data transmission using RDMA technology requires appropriate programming models and software support, such as MPI (Message Passing Interface) or RDMA-aware libraries and frameworks. These tools provide the abstraction and encapsulation of RDMA, so that developers can easily use RDMA technology for efficient remote memory access.

The asynchronous operation mode is a programming mode for processing operations that need to be executed for a long time, such as network requests, file reading and writing, database queries, etc. In the traditional synchronous mode of operation, the program will wait for the operation to complete before continuing to execute, while in the asynchronous mode of operation, the program can continue to perform other tasks without waiting for the operation to complete. The key to the asynchronous operation pattern is to delegate the operation to another task or thread, and notify the main thread or call a callback function to process the result when the operation is complete. This can improve the concurrency and responsiveness of the program, especially for scenarios that need to handle a large number of IO operations.

Asynchronous operation mode usually involves the following concepts:

1. Asynchronous function/method: An asynchronous function or method refers to a function or method that can be suspended during an operation and returns an intermediate result. They are usually identified with special keywords or modifiers (such as async/await) and instruct the compiler to turn them into appropriate asynchronous code.

2. Callback function: The callback function is a function that is called when the asynchronous operation is completed. It is usually passed as a parameter to an asynchronous operation and is called after the operation completes to process the result of the operation. Callback functions can be used to handle success conditions, error handling, clean up resources, etc.

3. Asynchronous task scheduler: The asynchronous task scheduler is responsible for managing and scheduling the execution of asynchronous tasks. It can determine the execution order of tasks and allocate resources according to conditions such as available system resources and task priorities.

4. Asynchronous event loop: The asynchronous event loop is a loop running in the main thread, which is used to receive and process the completion notification of the asynchronous operation, and schedule the execution of the callback function. It is responsible for managing the event queue, processing completed operations, and pushing corresponding callback functions onto the execution queue.

The benefits of the asynchronous mode of operation include:

1. Improve the concurrency and responsiveness of the program: multiple operations can be processed at the same time without waiting for each operation to complete.
2. Reduce waste of resources: While waiting for IO operations to complete, CPU resources can be used to perform other tasks to improve resource utilization.
3. Better user experience: Asynchronous operations can keep programs responsive while performing time-consuming operations, improving user experience.

The I/O Forwarding Scalability Layer (IOFSL) [9] is another project upon which part of the work presented in this paper is based. IOFSL makes use of RPC to specifically forward I/O calls. It defines an API called ZOIDFS that locally serializes function parameters and sends them to a remote server, where they can in turn get mapped onto file system specific I/O operations. One of the main motivations for extending the work that already exists in IOFSL is the ability to send not only a specific set of calls, as the ones that are defined through the ZOIDFS API, but a various set of calls, which can be dynamically and generically defined. It is also worth noting that IOFSL is built on top of the BMI [10] network transport layer used in the Parallel Virtual File System (PVFS) [11].It allows support for dynamic connection as well as fault tolerance and also defines two types of messaging, unexpected and expected (described in section III-B), that can enable an asynchronous mode of operation. Nevertheless, BMI is limited in its design by not directly exposing the RMA semantics that are required to explicitly achieve RDMA operations from the client memory to the server memory, which can be an issue and a performance limitation (main advantages of using an RMA approach are described in section III-B). In addition, while BMI does not offer one-sided operations, it does provide a relatively high level set of network operations. This makes porting BMI to new network transports (such as the Cray Gemini interconnect [12]) to be a non-trivial work, and more time consuming than it should be, as only a subset of the functionality provided by BMI is required for implementing RPC in our context.

The I/O Forwarding Scalability Layer (IOFSL) [9] is another project on which some of the work presented here is based. IOFSL exclusively forwards I/O calls using RPC. It defines an API called ZOIDFS, which serializes function arguments locally and sends them to a remote server, where they can in turn be mapped to filesystem-specific I/O operations. One of the main motivations for extending the work already existing in IOFSL is to be able to send not only a specific set of calls (as defined through the ZOIDFS API), but also a variety of calls, which can be defined dynamically and generically. It is also worth noting that IOFSL is built on top of the BMI [10] network transport layer used in Parallel Virtual File System (PVFS) [11]. It allows support for dynamic connections and fault tolerance, and also defines two types of messaging, unexpected and expected (described in Section III-B), that enable asynchronous modes of operation. However, BMI is limited in its design because it does not directly expose the RMA semantics required to explicitly implement RDMA operations from client memory to server memory, which can be a problem and a performance limitation (the main advantage of using the RMA approach is in Described in Section III - B). Furthermore, while BMI does not provide unilateral operations, it does provide a relatively high-level set of network operations. This makes porting BMI to new network transports (such as the CrayGemini interconnect [12]) a non-trivial effort, and more time-consuming than it should be, since implementing RPC in our context requires only a fraction of the functionality provided by BMI Subset.

IOFSL (I/O Forwarding Scalability Layer) is an I/O forwarding layer used to provide scalability in high-performance computing (HPC) systems. It aims to solve I/O bottlenecks and performance problems in large-scale parallel computing, and improve the I/O performance and scalability of the system by optimizing and distributing I/O loads. IOFSL forwards I/O requests from computing nodes to storage nodes by introducing an intermediate layer between computing nodes and storage nodes, and balances and optimizes I/O loads. It provides the following key features:

1. I/O distribution and load balancing: The IOFSL layer implements load balancing and parallel I/O operations by distributing I/O requests to multiple disk devices on multiple storage nodes. In this way, the bandwidth and throughput of the storage system can be fully utilized and the overall I/O performance can be improved.

2. Data aggregation and caching: The IOFSL layer can aggregate small I/O requests from different computing nodes into larger data blocks and perform data caching to reduce the load on the storage system and network transmission overhead. This helps improve data access efficiency and overall I/O performance.

3. Data compression and encoding: The IOFSL layer can implement data compression and encoding techniques to reduce the bandwidth consumption of data in network transmission. This helps reduce the load on the storage system and speed up the data transfer process.

4. Scalability and elasticity: IOFSL is designed to be scalable and configurable, and can adapt to HPC systems of different sizes and needs. It supports dynamic increase or decrease of storage nodes, and automatically adapts to system changes and failures.

By introducing the IOFSL layer, the HPC system can make full use of storage resources and improve I/O performance and scalability. It is very beneficial for large-scale parallel computing and data-intensive applications. It can optimize I/O operations and reduce data transmission overhead, thereby improving overall computing efficiency and system performance.

Parallel Virtual File System (PVFS for short) is a distributed file system used to build a high-performance parallel computing environment. The design goal of PVFS is to provide high throughput, low latency, and scalability to meet the high-performance requirements of file systems in massively parallel computing. PVFS supports organizing multiple storage nodes into a logical file system and provides the ability to access files in parallel. It distributes file data to multiple storage nodes and allows multiple computing nodes to access files in parallel to achieve efficient I/O operations.

Key features and design principles of PVFS include:

1. Distributed architecture: PVFS adopts a distributed architecture to disperse and store file data on multiple storage nodes to achieve parallel data access and high-throughput I/O operations.

2. Parallel access: PVFS allows multiple computing nodes to access files at the same time, realizing parallel read and write operations. This parallel access method can improve the performance and response speed of the file system.

3. Data distribution and load balancing: PVFS uses data distribution strategies to evenly distribute file data to multiple storage nodes to achieve load balancing and efficient data access.

4. Fault tolerance and scalability: PVFS has a fault tolerance mechanism that supports automatic detection and recovery of node failures. In addition, PVFS also has good scalability, and storage nodes can be added as needed to expand the capacity and performance of the file system.

BMI (Bulk Memory Operations) is a set of instructions and functions introduced into the x86-64 architecture to improve the performance of large memory operations. It provides methods for efficiently manipulating large blocks of memory, such as copying, filling, and comparing memory regions. BMI includes multiple instructions, such as movs, stos, cmps, and lods, for manipulating memory blocks. These instructions are optimized to efficiently process large amounts of data by taking advantage of SIMD (Single Instruction Multiple Data) capabilities and reducing the number of memory accesses required. Key benefits of the BMI directive include:

1. Improve performance: BMI instructions improve the performance of large memory operations by utilizing SIMD functions and reducing the number of memory accesses.

2. Reduce the number of instructions: Using BMI instructions can reduce the number of instructions required, thereby reducing the execution time and power consumption of instructions.

3. Simplify code: BMI instructions provide the function of directly operating memory blocks, which can simplify code writing and maintenance, and improve development efficiency.

Another project, Sandia National Laboratories’ NEtwork Scalable Service Interface (Nessie) [13] system provides a simple RPC mechanism originally developed for the Lightweight File Systems [14] project. It provides an asynchronous RPC solution, which is mainly designed to overlap computation and I/O. The RPC interface of Nessie directly relies on the Sun XDR solution which is mainly designed to communicate between heterogeneous architectures, even though practically all High-Performance Computing systems are homogeneous.Nessie provides a separate mechanism to handle bulk data transfers, which can use RDMA to transfer data efficiently from one memory to the other, and supports several network transports. The Nessie client uses the RPC interface to push control messages to the servers. Additionally, Nessie exposes a different, one-sided API (similar to Portals [15]), which the user can use to push or pull data between client and server.Mercury is different, in that it’s interface, which also supports RDMA natively, can transparently handle bulk data for the user by automatically generating abstract memory handles representing the remote large data arguments, which are easier to manipulate and do not require any extra effort by the user.Mercury also provides fine grain control on the data transfer if required (for example to implement pipelining). In addition, Mercury provides a higher level interface than Nessie, greatly reducing the amount of user code needed to implement RPC functionality.

​Another project, Sandia National Laboratories' NEtworkScalable Service Interface (Nessie) [13] system provides a simple RPC mechanism originally developed for the LightweightFile Systems [14] project. It provides an asynchronous RPC solution, designed primarily for overlapping computation and I/O. Nessie's RPC interface relies directly on the Sun XDR solution, which is primarily designed for communication between heterogeneous architectures, even though virtually all HPC systems are homogeneous. Nessie provides a single mechanism to handle bulk data transfers, it can efficiently transfer data from one memory to another using RDMA, and supports multiple network transfers. Nessie clients use the RPC interface to push control messages to the server. Furthermore, Nessie exposes a different, one-sided API (similar to Portals [15]) that users can use to push or pull data between clients and servers. Mercury is different because its interface itself also supports RDMA, which can transparently handle bulk data for users by automatically generating abstract memory handles representing remote big data parameters, which are easier to manipulate and do not require any additional work from users. Mercury also provides fine-grained control over data transfer (for example, to implement pipelining) if desired. In addition, Mercury provides a higher-level interface than Nessie, which greatly reduces the amount of user code required to implement RPC functionality.

Another similar approach can be seen with the Decoupled and Asynchronous Remote Transfers (DART) [16] project.While DART is not defined as an explicit RPC framework, it allows transfer of large amounts of data using a client/server model from applications running on the compute nodes of a HPC system to local storage or remote locations, to enable remote application monitoring, data analysis, code coupling, and data archiving. The key requirements that DART is trying to satisfy include minimizing data transfer overheads on the application, achieving high-throughput, low-latency data transfers, and preventing data losses. Towards achieving these goals, DART is designed so that dedicated nodes, i.e., separate from the application compute nodes, asynchronously extract data from the memory of the compute nodes using RDMA. In this way, expensive data I/O and streaming operations from the application compute nodes to dedicated nodes are offloaded, and allow the application to progress while data is transferred.While using DART is not transparent and therefore requires explicit requests to be sent by the user, there is no inherent limitation for integration of such a framework within our network abstraction layer and therefore wrap it within the RPC layer that we define, hence allowing users to transfer data using DART on the platforms it supports.

Another similar approach can be seen in the Decoupled and Asynchronous Remote Transfers (DART) [16] project. Although not defined as an explicit RPC framework, DART allows the use of a client/server model to transfer large amounts of data from applications running on compute nodes in HPC systems to local storage or remote locations for remote application monitoring, data analysis, code Coupling and data archiving. The key requirements that DART tries to meet include minimizing the application's data transfer overhead, achieving high throughput, low latency data transfer, and preventing data loss. To achieve these goals, DART is designed so that dedicated nodes (that is, separate from the application compute nodes) fetch data asynchronously from the compute nodes' memory using RDMA. In this way, expensive data I/O and streaming operations from application compute nodes to dedicated nodes are offloaded and allow applications to proceed concurrently with data transfer. While using DART is opaque and thus requires the user to send an explicit request, there is no inherent limit to integrating such a framework in our network abstraction layer, so wrapping it in our defined RPC layer allows users to support The platform uses DART to transfer data.

III. ARCHITECTURE

As mentioned in the previous section, Mercury’s interface relies on three main components: a network abstraction layer, an RPC interface that is able to handle calls in a generic fashion and a bulk data interface, which complements the RPC layer and is intended to easily transfer large amounts of data by abstracting memory segments. We present in this section the overall architecture and each of its components.

As mentioned in the previous section, Mercury's interface relies on three main components: the network abstraction layer NA, an RPC interface capable of handling calls in a generic way, and a bulk data interface (Bulk), which complements the RPC layer and is designed to easily transfer large The amount of data passing through the abstract memory segment. We introduce the overall architecture and each of its components in this section.

A. Overview

The RPC interface follows a client / server architecture. As described in figure 1, issuing a remote call results in different steps depending on the size of the data associated with the call. We distinguish two types of transfers: transfers containing typical function parameters, which are generally small, referred to as metadata, and transfers of function parameters describing large amounts of data, referred to as bulk data.

The RPC interface follows a client/server architecture. As shown in Figure 1, making a remote call results in different steps depending on the size of the data associated with the call. We distinguish between two types of transfers: transfers containing typical feature parameters, usually small, called metadata; transfers of feature parameters describing large amounts of data, called bulk data.

insert image description here

Figure 1: Architectural overview: Each party uses an RPC handler to serialize and deserialize the parameters sent over the interface. Calling functions with relatively small parameters results in the use of the short message delivery mechanism exposed by the network abstraction layer, while functions with large data parameters use the RMA mechanism additionally

Every RPC call sent through the interface results in the serialization of function parameters into a memory buffer (its size generally being limited to one kilobyte, depending on the interconnect), which is then sent to the server using the network abstraction layer interface. One of the key requirements is to limit memory copies at any stage of the transfer, especially when transferring large amounts data. Therefore, if the data sent is small, it is serialized and sent using small messages, otherwise a description of the memory region that is to be transferred is sent within this same small message to the server, which can then start pulling the data (if the data is the input of the remote call) or pushing the data (if the data is the output of the remote call). Limiting the size of the initial RPC request to the server also helps in scalability, as it avoids unnecessary server resource consumption in case of large numbers of clients concurrently accessing the same server. Depending on the degree of control desired, all these steps can be transparently handled by Mercury or directly exposed to the user.

Each RPC call sent over the interface results in the serialization of the function parameters into an in-memory buffer (often limited in size to 1 KB depending on the interconnect), which is then sent to the server using the Network Abstraction Layer interface. One of the key requirements is to limit memory copies at any stage of the transfer, especially when transferring large amounts of data. So if the data sent is small, it is serialized and sent using a small message, otherwise a description of the memory area to be transferred is sent to the server in the same small message, and the server can then start pulling the data (if the data is input to the remote call) or push data if the data is the output of the remote call. Limiting the size of the initial RPC request to the server also helps scalability because it avoids unnecessary consumption of server resources in the case of a large number of clients accessing the same server at the same time. Depending on the degree of control required, all of these steps can be handled transparently by Mercury or exposed directly to the user.

B. Network Abstraction Layer

The main purpose of the network abstraction layer is as its name suggests to abstract the network protocols that are exposed to the user, allowing multiple transports to be integrated through a system of plugins. A direct consequence imposed by this architecture is to provide a lightweight interface, for which only a reasonable effort will be required to implement a new plugin. The interface itself must defines three main types of mechanisms for transferring data: unexpected messaging, expected messaging and remote memory access; but also the additional setup required to dynamically establish a connection between the client and the server (although a dynamic connection may not be always feasible depending on the underlying network implementation used).

As the name suggests, the main purpose of the network abstraction layer is to abstract the network protocols exposed to users, allowing the integration of multiple transports through a plug-in system. A direct consequence of this architecture is the provision of a lightweight interface so that a new plugin can be implemented with only reasonable effort. The interface itself must define three main types of mechanisms for transferring data: unexpected message passing unexpected, expected message passing expected, and remote memory access rma; but also the extra setup needed to dynamically establish a connection between the client and server (dynamic connection May not always be possible, depending on the underlying network implementation used (depending on what is used).

Unexpected and expected messaging is limited to the transfer of short messages and makes use of a two-sided approach.The maximum message size is, for performance reasons, determined by the interconnect and can be as small as a few kilobytes. The concept of unexpected messaging is used in other communication protocols such as BMI [10]. Sending an unexpected message through the network abstraction layer does not require a matching receive to be posted before it can complete. By using this mechanism, clients are not blocked and the server can, every time an unexpected receive is issued, pick up the new messages that have been posted. Another difference between expected and unexpected messages is unexpected messages can arrive from any remote source, while expected messages require the remote source to be known.

Unexpected and expected messages are limited to the transmission of short messages and use a bilateral method. For performance reasons, the maximum message size is determined by the interconnect and can be as small as a few kilobytes. The concept of accidental message passing is also used in other communication protocols such as BMI [10]. Sending an unexpected message through the network abstraction layer does not require a matching receive to be posted before completion. By using this mechanism, the client is not blocked and the server can pick up a new message that has been published every time an unexpected receive is issued. Another difference between expected and unexpected messages is that unexpected messages can arrive from any remote source while expected messages require knowledge of the remote source.

The remote memory access (RMA) interface allows remote memory chunks (contiguous and non-contiguous) to be accessed. In most one-sided interfaces and RDMA protocols, memory must be registered to the network interface controller (NIC) before it can used. The purpose of the interface defined in the network abstraction layer is to create a first degree of abstraction and define an API that is compatible with most RMA protocols. Registering a memory segment to the NIC typically results in the creation of a handle to that segment containing virtual address information, etc. The local handle created needs to be communicated to the remote node before that node can start a put or get operation. The network abstraction is responsible for ensuring that these memory handles can be serialized and transferred across the network.Once handles are exchanged, a non-blocking put or get can be initiated. On most interconnects, put and get will map to the put and get operation provided by the specific API provided by the interconnect. The network abstraction interface is designed to allow the emulation of one-sided transfers on top of two-sided sends and receives for network protocols such as TCP/IP that only support a two-sided messaging approach.

The Remote Memory Access (RMA) interface allows access to remote memory blocks (contiguous and non-contiguous). In most unidirectional interfaces and RDMA protocols, memory must be registered with the network interface controller (NIC) before it can be used. The purpose of defining interfaces in the network abstraction layer is to create a first level of abstraction and define an API compatible with most RMA protocols. Registering a memory segment with the NIC usually results in the creation of a handle to that segment, which contains virtual address information, among other things. The created local handle needs to be communicated to the remote node before it can begin put or get operations. The network abstraction is responsible for ensuring that these memory handles can be serialized and transmitted over the network. After the handles have been exchanged, a non-blocking put or get can be initiated. On most interconnects, put and get will map to the put and get operations provided by the specific API provided by the interconnect. The Network Abstraction Interface is designed to allow emulation of unidirectional transmissions on top of bidirectional send and receive network protocols such as TCP/IP which only support bidirectional messaging methods. With this network abstraction layer, Mercury can easily be ported to support new interconnects. The relatively limited functionality provided by the network abstraction (e.g. no bidirectional messages of infinite size) ensures near-native performance.

With this network abstraction layer in place, Mercury can easily be ported to support new interconnects. The relatively limited functionality provided by the network abstraction (for example, no unlimited size two-sided messages) ensures close to native performance.

With this network abstraction layer in place, Mercury can be easily ported to support new interconnects. The relatively limited functionality provided by the network abstraction (e.g., no bilateral messages of infinite size) ensures near-native performance.

Two-sided messages, also known as two-way messages, are a communication pattern in which two entities can exchange messages in both directions. In this mode, each entity can act as both a sender and a receiver to realize two-way communication.

In the context of parallel computing or distributed systems, bilateral messages allow the exchange of data or information between two processes or nodes. Each entity can initiate communication by sending a message, and the receiver can process the message and send a response. This two-way message exchange allows interactive and coordinated communication between entities. Bilateral messages are frequently used in messaging systems and communication protocols in parallel and distributed computing frameworks. An example of this is the Message Passing Interface (MPI) in high performance computing, and network protocols such as TCP/IP. Using bilateral messages enables a more flexible and dynamic communication pattern than one-way or one-sided communication. It allows for request-response interactions, where the sender requests information or services from the receiver and receives a specific response.

C. RPC Interface and Metadata

Sending a call that only involves small data makes use of the unexpected / expected messaging defined in III-B. However, at a higher level, sending a function call to the server means concretely that the client must know how to encode the input parameters before it can start sending information and know how to decode the output parameters once it receives a response from the server. On the server side, the server must also have knowledge of what to execute when it receives an RPC request and how it can decode and encode the input and output parameters. The framework for describing the function calls and encoding/decoding parameters is key to the operation of our interface.

Sending a call involving only small data uses unexpected/expected messaging as defined in III-B. However, at a high level, sending a function call to the server specifically means that the client must know how to encode input parameters before it starts sending information, and how to decode output parameters after receiving the server's response. On the server side, the server must also know what to do when it receives an RPC request, and how to decode and encode input and output parameters. The framework describing function calls and encoding/decoding arguments is key to the operation of our interface.

One of the important points is the ability to support a set of function calls that can be sent to the server in a generic fashion, avoiding the limitations of a hard-coded set of routines.The generic framework is described in figure 2. During the initialization phase, the client and server register encoding and decoding functions by using a unique function name that is mapped to a unique ID for each operation, shared by the client and server. The server also registers the callback that needs to be executed when an operation ID is received with a function call. To send a function call that does not involve bulk data transfer, the client encodes the input parameters along with that operations ID into a buffer and send it to the server using an unexpected messaging protocol, which is non-blocking. To ensure full asynchrony, the memory buffer used to receive the response back from the server is also pre-posted by the client. For reasons of efficiency and resource consumption, these messages are limited in size (typically a few kilobytes).However if the metadata exceeds the size of an unexpected message, the client will need to transfer the metadata in separate messages, making transparent use of the bulk data interface described in III-D to expose the additional metadata to the server.

One of the main points is the ability to support a set of function calls that can be sent to the server in a generic way, avoiding the limitation of a set of hard-coded routines. The general framework is shown in Figure 2. During the initialization phase, the client and server register encoding and decoding functions, shared by the client and server, by using unique function names that map to each operation's unique ID. The server also registers callbacks that need to be executed when an operation ID is received through a function call. To send a function call that does not involve bulk data transfers, the client encodes the input parameters into a buffer along with the operation ID and sends them to the server using the non-blocking unexpected messaging protocol. To ensure full asynchrony, the memory buffer used to receive responses from the server is also pre-released by the client. These messages are limited in size (typically a few kilobytes) for reasons of efficiency and resource consumption. However, if the metadata exceeds the unexpected message size, the client will need to transmit the metadata in a separate message, transparently using the interface described in Bulk Data III-D for exposing additional metadata to the server.

insert image description here

Figure 2: Asynchronous execution flow of RPC calls. The receive buffer is prepublished, allowing the client to do other work while the call is executed remotely and the response is sent

When the server receives a new request ID, it looks up the corresponding callback, decodes the input parameters, executes the function call, encodes the output parameters and starts sending the response back to the client. Sending a response back to the client is also non-blocking, therefore, while receiving new function calls, the server can also test a list of response requests to check their completion, freeing the corresponding resources when an operation completes. Once the client has knowledge that the response has been received (using a wait/test call) and therefore that the function call has been remotely completed, it can decode the output parameters and free the resources that were used for the transfer.

When the server receives a new request ID, it looks up the corresponding callback, decodes the input parameters, executes the function call, encodes the output parameters, and starts sending the response back to the client. Sending responses back to the client is also non-blocking, so when receiving a new function call, the server can also test the list of response requests to check their completion, freeing the corresponding resources when the operation is complete. Once the client knows that the response has been received (using the wait/test call), and thus the function call has completed remotely, it can decode the output parameters and release the resources used for the transfer.

With this mechanism in place, it becomes simple to extend it to handle bulk data.

With this mechanism in place, it becomes trivial to extend it to handle batches of data.

D. Bulk Data Interface

In addition to the previous interface, some function calls may require the transfer of larger amounts of data. For these function calls, the bulk data interface is used and is built on top of the remote memory access protocol defined in the network abstraction layer. Only the RPC server initiates onesided transfers so that it can, as well as controlling the data flow, protect its memory from concurrent accesses.

In addition to the previous interface, some function calls may need to transfer a larger amount of data. For these function calls, a bulk data interface is used, built on top of the remote memory access protocol defined in the network abstraction layer. Only the RPC server initiates a one-sided transfer so that it can control the data flow and protect its memory from concurrent access.

As described in figure 3, the bulk data transfer interface uses a one-sided communication approach. The RPC client exposes a memory region to the RPC server by creating a bulk data descriptor (which contains virtual memory address information, size of the memory region that is being exposed, and other parameters that may depend on the underlying network implementation). The bulk data descriptor can then be serialized and sent to the RPC server along with the RPC request parameters (using the RPC interface defined in section III-C). When the server decodes the input parameters,it deserializes the bulk data descriptor and gets the size of the memory buffer that has to be transferred.

As shown in Figure 3, the bulk data transfer interface uses a one-sided communication method. RPC clients expose memory regions to RPC servers by creating bulk data descriptors that contain virtual memory address information, the size of the memory region being exposed, and other parameters that may depend on the underlying network implementation. The bulk data descriptor can then be serialized and sent to the RPC server (using the RPC interface defined in Section III-C) along with the RPC request parameters. When the server decodes the input parameters, it deserializes the bulk data descriptor and gets the size of the memory buffer that must be transferred.

insert image description here

In the case of an RPC request that consumes large data parameters, the RPC server may allocate a buffer of the size of the data that needs to be received, expose its local memory region by creating a bulk data block descriptor and initiate an asynchronous read / get operation on that memory region.The RPC server then waits / tests for the completion of the operation and executes the call once the data has been fully received (or partially if the execution call supports it). The response (i.e., the result of the call) is then sent back to the RPC client and memory handles are freed.

In cases where an RPC request consumes a large number of data parameters, the RPC server may allocate a buffer the size of the data it needs to receive, expose its local memory region by creating a bulk data block descriptor, and initiate an asynchronous read/get on that memory region operate. The RPC server then waits/tests for the completion of the operation and executes the call after the data has been fully received (or partially received, if the executing call supports it). The response (i.e. the result of the call) is then sent back to the RPC client, and the memory handle is freed.

In the case of an RPC request that produces large data parameters, the RPC server may allocate a buffer of the size of the data that is going to be produced, expose the memory region by creating a bulk data block descriptor, execute the call, then initiate an asynchronous write / put operation to the client memory region that has been exposed. The RPC server may then wait/test for the completion of the operation and send the response (i.e., the result of the call) back to the RPC client. Memory handles can then be freed.

In the case of RPC requests that generate large data parameters, the RPC server can allocate a buffer of the same size as the data to be generated, expose the memory region by creating a bulk data block descriptor, execute the call, and then call the already exposed client Initiate an asynchronous write/put operation on the end memory region. The RPC server can then wait/test for the completion of the operation and send a response (i.e. the result of the call) back to the RPC client. The memory handle can then be freed.

Transferring data through this process can be transparent for the user, especially since the RPC interface can also take care of serializing / deserializing the memory handles along with the other parameters. This is particularly important when non-contiguous memory segments have to be transferred. In either case memory segments are automatically registered on the RPC client and are abstracted by the memory handle created. The memory handle is then serialized along with the parameters of the RPC function and transferring large data using non-contiguous memory regions therefore results in the same process described above. Note that the handle may be variable size in this case as it may contain more information and also depends on the underlying network implementation that can support registration of memory segments directly.

Transferring data through this process is transparent to the user, especially since the RPC interface can also handle serialization/deserialization of memory handles and other parameters. This is especially important when non-contiguous segments of memory must be transferred. In both cases, the memory segment is automatically registered with the RPC client and abstracted by the created memory handle. The memory handle is then serialized along with the parameters of the RPC function, so transferring large data using a non-contiguous memory region results in the same process as above. Note that the handle may be of variable size in this case, as it may contain more information, and also depends on the underlying network implementation which can support memory segment registration directly.

IV. EVALUATION

The architecture previously defined enables generic RPC calls to be shipped along with handles that can describe contiguous and non-contiguous memory regions when a bulk data transfer is required. We present in this section how one can take advantage of this architecture to build a pipelining mechanism that can easily request blocks of data on demand.

The previously defined architecture allows generic RPC calls to be issued with handles that can describe contiguous and non-contiguous regions of memory when bulk data transfers are required. In this section we describe how we can take advantage of this architecture to build a pipeline mechanism that can easily request chunks of data on demand.

A. Pipelining Bulk Data Transfers

Pipelining transfers is a typical use case when one wants to overlap communication and execution. In the architecture that we described, requesting a large amount of data to be processed results in an RPC request being sent from the RPC client to the RPC server as well as a bulk data transfer. In a common use case, the server may wait for the entire data to be received before executing the requested call. However, by pipelining the transfers, one can in fact start processing the data while it is being transferred, avoiding to pay the cost of the latency for an entire RMA transfer. Note that although we focus on this point in the example below, using this technique can also be particularly useful if the RPC server does not have enough memory to handle all the data that needs to be sent, in which case it will also need to transfer data as it processes it.

Pipelining is a typical use case when one wants to overlap communication and execution. In the architecture we describe, requests to process large amounts of data result in RPC requests from the RPC client to the RPC server and large data transfers. In a common use case, the server might wait for all data to be received before executing the requested call. However, with pipelining, you can actually start processing the data as it is being transferred, avoiding paying the latency cost for the entire RMA transfer. Note that although we focus on this in the examples below, using this technique is also especially useful if the RPC server doesn't have enough memory to handle all the data it needs to send, in which case it also needs to be in the Transfer data while processing it.

A simplified version of the RPC client code is presented below:

A simplified version of the RPC client code is as follows:

#define BULK_NX 16
#define BULK_NY 128

int main(int argc, char *argv[])
{
    
    
	hg_id_t rpc_id;
	write_in_t in_struct;
	write_out_t out_struct;
	hg_request_t rpc_request;
	int buf[BULK_NX][BULK_NY];
	hg_bulk_segment_t segments[BULK_NX];
	hg_bulk_t bulk_handle = HG_BULK_NULL;
	
	/* Initialize the interface */
	[...]
	/* Register RPC call */
	rpc_id = HG_REGISTER("write",write_in_t, write_out_t);
	
	/* Provide data layout information */
	for (i = 0; i < BULK_NX ; i++) {
    
    
	segments[i].address = buf[i];
	segments[i].size = BULK_NY * sizeof(int);
	}
	
	/* Create bulk handle with segment info */
	HG_Bulk_handle_create_segments(segments,BULK_NX, HG_BULK_READ_ONLY, &bulk_handle);
	
	/* Attach bulk handle to input parameters */
	[...]
	in_struct.bulk_handle = bulk_handle;
	
	/* Send RPC request */
	HG_Forward(server_addr, rpc_id,&in_struct, &out_struct, &rpc_request);
	
	/* Wait for RPC completion and response */
	HG_Wait(rpc_request, HG_MAX_IDLE_TIME,HG_STATUS_IGNORE);
	
	/* Get output parameters */
	[...]
	ret = out_struct.ret;
	
	/* Free bulk handle */
	HG_Bulk_handle_free(bulk_handle);
	
	/* Finalize the interface */
	[...]
}

When the client initializes, it registers the RPC call it wants to send. Because this call involves non contiguous bulk data transfers, memory segments that describe the memory regions are created and registered. The resulting bulk_handle is then passed to the HG_Forward call along with the other call parameters. One may then wait for the response and free the bulk handle when the request has completed (a notification may also be sent in the future to allow the bulk handle to be freed earlier, and hence the memory to be unpinned).

When the client initializes, it registers the RPC calls it wants to send. Because this call involves non-sequential bulk data transfers, memory segments describing memory regions are created and registered. The resulting bulk_handle is then passed to the HG_Forward call along with other call parameters. It is then possible to wait for a response and release the bulk handle when the request is complete (and also send a notification in the future to allow the bulk handle to be released earlier, thus unbinding the memory).

The pipelining mechanism happens on the server, which takes care of the bulk transfers. The pipeline itself has here a fixed pipeline size and a pipeline buffer size. A simplified version of the RPC server code is presented below:

The pipelining mechanism takes place on the server, which is responsible for bulk transfers. The pipe itself has here a fixed pipe size and a pipe buffer size. A simplified version of the RPC server code looks like this:

#define PIPELINE_BUFFER_SIZE 256
#define PIPELINE_SIZE 4

int rpc_write(hg_handle_t handle)
{
    
    
	write_in_t in_struct;
	write_out_t out_struct;
	hg_bulk_t bulk_handle;
	hg_bulk_block_t bulk_block_handle;
	hg_bulk_request_t bulk_request[PIPELINE_SIZE];
	void *buf;
	size_t nbytes, nbytes_read = 0;
	size_t start_offset = 0;
	
	/* Get input parameters and bulk handle */
	HG_Handler_get_input(handle, &in_struct);
	[...]
	bulk_handle = in_struct.bulk_handle;
	
	/* Get size of data and allocate buffer */
	nbytes = HG_Bulk_handle_get_size(bulk_handle);
	buf = malloc(nbytes);
	
	/* Create block handle to read data */
	HG_Bulk_block_handle_create(buf, nbytes,
	HG_BULK_READWRITE, &bulk_block_handle);
	
	/* Initialize pipeline and start reads */
	for (p = 0; p < PIPELINE_SIZE; p++) {
    
    
	size_t offset = p * PIPELINE_BUFFER_SIZE;
	/* Start read of data chunk */
	HG_Bulk_read(client_addr, bulk_handle,
	offset, bulk_block_handle, offset,
	PIPELINE_BUFFER_SIZE, &bulk_request[p]);
	}
	
	while (nbytes_read != nbytes) {
    
    
		for (p = 0; p < PIPELINE_SIZE; p++) {
    
    
			size_t offset = start_offset + p * PIPELINE_BUFFER_SIZE;
			/* Wait for data chunk */
			HG_Bulk_wait(bulk_request[p],
			HG_MAX_IDLE_TIME, HG_STATUS_IGNORE);
			nbytes_read += PIPELINE_BUFFER_SIZE;
			
			/* Do work (write data chunk) */
			write(buf + offset, PIPELINE_BUFFER_SIZE);
			
			/* Start another read */
			offset += PIPELINE_BUFFER_SIZE * 51 PIPELINE_SIZE;
			if (offset < nbytes) {
    
    
				HG_Bulk_read(client_addr,bulk_handle, offset,bulk_block_handle, offset,PIPELINE_BUFFER_SIZE,&bulk_request[p]);
			} else {
    
    
				/* Start read with remaining piece */
			}
		}
		start_offset += PIPELINE_BUFFER_SIZE * PIPELINE_SIZE;
	}
	
	/* Free block handle */
	HG_Bulk_block_handle_free(bulk_block_handle);
	free(buf);
	
	/* Start sending response back */
	[...]
	out_struct.ret = ret;
	HG_Handler_start_output(handle, &out_struct);
}

int main(int argc, char *argv[])
{
    
    
	/* Initialize the interface */
	[...]
	/* Register RPC call */
	HG_HANDLER_REGISTER("write", rpc_write,write_in_t, write_out_t);
	
	while (!finalized) {
    
    
		/* Process RPC requests (non-blocking) */
		HG_Handler_process(0, HG_STATUS_IGNORE);
	}
	
	/* Finalize the interface */
	[...]
}

Every RPC server, once it is initialized, must loop over a HG_Handler_process call, which waits for new RPC requests and executes the corresponding registered callback (in the same thread or new thread depending on user needs). Once the request is deserialized, the bulk_handle parameter can be used to get the total size of the data that is to be transferred, allocate a buffer of the appropriate size and start the bulk data transfers. In this example, the pipeline size is set to 4 and the pipeline buffer size is set to 256, which means that 4 RMA requests of 256 bytes are initiated. One can then wait for the first piece of 256 bytes to arrive and process it. While it is being processed, other pieces may arrive. Once one piece is processed a new RMA transfer is started for the piece that is at stage 4 in the pipeline and one can wait for the next piece, process it. Note that while the memory region registered on the client is non-contiguous, the HG_Bulk_read call on the server presents it as a contiguous region, simplifying server code. In addition, logical offsets (relative to the beginning of the data) can be given to move data pieces individually, with the bulk data interface taking care of mapping from the continuous logical offsets to the non-contiguous client memory regions.

Each RPC server, once initialized, must loop through HG_Handler_process calls, which wait for new RPC requests and execute corresponding registered callbacks (in the same thread or a new thread, depending on user needs). Once the request has been deserialized, the bulk_handle parameter can be used to get the total size of the data to be transferred, allocate an appropriately sized buffer and start the bulk data transfer. In this example, the pipeline size is set to 4, and the pipeline buffer size is set to 256, that is, four RMA requests of 256 bytes are initiated. You can then wait for the first block of 256 bytes to arrive and process it. While it's being processed, other pieces may arrive. Once a fragment has been processed, a new RMA transfer is started for the fragment in stage 4 in the pipeline and can wait for the next fragment to be processed. Note that although the memory region registered on the client is non-contiguous, the HG_Bulk_read call on the server represents it as a contiguous region, simplifying the server code. Additionally, logical offsets (relative to the beginning of the data) can be given to move chunks of data individually, while the bulk data interface takes care of mapping contiguous logical offsets to non-contiguous regions of client memory.

We continue this process until all the data has been read / processed and the response (i.e., the result of the function call) can be sent back. Again we only start sending the response by calling the HG_Handler_start_output call and its completion will only be tested by calling HG_Handler_process, in which case the resources associated to the response will be freed. Note that all functions support asynchronous execution, allowing Mercury to be used in event driven code if so desired.

We continue this process until all the data has been read/processed and the response (i.e. the result of the function call) can be sent back. Likewise, we can only start sending a response by calling the HG_Handler_start_output call, and test its completion only by calling HG_Handler_process, in which case the resources associated with the response will be freed. Note that all functions support asynchronous execution, and Mercury can be used in event-driven code if desired.

B. Network Plugins and Testing Environment

Two plugins have been developed as of the date this paper is written to illustrate the functionality of the network abstraction layer. At this point, the plugins have not been optimized for performance. One is built on top of BMI [10]. However, as we already pointed out in section II, BMI does not provide RMA semantics to efficiently take advantage of the network abstraction layer defined and the one-sided bulk data transfer architecture. The other one is built on top of MPI [17], which has only been providing full RMA semantics [18] recently with MPI3 [19]. Many MPI implementations, specifically those delivered with already installed machines, do not yet provide all MPI3 functionality. As BMI has not yet been ported to recent HPC systems, to illustrate the functionality and measure early performance results, we only consider the MPI plugin in this paper. This plugin, to be able to run on existing HPC systems limited to MPI-2 functionality, such as Cray systems, implements bulk data transfers on top of two-sided messaging.

In practice, this means that for each bulk data transfer, an additional bulk data control message needs to be sent to the client to request either sending or receiving data. Progress on the transfer can then be realized by using a progress thread or by entering progress functions

At the time of writing, two plugins have been developed to illustrate the functionality of the network abstraction layer. At this point, the plugin is not yet optimized for performance. One is based on BMI [10]. However, as we already pointed out in Section II, BMI does not provide RMA semantics to efficiently exploit the defined network abstraction layer and one-sided bulk data transfer architecture. The other is built on top of MPI [17], which only recently provides full RMA semantics [18] via MPI3 [19]. Many MPI implementations, especially those machines that have it installed, do not yet provide all MPI3 functionality. Since BMI has not been ported to recent high-performance computing systems, we only consider the MPI plugin in this article for the purpose of illustrating functionality and measuring early performance results. This plugin, capable of running on existing HPC systems limited to MPI-2 capabilities, such as Cray systems, enables bulk data transfers based on bilateral message passing.

In practice, this means that for each bulk data transfer, an additional bulk data control message needs to be sent to the client to request data to be sent or received. The progress of the transfer can then be achieved by using a progress thread or by entering a progress function.

For testing we make use of two different HPC systems. One is an Infiniband QDR 4X cluster with MVAPICH [20] 1.8.1, the other one is a Cray XE6 with Cray MPT [21] 5.6.0.

For testing, we used two different HPC systems. One is an Infiniband QDR 4X cluster using MVAPICH[20] 1.8.1 and the other is a Cray XE6 cluster using Cray MPT[21] 5.6.0.

C. Performance Evaluation

As a first experiment, we measured the time it takes to execute a small RPC call (without any bulk data transfer involved) for an empty function (i.e., a function that returns immediately). On the Cray XE6 machine, measuring the average time for 20 RPC invocations, each call took 23 µs.This time includes the XDR encoding and decoding of the parameters of the function. However, as pointed out earlier, most HPC systems are homogeneous and thus don’t require the data portability provided by XDR. When disabling XDR encoding (performing a simple memory copy instead) the time drops to 20 µs. This non-negligible improvement (15%) demonstrates the benefit of designing an RPC framework specifically for HPC environments.

As a first experiment, we measured the time it takes to execute a small RPC call (not involving any bulk data transfer) to an empty function (i.e., a function that returns immediately). On a Cray XE6 machine, the average time was measured for 20 RPC calls, each taking 23µs. This time includes XDR encoding and decoding of function arguments. However, as previously pointed out, most HPC systems are homogeneous and thus do not require the data portability provided by XDR. When XDR encoding is disabled (a simple memory copy is performed instead), the time drops to 20µs. This non-negligible improvement (15%) demonstrates the benefit of designing RPC frameworks specifically for HPC environments.

The second experiment consists of testing the pipelining technique for bulk data transfers previously explained between one client and one server. As shown in table I, on Cray XE6 pipelining transfers can be particularly efficient when requests have already completed while other pipeline stages are being processed, allowing us to get very high bandwidth. However, the high injection bandwidth on this system makes it difficult to get good performance for small packets (such as the bulk data control messages due to the emulation of one-sided features on this system) particularly when data flow is not continuous.

The second experiment consisted of testing the pipelining technique previously explained for bulk data transfer between a client and a server. As shown in Table 1, on the Cray XE6, pipelining can be particularly efficient when the request has already completed while other pipeline stages are processing, allowing us to achieve very high bandwidth. However, the high injection bandwidth on this system makes it difficult to achieve good performance for small data packets (such as bulk data control messages due to simulation of one-sided functions on this system), especially if the data flow is not continuous.

insert image description here

Finally we evaluated the scalability of the RPC server by evaluating the total data throughput while increasing the number of clients. Figures 4 and 5 show the results for a QDR InfiniBand system (using MVAPICH) and the Cray XE6 system respectively. In both cases, in part due to the server side bulk data flow control mechanism, Mercury shows excellent scalability, with throughput either increasing or remaining stable as the number of concurrent clients increases. For comparison, the point to point message bandwidth on each system is shown.On the InfiniBand system, Mercury achieves about 70% of maximum network bandwidth. This is an excellent result, considering that the Mercury time represents an RPC call in addition to the data transfer, compared to the time to send a single message for the OSU benchmark. On the Cray system, performance is less good (about 40% of peak). We expect that this is mainly due to the relatively poor small message performance of the system, combined with the extra control messages caused by the one-sided emulation.

Finally, we evaluate the scalability of RPC Serverby, evaluating the total data throughput when increasing the number of client hours. Figures 4 and 5 show the results for the AQDR Infiniband system (using MVAPICH) and the CRAY XE6SYSTEM, respectively. In both cases, due in part to the server sidebulk data flow control mechanism, HG shows superior performance with increased throughput or remaining stable as the number of concurrent clients increases. For comparison, point message bandwidth on each system is shown. In the Infiniband system, Mercury reached 70% of the maximum network bandwidth. Considering that the HG time represents the RPC call for data transfer, this is a good result compared to the time for sending the OSU benchmark's Asingle message. On the Cray system, the performance is poor (about 40% of the peak). We expect this to be mainly due to the small information performance of this system, and the additional control measures induced by unilateral simulation. However, the low performance may also be caused by system limitations, considering that the performance of Nessie's similar operation (read) [22] shows the same low bandwidth, even though it is achieved by bypassing MPI and using the native uGNI API of the real RDMA interconnect .

insert image description here

V. CONCLUSION AND FUTURE WORK

In this paper we presented the Mercury framework. Mercury is specifically designed to offer RPC services in a HighPerformance Computing environment. Mercury builds on a small, easily ported network abstraction layer providing operations closely matched to the capabilities of contemporary HPC network environments. Unlike most other RPC frameworks, Mercury offers direct support for handling remote calls containing large data arguments. Mercury’s network protocol is designed to scale to thousands of clients. We demonstrated the power of the framework by implementing a remote write function including pipelining of large data arguments. We subsequently evaluated our implementation on two different HPC systems, showing both single client performance and multi-client scalability.

In this article, we introduced the Mercury framework. Mercury is specifically designed to provide RPC services in high-performance computing environments. Mercury is built on a small, easily portable network abstraction layer, providing operations that closely match the capabilities of contemporary HPC network environments. Unlike most other RPC frameworks, Mercury provides direct support for handling remote calls containing large data parameters. Mercury's network protocol is designed to scale to thousands of clients. We demonstrate the power of this framework by implementing a remote write function that includes a pipeline for large data parameters. We subsequently evaluate our implementation on two different HPC systems, demonstrating single-client performance and multi-client scalability.

With the availability of the high-performing, portable, generic RPC functionality provided by Mercury, IOFSL can be simplifed and modernized by replacing the internal, hard coded IOFSL code by Mercury calls. As the network abstraction layer on top of which Mercury is built already supports using BMI for network connectivity, existing deployments of IOFSL continue to be supported, at the same time taking advantage of the improved scalability and performance of Mercury’s network protocol.

With the high-performance, portable, general-purpose RPC functionality provided by Mercury, IOFSL can be simplified and modernized by replacing the internal hard-coded IOFSL code with Mercury calls. Since the network abstraction layer upon which Mercury is built already supports the use of BMI for networking, existing IOFSL deployments will continue to be supported while taking advantage of the improved scalability and performance of the Mercury networking protocol.

Currently, Mercury does not offer support for canceling ongoing RPC calls. Cancellation is important for resiliency in environments where nodes or network can fail. Future work will include support for cancellation.

Currently, Mercury does not support canceling RPC calls in progress. Cancellation is important for resiliency in environments where node or network failures may occur. Future work will include support for cancellation.

While Mercury already supports all required functionality to efficiently execute RPC calls, the amount of user code required for each call can be further reduced. Future versions of Mercury will provide a set of preprocessor macros, reducing the user’s effort by automatically generating as much boiler plate code as possible.

While Mercury already supports all the functionality needed to efficiently execute RPC calls, the amount of user code required for each call can be further reduced. Future versions of Mercury will provide a set of preprocessor macros that reduce the user's workload by automatically generating as much boilerplate code as possible.

The network abstraction layer currently has plugins for BMI, MPI-2 and MPI-3. However, as MPI RMA functionality is difficult to use in a client/server context [23], we intend to add native support for Infiniband networks, and the Cray XT and IBM BG/P and Q networks.

The Network Abstraction Layer currently has plugins for BMI, MPI-2, and MPI-3. However, since the MPI RMA function is difficult to use in a client/server environment [23], we intend to add native support for Infiniband networks, Cray XT and IBM BG/P and Q networks.

Paper link

Guess you like

Origin blog.csdn.net/weixin_43912621/article/details/131008401