Keeping GPUs Busy
As AI and HPC datasets continue to increase in size, the time spent loading data for a given application begins to place a strain on the total application’s performance. When considering end-to-end application performance, fast GPUs are increasingly starved by slow I/O.
I/O, the process of loading data from storage to GPUs for processing, has historically been controlled by the CPU. As computation shifts from slower CPUs to faster GPUs, I/O becomes more of a bottleneck to overall application performance.
Just as GPUDirect RDMA (Remote Direct Memory Address) improved bandwidth and latency when moving data directly between a network interface card (NIC) and GPU memory, a new technology called GPUDirect Storage enables a direct data path between local or remote storage, like NVMe or NVMe over Fabric (NVMe-oF), and GPU memory. Both GPUDirect RDMA and GPUDirect Storage avoid extra copies through a bounce buffer in the CPU’s memory and enable a direct memory access (DMA) engine near the NIC or storage to move data on a direct path into or out of GPU memory – all without burdening the CPU or GPU. This is illustrated in Figure 1. For GPUDirect Storage, storage location doesn’t matter; it could be inside an enclosure, within the rack, or connected over the network. Whereas the bandwidth from CPU system memory (SysMem) to GPUs in an NVIDIA DGX-2 is limited to 50 GB/s, the bandwidth from SysMem, from many local drives and from many NICs can be combined to achieve an upper bandwidth limit of nearly 200 GB/s in a DGX-2.
Figure 1: The standard path between GPU memory and NVMe drives uses a bounce buffer in system memory that hangs off of the CPU. The direct data path from storage gets higher bandwidth by skipping the CPU altogether.
In this blog, we expand on a previous post demonstrating GPUDirect Storage: a proof of concept enabling direct memory access (DMA) from storage that is either local to a given server or outside of the enclosure via NVMe-oF. We demonstrate that direct memory access from storage to GPU relieves the CPU I/O bottleneck and enables increased I/O bandwidth and capacity. Further, we provide initial performance metrics presented at GTC19 in San Jose, based on the RAPIDS project’s GPU-accelerated CSV reader housed within the cuDF library. Lastly, we will provide suggestions on key applications that can make use of faster and increased bandwidth, lower latency, and increased capacity between storage and GPUs.
In future postings, as this feature…