GPUDirect Use

CuFile Configuration Settings

The CuFile library has a configuration file located by default at /etc/cufile.json. Users might find that the default configuration may not be suited for their workload/application needs and instead need to make modifications. To do this first make a local copy at some path.

cp /etc/cufile.json /some/path/cufile.json`

Then the CUFILE_ENV_PATH_JSON environment variable must be updated to point to the path of the new configuration file.

export CUFILE_ENV_PATH_JSON="/some/path/cufile.json"

A list of the parameters and their values are described in the GPUDirect Storage Benchmarking and Configuration Guide Make and save any changes necessary.

Verify GPUDirect Storage is Supported

GPUDirect storage (GDS) is an experimental and evolving technology. There may comes times when users suspect GDS is not working as intended. A first check is to investigate whether the GDS associated libraries are loaded correctly. This can be done by calling gdscheck.py -p.

[strugf@gpu0206 ~]$ /usr/local/cuda-12.6/gds/tools/gdscheck.py -p
 GDS release version: 1.11.1.6
 nvidia_fs version:  2.25 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Pass-through or Enabled
 GPU index 1 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Pass-through or Enabled
 GPU index 2 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Pass-through or Enabled
 GPU index 3 NVIDIA RTX A6000 bar:1 bar size (MiB):256 supports GDS, IOMMU State: Pass-through or Enabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: Pass-through or enabled
 WARN: GDS is not guaranteed to work functionally or in a performant way with iommu=on/pt
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12060
 Platform: MZ33-AR0-000, Arch: x86_64(Linux 5.14.0-503.33.1.el9_5.x86_64)
 Platform verification succeeded

Let’s read through some of the most important lines of the output.

GDS release version: 1.11.1.6
 nvidia_fs version:  2.25 libcufile version: 2.12

This informs us that libcufile has been loaded and the nvidia-fs driver responsible for memory mapping between storage and the GPU is installed.

=====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0

We see NVMe: Supported which informs that GDS is currently configured to work for NVMe drives, and all other storage types are not properly configured as apparent from the Unsupported flag.

It is also possible to verify GDS is working by setting the cufile.json parameter logging:level to TRACE. When running a process that makes CuFile API calls, a cufile.log will be created in the directory the process is called from. The log is highly verbose, but it is possible to see the GDS transfers.

25-06-2025 13:24:20:556 [pid=1535029 tid=1535093] DEBUG cufio_core:2716 gds path taken with ODIRECT fd: 4 
25-06-2025 13:24:20:556 [pid=1535029 tid=1535093] DEBUG 0:1709 nvfs_io_submit file_offset 0 size 4194304 gpu_offset 0 nvbuf 0x7fae3d1bb700 is_unaligned 0 
25-06-2025 13:24:20:556 [pid=1535029 tid=1535093] DEBUG 0:461 current cuda context present 
25-06-2025 13:24:20:556 [pid=1535029 tid=1535093] DEBUG 0:472 Allocate buffer of size 1048576 on GPU 0 PCI-Group 0 
25-06-2025 13:24:20:556 [pid=1535029 tid=1535093] DEBUG 0:389 Bounce buffer 140380150431744 GPU page aligned 
25-06-2025 13:24:20:556 [pid=1535029 tid=1535093] DEBUG 0:481 Buffer from aligned alloc, dptr 0x7faccd000000 aligned_dptr 0x7faccd000000 size 1048576 
25-06-2025 13:24:20:556 [pid=1535029 tid=1535093] DEBUG 0:335 map buf 0x7faccd000000 Size 1048576 sbuf_size 1048576 pin_gpu_memory 1 
25-06-2025 13:24:20:556 [pid=1535029 tid=1535093] DEBUG 0:336 map buf 0x7faccd000000 bounce-buffer 1 groupId 0 
25-06-2025 13:24:20:556 [pid=1535029 tid=1535093] DEBUG 0:1093 Total usage 0 Max Usage 33554432 
25-06-2025 13:24:20:558 [pid=1535029 tid=1535093] DEBUG 0:487 MAP gpu index : 0 bdf: 0 1 0 0 
25-06-2025 13:24:20:559 [pid=1535029 tid=1535093] DEBUG 0:507 Buffer allocation and map success on GPU: 0 
25-06-2025 13:24:20:559 [pid=1535029 tid=1535093] DEBUG 0:842 Bounce-buffer allocated from PCI-Group: 0 GPU: 0

KvikIO: Making cuFile API Calls in Python

KvikIO is a python library for high performance file IO. It provides python bindings to cuFile which enables the use of GDS in python applications. KvikIO comes with its own runtime properties that can be set globally or with a context manager.

# Set the property globally.
kvikio.defaults.set({"prop1": value1, "prop2": value2})

# Set the property with a context manager.
# The property automatically reverts to its old value
# after leaving the `with` block.
with kvikio.defaults.set({"prop1": value1, "prop2": value2}):
    ...

The list of properties and their allowed values can be found in the kvikio python documentation.

Below is a minimal example of reading a file using kvikio’s cuFile bindings.

import kvikio, cupy
import kvikio.defaults

filepath = "/mnt/nvme/temp.file"
MiB = 1048576
size = 10 * MiB

with kvikio.defaults.set({"compat_mode": False}):
	with kvikio.CuFile(filepath, "rb") as f:
		buf = cupy.empty(size, dtype = "b")
		fut = f.pread(buf) # pread is non-blocking and returns a future that must be 
						   # waited upon
		fut.get()

print(buf)
print("Done!")

GDS is only available when reading from configured drives. At ACCRE, the /mnt/nvme* drives support GDS reading and writing. We set compat_mode to False to enforce cuFile I/O. First a cupy.ndarray must be initialized which is large enough to contain the requested byte range before making a read call. Then we pass the buffer to pread where the size of the read is inferred implicitly from buf.size. kvikio.cufile.CuFile.pread() returns an IOFuture which must be waited on with IOFuture.get() since pread() is non-blocking.

A Word on Compatibility Modes

Both cuFile and kvikio provide similar but distinctly separate compatbility modes. It is not strictly required to make cuFile API calls in a GDS supported set-up. When GDS is not supported cuFile goes to compatibility mode and read/write operations are made using the POSIX API.

Kvikio similarly provides a compatibility mode when it is unable to load libcufile. When kvikio is in compatibility mode, all read/write operations are made using the POSIX API.

CuFile and Kvikio have separate compatibility modes. That is CuFile can be ran in compatibility mode when kvikio is not. KvikIO’s compatibility mode is provided for performance considerations and will bypass the cuFile API to make the POSIX backend API calls directly.