

Sharing guests: Qiu Lu, Tang Chunxu, Wang Beinan
Currently, the fields of artificial intelligence (AI) and machine learning (ML) are developing rapidly, and it is becoming increasingly critical to effectively handle large data sets during training. Ray has become an important player in this field, enabling large-scale data set training through efficient data stream processing. Ray breaks large data sets into manageable chunks and divides training jobs into smaller tasks without the need to store the entire data set locally on the training machine. However, this innovative approach also faces certain challenges.
While Ray facilitates training with large datasets, data loading is still a serious bottleneck. Each epoch requires reloading the entire data set from remote storage, which will seriously reduce GPU utilization and increase the transmission cost of stored data. Therefore, we need a more optimized method to manage data during the training process and improve efficiency. .
Ray mainly uses memory to store data, and its in-memory object storage is designed for large task data. However, this approach faces bottlenecks on data-intensive tasks because the data required for large tasks must be preloaded into Ray's memory storage before execution. Since the size of object storage usually cannot accommodate the training data set, it is not suitable for caching data across multiple training epochs, which also highlights the need for a more scalable data management solution for the Ray framework.
Ray的重要优势之一是利用GPU进行训练,同时利用CPU进行数据加载和预处理。该方法可确保Ray集群内GPU、CPU和内存资源的有效利用, 但是这也会导致磁盘资源利用不足,并且缺乏有效管理。一个变革性的想法应运而生:搭建高性能数据访问层,通过智能地跨机器管理低效磁盘资源来缓存和访问训练数据集,这样可以显著提高整体训练性能,并降低访问远端存储的频次。
Alluxio通过巧妙高效地利用GPU和相邻CPU机器上未使用的磁盘容量进行分布式缓存,加速大规模数据集的训练。这一创新方法能显著提高数据加载性能,对于使用大规模数据集的训练至关重要,同时还降低了对远端存储的依赖并减少相关的数据传输成本。

Integrating Alluxio improves Ray's data management capabilities and brings many benefits:
√
Scalability
Data access and caching are highly scalable
√
Speed up data access
Utilizes high-performance disks to cache data
Optimized for high-concurrency random reading of column storage file formats such as Parquet
zero copy
√
reliability and availability
No single point of failure
Robust remote storage access during outages
√
Flexible resource management
Dynamically allocate and release cache resources based on workload needs
Ray can effectively orchestrate machine learning workflows and seamlessly integrate with data loading, preprocessing and training frameworks. As a high-performance data access layer, Alluxio can greatly optimize AI/ML training and inference tasks, especially when remote storage data needs to be accessed repeatedly.
Ray utilizes PyArrow to load data and convert the data format into Arrow format, which is then used by the Ray workflow in the next stage. PyArrow delegates storage connection issues to the fsspec framework, and Alluxio serves as an intermediate cache layer between Ray and underlying storage systems (such as S3, Azure Blob Storage, and Hugging Face).

When using Alluxio as a caching layer between Ray and S3, simply import Alluxiofs, initialize the Alluxio file system, and change the Ray file system to Alluxio.
# Import fsspec & alluxio fsspec implementation
import fsspec
from alluxiofs import AlluxioFileSystem
fsspec.register_implementation("alluxio", AlluxioFileSystem)
# Create Alluxio filesystem with S3 as the underlying storage system
alluxio = fsspec.filesystem("alluxio", target_protocol=”s3”, etcd_host=args.etcd_host)
# Ray read data from Alluxio using S3 URL
ds = ray.data.read_images("s3://datasets/imagenet-full/train", filesystem=alluxio)
我们使用Ray Data的Ray Data nightly test来对比Alluxio 和同一区域S3在不同训练epoch的数据加载性能。基准测试结果表明,通过将Alluxio与Ray集成可显著降低存储成本并大幅提高吞吐量。

√
Improved data access performance: We observed that when Ray’s object storage is not affected by memory pressure, Alluxio’s throughput is 2 times that of S3 in the same area.
√
在内存压力下优势更明显:值得注意的是,当Ray 的对象存储面临内存压力时,Alluxio的性能优势显著增加,其吞吐量比S3高出5倍。
对于Ray任务而言,将未利用的磁盘资源作为分布式缓存的存储有着巨大的战略意义。该方法显著提高了数据加载性能,在跨多个epoch使用相同数据集进行训练或调优的情况下尤其有用。此外,当Ray 面临内存压力时,它能为优化和简化这些场景下的数据管理流程提供实用的解决方案。
✦
[Add assistant to get more information]
✦

✦
【Recent Popularity】
✦

✦
【Baodian Market】
✦






This article is shared from the WeChat public account - Alluxio (Alluxio_China).
If there is any infringement, please contact [email protected] for deletion.
This article participates in the " OSC Source Creation Plan ". You who are reading are welcome to join and share together.