Binhang Yuan


Loading...

Last Name

Yuan

First Name

Binhang

Organisational unit

Search Results

Publications 1 - 10 of 11
  • Xu, Lijie; Qiu, Shuang; Yuan, Binhang; et al. (2022)
    SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
    Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6X-12.8X faster than two state-of-the-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.
  • Wang, Jue; Yuan, Binhang; Rimanic, Luka; et al. (2022)
    Advances in Neural Information Processing Systems 35
    Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallel-style training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AQ-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks. Different from previous efforts in activation compression, instead of compressing activation values directly, AQ-SGD compresses the changes of the activations. This allows us to show, to the best of our knowledge for the first time, that one can still achieve O ( 1 / √ T ) convergence rate for non-convex objectives under activation compression, without making assumptions on gradient unbiasedness that do not hold for deep learning models with non-linear activation functions. We then show that AQ-SGD can be optimized and implemented efficiently, without additional end-to-end runtime overhead. We evaluated AQ-SGD to fine-tune language models with up to 1.5 billion parameters, compressing activation to 2-4 bits. AQ-SGD provides up to 4.3 × end-to-end speed-up in slower networks, without sacrificing model quality. Moreover, we also show that AQ-SGD can be combined with state-of-the-art gradient compression algorithms to enable end-to-end communication compression: All communications between machines, including model gradients, forward activations, and backward gradients are compressed into lower precision. This provides up to 4.9 × end-to-end speed-up, without sacrificing model quality.
  • Wang, Jue; Lu, Yucheng; Yuan, Binhang; et al. (2023)
    Proceedings of Machine Learning Research ~ Proceedings of the 40th International Conference on Machine Learning
    Distributed training of foundation models, especially large language models (LLMs), is communication-intensive and so has heavily relied on centralized data centers with fast interconnects. Can we train on slow networks and unlock the potential of decentralized infrastructure for foundation models? In this paper, we propose COCKTAILSGD, a novel communication-efficient training framework that combines three distinct compression techniques—random sparsification, top-K sparsification, and quantization—to achieve much greater compression than each individual technique alone. We justify the benefit of such a hybrid approach through a theoretical analysis of convergence. Empirically, we show that COCKTAILSGD achieves up to 117×compression in fine-tuning LLMs up to 20 billion parameters without hurting convergence. On a 500Mbps network, COCKTAILSGD only incurs ∼ 1.2× slowdown compared with data center networks.
  • Xu, Lijie; Qiu, Shuang; Yuan, Binhang; et al. (2024)
    The VLDB Journal
    Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on block-addressable secondary storage such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel two-level data shuffling strategy named CorgiPile, which can avoid a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of CorgiPile and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. For deep learning systems, we extend single-process CorgiPile to multi-process CorgiPile for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that CorgiPile can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, CorgiPile is 1.6x – 12.8x faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, CorgiPile is 1.5x faster than PyTorch with full data shuffle.
  • Liu, Zichang; Wang, Jue; Dao, Tri; et al. (2023)
    Proceedings of Machine Learning Research ~ Proceedings of the 40th International Conference on Machine Learning
    Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware. We hypothesize that contextual sparsity, which are small, input-dependent sets of attention heads and MLP parameters that yield approximately the same output as the dense model for a given input, can address these issues. We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability. Based on these insights, we propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference. We validate that DejaVu can reduce the inference latency of OPT-175B by over 2X compared to the state-of-the-art FasterTransformer, and over 6X compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https://github.com/FMInference/DejaVu.
  • Yuan, Binhang; He, Yongjun; Quincy Davis, Jared; et al. (2022)
    Advances in Neural Information Processing Systems 35
    Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model parallel foundation model training, such as Megatron and Deepspeed, only consider the homogeneous data center setting. In this paper, we present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network. Our key technical contribution is a scheduling algorithm that allocates different computational “tasklets” in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network. We provide a formal cost model and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8× faster than prior state-of-the-art training systems.
  • Lian, Xiangru; Yuan, Binhang; Zhu, Xuefeng; et al. (2022)
    KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    Recent years have witnessed an exponential growth of model scale in deep learning-based recommender systems - -from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. We resolve this challenge by careful co-design of both optimization algorithm and distributed system architecture. Specifically, to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstrations and empirical studies with up to 100 trillion parameters have been conducted to justify the system design and implementation of Persia. We make Persia publicly available (at github.com/PersiaML/Persia) so that anyone can easily train a recommender model at the scale of 100 trillion parameters.
  • Pan, Rui; Lei, Yiming; Li, Jialong; et al. (2022)
    HotNets '22: Proceedings of the 21st ACM Workshop on Hot Topics in Networks
    This paper discusses why flow scheduling does not apply to distributed deep learning training and presents EchelonFlow, the first network abstraction to bridge the gap. EchelonFlow deviates from the common belief that semantically related flows should finish at the same time. We reached the key observation, after extensive workflow analysis of diverse training paradigms, that distributed training jobs observe strict computation patterns, which may consume data at different times. We devise a generic method to model the drastically different computation patterns across training paradigms, and formulate EchelonFlow to regulate flow finish times accordingly. Case studies of mainstream training paradigms under EchelonFlow demonstrate the expressiveness of the abstraction, and our system sketch suggests the feasibility of an EchelonFlow scheduling system.
  • Sheng, Ying; Zheng, Lianmin; Yuan, Binhang; et al. (2023)
    Proceedings of Machine Learning Research ~ Proceedings of the 40th International Conference on Machine Learning
    The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen
  • Zhang, Yuji; Xu, Shusheng; Xing, Wenhui; et al. (2024)
    Journal of the American Heart Association
    Background: Atrial fibrillation (AF) increases risk of embolic stroke, and in postoperative patients, increases cost of care. Consequently, ECG screening for AF in high-risk patients is important but labor-intensive. Artificial intelligence (AI) may reduce AF detection workload, but AI development presents challenges. Methods and Results: We used a novel approach to AI development for AF detection using both surface ECG recordings and atrial epicardial electrograms obtained in postoperative cardiac patients. Atrial electrograms were used only to facilitate establishing true AF for AI development; this permitted the establishment of an AI-based tool for subsequent AF detection using ECG records alone. A total of 5 million 30-second epochs from 329 patients were annotated as AF or non-AF by expert ECG readers for AI training and validation, while 5 million 30-second epochs from 330 different patients were used for AI testing. AI performance was assessed at the epoch level as well as AF burden at the patient level. AI achieved an area under the receiver operating characteristic curve of 0.932 on validation and 0.953 on testing. At the epoch level, testing results showed means of AF detection sensitivity, specificity, negative predictive value, positive predictive value, and F1 (harmonic mean of positive predictive value and sensitivity) as 0.970, 0.814, 0.976, 0.776, and 0.862, respectively, while the intraclass correlation coefficient for AF burden detection was 0.952. At the patient level, AF burden sensitivity and positive predictivity were 96.2% and 94.5%, respectively. Conclusions: Use of both atrial electrograms and surface ECG permitted development of a robust AI-based approach to postoperative AF recognition and AF burden assessment. This novel tool may enhance detection and management of AF, particularly in patients following operative cardiac surgery.
Publications 1 - 10 of 11