GVLE: a highly optimized GPU-based implementation of variable-length encoding
Metadata only
Date
2023-05Type
- Journal Article
ETH Bibliography
yes
Altmetrics
Abstract
Nowadays, the massive use of multimedia data gives to data compression a funda-mental role in reducing the storage requirements and communication bandwidth. Variable-length encoding (VLE) is a relevant data compression method that reduces input data size by assigning shorter codewords to mostly used symbols, and longer codewords to rarely utilized symbols. As it is a common strategy in many compres-sion algorithms, such as the popular Huffman coding, speeding VLE up is essen-tial to accelerate them. For this reason, during the last decade and a half, efficient VLE implementations have been presented in the area of General Purpose Graph-ics Processing Units (GPGPU). The main performance issues of the state-of-the-art GPU-based implementations of VLE are the following. First, the way in which the codeword look-up table is stored in shared memory is not optimized to reduce the bank conflicts. Second, input/output data are read/written through inefficient strided global memory accesses. Third, the way in which the thread-codes are built is not optimized to reduce the number of executed instructions. Our goal in this work is to significantly speed up the state-of-the-art implementations of VLE by solving their performance issues. To this end, we propose GVLE, a highly optimized implemen-tation of VLE on GPU, which uses the following optimization strategies. First, the caching of the codeword look-up table is done in a way that minimizes the bank conflicts. Second, input data are read by using vectorized loads to exploit fully the available global memory bandwidth. Third, each thread encoding is performed effi-ciently in the register space with high instruction-level parallelism and lower num-ber of executed instructions. Fourth, a novel inter-block scan method, which out-performs those of state-of-the-art solutions, is used to calculate the bit-positions of the thread-blocks encodings in the output bit-stream. Our proposed mechanism is based on a regular segmented scan performed efficiently on sequences of bit-lengths of 32 consecutive thread-blocks encodings by using global atomic additions. Fifth, output data are written efficiently by executing coalesced global memory stores. An exhaustive experimental evaluation shows that our solution is on average 2.6x faster than the best state-of-the-art implementation. Additionally, it shows that the scan algorithm is on average 1.62x faster if it utilizes our inter-block scan method instead of that of the best state-of-the-art VLE solution. Hence, our inter-block scan method offers promising possibilities to accelerate algorithms that require it, such as the scan itself or the stream compaction. Show more
Publication status
publishedExternal links
Journal / series
The Journal of SupercomputingVolume
Pages / Article No.
Publisher
SpringerSubject
Data compression; Variable-length encoding; Huffman coding; GPU; CUDAMore
Show all metadata
ETH Bibliography
yes
Altmetrics