GVLE: a highly optimized GPU-based implementation of variable-length encoding


METADATA ONLY
Loading...

Date

2023-05

Publication Type

Journal Article

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

Nowadays, the massive use of multimedia data gives to data compression a funda-mental role in reducing the storage requirements and communication bandwidth. Variable-length encoding (VLE) is a relevant data compression method that reduces input data size by assigning shorter codewords to mostly used symbols, and longer codewords to rarely utilized symbols. As it is a common strategy in many compres-sion algorithms, such as the popular Huffman coding, speeding VLE up is essen-tial to accelerate them. For this reason, during the last decade and a half, efficient VLE implementations have been presented in the area of General Purpose Graph-ics Processing Units (GPGPU). The main performance issues of the state-of-the-art GPU-based implementations of VLE are the following. First, the way in which the codeword look-up table is stored in shared memory is not optimized to reduce the bank conflicts. Second, input/output data are read/written through inefficient strided global memory accesses. Third, the way in which the thread-codes are built is not optimized to reduce the number of executed instructions. Our goal in this work is to significantly speed up the state-of-the-art implementations of VLE by solving their performance issues. To this end, we propose GVLE, a highly optimized implemen-tation of VLE on GPU, which uses the following optimization strategies. First, the caching of the codeword look-up table is done in a way that minimizes the bank conflicts. Second, input data are read by using vectorized loads to exploit fully the available global memory bandwidth. Third, each thread encoding is performed effi-ciently in the register space with high instruction-level parallelism and lower num-ber of executed instructions. Fourth, a novel inter-block scan method, which out-performs those of state-of-the-art solutions, is used to calculate the bit-positions of the thread-blocks encodings in the output bit-stream. Our proposed mechanism is based on a regular segmented scan performed efficiently on sequences of bit-lengths of 32 consecutive thread-blocks encodings by using global atomic additions. Fifth, output data are written efficiently by executing coalesced global memory stores. An exhaustive experimental evaluation shows that our solution is on average 2.6x faster than the best state-of-the-art implementation. Additionally, it shows that the scan algorithm is on average 1.62x faster if it utilizes our inter-block scan method instead of that of the best state-of-the-art VLE solution. Hence, our inter-block scan method offers promising possibilities to accelerate algorithms that require it, such as the scan itself or the stream compaction.

Publication status

published

Editor

Book title

Volume

79 (8)

Pages / Article No.

8447 - 8474

Publisher

Springer

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Data compression; Variable-length encoding; Huffman coding; GPU; CUDA

Organisational unit

Notes

Funding

Related publications and datasets