GVLE: a highly optimized GPU-based implementation of variable-length encoding

Fuentes-Alventosa, Antonio; Gómez Luna, Juan; Medina-Carnicer, R.

doi:10.1007/s11227-022-04994-3

Metadata only

Author

Fuentes-Alventosa, Antonio

Gómez Luna, Juan

Medina-Carnicer, R.

Date

2023-05

Type

Journal Article

ETH Bibliography

yes

Altmetrics

Abstract

Nowadays, the massive use of multimedia data gives to data compression a funda-mental role in reducing the storage requirements and communication bandwidth. Variable-length encoding (VLE) is a relevant data compression method that reduces input data size by assigning shorter codewords to mostly used symbols, and longer codewords to rarely utilized symbols. As it is a common strategy in many compres-sion algorithms, such as the popular Huffman coding, speeding VLE up is essen-tial to accelerate them. For this reason, during the last decade and a half, efficient VLE implementations have been presented in the area of General Purpose Graph-ics Processing Units (GPGPU). The main performance issues of the state-of-the-art GPU-based implementations of VLE are the following. First, the way in which the codeword look-up table is stored in shared memory is not optimized to reduce the bank conflicts. Second, input/output data are read/written through inefficient strided global memory accesses. Third, the way in which the thread-codes are built is not optimized to reduce the number of executed instructions. Our goal in this work is to significantly speed up the state-of-the-art implementations of VLE by solving their performance issues. To this end, we propose GVLE, a highly optimized implemen-tation of VLE on GPU, which uses the following optimization strategies. First, the caching of the codeword look-up table is done in a way that minimizes the bank conflicts. Second, input data are read by using vectorized loads to exploit fully the available global memory bandwidth. Third, each thread encoding is performed effi-ciently in the register space with high instruction-level parallelism and lower num-ber of executed instructions. Fourth, a novel inter-block scan method, which out-performs those of state-of-the-art solutions, is used to calculate the bit-positions of the thread-blocks encodings in the output bit-stream. Our proposed mechanism is based on a regular segmented scan performed efficiently on sequences of bit-lengths of 32 consecutive thread-blocks encodings by using global atomic additions. Fifth, output data are written efficiently by executing coalesced global memory stores. An exhaustive experimental evaluation shows that our solution is on average 2.6x faster than the best state-of-the-art implementation. Additionally, it shows that the scan algorithm is on average 1.62x faster if it utilizes our inter-block scan method instead of that of the best state-of-the-art VLE solution. Hence, our inter-block scan method offers promising possibilities to accelerate algorithms that require it, such as the scan itself or the stream compaction. Show more

Publication status

published

External links

https://doi.org/10.1007/s11227-022-04994-3

Journal / series

The Journal of Supercomputing

Volume

79(8)

Pages / Article No.

8447 - 8474

Publisher

Springer

Subject

Data compression; Variable-length encoding; Huffman coding; GPU; CUDA

More

Show all metadata

ETH Bibliography

yes

Altmetrics

Research Collection

Search

GVLE: a highly optimized GPU-based implementation of variable-length encoding Mendeley CSV RIS BibTeX

GVLE: a highly optimized GPU-based implementation of variable-length encoding

Mendeley

CSV

RIS

BibTeX