Distributed Recommendation Inference on FPGA Clusters


Loading...

Date

2021

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Deep neural networks are widely used in personalized recommendation systems. Such models involve two major components: the memory-bound embedding layer and the computation-bound fully-connected layers. Existing solutions are either slow on both stages or only optimize one of them. To implement recommendation inference efficiently in the context of a real deployment, we design and implement an FPGA cluster optimizing the performance of both stages. To remove the memory bottleneck, we take advantage of the High-Bandwidth Memory (HBM) available on the latest FPGAs for highly concurrent embedding table lookups. To match the required DNN computation throughput, we partition the workload across multiple FPGAs interconnected via a 100 Gbps TCP/IP network. Compared to an optimized CPU baseline (16 vCPU, AVX2-enabled) and a one-node FPGA implementation, our system (four-node version) achieves 28.95x and 7.68x speedup in terms of throughput respectively. The proposed system also guarantees a latency of tens of microseconds per single inference, significantly better than CPU and GPU-based systems which take at least milliseconds.

Publication status

published

Editor

Book title

2021 31st International Conference on Field-Programmable Logic and Applications (FPL)

Journal / series

Volume

Pages / Article No.

279 - 285

Publisher

IEEE

Event

31st International Conference on Field-Programmable Logic and Applications (FPL 2021)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Organisational unit

03506 - Alonso, Gustavo / Alonso, Gustavo check_circle

Notes

Conference lecture held on September 3, 2021. Due to the Coronavirus (COVID-19) the conference was conducted virtually.

Funding

Related publications and datasets