Open access
Date
2021Type
- Conference Paper
ETH Bibliography
yes
Altmetrics
Abstract
Deep neural networks are widely used in personalized recommendation systems. Such models involve two major components: the memory-bound embedding layer and the computation-bound fully-connected layers. Existing solutions are either slow on both stages or only optimize one of them. To implement recommendation inference efficiently in the context of a real deployment, we design and implement an FPGA cluster optimizing the performance of both stages. To remove the memory bottleneck, we take advantage of the High-Bandwidth Memory (HBM) available on the latest FPGAs for highly concurrent embedding table lookups. To match the required DNN computation throughput, we partition the workload across multiple FPGAs interconnected via a 100 Gbps TCP/IP network. Compared to an optimized CPU baseline (16 vCPU, AVX2-enabled) and a one-node FPGA implementation, our system (four-node version) achieves 28.95x and 7.68x speedup in terms of throughput respectively. The proposed system also guarantees a latency of tens of microseconds per single inference, significantly better than CPU and GPU-based systems which take at least milliseconds. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000485145Publication status
publishedExternal links
Book title
2021 31st International Conference on Field-Programmable Logic and Applications (FPL)Pages / Article No.
Publisher
IEEEEvent
Organisational unit
03506 - Alonso, Gustavo / Alonso, Gustavo
Notes
Conference lecture held on September 3, 2021. Due to the Coronavirus (COVID-19) the conference was conducted virtually.More
Show all metadata
ETH Bibliography
yes
Altmetrics