- Conference Paper
Rights / licenseIn Copyright - Non-Commercial Use Permitted
Deep neural networks are widely used in personalized recommendation systems. Such models involve two major components: the memory-bound embedding layer and the computation-bound fully-connected layers. Existing solutions are either slow on both stages or only optimize one of them. To implement recommendation inference efficiently in the context of a real deployment, we design and implement an FPGA cluster optimizing the performance of both stages. To remove the memory bottleneck, we take advantage of the High-Bandwidth Memory (HBM) available on the latest FPGAs for highly concurrent embedding table lookups. To match the required DNN computation throughput, we partition the workload across multiple FPGAs interconnected via a 100 Gbps TCP/IP network. Compared to an optimized CPU baseline (16 vCPU, AVX2-enabled) and a one-node FPGA implementation, our system (four-node version) achieves 28.95x and 7.68x speedup in terms of throughput respectively. The proposed system also guarantees a latency of tens of microseconds per single inference, significantly better than CPU and GPU-based systems which take at least milliseconds. Show more
Book title2021 31st International Conference on Field-Programmable Logic and Applications (FPL)
Pages / Article No.
Organisational unit03506 - Alonso, Gustavo / Alonso, Gustavo
NotesConference lecture held on September 3, 2021. Due to the Coronavirus (COVID-19) the conference was conducted virtually.
MoreShow all metadata