Journal: Proceedings of the VLDB Endowment
Loading...
Abbreviation
Proc. VLDB Endow.
Publisher
Association for Computing Machinery
78 results
Search Results
Publications 1 - 10 of 78
- Federated Matrix Factorization with Privacy GuaranteeItem type: Journal Article
Proceedings of the VLDB EndowmentLi, Zitao; Ding, Bolin; Zhang, Ce; et al. (2021)Matrix factorization (MF) approximates unobserved ratings in a rating matrix, whose rows correspond to users and columns correspond to items to be rated, and has been serving as a fundamental building block in recommendation systems. This paper comprehensively studies the problem of matrix factorization in different federated learning (FL) settings, where a set of parties want to cooperate in training but refuse to share data directly. We first propose a generic algorithmic framework for various settings of federated matrix factorization (FMF) and provide a theoretical convergence guarantee. We then systematically characterize privacy-leakage risks in data collection, training, and publishing stages for three different settings and introduce privacy notions to provide end-to-end privacy protections. The first one is vertical federated learning (VFL), where multiple parties have the ratings from the same set of users but on disjoint sets of items. The second one is horizontal federated learning (HFL), where parties have ratings from different sets of users but on the same set of items. The third setting is local federated learning (LFL), where the ratings of the users are only stored on their local devices. We introduce adapted versions of FMF with the privacy notions guaranteed in the three settings. In particular, a new private learning technique called embedding clipping is introduced and used in all the three settings to ensure differential privacy. For the LFL setting, we combine differential privacy with secure aggregation to protect the communication between user devices and the server with a strength similar to the local differential privacy model, but much better accuracy. We perform experiments to demonstrate the effectiveness of our approaches. - BAGUA: Scaling up Distributed Learning with System RelaxationsItem type: Journal Article
Proceedings of the VLDB EndowmentGan, Shaoduo; Jiang, Jiawei; Yuan, Binhang; et al. (2021)Recent years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via "system relaxations": quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build BAGUA, a MPI-style communication library, providing a collection of primitives, that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by this design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 2x) across a diverse range of tasks. Moreover, we conduct a rigorous tradeo. exploration showing that di.erent algorithms and system relaxations achieve the best performance over di.erent network conditions. - Cracking Vector Search IndexesItem type: Journal Article
Proceedings of the VLDB EndowmentMageirakos, Vasilis; Wu, Bowen; Alonso, Gustavo (2025)Retrieval Augmented Generation (RAG) uses vector databases to expand the expertise of an LLM model without having to retrain it. The idea can be applied over data lakes, leading to the notion of embedding data lakes, i.e., a pool of vector databases ready to be used by RAGs. The key component in these systems is the indexes enabling Approximated Nearest Neighbor Search (ANNS). However, in data lakes, one cannot realistically expect to build indexes for every dataset. Thus, we propose an adaptive, partition-based index, CrackIVF, that performs much better than up-front index building. CrackIVF starts answering as a small index, and only expands to improve performance as it sees enough queries. It does so by progressively adapting the index to the query workload. That way, queries can be answered right away without having to build a full index first. After seeing enough queries, CrackIVF will produce an index comparable to those built with conventional techniques. CrackIVF can often answer more than 1 million queries before other approaches have even built the index, achieving 10–1000x faster initialization times. This makes it ideal for cold or infrequently used data and as a way to bootstrap access to unseen datasets. - Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain PredictionsItem type: Journal Article
Proceedings of the VLDB EndowmentKarlaš, Bojan; Li, Peng; Wu, Renzhi; et al. (2020)Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables, which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) --- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed --- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average. - Evaluating Query Languages and Systems for High-Energy Physics DataItem type: Journal Article
Proceedings of the VLDB EndowmentGraur, Dan; Müller, Ingo; Proffitt, Mason; et al. (2021)In the domain of high-energy physics (HEP), query languages in general and SQL in particular have found limited acceptance. This is surprising since HEP data analysis matches the SQL model well: the data is fully structured and queried using mostly standard operators. To gain insights on why this is the case, we perform a comprehensive analysis of six diverse, general-purpose data processing platforms using an HEP benchmark. The result of the evaluation is an interesting and rather complex picture of existing solutions: Their query languages vary greatly in how natural and concise HEP query patterns can be expressed. Furthermore, most of them are also between one and two orders of magnitude slower than the domain-specific system used by particle physicists today. These observations suggest that, while database systems and their query languages are in principle viable tools for HEP, significant work remains to make them relevant to HEP researchers. - On uncertain graphs modeling and queriesItem type: Conference Paper
Proceedings of the VLDB EndowmentKhan, Arijit; Chen, Lei (2015) - Distributed join algorithms on thousands of coresItem type: Conference Paper
Proceedings of the VLDB EndowmentBarthels, Claude; Müller, Ingo; Schneider, Timo; et al. (2017)Traditional database operators such as joins are relevant not only in the context of database engines but also as a building block in many computational and machine learning algorithms. With the advent of big data, there is an increasing demand for efficient join algorithms that can scale with the input data size and the available hardware resources. In this paper, we explore the implementation of distributed join algorithms in systems with several thousand cores connected by a low-latency network as used in high performance computing systems or data centers. We compare radix hash join to sort-merge join algorithms and discuss their implementation at this scale. In the paper, we explain how to use MPI to implement joins, show the impact and advantages of RDMA, discuss the importance of network scheduling, and study the relative performance of sorting vs. hashing. The experimental results show that the algorithms we present scale well with the number of cores, reaching a throughput of 48.7 billion input tuples per second on 4,096 cores. - Flexible Online Task Assignment in Real-Time Spatial DataItem type: Journal Article
Proceedings of the VLDB EndowmentTong, Yongxin; Wang, Libin; Zimu, Zhou; et al. (2017)The popularity of Online To Offline (O2O) service platforms has spurred the need for online task assignment in real-time spatial data, where streams of spatially distributed tasks and workers are matched in real time such that the total number of assigned pairs is maximized. Existing online task assignment models assume that each worker is either assigned a task immediately or waits for a subsequent task at a fixed location once she/he appears on the platform. Yet in practice a worker may actively move around rather than passively wait in place if no task is assigned. In this paper, we define a new problem Flexible Two-sided Online task Assignment (FTOA). FTOA aims to guide idle workers based on the prediction of tasks and workers so as to increase the total number of assigned worker-task pairs. To address the FTOA problem, we face two challenges: (i) How to generate guidance for idle workers based on the prediction of the spatiotemporal distribution of tasks and workers? (ii) How to leverage the guidance of workers’ movements to optimize the online task assignment? To this end, we propose a novel two-step framework, which integrates offline prediction and online task assignment. Specifically, we estimate the distributions of tasks and workers per time slot and per unit area, and design an online task assignment algorithm, Prediction-oriented Online task Assignment in Real-time spatial data (POLAR-OP). It yields a 0.47-competitive ratio, which is nearly twice better than that of the state-of-the-art. POLAR-OP also reduces the time complexity to process each newly-arrived task/worker to O(1). We validate the effectiveness and efficiency of our methods via extensive experiments on both synthetic datasets and real-world datasets from a large-scale taxi-calling platform. - Deployment of Query Plans on MulticoresItem type: Journal Article
Proceedings of the VLDB EndowmentGiceva, Jana; Alonso, Gustavo; Roscoe, Timothy; et al. (2014)Efficient resource scheduling of multithreaded software on multicore hardware is difficult given the many parameters involved and the hardware heterogeneity of existing systems. In this paper we explore the efficient deployment of query plans over a multicore machine. We focus on shared query systems, and implement the proposed ideas using SharedDB. The goal of the paper is to explore how to deliver maximum performance and predictability, while minimizing resource utilization when deploying query plans on multicore machines. We propose to use resource activity vectors to characterize the behavior of individual database operators. We then present a novel deployment algorithm which uses these vectors together with dataflow information from the query plan to optimally assign relational operators to physical cores. Experiments demonstrate that this approach significantly reduces resource requirements while preserving performance and is robust across different server architectures. - Hardware Acceleration of Compression and Encryption in SAP HANAItem type: Conference Paper
Proceedings of the VLDB EndowmentChiosa, Monica; Maschi, Fabio; Müller, Ingo; et al. (2022)With the advent of cloud computing, where computational resources are expensive and data movement needs to be secured and minimized, database management systems need to reconsider their architecture to accommodate such requirements. In this paper, we present our analysis, design and evaluation of an FPGA-based hardware accelerator for offloading compression and encryption for SAP HANA, SAP's Software-as-a-Service (SaaS) in-memory database. Firstly, we identify expensive data-transformation operations in the I/O path. Then we present the design details of a system consisting of compression followed by different types of encryption to accommodate different security levels, and identify which combinations maximize performance. We also analyze the performance benefits of offloading decryption to the FPGA followed by decompression on the CPU. The experimental evaluation using SAP HANA traces shows that analytical engines can benefit from FPGA hardware offloading. The results identify a number of important trade-offs (e.g., the system can accommodate low-latency secured transactions to high-performance use cases or offer lower storage cost by also compressing payloads for less critical use cases), and provide valuable information to researchers and practitioners exploring the nascent space of hardware accelerators for database engines.
Publications 1 - 10 of 78