Fraud Detection in Ethereum Using Web-scraping and Natural Language Processing Techniques

Open access
Author
Date
2021-09-11Type
- Master Thesis
ETH Bibliography
yes
Altmetrics
Abstract
The objective of this thesis is to discern Ethereum fraudulent smart contracts (defined as smart contracts related to Ponzi schemes) from non-fraudulent ones on the Ethereum blockchain. For this purpose, we employ web scraping techniques in order to retrieve data on the transactions of each smart contract. More importantly, we retrieve the opcodes sequence of each smart contract, which is to say the set of instructions that determine the contract’s behaviour on the network. The sequence of opcodes of each smart contract is thus embedded using natural language processing (NLP) techniques, to then feed a classifier ensemble.
As is typical for most problems concerned with fraud detection, the dataset we work on is characterized by a vast class imbalance. The model we propose effectively addresses this issue through (i) leveraging on the resampling of the training set, through (ii) setting a filter for ‘obvious negative (i.e. non-fraudulent) instances’, and through (iii) weighting each classifier’s predictions in the ensemble based on the estimated balanced accuracy of the classifier. The class imbalance has also important implications on the metric through which we are to assess the classifier. In this regards, whereas the most common metric used in the literature is arguably the F1-Score, we found the balanced accuracy to be more suited for out setting.
Chapter 1 provides the necessary background for the reader to familiarize with the topic. It covers the conceptual difference between a traditional fiat currency such as the Swiss franc and a cryptocurrency. It goes on to explain concepts central to cryptocurrencies, foremost the Proof of Work (PoW) protocol and the smart contract feature. It then provides a literature review on fraud detection, focusing on common challenges such as that of imbalanced datasets. Chapter 2 illustrates the retrieval process for transactional and opcodes data. It also discusses the NLP techniques we use to embed sequences of opcodes into numeric vectors. Chapter 3 presents the proposed model ensemble, explaining the value added by each different stage of the ensemble. In particular, it focuses on how the ensemble addresses the challenge posed by the vast class imbalance. Chapter 4 examines the adequacy and usefulness of the model and, more broadly, of the thesis. In this regard, it discusses not only technical considerations of statistical nature, but also economic and legal ones. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000523134Publication status
publishedPublisher
ETH ZurichSubject
Cryptocurrencies; Fraud detection; Fraud; Ethereum; Cryptocurrency; Ledger; NLP; NATURAL LANGUAGE PROCESSING (ARTIFICIAL INTELLIGENCE); Imbalanced data; smart contract; Smart Contracts; Word embeddings; Document embeddingsOrganisational unit
02537 - Seminar für Statistik (SfS) / Seminar for Statistics (SfS)
More
Show all metadata
ETH Bibliography
yes
Altmetrics