The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification


METADATA ONLY
Loading...

Date

2020-03-31

Publication Type

Working Paper

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

In this tutorial we introduce three approaches to preprocess text data with Natural Language Processing (NLP) and perform text document classification using machine learning. The first approach is based on 'bag-of-' models, the second one on word embeddings, while the third one introduces the two most popular Recurrent Neural Networks (RNNs), i.e. the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures. We apply all approaches on a case study where we classify movie reviews using Python and Tensorflow 2.0. The results of the case study show that extreme gradient boosting algorithms outperform adaptive boosting and random forests on bag-of-words and word embedding models, as well as LSTM and GRU RNNs, but at a steep computational cost. Finally, we provide the reader with comments on NLP applications for the insurance industry.

Publication status

published

Editor

Book title

Journal / series

Volume

Pages / Article No.

3547887

Publisher

Social Science Research Network

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Natural language processing; Bag-of-words models; Word embeddings; Machine learning; Recurrent neural networks; Deep learning; Python; Tensorflow 2.0; Keras

Organisational unit

03995 - von Wangenheim, Florian / von Wangenheim, Florian check_circle

Notes

Funding

Related publications and datasets