Show simple item record

dc.contributor.author
Ferrario, Andrea
dc.contributor.author
Naegelin, Mara
dc.date.accessioned
2020-12-17T06:20:34Z
dc.date.available
2020-12-16T16:54:15Z
dc.date.available
2020-12-17T06:20:34Z
dc.date.issued
2020-03-31
dc.identifier.other
10.2139/ssrn.3547887
en_US
dc.identifier.uri
http://hdl.handle.net/20.500.11850/456731
dc.description.abstract
In this tutorial we introduce three approaches to preprocess text data with Natural Language Processing (NLP) and perform text document classification using machine learning. The first approach is based on 'bag-of-' models, the second one on word embeddings, while the third one introduces the two most popular Recurrent Neural Networks (RNNs), i.e. the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures. We apply all approaches on a case study where we classify movie reviews using Python and Tensorflow 2.0. The results of the case study show that extreme gradient boosting algorithms outperform adaptive boosting and random forests on bag-of-words and word embedding models, as well as LSTM and GRU RNNs, but at a steep computational cost. Finally, we provide the reader with comments on NLP applications for the insurance industry.
en_US
dc.language.iso
en
en_US
dc.publisher
Social Science Research Network
en_US
dc.subject
Natural language processing
en_US
dc.subject
Bag-of-words models
en_US
dc.subject
Word embeddings
en_US
dc.subject
Machine learning
en_US
dc.subject
Recurrent neural networks
en_US
dc.subject
Deep learning
en_US
dc.subject
Python
en_US
dc.subject
Tensorflow 2.0
en_US
dc.subject
Keras
en_US
dc.title
The Art of Natural Language Processing: Classical, Modern and Contemporary Approaches to Text Document Classification
en_US
dc.type
Working Paper
ethz.journal.title
SSRN
ethz.pages.start
3547887
en_US
ethz.size
51 p.
en_US
ethz.code.jel
JEL - JEL::C - Mathematical and Quantitative Methods::C4 - Econometric and Statistical Methods: Special Topics::C45 - Neural Networks and Related Topics
en_US
ethz.code.jel
JEL - JEL::C - Mathematical and Quantitative Methods::C5 - Econometric Modeling::C51 - Model Construction and Estimation
en_US
ethz.code.jel
JEL - JEL::C - Mathematical and Quantitative Methods::C5 - Econometric Modeling::C52 - Model Evaluation, Validation, and Selection
en_US
ethz.code.jel
JEL - JEL::G - Financial Economics::G2 - Financial Institutions and Services::G22 - Insurance; Insurance Companies; Actuarial Studies
en_US
ethz.publication.place
Rochester, NY
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02120 - Dep. Management, Technologie und Ökon. / Dep. of Management, Technology, and Ec.::03995 - von Wangenheim, Florian / von Wangenheim, Florian
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02120 - Dep. Management, Technologie und Ökon. / Dep. of Management, Technology, and Ec.::03995 - von Wangenheim, Florian / von Wangenheim, Florian
en_US
ethz.date.deposited
2020-12-16T16:54:25Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Metadata only
en_US
ethz.rosetta.installDate
2020-12-17T06:20:45Z
ethz.rosetta.lastUpdated
2021-02-15T22:34:05Z
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=The%20Art%20of%20Natural%20Language%20Processing:%20Classical,%20Modern%20and%20Contemporary%20Approaches%20to%20Text%20Document%20Classification&rft.jtitle=SSRN&rft.date=2020-03-31&rft.spage=3547887&rft.au=Ferrario,%20Andrea&Naegelin,%20Mara&rft.genre=preprint&rft_id=info:doi/10.2139/ssrn.3547887&
 Search print copy at ETH Library

Files in this item

FilesSizeFormatOpen in viewer

There are no files associated with this item.

Publication type

Show simple item record