Leveraging large amounts of weakly supervised data for multi-language sentiment classification
Abstract
This paper presents a novel approach for multi-lingual sentiment classification in short texts. This is a challenging task as the amount of training data in languages other than English is very limited. Previously proposed multi-lingual approaches typically require to establish a correspondence to English for which powerful classifiers are already available. In contrast, our method does not require such supervision. We leverage large amounts of weakly- supervised data in various languages to train a multi-layer convolutional network and demonstrate the importance of using pre-training of such networks. We thoroughly evaluate our approach on various multi-lingual datasets, including the recent SemEval-2016 sentiment prediction benchmark (Task 4), where we achieved state-of-the-art performance. We also compare the performance of our model trained individually for each language to a variant trained for all languages at once. We show that the latter model reaches slightly worse - but still acceptable - performance when compared to the single language model, while benefiting from better generalization properties across languages. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000126764Publication status
publishedExternal links
Book title
Proceedings of the 26th International Conference on World Wide Web (WWW' 17)Pages / Article No.
Publisher
ACMEvent
Subject
Sentiment classification; multi-language; weak supervision; neural networksOrganisational unit
09462 - Hofmann, Thomas / Hofmann, Thomas
More
Show all metadata
ETH Bibliography
yes
Altmetrics