Beyond the English web: Zero-shot cross-lingual and lightweight monolingual classification of registers

Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara Hellström, Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo, Veronika Laippala

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register-annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.

Original languageEnglish (US)
Title of host publicationEACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
PublisherAssociation for Computational Linguistics (ACL)
Pages183-191
Number of pages9
ISBN (Electronic)9781954085046
StatePublished - 2021
Event16th Conference of the European Chapter of the Associationfor Computational Linguistics: Student Research Workshop, EACL 2021 - Virtual, Online
Duration: Apr 19 2021Apr 23 2021

Publication series

NameEACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Student Research Workshop

Conference

Conference16th Conference of the European Chapter of the Associationfor Computational Linguistics: Student Research Workshop, EACL 2021
CityVirtual, Online
Period4/19/214/23/21

ASJC Scopus subject areas

  • Software
  • Computational Theory and Mathematics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Beyond the English web: Zero-shot cross-lingual and lightweight monolingual classification of registers'. Together they form a unique fingerprint.

Cite this