upb在Semeval-2020任务12：通过微调各种基于BERT的模型，在社交媒体上进行多语言进攻性语言检测

论文标题

upb在Semeval-2020任务12：通过微调各种基于BERT的模型，在社交媒体上进行多语言进攻性语言检测

UPB at SemEval-2020 Task 12: Multilingual Offensive Language Detection on Social Media by Fine-tuning a Variety of BERT-based Models

论文作者

Tanase, Mircea-Adrian, Cercel, Dumitru-Clementin, Chiru, Costin-Gabriel

论文摘要

进攻性语言检测是自然语言处理领域中最具挑战性的问题之一，这是由于这种现象在在线社交媒体中的不断增长所施加的。本文介绍了我们基于变压器的解决方案，用于在Twitter上使用五种语言（即英语，阿拉伯语，丹麦语，希腊语和土耳其语）识别进攻性语言，该语言在2020年ISCHENSEVAL 2020共享任务的子任务中使用。使用单语言和多语言Corpora进行了预训练的几种神经体系结构（即Bert，Mbert，Roberta，XLM-Roberta和Albert）进行了精细调整，并使用多个数据集组合进行了微调。最后，得分最高的模型用于我们的比赛中的提交，该竞赛将我们的团队分别排名第21位，第21位，53，39，37个，37个和46的第16位，分别为英语，阿拉伯语，丹麦语，希腊语，希腊语和土耳其语。

Offensive language detection is one of the most challenging problem in the natural language processing field, being imposed by the rising presence of this phenomenon in online social media. This paper describes our Transformer-based solutions for identifying offensive language on Twitter in five languages (i.e., English, Arabic, Danish, Greek, and Turkish), which was employed in Subtask A of the Offenseval 2020 shared task. Several neural architectures (i.e., BERT, mBERT, Roberta, XLM-Roberta, and ALBERT), pre-trained using both single-language and multilingual corpora, were fine-tuned and compared using multiple combinations of datasets. Finally, the highest-scoring models were used for our submissions in the competition, which ranked our team 21st of 85, 28th of 53, 19th of 39, 16th of 37, and 10th of 46 for English, Arabic, Danish, Greek, and Turkish, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题