Caracterização morfossintática de um corpus de tweets e análise preliminar de erros de tagging

Carregando...
Imagem de Miniatura

Título da Revista

ISSN da Revista

Título de Volume

Editor

Universidade Federal de São Carlos

Resumo

Part-of-speech tagging is one of the first processes for natural language interpretation in Natural Language Processing (NLP) systems or applications. Being defined as the identification of the grammatical category of each word or token in a text, the tagging task generates relevant knowledge for other processes of the system/application, such as the syntactic analysis or parsing. Given the relevance of social networks, many researches on tagging have been developed focusing on processing different types of “user-generated content” (UGC). Regarding the grammatical framework and the linguistic resource, most of the research on tagging has relied on the Universal Dependencies (UD) model, and on the construction of annotated tweet corpora (also called tweebanks). In this work, we first performed the statistical characterization of the gold-standard morphosyntactic annotation of the DANTEStocks corpus. Such resource comprises tweets from the stock market domain, and it is the first tweebank with UD annotation in Portuguese. As a result, it was found that (i) the posts in the corpus tend to be fragmented and composed of informal uses of punctuation marks (such as reduplication), which is evidenced by the high frequency of the PUNCT tag; (ii) the tweets seem to have a nominal structure, since NOUN and PROPN are highly frequent, unlike VERB; (iii) interjections are rarely used by users as a way of expressing feelings and emotions, given the low frequency of INJT, and (iv) the character limitation posed by the platform seems to have influence on the simplicity (syntactic) structure of the tweets, avoiding CCONJ, SCONJ and AUX. Next, we carried out an initial analysis of the tagging errors made by UDPipe 2.1 in the annotation of a subset of tweets from DANTEStocks. This analysis resulted in a set of post-editing tagging rules, which still need to be evaluated, and in the classification of the rules according to their degree of generalization. Furthermore, we could found that (i) the vast majority of errors made by the tagging method refer to general language knowledge and not domain knowledge and (ii) the CGU (lexical and/or orthographic) phenomena of the corpus seem to have low influence on the tagging process, since a minority of errors are related to tokens characterized by these phenomena. With that, we believe that this work contributes to the linguistic-descriptive studies and to NLP.

Descrição

Citação

CEREGATTO, Gabriel. Caracterização morfossintática de um corpus de tweets e análise preliminar de erros de tagging. 2022. Trabalho de Conclusão de Curso (Graduação em Linguística) – Universidade Federal de São Carlos, São Carlos, 2022. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/20374.

Coleções

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced

Licença Creative Commons

Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial-NoDerivs 3.0 Brazil