Caracterização morfossintática de um corpus de tweets e análise preliminar de erros de tagging
Carregando...
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de São Carlos
Resumo
Part-of-speech tagging is one of the first processes for natural language interpretation in Natural Language Processing (NLP) systems or applications. Being defined as the identification of the grammatical category of each word or token in a text, the tagging task generates relevant knowledge for other processes of the system/application, such as the syntactic analysis or parsing. Given the relevance of social networks, many researches on tagging have been developed focusing on processing different types of “user-generated content” (UGC). Regarding the grammatical framework and the linguistic resource, most of the research on tagging has relied on the Universal Dependencies (UD) model, and on the construction of annotated tweet corpora (also called tweebanks). In this work, we first performed the statistical characterization of the gold-standard morphosyntactic annotation of the DANTEStocks corpus. Such resource comprises tweets from the stock market domain, and it is the first tweebank with UD annotation in Portuguese. As a result, it was found that (i) the posts in the corpus tend to be fragmented and composed of informal uses of punctuation marks (such as reduplication), which is evidenced by the high frequency of the PUNCT tag; (ii) the tweets seem to have a nominal structure, since NOUN and PROPN are highly frequent, unlike VERB; (iii) interjections are rarely used by users as a way of expressing feelings and emotions, given the low frequency of INJT, and (iv) the character limitation posed by the platform seems to have influence on the simplicity (syntactic) structure of the tweets, avoiding CCONJ, SCONJ and AUX. Next, we carried out an initial analysis of the tagging errors made by UDPipe 2.1 in the annotation of a subset of tweets from DANTEStocks. This analysis resulted in a set of post-editing tagging rules, which still need to be evaluated, and in the classification of the rules according to their degree of generalization. Furthermore, we could found that (i) the vast majority of errors made by the tagging method refer to general language knowledge and not domain knowledge and (ii) the CGU (lexical and/or orthographic) phenomena of the corpus seem to have low influence on the tagging process, since a minority of errors are related to tokens characterized by these phenomena. With that, we believe that this work contributes to the linguistic-descriptive studies and to NLP.
Descrição
Palavras-chave
Citação
CEREGATTO, Gabriel. Caracterização morfossintática de um corpus de tweets e análise preliminar de erros de tagging. 2022. Trabalho de Conclusão de Curso (Graduação em Linguística) – Universidade Federal de São Carlos, São Carlos, 2022. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/20374.
Coleções
item.page.endorsement
item.page.review
item.page.supplemented
item.page.referenced
Licença Creative Commons
Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial-NoDerivs 3.0 Brazil
