Proposta e validação de uma taxonomia de variações ortográfcas em tweets do mercado financeiro
Carregando...
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de São Carlos
Resumo
Given the relevance of the Twitter platform (now X) for various segments of society, Natural Language Processing (NLP) tools and applications capable of handling the primarily non-canonical language of the tweet genre (currently referred to as posts) are in high demand. To develop them, annotated corpora (known as tweetbanks) are essential resources, as are the description and analysis of their linguistic characteristics. In this Undergraduate Thesis, the object of investigation was the DANTEStocks corpus of 4,048 financial market tweets, which is the first in Portuguese to be annotated according to the Universal Dependencies (UD) model. More precisely, a survey of orthographic phenomena in the corpus was conducted, treated as "variations" rather than "errors", in accordance with the theoretical assumptions of Variationist Sociolinguistics. These phenomena were systematized into a hierarchical typology with two dimensions: "Standard Norm" and "Innovative Norm", which seek to capture variations of canonical and innovative lexical forms. Based on this typology, the phenomena observed in 3,614 tokens (found in 1,069 tweets) from DANTEStocks were manually annotated, yielding a preliminary characterization of the corpus. The results evidenced the predominance of the Innovative Norm, accounting for 92.81% of the annotated phenomena (3,457 occurrences), reinforcing the hypothesis that, in User-Generated Content (UGC) on financial topics, innovative linguistic strategies prevail, manifesting in tokens that function as codes and systematic forms of digital communication, characteristic of the medium and of a particular social context. In this way, the annotation of graphic variations in DANTEStocks broadens the understanding of the language of financial market tweets and may enhance the tolerance of NLP models toward non-canonical language, enabling them to recognize variant forms as linguistically valid and semantically informative.
Descrição
Palavras-chave
Citação
SCANDAROLLI, Clarissa Lenina. Proposta e validação de uma taxonomia de variações ortográfcas em tweets do mercado financeiro. 2025. Trabalho de Conclusão de Curso (Graduação em Linguística) – Universidade Federal de São Carlos, São Carlos, 2025. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/23898.
Coleções
item.page.endorsement
item.page.review
item.page.supplemented
item.page.referenced
Licença Creative Commons
Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial-NoDerivs 3.0 Brazil
