Proposta e validação de uma taxonomia de variações ortográfcas em tweets do mercado financeiro

Carregando...
Imagem de Miniatura

Título da Revista

ISSN da Revista

Título de Volume

Editor

Universidade Federal de São Carlos

Resumo

Given the relevance of the Twitter platform (now X) for various segments of society, Natural Language Processing (NLP) tools and applications capable of handling the primarily non-canonical language of the tweet genre (currently referred to as posts) are in high demand. To develop them, annotated corpora (known as tweetbanks) are essential resources, as are the description and analysis of their linguistic characteristics. In this Undergraduate Thesis, the object of investigation was the DANTEStocks corpus of 4,048 financial market tweets, which is the first in Portuguese to be annotated according to the Universal Dependencies (UD) model. More precisely, a survey of orthographic phenomena in the corpus was conducted, treated as "variations" rather than "errors", in accordance with the theoretical assumptions of Variationist Sociolinguistics. These phenomena were systematized into a hierarchical typology with two dimensions: "Standard Norm" and "Innovative Norm", which seek to capture variations of canonical and innovative lexical forms. Based on this typology, the phenomena observed in 3,614 tokens (found in 1,069 tweets) from DANTEStocks were manually annotated, yielding a preliminary characterization of the corpus. The results evidenced the predominance of the Innovative Norm, accounting for 92.81% of the annotated phenomena (3,457 occurrences), reinforcing the hypothesis that, in User-Generated Content (UGC) on financial topics, innovative linguistic strategies prevail, manifesting in tokens that function as codes and systematic forms of digital communication, characteristic of the medium and of a particular social context. In this way, the annotation of graphic variations in DANTEStocks broadens the understanding of the language of financial market tweets and may enhance the tolerance of NLP models toward non-canonical language, enabling them to recognize variant forms as linguistically valid and semantically informative.

Descrição

Citação

SCANDAROLLI, Clarissa Lenina. Proposta e validação de uma taxonomia de variações ortográfcas em tweets do mercado financeiro. 2025. Trabalho de Conclusão de Curso (Graduação em Linguística) – Universidade Federal de São Carlos, São Carlos, 2025. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/23898.

Coleções

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced

Licença Creative Commons

Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial-NoDerivs 3.0 Brazil