Representação formal de significado: o caso dos tweets do mercado financeiro

Carregando...
Imagem de Miniatura

Título da Revista

ISSN da Revista

Título de Volume

Editor

Universidade Federal de São Carlos

Resumo

Semantic annotation of corpora plays a crucial role in the development of Natural Language Processing (NLP) tools. In the case of user generated content (UGC) corpora composed of tweets whose language is predominantly informal semantic annotations typically focus on lexical aspects such as named entities, emotion, or polarity. In this study, we investigated the sentential semantic representation of Portuguese language financial market tweets via the Abstract Meaning Representation (AMR) framework, under the hypothesis that syntactic information can aid such representations. To this end, we annotated the DANTEStocks corpus, the first tweebank with grammatical annotation according to the Universal Dependencies (UD) formalism, covering a subset of 1,128 of the 4,048 tweets (i.e. 30% of the total). The AMR annotation followed a hybrid methodology, comprising a manual phase to establish guidelines and reference models, and a semiautomatic phase in which manually curated corrections were applied to graphs produced by a large language model. This process confirmed our hypothesis that UD syntactic dependencies facilitate the construction of AMR graphs, particularly in identifying subgraphs and relations among concepts. Additional contributions include: (i) the development and validation of AMR annotation guidelines for phenomena specific to Portuguese, UGC, and the financial domain; (ii) the proposal of labels for three types of financial domain entities (URLs, tickers, and users); and (iii) the introduction of a frameset for the VerboBrasil repository covering the financial verb repicar. The primary challenge encountered was interpreting tweets for AMR graph construction, given their fragmentation, truncation, contextual dependence, and specialized vocabulary necessitating continuous expert consultation and external financial-domain resources. Nevertheless, even with a relatively small validation set, an inter-annotator F-score of 89% indicates that the AMR-annotated portion of DANTEStocks is a reliable resource for pioneering research on AMR parsing of tweets in NLP and for guiding the annotation of the remaining corpus.

Descrição

Citação

CEREGATTO, Gabriel. Representação formal de significado: o caso dos tweets do mercado financeiro. 2025. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2025. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/22742.

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced

Licença Creative Commons

Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial-NoDerivs 3.0 Brazil