Representação formal de significado: o caso dos tweets do mercado financeiro
Carregando...
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de São Carlos
Resumo
Semantic annotation of corpora plays a crucial role in the development of Natural Language Processing (NLP) tools. In the case of user generated content (UGC) corpora composed of tweets whose language is predominantly informal semantic annotations typically focus on lexical aspects such as named entities, emotion, or polarity. In this study, we investigated the sentential semantic representation of Portuguese language financial market tweets via the Abstract Meaning Representation (AMR) framework, under the hypothesis that syntactic information can aid such representations. To this end, we annotated the DANTEStocks corpus, the first tweebank with grammatical annotation according to the Universal Dependencies (UD) formalism, covering a subset of 1,128 of the 4,048 tweets (i.e. 30% of the total). The AMR annotation followed a hybrid methodology, comprising a manual phase to establish guidelines and reference models, and a semiautomatic phase in which manually curated corrections were applied to graphs produced by a large language model. This process confirmed our hypothesis that UD syntactic dependencies facilitate the construction of AMR graphs, particularly in identifying subgraphs and relations among concepts. Additional contributions include: (i) the development and validation of AMR annotation guidelines for phenomena specific to Portuguese, UGC, and the financial domain; (ii) the proposal of labels for three types of financial domain entities (URLs, tickers, and users); and (iii) the introduction of a frameset for the VerboBrasil repository covering the financial verb repicar. The primary challenge encountered was interpreting tweets for AMR graph construction, given their fragmentation, truncation, contextual dependence, and specialized vocabulary necessitating continuous expert consultation and external financial-domain resources. Nevertheless, even with a relatively small validation set, an inter-annotator F-score of 89% indicates that the AMR-annotated portion of DANTEStocks is a reliable resource for pioneering research on AMR parsing of tweets in NLP and for guiding the annotation of the remaining corpus.
Descrição
Citação
CEREGATTO, Gabriel. Representação formal de significado: o caso dos tweets do mercado financeiro. 2025. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2025. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/22742.
Coleções
item.page.endorsement
item.page.review
item.page.supplemented
item.page.referenced
Licença Creative Commons
Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial-NoDerivs 3.0 Brazil
