Anotação de corpus: caracterização de Entidades Nomeadas em tweets do mercado financeiro

Carregando...
Imagem de Miniatura

Título da Revista

ISSN da Revista

Título de Volume

Editor

Universidade Federal de São Carlos

Resumo

Corpus annotation is a cornerstone of Natural Language Processing (NLP), providing the foundation for training and evaluating Machine Learning systems, as well as for investigating linguistic behavior in various domains. A key challenge in this area is the annotation of Named Entities (NEs) in user-generated content (UGC). The informal language and a wide range of platform and domain-specific phenomena found in tweets demand highly adapted annotation methodologies. This dissertation addresses this challenge by conducting an independent reannotation of the DANTEStocks, a Portuguese corpus of 4,048 financial market tweets (84,396 tokens). While a previous annotation based on the 10 general categories of the Second HAREM existed, our work expands this taxonomy to a more granular set of 47 types. This was achieved by refining the original guidelines based on linguistically motivated decisions and by introducing four new domain-specific types: certificate, indicator, ticker, and user. The annotation was carried out by a single annotator using a semi-automatic approach that combined rule-based methods with manual curation, resulting in a new reference annotation and a comprehensive set of guidelines. The resulting annotation comprises 20,092 entities (24,825 tokens). A detailed characterization of the corpus revealed a linguistic profile dominated by single-token (i.e. those consisting of a single word) entities and by specific stock market types such as ticker, money, and virtual. The main contributions of this work are therefore: the enrichment of the DANTEStocks corpus with a fine-grained NE annotation, a set of guidelines for annotating UGC, and a discussion of the strategies developed to overcome the inherent challenges of this task.

Descrição

Citação

PIAI, Laís. Anotação de corpus: caracterização de Entidades Nomeadas em tweets do mercado financeiro. 2025. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2025. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/22651.

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced

Licença Creative Commons

Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial-NoDerivs 3.0 Brazil