Anotação de corpus: caracterização de Entidades Nomeadas em tweets do mercado financeiro
Carregando...
Data
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de São Carlos
Resumo
Corpus annotation is a cornerstone of Natural Language Processing (NLP), providing the foundation for training and evaluating Machine Learning systems, as well as for investigating linguistic behavior in various domains. A key challenge in this area is the annotation of Named Entities (NEs) in user-generated content (UGC). The informal language and a wide range of platform and domain-specific phenomena found in tweets demand highly adapted annotation methodologies. This dissertation addresses this challenge by conducting an independent reannotation of the DANTEStocks, a Portuguese corpus of 4,048 financial market tweets (84,396 tokens). While a previous annotation based on the 10 general categories of the Second HAREM existed, our work expands this taxonomy to a more granular set of 47 types. This was achieved by refining the original guidelines based on linguistically motivated decisions and by introducing four new domain-specific types: certificate, indicator, ticker, and user. The annotation was carried out by a single annotator using a semi-automatic approach that combined rule-based methods with manual curation, resulting in a new reference annotation and a comprehensive set of guidelines. The resulting annotation comprises 20,092 entities (24,825 tokens). A detailed characterization of the corpus revealed a linguistic profile dominated by single-token (i.e. those consisting of a single word) entities and by specific stock market types such as ticker, money, and virtual. The main contributions of this work are therefore: the enrichment of the DANTEStocks corpus with a fine-grained NE annotation, a set of guidelines for annotating UGC, and a discussion of the strategies developed to overcome the inherent challenges of this task.
Descrição
Palavras-chave
Citação
PIAI, Laís. Anotação de corpus: caracterização de Entidades Nomeadas em tweets do mercado financeiro. 2025. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2025. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/22651.
Coleções
item.page.endorsement
item.page.review
item.page.supplemented
item.page.referenced
Licença Creative Commons
Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial-NoDerivs 3.0 Brazil
