Análise de regras linguísticas para o aperfeiçoamento de anotações automáticas de part-of-speech

Ribeiro, Lucas Lopes

dc.contributor.author	Ribeiro, Lucas Lopes
dc.date.accessioned	2024-08-19T13:00:07Z
dc.date.available	2024-08-19T13:00:07Z
dc.date.issued	2023-09-11
dc.identifier.citation	RIBEIRO, Lucas Lopes. Análise de regras linguísticas para o aperfeiçoamento de anotações automáticas de part-of-speech. 2023. Trabalho de Conclusão de Curso (Graduação em Linguística) – Universidade Federal de São Carlos, São Carlos, 2023. Disponível em: https://repositorio.ufscar.br/handle/ufscar/20375.	*
dc.identifier.uri	https://repositorio.ufscar.br/handle/ufscar/20375
dc.description.abstract	The automatic morphosyntactic tagging of Part-of-Speech, also known as PoS tagging, is an essential task as it is one of the initial text processing steps that a text undergoes during analysis performed by Natural Language Processing (NLP) applications or methods. The task involves classifying words in a text according to their grammatical classes. In the literature, there are numerous research efforts dedicated to this type of activity, mostly focused on corpora of more formal genres such as journalistic and academic texts. Furthermore, Universal Dependencies (UD) is the most widely adopted linguistic theory in current research on automatic tagging due to its universal guidelines for morphosyntactic labels. For the Portuguese language, there are still few works based on this formalism, especially when it comes to user-generated content (UGC). Therefore, the objective of this study was to analyze, refine, and evaluate a set of post-editing tagging rules proposed by Ceregatto (2022), based on errors made by the UDPipe 2.1 model when annotating the DANTEStocks corpus, which comprises a collection of financial market tweets. These rules aim to enrich tagging methods (statistical and/or probabilistic, such as UDPipe 2.1) with linguistic knowledge for UGC texts in the Portuguese language. As a result, the reduction of the initial set of rules, the formalization of their description, and the evaluation of refined rules referring to the ADJ tag are highlighted.	eng
dc.description.sponsorship	Não recebi financiamento	por
dc.language.iso	por	por
dc.publisher	Universidade Federal de São Carlos	por
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Brazil	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/br/	*
dc.subject	Etiquetação morfossintática	por
dc.subject	Tweet	eng
dc.subject	Universal dependencies	eng
dc.subject	Tagging	eng
dc.title	Análise de regras linguísticas para o aperfeiçoamento de anotações automáticas de part-of-speech	por
dc.title.alternative	Analysis of linguistic rules for improving automatic part-of-speech annotation	eng
dc.type	TCC	por
dc.contributor.advisor1	Di Felippo, Ariani
dc.contributor.advisor1Lattes	https://lattes.cnpq.br/8648412103197455	por
dc.description.resumo	A etiquetação morfossintática automática de Part-of-speech, também denominada PoS tagging, é uma tarefa essencial, pois é um dos primeiros processamentos textuais pelo qual um texto é submetido durante a análise realizada por aplicações ou métodos de Processamento Automático das Línguas Naturais. A tarefa consiste em classificar as palavras de um texto de acordo com as suas classes gramaticais. Na literatura, há diversas pesquisas voltadas para esse tipo de atividade, em sua maioria voltada para corpus dos gêneros mais formais como o jornalístico e acadêmico. Além disso, a Universal Dependencies (UD) é a teoria linguística mais adotada nas pesquisas de etiquetação automática atualmente, por apresentar diretrizes universais de etiquetas morfossintáticas. Para a língua portuguesa, ainda há poucos trabalhos baseado nesse formalismo, sobretudo quando se trata de conteúdo (principalmente, textos) gerado por usuários (CGU). Portanto, o objetivo deste trabalho foi o de analisar, refinar e avaliar um conjunto de regras de pós-edição de tagging, propostas por Ceregatto (2022), a partir de erros cometidos pelo modelo UDPipe 2.1 quando da anotação do corpus DANTEStocks, que reúne um conjunto de tweets do mercado financeiro. Tais regras objetivam enriquecer os métodos de tagging (estatísticos e/ou probabilísticos, como o UDPipe 2.1) com conhecimento linguístico para textos do tipo CGU em língua portuguesa. Como resultado, destaca-se a redução do conjunto inicial de regras, a formalização de sua descrição e a avaliação das regras refinadas que se referem à tag ADJ.	por
dc.publisher.initials	UFSCar	por
dc.subject.cnpq	LINGUISTICA, LETRAS E ARTES::LINGUISTICA::TEORIA E ANALISE LINGUISTICA	por
dc.publisher.address	Câmpus São Carlos	por
dc.contributor.authorlattes	http://lattes.cnpq.br/1786710232811854	por
dc.publisher.course	Linguística - Ling	por
dc.contributor.advisor1orcid	https://orcid.org/0000-0002-4566-9352	por

Files in this item

Name:: RibeiroLL_TCC_Ling_2023.pdf
Size:: 1.417Mb
Format:: PDF

View/Open

Name:: license_rdf
Size:: 810bytes
Format:: application/rdf+xml

View/Open

This item appears in the following Collection(s)

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Brazil