Tipologia de traços linguísticos de textos do português do Brasil dos séculos XVI, XVII, XVIII e XIX: uma proposta para a classificação automática de gêneros textuais
Souza, Jacqueline Aparecida de
MetadataShow full item record
Based on methodological postulates of the Linguistic of corpus and on the genre concepts, proposed by Swales (1990) and Biber (1995), this research intends to describe linguistic traces which are characteristic of historic texts and correlate them to their respective genres, as well as propose a typology of traces so that it is possible to automatically identify the genre. In order to execute the research, the corpus of the Portuguese of the centuries XVI, XVII and XVII of the project Historical Dictionary of the Portuguese in Brazil (program Institutes of the Millennium/CNPq UNESP/Araraquara), which is constituted by 2,459 texts and 7,5 million words has been used. In order to realize a historical description, the study has started from synchronic characteristics obtained from the table of contemporary traces elaborated by Aires (2005). As for the manipulation of the corpus, it has been used the Philologic, the Unitex as well as another tool for the extraction and quantification of traces that has been developed. For the purposes of classification, algorithms available at Weka (Waikato Environment for knowledge Analysis) such as: Naive Bayes, Bayes Net, SMO, Multilayer Perceptron e RBFNetwork, J48, NBTree have been used. The description has been made based on the 62 traces, which include statistics based on a text as a whole and on words, including classes of verbs, pronouns, adverbs as well as discourse markers, expressions and lexical units. It has been concluded that the genres share specific linguistic characteristics. However, they also present their own standards with the use of specific expressions and the frequency of lexical units. Despite the limitations and complications in using a historical corpus, the performance of the classifiers based on the raised traces was satisfactory and the rate of correct classification was 84% and 92%.