Metodologia de pré-processamento textual para extração de informação sobre efeitos de doenças em artigos científicos do domínio biomédico
Matos, Pablo Freire
MetadataMostrar registro completo
There is a large volume of unstructured information (i.e., in text format) being published in electronic media, in digital libraries particularly. Thus, the human being becomes restricted to an amount of text that is able to process and to assimilate over time. In this dissertation is proposed a methodology for textual preprocessing to extract information about disease effects in the biomedical domain papers, in order to identify relevant information from a text, to structure and to store this information in a database to provide a future discovery of interesting relationships between the extracted information. The methodology consists of four steps: Data Entrance (Step 1), Sentence Classification (Step 2), Identification of Relevant Terms (Step 3) and Terms Management (Step 4). This methodology uses three information extraction approaches from the literature: machine learning approach, dictionary-based approach and rule-based approach. The first one is developed in Step 2, in which a supervised machine learning algorithm is responsible for classify the sentences. The second and third ones are developed in Step 3, in which a dictionary of terms validated by an expert and rules developed through regular expressions were used to identify relevant terms in sentences. The methodology validation was carried out through its instantiation to an area of the biomedical domain, more specifically using papers on Sickle Cell Anemia. Accordingly, two case studies were conducted in both Step 2 and in Step 3. The obtained accuracy in the sentence classification was above of 60% and F-measure for the negative effect class was above of 70%. These values correspond to the results achieved with the Support Vector Machine algorithm along with the use of the Noise Removal filter. The obtained F-measure with the identification of relevant terms was above of 85% for the fictitious extraction (i.e., manual classification performed by the expert) and above of 80% for the actual extraction (i.e., automatic classification performed by the classifier). The F-measure of the classifier above of 70% and F-measure of the actual extraction above 80% show the relevance of the sentence classification in the proposed methodology. Importantly to say that many false positives would be identified in full text papers without the sentence classification step.