Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico

Duque, Juliana Lilian

Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico

Arquivos

4310.pdf (3.11 MB)

Data

2012-02-24

Autores

Duque, Juliana Lilian

Editor

Universidade Federal de São Carlos

Resumo

Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven.

Palavras-chave

Inteligência artificial, Banco de dados, Mineração de textos, Reconhecimento de padrões, Extração de informação, Anemia falciforme, Tratamentos, Pré-Processamento, Domínio Biomédico, Information Extraction, Treatments, Text Mining, Preprocessing, Biomedical Domain, Sickle Cell Anemia

Citação

DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012.

URI

https://repositorio.ufscar.br/handle/20.500.14289/496

Coleções

Teses e Dissertações

Página do item completo

Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico

Arquivos

Data

Autores

Título da Revista

ISSN da Revista

Título de Volume

Editor

Resumo

Descrição

Palavras-chave

Citação

URI

Coleções

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced