Recuperação de informação em atas de reunião utilizando segmentação textual e extração de tópicos
Abstract
In a context where the most informations stored by the organizations is in the text format,
the development of computer tools for knowledge discovery aimed to organization and
information retrieval is a task that holds attention and relevance. The extraction topic
models are often employed in this task. This models are able to establish relations between
documents and found latents patterns in sets of them. However, there is an additional
challenge for documents composed by multiples subjects. For texts with subject shifts, it
is necessary, in fist, found the chunks of texts that address a single subject. So, the text
segmentation techniques are used to break a text in segments with a relatively independent
issue. The combination of the text segmentation and topic extraction algorithms, can be
used to make an structure that helps to understanding textual data, which are inherently
non structured. The creation of an topic organized structure, that incorporates latent
information concerning the corpus, favors Information Retrievals techniques. This approach
allows query expansion space, besides the original set of terms of each segment, and the
identification most relevant pieces of text. This work presents an methodology to connect
the Text Segmentation techniques to the Topic extraction models, in order to generates an
derivative structure from a non structured corpus, which concentrates the original texts
plus the latent informations and organized by semantic likeness. The research for this thesis
investigates the text segmentation techniques and the Topic extraction models to develop
an Information Retrieval system for meeting minutes. We develops an system to give to
the user the most relevant segments to the query. The segmentos presented are clustered
by this topics, that also enables exploratory searches to the data base, through browsing by
groups of semantically related segments. Furthermore, less relevant segments are omitted,
allowing results focused on the researched subject. Five techniques of text segmentation
and three Topic extraction models from the literature was evaluated. Based on the set of
meeting minutes analysed, we create an manually annotated corpus with informations on
the thematic composition of the meeting minutes and about the issue shifts. The annotated
corpus served as reference to objective evaluation of the text segmentation techniques. We
also evaluate the results obtained with the Topic extraction models. This results were
analysed by related to the meeting minutes context. The results points the employed
techniques and the methodology presented gives satisfactory answers. However, more data
can be necessary to leads new experiments in order to improve the used techniques. The
system implementation and the tools used in the work, as well the results obtained in
the evaluations, plus the annotated corpus are the principal contributions of this work, in
order to support next works.