Investigação de estratégias de seleção de conteúdo baseadas na UNL (Universal Networking Language)

Chaud, Matheus Rigobelo

Investigação de estratégias de seleção de conteúdo baseadas na UNL (Universal Networking Language)

Arquivos

6636.pdf (2.99 MB)

Data

2015-03-03

Autores

Chaud, Matheus Rigobelo

Editor

Universidade Federal de São Carlos

Resumo

The field of Natural Language Processing (NLP) has witnessed increased attention to Multilingual Multidocument Summarization (MMS), whose goal is to process a cluster of source documents in more than one language and generate a summary of this collection in one of the target languages. In MMS, the selection of sentences from source texts for summary generation may be based on either shallow or deep linguistic features. The purpose of this research was to investigate whether the use of deep knowledge, obtained from a conceptual representation of the source texts, could be useful for content selection in texts within the newspaper genre. In this study, we used a formal representation system the UNL (Universal Networking Language). In order to investigate content selection strategies based on this interlingua, 3 clusters of texts were represented in UNL, each consisting of 1 text in Portuguese, 1 text in English and 1 human-written reference summary. Additionally, in each cluster, the sentences of the source texts were aligned to the sentences of their respective human summaries, in order to identify total or partial content overlap between these sentences. The data collected allowed a comparison between content selection strategies based on conceptual information and a traditional selection method based on a superficial feature - the position of the sentence in the source text. According to the results, content selection based on sentence position was more closely correlated with the selection made by the human summarizer, compared to the conceptual methods investigated. Furthermore, the sentences in the beginning of the source texts, which, in newspaper articles, usually convey the most relevant information, did not necessarily contain the most frequent concepts in the text collection; on several occasions, the sentences with the most frequent concepts were in the middle or at the end of the text. These results indicate that, at least in the clusters analyzed, other criteria besides concept frequency help determine the relevance of a sentence. In other words, content selection in human multidocument summarization may not be limited to the selection of the sentences with the most frequent concepts. In fact, it seems to be a much more complex process.

Palavras-chave

Linguística aplicada, Sumarização automática, Estratégias de seleção de conteúdo, Interlíngua UNL (Universal Networking Language), Processamento automático de línguas naturais, Sistemas de representação de conhecimento, Automatic summarization, Multilingual multidocument summarization, Natural language processing, Knowledge representation systems, Universal networking language, Content selection

Citação

CHAUD, Matheus Rigobelo. Investigação de estratégias de seleção de conteúdo baseadas na UNL (Universal Networking Language). 2015. 171 f. Dissertação (Mestrado em Ciências Humanas) - Universidade Federal de São Carlos, São Carlos, 2015.

URI

https://repositorio.ufscar.br/handle/20.500.14289/5799

Coleções

Teses e Dissertações

Página do item completo

Investigação de estratégias de seleção de conteúdo baseadas na UNL (Universal Networking Language)

Arquivos

Data

Autores

Título da Revista

ISSN da Revista

Título de Volume

Editor

Resumo

Descrição

Palavras-chave

Citação

URI

Coleções

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced