Sumarização automática multidocumento multilíngue: seleção de conteúdo e tratamento da redundância com base em conhecimento léxico-conceitual

Camargo, Yasmin Vizeu

Sumarização automática multidocumento multilíngue: seleção de conteúdo e tratamento da redundância com base em conhecimento léxico-conceitual

Arquivos

Sumarização Automática Multidocumento Multilíngue seleção de conteúdo e tratamento da redundância com base em conhecimento léxico-conceitual.pdf (2.38 MB)

Anexo I Carta Orientador (a).pdf (178.09 KB)

Data

2020-03-19

Autores

Camargo, Yasmin Vizeu

Editor

Universidade Federal de São Carlos

Resumo

Multilingual Multi-document Summarization consists in automatically producing, from a collection of texts on the same topic and in different languages, a summary in one of the source languages. Thus, this task deals with the problems of Multi-document Summarization, such as the identification of relevant content and the treatment of redundancy, and with the multiplicity of source languages. For the production of multilingual summaries in Portuguese, CFUL is the method with the best performance. CFUL is extractive and thus it punctuates the source sentences in their original languages based on the simple frequency of their nominal concepts in the collection and it selects the best-ranked ones in Portuguese for the summary, avoiding redundancy based on word overlapping between them. In this work, the CFULHiper extractive method is proposed. It also selects content based on the simple frequency of the nominal concepts, but it additionally takes into account a differentiated score for the superordinate concepts that are in hierarchical relations with others in the collection. The method assumes that superordinate concepts convey generic information, which is relevant to compose informative summaries. Moreover, CFULHiper avoids redundancy based on concept overlapping, capturing sentence similarity in a more intelligent manner. To develop CFULHiper, we have selected the CM2News corpus, which consists of 20 bilingual collections (Portuguese and English) of news, whose nouns of the source texts were annotated with concepts from WordNet of Princeton. The corpus was extended with the inclusion of 10 new collections, resulting at the second version of CM2News. The CM2News 2.0 corpus was submitted to an automatic pre-processing. For each collection, we have performed: (ii) identification of the conceptual hierarquical relations across the source-texts, and (iii) calculation of the simple and cumulative frequencies of the nominal concepts. To calculate the accumulated frequency of a hyperonym x, the simple frequency of x is added to the simple frequency of its hyponyms. Then, we automatically applied CFULHiper to each collection of the corpus, producing 30 summaries in Portuguese with 70% compression. We have evaluated the linguistic quality (gramaticality, non-redundancy, referential clarity, focus and estructure/coherence) and the informativeness (ROUGE) of all summaires generated by CFULHiper. The informativeness of the CFULHiper extracts is slightly better, which indicates that more generic information is relevant for composing multilingual extracts. The conceptual overlap, however, had no impact on the treatment of redundancy. Since sentences selected exclusively from a single source-text no longer have much redundancy between themselves, and multi-document clusters tend to have few cases of synonymy and polysemy, the application of a lexical or conceptual overlap measure basically generates the same results for similarity identification.

Palavras-chave

Sumarização multidocumento multilíngue, Conhecimento léxico-conceitual, Seleção de conteúdo, Redundância, Relação hierárquica, Multilingual multi-document summarization, Lexical-conceptual knowledge, Content selection, Redundancy, Hierarquical relation

Citação

CAMARGO, Yasmin Vizeu. Sumarização automática multidocumento multilíngue: seleção de conteúdo e tratamento da redundância com base em conhecimento léxico-conceitual. 2020. Dissertação (Mestrado em Linguística) – Universidade Federal de São Carlos, São Carlos, 2020. Disponível em: https://repositorio.ufscar.br/handle/20.500.14289/12445.

URI

https://repositorio.ufscar.br/handle/20.500.14289/12445

Coleções

Teses e Dissertações

Licença Creative Commons

Exceto quando indicado de outra forma, a licença deste item é descrita como Attribution-NonCommercial-NoDerivs 3.0 Brazil

Página do item completo

Sumarização automática multidocumento multilíngue: seleção de conteúdo e tratamento da redundância com base em conhecimento léxico-conceitual

Arquivos

Data

Autores

Título da Revista

ISSN da Revista

Título de Volume

Editor

Resumo

Descrição

Palavras-chave

Citação

URI

Coleções

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced

Licença Creative Commons