Sumarização automática multidocumento multilíngue: seleção de conteúdo e tratamento da redundância com base em conhecimento léxico-conceitual
Abstract
Multilingual Multi-document Summarization consists in automatically producing, from a collection of texts on the same topic and in different languages, a summary in one of the source languages. Thus, this task deals with the problems of Multi-document Summarization, such as the identification of relevant content and the treatment of redundancy, and with the multiplicity of source languages. For the production of multilingual summaries in Portuguese, CFUL is the method with the best performance. CFUL is extractive and thus it punctuates the source sentences in their original languages based on the simple frequency of their nominal concepts in the collection and it selects the best-ranked ones in Portuguese for the summary, avoiding redundancy based on word overlapping between them. In this work, the CFULHiper extractive method is proposed. It also selects content based on the simple frequency of the nominal concepts, but it additionally takes into account a differentiated score for the superordinate concepts that are in hierarchical relations with others in the collection. The method assumes that superordinate concepts convey generic information, which is relevant to compose informative summaries. Moreover, CFULHiper avoids redundancy based on concept overlapping, capturing sentence similarity in a more intelligent manner. To develop CFULHiper, we have selected the CM2News corpus, which consists of 20 bilingual collections (Portuguese and English) of news, whose nouns of the source texts were annotated with concepts from WordNet of Princeton. The corpus was extended with the inclusion of 10 new collections, resulting at the second version of CM2News. The CM2News 2.0 corpus was submitted to an automatic pre-processing. For each collection, we have performed: (ii) identification of the conceptual hierarquical relations across the source-texts, and (iii) calculation of the simple and cumulative frequencies of the nominal concepts. To calculate the accumulated frequency of a hyperonym x, the simple frequency of x is added to the simple frequency of its hyponyms. Then, we automatically applied CFULHiper to each collection of the corpus, producing 30 summaries in Portuguese with 70% compression. We have evaluated the linguistic quality (gramaticality, non-redundancy, referential clarity, focus and estructure/coherence) and the informativeness (ROUGE) of all summaires generated by CFULHiper. The informativeness of the CFULHiper extracts is slightly better, which indicates that more generic information is relevant for composing multilingual extracts. The conceptual overlap, however, had no impact on the treatment of redundancy. Since sentences selected exclusively from a single source-text no longer have much redundancy between themselves, and multi-document clusters tend to have few cases of synonymy and polysemy, the application of a lexical or conceptual overlap measure basically generates the same results for similarity identification.
Collections
The following license files are associated with this item: