Developed at the University of Lisbon, Dept. of Informatics, by the NLX-Natural Language and Speech Group.
home | features | versão portuguesa
SIMBA is a multi-document summarization system for the Portuguese language freely available as an online service.
It was developed and is maintained by the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
SIMBA produces extractive summaries for collections of documents written in Portuguese. It starts by annotating the text. Then, it executes two clustering procedure in sequence: clustering by similarity and clustering by keywords. This double-clustering procedure seeks to firstly remove the redundancy within the collection of documents, and secondly to select the relevant content found in the input documents, so the summary contains the most significant topics covered.
The summary length is defined based on the size of the input documents, by providing a compression rate (for instance, 0.15 means that the size of summary will be 15% of the size of the original input documents).
There are also two types of summaries that can be created: "with post-processing" or "without post-processing". The post-processing procedure is an independent module of SIMBA, that aims to improve the textual quality of the final summary delivered to the end-user. It performs three different operations over the candidate summary: sentence reduction, paragraph creation and discourse connective insertion. Sentence reduction seeks to reduce the sentences down to their main content. Paragraph creation groups sentences by topics defining paragraphs, for the text to be easier to read. Discourse connective insertion aims to strengthen the connections between sentences through the insertion of expressions that seek to improve the text comprehension. All in all, the post-processing module seeks to improve the text readability, cohesion and fluency, thus its textual quality.
SIMBA makes use of several tools developed in the NLX-Natural Language and Speech Group, namely LX-Suite and LX-Parser.
SIMBA was developed by Sara Botelho Silveira in her Ph.D research, supervised by António Branco, at the NLX-Natural Language and Speech Group at the University of Lisbon, Department of Informatics.
The work leading to SIMBA was supported by FCT — Fundação para a Ciência e Tecnologia — under the research grant SFRH/BD/45133/2008.
Sara Botelho Silveira and António Branco. Extracting multi-document summaries with a double clustering approach. In Proceedings of the 17th International Conference on Applications of Natural Language Processing to Information Systems (NLDB 2012), pages 70–81, Groningen, The Netherlands, June 2012. Springer Berlin/Heidelberg. [ pdf ]
Sara Botelho Silveira and António Branco. Enhancing multi-document summaries with sentence simplification. In Proceedings of the International Conference on Artificial Intelligence (ICAI 2012), pages 742–748, Las Vegas, USA, July 2012. [ pdf ]
Sara Botelho Silveira and António Branco. Combining a double clustering approach with sentence simplification to produce highly informative multi-document summaries. In Proceedings of the 14th International Conference on Artificial Intelligence (IRI 2012), pages 482–489, Las Vegas, USA, August 2012. [ pdf ]
Sara Botelho Silveira and António Branco. Using a double clustering approach to build extractive multi-document summaries. In Proceedings of the 15th International Conference on Text, Speech and Dialogue (TSD 2012), pages 298–305, Brno, Czech Republic, September 2012. Springer Berlin/Heidelberg. [ pdf ]
Sara Botelho Silveira and António Branco. Compressing multi-document summaries through sentence simplification. In ICAART 2013: 5th International Conference on Agents and Artificial Intelligence, Barcelona, Spain, February 2013. [ pdf ]
Sara Botelho Silveira and António Branco. Sentence reduction algorithms to improve multi-document summarization. In Lecture Notes – Communications in Computer and Information Science. Springer-Verlag, 2014. [ pdf ]
Sara Botelho Silveira and António Branco. Uncovering discourse relations to insert connectives between the sentences of an automatic summary. In PolTAL 2014: 9th International Conference on Natural Language Processing. Springer LNCS/LNAI, 2014. [ pdf ]When mentioning SIMBA, this is the canonical reference to be used:
Sara Botelho Silveira. Enhancing Extractive Summarization with Automatic Post-processing. Ph.D.thesis, Universidade de Lisboa, Lisbon. 2015. [ pdf ]
Contact us using the following email address: 'nlxgroup' concatenated with 'at' concatenated with 'di.fc.ul.pt'.
SIMBA stands for Summarization Improved By Automatic Post-processing, which is the main hypothesis this sytstem aims to validate.