Mining and Enriching Multilingual Scientific Text Collections: Challenges and Opportunities

Mining and Enriching Multilingual Scientific Text Collections: Challenges and Opportunities

Place: Large Lecture Room

 Abstract:

Scientists worldwide are confronted with an exponential growth in the number of scientific documents being made available, in this scenario of scientific information overload, natural language processing has a key role to play.

Over the past few years we have seen a number of  tools for the analysis of the structure of scientific documents (e.g. transforming PDF to XML), methods for extracting keywords, or classifying sentences into argumentative categories being developed.  However, deep analysis of scientific documents such as: finding key claims, assessing the argumentative quality and strength of the research, or summarizing the key contributions of a piece of work are less common. Besides, most research in scientific text processing is being carried out for the English language, neglecting both the share of scientific information available in other languages and the fact that scientific publications are many times bilingual.

In this talk, I will present work carried out in our laboratory towards the development of a system for “deep” analysis and annotation of scientific text collection. Originally for the English language, it has now been adapted to Spanish.   After a brief overview of the system and its main components, I will present our recent work on the development of a bi-lingual (Spanish and English) fully annotated text resource in the field of natural language processing that we have created with our system together with a faceted-search and visualization system to explore the created resource.

The talk will be preceded by an overview of the research activities and projects developed at the Natural Language Processing Group (TALN)  from Universitat Pompeu Fabra.