Feature Extraction and Quantitative Analysis of Large Scientific Document Corpora (2015-2016)

A wealth of extra information is in corpora of scientific and technical documents. For example, arXiv, created in 1991 by Paul Ginsparg and now located at Cornell University, has seen submissions increase to nearly ten thousand per month and expanding to more subjects. Another example is the PubMed search site at the National Institutes of Health for biomedical literature: it is comprised of more than 24 million citation records.

While a journal article, a technical report or a thesis typically focuses on a single subject of interest to the author or authors, it is commonly based on multiple or interdisciplinary forms of knowledge, technologies and techniques. It may make new connections, or advance the research and development in more than one field, beyond what it was intended individually and initially. While such documents are publicly accessible, the means to navigate and utilize a vast document corpus remains limited to traditional query searches and category browsing mechanisms.

This project created corpus maps, at multiple resolution scales in themes and time periods, for literature navigation by expert or new researchers in order to facilitate summary, discovery of new connections or new frontiers, identification of existing gaps and decision-making. Team members investigated severa; fundamental, integral and challenging components in document corpus analysis.


Fall 2015 – Spring 2016

Team Leaders

  • Nikos Pitsianis, Arts & Sciences-Computer Science
  • Xiaobai Sun, Arts & Sciences-Computer Science

/undergraduate Team Members

  • Devavrat Dabke, Mathematics (BS), Computer Science (BS2)