Tracing the flow of knowledge using Pyspark

1:30pm - 1:55pm on Friday, October 6 in PennTop North

Jessica Cox, Corey Harper

Audience Level:


Scientists are eager for feedback on their work. What better place to look than the sentences that cite their discovery? Join us for a tutorial in Pyspark, where we explore NLP techniques using a CC-BY corpus of scientific journal articles to understand why and how literature is being cited.


This talk will be focused on doing Natural Language Processing (NLP) in a Python-based Spark environment using PySpark. Examples will be drawn from a Citing Sentences project underway within Elsevier Labs ( The goal of this project is to build and analyze citation networks to understand the diffusion and flow of ideas through the scientific research landscape. Much like a social network, scientists want to understand how others are ‘talking’ about their papers. Are they supporting their work? Disagreeing with it? Is it being referred to as a discovery?

The development of our input datasets is out of scope for this talk, partly because the framework for citing sentence extraction is built out in Spark Scala rather than PySpark. However, our citing sentence dataframe formats will be described and documented and sample data will be provided so that others can explore and reproduce our analyses.

The presentation will cover:

The following code will be provided for audience members to return to the topic and continue learning after the event:

Want to edit this page?