Tracing the flow of knowledge using Pyspark

1:30pm - 1:55pm on Friday, October 6 in PennTop North

Jessica Cox, Corey Harper

Audience Level:: Intermediate
Slides:: https://data.mendeley.com/datasets/8kyckg3dh5/1
Watch:: https://youtu.be/KsjD9EXQe9M

Overview

Scientists are eager for feedback on their work. What better place to look than the sentences that cite their discovery? Join us for a tutorial in Pyspark, where we explore NLP techniques using a CC-BY corpus of scientific journal articles to understand why and how literature is being cited.

Description

This talk will be focused on doing Natural Language Processing (NLP) in a Python-based Spark environment using PySpark. Examples will be drawn from a Citing Sentences project underway within Elsevier Labs (http://labs.elsevier.com/). The goal of this project is to build and analyze citation networks to understand the diffusion and flow of ideas through the scientific research landscape. Much like a social network, scientists want to understand how others are ‘talking’ about their papers. Are they supporting their work? Disagreeing with it? Is it being referred to as a discovery?

The development of our input datasets is out of scope for this talk, partly because the framework for citing sentence extraction is built out in Spark Scala rather than PySpark. However, our citing sentence dataframe formats will be described and documented and sample data will be provided so that others can explore and reproduce our analyses.

The presentation will cover:

Reformatting, manipulating, and combining dataframes to meet specific analysis needs
Preparing data for use with NLP tools and techniques
Using PySpark, SparkSQL, SparkML and other Spark libraries within Python code to perform NLP
Moving Spark Dataframes in and out of Pandas for additional analysis and to do visualizations
Performing additional natural language analysis in NLTK within the PySpark environment
Generating export formats suitable for other tools, such as for visualization with Gephi

The following code will be provided for audience members to return to the topic and continue learning after the event:

A “Community Edition” DataBricks compatible notebook with SparkML, SparkSQL, PySpark, and NLTK code
A sample datafile of citing sentences from Elsevier’s CCBY-licensed articles