PySHED: a Python framework for Streaming Heterogeneous Event Data

4:55pm - 5:20pm on Friday, October 6 in PennTop North

Christopher J. Wright

Audience Level:
Intermediate
Watch:
https://youtu.be/XqSvpVqY3_8

Overview

We present a streaming data processing library tailored for the heterogeneous data which comes from scientific experiments. Our library emphasizes data provenance, retrieval, and pipeline flexibility. We will discuss the application of this library to materials experiments at an x-ray synchrotron.

Description

Data is naturally heterogeneous, containing data and metadata in a highly interrelated web. Financial data, where the goal is to correlate stock price with contextual metadata like news stories, is highly heterogeneous. However, this class of data is very difficult to handle in a traditional pipelining sense, as the different data types need to be treated in their own bespoke way. Our new library PySHED aims to tackle these issues by creating a streaming data processing protocol for heterogeneous data. The simple, elegant, and flexible protocol enables developers to properly handle their different data types while retaining all the pipelining power for combining, processing, and splitting streams of data. Furthermore, our approach automatically stores provenance information enabling traceback, reanalysis of data, and data introspection. We will discuss the application of this framework to live x-ray experiment data analysis. Finally we’ll discuss future integrations with parallel processing and feedback between data collection and analysis.

Want to edit this page?