Indexing all the things: Building your search engine in Python

2:00pm - 2:25pm on Friday, October 6 in Madison

Joe Cabrera

Audience Level:: Intermediate
Slides:: https://docs.google.com/presentation/d/e/2PACX-1vTtLD6Hd4exZFSUM_-lpYTxli6qeo69TDFE69IyvS9OWIjT-7w4N-lWsLMVxd9vxgW9Cruogk4vvGq0/pub?start=false&loop=false&delayms=3000
Watch:: https://youtu.be/ZBbiFGCLbAA

Overview

Python is an excellent language for building a search engine. However, indexing data for use in a search engine is challenging when both your database and the search index must sync. Elasticsearch exist for creating search engines initially, but you need a custom solution to keep them sycned.

Description

Since the emergence of Elasticsearch, common Information Retrieval tasks such as indexing, scoring and retrieval of documents into a search engine have never been easier. However unique challenges still exist for indexing large sets of data from databases. At Jopwell, we need to insure that data in our database is kept in constant sync with data in our search index.

Initially you need to take data from a traditional SQL database and flatten it for indexing in Elasticsearch. Since indexing this data can be a memory intensive task, Celery is useful for ensuring you can index large sets of data in both a distributed and memory-conservative manner. Once all your documents are in your Elasticsearch index, you need to retrieve data from your database related to a user’s search results.

In this talk, I’ll show the basics of creating a search engine in Python, keeping these it synced with another data store and how you can keep your index running smoothly.

Talk Outline

Introduction to the problem (2 min)

Building your document indexer (7 min)

Flattening database data into a search document
Using Celery to index documents efficiently

Scoring and search results retrieval (7 min)

Scoring algorithms
Retrieving matching results from the database

Strategies for syncing data from (7 min)

Traditional SQL database
Elasticsearch index