Indexing all the things: Building your search engine in Python
2:00pm - 2:25pm on Friday, October 6 in MadisonJoe Cabrera
- Audience Level:
- Intermediate
- Slides:
- https://docs.google.com/presentation/d/e/2PACX-1vTtLD6Hd4exZFSUM_-lpYTxli6qeo69TDFE69IyvS9OWIjT-7w4N-lWsLMVxd9vxgW9Cruogk4vvGq0/pub?start=false&loop=false&delayms=3000
- Watch:
- https://youtu.be/ZBbiFGCLbAA
Overview
Python is an excellent language for building a search engine. However, indexing data for use in a search engine is challenging when both your database and the search index must sync. Elasticsearch exist for creating search engines initially, but you need a custom solution to keep them sycned.
Description
Since the emergence of Elasticsearch, common Information Retrieval tasks such as indexing, scoring and retrieval of documents into a search engine have never been easier. However unique challenges still exist for indexing large sets of data from databases. At Jopwell, we need to insure that data in our database is kept in constant sync with data in our search index.
Initially you need to take data from a traditional SQL database and flatten it for indexing in Elasticsearch. Since indexing this data can be a memory intensive task, Celery is useful for ensuring you can index large sets of data in both a distributed and memory-conservative manner. Once all your documents are in your Elasticsearch index, you need to retrieve data from your database related to a user’s search results.
In this talk, I’ll show the basics of creating a search engine in Python, keeping these it synced with another data store and how you can keep your index running smoothly.
Talk Outline
Introduction to the problem (2 min)
Building your document indexer (7 min)
-
Flattening database data into a search document
-
Using Celery to index documents efficiently
Scoring and search results retrieval (7 min)
-
Scoring algorithms
-
Retrieving matching results from the database
Strategies for syncing data from (7 min)
-
Traditional SQL database
-
Elasticsearch index