Building a Scalable & Isolated Architecture for Preprocessing Medical Records

Presented at Airflow Summit 2021

By Mikaela Pisani Anthony Figueroa

After performing several experiments with Airflow, we reached the best architectural design for processing text medical records in scale. Our hybrid solution uses Kubernetes, Apache Airflow, Apache Livy, and Apache cTAKES. Using Kubernetes’ containers has the benefit of having a consistent, portable, and isolated environment for each component of the pipeline. With Apache Livy, you can run tasks in a Spark Cluster at scale. Additionally, Apache cTAKES helps with the extraction of information from electronic medical records clinical free-text by using natural language processing techniques to identify codable entities, temporal events, properties, and relations.

Download slides

Mikaela Pisani

Rootstrap, Head of Data Science

Anthony Figueroa

CTO at Rootstrap