Building a Scalable & Isolated Architecture for Preprocessing Medical Records

Presented at Airflow Summit 2021

After performing several experiments with Airflow, we reached the best architectural design for processing text medical records in scale. Our hybrid solution uses Kubernetes, Apache Airflow, Apache Livy, and Apache cTAKES. Using Kubernetes’ containers has the benefit of having a consistent, portable, and isolated environment for each component of the pipeline. With Apache Livy, you can run tasks in a Spark Cluster at scale. Additionally, Apache cTAKES helps with the extraction of information from electronic medical records clinical free-text by using natural language processing techniques to identify codable entities, temporal events, properties, and relations.