Accolade - Modular Data Pipeline

Accolade uses a combination of intelligent technologies and human expertise to help their clients’ employees make the most of their health plans. They approached UC Santa Cruz with two interconnected engineering challenges: the Master Data Management was to develop a machine learning model that was capable of classifying whether two healthcare records belonged to the same person, and the Modular Data Pipeline, which would create a tool for Accolade’s data scientists that would automate the laborious process of preparing data for machine learning.

The UCSC team was divided into subspecialties according to their strengths. One team member was a web services expert, another had taken machine learning courses, another, automation. They met three or four times a week and were in constant communication with the sponsors. By using AGILE project management for fast iteration and focus, the team stayed on target and focused.

“The problem we’re tackling with Accolade is how to provide personalized health care advice as efficiently as possible,” Engineering student Juan Andreas said. “Our high-level goal is to predict when someone will call based on the data that we have on them.”  

Michael Distasio, Accolade Senior Data Scientist and project sponsor said, “Accolade has people answering phone calls for clients. Our job is to take the data that we have and predict when a certain person will call again. He added,  “For example, if we have six months of history on someone, when will they call again?” 

The MDP team built an Extract-Transform-Load (ETL) data pipeline that produces a featured data repository, and an encounter prediction model that both validates the output of the pipeline and predicts future client encounters. 

“What the ETL model does is take data that’s written in plain English,” said Christian Ortiz, one of the students on the Accolade project. “And it takes this data and transforms it into machine readable code, then what the load phase is, is when it takes the transformed data and puts it back into the database in an intelligible form, where it’s combined with other databases and fed into our model.”

The pipeline uses Scikit-Learn to train the model, implementing a Random Forest Regressor which was developed by the Master Data Management Accolade team. According to Andreas, there was an enormous amount of data to deal with and process. 

“Healthcare means handling highly confidential data, so we had to do tons of de-identification for privacy and security, said Andreas.  “Another important part of the project was learning how to use all kinds of new technology, which of course was really fun because we got to learn things like Spark, which is a distributed computing language lots of companies are using.” 

Other great experiences during the senior research projects included: version control and getting GitHub to behave with previously implemented code; there were dynamic functions to program into the databases (like people’s ages) and then there was a technological issue which meant they had to create the pipeline locally but in the end it all came together.

“It was so great being able to work on such a huge project and seeing our code work,” said Andreas.  

“We’re incredibly proud of the work they did for Accolade,” said sponsor Michael Distasio. “We are about to go back to Seattle and turn their work into a tool that will be used by our teams.”