‘Big data’ is a big tool for the medical community
Two decades ago, Dr. David Eddy approached Kaiser Permanente — the massive managed healthcare organization headquartered in Oakland, Calif. — with a big idea: the creation of a large-scale simulation model that would take into account a plethora of factors to predict medical outcomes for patients. This data would then guide physicians in determining the proper course of action, as well as help patients better understand the health consequences of their own decisions and actions.
The result was the Archimedes Model, which today is a shining example of the concept of "big data" — the analytical processing of enormous volumes of information using large clusters of computers. To accomplish this, Archimedes leverages its proprietary platforms aided by the Hadoop open-source software platform and by Grid Engine, a distributed resource management system developed by Univa, a data-center-automation solutions provider based in Hoffman Estates, Ill.
"Our simulator takes data from the real world and everything we know about how diseases behave to create data sets that describe what the future for the population is going to be," said Katrina Montinola, vice president of engineering for Kaiser Pemanente's Archimedes unit, which is based in San Francisco.
For example, a doctor could predict for a smoker, based on decades of information, what his likely health prognosis would be, depending on certain behaviors, treatments and socio-economic and hereditary indicators.
However, the raw data sets emerging from the simulator are far too large to be of any practical use, according to Montinola. That is why Archimedes developed IndiGO (Individualized Guidelines and Outcomes), a platform that uses Hadoop to aggregate the data in ways that will be beneficial to those making healthcare decisions.
"Now the data is in a form that lets you spot trends," Montinola said. "For example, you might see that one population has filled a lot of prescriptions for a certain drug, and they seem to have suffered a lot less heart attacks compared with another population that didn't have access to that drug. Based on that, you might decide that the underserved population is prescribed that medication that seems to have worked.
"That's the whole point of data analytics — to try to spot these trends that you wouldn't be able to know without large data sets that allow you to see the connection between disease and the desired outcome."
Clearly, the ability to crunch mind-numbing amounts of data is beneficial. But just as beneficial is the speed at which the data is crunched, according to Montinola.
"Computers have gotten faster, faster and faster over the last 30-40 years, to the point where now you can look at large, large data sets really quickly, she said. "The ability to make a useful decision depends on being able to crunch that data in a useful timeframe."
That's where the Grid Engine platform comes into play, according to Univa CEO Gary Tyreman.
"It allows you to take a number of servers and share them across different groups in order to put as much work through as you can," Tyreman said. "One of the applications that has become more and more popular over the past year is big data. Our product lets you integrate your big-data applications — Hadoop for example — into that compute environment."
In Archimedes' case, Grid Engine integrates its simulation and IndiGO platforms to create efficiencies that otherwise wouldn't be possible, which shaves a great deal of time from the process.
"We're the infrastructure," Tyreman said. "If you think of this as two sets of servers, with the first one being the simulator and the second is this model that has Hadoop beside it, our software pulls these two environments together. So, instead of two complementary systems, they have one. Without Grid Engine, they would need to double the number of servers, because they'd have two environments."
In other words, the simulation, aggregation and modeling functions effectively were silos, and their functions were performed independently. With Grid Engine, all of these functions can be run simultaneously, which greatly conserves server capacity.
"So, you can have multiple physicians asking multiple questions, because — by eliminating the serialized [data processing] — we are able to drive utilization up," Tyreman said.
This efficiency has a profound impact on the amount of time needed to crunch data, which is particularly advantageous given the massive amounts that need to be processed in a big-data environment, Tyreman said.
"They no longer have to move data between two applications and two environments," he said. "Let's say that it takes a couple of hours to compute. If I can give you double the capacity, it's now one hour. Time is very important in the medical community."