Monday, October 24, 2011

Saving lives through data mining

At the Open Science Summit (@OpenScienceSum) this weekend, a major theme was applying the lessons of open source software (and other means of social production) to human health and drug discovery. WIthin this, my interest was (as in OSS a decade ago) new firms were organizing to take advantage of these opportunities.

Several of the sessions focused on creating (or simulating) a data commons in which the collective findings of dozens (or thousands) of researchers could be aggregated to allow researchers to spot patterns that would be otherwise unobservable in a single trial.

One example of that was Marty Tenenbaum, founder of Cancer Commons (, which as its name suggests is seeking to be the clearinghouse for cancer data. Some of this is published research, but the goal is gather and share its own data, as the mission statement articulates:
Cancer Commons is a new patient-centric paradigm for translational medicine, in which every patient receives personalized therapy based upon the best available science, and researchers continuously test and refine their models of cancer biology and therapeutics based on the resulting clinical responses.
Representatives of two for-profit companies talked about their efforts to aggregate the terrabytes (or petabytes) of medical data to allow researchers to mine existing data for new insights.

NextBio aggregates public genomic data and allows (paid) customers to combine that with proprietary data to identify possible relationships. The freemium model gets academic researchers started for free — presumably so that grad students get hooked as grad students and take that to their new employers.

DNAnexus is preparing for the shift from a world with one sequenced human genome to millions, and the vast amounts of storage (and computing power) that will be necessary in only a few years to handle this avalanche of data. Earlier this month it teamed with Google (one of its investors) to host the Sequence Read Archive on the cloud for all researchers to have access.

Meanwhile, the consumer genomics pioneer 23 and Me has both genotype and phenotype data for 125,000 people: while the scale is much smaller, the uniformity is greater. One audience member called on the company to use the data for medicinal research, given the notorious lack of comparability for most phenotype data.

Data mining for medical research reminds me of data mining for social science research. Spending money to gather data is normally an entry barrier in the social sciences, so if there’s a free source of data (patents, open source repositories) this attracts a flood of new researchers and studies.

So if researchers don’t have to gather their own data to hypothesize — or evaluate — potential treatments, this is going to drastically cut the cost of entry into pharmaceutical research. We’ve seen this story before — again in open source — there will be a few big winners that move quickly and decisively to exploit this new technology, while other incumbents will find their traditional business models fail as barriers to entry crumble.

Of course, there are other entry barriers for pharma besides research — trials, funding, distribution, brand. Pressure will come on these fronts too.

An executive of Celtic Therapeutics talked about approaches to cut the cost of clinical trials,by using crowdsourcing for the design of trials and telemedicine for more efficiently processing patients. He claimed the costs of trials could be cut from $15-20 million to $1-2 million. Normally I’d take such dramatic improvements with a pound of salt, but the company claims as an advisor Karim Lakhani — one of the world’s leading crowdsourcing experts (and a personal friend).