Pharma, I am constantly told by colleagues in Silicon Valley, needs to acquire a “data science mindset.” Yet the entire business of drug discovery and development is deeply rooted in data, in science, and in evidence. Robust documentation is required by strict regulators at every step along the path, from the composition of the therapeutic molecule to exactly how it’s manufactured to exactly how it behaves in model systems to exactly how it behaves in people – its absorption, metabolism, distribution, and of course, critically, its safety and efficacy in patients with a particular condition, as demonstrated, generally, in multiple prospective, randomized controlled trials (RCTs). I’ve heard stories of companies shipping off reams of documentation to the FDA in trucks, prior to the days of electronic submissions.
In this context, you can perhaps appreciate why pharma folks tend to bristle when data scientists say it’s time for pharmas to treat data “as a first class citizen,” and say things like “data science is the application of the scientific method to data.” Funny story, pharma researchers often think, we’re pretty much neck deep in both data and the scientific method, our entire business is absolutely steeped in it, what’s your point?
After struggling with this for a while, I think I can finally offer some insight. But first, a few stipulations. It’s incredibly easy to dismiss the entire conversation in tribal fashion by assuming “it’s all consultants just trying to make money,” or “pharma doesn’t care about making change anyway”; some of these conclusions may even be true in some contexts, but even so, I’m convinced there’s a real conversation to be had. The top data scientists I tend to think of – folks like MacArthur genius Daphne Koller, for example – are not shilling some transformational change consulting project. She is deeply rooted in data science and AI, and taught it to Stanford graduate students for many years (as she described to Lisa Suennen and me on our Tech Tonics podcast). Koller, and many others, believe there are significant opportunities to apply data science to the way new medicines are discovered and developed, and believes it can ultimately make a profound difference to patients.
Similarly, within pharma, there are intense pressures on the R&D side, because it’s incredibly difficult for companies to come up with their next product; biology is incredibly complex, people are incredibly complex, and sticking a new chemical in the body and asking that it just does what you want and doesn’t do any harm requires both audacity and luck, and the vast, vast majority of candidate molecules never make it through. It’s not surprising that so many R&D heads view each rare success as something akin to a miracle. Biopharma urgently need to find ways to discover and develop new medicines better, faster, and cheaper – which of course is the unofficial mantra of tech innovation.
In short: we have real problems in the way new medicines are discovered and developed, and there are thoughtful data scientists with deep expertise who truly believe technology could profoundly help the process and accelerate the delivery of impactful medicines for patients.
So if there is at least a measure of good, authentic intention on both sides, why does it still feel like pharma researchers and data scientists are talking past each other?
The Pharma Perspective
From what I’ve pieced together, here’s the story.
From the pharma perspective, developing a new product requires a succession of teams that each take a product from one “stage gate” to another, such as from target to hit, or hit to lead, etc. There are maybe a dozen teams that ultimately are responsible for the development of a new medicine, and each ultimately has to deliver a dossier of product-specific and stage-specific evidence. In many cases, the evidence will suggest the candidate molecule isn’t a promising drug, other times, the data will provide encouragement to continue to advance the molecule, in what is largely a step-wise fashion. I’m simplifying a bit, but ultimately it’s the collection of these dossiers that are collated and presented to regulators, a collection of evidence meeting pre-specified criteria in a range of categories. So at one level, a process replete with data, evidence, and science.
What really seems to drive sophisticated data scientists crazy about pharma is the loculated nature of the data, the fact that the data associated with each project and each step seem to exist in their own universe. There are huge variations in the way data are collected and organized at each step, and often even within the same step, the data from one project are often not easily relatable to data (at the same step) from another project.
This is a problem even at the earliest stages of discovery and development, and an exponentially greater problem as a promising molecule enters clinical development; at this point, the data are effectively“owned” by a dedicated product team that’s formed around that molecule, and they tend to be exceptionally protective of it. To be clear, every product team I’ve ever engaged with wanted to execute responsible and well-designed clinical studies, studies explicitly designed to evaluate whether or not the molecule worked, and whether or not it was safe and well-tolerated. The goal of the product team was to get the required studies done, and to evaluate the results. This is the mission – to obtain the information required to assess whether or not a product is safe and effective, according to demanding regulatory standards.
The Data Scientist Perspective
Yet when data scientists look at all the data collected by pharmas, they generally find themselves in shock, appalled by what they see as a huge missed opportunity. As Andrew Carroll, a brilliant former colleague of mine at DNAnexus, who now works at Google, explained on a recent blog (the grew out of a constructive twitter dialogue), RCTs represent:
“one of the most rigorous manifestations of scientific design and practice you can have. But in a trial, look at where all of the emphasis on experimental design goes. It all goes into designing the enrollment, procedures, and collection. It is true that managing data is an essential part of a trial, but not in a way that affords any agency or discovery in the data. In fact, a trial is specifically designed (and rightly so) to limit the ability to do anything but yield a single, fixed statistically valid outcome.
The difference in data science is that data is an input. The problem is that many are conditioned to think of data as the object of value which comes out of experiments….”
As Carroll astutely points out,
“I think some of this mentality explains the tension of ‘data parasites.’ When you think of data as the valuable output of your experiments, which naturally contain your valuable papers, you are protective of it. When you see data as the starting inputs for well-structured science, this mentality seems weird. Since you are doing the same type of scientific design, you feel more like a data symbiont than a parasite.”
What data scientists would like is a rich and robust collection of consistently collected, well-annotated data to play with, analyze, and discover unexpected patterns and connections. Pharma, they argue, is missing a huge opportunity to leverage and learn from their own data – think of the insights that are possible.
“Pharma runs Phase 3 trial. Drug fails, but some patients do respond. Data scientist mindset (DSM) -> collect ‘omic data from each patient & apply ML to learn new molecular mechanisms of response. Non-DSM: let’s ask our clinical experts why they think it failed.”
While almost everyone traditional pharma R&D leader I know is skeptical about the immediate potential of data science, to a person, most are now hard at work building out their own data science team, FOMO in action. A common first step many pharmas seem to be taking or contemplating is embarking on some sort of grand project to make all (or most) (or some) existing data inter-relatable, which in practice is both an extremely heavy lift and distracting to existing teams who now must both get their actual work done and contribute to this corporate mandate.
This feels like a recurrent pattern in a number of areas involving data science and data scientists. From the data scientist view, front line practitioners, whether drug developers or clinicians, could do the greatest good by richly documenting their observations, enabling data scientists (perhaps working with practitioners or other domain experts) to analyze the data and identify patterns, and suggest an action plan informed by rigorous analysis. Yet front line practitioners typically don’t relish the role of data entry clerk; they generally want to enter the minimum amount of information needed to do their jobs, and to do as much of the thinking as possible themselves, leveraging their experience and intuition, which in some cases can be valuable, in other cases might lead to avoidable errors associated with cognitive bias – see my discussion of Kahneman vs. Klein, here.
Is The Juice Worth The Squeeze?
Back to pharma’s effort to make existing data inter-relatable: it is unclear whether the juice will be worth the squeeze. In theory, having all the data better organized could, as data scientists envision, lead to greater insights. But most pharma folks I know are pretty skeptical about this, and many data science people worry that retrofitting data — and culture — may be prohibitively difficult. Speaking at a panel at the recent Precision Medicine World Conference– Silicon Valley meeting (video here), Koller discussed a traditional problem that tech companies confront, “technical debt” (essentially the messy code that accumulates over time and must eventually be cleaned up – see this nice discussion by Vijay Pande), and said that as bad as technical debt is, “cultural debt” is much worse, meaning that once a company is built without a data science culture, it’s hard to really acquire that capability. Perhaps not surprisingly, in her own company, insitro (I have no conflicts to disclose), Koller says she’s trying to build a data science-oriented pharma company from the ground up, integrating biopharma domain experts and data science experts from the beginning.
I suspect that among pharma incumbents, the greatest value of these “connect the data” efforts might be forcing them to think more deeply about how best to organize their information going forward. If transitioning to better data structures, which enables the use of more sophisticated analytical tools, actually does result in tangible R&D insights, then adoption will be rapid. But unless or until these wins appear, there’s likely to remain an atmosphere that feels like “assent without belief,” where pharmas embark on some kind of highly visible data science effort, but drug development continues pretty much as usually.
Given the urgency and the difficulty of discovering, developing, and delivering impactful new medicines for patients, I hope we collectively can figure out how to effectively apply the tools and approach of modern data science, in a thoughtful, humble, and fit-for-purpose fashion.
The humility aspect seems especially important. According to biotech journalist Luke Timmerman (via Twitter), tech guru turned biology funder Sean Parker “dismisses idea of AI/ML solving everything in biology. Showing respect for immense complexity, mystery of biology. AI/ML can be good when u understand basics of the problem and data inputs are solid. Otherwise, garbage in/garbage out.” Similarly, Joi Ito, director of the MIT Media Lab, reminds us (via Timmerman) “basically no ML applications are clinically validated yet,” and notes “when I talk with SV engineers, they have a hard time with complexity. They prefer structure.”
As oncologist and author (and former housestaff colleague) Sid Mukherjee observes (via Timmerman), “There’s a big data fetishization going on,” adding he “encourages Sean Parker to continue pooh-poohing big data for biology…. linear thinking gave us many medical discoveries, not ‘throw 300 million data points into a bucket and see what comes out.’” (I’ve discussed the fetishization of DNA here, pace Lewontin, and our obsession with big data here).
Data scientists are optimistic about the opportunity to improve how new medicines are discovered and delivered; most traditional medical scientists (including most pharma researchers) are skeptical that these new approaches will deliver benefit to patients, but convincible, saying, appropriately: “show me the data.”