Biology has over recent decades moved to finer and finer levels of details, be it the type of readout, or better resolution in the spatial (eg single-cell) or time domain. And, understandably, there is considerable excitement every time we are able to generate data in a technologically novel way.
The question is, though: In which way is this data practically useful, either to understand biology (and patient subtypes/endotypes), and/or to discover new drugs?
Especially currently, with the majority of ‘AI in drug discovery’ startups focused on generating novel drugs, there needs to be data that links disease biology (genes, mutations, …) to potential therapy (be it small molecules or biologics) – and this link can only be as strong as the data is (since even the fanciest algorithm will not make up for poor data!).
Which types of ‘-omics’ data are around currently?
Some types of biological data we are currently able to generate relatively easily in this context are summarized in the table below and further described in the text (with points in italics discussed with respect to their current practical utility in drug discovery in the personal opinion of the author below):
Information provided | (Potential) Benefits | |
Genome Sequencing | DNA Sequence of organism (human, pathogen, etc.) | Understanding ‘building blocks’ of life; variations associated with disease; identifying drug targets |
Single cell sequencing | Sequence/expression level on single cell level | Understanding heterogeneous cell populations (cells that drive disease, contribute to eg drug resistance in cancer, etc.) |
Gene Expression | Expression levels of genes | Identifying activity of genes related to cellular function, disease, drug efficacy/resistance, … |
Cellular Imaging | Geometry (morphology) of cell and its organelles | Understanding visually (via markers) changes in cellular organization |
- Genome sequencing data – data available about both human and other DNA has increased particularly since the Human Genome Project in the 1990s, with the expectation at the time that we would learn more both about human biology, and in turn also about potential drug targets;
- Single cell sequencing– a concept that has become significantly more popular in the last decade, realizing that cell populations are heavily heterogeneous; eg the Sanger Institute in Hinxton has become heavily invested in the area recently. This type of data hence allows for the generation of spatially better resolved data, which is of importance eg to understand heterogeneous cancer cell populations;
- Epigenetics information – this describes heritable traits beyond modifications of DNA sequence, which is a concept which interestingly goes back more than 60 years already, to 1942 (before even actual ‘genes’ were known) and the work of Waddington;
- Gene expression data – capturing not the sequence of genes but rather their expression levels, and where larger scale work from the disease side goes back to around the early 1980s. This field had its popularity significantly increasing with Affymetrix GeneChips in the 90s; and with more recent data from the compound side becoming available eg via Connectivity Map and LINCS within the last 10 years or so. During this time also other techniques such as RNA-Seq, and its experimentally simpler and more affordable cousins such as RASL-seq, TempO-Seq, DRUG-Seq (and others) have been established;
- Proteomics data – which is describing a biological system not on the gene but on the protein level, thereby also considering that gene and protein levels are often only weakly correlated. In this area experimental approaches have made significant leaps in recent years, but generally both the experimental setup and data analysis remain significantly more difficult than on the transcriptomic level;
- Metabolomic information – which is identifying and quantifying metabolites in a living system, with claims to be ‘closer to the phenotype’ than even the proteomics level. Aspects such as metabolite identification remain tricky, but structure elucidation and experimental techniques are continuing to evolve significantly currently;
- Imaging data – which can refer to data on rather different levels, from the cellular to the organ and organism level. On the cellular level the field was driven eg by developments in confocal miscroscopy (interestingly the first confocal miscroscopy patents date back to 1957!), but it also comprising eg 3D imaging methods of tumors. While some of the underlying physical principles for data generation have been known for longer, efficient data analysis methods only emerged much more recently, in the last 15-20 years. In the context of this discussion the focus will be on cellular imaging and its use in the drug discovery context, in particular on High-Content readouts, such as those available via the recent Cell Painting assay/datasets and similar formats;
- And others, which are not included here for now
We have data, great – and now?
To preempt the conclusion of this piece somewhat: I entirely share the excitement about the technical side of generating data on a finer and finer level of detail, both from the purely technological side, and I can also see implications for the understanding of fundamental biology to a good extent. What, however, is less clear to me in many cases is that we actually know what we are doing with all that data subsequently, in the rather practical context of discovering safe and efficacious drugs at reasonable expense and pace, and/or patient subtyping for personalized medicine. My point is not that this is never possible – my point is that we have a huge amount of data, and compared to that the practical utility is comparatively small.
Personally, I have encountered this conundrum the first time when analyzing High-Content Screens myself during my postdoc at Novartis more than 10 years ago, and where the cellular parameters determined from automated microscopy seemed (and after 10 years still seem!) rather cryptic to me with respect to biological interpretation and utility (though some approaches to eg rationalize high-content readouts mechanistically have been published recently). To me, our (relative) lack of understanding doesn’t seem to be limited only to cellular microscopy readouts – I would argue that it is even common to the majority of ‘-omics’ types of data generated. We need to observe the response of biology to compound application – fully agreed, so high-dimensional readouts can in principle tell us more than eg target-based assays alone. But my point is that, without proper hypotheses being used in the first place the data generated is often difficult to handle subsequently – either because of statistical reasons (eg weak signals in cases where we have few samples and a high-dimensional readout space), and/or since the experimental setup is simply irrelevant for any in vivo situation (say, due to using single-cell systems, physiologically irrelevant dose or time points), or also for many other possible reasons.
So we can generate all this data – but what does it mean, how can it be used? In the following I will (very) briefly – and admittedly subjectively, though not without evidence – shed some glimpses of light on the impact that genome sequencing, single cell sequencing, gene expression data, and cellular imaging had so far on both aspects, of understanding disease as well as drug discovery.
So what did – in brief – genome sequencing, single cell sequencing, gene expression data, and cellular imaging data contribute to drug discovery today?
Sequencing: The sequencing of the human genome was an ambitious project, compared at the time to bringing the first human onto the moon (though this picture has been used rather frequently, also more recently, in this context). At the time it was expected that there will be “More drug targets… 3000–10 000 targets compared with 483” (luckily this work didn’t state a particular time frame for that to happen). It appears that at least 20 years later we didn’t really get there – recent (2017) estimates of drug targets put the number currently at around 667. On the other hand, with CRISPR and related techniques, will be be able to expand the number of drug targets in the near future, and isn’t this based on previous projects, such as the Human Genome Project? In addition, maybe a focus on small molecules held us back – and will new chemical modalities/biologics help in the future? So quite likely I wouldn’t put the impact of sequencing the human genome to expanding to ‘3,000-10,000 drug targets’, that hasn’t materialized yet. But we certainly can annotate genomes, and hence proteins, more systematically than we were able to do before. So in a way, genome sequencing by itself didn’t really unravel the dynamic interactions of living systems and expand druggable targets on a huge scale. But it helped catalog biology better, which is crucial to store and annotate data in the future. Also sequencing helped for practical purposes, such as understanding the heterogeneity of cancers (and that eg two cell lines in the NCI 60 screening set are actually the same, which hasn’t been known before), thereby providing a basis for defining what we actually deal with. On the patient level also genetic drivers for other diseases have been identified, as an example I will pick Pulmonary Arterial Hypertension (PAH) here given recent local work here in Cambridge – where, inch by inch, the authors were able to tease out genetic factors which contribute to PAH, given suitable datasets, methods, and making a dedicated effort.
VERDICT: Sequencing data has helped us greatly to catalog and annotate biology. Some advances have been made to identify genetic drivers for disease as well. However, its impact to develop new drugs has been limited in my personal opinion. Reasons include that few diseases (except some genetic diseases) are purely defined by DNA sequence; even if there is a genetic contributing factor there will be other contributing factors as well which are required for a disease to develop; and on the methodological side samples sizes have often been small and mathematical methods are more tricky to use (eg when it comes to biases) than a simple ‘data in – knowledge out’ would suggest.
Single cell sequencing data: The core argument for spatially resolved single-cell sequencing data is that transcription events are not uniformly distributed across cells, and hence for understanding cellular populations across different areas (say, development, understanding disease, and drug response) a finer level of detail than cellular population averages is needed. On the fundamental level this is intuitively true – and some research exists that underlines the practical usefulness of this type of data, such as when understanding developmental processes and understanding the heterogeneity of brain tumors, with practical implications to drug response. Other studies used single-cell data to describe determinants of drug response in cancer immunotherapy. It seems to me that single-cell sequencing is currently somewhat on the peak of the hype cycle, with articles such as “A Project to Map All Human Cells Will Change How Disease Is Cured” – what is this claim really based on though? When reading the article I cannot say – lots of ‘might’ and ‘could’ feature in it. Practically I wonder if the finer and finer level of spatial and temporal resolution will really lead us to a more unified picture of disease biology that will be useful for practical purposes – since we need to generate the data in the first place, store and analyze it, and then, importantly, identify common patterns in it – which, given the larger number of variables, will be more and more tricky the finer level of data we generate in the first place. Beyond first studies understanding development and disease which have already appeared I think we need to see whether the insight gained really justifies that (rather large) investments in the area.
VERDICT: ‘Proof of concept’ has certainly been established, eg for understanding development, or understanding drug response in patients, so there is a clear scientific rationale for this work to further our understanding of biology. Where does it help in drug discovery though, or practical patient subtyping (in the clinic)? I think this is where the level of detail generated can be tricky to handle (see also below) – even ‘conventional’ sequencing often isn’t really used in cancer clinics right now. So from the viewpoint of ‘value for money’ this might be difficult to justify… but maybe fundamental research never is?
Gene Expression data: Firstly a personal disclaimer: I used gene expression data quite a lot (both personally and in my research group), and I love it! That being said: Maybe it’s more a love/hate relationship. There have been many successful examples of using gene expression data eg for repurposing, mode of action analysis, understanding drug efficacy and understanding the toxicity of potential new drugs, so from the empirical angle transcriptomics data seems to be very valuable. On the other hand, would I claim that we really understand gene expression data? We can do Gene Set Enrichment Analysis (GSEA), Weighted Gene Co-Expression Network Analysis (WGCNA), etc. … but it seems the outcome of such analysis is very often that a rather large number of gene(s) are differentially expressed, heavily dependent on the precise method and parameters used, leading to a number of colorful pathway annotations being modulated… and drawing concrete conclusions from the finding, a proper interpretation, is rather tricky. Even beyond the method used – is the data that has been generated even coming from the right disease state/tissue/has it been taken at the right time point/compound dose/etc.? Quite often the answer to those questions is simply – ‘we don’t know!’. In addition, much of current (compound-derived) gene expression data has been generated in cell lines – how does this extrapolate to decision making in patients? So, while empirically gene expression data – which is quite cheap and fast to generate – has turned out to be very useful, understanding gene expression data, in my personal opinion, is – often- tricky. But maybe a signal is good enough – eg looking at the most up-and downregulated genes, and apply it for signal detection and repurposing, without even understanding the data in every detail? It seems that this works rather very well, as studies of other groups as well as our own have shown.
VERDICT: I would put gene expression data into the category ‘we don’t quite understand what we do, but often it’s rather useful in practice’ – we in many cases do not really model and understand the data, far from it; but for a variety of practical applications transcriptomics data has turned out to be useful. And it’s cheap and easy to generate, which is a plus – though care has to be taken precisely how to set up a biological system to be predictive for a given question one asks.
Cellular Imaging data – While DNA sequencing and transcriptomics have been around for a while and are now rather established techniques, cellular imaging data seems to be in its ‘second spring’ currently, after the first practical demonstration of general principles in the late 1990s, and the more recent standardization of readouts, such as in the CellPainting assay. To me personally it was rather surprising to see how long it took to establish standards for data generation and handling in the field (processes still ongoing) – but now that standards emerge organizations such as the EPA (and also many pharmaceutical companies) are rather swift adopters of the readout and companies such as Recursion Pharmaceuticals banking heavily on this type of readout, along with other of the big pharma companies. Some first examples of drug repurposing applications (with somewhat different formats) do exist, so there inherently seems to be an information content in cell morphology based readouts. The same holds true for compound target prediction using imaging data. But what is the biological setup I need (in particular when it comes to the ‘ugly siblings’, cell line, time point and dose), how do I need to analyze data for a given purpose? Do I really need imaging data, is it worthwhile using such data compared to other tools, such as ligand structure-based target prediction? This may very well be the case, in particular given the reusability of images for different endpoints, but at least according to the information available in the public domain we probably still need to wait and see for further applications and comparative studies to emerge. In my very personal opinion, given the research we also perform in the group, we have frequently encountered situations where trivial signals were easy to detect in CellPainting readouts – but where subsequent, finer-grained information was much more difficult or even impossible to tease out. Is this due to the data, or rather the analysis method and endpoint used by us? Very difficult to say at this stage.
VERDICT: While image-based cellular morphology readouts have now been around for more than 20 years it is, in my opinion, still difficult to say what they can precisely be used for, and what the best setup for a given purpose is. It seems that cellular imaging, from the data generation, storage/handling, as well as the analysis side – with respect to practical impact – has been in the ‘establishing best practice’ stage for quite a while now, in particular when it comes to hypothesis free/general data generation (obviously looking for particular markers is very different!). Hence I would say that no final verdict is possible at this stage, but in order to establish practical value eg of the CellPainting assays we would ideally need to move on to ‘production phase’ shortly, and likely this needs to involve multiple partners and larger consortia.
So what now – is there a point in using ‘-omics’ data, or not?
Overall I think the picture, looking at those four different types of readouts as a sample, is rather mixed – and for very different reasons in every case. Sequencing allows us to catalog – with little direct, but hence much indirect impact on understanding disease, and drug discovery. Single cell sequencing allows us to understand healthy and disease biology better – but likely its cost will be prohibitive for some time for practical applications in the clinic, such as patient subtyping. Gene expression data is practically tremendously useful in different areas, such repurposing and others – without us really understanding what is happening (which may or may not be a problem in particular cases). Cellular imaging has recently arrived at standards for data generation which is good to see – but the readouts are still rather cryptic, and practical utility still remains to be shown, at least in the public domain.
Claim | Contribution | Not yet realized | |
Sequencing | ‘understanding the building blocks of life’ | Significant contributions to systematically cataloging biology, some impact on patient subtyping | Little direct impact on drug discovery itself (certainly compared to original claims) |
Single cell sequencing | Understanding heterogeneity in disease, processes in development | Case studies show first suitability for understanding eg developmental processes, cell heterogeneity, and drug reponse | Applications for drug discovery & in clinic (not the lab!) still to be established, amount of data generated might be prohibitive for some (many?) applications |
Gene expression data | Understanding cellular states and processes | Various successful applications eg in repurposing, understanding modes of action, and understanding and predicting modes of toxicity | Signals often difficult to interpret overall (but genes can be interpreted and eg put on pathways individually), analyses are heavily method- and parameter-dependent |
Imaging | Understanding cellular processes on morphological level | Allow generation of standardized cell morphology data on large scale, comparatively cheap | Readouts are cryptic; practical utility beyond simple examples still needs to be demonstrated (at least in the public domain) |
Problems with ‘-omics’ data
While every new technology has a certain period where it needs to show its value, some aspects can be commonly observed (especially in the current context of ‘AI’ in drug discovery, which doesn’t always consider the complexity, predictivity, and variance of biological data fully):
To a good extent still simplified model systems are used – often simple cell lines (though there is a considerable effort to go into 3D and heterogeneous cell cultures etc.). To what extent does this represent the patient, or a situation in the clinic? This uncertainty relates to any of the -omics readouts mentioned above where cell lines are used to generate data. (On the other hand, cells can simply be seen as ‘signal generators’ of course who respond to eg compound treatment – but in this case one needs to let go of the idea that measurements in cells have direct clinical applications in a then different biological context.)
We assume that ‘more is better’ – finer levels of spatial resolution, temporal resolution, different types of readouts, … until we can generate spatially and temporally resolved maps of our body on a cellular level. Apart from practical problems of data handling and analysis the problem that emerges is: How can we generalize this? How do we integrate this type of information, to go from data to understanding? This is not a trivial point – we generate more and more variables with the data we generate, which isn’t really matched by the number of data points we have – but to go to knowledge we need to have an underlying map to integrate this data into a unified concept. Otherwise we just – generate data. But generating data cannot be the end of it in order to arrive at practically useful solutions.
Technology push vs scientific pull – I am old enough to have observed this in the 2000 biotech bubble, and again now – our human mindset in ‘the West’ is (usually) skewed towards (a) new is better (‘Artificial Intelligence is better than Machine Learning’ – although the latter has been around for decades), (b) analytical approaches (where ‘more data is better’ – although predictivity/signal should be the guideline instead), and (c) economic interests of companies, who have the (rather natural, from their perspective) incentive of pushing their own product into the market. If you have a start-up presenting on ‘AI using ‘omics data and deep learning’ then the VCs will flock to you – and this company will create a technology push, where other potential customers, because of ‘Fear Of Missing Out’ or other reasons, will have difficulties to resist and not buy into the new offering. The skeptical voice will receive less attention – maybe he or she is simply behind the current state of the art! Scientific pull is, as far as I can observe it, less often the reason for developments than technology push. Jumping on every new technology will never allow us to understand the capabilities of the tools we already have. Promoting a new hype, though, is more sexy though than warnings – leading to article headlines such as ‘More is better‘ when it comes to -omics data, with the contents however being rather light on detail why ‘more’ should precisely be ‘better’, in particular across applications, and when taking the cost of generating such data into account as well.
Where is the signal in the data? This goes back to the question of hypothesis-free vs hypothesis-driven data generation. Based on my experience hypothesis-free data might well seem more ‘universal’ at the start – but it harbors the problem of not being able to identify the signal one needs for decision making. I think in some cases (such as gene expression data) we have more experience that this is a useful type of data to generate than in others, simply from experience – and it has the advantage that we can easily link it back to existing knowledge (genes, pathways, etc.). So if you can – generate hypothesis-driven data. This needs to be driven by scientific question, covering suitable experimental design, and likely consortia in many cases (not just for sharing data which is often the case, but also for generating data in the first place)
Data is crucial for use in AI – but in many currently published studies it isn’t always clear how data has been used precisely for decision-making, which methods were applied, and what the contribution of ‘AI’ compared to a negative control (eg an established method) was. As one example, selecting a drug candidate for Wilson’s Disease was based on data from “more than 2,400 diseases and over 100,000 pathogenic mutations” – in my experience though, this doesn’t give you a neat mutation, or a handful of drugs for testing in return. You rather get lots of possible links back… and then you need a human to sift through the information (which is often the much more tricky part!). As for ‘validation’ it is simply very easy to ‘re-discover’ what you have known anyway using AI – but without a negative control it is very difficult to assess what the added advantage of AI is.
Translation from research to the clinic is poor: As in: really poor. There are fancy things that appear on the research front, characterizing cancer landscapes etc. The wife of a friend of mine had breast cancer in Germany, and what was the genetic information used for decision making? Nada – standard approach for everyone (operation, radiation and subsequently Tamoxifen), and that was it. I have heard similar stories from Spain – lots going on in research, but in the clinic? Translation is meager, partially due to lack of validation of methods in the clinic, partially due to established workflows and difficulty of implementation, partially due to cost. Do we really get most ‘bang per buck’ for taxpayer’s money this way though? I am not sure about that.
Conclusions
We need to generate data, such as -omics data, for describing biology – but the devil is in the detail; which data do we generate, in which biological setup, and how do we analyze if for which purpose? Here, much remains to be done. We will likely run into problems with generating and understanding the data we generate if we always jump on finer levels of resolution in the spatial and temporal domain, both on the practical (eg with respect to data storage and analysis) as well as the scientific level (how to make sense of fine-grained data). Biological noise, and the inability to generate the numbers of data points needed, will likely set limits to what we can achieve with always finer levels of resolution (not necessarily when characterizing systems, but probably when trying to draw actionable conclusions from the data.
We probably need to explore collaborative approaches for establishing state of the art when it comes to data – the current focus on ‘proof by example’ is likely not useful; we need to have controls and established methods as a baseline, and studies where reproducibility is ensured. Hiding logistic regression baselines in the Supplementary Material to promote the apparent superiority of ‘deep learning’ is also not a good approach. A certain distrust in scientific publications in understandable – and ‘higher-ranked’ journals do not really seem to fare better than ‘lower-ranked ones’ – “only half of the articles [from machine learning in biology and medicine] shared software, 64% shared data and 81% applied any kind of evaluation. Although crucial for ensuring the validity of ML applications, these aspects were met more by publications in lower-ranked journals.”
Let’s see what we can discover in -omics data in the future – I am certainly very curious and hope that they can be put to good use when answering practical questions in the healthcare setting in the future.
/Andreas