Machine Learning will Revolutionize Metabolomics

1. What is untargeted metabolomics?

Metabolomics is a rapidly emerging field that seeks to measure the landscape of metabolites, or organic small molecules, in living organisms. Often, when scientists are interested in a very specific set of known metabolites, the protocols are tuned to detect this subset of the global metabolome.  This is known as targeted metabolomics.  Alternatively, untargeted metabolomics refers to the identification and analysis of the entire global metabolome within the living organism.  While this represents a significantly more complicated set of experiments, it has the potential to uncover far richer datasets that can discover new disease mechanisms, identify active compounds in bioactive extracts, and characterize new ways in which drugs target specific diseases.

As with other kinds of -omics (genomics, transcriptomics, proteomics), the goal is to get as comprehensive a survey as possible of the chemical space. Biology being dynamic, complex, and interconnected, if you want to understand an organism, it’s best to gather as much information as possible. Metabolomics is a particularly promising type of -omics data for a few reasons. Firstly, the metabolome is a closer representation of phenotype than upstream -omes. It also concerns a domain of chemistry, small organic molecules, of extremely high diversity and economic value across multiple sectors of the economy: agriculture, pharmaceuticals, medical diagnostics, and biomanufacturing to name a few.

Besides being particularly valuable, metabolomics can be particularly challenging. Since metabolites are not composed of one-dimensional sequences of building blocks, like the four nucleotides of DNA or the twenty amino acids of proteins, we can’t take advantage of that inherent structure in the data or leverage sequencing in order to identify the compound structures. We’ll get into more details on those difficulties below, but before that, I will provide  a high-level summary of a tandem mass spectrometry (MS/MS)-based metabolomics workflow [2]:

  1. Start with a soup of thousands of molecules (metabolites) that you’ve extracted from an organism. A typical plant, for example, may have around 10,000 distinct metabolites.
  2. Separate them using chromatography and pass them through a tandem (2-stage) mass spectrometer [2]. The first stage will measure the individual compounds’ masses and abundances. (MS1 spectra)
  3. The second stage will fragment the compounds into bits and, for each one, measure the masses and abundances of the resulting fragments (MS2 spectra).
  4. Analyze potential metabolites by comparing their masses and fragmentation spectra to each other and to libraries of known spectra (library matching).

Fragmentation pattern of acetylsalicylic acid (Aspirin)

Other than library matching, you might do any other kinds of analyses with spectral data. Examples include: predicting fragmentation trees, molecular fingerprints, chemical classes, or structures of the individual metabolites. Through molecular networking, you can perform high-level analyses on compound families by clustering compounds based on the similarity of their spectra.

2. The big problem: identifying compounds from their mass spectra

Mass spectra of two very similar compounds can look very different. Left: two very similar metabolites (92% Tanimoto similarity) with differences highlighted in grey,  Right, their mass spectra mirrored one on top of the other.  The cosine similarity of the spectra is 0.2.  (Both similarity measures scale from 0 to 1.)

One of the biggest (if not the absolute biggest) bottlenecks to doing metabolomics with mass spectrometry is identifying the molecules. Without this step, the amount of insight you can gain from metabolomics is fairly limited. Mostly, identifying compounds happens by matching to similar MS2 spectra (with similar parent masses and column retention) in reference databases.

MS2 matching requires a measure of similarity that tells you how similar two spectra are. Spectral similarity comes in a lot of flavors, but it typically works by constructing vectors of likely matches for the individual mass peaks and then measuring the cosine similarity of those vectors. Using spectral similarity allows you to find close or exact matches in databases of known spectra, which can help identify the compounds or their families and be a starting point for inferring the chemical structure of the molecule itself.

There are a couple of things that make all of that hard to do. Like most experimental data, even spectra that should be exact matches are not always alignable due to a range of matrix effects, batch effects, solvent effects, different ions, equipment differences, and general experimental variability.  Furthermore, spectral reference libraries have fairly low coverage; only a small percentage of the molecules of any given organism are described and registered in spectral databases.

Most importantly, however, is the fact that spectral similarity is not exactly the same as structural similarity. In lieu of finding an exact spectral match in a database, it’s useful to find near matches. Unfortunately, even relatively small changes to molecules can result in dramatically different fragmentation patterns and thus dramatically different spectra. See the example above as an illustration, which demonstrates that even a slight difference can lead to fragmentation patterns that are dissimilar enough to not match in a spectral library lookup.  Failing to measure compound similarity makes it extremely hard to find the identity of the compounds, cluster compounds into families, and analyze structure-to-function relationships.

3. Mass spectra and machine learning: an ideal match

Interpreting mass spectra is a phenomenal fit for machine learning. That’s because library matching, molecular networking, and molecule prediction are fundamentally a problem of data representation: Can you represent bags of masses and abundances in a way that preserves structural similarity or identifies structural fragments or chemical class? Representing these kinds of complicated relationships between multiple data modes is part of what makes deep learning so powerful. It’s what is responsible for the recent massive gains in fields like machine translation and facial recognition.

Recently, Huber and colleagues demonstrated the power of machine learning to interpret spectra by creating the first Spec2vec model, which uses Word2vec to learn representations of mass peaks based on their co-occurrence in MS2 spectra across large reference databases. In Spec2vec, each mass peak is treated as a word and assigned a 300-length vector, which is optimized so mass peaks that commonly occur together have similar vectors. MS2 spectra are represented as weighted averages of their peaks’ vectors, meaning similar spectra should have similar vectors as well. The method is unsupervised, based only on learning co-occurrence, but even without supervision, it creates a similarity measure that approximates structural similarity strikingly better than cosine similarity measures (see figure below). However, there is still a lot of room for an even better approximation.

Above: average structural similarity of the top x percent most similar spectra pairwise from a selection of around 13,000 known molecules from GNPS. The grey line at the top represents the best possible approximation of structural similarity. Figure adapted from Huber et al (2021)

Similarity is not the only area for improvement by ML in structural annotation. Even where no close match exists in libraries, there is a lot to learn about the structure and property of molecules from their spectra. Structure and property inference is about pattern recognition. Domain experts can often gain enough expertise to “read” from mass spectra very specific differences in molecular structures within their area of expertise.  This suggests that information in the spectra may be enough to make a good guess at the compound classes and structure of the molecules they represent.

Tools like CSI:FingerID and SubFragment-Matching have shown that ML algorithms can learn to predict useful properties and substructures of molecules from spectra alone. Coupling those predictive algorithms with more powerful data encoding provided by deep learning, and coupling prediction with molecular generation models, might allow a model to better translate spectrum from an unknown compound directly into a predicted structure.  

4. Building an ML-driven data set

With active learning (human-in-the-loop machine learning), intelligent selection of small numbers of key unknowns for labeling allows much more efficient model learning. Image credit: Human-in-the-loop machine learning, Robert (Munro) Monarch, Manning Publications, 2021

As every machine learning practitioner knows, a model is only as good as its data. Public metabolomics reference databases like those in GNPS, MetaboLights, and Metabolomics Workbench are extremely useful, but large reference datasets are built around specific classes of interest and compound availability, not necessarily to maximize the information available to an ML algorithm.

To fill in this gap, at Enveda Biosciences we’re building the largest dataset of naturally-occurring metabolites purpose-built for machine learning.  We are starting with phytochemicals, which are historically among the richest sources of therapeutic drugs. Even so, the vast majority of plant compounds are still unidentified. Enveda aims to change this by collecting and analyzing the largest collection of plant chemistry ever assembled.

Our search for new drugs to bring to the clinic will feed massive data into our ML algorithms, which will in turn provide better guidance to our drug discovery programs. Active learning strategies will help us identify and characterize the mass spectra whose identity is most likely to improve our models, which we can then actively characterize, until our model performs well across all phytochemical space. Crucially, that data will be internally consistent and controlled by our acquisition platform.

5. Machine learning for metabolomics is a wide-open opportunity for drug discovery

Similarity and structure prediction is only the tip of the iceberg of how machine learning might revolutionize metabolomics-focused drug discovery.

As we run our extracts and mixtures through bioactivity assays, we will be able to leverage the natural diversity of molecules to infer which chemical structures are responsible for desirable efficacy, toxicity, and pharmacokinetic properties, based only on the existing variability of natural molecules.

Being able to identify small structural changes in molecules can help us track metabolic changes to drugs as they pass through the body and identify reaction centers.

Better dimensionality reduction methods and clustering will enable us to characterize large numbers of molecules in many plants at once and optimize sourcing for our compounds.

Data extraction, NLP, and knowledge graphs will enable us to ingest and join the world’s published information on the properties of these molecules, connect molecules to cellular pathways, and prioritize lead generation, target ID, and production.  

6. The golden age of ML for metabolomics is just beginning

We are at a junction where improvements in mass spectrometry equipment and computational workflows have enabled large amounts of data to be collected and processed at once, while machine learning, particularly deep neural networks, are enabling the kind of pattern recognition needed to automate mass spectral interpretation. This opens up the possibility of scaling metabolomics to a degree not possible before.

The technology is converging from the technical and computational sides, but ML for metabolomics is still a relatively small field compared to genomics, transcriptomics, or proteomics. The advances we make over the next 3-5 years will create an enormous opportunity for applied machine learning to revolutionize metabolomics and by extension the numerous industries that rely on it.


[1] Some terminology: untargeted metabolomics attempts to quantify all metabolites in a sample, as opposed to targeted metabolomics where there is a specific class or set of classes of molecules of interest to quantify.  This article will use the term metabolomics as a shorthand for specifically the untargeted variety.

[2] Tandem mass-spectrometry is the most common way of measuring the metabolome, but there are other methods as well, including NMR.

Join Us By Subscribing To Follow Our Progress.