GRAFF-MS: a new scalable ML model for accurate prediction of small molecule mass spectra

Analyzing the molecular basis of life is hard

If you’re a scientist working with any kind of biological sample (blood, tumors, cell culture, plants, fungus, microbes, whatever), understanding the chemical composition of your sample is highly valuable. Today, the big macromolecules that make up the cell (DNA, RNA, and proteins) can all be assayed at relatively high throughput. But there is another key component to the cell that has so far received much less attention, the small organic molecules, or metabolites.

Nature has a lot of metabolites! A single extract of a plant, say, or a cancer tumor, might contain in the thousands or tens of thousands of distinct compounds, and across all of nature there may be hundreds of millions to tens of billions of metabolites. Being able to compare metabolites between samples can tell you a lot, such as pinpointing the difference between diseased and healthy tissues. These metabolites are centerpieces of all essential cellular and organismal processes: energetics, membranes, hormones, intracellular signaling, immunity, cooperation, competition, and death. At some level, one can even think of DNA, RNA, and proteins as existing in order to make these organic molecules, break them apart, combine them, alter them, excrete them, sense them, and respond to them. A full accounting of all of the metabolites and their abundances would go a long way towards describing what an organism is up to.

But despite their criticality, experimental methods to identify and quantify metabolites – aka metabolomics – from a biological sample are lacking. Unlike DNA, RNA, and proteins, metabolites aren’t strings of repeated building blocks (nucleotides and amino acids) that can be analyzed by a sequencer. Instead, the way you study metabolites is mainly through tandem mass spectrometry (MS/MS) [1], in which molecules are ionized and then broken into pieces. The masses of the resulting pieces are detected and combined together into an MS2 [2]  mass spectrum that looks like this:

MS2 spectrum of Quinine. Image credit: MassBank

The MS2 spectrum is notoriously hard to interpret, and that’s what makes metabolomics hard. You can detect a metabolite, its mass, and the masses of its component fragments, but putting those altogether into an identifiable metabolite is extremely complicated and difficult.

One resource that would make the interpretation of mass spectra easier would be a reference library of known metabolites, their structures, and their mass spectra. This would allow researchers to match an experimental spectra to a previously determined spectra. These libraries do exist, but their downside is that they only contain a small fraction of molecules (about 50,000 in total), many of which are synthetic molecules that would not be seen in a biological sample. Using this approach will typically identify only around 2% to 10% of the spectra, and thus metabolites, from biological samples.

To widen the scope of these experimental reference libraries, researchers will supplement with libraries of predicted spectra from known metabolites. This takes advantage of the fact that there are many more known metabolite structures (over 400,000) than there are metabolites that have been profiled in mass spec. Commonly used tools for generating predicted spectra either simulate bond breaking or use machine learning or some combination of the two.

As we have used these predicted spectra augmented libraries in the course of our work, we have found the accuracy of the predictions insufficient, and have struggled with how long these models take to run. Our goal was to create a better method for generating these predicted spectra as a means of improving our reference library and thus our overall ability to identify metabolites from samples.

Using graph neural networks and finite vocabularies to better predict mass spectra

Last summer at Enveda, we launched a program to train a machine learning fragmentation model for generating predicted spectra. This work was undertaken by a graduate summer intern Michael Murphy under the direction of Tom Butler and data science fellow Tobias Kind at Enveda. Michael has since returned to finish his PhD with Stefanie Jegelka and Ernest Fraenkel at MIT. The resulting paper is available as a preprint here.

One of the key challenges with predicting spectra using machine learning is that the type of data represented in a mass spectra is not the type of data that is optimal for machine learning models. Modern mass spectra are extremely precise and often accurate within 1/1,000 the mass of a neutron. That precision carries a lot of information about what the ions are. Because binding energies in the nucleus affect the masses of elements, there are many fewer molecules that can have a mass of 241.4005 Daltons than there are whose mass rounds to 241.4 Daltons, for example. While this precision is key for metabolite identification, it is not naturally compatible for machine learning, as these models struggle with predictions requiring so much precision.

Other models like CFM-ID get around this problem by simulating iterative bond breaking over the molecular graph to give a set of putative fragments for each molecule. This yields a space of precise masses but is computationally complicated, takes a really long time for large molecules, and misses hard-to-predict bond rearrangements that are characteristic of mass spec fragmentation. Alternatively, models like NEIMS round or bin the masses to 0.1 or 0.01 Daltons to get a smaller representation space very efficiently, but at the cost of losing mass precision.

Left, CFM-ID uses iterative bond breaking to create a vocabulary of possible masses.  Right, NEIMS is a feed forward neural network that accepts precomputed fingerprints as inputs and outputs probabilities over coarse-binned masses.

Michael hypothesized that, as in many computational chemistry applications, a graph neural network trained on 2-D chemical structures might better be able to learn fragmentation patterns while still maintaining the speed and efficiency of a neural network. Graph neural networks have exploded in popularity in recent years, and are incorporated in technologies like AlphaFold and modern recommendation systems. On molecules, graph neural networks work by iteratively combining information about atoms (nodes of the graph) with information about neighboring atoms that are connected by chemical bonds. This allows the model to represent neighborhoods of atoms and learn how atom and bond types contribute to fragmentation during mass spectrometry.  

Successive layers of graph neural networks allow atoms to “see” increasingly distant atoms along paths formed by atomic bonds. Image credit: Michael Bronstein.

One thing we noticed when analyzing the training data was that the vast majority of mass signal in the training data comes from a small percentage (about 2%) of the total fragment ions (defined as ions and neutral losses). This allowed us to represent these decoded spectra as a probability distribution over a combination of those specific fragments and the masses you’d get from subtracting those fragments from the original molecule. We discard the bulk of the possible masses, which don’t contribute much to the overall corpus of spectra. This brings the vocabulary size down to a very manageable 10,000. Using this fixed vocabulary gets around the problem of the super-precise masses, and then allows the model to predict the spectrum by predicting the probability of each.

Performance results for GRAFF-MS vs CFM-ID and NEIMS evaluated on a structure-disjoint test set from NIST and on an independent dataset from CASMI 2016

We trained GRAFF-MS on the NIST20 library which contains spectra from around 31,000 compounds, and evaluated it both on a held-out NIST subset of 1,637 structures, as well as on spectra from the 2016 CASMI challenge, which was constructed specifically by domain experts for testing such algorithms.  In both, GRAFF-MS, outperforms both CFM-ID and NEIMS, predicting 76% of the spectra on CASMI 2016 to within 0.7 cosine similarity of the actual.  Below is an example of two molecules from the test set which are extremely similar structurally, with predicted and actual mass spectra for each.  

Predicted (blue) and actual (orange) mass spectra for two extremely similar molecules. GRAFF-MS is able to make predictions that respond to fairly subtle changes in molecular structure.

Furthermore, GRAFF-MS is a much more resource efficient way to generate large in silico spectral data, particularly when generating putative structures from larger molecules (>500 Da), where the bond-breaking of CFM-ID starts to get complicated.  Since GRAFF-MS is a neural network, it can take advantage of GPU acceleration. Simulating how long it would take to generate predicted spectra for the 2.3 million ChEMBL 3.1 molecules under 1000 Da, for example, would take 2 hours for GRAFF-MS running on a single GPU, while the same task would take 96 hours running CFM-ID continuously in parallel on 64 CPUs.

Simulating how long it would take to generate predicted spectra for the 2.3 million ChEMBL 3.1 molecules under 1000 Da, for example, would take 2 hours for GRAFF-MS running on a single GPU, while the same task would take 96 hours running CFM-ID continuously in parallel on 64 CPUs.

By increasing both the speed and accuracy of spectra prediction, we can now build much larger and more diverse reference libraries, which will make our efforts to identify metabolites from nature to inspire new medicines faster and more effectively.

Next steps: way more data

Machine learning models like GRAFF-MS promise to unlock truly large-scale metabolomics by learning the language of mass spectra, the language of chemical structures and how to translate between them. To truly achieve this, more research, more development and more data is still needed.

It’s also important to note that the GRAFF-MS model can only predict spectra for known molecules. Predicting across the entire space of plausible molecules rather than only those which are already known can help with identification of the truly dark chemical space of unknown metabolites in nature. We’ve discussed our efforts to predicting properties of metabolites directly from spectra without a reference in an earlier post and this preprint. Since these predictions typically come from machine learning models trained on the spectral reference databases, having a good model for predicting fragmentation like GRAFF-MS can augment that training data enormously as well.

Relative to other -omics fields, metabolomics is in its infancy in terms of gathering data: as we’ve mentioned before, publicly available datasets have spectra for only around 50,000 molecules. At Enveda Biosciences, we see this as a huge opportunity. We are already at the forefront of machine learning in metabolomics, and through the course of identifying novel molecules to pursue as drug candidates we will continue to gather the largest dataset of labeled and unlabeled metabolomics in the world to train better and better machine learning models.  If you’re interested in working with cutting edge machine learning in a field with wide-open possibilities like metabolomics, please reach out!


[1] The other way is through NMR which works in a totally different way than mass spectrometry, and which we will not talk about in any detail except to say that of the two methods, MS/MS is by far the most common and scalable way to get a handle on all the molecules in a biological sample. It’s very sensitive; it can detect even very low-abundance molecules which make up the vast majority of molecules in a typical sample, and can detect thousands of molecules in very complex samples in a single pass.

[2] The initial, or MS1 spectrum is the mass of all the ions in the sample, after which individual molecules are selected for fragmentation which results in the MS2 spectrum. You can go even farther and fragment the fragments (MS3, MS4, etc), but we will just talk about MS2 for now, it’s the most common and scalable method.

[3] We should note another common variation on the way people augment spectral lookup with known metabolite structure. Models like CSI:FingerID and MIST use molecular fingerprints instead of predicted spectra, where the fingerprints are predicted from the spectra and then compared to computed fingerprints from known metabolites.

Join Us By Subscribing To Follow Our Progress.