BioByte 078: a foundation model for life's chemistry, reprogramming the germ line and AI losing steam

Zahra Khwaja

Patrick Malone

Ketan Yerneni

, and 3 others

Jun 07, 2024

Welcome to Decoding Bio’s BioByte: each week our writing collective highlight notable news—from the latest scientific papers to the latest funding rounds—and everything in between. All in one place.

pov living our best scifi summer

What we read

PRISM: A foundation model for life’s chemistry [David Healey et al., Enveda Biosciences, 2024]

The team at Enveda Biosciences released its foundation model PRISM (Pretrained Representations Informed by Spectral Masking) which was trained on 1.2B small molecule mass spectra and 85B tokens.

To determine which metabolites exist in an organism, such as a plant, extracts are processed via liquid chromatography and tandem mass spectrometry (LC-MS/MS). In this process samples are separated out via a gradient and individual molecules are fragmented into pieces. The masses of separate fragments are measured and represented as MS/MS spectra. Mass spectra contain structure information of the original molecule, but can only be used if the mass spectrum from the sample can be matched to a reference mass spectrum. However, only a few percent of molecules are known per sample given most molecules in nature have never been identified or characterized.

PRISM uses an adapted BERT architecture for use with mass spectra, which uses masked language modeling as its training framework. Analogously, PRISM uses masked peak modeling, where it masks peaks instead of words. This training allows the model to learn the “grammar” of chemistry.

The team compared the effect of using PRISM’s weights when fine-tuned on label data versus a standard ML model trained on annotated spectra. They noticed a relative improvement of 7-16% across 12 chemical drug-likeness properties. The use of PRISM pre-training also improved structure prediction for unknown spectra by 23%.

The AI Revolution is Already Losing Steam [Wall Street Journal, May 2024]

If I had a pound for every time someone mentioned AI, LLMs, or chips, I’d be worth more than Nvidia. However, as with all hyped technologies, the promise and future may be farther away than they seem. This article by the WSJ discusses a few reasons why.

Firstly, the pace of improvement in AI is slowing. We are reaching the limits of available data to train on, such as the internet, meaning companies will need to generate new data from sources like non-digital texts, use synthetic data techniques, or focus on architectural improvements instead.

Secondly, AI may become a commodity. As AI models are increasingly open-sourced and converging around similar performance levels, companies will have to compete on cost, reducing profit margins.

Thirdly, AI is expensive to run and train. Every time a query is given to an AI, it must 'think anew,' which is costly. Without innovations in chips or AI architecture, these operational costs could lead to low margins for companies.

Last but not least, the use cases are not yet transformative. Many of us have used ChatGPT, but has it really transformed our lives?

In conclusion, these factors do not suggest that AI is overhyped; rather, they indicate that a lot of work and innovation are needed for AI to reach its full potential.

Regulatory T-Cell Complexities [Derek Lowe, In the Pipeline, 2024]

Autoimmune disease has increasingly come to the foray of biomedicine (don’t we know it), and significant work over the past several years has begun to elucidate some of these complex mechanisms of disease development and progression. Derek highlights a recent trial in Type I diabetes (T1D) which evaluated the use of autologous polyclonal expanded regulatory T cells (Tregs) in children with new-onset T1D.

Tregs have several roles, but one thing is for certain: these are a small but critical cell population in autoimmunity and immune tolerance. This has led to considerable efforts to modulate Tregs to dampen the immune response across a number of indications. In this study, children with newly diagnosed T1D had a portion of their own Tregs extracted, which were expanded in vitro, before being re-infused. Although earlier work over the years had demonstrated this approach is safe, outcomes were middling at best. Patients in this trial were randomized to placebo, low-dose Tregs, or high-dose Tregs; there were no signs of any adverse events. Although high-dose patients exhibited a greater number of expanded T-cells after one week, these dropped significantly afterwards, mirroring the levels seen as the placebo group at three months. C-peptide levels (a marker of insulin secretion) decreased over two years of follow-up (suggesting disease progression), to which all groups had statistically similar levels.

Interestingly enough, the authors found that lower proliferation rates of T-cells in vitro were associated with a reduced decline in C-peptide in vivo, suggesting a possibility that lower-proliferating Tregs were actually more functional in vivo, while higher proliferating ones may have led to an inflammatory response. Thus it seems, yet again, that impressive data in mice models doesn’t translate over to humans. Significant work will be necessary to understand how to modulate Tregs in vivo (whether engineering or more) to re-establish immune self tolerance.

In vitro reconstitution of epigenetic reprogramming in the human germ line [Saitou, Nature May 2024]

The ability to artificially create sperm and egg cells in a dish would have profound implications for fertility. However, the creation of gametes requires epigenetic reprogramming. Over the course of our lives, our epigenome changes, however these acquired marks must be erased on developing egg and sperm cells and enable progression towards gametogenesis. How to activate this reprogramming artificially has been a challenge. Saitou and his team found that the protein BMP is a key driver of this process. When added to culture the cells could progress a step further along their development cycle than without the addition. However the reprogramming is not confirmed to be fully complete, with some epigenetic marks remaining. Even one aberrant mark can cause disease in later life and so more work is needed to establish a robust protocol here.

Mapping medically relevant RNA isoform diversity in the aged human frontal cortex with deep long-read RNA-seq [Ebbert et al, Nature Biotechnology, 2024]

Why it matters: A novel application of long-read RNA sequencing advances our understanding of RNA isoform diversity in the human brain and its potential relevance to neurological diseases. The study discovers isoform-level changes in Alzheimer’s disease that were missed by gene-level analysis.

A recent study published in Nature Biotech applied deep long-read RNA sequencing to 12 aged human frontal cortex samples (6 Alzheimer's disease cases and 6 controls) to comprehensively map RNA isoform (different versions of RNA produced from the same mRNA through alternative splicing) diversity. Mapping RNA isoforms from medically relevant genes and characterizing their functions could facilitate direct targeting of RNA isoforms for disease treatment.

The authors discovered hundreds of new isoforms, including from medically relevant genes, and identified isoform expression changes associated with Alzheimer's disease that were not detectable by standard gene-level analysis. The findings demonstrate the power of long-read sequencing to uncover isoform complexity and disease-associated changes in the human brain. For example, the gene TNFSF12 is not differentially expressed between AD and controls, however the TNFSF12-219 isoform is significantly upregulated in AD whereas the TNFSF12-203 isoform is significantly upregulated in controls.

Notable deals

Sanofi, Open AI and Formation Bio collaborate to build drug development models each player brings something unique: vast sums of data, compute and novel model architecture.
Enveda Biosciences and Microsoft collaborate to build a foundation model trained on mass spectra data - another big tech + biotech collaboration, Enveda will utilize microsofts cloud compute to help train and run its PRISM model.
Cradle launches with $48M to build reversible cryo technologies to pause life for those in need of life critical treatment.

In case you missed it

Best practices for machine learning in antibody discovery and development

What we listened to

Ground Truths

Venki Ramakrishnan: The New Science of Aging

Listen now

a year ago · 47 likes · 4 comments · Eric Topol

Events

AI x Bio Summit 2024 @ NYSE: Sign up here for the reception only and here if you’re interested in participating in both the conference & reception. Space is very limited!