Cure Logo

July 28, 2025

Article

How Public Data are Fueling the Next Wave of Drug Innovation

View all topics

Contributing Writer

By Ryan Flinn

Overview

Open-source biological datasets Boltz-2 and Tahoe-100M are helping resource-strapped healthcare entrepreneurs and academic scientists accelerate drug discovery using AI but without the high data costs.

Boltz-2 and Tahoe-100M Give Biotech Startups and Academic Labs Access to Big Pharma-grade Data

Two groundbreaking open-source datasets Boltz-2 and Tahoe-100M are unlocking powerful capabilities once limited to Big Pharma, giving startups and academic researchers access to tools that can dramatically accelerate and de-risk R&D involved in discovering new medicines.

These recently released large and free public datasets will let scientists explore how drugs might interact with the body and reveal how different cells respond to treatments, but without the large budgets or proprietary constraints.

Boltz-2 matches industry gold standards for accuracy, but can make predictions up to 1,000 times faster than traditional methods. Tahoe-100M covers 100 million cells and is 50 times larger than previous public drug response datasets.

Boltz-2 launched in June from a collaboration between Recursion, the MIT Jameel Clinic, and NVIDIA. The model trained using Recursion’s supercomputer and was developed with input from academic researchers at MIT. Vevo Therapeutics (since renamed Tahoe), created Tahoe-100M with single-cell analysis from Parse Biosciences and sequencing from Ultima Genomics. The full dataset is now available as part of the Arc Institute’s Virtual Cell Atlas that launched in February.

The datasets are already being used by thousands. Boltz-2 has been downloaded more than 170,000 times by 41,500 unique users, according to Recursion. The company said platform companies like Tamarind Bio, Rowan, deepmirror, and ReSync Bio have already onboarded Boltz-2 to their platforms, while NVIDIA announced software improvements that double Boltz-2’s speed and reduce memory requirements.

“Boltz-2 gives R&D teams a powerful tool to triage more effectively and focus resources on the most promising compounds,” said Najat Khan, PhD, Recursion’s Chief R&D Officer and Chief Commercial Officer, in a LinkedIn post. “Collaborations like this, bridging academic innovation and industry application, play an important role in advancing the field and, ultimately, improving how we develop and deliver medicines for patients.”

“Accurately predicting how strongly molecules bind has been a long-standing challenge in drug discovery—one that required novel machine learning and computer science techniques to address,” said Regina Barzilay, PhD, a professor at MIT and AI faculty lead at the Jameel Clinic.

While Boltz-2 tackles the challenge of predicting how tightly drugs bind to their targets, Tahoe is focused on how drugs affect individual cells, particularly in cancer. Tahoe-100M maps how more than 1,200 drug treatments change gene activity in 100 million cancer cells, representing 50 different tumor types.

“Now researchers can analyze together both observational natural cell states and cells that have been deliberately perturbed by drugs or chemicals to see how they respond," says Dave Burke, PhD, Arc Institute's Chief Technology Officer, in a news release.

Johnny Yu, PhD, Chief Scientific Officer and Co-Founder of Tahoe (formerly Vevo Therapeutics), said that instead of just looking at how a drug attaches to a single protein in a lab dish, the new dataset makes it possible to see how drugs actually affect real patient cells, showing in detail how each drug changes the activity of individual cells and genes.

“This accelerates the path towards building clinical products at a fraction of the cost and allows us to pick the best drug molecules, identify the best drug combinations to increase efficacy, and find the patients that are most likely to respond to them - all in one experiment,” he said in a news release.

Nima Alidoust, PhD, Tahoe CEO and Co-Founder, said while AI models have emerged that can predict protein structures and functions, his company’s goal is to build AI models that can predict how diseased cells interact with potential drug molecules. Making the dataset free and available to everyone helps change the drug discovery dynamic.

"Open sourcing a dataset of this magnitude is a momentous step towards creating a more open and collaborative community in biological research, which can ultimately help us design better therapeutics for patients,” Alidoust said in a news release. “It further demonstrates our confidence in our ability to generate transformative datasets and reflects our commitment to enabling researchers worldwide to build innovative AI models."

More Stories