7 Comments
User's avatar
Surag Nair's avatar

On the point of data quality over quantity — if the end goal is to make patient-level predictions (e.g., response to therapy), won’t we eventually need large-scale data (10-100k+ patients even)? High-dimensional, multi-modal data per patient is crucial, but with few patients, the analysis risks becoming more descriptive than predictive. That’s still great for hypothesis generation but maybe not for ML. One analogy is models that predict sex from retinal images where the signal is real and non-obvious, but only becomes robust and generalizable with scale.

Expand full comment
Abhishaike Mahajan's avatar

i think it is an open question how much data is necessary! i think in the short term, i am much more bullish on hypothesis generation, which is also why it is good that noetik’s collected dataset is (currently) one of a kind. i agree data throughput will need to improve regardless, but the bottleneck is much more on the machine side, and people besides us are working hard on that (spatial transcriptomics companies)

Expand full comment
zdk's avatar

All the good ones are leaving NY for SF 😞

Expand full comment
Abhishaike Mahajan's avatar

currently plan to stay in NY! at least for the moment

Expand full comment
Matt Schwartz's avatar

I think there's an opportunity to combine quantity and quality. In endoscopy, we're finding that we can use massive quantities of unlabeled data to train a self-supervised encoder. That encoder allows us to train downstream application decoders with relatively small datasets that are well-curated and labeled. The example we've shown so far is that we can take the placebo arm of a Ph3 ulcerative colitis trial that's 300 patients and classify the responders vs. non-responders from only their baseline colonoscopy video!

Expand full comment
Eric Kernfeld's avatar

Hi Dr. Owl. I spent a big chunk of my Ph.D. evaluating counterfactual predictions about genetic perturbation outcomes. I spent some time looking at the OCTO-VC demos and I found it very worrisome. There is a growing graveyard of similar models that seem to do worse than the mean of their training data. Here are 8 independent evaluations that differ in many details but are all broadly compatible with poor performance of virtual cell predictions.

Ahlmann-Eltze et al.

https://www.biorxiv.org/content/10.1101/2024.09.16.613342v5

Csendes et al.

https://pmc.ncbi.nlm.nih.gov/articles/PMC12016270/

PertEval-scFM

https://icml.cc/virtual/2025/poster/43799

scEval

https://www.biorxiv.org/content/10.1101/2023.09.08.555192v7

C. Li et al.

https://www.biorxiv.org/content/10.1101/2024.12.20.629581v1.full

L. Li et al.

https://www.biorxiv.org/content/10.1101/2024.12.23.630036v1#libraryItemId=17605488

Wong et al.

https://www.biorxiv.org/content/10.1101/2025.01.06.631555v3#libraryItemId=17605840

My Ph.D. work

https://www.biorxiv.org/content/10.1101/2023.07.28.551039v2

I would be interested to hear your thoughts on this. Are you worried about it? If OCTO-VC doesn't predict counterfactuals well, how will that affect Noetik's strategy?

Expand full comment
Abhishaike Mahajan's avatar

what was worrying about specifically the octo demo?

Expand full comment