On the point of data quality over quantity — if the end goal is to make patient-level predictions (e.g., response to therapy), won’t we eventually need large-scale data (10-100k+ patients even)? High-dimensional, multi-modal data per patient is crucial, but with few patients, the analysis risks becoming more descriptive than predictive. That’s still great for hypothesis generation but maybe not for ML. One analogy is models that predict sex from retinal images where the signal is real and non-obvious, but only becomes robust and generalizable with scale.
i think it is an open question how much data is necessary! i think in the short term, i am much more bullish on hypothesis generation, which is also why it is good that noetik’s collected dataset is (currently) one of a kind. i agree data throughput will need to improve regardless, but the bottleneck is much more on the machine side, and people besides us are working hard on that (spatial transcriptomics companies)
I think there's an opportunity to combine quantity and quality. In endoscopy, we're finding that we can use massive quantities of unlabeled data to train a self-supervised encoder. That encoder allows us to train downstream application decoders with relatively small datasets that are well-curated and labeled. The example we've shown so far is that we can take the placebo arm of a Ph3 ulcerative colitis trial that's 300 patients and classify the responders vs. non-responders from only their baseline colonoscopy video!
Hi Dr. Owl. I spent a big chunk of my Ph.D. evaluating counterfactual predictions about genetic perturbation outcomes. I spent some time looking at the OCTO-VC demos and I found it very worrisome. There is a growing graveyard of similar models that seem to do worse than the mean of their training data. Here are 8 independent evaluations that differ in many details but are all broadly compatible with poor performance of virtual cell predictions.
I would be interested to hear your thoughts on this. Are you worried about it? If OCTO-VC doesn't predict counterfactuals well, how will that affect Noetik's strategy?
OCTO-VC showed very few examples of successes, and the examples seemed to be selected by starting with known causal effects and then checking model predictions. It would be more reassuring to also include negative examples (perturbation has no effect) and show that OCTO-VC does not predict an effect, or to start with the top model predictions. It would also be reassuring to see an acknowledgement of prior negative findings up front along with a strategy for how to work through them, like in the txpert demo. Without that, it's hard to tell whether the OCTO-VC team considers these findings relevant.
I want to apologize for the harshness of my comments and congratulate you on your new endeavor. I tend to get bogged down in the details, but there's a lot of different ways to get insight from -omics. Wishing you folks success.
There are two similar-yet-different strategies to that of Noetik, and I'm curious for your thoughts about each of them. One direction (eg Tempus) is to first focus on scaling one's patient population, and then on getting additional modalities of data for individuals of particular interest. To be fair, at least in Tempus' case, the "get additional modalities of data" is driven by patients and doctors, not Tempus itself, but it turns out that those all select for the same thing (difficult cases with poor existing mainline treatment options). In constrast to these, it seems that Noetik is going to have data from fewer individuals yet more modalities, at least to begin with. Is your optimism about Noetik primarily driven by optimism about this strategic bet, or by optimism that Noetik has the ML chops and vision to build and utilize foundation models?
The other direction is to focus on perturbation data rather than observational data. The advantage there is that a perturbation screen hit also directly tells you how exactly to modify the disease state. (The disadvantage is of course that perturbation data is less realistically contextualized.) Do you think Noetik's models will also be able to answer the question of how to perturb disease states, or do you think that other parallel work in AIxbio (eg via protein structure models) will commodify solving this problem?
Hmm, hard to pattern match the companies plan into one of those two
Here are some general thoughts that maybe help answer the questions:
1. In-vivo, observational human data is the most valuable, and collecting this (with many modalities) to train a model is our highest priority
2. We can poke at ‘perturbational’ data by helping other pharmas design their clinical trials. We’re doing this right now with one of our partners (the Agenus partnership) and it is where I feel most bullish (blog post someday!)
3. You can poke at (less realistic) perturbational data at higher scales with mouse models, which we are currently doing as well.
4. ML is rarely a good moat, but (i think) data fed into the ML can be! I think we picked a set of data that will be very hard for other people to ‘accidentally’ acquire, eg the Tempus/Flatiron strategy, purely bc its so new/expensive to gather (mostly the spatial transcriptomics). Someday this wont be the case, but all advantages eventually fade
5. I think we have early signs that our largely observational data is good enough to get at *some* disease state perturbation, though caveats on it of course not being perfect. Blog post about this actually coming soon
But tbh, strategy matters a lot more than any ML/even the data fed into the ML, I think the BD team at Noetik is really quite good, though I’m at less liberty to talk about how they are pondering things
What's your take on datasets and models like State x Tahoe-100M and their comparative off-the-shelf value (as a function of their scale and training) compared to smaller tailored datasets like Noetik's for hypothesis generation?
On the point of data quality over quantity — if the end goal is to make patient-level predictions (e.g., response to therapy), won’t we eventually need large-scale data (10-100k+ patients even)? High-dimensional, multi-modal data per patient is crucial, but with few patients, the analysis risks becoming more descriptive than predictive. That’s still great for hypothesis generation but maybe not for ML. One analogy is models that predict sex from retinal images where the signal is real and non-obvious, but only becomes robust and generalizable with scale.
i think it is an open question how much data is necessary! i think in the short term, i am much more bullish on hypothesis generation, which is also why it is good that noetik’s collected dataset is (currently) one of a kind. i agree data throughput will need to improve regardless, but the bottleneck is much more on the machine side, and people besides us are working hard on that (spatial transcriptomics companies)
I think there's an opportunity to combine quantity and quality. In endoscopy, we're finding that we can use massive quantities of unlabeled data to train a self-supervised encoder. That encoder allows us to train downstream application decoders with relatively small datasets that are well-curated and labeled. The example we've shown so far is that we can take the placebo arm of a Ph3 ulcerative colitis trial that's 300 patients and classify the responders vs. non-responders from only their baseline colonoscopy video!
Hi Dr. Owl. I spent a big chunk of my Ph.D. evaluating counterfactual predictions about genetic perturbation outcomes. I spent some time looking at the OCTO-VC demos and I found it very worrisome. There is a growing graveyard of similar models that seem to do worse than the mean of their training data. Here are 8 independent evaluations that differ in many details but are all broadly compatible with poor performance of virtual cell predictions.
Ahlmann-Eltze et al.
https://www.biorxiv.org/content/10.1101/2024.09.16.613342v5
Csendes et al.
https://pmc.ncbi.nlm.nih.gov/articles/PMC12016270/
PertEval-scFM
https://icml.cc/virtual/2025/poster/43799
scEval
https://www.biorxiv.org/content/10.1101/2023.09.08.555192v7
C. Li et al.
https://www.biorxiv.org/content/10.1101/2024.12.20.629581v1.full
L. Li et al.
https://www.biorxiv.org/content/10.1101/2024.12.23.630036v1#libraryItemId=17605488
Wong et al.
https://www.biorxiv.org/content/10.1101/2025.01.06.631555v3#libraryItemId=17605840
My Ph.D. work
https://www.biorxiv.org/content/10.1101/2023.07.28.551039v2
I would be interested to hear your thoughts on this. Are you worried about it? If OCTO-VC doesn't predict counterfactuals well, how will that affect Noetik's strategy?
what was worrying about specifically the octo demo?
OCTO-VC showed very few examples of successes, and the examples seemed to be selected by starting with known causal effects and then checking model predictions. It would be more reassuring to also include negative examples (perturbation has no effect) and show that OCTO-VC does not predict an effect, or to start with the top model predictions. It would also be reassuring to see an acknowledgement of prior negative findings up front along with a strategy for how to work through them, like in the txpert demo. Without that, it's hard to tell whether the OCTO-VC team considers these findings relevant.
That's fair! We'll hopefully discuss some of the negative cases in future tech reports
also, thank you for the comment :) we'll hopefully live up to your (very understandable) standards with follow-on releases
Cool!
I want to apologize for the harshness of my comments and congratulate you on your new endeavor. I tend to get bogged down in the details, but there's a lot of different ways to get insight from -omics. Wishing you folks success.
no worries at all! i greatly appreciate the high standards and think its much more useful information to have than pure positivity
say high to Ron for us!
All the good ones are leaving NY for SF 😞
currently plan to stay in NY! at least for the moment
There are two similar-yet-different strategies to that of Noetik, and I'm curious for your thoughts about each of them. One direction (eg Tempus) is to first focus on scaling one's patient population, and then on getting additional modalities of data for individuals of particular interest. To be fair, at least in Tempus' case, the "get additional modalities of data" is driven by patients and doctors, not Tempus itself, but it turns out that those all select for the same thing (difficult cases with poor existing mainline treatment options). In constrast to these, it seems that Noetik is going to have data from fewer individuals yet more modalities, at least to begin with. Is your optimism about Noetik primarily driven by optimism about this strategic bet, or by optimism that Noetik has the ML chops and vision to build and utilize foundation models?
The other direction is to focus on perturbation data rather than observational data. The advantage there is that a perturbation screen hit also directly tells you how exactly to modify the disease state. (The disadvantage is of course that perturbation data is less realistically contextualized.) Do you think Noetik's models will also be able to answer the question of how to perturb disease states, or do you think that other parallel work in AIxbio (eg via protein structure models) will commodify solving this problem?
Hmm, hard to pattern match the companies plan into one of those two
Here are some general thoughts that maybe help answer the questions:
1. In-vivo, observational human data is the most valuable, and collecting this (with many modalities) to train a model is our highest priority
2. We can poke at ‘perturbational’ data by helping other pharmas design their clinical trials. We’re doing this right now with one of our partners (the Agenus partnership) and it is where I feel most bullish (blog post someday!)
3. You can poke at (less realistic) perturbational data at higher scales with mouse models, which we are currently doing as well.
4. ML is rarely a good moat, but (i think) data fed into the ML can be! I think we picked a set of data that will be very hard for other people to ‘accidentally’ acquire, eg the Tempus/Flatiron strategy, purely bc its so new/expensive to gather (mostly the spatial transcriptomics). Someday this wont be the case, but all advantages eventually fade
5. I think we have early signs that our largely observational data is good enough to get at *some* disease state perturbation, though caveats on it of course not being perfect. Blog post about this actually coming soon
But tbh, strategy matters a lot more than any ML/even the data fed into the ML, I think the BD team at Noetik is really quite good, though I’m at less liberty to talk about how they are pondering things
What's your take on datasets and models like State x Tahoe-100M and their comparative off-the-shelf value (as a function of their scale and training) compared to smaller tailored datasets like Noetik's for hypothesis generation?