The pursuit of noninvasive glucose - hunting the deceitful turkey
Hidden stratification causes clinically meaningful failures in machine learning for medical imaging
Dissecting racial bias in an algorithm used to manage the health of populations
Deep learning on electronic medical records is doomed to fail
Introduction
Amongst the greatest pleasures of being involved in science is getting to see a deeply knowledgable scientist stand up, look at what their peers are working on, and loudly proclaim (for others to hear) that it’s all bullshit — while having plenty of evidence to back up such a strong claim. It’s an invitation to observe a usually intensely private side of the scientific process, something typically discussed behind closed doors or indirectly addressed via a competing paper. It’s also, in my opinion, one of the best ways to understand the limitations of any given field.
This is a compilation of the posts I’ve seen in this genre that I really like. I’ll attach a link to each post and go over the following things:
Who posted it + when they posted it
A short TLDR of the point the article is making
Any further comments I have on the article
One quick note before we move on: are these sorts of articles always, strictly speaking, correct? Asserting something negative is always challenging — some would say impossible outside of very specific areas. These articles should serve to entertain and inform, but they are an opinion amongst many. While I’d broadly agree with many of the articles I’m posting here, always look into things yourself before trusting any one person!
Clinical Medicine
The pursuit of noninvasive glucose - hunting the deceitful turkey
Who/when: This is by John L. Smith, written in 2023. There have been nine versions of this post, the ninth one (given) will unfortunately be the last one released by John.
TLDR: Noninvasive continuous glucose monitors have been worked on for decades by dozens of different companies and dozens of different approaches. They have never, ever worked. And it’s unlikely they ever will work, given the biological and physical limitations of going the noninvasive route.
My thoughts: I read this back when I was a college sophomore interning at a physiological device company, and I think it partially drove me to realize how intensely interesting medicine as a field is. The post touches so many cool concepts over 250 (!) pages, I highly, highly recommend you at least read this one if nothing else.
Hidden stratification causes clinically meaningful failures in machine learning for medical imaging
Who/when: This is by Lauren Oakden-Rayner, written in 2019. I have named this after the released paper, but have attached the blog post that Lauren has written over it since it’s more readable.
TLDR: Medical imaging models can incorrectly categorize distinct medical conditions due to unacknowledged variability within disease classes — largely because this variability is not reflected in the labels of the training dataset (referred to in the paper as ‘Hidden Stratification’). Because of this, AI performance in medical imaging settings is massively over overestimated.
My thoughts: I bring this post up so often! The hidden stratification concept is extremely conceptually useful, it shows up in all sorts of ML fields. I hope Lauren eventually writes an update on the state of hidden stratification in medical imaging, would be interesting to know how the field has moved forwards since 2019.
The amoral nonsense of Orchid’s embryo selection
Who/when: This is by Lior Pachter, written in 2021.
TLDR: Orchid Health (a fertility startup) plans to use polygenic risk scores (PRS) from GWAS studies to select embryos in IVF that are at reduced risk for complex diseases like schizophrenia and heart disease. Lior has three main points. One, the predictive power of PRS is extremely weak, so most customers are likely wasting their money. Two, many disease-coding genes in the human body are pleiotropic genes, which is used to describe genes that are simultaneously involved in multiple unrelated physiological processes, and optimizing for one mutation over another introduces completely unknown risks. Third, most GWAS studies largely focus on individuals with European ancestry, and its results may be entirely useless to anybody from a different nationality.
My thoughts: Nothing to comment on, cool post!
Why conventional wisdom on health care is wrong (a primer)
Who/when: This is by Random Critical Analysis (anonymous), written in 2016. This person generally writes a lot about healthcare misconceptions (but hasn’t posted anything since 2020), and has posts about a lot of their individual points, like this one, this is just a compilation of the main arguements.
TLDR: People living in the US have a weirdly low average lifespan given how much we spend on healthcare. The usual argument for this is healthcare mismanagement, poverty, healthcare access, etc. But, the author argues that none of those compare to two simple facts: the US having high rates of diseases of affluence (obesity, lack of exercise, etc) AND the US having massive amounts of money to spend on (badly) curing them. In a summary: diminishing returns to spending and worse lifestyle factors explain America’s mediocre health outcomes. I’m missing so much, this essay is probably the most information dense thing I’m posting.
My thoughts: This is a really controversial post. It goes so counter to the usual mindset about why healthcare spending in the US is so messed up, but presents a ridiculous amount of evidence to back up their points. Do I believe it? I’m not sure. I think what one does walk away from this with is an understanding that healthcare economics is stupidly complicated. I would love to know if anyone has a strong rebuttal to the arguments laid out here!
Dissecting racial bias in an algorithm used to manage the health of populations
Who/when: This is by Ziad Obermeyer, written in 2019.
TLDR: Honestly, it’d be hard to summarize this in a way that is better than their own summary. I’ll just repeat it: “For millions of patients across the US, hospitals use commercial risk scores to target those needing extra help with complex health needs. We examine a widely used commercial algorithm for racial bias. Thanks to a unique dataset, we also study the algorithm’s construction, gaining a rare window into the mechanisms of bias. We find significant racial bias: at the same risk score, blacks are considerably sicker than whites….We isolate the problem to the algorithm’s objective function: it predicts costs, and since blacks incur lower costs than whites conditional on health, accurate cost predictions produce racially biased health predictions. We find suggestive evidence of a “problem formulation error”: as algorithmic prediction is in a nascent stage, convenient choices of proxy labels to predict (in this case, cost) can inadvertently produce biases at scale.”
My thoughts: All of Ziad’s articles are really fun to read through. I was first introduced to his work via an ex-manager and I’ve been a fan since. He’s made somewhat of a career out of these types of ‘theres a massive flaw in the current system’ studies and does them extraordinarily well. If you’re interested in further reading, I’d recommend checking out his article on unexplained pain disparities in marginalized communities and his more recent article on the clinical consequences of copays.
Deep learning on electronic medical records is doomed to fail
Who/when: This is by Brian Kihoon Lee, written in 2022.
TLDR: As the title implies, deep learning on electronic medical records isn’t going to work. Brian lays out a few reasons for this. One, the data is extremely fragmented amongst the players in the space (Epic, Cerner, etc) and while dataset interoperability is underway, it’ll be a long time till it happens. Two, there are strong distribution shifts in how to interpret health record data from hospital-to-hospital, each one have their own unique workflows that ML models may overfit to. Three, information from medical records are not meant to explain medical information, but to help with billing…so a pretty significant number of datapoints in any records dataset may just be an insurance hack.
My thoughts: All of Brian’s essays are great reads, this one just deeply resonated with me as someone who spent 2 years on applying ML to healthcare claims data. The whole subject is just unbelievably messy, hard to imagine any company in the space really figuring it out.
Biology
Structure is beauty, but not always truth
Who/when: This is by James S. Fraser, written in 2024.
TLDR: Protein structures are often trusted upon as a good basis to build a therapeutic (e.g. a binder), but they shouldn’t be, and there are four reasons why. One, high-confidence AF2 predicted structures can still be meaningfully wrong because their underlying training dataset (experimentally-derived structures) are reasonably noisy. Two, proteins move around a lot — even minor movements can cause meaningful differences in how ligands/proteins interact with it — and most structures will not include this movement information. Three, experimentally derived structures may look completely different to how it exists in-vivo due to post-translational-modifications and it being (maybe) part of a larger complex. Four, proteins don’t exist in isolation, and all therapeutics intended to interact with a protein will also necessarily interact with many other proteins on the way.
My thoughts: It’s a short paper, but an excellent read. The paper was first recommended to me by a friend as a reply to my computational toxicity post with them, and I can see why, the challenge with predicting toxicity ties in quite well with the aforementioned four problems. What was especially interesting was the section on in-vitro structures being deceiving. It’s not something I think about very often! I’m often pondering how AF2-predicted structures are wrong, how proteins are flexible, and that any therapeutic will touch many proteins in its journey around the body. But I do often look at experimentally-derived structures and think ‘okay, this is what the protein definitely looks like’, and that’s not true at all!
The specious art of single-cell genomics
Who/when: This is by Lior Pachter, written in 2021. Another Lior post! This is the title of a paper, but I’m posting the Twitter thread made over it since it’s a decent overview of the core concept.
TLDR: UMAP and t-SNE are commonly used for showcasing low-dimensional representations of scRNA-seq data. They make for very pretty plots! But they absolutely, without a shadow of a doubt, should not be used for interpreting the data. Ever. The ‘layout’ of the data created by t-SNE/UMAP are entirely arbitrary and can be made to resemble basically any desired shape. What to do instead? Train an autoencoder and use the two-dimensional hidden dimension to plot things!
My thoughts: Lior is well known for being extremely anti-UMAP/t-SNE, this is a good starter post to understand where he is coming from. Reading the quote-tweets here are also pretty entertaining, here is a post by somebody disagreeing with him, here is Lior’s response, and here is the response to that response. So many of these! I don’t work much in scRNA-seq outside of side projects, so I don’t have a huge dog in the fight, but I’m much more on Lior’s side here.
Biology + ML
We need better benchmarks for machine learning in drug discovery
Who/when: This is by Pal Walters, written in 2023.
TLDR: Small molecule datasets used for ML have a ton of problems. Here they are: invalid chemical structures that can't be parsed by common cheminformatics tools, inconsistent stereochemistry/chemical representations, combining data from different sources without standardization, poorly defined training/test splits, data curation errors like duplicate structures with conflicting labels, and using assays with high rates of artifactual activities. All of these result in current toxicity models being basically unusable, ending with a call to the academic community to release more curated datasets rather than yet another molecular LLM.
My thoughts: Besides this being a fun post to read, it was fairly reassuring after going through the cheminformatics ML literature; understanding what’s good and bad is genuinely challenging and not just a me problem. Pat Walters blog in general has a ton of great articles in this direction of being a skeptic of the current generation of small molecule AI tools, this and this are my other favorite articles by him.
A future history of biomedical progress
Who/when: This is by markov.bio (stealth biotech startup ran by Adam Green), written in 2022.
TLDR: The inclusion of this post is a little wonky, because it is pessimistic in the first half and optimistic in the second. The core thesis of the first half is that modern biomedical progress, despite seemingly revolutionary advances, is living in an age of stagnation; very little progress has been made in solving most diseases. This is because, the author argues, biomedical research has historically focused on human-legible explanations for biological phenomena and for biological predictions. This mindset has immense diminishing returns as we try to create therapies for conditions that cannot be simplified down to a single pathway, and many conditions interact with all pathways — of which we still barely understand the full scope. Approaching drug design using this approach is termed the ‘mechanistic mind’ by the author. There is an alternative mindset, which defers full understanding of a system to an ML model that has implicitly learned the dynamics of the biological system, but is completely uninterpretable to a human mind. The second half focuses on this mindset, expanding on how such a model could be built. The answer? In a word, scale.
My thoughts: Most people who work in the bio-ML intersection space have likely read this essay, it has became surprisingly famous. The first half is a very nice isolation of the problem with a lot of modern biological research + an excellent recap of the history of the field. I think the second half of the easy is interesting, but somewhat less compelling, I’m really hoping for follow-up posts from them!
Machine Learning
Deep reinforcement learning doesn't work yet
Who/when: This is by Alex Irpan, written in 2018.
TLDR: Deep RL has a lot of potential, but it doesn’t currently work well. One, because it’s extremely sample inefficient, making it hard to apply in any situation where you don’t have access to a fast simulator (e.g. an Atari game). Two, the problems that can be solved by RL are often better solved by basic MCTS (the author notes that this is an unfair comparison, but like, still!). Three, reward function design for the RL agent is really hard to do well, and can easily lead to weird behavior if done incorrectly. Finally, most deep RL results are really hard to reproduce due to extreme training instability.
My thoughts: This post is burnt into my head because I first read it after taking David Silver’s RL class back in 2020. I felt incredibly hyped about the future of the field after the class, and then came crashing back down to earth after I read this paper. I stopped thinking about deep RL entirely and became, for the most part, anti-RL based entirely off this essay. This is likely not at all what the author intended, given that he is a deep RL researcher! How has the criticism aged, especially given the extreme success of RLHF? I stumbled across an interesting r/machinelearning thread from Janurary 2024 where there’s lots of anecdotal experiences shared over peoples experiences with deep RL, and most of it is explicitly negative. How come RLHF worked so well then? It may simply be a case of fine-tuning from human feedback not being a uniquely hard problem, given that Direct Preference Optimization (which has zero reliance on RL) also seems to work quite well. All this to say: the original post seems to still be accurate.