The Twenty Percent: Where AI Stops and Your Doctor Starts

May 17, 2026

It’s 7:55 in the hotel breakfast room, day three of the global partner meeting. Over two thousand people are here, riding elevators, picking up badges, ordering coffees they’ll leave unfinished. At my table, there are four of us. Eggs and bacon on our plates, two coffee refills, a phone face down next to the salt and pepper.

The senior partner across from me, maybe mid-fifties, picks up his phone, checks the screen, and places it on the napkin in front of me. His latest annual physical is open in ChatGPT, the AI’s answer in blue under the labs. He says, Tell me what I’m missing.

I read through it. The numbers are his. The interpretation is solid. The recommendations aren’t quite what I’d give.

I look up. He’s watching my face, like clients do when the slides go silent. Three things are obvious. He’s got the right instinct. He’s reading his own body the way he reads a P&L. The tool covers about 80% of what a good doctor would do for him, without needing to fly anywhere. The missing 20% is what can cost you years, a decade from now, if nobody calls it out.

“What’s it not catching?” he says. The bacon is getting cold.

The Week That Made the Question

This is the conversation of the week. Two thousand partners in San Francisco for our annual offsite. AI is the topic in every panel, every hallway, every drink. The questions keep landing with me.

I’ve been on both sides of this all week. Bloodwork drawn. DEXA scan two blocks away. In my hotel room at 10:47 pm, lab dashboard open, Claude on my MacBook, checking where my markers line up beyond just a normal range. The results came back. Ten years ago, this would have meant a long trip to the doctor and a substantial bill.

Two truths from this week. The access upgrade is the biggest expansion of clinical reasoning for patients in 50 years. The new cognitive tax is real. The layer AI changes is narrower than the conference rooms make it sound.

Where the Model Wins

Start with imaging and the trace. This is where the evidence is clearest.

A 2017 Nature paper trained a convolutional network on 129,450 images and matched 21 board-certified dermatologists in diagnosing biopsy-proven melanoma [1]. A 2019 Nature Medicine deep neural network beat the average cardiologist on 12 rhythm classes across 91,232 single-lead ECGs, achieving F1 0.837 vs 0.780 [2]. A Mayo AI-ECG screen for asymptomatic left-ventricular dysfunction achieved an AUC of 0.93, a better screening test than mammography (0.85) or cervical cytology (0.70) [3]. A Google Health mammography system reduced false positives by 5.7% and false negatives by 9.4% versus US radiologists [4]; Nature published a reproducibility challenge the same year, so cite the headline with the caveat.

That’s the specialist layer. The harder part is that the model now matches primary care on the conversational side.

A 2020 dermatology system achieved top-1 accuracy of 0.66, compared with 0.44 for primary care physicians, across conditions accounting for 80% of primary-care visits [5]. Med-PaLM 2 hit 86.5% on USMLE-style questions; physicians preferred its consumer-question answers to physician-written ones on eight of nine clinical-utility axes [6]. GPT-4 achieved 57% accuracy in diagnosing NEJM clinicopathologic cases, outperforming 99.98% of simulated human readers [7]. Google’s AMIE, in a double-blind OSCE with 159 cases, beat 20 primary care physicians on 30 of 32 specialist-rated axes [8]. A 2025 NEJM AI RCT of an AI therapy chatbot reported a 51% reduction in depression symptoms over 4 weeks; therapeutic alliance comparable to that of human clinicians [9].

This is what the phone in the senior partner’s hand does. He uploads his ECG strip, maybe from his Apple Watch. It’s real. The model works.

The Twenty Percent

The opener made a promise. Here’s what it is.

The de-prescribing layer. I’ve reviewed a senior executive’s medication list and asked the basic questions. The statin started in 2014, twelve years after the event. Is anyone planning to revisit? The PPI for reflux that cleared up in 2019. Still needed? The beta-blocker for situational anxiety from 2017. Has anyone looked at it this decade? AI adds to the list. The clinician who knows your story edits it down.

The hands-on layer. Roughly 1 in 20 adult men has a thyroid nodule that you can feel that’s never been imaged. A hand exam detects tendon xanthomas, a sign of inherited dyslipidemia, before the lipid panel results come back. The difference between standing and sitting blood pressure is a frailty marker you won’t see on the panel. None of this is in the PDF you uploaded.

The non-quantifiable signal. The patient who pauses when you ask about alcohol. The wife who insisted on the appointment. Fingers tapping on the chair. The way he says the kids are fine three times. AI gets data. The clinician gets the person.

Realistic behavior change. A good doctor knows this person, with this calendar, this travel, this marriage, won’t do 14 things. They’ll do one. The doctor picks which lever to move.

From my week: My DEXA shows total body fat at the 6th percentile for men forty-five to forty-nine (13.1%). Visceral fat at the 39th (VAT mass 0.6kg or 639cm³). The model finds the mismatch. I’m carrying more visceral fat than you’d expect for how lean I am (leaner than 94% of men my age). Probably genetic; central adiposity runs in my family. The doctor knows what to do: a specific cardio plan, watch ApoB and insulin trends, and factor in family history on Lp(a) no matter the number. AI spots the pattern. The clinician knows what it means for me.

A coda on what the model gets wrong. All four major LLMs returned race-based medicine errors on clinical questions [10]. Medical hallucination benchmarks find models confidently endorsing wrong multiple-choice answers even with “none of the above” available [11]. USMLE-style benchmarks miss the clinical dialogue and longitudinal reasoning that real care requires [12]. And the human-in-the-loop is poorly trained: in a 2024 JAMA Network Open trial, physicians plus GPT-4 scored 76% on diagnostic reasoning; GPT-4 alone scored 92% [13]. The model has the capability that the clinician has not yet extracted.

The Cognitive Tax

A breakfast later, a different partner. He looks gray. That’s what happens when you sleep five hours for six nights. He admits he’s been asking Claude one more thing every night until 1 am this week. Eight specific things. None moved.

The cost. Attention residue is the measurable performance drag when you switch from an unfinished Task A to a new Task B; the cleaner the close-out of A, the less the drag [14]. Cognitive offloading is the broader pattern: when future access is expected, recall of the information drops, and recall of where to find it rises [15], driven by metacognitive misjudgment in which we offload even when internal cognition would be more accurate [16]. Interrupted work finishes faster at the same quality level, but at higher stress and frustration [17].

Then sleep. Four hours of evening iPad reading in a 2015 PNAS study suppressed melatonin by 55% and delayed circadian phase by over ninety minutes compared to print [18]. Late-night work-related smartphone use among mid- and high-level managers reduced sleep and increased next-morning depletion, more than for TV, laptop, or tablet use [19]. The evidence is smartphone-specific. The mechanism is the same.

The Fourteen Vials

Three days before the breakfast, at a Quest Diagnostics lab, the phlebotomist labeled the last of 14 vials and pointed me to the bathroom. The blood draw took two minutes. The urine cup would take 90 seconds.

I came out of the bathroom, and she was on the phone.

I didn’t need to hear the words. Her shoulders told me. She was on the phone with her supervisor. Behind her, the 14 vials of my blood were still lined up on the counter like a teaching demo. The reagent for one of the assays was missing from the kit. The panel couldn’t be run. They apologized. Could I come back tomorrow.

I came back the next day. The other arm. Two more minutes. Another bathroom trip. This time, the reagent was there. Somewhere in a centrifuge, my 100+ biomarkers became data.

This is the part the AI conversation in San Francisco kept missing. At least in health.

The model on my phone reads 100 biomarkers for $20 a month. Function Health, through Quest, runs the panel for a dollar a day. The interpretation and the test are now cheap. What isn’t cheap or automated is the person who puts the needle in your arm. The phlebotomist who finds your vein on the first try while you look away. Who tells you to keep breathing while she’s doing her thing. Who knows which vial will get agitated and which won’t. Who calls her supervisor when the reagent is missing. Nobody has built an AI for the inside of your elbow (yet).

The bottleneck didn’t disappear. It moved.

This sits in the Capacity pillar of my Upward ARC framework, the long arc of what you can know, decide, and do for yourself over thirty years. The second tether is Recover. Expanded medical reasoning is useless if running it every night costs you the sleep and parasympathetic foundation on which everything else depends. Both halves are the same edit. Be deliberate about what you let into your head, and when you let it out.

Try This Today

Build the Baseline First. Before you paste a lab into the model, build your context doc. AI without context defaults to population norms, which is the worst answer for a senior executive. Write three sections and save them. Family history: cause and age of death for grandparents and parents, first-degree relatives’ chronic conditions, and any cardiovascular event under sixty. Current meds and supplements: every pill, every dose, since when, why. Lifestyle context: work schedule including travel, sleep timing, training routine, alcohol honestly, diet, and mental state. Past data points go here too: prior bloodwork, BP, and weight trends, imaging. Save it as your patient context prompt. Add the question that matters for the next ten years: what biomarkers should I track in future readings so the next interpretation has trajectory, not just snapshots? Everything else depends on it.

The Six-Question Prompt. With the baseline set, when new labs or imaging come in, ask these: Trajectory, not snapshot. Family history is weighted on the markers near the top of the panel. What tests are missing that I should request. Cross-correlations between markers, the syndrome that a single marker might miss. De-prescribing review. The one or two changes that would move the most. The model does useful work on questions one to four and decent work on five and six. The doctor’s edit on five and six is in your physical exam. Bring both.

The Physical-First Rule. AI doesn’t replace the annual physical with a doctor who puts hands on you. The orthostatic blood pressure, the thyroid check, the hand exam, the conversation where you finally say what you haven’t told a GP in 15 years. Do both. AI interprets what the physical produces. The 20% is in the room.

The AI Curfew. No LLM after 9 pm on a workday. I spend six to eight hours a day in Claude Code right now. The capabilities are addictive. I learn, I experiment, I build. It’s a tool. Without boundaries, it turns into a video game. The same discipline that keeps your twelve-year-old off Fortnite at 9 pm applies to you and the model. It’ll be there at 7 am, with a clearer head, not at the cost of your sleep.

The Deliberate-Tool Rule. Don’t use five AI tools at once. Pick one or two and go deep. Learn their prompt patterns, their failure modes, your workflow. The knowledge transfers to other tools later. Depth beats breadth.

Back to the Breakfast Room

The senior partner is still waiting. I put my coffee down. Ask it three things it did not ask you, I tell him. Family history weighting on the markers near the top of the panel. What it would take off your medication list, not just add to it. What test it would order if it were sitting across from you. He pulls up ChatGPT and starts typing. The bacon is cold now.

This is the wave for the corner office’s relationship with its own body. Done well, it’s the biggest gift of clinical reasoning to the senior executive in 50 years. The second opinion you waited 6 weeks for is now available at 7 am, between a board call and a flight. Done badly, it’s a new cortisol drip with no off switch.

The senior executives who win this decade will be the ones who use it well and know when to turn it off. That’s one skill, not two.

Stay healthy.

Andre

PS: If you read this and you’re either anything like the partner who got 80% from ChatGPT or the partner asking Claude at 1 in the morning, forward it to one peer on the other side of that line. That’s how this newsletter grows, and it’s the only way I want it to.

References

[1] Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115-118.

[2] Hannun, A. Y., Rajpurkar, P., Haghpanahi, M., Tison, G. H., Bourn, C., Turakhia, M. P., & Ng, A. Y. (2019). Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nature Medicine, 25(1), 65-69.

[3] Attia, Z. I., Kapa, S., Lopez-Jimenez, F., McKie, P. M., Ladewig, D. J., Satam, G., Pellikka, P. A., Enriquez-Sarano, M., Noseworthy, P. A., Munger, T. M., Asirvatham, S. J., Scott, C. G., Carter, R. E., & Friedman, P. A. (2019). Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram. Nature Medicine, 25(1), 70-74.

[4] McKinney, S. M., Sieniek, M., Godbole, V., Godwin, J., Antropova, N., Ashrafian, H., Back, T., Chesus, M., Corrado, G. S., Darzi, A., Etemadi, M., Garcia-Vicente, F., Gilbert, F. J., Halling-Brown, M., Hassabis, D., Jansen, S., Karthikesalingam, A., Kelly, C. J., King, D., … Shetty, S. (2020). International evaluation of an AI system for breast cancer screening. Nature, 577(7788), 89-94.

[5] Liu, Y., Jain, A., Eng, C., Way, D. H., Lee, K., Bui, P., Kanada, K., de Oliveira Marinho, G., Gallegos, J., Gabriele, S., Gupta, V., Singh, N., Natarajan, V., Hofmann-Wellenhof, R., Corrado, G. S., Peng, L. H., Webster, D. R., Ai, D., Huang, S. J., … Coz, D. (2020). A deep learning system for differential diagnosis of skin diseases. Nature Medicine, 26(6), 900-908.

[6] Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Babiker, A., Schärli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., … Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172-180.

[7] Eriksen, A. V., Möller, S., & Ryg, J. (2024). Use of GPT-4 to diagnose complex clinical cases. NEJM AI, 1(1), AIp2300031.

[8] Tu, T., Palepu, A., Schaekermann, M., Saab, K., Freyberg, J., Tanno, R., Wang, A., Li, B., Amin, M., Cheng, Y., Vedadi, E., Tomasev, N., Azizi, S., Singhal, K., Hou, L., Webson, A., Kulkarni, K., Mahdavi, S. S., Semturs, C., … Natarajan, V. (2025). Towards conversational diagnostic artificial intelligence. Nature, 642(8068), 442-450.

[9] Heinz, M. V., Mackin, D. M., Trudeau, B. M., Bhattacharya, S., Wang, Y., Banta, H. A., Jewett, A. D., Salzhauer, A. J., Griffin, T. Z., & Jacobson, N. C. (2025). Randomized trial of a generative AI chatbot for mental health treatment. NEJM AI, 2(4), AIoa2400802.

[10] Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V., & Daneshjou, R. (2023). Large language models propagate race-based medicine. npj Digital Medicine, 6(1), 195.

[11] Pal, A., Umapathi, L. K., & Sankarasubbu, M. (2023). Med-HALT: Medical domain hallucination test for large language models. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL) (pp. 314-334). Association for Computational Linguistics.

[12] Mehandru, N., Miao, B. Y., Almaraz, E. R., Sushil, M., Butte, A. J., & Alaa, A. (2024). Evaluating large language models as agents in the clinic. npj Digital Medicine, 7(1), 84.

[13] Goh, E., Gallo, R., Hom, J., Strong, E., Weng, Y., Kerman, H., Cool, J. A., Kanjee, Z., Parsons, A. S., Ahuja, N., Horvitz, E., Yang, D., Milstein, A., Olson, A. P. J., Rodman, A., & Chen, J. H. (2024). Large language model influence on diagnostic reasoning: A randomized clinical trial. JAMA Network Open, 7(10), e2440969.

[14] Leroy, S. (2009). Why is it so hard to do my work? The challenge of attention residue when switching between work tasks. Organizational Behavior and Human Decision Processes, 109(2), 168-181.

[15] Sparrow, B., Liu, J., & Wegner, D. M. (2011). Google effects on memory: Cognitive consequences of having information at our fingertips. Science, 333(6043), 776-778.

[16] Risko, E. F., & Gilbert, S. J. (2016). Cognitive offloading. Trends in Cognitive Sciences, 20(9), 676-688.

[17] Mark, G., Gudith, D., & Klocke, U. (2008). The cost of interrupted work: More speed and stress. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ‘08) (pp. 107-110). Association for Computing Machinery.

[18] Chang, A.-M., Aeschbach, D., Duffy, J. F., & Czeisler, C. A. (2015). Evening use of light-emitting eReaders negatively affects sleep, circadian timing, and next-morning alertness. Proceedings of the National Academy of Sciences, 112(4), 1232-1237.

[19] Lanaj, K., Johnson, R. E., & Barnes, C. M. (2014). Beginning the workday yet already depleted? Consequences of late-night smartphone use and sleep. Organizational Behavior and Human Decision Processes, 124(1), 11-23.

A note for new readers:

I’m a trained reconstructive facial surgeon, medical doctor, and dentist. Before launching this newsletter, I had a varied career: competitive freestyle wrestler, management consultant (McKinsey), entrepreneur (Zocdoc, Thermondo, and docdre ventures), and corporate executive (Sandoz). Today, I’m a Managing Director and Partner at BCG.

Husband of one. Father of three. Split between Berlin’s urban pulse and our Baltic Sea retreat. I’d rather be moving than sitting. Not just hobbies. Research. My body is my primary laboratory; I’ve been conducting experiments for decades.

If this is your first time here, welcome. I’m excited to share what I’ve learned and will continue to learn with you.

DISCLAIMER:

Let’s get one thing straight: None of this, whether text, graphics, images, or anything else, is medical or health advice. This newsletter is here to inform, educate, and (hopefully) entertain you, not to diagnose or treat you.

Yes, I’m a trained medical doctor and dentist. No, I’m not your doctor. The content here isn’t a replacement for professional medical advice, diagnosis, or treatment.

If you have questions about your health, talk to your physician or a qualified health professional. Don’t ignore their advice or delay getting care because of something you read in The Upward ARC. Be smart. Do your research. And, as always, take care of yourself.

Discussion about this post

Ready for more?