How scientists are using Claude to accelerate research and discovery

Not to be a luddite, but large language models are fundamentally not meant for tasks of this nature. And listen to this:

> Most notably, it provides confidence levels in its findings, which Cheeseman emphasizes is crucial.

These 'confidence levels' are suspect. You can ask Claude today, "What is your confidence in __" and it will, unsurprisingly, give a 'confidence interval'. I'd like to better understand the system implemented by Cheeseman. Otherwise I find the whole thing, heh, cheesy!

I've spent the last ~9 months building a system that, amongst other things, uses a vLLM to classify and describe >40 million house images of number signs in all of Italy. I wish I was joking, but that aside.

When asked about their confidence, these things are almost entirely useless. If the Magic Disruption Box is incapabele of knowing whether or not it read "42/A" correctly, I'm not convinced it's gonna revolutionize science by doing autonomous research.

How exactly are we asking for the confidence level?

If you give the model the image and a prior prediction, what can it tell you? Asking for it to produce a 1-10 figure in the same token stream as the actual task seems like a flawed strategy.

I’m not saying the LLM will give a good confidence value, maybe it will maybe it won’t, it would depend on its training, but why is making it produce the confidence value in the same token stream as the actual task a flawed strategy?

That’s how typical classification and detection CNNs work. Class and confidence value along with bounding box for detection CNNs.

Because it's not calibrated to. In LLMs, next token probabilities are calibrated: the training loss drives it to be accurate. Likewise in typical classification models for images or w/e else. It's not beyond possibility to train a model to give confidence values.

But the second-order 'confidence as a symbolic sequence in the stream' is only (very) vaguely tied to this. Numbers-as-symbols are of different kind to numbers-as-next-token-probabilities. I don't doubt there is _some_ relation, but it's too much inferential distance away and thus worth almost nothing.

With that said, nothing really stops you from finetuning an LLM to produce accurately calibrated confidence values as symbols in the token stream. But you have to actually do that, it doesn't come for free by default.

Yeah, I agree you should be able to train it to output confidence values, especially integers from 0 to 9 for confidence should make it so it won’t be as confused.

CNNs and LLMs are fundamentally different architectures. LLMs do not operate on images directly. They need to be transformed into something that can ultimately be fed in as tokens. The ability to produce a confidence figure isn't possible until we've reached the end of the pipeline and the vision encoder has already done its job.

The images get converted to tokens using the vision encoder, But the tokens are just embedding vectors. So it should be able to if you train it.

CNNs and LLMs are not that different. You can train an LLM architecture to do the same thing that CNNs do with a few modifications, see Vision Transformers.

> If the Magic Disruption Box is incapabele of knowing whether or not it read "42/A" correctly

Are you implying that science done by humans is entirely error-free?

There exists human research that is worse than AI slop. There is no AI research worthy of the Nobel prize

yet.

Yes and no at the same time, depending on what you intend to get from asking. I don't know what you were doing with this project, obviously, so I don't speak to that, but science (well, stats in general, but science needs stats) has a huge dependency on being sure the question was the correct one and not just rhyming.

Reading hand-written digits was the 'hello world' of AI well before LLMs came along. I know, because I did it well before LLMs came along.

Obviously a simple model itself can't know if it's right or wrong, as per one of Wittgenstein's quote:

  If there were a verb meaning 'to believe falsely', it would not have any significant first person, present indicative.

That said, IMO not (as Wittgenstein seemed to have been claiming) impossible, as at the very least human brains are not single monolithic slabs of logic: https://www.lesswrong.com/posts/CFbStXa6Azbh3z9gq/wittgenste...

In the case of software, whatever system surrounds this unit of machine classification (be it scripts or more ML) can know how accurately this unit classifies things in certain conditions. My own MNIST-hello-world example, split the test set and training set, the test set tells you (roughly!) how good the training was: while this still won't tell you if any given answer is wrong, it will tell you how many of those 40 million is probably wrong.

Humans and complex AI can, in principle, know their own uncertainty, e.g. I currently estimate my knowledge of physics to be around the level of a first year undergraduate course student, because I have looked at what gets studied in the first year and some past paers and most of it is not surprising (just don't ask me which one is a kaon and which one is a pion).

Unfortunately "capable" doesn't mean "good", and indeed humans are also pretty bad at this, the general example is Dunning Kruger, and my personal experience of that from the inside is that I've spent the last 7.5 years living in Germany, and at all points I've been sure (with evidence, even!) that my German is around B1 level, and yet it has also been the case that with each passing year my grasp of the language has improved, so what I'm really sure of is that I was wrong 7 years ago, but I don't know if I still am or not, and will only find out at the end of next month when I get the results of an exam I have yet to sit.

A blind mathematician can do revolutionary work despite not being able to see

Here's a logical step you skipped: A blind matematician can do revolutionary work in mathematics. He is highly unlikely to do revolutionary work in agriculture.

Interesting example, as there was an article on HN front page 10 days ago about exactly that - a blind person doing revolutionary work in agriculture. [0][1]

[0] https://www.bbc.com/news/articles/c4g4zlyqnr0o — "I used Lego to design a farm for people who are blind - like me"

[1] https://news.ycombinator.com/item?id=46502269

> large language models are fundamentally not meant for tasks of this nature

There should be some research results showing their fundamental limitations. As opposed to empirical observations. Can you point at them?

What about VLMs, VLAs, LMMs?

Old "agged Technological Frontier" but explains a bit the challenge https://www.hbs.edu/faculty/Pages/item.aspx?num=64700 namely... it's hard and the lack of reproducibility (models getting inaccessible to researcher quickly) makes this kind of studies very challenging.

That is an old empirical study. jadenpeterson was talking about some fundamental limitations of LLMs.

Finding patterns in large datasets is one of the things LLMs are really good at. Genetics is an area where scientists have already done impressive things with LLMs.

However you feel about LLMs, and I say this because you don't have to use them for very long before you witness how useful they can be for large datasets so I'm guessing you're not a fan, they are undeniably incredible tools in some areas of science.

https://news.stanford.edu/stories/2025/02/generative-ai-tool...

https://www.nature.com/articles/s41562-024-02046-9

In reference to the second article: who cares? What we care about is experimental verification. I could see maybe accurate prediction being helpful in focusing funding, but you still gotta do the experimentation.

Not disagreeing with your initial statement about LLMs being good and finding patterns in datasets btw.

This is also true of lots of human research, there's always a theory side of research that guides the experimental side. Even if just informal, experimental researchers have priors for what experimental verification they should attempt.

Yeah, there’s an infinite numbers of experiments you could run but obviously infinite resources don’t exist, so you need theory to guide where to look. For example, computational methods in bioinformatics to guess a protein function so that experimental researchers can verify the protein function (which takes weeks to months for a given protein function hypothesis) is an entire field.

You need to search in both likely and unlikely places. This is pretty common in high dimensional search spaces. Searching only in the most likely places gets you stuck in local minima

As a scientist, the two links you provided are severely lacking in utility.

The first developed a model to calculate protein function based on DNA sequence - yet provides no results of testing of the model. Until it does, it’s no better than the hundreds of predictive models thrown on the trash heap of science.

The second tested a models “ability to predict neuroscience results” (which reads really oddly). How did they test it? Pitted humans against LLMs in determining which published abstracts were correct.

Well yeah? That’s exactly what LLMs are good at - predicting language. But science is not advanced by predicting which abstracts of known science are correct.

It reminds me of my days in working with computational chemists - we had an x-ray structure of the molecule bound to the target. You can’t get much better than that at hard, objective data.

“Oh yeah, if you just add a methyl group here you’ll improve binding by an order of magnitude”.

So we went back to the lab, spent a week synthesizing the molecule, sent it to the biologists for a binding study. And the new molecule was 50% worse at binding.

And that’s not to blame the computation chemist. Biology is really damn hard. Scientists are constantly being surprised at results that are contradictory to current knowledge.

Could LLMs be used in the future to help come up with broad hypotheses in new areas? Sure! Are the hypotheses going to prove fruitless most of the time? Yes! But that’s science.

But any claim of a massive leap in scientific productivity (whether LLMs or something else) should be taken with a grain of salt.

> Finding patterns in large datasets is one of the things LLMs are really good at.

Where by "good at" you mean "are totally shit at"?

They routinely hallucinate things even on tiny datasets like codebases.

I don't follow the logic that "it hallucinates so it's useless". In the context of codebases I know for sure that they can be useful. Large datasets too. Are they also really bad at some aspects of dealing with both? Absolutely. Dangerously, humorously bad sometimes.

But the latter doesn't invalidate the former.

> I don't follow the logic that "it hallucinates so it's useless".

I... don't even know how to respond to that.

Also. I didn't say they were useless. Please re-read the claim I responded to.

> Are they also really bad at some aspects of dealing with both? Absolutely. Dangerously, humorously bad sometimes.

Indeed.

Now combine "Finding patterns in large datasets is one of the things LLMs are really good at." with "they hallucinate even on small datasets" and "Are they also really bad at some aspects of dealing with both? Absolutely. Dangerously, humorously bad sometimes"

Translation, in case logic somehow eludes you: if an LLM finds a pattern in a large dataset given that it often hallucinates, dangerously, humorously bad, what are the chances that the pattern it found isn't a hallucination (often subtle one)?

Especially given the undeniable verifiable fact that LLMs are shit at working with large datasets (unless they are explicitly trained on them, but then it still doesn't remove the problem of hallucinations)

LLMs do typically encode a confidence level in their embeddings, they just never use it when asked. There were multiple papers on this a few years back and they got reasonable results out of it. I think it was in the GPT3.5 era though

I made a toy order item cost extractor out of my pile of emails. Claude added confidence percentage tracking and it couldn't be more useless.

This is what Yan Le Cun means when he talks about how research is at a dead end at the moment with everyone all in on LLMs to a fault

I'm just a noob but lecun seems obsessed with the idea of world models, which I assume means a more rigorous physical approach, and I don't understand (again, confused noob here) how are t would help precise abstract thinking.

Can't LLMs be fed the entire corpus of literature to synthesise (if not "insight") useful intersections? Not to mention much better search than what was available when I was a lowly grad...

I use Gemini almost obsessively but I don't think feeding the entire corpus of a subject would work great.

The problem is so much of consensus is wrong and it is going to start by giving you the consensus answer on anything.

There are subjects I can get it to tell me the consensus answer then say "what about x" and it completely changes and contradicts the first answer because x contradicts the standard consensus orthodoxy.

To me it is not much different than going to the library to research something. The library is not useless because the books don't read themselves or because there are numerous books on a subject that contradict each other. Gaining insight from reading the book is my role.

I suspect much LLM criticism is from people who neither much use LLMs nor learn much of anything new anyway.

I never suggested I want an LLM to be the definitive answer to a question but I'm certain that there are a lot of low hanging fruit across disciplines where the limit is the awareness of people in one field of the work of another field, and the limiting factor was the friction in discovery - I can't see how a specialised research tool powered by LLMs and RAG wouldn't be a net gain for research if only to generate promising new leads.

Throwing compute to mine a search space seems like one of the less controversial ways to use technology...

Call me when a disinterested third-party says so. PR announcements by the very people who have a large stake in our belief in their product are unreliable.

This company predicts software development is a dead occupation yet ships a mobile chat UI that appears to be perpetually full of bugs, and has had a number of high profile incidents.

"This company predicts software development is a dead occupation"

Citation needed?

Closest I've seen to that was Dario saying AI would write 90% of the code, but that's very different from declaring the death of software development as an occupation.

The clear disdain he has for the profession is evident in any interview he gives. Him saying 90% of the code was not a signal to us, but it was directed to his fellow execs, that they can soon get rid of 90% of the engineers and some other related professions.

I think it's pretty clear that Anthrophic was the main AI lab pushing code automation right from the start. Their blog posts, everything just targeted code generation. Even their headings for new models in articules would be "code". My view if they weren't around, even if it would of happened eventually, code would of been solved with cadence to other use cases (i.e. gradually as per general demand).

AI Engineers aren't actually SWE's per se; they use code but they see it as tedious non-main work IMO. They are happy to automate their compliment and raise in status vs SWE's who typically before all of this had more employment opportunities and more practical ways to show value.

AI is already writing 90% of my code. 100% of Claude Code's code, too. So Amodei was right.

Is your argument that the quotes by the researchers in the article are not real?

What quotes? This is an AI summary that may or may not have summarized actual quotes from the researchers, but I don't see a single quote in this article, or a source.

Why are you commenting if you can't even take a few minutes to read this ? It's quite bizarre. There's a quote and repo for Cheeseman, and a paper for Biomni.

There is only one quote in the entire article, though:

> Cheeseman finds Claude consistently catches things he missed. “Every time I go through I’m like, I didn’t notice that one! And in each case, these are discoveries that we can understand and verify,” he says.

Pretty vague and not really quantifiable. You would think an article making a bold claim would contain more than a single, hand-wavy quote from an actual scientist.

>Pretty vague and not really quantifiable. You would think an article making a bold claim would contain more than a single, hand-wavy quote from an actual scientist.

Why? What purpose would quotes serve better than a paper with numbers and code? Just seems like nitpicking here. The article could have gone without a single quote (or had several more) and it wouldn't really change anything. And that quote is not really vague in the context of the article.

[flagged]

The point is to look at who is making a claim and asking what they hope to gain from it. This is orthogonal to what the thing is, really. It’s just basic skepticism.

Even if the article is accurate, it still makes sense to question the motives of the publisher. Especially if they’re selling a product.

[flagged]

Most people aren't software developers. The HN audience can benefit from LLMs in ways that many people don't value.

Are you accusing Anthropic of hallucinating an MIT lab under the MIT domain? I mean they literally link to it https://cheesemanlab.wi.mit.edu/

And if you go to that site, the researchers say nothing about using Claude, or any LLMs for that matter.

Honestly, it doesn't even seem they read the article, just came in, saw it was pro-AI, and commented.

You know things have shifted a gear when people just start flat out denying reality.

[flagged]

> Call me when a disinterested third-party says Call me when a disinterested third-party says so

Saying what? This describes three projects that use an Anthropic’s product. Do you need a third party to confirm that? Or do you need someone to tell you if they are legit?

There are hundreds of announcements by vendors ơn HN. Did you object to them all or only when your own belief is against them?

This will quickly corrupt the scientific record.

Of course this comes from Anthropic PR. Stanford basically has a stake in making LLMs and AI hype so no wonder they are the most receptive.

Pairs well with this: https://hegemon.substack.com/p/the-age-of-academic-slop-is-u...

Taking CV-filler from 80% to 95% of published academic work is yet another revolutionary breakthrough on the road to superintelligence.

> scholarly dark matter that exists to pad CVs and satisfy bureaucratic metrics, but which no one actually reads or relies upon.

Is it cynical to believe this is already true and has been forever?

Is it naive to hope that when AI can do this work, we will all admit that much of the work was never worth doing in the first place, our academic institutions are broken, and new incentives are sorely needed?

I’m reminded of a chapter in Abundance where Ezra Klein notes how successful (NIH?) grant awardees are getting older over time, nobody will take risks on young scientists, and everyone is spending more of their time churning out bureaucratic compliance than doing science.

*source, maker of Claude.

[deleted]

oh look, marketing, disguised as "news"

Honestly, it was fun ride HN but the never ending AI blog spam & advertising is 'gumming the works'.

The writing is on the wall that AI will continue to dominate discussion on here for the forseeable future (yawn...)

Its time for me to move to a space where the AI bros are not, wherever that may be. I hope to see some of you there!

oh look another advertisement for anthropic

Steady stream of these very regularly. Lately feels like place is a marketing board AI-anything companies.

I say hacker news should just be for low information clickbait headlines and daily pop culture articles that have nothing to do with technology, not AI advertisements.

Anything posted by anthropic always gets to the front page.

Their blog is their advertisement? You mean water is wet?

As for why it’s featured on HN, do you think it’s less important than the politics that is Nobel peace prize that is top of the front page at the moment?

[dead]

[flagged]

Have you done anything interesting with this army that you can share and you're proud of? Specifically something concrete you can link to, not just something you can imagine or describe?

Nothing I can see on https://github.com/bpodgursky/ nor https://bpodgursky.com/projects/ but I'd be curious.

Honestly it's a pattern when arguing about the topic :

- wow AI is amazing!

- OK, why?

- because it can do so much stuff! I'm making so much projects and earning a lot!

- cool, care to share a link to anything?

- ... radio silence or yet another NES emulator or something that is clearly so curated it might have taken more work than without any "AI assistance"

Problem with modern tech people is their obsession with pointing to things in public. People can't just be writing software anymore.

I believe LLM, (specifically code gen) has produced nothing of substance. I'm looking for evidence to disprove that assumption. You're welcome to share nothing, but when you brag about how it's fantastical, it's reasonable to ask. And then no one can never prove it... I can only hear that as; I could if I want to, I just don't want to.

If you don't want field questions about it, don't brag about it?

Equally, to your condemnation: the problem with AI enjoyers is they claim it's nearly perfect, and it can do everything, and it makes them so much faster. But every example is barely more than boilerplate, or it's a sham.

Some of work on proprietary software. Systems, firmwares, memory allocators, compilers, runtimes. You know, things that don't have fancy web pages or even stand-alone git repos because the code is just being continually reviewed and merged alongside human written code.

I'm not even OP so I don't know of it applies to them, but the above applies to me and it's been extremely helpful to me.

I'm not looking for a fancy webpage. I'm not looking for revolutionary output. I'm looking for at it's most basic level commits, that are, without a doubt meaningful. I'm looking for someone to say; "look at this" I claim that I couldn't have done this is the same amount of time without AI, it would have been harder without, and I'm proud of this.

Someone to claim, honestly. They've created something they take pride in, something they feel to them, is impressive. That they couldn't have done without AI.

And then for it to appear to anyone that looks, that the thing they're proud of is obviously, and meaningfully significant such that it' couldn't be considered by anyone half reasonable to be boilerplate. And then, ideally but not required; for me to not feel embarrassed by proxy for them, when they claim they're proud of the AI supported output.

Surely, if it's as helpful, and as revolutionary as the current fervor claims. Someone, or anyone, should be willing to say. Look at this commit. Look at this project, look at this repo. I'm proud of it, and I want everyone to know I'm proud of it. And I wouldn't have been able to to it with out the support of an LLM? At least someone would have created something, they wish to release into the open source world. And claim they take pride in it?

Indeed, all these praises give a very "I do have a girlfriend. You don’t know her. She’s from Canada." feel.

Its very hard for any of LLM fans to share anything substantial, Its always just demos and prototypes which even they have no idea how it works. If you work for any big company, you know all of LLM fans who are just trying desperately to show their managers and leaders that they can use LLMs. Even publicly famous programmers who run 10 agents at the same time, when you use their products you see they have become buggy and they have been shipping more slop than their customers require.

Don't worry, they have multiple agents working on that, right now.

[flagged]

You broke the site guidelines badly and repeatedly in this thread. We ban accounts that do that. I don't want to ban you, so if you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.

[flagged]

By paying Anthropic large sums of money ?!?

Funny you say that.