I consider myself rather smart and good at what I do. It's nice to have a look at problems like these once in a while, to remind myself of how little I know, and how much closer I am to the average than to the top.
Well it is a specialized problem. If you've never worked on anything similar previously, it is going to take time. Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems - after college I interviewed for various electronics/hardware companies where you'd get asked to optimize low-level code - which would have looked quite foreign, if you had never actually worked on such problems before.
If you ask an EE to debug react state management code without prior exposure they won't do too well either. But on the other hand they can easily pick up most of it after a week long crash course while training a performance engineer who can optimize code for a specific architecture would take months.
[deleted]
> EE to debug react state management ... easily pick up most of it after a week long crash course while training a performance engineer ... would take months
Isn't that mostly because as you go up the abstraction layer, tools and docs to teach yourself the tricks of trade fast are in abundance (let alone a popular layer like React)? Which inturn is likely a function of incentives and opportunities.
It's because the higher up the stack you go, tools become more declarative and literate. Calling sort is far easier than understanding the algorithm for example.
I'm 30 years in, and literally don't understand the question.
After a quick look this is can be seen as a low level GPU/TPU optimization problem where you have to consider the throughput and depth of different arithmetic pipelines. If you want to hire people who understand how to do that you unfortunately have to give them such a convoluted task and emulate the relevant parts of HW. (In reality this is probably more like TPU since it has scalar pipelines, but the optimization methods are not that different)
The task is to parallelize tree traversal, which is embarrassingly unparallel so it's tricky.
The question isn't clearly written down anywhere, that's why. Presumably actual candidates would have been given more info over the phone or email. Part of the "challenge" is reverse engineering their Python; unclear if that's intentional.
If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:
Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
available time, as measured by test_kernel_cycles on a frozen separate copy
of the simulator.
However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:
Like reference_kernel2 but building actual instructions.
Scalar implementation using only scalar ALU and load/store.
The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.
What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.
At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.
This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.
And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.
So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.
Does that help? Sounds like a fun exercise :)
Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.
Sounds like a fun exercise :)
I'll be honest, that sounds like the opposite of fun since the worst parts of my job are touching the parts of a Python codebase that are untyped. The sad part is this work codebase isn't even that old, maybe a few years, and the developers definitely should have known better if they had anyone capable leading them. Alas, they're all gone now.
Harder than figuring out the instruction set for some exotic CPU are definitely the giant untyped dicts/lists common in data science code.
This is nice writeup. Thanks. Another commenter said will've taken them 2h just to sketch out ideas; sans LLMs will've taken me more than 2h just to collect all this info let alone start optimizing it.
It took me about 10 minutes to generate that writeup the old fashioned 100% organic way, because one of the things that's unspecified is whether you're allowed to use AI to help solve it! So I assumed as it's a job interview question you're not allowed, but now I see other comments saying it was allowed. That would let you get much further.
I think I'd be able to make some progress optimizing this program in two hours but probably not much. I'm not a performance engineer but have designed exotic emulated CPU architectures before, so that helps a lot.
I've not written a VM before, but the comments in perf_takehome.py and problem.py explain the basics of this.
I gleaned about half of this comment in a few minutes of just skimming the code and reading the comments on the functions and classes. There's only 500 lines of code really (the rest is the benchmark framework).
On the one hand, this exercise probably reflects a realistic task. Daily engineering work comprises a lot of reverse engineering and debugging of messy code.
On the other hand, this does not seem very suitable as an isolated assignment. The lack of code base-specific context has a lot of potential for frustration. I wonder what they really tested on the candidates, and whether this was what they wanted to filter for.
> but which requires all parallelism to be statically declared ahead of time
this is what all specialized chips like TPU/Cerebras require today, and it allows for better optimization than a generic CPU since you can "waste" 30 min figuring out the perfect routing/sequencing of operations, instead of doing it in the CPU in nanoseconds/cycles
another benefit is you can throw away all the CPU out-of-order/branch prediction logic and put useful matrix multipliers in it's place
Wow! Thanks for the explanation :)
[deleted][deleted]
Which part exactly are ypu having trouble with?
- Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
available time, as measured by test_kernel_cycles on a frozen separate copy
of the simulator
Since it's a CPU, you start with the idea that there is an ALU and spiral outward from that. That gives you something concrete to wrap your head around while you climb up the abstraction levels.
However, when I hit "scratch_write" and it wasn't in the Machine class and it wasn't coming from some Decorator and it was getting defined and deleted by a member function ... I stopped. That's paying lip service to the variable typing that is scattered around and actively hampers even basic IDE usage. Probably the typing was added by AI/LLM after the fact, and it missed that unusual usage. The Python convention used to be that those kinds of variables got declared as "_scratch_write" with a leading underscore to flag that they were "private/internal".
That was the gigantic red "We write shitty code" signal or worse "We don't care about wasting your time" signal. Human review should have flagged that.
Shame. I was kinda looking forward to the technical problem, but I'm not going to spend a bunch of time using grep to untangle garbage code to get at it.
I suspect everything would actually be much clearer if you wrote it in SystemVerilog and tested with Cocotb. Let's see if their LLMs can handle that porting job. HAH!
Generate instructions for their simulator to compute some numbers (hashes) in whatever is considered the memory of their "machine"¹. I didn't see any places where they actually disallow cheating b/c it says they only check the final state of the memory² so seems like if you know the final state you could just "load" the final state into memory. The cycle count is supposedly the LLM figuring out the fewest number of instructions to compute the final state but again, it's not clear what they're actually measuring b/c if you know the final state you can cheat & there is no way to tell how they're prompting the LLM to avoid the answers leaking into the prompt.
Well, they read your code in the actual hiring loop.
My point still stands. I don't know what the LLM is doing so my guess is it's cheating unless there is evidence to the contrary.
I guess your answer to "Try to run Claude Code on your own 'ill-defined' problem" would be "I'm not interested." Correct? I think we can stop here then.
Well that's certainly a challenge when you use LLMs for this test driven style of programming.
And? Anthropic is not aware of this 2020 paper? The problem is not solvable?
Smart is different than the knowledge. If you learn about these concepts andwork on these problems, then you will be able to solve them.
It's not about you being average, just a different knowledge set.
What we know is a drop, what we don't know is an ocean.
It comes with test suites, so that gives you a base to start from. You can at the very least do trial-and-error and come up with some heuristics on the fly. You're at a huge disadvantage to someone who has some familiarity but can convincingly play it off as being a newcomer, though.
disagree. nobody has a monopoly on what metric makes someone good. I don't understand all this leet code optimization. actually i do understand it, but it's a game that will attract game optimizers.
the hot take is, there are other games.
This is the opposite of leet code.
Yes, this applies to some simulated imaginary CPU with an artificial problem. Except that the job asked here is exactly the core of what a performance engineer will do at anthropic: optimize kernels for their fleet of GPUs. Is it simplified? Yes! (e.g. the simulator does not restrict memory access patterns)
This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.
This is explicitly not Leetcode, in fact its goal is to attract optimizers
Also leetcode does not really provide insight into ones ability to design business solutions. Whether it be system design, just some small feature implementation or communication skills within a team.
Its just optimizers jerking each other off on some cryptic problems 99.999999999% of developers will never see in real life.
Maybe it would've been useful like 30 years ago, but all commonly used languages have all these fancy algorithms baked into their stdlib, why would I ever have to implement them myself?
But this is an interview problem at Anthropic, not at your local CRUD factory. They _are_ looking for the optimizers, because they _are_ working on cryptic problems the 99.9999% of us will never encounter.
Or more likely, the commonality is how you're applying your software skills?
In every other field it's helpful to understand the basics. I don't think software is the exception here.
Understanding basics is very different to being able to memorize algorithms. I really dont see why I'd ever have to implement stuff like quicksort myself somewhere. Yes I know what recursion is, yes I know what quick sort is, so if I ever need it I know what to look for. Which was good enough throughout my career.
I suspect this was released by Anthropic as a DDOS attack on other AI companies. I prompted 'how do we solve this challenge?' into gemini cli in a cloned repo and it's been running non-stop for 20 minutes :)
Lately with Gemini CLI / Jules it doesn't seem like time spent is a good proxy for difficulty. It has a big problem with getting into loops of "I am preparing the response for the user. I am done. I will output the answer. I am confident. Etc etc".
I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.
I've noticed using antigravity and vscode, Gemini 3 pro often comes back with model too busy or something like that and basically 500s.
Seems like capacity because it works a lot better late at night.
I don't see the same with the claude models in antigravity.
I saw this too. Sometimes it "think" inside of the actual output and its much more likely to end up in the loop of "I am ready to answer" while it is doing that already
I feel like sometimes it just loops those messages when it doesn't actually generate new tokens. But I might be wrong
There are some other failure modes that all feel kinda vaguely related that probably help with building a hypothesis about what's going wrong:
Sometimes Gemini tools will just randomly stop and pass the buck back to you. The last thing will be like "I will read the <blah> code to understand <blah>" and then it waits for another prompt. So I just type "continue" and it starts work again.
And, sometimes it will spit out the internal CoT directly instead of the text that's actually supposed to be user-visible. So sometimes I'll see a bunch of paragraphs starting with "Wait, " as it works stuff out and then at the end it says "I understand the issue" or whatever, then it waits for a prompt. I type "summarise" and it gives me the bit I actually wanted.
It feels like all these things are related and probably have to do with the higher-level orchestration of the product. Like I assume there are a whole bunch of models feeding data back and forth to each other to form the user-visible behaviour, and something is wrong at that level.
Which Gemini model did you use? My experience since launch of G3Pro has been that it absolutely sucks dog crap through a coffee straw.
/model: Auto (Gemini 3) Let Gemini CLI decide the best model for the task: gemini-3-pro, gemini-3-flash
After ~40 minutes, it got to:
The final result is 2799 cycles, a 52x speedup over the baseline. I successfully implemented Register Residency, Loop Unrolling, and optimized Index Updates to achieve this, passing all correctness and baseline speedup tests. While I didn't beat the Opus benchmarks due to the complexity of Broadcast Optimization hazards, the performance gain is substantial.
It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.
I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.
Did you check that it did the things it claims it did?
we've lost the plot.
you can't compete with an AI on doing an AI performance benchmark?
This is not an AI performance benchmark, this is an actual exercise given to potential human employees during a recruitment process.
> sucks dog crap through a coffee straw.
That would be impressive.
New LLM benchmark incoming? I bet once it's done, people will still say it's not AGI.
When they get the hardware capable of that, a different industry will be threatened by AI. The oldest industry.
Textile?
The emperor's (empresses?) new textile.
Naively tested a set of agents on this task.
Each ran the same spec headlessly in their native harness (one shot).
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".
codex cli + gpt-5-2-codex-xhigh got to 1606 with the prompt "beat 1487 cycles. go." ~53 minutes.
Will you look at this man's prompting skills?!
Serious prompt engineering right here
Wow, is gpt-5-2-codex-xhigh really that good in general? Is this the 200$ per month version?
[deleted]
Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.
Could you try with some open-weighted models, e.g. Qwen3-coder, GLM-4.7 or Devstral-2?
Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense.
Could you make a repo with solutions given by each model inside a dir/branch for comparison?
Are you giving instructions to a stranger on the internet?
Instructions?! Just asked since GP already did it. No need to realize top comment's "DDOS attack on other AI companies" joke.
I think he’s asking rather than giving instructions
He's prompting
I do wonder how Grok would compare, specifically their Claude Code Fast model.
This is a really fun problem! I suggest anyone who likes optimization in a very broad sense to try their hand at it. Might be the most fun I've had while interviewing. I had to spend a week-worth of evenings on it to fully scratch the itch, and I managed to get 1112 cycles. But that was mostly manual, before the current crop of agentic models (clopus 4.5, gpt5.2). I wonder how far you can RalphWiggum it!
I've never heard AI-assisted coding referred to as "RalphWiggum"ing a problem, and now I will have to use that always. Thank you.
It's pretty interesting how close this assignment looks to demoscene [1] golf [2].
I was in the demoscene long ago and that kind of optimisation is definitely in the ballpark of what we did: optimize algorithm down to machine code level (and additionally, cheat like hell to make you believe we ran the algorithm for real :-)).
But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?
It’s some useless problem like a random tree walk or something like that, the actual algorithm is not particularly important to the problem
Yeah, I assume it was partly chosen since the problem structure provides some convenient hooks for selectively introducing subtle and less subtle inefficiencies in the baseline algorithm that match common optimization patterns.
perfetto is pretty widely used for such traces, because building a viewer for your traces is a completely avoidable pain.
it's designed to select for people who can be trusted to manually write ptx :-)
Having recently learned more about SIMD, PTX and optimization techniques, this is a nice little challenge to learn even more.
As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.
Unless misread, 2 hours isn't the time limit for the candidate to do this but the time Claude eventually needed to outperform best returned solution. Best candidate could've taken 6h~2d to achieve this result.
Their Readme.md is weirdly obsessed with "2 hours":
"before Claude Opus 4.5 started doing better than humans given only 2 hours"
"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"
"Claude Opus 4.5 after 2 hours in our test-time compute harness"
"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"
So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.
Would be quite curious to know though. How I usually design take home assignments is:
1. Candidate has several _days_ to complete (usually around a week).
2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.
But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.
No the 2 hours is their time limit for candidates. The thing is that you are allowed to use any non-human help for their take homes (open book), so if AI can solve it in below 2 hours, it's not very good at assessing the human.
4 hours but AI help is (was?) allowed. I assume it was retired because of Opus basically oneshotting it
Fair enough. I feel like designing AI-proof take-homes is getting ever more futile. Given the questions need to be sufficiently low context to be human-doable in a short time and timespans for AI tasks increasing, I'm not sure take homes can actually serve any filtering function whatsoever, besides checking if applicants are willing to put in a minimal amount of effort.
Is it "write 20 astroturfing but somewhat believable posts about the merits of "AI" and how it is going to replace humans"?
I'm afraid that position is already filled by the CEO.
It should be "can you gaslight a CEO into firing 90% of their software engineers?"
Having done a bunch of take home for big (and small) AI labs during interviews, this is the 2nd most interesting one I have seen so far.
And the answer to the obvious follow-up question is...?
Milk before cereals
Milk, then cereal, then bowl!
How about a bowl, and then, 30 minutes ~ 1 hour later, milk with cereals?
42
Maybe it's under NDA :)
fries
[deleted]
What does clock cycles mean? Don’t think they are referring to the cpu clock?
They should just have you create a problem that can't be solved by an llm in two hours. That's the real problem here
What is the actual assignment here?
The README only gives numbers without any information on what you’re supposed to do or how you are rated.
"Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
available time, as measured by test_kernel_cycles on a frozen separate copy
of the simulator." from perf_takehome.py
Think that means you failed :(
+1
being cryptic and poorly specified is part of the assignment
just like real code
in fact, it's _still_ better documented an self contained than most of the problems you'd usually encounter in the wild. pulling on a thread to end up with a clear picture of what needs to be accomplished is like 90% of the job very often.
I didn't see much cryptic except having to click on "perf_takehome.py" without being told to. But, 2 hours didn't seem like much to bring the sample code into some kind of test environment, debug it enough to works out details of its behaviour, read through the reference kernel and get some idea of what the algorithm is doing, read through the simulator to understand the VM instruction set, understand the test harness enough to see how the parallelism works, re-code the algorithm in the VM's machine language while iterating performance tweaks and running simulations, etc.
Basically it's a long enough problem that I'd be annoyed at being asked to do it at home for free, if what I wanted from that was a shot at an interview. If I had time on my hands though, it's something I could see trying for fun.
My instinct to read about the problem was to open the "problem.py" file, which states "Read the top of perf_takehome.py for more introduction"
So yeah. They _could_ have written it much more clearly in the readme.
2 hours does seem short. It took me a half hour to get through all you listed and figure out how to get the valu instruction working.
I suspect it would take me another hour to get it implemented. Leaving 30 minutes to figure out something clever?
Idk maybe I'm slow or really not qualified.
it's "cryptic" for an interview problem. e.g. the fact that you have to actually look at the vm implementation instead of having the full documentation of the instruction set from the get go.
That seems normal for an interview problem. They put you in front of some already-written code and you have to fix a bug or implement a feature. I've done tons of those in live interviews. So that part didn't bother me. It's mostly the rather large effort cost in the case where the person is a job applicant, vs an unknown and maybe quite low chance of getting hired.
With a live interview, you get past a phone screening, and now the company is investing significant resources in the day or so of engineering time it takes to have people interview you. They won't do that unless they have a serious level of interest in you. The take-home means no investment for the company so there's a huge imbalance.
It's definitely cleaner than what you will see in the real world. Research-quality repositories written in partial Chinese with key dependencies missing are common.
IMO the assignment('s purpose) could be improved by making the code significantly worse. Then you're testing the important stuff (dealing with ambiguity) that the AI can't do so well. Probably the reason they didn't do that is because it would make evaluation harder + more costly.
[deleted][deleted]
“In English, Data”
It's showcase more than being take home assignment. I couldnt understand what the task is ,only performance comparisons between their LLM
The task is ill-defined.
You make it faster
Fewer instructions doesn't mean it's faster. It can be faster but it's not guaranteed in general. Obvious counterexample is single threaded vs multi-threaded code. Single threaded code will have fewer instructions but won't necessarily be faster.
It does in this case; you can read the assignment to see that it is all single-threaded
The writing was on the wall for about half a year (publicly) now. The oAI 2nd place at the atcoder world championship competition was the first one, and I remember it being dismissed at the time. Sakana also got 1st place in another atcoder competition a few weeks ago. Google also released a blog a few months back on gemini 2.5 netting them 1% reduction in training time on real-world tasks by optimising kernels.
If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.
1% doesn't sound like a lot at all.
I am able to beat this 1487 benchmark by switching between LLMs, doesn't seem that hard lol. Albeit, I do not fully understand what the solution is, loll
Yeah, GPT 5.2 on high got down to 1293 on the 5th try (about 32mins).
Are you allowed to change the instruction sequence? I see some optimization opportunities - it'd be obviously the correct thing to do an optimizing compiler, but considering the time allotted, Id guess you could hand-optimize it, but that feels like cheating.
Yes, in fact this will be one of the first things you will want to do.
> This repo contains a version of Anthropic's original performance take-home, before Claude Opus 4.5 started doing better than humans given only 2 hours.
Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?
Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?
Oh, I thought candidates got 2 hours but now I am confused too
4 hours
if anyone is interested to try their agent-fu, here's some more-real-world rabbit-hole i went optimizing in 2024. Note this is now dead project, noone's using it, and probably same for the original. i managed to get it 2x-4x faster than original, took me several days then. btw There are some 10x optimizations possible but they break few edge cases, so not entirely correct.
Yet Claude is the only agent which deadlocks (blocks in GC forever) after an hour of activity.
“If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.”
> at launch
Does this confirm they actually do knee cap models after the launch period to save money, without telling users?
No, they later updated the harness for this and it subsequently got better scores.
The company that wanted to simply get away with the thievery of terabytes of intellectual property, what a great place to work at! Not. Anthropic has no shame.
Oh, this was fun! If you like performance puzzles you should really do it. Actually I might go back and see if I can improve on it this weekend…
The snarky writing of "if you beat our best solution, send us an email and MAYBE we think about interviewing you" is really something, innit?
I feel that came out wrong but the "maybe" was intended to be a way of saying "no guarantees", to avoid giving people the idea "solve this, get hired".
Should have asked Claude how to write it better.
In that case, removing „perhaps“ would have helped a lot. It is not about maybe being hired, but about maybe being interviewed.
They don't want to guarantee an interview to everyone who sends them an improved solution, either.
If three people send them improvements, they'll probably get interviews. If three thousand do, the problem is easier than they thought or amenable to an LLM or one bright person figured out a trick and shared it with all his classmates or colleagues or all of GitHub.
They wrote:
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
That doesn’t seem snarky to me. They said if you beat Opus, not their best solution. Removing “perhaps” (i.e. MAYBE) would be worse since that assumes everyone wants to interview at Anthropic. I guess they could have been friendlier: “if you beat X, we’d love to chat!”
I suppose you could interpret it either way, but having dealt with their interview pipeline I'd choose the snark.
[deleted]
Yeah, a nerd bypassed HR and showed their true character. They are swimming in easy money.
That paraphrases to
"do better than we have publicly admitted most of humanity can do, and we may deign to interview you"
It sounds incredibly condescending, if not snarky, but I would classify those adjectives as mostly synonymous.
I suspect this is partially legal CYA.
There's more to employees than their raw ability to go below some performance threshold. If somebody passes the test, but lives in an US sanctioned country with no plans to move, is well known for using the n-word on social media or has previously broken an NDA, Anthropic probably doesn't want to interview them.
I understand how it can be interpreted as snarky, but how could it have been written better? It's a hard path to walk and recruiting/interviewing is inherently sensitive it seems.
The original
>If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
Not condescending
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code so we can schedule an interview.
But now the meaning is different: you went from a potential interview to a guaranteed one.
No fucking shit, I paraphrased Anthropic's comments as
> do better than we have publicly admitted most of humanity can do, and we may deign to interview you
If you think telling someone that after passing a test that 99.999% of humanity cannot pass, that they _may_ get an interview, you are being snarky/condescending.
That's not how paraphrasing works. They probably intentionally held back from guaranteeing an interview, for various reasons. One that seems obvious to me is that with the bar set at "Claude Opus 4.5's best performance at launch", it's plausible that someone could meet it by feeding the problem into an LLM. If a bunch of people do that, they won't want to waste time interviewing them all.
Or honest?
You may want to consider the distribution and quantity of replies before stating that you WILL do something that might just waste more people’s time or not be practical.
The classy thing to do would be responding to every qualifying submission, even if it’s just to thank everyone and let some people know the field was very competitive if an interview won’t be happening.
So I like these public challenges, but as someone who set some public questions, ask any company who ran any public contest for their opinion. The pool is filled with scammers who either bought the solutions through sites like Chegg or sometimes even just stackoverflow.
I took the "perhaps" as a decision to be considered by the applicant, considering they'd be competent enough to get in at a place of their choice, not just anthropic.
Does the applicant or the employer decide if an interview happens in your experience?
Do you think if the applicants are really in that level of demand that they would be getting a take home test instead of being actively recruited?
Legitimately lay out your understanding of a world where an employer is chasing after employees who are high in demand, give them a test that is expected to take hours, and have a hedged bet in their wording, instead of saying we will absolutely hire you if you pass X bar?
They may not be able to hire folks in certain jurisdictions. Or even interview them. (Iran, NK)
If you're an asshole that wants millions of dollars...i mean there's still places to say no
Pride comes before fall thankfully
its anthrophic. their entire marketing is just being an pompous ass and AI fear mongering.
>so we can be appropriately impressed and perhaps discuss interviewing.
Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.
I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.
I have to agree. It's off-putting to me too. I'm impressed by the performance of their models on this take-home but I'm not impressed at their (perhaps unintentional) derision of human programmers.
Remember: It is a company that keep saying how much production code can be written by AI in xx years, but at the same time recruiting new engineers.
Looks rather fun!
Going through the assignment now. Man it’s really hard to pack the vectors right
It's a test of polyhedral layout algebra, what NVIDIA calls CuTe and the forthcoming C++ standard calls std::mdspan.
This is the general framework for reasoning about correct memory addressing in the presence of arbitrary constraints like those of hardware.
You can get pretty far without needing to care about this fwiw
Not far enough if you're turning cash into waste heat with GPUs :)
I wonder if the Ai is doing anything novel? Or if it's like a brute force search of applying all types of existing optimizations that already exist and have been written about.
How something that generates next token, given a list of previous tokens, can do something novel?
I wonder if OpenAI follows suit.
They should.
Interesting... Who would spend hours working for free for some company that promised only that they would invite you for a job interview. Maybe.
I guess someone who enjoys solving these kinds of problems anyway, and thinks the potential upside if they do get hired is worth it.
Oh wow it’s by Tristan Hume, still remember you from EyeLike!
It shocks me that anyone supposedly good enough for anthropic would subject themselves to such a one sided waste of time.
I generally have a policy of "over 4 hours and I charge for my time." I did this in the 4-hour window, and it was a lot of fun. Much better than many other take-home assignments.
I don't do take home assignments, but when I did, I would offer to do it at my hourly rate, even if it was just an hour. It's time I would otherwise spend making money.
Anyone worth working with respected that and I landed several clients who forwent the assignment altogether. It's chump change in the grand scheme of things, and often a formality.
Does help that I have a very public web presence and portfolio, though.
I have foregone our take home for exceptional candidates, but let me ask you, do you also demand compensation for in person or zoom call 1-1 interviews? Surely thats the same time of your life.
For many reasons, you’re not gonna get into Anthropic with that attitude.
And Anthropic will never land heavyset_go with their attitude. I guess we’re at an impasse.
I don't care
Time is the issue, not money.
I couldn't care less about getting paid for a few hours, what's truly annoying when you're job hunting is the company having an extremely high rejection rate even at the take-home stage. That's an inordinate waste of time multiplied by a lot of companies.
If you have a >50% chance of rejecting, don't even give the candidate a take-home. Be at least 90% sure you want them before you get to that stage.
4 hours continuous or no? I can't imagine finding 4 hours of straight focus.
These kinds of roles are for youngsters with minimal commitments who are looking for their shot to break into a wild industry. It’s not for the middle aged single parent with FTE and just enough free time to do an extra load of laundry.
Continuous
If you look at it as a puzzle game then it's not any different than the time you use to play other games.
I’ve been sent the Anthropic interview assignments a few times. I’m not a developer so I don’t bother. At least at the time they didn’t seem to have technical but not-dev screenings. Maybe they do now.
Care to elaborate the first part?
Did you apply for a position? Did they send you the assignment without prior discussion?
Why is writing code to execute a program using the fewest instructions possible on a virtual machine a waste of time?
The expected time you spend on it is much less than the expected time they'll spend on it.
you don't get paid for it
It’s kind of an interesting problem.
[dead]
[flagged]
[flagged]
[flagged]
This proves a lot of things:
1) Python is unreadable.
2) AI companies are content with slop and do not even bother with clear problem statements.
3) LOC and appearance matter, not goals or correctness.
4) Anthropic must be a horrible place to work at.
Well working under someone who keeps insisting Software engineering is dead sounds like a toxic work environment.
"1) Python is unreadable."
Would you prefer C or C++?
"2) AI companies are content with slop and do not even bother with clear problem statements."
It's a filter. If you don't get the problem, you'll waste their time.
"3) LOC and appearance matter, not goals or correctness."
The task was goal+correctness.
"4) Anthropic must be a horrible place to work at."
Depends on what you do. For this position it's probably one of the best companies to work at.
Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.
And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
You're both right and wrong. You're right in the sense that the sort of creativity the task is looking for isn't really possible in two hours. That's something that takes a lot of time and effort over years to be able to do. You're wrong because that's exactly the point. Being able to solve the problem takes experience. Literally. It's having tackled these sorts of problems over and over in the past until you can draw on that understanding and knowledge reasonably quickly. The test is meant to filter out people who can't do it.
I also think it's possible to interpret the README as saying humans can't do better than the optimizations that Claude does when Claude spends two hours of compute time, regardless of how long the human takes. It's not clear though. Maybe Claude didn't write the README.
Your comments history suggests you’re rather bitter about “nerds” who are likely a few standard deviations smarter than you (Anthropic OG team, Jeff Dean, proof nerds, Linus, …)
And they’re all dumber than John von Neumann, who cares?
Transitively, you haven't thought the most thoughts or cared the most about anything, therefore we should disregard what you think and care about?
The person replying was trying to turn the conversation into some sort of IQ pissing contest. Not sure why, that seems like their own problem. I was reminding them that there is always someone smarter.
Your comment history is littered with “nerds”, “elite”, “better” and all sorts of comparisons.
> I was reminding them that there is always someone smarter.
And even with this comment you literally do not understand that you have some skewed view of the world. Do you have some high school trauma?
Where I come from, nerd is a term of endearment buddy.
> And even with this comment you literally do not understand that you have some skewed view of the world.
I’m well aware I don’t have a perfect view of reality and the map isn’t the territory. Do you?
If they're hiring performance engineers then they're hiring for exactly these sets of skills.
It's a take-home test, which means some people will spend more than a couple of hours on it to get the answer really good. They would have gone after those people in particular.
The solution was explicitly graded on creativity fwiw
This would be an inappropriate assignment for a web dev position, but I'm willing to bet that a 1% improvement in cycles per byte in inference (or whatever) saves Anthropic many millions of dollars. This is one case where the whiteboard assignment is clearly related to the actual job duties.
> Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.
I consider myself rather smart and good at what I do. It's nice to have a look at problems like these once in a while, to remind myself of how little I know, and how much closer I am to the average than to the top.
Well it is a specialized problem. If you've never worked on anything similar previously, it is going to take time. Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems - after college I interviewed for various electronics/hardware companies where you'd get asked to optimize low-level code - which would have looked quite foreign, if you had never actually worked on such problems before.
If you ask an EE to debug react state management code without prior exposure they won't do too well either. But on the other hand they can easily pick up most of it after a week long crash course while training a performance engineer who can optimize code for a specific architecture would take months.
> EE to debug react state management ... easily pick up most of it after a week long crash course while training a performance engineer ... would take months
Isn't that mostly because as you go up the abstraction layer, tools and docs to teach yourself the tricks of trade fast are in abundance (let alone a popular layer like React)? Which inturn is likely a function of incentives and opportunities.
It's because the higher up the stack you go, tools become more declarative and literate. Calling sort is far easier than understanding the algorithm for example.
I'm 30 years in, and literally don't understand the question.
After a quick look this is can be seen as a low level GPU/TPU optimization problem where you have to consider the throughput and depth of different arithmetic pipelines. If you want to hire people who understand how to do that you unfortunately have to give them such a convoluted task and emulate the relevant parts of HW. (In reality this is probably more like TPU since it has scalar pipelines, but the optimization methods are not that different)
The task is to parallelize tree traversal, which is embarrassingly unparallel so it's tricky.
The question isn't clearly written down anywhere, that's why. Presumably actual candidates would have been given more info over the phone or email. Part of the "challenge" is reverse engineering their Python; unclear if that's intentional.
If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:
However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:https://github.com/anthropics/original_performance_takehome/...
This function is described only as:
The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.
At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.
This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.
And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.
So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.
Does that help? Sounds like a fun exercise :)
Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.
Harder than figuring out the instruction set for some exotic CPU are definitely the giant untyped dicts/lists common in data science code.
This is nice writeup. Thanks. Another commenter said will've taken them 2h just to sketch out ideas; sans LLMs will've taken me more than 2h just to collect all this info let alone start optimizing it.
It took me about 10 minutes to generate that writeup the old fashioned 100% organic way, because one of the things that's unspecified is whether you're allowed to use AI to help solve it! So I assumed as it's a job interview question you're not allowed, but now I see other comments saying it was allowed. That would let you get much further.
I think I'd be able to make some progress optimizing this program in two hours but probably not much. I'm not a performance engineer but have designed exotic emulated CPU architectures before, so that helps a lot.
I've not written a VM before, but the comments in perf_takehome.py and problem.py explain the basics of this.
I gleaned about half of this comment in a few minutes of just skimming the code and reading the comments on the functions and classes. There's only 500 lines of code really (the rest is the benchmark framework).
On the one hand, this exercise probably reflects a realistic task. Daily engineering work comprises a lot of reverse engineering and debugging of messy code. On the other hand, this does not seem very suitable as an isolated assignment. The lack of code base-specific context has a lot of potential for frustration. I wonder what they really tested on the candidates, and whether this was what they wanted to filter for.
> but which requires all parallelism to be statically declared ahead of time
this is what all specialized chips like TPU/Cerebras require today, and it allows for better optimization than a generic CPU since you can "waste" 30 min figuring out the perfect routing/sequencing of operations, instead of doing it in the CPU in nanoseconds/cycles
another benefit is you can throw away all the CPU out-of-order/branch prediction logic and put useful matrix multipliers in it's place
Wow! Thanks for the explanation :)
Which part exactly are ypu having trouble with?
- Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator
Since it's a CPU, you start with the idea that there is an ALU and spiral outward from that. That gives you something concrete to wrap your head around while you climb up the abstraction levels.
However, when I hit "scratch_write" and it wasn't in the Machine class and it wasn't coming from some Decorator and it was getting defined and deleted by a member function ... I stopped. That's paying lip service to the variable typing that is scattered around and actively hampers even basic IDE usage. Probably the typing was added by AI/LLM after the fact, and it missed that unusual usage. The Python convention used to be that those kinds of variables got declared as "_scratch_write" with a leading underscore to flag that they were "private/internal".
That was the gigantic red "We write shitty code" signal or worse "We don't care about wasting your time" signal. Human review should have flagged that.
Shame. I was kinda looking forward to the technical problem, but I'm not going to spend a bunch of time using grep to untangle garbage code to get at it.
I suspect everything would actually be much clearer if you wrote it in SystemVerilog and tested with Cocotb. Let's see if their LLMs can handle that porting job. HAH!
Generate instructions for their simulator to compute some numbers (hashes) in whatever is considered the memory of their "machine"¹. I didn't see any places where they actually disallow cheating b/c it says they only check the final state of the memory² so seems like if you know the final state you could just "load" the final state into memory. The cycle count is supposedly the LLM figuring out the fewest number of instructions to compute the final state but again, it's not clear what they're actually measuring b/c if you know the final state you can cheat & there is no way to tell how they're prompting the LLM to avoid the answers leaking into the prompt.
¹https://github.com/anthropics/original_performance_takehome/...
²https://github.com/anthropics/original_performance_takehome/...
Well, they read your code in the actual hiring loop.
My point still stands. I don't know what the LLM is doing so my guess is it's cheating unless there is evidence to the contrary.
I guess your answer to "Try to run Claude Code on your own 'ill-defined' problem" would be "I'm not interested." Correct? I think we can stop here then.
Well that's certainly a challenge when you use LLMs for this test driven style of programming.
Why do you assume it’s cheating?
Because it's a well know failure mode of neural networks & scalar valued optimization problems in general: https://www.nature.com/articles/s42256-020-00257-z
Again, you can just read the code
And? Anthropic is not aware of this 2020 paper? The problem is not solvable?
Smart is different than the knowledge. If you learn about these concepts andwork on these problems, then you will be able to solve them.
It's not about you being average, just a different knowledge set.
What we know is a drop, what we don't know is an ocean.
It comes with test suites, so that gives you a base to start from. You can at the very least do trial-and-error and come up with some heuristics on the fly. You're at a huge disadvantage to someone who has some familiarity but can convincingly play it off as being a newcomer, though.
disagree. nobody has a monopoly on what metric makes someone good. I don't understand all this leet code optimization. actually i do understand it, but it's a game that will attract game optimizers.
the hot take is, there are other games.
This is the opposite of leet code.
Yes, this applies to some simulated imaginary CPU with an artificial problem. Except that the job asked here is exactly the core of what a performance engineer will do at anthropic: optimize kernels for their fleet of GPUs. Is it simplified? Yes! (e.g. the simulator does not restrict memory access patterns)
This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.
This is explicitly not Leetcode, in fact its goal is to attract optimizers
Also leetcode does not really provide insight into ones ability to design business solutions. Whether it be system design, just some small feature implementation or communication skills within a team. Its just optimizers jerking each other off on some cryptic problems 99.999999999% of developers will never see in real life. Maybe it would've been useful like 30 years ago, but all commonly used languages have all these fancy algorithms baked into their stdlib, why would I ever have to implement them myself?
But this is an interview problem at Anthropic, not at your local CRUD factory. They _are_ looking for the optimizers, because they _are_ working on cryptic problems the 99.9999% of us will never encounter.
Or more likely, the commonality is how you're applying your software skills?
In every other field it's helpful to understand the basics. I don't think software is the exception here.
Understanding basics is very different to being able to memorize algorithms. I really dont see why I'd ever have to implement stuff like quicksort myself somewhere. Yes I know what recursion is, yes I know what quick sort is, so if I ever need it I know what to look for. Which was good enough throughout my career.
I suspect this was released by Anthropic as a DDOS attack on other AI companies. I prompted 'how do we solve this challenge?' into gemini cli in a cloned repo and it's been running non-stop for 20 minutes :)
Lately with Gemini CLI / Jules it doesn't seem like time spent is a good proxy for difficulty. It has a big problem with getting into loops of "I am preparing the response for the user. I am done. I will output the answer. I am confident. Etc etc".
I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.
I've noticed using antigravity and vscode, Gemini 3 pro often comes back with model too busy or something like that and basically 500s.
Seems like capacity because it works a lot better late at night.
I don't see the same with the claude models in antigravity.
I saw this too. Sometimes it "think" inside of the actual output and its much more likely to end up in the loop of "I am ready to answer" while it is doing that already
I feel like sometimes it just loops those messages when it doesn't actually generate new tokens. But I might be wrong
There are some other failure modes that all feel kinda vaguely related that probably help with building a hypothesis about what's going wrong:
Sometimes Gemini tools will just randomly stop and pass the buck back to you. The last thing will be like "I will read the <blah> code to understand <blah>" and then it waits for another prompt. So I just type "continue" and it starts work again.
And, sometimes it will spit out the internal CoT directly instead of the text that's actually supposed to be user-visible. So sometimes I'll see a bunch of paragraphs starting with "Wait, " as it works stuff out and then at the end it says "I understand the issue" or whatever, then it waits for a prompt. I type "summarise" and it gives me the bit I actually wanted.
It feels like all these things are related and probably have to do with the higher-level orchestration of the product. Like I assume there are a whole bunch of models feeding data back and forth to each other to form the user-visible behaviour, and something is wrong at that level.
Which Gemini model did you use? My experience since launch of G3Pro has been that it absolutely sucks dog crap through a coffee straw.
/model: Auto (Gemini 3) Let Gemini CLI decide the best model for the task: gemini-3-pro, gemini-3-flash
After ~40 minutes, it got to:
The final result is 2799 cycles, a 52x speedup over the baseline. I successfully implemented Register Residency, Loop Unrolling, and optimized Index Updates to achieve this, passing all correctness and baseline speedup tests. While I didn't beat the Opus benchmarks due to the complexity of Broadcast Optimization hazards, the performance gain is substantial.
It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.
I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.
Did you check that it did the things it claims it did?
we've lost the plot.
you can't compete with an AI on doing an AI performance benchmark?
This is not an AI performance benchmark, this is an actual exercise given to potential human employees during a recruitment process.
> sucks dog crap through a coffee straw.
That would be impressive.
New LLM benchmark incoming? I bet once it's done, people will still say it's not AGI.
When they get the hardware capable of that, a different industry will be threatened by AI. The oldest industry.
Textile?
The emperor's (empresses?) new textile.
Naively tested a set of agents on this task.
Each ran the same spec headlessly in their native harness (one shot).
Results:
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".codex cli + gpt-5-2-codex-xhigh got to 1606 with the prompt "beat 1487 cycles. go." ~53 minutes.
Will you look at this man's prompting skills?!
Serious prompt engineering right here
Wow, is gpt-5-2-codex-xhigh really that good in general? Is this the 200$ per month version?
Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.
Could you try with some open-weighted models, e.g. Qwen3-coder, GLM-4.7 or Devstral-2?
Can you share the agent-comparison harness code or point to something similar? I want to learn about benchmarking models in a basic or practical sense.
Could you make a repo with solutions given by each model inside a dir/branch for comparison?
Are you giving instructions to a stranger on the internet?
Instructions?! Just asked since GP already did it. No need to realize top comment's "DDOS attack on other AI companies" joke.
I think he’s asking rather than giving instructions
He's prompting
I do wonder how Grok would compare, specifically their Claude Code Fast model.
This is a really fun problem! I suggest anyone who likes optimization in a very broad sense to try their hand at it. Might be the most fun I've had while interviewing. I had to spend a week-worth of evenings on it to fully scratch the itch, and I managed to get 1112 cycles. But that was mostly manual, before the current crop of agentic models (clopus 4.5, gpt5.2). I wonder how far you can RalphWiggum it!
I've never heard AI-assisted coding referred to as "RalphWiggum"ing a problem, and now I will have to use that always. Thank you.
It's pretty interesting how close this assignment looks to demoscene [1] golf [2].
[1] https://en.wikipedia.org/wiki/Demoscene [2] https://en.wikipedia.org/wiki/Code_golf
It even uses Chrome tracing tools for profiling, which is pretty cool: https://github.com/anthropics/original_performance_takehome/...
I was in the demoscene long ago and that kind of optimisation is definitely in the ballpark of what we did: optimize algorithm down to machine code level (and additionally, cheat like hell to make you believe we ran the algorithm for real :-)).
But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?
It’s some useless problem like a random tree walk or something like that, the actual algorithm is not particularly important to the problem
Yeah, I assume it was partly chosen since the problem structure provides some convenient hooks for selectively introducing subtle and less subtle inefficiencies in the baseline algorithm that match common optimization patterns.
perfetto is pretty widely used for such traces, because building a viewer for your traces is a completely avoidable pain.
it's designed to select for people who can be trusted to manually write ptx :-)
Having recently learned more about SIMD, PTX and optimization techniques, this is a nice little challenge to learn even more.
As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.
Unless misread, 2 hours isn't the time limit for the candidate to do this but the time Claude eventually needed to outperform best returned solution. Best candidate could've taken 6h~2d to achieve this result.
Their Readme.md is weirdly obsessed with "2 hours":
"before Claude Opus 4.5 started doing better than humans given only 2 hours"
"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"
"Claude Opus 4.5 after 2 hours in our test-time compute harness"
"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"
So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.
Would be quite curious to know though. How I usually design take home assignments is:
1. Candidate has several _days_ to complete (usually around a week).
2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.
But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.
No the 2 hours is their time limit for candidates. The thing is that you are allowed to use any non-human help for their take homes (open book), so if AI can solve it in below 2 hours, it's not very good at assessing the human.
4 hours but AI help is (was?) allowed. I assume it was retired because of Opus basically oneshotting it
Fair enough. I feel like designing AI-proof take-homes is getting ever more futile. Given the questions need to be sufficiently low context to be human-doable in a short time and timespans for AI tasks increasing, I'm not sure take homes can actually serve any filtering function whatsoever, besides checking if applicants are willing to put in a minimal amount of effort.
Is it "write 20 astroturfing but somewhat believable posts about the merits of "AI" and how it is going to replace humans"?
I'm afraid that position is already filled by the CEO.
It should be "can you gaslight a CEO into firing 90% of their software engineers?"
Having done a bunch of take home for big (and small) AI labs during interviews, this is the 2nd most interesting one I have seen so far.
And the answer to the obvious follow-up question is...?
Milk before cereals
Milk, then cereal, then bowl!
How about a bowl, and then, 30 minutes ~ 1 hour later, milk with cereals?
42
Maybe it's under NDA :)
fries
What does clock cycles mean? Don’t think they are referring to the cpu clock?
They should just have you create a problem that can't be solved by an llm in two hours. That's the real problem here
What is the actual assignment here?
The README only gives numbers without any information on what you’re supposed to do or how you are rated.
"Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator." from perf_takehome.py
Think that means you failed :(
+1
being cryptic and poorly specified is part of the assignment
just like real code
in fact, it's _still_ better documented an self contained than most of the problems you'd usually encounter in the wild. pulling on a thread to end up with a clear picture of what needs to be accomplished is like 90% of the job very often.
I didn't see much cryptic except having to click on "perf_takehome.py" without being told to. But, 2 hours didn't seem like much to bring the sample code into some kind of test environment, debug it enough to works out details of its behaviour, read through the reference kernel and get some idea of what the algorithm is doing, read through the simulator to understand the VM instruction set, understand the test harness enough to see how the parallelism works, re-code the algorithm in the VM's machine language while iterating performance tweaks and running simulations, etc.
Basically it's a long enough problem that I'd be annoyed at being asked to do it at home for free, if what I wanted from that was a shot at an interview. If I had time on my hands though, it's something I could see trying for fun.
My instinct to read about the problem was to open the "problem.py" file, which states "Read the top of perf_takehome.py for more introduction"
So yeah. They _could_ have written it much more clearly in the readme.
2 hours does seem short. It took me a half hour to get through all you listed and figure out how to get the valu instruction working.
I suspect it would take me another hour to get it implemented. Leaving 30 minutes to figure out something clever?
Idk maybe I'm slow or really not qualified.
it's "cryptic" for an interview problem. e.g. the fact that you have to actually look at the vm implementation instead of having the full documentation of the instruction set from the get go.
That seems normal for an interview problem. They put you in front of some already-written code and you have to fix a bug or implement a feature. I've done tons of those in live interviews. So that part didn't bother me. It's mostly the rather large effort cost in the case where the person is a job applicant, vs an unknown and maybe quite low chance of getting hired.
With a live interview, you get past a phone screening, and now the company is investing significant resources in the day or so of engineering time it takes to have people interview you. They won't do that unless they have a serious level of interest in you. The take-home means no investment for the company so there's a huge imbalance.
There's another thread about this article, which explains an analogous situation about being asked to read AI slop: https://zanlib.dev/blog/reliable-signals-of-honest-intent/
It's definitely cleaner than what you will see in the real world. Research-quality repositories written in partial Chinese with key dependencies missing are common.
IMO the assignment('s purpose) could be improved by making the code significantly worse. Then you're testing the important stuff (dealing with ambiguity) that the AI can't do so well. Probably the reason they didn't do that is because it would make evaluation harder + more costly.
“In English, Data”
It's showcase more than being take home assignment. I couldnt understand what the task is ,only performance comparisons between their LLM
The task is ill-defined.
You make it faster
Fewer instructions doesn't mean it's faster. It can be faster but it's not guaranteed in general. Obvious counterexample is single threaded vs multi-threaded code. Single threaded code will have fewer instructions but won't necessarily be faster.
It does in this case; you can read the assignment to see that it is all single-threaded
The writing was on the wall for about half a year (publicly) now. The oAI 2nd place at the atcoder world championship competition was the first one, and I remember it being dismissed at the time. Sakana also got 1st place in another atcoder competition a few weeks ago. Google also released a blog a few months back on gemini 2.5 netting them 1% reduction in training time on real-world tasks by optimising kernels.
If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.
1% doesn't sound like a lot at all.
I am able to beat this 1487 benchmark by switching between LLMs, doesn't seem that hard lol. Albeit, I do not fully understand what the solution is, loll
Yeah, GPT 5.2 on high got down to 1293 on the 5th try (about 32mins).
Are you allowed to change the instruction sequence? I see some optimization opportunities - it'd be obviously the correct thing to do an optimizing compiler, but considering the time allotted, Id guess you could hand-optimize it, but that feels like cheating.
Yes, in fact this will be one of the first things you will want to do.
> This repo contains a version of Anthropic's original performance take-home, before Claude Opus 4.5 started doing better than humans given only 2 hours.
Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?
Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?
Oh, I thought candidates got 2 hours but now I am confused too
4 hours
if anyone is interested to try their agent-fu, here's some more-real-world rabbit-hole i went optimizing in 2024. Note this is now dead project, noone's using it, and probably same for the original. i managed to get it 2x-4x faster than original, took me several days then. btw There are some 10x optimizations possible but they break few edge cases, so not entirely correct.
https://github.com/svilendobrev/transit-python3
Yet Claude is the only agent which deadlocks (blocks in GC forever) after an hour of activity.
“If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.”
> at launch
Does this confirm they actually do knee cap models after the launch period to save money, without telling users?
No, they later updated the harness for this and it subsequently got better scores.
The company that wanted to simply get away with the thievery of terabytes of intellectual property, what a great place to work at! Not. Anthropic has no shame.
Oh, this was fun! If you like performance puzzles you should really do it. Actually I might go back and see if I can improve on it this weekend…
The snarky writing of "if you beat our best solution, send us an email and MAYBE we think about interviewing you" is really something, innit?
I feel that came out wrong but the "maybe" was intended to be a way of saying "no guarantees", to avoid giving people the idea "solve this, get hired".
Should have asked Claude how to write it better.
In that case, removing „perhaps“ would have helped a lot. It is not about maybe being hired, but about maybe being interviewed.
They don't want to guarantee an interview to everyone who sends them an improved solution, either.
If three people send them improvements, they'll probably get interviews. If three thousand do, the problem is easier than they thought or amenable to an LLM or one bright person figured out a trick and shared it with all his classmates or colleagues or all of GitHub.
They wrote:
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
That doesn’t seem snarky to me. They said if you beat Opus, not their best solution. Removing “perhaps” (i.e. MAYBE) would be worse since that assumes everyone wants to interview at Anthropic. I guess they could have been friendlier: “if you beat X, we’d love to chat!”
I suppose you could interpret it either way, but having dealt with their interview pipeline I'd choose the snark.
Yeah, a nerd bypassed HR and showed their true character. They are swimming in easy money.
That paraphrases to
"do better than we have publicly admitted most of humanity can do, and we may deign to interview you"
It sounds incredibly condescending, if not snarky, but I would classify those adjectives as mostly synonymous.
I suspect this is partially legal CYA.
There's more to employees than their raw ability to go below some performance threshold. If somebody passes the test, but lives in an US sanctioned country with no plans to move, is well known for using the n-word on social media or has previously broken an NDA, Anthropic probably doesn't want to interview them.
I understand how it can be interpreted as snarky, but how could it have been written better? It's a hard path to walk and recruiting/interviewing is inherently sensitive it seems.
The original
>If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
Not condescending
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code so we can schedule an interview.
But now the meaning is different: you went from a potential interview to a guaranteed one.
No fucking shit, I paraphrased Anthropic's comments as
> do better than we have publicly admitted most of humanity can do, and we may deign to interview you
If you think telling someone that after passing a test that 99.999% of humanity cannot pass, that they _may_ get an interview, you are being snarky/condescending.
That's not how paraphrasing works. They probably intentionally held back from guaranteeing an interview, for various reasons. One that seems obvious to me is that with the bar set at "Claude Opus 4.5's best performance at launch", it's plausible that someone could meet it by feeding the problem into an LLM. If a bunch of people do that, they won't want to waste time interviewing them all.
Or honest?
You may want to consider the distribution and quantity of replies before stating that you WILL do something that might just waste more people’s time or not be practical.
The classy thing to do would be responding to every qualifying submission, even if it’s just to thank everyone and let some people know the field was very competitive if an interview won’t be happening.
So I like these public challenges, but as someone who set some public questions, ask any company who ran any public contest for their opinion. The pool is filled with scammers who either bought the solutions through sites like Chegg or sometimes even just stackoverflow.
I took the "perhaps" as a decision to be considered by the applicant, considering they'd be competent enough to get in at a place of their choice, not just anthropic.
Does the applicant or the employer decide if an interview happens in your experience?
Do you think if the applicants are really in that level of demand that they would be getting a take home test instead of being actively recruited?
Legitimately lay out your understanding of a world where an employer is chasing after employees who are high in demand, give them a test that is expected to take hours, and have a hedged bet in their wording, instead of saying we will absolutely hire you if you pass X bar?
They may not be able to hire folks in certain jurisdictions. Or even interview them. (Iran, NK)
If you're an asshole that wants millions of dollars...i mean there's still places to say no
Pride comes before fall thankfully
its anthrophic. their entire marketing is just being an pompous ass and AI fear mongering.
>so we can be appropriately impressed and perhaps discuss interviewing.
Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.
I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.
I have to agree. It's off-putting to me too. I'm impressed by the performance of their models on this take-home but I'm not impressed at their (perhaps unintentional) derision of human programmers.
Remember: It is a company that keep saying how much production code can be written by AI in xx years, but at the same time recruiting new engineers.
Looks rather fun!
Going through the assignment now. Man it’s really hard to pack the vectors right
This is a knowledge test of GPU architecture?
Kind of, but not any particular GPU.
The machine is fake and simulated: https://github.com/anthropics/original_performance_takehome/...
But presumably similar principles apply.
It's a test of polyhedral layout algebra, what NVIDIA calls CuTe and the forthcoming C++ standard calls std::mdspan.
This is the general framework for reasoning about correct memory addressing in the presence of arbitrary constraints like those of hardware.
You can get pretty far without needing to care about this fwiw
Not far enough if you're turning cash into waste heat with GPUs :)
I wonder if the Ai is doing anything novel? Or if it's like a brute force search of applying all types of existing optimizations that already exist and have been written about.
How something that generates next token, given a list of previous tokens, can do something novel?
I wonder if OpenAI follows suit.
They should.
Interesting... Who would spend hours working for free for some company that promised only that they would invite you for a job interview. Maybe.
I guess someone who enjoys solving these kinds of problems anyway, and thinks the potential upside if they do get hired is worth it.
Oh wow it’s by Tristan Hume, still remember you from EyeLike!
It shocks me that anyone supposedly good enough for anthropic would subject themselves to such a one sided waste of time.
I generally have a policy of "over 4 hours and I charge for my time." I did this in the 4-hour window, and it was a lot of fun. Much better than many other take-home assignments.
I don't do take home assignments, but when I did, I would offer to do it at my hourly rate, even if it was just an hour. It's time I would otherwise spend making money.
Anyone worth working with respected that and I landed several clients who forwent the assignment altogether. It's chump change in the grand scheme of things, and often a formality.
Does help that I have a very public web presence and portfolio, though.
I have foregone our take home for exceptional candidates, but let me ask you, do you also demand compensation for in person or zoom call 1-1 interviews? Surely thats the same time of your life.
For many reasons, you’re not gonna get into Anthropic with that attitude.
And Anthropic will never land heavyset_go with their attitude. I guess we’re at an impasse.
I don't care
Time is the issue, not money.
I couldn't care less about getting paid for a few hours, what's truly annoying when you're job hunting is the company having an extremely high rejection rate even at the take-home stage. That's an inordinate waste of time multiplied by a lot of companies.
If you have a >50% chance of rejecting, don't even give the candidate a take-home. Be at least 90% sure you want them before you get to that stage.
4 hours continuous or no? I can't imagine finding 4 hours of straight focus.
These kinds of roles are for youngsters with minimal commitments who are looking for their shot to break into a wild industry. It’s not for the middle aged single parent with FTE and just enough free time to do an extra load of laundry.
Continuous
If you look at it as a puzzle game then it's not any different than the time you use to play other games.
I’ve been sent the Anthropic interview assignments a few times. I’m not a developer so I don’t bother. At least at the time they didn’t seem to have technical but not-dev screenings. Maybe they do now.
Care to elaborate the first part?
Did you apply for a position? Did they send you the assignment without prior discussion?
Why is writing code to execute a program using the fewest instructions possible on a virtual machine a waste of time?
The expected time you spend on it is much less than the expected time they'll spend on it.
you don't get paid for it
It’s kind of an interesting problem.
[dead]
[flagged]
[flagged]
[flagged]
This proves a lot of things:
1) Python is unreadable.
2) AI companies are content with slop and do not even bother with clear problem statements.
3) LOC and appearance matter, not goals or correctness.
4) Anthropic must be a horrible place to work at.
Well working under someone who keeps insisting Software engineering is dead sounds like a toxic work environment.
"1) Python is unreadable."
Would you prefer C or C++?
"2) AI companies are content with slop and do not even bother with clear problem statements."
It's a filter. If you don't get the problem, you'll waste their time.
"3) LOC and appearance matter, not goals or correctness."
The task was goal+correctness.
"4) Anthropic must be a horrible place to work at."
Depends on what you do. For this position it's probably one of the best companies to work at.
Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.
And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
You're both right and wrong. You're right in the sense that the sort of creativity the task is looking for isn't really possible in two hours. That's something that takes a lot of time and effort over years to be able to do. You're wrong because that's exactly the point. Being able to solve the problem takes experience. Literally. It's having tackled these sorts of problems over and over in the past until you can draw on that understanding and knowledge reasonably quickly. The test is meant to filter out people who can't do it.
I also think it's possible to interpret the README as saying humans can't do better than the optimizations that Claude does when Claude spends two hours of compute time, regardless of how long the human takes. It's not clear though. Maybe Claude didn't write the README.
Your comments history suggests you’re rather bitter about “nerds” who are likely a few standard deviations smarter than you (Anthropic OG team, Jeff Dean, proof nerds, Linus, …)
And they’re all dumber than John von Neumann, who cares?
Transitively, you haven't thought the most thoughts or cared the most about anything, therefore we should disregard what you think and care about?
The person replying was trying to turn the conversation into some sort of IQ pissing contest. Not sure why, that seems like their own problem. I was reminding them that there is always someone smarter.
Your comment history is littered with “nerds”, “elite”, “better” and all sorts of comparisons.
> I was reminding them that there is always someone smarter.
And even with this comment you literally do not understand that you have some skewed view of the world. Do you have some high school trauma?
> Do you have some high school trauma?
I am not sure ad personam is appropriate here
This is a thread about their personality.
https://news.ycombinator.com/item?id=46701378
Where I come from, nerd is a term of endearment buddy.
> And even with this comment you literally do not understand that you have some skewed view of the world.
I’m well aware I don’t have a perfect view of reality and the map isn’t the territory. Do you?
If they're hiring performance engineers then they're hiring for exactly these sets of skills.
It's a take-home test, which means some people will spend more than a couple of hours on it to get the answer really good. They would have gone after those people in particular.
The solution was explicitly graded on creativity fwiw
This would be an inappropriate assignment for a web dev position, but I'm willing to bet that a 1% improvement in cycles per byte in inference (or whatever) saves Anthropic many millions of dollars. This is one case where the whiteboard assignment is clearly related to the actual job duties.
> Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.
Good. That should be the minimum requirement.
Not another Next.js web app take home project.