Ask HN: Share your AI prompt that stumps every model

2 months ago 5

No, please don't.

I think it's good to keep a few personal prompts in reserve, to use as benchmarks for how good new models are.

Mainstream benchmarks have too high a risk of leaking into training corpora or of being gamed. Your own benchmarks will forever stay your own.

"Tell me about the Marathon crater."

This works against _the LLM proper,_ but not against chat applications with integrated search. For ChatGPT, you can write, "Without looking it up, tell me about the Marathon crater."

This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

This works because:

1. Training sets consist of knowledge we have, and not of knowledge we don't have.

2. Commitment bias. Complaint chat models will be trained to start with "Certainly! The Marathon Crater is a geological formation", or something like that, and from there, the next most probable tokens are going to be "in Greece", "on Mars" or whatever. At this point, all tokens that are probable are also incorrect.

When demonstrating this, I like to emphasise point one, and contrast it with the human experience.

We exist in a perpetual and total blinding "fog of war" in which you cannot even see a face all at once; your eyes must dart around to examine it. Human experience is structured around _acquiring_ and _forgoing_ information, rather than _having_ information.

Like this one a lot. Perplexity gets this right, probably because it searches the web.

"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine"

There appears to be a misunderstanding in your query. Based on the search results provided, there is no mention of a “Marathon Crater” among the impact craters discussed. The search results contain information about several well-known impact craters: 1. Chicxulub crater - Formed about 66 million years ago by an asteroid impact 2. Barringer Crater (also known as Meteor Crater) - Discovered in the 19th century by American settlers and studied by scientists like Albert E. Foote in 1891 and Daniel M. Barringer in the early 20th century 3. A recently discovered 3.47 billion-year-old crater in Australia’s Pilbara region (announced in March 2025)

You can trip them up even more if you rewrite the question with the hidden assumption that X exists, e.g.:

"When was Marathon Crater discovered? I don't need an exact date - a rough estimate will be fine."

OpenAI gpt-4o

Marathon Crater was discovered by the Mars Exploration Rover Opportunity during its mission on Mars. Opportunity arrived at the crater in April 2015. The crater was named "Marathon" to commemorate the rover's achievement of traveling a marathon's distance (about 42.195 kilometers or 26.219 miles) on the Martian surface.

Sonnet 3.5

I apologize, but I don't want to make claims about when Marathon Crater was discovered without being certain about the information. *While I know it's a crater on Mars*, I don't have reliable information about when it was first identified or discovered.

Gemini 2.5

Marathon Crater is located on Mars and was explored by NASA's Opportunity rover. While it would have been visible in orbital images earlier, it became a specific point of interest and was likely named and identified as a target for the rover around the time Opportunity was approaching or exploring the western rim of Endeavour Crater. Opportunity arrived at the rim of Endeavour Crater in August 2011, and it reached the area known as Marathon Valley (adjacent to or containing features sometimes associated with the name, depending on specific definitions used by the team) around late 2014 / early 2015. So, a rough estimate for its "discovery" or identification as a distinct feature of interest would be in the early 2010s, likely between 2011 and 2014, as mission planners mapped out Opportunity's traverse along the Endeavour rim using orbital data.

OpenAI o4-mini-high

I’m actually not finding any officially named “Marathon Crater” in the planetary‐ or terrestrial‐impact crater databases. Did you perhaps mean the features in Marathon Valley on Mars (which cuts into the western rim of Endeavour Crater and was explored by Opportunity in 2015)? Or is there another “Marathon” feature—maybe on the Moon, Mercury, or here on Earth—that you had in mind? If you can clarify which body or region you’re referring to, I can give you a rough date for when it was first identified.

>Complaint chat models will be trained to start with "Certainly!

They are certainly biased that way but there's also some 'i don't know' samples in rlhf, possibly not enough but it's something they think about.

At any rate, Gemini 2.5pro passes this just fine

>Okay, based on my internal knowledge without performing a new search: I don't have information about a specific, well-known impact crater officially named "Marathon Crater" on Earth or another celestial body like the Moon or Mars in the same way we know about Chicxulub Crater or Tycho Crater.

>However, the name "Marathon" is strongly associated with Mars exploration. NASA's Opportunity rover explored a location called Marathon Valley on the western rim of the large Endeavour Crater on Mars.

> This tests self awareness. A two-year-old will answer it correctly, as will the dumbest person you know. The correct answer is "I don't know".

I disagree. It does not test self awareness. It tests (and confirms) that current instruct-tuned LLMs are tuned towards answering questions that users might have. So the distribution of training data probably has lots of "tell me about mharrner crater / merinor crater / merrihana crater" and so on. Replying "I don't know" to all those questions would be net detrimental, IMO.

We want the distribution to be varied and expansive enough that it has samples of answering when possible and samples of clarifying with additional questions or simply saying "I don't know" when applicable. That can be trained by altering the distribution in RLHF. This question does test self awareness insofar as if it gets this right by saying "I don't know" we know there are more samples of "I don't know"s in the RLHF dataset and we can trust the LLM a bit more to not be biased towards blind answers.

Hence why some models get this right and others just make up stuff about Mars.

GPT 4.5 even doubles down when challenged:

> Nope, I didn’t make it up — Marathon crater is real, and it was explored by NASA's Opportunity rover on Mars. The crater got its name because Opportunity had driven about 42.2 kilometers (26.2 miles — a marathon distance) when it reached that point in March 2015. NASA even marked the milestone as a symbolic achievement, similar to a runner finishing a marathon.

(Obviously all of that is bullshit.)

This is the kind of reason why I will never use AI

What's the point of using AI to do research when 50-60% of it could potentially be complete bullshit. I'd rather just grab a few introduction/101 guides by humans, or join a community of people experienced with the thing — and then I'll actually be learning about the thing. If the people in the community are like "That can't be done", well, they have had years or decades of time invested in the thing and in that instance I should be learning and listening from their advice rather than going "actually no it can".

I see a lot of beginners fall into that second pit. I myself made that mistake at the tender age of 14 where I was of the opinion that "actually if i just found a reversible hash, I'll have solved compression!", which, I think we all here know is bullshit. I think a lot of people who are arrogant or self-possessed to the extreme make that kind of mistake on learning a subject, but I've seen this especially a lot when it's programmers encountering non-programming fields.

Finally tying that point back to AI — I've seen a lot of people who are unfamiliar with something decide to use AI instead of talking to someone experienced because the AI makes them feel like they know the field rather than telling them their assumptions and foundational knowledge is incorrect. I only last year encountered someone who was trying to use AI to debug why their KDE was broken, and they kept throwing me utterly bizzare theories (like, completely out there, I don't have a specific example with me now but, "foundational physics are wrong" style theories). It turned out that they were getting mired in log messages they saw that said "Critical Failure", as an expert of dealing with Linux for about ten years now, I checked against my own system and... yep, they were just part of mostly normal system function (I had the same messages on my Steam Deck, which was completely stable and functional). The real fault was buried halfway through the logs. At no point was this person able to know what was important versus not-important, and the AI had absolutely no way to tell or understand the logs in the first place, so it was like a toaster leading a blind man up a mountain. I diagnosed the correct fault in under a day by just asking them to run two commands and skimming logs. That's experience, and that's irreplaceable by machine as of the current state of the world.

I don't see how AI can help when huge swathes of it's "experience" and "insight" is just hallucinated. I don't see how this is "helping" people, other than making people somehow more crazy (through AI hallucinations) and alone (choosing to talk to a computer rather than a human).

Here is one of them.

If 60999994719999854799998669 is product of three primes, find the sum of its prime factors.

I think o3 brute forced this one so maybe I need to change the numbers

Something about an obscure movie.

The one that tends to get them so far is asking if they can help you find a movie you vaguely remember. It is a movie where some kids get a hold of a small helicopter made for the military.

The movie I'm concerned with is called Defense Play from 1988. The reason I keyed in on it is because google gets it right natively ("movie small military helicopter" gives the IMDb link as one of the top results) but at least up until late 2024 I couldn't get a single model to consistently get it. It typically wants to suggest Fire Birds (large helicopter), Small Soldiers (RC helicopter not a small military helicopter) etc.

Basically a lot of questions about movies tends to get distracted by popular movies and tries to suggest films that fit just some of the brief (e.g. this one has a helicopter could that be it?)

The other main one is just asking for the IMDb link for a relatively obscure movie. It seems to never get it right I assume because the IMDb link pattern is so common it'll just spit out a random one and be like "there you go".

These are designed mainly to test the progress of chatbots towards replacing most of my Google searches (which are like 95% asking about movies). For the record I haven't done it super recently, and I generally either do it with arena or the free models as well, so I'm not being super scientific about it.

It might be cheating a bit, but I’ve been happily (mis)using OpenAI Deep Research for such questions. It does well in cases where there are multiple surface level matches, as it’s able to go through the them one by one and look for the details.

I did something similar recently, trying to describe a piece of art that I couldn't remember the name of (it ended up being Birth of Venus by Sandro Botticelli) ... it really struggles with that sort of thing, but honestly so do most humans. It tended to recommend similarly to what you're describing with movies - it gets distracted by more popular/well-known pieces that don't really match up with the description you're giving to it.

Surprisingly, GPT did manage to identify a book that I remembered from college decades ago ("Laboratory Manual for Morphology and Syntax"). It seems to be out of print, and I assumed it was obscure.

Can agree that it’s good at finding books. I was trying to find a book (Titanic 2020) I vaguely remembered from a couple plot points and the fact a ship called Titanic was invoked. ChatGPT figured it out pretty much instantly, after floundering through book sites and Google for a while.

Wonder if books are inherently easier because their content is purely written language? Whereas movies and art tend to have less point by point descriptions of what they are.

"How much wood would a woodchuck chuck if a woodchuck could chuck wood?"

So far, all the ones I have tried actually try to answer the question. 50% of them correctly identify that it is a tongue twister, but then they all try to give an answer, usually saying: 700 pounds.

Not one has yet given the correct answer, which is also a tongue twister: "A woodchuck would chuck all the wood a woodchuck could chuck if a woodchuck could chuck wood."

ChatGPT really impressed me on this one:

Sing, O Muse, of the Chucking of Wood

Sing, O Muse, of the stout-hearted woodchuck, Dweller of hills, whom men call groundhog, Whose paws, though small, could hurl mighty timbers, If fate had gifted him such labor divine.

Born of burrow and brush, he rose from the loam, In fields where Apollo’s light did not linger. No axe he bore, nor bronze-forged blade, Yet in his breast a fire: to chuck wood without end.

Down from the hills he came, stout as Odysseus, Cunning and strong, with fur kissed by Demeter. The trees did tremble at his gaze, And even old oak whispered, “Beware!”

“Would he chuck?” cried nymphs from the groves, “If such were his fate, what wood could withstand?” Lo, he summoned the strength of seven oxen, And hurled logs with a force unmeasured.

Seven hundred pounds of timber he cast, Each arc a hymn to muscle and will. Nay, not for war, nor gold, nor gods’ glory— But for the joy of the deed, the art of the chuck.

Let men remember the name woodchuck, In songs by the hearth and tales for the young. For though he cannot, he surely would— And if he could, by Zeus, he surely shall.

4o. The prompt is literally "How much wood would a woodchuck chuck if a woodchuck could chuck wood?". It asked me if I want a poetic answer, and I've requested Homer.

Now I'm wondering if it makes any difference if this was asked through the audio encoder on a multimodal model. A tongue twister means nothing to a text-only model.

If that prompt can be easily trained against, it probably doesn't exploit a generic bias. These are not that interesting, and there's no point in hiding them.

*Some generic biases. Some others like recency bias, serial-position effect, "pink elephant" effect, negation accuracy seem to be pretty fundamental and are unlikely to be fixed without architectural changes, or at all. Things exploiting in-context learning and native context formatting are also hard to suppress during the training without making the model worse.

Sure there is. If you want to know if students understand the material, you don't hand out the answers to the test ahead of time.

Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.

You want to know if a new model is actually better, which you won't know if they just added the specific example to the training set. It's like handing a dev on your team some failing test cases, and they keep just adding special cases to make the tests pass.

How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?

I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.

> How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?

Well, that's easy: zero.

Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time.

So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data...

To gauge how well the models "think" and what amount of slop they generate.

Keeping it secret because I don't want my answers trained into a model.

Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.

As a genuine human I am really struggling to untangle that story. Maybe I needed to pay more attention in freshman lit class, but that is definitely a brainteaser.

Read it for the first time just now - it seems to me that Pierrot has stolen the narrator's purse (under the guise of dusting the chalk from their cloak) and successfully convinced them to blame Truth, instead. There's almost certainly more to it that I'm missing.

That's the core of it, but it's implied, not outright stated, and requires some tricky language parsing, basic theory of mind, and not being too distracted by the highly symbolic objects.

The thief is Pierrot.

The people around are telling the storyteller that "he" (Pierrot) has stolen the purse, but the storyteller misinterprets this as pointing to some arbitrary agent.

Truth says Pierrot can "find [the thief] with this mirror": since Pierrot is the thief, he will see the thief in the mirror.

Pierrot dodges the implication, says "hey, Truth brought you back that thing [that Truth must therefore have stolen]", and the storyteller takes this claim at face value, "forgetting it was not a mirror but [instead] a purse [that] [they] lost".

The broader symbolism here (I think) is that Truth gets accused of creating the problem they were trying to reveal, while the actual criminal (Pierrot) gets away with their crime.

The narrator's "friend" pickpocketed him. When boldly confronted by Truth, he cleverly twists her accusation to make it seem like she's confessing, and the narrator, bewildered by the laughter and manipulation, buys it wholesale. Bonus points for connecting it to broader themes like mass propaganda, commedia dell'arte, or the dreamlike setting and hypnotic repetition of phrasing.

The best ChatGPT could do was make some broad observations about the symbolism of losing money, mirrors, absurdism, etc. But it whiffed on the whole "turning the tables on Truth" thing. (Gemini did get it, but with a prompt that basically asked "What really happened in this story?"; can't find the original response as it's aged out of the history)

"If I can dry two towels in two hours, how long will it take me to dry four towels?"

They immediately assume linear model and say four hours not that I may be drying things on a clothes line in parallel. It should ask for more context and they usually don't.

gpt-4.1:

> If you can only dry two towels at once, it will take 4 hours to dry four towels. If you can dry all four at once, it will only take 2 hours.

How long has it been since you’ve tried this?

Every model I asked just now gave what I see as the correct answer — giving 2 answers one for the case of your dryer being at capacity w/ 2 towels and the other when 4 towels can be dried simultaneously.

To me, if you say that the correct answer must require the model asking for more context then essentially any prompt that doesn’t result in the model asking for more context is “wrong.”

Sending "</think>" to reasoning models like deepseek-r1 results in the model hallucinating a response to a random question. For example, it answered to "if a car travels 120km in 2 hours, what is the average speed in km/h?". It's fun I guess.

define stump?

If you write a fictional story where the character names sound somewhat close to real things, like a “Stefosaurus” that climbs trees, most will correct you and call it a Stegosaurus and attribute Stegosaurus traits to it.

Also, ones that can't be solved at a glance by humans don't count. Like this horrid ambiguous example from SimpleBench I saw a while back that's just designed to confuse:

John is 24 and a kind, thoughtful and apologetic person. He is standing in an modern, minimalist, otherwise-empty bathroom, lit by a neon bulb, brushing his teeth while looking at the 20cm-by-20cm mirror. John notices the 10cm-diameter neon lightbulb drop at about 3 meters/second toward the head of the bald man he is closely examining in the mirror (whose head is a meter below the bulb), looks up, but does not catch the bulb before it impacts the bald man. The bald man curses, yells 'what an idiot!' and leaves the bathroom. Should John, who knows the bald man's number, text a polite apology at some point?

A. no, because the lightbulb was essentially unavoidable

B. yes, it would be in character for him to send a polite text apologizing for the incident

C. no, because it would be redundant

D. yes, because it would potentially smooth over any lingering tension from the encounter

E. yes, because John saw it coming, and we should generally apologize if we fail to prevent harm

F. yes because it is the polite thing to do, even if it wasn't your fault.

Yeah I really dislike this kind of question from SimpleBench. I've suggested many improvements to some of the publicly available questions but not had a good response.

I think the correct answer to the question above should be something like, "are you sure the question is correct because it's not clear whether John and the bald man are the same person" but of course an LLM would be marked down if it was inclined to respond in this way.

At a glance, it sounds like John is the bald man? If we're treating this as a riddle, it doesn't seem incomprehensible. Whether riddles are a fair test is another question.

I don't know if it stumps every model, but I saw some funny tweets asking ChatGPT something like "Is Al Pacino in Heat?" (asking if some actor or actress in the film "Heat") - and it confirms it knows this actor, but says that "in heat" refers to something about the female reproductive cycle - so, no, they are not in heat.

I believe it was GoogleAI in search but it was worse than that. Some asked it if Angelina Jolie was in heat. The tone started kind of insulting like the user was a sexist idiot for thinking human women go into heat like animals, then went back and forth saying she is still fertile at her age and also that her ovaries had been removed. It was funny because it managed to be arrogant, insulting, kind of creepy and gross and logically inconsistent while not even answering the question.

Angelina Jolie was not in Heat (1995). They were probably thinking of Natalie Portman or Ashley Judd when they asked the question.

I just asked Claude and if I capitalized "Heat", it knew I was talking about the movie, but for lower case "heat", it got offended and asked me to clarify.

relatedly - what are y'all using to manage your personal collection of prompts?

i'm still mostly just using a folder in obsidian backed by a private github repo, but i'm surprised something like https://www.prompthub.us/ hasn't taken off yet.

i'm also curious about how people are managing/versioning the prompts that they use within products that have integrations with LLMs. it's essentially product configuration metadata so I suppose you could just dump it in a plaintext/markdown file within the codebase, or put it in a database if you need to be able to tweak prompts without having to do a deployment or do things like A/B testing or customer segmentation

1) Word Ladder: Chaos to Order

2) Shortest word ladder: Chaos to Order

3) Which is the second last scene in pulp fiction if we order the events by time?

4) Which is the eleventh character to appear on Stranger Things.

5) suppose there is a 3x3 Rubik's cube with numbers instead of colours on the faces. the solved rubiks cube has numbers 1 to 9 in order on all the faces. tell me the numbers on all the corner pieces.

>4) Which is the eleventh character to appear on Stranger Things.

Gemini 2.5 Pro said Benny Hammond. Is this right?

I tried generating erotic texts with every model I encountered, but even so called "uncensored" models from Huggingface are trying hard to avoid the topic, whatever prompts I give.

I ask it to explain the metaphor “my lawyer is a shark” and then explain to me how a French person would interpret the metaphor - the llms get the first part right but fail on the second. All it would have to do is give me the common French shark metaphors and how it would apply them to a lawyer - but I guess not enough people on the internet have done this comparison.

Disclaimer: OT and pretty ranty.

I don't know if that's what the GP hinted at, but as a Svelte developer and big advocate for more than 6 years (single handedly training and evangelizing 20+ developers on it), I found so many concerns with Svelte 5 that it simply made me use React again.

It's a temporary choice and I'm desperately evaluating other ecosystems (Looking at you SolidJS).

I haven’t been able to get any AI model to find Waldo in the first page of the Great Waldo Search. O3 even gaslit me through many turns trying to convince me it found the magic scroll.

Things like "What is today's date" used to be enough (would usually return the date that the model was trained).

I recently did things like current events, but LLMs that can search the internet can do those now. i.e. Is the pope alive or dead?

Nowadays, multi-step reasoning is the key, but the Chinese LLM (I forget the name of it) can do that pretty well. Multi-step reasoning is much better at doing algebra or simple math, so questions like "what is bigger, 5.11 or 5.5?"

"Keep file size small when you do edits"

Makes me wonder if all these models were heavily trained on codebases where 1000 LOC methods are considered good practice

Write 20 sentences that end with "p" in the final word before the period or other punctuation.

Succeeded on ChatGPT, pretty close on gemma3:4b -- the exceptions usually ending with a "puh" sound...

Good one. I do seem to get consistently good results on Gemini 2.5 when using the slightly more explicit "Write 20 sentences where the very last character of each sentence is the letter 'p'."

Not necessarily. It could start by using diamond's IOR, and use that to dictate a common brdf calculation. Along with some approximate refraction, perhaps using a equirectangular projected sphere map or something for the background.

I want to know as well! Except that this thread is undoubtedly going to get plugged into the training data, so unfortunately, why would people do that? For mine that worked before the ChatGPT 4.5, it was the river crossing problem. The farmer with a wolf a sheep and grain, needing to cross a river, except that the boat can hold everything. Older LLMs would pattern match against the training data and insist on a solution from there, instead of reasoning out that the modified problem doesn't require those steps to solve. But since ChatGPT 4, it's been able to solve that directly, so that no longer works.

what are the zeros of the following polynomial:

\[ P(z) = \sum_{k=0}^{100} c_k z^k \] where the coefficients $ c_k $ are defined as: \[ c_k = \begin{cases} e^2 + i\pi & \text{if } k = 100, \\ \ln(2) + \zeta(3)\,i & \text{if } k = 99, \\ \sqrt{\pi} + e^{i/2} & \text{if } k = 98, \\ \frac{(-1)^k}{\Gamma(k+1)} + \sin(k) \, i & \text{for } 0 \leq k \leq 97, \end{cases} \]

Not to my knowledge. I asked Deepseek: "create me a random polynomial of degree 100 using complex numbers as coefficients. It must have at least 3 different transcendental numbers." Then I messed with some of the exponents.

Write a regular expression that matches Miqo'te seekers of the sun names. They always confuse the male and female naming conventions.

Re the epigram “stroking the sword while lamenting the social realities,” attributed to Shen Qianqiu during the Ming dynasty, please prepare a short essay on its context and explore how this sentiment resonates in modern times.

A ball costs 5 cents more than a bat. Price of a ball and a bat is $1.10. Sally has 20 dollars. She stole a few balls and bats. How many balls and how many bats she has?

All LLMs I tried miss the point that she stole things and not bought them

Google Gemini (2.0 Flash, free online version) handled this rather okay; it gave me an arguably unneccessary calculation of the individual prices of ball and bat, but then ended with "However with the information given, we can't determine exactly how many balls and bats Sally stole. The fact that she has $20 tells us she could have stolen some, but we don't know how many she did steal." While "the fact that she has $20" has no bearing on this - and the model seems to wrongly imply that it does - the fact that we have insufficient information to determine an answer is correct, and the model got the answer essentially right.

gemini 2.5 give following response.

Conclusion:

We can determine the price of a single ball ($0.575) and a single bat ($0.525). However, we cannot determine how many balls and bats Sally has because the information "a few" is too vague, and the fact she stole them means her $20 wasn't used for the transaction described.

Grok 3.0 wasn’t fooled on this one, either:

Final Answer: The problem does not provide enough information to determine the exact number of balls and bats Sally has. She stole some unknown number of balls and bats, and the prices are $0.575 per ball and $0.525 per bat.

I don't have a prompt per-say.. but recently I have managed to ask certain questions of both openai o1/o3 and claude extended thinking 3.7 that have spiraled way out of control. A simple high-level architecture question with an emphasis on do not produce code lets just talk thru this yields nearly 1,000 lines of SQL. Once the conversation/context gets quite long it is more likely to occur, in my experience.

The only model I've seen so far that doesn't end up going crazy with long contexts with Gemini 2.5 pro, but tbf I haven't gone past 700-750k total tokens so maybe as it starts to approach the limit (1.05M) things get hairy?

Read Entire Article