Prompt Diff Tool
Compare two prompts side by side and highlight differences. Optimize your LLM prompts by seeing exactly what changed between versions.
Prompt Engineering Tips
Use this tool to A/B test your prompts. Small wording changes can significantly affect LLM output quality. Track character and token counts to stay within model context limits. The token estimate uses a rough ~4 characters per token approximation.
About This Tool
You're tweaking an LLM prompt to fix a specific failure mode, and you want to compare your iteration against the previous version to see exactly what changed. Long prompts develop subtle drift over time — a word swapped here, an example added there — and the cumulative effect is hard to track when you're three iterations deep.
Drop your two prompt versions side by side. The diff highlights additions, deletions, and changed words. Useful for verifying that a "small tweak" was actually small, for code review when a prompt is part of a pull request, and for documenting what changed when behavior shifts between versions. The diff treats whitespace as a real character — in prompt engineering, whitespace genuinely affects model behavior.
The algorithm is a standard text diff (typically Myers' algorithm or one of its variants), which finds the longest common subsequence between two texts and marks everything outside as added or removed. Word-level diff treats the text as a sequence of tokens (words plus punctuation), which produces readable output for prose. Character-level diff treats every character independently, which catches small punctuation changes but produces visual noise for normal edits. The default is word-level with a character-level option for fine-grained inspection. Whitespace handling is configurable but defaults to significant — meaning a difference in spacing or line breaks shows as a change rather than being normalized away.
A worked example: version 1 of your prompt says "You are a helpful assistant. Answer the user's question concisely and accurately." Version 2 says "You are a helpful, expert assistant. Answer the user's question concisely, accurately, and with citations where relevant." The diff shows: added ", expert" after "helpful", added ", and with citations where relevant" before the final period. Two changes that together shift the model's behavior — the "expert" framing tends to produce more confident answers, and the citation request changes how the model handles factual claims. Without the diff you might think you only added one thing; the diff makes both changes visible, which matters when one of them is responsible for the behavior shift you're seeing.
Why this matters more for prompts than for code: small changes can have outsized effects on model behavior, and the cause is often hard to attribute. A prompt that worked at temperature 0.7 might break at the same temperature after a single word change, and the diff is your only record of what the change was. Treat prompts like code: version them in a file or repo, commit each iteration, and use diffs in code review. Most prompt-engineering pain traces back to lost track of what changed between the working version and the broken one. The structured-diff alternative (for JSON or XML prompts) gives semantic comparison rather than text comparison, which can be cleaner when the prompt has a defined structure — but plain-text diff handles the typical case fine. For very long prompts (tens of thousands of tokens), splitting into sections and diffing each is more readable than one massive block diff.
The about text and FAQ on this page were drafted with AI assistance and reviewed by a member of the Coherence Daddy team before publishing. See our Content Policy for editorial standards.