Remove Duplicate Lines
Remove duplicate lines from text, keeping only unique entries.
Related Tools
About This Tool
Removing duplicate lines is a common preprocessing step for log files, email lists, word lists, and dataset cleaning. The basic operation reads a list of lines and emits each unique line once, preserving first-occurrence order or sorting alphabetically.
This tool runs entirely in the browser. Options: case-sensitive matching, whitespace trimming, ignore empty lines, sort or preserve order.
Deduplication implementations split between hash-based and sort-based approaches. Hash-based (the default in most modern tools) reads each line, computes a hash key, and tracks seen keys in a set. Time complexity is O(n), memory is O(unique lines). Sort-based (Unix `sort | uniq`) sorts the input and emits each line where it differs from the previous; time is O(n log n), memory varies. For large inputs, the hash approach is faster but memory-bounded; for very large data exceeding RAM, sort-based with disk-backed sorting is the standard. Browser implementations use the hash approach since browser memory is generous and inputs are typically interactive-scale.
A worked example: a file of 100,000 email addresses with an unknown number of duplicates. Sample input: `alice@example.com\nBOB@example.com\nalice@example.com\nbob@example.com `. Default (case-sensitive, no trim): output is 3 lines because Alice appears twice as exact duplicate but BOB and bob are different lines, and `bob@example.com ` (trailing space) is different from `bob@example.com`. Case-insensitive with whitespace trim: output is 2 lines (alice and bob). For email lists, case-insensitive plus trim is almost always correct — virtually all email providers normalize the local part case despite the technical RFC 5321 distinction. The tool reports input count, output count, and number of duplicates removed.
Limitations: line-level deduplication is exact-match-only. "john.smith@example.com" and "j.smith@example.com" are different lines, even though they may belong to the same person. Fuzzy matching, near-duplicate detection (with edit distance or token overlap), and column-based deduplication (CSV column N matches) require different tools. Browser memory limits processing to roughly 100 MB of input; beyond that, use a script. The order-preserving option uses a Map under the hood to track first occurrence — slightly more memory than sort-based but worth it when original order matters (log analysis, replay scripts).
The about text and FAQ on this page were drafted with AI assistance and reviewed by a member of the Coherence Daddy team before publishing. See our Content Policy for editorial standards.