Skip to main content
Free Tool

Duplicate Content Detector

Paste 2–6 blocks of text and get a pairwise Jaccard similarity matrix. Catches near-duplicates, partial rewrites, and keyword-cannibalising pages — all in your browser.

Jaccard + containment5-word shingle analysisShared-phrase previewCannibalisation detection4 verdicts per pairCSV exportIn-browser only

Documents

1

Paste text blocks

Two to six documents. Each needs at least 5 words. Label each block so the pair matrix stays readable.

2

Shingle size

N-gram length for comparison. 5 is the SEO standard — catches genuine phrase overlap without being fooled by stopword collisions.

Similarity matrix

StatusWorst pair: 26% (Somewhat similar)
Page AvsPage B
Somewhat similar
Jaccard
25.7%
Containment
40.9%
Shared phrases (5 shown)
"workspacein is an australian seo""is an australian seo agency""an australian seo agency helping""australian seo agency helping businesses""seo agency helping businesses grow"
Page AvsPage C
Unique
Jaccard
0.0%
Containment
0.0%
Page BvsPage C
Unique
Jaccard
0.0%
Containment
0.0%
Recommendations
  • infoPage A vs Page B: 26% Jaccard — some overlap but probably acceptable.

How to Fix Duplicate Content

Four options, ranked from best to worst.

1

301 redirect to one canonical

The cleanest fix. Pick the URL with the most backlinks and authority, 301 all duplicates to it. You consolidate signals and stop keyword cannibalisation in one move.

2

Use rel="canonical"

If you must keep both URLs live (e.g. filtered views, pagination), point all variants' canonical tag to the authoritative version. Google respects this ~80% of the time.

3

Rewrite to genuine uniqueness

If both pages need to exist AND both need to rank, rewrite the bodies to target different intent. "SEO services" and "SEO services Sydney" should have distinctive content, not swapped-in location words.

4

Noindex the weakest one

Last resort when you can't redirect, canonical, or rewrite. Keeps the page live for users but out of Google. Lose the rank potential, keep the UX.

Similarity Thresholds & What They Mean

The thresholds we use in SEO audits.

75%+ Near-duplicate

Almost identical. Pick one canonical and 301 the others. Leaving both live wastes crawl budget and confuses Google.

40–75% Substantial

Rewrite shared passages or consolidate pages. Often the result of "localised" landing pages that only swap place names.

15–40% Similar

Probably fine. Same topic, genuinely different angles. Audit shared phrases — maybe tighten if they're boilerplate.

Under 15% Unique

Different pages on a similar theme. Healthy variety. Common shingles are usually high-frequency natural language.

Jaccard vs Containment

Jaccard treats pages symmetrically. Containment shows what % of the smaller doc is inside the bigger one — catches partial rewrites.

5-word shingle size

Standard for SEO duplicate detection. Short enough to catch paraphrases, long enough to avoid false positives on common stopword chains.

Cannibalisation

Two of your own pages targeting the same keyword. Jaccard helps spot it. Fix by merging, redirecting, or sharpening intent.

Scraped content

If your content appears on another domain at 80%+ Jaccard, it's likely scraped. DMCA takedown or ensure your canonical is strong.

Where Duplicate Content Usually Hides

Common sources of accidental duplication — almost all of them fixable.

CMS-generated

Tag + category archivesAuthor archivesPagination /page/2Print versionsParameter variationsFiltered listing URLs

Editorial patterns

Location swap-out landing pagesProduct description boilerplateService page near-clonesAuto-translated contentSyndicated feedsCross-posted blog posts

Technical

HTTP + HTTPS livewww + non-www both liveTrailing-slash variationsUTM-parameter versionsStaging sites indexableMobile (m.) subdomain

External duplication

Scraper content theftSyndication without canonicalManufacturer descriptionsPress release repostsAffiliate mirror sitesAI-generated re-publish

Duplicate Content FAQ

What counts as duplicate content for SEO?

Google broadly treats substantial blocks of matching or near-matching content as duplicate — both within a site and across domains. "Substantial" is Google's word, not a precise percentage, but a Jaccard similarity over 50% on 5-word shingles is a strong flag worth investigating.

Does duplicate content always hurt rankings?

Not always. Google picks one version to rank and suppresses others — but the penalty is lost rankings, not a manual action. Exact duplicates across domains (scraped content) are higher risk. Internal duplicates are usually just wasted crawl budget.

How do I fix duplicate content?

Best: 301 redirect duplicates to one canonical URL. Second: rel="canonical" to the authoritative version. Third: rewrite the duplicate content to be genuinely unique. Worst: leave it — Google will still pick one and ignore the rest, but you lose control of which.

Is my text sent anywhere?

No. All similarity calculations run locally in your browser. Nothing is uploaded or stored.

Want a Full Site Duplicate-Content Audit?

Our Australian SEO team crawls your site, maps duplicate and near-duplicate content, and delivers a prioritised redirect / canonical / rewrite plan.

  • Site-wide duplication scan
  • Canonical + redirect plan
  • No lock-in commitment
Book a Free Consultation First
🔒 Secure checkout|Delivered within 48 hours|100% money-back guarantee

No long-term commitment. Cancel anytime. 100% satisfaction guaranteed.