Wednesday, July 2, 2008

Shingles and Near Duplicate Detection

Sergei Vassilvitskii of Yahoo! has a useful ppt describing work to identify duplicate and near duplicate pages on the Web using shingles. Claims that 25%-40% of all WWW documents are duplicates or near duplicates. Hashing of documents cannot identify near duplicates while edit distance will not scale. Uses a hash of a small number of shingles (ngrams), calculating similarity by rate at which mini-hashes agree. Also has a useful discussion of Jaccard similarities. Talk is based on Andrei Broder's (AltaVista and Yahoo!) work, described in Identifying and filtering near-duplicate documents and previous papers cited there. There are other commercial applications of this approach, such as Equivio's near duplication identification service which uses a related similarity measure.

While I am at it, have a look at Detecting Near Duplicates in Big Data for pointers to recent work at Google on the same problem. Also, the recent International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN).