Wednesday, July 2, 2008

Shingles and Near Duplicate Detection

Sergei Vassilvitskii of Yahoo! has a useful ppt describing work to identify duplicate and near duplicate pages on the Web using shingles. Claims that 25%-40% of all WWW documents are duplicates or near duplicates. Hashing of documents cannot identify near duplicates while edit distance will not scale. Uses a hash of a small number of shingles (ngrams), calculating similarity by rate at which mini-hashes agree. Also has a useful discussion of Jaccard similarities. Talk is based on Andrei Broder's (AltaVista and Yahoo!) work, described in Identifying and filtering near-duplicate documents and previous papers cited there. There are other commercial applications of this approach, such as Equivio's near duplication identification service which uses a related similarity measure.

While I am at it, have a look at Detecting Near Duplicates in Big Data for pointers to recent work at Google on the same problem. Also, the recent International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN).

Tuesday, June 24, 2008

Datawocky: More data and human evaluation

Anand Rajaraman in Datawocky makes the case that more data usually beats better algorithms by reference to the NetFlix challenge and provides a little more detail in part two of the same post. He also notes that Google continues to use human evaluation as part of their search algorithm tuning in Are Machine-Learned Models Prone to Catastrophic Errors? suggesting that machine learning, based on seen instances, can suffer from the "Black Swan" problem. Finally, he makes the case, based on another blog entry, that one should Change the algorithm, not the dataset if your approach can't handle the scale of data you are throwing at it. Interesting comments all. A blog to watch.

Monday, May 26, 2008

From Words to Works: Machine Learning and Text Mining at ARTFL

I recently had the opportunity to present an overview of our current work in machine learning and text mining to the 2008 meeting of Technological Innovation and Cooperation for Foreign Information Access (TICFIA) meeting held in Chicago on the first of May. [slides]

Wednesday, May 21, 2008

Similarity as a Scholarly Primitive

I gave this 4/6 talk at the Chicago Bamboo Project Workshop last week. I used Google's Presentation system in place of Powerpoint, which allows you to present with only a browser and to embed the talk in posts. Very handy, particularly since one can collaborate with others and provide links to the full screen presentation [Click here].