Motivated by our work with political scientists we present an algorithm that detects near-duplicate Web pages. These scientists analyze Web archives of news sites. The archives were collected with crawlers and contain a large number of pages that look very different because the frame around their core content differs. However, the news stories in the pages are nearly identical. The close proximity of unrelated items on the pages makes the detection of content overlap difficult. Our SpotSigs algorithm generates signatures that are spread across each document. Places for these signatures are determined by the placement of common words, like 'is' and 'the' in the documents. We can vary our method of computing the signatures. Using hash collisions the algorithm detects overlap among the signatures of matching contents. We study how the different SpotSigs parameters impact precision and recall performance. We propose and evaluate variants of SpotSigs on a test bed of 2168 Web Pages and study the tradeoffs involved. One of our motivations was also to keep pre-processing requirements low for the detection of near duplicates and to this end we do not remove ads, client side scripts and other HTML formatting elements from the documents. On this data set SpotSigs obtains a precision of over 93% and a recall of over 85% for near duplicate detection.
The FreePint Family is a family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success.
'FreePint... provides most of my professional development because it won't come through work and [other resources] just don't cut it.'
FUMSI Forum: Do you have a research question? Post it to the FUMSI Forum, where professionals share Q&A and useful tips on how to Find, Use, Manage and Share Information. It's free.