|
Home > ResourceBlog > Article
« All ResourceBlog Articles
| |
Wednesday, 30th January 2008
SpotSigs: Near Duplicate Detection in Web Page Collections (Thesis)
Jonathan, Siddharth; Paepcke, Andreas. SpotSigs: Near Duplicate Detection in Web Page Collections,
Motivated by our work with political scientists we present an algorithm that detects near-duplicate Web pages. These scientists analyze Web archives of news sites. The archives were collected with crawlers and contain a large number of pages that look very different because the frame around their core content differs. However, the news stories in the pages are nearly identical. The close proximity of unrelated items on the pages makes the detection of content overlap difficult. Our SpotSigs algorithm generates signatures that are spread across each document. Places for these signatures are determined by the placement of common words, like 'is' and 'the' in the documents. We can vary our method of computing the signatures. Using hash collisions the algorithm detects overlap among the signatures of matching contents. We study how the different SpotSigs parameters impact precision and recall performance. We propose and evaluate variants of SpotSigs on a test bed of 2168 Web Pages and study the tradeoffs involved. One of our motivations was also to keep pre-processing requirements low for the detection of near duplicates and to this end we do not remove ads, client side scripts and other HTML formatting elements from the documents. On this data set SpotSigs obtains a precision of over 93% and a recall of over 85% for near duplicate detection.
Source: Stanford Info Lab
Category:
Views: 1234
|
« All ResourceBlog Articles
| |
FreePint
FreePint supports the value of information in the enterprise. Read more »
Latest FreePint Articles:
-
Duedil - Making Company Data More Transparent Thursday, 23rd May 2013
Penny Crossland reviews internet start-up Duedil - short for due diligence - and finds it a welcome addition to the numerous web-based providers of company data. Aggregating all UK and Irish company documents from the official registers, around 100 million at the last count, Duedil combines these with information from regulatory registers and presents the data via a visually attractive dashboard, with interactive features.
-
Mini Review: Duedil Thursday, 23rd May 2013
Duedil is an internet start-up with a mission to make open source official company data transparent. The database covers all UK and Irish corporate filings, 100 million in all, and with sophisticated visualisation tools and clever linking of social media has managed to produce a useful tool for company and due diligence researchers. This review analyses the service, highlights some of the finer points and points to aspects that still need improving.
-
Reskilling for Survival in an Increasingly Information-Biased World Thursday, 23rd May 2013
Info pro expert Sue Hill of Sue Hill Recruitment explains how it's essential that her organisation keep on top of big data development trends in order to best advise client companies and job seekers. She explains what info pros should do to position themselves at the centre of the big data opportunity.
- ... more ...
All Family Articles » Family Articles by Category »
A FreePint Subscription delivers articles and reports that support your organisation's information practice, content and strategy.
Start the conversation about a subscription by completing our online form: "How can FreePint help?"
FreePint Testimonials
"This report will be of great value to me as I meet with the managing partner in the near future to discuss the budget. It is one of the ..."
Read more testimonials and supply yours »
|
|
| Register |
|
Register to receive the free ResourceShelf Newsletter, featuring highlighted posts.
Find out more »
|
|
|
ResourceShelf sponsored by:

|
|

|
|
|
|
|
|