Home > ResourceBlog > Article

« All ResourceBlog Articles

 

Bookmark and Share   Feed

Saturday, 13th January 2007

New from Yahoo Research: Web Spam Detection using the Web Topology & Challenges in Distributed Information Retrieval

Two papers from Yahoo Research that might be of interest.

#1
Challenges in Distributed Information Retrieval
by Ricardo Baeza-Yates, Carlos Castillo, Flavio Junqueira, Vassilis Plachouras and Fabrizio Silvestri
This invited paper will be presented at In ICDE [International Conference on Data Engineering] in April, 2007.
From the abstract:

In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is always evolving. The number of Web sites continues to grow rapidly and there are currently more than 20 billion indexed pages. In the near future, centralized systems are likely to become ineffective against such a load, thus suggesting the need of fully distributed search engines. Such engines need to achieve the following goals: high quality answers, fast response time, high query throughput, and scalability. In this paper we survey and organize recent research results, outlining the main challenges of designing a distributed Web retrieval system.

Request the full text from YR or download (15 pages; PDF) via the web site of Carlos Castillo.

#2
Know your Neighbors: Web Spam Detection using the Web Topology
by Carlos Castillo and Debora Donato and Aristides Gionis and Vanessa Murdock and Fabrizio Silvestri (2006)

To access the full text of this report, either visit the web site of Carlos Castillo and download the paper (PDF; 10 pages) or request directly from Yahoo Research.

Abstract:

Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link dependencies among the Web pages, and the content of the pages themselves. We find that linked hosts tend to belong to the same class: either both are spam or both are non-spam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam that can be applied in practice to large-scale Web data.

Source: Yahoo Research; Web Site of Carlos Castillo

See Also: SIGIR Forum: A Reference Collection for Web Spam, Open Source Information Retrieval Systems Other Articles


Category:

Views: 1232

   




« All ResourceBlog Articles

 

FreePint

FreePint supports the value of information in the enterprise. Read more »


FeedLatest FreePint Articles:


  • Click to view the article Webinar Will Bring Big Data Down to Size
    Monday, 20th May 2013

    Find out more about FreePint's Webinar: Big Data in Action: Plain Language, Practical Guidance and don't forget to register for the free event. It's a great opportunity to find out more about big data from four technology and service companies in the sector: Attunity, Connotate, Linguamatics and Opera Solutions.

  • Click to view the article Question Time for Thomson Reuters and PLC's Top Team
    Monday, 20th May 2013

    Robin Neidorf meets up with key members of Thomson Reuters and PLC's senior management team to quiz them about the role of PLC in Thomson Reuters' plans for the US legal market - and to find out more about Thomson Reuters' approach to the legal market worldwide.

  • Click to view the article Law Librarians and Content Choices
    Monday, 20th May 2013

    As Thomson Reuters continues to assimilate PLC into its business post acquisition, FreePint interviews members of the senior management team to find out how PLC's offering can support the changing structure of law firms and the increasingly strategic role of legal information professionals.

  • ... more ...

All Family Articles »
Family Articles by Category »


A FreePint Subscription delivers articles and reports that support your organisation's information practice, content and strategy.

Start the conversation about a subscription by
completing our online form: "How can FreePint help?"


FreePint Testimonials

"This report will be of great value to me as I meet with the managing partner in the near future to discuss the budget. It is one of the ..."

Read more testimonials and supply yours »






 

 
 
 

Register

Register to receive the free ResourceShelf Newsletter, featuring highlighted posts.

Find out more »

ResourceShelf sponsored by:

Article Categories

All Article Categories »

Archive

All Archives »