Receive the weekly sampler of posts and "Resource of the Week".
Subscribe »

Enter your
email address:

My Account »


Bookmark and Share

Testimonial?
If you find ResourceShelf useful, please supply a testimonial »








Home > ResourceBlog > Article

« All ResourceBlog Articles

 

Bookmark and Share   \"Feed\"

Friday, 9th November 2007

Technical Report: A Comparison of Open Source Search Engines

A Comparison of Open Source Search Engines
46 pages; PDF.
by Christian Middleton, Ricardo Baeza-Yates

The present work is the first study, to the best of our knowledge, to cover a comparison of the main features of 17 search engines, as well as a comparison of the performance during the indexing and retrieval tasks with different document collections and several types of queries. The objective of this work is to be used as a reference for deciding which open source search engine fits best with the particular constraints of the search problem to be solved. On chapter 2 we prefer a background of the general concepts of Information Retrieval. On chapter 3 it is presented a description of the search engines used in this work. Then, on chapter 4 the methodology used during the experiments is described. On chapters 5.1 and 5.2 we present the results of the different experiments conducted, and on chapter 5.3 the analysis of these results. Finally, on chapter 6 the conclusions are presented.

Which engines were considered? Which were compared? From pages 17-18:

We compared 29 search engines: ASPSeek, BBDBot, Datapark, ebhath, Eureka, ht://Dig, Indri, ISearch, IXE, Lucene, Managing Gigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu, Nutch, Omega, OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E, SWISH++, Terrier, WAIS/ freeWAIS, WebGlimpse, XML Query Engine, XMLSearch, Zebra, and Zettair.

Based on the information collected, it is possible to discard some projects because they are considered outdated (e.g. last update is prior to the year 2000), the project is not maintained or paralyzed, or it was not possible to obtain information of them. For these reasons we discarded ASPSeek, BBDBot, ebhath, Eureka, ISearch, MPS Information Server, PLWeb, and WAIS/freeWAIS.

In some cases, a project was rejected because of additional factors. For example, although the MG project (presented on the book “Managing Gigabytes”) is one of the most important work on the area, it was not included in this work, due to the fact that it has not been updated since 1999. Another special case is the Nutch project. The Nutch search engine is based on the Lucene search engine, and is just an implementation that uses the API provided by Lucene. For this reason, only the Lucene project will be analyzed. And finally, XML Query Engine and Zebra were discarded since they focus on structured data (XML) rather than on semi-structured data as HTML. Therefore, the initial list of search engines that we wanted to cover in the present work were:

Datapark, ht://Dig, Indri, IXE, Lucene, MG4J, mnoGoSearch, Namazu, OmniFind, OpenFTS, Omega, SWISH-E, SWISH++, Terrier, WebGlimpse (Glimpse), XMLSearch, and Zettair. However, with the preliminary tests, we observed that the indexing time for Datapark, mnoGoSearch, Namazu, OpenFTS, and Glimpse where 3 to 6 times longer than the rest of the search engines, for the smallest database, and hence we also did not considered them on the final performance comparison.

Source: Universitat Pompeu Fabra (Barcelona, Spain)


Category:

Views: 832



blog comments powered by Disqus

« All ResourceBlog Articles

 

Read about the FreePint FamilyThe FreePint Family is a family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success.

'FreePint... provides most of my professional development because it won't come through work and [other resources] just don't cut it.'

Read about the FreePint Family »


Visit the FreePint ShopFreePint Shop: FreePint sells reports, resources and subscription products to support your information work and information-related decisions.

Latest: FreePint Volume: Critical Insight on Social Media 2012 (01 Feb 2012) | FUMSI Report: Folio on Conferences and Continuing Professional Development (26 Jan 2012) | FreePint Research Report: Information Governance Policies and Priorities (25 Jan 2012) | Docuticker Report: DocuTips on Health Literacy (19 Jan 2012) | VIP Magazine: 98 (18 Jan 2012)

Browse the FreePint Shop »


FUMSI ForumFUMSI Forum: Do you have a research question? Post it to the FUMSI Forum, where professionals share Q&A and useful tips on how to Find, Use, Manage and Share Information. It's free.

Latest FUMSI Forum postings: Most Shared Content on Finding Information (09 Feb 2012) | Times are changing - a FUMSI Editorial (09 Feb 2012) | [TIPPLE] eBook resources - Share (07 Feb 2012) | Most Shared Content on Sharing Information (01 Feb 2012) | Our own worst enemy? - a FUMSI Editorial (01 Feb 2012)

Visit the FUMSI Forum and post »


VIP LiveWireVIP LiveWire: Offers commentary on emerging news stories of interest to premium content users, vendors and industry insiders.

Latest VIP LiveWire postings: Compliance - it's not just financial (10 Feb 2012) | Social media and BRIC - new report (08 Feb 2012) | Reuters takes the social media pulse (08 Feb 2012) | How to deal with the tech-savvy customer? (08 Feb 2012) | More ways for employers to poke around (01 Feb 2012)

Visit the VIP LiveWire »






Subscribe

Subscribe to the ResourceShelf Newsletter and receive the weekly sampler of posts and Resource of the Week.

Find out more »

ResourceShelf sponsored by:

Article Categories

All Article Categories »

Archive

All Archives »