Receive the weekly sampler of posts and "Resource of the Week".
Subscribe »

Enter your
email address:

My Account »


Bookmark and Share

Testimonial?
If you find ResourceShelf useful, please supply a testimonial »








Home > ResourceBlog > Article

« All ResourceBlog Articles

 

Bookmark and Share   Feed

Friday, 9th November 2007

Technical Report: A Comparison of Open Source Search Engines

A Comparison of Open Source Search Engines
46 pages; PDF.
by Christian Middleton, Ricardo Baeza-Yates

The present work is the first study, to the best of our knowledge, to cover a comparison of the main features of 17 search engines, as well as a comparison of the performance during the indexing and retrieval tasks with different document collections and several types of queries. The objective of this work is to be used as a reference for deciding which open source search engine fits best with the particular constraints of the search problem to be solved. On chapter 2 we prefer a background of the general concepts of Information Retrieval. On chapter 3 it is presented a description of the search engines used in this work. Then, on chapter 4 the methodology used during the experiments is described. On chapters 5.1 and 5.2 we present the results of the different experiments conducted, and on chapter 5.3 the analysis of these results. Finally, on chapter 6 the conclusions are presented.

Which engines were considered? Which were compared? From pages 17-18:

We compared 29 search engines: ASPSeek, BBDBot, Datapark, ebhath, Eureka, ht://Dig, Indri, ISearch, IXE, Lucene, Managing Gigabytes (MG), MG4J, mnoGoSearch, MPS Information Server, Namazu, Nutch, Omega, OmniFind IBM Yahoo! Ed., OpenFTS, PLWeb, SWISH-E, SWISH++, Terrier, WAIS/ freeWAIS, WebGlimpse, XML Query Engine, XMLSearch, Zebra, and Zettair.

Based on the information collected, it is possible to discard some projects because they are considered outdated (e.g. last update is prior to the year 2000), the project is not maintained or paralyzed, or it was not possible to obtain information of them. For these reasons we discarded ASPSeek, BBDBot, ebhath, Eureka, ISearch, MPS Information Server, PLWeb, and WAIS/freeWAIS.

In some cases, a project was rejected because of additional factors. For example, although the MG project (presented on the book “Managing Gigabytes”) is one of the most important work on the area, it was not included in this work, due to the fact that it has not been updated since 1999. Another special case is the Nutch project. The Nutch search engine is based on the Lucene search engine, and is just an implementation that uses the API provided by Lucene. For this reason, only the Lucene project will be analyzed. And finally, XML Query Engine and Zebra were discarded since they focus on structured data (XML) rather than on semi-structured data as HTML. Therefore, the initial list of search engines that we wanted to cover in the present work were:

Datapark, ht://Dig, Indri, IXE, Lucene, MG4J, mnoGoSearch, Namazu, OmniFind, OpenFTS, Omega, SWISH-E, SWISH++, Terrier, WebGlimpse (Glimpse), XMLSearch, and Zettair. However, with the preliminary tests, we observed that the indexing time for Datapark, mnoGoSearch, Namazu, OpenFTS, and Glimpse where 3 to 6 times longer than the rest of the search engines, for the smallest database, and hence we also did not considered them on the final performance comparison.

Source: Universitat Pompeu Fabra (Barcelona, Spain)


Category:

Views: 1011




blog comments powered by Disqus

« All ResourceBlog Articles

 

Read about the FreePint FamilyFreePint Family

A family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success. Read more »


FeedLatest Family Articles:


Click to view the article Quilting big data threads
Thursday, 24th May 2012

Recently I have found myself cooing over visualisation maps (and heat maps) of health and well being resources. The content rich data is overlayed with mapping technologies, and some interesting themes and patterns are emerging.


Click to view the article The fallacy of information overload
Wednesday, 23rd May 2012

A lot of the talk around social media in the last year has been around information overload. Social media has provided us with new and exciting ways to create content. But it has also meant learning new ways to manage and engage with social media tools. Are we teetering on the edge of an information overload precipice?


Click to view the article Information overload: fact, fantasy or filter failure?
Wednesday, 23rd May 2012

Information overload is a figment of your imagination. Or a failure of your filter. Or a symptom of your technological submissiveness. Depends on who you ask.


Click to view the article Newsdesk: tracking millions of pieces of information a day
Tuesday, 22nd May 2012

What if you had to sort through 3.5 million articles and social media posts a day and try to pull out the most relevant items for your organisation? What if you then had to cobble it all together into something readable for your top groups and executives in your organisation?


Click to view the article Alacra Compliance adds managerial oversight
Tuesday, 22nd May 2012

Alacra Compliance saves time by aggregating information from both free and fee-based sources and enabling users to conduct an accurate federated search across these sources (coined “simultaneous search” by Alacra).


All Family Articles »
Family Articles by Category »


Tell us what you're working on,
and we'll talk to you about how FreePint can help »


FreePint Family Testimonials

"Fabulous resource to learn of unique tools and insights. Very useful." Manager, Futures and Forecasting, Virginia, USA

More testimonials »






Subscribe

Subscribe to the ResourceShelf Newsletter and receive the weekly sampler of posts and Resource of the Week.

Find out more »

ResourceShelf sponsored by:

Article Categories

All Article Categories »

Archive

All Archives »