Receive the weekly sampler of posts and "Resource of the Week".
Subscribe »

Enter your
email address:

My Account »


Bookmark and Share

Testimonial?
If you find ResourceShelf useful, please supply a testimonial »








Home > ResourceBlog > Article

« All ResourceBlog Articles

 

Bookmark and Share   \"Feed\"

Thursday, 2nd March 2006

OVER 75 Million Pages: NARA/Internet Archive Collection of U.S. Government Web Material Becomes Keyword Searchable, Powered by Nutch Technology

Resources, Reports, Tools, Lists, and Full Text
U.S. Government--Web Content--Archives--Databases
Source: IA
NARA/Internet Archive Collection of U.S. Government Web Material NOW Keyword Searchable
Word from the Internet Archive (IA) that a special collection (about 75 million pages) of web material that they collected/"harvested"/captured for the National Archives (NARA) IS NOW keyword searchable. The collection is titled, "2004 Presidential Term Web Harvest." Look for the new search box on the right side of the page. This collection first became available to the public in January 2005.

From the web site:
"The National Archives and Records Administration (NARA) conducted a harvest (i.e., capture) of Federal Agency public web sites as they existed prior to January 20, 2005. This harvest was intended to document Federal agencies' presence on the World Wide Web at the time that the Presidential Administration term ended in early 2005."

"The 2004 Presidential Term Web Harvest is a National Archives and Records Administration (NARA) project that produced a collection of federal web sites copied, or harvested, from the world wide web between 10/14/04 and 11/19/04. The Heritrix web harvester (http://crawler.archive.org/) and a list of 982 active and unrestricted second level URLs were used to capture all linked federal sites down to the fourth level. Those initial 982 '.gov' and '.mil' URLs were provided by U.S. General Services Administration's (GSA) '.GOV' Internet Domain Registry and the Defense Information Systems Agency (DOD/DISA)."

Today's posting from the Internet Archive:
"This is our [IA] largest public single text searchable collection to date. The index was created using the NUTCH and NUTCHWAX extensions open source software."

Kudos to the IA. I wonder if they have any plans to bring back keyword search capabilities to The Wayback Machine (or at least a portion of it). I hope so. A few years ago the IA offered Recall. It allowed users to keyword search a portion of TWM. More here about this gone-but-not-forgotten service here. The search technology was developed by Anna Patterson who now works at Google.

Views: 893



blog comments powered by Disqus

« All ResourceBlog Articles

 

Read about the FreePint FamilyThe FreePint Family is a family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success.

'FreePint... provides most of my professional development because it won't come through work and [other resources] just don't cut it.'

Read about the FreePint Family »


Visit the FreePint ShopFreePint Shop: FreePint sells reports, resources and subscription products to support your information work and information-related decisions.

Latest: FreePint Volume: Critical Insight on Social Media 2012 (01 Feb 2012) | FUMSI Report: Folio on Conferences and Continuing Professional Development (26 Jan 2012) | FreePint Research Report: Information Governance Policies and Priorities (25 Jan 2012) | Docuticker Report: DocuTips on Health Literacy (19 Jan 2012) | VIP Magazine: 98 (18 Jan 2012)

Browse the FreePint Shop »


FUMSI ForumFUMSI Forum: Do you have a research question? Post it to the FUMSI Forum, where professionals share Q&A and useful tips on how to Find, Use, Manage and Share Information. It's free.

Latest FUMSI Forum postings: Most Shared Content on Finding Information (09 Feb 2012) | Times are changing - a FUMSI Editorial (09 Feb 2012) | [TIPPLE] eBook resources - Share (07 Feb 2012) | Most Shared Content on Sharing Information (01 Feb 2012) | Our own worst enemy? - a FUMSI Editorial (01 Feb 2012)

Visit the FUMSI Forum and post »


VIP LiveWireVIP LiveWire: Offers commentary on emerging news stories of interest to premium content users, vendors and industry insiders.

Latest VIP LiveWire postings: Social media and BRIC - new report (08 Feb 2012) | Reuters takes the social media pulse (08 Feb 2012) | How to deal with the tech-savvy customer? (08 Feb 2012) | More ways for employers to poke around (01 Feb 2012) | Trust your supplier? Check with the Armadillo (01 Feb 2012)

Visit the VIP LiveWire »






Subscribe

Subscribe to the ResourceShelf Newsletter and receive the weekly sampler of posts and Resource of the Week.

Find out more »

ResourceShelf sponsored by:

Article Categories

All Article Categories »

Archive

All Archives »