Receive the weekly sampler of posts and "Resource of the Week".
Subscribe »

Enter your
email address:

My Account »


Bookmark and Share

Testimonial?
If you find ResourceShelf useful, please supply a testimonial »








Home > ResourceBlog > Article

« All ResourceBlog Articles

 

Bookmark and Share   Feed

Thursday, 2nd March 2006

OVER 75 Million Pages: NARA/Internet Archive Collection of U.S. Government Web Material Becomes Keyword Searchable, Powered by Nutch Technology

Resources, Reports, Tools, Lists, and Full Text
U.S. Government--Web Content--Archives--Databases
Source: IA
NARA/Internet Archive Collection of U.S. Government Web Material NOW Keyword Searchable
Word from the Internet Archive (IA) that a special collection (about 75 million pages) of web material that they collected/"harvested"/captured for the National Archives (NARA) IS NOW keyword searchable. The collection is titled, "2004 Presidential Term Web Harvest." Look for the new search box on the right side of the page. This collection first became available to the public in January 2005.

From the web site:
"The National Archives and Records Administration (NARA) conducted a harvest (i.e., capture) of Federal Agency public web sites as they existed prior to January 20, 2005. This harvest was intended to document Federal agencies' presence on the World Wide Web at the time that the Presidential Administration term ended in early 2005."

"The 2004 Presidential Term Web Harvest is a National Archives and Records Administration (NARA) project that produced a collection of federal web sites copied, or harvested, from the world wide web between 10/14/04 and 11/19/04. The Heritrix web harvester (http://crawler.archive.org/) and a list of 982 active and unrestricted second level URLs were used to capture all linked federal sites down to the fourth level. Those initial 982 '.gov' and '.mil' URLs were provided by U.S. General Services Administration's (GSA) '.GOV' Internet Domain Registry and the Defense Information Systems Agency (DOD/DISA)."

Today's posting from the Internet Archive:
"This is our [IA] largest public single text searchable collection to date. The index was created using the NUTCH and NUTCHWAX extensions open source software."

Kudos to the IA. I wonder if they have any plans to bring back keyword search capabilities to The Wayback Machine (or at least a portion of it). I hope so. A few years ago the IA offered Recall. It allowed users to keyword search a portion of TWM. More here about this gone-but-not-forgotten service here. The search technology was developed by Anna Patterson who now works at Google.

Views: 1029




blog comments powered by Disqus

« All ResourceBlog Articles

 

Read about the FreePint FamilyFreePint Family

A family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success. Read more »


FeedLatest Family Articles:


Click to view the article Quilting big data threads
Thursday, 24th May 2012

Recently I have found myself cooing over visualisation maps (and heat maps) of health and well being resources. The content rich data is overlayed with mapping technologies, and some interesting themes and patterns are emerging.


Click to view the article The fallacy of information overload
Wednesday, 23rd May 2012

A lot of the talk around social media in the last year has been around information overload. Social media has provided us with new and exciting ways to create content. But it has also meant learning new ways to manage and engage with social media tools. Are we teetering on the edge of an information overload precipice?


Click to view the article Information overload: fact, fantasy or filter failure?
Wednesday, 23rd May 2012

Information overload is a figment of your imagination. Or a failure of your filter. Or a symptom of your technological submissiveness. Depends on who you ask.


Click to view the article Newsdesk: tracking millions of pieces of information a day
Tuesday, 22nd May 2012

What if you had to sort through 3.5 million articles and social media posts a day and try to pull out the most relevant items for your organisation? What if you then had to cobble it all together into something readable for your top groups and executives in your organisation?


Click to view the article Alacra Compliance adds managerial oversight
Tuesday, 22nd May 2012

Alacra Compliance saves time by aggregating information from both free and fee-based sources and enabling users to conduct an accurate federated search across these sources (coined “simultaneous search” by Alacra).


All Family Articles »
Family Articles by Category »


Tell us what you're working on,
and we'll talk to you about how FreePint can help »


FreePint Family Testimonials

"Fabulous resource to learn of unique tools and insights. Very useful." Manager, Futures and Forecasting, Virginia, USA

More testimonials »






Subscribe

Subscribe to the ResourceShelf Newsletter and receive the weekly sampler of posts and Resource of the Week.

Find out more »

ResourceShelf sponsored by:

Article Categories

All Article Categories »

Archive

All Archives »