Receive the weekly sampler of posts and "Resource of the Week".
Subscribe »

Enter your
email address:

My Account »


Bookmark and Share

Testimonial?
If you find ResourceShelf useful, please supply a testimonial »








Home > ResourceBlog > Article

« All ResourceBlog Articles

 

Bookmark and Share   \"Feed\"

Monday, 9th August 2010

Google's Count of 130 Million Books is Probably Bunk

This Ars Technica post is in response to the estimates Google posted on Inside Google Book Search last week that we did our best to break down in this post. In terms of accuracy, who knows. It would be interesting to hear other organizations (OCLC) and companies (Bowker) respond. We continue to be on the lookout for them. Others could also respond with estimates like national libraries, the Library of Congress, and others who have access to large amounts of bibliographic data.

Jon Stokes writes:

It's a large, official-sounding number, and the explanation for how Google arrived at it involves a number of acronyms and terms that will be unfamiliar to most of those who read the post. It's also quite likely to be complete bunk.

The author includes comments (from last year) by Geoff Nunberg who had some very strong words about the quality of Google metadata.

The first two posts here provide comments from Professor Nunberg info along with slides from a presentation at UC Berkeley.

Shortly after Nunberg's comments were made and with GREAT detail, Dr. Peter Jacso, a librarian and head of the Library and Info Science program at the University of Hawaii wrote an article showing some of the problems Nunberg mentions.

The problem here is not with Nunberg or Jacso but with the author. A year is a longtime in the publishing/library world and even longer in the search engine world. A few phone calls or emails to Mr. Nunberg and Dr. Jacso (assuming he would have found the LJ article) and solicit comments from them would make the Ars Technica much stronger. Perhaps Google is doing better with metadata or perhaps the problem has become worse. We don't know.

Next, he refers to subject headings as classifications. It's minor but our point is that if the author would have solicited the views of catalogers, metadata experts (often the same person) and others in the library world the differences would have been explained and the article made more useful. One of the best things the Internet can do is get you in touch with experts. Also, a bit of research would have likely directed him to OCLC, OCLC Research, Hathi Trust, and other organizations.

The classifications are a mess, and Nunberg's presentation points out that the first 10 classifications for Walt Whitman's "Leaves of Grass" classify it as Juvenile Nonfiction, Poetry, Fiction, Literary Criticism, Biography & Autobiography, Counterfeits and Counterfeiting. Then there are authors that are missing or misattributed, and titles that bear no relation to the linked work.

Again, this is very possible. However, this was written a year ago. Perhaps things have changed. Some researching by the author could have cleared it all up.

While Google blaming libraries (again, last year) was not a good idea if for no other reason that public relations, how do librarians feel about this now? Have they noticed any improvements? Has it gotten worse?

He then moves on to quote "Evan" Hellman from the "Go to Hellman" blog. Eric knows his stuff, a true expert, and should be heard but again problems.

1) His name is Eric Hellman not Evan. Minor? Absolutely and an honest mistake. In this case, where were the Ars Technica editors?

2) The quote from a Go To Hellman blog is about library generated metadata. This is an important read and makes several interesting and provocative points. But once again, soliciting some new comments (the post is from September, 2009) from Mr. Hellman would have helped the article and helped the reader. Discussions of metadata, Google, and related issues were topics during this years "annual conference" season. Perhaps Hellman or others have insights from the conferences about library generated metadata and its use by Google.

Let's be clear, it's QUITE POSSIBLE that today's Ars Technica article is on the money and the headline says it like it is. We just wish the article offered more facts and current expert comments to explain why the number is accurate or bunk.

However, without actually talking to ANYONE (in the past week or so) about what has gone on in the last year leaves us with more questions than answers. A very small amount of legwork could have created a useful and important article. Contact info for Professor Nunberg and Eric Hellman are readily available and after reading the Google blog post, other organizations are mentioned. How about just calling a library and asking for the PR/Media Relations department? No? What about the media folks at ALA? There available to help with articles like this. Still a problem? Between social media and old school email lists, getting to good people to chat with can be easy.

Btw, Google provided two totals. One with serials removed (129,864,880) but they are more sure of the number with serials:

[Our emphasis] Counting only things that are printed and bound, we arrive at about 146 million. This is our best answer today. It will change.

Again, rather minor but worth mentioning.

Bottom Line: A bit of research could have taken an article that doesn't add any new info or comment to the topic and turned it into an a "must read" and "must quote" piece.

Views: 1078



blog comments powered by Disqus

« All ResourceBlog Articles

 

Read about the FreePint FamilyThe FreePint Family is a family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success.

'FreePint... provides most of my professional development because it won't come through work and [other resources] just don't cut it.'

Read about the FreePint Family »


Visit the FreePint ShopFreePint Shop: FreePint sells reports, resources and subscription products to support your information work and information-related decisions.

Latest: FreePint Volume: Critical Insight on Social Media 2012 (01 Feb 2012) | FUMSI Report: Folio on Conferences and Continuing Professional Development (26 Jan 2012) | FreePint Research Report: Information Governance Policies and Priorities (25 Jan 2012) | Docuticker Report: DocuTips on Health Literacy (19 Jan 2012) | VIP Magazine: 98 (18 Jan 2012)

Browse the FreePint Shop »


FUMSI ForumFUMSI Forum: Do you have a research question? Post it to the FUMSI Forum, where professionals share Q&A and useful tips on how to Find, Use, Manage and Share Information. It's free.

Latest FUMSI Forum postings: [TIPPLE] eBook resources - Share (07 Feb 2012) | Most Shared Content on Sharing Information (01 Feb 2012) | Our own worst enemy? - a FUMSI Editorial (01 Feb 2012) | [TIPPLE] eBook resources - Manage (31 Jan 2012) | "Frictionless sharing" - exploring the c (31 Jan 2012)

Visit the FUMSI Forum and post »


VIP LiveWireVIP LiveWire: Offers commentary on emerging news stories of interest to premium content users, vendors and industry insiders.

Latest VIP LiveWire postings: Social media and BRIC - new report (08 Feb 2012) | Reuters takes the social media pulse (08 Feb 2012) | How to deal with the tech-savvy customer? (08 Feb 2012) | More ways for employers to poke around (01 Feb 2012) | Trust your supplier? Check with the Armadillo (01 Feb 2012)

Visit the VIP LiveWire »






Subscribe

Subscribe to the ResourceShelf Newsletter and receive the weekly sampler of posts and Resource of the Week.

Find out more »

ResourceShelf sponsored by:

Article Categories

All Article Categories »

Archive

All Archives »