Google's Count of 130 Million Books is Probably Bunk
This Ars Technica post is in response to the estimates Google posted on Inside Google Book Search last week that we did our best to break down in this post. In terms of accuracy, who knows. It would be interesting to hear other organizations (OCLC) and companies (Bowker) respond. We continue to be on the lookout for them. Others could also respond with estimates like national libraries, the Library of Congress, and others who have access to large amounts of bibliographic data.
It's a large, official-sounding number, and the explanation for how Google arrived at it involves a number of acronyms and terms that will be unfamiliar to most of those who read the post. It's also quite likely to be complete bunk.
The first two posts here provide comments from Professor Nunberg info along with slides from a presentation at UC Berkeley.
Shortly after Nunberg's comments were made and with GREAT detail, Dr. Peter Jacso, a librarian and head of the Library and Info Science program at the University of Hawaii wrote an article showing some of the problems Nunberg mentions.
The problem here is not with Nunberg or Jacso but with the author. A year is a longtime in the publishing/library world and even longer in the search engine world. A few phone calls or emails to Mr. Nunberg and Dr. Jacso (assuming he would have found the LJ article) and solicit comments from them would make the Ars Technica much stronger. Perhaps Google is doing better with metadata or perhaps the problem has become worse. We don't know.
Next, he refers to subject headings as classifications. It's minor but our point is that if the author would have solicited the views of catalogers, metadata experts (often the same person) and others in the library world the differences would have been explained and the article made more useful. One of the best things the Internet can do is get you in touch with experts. Also, a bit of research would have likely directed him to OCLC, OCLC Research, Hathi Trust, and other organizations.
The classifications are a mess, and Nunberg's presentation points out that the first 10 classifications for Walt Whitman's "Leaves of Grass" classify it as Juvenile Nonfiction, Poetry, Fiction, Literary Criticism, Biography & Autobiography, Counterfeits and Counterfeiting. Then there are authors that are missing or misattributed, and titles that bear no relation to the linked work.
Again, this is very possible. However, this was written a year ago. Perhaps things have changed. Some researching by the author could have cleared it all up.
While Google blaming libraries (again, last year) was not a good idea if for no other reason that public relations, how do librarians feel about this now? Have they noticed any improvements? Has it gotten worse?
He then moves on to quote "Evan" Hellman from the "Go to Hellman" blog. Eric knows his stuff, a true expert, and should be heard but again problems.
1) His name is Eric Hellman not Evan. Minor? Absolutely and an honest mistake. In this case, where were the Ars Technica editors?
2) The quote from a Go To Hellman blog is about library generated metadata. This is an important read and makes several interesting and provocative points. But once again, soliciting some new comments (the post is from September, 2009) from Mr. Hellman would have helped the article and helped the reader. Discussions of metadata, Google, and related issues were topics during this years "annual conference" season. Perhaps Hellman or others have insights from the conferences about library generated metadata and its use by Google.
Let's be clear, it's QUITE POSSIBLE that today's Ars Technica article is on the money and the headline says it like it is. We just wish the article offered more facts and current expert comments to explain why the number is accurate or bunk.
However, without actually talking to ANYONE (in the past week or so) about what has gone on in the last year leaves us with more questions than answers. A very small amount of legwork could have created a useful and important article. Contact info for Professor Nunberg and Eric Hellman are readily available and after reading the Google blog post, other organizations are mentioned. How about just calling a library and asking for the PR/Media Relations department? No? What about the media folks at ALA? There available to help with articles like this. Still a problem? Between social media and old school email lists, getting to good people to chat with can be easy.
Btw, Google provided two totals. One with serials removed (129,864,880) but they are more sure of the number with serials:
[Our emphasis] Counting only things that are printed and bound, we arrive at about 146 million. This is our best answer today. It will change.
Again, rather minor but worth mentioning.
Bottom Line: A bit of research could have taken an article that doesn't add any new info or comment to the topic and turned it into an a "must read" and "must quote" piece.
The FreePint Family is a family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success.
'FreePint... provides most of my professional development because it won't come through work and [other resources] just don't cut it.'
FUMSI Forum: Do you have a research question? Post it to the FUMSI Forum, where professionals share Q&A and useful tips on how to Find, Use, Manage and Share Information. It's free.