Google's Count of 130 Million Books is Probably Bunk
This Ars Technica post is in response to the estimates Google posted on Inside Google Book Search last week that we did our best to break down in this post. In terms of accuracy, who knows. It would be interesting to hear other organizations (OCLC) and companies (Bowker) respond. We continue to be on the lookout for them. Others could also respond with estimates like national libraries, the Library of Congress, and others who have access to large amounts of bibliographic data.
It's a large, official-sounding number, and the explanation for how Google arrived at it involves a number of acronyms and terms that will be unfamiliar to most of those who read the post. It's also quite likely to be complete bunk.
The first two posts here provide comments from Professor Nunberg info along with slides from a presentation at UC Berkeley.
Shortly after Nunberg's comments were made and with GREAT detail, Dr. Peter Jacso, a librarian and head of the Library and Info Science program at the University of Hawaii wrote an article showing some of the problems Nunberg mentions.
The problem here is not with Nunberg or Jacso but with the author. A year is a longtime in the publishing/library world and even longer in the search engine world. A few phone calls or emails to Mr. Nunberg and Dr. Jacso (assuming he would have found the LJ article) and solicit comments from them would make the Ars Technica much stronger. Perhaps Google is doing better with metadata or perhaps the problem has become worse. We don't know.
Next, he refers to subject headings as classifications. It's minor but our point is that if the author would have solicited the views of catalogers, metadata experts (often the same person) and others in the library world the differences would have been explained and the article made more useful. One of the best things the Internet can do is get you in touch with experts. Also, a bit of research would have likely directed him to OCLC, OCLC Research, Hathi Trust, and other organizations.
The classifications are a mess, and Nunberg's presentation points out that the first 10 classifications for Walt Whitman's "Leaves of Grass" classify it as Juvenile Nonfiction, Poetry, Fiction, Literary Criticism, Biography & Autobiography, Counterfeits and Counterfeiting. Then there are authors that are missing or misattributed, and titles that bear no relation to the linked work.
Again, this is very possible. However, this was written a year ago. Perhaps things have changed. Some researching by the author could have cleared it all up.
While Google blaming libraries (again, last year) was not a good idea if for no other reason that public relations, how do librarians feel about this now? Have they noticed any improvements? Has it gotten worse?
He then moves on to quote "Evan" Hellman from the "Go to Hellman" blog. Eric knows his stuff, a true expert, and should be heard but again problems.
1) His name is Eric Hellman not Evan. Minor? Absolutely and an honest mistake. In this case, where were the Ars Technica editors?
2) The quote from a Go To Hellman blog is about library generated metadata. This is an important read and makes several interesting and provocative points. But once again, soliciting some new comments (the post is from September, 2009) from Mr. Hellman would have helped the article and helped the reader. Discussions of metadata, Google, and related issues were topics during this years "annual conference" season. Perhaps Hellman or others have insights from the conferences about library generated metadata and its use by Google.
Let's be clear, it's QUITE POSSIBLE that today's Ars Technica article is on the money and the headline says it like it is. We just wish the article offered more facts and current expert comments to explain why the number is accurate or bunk.
However, without actually talking to ANYONE (in the past week or so) about what has gone on in the last year leaves us with more questions than answers. A very small amount of legwork could have created a useful and important article. Contact info for Professor Nunberg and Eric Hellman are readily available and after reading the Google blog post, other organizations are mentioned. How about just calling a library and asking for the PR/Media Relations department? No? What about the media folks at ALA? There available to help with articles like this. Still a problem? Between social media and old school email lists, getting to good people to chat with can be easy.
Btw, Google provided two totals. One with serials removed (129,864,880) but they are more sure of the number with serials:
[Our emphasis] Counting only things that are printed and bound, we arrive at about 146 million. This is our best answer today. It will change.
Again, rather minor but worth mentioning.
Bottom Line: A bit of research could have taken an article that doesn't add any new info or comment to the topic and turned it into an a "must read" and "must quote" piece.
A family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success. Read more »
Recently I have found myself cooing over visualisation maps (and heat maps) of health and well being resources. The content rich data is overlayed with mapping technologies, and some interesting themes and patterns are emerging.
A lot of the talk around social media in the last year has been around information overload. Social media has provided us with new and exciting ways to create content. But it has also meant learning new ways to manage and engage with social media tools. Are we teetering on the edge of an information overload precipice?
Information overload is a figment of your imagination. Or a failure of your filter. Or a symptom of your technological submissiveness. Depends on who you ask.
What if you had to sort through 3.5 million articles and social media posts a day and try to pull out the most relevant items for your organisation? What if you then had to cobble it all together into something readable for your top groups and executives in your organisation?
Alacra Compliance saves time by aggregating information from both free and fee-based sources and enabling users to conduct an accurate federated search across these sources (coined “simultaneous search” by Alacra).