Receive the weekly sampler of posts and "Resource of the Week".
Subscribe »

Enter your
email address:

My Account »


Bookmark and Share

Testimonial?
If you find ResourceShelf useful, please supply a testimonial »








Home > ResourceBlog > Article

« All ResourceBlog Articles

 

Bookmark and Share   \"Feed\"

Thursday, 5th August 2010

Google's Estimates Approx. 146 Million Books (Printed and Bound); 129,864,880 Minus Serials

Google comes up with total book estimates and then explains how they arrived at the various numbers. If nothing else, it will show the non-cataloger is not as easy as it looks and many types and sources of bibliographic info exist.

As to the accuracy of the numbers. We'll leave that to the experts, people with the tools to not only access the data but to also manipulate it in a number of ways.

From Inside Google Book Search

One definition of a book we find helpful inside Google when handling book metadata is a “tome,” an idealized bound volume. A tome can have millions of copies (e.g. a particular edition of “Angels and Demons” by Dan Brown) or can exist in just one or two copies (such as an obscure master’s thesis languishing in a university library). This is a convenient definition to work with, but it has drawbacks. For example, we count hardcover and paperback books produced from the same text twice, but treat several pamphlets bound together by a library as a single book.

You'll then read about ISBN's, SBN's, Library of Congress Accession Numbers, and OCLC Accession Numbers.

Then, Google begins their explanation of how they count.

So what does Google do? We collect metadata from many providers (more than 150 and counting) that include libraries, WorldCat, national union catalogs and commercial providers. At the moment we have close to a billion unique raw records. We then further analyze these records to reduce the level of duplication within each provider, bringing us down to close to 600 million records.

Does this mean that there are 600 million unique books in the world? Hardly. There is still a lot of duplication within a single provider (e.g. libraries holding multiple distinct copies of a book) and among providers -- for example, we have 96 records from 46 providers for “Programming Perl, 3rd Edition”. Twice every week we group all those records into “tome” clusters, taking into account nearly all attributes of each record.

Next, the "trust" they assign to various data sources.

When evaluating record similarity, not all attributes are created equal. For example, when two records contain the same ISBN this is a very strong (but not absolute) signal that they describe the same book, but if they contain different ISBNs, then they definitely describe different books. We trust OCLC and LCCN number similarity slightly less, both because of the inconsistencies noted above and because these numbers do not have checksums, so catalogers have a tendency to mistype them.

"We put even less trust in the “free-form” attributes such as titles, author names and publisher names."

[Examples/Snip]

We tend to rely on publisher names, as they are cataloged, even less. While publishers are very protective of their names, catalogers are much less so.

[Examples/Snip]

So after all is said and done, how many clusters does our algorithm come up with? The answer changes every time the computation is performed, as we accumulate more data and fine-tune the algorithm. The current number is around 210 million.

But that's not the end. They post goes on to mention what types of material (non-books) they exclude from the Google total:

+ Microforms (8 million)
+ Audio Recordings (4.5 million)
+ Videos (2 million)
+ Maps (another 2 million)
+ T-Shirts with ISBNs (about one thousand),
+ Turkey probes (1, added to a library catalog as an April Fools joke), and other items for which we receive catalog entries.

[Our emphasis] Counting only things that are printed and bound, we arrive at about 146 million. This is our best answer today. It will change.

Finally, what to do about serials and government documents?

Our handling of serials is still imperfect. Serials cataloging practices vary widely across institutions. The volume descriptions are free-form and are often entered as an afterthought. For example, “volume 325, number 6”, “no. 325 sec. 6”, and “V325NO6” all describe the same bound volume. The same can be said for the vast holdings of the government documents in US libraries [What about government docs from other countries, the UN, and ngo's?] At the moment we estimate that we know of 16 million bound serial and government document volumes. This number is likely to rise as our disambiguating algorithms become smarter.

After we exclude serials, we can finally count all the books in the world. There are 129,864,880 of them. At least until Sunday.

Source: Inside Google Book Search

Views: 1313



blog comments powered by Disqus

« All ResourceBlog Articles

 

Read about the FreePint FamilyThe FreePint Family is a family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success.

'FreePint... provides most of my professional development because it won't come through work and [other resources] just don't cut it.'

Read about the FreePint Family »


Visit the FreePint ShopFreePint Shop: FreePint sells reports, resources and subscription products to support your information work and information-related decisions.

Latest: FreePint Volume: Critical Insight on Social Media 2012 (01 Feb 2012) | FUMSI Report: Folio on Conferences and Continuing Professional Development (26 Jan 2012) | FreePint Research Report: Information Governance Policies and Priorities (25 Jan 2012) | Docuticker Report: DocuTips on Health Literacy (19 Jan 2012) | VIP Magazine: 98 (18 Jan 2012)

Browse the FreePint Shop »


FUMSI ForumFUMSI Forum: Do you have a research question? Post it to the FUMSI Forum, where professionals share Q&A and useful tips on how to Find, Use, Manage and Share Information. It's free.

Latest FUMSI Forum postings: Most Shared Content on Finding Information (09 Feb 2012) | Times are changing - a FUMSI Editorial (09 Feb 2012) | [TIPPLE] eBook resources - Share (07 Feb 2012) | Most Shared Content on Sharing Information (01 Feb 2012) | Our own worst enemy? - a FUMSI Editorial (01 Feb 2012)

Visit the FUMSI Forum and post »


VIP LiveWireVIP LiveWire: Offers commentary on emerging news stories of interest to premium content users, vendors and industry insiders.

Latest VIP LiveWire postings: Social media and BRIC - new report (08 Feb 2012) | Reuters takes the social media pulse (08 Feb 2012) | How to deal with the tech-savvy customer? (08 Feb 2012) | More ways for employers to poke around (01 Feb 2012) | Trust your supplier? Check with the Armadillo (01 Feb 2012)

Visit the VIP LiveWire »






Subscribe

Subscribe to the ResourceShelf Newsletter and receive the weekly sampler of posts and Resource of the Week.

Find out more »

ResourceShelf sponsored by:

Article Categories

All Article Categories »

Archive

All Archives »