Receive the weekly sampler of posts and "Resource of the Week".
Subscribe »

Enter your
email address:

My Account »


Bookmark and Share

Testimonial?
If you find ResourceShelf useful, please supply a testimonial »








Home > ResourceBlog > Article

« All ResourceBlog Articles

 

Bookmark and Share   Feed

Thursday, 24th September 2009

Peter Jacso Takes on Google Scholar Finding Ghost Authors, Lost Authors, and Other Problems

Access the Full Text of the Entire Article

With all of the talk about Google Book Search lately, little has been written about Google Scholar. Now, in a lengthy and well-documented analysis (numerous screenshots) published in Library Journal, Dr. Peter Jacso from the University of Hawaii at Manoa, a monthly columnist for Gale/Cengage and a friend of ResourceShelf, documents some of the problems (two of them named in the title of the article) that he has found while using Google Scholar [GS] during the past several months. Actually, some of the problems go back years.

Here are just a few passages from Dr. Jacso's article that we found to be of greatest interest:

They [the Google Scholar developers] decided—very unwisely—not to use the good metadata generously offered to them by scholarly publishers and indexing/abstracting services, but instead chose to try and figure them out through ostensibly smart crawler and parser programs.

Millions of records have erroneous metadata, as well as inflated publication and citation counts

A free tool, Google Scholar has become the most convenient resource to find a few good scholarly papers—often in free full-text format—on even the most esoteric topics. [Our emphasis] For topical keyword searches, GS is most valuable. But it cannot be used to analyze the publishing performance and impact of researchers.

Very often, the real authors are relegated to ghost authors deprived of their authorship along with publication and citation counts. [Our emphasis] In the scholarly world, this is critical, as the mantra “publish or perish” is changing to “publish, get cited or perish.”


[Our emphasis] While GS developers have fixed some of the most egregious problems that I reported in several reviews, columns and conference/workshop presentations since 2004—such as the 910,000 papers attributed to an author named “Password”—other large-scale nonsense remains and new absurdities are produced every day.

The numbers in GS are inflated for two main reasons. First, GS lumps together the number of master records (created from actual publications), and the number of citation records (distinguished by the prefix: [citation]) when reporting the total hits for author name search.

...fee-based Web of Science and Scopus have lower article and citation counts and scientometric indicators, as they have a far more selectively defined source base with fewer journals from which to gather publication and citations data. In addition, they count only the master records for the authors’ publication count (as they should), and keep the stray and orphan citations in a separate file.

Unfortunately, the bad metadata has a long reach. These numbers are taken at face value by the free utilities such as the Google Scholar Citation Count gadget by Jan Feyereisl and the sophisticated and pretty Publish or Perish (PoP) software (produced by Tarma Software).

As about 10.2 million records from GBS [Google Book Search] are incorporated now in GS, the metadata disaster likely will continue unabated. It is bad enough to have so many records with erroneous publication years, titles, authors, and journal names.

In its stupor, the parser fancies as author names (parts of) section titles, article titles, journal names, company names, and addresses, such as Methods (42,700 records), Evaluation (43,900), Population (23,300), Contents (25,200), Technique(s) (30,000), Results (17,900), Background (10,500), or—in a whopping number of records— Limited (234,000) and Ltd (452,000). The numbers kept growing by several hundred thousands hits for the cumulative total of the above ”authors” during the few days this paper was being written. More screenshots are available here.

Lost Authors

These errors could be considered relatively harmless if they did not affect the contributions of genuine, real scholars. But the biggest problem is when the mess replaces real scholars with ghost authors, leaving the former as lost authors.


[Our emphasis] Certainly the entire database isn’t rotten, just a few million records. That may be a relatively small percentage—Google won’t reveal the total number of records, and these are just my few forensic search test queries—but there’s ample cause for worry.

In case of GBS [Google Book Search], Google relied on its collective Pavlovian reflex to blame the publishers and libraries (meaning the librarians, catalogers, indexers) for the wrong metadata.

In the case of Google Scholar, these same Googlish arguments will not fly, because practically all the scholarly publishers gave Google—hats in hand—their digital archive with metadata. The idea was to have Google index it and drive traffic to the publishers’ sites.

Yes, GS has fixed fairly quickly some of the major errors that I earlier used to demonstrate its illiteracy and innumeracy, but have so far left millions of others untouched.

GS designers have sent very under-trained, ignorant crawlers/parsers to recognize and fetch the metadata elements on their own. Not all of the indexing/abstracting services are perfect and consistent, but their errors are dwarfed by the types and volume of those in GS. This is the perfect example of the lethal mix of ignorance and arrogance GS developers applied to metadata and relevance ranking issues.

The parsers have not improved much in the past five years despite much criticism. GS developers corrected some errors that got negative publicity, but these were Band-Aids, where brain surgery and extensive parser training is required. Without these, GS will keep producing similar errors on a mega-scale.

Again, these highlights are a only a small portion of the entire article that also includes numerous screenshots. You can access the full text here.

Source: Library Journal


Category:

Views: 10366




blog comments powered by Disqus

« All ResourceBlog Articles

 

Read about the FreePint FamilyFreePint Family

A family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success. Read more »


FeedLatest Family Articles:


Click to view the article Quilting big data threads
Thursday, 24th May 2012

Recently I have found myself cooing over visualisation maps (and heat maps) of health and well being resources. The content rich data is overlayed with mapping technologies, and some interesting themes and patterns are emerging.


Click to view the article The fallacy of information overload
Wednesday, 23rd May 2012

A lot of the talk around social media in the last year has been around information overload. Social media has provided us with new and exciting ways to create content. But it has also meant learning new ways to manage and engage with social media tools. Are we teetering on the edge of an information overload precipice?


Click to view the article Information overload: fact, fantasy or filter failure?
Wednesday, 23rd May 2012

Information overload is a figment of your imagination. Or a failure of your filter. Or a symptom of your technological submissiveness. Depends on who you ask.


Click to view the article Newsdesk: tracking millions of pieces of information a day
Tuesday, 22nd May 2012

What if you had to sort through 3.5 million articles and social media posts a day and try to pull out the most relevant items for your organisation? What if you then had to cobble it all together into something readable for your top groups and executives in your organisation?


Click to view the article Alacra Compliance adds managerial oversight
Tuesday, 22nd May 2012

Alacra Compliance saves time by aggregating information from both free and fee-based sources and enabling users to conduct an accurate federated search across these sources (coined “simultaneous search” by Alacra).


All Family Articles »
Family Articles by Category »


Tell us what you're working on,
and we'll talk to you about how FreePint can help »


FreePint Family Testimonials

"Fabulous resource to learn of unique tools and insights. Very useful." Manager, Futures and Forecasting, Virginia, USA

More testimonials »






Subscribe

Subscribe to the ResourceShelf Newsletter and receive the weekly sampler of posts and Resource of the Week.

Find out more »

ResourceShelf sponsored by:

Article Categories

All Article Categories »

Archive

All Archives »