Receive the weekly sampler of posts and "Resource of the Week".
Subscribe »

Enter your
email address:

My Account »


Bookmark and Share

Testimonial?
If you find ResourceShelf useful, please supply a testimonial »








Home > ResourceBlog > Article

« All ResourceBlog Articles

 

Bookmark and Share   \"Feed\"

Monday, 6th August 2007

New program color-codes text in Wikipedia entries to indicate trustworthiness

New program color-codes text in Wikipedia entries to indicate trustworthiness

The online reference site Wikipedia enjoys immense popularity despite nagging doubts about the reliability of entries written by its all-volunteer team. A new program developed at the University of California, Santa Cruz, aims to help with the problem by color-coding an entry's individual phrases based on contributors' past performance.

The program analyzes Wikipedia's entire editing history--nearly two million pages and some 40 million edits for the English-language site alone--to estimate the trustworthiness of each page. It then shades the text in deepening hues of orange to signal dubious content. A 1,000-page demonstration version is already available on a web page operated by the program's creator, Luca de Alfaro, associate professor of computer engineering at UCSC.

Other sites already employ user ratings as a measure of reliability, but they typically depend on users' feedback about each other. This method makes the ratings vulnerable to grudges and subjectivity. The new program takes a radically different approach, using the longevity of the content itself to learn what information is useful and which contributors are the most reliable.

+ UCSC Wiki Lab: Wikipedia trust coloring demo

Source: University of California-Santa Cruz

A very interesting read! Some quick things we thought about as we read it:

+ This demo is based on a few thousand pages. We wonder how/if this system would scale from this relatively small amount of content to the massive amount of constantly changing and expanding content that make up Wikipedia at the present time. What about future growth? Wikipedia is likely to grow larger with new entries being added regularly. Can this system handle it? The larger the corpus of data becomes it can become more difficult to maintain even assuming that the Wikipedia user/editor base stays the same. What if volunteers move on to the "next big thing." What issues does having a large and respected (by many) database of old, incorrect, out of date, and potentially manipulated data mean for the researcher who has come to depend on Wikipedia?

+ The demo is based on the Wikipedia at a specific date in time. As we've mentioned in the past, anyone can download Wikipedia data and claim it as their own and place ads on it. The Wikipedia management does not force the person who downloads the content to make changes/updates to it. So, one of the positive attributes of Wikipedia, quick changes, is lost.

+ We have seen and read how easy it is to manipulate content on other sites.
From the web site: "authors whose contributions are undone lose reputation..." Can this also be gamed? We've heard of many stories where Wikipedia material that is accurate and added by honest Wikipedians has been removed, replaced, changed, etc. because an editor "doesn't like it". This content might be a specific statistical number (ie. the population of "xxx") or an entire entry. This News.com tells the story of one News.com writer.

+ Can this system alert editors to "long tail" material that has not been updated in xxx amount of time? It's one thing to have a Wikipedia entry but something else for it to be updated with the latest information. Popular topics are checked constantly but doesn't "long tail" material deserve the same review and updating? As the database grows larger this will become more of a challenge. Can this system help?

+ Could authors with strong, positive reputations, begin "selling" their reputations by placing/changing content for others?

+

...changes performed by low-reputation authors have a significantly larger than average probability of having poor quality, as judged by human observers, and of being later undone, as measured by our algorithms.

How does one measure quality? Quality of writing style? Source of information? Currency of information? How does one measure content that might be undone because someone else just doesn't like it for one reason or another?

+ Since anyone can download Wikipedia data and use it on their own site, how would a typical user know and understand that they might be viewing pages that do not reflect the rankings this algorithm provides.

ResourceShelf Contributing Editor, Dan Giancaterino adds:

The problem with basing this system on author reputation is that it can be gamed. There's nothing that would prevent anyone from adding lots of edits to obscure pages. Since very few people are actually seeing these edits, there's little chance of being reversed. Bingo ... cheap reputation. They've got to find a way to factor in the importance of the page being edited into the author reputation algorithm. And all this is obscuring the real point: I want/need an accurate answer to something. Not shades of orange.

+ Kudos to the author for also pointing out this issue in their paper. Yes, a new user is considered guilty until your reputation (over time) proves you otherwise. Sad but true and as we know constantly generating new user names to place/change new content is a closely related issue and likely easily done by someone who can write the code. Additionally, what about the Wikipedia user who registers (with good intentions) so they can change/update/modify a single entry?

Wikipedia allows users to register, and create an author identity, whenever they wish. As a consequence, we need to make the initial reputation of new authors very low, close to the minimum possible (in our case, 0). If we made the initial reputation of new authors any higher, then authors, after committing revisions that damage their reputation,would simply re-register as new users to gain the higher value. An unfortunate side-effect of allowing people to obtain new identities at will is that we cannot presume that people are innocent until proven otherwise: we have to assign to newcomers the same reputation as proven offenders.
This is a contributing factor to our reputation having low precision: many authors who have low reputation still perform very good quality revisions, as they are simply new authors, rather than proven offenders.

+ Might having some confirmed and accurate information about the author/contributor also be of value and cut down on these and similar problems? In other words, signed articles that also link to that person's background, reputation, etc. This is, to some degree, what Larry Sanger and the Citizendium project is up to.
From the Citizendium web site:

+ We offer gentle expert oversight.
+ We use our real names, not pseudonyms.


Category:

Views: 672



blog comments powered by Disqus

« All ResourceBlog Articles

 

Read about the FreePint FamilyThe FreePint Family is a family of resources to help information workers be more effective, raise the value of information in their organisations and contribute to success.

'FreePint... provides most of my professional development because it won't come through work and [other resources] just don't cut it.'

Read about the FreePint Family »


Visit the FreePint ShopFreePint Shop: FreePint sells reports, resources and subscription products to support your information work and information-related decisions.

Latest: FreePint Volume: Critical Insight on Social Media 2012 (01 Feb 2012) | FUMSI Report: Folio on Conferences and Continuing Professional Development (26 Jan 2012) | FreePint Research Report: Information Governance Policies and Priorities (25 Jan 2012) | Docuticker Report: DocuTips on Health Literacy (19 Jan 2012) | VIP Magazine: 98 (18 Jan 2012)

Browse the FreePint Shop »


FUMSI ForumFUMSI Forum: Do you have a research question? Post it to the FUMSI Forum, where professionals share Q&A and useful tips on how to Find, Use, Manage and Share Information. It's free.

Latest FUMSI Forum postings: Most Shared Content on Finding Information (09 Feb 2012) | Times are changing - a FUMSI Editorial (09 Feb 2012) | [TIPPLE] eBook resources - Share (07 Feb 2012) | Most Shared Content on Sharing Information (01 Feb 2012) | Our own worst enemy? - a FUMSI Editorial (01 Feb 2012)

Visit the FUMSI Forum and post »


VIP LiveWireVIP LiveWire: Offers commentary on emerging news stories of interest to premium content users, vendors and industry insiders.

Latest VIP LiveWire postings: Social media and BRIC - new report (08 Feb 2012) | Reuters takes the social media pulse (08 Feb 2012) | How to deal with the tech-savvy customer? (08 Feb 2012) | More ways for employers to poke around (01 Feb 2012) | Trust your supplier? Check with the Armadillo (01 Feb 2012)

Visit the VIP LiveWire »






Subscribe

Subscribe to the ResourceShelf Newsletter and receive the weekly sampler of posts and Resource of the Week.

Find out more »

ResourceShelf sponsored by:

Article Categories

All Article Categories »

Archive

All Archives »