Home > ResourceBlog > Article

« All ResourceBlog Articles


Bookmark and Share   Feed

Monday, 6th August 2007

New program color-codes text in Wikipedia entries to indicate trustworthiness

New program color-codes text in Wikipedia entries to indicate trustworthiness

The online reference site Wikipedia enjoys immense popularity despite nagging doubts about the reliability of entries written by its all-volunteer team. A new program developed at the University of California, Santa Cruz, aims to help with the problem by color-coding an entry's individual phrases based on contributors' past performance.

The program analyzes Wikipedia's entire editing history--nearly two million pages and some 40 million edits for the English-language site alone--to estimate the trustworthiness of each page. It then shades the text in deepening hues of orange to signal dubious content. A 1,000-page demonstration version is already available on a web page operated by the program's creator, Luca de Alfaro, associate professor of computer engineering at UCSC.

Other sites already employ user ratings as a measure of reliability, but they typically depend on users' feedback about each other. This method makes the ratings vulnerable to grudges and subjectivity. The new program takes a radically different approach, using the longevity of the content itself to learn what information is useful and which contributors are the most reliable.

+ UCSC Wiki Lab: Wikipedia trust coloring demo

Source: University of California-Santa Cruz

A very interesting read! Some quick things we thought about as we read it:

+ This demo is based on a few thousand pages. We wonder how/if this system would scale from this relatively small amount of content to the massive amount of constantly changing and expanding content that make up Wikipedia at the present time. What about future growth? Wikipedia is likely to grow larger with new entries being added regularly. Can this system handle it? The larger the corpus of data becomes it can become more difficult to maintain even assuming that the Wikipedia user/editor base stays the same. What if volunteers move on to the "next big thing." What issues does having a large and respected (by many) database of old, incorrect, out of date, and potentially manipulated data mean for the researcher who has come to depend on Wikipedia?

+ The demo is based on the Wikipedia at a specific date in time. As we've mentioned in the past, anyone can download Wikipedia data and claim it as their own and place ads on it. The Wikipedia management does not force the person who downloads the content to make changes/updates to it. So, one of the positive attributes of Wikipedia, quick changes, is lost.

+ We have seen and read how easy it is to manipulate content on other sites.
From the web site: "authors whose contributions are undone lose reputation..." Can this also be gamed? We've heard of many stories where Wikipedia material that is accurate and added by honest Wikipedians has been removed, replaced, changed, etc. because an editor "doesn't like it". This content might be a specific statistical number (ie. the population of "xxx") or an entire entry. This News.com tells the story of one News.com writer.

+ Can this system alert editors to "long tail" material that has not been updated in xxx amount of time? It's one thing to have a Wikipedia entry but something else for it to be updated with the latest information. Popular topics are checked constantly but doesn't "long tail" material deserve the same review and updating? As the database grows larger this will become more of a challenge. Can this system help?

+ Could authors with strong, positive reputations, begin "selling" their reputations by placing/changing content for others?


...changes performed by low-reputation authors have a significantly larger than average probability of having poor quality, as judged by human observers, and of being later undone, as measured by our algorithms.

How does one measure quality? Quality of writing style? Source of information? Currency of information? How does one measure content that might be undone because someone else just doesn't like it for one reason or another?

+ Since anyone can download Wikipedia data and use it on their own site, how would a typical user know and understand that they might be viewing pages that do not reflect the rankings this algorithm provides.

ResourceShelf Contributing Editor, Dan Giancaterino adds:

The problem with basing this system on author reputation is that it can be gamed. There's nothing that would prevent anyone from adding lots of edits to obscure pages. Since very few people are actually seeing these edits, there's little chance of being reversed. Bingo ... cheap reputation. They've got to find a way to factor in the importance of the page being edited into the author reputation algorithm. And all this is obscuring the real point: I want/need an accurate answer to something. Not shades of orange.

+ Kudos to the author for also pointing out this issue in their paper. Yes, a new user is considered guilty until your reputation (over time) proves you otherwise. Sad but true and as we know constantly generating new user names to place/change new content is a closely related issue and likely easily done by someone who can write the code. Additionally, what about the Wikipedia user who registers (with good intentions) so they can change/update/modify a single entry?

Wikipedia allows users to register, and create an author identity, whenever they wish. As a consequence, we need to make the initial reputation of new authors very low, close to the minimum possible (in our case, 0). If we made the initial reputation of new authors any higher, then authors, after committing revisions that damage their reputation,would simply re-register as new users to gain the higher value. An unfortunate side-effect of allowing people to obtain new identities at will is that we cannot presume that people are innocent until proven otherwise: we have to assign to newcomers the same reputation as proven offenders.
This is a contributing factor to our reputation having low precision: many authors who have low reputation still perform very good quality revisions, as they are simply new authors, rather than proven offenders.

+ Might having some confirmed and accurate information about the author/contributor also be of value and cut down on these and similar problems? In other words, signed articles that also link to that person's background, reputation, etc. This is, to some degree, what Larry Sanger and the Citizendium project is up to.
From the Citizendium web site:

+ We offer gentle expert oversight.
+ We use our real names, not pseudonyms.


Views: 1394

« All ResourceBlog Articles



FreePint supports the value of information in the enterprise. Read more »

FeedLatest FreePint Content:

  • Click to view the article Information Support in the Product Lifecycle
    Friday, 25th July 2014

    Product Lifecycle Management (PLM) is a strategy which integrates people, processes and business systems into the workflow. It allows disparate communications systems access to product knowledge and to integrate the results of related research; contributing the enabling technology for information to serve as a central hub for collaborative design and development. Author Michael Smith also considers organisational barriers to collaboration, and how these can affect the timeliness and relevance of design decisions and their effect on deliverables and cost models.

  • Click to view the article Changing Compliance Culture in the Financial Sector
    Thursday, 24th July 2014

    In a speech at the annual Thomson Reuters Compliance & Risk Summit Tracey McDermott of the UK Financial Conduct Authority spoke about the change in focus of the regulator towards corporate culture and the enforcement of personal liability by management. The shift from rules-based regulation to the harder to measure compliance culture may require information managers to review the compliance information in their organisation and who receives it.

  • Click to view the article Product Review of Reg-Track (Sources - Content & Coverage)
    Thursday, 24th July 2014

    In the second part of his review, Chris Porter looks at content coverage in Reg-Track, a regulatory tracking service aimed at compliance professionals focused on the financial services industry. With a focus on the largest financial markets, such as those within North America, the European Union and Asia-Pacific, Reg-Track is also expanding its coverage to additional regulators.

  • ... more ...

All FreePint Content »
FreePint Topics »

A FreePint Subscription delivers articles and reports that support your organisation's information practice, content and strategy.

Find out more and order a FreePint Subscription by visiting the
completing our online form: Subscription Order page.

FreePint Testimonials

"FreePint is timely and responsive to topical issues. The service offers a really good catch-all for news and pointers on sources that you would ..."

Read more testimonials and supply yours »




Register to receive the free ResourceShelf Newsletter, featuring highlighted posts.

Find out more »

Article Categories

All Article Categories »


All Archives »