« Google pushing vastly more efficient PSU's | Main | Early Science Fiction Collection on Project Gutenberg »

May 24, 2007

reCaptcha - brilliant idea

This is a brilliant idea that kills two birds with one stone.

The Gutenberg Project started the ball rolling with the idea of making public domain literature available to a wide audience. Their idea was to scan books, OCR them into text and make them available on their website. It's a brilliant strategy, one that has allowed me to read a few free novels - turn of the century yes, but still a glittering array of reading material to choose from.

The problem with OCR is that the characters are often misread by the computer. The volunteer scanning the book into the computer needs to meticulously go through the text produced by the OCR process and correct all and any errors. A hugely time-consuming business I'm sure you'll agree.

The second 'bird' that I alluded to is the fact that many sites use a captcha to distinguish between humans and computers. A typical captcha will consist of distorted text that only a human can distinguish correctly. A computer attempting to enter spam onto a comment form will use OCR to distinguish what the text says.

This is where the makers of reCaptcha made the connection. In the book scanning process, the computer fails to recognise the character using OCR, but often a human has no difficulty in distinguishing the word.

Using a cache of problematic OCR'd words as captchas the process of digitising books is somewhat, if not fully, automated. The problematic OCR word is displayed along with an OCR word that is known and if the human correctly guesses the known word the problem word is assumed to be correct. Check that against a few other answers and you can be sure it is.

Brilliant strategy.


Posted by dottie at May 24, 2007 9:22 PM

Trackback Pings

TrackBack URL for this entry:
http://www.5thpercentile.com/blog/mt-tb.cgi/84

Comments

Post a comment




Remember Me?