Google To Use OCR To Index Scanned Docs

Friday, October 31, 2008

Google To Use OCR To Index Scanned Docs

Google has announced that it will now begin including scanned documents in its search results.

Unlike standard text documents, scanned files don't contain any text data that Google's spiders can index. In a move to correct this issue Google has began to employ Optical Character Recognition (OCR) technology, converting photos (which includes scanned images or documents) into digital text files. These text files can then be searched and indexed by Google.

In the past Google would attempt to index these image files as well as possible, but could typically search only file titles and nearby metadata - not the contents of the documents. From now on Google searches will include the text within these scanned images in normal search results. When you encounter a scanned document you’ll be able to view it in its original form as a PDF, or as a converted text file (click “View As HTML”).

Google has provided a few searches for users to checkout the new system at work. Just click on these search queries. Note the document excerpt in the search results, along with the full text presented after the 'View as HTML' link:

repairing aluminum wiring

spin lock performance

Mumps and Severe Neutropenia

Steady success in a volatile world