indexing Archives

The first document management system (scanning, indexing and retrieval) Copilabs sold was to the original “MVP Sports” headquarters. It cost nearly $50,000 and was readily cost-justified. “How is that possible,” you ask?

They had a filing system comprised of 19 lateral file cabinets, 4 drawers each. There were always bulging folders sitting on top of the cabinets, awaiting re-filing. The opportunities for loss of pages, or simple misplacement of files, was large. The costs of time and loss were larger than the cost of financing the new electronic filing system.

For that system, 22 years ago, one operator had to select and enter indices for the kinds of documents contained in one “file.” It took 4 to 5 hours per day and the system still cost-justified.

How wonderful that the filing and indexing represented by that huge investment can be matched with a fast PC, a good copier / scanner and $400 worth of software. The indexing will now be done automatically thanks to “OCR” or Optical Character Recognition.

OCR is a database (that is itself indexed) of patterns of black and white pixels (“pels,” more correctly) that are matched with known patterns that form letters and numbers that transmit information to humans. We say that OCR “reads” the images we “capture” by scanning, and turns them back into real printing. Most simply, for indexing purposes, OCR need not be perfect, only “close” to complete translation of the black and white areas of our scanned images into whole words or strings of numbers or characters that may be connected to the “page-image” we’d like to look at right now, thank you very much.

At a speed of about one second per page, OCR software can “read” scanned pages and connect “hidden” pages of adequately translated sets of words on those page-images, and those sets of words, even the imperfect translations, are themselves indexed. Those indices tie the words, phrases and number-strings to the specific page – or pages – on which they appear: word – page – document – folder – directory – drive.

You type in the “key words” or number-string and up pops the document or, even, the single page inside that document that you need to see – in seconds. In various forms, OCR-based indexing can be very sophisticated, or relatively simple. In a simple form, your OCR indexing software “watches” directories and folders where scanned images are “saved” as part of the scanning process. When a document is added or augmented, the OCR and indexing database swing into action, reading all the new or changed pages, indexing their “read” characters and words and updating the database for the filing system. As soon as it is done, a user can find one or more of those pages with a pertinent keyword or character string.

Thanks!

Tag: indexing

Your Friend, Ohhh-Cee-Ahhr