Your Friend, Ohhh-Cee-Ahhr

The first document management system (scanning, indexing and retrieval) Copilabs sold was to the original “MVP Sports” headquarters. It cost nearly $50,000 and was readily cost-justified. “How is that possible,” you ask?

They had a filing system comprised of 19 lateral file cabinets, 4 drawers each. There were always bulging folders sitting on top of the cabinets, awaiting re-filing. The opportunities for loss of pages, or simple misplacement of files, was large. The costs of time and loss were larger than the cost of financing the new electronic filing system.

For that system, 22 years ago, one operator had to select and enter indices for the kinds of documents contained in one “file.” It took 4 to 5 hours per day and the system still cost-justified.

How wonderful that the filing and indexing represented by that huge investment can be matched with a fast PC, a good copier / scanner and $400 worth of software. The indexing will now be done automatically thanks to “OCR” or Optical Character Recognition.

OCR is a database (that is itself indexed) of patterns of black and white pixels (“pels,” more correctly) that are matched with known patterns that form letters and numbers that transmit information to humans. We say that OCR “reads” the images we “capture” by scanning, and turns them back into real printing. Most simply, for indexing purposes, OCR need not be perfect, only “close” to complete translation of the black and white areas of our scanned images into whole words or strings of numbers or characters that may be connected to the “page-image” we’d like to look at right now, thank you very much.

At a speed of about one second per page, OCR software can “read” scanned pages and connect “hidden” pages of adequately translated sets of words on those page-images, and those sets of words, even the imperfect translations, are themselves indexed. Those indices tie the words, phrases and number-strings to the specific page – or pages – on which they appear: word – page – document – folder – directory – drive.

You type in the “key words” or number-string and up pops the document or, even, the single page inside that document that you need to see – in seconds. In various forms, OCR-based indexing can be very sophisticated, or relatively simple. In a simple form, your OCR indexing software “watches” directories and folders where scanned images are “saved” as part of the scanning process. When a document is added or augmented, the OCR and indexing database swing into action, reading all the new or changed pages, indexing their “read” characters and words and updating the database for the filing system. As soon as it is done, a user can find one or more of those pages with a pertinent keyword or character string.

Contact us to try OCR on your files.


Finding All Those Pages

We’ve used indexes (also called indices, nowadays) since we’ve had writing. As soon as we made separate rooms for different grains, or put fences between fields, we’ve kept lists of which was which. Chapters in books, page numbering, wooden pigeon-hole sorting cabinets, street addresses: they’re all types of indexes. Even the numbers on sports uniforms are used to link players’ names with their positions on the field, their lockers in the clubhouse, their paychecks and so on. Indexes.

When you start scanning pages and saving them in a computer, the software “names” each page or batches of pages, called documents, by date and time and job number or with a number that “increments” by one number-value higher than the previous page scanned that day or the previous page stored in that folder / directory. By itself that string of information is not particularly helpful for retrieving that image of the words on the page or pages you need to learn from. So, we create two simple computer indexes: document naming conventions, and logical folder / directory names.

If we are looking for Smith, Inc. invoices we would go to the Smith, Inc. folder in our computer filing system. If there are hundreds of pages filed in the Smith, Inc. folder you will be looking for those you named “Invoice” together with a date and invoice number you may have added. Those are the pieces of information you would know when you went looking, aren’t they?

Think of your paper-filled file cabinets that you hope to eliminate with scanned documents (“electronic files”). If you were going to find a Smith, Inc. invoice from February of 2012, you would know which cabinet to look in, which drawer to roll out and then, by those little plastic tabs, which Pendaflex to look inside of. That’s because everything was labeled or named as you filed it for just this purpose. Those bits of knowledge and labeling are all “indices.”

But, part of the value of electronic files is saving time. When you go looking in your computer to get the February, 2012 invoices, only, to display on screen, you want it to happen quickly. You don’t want to, figuratively, walk to the file cabinet, check the labeling of the cabinet and drawers, pull one open, paw through the hanging folders and finally open the manila folder that has invoices in it. No, what you really want – and can obtain from electronic filing – is for Smith, Inc’s “Feb, 2012, Inv” PAGES to appear on screen, or in a simple list on screen from which you can choose, by entering the quoted string, above or clicking on a couple of identifiers in a drop-down list.

Or, even better, maybe you could look at only the ONE invoice you care about at the moment, by typing in the invoice number, which would be unique to that page among the thousands of scanned pages. That will be cool.

And, it’s possible without your naming every single scanned page separately. It’s all thanks to “OCR.”

Read more about OCR.