PDF OCR Capabilities... | Jax Beach Technology Services

By donmc, 9 July, 2008

Topic

From: Planet PDF

A couple of key points here. First, this discussion applies only to Acrobat, not to Reader. Second, prior to Acrobat 6, Adobe allowed you to perform "paper capture" with Acrobat only up to 50 pages. If you have Acrobat 4 or 5, you've got a 50-page limit (although, of course, there are ways to work around it.) I think that Adobe still offers the Capture Server product for large scale scanning and OCR work. It's meant for use in a high-volume production environment, such as a litigation support vendor. In my experience, in government at least, people were leery of using it because you paid by the page. That is, you could buy a 100,000-page license and then you have to fill 'er up again for the next 100,000. Acrobat 6 Professional allows you to "capture" or OCR large documents without buying the separate server, but is still not truly a substitute for industrial strength tools in a production environment. It is, however, capable of a surprising level of automation, and as far as I can tell, it's not dumbed down in its character recognition capabilities.

So here you are with a big old TIFF file. Or, if you are like me and occasionally have opposing counsel that just wants to jerk your chain, a PDF file that was produced in "image only" format from MS Word and contains no text.

In Acrobat 6, go to File > Create PDF > From File and select the TIFF file that you want to convert. That brings your image into the PDF format, but still doesn't make it word-searchable.

Note: You can also choose "From Multiple Files" if you want to do a batch.

Now, go to Document > Paper Capture > Start Capture. The dialog that comes up gives you some choices. You can do a page, all pages, or a range (which might be a good choice if you have, say, a few pages of text followed by lots of charts). Be sure to click the "Edit" button to see the other things you can do, like select English as the recognized language. The PDF Output Style choice you probably want is "Searchable Image (Exact)." As a rule, I wouldn't do any downsampling of the image, although this might reduce the size of the resulting file.

Click OK, and the OCR engine will start up. If you are running a normal Windows box of moderate memory and processor speeds, pretty much every other process will choke while Acrobat reads the document and converts the pictures of letters into text letters. If it's a heavily formatted, 1,000-page document, go have lunch or save it for the end of the day because this is going to take a while. Adobe does provide a process window that keeps you apprised of events.

When it's done, don't forget to File > Save the document. And there you have it. (At this point, I always like to do a little test by running a quick search on a word that I see on the first page. It just makes me feel better to know that it worked. I also have a continuing dialogue about what to do with the original TIFF file...)