Like most academics, I rely increasingly on online resources and there are now millions of pages of old journals, magazines, newspapers, etc. accessible online. In most cases, the files are stored in PDF (Adobe Acrobat) format, and optical character recognition (OCR) has been used to turn the scanned images into searchable text. Much time is saved, but a couple of problems plague me (and, I assume, many other users of these sites). Firstly, the OCR text is seldom proof-read and can be of very low quality (as a rule with OCR, the lower the quality of the original image, the lower the accuracy of the final text – and a lot of the resources I use have texts scanned from old printed materials or from microfilms of old printed materials). Secondly, in many cases when you download a PDF file to read or refer to later, the OCR text isn’t part of the file – all you get is the image.
I use Adobe Acrobat Pro to create, edit and read PDF files (I know it’s expensive, but the education pricing makes it affordable). It has its own inbuilt OCR capability, but it doesn’t seem very accurate and there’s no easy way to edit the images that comprise the PDF (for example, to improve their contrast and thus the accuracy of the scanning). So, recently I’ve gone back to a program that I’ve not used in years, ABBYY Finereader 12 (again, not cheap, but having bought it in the past, I was able to buy an upgrade at the educational price).
As with most decent software, there’s a free demo version so that you can test it (and it runs on Windows and Mac). After playing with it for a few days, I’ve found it’s extremely accurate, even with old, fuzzy texts, and it has a couple of nice features that Acrobat Pro lacks. Firstly, it has a built-in image editor, so if your scanned image has dark edges, or other marks that confuse the OCR, you can delete them before you start. You can also eliminate big white borders (a pain when you’re trying to view your PDF at “page width” and want the text nice and readable). Even more usefully, FineReader lets you adjust settings in great detail; you can, for example, boost the contrast in the image edit the levels (to eliminate a grey background), or deskew the image – and then choose whether to apply the edit to all pages, the current page, or a selection. And FineReader can also pre-process all or some of the images automatically and – unlike Acrobat Pro – you have quite a lot of control over what it does when pre-processing (you can select options such as “reduce noise” or “whiten background”, for example). The result, in my tests so far, is close to 100% accuracy for most of the PDFs I’ve converted (and of course you have the option to verify and correct the text before saving, if you want to). And it’s fast: on my PC, a 20 page PDF file is converted in under 10 seconds.
And, of course, FineReader can do all the normal stuff an OCR package does, like scan pages directly to the format of your choice (Word, Excel, PDF, etc.).
So, if you need to do this kind of thing with documents, I recommend this very highly; over the course of the book I’m currently researching, I expect it to save me hundreds of hours.
UPDATE [15 September 2016]
I recently hit a problem with Finereader; whenever I tried to start the program I got an error that said “ABBYY licensing service is unavailable. The RPC server is unavailable.” I contacted ABBYY’s online help and after a couple of very quick emails they were able to solve the problem (you open Windows Services, select Finereader, and change the startup type from “Automatic” to “Automatic (delayed start)”. I was very impressed with the speed and efficiency with which ABBYY’s technicians resolved the issue. If you’re facing the same problem, there is more information on their website.