Shortly after midnight on September 26, 1983, Lieutenant Colonel Stanislav Petrov stood between the analytical pronouncements of Cosmos 1382 in the Oko system and global nuclear war that might certainly annihilate civilization. Paranoid Soviet leaders had a system built to alert to a preemptive attack. It used infrared sensors to detect the hot gas plume of an incoming missile. When it was fooled by the setting sun into predicting one and then five incoming missiles, it was a computer programmer who recognized it for the mistake it was. For this we can all be thankful.
Algorithms often process clean input data well enough to get a useful answer with minimal or no rework. It’s when we follow our data blindly that we get into trouble. This is the very real danger of an algorithm-driven world. That shouldn’t completely dissuade us from using algorithms to help us make decisions. But given the potential for poor input data (either through inconsistent measurement or incomplete knowledge of input state assumptions), it’s best to keep a human in the loop in critical systems.
Recently, I was investigating death rates by state for different types of cancer. Two tables in Cancer Facts and Figures 2015 had just the right data to put together – new cases (p5) and deaths (p6). All I needed to do was cut and paste the data into Excel and make a new sheet to divide deaths by new cases. How hard could it be?
Not so fast. Cut and paste from the PDF didn’t work. So, I printed a page of the original document, scanned it and saved the resulting image as a PNG file. Abbyy FineReader Sprint 8.0 converted this image into a nearly perfect Excel spreadsheet with only a couple of cells to fix. That result was good enough, but there was still the nagging issue of the printing and scanning. Why waste the paper and ink when you could just read it from the page electronically?
With a Mac, I used Grab to capture a TIFF image of the page. The attempt to convert this image with Abbyy created a nonsense spreadsheet. I saved the TIFF image in the PNG format and converted it. Now the rows and columns were captured properly, but 10% of the 561 cells (51 rows and 11 columns) had errors. This kicked off an analytical odyssey typical of those pursued by scientific investigators working in a new area with bad data. You know it’s going to end badly but hope drives you forward.
The scanned version previously recognized had been captured with a text method that sharpened and added contrast. I replicated this with the color adjustment tool in Preview (again this is all using bundled Mac tools). Increasing contrast to 75%, only 40 errors were seen. At contrast of 100%, errors were further reduced to 30. Sharpen at 100% was roughly equivalent. With a combination of contrast at 25% and sharpen at 75%, I was down to 11 errors.
Still though, this was disappointing compared to the nearly perfect work made of the printed version that had been scanned. Meanwhile, Abbyy was calling attention to this with every attempt to read the file. It persistently displayed a warning, “The image resolution is too small, please re-scan with the resolution greater than 150 dpi.” Now the time to listen to the warning was at hand. The scanned image had been captured at 150 dpi. The TIFF generated by Grab was 72 dpi. The answer was clear.
A quick scan of the Internet suggested that best practice for scanning resolution was closer to 300 dpi (Abbyy, Fujitsu, Univ of Illinois). Could I get 300 dpi image from the PDF file? It turns out that the Save As function in Preview can save a PNG formatted image of a selected page in a PDF file. An input box lets you enter any numeric value for the resolution without argument. Nice, but no matter what you enter the image is saved at 150 dpi.
Fortunately, that was good enough. I had a worksheet with death rates by state for selected cancers, and the exercise was sufficient as a lesson for this article. But that brings us back to the threat of nuclear annihilation instigated by a closed loop monitoring system. When one builds a mission critical system, can you trust it enough to run without human intervention? What if there was a threshold that was good enough, and we were 99.999% sure that it told the truth as we knew it?
Let’s see… Maybe I want to scan tables from multiple PDF files. Could I use my method unattended? Look, there is package called Xpdf with an extractor called pdftoppm that scans at 300 dpi. It runs from the command line. I could raise some venture money for a big data / analytics business, scour the Internet for PDF files and start extracting. What could possibly go wrong?