OCR / Text Recognition and Recovery Problem

I am working on a research project that deals with American military casualties during WWII. Specifically, I am attempting to construct a count of casualties for each service at the county level. There are two sources of data here, each presenting their own challenges.

1. Army and Air Force data. The National Archives hosts lists of Army and Air Force servicemen killed in action by state and county. There are .gif images of the report available online. Here is a sample for several counties in Texas.

I DO NOT need to recover the names or any other information. I simply need to count the number of names (each on its own line, and listed in groups of five) under each County. There are hundreds of these images (50 states - 30-100 for each state).

I have been unable to find an OCR program that can tackle this problem adequately. How would you suggest I approach this challenge? (I have some programming expertise in Python and Java, but would prefer to use any off-the-shelf solutions that may exist).

2. Navy and Marine Core data. This data is organized differently. Each state has long lists of casualties with the address of their next of kin. Here is a sample for Texas again. For these images, I need to BOTH count the number of dead and recover their hometown, which is typically the last word in each entry. I can then match these hometowns to counties and merge with database 1.

Again, the usual OCR programs have proved inadequate. Any help on this (admittedly more difficult) problem would be very much appreciated.

Thank you in advance experts!

Topic text-mining dataset data-cleaning processing

Category Data Science


The answer to both data sets is an OCR application with some post-processing, but a more specialized program than a generic low-quality or an open source OCR. Essentially the harder the problem, the more capable and advanced tools need to be used to solve it.

There will be two major stages in this task: digitizing the data (image to text, i.e. OCR) and processing the data (performing the actual count). Look at them separately in order to select the best method for each stage.

The main challenges in these images and generic OCR are:

a) images have low resolution. For example the # 1 image has resolution of about 72 dpi. Suggested resolution for such text quality is to scan at 300 to 400 dpi, but it is clear that re-scanning or controlling scan resolution is not applicable now. That’s why one option is to clean and increase the size using image pre-processing tools. This is what the original #1 image snippet looks like after adaptive binarization and zoomed at 300%. It is clear that each character has too few pixels and characters can be easily misread.

enter image description here

b) GIF format in #1 is not supported by many OCR applications. Images need to be batch-converted to a different format, such as PNG or TIF.

c) in these scans the backgrounds and bleed-through (shadow from the text on the other side of the paper) is visible. Good binarization needs to be used to remove background and bleed-through, but not remove vital parts of actual characters.

After implementing specific pre-processing solutions for the items listed above, and then using a high quality OCR system, such as www.ocr-it.com API, highest possible results can be achieved. Result is far from perfect, but it is as high accuracy as it could be achieved with a modern OCR engine on these images.

enter image description here

Luckily for this project, the data needs to be counted, so the second stage has all necessary data for reliable data post-processing analysis. Contrary to other basic OCR engines, the OCR provided by www.ocr-it.com API, which I used to produce the above recognition. OCR-IT API is free to develop with, and very inexpensive to use per page, so that may be a very economical but powerful solution for this project. It returns formatted text layout, including preserving line breaks and overall format structure. This makes text post-processing easier.

A simple algorithm can be run to count the number of lines, resulting in the necessary for the research count.

The above describes a two-stage approach: getting best possible OCR result, and using an applicable method to process data for the required task

Bat wait, there is more…

There is a second option to use an even more specialized OCR application called FlexiCapture with FlexiLayout technology. This powerful and intelligent data capture technology has built-in high-accuracy OCR, and it has a powerful rules and data analytics engine to perform very specialized user-defined chains of actions and tasks.

The implementation of this method using FlexiCapture with FlexiLayout takes the following logical steps.

First, full page OCR is performed and all objects are extracted, including characters, noise, black horizontal and vertical lines, white gaps, and objects (which could be pictures, logos, handwriting, etc.). This produces objects upon which we can apply our search criteria.

Next, the following constraints are applied to the post-OCR data analysis and search criteria: separate image into three vertical columns and run the following logic per column, use line-start as individual count, skip header/footer/indented lines (county names), assume each name to have at least three characters, find recursively every name starting from top to bottom in every column, exclude previously found lines.

While the above logic sounds complex to setup, the actual setup takes just a few minutes and requires minimal work through user interface (UI) environment. No coding or programming is necessary. The following search elements and criteria have been created.

enter image description here

RepeatingGroup consisting of a CharacterString search object.

This setup produces the following search result for the first column of data:

enter image description here

As the last step, FlexiCapture is instructed to return the number of total found elements that fit our search criteria, effectively producing the necessary data for the research task.

There are other logic alternatives that can be setup in FlexiCapture, such as finding the number of white spaces between lines, or searching for the fixed-length fixed-placement 3-letter combinations at the end of every column linen.

In conclusion, there are several options (which is always nice) how this task can be achieved with relative ease in effort and high quality, but the success depends on the quality of tools used and necessary knowledge how to use them.

If you believe some of these tools and processes can be beneficial to your project, please contact my directly. I specialize in these workflows. Ilya @ WiseTREND. My company may be able to help with the setup or guidance. We have participated in various research initiatives, some through donations to a good cause.


The reason your OCR program is not recognizing anything is that the font size is very small. If you resize your image to increase its size, you will obtain better results. For example, using ImageMagick to apply a fixed threshold to remove the background on your first image and increase its size:

convert -density 500 -threshold 40% 29-1891a.gif -resize 250% output.tiff

After this, tesseract does a reasonable job:

tesseract output.tiff output test.config

I included a configuration file test.config to tesseract in order to restrict the permitted characters. In this way, we don't obtain mistaken unicode characters in our text.

test.config

tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-

This is the result:

HOWARD COUNTY 10011607

NALL JOHN E 0-686196 1 LT F05 DALE MARVIN E 30037267 PVT DON SANSJNG DAVID 10505969 PFC
NIXSON EONOND R 10077072 CPL KIA DANIEL NILBORN N 30637064 PVT FOB SAUNDER6 NEWTON L 0-401709 CAPT
NOBLES STEPNEN E 30502791 PVT 0N0 DAVIS IR D 30433764 PVT 0N0 SNELTON CHARLES R 14001742 VT
OSBURN SILLY F 02008518 2 LT DNB BAVIS JINKIE L 18214208 5 SC KIA SHELTON CHARLIE 33423953 7335
PELTON J L 30105473 5 SC RNA 0 DEAN CHARLES 02066314 3 LT 0N0 SMERAN CARLTON A JR 0-510441 2 LT
GUEZADA FRANK 5 10014939 PFC KIA IDIXON JOHN N 6202114 PVT 00R SIBLEV JESSE 30037241 PFC
RAMIREZ JOSE T 30347914 PPC KIA DREADIN RAYMOND 6950964 PVT KIA SLATER JERONE E JR 0-605266 1 LT
OSE THO S 0 30345100 PPC NTA - DUNN GARLAND J 30037202 PVT 0N0 SMITH 00 SE 30509104 PPC
R035 ORREN C 16015376 PVT ONE 1 ELDER GERALD P 0-387706 2 LT KIA SNITN JASPER 7 33037355 7 SC
RUTLEDCE CARL R 38107116 PVTELPNAA JEARHER MARVIN R 30279097 CPL DON SMITN JERALO D 10005000 CPL
SCUDDAY BERNIE L 0-603906 1 LT KIA - FINDLEY PAUL A 50536010 PFC OMB SMITH NELDON A 5550735 PFC
SNITH ROBERT L 10015524 507 ONE- ELINC ROY T I 0-724730 CAPT F05 S THEY 4295431 5 5
SNITH TRAVIS L 30670530 PVT IDUN FORD HERRELL E 0-665672 CAPT E00 STEPN NSON JESSIE P 10006200 PVT
SNEED ROY A 13076927 S 80 ONE GARRETT TOMMY 30220694 PFC KIA SULLIVAN 31 01525125 2 L
SOUTN CARL 0 JR 30343379 PVT OMB CATLOR R T 20012665 SOT KIA SNINDELL VERL O 30037252 7005
STEVENS JANESCO 10015557 PVT DNR CLASSCOCN CNARLES J 6571000 5 SO ONO TAYLOR ARTHUR V 33531479 PFC
STENA D JO 0-723036 2 LT POL 7 00 A 10007027 PVT DOM ITOHN E N A JR 30035251 5 SO
UTTON ARVI 6360431 AV C 0N6 COSSETT JANES N 30436055 CPL KIAI ALKER J NEE H 30117597 SOT
IALBOTT CHARLES 9 38342083 PFC KIA GRESN J 18126923 5 SO KIA NALKER RALPH L 203126071PPC
TUCKER JAMES 30848763 PVT MIA CRIFFIS WILLIAM J 6270119 PFC 0N0 NALLACE JESSE A 30043713 T SC
-TUCKER STERLING P 33341143 RPCI KIA CROSS ELERY C 0-390713 3 LT DNR NARREN CHARLES D 30110253 5 50
NAOSNORTN PAUL P 30345079 5 SC ONE HAKNONBS ROBERT M 38433940 3 SC KIA MASHINCTON HENRY 30299100 PFC
NALNER JAN S H JR 0-696023 2 LT P0 HANDLSY JAMES J JR 0-417955 1 LT ONE EEMS NINPRED E 20017930 CPL
NEBB GLEN 30343104 PVT -K1 7 NANET FARRELL 8 30012597 SCT KIA NNITE DE NIS 10124415 PFC
NRAT JAMES H 30067743 TE05 KIA NARCIE PRANCNARD 6370053 1 SC NIA NMITE MARVIN J 30430974 PFC
NRIOHT NAILAND 0 30609570 PFC KIA NARKEY VEWCEN 30424190 PVT KIA NOODARD BILLY E 510217472 T 5
HARRIS PRBS 38111537 PFC OKIA NRICNT EILL 0-725355 CAPT
I HARRISO DUKE N JR 18055432 AV 0 ONE YOST TNURNAN R 6273153 PVT
HENDRIX JOHN N JR 10217563 PVT 0N8
HICKERSDN JACK 04431540 1 LT 9N8
HUDSPETH COUNTY 3536305354 A CHEW 55555 W
9 0776 A -
- JOHNSTON LONNIE O 0-754906 3 LT KIA H INSON COUNTY
ONES GEORBE N 01173315 1 LT ONE UTCH
JUMPER ISAAC H 30605960 PVT KIA I
A9515 HAN C -7 7 3 LT N5 LONG OSCAR D 30812573 5 SC KIA ALEXANDER BOYD A 0-690739 2 LT
EARDNERNE3SEPN H 3L5E3355 3 L7 335 LVTLE JOHN E 30433946 PVT KIA EALDNIN JANES H 0-407114 CAPT
NORALES ALFREDO L 18015539 3 CG P05 MACK HULET 00057345 1 LT NIA EICCERSTAPP C N 34530001 PC
54N1952 LOCAS N 39441440 PFC K14 NAJORS TRUSTT J 00410046 2 LT KIA BRITTON JAMES H 18036279 8 SC
ROBLES VICENTE 30570733 PFC 0N5 MASON DICE 30203437 CPL NIA EULLARD CAR 5 30607623 PVT
SANCNE2 ANGEL R 38310341 PVT ONE MASON HALTER P 0-671673 3 LT NIA C IN ROBE T E 33105799 TECS
VALLES RENICIO 0 30441430 PIC DON NASSEV JESSE D 30435424 PVT KIA COHA K LL03 N 10077256 PVT
NC CLENDON J H 30117530 PPC ONE EVANS L 0 37392406 TECS
NC HNORTER C R 30300391 PVT KIA FIRLEY MARVIN L 38572402 PVT
4 HILLNAN ODEAN R T-000345 FL 0 DNB IPIELDS JAKES 35711546 PVT
HIL10N JACKSON 7 13030591 PFC 933 ORADDY ROY L 39112920 PFC
HUNT COUNTY HOORE ARLON D 6295620 9 SC KIA GRANT 30572401 PVT
MORRISON DURHARD D 0-519411 3 LT KIA HANNA EVERETT T 10104243 S SC
- LNULLINS GERALD D 6295622 T 50 NIA HANSARD SAMUEL 2 04754619 1 LT
NEAL HOMER M 30609200 PVT KIA THARVEY JOHN B 0-602116 2 LT
ALANIS VICENTE 30894642 VT KIA MEAL RAVN 39 9 1759550 3 NECOAL MARTIN J 6396940 PFC
ALLEN TRUHAN L 0-123936 1 LT KIA NELSON 1 0 30431073 PPET ONE HILL RAYMONO 8 30050142 PVT
ALLEY NILBUR K 01703993 2 LT KIA NELSON TRAVIS C 38204649 PVT DOH HOP JACK H D 30342393 PFC
BENCH CLARENCE A 30003600 PFC KIA INEWLAND OTIS T T-000346 FL 0 F051 HDFF ROBERT C 30345133 T 50
BENNETT EVERETT N 0-692130 LT 0N5 NICNOLSON DALE- 10005042 PVT KIA JONES JANES 0 30343505 PVT
I I
A NIXON LOYD 5041674 TECS KIA KAPPELMN MC 0-360687 CAPT
BLACKHELL E C 38357321 307 KIA PARKER N 33130N V 52959524 3 L7 KIA KECANS TIM JR 38342898 FC
ERITT BASIL JR 36002307 PFC K14 PATTERSON THOMAS H 18136913 SCT 9N5 KENNINER EARL 0 30401408 SGT
BROHN SHOE 8 38634396 PFC KIA PERRI JA 5 0-562333 1 LT KIA LANTRON EDWARD L 19190302 5 SO
BURNS VIRCIL P 38035701 PRC KIA PETTICREH FRED 0 16215700 S 50 NIA LESNER ROLLIE H 30050735 AV C
BUTLER GEORGE A 37259433 1 SC KIA PHILLIPS C L JR 0-431139 CAPT KIA NC CARTY TOURNAN P 30711042 PFC
CAMERON VANDSLL C 30634570 PPCI KIA PILCRTN CLYDE 30115026 V NIA NC GLENDON JACK 30330395 PFC
CARTER LEO RD K 30119959 AV 0 ONE PO D EUCENE N 4 30431074 PRC KIA NC NINNET K E 30340151 PVT
CARTER OILLIAH F 38685535 PFC 0N8 PRESLEY NILLIAN H 30037547 PVT 0N5 NC QUEEN JAMES Y 02055059 2 LT
CREEK LOTD 6379416 A SC DNB PRICE PREDRICK P 30409331 CPL 100W NOTEN GEORGE N 30304695 PFC
CLARK JOHN 18317907 CPL KIA PURCELL SAMUEL N 30017905 PVT 0N5 PANNELL NOLENIA 6227623 1 SC
COLLINS RAYMOND N 30137040 PFC OOH RAILIPF NARDEN A 30037006 CPL 00H PIERCE FELIX 0 30335390 5 SC
CREAKER SAH BL N 18006101 PVT KIA HAYNES NILLIAN T 7-954739 PL 0 9N5 PIETZSCH JAI E 0-426961 2 LT
CRIDER NARLAN 0 10154794 SCT KIA REED JDRN L 30529414 PVT POL RTER NALLACE N JR 38193560 CPL
DALE CHARLEY 8 I 38049500 3 SC KIA ROI NILLIS 5 01393956 1 LT KIA PRESCOTT DAVID L JR 0-707191 2 LT
05 231



D KZXXN D
NNNNH 2
XXLA-A- Q

2
33

UR R
2055
35

As you can see, there are many mistakes, but at least we are obtaining reasonable results. In any case, you don't want to do this because even if you were able to get perfect recognition, some of your lines would be placed in the wrong county because tesseract is looking at your document as a big paragraph. I recommend you to use the vertical lines to segment your image into 3 parts and then preprocess each one of them. You can even try to concatenate the parts vertically and perform OCR in a single page. This also applies to the second image.

By the way, the resolution of your image is not great, so if you can get a better image, that is going to make a big difference.


In what way have the usual OCR programs proved inadequate? Do you have some example output that you find you can't work with?

I can see how the columns complicate things.

I'd say for data set 1: OCR the images, then read the files line by line and match on for instance a sequence of at least five numbers. So you get 0, 1, 2 or 3 per row. You may miss a couple due to the OCR accidentally recognizing a number as a letter for instance, but I expect this to work reasonably well. How precise do you have to be?

Data set 2 seems more difficult. Maybe counting can be done by matching on capitalized sequences followed by a comma. Placename... very tricky. Once again, do you have some OCR output we can look at?

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.