How to segment old digitized newspapers into articles
I'm working on a large corpus of french daily newspapers from the 19th century that have been digitized and where the data are in the form of raw OCR text files (one text file per day). In terms of size, one year of issues is around 350 000 words long.
What I'm trying to achieve is to detect the different articles that form a newspaper issue. Knowing that an article can be two or thee lines long or very much longer, that there is no systematic typographic division and that there are a lot of OCR errors in each file. I should also mention that I don't have access to others OCR data like document layout in XML or so.
I've tried the TexTiling algorithm (the nltk implementation) but the results were not really conclusive.
Before diving deeper by myself I was wondering if maybe some of you would have some hint about a task like this one : train a machine learning model, try others algorithms ?
Topic ocr text-mining nlp
Category Data Science