How to segment old digitized newspapers into articles

I'm working on a large corpus of french daily newspapers from the 19th century that have been digitized and where the data are in the form of raw OCR text files (one text file per day). In terms of size, one year of issues is around 350 000 words long.

What I'm trying to achieve is to detect the different articles that form a newspaper issue. Knowing that an article can be two or thee lines long or very much longer, that there is no systematic typographic division and that there are a lot of OCR errors in each file. I should also mention that I don't have access to others OCR data like document layout in XML or so.

I've tried the TexTiling algorithm (the nltk implementation) but the results were not really conclusive.

Before diving deeper by myself I was wondering if maybe some of you would have some hint about a task like this one : train a machine learning model, try others algorithms ?

Topic ocr text-mining nlp

Category Data Science


As far as I know topic segmentation is not a particularly easy task with clean data, so it's likely to be challenging with noisy old French.

It's not exactly the same problem so I'm not sure if this is useful but you might want to look into using stylistic features in order to help the model detect the changes between articles. There has been a fair amount of work on the task of style change detection as part of the PAN series (the task has run for 3 years, results and papers are available from the previous years).

Hope this helps.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.