Best practice on count of manual annotations for building criminal detection from news articles?

We have 7 million news articles corpus, which we want to classify into crimes or non-crimes and further identify criminals by using NERs/annotating criminals, crime manually. For creating a model that identifies criminals, what is the number of annotated articles that we must train/build our model on? Is there any industry best practice on this count? Is there any better way to come to this number of training(annotated) dataset, than random guessing? Are there any best practices resources that anyone can point to? Thanks in advance!

Topic annotation data-science-model nlp machine-learning

Category Data Science


It always depends on each specific problem. The amount of data depends on a number of factors including (but not limited to) the complexity of the problem, the number of features, the quality of the training data, the ratio of the training classes (i.e. class imbalance), etc.

If you have no idea where to start, sometimes the best thing is just to experiment. For example, I would create a model with as much data that is both readily available and reasonably easy to handle in a cross-validation framework. From there, you can test the effects of increasing the training set size versus tuning the hyper-parameters. While you're not necessarily trying to find the point of diminishing returns (it's usually best to err on the side of too much than not enough), this type of exercise at least gives a general idea of the complexity of your problem.

A lot of people will have their own rule-of-thumbs or metrics; there are plenty of reading out there. For example, there's the Vapnik–Chervonenkis dimension and the one in ten rule (though I think computer vision people's rule-of-thumb is one in 1000, which again illustrates how variable these rule-of-thumbs are).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.