How to use text classification where the training source are txt files in categorized folders?

I have 200 *.txt unique files for each folder:

Each file is a lawsuit initial text separated by legal areas (folders) of public advocacy.

I would like to create training data to predict new lawsuits by their legal area.

Last year, I have tried using PHP-ML, but it consumes too much memory, so I would like to migrate to Python.

I started the code, loading each text file in a json-alike structure, but I don't know the next steps:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import os

path = 'C:\wamp64\www\machine_learning\webroot\iniciais\\'

files = {}

for directory in os.listdir(path):
    if os.path.isdir(path+directory):
        files[directory] = [];
        full_path = path+directory
        for filename in os.listdir(full_path):
                full_filename = path+directory+\\+filename
                if full_filename.endswith(.txt):
                    with open(full_filename, 'r', encoding='cp437') as f:
                        files[directory].append(f.readlines())

Thanks in advance

Topic text-classification python machine-learning

Category Data Science


Scikit-learn's sklearn.datasets.load_files is a function to "Load text files with categories as subfolder names".


Assuming that your folders are your classes, you can match any document with the correspondent tag.
Then, for every document:
1.- Normalize the text, i.e. remove stop words (unless they make sense), stem and / or lemmatize it (unless it doesn't makes sense).
2.- Vectorize the documents, you can choose TFIDF, BOW, word embeddings etc
3.- Depending on your documents train with an MLP (in case of BOW) or an LSTM in case of words embeddings.

When you have a new document, you need to repeat the procedure using the vocabulary you created for the training set.

I had a similar use case and was enough to use BOW with a multi layer perceptron, the accuracy was above 95%, but, documents were different for each category and I removed most frequent words because there where to common.

Another solution is to perform a topic modeling on documents binding those topics to the category and then training a simple classifier (an MLP or SVM will work)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.