How to use text classification where the training source are txt files in categorized folders?

Question

How to use text classification where the training source are txt files in categorized folders?

celsowm

2022年5月10日 18:04

I have 200 *.txt unique files for each folder:

Each file is a lawsuit initial text separated by legal areas (folders) of public advocacy.

I would like to create training data to predict new lawsuits by their legal area.

Last year, I have tried using PHP-ML, but it consumes too much memory, so I would like to migrate to Python.

I started the code, loading each text file in a json-alike structure, but I don't know the next steps:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import os

path = 'C:\wamp64\www\machine_learning\webroot\iniciais\\'

files = {}

for directory in os.listdir(path):
    if os.path.isdir(path+directory):
        files[directory] = [];
        full_path = path+directory
        for filename in os.listdir(full_path):
                full_filename = path+directory+\\+filename
                if full_filename.endswith(.txt):
                    with open(full_filename, 'r', encoding='cp437') as f:
                        files[directory].append(f.readlines())

Thanks in advance

Topic text-classification python machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2021年12月7日 14:52

1

Brian Spiering answered at 2021年12月7日 14:52

Scikit-learn's sklearn.datasets.load_files is a function to "Load text files with categories as subfolder names".

Eduardo Di Santi Grönros · Accepted Answer · 2020年2月2日 13:41

Assuming that your folders are your classes, you can match any document with the correspondent tag.
Then, for every document:
1.- Normalize the text, i.e. remove stop words (unless they make sense), stem and / or lemmatize it (unless it doesn't makes sense).
2.- Vectorize the documents, you can choose TFIDF, BOW, word embeddings etc
3.- Depending on your documents train with an MLP (in case of BOW) or an LSTM in case of words embeddings.

When you have a new document, you need to repeat the procedure using the vocabulary you created for the training set.

I had a similar use case and was enough to use BOW with a multi layer perceptron, the accuracy was above 95%, but, documents were different for each category and I removed most frequent words because there where to common.

Another solution is to perform a topic modeling on documents binding those topics to the category and then training a simple classifier (an MLP or SVM will work)

How to use text classification where the training source are txt files in categorized folders?

About