How to use text classification where the training source are txt files in categorized folders?
I have 200 *.txt unique files for each folder:
Each file is a lawsuit initial text separated by legal areas (folders) of public advocacy.
I would like to create training data to predict new lawsuits by their legal area.
Last year, I have tried using PHP-ML, but it consumes too much memory, so I would like to migrate to Python.
I started the code, loading each text file in a json-alike
structure, but I don't know the next steps:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import os
path = 'C:\wamp64\www\machine_learning\webroot\iniciais\\'
files = {}
for directory in os.listdir(path):
if os.path.isdir(path+directory):
files[directory] = [];
full_path = path+directory
for filename in os.listdir(full_path):
full_filename = path+directory+\\+filename
if full_filename.endswith(.txt):
with open(full_filename, 'r', encoding='cp437') as f:
files[directory].append(f.readlines())
Thanks in advance
Topic text-classification python machine-learning
Category Data Science