How to use ontologies for text classification?

I am new to machine learning and want to classify sentences using ontologies (taxonomies/ knowledge graphs) and supervised learning methods (I have an annotated training dataset).

My question is how to use the ontology for this task? Is the following method correct?

I will first perform the tokenization, stemming and stop word removal (pre-processing). Then, I will search for each term in the ontology and after finding them, I will add their related hierarchy to an array or vector for each document. Then I will train the classifiers (supervised) on those vectors.

Please let me know if this method is correct or if there are steps that I am missing here.

Thanks! :)

Topic knowledge-graph text-classification classification

Category Data Science


Since no one answered my question, I'll describe what I found here for others.

Similar to what I described above in the question, you go over pre-processing, then look for the terms in the ontology and replace them with the whole branches from the ontology. After that, you train your model.

Notes:

  • Since you have the ontology with all the relevant terms, you can leave the stopwords in while pre-processing. Stopword removal can be performed via lists or using sklearn or Spacy. In addtion, Spacy performs lemmatization, which can be done via ontologies too (more accurate, but probably slower).

  • Use SPARQL for working with ontologies and pattern matching. Spacy does patter matching too; also, Lucene and several other tools. However, I haven't used them with ontologies. Similarity algorithms can also be used.

  • I found Random Forests to work best among traditional supervised methods. However, probably using neural networks might produce better results. Please keep in mind you can use alternative methods for text classification without ontologies, such as word embedding and topic models. Spacy seems to be a good tool for text classification too. It has nice documentation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.