How to split natural language script into segments?

Question

How to split natural language script into segments?

A.D.

2018年4月16日 18:54

I have a bunch of .txt and .srt files extracted from a MOOC website, they are the scripts of the videos. I would like to segment the scripts into parts such that each part falls into one of the following categories:
MainConceptDescription-> Explanation of the Main concept(s)
SubConceptDescription-> Explanation of Subconcept related to the main concept
Methodology / Technique-> To achieve something, what should one do
Summary-> Summary of the discussed material or of the whole course
Application-> Practical advise for the concept
Example-> Concept example

Now, for the first 2 i think I should try to apply Latent Dirichlet Allocation to extract the topics. Another idea was to look in the resource name and search for these words in the text. Another idea was to read some of the resources and manually fix some sort of a dictionary for every category and then create regex patterns and search for them in the text.

But the latter seems too lame. So now I am not sure what can I do. I've seen similar works for research papers, however research papers have their own specific expressions etc that are more or less constant and seen in most of the papers, but that is not the case with my video scripts where it's 100% spoken natural language I need to work on. Do you have any ideas how can I approach this? I do have a list of keywords that kind of signifies whether an example follows or a concept is explained, but I am doing this manually which is surely not what I want to do for 563 files that may as well become many more.

Furthermore, I'd like to connect topics found to ontologies in order to enrich the metadata about each file. I have no idea how can I approach this either. Any advice would be VERY appreciated.

Forgive me if my explanations make no sense. I am not too familiar with the terminology. So if you also explain some of the terminology you use, i'd appreciate that too. And please advise on algorithms I can try with.

Topic lda topic-model python processing data-mining

Category Data Science

Emre · Accepted Answer · 2018年4月16日 18:54

I haven't seen anything like this before but it seems quite feasible. You need an ontology to separate the main concept into its subconcepts, then you need a classifier to distinguish between your broader categories; description, methodology, classifier, application, and example. That is, I would manually label some transcripts at the paragraph level. If you don't have paragraph-segmented text, smooth the classification probability over the sentences so they share the same label. Or use a paragraph segmentation model as in Automatic Paragraph Segmentation with Lexical and Prosodic Features.

The classifier could be a CRF or an RNN. The modern way to induce the ontology would be through word embeddings; cf. e.g, Learning Semantic Hierarchies via Word Embeddings. Formerly I would have recommended a hierarchical topic model, such as hLDA; cf. e.g, Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes or Unsupervised Terminological Ontology Learning based on Hierarchical Topic Modeling

Welcome to the site and good luck!

How to split natural language script into segments?

About