Chunker/shallow parser for spoken language

Question

Chunker/shallow parser for spoken language

errantlinguist

2017年10月12日 20:57

I'm trying to extract NPs from transcribed spoken text, such as

um it's the bl- it's the blue one in the right no left hand corner

which contains e.g. fillers (e.g. um) and disfluencies (e.g. bl-, right no left hand corner) that are not commonly seen in written text. Ideally, I'd like to get something like the three sequences it, the blue one and the left hand corner (or at the very least the right no left hand corner).

I'm currently using Stanford CoreNLP's pre-trained shift-reduce parser with a beam size of 4 (englishSR.beam.ser.gz) and bidirectional dependency network POS tagging (english-bidirectional-distsim.tagger) after filtering out fillers and duplicated tokens (e.g. uh it's it's that one it's that one). This performs okay but seems to fail a lot more than I'd expect; Are there no chunkers or (shallow) parsers widely available which are tailored specifically to spoken English as opposed to written English? The language the chunker/parser is written in is irrelevant (i.e. it needn't have a Java API). I've also tried using Stanford CoreNLP's caseless models, which actually seem to perform a bit worse (however, I haven't done any rigorous comparisons).

Topic stanford-nlp preprocessing parsing nlp

Category Data Science

Chunker/shallow parser for spoken language

About