Attention to multiple areas of same sentence

Question

Attention to multiple areas of same sentence

Sandeep Bhutani

2022年4月5日 14:06

Lets consider some sentences below:
"Datascience exchange is a wonderful platform to get answers to datascience related queries and it helps to learn various concepts too"
"Can company1 buy company2? What will be their total turnover then?"
"Coronavirus was originated in china. After that it is spreading all over the world. To prevent it everyone has to take care of cleanliness and prefer vegetarians."

In all above sentences you can see there are multiple questions or utternaces. Sometimes separated by and sometimes by question mark and sometimes by just a dot.
A rule based separation of these sentences fail in many cases. I want to split these sentences in individual intents.

One of the approach I am guessing is by using attention mechanism on different parts of sentences. I cant use gensim etc sentence embeddings as I dont have clear sentence boundaries here.
Can someone suggests if attention approach will work? If yes, any similar code if they can point to, that would be helpful as I haven't coded this before.

If any other better approach can solve this problem then please suggest.

Topic attention-mechanism nlp

Category Data Science

Edoardo Guerriero · Accepted Answer · 2020年3月9日 00:28

For the sentences you provided the nltk sentence tokeniser would work just fine.

from nltk.tokenize import sent_tokenize

sentences = ['Datascience exchange is a wonderful platform to get answers to datascience related queries and it helps to learn various concepts too','Can company1 buy company2? What will be their total turnover then?','Coronavirus was originated in china. After that it is spreading all over the world. To prevent it everyone has to take care of cleanliness and prefer vegetarians.']

for sent in sentences:
    print(sent_tokenize(sent))

out:
['Datascience exchange is a wonderful platform to get answers to datascience related queries and it helps to learn various concepts too']
['Can company1 buy company2?', 'What will be their total turnover then?']
['Coronavirus was originated in china.', 'After that it is spreading all over the world.', 'To prevent it everyone has to take care of cleanliness and prefer vegetarians.']

It is true that these rule based tokeniser do suffer from several issues (see: https://github.com/nltk/nltk/issues/494 ). If you are specifically working on a sequential model for sentence slitting then an LSTM with attention mechanism would surely be a good choice, you can find many repositories on GitHub which implement the code of the original paper 'attention is all you need' (among others: https://github.com/kaushalshetty/Positional-Encoding )

Attention to multiple areas of same sentence

About