remove special character in a List or String

Input_String is Text_Corpus of Jane Austen Book

output Should be : ['to', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']

But getting this Output : ['to', 'be,', 'or', 'not', 'to', 'be:', 'that', 'is', 'the', 'question!']

Topic nltk ipython python machine-learning

Category Data Science


Regular expressions can be used to create a simple tokenizer and normalizer:

from __future__ import annotations
import re

def tokens(text: str) -> list(str):
    "List all the word tokens in a text."
    return re.findall('[\w]+', text.lower())

assert tokens("To be, or not to be, that is the question:") == ['to', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']

Otherwise, use an established library like spaCy to generate a list of tokens.


You can simply use the python regular expression library re. It will look something like this:

import re

def text2word(text):
    '''Convert string of words to a list removing all special characters'''
    result = re.finall('[\w]+', text.lower())
    return result

If you can log the result on the console to see the output that the function returns

For example:

string = " To be or not to be: that is the question!"
print(text2word(string))

Ouput: ['to', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']


I would use regex. I can only say what would work for the input sentence I can deduce from your desired output:

s = 'to be, or not to be: that is the question!'

I simply remove all characters that are not letters (upper or lower case) or spaces.

import re

pattern = r'[^A-Za-z ]'
regex = re.compile(pattern)

result = regex.sub('', s).split(' ')

print(result)

['to', 'be', 'or', 'not', 'to', 'be', 'that', 'is', 'the', 'question']


Edit

Based on the update comment from OP - my answer can be adjusted to work on each of the words via simple interation of the sentences:

cleaned_sentenced = []    # will become a list of lists

for sentence in sentences:
    temp = [regex.sub('', word) for word in sentence]
    cleaned_sentences.append(temp)

This uses regex as defined up above.


Here is another option for you, but it should be a bit more slow than the rest of the answers.

import string
s = 'to be, or not to be: that is the question!'
punct_set= set(string.punctuation)#Saving punctuation into a set
s = ''.join(ch for ch in s if ch not in punct_set)#Get every character and remove punct

Another way is to use the translate method. In python 3, a dictionary should be passed to the method. None maps the character that will be removed.

import string
s = 'to be, or not to be: that is the question!'
translation = dict.fromkeys(map(ord, string.punctuation), None)#Dictionary with punctuation to be removed
no_punct_s = s.translate(translation)

This can be achieved by Regular Expresiions

import re
modified_string = re.sub(r'\W+', '', input_string) # on individual tokens 
modified_string = re.sub(r'[^a-zA-Z0-9_\s]+', '', input_string) # on sentence itself.Here I have modified RegEx to include spaces as well

'\W == [^a-zA-Z0-9_], so everything except numbers,alphabets and _ would be replaced by space

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.