Unable to resolve Type error using Tokenizer.tokenize from NLTK

I want to tokenize text data and am unable to proceed due to a type error, am unable to know how to proceed to rectify the error, To give some context - all the columns - Resolution code','Resolution Note','Description','Shortdescription' are text data in English- here is the code that I have written : #Removal of Stop words: from nltk.tokenize import sent_tokenize, word_tokenize from nltk.corpus import stopwords from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\w+') stop_words = set(stopwords.words('english')) tokenizer = RegexpTokenizer(r'\w+') dfclean_imp_netc=pd.DataFrame() …
Category: Data Science

BertTokenizer on custom data returns same index for all tokens

I'm trying to train Bert tokenizer on a custom dataset but when running tokenizer.tokenize on sample data, it returns the same index for every tokens which is clearly not what is expected. Running bert_vocab_from_dataset on the below sample dataset returns a vocabulary of 88 tokens long. After saving this and reusing it in tensorflow_text.BertTokenizer, I get [88] for all the tokens of the provided two test sentences. Fully reproducible example code: import tensorflow as tf import tensorflow_text from pathlib import …
Category: Data Science

How to perform tokenization for tweets in xlnet?

X_train has only one column that contains all tweets. xlnet_model = 'xlnet-large-cased' xlnet_tokenizer = XLNetTokenizer.from_pretrained(xlnet_model) def get_inputs(tweets, tokenizer, max_len=120): """ Gets tensors from text using the tokenizer provided""" inps = [tokenizer.encode_plus(t, max_length=max_len, pad_to_max_length=True, add_special_tokens=True) for t in tweets] inp_tok = np.array([a['input_ids'] for a in inps]) ids = np.array([a['attention_mask'] for a in inps]) segments = np.array([a['token_type_ids'] for a in inps]) return inp_tok, ids, segments inp_tok, ids, segments = get_inputs(X_train, xlnet_tokenizer) AttributeError: 'NoneType' object has no attribute 'encode_plus'
Category: Data Science

What tokenizer does OpenAI's GPT3 API use?

I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error. The closest I got to an answer was this post, which still doesn't say what tokenizer it uses. If I knew what tokenizer the API used, then I could count how many tokens are in my prompt before I submit …
Category: Data Science

Why does my char level Keras tokenizer add spaces when converting sequences to texts?

I create a tokenizer with import tf tokenizer = tf.keras.preprocessing.text.Tokenizer(split='', char_level=True, ...) tokenizer.fit_to_texts(...) But when I convert sequences of tokens to texts, the result contains a space after each character (except for the last one): test_text = 'this is a test' seq = tokenizer.texts_to_sequences([test_text]) r = tokenizer.sequences_to_texts(seq)[0] assert(r == ''.join([ c+' ' for c in test_text ])[:-1]) Is there a way to avoid this added spaces? Am I missing some configuration parameter?
Category: Data Science

How to deal with "Ergänzungsstrichen" and "Bindestrichen" in German NLP?

Problem In German, the phrase "Haupt- und Nebensatz" has exactly the same meaning as "Hauptsatz und Nebensatz". However, when transforming both phrases using e.g. spacy's de_core_news_sm pipeline, the cosine similarity of the resulting vectors differs significantly: token1 token2 similarity Haupt- Hauptsatz 0.07 und und 0.67 Nebensatz Nebenssatz 0.87 Code to reproduce import spacy import numpy as np def calc_cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) nlp = spacy.load("de_core_news_sm") doc1 = nlp("Hauptsatz und Nebensatz") doc2 = nlp("Haupt- und Nebensatz") …
Category: Data Science

Is it good practice to remove the numeric values from the text data during preprocessing?

Im doing preprocessing on a text dataset. I have certain numerics in it like: date(1st July) year(2019) tentative values (3-5 years/ 10+ advantages). unique values (room no 31/ user rank 45) percentage(100%) Is it recommended to discard this numerics before creating a vectorizer(bow/tf-idf) for any model(classification/regression) development? Any quick help on this is much appreciated. Thank you
Category: Data Science

Adding a new token to a transformer model without breaking tokenization of subwords

I'm running an experiment investigating the internal structure of large pre-trained models (BERT and RoBERTa, to be specific). Part of this experiment involves fine-tuning the models on a made-up new word in a specific sentential context and observing its predictions for that novel word in other contexts post-tuning. Because I am just trying to teach it a new word, we freeze the embeddings for the other words during fine-tuning so that only the weights for the new word are updated. …
Category: Data Science

How to i get word embeddings for out of vocabulary words using a transformer model?

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s. I would like to know if there is any aggregation strategies available that make sense apart from doing a mean of these vectors. from transformers import AutoTokenizer, AutoModel # download and load model tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") …
Category: Data Science

How to Calculate semantic similarity between video captions?

I intend to calculate the accuracy of a caption generated by comparing it to a number of reference sentences. For example, the captions for one video are as follows: These captions are for the same video only. However, reference sentences have been broken down with respect to different segments of a video. Reference sentences (R): A man is walking along while pushing his bicycle. He tries to balance himself by taking support from a pole. Then he falls on the …
Category: Data Science

does ValueError: 'rat' is not in list means not exist in tokenizer

Does this error means that the word doesn't exist in the tokenizer return sent.split(" ").index(word) ValueError: 'rat' is not in list the code sequences like def sentences(): for sent in sentences: token = tokenizer.tokenize(sent) for i in token : idx = get_word_idx(sent,i) def get_word_idx(sent: str, word: str): return sent.split(" ").index(word) sentences split returns ['long', 'restaurant', 'table', 'with', 'rattan', 'rounded', 'back', 'chairs'] which rattan here is the problem as i think
Category: Data Science

Any way to make NER tagging with float(2.0) and inferencing with str(2)

One of the NER attribute is tagged with float (3.0, 2.0, ...) while the text file I am trying to inference from are in string format of (3, 2, ...). The Spacy model I used can't pick up the numbers interchangeably with or without a (.0) tail. Anyone has any idea how to solve this issue? Many thanks!
Category: Data Science

How does Keras Tokenizer choose tokens given a sentence?

I tried to find the answer to this question but I can't find anything, so I ask here: How does Keras Tokenizer choose tokens given a sentence of words ? To be more precise with what I want to know, given this simple example: #Import module from keras.preprocessing.text import Tokenizer # define a document doc = ['The cat sat on the mat'] # create the tokenizer tokenizer = Tokenizer() # fit the tokenizer on the document tokenizer.fit_on_texts(doc) encoded_doc=tokenizer.texts_to_sequences(doc) print('word_index : …
Category: Data Science

How to precompute one sequence in a sequence-pair task when using BERT?

BERT uses separator tokens ([SEP]) to input two sequences for a sequence-pair task. If I understand the BERT architecture correctly, attention is applied to all inputs thus coupling the two sequences right from the start. Now, consider a sequence-pair task in which one of the sequences is constant and known from the start. E.g. Answering multiple unknown questions about a known context. To me it seems that there could be a computational advantage if one would precompute (part of) the …
Category: Data Science

Tokenizer returning incorrect values and losing a lot of data

(cross posted from main stackoverflow) This is a weird situation so I hope I can explain it correctly. My partner and I are working on a ML project where we create a model that predicts whether a Reddit comment is sarcastic or not. (Data set for reference) We have created our model based on the training data csv (all seems good), and now want to test it based on the testing data csv. To do so we have split the …
Category: Data Science

When to do tokenization and does my output need tokenization after stemming?

I am working on sentiment analysis project , where there are various customer reviews. So I am trying to clean those reviews. So first thing i did is removing special characters, white spaces, numbers from text. Next I did is removing stop words(removing this, that, have etc.) After that i did stemming(removing ING, ed,y etc). Below is my output. What I want to know is that is tokenization needed here any more? Because my output after stemming looks like tokenization …
Category: Data Science

Training NMT models for noisy social media roman text

I am trying to train an NMT model where the source side is roman text of Asian languages from social media, and target side is English. Note that since roman text is not native to Asia, the romanizations done by people to type on the Internet are very personal and hence a bit noisy, but easily intelligible to native speakers. The following is an example for writing a Hindi sentence in different ways: Vaise bhi mere paas jo bhi hai …
Category: Data Science

create sequence of non dictionary words

I have a few word vectors- recvfrom,sendto,epoll_pwait,recvfrom,sendto,epoll_pwait getuid,recvfrom,writev,getuid,epoll_pwait,getuid Now i want to tokenized them and then make them into sequences to feed into the model- For a standard word vector I would do something like this- ### Create sequence vocabulary_size = 20000 tokenizer = Tokenizer(num_words= vocabulary_size) tokenizer.fit_on_texts(df['text']) sequences = tokenizer.texts_to_sequences(df['text']) data = pad_sequences(sequences, maxlen=50) But in my data I have non dictionary words and also I have some repeating words. How do I convert this data into sequences?
Category: Data Science

Tensorflow text tokenizer incorrect tokenization

I am trying to use TF Tokenizer for a NLP model from tensorflow.keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words=200, split=" ") sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ", "This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"] tokenizer.fit_on_texts(sample_text) print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"])) OP: [[1, 7, 8, 9]] Word_Index: print(tokenizer.index_word[8]) ===> 'ab' print(tokenizer.index_word[9]) ===> 'cdefghijklmnopqrstuvwxyz' The problem is that the tokenizer creates tokens based on . in this case. I am giving the split = …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.