tokenization

Unable to resolve Type error using Tokenizer.tokenize from NLTK

Vivek Rmk

2022年5月19日 09:01

I want to tokenize text data and am unable to proceed due to a type error, am unable to know how to proceed to rectify the error, To give some context - all the columns - Resolution code','Resolution Note','Description','Shortdescription' are text data in English- here is the code that I have written : #Removal of Stop words: from nltk.tokenize import sent_tokenize, word_tokenize from nltk.corpus import stopwords from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\w+') stop_words = set(stopwords.words('english')) tokenizer = RegexpTokenizer(r'\w+') dfclean_imp_netc=pd.DataFrame() …

Topic: tokenization nltk python

Category: Data Science

BertTokenizer on custom data returns same index for all tokens

lazarea

2022年5月11日 13:53

I'm trying to train Bert tokenizer on a custom dataset but when running tokenizer.tokenize on sample data, it returns the same index for every tokens which is clearly not what is expected. Running bert_vocab_from_dataset on the below sample dataset returns a vocabulary of 88 tokens long. After saving this and reusing it in tensorflow_text.BertTokenizer, I get [88] for all the tokens of the provided two test sentences. Fully reproducible example code: import tensorflow as tf import tensorflow_text from pathlib import …

Topic: bert transformer tokenization preprocessing nlp

Category: Data Science

How to perform tokenization for tweets in xlnet?

Mathew

2022年5月10日 20:03

X_train has only one column that contains all tweets. xlnet_model = 'xlnet-large-cased' xlnet_tokenizer = XLNetTokenizer.from_pretrained(xlnet_model) def get_inputs(tweets, tokenizer, max_len=120): """ Gets tensors from text using the tokenizer provided""" inps = [tokenizer.encode_plus(t, max_length=max_len, pad_to_max_length=True, add_special_tokens=True) for t in tweets] inp_tok = np.array([a['input_ids'] for a in inps]) ids = np.array([a['attention_mask'] for a in inps]) segments = np.array([a['token_type_ids'] for a in inps]) return inp_tok, ids, segments inp_tok, ids, segments = get_inputs(X_train, xlnet_tokenizer) AttributeError: 'NoneType' object has no attribute 'encode_plus'

Topic: tokenization tensorflow sentiment-analysis nlp python

Category: Data Science

What tokenizer does OpenAI's GPT3 API use?

Herman Autore

2022年3月30日 05:10

I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error. The closest I got to an answer was this post, which still doesn't say what tokenizer it uses. If I knew what tokenizer the API used, then I could count how many tokens are in my prompt before I submit …

Topic: openai-gpt tokenization python-3.x

Category: Data Science

Why does my char level Keras tokenizer add spaces when converting sequences to texts?

Alexandre

2022年3月21日 05:05

I create a tokenizer with import tf tokenizer = tf.keras.preprocessing.text.Tokenizer(split='', char_level=True, ...) tokenizer.fit_to_texts(...) But when I convert sequences of tokens to texts, the result contains a space after each character (except for the last one): test_text = 'this is a test' seq = tokenizer.texts_to_sequences([test_text]) r = tokenizer.sequences_to_texts(seq)[0] assert(r == ''.join([ c+' ' for c in test_text ])[:-1]) Is there a way to avoid this added spaces? Am I missing some configuration parameter?

Topic: tokenization python-3.x keras

Category: Data Science

How to deal with "Ergänzungsstrichen" and "Bindestrichen" in German NLP?

gebbissimo

2022年3月8日 07:51

Problem In German, the phrase "Haupt- und Nebensatz" has exactly the same meaning as "Hauptsatz und Nebensatz". However, when transforming both phrases using e.g. spacy's de_core_news_sm pipeline, the cosine similarity of the resulting vectors differs significantly: token1 token2 similarity Haupt- Hauptsatz 0.07 und und 0.67 Nebensatz Nebenssatz 0.87 Code to reproduce import spacy import numpy as np def calc_cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) nlp = spacy.load("de_core_news_sm") doc1 = nlp("Hauptsatz und Nebensatz") doc2 = nlp("Haupt- und Nebensatz") …

Topic: tokenization nlp

Category: Data Science

Is it good practice to remove the numeric values from the text data during preprocessing?

emily

2022年2月17日 20:29

Im doing preprocessing on a text dataset. I have certain numerics in it like: date(1st July) year(2019) tentative values (3-5 years/ 10+ advantages). unique values (room no 31/ user rank 45) percentage(100%) Is it recommended to discard this numerics before creating a vectorizer(bow/tf-idf) for any model(classification/regression) development? Any quick help on this is much appreciated. Thank you

Topic: bag-of-words hashingvectorizer tokenization tfidf nlp

Category: Data Science

Adding a new token to a transformer model without breaking tokenization of subwords

Jigsaw

2022年2月13日 21:05

I'm running an experiment investigating the internal structure of large pre-trained models (BERT and RoBERTa, to be specific). Part of this experiment involves fine-tuning the models on a made-up new word in a specific sentential context and observing its predictions for that novel word in other contexts post-tuning. Because I am just trying to teach it a new word, we freeze the embeddings for the other words during fine-tuning so that only the weights for the new word are updated. …

Topic: huggingface tokenization

Category: Data Science

How to i get word embeddings for out of vocabulary words using a transformer model?

cerofrais

2022年2月7日 19:04

When i tried to get word embeddings of a sentence using bio_clinical bert, for a sentence of 8 words i am getting 11 token ids(+start and end) because "embeddings" is an out of vocabulary word/token, that is being split into em,bed,ding,s. I would like to know if there is any aggregation strategies available that make sense apart from doing a mean of these vectors. from transformers import AutoTokenizer, AutoModel # download and load model tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") …

Topic: huggingface transformer tokenization stanford-nlp nlp

Category: Data Science

How to Calculate semantic similarity between video captions?

Vaidehi

2022年2月1日 12:07

I intend to calculate the accuracy of a caption generated by comparing it to a number of reference sentences. For example, the captions for one video are as follows: These captions are for the same video only. However, reference sentences have been broken down with respect to different segments of a video. Reference sentences (R): A man is walking along while pushing his bicycle. He tries to balance himself by taking support from a pole. Then he falls on the …

Topic: semantic-similarity tokenization word-embeddings nlp python

Category: Data Science

does ValueError: 'rat' is not in list means not exist in tokenizer

Begnnier

2022年1月19日 23:09

Does this error means that the word doesn't exist in the tokenizer return sent.split(" ").index(word) ValueError: 'rat' is not in list the code sequences like def sentences(): for sent in sentences: token = tokenizer.tokenize(sent) for i in token : idx = get_word_idx(sent,i) def get_word_idx(sent: str, word: str): return sent.split(" ").index(word) sentences split returns ['long', 'restaurant', 'table', 'with', 'rattan', 'rounded', 'back', 'chairs'] which rattan here is the problem as i think

Topic: bert tokenization word-embeddings nlp

Category: Data Science

Any way to make NER tagging with float(2.0) and inferencing with str(2)

yussaaa

2022年1月10日 15:01

One of the NER attribute is tagged with float (3.0, 2.0, ...) while the text file I am trying to inference from are in string format of (3, 2, ...). The Spacy model I used can't pick up the numbers interchangeably with or without a (.0) tail. Anyone has any idea how to solve this issue? Many thanks!

Topic: object-detection spacy tokenization named-entity-recognition machine-learning

Category: Data Science

How does Keras Tokenizer choose tokens given a sentence?

HelpNeederStudent

2022年1月2日 19:30

I tried to find the answer to this question but I can't find anything, so I ask here: How does Keras Tokenizer choose tokens given a sentence of words ? To be more precise with what I want to know, given this simple example: #Import module from keras.preprocessing.text import Tokenizer # define a document doc = ['The cat sat on the mat'] # create the tokenizer tokenizer = Tokenizer() # fit the tokenizer on the document tokenizer.fit_on_texts(doc) encoded_doc=tokenizer.texts_to_sequences(doc) print('word_index : …

Topic: tokenization keras preprocessing deep-learning neural-network

Category: Data Science

How to precompute one sequence in a sequence-pair task when using BERT?

Just van der Veeken

2021年12月17日 17:32

BERT uses separator tokens ([SEP]) to input two sequences for a sequence-pair task. If I understand the BERT architecture correctly, attention is applied to all inputs thus coupling the two sequences right from the start. Now, consider a sequence-pair task in which one of the sequences is constant and known from the start. E.g. Answering multiple unknown questions about a known context. To me it seems that there could be a computational advantage if one would precompute (part of) the …

Topic: bert tokenization deep-learning nlp

Category: Data Science

What is the difference between TextVectorization and Tokenizer?

Pritam Sinha

2021年12月7日 19:31

What is the difference between the layers.TextVectorization() and from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences And when to use what ?

Topic: tokenization keras nlp

Category: Data Science

Tokenizer returning incorrect values and losing a lot of data

hoshii_tomato

2021年11月29日 08:51

(cross posted from main stackoverflow) This is a weird situation so I hope I can explain it correctly. My partner and I are working on a ML project where we create a model that predicts whether a Reddit comment is sarcastic or not. (Data set for reference) We have created our model based on the training data csv (all seems good), and now want to test it based on the testing data csv. To do so we have split the …

Topic: tokenization data dataset data-cleaning machine-learning

Category: Data Science

When to do tokenization and does my output need tokenization after stemming?

Sakshi Maurya

2021年10月13日 16:58

I am working on sentiment analysis project , where there are various customer reviews. So I am trying to clean those reviews. So first thing i did is removing special characters, white spaces, numbers from text. Next I did is removing stop words(removing this, that, have etc.) After that i did stemming(removing ING, ed,y etc). Below is my output. What I want to know is that is tokenization needed here any more? Because my output after stemming looks like tokenization …

Topic: tokenization preprocessing sentiment-analysis nlp data-cleaning

Category: Data Science

Training NMT models for noisy social media roman text

Gokul NC

2021年10月7日 09:53

I am trying to train an NMT model where the source side is roman text of Asian languages from social media, and target side is English. Note that since roman text is not native to Asia, the romanizations done by people to type on the Internet are very personal and hence a bit noisy, but easily intelligible to native speakers. The following is an example for writing a Hindi sentence in different ways: Vaise bhi mere paas jo bhi hai …

Topic: transformer tokenization machine-translation

Category: Data Science

create sequence of non dictionary words

ubuntu_noob

2021年9月26日 09:02

I have a few word vectors- recvfrom,sendto,epoll_pwait,recvfrom,sendto,epoll_pwait getuid,recvfrom,writev,getuid,epoll_pwait,getuid Now i want to tokenized them and then make them into sequences to feed into the model- For a standard word vector I would do something like this- ### Create sequence vocabulary_size = 20000 tokenizer = Tokenizer(num_words= vocabulary_size) tokenizer.fit_on_texts(df['text']) sequences = tokenizer.texts_to_sequences(df['text']) data = pad_sequences(sequences, maxlen=50) But in my data I have non dictionary words and also I have some repeating words. How do I convert this data into sequences?

Topic: tokenization keras scikit-learn nlp

Category: Data Science

Tensorflow text tokenizer incorrect tokenization

data_person

2021年8月25日 02:31

I am trying to use TF Tokenizer for a NLP model from tensorflow.keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words=200, split=" ") sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ", "This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"] tokenizer.fit_on_texts(sample_text) print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"])) OP: [[1, 7, 8, 9]] Word_Index: print(tokenizer.index_word[8]) ===> 'ab' print(tokenizer.index_word[9]) ===> 'cdefghijklmnopqrstuvwxyz' The problem is that the tokenizer creates tokens based on . in this case. I am giving the split = …

Topic: tokenization keras tensorflow deep-learning

Category: Data Science

About