BertTokenizer on custom data returns same index for all tokens
I'm trying to train Bert tokenizer on a custom dataset but when running tokenizer.tokenize
on sample data, it returns the same index for every tokens which is clearly not what is expected. Running bert_vocab_from_dataset
on the below sample dataset returns a vocabulary of 88 tokens long. After saving this and reusing it in tensorflow_text.BertTokenizer
, I get [88] for all the tokens of the provided two test sentences.
Fully reproducible example code:
import tensorflow as tf
import tensorflow_text
from pathlib import Path
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset
Path('BERT_Tokenizer').mkdir(parents=True, exist_ok=True)
bert_vocab_args = dict(
vocab_size = 8000,
reserved_tokens=[[PAD], [UNK], [START], [END]],
bert_tokenizer_params=dict(lower_case=True),
)
vocab = bert_vocab_from_dataset.bert_vocab_from_dataset(
tf.data.Dataset.from_tensor_slices(sample_text),
**bert_vocab_args)
print(fVocabulary is {len(vocab)} tokens long.)
# Save vocabulary
with open(BERT_Tokenizer/vocab.txt, w) as f:
for token in vocab:
print(token.encode('utf-8'), file=f)
tokenizer = tensorflow_text.BertTokenizer(
vocab_lookup_table = BERT_Tokenizer/vocab.txt,
)
tokens = tokenizer.tokenize([
'hello my name is amy how are you doing today',
'i live right in downtown quite close to the port'
])
tokens
Data:
sample_text = ['sorry but i dont see where there are any more legal threat towards wik a i never made any the truth is here the matter is look at what this guy did with my pic after seeimng it on wik he ha made me a target',
'interaction ban you are the one who posted this doe that make you the admin who imposed the ban or would it be someone else i am asking because i am curious what the proper mechanism is if any for reporting possible violation of interaction ban i asked at an and they told me to ask the admin who imposed the ban ← what is up doc carrots→ by my recollection that rfc wa re closed by someone else apparently when they reverted the closure i had done they neglected to place their own name on that page you linked to regardless afaik there is no restriction that place one admin solely a responsible for enforcing any user sanction and afaik that is part of what wp an is used for after all there is always another admin feel free to link to this response on wp an at your discretion they basically sent me your way so i am just going to let it simmer on the back burner for a while longer ← what is up doc carrots→ baseball a a fan of baseball i thought you might be interested in this link that i ran across talk awesome of course the nature of the game ha changed the current record for save were never approached by the old timer and that is the other side of the same coin ← what is up doc carrots→ richard iii now is the winter of our discontent made glorious summer by this sun of york and all the cloud that lour would upon our house in the deep bosom of the ocean buried now are our brow bound with victorious wreath our bruised arm hung up for monument our stern alarum changed to merry meeting our dreadful march to delightful measure grim visaged war hath smooth would his wrinkled front and now instead of mounting barded steed to fright the soul of fearful adversary he caper nimbly in a lady is chamber to the lascivious pleasing of a lute but i that am not shaped for sportive trick nor made to court an amorous looking glass i that am rudely stamp would and want love is majesty to strut before a wanton ambling nymph i that am curtail would of this fair proportion cheated of feature by dissembling nature deformed unfinish would sent before my time into this breathing world scarce half made up and that so lamely and unfashionable that dog bark at me a i halt by them why i in this weak piping time of peace have no delight to pas away the time unless to spy my shadow in the sun and descant on mine own deformity and therefore since i cannot prove a lover to entertain these fair well spoken day i am determined to prove a villain may \u200e richard ii this royal throne of king this scepter would isle this earth of majesty this seat of mar this other eden demi paradise this fortress built by nature for herself against infection and the hand of war this happy breed of men this little world this precious stone set in the silver sea which serf it in the office of a wall or a a moat defensive to a house against the envy of le happier land this blessed plot this earth this realm this england this nurse this teeming womb of royal king fear would by their breed and famous by their birth renowned for their deed a far from home for christian service and true chivalry a is the sepulchre in stubborn jewry of the world is ransom blessed mary is son this land of such dear soul this dear dear land dear for her reputation through the world is now leased out i die pronouncing it like to a tenement or pelting farm',
'ha that wa funny but i recall sinebot or it predecessor used to wait a while before butting in perhaps that wa not intentional behaviour but i recall having a few minute to finish and sign and having done so a few time if you can make it do so that would be good p s yeah i could leave this on your user page but really why not just direct people to leave feedback here you are monitoring it after all no reply needed just some feedback',
'no someone buggered it all up with those stupid redirections the player name is walter gordon neilson not gordon neilson and willie neilson what a crock of sh i thought nickname were extra their proper name should never be overuled redirected and his brother is robert thomson neilson not robert neilson i am gonna get all those redirections reversed and get those biography rewritten correctly',
'redirect talk all you need is luv',
'the ccp ha no legitimacy the party is full of coward liar thief and murderer the day that corrupt soulless organization is destroyed will be a glorious day for humankind',
'combine with diy article this article should be combined with the diy article',
'william gaillard accepted rr wa looking at that page myself to report the fella for vandalism he is a manu fan he wa removing cited source i will steer clear of the article',
'multiple version of released from diversity of source',
'you are insulting me and criticizing all of my edits you are the one who need to be blocked from life']
Topic bert transformer tokenization preprocessing nlp
Category Data Science