I am trying to develop a NLP model, which takes something like you have high levels of cholesterol(this will be a tag) as input and has to output something like you have high levels of cholesterol, you need to have a low-salt diet that emphasizes fruits, vegetables and whole grains; limit the amount of animal fats and use good fats in moderation(this will be the suggestion; and it is an example suggestion from doctor). So, now when I was researching …
So I'm doing a project on the COVID-19 Tweets dataset from the IEEE port and I plan to analyse the tweets over the time period from March 2020 till date. The thing is there's more than 300 CSVs for each data with each having millions of rows. Now I need to hydrate all of these tweets before I can go and filter through them. Hydrating just 1 CSV alone took more than two hours today. I wanted to know if …
I have a labelled dataset of relevant and non-relevant documents for which I built a boolean document term matrix. I am trying to develop an algorithm which given this input would create a text-based boolean search rule which identifies a subset of the data favouring first of all sensitivity and then specificity. I'd like to know published literature on the topic. I made some initial search but couldn't find anything related. I'd be glad if you can point me to …
In the field of text classification, it is common to use Conv1D filters running over word embeddings and then getting a single value on the output for each filter using GlobalMaxPooling1D. As I understand the process, the convolutional filter is a matrix of the same size as the $$\text{size of filter matrix} = \text{embedding dim}\cdot\text{width of the filter}$$ The filter matrix is then applied to the input embeddings (multiplied element by element) which produces a matrix of the same size …
I would like to extract bank names from a given text like wells Fargo, chase....is there a python library for this? I know there is entity tagger in space and flair but they only identify the entity (org/person)
How to implement multiple filters for checking data cell in a range ? Suppose, I have a list of numbers like, range_1 = [ 70 ,15,5,7,3,7,8,3,2, 63 ] # and range_1 = [ 50, 56, 80, 61, 83, 87, 13, 58, 43, 24, 84, 54, 64,36, 48 ] And I want to check any column values exist within these two lists. Any suggestion would be appreciated
I am trying to draw a process flow (like a template) to be followed while on text analysis projects. So far, I've come up with this. Text Analytics Steps Data Collection Acquire data Convert data into plain text Remove Duplicate Entries Text Parsing and Extracting Features Tokenization Parsing Remove HTML characters Decode complex symbols to UTF-8 Spell check Apostrophe look-up Remove punctuation marks Remove expressions / emojis Split attached words Slangs look-up Remove URLs Lemmatization / Stemming (Normalization of Tokens) …
I am searching for an automated method (ideally a python package) that produces a score to assess the credibility of a given text (e.g. from a webpage). I am not searching for: text complexity assessments (i.e. how long sentences are and how many difficult words are used) as for example flesch reading ease, smog index, flesch kincaid grade, coleman liau index, automated readability index, dale chall readability score, difficult words index, linsear write formula, or gunning fog. text coherence (i.e. …
I need to tokenize a corpus of abstracts from an international conference. The abstracts are usually American English but sometimes British English. Consequently, I get 2 tokens for “organization” and “organisation” or “color” and “colour”. Examples : https://en.oxforddictionaries.com/spelling/british-and-spelling Do you know a (python) library converting “British English” to “American English” (or vis versa) ? I would be happy to that ... (but I am french and my english is not soo good) Thanks.
Let's say I have the following text" Is that another kitten playing in the shoes in the top right? I would like my code to extract kitten from that text. Is there any list of animals names readily available?
We've got a list of approximately 18,000 product names (they're from 80-90 sources, so quite a few that are similar but not duplicates - these were picked as DISTINCT from a table) unfortunately there are different ways of expressing these names. We have to try and normalize the dataset so we present our users with more meaningful names. For example, a list like this: Canon EOS 5D Mark III Canon EOS 5D mk III Canon EOS 5DMK3 Canon EF 70-200mm …