Predicting Disease Drugs

I have a dataset in the format:

Keywords                                                         Disease/Drugs
bradycardia, insomnia, hypotension, hearinglos...                 NSAIDS Poisoning
vomiting, nausea, diarrhea, seizure, edema, an...                 NSAIDS Poisoning

pancreatitis, gi, symptoms, restlessness, leuk...                 Chronic abacavir use (Nucleoside Analog Revers..
ards, apnea, hepatotoxicity, dyspnea, pulmonar...                 Chronic stavudine and didanosine use (Nucleosi...
    

There are many data but it is in this format.

Converted above data into the format, exploded, and created new rows according to ,

Keywords                          Disease/Drugs
bradycardia                        NSAIDS Poisoning
insomnia                           NSAIDS Poisoning

pancreatitis                       Chronic stavudine and didanosine use (Nucleosi...

Now I created the prediction system using DecisionTreeClassifier after encoding the Input column Keywords.

Also, I found the top 10 predictions using:

p_probability = model.predict_proba([[t]])
best_n = np.argsort(p_probability, axis=1)[:,-10:]   

When I input the single symptom like bradycardia, it shows 10 best predictions.

Also when I input a list of 5 symptoms, then it will show 50 best predictions.

Since, a list of symptoms can have common disease/drugs, I want to create a system, that when inputted the list of any number of symptoms, will show the 10 best predictions only.

Topic classifier decision-trees scikit-learn python predictive-modeling

Category Data Science


Since you mentioned based on 5 symptoms you are getting 50 disease predictions.

As per your use case a symptom can match to many diseases after getting the disease predictions by applying your ML algorithms(naive bayes/ decision tree).

To always get the top 10 predictions no matter how many symptoms are given as input you can do so by using np.unique get unique frequency counts and use np.argsort to sort by frequency count and get the top 10.

Based on your code


p_probability = model.predict_proba([[t]])
best_n = np.argsort(p_probability, axis=1)[:,-10:]   

Assuming p_probability (for 5 symptoms gives 50 predictions) is an array of 50values as per format below

     array([‘NSAIDS Poisoning’,’Chronic stavudine and didanosine use’……, ‘NSAIDS Poisoning’, ‘NSAIDS Poisoning’,’Chronic abacavir use’])

Considering value ‘Chronic abacavir use’ is the 50th prediction


#disease_predictions is value of all disease predictions from p_probability 
disease_prediction_list = p_probability

#when return_counts is true, returns 2D array all unique value and it’s frequency count
unique_value, freq_count = np.unique(disease_prediction_list, return_counts=True)


#sorted index based on frequency count
freq_count_sort_index = np.argsort(-freq_count)


#predictions are now sorted based on frequency count 
frequency_sorted_prediction= unique_value[freq_count_sort_index]

#Top 10 predictions sorted based on frequency counts
top_10_prediction= frequency_sorted_prediction[-10:]



I guess this is a special NLP problem where you basically deal with "translating" $x$ into $y$. Thus, you could look at "sequence-to-sequence" learning (neural translation model), where you try to "translate" a keyword or a set of keywords into a drug:

bradycardia -> NSAIDS Poisoning

There are a number of useful sources form Keras/Tensorflow, such as:


You should prepare your training data in a different way. By exploding the keywords in separate rows you are losing information on correlation of symptoms for a Disease/Drug.

For example: A patient with nausea + insomnia -> Sleep disorder. Whereas a patient with nausea + diarrhea -> Food poisoning.

Given your dataset you must one-hot encode the keywords and use them as features to train your model.

enter image description here

And then given a new patient info: enter image description here

You encode it as: enter image description here

And predict it using your model, same as before.

p_probability = model.predict_proba([[t]])
best_n = np.argsort(p_probability, axis=1)[:,-10:] 

By this method:

  1. You can input a list of symptoms to your model
  2. You get top 10 results for each prediction

I understand that you have keywords of a sickness and the drug that was given to that patient. Given the quality of the question, I will sincerely recommend you to start not doing any ML and just doing some basic statistics.

If you want to see the top 10 best drugs for bradicardia, probably the best is to do a frequency count of how. In this way you should be able to find the top most frequent drugs for treating bradicardia

Given a list of symptoms find the previous and closes drug that was offered in the past. Just as a query, if a person shows the same symptoms that your new patient, if you find a past query that has the same symptoms you might want to recommend the same. And then rank for the next closes query to your query.

This will be without ML. With ML, you need to do proper cleaning of your dataset, and the build a ranking system. To start I will recommend you with a point-wise ranking system.

Still I believe that you should try first without ML.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.