Remove all characters following a certain character in a column of a dataset

I have a data set like the following, and the first column contains the groupings. However, some are labelled slightly differently. I need to remove all characters following the punctuation used (bracket, semicolon, comma). groups <- c("Group1", "Group1", "Group1;Group1", "Group1(subset)", "Group1,ex" ) I would like this to present all of these just as Group1 (so they would all appear the same as the first two) - so to remove all characters in the string following the punctuation. I then need …
Category: Data Science

How to interpret a specific feature importance?

Apologies for a very case specific question. I have a dataset of genes, with which I am using machine learning to predict if a gene causes a disease. One of the features I have is a beta value (which is the effect size of the gene's impact on the disease), and I'm not sure how best to interpret and use this feature. I condense the beta values from the variant level to the gene level, so a gene is left …
Category: Data Science

Sequence Embedding using embedding layer: how does the network architecture influence it?

I want to obtain a dense vector representation of protein sequences so that I can meaningfully represent them in an embedding space. We can consider them as sequences of letters, in particular there are 21 unique symbols which are the amino acids (for example: MNTQILVFIACVLIEAKGDKICL). My approach is to use a sequence embedding that can be learned as a part of a deep learning model (built with Python using Keras libraries), that is a classifier (supervised) neural network which I …
Category: Data Science

Understanding how Long Short-Term Memory works in classification of sequences of symbols

I want to use a LSTM neural network to classify sequences of protein according to the host species. For example, I have these sequences of letters (toy example, just to understand): MNTQILVFIACVLIEAKGDKICL belongs to human AKGDKICLMNTQILVFIACVLIE belongs to human MNTQAKGDKICLILVFIACVLIE belongs to dog The sequences are different only according to the position of the subsequence AKGDKICL and my network should learn to recognize this. Is a LSTM network able to do this ? I am trying to focus on the …
Category: Data Science

012 matrix with multiple alleles

Edit: I see now there is a bioinformatics site for StackExchange. I am not familiar with the technical way to shift my post there, but it fits there way more. I am using VcfTools to parser VCF files. I can use it to generate a 012 matrix. This matrix is 2D, with the shape of (num of individuals, num of SNPs). In each cell in the matrix, there is the number of occurrences, of the alternative allele for the specific …
Category: Data Science

What are some strategies to deal with label sparsity when training a protein function prediction model?

The protein function prediction task requires you to take a sequence of amino acids (think words in a sentence, but if there are only 20 words), and output the functions that protein can take. There are around 30 thousand labels for protein function, and these labels are not mutually exclusive, so protein function prediction is essentially a huge binary prediction multitask. Now the catch is some labels are very common, and others are very rare, and overall a protein is …
Category: Data Science

Sequence to Sequence learning applied to list of numbers

I am looking to apply ML methods to genetic data. My goal is to predict which rare (generally de novo) mutations a person has based on what non-rare (generally inherited) mutations. I have worked on this mutation data before, and stored the mutation data as one-hot vectors: a person X can have mutation Y zero times, once on chromatid A, once on chromatid B, or once on each chromatid. This is represented as {'0|0', '0|1', '1|0', '1|1'}. The target data …
Category: Data Science

Best model for Antimicrobial Resistance (AMR) prediction?

Some classes of problem are best solved by a specific class of machine learning model, due to the structure of the data (e.g. CNN's for computer vision tasks). Prediction of bacterial resistance/susceptibility to antimicrobials (from genotypic data) using Machine Learning methods is a problem that has started receiving interest in recent years. The following paper (from 2017) analysed the then current literature and found that: To date, there has not been a consensus about the optimal machine learning model to …
Category: Data Science

How to Implement Biological Neuron Activations in Artificial Neural Networks

In artificial neural networks, activation functions are used for neurons, i.e. the sigmoid activation: Which can be implemented in code as (in Python): def sigmoid(x): return 1 / (1 + math.exp(-x)) How can we implement a biological activation function, such as the Hodgkin-Huxley model, whose mathematical form is: Where: Cm: Capacitance Vm: Membrane potential As mentioned on the Wikipedia page, The typical Hodgkin–Huxley model treats each component of an excitable cell as an electrical element (as shown in the figure). …
Category: Data Science

Remove part of string in R

I have a table in R. It just has two columns and many rows. Each element is a string that contains some characters and some numbers. I need number part of the element. How can I have number part? For example: INTERACTOR_A INTERACTOR_B 1 ce7380 ce6058 2 ce7380 ce13812 3 ce7382 ce7382 4 ce7382 ce5255 5 ce7382 ce1103 6 ce7388 ce523 7 ce7388 ce8534 Thanks
Category: Data Science

How do I solve a "TypeError: __array__() takes 1 positional argument but 2 were given" Keras error?

I am trying to build a multi-input CNN using Keras/Tensorflow. I have 5000 'smile' training inputs which are 1D arrays (shape = (100,)). These inputs have a maximum length of 100. I have 5000 'protein' training inputs which are also 1D arrays (shape = (1500,), which have a maximum length of 1500. I have the following data types and shapes: #type(test_protein)#numpy.ndarray of <class 'numpy.ndarray'> #int32 #type(val_protein)#numpy.ndarray of <class 'numpy.ndarray'> #int32 #type(train_protein)#numpy.ndarray of <class 'numpy.ndarray'> #int32 #type(train_smile)#numpy.ndarray of <class 'numpy.ndarray'> #int32 …
Category: Data Science

Bioinformatics add-on

I would like to ask you why there are no Volcano-plot or MA-plot widgets in bioinformatics add-on already? As I can see it in the youtube video tutorial (two years ago, version 3.3). In the latest version, a functionality of making Volcano-plot disappeared. My question is why and or where?
Category: Data Science

What is important for Pharmaceutical companies to answer with Big Data Analysis?

I am a data scientist, and I have some biological background (genetics). I have been asked to give a talk for our customers from pharmaceutical industry. I should show them how they benefit from Big data tools such as Spark to get value out of their data. I should do this talk with an example. Does anyone know what is the best example for pharmacutical industry? I mean what data/ challenges would be better to tackle? Thanks
Category: Data Science

How to decide on using xgboost with imputation or without it and keeping missing values?

I have a large genetic dataset that I am using xgboost on to score most likely disease causing genes - giving the genes a score between 0-1 of likelihood. I try to avoid features with a lot of missing data but this can be hard for genetic data, the largest amount of missingness I have for a feature is roughly half of values in a feature column are missing. Currently I run my xgboost model in 2 versions, one with …
Category: Data Science

What kind of research can be done with genomic data?

It is well known that science has given us large amounts of free accessible data, such as https://www.1000genomes.org and https://www.ncbi.nlm.nih.gov/genbank. How can we play around with the data and apply data science/machine learning to it? What could be some ideas? My own ideas: Biological data visualisation Gene prediction using hidden-markov-model Any more?
Category: Data Science

How to make a classification problem into a regression problem?

I have data describing genes which each get 1 of 4 labels, I use this to train models to predict/label other unlabelled genes. I have a huge class imbalance with 10k genes in 1 label and 50-100 genes in the other 3 labels. Due to this imbalance I'm trying to change my labels into numeric values for a model to predict a score rather than a label and reduce bias. Currently from my 4 labels (of most likely, likely, possible, …
Category: Data Science

FFR and FAR calculating for multiclasss biometric face recognition system

I am implementing a face recognition system using facenet and svc Ml algorithm i have like 20 classes or more and I'm getting 98% accuracy im trying to calculate the FAR and FRR and the EER I'm assuming that the threshold is the probability of predicting is that correct ? this is the code for calculating the FP,FN,TP,TN FP = matrics.sum(axis=0) - np.diag(matrics) FN = matrics.sum(axis=1) - np.diag(matrics) TP = np.diag(matrics) TN = matrics.sum() - (FP + FN + TP) …
Category: Data Science

Clustering analysis for observations with lists as data

So I have several samples analyzed for their chemical composition. After data analysis, for each sample, I have a list of compounds found and their corresponding relative abundance. Some compounds are unique but most are actually found in most samples. I want to do clustering analysis based on these list of compounds. How do I go about this? Specifically how to vectorize my dataset since each observation is actually an array with both numerical (abundance) and categorical (compound label) variables.
Category: Data Science

How to compare genetic profiles or vcf files in Python?

I have hundreds of vcf file where each vcf file contains genome profile for a tissue. A portion of the vcf file is as follows: I can read each vcf file into a dataframe. So it would be hundreds of dataframes. Each vcf file/dataframe contains hundreds of columns and 40/50 thousands rows. I want to see the difference in ALT column for each profile (vcf files/ dataframes) on CHROM, POS, ID and REF columns. What would be the best way …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.