bioinformatics

Remove all characters following a certain character in a column of a dataset

NewtoR

2022年6月5日 02:05

I have a data set like the following, and the first column contains the groupings. However, some are labelled slightly differently. I need to remove all characters following the punctuation used (bracket, semicolon, comma). groups <- c("Group1", "Group1", "Group1;Group1", "Group1(subset)", "Group1,ex" ) I would like this to present all of these just as Group1 (so they would all appear the same as the first two) - so to remove all characters in the string following the punctuation. I then need …

Topic: regex bioinformatics r

Category: Data Science

How to interpret a specific feature importance?

DN1

2022年6月4日 09:07

Apologies for a very case specific question. I have a dataset of genes, with which I am using machine learning to predict if a gene causes a disease. One of the features I have is a beta value (which is the effect size of the gene's impact on the disease), and I'm not sure how best to interpret and use this feature. I condense the beta values from the variant level to the gene level, so a gene is left …

Topic: bioinformatics feature-selection machine-learning

Category: Data Science

Sequence Embedding using embedding layer: how does the network architecture influence it?

HelpNeederStudent

2022年2月16日 17:37

I want to obtain a dense vector representation of protein sequences so that I can meaningfully represent them in an embedding space. We can consider them as sequences of letters, in particular there are 21 unique symbols which are the amino acids (for example: MNTQILVFIACVLIEAKGDKICL). My approach is to use a sequence embedding that can be learned as a part of a deep learning model (built with Python using Keras libraries), that is a classifier (supervised) neural network which I …

Topic: embeddings bioinformatics sequence deep-learning nlp

Category: Data Science

Understanding how Long Short-Term Memory works in classification of sequences of symbols

HelpNeederStudent

2022年2月7日 04:10

I want to use a LSTM neural network to classify sequences of protein according to the host species. For example, I have these sequences of letters (toy example, just to understand): MNTQILVFIACVLIEAKGDKICL belongs to human AKGDKICLMNTQILVFIACVLIE belongs to human MNTQAKGDKICLILVFIACVLIE belongs to dog The sequences are different only according to the position of the subsequence AKGDKICL and my network should learn to recognize this. Is a LSTM network able to do this ? I am trying to focus on the …

Topic: text-classification lstm bioinformatics neural-network nlp

Category: Data Science

012 matrix with multiple alleles

Shaq

2021年12月26日 11:30

Edit: I see now there is a bioinformatics site for StackExchange. I am not familiar with the technical way to shift my post there, but it fits there way more. I am using VcfTools to parser VCF files. I can use it to generate a 012 matrix. This matrix is 2D, with the shape of (num of individuals, num of SNPs). In each cell in the matrix, there is the number of occurrences, of the alternative allele for the specific …

Topic: bioinformatics

Category: Data Science

What are some strategies to deal with label sparsity when training a protein function prediction model?

luongminh97

2021年11月28日 00:56

The protein function prediction task requires you to take a sequence of amino acids (think words in a sentence, but if there are only 20 words), and output the functions that protein can take. There are around 30 thousand labels for protein function, and these labels are not mutually exclusive, so protein function prediction is essentially a huge binary prediction multitask. Now the catch is some labels are very common, and others are very rare, and overall a protein is …

Topic: bioinformatics class-imbalance

Category: Data Science

Sequence to Sequence learning applied to list of numbers

Whitehot

2021年8月23日 12:30

I am looking to apply ML methods to genetic data. My goal is to predict which rare (generally de novo) mutations a person has based on what non-rare (generally inherited) mutations. I have worked on this mutation data before, and stored the mutation data as one-hot vectors: a person X can have mutation Y zero times, once on chromatid A, once on chromatid B, or once on each chromatid. This is represented as {'0|0', '0|1', '1|0', '1|1'}. The target data …

Topic: sequence-to-sequence bioinformatics categorical-data machine-learning

Category: Data Science

Best model for Antimicrobial Resistance (AMR) prediction?

iacob

2021年8月20日 11:41

Some classes of problem are best solved by a specific class of machine learning model, due to the structure of the data (e.g. CNN's for computer vision tasks). Prediction of bacterial resistance/susceptibility to antimicrobials (from genotypic data) using Machine Learning methods is a problem that has started receiving interest in recent years. The following paper (from 2017) analysed the then current literature and found that: To date, there has not been a consensus about the optimal machine learning model to …

Topic: data-science-model model-selection bioinformatics supervised-learning machine-learning

Category: Data Science

How to Implement Biological Neuron Activations in Artificial Neural Networks

Larry

2021年7月31日 23:51

In artificial neural networks, activation functions are used for neurons, i.e. the sigmoid activation: Which can be implemented in code as (in Python): def sigmoid(x): return 1 / (1 + math.exp(-x)) How can we implement a biological activation function, such as the Hodgkin-Huxley model, whose mathematical form is: Where: Cm: Capacitance Vm: Membrane potential As mentioned on the Wikipedia page, The typical Hodgkin–Huxley model treats each component of an excitable cell as an electrical element (as shown in the figure). …

Topic: activation-function implementation bioinformatics neural-network python

Category: Data Science

Remove part of string in R

mahtab

2021年7月1日 20:53

I have a table in R. It just has two columns and many rows. Each element is a string that contains some characters and some numbers. I need number part of the element. How can I have number part? For example: INTERACTOR_A INTERACTOR_B 1 ce7380 ce6058 2 ce7380 ce13812 3 ce7382 ce7382 4 ce7382 ce5255 5 ce7382 ce1103 6 ce7388 ce523 7 ce7388 ce8534 Thanks

Topic: bioinformatics dataset r

Category: Data Science

How do I solve a "TypeError: array() takes 1 positional argument but 2 were given" Keras error?

Gabriel

2021年6月21日 11:07

I am trying to build a multi-input CNN using Keras/Tensorflow. I have 5000 'smile' training inputs which are 1D arrays (shape = (100,)). These inputs have a maximum length of 100. I have 5000 'protein' training inputs which are also 1D arrays (shape = (1500,), which have a maximum length of 1500. I have the following data types and shapes: #type(test_protein)#numpy.ndarray of <class 'numpy.ndarray'> #int32 #type(val_protein)#numpy.ndarray of <class 'numpy.ndarray'> #int32 #type(train_protein)#numpy.ndarray of <class 'numpy.ndarray'> #int32 #type(train_smile)#numpy.ndarray of <class 'numpy.ndarray'> #int32 …

Topic: cnn keras tensorflow bioinformatics neural-network

Category: Data Science

Bioinformatics widget

Fatty

2020年12月13日 03:35

Why is there are no Volcano-plot or MA-plot in bioinformatics widget?

Topic: orange bioinformatics

Category: Data Science

Bioinformatics add-on

PePr

2020年12月13日 01:50

I would like to ask you why there are no Volcano-plot or MA-plot widgets in bioinformatics add-on already? As I can see it in the youtube video tutorial (two years ago, version 3.3). In the latest version, a functionality of making Volcano-plot disappeared. My question is why and or where?

Topic: orange3 orange bioinformatics

Category: Data Science

What is important for Pharmaceutical companies to answer with Big Data Analysis?

Rebecca

2020年6月22日 11:24

I am a data scientist, and I have some biological background (genetics). I have been asked to give a talk for our customers from pharmaceutical industry. I should show them how they benefit from Big data tools such as Spark to get value out of their data. I should do this talk with an example. Does anyone know what is the best example for pharmacutical industry? I mean what data/ challenges would be better to tackle? Thanks

Topic: bioinformatics apache-spark dataset bigdata

Category: Data Science

How to decide on using xgboost with imputation or without it and keeping missing values?

DN1

2020年6月16日 16:56

I have a large genetic dataset that I am using xgboost on to score most likely disease causing genes - giving the genes a score between 0-1 of likelihood. I try to avoid features with a lot of missing data but this can be hard for genetic data, the largest amount of missingness I have for a feature is roughly half of values in a feature column are missing. Currently I run my xgboost model in 2 versions, one with …

Topic: missing-data xgboost bioinformatics regression machine-learning

Category: Data Science

What kind of research can be done with genomic data?

SmallChess

2020年4月16日 07:36

It is well known that science has given us large amounts of free accessible data, such as https://www.1000genomes.org and https://www.ncbi.nlm.nih.gov/genbank. How can we play around with the data and apply data science/machine learning to it? What could be some ideas? My own ideas: Biological data visualisation Gene prediction using hidden-markov-model Any more?

Topic: data bioinformatics classification machine-learning

Category: Data Science

How to make a classification problem into a regression problem?

DN1

2020年3月27日 13:23

I have data describing genes which each get 1 of 4 labels, I use this to train models to predict/label other unlabelled genes. I have a huge class imbalance with 10k genes in 1 label and 50-100 genes in the other 3 labels. Due to this imbalance I'm trying to change my labels into numeric values for a model to predict a score rather than a label and reduce bias. Currently from my 4 labels (of most likely, likely, possible, …

Topic: bioinformatics regression classification machine-learning

Category: Data Science

FFR and FAR calculating for multiclasss biometric face recognition system

Mustafa Azzurri

2020年1月9日 21:02

I am implementing a face recognition system using facenet and svc Ml algorithm i have like 20 classes or more and I'm getting 98% accuracy im trying to calculate the FAR and FRR and the EER I'm assuming that the threshold is the probability of predicting is that correct ? this is the code for calculating the FP,FN,TP,TN FP = matrics.sum(axis=0) - np.diag(matrics) FN = matrics.sum(axis=1) - np.diag(matrics) TP = np.diag(matrics) TN = matrics.sum() - (FP + FN + TP) …

Topic: bioinformatics evaluation machine-learning

Category: Data Science

Clustering analysis for observations with lists as data

quarksome

2019年8月14日 15:17

So I have several samples analyzed for their chemical composition. After data analysis, for each sample, I have a list of compounds found and their corresponding relative abundance. Some compounds are unique but most are actually found in most samples. I want to do clustering analysis based on these list of compounds. How do I go about this? Specifically how to vectorize my dataset since each observation is actually an array with both numerical (abundance) and categorical (compound label) variables.

Topic: bioinformatics clustering

Category: Data Science

How to compare genetic profiles or vcf files in Python?

studentcoder

2019年6月27日 23:39

I have hundreds of vcf file where each vcf file contains genome profile for a tissue. A portion of the vcf file is as follows: I can read each vcf file into a dataframe. So it would be hundreds of dataframes. Each vcf file/dataframe contains hundreds of columns and 40/50 thousands rows. I want to see the difference in ALT column for each profile (vcf files/ dataframes) on CHROM, POS, ID and REF columns. What would be the best way …

Topic: dataframe bioinformatics pandas python

Category: Data Science

About