I have a background in GIS and am just learning data science using programming languages. Specifically, I am focusing on learning Python to complement my bosses knowledge of R (we are a new department and there are just 3 of us). In my previous research, I always either used ArcGIS coupled with data manipulation of CSVs in Excel for my research. My boss says they don't use CSVs because it doesn't maintain the metadata, and most of their files are …
I have 10 datasets, each with the same variables (e.g., age and income) but different numbers of observations. Let us now consider a categorical variable $X$ that can only take values $0$ and $1$ per dataset, meaning that it keeps the same value for all observations. For 5 datasets, $X=0$; for the other 5, $X=1$. How do I create a regression model for a variable of these datasets (e.g., age) that takes into account this "meta-variable" $X$? A simple solution …
There are 4 datasets (all in csv format), each has a uniqueID column by which each record can be identified. Image and text datasets are dense datasets.(need to be converted to ndarray). Can someone suggest how to use all these 4 datasets for building a regression model? This is how the datasets look, Metadata having some input features and target variable(views) uniqueID ad_blocked embed duration language hour views 1 True True 68 3 10 244 2 False True 90 1 …
I want to know what is metadata and what is meant by meta features? When I google Meta Features what I get is feature selection tool called "Meta-Feature". What is the function of feature selection tools ? Also, what I want is the definition and meaning of the meta features ?
I have a convolutional neural network and would like to include some metadata. My metadata is in a multiple csv files that correspond to each class and it contains a bunch of geometric properties (about 8 numerical measurements), specifically revolving around size and volume that would help classification of similar looking images but have varying size and volume. I am currently using Keras to build my models. What I am unsure of is where and how to add metadata into …
I'm very new to r and trying to run a multi-level meta analysis using pre-calculated effect sizes. The data file can be accessed via this link:testrunfile The script I used as a first step to fit the model was: res <- rma.mv (yi = es_r, v = var, data = testrun, method = "REML", level = 95, digits = 7, slab = ref, random = ~ 1 | samp_id) But I keep getting this error: Error in verbose > 2 …
Having a lot of text documents (in natural language, unstructured), what are the possible ways of annotating them with some semantic meta-data? For example, consider a short document: I saw the company's manager last day. To be able to extract information from it, it must be annotated with additional data to be less ambiguous. The process of finding such meta-data is not in question, so assume it is done manually. The question is how to store these data in a …
Given a social network, I want to perform community detection and compare the result to known node metadata, such as gender, age, etc. to see if certain communities are largely composed of "similar" people. I have seen this done before in visualizations like this: (image from https://arxiv.org/pdf/0809.0690.pdf) where each circle represents a community and the coloring of the circle shows the breakdown of some attribute (e.g. nationality) within that community. Does anyone know what tool can be used to create …
I am using a library called MFE to generate meta-features. However, I am working right now with several files and I notice that I am using only 1 core of my machine and taking too much time. I have been trying to implement some libraries as I saw in another question: library(iterators) library(foreach) library(doParallel) This one, but me being dumb could not implement it ='(. I just would like to put this snippet running in all my cores so I …
I am curating a large quantity of data from different sensors. If I know that a particular sensor was broken or poorly calibrated for a particular time range, what would be a useful way of annotating the data to make it clear that the data are of poor quality and / or have known errors? I am thinking a set of key:value pairs (like quality:error, description:'sensor was broken') that I can store in json, yaml, image header (e.g. exif) metadata …
How do you understand a dataset when there is no metadata given (no details about the attributes given in the dataset)? It is difficult to comprehend the attribute names as only the short forms are given. It's given to me that 'pm2.5' is the target variable. How do I understand which independent variables will affect this target variable?
If i have a lot of data points describing the price of a used car. How would I find the market value of the car (assuming that the price points in the data set are the only determinant used, and the basis of determination will be the frequency [higher the frequency the better] price data point occurence for that particular car). A count of absolute value recurrences will not work, as I want to bucket numbers that are similar (less …
As part of my thesis I've done some experiments that have resulted in a reasonable amount of time-series data (motion-capture + eye movements). I have a way of storing and organizing all of this data, but it's made me wonder whether there are best practices out there for this sort of task. I'll describe what I've got, and maybe that will help provide some recommendations. So, I have an experiment that requires subjects to use their vision and move their …
The formula for Information given by a data of occurring with probability p is: I=-log2 p This formula gives the bits if information needed to know the outcome of the event. This formula captures the intuition that the information needed to know the outcome of an event with probability 1 is 0 as we already know that outcome of the event. So shouldn't the formula give the information as 0 for the event with probability 0 as we know the …
I am trying to start a meta-analysis for which I want to extract some 16S-based information from public databases. Moreover, I want to relate this information with any metadata found in the associated studies (everything from environmental variables to sequencing details). For this, I realized some databases are available, like NCBI-Nucletotide, NCBI-SRA and EMBL-EBI-ENA, but I am not sure about which one to use or whether I can use them all. How can I filter only whole 16S sequences? Or …
I have data coming from a source system that is pipe delimited. Pipe was selected over comma since it was believed no pipes appeared in field, while it was known that commas do occur. After ingesting this data into Hive however it has been discovered that rarely a field does in fact contain a pipe character. Due to a constraint we are unable to regenerate from source to escape the delimiter or change delimiters in the usual way. However we …