I have a set of tables containing some thousand entries and some tenths of columns from machine status values of production. The entries are of mixed types like string, float, or timestamp. Each table is pre-labeled with a certain failure mode (e.g. valve setting jump, the problem with inlet A, etc.). This could be due to a jump in the mean values in some columns or a special correlation between several columns. This is what I refer to as a …
I have the following data frame: Date.POSIXct Date WeekDay DayCategory Hour Holidays value 1 2018-05-01 00:00:00 2018-05-01 MA MA-MI-JU 0 0 30 2 2018-05-01 01:00:00 2018-05-01 MA MA-MI-JU 1 0 80 3 2018-05-01 02:00:00 2018-05-01 MA MA-MI-JU 2 0 42 4 2018-05-01 03:00:00 2018-05-01 MA MA-MI-JU 3 0 90 5 2018-05-01 04:00:00 2018-05-01 MA MA-MI-JU 4 0 95 6 2018-05-01 05:00:00 2018-05-01 MA MA-MI-JU 5 0 5 DayCategory groups days of the week in the following way: Mondays goes to …
I would like to export tables for the following result for a repeated measure anova: Here the function which ANOVA test has been implemented fAddANOVA = function(data) data %>% ezANOVA(dv = .(value), wid = .(ID), within = .(COND)) %>% as_tibble() And here the commands to explore ANOVA statistics aov_stats <- df_join %>% group_by(signals) %>% mutate(ANOVA = map(data, ~fAddANOVA(.x))) %>% dplyr::select(., -data) %>% unnest(ANOVA) > aov_stats # A tibble: 12 x 4 # Groups: signals [12] signals ANOVA$Effect $DFn $DFd $F …
I try to create a table like the one following. I got 1 line product, 1 dimension "zones" and I want to add few columns : total for pieces on a line, a fixe value for a product(stock mini), a calculated field stock mini - total an an icone to tell me if the sctock mini - total is > or < to 0 After struggling I manage to create my columns, but I found no solution to add a …
I just started learning data science and am having a problem when generating a dataset. Dataset: covid_data=pd.read_csv(r"C:\Users\Test\OneDrive\Desktop\Project_test\data.csv") For some reason when I try to create a new dataset it creates an additional column "cases)" and adds NaN values automatically. It happens randomly, it works for a while and when I restarted my jupyter notebook then it happened again. Any idea how to prevent this issue? I obtain the dataset from https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv/data.csv Screenshot of the data.csv file:
I am sort of new in R & recently came across data.table package that is probably ~ 10 times faster than dplyr in various operations. I have a shiny app based on Covid data that is getting heavier each day & I am already not so impressed with the loading time & expecting it to get slow by each passing day. In shiny app I have provided several input options to user & hence the reactive elements and filter & …
I have multiple data frames with same column names. I want to write them together to an excel sheet stacked vertically on top of each other. And between each, there will be a text occupying a row. This is what I have in mind. I tried the pandas.ExcelWriter() method, but each dataframe overwrites the previous frame in the sheet, instead of appending. Note that, I still need multiple sheets for different dataframe, but also multiple dataframes on each sheet. Is …
So have this table above. I'm trying to aggregate the occupations such that the table results in: I've tried using df.groupby(['Occupation']) but I get an error. All I know is that my final step would be to set the index to "Occupation". But I still don't know how to group via entries in the single Occupation column here. Also, what type of table would the final table be name/called? I know it's not called a mutiindex table because there is …
Here is the GitHub link to the most recent data.table benchmark. The data.table benchmarks has not been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?
I faced a problem which I'd like to solve w/o any programming. And looking for a software to do this. I have a dataset, for example: (brand-id, brand-name, product-class-name;) 0, Audi, economy business premium; 1, Rolls Royce, luxury; 2, Seat, economy; 3, Tesla, business premium; And I'd like to automatically process this dataset, resulting in creating an additional table to classify parameters in column 3, like: (product-class-id, product-class-name, brand-id;) 0, economy, 0 2; 1, business, 0 3; 2, premium, 0 …
I have a silly question. Below is the output of a logistic regression analysis I did. I notice that when I switch the order of the arguments I put in the table function in R that it also switch the false positives and the false negatives values but it did not switch the the location of the Female and Male rows and columns. This to me seems like it could really affect the interpretation if the false positives/negatives can change …
I am trying to do simple calculations in R when no raw data but grouped data with frequencies is available only. This is the case when I have a large amount of records in a database, say a large SQL table, and then for given reasons GROUP BY and COUNT to aggregate instead of downloading the original table for analysis in R. As I understand, one could say in R that I'm talking about data in a table format. To …
This is something I can't achieve with the reshape2 library for R. I have the following data: zone code literal 1: A 14 bicl 2: B 14 bicl 3: B 24 calso 4: A 51 mara 5: B 51 mara 6: A 125 gan 7: A 143 carc 8: B 143 carc i.e.: each zone has 4 codes with its corresponding literal. I would like to transform it to a dataset with one column for each of the four codes …
I have this data set with consist of ISO3166 Alpha-2 codes for countries. Example: DE, AD, AE etc They are coded as factor variables in R and there are about 173 observations. Now because there are too many and this would just overwhelm a boxplot, I want to make a contingency table with other variables by condensing the codes and create shorter categories (also coded as factors) with the codes, for example, having DE, RE, ED, FR-> Europe CA, US-> …
I know that I can read in a very large csv file much faster with fread using the data.table library than with read.csv that reads a file in as a data.frame. However, dplyr can only perform operations on data.frame. My questions are: Why was dplyr built to work with the slower of the two data structures? When working with big data is it good practice to read in as data.table then convert to data.frame to perform dplyr operations? Is there …
I have a table of features and labels where each row has a time stamp. Labels are categorical. They go in a batch where one label repeats several times. Batches with the same label do not have a specific order. The number of repetitions of the same label in one batch is always the same. In the example below, every three rows has the same label. I would like to get a new table where Var 1 and Var 2 …
How do you see in pandas the element of a csv table with many columns (>25) which the names of its columns is more than 10 character? I have 5000 rows and 32 columns and the label of some columns are more than 10 characters. How I ca see them and work with different columns? Excel does not work! All of the items are sloppy Access is OK but could not detect the long labels of items! What is your …
Given: A monthly percentage (%) metric has to be calculated from dividing a column ('Numerator') from one table by a column ('Denominator') from another table, both filtered by month, as given in an example below: Table 1: Date_1 Numerator 01-Jan-19 5 05-Feb-19 4 04-Apr-19 1 07-May-19 3 11-Jun-19 5 22-Jun-19 4 25-Jul-19 5 31-Aug-19 1 03-Sep-19 4 25-Oct-19 5 Table 2: Date_2 Denominator 03-Jan-19 7 05-Jan-19 9 16-Feb-19 8 22-Feb-19 7 04-Mar-19 10 18-Mar-19 8 24-Apr-19 8 25-Apr-19 8 01-May-19 …
What are the most effective bread-and-butter in-memory open source tabular data frameworks today? I have been working with tabular data for years with an in-house solution that integrates with Excel well, but falls short of many other expectations. I would like to (if possible/true) demonstrate that our solution has fallen behind the times. In other words, assuming an SQL-like platform is responsible for persistence of a data set, but cycle intensive calculations need to be performed on that dataset (E.g. …