Converting data format

I'm trying to use the recent COVID-19 data from the site of Italian Civil Protection, but they use a rather complicated time format that I'm finding troublesome as a novice to plot as data in a graph. This is how the data is presented: [1] 2020-02-24T18:00:00 2020-02-25T18:00:00 2020-02-26T18:00:00 2020-02-27T18:00:00 2020-02-28T18:00:00 2020-02-29T18:00:00 and I would like to use the format as DD-MM, without the time and the year. How can I do it?
Category: Data Science

R code making 1 column into multiple columns with their unique ID

Currently stuck on a data wrangling question in R. So far I've tried variations of this code using tidyverse package, columns 5 and 6 here were the rating and the user: df[,5:6] %>% pivot_wider(names_from = question, values_from = rating, names_sep = ".") %>% unnest(cols = everything())-> df_reformat Each column will be the question ID and the rows are the scores for each user, ideally clustered by group. Data structure needed: repID user Customer question 1 Customer question 2 .... Customer …
Category: Data Science

Storing Large dataset for processing and analysis of data

I am new to data engineering and wanted to know , what is the best way to store more than 3000 GB of data for further processing and analysis ? I am specifically looking for open source resources . I have explored many data formats for storage . The dataset that I want to store is a heart rate pulse data generated by a sensor.
Category: Data Science

Python: convert variables into correct format for DataFrame

I have 3 variables that I would like to use to build my dataset but since they are in a weird shape/format, I had no success so far. I'm quite new to this and really appreciate any help!! The 3 variables I have are: print(newspaper) ['Bolero'] ['Schweizer Illustrierte Style'] ['Bolero'] print(title) ['Schönheit und Tragik'] ['magie pur'] ['Das sind unsere Favoriten'] print(pubDate) ['2007-01-01'] ['2007-01-01'] ['2007-01-01'] It seems to like all variables are a list of lists, but I'm not quite sure. …
Category: Data Science

Running a query in R after establishing dbconnect

I do not seem to figure out what is wrong it the following statement. The connection to the DWH is established but the query statement in R seems not to work, with the following error : LR=dbGetQuery(con, "select id as ID, date_c."Professional_Status" as Prof_Status, case when talk_sec >= 5 then 1 else 0 end as Established_Connection from id_collect as id_c left join date_conncet as date_c on id_c.date=date_c.date where date::date = '2018-01-19' and country = 'IT' and type = 'shop' and …
Category: Data Science

How to drop the previous rows of a database based on a matching value in a column?

So I am currently trying to sort through a data frame containing attribute classes and values of teams. However, my data has multiple rows of different classes and values of the same Team ID/Attribute ID. I was wondering if there was a faster way to get just the last row of each of the same Team IDs/Attribute IDs.
Category: Data Science

Date time conversion in a CSV column

I am new to data science. I am attempting to write a program using regression techniques, and all of my values are numerical, except for the date and time (UTC), which are written in this format: HH:MM:SS MM/DD/YY. The date and time are a part of a CSV file and I do not know how to alter the column. I have looked around for how to convert this to a numerical value, but all the results put the date before …
Category: Data Science

Advantage of a treebank in XML format

Which treebanks are based on an XML format? What is the advantage of XML format for a treebank? I think it may have effects on annotation and querying the treebank. for example LASSY and Alpino or TIGER are in xml format.
Category: Data Science

Is there any way to analyze the format of text strings?

I have a lot of data which basically consists of alphanumeric text on individual lines which can very in length and contain delimiters. Since there are many thousands of lines of text, I'm looking to see whether there is an automated way to determine the different formats of text. A sample of which is: 90665013-163 90731046-103 90840069-009 90847069-009 90880046-103 90889046-103 90897-051 9089744-103 9089844-103 90901-46909 90901-lep 9091046-103 9091046-909 90764046-1037 can10043E can90065-op016 9094344-103 90669j4-4438718 90666ie79 90664046-103 90710-077 004-919 4A1900935 can90064-op016 can90066-E016 9094544-103 …
Category: Data Science

How to store efficiently very large sparse 3D matrices

To train a CNN, I have stacked arrays of images over observations [observations x width x length]. The dataset is very sparse ($95\%$). What would be an efficient way of storing these matrices efficiently in terms of format (e.g. pickle, parquet) structure (e.g. scipy.sparse.csr_matrix, List of Lists)
Category: Data Science

What is the most used format to save data with type information

I am exporting data from an SQL database and importing it into R. This is a two step process since I first (automatically) download the data to a hard drive and then import the file with R. Currently, I am using csv files to save the data. Everybody supports csv. But csv does not support type information. This makes it sometimes cumbersome to load a csv file because I must check all the column types. This seems unnecessary because the …
Category: Data Science

Connecting Infusionsoft data to Google data studio

I want to create a Google Data studio dashboard from Infusion soft data. The main problem are the connectors - there are multiple tools that provide direct connectors but they are paid solutions like Klipfolio, Clicdata, Grow etc. If a direct connection is not possible, I want to use some combination of Google sheets and Zapier or other free tools to create a data flow that can be constantly refreshed for data coming in from "infusionsoft" to "Google data studio" …
Category: Data Science

ValueError: could not convert string to float: '���'

I have a (2M, 23) dimensional numpy array X. It has a dtype of <U26, i.e. unicode string of 26 characters. array([['143347', '1325', '28.19148936', ..., '61', '0', '0'], ['50905', '0', '0', ..., '110', '0', '0'], ['143899', '1325', '28.80434783', ..., '61', '0', '0'], ..., ['85', '0', '0', ..., '1980', '0', '0'], ['233', '54', '27', ..., '-1', '0', '0'], ['���', '�', '�����', ..., '�', '��', '���']], dtype='<U26') When I convert it to a float datatype, using X_f = X.astype(float) I get the …
Category: Data Science

Best file format for transfer of EHR data

I am working on a clinical trial where we have several sites sending us EHR data. The sites are currently sending the data in excel files. I have a feeling someone's opening the files because 3 of the files have 64,999 rows exactly, and excel 2007 cuts off at 65,000. I am working in python, but I am trying to prevent the people at the local sites from opening the files in excel. What's the best format for the files …
Category: Data Science

Containing multicomponent data in rows or columns

I have been working with DNA sequences and compiled a table with features from those sequences. I have a column called Trimer, which contains strings. For some DNA sequences there is one trimer of interest so that column contains one 3 character string (i.e. "ATG"). For other rows in the table that trimer column has 2 or 3 trimers of interest so the Trimer column has multiple strings in it (i.e. "ATT, CTG, GAT"). All trimers from one sequence should …
Category: Data Science

Getting stock data in a discipline manner from Yahoo finance

I used the below code for downloading stock data from yahoo finance:- import yfinance as yf import datetime stocks = ["AXISBANK.NS", "HDFCBANK.NS", "ICICIBANK.NS" ,"INDUSINDBK.NS", "KOTAKBANK.NS", "SBIN.NS", "YESBANK.NS"] start = datetime.datetime(2018,1,1) end = datetime.datetime(2019,7,17) data = yf.download(stocks, start=start, end=end) data I get the data in a below manner:- I saved the data using panda:- import pandas as pd df = pd.DataFrame(data) # saving the dataframe df.to_csv('BANKING STOCK.csv') I got the data in this format:- But I ant my data in this …
Category: Data Science

.h5 file format does not close properly

import h5py #added hf = h5py.File('../images.h5', 'w') #added hf.close() #added h5_file = tables.open_file("images.h5", mode="w") I also tried: h5py.File.close(hf) the error that pops up in both cases is: ValueError: The file 'restricted_images.h5' is already opened. Please close it before reopening in write mode. I've also tried: if isinstance(obj, h5py.File): # Just HDF5 files obj.close() while In[]: hf Out[]: <Closed HDF5 file> , the file is not closed yet.
Category: Data Science

Labeling data as having an error?

I am curating a large quantity of data from different sensors. If I know that a particular sensor was broken or poorly calibrated for a particular time range, what would be a useful way of annotating the data to make it clear that the data are of poor quality and / or have known errors? I am thinking a set of key:value pairs (like quality:error, description:'sensor was broken') that I can store in json, yaml, image header (e.g. exif) metadata …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.