So I have a dataset spread over multiple and ever-growing excel files all of which looks like: email order_ID order_date [email protected] 1234 23-Mar-2021 [email protected] 1235 23-Mar-2021 [email protected] 1236 23-Mar-2021 [email protected] 1237 24-Mar-2021 [email protected] 1238 28-Mar-2021 End goal is to have two distinct datasets as: First one being Orders: (Public. For analysis, trading emails with user_IDs for anonymity and marking returning for further analyses) user_ID order_ID order_date is_returning? 1 1234 23-Mar-2021 0 2 1235 23-Mar-2021 0 2 1236 23-Mar-2021 1 1 …
I am trying to import a data frame into spark using Python's pyspark module. For this, I used Jupyter Notebook and executed the code shown in the screenshot below After that I want to run this in CMD so that I can save my python codes in text file and save as test.py (as python file). Then, I run that python file in CMD using python test.py command, below the screen shot: So my task previously worked, but after 3 …
I am new to data engineering and wanted to know , what is the best way to store more than 3000 GB of data for further processing and analysis ? I am specifically looking for open source resources . I have explored many data formats for storage . The dataset that I want to store is a heart rate pulse data generated by a sensor.
I have a pipeline which outputs model scores to s3. I need to partition the data by model_type and date. Which is the most efficient way to partition the data from the below: s3://bucket/data/model_type=foo/dt=YYYY-MM-DD/a.csv s3://bucket/data/dt=YYYY-MM-DD/model_type=foo/a.csv
I would like to sort a multi-index pandas dataframe by a column, but do not want the entire dataframe to be sorted at once. But rather would like to sort by one of the indices. Here is an example of what I mean: Below is an example of a multi-index dataframe. first second bar one 0.361041 two 0.476720 baz one 0.565781 two 0.848519 foo one 0.405524 two 0.882497 qux one 0.488229 two 0.303862 What I want to do is to …
I am working on a machine learning practice problem, from https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/#ProblemStatement I want to replace the null values in the column 'Item_Weight' and for that I am using the mean values given by a pivot_table where I calculated the mean of 'Item_Weight' and grouping the mean by column 'Item_Identifier' of the dataset. item_weight_mean = ds.pivot_table(values='Item_Weight',columns='Item_Identifier') loc2 = ds['Item_Weight'].isnull() ds.loc[loc2, 'Item_Weight'] = ds.loc[loc2, 'Item_Identifier'].apply(lambda x: item_weight_mean[x]) I am getting an error for the same code. (key) -> 2902 indexer = self.columns.get_loc(key) …
I'm a Data Analyst in a pretty big company and I'm having a really bad time with the data I'm being given. I spend about 70% of my time thinking about where to find the data and how to pull it instead of analyzing it. I have to pull from tables that are sometimes 800 columns wide (600 with a ton of N/As) and which have no or almost no documentation. This is my first job so I don't know …
I am trying to come up with a data pipeline architecture. The data I deal with is event logging for labs requested, failed, succeeded etc with timestamps and some customer info for several different customers. Eventually I want that data being dumped into a dashboard, for both external and internal use. What's the best way to approach it: event-driven or batch driven ETL? We don't care much for real-time processing, and the data is rather small.
I'm working in a company that has two legacy data warehouses, which have evolved to unmaintainable monoliths throughout the time. Therefore, they are in dire need of re-form. I'm investigating a reform of the current data architecture into an architecture that is more in line with the principles of a data mesh, like advocated in this influential article by Zhamak Dehghani (Probably well-known material to data professionals here). The first Data warehouse, say DWH-A, mainly consists of data coming directly …
I have a dataset where there are categorical features as well as numeric features, and I have to perform OneHotEncoding, Normalization and feature selection on it. In what order should I perform these steps on my data? I am new to DataScience, please explain the logic behind it in Layman's terms too. Thank you.
I have a dataset and on that, I have to perform feature selection using a correlation-based feature selection process (using scikit-learn), can anyone please show me how to do it with a small example using the sklearn library. I have read sklearn's documentation about feature selection: https://scikit-learn.org/stable/modules/feature_selection.html, but I am not able to wrap my head around it. Please explain the steps in layman's terms, if possible. Thank you.
I have a relatively complex query which runs against a database and contains multiple join statements, lead/lag functions, subquery, etc. These tables are available as individual files in my object store. I am trying to run a Spark job to perform the same query. Is it advisable to try and convert the SQL query into Spark SQL (which I was able to do by making few changes) or is it better to use dataframe APIs to reconstruct the query and …
My team is exploring options to create a robust "analytics" capability that is well-suited for our large quantities of sensor test data. I'd appreciate any suggestions for technologies that would perform well for my use case. About my data: For each test, we process binary recordings into flat files for each end-user (maybe 5 to 15 files per test, for hundreds of tests per year) Each file contains time-series data for 100 to 1000 parameters Parameter sample rates are anywhere …
We are building an ML pipeline on AWS, which will obviously require some heavy-compute components including preprocessing and batch training. Most the the pipeline is on Lambda, but Lambda is known to have time limits on how long a job can be run (~15mins). Thus for the longer running jobs like batch training of ML models we will need(?) to access something like EC2 instances. For example a lambda function could be invoked and then kick off an EC2 instance …
What I want to achieve: Transfer hourly coming data files onto a SFTP file server located on a compute engine VM from several different feeds into Bigquery with real-time updates effectively & cost-efficiently. Context: The software I am trying to import data from is an old legacy software and does not support direct exports to cloud. So direct connection from software to cloud isn't an option. It does however support exporting data to a SFTP server. Which is not available …
I have a 500MB model which I am commiting to Git. That is a really bad practice since for newer model versions the repository will be huge. As well, It will slow down all builds for deployments. I thought of using another repository that contains all the models and then fetch them in running time. Does anybody know a clean approach or alternative?
I'm predicting hours that will be worked for building tasks. Due to the overall low sample size, I've stacked multiple related tasks together into a single model. (There may be 100 total samples in a single model, each task having 10 to 20 samples individually) An example would be - how long will it take a worker to complete each task associated with installing 2 different sizes of pipe in a hospital. There are many tasks associated with installing a …
A lot of time after merging two pandas dataframes, I end up with NaNs in the new dataframe, that's just how the way it is, because one csv does not have all the ID's that the other has (Two dataframes of different sizes for example). Those NaNs have not been present before, it's just the nature of the left join in pandas to specify that missing data as NaN. So some rows have NaN values in some columns. My question …
I'm making a side project, where I collect geospatial data by web scrapping and from OSM API. I've started with simple Java application, however, I would like to make it as a data flow, purely for learning purposes. Unfortunately, my knowledge about tools, and mostly connecting them, is, well, low. What is my goal? As a final result I want to visualize scrapped geospatial points on the map with the roads connecting them(from OSM). Current flow: In standalone Java application …
I would like to better understand what a good Data Englineer must know or what he does. Job descriptions mostly list tools that are required, such as Python. If it is possible to separate Data Engineering from Data Science, on what principles is Data Engineering based, what is the result of the Data Engineering? Is it creating some data structures? If so, what these structures might be? Are there standards or best practices?