data-engineering

Transitioning from a python script for data transformation to BigQuery

Hamza

2022年5月29日 13:02

So I have a dataset spread over multiple and ever-growing excel files all of which looks like: email order_ID order_date [email protected] 1234 23-Mar-2021 [email protected] 1235 23-Mar-2021 [email protected] 1236 23-Mar-2021 [email protected] 1237 24-Mar-2021 [email protected] 1238 28-Mar-2021 End goal is to have two distinct datasets as: First one being Orders: (Public. For analysis, trading emails with user_IDs for anonymity and marking returning for further analyses) user_ID order_ID order_date is_returning? 1 1234 23-Mar-2021 0 2 1235 23-Mar-2021 0 2 1236 23-Mar-2021 1 1 …

Topic: google-bigquery data-engineering etl pandas python

Category: Data Science

How to run Spark python code in Jupyter Notebook via command prompt

randunu galhena

2022年4月16日 19:04

I am trying to import a data frame into spark using Python's pyspark module. For this, I used Jupyter Notebook and executed the code shown in the screenshot below After that I want to run this in CMD so that I can save my python codes in text file and save as test.py (as python file). Then, I run that python file in CMD using python test.py command, below the screen shot: So my task previously worked, but after 3 …

Topic: data-engineering pyspark apache-spark python bigdata

Category: Data Science

Storing Large dataset for processing and analysis of data

user14519285

2022年4月4日 03:02

I am new to data engineering and wanted to know , what is the best way to store more than 3000 GB of data for further processing and analysis ? I am specifically looking for open source resources . I have explored many data formats for storage . The dataset that I want to store is a heart rate pulse data generated by a sensor.

Topic: data-engineering data-analysis data-formats dataset processing

Category: Data Science

How to partition data effectively?

CyberPunk

2022年3月31日 20:23

I have a pipeline which outputs model scores to s3. I need to partition the data by model_type and date. Which is the most efficient way to partition the data from the below: s3://bucket/data/model_type=foo/dt=YYYY-MM-DD/a.csv s3://bucket/data/dt=YYYY-MM-DD/model_type=foo/a.csv

Topic: data-engineering

Category: Data Science

How to sort a multi-level pandas data-frame by a particular column?

user62198

2022年3月22日 01:00

I would like to sort a multi-index pandas dataframe by a column, but do not want the entire dataframe to be sorted at once. But rather would like to sort by one of the indices. Here is an example of what I mean: Below is an example of a multi-index dataframe. first second bar one 0.361041 two 0.476720 baz one 0.565781 two 0.848519 foo one 0.405524 two 0.882497 qux one 0.488229 two 0.303862 What I want to do is to …

Topic: data-engineering python-3.x pandas data-cleaning

Category: Data Science

How do I replace NaN values using group by pivot_table in pandas DataFrame?

Amandeep Singh

2022年2月24日 17:02

I am working on a machine learning practice problem, from https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/#ProblemStatement I want to replace the null values in the column 'Item_Weight' and for that I am using the mean values given by a pivot_table where I calculated the mean of 'Item_Weight' and grouping the mean by column 'Item_Identifier' of the dataset. item_weight_mean = ds.pivot_table(values='Item_Weight',columns='Item_Identifier') loc2 = ds['Item_Weight'].isnull() ds.loc[loc2, 'Item_Weight'] = ds.loc[loc2, 'Item_Identifier'].apply(lambda x: item_weight_mean[x]) I am getting an error for the same code. (key) -> 2902 indexer = self.columns.get_loc(key) …

Topic: data-engineering feature-engineering pandas python machine-learning

Category: Data Science

Data engineering good and bad practice?

Marc

2022年2月23日 15:34

I'm a Data Analyst in a pretty big company and I'm having a really bad time with the data I'm being given. I spend about 70% of my time thinking about where to find the data and how to pull it instead of analyzing it. I have to pull from tables that are sometimes 800 columns wide (600 with a ton of N/As) and which have no or almost no documentation. This is my first job so I don't know …

Topic: data-engineering etl sql

Category: Data Science

Batch-driven or event-driven ETL

alyk

2022年2月8日 06:46

I am trying to come up with a data pipeline architecture. The data I deal with is event logging for labs requested, failed, succeeded etc with timestamps and some customer info for several different customers. Eventually I want that data being dumped into a dashboard, for both external and internal use. What's the best way to approach it: event-driven or batch driven ETL? We don't care much for real-time processing, and the data is rather small.

Topic: data-engineering etl

Category: Data Science

From what volumes of data do data ingestion tools like apache nifi, flume, storm or tools like logstash become relevant?

Psychotechnopath

2021年12月29日 14:52

I'm working in a company that has two legacy data warehouses, which have evolved to unmaintainable monoliths throughout the time. Therefore, they are in dire need of re-form. I'm investigating a reform of the current data architecture into an architecture that is more in line with the principles of a data mesh, like advocated in this influential article by Zhamak Dehghani (Probably well-known material to data professionals here). The first Data warehouse, say DWH-A, mainly consists of data coming directly …

Topic: data-engineering

Category: Data Science

When Does Feature Selection Takes Place?

Jainam Shroff

2021年9月28日 11:03

I have a dataset where there are categorical features as well as numeric features, and I have to perform OneHotEncoding, Normalization and feature selection on it. In what order should I perform these steps on my data? I am new to DataScience, please explain the logic behind it in Layman's terms too. Thank you.

Topic: data-engineering feature-engineering classification feature-selection machine-learning

Category: Data Science

How to perform feature selection on a dataset using correlation-based feature selection process

Jainam Shroff

2021年9月28日 00:34

I have a dataset and on that, I have to perform feature selection using a correlation-based feature selection process (using scikit-learn), can anyone please show me how to do it with a small example using the sklearn library. I have read sklearn's documentation about feature selection: https://scikit-learn.org/stable/modules/feature_selection.html, but I am not able to wrap my head around it. Please explain the steps in layman's terms, if possible. Thank you.

Topic: data-engineering machine-learning-model classification feature-selection machine-learning

Category: Data Science

Spark Dataframe APIs vs Spark SQL

Punter Vicky

2021年9月24日 05:08

I have a relatively complex query which runs against a database and contains multiple join statements, lead/lag functions, subquery, etc. These tables are available as individual files in my object store. I am trying to run a Spark job to perform the same query. Is it advisable to try and convert the SQL query into Spark SQL (which I was able to do by making few changes) or is it better to use dataframe APIs to reconstruct the query and …

Topic: data-engineering scala pyspark apache-spark sql

Category: Data Science

Best Technologies opening Large Sets of Sensor Time-Series Data to Analytics

CrashLandon

2021年8月3日 16:58

My team is exploring options to create a robust "analytics" capability that is well-suited for our large quantities of sensor test data. I'd appreciate any suggestions for technologies that would perform well for my use case. About my data: For each test, we process binary recordings into flat files for each end-user (maybe 5 to 15 files per test, for hundreds of tests per year) Each file contains time-series data for 100 to 1000 parameters Parameter sample rates are anywhere …

Topic: data-engineering time-series data-mining

Category: Data Science

Alternative to EC2 for running ML batch training jobs on AWS

Cybernetic

2021年6月29日 15:07

We are building an ML pipeline on AWS, which will obviously require some heavy-compute components including preprocessing and batch training. Most the the pipeline is on Lambda, but Lambda is known to have time limits on how long a job can be run (~15mins). Thus for the longer running jobs like batch training of ML models we will need(?) to access something like EC2 instances. For example a lambda function could be invoked and then kick off an EC2 instance …

Topic: data-engineering aws-lambda pipelines aws machine-learning

Category: Data Science

How to schedule importing data files from SFTP server located on compute engine instance into BigQuery?

Hamza

2021年3月28日 06:32

What I want to achieve: Transfer hourly coming data files onto a SFTP file server located on a compute engine VM from several different feeds into Bigquery with real-time updates effectively & cost-efficiently. Context: The software I am trying to import data from is an old legacy software and does not support direct exports to cloud. So direct connection from software to cloud isn't an option. It does however support exporting data to a SFTP server. Which is not available …

Topic: google-bigquery google-cloud-platform data-engineering etl

Category: Data Science

Loading models from external source

room13

2021年2月28日 22:13

I have a 500MB model which I am commiting to Git. That is a really bad practice since for newer model versions the repository will be huge. As well, It will slow down all builds for deployments. I thought of using another repository that contains all the models and then fetch them in running time. Does anybody know a clean approach or alternative?

Topic: data-engineering machine-learning-model python

Category: Data Science

How to Include Features that Apply to Specific Classes

phatgamer69

2020年11月10日 13:30

I'm predicting hours that will be worked for building tasks. Due to the overall low sample size, I've stacked multiple related tasks together into a single model. (There may be 100 total samples in a single model, each task having 10 to 20 samples individually) An example would be - how long will it take a worker to complete each task associated with installing 2 different sizes of pipe in a hospital. There are many tasks associated with installing a …

Topic: data-engineering statistics predictive-modeling data-cleaning machine-learning

Category: Data Science

How to deal with nan values after merging / joining two dataframes?

Moez ‌

2020年10月23日 13:50

A lot of time after merging two pandas dataframes, I end up with NaNs in the new dataframe, that's just how the way it is, because one csv does not have all the ID's that the other has (Two dataframes of different sizes for example). Those NaNs have not been present before, it's just the nature of the left join in pandas to specify that missing data as NaN. So some rows have NaN values in some columns. My question …

Topic: data-engineering pandas dataset data-cleaning

Category: Data Science

Data Engineering Stack - collect, transform and visualize geospatial data

Forin

2020年10月13日 06:07

I'm making a side project, where I collect geospatial data by web scrapping and from OSM API. I've started with simple Java application, however, I would like to make it as a data flow, purely for learning purposes. Unfortunately, my knowledge about tools, and mostly connecting them, is, well, low. What is my goal? As a final result I want to visualize scrapped geospatial points on the map with the roads connecting them(from OSM). Current flow: In standalone Java application …

Topic: data-engineering geospatial visualization tools

Category: Data Science

ETL and Data Engineering - is it purely the knowledge of tools or is there theory behind it?

MindYB

2020年8月23日 15:27

I would like to better understand what a good Data Englineer must know or what he does. Job descriptions mostly list tools that are required, such as Python. If it is possible to separate Data Engineering from Data Science, on what principles is Data Engineering based, what is the result of the Data Engineering? Is it creating some data structures? If so, what these structures might be? Are there standards or best practices?

Topic: data-engineering data-analysis etl databases

Category: Data Science