etl - Geeks Mental

Transitioning from a python script for data transformation to BigQuery

Hamza

2022年5月29日 13:02

So I have a dataset spread over multiple and ever-growing excel files all of which looks like: email order_ID order_date [email protected] 1234 23-Mar-2021 [email protected] 1235 23-Mar-2021 [email protected] 1236 23-Mar-2021 [email protected] 1237 24-Mar-2021 [email protected] 1238 28-Mar-2021 End goal is to have two distinct datasets as: First one being Orders: (Public. For analysis, trading emails with user_IDs for anonymity and marking returning for further analyses) user_ID order_ID order_date is_returning? 1 1234 23-Mar-2021 0 2 1235 23-Mar-2021 0 2 1236 23-Mar-2021 1 1 …

Topic: google-bigquery data-engineering etl pandas python

Category: Data Science

Data engineering good and bad practice?

Marc

2022年2月23日 15:34

I'm a Data Analyst in a pretty big company and I'm having a really bad time with the data I'm being given. I spend about 70% of my time thinking about where to find the data and how to pull it instead of analyzing it. I have to pull from tables that are sometimes 800 columns wide (600 with a ton of N/As) and which have no or almost no documentation. This is my first job so I don't know …

Topic: data-engineering etl sql

Category: Data Science

Batch-driven or event-driven ETL

alyk

2022年2月8日 06:46

I am trying to come up with a data pipeline architecture. The data I deal with is event logging for labs requested, failed, succeeded etc with timestamps and some customer info for several different customers. Eventually I want that data being dumped into a dashboard, for both external and internal use. What's the best way to approach it: event-driven or batch driven ETL? We don't care much for real-time processing, and the data is rather small.

Topic: data-engineering etl

Category: Data Science

What is the best practice to test a ETL pipeline?

Costa

2021年9月3日 02:02

In traditional software development practice, before going into production, a piece of code should go through various stages of testing (unit test, integration test, user acceptance test) to secure the stability of the software. A ETL pipeline, as a piece of code, should also go through these testing steps to build a healthy system. However due to the nature of ETL process, traditional testing technique may not be applicable. Is there any reference or guideline specifically focus on testing on …

Topic: etl reference-request

Category: Data Science

PySpark for Big Data and RAM usage

Shaun

2021年6月20日 00:19

I'm trying to figure out the best and most efficient method of handing ETL operations for big data. My question is this. Say I have a table that is ~50 GB in size. In order to effectively transfer the data from this table from one source to another, specifically using PySpark, do I need to have more than 50 GB of RAM? Thanks for your help.

Topic: dataframe etl memory pyspark

Category: Data Science

How to schedule importing data files from SFTP server located on compute engine instance into BigQuery?

Hamza

2021年3月28日 06:32

What I want to achieve: Transfer hourly coming data files onto a SFTP file server located on a compute engine VM from several different feeds into Bigquery with real-time updates effectively & cost-efficiently. Context: The software I am trying to import data from is an old legacy software and does not support direct exports to cloud. So direct connection from software to cloud isn't an option. It does however support exporting data to a SFTP server. Which is not available …

Topic: google-bigquery google-cloud-platform data-engineering etl

Category: Data Science

How to create a parquet file from a query to a mysql table

Carlos P Ceballos

2021年1月20日 12:27

Updating a legacy ~ETL; on it's base it exports some tables of the prod DB to s3, the export contains a query. The export process generates a csv file using the following logic: res = sh.sed( sh.mysql( '-u', settings_dict['USER'], '--password={0}'.format(settings_dict['PASSWORD']), '-D', settings_dict['NAME'], '-h', settings_dict['HOST'], '--port={0}'.format(settings_dict['PORT']), '--batch', '--quick', '--max_allowed_packet=512M', '-e', '{0}'.format(query) ), r's/"/\\"/g;s/\t/","/g;s/^/"/;s/$/"/;s/\n//g', _out=filename ) the mid term solution with more traction is AWS Glue, but if I could have a similar function to generate parquet files instead of csv files …

Topic: etl csv python

Category: Data Science

ETL and Data Engineering - is it purely the knowledge of tools or is there theory behind it?

MindYB

2020年8月23日 15:27

I would like to better understand what a good Data Englineer must know or what he does. Job descriptions mostly list tools that are required, such as Python. If it is possible to separate Data Engineering from Data Science, on what principles is Data Engineering based, what is the result of the Data Engineering? Is it creating some data structures? If so, what these structures might be? Are there standards or best practices?

Topic: data-engineering data-analysis etl databases

Category: Data Science

Extracting and Mining PDF Data

Keetj

2020年1月9日 00:03

I have a pdf file (admission application). I want to read/search the pdf and extract terms with similar meaning and then convert this data into a DataFrame to save as a xlsm file. HELP!

Topic: etl

Category: Data Science

Integrating Dagster with Django ORM

Yevgen Ponomarenko

2019年11月26日 08:23

I want to integrate Dagster into ongoing Django project. Dagster runs out of Django context and eventually there is no way to directly access django ORM without calling django.setup() somewhere, I did it in init of my app., but this is not acceptable because it breaks app execution (like runserver)

Topic: etl python

Category: Data Science

what ETL technique should i use for text documents using Hadoop?

Sebastian Delgado

2018年4月11日 01:22

I have a school Big Data project where basically the teacher is going to give us a large amount of text documents (from the Gutenberg project data set ) and he want us to give as output the document where a "keyword" is more relevant, he also wants us to divide the project in 3 parts: Data acquisition, preprocing (cleaning, transform, join, etc), loading. ETL PROCESS. Data processing. User friendly application. I need to define what technologies or methods i'm …

Topic: etl dataset data-cleaning apache-hadoop bigdata

Category: Data Science

How to make R or Python as fast as SAS for ODBC Oracle queries?

Sean McCarthy

2018年1月4日 15:25

I want to use R or Python to query big structured SQL-type data, but they are very slow compared to SAS. I tried using R and Python to return a 1.3 million record Oracle ODBC passthrough query. The query took 8-15 seconds in SAS, 20-30 seconds in Python, and 50-70 seconds in R. Does anyone know why? R Packages Used: First I used the RODBC package in R to query to the Oracle database. Then I tried the ROracle package, …

Topic: etl sas python r bigdata

Category: Data Science

Successful ETL Automation: Libraries, Review papers, Use Cases

AN6U5

2016年12月14日 05:02

I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data? I would be interested to see any existing libraries dealing with scalable ETL solutions. Ideally these would be capable of ingesting 1-5 petabytes of data containing 50 billion records from 100 inhomogenious data sets in tens or hundreds of hours running on 4196 cores (256 I2.8xlarge AWS machines). I really do mean ideally, as I would be …

Topic: etl normalization data-cleaning

Category: Data Science

Convert date into number - Apache PIG

João_testeSW

2016年9月1日 17:10

Imagine that I've a field called date in this format: "yyyy-mm-dd" and I want to convert to number like "yyymmdd". For that I'm trying to use this: Data_ID = FOREACH File GENERATE CONCAT((chararray)SUBSTRING(Date,0,4),(chararray)SUBSTRING(Date,6,2),(chararray)SUBSTRING(Date,9,2)); But I'm getting a list of nulls... Anyone knows what I'm doing wrong? Thnaks!

Topic: etl apache-pig

Category: Data Science

About