So I have a dataset spread over multiple and ever-growing excel files all of which looks like: email order_ID order_date [email protected] 1234 23-Mar-2021 [email protected] 1235 23-Mar-2021 [email protected] 1236 23-Mar-2021 [email protected] 1237 24-Mar-2021 [email protected] 1238 28-Mar-2021 End goal is to have two distinct datasets as: First one being Orders: (Public. For analysis, trading emails with user_IDs for anonymity and marking returning for further analyses) user_ID order_ID order_date is_returning? 1 1234 23-Mar-2021 0 2 1235 23-Mar-2021 0 2 1236 23-Mar-2021 1 1 …
I'm a Data Analyst in a pretty big company and I'm having a really bad time with the data I'm being given. I spend about 70% of my time thinking about where to find the data and how to pull it instead of analyzing it. I have to pull from tables that are sometimes 800 columns wide (600 with a ton of N/As) and which have no or almost no documentation. This is my first job so I don't know …
I am trying to come up with a data pipeline architecture. The data I deal with is event logging for labs requested, failed, succeeded etc with timestamps and some customer info for several different customers. Eventually I want that data being dumped into a dashboard, for both external and internal use. What's the best way to approach it: event-driven or batch driven ETL? We don't care much for real-time processing, and the data is rather small.
In traditional software development practice, before going into production, a piece of code should go through various stages of testing (unit test, integration test, user acceptance test) to secure the stability of the software. A ETL pipeline, as a piece of code, should also go through these testing steps to build a healthy system. However due to the nature of ETL process, traditional testing technique may not be applicable. Is there any reference or guideline specifically focus on testing on …
I'm trying to figure out the best and most efficient method of handing ETL operations for big data. My question is this. Say I have a table that is ~50 GB in size. In order to effectively transfer the data from this table from one source to another, specifically using PySpark, do I need to have more than 50 GB of RAM? Thanks for your help.
What I want to achieve: Transfer hourly coming data files onto a SFTP file server located on a compute engine VM from several different feeds into Bigquery with real-time updates effectively & cost-efficiently. Context: The software I am trying to import data from is an old legacy software and does not support direct exports to cloud. So direct connection from software to cloud isn't an option. It does however support exporting data to a SFTP server. Which is not available …
Updating a legacy ~ETL; on it's base it exports some tables of the prod DB to s3, the export contains a query. The export process generates a csv file using the following logic: res = sh.sed( sh.mysql( '-u', settings_dict['USER'], '--password={0}'.format(settings_dict['PASSWORD']), '-D', settings_dict['NAME'], '-h', settings_dict['HOST'], '--port={0}'.format(settings_dict['PORT']), '--batch', '--quick', '--max_allowed_packet=512M', '-e', '{0}'.format(query) ), r's/"/\\"/g;s/\t/","/g;s/^/"/;s/$/"/;s/\n//g', _out=filename ) the mid term solution with more traction is AWS Glue, but if I could have a similar function to generate parquet files instead of csv files …
I would like to better understand what a good Data Englineer must know or what he does. Job descriptions mostly list tools that are required, such as Python. If it is possible to separate Data Engineering from Data Science, on what principles is Data Engineering based, what is the result of the Data Engineering? Is it creating some data structures? If so, what these structures might be? Are there standards or best practices?
I have a pdf file (admission application). I want to read/search the pdf and extract terms with similar meaning and then convert this data into a DataFrame to save as a xlsm file. HELP!
I want to integrate Dagster into ongoing Django project. Dagster runs out of Django context and eventually there is no way to directly access django ORM without calling django.setup() somewhere, I did it in init of my app., but this is not acceptable because it breaks app execution (like runserver)
I have a school Big Data project where basically the teacher is going to give us a large amount of text documents (from the Gutenberg project data set ) and he want us to give as output the document where a "keyword" is more relevant, he also wants us to divide the project in 3 parts: Data acquisition, preprocing (cleaning, transform, join, etc), loading. ETL PROCESS. Data processing. User friendly application. I need to define what technologies or methods i'm …
I want to use R or Python to query big structured SQL-type data, but they are very slow compared to SAS. I tried using R and Python to return a 1.3 million record Oracle ODBC passthrough query. The query took 8-15 seconds in SAS, 20-30 seconds in Python, and 50-70 seconds in R. Does anyone know why? R Packages Used: First I used the RODBC package in R to query to the Oracle database. Then I tried the ROracle package, …
I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data? I would be interested to see any existing libraries dealing with scalable ETL solutions. Ideally these would be capable of ingesting 1-5 petabytes of data containing 50 billion records from 100 inhomogenious data sets in tens or hundreds of hours running on 4196 cores (256 I2.8xlarge AWS machines). I really do mean ideally, as I would be …
Imagine that I've a field called date in this format: "yyyy-mm-dd" and I want to convert to number like "yyymmdd". For that I'm trying to use this: Data_ID = FOREACH File GENERATE CONCAT((chararray)SUBSTRING(Date,0,4),(chararray)SUBSTRING(Date,6,2),(chararray)SUBSTRING(Date,9,2)); But I'm getting a list of nulls... Anyone knows what I'm doing wrong? Thnaks!