How to create a parquet file from a query to a mysql table

Updating a legacy ~ETL; on it's base it exports some tables of the prod DB to s3, the export contains a query. The export process generates a csv file using the following logic:

res = sh.sed(
    sh.mysql(
       '-u',
       settings_dict['USER'],
       '--password={0}'.format(settings_dict['PASSWORD']),
       '-D', settings_dict['NAME'],
       '-h', settings_dict['HOST'],
       '--port={0}'.format(settings_dict['PORT']),
       '--batch',
       '--quick',
       '--max_allowed_packet=512M',
       '-e', '{0}'.format(query)
    ),
    r's/"/\\"/g;s/\t/","/g;s/^/"/;s/$/"/;s/\n//g',
    _out=filename
)

the mid term solution with more traction is AWS Glue, but if I could have a similar function to generate parquet files instead of csv files there would be much needed big short term gains

Topic etl csv python

Category Data Science


I can think of a few ways besides using Apache Spark.


it seams that there is no direct way to do it, other than through Spark / PySpark; as long as that holds true, the answer is in SO: https://stackoverflow.com/questions/27718382/how-to-work-with-mysql-and-apache-spark

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.