Which cloud platform to maximize my impact as a data scientist?

I am looking to pick up the knowledge/software skills to move towards becoming an end to end deep learning engineer. By this I mean handling the following on my own:

  1. preprocess big data at low latency
  2. design train deep learning models on massive data
  3. deploy models to serve predictions at massive scale
  4. stream/preprocess incoming data to update models in real time

Which cloud platform would you choose to do this?

  • GCP: Allows me to do the above with the minimum effort (serverless model hosting, model versioning etc), however, it ties me to tensorflow (I'm an MXNet fan). Looks like I need to pick up apache beam for distributed data preprocessing...
  • AWS: Maximum flexibility but seems far less clean. Appears to be more suited to a team of 5 experts wanting to achieve the above.

What software would you choose?

  • I'm essentially looking to pickup the minimum amount of things to have the maximum impact.
  • Currently spend most of my time using python + MXNet + EC2 and am comfortable with (2)

Topic google-prediction-api preprocessing cloud-computing deep-learning distributed

Category Data Science


It has been my experience that transitioning from local modeling to large-scale distributed programming is a lot more work than most data scientists realize, and leaves little room for anything BUT data engineering, similar to what @Emre said above.

If you're developing an infrastructure yourself (say, spark) using GCP or AWS VMs, installing and maintaining that is a LOT of work. This is doubly true if you're running a multi-tenant system and/or supporting production jobs. You will constantly be trying to solve the 'why didn't my job run?' or 'why does it take my job 14 days to run?' problems.

If you're using the built-in data science infrastructure to those systems (Redshift, Athena, ElasticSearch, etc), you can save some time but it is still EXTREMELY non-trivial to manage. There is a reason why, year after year, Data Scientists' favorite tools always include Databricks or similar -- because managing these things is a pain, and requires a completely different skillset than actual modeling.

All that being said, I have two suggestions. First, AWS is more mature and has more community answers that help overcome hurdles than GCP does. You will run into issues with its IAM systems (which become a necessary evil of responsible big data engineering), and the quirkiness of its various products (lambda, for instance, will only run < 3m scripts), but overall everything you're trying to do has already been done and documented by someone else. It is the better choice, IMO.

However, I'd urge you to take a closer look at a managed platform like Qubole or Databricks (I work for neither). I set up and maintained a Qubole / AWS environment for over a year. I learned a ton about data engineering, architecture, elastic spark infrastructure and all the nuances / pitfalls / limitations of distributed computing that they don't tell you about in the rote documentation, but was still able to maintain a functioning system. My use case didn't include deep learning, but they could have, by me turning a few knobs.

Then, once you have taken your lessons from these environments that work, you can architect, deploy & support your own big data infrastructure to your heart's content (which will now be your full time job). Hope that helps.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.