Pandas On AWS Sagemaker

Hello Guys i have a question

We want to start working with AWS Sagemaker.

I understand that i can open Jupiter notebook and work like it was in my computer.

but i know pandas working on single node.

when i working for example on my machine i have 64gb memory and that is the limit for pandas because its not parallel but AWS is parallel so how pandas work with that

Topic sagemaker aws pandas machine-learning

Category Data Science


In my opinion, your question is composed of two parts :

  • How to run a processing job on a big AWS Sagemaker instance with more than 64GB of memory ?
  • How to run pandas in parallel ?

How to run a job on a big AWS Sagemaker instance ?

When you open Sagemaker studio, you are by default working on a ml.t3.medium instance, a very small (and cheap instance). The reason is that this local machine is not designed to run big processing jobs, but to make some data exploration on small data, manage sagemaker, run notebooks etc...

To run a big job on Sagemaker, you will need to setup a processing job that you will run on a distant instance. You will only pay for the running time you have made on this distant instance. You can run pandas / sklearn / pyspark jobs.

How to run pandas in parallel ?

Pandas can't run in parallel. It is not related to AWS Sagemaker. If you want some pandas alternatives in parallel, you can refer to :

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.