Ways to speed up Python code for data science purposes

Although it might sound like a pure techie question, I would like to know which ways you usually try out, for very data science-like processes, when you need to speed up your processes (given that the data retrieval is not a problem and that it also fits in memory etc). Some of those could be the following, but I would like to receive feedback about any other else:

  • good practices as always using Numpy when possible on numeric operations and not loops...
  • more good practices like using 'apply', 'applymap'... instead of loops when applying functions to elements of lists, dataframes, etc
  • Numba applied on native python loops, numpy arrays...
  • multiprocessing with multiprocessing library depending on the available logical cores

This is motivated by the fact that, if we mainly use Python with all its advantages, we do not want to switch to other languages like Scala or Julia, unless there is no alternative.

Topic python efficiency scalability

Category Data Science


Things I care about a lot:

  • list comprehensions instead of loops

  • use apply + lambda functions when forced to iterate operations on pandas dataframes

  • use @tf.function decorator on top of TensorFlow functions to speed up computation

  • use as much SQL as possible when importing data from databases, to avoid doing the same stuff in Python

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.