Data Science Tools Using Scala

I know that Spark is fully integrated with Scala. It's use case is specifically for large data sets. Which other tools have good Scala support? Is Scala best suited for larger data sets? Or is it also suited for smaller data sets?

Topic scala scalability

Category Data Science


ScalaNLP is a suite of machine learning and numerical computing libraries with support for common natural language processing tasks.

Here is a newly updated list of scala libraries for data science.


Scala is suited for both large and small data science applications. Consider DynaML if you are interested to try a machine learning library which integrates well with Apache Spark. It is still in its infancy so to speak in terms of number of models offered, but it makes up for it by a broad and flexible machine learning API.

To take a look at some sample use cases consider (more where that came from)

  1. System Identification - Abott Power Plant

Disclaimer: I am the author of DynaML


From listening to presentations by Martin Odersky, the creator of Scala, it is especially well suited for building highly scalable systems by leveraging functional programming constructs in conjuction with object orientation and flelxible syntax. It is also useful for development of small systems and rapid prototyping because it takes less lines of code than some other languages and it has an interactive mode for fast feedback. One notable Scala framework is Akka which uses the actor model of concurrent computation. Many of Odersky's presentations are on YouTube and there is a list of tools implemented with Scala on wiki.scala-lang.org.

An implicit point is that tools and frameworks written in Scala inherently have Scala integration and usually a Scala API. Then other APIs may be added to support other languages beginning with Java since Scala is already integrated and in fact critically depends on Java. If a tool or framework is not written in Scala, it is unlikely that it offers any support for Scala. That is why in answer to your question I have pointed towards tools and frameworks written in Scala and Spark is one example. However, Scala currently has a minor share of the market but its adoption rate is growing and the high growth rate of Spark will enhance that. The reason I use Scala is because Spark's API for Scala is richer than the Java and Python APIs.

The main reasons I prefer Scala generally is because it is much more expressive than Java because it allows and facilitates the use of functions as objects and values while retaining object oriented modularity, which enables development of complex and correct programs with far less code than Java which I had preferred because of widespread use, clarity and excellent documentation.


Re: size of data

The short answer

Scala works for both small and large data, but its creation and development is motivated by needing something scalable. Scala is an acronym for “Scalable Language”.

The long answer

Scala is a functional programming language that runs on the jvm. The 'functional' part of this is a fundamental difference in the language that makes you think differently about programming. If you like that way of thinking, it lets you quickly work with small data. Whether you like it or not, functional languages are fundamentally easier to massively scale. The jvm piece is also important because the jvm is basically everywhere and, thus, Scala code can run basically everywhere. (Note there are plenty of other languages written on the jvm and plenty of other functional programming languages, and languages beyond Scala do appear in both lists.)

This talk give a good overview of the motivation behind Scala.

Re: other tools that have good Scala support:

As you mentioned, Spark (distributable batch processing better at iteratative algorithms than its counterpart) is a big one. With Spark comes its libraries Mllib for machine learning and GraphX for graphs. As mentioned by Erik Allik and Tris Nefzger, Akka and Factorie exist. There is also Play.

Generally, I can't tell if there is a specific use case you're digging for (if so, make that a part of your question), or just want a survey of big data tools and happen to know Scala a bit and want to start there.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.