Scala vs Java if you're NOT going to use Spark?

I'm facing some indecision when choosing how to allocate my scarce learning time for the next few months between Scala and Java.

I would like help objectively understanding the practical tradeoffs.

The reason I am interested in Java is that I think some of my production, frequently refreshed, forecasts and analyses at work would run much faster in Java (compared to R or Python) and by becoming more proficient in Java I would enable myself to work on interesting side projects, such as non-Data-Science apps I'd like to develop. Currently I've taken a couple of Java courses, but I need much more education and practice to master it.

The reason I started considering learning Scala is very similar -- I figured that the statistical/ML orientation would make it good for my work as a Data Scientist, and since it is based on Java I would be getting practice at work that helps me with my side-interest in Java, even though there are a few major differences such as functional vs. imperative and traits vs. interfaces.

It seems like a lot of the advantages of Scala revolve around integration with Spark. I am thinking that this should be the tipping point for my decision, because my team is not currently using Spark and I don't have a good enough reason to request it. However, I thought I should ask here so that I don't waste too much time if Scala is still a better choice.

For the purposes of this question please ignore alternatives such as Python, R, Julia, etc (I've eliminated those from consideration for other reasons, such as already being sufficiently familiar with them for my use cases).

Topic java scala

Category Data Science


The two languages have pretty similar benefits since Scala can call Java libraries. So Java machine learning packages like Weka (http://www.cs.waikato.ac.nz/ml/weka/), can in theory be easily used with Scala.

There are minor pros and cons to each, however:

  • Java is a language that most software engineers with 5+ years of experience understand. If you go to a big financial institution and need to hand off your app to a legacy team, they'll likely know how to support Java a app but not a Scala app (at least where I work)
  • Java is stable and doesn't often change, but Scala seems to change rapidly. This can make it more difficult to maintain in the long term
  • Poorly written Java code might be verbose, but poorly written Scala code can be completely unintelligible (ex: what does ::: mean?). The creator of Python optimized Python for readability; it sometimes feels like the creator of Scala did the opposite
  • You can write Scala code much faster than you can write Java code. Scala is ideal for temporary prototype code because you can see your idea come to life faster than you can with Java
  • Spark is much easier to work with Scala than Java. The machine learning Spark libraries are decent enough that you might not need to use a different machine learning library like Weka. I've seen people build Scala models in Spark even on small datasets. You don't need to have a huge data set to use Spark.

Summary: Go with Scala. Most data science work is prototyping, and Scala will help you work through prototypes faster. Spark ML is probably good enough for your needs, and Scala is much better for Spark than Java.


The other side of the coin:

I don't have an extensive experience with Scala; I have written approximately 10,000 lines of Scala code. However, consider that Scala code is often much shorter than its rough equivalent of 40,000 lines of Java.

On short I don't like Scala at all. I love it's goals, it's ideas but for production use I consider the implementation of those ideas too bold and sometimes even foolish. My belief is that Scala code is hard to write in a robust and crystal clear fashion. There are too many concepts which often overlaps by their side effects which can make your life really hard if you are not an expert. There is an opinion that one can avoid those problems by simply not using those constructs. I reject this opinion since if you want to avoid problematic things you would have to know that you do not understand completely those features, which would require an expert knowledge. My belief is that Scala language implementation is proper for an university where new concepts should be tested, which is exactly what happens with Scala. There are many examples, I will give some reasons which can sustain what I say:

  • Java generics are incomplete mostly because of type erasure, everybody agrees on that. Scala "solved" that problems using a public API in at least three ways: Manifest, TypeTag, ClassTag. In my opinion this means that the problem is simply not solved or if it's solved some of the solutions are wrong enough.
  • converters are an example of a feature which I consider problematic: one simply removes the strong connection between types with something soft, very easy to be missed. After I started to use a third party library where some courageous guy used plenty of them and my code started to behave weird. I typed something wrong which happened to be covered by a converter and spent hours debugging.
  • parameters with default values: one can define methods with parameters with default values; this is useful mostly because using named parameters you avoid writing many methods... until the moment when you discover that you cannot have two methods both with a default valued parameter with the same name.

Now I don't want to flame a war, far from it. Scala has many wonderful ideas, Java is old and lacks many things. But for production I would always chose Java.

Later edit:

For more details on Scala type erasure there is a short review here.

In order to clarify what I stated about parameter with values I just give an hypothetical example (so objections like Array and List implements some traversable interface does not count):

class Plot {
  def histogram(x: Array[Double], bins: Int = 30): String = "test1"
  def histogram(x: List[Double], bins: Int = 30): String = "test2"
}

The above code does not compile: error: in class Plot, multiple overloaded alternatives of method f define default arguments.

type-erasure-manifest-and-typetag/

Another later edit I received an edit proposal and since I do not know how to message the contributor I will answer here. His idea is that the code is not right and it should look like the following.

def histogram(x: Array[Double], bins: Int = 30): String = {
 var r = "test1"
 return r;
}

As far as I know this is the same as above, since my version is a shortcut. The point is that the code does not compile because you have a parameter with default values with the same name in two methods, not because methods are not declared correctly. Obviously, like in many other languages, one can find other ways to overpass this problem.

The whole idea is that to me Scala looks like a too generous language, with many ways of doing many things, which as a side effect creates a complexity burden. In the context, having features like the one I mentioned, it makes things even harder. Obviously I do not own the truth, so I take the liberty to avoid the language complexity and concentrate my effort on solving real problems, and anybody else can have the liberty to solve its problems with this language. With all due respect.


This is a bit off topic for this SE, or maybe opinion-based, but, I work in this field and I'd recommend Scala.

No I would not characterize Scala as a "stats-oriented" Java. I'd describe it as what you get if you asked 3 people to design "Java 11" and then used all of their ideas at once.

Java 8 remains great, but Scala fully embraces just about all the good ideas from languages that you'd want, like type safety and closures and functional paradigms, and a more elaborate types/generics system, down to more syntactic sugar conveniences like case classes and lazy vals. Plus you get to run everything in the same JVM and interoperate with any Java libraries.

The price is complexity: understanding all of Scala is a lot more difficult than all of Java. And some bits of Scala maybe were a bridge too far. And, the tooling ecosystem isn't great IMHO in comparison to the standard Java tools. But there again you can for example use Maven instead of SBT. But you can mostly avoid the parts of Scala that are complex if you don't need them. Designing Scala libraries takes a lot of skill and know-how; just developing run-of-the-mill code in Scala does not.

From a productivity perspective, once you're used to Scala, you'll be more productive. If you're an experienced Java dev, I actually think you'll appreciate Scala. I have not liked any other JVM language, which tend to be sort of one-issue languages that change lots of stuff for marginal gains.

Certainly, for analytics, the existence of Spark argues for Scala. You'll use it eventually.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.