Hive / Impala best practice code structuring
Coming from a DWH-background I am used to putting subqueries almost everywhere in my queries. On a Hadoop project (with Hive version 1.1.0 on Cloudera), I noticed we can forego subqueries in some cases.
It made me wonder if there are similar SQL-dialect specific differences between what is used in Hadoop SQL and what you would use in a DWH-setting. So I would like to extend this question so that people can mention what they noticed as differences between Hadoop and DWH in when structuring their queries. I noticed there was very little reference to this topic for Hadoop.
There are some normal principles that apply to Hadoop like mentioned here: https://streever.atlassian.net/wiki/spaces/HADOOP/pages/3211279/Hive+SQL+Best+Practices
It would be nice to get a few of your best practices for working with Hadoop. E.g. You write your queries as neutral as possible so that it works in both Hive and Impala avoiding using language-specific functions such as left (Impala only)
The example I came across was that a group by
-query worked in two different ways. In classic RDBMS I know that only one of both would work.
Here is an example (columns are displayed in different order then in group by
):
Classic SQL (Oracle, SQL Server, etc)
select t2.b, t2.a
from (select t1.a,t1.b from table1 as t1
group by t1.a, t1.b) t2
Hive SQL
select t1.b, t1.a
from table1 as t1
group by t1.a, t1.b
Notice that everything is combined in 1 single query Hadoop. The classic SQL snippet first does the group by
in a subquery before the data is displayed in the right order.
I would be curious to know if you came across other subtleties like this one.
I know, based on my experience, that it is good practice to respect the following guidelines while coding:
Keep your code clean (hence avoid sub-queries if you can do without)
Consider what impact changes to settings might have. E.g.
yarn.nodemanager.resource.memory-mb=24576
is great to use, but what happens if you are not allowed to change the node memory size. In an automated job it is not necessarily good practice. Look at the general Hadoop settings, e.g. file size.Avoid using functions that are specific to a program. E.g.
select left(Hello world, 3)
is useful in Impala but in Hive it has to be rewritten toselect left(Hello world, 1,4)
. This can lead to problems when later on down the line it will run in a different program.
Topic hive apache-hadoop
Category Data Science