How to store subset of columns from a csv file?

I need to create a table in hive (or Impala) by reading from a csv file (named file.csv), the problem is that this csv file could have a different number of columns each time I read it. The only thing I am sure of is that it will always have three columns called A, B, and C. For example, the first csv I get could be (the first row is the header): ------------------------ | X | Y | A | …
Topic: hive pyspark sql
Category: Data Science

Mapreduce jobs not working in hive

I was trying to execute a hive query: select name, count(*) from amazon where review != NULL group by name ; Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564815666993_0001, Tracking URL = http://aamir-VirtualBox:8088/proxy/application_1564815666993_0001/ Kill Command = …
Category: Data Science

Hive / Impala best practice code structuring

Coming from a DWH-background I am used to putting subqueries almost everywhere in my queries. On a Hadoop project (with Hive version 1.1.0 on Cloudera), I noticed we can forego subqueries in some cases. It made me wonder if there are similar SQL-dialect specific differences between what is used in Hadoop SQL and what you would use in a DWH-setting. So I would like to extend this question so that people can mention what they noticed as differences between Hadoop …
Category: Data Science

Hive Bring Table to Local Driver for Fast Debugging on CLI

I have a large Hive table on HDFS. Every time I query it, it runs a map reduce job which is slow. For debugging my code on CLI, I want a fast query. Is it possible to sample rows of the table and bring them to the local machine. Then the hive queries would only run on the local machine without map reduce. If yes, how? PS I know that map reduce runs for some commands only. I have nested …
Topic: hive sql
Category: Data Science

cannot access hive from spark

I am trying to install a hadoop + spark + hive cluster. I am using hadoop 3.1.2, spark 2.4.5 (scala 2.11 prebuilt with user-provided hadoop) and hive 2.3.3 (also tried 3.1.2 with the exact same results). All downloaded from their websites. I can run spark apps (as yarn client) with no issues, I can run hive queries directly (beeline) or via pyhive with no issues (I tried both hive-on-mr and hive-on-spark, both working fine, jobs are created by yarn and …
Category: Data Science

Can i access data on one spark cluster from another?

What if I want to sample a table over hive on Spark-cluster-1, but I'm logged in on Spark-cluster-2? Connecting to jdbc:hive2://spark.cluster.1:10000/default;principal=hive/[email protected];ssl=true This call returns error: "Error: Could not open client transport with JDBC Uri:" when I issue the call from spark.cluster.2 using this call: hive -e "select * FROM database.tablename where rand() <= 0.0001 order by rand() limit 10" What are the limitations to do this? I should be able to read a table even if I'm not logged-in to …
Category: Data Science

Hive query to get all rows where a particular column value lies in a particular precentile

I am trying to filter my rows in hive table named id_counts based on percentile values. Lets considers the following table. +------+----------+ | id | quantity | +------+----------+ | a01 | 234 | | a02 | 345 | | a03 | 23 | +------+----------+ now lets say I want to get the rows which have quantity in the 90th percentile then what query should i give. I tried the following: select * from id_counts having quantity>= percentile(quantity, 0.9); But it …
Topic: hive
Category: Data Science

COUNT on External Table in HIVE

I have been trying around the EXTERNAL table concepts in HIVE CREATE EXTERNAL TABLE IF NOT EXISTS MovieData (id INT, title STRING,releasedate date, videodate date, URL STRING,unknown TINYINT, Action TINYINT, Adventure TINYINT, Animation TINYINT,Children TINYINT, Comedy TINYINT, Crime TINYINT, Documentary TINYINT, Drama TINYINT, Fantasy TINYINT, Film-Noir TINYINT, Horror TINYINT, Musical TINYINT, Mystery TINYINT, Romance TINYINT, Sci-Fi TINYINT, Thriller TINYINT, War TINYINT, Western TINYINT) COMMENT 'This is a list of movies and its genre' ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' …
Topic: hive
Category: Data Science

how to disable query from beeline results

I am occuring a strange hive-client beeline behavior. In the outputed file with query results there is also a queary at the beggining and at the end. Is there any option to disable such behavior? I can't see such option in the beeline -help -bash-4.2$ beeline -help Usage: java org.apache.hive.cli.beeline.BeeLine -u <database url> the JDBC URL to connect to -n <username> the username to connect as -p <password> the password to connect as -d <driver class> the driver class to …
Category: Data Science

getting error:-Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable

I am trying to connect to hive from java but getting error. I searched in google but not got any helpfull solution. I have added all jars also. The code is:- package mypackage; import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveJdbcClient { private static String driver = "org.apache.hadoop.hive.jdbc.HiveDriver"; public static void main(String[] args) throws SQLException, ClassNotFoundException { Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); try { Class.forName(driver); } catch (ClassNotFoundException e) { e.printStackTrace(); System.exit(1); } Connection connect = DriverManager.getConnection("jdbc:hive://master:10000 /default", "", …
Category: Data Science

Find outliers in Hive - SemanticException

I'm trying to find some outliers on my database using HIVE and I'm using Standard Deviation technique. My query is: SELECT ID FROM data WHERE ID < (AVG(ID) + STDDEV(ID)) AND ID > (AVG(ID) - STDDEV(ID)); When I run this code I'm getting the following error: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 3:12 Not yet supported place for UDAF 'AVG' How to solve this problem? Many thanks!
Category: Data Science

How to extract a column that has the highest value within row in Hive?

I have a table, more or less in the following format col1 col2 col3 ... col100 val1 val2 val3 ... val100 Where val* are doubles. Is there a way to extract for each row in which column is the highest value within row in Hive? For example, for table like col1 col2 col3 2 4 5 8 1 2 I would get col3 col1
Topic: hive sql
Category: Data Science

How to proceed 2 executions in 1 step in hive?

I am wondering if there is a way to proceed 2 exectuions in 1 step in hive. For example: SELECT * FROM TABLE1 SELECT * FROM TABLE2 ; Do this in one window, and do not have to open 2 hive windows to execute each line separetly. Can it be done on HUE?
Topic: hive
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.