hive - Geeks Mental

How to store subset of columns from a csv file?

Francesco Pegoraro

2022年5月6日 19:01

I need to create a table in hive (or Impala) by reading from a csv file (named file.csv), the problem is that this csv file could have a different number of columns each time I read it. The only thing I am sure of is that it will always have three columns called A, B, and C. For example, the first csv I get could be (the first row is the header): ------------------------ | X | Y | A | …

Topic: hive pyspark sql

Category: Data Science

Mapreduce jobs not working in hive

Aamir Ahmad Ansari

2020年8月23日 13:24

I was trying to execute a hive query: select name, count(*) from amazon where review != NULL group by name ; Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1564815666993_0001, Tracking URL = http://aamir-VirtualBox:8088/proxy/application_1564815666993_0001/ Kill Command = …

Topic: hive apache-hadoop bigdata

Category: Data Science

Hive / Impala best practice code structuring

Gerardsson

2020年7月25日 00:03

Coming from a DWH-background I am used to putting subqueries almost everywhere in my queries. On a Hadoop project (with Hive version 1.1.0 on Cloudera), I noticed we can forego subqueries in some cases. It made me wonder if there are similar SQL-dialect specific differences between what is used in Hadoop SQL and what you would use in a DWH-setting. So I would like to extend this question so that people can mention what they noticed as differences between Hadoop …

Topic: hive apache-hadoop

Category: Data Science

Hive Bring Table to Local Driver for Fast Debugging on CLI

solora

2020年6月19日 19:59

I have a large Hive table on HDFS. Every time I query it, it runs a map reduce job which is slow. For debugging my code on CLI, I want a fast query. Is it possible to sample rows of the table and bring them to the local machine. Then the hive queries would only run on the local machine without map reduce. If yes, how? PS I know that map reduce runs for some commands only. I have nested …

Topic: hive sql

Category: Data Science

cannot access hive from spark

user3044083

2020年3月30日 23:53

I am trying to install a hadoop + spark + hive cluster. I am using hadoop 3.1.2, spark 2.4.5 (scala 2.11 prebuilt with user-provided hadoop) and hive 2.3.3 (also tried 3.1.2 with the exact same results). All downloaded from their websites. I can run spark apps (as yarn client) with no issues, I can run hive queries directly (beeline) or via pyhive with no issues (I tried both hive-on-mr and hive-on-spark, both working fine, jobs are created by yarn and …

Topic: hive pyspark apache-spark apache-hadoop

Category: Data Science

Can i access data on one spark cluster from another?

sAguinaga

2020年1月8日 15:59

What if I want to sample a table over hive on Spark-cluster-1, but I'm logged in on Spark-cluster-2? Connecting to jdbc:hive2://spark.cluster.1:10000/default;principal=hive/[email protected];ssl=true This call returns error: "Error: Could not open client transport with JDBC Uri:" when I issue the call from spark.cluster.2 using this call: hive -e "select * FROM database.tablename where rand() <= 0.0001 order by rand() limit 10" What are the limitations to do this? I should be able to read a table even if I'm not logged-in to …

Topic: hive apache-spark

Category: Data Science

Hive query to get all rows where a particular column value lies in a particular precentile

Heisenbug

2019年8月23日 19:24

I am trying to filter my rows in hive table named id_counts based on percentile values. Lets considers the following table. +------+----------+ | id | quantity | +------+----------+ | a01 | 234 | | a02 | 345 | | a03 | 23 | +------+----------+ now lets say I want to get the rows which have quantity in the 90th percentile then what query should i give. I tried the following: select * from id_counts having quantity>= percentile(quantity, 0.9); But it …

Topic: hive

Category: Data Science

COUNT on External Table in HIVE

Joby

2018年5月24日 10:01

I have been trying around the EXTERNAL table concepts in HIVE CREATE EXTERNAL TABLE IF NOT EXISTS MovieData (id INT, title STRING,releasedate date, videodate date, URL STRING,unknown TINYINT, Action TINYINT, Adventure TINYINT, Animation TINYINT,Children TINYINT, Comedy TINYINT, Crime TINYINT, Documentary TINYINT, Drama TINYINT, Fantasy TINYINT, Film-Noir TINYINT, Horror TINYINT, Musical TINYINT, Mystery TINYINT, Romance TINYINT, Sci-Fi TINYINT, Thriller TINYINT, War TINYINT, Western TINYINT) COMMENT 'This is a list of movies and its genre' ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' …

Topic: hive

Category: Data Science

how to disable query from beeline results

Marcin Kosiński

2017年1月7日 16:35

I am occuring a strange hive-client beeline behavior. In the outputed file with query results there is also a queary at the beggining and at the end. Is there any option to disable such behavior? I can't see such option in the beeline -help -bash-4.2$ beeline -help Usage: java org.apache.hive.cli.beeline.BeeLine -u <database url> the JDBC URL to connect to -n <username> the username to connect as -p <password> the password to connect as -d <driver class> the driver class to …

Topic: hive apache-hadoop

Category: Data Science

getting error:-Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/io/Writable

Jaishree Rout

2016年11月29日 12:39

I am trying to connect to hive from java but getting error. I searched in google but not got any helpfull solution. I have added all jars also. The code is:- package mypackage; import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveJdbcClient { private static String driver = "org.apache.hadoop.hive.jdbc.HiveDriver"; public static void main(String[] args) throws SQLException, ClassNotFoundException { Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); try { Class.forName(driver); } catch (ClassNotFoundException e) { e.printStackTrace(); System.exit(1); } Connection connect = DriverManager.getConnection("jdbc:hive://master:10000 /default", "", …

Topic: hive java apache-hadoop

Category: Data Science

Find outliers in Hive - SemanticException

SaCvP

2016年9月3日 19:50

I'm trying to find some outliers on my database using HIVE and I'm using Standard Deviation technique. My query is: SELECT ID FROM data WHERE ID < (AVG(ID) + STDDEV(ID)) AND ID > (AVG(ID) - STDDEV(ID)); When I run this code I'm getting the following error: Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 3:12 Not yet supported place for UDAF 'AVG' How to solve this problem? Many thanks!

Topic: hive statistics data-cleaning

Category: Data Science

How to extract a column that has the highest value within row in Hive?

Marcin Kosiński

2015年11月18日 00:17

I have a table, more or less in the following format col1 col2 col3 ... col100 val1 val2 val3 ... val100 Where val* are doubles. Is there a way to extract for each row in which column is the highest value within row in Hive? For example, for table like col1 col2 col3 2 4 5 8 1 2 I would get col3 col1

Topic: hive sql

Category: Data Science

How to paste string and int from map to an array in hive?

Marcin Kosiński

2015年6月1日 14:37

I am trying to paste a string and int from map in Hive to an array. For now, record looks like this: {"string1":1,"string2":1,"string3":15} Is there a way to convert it to an array like this: ["string1:1","string2:1","string3:15"]

Topic: hive

Category: Data Science

Hive: How to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

Marcin Kosiński

2015年4月28日 21:46

In this wiki page there is a function corr() that calculates the Pearson coefficient of correlation, but my question is that: is there any function in Hive that enables to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

Topic: hive correlation apache-hadoop

Category: Data Science

How to proceed 2 executions in 1 step in hive?

Marcin Kosiński

2015年2月2日 17:00

I am wondering if there is a way to proceed 2 exectuions in 1 step in hive. For example: SELECT * FROM TABLE1 SELECT * FROM TABLE2 ; Do this in one window, and do not have to open 2 hive windows to execute each line separetly. Can it be done on HUE?

Topic: hive

Category: Data Science

About