cannot access hive from spark
I am trying to install a hadoop + spark + hive cluster. I am using hadoop 3.1.2, spark 2.4.5 (scala 2.11 prebuilt with user-provided hadoop) and hive 2.3.3 (also tried 3.1.2 with the exact same results). All downloaded from their websites.
I can run spark apps (as yarn client) with no issues, I can run hive queries directly (beeline) or via pyhive with no issues (I tried both hive-on-mr and hive-on-spark, both working fine, jobs are created by yarn and executed successfully). The only thing I cannot do is access hive tables from spark.
spark 2.4.x prebuilt with user-provided hadoop is not built with hive, so I downloaded from maven the required jars (spark-hive, hive-jdbc, hive-service, thrift, ...) and put them in the classpath. I have also compiled spark 2.4.5 with -Phive, but with the exact same results as the maven jars:
- access hive directly:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark-hive").enableHiveSupport().getOrCreate()
spark.sql("show tables").show()
after all the requires jars are in classpath, spark session is created successfully, but on calling "show" on a query, I get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o41.sql.
: java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
at org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:204)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:285)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
I am trying to access hive as an external catalog, with these settings:
spark.sql.catalogImplementation hive
spark.sql.warehouse.dir /user/hive/warehouse
spark.sql.hive.metastore.version 2.3.0
spark.sql.hive.metastore.jars /lib/hive/*
- access hive via jdbc
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark-hive-jdbc").getOrCreate()
df = spark.read.option("url", "jdbc:hive2://hive:10000/default").option("driver", "org.apache.hive.jdbc.HiveDriver").option("user", "hive").option("password", "hive").option("fetchsize", "512").option("dbtable", "bank_csv").format("jdbc").load()
df.show(10)
Since hive-jdbc connector seems to be missing from the prebuilt hive distribution, I am downloading the jar from maven. After that, the query results in all fields being populated with the column names:
df.show(10)
+------------+------------+----------------+------------------+----------------+----------------+----------------+-------------+----------------+------------+--------------+-----------------+-----------------+--------------+-----------------+----------+
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
+------------+------------+----------------+------------------+----------------+----------------+----------------+-------------+----------------+------------+--------------+-----------------+-----------------+--------------+-----------------+----------+
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
+------------+------------+----------------+------------------+----------------+----------------+----------------+-------------+----------------+------------+--------------+-----------------+-----------------+--------------+-----------------+----------+
only showing top 10 rows
If the table has non-string defined columns, it will give errors about not being able to convert the data (since it gets only strings).
As you can see, I am using the fetchsize
option as suggested here: https://stackoverflow.com/questions/42843241/spark-jdbc-returning-dataframe-only-with-column-names
Any ideas what's going on?
Topic hive pyspark apache-spark apache-hadoop
Category Data Science