cannot access hive from spark

I am trying to install a hadoop + spark + hive cluster. I am using hadoop 3.1.2, spark 2.4.5 (scala 2.11 prebuilt with user-provided hadoop) and hive 2.3.3 (also tried 3.1.2 with the exact same results). All downloaded from their websites.

I can run spark apps (as yarn client) with no issues, I can run hive queries directly (beeline) or via pyhive with no issues (I tried both hive-on-mr and hive-on-spark, both working fine, jobs are created by yarn and executed successfully). The only thing I cannot do is access hive tables from spark.

spark 2.4.x prebuilt with user-provided hadoop is not built with hive, so I downloaded from maven the required jars (spark-hive, hive-jdbc, hive-service, thrift, ...) and put them in the classpath. I have also compiled spark 2.4.5 with -Phive, but with the exact same results as the maven jars:

  1. access hive directly:
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("pyspark-hive").enableHiveSupport().getOrCreate()
    spark.sql("show tables").show()

after all the requires jars are in classpath, spark session is created successfully, but on calling "show" on a query, I get this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o41.sql.
: java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
    at org.apache.spark.sql.hive.HiveUtils$.formatTimeVarsForHiveClient(HiveUtils.scala:204)
    at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:285)
    at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
    at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
    at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
    at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
    at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)

I am trying to access hive as an external catalog, with these settings:

spark.sql.catalogImplementation     hive
spark.sql.warehouse.dir             /user/hive/warehouse
spark.sql.hive.metastore.version    2.3.0
spark.sql.hive.metastore.jars       /lib/hive/*
  1. access hive via jdbc
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark-hive-jdbc").getOrCreate()
df = spark.read.option("url", "jdbc:hive2://hive:10000/default").option("driver", "org.apache.hive.jdbc.HiveDriver").option("user", "hive").option("password", "hive").option("fetchsize", "512").option("dbtable", "bank_csv").format("jdbc").load()
df.show(10)

Since hive-jdbc connector seems to be missing from the prebuilt hive distribution, I am downloading the jar from maven. After that, the query results in all fields being populated with the column names:

 df.show(10)
+------------+------------+----------------+------------------+----------------+----------------+----------------+-------------+----------------+------------+--------------+-----------------+-----------------+--------------+-----------------+----------+
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
+------------+------------+----------------+------------------+----------------+----------------+----------------+-------------+----------------+------------+--------------+-----------------+-----------------+--------------+-----------------+----------+
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
|bank_csv.age|bank_csv.job|bank_csv.marital|bank_csv.education|bank_csv.default|bank_csv.balance|bank_csv.housing|bank_csv.loan|bank_csv.contact|bank_csv.day|bank_csv.month|bank_csv.duration|bank_csv.campaign|bank_csv.pdays|bank_csv.poutcome|bank_csv.y|
+------------+------------+----------------+------------------+----------------+----------------+----------------+-------------+----------------+------------+--------------+-----------------+-----------------+--------------+-----------------+----------+
only showing top 10 rows

If the table has non-string defined columns, it will give errors about not being able to convert the data (since it gets only strings). As you can see, I am using the fetchsize option as suggested here: https://stackoverflow.com/questions/42843241/spark-jdbc-returning-dataframe-only-with-column-names

Any ideas what's going on?

Topic hive pyspark apache-spark apache-hadoop

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.