COUNT on External Table in HIVE

I have been trying around the EXTERNAL table concepts in HIVE

CREATE EXTERNAL TABLE IF NOT EXISTS MovieData
(id INT, title STRING,releasedate date, videodate date,
URL STRING,unknown TINYINT, Action TINYINT, Adventure TINYINT, 
Animation TINYINT,Children TINYINT, Comedy TINYINT, Crime TINYINT, 
Documentary TINYINT, Drama TINYINT, Fantasy TINYINT,
Film-Noir TINYINT, Horror TINYINT, Musical TINYINT, 
Mystery TINYINT, Romance TINYINT, Sci-Fi TINYINT,
Thriller TINYINT, War TINYINT, Western TINYINT)
COMMENT 'This is a list of movies and its genre'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE; 

Created a table using the above statement and then used the LOAD statement to get the data populated.

LOAD DATA LOCAL INPATH '/home/ubuntu/MovieLens.txt' INTO TABLE MovieData;

Next time I DROP the table in HIVE and recreate it again and LOAD the data... But when I do a COUNT operation on the table I get double the values that's present in the file that I loaded.

I read through few articles that EXTERNAL table does not delete the data but the schema alone from the HIVE metastore... External Table

Can you please advise why does HIVE behave this way...

Topic hive

Category Data Science


In hive if you are doing using drop to drop a table and if the table is external table then it will only delete the table.

When you are creating table it creates a directory in hadoop with the path you specified. But if you are not giving any path it will take the default directory (like: /user/hive/warehouse/table_name)and create a directory with the table name. When you load the data into external table from a text file than it will load data into table as well as it will also load the data into the hadoop path. If you drop table then it will only delete the table since it is an external table it will not delete the data from hadoop.

Next time after drop when you recreate the table and reload the data it stores the new data and old data in hadoop path. That is why you are getting double counts.

If you want to avoid this you have to delete the file from hadoop manually using hadoop fs -rmr /path to directory/. Or before droping the table truncate command.


In Hadoop framework, there are multiple way to analyze the data. This depends on your use case, expertise and preference. Hive EXTERNAL tables are designed in a such way that other programmer can also share the same data location from other data processing model like Pig, MapReduce Programming, Spark and other without affecting each other work.

In case of external table you will not loose the data if you have accidentally dropped your table as you already know it only drops the meta data and delete the schema and data remain untouched.

This method is useful if there is already legacy data in HDFS on which the user wants to put some metadata so that the data can be queried and manipulated using Hive. Or you are loading data on HDFS from other ETL tools.

Since EXTERNAL table doesn't delete the data and you are loading file again you are getting the count difference. if you are on your own to do all operation like load, analysis, drop etc, Hive support the INTERNAL table as well. If you want to delete the data when you drop table you can use Hive INTERNAL table. To create internal table you just need to remove the EXTERNAL keyword from your query and when you will drop this table it will delete the data as well.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.