How to store subset of columns from a csv file?

I need to create a table in hive (or Impala) by reading from a csv file (named file.csv), the problem is that this csv file could have a different number of columns each time I read it. The only thing I am sure of is that it will always have three columns called A, B, and C.

For example, the first csv I get could be (the first row is the header):

 ------------------------
|  X | Y | A | Z | B | C |
 ------------------------
|  1 | 2 | 3 | 4 | 5 | 6 | 

and the second:

 ------------
|  C | A | B | 
 -------------
|  1 | 2 | 3 | 

And I need to store this in a table, maybe an external table. Something like this:

CREATE EXTERNAL TABLE file (A STRING, B STRING, C STRING)
AS
SELECT A, B, C
USING HEADER
LOCATION 'input/loading/';

That obviously does not work. Any ideas?

Topic hive pyspark sql

Category Data Science


Hive's RegexSerDe which is often used to process logs could be used in your use case. You could use the regex to extract the first 3 columns from a line

CREATE EXTERNAL TABLE file (A STRING, B STRING, C STRING)
AS
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
    WITH SERDEPROPERTIES 
      ('input.regex'='^(\\w+)\\t(\\w+)\\t(\\w+)(.*)')
LOCATION 'input/loading/'

The regex included is using pattern groups (\w+) to extract the first 3 columns and the last group (.*) is anything else which may be in the line

See more details here Community Article and Official Documentation

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.