How to store subset of columns from a csv file?

Question

How to store subset of columns from a csv file?

Francesco Pegoraro

2022年5月6日 19:01

I need to create a table in hive (or Impala) by reading from a csv file (named file.csv), the problem is that this csv file could have a different number of columns each time I read it. The only thing I am sure of is that it will always have three columns called A, B, and C.

For example, the first csv I get could be (the first row is the header):

 ------------------------
|  X | Y | A | Z | B | C |
 ------------------------
|  1 | 2 | 3 | 4 | 5 | 6 |

and the second:

 ------------
|  C | A | B | 
 -------------
|  1 | 2 | 3 |

And I need to store this in a table, maybe an external table. Something like this:

CREATE EXTERNAL TABLE file (A STRING, B STRING, C STRING)
AS
SELECT A, B, C
USING HEADER
LOCATION 'input/loading/';

That obviously does not work. Any ideas?

Topic hive pyspark sql

Category Data Science

ggordon · Accepted Answer · 2020年3月31日 00:02

Hive's RegexSerDe which is often used to process logs could be used in your use case. You could use the regex to extract the first 3 columns from a line

CREATE EXTERNAL TABLE file (A STRING, B STRING, C STRING)
AS
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
    WITH SERDEPROPERTIES 
      ('input.regex'='^(\\w+)\\t(\\w+)\\t(\\w+)(.*)')
LOCATION 'input/loading/'

The regex included is using pattern groups (\w+) to extract the first 3 columns and the last group (.*) is anything else which may be in the line

See more details here Community Article and Official Documentation

How to store subset of columns from a csv file?

About