File converter: from CSV to HDF5

Can anyone recommend any command line tool for converting large CSV file into HDF5 format?

Topic csv

Category Data Science


import numpy as np
import pandas as pd

#filename = '/tmp/test.hdf5'
filename = 'D:\test.hdf5'

df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['C1', 'C2'])
print(df)
#    C1  C2
# 0  0   1
# 1  2   3
# 2  4   5
# 3  6   7

# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df    # allow df to be garbage collected

# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['C1', 'C2'])
df2.to_hdf(filename, 'data', append=True)

print(pd.read_hdf(filename, 'data'))

  • 2nd approach: you could append to a HDFStore instead of calling df.to_hdf:
import numpy as np
import pandas as pd

#filename = '/tmp/test.hdf5'
filename = 'D:\test.hdf5'
store = pd.HDFStore(filename)

for i in range(2):
    df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['C1', 'C2'])
    store.append('data', df)

store.close()

store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
  • 3rd approach: using chunksize parameter and append each chunk to the HDF file which was answered here.

Personally, I like the 1st and 2nd approaches.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.