RAM crashed for XML to DataFrame conversion function
I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code?
Code
#Libraries
import pandas as pd
import xml.etree.cElementTree as ET
#Function to convert XML file to Pandas Dataframe
def xml2df(file_path):
#Parsing XML File and obtaining root
tree = ET.parse(file_path)
root = tree.getroot()
dict_list = []
for _, elem in ET.iterparse(file_path, events=(end,)):
if elem.tag == row:
dict_list.append(elem.attrib) # PARSE ALL ATTRIBUTES
elem.clear()
df = pd.DataFrame(dict_list)
return df
Part of an XML File ('Badges.xml')
badges
row Id=82946 UserId=3718 Name=Teacher Date=2008-09-15T08:55:03.923 Class=3 TagBased=False /
row Id=82947 UserId=994 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
row Id=82949 UserId=3893 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
row Id=82950 UserId=4591 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
row Id=82951 UserId=5196 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
row Id=82952 UserId=2635 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
row Id=82953 UserId=1113 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
This conversion in needed so that I can perform furthur Data Analysis.
I have asked this on StackOverflow (Link) but the answers did not solve my query. I hope to find a solution here.
Topic dataframe python-3.x parsing pandas python
Category Data Science