RAM crashed for XML to DataFrame conversion function

I have created the following function which converts an XML File to a DataFrame. This function works good for files smaller than 1 GB, for anything greater than that the RAM(13GB Google Colab RAM) crashes. Same happens if I try it locally on Jupyter Notebook (4GB Laptop RAM). Is there a way to optimize the code?

Code

#Libraries
import pandas as pd
import xml.etree.cElementTree as ET

#Function to convert XML file to Pandas Dataframe    
def xml2df(file_path):

  #Parsing XML File and obtaining root
  tree = ET.parse(file_path)
  root = tree.getroot()

  dict_list = []

  for _, elem in ET.iterparse(file_path, events=(end,)):
      if elem.tag == row:
        dict_list.append(elem.attrib)      # PARSE ALL ATTRIBUTES
        elem.clear()

  df = pd.DataFrame(dict_list)
  return df

Part of an XML File ('Badges.xml')

badges
  row Id=82946 UserId=3718 Name=Teacher Date=2008-09-15T08:55:03.923 Class=3 TagBased=False /
  row Id=82947 UserId=994 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
  row Id=82949 UserId=3893 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
  row Id=82950 UserId=4591 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
  row Id=82951 UserId=5196 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
  row Id=82952 UserId=2635 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /
  row Id=82953 UserId=1113 Name=Teacher Date=2008-09-15T08:55:03.957 Class=3 TagBased=False /

This conversion in needed so that I can perform furthur Data Analysis.

I have asked this on StackOverflow (Link) but the answers did not solve my query. I hope to find a solution here.

Topic dataframe python-3.x parsing pandas python

Category Data Science


import dask
import dask.bag as db
import dask.dataframe as dd
from dask.dot import dot_graph
from dask.diagnostics import ProgressBar

dask.set_options(get=dask.multiprocessing.get)
tags_xml = db.read_text('data/Tags.xml', encoding='utf-8')
tags_xml.take(10)

Refer this link for complete tutorial Dask with XML

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.