How to read html tables under multiple headers and combine them in a single pandas dataframe?
I have an html file like this:
h1Group 1/h1
table
tr
tdCol1/td
tdCol2/td
tdCol3/td
/tr
tr
tdValA/td
tdValB/td
tdValC/td
/tr
/table
h1Group 2/h1
table
tr
tdCol1/td
tdCol2/td
tdCol3/td
/tr
tr
tdValP/td
tdValQ/td
tdValR/td
/tr
/table
I want to read it into Pandas as if it had this structure:
table
tr
tdCaption/td
tdCol1/td
tdCol2/td
tdCol3/td
/tr
tr
tdGroup 1/td
tdValA/td
tdValB/td
tdValC/td
/tr
tr
tdGroup 2/td
tdValP/td
tdValQ/td
tdValR/td
/tr
/table
I can do it easily with PowerQuery the language for PowerBI:
let
Source = Web.Page(File.Contents("multiple_tables.html")),
#"Expanded Data" = Table.ExpandTableColumn(Source, "Data", {"Column1", "Column2", "Column3"}, {"Col1", "Col2", "Col3"}),
#"Filtered Rows" = Table.SelectRows(#"Expanded Data", each ([Caption] "Document") and ([Col1] "Col1"))
in
#"Filtered Rows"
Is there a way to achieve this effect in less than 10 lines of code using Python/Pandas plus some html parser opensource library? Or should I resign myself to writing lower level code to handle this?
Category Data Science