How to read html tables under multiple headers and combine them in a single pandas dataframe?

Question

How to read html tables under multiple headers and combine them in a single pandas dataframe?

Luxspes

2019年8月21日 15:22

I have an html file like this:

h1Group 1/h1
table
  tr
    tdCol1/td
    tdCol2/td
    tdCol3/td
  /tr
  tr
    tdValA/td
    tdValB/td
    tdValC/td
  /tr
/table
h1Group 2/h1
table
  tr
    tdCol1/td
    tdCol2/td
    tdCol3/td
  /tr
  tr
    tdValP/td
    tdValQ/td
    tdValR/td
  /tr
/table

I want to read it into Pandas as if it had this structure:

table
 tr
   tdCaption/td
   tdCol1/td
   tdCol2/td
   tdCol3/td
 /tr
 tr
   tdGroup 1/td
   tdValA/td
   tdValB/td
   tdValC/td
 /tr
 tr
   tdGroup 2/td
   tdValP/td
   tdValQ/td
   tdValR/td
 /tr
/table

I can do it easily with PowerQuery the language for PowerBI:

let
    Source = Web.Page(File.Contents("multiple_tables.html")),
    #"Expanded Data" = Table.ExpandTableColumn(Source, "Data", {"Column1", "Column2", "Column3"}, {"Col1", "Col2", "Col3"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Data", each ([Caption]  "Document") and ([Col1]  "Col1"))
in
    #"Filtered Rows"

Is there a way to achieve this effect in less than 10 lines of code using Python/Pandas plus some html parser opensource library? Or should I resign myself to writing lower level code to handle this?

Topic powerbi pandas python

Category Data Science

Stephen Rauch · Accepted Answer · 2018年8月1日 03:31

With a bit of parsing via BeautifulSoup, we can get a pandas.Dataframe using pandas.read_html() like:

Code:

def get_tables(source):
    elems = iter(BeautifulSoup(source, 'lxml').find_all(['table', 'h1']))
    df = pd.DataFrame(
        pd.read_html(str(next(elems)), header=0)[0].iloc[0].rename(h1.text)
        for h1 in elems)
    df.index.names = ['Caption']
    return df

Test Code:

import pandas as pd
from bs4 import BeautifulSoup

with open('test.html', 'r') as f:
    print(get_tables(f))

Results:

         Col1  Col2  Col3
Caption                  
Group 1  ValA  ValB  ValC
Group 2  ValP  ValQ  ValR

How to read html tables under multiple headers and combine them in a single pandas dataframe?

Code:

Test Code:

Results:

About