How to fetch text from pdf to further proceed with question answer based model from the same document?

Question

How to fetch text from pdf to further proceed with question answer based model from the same document?

Arijit Das

2022年5月21日 02:01

To illustrate the above title.

Suppose you have a pdf document, which is basically scanned from hardcopy, now there are set of fixed questions to answer from the document itself. For an example a document contains a contract of land, now the set of fixed questions be "who is the seller?" "what is price of the asset? ", document has referred to this answers probably 2-3 times, as a human it's a simple task.

How to automate this?

Topic cnn computer-vision deep-learning nlp machine-learning

Category Data Science

Musakkhir Sayyed · Accepted Answer · 2018年10月6日 12:20

You can use pypdf2 to extract text from pdf.

import PyPDF2

with open('sample.pdf','rb') as pdf_file, open('sample_output.txt', 'w') as text_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    for page_number in range(number_of_pages):   # use xrange in Py2
        page = read_pdf.getPage(page_number)
        print('Page No - ' + str(1 + read_pdf.getPageNumber(page)))
        page_content = page.extractText()
        text_file.write(page_content)

How to fetch text from pdf to further proceed with question answer based model from the same document?

About