parse pdf into Json or Xml

I want to create a neural net that can obtain some specific words from a pdf document into JSON or XML. For example let's assume that I have a pdf containing some information about countries and i want to recuperate the countries name and population to obtain something like this :

countries
  country
    name
      France
    /name
    population
      70m
    /population
  /country
.
.
.
/countries

Should I build a neural net and train it myself? If so can you give a good tutorial to follow please, or is there an already trained one that I can use?

Topic neural-network parsing

Category Data Science


Well, Unless your goal is to build a neural net to solve the problem. This can be done in a much simpler way, Like in case of country name you can just check against a list of country names, and so on. At best some NLP could give you what you want. A neural net solution might be a little overkill.

If a neural net is compulsory, Then I think You could get a better answer if some details were specified. Are you looking for a fixed set of fields, what kind of text content do the pdfs contain etc.

Also just in case, if you were thinking a neural net will give you a json as output (just in case if you were thinking that). That will not be the case. you would have to convert it to json from the neural nets output, but that conversion stuff is very trivial, so i should not even be talking about that.

I know i have not answered your question. But i hope you got some direction.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.