Does building a corpus make sense on a documentation project?
I have zero to experience in data science or machine learning. Because of this I am not able to determine if building a corpus does apply to the problem I am trying to solve.
I am trying to build a reference site for cloud technologies such as AWS Google Cloud.
I was able to build structured data and identify primary entities with in a single ecosystem using standard web scraping and sql.queries.
But I wanted to have the ability to have a mechanism that can autonomously identify entities and related information that is relevant to that entity and other entities it has relationships with.
Given that a specific ecosystem documentation follows a certain style can I use few entities as training docs and then have it classify the information I mentioned above.
Is the starting point to this is to build a corpus? I tried it out nltk categorized corpus builder.
Is it fine to include a specific document in multiple categories? For example an instance in AWS can be in category ec2 and a general category Computing unit
Anyways is this problem I am trying to solve fit into the general NLP ML space?
Topic corpus nlp machine-learning
Category Data Science