what ETL technique should i use for text documents using Hadoop?
I have a school Big Data project where basically the teacher is going to give us a large amount of text documents (from the Gutenberg project data set ) and he want us to give as output the document where a "keyword" is more relevant, he also wants us to divide the project in 3 parts:
- Data acquisition, preprocing (cleaning, transform, join, etc), loading. ETL PROCESS.
- Data processing.
- User friendly application.
I need to define what technologies or methods i'm gonna user for each of the parts of the project, but i have no idea what to do on the ETL part since the documents are gonna be written in legible English (they are books), i will appreciate any information you can give me on this but also about the other parts of the project.
Thanks a million for reading.
Topic etl dataset data-cleaning apache-hadoop bigdata
Category Data Science