what ETL technique should i use for text documents using Hadoop?

I have a school Big Data project where basically the teacher is going to give us a large amount of text documents (from the Gutenberg project data set ) and he want us to give as output the document where a "keyword" is more relevant, he also wants us to divide the project in 3 parts:

  • Data acquisition, preprocing (cleaning, transform, join, etc), loading. ETL PROCESS.
  • Data processing.
  • User friendly application.

I need to define what technologies or methods i'm gonna user for each of the parts of the project, but i have no idea what to do on the ETL part since the documents are gonna be written in legible English (they are books), i will appreciate any information you can give me on this but also about the other parts of the project.

Thanks a million for reading.

Topic etl dataset data-cleaning apache-hadoop bigdata

Category Data Science


I would suggest a data fabric. That would meet your need for data acquisition, preprocessing, data quality, master data management, etc.

Given I work for Talend, I would suggest our data fabric. =)

Here’s a case study with the Panama Papers. https://www.talend.com/blog/2017/01/17/talend-data-masters-2016-icij-decoded-panama-papers-talend/

The concept in that case study of combining a data fabric with analytic tools is the general concept to use regardless of which data fabric you use.

You can find the trial and open versions of Talend at https://www.talend.com/download/.

Edit: Here’s another example, which implements a backend and UI showing the ebooks of the Gutenberg project. It allows to import the whole gutenberg index using a camel route. https://github.com/Talend/tesb-rt-se/tree/master/examples/tesb/ebook

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.