What are common problems around HADOOP storage?

I've been asked to lead a program to understand why our Hadoop storage is constantly near capacity. What questions should I ask?

  1. Data age,
  2. Data size?
  3. Housekeeping schedule?
  4. How do we identify the different types of compression used by different applications?
  5. How can we identify where the duplicate data sources are?
  6. Are jobs designated for edge nodes only on edge nodes?

Topic apache-hadoop

Category Data Science


You can ask following questions:

  1. Hadoop Cluster Size (No. of nodes/machines allocated for data storage)
  2. Hardware configuration (storage capacity) for nodes/machines
  3. Data Block Replication factor for fault tolerance
  4. Data Compression technique used for file storage
  5. Data Archiving process

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.