Hadoop is an open source framework which can be downloaded and installed directly for use. Hadoop is basically for storing and processing huge data sets but not recommended with small data sets. Hadoop is for storing and for processing huge datasets with cluster of commodity hardware.
Is a set of machines in a single LAN.
HDFS is a specially designed file system for storing huge data sets with cluster of commodity hardware’s with streaming access pattern.
Streaming access pattern:–
Write one’s read any number of times but don’t try to change that content of the file once you keep it into HDFS.
Basically the block size is 4 KB, in this 4 KB if you store 2 KB file then remaining 2 KB is wasted. The HDFS block size is 64 MB or sometimes 128 MB also. In this 64 MB if you stored 34 MB file then remaining 30 MB is stored in another file. So this is specially designed for storing huge data sets.
Suppose we have a client, the client is having a machine and wants to store and process some data. He wants to maintain one cluster with more number of machines and share the data in number of machines that is the intension of hadoop.The client does not connect with machines (Data nodes) because he is outside to the cluster. The client connects with the name node and the name node splits the data. Sometimes the system is down because these are commodity hardware’s. HDFS overcomes these problems. In this HDFS is given 3 replications by default. So keep two more backup files in our file, the HDFS stores each file in 3 different locations to overcome the problem of data lose.
If suppose any of this data nodes is not giving proper block report in time, then name node thinks that the data node may be dead and it removes that Meta data from the hard disk, then the file is stored in another system. This process is done in fraction of seconds. If there is no Meta data then there is no use with Hadoop and then entire cluster is inaccessible. The HDFS will not be working because name node is calling single point of failure.
Once writing a program and submitting a program. The job tracker comes into picture and it will take over that request. Now this job tracker does not know what data is stored in data node because there is no communication between job tracker and data nodes. All the master services can talk to each other, so the job tracker can talk to name node. The job tracker roll is assigning task to task tracker and then the task tracker receives that job and processing is called map.
The task tracker gives heart beat to job tracker every 3 seconds. If task tracker is not giving proper heart beat to the job tracker, the job tracker will wait 10 heart beats i.e. 30 seconds of time and if the task tracker is not giving the heart beat within this time then the job tracker may think that either its working very slowly (or) may be dead. If this is the condition then the job tracker gives this job to another data node having same type of data and the process continues.