Hadoop : What It Is & How to Get Started
Hadoop! What is this all about? Everywhere we go, people talk about Hadoop and Big Data. Ah! Well, I have worked on Hadoop, so let me share with you my experiences of it. Hadoop is an open source platform which helps in storing and managing large amount of data cheaply and efficiently. That is the reason it comes as an association with the Big data. Hadoop is managed by Apache Software Foundation and is proving to be highly efficient in its job.
So, what does it exactly do? Hadoop stores huge amount of data sets across distributed clusters of servers and then run a “distributed” analysis applications in each cluster. It will enable your Big data applications to run smoothly even if your individual servers or clusters fail. Another reason for its efficiency is its ability to make your applications run without shuttling volumes of data across your network.
According to Apache’s definition of Hadoop:
“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.”
Having a greater insight about the product, it came out to be amazing because of its incredibly flexible architecture. It is completely modular, which means any of its components can be swapped out and be used for various software tools.
To make the understanding the Hadoop simpler, it is made up of two main components- a data processing framework and a distributed file system for data storage.
The distributed file system is an array of storage clusters which holds the actual data. As a default application, Hadoop uses “Hadoop Distributed File System (HDFS)”, but it is flexible enough to use other file systems.
Now, coming to data processing framework, it is a tool that works on the data and with the data. As a default application it uses, Java based system MapReduce. Map reduce more popular when compared to HDFS for two big reasons.
If you have a little technical background, you might be aware of the fact that in traditional relational database systems data is analyzed using query based SQL, while non- relational database used NoSQL. But, when it comes to Hadoop, it is actually not a database, data can be stored there and pulled out but no queries are involved in it. It is more of a data warehouse which needs MapReduce to actually process the data.
Yet another element which makes Hadoop unique is that the functions are more of distributed than centralized as often seen in typical traditional databases.
There are two benefits of a Hadoop cluster being sit on every machine, number one, it adds redundancy to the system. Number two, even if one cluster in a machine goes down the data to be processed comes in the same machine where data is stored thus speeding up information retrieval.
Robust and efficient, are the two adjectives attached to Hadoop, and they actually define the Hadoop system as a whole. Let me explain its process, MapReduce uses two components a Job tracker that sits on the Hadoop master node and a TaskTracker that sit out on every node on Hadoop network.
The mapping is done by the JobTracker which divides computing jobs into defined pieces and those jobs are then shifted to TaskTrackers. It, is the cluster where data is stored. So, once a particular job runs the correct subset of data is Reduced back to the central node of the Hadoop cluster, combined with all the other datasets found on all of the cluster's machines.
When talking about HDFS, it also works in a similar manner.
HDFS is distributed in a similar fashion. A single NameNode tracks where data is housed in the cluster of servers, known as DataNodes. This data is kept in data blocks on the DataNodes. HDFS duplicates these data blocks, in size of 128MB, and distributes them so they are replicated within multiple nodes across the cluster.
Yet another advantage is given to the Hadoop with this distribution style. As we know the data and processing is done on the same server in the cluster, so every time a new machine is added to the cluster the machine strengthens up by having additional system hard drive space and processing power.
So, what does it exactly do? Hadoop stores huge amount of data sets across distributed clusters of servers and then run a “distributed” analysis applications in each cluster. It will enable your Big data applications to run smoothly even if your individual servers or clusters fail. Another reason for its efficiency is its ability to make your applications run without shuttling volumes of data across your network.
According to Apache’s definition of Hadoop:
“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.”
Having a greater insight about the product, it came out to be amazing because of its incredibly flexible architecture. It is completely modular, which means any of its components can be swapped out and be used for various software tools.
To make the understanding the Hadoop simpler, it is made up of two main components- a data processing framework and a distributed file system for data storage.
The distributed file system is an array of storage clusters which holds the actual data. As a default application, Hadoop uses “Hadoop Distributed File System (HDFS)”, but it is flexible enough to use other file systems.
Hadoop Distributed File System (HDFS)
Now, what is HDFS? It is more or less like a big bucket where you dump in all your data and it will be stored there until and unless you want to process that data or get it in some use like exporting. Capturing or running a performance analysis on the data.Now, coming to data processing framework, it is a tool that works on the data and with the data. As a default application it uses, Java based system MapReduce. Map reduce more popular when compared to HDFS for two big reasons.
- It actually gets your data processed.
- It is pretty difficult and exciting to work with.
If you have a little technical background, you might be aware of the fact that in traditional relational database systems data is analyzed using query based SQL, while non- relational database used NoSQL. But, when it comes to Hadoop, it is actually not a database, data can be stored there and pulled out but no queries are involved in it. It is more of a data warehouse which needs MapReduce to actually process the data.
Data Processing Framework & MapReduce
MapReduce pulls out the information as needed, thus, the data enthusiasts get a lot of power and flexibility along with lot of complexity. Also, there are a number of tools to make this stuff easier, Hadoop includes Hive, which is an Apache application that is used to convert query language to MapReduce Jobs. The limitation of MapReduce to do one job at a time during batch processing, leads Hadoop to be used more often as a data warehousing tool than a data analysis tool.Yet another element which makes Hadoop unique is that the functions are more of distributed than centralized as often seen in typical traditional databases.
There are two benefits of a Hadoop cluster being sit on every machine, number one, it adds redundancy to the system. Number two, even if one cluster in a machine goes down the data to be processed comes in the same machine where data is stored thus speeding up information retrieval.
The mapping is done by the JobTracker which divides computing jobs into defined pieces and those jobs are then shifted to TaskTrackers. It, is the cluster where data is stored. So, once a particular job runs the correct subset of data is Reduced back to the central node of the Hadoop cluster, combined with all the other datasets found on all of the cluster's machines.
When talking about HDFS, it also works in a similar manner.
HDFS is distributed in a similar fashion. A single NameNode tracks where data is housed in the cluster of servers, known as DataNodes. This data is kept in data blocks on the DataNodes. HDFS duplicates these data blocks, in size of 128MB, and distributes them so they are replicated within multiple nodes across the cluster.
Yet another advantage is given to the Hadoop with this distribution style. As we know the data and processing is done on the same server in the cluster, so every time a new machine is added to the cluster the machine strengthens up by having additional system hard drive space and processing power.
How is Hadoop Different from Past Techniques?
- Hadoop can handle data in a very fluid way.
- Hadoop has a simplified programming model.
- Hadoop is easy to administer.
- Hadoop is agile.
Why use Apache Hadoop?
- It’s cost effective.
- It’s fault-tolerant
- It’s flexible.
- It’s scalable.
Customize Your Hadoop
As I have mentioned that HDFS or Map reduce are just a default applications, a used need not to stick with it. For instance, Amazon Web Services have adapted its personal S3 file system for Hadoop while DataStax’ Brisk is a replacement of MapReduce.Getting Started with Hadoop
UFaber.com has some excellent courses on Hadoop. You can start with their free course on Hadoop which offers the most extensive study tutorials for Hadoop training in the world. The course covers :- Marketing analytics
- Machine learning and/or sophisticated data mining
- Processing of XML messages
- Web crawling and/or text processing
- Engage into a practical hands on approach, do real projects, and measure industry impact value
- Focuses on Industry centric perspective of Hadoop
- Get hands on with Advanced Mapreduce techniques and practices.
- Last but not the least, learn from an expert who has contributed in the Hadoop project itself