Friday, December 26, 2014

What is Hadoop?


Apache Hadoop is an open source Java framework, by Doug Cutting and Michael J. Cafarella, for processing and querying vast amounts of data on large clusters of commodity hardware. Hadoop is a top level Apache project, initiated and led by Yahoo!.

Hadoop was derived from Google's MapReduce and Google File System (GFS) papers. It relies on an active community of contributors from all over the world for its success.

Hadoop is basically 2 things, a distributed file system (HDFS) which constitutes Hadoop's storage layer and a distributed computation framework (MapReduce) which constitutes the processing layer.

You should go for Hadoop if your data is very huge and you have offline, batch processing kind needs. Hadoop is not suitable for real time stuff. You setup a Hadoop cluster on a group of commodity machines connected together over a network(called as a cluster). You then store huge amounts of data into the HDFS and process this data by writing MapReduce programs(or jobs). Being distributed, HDFS is spread across all the machines in a cluster and MapReduce processes this scattered data locally by going to each machine, so that you don't have to relocate this gigantic amount of data.


Today Hadoop and Big Data have almost become synonyms to each other. Over time it has evolved into one big herd of various tools, such as Pig, Hive, HBase, Zookeeper, Sqoop, Flume, Oozie etc., each meant to serve a different purpose. But glued together they give you a power packed combo.



Four main characteristics

Scalable

New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.

Cost effective

Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.

Flexible

Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.

Fault tolerant

When you lose a node, the system redirects work to another location of the data and continues processing without missing a fright beat.

1 comment:

  1. Its very nice blog for beginner.....expecting more info about hadoop-ecosystems

    ReplyDelete