The difference between these two is quite fundamental and distinct. But most of the people, particularly the beginners of the subject are confused between the two.
Big data is nothing but a large set of data that cannot be processed by a traditional database system. The same data is also not processed by conventional computer processing units. Now that is what we call a problem. Hadoop, on the other hand, acts as the solution.
Now let’s define Big Data and Hadoop individually so we can understand the difference better.
The term “big data” is sometimes used as an umbrella term for the entire ecosystem, and that is precisely where the confusion begins. So let’s define Big Data in the simplest possible manner:
It is a large set of data that’s so vast and complicated that it cannot be processed by a traditional data processing system. The data can also not be stored using the traditional database systems.
Big Data typically exhibits the following 3 properties:
- Volume: The volume should be very large, so large that a single machine won’t be able to process this volume.
- Velocity: The speed of data should be very high.
- Variety: Big data consists of multiple formats of data including the likes of Structured, Semi-structured and completely unstructured.
Hadoop is typically based on Google’s MapReduce framework. This tool was implemented as an open source alternative to the Google’s MarReduce.
Hadoop is used for processing Big Data.
In layman’s terms, Hadoop is the framework in which the application is broken down into a number of small parts. These parts then run on various nodes in a cluster of systems. This provides the ability to process big data through the use of a cluster of multiple systems connected together aggregating the results into a final single set of result.
But now, after the release of Hadoop, it’s mostly used as an umbrella term for the entire ecosystem of frameworks and application which are used for storage, processing, analysis of big data.
The current Hadoop ecosystem comes with the Hadoop MapReduce, Hadoop Kernel, Hadoop Distributed File System accompanied by a number of related projects like the Apache Storm, Hive, Spark, Pig etc.
Two primary components of Hadoop are:
- HDFS: The Hadoop Distributed File System (HDFS) – it is the open source equivalent of the Google File System. It is the distributed file system used for storing big data in clusters on different systems.
- MapReduce: MapReduce is the actual framework used for processing of the data stored in HDFS.
Big data is nothing but a concept representing a huge amount of data and how to handle that data whereas Hadoop is the framework used for handling this large amount of data. Hadoop, on the other hand, is one framework in the ecosystem and there are many more capable of handling big data.
Big Data is a complex asset with many interpretations whereas Hadoop is a program that accomplishes a set of pre-defined goals.
Big Data, being just a collection of data, can consist of multiple formats of data while Hadoop is the framework different codes need to be written to handle different formats of data which can be structured, semi-structured and completely unstructured.
Hadoop is an open-source framework maintained and developed by a global community of users. It comes with a number of main components like MapReduce and HDFS accompanied by several other support components such as Pig, Hive, etc.
For analogy, Hadoop is a processing machine and big data is the raw material which is fed into this processing machine so that the meaningful results can be achieved.