Data from various source such as email, images, social media, pdf files, JSON files, various other formats etc. are stored in a data lake (repository), typically a Hadoop file system (HDFS). Since the data is preserved in a native format (unstructured), provenance is well maintained and forced transformation into structured data (data warehousing) is avoided. Several types of data analytics can be performed on the data stored in a data lake.
 
The Apache Hadoop is a open-source software library that is designed for distributed processing of large data set (files) across clusters of servers. Hadoop uses standard programming languages and is implemented on generic hardware. The HDFS is designed to scale up from single server to large cluster, with computing power and storage on each server. This also increases the probability of failure. The HDFS is designed to detect faults or failure and automatically recover quickly at the application layer to attain high availability.

Reference: Apache - Hadoop


Last Revised on: November 17th, 2017