Bharath Tech Update: MapReduce

Thursday, August 29, 2013

MapReduce Data Processing Pattern

MapReduce is

Data processing approach for processing highly parallelizable datasets.
Implemented as a cluster, with many nodes working in parallel on different parts of the data.

Data is divided up into small chunks that are then distributed across all data nodes in cluster for parallel processing.

MapReduce requires writing two functions:

a mapper
a reducer

MapReduce functions can be impleted in JAVA, C++, C#, Python, Javascript (to script PIG), etc...

These functions accept data as input and then return transformed data as output. Functions are called repeatedly, with subsets of data, with the output of the mapper being aggregated and then sent to the reducer. JobTracker coordinates jobs across the cluster.

Limiting factor for MapReduce is the size of the cluster.

Hadoop

Implements MapReduce as a batch-processing system.

Optimzed for flexible and efficient processing of huge amounts of data, not for response time.

Hadoop ecosystem includes higher level of abstractions beyond MapReduce.

Hive provides a SQL kind of query language. When we submit a HiveQL query, Hive generates MapReduce functions and runs behind the scenes to carryout the requested query.

Pig is another query abstraction with a data flow language known as Pig Latin. Pig also generates MpaReduce functions and runs behind the scenes to implement the higher-level operations described in Pig Latin.

Mahout is a machine learning abstraction....TB typed

Sqoop is a relational database connector.

Use Hadoop to get right data subset and shape it to the desired form, then use BI tools to finish the analytical processing.

Wednesday, April 17, 2013

MapReduce Design Patterns

Summarization patterns: get a top-level view by summarizing and grouping data
Filtering patterns: view data subsets such as records generated from one user
Data organization patterns: reorganize data to work with other systems, or to make MapReduce analysis easier
Join patterns: analyze different datasets together to discover interesting relationships
Metapatterns: piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job
Input and output patterns: customize the way you use Hadoop to load or store data