Bharath Tech Update

MapReduce is

Data processing approach for processing highly parallelizable datasets.
Implemented as a cluster, with many nodes working in parallel on different parts of the data.

Data is divided up into small chunks that are then distributed across all data nodes in cluster for parallel processing.

MapReduce requires writing two functions:

a mapper
a reducer

MapReduce functions can be impleted in JAVA, C++, C#, Python, Javascript (to script PIG), etc...

These functions accept data as input and then return transformed data as output. Functions are called repeatedly, with subsets of data, with the output of the mapper being aggregated and then sent to the reducer. JobTracker coordinates jobs across the cluster.

Limiting factor for MapReduce is the size of the cluster.

Hadoop

Implements MapReduce as a batch-processing system.

Optimzed for flexible and efficient processing of huge amounts of data, not for response time.

Hadoop ecosystem includes higher level of abstractions beyond MapReduce.

Hive provides a SQL kind of query language. When we submit a HiveQL query, Hive generates MapReduce functions and runs behind the scenes to carryout the requested query.

Pig is another query abstraction with a data flow language known as Pig Latin. Pig also generates MpaReduce functions and runs behind the scenes to implement the higher-level operations described in Pig Latin.

Mahout is a machine learning abstraction....TB typed

Sqoop is a relational database connector.

Use Hadoop to get right data subset and shape it to the desired form, then use BI tools to finish the analytical processing.

Bharath Tech Update

Tuesday, November 5, 2013

App Pool vs Process ID listing

Scenarios

Thursday, August 29, 2013

MapReduce Data Processing Pattern