Thursday, August 29, 2013

MapReduce Data Processing Pattern

MapReduce is
  • Data processing approach for processing highly parallelizable datasets.
  • Implemented as a cluster, with many nodes working in parallel on different parts of the data.
Data is divided up into small chunks that are then distributed across all data nodes in cluster for parallel processing.

MapReduce requires writing two functions:
  • a mapper
  • a reducer
MapReduce functions can be impleted in JAVA, C++, C#, Python, Javascript (to script PIG), etc...

These functions accept data as input and then return transformed data as output. Functions are called repeatedly, with subsets of data, with the output of the mapper being aggregated and then sent to the reducer. JobTracker coordinates jobs across the cluster.

Limiting factor for MapReduce is the size of the cluster.

Hadoop 

Implements MapReduce as a batch-processing system.
Optimzed for flexible and efficient processing of huge amounts of data, not for response time.

Hadoop ecosystem includes higher level of abstractions beyond MapReduce.

Hive provides a SQL kind of query language. When we submit a HiveQL query, Hive generates MapReduce functions and runs behind the scenes to carryout the requested query.

Pig is another query abstraction with a data flow language known as Pig Latin. Pig also generates MpaReduce functions and runs behind the scenes to implement the higher-level operations described in Pig Latin.

Mahout is a machine learning abstraction....TB typed

Sqoop is a relational database connector.

Use Hadoop to get right data subset and shape it to the desired form, then use BI tools to finish the analytical processing.

MongoDB Install on Windows 7 64-Bit

MongoDB requires a data folder to store its files. 
The default location for the MongoDB data directory is \data\db


Note
You may specify an alternate path for \data\db with the dbpath setting for mongod.exe, as in the following example:
C:\mongodb\bin\mongod.exe --dbpath d:\test\mongodb\data
If your path includes spaces, enclose the entire path in double quotations, for example:

C:\mongodb\bin\mongod.exe --dbpath "d:\test\mongo db data"

mongod.exe --dbpath "C:\Program Files (x86)\mongodb-win32-x86_64-2.4.4-rc0\data"


C:\Program Files (x86)\mongodb-win32-x86_64-2.4.4-rc0\bin>mongod.exe --dbpath "C
:\Program Files (x86)\mongodb-win32-x86_64-2.4.4-rc0\data"
Sat Jun 01 17:16:41.671 [initandlisten] MongoDB starting : pid=9196 port=27017 d
bpath=C:\Program Files (x86)\mongodb-win32-x86_64-2.4.4-rc0\data 64-bit host=BHA
RATH-XPS-PC
Sat Jun 01 17:16:41.674 [initandlisten] db version v2.4.4-rc0
Sat Jun 01 17:16:41.674 [initandlisten] git version: f25c410a9c4a88de36c82797e82
e306be2274d40
Sat Jun 01 17:16:41.674 [initandlisten] build info: windows sys.getwindowsversio
n(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST
_LIB_VERSION=1_49
Sat Jun 01 17:16:41.675 [initandlisten] allocator: system
Sat Jun 01 17:16:41.676 [initandlisten] options: { dbpath: "C:\Program Files (x8
6)\mongodb-win32-x86_64-2.4.4-rc0\data" }
Sat Jun 01 17:16:41.694 [initandlisten] journal dir=C:\Program Files (x86)\mongo
db-win32-x86_64-2.4.4-rc0\data\journal
Sat Jun 01 17:16:41.696 [initandlisten] recover : no journal files present, no r
ecovery needed
Sat Jun 01 17:16:41.885 [FileAllocator] allocating new datafile C:\Program Files
 (x86)\mongodb-win32-x86_64-2.4.4-rc0\data\local.ns, filling with zeroes...
Sat Jun 01 17:16:41.887 [FileAllocator] creating directory C:\Program Files (x86
)\mongodb-win32-x86_64-2.4.4-rc0\data\_tmp
Sat Jun 01 17:16:42.101 [FileAllocator] done allocating datafile C:\Program File
s (x86)\mongodb-win32-x86_64-2.4.4-rc0\data\local.ns, size: 16MB,  took 0.2 secs

Sat Jun 01 17:16:42.103 [FileAllocator] allocating new datafile C:\Program Files
 (x86)\mongodb-win32-x86_64-2.4.4-rc0\data\local.0, filling with zeroes...
Sat Jun 01 17:16:42.705 [FileAllocator] done allocating datafile C:\Program File
s (x86)\mongodb-win32-x86_64-2.4.4-rc0\data\local.0, size: 64MB,  took 0.583 sec
s
Sat Jun 01 17:16:42.709 [initandlisten] command local.$cmd command: { create: "s
tartup_log", size: 10485760, capped: true } ntoreturn:1 keyUpdates:0  reslen:37
824ms
Sat Jun 01 17:16:43.057 [websvr] admin web console waiting for connections on po
rt 28017
Sat Jun 01 17:16:43.234 [initandlisten] waiting for connections on port 27017



C:\Program Files (x86)\mongodb-win32-x86_64-2.4.4-rc0\bin>mongo.exe
MongoDB shell version: 2.4.4-rc0
connecting to: test
Welcome to the MongoDB shell.
For interactive help, type "help".
For more comprehensive documentation, see
        http://docs.mongodb.org/
Questions? Try the support group
        http://groups.google.com/group/mongodb-user
>

MongoDB as a Windows Service


Set log path
echo logpath="C:\Program Files (x86)\mongodb-win32-x86_64-2.4.4-rc0\log\mongo.log" > "C:\Program Files (x86)\mongodb-win32-x86_64-2.4.4-rc0\mongod.cfg"

Running the MongoDB server (i.e. “mongod.exe”)
  • To run the MongoDB service:

net start MongoDB
  • To stop the MongoDB service:
    net stop MongoDB
    
  • To remove the MongoDB service:
    C:\mongodb\bin\mongod.exe --remove
Installing MongoDB on Windows
http://docs.mongodb.org/manual/tutorial/install-mongodb-on-windows/

Sample video of building a sample web application working with MongoDB
http://www.10gen.com/presentations/building-web-applications-mongodb-introduction

Fiddler Hook for Visual Studio HTTP traffic

WinInet traffic monitoring can be done by updating the Machine.config file

Configure Fiddler

Following configuration settings are required

<configuration>
  <system.net>
    <defaultProxy>
      <proxy
              usesystemdefault="False"
              bypassonlocal="True" 
              proxyaddress="http://127.0.0.1:8888"              
              />
    </defaultProxy>
  </system.net>
</configuration>