Bharath Tech Update: Hadoop

Showing posts with label Hadoop. Show all posts

Thursday, May 15, 2014

Hadoop Hive External vs Internal Table

Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:

The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
You want to use a custom location such as ASV.
Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).

Use INTERNAL tables when:

The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.

Wednesday, May 14, 2014

Hadoop Quick Info

Apache Hadoop– Hadoop is an open source software framework which allows you to cheaply store and process vast amounts of structured and unstructured data.

Flume– A service for collecting, aggregating, and moving large amounts of log and event data into Hadoop.

HBase- A scalable, distributed, column-oriented data store that runs on top of HDFS. A short video overview of Flume.

HDFS– an acronym for "Hadoop Distributed File System"

Hive- A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It allows you to query data using a SQL-like language called HiveQL (HQL).

HiveQL (HQL)- A SQL like query language for Hadoop used to execute MapReduce jobs on HDFS.

JobTracker– the service within Hadoop which distributes MapReduce tasks to specific nodes in the cluster.

NameNode– the core of the HDFS file system. The NameNode maintains a record of all files stored on the Hadoop cluster.

Oozie - workflow scheduler system to manage Apache Hadoop jobs.

Pig– a high level programming language for creating MapReduce programs used within Hadoop. An introduction to Pig.

Sqoop– a tool for transferring data between Hadoop and relational databases.

YARN– a resource manager for Hadoop 2. YARN is short for "Yet another resource negotiator". Introduction to YARN on the Apache Hadoop website.

ZooKeeper - Centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Thursday, August 29, 2013

MapReduce Data Processing Pattern

MapReduce is

Data processing approach for processing highly parallelizable datasets.
Implemented as a cluster, with many nodes working in parallel on different parts of the data.

Data is divided up into small chunks that are then distributed across all data nodes in cluster for parallel processing.

MapReduce requires writing two functions:

a mapper
a reducer

MapReduce functions can be impleted in JAVA, C++, C#, Python, Javascript (to script PIG), etc...

These functions accept data as input and then return transformed data as output. Functions are called repeatedly, with subsets of data, with the output of the mapper being aggregated and then sent to the reducer. JobTracker coordinates jobs across the cluster.

Limiting factor for MapReduce is the size of the cluster.

Hadoop

Implements MapReduce as a batch-processing system.

Optimzed for flexible and efficient processing of huge amounts of data, not for response time.

Hadoop ecosystem includes higher level of abstractions beyond MapReduce.

Hive provides a SQL kind of query language. When we submit a HiveQL query, Hive generates MapReduce functions and runs behind the scenes to carryout the requested query.

Pig is another query abstraction with a data flow language known as Pig Latin. Pig also generates MpaReduce functions and runs behind the scenes to implement the higher-level operations described in Pig Latin.

Mahout is a machine learning abstraction....TB typed

Sqoop is a relational database connector.

Use Hadoop to get right data subset and shape it to the desired form, then use BI tools to finish the analytical processing.

Wednesday, May 29, 2013

Microsoft HDInsight Public Preview - Setting up Hadoop cluster on Windows Azure

Windows Azure HDInsight public preview announced on https://HadoopOnAzure.com

I was an existing HadoopOnAzure.com beta users so signing up with already signed up live account

Requesting for creating new cluster

Sample request form for cluster creation on HDInsight

After click on "Request Cluster" button on right bottom corner of web interface

Hadoop Cluster creation in progress

Hadoop cluster created and in Running status.

After clicking on "Go to Cluster" link under Cluster URL section in screenshot above.
- we see the cluster information screen as below.

Configure Ports (ODBC server)

Configure Ports (ODBC server)

Sample HDInsight Hadoop cluster NameNode Remote Desktop RDP connection file content

Connecting Remotely to NameNode desktop through RDP

Sample My Computer - Explorer view of the NameNode hadoop cluster created.

Local user "bphdinsight" created with what we provided under cluster login section during requesting form submission for hadoop cluster creation.

Remote Desktop view

Sample preview of setup version of builds available on Hadoop Cluster

Wednesday, April 17, 2013

MapReduce Design Patterns

Summarization patterns: get a top-level view by summarizing and grouping data
Filtering patterns: view data subsets such as records generated from one user
Data organization patterns: reorganize data to work with other systems, or to make MapReduce analysis easier
Join patterns: analyze different datasets together to discover interesting relationships
Metapatterns: piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job
Input and output patterns: customize the way you use Hadoop to load or store data

Tuesday, March 26, 2013

Windows Azure HDInsight now available for public preview

When I accessed the https://www.hadooponazure.com, I saw announcement from Microsoft that HDInsight is integrated now with WindowsAzure.com and available for public preview.

Monday, October 8, 2012

Preview Apache™ Hadoop™-based Services for Windows Azure

In this blog post i will document the preview I did on "Apache™ Hadoop™-based Services for Windows Azure"

After my first look - I feel the following are some of best things for Admins/Developers Microsoft had made for this Hadoop offerings on Windows Azure.
1. Metro Style user experience to Create a Hadoop cluster on Windows Azure.
2. Very easy to configure and manager Hadoop cluster and nodes.
3. Job history management and Viewing the job execution history is so seamless.
4. Deploying Map/Reduce job is so simple(might be for basic job routines?)
5. Development of Map/Reduce code in your favorite JAVASCRIPT language
6. Webconsole to execute and manage the Map/Reduce jobs.
7. View Map/Reduce Job results in web console, view in GRAPH output, etc...

So much to document and explore in this new BigData approach from Microsoft....

Ok Let's get started...

You can start from https://www.hadooponazure.com/

Once you logged in using Windows Live credentials, you see the "Request a new Hadoop Cluster" screen in Home page.

Provide all required information and finally click on "Request Cluster" button link at right navigation.

Hadoop Cluster allocation in progress...

Hadoop Cluster - Allocation in progress

Hadoop Cluster Nodes - Allocation in progress....

Hadoop - Manage Cluster screen

Login using MSTSC Remote access to view the created Node in Hadoop cluster....

Hadoop Cluster - Map/Reduce job administration

Summary screen of particular Hadoop Cluster created just now...

Hadoop Cluster - Configure Ports
- FTP
- ODBC Server

Default these ports are in Closed status

Hadoop Cluster - Job execution History

You can download Client Utilities for Microsoft Apache Hadoop based services that will help in querying data in Hive using ODBC driver or Excel.

Hadoop Cluster - Summary Screen (Metro-Style web interface)

Hadoop - Release Cluster

Hadoop Cluster - Release in progress...

Test Hadoop cluster created is now released...

Thanks to Microsoft again for enabling developer communities to preview the Hadoop on Windows Azure.

If you are also interested in this and need more information refer to this link
- How-To and FAQ Guide for ApacheTM HadoopTM-based Services for Windows Azure