Showing posts with label HDInsight. Show all posts
Showing posts with label HDInsight. Show all posts

Wednesday, May 29, 2013

Microsoft HDInsight Public Preview - Setting up Hadoop cluster on Windows Azure

 Windows Azure HDInsight public preview announced on https://HadoopOnAzure.com

1-Home-2013-03-25_2353

I was an existing HadoopOnAzure.com beta users so signing up with already signed up live account

2-SignIn-2013-03-25_2355

Requesting for creating new cluster

3-Create-Cluster-2013-03-25_2355

Sample request form for cluster creation on HDInsight
4-Request-CreateCluster-2013-03-25_2356

After click on "Request Cluster" button on right bottom corner of web interface

5-Creating-Cluster

Hadoop Cluster creation in progress

6-Deploying-Cluster

Hadoop cluster created and in Running status.

7-Hadoop-Cluster-Created

After clicking on "Go to Cluster" link under Cluster URL section in screenshot above.
- we see the cluster information screen as below.

8-Cluster-Look

Configure Ports (ODBC server)

9-ODBC-Ports-Configure

Configure Ports (ODBC server)

10-ODBC-Ports-Enabled

Sample HDInsight Hadoop cluster NameNode Remote Desktop RDP connection file content

11-Remote-Desktop-config

Connecting Remotely to NameNode desktop through RDP

12-MSTSC-Connecting

Sample My Computer - Explorer view of the NameNode hadoop cluster created.

13-RemoteDesktop-DataNode-View-3

Local user "bphdinsight" created with what we provided under cluster login section during requesting form submission for hadoop cluster creation.

13-RemoteDesktop-View-1

Remote Desktop view
13-Remote-Desktop-View-2

Sample preview of setup version of builds available on Hadoop Cluster 

14-Components-And-Versions

Google BigQuery Vs HDInsight - Comparison


ComparisonGoogle BigQueryWindows Azure HDInsight
Pricing

BigQuery uses a columnar data structure, which means that for a given query, you are only charged for data processed in each column, not the entire table.
Note: The first 100GB of data processed per month is at no charge.
Only 2 pricing components (query processing, storage)
Priced based on the configuration of Hadoop cluster and storage configuration.
Storage Options
Data can be loaded directly to Tables in BigQuery project.
Note: Recommendation
-to load data files first to Google Cloud Storage and then load data to BigQuery tables.
-Max 4 GB per file
-Max 100 GB per load
-Max 1000 files per load
HDInsight provides two options for storing data
•Windows Azure Blob Storage and
•Hadoop Distributed File system (HDFS)
Data Formats
BigQuery supports two schema types:
A flat schema in CSV or newline-delimited JSON format.
A nested/repeated schema in newline-delimited JSON format.
Supports unstructured data formats
Performance
Very fast in response for the query submitted
Slow in response (waited for several minutes to hours to complete to provide required output)
Best Practices
BigQuery Data Strategies and Best Practices
Big Data Solutions on Windows Azure

Tuesday, March 26, 2013

Windows Azure HDInsight now available for public preview

When I accessed the https://www.hadooponazure.com, I saw announcement from Microsoft that HDInsight is integrated now with WindowsAzure.com and available for public preview.


Windows Azure HDInsight How to check Hadoop available components and its versions

Windows Azure HDInsight is the Hadoop distribution based on the Hortonworks Data Platform 1.1.0

One of the way to check the available components and their versions is to login to a cluster using RDP connection and go to "C:\apps\dist\" directory.


Windows Azure HDInsight provides Hadoop services through following components:

Apache Hadoop 1.0.3
Apache Hive 0.9.0
Apache Pig 0.9.3
Apache Sqoop 1.4.2
SQL Server JDBC Driver 3.0
Apache Oozie 3.2.0
Apache HCatalog 0.4.1
Apache Templeton 0.1.4