Thursday, May 15, 2014

Hadoop Hive External vs Internal Table

Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:

  • The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files.
  • Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
  • You want to use a custom location such as ASV.
  • Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
  • You are not creating table based on existing table (AS SELECT).

Use INTERNAL tables when:

  • The data is temporary.
  • You want Hive to completely manage the lifecycle of the table and data.

Wednesday, May 14, 2014

Hadoop Quick Info

Apache Hadoop– Hadoop is an open source software framework which allows you to cheaply store and process vast amounts of structured and unstructured data.

Flume– A service for collecting, aggregating, and moving large amounts of log and event data into Hadoop.

HBase- A scalable, distributed, column-oriented data store that runs on top of HDFS. A short video overview of Flume.

HDFS– an acronym for "Hadoop Distributed File System"

Hive- A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It allows you to query data using a SQL-like language called HiveQL (HQL).

HiveQL (HQL)- A SQL like query language for Hadoop used to execute MapReduce jobs on HDFS.

JobTracker– the service within Hadoop which distributes MapReduce tasks to specific nodes in the cluster.

NameNode– the core of the HDFS file system. The NameNode maintains a record of all files stored on the Hadoop cluster.

Oozie - workflow scheduler system to manage Apache Hadoop jobs.

Pig– a high level programming language for creating MapReduce programs used within Hadoop. An introduction to Pig.

Sqoop– a tool for transferring data between Hadoop and relational databases.

YARN– a resource manager for Hadoop 2. YARN is short for "Yet another resource negotiator". Introduction to YARN on the Apache Hadoop website.

ZooKeeper - Centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.