cloudcon-hive

This repo contains data set and queries I use in my presentations on SQL-on-Hive (i.e. Impala and hive) at various conferences. This started off as a repo that was use in my presentation at CloudCon in San Francisco, so the name of the repo reflects that but now this repo has morphed into a single repository that contains my dataset for demos and such at various different presentations on Hive and Impala.

Files in the repository

2008.tar.gz: Flight delay dataset from 2008.
airports.csv: Dataset linking airport codes to their full names. More details in Introduction section.
README.md: This file.

Datasets

There are 2 datasets in the repo.

a) The first dataset contains on-time flight performance data from 2008, originally released by Research and Innovative Technology Administration (RITA). The source of this dataset is http://stat-computing.org/dataexpo/2009/the-data.html. The dataset

b) The second dataset contains listing of various airport codes in continental US, Puerto Rico and US Virgin Islands. The source of this dataset is http://www.world-airport-codes.com/ The data was scraped from this website and then cleansed to be in its present CSV form.

Setup

You will need a box with Hadoop and Hive installed. Easiest way to get it to install one of the Demo VMs or use packages available from Apache Bigtop. Cloudera Demo VMs are available from Cloudera's website. You can learn more about Apache Bigtop and install integration test Apache Hadoop and Hive by going to the project's main page and the project's wiki.

Git clone this repo, untar dataset and launch hive:


git clone git://github.com/markgrover/cloudcon-hive.git
tar -xzvf cloudcon-hive/2008.tar.gz
hive

Hive commands

You can create tables for the datasets in Hive. These tables can directly be used in Impala, since Hive and Impala share metadata. Of course, you can create these tables directly in Impala itself too, in case you don't want to use Hive. We will only show you commands for creating tables in Hive only though.

On hive shell: Create hive table, flight_data:


CREATE TABLE flight_data(
   year INT,
   month INT,
   day INT,
   day_of_week INT,
   dep_time INT,
   crs_dep_time INT,
   arr_time INT,
   crs_arr_time INT,
   unique_carrier STRING,
   flight_num INT,
   tail_num STRING,
   actual_elapsed_time INT,
   crs_elapsed_time INT,
   air_time INT,
   arr_delay INT,
   dep_delay INT,
   origin STRING,
   dest STRING,
   distance INT,
   taxi_in INT,
   taxi_out INT,
   cancelled INT,
   cancellation_code STRING,
   diverted INT,
   carrier_delay STRING,
   weather_delay STRING,
   nas_delay STRING,
   security_delay STRING,
   late_aircraft_delay STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Load the data into the table:


LOAD DATA LOCAL INPATH '2008.csv' OVERWRITE INTO TABLE flight_data;

Ensure the table got created and loaded fine:


SHOW TABLES;
SELECT
   *
FROM
   flight_data
LIMIT 10;

Query the table. Find average arrival delay for all flights departing SFO in January:


SELECT
   avg(arr_delay)
FROM
   flight_data
WHERE
   month=1
   AND origin='SFO';

On hive shell: create the airports table


CREATE TABLE airports(
   name STRING,
   country STRING,
   area_code INT,
   code STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Load data into airports table:


LOAD DATA LOCAL INPATH 'cloudcon-hive/airports.csv' OVERWRITE INTO TABLE airports;

On hive shell, list some rows from the airports table:


SELECT
   *
FROM
   airports
LIMIT 10

On hive shell: run a join query to find the average delay in January 2008 for each airport and to print out the airport's name:


SELECT
   name,
   AVG(arr_delay)
FROM
   flight_data f
   INNER JOIN airports a
   ON (f.origin=a.code)
WHERE
   month=1
GROUP BY
   name;

Impala commands

Now, we will run the same queries on Impala to compare its performance with Hive. No need to re-create these tables in Impala since Hive and Impala share the same metastore. You can directly access them in Impala.


bash> impala-shell
impala> connect localhost;
impala> show tables;

On Impala shell, query the table. Find average arrival delay for all flights departing SFO in January:


SELECT
   avg(arr_delay)
FROM
   flight_data
WHERE
   month=1
   AND origin='SFO';

On Impala shell: run a join query to find the average delay in January 2008 for each airport and to print out the airport's name:


SELECT
   name,
   AVG(arr_delay)
FROM
   flight_data f
   INNER JOIN airports a
   ON (f.origin=a.code)
WHERE
   month=1
GROUP BY
   name;

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
2008.tar.gz		2008.tar.gz
Grover-Hive Presentation.pdf		Grover-Hive Presentation.pdf
README.md		README.md
airports.csv		airports.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cloudcon-hive

Files in the repository

Datasets

Setup

Hive commands

Impala commands

About

Releases

Packages

vijay-balakrishnan/cloudcon-hive

Folders and files

Latest commit

History

Repository files navigation

cloudcon-hive

Files in the repository

Datasets

Setup

Hive commands

Impala commands

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages