Name	Name	Last commit message	Last commit date
parent directory ..
Student data file - sample	Student data file - sample
Readme.md	Readme.md
Yelp_Hive_Queries.hql	Yelp_Hive_Queries.hql

DATA ENGINEERING ON YELP DATASET USING HADOOP & HIVE

Big Data Definition

Big Data is the collection of huge datasets of semi-structured and unstructured data, generated by the high-performance heterogeneous group of devices ranging from social networks to scientific computing applications.

Evolution of Data - Gb to Pb to Zb
Sources of Big Data - Traditional Database's,Social medium,Cloud,IoT
Characteristics of Big Data - Volume,Velocity,Variety,Veracity
Some industry applications of big data - Retail, Financial Services, Healthcare, Manufacturing
Types of data - structured,semi-structured,unstructured.
Job Roles in Big Data Domain - Big data consultant,Architect ,Big data engineers,Data Analyst

Hadoop - HDFS & MR

Hadoop - Hadoop is an open source,Java based programming framework that manages data processing and storage for big data applications in a distributed computing system.
Features of Hadoop - Flexibility,Scalability,Fault-proof,Cost-effective
Hadoop vs RDBMS - Hadoop doesn't support real time data processing (OLTP),

              it is designed to support large scale batch processing work loads (OLAP) where as RDBMS supports real-time data processing.

Hadoop has two main components HDFS and the YARN.

components of HDFS

NAME NODE
DATA NODE
Secondary NameNode
STANDBY Namenode

YARN - Yet Another Resource Negotitator

YARN also follows the Master-Slave architecture.
Master - ResourceManager
Slave - NodeManager

Key aspects of HDFS

REPLICATION IN HDFS - In HDFS,Default replication factor is 3
SPLITTING OF DATA - In HDFS,Default block size is 128MB.
File Operations - Reading from in file , Write to a file

MAP-REDUCE

MAP - Data parallel model is used in MAP phase.
Reduce - Inverse-tree parallel model for aggregating results in Reduce phase

HIVE

Hive is a data warehouse software
Hive vs RDBMS - Hive - Write once, Read many times ; RDBMS - Read and Write many times.
Uses of Hive-Metastore - It stores all the information about Hive tables, in a central repository.
HIVE Data Types - Primitive Data type , Complex Data type
Primitive Data type - String,Int,float,double,Timestamp,Binary
Complex Data type – STRUCT(Object),MAP(Key-Value),ARRAY(Indices)
Internal vs External tables
External table - External tables is used when data to remain stored on the HDFS even after dropping tables

             because Hive does not delete the data stored outside (of the Hive database).

Internal table - In internal table, the data is temporary. So, when the Hive table is dropped, the data stored

             in the internal table is deleted along with the metadata.

YELP Dataset Overview

Link : https://www.yelp.com/dataset_challange

The Challange Dataset: 2.7M reviews and 649K tips by users for 86K business,566K business attributes,,

                   (e.g) hours,parking,availability,ambience. Social network of 687K users for a

                   total of 4.2M social edges Agregated check-ins over time for each of the 86K businesses

                   200,000 pictures from the included businesses

Intrested areas of research - Cultural Trends,Location Mining and Urban panning,Seasonal Trends

                          Infer Categories,Natural language ProcessingNLP,Changepoints and Events,Social Graph Mining

Dataset's - Domain,Business,Review,User,check-in,tip,photo
User and Review dataset considered for this session.

YELP Data modeling - Table creations

Structure of user.Json
Structure of review.Json

Basic Queries in Hive and few UDF explanation

Query to give the total number of records, total number of unique user, min and avg star values
Finding avg review each reviwer given
Average stars given by reviewer
Join Statement - Joining user and review table's

Hive Table Partitioning

Partitions are used to make queries faster by dividing the tables into smaller parts using partition key columns.
Types of Partitioning - Static and Dynamic Partitioning
STATIC PARTIONING - Insert input data files individually into a partition table is Static Partition.
DYNAMIC PARTIONING - Single insert to partition table is known as dynamic partition.

                 Usually dynamic partition load the data from non partitioned table

Property set - set hive.exec.dynamic.partition=true;

           set hive.exec.dynamic.partition.mode = nonstrict;

Bucketing in Hive tables

Bucketing also divides the data into smaller and more manageable parts based on hash value of a column value.
Property set - set hive.enforce.bucketing=true;

File formats in Hive

ORC - Optimized Row Columnar - good compression ratio, avoid unwanted data seek using strips((collection of column data) to store data.

                           Uses the concept of column major order. Contents are stored as Binary.

Syntax - stored as orc

     location '<<Location of the file(table) to be created>>'

     tblproperties("orc.compress"="SNAPPY")

Loading table in Amazon S3 Bucket (Bucket should be created, ORC file name should NOT be present already)
Parquet - Built for complecx data types,All the meatadata written at the end of parquet file.

      Seperates meta data from column data.

Syntax - stored as parquet

     location '<<Location of the file(table) to be created>>'

     tblproperties("parquet.compress"="SNAPPY")

COMPLEX QUERIES - for statistical analysis

UDF - User Defined Functions (example : Date objects ,timestamp to date)
UDAF - aggregate function - bunch of rows as input and one value result
UDTF - User Defined table generating function - take one row as input and give multiple outputs
Few examples are : explode,ngrams,sentences,lower,size etc

Code Description

File Name : yelp_hive_queries.hql
File Description : This Hive file contains the table creation and query statements.
                   Execute the queries in the same order.

Steps to Run

There are two ways to get into HIVE prompt

From Windows OS or Linux systems
In AWS, Create (Hadoop,Hive) Cluster with capacity of m4.xLarge on EMR with 3 instances.
In case of windows , use Putty to connect to Command line Interface and use WinScp to transfer files from Local Machine to HDFS.
Uses the public IP allocated to connect using Putty and WinScp.

Useful Reference Links

SUB Queries in SELECT clause Help document :

https://cwiki.apache.org/confluence/display/Hive/Subqueries+in+SELECT

SUB Queries in FROM clause Help document :

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

Joins

Link : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

Built-in Functions (UDF,UDAF,UDTF)

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inFunctions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codes

Codes

Readme.md

DATA ENGINEERING ON YELP DATASET USING HADOOP & HIVE

Big Data Definition

Hadoop - HDFS & MR

components of HDFS

YARN - Yet Another Resource Negotitator

Key aspects of HDFS

MAP-REDUCE

HIVE

YELP Dataset Overview

Link : https://www.yelp.com/dataset_challange

YELP Data modeling - Table creations

Basic Queries in Hive and few UDF explanation

Hive Table Partitioning

Bucketing in Hive tables

File formats in Hive

COMPLEX QUERIES - for statistical analysis

Code Description

Steps to Run

There are two ways to get into HIVE prompt

Useful Reference Links

SUB Queries in SELECT clause Help document :

SUB Queries in FROM clause Help document :

Joins

Built-in Functions (UDF,UDAF,UDTF)

Files

Codes

Directory actions

More options

Directory actions

More options

Latest commit

History

Codes

Folders and files

parent directory

Readme.md

DATA ENGINEERING ON YELP DATASET USING HADOOP & HIVE

Big Data Definition

Hadoop - HDFS & MR

components of HDFS

YARN - Yet Another Resource Negotitator

Key aspects of HDFS

MAP-REDUCE

HIVE

YELP Dataset Overview

Link : https://www.yelp.com/dataset_challange

YELP Data modeling - Table creations

Basic Queries in Hive and few UDF explanation

Hive Table Partitioning

Bucketing in Hive tables

File formats in Hive

COMPLEX QUERIES - for statistical analysis

Code Description

Steps to Run

There are two ways to get into HIVE prompt

Useful Reference Links

SUB Queries in SELECT clause Help document :

SUB Queries in FROM clause Help document :

Joins

Built-in Functions (UDF,UDAF,UDTF)