Big Data is the collection of huge datasets of semi-structured and unstructured data, generated by the high-performance heterogeneous group of devices ranging from social networks to scientific computing applications.
- Evolution of Data - Gb to Pb to Zb
- Sources of Big Data - Traditional Database's,Social medium,Cloud,IoT
- Characteristics of Big Data - Volume,Velocity,Variety,Veracity
- Some industry applications of big data - Retail, Financial Services, Healthcare, Manufacturing
- Types of data - structured,semi-structured,unstructured.
- Job Roles in Big Data Domain - Big data consultant,Architect ,Big data engineers,Data Analyst
- Hadoop - Hadoop is an open source,Java based programming framework that manages data processing and storage for big data applications in a distributed computing system.
- Features of Hadoop - Flexibility,Scalability,Fault-proof,Cost-effective
- Hadoop vs RDBMS - Hadoop doesn't support real time data processing (OLTP),
-
it is designed to support large scale batch processing work loads (OLAP) where as RDBMS supports real-time data processing.
- Hadoop has two main components HDFS and the YARN.
- NAME NODE
- DATA NODE
- Secondary NameNode
- STANDBY Namenode
- YARN also follows the Master-Slave architecture.
- Master - ResourceManager
- Slave - NodeManager
- REPLICATION IN HDFS - In HDFS,Default replication factor is 3
- SPLITTING OF DATA - In HDFS,Default block size is 128MB.
- File Operations - Reading from in file , Write to a file
- MAP - Data parallel model is used in MAP phase.
- Reduce - Inverse-tree parallel model for aggregating results in Reduce phase
-
Hive is a data warehouse software
-
Hive vs RDBMS - Hive - Write once, Read many times ; RDBMS - Read and Write many times.
-
Uses of Hive-Metastore - It stores all the information about Hive tables, in a central repository.
-
HIVE Data Types - Primitive Data type , Complex Data type
-
Primitive Data type - String,Int,float,double,Timestamp,Binary
-
Complex Data type – STRUCT(Object),MAP(Key-Value),ARRAY(Indices)
-
Internal vs External tables
-
External table - External tables is used when data to remain stored on the HDFS even after dropping tables
-
because Hive does not delete the data stored outside (of the Hive database).
-
Internal table - In internal table, the data is temporary. So, when the Hive table is dropped, the data stored
-
in the internal table is deleted along with the metadata.
-
The Challange Dataset: 2.7M reviews and 649K tips by users for 86K business,566K business attributes,,
-
(e.g) hours,parking,availability,ambience. Social network of 687K users for a
-
total of 4.2M social edges Agregated check-ins over time for each of the 86K businesses
-
200,000 pictures from the included businesses
-
Intrested areas of research - Cultural Trends,Location Mining and Urban panning,Seasonal Trends
-
Infer Categories,Natural language ProcessingNLP,Changepoints and Events,Social Graph Mining
-
Dataset's - Domain,Business,Review,User,check-in,tip,photo
-
User and Review dataset considered for this session.
- Structure of user.Json
- Structure of review.Json
- Query to give the total number of records, total number of unique user, min and avg star values
- Finding avg review each reviwer given
- Average stars given by reviewer
- Join Statement - Joining user and review table's
- Partitions are used to make queries faster by dividing the tables into smaller parts using partition key columns.
- Types of Partitioning - Static and Dynamic Partitioning
- STATIC PARTIONING - Insert input data files individually into a partition table is Static Partition.
- DYNAMIC PARTIONING - Single insert to partition table is known as dynamic partition.
-
Usually dynamic partition load the data from non partitioned table
- Property set - set hive.exec.dynamic.partition=true;
-
set hive.exec.dynamic.partition.mode = nonstrict;
- Bucketing also divides the data into smaller and more manageable parts based on hash value of a column value.
- Property set - set hive.enforce.bucketing=true;
-
ORC - Optimized Row Columnar - good compression ratio, avoid unwanted data seek using strips((collection of column data) to store data.
-
Uses the concept of column major order. Contents are stored as Binary.
-
Syntax - stored as orc
-
location '<<Location of the file(table) to be created>>'
-
tblproperties("orc.compress"="SNAPPY")
-
Loading table in Amazon S3 Bucket (Bucket should be created, ORC file name should NOT be present already)
-
Parquet - Built for complecx data types,All the meatadata written at the end of parquet file.
-
Seperates meta data from column data.
-
Syntax - stored as parquet
-
location '<<Location of the file(table) to be created>>'
-
tblproperties("parquet.compress"="SNAPPY")
- UDF - User Defined Functions (example : Date objects ,timestamp to date)
- UDAF - aggregate function - bunch of rows as input and one value result
- UDTF - User Defined table generating function - take one row as input and give multiple outputs
- Few examples are : explode,ngrams,sentences,lower,size etc
File Name : yelp_hive_queries.hql
File Description : This Hive file contains the table creation and query statements.
Execute the queries in the same order.
- From Windows OS or Linux systems
- In AWS, Create (Hadoop,Hive) Cluster with capacity of m4.xLarge on EMR with 3 instances.
- In case of windows , use Putty to connect to Command line Interface and use WinScp to transfer files from Local Machine to HDFS.
- Uses the public IP allocated to connect using Putty and WinScp.