Skip to content

Commit

Permalink
Merge pull request linkedin#20 from linkedin/intro_bigdata_cleanup
Browse files Browse the repository at this point in the history
Fixed typos errors in intro & big_data
  • Loading branch information
kalyanceg committed Nov 25, 2020
2 parents f54baf4 + 41f7512 commit c494b08
Show file tree
Hide file tree
Showing 4 changed files with 39 additions and 40 deletions.
28 changes: 14 additions & 14 deletions courses/big_data/evolution.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@
# Architecture of Hadoop

1. **HDFS**
1. The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
2. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
1. The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
2. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
3. HDFS is part of the [Apache Hadoop Core project](https://github.com/apache/hadoop).

![HDFS Architecture](images/hdfs_architecture.png)

1. NameNode: is the arbitrator and central repository of file namespace in the cluster. The NameNode executes the operations such as opening, closing, and renaming files and directories.
2. DataNode: manages the storage attached to the node on which it runs. It is responsible for serving all the read and write requests. It performs operations on instructions on NameNode such as creation, deletion, and replications of blocks.
2. DataNode: manages the storage attached to the node on which it runs. It is responsible for serving all the read and writes requests. It performs operations on instructions on NameNode such as creation, deletion, and replications of blocks.
3. Client: Responsible for getting the required metadata from the namenode and then communicating with the datanodes for reads and writes. </br></br></br>

2. **YARN**
Expand All @@ -25,29 +25,29 @@
2. Resource Manager: It is the master daemon of YARN and is responsible for resource assignment and management among all the applications. Whenever it receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. It has two major components:
3. Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure scheduler, which means that it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources.
4. Application manager: It is responsible for accepting the application and negotiating the first container from the resource manager. It also restarts the Application Manager container if a task fails.
5. Node Manager: It takes care of individual nodes on the Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep-up with the Node Manager. It monitors resource usage, performs log management and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and starting it on the request of the Application master.
6. Application Master: An application is a single job submitted to a framework. The application manager is responsible for negotiating resources with the resource manager, tracking the status and monitoring progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context(CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time.
7. Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single node. The containers are invoked by Container Launch Context(CLC) which is a record that contains information such as environment variables, security tokens, dependencies etc. </br></br>
5. Node Manager: It takes care of individual nodes on the Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep up with the Node Manager. It monitors resource usage, performs log management, and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and starting it at the request of the Application master.
6. Application Master: An application is a single job submitted to a framework. The application manager is responsible for negotiating resources with the resource manager, tracking the status, and monitoring the progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context(CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time.
7. Container: It is a collection of physical resources such as RAM, CPU cores, and disk on a single node. The containers are invoked by Container Launch Context(CLC) which is a record that contains information such as environment variables, security tokens, dependencies, etc. </br></br>


# MapReduce framework

![MapReduce Framework](images/map_reduce.jpg)

1. The term MapReduce represents two separate and distinct tasks Hadoop programs perform-Map Job and Reduce Job. Map jobs take data sets as input and process them to produce key value pairs. Reduce job takes the output of the Map job i.e. the key value pairs and aggregates them to produce desired results.
2. Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on computing clusters. Mapreduce helps to split the input data set into a number of parts and run a program on all data parts parallel at once.
3. Please find the below Word count example demonstrating the usage of MapReduce framework:
1. The term MapReduce represents two separate and distinct tasks Hadoop programs perform-Map Job and Reduce Job. Map jobs take data sets as input and process them to produce key-value pairs. Reduce job takes the output of the Map job i.e. the key-value pairs and aggregates them to produce desired results.
2. Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for distributed processing of large data sets on computing clusters. Mapreduce helps to split the input data set into a number of parts and run a program on all data parts parallel at once.
3. Please find the below Word count example demonstrating the usage of the MapReduce framework:

![Word Count Example](images/mapreduce_example.jpg)
</br></br>

# Other tooling around hadoop
# Other tooling around Hadoop

1. [**Hive**](https://hive.apache.org/)
1. Uses a language called HQL which is very SQL like. Gives non-programmers the ability to query and analyze data in Hadoop. Is basically an abstraction layer on top of map-reduce.
2. Ex. HQL query:
2. Ex. HQL query:
1. _SELECT pet.name, comment FROM pet JOIN event ON (pet.name = event.name);_
3. In mysql:
3. In mysql:
1. _SELECT pet.name, comment FROM pet, event WHERE pet.name = event.name;_
2. [**Pig**](https://pig.apache.org/)
1. Uses a scripting language called Pig Latin, which is more workflow driven. Don't need to be an expert Java programmer but need a few coding skills. Is also an abstraction layer on top of map-reduce.
Expand All @@ -66,7 +66,7 @@
3. [**Spark**](https://spark.apache.org/)
1. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster’s memory and query it repeatedly, making it well suited to machine learning algorithms.
4. [**Presto**](https://prestodb.io/)
1. Presto is a high performance, distributed SQL query engine for Big Data.
1. Presto is a high performance, distributed SQL query engine for Big Data.
2. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, and MongoDB.
3. Example presto query:
```mysql
Expand All @@ -80,4 +80,4 @@

1. In order to transport the data over the network or to store on some persistent storage, we use the process of translating data structures or objects state into binary or textual form. We call this process serialization..
2. Avro data is stored in a container file (a .avro file) and its schema (the .avsc file) is stored with the data file.
3. Apache Hive provides support to store a table as Avro and can also query data in this serialisation format.
3. Apache Hive provides support to store a table as Avro and can also query data in this serialisation format.
9 changes: 4 additions & 5 deletions courses/big_data/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

## What to expect from this course

This course covers the basics of Big Data and how it has evolved to become what it is today. We will take a look at a few realistic scenarios where Big Data would be a perfect fit. An interesting assignment on designing a Big Data system is followed by understanding the architecture of Hadoop and the tooling around it.
This course covers the basics of Big Data and how it has evolved to become what it is today. We will take a look at a few realistic scenarios where Big Data would be a perfect fit. An interesting assignment on designing a Big Data system is followed by understanding the architecture of Hadoop and the tooling around it.

## What is not covered under this course

Expand All @@ -32,7 +32,7 @@ Writing programs to draw analytics from data.

# Overview of Big Data

1. Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, techniques and frameworks.
1. Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, techniques, and frameworks.
2. Big Data could consist of
1. Structured data
2. Unstructured data
Expand All @@ -50,9 +50,8 @@ Writing programs to draw analytics from data.
1. Take the example of the traffic lights problem.
1. There are more than 300,000 traffic lights in the US as of 2018.
2. Let us assume that we placed a device on each of them to collect metrics and send it to a central metrics collection system.
3. If each of the IOT devices sends 10 events per minute, we have 300000x10x60x24 = 432x10^7 events per day.
3. If each of the IoT devices sends 10 events per minute, we have 300000x10x60x24 = 432x10^7 events per day.
4. How would you go about processing that and telling me how many of the signals were “green” at 10:45 am on a particular day?
2. Consider the next example on Unified Payments Interface (UPI) transactions:
1. We had about 1.15 billion UPI transactions in the month of October, 2019 in India.
1. We had about 1.15 billion UPI transactions in the month of October 2019 in India.
12. If we try to extrapolate this data to about a year and try to find out some common payments that were happening through a particular UPI ID, how do you suggest we go about that?

6 changes: 3 additions & 3 deletions courses/big_data/tasks.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Tasks and conclusion

## Post training tasks:
## Post-training tasks:

1. Try setting up your own 3 node hadoop cluster.
1. Try setting up your own 3 node Hadoop cluster.
1. A VM based solution can be found [here](http://hortonworks.com/wp-content/uploads/2015/04/Import_on_VBox_4_07_2015.pdf)
2. Write a simple spark/MR job of your choice and understand how to generate analytics from data.
1. Sample dataset can be found [here](https://grouplens.org/datasets/movielens/)
Expand All @@ -11,4 +11,4 @@
1. [Hadoop documentation](http://hadoop.apache.org/docs/current/)
2. [HDFS Architecture](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html)
3. [YARN Architecture](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
4. [Google GFS paper](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/035fc972c796d33122033a0614bc94cff1527999.pdf)
4. [Google GFS paper](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/035fc972c796d33122033a0614bc94cff1527999.pdf)
36 changes: 18 additions & 18 deletions courses/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,26 @@

<img src="img/sos.png" width=200 >

Early 2019, we started visiting campuses to recruit the brightest minds to ensure LinkedIn and all the services that it is composed of is always available for everyone. This function at Linkedin falls in the purview of the Site Reliability Engineering team and Site Reliability Engineers ( SRE ) who are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.
Early 2019, we started visiting campuses to recruit the brightest minds to ensure LinkedIn and all the services that it is composed of is always available for everyone. This function at Linkedin falls in the purview of the Site Reliability Engineering team and Site Reliability Engineers ( SRE ) who are Software Engineers who specialize in reliability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.

As we continued on this journey we started getting a lot of questions from these campuses on what exactly site engineering roll entails? and, how could someone learn the skills and the disciplines involved to become a successful site engineer? Fast forward a few months, and a few of these campus students had joined LinkedIn either as Interns or as full time engineers to become a part of the Site Engineering team, we also had a few lateral hires who joined our organization who were not from a traditional SRE background. That's when a few of us got together and started to think about how we can on board new new graduate engineers to the site engineering team.
As we continued on this journey we started getting a lot of questions from these campuses on what exactly site engineering role entails? and, how could someone learn the skills and the disciplines involved to become a successful site engineer? Fast forward a few months, and a few of these campus students had joined LinkedIn either as Interns or as full-time engineers to become a part of the Site Engineering team, we also had a few lateral hires who joined our organization who were not from a traditional SRE background. That's when a few of us got together and started to think about how we can onboard new graduate engineers to the site engineering team.

There is a vast amount of resources scattered throughout the web on what are the roles and responsibilities of an SREs, how to monitor site health, handling incidents, maintain SLO/SLI etc. But there are very few resources out there guiding someone on what all basic skill sets one has to acquire as a beginner. Because of the lack of these resources we felt that individuals are having a tough time getting into open positions in the industry. We created School Of SRE as a starting point for anyone wanting to build their career in the role of SRE.
There is a vast amount of resources scattered throughout the web on what are the roles and responsibilities of SREs are, how to monitor site health, handling incidents, maintain SLO/SLI, etc. But there are very few resources out there guiding someone on all basic skill sets one has to acquire as a beginner. Because of the lack of these resources, we felt that individuals are having a tough time getting into open positions in the industry. We created the School Of SRE as a starting point for anyone wanting to build their career in the role of SRE.

In this course we are focusing on building strong foundational skills. The course is structured in a way to provide more real life examples and how learning each of the topics can play a bigger role in your day to day SRE life. Currently we are covering the following topics under the School Of SRE:

- Fundamentals Series
- [Linux Basics](https://linkedin.github.io/school-of-sre/linux_basics/intro/)
- [Git](https://linkedin.github.io/school-of-sre/git/git-basics/)
- [Linux Networking](https://linkedin.github.io/school-of-sre/linux_networking/intro/)
- [Python and Web](https://linkedin.github.io/school-of-sre/python_web/intro/)
- Data
- Relational databases (MySQL)
- [NoSQL concepts](https://linkedin.github.io/school-of-sre/databases_nosql/intro/)
- [Big Data](https://linkedin.github.io/school-of-sre/big_data/intro/)
- [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
- [Security](https://linkedin.github.io/school-of-sre/security/intro/)

We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added reference which could be a guide for further learning. Our hope is that by going through these modules we should be able build the essential skills required for a Site Reliability Engineer.

At Linkedin we are using this curriculum for onboarding our non-traditional hires and new college grads to the SRE role. We had multiple rounds of successful onboarding experience with the new members and helped them to be productive in a very short period of time. This motivated us to opensource these contents for helping other organisations onboarding new engineers to the role and individuals to get into the role. We realise that the initial content we created is just a starting point and our hope is that the community can help in the journey refining and extending the contents.
- Fundamentals Series
- [Linux Basics](https://linkedin.github.io/school-of-sre/linux_basics/intro/)
- [Git](https://linkedin.github.io/school-of-sre/git/git-basics/)
- [Linux Networking](https://linkedin.github.io/school-of-sre/linux_networking/intro/)
- [Python and Web](https://linkedin.github.io/school-of-sre/python_web/intro/)
- Data
- Relational databases (MySQL)
- [NoSQL concepts](https://linkedin.github.io/school-of-sre/databases_nosql/intro/)
- [Big Data](https://linkedin.github.io/school-of-sre/big_data/intro/)
- [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
- [Security](https://linkedin.github.io/school-of-sre/security/intro/)

We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added reference which could be a guide for further learning. Our hope is that by going through these modules we should be able to build the essential skills required for a Site Reliability Engineer.

At Linkedin, we are using this curriculum for onboarding our non-traditional hires and new college grads to the SRE role. We had multiple rounds of successful onboarding experience with the new members and helped them to be productive in a very short period of time. This motivated us to opensource these contents for helping other organizations onboarding new engineers to the role and individuals to get into the role. We realize that the initial content we created is just a starting point and we hope that the community can help in the journey of refining and extending the contents.

0 comments on commit c494b08

Please sign in to comment.