Hadoop-In-Action-Introductory-Patent-Dataset-Analysis

A basic introductory example of hadoops mapreduce libraries to load and analyse large datasets in this case a US patent dataset sourced from cite75_99.txt , This is an implementation and extension of Ideas from the book Hadoop in Action by Chuck Lam.

Idea

While small datasets can be handelled by python scripts directly , large datasets ( in the order of 50+TB ) cannot be directly loaded in to memory and processed , MapReduce algorithms were initially developed with this problem in mind . Hadoop abstracts out the details of the distributed systems and takes complete care of handling them . This project is a simple example where we load a 250 MB odd text file , to which we apply a basic map reduce function to come up with a frequency distribution plot for the given patent dataset .

After extraction of the frequency distribution data , using this small python snippet we have obtained the following output.

CODE

import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("data",sep ='\t',header = None)
plt.yscale('log')
plt.xscale('log')
plt.xlabel("Number of Citations")
plt.ylabel("Number of Patents")

plt.plot(data[1],marker = '.',color='darkslateblue')
plt.show()

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
finalOutput		finalOutput
src/main/java		src/main/java
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop-In-Action-Introductory-Patent-Dataset-Analysis

Idea

CODE

OUTPUT

About

Releases

Packages

Languages

jayantakumar/Hadoop-In-Action-Introductory-Patent-Dataset-Analysis

Folders and files

Latest commit

History

Repository files navigation

Hadoop-In-Action-Introductory-Patent-Dataset-Analysis

Idea

CODE

OUTPUT

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages