spider.io

An Python/Amazon EMR based crawler for fetching/analyzing big list of urls

Prerequisites

You should have 3cmd and elastic-mapreduce utils installed on your box and configured to work with your Amazon EC2 account.

You should pick up the name for the bucket in which the data will be uploaded and the results will be stored.

Preparing data

Download the archive of 1M most popular sites from Alexa and put it in the 'data' folder without any modifications.

Take the 'bugs.js' file (you know where to get it) and put it to the same 'data' folder.

Run 1-prepare.sh. The script will prepare the data and upload it to the bucket.

Running Crawler

Run 2-run-job-flow.sh

Getting Results

Run 3-get-results.sh. The script will download result files into 'output' folder.

Last Step

Run 4-analyze.py. The script will take files from 'output' folder, join them into 'result.json' file and put it in the current directory. At the end the script will show some statistics.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
1-prepare.sh		1-prepare.sh
2-run-job-flow.sh		2-run-job-flow.sh
3-get-results.sh		3-get-results.sh
4-analyze.py		4-analyze.py
README.md		README.md
configure-python.sh		configure-python.sh
mapper.py		mapper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spider.io

Prerequisites

Preparing data

Running Crawler

Getting Results

Last Step

About

Releases

Packages

Languages

lithuak/alexa-hadoop-spider

Folders and files

Latest commit

History

Repository files navigation

spider.io

Prerequisites

Preparing data

Running Crawler

Getting Results

Last Step

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages