pyCrawler

Threaded python web crawler with MongoDB backend.

The file crawlerLinks.py grabs a random link from a given collection. It crawls for new http links and if they're unique it does a new insert into the collection.

The other file named crawlerMeta.py grabs a random link from a collection and extracts some meta data using Beautiful Soup 4. The meta data gets saved together with the crawled link. The number of threads used can easily be modified in the scripts.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
crawlerLinks.py		crawlerLinks.py
crawlerMeta.py		crawlerMeta.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyCrawler

About

Releases

Packages

zpc5686/pyCrawler

Folders and files

Latest commit

History

Repository files navigation

pyCrawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages