Skip to content

lpradovera/anemone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Anemone

Anemone is a web spider framework that can spider a domain and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized spider tasks quickly and easily.

See anemone.rubyforge.org for more information.

Features

  • Multi-threaded design for high performance

  • Tracks 301 HTTP redirects to understand a page’s aliases

  • Built-in BFS algorithm for determining page depth

  • Allows exclusion of URLs based on regular expressions

  • Choose the links to follow on each page with focus_crawl()

  • HTTPS support

  • Records response time for each page

  • CLI program can list all pages in a domain, calculate page depths, and more

  • Obey robots.txt

  • In-memory or persistent storage of pages during crawl, using TokyoCabinet or PStore

Examples

See the scripts under the lib/anemone/cli directory for examples of several useful Anemone tasks.

Requirements

  • nokogiri

  • robots

Development

To test and develop this gem, additional requirements are:

  • rspec

  • fakeweb

  • tokyocabinet

The latter gem needs Tokyo Cabinet installed on your system.

About

Anemone web-spider framework

Resources

License

Stars

Watchers

Forks

Packages

No packages published