A very simple web scraper chassis. Claw takes a web page (or list of pages), scrapes some info from those pages, then dumps the results to a JSON or CSV file.
Accepts parameters
- a page url, or array of URLs
- a selection to scrape
- fields to pull out from within that section
- an output folder
- number of seconds to delay
var claw = require('claw');
var claw_config = {
pages : [
'http://www.bing.com/search?q=hello',
'http://www.bing.com/search?q=goodbye'
],
selector : 'h3',
fields : {
"text" : "$(sel).find('a').text()",
"href" : "$(sel).find('a').attr('href')"
},
delay : 3
}
claw.init(claw_config);
Each page gets saved to a separate output file.
Here's what the exported JSON looks like:
[
{
"text": "HELLO! Online: celeb & royal news, magazine, babies, weddings",
"href": "http://www.hellomagazine.com/"
},
{
"text": "Hello - Wikipedia, the free encyclopedia",
"href": "http://en.wikipedia.org/wiki/Hello"
},
{
"text": "Hello | Define Hello at Dictionary.com",
"href": "http://dictionary.reference.com/browse/hello"
}
]
and here's the CSV:
text,href
"HELLO! Online: celeb & royal news, magazine, babies, weddings, É","http://www.hellomagazine.com/"
"Hello - Wikipedia, the free encyclopedia","http://en.wikipedia.org/wiki/Hello"
"Hello | Define Hello at Dictionary.com","http://dictionary.reference.com/browse/hello"
Claw can import its configuration from a JSON file:
claw('sample1.json');
The file looks like this:
{
"pages" : [
"http://www.bing.com/search?q=hello",
"http://www.bing.com/search?q=goodbye"
],
"selector" : "h3",
"fields" : {
"href" : "$(sel).find('a').attr('href')"
},
"delay" : 5
}
You can also use claw from the command line. First, install it globally:
npm install -g claw
Then run it in the same folder as your config file:
claw sample1.json
This will create a folder called sample1, with your results.
Claw can grab its page list from a JSON file that is a list of urls (or an object with .href properties). Instead of an array, just set "pages" to a file name and path.
claw_config.pages = "pages.json";
Questions? Ideas? Hit me up on twitter - @dylanized