Skrap

Easily scrap web pages by providing json recipes

This project is maintained by nickdima

skrap

Skrap is a command line utility and node.js module for easily scraping web pages by providing json recipes.

Getting Started

Install the module with: npm install skrap

Use it from the command line

skrap recipe.json param1=value param2=value ... [options]

Use it in node.js

var skrap = require('skrap');
var recipePath = "./recipe.json";

skrap(recipePath, {param: 'value'}, function(data) {
    console.log(data);
})

Documentation

Create a recipe

A recipe is just a JSON file that contains rules for scraping a web page. Here's a simple example:

{
    "url" : "http://www.imdb.com/find?q=${movie}&s=tt&ttype=ft&ref_=fn_ft",
    "collections" : [{
        "name" : "movies",
        "query": "$('table.findList tr')",
        "fields": {
            "title" : "find('td.result_text a').text()",
            "year" : "find('td.result_text').text().match(/\\((\\d{4})\\)/)[1]",
            "poster" : "find('td.primary_photo img').attr('src')",
        }
    }]
}

The recipe makes use of CSS selectors for targeting the pieces of data that needs to be scraped. Skrap depends on the cheerio node.js module for querying the DOM, which selector's implementation is nearly identical to jQuery's, so the API is very similar.

Brakedown of a simple recipe file

Running skrap with the above example and passing the parameter movie=spider-man will generate this JSON file

Page crawling and advanced options

Here's a more complex example:

{
    "url" : "http://www.imdb.com/find?q=${movie}&s=tt&ttype=ft&ref_=fn_ft",
    "headers": {
        "Accept-Language": "en-US,en;q=0.8,it;q=0.6,ro;q=0.4"
    },
    "collections" : [{
        "name" : "movies",
        "query": "$('table.findList tr')",
        "fields": {
            "title" : "find('td.result_text a').text()",
            "year" : "find('td.result_text').text().match(/\\((\\d{4})\\)/)[1]",
            "poster" : "find('td.primary_photo img').attr('src')",
            "details": {
                "url" : "find('td.result_text a').attr('href').replace('/','http://www.imdb.com/')",
                "group": false,
                "fields": {
                    "rating": "$('#overview-top .star-box-giga-star').text().trim()",
                    "duration": "$('#overview-top time').text().trim()"             
                }
            }
        }
    }]
}

Optional fields:

Page crawling

Skrap has basic support for one level deep page crawling. The way it works is by provinding an object with crawling instructions instead of just a selector for a field name.

In cases when you need to crawl a page for just a single piece of data, there's also a simplified syntax:

"rating": {
    "url" : "find('td.result_text a').attr('href').replace('/','http://www.imdb.com/')",
    "query": "$('#overview-top .star-box-giga-star').text().trim()"
}

Examples

See the /examples folder

Contributing

In lieu of a formal styleguide, take care to maintain the existing coding style.

License

Copyright (c) 2014 Nick Dima
Licensed under the MIT license.