Skip to content
/ xspider Public

Concurrent web spider that crawls its way through a list of website sources, extracts articles, and stores them as JSON documents.

License

Notifications You must be signed in to change notification settings

jdiaz/xspider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

XSpider

Note: This project is an open source / externalization of iberobyte's past crawling strategy.

Web spider that crawls its way through a list of website sources, extracts relevant articles, and indexes the entries by execution date for long term storage and web consumption.

Requirements

  • Scala 2.12.10
  • Mill

Usage

  1. mill compile
  2. mill assembly
  3. java -cp out/spider/assembly/dest/out.jar spider.Main > out.log

Alternatively, simply execute ./run.sh in your terminal. The output will be stored in a article_${unixtime}.json file.

Code Generation

XSpider uses code generation tecniques to acelerate development. Specifically, the script command, ./sc, can run a delegate control to other script commands.
For example, ./sc scraper ScraperElNuevoDia will invoke the ./scraper command with ScraperElNuevoDia as argument. The output is a file generated based on your input. In this case, for example, news files:

  • spider/scraper/ScraperElNuevoDia.scala,
  • spider/scraper/ScraperElNuevoDiaTests.scala
  • spider/test/resources/ElNuevoDia.html will be created. The contents of the file will contain comments indicating where to modify the file. For example:
/**
 * This file is partiallty generated. Only make modifications between
 * BEGIN MANUAL SECTION and END MANUAL SECTION designators.
 * 
 * This file is was generated by ./sc scraper script command.
 */
package scraper

import collection.JavaConverters._
import org.jsoup.nodes.Document
import org.jsoup.nodes.Element

import spider.CountryType._
import crawler.CrawlerTypes.URL
import crawler.Article

object ScraperElNuevoDia extends TScraper {

  def apply(
    siteURL: URL,
    country: CountryType,
    html: String
  ): Seq[Article] = {
    val doc = parseHTMLDocument(html)
    /** BEGIN MANUAL SECTION */
    /** END MANUAL SECTION */
  }
}

Developer can quickly build new scrapers by employing this framework.

Testing

mill spider.test

LICENSE

MIT

About

Concurrent web spider that crawls its way through a list of website sources, extracts articles, and stores them as JSON documents.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published