Skip to content
/ nutch Public
forked from apache/nutch

A domain-specific Web crawler for planetary defense based on Apache Nutch, funded by NASA

License

Notifications You must be signed in to change notification settings

Yongyao/nutch

 
 

Repository files navigation

Planetary defense (PD) Web crawler

Most open source Web crawlers (e.g. Apache Nutch) deal with focused crawling by relying on a keyword or document list composed by subject matter experts and similarity measures such as cosine similarity and Naïve Bayes classifier. This work has extended Nutch by developing a semi-supervised method of creating keyword list and considering both text content and hyperlink structure in the Planetary Defense Framework Gateway project, a NASA funded effort aimed to develop a cyberinfrastructure for scientific collaboration across different organizations. Please refer to the slides here for more detail.

Apache Nutch

For the latest information about Nutch, please visit our website at:

http://nutch.apache.org

and our wiki, at:

http://wiki.apache.org/nutch/

To get started using Nutch read Tutorial:

http://wiki.apache.org/nutch/NutchTutorial

About

A domain-specific Web crawler for planetary defense based on Apache Nutch, funded by NASA

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 96.4%
  • HTML 2.6%
  • Other 1.0%