Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.
/ apjt-web-crawler Public archive

Simple Web Crawler required for an old interview

Notifications You must be signed in to change notification settings

turneand/apjt-web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Sample web crawler program

How to build and run:

To build on the command line:

   mvn clean install

To execute, locate the directory containing "apjt-core-1.0-SNAPSHOT.jar" and run:

   java -jar apjt-core-1.0-SNAPSHOT.jar << URL >>

Where << URL >> is the url to parse, you should then get output that eventually ends up like:

   INTERNAL - http://localhost:54410/index.html
      img - http://localhost:54410/image2.jpg
      INTERNAL - http://localhost:54410/file1.html
      img - http://localhost:54410/image1.jpg
      img - http://localhost:54410/image3.jpg
      INTERNAL - http://localhost:54410/nested/file2.html
        a - http://www.google.co.uk
        img - http://localhost:54410/nested/nestedimg.jpg
      INTERNAL - ERROR[404] - http://localhost:54410/file3.html

INTERNAL indicates a relative internal link, "a" indicates an external resource, "img" indicates an image, etc (other tags containing a "src" attribute are also supported).

Three main components to question:

  1. Write a crawler - I believe this is the main functionality being requested, so although 3rd party solutions are available, I have opted to implement a basic version myself
  2. Write a webpage parser - Parsing html is surprisingly difficult, so I have utilised "jsoup" 3rd party library for this functionality
  3. Print out results - Simple outputting to console has been implemented due to time constraints, rather than a graphical model

Time spent:

  1. approximately 15 minutes was spent with "prep", reading question, determining an html parser, etc
  2. following days a little over two hours was spent on actual coding, but additional time was spent writing this documentation
  3. did already have a "template" maven project that runs a build - but needed to add the shade plugin separately - so not much time spent there

Missed functionality due to time constraints and tradeoffs

  1. Needs additional tests around the parsing of the html pages
  2. Launcher needs more functionality that can be tested
  3. Need to better handle the distinction between internal and external links, as support is brittle
  4. Should support maximum nested pages, rather than maximum number of pages, as the output would then be a bit more logical
  5. Should allow the maxPages and/or maxDepth to be passed in as parameter
  6. Handling of pages should be executed in parallel - code has been mostly written to support parallelisation, but the launcher does not use it
  7. Rendering of the "site map" should be in a better form, graphical output, xml site, etc
  8. Input to the program should be validated
  9. Should act like a better behaved robot
  10. Documentation should be improved
  11. Unit tests are a little heavy-weight due to the jetty server wiring as a TestRule (although doesn't add too much overhead) - implemented something similar recently for cxf testing, so knew how easy it is although means the app is tested a lot more production-like, the main reason for it was to simplify the jsoup code and wiring, to remove too many conditional parsing types
  12. Used jsoup for the html parsing, this was the first time I've ever come across jsoup, so the API usage is mainly based upon their examples - if for a real app would prefer to spend more time understanding that area
  13. Model implemented was separate parse and render, so will be holding more resources in memory, but have implemented it such that a page hopefully won't be parsed more than once
  14. Simple in-memory cache of the pages was used, rather than a store that could output to disk, therefore would not scale well

About

Simple Web Crawler required for an old interview

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published