Skip to content

zhijia/HPar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HPar

HPar is a prototype of a data parallel HTML5 parser. It is based on the popular HTML paser Jsoup. With speculative parallelization, HPar can even parse a single HTML file in parallel.

Fig. Speedup on MacBook Pro with a Quad-Core CPU

Getting Started:

./compile.sh
./run.sh      # output is /test/output.html

How To Use:

ParallelParser pparser = new ParallelParser(html, numThreads);
doc = pparser.parse();

Note: The prototype is still under development. Though it passed the included test set (up to 8 threads), the current version does not guarantee the resulted DOM tree is always the same as that from a sequential version. You are welcome to contribute to this project to make it more solid.

Publication:

Paper:

HPar: A practical parallel parser for HTML--taming HTML complexities for parallel parsing

Reference:

Zhao, Z., Bebenita, M., Herman, D., Sun, J., & Shen, X. (2013). HPar: A practical parallel parser 
for HTML--taming HTML complexities for parallel parsing. ACM Transactions on Architecture and Code 
Optimization (TACO), 10(4), 44.

@article{zhao2013hpar,
  title={HPar: A practical parallel parser for HTML--taming HTML complexities for parallel parsing},
  author={Zhao, Zhijia and Bebenita, Michael and Herman, Dave and Sun, Jianhua and Shen, Xipeng},
  journal={ACM Transactions on Architecture and Code Optimization (TACO)},
  volume={10},
  number={4},
  pages={44},
  year={2013},
  publisher={ACM}
}

About

A data parallel HTML5 parser

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published