Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

Add all various improvements in scala. #94

Open
wants to merge 422 commits into
base: master
Choose a base branch
from

Conversation

raisercostin
Copy link

Merged various forks to include as many as possible improvements made to goose in the main trunk.

skyshard and others added 30 commits May 17, 2013 18:31
(cherry picked from commit 6c7f98523a0cee3e08d506f656850c8e29974602)
Conflicts:
	pom.xml
	src/main/scala/com/gravity/goose/network/HtmlFetcher.scala
	src/main/scala/com/gravity/goose/text/StopWords.scala
… not found properly inside the for() for crawler… leaving for later)
…?) doesn't actually do extraction in this case...
warrd and others added 30 commits October 13, 2014 18:17
Goose uses a HashSet for iterating topNode candidates
But HashSet doesn't guarantee ordering, so when two candidates have
the same score, the choice is basically random. This is not acceptable.
Now, by using LinkedHashSet we make sure that in case of draw, we choose
the first tag that was found in the DOM tree.
Using LinkedHashSet to avoid inconsistency
Accept cookies from web sites which put all the cookies into one request
header.
Conflicts:
	build.sbt
	src/main/scala/com/gravity/goose/Configuration.scala
Conflicts:
	README.md
	build.sbt
	pom.xml
	src/main/scala/com/gravity/goose/Article.scala
	src/main/scala/com/gravity/goose/Configuration.scala
	src/main/scala/com/gravity/goose/opengraph/OpenGraphData.scala
	src/test/scala/com/gravity/goose/GooseTest.scala
Conflicts:
	pom.xml
	src/main/scala/com/gravity/goose/Article.scala
	src/main/scala/com/gravity/goose/Configuration.scala
	src/main/scala/com/gravity/goose/Crawler.scala
	src/main/scala/com/gravity/goose/images/ImageExtractor.scala
	src/main/scala/com/gravity/goose/images/StandardImageExtractor.scala
	src/main/scala/com/gravity/goose/images/UpgradedImageIExtractor.scala
	src/main/scala/com/gravity/goose/network/HtmlFetcher.scala
	src/test/scala/com/gravity/goose/TestUtils.scala
	src/test/scala/com/gravity/goose/TextExtractionsTest.scala
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.