Skip to content
This repository has been archived by the owner on Oct 30, 2018. It is now read-only.

Add all various improvements in scala. #94

Open
wants to merge 422 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
422 commits
Select commit Hold shift + click to select a range
40b80b9
less logging
skyshard May 18, 2013
7636f6b
remove root build.sbt to try to get project compile to work
skyshard May 18, 2013
1bb28e9
Revert "remove root build.sbt to try to get project compile to work"
skyshard May 18, 2013
a4416bd
fix spelling error and mission quote
eldilibra May 22, 2013
975b91f
Add image width and height
May 23, 2013
5bafcce
Merge branch 'sonatype_hosting' into deploy
May 23, 2013
ee411d4
Merge branch 'http_proxy' into deploy
May 23, 2013
047f2aa
Change repository location
May 30, 2013
4992f86
Commit version 2.1.23
Jun 25, 2013
950ebcb
Merge https://github.com/chechu/goose into chechu
skyshard Jul 17, 2013
cc22a49
Revert "Added the Acceso Maven repository. Using the Lingpipe languag…
skyshard Jul 17, 2013
40f3a9e
handling gzipped responses, content type detection failure
skyshard Jul 17, 2013
15e4921
fixing gizmodo, mashable
skyshard Jul 17, 2013
b9d7a20
getting tests running (and failing)
skyshard Jul 18, 2013
f4882a6
link extraction from content node
skyshard Jul 18, 2013
4b91e2f
stricter small paragraph removal… monitor how well this does
skyshard Jul 18, 2013
aa514fd
referer and cookie support for html fetching
skyshard Jul 18, 2013
6cd9fc2
add document in trace for easier debug
skyshard Jul 18, 2013
0849a97
stricter non-removal of .scrollable class
skyshard Jul 18, 2013
6b9814a
better description, canonical link extractors
skyshard Jul 18, 2013
39d9032
guess date from url if the extractor fails, logic from snacktory: htt…
skyshard Jul 18, 2013
8b544bd
langid stuff todo later…
skyshard Jul 18, 2013
802551d
remove junit dep
skyshard Jul 18, 2013
407893a
ignore # href
skyshard Jul 18, 2013
91cbc82
don't drop all lists in content immediately, score them first (and ch…
skyshard Jul 19, 2013
2567920
throw more errors (couldn't get it to throw parsing errors or content…
skyshard Jul 20, 2013
bc163be
changed link extraction to give url -> text map
skyshard Jul 23, 2013
435ab83
don't crash on documents without body (eg. a page with just frame set…
skyshard Jul 25, 2013
a5c9d28
don't keep alive forever if value isn't specified, just leave the con…
skyshard Jul 25, 2013
c7a22da
remove paragraphs that are completely links as well
skyshard Jul 25, 2013
070d528
fix canonical urls to get the absolute href
skyshard Jul 25, 2013
ff8613f
fix other canonical link types to get the abs href, set the base uri …
skyshard Jul 25, 2013
5fc462d
shorter connection timeout for images
skyshard Jul 26, 2013
ed41e37
ignore javascript links
skyshard Jul 26, 2013
a505e33
ignore huffingtonpost related slideshow
skyshard Jul 31, 2013
2b020f2
Add .idea to ignore whitelist.
andhapp Aug 3, 2013
68cbf29
Updated pom.xml
andhapp Aug 3, 2013
270b33e
We do want stalecheck to make sure the connection is always fresh.
andhapp Aug 3, 2013
b3cc655
Modifying the CLI interface to work on a file containing a list
bsravanin Oct 23, 2013
74f1e4d
Changing the CLI interface to process a file containing a list of
bsravanin Oct 23, 2013
c8d3b08
Merge branch 'master' of github.com:bsravanin/goose
bsravanin Oct 29, 2013
00087ec
Merge branch 'master' of https://github.com/skyshard/goose
Oct 31, 2013
bcf6dcb
Fix versions for deployment
Nov 1, 2013
bec521a
Remove idle connection monitor - clojure AOT hangs when the thread is…
Nov 1, 2013
977648c
Upgrade versions, remove unnecessary dependencies, and issue new release
Nov 1, 2013
e70f4f2
Cassandra persistence.
bsravanin Nov 8, 2013
145d5e7
Further simplifying the CLI interface.
bsravanin Nov 22, 2013
3321da1
Merge branch 'http_proxy'
Nov 25, 2013
2a4f4ea
Merge branch 'fix_image_sizes'
Nov 25, 2013
f2f30cf
[maven-release-plugin] prepare release goose-2.1.23
Nov 25, 2013
d90bc2b
[maven-release-plugin] prepare release goose-2.1.23
Nov 25, 2013
f2bd9db
Revert "[maven-release-plugin] prepare release goose-2.1.23"
Nov 25, 2013
cb48ccf
Revert "[maven-release-plugin] prepare release goose-2.1.23"
Nov 25, 2013
5679837
Update scm
Nov 25, 2013
dcaed2f
[maven-release-plugin] prepare release goose-2.1.23
Nov 25, 2013
08e0b39
[maven-release-plugin] prepare for next development iteration
Nov 25, 2013
2b7e0a1
After modifications to the url_versions schema.
bsravanin Nov 26, 2013
67bde0c
bump to Scala 2.10.0
Jan 8, 2014
fe17534
scala.actors.Future* no longer in Scala 2.10
Jan 8, 2014
870b3ce
add zeebox Nexus config
Jan 8, 2014
d5a97b6
Changes to compile on 2.10.3
ViniciusMiana Jan 24, 2014
5ca6941
Added java class
nebm51 Feb 2, 2014
34256e2
Added functions for ContentExtractor class
nebm51 Feb 2, 2014
52d3b0a
Added support classes
nebm51 Feb 3, 2014
47b3d45
Added support classes
nebm51 Feb 3, 2014
a9e2b77
Main functions implemented
nebm51 Feb 3, 2014
be3c478
Added statistics functions
nebm51 Feb 6, 2014
ad729de
Do not use absolute path for ImageMagick tools
ilpianista Feb 21, 2014
7037baf
Escalate exception in Image extraction
fabiofumarola Mar 11, 2014
ac33778
Migrate project on to sbt. Remove maven files. Upgrade to scala 2.10.
jasongoodwin Mar 13, 2014
79d2193
updated to use scala 2.10.3
darthbear Mar 16, 2014
2502b89
Remove PSD
CharlesGuillot Mar 19, 2014
ff5f52c
Updated pom.xml
pvdlg Mar 22, 2014
9d5635c
Updated README
pvdlg Mar 22, 2014
f5c56c7
Fixed badlink JUnit Test
pvdlg Mar 22, 2014
82aab5a
Fixed a javadoc tag
pvdlg Mar 22, 2014
1f0f5f2
Moved static html under test/resources
pvdlg Mar 22, 2014
f9dd1a1
pom.xml update to generate javadoc jar
pvdlg Mar 22, 2014
0aa93a5
Add image width and height
pvdlg Mar 22, 2014
97566a1
Set stalecheck to true
pvdlg Mar 22, 2014
25cd2ec
minot typo
pvdlg Mar 22, 2014
fc61d2e
Updated Maven build dependencies
pvdlg Mar 22, 2014
7a11532
Add http proxy support from java prop or environment variable
pvdlg Mar 22, 2014
75ff308
Minor typos.
pvdlg Mar 22, 2014
f84b754
Fixed spelling error
pvdlg Mar 22, 2014
c6c25f6
Escalate exception in Image extraction + formatting
pvdlg Mar 22, 2014
de60c3c
Updated README
darthbear Mar 22, 2014
9fcafc3
update .gitignore
robbypond Apr 3, 2014
33169ae
add opengraph tags
robbypond Apr 3, 2014
9defe03
add sbt support
Apr 11, 2014
9536f8a
update pom: scala version and junit dir
Apr 14, 2014
cd8d323
add chinese support. use mmseg4j
Apr 15, 2014
c9de660
modifying link structure
skyshard Apr 18, 2014
17a5298
update imagemagick path
Apr 22, 2014
3801d98
should use Nil for empty lists
skyshard Apr 24, 2014
1eac43e
Moved the bean properties to constructor parameters.
david-cliqz May 22, 2014
9b3f734
Added the rest of the parameters to the constructor, too. This resolv…
david-cliqz May 22, 2014
b2c29f5
Fixes #1.
david-cliqz May 22, 2014
b1aeb84
Fixes #3.
david-cliqz May 22, 2014
5e53041
Fixes #4.
david-cliqz May 22, 2014
f3b871b
Resolves #5.
david-cliqz May 22, 2014
7397fac
Replaces tabs with spaces. Fixes #7.
david-cliqz May 26, 2014
09d1618
Added the main class attribute. Fixes #6.
david-cliqz May 26, 2014
5f3bc13
use scala 2.10.4
darthbear Jun 11, 2014
170d76f
Added a check to doTitleSplit(); fixes #8.
david-cliqz Jun 16, 2014
413d7c9
Added the stopword lists from python-goose. Part of #9.
david-cliqz Jun 16, 2014
69b0271
Removed accidentally added debugging messages.
david-cliqz Jun 16, 2014
a0c320a
bumped version to 2.1.23. Thanks @eliasah
darthbear Jun 17, 2014
4d4824a
Upgraded jsoup to the latest version. Also bumped the version number up.
david-cliqz Jun 18, 2014
d13f0fc
Upgraded jsoup to the latest version. Also bumped the version number up.
david-cliqz Jun 18, 2014
476414d
Merge branch 'master' of github.com:david-cliqz/goose
david-cliqz Jun 18, 2014
e0f568a
Added a lang parameter to getStopWords().
david-cliqz Jun 20, 2014
f3ae1d0
Added the lang parameter to the interface. All we need now is to connect
david-cliqz Jun 20, 2014
b34e1d9
Fully wired lang in.
david-cliqz Jun 20, 2014
8383fa7
added ko stopword
theodoreLee Jul 7, 2014
79cb607
updated build.sbt
theodoreLee Jul 7, 2014
5ff32a8
Merge remote-tracking branch 'remotes/kkme/master'
raisercostin Jul 7, 2014
fe56ab1
- fix build
raisercostin Jul 7, 2014
688d310
- upgrade to languages
raisercostin Jul 7, 2014
44e886b
- add eclipse resources to classpath
raisercostin Jul 7, 2014
d1034d0
- partially integrate assembly (but works with scala 2.9 only)
raisercostin Jul 7, 2014
723fffe
updated build.sbt
theodoreLee Jul 7, 2014
1a2d7d2
Too old. Is migrated in scala.
raisercostin Jul 7, 2014
f1eb1ed
- removed
raisercostin Jul 7, 2014
b76fc08
Merge remote-tracking branch 'remotes/david/master'
raisercostin Jul 7, 2014
d4f6b9f
- integrate
raisercostin Jul 7, 2014
f4b7594
Merge remote-tracking branch 'remotes/vinicius/master'
raisercostin Jul 7, 2014
c8b4bde
- remove java folders when generated by sbteclipse
raisercostin Jul 7, 2014
626f4b2
Merge remote-tracking branch 'remotes/kunalmodi/master'
raisercostin Jul 7, 2014
219d446
Merge remote-tracking branch 'remotes/vinicius/deploy'
raisercostin Jul 7, 2014
381defa
Merge remote-tracking branch 'remotes/kkme/test'
raisercostin Jul 7, 2014
0049a21
Merge branch 'master' of https://github.com/vanduynslagerp/goose
raisercostin Jul 7, 2014
2918726
- add sbt-dependency-graph plugin
raisercostin Jul 7, 2014
da2f7e6
Added domain property to an article.
theodoreLee Jul 8, 2014
8fc3ad4
Merge remote-tracking branch 'remotes/marcosinger/master'
raisercostin Jul 8, 2014
5a9a8a4
Merge remote-tracking branch 'remotes/theodore/master'
raisercostin Jul 8, 2014
042c79d
Merge remote-tracking branch 'aurality/master'
raisercostin Jul 8, 2014
71ccc9b
Merge remote-tracking branch 'remotes/nator/master'
raisercostin Jul 8, 2014
05c87be
Merge remote-tracking branch 'remotes/chimpler/master'
raisercostin Jul 8, 2014
6514fa6
Merge remote-tracking branch 'remotes/jaytaylor/master'
raisercostin Jul 8, 2014
8806b9b
Merge remote-tracking branch 'remotes/FaKod/master'
raisercostin Jul 8, 2014
72ab356
Merge remote-tracking branch 'remotes/skyshard/master'
raisercostin Jul 8, 2014
88988bf
Merge remote-tracking branch 'remotes/skyshard/langid'
raisercostin Jul 8, 2014
dfcca5a
- fix all again
raisercostin Jul 8, 2014
680092b
Add SBT.
jordanburke Jul 9, 2014
9fda322
Changed the version to 2.2.0. Removed pom.xml
jordanburke Jul 9, 2014
5527254
Language is transfered for each url and not in Goose configuration
raisercostin Jul 9, 2014
26c51ed
Merge remote-tracking branch 'remotes/sapienapps/master'
raisercostin Jul 9, 2014
874d366
Merge remote-tracking branch 'remotes/rob/master'
raisercostin Jul 9, 2014
f36c2a3
configure stopwords
raisercostin Jul 9, 2014
5b33850
Merge remote-tracking branch 'remotes/and/determine-encoding'
raisercostin Jul 9, 2014
5c2ce84
Merge remote-tracking branch 'remotes/bsravanin/master'
raisercostin Jul 10, 2014
2495bba
integrate cassandra
raisercostin Jul 10, 2014
4a797d6
- create different main
raisercostin Jul 10, 2014
d335ef7
- comment out cassandra as it introduces too many dependencies to be …
raisercostin Jul 10, 2014
96dd7be
Merge remote-tracking branch 'remotes/qu1j0t3/add_sbt_config'
raisercostin Jul 10, 2014
f1fbb08
Merge remote-tracking branch 'remotes/zeebox/master'
raisercostin Jul 10, 2014
957c8e5
Merge remote-tracking branch 'remotes/dr3s/master'
raisercostin Jul 10, 2014
8ec785d
Merge remote-tracking branch 'remotes/SentiOne/master'
raisercostin Jul 10, 2014
644f55c
Merge remote-tracking branch 'remotes/devender/master'
raisercostin Jul 10, 2014
ad01f40
Merge remote-tracking branch 'remotes/eldilibra/minor-cleanup'
raisercostin Jul 10, 2014
e48f2dd
- fix debug
raisercostin Jul 10, 2014
dcd405d
Merge remote-tracking branch 'remotes/qu1j0t3/master'
raisercostin Jul 10, 2014
e80fa85
Merge remote-tracking branch 'remotes/andhapp/fix-no-response-http-ex…
raisercostin Jul 10, 2014
d51e117
Merge remote-tracking branch 'remotes/ilpianista/patch-1'
raisercostin Jul 10, 2014
f180913
Merge remote-tracking branch 'remotes/fabiofumarola/patch-1'
raisercostin Jul 10, 2014
6e0708f
Merge remote-tracking branch 'remotes/jasongoodwin/master'
raisercostin Jul 10, 2014
71d4172
Merge remote-tracking branch 'remotes/CharlesGuillot/master'
raisercostin Jul 10, 2014
86f43b7
Merge remote-tracking branch 'remotes/robert-blumen/all-java-images'
raisercostin Jul 10, 2014
2d06183
- relative path to imageMagick executable
raisercostin Jul 10, 2014
c1ea29e
Merge remote-tracking branch 'remotes/mneedham/master'
raisercostin Jul 10, 2014
bad88e1
Merge remote-tracking branch 'remotes/amir343/master'
raisercostin Jul 10, 2014
1e524e3
Merge remote-tracking branch 'remotes/coryhacking/master'
raisercostin Jul 10, 2014
2403b6b
Merge remote-tracking branch 'remotes/johnteslade/master'
raisercostin Jul 10, 2014
0eff76e
Merge remote-tracking branch 'remotes/pshken/master'
raisercostin Jul 10, 2014
c53dd33
- more details about mavn and imageMagick prerequisites
raisercostin Jul 10, 2014
9eb5779
Merge remote-tracking branch 'remotes/tomazk/master'
raisercostin Jul 10, 2014
6c8e5ba
Merge remote-tracking branch 'remotes/dhepper/master'
raisercostin Jul 10, 2014
21d410b
Merge remote-tracking branch 'remotes/andrewlin12/master'
raisercostin Jul 10, 2014
44969b7
add a main that exports a service on http://localhost:8890?url=http:/…
raisercostin Jul 10, 2014
c35a9e0
- fix JsonUtil serializer to handle scala collections
raisercostin Jul 10, 2014
ab5df7b
Create JsonMain to print a json from article.
raisercostin Jul 10, 2014
b45f409
Merge remote-tracking branch 'remotes/umars/master'
raisercostin Jul 10, 2014
b5c46ac
Merge remote-tracking branch 'remotes/umars/gh-pages'
raisercostin Jul 10, 2014
88aa61f
Merge remote-tracking branch 'remotes/AAAI/master'
raisercostin Jul 10, 2014
52bd6ac
Merge remote-tracking branch 'remotes/amalinovskiy/master'
raisercostin Jul 10, 2014
b56602d
- enable gae if needed
raisercostin Jul 11, 2014
2b184b8
Merge remote-tracking branch 'remotes/nebm51/master'
raisercostin Jul 11, 2014
e384379
Merge remote-tracking branch 'remotes/nebm51/JavaPort'
raisercostin Jul 11, 2014
0a9517e
- java changes are too old
raisercostin Jul 11, 2014
604ba7b
- prepare release
raisercostin Jul 11, 2014
28a8745
- add release details
raisercostin Jul 11, 2014
7754973
fix script
raisercostin Jul 11, 2014
43ab516
Update README.md
raisercostin Jul 11, 2014
d8e8d6d
- configuration as case class
raisercostin Jul 13, 2014
bd6945e
Merge branch 'master' of https://github.com/raisercostin/goose
raisercostin Jul 13, 2014
9bfcca3
- pass unique sub-folders to svn import as workaround for MKCOL failu…
raisercostin Jul 13, 2014
914f759
2.2.2-SNAPSHOT
raisercostin Jul 13, 2014
d64675c
- comment out prints
raisercostin Aug 8, 2014
ef5da5e
- remove node cloning
raisercostin Aug 10, 2014
be2c01f
- update distribution management
raisercostin Aug 10, 2014
404423e
- update distribution management
raisercostin Aug 10, 2014
c4a939e
- mvn pom fixes
raisercostin Aug 10, 2014
70f7a25
ignore parent
raisercostin Aug 10, 2014
e60aa50
[maven-release-plugin] prepare release goose-2.2.2
raisercostin Aug 10, 2014
0a1c2df
[maven-release-plugin] rollback the release of goose-2.2.2
raisercostin Aug 10, 2014
b9c2c4c
- add git
raisercostin Aug 10, 2014
117a7da
[maven-release-plugin] prepare release goose-2.2.2
raisercostin Aug 10, 2014
a923cbe
[maven-release-plugin] prepare for next development iteration
raisercostin Aug 10, 2014
895c6fa
- follow scala release name convention
raisercostin Aug 10, 2014
a405f04
- describe release process
raisercostin Aug 10, 2014
d03f7bd
- fix pom groupId and artifactId
raisercostin Aug 10, 2014
9d24433
[maven-release-plugin] prepare release goose_2.10-2.2.3
raisercostin Aug 10, 2014
8d19186
[maven-release-plugin] prepare for next development iteration
raisercostin Aug 10, 2014
dbd26ee
upgrade scala
jasongoodwin Sep 7, 2014
28365f9
checkin org in build sbt
jasongoodwin Sep 9, 2014
c256c40
Working with SBT and Scala 2.11.2
warrd Oct 11, 2014
fcb3cdf
Merge branch 'patch-1' of git://github.com/ilpianista/goose
warrd Oct 11, 2014
aa8300a
Merge branch 'master' of git://github.com/robbypond/goose
warrd Oct 11, 2014
6464af9
Update readme
warrd Oct 11, 2014
78a71d6
Sonatype repo management
warrd Oct 13, 2014
f49218c
Extract open graph timestamps, tags and section
warrd Oct 14, 2014
8b709f6
Use open graph data if available to parse publish date
warrd Oct 14, 2014
5523fda
Remove unused StandardImageExtractor
Oct 27, 2014
94f0c49
Allow extraction of all the images in the order found in the HTML
Oct 27, 2014
34bd3e1
Add test for getting all images
Oct 27, 2014
6c8fa77
Update surefire dependency to 2.17
Oct 27, 2014
95d3413
Allow settings imagemagick binaries using env vars
Oct 27, 2014
636eb99
Rename AllImages and TextExtractions tests so they always run
Oct 27, 2014
5cc6b00
Using LinkedHashSet to avoid inconsistency
onilton Oct 28, 2014
cd88d94
Merge pull request #1 from oniltonmaciel/contentextractor-improvements
raisercostin Oct 28, 2014
a0c0cc8
Accept single cookie header
ivgiuliani Oct 28, 2014
714ed21
Check stale connections
ivgiuliani Oct 28, 2014
abe2f98
Re raise unknown exceptions
ivgiuliani Oct 28, 2014
4da8643
Merge remote-tracking branch 'remotes/jasongoodwin/master'
raisercostin Oct 28, 2014
8f1e2e7
Merge branch 'master' of https://github.com/raisercostin/goose
raisercostin Oct 28, 2014
b0ad84a
- upgraded to scala 2.11 (and some dependencies accordingly)
raisercostin Oct 28, 2014
b0e7d7c
Merge remote-tracking branch 'remotes/warrd/master'
raisercostin Oct 28, 2014
2557c08
- fix compilation errors
raisercostin Oct 28, 2014
63fbbf9
Merge remote-tracking branch 'remotes/pickl-it/master'
raisercostin Oct 29, 2014
1e03976
- fix compilation errors after merge
raisercostin Oct 29, 2014
d031421
- upgrade to scala 2.11.2
raisercostin Oct 29, 2014
83d03a2
- upgrade pom.xml
raisercostin Oct 29, 2014
dd4a430
[maven-release-plugin] prepare release goose_2.11-2.2.4
raisercostin Oct 29, 2014
8974466
[maven-release-plugin] prepare for next development iteration
raisercostin Oct 29, 2014
e7fa1d8
distribution moved to bintray
Oct 14, 2016
4ae111b
autopublish to bintray
Oct 14, 2016
837a766
added pom files to release for scala 2.10 and 2.11
Feb 5, 2017
179e1a0
prepare 2.2.8
raisercostin Feb 26, 2017
2163457
update deploy procedure via mvn
raisercostin Feb 26, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fully wired lang in.
  • Loading branch information
david-cliqz committed Jun 20, 2014
commit b34e1d9d9cbc86570420335dad0eb44ece962a9e
9 changes: 4 additions & 5 deletions src/main/scala/com/gravity/goose/Crawler.scala
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ class Crawler(config: Configuration) {
parseCandidate <- URLHelper.getCleanedUrl(crawlCandidate.url)
rawHtml <- getHTML(crawlCandidate, parseCandidate)
doc <- getDocument(parseCandidate.url.toString, rawHtml)
lang = crawlCandidate.lang
} {
trace("Crawling url: " + parseCandidate.url)

Expand All @@ -82,7 +83,7 @@ class Crawler(config: Configuration) {
// before we do any calcs on the body itself let's clean up the document
article.doc = docCleaner.clean(article)

extractor.calculateBestNodeBasedOnClustering(article) match {
extractor.calculateBestNodeBasedOnClustering(article, lang) match {
case Some(node: Element) => {
article.topNode = node
article.movies = extractor.extractVideos(article.topNode)
Expand All @@ -102,12 +103,10 @@ class Crawler(config: Configuration) {
}
}
}
article.topNode = extractor.postExtractionCleanup(article.topNode)
article.topNode = extractor.postExtractionCleanup(article.topNode, lang)




article.cleanedArticleText = outputFormatter.getFormattedText(article.topNode)
article.cleanedArticleText = outputFormatter.getFormattedText(article.topNode, lang)
}
case _ => trace("NO ARTICLE FOUND")
}
Expand Down
33 changes: 18 additions & 15 deletions src/main/scala/com/gravity/goose/extractors/ContentExtractor.scala
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,8 @@ trait ContentExtractor {
* @return
*/

def calculateBestNodeBasedOnClustering(article: Article): Option[Element] = {
def calculateBestNodeBasedOnClustering(article: Article,
lang:String): Option[Element] = {
trace(logPrefix + "Starting to calculate TopNode")
val doc = article.doc
var topNode: Element = null
Expand All @@ -203,7 +204,7 @@ trait ContentExtractor {
val nodesWithText = mutable.Buffer[Element]()
for (node <- nodesToCheck) {
val nodeText: String = node.text
val wordStats: WordStats = StopWords.getStopWordCount(nodeText)
val wordStats: WordStats = StopWords.getStopWordCount(nodeText, lang)
val highLinkDensity: Boolean = isHighLinkDensity(node)
if (wordStats.getStopWordCount > 2 && !highLinkDensity) {
nodesWithText.add(node)
Expand All @@ -217,7 +218,7 @@ trait ContentExtractor {

for (node <- nodesWithText) {
var boostScore: Float = 0
if (isOkToBoost(node)) {
if (isOkToBoost(node, lang)) {
if (cnt >= 0) {
boostScore = ((1.0 / startingBoost) * 50).asInstanceOf[Float]
startingBoost += 1
Expand All @@ -237,7 +238,7 @@ trait ContentExtractor {
trace(logPrefix + "Location Boost Score: " + boostScore + " on interation: " + i + "' id='" + node.parent.id + "' class='" + node.parent.attr("class"))

val nodeText: String = node.text
val wordStats: WordStats = StopWords.getStopWordCount(nodeText)
val wordStats: WordStats = StopWords.getStopWordCount(nodeText, lang)
val upscore: Int = (wordStats.getStopWordCount + boostScore).asInstanceOf[Int]
updateScore(node.parent, upscore)
updateScore(node.parent.parent, upscore / 2)
Expand Down Expand Up @@ -292,7 +293,7 @@ trait ContentExtractor {
* @param node
* @return
*/
private def isOkToBoost(node: Element): Boolean = {
private def isOkToBoost(node: Element, lang: String): Boolean = {
val para = "p"
var stepsAway: Int = 0
val minimumStopWordCount = 5
Expand All @@ -306,7 +307,7 @@ trait ContentExtractor {
return false
}
val paraText: String = currentNode.text
val wordStats: WordStats = StopWords.getStopWordCount(paraText)
val wordStats: WordStats = StopWords.getStopWordCount(paraText, lang)
if (wordStats.getStopWordCount > minimumStopWordCount) {
trace(logPrefix + "We're gonna boost this node, seems contenty")
return true
Expand Down Expand Up @@ -491,10 +492,10 @@ trait ContentExtractor {
* @param targetNode
* @return
*/
def postExtractionCleanup(targetNode: Element): Element = {
def postExtractionCleanup(targetNode: Element, lang: String): Element = {

trace(logPrefix + "Starting cleanup Node")
val node = addSiblings(targetNode)
val node = addSiblings(targetNode, lang)
for {
e <- node.children
if (e.tagName != "p")
Expand Down Expand Up @@ -534,7 +535,9 @@ trait ContentExtractor {
* @param currentSibling
* @return
*/
def getSiblingContent(currentSibling: Element, baselineScoreForSiblingParagraphs: Int): Option[String] = {
def getSiblingContent(currentSibling: Element,
baselineScoreForSiblingParagraphs: Int,
lang: String): Option[String] = {

if (currentSibling.tagName == "p" && currentSibling.text.length() > 0) {
Some(currentSibling.outerHtml)
Expand All @@ -549,7 +552,7 @@ trait ContentExtractor {
Some((for {
firstParagraph <- potentialParagraphs
if (firstParagraph.text.length() > 0)
wordStats: WordStats = StopWords.getStopWordCount(firstParagraph.text)
wordStats: WordStats = StopWords.getStopWordCount(firstParagraph.text, lang)
paragraphScore: Int = wordStats.getStopWordCount
siblingBaseLineScore: Double = .30
if ((baselineScoreForSiblingParagraphs * siblingBaseLineScore).toDouble < paragraphScore)
Expand Down Expand Up @@ -578,14 +581,14 @@ trait ContentExtractor {
b
}

private def addSiblings(topNode: Element): Element = {
private def addSiblings(topNode: Element, lang: String): Element = {

trace(logPrefix + "Starting to add siblings")

val baselineScoreForSiblingParagraphs: Int = getBaselineScoreForSiblings(topNode)
val baselineScoreForSiblingParagraphs: Int = getBaselineScoreForSiblings(topNode, lang)
val results = walkSiblings(topNode) {
currentNode => {
getSiblingContent(currentNode, baselineScoreForSiblingParagraphs)
getSiblingContent(currentNode, baselineScoreForSiblingParagraphs, lang)

}
}.reverse.flatMap(itm => itm)
Expand All @@ -602,15 +605,15 @@ trait ContentExtractor {
* @param topNode
* @return
*/
private def getBaselineScoreForSiblings(topNode: Element): Int = {
private def getBaselineScoreForSiblings(topNode: Element, lang: String): Int = {
var base: Int = 100000
var numberOfParagraphs: Int = 0
var scoreOfParagraphs: Int = 0
val nodesToCheck: Elements = topNode.getElementsByTag("p")

for (node <- nodesToCheck) {
val nodeText: String = node.text
val wordStats: WordStats = StopWords.getStopWordCount(nodeText)
val wordStats: WordStats = StopWords.getStopWordCount(nodeText, lang)
val highLinkDensity: Boolean = isHighLinkDensity(node)
if (wordStats.getStopWordCount > 2 && !highLinkDensity) {
numberOfParagraphs += 1;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,11 @@ trait OutputFormatter {
* @param topNode the top most node to format
* @return the prepared Element
*/
@Deprecated def getFormattedElement(topNode: Element): Element = {
@Deprecated def getFormattedElement(topNode: Element, lang: String): Element = {
removeNodesWithNegativeScores(topNode)
convertLinksToText(topNode)
replaceTagsWithText(topNode)
removeParagraphsWithFewWords(topNode)
removeParagraphsWithFewWords(topNode, lang)
topNode
}

Expand All @@ -62,11 +62,11 @@ trait OutputFormatter {
* @param topNode the top most node to format
* @return a formatted string with all HTML removed
*/
def getFormattedText(topNode: Element): String = {
def getFormattedText(topNode: Element, lang: String): String = {
removeNodesWithNegativeScores(topNode)
convertLinksToText(topNode)
replaceTagsWithText(topNode)
removeParagraphsWithFewWords(topNode)
removeParagraphsWithFewWords(topNode, lang)
convertToText(topNode)
}

Expand Down Expand Up @@ -173,7 +173,7 @@ trait OutputFormatter {
/**
* remove paragraphs that have less than x number of words, would indicate that it's some sort of link
*/
private def removeParagraphsWithFewWords(topNode: Element) {
private def removeParagraphsWithFewWords(topNode: Element, lang: String) {
if (topNode != null) {
if (logger.isDebugEnabled) {
logger.debug("removeParagraphsWithFewWords starting...")
Expand All @@ -183,7 +183,7 @@ trait OutputFormatter {

for (el <- allNodes) {
try {
val stopWords = StopWords.getStopWordCount(el.text)
val stopWords = StopWords.getStopWordCount(el.text, lang)
if (stopWords.getStopWordCount < 3 && el.getElementsByTag("object").size == 0 && el.getElementsByTag("embed").size == 0) {
logger.debug("removeParagraphsWithFewWords - swcnt: %d removing text: %s".format(stopWords.getStopWordCount, el.text()))
el.remove()
Expand Down