Skip to content

Commit

Permalink
URIs can contain invalid characters if they are built by the website …
Browse files Browse the repository at this point in the history
…using article titles or similar. That does work when clicked in a browser, it does however result in an invalid link when crawled by Anemone.

I added URI.escape to avoid that. Tested on about 20 websites without issues.
  • Loading branch information
lpradovera committed Apr 23, 2011
1 parent a7895d0 commit ef4f23a
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion lib/anemone/page.rb
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ def links
doc.search("//a[@href]").each do |a|
u = a['href']
next if u.nil? or u.empty?
abs = to_absolute(URI(u)) rescue next
abs = to_absolute(URI(URI.escape(u))) rescue next
@links << abs if in_domain?(abs)
end
@links.uniq!
Expand Down

1 comment on commit ef4f23a

@yatish27
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this does not crawl www.toolsberry.com/%23 . it converted # to %23 and hence its gives 404

Please sign in to comment.