Skip to content
This repository has been archived by the owner on Feb 9, 2022. It is now read-only.

Tweak wording for duplicated content #359

Merged
merged 1 commit into from
Apr 5, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Tweak wording for duplicated content
  • Loading branch information
hadley committed Mar 28, 2018
commit 8e06b71f4d66057b3f69254c89d76f1ab259bfc7
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -263,9 +263,7 @@ Default is `0`.

#### Duplicated content

It could happen that the crawled website returned duplicated data. Most of the time, this is because the crawled pages got the same urls with two different schemes.

If we have URLs like `http://website.com/page` and `http://website.com/page/` (notice the second one ending with `/`), the scrapper will consider them as different. This can be fixed by adding a regex to the `stop_urls` in the `config.json`:
It could happen that the crawled website contains duplicated data. Most of the time this is because a the same page was crawled from different urls. If we have URLs like `http://website.com/page` and `http://website.com/page/` (notice the second one ending with `/`), the scraper will consider them as different. This can be fixed by adding a regex to the `stop_urls` in the `config.json`:

```json
"stop_urls": [
Expand Down