Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Website Scraper Plugin Only Scanning Homepage #17

Open
ashmilhussain opened this issue Sep 26, 2024 · 11 comments
Open

Website Scraper Plugin Only Scanning Homepage #17

ashmilhussain opened this issue Sep 26, 2024 · 11 comments
Assignees

Comments

@ashmilhussain
Copy link
Contributor

The website scraper plugin is currently only scanning the homepage. Modify the to plugin scrap a single page or subpages based on a flag.
This flag should control whether the crawler scans just the homepage or includes subpages as well, providing more flexibility in how pages are crawled.

Use flag : is_scan_child
'true' : scraper both main page and subpages
'false' : scrap only main page

@ashmilhussain ashmilhussain added the good first issue Good for newcomers label Sep 26, 2024
@YashBaviskar1
Copy link

Hey there! I would love to try to solve this issue.
Maybe i will try to recursively scan the page whenever the flag is true. thank you

@ashmilhussain
Copy link
Contributor Author

Hey @YashBaviskar1 ,

Assigning this issue to you, happy coding

If any assistance required meet our team here : https://join.slack.com/t/theailounge/shared_invite/zt-2ogkrruyf-FPOHuPr5hdqXl34bDWjHjw

@agberoz
Copy link
Collaborator

agberoz commented Oct 3, 2024

@YashBaviskar1 any updates on this?

@YashBaviskar1
Copy link

Hey there Yes, I was looking through the code base , from what i understand i have to change url_reader.py
I was just figuring out a way on how to use def load(self) in Class UrlReader(DocsReader) to recursively call itself, is this the right approach.

Sorry for the delay

@agberoz
Copy link
Collaborator

agberoz commented Oct 4, 2024

@YashBaviskar1, Recursion might work, but it's risky for web scrapers due to infinite loops or stack overflow from nested or circular links.

I would also suggest using a data structure like a queue. Here's how it can work:

  1. Start with the base URL.
  2. Add child URLs to the queue as you process each page.
  3. Ensure that each URL is only visited once (you can use a set to track visited URLs).
  4. Process each URL iteratively from the queue.

@YashBaviskar1
Copy link

Yup thank you! I will try to do what you have said, it is pretty intuitive

@YashBaviskar1
Copy link

YashBaviskar1 commented Oct 8, 2024

@ashmilhussain @agberoz, Hello there so I tried to make amend on what you had said and this is what i have came up with :

  • In Class Website I added is_scan_child flag in __init__ and changed the fetch_data accordingly
class Website:
    """
    Website class for interacting with website data.
    """
    def __init__(self, website_url: str, is_scan_child : bool = False):
        self.connection = {}
        self.is_scan_child = is_scan_child #default is False 
        self.params = {
            'url': website_url,
        }

    def fetch_data(self):
        base_reader = UrlReader({
            "type": "url",
            "path": [self.params.get('url')],
            "is_scan_child" : self.is_scan_child
        })
        data = base_reader.load()
        return data
  • Now in Class UrlReader I did as you instructed, initialized a url_queue queue and a visited_url set and used find_all to find all the href tags and append in queue if the is_scan_child is true. The process iterates through the entire queue, while the set keeps track of visited links.
class UrlReader:
    def __init__(self, source):
        self.source = source
        self.visited_url = set()

    def load(self):
        out = []
        is_scan_child = self.source.get("is_scan_child", False)
        url_queue = [] #defining a queue to capture sub_urls 
        if "path" in self.source:
            urls = self.source["path"]
            for url in urls:
                url_queue.append(url)
            while url_queue : 
                url = url_queue.pop(0)
                if url in self.visited_url :
                    continue
                self.visited_url.add(url)
                try:
                    response = requests.get(url)
                    if response.status_code == 200:
                        soup = BeautifulSoup(response.content, 'html.parser')
                        if is_scan_child :
                            for a in soup.find_all('a', href = True) :
                                absolute_url = urljoin(url, a['href'])
                                if absolute_url not in self.visited_url :
                                    url_queue.append(absolute_url)
                            #print(f" Sub Urls are : {url_queue}")
                        tag = soup.body
                        text = ''.join(list(tag.strings)[:-1])
                        metadata = {
                            "path": url
                        }
                        out.append({"content": str(text), "metadata": metadata})
                    else:
                        logger.critical(f"Failed to retrieve content, status code: {response.status_code}")
                except Exception as e:
                    logger.error(e)
        return out

Since I thought we are dealing with subpages mostly I added this line and note the new import too:

from urllib.parse import urljoin

absolute_url = urljoin(url, a['href'])

I do not know if it is necessary or not.

@ashmilhussain
Copy link
Contributor Author

@agberoz check this

@agberoz
Copy link
Collaborator

agberoz commented Oct 11, 2024

@YashBaviskar1 It's looking good!

A suggestion: When adding child URLs to the queue, please ensure they start with the base URL to prevent the crawler from navigating to other domains.

@YashBaviskar1
Copy link

YashBaviskar1 commented Oct 12, 2024

@agberoz
Sure, we can do that using .netloc to include and compare the domain of the urls

from urllib.parse import urlparse


#Base Domain of the FIRST(BASE URL)
base_domain = urlparse(urls[0]).netloc

and then add this condition when appending the child url

if is_scan_child :
    for a in soup.find_all('a', href = True) :
        absolute_url = urljoin(url, a['href'])
        if absolute_url not in self.visited_url and urlparse(absolute_url).netloc == base_domain :
            url_queue.append(absolute_url)

if the absolute_url is of different domain, it will not be added in the url_queue and hence the crawler will stay on the base domain.

@YashBaviskar1
Copy link

Ok i will soon send a PR if everything works! Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants