Website Scraper Plugin Only Scanning Homepage #17

ashmilhussain · 2024-09-26T19:50:03Z

The website scraper plugin is currently only scanning the homepage. Modify the to plugin scrap a single page or subpages based on a flag.
This flag should control whether the crawler scans just the homepage or includes subpages as well, providing more flexibility in how pages are crawled.

Use flag : is_scan_child
'true' : scraper both main page and subpages
'false' : scrap only main page

YashBaviskar1 · 2024-09-28T07:20:45Z

Hey there! I would love to try to solve this issue.
Maybe i will try to recursively scan the page whenever the flag is true. thank you

ashmilhussain · 2024-09-28T07:43:03Z

Hey @YashBaviskar1 ,

Assigning this issue to you, happy coding

If any assistance required meet our team here : https://join.slack.com/t/theailounge/shared_invite/zt-2ogkrruyf-FPOHuPr5hdqXl34bDWjHjw

agberoz · 2024-10-03T12:23:01Z

@YashBaviskar1 any updates on this?

YashBaviskar1 · 2024-10-03T17:25:54Z

Hey there Yes, I was looking through the code base , from what i understand i have to change url_reader.py
I was just figuring out a way on how to use def load(self) in Class UrlReader(DocsReader) to recursively call itself, is this the right approach.

Sorry for the delay

agberoz · 2024-10-04T18:47:44Z

@YashBaviskar1, Recursion might work, but it's risky for web scrapers due to infinite loops or stack overflow from nested or circular links.

I would also suggest using a data structure like a queue. Here's how it can work:

Start with the base URL.
Add child URLs to the queue as you process each page.
Ensure that each URL is only visited once (you can use a set to track visited URLs).
Process each URL iteratively from the queue.

YashBaviskar1 · 2024-10-06T20:10:41Z

Yup thank you! I will try to do what you have said, it is pretty intuitive

YashBaviskar1 · 2024-10-08T20:08:31Z

@ashmilhussain @agberoz, Hello there so I tried to make amend on what you had said and this is what i have came up with :

In Class Website I added is_scan_child flag in __init__ and changed the fetch_data accordingly

class Website:
    """
    Website class for interacting with website data.
    """
    def __init__(self, website_url: str, is_scan_child : bool = False):
        self.connection = {}
        self.is_scan_child = is_scan_child #default is False 
        self.params = {
            'url': website_url,
        }

    def fetch_data(self):
        base_reader = UrlReader({
            "type": "url",
            "path": [self.params.get('url')],
            "is_scan_child" : self.is_scan_child
        })
        data = base_reader.load()
        return data

Now in Class UrlReader I did as you instructed, initialized a url_queue queue and a visited_url set and used find_all to find all the href tags and append in queue if the is_scan_child is true. The process iterates through the entire queue, while the set keeps track of visited links.

class UrlReader:
    def __init__(self, source):
        self.source = source
        self.visited_url = set()

    def load(self):
        out = []
        is_scan_child = self.source.get("is_scan_child", False)
        url_queue = [] #defining a queue to capture sub_urls 
        if "path" in self.source:
            urls = self.source["path"]
            for url in urls:
                url_queue.append(url)
            while url_queue : 
                url = url_queue.pop(0)
                if url in self.visited_url :
                    continue
                self.visited_url.add(url)
                try:
                    response = requests.get(url)
                    if response.status_code == 200:
                        soup = BeautifulSoup(response.content, 'html.parser')
                        if is_scan_child :
                            for a in soup.find_all('a', href = True) :
                                absolute_url = urljoin(url, a['href'])
                                if absolute_url not in self.visited_url :
                                    url_queue.append(absolute_url)
                            #print(f" Sub Urls are : {url_queue}")
                        tag = soup.body
                        text = ''.join(list(tag.strings)[:-1])
                        metadata = {
                            "path": url
                        }
                        out.append({"content": str(text), "metadata": metadata})
                    else:
                        logger.critical(f"Failed to retrieve content, status code: {response.status_code}")
                except Exception as e:
                    logger.error(e)
        return out

Since I thought we are dealing with subpages mostly I added this line and note the new import too:

from urllib.parse import urljoin

absolute_url = urljoin(url, a['href'])

I do not know if it is necessary or not.

ashmilhussain · 2024-10-09T07:52:27Z

@agberoz check this

agberoz · 2024-10-11T11:34:51Z

@YashBaviskar1 It's looking good!

A suggestion: When adding child URLs to the queue, please ensure they start with the base URL to prevent the crawler from navigating to other domains.

YashBaviskar1 · 2024-10-12T21:16:58Z

@agberoz
Sure, we can do that using .netloc to include and compare the domain of the urls

from urllib.parse import urlparse


#Base Domain of the FIRST(BASE URL)
base_domain = urlparse(urls[0]).netloc

and then add this condition when appending the child url

if is_scan_child :
    for a in soup.find_all('a', href = True) :
        absolute_url = urljoin(url, a['href'])
        if absolute_url not in self.visited_url and urlparse(absolute_url).netloc == base_domain :
            url_queue.append(absolute_url)

if the absolute_url is of different domain, it will not be added in the url_queue and hence the crawler will stay on the base domain.

YashBaviskar1 · 2024-10-14T20:48:44Z

Ok i will soon send a PR if everything works! Thank you

ashmilhussain added the good first issue Good for newcomers label Sep 26, 2024

ashmilhussain assigned YashBaviskar1 Sep 28, 2024

agberoz added backend hacktoberfest labels Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Website Scraper Plugin Only Scanning Homepage #17

Website Scraper Plugin Only Scanning Homepage #17

ashmilhussain commented Sep 26, 2024

YashBaviskar1 commented Sep 28, 2024

ashmilhussain commented Sep 28, 2024

agberoz commented Oct 3, 2024

YashBaviskar1 commented Oct 3, 2024

agberoz commented Oct 4, 2024 •

edited

Loading

YashBaviskar1 commented Oct 6, 2024

YashBaviskar1 commented Oct 8, 2024 •

edited

Loading

ashmilhussain commented Oct 9, 2024

agberoz commented Oct 11, 2024

YashBaviskar1 commented Oct 12, 2024 •

edited

Loading

YashBaviskar1 commented Oct 14, 2024

Website Scraper Plugin Only Scanning Homepage #17

Website Scraper Plugin Only Scanning Homepage #17

Comments

ashmilhussain commented Sep 26, 2024

YashBaviskar1 commented Sep 28, 2024

ashmilhussain commented Sep 28, 2024

agberoz commented Oct 3, 2024

YashBaviskar1 commented Oct 3, 2024

agberoz commented Oct 4, 2024 • edited Loading

YashBaviskar1 commented Oct 6, 2024

YashBaviskar1 commented Oct 8, 2024 • edited Loading

ashmilhussain commented Oct 9, 2024

agberoz commented Oct 11, 2024

YashBaviskar1 commented Oct 12, 2024 • edited Loading

YashBaviskar1 commented Oct 14, 2024

agberoz commented Oct 4, 2024 •

edited

Loading

YashBaviskar1 commented Oct 8, 2024 •

edited

Loading

YashBaviskar1 commented Oct 12, 2024 •

edited

Loading