-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Website Scraper Plugin Only Scanning Homepage #17
Comments
Hey there! I would love to try to solve this issue. |
Hey @YashBaviskar1 , Assigning this issue to you, happy coding If any assistance required meet our team here : https://join.slack.com/t/theailounge/shared_invite/zt-2ogkrruyf-FPOHuPr5hdqXl34bDWjHjw |
@YashBaviskar1 any updates on this? |
Hey there Yes, I was looking through the code base , from what i understand i have to change url_reader.py Sorry for the delay |
@YashBaviskar1, Recursion might work, but it's risky for web scrapers due to infinite loops or stack overflow from nested or circular links. I would also suggest using a data structure like a queue. Here's how it can work:
|
Yup thank you! I will try to do what you have said, it is pretty intuitive |
@ashmilhussain @agberoz, Hello there so I tried to make amend on what you had said and this is what i have came up with :
class Website:
"""
Website class for interacting with website data.
"""
def __init__(self, website_url: str, is_scan_child : bool = False):
self.connection = {}
self.is_scan_child = is_scan_child #default is False
self.params = {
'url': website_url,
}
def fetch_data(self):
base_reader = UrlReader({
"type": "url",
"path": [self.params.get('url')],
"is_scan_child" : self.is_scan_child
})
data = base_reader.load()
return data
class UrlReader:
def __init__(self, source):
self.source = source
self.visited_url = set()
def load(self):
out = []
is_scan_child = self.source.get("is_scan_child", False)
url_queue = [] #defining a queue to capture sub_urls
if "path" in self.source:
urls = self.source["path"]
for url in urls:
url_queue.append(url)
while url_queue :
url = url_queue.pop(0)
if url in self.visited_url :
continue
self.visited_url.add(url)
try:
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
if is_scan_child :
for a in soup.find_all('a', href = True) :
absolute_url = urljoin(url, a['href'])
if absolute_url not in self.visited_url :
url_queue.append(absolute_url)
#print(f" Sub Urls are : {url_queue}")
tag = soup.body
text = ''.join(list(tag.strings)[:-1])
metadata = {
"path": url
}
out.append({"content": str(text), "metadata": metadata})
else:
logger.critical(f"Failed to retrieve content, status code: {response.status_code}")
except Exception as e:
logger.error(e)
return out Since I thought we are dealing with subpages mostly I added this line and note the new import too: from urllib.parse import urljoin
absolute_url = urljoin(url, a['href']) I do not know if it is necessary or not. |
@agberoz check this |
@YashBaviskar1 It's looking good! A suggestion: When adding child URLs to the queue, please ensure they start with the base URL to prevent the crawler from navigating to other domains. |
@agberoz from urllib.parse import urlparse
#Base Domain of the FIRST(BASE URL)
base_domain = urlparse(urls[0]).netloc and then add this condition when appending the child url if is_scan_child :
for a in soup.find_all('a', href = True) :
absolute_url = urljoin(url, a['href'])
if absolute_url not in self.visited_url and urlparse(absolute_url).netloc == base_domain :
url_queue.append(absolute_url) if the |
Ok i will soon send a PR if everything works! Thank you |
The website scraper plugin is currently only scanning the homepage. Modify the to plugin scrap a single page or subpages based on a flag.
This flag should control whether the crawler scans just the homepage or includes subpages as well, providing more flexibility in how pages are crawled.
The text was updated successfully, but these errors were encountered: