Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping n levels deep #112

Closed
rawmean opened this issue Apr 30, 2024 · 5 comments
Closed

Scraping n levels deep #112

rawmean opened this issue Apr 30, 2024 · 5 comments

Comments

@rawmean
Copy link

rawmean commented Apr 30, 2024

Is your feature request related to a problem? Please describe.
I'd like to scrape a website n-levels deep.

Describe the solution you'd like
For example, given url = example.com, the scraper should also follow the links in example.com and scrape those too

Describe alternatives you've considered
I can use BeautifulSoup and download the pages and then feed them to this

@PeriniM
Copy link
Collaborator

PeriniM commented Apr 30, 2024

Hei @rawmean, we will add it in the to-do list for feature requests! It would be interesting to create a new graph for this and maybe calling it CrawlerGraph or DeepScraperGraph

@PeriniM PeriniM reopened this Apr 30, 2024
@ScrapeGraphAI ScrapeGraphAI deleted a comment from VinciGit00 Apr 30, 2024
@mayurdb
Copy link
Contributor

mayurdb commented May 7, 2024

I'll try to take a stab at it.
This is what I'm thinking:
Input: URL

  1. FetchNode
  2. ParseNode
  3. RAGNode
  4. SearchLinkNode -> Get all the links on the page
  5. (new) LinkFilterNode -> Filter out potentially relevant links
  6. (new) RepeaterNode -> Executes graph from child node onwards once for each of the input link in parallel
  7. FetchNode
  8. ParseNode
  9. RAGNode
  10. (new) ContainsAnswerNode -> A new node type that can tell if the currect content contains the answer
  11. (new) ConditionalNode -> A new node with two children, if parent returns true, pick child 1 or else pick child 2
    12a. GenerateAnswerNode
    12b. Go to step 4 for next level of depth

Let me know if this looks reasonable or if you have any other plan/better alternative that you can think of

@VinciGit00
Copy link
Collaborator

Yeah, pls contact me thorough email (mvincig11@gmail.com)

@ChrisDelClea
Copy link

Sounds really intresting.

@davideuler
Copy link

I am looking for the feature too. There are two use cases:
1.Loop through several path levels of a website, to extract information from all item pages. like to extract all shop item informations, all renting houses prices and locations. In this case, I can specify which paths will be processed by regex expressions.
2.Loop through all pages of a small website. It behaves like crawler as nutch, while I can specify what I will get from each page. There is a prompt to match the target page, and a prompt to get data/files from that page. Sometimes I need to crawl all videos/images of a specified condition for the website.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants