-
-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQUEST] Spider mode for feroxbuster #407
Comments
Does a single word in the word list and extract links do what you're looking for ? |
It does, that was essentially the alternative method I mentioned. An option like this would really just be a syntactic sugar to make it easier to express that type of behavior, but I can also completely understand not wanting the pollute the interface with extra options that aren't really necessary (especially if it's non-trivial to implement). |
That's what I thought you meant, just wanted to be sure. I'm not against adding it. I think the tool has grown a bit beyond the original "simple" tagline, lol. I don't know of anyone that used it in this particular way, but that may be because it wasn't intuitive to invoke the behavior. Implementation-wise, I don't think it would be much beyond adding the flag. Would need to handle the check for an empty wordlist, and after that just kinda see what the code does... lol. If it handles it mostly gracefully, that'd be basically it. I've been thinking about adding a 'bag of observed words' kind of thing. Similar to |
For sure. Maybe it makes sense to stick to the "brute force" approach, just wanted to throw this out there as a possible enhancement. I've had some trouble finding a good spidering tool, and feroxbuster may be better than existing tools I've looked at even for that.
That's a neat idea, would there be any filtering logic, or simply fetch everything that looks like a word and add it (I wonder if doing things like splitting existing links into words might generate good results)? Also interesting to think how a dynamically changing word list might work in practice. I imagine could do an initial fetch and then have all requests share the same list for the duration of the scan, or are you thinking words would incrementally get added during scanning? |
one idea would be to borrow from NLP ideas. Each word would be filtered first against a set of stop words (the, is, am, was... etc). After that, it'd be added to a structure that keeps track of frequency in the page. Can then filter out based on some pre-chosen TF-IDF (how important the word is in relation to the document) cutoff value. the other approach would be, just like you said, to simply add any not previously seen word to the wordlist.
very similar logic is already in extract links. It seems to work out pretty well in the way its used now. I suspect it would still be useful.
I fear it'd be a non-trivial amount of work compared to how the wordlist works now.
I'm not entirely sure how you meant this, but the way I interpret it makes it sound like it'd be limited. Ideally, extracted words would be tried on every new directory, and extracted from every new page, updating any future directory scans (basically the "words would incrementally get added during scanning" is how i think it ought to behave) |
I think you're right it would be pretty limited to populate it only once at the start of the scan. I guess I was mostly wondering how complex it would be to have a mutable word list that gets shared / updated by several concurrent processes that are all making requests, but maybe that's not a big deal depending on how it's implemented. |
Sorry, I got a bit sidetracked. I looked at some other tools I've used in the past: I haven't looked at either one any time recently. Did you try either of these? It looks like hakrawler stripped out a lot of its initial functionality. |
Thanks! I wasn't aware of these and will take a look. |
i was mostly asking to see how they compared to you running feroxbuster for wordlist-less crawling. |
Looks like feroxbuster gives about the same number of results as Photon and hakrawler based on a quick check. I did notice though that hakrawler was a lot faster than both feroxbuster and Photon, so maybe there are some opportunities to optimize the scan for feroxbuster. Here was the ferox command (single-slash.txt contains only the line "/") and it took 10-15 seconds to run on my computer: ~ $ feroxbuster -u https://www.yahoo.com -w single-slash.txt --extract-links
___ ___ __ __ __ __ __ ___
|__ |__ |__) |__) | / ` / \ \_/ | | \ |__
| |___ | \ | \ | \__, \__/ / \ | |__/ |___
by Ben "epi" Risher 🤓 ver: 2.4.0
───────────────────────────┬──────────────────────
🎯 Target Url │ https://www.yahoo.com
🚀 Threads │ 50
📖 Wordlist │ single-slash.txt
👌 Status Codes │ [200, 204, 301, 302, 307, 308, 401, 403, 405, 500]
💥 Timeout (secs) │ 7
🦡 User-Agent │ feroxbuster/2.4.0
💉 Config File │ /home/dsaxton/.config/feroxbuster/ferox-config.toml
🔎 Extract Links │ true
🔃 Recursion Depth │ 4
───────────────────────────┴──────────────────────
🏁 Press [ENTER] to use the Scan Cancel Menu™
──────────────────────────────────────────────────
200 864l 15813w 0c https://www.yahoo.com/news/m-frosted-flakes-man-kevin-143000649.html
200 1l 171w 16605c https://www.yahoo.com/lib/metro/g/myy/rapidworker_1_2_0.0.40.js
500 1l 2w 28c https://www.yahoo.com/tdv2_fp/api/resource/NotificationHistory.getHistory
WLD 2l 5w 0c Got 403 for https://www.yahoo.com/lib/metro/21cd0d935f464b8aaece4f992787fcd0 (url length: 32)
200 1l 1w 42c https://www.yahoo.com/px.gif
302 1l 14w 260c https://www.yahoo.com/photo?psize=24X24&fallback_url=https%3A%2F%2Fs.yimg.com%2Fdh%2Fap%2Fsocial%2Fprofile%2Fprofile_a24.png&alphatar_photo=true&format=image
200 6l 19w 153c https://www.yahoo.com/p.gif?beaconType=darlaFetcherBeacon&
302 1l 14w 201c https://www.yahoo.com/finance/news/kyle-rittenhouse-ipad-pinch-to-zoom-lawyers-claim-142110207.html
200 1l 13w 158c https://www.yahoo.com/lib/metro/g/myy/advertisement_0.0.20.js
302 1l 14w 215c https://www.yahoo.com/sports/mike-zimmer-says-vikings-player-hospitalized-due-to-covid-19-symptoms-211029008.html
200 68l 110w 1856c https://www.yahoo.com/manifest_desktop_us.json
WLD 143l 380w 4471c Got 403 for https://www.yahoo.com/ws/v3/mailboxes/6c24c0c9766e4978911daf4dc0efde85 (url length: 32)
WLD - - - Wildcard response is dynamic; auto-filtering (4462 + url length) responses; toggle this behavior by using --dont-filter
WLD 143l 380w 4535c Got 403 for https://www.yahoo.com/ws/v3/mailboxes/5d7e967730c54f48b0866fb6591a9fa0c90183519e2d4fb099591cda9de10910cbe18806988d414b9a3f54d69677e177 (url length: 96)
200 1813l 20212w 0c https://www.yahoo.com/
[####################] - 14s 704/704 0s found:14 errors:0
[####################] - 13s 1/1 0/s https://www.yahoo.com
[####################] - 12s 1/1 0/s https://www.yahoo.com/fpjs/
[####################] - 12s 1/1 0/s https://www.yahoo.com/myjs/
[####################] - 12s 1/1 0/s https://www.yahoo.com/
[####################] - 0s 1/1 1/s https://www.yahoo.com/lifestyle/
[####################] - 0s 2/1 6/s https://www.yahoo.com/lib/metro/
[####################] - 0s 1/1 1/s https://www.yahoo.com/lib/metro/g/
[####################] - 0s 1/1 1/s https://www.yahoo.com/plus/mail/
[####################] - 0s 3/1 8/s https://www.yahoo.com/ws/v3/mailboxes/ This command finishes in a couple seconds: echo https://www.yahoo.com | hakrawler |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@epi052 Maybe this is out of the scope for feroxbuster so we can close. FWIW I played around with creating a web crawler in Rust that seems to give some reasonable results: https://github.com/dsaxton/wrake. It could definitely be improved a lot though; e.g., I think it's throwing out a lot of valid links, and could possibly benefit from using async / await. |
@dsaxton So, I've been going back and forth on this one. I'm of the mind that I don't want ferox creeping off into other realms of related stuff. One of the original reasons I wanted to write it was to have a single tool, that did a single thing really well. As far as using ferox as a crawler, we could document that it's possible using your workaround, but that it's not necessarily intended to act as a crawler, and there are likely better options (maybe summarizing what you've found in your testing of other tools and what you've learned writing wrake?) That's ultimately where I've landed on this. I'd like to keep ferox solely as a directory brute-forcing tool. If you don't want to add any related documentation, I'm absolutely ok with that, just re-close this ticket and folks can find it via search if it ever makes sense. If you do feel like writing it up, the docs live @ https://github.com/epi052/feroxbuster-docs now. Thanks again! |
I took a look at wrake, and yes, at a quick glance, async / await would still give you a lot more perf than what you're currently getting with just rayon. |
@epi052 Thanks, I'll look into adding something to the docs soonish. I agree though after thinking a bit more that it's good to keep the features focused on brute forcing, so we could say it's possible to use feroxbuster for crawling, but probably not optimal if that's the user's primary goal. |
Put up a PR in the docs repo so closing this |
Is your feature request related to a problem? Please describe.
It would be interesting if
feroxbuster
had a "spider-mode," which would really just use be the--extract-links
behavior without using a word list. This would make for a nice option if ever a user wants to get a quick map of a site without also spraying the server with a lot of requests that are likely to fail.Describe the solution you'd like
One approach could be something like
feroxbuster -u https://example.com --spider
which only requests the root path and then recursively fetches based on links that are found. This would pretty much just be an alias that activates functionality thatferoxbuster
already has, but in a more expressive and user-friendly way.Describe alternatives you've considered
I've only tried using a very small dummy word list along with
--extract-links
, but maybe there is a simpler way I haven't thought of.The text was updated successfully, but these errors were encountered: