Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST] Spider mode for feroxbuster #407

Closed
dsaxton opened this issue Oct 31, 2021 · 16 comments
Closed

[FEATURE REQUEST] Spider mode for feroxbuster #407

dsaxton opened this issue Oct 31, 2021 · 16 comments
Labels
enhancement New feature or request

Comments

@dsaxton
Copy link
Contributor

dsaxton commented Oct 31, 2021

Is your feature request related to a problem? Please describe.

It would be interesting if feroxbuster had a "spider-mode," which would really just use be the --extract-links behavior without using a word list. This would make for a nice option if ever a user wants to get a quick map of a site without also spraying the server with a lot of requests that are likely to fail.

Describe the solution you'd like

One approach could be something like feroxbuster -u https://example.com --spider which only requests the root path and then recursively fetches based on links that are found. This would pretty much just be an alias that activates functionality that feroxbuster already has, but in a more expressive and user-friendly way.

Describe alternatives you've considered

I've only tried using a very small dummy word list along with --extract-links, but maybe there is a simpler way I haven't thought of.

@dsaxton dsaxton added the enhancement New feature or request label Oct 31, 2021
@epi052
Copy link
Owner

epi052 commented Nov 1, 2021

Does a single word in the word list and extract links do what you're looking for ?

@dsaxton
Copy link
Contributor Author

dsaxton commented Nov 1, 2021

Does a single word in the word list and extract links do what you're looking for ?

It does, that was essentially the alternative method I mentioned. An option like this would really just be a syntactic sugar to make it easier to express that type of behavior, but I can also completely understand not wanting the pollute the interface with extra options that aren't really necessary (especially if it's non-trivial to implement).

@epi052
Copy link
Owner

epi052 commented Nov 2, 2021

It does, that was essentially the alternative method I mentioned

That's what I thought you meant, just wanted to be sure. I'm not against adding it. I think the tool has grown a bit beyond the original "simple" tagline, lol. I don't know of anyone that used it in this particular way, but that may be because it wasn't intuitive to invoke the behavior.

Implementation-wise, I don't think it would be much beyond adding the flag. Would need to handle the check for an empty wordlist, and after that just kinda see what the code does... lol. If it handles it mostly gracefully, that'd be basically it.

I've been thinking about adding a 'bag of observed words' kind of thing. Similar to --extract-links except it would be --extract-words and would then be added into the wordlist (think cewl+feroxbuster). turbo intruder added it recently, and i think there's another scanner that does it also. It's not the same thing as spidering, but this issue made me think of it and want to at least put it in writing somewhere, lol.

@dsaxton
Copy link
Contributor Author

dsaxton commented Nov 2, 2021

I think the tool has grown a bit beyond the original "simple" tagline, lol. I don't know of anyone that used it in this particular way, but that may be because it wasn't intuitive to invoke the behavior.

For sure. Maybe it makes sense to stick to the "brute force" approach, just wanted to throw this out there as a possible enhancement. I've had some trouble finding a good spidering tool, and feroxbuster may be better than existing tools I've looked at even for that.

I've been thinking about adding a 'bag of observed words' kind of thing. Similar to --extract-links except it would be --extract-words and would then be added into the wordlist (think cewl+feroxbuster).

That's a neat idea, would there be any filtering logic, or simply fetch everything that looks like a word and add it (I wonder if doing things like splitting existing links into words might generate good results)?

Also interesting to think how a dynamically changing word list might work in practice. I imagine could do an initial fetch and then have all requests share the same list for the duration of the scan, or are you thinking words would incrementally get added during scanning?

@epi052
Copy link
Owner

epi052 commented Nov 2, 2021

would there be any filtering logic, or simply fetch everything that looks like a word and add it

one idea would be to borrow from NLP ideas. Each word would be filtered first against a set of stop words (the, is, am, was... etc). After that, it'd be added to a structure that keeps track of frequency in the page. Can then filter out based on some pre-chosen TF-IDF (how important the word is in relation to the document) cutoff value.

the other approach would be, just like you said, to simply add any not previously seen word to the wordlist.

I wonder if doing things like splitting existing links into words might generate good results

very similar logic is already in extract links. It seems to work out pretty well in the way its used now. I suspect it would still be useful.

    /// Iterate over a given path, return a list of every sub-path found
    ///
    /// example: `path` contains a link fragment `homepage/assets/img/icons/handshake.svg`
    /// the following fragments would be returned:
    ///   - homepage/assets/img/icons/handshake.svg
    ///   - homepage/assets/img/icons/
    ///   - homepage/assets/img/
    ///   - homepage/assets/
    ///   - homepage/

Also interesting to think how a dynamically changing word list might work in practice

I fear it'd be a non-trivial amount of work compared to how the wordlist works now.

an initial fetch and then have all requests share the same list for the duration of the scan

I'm not entirely sure how you meant this, but the way I interpret it makes it sound like it'd be limited. Ideally, extracted words would be tried on every new directory, and extracted from every new page, updating any future directory scans (basically the "words would incrementally get added during scanning" is how i think it ought to behave)

@dsaxton
Copy link
Contributor Author

dsaxton commented Nov 2, 2021

I'm not entirely sure how you meant this, but the way I interpret it makes it sound like it'd be limited. Ideally, extracted words would be tried on every new directory, and extracted from every new page, updating any future directory scans (basically the "words would incrementally get added during scanning" is how i think it ought to behave)

I think you're right it would be pretty limited to populate it only once at the start of the scan. I guess I was mostly wondering how complex it would be to have a mutable word list that gets shared / updated by several concurrent processes that are all making requests, but maybe that's not a big deal depending on how it's implemented.

@epi052
Copy link
Owner

epi052 commented Nov 10, 2021

Sorry, I got a bit sidetracked. I looked at some other tools I've used in the past:

I haven't looked at either one any time recently. Did you try either of these? It looks like hakrawler stripped out a lot of its initial functionality.

@dsaxton
Copy link
Contributor Author

dsaxton commented Nov 10, 2021

Sorry, I got a bit sidetracked. I looked at some other tools I've used in the past:

* https://github.com/s0md3v/Photon

* https://github.com/hakluke/hakrawler

I haven't looked at either one any time recently. Did you try either of these? It looks like hakrawler stripped out a lot of its initial functionality.

Thanks! I wasn't aware of these and will take a look.

@epi052
Copy link
Owner

epi052 commented Nov 10, 2021

i was mostly asking to see how they compared to you running feroxbuster for wordlist-less crawling.

@dsaxton
Copy link
Contributor Author

dsaxton commented Nov 11, 2021

i was mostly asking to see how they compared to you running feroxbuster for wordlist-less crawling.

Looks like feroxbuster gives about the same number of results as Photon and hakrawler based on a quick check. I did notice though that hakrawler was a lot faster than both feroxbuster and Photon, so maybe there are some opportunities to optimize the scan for feroxbuster. Here was the ferox command (single-slash.txt contains only the line "/") and it took 10-15 seconds to run on my computer:

~ $ feroxbuster -u https://www.yahoo.com -w single-slash.txt --extract-links

 ___  ___  __   __     __      __         __   ___
|__  |__  |__) |__) | /  `    /  \ \_/ | |  \ |__
|    |___ |  \ |  \ | \__,    \__/ / \ | |__/ |___
by Ben "epi" Risher 🤓                 ver: 2.4.0
───────────────────────────┬──────────────────────
 🎯  Target Url            │ https://www.yahoo.com
 🚀  Threads               │ 50
 📖  Wordlist              │ single-slash.txt
 👌  Status Codes          │ [200, 204, 301, 302, 307, 308, 401, 403, 405, 500]
 💥  Timeout (secs)        │ 7
 🦡  User-Agent            │ feroxbuster/2.4.0
 💉  Config File           │ /home/dsaxton/.config/feroxbuster/ferox-config.toml
 🔎  Extract Links         │ true
 🔃  Recursion Depth       │ 4
───────────────────────────┴──────────────────────
 🏁  Press [ENTER] to use the Scan Cancel Menu™
──────────────────────────────────────────────────
200      864l    15813w        0c https://www.yahoo.com/news/m-frosted-flakes-man-kevin-143000649.html
200        1l      171w    16605c https://www.yahoo.com/lib/metro/g/myy/rapidworker_1_2_0.0.40.js
500        1l        2w       28c https://www.yahoo.com/tdv2_fp/api/resource/NotificationHistory.getHistory
WLD        2l        5w        0c Got 403 for https://www.yahoo.com/lib/metro/21cd0d935f464b8aaece4f992787fcd0 (url length: 32)
200        1l        1w       42c https://www.yahoo.com/px.gif
302        1l       14w      260c https://www.yahoo.com/photo?psize=24X24&fallback_url=https%3A%2F%2Fs.yimg.com%2Fdh%2Fap%2Fsocial%2Fprofile%2Fprofile_a24.png&alphatar_photo=true&format=image
200        6l       19w      153c https://www.yahoo.com/p.gif?beaconType=darlaFetcherBeacon&
302        1l       14w      201c https://www.yahoo.com/finance/news/kyle-rittenhouse-ipad-pinch-to-zoom-lawyers-claim-142110207.html
200        1l       13w      158c https://www.yahoo.com/lib/metro/g/myy/advertisement_0.0.20.js
302        1l       14w      215c https://www.yahoo.com/sports/mike-zimmer-says-vikings-player-hospitalized-due-to-covid-19-symptoms-211029008.html
200       68l      110w     1856c https://www.yahoo.com/manifest_desktop_us.json
WLD      143l      380w     4471c Got 403 for https://www.yahoo.com/ws/v3/mailboxes/6c24c0c9766e4978911daf4dc0efde85 (url length: 32)
WLD         -         -         - Wildcard response is dynamic; auto-filtering (4462 + url length) responses; toggle this behavior by using --dont-filter
WLD      143l      380w     4535c Got 403 for https://www.yahoo.com/ws/v3/mailboxes/5d7e967730c54f48b0866fb6591a9fa0c90183519e2d4fb099591cda9de10910cbe18806988d414b9a3f54d69677e177 (url length: 96)
200     1813l    20212w        0c https://www.yahoo.com/
[####################] - 14s      704/704     0s      found:14      errors:0      
[####################] - 13s        1/1       0/s     https://www.yahoo.com
[####################] - 12s        1/1       0/s     https://www.yahoo.com/fpjs/
[####################] - 12s        1/1       0/s     https://www.yahoo.com/myjs/
[####################] - 12s        1/1       0/s     https://www.yahoo.com/
[####################] - 0s         1/1       1/s     https://www.yahoo.com/lifestyle/
[####################] - 0s         2/1       6/s     https://www.yahoo.com/lib/metro/
[####################] - 0s         1/1       1/s     https://www.yahoo.com/lib/metro/g/
[####################] - 0s         1/1       1/s     https://www.yahoo.com/plus/mail/
[####################] - 0s         3/1       8/s     https://www.yahoo.com/ws/v3/mailboxes/

This command finishes in a couple seconds:

echo https://www.yahoo.com | hakrawler

@stale
Copy link

stale bot commented Nov 25, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 25, 2021
@dsaxton
Copy link
Contributor Author

dsaxton commented Nov 25, 2021

@epi052 Maybe this is out of the scope for feroxbuster so we can close. FWIW I played around with creating a web crawler in Rust that seems to give some reasonable results: https://github.com/dsaxton/wrake.

It could definitely be improved a lot though; e.g., I think it's throwing out a lot of valid links, and could possibly benefit from using async / await.

@dsaxton dsaxton closed this as completed Nov 25, 2021
@epi052
Copy link
Owner

epi052 commented Nov 26, 2021

@dsaxton So, I've been going back and forth on this one.

I'm of the mind that I don't want ferox creeping off into other realms of related stuff. One of the original reasons I wanted to write it was to have a single tool, that did a single thing really well.

As far as using ferox as a crawler, we could document that it's possible using your workaround, but that it's not necessarily intended to act as a crawler, and there are likely better options (maybe summarizing what you've found in your testing of other tools and what you've learned writing wrake?)

That's ultimately where I've landed on this. I'd like to keep ferox solely as a directory brute-forcing tool.

If you don't want to add any related documentation, I'm absolutely ok with that, just re-close this ticket and folks can find it via search if it ever makes sense.

If you do feel like writing it up, the docs live @ https://github.com/epi052/feroxbuster-docs now.

Thanks again!

@epi052 epi052 reopened this Nov 26, 2021
@stale stale bot removed the stale label Nov 26, 2021
@epi052
Copy link
Owner

epi052 commented Nov 26, 2021

I took a look at wrake, and yes, at a quick glance, async / await would still give you a lot more perf than what you're currently getting with just rayon.

@dsaxton
Copy link
Contributor Author

dsaxton commented Nov 29, 2021

@epi052 Thanks, I'll look into adding something to the docs soonish. I agree though after thinking a bit more that it's good to keep the features focused on brute forcing, so we could say it's possible to use feroxbuster for crawling, but probably not optimal if that's the user's primary goal.

@dsaxton
Copy link
Contributor Author

dsaxton commented Dec 9, 2021

Put up a PR in the docs repo so closing this

@dsaxton dsaxton closed this as completed Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants