Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requested additions to documentation #4757

Open
david-cortes opened this issue Oct 30, 2021 · 1 comment
Open

requested additions to documentation #4757

david-cortes opened this issue Oct 30, 2021 · 1 comment
Labels

Comments

@david-cortes
Copy link
Contributor

The docs for lightgbm are missing many key details about the kind of things that users would expect to see in software documentation, such as effects of parameters and methodology used by the software.

For example, if I take a look at the docs for parameters here: https://lightgbm.readthedocs.io/en/latest/Parameters.html
there is a section called "core parameters" - from what I've found out in some comments over here, some of those parameters can be passed to both prediction and training (such as num_threads), while others are only for training. It is not explained in the docs which are common and which are exclusive.

If I take a look at the Python and R docs, I see that some of the parameters are arguments, and I'm left wondering why some are arguments while others aren't, and what happens if there is a clash between them (e.g. if I pass both, which one prevails, or if I leave the default one in the arguments and pass one as a dict/list, which one again prevails).

Aside from topics like those, if I then look at the parameters themselves without looking at the source code or the issue tracker in github, I would then wonder for example, among several things:

  • Why do I keep seeing a message about infinite gain while training?
  • Is there any difference if I pass data as float64 compared to float32? Any difference passing it as data frame vs. contiguous array? Does it support arrays that are memoryviews of a larger array in numpy? Any difference in pasing them as row-major vs. column-major?
  • Are the data bins/histogram constructed by a simple range binning or is it done by percentiles or something else? Will bins be determined correctly for distributions which have small values regardless of what I pass for max_bins?
  • What happens if some bin has too few observations? Do they get merged into the bin that is smaller or larger, or do they get split evenly?
  • What happens with missing values in the bins? What if the bin for missing values has a size below min_data_in_bin?
  • Do categorical variables need to have a continuous numeration or can they have gaps?
  • How are non-one-hot splits on categorical variables determined? Does it generate per-node category statistics, or does it determine category statistics beforehand from all the data and always keep to those?
  • What happens if an observation at prediction time reaches a node that splits on a categorical variable, and this new observation has a category/value for that variable which none of the observations that reached that node during training had?
  • What happens if I introduce a new categorical value which no observation had during training at all?
  • What happens if there are missing values in the prediction data but not in the training data?
  • Are missing values in categorical features treated the same as in numerical features?
  • If I use the CV functions, do the bins get generated from all the data or only from the training folds?
  • If I am using pandas DataFrames with Categorical dtype, when I pass data for predictions, does it need to be encoded with the same categories as the data for training, or does lightgbm re-encode them for me?
  • Is changing parameters of a fitted model supposed to work? (e.g. model.set_params(num_threads=1) is a very natural thing to want to do in the scikit-learn interface, and given that it'd work in most other software, I'd expect there to be an easily findable warning about it and a suggestion to pass it as prediction parameter).
  • If I use the linear trees option, is the data used as-is or is it taken imprecisely from the bins used for tree splits?
  • Which kind of functionality is or isn't thread-safe? E.g. can I use the scikit-learn interfaces wrapped inside joblib with a shared memory model?

Currently, lots of these answers are quite easily findable from a google search across github issues, but I'd expect technical docs to be stand-alone tools and to go into details about this type of things.

To be clear, the issue is not about having these answers in this github page, but about having these matters and similar be explained in the docs.

@jameslamb jameslamb added the doc label Oct 31, 2021
@jameslamb
Copy link
Collaborator

Thanks for enumerating a specific of list of topics you'd like to see explained in LightGBM's documentation. Maintainers here will work on adding them.

@jameslamb jameslamb changed the title Docs fall very short on details requested additions to documentation Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants