-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Julia: fix missing ( and ) bug #147
Conversation
Do you want to reintroduce |
Thank you for fix. I will add |
Great - thank you for doing this benchmarks. They push the whole data science ecosystem forward greatly. And we know that we now have work on joins 😄. In general we were in the process of stabilizing the API for DataFrames.jl and since release 0.21 the effort will shift towards performance in particular. Here I have a small question (I have not checked other frameworks) - do all your tests are single-threaded or you allow multi-threading, and if yes how many cores are made available? |
Multithreading is allowed, and multi-GPU as well (but still not yet used, blocked by missing documentation and/or support in cudf #116). All what single node machine offers. |
@jangorecki As we're working towards implementing some multithreading, could you help us define an acceptable strategy to choose the number of threads to use with DataFrames.jl? What do other implementations do: always use all available threads, or choose the optimal number depending on the operation? Would it be OK to specify explicitly the number of threads to use for each operation, or do you want a single global setting? Thanks! |
@nalimilan Sure. In general it is risky to use all available threads by default because users might be working on a shared environment. |
This is exactly also our understanding and that is why @nalimilan has asked the question. In Julia for now we plan to allow user to specify explicitly how many threads should be used "per operation" with no threading as the default (so using multiple threads is an opt-in).
Thank you. So I understand that your approach here is to mimic what would be a "production usage" scenario, where an administrator is allowed to tune the number of threads used (this is in line with the design we make in DataFrames.jl where the user is asked to explicitly specify the number of threads that should be used). Again - thank you for such a great commitment to maintaining these benchmarks. It is a great thing to have for such a project as DataFrames.jl that has no resources on its own and is solely based on volunteer work. |
Thanks. Indeed in general adding many threads doesn't give a big speedup. Hopefully it shouldn't be slower (except in pathological cases where there are only 1-5 row per group, but the benchmarks don't do that currently), so for benchmarking purposes it could be workable to just use as many threads as possible. For real uses it matters choose a tradeoff between speed and CPU consumption. In practical terms, Julia is probably a bit different than e.g. R as the user chooses the maximal number of threads on startup. So that's a first way of affecting the number of threads DataFrames.jl will use, and here we could set it to 40 if that's indeed the number of cores (I think some IDEs do that automatically), or 20 (that shouldn't make a big difference). Then would it be OK to pass |
@bkamins I think you are correct about memory bound operations. @nalimilan having |
No description provided.