Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live updates and sorting for R data explorer #287

Merged
merged 43 commits into from
Apr 3, 2024

Conversation

jmcphers
Copy link
Contributor

@jmcphers jmcphers commented Apr 1, 2024

This change adds live updates and sorting for the data explorer backend for R.

Addresses posit-dev/positron#2159 (sorting portion)
Addresses posit-dev/positron#2386
Addresses posit-dev/positron#2333

Most of this is accomplished by having the data explorer's backend maintain knowledge of the object it's looking at (in the form of a name/environment binding), and the current sorting state (in the form of a set of sorting keys and sorted row indices).

Includes integration tests for the new functionality.

Copy link
Contributor

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think it just needs some changes for tibble support.

crates/ark/src/data_explorer/r_data_explorer.rs Outdated Show resolved Hide resolved
crates/ark/src/data_explorer/r_data_explorer.rs Outdated Show resolved Hide resolved
crates/ark/src/data_explorer/r_data_explorer.rs Outdated Show resolved Hide resolved
crates/harp/src/object.rs Show resolved Hide resolved
crates/ark/src/data_explorer/r_data_explorer.rs Outdated Show resolved Hide resolved
crates/ark/tests/data_explorer.rs Outdated Show resolved Hide resolved
crates/ark/tests/data_explorer.rs Outdated Show resolved Hide resolved
@jmcphers jmcphers force-pushed the feature/data-explorer-sorting branch from 709783a to bf01f7a Compare April 2, 2024 21:40
@jmcphers jmcphers requested a review from lionel- April 2, 2024 23:23
@jmcphers
Copy link
Contributor Author

jmcphers commented Apr 2, 2024

@lionel- thanks for the detailed review! I think I've addressed all your comments, LMK if anything still looks off.

@DavisVaughan
Copy link
Contributor

I'll also take a look tomorrow morning

Copy link
Contributor

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I've just pushed a couple of commits that add tests for tibble data frames. I've confirmed that the previous way of subsetting columns causes a failure with this new test, but the new way succeeds.

@DavisVaughan
Copy link
Contributor

DavisVaughan commented Apr 3, 2024

I was expecting this to live update, it does in RStudio, am I doing it wrong?

library(nycflights13)
x <- flights
View(x)
x <- data.frame(a = 1:10)
Screen.Recording.2024-04-03.at.9.42.25.AM.mov

Oh, the env isn't passed through in the View() case. I think that will be important to do. I know that's the primary way my wife pulls up the data viewer in rstudio.

#[harp::register]
pub unsafe extern "C" fn ps_view_data_frame(x: SEXP, title: SEXP) -> anyhow::Result<SEXP> {
    let x = RObject::new(x);

    let title = RObject::new(title);
    let title = unwrap!(String::try_from(title), Err(_) => "".to_string());

    let main = RMain::get();
    let comm_manager_tx = main.get_comm_manager_tx().clone();

    RDataExplorer::start(title, x, None, comm_manager_tx)?;

    Ok(R_NilValue)
}

#' @export
.ps.null_count <- function(col) {
# Include NA and NaN values in the null count
is_null <- function(x) { is.na(x) | is.null(x) | is.nan(x) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is_null <- function(x) { is.na(x) | is.null(x) | is.nan(x) }
is_null <- function(x) { is.na(x) | is.null(x) }

NaNs are included in is.na()

crates/harp/src/object.rs Outdated Show resolved Hide resolved
Comment on lines 43 to 47
/// - `column_index` - The index of the column to extract. Passed directly to R,
/// so uses 1-based indexing.
/// - `kind` - The kind of table `x` is (matrix or data frame).
///
pub fn tbl_get_column(x: SEXP, column_index: i32, kind: TableKind) -> anyhow::Result<RObject> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lionel- and I typically use a convention where we do the 0 vs 1 based indexing at the FFI boundary. This typically makes the code cleaner and easier to understand in the long run as it avoids a proliferation of + 1 or - 1 everywhere. So I'd expect column_index to be 0 based here and the conversion to happen internally as you build the call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that. Done in 1c05603.

Comment on lines 141 to 143
// Generate an intial set of row indices that are just the
// row numbers
let row_indices: Vec<i32> = (1..=shape.num_rows).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to manually check that (1..shape.num_rows) gives an empty vector when num_rows = 0! That's nice rust behavior here, but I was a little surprised by it! I'm not sure if there is something we can do to make it clearer that we do indeed nicely handle the 0 row case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unnecessarily verbose, but maybe this is clearer? 609d14a

crates/ark/src/data_explorer/r_data_explorer.rs Outdated Show resolved Hide resolved
@@ -1,20 +1,25 @@
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature request - RStudio has this behavior where if you update the binding to something it can't show but then reupdate the binding to a data frame again, then it picks up where it left off just fine.

That's a pretty nice behavior for interactive work, where its definitely possible you can temporarily mask that binding with something that isn't a data frame, but then you say "oops let me undo that" and the data viewer picks right back up.

library(nycflights13)
library(dplyr)

df <- flights

# cool!
df <- data.frame(x = 1:5)

# disconnect
df <- 3

# was hoping this would reupdate my existing window, like rstudio
df <- flights

(this may be a bad example because the rstudio viewer can indeed show 3, but try with like environment() or something and this still works)

Screen.Recording.2024-04-03.at.10.30.37.AM.mov

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea, but would be kind of tough to draw with the current set of stencils. Currently, the "disconnect" state tears down everything -- the thread that services RPCs, the underlying objects, the UI, etc. There's no concept of "reconnect" for a comm; you can open a new one, of course, but that's it.

Probably the way to do this would be for the comm to include some sort of unique key (separate from the comm ID). In the reopen case we'd open a new comm with a matching unique key and the UI could attach that to an existing explorer tab if there is one.

At any rate we don't even have a "disconnected" state yet so that probably has to happen first: posit-dev/positron#2516

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to ☝️ to make sure we keep track of this idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

crates/ark/src/data_explorer/r_data_explorer.rs Outdated Show resolved Hide resolved
crates/ark/src/modules/positron/r_data_explorer.R Outdated Show resolved Hide resolved
}
// Add the sort order per column
order.param("decreasing", RObject::try_from(decreasing)?);
order.param("method", RObject::from("radix"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Tricky:

  • On one hand, radix is nice because it gives
    • Consistent C locale ordering
    • decreasing per column
    • It is fast
  • On the other hand
    • People probably want to see sorting in their own locale, it affects how capitalization is handled and how special characters like ñ in Spanish or Ø in Danish are sorted (C locale doesn't put them in the "right place", typically they unexpectedly end up after z).

RStudio seems to use base::order() without radix because it seems to respect the current locale (which is nice), but it doesn't seem to have to deal with the length 1 decreasing restriction because only 1 sort key seems to be allowed at a time.

Note how when I specify C locale, the Abc is no longer grouped with abc as one may expect. This is the kind of stuff we lose in the C locale. It gets more annoying in other languages too when it affects actual characters and not just capitalization.

Screen.Recording.2024-04-03.at.10.49.45.AM.mov

The fully generic way we do in dplyr is to use vctrs:::vec_order_radix(), which allows for length >1 decreasing and also has a chr_proxy_collate argument that transforms a character vector into a secondary character vector that can be sorted in the C locale but the result with be as if you sorted in the user locale. We use stringi with something like stringi::stri_sort_key(col, locale = "en") for this but adding both of these deps to ark is kind of a big deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added radix on @lionel- 's recommendation. And you're right, RStudio does not support sorting on more than one column so it doesn't have to deal with some of these problems!

We can probably start with radix for now and see what feedback people give us. We might be able to get to 95% by using radix sometimes (as I suspect most sorting happens on numeric values) and only dropping it when one of the sort columns is a character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've promoted this to an issue here: posit-dev/positron#2647

@jmcphers
Copy link
Contributor Author

jmcphers commented Apr 3, 2024

Oh, the env isn't passed through in the View() case. I think that will be important to do. I know that's the primary way my wife pulls up the data viewer in rstudio.

@DavisVaughan Not hard to do! Implemented now.

@jmcphers jmcphers merged commit 3f96567 into main Apr 3, 2024
1 check passed
@posit-dev posit-dev deleted a comment from jmcphers Apr 5, 2024
@DavisVaughan DavisVaughan deleted the feature/data-explorer-sorting branch October 14, 2024 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants