diff --git a/DESCRIPTION b/DESCRIPTION index 3767de6..8443ba6 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: git2rdata Title: Store and Retrieve Data.frames in a Git Repository -Version: 0.0.2.9000 +Version: 0.0.3 Authors@R: c( person( "Thierry", "Onkelinx", role = c("aut", "cre"), diff --git a/NEWS.md b/NEWS.md index 16300ec..d4da48d 100644 --- a/NEWS.md +++ b/NEWS.md @@ -5,7 +5,7 @@ git2rdata 0.0.2 (2019-02-26) * metadata is added as a list to the objects rather than in YAML format. * the [yaml](https://cran.r-project.org/package=yaml) package is used to store the metadata list in YAML format. - * `write_vc()` now used the 'strict' argument instead of 'override' + * `write_vc()` now uses the 'strict' argument instead of 'override' * the functionality `rm_data()` is split into `rm_data()` and `prune_meta()` (#9) ### NEW FEATURES @@ -30,9 +30,9 @@ git2rdata 0.0.2 (2019-02-26) * each helpfile contains a working example (#11) * README updated (#12) * Updated the rationale with links to the vignettes - * `git2rdata` has an hexsticker logo + * `git2rdata` has a hexsticker logo * A DOI is added - * The installation instructions uses `remotes` and build the vignettes + * The installation instructions use `remotes` and build the vignettes * `auto_commit()` was removed because of limited extra functionality over `git2r::commit()` * dataframes are read and written by base R functions instead of `readr` functions diff --git a/R/list_data.R b/R/list_data.R index eafafa6..99db4e8 100644 --- a/R/list_data.R +++ b/R/list_data.R @@ -1,4 +1,4 @@ -#' list available data objects in the repository +#' List available data files #' @param root the `root` of the repository. Either a path or a `git-repository` #' @param path relative `path` from the `root`. Defaults to the `root` #' @inheritParams base::list.files diff --git a/R/read_vc.R b/R/read_vc.R index d770f1d..40cc340 100644 --- a/R/read_vc.R +++ b/R/read_vc.R @@ -1,4 +1,6 @@ -#' Read a \code{data.frame} from a repository +#' Read a \code{data.frame} +#' +#' Note that the dataframe has to be written with `write_vc()` before it can be read with `read_vc()`. #' @inheritParams write_vc #' @return The \code{data.frame} with the file names and hashes as attributes #' @rdname read_vc diff --git a/R/write_vc.R b/R/write_vc.R index 1669beb..90df947 100644 --- a/R/write_vc.R +++ b/R/write_vc.R @@ -1,4 +1,4 @@ -#' Write a \code{data.frame} to a git repository +#' Write a \code{data.frame} #' #' This will create two files. The `".tsv"` file contains the raw data. #' The `".yml"` contains the meta data on the columns in YAML format. diff --git a/README.md b/README.md index eff80ea..08c246b 100644 --- a/README.md +++ b/README.md @@ -10,19 +10,19 @@ [![DOI](https://zenodo.org/badge/147685405.svg)](https://zenodo.org/badge/latestdoi/147685405) ## Rationale -The `git2rdata` package writes and reads dataframes as plain text files. Important information is stored in a metadata file. +The `git2rdata` package is an R package for writing and reading dataframes as plain text files. Important information is stored in a metadata file. -1. Storing metadata allows to maintain variables classes. By default, the data is optimized for file storage prior to writing. This make the data less human readable and can be turned off. Details on the implementation are available on the [plain text](https://inbo.github.io/git2rdata/articles/plain_text.html) vignette. -1. Storing metadata also allows to minimize row base [diffs](https://en.wikipedia.org/wiki/Diff) between two consecutive [commits](https://en.wikipedia.org/wiki/Commit_(version_control)). This is a useful feature when storing data as plain text files under version control. Details on this part of the implementation is available at the [version control](https://inbo.github.io/git2rdata/articles/version_control.html) vignette. Although `git2rdata` was envisioned with a [git](https://git-scm.com/) workflow in mind, it can also be used in combination with other version control systems like [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/). -1. `git2rdata` is intended to facility a reproducible and traceable workflow. A toy example is given in the [workflow](https://inbo.github.io/git2rdata/articles/workflow.html) vignette. -1. The [efficiency](https://inbo.github.io/git2rdata/articles/efficiency.html) vignette gives some insight on the efficiency in terms of file storage, git repository size and speed for writing and reading. +1. Storing metadata allows to maintain the classes of variables. By default, the data is optimized for file storage prior to writing. This makes the data less human readable and can be turned off. Details on the implementation are available in the [plain text](https://inbo.github.io/git2rdata/articles/plain_text.html) vignette. +1. Storing metadata also allows to minimize row based [diffs](https://en.wikipedia.org/wiki/Diff) between two consecutive [commits](https://en.wikipedia.org/wiki/Commit_(version_control)). This is a useful feature when storing data as plain text files under version control. Details on this part of the implementation are available in the [version control](https://inbo.github.io/git2rdata/articles/version_control.html) vignette. Although `git2rdata` was envisioned with a [git](https://git-scm.com/) workflow in mind, it can also be used in combination with other version control systems like [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/). +1. `git2rdata` is intended to facilitate a reproducible and traceable workflow. A toy example is given in the [workflow](https://inbo.github.io/git2rdata/articles/workflow.html) vignette. +1. The [efficiency](https://inbo.github.io/git2rdata/articles/efficiency.html) vignette provides some insight into the efficiency in terms of file storage, git repository size and speed for writing and reading. ## Installation Install the development version ```r -# installation requires the "remotes" packages +# installation requires the "remotes" package # install.package("remotes") # install with vignettes (recommended) @@ -38,7 +38,7 @@ remotes::install_github("inbo/git2rdata")) ## Main usage -Dataframes are stored using `write_vc()` and retrieved with `read_vc()`. Both share the arguments `root` and `file`. Root refers to a base location where the dataframe should be stored. It can either point to a local directory or a local git repository. `file` is the file name to use and can include a path relative to `root`. Make sure the relative path stays within `root`. +Dataframes are stored using `write_vc()` and retrieved with `read_vc()`. Both functions share the arguments `root` and `file`. `root` refers to a base location where the dataframe should be stored. It can either point to a local directory or a local git repository. `file` is the file name to use and can include a path relative to `root`. Make sure the relative path stays within `root`. ```r library(git2rdata) @@ -55,7 +55,7 @@ Please use the output of `citation("git2rdata")` ## Folder structure - `R`: The source scripts of the [R](https://cran.r-project.org/) functions with documentation in [Roxygen](https://github.com/klutometis/roxygen) format -- `man`: The help file in [Rd](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Rd-format) format +- `man`: The help files in [Rd](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Rd-format) format - `testthat`: R scripts with unit tests using the [testthat](http://testthat.r-lib.org/) framework - `vignettes`: source code for the vignettes describing the package - `man-roxygen`: templates for documentation in Roxygen format @@ -78,4 +78,4 @@ git2rdata ## Contributions -Contribution to `git2rdata` are welcome. Please read our [Contributing guidelines](.github/CONTRIBUTING.md) first. The `git2rdata` project is released with a [Contributor Code of Conduct](.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms. +Contributions to `git2rdata` are welcome. Please read our [Contributing guidelines](.github/CONTRIBUTING.md) first. The `git2rdata` project is released with a [Contributor Code of Conduct](.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms. diff --git a/codemeta.json b/codemeta.json index cafd7a9..a599ad9 100644 --- a/codemeta.json +++ b/codemeta.json @@ -10,7 +10,7 @@ "codeRepository": "https://github.com/inbo/git2rdata", "issueTracker": "https://github.com/inbo/git2rdata/issues", "license": "https://spdx.org/licenses/GPL-3.0", - "version": "0.0.2.9000", + "version": "0.0.3", "programmingLanguage": { "@type": "ComputerLanguage", "name": "R", @@ -163,7 +163,7 @@ } ], "readme": "https://github.com/inbo/git2rdata/blob/master/README.md", - "fileSize": "334.831KB", + "fileSize": "336.015KB", "contIntegration": [ "https://travis-ci.org/inbo/git2rdata", "https://ci.appveyor.com/project/ThierryO/git2rdata/branch/master", diff --git a/man/list_data.Rd b/man/list_data.Rd index 68cac46..5e7c31e 100644 --- a/man/list_data.Rd +++ b/man/list_data.Rd @@ -2,7 +2,7 @@ % Please edit documentation in R/list_data.R \name{list_data} \alias{list_data} -\title{list available data objects in the repository} +\title{List available data files} \usage{ list_data(root = ".", path = ".", recursive = TRUE) } @@ -17,7 +17,7 @@ list_data(root = ".", path = ".", recursive = TRUE) a character vector is dataframe names, including their relative path } \description{ -list available data objects in the repository +List available data files } \examples{ ## on file system diff --git a/man/read_vc.Rd b/man/read_vc.Rd index 6d81af0..4a72f3c 100644 --- a/man/read_vc.Rd +++ b/man/read_vc.Rd @@ -2,7 +2,7 @@ % Please edit documentation in R/read_vc.R \name{read_vc} \alias{read_vc} -\title{Read a \code{data.frame} from a repository} +\title{Read a \code{data.frame}} \usage{ read_vc(file, root = ".") } @@ -17,7 +17,7 @@ Defaults to the current working directory (".").} The \code{data.frame} with the file names and hashes as attributes } \description{ -Read a \code{data.frame} from a repository +Note that the dataframe has to be written with \code{write_vc()} before it can be read with \code{read_vc()}. } \examples{ ## on file system diff --git a/man/write_vc.Rd b/man/write_vc.Rd index 51b63f3..eb58764 100644 --- a/man/write_vc.Rd +++ b/man/write_vc.Rd @@ -3,7 +3,7 @@ \name{write_vc} \alias{write_vc} \alias{write_vc.git_repository} -\title{Write a \code{data.frame} to a git repository} +\title{Write a \code{data.frame}} \usage{ write_vc(x, file, root = ".", sorting, strict = TRUE, optimize = TRUE, na = "NA", ...) diff --git a/pkgdown/extra.css b/pkgdown/extra.css index 16d9c9d..fb1e7e6 100644 --- a/pkgdown/extra.css +++ b/pkgdown/extra.css @@ -5,6 +5,10 @@ body { font-family: FlandersArtSans-Light, Verdana, Arial, sans-serif; } +.row{ + background-color: #ffffff; +} + a { color: #c04384; } diff --git a/tests/testthat/test_d_recent_commit.R b/tests/testthat/test_d_recent_commit.R index 066e882..3de15e2 100644 --- a/tests/testthat/test_d_recent_commit.R +++ b/tests/testthat/test_d_recent_commit.R @@ -1,8 +1,8 @@ context("recent_commit") -# currently odb_blobs() can't handle subsecond commits -# when TRUE Sys.sleep(1.1) is added before each commit -subsecond <- TRUE +# git timings don't handle subsecond changes +# therefore Sys.sleep(subsecond) is added before each commit +subsecond <- 1.2 root <- tempfile(pattern = "git2rdata-recent") dir.create(root) @@ -25,14 +25,14 @@ write_vc( test_data[5:6, ], file = "test1", root = root, stage = TRUE, sorting = "test_Date" ) -if (subsecond) Sys.sleep(1.1) +Sys.sleep(subsecond) commit_3 <- commit(root, "update first file") write_vc( test_data[7:8, ], file = "test3", root = root, stage = TRUE, sorting = "test_Date" ) -if (subsecond) Sys.sleep(1.1) +Sys.sleep(subsecond) commit_4 <- commit(root, "add third file") write_vc( @@ -70,8 +70,13 @@ write_vc( sorting = "test_Date" ) commit_7 <- commit(root, "second subsecond") +write_vc( + test_data[15:16, ], file = "subsecond", root = root, stage = TRUE, + sorting = "test_Date" +) +commit_8 <- commit(root, "third subsecond") expect_warning( output <- recent_commit(file = "subsecond", root, data = TRUE), "Multiple commits within the same second" ) -expect_true(all(output$commit %in% c(commit_6$sha, commit_7$sha))) +expect_true(all(output$commit %in% c(commit_6$sha, commit_7$sha, commit_8$sha))) diff --git a/vignettes/efficiency.Rmd b/vignettes/efficiency.Rmd index 385e2ca..dfb8c04 100644 --- a/vignettes/efficiency.Rmd +++ b/vignettes/efficiency.Rmd @@ -151,7 +151,7 @@ str(airbag) ### On a file system -We start by writing the dataset as is with `write.table()`, `saveRDS()`, `write_vc()` and `write_vc()` without storage optimization. Note that `write_vc()` uses optimization by default. Since `write_vc()` creates two files for each data set, we take their combinated file size into account. +We start by writing the dataset as is with `write.table()`, `saveRDS()`, `write_vc()` and `write_vc()` without storage optimization. Note that `write_vc()` uses optimization by default. Since `write_vc()` creates two files for each data set, we take their combined file size into account. ```{r set_tmp_dir} library(git2rdata) @@ -173,7 +173,7 @@ fn <- write_vc(airbag, "airbag_verbose", root, sorting = "X", optimize = FALSE) verbose_size <- sum(file.size(file.path(root, fn))) ``` -Since the data is highly compressable, `saveRDS()` yields the smallest file at the cost of having a binary file format. Both `write_vc()` formats yield smaller files than `write.table()`. Partly because `write_vc()` doesn't store row names and only uses quotes when needed. The difference between the optimized and verbose version of `write_vc()` is, in this case, solely due to the way factors are stored in the data (tsv) file. The optimized version stores the indices of the factor whereas the verbose version stores the levels. For example: `airbag$dvcat` has 5 levels with fairly short levels (on average 5 character), however storing the index requires only 1 character. Resulting in more compact files. +Since the data is highly compressable, `saveRDS()` yields the smallest file at the cost of having a binary file format. Both `write_vc()` formats yield smaller files than `write.table()`. Partly because `write_vc()` doesn't store row names and only uses quotes when needed. The difference between the optimized and verbose version of `write_vc()` is, in this case, solely due to the way factors are stored in the data (tsv) file. The optimized version stores the indices of the factor whereas the verbose version stores the levels. For example: `airbag$dvcat` has 5 levels with fairly short labels (on average 5 character), however storing the index requires only 1 character. Resulting in more compact files. ```{r table_file_size, echo = FALSE} kable( @@ -188,7 +188,7 @@ kable( ) ``` -The reduction in file size when storing in factors depends on the length of the labels, the number of levels and the number of observations. The figure below illustrates the huge gain as soon as the level labels contain a few characters. The gain is less pronounces when the factor has a large number of levels. The optimization fails only in the extreme cases with very short factor labels and a high number of labels. +The reduction in file size when storing in factors depends on the length of the labels, the number of levels and the number of observations. The figure below illustrates the huge gain as soon as the level labels contain a few characters. The gain is less pronounced when the factor has a large number of levels. The optimization fails only in extreme cases with very short factor labels and a high number of labels. ```{r factor_label_length, echo = FALSE, fig.cap = "Effect of the label length on the efficiency of storing factor optimized, assuming 1000 observations", warning = FALSE} ratio <- function(label_length = 1:20, n_levels = 9, n_obs = 1000) { @@ -272,9 +272,9 @@ ggplot(f_ratio, aes(x = observations, y = ratio, colour = levels)) + ### In git repositories -Here we will simulate how much space the data requires when the history is stored in git repository. We will create a git repository for each method and store several subsets of the same data. Each commit contains a new version of the data. Each version is a random sample containing 90% of the observations of the `airbag` data. Two consecutive versions of the subset will have about 90% of the observations in common. 10% of the observations will be replaced by other observations. +Here we will simulate how much space the data requires when the history is stored in a git repository. We will create a git repository for each method and store several subsets of the same data. Each commit contains a new version of the data. Each version is a random sample containing 90% of the observations of the `airbag` data. Two consecutive versions of the subset will have about 90% of the observations in common. 10% of the observations will be replaced by other observations. -After writing each version, we commit the file, perform garbage collection (`git gc`) on the git repository to minimize it size and then calculate the size of the git history (`git count-objects -v`). +After writing each version, we commit the file, perform garbage collection (`git gc`) on the git repository to minimize its size and then calculate the size of the git history (`git count-objects -v`). ```{r git_size, eval = system.file("efficiency", "git_size.rds", package = "git2rdata") == ""} library(git2r) @@ -342,7 +342,7 @@ if (system.file("efficiency", "git_size.rds", package = "git2rdata") == "") { Each version of the data has on purpose a random order of observations and variables. This is what would happen in a worst case scenario as it would generate the largest posibble diff. We also test `write.table()` with a stable ordering of the observations and variables. -The randomised `write.table()` yields the largest git repository, converging to about 6.5 times the size of a git repository based on the sorted `write.table()`. `saveRDS()` yields a 25% reduction in repostory size compared to the randomised `write.table()`, but still is almost 5 times larger than the sorted `write.table()`. Note that the gain of storing binary files in a git repository is much smaller than the gain in individual file size because the git repository will be compressed too. The optimized `write_vc()` starts at 83% converges toward 72%, the verbose version starts at 90% and converges towards 105%. There is a clear gain when using `write_vc()` with optimization in terms of storage size and the availability of metadata. The verbose option of `write_vc()` lack the gain in terms of storage size but still has the metadata advantange. +The randomised `write.table()` yields the largest git repository, converging to about 6.5 times the size of a git repository based on the sorted `write.table()`. `saveRDS()` yields a 25% reduction in repostory size compared to the randomised `write.table()`, but still is almost 5 times larger than the sorted `write.table()`. Note that the gain of storing binary files in a git repository is much smaller than the gain in individual file size because the git repository will be compressed too. The optimized `write_vc()` starts at 83% and converges toward 72%, the verbose version starts at 90% and converges towards 105%. There is a clear gain when using `write_vc()` with optimization in terms of storage size and the availability of metadata. The verbose option of `write_vc()` lacks the gain in terms of storage size but still has the metadata advantage. ```{r plot_git_size, echo = FALSE, fig.cap = "Size of the git history using the different storage methods."} rs <- lapply( @@ -350,34 +350,35 @@ rs <- lapply( function(x) { if (x == "saveRDS") { fun <- "saveRDS" - method = "default" + optimized = "yes" } else if (x == "write_vc.optimized") { fun <- "write_vc" - method = "default" + optimized = "yes" } else if (x == "write_vc.verbose") { fun <- "write_vc" - method = "verbose" + optimized = "no" } else if (x == "write.table") { fun <- "write.table" - method = "default" + optimized = "no" } else if (x == "write.table.sorted") { fun <- "write.table" - method = "sorted" + optimized = "yes" } data.frame(commit = seq_along(repo_size[x, ]), size = repo_size[x, ], rel_size = repo_size[x, ] / repo_size["write.table.sorted", ], - fun = fun, method = method,stringsAsFactors = FALSE) + fun = fun, optimized = optimized, stringsAsFactors = FALSE) } ) rs <- do.call(rbind, rs) -ggplot(rs, aes(x = commit, y = size / 2^10, colour = fun, linetype = method)) + +rs$optimized <- factor(rs$optimized, levels = c("yes", "no")) +ggplot(rs, aes(x = commit, y = size / 2^10, colour = fun, linetype = optimized)) + geom_line() + scale_y_continuous("repo size (in MiB)") + scale_colour_manual("function", values = inbo_colours) ``` ```{r plot_rel_git_size, echo = FALSE, fig.cap = "Relative size of the git repository when compared to write.table()."} -ggplot(rs, aes(x = commit, y = rel_size, colour = fun, linetype = method)) + +ggplot(rs, aes(x = commit, y = rel_size, colour = fun, linetype = optimized)) + geom_line() + scale_y_continuous("size relative to sorted write.table()", breaks = 0:10) + scale_colour_manual("function", values = inbo_colours) @@ -445,7 +446,7 @@ if (system.file("efficiency", "read_timings.rds", package = "git2rdata") == "") } ``` -The timings on reading the data is is more extreme story. Reading the binary format takes about 8% of the time needed to read the standard plain text format using `read.table()`. `read_vc()` takes about 65% (optimized) and 79% (verbose) of the time needed by `read.table()`, which at first seems strange because `read_vc()` calls `read.table()` to read the files and has some extra work to convert the variables to the correct data type. The main difference is that `read_vc()` knows the required data type _a priori_ and passes this info to `read.table()`. Otherwise, `read.table()` has to guess the correct data type from the file. +The timings on reading the data is another story. Reading the binary format takes about 8% of the time needed to read the standard plain text format using `read.table()`. `read_vc()` takes about 65% (optimized) and 79% (verbose) of the time needed by `read.table()`, which at first seems strange because `read_vc()` calls `read.table()` to read the files and has some extra work to convert the variables to the correct data type. The main difference is that `read_vc()` knows the required data type _a priori_ and passes this info to `read.table()`. Otherwise, `read.table()` has to guess the correct data type from the file. ```{r plot_read_timings, echo = FALSE, fig.cap = "Boxplots for the read timings for the different methods."} mb$expr <- factor( diff --git a/vignettes/plain_text.Rmd b/vignettes/plain_text.Rmd index 0ef101d..53c7577 100644 --- a/vignettes/plain_text.Rmd +++ b/vignettes/plain_text.Rmd @@ -1,9 +1,9 @@ --- -title: "Storing dataframes as plain text files" +title: "Getting started" author: "Thierry Onkelinx" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{Storing dataframes as plain text files} + %\VignetteIndexEntry{Getting started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- @@ -24,13 +24,13 @@ This vignette motivates why we wrote `git2rdata` and illustrates how you can use ### Maintaining variable classes -R has several options to store dataframes as plain text files from R. Base R has `write.table()` and its companions like `write.csv()` .Some other options are `readr::write_delim()`, `readr::write_csv()` and `readr::write_tsv()`. Each of them writes a dataframe as a plain text file by converting all variables into characters. After reading the file, the conversion is reversed. However, the distinction between `character` and `factor` is lost in translation. `read.table()` converts by default all strings to factors, `readr::read_csv()` keeps by default all strings as character. The factor levels are another thing which is lost. The functions determines the factor levels based on the observed levels in the plain text file. Hence factor levels without observations will disappear. The order of the factor levels is also determined by the available levels in the plain text file, which can be different from the original order. +R has several options to store dataframes as plain text files from R. Base R has `write.table()` and its companions like `write.csv()` .Some other options are `readr::write_delim()`, `readr::write_csv()` and `readr::write_tsv()`. Each of them writes a dataframe as a plain text file by converting all variables into characters. After reading the file, the conversion is reversed. However, the distinction between `character` and `factor` is lost in translation. `read.table()` converts by default all strings to factors, `readr::read_csv()` keeps by default all strings as character. The factor levels are another thing which is lost. These functions determine factor levels based on the observed levels in the plain text file. Hence factor levels without observations will disappear. The order of the factor levels is also determined by the available levels in the plain text file, which can be different from the original order. The `write_vc()` and `read_vc()` functions from `git2rdata` keep track of the class of each variable and, in case of a factor, also of the factor levels and their order. Hence this function pair preserves the information content of the dataframe. The `vc` suffix stands for **v**ersion **c**ontrol as these functions use their full capacity in combination with a version control system. ### Optimizing file storage -Plain text files require more disk space than binary files. This is the price we have to pay for a readable file format. The default option of `write_vc()` is to space optimize the data prior to writing. Since we use a tab delimited file format, we can omit quotes around character variables. This saves 2 byte per row for each character variable. Quotes are added automatically in the exptional cases when they are needed, e.g. to store a string that contains tab or newline characters. In such case, only those strings are quotes when it is strictly necessary. +Plain text files require more disk space than binary files. This is the price we have to pay for a readable file format. The default option of `write_vc()` is to minimize file size as much as possible prior to writing. Since we use a tab delimited file format, we can omit quotes around character variables. This saves 2 bytes per row for each character variable. Quotes are added automatically in the exceptional cases when they are needed, e.g. to store a string that contains tab or newline characters. In such cases, quotes are only used in row-variable combinations where the exception occurs. Since we store the class of each variable, further file size reductions can be achieved by following rules: @@ -43,9 +43,9 @@ Storing the factors, POSIXct and Date as their index, makes them less user reada ### Optimized for version control -Another main goal of of `git2rdata` is to optimise the storage of the plain text files under version control. `write_vc()` and `read_vc()` has methods for interacting with [git](https://git-scm.com/) repositories using the `git2r` framework. Users who want to use git without `git2r` or use a different version control system (e.g. [Subversion](https://subversion.apache.org/), [Mercurial](https://www.mercurial-scm.org/)), still can use `git2rdata` to write the files to disk and uses their prefered workflow on version control. +Another main goal of `git2rdata` is to optimise the storage of the plain text files under version control. `write_vc()` and `read_vc()` has methods for interacting with [git](https://git-scm.com/) repositories using the `git2r` framework. Users who want to use git without `git2r` or use a different version control system (e.g. [Subversion](https://subversion.apache.org/), [Mercurial](https://www.mercurial-scm.org/)), still can use `git2rdata` to write the files to disk and uses their preferred workflow on version control. -Hence, `write_vc()` will always perform checks to look for changes which potentially lead to large diffs. More details on this in a dedicated vignette. Some problem will always yield a warning. Other problems will yield by default an error. The user can turn these errors into warnings by setting the `strict = FALSE` argument. +Hence, `write_vc()` will always perform checks to look for changes which potentially lead to large diffs. More details on this in the [version control](https://inbo.github.io/git2rdata/articles/version_control.html) vignette. Some problems will always yield a warning. Other problems will yield an error by default. The user can turn these errors into warnings by setting the `strict = FALSE` argument. As this vignette ignores the part on version control, we will always use `write_vc(strict = FALSE)` and hide the warnings to improve the readability. @@ -87,7 +87,7 @@ library(git2rdata) write_vc(x = x, file = "first_test", root = path, strict = FALSE) ``` -`write_vc()` returns a vector of relative paths to the raw data and metadata files. The hashes of these files are uses as names of the vector. We can have a look at both files. We'll only display the first 10 rows of the raw data. Notice that the YAML format of the metadata has the benefit of being both human and machine readable. +`write_vc()` returns a vector of relative paths to the raw data and metadata files. The hashes of these files are used as names of the vector. We can have a look at both files. We'll only display the first 10 rows of the raw data. Notice that the YAML format of the metadata has the benefit of being both human and machine readable. ```{r manual_data} print_file <- function(file, root, n = -1) { @@ -128,9 +128,11 @@ y2 <- read_vc(file = "verbose", root = path) all.equal(x, y2, check.attributes = FALSE) ``` +As `read_vc()` requires the meta data, it can only read dataframes which were stored by `write_vc()`. + ## Missing values -`write_vc()` has an `na` arguments which specifies the string which is used to indicate missing values. Because we avoid using quotes, this string must be different from any character value in the data. This include factor labels when the data is stored verbose. This is checked and will always return an error, even with `strict = FALSE`. +`write_vc()` has an `na` argument which specifies the string which is used to indicate missing values. Because we avoid using quotes, this string must be different from any character value in the data. This includes factor labels when the data is stored verbose. This is checked and will always return an error, even with `strict = FALSE`. ```{r echo = FALSE, results = "hide"} stopifnot("X" %in% x$x, "b" %in% x$y) @@ -150,7 +152,7 @@ print_file("custom_na.tsv", path, 10) print_file("custom_na.yml", path, 4) ``` -The default string for missing values is `"NA"`. We recommend to keep this default, as long as the dataset permits it. A first good alternative is an empty string (`""`). If that won't work either, you'll have to use your imagination. Try to keep is short, clear and robust^[robust in the sense that you won't need to change it later]. +The default string for missing values is `"NA"`. We recommend to keep this default, as long as the dataset permits it. A first good alternative is an empty string (`""`). If that won't work either, you'll have to use your imagination. Try to keep it short, clear and robust^[robust in the sense that you won't need to change it later]. ```{r empty_na} write_vc(x, "custom_na", path, strict = FALSE, na = "") diff --git a/vignettes/version_control.Rmd b/vignettes/version_control.Rmd index 7417e3e..3c08d9e 100644 --- a/vignettes/version_control.Rmd +++ b/vignettes/version_control.Rmd @@ -1,9 +1,9 @@ --- -title: "Storing dataframes under version control" +title: "Optimizing storage for version control" author: "Thierry Onkelinx" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{Storing dataframes under version control} + %\VignetteIndexEntry{Optimizing storage for version control} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- @@ -58,7 +58,7 @@ Version control systems like [git](https://git-scm.com/), [subversion](https://s ## Sorting observations -Version control systems often track changes on plain text files based on row based differences. In layman's terms it only records which lines in a file are removed and which lines are inserted at what location. Changing an existing line implies removing the old version and inserting the new one. This is illustrated is the minimal example below. +Version control systems often track changes in plain text files based on row based differences. In layman's terms it only records which lines in a file are removed and which lines are inserted at what location. Changing an existing line implies removing the old version and inserting the new one. This is illustrated is the minimal example below. Original version @@ -131,7 +131,7 @@ fn <- write_vc(x, "row_order", root, sorting = "y", strict = FALSE) fn <- write_vc(x, "row_order", root, sorting = c("y", "x"), strict = FALSE) ``` -Once the sorting is defined we may omit it when writing new versions. The sorting as defined in the existing metadata will be used to sort the observations. A check for potential ties will be performed and results in a warning when ties are found. +Once the sorting is defined we may omit the `sorting` argument when writing new versions. The sorting as defined in the existing metadata will be used to sort the observations. A check for potential ties will be performed and results in a warning when ties are found. ```{r update_sorted} print_file <- function(file, root, n = -1) { diff --git a/vignettes/workflow.Rmd b/vignettes/workflow.Rmd index fa92c0e..ea67b0a 100644 --- a/vignettes/workflow.Rmd +++ b/vignettes/workflow.Rmd @@ -1,9 +1,9 @@ --- -title: "Potential workflows for working with dataframes under version control" +title: "Suggested workflow for storing a variable set of dataframes under version control" author: "Thierry Onkelinx" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{Potential workflows for working with dataframes under version control} + %\VignetteIndexEntry{Suggested workflow for storing a variable set of dataframes under version control} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{git2r} @@ -20,11 +20,11 @@ set.seed(20120225) ## Introduction -This vignette describes a potential workflow for storing dataframes. This time we use a `git2r::repository()` object as the root. This adds git functionality to `write_vc()` and `read_vc()`. +This vignette describes a suggested workflow for storing dataframes. This time we use a `git2r::repository()` object as the root. This adds git functionality to `write_vc()` and `read_vc()`, provided by the [`git2r`](https://cran.r-project.org/package=git2r) package. This allows to pull, stage, commit and push from within R. The rationale behind this workflow is that we have read-only access to a database containing the raw data. The database is beyond our control. Observations in the database can be added, removed or updated without our knowledge. These changes cannot be traced in the database. -The database defines a variable number of groups (e.g. species). We have defined a standard analysis which should run for each group. We want to repeat the analyses with some predefined frequency (e.g. once every year). In order to make the analyses reproducible, we want to store the relevant data in a git repository. +The database defines a variable number of dataframe (e.g. species can be added or removed). We have defined a standard analysis which should run for each group. We want to repeat the analyses with some predefined frequency (e.g. once every year). In order to make the analyses reproducible, we want to store the relevant data in a git repository. ## Setup @@ -84,9 +84,9 @@ generate_data <- function(x, n = rpois(1, 10)) { ### First commit -Suppose that we have two groups at the first point in time. We read the data for these group from the database. We also store them in a list called `content` to be reused in the [next section](#automated-workflow-for-storing-dataframes). +Suppose that we have two groups at the first point in time. We read the data for these groups from the database. We also store them in a list called `content` to be reused in the [next section](#automated-workflow-for-storing-dataframes). -Then we connect to the git repository using `repository()`. Note that this assumes that `path` is an existing git repository. Now we can write each group to a dedicated data file in the repository. When the `root` argument of `write_vc()` is a `git_repository`, then it gains two additional arguments: `stage` and `force`. Setting `stage = TRUE`, will automatically stage the files written by `write_vc()`. +Then we connect to the git repository using `repository()`. Note that this assumes that `path` is an existing git repository. Now we can write each group to a dedicated data file in the repository. If the `root` argument of `write_vc()` is a `git_repository`, it gains two additional arguments: `stage` and `force`. Setting `stage = TRUE`, will automatically stage the files written by `write_vc()`. ```{r store_data_1} A <- generate_data() @@ -133,7 +133,7 @@ cm <- commit(repo, message = "Second commit") ### Third commit -During the third point in time, group A is removed, group B unchanged and group C updated. So we remove group A and write the two other groups. We use `add = TRUE` to stage the unstaged removal of group A. Since group C was force into the history, `.gitignore` is overruled for these two files. +During the third point in time, group A is removed, group B unchanged and group C updated. So we remove group A and write the two other groups. We use `all = TRUE` to stage the unstaged removal of group A. Since group C was force into the history, `.gitignore` is overruled for these two files. ```{r store_data_3} C <- generate_data(C) @@ -149,12 +149,13 @@ status(repo) ## Automated workflow for storing dataframes -The list `content` contains the relevant data at the different points in time. We create a -custom function to store the data in an automated way. In pratice we will run this function each time we want to make a snapshot of the data. In this example we emulate that by applying it to each element of `content`. +To mimic a changing dataset we reuse the list `content` created above. This contains the relevant data at the different points in time. We create a custom function to store the data in an automated way. In pratice we will run this function each time we want to make a snapshot of the data. In this example we emulate that by applying it to each element of `content`. We start by pulling the remote repository to make sure that our local repository has the latest version. Then we want to write the dataframe for each group. But how do we detect which groups are no longer present? A straightforward workaround for this problem is to first remove all data files. Then write all currently existing dataframes to the repository. Since we only removed the data files, any preexisting metadata is still available. After writing all existing dataframes we only are left with cleaning dangling metadata files. To make this process more convenient we created `rm_data()` and `prune_meta()`. `prune_meta()` will remove any `.yml` file without matching `.tsv` file. `rm_data()` removes by default all `.tsv` files with associated `.yml` file. When applied on a `git_repository` object, there is an extra fail-safe because then it will only remove unmodified files. _Caveat_: when applied on a path, it will remove _all_ data files, without warning. Even when the path points to a git repository. So use `rm_data()` and `prune_meta()` with care. -The last steps in the function consists of committing the changes and push them to the remote repository. We had to add a `Sys.sleep(1)` to avoid commits within the same second. This should not be needed in a real-life situation. +The last steps in the function consist of committing the changes and push them to the remote repository. We had to add a `Sys.sleep(2)` to avoid commits within the same second. This should not be needed in a real-life situation. + +Please note that the function below is intended as a template. In practice, step 3 would contain user defined functions to create the relevant dataframes and store them using `write_vc()`. ```{r automated_flow} store_data <- function(df, repo) { @@ -162,7 +163,7 @@ store_data <- function(df, repo) { pull(repo) # step 2: remove all exisiting data files rm_data(repo, path = ".", type = "all", stage = TRUE) - # step 3: write all current data + # step 3: create and write all relevant dataframe lapply( names(df), function(i) { @@ -176,7 +177,7 @@ store_data <- function(df, repo) { commit(repo, "Scripted commit from git2rdata", session = TRUE) # step 6: update the remote repository push(repo) - # avoid subsecond commits + # avoid subsecond commits, only needed in this toy example Sys.sleep(2) } ```