The following vignette complements this page: Recommendations for Using summarytools With Rmarkdown
summarytools is an R package providing tools to neatly and quickly summarize data. It can also make R a little easier to learn and use. Four functions are at the core of the package:
freq()
: frequency tables with proportions, cumulative proportions and missing data information.ctable()
: cross-tabulations between two factors or any discrete data, with total, rows or columns proportions, as well as marginal totals.descr()
: descriptive (univariate) statistics for numerical vectors.dfSummary()
: Extensive data frame summaries that facilitate data cleaning and firsthand evaluation.
An emphasis has been put on both what and how results are presented, so that the package can serve both as a data exploration and reporting tool, which can be used either on its own for minimal reports, or along with larger sets of tools such as RStudio’s for rmarkdown, and knitr.
Building on the strengths of pander and htmltools, the outputs produced by summarytools can be:
- Displayed in plain text in the R console (default behaviour)
- Used in Rmardown documents and knitted along with other text and R output
- Written to html files that fire up in RStudio’s Viewer pane or in your system’s default browser
- Written to plain text files / Rmarkdown text files
Some people have successfully included some of the package’s functions in shiny apps, too!
summarytools' dataframe summaries are now part of radiant, an outstanding Shiny App for Business analytics that I highly recommend.
To benefit from all the latests fixes, install it from GitHub:
install.packages("devtools")
library(devtools)
install_github('dcomtois/summarytools')
To install the most recent version on the R-CRAN repository:
install.packages("summarytools")
For enthusiastic users willig to contribute to summarytools’ development, I encourage you to go for the development version, which is the most up-to-date, but also a work-in-progress. Bugs may show up, but if you report them I can generally fix them quickly.
install.packages("devtools")
library(devtools)
install_github('dcomtois/summarytools', ref='dev-current')
You can see the source code and documentation on the official R site here.
The freq()
function generates a table of frequencies with counts and
proportions. Since this page use markdown rendering, we’ll set style = 'rmarkdown'
to take advantage of it.
library(summarytools)
freq(iris$Species, style = "rmarkdown")
Variable: iris$Species
Type: Factor
(unordered)
Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|
setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 |
If we do not worry about missing data, we can set report.nas = FALSE
:
freq(iris$Species, report.nas = FALSE, style = "rmarkdown", omit.headings = TRUE)
Freq | % | % Cum. | |
---|---|---|---|
setosa | 50 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 |
Total | 150 | 100.00 | 100.00 |
We could furthermore omit the Totals row by setting totals = FALSE
.
We’ll now use a sample data frame called tobacco, which is included in
the package. We want to cross-tabulate the two categorical variables
smoker
and diseased
. By default, ctable()
gives row proportions,
but we’ll include the full syntax anyway.
Since markdown has not support (yet) for multi-line headings, we’ll show an image of the resulting html table.
with(tobacco, view(ctable(smoker, diseased)))
Notice that instead of ctable(tobacco$smoker, tobacco$diseased, ...)
,
we used the with()
function, making the syntax less redundant.
It is possible to display column, total, or no proportions at all. We can also omit the marginal totals to have a simple 2 x 2 table.
with(tobacco,
print(ctable(smoker, diseased, prop = 'n', totals = FALSE),
omit.headings = TRUE, method = 'render'))
diseased |
||
---|---|---|
smoker |
Yes |
No |
Yes |
125 |
173 |
No |
99 |
603 |
The descr()
function generates common central tendency statistics and
measures of dispersion for numerical data. It can handle single vectors
as well as data frames, in which case it just ignores non-numerical
columns (and displays a message to that effect).
descr(iris, style = "rmarkdown")
## Non-numerical variable(s) ignored: Species
Data Frame: iris
N:
150
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
---|---|---|---|---|
Mean | 5.84 | 3.06 | 3.76 | 1.20 |
Std.Dev | 0.83 | 0.44 | 1.77 | 0.76 |
Min | 4.30 | 2.00 | 1.00 | 0.10 |
Q1 | 5.10 | 2.80 | 1.60 | 0.30 |
Median | 5.80 | 3.00 | 4.35 | 1.30 |
Q3 | 6.40 | 3.30 | 5.10 | 1.80 |
Max | 7.90 | 4.40 | 6.90 | 2.50 |
MAD | 1.04 | 0.44 | 1.85 | 1.04 |
IQR | 1.30 | 0.50 | 3.50 | 1.50 |
CV | 0.14 | 0.14 | 0.47 | 0.64 |
Skewness | 0.31 | 0.31 | -0.27 | -0.10 |
SE.Skewness | 0.20 | 0.20 | 0.20 | 0.20 |
Kurtosis | -0.61 | 0.14 | -1.42 | -1.36 |
N.Valid | 150.00 | 150.00 | 150.00 | 150.00 |
Pct.Valid | 100.00 | 100.00 | 100.00 | 100.00 |
If your eyes/brain prefer seeing things the other way around, just use
transpose = TRUE
. Here, we also select only the statistics we wish to
see, and specify omit.headings = TRUE
to avoid reprinting the same
information as
above.
descr(iris, stats = c("mean", "sd", "min", "med", "max"), transpose = TRUE,
omit.headings = TRUE, style = "rmarkdown")
## Non-numerical variable(s) ignored: Species
Mean | Std.Dev | Min | Median | Max | |
---|---|---|---|---|---|
Sepal.Length | 5.84 | 0.83 | 4.30 | 5.80 | 7.90 |
Sepal.Width | 3.06 | 0.44 | 2.00 | 3.00 | 4.40 |
Petal.Length | 3.76 | 1.77 | 1.00 | 4.35 | 6.90 |
Petal.Width | 1.20 | 0.76 | 0.10 | 1.30 | 2.50 |
dfSummary()
collects information about all variables in a data frame
and displays it in a singe, legible table.
With the following tiny bit of code, we’ll generate a summary report for the iris data frame and have it displayed in RStudio’s Viewer pane:
# Load the package
library(summarytools)
# Generate the summary
view(dfSummary(iris))
It is also possible to use dfSummary()
in Rmarkdown documents. In
this next example, note that due to rmarkdown compatibility issues,
histograms are not shown. We’re working on this. Further down, we’ll see
how tu use html rendering to go around this problem.
dfSummary(tobacco, plain.ascii = FALSE, style = "grid")
tobacco
N: 1000
No | Variable | Stats / Values | Freqs (% of Valid) | Text Graph | Valid | Missing |
---|---|---|---|---|---|---|
1 | gender [factor] |
1. F 2. M |
489 (50.0%) 489 (50.0%) |
IIIIIIIIIIIIIIII IIIIIIIIIIIIIIII |
978 (97.8%) |
22 (2.2%) |
2 | age [numeric] |
mean (sd) : 49.6 (18.29) min < med < max : 18 < 50 < 80 IQR (CV) : 32 (0.37) |
63 distinct values | 975 (97.5%) |
25 (2.5%) |
|
3 | age.gr [factor] |
1. 18-34 2. 35-50 3. 51-70 4. 71 + |
258 (26.5%) 241 (24.7%) 317 (32.5%) 159 (16.3%) |
IIIIIIIIIIIII IIIIIIIIIIII IIIIIIIIIIIIIIII IIIIIIII |
975 (97.5%) |
25 (2.5%) |
4 | BMI [numeric] |
mean (sd) : 25.73 (4.49) min < med < max : 8.83 < 25.62 < 39.44 IQR (CV) : 5.72 (0.17) |
974 distinct values | 974 (97.4%) |
26 (2.6%) |
|
5 | smoker [factor] |
1. Yes 2. No |
298 (29.8%) 702 (70.2%) |
IIIIII IIIIIIIIIIIIIIII |
1000 (100%) |
0 (0%) |
6 | cigs.per.day [numeric] |
mean (sd) : 6.78 (11.88) min < med < max : 0 < 0 < 40 IQR (CV) : 11 (1.75) |
37 distinct values | 965 (96.5%) |
35 (3.5%) |
|
7 | diseased [factor] |
1. Yes 2. No |
224 (22.4%) 776 (77.6%) |
IIII IIIIIIIIIIIIIIII |
1000 (100%) |
0 (0%) |
8 | disease [character] |
1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ] |
36 (16.2%) 34 (15.3%) 21 ( 9.5%) 20 ( 9.0%) 20 ( 9.0%) 19 ( 8.6%) 14 ( 6.3%) 14 ( 6.3%) 12 ( 5.4%) 11 ( 5.0%) 21 ( 9.4%) |
IIIIIIIIIIIIIIII IIIIIIIIIIIIIII IIIIIIIII IIIIIIII IIIIIIII IIIIIIII IIIIII IIIIII IIIII IIII IIIIIIIII |
222 (22.2%) |
778 (77.8%) |
9 | samp.wgts [numeric] |
mean (sd) : 1 (0.08) min < med < max : 0.86 < 1.04 < 1.06 IQR (CV) : 0.19 (0.08) |
0.86!: 267 (26.7%) 1.04!: 249 (24.9%) 1.05!: 324 (32.4%) 1.06!: 160 (16.0%) ! rounded |
IIIIIIIIIIIII IIIIIIIIIIII IIIIIIIIIIIIIIII IIIIIII |
1000 (100%) |
0 (0%) |
summarytools has a generic print
method, print.summarytools()
. By
default, its method
argument is set to 'pander'
. One of the ways in
which view()
is useful is that we can use it to easily display html
outputs in RStudio’s Viewer. In this case, the view()
function
simply acts as a wrapper around the generic print()
function,
specifying the method = 'viewer'
for us. When used outside RStudio,
the method
falls back on 'browser'
and the report is fired up in the
system’s default browser.
With freq()
and descr()
, you can use R’s base function by()
to
show statistics split by a ventilation / categorical variable. R’s
by()
function returns a list
containing as many summarytools
objects as there are categories in our ventilation variable.
To propertly display the content present in that list, we use the
view()
function. Using print()
, while technically possible, will
not give as much satisfactory results.
Using the iris data frame, we will display descriptive statistics broken down by Species.
# First save the results
iris_stats_by_species <- by(data = iris,
INDICES = iris$Species,
FUN = descr, stats = c("mean", "sd", "min", "med", "max"),
transpose = TRUE)
# Then use view(), like so:
view(iris_stats_by_species, method = "pander", style = "rmarkdown")
Data Frame: iris
Group: Species = setosa
N: 50
Mean | Std.Dev | Min | Median | Max | |
---|---|---|---|---|---|
Sepal.Length | 5.01 | 0.35 | 4.30 | 5.00 | 5.80 |
Sepal.Width | 3.43 | 0.38 | 2.30 | 3.40 | 4.40 |
Petal.Length | 1.46 | 0.17 | 1.00 | 1.50 | 1.90 |
Petal.Width | 0.25 | 0.11 | 0.10 | 0.20 | 0.60 |
Group: Species = versicolor
N: 50
Mean | Std.Dev | Min | Median | Max | |
---|---|---|---|---|---|
Sepal.Length | 5.94 | 0.52 | 4.90 | 5.90 | 7.00 |
Sepal.Width | 2.77 | 0.31 | 2.00 | 2.80 | 3.40 |
Petal.Length | 4.26 | 0.47 | 3.00 | 4.35 | 5.10 |
Petal.Width | 1.33 | 0.20 | 1.00 | 1.30 | 1.80 |
Group: Species = virginica
N: 50
Mean | Std.Dev | Min | Median | Max | |
---|---|---|---|---|---|
Sepal.Length | 6.59 | 0.64 | 4.90 | 6.50 | 7.90 |
Sepal.Width | 2.97 | 0.32 | 2.20 | 3.00 | 3.80 |
Petal.Length | 5.55 | 0.55 | 4.50 | 5.55 | 6.90 |
Petal.Width | 2.03 | 0.27 | 1.40 | 2.00 | 2.50 |
To see an html version of these results, we’d simply do this (results not shown):
view(iris_stats_by_species)
Instead of showing several tables having only one column each, the
view()
function will assemble the results into a single table:
BMI_by_age <- with(tobacco,
by(BMI, age.gr, descr,
stats = c("mean", "sd", "min", "med", "max")))
view(BMI_by_age, "pander", style = "rmarkdown")
Variable: tobacco$BMI by age.gr
18-34 | 35-50 | 51-70 | 71 + | |
---|---|---|---|---|
Mean | 23.84 | 25.11 | 26.91 | 27.45 |
Std.Dev | 4.23 | 4.34 | 4.26 | 4.37 |
Min | 8.83 | 10.35 | 9.01 | 16.36 |
Median | 24.04 | 25.11 | 26.77 | 27.52 |
Max | 34.84 | 39.44 | 39.21 | 38.37 |
The transposed version looks like this:
BMI_by_age <- with(tobacco,
by(BMI, age.gr, descr, transpose = TRUE,
stats = c("mean", "sd", "min", "med", "max")))
view(BMI_by_age, "pander", style = "rmarkdown", omit.headings = TRUE)
Mean | Std.Dev | Min | Median | Max | |
---|---|---|---|---|---|
18-34 | 23.84 | 4.23 | 8.83 | 24.04 | 34.84 |
35-50 | 25.11 | 4.34 | 10.35 | 25.11 | 39.44 |
51-70 | 26.91 | 4.26 | 9.01 | 26.77 | 39.21 |
71 + | 27.45 | 4.37 | 16.36 | 27.52 | 38.37 |
As is the case for by()
, the view()
function is essential for making
results nice and tidy.
tobacco_subset <- tobacco[ ,c("gender", "age.gr", "smoker")]
freq_tables <- lapply(tobacco_subset, freq)
view(freq_tables, footnote = NA, file = 'freq-tables.html')
As we have seen, summarytools can generate both text (including rmarkdown) and html results. Both can be used in Rmarkdown, according to your preferences. The vignette mentionned at the top of this page is dedicated to showing examples, but if you’re in a hurry, here are a few tips to get started:
- Always set the
knitr
chunk optionresults = 'asis'
. You can do this on a chunk-by-chunk basis, but here is how to do it globally:
knitr::opts_chunk$set(echo = TRUE, results = 'asis')
Refer to this page for more on knitr’s options.
- To get better results when using html (with
method = 'render'
), set up your .Rmd document so it includes summarytool’s css.
# ---
# title: "RMarkdown using summarytools"
# output:
# html_document:
# css: C:/R/win-library/3.4/summarytools/includes/stylesheets/summarytools.css
# ---
# ```{r, results='asis'}
# library(summarytools)
# freq(tobacco$smoker, style='rmarkdown')
#
# print(dfSummary(tobacco, style = 'grid', plain.ascii = FALSE, graph.magnif = 0.85),
# method = 'render', omit.headings = TRUE)
# ```
The console will always tell you the location of the temporary html file that is created in the process. However, you can specify the name and location of that file explicitly if you need to reuse it later on:
view(iris_stats_by_species, file = "~/iris_stats_by_species.html")
Based on the file extension you provide (.html vs others),
summarytools will use the appropriate method; there is no need to
specify the method
argument.
There is also an append =
logical argument for adding content to
existing reports, both text/Rmarkdown and html. This is useful if you
want to quickly include several statistical tables in a single file. It
is fast alternative to creating an .Rmd document if you don’t need the
extra content that the latter allows.
Version 0.8.3 introduced the following set of global options:
round.digits
=2
plain.ascii
=TRUE
omit.headings
=FALSE
(if using in a markdown document or a shiny app, setting this toTRUE
might be preferablefootnote
='default'
(set to empty string orNA
to omit footnote)display.labels
=TRUE
freq.totals
=TRUE
freq.display.nas
=TRUE
ctable.totals
=TRUE
ctable.prop
='r'
(display row proportions by default)descr.stats
='all'
descr.transpose
=FALSE
bootstrap.css
=TRUE
(if using in a markdown document or a shiny app, setting this toFALSE
might be preferablecustom.css
=NA
escape.pipe
=FALSE
st_options() # display all global options' values
st_options('round.digits') # display only one option
st_options('omit.headings', TRUE) # change an option's value
st_options('footnote', NA) # Turn off the footnote on all outputs.
# This option was used prior to generating
# the present document.
When a summarytools object is stored, its formatting attributes are
stored with it. However, you can override most of them when using the
print()
and view()
functions.
age_stats <- freq(tobacco$age.gr) # age_stats contains a regular output for freq
# including headings, NA counts, and Totals
print(age_stats, style = "rmarkdown", report.nas = FALSE,
totals = FALSE, omit.headings = TRUE)
Freq | % | % Cum. | |
---|---|---|---|
18-34 | 258 | 26.46 | 26.46 |
35-50 | 241 | 24.72 | 51.18 |
51-70 | 317 | 32.51 | 83.69 |
71 + | 159 | 16.31 | 100.00 |
Note that the omitted attributes are stil part of the age_stats object.
- Options over-ridden explicitly with
print()
orview()
have precendence - options specified as explicit arguments to
freq() / ctable() / descr() / dfSummary()
come second - Global options, which can be set with
st_options
, come third
Version 0.8 of summarytools uses RStudio’s htmltools package and version 4 of Bootstrap’s cascading stylesheets.
It is possible to include your own css if you wish to customize the
look of the output tables. See the documentation for the package’s
print.summarytools()
function for details, but here is a quick example
to give you the gist of it.
Say you need to make the font size really, really small. For this, you
create a CSS file - let’s call it “custom.css” - containing the
following class:
.table-condensed {
font-size: 8px;
}
Then, to apply it to a summarytools object and display it in your browser:
view(dfSummary(tobacco), custom.css = 'path/to/custom.css',
table.classes = 'table-condensed')
To display a smaller table that is not that smaller, you can use the
provided css class st-small
.
To include summarytools functions into shiny apps, it is recommended that you:
- set
bootstrap.css
toFALSE
to avoid interacting with the app’s layout - adjust the size of the graphs in
dfSummary()
- omit headings
print(dfSummary(somedata, graph.magnif = 0.8),
method = 'render',
omit.headings = TRUE,
bootstrap.css = FALSE)
The package comes with no guarantees. It is a work in progress and feedback / feature requests are welcome. Just send me an email (dominic.comtois (at) gmail.com), or open an Issue if you find a bug.
Also, the package grew significantly larger, and maintaining it all by myself is time consuming. If you would like to contribute, please get in touch, I’d greatly appreciate the help.