📚 New vignette, with "native" dataset.

ampatzia · Aug 21, 2017 · 469e882 · 469e882
1 parent bd3214e
commit 469e882
Show file tree

Hide file tree

Showing 6 changed files with 126 additions and 38 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -18,7 +18,8 @@ Imports: magrittr,
          gridExtra,
          stringi,
          stringr,
-         readr
+         readr,
+         dendextend
 RoxygenNote: 6.0.1
 Suggests: knitr,
     rmarkdown,

diff --git a/R/main.R b/R/main.R
@@ -1006,7 +1006,10 @@ pm_cluster <- function(fluidity_list,method="ward.D",genome_names){
                         fill = 0) #make matrix diagonal to convert to distance
   dist_matrix <- dist_matrix[, -1]
   clust_res <- stats::hclust(stats::as.dist(dist_matrix), method = method)
-  if(!missing(genome_names)){clust_res$labels <- genome_names$Organism}
+  if(class(genome_names)== "data.frame"){clust_res$labels <- genome_names$Organism}
+  if(class(genome_names)== "character"){clust_res$labels <- genome_names}
+  if(!missing(genome_names)){clust_res$labels <- paste0("gen_",seq(1,nrow(fluidity_list$genome_fluidity),1))}
+
   return(clust_res)
 }
 
@@ -1023,8 +1026,6 @@ pm_cluster <- function(fluidity_list,method="ward.D",genome_names){
 #' Bacillus thuringiensis. Genome names are contained in the bac_stre_names
 #' dataset.
 #'
-#' @docType data
-#'
 #' @usage data(bac_stre)
 #'
 #'
@@ -1044,12 +1045,9 @@ pm_cluster <- function(fluidity_list,method="ward.D",genome_names){
 #' This dataset contains the names of the genomes inside the bac_stre dataset.
 #'
 #'
-#' @docType data
-#'
 #' @usage data(bac_stre_names)
 #'
-#'
-#' @keywords datasets
+
 #'
 #' @references Mpatziakas A, Psomopoulos FE, Moysiadis T and Sgardelis S. Computing pangenome statistics in R
 #'  F1000Research 2017, 6(ISCB Comm J):1529 (poster) (doi: 10.7490/f1000research.1114765.1)

diff --git a/README.md b/README.md
@@ -12,3 +12,8 @@ If package *devtools* is present, simply run:
 library(devtools)
 install_github("ampatzia/PasaR")
 ```
+
+# Relevant publications
+
+Mpatziakas A, Psomopoulos FE, Moysiadis T and Sgardelis S. Computing pangenome statistics in R. F1000Research 2017, 6(ISCB Comm J):1529 (poster) ([doi: 10.7490/f1000research.1114765.1](http://dx.doi.org/10.7490/f1000research.1114765.1))
+
diff --git a/man/bac_stre.Rd b/man/bac_stre.Rd
diff --git a/man/bac_stre_names.Rd b/man/bac_stre_names.Rd
diff --git a/vignettes/pasaR-vignette.Rmd b/vignettes/pasaR-vignette.Rmd
@@ -10,88 +10,95 @@ vignette: >
 ---
 
 
-```{r, message= FALSE, warning= FALSE}
+```{r load_libs, message= FALSE, warning= FALSE}
 library(pasaR) 
 library(knitr)
-library(micropan)
+library(dendextend)
 knitr::opts_chunk$set(dev = 'png')
 
 ```
 
-```{r, echo=FALSE}
-data( "Mpneumoniae.blast.panmat" )
-panm<- Mpneumoniae.blast.panmat
-panm<-as.data.frame(panm)
-rm(Mpneumoniae.blast.panmat)
+```{r load_data, echo=FALSE}
+data( "bac_stre" )
+data( "bac_stre_names" )
 
 ```
 
 # Introduction
 
- In this vignette we will showcase the package functions on the Mpneumoniae dataset offered on package **micropan**, obtained by 'r data( "Mpneumoniae.blast.panmat" )'.
+ In this vignette the package functions will be showcased on a dataset containing
+ seven (81) strains of the following bacteria : twelve (12) of Streptococcus pneumoniae,
+ thirteen (13) of Streptococcus Pyogenes, thirty nine (39) of Bacillus cereus and 
+ seventeen (17) of Bacillus thuringiensis.
 
 # Genome summary
 
-A summary of the genome participation to the clusters allows for an initial overview of the data
+A summary of the genome participation to the clusters allows for an initial overview of the data, something not so useful in cases of bigger datasets.
 
-```{r, echo=FALSE}
+```{r make_summary_table, echo=FALSE}
 
-kable(panm_summary(panm))
+kable(panm_summary(bac_stre))
 
 ```
 
 
 # Exploring the data
 
 Another way of exploring the genome data is throught the use of plots. Three types of graphics are offered. User can  plot  Genome - Cluster participation, with  'r pm_plot()':
-```{r,echo=FALSE}
+```{r plot1,echo=FALSE}
 
-pm_plot(panm,plot_type = "bar")
+pm_plot(bac_stre,plot_type = "bar")
 
 
 ```
 
 Cluster participation - Genomes, with 'r cp_plot()':
 
-```{r,echo=FALSE}
+```{r plot2,echo=FALSE}
 
-cp_plot(panm,plot_type = "bar")
+cp_plot(bac_stre,plot_type = "bar")
 
 ```
 
 
 Genes - Cluster participation, with 'r gp_plot()':
 
-```{r,echo=FALSE}
+```{r plot3,echo=FALSE}
 
-gp_plot(panm,plot_type = "bar",collapsed = TRUE)
+gp_plot(bac_stre,plot_type = "bar",collapsed = TRUE)
 
 ```
 
+These plots can be be conviniently grouped by  'r grid_plot()':
 
+```{r plot4,echo=FALSE}
+
+grid_plot(bac_stre)
+
+```
 
 # Heaps Law fitting
-To decribe the fit of the data , two estimated parameters, intercept and decay parameter a. If a<1 then the pangenome is considered to be open. In this case we see that the pangenome is closed, meaning that some point adding new Pneumoniae strains to our data will not add new genes.
+To decribe the fit of the data , two estimated parameters, intercept and decay parameter a. If a<1, like this case then the pangenome is considered to be open, meaning that at any point adding new  strains to our data will add new genes.
 
-```{r,echo= FALSE}
-pm_heaps(panm,n_perm=1000)
+```{r heaps,echo= FALSE}
+pm_heaps(bac_stre,n_perm=50)
 
 ```
 This is an optimized version of the heaps() function from package micropan.
 
 # Binomial mixture fitting
 
-We can estimate the pangenome size and the pangenome core size through binomial mixtures as proposed by Snipen & Liland & implemented in package **micropan**. One must search the model with the optimal number of components. This is determined comparing the Bayesian Information criterion scores of the models and chosing the one with the smaller score. In our case the pangenome is composed by three components.  
-```{r, echo=FALSE}
+We can estimate the pangenome size and the pangenome core size through binomial mixtures as proposed by Snipen & Liland & implemented in package **micropan**. One must search the model with the optimal number of components. This is determined comparing the Bayesian Information criterion scores of the models and chosing the one with the smaller score. In our case the optimal pangenome composition seems to be three components.  
+```{r binom, echo=FALSE}
 
-pm_binom(panm, K.range = 3 )
+pm_binom(bac_stre, K.range = 9 )
 ```
 
 # Chao population estimator
 
-Another way to estimate the pangenome size  using the Chao lower bound estimator. The unbiased version of the estimator is provided, so results will vary in comparison with the results of function 'r chao()' **micropan** 
-```{r}
-pm_chao(panm)
+Another way to estimate the pangenome size  using the Chao lower bound estimator. The unbiased version of the estimator is provided, so results will vary in comparison with the results of function 'r chao()' **micropan** . 
+```{r chao , echo=FALSE}
+pm_chao(bac_stre)
 
 ```
 
@@ -100,16 +107,43 @@ pm_chao(panm)
 
 Another measure of interest is fluidity takes values in [0,1], with 1 denoting no common genes. Function **pm_fluidity()** computes fluidity with sampling, as implemented on package micropan, optimized for speed.
 
-```{r, echo= FALSE}
+```{r fluid_samp, echo= FALSE}
 
-pm_fluidity(panm, n.sim=1000)
+pm_fluidity(bac_stre, n.sim=100)
 
 ```
 
 # Fluidity without sampling
 
 The function **pm_fluidity_all** computes the exact value of fluidity, the standard deviation, the values for every pair and a mean value for each genome paired to all other genomes.
 
-```{r, echo=FALSE}
-pm_fluidity_all(panm)
+```{r fluid_all, echo=FALSE}
+pm_fluidity_all(bac_stre)
+```
+
+
+# Clustering with fluidity
+
+We can use the fluidity results as a distance measure and  'r cluster_number()' provides a proposal for the number of clusters found in dataset through hierarchical clustering.
+
+```{r fluid_clust, echo=FALSE}
+f_vals<-pm_fluidity_all(bac_stre)
+kable(cluster_number(f_vals))
+```
+ Gap  statistic proposes 2 clusters:
+```{r fluid_plot1 , echo=FALSE}
+res<-pm_cluster(f_vals,genome_names = bac_stre_names)
+
+dend<-as.dendrogram(res)
+dend <- set(dend, "labels_cex",0.8) 
+dend<-color_labels(dend, k = 2)
+plot(dend)
+
+```
+
+Dunn index proposed 3 clusters:
+```{r fluid_plot2, echo=FALSE}
+dend<-color_labels(dend, k = 2)
+plot(dend)
+
 ```