Chapter 1 About this document

In this document you will find indications on how to use various R libraries and functions for estimating true disease prevalence and accuracy of diagnostic tests in a Bayesian framework, along with various exercises. We suppose that you do have some basic R skills. If you have not worked with R before or feel a bit rusty, here are some resources to help you to prepare:

Chapters 1, 2, and 4 of the CodeCademy “Learn R” course will provide a good overview of the basic concepts required for this workshop.
If you are familiar with R and want to do some further reading, Hadley Wickham’s “R for Data Science” is a great resource.

Remember, there are often many different ways to conduct a given piece of work in R. Throughout this document, we tried to stick with the simpler approaches (e.g., the shortest code, the minimal number of R libraries).

1.1 Software and libraries used

To conduct the exercises from this book, we will need to install three software (R, RStudio, and JAGS) and a few R packages. The installation steps are described below. These instructions are reproduced in parts from R-project.org, Rstudio.com, and various R packages’ manuals.

1.1.1 R download

R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.

R can be downloaded here. You can choose in the list a CRAN mirror close to your institution; for instance, in Canada we use the mirror from Simon Fraser University, Burnaby.

We used the 4.4.1 version of R to develop this book.

1.1.2 RStudio download

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Workbench.

A free RStudio Desktop Open Source License can be downloaded here.

We used the 2024.04.2+764 version of RStudio to develop this book.

1.1.3 R libraries

To install a given R library, open RStudio, click Packages, choose load package, choose the package you need (e.g., epiR).

For the workshop, the following R packages will be used:

1.1.3.1 R2jags

Version 0.8-5 R2jags manual The R2JAGS package provides convenient functions to call rjags (see next) and JAGS (a software specialized in Bayesian analyses; see below) from R. It automatically writes the data and scripts in a format readable by JAGS. After the JAGS process has finished, it is possible to read the resulting data into R using the facilities of the coda and mcmcplots packages for further analyses of the output.

1.1.3.2 rjags

Version 4-15 rjags manual The rjags package provides an interface from R to the JAGS library for Bayesian data analysis. JAGS uses Markov Chain Monte Carlo (MCMC) to generate a sequence of dependent samples from the posterior distribution of the parameters.

1.1.3.3 coda

Version 0.19-4.1 coda manual
The coda package provides functions for summarizing and plotting the output from MCMC simulations, as well as diagnostic tests of convergence to the equilibrium distribution of the Markov chain.

1.1.3.4 mcmcplots

Version 0.4.3 mcmcplots manual
The mcmcplots package provides a function (mcmcplot) that produces common MCMC diagnostic plots in an html file that can be viewed from a web browser. When viewed in a web browser, hundreds of MCMC plots can be viewed efficiently by scrolling through the output as if it were any typical web page.

1.1.3.5 epiR

Version 2.0.75 epiR manual
The epiR package contains tools for the analysis of epidemiological and surveillance data. For the workshop we will mainly use epiR functions that can return shape parameters for different distributions, based on expert elicitation, to produce and visualize prior distributions.

1.1.4 JAGS

JAGS is a free software package for performing Bayesian inference Using Gibbs Sampling. JAGS stands for “Just Another Gibbs Sampler”. It is a program for analysis of Bayesian models using MCMC simulation.

You will have to install JAGS on your computer for the R2JAGS library to work. JAGS can be downloaded here.

Click on the “Download latest version JAGS-4.3.1.html” green button;
Wait a few seconds;
Open up the JAGS-4.3.1.html file that was downloaded;
Pick the appropriate installation depending on the version of R that is installed on your computer;
- If you have installed R version 4.2.0 or later, then JAGS-4.3.1 can be installed.

1.2 Some notation

Throughout the document, you will find examples of R code along with comments. The R code used always appear in grey boxes (see the following example). This is the code that you will be able to use for your own analyses. Lines that are preceded by a # sign are comments, they are skipped when R reads them. Following each grey box with R code, another grey box with results from the analysis is presented.

For instance, this is a R code where I am simply asking to show main descriptive statistics for the speed variable of the cars dataset (note that this dataset is already part of R).

#This is a comment. R will ignore this line

#The summary() function can be use to ask for a summary of various R objects. For a variable (here the variable speed), it shows descriptive statistics.
summary(cars$speed)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    15.0    15.4    19.0    25.0

Throughout this document we will use:
- Italics for datasets or variables. For instance, the speed variable of the dataset cars.
- Shaded boxes for R libraries (e.g., episensr) and functions (e.g., summary()).

In R we can first call a given library and then use the functions related to this library or we can type the name of the library followed by :: and then the function. For instance the two following pieces of code are equivalent:

library(ggplot2)
ggplot(data=cars, mapping=(aes(x=speed)))+
  geom_histogram()

##OR##

ggplot2::ggplot(data=cars, mapping=(aes(x=speed)))+
  geom_histogram()

The latter may improve reproducibility, but at the expense of longer codes. Throughout the document, we will always first call the library and then run the functions to keep codes short.

One last thing, when using a given function, it is not mandatory to name all the arguments, as long as they are presented in the sequence expected by this function. For instance, the ggplot() function that we used in the previous chunk of code is expecting to see first a dataset (data=) and then a mapping attribute (mapping=) and, within that mapping attribute a x variable (x=). We could shorten the code by omitting all of these. The two following pieces of code are, therefore, equivalent:

library(ggplot2)
ggplot(data=cars, mapping=(aes(x=speed)))+
  geom_histogram()

##OR##  

library(ggplot2)
ggplot(cars, (aes(speed)))+
  geom_histogram()

Throughout the document, however, we will use the longer code with all the arguments being named. Since you are learning these new functions, it would be quite a challenge to use the shorter code right from the start. But you could certainly adopt the shorter codes later on.

LET’S GET STARTED!

Bayesian latent class models for prevalence estimation and diagnostic test evaluation