Bayesian latent class models for prevalence estimation and diagnostic test evaluation
2024-10-11
Chapter 1 About this document
In this document you will find indications on how to use various R
libraries and functions for estimating true disease prevalence and accuracy of diagnostic tests in a Bayesian framework, along with various exercises. We suppose that you do have some basic R
skills. If you have not worked with R
before or feel a bit rusty, here are some resources to help you to prepare:
Chapters 1, 2, and 4 of the CodeCademy “Learn R” course will provide a good overview of the basic concepts required for this workshop.
If you are familiar with
R
and want to do some further reading, Hadley Wickham’s “R for Data Science” is a great resource.
Remember, there are often many different ways to conduct a given piece of work in R
. Throughout this document, we tried to stick with the simpler approaches (e.g., the shortest code, the minimal number of R
libraries).
1.1 Software and libraries used
To conduct the exercises from this book, we will need to install three software (R
, RStudio
, and JAGS
) and a few R
packages. The installation steps are described below. These instructions are reproduced in parts from R-project.org, Rstudio.com, and various R packages’ manuals.
1.1.1 R download
R
is a language and environment for statistical computing and graphics. R
provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
R
can be downloaded here. You can choose in the list a CRAN mirror close to your institution; for instance, in Canada we use the mirror from Simon Fraser University, Burnaby.
We used the 4.4.1 version of R to develop this book.
1.1.2 RStudio download
RStudio
is an integrated development environment (IDE) for R
. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
RStudio
is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server
or RStudio Workbench
.
A free RStudio Desktop Open Source License
can be downloaded here.
We used the 2024.04.2+764 version of RStudio
to develop this book.
1.1.3 R libraries
To install a given R
library, open RStudio
, click Packages
, choose load package
, choose the package you need (e.g., epiR
).
For the workshop, the following R
packages will be used:
1.1.3.1 R2jags
Version 0.8-5 R2jags manual
The R2JAGS
package provides convenient functions to call rjags
(see next) and JAGS
(a software specialized in Bayesian analyses; see below) from R
. It automatically writes the data and scripts in a format readable by JAGS
. After the JAGS
process has finished, it is possible to read the resulting data into R
using the facilities of the coda
and mcmcplots
packages for further analyses of the output.
1.1.3.2 rjags
Version 4-15 rjags manual
The rjags
package provides an interface from R
to the JAGS
library for Bayesian data analysis. JAGS
uses Markov Chain Monte Carlo (MCMC) to generate a sequence of dependent samples from the posterior distribution of the parameters.
1.1.3.3 coda
Version 0.19-4.1 coda manual
The coda
package provides functions for summarizing and plotting the output from MCMC simulations, as well as diagnostic tests of convergence to the equilibrium distribution of the Markov chain.
1.1.3.4 mcmcplots
Version 0.4.3 mcmcplots manual
The mcmcplots
package provides a function (mcmcplot
) that produces common MCMC diagnostic plots in an html file that can be viewed from a web browser. When viewed in a web browser, hundreds of MCMC plots can be viewed efficiently by scrolling through the output as if it were any typical web page.
1.1.3.5 epiR
Version 2.0.75 epiR manual
The epiR
package contains tools for the analysis of epidemiological and surveillance data. For the workshop we will mainly use epiR
functions that can return shape parameters for different distributions, based on expert elicitation, to produce and visualize prior distributions.
1.1.4 JAGS
JAGS
is a free software package for performing Bayesian inference Using Gibbs Sampling. JAGS
stands for “Just Another Gibbs Sampler”. It is a program for analysis of Bayesian models using MCMC simulation.
You will have to install JAGS
on your computer for the R2JAGS
library to work. JAGS
can be downloaded here.
- Click on the “Download latest version JAGS-4.3.1.html” green button;
- Wait a few seconds;
- Open up the JAGS-4.3.1.html file that was downloaded;
- Pick the appropriate installation depending on the version of
R
that is installed on your computer;- If you have installed
R
version 4.2.0 or later, then JAGS-4.3.1 can be installed.
- If you have installed
1.2 Some notation
Throughout the document, you will find examples of R
code along with comments. The R
code used always appear in grey boxes (see the following example). This is the code that you will be able to use for your own analyses. Lines that are preceded by a # sign are comments, they are skipped when R
reads them. Following each grey box with R
code, another grey box with results from the analysis is presented.
For instance, this is a R
code where I am simply asking to show main descriptive statistics for the speed variable of the cars dataset (note that this dataset is already part of R
).
#This is a comment. R will ignore this line
#The summary() function can be use to ask for a summary of various R objects. For a variable (here the variable speed), it shows descriptive statistics.
summary(cars$speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 15.0 15.4 19.0 25.0
Throughout this document we will use:
- Italics for datasets or variables. For instance, the speed variable of the dataset cars.
- Shaded boxes for R
libraries (e.g., episensr
) and functions (e.g., summary()
).
In R
we can first call a given library and then use the functions related to this library or we can type the name of the library followed by ::
and then the function. For instance the two following pieces of code are equivalent:
library(ggplot2)
ggplot(data=cars, mapping=(aes(x=speed)))+
geom_histogram()
##OR##
ggplot2::ggplot(data=cars, mapping=(aes(x=speed)))+
geom_histogram()
The latter may improve reproducibility, but at the expense of longer codes. Throughout the document, we will always first call the library and then run the functions to keep codes short.
One last thing, when using a given function, it is not mandatory to name all the arguments, as long as they are presented in the sequence expected by this function. For instance, the ggplot()
function that we used in the previous chunk of code is expecting to see first a dataset (data=
) and then a mapping attribute (mapping=
) and, within that mapping attribute a x variable (x=
). We could shorten the code by omitting all of these. The two following pieces of code are, therefore, equivalent:
library(ggplot2)
ggplot(data=cars, mapping=(aes(x=speed)))+
geom_histogram()
##OR##
library(ggplot2)
ggplot(cars, (aes(speed)))+
geom_histogram()
Throughout the document, however, we will use the longer code with all the arguments being named. Since you are learning these new functions, it would be quite a challenge to use the shorter code right from the start. But you could certainly adopt the shorter codes later on.
LET’S GET STARTED!