You can install the stable version of greta from CRAN:
or the development version from GitHub using devtools:
then load the package
Before you can fit models with greta
, you will also need
to have a working installation of Google’s TensorFlow python package
(version >= 2.0.0) and the tensorflow-probability
python package (version >= 0.8.0).
If these python modules aren’t yet installed when greta
is used, it suggests to use install_greta_deps()
to install
the dependencies. We recommend using this function to install
dependencies. For more detail on installation, see the vignette
“installation”.
greta’s plotting functionality depends on the DiagrammeR package. Because DiagrammeR depends on the igraph package, which contains a large amount of code that needs to be compiled, DiagrammeR often takes a long time to install. So, DiagrammeR isn’t installed automatically with greta. If you want to plot greta models, you can install igraph and DiagrammeR from CRAN.
igraph can be difficult to install on some machines, due to dependencies on the XML R package and libxml2 and gfortran software libraries. There are workarounds to these issues (e.g. here and here) but if you can’t get it to work, you can still use greta for everything except plotting models.
greta lets us build statistical models interactively in R, and then sample from them by MCMC. We build greta models with greta array objects, which behave much like R’s array, matrix and vector objects for numeric data. Like those numeric data objects, greta arrays can be manipulated with functions and mathematical operators to create new greta arrays.
The key difference between greta arrays and numeric data objects is that when you do something to a greta array, greta doesn’t calculate the values of the new greta array. Instead, it just remembers what operation to do, and works out the size and shape of the result.
For example, we can create a greta array z
representing
some data (a 3x3 matrix of 1s):
we can then create a new greta array z2
by doing
something to z
:
greta knows the result must also be a 3x3 matrix, but it doesn’t try
to calculate the results. Instead it treats the new greta array
z2
like a placeholder, for which it will calculate the
results later.
Because greta only creates placeholders when we’re building
our model, we can construct models using greta arrays that represent
unknown variables. For example, if we create a new greta array
a
representing some unknown values, we can still manipulate
it as though it were data:
This allows us to create a wide range of models, like in the general-purpose modelling languages like BUGS and Stan. But unlike those languages we build greta models interactively in R so get feedback immediately if there’s a mistake like a misspelled variable name or if one of our greta arrays has the wrong shape.
greta also lets us declare that a greta array follows a probability distribution, allowing us to train models using observed data, and to define prior distributions over our parameters, for Bayesian analyses.
The rest of this vignette walks through an example of fitting a model using greta. If you’d like to see examples of some common models fitted in greta and with equivalent BUGS and Stan code, take a look at Example models. If you’d like more technical details about how greta works under the hood, check out Technical details.
The rest of the vignette explains step-by-step how to build,
visualise and fit a model with greta. We’ll be stepping through a model
for linear regression between two of the variables in the iris
dataset, which is included with base R. The model is Bayesian
(we specify priors over the variables), though it is also possible to do
frequentist (no priors) inference in greta, using
variable()
instead of a probability distribution to create
random variables.
Here’s the whole script to specify and fit the model:
library(greta)
# data
x <- as_data(iris$Petal.Length)
y <- as_data(iris$Sepal.Length)
# variables and priors
int <- normal(0, 1)
coef <- normal(0, 3)
sd <- student(3, 0, 1, truncation = c(0, Inf))
# operations
mean <- int + coef * x
# likelihood
distribution(y) <- normal(mean, sd)
# defining the model
m <- model(int, coef, sd)
# plotting
plot(m)
# sampling
draws <- mcmc(m, n_samples = 1000)
The first section of the script takes the iris data (which is automatically loaded) and converts the two columns we care about into greta arrays:
The greta function as_data()
converts other R objects
into greta arrays. In this case it’s converting numeric vectors (the two
columns of the iris dataframe) into greta arrays. as_data()
can also convert matrices and R arrays with numeric, integer or logical
(TRUE
or FALSE
) values into greta arrays. It
can also convert dataframes to greta arrays, so long as all elements are
either numeric, integer or logical.
E.g. we can convert the first 5 rows and 4 columns of the iris dataframe, and print the result:
Whenever as_data()
converts logical values to greta
arrays, it converts them to 1s (for TRUE
) and 0s (for
FALSE
). E.g. if we first convert the last column of
iris
from a factor into a logical vector, we can see
this:
You can also see from this example that greta arrays always consider a vector as either a column vector (the default) or a row vector, and greta arrays always have at least two dimensions:
For many models, we don’t have to explicitly convert data to
greta arrays, the R objects will be converted automatically when we do
an operation on them. That’s handy for when we
want to use constants in our model because it saves us manually
converting numbers each time. However, it’s good practice to explicitly
convert your data to a greta array using as_data()
. This
has two advantages: it lets greta work out the names of your data greta
arrays (e.g. x
and y
in our example) which it
can use when plotting the model; and
as_data()
will check your data for missing
(NA
) or non-finite (Inf
or -Inf
)
values, which will break the model.
greta also provides some convenience functions to generate fixed
numeric values. ones()
and zeros()
create
greta arrays filled with either 1 or zero, and with the specified
dimensions:
The greta_array()
function generalises this to let you
create greta arrays with any values, in the same way as R’s
array()
function:
greta_array()
is just a thin wrapper around
array()
, provided for convenience. A command like
greta_array(<some arguments>)
has exactly the same
effect as: as_data(array<some arguments>)
.
The second section of the script creates three greta arrays to represent the parameters in our model:
Each of these is a variable greta array, and each is assumed a priori (before fitting to data) to follow a different probability distribution. In other words, these are prior distributions over variables, which we need to specify to make this a full Bayesian analysis. Before going through how to specify variables with probability distributions, it will be clearer to first demonstrate the alternative; variables without probability distributions.
If we were carrying out a frequentist analysis of this model, we
could create variable greta arrays (values we want to learn) without
probability distributions using the variable()
function.
E.g. in a frequentist version of the model we could create
int
with:
variable()
has three arguments. The first two arguments
determine the constraints on this parameter; we left them at their
default setting of lower = -Inf, upper = Inf
meaning the
variables can take any value on the real number line. The third argument
gives the dimensions of the greta array to return, in this case we left
it at its default value of 1x1 (a scalar).
We can create a variable constrained between two values by specifying
lower
and upper
. So we could have created the
positive variable sd
(the standard deviation of the
likelihood) with:
If we had instead wanted a 2x3 array of positive variables we could have created it like this:
In our example script, when we created the variables
int
, coef
and sd
, we
simultaneously stated the prior distributions for them using some of
greta’s probability distribution functions. You can see a list of the
currently available distributions in the ?greta::distributions
helpfile. Each of these distribution functions takes as arguments the
distribution’s parameters (either as numbers or other greta arrays), as
well as the dimensions of the resulting greta array. As before, we left
the dimensions arguments at their default value to create scalar greta
arrays.
Both int
and coef
were given zero-mean
normal distributions, which are a common choice of prior for
unconstrained variables in Bayesian analyses. For the strictly positive
parameter sd
, we chose a slightly unconventional option, a
positive-truncated (non-standard) student’s t distribution, which we
create using greta’s built-in support for truncated distributions.
Some of greta’s probability distributions (those that are continuous
and univariate) can be specified as truncated distributions. By
modifying the truncation
argument, we can state that the
resulting distribution should be truncated between the two truncation
bounds. So to create a standard normal distribution truncated between -1
and 1 we can do:
greta will account for this truncation when calculating the density of this distribution; rescaling it to be a valid probability distribution. We can only truncate to within the support of the distribution; e.g. greta will throw an error if we try to truncate a lognormal distribution (which must be positive) to have a lower bound of -1.
The third section of the script uses mathematical operators to combine some of our parameters and data, to calculate the predicted sepal lengths, for a given parameter set:
Because int
and coef
are both scalars, the
resulting greta array mean
has the same dimensions as
x
; a column vector with 150 elements:
greta arrays can be manipulated using R’s standard arithmetic,
logical and relational operators; including +
,
*
and many others. The ?greta::operators
help file lists the operators that are implemented for greta arrays. You
can also use a lot of common R functions for numeric data, such as
sum()
, log()
and others. the available
functions are listed in the ?greta::functions
helpfile. All of these mathematical manipulations of greta arrays
produce ‘operation’ greta arrays.
We can use R’s extract and replace syntax (using [
) on
greta arrays, just like with R’s vectors, matrices and arrays. E.g. to
extract elements from mean
we can do:
We can assign values from one greta array to another too. For example, if we wanted to create a matrix that has random normal variables in the first column, but zeros everywhere else, we could do:
R’s subset operator [
has an argument drop
,
which determines whether to reduce the number of dimensions of a array
or matrix when the object has zero elements in that dimension. By
default, drop = TRUE
for R objects, so matrices are
automatically converted into vectors (which have dimension
NULL
) if you take out a single column:
greta arrays must always have two dimensions, so greta always acts as
though drop = FALSE
:
We can write our own functions for greta arrays, using the existing operators and functions. For example, we could define the inverse hyperbolic tangent function for greta arrays like this:
So far, we have created greta arrays representing the variables in
our model (with prior distributions) and created other greta arrays from
them and some fixed, independent data. To perform statistical
inference on this model, we also need to link it to some observed
dependent data. By comparing the sepal lengths predicted under
different parameter values with the observed sepal lengths, we can
estimate the most plausible values of those parameters. We do that by
defining a likelihood for the observed sepal length data
y
.
By defining a likelihood over observed data, we are stating that
these observed data are actually a random sample from some probability
distribution, and we’re trying to work out the parameters of that
distribution. In greta we do that with the distribution()
assignment function:
With the syntax
distribution(<lhs>) <- <rhs>
, we are stating
that the data greta array on the left <lhs>
has the
same distribution as the greta array on the right
<rhs>
. In this case, we’re temporarily creating a
random variable with a normal distribution (with parameters determined
by the greta arrays mean
and sd
), but then
stating that values of that distribution have been observed
(y
). In this case both <lhs>
(y
) and <rhs>
are column vectors with
the same dimensions, so each element in y
has a normal
distribution with the corresponding parameters.
Now all of the greta arrays making up the model have been created, we
need to combine them and set up the model so that we can sample from it,
using model()
:
model()
returns a ‘greta model’ object, which combines
all of the greta arrays that make up the model. We can pass greta arrays
as arguments to model()
to flag them as the parameters
we’re interested in. When sampling from the model with
mcmc()
those will be the greta arrays for which samples
will be returned. Alternatively, we can run model()
without
passing any greta arrays, in which case all of the greta arrays (except
for data greta arrays) in the working environment will be set as the
targets for sampling instead.
greta provides a plot function for greta models to help you visualise and check the model before sampling from it.
The greta arrays in your workspace that are used in the model are all
represented as nodes (shapes) with names. These are either data
(squares; x
and y
), variables (large circles;
int
, coef
, sd
) or the results of
operations (small circles; mean
). The operations used to
create the operation greta arrays are printed on the arrows from the
arrays they were made from. There are also nodes for greta arrays that
were implicitly defined in our model. The data nodes (squares)
with numbers are the parameters used to define the prior distributions,
and there’s also an intermediate operation node (small circle), which
was the result of multiplying coef
and x
(before adding int
to create mean
).
Here’s a legend for the plot (it’s in the ?greta::model
helpfile too):
The fourth type of node (diamonds) represents probability distributions. These have greta arrays as parameters (linked via solid lines), and have other greta arrays as values(linked via dashed lines). Distributions calculate the probability density of the values, given the parameters and their distribution type.
For example, a plot of just the prior distribution over
coef
(defined as coef <- normal(0, 3)
)
shows the parameters as data leading into the normal distribution, and a
dashed arrow leading out to the distribution’s value, the variable
coef
:
It’s the same for the model likelihood, but this time the
distribution’s parameters are a variable (sd
) and the
result of an operation (mean
), and the distribution’s value
is given by data (the observed sepal lengths y
):
When defining the model, greta combines all of the distributions together to define the joint density of the model, a measure of how ‘good’ (or how probable if we’re being Bayesian) are a particular candidate set of values for the variables in the model.
Now we have a greta model that will give us the joint density for a
candidate set of values, so we can use that to carry out inference on
the model. We do that using an Markov chain Monte Carlo (MCMC) algorithm
to sample values of the parameters we’re interested in, using the
mcmc()
function:
Here we’re using 1000 steps of the static Hamiltonian Monte Carlo (HMC) algorithm on each of 4 separate chains, giving us 4000 samples. HMC uses the gradients of the joint density to efficiently explore the set of parameters. By default, greta also spends 1000 iterations on each chain ‘warming up’ (tuning the sampler parameters) and ‘burning in’ (moving to the area of highest probability) the sampler.
draws
is a greta_mcmc_list
object, which
inherits from the coda R package’s mcmc.list
object. So we
can use functions from coda, or one of the many other MCMC software
packages that use this format, to plot and summarise the MCMC
samples.
The bayesplot package makes some nice plots of the MCMC chain and the parameter estimates
library (bayesplot)
# set theme to avoid issues with fonts
ggplot2::theme_set(ggplot2::theme_bw())
mcmc_trace(draws, facet_args = list(nrow = 3, ncol = 1))
mcmc_intervals(draws)
If your model is taking a long time whilst in the sampling phase and
you want to take a look at the samples. You can stop the sampler
(e.g. using the stop button in RStudio) and then retrieve the samples
drawn so far, by using stashed_samples()
. Note that this
won’t return anything if you stop the model during the warmup phase
(since those aren’t valid posterior samples) or if the sampler completed
successfully.
greta’s default sampler is (static) Hamiltonian Monte Carlo. The
sampler will automatically tune itself during the warmup phase, to make
it as efficient as possible. If the chain looks like it’s moving too
slowly, or if you are getting a lot of messages about proposals being
rejected, the first thing to try is increasing the length of the warmup
period from its default of 1000 iterations (via the warmup
argument). If you’re still getting a lot of rejected samples, it’s often
a good idea to manually set the initial values for the sampler (via the
initial_values
argument). This is often the case when you
have lots of data; the more information there is, the more extreme the
log probability, and the higher the risk of numerical problems.
A downside of HMC is that it can’t be used to sample discrete
variables (e.g. integers), so we can’t specify a model with a discrete
distribution (e.g. Poisson or Binomial), unless it’s in the likelihood.
If we try to build such a model, greta will give us an error when we run
model()
. Future versions of greta will implement a wider
range of MCMC samplers, including some for discrete variables.