Performs Bayesian multiple linear regression of Y on
X; that is, this function fits the regression model \(Y = \sum_l
X b_{l=1}^L + e\), where elements of e are i.i.d. normal with
zero mean and variance residual_variance
, and
\(\sum_{l=1}^L b_l\) is a vector of length p representing the
effects to be estimated. The “susie assumption” is that each
\(b_l\) has exactly one non-zero element. The prior on the
non-zero element is normal with zero mean and variance var(Y)
* scaled_prior_variance
. The model is fitted using the
“Iterative Bayesian Stepwise Selection” (IBSS) algorithm.
See also susie_trendfilter
for applying susie to
non-parametric regression, particularly changepoint problems.
susie(
X,
Y,
L = min(10, ncol(X)),
scaled_prior_variance = 0.2,
residual_variance = NULL,
prior_weights = NULL,
null_weight = NULL,
standardize = TRUE,
intercept = TRUE,
estimate_residual_variance = TRUE,
estimate_prior_variance = TRUE,
estimate_prior_method = c("optim", "EM", "simple"),
check_null_threshold = 0,
prior_tol = 1e-09,
residual_variance_upperbound = Inf,
s_init = NULL,
coverage = 0.95,
min_abs_corr = 0.5,
compute_univariate_zscore = FALSE,
na.rm = FALSE,
max_iter = 100,
tol = 0.001,
verbose = FALSE,
track_fit = FALSE,
residual_variance_lowerbound = var(drop(Y))/10000,
refine = FALSE
)
susie_suff_stat(
bhat,
shat,
R,
n,
var_y,
XtX,
Xty,
yty,
maf = NULL,
maf_thresh = 0,
L = 10,
scaled_prior_variance = 0.2,
residual_variance = NULL,
estimate_residual_variance = TRUE,
estimate_prior_variance = TRUE,
estimate_prior_method = c("optim", "EM", "simple"),
check_null_threshold = 0,
prior_tol = 1e-09,
r_tol = 1e-08,
prior_weights = NULL,
null_weight = NULL,
standardize = TRUE,
max_iter = 100,
s_init = NULL,
intercept_value = 0,
coverage = 0.95,
min_abs_corr = 0.5,
tol = 0.001,
verbose = FALSE,
track_fit = FALSE,
check_input = FALSE,
refine = FALSE
)
An n by p matrix of covariates.
The observed responses, a vector of length n.
Number of components (nonzero coefficients) in the susie regression model. If L is larger than the number of covariates, p, L is set to p.
The scaled prior variance. This is
either a scalar or a vector of length L
. The prior variance
of each non-zero element of b is set to var(Y) *
scaled_prior_variance
. If estimate_prior_variance = TRUE
,
this provides initial estimates of the prior variances.
Variance of the residual. If
estimate_residual_variance = TRUE
, this value provides the
initial estimate of the residual variance. By default, it is
var(Y)
.
A vector of length p, in which each entry gives the prior probability that corresponding column of X has a nonzero effect on the outcome, Y.
Prior probability of no effect (a number between 0 and 1, and cannot be exactly 1).
If standardize = TRUE
, standardize the
columns of X (or XtX and Xty) to unit variance prior to
fitting. Note that scaled_prior_variance
specifies the prior
on the coefficients of X after standardization (if it is
performed). If you do not standardize, you may need to think more
carefully about specifying scaled_prior_variance
. Whatever
your choice, the coefficients returned by coef
are given for
X
on the original input scale. Any column of X
that
has zero variance is not standardized.
If intercept = TRUE
, the intercept is
fitted; it intercept = FALSE
, the intercept is set to
zero. Setting intercept = FALSE
is generally not
recommended.
If
estimate_residual_variance = TRUE
, the residual variance is
estimated, using residual_variance
as an initial value. If
estimate_residual_variance = FALSE
, the residual variance is
fixed to the value supplied by residual_variance
.
If estimate_prior_variance =
TRUE
, the prior variance is estimated (this is a separate
parameter for each of the L effects). If provided,
scaled_prior_variance
is then used as an initial value for
the optimization. When estimate_prior_variance = FALSE
, the
prior variance for each of the L effects is determined by the
value supplied to scaled_prior_variance
.
The method used for estimating prior
variance. When estimate_prior_method = "simple"
is used, the
likelihood at the specified prior variance is compared to the
likelihood at a variance of zero, and the setting with the larger
likelihood is retained.
When the prior variance is estimated,
compare the estimate with the null, and set the prior variance to
zero unless the log-likelihood using the estimate is larger by this
threshold amount. For example, if you set
check_null_threshold = 0.1
, this will "nudge" the estimate
towards zero when the difference in log-likelihoods is small. A
note of caution that setting this to a value greater than zero may
lead the IBSS fitting procedure to occasionally decrease the ELBO.
When the prior variance is estimated, compare the
estimated value to prior_tol
at the end of the computation,
and exclude a single effect from PIP computation if the estimated
prior variance is smaller than this tolerance value.
Upper limit on the estimated
residual variance. It is only relevant when
estimate_residual_variance = TRUE
.
A previous susie fit with which to initialize.
A number between 0 and 1 specifying the “coverage” of the estimated confidence sets.
Minimum absolute correlation allowed in a credible set. The default, 0.5, corresponds to a squared correlation of 0.25, which is a commonly used threshold for genotype data in genetic studies.
If compute_univariate_zscore
= TRUE
, the univariate regression z-scores are outputted for each
variable.
Drop any missing values in Y from both X and Y.
Maximum number of IBSS iterations to perform.
A small, non-negative number specifying the convergence
tolerance for the IBSS fitting procedure. The fitting procedure
will halt when the difference in the variational lower bound, or
“ELBO” (the objective function to be maximized), is
less than tol
.
If verbose = TRUE
, the algorithm's progress,
and a summary of the optimization settings, are printed to the
console.
If track_fit = TRUE
, trace
is also returned containing detailed information about the
estimates at each iteration of the IBSS fitting procedure.
Lower limit on the estimated
residual variance. It is only relevant when
estimate_residual_variance = TRUE
.
If refine = TRUE
, we use a procedure to help
SuSiE get out of local optimum.
A p-vector of estimated effects.
A p-vector of standard errors.
A p by p symmetric, positive semidefinite matrix. It
can be \(X'X\), the covariance matrix \(X'X/(n-1)\), or a
correlation matrix. It should be estimated from the same samples
used to compute bhat
and shat
. Using an out-of-sample
matrix may produce unreliable results.
The sample size.
The sample variance of y, defined as \(y'y/(n-1)\).
When the sample variance cannot be provided, the coefficients
(returned from coef
) are computed on the "standardized" X, y
scale.
A p by p matrix \(X'X\) in which the columns of X are centered to have mean zero.
A p-vector \(X'y\) in which y and the columns of X are centered to have mean zero.
A scalar \(y'y\) in which y is centered to have mean zero.
Minor allele frequency; to be used along with
maf_thresh
to filter input summary statistics.
Variants having a minor allele frequency smaller than this threshold are not used.
Tolerance level for eigenvalue check of positive semidefinite matrix of R.
The intercept. (The intercept cannot be
estimated from centered summary data.) This setting will be used by
coef
to assign an intercept value, mainly for consistency
with susie
. Set to NULL
if you want coef
not
to include an intercept term (and so only a p-vector is returned).
If check_input = TRUE
,
susie_suff_stat
performs additional checks on XtX
and
Xty
. The checks are: (1) check that XtX
is positive
semidefinite; (2) check that Xty
is in the space spanned by
the non-zero eigenvectors of XtX
.
A "susie"
object with some or all of the following
elements:
An L by p matrix of posterior inclusion probabilites.
An L by p matrix of posterior means, conditional on inclusion.
An L by p matrix of posterior second moments, conditional on inclusion.
A vector of length n, equal to X %*% colSums(alpha
* mu)
.
log-Bayes Factor for each single effect.
log-Bayes Factor for each variable and single effect.
Intercept (fixed or estimated).
Residual variance (fixed or estimated).
Prior variance of the non-zero elements of b, equal to
scaled_prior_variance * var(Y)
.
The value of the variational lower bound, or “ELBO” (objective function to be maximized), achieved at each iteration of the IBSS fitting procedure.
Vector of length n containing the fitted values of the outcome.
Credible sets estimated from model fit; see
susie_get_cs
for details.
A vector of length p giving the (marginal) posterior inclusion probabilities for all p covariates.
A vector of univariate z-scores.
Number of IBSS iterations that were performed.
TRUE
or FALSE
indicating whether
the IBSS converged to a solution within the chosen tolerance
level.
susie_suff_stat
returns also outputs:
A p-vector of t(X)
times the fitted values,
X %*% colSums(alpha*mu)
.
susie_suff_stat
performs sum of single-effect
linear regression with summary statistics. The required summary
data are either: bhat
, shat
, the p by p symmetric,
positive semidefinite correlation (or covariance) matrix R
,
the sample size n
, and the variance of y; or the p by p
matrix \(X'X\), the p-vector \(X'y\), the sum of squares
\(y'y\), and the sample size n
. The summary statistics
should come from the same individuals. Both the columns of X and
the vector y should be centered to have mean zero before computing
these summary statistics; you may also want to scale each column of
X and y to have variance 1 (see examples).
G. Wang, A. Sarkar, P. Carbonetto and M. Stephens (2020). A simple new approach to variable selection in regression, with application to genetic fine-mapping. Journal of the Royal Statistical Society, Series B https://doi.org/10.1101/501114.
# susie example.
set.seed(1)
n = 1000
p = 1000
beta = rep(0,p)
beta[1:4] = 1
X = matrix(rnorm(n*p),nrow = n,ncol = p)
X = scale(X,center = TRUE,scale = TRUE)
y = drop(X %*% beta + rnorm(n))
res1 = susie(X,y,L = 10)
#> Error in susie(X, y, L = 10): could not find function "susie"
plot(beta,coef(res1)[-1])
#> Error in coef(res1): object 'res1' not found
abline(a = 0,b = 1,col = "skyblue",lty = "dashed")
#> Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet
plot(y,predict(res1))
#> Error in predict(res1): object 'res1' not found
abline(a = 0,b = 1,col = "skyblue",lty = "dashed")
#> Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet
# susie_suff_stat example.
input_ss = compute_ss(X,y,standardize = TRUE)
#> Error in compute_ss(X, y, standardize = TRUE): could not find function "compute_ss"
res2 = with(input_ss,
susie_suff_stat(XtX = XtX,Xty = Xty,yty = yty,n = n,L = 10))
#> Error in with(input_ss, susie_suff_stat(XtX = XtX, Xty = Xty, yty = yty, n = n, L = 10)): object 'input_ss' not found
plot(coef(res1)[-1],coef(res2)[-1])
#> Error in coef(res1): object 'res1' not found
abline(a = 0,b = 1,col = "skyblue",lty = "dashed")
#> Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...): plot.new has not been called yet