By [Hopf and Marks, NBT, 2017]
Motivation: pairwise correlation measures such as mutual information cannot distinguish direct and indirect coupling. Using a model such as MRF enables this: e.g. suppose we have energy terms A-B and B-C, then this will explain away coupling of A-C.
Model: given sequence alignment, treats all sequences as samples from a stationary distribution. The distribution is determined by the fitness (or energy) of the sequence. Using MRF model for the energy terms: single AA, and pairwise AA interactions.
Accommodating indels: treat indels as observed process, so if gap, contribute nothing to the energy. Accounting for sequence correlation due to phylogeny: sequence weighting. The weight is proportional to the inverse of the number of similar proteins.
Inference: (1) Regularization: L2 penalty for the parameters. (2) Use site-factored pseudolikelihood approximation.
Assessing deleterious effects of mutations: change of fitness.
Agreement with experimental fitness: correlation of deleterious effects of all AAs vs. experimental fitness.
Assessing human disease variants: similar to PPH2 and SIFT. Note that EVmutation is unsupervised.
The same model can be applied to protein interactions. The search space is larger, and may be important to incorporate priors of contact positions. Can we also use it to fit experimental fitnesses data?
Lesson: use MRF to account for indirect correlations.