How to do structure-informed adaptation detection on protein phylogenies

There are plenty of phylogenetics tutorials out there, but I haven't encountered one that dealt with the idiosyncracies of short-sequence, deeply-diverged proteins. Below are suggestions for constructing phylogenies and reconstructing ancestors for this type of data, as well as a guide to detecting adaptative events in your phylogeny using reconstructed ancestral sequences and the crystal structure of your protein. The tutorial also includes some notes on using PyMol, a very good open-source protein visualization program. Suggestions and questions are much appreciated!

You will need:

A fasta file of sequences ready for alignment (you have given them names you like and won’t regret, like the genus and species of the organism they came from). Make sure you have good taxon sampling! I.e. you include sequences from all taxa known to have the protein, and none are overrepresented in a way that might bias your results.
A PDB file of your protein’s structure. You can do a little analysis without this, but it’ll help you identify interesting regions. If there are many crystals of your protein, choose one where subunits are maximally in contact, or choose several.

Align your sequences with a program like Mafft, clean with something like Trim-al. Guidance calculates alignment support scores.
Make a phylogeny. Play around with PhyML, RAxML, IQTree, and MrBayes, and calculate approximate likelihood ratios and bootstrap support. Understand the logic behind the program you use, and its strengths and weaknesses. The Phylogenetics Handbook is a great, and freely available, resource.
Reconstruct ancestors! I like Ziheng Yang's program PAML. PAML gives Bayesian posterior probabilities for each site in each ancestral sequence, and although these are really worth exploring ( to determine the robustness of results to probable deviations from the most likely ancestor), for simplicity this tutorial will deal only with the most likely ancestral sequence at each node.
There are 2 main types of uncertainty in ASR: the alignment and the ancestral sequence itself. When setting up a spreadsheet to analyze evolutionary rate, I like to make a column for each type of uncertainty at a site. At the bottom of this protocol is an example of the spreadsheets I construct for these analyses.
1. Use a program like Guidance to determine support for sites and sequences in the alignment. Guidance calls sites with confidence<0.93 and sequences with confidence<0.6 “unreliable.” I wouldn’t suggest removing unreliable (very divergent or fast evolving, respectively) sequences or sites in ASR, but it is a good idea to keep track of uncertainty so you know how meaningful a given conclusion is (e.g. if you find that the C-terminus of your protein appears to be evolving very quickly, but that part of the alignment is unreliable by guidance or ZORRO standards, then the heterogeneity at those sites might be an effect of misalignment, not mutation rate).
2. Quantify the reliability of your reconstruction at each site of your ancestral sequence (the deepest node in your tree, or subtree-of-interest).
Identify the interesting regions in your protein and make a column for each in your spreadsheet. These can be gleaned from the literature and will differ for each molecule, but in general the N-terminus and C-terminus (30 residues or so) are worth taking a look at, as are any contact surfaces (where subunits touch, where other proteins dock, where your protein touches a membrane if it does), channels, or binding pockets. Other things to look out for are secondary structures (helices, sheets, loops) whose functional importance has been demonstrated or hypothesized in the literature. Classical biochemistry approaches can miss interesting bits of proteins and so your search should not be limited to identified regions. Look at your protein in a viewer like PyMol for ideas. Better yet, look at your ancestor and a modern crystal: are there domains that differ between the two?
Identify the sites within your regions of interest. In the region’s column in your spreadsheet, each site will get a binary value for inclusion in the region (1=site is in that region, 0=site is not in that region). I use Pymol for most of these. To analyze a subunit in an oligomeric protein, for example, I would do the following:
1. Split your protein into subunits by selecting chains and copying them to objects. For example, for analysis of the D subunit of the nitrogenase holoenzyme (which has 2 D subunits, 2 K subunits, and 2 docked homodimeric reductases), I make 6 subunit objects in PyMol: D1, D2, K1, K2, Reductase1, and Reductase2. I also make objects for all cofactors and ligands.
2. For subunit contact surfaces, type sele br. D1 within 5 of K1 into the PyMol command line to select residues in the D subunit that could be interacting with the K subunit by VDW forces. Copy these residues to an object if you like (I like to save a PyMol workspace with objects of all of my regions of interest, for reference and editing). Regardless, identify these residues in the protein sequence by turning on sequence view in PyMol and putting a 1 in your spreadsheet column (“Contact surface with K1 subunit” in the spreadsheet below, for this example) for each residue in the region. Put a 0 for every residue not in the region.
3. For a cofactor or ligand, I use a smaller distance (sele br. D1 within 2.5 of FeMoco). You could also create 2 separate objects called small_binding_pocket and large_binding_pocket or something, with residues at 2 and 4 angstroms respectively, to analyze separately those in covalent or H-bond interactions with the cofactor or ligand and those near but not bonded to the cofactor or ligand. Fill in your spreadsheet columns.
4. If you have channels or caves without ligands or helices of interest, either select them manually or install a PyMol plugin that identifies that sort of thing.
Calculate the diversity score at each site. This will be a value between 0 and 2 that captures the variance of a given site in your alignment.

Analyzing evolutionary rates along a branch

Hanon McShea

Contact