Looking through the tree produced in my last post, I noticed that several interesting sequences were missing from the tree. There are also less sequences in the tree than I get if I search for “Nostoc rbcX” in Entrez. Turns out that this is because blast+ limits the number of results returned to 500 by default. In retrospect, the fact that I ended up with exactly 500 sequences should have been a red flag. Fortunately, blast+ includes an option to exclude sequences from the results by GI number.
First I will move all of the old files to a separate folder and make a new folder for the new files:
mkdir 130429 130502
mv Nostoc_rbcX* 130429/
Extract GIs from sequence headers and write to a file:
grep “>” 130429/Nostoc_rbcX.fa | perl -p -e ‘s/>gi|(d+)|.*/$1/’ >130429/Nostoc_rbcX_gis.txt
Repeat blast search using the -negative_gilist option:
blastn -task megablast -query Ncommune_rbcX.fa -db nt -evalue 1e-20 -outfmt ‘6 qseqid qlen sacc slen pident length mismatch gapopen qstart qend qframe sstart send sframe evalue bitscore’ -negative_gilist 130429/Nostoc_rbcX_gis.txt -out 130502/Nostoc_rbcX.bl
This returns 430 additional results, but the sequence similarity starts to drop off rather quickly, with only 19 sequences with an Evalue of less than 1e-180. The top hits below this value are all Anabaena sequences, so it is probably safe to use this as a cutoff for now.
Repeat blast search with a more stringent evalue cutoff:
blastn -task megablast -query Ncommune_rbcX.fa -db nt -evalue 1e-180 -outfmt ‘6 qseqid qlen sacc slen pident length mismatch gapopen qstart qend qframe sstart send sframe evalue bitscore’ -negative_gilist 130429/Nostoc_rbcX_gis.txt -out 130502/Nostoc_rbcX.bl
Obtaining the sequences, trimming, reverse-complementing, and renaming proceeds as described previously.
Before adding the new sequences to the alignment, it will be helpful to remove the redundant sequences so that the groupings do have to be recalculated in MetaPIGA:
cat 130429/Nostoc_rbcX_metadata.txt | Scripts/RemoveRedundant.pl 130429/Nostoc_rbcX_aln.fa >Nostoc_rbcX_nr_aln.fa
Add new sequences to the alignment:
mafft –add 130502/Nostoc_rbcX_filtered.fa Nostoc_rbcX_nr_aln.fa > Nostoc_rbcX_aln.fa
After applying automated filtering in MetaPIGA and building a tree as described previously, I got a tree that is very different from the one produced last time. Comparing the alignment files from the two analyses, 288 ambiguously aligned positions that were trimmed previously were left in this time. MetaPIGA uses the “gappyout” method from TrimAl for automated alignment trimming. This method appears to be highly sensitive to which sequences are included. In order to keep things consistent when I add new taxa, I am going to manually apply an exclusion set. Because the coordinates of the exclusion set may change as the dimensions of the alignment change, I selected a sequence without any insertions to act a reference. I then used the gap positions in this sequence as an exclusion set for the “-select” method in the command line version of TrimAl.
Remove redundant sequences:
Open Nostoc_rbcX_aln.fa in MetaPIGA, then select Dataset->”Check for Ambiguous sequences.” Copy resultant groupings to 130502/Nostoc_rbcX_groups.txt and save the modified dataset as Nostoc_rbcX.nex
Trim ambiguous regions:
Scripts/GetExcluded.pl Nostoc_rbcX_aln.fa |pbcopy
trimal -in Nostoc_rbcX.nex -phylip -select `pbpaste` >Nostoc_rbcX.phy
This uses the very cool pbcopy/pbcopy commands to pipe output from the first command to the clipboard so it can be pasted into the command line for the second command. These commands are OSX specific, but there are similar commands available in other flavours of unix.
phyml -i Nostoc_rbcX.phy
mv Nostoc_rbcX.phy_phyml_tree.txt Nostoc_rbcX.nwk
Next, I need to integrate the new groups of redundant sequences with the old set. This information also needs to be added to the Nostoc_rbcX_metadata.txt file. I started by parsing the fasta headers as previously to extract host information, then modified the AddGroups.pl script to compare the new list of redundant sequence groups to the previous list and to add new sequences to the old groups if necessary:
Scripts/ParseHost.pl < 130502/Nostoc_rbcX.fa > 130502/Nostoc_rbcX_metadata.txt
cat 130429/Nostoc_rbcX_metadata.txt 130502/Nostoc_rbcX_metadata.txt | Scripts/AddGroup.pl Nostoc_rbcX.nwk 130502/Nostoc_rbcX_groups.txt > Nostoc_rbcX_metadata.txt
Lastly, I added the host info from the metadata file to the tree and visualize with FigTree:
Scripts/AddHost.pl Nostoc_rbcX_metadata.txt < Nostoc_rbcX.nwk >Nostoc_rbcX_host.nwk
Which produces this tree. Because this is a more conservative exclusion set, the tree looks quite different from the one I produced previously, but if I reanalyse that dataset using this exclusion set, I get a very similar topology.
A few quick observations about this tree for now:
The most basal lichen photobiont sequence (DQ185264 from Peltigera didactyla) is from a culture and may represent an epiphyte or accessory symbiont, rather than the true photobiont.
The next branch includes all sequenced photobionts of Peltigera malacea, Leptogium lichenoides and Sticta hypochra, each of which forms a more-or-less distinct cluster:
The much-discussed Nephroma guild is not recovered as monophyletic but forms a paraphyletic grade at the base of the core photobiont lineage (I suspect that this may be an artifact of some kind):
There is one Stereocaulon photobiont, which is on an extremely long branch, that groups with four divergent Peltigera neopolydactyla photobionts, though there are other P. neopolydactyla photobionts throughout the tree:
There are plenty of other interesting observations to be made about this tree. More soon.