Thanks to some amazing contributions and discussions in the comment section of my previous post on this article (special thanks to ‘qq’), we’ve now managed to identify (its at least a good estimation) the majority of the White Labs strains from the set of brewing strains in the Gallone et al. 2016 paper.
In addition to the genotype data, the Gallone et al. 2016 paper contains a huge amount of phenotypic data on these strains, and this information could be very useful for (home)brewers as well now that we know which strains are which (at least most of them). I’ve compiled the following spreadsheet (with the help of ‘qq’) containing all the identifications and useful phenotypic data (e.g. POF, flocculation, sugar use, various tolerances, and ester concentrations). See the ‘Notes’ sheet for more information. Press the icon in the lower right corner to open the spreadsheet in a new window (recommended). Please feel free to leave feedback or suggestions for improvement if you find something unclear or missing!
Updated family trees
I also put together the following dendrograms displaying the White Labs codes for those interested in how they look in the family tree (click to enlarge). As an added bonus, if anyone is wondering where the lager yeasts fit in on the tree, I’ve expanded it with the S. cerevisiae sub-genomes of two Frohberg strains (‘A15’ (VTT-A63015) and W34/70; names in purple) and two Saaz strains (CBS1503 and CBS1538; names in blue). We’ve sequenced A15 ourselves, while the other sequence reads were pulled from Okuno et al. 2016.
In addition, the following dendrograms are also available:
‘qq’ also put together the following dendrograms (Beer 1 and Beer 2 groups separately) from Figure 1 in the original study:
There are some really interesting observations that can be made, e.g.
- WLP540, the ‘Rochefort’ strain, appears to be of British origin (POF- as well), and not related to other Trappist strains.
- The group of ‘US’ strains appears to contain one English and one Belgian strain. The Belgian one could be WLP515.
- WLP800 and WLP320 appear closely related, and their phenotypes seem very similar. Anyone want to try making a lager with WLP320 American Hefeweizen?
- Not related to the White Labs strains at all, but two strains used for industrial lager production (Beer039 and Beer040) appear to be closely related to various Belgian Ale strains in the Beer2 group.
- As can be seen, the lager yeasts branch off in the beginning of the ‘Beer 1’ clade together with the Hefeweizen strains. This is similar to the result obtained by Goncalves et al. 2016.
I’ve also recreated the heatmap from Figure 3 in the original publication, now with the White Labs codes in place of the strain codes. Strains (rows) that are grouped closely together are phenotypically similar. You can see a nice division between the ‘Beer 1’ strains and the rest of the strains. Click to enlarge.
Furthermore, I performed Principal Component Analysis (PCA) on the phenotype dataset, and below you can find the scores and loadings plots for ‘PC1 vs PC2’ and ‘PC1 vs PC3’. These first three principal components explained about 40% of the variation in the dataset. In layman’s terms, you can interpret the plots in the following way:
- Strains close to each other on the ‘scores’ plot (top) are phenotypically similar.
- By looking at the corresponding location in the ‘loadings’ plot (bottom), you can find what phenotypic traits are strongly associated with those particular strains.
- The larger the magnitude of the phenotypic trait (i.e. the further from the origin), the more it contributes to these principal components.
- E.g. In the top of the ‘scores’ plot for ‘PC1 vs PC2’ we find ‘wine018’, ‘wine017’ and ‘WLP050’. Looking at the ‘loadings’ plot, we see that these strains are strongly associated with high production of ethyl, isoamyl, isobutyl and phenylethyl acetate, as well as isoamyl alcohol. The particular traits have a large effect on PC2.
- Use these plots e.g. to find similar strains (if you want to substitute one yeast for another for example), or strains associated with particular traits.
PC1 vs PC2
PC1 vs PC3
Presence of STA1 in the strains
Following the recent news about a potential S. cerevisiae var. diastaticus contamination in White Labs yeast, I also did a BLAST search for STA1 (GenBank: X02649.1), the gene encoding for an extracellular glucoamylase, in each of the assemblies. Interestingly, there were full hits in multiple strains (partial matches in many strains as well). The strains are:
- Beer059 (probably WLP026)
- Beer085 (very likely WLP570)
- Beer086 (probably WLP585)
- Beer091 (probably a WLP strain)
- Beer092 (probably a WLP strain)
- Wine019 (probably a WLP strain)
I also assembled the reads from the WLP570 sample in the 1002 yeast genomes project, and STA1 was present in the resulting assembly as well. While these strains appear to contain STA1, it is unclear if any of them are diastatic (this needs to be tested). Interestingly, all the STA1 sequences contain a ‘T’ insertion at position 2406, close to stop codon (position 2427-2429). With the insertion, the amino acid sequence is extended by 39 amino acids (and these amino acids are homologous to the C-terminus of the SGA1-encoded intracellular glucoamylase). All this data is in the Excel spreadsheet as well!
In addition to these identifications, the comment section of the previous post contains a lot of interesting discussion on topics such as the history and spread of different strains, and whether interspecies hybrids have been included in the study. Please have a look! This information would definitely be worth a post of its own, but I’m sure someone else would be better at putting together such a post than me.
Hopefully White Labs will now be willing to unblind their own strains. It would make this data much more valuable, and I’m not really sure what they have to gain from keeping it secret.
Briefly, for those wondering how the tree was constructed: the Illumina reads from the lager strains were aligned to a concatenated reference genome of S. cerevisiae and S. eubayanus. FreeBayes was used to call variants. A consensus sequence for each strain was created using BCFtools and the FreeBayes VCFs. Only the S. cerevisiae sub-genome was retained for subsequent analysis. In addition, regions with coverage < 5 were excluded. kSNP3 was used for SNP detection and phylogenetic analysis on the set of assemblies from Gallone et al. 2016 and the four consensus sequences. The resulting SNP matrix was fed into IQ-TREE to produce a maximum likelihood tree with 1000 ultrafast bootstrap approximations. While the tree is very similar to the one in the Gallone et al. 2016 paper, there are some minor differences resulting from the different methodology (they e.g. only looked at SNPs located in the coding sequence of genes, while this approach looks at the whole genome).
My previous post contains information and discussion about how the White Labs strain identification was done.
DISCLAIMER: These are only guesses based on a range of evidence.