CAZymes (Carbohydrate Active enZymes) are among the most important enzymes for the bioenergy and agricultural industries, because they are responsible for the synthesis, degradation and modification of all the carbohydrates on Earth (Cantarel, et al., 2009). CAZymes are classified into 356 protein families of six classes by the CAZy database ( based on sequence similarity, including 97 glycosyltransferase (GT, EC 2.4.-.-) families, 129 glycoside hydrolase (GH, EC 3.2.1.-) families, 23 polysaccharide lyase (PL, EC 4.2.2.-) families, 16 carbohydrate esterase (CE) families, 13 auxiliary activity (AA) families, and 78 carbohydrate binding module (CBM) families. In general, GTs are used to build complex carbohydrate polymers, while GHs, AAs, CEs and PLs are used for breaking and modifying carbohydrates. CBMs, as indicated by names, are structural modules used for recognizing and binding different carbohydrates.

Different families of each class share no or very low sequence similarity, while may have distant structural or evolutionary similarity. One interesting and unaddressed question is that how families of each CAZyme class are evolutionarily related. To answer this question, we have downloaded pHMMs of the 356 CAZyme families from the dbCAN database (Yin, et al., 2012) ( and built pHMM phylogenies for each of the six classes.

The GH class (Henrissat, 1991) is the largest CAZyme class, including 129 families as of July 15th, 2016 (six previous families were later removed but their names remain in the CAZy nomenclature system). Between class and family, the CAZy database also created an intermediate level, clan, to group families having the similar 3D structural fold and conserved catalytic residues (Henrissat and Bairoch, 1996). So far 52 GH families have been classified into 14 clans named GH-A to GH-N ( GH-A is the largest clan, containing 19 families, followed by GH-E containing 4 families. All other clans include either two or three families. No phylogenies describing the relationship among different GH clans and families have ever been reported.

In the pHMM phylogeny shown in built using the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) method of Phylip (Felsenstein, J. 2005), the 19 families of GH-A are all clustered. The GH-D, GH-H, and GH-K clans, which share the same fold (β/α)8 with GH-A, are all clustered with GH-A in a larger clade. Similarly, GH-G, GH-L, and GH-M clans share the (α/α)6 fold and are all clustered in one clade. GH-F and GH-J share the 5-fold β-propeller and are all clustered. The two β-jelly roll clans GH-B and GH-C except for GH7 are clustered with the three GH-I families. The 6-fold β-propeller GH-E clan has three families clustered with the 5-fold β-propeller clans, and the other family is clustered with GH-N clan (β-helix fold). With the pHMM phylogeny, the clan assignment can be made for many of the CAZy-unclassified families.

Overall, reveals evolutionary patterns in agreement with structural fold-based clan classification, suggesting that, at least for GH enzymes, structural conservation signifies evolutionary relatedness.

The GT class (Coutinho, et al., 2003) contains 97 families (three previous families were removed but their names remain in the CAZy nomenclature system) as of July 15th, 2016. On CAZy website, 47 of the 97 families are assigned to three clans: GT-A, GT-B, and GT-C ( According to (Lairson, et al., 2008), the three clans were defined based on three major structural folds determined by 3D-structures or by bioinformatics prediction: (i) GT-A families, named by (Bourne and Henrissat, 2001), contain "two closely abutting β/α/β Rossmann domains"; (ii) GT-B families, also named by (Bourne and Henrissat, 2001), contain "two β/α/β Rossmann domains that face each other and are linked flexibly"; and (iii) GT-C, named by (Liu and Mushegian, 2003) using bioinformatics sequence and structure analyses, contain "a predicted protein topology for transmembrane glycosyltransferases that is not experimentally verified". Although some other bioinformatics worked on GT inter-family comparison, e.g., (Kikuchi, et al., 2003) and proposed more GT clans, the three clan classification has been widely adopted.

However, none of the previous analyses have built a phylogeny for different GT families. The pHMM phylogeny built using the UPGMA method of Phylip (Felsenstein, J. 2005) shown in generally supports the three clan classification, where three major clades each correspond to one of the three clans. Interestingly, GT63 falls into the GT-A clan instead of GT-B, suggesting that the structure fold-based classification is not always in agreement with the evolutionary classification (pHMM tree). In addition, GT78 and GT84 together with six CAZy-unclassified families (black branches in Figure S2: GT32, GT44, GT88, GT73, GT29, GT42) do not cluster within the three clans, suggesting that these families are evolutionarily unclassified. However, with the pHMM phylogeny, most CAZy-unclassified families can be assigned to one of the three clans.

The AA class (Levasseur, et al., 2013) now contains four families of lytic polysaccharide monooxygenases (LPMO) and nine families of ligninolytic enzymes. There were no clans defined for AA families in the CAZy database ( In our pHMM phylogeny shown in , the four LPMO families are clearly separated from the nine ligninolytic enzyme families. In addition, the phylogeny also reveals the closer evolutionary relationships for AA9-AA11, AA10-AA13, AA4-AA7, AA3-AA8, and AA5-AA12 pairs.

The PL class (Lombard, et al., 2010) now contains 23 families (PL19 was renamed as GH91). Thirteen of the 23 families have known structural folds according to the CAZy database. The parallel β-helix fold is found in four PL families, which are all clustered in the pHMM phylogeny shown in . The β-jelly roll fold is found in two PL families, which are also clustered in the phylogeny. The other folds are each found in just one PL family. About the 10 families that do not have known structural folds, (i) PL12, PL15, PL17, and PL21 form a monophyletic clade; (ii) PL22 and PL23 form a cluster with PL8 [(α/α) 6 barrel + anti-parallel β-sheet]; (iii) PL24 is clustered with PL11 (β-propeller); (iv) PL13 is clustered with the two β-jelly roll families (PL7 and PL18); and (v) PL14 and PL20 are clustered.

The CE class contains 16 families in the CAZy database ( Thirteen of them have known structural folds. Eight families have the (α/β/α)-sandwich fold, and fall into three clades in the pHMM phylogeny. The other 5 families each have a different fold.

The CBM class (Boraston, et al., 2004) contains 78 families so far. According to the CAZy website and (Guillen, et al., 2010), 39 families have been classified into seven structural folds. Among them the β-sandwich fold is found in 31 out of the 39 families. In our pHMM phylogeny, these 31 families form a large clade that also include most CAZy-unclassified families, strengthening the fact that β-sandwich is the most dominant fold in CBMs. Interestingly, the two β-trefoil families are clustered within this β-sandwich clade but with different β-sandwich families, suggesting a convergent evolution into the same β-trefoil fold. In the CAZy website, the CBM5, CBM12, and CBM73 are indicated to be distantly related, and in our pHMM phylogeny these three families are clustered together. The other four folds correspond to nine families and are obviously evolutionarily distant from the dominant β-sandwich clade.

The CBM class is the most rapidly growing CAZyme class, with 16 new families added since July 2012. Except for CBM73, all these new families are clustered within the β-sandwich clade. The most recent addition was by (Venditto, et al., 2016), which added CBM75 to CBM80 families. Four of them (CBM76, CBM78, CBM79 and CBM80) are clustered with CBM31.