AA1AA2AA3AA4AA5AA6AA7AA8AA9AA10AA11AA12AA13AA14AA15AA16AA17
CBM1CBM2CBM3CBM4CBM5CBM6CBM8CBM9CBM10CBM11CBM12CBM13CBM14CBM15CBM16CBM17CBM18CBM19CBM20CBM21CBM22CBM23CBM24CBM25CBM26CBM27CBM28CBM30CBM31CBM32CBM34CBM35CBM36CBM37CBM38CBM39CBM40CBM41CBM42CBM43CBM44CBM45CBM46CBM47CBM48CBM49CBM50CBM51CBM53CBM54CBM55CBM56CBM57CBM58CBM59CBM60CBM61CBM62CBM63CBM64CBM65CBM66CBM67CBM68CBM69CBM70CBM71CBM72CBM73CBM74CBM76CBM77CBM79CBM81CBM84CBM85CBM87CBM88
CE1CE2CE3CE4CE5CE6CE7CE8CE9CE11CE12CE13CE14CE15CE16CE17CE18CE19
GH1GH2GH3GH4GH5GH6GH7GH8GH9GH10GH11GH12GH13GH14GH15GH16GH17GH18GH19GH20GH22GH23GH24GH25GH26GH27GH28GH29GH30GH31GH32GH33GH34GH35GH36GH37GH38GH39GH42GH43GH44GH45GH46GH47GH48GH49GH50GH51GH52GH53GH54GH55GH56GH57GH58GH59GH62GH63GH64GH65GH66GH67GH68GH70GH71GH72GH73GH74GH75GH76GH77GH78GH79GH80GH81GH82GH83GH84GH85GH86GH87GH88GH89GH90GH91GH92GH93GH94GH95GH96GH97GH98GH99GH100GH101GH102GH103GH104GH105GH106GH107GH108GH109GH110GH111GH112GH113GH114GH115GH116GH117GH118GH119GH120GH121GH122GH123GH124GH125GH126GH127GH128GH129GH130GH131GH132GH133GH134GH135GH136GH137GH138GH139GH140GH141GH142GH143GH144GH146GH147GH148GH149GH150GH151GH152GH153GH154GH156GH158GH159GH160GH161GH162GH163GH164GH165GH166GH167GH168GH169GH170GH171
GT1GT2GT3GT4GT5GT6GT7GT8GT9GT10GT11GT12GT13GT14GT15GT16GT17GT18GT19GT20GT21GT22GT23GT24GT25GT26GT27GT28GT29GT30GT31GT32GT33GT34GT35GT37GT38GT39GT40GT41GT42GT43GT44GT45GT47GT48GT49GT50GT51GT52GT53GT54GT55GT56GT57GT58GT59GT60GT61GT62GT63GT64GT65GT66GT67GT68GT69GT70GT71GT72GT73GT74GT75GT76GT77GT78GT79GT80GT81GT82GT83GT84GT85GT87GT88GT89GT90GT91GT92GT93GT94GT95GT96GT97GT98GT99GT100GT101GT102GT103GT104GT105GT106GT107GT108GT109GT110GT111GT112GT113GT114
PL1PL2PL3PL4PL5PL6PL7PL8PL9PL10PL11PL12PL13PL14PL15PL16PL17PL18PL20PL21PL22PL23PL24PL25PL26PL27PL28PL29PL30PL31PL33PL34PL35PL36PL37PL38PL39PL40PL41PL42

Unique features of dbCAN-sub: dbCAN-sub is developed as the first comprehensive CAZyme subfamily HMM database (including CBMs) to enable substrate annotation for CAZymes. The subfamily HMMdb (Figure 1) is derived from 25,487 CAZyme subfamilies classified by eCAMI (enzyme Classification And Motif Identification), a new k-mer based tool that we published in 2020 for the classification of enzyme families into subfamilies using a bipartite network algorithm (1). eCAMI was integrated into our popular dbCAN2 meta server in 2021 to replace Hotpep (2) according to a recent CAZyme annotation tool evaluation work from an independent group (3). Like CUPP, eCAMI can assign proteins to subfamilies with EC numbers (colored curves in Figure 1). However, both CUPP and eCAMI suffer from high demands of computer CPU and memory. eCAMI can annotate not only the catalytic enzyme domains but also the carbohydrate binding CBM domains. A very recent paper found that eCAMI tends to produce more granular subfamilies (4) than CUPP, and thus produces a higher percentage of subfamilies with a single EC number, allowing more specific substrate inference.

dbCAN-sub uses HMMs instead of k-mer peptides for subfamily assignment. Using HMMs has advantages: (i) significantly lower computer memory use; (ii) parallel computing to reduce CPU time; (iii) statistical significance E-value and domain positions reported by HMMER search. In other words, to address the computing cost issue, we have converted each eCAMI subfamily into an HMM, which was built from dbCAN domain sequence alignment of the subfamily.

More importantly, dbCAN-sub enables carbohydrate substrate annotation with a manually curated mapping table between CAZyme subfamily, characterized CAZymes, EC numbers, and glycan substrates. We constructed this mapping table by curating the CAZy family webpages for experimentally characterized proteins (e.g., GH5). Most of these webpages contain external links from EC numbers to the Enzyme database and from characterized protein IDs to the PubMed pages of biochemical reference papers. In most cases, we could obtain the substrate information for subfamilies by skimming through the paper abstracts or EC descriptions using EC numbers of experimentally characterized proteins in the subfamilies. For all CBM families and some enzyme families, we were able to extract the substrate information from the CAZy webpages without EC.

We have built a webpage for each CAZyme subfamily to provide all the necessary information that users need to understand what data the subfamily HMM was built upon: (i) a summary table with various counts of CAZy proteins including the download links to the fasta sequences; (ii) a substrate table with EC numbers and curated substrates from CAZy webpages and literature; (iii) a member protein table with all CAZy protein IDs and their subfamily assignments in the CAZy and CUPP databases (if exist). All these tables, dbCAN-sub HMMs, sequence alignments, and fasta sequences can be downloaded from the dbCAN-sub website.

Lastly, the dbCAN-sub subfamily HMMdb is integrated into our popular dbCAN2 meta server and the standalone run_dbcan program to allow the glycan substrate annotation for user submitted (meta)genomes.

Future update: We plan to update dbCAN-sub annually as new sequences and families are added in the CAZy database. New subfamilies will be created if the new CAZy sequences have higher similarity to eCAMI previously unclassified sequences or to each other than to existing subfamilies. The dbCAN-sub database will be a new addition to our popular dbCAN family tool suite (dbCAN2, dbCAN-seq, dbCAN-PUL, eCAMI), which focuses on CAZyme bioinformatics and carbohydrate metabolism.