logo
logo
help
Help page

You are here: Home > HELP

* Browse page     click to see this sample page.

    The navigation contains Home, Download, CAZyme Gene Cluster, Metadata, Links, Help and About Us.
  • 1.1 Download
  • The CAZyme sequence and annotation data are available in the download page.
    There are three functions in this page (red boxes in below picture):
    1. You can put key words (GCF ID [NCBI genome assembly ID], species name and taxid) to search the download table
    2. You can choose to show different numbers of entries
    3. There are six properties shown for each genome including the fraction of CAZyme in the genome
    Click the GCF ID, you can download the tar.gz file; this tar ball contains two files:
    1. GCF_ID.fasta contiains the all of the CAZymes sequence of this GCF_ID.
    2. GCF_ID.txt contains the following tab-separated properties of each CAZymes:
    (GCF_ID, CAZyme_ID, Product, RefSeq_ID, Start, End, Strand, CAZyme_domains, Molecular_weight, Isoelectric_point, TMHMM_num, LipoP, Predicted_EC, MetaCyc, SignalP_cleavage_site).

  • 1.2 CAZyme Gene Cluster
  • This page shows the CAZyme Gene Clusters (CGCs) that we predicted from all the bacterial genomes. CGCs are defined as genomic regions containing at least one CAZyme gene, one transporter gene (predicted by searching against the TCDB) and one transcription factor/TF gene (predicted by searching against the collectf DB, the RegulonDB, and the DBTBS. The rational is that CAZymes often work together with each other and with other important genes (e.g. TFs, sugar transporters) to synergistically degrade or synthesize various highly complex carbohydrates.
    The CGC table shows the number of CAZymes, the number of Transporters and the number of TFs. Click on the GCF ID, one can open its CGC page.


  • 1.2.1 CAZyme Gene Cluster - GCF_ID (CGC_GCF)
  • CGC_GCF page contains two parts: (i) the first part is Jbrowser of this CGC, the yellow highlights the location range of this CGC; (ii) the second part is a table showing all the genes in the range above, CGC signature genes are highlighted with yellow color.

    Clicking one gene in the highlighted region of Jbrowser will open a pop window with its primary data, attributes, fasta sequences to download and some subfeatures.
    This function comes with Jbrowse, and is also found in the protein page.


  • 1.3 Metadata
  • Metadata are the macroscopical terms describing different biological properties. Here we downloaded all the bacteria's species metadata from JGI's Integrated Microbial Genomes (IMG) database. We then extracted genomes having at least one metadata property in our dbCAN-seq database. The properties are disease, ecosystem, ecosystem category, ecosystem type, habitat, metabolism, motility, oxygen_requirement, ph, phenotype, salinity, sample body site, sample body subsite, specific ecosystem, temperature range. All these features could be selected from a pull-down menu. When one feature is selected (e.g., disease in the red box), a bar chart will be displayed with y axis showing the different diseases and the bar height showing the number of genomes/GCFs in that disease. The bars are sorted according to height. Clicking one bar will lead to a new page with another bar chart showing the genomes/GCFs in that bar (below).


  • 1.3.1 Metadata-property
  • This bar chart is sorted according to height too. Each bar represents one GCF/genome, with the height proportional to the number of CAZymes found in the genome. Clicking one of them, we will go to the genome page.

* Search Engine    

    We support keyword search of 11different data fields:
    1. CAZyme ID (e.g., NP_212393.2 );
    2. GCF_ID (e.g., GCF_000005825.2);
    3. Tax_ID (e.g., 398511);
    4. Species_Name (e.g.,Bacillus pseudofirmus OF4);
    5. CAZyme_domain (e.g., CE4);
    6. Pdb_hit (e.g., similarity = 20%,1LZL_A);
    7. Swissprot_hit (e.g.,similarity = 20%,ETHA_MYCTU or P9WNF9 or sp|P9WNF9|ETHA_MYCTU);
    8. Cdd_ID (e.g., similarity = 20%,COG2072);
    9. CAZyme_hit (e.g., similarity = 20%,AHF23796.1);
    10. MetaCyc (e.g., 3.2.1.8-RXN);
    11. Predited_EC (e.g., 3.1.1.23);
    From 1 to 5 and 10 to 11, one can type the keyword to search.


    From 6 to 9, one has to specify a sequence identity value in addition to a keyword. For example, if choose to search pdb_hit, on the left an identity value, e.g., 20%, has to be selected, followed by a pdb ID. The result will be a list of proteins in the database that share > 20% identity to the pdb protein.


    Our search Engine is very fast, because we use sphinx to make different page. When that page is on, the search results on that page will be obtained rather than all of the search results obtained no matter the page is on or not.

    Take Cdd_id(e.g.,similarity = 30%,pfam07859) for example, this is page 2 of the search result of cdd_id query pfam07859. When we click "The Next Page/The Previous Page", then we will get related result of this page.

* Browse page     click here to this sample page.

  • 3 Browse by Taxonomy
  • 3.1 
  • This page shows the 31 phyla of bacteria. Choose one of phylum,the six levels of classification (phylum, class, lineage_order, family, genus, species) are shown when choose Taxonomic Group.

  • 3.2  Browse by Family
  • Genome Bar version shows the sorted number of CAZymes of each taxnomy|gcf_id, where all of the numbers are sorted. Choose one of them, we go to the "Browse by Genome" page.

  • 3.3  Genomes Bar
  • Genomes bar shows the sorted number of CAZymes of each taxnomy|gcf_id, where all of the numbers are sorted. Choose one of them, we go to the "Browse by Genome" page.

  • 3.4 metadata
  • If the genome has metadata in our database, the page will show the metadata of all lot of properties.
  • 3.5 Browse by Genome
  • This page has two tabs(Browse by Family, CAZyme list).
    The CAZymes number of each family of this Genome are caculated and shown in the bracket behind each family name.

    CAZyme lists (sample)shows the detailed information of CAZyme.
    1.Red rectangle 1 links to the protein page (sample).
    2.Red rectangle 2 links to the NCBI database(sample).
    3.Red rectangle 2 links to the NCBI taxnomy page(sample).

* Protein page     click here to this sample page.

  • 4 Browse by Family
  • Basic flow when browse by family:
    - Select one family from Home page


    - This is a family with # of proteins < 1000, which is shown as a bar graph


    - This is a family with # of proteins > 1000, which is shown as a table


    - This is the protein list page, showing the protein IDs of this family of this genome


* Protein page     , click here to this sample page.

  • 5.1 Basic Information
  • The basic information descibes the property of CAZyme and its genome. All of them have the outer link to related databse.
  • 5.2 Genomic Context
  • Jbrowse use the GFF3 file to label the gene location of one genome.
  • 5.3 Full sequence
  • We support the full sequence of protein and the download link.
  • 5.4 Enzyme prediction
  • Ensemble Enzyme Prediction Pipeline annotates protein sequences with Enzyme Function classes comprised of full, four-part Enzyme Commission numbers and MetaCyc reaction identifiers.
  • 5.5 CAZyme Signature Domains
  • CAZyme Signature Domains These are the family cazyme domains, the sequence was labled by red rectangles, which are different doamins.
  • 5.6 CDD search
  • RPS-BLASTwas run with full-length CAZyme protein sequences as query and the NCBI CDD database (hyperlink) as the database. CDD is a protein annotation resource that contains well annotated sequence models. E-value < 1e-2 was used to keep the CDD domain match.
  • 5.7 CAZyme Hits
  • We use the Diamond search against DBCAN CAZyme Database instead of blast because of its fast search speed.
  • 5.8 PDB hits
  • Protein Data Bank protein sequence database was downloaded. Diamond was run with full-length proteins as query to search for homologous protein matches. Value <10 (E-value< 1e-5)was used to keep significant match. If there is a significant PDB match, that means the browsed protein has a close homolog with 3D structure solved. Orthologous groups.
  • 5.9 Swissprot hits
  • Swiss-Prot database was downloaded. Diamond was run with full-length proteins as query to search for homologous protein matches. Value <10 (E-value< 1e-5)was used to keep significant match. If there is a significant PDB match, that means the browsed protein has a close homolog with 3D structure solved. Orthologous groups.
  • 5.10 Signalp annotations
  • Swiss-Prot database was downloaded. Signal peptide was predicted using SignalP. Both Gram-negative bacteria and Gram-positive bacteria are selected and their predicted results are offerd on the page(Gram-positve-SP and Gram-negative-SP).
  • 5.11 TMHMM
  • Full-length sequences were taken to run TMHMMto predict the transmembrance regions.
  • 5.12 PPSpred
  • The secondary structure prediction was predicted by PPSpred.
    We use different color(red - coil, green - helix, black - beta) to represent the different location.
  • 5.13 lipoP
  • The lipoproteins prediction was predicted by lipoP.