dbCAN-seq: a database of CAZyme sequence and annotation

help

Help page

You are here: Home > HELP

Index

CAZy

* Browse page click to see this sample page.

1 Navigation bar

1.1 Download

Introduction:
The download page has a searchable table with all the 5,329 genomes. Each row of the table corresponds to a genome with a download link to a compressed tarball file. The tarball contains a FASTA sequence file of all the CAZymes in the genome, and a tab-separated file with all the annotation and location data. There is also a link to download data of all the genomes.

There are three functions in this page (red boxes in below picture):
1. You can put key words (GCF ID [NCBI genome assembly ID], species name and taxid) to search the download table
2. You can choose to show different numbers of entries
3. There are six properties shown for each genome including the fraction of CAZyme in the genome
4. You can download all the genomes file.

Click the GCF ID, you can download the tar.gz file; this tar ball contains two files:
1. GCF_ID.fasta contiains the all of the CAZymes sequence of this GCF_ID.
2. GCF_ID.txt contains the following tab-separated properties of each CAZymes:
(GCF_ID, CAZyme_ID, Product, RefSeq_ID, Start, End, Strand, CAZyme_domains, Molecular_weight, Isoelectric_point, TMHMM_num, LipoP, Predicted_EC, MetaCyc, SignalP_cleavage_site).

1.2 CAZyme Gene Cluster

Introduction:
CGCs are defined as genomic regions containing at least one CAZyme gene, one transporter/TC gene (predicted by searching against the TCDB) and one transcription factor/TF gene (predicted by searching against the collectf DB, the RegulonDB, and the DBTBS. The rational is that CAZymes often work together with each other and with other important genes (e.g. TFs, sugar transporters) to synergistically degrade or synthesize various highly complex carbohydrates.

The CGC page is designed as a tool page. When a user opens the "CAZyme Gene Cluster" page, a table will be seen with all the 5,329 genomes. The table actually has 7,841 rows as one genome can have multiple RefSeq IDs (chromosomes and plasmids).

One can click on a genome to open a new page (parameter page), where two parameters need to be set: (i) distance and (ii) signature gene classes, which were already described above. After hitting on the Calculate button, the CGC-Finder python program will be called to identify CGCs in the genome. The program runs very fast: processing one genome takes a few milliseconds, and processing all the 5,329 genomes together takes < 1 minute.

The CGC-Finder result will be printed as a table (CGC genome page) below the parameter selection section. Each row in the table is one CGC with different statistics about the CGC, such as the numbers of the three signature genes and the number of all genes (including non-signature ones).

Clicking on the CGC_no will open a separate page (individual CGC page) showing: (i) the graphical representation of the genomic location of the CGC in a Jbrowser; (ii) at the bottom the detailed information about all genes in the CGC as a table, such as the genomic location, the functional description, and if signature gene and evidence. All the data tables can be downloaded by clicking on a download link above the table.

1.3 Metadata

Integrated Microbial Genomes (IMG) database

1.3.1 Metadata-property

* Search Engine

2 Search Engine

NP_212393.2

GCF_000005825.2

398511

Bacillus pseudofirmus OF4

CE4

similarity = 20%,1LZL_A

similarity = 20%,COG2072

similarity = 20%,AHF23796.1

3.2.1.8-RXN

3.1.1.23

Our search Engine is very fast, because we use sphinx to make different page. When that page is on, the search results on that page will be obtained rather than all of the search results obtained no matter the page is on or not.

Take Cdd_id(e.g.,similarity = 30%,pfam07859) for example, this is page 2 of the search result of cdd_id query pfam07859. When we click "The Next Page/The Previous Page", then we will get related result of this page.