Database for Polyphenol Utilized Proteins from gut microbiota

Class List

Introduction to dbPUP

dbPUP is a database of polyphenol utilization proteins (PUPs) that have been experimentally characterized to metabolize polyphenol substrates. Polyphenols are one of largest groups of secondary metabolites in plants and are present in a wide range of fruits, vegetables, cereals as well as plant-based food products. In recent years, growing evidence has been shown that consumption of polyphenol-rich foods can lower the risk of various human diseases such as cardiovascular diseases, cancers, and metabolic syndromes.

The core contents of dbPUP are 60 proteins that are biochemically characterized with one or more specific polyphenol substrates. We collected these 60 proteins and associated metadata by manually curating hundreds of peer-reviewed papers from PubMed and by searching over hundreds of polyphenolic substrates in the BRENDA database BRENDA. We have classified these biochemically characterized PUPs into sequence families according to the conserved Pfam domains that they share. We have further expanded the PUP families by including sequence homologs from the UniProt database and the database of Unified Human Gastrointestinal Protein (UHGP) catalog.

We organized PUP seeds and homologs of dbPUP in a hiercharical classification: enzyme class (Enzyme database) -> protein family (Pfam domain) -> protein subfamily (sequence similarity network).


The 60 characterized PUP proteins (called seeds) are classified into 6 enzyme classes based on their EC numbers and functional annotations, including two proteins assigned to an unclassified class (UCs).

Oxidation/Reduction Reactions (ORs): transfers of H and O atoms or electrons from one substance to another, including 23 PUP seeds.

Functional Group Transfer Reactions (FRs): transfers of a functional group from one substance to another, including 7 PUP seeds.

Hydrolysis Reactions (HRs): breaks a chemical bond in order to divide a large molecule into two smaller ones, including 25 PUP seeds.

Non-hydrolytic Cleaving Reactions (NCRs): non-hydrolytic addition or removal of groups from substrates, including 1 PUP seed.

Isomerization Reactions (IRs): has exactly the same atoms, but the atoms are rearranged, including 2 PUP seeds.

Synthesis Reactions (SRs): joins together two large molecules by forming a new chemical bond, no PUP seeds.

Translocation Reactions (TRs): assists in moving another molecule, usually across a cell membrane, no PUP seeds.

Unclassified (UCs): only available for experimentally validated proteins without significant Pfam family, including 2 PUP seeds.


In addition to the 60 PUP seeds, dbPUP also contains PUP sequence homologs from UniProt and UHGP. To identify PUP homologs, we first identified conserved Pfam domains in the 60 PUP seeds. According to the shared Pfam domains, PUP seeds of the same enzyme class were further classified into different families: OR class (9 families), FR class (4 families), HR class (8 families), NCR class (1 family), IR class (2 families), and UC class (2 families).


As each family has its own signature Pfam domain or domain combination (see each family page, e.g. OR families), PUP sequence homologs were then identified by search against UniProt (Swiss-Prot and TrEMBL) databases using HMMER or PSI-BLAST (for the two UC families). A phylogenetic tree was constructed (FastTree2.1.11) for homologs from Swiss-Prot in each family. Sequence homologs from TrEMBL were further filtered out by PSI-BLAST with the PUP seeds of the family as query using a unified threshold (E-value 0.001 and iteration number 5), because TrEMBL is too big and contains all computer-predicted proteins.

We have followed the same procedure to collect PUP homologs from the UHGP database.


To further classify PUP homologs from UniProt, an all-versus-all BLASTP was performed for each family using the sequence similarity network (SSN) analysis. The BLASTP result was then used as input for Cytoscape to classify PUP seeds and homologs into clusters (subfamilies) based on a given similarity threshold. Results were lastly visualized with Cytoscape using the yFiles organic layout and graphs were generated. Clusters with 10+ sequences were considered as a subfamily.

Physically linked PUP gene clusters (PGCs)

The UHGP protein data are derived from the Unified Human Gastrointestinal Genomes (UHGG), which contain over 200,000 nonredundant genomes from the human gut microbiome. By locating the PUP homologs of UHGP in the genomes, we have identified physically linked PUP gene clusters (PGCs) in the gut microbiome, which potentially are involved in polyphenols utilization in human gut. The concept of PGCs is the same as the polysaccharide utilization loci (PULs) or CAZyme gene clusters (CGCs) for carbohydrate utilization. The idea/hypothesis is that for more efficient polyphenol utilization, PUP encoding genes might be clustered with each other and with other genes in the microbial genomes to form an operon or physically linked gene clusters for cooridinated gene expression.

From the Pfam search, a nonredundant protein catalog with 51,157 UHGP sequence homologs of PUP seeds was yielded and presented in table (e.g., uhgp/Africa), together with metadata of the UHGP. Using a simple algorithm described in the UHGP page to locate the PUP homologs in the genomes, a total of 1074 PGCs were identified from UHGG catalog with PGC sizes ranging from 2 kb to 11 kb.

Copyright © 2022 University of Nebraska-Lincoln | Dr. Yin's Lab at UNL
Handcoded by Pengxiang Zhang & Yinchao He.