
Release 1.5
(2010-04-21)
Help and DocumentationDocumentation
• Getting started
• Advanced queries
• Core databases
• Modules
FAQ
• What is and what is not PROFESS?
• Who will be interested in PROFESS?
• How to query PROFESS?
• How to download the data?
• More...
Database StatisticsFunction
| • eggNOG Clusters | 224,847 |
| • Enzyme Classes | 5,093 |
| • GO Terms | 27,606 |
| • Ligands | 9,330 |
| • PFAM | 10,340 |
| • Protein Interactions | 67,113 |
Evolution
| • Essential Genes | 6,099 |
Structure
| • CATH Classes | 2,178 |
| • Protein Structures | 56,699 |
| • Structure Comparisons | 401,967 |
Sequence
| • Protein Sequences | 10,891,633 |
Disease
| • Pancreatic cancer | 2,013 |
Other
| • Taxonomy | 558,282 |
Development TeamContact: Thomas Triplet
Database Design and Programming
• Peter Revesz
• Thomas Triplet
Genomics
• Mark A. Griep
• Robert Powers
• Matthew D. Shortridge
• Jaime Stark
Getting Started | Adv. Queries | Core Databases | Modules
Like most search engine, the user will type keywords to retrieve the data. However, the PROFESSor goes much beyond the simple approach.
Just type a keyword, selenocysteine for example. By default, the PROFESSor will suggest entries from all core databases. Suggestions are to help the user to refine his query, but selecting a suggestion is left optional.
For example, when searching for selenocysteine, the PROFESSor will retrieve COG clusters, Enzyme Classes, ligands, and terms from the Gene Ontology.


You may also restrict a query to a particular database of your choice by prefixing the keyword with [KEY], where KEY depends on the database and may be one of the following (note that this list will grow with the number of core databases):
ALL, COG, EC, GO, LIGAND, PDB, PFAM

The key ALL allows the user to search in all core databases. It is implicitly used when no key is given. Hence, it will mainly be useful when using multiple keywords (see below).
For example, the query [COG] selenocysteine will retrieve COG clusters related to selenocysteines.

Alternatively, you can also type the ID commonly use to refer to an item. For example, you can type the PDB_ID of a protein structure or an E.C. number. When searching for a ligand, the user can also type the formula of the compound.

Note that if you search for cysteine for example, selenocysteine entries will not be returned except if no cysteine can be found.
PROFESS may also be queried using many keywords from several databases using boolean logic. For examples of advance queries, see below.
Using regular expressions, the general syntax for queries is defined as:
([OR]0,1[KEY]0,1 keywords ([OR] keywords)*)+
KEY depends on the database and may be one of the following (note that this list will grow with the number of core databases):
ALL, COG, EC, GO, LIGAND, PDB, PFAM
By default, all keywords after a [KEY] are considered as a unique string for the query. This behavior can be altered by prefixing the keywords with [OR]. The wildcard characters % (any number n of characters, with n ≥ 0) and _ (exactly one character) may be used in a query. A logical AND is performed between different keys.

Note that, in general, the following 3 queries will return different results:
| • [COG] lyase primase | returns clusters matching the string "lyase primase", |
| • [COG] lyase [OR] primase | returns clusters matching either the string "lyase" or the string "primase", |
| • [COG] lyase [COG] primase | returns clusters matching both strings "lyase" and "primase". |
| • | [COG] lyase Returns clusters matching the string "lyase", |
| • | [COG] 15 [OR] 520 [OR] 369 Returns COG clusters 15, 520 and 369, |
| • | [COG] 15 [COG] 520 [COG] 369 Returns an empty set because no COG cluster number can be 15, 520 and 369 at the same time, |
| • | [EC] selenocysteine [OR] selenium Returns COG clusters such that one or more proteins is related to "selenocysteine" or "selenium" in the Enzyme Classification, |
| • | [COG] lyase [EC] selenocysteine [OR] selenium Returns COG clusters containing "lyase" proteins and such that one or more proteins is related to "selenocysteine" or "selenium" in the Enzyme Classification, |
| • | [EC] 4.% [LIGAND] C12 H17 Returns COG clusters with one or more proteins with an E.C. number starting with "4." and with one or more proteins that bind a ligand containing "C12 H17", that is, ligands with 12 atoms of carbon and 17 atoms of hydrogen. |
Coming soon...
The following databases are integrated in PROFESS. This list will grow based on user feedback.

Note that the last update and version indicate the last update of the database in PROFESS. Although we will frequently update PROFESS, it may no correspond to the latest version of the database.
CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels: Class (C), Architecture (A), Topology (T) and Homologous superfamily (H). The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.
Last update: | 04/24/2009 |
Version: | 3.2.0 |
Website: |
Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.
Last update: | 09/11/2003 |
Website: |
eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a database of orthologous groups of genes. The orthologous groups are annotated with functional description lines (derived by identifying a common denominator for the genes based on their various annotations), with functional categories (i.e derived from the original COG/KOG categories).
Last update: | 11/09/2009 |
Version: | 2.0 |
Website: |
Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes by the reactions they catalyse.
Last update: | 05/03/2009 |
Version: | 13 |
Website: |
Essential genes are those indispensable for the survival of an organism, and therefore are considered a foundation of life. DEG hosts records of currently available essential genes among a wide range of organisms.
Last update: | 05/03/2009 |
Version: | 5.2 |
Website: |
The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data.
Last update: | 10/14/2008 |
Website: |
The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism.
Last update: | 05/03/2009 |
Website: |
KEGG LIGAND contains our knowledge on the universe of chemical substances and reactions that are relevant to life. It is a composite database consisting of COMPOUND, DRUG, GLYCAN, REACTION, RPAIR, and ENZYME databases, whose entries are identified by C, D, G, R, RP, and EC numbers, respectively.
Last update: | 01/09/2009 |
Website: |
The PCOD is a manually curated database of proteins from various proteomics and genomics studies that are potentially associated with pancreatic cancer.
Last update: | 04/22/2010 |
Proteomics Sources: | Yamada, et al.; Journal of Proteomics & Bioinformatics (2009) |
| |
| |
| |
| |
Genomics Sources: |
The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, other animals, and humans.
Last update: | 04/28/2009 |
Website: |
The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
Last update: | 07/01/2008 |
Version: | 23.0 |
Website: |
Protein–protein interactions play key roles in protein function and the structural organization of a cell. A thorough description of these interactions should facilitate elucidation of cellular activities, targeted-drug design, and whole cell engineering. A large-scale comprehensive pull-down assay was performed using a His-tagged Escherichia coli ORF clone library.
Last update: | 01/27/2006 |
Website: |
The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.
Due to copyright issues, we only provide links to retrieve data from SCOP website rather than reproducing SCOP data on our pages.
Last update: | 06/01/2009 |
Website: |
Organisms are classified in a hierarchical tree structure. The UniProtKB-Taxonomy database contains every node (taxon) of the tree.
Last update: | 04/12/2009 |
Website: |
This module aggregates and displays in a unique table, structures from the Protein Data Bank within the current eggNOG cluster along with taxonomic data from UniProtKB and associated functions from the Enzyme Classification, the Gene Ontology, KEGG-Ligands, and PFAM. The table may be downloaded in CSV format.
Since: | v1.0 |
Data Sources: |
This module displays details about ligands binding a protein that belong to the NOG cluster. Buffers, detergents, ions and solvents are separated to provide the user a quicker access to most relevant ligands. The table may be downloaded in CSV format.
Since: | v1.0 |
Data Sources: |
This module aggregates protein interactions from different sources within the current eggNOG cluster.
Since: | v1.0 |
Data Sources: |
This module aggregates and generates statistics using data from the Enzyme Classification, the Gene Ontology, and PFAM. For each of the three classifications, we compute the number of proteins within each class (within the current COG cluster) and represent the protein distribution as pie charts. This allows the user to quickly differentiate relevant classes from outliers. Classes are sorted by decreasing number of proteins. The darker the color in the pie chart, the higher the number of proteins.
The aggregate data may be downloaded as:
• CSV format, by clicking on
in the title for each of the three classifications.
• PNG images in high-definition (1000x300), by clicking on the thumbnail.
Since: | v1.0 |
Data Sources: |
This module shows Essential Genes in E. coli from DEG within the current COG cluster.
Genes are displayed with corresponding protein structures from the PDB
(see module Sequence Similarities for more details about the association Gene/Structure).
Since: | v1.0 |
Data Sources: |
This module shows the unrooted phylogenetic tree generated using protein chain sequences from the PDB.
First, the sequences were aligned using ClustalW2.
Second, the tree was computed using ClustalW2 using the multiple sequence alignment as a guide.
The final image was generated using DrawTree from the PHYLIP package.
The phylogenetic tree may be downloaded as:
• PHYLIP format, by clicking on
in the title of the module
• PNG images in high-definition (3000*3000), by clicking on the thumbnail.
Since: | v1.0 |
Data Sources: |
This module shows the unrooted phylogenetic tree generated using protein structures from the PDB.
First, the structures were aligned using MAMMOTH-mult, a tool for multiple structure alignments.
Second, the tree was computed by ClustalW2 using the multiple structure alignment as a guide.
The final image was generated using DrawTree from the PHYLIP package.
The phylogenetic tree may be downloaded in PHYLIP format (by clicking on
in the title) or as PNG images in high-definition (3000*3000) by clicking on the thumbnail.
Since: | v1.1 |
Data Sources: |
This module aggegates in a unique table, data from CATH, the Protein Data Bank, SCOP and the UniProtKB Taxonomy. Note that due to copyright issues, we only provide links to retrieve data from SCOP website rather than reproducing SCOP data on our pages. The table may be downloaded in CSV format.
Since: | v1.0 |
Data Sources: |
The pairwise structure comparison tool DaliLite was used to measure the backbone structure similarity of proteins within each orthologous cluster defined by the eggNOG database. All-against-all pairwise structural comparisons within and between the Proteobacteria and Firmicutes were carried out for all COGs where both phyla were represented by a minimum of two organisms. The Dali Z-scores were normalized to calculate a Fractional Structure Similarity (FSS) score:
FSS= ZAB / ZAA
where ZAB is the Dali Z-score when protein B is compared to protein A and ZAA is the Z-score when protein A is compared to itself. Thus, ZAA represents the maximum Z-score that can be achieved for perfect similarity. FSS provides a simple quantitative measure of the distance the two proteins have diverged in their structures.
Since: | v1.0 |
Data Sources: |
This module lists protein chain sequences within the eggNOG cluster. Sequences may be downloaded in FASTA format.
Since: | v1.0 |
Data Sources: | eggNOG, TAXON, PDB, SWISS-PROT, TrEMBL |