PROFESS

Release 1.5
(2013-08-25)







StatsDatabase Statistics

Function

eggNOG Clusters 224,847
Enzyme Classes 5,093
GO Terms 27,606
Ligands 9,330
PFAM 10,340
Protein Interactions 67,113

Evolution

Essential Genes 6,099

Structure

CATH Classes 2,178
Protein Structures 56,699
Structure Comparisons 401,967

Sequence

Protein Sequences 10,891,633

Disease

Pancreatic cancer 2,013

Other

Taxonomy 558,282
Last update: 2013-08-25

TeamDevelopment Team

Contact:    Thomas Triplet


Database Design and Programming
Peter Revesz
Thomas Triplet


Genomics
Mark A. Griep
Robert Powers
Matthew D. Shortridge
Jaime Stark

Documentation




Getting Started | Adv. Queries | Core Databases | Modules



Getting Started


With the queries


Like most search engine, the user will type keywords to retrieve the data. However, the PROFESSor goes much beyond the simple approach.

Just type a keyword, selenocysteine for example. By default, the PROFESSor will suggest entries from all core databases. Suggestions are to help the user to refine his query, but selecting a suggestion is left optional.

For example, when searching for selenocysteine, the PROFESSor will retrieve COG clusters, Enzyme Classes, ligands, and terms from the Gene Ontology.




With the Interface





Advanced Queries


Restricting Search to a Given Database


You may also restrict a query to a particular database of your choice by prefixing the keyword with [KEY], where KEY depends on the database and may be one of the following (note that this list will grow with the number of core databases):

ALL, COG, EC, GO, LIGAND, PDB, PFAM


The key ALL allows the user to search in all core databases. It is implicitly used when no key is given. Hence, it will mainly be useful when using multiple keywords (see below).


For example, the query [COG] selenocysteine will retrieve COG clusters related to selenocysteines.



Alternatively, you can also type the ID commonly use to refer to an item. For example, you can type the PDB_ID of a protein structure or an E.C. number. When searching for a ligand, the user can also type the formula of the compound.

Note that if you search for cysteine for example, selenocysteine entries will not be returned except if no cysteine can be found.



Multiple Search


PROFESS may also be queried using many keywords from several databases using boolean logic. For examples of advance queries, see below.

Using regular expressions, the general syntax for queries is defined as:

([OR]0,1[KEY]0,1 keywords ([OR] keywords)*)+


KEY depends on the database and may be one of the following (note that this list will grow with the number of core databases):

ALL, COG, EC, GO, LIGAND, PDB, PFAM


By default, all keywords after a [KEY] are considered as a unique string for the query. This behavior can be altered by prefixing the keywords with [OR]. The wildcard characters % (any number n of characters, with n ≥ 0) and _ (exactly one character) may be used in a query. A logical AND is performed between different keys.


Note that, in general, the following 3 queries will return different results:

[COG] lyase primasereturns clusters matching the string "lyase primase",
[COG] lyase [OR] primasereturns clusters matching either the string "lyase" or the string "primase",
[COG] lyase [COG] primasereturns clusters matching both strings "lyase" and "primase".



Examples of queries


[COG] lyase
Returns clusters matching the string "lyase",
 
[COG] 15 [OR] 520 [OR] 369
Returns COG clusters 15, 520 and 369,
 
[COG] 15 [COG] 520 [COG] 369
Returns an empty set because no COG cluster number can be 15, 520 and 369 at the same time,
 
[EC] selenocysteine [OR] selenium
Returns COG clusters such that one or more proteins is related to "selenocysteine" or "selenium" in the Enzyme Classification,
 
[COG] lyase [EC] selenocysteine [OR] selenium
Returns COG clusters containing "lyase" proteins and such that one or more proteins is related to "selenocysteine" or "selenium" in the Enzyme Classification,
 
[EC] 4.% [LIGAND] C12 H17
Returns COG clusters with one or more proteins with an E.C. number starting with "4." and with one or more proteins that bind a ligand containing "C12 H17", that is, ligands with 12 atoms of carbon and 17 atoms of hydrogen.



Query Assistant


Coming soon...





Core Databases


The following databases are integrated in PROFESS. This list will grow based on user feedback.


Note that the last update and version indicate the last update of the database in PROFESS. Although we will frequently update PROFESS, it may no correspond to the latest version of the database.



CATH Database


CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels: Class (C), Architecture (A), Topology (T) and Homologous superfamily (H). The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.

Last update: 

04/24/2009

Version: 

3.2.0

Website: 

http://www.cathdb.info/



Clusters of Orthologous Groups (COG) of proteins database (included in eggNOG)


Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.

Last update: 

09/11/2003

Website: 

http://www.ncbi.nlm.nih.gov/COG/



Evolutionary Genealogy of Genes: Non-supervised Orthologous Groups (eggNOG) of proteins database


eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a database of orthologous groups of genes. The orthologous groups are annotated with functional description lines (derived by identifying a common denominator for the genes based on their various annotations), with functional categories (i.e derived from the original COG/KOG categories).

Last update: 

11/09/2009

Version: 

2.0

Website: 

http://eggnog.embl.de/



Enzyme Classification


Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes by the reactions they catalyse.

Last update: 

05/03/2009

Version: 

13

Website: 

http://www.chem.qmul.ac.uk/iubmb/enzyme/



Database of Essential Genes


Essential genes are those indispensable for the survival of an organism, and therefore are considered a foundation of life. DEG hosts records of currently available essential genes among a wide range of organisms.

Last update: 

05/03/2009

Version: 

5.2

Website: 

http://www.essentialgene.org/



Database of Interacting Proteins


The DIP database catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data.

Last update: 

10/14/2008

Website: 

http://dip.doe-mbi.ucla.edu/dip/Main.cgi



Gene Ontology


The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism.

Last update: 

05/03/2009

Website: 

http://www.geneontology.org/



Kyoto Encyclopedia of Genes and Genomes (KEGG) - Ligands


KEGG LIGAND contains our knowledge on the universe of chemical substances and reactions that are relevant to life. It is a composite database consisting of COMPOUND, DRUG, GLYCAN, REACTION, RPAIR, and ENZYME databases, whose entries are identified by C, D, G, R, RP, and EC numbers, respectively.

Last update: 

01/09/2009

Website: 

http://www.genome.jp/kegg/ligand.html



Pancreatic Cell 'omics' Data (PCOD)


The PCOD is a manually curated database of proteins from various proteomics and genomics studies that are potentially associated with pancreatic cancer.

Last update: 

04/22/2010

Proteomics Sources: 

Yamada, et al.; Journal of Proteomics & Bioinformatics (2009)

 

Crnogorac-Jurcevic, et al.; Gastroenterology (2005)

 

Chen, et al.; Gastroenterology (2005)

 

Grutzmann, et al.; Oncogene (2005)

 

Shen, et al.; Cancer Research (2004)

Genomics Sources: 

Jones, et al.; Science (2008)



Protein DataBank (PDB)


The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, other animals, and humans.

Last update: 

04/28/2009

Website: 

http://www.rcsb.org



Protein Families (PFAM) database


The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

Last update: 

07/01/2008

Version: 

23.0

Website: 

http://pfam.sanger.ac.uk/



Protein/Protein interactions in E. coli


Protein–protein interactions play key roles in protein function and the structural organization of a cell. A thorough description of these interactions should facilitate elucidation of cellular activities, targeted-drug design, and whole cell engineering. A large-scale comprehensive pull-down assay was performed using a His-tagged Escherichia coli ORF clone library.

Last update: 

01/27/2006

Website: 

http://genome.cshlp.org/content/16/5/686.abstract



Structural Classification of Proteins (SCOP)


The SCOP database, created by manual inspection and abetted by a battery of automated methods, aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known. As such, it provides a broad survey of all known protein folds, detailed information about the close relatives of any particular protein, and a framework for future research and classification.
Due to copyright issues, we only provide links to retrieve data from SCOP website rather than reproducing SCOP data on our pages.

Last update: 

06/01/2009

Website: 

http://scop.mrc-lmb.cam.ac.uk/scop/index.html



UniProt Knownledge Base - Taxonomy


Organisms are classified in a hierarchical tree structure. The UniProtKB-Taxonomy database contains every node (taxon) of the tree.

Last update: 

04/12/2009

Website: 

http://www.uniprot.org/taxonomy/





Modules for Protein Functions


Functions


This module aggregates and displays in a unique table, structures from the Protein Data Bank within the current eggNOG cluster along with taxonomic data from UniProtKB and associated functions from the Enzyme Classification, the Gene Ontology, KEGG-Ligands, and PFAM. The table may be downloaded in CSV format.

Since: 

v1.0

Data Sources: 

eggNOG, EC, GO, KEGG-LIGAND, TAXON, PDB, PFAM



Ligands


This module displays details about ligands binding a protein that belong to the NOG cluster. Buffers, detergents, ions and solvents are separated to provide the user a quicker access to most relevant ligands. The table may be downloaded in CSV format.

Since: 

v1.0

Data Sources: 

eggNOG, KEGG-LIGAND, PDB



Protein Interactions


This module aggregates protein interactions from different sources within the current eggNOG cluster.

Since: 

v1.0

Data Sources: 

DIP, eggNOG, PIN



Summary


This module aggregates and generates statistics using data from the Enzyme Classification, the Gene Ontology, and PFAM. For each of the three classifications, we compute the number of proteins within each class (within the current COG cluster) and represent the protein distribution as pie charts. This allows the user to quickly differentiate relevant classes from outliers. Classes are sorted by decreasing number of proteins. The darker the color in the pie chart, the higher the number of proteins.

The aggregate data may be downloaded as:
     • CSV format, by clicking on DL in the title for each of the three classifications.
     • PNG images in high-definition (1000x300), by clicking on the thumbnail.

Since: 

v1.0

Data Sources: 

eggNOG, EC, GO, PDB, PFAM





Modules for Protein Evolution


Essential Genes


This module shows Essential Genes in E. coli from DEG within the current COG cluster. Genes are displayed with corresponding protein structures from the PDB (see module Sequence Similarities for more details about the association Gene/Structure).

Since: 

v1.0

Data Sources: 

eggNOG, DEG, TAXON, PDB



Sequence-based Phylogenetic Trees


This module shows the unrooted phylogenetic tree generated using protein chain sequences from the PDB.

First, the sequences were aligned using ClustalW2. Second, the tree was computed using ClustalW2 using the multiple sequence alignment as a guide. The final image was generated using DrawTree from the PHYLIP package.

The phylogenetic tree may be downloaded as:
     • PHYLIP format, by clicking on DL in the title of the module
     • PNG images in high-definition (3000*3000), by clicking on the thumbnail.

Since: 

v1.0

Data Sources: 

eggNOG, PDB



Structure-based Phylogenetic Trees


This module shows the unrooted phylogenetic tree generated using protein structures from the PDB.

First, the structures were aligned using MAMMOTH-mult, a tool for multiple structure alignments. Second, the tree was computed by ClustalW2 using the multiple structure alignment as a guide. The final image was generated using DrawTree from the PHYLIP package.
The phylogenetic tree may be downloaded in PHYLIP format (by clicking on DL in the title) or as PNG images in high-definition (3000*3000) by clicking on the thumbnail.

Since: 

v1.1

Data Sources: 

eggNOG, PDB





Modules for Protein Structures


Structures


This module aggegates in a unique table, data from CATH, the Protein Data Bank, SCOP and the UniProtKB Taxonomy. Note that due to copyright issues, we only provide links to retrieve data from SCOP website rather than reproducing SCOP data on our pages. The table may be downloaded in CSV format.

Since: 

v1.0

Data Sources: 

CATH, eggNOG, TAXON, PDB, SCOP



Structure Comparisons


The pairwise structure comparison tool DaliLite was used to measure the backbone structure similarity of proteins within each orthologous cluster defined by the eggNOG database. All-against-all pairwise structural comparisons within and between the Proteobacteria and Firmicutes were carried out for all COGs where both phyla were represented by a minimum of two organisms. The Dali Z-scores were normalized to calculate a Fractional Structure Similarity (FSS) score:

FSS= ZAB / ZAA


where ZAB is the Dali Z-score when protein B is compared to protein A and ZAA is the Z-score when protein A is compared to itself. Thus, ZAA represents the maximum Z-score that can be achieved for perfect similarity. FSS provides a simple quantitative measure of the distance the two proteins have diverged in their structures.



Since: 

v1.0

Data Sources: 

eggNOG, TAXON, PDB





Modules for Protein Sequences


Sequences


This module lists protein chain sequences within the eggNOG cluster. Sequences may be downloaded in FASTA format.

Since: 

v1.0

Data Sources: 

eggNOG, TAXON, PDB, SWISS-PROT, TrEMBL