UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI) Protein Information Resource (PIR) Contact UniProt is mainly supported by the National Institutes of Health (NIH) grant 2 U01 HG Additional support for the EBI's involvement in UniProt comes from the European Commission contract FELICS (021902) and from the NIH grant 5 P41 HG UniProtKB/Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science. PIR activities are also supported by the NIH grants for NIAID proteomic resource (HHSN C) and grid enablement (NCI-caBIG-ICR), and National Science Foundation grants for protein ontology (ITR ) and BioTagger (IIS ). UniProtKB Sequences UniProtKB Isoform Sequences Selected UniParc Sequences from ENSEMBL, RefSeq and PDB databases String Comparison: Identifying sub-fragments and identical sequences CD-HIT computation: Clustering UniRef100 representative sequences at 90% level CD-HIT computation: Clustering UniRef90 representative sequences at 50% level Generating data files for distribution UniRef Release UniRef100 Identical sequences and sub-fragments with 11 or more residues are placed into a single record UniRef90 Members of related UniRef100s at 90% level form a UniRef90 cluster. The representative is selected based on the quality of the entry, name, organism and sequence length. Title and identifier are derived from the representative sequence. UniRef50 Members of related UniRef90s at 50% level form a UniRef90 cluster. The representative is selected based on the quality of the entry, name, organism and sequence length. Title and identifier are derived from the representative sequence. UniProt Reference Clusters (UniRef), UniRef100, UniRef90 and UniRef50 are automatically generated from UniProt Knowledgebase and selected UniParc records. The databases provide complete coverage of sequence space while hiding redundant sequences from view. The non-redundancy allows faster sequence similarity searches by using UniRef90 and UniRef50 UniRef90 40% size Reduction UniRef50 65% size Reduction >UniRef90_P00439 Phenylalanine-4-hydroxylase related cluster MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSALQKIK <UniRef90 xmlns=" Phenylalanine-4-hydroxylase related cluster MSTAVLENPGLGRKLSDFGQETSYIEDNCNQNGAISLIFSLKEEVGALAKVLRLFEENDV NLTHIESRPSRLKKDEYEFFTHLDKRSLPALTNIIKILRHDIGATVHELSRDKKKDTVPW FPRTIQELDRFANQILSYGAELDADHPGFKDPVYRARRKQFADIAYNYRHGQPIPRVEYM EEEKKTWGTVFKTLKSLYKTHACYEYNHIFPLLEKYCGFHEDNIPQLEDVSQFLQTCTGF RLRPVAGLLSSRDFLGGLAFRVFHCTQYIRHGSKPMYTPEPDICHELLGHVPLFSDRSFA QFSQEIGLASLGAPDEYIEKLATIYWFTVEFGLCKQGDSIKAYGAGLLSSFGELQYCLSE KPKLLPLELEKTAIQNYTVTEFQPLYYVAESFNDAKEKVRNFAATIPRPFSVRYDPYTQR IEVLDNTQQLKILADSINSEIGILCSALQKIK XML file FASTA file UniRef Usages ●Speeding up similarity search ●Reducing bias in homology searches by providing more even sequence space ●Using the clusters for family classification ●Using the clusters to annotate EST and other sequence databases ●Using the clusters to check the consistency of UniProtKB annotations