After annotating a genome, we can generally group these genes into three types: first type are known genes which have been functionally characterized; second type are conserved hypothetical genes which are conserved in many organisms; third type are hypothetical genes which are not found in other organisms. Generally, a newly sequenced genome contains about thirty percents of genes that are annotated poorly or even wrongly [1]. Hence, these poorly characterized hypothetical genes might play an important role in our understanding of life and biology [2]. Given that so little are known about these unknown genes, it is a great challenge for scientists to select a research target worth spending years of time. Here we designed a scoring method to give each candidate protein a score based on different criteria ranking result and applied it to hypothetical proteins of two major oral pathogens, Porphyromonas gingivalis W83 and Streptococcus mutans UA159.
Materials and methods
Porphyromonas gingivalis W83 top-20, [Full ranking table]. (Click on the table header for sorting)
Rank | GI | LANL ID | TIGR ID | Size (aa) | Gene Cluster Size | Domain (#) | Solubility (%) | Disordered Region Length (aa) | Disordered Region Length Percentage (%) | Bacteria Strain (#) | Bacteria Species (#) | Bacteria Family (#) | TFBS (#) | Definition |
1 | 34541330 | PG1479 | PG_1694 | 364 | 5 | 4 | 48.7 | 78 | 21.4 | 40 | 34 | 19 | 45 | conserved hypothetical protein |
2 | 34541597 | PG1750 | PG_2004 | 330 | 7 | 1 | 13.8 | 14 | 4.2 | 40 | 33 | 16 | 43 | conserved membrane protein |
3 | 34540102 | PG0232 | PG_0257 | 481 | 7 | 2 | 10.7 | 71 | 14.8 | 138 | 115 | 51 | 36 | ABC transporter subunit |
4 | 34539940 | PG0058 | PG_0069 | 504 | 11 | 2 | 6.8 | 50 | 9.9 | 101 | 87 | 36 | 33 | conserved hypothetical protein (possible sugar kinase) |
5 | 34540161 | PG0301 | PG_0325 | 209 | 8 | 2 | 41.9 | 31 | 14.8 | 50 | 43 | 28 | 32 | conserved hypothetical protein (possible serine cycle enzyme,formiminotransferase cyclodeaminase) |
6 | 34541437 | PG1588 | PG_1817 | 271 | 13 | 3 | 41 | 15 | 5.5 | 25 | 23 | 15 | 24 | cytochrome c-type synthesis protein (cytochrome c biogenesis protein) |
7 | 34539921 | PG0042 | PG_0048 | 367 | 4 | 2 | 28 | 68 | 18.5 | 31 | 30 | 44 | 37 | conserved GTP-binding protein |
8 | 34541242 | PG1390 | PG_1591 | 365 | 1 | 2 | 41 | 56 | 15.3 | 27 | 27 | 21 | 47 | conserved hypothetical protein |
9 | 34540104 | PG0235 | PG_0259 | 447 | 7 | 1 | 22.8 | 65 | 14.5 | 47 | 44 | 24 | 29 | conserved hypothetical protein (possible ABC transporter related membrane protein) |
10 | 34540684 | PG0826 | PG_0927 | 138 | 13 | 2 | 58.9 | 34 | 24.6 | 104 | 93 | 41 | 31 | conserved hypothetical protein |
11 | 34540774 | PG0920 | PG_1033 | 248 | 7 | 2 | 5.3 | 9 | 3.6 | 94 | 90 | 44 | 36 | conserved hypothetical protein/possible ABC element with MSD domain |
12 | 34541294 | PG1443 | PG_1653 | 407 | 7 | 7 | 9.4 | 48 | 11.8 | 33 | 24 | 10 | 34 | conserved hypothetical protein |
13 | 34540453 | PG0590 | PG_0652 | 100 | 11 | 1 | 46.5 | 0 | 0 | 19 | 17 | 14 | 30 | conserved hypothetical protein |
14 | 34540561 | PG0699 | PG_0778 | 239 | 8 | 1 | 3.8 | 33 | 13.8 | 87 | 75 | 35 | 43 | possible glycoprotein endopeptidase |
15 | 34540098 | PG0228 | PG_0253 | 152 | 7 | 1 | 46.4 | 45 | 29.6 | 31 | 31 | 21 | 43 | conserved hypothetical protein |
16 | 34541477 | PG1627 | PG_1868 | 173 | 5 | 1 | 50.7 | 0 | 0 | 21 | 16 | 10 | 32 | conserved hypothetical protein |
17 | 34540740 | PG0890 | PG_0996 | 302 | 8 | 7 | 12.1 | 110 | 36.4 | 26 | 23 | 41 | 47 | conserved hypothetical protein |
18 | 34540228 | PG0368 | PG_0401 | 513 | 7 | 3 | 15 | 156 | 30.4 | 106 | 93 | 37 | 32 | conserved hypothetical protein(possible CTP synthase) |
19 | 34541630 | PG1784 | PG_2043 | 364 | 3 | 2 | 17.9 | 32 | 8.8 | 64 | 50 | 25 | 31 | conserved hypothetical protein |
20 | 34540629 | PG0769 | PG_0859 | 312 | 7 | 3 | 20.1 | 61 | 19.6 | 16 | 14 | 11 | 35 | conserved hypothetical protein |
Rank | GI | LANL ID | TIGR ID | Size (aa) | Gene Cluster Size | Domain (#) | Solubility (%) | Disordered Region Length (aa) | Disordered Region Length Percentage (%) | Bacteria Starin (#) | Bacteria Species (#) | Bacteria Family (#) | TFBS (#) | Definition |
1 | 24379036 | SMu0505 | NTL02SM0504 | 86 | 17 | 1 | 51.4 | 0 | 0 | 79 | 71 | 26 | 58 | conserved hypothetical protein |
2 | 24380160 | SMu1634 | NTL02SM1628 | 247 | 20 | 2 | 64.6 | 34 | 13.8 | 56 | 44 | 12 | 57 | conserved hypothetical protein; possible methyltransferase |
3 | 24379999 | SMu1473 | NTL02SM1467 | 164 | 11 | 2 | 95.3 | 41 | 25 | 92 | 84 | 33 | 58 | conserved hypothetical protein |
4 | 24378760 | SMu0228 | NTL02SM0228 | 471 | 10 | 1 | 20.8 | 82 | 17.4 | 134 | 126 | 66 | 57 | ABC transporter permease |
5 | 24378552 | SMu0020 | NTL02SM0020 | 391 | 18 | 1 | 21.4 | 12 | 3.1 | 84 | 72 | 24 | 56 | aspartate or aromatic amino acid aminotransferase |
6 | 24379271 | SMu0740 | NTL02SM0739 | 378 | 8 | 3 | 25.9 | 24 | 6.3 | 36 | 30 | 17 | 59 | aminotransferase |
7 | 24380037 | SMu1511 | NTL02SM1505 | 288 | 14 | 1 | 16.3 | 23 | 8 | 116 | 93 | 37 | 57 | conserved hypothetical protein, tetrapyrrole methylase family |
8 | 24378835 | SMu0303 | NTL02SM0303 | 271 | 10 | 2 | 55.1 | 26 | 9.6 | 99 | 78 | 28 | 55 | inner membrane protein |
9 | 24378920 | SMu0388 | NTL02SM0388 | 275 | 4 | 4 | 38.5 | 11 | 4 | 61 | 46 | 14 | 58 | conserved hypothetical protein, Cof family |
10 | 24379860 | SMu1332 | NTL02SM1328 | 262 | 9 | 2 | 20.8 | 38 | 14.5 | 82 | 64 | 28 | 58 | conserved hypothetical protein, NIF3-related |
11 | 24380470 | SMu1942 | NTL02SM1938 | 657 | 9 | 2 | 27.5 | 127 | 19.3 | 44 | 36 | 8 | 60 | conserved hypothetical protein (DHH family protein) |
12 | 24380093 | SMu1567 | NTL02SM1561 | 82 | 10 | 1 | 80.8 | 0 | 0 | 40 | 39 | 12 | 58 | conserved hypothetical protein |
13 | 24379851 | SMu1323 | NTL02SM1319 | 327 | 7 | 10 | 91.9 | 25 | 7.6 | 23 | 15 | 4 | 59 | conserved hypothetical protein, tetratricopeptide repeat family |
14 | 24380155 | SMu1629 | NTL02SM1623 | 238 | 20 | 1 | 49.5 | 90 | 37.8 | 100 | 87 | 30 | 57 | conserved hypothetical protein |
15 | 24378740 | SMu0208 | NTL02SM0208 | 555 | 7 | 1 | 32 | 100 | 18 | 75 | 60 | 21 | 57 | conserved hypothetical protein (possible kinase) |
16 | 24378693 | SMu0161 | NTL02SM0161 | 200 | 12 | 1 | 29.9 | 15 | 7.5 | 21 | 19 | 30 | 59 | conserved hypothetical protein (possible oxidoreductase) |
17 | 24379602 | SMu1073 | NTL02SM1070 | 246 | 6 | 1 | 58.2 | 0 | 0 | 35 | 31 | 20 | 56 | conserved hypothetical protein |
18 | 24379206 | SMu0675 | NTL02SM0674 | 273 | 4 | 5 | 40.8 | 54 | 19.8 | 65 | 49 | 14 | 58 | hydrolase, haloacid dehalogenase-like family |
19 | 24379498 | SMu0969 | NTL02SM0966 | 110 | 10 | 7 | 78.2 | 32 | 29.1 | 63 | 54 | 19 | 58 | DNA-binding protein |
20 | 24379348 | SMu0817 | NTL02SM0816 | 92 | 16 | 1 | 60.2 | 13 | 14.1 | 31 | 27 | 20 | 59 | conserved hypothetical protein |
In this analysis, we provide a ranked list of those hypothetical proteins in Porphyromonas gingivalis W83 and Streptococcus mutans UA195, both are oral pathogens of great interest. Those hyperlinks provide some evidence about that these hypothetical proteins might possess possible functions. This analysis is not an end but a start for further experiments and one day we might fulfill annotations of these hypothetical proteins.
References