Material and Methods
The nine oral pathogens are predicted in this study
- Actinobacillus actinomycetemcomitans HK1651
- Fusobacterium nucleatum ATCC 25586
- Prevotella intermedia 17
- Streptococcus mutans UA159
- Streptococcus sanguinis SK36
- Streptococcus gordonii Challis substr. CH1
- Porphyromonas gingivalis W83
- Tannerella forsythensis ATCC 43037
- Treponema denticola ATCC 35405
- Actinomyces naeslundii MG1
Whole genome sequences blastx against ISfinder database are performed
There are IS elements recorded in the ISfinder database. This database contains the DNA sequences or portein sequences in the IS elements and these IS elements are classified as families. The protein sequences (ISPEP) are searched by whole genome sequences with blastx. In order to get the most possible results, the cut-off is not set here because some truncated or degenerated transposases will not get good matches against the ISPEP. In this step, the initial hit regions (all the high scoring pairs (HSP)) are found to be candidated regions.
The inverted repeats finder (IRF) is used to find IRs
Tow of the most significant features of IS elements are the transposase coding genes inside and the inverted repeats at both terminal regions. The number of the open reading frames (ORFs) can be more than one. Since the boundries of the IS elements are not known at this step and the IS elements are about 2.5kbps, the initial hit regions are extended both sides 1kbps. Then the sequences of the initial hit regions with their flanking regions are used as input for IRF. This program considers some repeats features like the percents of matches and indels, GC content and so on. Then based on these features, the IRF calculate its score as measuring the goodness of the IRs. Here, the threshold of this score is lower than the IRF authors' recomendation because in some cases, the IRs may be degenerated through evolution and become not significant features. However, the basic intact structures of IS element are what we want. The initial hit regions with the IRs found are considered as containing the possible IS elements which are inside between the IRL and IRR.
The second blastx against ISPEP database
The flanking region is 1000 base pairs each side so the region within IRs may not contain the transposases, which are the ISPEP hits. The second blastx against ISPEP database is to examine the regions within IRs. In this step, the regions within IRs with blastx hits against ISPEP are the possible IS elements. The evalue cut-off is set as 0.0001 because we know the boundries of the predicted IS elements. Therefore, the predicted elements contain the IR at terminal ends and at least one blastx ISPEP hit within IRs.
Blastx against nr protein database
Even the predicted IS contains at least one blastx ISPEP hit, it could be random hits or fusion proteins. The regions within IRs found are also used to perform blastx against nr protein database to check the second ISPEP hits. The regions without any transposase hit will be deleted from our candidate IS elements. However, the results of the blastx are considered as the reference for checking the second ISPEP hits. No cut-off is set (evalue is 10) and all the results are preserved and the number of hits with "transposase", "hypothetical" and others are recorded. So, even if the results with only one non-significant hits against nr database will be preserved for further investigation.
Gene annotations are included
No current annotations are used as the criteria for finding transposase. The annotations are considered as a reference for checking the result quality. In addition, the annotation also gives the information of the ORFs. If the annotation exists in the records, the second ISPEP blastx hits will be checked if the regions of hits locate in the ORFs. Only the hit regions locate in the ORFs are recorded as the predicted IS element records.
Refinements of the results
The purpose of refinement is to eleminate the redundant records. For example, the same IRs will be found in the slightly different regions of initial hits. The region with better evalue is preserved and the worse one is deleted. And if the two predicted IS element contains the same ORFs, the IR with higher IRF score is recorded.
Data curation
After the refinement, there are still many records in nearly the same regions but with different sets of ORFs or with overlapped hits of the second ISPEP blsatx results. Therefore, before the mannually curation, those regions have to be combined as the final refinement results. Then each region is defined as the overlapped hits on the target genomes of the second ISPEP blastx results. That is, if the second ISPEP blastx hits on the genome are overlapped with other hits, these overlapped records are combined in one regions which starts from the smallest hit start to the largest hit end. And the records are defined as the records of predicted IS element which contains the terminal IR. In our result below, first you have to select the interested regions and then it shows the detailed records in this regions.
The mannually curation step is relatively simple in each regions. If the distances between IRs and the ORF are smaller, the higher chances this record is true. All the distances below 100 base pairs are kept and others are deleted in each region. If there are no such records, then all results are preserved. The other thing to curate is to check if the annotation is the transposase. If the annotations indicate other functions, the blastx results against ISPEP and nr databases are examined to see if this is a not significant match or a fusion protein.
For the no annotation parts, if the region contains both the records with and without annotation. The no annotation records are deleted. If all the records are without annotation in one region, all records are preserved. However, if these records are within ORFs, they should be deleted because they are partial matches with those ORFs. It is interesting to investigate the regions with all no annotation record are located in intergenic regions. This may be the results of wrong gene prediction or the results of the degeneration of IS elements, especially when the record gets long hits against ISPEP.