TRAP logo

TRAP - Tandem Repeats Analysis Program

-
 
Satellite DNA survey of five genomes

 
Electronic Supplementary Material accompanying the paper entitled "TRAP: automated classification, quantification, and annotation of tandemly repeated sequences" by Sobreira, T.J.P., Durham, A.M. and Gruber A.


Introduction

    Tandem repeat loci are composed by clusters of different copy numbers of tandemly repeated sequences, which are classified according to the period size of the repeat units: microsatellites or simple sequence repeats (SSR) for periods of 1 to 6 bp, minisatellites for periods from 11 to 100 bp, and satellites for sequence periods longer than 100 bp (Armour et al., 1999; Chambers and MacAvoy, 2000). However, this classification is not standardized and, depending on the author, the period size values defining each class can be slightly different (Chambers and MacAvoy, 2000). The evolutionary mechanism for microsatellite variability is mostly assumed to involve slippage events during replication, whereas minisatellite heterogeneity is ascribed to both slippage events and unequal crossing-over or unequal sister chomatide exchange during meiosis. Repeats containing period sizes of 7 to 10 bp are not assigned to any particular class because the information on their mutational mechanism is not known yet (Chambers and MacAvoy, 2000).

Methods

    As a first step for the standardization of tandem repeats analysis, we submitted some local sequences to TRF version 4.00 (Benson, 1999), testing different match/mismatch/indel alignment parameters: (2,3,5), (2,5,7) and (2,7,7), while maintaining the remaining parameters as (80,10,25,1000). For a better explanation of these parameters, the reader is referred to Benson (1999). TRF results were then submitted to several runs on TRAP, changing in each one the cutoff of minimum accepted identity percentage between adjacent repeat units overall (-id parameter of TRAP). Using cutoffs varying from 50 to 100%, we observed that the alignment parameters (2,5,7) resulted in most of the repeat loci presenting at least 70% identity across repeat units (data not shown). We used a minimum alignment score of 25, ensuring that for a repeat locus to be reported, it should present a minimum copy number of 13 repeat units for homopolymers, 7 for dinucleotides, 5 for trinucleotides, 4 for tetranucleotides, 3 for penta- and hexanucleotides or 2 for heptanucleotides and longer motifs. For degenerate repeats, mismatch and indel penalties are subtracted from the overall alignment score, thus implying that, depending on the period size of the repeat motif, longer stretches must occur for the program to report them. In view of the preliminary results obtained as described above, we concluded that the TRF parameter set (2,5,7,80,10,25,1000) presents an acceptable stringency for starting a TRAP analysis, and thus adopted it in all subsequent analyses. In order to test TRAP in real-life examples, we decided to analyze the satellite content of the following genomes: Escherichia coli, Saccharomyces cerevisiae, Plasmodium falciparum, Caenorhabditis elegans and Drosophila melanogaster. Sequences, data sources, accession codes and a complete set of results are also available in this site.

Results and Discussion

    Table 1 depicts the results obtained for the five analyzed genomes, using TRAP identity cutoffs varying from 70 to 100%. E. coli presents a very low tandemly repeated DNA content, as is usually observed in prokaryotes (Hancock, 2002). Comparing TRAP's results with some repeat surveys reported on the literature, we observed a good agreement. Tóth et al. (2000) calculated the content of perfect SSRs with period sizes of 1-6 bp for several genomes. These authors reported an overall tandem repeat content per megabase of 3,004 bp for S. cerevisiae (0.3%) and 2,139 bp (0.2%) for C. elegans. In our analysis, using an identity of 100% between adjacent repeats, TRAP determined overall repeat contents of 0.4% and 0.5%, respectively. The differences observed between our results and those reported by Tóth et al. (2000) can be ascribed to the different repeat finder programs used by both groups, as well as the criteria used for defining a repeat locus. In another survey, Karaoglu et al. (2004) reported for S. cerevisiae an occurrence of 3,618 repeat loci for repeats of 10 bp or longer. This result is in a good agreement with TRAP analysis, which resulted in 3,697 loci for perfect repeats (see Table 1). In the case of Plasmodium falciparum, Su and Wootton (2004) reported a microsatellite frequency of at least one locus per kb, mostly (TA)n or (T or A)n, with a typical n value of 10-30 bp (Ferdig and Su, 2000). Considering a genome complexity of circa 23 Mbp, this estimation predicts an overall microsatellite content varying from 1% to 6% (23,000 loci x period size of 1-2 bp x n value of 10-30 bp). In agreement with this data, and considering only perfect repeats, TRAP has determined a microsatellite content of 3% (see Table 1). The strikingly high repetitive content of P. falciparum genome is even more pronounced if we consider degeneracies. The overall repeat content, considering all repeat classes, varies from 3.9% (for perfect repeats) to 37.8% (70% identity between adjacent repeats overall), considering all satellite DNA classes. Here, most of the repeats are in the microsatellite class. Still, the telomeric repeat AAACCCT, when considering 80% of overall identity, accounts for a total of 29,021 bp, corresponding to only 0.13% of the P. falciparum genome. The extreme compositional bias of P. falciparum genome, in excess of 80% A+T content (Gardner et al., 2002), could be ascribed as one of the causes of the high occurrence of the low complexity regions comprising tandemly repeated sequences. In fact, the eight most prevalent microsatellite motifs are composed exclusively by A and T bases, and account for a total repeat content of 16.9% at 70% identity, and 5.3% at 90% identity (see Table 1). A high repetitive content was also found in the amoeba Dictyostelium discoideum, an organism presenting an (A+T)-rich (77.57%) genome, with an overall tandem repeat content >11% (Eichinger et al., 2005).

    One of TRAP's characteristics is the ability to select and quantify degenerate repeats according to different percentages of matches between adjacent repeats overall, a definition created by Benson (1999). This means that calculation of repeat content can be made in a flexible manner and adjusted to various identity criteria. In fact, when we calculated the overall repeat content of P. falciparum, totally discrepant results were observed when different identity cutoffs were used. The microsatellite (repeat unit length of 1-6 bp) content varied from 3.0% for perfect repeats up to 16.9% for repeats presenting at least 70% of matches (Table 1). Similar results were obtained for minisatellite sequences (repeat period of 11-100 bp), where content values varied from 0.2% to 13.2% under the same identity values as above. In a smaller scale, this was also observed for other organisms such as C. elegans and D. melanogaster. Most of literature reports evaluate the repeat content based only on perfect repeats, thus excluding degenerate repeats from the calculation and possibly leading to an underestimate of the whole tandem repeat content. Since there is no universal standardization for the definition of tandem repeats in terms of minimum copy number and extent of divergence, any census should be made using more than a single set of criteria. By permitting flexibility in the criteria used for repeat definition, TRAP can generate more comprehensive and comparative surveys.

References

  • Armour,J.A.L., et al. (1999) Microsatellites and mutation processes in tandemly repetitive DNA. In: Goldstein,D. and Schlotterer,C. (eds), Microsatellites: Evolution and Applications, Oxford University Press, Oxford, pp. 24-29.
  • Benson,G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573-580.
  • Chambers,G.K. and MacAvoy,E.S. (2000) Microsatellites: consensus and controversy. Comp. Biochem. Physiol. B., 126, 455-476.
  • Eichinger,L., et al. (2005) The genome of the social amoeba Dictyostelium discoideum. Nature, 435, 43-57.
  • Ferdig,M.T. and Su,X.Z. (2000) Microsatellite markers and genetic mapping in Plasmodium falciparum. Parasitol. Today, 16, 307-312.
  • Gardner,M.J., et al. (2002) Genome sequence of the human malaria parasite Plasmodium falciparum. Nature, 419, 498-511.
  • Hancock,J.M. (2002) Genome size and the accumulation of simple sequence repeats: implications of new data from genome sequencing projects. Genetica, 115, 93-103.
  • Su,X.Z. and Wootton,J.C. (2004) Genetic mapping in the human malaria parasite Plasmodium falciparum. Mol. Microbiol., 53, 1573-1582.
  • Tóth,G., et al. (2000) Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res., 10, 967-981.



-

 © 2005 - Arthur Gruber, Alan M. Durham & Tiago J.P. Sobreira