
Overview
Chaos Game Representation (CGR) is an iterative mapping technique to construct a two dimensional representation of genomic sequences (Jeffrey, 1990). CGRs have been conventionally used to visualize the large nucleotide sequences. However, apart from visualization, CGRs can be used to compare DNA sequences, construct cladograms and address various biological problems like HIV subtyping (Pandit and Sinha, 2010). CGAT is an integrated web server for multiple whole genome sequence comparison using Chaos Game Representation (CGR) based approach. The most important and significant idea behind this webserver is its ability to handle large DNA sequences (e.g., NJ tree of whole genome sequences of 11 Viruses each ~150 Kbp long was made in 2 min 30 sec), effectively classify very similar sequences (e.g., sub-subtypes of HIV-1 sequences) in less time and deliver better resolved classification trees when compared to standard Maximum Likelihood methods. It efficiently classifies sequences based on both inter-species and intra-species variation in a computationally less intense manner. It analyses whole genome variations using an alignment free and scale invariant method resulting in trees that can be used to interpret similarity between multiple whole genome sequences.
Input, Output and Processing
Input required is a set of two or more genome sequences in FASTA format, which can either be pasted on the webpage or uploaded as text file. The size limit for each input sequence is 10MB. The other input required by user is “word length” for which the frequencies of all the words in the sequences are calculated. User can also specify the Out-group for construction of Neighbor Joining Tree using Phylip (Felsenstein J, 2005). Sequences are converted to coordinates for plotting as CGR. The CGR for each sequence can be visualized on the browser and the image can be downloaded in PNG format. These coordinates can also be downloaded and CGRs can be visualized in any graph plotting softwares like GNUPLOT. These CGRs are then used to calculate frequencies of words at user specified word length. The frequencies of all the words in each sequence can be downloaded as text file for additional processing. Based on these word frequencies the pair-wise Eucledian distance is calculated between all pairs of sequences in input data. This distance matrix can again be either visualized on the browser or downloaded as text file. This distance matrix is automatically used as input for Phylip and the NJ tree file based on this distance matrix can be downloaded in Newick format. The web server also displays the tree on the browser using the software Notung (version 2.6) (Durand et. al, 2006). The web-server assigns unique job-id to each submission and the job id is displayed along with the link to the result. The results for each input can be accessed at a later time (up to 48 hrs) by using the job id.
References
1. Jeffrey HJ: Chaos game representation of gene structure. Nucleic Acids Res 1990, 18:2163–2170.
2. Pandit A and Sinha S, Using genomic signatures for HIV-1 sub-typing BMC Bioinformatics 2010, 11(Suppl 1):S26.
3. Felsenstein, J. PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. 2005 Department of Genome Sciences, University of Washington, Seattle.
4. D. Durand, B. V. Halldorsson, B. Vernot. A Hybrid Micro-Macroevolutionary Approach to Gene Tree Reconstruction. Journal of Computational Biology 2006, 13(2):320-335.
5. Goldman N: Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res 1993, 21:2487–2491.
6. Almeida JS, Carriço JA, Maretzek A, Noble PA and Fletcher M:Analysis of genomic sequences by chaos game representation. Bioinformatics 2001, 17:429–437.
7. Wang Y, Hill K, Singh S and Kari L: The spectrum of genomic signatures: from di-nucleotides to chaos game representation. Gene 2005, 346:173–185.
8. Pandit A et. al., Multifractal analysis of HIV-1 genomes. Molecular Phylogenetics and Evolution 2012, 62:756-763.