Efficient coding of DNA

Authors

  • Boštjan MUROVEC Univ. of Ljubljana, Fac. of Electrical Engineering, Laboratory for Computer Integrated Manufacturing, Tržaška cesta 25, SI-1000 Ljubljana, Slovenia
  • Blaž STRES Univ. of Ljubljana, Biotechnical Fac., Dept. of Animal Science, Groblje 3, SI-1230 Domžale, Slovenia

DOI:

https://doi.org/10.14720/aas.2008.92.2.15147

Keywords:

molecular genetics, bioinformatics, DNA sequences, coding

Abstract

Microcomputers have become ubiquitous tools for DNA research and analysis. Before DNA sequences can be fed into computer programs they need to be suitably coded, which is usually done in a widely accepted FASTA format. According to this scheme, DNA sequence is represented as an ASCII string of four nucleotide characters A, G, C and T, possibly extended with additional codes for representation of degenerated sites, and a character code for FASTA blanks when dealing with aligned DNA sequences. FASTA representation is intuitive for biologists and it eases development of programs since developers can utilize a myriad of available libraries for working with ASCII strings. Despite the mentioned advantages, FASTA format possesses certain drawbacks like inefficient searching for substrings, especially in the presence of degenerative codes. The second disadvantage is inefficient storage of FASTA blank characters, since each such character occupies one byte of memory. Substring searching speed is also negatively affected in the case of excessive number of blanks. Due to the stated drawbacks, we propose an alternative coding of DNA sequences, which enables faster searching of substrings and efficient storage of FASTA blanks, with the result that a greater set of DNA sequences can be held in working memory of a computer and processed faster.

Published

20. 12. 2008

Issue

Section

Original Scientific Article

How to Cite

MUROVEC, B., & STRES, B. (2008). Efficient coding of DNA. Acta Agriculturae Slovenica, 92(2), 151–162. https://doi.org/10.14720/aas.2008.92.2.15147