Clustering of Short Read Sequences for de novo Transcriptome Assembly

Document Type: Original Research Papers


1 Department of Algorithms and Computation, University of Tehran, Tehran, Iran

2 Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran

3 National Telecom Research Center, Tehran, Iran

4 National Institute of Genetic Engineering and Biotechnology (NIGEB), Tehran 14155-6346, Iran.


Given the importance of transcriptome analysis in various biological studies and considering the
vast amount of whole transcriptome sequencing data, it seems necessary to develop an
algorithm to assemble transcriptome data. In this study we propose an algorithm for
transcriptome assembly in the absence of a reference genome. First, the contiguous sequences
are generated using de Bruijn graph with different k-mer lengths. Then, the eclectic mixtures of
sequences are gathered in order to form the final sequences. Lastly, the contiguous sequences
are clustered and the isoform groups are provided. This proposed algorithm is capable of
generating long contiguous sequences and accurately clustering them into isoform groups.To
evaluate our algorithm, we applied it to a simulated RNA-seq dataset of rat transcriptome and a
real RNA-seq experiment of the loricaria gr. cataphracta transcriptome. The correctness of the
assembled contigs was more than 95%, and our algorithm was able to reconstruct over 70% of
the transcripts at more than 80% of the transcripts’ lengths. This study demonstrates that
applying a sophisticated merging method improves transcriptome assembly. The source code is
available upon request by contacting the corresponding author by email.


Main Subjects