Rosalind: Python solutions to common problems in Bioinformatics

Rosalind: Python solutions to common problems in Bioinformatics

- 11 mins

Rosalind Franklin was an outstanding, successful researcher in several fields of knowledge. Specifically, she made great contributions to the understanding of the fine molecular structures of DNA, RNA, viruses, coal and graphite.

Franklin is best known for her work on the X-ray diffraction images of DNA which led to discovery of DNA double helix. Her X-ray diffraction images confirming the helical structure of DNA were shown to Watson without her approval or knowledge, and gave birth to Crick and Watson’s 1953 hypothesis regarding the structure of DNA. Her work was published third, in the series of three DNA Nature articles, led by the paper of Watson and Crick which only hinted at her contribution to their hypothesis.

Rosalind Franklin is often associated only to this infamous event, and that is really unfair considering how great of a scientific (and role model) she was. The Rosalind Project is just a small tribute to her importance. Rosalind is a platform for learning bioinformatics through computational problems of varying difficulty that are extracted from real challenges of molecular biology.

Created by the University of California at San Diego and Saint Petersburg Academic University, and inspired on the Project Euler, Rosalind try inspire a new generation of bioinformatics students to develop vital programming skills at their own pace in a unique environment.

In this page you will find my personal solutions to the problems, using only Python. They have been part of my process trying to learn Python and so, code can be messy or not optimized (sorry in advance). Every solution comes with small explanations, so it can serve as a Python learning experience, just as it was for me. Also, I’ve tried to include simple, brief explanations of some biology concepts that can be tricky to understand.

I strongly encourage you to try and solve the problems without looking at any solution. Even if at first sight a problem feels overwhelming, give it a try and come back a few days later. You will be surprised at all the things you are capable of doing and you will learn even more.

And just remember, if everything goes wrong, here you’ll find some help. :)

This project is still a work in progress. Make sure to come back frecuently for the lastest updates!


ID Title Python Solution
DNA Counting DNA Nucleotides Rosalind_DNA
RNA Transcribing DNA into RNA Rosalind_RNA
REVC Complementing a Strand of DNA Rosalind_REVC
FIB Rabbits and Recurrence Relations Rosalind_FIB
GC Computing GC Content Rosalind_GC
HAMM Counting Point Mutations Rosalind_HAMM
IPRB Mendel’s First Law Rosalind_IPRB
PROT Translating RNA into Protein Rosalind_PROT
SUBS Finding a Motif in DNA Rosalind_SUBS
CONS Consensus and Profile Rosalind_CONS
FIBD Mortal Fibonacci Rabbits Rosalind_FIBD
IEV Calculating Expected Offspring Rosalind_IEV
LIA Independent Alleles Rosalind_LIA
GRPH Overlap Graphs Rosalind_GRPH
LCSM Finding a Shared Motif Rosalind_LCSM
MPRT Finding a Protein Motif Rosalind_MPRT
MRNA Inferring mRNA from Protein Rosalind_MRNA
ORF Open Reading Frames Rosalind_ORF
PERM Enumerating Gene Orders Rosalind_PERM
PRTM Calculating Protein Mass Rosalind_PRTM
REVP Locating Restriction Sites Rosalind_REVP
SPLC RNA Splicing Rosalind_SPLC
PROB Introduction to Random Strings Rosalind_PROB
LEXF Enumerating k-mers Lexicographically Rosalind_LEXF
LGIS Longest Increasing Subsequence Rosalind_LGIS
LONG Genome Assembly as Shortest Superstring Rosalind_LONG
PMCH Perfect Matchings and RNA Secondary Structures Rosalind_PMCH
PPER Partial Permutations Rosalind_PPER
SIGN Enumerating Oriented Gene Orderings Rosalind_SIGN
SSEQ Finding a Spliced Motif  
TRAN Transitions and Transversions  
TREE Completing a Tree  
CAT Catalan Numbers and RNA Secondary Structures  
CORR Error Correction in Reads  
INOD Counting Phylogenetic Ancestors  
KMER k-Mer Composition  
KMP Speeding Up Motif Finding  
LCSQ Finding a Shared Spliced Motif  
LEXV Ordering Strings of Varying Length Lexicographically  
MMCH Maximum Matchings and RNA Secondary Structures  
PDST Creating a Distance Matrix  
REAR Reversal Distance  
RSTR Matching Random Motifs  
SSET Counting Subsets  
ASPC Introduction to Alternative Splicing  
EDIT Edit Distance  
EVAL Expected Number of Restriction Sites  
MOTZ Motzkin Numbers and RNA Secondary Structures  
NWCK Distances in Trees  
SCSP Interleaving Two Motifs  
SETO Introduction to Set Operations  
SORT Sorting by Reversals  
SPEC Inferring Protein from Spectrum  
TRIE Introduction to Pattern Matching  
CONV Comparing Spectra with the Spectral Convolution  
CTBL Creating a Character Table  
DBRU Constructing a De Bruijn Graph  
EDTA Edit Distance Alignment  
FULL Inferring Peptide from Full Spectrum  
INDC Independent Segregation of Chromosomes  
ITWV Finding Disjoint Motifs in a Gene  
LREP Finding the Longest Multiple Repeat  
NKEW Newick Format with Edge Weights  
RNAS Wobble Bonding and RNA Secondary Structures  
AFRQ Counting Disease Carriers  
CSTR Creating a Character Table from Genetic Strings  
CTEA Counting Optimal Alignments  
CUNR Counting Unrooted Binary Trees  
GLOB Global Alignment with Scoring Matrix  
PCOV Genome Assembly with Perfect Coverage  
PRSM Matching a Spectrum to a Protein  
QRT Quartets  
SGRA Using the Spectrum Graph to Infer Peptides  
SUFF Encoding Suffix Trees  
CHBP Character-Based Phylogeny  
CNTQ Counting Quartets  
EUBT Enumerating Unrooted Binary Trees  
GASM Genome Assembly Using Reads  
GCON Global Alignment with Constant Gap Penalty  
LING Linguistic Complexity of a Genome  
LOCA Local Alignment with Scoring Matrix  
MEND Inferring Genotype from a Pedigree  
MGAP Maximizing the Gap Symbols of an Optimal Alignment  
MREP Identifying Maximal Repeats  
MULT Multiple Alignment  
PDPL Creating a Restriction Map  
ROOT Counting Rooted Binary Trees  
SEXL Sex-Linked Inheritance  
SPTD Phylogeny Comparison with Split Distance  
WFMD The Wright-Fisher Model of Genetic Drift  
ALPH Alignment-Based Phylogeny  
ASMQ Assessing Assembly Quality with N50 and N75  
CSET Fixing an Inconsistent Character Set  
EBIN Wright-Fisher’s Expected Behavior  
FOUN The Founder Effect and Genetic Drift  
GAFF Global Alignment with Scoring Matrix and Affine Gap Penalty  
GREP Genome Assembly with Perfect Coverage and Repeats  
OAP Overlap Alignment  
QRTD Quartet Distance  
SIMS Finding a Motif with Modifications  
SMGB Semiglobal Alignment  
KSIM Finding All Similar Motifs  
LAFF Local Alignment with Affine Gap Penalty  
OSYM Isolating Symbols in Alignments  
RSUB Identifying Reversing Substitutions  

Thank you for your time and attention. Feel free to reach out if you have any questions or inquiries. I’d love to hear from you!

David Boo

David Boo

Where biology, informatics & statistics intersect

rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora