Rosalind: Python solutions to common problems in Bioinformatics

Saturday. January 23, 2021 - 11 mins

Rosalind Franklin was an outstanding, successful researcher in several fields of knowledge. Specifically, she made great contributions to the understanding of the fine molecular structures of DNA, RNA, viruses, coal and graphite.

Franklin is best known for her work on the X-ray diffraction images of DNA which led to discovery of DNA double helix. Her X-ray diffraction images confirming the helical structure of DNA were shown to Watson without her approval or knowledge, and gave birth to Crick and Watson’s 1953 hypothesis regarding the structure of DNA. Her work was published third, in the series of three DNA Nature articles, led by the paper of Watson and Crick which only hinted at her contribution to their hypothesis.

Rosalind Franklin is often associated only to this infamous event, and that is really unfair considering how great of a scientific (and role model) she was. The Rosalind Project is just a small tribute to her importance. Rosalind is a platform for learning bioinformatics through computational problems of varying difficulty that are extracted from real challenges of molecular biology.

Created by the University of California at San Diego and Saint Petersburg Academic University, and inspired on the Project Euler, Rosalind try inspire a new generation of bioinformatics students to develop vital programming skills at their own pace in a unique environment.

In this page you will find my personal solutions to the problems, using only Python. They have been part of my process trying to learn Python and so, code can be messy or not optimized (sorry in advance). Every solution comes with small explanations, so it can serve as a Python learning experience, just as it was for me. Also, I’ve tried to include simple, brief explanations of some biology concepts that can be tricky to understand.

I strongly encourage you to try and solve the problems without looking at any solution. Even if at first sight a problem feels overwhelming, give it a try and come back a few days later. You will be surprised at all the things you are capable of doing and you will learn even more.

And just remember, if everything goes wrong, here you’ll find some help. :)

This project is still a work in progress. Make sure to come back frecuently for the lastest updates!

ID	Title	Python Solution
DNA	Counting DNA Nucleotides	Rosalind_DNA
RNA	Transcribing DNA into RNA	Rosalind_RNA
REVC	Complementing a Strand of DNA	Rosalind_REVC
FIB	Rabbits and Recurrence Relations	Rosalind_FIB
GC	Computing GC Content	Rosalind_GC
HAMM	Counting Point Mutations	Rosalind_HAMM
IPRB	Mendel’s First Law	Rosalind_IPRB
PROT	Translating RNA into Protein	Rosalind_PROT
SUBS	Finding a Motif in DNA	Rosalind_SUBS
CONS	Consensus and Profile	Rosalind_CONS
FIBD	Mortal Fibonacci Rabbits	Rosalind_FIBD
IEV	Calculating Expected Offspring	Rosalind_IEV
LIA	Independent Alleles	Rosalind_LIA
GRPH	Overlap Graphs	Rosalind_GRPH
LCSM	Finding a Shared Motif	Rosalind_LCSM
MPRT	Finding a Protein Motif	Rosalind_MPRT
MRNA	Inferring mRNA from Protein	Rosalind_MRNA
ORF	Open Reading Frames	Rosalind_ORF
PERM	Enumerating Gene Orders	Rosalind_PERM
PRTM	Calculating Protein Mass	Rosalind_PRTM
REVP	Locating Restriction Sites	Rosalind_REVP
SPLC	RNA Splicing	Rosalind_SPLC
PROB	Introduction to Random Strings	Rosalind_PROB
LEXF	Enumerating k-mers Lexicographically	Rosalind_LEXF
LGIS	Longest Increasing Subsequence	Rosalind_LGIS
LONG	Genome Assembly as Shortest Superstring	Rosalind_LONG
PMCH	Perfect Matchings and RNA Secondary Structures	Rosalind_PMCH
PPER	Partial Permutations	Rosalind_PPER
SIGN	Enumerating Oriented Gene Orderings	Rosalind_SIGN
SSEQ	Finding a Spliced Motif
TRAN	Transitions and Transversions
TREE	Completing a Tree
CAT	Catalan Numbers and RNA Secondary Structures
CORR	Error Correction in Reads
INOD	Counting Phylogenetic Ancestors
KMER	k-Mer Composition
KMP	Speeding Up Motif Finding
LCSQ	Finding a Shared Spliced Motif
LEXV	Ordering Strings of Varying Length Lexicographically
MMCH	Maximum Matchings and RNA Secondary Structures
PDST	Creating a Distance Matrix
REAR	Reversal Distance
RSTR	Matching Random Motifs
SSET	Counting Subsets
ASPC	Introduction to Alternative Splicing
EDIT	Edit Distance
EVAL	Expected Number of Restriction Sites
MOTZ	Motzkin Numbers and RNA Secondary Structures
NWCK	Distances in Trees
SCSP	Interleaving Two Motifs
SETO	Introduction to Set Operations
SORT	Sorting by Reversals
SPEC	Inferring Protein from Spectrum
TRIE	Introduction to Pattern Matching
CONV	Comparing Spectra with the Spectral Convolution
CTBL	Creating a Character Table
DBRU	Constructing a De Bruijn Graph
EDTA	Edit Distance Alignment
FULL	Inferring Peptide from Full Spectrum
INDC	Independent Segregation of Chromosomes
ITWV	Finding Disjoint Motifs in a Gene
LREP	Finding the Longest Multiple Repeat
NKEW	Newick Format with Edge Weights
RNAS	Wobble Bonding and RNA Secondary Structures
AFRQ	Counting Disease Carriers
CSTR	Creating a Character Table from Genetic Strings
CTEA	Counting Optimal Alignments
CUNR	Counting Unrooted Binary Trees
GLOB	Global Alignment with Scoring Matrix
PCOV	Genome Assembly with Perfect Coverage
PRSM	Matching a Spectrum to a Protein
QRT	Quartets
SGRA	Using the Spectrum Graph to Infer Peptides
SUFF	Encoding Suffix Trees
CHBP	Character-Based Phylogeny
CNTQ	Counting Quartets
EUBT	Enumerating Unrooted Binary Trees
GASM	Genome Assembly Using Reads
GCON	Global Alignment with Constant Gap Penalty
LING	Linguistic Complexity of a Genome
LOCA	Local Alignment with Scoring Matrix
MEND	Inferring Genotype from a Pedigree
MGAP	Maximizing the Gap Symbols of an Optimal Alignment
MREP	Identifying Maximal Repeats
MULT	Multiple Alignment
PDPL	Creating a Restriction Map
ROOT	Counting Rooted Binary Trees
SEXL	Sex-Linked Inheritance
SPTD	Phylogeny Comparison with Split Distance
WFMD	The Wright-Fisher Model of Genetic Drift
ALPH	Alignment-Based Phylogeny
ASMQ	Assessing Assembly Quality with N50 and N75
CSET	Fixing an Inconsistent Character Set
EBIN	Wright-Fisher’s Expected Behavior
FOUN	The Founder Effect and Genetic Drift
GAFF	Global Alignment with Scoring Matrix and Affine Gap Penalty
GREP	Genome Assembly with Perfect Coverage and Repeats
OAP	Overlap Alignment
QRTD	Quartet Distance
SIMS	Finding a Motif with Modifications
SMGB	Semiglobal Alignment
KSIM	Finding All Similar Motifs
LAFF	Local Alignment with Affine Gap Penalty
OSYM	Isolating Symbols in Alignments
RSUB	Identifying Reversing Substitutions

Thank you for your time and attention. Feel free to reach out if you have any questions or inquiries. I’d love to hear from you!

David Boo

Where biology, informatics & statistics intersect

Rosalind: Python solutions to common problems in Bioinformatics

Related Posts

David Boo