Biopython Tutorial (I): from beginner to advanced.

Saturday. January 09, 2021 - 13 mins

Biopython is a Python library for reading and writing many common biological data formats and on this post we will be explaining some basic and advanced concepts and tricks. It is a extremely powerful library which provides a wide range of functions and can be used from working with sequences to evolutionary genomics or structural proteomics. And many more!

This is the first part in a series of posts in which we will be covering these contents. I hope you enjoy and learn something new today.

1. First Steps

What is Biopython
Installing and upgrading

2. Working with Sequences

Creating a sequence
Parsing a sequence file
Counting sequence length & number of occurrences of a nucleotide
Calculating GC-content
Calculating molecular weight
Get the reverse complement of a sequence, transcription & translation
Slicing a sequence
Concatenating sequences
Finding the starting index of a subsequence
Writing sequences to a file
Converting a FASTQ file to FASTA file & other formats

3. NCBI Entrez databases

General Guidelines
Accessing Pubmed, Medline, Genbank & others

4. BLAST

Running a web BLAST
Parsing a BLAST output

5. Multiple Sequence Alignment

Pairwise sequence alignment
Multiple sequence alignment: creating alignment using different algorithms: ClustalW, MUSCLE…
Reading a MSA

6. Phylogenetics

Constructing a phylogenetic tree
Modifying an existing tree

7. Sequence motif analysis

8. PDB: 3D structure protein analysis

Count atoms in a PDB structure

1. First Steps

Our first step is installing Biopython. The easiest way is to Biopython is using pip package such as follows. (Please note that Biopython also requires NumPy package, which will be installed automatically if you install Biopython with pip).

If you already have Biopython installed you can upgrade it using pip aswell.

pip install biopython
pip install --upgrade biopython

You can test if Biopython is properly installed by executing:

import Bio

If no error messages appear, then everything is working and you’re all set.

2. Working with sequences

Creating a sequence

The easiest way to create a sequence is using the Biopython Seq object. The bio.seq module is really useful and we will make use of it constantly. You can check the Wiki page for more information and advice. Here is a small example.

from Bio.Seq import Seq
seq_example = Seq("AGTACACTGGT")
print("Sequence is", seq_example)
>>> Sequence is AGTACACTGGT

Parsing a sequence file

Biopython’s SeqIO (Sequence Input/Output) can be used to read sequence files. The main function is Bio.SeqIO.parse(), which requires a filename and format and returns a SeqRecord iterator. This is an example of parsing a FASTA file.

from Bio import SeqIO
for record in SeqIO.parse("example_sequence.fasta", "fasta"):
    print(record.seq)

record.id will return the identifier of the sequence. record.seq will return the sequence itself. record.description will return the sequence description.

Counting sequence length & number of occurrences of a nucleotide

Counting sequence length is really easy using len(). Using our previous example:

from Bio.Seq import Seq
seq_example = Seq("AGTACACTGGT")
print("Sequence is", seq_example)
>>> Sequence is AGTACACTGGT
print("The length of the sequence is", len(seq_example))
>>> The length of the sequence is 11

You can count numer of occurrences of a certain nucleotide as follows: (Make sure to create/load sequence as we’ve done earlier)

print(seq_example.count("C"))
>>> 2

Calculating GC-content

GC content is a measure of the percentage of guanine (G) or cytosine (C) related to the total nitrogenous bases (A+T on DNA or A+U on RNA). GC content is important because it indicates a higher melting temperature and also key factor in bacteria taxonomy. Biopython has a Bio.SeqUtils package that simplifies this task. Feel free to check wiki for more information.

from Bio.SeqUtils import GC
from Bio.Seq import Seq
seq_example = Seq("AGTACACTGGT")
print(GC(seq_example))
>>> 45.45

Calculating molecular weight

Again, molecular weight can be easily calculated using Bio.SeqUtils as follows:

from Bio.SeqUtils import molecular_weight
from Bio.Seq import Seq
seq_example = Seq("AGTACACTGGT")
print(molecular_weight(seq_example))
>>> 3436.19

Get the reverse complement of a sequence, transcription & translation

Using Biopython these three processes ar really straightforward. To get the reverse complement use the single function reverse_complement().

print("Reverse complement:", seq_example.reverse_complement())
>>> The reverse complement if the sequence is ACCAGTGTACT

Transcription is also a single function, transcribe()

print("Transcription:", seq_example.transcribe())
>>> Transcription: AGUACACUGGU

Translation is again another function, translate(). Please note that if your sequence is not suitable for correct translation you will recieve a warning from Biopython (our example gets this warning). This happens if your sequence is not a multiple of three, which makes partial codons on translation (we’ll come back to this later). You can have more information about this biological process here

print("Translation:", seq_example.translate())
>>> Translation: STL

Slicing a sequence

With Biopython you can slice effectively sequences and extract any part of it.

from Bio.Seq import Seq
seq_example = Seq("AGTACACTGGT")
print(seq_example[4:9])
>>> CACTG

It follows the usual indexing conventions for Python strings, with the first element of the sequence numbered 0. When you do a slice the first item is included (i.e. 4) and the last is excluded (9 in this case). We can get the third codon positions of this DNA sequence:

from Bio.Seq import Seq
seq_example = Seq("AGTACACTGGT")
print(seq_example[0::3])
>>> AACG

Concatenating sequences

You can add any two Seq objects together

from Bio.Seq import Seq
seq_example = Seq("AGTACACTGGT")
seq_add = Seq("TTTT")
print(seq_example + seq_add)
>>> AGTACACTGGTTTTT

Biopython Seq also has a .join method which can be really useful

from Bio.Seq import Seq
fragments = [Seq("AGTA"), Seq("CACT"), Seq("GGT")]
filler = Seq("G"*10)
print(filler.join(fragments))
>>> AGTAGGGGGGGGGGCACTGGGGGGGGGGGGT

Find the starting index of a subsequence

Easiest way to find starting index of a subsequence is using find()

from Bio.Seq import Seq
seq_example = Seq("AGTACACTGGT")
print("CAC index:", seq_example.find("CAC"))
>>> CAC index: 4

Writing sequences to a file

SeqIO (Sequence Input/Output) can also be used to write sequences to files. Our main function is SeqIO.write() and it allows you to save sequences as well as id or descriptions (See below for instructions on how to enter Genbank and other NCBI tools). Here is an example:

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

rec1 = SeqRecord(
    Seq(
        "MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
        "GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
        "NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"
        "SSAC",
    ),
    id="gi|14150838|gb|AAK54648.1|AF376133_1",
    description="chalcone synthase [Cucumis sativus]",
)

rec2 = SeqRecord(
    Seq(
        "YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ"
        "DMVVVEIPKLGKEAAVKAIKEWGQ",
    ),
    id="gi|13919613|gb|AAK33142.1|",
    description="chalcone synthase [Fragaria vesca subsp. bracteata]",
)

rec3 = SeqRecord(
    Seq(
        "MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC"
        "EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP"
        "KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN"
        "NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV"
        "SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW"
        "IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT"
        "TGEGLEWGVLFGFGPGLTVETVVLHSVAT",
    ),
    id="gi|13925890|gb|AAK49457.1|",
    description="chalcone synthase [Nicotiana tabacum]",
)

my_records = [rec1, rec2, rec3]
SeqIO.write(my_records, "seq_examples.faa", "fasta")

Output will be a fasta file with your sequences and records in a fasta file.

Converting a FASTQ file to FASTA file & other formats

We need to convert DNA data file formats in certain applications. You can use SeqIO.write() but the easiest way to perform this common task is just SeqIO.convert() . See this example changing GenBank format to FASTA:

from Bio import SeqIO
count = SeqIO.convert("ls_orchid.gbk", "genbank", "orchid_converted.fasta", "fasta")
print("Converted %i records" % count)

Same concept, converting from FASTQ to FASTA as follows.

from Bio import SeqIO
count = SeqIO.convert("ls_orchid.fastq", "fastq", "orchid_converted.fasta", "fasta")
print("Converted %i records" % count)

See help function to have a complete list of conversions you can do:

from Bio import SeqIO
help(SeqIO.convert)

3. NCBI Entrez databases

General Guidelines

Entrez is a data retrieval system that provides users access to NCBI’s databases such as PubMed, GenBank, GEO, and many others. Using Biopython’s module Bio.Entrez allows you to access Entrez and search & download records from within a Python script. You must provide an email to access Entrez - NCBI can block anonymous requests. Also, Bio.Entrez might not be the best option to download huge amounts of data as it can clog the servers.

from Bio import Entrez
Entrez.email = "davidboo@example.com"

Accessing PubMed, Medline, Genbank & others

To search any of these databases, we use Bio.Entrez.esearch() module, which requires some parameters:

handle = Entrez.esearch(db="value", term="keywords", retmax=100)

Where db are databases such as Pubmed, nucleotide, protein, snp, omim, unigene… keywords are what you are interested in looking at (organisms, Human, Biopython, urea…) and retmax are the number of identifies returned.

Let’s do an example accessing PubMed. We will be searching in PubMed for 20 publications that include breast cancer in their title.

from Bio import Entrez
Entrez.email = "davidboo@example.com"
handle = Entrez.esearch(db="pubmed", term="breast[title] AND cancer[title]", retmax=20, rettype="fasta", retmode="text")
records = Entrez.read(handle)
print(records["IdList"])

If you want to retrieve a full record from Entrez, then you should use Bio.Entrez.efetch()

from Bio import Entrez
Entrez.email = "davidboo@example.com"
handle = Entrez.efetch(db="nucleotide", id="KY398842.1", rettype="gb", retmode="text")
print(handle.read())

If you fetch the record in one of the formats accepted by Bio.SeqIO, you could directly parse it into a SeqRecord:

from Bio import SeqIO
from Bio import Entrez
Entrez.email = "davidboo@example.com"
handle = Entrez.efetch(db="nucleotide", id="KY398842.1", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
print(record.id)
>>> KY398842.1
print(record.name)
>>> KY398842
print(record.description)
>>> Rabbit hemorrhagic disease virus isolate RHDV BZ/2015 capsid structural protein VP60 gene, partial cds.
record.seq
>>> Seq('ATGGAGGGCAAAGCCCGCACAGCGCCGCAAGGCGAAGCAGCAGGCACTGCTAC...TGA')

That’s all for today! I hope you got an idea on how to use basic Biopython for sequence works.

Thank you for reading. Feel free to check out our next post for more information on Biopython and many other things.

Let me know if you have any feedback or issues, I’d love to hear from you!

David Boo

Where biology, informatics & statistics intersect

Biopython Tutorial (I): from beginner to advanced.

1. First Steps

2. Working with Sequences

3. NCBI Entrez databases

4. BLAST

5. Multiple Sequence Alignment

6. Phylogenetics

7. Sequence motif analysis

8. PDB: 3D structure protein analysis

1. First Steps

2. Working with sequences

3. NCBI Entrez databases

Related Posts

David Boo