Python

How to measure DNA similarity with Python and Dynamic Programming

*Note, if you want to skip the background / alignment calculations and go straight to where the code begins, just click here.

Dynamic Programming and DNA

Dynamic programming has many uses, including identifying the similarity between two different strands of DNA or RNA, protein alignment, and in various other applications in bioinformatics (in addition to many other fields). For anyone less familiar, dynamic programming is a coding paradigm that solves recursive problems by breaking them down into sub-problems using some type of data structure to store the sub-problem results. In this way, recursive problems (like the Fibonacci sequence for example) can be programmed much more efficiently because dynamic programming allows you to avoid duplicate (and hence, wasteful) calculations in your code. Click here to read more about dynamic programming.

Let’s dive into the main topic of this post by implementing an algorithm to measure similarity between two strands of DNA.

Motivation

Measuring the similarity between two different sequences of DNA is very useful because it can help tell us how closely related (or not) those sequences of DNA and their sources are (e.g. comparing the DNA of two different species, or two different genes). This helps to give insight into the structure and function of a newly identified gene or DNA sequence.

Background

For our similarity example, we will measure the global alignment between two strands of DNA, and then show how this can be done more generally with dynamic programming. “Global” means we will use the entirety of each sequence of DNA. “Alignment” means we are trying to align the sequences of DNA such that some score is maximized based upon how many matches / mismatches / gaps we have between the strands. By aligning the DNA sequences pairwise, we can create gaps in between nucleotides to create better alignments. We’ll get more into that in a moment. There are other ways of assessing DNA similarity, and I may cover some of them in a future article, but for now we’ll stick with global alignment.

DNA is made of four types of nucleotides – adenine, guanine, cytosine, and thymine. These are typically abbreviated by the letters A, G, C, and T, respectively. Programmatically, DNA can be represented as a string of characters, where each character must be one of A, G, C, or T. Suppose, then, that we have the two sequences of DNA as seen below.


Sequence 1 ==> G T C C A T A C A
Sequence 2 ==> T C A T A T C A G

How to measure the similarity between DNA strands

We need a metric to use for computing the global alignment between DNA strands. We’re going to use a scoring method (see below) in conjunction with the Needlman-Wunsch algorithm to figure out the optimal global alignment. An example of performing non-optimal global alignment on our sequences is the following:


Sequence 1 ==> G T C C A T A C A
Sequence 2 ==> T C A T A T C A G

In the above case we are merely lining up the sequences of DNA pairwise, and highlighting the matches between the sequences. In this way, we have two matches (hightlighted) and seven mismatches.

Let’s define the score we are trying to maximize as the following:

Add 1 for each match between the sequences
Subtract 1 for each mismatch
Subtract 1 for each gap i.e. insertion / deletion (shown by “-“)

The exact scoring metric (adding one for match, penalizing one for mismatch / gap) may vary between different scoring systems, but we’ll keep with our defined scoring methodology for the purposes of this post.

By this measure, our global alignment above gives us a score of 2 matches minus 7 mismatches = -5. We can improve this score if we take into account that each particular strand may have insertions or deletions i.e. we may shift the nucleotides (or characters) in each particular sequence to form a better alignment, provided we don’t actually reorder either sequence. In other words, we can shift nucleotides in a sequence as long as the order of the nucelotides doesn’t change. If that isn’t clear, the example below shows how we can achieve the maximum global alignment score by a modified alignment.


Sequence 1 ==> G T C C A T A - C A -
Sequence 2 ==> - T C - A T A T C A G

The highlighted letters (i.e. nucelotides) show where there is a match between the DNA strands. Here we have seven matches, and four mismatches.

Our score is 7 – 4 = 3. This is much better than our previous score of -5! Note how the order of the nucleotides (i.e. characters in the “DNA string”) did not change in either sequence; we merely created gaps which represent insertions / deletions. The reason doing this makes sense biologically is that insertions or deletions of nucleotides are common forms of mutations. For example, two strands of DNA may be related in function, but mutations caused extra nucleotides to be inserted (or deleted) from one or both sequences.

Now, to come up with a more generalized way of finding a maximum score solution, let’s construct a table like below so that one of the sequences runs across the header, while the other runs across the rows.

G T C C A T A C A
0 -1 -2 -3 -4 -5 -6 -7 -8 -9
T -1
C -2
A -3
T -4
A -5
T -6
C -7
A -8
G -9

Each particular cell’s value will represent the max score achieved by pairing each strand of DNA up until that many rows and columns. For example, the -3 in the top row (disregarding the header row) is the value -3 because that is the scored achieved by pairing a gap (“-“) with the first three nucelotides of the header DNA sequence, GTC.

Likewise, we get each of the values -1, …, -9 by pairing each successive subsequence of the header column to the gap, “-”

-, G ==> -1
-, GT ==> -2
-, GTC ==> -3,
-, GTCC ==> -4
-, GTCCA ==> -5
-, GTCCAT ==> -6
-, GTCCATA ==> -7
-, GTCCATAC ==> -8
-, GTCCATACA ==> -9

In each subsequence, every nucleotide is compared to “-“. Since no nucleotide in any of the subsequences matches “-“, the score of each matched to “-” is just -1 times the length of the subsequence.

Filling in another row and column gives us this:

G T C C A T A C A
0 -1 -2 -3 -4 -5 -6 -7 -8 -9
T -1 -1 0 -1 -2 -3 -4 -5 -6 -7
C -2 -2
A -3 -3
T -4 -4
A -5 -5
T -6 -6
C -7 -7
A -8 -8
G -9 -7

The calculations for the first entries of the new row can be seen below:



G ====> G ====> -1 = -1
T       T

G T ====> G T ====> -1 + 1 = 0
T         - T


G T C ====> G T C ====> -1 + 1 - 1 = -1
T           - T -


G T C C ====> G T C C ====> -1 + 1 - 1 - 1 = -2
T             - T - -


G T C C A ====> G T C C A ====> -1 + 1 - 1 - 1 - 1 = -3
T               - T - - -


G T C C A T ====> G T C C A T ====> -1 + 1 - 1 - 1 - 1 - 1 = -4
T                 - T - - - -


G T C C A T A ====> G T C C A T A ====> -1 + 1 - 1 - 1 - 1 - 1 - 1 = -5
T                   - T - - - - - 




G T C C A T A C ====> G T C C A T A C ====> -1 + 1 - 1 - 1 - 1 - 1 - 1 - 1 = -6
T                     - T - - - - - -



G T C C A T A C A ====> G T C C A T A C A ====> -1 + 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 = -7
T                       - T - - - - - - -


Finishing the rest of the table

Now we finish the rest of the table:

G T C C A T A C A
0 -1 -2 -3 -4 -5 -6 -7 -8 -9
T -1 -1 0 -1 -2 -3 -4 -5 -6 -7
C -2 -2 -1 1 0 -1 -2 -3 -4 -5
A -3 -3 -2 0 0 1 -1 -2 -3 -4
T -4 -4 -3 -1 -1 0 2 1 0 -1
A -5 -5 -4 -2 -1 0 1 2 1 0
T -6 -6 -5 -3 -3 0 1 2 1 1
C -7 -7 -6 -4 -2 -2 0 1 3 2
A -8 -8 -7 -5 -3 0 -1 1 2 4
G -9 -7 -8 -6 -4 -1 -1 1 1 3


The diagonal entries are calculated like this:



G ====> G ====> -1 = -1
T       T


G T ====> G T - ====> -1 + 1 - 1 = -1
T C       - T -


G T C ====> G T C - ====> -1 + 1 + 1 - 1 = 0
T C A       - T C A


G T C C ====> G T C C - ====> -1 + 1 + 1 - 1 - 1 = -1
T C A T       - T C A T


G T C C A ====> G T C C - A ====> -1 + 1 + 1 - 1 - 1 + 1 = 0
T C A T A       - T C A T A


G T C C A T ====> G T C C - A T ====> -1 + 1 + 1 - 1 - 1 + 1 + 1 = 1
T C A T A T       - T C A T A T


G T C C A T A ====> G T C C A T A - - ====> -1 + 1 + 1 - 1 + 1 + 1 + 1 - 1 - 1 = 1
T C A T A T C       - T C - A T A T C


G T C C A T A C ====> G T C C A T A - C - ====> -1 + 1 + 1 - 1 + 1 + 1 + 1 - 1 + 1 - 1 = 2
T C A T A T C A       - T C - A T A T C A


G T C C A T A C A ====> G T C C A T A - C A - ====> -1 + 1 + 1 - 1 + 1 + 1 + 1 - 1 + 1 + 1 - 1 = 3
T C A T A T C A G       - T C - A T A T C A G


Python implementation

Now – how do we actually implement this in Python? First, let’s initialize our table, which we’ll represent as a NumPy array. We’ll also define a function GET_SCORE that will test if two input nucleotides are equal or not – returning a reward (default 1) if they are and returning a penalty (default -1) if they are not.


import numpy as np

X = "GTCCATACA"
Y = "TCATATCAG"


def GET_SCORE(n1, n2, penalty = -1, reward = 1):
    
    if n1 == n2:
        return reward
    else:
        return penalty


# initialize score matrix
score_matrix = np.ndarray((len(X) + 1, len(Y) + 1))
 
for i in range(len(X) + 1):
 score_matrix[i, 0] = penalty * i

for j in range(len(Y) + 1):
 score_matrix[0, j] = penalty * j


Calculating the value of each cell

Next, we need to set the values of each of the remaining cells. The value of each cell is recursively based upon 1) the cells to the immediate left, above, and diagonally left; 2) whether there’s a match or gap between the nucleotides being compared (one from each sequence).

More specifically, the value of the cell in the i,j position of score_matrix (i.e. score_matrix[i, j]) is calculated by taking the max of the following:


score_matrix[i – 1, j – 1] + GET_SCORE(X[i – 1], Y[j – 1])
score_matrix[i, j – 1] + penalty
score_matrix[i – 1, j] + penalty

The indexes can be a bit confusing here, so let’s pause for a moment.

  • Remember our matrix is (len(X) + 1) x (len(Y) + 1)
  • X and Y represent the two respective DNA sequences.
  • Also, recall that Python is zero-indexed. This means that X[i – 1] refers to the last element of X if i equals the length of X, for example.


    Let’s look at a specific example. Suppose i = 4 and j = 4. Then:

    score_matrix[i, j] = score_matrix[4, 4]

    This equals the max of:


    score_matrix[3, 3] + GET_SCORE(X[3], Y[3])
    score_matrix[4, 3] + penalty
    score_matrix[3, 4] + penalty

    This is equivalent to calculating the score value for the below pairwise subsequences:

    
    G T C C
    T C A T
    
    

    Effectively, finding the score value for the above subsequences is found by first finding the score value for each of the following:

    
    G T C
    T C A
    
    
    G T C C
    T C A
    
    
    G T C
    T C A T
    

    For the first of the these, we just need to check what adding the final nucleotide to each respective subsequence does. For the second and third of these, we know that adding a gap to the 3-element subsequence in each case is the only possibility we’re considering.

    In other words, the reason why the score formula is defined this way is because it effectively checks the possible ways the first i – 1 nucleotides of the first DNA sequence (X) matched against the first j – 1 nucleotides of the second DNA sequence (Y) can be modified to maximize the score of aligning the first i nucleotides of X against the first j nucleotides of Y.

    score_matrix[3, 3] + GET_SCORE(X[3], Y[3]) = 0 – 1 = -1
    score_matrix[4, 3] + penalty = -1 – 1 = -2
    score_matrix[3, 4] + penalty = -1 – 1 = -2

    The max of (-1, -2, -2) is -1. Thus, score_matrix[4, 4] = -1.

    We can use the logic above to write the below Python code for filling out the rest of the score matrix:

    
    # define each cell in the matrix by as the max score possible in that stage
    for i in range(1, len(X) + 1):
     for j in range(1, len(Y) + 1):
      match = score_matrix[i - 1, j - 1] + GET_SCORE(X[i - 1], Y[j - 1], penalty, reward)
      delete = score_matrix[i -1, j] + penalty
      insert = score_matrix[i, j - 1] + penalty
      
      score_matrix[i, j] = max([match, delete, insert])
    
    
    

    We can add this code into a function, which we’ll call global_alignment. The remaining code in this function figures out the actual alignment of the DNA sequences (continued below).

    
    def GET_SCORE(n1, n2, penalty = -1, reward = 1):
        
        if n1 == n2:
            return reward
        else:
            return penalty
    
    
    def global_alignment(X, Y, penalty = -1, reward = 1):
        
        # initialize score matrix
        score_matrix = np.ndarray((len(X) + 1, len(Y) + 1))
         
        for i in range(len(X) + 1):
            score_matrix[i, 0] = penalty * i
        
        for j in range(len(Y) + 1):
            score_matrix[0, j] = penalty * j
            
        
        # define each cell in the matrix by as the max score possible in that stage
        for i in range(1, len(X) + 1):
            for j in range(1, len(Y) + 1):
                match = score_matrix[i - 1, j - 1] + GET_SCORE(X[i - 1], Y[j - 1], penalty, reward)
                delete = score_matrix[i -1, j] + penalty
                insert = score_matrix[i, j - 1] + penalty
                
                score_matrix[i, j] = max([match, delete, insert])
                
        
        i = len(X)
        j = len(Y)
        
        align_X = ""
        align_Y = ""
        
        while i > 0 or j > 0:
            
            current_score = score_matrix[i, j]
            left_score = score_matrix[i - 1, j]
            
            
            if i > 0 and j > 0 and X[i - 1] == Y[j - 1]:
                align_X = X[i - 1] + align_X
                align_Y = Y[j - 1] + align_Y
                i = i - 1
                j = j - 1
            
            elif i > 0 and current_score == left_score + penalty:
                align_X = X[i - 1] + align_X
                align_Y = "-" + align_Y
                i = i - 1
                
            else:
                align_X = "-" + align_X
                align_Y = Y[j - 1] + align_Y
                j = j - 1
    
    
        return align_X, align_Y
    
    
    

    Thus, calling

    
    global_alignment(X, Y)
    
    

    returns (‘GTCCATA-CA-‘, ‘-T-CATATCAG’)

    Breaking down the Dynamic Programming code

    Now, let’s break down the while loop logic in the code above. This code recursively builds up the aligned DNA sequences starting in reverse of each original sequence. The results below show the values of align_X and align_Y for each iteration in the loop.

    
    
    - 
    G
    
    A - 
    A G
    
    
    C A - 
    C A G
    
    - C A - 
    T C A G
    
    A - C A - 
    A T C A G
    
    
    T A - C A - 
    T A T C A G
    
    
    A T A - C A - 
    A T A T C A G
    
    
    C A T A - C A - 
    C A T A T C A G
    
    
    C C A T A - C A - 
    - C A T A T C A G
    
    
    T C C A T A - C A - 
    T - C A T A T C A G
    
    
    G T C C A T A - C A - 
    - T - C A T A T C A G
    
    
    
    

    To start, we define align_X and align_Y to be empty strings:

    
    i = len(X)
    j = len(Y)
        
    align_X = ""
    align_Y = ""
    
    

    Within the while loop we’ll define the following for ease of use:

    
    current_score = score_matrix[i, j]
    left_score = score_matrix[i - 1, j]
    
    

    Then we do the if / else piece of code. First, we check if the current (i – 1)th character of X equals the (j – 1)th character of Y. If these characters (i.e. nucleotides) are equal, then we append the nucleotide to align_X and to align_Y.

    
            if i > 0 and j > 0 and X[i - 1] == Y[j - 1]:
                align_X = X[i - 1] + align_X
                align_Y = Y[j - 1] + align_Y
                i = i - 1
                j = j - 1
    
    

    Else, if the score for the current values of i and j in the loop are equal to left_score plus the penalty, then we know there must be a gap in align_Y. This is because left_score is referring to the score in the cell immediately to the left of the current i,j cell.

    
            elif i > 0 and current_score == left_score + penalty:
                align_X = X[i - 1] + align_X
                align_Y = "-" + align_Y
                i = i - 1
                
    

    Otherwise, there must be a gap in align_X, so we append a “-“ to align_X.

    
    
            else:
                align_X = "-" + align_X
                align_Y = Y[j - 1] + align_Y
                j = j - 1
    
    


    That’s it for this post! Thanks for reading.

    Please check out my other Python articles by clicking here or keep up with my latest posts by following my blog on Twitter!

  • Andrew Treadway

    Recent Posts

    Software Engineering for Data Scientists (New book!)

    Very excited to announce the early-access preview (MEAP) of my upcoming book, Software Engineering for…

    2 years ago

    How to stop long-running code in Python

    Ever had long-running code that you don't know when it's going to finish running? If…

    3 years ago

    Faster alternatives to pandas

    Background If you've done any type of data analysis in Python, chances are you've probably…

    3 years ago

    Automated EDA with Python

    In this post, we will investigate the pandas_profiling and sweetviz packages, which can be used…

    3 years ago

    How to plot XGBoost trees in R

    In this post, we're going to cover how to plot XGBoost trees in R. XGBoost…

    4 years ago

    Python collections tutorial

    In this post, we'll discuss the underrated Python collections package, which is part of the…

    4 years ago