CAG repeat expansion : Huntington's disease

Introduction


  • We trace back to study more details in Huntington's disease. This disease is first discovered by C. O. Waters and is deeply described by George Huntington. Huntington's disease is different from the above diseases, sickle-cell anaemia and LNS, and is characterized by an expansion of short repeats in DNA sequence .

  • The major landscape was that the responsible gene was identified through molecular techniques in 1993. The research based on the families affected by this disease. It was found that the responsible gene encodes 3144 amino acids, named huntington. Even today we could not fully understand what the protein function is, but we found that a region where the glutamine is repeated for many times. The following figure is the conceptual repeated amino acid sequences in a specific Huntington's disease.

    1. The first sequence is normal, but the second and the third sequence is more likely to give rise to Huntington's disease.
    2. In the second and the third sequence there are longer poly(Q) inserted in a region and this may give rise to huntington's disease.
  • The length of poly-gutamine region is uncertain among human individuals. Normally the length of poly(Q) is less than 28 glutamine residues. But in rare case the number increases to more than 40 it gives rise to Huntington's disease . The exact mechanism is still unknown. At mRNA level, the ploy(Q) region is result from the codon being repeated many times . Such a repeat region is prone to mutational processes that change the number of repeats. It is believed that triplet expansion occurs by slippage during DNA replication. A newly synthesized strand will loop out during DNA replication while base-pairing is maintained. The following figure is the slippage during DNA replication.

    1. Replication slippage occurs at repetitive sequences, "CAG" triplets.
    2. In backward slippage, the newly synthesized strand loops out and as a result the new DNA strand contains more "CAG" triplets than the template strand.
    3. In forward slippage, the template strand loops out and as a result the new DNA strand contains less "CAG" triplets.

Perl programming


  • Regular expression in Perl programming
# find continuous "CAG" sequences between 6 to 10 times repeats
m/(CAG){6,10}/

# default variables stored in the memories after executing regular expressions
$1, $2, $3 ... 

# == ()()() => the content is related with relevant expressions in parentheses
# refer to the following example
$1, $2, $3 ...   
`
use strict;

my $seq = "ATCGATCG";
my $seq_count = ($seq =~ s/CG/CG/g);
print "$seq_count\n";

my $testSeq = "GAACT";
$testSeq =~ m/([AG]{3})CT/;
print $1."\n";

my $testSeq2 = "GAACGT";
$testSeq2 =~ m/([AG]{3})(.*)(.)$/;
print "$1 $2 $3\n";

my $testSeq3 = "GCCCCATC";
# 用括號來記錄於 $1,$2,$3 等
$testSeq3 =~ m/(C{1})/; 
print "$1\n";
$testSeq3 =~ m/(C{1,})/;
print "$1\n";
  • the arguments as a list structure passed into a sub-function(refer to the following example)
use strict;

my $dna = "ATCGATCG";
my $rev = revcomp($dna);
print "$rev\n";

# subfunction 不用提前宣告
sub revcomp {
    my $str = $_[0]; # 傳進來的第一個參數,以 0 為開始 index
    $str = reverse($str);
    $str =~ tr/ATCG/TAGC/;
    return $str;
}

my $res = testFun(1,10);
print "res: $res\n";

sub testFun {
    my $a = $_[0];
    my $b = $_[1]; # 傳進來第二個參數,以 1 為 index
    print $a+$b."\n";
    return $a+$b;
}


retvParas(1,'a',"12ab");

sub retvParas {
    print join(",",@_)."\n";
    my $len = scalar(@_);
    print "$len\n";
}
#!/usr/bin/perl -w

use strict;

# mRNA-seq data (124 MB)
# all human refseq
my $fin_name = "refseq_human.txt";

# file handler
open(fin,$fin_name) or die("Input data is in error.\n");

my $id = "";
my $seq = "";

# subfunction
# input : $id, $seq
# output : 找出 CAG 重複超過六次的 RNA seq
sub findCAGRepeat {
    my ($seqId, $seqData) = @_;
    # 用 substitution 代替 match 並配合 global 真正找出所有 CAG 重複數量
    if ($seqData =~ s/((CAG){6,})//g) {
        my $len = length($1);
        print "$id\trepeat length: $len\n";
    }
}

foreach my $line () {
    # 一行讀入,並移除 '\n'
    chomp($line);

    # 遇序序列資訊
    if ( $line =~ m/>/) {

        # 第一次遇到序列資訊處理
        if ($id ne '') {
            findCAGRepeat($id,$seq);
        }

        # initial variables
        $id = substr($line,0,20);
        $seq = "";
    }
    else {
        # 不斷將 RNA-seqence 串起來
        $seq .= $line;
    }
}
close(fin);

# 最後一筆序列仍需要再進行檢查
findCAGRepeat($id,$seq);

Reference


  1. Ingram, V.M. (1956) A specific chemical difference between the globins of normal guman and sickle-cell anaemia haemoglobin. Nature 178(4537), 792-794
  2. Ingram, V.M. (1957) Gene mutations in human haemoglobin: the chemical difference between normal and sickle cell haemoglobin. Nature 180(4581), 326-328
  3. JHU,KAI-LIN (2005) Lesch-Nyhan syndrome. http://www.genes-at-taiwan.com.tw/genehelp/database/disease/Lesch_Nyhan_syndrome_940801.htm
  4. Sculley, et al. (1992) A review of the molecular basis of hypoxanthine-guanine phosphoribosyltransferase(HPRT) deficiency. Hum Genet 90(3), 195-207
  5. Nyhan et al. (1997) The recognition of Lesch-Nythan syndrome as an inborn error of purine metabolism. J Inherit Metab Dis 20(2), 171-178
  6. Genomics and Bioinformatics。Tore Samuelsson。Cambridge university press。July 2012。ISBN: 9781107401242。

results matching ""

    No results matching ""