学习到目前为止,我们主要是解决化学信息学的问题。我们感兴趣于如何用机器学习技术来预测分子的化学物征。本教程,我们将移动一点点,看如何用经典的计算机科学技术和机器学习来解决生物信息学的问题。
为此,我们要使用免费的biopython库进行基础的生物信息学工作。本教程的很多材料都来自官方教程[Biopython tutorial]http://biopython.org/DIST/docs/tutorial/Tutorial.html).
我们强烈推荐你在学习完本教程你阅读这些官方教程。
In [1]:
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3489 100 3489 0 0 47148 0 --:--:-- --:--:-- --:--:-- 47148
add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added omnia to channels
added conda-forge to channels
done
conda packages installation finished!
# conda environments:
#
base * /root/miniconda
In [2]:
!pip install --pre deepchem
import deepchem
deepchem.__version__
Collecting deepchem
Downloading https://files.pythonhosted.org/packages/b5/d7/3ba15ec6f676ef4d93855d01e40cba75e231339e7d9ea403a2f53cabbab0/deepchem-2.4.0rc1.dev20200805054153.tar.gz (351kB)
|████████████████████████████████| 358kB 4.7MB/s
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.16.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.18.5)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.0.5)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.22.2.post1)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.4.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2018.9)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2.8.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->deepchem) (1.15.0)
Building wheels for collected packages: deepchem
Building wheel for deepchem (setup.py) ... done
Created wheel for deepchem: filename=deepchem-2.4.0rc1.dev20200805145043-cp36-none-any.whl size=438623 sha256=b76201fc01bf910a8490d4ed5cc195b109d08f019ce7afc25cdf254c62c4eab3
Stored in directory: /root/.cache/pip/wheels/41/0f/fe/5f2659dc8e26624863654100f689d8f36cae7c872d2b310394
Successfully built deepchem
Installing collected packages: deepchem
Successfully installed deepchem-2.4.0rc1.dev20200805145043
Out[2]:
'2.4.0-rc1.dev'
We'll use pip to install biopython
In [3]:
!pip install biopython
Collecting biopython
Downloading https://files.pythonhosted.org/packages/a8/66/134dbd5f885fc71493c61b6cf04c9ea08082da28da5ed07709b02857cbd0/biopython-1.77-cp36-cp36m-manylinux1_x86_64.whl (2.3MB)
|████████████████████████████████| 2.3MB 4.5MB/s
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from biopython) (1.18.5)
Installing collected packages: biopython
Successfully installed biopython-1.77
In [4]:
import Bio
Bio.__version__
Out[4]:
'1.77'
In [5]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACATTG")
my_seq
Out[5]:
Seq('AGTACACATTG')
In [6]:
my_seq.complement()
Out[6]:
Seq('TCATGTGTAAC')
In [7]:
my_seq.reverse_complement()
Out[7]:
Seq('CAATGTGTACT')
解析序列记录
我们将从
Biopython
教程下载
fasta
样本文件进行一些练习。
这个文件是一个序列(lady slipper orcid genes)的击中。
In [8]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta
--2020-08-05 14:50:55-- https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76480 (75K) [text/plain]
Saving to: ‘ls_orchid.fasta’
ls_orchid.fasta 100%[===================>] 74.69K --.-KB/s in 0.01s
2020-08-05 14:50:55 (4.97 MB/s) - ‘ls_orchid.fasta’ saved [76480/76480]
我们来看一下文件内容是如何的。
In [9]:
from Bio import SeqIO
for seq_record in SeqIO.parse('ls_orchid.fasta', 'fasta'):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', SingleLetterAlphabet())
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', SingleLetterAlphabet())
748
gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT', SingleLetterAlphabet())
744
gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA', SingleLetterAlphabet())
733
gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC', SingleLetterAlphabet())
718
gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT', SingleLetterAlphabet())
730
gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA', SingleLetterAlphabet())
704
gi|2765649|emb|Z78524.1|CFZ78524
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC', SingleLetterAlphabet())
740
gi|2765648|emb|Z78523.1|CHZ78523
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTGCGGCAGGATCATTGTTGAGACAGCAG...AAG', SingleLetterAlphabet())
709
gi|2765647|emb|Z78522.1|CMZ78522
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...GAG', SingleLetterAlphabet())
700
gi|2765646|emb|Z78521.1|CCZ78521
Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAGAATATATGATCGAGT...ACC', SingleLetterAlphabet())
726
gi|2765645|emb|Z78520.1|CSZ78520
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TTT', SingleLetterAlphabet())
753
gi|2765644|emb|Z78519.1|CPZ78519
Seq('ATATGATCGAGTGAATCTGGTGGACTTGTGGTTACTCAGCTCGCCATAGGCTTT...TTA', SingleLetterAlphabet())
699
gi|2765643|emb|Z78518.1|CRZ78518
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGGAGGATCATTGTTGAGATAGTAG...TCC', SingleLetterAlphabet())
658
gi|2765642|emb|Z78517.1|CFZ78517
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...AGC', SingleLetterAlphabet())
752
gi|2765641|emb|Z78516.1|CPZ78516
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAT...TAA', SingleLetterAlphabet())
726
gi|2765640|emb|Z78515.1|MXZ78515
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGCTGAGACCGTAG...AGC', SingleLetterAlphabet())
765
gi|2765639|emb|Z78514.1|PSZ78514
Seq('CGTAACAAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAAGCCCCCA...CTA', SingleLetterAlphabet())
755
gi|2765638|emb|Z78513.1|PBZ78513
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...GAG', SingleLetterAlphabet())
742
gi|2765637|emb|Z78512.1|PWZ78512
Seq('CGTAACAAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAAGCCCCCA...AGC', SingleLetterAlphabet())
762
gi|2765636|emb|Z78511.1|PEZ78511
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTTCGGAAGGATCATTGTTGAGACCCCCA...GGA', SingleLetterAlphabet())
745
gi|2765635|emb|Z78510.1|PCZ78510
Seq('CTAACCAGGGTTCCGAGGTGACCTTCGGGAGGATTCCTTTTTAAGCCCCCGAAA...TTA', SingleLetterAlphabet())
750
gi|2765634|emb|Z78509.1|PPZ78509
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...GGA', SingleLetterAlphabet())
731
gi|2765633|emb|Z78508.1|PLZ78508
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...TGA', SingleLetterAlphabet())
741
gi|2765632|emb|Z78507.1|PLZ78507
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCCCCA...TGA', SingleLetterAlphabet())
740
gi|2765631|emb|Z78506.1|PLZ78506
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCAA...TGA', SingleLetterAlphabet())
727
gi|2765630|emb|Z78505.1|PSZ78505
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...TTT', SingleLetterAlphabet())
711
gi|2765629|emb|Z78504.1|PKZ78504
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTTCGGAAGGATCATTGTTGAGACCGCAA...TAA', SingleLetterAlphabet())
743
gi|2765628|emb|Z78503.1|PCZ78503
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTCCGGAAGGATCCTTGTTGAGACCGCCA...TAA', SingleLetterAlphabet())
727
gi|2765627|emb|Z78502.1|PBZ78502
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGACCGCCA...CGC', SingleLetterAlphabet())
757
gi|2765626|emb|Z78501.1|PCZ78501
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCAA...AGA', SingleLetterAlphabet())
770
gi|2765625|emb|Z78500.1|PWZ78500
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGCTCATTGTTGAGACCGCAA...AAG', SingleLetterAlphabet())
767
gi|2765624|emb|Z78499.1|PMZ78499
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAGGGATCATTGTTGAGATCGCAT...ACC', SingleLetterAlphabet())
759
gi|2765623|emb|Z78498.1|PMZ78498
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAAGGTCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
750
gi|2765622|emb|Z78497.1|PDZ78497
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
788
gi|2765621|emb|Z78496.1|PAZ78496
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...AGC', SingleLetterAlphabet())
774
gi|2765620|emb|Z78495.1|PEZ78495
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...GTG', SingleLetterAlphabet())
789
gi|2765619|emb|Z78494.1|PNZ78494
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGGTCGCAT...AAG', SingleLetterAlphabet())
688
gi|2765618|emb|Z78493.1|PGZ78493
Seq('CGTAACAAGGATTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...CCC', SingleLetterAlphabet())
719
gi|2765617|emb|Z78492.1|PBZ78492
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...ATA', SingleLetterAlphabet())
743
gi|2765616|emb|Z78491.1|PCZ78491
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...AGC', SingleLetterAlphabet())
737
gi|2765615|emb|Z78490.1|PFZ78490
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())
728
gi|2765614|emb|Z78489.1|PDZ78489
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGC', SingleLetterAlphabet())
740
gi|2765613|emb|Z78488.1|PTZ78488
Seq('CTGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACGCAATAATTGATCGA...GCT', SingleLetterAlphabet())
696
gi|2765612|emb|Z78487.1|PHZ78487
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TAA', SingleLetterAlphabet())
732
gi|2765611|emb|Z78486.1|PBZ78486
Seq('CGTCACGAGGTTTCCGTAGGTGAATCTGCGGGAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())
731
gi|2765610|emb|Z78485.1|PHZ78485
Seq('CTGAACCTGGTGTCCGAAGGTGAATCTGCGGATGGATCATTGTTGAGATATCAT...GTA', SingleLetterAlphabet())
735
gi|2765609|emb|Z78484.1|PCZ78484
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGGGGAAGGATCATTGTTGAGATCACAT...TTT', SingleLetterAlphabet())
720
gi|2765608|emb|Z78483.1|PVZ78483
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())
740
gi|2765607|emb|Z78482.1|PEZ78482
Seq('TCTACTGCAGTGACCGAGATTTGCCATCGAGCCTCCTGGGAGCTTTCTTGCTGG...GCA', SingleLetterAlphabet())
629
gi|2765606|emb|Z78481.1|PIZ78481
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())
572
gi|2765605|emb|Z78480.1|PGZ78480
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())
587
gi|2765604|emb|Z78479.1|PPZ78479
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGT', SingleLetterAlphabet())
700
gi|2765603|emb|Z78478.1|PVZ78478
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCAGTGTTGAGATCACAT...GGC', SingleLetterAlphabet())
636
gi|2765602|emb|Z78477.1|PVZ78477
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC', SingleLetterAlphabet())
716
gi|2765601|emb|Z78476.1|PGZ78476
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...CCC', SingleLetterAlphabet())
592
gi|2765600|emb|Z78475.1|PSZ78475
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', SingleLetterAlphabet())
716
gi|2765599|emb|Z78474.1|PKZ78474
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACGT...CTT', SingleLetterAlphabet())
733
gi|2765598|emb|Z78473.1|PSZ78473
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG', SingleLetterAlphabet())
626
gi|2765597|emb|Z78472.1|PLZ78472
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
737
gi|2765596|emb|Z78471.1|PDZ78471
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
740
gi|2765595|emb|Z78470.1|PPZ78470
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GTT', SingleLetterAlphabet())
574
gi|2765594|emb|Z78469.1|PHZ78469
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GTT', SingleLetterAlphabet())
594
gi|2765593|emb|Z78468.1|PAZ78468
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...GTT', SingleLetterAlphabet())
610
gi|2765592|emb|Z78467.1|PSZ78467
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())
730
gi|2765591|emb|Z78466.1|PPZ78466
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...CCC', SingleLetterAlphabet())
641
gi|2765590|emb|Z78465.1|PRZ78465
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC', SingleLetterAlphabet())
702
gi|2765589|emb|Z78464.1|PGZ78464
Seq('CGTAACAAGGTTTCCGTAGGTGAGCGGAAGGGTCATTGTTGAGATCACATAATA...AGC', SingleLetterAlphabet())
733
gi|2765588|emb|Z78463.1|PGZ78463
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGTTCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
738
gi|2765587|emb|Z78462.1|PSZ78462
Seq('CGTCACGAGGTCTCCGGATGTGACCCTGCGGAAGGATCATTGTTGAGATCACAT...CAT', SingleLetterAlphabet())
736
gi|2765586|emb|Z78461.1|PWZ78461
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...TAA', SingleLetterAlphabet())
732
gi|2765585|emb|Z78460.1|PCZ78460
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...TTA', SingleLetterAlphabet())
745
gi|2765584|emb|Z78459.1|PDZ78459
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TTT', SingleLetterAlphabet())
744
gi|2765583|emb|Z78458.1|PHZ78458
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TTG', SingleLetterAlphabet())
738
gi|2765582|emb|Z78457.1|PCZ78457
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...GAG', SingleLetterAlphabet())
739
gi|2765581|emb|Z78456.1|PTZ78456
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
740
gi|2765580|emb|Z78455.1|PJZ78455
Seq('CGTAACCAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAGATCACAT...GCA', SingleLetterAlphabet())
745
gi|2765579|emb|Z78454.1|PFZ78454
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AAC', SingleLetterAlphabet())
695
gi|2765578|emb|Z78453.1|PSZ78453
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())
745
gi|2765577|emb|Z78452.1|PBZ78452
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())
743
gi|2765576|emb|Z78451.1|PHZ78451
Seq('CGTAACAAGGTTTCCGTAGGTGTACCTCCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
730
gi|2765575|emb|Z78450.1|PPZ78450
Seq('GGAAGGATCATTGCTGATATCACATAATAATTGATCGAGTTAAGCTGGAGGATC...GAG', SingleLetterAlphabet())
706
gi|2765574|emb|Z78449.1|PMZ78449
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC', SingleLetterAlphabet())
744
gi|2765573|emb|Z78448.1|PAZ78448
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG', SingleLetterAlphabet())
742
gi|2765572|emb|Z78447.1|PVZ78447
Seq('CGTAACAAGGATTCCGTAGGTGAACCTGCGGGAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())
694
gi|2765571|emb|Z78446.1|PAZ78446
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...CCC', SingleLetterAlphabet())
712
gi|2765570|emb|Z78445.1|PUZ78445
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGT', SingleLetterAlphabet())
715
gi|2765569|emb|Z78444.1|PAZ78444
Seq('CGTAACAAGGTTTCCGTAGGGTGAACTGCGGAAGGATCATTGTTGAGATCACAT...ATT', SingleLetterAlphabet())
688
gi|2765568|emb|Z78443.1|PLZ78443
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG', SingleLetterAlphabet())
784
gi|2765567|emb|Z78442.1|PBZ78442
Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAATTGATCGAGT...AGT', SingleLetterAlphabet())
721
gi|2765566|emb|Z78441.1|PSZ78441
Seq('GGAAGGTCATTGCCGATATCACATAATAATTGATCGAGTTAATCTGGAGGATCT...GAG', SingleLetterAlphabet())
703
gi|2765565|emb|Z78440.1|PPZ78440
Seq('CGTAACAAGGTTTCCGTAGGTGGACCTCCGGGAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())
744
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
序列对象
Biopython基础设施的很大一部分是处理序列的工具。这些可能是DNA,RNA,氨基酸序列或者是更特异的结构。要告诉Biopython处理哪种类型的序列,你可以用字母指明。
In [10]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("ACAGTAGAC", IUPAC.unambiguous_dna)
my_seq
Out[10]:
Seq('ACAGTAGAC', IUPACUnambiguousDNA())
In [11]:
my_seq.alphabet
Out[11]:
IUPACUnambiguousDNA()
如果我们想编码蛋白质序列,我们可以很容易的完成。
In [12]:
my_prot = Seq("AAAAA", IUPAC.protein) # Alanine pentapeptide
my_prot
Out[12]:
Seq('AAAAA', IUPACProtein())
In [13]:
my_prot.alphabet
Out[13]:
IUPACProtein()
我们可以取很长的序列并用字串索引。
In [14]:
print(len(my_prot))
5
In [15]:
my_prot[0]
Out[15]:
'A'
你可以对序列使用切片标记以获得子序列。
In [16]:
my_prot[0:3]
Out[16]:
Seq('AAA', IUPACProtein())
如果序列类型相同你可以进行联接。
In [17]:
my_prot + my_prot
Out[17]:
Seq('AAAAAAAAAA', IUPACProtein())
但这样会失败。
In [18]:
my_prot + my_seq
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-0b91ff3c1125> in <module>()
----> 1 my_prot + my_seq
/usr/local/lib/python3.6/dist-packages/Bio/Seq.py in __add__(self, other)
335 if not Alphabet._check_type_compatible([self.alphabet, other.alphabet]):
336 raise TypeError(
--> 337 f"Incompatible alphabets {self.alphabet!r} and {other.alphabet!r}"
338 )
339 # They should be the same sequence type (or one of them is generic)
TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()
转录
转录是DNA序列转移到RNA信使的过程。记住这是生物学中心法则的一部分,其中DNA产生信使RNA,信使RNA产生蛋白质。下面是借用Khan学院的课程的很好的表示。
注意上图中,DNA是两股,上一个被称为编码股,下一个为模板股。模板股用于真实的转移到信使RNA的转录过程,但是在生物信息学中,它通常与编码股一起工作。现在我们来看一下如何用Biopython进行转录计算。
In [19]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
coding_dna = Seq("ATGATCTCGTAA", IUPAC.unambiguous_dna)
coding_dna
Out[19]:
Seq('ATGATCTCGTAA', IUPACUnambiguousDNA())
In [20]:
template_dna = coding_dna.reverse_complement()
template_dna
Out[20]:
Seq('TTACGAGATCAT', IUPACUnambiguousDNA())
注意这个序列匹配下图。你可能很困惑为什么template_dna
序列是逆过来展示的。原因是为了方便,模板股从反向读取。
我们来看一下如何转录coding_dna
股到信使
RNA
。这要将
U
变换为
T
并变换字母。
In [21]:
messenger_rna = coding_dna.transcribe()
messenger_rna
Out[21]:
Seq('AUGAUCUCGUAA', IUPACUnambiguousRNA())
我们进行逆转录,来从信使RNA恢复原始的编码股。
In [22]:
messenger_rna.back_transcribe()
Out[22]:
Seq('ATGATCTCGTAA', IUPACUnambiguousDNA())
翻译
翻译是转录的下一步,信使RNA转换到蛋白质序列。变里有一张漂亮的图(from Wikipedia#/media/File:Ribosome_mRNA_translation_en.svg),展示基本过程序。
注意3个核苷酸对应一个新的氨基酸增加到蛋白链上。编码一个氨基酸的3个核苷酸称为"codon."。我们可以对使信RNA使用translate()
方法进行转换。
messenger_rna.translate()
翻译也可以直接从编码序列DNA进行。
In [23]:
coding_dna.translate()
Out[23]:
Seq('MIS*', HasStopCodon(IUPACProtein(), '*'))
我们现在考虑一下长的有更有趣的结构的基因序列。
In [24]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
coding_dna.translate()
Out[24]:
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
上面的序列中,'*'代表stop codon。stop codon是由3种核苷酸组成的序列,它关闭蛋白质。在DNA中,stop codons为 'TGA'',‘TAA', 'TAG'。注意最后的序列有多个stop codons。
也要可能运行到第一个stop codon并停止。
In [25]:
coding_dna.translate(to_stop=True)
Out[25]:
Seq('MAIVMGR', IUPACProtein())
这里我们要介绍一下术语。一个完整的编码序列CDS是一个信使RNA的核苷酸序列,它形成整个codons(即,序列的长度乘3),以"start codon"开始,以"stop codon"结束。
一个"start codon"基本上与"stop codon"相反,通常是"AUG",但也可能不同(特别是处理一些细菌DNA时)。
我们来看一下如何翻译细菌信使RNA的CDS。
In [26]:
from Bio.Alphabet import generic_dna
gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
"GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
"AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
"TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
"AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
generic_dna)
# We specify a "table" to use a different translation table for bacterial proteins
gene.translate(table="Bacterial")
Out[26]:
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
In [27]:
gene.translate(table="Bacterial", to_stop=True)
Out[27]:
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())
处理标记序列
有时候能够使用标记序更是很有用的,当有很多标记时,如GenBank或EMBL文件。对于这些目的,我们要使用SeqRecord
类。
In [28]:
from Bio.SeqRecord import SeqRecord
help(SeqRecord)
Help on class SeqRecord in module Bio.SeqRecord:
class SeqRecord(builtins.object)
| A SeqRecord object holds a sequence and information about it.
|
| Main attributes:
| - id - Identifier such as a locus tag (string)
| - seq - The sequence itself (Seq object or similar)
|
| Additional attributes:
| - name - Sequence name, e.g. gene name (string)
| - description - Additional text (string)
| - dbxrefs - List of database cross references (list of strings)
| - features - Any (sub)features defined (list of SeqFeature objects)
| - annotations - Further information about the whole sequence (dictionary).
| Most entries are strings, or lists of strings.
| - letter_annotations - Per letter/symbol annotation (restricted
| dictionary). This holds Python sequences (lists, strings
| or tuples) whose length matches that of the sequence.
| A typical use would be to hold a list of integers
| representing sequencing quality scores, or a string
| representing the secondary structure.
|
| You will typically use Bio.SeqIO to read in sequences from files as
| SeqRecord objects. However, you may want to create your own SeqRecord
| objects directly (see the __init__ method for further details):
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> from Bio.Alphabet import IUPAC
| >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
| ... IUPAC.protein),
| ... id="YP_025292.1", name="HokC",
| ... description="toxic membrane protein")
| >>> print(record)
| ID: YP_025292.1
| Name: HokC
| Description: toxic membrane protein
| Number of features: 0
| Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
|
| If you want to save SeqRecord objects to a sequence file, use Bio.SeqIO
| for this. For the special case where you want the SeqRecord turned into
| a string in a particular file format there is a format method which uses
| Bio.SeqIO internally:
|
| >>> print(record.format("fasta"))
| >YP_025292.1 toxic membrane protein
| MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
| <BLANKLINE>
|
| You can also do things like slicing a SeqRecord, checking its length, etc
|
| >>> len(record)
| 44
| >>> edited = record[:10] + record[11:]
| >>> print(edited.seq)
| MKQHKAMIVAIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
| >>> print(record.seq)
| MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
|
| Methods defined here:
|
| __add__(self, other)
| Add another sequence or string to this sequence.
|
| The other sequence can be a SeqRecord object, a Seq object (or
| similar, e.g. a MutableSeq) or a plain Python string. If you add
| a plain string or a Seq (like) object, the new SeqRecord will simply
| have this appended to the existing data. However, any per letter
| annotation will be lost:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
| >>> print("%s %s" % (record.id, record.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(record.letter_annotations))
| ['solexa_quality']
|
| >>> new = record + "ACT"
| >>> print("%s %s" % (new.id, new.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNNACT
| >>> print(list(new.letter_annotations))
| []
|
| The new record will attempt to combine the annotation, but for any
| ambiguities (e.g. different names) it defaults to omitting that
| annotation.
|
| >>> from Bio import SeqIO
| >>> with open("GenBank/pBAD30.gb") as handle:
| ... plasmid = SeqIO.read(handle, "gb")
| >>> print("%s %i" % (plasmid.id, len(plasmid)))
| pBAD30 4923
|
| Now let's cut the plasmid into two pieces, and join them back up the
| other way round (i.e. shift the starting point on this plasmid, have
| a look at the annotated features in the original file to see why this
| particular split point might make sense):
|
| >>> left = plasmid[:3765]
| >>> right = plasmid[3765:]
| >>> new = right + left
| >>> print("%s %i" % (new.id, len(new)))
| pBAD30 4923
| >>> str(new.seq) == str(right.seq + left.seq)
| True
| >>> len(new.features) == len(left.features) + len(right.features)
| True
|
| When we add the left and right SeqRecord objects, their annotation
| is all consistent, so it is all conserved in the new SeqRecord:
|
| >>> new.id == left.id == right.id == plasmid.id
| True
| >>> new.name == left.name == right.name == plasmid.name
| True
| >>> new.description == plasmid.description
| True
| >>> new.annotations == left.annotations == right.annotations
| True
| >>> new.letter_annotations == plasmid.letter_annotations
| True
| >>> new.dbxrefs == left.dbxrefs == right.dbxrefs
| True
|
| However, we should point out that when we sliced the SeqRecord,
| any annotations dictionary or dbxrefs list entries were lost.
| You can explicitly copy them like this:
|
| >>> new.annotations = plasmid.annotations.copy()
| >>> new.dbxrefs = plasmid.dbxrefs[:]
|
| __bool__(self)
| Boolean value of an instance of this class (True).
|
| This behaviour is for backwards compatibility, since until the
| __len__ method was added, a SeqRecord always evaluated as True.
|
| Note that in comparison, a Seq object will evaluate to False if it
| has a zero length sequence.
|
| WARNING: The SeqRecord may in future evaluate to False when its
| sequence is of zero length (in order to better match the Seq
| object behaviour)!
|
| __contains__(self, char)
| Implement the 'in' keyword, searches the sequence.
|
| e.g.
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Fasta/sweetpea.nu", "fasta")
| >>> "GAATTC" in record
| False
| >>> "AAA" in record
| True
|
| This essentially acts as a proxy for using "in" on the sequence:
|
| >>> "GAATTC" in record.seq
| False
| >>> "AAA" in record.seq
| True
|
| Note that you can also use Seq objects as the query,
|
| >>> from Bio.Seq import Seq
| >>> from Bio.Alphabet import generic_dna
| >>> Seq("AAA") in record
| True
| >>> Seq("AAA", generic_dna) in record
| True
|
| See also the Seq object's __contains__ method.
|
| __eq__(self, other)
| Define the equal-to operand (not implemented).
|
| __format__(self, format_spec)
| Return the record as a string in the specified file format.
|
| This method supports the Python format() function and f-strings.
| The format_spec should be a lower case string supported by
| Bio.SeqIO as a text output file format. Requesting a binary file
| format raises a ValueError. e.g.
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
| ... id="YP_025292.1", name="HokC",
| ... description="toxic membrane protein")
| ...
| >>> format(record, "fasta")
| '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n'
| >>> print(f"Here is {record.id} in FASTA format:\n{record:fasta}")
| Here is YP_025292.1 in FASTA format:
| >YP_025292.1 toxic membrane protein
| MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
| <BLANKLINE>
|
| See also the SeqRecord's format() method.
|
| __ge__(self, other)
| Define the greater-than-or-equal-to operand (not implemented).
|
| __getitem__(self, index)
| Return a sub-sequence or an individual letter.
|
| Slicing, e.g. my_record[5:10], returns a new SeqRecord for
| that sub-sequence with some annotation preserved as follows:
|
| * The name, id and description are kept as-is.
| * Any per-letter-annotations are sliced to match the requested
| sub-sequence.
| * Unless a stride is used, all those features which fall fully
| within the subsequence are included (with their locations
| adjusted accordingly). If you want to preserve any truncated
| features (e.g. GenBank/EMBL source features), you must
| explicitly add them to the new SeqRecord yourself.
| * The annotations dictionary and the dbxrefs list are not used
| for the new SeqRecord, as in general they may not apply to the
| subsequence. If you want to preserve them, you must explicitly
| copy them to the new SeqRecord yourself.
|
| Using an integer index, e.g. my_record[5] is shorthand for
| extracting that letter from the sequence, my_record.seq[5].
|
| For example, consider this short protein and its secondary
| structure as encoded by the PDB (e.g. H for alpha helices),
| plus a simple feature for its histidine self phosphorylation
| site:
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> from Bio.SeqFeature import SeqFeature, FeatureLocation
| >>> from Bio.Alphabet import IUPAC
| >>> rec = SeqRecord(Seq("MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLAT"
| ... "EMMSEQDGYLAESINKDIEECNAIIEQFIDYLR",
| ... IUPAC.protein),
| ... id="1JOY", name="EnvZ",
| ... description="Homodimeric domain of EnvZ from E. coli")
| >>> rec.letter_annotations["secondary_structure"] = " S SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT "
| >>> rec.features.append(SeqFeature(FeatureLocation(20, 21),
| ... type = "Site"))
|
| Now let's have a quick look at the full record,
|
| >>> print(rec)
| ID: 1JOY
| Name: EnvZ
| Description: Homodimeric domain of EnvZ from E. coli
| Number of features: 1
| Per letter annotation for: secondary_structure
| Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR', IUPACProtein())
| >>> rec.letter_annotations["secondary_structure"]
| ' S SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT '
| >>> print(rec.features[0].location)
| [20:21]
|
| Now let's take a sub sequence, here chosen as the first (fractured)
| alpha helix which includes the histidine phosphorylation site:
|
| >>> sub = rec[11:41]
| >>> print(sub)
| ID: 1JOY
| Name: EnvZ
| Description: Homodimeric domain of EnvZ from E. coli
| Number of features: 1
| Per letter annotation for: secondary_structure
| Seq('RTLLMAGVSHDLRTPLTRIRLATEMMSEQD', IUPACProtein())
| >>> sub.letter_annotations["secondary_structure"]
| 'HHHHHTTTHHHHHHHHHHHHHHHHHHHHHH'
| >>> print(sub.features[0].location)
| [9:10]
|
| You can also of course omit the start or end values, for
| example to get the first ten letters only:
|
| >>> print(rec[:10])
| ID: 1JOY
| Name: EnvZ
| Description: Homodimeric domain of EnvZ from E. coli
| Number of features: 0
| Per letter annotation for: secondary_structure
| Seq('MAAGVKQLAD', IUPACProtein())
|
| Or for the last ten letters:
|
| >>> print(rec[-10:])
| ID: 1JOY
| Name: EnvZ
| Description: Homodimeric domain of EnvZ from E. coli
| Number of features: 0
| Per letter annotation for: secondary_structure
| Seq('IIEQFIDYLR', IUPACProtein())
|
| If you omit both, then you get a copy of the original record (although
| lacking the annotations and dbxrefs):
|
| >>> print(rec[:])
| ID: 1JOY
| Name: EnvZ
| Description: Homodimeric domain of EnvZ from E. coli
| Number of features: 1
| Per letter annotation for: secondary_structure
| Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR', IUPACProtein())
|
| Finally, indexing with a simple integer is shorthand for pulling out
| that letter from the sequence directly:
|
| >>> rec[5]
| 'K'
| >>> rec.seq[5]
| 'K'
|
| __gt__(self, other)
| Define the greater-than operand (not implemented).
|
| __init__(self, seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None, features=None, annotations=None, letter_annotations=None)
| Create a SeqRecord.
|
| Arguments:
| - seq - Sequence, required (Seq, MutableSeq or UnknownSeq)
| - id - Sequence identifier, recommended (string)
| - name - Sequence name, optional (string)
| - description - Sequence description, optional (string)
| - dbxrefs - Database cross references, optional (list of strings)
| - features - Any (sub)features, optional (list of SeqFeature objects)
| - annotations - Dictionary of annotations for the whole sequence
| - letter_annotations - Dictionary of per-letter-annotations, values
| should be strings, list or tuples of the same length as the full
| sequence.
|
| You will typically use Bio.SeqIO to read in sequences from files as
| SeqRecord objects. However, you may want to create your own SeqRecord
| objects directly.
|
| Note that while an id is optional, we strongly recommend you supply a
| unique id string for each record. This is especially important
| if you wish to write your sequences to a file.
|
| If you don't have the actual sequence, but you do know its length,
| then using the UnknownSeq object from Bio.Seq is appropriate.
|
| You can create a 'blank' SeqRecord object, and then populate the
| attributes later.
|
| __iter__(self)
| Iterate over the letters in the sequence.
|
| For example, using Bio.SeqIO to read in a protein FASTA file:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Fasta/loveliesbleeding.pro", "fasta")
| >>> for amino in record:
| ... print(amino)
| ... if amino == "L": break
| X
| A
| G
| L
| >>> print(record.seq[3])
| L
|
| This is just a shortcut for iterating over the sequence directly:
|
| >>> for amino in record.seq:
| ... print(amino)
| ... if amino == "L": break
| X
| A
| G
| L
| >>> print(record.seq[3])
| L
|
| Note that this does not facilitate iteration together with any
| per-letter-annotation. However, you can achieve that using the
| python zip function on the record (or its sequence) and the relevant
| per-letter-annotation:
|
| >>> from Bio import SeqIO
| >>> rec = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
| >>> print("%s %s" % (rec.id, rec.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(rec.letter_annotations))
| ['solexa_quality']
| >>> for nuc, qual in zip(rec, rec.letter_annotations["solexa_quality"]):
| ... if qual > 35:
| ... print("%s %i" % (nuc, qual))
| A 40
| C 39
| G 38
| T 37
| A 36
|
| You may agree that using zip(rec.seq, ...) is more explicit than using
| zip(rec, ...) as shown above.
|
| __le___(self, other)
| Define the less-than-or-equal-to operand (not implemented).
|
| __len__(self)
| Return the length of the sequence.
|
| For example, using Bio.SeqIO to read in a FASTA nucleotide file:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Fasta/sweetpea.nu", "fasta")
| >>> len(record)
| 309
| >>> len(record.seq)
| 309
|
| __lt__(self, other)
| Define the less-than operand (not implemented).
|
| __ne__(self, other)
| Define the not-equal-to operand (not implemented).
|
| __radd__(self, other)
| Add another sequence or string to this sequence (from the left).
|
| This method handles adding a Seq object (or similar, e.g. MutableSeq)
| or a plain Python string (on the left) to a SeqRecord (on the right).
| See the __add__ method for more details, but for example:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
| >>> print("%s %s" % (record.id, record.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(record.letter_annotations))
| ['solexa_quality']
|
| >>> new = "ACT" + record
| >>> print("%s %s" % (new.id, new.seq))
| slxa_0001_1_0001_01 ACTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(new.letter_annotations))
| []
|
| __repr__(self)
| Return a concise summary of the record for debugging (string).
|
| The python built in function repr works by calling the object's ___repr__
| method. e.g.
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> from Bio.Alphabet import generic_protein
| >>> rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT"
| ... +"GEMKEQTEWHRVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQ"
| ... +"SGQDRYTTEVVVNVGGTMQMLGGRQGGGAPAGGNIGGGQPQGGWGQ"
| ... +"PQQPQGGNQFSGGAQSRPQQSAPAAPSNEPPMDFDDDIPF",
| ... generic_protein),
| ... id="NP_418483.1", name="b4059",
| ... description="ssDNA-binding protein",
| ... dbxrefs=["ASAP:13298", "GI:16131885", "GeneID:948570"])
| >>> print(repr(rec))
| SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF', ProteinAlphabet()), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])
|
| At the python prompt you can also use this shorthand:
|
| >>> rec
| SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF', ProteinAlphabet()), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])
|
| Note that long sequences are shown truncated. Also note that any
| annotations, letter_annotations and features are not shown (as they
| would lead to a very long string).
|
| __str__(self)
| Return a human readable summary of the record and its annotation (string).
|
| The python built in function str works by calling the object's ___str__
| method. e.g.
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> from Bio.Alphabet import IUPAC
| >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
| ... IUPAC.protein),
| ... id="YP_025292.1", name="HokC",
| ... description="toxic membrane protein, small")
| >>> print(str(record))
| ID: YP_025292.1
| Name: HokC
| Description: toxic membrane protein, small
| Number of features: 0
| Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
|
| In this example you don't actually need to call str explicity, as the
| print command does this automatically:
|
| >>> print(record)
| ID: YP_025292.1
| Name: HokC
| Description: toxic membrane protein, small
| Number of features: 0
| Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
|
| Note that long sequences are shown truncated.
|
| format(self, format)
| Return the record as a string in the specified file format.
|
| The format should be a lower case string supported as an output
| format by Bio.SeqIO, which is used to turn the SeqRecord into a
| string. e.g.
|
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> from Bio.Alphabet import IUPAC
| >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
| ... IUPAC.protein),
| ... id="YP_025292.1", name="HokC",
| ... description="toxic membrane protein")
| >>> record.format("fasta")
| '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n'
| >>> print(record.format("fasta"))
| >YP_025292.1 toxic membrane protein
| MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF
| <BLANKLINE>
|
| The Python print function automatically appends a new line, meaning
| in this example a blank line is shown. If you look at the string
| representation you can see there is a trailing new line (shown as
| slash n) which is important when writing to a file or if
| concatenating multiple sequence strings together.
|
| Note that this method will NOT work on every possible file format
| supported by Bio.SeqIO (e.g. some are for multiple sequences only,
| and binary formats are not supported).
|
| lower(self)
| Return a copy of the record with a lower case sequence.
|
| All the annotation is preserved unchanged. e.g.
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Fasta/aster.pro", "fasta")
| >>> print(record.format("fasta"))
| >gi|3298468|dbj|BAA31520.1| SAMIPF
| GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVG
| VTNALVFEIVMTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI
| <BLANKLINE>
| >>> print(record.lower().format("fasta"))
| >gi|3298468|dbj|BAA31520.1| SAMIPF
| gghvnpavtfgafvggnitllrgivyiiaqllgstvaclllkfvtndmavgvfslsagvg
| vtnalvfeivmtfglvytvyataidpkkgslgtiapiaigfivgani
| <BLANKLINE>
|
| To take a more annotation rich example,
|
| >>> from Bio import SeqIO
| >>> old = SeqIO.read("EMBL/TRBG361.embl", "embl")
| >>> len(old.features)
| 3
| >>> new = old.lower()
| >>> len(old.features) == len(new.features)
| True
| >>> old.annotations["organism"] == new.annotations["organism"]
| True
| >>> old.dbxrefs == new.dbxrefs
| True
|
| reverse_complement(self, id=False, name=False, description=False, features=True, annotations=False, letter_annotations=True, dbxrefs=False)
| Return new SeqRecord with reverse complement sequence.
|
| By default the new record does NOT preserve the sequence identifier,
| name, description, general annotation or database cross-references -
| these are unlikely to apply to the reversed sequence.
|
| You can specify the returned record's id, name and description as
| strings, or True to keep that of the parent, or False for a default.
|
| You can specify the returned record's features with a list of
| SeqFeature objects, or True to keep that of the parent, or False to
| omit them. The default is to keep the original features (with the
| strand and locations adjusted).
|
| You can also specify both the returned record's annotations and
| letter_annotations as dictionaries, True to keep that of the parent,
| or False to omit them. The default is to keep the original
| annotations (with the letter annotations reversed).
|
| To show what happens to the pre-letter annotations, consider an
| example Solexa variant FASTQ file with a single entry, which we'll
| read in as a SeqRecord:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
| >>> print("%s %s" % (record.id, record.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(record.letter_annotations))
| ['solexa_quality']
| >>> print(record.letter_annotations["solexa_quality"])
| [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
|
| Now take the reverse complement, here we explicitly give a new
| identifier (the old identifier with a suffix):
|
| >>> rc_record = record.reverse_complement(id=record.id + "_rc")
| >>> print("%s %s" % (rc_record.id, rc_record.seq))
| slxa_0001_1_0001_01_rc NNNNNNACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
|
| Notice that the per-letter-annotations have also been reversed,
| although this may not be appropriate for all cases.
|
| >>> print(rc_record.letter_annotations["solexa_quality"])
| [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
|
| Now for the features, we need a different example. Parsing a GenBank
| file is probably the easiest way to get an nice example with features
| in it...
|
| >>> from Bio import SeqIO
| >>> with open("GenBank/pBAD30.gb") as handle:
| ... plasmid = SeqIO.read(handle, "gb")
| >>> print("%s %i" % (plasmid.id, len(plasmid)))
| pBAD30 4923
| >>> plasmid.seq
| Seq('GCTAGCGGAGTGTATACTGGCTTACTATGTTGGCACTGATGAGGGTGTCAGTGA...ATG', IUPACAmbiguousDNA())
| >>> len(plasmid.features)
| 13
|
| Now, let's take the reverse complement of this whole plasmid:
|
| >>> rc_plasmid = plasmid.reverse_complement(id=plasmid.id+"_rc")
| >>> print("%s %i" % (rc_plasmid.id, len(rc_plasmid)))
| pBAD30_rc 4923
| >>> rc_plasmid.seq
| Seq('CATGGGCAAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCA...AGC', IUPACAmbiguousDNA())
| >>> len(rc_plasmid.features)
| 13
|
| Let's compare the first CDS feature - it has gone from being the
| second feature (index 1) to the second last feature (index -2), its
| strand has changed, and the location switched round.
|
| >>> print(plasmid.features[1])
| type: CDS
| location: [1081:1960](-)
| qualifiers:
| Key: label, Value: ['araC']
| Key: note, Value: ['araC regulator of the arabinose BAD promoter']
| Key: vntifkey, Value: ['4']
| <BLANKLINE>
| >>> print(rc_plasmid.features[-2])
| type: CDS
| location: [2963:3842](+)
| qualifiers:
| Key: label, Value: ['araC']
| Key: note, Value: ['araC regulator of the arabinose BAD promoter']
| Key: vntifkey, Value: ['4']
| <BLANKLINE>
|
| You can check this new location, based on the length of the plasmid:
|
| >>> len(plasmid) - 1081
| 3842
| >>> len(plasmid) - 1960
| 2963
|
| Note that if the SeqFeature annotation includes any strand specific
| information (e.g. base changes for a SNP), this information is not
| amended, and would need correction after the reverse complement.
|
| Note trying to reverse complement a protein SeqRecord raises an
| exception:
|
| >>> from Bio.SeqRecord import SeqRecord
| >>> from Bio.Seq import Seq
| >>> from Bio.Alphabet import IUPAC
| >>> protein_rec = SeqRecord(Seq("MAIVMGR", IUPAC.protein), id="Test")
| >>> protein_rec.reverse_complement()
| Traceback (most recent call last):
| ...
| ValueError: Proteins do not have complements!
|
| Also note you can reverse complement a SeqRecord using a MutableSeq:
|
| >>> from Bio.SeqRecord import SeqRecord
| >>> from Bio.Seq import MutableSeq
| >>> from Bio.Alphabet import generic_dna
| >>> rec = SeqRecord(MutableSeq("ACGT", generic_dna), id="Test")
| >>> rec.seq[0] = "T"
| >>> print("%s %s" % (rec.id, rec.seq))
| Test TCGT
| >>> rc = rec.reverse_complement(id=True)
| >>> print("%s %s" % (rc.id, rc.seq))
| Test ACGA
|
| translate(self, table='Standard', stop_symbol='*', to_stop=False, cds=False, gap=None, id=False, name=False, description=False, features=False, annotations=False, letter_annotations=False, dbxrefs=False)
| Return new SeqRecord with translated sequence.
|
| This calls the record's .seq.translate() method (which describes
| the translation related arguments, like table for the genetic code),
|
| By default the new record does NOT preserve the sequence identifier,
| name, description, general annotation or database cross-references -
| these are unlikely to apply to the translated sequence.
|
| You can specify the returned record's id, name and description as
| strings, or True to keep that of the parent, or False for a default.
|
| You can specify the returned record's features with a list of
| SeqFeature objects, or False (default) to omit them.
|
| You can also specify both the returned record's annotations and
| letter_annotations as dictionaries, True to keep that of the parent
| (annotations only), or False (default) to omit them.
|
| e.g. Loading a FASTA gene and translating it,
|
| >>> from Bio import SeqIO
| >>> gene_record = SeqIO.read("Fasta/sweetpea.nu", "fasta")
| >>> print(gene_record.format("fasta"))
| >gi|3176602|gb|U78617.1|LOU78617 Lathyrus odoratus phytochrome A (PHYA) gene, partial cds
| CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCA
| AAACATGTGAAGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCG
| ACCTTAAGAGCTCCACATAGTTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCT
| TCATTGGTTATGGCAGTGGTCGTCAATGACAGCGATGAAGATGGAGATAGCCGTGACGCA
| GTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAGTTTGTCATAACACTACTCCG
| AGGTTTGTT
| <BLANKLINE>
|
| And now translating the record, specifying the new ID and description:
|
| >>> protein_record = gene_record.translate(table=11,
| ... id="phya",
| ... description="translation")
| >>> print(protein_record.format("fasta"))
| >phya translation
| QAARFLFMKNKVRMIVDCHAKHVKVLQDEKLPFDLTLCGSTLRAPHSCHLQYMANMDSIA
| SLVMAVVVNDSDEDGDSRDAVLPQKKKRLWGLVVCHNTTPRFV
| <BLANKLINE>
|
| upper(self)
| Return a copy of the record with an upper case sequence.
|
| All the annotation is preserved unchanged. e.g.
|
| >>> from Bio.Alphabet import generic_dna
| >>> from Bio.Seq import Seq
| >>> from Bio.SeqRecord import SeqRecord
| >>> record = SeqRecord(Seq("acgtACGT", generic_dna), id="Test",
| ... description = "Made up for this example")
| >>> record.letter_annotations["phred_quality"] = [1, 2, 3, 4, 5, 6, 7, 8]
| >>> print(record.upper().format("fastq"))
| @Test Made up for this example
| ACGTACGT
| +
| "#$%&'()
| <BLANKLINE>
|
| Naturally, there is a matching lower method:
|
| >>> print(record.lower().format("fastq"))
| @Test Made up for this example
| acgtacgt
| +
| "#$%&'()
| <BLANKLINE>
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| letter_annotations
| Dictionary of per-letter-annotation for the sequence.
|
| For example, this can hold quality scores used in FASTQ or QUAL files.
| Consider this example using Bio.SeqIO to read in an example Solexa
| variant FASTQ file as a SeqRecord:
|
| >>> from Bio import SeqIO
| >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")
| >>> print("%s %s" % (record.id, record.seq))
| slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
| >>> print(list(record.letter_annotations))
| ['solexa_quality']
| >>> print(record.letter_annotations["solexa_quality"])
| [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
|
| The letter_annotations get sliced automatically if you slice the
| parent SeqRecord, for example taking the last ten bases:
|
| >>> sub_record = record[-10:]
| >>> print("%s %s" % (sub_record.id, sub_record.seq))
| slxa_0001_1_0001_01 ACGTNNNNNN
| >>> print(sub_record.letter_annotations["solexa_quality"])
| [4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
|
| Any python sequence (i.e. list, tuple or string) can be recorded in
| the SeqRecord's letter_annotations dictionary as long as the length
| matches that of the SeqRecord's sequence. e.g.
|
| >>> len(sub_record.letter_annotations)
| 1
| >>> sub_record.letter_annotations["dummy"] = "abcdefghij"
| >>> len(sub_record.letter_annotations)
| 2
|
| You can delete entries from the letter_annotations dictionary as usual:
|
| >>> del sub_record.letter_annotations["solexa_quality"]
| >>> sub_record.letter_annotations
| {'dummy': 'abcdefghij'}
|
| You can completely clear the dictionary easily as follows:
|
| >>> sub_record.letter_annotations = {}
| >>> sub_record.letter_annotations
| {}
|
| Note that if replacing the record's sequence with a sequence of a
| different length you must first clear the letter_annotations dict.
|
| seq
| The sequence itself, as a Seq or MutableSeq object.
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __hash__ = None
我们来写一些代码调用SeqRecord
并看一下有什么结果。
In [29]:
from Bio.SeqRecord import SeqRecord
simple_seq = Seq("GATC")
simple_seq_r = SeqRecord(simple_seq)
In [30]:
simple_seq_r.id = "AC12345"
simple_seq_r.description = "Made up sequence"
print(simple_seq_r.id)
print(simple_seq_r.description)
AC12345
Made up sequence
我们看一下如何用SeqRecord
来解析大的
fasta文件。我们从biopython网站下载一个文件。
In [31]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna
--2020-08-05 14:52:05-- https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9853 (9.6K) [text/plain]
Saving to: ‘NC_005816.fna’
NC_005816.fna 100%[===================>] 9.62K --.-KB/s in 0s
2020-08-05 14:52:05 (50.1 MB/s) - ‘NC_005816.fna’ saved [9853/9853]
In [32]:
from Bio import SeqIO
record = SeqIO.read("NC_005816.fna", "fasta")
record
Out[32]:
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])
注意
SeqRecord
对象有很多的标记。我们仔细看一下。
In [33]:
record.id
Out[33]:
'gi|45478711|ref|NC_005816.1|'
In [34]:
record.name
Out[34]:
'gi|45478711|ref|NC_005816.1|'
In [35]:
record.description
Out[35]:
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'
现在看一下相同的序列,下载自GenBank。我们先从biopython教程网站下载宿文件。
In [36]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb
--2020-08-05 14:52:19-- https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31838 (31K) [text/plain]
Saving to: ‘NC_005816.gb’
NC_005816.gb 100%[===================>] 31.09K --.-KB/s in 0.008s
2020-08-05 14:52:20 (3.80 MB/s) - ‘NC_005816.gb’ saved [31838/31838]
In [37]:
from Bio import SeqIO
record = SeqIO.read("NC_005816.gb", "genbank")
record
Out[37]:
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['Project:58037'])
SeqIO对象
以后继续。
下载全文请到www.data-vision.net,技术联系电话13712566524