DeepChem教程21: Bioinformatics

其他 2021-03-08 01:58:24 阅读次数: 0

学习到目前为止，我们主要是解决化学信息学的问题。我们感兴趣于如何用机器学习技术来预测分子的化学物征。本教程，我们将移动一点点，看如何用经典的计算机科学技术和机器学习来解决生物信息学的问题。

为此，我们要使用免费的biopython库进行基础的生物信息学工作。本教程的很多材料都来自官方教程[Biopython tutorial]http://biopython.org/DIST/docs/tutorial/Tutorial.html).

我们强烈推荐你在学习完本教程你阅读这些官方教程。

In [1]:

!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py

import conda_installer

conda_installer.install()

!/root/miniconda/bin/conda info -e

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 3489 100 3489 0 0 47148 0 --:--:-- --:--:-- --:--:-- 47148

add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH

python version: 3.6.9

fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

done

installing miniconda to /root/miniconda

done

installing rdkit, openmm, pdbfixer

added omnia to channels

added conda-forge to channels

done

conda packages installation finished!

# conda environments:

base * /root/miniconda

In [2]:

!pip install --pre deepchem

import deepchem

deepchem.__version__

Collecting deepchem

Downloading https://files.pythonhosted.org/packages/b5/d7/3ba15ec6f676ef4d93855d01e40cba75e231339e7d9ea403a2f53cabbab0/deepchem-2.4.0rc1.dev20200805054153.tar.gz (351kB)

|████████████████████████████████| 358kB 4.7MB/s

Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.16.0)

Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.18.5)

Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.0.5)

Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from deepchem) (0.22.2.post1)

Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from deepchem) (1.4.1)

Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2018.9)

Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas->deepchem) (2.8.1)

Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/dist-packages (from python-dateutil>=2.6.1->pandas->deepchem) (1.15.0)

Building wheels for collected packages: deepchem

Building wheel for deepchem (setup.py) ... done

Created wheel for deepchem: filename=deepchem-2.4.0rc1.dev20200805145043-cp36-none-any.whl size=438623 sha256=b76201fc01bf910a8490d4ed5cc195b109d08f019ce7afc25cdf254c62c4eab3

Stored in directory: /root/.cache/pip/wheels/41/0f/fe/5f2659dc8e26624863654100f689d8f36cae7c872d2b310394

Successfully built deepchem

Installing collected packages: deepchem

Successfully installed deepchem-2.4.0rc1.dev20200805145043

Out[2]:

'2.4.0-rc1.dev'

We'll use pip to install biopython

In [3]:

!pip install biopython

Collecting biopython

Downloading https://files.pythonhosted.org/packages/a8/66/134dbd5f885fc71493c61b6cf04c9ea08082da28da5ed07709b02857cbd0/biopython-1.77-cp36-cp36m-manylinux1_x86_64.whl (2.3MB)

|████████████████████████████████| 2.3MB 4.5MB/s

Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from biopython) (1.18.5)

Installing collected packages: biopython

Successfully installed biopython-1.77

In [4]:

import Bio

Bio.__version__

Out[4]:

'1.77'

In [5]:

from Bio.Seq import Seq

my_seq = Seq("AGTACACATTG")

my_seq

Out[5]:

Seq('AGTACACATTG')

In [6]:

my_seq.complement()

Out[6]:

Seq('TCATGTGTAAC')

In [7]:

my_seq.reverse_complement()

Out[7]:

Seq('CAATGTGTACT')

    解析序列记录

我们将从Biopython 教程下载fasta样本文件进行一些练习。 这个文件是一个序列（lady slipper orcid genes）的击中。

In [8]:

!wget https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta

--2020-08-05 14:50:55--  https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta

Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...

Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: 76480 (75K) [text/plain]

Saving to: ‘ls_orchid.fasta’

ls_orchid.fasta 100%[===================>] 74.69K --.-KB/s in 0.01s

2020-08-05 14:50:55 (4.97 MB/s) - ‘ls_orchid.fasta’ saved [76480/76480]

我们来看一下文件内容是如何的。

In [9]:

from Bio import SeqIO

for seq_record in SeqIO.parse('ls_orchid.fasta', 'fasta'):

print(seq_record.id)

print(repr(seq_record.seq))

print(len(seq_record))

gi|2765658|emb|Z78533.1|CIZ78533

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())

740

gi|2765657|emb|Z78532.1|CCZ78532

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', SingleLetterAlphabet())

753

gi|2765656|emb|Z78531.1|CFZ78531

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', SingleLetterAlphabet())

748

gi|2765655|emb|Z78530.1|CMZ78530

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT', SingleLetterAlphabet())

744

gi|2765654|emb|Z78529.1|CLZ78529

Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA', SingleLetterAlphabet())

733

gi|2765652|emb|Z78527.1|CYZ78527

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC', SingleLetterAlphabet())

718

gi|2765651|emb|Z78526.1|CGZ78526

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT', SingleLetterAlphabet())

gi|2765650|emb|Z78525.1|CAZ78525

Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA', SingleLetterAlphabet())

gi|2765649|emb|Z78524.1|CFZ78524

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC', SingleLetterAlphabet())

gi|2765648|emb|Z78523.1|CHZ78523

Seq('CGTAACCAGGTTTCCGTAGGTGAACCTGCGGCAGGATCATTGTTGAGACAGCAG...AAG', SingleLetterAlphabet())

gi|2765647|emb|Z78522.1|CMZ78522

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...GAG', SingleLetterAlphabet())

gi|2765646|emb|Z78521.1|CCZ78521

Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAGAATATATGATCGAGT...ACC', SingleLetterAlphabet())

gi|2765645|emb|Z78520.1|CSZ78520

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TTT', SingleLetterAlphabet())

gi|2765644|emb|Z78519.1|CPZ78519

Seq('ATATGATCGAGTGAATCTGGTGGACTTGTGGTTACTCAGCTCGCCATAGGCTTT...TTA', SingleLetterAlphabet())

gi|2765643|emb|Z78518.1|CRZ78518

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGGAGGATCATTGTTGAGATAGTAG...TCC', SingleLetterAlphabet())

gi|2765642|emb|Z78517.1|CFZ78517

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...AGC', SingleLetterAlphabet())

gi|2765641|emb|Z78516.1|CPZ78516

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAT...TAA', SingleLetterAlphabet())

gi|2765640|emb|Z78515.1|MXZ78515

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGCTGAGACCGTAG...AGC', SingleLetterAlphabet())

gi|2765639|emb|Z78514.1|PSZ78514

Seq('CGTAACAAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAAGCCCCCA...CTA', SingleLetterAlphabet())

gi|2765638|emb|Z78513.1|PBZ78513

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...GAG', SingleLetterAlphabet())

gi|2765637|emb|Z78512.1|PWZ78512

Seq('CGTAACAAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAAGCCCCCA...AGC', SingleLetterAlphabet())

gi|2765636|emb|Z78511.1|PEZ78511

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTTCGGAAGGATCATTGTTGAGACCCCCA...GGA', SingleLetterAlphabet())

gi|2765635|emb|Z78510.1|PCZ78510

Seq('CTAACCAGGGTTCCGAGGTGACCTTCGGGAGGATTCCTTTTTAAGCCCCCGAAA...TTA', SingleLetterAlphabet())

gi|2765634|emb|Z78509.1|PPZ78509

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...GGA', SingleLetterAlphabet())

gi|2765633|emb|Z78508.1|PLZ78508

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...TGA', SingleLetterAlphabet())

gi|2765632|emb|Z78507.1|PLZ78507

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCCCCA...TGA', SingleLetterAlphabet())

gi|2765631|emb|Z78506.1|PLZ78506

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCAA...TGA', SingleLetterAlphabet())

gi|2765630|emb|Z78505.1|PSZ78505

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCCA...TTT', SingleLetterAlphabet())

gi|2765629|emb|Z78504.1|PKZ78504

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTTCGGAAGGATCATTGTTGAGACCGCAA...TAA', SingleLetterAlphabet())

gi|2765628|emb|Z78503.1|PCZ78503

Seq('CGTAACCAGGTTTCCGTAGGTGAACCTCCGGAAGGATCCTTGTTGAGACCGCCA...TAA', SingleLetterAlphabet())

gi|2765627|emb|Z78502.1|PBZ78502

Seq('CGTAACCAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGACCGCCA...CGC', SingleLetterAlphabet())

gi|2765626|emb|Z78501.1|PCZ78501

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACCGCAA...AGA', SingleLetterAlphabet())

gi|2765625|emb|Z78500.1|PWZ78500

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGCTCATTGTTGAGACCGCAA...AAG', SingleLetterAlphabet())

gi|2765624|emb|Z78499.1|PMZ78499

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAGGGATCATTGTTGAGATCGCAT...ACC', SingleLetterAlphabet())

gi|2765623|emb|Z78498.1|PMZ78498

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAAGGTCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())

gi|2765622|emb|Z78497.1|PDZ78497

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())

gi|2765621|emb|Z78496.1|PAZ78496

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...AGC', SingleLetterAlphabet())

gi|2765620|emb|Z78495.1|PEZ78495

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...GTG', SingleLetterAlphabet())

gi|2765619|emb|Z78494.1|PNZ78494

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGGTCGCAT...AAG', SingleLetterAlphabet())

gi|2765618|emb|Z78493.1|PGZ78493

Seq('CGTAACAAGGATTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...CCC', SingleLetterAlphabet())

gi|2765617|emb|Z78492.1|PBZ78492

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...ATA', SingleLetterAlphabet())

gi|2765616|emb|Z78491.1|PCZ78491

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...AGC', SingleLetterAlphabet())

gi|2765615|emb|Z78490.1|PFZ78490

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())

gi|2765614|emb|Z78489.1|PDZ78489

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGC', SingleLetterAlphabet())

gi|2765613|emb|Z78488.1|PTZ78488

Seq('CTGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACGCAATAATTGATCGA...GCT', SingleLetterAlphabet())

gi|2765612|emb|Z78487.1|PHZ78487

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TAA', SingleLetterAlphabet())

gi|2765611|emb|Z78486.1|PBZ78486

Seq('CGTCACGAGGTTTCCGTAGGTGAATCTGCGGGAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())

gi|2765610|emb|Z78485.1|PHZ78485

Seq('CTGAACCTGGTGTCCGAAGGTGAATCTGCGGATGGATCATTGTTGAGATATCAT...GTA', SingleLetterAlphabet())

gi|2765609|emb|Z78484.1|PCZ78484

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGGGGAAGGATCATTGTTGAGATCACAT...TTT', SingleLetterAlphabet())

gi|2765608|emb|Z78483.1|PVZ78483

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())

gi|2765607|emb|Z78482.1|PEZ78482

Seq('TCTACTGCAGTGACCGAGATTTGCCATCGAGCCTCCTGGGAGCTTTCTTGCTGG...GCA', SingleLetterAlphabet())

gi|2765606|emb|Z78481.1|PIZ78481

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())

gi|2765605|emb|Z78480.1|PGZ78480

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())

gi|2765604|emb|Z78479.1|PPZ78479

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGT', SingleLetterAlphabet())

gi|2765603|emb|Z78478.1|PVZ78478

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCAGTGTTGAGATCACAT...GGC', SingleLetterAlphabet())

gi|2765602|emb|Z78477.1|PVZ78477

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC', SingleLetterAlphabet())

gi|2765601|emb|Z78476.1|PGZ78476

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...CCC', SingleLetterAlphabet())

gi|2765600|emb|Z78475.1|PSZ78475

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', SingleLetterAlphabet())

gi|2765599|emb|Z78474.1|PKZ78474

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACGT...CTT', SingleLetterAlphabet())

gi|2765598|emb|Z78473.1|PSZ78473

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG', SingleLetterAlphabet())

gi|2765597|emb|Z78472.1|PLZ78472

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())

gi|2765596|emb|Z78471.1|PDZ78471

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())

gi|2765595|emb|Z78470.1|PPZ78470

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GTT', SingleLetterAlphabet())

gi|2765594|emb|Z78469.1|PHZ78469

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GTT', SingleLetterAlphabet())

gi|2765593|emb|Z78468.1|PAZ78468

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCGCAT...GTT', SingleLetterAlphabet())

gi|2765592|emb|Z78467.1|PSZ78467

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGA', SingleLetterAlphabet())

gi|2765591|emb|Z78466.1|PPZ78466

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...CCC', SingleLetterAlphabet())

gi|2765590|emb|Z78465.1|PRZ78465

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC', SingleLetterAlphabet())

gi|2765589|emb|Z78464.1|PGZ78464

Seq('CGTAACAAGGTTTCCGTAGGTGAGCGGAAGGGTCATTGTTGAGATCACATAATA...AGC', SingleLetterAlphabet())

gi|2765588|emb|Z78463.1|PGZ78463

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGTTCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())

gi|2765587|emb|Z78462.1|PSZ78462

Seq('CGTCACGAGGTCTCCGGATGTGACCCTGCGGAAGGATCATTGTTGAGATCACAT...CAT', SingleLetterAlphabet())

gi|2765586|emb|Z78461.1|PWZ78461

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...TAA', SingleLetterAlphabet())

gi|2765585|emb|Z78460.1|PCZ78460

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...TTA', SingleLetterAlphabet())

gi|2765584|emb|Z78459.1|PDZ78459

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TTT', SingleLetterAlphabet())

gi|2765583|emb|Z78458.1|PHZ78458

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TTG', SingleLetterAlphabet())

gi|2765582|emb|Z78457.1|PCZ78457

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...GAG', SingleLetterAlphabet())

gi|2765581|emb|Z78456.1|PTZ78456

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())

gi|2765580|emb|Z78455.1|PJZ78455

Seq('CGTAACCAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAGATCACAT...GCA', SingleLetterAlphabet())

gi|2765579|emb|Z78454.1|PFZ78454

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AAC', SingleLetterAlphabet())

gi|2765578|emb|Z78453.1|PSZ78453

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())

gi|2765577|emb|Z78452.1|PBZ78452

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())

gi|2765576|emb|Z78451.1|PHZ78451

Seq('CGTAACAAGGTTTCCGTAGGTGTACCTCCGGAAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())

gi|2765575|emb|Z78450.1|PPZ78450

Seq('GGAAGGATCATTGCTGATATCACATAATAATTGATCGAGTTAAGCTGGAGGATC...GAG', SingleLetterAlphabet())

gi|2765574|emb|Z78449.1|PMZ78449

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGC', SingleLetterAlphabet())

gi|2765573|emb|Z78448.1|PAZ78448

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG', SingleLetterAlphabet())

gi|2765572|emb|Z78447.1|PVZ78447

Seq('CGTAACAAGGATTCCGTAGGTGAACCTGCGGGAGGATCATTGTTGAGATCACAT...AGC', SingleLetterAlphabet())

gi|2765571|emb|Z78446.1|PAZ78446

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTCCGGAAGGATCATTGTTGAGATCACAT...CCC', SingleLetterAlphabet())

gi|2765570|emb|Z78445.1|PUZ78445

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...TGT', SingleLetterAlphabet())

gi|2765569|emb|Z78444.1|PAZ78444

Seq('CGTAACAAGGTTTCCGTAGGGTGAACTGCGGAAGGATCATTGTTGAGATCACAT...ATT', SingleLetterAlphabet())

gi|2765568|emb|Z78443.1|PLZ78443

Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...AGG', SingleLetterAlphabet())

gi|2765567|emb|Z78442.1|PBZ78442

Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACATAATAATTGATCGAGT...AGT', SingleLetterAlphabet())

gi|2765566|emb|Z78441.1|PSZ78441

Seq('GGAAGGTCATTGCCGATATCACATAATAATTGATCGAGTTAATCTGGAGGATCT...GAG', SingleLetterAlphabet())

gi|2765565|emb|Z78440.1|PPZ78440

Seq('CGTAACAAGGTTTCCGTAGGTGGACCTCCGGGAGGATCATTGTTGAGATCACAT...GCA', SingleLetterAlphabet())

gi|2765564|emb|Z78439.1|PBZ78439

Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())

    序列对象

Biopython基础设施的很大一部分是处理序列的工具。这些可能是DNA,RNA，氨基酸序列或者是更特异的结构。要告诉Biopython处理哪种类型的序列，你可以用字母指明。

In [10]:

from Bio.Seq import Seq

from Bio.Alphabet import IUPAC

my_seq = Seq("ACAGTAGAC", IUPAC.unambiguous_dna)

my_seq

Out[10]:

Seq('ACAGTAGAC', IUPACUnambiguousDNA())

In [11]:

my_seq.alphabet

Out[11]:

IUPACUnambiguousDNA()

    如果我们想编码蛋白质序列，我们可以很容易的完成。

In [12]:

my_prot = Seq("AAAAA", IUPAC.protein) # Alanine pentapeptide

my_prot

Out[12]:

Seq('AAAAA', IUPACProtein())

In [13]:

my_prot.alphabet

Out[13]:

IUPACProtein()

我们可以取很长的序列并用字串索引。

In [14]:

print(len(my_prot))

In [15]:

my_prot[0]

Out[15]:

'A'

你可以对序列使用切片标记以获得子序列。

In [16]:

my_prot[0:3]

Out[16]:

Seq('AAA', IUPACProtein())

如果序列类型相同你可以进行联接。

In [17]:

my_prot + my_prot

Out[17]:

Seq('AAAAAAAAAA', IUPACProtein())

但这样会失败。

In [18]:

my_prot + my_seq

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)

<ipython-input-18-0b91ff3c1125> in <module>()

----> 1 my_prot + my_seq

/usr/local/lib/python3.6/dist-packages/Bio/Seq.py in __add__(self, other)

    335             if not Alphabet._check_type_compatible([self.alphabet, other.alphabet]):

    336                 raise TypeError(

--> 337                     f"Incompatible alphabets {self.alphabet!r} and {other.alphabet!r}"

    338                 )

    339             # They should be the same sequence type (or one of them is generic)

TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()

转录

转录是DNA序列转移到RNA信使的过程。记住这是生物学中心法则的一部分，其中DNA产生信使RNA，信使RNA产生蛋白质。下面是借用Khan学院的课程的很好的表示。

注意上图中，DNA是两股，上一个被称为编码股，下一个为模板股。模板股用于真实的转移到信使RNA的转录过程，但是在生物信息学中，它通常与编码股一起工作。现在我们来看一下如何用Biopython进行转录计算。

In [19]:

from Bio.Seq import Seq

from Bio.Alphabet import IUPAC

coding_dna = Seq("ATGATCTCGTAA", IUPAC.unambiguous_dna)

coding_dna

Out[19]:

Seq('ATGATCTCGTAA', IUPACUnambiguousDNA())

In [20]:

template_dna = coding_dna.reverse_complement()

template_dna

Out[20]:

Seq('TTACGAGATCAT', IUPACUnambiguousDNA())

注意这个序列匹配下图。你可能很困惑为什么template_dna序列是逆过来展示的。原因是为了方便，模板股从反向读取。

我们来看一下如何转录 coding_dna股到信使RNA。这要将U变换为T并变换字母。

In [21]:

messenger_rna = coding_dna.transcribe()

messenger_rna

Out[21]:

Seq('AUGAUCUCGUAA', IUPACUnambiguousRNA())

我们进行逆转录，来从信使RNA恢复原始的编码股。

In [22]:

messenger_rna.back_transcribe()

Out[22]:

Seq('ATGATCTCGTAA', IUPACUnambiguousDNA())

翻译

翻译是转录的下一步，信使RNA转换到蛋白质序列。变里有一张漂亮的图（from Wikipedia#/media/File:Ribosome_mRNA_translation_en.svg)，展示基本过程序。

注意3个核苷酸对应一个新的氨基酸增加到蛋白链上。编码一个氨基酸的3个核苷酸称为"codon."。我们可以对使信RNA使用translate() 方法进行转换。

messenger_rna.translate()

翻译也可以直接从编码序列DNA进行。

In [23]:

coding_dna.translate()

Out[23]:

Seq('MIS*', HasStopCodon(IUPACProtein(), '*'))

我们现在考虑一下长的有更有趣的结构的基因序列。

In [24]:

coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)

coding_dna.translate()

Out[24]:

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

上面的序列中，'*'代表stop codon。stop codon是由3种核苷酸组成的序列，它关闭蛋白质。在DNA中，stop codons为 'TGA''，‘TAA', 'TAG'。注意最后的序列有多个stop codons。

也要可能运行到第一个stop codon并停止。

In [25]:

coding_dna.translate(to_stop=True)

Out[25]:

Seq('MAIVMGR', IUPACProtein())

这里我们要介绍一下术语。一个完整的编码序列CDS是一个信使RNA的核苷酸序列，它形成整个codons（即，序列的长度乘3），以"start codon"开始，以"stop codon"结束。

一个"start codon"基本上与"stop codon"相反，通常是"AUG"，但也可能不同（特别是处理一些细菌DNA时）。

我们来看一下如何翻译细菌信使RNA的CDS。

In [26]:

from Bio.Alphabet import generic_dna

gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \

           "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \

           "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \

           "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \

           "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",

           generic_dna)

# We specify a "table" to use a different translation table for bacterial proteins

gene.translate(table="Bacterial")

Out[26]:

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In [27]:

gene.translate(table="Bacterial", to_stop=True)

Out[27]:

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())

处理标记序列

有时候能够使用标记序更是很有用的，当有很多标记时，如GenBank或EMBL文件。对于这些目的，我们要使用SeqRecord类。

In [28]:

from Bio.SeqRecord import SeqRecord

help(SeqRecord)

Help on class SeqRecord in module Bio.SeqRecord:

class SeqRecord(builtins.object)

 |  A SeqRecord object holds a sequence and information about it.

 |  Main attributes:

 |   - id          - Identifier such as a locus tag (string)

 |   - seq         - The sequence itself (Seq object or similar)

 |  Additional attributes:

 |   - name        - Sequence name, e.g. gene name (string)

 |   - description - Additional text (string)

 |   - dbxrefs     - List of database cross references (list of strings)

 |   - features    - Any (sub)features defined (list of SeqFeature objects)

 |   - annotations - Further information about the whole sequence (dictionary).

 |     Most entries are strings, or lists of strings.

 |   - letter_annotations - Per letter/symbol annotation (restricted

 |     dictionary). This holds Python sequences (lists, strings

 |     or tuples) whose length matches that of the sequence.

 |     A typical use would be to hold a list of integers

 |     representing sequencing quality scores, or a string

 |     representing the secondary structure.

 |  You will typically use Bio.SeqIO to read in sequences from files as

 |  SeqRecord objects.  However, you may want to create your own SeqRecord

 |  objects directly (see the __init__ method for further details):

 |  >>> from Bio.Seq import Seq

 |  >>> from Bio.SeqRecord import SeqRecord

 |  >>> from Bio.Alphabet import IUPAC

 |  >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",

 |  ...                         IUPAC.protein),

 |  ...                    id="YP_025292.1", name="HokC",

 |  ...                    description="toxic membrane protein")

 |  >>> print(record)

 |  ID: YP_025292.1

 |  Name: HokC

 |  Description: toxic membrane protein

 |  Number of features: 0

 |  Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())

 |  If you want to save SeqRecord objects to a sequence file, use Bio.SeqIO

 |  for this.  For the special case where you want the SeqRecord turned into

 |  a string in a particular file format there is a format method which uses

 |  Bio.SeqIO internally:

 |  >>> print(record.format("fasta"))

 |  >YP_025292.1 toxic membrane protein

 |  MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

 |  <BLANKLINE>

 |  You can also do things like slicing a SeqRecord, checking its length, etc

 |  >>> len(record)

 |  44

 |  >>> edited = record[:10] + record[11:]

 |  >>> print(edited.seq)

 |  MKQHKAMIVAIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

 |  >>> print(record.seq)

 |  MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

 |  Methods defined here:

 |  __add__(self, other)

 |      Add another sequence or string to this sequence.

 |      The other sequence can be a SeqRecord object, a Seq object (or

 |      similar, e.g. a MutableSeq) or a plain Python string. If you add

 |      a plain string or a Seq (like) object, the new SeqRecord will simply

 |      have this appended to the existing data. However, any per letter

 |      annotation will be lost:

 |      >>> from Bio import SeqIO

 |      >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")

 |      >>> print("%s %s" % (record.id, record.seq))

 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN

 |      >>> print(list(record.letter_annotations))

 |      ['solexa_quality']

 |      >>> new = record + "ACT"

 |      >>> print("%s %s" % (new.id, new.seq))

 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNNACT

 |      >>> print(list(new.letter_annotations))

 |      []

 |      The new record will attempt to combine the annotation, but for any

 |      ambiguities (e.g. different names) it defaults to omitting that

 |      annotation.

 |      >>> from Bio import SeqIO

 |      >>> with open("GenBank/pBAD30.gb") as handle:

 |      ...     plasmid = SeqIO.read(handle, "gb")

 |      >>> print("%s %i" % (plasmid.id, len(plasmid)))

 |      pBAD30 4923

 |      Now let's cut the plasmid into two pieces, and join them back up the

 |      other way round (i.e. shift the starting point on this plasmid, have

 |      a look at the annotated features in the original file to see why this

 |      particular split point might make sense):

 |      >>> left = plasmid[:3765]

 |      >>> right = plasmid[3765:]

 |      >>> new = right + left

 |      >>> print("%s %i" % (new.id, len(new)))

 |      pBAD30 4923

 |      >>> str(new.seq) == str(right.seq + left.seq)

 |      True

 |      >>> len(new.features) == len(left.features) + len(right.features)

 |      True

 |      When we add the left and right SeqRecord objects, their annotation

 |      is all consistent, so it is all conserved in the new SeqRecord:

 |      >>> new.id == left.id == right.id == plasmid.id

 |      True

 |      >>> new.name == left.name == right.name == plasmid.name

 |      True

 |      >>> new.description == plasmid.description

 |      True

 |      >>> new.annotations == left.annotations == right.annotations

 |      True

 |      >>> new.letter_annotations == plasmid.letter_annotations

 |      True

 |      >>> new.dbxrefs == left.dbxrefs == right.dbxrefs

 |      True

 |      However, we should point out that when we sliced the SeqRecord,

 |      any annotations dictionary or dbxrefs list entries were lost.

 |      You can explicitly copy them like this:

 |      >>> new.annotations = plasmid.annotations.copy()

 |      >>> new.dbxrefs = plasmid.dbxrefs[:]

 |  __bool__(self)

 |      Boolean value of an instance of this class (True).

 |      This behaviour is for backwards compatibility, since until the

 |      __len__ method was added, a SeqRecord always evaluated as True.

 |      Note that in comparison, a Seq object will evaluate to False if it

 |      has a zero length sequence.

 |      WARNING: The SeqRecord may in future evaluate to False when its

 |      sequence is of zero length (in order to better match the Seq

 |      object behaviour)!

 |  __contains__(self, char)

 |      Implement the 'in' keyword, searches the sequence.

 |      e.g.

 |      >>> from Bio import SeqIO

 |      >>> record = SeqIO.read("Fasta/sweetpea.nu", "fasta")

 |      >>> "GAATTC" in record

 |      False

 |      >>> "AAA" in record

 |      True

 |      This essentially acts as a proxy for using "in" on the sequence:

 |      >>> "GAATTC" in record.seq

 |      False

 |      >>> "AAA" in record.seq

 |      True

 |      Note that you can also use Seq objects as the query,

 |      >>> from Bio.Seq import Seq

 |      >>> from Bio.Alphabet import generic_dna

 |      >>> Seq("AAA") in record

 |      True

 |      >>> Seq("AAA", generic_dna) in record

 |      True

 |      See also the Seq object's __contains__ method.

 |  __eq__(self, other)

 |      Define the equal-to operand (not implemented).

 |  __format__(self, format_spec)

 |      Return the record as a string in the specified file format.

 |      This method supports the Python format() function and f-strings.

 |      The format_spec should be a lower case string supported by

 |      Bio.SeqIO as a text output file format. Requesting a binary file

 |      format raises a ValueError. e.g.

 |      >>> from Bio.Seq import Seq

 |      >>> from Bio.SeqRecord import SeqRecord

 |      >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),

 |      ...                    id="YP_025292.1", name="HokC",

 |      ...                    description="toxic membrane protein")

 |      ...

 |      >>> format(record, "fasta")

 |      '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n'

 |      >>> print(f"Here is {record.id} in FASTA format:\n{record:fasta}")

 |      Here is YP_025292.1 in FASTA format:

 |      >YP_025292.1 toxic membrane protein

 |      MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

 |      <BLANKLINE>

 |      See also the SeqRecord's format() method.

 |  __ge__(self, other)

 |      Define the greater-than-or-equal-to operand (not implemented).

 |  __getitem__(self, index)

 |      Return a sub-sequence or an individual letter.

 |      Slicing, e.g. my_record[5:10], returns a new SeqRecord for

 |      that sub-sequence with some annotation preserved as follows:

 |      * The name, id and description are kept as-is.

 |      * Any per-letter-annotations are sliced to match the requested

 |        sub-sequence.

 |      * Unless a stride is used, all those features which fall fully

 |        within the subsequence are included (with their locations

 |        adjusted accordingly). If you want to preserve any truncated

 |        features (e.g. GenBank/EMBL source features), you must

 |        explicitly add them to the new SeqRecord yourself.

 |      * The annotations dictionary and the dbxrefs list are not used

 |        for the new SeqRecord, as in general they may not apply to the

 |        subsequence. If you want to preserve them, you must explicitly

 |        copy them to the new SeqRecord yourself.

 |      Using an integer index, e.g. my_record[5] is shorthand for

 |      extracting that letter from the sequence, my_record.seq[5].

 |      For example, consider this short protein and its secondary

 |      structure as encoded by the PDB (e.g. H for alpha helices),

 |      plus a simple feature for its histidine self phosphorylation

 |      site:

 |      >>> from Bio.Seq import Seq

 |      >>> from Bio.SeqRecord import SeqRecord

 |      >>> from Bio.SeqFeature import SeqFeature, FeatureLocation

 |      >>> from Bio.Alphabet import IUPAC

 |      >>> rec = SeqRecord(Seq("MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLAT"

 |      ...                     "EMMSEQDGYLAESINKDIEECNAIIEQFIDYLR",

 |      ...                     IUPAC.protein),

 |      ...                 id="1JOY", name="EnvZ",

 |      ...                 description="Homodimeric domain of EnvZ from E. coli")

 |      >>> rec.letter_annotations["secondary_structure"] = "  S  SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT  "

 |      >>> rec.features.append(SeqFeature(FeatureLocation(20, 21),

 |      ...                     type = "Site"))

 |      Now let's have a quick look at the full record,

 |      >>> print(rec)

 |      ID: 1JOY

 |      Name: EnvZ

 |      Description: Homodimeric domain of EnvZ from E. coli

 |      Number of features: 1

 |      Per letter annotation for: secondary_structure

 |      Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR', IUPACProtein())

 |      >>> rec.letter_annotations["secondary_structure"]

 |      '  S  SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT  '

 |      >>> print(rec.features[0].location)

 |      [20:21]

 |      Now let's take a sub sequence, here chosen as the first (fractured)

 |      alpha helix which includes the histidine phosphorylation site:

 |      >>> sub = rec[11:41]

 |      >>> print(sub)

 |      ID: 1JOY

 |      Name: EnvZ

 |      Description: Homodimeric domain of EnvZ from E. coli

 |      Number of features: 1

 |      Per letter annotation for: secondary_structure

 |      Seq('RTLLMAGVSHDLRTPLTRIRLATEMMSEQD', IUPACProtein())

 |      >>> sub.letter_annotations["secondary_structure"]

 |      'HHHHHTTTHHHHHHHHHHHHHHHHHHHHHH'

 |      >>> print(sub.features[0].location)

 |      [9:10]

 |      You can also of course omit the start or end values, for

 |      example to get the first ten letters only:

 |      >>> print(rec[:10])

 |      ID: 1JOY

 |      Name: EnvZ

 |      Description: Homodimeric domain of EnvZ from E. coli

 |      Number of features: 0

 |      Per letter annotation for: secondary_structure

 |      Seq('MAAGVKQLAD', IUPACProtein())

 |      Or for the last ten letters:

 |      >>> print(rec[-10:])

 |      ID: 1JOY

 |      Name: EnvZ

 |      Description: Homodimeric domain of EnvZ from E. coli

 |      Number of features: 0

 |      Per letter annotation for: secondary_structure

 |      Seq('IIEQFIDYLR', IUPACProtein())

 |      If you omit both, then you get a copy of the original record (although

 |      lacking the annotations and dbxrefs):

 |      >>> print(rec[:])

 |      ID: 1JOY

 |      Name: EnvZ

 |      Description: Homodimeric domain of EnvZ from E. coli

 |      Number of features: 1

 |      Per letter annotation for: secondary_structure

 |      Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR', IUPACProtein())

 |      Finally, indexing with a simple integer is shorthand for pulling out

 |      that letter from the sequence directly:

 |      >>> rec[5]

 |      'K'

 |      >>> rec.seq[5]

 |      'K'

 |  __gt__(self, other)

 |      Define the greater-than operand (not implemented).

 |  __init__(self, seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None, features=None, annotations=None, letter_annotations=None)

 |      Create a SeqRecord.

 |      Arguments:

 |       - seq         - Sequence, required (Seq, MutableSeq or UnknownSeq)

 |       - id          - Sequence identifier, recommended (string)

 |       - name        - Sequence name, optional (string)

 |       - description - Sequence description, optional (string)

 |       - dbxrefs     - Database cross references, optional (list of strings)

 |       - features    - Any (sub)features, optional (list of SeqFeature objects)

 |       - annotations - Dictionary of annotations for the whole sequence

 |       - letter_annotations - Dictionary of per-letter-annotations, values

 |         should be strings, list or tuples of the same length as the full

 |         sequence.

 |      You will typically use Bio.SeqIO to read in sequences from files as

 |      SeqRecord objects.  However, you may want to create your own SeqRecord

 |      objects directly.

 |      Note that while an id is optional, we strongly recommend you supply a

 |      unique id string for each record.  This is especially important

 |      if you wish to write your sequences to a file.

 |      If you don't have the actual sequence, but you do know its length,

 |      then using the UnknownSeq object from Bio.Seq is appropriate.

 |      You can create a 'blank' SeqRecord object, and then populate the

 |      attributes later.

 |  __iter__(self)

 |      Iterate over the letters in the sequence.

 |      For example, using Bio.SeqIO to read in a protein FASTA file:

 |      >>> from Bio import SeqIO

 |      >>> record = SeqIO.read("Fasta/loveliesbleeding.pro", "fasta")

 |      >>> for amino in record:

 |      ...     print(amino)

 |      ...     if amino == "L": break

 |      X

 |      A

 |      G

 |      L

 |      >>> print(record.seq[3])

 |      L

 |      This is just a shortcut for iterating over the sequence directly:

 |      >>> for amino in record.seq:

 |      ...     print(amino)

 |      ...     if amino == "L": break

 |      X

 |      A

 |      G

 |      L

 |      >>> print(record.seq[3])

 |      L

 |      Note that this does not facilitate iteration together with any

 |      per-letter-annotation.  However, you can achieve that using the

 |      python zip function on the record (or its sequence) and the relevant

 |      per-letter-annotation:

 |      >>> from Bio import SeqIO

 |      >>> rec = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")

 |      >>> print("%s %s" % (rec.id, rec.seq))

 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN

 |      >>> print(list(rec.letter_annotations))

 |      ['solexa_quality']

 |      >>> for nuc, qual in zip(rec, rec.letter_annotations["solexa_quality"]):

 |      ...     if qual > 35:

 |      ...         print("%s %i" % (nuc, qual))

 |      A 40

 |      C 39

 |      G 38

 |      T 37

 |      A 36

 |      You may agree that using zip(rec.seq, ...) is more explicit than using

 |      zip(rec, ...) as shown above.

 |  __le___(self, other)

 |      Define the less-than-or-equal-to operand (not implemented).

 |  __len__(self)

 |      Return the length of the sequence.

 |      For example, using Bio.SeqIO to read in a FASTA nucleotide file:

 |      >>> from Bio import SeqIO

 |      >>> record = SeqIO.read("Fasta/sweetpea.nu", "fasta")

 |      >>> len(record)

 |      309

 |      >>> len(record.seq)

 |      309

 |  __lt__(self, other)

 |      Define the less-than operand (not implemented).

 |  __ne__(self, other)

 |      Define the not-equal-to operand (not implemented).

 |  __radd__(self, other)

 |      Add another sequence or string to this sequence (from the left).

 |      This method handles adding a Seq object (or similar, e.g. MutableSeq)

 |      or a plain Python string (on the left) to a SeqRecord (on the right).

 |      See the __add__ method for more details, but for example:

 |      >>> from Bio import SeqIO

 |      >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")

 |      >>> print("%s %s" % (record.id, record.seq))

 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN

 |      >>> print(list(record.letter_annotations))

 |      ['solexa_quality']

 |      >>> new = "ACT" + record

 |      >>> print("%s %s" % (new.id, new.seq))

 |      slxa_0001_1_0001_01 ACTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN

 |      >>> print(list(new.letter_annotations))

 |      []

 |  __repr__(self)

 |      Return a concise summary of the record for debugging (string).

 |      The python built in function repr works by calling the object's ___repr__

 |      method.  e.g.

 |      >>> from Bio.Seq import Seq

 |      >>> from Bio.SeqRecord import SeqRecord

 |      >>> from Bio.Alphabet import generic_protein

 |      >>> rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT"

 |      ...                    +"GEMKEQTEWHRVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQ"

 |      ...                    +"SGQDRYTTEVVVNVGGTMQMLGGRQGGGAPAGGNIGGGQPQGGWGQ"

 |      ...                    +"PQQPQGGNQFSGGAQSRPQQSAPAAPSNEPPMDFDDDIPF",

 |      ...                    generic_protein),

 |      ...                 id="NP_418483.1", name="b4059",

 |      ...                 description="ssDNA-binding protein",

 |      ...                 dbxrefs=["ASAP:13298", "GI:16131885", "GeneID:948570"])

 |      >>> print(repr(rec))

 |      SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF', ProteinAlphabet()), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])

 |      At the python prompt you can also use this shorthand:

 |      >>> rec

 |      SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF', ProteinAlphabet()), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570'])

 |      Note that long sequences are shown truncated. Also note that any

 |      annotations, letter_annotations and features are not shown (as they

 |      would lead to a very long string).

 |  __str__(self)

 |      Return a human readable summary of the record and its annotation (string).

 |      The python built in function str works by calling the object's ___str__

 |      method.  e.g.

 |      >>> from Bio.Seq import Seq

 |      >>> from Bio.SeqRecord import SeqRecord

 |      >>> from Bio.Alphabet import IUPAC

 |      >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",

 |      ...                         IUPAC.protein),

 |      ...                    id="YP_025292.1", name="HokC",

 |      ...                    description="toxic membrane protein, small")

 |      >>> print(str(record))

 |      ID: YP_025292.1

 |      Name: HokC

 |      Description: toxic membrane protein, small

 |      Number of features: 0

 |      Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())

 |      In this example you don't actually need to call str explicity, as the

 |      print command does this automatically:

 |      >>> print(record)

 |      ID: YP_025292.1

 |      Name: HokC

 |      Description: toxic membrane protein, small

 |      Number of features: 0

 |      Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())

 |      Note that long sequences are shown truncated.

 |  format(self, format)

 |      Return the record as a string in the specified file format.

 |      The format should be a lower case string supported as an output

 |      format by Bio.SeqIO, which is used to turn the SeqRecord into a

 |      string.  e.g.

 |      >>> from Bio.Seq import Seq

 |      >>> from Bio.SeqRecord import SeqRecord

 |      >>> from Bio.Alphabet import IUPAC

 |      >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",

 |      ...                         IUPAC.protein),

 |      ...                    id="YP_025292.1", name="HokC",

 |      ...                    description="toxic membrane protein")

 |      >>> record.format("fasta")

 |      '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n'

 |      >>> print(record.format("fasta"))

 |      >YP_025292.1 toxic membrane protein

 |      MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF

 |      <BLANKLINE>

 |      The Python print function automatically appends a new line, meaning

 |      in this example a blank line is shown.  If you look at the string

 |      representation you can see there is a trailing new line (shown as

 |      slash n) which is important when writing to a file or if

 |      concatenating multiple sequence strings together.

 |      Note that this method will NOT work on every possible file format

 |      supported by Bio.SeqIO (e.g. some are for multiple sequences only,

 |      and binary formats are not supported).

 |  lower(self)

 |      Return a copy of the record with a lower case sequence.

 |      All the annotation is preserved unchanged. e.g.

 |      >>> from Bio import SeqIO

 |      >>> record = SeqIO.read("Fasta/aster.pro", "fasta")

 |      >>> print(record.format("fasta"))

 |      >gi|3298468|dbj|BAA31520.1| SAMIPF

 |      GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVG

 |      VTNALVFEIVMTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI

 |      <BLANKLINE>

 |      >>> print(record.lower().format("fasta"))

 |      >gi|3298468|dbj|BAA31520.1| SAMIPF

 |      gghvnpavtfgafvggnitllrgivyiiaqllgstvaclllkfvtndmavgvfslsagvg

 |      vtnalvfeivmtfglvytvyataidpkkgslgtiapiaigfivgani

 |      <BLANKLINE>

 |      To take a more annotation rich example,

 |      >>> from Bio import SeqIO

 |      >>> old = SeqIO.read("EMBL/TRBG361.embl", "embl")

 |      >>> len(old.features)

 |      3

 |      >>> new = old.lower()

 |      >>> len(old.features) == len(new.features)

 |      True

 |      >>> old.annotations["organism"] == new.annotations["organism"]

 |      True

 |      >>> old.dbxrefs == new.dbxrefs

 |      True

 |  reverse_complement(self, id=False, name=False, description=False, features=True, annotations=False, letter_annotations=True, dbxrefs=False)

 |      Return new SeqRecord with reverse complement sequence.

 |      By default the new record does NOT preserve the sequence identifier,

 |      name, description, general annotation or database cross-references -

 |      these are unlikely to apply to the reversed sequence.

 |      You can specify the returned record's id, name and description as

 |      strings, or True to keep that of the parent, or False for a default.

 |      You can specify the returned record's features with a list of

 |      SeqFeature objects, or True to keep that of the parent, or False to

 |      omit them. The default is to keep the original features (with the

 |      strand and locations adjusted).

 |      You can also specify both the returned record's annotations and

 |      letter_annotations as dictionaries, True to keep that of the parent,

 |      or False to omit them. The default is to keep the original

 |      annotations (with the letter annotations reversed).

 |      To show what happens to the pre-letter annotations, consider an

 |      example Solexa variant FASTQ file with a single entry, which we'll

 |      read in as a SeqRecord:

 |      >>> from Bio import SeqIO

 |      >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")

 |      >>> print("%s %s" % (record.id, record.seq))

 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN

 |      >>> print(list(record.letter_annotations))

 |      ['solexa_quality']

 |      >>> print(record.letter_annotations["solexa_quality"])

 |      [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]

 |      Now take the reverse complement, here we explicitly give a new

 |      identifier (the old identifier with a suffix):

 |      >>> rc_record = record.reverse_complement(id=record.id + "_rc")

 |      >>> print("%s %s" % (rc_record.id, rc_record.seq))

 |      slxa_0001_1_0001_01_rc NNNNNNACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT

 |      Notice that the per-letter-annotations have also been reversed,

 |      although this may not be appropriate for all cases.

 |      >>> print(rc_record.letter_annotations["solexa_quality"])

 |      [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]

 |      Now for the features, we need a different example. Parsing a GenBank

 |      file is probably the easiest way to get an nice example with features

 |      in it...

 |      >>> from Bio import SeqIO

 |      >>> with open("GenBank/pBAD30.gb") as handle:

 |      ...     plasmid = SeqIO.read(handle, "gb")

 |      >>> print("%s %i" % (plasmid.id, len(plasmid)))

 |      pBAD30 4923

 |      >>> plasmid.seq

 |      Seq('GCTAGCGGAGTGTATACTGGCTTACTATGTTGGCACTGATGAGGGTGTCAGTGA...ATG', IUPACAmbiguousDNA())

 |      >>> len(plasmid.features)

 |      13

 |      Now, let's take the reverse complement of this whole plasmid:

 |      >>> rc_plasmid = plasmid.reverse_complement(id=plasmid.id+"_rc")

 |      >>> print("%s %i" % (rc_plasmid.id, len(rc_plasmid)))

 |      pBAD30_rc 4923

 |      >>> rc_plasmid.seq

 |      Seq('CATGGGCAAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCA...AGC', IUPACAmbiguousDNA())

 |      >>> len(rc_plasmid.features)

 |      13

 |      Let's compare the first CDS feature - it has gone from being the

 |      second feature (index 1) to the second last feature (index -2), its

 |      strand has changed, and the location switched round.

 |      >>> print(plasmid.features[1])

 |      type: CDS

 |      location: [1081:1960](-)

 |      qualifiers:

 |          Key: label, Value: ['araC']

 |          Key: note, Value: ['araC regulator of the arabinose BAD promoter']

 |          Key: vntifkey, Value: ['4']

 |      <BLANKLINE>

 |      >>> print(rc_plasmid.features[-2])

 |      type: CDS

 |      location: [2963:3842](+)

 |      qualifiers:

 |          Key: label, Value: ['araC']

 |          Key: note, Value: ['araC regulator of the arabinose BAD promoter']

 |          Key: vntifkey, Value: ['4']

 |      <BLANKLINE>

 |      You can check this new location, based on the length of the plasmid:

 |      >>> len(plasmid) - 1081

 |      3842

 |      >>> len(plasmid) - 1960

 |      2963

 |      Note that if the SeqFeature annotation includes any strand specific

 |      information (e.g. base changes for a SNP), this information is not

 |      amended, and would need correction after the reverse complement.

 |      Note trying to reverse complement a protein SeqRecord raises an

 |      exception:

 |      >>> from Bio.SeqRecord import SeqRecord

 |      >>> from Bio.Seq import Seq

 |      >>> from Bio.Alphabet import IUPAC

 |      >>> protein_rec = SeqRecord(Seq("MAIVMGR", IUPAC.protein), id="Test")

 |      >>> protein_rec.reverse_complement()

 |      Traceback (most recent call last):

 |         ...

 |      ValueError: Proteins do not have complements!

 |      Also note you can reverse complement a SeqRecord using a MutableSeq:

 |      >>> from Bio.SeqRecord import SeqRecord

 |      >>> from Bio.Seq import MutableSeq

 |      >>> from Bio.Alphabet import generic_dna

 |      >>> rec = SeqRecord(MutableSeq("ACGT", generic_dna), id="Test")

 |      >>> rec.seq[0] = "T"

 |      >>> print("%s %s" % (rec.id, rec.seq))

 |      Test TCGT

 |      >>> rc = rec.reverse_complement(id=True)

 |      >>> print("%s %s" % (rc.id, rc.seq))

 |      Test ACGA

 |  translate(self, table='Standard', stop_symbol='*', to_stop=False, cds=False, gap=None, id=False, name=False, description=False, features=False, annotations=False, letter_annotations=False, dbxrefs=False)

 |      Return new SeqRecord with translated sequence.

 |      This calls the record's .seq.translate() method (which describes

 |      the translation related arguments, like table for the genetic code),

 |      By default the new record does NOT preserve the sequence identifier,

 |      name, description, general annotation or database cross-references -

 |      these are unlikely to apply to the translated sequence.

 |      You can specify the returned record's id, name and description as

 |      strings, or True to keep that of the parent, or False for a default.

 |      You can specify the returned record's features with a list of

 |      SeqFeature objects, or False (default) to omit them.

 |      You can also specify both the returned record's annotations and

 |      letter_annotations as dictionaries, True to keep that of the parent

 |      (annotations only), or False (default) to omit them.

 |      e.g. Loading a FASTA gene and translating it,

 |      >>> from Bio import SeqIO

 |      >>> gene_record = SeqIO.read("Fasta/sweetpea.nu", "fasta")

 |      >>> print(gene_record.format("fasta"))

 |      >gi|3176602|gb|U78617.1|LOU78617 Lathyrus odoratus phytochrome A (PHYA) gene, partial cds

 |      CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCA

 |      AAACATGTGAAGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCG

 |      ACCTTAAGAGCTCCACATAGTTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCT

 |      TCATTGGTTATGGCAGTGGTCGTCAATGACAGCGATGAAGATGGAGATAGCCGTGACGCA

 |      GTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAGTTTGTCATAACACTACTCCG

 |      AGGTTTGTT

 |      <BLANKLINE>

 |      And now translating the record, specifying the new ID and description:

 |      >>> protein_record = gene_record.translate(table=11,

 |      ...                                        id="phya",

 |      ...                                        description="translation")

 |      >>> print(protein_record.format("fasta"))

 |      >phya translation

 |      QAARFLFMKNKVRMIVDCHAKHVKVLQDEKLPFDLTLCGSTLRAPHSCHLQYMANMDSIA

 |      SLVMAVVVNDSDEDGDSRDAVLPQKKKRLWGLVVCHNTTPRFV

 |      <BLANKLINE>

 |  upper(self)

 |      Return a copy of the record with an upper case sequence.

 |      All the annotation is preserved unchanged. e.g.

 |      >>> from Bio.Alphabet import generic_dna

 |      >>> from Bio.Seq import Seq

 |      >>> from Bio.SeqRecord import SeqRecord

 |      >>> record = SeqRecord(Seq("acgtACGT", generic_dna), id="Test",

 |      ...                    description = "Made up for this example")

 |      >>> record.letter_annotations["phred_quality"] = [1, 2, 3, 4, 5, 6, 7, 8]

 |      >>> print(record.upper().format("fastq"))

 |      @Test Made up for this example

 |      ACGTACGT

 |      +

 |      "#$%&'()

 |      <BLANKLINE>

 |      Naturally, there is a matching lower method:

 |      >>> print(record.lower().format("fastq"))

 |      @Test Made up for this example

 |      acgtacgt

 |      +

 |      "#$%&'()

 |      <BLANKLINE>

 |  ----------------------------------------------------------------------

 |  Data descriptors defined here:

 |  __dict__

 |      dictionary for instance variables (if defined)

 |  __weakref__

 |      list of weak references to the object (if defined)

 |  letter_annotations

 |      Dictionary of per-letter-annotation for the sequence.

 |      For example, this can hold quality scores used in FASTQ or QUAL files.

 |      Consider this example using Bio.SeqIO to read in an example Solexa

 |      variant FASTQ file as a SeqRecord:

 |      >>> from Bio import SeqIO

 |      >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa")

 |      >>> print("%s %s" % (record.id, record.seq))

 |      slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN

 |      >>> print(list(record.letter_annotations))

 |      ['solexa_quality']

 |      >>> print(record.letter_annotations["solexa_quality"])

 |      [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]

 |      The letter_annotations get sliced automatically if you slice the

 |      parent SeqRecord, for example taking the last ten bases:

 |      >>> sub_record = record[-10:]

 |      >>> print("%s %s" % (sub_record.id, sub_record.seq))

 |      slxa_0001_1_0001_01 ACGTNNNNNN

 |      >>> print(sub_record.letter_annotations["solexa_quality"])

 |      [4, 3, 2, 1, 0, -1, -2, -3, -4, -5]

 |      Any python sequence (i.e. list, tuple or string) can be recorded in

 |      the SeqRecord's letter_annotations dictionary as long as the length

 |      matches that of the SeqRecord's sequence.  e.g.

 |      >>> len(sub_record.letter_annotations)

 |      1

 |      >>> sub_record.letter_annotations["dummy"] = "abcdefghij"

 |      >>> len(sub_record.letter_annotations)

 |      2

 |      You can delete entries from the letter_annotations dictionary as usual:

 |      >>> del sub_record.letter_annotations["solexa_quality"]

 |      >>> sub_record.letter_annotations

 |      {'dummy': 'abcdefghij'}

 |      You can completely clear the dictionary easily as follows:

 |      >>> sub_record.letter_annotations = {}

 |      >>> sub_record.letter_annotations

 |      {}

 |      Note that if replacing the record's sequence with a sequence of a

 |      different length you must first clear the letter_annotations dict.

 |  seq

 |      The sequence itself, as a Seq or MutableSeq object.

 |  ----------------------------------------------------------------------

 |  Data and other attributes defined here:

 |  __hash__ = None

我们来写一些代码调用SeqRecord并看一下有什么结果。

In [29]:

from Bio.SeqRecord import SeqRecord

simple_seq = Seq("GATC")

simple_seq_r = SeqRecord(simple_seq)

In [30]:

simple_seq_r.id = "AC12345"

simple_seq_r.description = "Made up sequence"

print(simple_seq_r.id)

print(simple_seq_r.description)

AC12345

Made up sequence

我们看一下如何用SeqRecord来解析大的fasta文件。我们从biopython网站下载一个文件。

In [31]:

!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna

--2020-08-05 14:52:05--  https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna

Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...

Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: 9853 (9.6K) [text/plain]

Saving to: ‘NC_005816.fna’

NC_005816.fna       100%[===================>]   9.62K  --.-KB/s    in 0s

2020-08-05 14:52:05 (50.1 MB/s) - ‘NC_005816.fna’ saved [9853/9853]

In [32]:

from Bio import SeqIO

record = SeqIO.read("NC_005816.fna", "fasta")

record

Out[32]:

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])

注意SeqRecord对象有很多的标记。我们仔细看一下。

In [33]:

record.id

Out[33]:

'gi|45478711|ref|NC_005816.1|'

In [34]:

record.name

Out[34]:

'gi|45478711|ref|NC_005816.1|'

In [35]:

record.description

Out[35]:

'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

现在看一下相同的序列，下载自GenBank。我们先从biopython教程网站下载宿文件。

In [36]:

!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb

--2020-08-05 14:52:19--  https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb

Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...

Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: 31838 (31K) [text/plain]

Saving to: ‘NC_005816.gb’

NC_005816.gb        100%[===================>]  31.09K  --.-KB/s    in 0.008s

2020-08-05 14:52:20 (3.80 MB/s) - ‘NC_005816.gb’ saved [31838/31838]

In [37]:

from Bio import SeqIO

record = SeqIO.read("NC_005816.gb", "genbank")

record

Out[37]:

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['Project:58037'])