Python处理多行文本问题--一个简单方法读取多行fasta文件

在处理fasta序列时,常常会遇到一条序列多行排列的现象,如下所示:

$cat test.fasta
>test_1
TGGGGAATCTTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGAGCGATGAAGGCCTTAGGGTTGTAAAGCTCT
TTCAGCTGGGAAGATAATGACGGTACCAGCAGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGG
GGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTGTTAAGTCGGGGGTGAAATCCCGGGGCTCAA
CCCCGGAACTGCCTCCGATACTGGCAATCTTGAGATCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTA
GATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAA
CAGG
>test_3
TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTGCGGGTTGTAAAGCACT
TTCAGTAGAGAAGAAATGCCCATGGTTAATACCCGTGGGTCTTGACGTAACCTACAGAAGAAGCACCGGCTAACTCCGTG
CCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGGTCAG
TCGGATGTGAAAGCCCTAGGCTCAACCTGGGAATGGCATTCGATACTGCCTGACTAGAGTATGGTAGAGGGAAGTGGAAT
TTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAACACCAGTGGCGAAGGCGACTTCCTGGGCCAATACTGACGCT
GAGGTGCGAAAGCGTGGGGAGCAAACAGG
>test_4
TGGGGAATTTTGGGCAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGGAGGATGAAGGCCCTCGGGTCGTAAACTCCT
GTCCTAGGGGAAGAAAAAAATGACGGTACCCTTGGAGGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGG
GGGGGGGGAGCGGTGTTCGGAATTACTGGGCGTAAAGGGCGCGCAGGCGGCCTGGGAAGTCTTGGGTGAAAGCCCCCAGC
TCAACTGGGGAATGGCCTGAGAAACCACTAGGCTGGAGTGCTGGAGAGGGAAGCGGAATTCCCGGTGGAGCGGTGAAATG
CGTAGATATCGGGAGGAACACCAGAGGCGAAGGCGGCTTCCTGGACAGACACTGACGCTGAGGCGCGAAAGCTAGGGGAG
CAAACGGG
>test_5
TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCT
TTCACCGGTGAAGATAATGACGGTAACCGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGG
GGCTAGCGTTGTTCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGACTATTAAGTCAGGGGTGAAATCCCGGGGCTCAA
CCCCGGAACTGCCTTTGATACTGGTAGTCTTGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTA
GATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAA
CAGG
>test_6
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACACCGCGTGCGGGATGAAGGCCCTCGGGTTGTAAACCGCTTTC
AGGAGGGACGAAAATGACGGTACCTCCAGAAGAAGGCCCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGCC
AAACGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCGGTTCAACAAGTCGATCGTGAAAGCCCGGGGCTCAACCC
CGGGACGCCGGTCGAAACTGTTGTGACTAGGGTCCGGTAGAGGTGAGTGGAATTCTCGGTGTAGCGGTGGAATGCGCAGA
TATCGAGAGGAACACCAGTTGCGAAGGCGGCTCACTGGGCCGGTACCGACGCTAAGGAGCGAAAGCGTGGGGAGCAAACA
GG

我的一个简单处理方法是,【整体读入-->分隔符分割为列表-->字符串合并列表】,代码如下:

seq_file=open("test.fasta")  
seq_list=seq_file.read().split(">")
for seq in seq_list :
    if seq :
        seq_name=seq.split("\n")[0]
        seq_fa="".join(seq.split("\n")[1:])
        print ">" + seq_name + "\n" + seq_fa

打印结果为:

>test_1
TGGGGAATCTTGGACAATGGGGGCAACCCTGATCCAGCCATGCCGCGTGAGCGATGAAGGCCTTAGGGTTGTAAAGCTCTTTCAGCTGGGAAGATAATGACGGTACCAGCAGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGATTGTTAAGTCGGGGGTGAAATCCCGGGGCTCAACCCCGGAACTGCCTCCGATACTGGCAATCTTGAGATCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
>test_3
TGGGGAATATTGGACAATGGGGGCAACCCTGATCCAGCAATGCCGCGTGTGTGAAGAAGGCCTGCGGGTTGTAAAGCACTTTCAGTAGAGAAGAAATGCCCATGGTTAATACCCGTGGGTCTTGACGTAACCTACAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGGTCAGTCGGATGTGAAAGCCCTAGGCTCAACCTGGGAATGGCATTCGATACTGCCTGACTAGAGTATGGTAGAGGGAAGTGGAATTTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAACACCAGTGGCGAAGGCGACTTCCTGGGCCAATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
>test_4
TGGGGAATTTTGGGCAATGGGCGAAAGCCTGACCCAGCAACGCCGCGTGGAGGATGAAGGCCCTCGGGTCGTAAACTCCTGTCCTAGGGGAAGAAAAAAATGACGGTACCCTTGGAGGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGGGGGGGGGGAGCGGTGTTCGGAATTACTGGGCGTAAAGGGCGCGCAGGCGGCCTGGGAAGTCTTGGGTGAAAGCCCCCAGCTCAACTGGGGAATGGCCTGAGAAACCACTAGGCTGGAGTGCTGGAGAGGGAAGCGGAATTCCCGGTGGAGCGGTGAAATGCGTAGATATCGGGAGGAACACCAGAGGCGAAGGCGGCTTCCTGGACAGACACTGACGCTGAGGCGCGAAAGCTAGGGGAGCAAACGGG
>test_5
TGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTCACCGGTGAAGATAATGACGGTAACCGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGATTTACTGGGCGTAAAGCGCACGTAGGCGGACTATTAAGTCAGGGGTGAAATCCCGGGGCTCAACCCCGGAACTGCCTTTGATACTGGTAGTCTTGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
>test_6
GGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACACCGCGTGCGGGATGAAGGCCCTCGGGTTGTAAACCGCTTTCAGGAGGGACGAAAATGACGGTACCTCCAGAAGAAGGCCCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGCCAAACGTTGTCCGGATTTATTGGGCGTAAAGGGCTCGTAGGCGGTTCAACAAGTCGATCGTGAAAGCCCGGGGCTCAACCCCGGGACGCCGGTCGAAACTGTTGTGACTAGGGTCCGGTAGAGGTGAGTGGAATTCTCGGTGTAGCGGTGGAATGCGCAGATATCGAGAGGAACACCAGTTGCGAAGGCGGCTCACTGGGCCGGTACCGACGCTAAGGAGCGAAAGCGTGGGGAGCAAACAGG

猜你喜欢

转载自www.cnblogs.com/xlij1205/p/10504418.html
今日推荐