【数据使用】3k水稻数据库现成SNP的使用

---恢复内容开始---

我们经常说幻想着使用已有数据发表高分文章，的确，这样的童话故事每天都在发生，但如何走出第一步我们很多小伙伴不清楚，那么我们就从水稻SNP数据库的使用来讲起。

http://snp-seek.irri.org/

这是3k的水稻变异库，上面保存着现成的SNP，由于数据过大，网站的维护方使用了Plink的格式来给我们在线储存SNP的信息，可以理解毕竟3025个水稻的全基因组SNP，怎么算都不是个小数。

Plink格式是如下三个文件：

base_filtered_v0.7.bed.gz
base_filtered_v0.7.bim.gz
base_filtered_v0.7.fam.gz

用Plink软件的“--recode”就可以把这三个软件转化为Vcf格式：

--recode [output format] <01 | 12> <tab | tabx | spacex | bgz | gen-gz>
         <include-alt> <omit-nonmale-y>
  Create a new text fileset with all filters applied.  The following output
  formats are supported:
  * '23': 23andMe 4-column format.  This can only be used on a single
    sample's data (--keep may be handy), and does not support multicharacter
    allele codes.
  * 'A': Sample-major additive (0/1/2) coding, suitable for loading from R.
    If you need uncounted alleles to be named in the header line, add the
    'include-alt' modifier.
  * 'AD': Sample-major additive (0/1/2) + dominant (het=1/hom=0) coding.
    Also supports 'include-alt'.
  * 'A-transpose': Variant-major 0/1/2.
  * 'beagle': Unphased per-autosome .dat and .map files, readable by early
    BEAGLE versions.
  * 'beagle-nomap': Single .beagle.dat file.
  * 'bimbam': Regular BIMBAM format.
  * 'bimbam-1chr': BIMBAM format, with a two-column .pos.txt file.  Does not
    support multiple chromosomes.
  * 'fastphase': Per-chromosome fastPHASE files, with
    .chr-[chr #].recode.phase.inp filename extensions.
  * 'fastphase-1chr': Single .recode.phase.inp file.  Does not support
    multiple chromosomes.
  * 'HV': Per-chromosome Haploview files, with .chr-[chr #][.ped + .info]
    filename extensions.
  * 'HV-1chr': Single Haploview .ped + .info file pair.  Does not support
    multiple chromosomes.
  * 'lgen': PLINK 1 long-format (.lgen + .fam + .map), loadable with --lfile.
  * 'lgen-ref': .lgen + .fam + .map + .ref, loadable with --lfile +
     --reference.
  * 'list': Single genotype-based list, up to 4 lines per variant.  To omit
    nonmale genotypes on the Y chromosome, add the 'omit-nonmale-y' modifier.
  * 'rlist': .rlist + .fam + .map fileset, where the .rlist file is a
    genotype-based list which omits the most common genotype for each
    variant.  Also supports 'omit-nonmale-y'.
  * 'oxford': Oxford-format .gen + .sample.  With the 'gen-gz' modifier, the
    .gen file is gzipped.
  * 'ped': PLINK 1 sample-major (.ped + .map), loadable with --file.
  * 'compound-genotypes': Same as 'ped', except that the space between each
    pair of same-variant allele codes is removed.
  * 'structure': Structure-format.
  * 'transpose': PLINK 1 variant-major (.tped + .tfam), loadable with
    --tfile.
  * 'vcf', 'vcf-fid', 'vcf-iid': VCFv4.2.  'vcf-fid' and 'vcf-iid' cause
    family IDs or within-family IDs respectively

to be used for the sample
    IDs in the last header row, while 'vcf' merges both IDs and puts an
    underscore between them.  If the 'bgz' modifier is added, the VCF file is
    block-gzipped.
    The A2 allele is saved as the reference and normally flagged as not based
    on a real reference genome (INFO:PR).  When it is important for reference
    alleles to be correct, you'll also want to include --a2-allele and
    --real-ref-alleles in your command.
  In addition,
  * The '12' modifier causes A1 (usually minor) alleles to be coded as '1'
    and A2 alleles to be coded as '2', while '01' maps A1 -> 0 and A2 -> 1.
  * The 'tab' modifier makes the output mostly tab-delimited instead of
    mostly space-delimited.  'tabx' and 'spacex' force all tabs and all
    spaces, respectively.

plink --bfile <prefix> --recode vcf-iid --out ./<out-prefix>

通过这种方式就可以把bed的信息转化为可用的vcf。

【数据使用】3k水稻数据库现成SNP的使用

猜你喜欢