GEO data mining (3)-basic knowledge of chips

High-throughput, whole-genome DNA chips have become very useful tools in the biological field. However, the amount of data generated by chip experiments is increasing. Different analysis methods will lead to different conclusions, so analysis plays a key role.

Purpose of gene chip analysis

  • Gene chip analysis is to use bioinformatics methods to find key genes that may play a role in biological effects from these chip data, find specific patterns and annotate each gene, so as to dig out hidden biological processes and extract them. The biological or functional significance.

  • Depending on the purpose of the chip, a chip may contain tens, hundreds or even hundreds of thousands of different sequences. The DNA fragments arranged in a matrix are usually called probes , and the sample RNA is called the target .

Principle of Gene Chip

The basic chip experiments, mRNA sample is first reverse transcribed into cDNA (simultaneously fluorescently labeled in the process), after the nucleic acid probes on the chip mixture , complementary to hybridize to the cDNA to bind to the chip , without being hybridized The sample is eluted.

After the chip is scanned by a fluorescence scanner, the probe at a certain position on the chip binds to the complementary nucleic acid in the sample, and a fluorescent spot is displayed at that position. This position indicates the identity of the gene, and the fluorescence intensity indicates the original The level of the mRNA in the sample . Chip technology is not only used to detect gene expression , but can also be used to detect single nucleotide polymorphisms .

Chip technology method

There are two basic methods in chip technology: single staining technology and double staining technology

Single dye technology

  • The single-staining technique is to hybridize a sample separately on a chip after a fluorescent label, and is currently the most used method. Hybridization of a sample with a chip alone can easily compare multiple chips.
  • The chip data generated is single-channel signal data. The data generated by this method has a large variation, and it is necessary to repeat the experiment to reduce the error.

Double staining technique

  • The double staining technique is to hybridize two samples with different fluorescent labels to the same chip. It is used to detect the difference in gene expression under two different conditions, such as diseased tissue and normal tissue (often multiple normal tissue DNAs are mixed together as a "pool" sample); treatment group and control group. Two samples (such as treatment and control) are labeled with two different fluorescences. The cDNA of one sample was labeled with Cy5 (a dye shown in red), and the other sample was labeled with Cy3 (a dye shown in green). The two fluorescently-labeled samples are mixed to compete with the probes on the chip for hybridization.
  • The chip data thus generated is dual-channel signal data. This dual-channel signal data facilitates direct comparison between two samples, helps reduce data variability, improves the accuracy of differential expression analysis between groups, and reduces the amount of chip used and saves costs. But because the experimental design has been determined using this technique, it cannot be compared with other samples.

Chip company

Currently, the chips on the market mainly come from three companies: Affymetric, Agilent and Illumina .

Gene chip analysis tool

Gene chip analysis generally does not require high hardware requirements, and ordinary computers can run it. However, if you are processing a large amount of data, it is recommended to increase the memory. Generally, a processor with 16g of memory and i7 can basically run all the analysis quickly. At present, there are many analysis tools for gene chip, but each has advantages and disadvantages. According to the degree of difficulty, the following three software and tools are recommended.

tool advantage Disadvantage
GeneSpring Interactive window operation interface, fool-like operation, powerful function, with more than 4400 high-level references, the gold standard for expression profile data analysis Commercial software charges, the operation is cumbersome, and the functionality is poor. Like SPSS, applicable to zero foundation
BRB-Array Analysis tool based on excel, automatically calling R package, powerful function, strong expansibility, simple operation, free to use Strong professionalism, high format requirements, error will be reported if there is any discrepancy. Suitable for a certain professional foundation
R-Bioconductor R language, an analysis tool that students must learn, powerful statistical analysis and drawing tools, a collection of almost all the latest analysis algorithms and toolkits, free download and use Need to have certain computer programming ability

bioconductor package, I will talk about how to use lumi package to process the chip data.
It is most convenient to use the bioconductor series package to process, just watch this tutorial: https://bioconductor.org/packages/release/data/experiment/vignettes/BeadArrayUseCases/inst/doc/BeadArrayUseCases.pdf The
data processing process is still there There is an article published in plos one magazine: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002276
BMC also has an article: https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC4486126/ Their team made a web version tool, which can directly upload the raw data of the Illumina chip to do a full set of analysis: http://www.arrayanalysis.org/

Data download

Generally speaking, it is difficult to compare and integrate data from different laboratories and experiments. Therefore, scientists established an alliance ( MGED Society ) to standardize the output and annotation of chip data, promote data sharing and the establishment of a unified database.

The designated standardization rule is called MIAME , and authoritative journals generally only accept chip data papers that follow MIAME rules. NCBI's GEO and EBI's ArrayExpress are currently the largest public resource databases for storing and publishing data on chips compatible with MIAME.

Illumina's bead series expression chip

Of course, the most familiar expression chip is the affymetrix series chip, and the analysis routine is very simple. You can directly use the affy package of R to get the expression matrix from the cel file through the RMA or MAS5 method . The chip shipped by Illumina is slightly different. Its raw data has 3 levels. Generally, Processed data (example) is obtained. When a series of statistical methods are still needed to extract the expression matrix.
http://www.bio-info-trainee.com/1937.html

In fact, the most important process for chip data processing is how to do QC and get the expression matrix. The following difference analysis and function enrichment analysis are actually similar.
Original link: Chip basic knowledge check-in
http://www.biotrainee.com/thread-992-1-1.html
(Source: Shengxin Skill Tree)

Guess you like

Origin blog.csdn.net/qq_44520665/article/details/113307926