More than 330 common variants loci have been identified associated with systemic lupus erythematosus (SLE), but their causal mechanisms are rarely well-explained. Many GWAS downstream analysis ideas were proposed, including fine-mapping, colocalization, Mendelian randomization causal inference, and multi-omics integrated analysis. But it is easy to get confused as there are many statisics hypotheses and caveat before using these tools. In this article, I will present my GWAS downstream analysis pipeline using a SLE signal as example.
Table 1. Number of cases and controls in 7 datasets.
| SNP array | WGS | ||||||
| Dataset | HK | GZ | AH | JN | Thai | HK | JN |
| n_cases | 1,604 | 1,014 | 877 | 512 | 835 | 1,532 | 195 |
| n_ctrls | 3,324 | 4,122 | 1,712 | 994 | 2,995 | 1,1357 | 369 |
The number of cases and controls of each dataset were shown in Table 1. Genome-wide association studies were frist conducted to each dataset separately, following a meta-analysis of 7 cohorts. As shown in Fig 1, a locus around GADPH-IFFO1-CHD4 in Chr12 was identified signficiantly associated with SLE. THe top SNP rs56848735 is located in the 5’UTR region of IFFO1 gene. We can see there are many variants have high LD with the top SNP rs56848735. Traditional fine-mapping algorithms like FINEMAP and SuSiE performs bad at this situation as there are many SNPs with high LD and low P values. Functionally-informed fine-mapping (PolyFun) was utilized to identify the candidate causal variants. The results from PolyFun shows there are two SNPs rs56848735 and rs1639123 contains highest posterior inclusion probability (PIP). THe rs1639123 is located in the intronic region that close to the 5’UTR of CHD4 gene.
To further identify the genes that affected by this locus, we downloaded the whole-blood eQTL data from GTEx and eQTLGen databases. Table 2 shows the eQTL P-values of rs56848735 and rs1639123 for genes in this region. CHD4 gene is significant in both GTEx and eQTLGen databases for both candidate causal variants.
Table 2. eQTL P-values of rs56848735 and rs1639123 for genes around this locus.
| CD27 | TAPBPL | VAMP1 | ZNF384 | GPR162 | CHD4 | LPAR5 | ||
| rs56848735 | GTEx | - | - | - | - | - | 1.4e-5 | - |
| eQTLGen | 5.3e-7 | 2.8e-8 | 3.2e-33 | 3.2e-10 | 5.6e-9 | 4.5e-24 | 5.6e-200 | |
| rs1639123 | GTEx | - | - | - | - | - | 6.2e-7 | - |
| eQTLGen | 8.8e-7 | 6.7e-16 | 5.8e-12 | 9.4e-9 | 1.6e-5 | 8.8e-28 | 2.4e-211 |
We next performed colocalization study between eQTL and GWAS summary statistics in this region. CHD4 gene is colocalized with GWAS signal in both GTEx and eQTLGen databases. The locus compare plots show the GWAS and eQTL signals are consistent in the cluster of high LD SNPs. However, it looks have some LD inconsistence beteween GWAS and eQTL, because the GTEx and eQTLGen data are mainly based on European population, but our GWAS analysis is based on East Aisn population. The LD situation will be much better if we use a East Asian whole-blood eQTL database.
The colocalization analysis shows this locus possibly regulate the expression of CHD4 gene. To further validate this hypothesis, we downloaded the a multiome dataset from 10X genomics. It is a scATAC-seq and scRNA-seq dataset that contains 3,000 cells from PBMC. The scATAC-seq data shows there are open chromatin peaks in both candidate causal variants. It means they are located in the regulatory region, and transcription factor could bind to these two peaks region and regulate gene expression. To identify the gene regulated by them, we tested the association between gene expression ~ peaks fragment by running hurdle negative binomial regression using Open4Gene. The result shows CHD4’s expression is significantly associated with the scATAC peak fragment counts of rs1639123. It suggests the rs1639123 is possibly the causal variant in this locus, and it regulates the expression of CHD4 gene.
We are collecting and merging larger multiome dataset to further replicate our finding here. The manuscript is under preparating, and will be available soon.