GWAS visualization (1): Summary table

본 강의는 GWAS 결과를 보기 좋게 정리하는 것이 목표이며 세부 내용은 다음과 같다.

Summary table: GWAS, Pruning, Clumping, Gene info
Visualization: QQ/Manhattan/Volcano/Regional plot
Automation

Setting

필요한 소프트웨서 혹은 패키지를 소개하겠다.

PLINK: http://zzz.bwh.harvard.edu/plink/download.shtml 를 참고하여 설치.
R packages
- knitr, rmarkdown : HTML 문서 만들기
- DT : Interactive 테이블 작성
- data.table : 빠른 파일 읽기 & 데이터 처리
- glue : 문자열 작업
- ggplot2, manhattanly, ggiraph : 시각화

library(knitr);library(data.table);library(glue);library(DT)
dir="/home/jinseob2kim/Dropbox/example"
setwd(dir)
getwd()

## [1] "/home/jinseob2kim/Dropbox/example"

실습 데이터는 bed 포맷으로 되어있으며 phenotype은 중성지방(TG)과 그것을 이분형 변수로 만든 high_tg을 사용한다. glue를 사용해서 parameter와 문자를 자연스럽게 붙여 표현할 수 있다.

bfile=glue("{dir}/Nong_1787")
bfile

## /home/jinseob2kim/Dropbox/example/Nong_1787

phfile=glue("{dir}/Nong_1787_tgnew.pheno")
phfile

## /home/jinseob2kim/Dropbox/example/Nong_1787_tgnew.pheno

GWAS using PLINK

연속변수: TG

PLINK의 –linear 옵션을 사용하며, glue를 활용해 실행문을 만들고 system함수로 리눅스 커맨드를 R에서 실행할 수 있다.

phname="tg"
outLinear=glue("{dir}/LinearTG")
system(glue("plink --noweb --bfile {bfile} --pheno {phfile} --pheno-name {phname} --linear --covar {phfile} --covar-name age,sex --out {outLinear}"))

결과는 –out에 지정한 파일이름 + assoc.linear 확장자로 저장된다. SNP의 효과는 TEST 열이 ADD인 항목들이다. data.table패키지의 fread 명령어로 빠르고 간단하게 결과파일을 읽을 수 있으며 실습 편의상 top 100개만 보겠다. DT 패키지의 datatable함수로 interactive table을 만들 수 있다.

## See result
outLinear=glue("{dir}/LinearTG")
out.linear=fread(glue("{outLinear}.assoc.linear"))
#kable(head(out.linear),caption="assoc.linear 파일",align="c")
datatable(out.linear[TEST=="ADD"][order(P)][1:100],caption="assoc.linear 파일",rownames=F)

이분형 변수: high_tg

–logistic 옵션을 사용하고 –pheno-name에 affected status(1)를 추가로 지정한다. 마찬가지로 SNP의 효과는 TEST 열이 ADD인 항목들이다.

phname2="high_tg"
outLogistic=glue("{dir}/LogisticTG")
system(glue("plink --noweb --bfile {bfile} --pheno {phfile} --pheno-name {phname2} --1 --logistic --covar {phfile} --covar-name age,sex --out {outLogistic}"))

결과는 assoc.logistic 확장자로 저장된다. 실습 편의상 top 100개만 보겠다.

outLogistic=glue("{dir}/LogisticTG")
out.logistic=fread(glue("{outLogistic}.assoc.logistic"))
#kable(head(out.logistic),caption="assoc.logistic 파일",align="c")
datatable(out.logistic[TEST=="ADD"][order(P)][1:100],caption="assoc.logistic 파일",rownames=F)

LD Pruning

LD Equilibrium인 SNP들을 얻고 싶을 때 수행하며 VIF(Variance Inflation Factor) 또는 pairwise genotype correlation(\(r^2\))을 이용한다.

VIF 이용

테스트할 SNP을 50개 SNP으로 회귀분석한 후 \(R^2\)값을 이용해 VIF를 계산한다. 그 다음 테스트는 지금 SNP으로부터 5번째 뒤에 있는 SNP이다

\[VIF = \dfrac{1}{1-R^2}\]

VIF가 2 이상이면 해당 SNP을 제외.
- VIF=1 : completely independent
- 2~10 : 흔히 쓰는 cut-off

Pairwise correlation 이용

인접한 50개 SNP들끼리의 상관계수의 제곱(\(r^2\))를 모두 구한다(\(\binom{50}{2}\)개).
\(r^2 \ge 0.5\) 인 쌍은 제외.
Shift the window 5 SNPs forward and repeat the procedure.

실행은 아래 코드와 같이 –indep 혹은 –indep-pairwise 옵션을 사용하고 저장할 파일이름을 –out 옵션에 적으면 된다. 실행시간이 오래걸린다.

## out file to save
outPrune=glue("{dir}/outPrune")                                                                 

## VIF method
#system(glue("plink --noweb --bfile {bfile} ---indep 50 5 2 --out {outPrune}"))

## R^2 method
system(glue("plink --noweb --bfile {bfile} --indep-pairwise 50 5 0.5 --out {outPrune}"))

포함된 SNP은 prune.in파일에, 제거된 SNP은 prune.out 파일에 저장된다. 처음 100개씩만 보겠다.

## See result
pin=fread(glue("{dir}/outPrune.prune.in"),header=F)
pout=fread(glue("{dir}/outPrune.prune.out"),header=F)

datatable(pin[1:100],caption="prune.in 파일",rownames=F)

nrow(pin)

## [1] 117510

datatable(pout[1:100],caption="prune.out 파일",rownames=F)

nrow(pout)

## [1] 21515

#kable(head(pin),caption="prune.in file",align="c")
#kable(head(pout),caption="prune.out file",align="c")

Load example: GWAS result- SNPs & Prune.in

이번에는 앞서 수행한 pruning과 GWAS 결과를 조합하여 (1) keeping, (2) SNP들의 결과만 저장하려고 한다. 물론 pruning할 때 keeping SNP들만의 데이터를 따로 저장할 수도 있으며 그렇다면 GWAS결과에서 SNP들의 정보만 취하면 된다. 먼저 아래와 같이 함수를 만들겠다.

## Function
ReadPlinkGWAS=function(res.gwas, res.prune.in){
  res.snp= res.gwas[TEST=="ADD"]
  setkey(res.snp,SNP)
  setkey(res.prune.in,V1)
  out=res.snp[res.prune.in]     ## Y[X]: Left inner join of X, Y
  return(out[order(P)])         ## order by P
}

앞서 수행한 pruning과 GWAS결과를 위 함수에 적용하면 아래와 같다. 마찬가지로 실습 편의상 top 100개의 결과만 보겠다.

# Read pruning & GWAS results#
res.gwas= out.linear
res.prune.in = pin

## setwd(dir) 
## res.gwas = fread("LinearTG.assoc.linear")
## res.prune.in = fread("outPrune.prune.in",header=F)

res.prune.snp = ReadPlinkGWAS(res.gwas, res.prune.in)
datatable(res.prune.snp[order(P)][1:100],rownames=F,caption = "Results of Keeping SNPs")

SNP갯수를 비교하면 아래와 같이 prune.in의 SNP갯수로 맞춰졌음을 확인할 수 있다.

nrow(res.gwas[TEST == "ADD"])      ## gwas

## [1] 139025

nrow(res.prune.in)                 ## prune.in

## [1] 117510

nrow(res.prune.snp)                ## final

## [1] 117510

Clumping

Clumping은 GWAS 분석한 결과에서 유의한 SNP들 중 비슷한 정보를 갖는 것들을 걸러내는 방법이며 비슷한 정보의 기준은 physical distance, LD이다. –clump 옵션을 이용하며 세부 parameter는 아래와 같이 컨트롤 한다.

–clump-p1 0.0001: Significance threshold for index SNPs
–clump-p2 0.01: Secondary significance threshold for clumped SNPs
–clump-r2 0.50: LD threshold for clumping
–clump-kb 250: Physical distance threshold for clumping

아래는 TG의 결과를 이용하여 clumping을 수행하고 결과를 살펴보는 코드이다. –clump-range는 gene정보를 추가로 제공하는 옵션이며 glist-hg18 파일은 http://zzz.bwh.harvard.edu/plink/dist/glist-hg18에서 미리 다운받았다.

outLinear=glue("{dir}/LinearTG")
system(glue("plink --noweb --bfile {bfile} --clump {outLinear}.assoc.linear --out {outLinear} --clump-range {dir}/glist-hg18"))

Clump결과는 .clumped 파일에, gene정보는 .clumped.ranges 파일에 저장되며 실행시간이 오래걸린다.

out.clumped = fread(glue("{outLinear}.clumped")) 
out.clumped.range = fread(glue("{outLinear}.clumped.ranges"))

datatable(out.clumped[order(P)][1:100],caption="Clumping")

datatable(out.clumped.range[order(P)][1:100],caption="Gene info")

Align SNP to gene

위에서 다운받은 glist-hg18파일을 이용하여 각 SNP에 gene이름을 매칭해 보겠다. 이 gene 정보는 시각화에서 활용될 것이다.

glist_hg18=fread(glue("{dir}/glist-hg18"))
setnames(glist_hg18, paste("V",1:4,sep=""),c("chr","start","end","symbol"))
setkey(glist_hg18,chr,start,end)

datatable(glist_hg18[1:100],caption="Gene list: hg18",rownames=F)

Merge

foverlaps함수를 이용해서

glist_hg18의 gene, chromosome, start, end 정보와
GWAS 결과의 chromosome, BP정보

를 합치면 SNP에 해당하는 gene을 매칭할 수 있다. 마찬가지로 결과는 top 100개만 보겠다.

## GWAS result: make new variable for foverlaps - chr, start, end 
res.gwas[,":="(chr=as.character(CHR), start=BP, end=BP)]

## Run 
res.gwas.addgene = foverlaps(res.gwas[TEST=="ADD"], glist_hg18,by.x=c("chr","start","end"), type="within", mult="first")

## See
datatable(res.gwas.addgene[,c(5:13,4)][order(P)][1:100],rownames=F,caption = "Add gene information")

## Save only plink + gene info
fwrite(res.gwas.addgene[,c(5:13,4)],"ResGWASAddgene.csv")

리눅스 자체 명령어 이용

마지막으로 리눅스 배치 파일을 이용한 실행과 비교해보겠다. 결과를 p값에 대해 sorting해서 저장하는 것을 awk 명령어로 수행하겠다. awk는 패턴검색 및 처리 언어로 아래의 3가지 문법이 있다.

awk ‘pattern’
awk ‘{action}’
awk ‘pattern {action}’

실습은 GWAS 결과파일인 LinearTG.assoc.linear을 갖고

변수명이 있는 1행, TEST가 ADD인 행들 저장.
p값(9열)에 대해 숫자(general numeric) sorting
파일 저장.
처음 6행 보기: cat 명령어 사용

을 수행하겠다.

cd /home/secondmath/Dropbox/example

awk 'NR==1||$5="ADD" {print $0|"sort --g --key=9"}' LinearTG.assoc.linear > sort.txt

cat sort.txt | head -6

## bash: line 0: cd: /home/secondmath/Dropbox/example: No such file or directory
## awk: cannot open LinearTG.assoc.linear (No such file or directory)