TCGA2STAT enables users to easily download TCGA data directly into a format ready for statistical analysis in the R environment. The package imports and processes both molecular profiles and clinical data for more than 30 cancer types profiled with different high-throughput platforms (Appendix A), including microarray, next generation sequencing, methylation array, SNP array, and array-CGH. The package automatically combines molecular profiles and clinical data into a data matrix ready for supervised analysis. Further, the package provides users a simple function to merge data from multiple molecular profiles or platforms for integrated statistical analysis. Most importantly, all of the package functionality is available from one simple and user-friendly interface that does not require any domain-specific knowledge to utilize.
The data imported by this package is the version-stamped standardized data sets hosted and maintained by the Broad Institute GDAC Firehose. All imported data will be the latest version available from Firehose; typically, this is the aggregated Level 3 data from all sample batches and public clinical data as available from TCGA Data Portal.
This package depends on XML (Lang 2013) to parse the contents and paths while importing data from Firehose. Additionally, it requires CNTools (Zhang) to process the segmented CNV or CNA data into gene-level data.
This package relies on the system’s capability to decompress the downloaded data which is a tar. Thus, users running R in Windows system that does not support tar
will receive an error message.
The TCGA2STAT package and the underlying code in the package are distributed under the GPL2.0 license. Users of the package are free to use and redistribute the package but editing is prohibited. The TCGA2STAT package is available from The Comprehensive R Archive Network (CRAN) and the package’s Github.
Usage of this package to import the TCGA data and employ the data in downstream analysis constitutes to agreement to TCGA data usage policy.
The TCGA2STAT package is available from The Comprehensive R Archive Network (CRAN).
To install TCGA2STAT from the package archive file obtained from the package’s Github:
> install.packages("TCGA2STAT_1.0.tar.gz", repos = NULL, type = "source")
This package is intended to be used in Unix and Mac OS, where tar
archive utility is supported natively. In Windows, if tar
is not supported, users will be given an error message stating Error: TAR is not installed in the system. Data unzip failed.
To enable tar
support in Windows, users could take the following steps:
Install Cygwin
In R, set the system environment to path for tar
and gzip
, for example:
> Sys.setenv(TAR="C:/cygwin64/bin/tar", R_GZIPCMD="C:/cygwin64/bin/gzip")
TCGA2STAT enables users to download any type of data via one single, uniform interface: the getTCGA
function. The type of data to be downloaded can be specified via three parameters:
disease
- acronym for the cancer typedata.type
- they type of omics-profilestype
- the specific type of measurement for the omics-profileFor example, the following command will return the RPKM values of genes profiled using RNA-Sequencing from ovarian cancer patients.
> rnaseq.ov <- getTCGA(disease="OV", data.type="RNASeq", type="RPKM")
Upon running, a message will be appear indicating that the RNAseq data is being imported. And, upon successful execution of the command, the number of genes imported will be given in another message; for example:
RNAseq data will be imported! This may take some time!
19990 genes have been imported!
The returned object rnaseq.ov
will be a list containing three elements:
dat
- The omics-profiles data matrix with genes in rows and patients in columns; in this example, this is a matrix of RPKM values.clinical
- Clinical data matrix; in this example, NULL is returned as the clinical data is not specified (default).merged.dat
- If clinical data is imported, a data matrix with both omics-profiles and clinical covariates (overall survival by default); patients are in the rows with the sample ID in the first column, followed by clinical covariates, and then the omics-profiles.As an example, the internal structure of the object returned from the above command is as follows:
> str(rnaseq.ov)
List of 3
$ dat : num [1:19990, 1:299] 0.0539 0 0.4204 3.6155 5.3981 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:19990] "AADACL3" "AADACL4" "ABCA4" "ABCB10" ...
.. ..$ : chr [1:299] "TCGA-04-1348-01A-01R-1565-13-2" "TCGA-04-1357-01A-01R-1565-13-2" "TCGA-04-1362-01A-01R-1565-13-2" "TCGA-04-1364-01A-01R-1565-13-2" ...
$ clinical : NULL
$ merged.dat: NULL
The data matrix of the omics profiles returned are of dimension gene x patients; the row names are gene symbols and the column names are samples ID (BCR code):
> head(rnaseq.ov$dat[, 1:3])
TCGA-04-1348-01A-01R-1565-13-2 TCGA-04-1357-01A-01R-1565-13-2 TCGA-04-1362-01A-01R-1565-13-2
AADACL3 0.0539 0.2341 0.0573
AADACL4 0.0000 0.0000 0.0000
ABCA4 0.4204 0.1647 0.2120
ABCB10 3.6155 7.1614 3.1549
ABCD3 5.3981 3.8884 7.2949
ABL2 1.6614 2.4809 1.7997
To obtain default clinical data together with the omics profiles, specify clinical=TRUE
. Consider the example below:
# Get the RPKM of genes along with all clinical data, and combine the RPKM with overall survival (OS)
> rnaseq_os.ov <- getTCGA(disease="OV", data.type="RNASeq", type="RPKM", clinical=TRUE)
# Look at the merged RPKM-OS data matrix
> dim(rnaseq_os.ov$merged.dat)
[1] 295 19993
> head(rnaseq_os.ov$merged.dat[,1:5])
bcr status OS AADACL3 AADACL4
1 TCGA-04-1348 1 1483 0.0539 0.0000
2 TCGA-04-1357 0 NA 0.2341 0.0000
3 TCGA-04-1362 1 1348 0.0573 0.0000
4 TCGA-04-1364 1 1024 0.0000 0.0242
5 TCGA-04-1365 0 2329 0.0177 0.0091
6 TCGA-04-1514 1 1720 0.0019 0.0000
To customize the clinical variables, specify cvars
to the desired clinical variable. Consider the example below:
# Get the RPKM gene expression profiles along with all clinical data, and combine the expression with age
> rnaseq_age.ov <- getTCGA(disease="OV", data.type="RNASeq", type="RPKM", clinical=TRUE, cvars="yearstobirth")
# Look at the merged RPKM-age data matrix
> head(rnaseq_age.ov$merged.dat[,1:5])
bcr YEARSTOBIRTH AADACL3 AADACL4 ABCA4
1 TCGA-04-1348 44 0.0539 0.0000 0.4204
2 TCGA-04-1357 52 0.2341 0.0000 0.1647
3 TCGA-04-1362 59 0.0573 0.0000 0.2120
4 TCGA-04-1364 61 0.0000 0.0242 0.0560
5 TCGA-04-1365 87 0.0177 0.0091 0.0757
6 TCGA-04-1514 45 0.0019 0.0000 0.1713
The list of acronyms for all available cancer types is given in Appendix A at the end of this vignette. A summary of all values permitted for both data.type
and type
is given in Appendix B. A list of values for cvar
and their brief descriptions is given in Appendix C. However, since the clinical variables vary from one cancer type to another, the list in Appendix C is not the complete list. Users interested in a specific clinical covariate for the particular cancer type should check the availability of the clinical covariate from the column-name of clinical
data matrix in the list returned by getTCGA
function.
In the subsections below, we will give examples of how to download each type of omics data along with a short description of the data obtained.
The two major platforms used to profile gene expression in TCGA cancer patients are microarray and RNA-Sequencing. Our package allows users to download data for both platforms if data is available.
Gene expression data profiled by microarray are available in different forms depending on the type of microarray employed. Specifically, three forms are available:
type="G450"
)type="U133"
)type="Huex"
)The data imported from these platforms are level-3 gene-level expression data. Please also take note that only Agilent G450 is available for every cancer type; Affymetrix U133A and Human Exon ST Array are only available for certain cancers.
Agilent 244K Custom Gene Expression G4502A
# Get Agilent G450 expression for ovarian cancer patients
> exp.ov <- getTCGA(disease="OV", data.type="mRNA_Array", type="G450")
# Look at the data
> dim(exp.ov$dat)
[1] 17814 597
> head(exp.ov$dat[,1:3])
TCGA-01-0630-11A-01R-0363-07 TCGA-01-0631-11A-01R-0363-07 TCGA-01-0633-11A-01R-0363-07
ELMO2 0.092000 0.0081875 -0.0953125
CREB3L1 0.469500 0.1473333 0.0235000
RPS11 0.965600 0.7915000 1.2695000
PNMA1 0.815400 0.8960000 0.8314000
MMP2 -0.969125 -0.0452500 -0.4230000
C10orf90 -1.887200 -1.9418000 -1.9832000
>
This gene expression data from Agilent G450 should have ~17814 genes. The expression profiles are lowess-normalized log-ratio values that usually range from -10 to 10, as shown in the following histogram (Fig.1):
Fig.1. Histogram of Agilent G450 expression data.
Affymetrix Human Genome U133A 2.0 Array
# Get Affymetrix Human Genome U133A expression for ovarian cancer patients
> u133a.ov <- getTCGA(disease="OV", data.type="mRNA_Array", type="U133")
# Look at the data
> dim(u133a.ov$dat)
[1] 12042 593
> head(u133a.ov$dat[,1:3])
TCGA-01-0628-11A-01R-0362-01 TCGA-01-0630-11A-01R-0362-01 TCGA-01-0631-11A-01R-0362-01
AACS 5.772263 6.428424 5.574237
FSTL1 8.220414 8.496560 8.710241
ELMO2 4.933350 5.430913 4.705522
CREB3L1 3.622720 3.870238 3.514235
RPS11 9.970795 10.387306 10.124355
PNMA1 7.959072 8.356616 8.132725
>
This data of ~12042 genes from Affymetrix HG-U133A is preprocessed and normalized using the RMA method, which results in profiles usually ranging from 2 to 14 as shown in the histogram below (Fig.2):
Fig.2. Histogram of Affymetrix HG-U133A expression data.
Affymetrix Human Exon 1.0 ST Array
Affymetrix Human Exon 1.0 ST Array is another common platform to profile gene level expression. Unlike HG-U133A where probes were designed to cover the entire genome, probes in Human Exon 1.0 ST Array are designed to cover exonic regions of the genes.
# Get Affymetrix Human Exon 1.0 ST Array expression for ovarian cancer patients
> huex.ov <- getTCGA(disease="OV", data.type="mRNA_Array", type="Huex")
# Look at the data
> dim(huex.ov$dat)
[1] 18632 594
> head(huex.ov$dat[,1:3])
TCGA-01-0628-11A-01R-0361-03 TCGA-01-0630-11A-01R-0361-03 TCGA-01-0631-11A-01R-0361-03
C9orf152 5.684061 5.740597 5.379095
ELMO2 6.940721 6.955458 6.859219
RPS11 12.112295 11.603771 11.435075
CREB3L1 6.141857 6.470617 6.821231
PNMA1 9.069748 9.193657 8.966602
MMP2 7.993179 8.551214 9.651369
This level-3 data is quantile-normalized, consisting of ~18632 genes with values range from 2 to 14 as shown in the following histogram (Fig.3):
Fig.3. Histogram of Affymetrix Human Exon 1.0 ST Array data.
TCGA also provides gene expression data measured via RNA-Sequencing technology, especially via Illumina Genome Analyzer (GA) or Illumina HiSeq 2000. For either platform, the data is available in two forms: RNA-Seq (preprocessed using the first pipeline), and RNA-SeqV2 (preprocessed using version two analysis).
RNA-Seq
Our package allows the user to download RNA-Seq data from both Illumina GA and Illumina HiSeq, depending on data availability. If gene expression is profiled via both platforms, only the data from HiSeq will be imported. In addition, users have the following options for which type of data to import: raw-counts (type="count"
) or the normalized counts (type="RPKM"
).
# Get raw count, RNA-Seq data for ovarian cancer patients
> counts.ov <- getTCGA(disease="OV", data.type="RNASeq")
# or equivalently
> counts.ov <- getTCGA(disease="OV", data.type="RNASeq", type="count")
# Look at the data:
> dim(counts.ov$dat)
[1] 19990 299
> head(counts.ov$dat[,1:3])
TCGA-04-1348-01A-01R-1565-13 TCGA-04-1357-01A-01R-1565-13 TCGA-04-1362-01A-01R-1565-13
AADACL3 36 64 30
AADACL4 0 0 0
ABCA4 575 92 226
ABCB10 2308 1864 1572
ABCD3 4890 1437 5160
ABL2 3591 2187 3038
# Get normalized counts, from RNA-Seq for ovarian cancer patients
> rnaseq.ov <- getTCGA(disease="OV", data.type="RNASeq", type="RPKM")
# Look at the data:
> dim(rnaseq.ov$dat)
[1] 19990 299
> head(rnaseq.ov$dat[,1:3])
TCGA-04-1348-01A-01R-1565-13-2 TCGA-04-1357-01A-01R-1565-13-2 TCGA-04-1362-01A-01R-1565-13-2
AADACL3 0.0539 0.2341 0.0573
AADACL4 0.0000 0.0000 0.0000
ABCA4 0.4204 0.1647 0.2120
ABCB10 3.6155 7.1614 3.1549
ABCD3 5.3981 3.8884 7.2949
ABL2 1.6614 2.4809 1.7997
RNA-SeqV2
For RNA-Sequencing data from the second analysis pipeline (RNASeqV2), our package provides importation of data from Illumina HiSeq only as this is the most common profiling method. The values returned are the RSEM values for the genes, which usually contains 20501 genes, with RSEM values ranging from 0 to 10^6.
# Get RSEM values, RNA-SeqV2 data for ovarian cancer patients
rsem.ov <- getTCGA(disease="OV", data.type="RNASeq2")
# Look at the data
> dim(rsem.ov$dat)
[1] 20501 307
> head(rsem.ov$dat[,1:3])
TCGA-04-1348-01A-01R-1565-13 TCGA-04-1357-01A-01R-1565-13 TCGA-04-1362-01A-01R-1565-13
A1BG 66.4695 65.5664 41.6412
A1CF 0.0000 0.0000 0.3310
A2BP1 0.2689 0.6510 4.3025
A2LD1 221.5219 141.2826 265.8161
A2ML1 7.5289 54.6875 5.6263
A2M 5899.8279 9384.4401 3350.4207
Similarly to gene expression, expression profiles of micro-RNAs (miRNA) are available from both the microarray platforms and the sequencing platforms. However, while the miRNA-Seq data is available for all cancer types, miRNA-array data is only available for three cancer types: GBM, GBMLGG, and OV.
The miRNA expression data available for download with our package is the level-3 Distance Weighted Discrimination (DWD) batch adjusted miRNA expression profiled from Agilent 8 x 15K Human miRNA-specific microarray (H-miRNA_8x15K) or Agilent Human miRNA Microarray Rel12.0 (H-miRNA_8x15Kv2 - for OV only).
# Get miRNA expression data via microarray for ovarian cancer patients
mir.ov <- getTCGA(disease="OV", data.type="miRNA_Array")
# Look at the data:
> str(mir.ov)
List of 3
$ dat : num [1:799, 1:594] 4.67 4.58 4.62 4.75 4.55 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:799] "ebv-miR-BART1-3p" "ebv-miR-BART1-5p" "ebv-miR-BART10" "ebv-miR-BART10*" ...
.. ..$ : chr [1:594] "TCGA-01-0628-11A-01T-0364-07" "TCGA-01-0630-11A-01T-0364-07" "TCGA-01-0631-11A-01T-0364-07" "TCGA-01-0633-11A-01T-0364-07" ...
$ clinical : NULL
$ merged.dat: NULL
> head(mir.ov$dat[,1:3])
TCGA-01-0628-11A-01T-0364-07 TCGA-01-0630-11A-01T-0364-07 TCGA-01-0631-11A-01T-0364-07
ebv-miR-BART1-3p 4.668551 4.640936 5.170681
ebv-miR-BART1-5p 4.579950 4.696028 4.855175
ebv-miR-BART10 4.618877 4.684807 4.742815
ebv-miR-BART10* 4.747920 4.730715 4.734904
ebv-miR-BART11-3p 4.553927 4.648465 4.605977
ebv-miR-BART11-5p 4.698162 4.642823 4.711533
miRNA expression was profiled for most cancer types using Illumina HiSeq 2000 miRNA Sequencing. Similar to RNA-Seq, two types of miRNA-Seq data can be downloaded: raw counts (type="count"
) or RPMMM (type="rpmmm"
). Examples below show how to obtain each type of miRNA-Seq data:
# Get miRNA expression read counts for ovarian cancer patients
mirseq.ov <- getTCGA(disease="OV", data.type="miRNASeq")
# or equivalently
mirseq.ov <- getTCGA(disease="OV", data.type="miRNASeq", type="count")
#
# Get miRNA expression RPMMM values for ovarian cancer patients
mirseqn.ov <- getTCGA(disease="OV", data.type="miRNASeq", type="rpmmm")
> str(mirseq.ov)
List of 3
$ dat : int [1:705, 1:461] 124097 247488 124173 523763 133064 39686 31230 47 11523 1375 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:705] "hsa-let-7a-1" "hsa-let-7a-2" "hsa-let-7a-3" "hsa-let-7b" ...
.. ..$ : chr [1:461] "TCGA-04-1331-01A-01R-1569-13" "TCGA-04-1332-01A-01R-1564-13" "TCGA-04-1336-01A-01R-1564-13" "TCGA-04-1337-01A-01R-1564-13" ...
$ clinical : NULL
$ merged.dat: NULL
> head(mirseq.ov$dat[,1:3])
TCGA-04-1331-01A-01R-1569-13 TCGA-04-1332-01A-01R-1564-13 TCGA-04-1336-01A-01R-1564-13
hsa-let-7a-1 124097 9682 3037
hsa-let-7a-2 247488 19634 6274
hsa-let-7a-3 124173 9825 3132
hsa-let-7b 523763 88962 66321
hsa-let-7c 133064 22926 5265
hsa-let-7d 39686 8955 14718
> head(mirseqn.ov$dat[,1:3])
TCGA-04-1331-01A-01R-1569-13-1 TCGA-04-1332-01A-01R-1564-13-1 TCGA-04-1336-01A-01R-1564-13-1
hsa-let-7a-1 30713.962 11269.88 3953.866
hsa-let-7a-2 61253.190 22854.04 8168.112
hsa-let-7a-3 30732.772 11436.33 4077.546
hsa-let-7b 129631.151 103552.07 86343.214
hsa-let-7c 32933.291 26685.94 6854.496
hsa-let-7d 9822.271 10423.65 19161.343
The level-3 files of called mutations from the Broad Firehose are saved in Mutation Annotation Format (MAF). Each MAF file lists mutations found for the particular patient. Since each patient may have different mutations, the number of genes in each MAF file vary from patient to patient. Hence, the MAF files are not in a format ready for statistical analysis. To solve this, our package aggregates the obtained MAF files into a matrix of dimension gene x patient, with a value of one in cell(i,j) if a mutation is found in gene-i from patient-j, and a zero otherwise.
Our package permits two options for the mutation data:
type="somatic"
) which is the default. This will only return those mutations with “somatic”" as the mutation-status and not with “silent” as the variant-classification in the original MAF files.type="all"
).Here are some example on downloading somatic mutation data:
# Get somatic non-silent mutations for ovarian cancer patients
mut.ov <- getTCGA(disease="OV", data.type="Mutation")
# or equivalently
mut.ov <- getTCGA(disease="OV", data.type="Mutation", type="somatic")
> str(mut.ov)
List of 3
$ dat : num [1:5821, 1:232] 1 1 1 1 1 1 1 0 0 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5821] "PPP2R5A" "TBC1D4" "ICAM1" "CPS1" ...
.. ..$ : chr [1:232] "TCGA-24-1471" "TCGA-13-0899" "TCGA-23-1021" "TCGA-04-1347" ...
$ clinical : NULL
$ merged.dat: NULL
> head(mut.ov$dat[,1:6])
TCGA-24-1471 TCGA-13-0899 TCGA-23-1021 TCGA-04-1347 TCGA-10-0930 TCGA-23-1027
PPP2R5A 1 0 0 0 0 0
TBC1D4 1 0 0 0 0 0
ICAM1 1 0 0 0 0 0
CPS1 1 0 0 0 0 0
EP300 1 0 0 0 0 0
PEBP4 1 0 0 0 0 0
Note that in the example below, the number of genes is much larger if one is importing all mutations; but the data matrix is considerably sparser.
# Get all mutations called for ovarian cancer patients
> allmut.ov <- getTCGA(disease="OV", data.type="Mutation", type="all")
> str(allmut.ov)
List of 3
$ dat : num [1:10060, 1:316] 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:10060] "HRNR" "LHX9" "PPP2R5A" "ZP4" ...
.. ..$ : chr [1:316] "TCGA-24-1471" "TCGA-13-0899" "TCGA-23-1021" "TCGA-04-1347" ...
$ clinical : NULL
$ merged.dat: NULL
DNA methylation profiles were obtained either via Illumina Infinium HumanMethylation27 BeadChip or the HumanMethylation450 BeadChip. The former platform probes for ~27,000 CpG sites (thus commonly known 27K) and the latter probes for ~450,000 CpG sites. Our package allows users to import either methylation data when available, via type="27K"
(default) or type="450K"
. For example:
# Get 27K methylation profiles for ovarian cancer patients
methyl.ov <- getTCGA(disease="OV", data.type="Methylation", type="27K")
# or equivalently
methyl.ov <- getTCGA(disease="OV", data.type="Methylation")
# Get 450K methylation profiles for ovarian cancer patients
methyl45.ov <- getTCGA(disease="OV", data.type="Methylation", type="450K")
Note that the matrix of the methylation profiles returned is NOT aggregated at the gene level. Each row in the data matrix (dat
) is a probe from the methylation assay, which represents a CpG site. Since genes often contain more than one CpG site and each CpG site can differ significantly in the methylation level, gene level aggregation is less desirable. Hence, our package returns the probe-level data.
# Look at the data
> str(methyl.ov)
List of 4
$ dat : num [1:27578, 1:612] 0.7994 0.339 0.0281 0.6012 NA ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:27578] "cg00000292" "cg00002426" "cg00003994" "cg00005847" ...
.. ..$ : chr [1:612] "TCGA-01-0628-11A-01D-0383-05" "TCGA-01-0630-11A-01D-0383-05" "TCGA-01-0631-11A-01D-0383-05" "TCGA-01-0633-11A-01D-0383-05" ...
$ cpgs :'data.frame': 27578 obs. of 3 variables:
..$ Gene_Symbol : chr [1:27578] "ATP2A1" "SLMAP" "MEOX2" "HOXD3" ...
..$ Chromosome : chr [1:27578] "16" "3" "7" "2" ...
..$ Genomic_Coordinate: int [1:27578] 28890100 57743543 15725862 177029073 148822837 93862594 93813777 11980953 89290921 0 ...
$ clinical : NULL
$ merged.dat: NULL
> head(methyl.ov$dat[,1:3])
TCGA-01-0628-11A-01D-0383-05 TCGA-01-0630-11A-01D-0383-05 TCGA-01-0631-11A-01D-0383-05
cg00000292 0.79940858 0.62039417 0.91148841
cg00002426 0.33900444 0.18030460 0.29385049
cg00003994 0.02811930 0.03607298 0.06159123
cg00005847 0.60116497 0.64955777 0.47394908
cg00006414 NA NA NA
cg00007981 0.01881682 0.01803597 0.02904295
> head(methyl.ov$cpgs)
Gene_Symbol Chromosome Genomic_Coordinate
cg00000292 ATP2A1 16 28890100
cg00002426 SLMAP 3 57743543
cg00003994 MEOX2 7 15725862
cg00005847 HOXD3 2 177029073
cg00006414 ZNF425;ZNF398 7 148822837
cg00007981 PANX1 11 93862594
As shown above, the returned list for methylation data has one additional element, cpgs
, which contains the probe’s genomic information. Each row of cpgs
is a CpG site and contains gene symbols and genomic coordinates in columns; rows in cpgs
are of the same number and order as the methylation profiles data dat
.
Values in the methylation data are the beta values. They range from zero to one (as shown in the histogram in Fig.4 below), with values greater than 0.5 representing hypermethylation, values less than 0.5 representing hypomethylation.
Fig.4. Histogram of methylation data.
Two major high-throughput platforms used to profile copy number changes are SNP-arrays and array comparative genomic hybridization (aCGH). Our package permits users to download both of these data types, although only SNP-array data is available for most cancer types. The copy number changes downloaded with our package are aggregated at the gene level; specifically, we merge the level-3 segments provided from TCGA (with reference genome build hg19) at the gene level. Values in the dat
matrix of the returned list object are the log-ratio of copy number between samples to reference samples. Thus, a positive log-ratio indicates a copy number gain and negative value indicates a DNA copy number loss.
Two forms of copy number changes profiled using SNP-arrays (Affymetrix Genome-Wide Human SNP Array 6.0) are available: data.type="CNA_SNP"
gives somatic copy number alterations (CNAs), which are somatic changes to chromosome structure that result in gains or losses in copies of sections of DNA; data.type="CNV_SNP"
gives copy number variants (CNVs), which are the copy number ratios from somatic cells minus the ratios from germlines. Examples of these are given below:
# Get SNP array CNA for ovarian cancer patients
cnasnp <- getTCGA(disease="OV", data.type="CNA_SNP")
# or
# Get SNP array CNV for ovarian cancer patients
cnvsnp <- getTCGA(disease="OV", data.type="CNV_SNP")
# Look at the data
> head(cnvsnp$dat[,1:3])
TCGA-01-0628-11A-01D-0356-01 TCGA-01-0630-11A-01D-0356-01 TCGA-01-0631-11A-01D-0356-01
A1BG -0.0043 0.0002 -0.0797
A2M 0.0028 -0.0003 0.1000
A2MP1 0.0028 -0.0003 0.1000
NAT1 0.0047 0.0050 -0.0045
NAT2 0.0047 0.0050 -0.0045
SERPINA3 0.0024 0.0031 0.0501
Copy number changes profiled via array-CGH are measured with two different types of arrays: Agilent Human Genome CGH Microarray 244A (type="244A"
) and Agilent Human Genome CGH Custom Microarray 2x415K (type="415K"
). Unlike CNA/CNV profiled with SNP-arrays that are available for most cancer types, aCGH-profiled CNA data are available for only a few cancer types and most of them only for Custom Array 2x415. Examples of how to obtain this data are given below:
# Get aCGH CNA for ovarian cancer patients
cnacgh <- getTCGA(disease="OV", data.type="CNA_CGH")
# or equivalently
cnacgh <- getTCGA(disease="OV", data.type="CNA_CGH", type="415K")
# Look at the data
> names(cnacgh)
[1] "dat" "clinical" "merged.dat"
> head(cnacgh$dat[,1:3])
TCGA-04-1353-01A-01D-1046-02 TCGA-04-1353-11B-01D-1046-02 TCGA-04-1369-01A-02D-1046-02
A1BG -0.7638 -0.2282 -0.5277
A2M -0.0284 -0.0774 0.0367
A2MP1 -0.0284 -0.0774 0.0367
NAT1 -0.8137 -0.1329 -1.1487
NAT2 -0.8137 -0.1329 -1.1487
SERPINA3 -0.2948 -0.0776 -0.1730
While merging the segments of copy number changes at gene-level, we filter out chromosome Y by default, which means the above three commands are equivalent to the following:
cnasnp <- getTCGA(disease="OV", data.type="CNA_SNP", filter="Y")
cnvsnp <- getTCGA(disease="OV", data.type="CNV_SNP", filter="Y")
cnacgh <- getTCGA(disease="OV", data.type="CNA_CGH", type="415K", filter="Y")
To include all genes, give an empty string to filter
, for example:
cnacgh.gbm <- getTCGA(disease="GBM", data.type="CNA_CGH", type="415K", filter="")
Since there are certain gene symbols shared by two chromosomal locations in the sex chromosomes, we suffixed these genes by the chromosome to eliminate ambiguities. For example, in the cnacgh.gbm$dat
object obtained from the above command, “SLC25A6_X” and “SLC25A6_Y” represent SLC25A6 genes in chromosome X and SLC25A6 in chromosome Y respectively.
Similarly, by providing a vector of characters with more than one chromosome to filter
, the function will filter out any genes belonging to the specified chromosomes. For example, to get CNA autosomal chromosomes only (excluding sex chromosomes):
cnacgh <- getTCGA(disease="OV", data.type="CNA_CGH", filter=c("X", "Y"))
Our package includes a function named OMICSBind
that allows users to combine two matrices of omics-profiles for the same set of patients into a single object. In order to combine more than two types of omics-profiles together, users can call the OMICSBind
function multiple times. In the example below, we show how to combine the RNA-Seq data from the second pipeline analysis, methylation profiles from 27K, and mutation data for ovarian cancer patients:
# Get the RNA-Seq, methylation, and mutation profiles for OV cancer patients
seq <- getTCGA(disease="OV", data.type="RNASeq2")
meth <- getTCGA(disease="OV", data.type="Methylation", type="27K")
mut <- getTCGA(disease="OV", data.type="Mutation", type="all")
# Now, merge the three OMICs-data into one R object
# step 1: merge RNA-Seq and mutation data
m1 <- OMICSBind(dat1 = seq$dat, dat2 = mut$dat)
# step 2: further concatenate the methylation data to the merged data-object
m2 <- OMICSBind(dat1 = m1$merged.data, dat2 = meth$dat)
# Look at the data
> dim(seq$dat)
[1] 20501 307
> dim(mut$dat)
[1] 10060 316
> dim(meth$dat)
[1] 27578 612
> dim(m1$merged.data)
[1] 30561 185
> dim(m2$merged.data)
[1] 58139 185
As shown above, the final merged data matrix m2$merged.data
has 185 samples. These are the tumor samples of patients common in RNA-Seq, mutation, and methylation data. On the other hand, m2$merged.data
has 58139 rows, which is the combination of 20501 genes from RNA-Seq, 10060 genes from mutation, and 30561 CpG sites from methylation data. In order to distinguish gene symbols from different omics-profile, we affix the gene names with d1 and d2 in the merged data. Specifically, d1 is affixed to gene names of the first omics-data input to OMICSBind
function, d2 is affixed to the names of the second input object:
> str(m1)
List of 3
$ merged.data: num [1:30561, 1:185] 66.469 0 0.269 221.522 7.529 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:30561] "d1.A1BG" "d1.A1CF" "d1.A2BP1" "d1.A2LD1" ...
.. ..$ : chr [1:185] "TCGA-04-1348" "TCGA-04-1357" "TCGA-04-1362" "TCGA-04-1364" ...
$ X : num [1:20501, 1:185] 66.469 0 0.269 221.522 7.529 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20501] "d1.A1BG" "d1.A1CF" "d1.A2BP1" "d1.A2LD1" ...
.. ..$ : chr [1:185] "TCGA-04-1348" "TCGA-04-1357" "TCGA-04-1362" "TCGA-04-1364" ...
$ Y : num [1:10060, 1:185] 0 0 0 0 0 0 0 0 0 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:10060] "d2.HRNR" "d2.LHX9" "d2.PPP2R5A" "d2.ZP4" ...
.. ..$ : chr [1:185] "TCGA-04-1348" "TCGA-04-1357" "TCGA-04-1362" "TCGA-04-1364" ...
> str(m2)
List of 3
$ merged.data: num [1:58139, 1:185] 66.469 0 0.269 221.522 7.529 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:58139] "d1.d1.A1BG" "d1.d1.A1CF" "d1.d1.A2BP1" "d1.d1.A2LD1" ...
.. ..$ : chr [1:185] "TCGA-04-1348" "TCGA-04-1357" "TCGA-04-1362" "TCGA-04-1364" ...
$ X : num [1:30561, 1:185] 66.469 0 0.269 221.522 7.529 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:30561] "d1.d1.A1BG" "d1.d1.A1CF" "d1.d1.A2BP1" "d1.d1.A2LD1" ...
.. ..$ : chr [1:185] "TCGA-04-1348" "TCGA-04-1357" "TCGA-04-1362" "TCGA-04-1364" ...
$ Y : num [1:27578, 1:185] 0.7991 0.0968 0.0218 0.8795 NA ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:27578] "d2.cg00000292" "d2.cg00002426" "d2.cg00003994" "d2.cg00005847" ...
.. ..$ : chr [1:185] "TCGA-04-1348" "TCGA-04-1357" "TCGA-04-1362" "TCGA-04-1364" ...
The TCGA OMICs profiles downloaded for a cancer type often include samples of different types, such as from tumor tissue, normal tissue, or other controls. The type of sample is encoded in the sample’s BCR code. Users who are interested in studying only tumor samples, for example, would need to parse the codes to extract only these samples. To spare users the difficulties of decoding the sample ID (BCR code) to extract particular types of samples, our package includes a function, SampleSplit
to accomplish this task in a user-friendly manner. Specifically, SampleSplit
will take the omics data matrix as input and separate the matrix according to the sample types. The object returned is a list of three matrices corresponding to the omics profiles of primary solid tumor, recurrent solid tumor, and normal tissues/blood samples.
# Get the RNA-Seq V2 profiles for LUSC cancer patients
lusc.rnaseq2 <- getTCGA(disease="LUSC", data.type="RNASeq2")
# Split the OMICs data by sample types
lusc.rnaseq2.bytype <- SampleSplit(lusc.rnaseq2$dat)
# Look at the data
> str(lusc.rnaseq2.bytype)
List of 3
$ primary.tumor : num [1:20501, 1:501] 742 0 0 170 128 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20501] "A1BG" "A1CF" "A2BP1" "A2LD1" ...
.. ..$ : chr [1:501] "TCGA-18-3406-01A-01R-0980-07" "TCGA-18-3407-01A-01R-0980-07" "TCGA-18-3408-01A-01R-0980-07" "TCGA-18-3409-01A-01R-0980-07" ...
$ recurrent.tumor: num[1:20501, 0 ]
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20501] "A1BG" "A1CF" "A2BP1" "A2LD1" ...
.. ..$ : NULL
$ normal : num [1:20501, 1:51] 49.49 0 1.04 105.22 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20501] "A1BG" "A1CF" "A2BP1" "A2LD1" ...
.. ..$ : chr [1:51] "TCGA-22-4593-11A-01R-1820-07" "TCGA-22-4609-11A-01R-2125-07" "TCGA-22-5471-11A-01R-1635-07" "TCGA-22-5472-11A-11R-1635-07" ...
Our package also includes a utility function, TumorNormalMatch
to prepare the downloaded omics-profiles for pairwise analysis between tumor tissues and normal tissues/blood samples in a user-friendly manner. In the example below, we show how to first get the RNA-Seq data of analysis pipeline 2 from LUSC patients and then split the data into tumor tissues and normal tissues/blood samples for the same set of patients.
# Get RNA-SeqV2 data for LUSC patients
> lusc.rnaseq2 <- getTCGA(disease="LUSC", data.type="RNASeq2")
# tumor-normal matched profiles
> lusc.rnaseq2.tum.norm <- TumorNormalMatch(lusc.rnaseq2$dat)
This command will return a list of two omics-profile data matrices with dimension of gene x patient. Both matrices have the same dimension and order for both columns (patient) and rows (gene). Thus, enabling users to make direct pairwise comparisons.
# Look at the data
> str(lusc.rnaseq2.tum.norm)
List of 2
$ primary.tumor: num [1:20501, 1:51] 94.67 0 3.46 38.73 209.32 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20501] "A1BG" "A1CF" "A2BP1" "A2LD1" ...
.. ..$ : chr [1:51] "TCGA-22-4593" "TCGA-22-4609" "TCGA-22-5471" "TCGA-22-5472" ...
$ normal : num [1:20501, 1:51] 49.49 0 1.04 105.22 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:20501] "A1BG" "A1CF" "A2BP1" "A2LD1" ...
.. ..$ : chr [1:51] "TCGA-22-4593" "TCGA-22-4609" "TCGA-22-5471" "TCGA-22-5472" ...
The major objective of developing TCGA2STAT
package is to provide statisticians with a set of user friendly tools to obtain and prepare omics-profiles from TCGA cancer patients for statistical analysis. In this example, we show how TCGA2STAT
can enable easy and user-friendly integrated analysis of two types of TCGA omics profiles using canonical correlation analysis.
First, we import methylation profiles and RNA-Seq gene expression data for LUSC patients via the getTCGA
function from our package:
# Get data
> lusc.methyl <- getTCGA(disease="LUSC", data.type="Methylation")
> lusc.rnaseq2 <- getTCGA(disease="LUSC", data.type="RNASeq2", clinical=TRUE)
To reduce computation time, we only include CpG sites and genes that are at the top 1% of variance across patients:
# Filter data
> met.var <- apply(lusc.methyl$dat, 1, var)
> met.data <- subset(lusc.methyl$dat, met.var >= quantile(met.var, 0.99, na.rm=T) & !is.na(met.var))
> rnaseq2.var <- apply(log10(1+lusc.rnaseq2$dat), 1, var)
> rnaseq.data <- subset(log10(1+lusc.rnaseq2$dat), rnaseq2.var >= quantile(rnaseq2.var, 0.99, na.rm=T) & !is.na(rnaseq2.var))
This will give us 234 CpG sites and 206 genes. Since canonical correlation analysis requires two different types of features measured on the same set of samples, we use the OMICSBind
function in the package to get R objects with methylation profiles and gene expression profiles for the same set of LUSC patients:
> met.rnaseq2 <- OMICSBind(dat1 = rnaseq.data, dat2= met.data)
# Look at the data
> str(met.rnaseq2)
List of 3
$ merged.data: num [1:440, 1:130] 2.112 1.654 0.252 2.874 1.391 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:440] "d1.A2ML1" "d1.ADAM23" "d1.ADCY8" "d1.ADH1B" ...
.. ..$ : chr [1:130] "TCGA-18-3406" "TCGA-18-3407" "TCGA-18-3408" "TCGA-18-3409" ...
$ X : num [1:206, 1:130] 2.112 1.654 0.252 2.874 1.391 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:206] "d1.A2ML1" "d1.ADAM23" "d1.ADCY8" "d1.ADH1B" ...
.. ..$ : chr [1:130] "TCGA-18-3406" "TCGA-18-3407" "TCGA-18-3408" "TCGA-18-3409" ...
$ Y : num [1:234, 1:130] 0.7561 0.6541 0.642 0.024 0.0258 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:234] "d2.cg00061629" "d2.cg00202711" "d2.cg00332153" "d2.cg00463577" ...
.. ..$ : chr [1:130] "TCGA-18-3406" "TCGA-18-3407" "TCGA-18-3408" "TCGA-18-3409" ...
Now that the data is ready, we carry out canonical correlation analysis using the CCA
package. We apply regularized canonical correlation analysis as the number of features for both sets of omics-profiles are larger than the number of samples.
library(CCA)
# run regularized-cca
> lusc.cc <- rcc(t(met.rnaseq2$X), t(met.rnaseq2$Y), 0.75025, 0.5005)
# compute the canonical loadings
> lusc.cc2 <- comput(t(met.rnaseq2$X), t(met.rnaseq2$Y), lusc.cc)
We can visualize the results of the analysis. In order to generate plots shown in our paper, we wrote plotting functions that can be found here: script. First, we visualize how the gene expression profiles correlate with the methylation profiles:
> plotNice2(lusc.cc2, d1=1, d2=2, XY="X", cex=0.7)
Fig.5. Canonical loadings of genes.
(Note - fill in XXX) From the plot in Figure 5, we can see that gene XIST is highly correlated with the set of CpG sites. Similarly, we can see how methylation profiles correlate with the gene expression profiles:
> plotNice2(lusc.cc2, d1=1, d2=2, XY="Y", cex=0.7)
Fig.6. Canonical loadings of CpG sites.
With canonical correlation analysis, we can also investigate patterns among patients by projecting onto the canonical loadings. Here, we are interested in determining whether the estimated canonical loadings separate patients into distinct groups or cancer subtypes. First, we project onto the first two canonical loadings and visualize the patients below:
> plt.indiv(lusc.cc, d1=1, d2=2, ind.names=rep("+", nrow(t(met.rnaseq2$X))))
Fig.7. Patient clusters.
In Figure 7, we can see that patients are well-clustered into two groups. We re-plot this with two different colors in order to clearly see the two groups:
score <- lusc.cc2$xscores[,1:2]
# Parameters were estimated based on visual inspection
grp1 <- rownames(score)[(score[,1]<0 & score[,2]<0) | (score[,1]>0 & score[,2]< -1.5)]
grp2 <- setdiff(rownames(score), grp1)
grp <- ifelse(rownames(score) %in% grp1, 1, 2)
colors = colors()[c(153,220)]
plotNice2.indiv(lusc.cc2, d1=1, d2=2, cols=colors[grp], cex=2.5,pchs=c(20,20)[grp])
Fig.8. Nicer plot of patient clusters as in Fig.7.
Next, we investigate whether the two patient groups found by our analysis are clinically meaningful by seeing whether the groups exhibit differing survival outcomes. To this end, we perform a Kaplan-Meier analysis on the overall patient survival time to determine whether the survival probabilities between the two identified patient subgroups are significantly different:
# Get survival data from the imported RNASeq2 object
os.score <- lusc.rnaseq2$merged.dat[,1:3]
# prepare the data to make sure they match the two clusters from above
os <- os.score[,-1]
rownames(os) <- os.score[,1]
os <- os[rownames(lusc.cc2$xscores),]
plotdat <- data.frame(grp, os)
plotdat <- plotdat[order(plotdat[,1]),]
# KM-plot
mypamr.plotsurvival(plotdat[,1], plotdat[,3]/365, plotdat[,2], c(colors()[c(153,220)]), lty=c(2,1), lwd=c(2,2))
Fig.9. KM-plot of patients’ overall survival.
From the survival plot in Figure 9, we can conclude that the canonical loadings of gene expression and methylation profiles separate patients into two clinically distinct subgroups: the ‘black group’ is the low-risk group and the ‘gray group’ is the high-risk group. The difference in overall survival between these two groups is border-line statistically significant (log-rank p-value < 0.06).
Summary of cancer types and omics-profiles. Get File
Values permitted for data.type
and type
in getTCGA
function. Get File
10/09/2015 TCGA2STAT
10/19/2015 TCGA2STAT_1.2
07/14/2016 TCGA2STAT_1.3.1 (beta)
Source Code: Source Archive
User Guide: Manual in PDF
Lang, Duncan Temple. 2013. XML: Tools for Parsing and Generating XML Within R and S-Plus. http://CRAN.R-project.org/package=XML.
Zhang, Jianhua. CNTools: Convert Segment Data into a Region by Sample Matrix to Allow for Other High Level Computational Analyses.