Cyber-T Help

Background

The Cyber-T Bayesian statistical framework was developed by Pierre Baldi and Tony Long [1]. The web interface was first implemented by Harry Mangalam. Cyber-T has received valuable inputs from many UCI biologists, in particular from G. Wesley Hatfield, Denis Heck, She-pin Hung, Michelle Riehle and Suzanne Sandmeyer. The framework has been extended and maintained in Baldi's group, by Yann Pecout, Laurence Richard and Suman Sundaresh, and currently by Matt Kayala and Julien Neel.

For bug reports or help, contact the Cyber-T Helpdesk.


Contents


Downloading and installing Cyber-T / hdarray (R code)

***NEW: Newly written hdarray (called bayesreg) now available for download***
***NEW: One-way regularized ANOVA (called bayesAnova) now available for download***

Cyber-T and/or the hdarray library (including hdarray, bayesreg, bayesAnova) is free for academic, non-commercial, research use only. The appropriate references [1,2] should be citied when used for academic/research purposes. For commercial licenses, please contact Pierre Baldi.

Download

To work with the high density array (hdarray) library written in R directly, you need to have R installed. There are installation packages available for most platforms and source code that will compile with gcc, all available via your nearest node of the Comprehensive R Archive Network.

Assuming you have R installed on your system, just 'source' the hdarray file using either the command line interface or the menu.

What is in the high density array library (hdarray)?


Using the hdarray code in R

Depending on what platform you are running the R code, you need to use the menu options and 'source' the hdarray file. This loads all the functions in hdarray in to R. On Windows or Unix-like systems, you can type "source(filename)" at the command line. You may also want to use the menu option to change the R working directory to the directory that contains your data. Let's assume a simple text file 'Expt1.txt' as follows:

    C1,C2,E1,E2
gene1,10,8,7,9
gene2,6,5,7,6
gene3,4,5,8,9

The following commands demonstrate how to read a data file and invoke the various functions to perform t-tests or bayes-t-tests on either either paired or control+experiment data. The R command line prompt may look like this ">" after which you can type the commands shown below.

Read the data file which, in this case, is comma separated. If the file is tab separated, ignore the 'sep' field. Do not type the ">". If the data file is not in your working directory, you need to specify the full file path.
Tab delimited file:
> dataFile<-read.table("Expt1.txt")
Comma delimited file:
> dataFile<-read.table("Expt1.txt",sep=",")
Space delimited file:
> dataFile<-read.table("Expt1.txt",sep=" ")

Display the first row to make sure the file has been read correctly
> dataFile[1,]

      C1 C2 E1 E2
gene1 10  8  7  9	

As you can see, the file has been read correctly. Since the first line had column names, the file was read correctly as the header. Note that there is no header name for the gene name column hence the gene name column is used to label the rows. If you don't have the column names, it works just fine too. R will create its own column names.

Display the dimensions of the dataset. The first number is the number of rows, the second is the number of columns.
> dim(dataFile)

[1] 3 4

Now that the data has been read, we can proceed to apply the various functions. The details of the function names are given in the earlier section. Let's apply the standard-t-test i.e. 'doitall'.
> doitall(dataFile,1,2,3,4,4,0.25,2,1,1,1,0)

These are the details of the standard t-test function call. The details of the other functions follow.

Standard t-test
doitall(h, cs, ce, es, ee, end, experror,minrep, post, betafit, colPPDE, corr)
h = data frame (e.g. dataFile in our earlier example)
cs/es = start column of control/expt (numbers start at 1, NOT zero)
ce/ee = end column of contol/expt
end = last column of data
experror = used for Bonferroni (default is 0.25)
minrep = minimum number of replicates
post = 0 means don't do PPDE analysis, 1 means do PPDE
betafit = beta value
colPPDE = 0 means use 'lnp' (p-value on log transformed data) for PPDE and 1 means use 'p'
corr = 0 means don't perform correlation analysis

Standard paired
doitall.pair (h, cs, ce, end, experror, minrep)
cs and ce = start and end of ratio columns

Bayes t-test
pierre(h, cs, ce, es, ee, end, experror, winsize, conf, minrep, post, betafit, colPPDE, corr)
winsize = window size
conf = confidence value

Bayes paired
pierre.pair(h, cs, ce, end, experror, winsize, conf, minrep, totalexpresscol)
totalexpresscol = column number of the calculated paired expression column

That's it! When you apply any of the commands above, the data file gets processed and the output files - allgene.txt, ROC.txt, mix.txt and temp.ps files - are saved in your working directory. 'Allgene.txt' contains all the analysis output data. 'Temp.ps' contains supporting graphs, 'mix.txt' contains the mixture model parameters and 'ROC.txt' contains the x and y-coordinates for the ROC plot in temp.ps. See the Cyber-T results section for details.


File Format

The format expected is essentially the output of a spreadsheet file in ASCII or plain text with the values delimited by tabs, commas, or spaces. If your data are in Excel or a similar spreadsheet program, they should be saved as text files and delimited by tabs, commas, or spaces. Whatever delimiter you choose make sure your gene names do not contain this item.

Comments can be inserted anywhere in the text as long as the comment lines are prefixed by a '#' because the underlying R code will ignore command lines beginning with #. Column headings must be removed unless they are preficed by a # sign or specifically excluded by identifying which is the row where the data starts. Do note that having the # sign in the middle of a line without prefixing the line with a # will result in errors. For example, please ensure that none of the gene names (in the label column) have a # in them.

Missing values are coded as 'NA' (not 'na' - case matters) and values below background are coded as zero ("0")

Here's an example file:

# here's a comment
# and another, and the line below is also a comment which is OK to insert
# Label Column 1 Column 2 Column3
# the real data starts just after this line 
GH01040 0.81888287 0.98072154 1.16866872 
GH01059 0.87158715 0.9766095 1.05333957
# comments can be interspersed as long as the line begins with a '#'
GH01066 0.8881064 0.45129639 0.79107254 
GH01085 0.53412245 0.84194019 0.95338764


Some Reasons for Unsuccessful File Upload

If you can see the file in your browser window, you should be able to upload it. However, some operating systems will refuse to let an already open file be manipulated by another application. If this is the case, you should get an error message when you try to upload it, perhaps even interpretable message. Generally, you would have to close the window of the file in the other application. Erroneous processing of the file could be due to a number of things.

Minimum Non-zero Replicates Required

This value is the minimum number of valid replicates required to do the t-test. It must alway be >=2 and <= the smaller of the number of Control or Experimental data columns.

For example, if the user was comparing 5 Control replicates to 5 Experimental replicates and only wanted to perform the t-test if there were at least 3 non-zero observations, this value would be set to 3.

The higher the number of non-zero replicates required, the greater the number of measurements considered in the analysis and the more accurate the variance estimate.


Posterior Probability of Differential Expression

To interpret the results of a high dimensional DNA array experiment it is necessary to determine the global false positive and negative levels inherent in the data set being analyzed. We have implemented a mixture-model based method described by Allison et al.
[5] for the computation of the global false positive and negative levels inherent in a DNA microarray experiment [2,3]. The basic idea is to consider the p-values as a new data set and to build a probabilistic model for these new data. When control data sets are compared to one another (i.e. no differential gene expression) it is easy to see that the p-values ought to have a uniform distribution between zero and one. In contrast, when data sets from different genotypes or treatment conditions are compared to one another, a non-uniform distribution will be observed in which p-values will tend to cluster more closely to zero than one.

Distribution of the p-values from the lrp+ vs. lrp- data from Hung et al. [3]
The p-values, based on a regularized t-test distribution, of the 2,758 genes (lrp+ vs. lrp-) expressed at value above background in all replicate experiments grouped into 100 bins and plotted against the number of genes in each bin. The dotted line indicates the uniform distribution of p-values under conditions of no differential expression. The fitted model (dashed curve) is a mixture of a beta and the uniform distribution (dotted line).

That is, there will be a subset of differentially expressed genes with "significant" p-values. The computational method of Allison [5] is used to model this mixture of uniform and non-uniform distributions to determine the probability, PPDE(p) ranging from 0 to 1, that any gene at any given p-value is differentially expressed; that is, that it is a member of the uniform (not differentially expressed) or the non-uniform (differentially expressed) distribution. With this method, we can estimate the rates of false positives and false negatives as well as true positives and true negatives at any given p-value threshold, PPDE(<p). In other words, we can obtain a posterior probability of differential expression PPDE(p) value for each gene measurement and a PPDE(<p) value at any given p-value threshold based on the experiment-wide global false positive level and the p-value exhibited by that gene [2,3]. It should also be emphasized that this information allows us to infer the genome-wide number of genes that are differentially expressed; that is, the fraction of genes in the non-uniform distribution (differentially expressed) and the fraction of genes in the uniform distribution (not differentially expressed).


Parameters for the Bayesian Standard Deviation Estimation (Optional)

In calculating the Bayesian estimate of the standard deviation there are two different parameters that the user must set. These 2 parameters relate to setting the Bayesian estimate of variance derived from the observed population. The t-test is used to detect significant differences between the means of two groups relative to the observed variance within groups. In a perfect world, all micro-array experiments would be highly replicated within each experimental treatment. Such replication would allow accurate estimates of the variance within experimental treatments to be obtained and the t-test would then perform well. Microarray experiments are expensive and time consuming to carry out, and there is the possibility that both control and experimental tissues will be limiting. As a result, the level of replication within experimental treatments is often low. This results in poor estimates of within-treatment variance and a corresponding poor performance of the t-test itself. This problem can be solved addressed using a Bayesian statistical approach that incorporates prior information in the estimation of within gene within treatment variances.

Although in terms of strong inference there is really no substitute for proper experimental replication, in the case of microarrays many parallel pseudo-replicated experiments are carried out on any given microarray (i.e., one experiment for each gene). In particular, it is possible to use estimates of within-treatment variance from a number of genes of similar expression level to stabilize estimates of variance for any given gene. More precisely, the variance within any given treatment is estimated by the weighted average of a prior estimate of the variance for that gene (obtained from of local weighted average of the variance of other genes) and the experimental estimate of the variance for that gene. This weighting factor is controlled by the experimenter and will depend on how confident the experimenter is that the background variance of a closely related set of genes approximates the variance of the gene under consideration.

An important property of this Bayesian approach to consider is that in the two limiting cases of complete confidence in the prior and zero confidence in the prior, the Bayesian approach is equivalent to simply looking at fold change and testing differences between treatments using the simple t-test respectively. In the Bayesian approach the weight given to the within gene variance estimate is a function of the number of observations contributing to that value. This leads to the desirable property of the Bayesian approach converging to the t-test as the experimenter carries out additional replications and thus becomes more confident of the observed estimate of within treatment variance for any given gene.

1. Sliding Window Size

Indicates how wide you want the window surrrounding the point under consideration to be. This sample of the data provides an estimate of the average variability of gene expression for those genes that show a similar expression level. It is important to estimate this average from a wide enough level that it is accurate, but not so wide of a window so as to average in genes with too different of average expression level. A sliding window of 101 genes has been shown to be quite accurate when analyzing 2000 or more genes, with only 1000 genes a window of 51 genes may work better.

2. Bayes Confidence Estimate Value

This is a number from 0 to infinity that indicates the weight give to the Bayesian prior estimate of within-treatment variance. Larger weights indicate greater confidence in the Bayesian prior; smaller weights indicate more confidence in the experimentally observed variance. We have observed reasonable performance with this parameter set equal to approximately 3 times the number of experimental observations, when the number of experimental observations is small (approximately 4 or less). If the number of replicate experimental observations is large then the Confidence value can be lowered to be equal to the number of replicates (or even less).


Paired Expression Value Estimate

The Ratio or Paired approach pairs a control value with an experimental value i.e. C1 with E1, C2 with E2, etc. The C+E approach does no such pairing. The difference is that if samples really are paired, the variance in the denominator can be much smaller (and hence power greater). The paired t-test evaluates whether the mean difference between paired samples is significantly different from zero, whereas the non-paired t-test tests if the mean of group 1 is different from the mean of group 2. Let's do a "classic" example.

Below is the body weight of a group of people before and after some diet regime was applied:

PERSON      BEFORE          AFTER      DIFFERENCE

Joe           200            180          -20
Sue           120            100          -20
George        175            155          -20

(Note: People here are like replicate experiments and Before and After are like Control versus Experimental). Doing a "paired t-test" results in showing the diet is very effective -- everyone loses 20 lb with no variance (that is: the mean difference is significantly different from zero). On the other hand, if we take the (Mean of the Afters) - (Mean of the Befores), the difference is still 20, however the variance within treatments is large so it is impossible to say the difference is significant.

Basically when observations are "paired" the paired approach is the correct and most powerful approach, when observations are not paired then we have to use the C+E approach.

Since Paired (Ratio) data lack the absolute expression data needed to calculate the Bayesian estimate of variance, you'll have to provide an estimate if you want to use the Bayesian calculation. Assume the following experiment in which the data below are the background-corrected scan values:

     Column that you have to calculate-----------------+
                                                       |
                                                       v
Gene      Con1   Con2   Con3 |  Exp1   Exp2   Exp3   Expr Est
-------  -----  -----  ----- | -----  -----  -----   -------
YAL001C     61     26     31 |    45     47     50   -32.73
YAL002C    156    166    122 |   108    181    250   -24.73
YAL002W     94     63    108 |   145    113     86   -27.57
YAL003W    809   1358   1234 |  1108   1110   1098   -13.05
YAL003W   5325   3142   4720 |  4271   4198   4127   -4.98
YAL004W     61      1     47 |    34     45     75   -35.49
YAL005C   5234      1   3099 |  6291   3874   8443   -12.45
-------  -----   ----  ----- | -----  -----  ----- 
Sum      11740   4757   9361 | 12002   9568  14129 
The extra column of values that would have to be calculated FOR EACH GENE would be as follows:
(sum( ln(61/11740) +  ln(26/4757) + ... +  ln(45/12002) + ln(50/14129) )) 
and then divide each of the numbers by the number of control and experiment values for the corresponding gene (6 in the example above). Dividing is not needed if all genes have the same number of valid control and experiment measurements. It is needed if there are missing values. The resultant Column marked 'Expr Est' is now used for the computation of the Bayesian estimated variance.


Gene Expression Array Analysis Results

Allgenes.txt: Column headings and what they mean

All log transformations performed by Cyber-T refer to the natural log (ln) transformation.

Control + Experiment Analysis Output

Paired Data Analysis Output

CyberT.ps/CyberT.pdf: Graphical output

Mix.txt and ROC.txt: Mixture model parameters

These files contain the output of the mixture model parameters from running the PPDE module. 'Mix.txt' contains the estimates of the mixture model and 'ROC.txt' contains the x and y-coordinates of the ROC plot in 'CyberT.ps/CyberT.pdf'.


Frequently Asked Questions, Errors and Troubleshooting


References and articles

  1. P. Baldi and A.D. Long, "A Bayesian Framework for the Analysis of Microarray Expression Data: Regularized t-Test and Statistical Inferences of Gene Changes", Bioinformatics, 17, 6, 509-519, (2001).
  2. P. Baldi and G. Wesley Hatfield, "DNA Microarrays and Gene Expression : From Experiments to Data Analysis and Modeling", Cambridge University Press (2002).
  3. S.P. Hung, P. Baldi and G.W. Hatfield, J Biol Chem 277, 40309-40323 (2002).
  4. G. Wesley Hatfield, S. Hung, and P. Baldi. "Differential Analysis of DNA Microarray Gene Expression Data", Molecular Microbiology, 47:871-877 (2002).
  5. D.B. Allison, G.L. Gadbury, M. Heo, J.R. Fernández, C.K. Lee, T.A. Prolla, R. Weindruch, "A mixture model approach for the analysis of microarray gene expression data", Computational Statistics & Data Analysis, 39:1-20 (2002).
  6. Institute of Genomics and Bioinformatics
  7. R's (distributed) home, the Comprehensive R Archive Network (CRAN).
  8. P. Spector, "An Introduction to S and S-PLUS", Duxbury Press (1994).
  9. W.N. Venables and B.D. Ripley, "Modern Applied Statistics with S-PLUS" (Second Edition), Springer (1997).
  10. Choe, S.E., Boutros, M., Michelson, A.M., Church, G.M., Halfon, M.S., "Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset", Genome Biology, 6:R16 (2005)


  Cyber-T Main   Help   Contact Us