Frequently Asked Questions
What kinds of input does AutoSOME accept?
AutoSOME accepts three kinds of numerical input:
1) Table with mandatory row names (column 1) followed by numerical data (see Table below). Every row must be the same length (i.e. same number of values). Column names are optional, however if used, every column must be labeled.
Two commonly-used microarray formats:
2) PCL = Pre CLuster format as used by the Cluster software [Eisen et al. (1998) PNAS 95:14863]. For an example, see
, or refer to the Cluster/TreeView User Manual available
. The first two columns are reserved for gene annotation (column 1 = row identifiers, column 2 = gene names or annotation). Column 3 is optional, is called GWEIGHT, and specifies how to weigh each gene when computing gene-gene similarity. This column is read by AutoSOME, but is ignored since AutoSOME does not cluster genes using a similarity matrix. Row 1 is mandatory and is used to provide column names including names of each array (e.g. ID, NAME, GWEIGHT, array1, array2, etc.). Row 2 is optional, is called EWEIGHT, and specifies how each array is weighed when computing array-array similarity. In contrast to GWEIGHT, AutoSOME will use EWEIGHT when constructing a distance matrix for transcriptome clustering (fuzzy cluster networks option).
3) Gene Expression Omnibus (GEO) Series Matrix. This format contains all chipset expression data as well as user-supplied annotation in a spreadsheet style. AutoSOME can automatically extract expression information content and column names from a raw series matrix file. This allows for rapid analysis of any GEO dataset by simply downloading the archive containing the series matrix text file, unzipping it, and loading it into AutoSOME. Try a real Series Matrix File available
Why is AutoSOME hanging or crashing?
- AutoSOME has not been internationalized yet. If English is not the native language on your computer, AutoSOME requires the java virtual machine argument "-Duser.language=en". This option is automatically invoked with the batch files included in the AutoSOME downloadable.
Another common cause of AutoSOME instability is an Out of Memory Error. The Web Start version allocates 1GB RAM to AutoSOME. This, however, may not be enough memory for an extremely large dataset (e.g. 60K rows, 2000 columns). The maximum amount of memory that can be allocated to 32-bit systems is ~1.6 GB. Memory usage is not significantly affected by the number of ensemble runs unless the number is very high, e.g. 1000 runs. To liberate memory in this case, we have provided an option to write ensemble runs to disk (see
). If invoked, a temporary folder will be created in the current working directory to save intermediate AutoSOME runs.
Although we have tried to implement extensive error-checking for input datasets and parameters, it remains possible that we missed something. Send us an
describing your input parameters and input dataset (include your input dataset if possible) and we will make every attempt to fix the problem.
AutoSOME is taking too long to finish. How can I speed it up?
For simple exploratory analysis on a very large dataset, try reducing the number of ensemble runs to 10 or so. Increase this number when ready for a final clustering run (see below). The AutoSOME GUI also has three settings for performance that can be selected using the
. By default, AutoSOME will run in Normal mode. This setting has less cluster resolving power than Precision mode, but is ~4X faster. In addition, our empirical experiments indicate that the two settings often result in comparable performance. Speed mode has the least precision, but is very fast and may be desired for a first pass. Changing the AutoSOME Mode will modify two parameters:
, both found in
. An additional parameter not affected by AutoSOME Mode is the
SOM maximum Grid Length
field also located in Algorithm Settings. By default the SOM node lattice will be bounded by a 30 x 30 grid. Reducing the boundary size can greatly decrease AutoSOME running time. For example, on datasets with at least 5,000 rows, the grid boundary is automatically reduced to 20 x 20. For a square SOM with 400 nodes, this means that AutoSOME could still, in principle, identify hundreds of clusters in roughly half the time needed to process a 30 x 30 grid. Finally, if nothing else, let AutoSOME use all available processors for faster running time (this is the default setting).
How many ensemble runs are needed for stable output?
Extensive ensemble stability tests suggest that datasets with well-separated clusters (like the example benchmark datasets) gain little in cluster stability or accuracy after 50-100 ensemble runs. Indeed, 25-50 runs will often be sufficient. For highly noisy datasets, like microarrays, the greatest gains in stability are usually reached by 100-200 iterations. For transcriptome clustering, the number of clustered samples is typically small, and a potentially large number of ensemble iterations (e.g. 500) can be executed in practical time.
How does the P-value threshold affect clustering results?
A major step of the AutoSOME method involves partitioning a graph containing all input data points into a set of data clusters. The p-value threshold allows the data graph to be cut into statistically significant clusters based on a simulated null hypothesis of random data points. The smaller the p-value the tighter (and smaller) the resulting clusters. A default threshold of less than 0.1 has been extensively benchmarked to yield consistently good accuracy on a diversity of clustering problems. Lower the p-value threshold for increasingly challenging datasets.
How does normalization affect my data and when should I use it?
i) Log2 Scaling: Logarithmic scaling is routinely used for microarray datasets to amplify small fold changes in gene expression, and is completely reversible. All other implemented input adjustment methods irreversibly change the input to make it more suitable for analysis.
ii) Unit variance: forces all columns to have zero mean and a standard deviation of one, and is commonly used when there is no a priori reason to treat any column differently from any other.
iii) Range [0,x]: Alternatively, data in all columns can be normalized to share lowest and highest values (0,x) by specifying an upper bound x.
iv) Median Center Rows/Arrays: For microarray analysis, centering each gene (row) and/or array (column) by subtracting the median value of the row/column eliminates amplitude shifts to highlight the most prominent patterns in the expression dataset.
v) Sum of Squares=1 Rows/Arrays: This normalization procedure smoothes microarray datasets by forcing the sum of squares of all expression values to equal 1 for each row/column in the dataset.
For gene expression datasets, at the very least, we suggest applying unit variance normalization of arrays and applying log2 scaling in cases where expression values span several orders of magnitude. Median-centering of genes to eliminate amplitude shifts is highly recommended for gene co-expression clustering (the SOM component of AutoSOME uses Euclidean distance and, unlike Pearson's correlation, is sensitive to amplitude differences).
Considerable smoothing can be achieved by setting the sum of squares = 1 across all genes and arrays. This tends to partition the data into larger clusters that trail off into genes with minimal differential expression. These minimal variance genes can be easily filtered out by adjusting cluster confidence (see the
gene co-expression tutorial
on confidence filtering). We recommend conducting pilot runs (using 20-50 ensemble iterations) with and without sum of squares normalization to see what works best.
Which distance metric should I use when making fuzzy cluster networks?
In its current implementation, AutoSOME can only create fuzzy cluster networks from clustering column (array) vectors. For microarrays, this amounts to clustering transcriptome profiles. Since unfiltered transcriptomes are potentially enormous, AutoSOME automatically performs an All-against-All comparison of all column vectors. This results in a similarity matrix that is used for clustering. Three common metrics for calculating
are provided as a user-adjustable parameter. Euclidean distance is chosen by default because it gave the best results during empirical testing on a variety of microarray datasets with previously known classes of cell lines. Euclidean distance is magnitude-sensitive and results in a distance matrix where similar transcriptomes have smaller distances between them. AutoSOME also implements Pearson's correlation and Uncentered correlation. Both correlation metrics have a maximum of 1 (completely correlated) and are widely used for hierarchical clustering. Pearson's correlation is insensitive to amplitude shifts. This means that two data vectors with similar shapes but different magnitudes can still be highly correlated. Uncentered correlation is more like Euclidean distance in that different magnitudes are penalized. For a good review of distance metrics used in bioinformatics, see D'haeseleer (2005) Nat. Biotechnol. 23:1499.
How do I open a saved clustering result?
AutoSOME writes three main files to disk after clustering, two of which are html files. The text file contains the necessary information to revisit your clustering results (e.g. AutoSOME_yeastData_E50_Pval0.1.txt). Go to
File>Open AutoSOME Results