Because the atgenexpress dataset is so huge, I have been trying to filter out genes which are not expressed across any of the samples. To this end, I have made use of the mas5calls function in the affy library of R which you get when you install the Bioconductor package. This function performs the Wilcoxon signed rank-based gene expression presence/absence detection algorithm on the microarray data. So in addition to ending up with an Excel file containing gene expression data (the rows are the genes and the columns are the experimental samples), you end up with another Excel spreadsheet of the same size, where in place of each expression value, there is now either the letter P, M, or A. P = present, M= marginal, A = absent (indicating whether the gene was detected or not in the microarray).
My first thought was to use this as a way to filter out genes. By summing the Ps, Ms, and As along each row (gene), I formed the ratio: R = P/(P+M+A). Then sorting the genes from largest to smallest by this value would give me the genes which were detected the most frequently. Genes which have R = 0 I am obviously not interested, so I deleted these from consideration.
There are other factors to consider though. For one, a gene which is highly expressed in one or two samples and could play an important role in the the genetic response of Arabidobsis to a stimulus could be important as well - but have a low R value. If I just filter by R-value, I am only catching those genes which are highly expressed across all experiments. I need to consider the standard deviation of each gene.
I am also interested in the L2 norm value of each gene. Viewing each gene as a vector containing expression values for each sample, I calculate the L2 norm. This is also a way to rank genes.
So there are in fact three ways to rank genes in terms of importance (by L2 norms, standard deviations, and their R values). I am working on a way to link them together into one metric. Below are scatterplots of the various interactions.

Click for larger image
Conclusions: The linear relationship which appears to exist between the L2 norm and R was kind of unexpected. The mean of all the L2 norms for the genes is 208.5677, and so the concentration of genes near the lower left-end of this graph also share low R values. Higher R values also appears to suggest higher L2 values.
There are a few genes which have L2 values above the mean and R values below 0.1 (suggesting genes expressed highly for a small number of samples), but only a very small number. In fact I think I could be safe in deleting all genes with an R value less than 0.5.

Click for larger image
Conclusions: The mean of all standard deviations is: 1.1074, and this plot seems to bear that out. What is interesting about this one is its similarity to the previous plot. There appear to be two large groupings of genes in the dataset. Ones with high R values and ones with very low R values, and a few genes randomly dispersed between these two groups.

Click for larger image
Conclusions: This is the most interesting plot. As I mentioned previously, the mean of the standard deviaitons is 1.1074, and the mean of the L2 norms is 208.5677. There appears to be a large clustering of genes which are highly expressed across a few experiments and have a standard deviation which falls below the mean for all genes.
Below are the histograms for the three metrics I have looked at:

Click for larger image

Click for larger image

Click for larger image
Future Course of Action: I will be posting the results of running ARACNE shortly (within a day or so).