Welcome to the read depth calculator

Choosing an appropriate read depth for an mRNA-seq measurement ahead of time can be a difficult task. Therefore we derived a quantitative mathematical framework that shows how biological information contained in an mRNA-seq dataset is corrupted as measurement noise is increased. In our accompanying work, we show that mRNA-seq users can extract useful transcriptional programs at a fraction of conventional sequencing depths, a technique known as "shallow mRNA-seq". With such drastic decreases in sequencing depth, mRNA-seq users can multiplex more samples into a single lane. Our results show that shallow mRNA-seq users can multiplex 100-1000 fold more experiments within a single Illumina sequencing lane and still accurately identify the leading principal components. Our paper contains more details on the analysis and how the read depth calculator can be used to best design experiments.

Solve for read depth
Solve for accuracy


How to use the read depth calculator

The read depth calculator can be used for experimental design or predicting error levels in experimental data already collected. For experimental design purposes, this tool can guide you in choosing how many samples to sequence and with what read depth ("Solve for depth"). If the read depth is already known already, you can use this tool to predict the accuracy of your principal components ("Solve for accuracy").

Filling in the read depth calculator fields

The read depth calculator computes the following mathematical relationship for determining the principal component error \| \textbf{pc}_i - \hat{\textbf{pc}_i} \|_2 = \frac{\kappa}{\lambda_i n N} . Each field of the calculator requests information about a different parameter.

1. Species drop-down menu - Choose your species so that the appropriate constant parameter ( \kappa ) is used.

2. Mode selection - Chose between "Solve for accuracy", if you know the read depth and want to assess how accurate a given princpal component is, and "Solve for depth", if you want to determine a read depth that will provide your desired level of accuracy in a given principal component.

3. Number of reads - Enter the number of reads ( N ) only when the "Solve for accuracy" mode is selected.

4. Accuracy - Enter the desired level of accuracy in terms of percentage difference between the true principal component and principal component from shallow mRNA-seq. 0% accuracy equates to the maximum euclidean distance between two unit length principal components, ( \sqrt{2} ). This option is only available when the "Solve for accuracy" mode is selected.

5. Number of samples - Enter the number of distinct samples ( n ) you plan to include in the analysis. A sample could be either a single cell sample or a bulk sample.

6. Principal value of interest - Enter the principal value of the normalized read count matrix corresponding to your principal component of interest ( \lambda_i ). This is unlikely to be unknown ahead of time. You can use similar datasets for reference or refer to the later sections of this page to see a distribution of values and learn what a typical value may look like.

Estimating principal values

Because it is unlikely to know the principal values of your data prior to profiling gene expression, we provide guidelines to help you estimate your principal values. We surveyed published over 350 published gene expression datasets (containing 23,000 gene expression profiles) with varying experimental designs to obtain estimate the distribution of principal values. We compiled 226 human gene expression datasets and 106 mouse gene expression datasets to provide as a reference. We generated these distributions from microarray data due to the availability of a large number of datasets. Because the principal values are intrinsic to the biological system and experimental design, this should not be problematic. Furthermore, analysis of a much smaller set of mRNA-seq datasets is consistent with these results. Only datasets with at least 20 experimental conditions or time points were considered so as not to artificially impact the dimensionality of the data.

Without prior knowledge of the principal values of the data, we suggest users assume that the principal values of your dataset are near the average over those observed in our analysis. Alternatively, one could use a more conservative estimation, by assuming that their principal values fall close to the lower 20th percentile. Here, we show the performance of both of those values. While identifying principal components in more structured datasets (average principal values) will outperform PCA applied to the conservative estimation, both show that accurate transcriptional programs can be extracted at low depth. Additionally, they both show similar gains in accuracy when sample number is increased.

Technical detail and further information

This webtool is brand new, so we will be continuously updating it. E-mail comments, suggestions or questions to Graham Heimberg (graham dot heimberg at ucsf dot edu), Raj Bhatnagar (Rajat dot Bhatnagar dot ucsf dot edu) or Matt Thomson (Matthew dot Thomson at ucsf dot edu). More details are also found in our paper.