Skip to content

Comparison of surface accessibility in multiple samples

Salvador Martínez de Bartolomé edited this page Dec 20, 2018 · 4 revisions

The program is encoded in the class edu.scripps.yates.pcq.quantsite.QuantSiteOutputComparator.
Having the jar file QuantSiteoutputcomparator.jar:

Therefore the command to use it is:

java -jar QuantSitecomparator.jar -f /home/salvador/file_with_paths -out output -RInf 1000 -number_sigmas 2

An explanation of the parameters:

 java -jar QuantSiteoutputComparator.jar
with the following parameters:
 -f,--input_file <arg>             [MANDATORY] Full path to a file containing pairs (separated by TAB) of sample names and full path to the
                                   peptideNodeTable of a PCQ run to compare
 -md,--minimum_discoveries <arg>   [OPTIONAL] minimum number of discoveries (significantly different between two samples) required for a quantified
                                   site to be in the output files. If not provided, there will be no minimum number, although no quant sites without
                                   any significantly different site between 2 samples will be reported in the Excel output file.
 -ns,--number_sigmas <arg>         [OPTIONAL] number of sigmas that will be used to decide whether an INFINITY ratio is significantly different than a
                                   FINITE ratio.
                                   If R1=POSITIVE_INFINITY and R2 < avg_distribution + ns*sigma_distribution_of_ratios, then R2 is significantly
                                   different.
                                   If R1=NEGATIVE_INFINITY and R2 > avg_distribution + ns*sigma_distribution_of_ratios, then R2 is significantly
                                   different.
 -out,--output_file_name <arg>     [MANDATORY] Output file name that will be created in the current folder
 -pvc,--pvalue_correction <arg>    [OPTIONAL] p-value correction method to apply. Valid values are: BH,BY,BONFERRONI,HOCHBERG,HOLM,HOMMEL. If not
                                   provided, the method will be BY (Reference: Yoav Benjamini, Daniel Yekutieli, "The control of the false discovery
                                   rate in multiple testing under dependency", Ann. Statist., Vol. 29, No. 4 (2001), pp. 1165-1188,
                                   DOI:10.1214/aos/1013699998 JSTOR:2674075)
 -qvt,--qvalue_threshold <arg>     [OPTIONAL] q-value threshold to apply to the corrected p-values. A value between 0 and 1 is permitted. If not
                                   provided, a threshold of 0.05 will be applied.
 -RInf,--replace_infinity <arg>    [OPTIONAL] -RInf replaces +/- Infinity with a user defined (+/-) value in the output summary table file

Detailed method explanation:

Let's say that we have 10 PCQ runs that comes from the analysis of the surface accessibility of 10 different samples.

Quantitative ratios for each quantified site are read from the output of the individual 10 PCQ runs. The actual values that are read are the mean and the standard deviation, that is, the peptide node ratio and the associated standard deviation of the individual ratios measured for each quantified site.

Then, for each quantified site, I build a matrix 10x10, where I store a t-test result comparing the ratios of that site in each pairwise sample comparison.

The t-test is performed as a two sample t-test between the individual ratio measurements of the site in one sample against the individual ratio measurements of the same site in the other sample. In practice the mean and standard deviation values are used to calculate the t-test.

For cases in which we have infinity ratios as the peptide node ratio, which means that, using the majority rule, the majority of the individual ratio measurements in the quant site’s peptide node were infinities of the same sign, we consider the t-test as:

  • If we have infinities in both samples:
    • significant if the signs of the infinities are opposite
    • non-significant if the sign of the infinities is the same
  • If only one sample is infinity, we calculate the total mean and deviations of the whole ratio distribution which in this case are -5.027 and 1.96 respectively, and
    • If the infinity is positive and the non-infinity is less than the total average + 2 * total stdev, it is significant
    • If the infinity is negative and the non-infinity is greater than the total average + 2 * total stdev, it is significant.
  • Otherwise it is not significant.

Any significant value coming out from the infinities rules is a “p-value=0.0” in the t-test matrix, however these 0.0 p-values are not considered for the following p-value correction.

In case of having just one measurement in any of the peptide node ratios of the samples or in case of having a NaN ratio in any of the peptide node ratios of the samples compared, the matrix value will be a NaN.

Once all the matrixes are filled, we correct for multiple hypothesis testing using the BY method (*) in the following way:

  • The p-value correction is performed with all the p-values coming out from the comparison of all the quant sites between two samples, so in this way, we perform 45 p-value corrections for each of the 45 comparisons between the 10 samples.

Then, we count the number of discoveries (**) per quantified site, that is, the number of times that a quantified site is significantly different between two samples. The program will output the distribution of number of discoveries, that is, how many quant sites have N discoveries, how many quant sites have N-1 discoveries…how many quant sites have 1 discovery.

The program outputs:

  • A single TSV file with the values used to calculate the p-values per each quantified site in each sample (output_comparison.tsv)
  • A single text file with the matrixes of p-values per each of the quantified sites (output_comparison_QVALUES_matrixes.txt). The quantified sites are sorted by descending number of discoveries.
  • A single Excel file with the matrixes of p-values per each of the quantified sites, having each quantified site in a different sheet. The quantified sites are sorted by descending number of discoveries.

(*) Yoav Benjamini, Daniel Yekutieli, "The control of the false discovery rate in multiple testing under dependency", Ann. Statist., Vol. 29, No. 4 (2001), pp. 1165-1188, DOI:10.1214/aos/1013699998 JSTOR:2674075 (**) a discovery means a p-value < than the threshold 0.05 in a comparison of ratios between two samples.