Skip to content

Commit

Permalink
series/quantiles: source-based quantile support
Browse files Browse the repository at this point in the history
Old behaviour:
==============

creativity-series previously removed non-finite values, and sorted the
rest, producing something like:

t,1st,2nd,3rd,...
2,10,11,12
3,3

(where t=0 and t=1 have no finite values, t=3 has only one non-finite,
etc.).  This discards any association with the files they come from,
however, which means any sort of source-based confidence exclusion is
impossible.

creativity-series-quantiles then used these pre-sorted values to produce
another .csv of quantile values.

createivity-series-graphs accepted either one--if the series file, it
calculated quantiles on the fly; if the quantiles, it used the
pre-calculated quantiles.

New behaviour:
==============

creativity-series now just generates something like:

t,source1,source2,source3
0,nan,nan,nan
1,nan,nan,nan
1,12,10,11
2,nan,3,nan

which creativity-series-quantiles and creativity-series-graphs now
understand: they does the nan trimming and sorting, then calculate
quantiles.

creativity-series-graphs also gains an entirely new ability, with flag
--source-confidence (but only when given a creativity-series file): to
select a confidence region by excluding (1-x)% of source files, then
plotting the minimum and maximum values remaining after removal of the
most non-median values.  This is done by calculating (p-0.5)^2 for the
inverse quantile p for each time period, then summing these up across
time periods: the source files with the largest scores are excluded.
(see --help for the gory details).
  • Loading branch information
jagerman committed Feb 3, 2016
1 parent 58cbdbe commit 99de1bb
Show file tree
Hide file tree
Showing 8 changed files with 338 additions and 179 deletions.
62 changes: 56 additions & 6 deletions creativity/cmdargs/SeriesGraphs.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ void SeriesGraphs::addOptions() {
("legend-position,L", value(legend_position_input), "The general position of the graph legend, one of 'none', 'inside', 'right', 'left', 'top', 'bottom' where 'none' suppresses the legend, 'inside' puts it inside the graph, and the rest put it outside on the indicated side of the graph.")
("legend-x", range<0,1>(legend_rel_x), "Relative x position (for --legend-position={inside,top,bottom}): 0 is left-most, 1 is right-most.")
("legend-y", range<0,1>(legend_rel_y), "Relative y position (for --legend-position={inside,left,right}): 0 is top-most, 1 is bottom-most.")
("time-confidence", "If given, calculate confidence intervals per-time period.")
("source-confidence", "If given, calculate confidence intervals by excluding the most extreme source files. See --help for details.")
;

po::options_description input_desc("Input files");
Expand All @@ -61,6 +63,13 @@ void SeriesGraphs::postParse(boost::program_options::variables_map &vars) {
throw po::invalid_option_value("--t-max value must be larger than --t-min value");
if (std::isfinite(graph_min_value) and std::isfinite(graph_max_value) and graph_max_value <= graph_min_value)
throw po::invalid_option_value("--y-max value must be larger than --y-min value");
if (vars.count("time-confidence")) {
if (vars.count("source-confidence")) throw po::invalid_option_value("--time-confidence and --source-confidence are mutually exclusive");
per_t_confidence = true;
}
if (vars.count("source-confidence")) {
per_t_confidence = false;
}
if (vars.count("same-t-scale") or (graph_min_t >= 0 and graph_max_t >= 0))
same_horizontal_scale = true;
if (vars.count("different-y-scale")) {
Expand All @@ -82,12 +91,53 @@ std::string SeriesGraphs::usage() const {
}

std::string SeriesGraphs::help() const {
return CmdArgs::help() + "Input files:\n" +
" FILE [FILE ...] One or more series input files from which to\n" +
" calculate quantiles. At least one must\n" +
" be given.\n\n" +
"This program takes files as produced by the creativity-series program as inputs\n" +
"and retains the existing sample quantiles for each period in the series files.\n\n";
return CmdArgs::help() + u8R"HELP(Input files:
FILE [FILE ...] One or more series or quantile input
files from which to graph series. At
least one must be given.
This program takes files as produced by the creativity-series or
creativity-series-quantiles programs as inputs and retains the existing sample
quantiles for each period in the series files.
The plot behaviour has two modes. The default mode, --time-confidence, and the
only mode available when using quantiles input files, plots by calculating the
request sample quantile as recorded in the quantiles file (which must already
exist in the file) or calculated from the finite values in the series file. For
example, for period t=20, a 95% confidence interval will result in the
confidence band at t=20 throwing away the top and bottom 2.5% of simulations
observed period t=20.
The other mode, --source-confidence, available only when using series files,
applies the requested confidence interval(s) to the source files. In this mode,
source file data is included only for the 95% (or whatever level) of source
files have the "smallest" quantiles. Specifically, each file is assigned a
score of:
∑ (pₜ - 0.5)²
t
where pₜ is the source file's inverse sample quantile at time t (that is, 0 for
the smallest observation, 1 for the largest, and 0.5 for the median
observation). Note that when a period t has NaN and infinite values, these
values are calculated with pₜ=0, and the smallest and largest finite values for
period t receive values of pₜ = #/2 and 1-#/2, where # is the number of
non-finite values in period t. In the case of duplicate values, the pₜ value is
the most favourable value (for example, for the data 1 1 2 3 4, both
observations with value 1 have a pₜ value of 0.25, not 0.).
Once the score has been calculated across all observations for all data files,
the x% of data files with the highest score are excluded; the minimum and
maximum of the remaining (finite) values then form the confidence interval end
points.
Note that this procedure is guaranteed to produce confidence intervals at least
as wide as the --time-confidence intervals, and almost always wider. Also note
that, when drawing a median, the actual line is the path of the series which
received the lowest aggregate score, *not* the median of each time period's
values.
)HELP";
}

std::string SeriesGraphs::versionSuffix() const {
Expand Down
3 changes: 3 additions & 0 deletions creativity/cmdargs/SeriesGraphs.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ class SeriesGraphs : public CmdArgs {
*/
std::string levels = "median,0.5,0.9,0.95";

/// If true, we're in per-t confidence mode. If false, we're in per-source file confidence mode.
bool per_t_confidence = true;

/** The input files (series or quantiles) to plot. Plot axes scales will be identical
* across files, so typically this should be called with series files for the same (or at
* least directly comparable) variables.
Expand Down
3 changes: 1 addition & 2 deletions creativity/data/quantiles.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@
namespace creativity { namespace data {

const std::regex
quantile_field_regex("m(?:in|ax|edian)|q\\d*[1-9]"),
ordinal_regex("(?:\\d*[02-9])?(?:1st|2nd|3rd)|\\d*1[123]th|\\d*[04-9]th");
quantile_field_regex("m(?:in|ax|edian)|q\\d*[1-9]");

std::string quantile_field(double quantile) {
if (quantile == 0) return "min";
Expand Down
3 changes: 0 additions & 3 deletions creativity/data/quantiles.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,5 @@ std::string quantile_field(double quantile);
*/
extern const std::regex quantile_field_regex;

/** A std::regex object that matches ordinal values as produced by creativity-series. */
extern const std::regex ordinal_regex;


}}
18 changes: 10 additions & 8 deletions data.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,19 +24,21 @@ using namespace eris;
// When outputting CSV values, we want exact precision, but don't necessarily need the full
// max_digits10 digits to get there. For example, 0.1 with max_digits10 becomes 0.10000000000000001
// but 0.1 also converts to the numerically identical value. 0.10000000000000002, on the other
// hand, needs every decimal digit of precision.
// hand, needs every decimal digit of precision, because it isn't equal to the double value 0.1
//
// This function produces tries first at the requested precision, then the requested precision less
// one, then less two; if the subsequent values are numerically identical to the given double value,
// the shortest is returned.
std::string double_str(double d, unsigned precision) {
double round_trip;
for (unsigned prec : {precision-2, precision-1}) {
std::stringstream ss;
ss.precision(prec);
ss << d;
ss >> round_trip;
if (round_trip == d) { ss << d; return ss.str(); }
if (std::isfinite(d)) {
double round_trip;
for (unsigned prec : {precision-2, precision-1}) {
std::stringstream ss;
ss.precision(prec);
ss << d;
ss >> round_trip;
if (round_trip == d) { ss << d; return ss.str(); }
}
}
std::stringstream ss;
ss.precision(precision);
Expand Down
Loading

0 comments on commit 99de1bb

Please sign in to comment.