Skip to content

Correlation

David Megginson edited this page May 2, 2018 · 3 revisions

The #valid_correlation test in HXL schemas allows you to test whether one value is always aligned with another one. The test is one-way: it will check that the values of the columns matched by the tag patterns in #valid_correlation always correspond to the same value in the column matched by the tag pattern in #valid_tag, but not the other way around. A couple of examples will help make it clear.

1:1 correspondences

Consider the following data

#adm1+name #adm1+code
Coast Province XX01
Cast Province XX01
Coast Province X01
Coast Province XX01
Coast Province XX01

You could use the following rule to test for correlation from the first column to the second:

{
    "#valid_tag":"#adm1+name",
    "#valid_correlation":"#adm1+code",
    "#description":"Province name does not correspond with P-code"
}

With this rule, the HXL validation engine would detect the error "Cast Province" => "XX01" (and suggest the correction "Coast Province"), but would not detect the error "Coast Province" => "X01".

The following rule tries the correlation the other way:

{
    "#valid_tag":"#adm1+code",
    "#valid_correlation":"#adm1+name",
    "#description":"Province P-code does not correspond with name"
}

With this rule, the HXL validation engine would detect the error "X01" => "Coast Province" (and suggest the correction "XX01"), but would not detect the error "XX01"=>"Cast Province". When you require a two-way correspondence, you need to add two reciprocal correlation rules:

[
    {
        "#valid_tag":"#adm1+name",
        "#valid_correlation":"#adm1+code",
        "#description":"Province name does not correspond with P-code"
    },
    {
        "#valid_tag":"#adm1+code",
        "#valid_correlation":"#adm1+name",
        "#description":"Province P-code does not correspond with name"
    }
]

1:many correspondences

The reason for the extra complexity above is that correlations aren't always reciprocal. For example, "Prefecture A" and "Prefecture B" might both be admin2 levels inside the same province. Consider the following:

#adm1+name #adm2+name
Coast Province Prefecture A
XXX Prefecture A
Coast Province Prefecture B
Coast Province Prefecture B
Coast Province Prefecture A

Both "Prefecture A" and "Prefecture B" are acceptable values for #adm2+name when #adm1+name is "Coast Province" (the province can have multiple prefectures), but only "Coast Province" is an acceptable value for #adm1+name when #adm2+name is "Prefecture A" (the prefecture is always in the same province). This rule will detect the one-way correspondences:

{
    "#valid_tag":"#adm1+name",
    "#valid_correlation":"#adm2+name",
    "#description":"Wrong province for prefecture"
}

It will correctly identify "XXX" as an error (and suggest "Coast Province" as a correction), but will not complain that both "Prefecture A" and "Prefecture B" are in the same province.

Clone this wiki locally