This corpus is used to add an extra feature to the table extraction tool: the probabilities of being labelled as meta-data in a big corpus of tables.
To compute it, we have performed an unsupervised annotation method over a corpus of 145,533,822 tables:
- First, iterate over the corpus annotating every cell in the first row or column as meta-data, and every other cell as data. We build a dictionary using this heuristic, where the key is the text of a cell and its value is the likelyhood of that cell being meta-data.
- Then, we iterate over the corpus again, but this time we use the previously computed dictionary to average the likelyhood for a whole row or column. If the average is higher than
0.5
, we consider that every cell in that row or column is meta-data, data otherwise. Then, we rebuild the previous dictionary using the new meta-data/data occurrences. - We repeat the previous step until no significant changes are produced.
While this simple method is not very effective, it can be used as another feature of the table extraction tool.