Gaps in documentation #12

lukavdplas · 2024-03-05T10:42:44Z

I've tried to document the package as best I can in #11, but sometimes my own understanding falls short.

Here are some questions I had, that should still be answered in the documentation:

How does the external_file option in the XML extractor work? Can we include an example?
Idem for secondary_tag option in the XML extractor. How can this be used?
I understand how to use the ExternalFile extractor, but not why. Since the file has to be specified during metadata extraction, why not just read the file at that stage? Can we add an example where this would be useful?

Lastly, the FilterAttribute extractor (a subclass of XML) sounds straightforward, but is there a difference between these two extractors?

extractor_1 = XML({'foo': 'bar'})
extractor_2 = FilterAttribute({'attribute': 'foo', 'value': 'bar'})

If so, what is it?

The text was updated successfully, but these errors were encountered:

BeritJanssen · 2024-07-03T13:25:57Z

The external_file argument was introduced for the sake of the dutchnewspapers corpora by some green programmer in the distant past. This corpus has a .xml file for every newspaper, and containing ids of every article, and then a .xml file for each article in the newspaper. It's not unlikely that there is a more elegant solution to achieve this: retrieve all information for a document from one file, but other fine-grained information (url, category, ocr confidence) from another.

lukavdplas · 2024-10-22T11:06:42Z

This can be closed; #18 also improved the documentation for the XML extractor.

lukavdplas added help wanted Extra attention is needed documentation Improvements or additions to documentation labels Mar 5, 2024

lukavdplas closed this as completed Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gaps in documentation #12

Gaps in documentation #12

lukavdplas commented Mar 5, 2024

BeritJanssen commented Jul 3, 2024 •

edited

Loading

lukavdplas commented Oct 22, 2024

Gaps in documentation #12

Gaps in documentation #12

Comments

lukavdplas commented Mar 5, 2024

BeritJanssen commented Jul 3, 2024 • edited Loading

lukavdplas commented Oct 22, 2024

BeritJanssen commented Jul 3, 2024 •

edited

Loading