You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've tried to document the package as best I can in #11, but sometimes my own understanding falls short.
Here are some questions I had, that should still be answered in the documentation:
How does the external_file option in the XML extractor work? Can we include an example?
Idem for secondary_tag option in the XML extractor. How can this be used?
I understand how to use the ExternalFile extractor, but not why. Since the file has to be specified during metadata extraction, why not just read the file at that stage? Can we add an example where this would be useful?
Lastly, the FilterAttribute extractor (a subclass of XML) sounds straightforward, but is there a difference between these two extractors?
The external_file argument was introduced for the sake of the dutchnewspapers corpora by some green programmer in the distant past. This corpus has a .xml file for every newspaper, and containing ids of every article, and then a .xml file for each article in the newspaper. It's not unlikely that there is a more elegant solution to achieve this: retrieve all information for a document from one file, but other fine-grained information (url, category, ocr confidence) from another.
I've tried to document the package as best I can in #11, but sometimes my own understanding falls short.
Here are some questions I had, that should still be answered in the documentation:
external_file
option in theXML
extractor work? Can we include an example?secondary_tag
option in theXML
extractor. How can this be used?ExternalFile
extractor, but not why. Since the file has to be specified during metadata extraction, why not just read the file at that stage? Can we add an example where this would be useful?Lastly, the
FilterAttribute
extractor (a subclass ofXML
) sounds straightforward, but is there a difference between these two extractors?If so, what is it?
The text was updated successfully, but these errors were encountered: