(feat,fix) save/export scrapper buffers #146

myshevchuk · 2020-12-09T22:33:45Z

remap save-buffer to new function orb-pdf-scrapper-save
remap write-file to new function orb-pdf-scrapper-save-as
These functions handle "fileless" Scrapper buffers correctly
fix a bug where the buffer would get master xml file name after cancelling
switch to Scrapper buffer when pressing 'n' in prevent-concurrent dialog

myshevchuk · 2020-12-09T22:39:41Z

This PR introduces commands orb-pdf-scrapper-save and orb-pdf-scrapper-save-as. The former command saves current progress in a Scrapper buffer to the corresponding temp file. It shadows Emacs' save-buffer in Scrapper buffers. For those (including myself) who press C-x C-s on every occasion. The command orb-pdf-scrapper-save-as (C-x C-w) shadows write-file allowing to correctly save the buffer to a new location without breaking the Scrapper session.

@j-steinbach this PR has been merged into develop, you are welcome to test it.

j-steinbach · 2020-12-10T21:59:57Z

It "works". I ran the PDF Scrapper, and saved the "Text mode" and "BibTeX mode" buffers to a different file.

What I am missing is the ability to insert the content into the buffer where I started the PDF Scrapper process.

myshevchuk · 2020-12-11T11:18:47Z

Yeah, I remember. Just haven't had enough time.

myshevchuk · 2020-12-11T11:33:11Z

So for now I plan to introduce two user options orb-pdf-scrapper-export-text-references and orb-pdf-scrapper-export-bibtex-references. They will have allowed values of heading and file. In the first case, references will be saved under a separate heading in the original Org roam buffer. In the second case, they will be save into a separate file. Perhaps another variable will be needed to control the location of the exported files.

The variable orb-pdf-scrapper-export-bibtex-references will be allowed to have a string value, in which case the BibTeX references will be appended (or prepended) to an existing .bib file, except for those whose keys already match entries in that file.

- remap `save-buffer` to new function `orb-pdf-scrapper-save` - remap `write-file` to new function `orb-pdf-scrapper-save-as` These functions handle "fileless" Scrapper buffers correctly - fix a bug where the buffer would master xml file name would after cancelling - switch to Scrapper buffer when pressing 'n' in prevent-concurrent dialog

myshevchuk · 2020-12-21T22:11:13Z

@j-steinbach So, with the latest commit a feature was added to automatically export the scrap(p)ed data upon hitting C-c C-c in the Org-mode buffer.

The user option orb-pdf-scrapper-export-options (the name may change before merging) controls the automatic export.

This variable is an association list of the form

(TYPE . ((LOCATION TARGET PROPERTIES))

TYPE is one of the symbols org, bib or txt.
LOCATION is one of the symbols headline or file.
TARGET is a string describing the target. When LOCATION is headline, TARGET will be the name of a headline under which to put the corresponding data (org, bib or txt). When LOCATION is file, TARGET will be the name of a directory of a file where the data will be saved (see below).
PROPERTIES is a property list for additional control.

Example:

(setq orb-pdf-scrapper-export-options
      '((org
         (headline "References (extracted by ORB PDF Scrapper)"
                   :property-drawer (("PDF_SCRAPPER_TYPE" . "org")
                                     ("PDF_SCRAPPER_SOURCE")
                                     ("PDF_SCRAPPER_DATE")))
         (file "temp.org"
               :placement append))
        (txt
         (headline "References (extracted by ORB PDF Scrapper)"
                   :property-drawer (("PDF_SCRAPPER_TYPE" . "txt")
                                     ("PDF_SCRAPPER_SOURCE")
                                     ("PDF_SCRAPPER_DATE")))
         (file "directory/")
        (bib
         (file "my-library.bib"
               :placement prepend))))

The first element of the list controls export of the Org data (TYPE = org). There are two locations declared for the Org export: headline and file. So the Org data will be export both to the headline named "References (extracted by ORB PDF Scrapper)" in the buffer of origin and a file named "temp.org".

The headline will be supplied with a property drawer with the following properties, according to the :property-drawer property:

"PDF_SCRAPPER_TYPE" with the value "org"
"PDF_SCRAPPER_SOURCE" with the name of the PDF file the references were extracted from
"PDF_SCRAPPER_DATE" with the date and time when the references were extracted.
It is possible to add arbitrary properties with arbitrary values. Furthermore, the value of a property can be a function (as a symbol), and this function will be called to calculate a property.

Currently, this property stuff is a prototype. It's not very useful yet, but may become more useful in the future.

Now the file LOCATION. The TARGET of the file location can be an existing directory, in which case a new file will be created with the name of the note's citation key + the corresponding extension (org, bib or txt) and the extracted data will be put there. If the TARGET is not an existing directory, it is assumed to be a file. If the file does not exist yet, it will be created. If the file exists, it will be used. The :placement property controls where to put the extracted data - at the begging of the file (symbol prepend) or at the end (symbol append). The TARGET can be an absolute path or a path relative to the note's of origin directory.

In the above example, a file temp.org in the note's directory will be used as a target. If it does not exist, it will be created, if it exists, the org data will be put at end of the file.

The second element of the list, the txt TYPE, declares that the text data should be put both under a headline in the note's buffer and into a new file (say Doe2021.txt) in a directory directory that resides within the note's directory.

The third element of the list, the bib TYPE, declares that the bibtex data should be prepended to a file my-library.bib.

This is a breaking change. Because all elements of the list are optional, no data will be exported at all if the variable orb-pdf-scrapper-export-options is nil (it is not by default). To emulate the previous behaviour of Orb PDF Scrapper, this variable must have the following value:

(setq orb-pdf-scrapper-export-options
      '((org
         (headline "References (extracted by ORB PDF Scrapper)"))))

Although it seems to work for me, this feature is only a preview, and any suggestions are welcome. I will merge it after cleaning up and documenting, and probably some bug hunting.

myshevchuk · 2020-12-21T22:18:02Z

Probably LOCATION should be called TARGET and vice versa.

j-steinbach · 2020-12-22T23:08:25Z

It's a lot to wrap your head around, but overall it looks fine. Two things:

What if I want to save in another file under a specific heading? At the moment it only prepends/appends the data.

Wouldn't it make more sense to specify the file (TARGET ❓) as either original or path, where original is the file I started the PDF Scrapper process and path is a file of my choice. And then I can select where in the document I want to save the data. Either a specific heading, or prepend or append.

It likely makes sense to give the user an ability to do something after the extraction. I am too newb for it, but I think it is called hooks?

For example, I would like to look-up the extracted cite-keys and -if they don't already exist- create a new file for each of them.

At the moment it doesn't make that much sense (as I still need to manually fix the reference keys to match my Zotero database and bibtex file), but if that feature also gets rolling, the path to automation 🚂 and astra ⭐ is open!

j-steinbach · 2020-12-22T23:10:43Z

Oh, if everything is optional, wouldn't it make sense to be able to change/define the headline "References (extracted by ORB PDF Scrapper)"?

myshevchuk · 2020-12-23T15:14:14Z

What if I want to save in another file under a specific heading? At the moment it only prepends/appends the data.

Wouldn't it make more sense to specify the file (TARGET ❓) as either original or path, where original is the file I started the PDF Scrapper process and path is a file of my choice. And then I can select where in the document I want to save the data. Either a specific heading, or prepend or append.

Fully agree. I would keep the existing structure of the list, though. The identifier file could be renamed to path. The list would then look something like this:

(setq orb-pdf-scrapper-export-options
      '((org
         (headline "References (extracted by ORB PDF Scrapper)"
                   :property-drawer (("PDF_SCRAPPER_TYPE" . "org")
                                     ("PDF_SCRAPPER_SOURCE")
                                     ("PDF_SCRAPPER_DATE")))
         (path "temp.org"
               :placement (headline "References (extracted by ORB PDF Scrapper)"))))

So the top-level headline target would imply the buffer of origin, as it does now. The inferior headline target within the path target would specify the heading in that file. The headline placement directive would only take effect when the target file is an Org-mode file.

It likely makes sense to give the user an ability to do something after the extraction. I am too newb for it, but I think it is called hooks?

For example, I would like to look-up the extracted cite-keys and -if they don't already exist- create a new file for each of them.

It's possible to provide some hooks. There are several points in the overall process, where these hooks can be called.

Reference extraction
Conversion to BibTeX
Coversion to Org
Export
Additionally, the whole Scrapper process can be such a point.

Each point can have two associated hooks - before and after. For your purpose, I'd suggest either 4 or 5, e.g. orb-pdf-scrapper-after-export-hook or orb-pdf-scrapper-end-process-hook.

Oh, if everything is optional, wouldn't it make sense to be able to change/define the headline "References (extracted by ORB PDF Scrapper)"?

Sure, this string is not hardcoded, it's exposed as a part of a user option - just change "References (extracted by ORB PDF Scrapper)" to whatever you like.

myshevchuk · 2021-01-10T22:32:46Z

I've heavily refactored and hopefully optimized the export code and added a target heading export option as requested. Please check the latest commit in this branch.

Consult the docstring of orb-pdf-scrapper-export-options for a description of the available options. The following example should demonstrate what's possible:

(setq orb-pdf-scrapper-export-options
      '((org  ;; <= TYPE 
         ;;  Export to a heading in the buffer of origin
         (heading "References (extracted by ORB PDF Scrapper)".   
         ;; ^             ^
         ;; TARGET     LOCATION
                     ;; PROPERTIES
                     ;;    v
                     :property-drawer ("PDF_SCRAPPER_TYPE"
                                       "PDF_SCRAPPER_SOURCE"
                                       "PDF_SCRAPPER_DATE")))
        (txt
         ;; Export to a file "references.org"
         (path "references.org"                           
               ;; under a heading "New references"                     
               :placement                                                            
               (heading "New references"
                        :property-drawer ("PDF_SCRAPPER_TYPE"
                                          "PDF_SCRAPPER_SOURCE"
                                          "PDF_SCRAPPER_DATE")
                        ;; Put the new heading in front of other headings
                        :placement prepend)))               
        (bib
         ;; Export to a file in an existing directory.  The file name will be CITEKEY.bib
         (path "/path/to/references-dir/"                            
               :placement prepend
              ;; Include only the references that are not in the target file
              ;; *and* the file(s) specified in bibtex-completion-bibliography
               :filter-bib-entries bibtex-completion-bibliography))))

I will merge it into master after adding a proper Customize definition and updating the README.

I'd also like to add a little bit more flexibility in heading and file names by allowing for wildcards à la orb-templates. I will also probably slightly elaborate filtering options. But this things will most probably follow after this branch has been merged into master.

myshevchuk · 2021-01-10T22:35:21Z

Please also make sure to backup your bib files if they are important to you and if you are are going to automatically export the extracted entries to master bib files!

fix #151

myshevchuk · 2021-01-10T23:20:09Z

Fix for #151 goes here.

j-steinbach · 2021-01-15T19:45:43Z

Sorry, but I think I am heavily confusing myself here. Do you want me to test this? And if yes, when do you want me to test this? In scrapper-save or when you put it into master? Your comment "Fix for #151 goes here" is throwing me off.

Also, as I have a separate Zotero database which updated my "Emacs .bib" file, it should be save for me to corrupt that Emacs .bib file, as I can rebuild from Zotero. Or is there any danger I don't know?

myshevchuk · 2021-01-16T12:25:46Z

Sorry for the confusion. Although I would appreciate any feedback, this was not a testing request. I had an impression you could be interested in an early adoption of the new functionality into your workflow. The message was: it is usable now but it will take some time until the changes make it into the master, there may also be bugs.

Regarding fix #151, it was more of a memo for myself (as are many other comments here). Fix for #151 is available in this branch and will be available in master after this branch has been merged. There will be no separate fix for the current master.

Also, as I have a separate Zotero database which updated my "Emacs .bib" file, it should be save for me to corrupt that Emacs .bib file, as I can rebuild from Zotero. Or is there any danger I don't know?

No, it should be fine.

j-steinbach · 2021-01-16T17:53:43Z

Ok, good to know. You are fine, I just had a heavy case of tunnel-vision coupled with stress and busyness. Also I am not used to "cooperating" on github, so yeah...

I am definitely interested in this feature (in the early adopter sense), but I will wait. (As I am a bit scared of switching branches and getting everything to work again. I need my system to work flawlessly atm..) But as soon as I get it going, you have to brace yourself, as feedback is coming :)

j-steinbach · 2021-01-20T14:43:30Z

Ok, I got around to "installing" the "scrapper-save" branch, but I am having problems configuring it (again!)...

This is my "literate config" block. I appended the orb-pdf-scrapper-options. Now it says invalid read syntax: ". in wrong context" when I evaluate it (C-c C-c).

There are also a few lines commented out; those don't get recognized by K (describe-function?). As I went around "testing" new and unfinished features, I don't know if they are actually valid fucntions/variables anymore. I didn't have the time to get through the official documentation and re-configure everything yet again.

#+BEGIN_SRC emacs-lisp :tangle yes
(setq
 orb-pdf-scrapper-refsection-headings '((parent "References")
                                        (in-roam "In Org Roam database" list)
                                        (in-bib "In BibTeX file" list)
                                        (valid "Valid citation keys" list)
                                        (invalid "Invalid citation keys" list))
 orb-pdf-scrapper-table-export-fields '("key" "author" "date")
 orb-autokey-titlewords-ignore '("A" "An" "On" "The" "Eine?" "Der" "Die" "Das" "[^[:upper:]].*" ".*[^[:upper:][:lower:]0-9].*")
 orb-pdf-scrapper-group-references nil
 ;; orb-pdf-scrapper-citekey-format
 orb-pdf-scrapper-set-fields '(("author" orb-pdf-scrapper--invalidate-nil-value)
                               ("title" orb-pdf-scrapper--invalidate-nil-value)
                               ("date" orb-pdf-scrapper--invalidate-nil-value))
 orb-pdf-scrapper-list-style "[%s] "
 ;; orb-pdf-scrapper-reference-numbers "citation-number"
 ;; orb-pdf-scrapper-export-text-references "heading"
 ;; orb-pdf-scrapper-export-bibtex-references "heading"
 orb-pdf-scrapper-export-options
 '((org  ;; <= TYPE
    ;;  Export to a heading in the buffer of origin
    (heading "References (extracted by ORB PDF Scrapper)".
             ;; ^             ^
             ;; TARGET     LOCATION
             ;; PROPERTIES
             ;;    v
             :property-drawer ("PDF_SCRAPPER_TYPE"
                               "PDF_SCRAPPER_SOURCE"
                               "PDF_SCRAPPER_DATE")))
   (txt
    ;; Export to a file "references.org"
    (path "references.org"
          ;; under a heading "New references"
          :placement
          (heading "New references"
                   :property-drawer ("PDF_SCRAPPER_TYPE"
                                     "PDF_SCRAPPER_SOURCE"
                                     "PDF_SCRAPPER_DATE")
                   ;; Put the new heading in front of other headings
                   :placement prepend)))
   (bib
    ;; Export to a file in an existing directory.  The file name will be CITEKEY.bib
    (path "/path/to/references-dir/"
          :placement prepend
          ;; Include only the references that are not in the target file
          ;; *and* the file(s) specified in bibtex-completion-bibliography
          :filter-bib-entries bibtex-completion-bibliography)))
 )
#+END_SRC

My packages.el. I went through a few doom sync, doom sync -u and doom upgrade again..

(package! org-roam-bibtex
  :recipe
  (:host github
   :repo "org-roam/org-roam-bibtex"
   :branch "scrapper-save"
))

j-steinbach · 2021-01-20T14:52:44Z

Is it possible that there is a typo in (heading "References (extracted by ORB PDF Scrapper)".?

Also now everything gets recognized. I think the trick is either restarting Emacs multiple times or doom compile (which you might have mentioned before)..

j-steinbach · 2021-01-20T14:54:13Z

E: NVM, the functions don't get recognized as valid ORB-functions (" is a variable without a source file."). They just get recognized because I declared them..

myshevchuk · 2021-01-21T09:03:47Z

Is it possible that there is a typo in (heading "References (extracted by ORB PDF Scrapper)".

Yes, that must have been my typo or some sort of autocompletion in my OS.

E: NVM, the functions don't get recognized as valid ORB-functions (" is a variable without a source file."). They just get recognized because I declared them..

Until you actually run an ORB PDF Scrapper process. This module is loaded lazily, i.e. it is not loaded together with main ORB functionality but rather after the first call to orb-pdf-scrapper-run or orb-note-actions-scrap-pdf (via orb-note-actions). After the first run, you'll be able to see all the variables and their docstrings.

j-steinbach · 2021-01-23T17:17:15Z

Ok, I have been having some "fun" with the whole process. So far I only want to insert all three result buffer into my document. This is my scrapper config (the rest is above):

orb-pdf-scrapper-export-options
 '((org  ;; <= TYPE
    ;;  Export to a heading in the buffer of origin
    (heading "Org-References (extracted by ORB PDF Scrapper)"
             ;; ^             ^
             ;; TARGET     LOCATION
             ;; PROPERTIES
             ;;    v
             :property-drawer ("PDF_SCRAPPER_TYPE"
                               "PDF_SCRAPPER_SOURCE"
                               "PDF_SCRAPPER_DATE")))
   (txt
    (heading "Text-References (extracted by ORB PDF Scrapper)"
             ;; ^             ^
             ;; TARGET     LOCATION
             ;; PROPERTIES
             ;;    v
             :property-drawer ("PDF_SCRAPPER_TYPE"
                               "PDF_SCRAPPER_SOURCE"
                               "PDF_SCRAPPER_DATE")))
   (bib
    ;; Export to a file in an existing directory.  The file name will be CITEKEY.bib
    (heading "Bib-References (extracted by ORB PDF Scrapper)"
             ;; Include only the references that are not in the target file
             ;; *and* the file(s) specified in bibtex-completion-bibliography
             :filter-bib-entries bibtex-completion-bibliography
             :property-drawer ("PDF_SCRAPPER_TYPE"
                               "PDF_SCRAPPER_SOURCE"
                               "PDF_SCRAPPER_DATE")))
   ))

I have two windows open: on the left my note file, on the right something else (my config.org).

I start the process in the left window, in the note file.

When I finish the process "text" and "org" get inserted into my note file. "Bib" is missing.

The process also doesn't close. I get the message "wrong type argument: stringp, nil".

Now the left window shows the process buffer (where I can press C-c C-c as many times as I want to insert more "org" and "text" headings into my note file (without throwing any error)) and the right window shows me my note file.

(I can reproduce the above. If I only use a single window, a new window gets created https://imgur.com/a/WP4bvb2)

Have fun! 👿

I am not sure if :filter-bib-entries bibtex-completion-bibliography works in the heading process. I would like it to; i.e. show me which keys/items are already in my .bib file (

For my workflow/setup I don't like to directly insert stuff into the .bib file, as this circumvents Zotero. (which is my single source-of-truth) - but I didn't yet check if I can import .bib files into Zotero, so maybe I can "export" the scrapper buffer to a temporary .bib file and insert that file into Zotero.

I also don't understand what the property-drawer does. Do I need it?

myshevchuk · 2021-01-23T21:33:24Z

The process also doesn't close. I get the message "wrong type argument: stringp, nil".

That's very strange. I could not reproduce the error by copy-pasting your configuration from the two above posts. All three headings are created and the process finishes successfully. Since in your case the process fails at inserting the bib heading, it must have something to do with it. Could you please run it once again with the debugger on, toggle-debug-on-error, and provide a backtrace? Also, what is the value of your bibtex-completion-bibliography?

I am not sure if :filter-bib-entries bibtex-completion-bibliography works in the heading process. I would like it to; i.e. show me which keys/items are already in my .bib file

It does, but in a different way you expect it. Inserted are only the entries, which are not in your bibliography file(s) specified in :filter-bib-entries VARIABLE-OR-FILE-OR-LIST-OF-FILES. The differences between exporting to heading and exporting to a bib file is that in the former case only :filter-bib-entries ... is taken into account, while in the latter case also the target file is checked in addition to any :filter-bib-entries file or files. If the target bib file is among the files specified in :filter-bib-entries, then there is no difference.

So currently the keys a silently filtered but you don't know which (unless you configured Org heading export groups, in which case keys in the in-roam and in-bib groups are likely to be the filtered keys). It would be fairly easy to provide a rudimentary report option for the filtered keys. I can think of a simple echo area message, or perhaps when exporting to a heading, putting the filtered keys separately under that heading, e.g.:

Echo area message

ORB PDF Scrapper filtered the following keys on bib export: key1, key2, key3

BibTeX entries exported separately under the same heading
a) As full BibTeX entries:

* Bib-References (extracted by ORB PDF Scrapper)

#+name: filtered-entries
#+begin_src bibtex
@article{1961-CRV-607,
  citation-number = {1},
    author = {Lichtenthaler, R.W.},
    title = {N/A},
    date = {1961},
  volume = {61},
  pages = {607},
  journal = {Chem. Rev}
}
...
#+end_src

#+name: new-entries
#+begin_src bibtex
@article{2007-T-10549,
  citation-number = {2},
    author = {Pomeisl, K. and Kvíčala, J. and Paleta, O. and Klásek, A. and Kafka, S. and Kubelka, V. and Havlíček, J. and Čejka, J.},
    title = {N/A},
    date = {2007},
  volume = {63},
  pages = {10549},
  journal = {Tetrahedron}
}
...
#+end_src

b) As citation keys

* Bib-References (extracted by ORB PDF Scrapper)

#+name: filtered-entries
- 1961-CRV-607
- ...

#+name: new-entries
#+begin_src bibtex
@article{2007-T-10549,
  citation-number = {2},
    author = {Pomeisl, K. and Kvíčala, J. and Paleta, O. and Klásek, A. and Kafka, S. and Kubelka, V. and Havlíček, J. and Čejka, J.},
    title = {N/A},
    date = {2007},
  volume = {63},
  pages = {10549},
  journal = {Tetrahedron}
}
...
#+end_src

Or maybe you can come up with some other style/option?

but I didn't yet check if I can import .bib files into Zotero

That's possible as far as I remember

also don't understand what the property-drawer does. Do I need it?

It holds some meta information about the extracted data like when and from what source were the data extracted. It's currently not very useful and is a sort of a placeholder for distant future features. I can vaguely envision manipulating the data under headings created by ORB PDF Scrapper, and a property drawer would greatly help to locate the target headline. But as I said, currently it's not particularly useful for you if you don't see how you can use it :) It's not required for export and can be safely omitted altogether in the orb-pdf-scrapper-export-options.

myshevchuk · 2021-01-23T21:47:49Z

By the way, are you still using the native-comp branch of Emacs?

j-steinbach · 2021-01-23T22:51:29Z

Yes, I think so. (Is there a command to check the version?)

j-steinbach · 2021-01-23T22:58:03Z

bibtex-completion-bibliography

bibtex-completion-bibliography is a variable defined in
bibtex-completion.el.

Value
"/home/jst/Gedankenwelt/Wissenschaft/zotero.bib"

Original Value
nil

debug-on-error:


Debugger entered--Lisp error: (wrong-type-argument stringp nil)
  bibtex-valid-entry()
  bibtex-skip-to-valid-entry()
  bibtex-map-entries(#f(compiled-function (key beg end) #<bytecode 0x13c14d8a396c07e8>))
  orb-pdf-scrapper--export-insert-temp-data(bib (:filter-bib-entries bibtex-completion-bibliography :property-drawer ("PDF_SCRAPPER_TYPE" "PDF_SCRAPPER_SOURCE" "PDF_SCRAPPER_DATE")))
  orb-pdf-scrapper--export-to-heading(bib "Bib-References (extracted by ORB PDF Scrapper)" (:filter-bib-entries bibtex-completion-bibliography :property-drawer ("PDF_SCRAPPER_TYPE" "PDF_SCRAPPER_SOURCE" "PDF_SCRAPPER_DATE")))
  orb-pdf-scrapper--export(bib)
  orb-pdf-scrapper--checkout()
  orb-pdf-scrapper-dispatcher()
  funcall-interactively(orb-pdf-scrapper-dispatcher)
  command-execute(orb-pdf-scrapper-dispatcher)

j-steinbach · 2021-01-23T23:00:51Z

(I will read and comment on the other bib-related stuff after/if we (actually you 👼) fix the error. I need to fiddle with everything a bit more to form an opinion)

the global value of `bibtex-dialect` may be uninitialized leading to an error

myshevchuk · 2021-01-24T09:53:10Z

The error should be fixed now. It was caused by uninitialized global value of bibtex-dialect in your setup, which is perfectly fine. The offending function orb-pdf-scrapper--export-insert-temp-data now sets the dialect locally while the file is being parsed.

myshevchuk · 2021-01-24T11:04:18Z

also don't understand what the property-drawer does. Do I need it?

It holds some meta information about the extracted data like when and from what source were the data extracted. It's currently not very useful and is a sort of a placeholder for distant future features. I can vaguely envision manipulating the data under headings created by ORB PDF Scrapper, and a property drawer would greatly help to locate the target headline.

A specific example where property drawers will be useful is a possible implementation of feature #142 you requested earlier. If you'd like to restart the process from the data that had been exported to a heading, ORB would need some metadata to locate that heading. A heading name is an unreliable identifier because it is likely to be changed by user. A property drawer (which will be hard-coded by default shall this feature be implemented) holding something like :PDF_SCRAPPER_TYPE: txt is an excellent anchor to bring ORB to the data.

j-steinbach · 2021-01-24T16:06:40Z

It works. Awesome!

myshevchuk · 2021-01-24T17:12:18Z

Yes, I think so. (Is there a command to check the version?)

I believe checking the value of system-configuration-features can give you a clue. Look for something like NATIVE_COMP there.

I don't use native-comp myself - tried it a couple of times but it was too raw for my daily use. I'll wait until it makes it into Emacs stable. My impression was that this feature may complicate upgrading Emacs packages. On the other hand, Doom supports it and people are using it, so maybe it's not that bad.

j-steinbach · 2021-02-03T14:58:36Z

Sorry that it took me so long to reply back, but I have a one-track mind and a deadline approaching.. 🚂

Now let me wrap my head around everything we talked about again. I do this by means of re-iteration.

This is what I have:

a scientific paper I want to extract references from
a database of references, likely manages with an external tool (Zotero, papis, ...), in the form of one or more .bib files

This is what I want:

insert all the extracted references into the note associated with the paper
insert all the extracted references into my .bib file(s)

This is how I currently work:

I create a new note and begin the ORB PDF Scrapper process:

Text mode

I manually edit the extracted references in the "Text buffer". Those references get automatically turned into BibLaTeX entries.

BibLaTeX mode

I have a bunch of automatically created .bib references from the previous step. I have to manually clean and edit them.
- This sometimes means I have to go back to the text mode buffer and edit stuff there, too. 🔁
I automatically generate a reference key for each entry.
- Some of the keys get generated "wrong". There is information missing/wrong in a previous step. 🔁
- Some of the keys don't fit the scheme I use in my reference manager (Zotero).
  a. I manually fix them. 🔁
  b. I ignore them 👍 (and fix them in a later step)
- I paste the generated keys into my global Zotero .bib file
  - I check for duplicate keys
  - I check for "bad" keys
  - I check for missing keys (by cross-referencing the associated PDF)
  - I manually fix all those in the BbibLaTeX buffer 🔁

Org mode

I have a list of generated cite: keys that can be used with org-bibtex. If a key is missing in my global .bib file(s) I can see that, as "good" keys are underlined.
- I go through each of the "bad" keys and fix them manually in Zotero. More work. 🔁
I finish the process.
- The list of references gets inserted into my note
- The references are inside my .bib file

🎯 What I want: Keep the number of 🔁 as small as possible, as they are a major waste of time, not very fun and have a high possibility of creating mistakes (manually updating keys is never a good idea).

With this feature/branch, the workflow changes as follows:

BibLaTeX mode

I have a bunch of automatically created .bib references from the previous step. I have to manually clean and edit them.
- This sometimes means I have to go back to the text mode buffer and edit stuff there, too. 🔁
I automatically generate a reference key for each entry.
- Some of the keys get generated "wrong". There is information missing/wrong in a previous step. 🔁
- Some of the keys don't fit the scheme I use in my reference manager (Zotero).
  ~~a. I manually fix them.~~
  b. I ignore them
- ~~I paste the generated keys into my global Zotero .bib file~~

Org mode

I have a list of generated cite: keys that can be used with org-bibtex. If a key is missing in my global .bib file I can see that, as "good" keys are underlined.
- ~~I go through each of the "bad" keys and fix them manually in Zotero. More work.~~
I finish the process.
- The list of cite:keys gets inserted into my note
- 🆕 The "original" text gets inserted into my note. I can use it to restart the Scrapper process later on, without losing all my work from the text mode.
- 🆕 The BibLaTeX mode entries get inserted into my note. It contains only entries that are not yet found in my global .bib file. These references are NOT yet inside my global .bib file.
  - I copy & paste them into Zotero. This sometimes generates new/different keys. As I use Zotero and the global .bib file as my single-source-of-truth, I have to manually fix and the keys in my note file again 🔁

As you can see, this is an improvement on two fronts. (:cake: ):

I can "save" and restart my progress
I don't insert duplicate references into Zotero. This means there are less keys I have to cross-reference.

I can still see at least two issues remaining:

"Blind" generation of keys in the original BibLaTeX mode. I am blindly generating keys without being able to cross-reference my global .bib file. This is what I was talking about in PDF Scrapper: Look-up existing .bib files during "BibTeX mode" #144.
Having different naming-schemes between Zotero and ORB. See Orb-autokey title source #147.

I think it boils down to the following: I don't want to deal with my keys after leaving the BibLaTeX mode ever again. At the moment I still have to deal with them after the PDF Scrapper process is finished.

Overall, I believe that PDF Scrapper exists to solve two issues:

Extracting references from a paper. This works very well, almost flawlessly, with some minor hiccups.
Merging those references into an existing bibliographic database and creating usable cite keys. At the moment this still involves a high amount of manual labor and should be improved further.

This was longer than I intended, and might be missing some features, but I think this captures my motivation behind using ORB and my workflow pretty well. 😅 I hope this gives a glimpse into what I take this feature for and provides a basis for future discussions on my end.

Or maybe you can come up with some other style/option?

Nothing came up so far yet, will tell if it does :)

myshevchuk · 2021-02-18T10:00:36Z

Hi, sorry for a long silence. Thanks for your great feedback and for taking your time to write it! My primary computer is still in the service. I will write a detailed response to your above post as soon as I get it back.

myshevchuk · 2021-02-23T22:55:25Z

Hi, I've finally got my laptop back after almost three weeks in the service and will now move towards merging this pull request into master, hopefully in a couple of days. I feel the export functionality, although quite basic, allowed you to improve your workflow and from the technical point of view is more or less polished for merging. I think only the README section is missing.

As a general remark about repetitions, it's impossible to avoid them all, that is to fully automate the process. The main reason, and from a programmer's point of view the unknown variable, is the initial set of text references generated by Anystyle and presented by ORB in the text buffer.

I sketched the first prototype of what would later become ORB PDF Scrapper in an hour or so. It consisted of several dozens of lines and was actually a fully automated piping through Anystyle text and BibTeX outputs with feeding the latter to a simple Elisp function to produce org references directly. But it became immediately obvious that such thing was unusable because there was no way to correct errors, generate citation keys in the desired format and so on. It's not the Anystyle's fault either. That program is doing an amazing job considering how different the initial input can be - starting from good and bad PDF files, different page layouts, and ending with hundreds of different citation styles.

=> There is little that can be done on the ORB's side to improve the extraction of text references from a PDF. It's possible to train a custom Finder model (anystyle command line only, ORB does not implement an interface yet). The default one is reasonably good though. Also, Anystyle is not very well documented in that respect and I'm not very fluent in reading Ruby code to figure it out myself in a reasonable amount of time. Automatically checking text references in Elisp is also not an option apart from the offered basic sanitize text command, because it would mean re-implementing parts of Anystyle or even the external modules such as pdf-to-text that Anystyle relies on. So, I'm afraid we are stuck with editing poorly extracted text references by hand in the foreseeable future. Also, since we are humans and always make mistakes, we are stuck with having to correct them now and then.

So, the next thing I did was implementing the current stepwise modal system, where the user (basically me I guess at that time) could edit the intermediate text and BibTeX input to their liking and go back to the previous mode in case some things were missing. It turned out for better, because the BibTeX data are valuable and it's good to be able to have a look at them and possibly store them in the process. Unfortunately, switching between buffers was designed one-way, e.g. you lose the "forward" progress when you go back to the previous mode. I could not clearly see how I could quickly implement a mapping between text and BibTeX, so that if you go back to text mode and edit one reference, you would not need to re-generate all the BibTeX entries, just one entry for that reference.

=> Such a feature would not necessarily decrease the amount of manual editing, but it would definitely make the overall experience more pleasant, or rather psychologically less boring - oh no, I have to re-generate all the data once again. I'm interested in implementing the text <-> BibTex (<-> org) mapping from the programmer's point of view, but due to the lack of time it's currently not in the top of the priority list. Hopefully, the export functionality brought in by this pull request partially addresses the issue.

The basic idea and my goal at the time I started with the PDF Scrapper was to extract references from a PDF file associated with an org-roam note and put them in that note as org-ref citation keys, so that org-roam automatically connects the current note with any existing notes tagged with the respective keys (#+ROAM_KEY). After having that basic functionality I spent some time to write the autokey feature because the one shipped with Emacs' BibTeX-mode simply did not suite my needs. I also implemented the interface to anystyle train command, which drastically improves text -> BibTeX conversion by creating a custom Parser model. The default Parser model shipped with Anystyle was performing poorly with citation styles I usually find in chemistry journals.

=> If you haven't use it yet and are having troubles with authour/journal/etc fields not being recognized correctly, I urge you to try training a custom Parser model. I'm also keen in improving the existing rather basic autokey functionality, e.g. #147

Now to your comments.

As you can see, this is an improvement on two fronts. (🍰 ):

Great! I'm glad it's working for you and thanks once again for your invaluable input.

I can still see at least two issues remaining:
• "Blind" generation of keys in the original BibLaTeX mode. I am blindly generating keys without being able to cross-reference my global .bib file. This is what I was talking about in #144.
• Having different naming-schemes between Zotero and ORB. See #147.

I think it boils down to the following: I don't want to deal with my keys after leaving the BibLaTeX mode ever again. At the moment I still have to deal with them after the PDF Scrapper process is finished.

I will have to finish the current pull request. Then I'm going to split the README file into a short README and a general manual. Documenting is part of programming, and there have been many changes to ORB over the past months. I'd also like to update the CHANGELOG and prepare a new release. It should not take too long though. Also tell me what feature is more urgent for you, #144 or #147, I'll start working on that one earlier, maybe simultaneously with the manual, but not simultaneously on both features.

Overall, I believe that PDF Scrapper exists to solve two issues:

• Extracting references from a paper. This works very well, almost flawlessly, with some minor hiccups.
• Merging those references into an existing bibliographic database and creating usable cite keys. At the moment this still involves a high amount of manual labor and should be improved further.

Well, as I wrote above the initial idea was simpler, but I don't mind putting it that way :)

j-steinbach · 2021-02-24T13:05:43Z

Thank you for the write-up - it is very interesting to see how a project evolves :) And congrats on getting your laptop back!

So, I'm afraid we are stuck with editing poorly extracted text references by hand in the foreseeable future. Also, since we are humans and always make mistakes, we are stuck with having to correct them now and then.

I think that it is OK to have manual parts. But they should be non-repeating, i.e. when I am done with them, then I never have to do them again. And based on your other paragraphs, I think you agree :)

I could not clearly see how I could quickly implement a mapping between text and BibTeX, so that if you go back to text mode and edit one reference, you would not need to re-generate all the BibTeX entries, just one entry for that reference.

There are probably thousands of things wrong with my assumption, but why not just treat each line in the "text buffer" as a table row?

text | bibtex | citekey

If the text cell changes, then update the bibtex and citekey cells.

The basic idea and my goal at the time I started with the PDF Scrapper was to extract references from a PDF file associated with an org-roam note and put them in that note as org-ref citation keys, so that org-roam automatically connects the current note with any existing notes tagged with the respective keys (#+ROAM_KEY).

Smart! And similar to my use-case, except that I go one step further and also try to insert the keys into my annotations, so that only the "usable" references remain. Emphasis on "try", as that is lots of work..

If you haven't use it yet and are having troubles with authour/journal/etc fields not being recognized correctly, I urge you to try training a custom Parser model

Yay! Another rabbit-hole to burrow down into! :)

I believe that #147 is more impactful (and annoying), so I would go with that. Also I believe that #144 kinda requires #147, as there is not much to look up if the keys differ.

Good idea with the documentation. I believe that I also have some "organically grown" ORB parameters in my config - I believe that updating the docs gives everyone a "fresh start", i.e. I can clean my config and you can focus on new features :)

I got another quick idea: Is it possible to show the "Scrapper modals" side-by-side? (vertical windows)

Text-mode: PDF | Text

BibTeX-mode: Text | BibTeX

Org-modal: BibTeX | Org

Basically we see the buffer where we last came from. This allows to look-up and fix mistakes in previous buffers.

(BRB requesting)

myshevchuk · 2021-02-25T21:48:19Z

There are probably thousands of things wrong with my assumption, but why not just treat each line in the "text buffer" as a table row?

text | bibtex | citekey

If the text cell changes, then update the bibtex and citekey cells.

Basically yes. I anticipate more problems on the BibTeX end. A BibTeX entry is not a single line, but a multiline entry with several fields. What if the user has edited the entry, added, removed or changed fields. How should such changes be merged with the changes coming from the updated text entry? Simply overwriting the entry should be easy. Also, BibTeX entries can span across varying number of lines, therefore some sort of position tracking must be used. Ok, that must also be easy. There will definitely be other issues such as what to do when a BibTeX was deleted, so some assumptions about the user behavior will have to be made.

also try to insert the keys into my annotations, so that only the "usable" references remain.

Could you please elaborate? What sort of annotations, maybe ORB could assist in automation?

I believe that I also have some "organically grown" ORB parameters in my config - I believe that updating the docs gives everyone a "fresh start", i.e. I can clean my config and you can focus on new features :)

Right, I encountered many times a situation where a new user would use config examples from some half a year-old blog post, where the information is absolutely outdated and the package would therefore only throw errors. Here I blame the lack of examples in the README prompting people to look for them somewhere else. But even users who faithfully followed the official README several months ago can still experience problems because the package has changed since then. A good documentation would also mean for me that ORB is getting ready to advancing to version 1.0.

j-steinbach · 2021-02-26T01:05:45Z

How should such changes be merged with the changes coming from the updated text entry?

Not sure if I understand this correctly, but I would say that the user always has the "last word", i.e. can overwrite everything.

And I also would say that there is just one direction.

text -> bibtex -> org

If a user changes a bibtex cell, then the changes bubble up to the org cell. Text remains untouched.
If a user changes a text cell, then the changes bubble up to the org cell. It changes, which bubbles up to the org cell.

also try to insert the keys into my annotations, so that only the "usable" references remain.

Could you please elaborate? What sort of annotations, maybe ORB could assist in automation?

Its actually pretty simple. Let's say that I annotated a paper and extracted its contents with org-noter. In my notes I now have some extracted annotations.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer [1] took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset (1960) sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

In this case there are two citations hidden in there (note: usually they follow the same citation style):

unknown printer [1]
Letraset (1960)

Then I use the ORB-Scrapper to extract the references from the paper. This gives me all references from the paper:

References

cite:printer1558jam
cite:letraset1960lorem
cite:random9000dude
cite:dont7777care
cite:haxor1337lulz

Now I go and insert those references into my extracted annotations:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer (cite:printer1558jam) took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of cite:letraset1960lorem sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Those are the citations I actually care about - they get linked directly to their source in my Zettelkasten, very similar to how you are doing it.

Finally I delete the references section. If a reference is not cited, then it clearly is not interesting for me.

There are a few problems I am encountering:

sanitizing the extracted text (newlines, whitespaces)
normalizing the unicode characters (sióćà -> sioca). I have to do this because my cite-keys are "ASCII clean", and I use them to start my regex calls.
citation styles specific issues. Example: ([1], [2], [3] vs. [1-3].

(I have a half-finished blog-post about my approach, but I didn't get to finish it, as I am supposed to write my thesis... oh well.)

I actually just yesterday started learning some elisp to maybe automate this - I already have some macros that kinda work, but they are horrifying to manage, so yeah, here I am, again not working on my thesis... 🥇

The following takes cite:letraset1960lorem and tries to replace all letraset (1960), Letraset et al (1960), Letraset's (1960), ... It works, but I am stumbling across weird edge cases everywhere. At the moment I just assume that no author wrote two papers in the same year.

At least writing it as a function lets me fix the problems easily. My macros constantly "ring the bell". 💢

(defun citekey-replace-author-and-year ()
  "Get the _author_ and _YEAR_ from a cite:authorYEARtitle key."
  (interactive) ; allow this to be user-callable
  (let
      ((regexp (rx ":"
                   (group (zero-or-more
                           (any letter "-")))
                   (group (one-or-more digit))
                   (one-or-more letter)))
       author
       year
       (citekey (buffer-substring (line-beginning-position) (line-end-position)))
       searchstring
       (number-matches 0)
       ) ; end of variable declaration
    (message regexp) ; start of body
    (message citekey)
    (when (string-match regexp citekey)
      (setq author (match-string 1 citekey)
            year (match-string 2 citekey)))
    (message (concat author " " year))
    (setq searchstring
          (rx (eval author )
              (any space punctuation)
              (*? not-newline)
              (zero-or-one "(")
              (eval year )
              (zero-or-one ")")
              ))
    (message searchstring)
    (save-excursion
      (while (re-search-backward searchstring nil t) ; no bounds, don't throw error if not matches
        (replace-match citekey)
        (incf number-matches)))
    (message "Found and replaced %s matches" number-matches)
    ))

For the "number" style, it is way simpler. Except if it is written as [1-3]. Good luck solving that with macros.

Ideally I could detect all references in my annotation body and then cross-reference them with my bibtex. Something like highlighting with occur, then group the same authors, then select a key to replace them with?

This also fixes the problem of references with pre-print release-years. I stumbled across a few references that say "Author 200X", but my key is "Author 200Y". In this case my regex would find nothing.

Yeah, I went a bit off-trail, but I hope the base idea is clear.

myshevchuk · 2021-02-26T07:35:07Z

Its actually pretty simple. Let's say that I annotated a paper and extracted its contents with org-noter. In my notes I now have some extracted annotations.

Very cool, but indeed it requires a lot of work. After you have finished your thesis (and I mine) we can think how PDF Scrapper could provide a rudimentary support for this, provided you'll still need PDF Scrapper. I envision, PDF Scrapper could become a bridge between PDF-tools and Anystyle and reference management software, a facility to do all references-related stuff.

j-steinbach · 2021-02-26T14:35:38Z

Sure thing! PDF Scrapper is definitely positioned that way. And I agree to hold the horses and focus on the matters on hand.

myshevchuk · 2021-03-02T08:05:42Z

This branch is now in the master!

myshevchuk force-pushed the scrapper-save branch from 1b83f1f to 7f5ad7b Compare December 12, 2020 11:05

myshevchuk force-pushed the scrapper-save branch from 7f5ad7b to 7525add Compare December 20, 2020 16:57

myshevchuk added 2 commits December 21, 2020 22:16

(feat) orb-pdf-scrapper-export-options

b179e9b

(int) uncomment a region

4f7061f

orb-pdf-scrapper export working beta

b18bdf9

(fix) consistent formatting between headings

e099ef0

fix #151

(fix) set bibtex dialect before mapping entries

bc5d069

the global value of `bibtex-dialect` may be uninitialized leading to an error

j-steinbach mentioned this pull request Feb 24, 2021

PDF-Scrapper: Show Scrapper modals side-by-side #171

Open

(doc) update README ORB PDF Scrapper section

84fb4bd

myshevchuk force-pushed the scrapper-save branch from 88bc2a8 to 84fb4bd Compare March 2, 2021 08:04

myshevchuk merged commit 068d9c2 into master Mar 2, 2021

myshevchuk deleted the scrapper-save branch March 12, 2021 23:39

myshevchuk mentioned this pull request Mar 15, 2021

PDF Scrapper: Save/Export the "Text mode" and "BibTeX mode" buffers #140

Closed

(feat,fix) save/export scrapper buffers #146

(feat,fix) save/export scrapper buffers #146

Conversation

myshevchuk commented Dec 9, 2020 • edited Loading

myshevchuk commented Dec 9, 2020

j-steinbach commented Dec 10, 2020

myshevchuk commented Dec 11, 2020

myshevchuk commented Dec 11, 2020

myshevchuk commented Dec 21, 2020

myshevchuk commented Dec 21, 2020

j-steinbach commented Dec 22, 2020

j-steinbach commented Dec 22, 2020

myshevchuk commented Dec 23, 2020

myshevchuk commented Jan 10, 2021 • edited Loading

myshevchuk commented Jan 10, 2021 • edited Loading

myshevchuk commented Jan 10, 2021

j-steinbach commented Jan 15, 2021

myshevchuk commented Jan 16, 2021 • edited Loading

j-steinbach commented Jan 16, 2021

j-steinbach commented Jan 20, 2021

j-steinbach commented Jan 20, 2021

j-steinbach commented Jan 20, 2021

myshevchuk commented Jan 21, 2021 • edited Loading

j-steinbach commented Jan 23, 2021 • edited Loading

myshevchuk commented Jan 23, 2021 • edited Loading

myshevchuk commented Jan 23, 2021

j-steinbach commented Jan 23, 2021

j-steinbach commented Jan 23, 2021

j-steinbach commented Jan 23, 2021

myshevchuk commented Jan 24, 2021

myshevchuk commented Jan 24, 2021

j-steinbach commented Jan 24, 2021

myshevchuk commented Jan 24, 2021

j-steinbach commented Feb 3, 2021

myshevchuk commented Feb 18, 2021

myshevchuk commented Feb 23, 2021

j-steinbach commented Feb 24, 2021

myshevchuk commented Feb 25, 2021 • edited Loading

j-steinbach commented Feb 26, 2021 • edited Loading

myshevchuk commented Feb 26, 2021

j-steinbach commented Feb 26, 2021

myshevchuk commented Mar 2, 2021

myshevchuk commented Dec 9, 2020 •

edited

Loading

myshevchuk commented Jan 10, 2021 •

edited

Loading

myshevchuk commented Jan 10, 2021 •

edited

Loading

myshevchuk commented Jan 16, 2021 •

edited

Loading

myshevchuk commented Jan 21, 2021 •

edited

Loading

j-steinbach commented Jan 23, 2021 •

edited

Loading

myshevchuk commented Jan 23, 2021 •

edited

Loading

myshevchuk commented Feb 25, 2021 •

edited

Loading

j-steinbach commented Feb 26, 2021 •

edited

Loading