-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(feat,fix) save/export scrapper buffers #146
Conversation
This PR introduces commands @j-steinbach this PR has been merged into |
It "works". I ran the PDF Scrapper, and saved the "Text mode" and "BibTeX mode" buffers to a different file. What I am missing is the ability to insert the content into the buffer where I started the PDF Scrapper process. |
Yeah, I remember. Just haven't had enough time. |
So for now I plan to introduce two user options The variable |
1b83f1f
to
7f5ad7b
Compare
- remap `save-buffer` to new function `orb-pdf-scrapper-save` - remap `write-file` to new function `orb-pdf-scrapper-save-as` These functions handle "fileless" Scrapper buffers correctly - fix a bug where the buffer would master xml file name would after cancelling - switch to Scrapper buffer when pressing 'n' in prevent-concurrent dialog
7f5ad7b
to
7525add
Compare
@j-steinbach So, with the latest commit a feature was added to automatically export the scrap(p)ed data upon hitting The user option This variable is an association list of the form (TYPE . ((LOCATION TARGET PROPERTIES))
Example: (setq orb-pdf-scrapper-export-options
'((org
(headline "References (extracted by ORB PDF Scrapper)"
:property-drawer (("PDF_SCRAPPER_TYPE" . "org")
("PDF_SCRAPPER_SOURCE")
("PDF_SCRAPPER_DATE")))
(file "temp.org"
:placement append))
(txt
(headline "References (extracted by ORB PDF Scrapper)"
:property-drawer (("PDF_SCRAPPER_TYPE" . "txt")
("PDF_SCRAPPER_SOURCE")
("PDF_SCRAPPER_DATE")))
(file "directory/")
(bib
(file "my-library.bib"
:placement prepend)))) The first element of the list controls export of the Org data (TYPE = The headline will be supplied with a property drawer with the following properties, according to the
Currently, this property stuff is a prototype. It's not very useful yet, but may become more useful in the future. Now the In the above example, a file The second element of the list, the The third element of the list, the This is a breaking change. Because all elements of the list are optional, no data will be exported at all if the variable (setq orb-pdf-scrapper-export-options
'((org
(headline "References (extracted by ORB PDF Scrapper)")))) Although it seems to work for me, this feature is only a preview, and any suggestions are welcome. I will merge it after cleaning up and documenting, and probably some bug hunting. |
Probably LOCATION should be called TARGET and vice versa. |
It's a lot to wrap your head around, but overall it looks fine. Two things: What if I want to save in another file under a specific heading? At the moment it only prepends/appends the data. Wouldn't it make more sense to specify the file (TARGET ❓) as either It likely makes sense to give the user an ability to do something after the extraction. I am too newb for it, but I think it is called hooks? For example, I would like to look-up the extracted cite-keys and -if they don't already exist- create a new file for each of them. At the moment it doesn't make that much sense (as I still need to manually fix the reference keys to match my Zotero database and bibtex file), but if that feature also gets rolling, the path to automation 🚂 and astra ⭐ is open! |
Oh, if everything is optional, wouldn't it make sense to be able to change/define the headline "References (extracted by ORB PDF Scrapper)"? |
Fully agree. I would keep the existing structure of the list, though. The identifier (setq orb-pdf-scrapper-export-options
'((org
(headline "References (extracted by ORB PDF Scrapper)"
:property-drawer (("PDF_SCRAPPER_TYPE" . "org")
("PDF_SCRAPPER_SOURCE")
("PDF_SCRAPPER_DATE")))
(path "temp.org"
:placement (headline "References (extracted by ORB PDF Scrapper)")))) So the top-level
It's possible to provide some hooks. There are several points in the overall process, where these hooks can be called.
Each point can have two associated hooks -
Sure, this string is not hardcoded, it's exposed as a part of a user option - just change "References (extracted by ORB PDF Scrapper)" to whatever you like. |
I've heavily refactored and hopefully optimized the export code and added a target heading export option as requested. Please check the latest commit in this branch. Consult the docstring of (setq orb-pdf-scrapper-export-options
'((org ;; <= TYPE
;; Export to a heading in the buffer of origin
(heading "References (extracted by ORB PDF Scrapper)".
;; ^ ^
;; TARGET LOCATION
;; PROPERTIES
;; v
:property-drawer ("PDF_SCRAPPER_TYPE"
"PDF_SCRAPPER_SOURCE"
"PDF_SCRAPPER_DATE")))
(txt
;; Export to a file "references.org"
(path "references.org"
;; under a heading "New references"
:placement
(heading "New references"
:property-drawer ("PDF_SCRAPPER_TYPE"
"PDF_SCRAPPER_SOURCE"
"PDF_SCRAPPER_DATE")
;; Put the new heading in front of other headings
:placement prepend)))
(bib
;; Export to a file in an existing directory. The file name will be CITEKEY.bib
(path "/path/to/references-dir/"
:placement prepend
;; Include only the references that are not in the target file
;; *and* the file(s) specified in bibtex-completion-bibliography
:filter-bib-entries bibtex-completion-bibliography)))) I will merge it into master after adding a proper Customize definition and updating the README. I'd also like to add a little bit more flexibility in heading and file names by allowing for wildcards à la |
Please also make sure to backup your |
Fix for #151 goes here. |
Sorry, but I think I am heavily confusing myself here. Do you want me to test this? And if yes, when do you want me to test this? In scrapper-save or when you put it into master? Your comment "Fix for #151 goes here" is throwing me off. Also, as I have a separate Zotero database which updated my "Emacs .bib" file, it should be save for me to corrupt that Emacs .bib file, as I can rebuild from Zotero. Or is there any danger I don't know? |
Sorry for the confusion. Although I would appreciate any feedback, this was not a testing request. I had an impression you could be interested in an early adoption of the new functionality into your workflow. The message was: it is usable now but it will take some time until the changes make it into the master, there may also be bugs. Regarding fix #151, it was more of a memo for myself (as are many other comments here). Fix for #151 is available in this branch and will be available in master after this branch has been merged. There will be no separate fix for the current master.
No, it should be fine. |
Ok, good to know. You are fine, I just had a heavy case of tunnel-vision coupled with stress and busyness. Also I am not used to "cooperating" on github, so yeah... I am definitely interested in this feature (in the early adopter sense), but I will wait. (As I am a bit scared of switching branches and getting everything to work again. I need my system to work flawlessly atm..) But as soon as I get it going, you have to brace yourself, as feedback is coming :) |
Ok, I got around to "installing" the "scrapper-save" branch, but I am having problems configuring it (again!)... This is my "literate config" block. I appended the orb-pdf-scrapper-options. Now it says There are also a few lines commented out; those don't get recognized by
My packages.el. I went through a few doom sync, doom sync -u and doom upgrade again..
|
Is it possible that there is a typo in Also now everything gets recognized. I think the trick is either restarting Emacs multiple times or doom compile (which you might have mentioned before).. |
E: NVM, the functions don't get recognized as valid ORB-functions (" is a variable without a source file."). They just get recognized because I declared them.. |
Yes, that must have been my typo or some sort of autocompletion in my OS.
Until you actually run an ORB PDF Scrapper process. This module is loaded lazily, i.e. it is not loaded together with main ORB functionality but rather after the first call to |
Ok, I have been having some "fun" with the whole process. So far I only want to insert all three result buffer into my document. This is my scrapper config (the rest is above):
I have two windows open: on the left my note file, on the right something else (my config.org). I start the process in the left window, in the note file. When I finish the process "text" and "org" get inserted into my note file. "Bib" is missing. The process also doesn't close. I get the message "wrong type argument: stringp, nil". Now the left window shows the process buffer (where I can press C-c C-c as many times as I want to insert more "org" and "text" headings into my note file (without throwing any error)) and the right window shows me my note file. (I can reproduce the above. If I only use a single window, a new window gets created https://imgur.com/a/WP4bvb2) Have fun! 👿 I am not sure if For my workflow/setup I don't like to directly insert stuff into the .bib file, as this circumvents Zotero. (which is my single source-of-truth) - but I didn't yet check if I can import .bib files into Zotero, so maybe I can "export" the scrapper buffer to a temporary .bib file and insert that file into Zotero. I also don't understand what the property-drawer does. Do I need it? |
That's very strange. I could not reproduce the error by copy-pasting your configuration from the two above posts. All three headings are created and the process finishes successfully. Since in your case the process fails at inserting the bib heading, it must have something to do with it. Could you please run it once again with the debugger on,
It does, but in a different way you expect it. Inserted are only the entries, which are not in your bibliography file(s) specified in So currently the keys a silently filtered but you don't know which (unless you configured Org heading export groups, in which case keys in the
b) As citation keys
Or maybe you can come up with some other style/option?
That's possible as far as I remember
It holds some meta information about the extracted data like when and from what source were the data extracted. It's currently not very useful and is a sort of a placeholder for distant future features. I can vaguely envision manipulating the data under headings created by ORB PDF Scrapper, and a property drawer would greatly help to locate the target headline. But as I said, currently it's not particularly useful for you if you don't see how you can use it :) It's not required for export and can be safely omitted altogether in the |
By the way, are you still using the |
Yes, I think so. (Is there a command to check the version?) |
bibtex-completion-bibliography
debug-on-error:
|
(I will read and comment on the other bib-related stuff after/if we (actually you 👼) fix the error. I need to fiddle with everything a bit more to form an opinion) |
the global value of `bibtex-dialect` may be uninitialized leading to an error
The error should be fixed now. It was caused by uninitialized global value of |
A specific example where property drawers will be useful is a possible implementation of feature #142 you requested earlier. If you'd like to restart the process from the data that had been exported to a heading, ORB would need some metadata to locate that heading. A heading name is an unreliable identifier because it is likely to be changed by user. A property drawer (which will be hard-coded by default shall this feature be implemented) holding something like |
It works. Awesome! |
I believe checking the value of I don't use native-comp myself - tried it a couple of times but it was too raw for my daily use. I'll wait until it makes it into Emacs stable. My impression was that this feature may complicate upgrading Emacs packages. On the other hand, Doom supports it and people are using it, so maybe it's not that bad. |
Sorry that it took me so long to reply back, but I have a one-track mind and a deadline approaching.. 🚂 Now let me wrap my head around everything we talked about again. I do this by means of re-iteration. This is what I have:
This is what I want:
This is how I currently work: I create a new note and begin the ORB PDF Scrapper process: Text mode
BibLaTeX mode
Org mode
🎯 What I want: Keep the number of 🔁 as small as possible, as they are a major waste of time, not very fun and have a high possibility of creating mistakes (manually updating keys is never a good idea). With this feature/branch, the workflow changes as follows: BibLaTeX mode
Org mode
As you can see, this is an improvement on two fronts. (:cake: ):
I can still see at least two issues remaining:
I think it boils down to the following: I don't want to deal with my keys after leaving the BibLaTeX mode ever again. At the moment I still have to deal with them after the PDF Scrapper process is finished. Overall, I believe that PDF Scrapper exists to solve two issues:
This was longer than I intended, and might be missing some features, but I think this captures my motivation behind using ORB and my workflow pretty well. 😅 I hope this gives a glimpse into what I take this feature for and provides a basis for future discussions on my end.
Nothing came up so far yet, will tell if it does :) |
Hi, sorry for a long silence. Thanks for your great feedback and for taking your time to write it! My primary computer is still in the service. I will write a detailed response to your above post as soon as I get it back. |
Hi, I've finally got my laptop back after almost three weeks in the service and will now move towards merging this pull request into master, hopefully in a couple of days. I feel the export functionality, although quite basic, allowed you to improve your workflow and from the technical point of view is more or less polished for merging. I think only the README section is missing. As a general remark about repetitions, it's impossible to avoid them all, that is to fully automate the process. The main reason, and from a programmer's point of view the unknown variable, is the initial set of text references generated by Anystyle and presented by ORB in the text buffer. I sketched the first prototype of what would later become ORB PDF Scrapper in an hour or so. It consisted of several dozens of lines and was actually a fully automated piping through Anystyle text and BibTeX outputs with feeding the latter to a simple Elisp function to produce org references directly. But it became immediately obvious that such thing was unusable because there was no way to correct errors, generate citation keys in the desired format and so on. It's not the Anystyle's fault either. That program is doing an amazing job considering how different the initial input can be - starting from good and bad PDF files, different page layouts, and ending with hundreds of different citation styles. => There is little that can be done on the ORB's side to improve the extraction of text references from a PDF. It's possible to train a custom Finder model ( So, the next thing I did was implementing the current stepwise modal system, where the user (basically me I guess at that time) could edit the intermediate text and BibTeX input to their liking and go back to the previous mode in case some things were missing. It turned out for better, because the BibTeX data are valuable and it's good to be able to have a look at them and possibly store them in the process. Unfortunately, switching between buffers was designed one-way, e.g. you lose the "forward" progress when you go back to the previous mode. I could not clearly see how I could quickly implement a mapping between text and BibTeX, so that if you go back to text mode and edit one reference, you would not need to re-generate all the BibTeX entries, just one entry for that reference. => Such a feature would not necessarily decrease the amount of manual editing, but it would definitely make the overall experience more pleasant, or rather psychologically less boring - oh no, I have to re-generate all the data once again. I'm interested in implementing the The basic idea and my goal at the time I started with the PDF Scrapper was to extract references from a PDF file associated with an => If you haven't use it yet and are having troubles with authour/journal/etc fields not being recognized correctly, I urge you to try training a custom Parser model. I'm also keen in improving the existing rather basic autokey functionality, e.g. #147 Now to your comments.
Great! I'm glad it's working for you and thanks once again for your invaluable input.
I will have to finish the current pull request. Then I'm going to split the README file into a short README and a general manual. Documenting is part of programming, and there have been many changes to ORB over the past months. I'd also like to update the CHANGELOG and prepare a new release. It should not take too long though. Also tell me what feature is more urgent for you, #144 or #147, I'll start working on that one earlier, maybe simultaneously with the manual, but not simultaneously on both features.
Well, as I wrote above the initial idea was simpler, but I don't mind putting it that way :) |
Thank you for the write-up - it is very interesting to see how a project evolves :) And congrats on getting your laptop back!
I think that it is OK to have manual parts. But they should be non-repeating, i.e. when I am done with them, then I never have to do them again. And based on your other paragraphs, I think you agree :)
There are probably thousands of things wrong with my assumption, but why not just treat each line in the "text buffer" as a table row? text | bibtex | citekey If the text cell changes, then update the bibtex and citekey cells.
Smart! And similar to my use-case, except that I go one step further and also try to insert the keys into my annotations, so that only the "usable" references remain. Emphasis on "try", as that is lots of work..
Yay! Another rabbit-hole to burrow down into! :) I believe that #147 is more impactful (and annoying), so I would go with that. Also I believe that #144 kinda requires #147, as there is not much to look up if the keys differ. Good idea with the documentation. I believe that I also have some "organically grown" ORB parameters in my config - I believe that updating the docs gives everyone a "fresh start", i.e. I can clean my config and you can focus on new features :) I got another quick idea: Is it possible to show the "Scrapper modals" side-by-side? (vertical windows) Text-mode: PDF | Text BibTeX-mode: Text | BibTeX Org-modal: BibTeX | Org Basically we see the buffer where we last came from. This allows to look-up and fix mistakes in previous buffers. (BRB requesting) |
Basically yes. I anticipate more problems on the BibTeX end. A BibTeX entry is not a single line, but a multiline entry with several fields. What if the user has edited the entry, added, removed or changed fields. How should such changes be merged with the changes coming from the updated text entry? Simply overwriting the entry should be easy. Also, BibTeX entries can span across varying number of lines, therefore some sort of position tracking must be used. Ok, that must also be easy. There will definitely be other issues such as what to do when a BibTeX was deleted, so some assumptions about the user behavior will have to be made.
Could you please elaborate? What sort of annotations, maybe ORB could assist in automation?
Right, I encountered many times a situation where a new user would use config examples from some half a year-old blog post, where the information is absolutely outdated and the package would therefore only throw errors. Here I blame the lack of examples in the README prompting people to look for them somewhere else. But even users who faithfully followed the official README several months ago can still experience problems because the package has changed since then. A good documentation would also mean for me that ORB is getting ready to advancing to version 1.0. |
Not sure if I understand this correctly, but I would say that the user always has the "last word", i.e. can overwrite everything. And I also would say that there is just one direction. text -> bibtex -> org If a user changes a bibtex cell, then the changes bubble up to the org cell. Text remains untouched.
Its actually pretty simple. Let's say that I annotated a paper and extracted its contents with org-noter. In my notes I now have some extracted annotations.
In this case there are two citations hidden in there (note: usually they follow the same citation style):
Then I use the ORB-Scrapper to extract the references from the paper. This gives me all references from the paper: References
Now I go and insert those references into my extracted annotations:
Those are the citations I actually care about - they get linked directly to their source in my Zettelkasten, very similar to how you are doing it. Finally I delete the references section. If a reference is not cited, then it clearly is not interesting for me. There are a few problems I am encountering:
(I have a half-finished blog-post about my approach, but I didn't get to finish it, as I am supposed to write my thesis... oh well.) I actually just yesterday started learning some elisp to maybe automate this - I already have some macros that kinda work, but they are horrifying to manage, so yeah, here I am, again not working on my thesis... 🥇 The following takes At least writing it as a function lets me fix the problems easily. My macros constantly "ring the bell". 💢 (defun citekey-replace-author-and-year ()
"Get the _author_ and _YEAR_ from a cite:authorYEARtitle key."
(interactive) ; allow this to be user-callable
(let
((regexp (rx ":"
(group (zero-or-more
(any letter "-")))
(group (one-or-more digit))
(one-or-more letter)))
author
year
(citekey (buffer-substring (line-beginning-position) (line-end-position)))
searchstring
(number-matches 0)
) ; end of variable declaration
(message regexp) ; start of body
(message citekey)
(when (string-match regexp citekey)
(setq author (match-string 1 citekey)
year (match-string 2 citekey)))
(message (concat author " " year))
(setq searchstring
(rx (eval author )
(any space punctuation)
(*? not-newline)
(zero-or-one "(")
(eval year )
(zero-or-one ")")
))
(message searchstring)
(save-excursion
(while (re-search-backward searchstring nil t) ; no bounds, don't throw error if not matches
(replace-match citekey)
(incf number-matches)))
(message "Found and replaced %s matches" number-matches)
)) For the "number" style, it is way simpler. Except if it is written as [1-3]. Good luck solving that with macros. Ideally I could detect all references in my annotation body and then cross-reference them with my bibtex. Something like highlighting with This also fixes the problem of references with pre-print release-years. I stumbled across a few references that say "Author 200X", but my key is "Author 200Y". In this case my regex would find nothing. Yeah, I went a bit off-trail, but I hope the base idea is clear. |
Very cool, but indeed it requires a lot of work. After you have finished your thesis (and I mine) we can think how PDF Scrapper could provide a rudimentary support for this, provided you'll still need PDF Scrapper. I envision, PDF Scrapper could become a bridge between PDF-tools and Anystyle and reference management software, a facility to do all references-related stuff. |
Sure thing! PDF Scrapper is definitely positioned that way. And I agree to hold the horses and focus on the matters on hand. |
88bc2a8
to
84fb4bd
Compare
This branch is now in the master! |
remap
save-buffer
to new functionorb-pdf-scrapper-save
remap
write-file
to new functionorb-pdf-scrapper-save-as
These functions handle "fileless" Scrapper buffers correctly
fix a bug where the buffer would get master xml file name after cancelling
switch to Scrapper buffer when pressing 'n' in prevent-concurrent dialog