Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to support images? #51

Open
dmitrym0 opened this issue Jul 5, 2022 · 5 comments
Open

Is it possible to support images? #51

dmitrym0 opened this issue Jul 5, 2022 · 5 comments
Milestone

Comments

@dmitrym0
Copy link

dmitrym0 commented Jul 5, 2022

Howdy @alphapapa, thanks for another amazing package!

I would love to download images as well. For instance this article works just fine with eww-readable, and a couple of images are critical to understanding the context.

Looking at the org-web-tools code, it appears that images are not fetched at all and therefore cannot be displayed. Pandoc support may be the other potential pitfall.

Am I on the right track, or are there other issues for supporting images that I'm not seeing?

@alphapapa
Copy link
Owner

Hi,

Thanks for the kind words. I'm glad it's useful to you.

Which command are you using? If you use org-web-tools-archive, you can have a gzip archive of a page, and wget can download the images if you choose.

The archive.is support doesn't seem to work anymore, and it was never officially supported, so I don't know if it would be possible to fix that. But for many Web pages (i.e. ones that don't require JavaScript to render), wget does a fine job of archiving the page and its content.

@dmitrym0
Copy link
Author

dmitrym0 commented Jul 6, 2022

I'm using org-web-tools-insert-web-page-as-entry.

You've given me an idea though. If I can use wget to download a reasonable facsimile of the web page, I can then convert it to an epub and use nov.el to read it.

It's hard to beat the convenience of org-web-tools though for inserting the content of an article at point!

@alphapapa
Copy link
Owner

I see. Yes, theoretically images could be downloaded to a directory, and they could be inserted into the Org content. Maybe a better way to handle that would be to have Wget download them and make the archive, then extract the archive and use Pandoc to convert it to Org content.

@dmitrym0
Copy link
Author

dmitrym0 commented Jul 7, 2022

I may have discovered an interesting alternative to eww-readable, readability.js, Mozilla's readable mode implementation. It understands inline images as well.

There's a web instance, readability-bot that accepts a URL and renders the contents with Readability.js.

For a test workflow:

  1. Invoke readability bot with the URL you're interested in (in a temp directory):
wget -nd -H -erobots=off --convert-links -E  -k -p \ 
'https://readability-bot.vercel.app/api/readability?url=https%3A%2F%2Fwww.nytimes.com%2Fwirecutter%2Freviews%2Fbest-robot-vacuum'

This generates a nice, flat, mirror of the page in readable mode. Note, that I'm using index.html below to refer to the root HTML file I'm interested in, but I haven't found how to convince wget to output to it yet.

  1. Invoke pandoc to generate an org file:
pandoc -f html -o output.org index.html

The org file now has embedded images as well. Though it's not quite clear how to manage all the assets just yet.

Alternatively, invoke pandoc to generate an epub:

pandoc -f html -t epub3 -o output.epub index.html

This has the benefit of being a full self contained document that can be read with nov.el and marked up with org-noter.

I had no idea pandoc supports epub generation. What an amazing piece of software.

Anyway, this is getting too long. Any thoughts on the best way to manage images when converting to org?

@johnhamelink
Copy link

johnhamelink commented Mar 9, 2023

Hi folks,

I wanted to build an archive of a blog I often refer to (gnuplotting.org) in an org-mode format.

I ended up writing this bash script:

#!/usr/bin/env bash

# This script retrieves the sitemap.xml file from the gnuplotting.org website,
# extracts a list of post names from it, and downloads the necessary data to display
# the site locally. It then converts the downloaded HTML content to org-mode format,
# allowing offline reading of the gnuplotting.org blog.

fqdn="www.gnuplotting.org"
url="http://${fqdn}"
sitemap="${url}/wp-sitemap-posts-post-1.xml"
post_slug_regex="<loc>http\:\/\/www\.gnuplotting\.org\/\K.+?(?=\/</loc>)"
content_selector="div#main"

useragent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36."

# Retrieve the sitemap.xml file, extract the post names from it into
# an array.
mapfile -t posts <<< "$(wget -qO- ${sitemap} | grep -Po "${post_slug_regex}")"

# For each post, crawl all the content we need to display it locally.
for post in "${posts[@]}"; do
    echo "Downloading ${post}"
    wget --quiet --convert-links --page-requisites --continue \
         --domains ${fqdn} --no-parent --level 5 \
         --user-agent="${useragent}" \
         -e robots=off --random-wait --restrict-file-names=unix \
         --max-redirect=0 --trust-server-names \
         "${url}/${post}/"
done

cd "${fdqn}"

echo "Building org-mode posts..."
for post in "${posts[@]}"; do
    #  Extract the main content from each post with htmlq, then
    #  convert to org with pandoc.
    cat "${post}/index.html" | htmlq -B "${content_selector}" | pandoc -f html -t org -o "${post}/${post}.org"
done

In my case, the website I was scraping was nice and simple, for more complex pages rdrview might be useful to further cut down on cruft.

Once the files are in org-mode, I merged them all together using an index.org file, using #+INCLUDE: and org-org-export-to-org.

Lastly, I used macros to clean up the output, such as unfilling all paragraphs of text, positioning captions correctly below figures, and renaming the source language from "prettyprint" to "gnuplot".

@alphapapa alphapapa added this to the Future milestone Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants