Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indent before HTML block elements causes indent in Markdown output #98

Closed
chrispy-snps opened this issue Nov 26, 2023 · 2 comments · Fixed by #120
Closed

Indent before HTML block elements causes indent in Markdown output #98

chrispy-snps opened this issue Nov 26, 2023 · 2 comments · Fixed by #120

Comments

@chrispy-snps
Copy link
Collaborator

chrispy-snps commented Nov 26, 2023

In our HTML, block elements are indented:

<html>
  <body>
    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit,
      sed do eiusmod tempor incididunt ut labore et dolore magna
      aliqua. Ut enim ad minim veniam, quis nostrud exercitation
      ullamco laboris nisi ut aliquip ex ea commodo consequat.
    </p>
  </body>
</html>

When HTML with indented block elements is converted, the indent causes incorrect formatting in the output.

Converting this indented <p> element:

from markdownify import markdownify as md

print(repr(md("""\
  <p>This is
     some text.</p>
""")))

produces this:

' This is\n some text.\n\n\n'
 ^       ^^^

It happens for non-<p> elements too. Converting these indented <h1> elements with the UNDERLINED and ATX heading formats:

print(repr(md("""\
    <h1>Title</h1>
""")))

print(repr(md("""\
    <h1>Title</h1>
""", heading_style="ATX")))

produces this:

' Title\n=====\n\n\n'
 ^

' # Title\n\n\n'
 ^

As a workaround, we iterate through all text object descendants in all text-containing block elements (<p>, <entry>, <li>, etc.) and convert newlines to spaces, but this is expensive on large document sets.

Possibly related to #31.

@chrispy-snps chrispy-snps changed the title Indent in <p> causes indent in Markdown output Indent before HTML block elements causes indent in Markdown output Nov 26, 2023
@chrispy-snps
Copy link
Collaborator Author

This seems to be a duplicate of issue #96.

@mirabilos
Copy link

or rather #88 perhaps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants