Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate Product Format String #61

Open
ian-pvd opened this issue Dec 31, 2021 · 4 comments
Open

Duplicate Product Format String #61

ian-pvd opened this issue Dec 31, 2021 · 4 comments

Comments

@ian-pvd
Copy link

ian-pvd commented Dec 31, 2021

When using getAlbumProducts, some URLs return duplicated strings for the format prop.

For example:

bandcamp.getAlbumProducts('https://bandcamp.prspct.nl/album/the-hardcore-party-ep', function (error, albumProducts) {
    console.log(albumProducts);
});

This consistently returns "Digital AlbumDigital Album" as the format. I'm not sure how this is happening, since the . buyItemPackageTitle element only contains this text once.

This seems to happen to certain URLs consistently, ex:

I'm using a random URL out of a set of 1000 for debugging in my app, and I'm seeing this ~5% of the time.

It also seems to happen to the name prop for some URLs, and I'm also seeing the string "Full Digital Discography" doubled.

@ian-pvd
Copy link
Author

ian-pvd commented Dec 31, 2021

Depending on your needs, you could just pull it from the JSON part of the page, example:

.albumRelease[0].musicReleaseFormat

I'm seeing this in the application/ld+json tag in the page markup, but where do I find it in the scraper results? I'm not seeing it in the AlbumInfo response. If there's a way to avoid making multiple scraper requests for the Album Info and then also the digital product price, that'd be really helpful.

Plus, it's trivial to use startsWith to still get a positive match on "Digital AlbumDigital Album" instead of strictly equal to, but I figured this response from the scraper deserved a bug report at least.

@ian-pvd
Copy link
Author

ian-pvd commented Dec 31, 2021

Further debugging seems to show that the releases where this is occurring actually do have two .buyItemPackageTitle spans inside the release list item.

Markup for a result without the issue:

<li class="buyItem digital">
    <h3 class="hd">    
        <button class='download-link buy-link' type="button">
              <span class="buyItemPackageTitle primaryText">Digital Album</span>
        </button>
        <div class="digitaldescription secondaryText">  Streaming + Download </div>
    </h3>
    ...
</li>

Markup returned for a result with the duplicate text issue:

<li class="buyItem digital">
    <h3 class="hd">
        <button class='download-link buy-link' type="button">
            <span class="buyItemPackageTitle primaryText">Digital Album</span>
        </button>
        <span class="buyItemPackageTitle primaryText you-own-this">Digital Album</span>
        <div class="digitaldescription secondaryText">  Streaming + Download </div>
    </h3>
    ...
</li>

This is from a dump of the html variable returned by the get function and passed into the parser function here: https://github.com/masterT/bandcamp-scraper/blob/master/lib/index.js#L58

First, I don't own this. Second, how would the scraper know that if the request is being made from node? Seems like a weird edge case, but I am seeing this behavior consistently on specific URLs.

Either way, I assume this is the cause of the duplicated text. I'm going to try to debug this further but I just wanted to post this as an update to my initial report that there wasn't duplicate text.

Also, I'm not sure what's happening with this line const $ = cheerio.load(html), but by the time I dump the data variable defined here, the duplicate text is present:

{
  products: [
    {
      imageUrls: [],
      name: 'Digital AlbumDigital Album',
      nameFallback: '',
      format: 'Digital AlbumDigital Album',
      formatFallback: '',
      priceInCents: 350,
      currency: 'EUR',
      offerMore: true,
      soldOut: false,
      nameYourPrice: false,
      description: 'Includes unlimited streaming via the free Bandcamp app, plus high-quality download in MP3, FLAC and more.'
    }
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@ian-pvd and others