Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Title: Browser context becomes contaminated after failed scrapes, returning "no results" pages #501

Open
saipavankumar-muppalaneni opened this issue Jan 20, 2025 · 0 comments
Labels
🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers

Comments

@saipavankumar-muppalaneni

crawl4ai version

latest

Expected Behavior

Expected Behavior

  • After failed scrape, subsequent requests should work normally
  • Or crawler should automatically reset browser context when contamination detected

Description
After a failed scrape attempt, the crawler consistently returns "no results" pages for subsequent requests, even for previously successful queries. This appears to be a browser context contamination issue similar to FireCrawl #884.

Code Example

Current Behavior

Current Behavior

  • Initial scrapes work correctly
  • After a failure, all subsequent requests return "no results" pages
  • Browser context cleanup and delays don't resolve the issue
  • Success rate drops to ~0% after initial failure

the code loops through batches of urls
when there is a failure in a batch for a url, the subsequent batch will return (check your search spellings and will return no results) for search scrape
I tried even reinitializing the crawler between batches and tried everything Claude suggested from timeouts to all kinds of settings.
Its like I have to rerun the project again to make it work.

Is this reproducible?

Yes

Inputs Causing the Bug

browser_conf = BrowserConfig(
        browser_type="chromium",
        headless=True,
        proxy_config={
            "server": f"http://{PROXYHOST}:{PROXYPORT}",
            "username": PROXYUSERNAME,
            "password": PROXYPASSWORD
        },
        viewport_width=1920,
        viewport_height=1080,
        verbose=True,  # Set this to True to help debug
        user_agent="random",
        text_mode=True,
        light_mode=True
    )

    # Create run configuration
    run_config = CrawlerRunConfig(
        word_count_threshold=10,
        exclude_external_links=True,
        remove_overlay_elements=True,
        excluded_tags=['header', 'footer', 'nav'],
        process_iframes=True,
        cache_mode="ENABLED",
        page_timeout=8000
    )

Steps to Reproduce

**Steps to Reproduce**
1. Initialize crawler with proxy configuration
2. Perform successful search query
3. Encounter a failed request
4. Subsequent requests return "no results" pages regardless of query

Code snippets

async def scrape_urls(self, urls: Union[str, List[str]], search_type: str = "website") -> List[Dict[str, Any]]:
        """
        Scrape URLs one at a time
        """
        if isinstance(urls, str):
            urls = [urls]

        # Return early if no URLs provided
        if not urls:
            logger.warning("No URLs provided to scrape")
            return []

        # Format URLs based on search type
        formatted_urls = []
        for url in urls:
            if search_type == "google":
                formatted_urls.append(f'https://www.google.com/search?q={url.replace(" ", "+")}')
            elif search_type == "duckduckgoimages":
                formatted_urls.append(f'https://duckduckgo.com/?q={url.replace(" ", "+")}&iax=images&ia=images')
            elif search_type == "duckduckgosearch":
                formatted_urls.append(f'https://duckduckgo.com/?q={url.replace(" ", "+")}&ia=web')
            elif search_type == "bing_videos":
                formatted_urls.append(f'https://www.bing.com/videos/search?q={url.replace(" ", "+")}')
            elif search_type == "bing_search":
                formatted_urls.append(f'https://www.bing.com/search?q={url.replace(" ", "+")}')
            else:  # website
                formatted_urls.append(url)

        crawler = None
        try:
            # Initialize crawler
            self.browser_config.ignore_https_errors = True
            crawler = AsyncWebCrawler(config=self.browser_config)
            
            async with crawler:
                results = []
                for formatted_url in formatted_urls:
                    try:
                        # Scrape single URL
                        result = await crawler.arun(url=formatted_url, config=self.run_config)
                        
                        if not result.success:
                            logger.error(f"Failed to scrape {formatted_url}: {result.error_message}")
                            results.append(None)
                            continue

                        # Process the result
                        soup = BeautifulSoup(result.html, 'lxml')
                        if search_type == "bing_search":
                            processed = self._process_bing_search_results(soup, formatted_url)
                            results.extend(processed if processed else [])
                        else:  # website
                            processed = self._process_website_content(soup, formatted_url)
                            results.append(processed if processed else None)

                        # Small delay between requests
                        await asyncio.sleep(0.5)

                    except Exception as e:
                        logger.error(f"Error processing {formatted_url}: {str(e)}")
                        results.append(None)

                # Filter out None values only for search types
                if search_type in ["google", "duckduckgosearch", "bing_search", "bing_videos", "duckduckgoimages"]:
                    results = [r for r in results if r is not None]

                return results

        except Exception as e:
            logger.error(f"Error during scraping: {str(e)}")
            return []

        finally:
            if crawler:
                await crawler.close()
                await asyncio.sleep(1)  # Wait for cleanup

OS

linux

Python version

3.11

Browser

chromium

Browser version

latest

Error logs & Screenshots (if applicable)

[CrawlResult(url='httpswww.bing.com.txt

@saipavankumar-muppalaneni saipavankumar-muppalaneni added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers
Projects
None yet
Development

No branches or pull requests

1 participant