You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After failed scrape, subsequent requests should work normally
Or crawler should automatically reset browser context when contamination detected
Description
After a failed scrape attempt, the crawler consistently returns "no results" pages for subsequent requests, even for previously successful queries. This appears to be a browser context contamination issue similar to FireCrawl #884.
Code Example
Current Behavior
Current Behavior
Initial scrapes work correctly
After a failure, all subsequent requests return "no results" pages
Browser context cleanup and delays don't resolve the issue
Success rate drops to ~0% after initial failure
the code loops through batches of urls
when there is a failure in a batch for a url, the subsequent batch will return (check your search spellings and will return no results) for search scrape
I tried even reinitializing the crawler between batches and tried everything Claude suggested from timeouts to all kinds of settings.
Its like I have to rerun the project again to make it work.
Is this reproducible?
Yes
Inputs Causing the Bug
browser_conf = BrowserConfig(
browser_type="chromium",
headless=True,
proxy_config={
"server": f"http://{PROXYHOST}:{PROXYPORT}",
"username": PROXYUSERNAME,
"password": PROXYPASSWORD
},
viewport_width=1920,
viewport_height=1080,
verbose=True, # Set this to True to help debug
user_agent="random",
text_mode=True,
light_mode=True
)
# Create run configuration
run_config = CrawlerRunConfig(
word_count_threshold=10,
exclude_external_links=True,
remove_overlay_elements=True,
excluded_tags=['header', 'footer', 'nav'],
process_iframes=True,
cache_mode="ENABLED",
page_timeout=8000
)
Steps to Reproduce
**Steps to Reproduce**
1. Initialize crawler with proxy configuration
2. Perform successful search query
3. Encounter a failed request
4. Subsequent requests return"no results" pages regardless of query
Code snippets
asyncdefscrape_urls(self, urls: Union[str, List[str]], search_type: str="website") ->List[Dict[str, Any]]:
""" Scrape URLs one at a time """ifisinstance(urls, str):
urls= [urls]
# Return early if no URLs providedifnoturls:
logger.warning("No URLs provided to scrape")
return []
# Format URLs based on search typeformatted_urls= []
forurlinurls:
ifsearch_type=="google":
formatted_urls.append(f'https://www.google.com/search?q={url.replace(" ", "+")}')
elifsearch_type=="duckduckgoimages":
formatted_urls.append(f'https://duckduckgo.com/?q={url.replace(" ", "+")}&iax=images&ia=images')
elifsearch_type=="duckduckgosearch":
formatted_urls.append(f'https://duckduckgo.com/?q={url.replace(" ", "+")}&ia=web')
elifsearch_type=="bing_videos":
formatted_urls.append(f'https://www.bing.com/videos/search?q={url.replace(" ", "+")}')
elifsearch_type=="bing_search":
formatted_urls.append(f'https://www.bing.com/search?q={url.replace(" ", "+")}')
else: # websiteformatted_urls.append(url)
crawler=Nonetry:
# Initialize crawlerself.browser_config.ignore_https_errors=Truecrawler=AsyncWebCrawler(config=self.browser_config)
asyncwithcrawler:
results= []
forformatted_urlinformatted_urls:
try:
# Scrape single URLresult=awaitcrawler.arun(url=formatted_url, config=self.run_config)
ifnotresult.success:
logger.error(f"Failed to scrape {formatted_url}: {result.error_message}")
results.append(None)
continue# Process the resultsoup=BeautifulSoup(result.html, 'lxml')
ifsearch_type=="bing_search":
processed=self._process_bing_search_results(soup, formatted_url)
results.extend(processedifprocessedelse [])
else: # websiteprocessed=self._process_website_content(soup, formatted_url)
results.append(processedifprocessedelseNone)
# Small delay between requestsawaitasyncio.sleep(0.5)
exceptExceptionase:
logger.error(f"Error processing {formatted_url}: {str(e)}")
results.append(None)
# Filter out None values only for search typesifsearch_typein ["google", "duckduckgosearch", "bing_search", "bing_videos", "duckduckgoimages"]:
results= [rforrinresultsifrisnotNone]
returnresultsexceptExceptionase:
logger.error(f"Error during scraping: {str(e)}")
return []
finally:
ifcrawler:
awaitcrawler.close()
awaitasyncio.sleep(1) # Wait for cleanup
crawl4ai version
latest
Expected Behavior
Expected Behavior
Description
After a failed scrape attempt, the crawler consistently returns "no results" pages for subsequent requests, even for previously successful queries. This appears to be a browser context contamination issue similar to FireCrawl #884.
Code Example
Current Behavior
Current Behavior
the code loops through batches of urls
when there is a failure in a batch for a url, the subsequent batch will return (check your search spellings and will return no results) for search scrape
I tried even reinitializing the crawler between batches and tried everything Claude suggested from timeouts to all kinds of settings.
Its like I have to rerun the project again to make it work.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
linux
Python version
3.11
Browser
chromium
Browser version
latest
Error logs & Screenshots (if applicable)
[CrawlResult(url='httpswww.bing.com.txt
The text was updated successfully, but these errors were encountered: