Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): use try/catch in .evaluate() to avoid errors #52

Merged
merged 5 commits into from
Feb 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion packages/gpt-scraper-core/src/crawler.ts
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
addFormats(validator);
validator.compile(schema);
return schema;
} catch (e: any) {

Check warning on line 36 in packages/gpt-scraper-core/src/crawler.ts

View workflow job for this annotation

GitHub Actions / lint

Unexpected any. Specify a different type
log.error(`Schema is not valid: ${e.message}`, { error: e });
await Actor.fail('Schema is not valid. Go to Actor run log, '
+ 'where you can find error details or disable "Use JSON schema to format answer" option.');
Expand Down Expand Up @@ -92,7 +92,16 @@
}
},
],

postNavigationHooks: [
async ({ page }) => {
// see https://github.com/apify/crawlee/issues/2314
// will solve client-side redirects through meta tags
await page.waitForSelector('body', {
state: 'attached',
timeout: 60_000,
});
},
],
async requestHandler({ request, page, enqueueLinks, closeCookieModals }) {
const { depth = 0 } = request.userData;
const state = await crawler.useState<State>(DEFAULT_STATE);
Expand Down Expand Up @@ -217,7 +226,7 @@
answer = answerResult.answer;
jsonAnswer = answerResult.jsonAnswer;
model.updateApiCallUsage(answerResult.usage);
} catch (error: any) {

Check warning on line 229 in packages/gpt-scraper-core/src/crawler.ts

View workflow job for this annotation

GitHub Actions / lint

Unexpected any. Specify a different type
if (error instanceof OpenaiAPIErrorToExitActor) {
throw await Actor.fail(error.message);
}
Expand Down
3 changes: 1 addition & 2 deletions packages/gpt-scraper-core/src/models/openai.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@
import { ChatOpenAI } from 'langchain/chat_models/openai';
import { OpenAI } from 'langchain/llms/openai';
import { LLMResult } from 'langchain/schema';
import { REPETITIVE_PROMPT_ERROR_MESSAGE } from '../errors.js';
import { NonRetryableOpenaiAPIError, OpenaiAPIError, OpenaiAPIErrorToExitActor, RateLimitedError } from '../errors.js';
import { NonRetryableOpenaiAPIError, OpenaiAPIError, OpenaiAPIErrorToExitActor, RateLimitedError, REPETITIVE_PROMPT_ERROR_MESSAGE } from '../errors.js';
import { tryToParseJsonFromString } from '../processors.js';
import { ProcessInstructionsOptions } from '../types/model.js';
import { OpenAIModelSettings } from '../types/models.js';
Expand All @@ -19,7 +18,7 @@
*
* see OpenAI errors documentation: https://platform.openai.com/docs/guides/error-codes/api-errors
*/
const wrapInOpenaiError = (error: any): OpenaiAPIError => {

Check warning on line 21 in packages/gpt-scraper-core/src/models/openai.ts

View workflow job for this annotation

GitHub Actions / lint

Unexpected any. Specify a different type
const errorMessage = error.error?.message || error.code || error.message;

// The error structure is completely different for insufficient quota errors. We need to handle it separately.
Expand Down Expand Up @@ -79,7 +78,7 @@
for (let retry = 1; retry < MAX_GPT_RETRIES + 1; retry++) {
try {
return await this.processInstructions(options);
} catch (error: any) {

Check warning on line 81 in packages/gpt-scraper-core/src/models/openai.ts

View workflow job for this annotation

GitHub Actions / lint

Unexpected any. Specify a different type
const wrappedError = wrapInOpenaiError(error);

if (wrappedError instanceof NonRetryableOpenaiAPIError) throw wrappedError;
Expand Down Expand Up @@ -121,7 +120,7 @@
* Parses the function arguments from the OpenAI LLM's output.
*/
private parseFunctionArguments = (functionOutput: LLMResult): string | null => {
const generations = functionOutput.generations as any;

Check warning on line 123 in packages/gpt-scraper-core/src/models/openai.ts

View workflow job for this annotation

GitHub Actions / lint

Unexpected any. Specify a different type
const firstFunction = generations?.[0]?.[0]?.message;
const functionArguments = firstFunction?.lc_kwargs?.additional_kwargs?.function_call?.arguments;

Expand Down
10 changes: 8 additions & 2 deletions packages/gpt-scraper-core/src/processors.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,13 @@ export const shrinkHtml = async (html: string, page: Page, removeElementsCssSele
if (removeSelector) {
const elements = doc.querySelectorAll(removeSelector);
for (const element of elements) {
element.remove();
// there have been some cases when the page's own scripts cause errors and running this line
// causes them to reemerge, so what in try/cartch
try {
element.remove();
} catch (err) {
/* ignore */
}
}
}
return doc.documentElement.outerHTML;
Expand All @@ -34,7 +40,7 @@ export const htmlToMarkdown = (html: string) => {
return htmlToMarkdownProcessor.turndown(html);
};

const chunkText = (text:string, maxLength: number) => {
const chunkText = (text: string, maxLength: number) => {
const numChunks = Math.ceil(text.length / maxLength);
const chunks = new Array(numChunks);

Expand Down
6 changes: 5 additions & 1 deletion shared/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
This changelog tracks updates to both GTP Scraper and Extended GPT Scraper actors.

# 2023-01-31
*Fixes*
- Eliminated the bug, when on some sites that contain erronous javascript the scraper would fail

### 2023-01-26
*Fixes*
- Fixed "max pages per run" not working correctly on specific websites.
Expand Down Expand Up @@ -31,4 +35,4 @@ This changelog tracks updates to both GTP Scraper and Extended GPT Scraper actor

*Changes*
- Use LangChain to connect to GPT models. This means some error messages are different.
- The default model `temperature` is now set to `0` instead of `1`. This should improve the reliability of scraping. While this is technically a breaking change, it should mostly behave as an improvement so we don't consider need to release a separate version.
- The default model `temperature` is now set to `0` instead of `1`. This should improve the reliability of scraping. While this is technically a breaking change, it should mostly behave as an improvement so we don't consider need to release a separate version.
Loading