Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with extracting contact details extraction and if good make a miniactor integration #43

Open
metalwarrior665 opened this issue Jan 8, 2024 · 10 comments
Assignees
Labels
enhancement New feature or improving/enhancing the existing ones for end users

Comments

@metalwarrior665
Copy link
Contributor

Compare if we could improve https://apify.com/vdrmota/contact-info-scraper

Basically, we need to figure out the best prompts and if it will do more than the contact, we can release it as a miniactor

@metalwarrior665 metalwarrior665 added the enhancement New feature or improving/enhancing the existing ones for end users label Jan 8, 2024
@jancurn
Copy link
Contributor

jancurn commented Jan 8, 2024

Just a note, if you guys do a miniactor for these sub-usecases, please let's use metamorph and make it open source, so that we can use it for marketing and show others how easily to do this.

@foxt451
Copy link
Contributor

foxt451 commented Jan 11, 2024

I guess I'll try to look into it a bit

@foxt451 foxt451 self-assigned this Jan 11, 2024
@metalwarrior665
Copy link
Contributor Author

Sounds good, let's not write any code yet, just play with it. We need to collect some pages for testing. The ideal case for GPT would be to match emails, phones, etc. with names. We had a student project that tried to do this based on HTML element proximity but that's super tricky.

@foxt451
Copy link
Contributor

foxt451 commented Jan 11, 2024

🙂 we would probably need some quite clever prompt

зображення

@metalwarrior665
Copy link
Contributor Author

Ouch :D

@jancurn
Copy link
Contributor

jancurn commented Jan 11, 2024

Wow, maybe we'll need to use another model :)

@foxt451
Copy link
Contributor

foxt451 commented Jan 13, 2024

So, I haven't yet started tinkering with the model settings, but I've set up basic boilerplate, which includes the miniactor code with metamorpth and tests. Tests basically list a set of urls with expected contacts to be found on them and check if they get returned when using model settings, schema and prompt exported from @packages/contact-scraper. They also check that there are no hallucinations (and, well, they all fail for now)
To tinker with it at this point, you go to @packages/contact-scraper and change either the exported model name, the model settings, the prompt or the schema and rerun npm test until it succeeds.
The model/prompt settings themselves would be not available on the miniactor. And here is the draft miniactor itself, built from the https://github.com/apify-projects/store-gpt-scraper/tree/feat/contact-details branch

@foxt451 foxt451 removed their assignment Jan 19, 2024
@metalwarrior665
Copy link
Contributor Author

I created a PR here https://github.com/apify-projects/store-gpt-scraper/pull/50/files.

I don't think this implementation offers any advantage over the current Contact Details Scraper since all of these things can be regexed. The next step that we are missing and GPT could manage (although I don't believe it is powerful enough) is to link the contacts together also with names. This is something we tried as student project based on HTML proximity but we didn't have good generic name regex. I would just play with it a bit and we can release it even if it is not great.

@foxt451
Copy link
Contributor

foxt451 commented Jan 26, 2024

I guess this actor/repo is not even needed then (and the issue belongs rather to contact details repo)? All the HTML/contacts is extracted in the contact-details scraper and GPT API requests could be sent directly from that actor. There is no need in this (gpt) actor's crawling capabilities (on the other hand, one could say that contact details is not needed, and we could use crawlee's social module here)

@foxt451
Copy link
Contributor

foxt451 commented Jan 30, 2024

So, I've closed that old PR as it seemed to go in the way wrong direction. And opened another one #51. What do you think @jancurn @metalwarrior665 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improving/enhancing the existing ones for end users
Projects
None yet
Development

No branches or pull requests

3 participants