Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scoring pairs is much slower after training then after loading settings file. #977

Open
fgregg opened this issue Mar 1, 2022 · 4 comments

Comments

@fgregg
Copy link
Contributor

fgregg commented Mar 1, 2022

this is going to be a pain to debug, i think.


To reproduce.

Get code for this linking project: https://github.com/labordata/fmcs-f7/tree/37e6e805ceb6ec8dee7844fbe7f45b71609066ad

make update_raw
rm link.csv
make link.csv

this will train dedupe and then do scoring and clustering. the scoring and clustering will be very slow

rm link.csv
make link.csv

this will use the settings file created in previous run and scoring and clustering will be much faster

@caligoig
Copy link

caligoig commented Mar 3, 2022

I can confirm same is happening for my custom data loaded. Once settings file is created in previous run, then loaded, scoring and clustering is way much faster.

@adamzev
Copy link

adamzev commented Jun 2, 2022

I played with this a bit. It seems the difference in runtime starts in the fillQueue function however as far as I could tell the inputs to that function were the same both times. Much more memory was in use when doing the training and onward (according to psutil.Process(os.getpid()).memory_info().rss) so that could have to do with the performance difference.

Changing the chunk_size parameter of fillQueue from 20,000 to 1000 seemed to greatly improve the performance when having done the training and slightly improve the performance running it the other way.

In order to get this to run on my computer I reduced the data size by adding in:

    data_d = readData(input_file)
    data_d = {k: data_d[k] for k in list(data_d)[:3000]}

@fgregg
Copy link
Contributor Author

fgregg commented Jun 2, 2022

thanks for this!

@fgregg
Copy link
Contributor Author

fgregg commented Jun 2, 2022

this makes me think that the data model is not getting cleaned up (related to the #956). I would have thought the fixes to that would have address this too, but maybe not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants