scoring pairs is much slower after training then after loading settings file. #977

fgregg · 2022-03-01T05:54:27Z

this is going to be a pain to debug, i think.

To reproduce.

Get code for this linking project: https://github.com/labordata/fmcs-f7/tree/37e6e805ceb6ec8dee7844fbe7f45b71609066ad

make update_raw
rm link.csv
make link.csv

this will train dedupe and then do scoring and clustering. the scoring and clustering will be very slow

rm link.csv
make link.csv

this will use the settings file created in previous run and scoring and clustering will be much faster

The text was updated successfully, but these errors were encountered:

caligoig · 2022-03-03T22:33:01Z

I can confirm same is happening for my custom data loaded. Once settings file is created in previous run, then loaded, scoring and clustering is way much faster.

adamzev · 2022-06-02T18:10:53Z

I played with this a bit. It seems the difference in runtime starts in the fillQueue function however as far as I could tell the inputs to that function were the same both times. Much more memory was in use when doing the training and onward (according to psutil.Process(os.getpid()).memory_info().rss) so that could have to do with the performance difference.

Changing the chunk_size parameter of fillQueue from 20,000 to 1000 seemed to greatly improve the performance when having done the training and slightly improve the performance running it the other way.

In order to get this to run on my computer I reduced the data size by adding in:

    data_d = readData(input_file)
    data_d = {k: data_d[k] for k in list(data_d)[:3000]}

fgregg · 2022-06-02T18:51:24Z

thanks for this!

fgregg · 2022-06-02T19:05:03Z

this makes me think that the data model is not getting cleaned up (related to the #956). I would have thought the fixes to that would have address this too, but maybe not.

fgregg mentioned this issue Jun 11, 2022

Consider removing distinction between Static and non-Static APIs #1049

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scoring pairs is much slower after training then after loading settings file. #977

scoring pairs is much slower after training then after loading settings file. #977

fgregg commented Mar 1, 2022 •

edited

Loading

caligoig commented Mar 3, 2022

adamzev commented Jun 2, 2022

fgregg commented Jun 2, 2022

fgregg commented Jun 2, 2022

scoring pairs is much slower after training then after loading settings file. #977

scoring pairs is much slower after training then after loading settings file. #977

Comments

fgregg commented Mar 1, 2022 • edited Loading

caligoig commented Mar 3, 2022

adamzev commented Jun 2, 2022

fgregg commented Jun 2, 2022

fgregg commented Jun 2, 2022

fgregg commented Mar 1, 2022 •

edited

Loading