-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extend edges method #2651
extend edges method #2651
Conversation
c00d375
to
66ca353
Compare
@hfr1tz3 I think you need to add
should do it. |
Nice fix! That looks like an annoying one to find. I have a suggestion; see my comment. |
TODO:
|
Reminder: once c code is edited, to test changes locally you've got to run |
811453d
to
0301bee
Compare
OK - I think this is ready to go. FYI, we loop over 'edges in' and 'edges out', and it seems like perhaps we should be building the trees there (and thus using the new C tree iteration stuff), but I think not, actually, here's why: the way the algorithm works is it looks for single edges Hm, so this does mean that this is O(k^3) where |
I've had a good think about how to use the trees to make the algorithm not O(k^3), and it's possible but not obvious (to me yet). Unless it becomes obvious I'd like to wait to see if it's an problem in practice. And, I don't know what's up with codecov? The actions say they've uploaded the results, e.g. to https://app.codecov.io/github/tskit-dev/tskit/commit/8973c7b86ada7a19decd9f88031ea045dfd95e6a but at that link it says that
|
I'm not seeing the codecov report for some reason, but here it is for this branch: TODO:
|
Note: to run the code on this branch locally (in, say, a different project):
This should replace tskit with this one; to make sure you're using the right version you can first uninstall tskit, so if it didn't work then it'll be obvious. |
TODO:
|
I am trying to run new empirical tests with our edge extend, but I am getting an error for some reason. |
Can you post the code that produces that error, and/or the information above the error that says at whichi point the error occurs? |
In last_truth on our num_edges repository cell 3, I installed our extend edges method and tried to run the code. However, I think I may be just calling the function wrong. I will look into it more tomorrow |
I have - after quite a bit of staring at things - modified the algorithm to use the new "tree position" method of iterating over trees. (The staring did result in better code and more robust tests, however.) However, I need some input here - probably from @jeromekelleher? So, right now (a) the python implementation in test_topology.py uses "tree position", but (b) the c version in The |
I see where you're coming from @petrelharp, but I don't see the point of making this an in-place TableCollection method. Why not a TreeSequence method that returns a new TreeSequence? You need everything you need for the tables to be a valid TreeSequence anyway. This is never going to be as performance-critical as (say) simplify, so the extra copy incurred isn't going to amount to anything. I think that would simplify things a bit? The implementation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't got my head around the algorithm fully, but it's much easier to understand now!
I vote for ditching the tsk_table_collection
version - surely this is a premature optimisation?
c/tskit/trees.c
Outdated
remove_unextended(&edges_in_head, &edges_in_tail); | ||
remove_unextended(&edges_out_head, &edges_out_tail); | ||
|
||
for (tj = tree_pos.out.start; tj != tree_pos.out.stop; tj += sign_int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use
for (tj = tree_pos.out.start; tj != tree_pos.out.stop; tj += tree_pos.direction) {
}
which is partly the intention of having direction
done like it is.
I totally agree that the performance here isn't critical. But I do think that there's a conceptual hole here: how do we write a method that modifies tables in place and uses the trees to do so? (I mean, there's no technical barrier to that, but that's not how the API is set up.) A good example of something where that's exactly what we need to do is |
It seems like the way to do this (eg in |
I think the number of things we need to do with the tables in place that require iterating over the trees will be pretty small, going forward. So, I'd vote we just leave stuff like We don't have to depend on the tree sequence for the tree_position thing --- we're really only using the breakpoints array, which we only really need for long-range jumping. But, I think the abstraction is getting very leaky when we do things on both the tree sequence and table collection, just for the sake of avoiding a copy, or creating a |
Seems like then if we were to write something like |
Never mind, since we wouldn't be able to do the no-copy thing in python really anyhow. |
Okay - this is ready for review, @jeromekelleher. valgrind reports a leak in the C tests related to the |
I've gone over this and refactored things a bit. The memory management at the C level was really quite tricky, but hopefully it's settled now. Some high-level stuff:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments. I'm not sure what the right pattern will be in order to update the mutation table too, but I guess that'll take some figuring out first.
tsk_id_t e, e1, e2, e_in; | ||
tsk_blkalloc_t edge_list_heap; | ||
double *near_side, *far_side; | ||
edge_list_t *edges_in_head, *edges_in_tail; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I was starting this from scratch, I would't bother with linked lists and do something like this instead:
tsk_bool_t extended = calloc(num_edges, sizeof(*extended));
tsk_id_t *edges_in = mallco(num_edges, sizeof(*edges_in));
tsk_id_t *edges_out = mallco(num_edges, sizeof(*edges_out));
tsk_size_t num_edge_in, num_edges_out;
and keep track of the lists like that. Repacking to get rid of unextended edges will be slightly trickier, but there's less faddling around with linked lists, memory allocation and so on, and it would almost certainly be more efficient.
But like I say - you might not want to get into this, since it's working.
One thing that might not be immediately obvious - I made a new file |
Oh riiiight, MUTATIONS. Thanks very much; I'll have a go at this. |
Ah, I see about the memory management. Makes sense. |
b3de586
to
ebb79ac
Compare
TODO: put mutations on the edges; proposed method:
|
90139ee
to
cbf2dda
Compare
Okay! Assuming I can get the CI errors to go away, I think this is ready to go:
Edit: anyhow know what's going on with the doc build? |
40419ff
to
75f4b10
Compare
This is ready for review: @jeromekelleher? @benjeffery? |
Codecov Report
@@ Coverage Diff @@
## main #2651 +/- ##
==========================================
- Coverage 89.75% 89.69% -0.07%
==========================================
Files 30 30
Lines 29902 30160 +258
Branches 5803 5860 +57
==========================================
+ Hits 26840 27053 +213
- Misses 1755 1778 +23
- Partials 1307 1329 +22
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
|
Darn it, looks like we need more test coverage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm not seeing much you can do to update coverage.
Two minor comments:
- We should probably do some checks on max_iter in the C code. This will actually make it easier to bump up test coverage in, e.g., the Python C API
- Do we really want to error on unspecified mutation times? It would probably be simpler to "min" impute if unknown, and that's what people will end up doing most of the time anyway?
Edit - whoops, just saw your point about impute mutation times above. Ignore comment 2 then!
# their time; requires mutation times be nonmissing and the mutation times | ||
# be >= their nodes' times. | ||
|
||
assert np.all(~tskit.is_unknown_time(mutations.time)), "times must be known" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use ts.impute_mutation_times
on the input ts instead of the hard requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could, but as noted above let's leave this to the end user (we have a note about this in the docstring)
Note: tskit-dev/tsdate#317 depends on this. |
I think this is ready to go, except for the doc build failure. Codecov says that I'm not testing the
line, but I totally am - not sure what's up with that? Edit: Well, I added a C test for this; should be okay now. |
8fc98b1
to
b9b0f16
Compare
Note: I've also done like tskit-dev/msprime#1395 in this PR to get it to pass the docs build. |
Okay! I think this is ready to merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just spotted on mistake.
I think we can squash down to one commit (or 3 max, if you really want to keep my changes?)
37f5387
to
ae44e89
Compare
Okay! I fixed that problem, added another C test that codecov pointed out, rebased & squashed; I think we're ready to go! |
Here we (me and @hfr1tz3 and @avabamf) have implemented the "extend_edges" and related things in tskit.
Here's the description of what this does (copied from the draft docstring):