-
-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use implicit gradient Op
s
#1275
Comments
@rlouf, did you ask about this approach (i.e. using rewrites for gradients) in a Discussion or comment in Gitter? If so, let's link that here so we can highlight more potential connections, use-cases, considerations, etc. |
This Aeppl PR highlighted one of the mentioned disadvantages of eager Similar issue for the gradient of the Softmax: #679 |
I remember it was mentioned during the last Aesara meeting, and if I find other discussions I'll link them here. |
Thanks; these are exactly what we needed to build a larger case for this change. At this point, it seems like the only way to move some shape-dependent/critical logic to compile-time—where The underlying idea is that it should be possible for all The primary thing preventing that from happening is the errant designs employed by the broadcasting-related |
Before I forget, we can also use gradient methods offered by our transpilation target languages in this case. For example, we could convert an un-expanded/inlined gradient |
We can delay the construction of explicit gradient graphs (i.e. use of
Op.grad
and the like) by employing implicit gradientOp
s that are later replaced with explicit sub-graph (e.g. similar to howOpFromGraph
s can be "in-lined").The approach would look as follows:
This also has the effect of enabling rewrites on gradient expressions and of providing more shape information to our gradient implementations.
For instance, this could be used to remove shape inference responsibilities and requirements from some
Op.make_node
andOp.grad
implementations (e.g.Elemwise.L_op
) by allowing access toShapeFeature
s and other compile/rewrite-time only information. Simply put, this is probably the easiest—and even best—way to guarantee that symbolic gradient implementations will always have the mostType
and shape information available, and all without wastefully cloning shape graphs and re-performing rewrites (e.g. like constant folding) on them.This approach was proposed in Theano/Theano#4452 and might also help with #682. As mentioned in the latter, we need to think carefully about when we make implicit gradient
Op
s explicit (i.e. "expand" them). Depending on exactly which rewrites are applied and when, the resulting gradient graphs could be quite different and have distinct and possibly unexpected numerical properties.To keep things simple, we can expand implicit
Op
s right after the first pass of basic canonicalizations so that shape inference/ShapeFeature
is useful and other rewrites (e.g. specializations) won't get in the way. If this approach helps with #682, then great, but, if not, I don't know if we should get into the details of further delayed or staged expansions just yet. Regardless, we'll have the machinery available to do that whenever we want.N.B. Performing expansions in this way still changes our gradient results so that they're dependent on our canonicalizations. In some ways, this relationship sounds good, since it seems to imply that the set of graphs we would be dealing with from then on would be more "regular". Over time, we could converge on a more concentrated and effective set of stabilizing rewrites for the exact kinds of gradients that our implementations and canonicalizations tend to produce, because we would have to deal less with the particulars of "random" user-formulated graphs.
The text was updated successfully, but these errors were encountered: