Possible incorrect definition of RDF? #1243

mphoward · 2024-04-26T01:47:46Z

mphoward
Apr 26, 2024
Collaborator

Hi All,

I am opening this as a discussion rather than an issue because I'm not sure how to create a reproducer for this, but I was reading the RDF code to explain something to someone, and I noticed that the query points and points might be used backward here:

freud/cpp/density/RDF.cc

Lines 60 to 69 in 8353049

    
           // Define prefactors with appropriate types to simplify and speed later code. 
        
           auto const nqp = static_cast<float>(m_n_query_points); 
        
           float number_density = nqp / m_box.getVolume(); 
        
           if (m_norm_mode == NormalizationMode::finite_size) 
        
           { 
        
               number_density *= static_cast<float>(m_n_query_points - 1) / static_cast<float>(m_n_query_points); 
        
           } 
        
           auto np = static_cast<float>(m_n_points); 
        
           auto nf = static_cast<float>(m_frame_counter); 
        
           float prefactor = float(1.0) / (np * number_density * nf);

In an RDF calculation, I would typically expect the number density to be computed as that of an ideal gas of potential neighbor points (which I think should be the np points), then the average would be taken over all origins (which I think are the nqp query points). Apologies in advance if I am confusing the meaning of points and query points here!

If points and query points are reversed in the normalization, I don't think they would actually cause an error in typical use cases because np and number_density are multiplied together, so it doesn't really matter which one is divided by volume. The reversal could be a problem for the finite-size renormalization, which I think should be using (np-1)/np, instead of (nqp-1)/nqp, but this option is usually only when points and query points are the same anyway.

Just wanted to bring it to y'all's intention to have a look at. Thanks!

tommy-waltmann · 2024-04-26T17:04:41Z

tommy-waltmann
Apr 26, 2024
Maintainer

Hi @mphoward ,

As a little blast from the past, have a look at the discussion on #396 from ~5 years ago about the normalization on the RDF and why freud has to do it the way it does. If that doesn't clear it up, then we can keep discussing it here.

3 replies

mphoward Apr 26, 2024
Collaborator Author

Yes, I remember our discussion on that issue! My question isn't actually about how the normalization is done, but rather if the formulas used might have the two variables backwards. Specifically, why is the number density computed using the query points rather than the points? From that issue, for example, the density uses the variable m_n_p in the old code:

freud/cpp/density/RDF.cc

Line 107 in b483a83

float ndens = float(m_n_p) / m_box.getVolume();

I would think that maps to m_n_points in the new code and not m_n_query_points. But, maybe I have the concepts of points and query points backwards.

tommy-waltmann Apr 26, 2024
Maintainer

After doing a little more searching, that code was changed in response to issue #1037 and fixed on PR #1038

mphoward Apr 26, 2024
Collaborator Author

Yeah, I remember that too and that fix makes sense since you want to be normalizing by number of origins and number of frames for the cumulative distribution.

The lines I’m referring to are for the RDF itself and would be a similar error (normalizing by points rather than query points). I think there is a fortunate cancellation of error for the RDF, though, so I’m not sure if you could actually trigger a bug with it in most normal use cases.

joaander · 2024-04-29T16:12:39Z

joaander
Apr 29, 2024
Maintainer

@mphoward see section 2.1 of this paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3085256/ for a general definition of RDF normalization. Note that the N*(N-1) normalization is only valid when points and query_points are the same set. Freud is currently correct in that case, even if the variable name "number_density" is misleading.

What you are implicitly asking for is more general support for the N_pairs normalization which we decided is out of scope for freud proper in all those previous issues/pull requests that Tommy mentioned. It is well defined what freud computes, so users can renormalize appropriately when they know the number of unique pairs between their query_points and points.

If we were to swap query points and points in this normalization, then freud would still be correct for finite size normalization when query_points is the same as points. It would differ for other cases.

For example, if I had N points and chose 1 of those points as a query point, the normalization should be 1/(N-1) - which would be correct if we swapped the two as you propose (as there are N-1 pairs of unique distance calculations). However, there are many other examples where this is not correct: e.g. chose 2 points as query points - or 1 query point from the set of points and 1 query point from outside that set. In these cases, neither ordering is correct as the computation of N_pairs is much less trivial.

To reiterate - in these cases, freud users should not use finite size normalization and should instead apply the appropriate correction factor (N_points * N_query_points / N_unique_pairs) when needed. Freud's data model is not aware about the identity of points, so it cannot evaluate the number of unique pairs between query_points and points.

Some users may be relying on the current behavior of finite size normalization and renormalizing by (N_points * (N_query_points-1) / N_unique_pairs), so I wouldn't want to change it.

I think the documentation already makes this behavior clear by stating:

For correct normalization behavior when using normalization_mode='finite_size', the points must be the same as the query_points and exclude_ii must be set to False.

If you think the documentation can be improved in this regard, feel free to make a pull request.

4 replies

mphoward Apr 29, 2024
Collaborator Author

@joaander Sorry, I think maybe my title was not clear on this discussion post. The definition I'm referring to is strictly in the variables defined in the C++ code. I am 100% fine with the way freud chooses to define the finite size normalization. It is clear when that can be used, and I don't think there is any error in the code for the documented use cases.

I opened this discussion because I found the C++ code confusing as written because points and query points are backwards in the implementation compared to how the RDF would normally be calculated (in my experience). For example, this is a paraphrase of the pseudocode for normalizing g(r) from Frenkel and Smit:

number_density = num_particles / volume
num_ideal = bin_volume * number_density
g = counts / (num_frames * num_particles * num_ideal)

The multicomponent extension (correlation of type j around type i) would be:

number_density_j = num_particles_j / volume
num_ideal_j = bin_volume * number_density_j
g_ij = counts / (num_frames * num_particles_i * num_ideal_j)

As I understand the notation of query points and points, the equivalent would be to make type i the query points and type j the points, so this is the same as:

number_density = num_points / volume
num_ideal = bin_volume * number_density
g = counts / (num_frames * num_query_points * num_ideal)

However, the equivalent implementation in the code currently is:

number_density = num_query_points / volume
num_ideal = bin_volume * number_density
g = counts / (num_frames * num_points * num_ideal)

This is equivalent to my pseudocode, but I think it is confusing because num_frames * num_points and num_ideal aren't actually referring to their physical variables. Put another way, the RDF should give the correlations in how points are distributed around query_points, so the ideal gas used for normalization should be of points.

I wasn't sure I was interpreting points and query points right though, so I wanted to ask before I opened a tiny PR to revise these variable definitions in the C++ code. I would swap where num_points and num_query_points are used in calculating the number density / default normalization, but leave the finite size correction in terms of num_query_points since that is the documented convention.

joaander Apr 29, 2024
Maintainer

I appreciate your attention to detail, but would prefer not to spend hours and many pages of text debating every 3 lines of code. Yes, naming variables is hard - but our time in this world is finite. If this bit of code is important to you, then feel free to submit a pull request that cleans it up. I would strongly prefer a version that does away with number_density entirely and computes the two special cases of N_pairs as outlined in the VMD paper I referenced. This leaves open the possibility to expand RDF to finite normalization in more complex cases in the future should someone desire it (e.g. manual specification of N_pars, a new data model, ...).

And yes, in the language of freud, points are a sea of positions and query_points are the points in space about which you measure quantities of that sea. For example, in a local density estimation points would be the particle positions and query_points would be voxel positions.

mphoward Apr 29, 2024
Collaborator Author

OK, can do when I get a chance. I figured I would bring it to your attention since we ran into a normalization issue using freud's RDF in a non-standard use case (filtering bond exclusions), and I had to read the source code & check against equations to understand the mistake we were making. Hence seeing the issue but wanting to ask about it first before just changing things.

joaander Apr 30, 2024
Maintainer

Yes, the docs could use an example on how to correct the normalization for the general case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible incorrect definition of RDF? #1243

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Possible incorrect definition of RDF? #1243

mphoward Apr 26, 2024 Collaborator

Replies: 2 comments · 7 replies

tommy-waltmann Apr 26, 2024 Maintainer

mphoward Apr 26, 2024 Collaborator Author

tommy-waltmann Apr 26, 2024 Maintainer

mphoward Apr 26, 2024 Collaborator Author

joaander Apr 29, 2024 Maintainer

mphoward Apr 29, 2024 Collaborator Author

joaander Apr 29, 2024 Maintainer

mphoward Apr 29, 2024 Collaborator Author

joaander Apr 30, 2024 Maintainer

mphoward
Apr 26, 2024
Collaborator

Replies: 2 comments 7 replies

tommy-waltmann
Apr 26, 2024
Maintainer

mphoward Apr 26, 2024
Collaborator Author

tommy-waltmann Apr 26, 2024
Maintainer

mphoward Apr 26, 2024
Collaborator Author

joaander
Apr 29, 2024
Maintainer

mphoward Apr 29, 2024
Collaborator Author

joaander Apr 29, 2024
Maintainer

mphoward Apr 29, 2024
Collaborator Author

joaander Apr 30, 2024
Maintainer