Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly specify DB connection when we need to, and don't when we don't need to #8224

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

chris48s
Copy link
Member

@chris48s chris48s commented Feb 4, 2025

Refs https://app.asana.com/0/1204880927741389/1208838937595665/f

This PR is a bit of a mashup of several different things. There are 3 core things I'm doing in this PR.

Prior to #8221 our "solution" to this problem was to pepper .using() and using= across scripts we thought we might want to run via SSH. #8221 should make this less necessary (but not completely unnecessary - we'll come on to this) so one of the things I'm doing in this PR is removing a lot of that stuff from places where we had done it explicitly and relying on the DB router to route queries to the right DB.

Then that said.. Django's DB routers are a somewhat leaky abstraction. They apply to ORM operations, but that doesn't mean they apply to all database I/O. Specifically there are 2 things we need to worry about:

connection.cursor

This one is kind of obvious if you think about it. The DB router expects a model class as one of the inputs so anything where you're running raw SQL doesn't make sense in that world. There is also no way to enforce allow_relation if we're working with raw SQL so if you think about it, it is obvious that the DB routers won't work if you're using raw SQL.
There's a thread about this on the django community forum https://forum.djangoproject.com/t/allow-database-router-to-override-default-connection/34529
Anyway, I can't see this changing soon. Also, we use raw SQL in several places in this application. Usually for valid reasons. So we do need to think about this.
In this PR, I've broadly done 2 things:

  • Where it is possible to remove raw SQL and replace with ORM calls I have done that. Replacing raw SQL where we don't strictly need it puts us back on the happy path and allows the DB routers to do their thing.
  • Where it is not possible to remove raw SQL (either because the ORM doesn't implement stuff we need or for performance reasons) I've made sure we are explicitly specifying the DB connection when using .cursor() to execute SQL because if we don't explicitly specify it, django just uses the default DB connection.

transaction.atomic

This one is perhaps less obvious.
Regardless of what your database routers say, transaction atomic blocks use the default DB connection unless you explicitly specify it with using=.

https://github.com/django/django/blob/5a2c1bc07d126ce32efaa157e712a8f3a7457b74/django/db/transaction.py#L182-L183
https://github.com/django/django/blob/5a2c1bc07d126ce32efaa157e712a8f3a7457b74/django/db/transaction.py#L18-L25

In our case, because the default DB is the replica rather than the primary in production this means that when we write code like this:

@transaction.atomic
def do_the_things():
    this = This(foo='bar')
    that = That(foo='bar')
    this.save()
    that.save()

The transaction applies to database IO over the replica connection but the writes we are doing are happening on the primary.
So it looks like that code is safely wrapped in a transaction, but it isn't really when we run it in prod 😱

Maybe this is telling us we should actually make the primary DB the default connection? (I accept there are tradeoffs either way round). For now I have chosen to just explicitly pass a using= to every transaction.atomic so that transactions actually working as intended·

Anyway, the point is: If we're using raw SQL or transactions it is up to us to manually specify what connection we're using or django will just use the default and it doesn't matter what we put in DB routers.

What I haven't attempted to do in this PR is prevent us from writing wrong code in future. That is another issue/PR.

The scope of this PR is only to make sure all the code we have already written is talking to the correct DB and remove code that is no longer needed following #8221

I have attempted to split this down into commits which are easy to digest one at a time.

we still need to specify the connection on
- raw SQL
- transactions
but db routers will handle the ORM calls now
We can write these queries using the ORM
so lets just do that
this file uses both, so I did this in its own commit for clarity
I don't think this was doing anything useful
@chris48s chris48s marked this pull request as draft February 4, 2025 16:34
Comment on lines +35 to +36
def get_principal_db_connection():
return connections[get_principal_db_name()]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of notes on this.

  1. I added this function as discussed, but given we have to pass the connection name in as a string when we use transaction.atomic() I'm not sure it is actually that useful. It might be more consistent to just use the string name only for both purposes.
  2. To make this PR something we can review independently of the commits in Only serve from replica when there is a request in scope #8221 I've slightly falsified the commit history here. We'll need to resolve a conflict with that PR if we keep the function. When resolved, the "real" diff for this file will be:
diff --git a/polling_stations/db_routers.py b/polling_stations/db_routers.py
index 0a03ff41..dd41e3ca 100644
--- a/polling_stations/db_routers.py
+++ b/polling_stations/db_routers.py
@@ -3,6 +3,7 @@ import os
 from django.conf import settings
 from django.db import DEFAULT_DB_ALIAS
 from django_middleware_global_request import get_request
+from django.db import connections
 
 
 PRIMARY = settings.PRINCIPAL_DB_NAME
@@ -54,3 +55,7 @@ def get_principal_db_name():
     if PRIMARY in settings.DATABASES:
         return PRIMARY
     return REPLICA
+
+
+def get_principal_db_connection():
+    return connections[get_principal_db_name()]

Comment on lines -222 to 223
@transaction.atomic
def send_error_email(self):
subject = "File upload failed"
message = f"File upload failure: {self}. Please investigate further."
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell from reading this function, this transaction wasn't really doing anything so I just removed it.

Comment on lines -119 to 122
cursor = connection.cursor()
cursor = get_principal_db_connection().cursor()
cursor.execute(
"""
SELECT a.uprn, d.polling_station_id
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially the 2 SQL statements in this file could be rewritten using the ORM. I think you could probably replace
get_uprns_by_district() with either

def get_uprns_by_district(self):
    districts = PollingDistrict.objects.filter(
        council_id=self.council_id, area__contains=OuterRef("location")
    ).values("polling_station_id")[:-1]

    return (
        Address.objects.filter(
            uprntocouncil__lad=self.gss_code,
        )
        .annotate(polling_station_id=Subquery(districts))
        .values_list("uprn", "polling_station_id")
    )

OR

def get_uprns_by_district(self):
    districts = PollingDistrict.objects.filter(
        council_id=self.council_id, area__contains=OuterRef("location")
    ).values("polling_station_id")

    return (
        Address.objects.filter(
            uprntocouncil__lad=self.gss_code,
        )
        .annotate(polling_station_id=Subquery(districts))
        .values_list("uprn", "polling_station_id")
    )

but I couldn't quite get my head round which and there's about 10 years of thinking about edge cases that I don't want to untangle here. I settled for just specifying the connection and sticking with raw SQL for these two.

@chris48s
Copy link
Member Author

chris48s commented Feb 4, 2025

I wanted to flag a couple of places where I explicitly chose not to change anything:

  1. There are a handful of places in the councils app where we have explicit .using() and using= calls which I haven't unpicked in this PR. I think I'm going to leave those to query with Will when he is available. Leaving them how they are doesn't break anything.
  2. I've left the uses of connection.cursor() in https://github.com/DemocracyClub/UK-Polling-Stations/blob/master/polling_stations/apps/dashboard/views.py as they are. We don't enable these view in prod and reading from the replica is fine anyway because we're not reading something we might have just written, so.. meh

@chris48s
Copy link
Member Author

chris48s commented Feb 4, 2025

Refs DemocracyClub/uk-geo-utils#37

@chris48s chris48s marked this pull request as ready for review February 4, 2025 17:03
@chris48s chris48s requested a review from symroe February 10, 2025 10:06
@coveralls
Copy link

Coverage Status

coverage: 71.98% (+0.2%) from 71.811%
when pulling bb20cda on rawsql-20250204
into b379d74 on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants