Explicitly specify DB connection when we need to, and don't when we don't need to #8224

chris48s · 2025-02-04T16:32:35Z

Refs https://app.asana.com/0/1204880927741389/1208838937595665/f

This PR is a bit of a mashup of several different things. There are 3 core things I'm doing in this PR.

Prior to #8221 our "solution" to this problem was to pepper .using() and using= across scripts we thought we might want to run via SSH. #8221 should make this less necessary (but not completely unnecessary - we'll come on to this) so one of the things I'm doing in this PR is removing a lot of that stuff from places where we had done it explicitly and relying on the DB router to route queries to the right DB.

Then that said.. Django's DB routers are a somewhat leaky abstraction. They apply to ORM operations, but that doesn't mean they apply to all database I/O. Specifically there are 2 things we need to worry about:

connection.cursor

This one is kind of obvious if you think about it. The DB router expects a model class as one of the inputs so anything where you're running raw SQL doesn't make sense in that world. There is also no way to enforce allow_relation if we're working with raw SQL so if you think about it, it is obvious that the DB routers won't work if you're using raw SQL.
There's a thread about this on the django community forum https://forum.djangoproject.com/t/allow-database-router-to-override-default-connection/34529
Anyway, I can't see this changing soon. Also, we use raw SQL in several places in this application. Usually for valid reasons. So we do need to think about this.
In this PR, I've broadly done 2 things:

Where it is possible to remove raw SQL and replace with ORM calls I have done that. Replacing raw SQL where we don't strictly need it puts us back on the happy path and allows the DB routers to do their thing.
Where it is not possible to remove raw SQL (either because the ORM doesn't implement stuff we need or for performance reasons) I've made sure we are explicitly specifying the DB connection when using .cursor() to execute SQL because if we don't explicitly specify it, django just uses the default DB connection.

transaction.atomic

This one is perhaps less obvious.
Regardless of what your database routers say, transaction atomic blocks use the default DB connection unless you explicitly specify it with using=.

https://github.com/django/django/blob/5a2c1bc07d126ce32efaa157e712a8f3a7457b74/django/db/transaction.py#L182-L183
https://github.com/django/django/blob/5a2c1bc07d126ce32efaa157e712a8f3a7457b74/django/db/transaction.py#L18-L25

In our case, because the default DB is the replica rather than the primary in production this means that when we write code like this:

@transaction.atomic
def do_the_things():
    this = This(foo='bar')
    that = That(foo='bar')
    this.save()
    that.save()

The transaction applies to database IO over the replica connection but the writes we are doing are happening on the primary.
So it looks like that code is safely wrapped in a transaction, but it isn't really when we run it in prod 😱

Maybe this is telling us we should actually make the primary DB the default connection? (I accept there are tradeoffs either way round). For now I have chosen to just explicitly pass a using= to every transaction.atomic so that transactions actually working as intended·

Anyway, the point is: If we're using raw SQL or transactions it is up to us to manually specify what connection we're using or django will just use the default and it doesn't matter what we put in DB routers.

What I haven't attempted to do in this PR is prevent us from writing wrong code in future. That is another issue/PR.

The scope of this PR is only to make sure all the code we have already written is talking to the correct DB and remove code that is no longer needed following #8221

I have attempted to split this down into commits which are easy to digest one at a time.

we still need to specify the connection on - raw SQL - transactions but db routers will handle the ORM calls now

We can write these queries using the ORM so lets just do that

this file uses both, so I did this in its own commit for clarity

I don't think this was doing anything useful

chris48s · 2025-02-04T17:01:00Z

polling_stations/db_routers.py

+def get_principal_db_connection():
+    return connections[get_principal_db_name()]


A couple of notes on this.

I added this function as discussed, but given we have to pass the connection name in as a string when we use transaction.atomic() I'm not sure it is actually that useful. It might be more consistent to just use the string name only for both purposes.

To make this PR something we can review independently of the commits in Only serve from replica when there is a request in scope #8221 I've slightly falsified the commit history here. We'll need to resolve a conflict with that PR if we keep the function. When resolved, the "real" diff for this file will be:

diff --git a/polling_stations/db_routers.py b/polling_stations/db_routers.py index 0a03ff41..dd41e3ca 100644 --- a/polling_stations/db_routers.py +++ b/polling_stations/db_routers.py @@ -3,6 +3,7 @@ import os from django.conf import settings from django.db import DEFAULT_DB_ALIAS from django_middleware_global_request import get_request +from django.db import connections PRIMARY = settings.PRINCIPAL_DB_NAME @@ -54,3 +55,7 @@ def get_principal_db_name(): if PRIMARY in settings.DATABASES: return PRIMARY return REPLICA + + +def get_principal_db_connection(): + return connections[get_principal_db_name()]

chris48s · 2025-02-04T17:01:16Z

polling_stations/apps/file_uploads/models.py

-    @transaction.atomic
    def send_error_email(self):
        subject = "File upload failed"
        message = f"File upload failure: {self}. Please investigate further."


As far as I can tell from reading this function, this transaction wasn't really doing anything so I just removed it.

chris48s · 2025-02-04T17:02:08Z

polling_stations/apps/data_importers/data_types.py

-        cursor = connection.cursor()
+        cursor = get_principal_db_connection().cursor()
        cursor.execute(
            """
                SELECT a.uprn, d.polling_station_id


Potentially the 2 SQL statements in this file could be rewritten using the ORM. I think you could probably replace
get_uprns_by_district() with either

def get_uprns_by_district(self): districts = PollingDistrict.objects.filter( council_id=self.council_id, area__contains=OuterRef("location") ).values("polling_station_id")[:-1] return ( Address.objects.filter( uprntocouncil__lad=self.gss_code, ) .annotate(polling_station_id=Subquery(districts)) .values_list("uprn", "polling_station_id") )

OR

def get_uprns_by_district(self): districts = PollingDistrict.objects.filter( council_id=self.council_id, area__contains=OuterRef("location") ).values("polling_station_id") return ( Address.objects.filter( uprntocouncil__lad=self.gss_code, ) .annotate(polling_station_id=Subquery(districts)) .values_list("uprn", "polling_station_id") )

but I couldn't quite get my head round which and there's about 10 years of thinking about edge cases that I don't want to untangle here. I settled for just specifying the connection and sticking with raw SQL for these two.

chris48s · 2025-02-04T17:02:49Z

I wanted to flag a couple of places where I explicitly chose not to change anything:

There are a handful of places in the councils app where we have explicit .using() and using= calls which I haven't unpicked in this PR. I think I'm going to leave those to query with Will when he is available. Leaving them how they are doesn't break anything.
I've left the uses of connection.cursor() in https://github.com/DemocracyClub/UK-Polling-Stations/blob/master/polling_stations/apps/dashboard/views.py as they are. We don't enable these view in prod and reading from the replica is fine anyway because we're not reading something we might have just written, so.. meh

chris48s · 2025-02-04T17:03:01Z

Refs DemocracyClub/uk-geo-utils#37

coveralls · 2025-02-10T15:35:31Z

coverage: 71.98% (+0.2%) from 71.811%
when pulling bb20cda on rawsql-20250204
into b379d74 on master.

chris48s added 7 commits February 4, 2025 16:25

add get_principal_db_connection helper to db_routers

0d8e8f8

remove unnecessary using statements from import_eoni

e6edd91

we still need to specify the connection on - raw SQL - transactions but db routers will handle the ORM calls now

remove raw SQL in data quality report

e5a106f

We can write these queries using the ORM so lets just do that

explicitly specify the connection when using raw sql

ba63f8e

explicitly specify the connection when using transactions

853e793

explicitly specify the connection when using raw sql and transaction

0217bea

this file uses both, so I did this in its own commit for clarity

remove unnecessary transaction

0cfff3d

I don't think this was doing anything useful

chris48s marked this pull request as draft February 4, 2025 16:34

chris48s commented Feb 4, 2025

View reviewed changes

chris48s mentioned this pull request Feb 4, 2025

use correct DB connection for transaction DemocracyClub/uk-geo-utils#37

Merged

chris48s marked this pull request as ready for review February 4, 2025 17:03

chris48s requested a review from symroe February 10, 2025 10:06

upgrade uk-geo-utils==0.17.1

bb20cda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly specify DB connection when we need to, and don't when we don't need to #8224

Explicitly specify DB connection when we need to, and don't when we don't need to #8224

chris48s commented Feb 4, 2025 •

edited

Loading

chris48s Feb 4, 2025

chris48s Feb 4, 2025

chris48s Feb 4, 2025

chris48s commented Feb 4, 2025 •

edited

Loading

chris48s commented Feb 4, 2025

coveralls commented Feb 10, 2025

		def get_principal_db_connection():
		return connections[get_principal_db_name()]

Explicitly specify DB connection when we need to, and don't when we don't need to #8224

Are you sure you want to change the base?

Explicitly specify DB connection when we need to, and don't when we don't need to #8224

Conversation

chris48s commented Feb 4, 2025 • edited Loading

chris48s Feb 4, 2025

Choose a reason for hiding this comment

chris48s Feb 4, 2025

Choose a reason for hiding this comment

chris48s Feb 4, 2025

Choose a reason for hiding this comment

chris48s commented Feb 4, 2025 • edited Loading

chris48s commented Feb 4, 2025

coveralls commented Feb 10, 2025

chris48s commented Feb 4, 2025 •

edited

Loading

chris48s commented Feb 4, 2025 •

edited

Loading