Skip to content

Demokratis-ch/demokratis-ml

Repository files navigation


Consultation procedures for the people

Demokratis.ch | Slack | [email protected] | πŸ€— demokratis

uv Ruff Demokratis is released under the MIT License

πŸš€ What's Demokratis?

Demokratis.ch makes it easier to participate in Swiss consultation procedures in order to better influence the legislative process at the federal and cantonal level.

About Demokratis

The consultation procedure is a fundamental, but lesser known integral part of Swiss democracy. While in theory the consultation procedure is open to everyone the barriers to participation are rather high. Demokratis.ch is an accessible and user-friendly web platform which makes it easy to explore, contribute to and monitor consultation procedures in Switzerland.

Demokratis is developed and run as a civil society initiative and you are most welcome to join us!

About machine learning at Demokratis

We use machine learning to process and understand the legal text drafts (Vorlagen) that are the subject of the consultation procedure, as well as to process related documents such as reports and letters accompanying the drafts.

The machine learning stack runs separately from the main Demokratis.ch website. The outputs of the models are always reviewed by a human before being displayed on the website.

Table of contents


How to contribute

As a community-driven project in its early stages, we welcome your feedback and contributions! We're excited to collaborate with the civic tech, open data, and data science communities to improve consultation processes for all.

Join us on Slack in the #ml channel to say hello, ask questions, and discuss our data and models.

The challenges of understanding legal text with machine learning are complex. If you have experience in NLP or ML, we’d love your input! We can’t do this alone and appreciate any help or insights you can offer.

Tooling and code quality

  • We use uv to manage dependencies. After cloning the repository, run uv sync --dev to install all dependencies.
  • To ensure code quality and enforce a common standard, we use ruff and pre-commit to format code and eliminate common issues. To make sure pre-commit runs all checks automatically when you commit, install the git hooks with uv run pre-commit install.
  • We've started out with a fairly strict ruff configuration. We expect to loosen up some rules when they become too bothersome. A research project cannot be tied up with the same rules as a big production app. Still, it's a lot easier to start with strict rules and gradually soften them than going the other way around.
  • All code must be auto-formatted by ruff before being accepted into the repository. pre-commit hooks (or your code editor) will do that for you. To invoke the formatter manually, run uv run ruff format your_file.py. It works on Jupyter notebooks, too.

What data we use

We obtain information about federal and cantonal consultations through APIs and website scraping. For each consultation (Vernehmlassung) we typically collect a number of documents of various types:

  • The proposed law change (draft, "Vorlage", "Entwurf", ...)
  • A report explaining the proposed change ("ErlΓ€uternder Bericht")
  • Accompanying letters, questionnaires, synoptic tables etc...

The documents are almost always just PDFs. We also get some metadata for the consultation itself, e.g. its title, starting and ending dates, and perhaps a short description.

See the Pandera schemata in demokratis_ml/data/schemata.py for a complete specification of the data we have on consultations and their documents.

Data acquisition and preprocessing

We use data from two main sources:

Document and consultation data is ingested from these sources into the Demokratis web platform running at Demokratis.ch. The web platform is our main source of truth. In addition to making the data available to end users, it also runs an admin interface that we use for manual review and correction of our database of consultations and their documents.

To transform the web platform data into a dataset for training models, we run a Prefect pipeline: demokratis_ml/pipelines/preprocess_consultation_documents.py. The result of this pipeline is a Parquet file conforming to the above-mentioned dataframe schema.

Our data is public

Our preprocessed dataset is automatically published to HuggingFace and you can download it directly from πŸ€— demokratis/consultation-documents. Don't hesitate to talk to us on Slack #ml if you have any questions about the data!

Our models and open ML problems

Current status

Problem Public dataset? Initial research Proof of concept model Deployed in production
I. Classifying consultation topics βœ… βœ… βœ… ❌
II. Extracting structure from documents 🟠(*) βœ… ❌ ❌
III. Classifying document types βœ… βœ… βœ… ❌

*) We haven't published our copies of the source PDFs, but our public dataset does include links to the original files hosted by cantons and the federal government.

I. Classifying consultation topics

We need to classify each new consultation into one or more topics (such as agriculture, energy, health, ...) so that users can easily filter and browse consultations in their area of interest. We also support email notifications, where users can subscribe to receive new consultations on their selected topics by email.

Our datasets

To label our dataset, we used a combination of weak pattern-matching rules, manual labelling, and Open Parl Data. You can see the full list of our topics in demokratis_ml/data/schemata.py:CONSULTATION_TOPICS.

Our models

To increase the breadth of input for the models, we first classify individual documents, even though all documents of a given consultation obviously fall into the same topics. To then predict the topics of the consultation itself, we let the document outputs "vote" on the final set of topics. This approach has proven to be effective because we are giving the model more data to learn from, as opposed to classifying consultations directly – in which case we have to pick a limited number of documents, drastically concatenate them etc.

We disregard documents of type RECIPIENT_LIST and SYNOPTIC_TABLE because we have not found their signals useful. There might be more room for improvement in document selection. This, however, also depends on our problem #3, classifying document types (below).

We currently have two models with good results:

  1. A "traditional" approach: we embed the documents using OpenAI's text-embedding-3-large model and classify those vectors directly with the simple sklearn pipeline of
    make_pipeline(
        StandardScaler(),
        MultiOutputClassifier(LogisticRegression()),
    )
    
    We found that OpenAI embeddings work better than jina-embeddings-v2-base-de, which in turn works better than general-purpose sentence transformer models.
  2. Fine-tuning a domain-specific language model from the πŸ€— joelniklaus/legallms collection, usually joelniklaus/legal-swiss-roberta-large. These pre-trained models were introduced in the paper MultiLegalPile: A 689GB Multilingual Legal Corpus.

Current results

In general:

  • The LegalLM model performs better than the "traditional" sklearn classifier. (Compute requirements are several orders of magnitude higher, though.)
  • Cantonal consultations are more difficult to classify than the federal ones. Merging the two datasets helps.

Current sample-weighted F1 scores:

Model Cantonal consultations Federal consultations Cantonal+federal
text-embedding-3-large + LogisticRegression 0.73 0.85 0.78
joelniklaus/legal-swiss-roberta-large ? 0.92 0.82
Click here to see an example classification report

"Traditional" model, cantonal+federal consultations.

Colour-coded classification report

II. Extracting structure from documents

Note

Latest work on this problem: PR!4 is trying to use LlamaParse to convert PDFs to Markdown.

An important goal of Demokratis is to make it easy for people and organisations to provide feedback (statements, Stellungnahmen) on consultations. To facilitate writing comments or suggesting edits on long complex legal documents, we need to break them apart into sections, paragraphs, lists, footnotes etc. Since all the consultation documents we can currently access are PDFs, it is surprisingly hard to extract machine-readable structure from them!

We are still researching the possible solutions to this problem. For shorter documents, the most workable solution seems to be to prompt GPT-4o to analyse a whole uploaded PDF file and emit the extracted structure in JSON. It may be possible to make this work for longer documents too with careful chunking. In initial tests, GPT-4o performed better at this task than Gemini 1.5 Pro. See our starting prompt for GPT-4o here along with sample input and output.

The services typically used for extracting PDFs – AWS Textract, Azure Document AI, Adobe Document Services – all do not seem to be reliable at detecting PDF layouts. In particular, they do not consistently differentiate between headers, paragraphs, lists, or even footnotes. The open-source project surya performs similarly as these cloud services and can easily be run locally. Another option we have not tried yet is using a LayoutLM model.

III. Classifying document types

Note

Latest work on this problem: PR!8.

Each consultation consists of several documents: usually around 5, but sometimes as much as 20 or more. For each document, we're interested in what role it plays in the consultation: is it the actual draft of the proposed change? Is it an accompanying letter or report? (You can see the full list of document types in demokratis_ml/data/schemata.py:DOCUMENT_TYPES.)

A chart showing the frequency of document types in the cantonal dataset

For federal consultations, we automatically get this label from the Fedlex API. However, cantonal documents do not have roles (types) assigned, so we need to train a model.

Our datasets

We labelled a part of the cantonal dataset manually and through weak rules on file names (e.g. label files called 'Adressatenliste.pdf' as RECIPIENT_LIST). We also used the entire federal dataset because it comes already labelled.

Our model

We embed the documents using OpenAI's text-embedding-3-large model and classify those vectors directly with the simple sklearn pipeline of make_pipeline(StandardScaler(), LogisticRegression()).

Current results

When we simplify the task by labelling unknown document types as VARIOUS_TEXT in the training set, we currently get weighted precision, recall, and F1 around 0.87.

Click here to see detailed classification report
Class Precision Recall F1-Score Support
DRAFT 0.86 0.82 0.84 360
FINAL_REPORT 0.96 0.94 0.95 160
LETTER 0.97 0.97 0.97 324
OPINION 0.96 0.99 0.98 83
RECIPIENT_LIST 0.97 0.97 0.97 207
REPORT 0.86 0.86 0.86 296
RESPONSE_FORM 0.00 0.00 0.00 3
SURVEY 0.74 0.83 0.78 24
SYNOPTIC_TABLE 0.75 0.68 0.71 66
VARIOUS_TEXT 0.72 0.75 0.74 375
Accuracy 0.87 1898
Macro Avg 0.78 0.78 0.78 1898
Weighted Avg 0.87 0.87 0.87 1898

About

Machine learning for Swiss consultation procedures

Resources

License

Stars

Watchers

Forks