A text webscraping tool for U.S. state legislature websites, with options for speech-to-text generated transcripts and public-facing example dashboards that include basic text analysis on specific policy areas.
Current coverage includes Nevada and Washington, with California in the design phase.
View dashboard demonstrations here.
The mission of StateLegiscraper is to make accessible text corpora of political, social, and scholarly significance that can build greater public transparency and academic knowledge about public policymaking and state-level politics.
In recent years, a number of controversial bills and policy proposals have emerged in state legislatures and media attention has increasingly focus on state legislative politics. But beyond recent news, public oversight of the policymaking process is an important cornerstone of democratic nations. As the current U.S. political climate has increasingly shifted national politics to the state-level, state legislatures are key policy venues to watch.
However, each of the 50 state legislatures have vastly different websites and public documentation protocols. Therefore, a systemic examination of national trends at the state-level is difficult to execute due to challenges in navigating, accessing, and processing relevant data. While projects such as LegiScan, Civic Eagle, and Open States have APIs that provide data about bills and representatives across all 50 states, there is currently no open source option that scrapes and processes written and spoken transcripts of state legislature commitee hearings and floor speeches for research purposes and public review.
.
├── data
│ └── dashboard
├── doc
├── examples
├── statelegiscraper
│ ├── assets
│ ├── states
│ ├── test
│ └── dashboard_helper.py
├── LICENSE
├── README.md
├── app.py
└── environment.yml
The statelegiscraper
directory includes a states
module, unit tests in test
, and a dashboard_helper
function script. Data relevant to dashboard are included in data
directory. The examples
directory provides example Jupyter notebooks that can help new users learn the ways StateLegiscraper organize scraping and processing. A Plotly Dash dashboard can run locally through the app.py
file (see Dashboard section below for details.
StateLegiscraper is installed using the command line and is best used with a virtual environment due to its dependencies.
- Open your choice of terminal (e.g., Terminal (MacOS) or Ubuntu 20.04 LTS (Windows))
- Clone the repoistory using
git clone https://github.com/ka-chang/StateLegiscraper.git
- Change to the StateLegiscraper directory using
cd StateLegiscraper
- Set up a new virtual environment with all necessary packages and their dependencies using
conda env create -f environment.yml
- Activate the statelegiscraper virtual environment with
conda activate statelegiscraper
- Deactivate the statelegiscraper virtual environment with
conda deactivate statelegiscraper
StateLegiscraper's webscraping tool uses a Python-based web browser automation tool, Selenium. This requires a specific browser and browser driver to work properly. The package is built using Google Chrome.
- Python = 3.9
- Google Chrome
- Chrome Driver
To check your installed Chrome version and to download the appropriate Chrome Driver, follow these instructions:
- Open Google Chrome
- At the top right corner of the browser, click the settings tab (three vertical dots ⋮)
- Navigate down to Help > About Google Chrome
- Your Google Chrome version is listed on the top of the page. For example:
- Find the Chrome Driver that corresponds to your version and save it to your local drive. We recommend saving it within the cloned repository directory
statelegiscraper/assets
for organizational purposes.
StateLegiscraper uses an open-source speech-to-text engine called DeepSpeech to process audio files to text transcripts. DeepSpeech requires acoustic models to run, which StateLegiscraper's audio_to_text functions require. You can read more about DeepSpeech's acoustic models in their release notes published here.
To download DeepSpeech's v.0.9.3 models and v.0.9.3 model scorer, follow these instructions in your terminal of choice:
- Navigate the the assets folder in the statelegiscraper package using
cd statelegiscraper/assets
. - Download the DeepSpeech's v.0.9.3 models into the assets directory using
curl -o https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
- Download the DeepSpeech's v.0.9.3 model scorer into the assets directory using
curl -o https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
StateLegiscraper contains U.S. state-specific modules that each contain two classes of functions: a Scrape class and a Process class.
- The Scrape class bundles functions that scrape U.S. state legislature websites for individual committee hearing and floor speech PDF / audio / video transcript links. Users export this raw data to their local drive or a mounted cloud drive.
- The Process class bundles functions that cleans and formats the raw scraped data into Python objects appropriate to use for popular NLP packages (e.g., nltk, SpaCy). Scraped PDF files will be converted to dictionary objects, while audio and video files will use Deep Speech, an open-source speech-to-text engine, to generate a text transcript of selected meetings. These transcripts can be used as dictionary objects, or exported as a JSON file.
Example Jupyter notebooks are provided in the examples directory that walk new users through StateLegiscraper's scrape and process functions, including expected behavior from Selenium and file management strategies.
StateLegiscraper also includes a series of public-facing dashboards using the scraped state legislature data. These dashboards provide interested users about high-level narrative trends within a specific state and/or policy area.
- COVID-19 Narrative Trends in Nevada's Health and Human Services and Finance Committees (2021)
To run the dashboard, ensure you have cloned the StateLegiscraper repository and are located in the root directory. Type in python app.py
in your terminal and the dashboards will open in a separate browser.
Researchers can gather raw data for nuanced, tailored analysis, while members of the public can engage with our text analysis dashboards to capture high-level trends in the political discourse at the state legislature. Read detailed user stories here.
The ambition of StateLegiscraper is to one day cover and maintain all 50 state legislature websites. If you'd like to request a state, build a dashboard, or suggest a feature to extend the functionality of StateLegiscraper, please feel free to raise an issue.
If you would like to report a bug or issue , please submit a detailed report at this link.
If you'd like to expand StateLegiscraper to other states, use the data to add to our dashboard options, or add additional features to the tool, please fork the repository, add your contribution, and generate a pull request. The complete contributing guide can be found at this link. This project operates under the Contributor Code of Conduct.
Many thanks to Dr. David Beck and Anant Mittal from the University of Washington for their support, guidance, and feedback in the development of this package.
StateLegiscraper logo adapted from Icon8 icons.