Replication package for "Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models"
This repository contains all the data and code to replicate the results presented in our paper. If you're interested in reproducing the results, jump directly to 4. 🔄 Reproduce.
Explore categorized blog posts that highlight the role of FMs and SE in real-world practices.
Check the posts_index folder for detailed indexes of relevant blog posts:
- FM4SE.md: Blogs on Foundation Models for Software Engineering (FM4SE).
- SE4FM.md: Blogs on Software Engineering for Foundation Models (SE4FM).
Prompts used for the FM/LLM jury can be found in the prompts folder:
- SEFM_Area.txt: Classifies blog posts into SE-FM areas.
- FM4SE.txt: Focuses on activities related to FM4SE.
- SE4FM.txt: Focuses on activities related to SE4FM.
The data folder contains all the datasets used in our study, including:
-
company_blogs.json: A JSON file containing blog sites from various companies.
-
collected_blog_posts.csv:
A CSV containing 4,463 blog posts with key metadata:id
: Unique identifier of the blog posttitle
: Title of the blog postlink
: URL of the blog postcompany
: The company that published the blog post.snippet
: the snippet of the blog post (provided by Google Search)area
: Classification area (FM4SE
,SE4FM
, orOthers
)
-
FM4SE_activities.csv:
Contains 155 FM4SE blog posts, with details about:activity
: FM4SE activitytasks
: Tasks related to the FM4SE activity- Other columns: Same as in
collected_blog_posts.csv
-
SE4FM_activities.csv:
Contains 997 SE4FM blog posts, with details about:activity
: SE4FM activitytasks
: Tasks associated with the SE4FM activity- Other columns: Same as in
collected_blog_posts.csv
To replicate the results of our study, follow these steps using the provided Docker image.
You can build the Docker image from the included Dockerfile
:
docker build -t fmse_blogs .
Alternatively, if you have the pre-built Docker image file (fmse_blogs_image.tgz
), load it into your Docker environment:
docker load -i fmse_blogs_image.tgz
Once the image is built or loaded, run the container to see the results as follows:
docker run -it -v "${PWD}/output":"/app/output" fmse_blogs
This repository can also be reused to collect and analyze blog posts from other companies. Follow these steps to adapt the code:
-
Update Blog Sources:
Modify data/company_blogs.json to include blog URLs for the companies you wish to analyze. -
Search Blog Posts:
Run the script scripts/search_blogs.py to search for blog posts. Note: You will need to configure the following environment variables for Google Search API:GOOGLE_SEARCH_API_KEY
GOOGLE_SEARCH_ENGINE_ID
-
Download Blog Posts:
Use scripts/download_blogs.py to fetch the blog posts based on search results. -
Analyze Blog Posts:
Apply your models to analyze the blog posts using the prompts provided in the prompts folder. -
Generate Reports:
Use scripts/report_results.py to generate results and insights from the analysis. -
Create Index Files:
Run scripts/generate_mds.py to create Markdown indexes for the blog posts.
If you only wish to reuse the FM/LLM Jury, you can directly integrate our module located in src/jury.
If you find this replication package useful, please cite our paper using the following BibTeX entry:
@misc{li_fmjury_2024,
title={Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models},
author={Hao Li and Cor-Paul Bezemer and Ahmed E. Hassan},
year={2024},
eprint={2410.09012},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2410.09012},
}