Scraping folder contains scraping information
metaculus.py
scrapes all binary, resolved questions.metaculus_aibqi.py
scrapes all resolved questions from aibq{i} tournamentmetaculus_aibq3_wd.py
scrapes all resolved questions and background informationmetaculus_data_binary_resolved_all.json
contains ALL binary, resolved questions from Metaculusmetaculus_data_binary_resolved_May24.json
contains ALL binary, resolved questions from Metaculus after May 1, 2024metaculus_data_aibq3_wd.json
contains all binary, resolved questions from Metaculus in Q3 LLM Benchmarking tournamentmetaculus_data_aibq4_wd.json
contains all binary, resolved questions from Metaculus in Q4 LLM Benchmarking tournamentclassification.py
gets Gemini to categorize the questionsmetaculus_data_aibq3_categories.csv
has all the categoriesmetaculus_data_path.json
exists because claude sonnet 3.5 old errorred out in the first resolution
Graphs folder contains visualizations
newspipeline.py
takes each question query and open_date and gets the relevant articles from before that date from AskNews
models.py
contains all the code to call models. Prompts located in prompts.py
directprediction.py
asks the LLM makes the prediction after reading articles from aibq3_news.json
extract_probabilities.py
gets probabilities from aibq3_predictions_{model name}.json and writes it to a .csv
extract_probabilities_conf.py
gets probabilities from aibq3_predictions_conf_{model name}.json and writes it to a .csv
Narrativeprediction.py
asks the LLM to write a script between Tetlock and Silver the day after the question's scheduled close date, and say the probability the models had calculated before the event.
Brier_score.py
calculates the Brier score for the .csv that has BOTH gpt-4o and sonnet 3.5 old
Brier_score_singlemodel.py
calculates the Brier score for a .csv that only has one model (currently 4o)
Individual Model Brier Scores(score, standard error): gpt0: 0.2104, 0.0132 gpt1: 0.1989, 0.0130 gpt2: 0.1985, 0.0127 gpt3: 0.1951, 0.0118 gpt4: 0.2010, 0.0125 claude0: 0.1923, 0.0127 claude1: 0.2015, 0.0130 claude2: 0.2100, 0.0135 claude3: 0.1985, 0.0129 claude4: 0.1992, 0.0129
Ensemble Brier Scores (score, standard error): GPT median: 0.1968, 0.0122 Claude median: 0.1947, 0.0125 GPT mean: 0.1924, 0.0115 Claude mean: 0.1910, 0.0119 Combined mean: 0.1912
Statistical Test Results: Paired t-test (GPT vs Claude): t-statistic: 0.1078 p-value: 0.9141 The difference between GPT and Claude scores is not statistically significant (p >= 0.05).
Effect size (Cohen's d): 0.0021 The effect size is small.
Individual Model Brier Scores (score, standard error): claude0: 0.1860, 0.0127 claude1: 0.1798, 0.0127 claude2: 0.1919, 0.0131 claude3: 0.1826, 0.0127 claude4: 0.1851, 0.0128 Median: 0.1810, 0.0126 Mean: 0.1785, 0.0120
Cross-Model Ensemble Brier Scores (4o and Sonnet-new): Median of all predictions: 0.1791 Mean of all predictions: 0.1792
Individual Model Brier Scores (score, standard error): claude0: 0.2531, 0.0145 claude1: 0.2491, 0.0136 claude2: 0.2601, 0.0142 claude3: 0.2507, 0.0134 claude4: 0.2448, 0.0138 Median: 0.2476, 0.0135 Mean: 0.2415, 0.0128
Individual Model Brier Scores (score, standard error): gpt0: 0.2186, 0.0133 gpt1: 0.2141, 0.0134 gpt2: 0.2166, 0.0130 gpt3: 0.2095, 0.0129 gpt4: 0.2122, 0.0133 Median: 0.2100, 0.0126 Mean: 0.2046, 0.0120
Individual Model Brier Scores (score, standard error): geminiflash20: 0.2351, 0.0192 geminiflash21: 0.2249, 0.0179 geminiflash22: 0.2338, 0.0206 geminiflash23: 0.2216, 0.0203 geminiflash24: 0.2439, 0.0208
Ensemble Brier Scores (score, standard error): Median Ensemble: 0.2227, 0.0189 Mean Ensemble: 0.2194, 0.0173
Individual Model Brier Scores (score, standard error): deepseekv30: 0.2086, 0.0130 deepseekv31: 0.2178, 0.0130 deepseekv32: 0.2292, 0.0142 deepseekv33: 0.2108, 0.0132 deepseekv34: 0.2180, 0.0132
Ensemble Brier Scores (score, standard error): Median Ensemble: 0.2116, 0.0127 Mean Ensemble: 0.2071, 0.0121
Average Brier Scores (score, standard error): gpt0: 0.2099, 0.0143 gpt1: 0.2005, 0.0138 gpt2: 0.2117, 0.0143 gpt3: 0.1887, 0.0136 gpt4: 0.2097, 0.0147 claude0: 0.1987, 0.0132 claude1: 0.2074, 0.0137 claude2: 0.2038, 0.0135 claude3: 0.2119, 0.0133 claude4: 0.2013, 0.0135
Ensemble Brier Scores (score, standard error): GPT median: 0.1950, 0.0132 Claude median: 0.1897, 0.0123 GPT mean: 0.1901, 0.0124 Claude mean: 0.1891, 0.0117
Statistical Test Results: Paired t-test (GPT vs Claude): t-statistic: -0.0984 p-value: 0.9216 The difference between GPT and Claude scores is not statistically significant (p >= 0.05).
Effect size (Cohen's d): -0.0022 The effect size is small.
Average Brier Scores: gpt0: 0.2694 gpt1: 0.2819 gpt2: 0.2439 gpt3: 0.2644 gpt4: 0.2805
GPT median: 0.2430
Individual Model Brier Scores (score, standard error): gpt0: 0.2651, 0.0187 gpt1: 0.2718, 0.0191 gpt2: 0.2252, 0.0179 gpt3: 0.2552, 0.0184 gpt4: 0.2766, 0.0188 Median: 0.2313, 0.0173 Mean: 0.2080, 0.0133
Individual Model Brier Scores (score, standard error): gpt0: 0.2302, 0.0104 gpt1: 0.2255, 0.0097 gpt2: 0.2161, 0.0096 gpt3: 0.2177, 0.0100 gpt4: 0.2212, 0.0101 Median: 0.2053, 0.0088 Mean: 0.2087, 0.0079
Average Brier Scores (sonnet-new): Individual Model Brier Scores (score, standard error): claude0: 0.2245, 0.0135 claude1: 0.2219, 0.0135 claude2: 0.2463, 0.0175 claude3: 0.2329, 0.0136 claude4: 0.2264, 0.0133 Median: 0.2197, 0.0130 Mean: 0.2180, 0.0120
Average Brier Scores (sonnet-new): <-- only ran 3 times by mistake claude0: 0.2177 claude1: 0.2278 claude2: 0.2348 Median: 0.2156 Mean: 0.2090
category,Claude Sonnet (new),Claude Haiku,Claude Sonnet (old),GPT-4o,Narrative Claude,Narrative 4o,Conf 4o Arts & Recreation,0.1657,0.1686,0.1584,0.1683,0.1686,0.2076,0.1633 Economics & Business,0.2223,0.2767,0.2258,0.2149,0.2819,0.2429,0.2289 Environment & Energy,0.2073,0.338,0.2334,0.2845,0.2695,0.2568,0.3087 Healthcare & Biology,0.1697,0.2087,0.1769,0.1658,0.1989,0.1868,0.181 Politics & Governance,0.1624,0.2324,0.1807,0.1699,0.2283,0.1919,0.19 Science & Tech,0.2047,0.3134,0.2425,0.2623,0.1994,0.2503,0.2744 Sports,0.1554,0.2086,0.1714,0.1925,0.2075,0.2677,0.1844
Individual Model Brier Scores: gpt0: 0.1213 gpt1: 0.3114 gpt2: 0.2421 gpt3: 0.2142 gpt4: 0.2813
Ensemble Brier Scores: Median Ensemble: 0.2625 Mean Ensemble: 0.2097
claude0: 0.2179 claude1: 0.0845 claude2: 0.2255 claude3: 0.1563 claude4: 0.2629
Ensemble Brier Scores: Median Ensemble: 0.1566 Mean Ensemble: 0.1748