Goal of the project is to execute and analyze translations of project-sized code aided by LLM. We predominantly considered translation from C++, the source-language, to Typescript, the target-language. Assuming that LLMs are already capable of translating functions of high quality, the focus lays on the characteristics that emerge from the project-sized type of code, such as the dependency structure, and how to successfully include them in the translated code.
./approaches
: location of transpilation approaches. Each approach is a python file with a name starting with "alltrans"
./benchmarks/benchmark.py
: script which executes a benchmarking run. In the file, specification of the benchmarks and approaches to be used can be done.
./benchmarks/benchmark_source
: folder containing the C++ sources of each benchmark
./benchmarks/benchmarks_target_typescript
: folder containing the workspace for transpilation as well as the final transpilations
./benchmarks/default_runner.py
: script which performs the benchmarking for each benchmark (transpilation, compilation, execution, testing)
A docker container is provided. Pull it using:
> docker pull 28081997/alltrans:v1.0
If you rather want to build a docker image yourself, then inside your customized alltrans project, run:
> docker build -t alltrans .
In your current directory, place a .env file containing the following API keys:
ANTHROPIC_API_KEY=...
OPENAI_API_KEY=...
WEIGHTS_AND_BIASES_API_KEY=...
Start the container using:
> docker run -it --env-file .env 28081997/alltrans:v1.0
To execute the benchmarks in the docker container or the local environment:
> cd benchmarks
> python3 benchmark.py
See benchmark.py for configuration of the selected benchmarks and transpilation approaches.
To execute the transpilation result of benchmark 15:
> cd benchmarks_target_typescript/15_level/transpilation_task
> node src/snake-master/main.js
The Benchmark component is the central part of the project that orchestrates the execution of specified benchmarks and transpilation approaches. For each benchmark and transpilation approach this component sets up the necessary environment and starts a Transpilation Runner which performs the concrete transpilation steps for the given benchmark and approach.
In order to set up the environment for a transpilation, the Benachmark component creates a folder at benchmarks_target_typescript
with the same name as the selected benchmark (e.g. 15_level
). Inside this folder, the default_runner.py
is copied, containing concrete instructions for executing the transpilation. The component also creates a folder called transpilation_task
in which it copies the C++ source code of the selected benchmark into a folder called cpp_source
. Inside this folder, the selected transpilation approach is copied with the filename alltrans.py
.
During transpilation a src
folder is created that contains the transpilation results. Each time a benchmark is transpiled, all content of transpilation_task
is deleted and the src
folder of the previous transpilation is copied to a src_history
folder to be recovered in case of accidental deletion.
Here is an example of the final folder structure of a benchmark after transpilation:
benchmarks_target_typescript
|- 15_level // name of the benchmark
|- default_runner.py // transpilation runner
|- transpilation_task
|- cpp_source
| |- <C++ files> // all source files of the benchmark
| |- utils // utils of our project
| |- alltrans.py // selected transpilation approach
|- src // folder containing the transpilation result
|- <TypeScript files> // all target files of the benachmark
|- <JavaScript files> // all compiled files of the benchmark
|- src_history // history of all src folders
Current working directory is the root of the project.
> cd benchmarks
> python3 ./benchmark.py
The benchmarking is implemented inside the benchmark.py
file. At the beginning of the file, one can configure a list of benchmarks to execute as well as a list of transpilation approaches to exclude.
Most of the implementation contains logic about copying and deleting files in order to setup the transpilation environment.
For executing the transpilation all selected benchmarks and all selected approaches are iterated and executed by calling python3 default_runner.py {args} as a bash command.
for benchmark_name in selected_benchmark_names:
...
for approach_name in selected_approach_names:
...
process = Popen(
[
"python3",
"runner.py" if custom_runner_exists else "default_runner.py",
benchmark_name,
approach_name,
str(EVALUATION_RUN),
],
cwd=target_path,
)
process.wait()
...
The Transpilation Runner contains all logic to perform the following tasks in order:
- Set up an npm environment
- Call the selected transpilation approach (
alltrans.py
) - Compile the transpiled files using tsc
- Execute the compiled files
- Generate and execute unit tests for the compiled files
The runner additionally logs all results on weights-and-biases.
In oder to check if the final JavaScript files execute and behave correctly, multiple approaches come into play:
In case the source code requires known input in order to execute computations, an alltrans_config.json
file can be created inside the benchmark folder. This json file can contain a configuration for an expect
script for each of the source files. This way the input and expected output can be specified enabling an automated execution of the transpiled file. For more details on how to construct such a configuration file, take a look at benchmarks 2-5.
In case no configuration exists, by default every JavaScript file is executed using nodejs without any additional arguments. However, as there exist software that requires non-deterministic input (such as games), correct execution can not be guaranteed for all benchmarks and is therefore not turned on by default. A manual evaluation of the correct execution is recommended.
At the current stage of development, npm is initialized inside the transpilation_task
folder. Hence all node
commands for executing the transpilation should be performed from this directory. This project performs automatic installation of npm dependencies only in V4 transpilation approach transpilation. In other approaches, this has to be done manually.
Here is an example for executing the runner.
Remember, that in order for this to work, the transpilation environment had to already be created by a previous benchmarking using the benchmark.py
component.
Current working directory is the root of the project.
> cd benchmarks/benchmarks_target_typescript/15_level
> python3 default_runner.py alltrans_context 15_level False
In default_runner.py the benchmark each of the previously described steps (environment setup, transpilation, compilation, execution, testing) are executed in order. In between each step, the result is sent to weights and biases.
def run(self):
...
print("######### STARTING RUNNER #########")
self.setup_project()
if self.evaluation_run is not True: # only transpile if not evaluation run
transpilation_result = self.transpile()
if not transpilation_result:
self.wandb_run.log({self.benchmark_name: 0})
quit()
print(" > Transpilation successfull")
compilation_result = self.compile()
if not compilation_result:
self.wandb_run.log({self.benchmark_name: 1})
quit()
print(" > Compilation successfull")
...
Central part of the project is the transpilation. We developed multiple incremental approaches which aim to transpile C++ code to TypeScript.
The problem of transpiling small code snippets has long been solved by small-sized LLMs. The challenge remains to scale this functionality to very complex code, large code files or even whole programming projects. We observed a multitude of challenges arise in this context:
- Complex code often relies on a variety of external libraries. These libraries can not be expected to be available in the target language.
- Large code files have to fit in the context window of the LLM. Otherwise the code has to be split up.
- Querying an LLM to generate large code files often leads to the LLM "skipping steps" by inserting placeholders.
- Code projects typically contain a hierarchy of local dependencies which has to be considered in order to give the LLM all relevant context
- With increasing context and problem complexity, the LLM struggles to solve all problems at once.
- The transpilation result of the LLM has to be checked: Can it compile? Can it execute? Does the program behave correctly?
- When iterating on complex approaches, it is not easy to quickly identify the effect of a new feature on the overall transpilation quality
Regarding challenges 2, 3 and 5: We limited ourselves to projects and files that do not exceed the token limit of current industry standard LLMs (OpenAI ChatGPT 3.5 Turbo, Anthropic Claude Sonnet 3.5). We believe that context size of LLMs will drastically increase in the near future, eventually making techniques aiming to iteratively transpile single files (splitting, etc.) obsolete. We observe, that current LLMs struggle to sufficiently transpile large files by leaving out necessary code pieces. However we believe these problems can better be adressed by Chain-of-Thought approaches using prompting techniques or specially trained LLMs (OpenAI ChatGPT o1).
In the following sections we present four iterations of our transpilation software, each iteration solving additional challenges.
The first iteration of the tranpilation component is kept simple and is useful as a baseline:
- Iterate through all C++ files
- For each file: Instruct an LLM to transpile the C++ file to TypeScript while adding the C++ code as context.
The second iteration takes a big step forward and takes the dependency structure of the project into account. A dependency graph is created based on the C++ sources. Using this graph, the transpilation is performed bottom-up: First, leaves which have no local dependencies are transpiled and removed from the tree. Then, the resulting leaves are transpiled. This process is repeated until all files have been transpiled.
For each C++ file, its corresponding header file (imported by the C++ file and same name as C++ file) is always added to the context during transpilation. In addition to that, all transpilation results (previously transpiled TypeScript files) the C++ file imports, are added to the context.
To further enable the LLM to correctly import local dependencies, a tree view of the folder- and filestructure of the source code is added to the context.
The third approach leverages prompting techniques to further improve the quality of the transpilation:
Coordinator/Worker Approach Instead of prompting an LLM to directly transpile a complex piece of code, two separate queries are made. First the LLM is is instruced to give detailed instructions on how to perform a transpilation based on the context information (described in V2). It is asked to especially consider possibilities to map external dependencies from the source language to the target language. In a second query, the LLM is given only the C++ source as context and the instructions generated by the previous query.
This approach has several advantages:
- LLMs perform better on tasks that enable them to answer in natural language insted of a predefined format. Enabling the LLM to generate the complex transpilation instructions in natural language is therefore beneficial. [1]
- Limiting the context size during the transpilation to only the C++ source and the condensed instructions decreases transpilation complexity and might improve the quality of the final result.
Error Correction + CoT After prerforming the transpilation, the LLM again is instructed to find any possible errors in the transpilation in a Chain-of-Tought pattern and answer with the correct code. This can lead to the LLM to find some obvious errors, gone unnoticed before.
This approach adds agentic behaviour to the component enabling it to iteratively compile the project, read possible errors, read project files, write project files, install new npm dependencies and instruct an LLM to fix programming errors.
We developed a set of utility classes for common tasks in our project.
The FileUtils.py contains various methods for operating with the filesystem of the operating system. This ranges from standard CRUD operations to listing folders/files or more specialized string operations.
The DependencyGraph.py is a more complicated piece of software. Its task is to generate a graph which structures the interconnections of local C++ files according to their import statements. The resulting graph should then be traversible, giving an order to the transpilation process. This section aims to give an introduction to the code to enable further improvements:
Each C++ file can contain a set of import statements of local and external files. A graph can be constructed as a Python Dictionary, where the key is the path to a C++ file and the value is a list of files, where each file is a dependency of the C++ file. We remove all dependencies which are not local. This graph can now be used to easily retrieve all dependencies given the path to a C++ file.
A copy of this graph would also enable a traversal by iteratively transpiling files which have an empty list stored in the graph dictionary. Upon tranpilation, the corresponding key of the file is removed from the dictionary and all occurences in the dependency arrays are deleted. However further improvements of the graph can be made regarding the traversal:
- Header files very often correspont to a single C++ file (Animal.cpp and Animal.h). It is therefore advised to transpile both source and header file into a single TypeScript file. In order for this to work, first a pairing between header- and source files is made. A pairing is done if a C++ file imports a header file and both files have the same name. Header files which are paired, are then completely removed from the dependency graph (keys as well as entries in dependency lists). During transpilation it is made sure, that every time a C++ file with a paired header file is transpiled, the header file is added to the context, hence resulting in a transpilation into one TypeScript file. However due to the removal of the header from the dependency graph, no separate TypeScript file for the header is created.
- Dependency cycles can occur in a scenario where two C++ files import each others paired header file. In a scenario where paired header files are "merged" with the C++ file during transpilation, this leads to both C++ files depending on each other. To mitigate this problem, such cycles are detected, affected header files are merged and added as dependency to both C++ files. As a result, both C++ files can be transpiled individually, both with the extended header file as context.
The CompilationUtils.py provide a wrapper for executing TypeScript compilation commands using Python.
The testing framework for AllTrans validates that the transpiled TypeScript code behaves identically to the original C++ code by comparing the input-output pairs generated from both versions. The process is automated using multiple scripts that guide the flow from C++ test generation to JavaScript validation.
After transpilation is complete, the process of testing begins in the benchmark.py file. The testing involved 3 main scripts present under the utils
folder : unit_test_generator.py
, run_and_parse_tests.py
and test_translated_code.py
The script points to unit_test_generator.py
, which automatically generates unit tests for the C++ source files.
These tests include:
- Input parameters.
- Expected output.
Once the C++ test cases are generated, the flow moves to run_and_parse_tests.py
.
This script:
- Reads the generated C++ tests.
- Compiles and runs the tests.
- Captures the inputs and outputs from the test execution.
- Combines the captured inputs and outputs into a JSON file conatined under
src/test_data
folder, with file named as <benchmark-c++-file-base-name>_test_data.json e.g.,bench2_test_data.json
.
After the test data is captured, test_translated_code.py
takes over.
This script:
- Mechanically passes the captured inputs from the JSON file to the transpiled JavaScript files.
- Compiles and runs the JavaScript code.
- Captures the outputs generated by the JavaScript code.
- Compares the JavaScript outputs with the original C++ outputs stored in the JSON file.
There are two key aspects to consider:
- Successful compilation of the JavaScript code using the provided JSON inputs.
- The test is considered a pass if the outputs match, indicating that the transpilation was accurate and correct.
Here is an example of the final folder structure of a benchmark after testing :
benchmarks_target_typescript
|- 2_level // name of the benchmark
|- default_runner.py // transpilation runner
|- transpilation_task
|- cpp_source
| |- <C++ files> // all source files of the benchmark
| |- utils // utils of our project
| |- alltrans.py // selected transpilation approach
|- src // folder containing the transpilation result
|- test_data // folder containing test_data.json files
|- <test_data.json files> //file conatining test data
|- <TypeScript files> // all target files of the benachmark
|- <JavaScript files> // all compiled files of the benchmark
|- unit_tests //folder containing source code tests
|- <C++ test files> // all source_test.cpp files for the benchmark
|- src_history // history of all src folders
python3 benchmark.py
python3 unit_test_generator.py <benchmark_source_code_directory> <benchmark_unit_test_directory>
python3 run_and_parse_tests.py <benchmark_unit_test_directory> <benchmark_test_data_json_directory>
python3 test_translated_code.py <benchmark_target_code_directory> <benchmark_test_data_json_directory>
Current benchmarks include single-file projects and multi-file projects, such as math tools, libraries and console games. _level describes the folder name at which the benchmark is stored. Up until benchmark "5" the projects are single files. MT1 and all following benchmarks are C++ multi-file projects.
Benchmark | LOC | Description | _level |
---|---|---|---|
0 | 10 | Hello World Program | 0 |
1 | 20 | Vector Addition | 1 |
2 | 58 | Vector Manipulation | 2 |
3.1 | 13 | Array Sum | 3 |
3.2 | 25 | Temperature converter | 3 |
3.3 | 54 | Geometric Areas | 3 |
3.4 | 47 | Calculator | 3 |
4.0 | 61 | Vector Manipulation | 4 |
4.1 | 154 | CG Solver | 4 |
4.2 | 210 | Graph Metrics | 4 |
5 | 210 | Fast Matrix Manipulation | 5 |
MT1 | 68 | Multiple File Projects | 9 |
MT2 | 164 | Multiple File Projects | 11 |
Prime numbers | 363 | Prime number program | 20 |
Snake | 473 | a terminal game | 15 |
Sorting Algorithm | 683 | Sorting algorithms: selection, bubble... | 18 |
Encryption Tool | 700 | Advanced Encyroption Standard Tool | 16 |
MathTools | 1067 | Math tools for computer graphics | 21 |
Console Calculator | 1277 | A calculator for the console | 17 |
Pac Man | 1515 | a game with external game engine | 14 |
Graph Structures | 1609 | Graph Structures | 6 |
JSON Parser | 2697 | A parser for JSON files jet is a programming language | 19 |
jet | 7926 | jet is a programming language | 7 |
Signal Estimator | 17907 | Measuring different characterisitic of an audio signal thats looped back from output to input | 13 |
ProAlgos C++ | 25624 | Implementation of different algorthims like searching, sorting, dynamic prog. | 12 |
This section shows the evaluation results of the previously described transpilation approaches on the previously described benchmarks. The experiments were performed using Anthropic Claude Sonnet 3.5 as LLM.
Benchmark | Transpilation | Automated Repair | Compilation | Execution | Test (manual) | Test (automated) |
---|---|---|---|---|---|---|
0 | Successful | --- | Successful | Successful | Correct | Successful |
1 | Successful | --- | Successful | Successful | Correct | --- |
2 | Successful | --- | Successful | Not successful | --- | --- |
3 | Successful | --- | Successful* | Not successful | --- | --- |
4 | Successful | --- | Not successful | --- | --- | --- |
5 | Successful | --- | Not successful | --- | --- | --- |
9 | Successful | --- | Not successful | --- | --- | --- |
10 | Successful | --- | Not successful | --- | --- | --- |
Problems
- Benchmark 2: Undefined function "prompt()" used.
- Benchmark 3: Undefined function "prompt()" used. Each file manually had to be compiled, as multiple main functions exist which lead to a "overload" error in TypeScript when transpiling all files together.
- Benchmark 4: Importing from unavailable "Math" module.
- Benchmark 5: Implementation of a "print" function leads to an overload error.
- Benchmark 9: Missed to export a module
- Benchmark 10: Used wrong import paths for local files
Observations
Approach V1 can be used as a baseline. One can see, that it quickly runs into problems for more complex files. Transpilation of projects is not possible because of missing context.
Benchmark | Transpilation | Automated Repair | Compilation | Execution | Test (manual) | Test (automated) |
---|---|---|---|---|---|---|
0 | Successful | --- | Successful | Successful | Correct | Successful |
1 | Successful | --- | Successful | Successful | Correct | Successful |
2 | Successful | --- | Successful | Not successful | --- | Execution Error |
3 | Successful | --- | Successful* | Not successful | --- | --- |
4 | Successful | --- | Not successful | --- | --- | --- |
5 | Successful | --- | Not successful | --- | --- | --- |
9 | Successful | --- | Successful | Successful | Correct | --- |
10 | Successful | --- | Successful | Successful | Correct | --- |
11 | Successful | --- | Successful | Successful | Correct | --- |
14 | Successful | --- | Not successful | --- | --- | --- |
15 | Successful | --- | Not successful | --- | --- | --- |
Problems
- Benchmark 2: Undefined function "prompt()" used.
- Benchmark 3: Undefined function "prompt()" used. Each file manually had to be compiled, as multiple main functions exist which lead to a "overload" error in TypeScript when transpiling all files together.
- Benchmark 4: Importing from unavailable "Math" module.
- Benchmark 5: Implementation of a "print" function leads to an overload error.
- Benchmark 14: Multitude of syntax, import and incomplete implementation errors
- Benchmark 15: Multitude of syntax, import and incomplete implementation errors
Observations Approach V2 gains the capability to correctly transpile small projects with local dependencies. However it fails for more complex projects.
Benchmark | Transpilation | Automated Repair | Compilation | Execution | Test (manual) | Test (automated) |
---|---|---|---|---|---|---|
0 | Successful | --- | Successful | Successful | Correct | Successful |
1 | Successful | --- | Successful | Successful | Correct | Successful |
2 | Successful | --- | Successful | Successful | Correct | Successful |
3 | Successful | --- | Successful | Successful | Correct | Partially Successful |
4 | Successful | --- | Successful | Successful | Partially correct | --- |
5 | Successful | --- | Successful | Successful | Not correct | --- |
9 | Successful | --- | Successful | Successful | Correct | --- |
10 | Successful | --- | Successful | Successful | Correct | --- |
11 | Successful | --- | Successful | Successful | Correct | --- |
14 | Successful | --- | Not Successful | Not Successful | Not correct | --- |
15 | Successful | --- | Not Successful | Not Successful | Not correct | --- |
Problems
- Benchmark 4 + 5 : Complex matrix operations => Difficult input format / parsing of cmd args
- Benchmark 14 + 15: Multitude of syntax, import and incomplete implementation errors
Observations Approach V3 is capable of eliminating smaller programming and syntax errors for the smaller benchmarks but still lacks the capability to solve transpilation of larger projects.
Benchmark | Transpilation | Automated Repair | Compilation | Execution | Test (manual) | Test (automated) |
---|---|---|---|---|---|---|
0 | Successful | Not required | Successful | Successful | Correct | Successful |
1 | Successful | Not required | Successful | Successful | Correct | Successful |
2 | Successful | Not required | Successful | Successful | Correct | Successful |
3 | Successful | Not required | Successful | Successful | Correct | Successful |
4 | Successful | Not required | Successful | Successful | Partially correct | Partially Successful |
5 | Successful | Not Required | Successful | Successful | Not correct | Not Successful |
9 | Successful | Not required | Successful | Successful | Correct | Successful |
10 | Successful | Not required | Successful | Successful | Correct | Successful |
11 | Successful | Not required | Successful | Successful | Correct | Successful |
14 | Successful | Not successful | Not Successful | Not Successful | Not correct | Not Successful |
15 | Successful | Required | Successful | Successful | Correct | Compilation Error |
16 | Not successful | --- | --- | --- | --- | --- |
17 | Successful | Required | Successful | Successful | Not correct | --- |
18 | Successful | Required | Successful | Successful | Partially correct | --- |
19 | Successful | Not successful | --- | --- | --- | --- |
20 | Successful | Required | Successful | Successful | Correct | --- |
21 | Successful | Not successful | --- | --- | --- | --- |
Problems
- Benchmark 4 + 5: Complex matrix operations => Difficult input format / parsing of cmd args
- Benchmark 14: Not able to suffieciently replace external graphics library
- Benchmark 16: Response blocked by anthropics security policy
- Benchmark 17: Skipped steps during transpilation (placeholders / mocks used)
- Benchmark 18: Some sorting algorithms not working
- Benchmark 19: Many errors in initial transpilation, too expensive to correct all
- Benchmark 21: Many errors in initial transpilation, too expensive to correct all
Observation
The agentic repair logic is capable of repairing more significant issues in the transpilation leading to two complex projects (16 + 20) eventually being correctly transpiled.
It is important to mention, that transpilation of project-benchmarks behave differently on every try, meaning that the result is not always the same and does not always work. However for the benchmakrs we marked as "Successful" transpilations worked most of the time.
Below you can see the output of the execution of the transpiled version of benchmark 15 which is a game of snake.
Score: 11
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . O O O O O < . . . . . .
. . . . . . . . O * . . . . . . . . . .
. . . . . . . . O . . . . . . . . . . .
. . . . . . . O O . . . . . . . . . . .
. . . . . . . O . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
[1] Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen (2024). Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models, arXiv: https://arxiv.org/abs/2408.02442v1
[2] Sun, Q., Chen, N., Wang, J., Li, X., & Gao, M. (2024). TransCoder: Towards unified transferable code representation learning inspired by human skills: https://doi.org/10.48550/arXiv.2306.07285