Awesome-ML-System-CoDesign

LLM Inference/Serving

Observability of llm applications: Exploration and practice from the perspective of trace[None][2024]
Triton inference server[None][2024]
TensorFlow: A system for Large-Scale machine learning[Proc.~USENIX OSDI][2016]
Taming throughput-latency tradeoff in llm inference with sarathi-serve[arXiv][2024]
Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills[arXiv][2023]
Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings[Proc.~the 3rd Workshop for Natural Language Processing Open Source Software][2023]
Legion: Expressing locality and independence with logical regions[Proc. IEEE SC][2012]
Semantic parsing on Freebase from question-answer pairs[Proc.~EMNLP][2013]
Clipper: A Low-Latency online prediction serving system[Proc.~USENIX NSDI][2017]
Flashattention: Fast and memory-efficient exact attention with io-awareness[Proc.~NeurIPS][2022]
Bert: Pre-training of deep bidirectional transformers for language understanding[Proc.~ACL][2018]
Retrieval-augmented generation for large language models: A survey[arXiv][2023]
Prompt cache: Modular attention reuse for low-latency inference[arXiv][2023]
Musketeer: all for one, one for all in data processing systems[Proc. ACM Eurosys][2015]
Serving DNNs like clockwork: Performance predictability from the bottom up[Proc.~USENIX OSDI][2020]
Flashdecoding++: Faster large language model inference on gpus[Proc.~Machine Learning and Systems][2023]
Data interpreter: An llm agent for data science[arXiv][2024]
Metagpt: Meta programming for a multi-agent collaborative framework[arXiv][2023]
Inference without interference: Disaggregate llm inference for mixed downstream workloads[arXiv][2024]
A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions[arXiv][2023]
Tool calling: Enhancing medication consultation via retrieval-augmented large language models[arXiv][2024]
Dryad: distributed data-parallel programs from sequential building blocks[Proc.~ACM Eurosys][2007]
Query expansion by prompting large language models[arXiv][2023]
Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity[arXiv][2024]
Ragcache: Efficient knowledge caching for retrieval-augmented generation[arXiv][2024]
Dspy: Compiling declarative language model calls into self-improving pipelines[arXiv][2023]
An llm compiler for parallel function calling[arXiv][2023]
Efficient memory management for large language model serving with pagedattention[Proc.~ACM SOSP][2023]
Retrieval-augmented generation for knowledge-intensive nlp tasks[None][2020]
AlpaServe: Statistical multiplexing with model parallelism for deep learning serving[Proc.~USENIX OSDI][2023]
Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache[arXiv][2024]
Parrot: Efficient serving of llm-based applications with semantic variable[Proc.~USENIX OSDI][2024]
Truthfulqa: Measuring how models mimic human falsehoods[arXiv][2021]
Optimizing llm queries in relational workloads[arXiv][2024]
Optimizing llm queries in relational workloads[arXiv][2024]
Online speculative decoding[arXiv][2023]
Ra-isf: Learning to answer and understand from retrieval augmentation via iterative self-feedback[arXiv][2024]
Self-refine: Iterative refinement with self-feedback[Proc.~NeurIPS][2024]
Specinfer: Accelerating large language model serving with tree-based speculative inference and verification[Proc.~ACM ASPLOS][2024]
Spotserve: Serving generative large language models on preemptible instances[Proc.~ACM ASPLOS][2024]
Ray: A distributed framework for emerging AI applications[Proc.~USENIX OSDI][2018]
Lossless acceleration of large language model via adaptive n-gram parallel decoding[arXiv][2024]
Splitwise: Efficient generative llm inference using phase splitting[arXiv][2023]
Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface[Proc.~NeurIPS][2023]
Fairness in serving large language models[arXiv][2023]
Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms[arXiv][2024]
Gemma 2: Improving open language models at a practical size[arXiv][2024]
Llama: Open and efficient foundation language models[arXiv][2023]
Llama 2: Open foundation and fine-tuned chat models[arXiv][2023]
Attention is All you Need[Proc.~NeurIPS][2017]
Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism[arXiv][2024]
Autogen: Enabling next-gen llm applications via multi-agent conversation framework[arXiv][2023]
Smoothquant: Accurate and efficient post-training quantization for large language models[Proc.~ICML][2023]
C-pack: Packaged resources to advance general chinese embedding[arXiv][2023]
HotpotQA: A dataset for diverse, explainable multi-hop question answering[Proc.~EMNLP][2018]
Orca: A distributed serving system for Transformer-Based generative models[Proc.~USENIX OSDI][2022]
Apache spark: a unified engine for big data processing[Communications of the ACM][2016]
SHEPHERD: Serving DNNs in the wild[Proc.~USENIX NSDI][2023]
Efficiently programming large language models using sglang[arXiv][2023]
Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving[arXiv][2024]
On optimal caching and model multiplexing for large model inference[arXiv][2023]

LLM-based Agent Inference/Serving

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
paper-bibs.bib		paper-bibs.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-ML-System-CoDesign

LLM Inference/Serving

LLM-based Agent Inference/Serving

LLM Training

About

Releases

Packages

Languages

wizard1203/Awesome-ML-System-CoDesign

Folders and files

Latest commit

History

Repository files navigation

Awesome-ML-System-CoDesign

LLM Inference/Serving

LLM-based Agent Inference/Serving

LLM Training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages