- Observability of llm applications: Exploration and practice from the perspective of trace[None][2024]
- Triton inference server[None][2024]
- TensorFlow: A system for Large-Scale machine learning[Proc.~USENIX OSDI][2016]
- Taming throughput-latency tradeoff in llm inference with sarathi-serve[arXiv][2024]
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills[arXiv][2023]
- Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings[Proc.~the 3rd Workshop for Natural Language Processing Open Source Software][2023]
- Legion: Expressing locality and independence with logical regions[Proc. IEEE SC][2012]
- Semantic parsing on Freebase from question-answer pairs[Proc.~EMNLP][2013]
- Clipper: A Low-Latency online prediction serving system[Proc.~USENIX NSDI][2017]
- Flashattention: Fast and memory-efficient exact attention with io-awareness[Proc.~NeurIPS][2022]
- Bert: Pre-training of deep bidirectional transformers for language understanding[Proc.~ACL][2018]
- Retrieval-augmented generation for large language models: A survey[arXiv][2023]
- Prompt cache: Modular attention reuse for low-latency inference[arXiv][2023]
- Musketeer: all for one, one for all in data processing systems[Proc. ACM Eurosys][2015]
- Serving DNNs like clockwork: Performance predictability from the bottom up[Proc.~USENIX OSDI][2020]
- Flashdecoding++: Faster large language model inference on gpus[Proc.~Machine Learning and Systems][2023]
- Data interpreter: An llm agent for data science[arXiv][2024]
- Metagpt: Meta programming for a multi-agent collaborative framework[arXiv][2023]
- Inference without interference: Disaggregate llm inference for mixed downstream workloads[arXiv][2024]
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions[arXiv][2023]
- Tool calling: Enhancing medication consultation via retrieval-augmented large language models[arXiv][2024]
- Dryad: distributed data-parallel programs from sequential building blocks[Proc.~ACM Eurosys][2007]
- Query expansion by prompting large language models[arXiv][2023]
- Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity[arXiv][2024]
- Ragcache: Efficient knowledge caching for retrieval-augmented generation[arXiv][2024]
- Dspy: Compiling declarative language model calls into self-improving pipelines[arXiv][2023]
- An llm compiler for parallel function calling[arXiv][2023]
- Efficient memory management for large language model serving with pagedattention[Proc.~ACM SOSP][2023]
- Retrieval-augmented generation for knowledge-intensive nlp tasks[None][2020]
- AlpaServe: Statistical multiplexing with model parallelism for deep learning serving[Proc.~USENIX OSDI][2023]
- Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache[arXiv][2024]
- Parrot: Efficient serving of llm-based applications with semantic variable[Proc.~USENIX OSDI][2024]
- Truthfulqa: Measuring how models mimic human falsehoods[arXiv][2021]
- Optimizing llm queries in relational workloads[arXiv][2024]
- Optimizing llm queries in relational workloads[arXiv][2024]
- Online speculative decoding[arXiv][2023]
- Ra-isf: Learning to answer and understand from retrieval augmentation via iterative self-feedback[arXiv][2024]
- Self-refine: Iterative refinement with self-feedback[Proc.~NeurIPS][2024]
- Specinfer: Accelerating large language model serving with tree-based speculative inference and verification[Proc.~ACM ASPLOS][2024]
- Spotserve: Serving generative large language models on preemptible instances[Proc.~ACM ASPLOS][2024]
- Ray: A distributed framework for emerging AI applications[Proc.~USENIX OSDI][2018]
- Lossless acceleration of large language model via adaptive n-gram parallel decoding[arXiv][2024]
- Splitwise: Efficient generative llm inference using phase splitting[arXiv][2023]
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface[Proc.~NeurIPS][2023]
- Fairness in serving large language models[arXiv][2023]
- Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms[arXiv][2024]
- Gemma 2: Improving open language models at a practical size[arXiv][2024]
- Llama: Open and efficient foundation language models[arXiv][2023]
- Llama 2: Open foundation and fine-tuned chat models[arXiv][2023]
- Attention is All you Need[Proc.~NeurIPS][2017]
- Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism[arXiv][2024]
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework[arXiv][2023]
- Smoothquant: Accurate and efficient post-training quantization for large language models[Proc.~ICML][2023]
- C-pack: Packaged resources to advance general chinese embedding[arXiv][2023]
- HotpotQA: A dataset for diverse, explainable multi-hop question answering[Proc.~EMNLP][2018]
- Orca: A distributed serving system for Transformer-Based generative models[Proc.~USENIX OSDI][2022]
- Apache spark: a unified engine for big data processing[Communications of the ACM][2016]
- SHEPHERD: Serving DNNs in the wild[Proc.~USENIX NSDI][2023]
- Efficiently programming large language models using sglang[arXiv][2023]
- Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving[arXiv][2024]
- On optimal caching and model multiplexing for large model inference[arXiv][2023]
-
Notifications
You must be signed in to change notification settings - Fork 0
wizard1203/Awesome-ML-System-CoDesign
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published