Skip to content

wizard1203/Awesome-ML-System-CoDesign

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Awesome-ML-System-CoDesign

LLM Inference/Serving

  • Observability of llm applications: Exploration and practice from the perspective of trace[None][2024]
  • Triton inference server[None][2024]
  • TensorFlow: A system for Large-Scale machine learning[Proc.~USENIX OSDI][2016]
  • Taming throughput-latency tradeoff in llm inference with sarathi-serve[arXiv][2024]
  • Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills[arXiv][2023]
  • Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings[Proc.~the 3rd Workshop for Natural Language Processing Open Source Software][2023]
  • Legion: Expressing locality and independence with logical regions[Proc. IEEE SC][2012]
  • Semantic parsing on Freebase from question-answer pairs[Proc.~EMNLP][2013]
  • Clipper: A Low-Latency online prediction serving system[Proc.~USENIX NSDI][2017]
  • Flashattention: Fast and memory-efficient exact attention with io-awareness[Proc.~NeurIPS][2022]
  • Bert: Pre-training of deep bidirectional transformers for language understanding[Proc.~ACL][2018]
  • Retrieval-augmented generation for large language models: A survey[arXiv][2023]
  • Prompt cache: Modular attention reuse for low-latency inference[arXiv][2023]
  • Musketeer: all for one, one for all in data processing systems[Proc. ACM Eurosys][2015]
  • Serving DNNs like clockwork: Performance predictability from the bottom up[Proc.~USENIX OSDI][2020]
  • Flashdecoding++: Faster large language model inference on gpus[Proc.~Machine Learning and Systems][2023]
  • Data interpreter: An llm agent for data science[arXiv][2024]
  • Metagpt: Meta programming for a multi-agent collaborative framework[arXiv][2023]
  • Inference without interference: Disaggregate llm inference for mixed downstream workloads[arXiv][2024]
  • A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions[arXiv][2023]
  • Tool calling: Enhancing medication consultation via retrieval-augmented large language models[arXiv][2024]
  • Dryad: distributed data-parallel programs from sequential building blocks[Proc.~ACM Eurosys][2007]
  • Query expansion by prompting large language models[arXiv][2023]
  • Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity[arXiv][2024]
  • Ragcache: Efficient knowledge caching for retrieval-augmented generation[arXiv][2024]
  • Dspy: Compiling declarative language model calls into self-improving pipelines[arXiv][2023]
  • An llm compiler for parallel function calling[arXiv][2023]
  • Efficient memory management for large language model serving with pagedattention[Proc.~ACM SOSP][2023]
  • Retrieval-augmented generation for knowledge-intensive nlp tasks[None][2020]
  • AlpaServe: Statistical multiplexing with model parallelism for deep learning serving[Proc.~USENIX OSDI][2023]
  • Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache[arXiv][2024]
  • Parrot: Efficient serving of llm-based applications with semantic variable[Proc.~USENIX OSDI][2024]
  • Truthfulqa: Measuring how models mimic human falsehoods[arXiv][2021]
  • Optimizing llm queries in relational workloads[arXiv][2024]
  • Optimizing llm queries in relational workloads[arXiv][2024]
  • Online speculative decoding[arXiv][2023]
  • Ra-isf: Learning to answer and understand from retrieval augmentation via iterative self-feedback[arXiv][2024]
  • Self-refine: Iterative refinement with self-feedback[Proc.~NeurIPS][2024]
  • Specinfer: Accelerating large language model serving with tree-based speculative inference and verification[Proc.~ACM ASPLOS][2024]
  • Spotserve: Serving generative large language models on preemptible instances[Proc.~ACM ASPLOS][2024]
  • Ray: A distributed framework for emerging AI applications[Proc.~USENIX OSDI][2018]
  • Lossless acceleration of large language model via adaptive n-gram parallel decoding[arXiv][2024]
  • Splitwise: Efficient generative llm inference using phase splitting[arXiv][2023]
  • Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface[Proc.~NeurIPS][2023]
  • Fairness in serving large language models[arXiv][2023]
  • Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms[arXiv][2024]
  • Gemma 2: Improving open language models at a practical size[arXiv][2024]
  • Llama: Open and efficient foundation language models[arXiv][2023]
  • Llama 2: Open foundation and fine-tuned chat models[arXiv][2023]
  • Attention is All you Need[Proc.~NeurIPS][2017]
  • Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism[arXiv][2024]
  • Autogen: Enabling next-gen llm applications via multi-agent conversation framework[arXiv][2023]
  • Smoothquant: Accurate and efficient post-training quantization for large language models[Proc.~ICML][2023]
  • C-pack: Packaged resources to advance general chinese embedding[arXiv][2023]
  • HotpotQA: A dataset for diverse, explainable multi-hop question answering[Proc.~EMNLP][2018]
  • Orca: A distributed serving system for Transformer-Based generative models[Proc.~USENIX OSDI][2022]
  • Apache spark: a unified engine for big data processing[Communications of the ACM][2016]
  • SHEPHERD: Serving DNNs in the wild[Proc.~USENIX NSDI][2023]
  • Efficiently programming large language models using sglang[arXiv][2023]
  • Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving[arXiv][2024]
  • On optimal caching and model multiplexing for large model inference[arXiv][2023]

LLM-based Agent Inference/Serving

LLM Training

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages