List: LLM inference | Curated by Luv Bansal

Jan 1, 2025
11 stories
1 save
LLM inference 
In
Generative AI
by
Kaushik Rajan
COCONUT: Redefining Reasoning in Large Language ModelsRevolutionizing reasoning in large language models through latent space.
Dec 17, 2024
8
Dec 17, 2024
8
In
TDS Archive
by
Matthew Gunton
Exploring Medusa and Multi-Token PredictionThis blog post will go into detail on the “MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads” paper
Jul 10, 2024
Jul 10, 2024
In
TDS Archive
by
Benjamin Marie
Torch Compile: 2x Faster Llama 3.2 with Low EffortBut it will depend on your GPU
Nov 13, 2024
3
Nov 13, 2024
3
In
AI Advances
by
Gavin Li
Crazy Challenge: Run Llama 405B on a 8GB VRAM GPUI’m taking on the challenge of running the Llama 3.1 405B model on a GPU with only 8GB of VRAM.
Aug 1, 2024
13
Aug 1, 2024
13
In
AI Advances
by
Muhammad Saad Uddin
Stop Guessing! Here’s How Much GPU Memory You REALLY Need for LLMs!Techniques to Calculate and Reduce Memory Footprint in LLM Serving
Sep 20, 2024
9
Sep 20, 2024
9
In
AI Advances
by
Nikhil Anand
My LLM’s outputs got 1000% better with this simple trick.I wish I had known this trick sooner.
Dec 2, 2024
35
Dec 2, 2024
35
In
OpenVINO-toolkit
by
OpenVINO™ toolkit
Reduce LLM Footprint with OpenVINO™ Toolkit Weight CompressionCreate lean LLMs using weight compression with the OpenVINO™ toolkit. Reduce LLM size, memory footprint, and GPU requirements.
Jul 2, 2024
Jul 2, 2024
Zain ul Abideen
Best LLM Inference Engine? TensorRT vs vLLM vs LMDeploy vs MLC-LLMBenchmarking various LLM Inference Engines.
Jul 6, 2024
1
Jul 6, 2024
1
Zain ul Abideen
Apple MLX vs Llama.cpp vs Hugging Face Candle Rust for Lightning-Fast LLMs LocallyMistral-7B and Phi-2 to experiment fastest inference/generation speed across libraries.
Jan 31, 2024
2
Jan 31, 2024
2
In
Generative AI
by
Simone Tedeschi
How to Run 70B LLMs on a Single 4GB GPUHave you ever dreamed of using the state-of-the-art large language models (LLMs) for your natural language processing (NLP) tasks, but felt…
Jan 21, 2024
10
Jan 21, 2024
10
In
TDS Archive
by
Benjamin Marie
Run Llama 2 70B on Your GPU with ExLlamaV2Finding the optimal mixed-precision quantization for your hardware
Sep 29, 2023
3
Sep 29, 2023
3