Guide Of All Open Sourced Large Language Models(LLMs)

Luv Bansal
11 min readAug 23, 2023

--

A Comprehensive Guide to all Open Source Foundational Large Language Models (LLMs) and Their Comparisons

Photo by Google DeepMind on Unsplash

It looks like everyone is really into the new trend: Large language models (LLMs). The demand for these massive data-eaters is constantly increasing. From GPT-3 to Megatron, the search for larger and more advanced tools is ongoing. So, whether you’re new to language processing or an experienced expert, here’s a list of all the open source LLMs that have emerged. Get ready to dive deep into the world of technology!

In this blog post, I will provide detailed information about the widely known and currently trending open source foundational language models up until August 2023. I will also evaluate these models and offer some comparisons between these large language models.

List of Open Sourced Large Language Models sorted by their released date:

Meta’s Llama-2

The Llama 2 release introduces a family of pretrained and fine-tuned LLMs, ranging in scale from 7B to 70B parameters (7B, 13B, 70B). The pretrained models come with significant improvements over the Llama 1 models, including being trained on 40% more tokens, having a much longer context length (4k tokens ), and using grouped-query attention for fast inference of the 70B model!

Fine-tuned models (Llama 2-Chat), which have been optimized for dialogue applications using Reinforcement Learning from Human Feedback (RLHF).

Params: 7B to 70B parameters (7B, 13B, 70B)
Context size: 4048
Trained on: 2 trillion tokens
Release Date: July, 2023
License: Custom
Github: Source Code
Paper: Llama 2: Open Foundation and Fine-Tuned Chat Models
Try it: HuggingChat

Mosaic’s MPT-30B

MPT-30B is a powerful addition to the MosaicML Foundation Series of open-source models designed for artificial intelligence and machine learning tasks. It surpasses the quality of GPT-3 and introduces two fine-tuned variants, MPT-30B-Instruct and MPT-30B-Chat, specializing in instruction following and multi-turn conversations respectively. Notably, MPT-30B models are trained with an 8k token context window, enabling efficient handling of longer contexts. These models offer strong coding abilities due to their pre-training data mixture. The capabilities of MPT-30B, particularly in programming, are highlighted, and it outperforms GPT-3 on various evaluation metrics with significantly less training compute.

Despite being trained as a general conversational model, MPT-30B-Chat is also surprisingly good at programming and scores 37.2% on HumanEval; this places it above almost all open source models other than WizardCoder MPT-30B models outperform LLaMa-30B and Falcon-40B by a wide margin, and even outperform many purpose-built coding models such as StarCoder.

Params: 30 B
License: commercial license except MPT-30B-Chat: non-commercial CC-By-NC-SA-4.0
Context size: 8192
Trained on
: 1 Trillion tokens
Release Date:
June, 2023
Paper/blog: MPT-30B: Raising the bar for open-source foundation models
Try it: Playground

XGen (By salesforce)

The XGen project has introduced a series of 7B language models, including XGen-7B-4K-base and XGen-7B-8K-base, trained on up to 1.5T tokens with dense attention up to 8K sequence length. Additionally, instruction-tuned versions (XGen-7B-{4K,8K}-inst) were fine-tuned on public domain instructional data.

These models showcase competitive or superior results compared to state-of-the-art LLMs of similar size, such as MPT, Falcon, LLaMA, Redpajama, and OpenLLaMA, particularly excelling in long sequence modeling tasks. The XGen-7B models demonstrate robust performance in both text and code tasks, bridging the gap between these domains. Notably, the 8K-seq models outperform their 2K- and 4K-seq counterparts, showing the significance of larger context.

Params: 7 B
License: Apache 2.0 license, XGen-7B-{4K,8K}-inst Released for research purpose only.
Context size: 8192
Trained on
: 1.5 Trillion tokens
Release Date:
June, 2023
GitHub: Source Code
Paper/blog: XGen: A 7B LLM Trained on 8K Input Sequence Length
Try it: Playground

Mosaic’s MPT-7B

Mosaic release series of MPT-7B models with three finetuned models in addition to the base MPT-7B: MPT-7B-Instruct, MPT-7B-Chat, and MPT-7B-StoryWriter-65k+, the last of which uses a context length of 65k tokens!

MPT-7B is a decoder-style transformer with 6.7B parameters. It was trained from scratch on 1 trillion tokens of text and code, which were carefully curated by MosaicML’s data team. The base model includes FlashAttention for fast training and inference, as well as ALiBi for finetuning and extrapolation to handle extremely long context lengths of 2048.

Params: 6.7 B
License: Licensed for commercial use
Context size: 2048 and MPT-7B-StoryWriter-65k+, which uses a context length of 65k tokens!
Trained on: 1 Trillion tokens
Release Date:
May, 2023
Github: Source Code
Paper: MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs
Try it: Playground

TII’S Falcon

Falcon is a recently introduced large language model (LLM) developed by the Technology Innovation Institute (TII). It includes variants such as Falcon-40B with 40 billion parameters, an autoregressive decoder-only model trained to predict sequences. It offers Falcon-7B with 7 billion parameters and ready-to-use Falcon-40B-Instruct and Falcon-7B-Instruct models.

Falcon-40B has demonstrated better performance than GPT-3 with only 75% of the training compute budget and reduced inference time. Falcon-40B was trained on 1 trillion tokens and Falcon-7B has been trained on 1.5 trillion tokens using RefinedWeb, a web dataset that underwent careful filtering and deduplication, achieving superior performance. Falcon LLM is open-source under the Apache License 2.0, aiming to promote collaboration and innovation in AI.

Note: Falcon don’t generate meaningful code

Params: 7B, 40B
License: Apache License 2.0
Context size: 2048
Trained on: Falcon 7b: 1 Trillion tokens, Falcon 40b: 1.5 Trillion tokens
Release Date:
May, 2023
Paper/blog: Falcon LLM
Try it: Playground

OpenLLaMA

OpenLLaMA is an open-source reproduction of Meta’s LLaMA language with some modifications, incorporating memory-efficient attention from Xformers, stable embedding from Bloom, and shared input-output embedding from PaLM. And the model is pre-trained on both Chinese and English, which gives it better performance on Chinese language tasks. They offer variants including 3B, 7B, and 13B models trained on 1 trillion tokens.

OpenLLaMA v2 models outperform the v1 models and are trained on diverse datasets, including Falcon refined-web, starcoder, wikipedia, arxiv, and more

Note: OpenLLaMA don’t generate meaningful code

Params: 3B, 7B, 13 B
License: Apache License 2.0
Context size:
204
Trained on
: 1 Trillion tokens
Release Date:
May, 2023
Github: Source Code
Try it: Playground

Together’s RedPajama-INCITE

The RedPajama project, dedicated to advancing open-source AI models, has introduced the RedPajama-INCITE family, encompassing 3B and 7B parameter models, as well as instruction-tuned and chat versions with the same architecture as the popular Pythia model suite. These models leverage a base dataset inspired by LLaMA, showcasing exceptional performance in contextual understanding tasks. Notably, the 3B model stands out for its speed and accessibility, while the instruction-tuned models excel in downstream applications like few-shot tasks and summarization. Impressively, the 7B model, still in training, surpasses Pythia 7B, highlighting the impact of a larger dataset.

RedPajama project built on top of our 1.2 trillion token RedPajama dataset. The RedPajama initiative’s collaborative nature and scalability demonstrate the potential for future AI advancements. Despite the ongoing quality refinement, these models have already exhibited promising results in benchmarks such as HELM and lm-evaluation-harness.

Params: 3B, 7 B
License: Licensed for commercial use
Context size: 2048
Trained on: 3B: 1.2 Trillion tokens
Release Date:
May, 2023
Github: Source Code
Paper: 3B and 7B RedPajama-INCITE family of models including base, instruction-tuned & chat models
Try it: Playground

FastChat-T5

FastChat-T5 is an open-source chatbot trained by fine-tuning Flan-t5-xl (3B parameters) on user-shared conversations collected from ShareGPT. It is based on an encoder-decoder transformer architecture, and can autoregressively generate responses to users’ inputs.
Outperforms Dolly-V2 with 4x fewer parameters.

Params: 3B
License: Apache License 2.0
Context size:
512
Trained on
: 1 Trillion tokens
Release Date:
April, 2023
Github: Source Code
Try it: Playground

StabilityAI’s StableLM-Alpha

Stability AI has introduced StableLM, an open-source language model available in varying parameter sizes, ranging from 3 billion to 7 billion, with larger models to come. StableLM builds on previous open-source language model efforts like GPT-J and GPT-NeoX, boasting high performance in text and code generation tasks despite its smaller size. The models are trained on an experimental dataset three times larger than The Pile with 1.5 trillion content tokens, enhancing their conversational and coding capabilities.

Additionally, fine-tuned research models are offered, utilizing datasets like Alpaca, GPT4All, Dolly, ShareGPT, and HH. These fine-tuned models are intended for research purposes and are released under the noncommercial CC BY-NC-SA 4.0 license.

Params: 3B, 7B
License: CC BY-SA-4.0 license (for commercial or research purposes)
Context size: 4096
Trained on: 3B: 1.2 Trillion tokens
Release Date:
April, 2023
Github: Source Code
Paper: StableLM Suite of Language Models
Try it: Playground

Google’s PaLM

PaLM 2 AI model from Google, which is ranked among the best large language models of 2023. Google has focused on commonsense reasoning, formal logic, mathematics, and advanced coding in 20+ languages on the PaLM 2 model. It’s being said that the largest PaLM 2 model has been trained on 540 billion parameters and has a maximum context length of 4096 tokens.

Google has announced four models based on PaLM 2 in different sizes (Gecko, Otter, Bison, and Unicorn). Of which, Bison is available currently, and it scored 6.40 in the MT-Bench test whereas GPT-4 scored a whopping 8.99 points.

Dolly by Databricks

Databricks’ dolly-v2, an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use, derived from EleutherAI’s Pythia-12b and fine-tuned on a ~15K record instruction corpus generated by Databricks employees in capability domains from the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA and summarization.

Dolly-v2-12b is not a state-of-the-art model, It underperforms dolly-v1–6b in some evaluation benchmarks. It might be due to the composition and size of the underlying fine-tuning datasets but does exhibit surprisingly high quality instruction following behavior not characteristic of the foundation model on which it is based.

Params: 3B, 7B, 12B
License: MIT
Context size: 2048
Release Date:
April, 2023
Github: Source Code
Paper: Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

EleutherAI’s Pythia

Pythia, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. They trained and released a suite of 8 model sizes on 2 different datasets: the Pile, as well as the Pile with deduplication applied.

All 8 model sizes are trained on the exact same data, in the exact same order. Each model saw 299,892,736,000 ~= 299.9B tokens during training.

Params: 70M to 12B
License: Apache 2.0
Context size: 2048
Trained on: 299.9B tokens
Release Date:
April, 2023
Github: Source Code
Paper: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Open Assistant (Pythia family)

OpenAssistant Conversations is a corpus of human-generated, human-annotated assistant-style conversation data that has been used to develop large language models (LLMs) for a variety of language tasks.

First iteration English supervised-fine-tuning (SFT) model of the Open-Assistant project. It is based on a Pythia 12B that was fine-tuned on ~22k human demonstrations of assistant conversations collected through the https://open-assistant.io/ human feedback

Params:
License: Apache 2.0
Context size: 2048
Trained on: 299.9B tokens
Release Date:
March, 2023
Paper: OpenAssistant Conversations — Democratizing Large Language Model Alignment

EleutherAI’s GPT-J

GPT-J is a 6 billion parameter open source English autoregressive language model trained on the Pile. The model was trained on 400B tokens from The Pile dataset with 800GB text. The zero-shot performance is roughly on par with GPT-3 of comparable size, and the performance gap from GPT-3 of comparable size is closer than the GPT-Neo models.

GPT-J-6B was trained on an English-language only dataset, and is thus not suitable for translation or generating text in other languages.

Params: 6B
License: Apache 2.0
Context size: 2048
Trained on: 400B tokens
Release Date: J
une, 2021
Github: Source Code
Paper: GPT-J-6B: 6B JAX-Based Transformer

EleutherAI’s GPT-NeoX

GPT-NeoX is a family of large-scale autoregressive language models that includes several variants, such as GPT-NeoX-1.6B, GPT-NeoX-3.2B, and GPT-NeoX-20B. These models are trained on massive datasets that contain billions tokens on the The Pile dataset, depending on the specific variant and the source of the data. They use a context window of up to 2048 tokens, which allows them to capture long-range dependencies in text.

Evaluation results show that GPT-NeoX models outperform many other state-of-the-art language models on a variety of benchmarks, including SuperGLUE, LAMBADA, and similarly sized GPT-3. In comparative studies, GPT-NeoX models have been shown to achieve higher accuracy and better generalization than models like BERT and RoBERTa.

Params: 1.2B, 3.2B, 20B
License: Apache 2.0
Context size: 2048
Trained on: 400B tokens
Release Date:
Feb, 2022
Github: Source Code
Paper: GPT-NeoX-20B: An Open-Source Autoregressive Language Model

BigScience’s BLOOM

The BLOOM family consists of several variants of the original BLOOM model, each with different sizes and capabilities. The original BLOOM model open-access multilingual large language model (LLM), has 176 billion parameters and was trained on a dataset of 1.61 terabytes of text spanning 46 natural languages and 13 programming languages. The model uses a context window of 2048 tokens, which allows it to capture long-range dependencies in the input text.

In terms of evaluation, BLOOM has been tested in zero-shot and few-shot settings on a range of tasks, including SuperGLUE, machine translation, summarization, and code generation. The results show that BLOOM outperforms many existing language models in these settings. However, the model’s performance is still not perfect, and there is room for improvement.

Params: 176B
License: OpenRAIL-M v1
Context size: 2048
Trained on: 1.61 terabytes of text
Release Date: Nov, 2022
Paper: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

The Growing World of Large Language Models

The world of Large Language Models is huge and always growing. Every new model keeps pushing the limits of what we can do. The open-source nature of the LLMs mentioned in this blog shows how the AI community works together and also sets the path for more innovations in the future.

Whether you’re a seasoned researcher, a budding AI enthusiast, or someone curious about the potential of these models, there’s no better time to dive in and explore the vast possibilities they offer.

In my upcoming blog posts, I’ll discuss topics like Closed Source Large Language Models (LLMs) and compare them with open-source models. I’ll also delve into Open Source LLMs for coding, provide in-depth blogs on each type of LLM, explain how to use open-source Large Language Models both online and on your own computer, and much more.

References

Open LLMs

A Message from AI Mind

Thanks for being a part of our community! Before you go:

--

--

Luv Bansal

ML Ops @Clarifai. All about Machine Learning, GenerativeAI and LLMs