xgen-7b-8k-instruct

A powerful 7-billion parameter LLM trained on up to 8K sequence length with fine-tuning on instructional data, enabling robust long sequence modeling and various NLP applications.

Input

Prompt:

Press Ctrl + Enter to submit

Output

Submit a prompt for a response.

Notes

Introduction

The XGEN-7B-8K-instruct model is a large language model (LLM) trained on up to 8K sequence length for up to 1.5 trillion token, making it suitable for long sequence modeling tasks such as text summarization, dialogue generation, and code generation. It is part of the XGen series of LLMs . The model has been fine-tuned on public-domain instructional data, making it suitable for various language understanding and generation tasks.

XGEN-7B-8K-instruct Model

  • Model Size: 7 billion parameters
  • Context Length: Up to 8,192 tokens
  • Training Data: Pre-trained on a two-stage data mixture, including Wikipedia-like documents from Common Crawl and Wikipedia, as well as code data from Starcoder.
  • Fine-tuned on public-domain instructional data, including databricks-dolly-15k, oasst1, Baize, and GPT-related datasets.

Use Cases

The XGEN-7B-8K-instruct model can be utilized in various natural language processing (NLP) tasks, including but not limited to:

  • Text summarization: Generating concise summaries of long documents.
  • Code generation: Generating code from natural language instructions (docstrings).
  • Dialogue understanding and summarization: Generating summaries of long dialogues.
  • Long-form question answering (QA): Answering questions based on lengthy source documents.

Dataset Information

 The model has been trained on a vast corpus of data, consisting of Wikipedia-like documents from Common Crawl and Wikipedia in 22 languages. 

The model is trained using a two-stage training strategy with different data mixtures:

  1. First stage (1.37T tokens): Includes top 20% Wikipedia-like documents from Common Crawl and Wikipedia data covering 22 languages.
  2. Second stage (110B tokens): Mixes code data from Starcoder with the data from the first stage.

The fine-tuning data includes instructional datasets such as databricks-dolly-15k, oasst1, Baize, and GPT-related datasets.

Evaluation

The XGEN-7B-8K-instruct model shows promising performance across various evaluation tasks:

  • It achieves comparable or superior results on standard NLP benchmarks when compared to other state-of-the-art open-source LLMs of similar size.
  • The targeted evaluation on long sequence modeling benchmarks demonstrates the benefits of the 8K-sequence models over the 2K- and 4K-sequence models.
  • The model performs well on both text-related tasks, such as Measuring Massive Multitask Language Understanding (MMLU) and QA, as well as code-related tasks, such as HumanEval for code generation.
  • It outperforms other baselines on long-form dialogue generation, text summarization, and QA tasks.

Limitations

Despite efforts to address risks of bias, toxicity, and hallucinations during pre-training and fine-tuning, XGEN-7B-8K-instruct is not completely free from such limitations. As with other LLMs, it may produce biased or inappropriate outputs in certain contexts. Additionally, the model's performance might be limited in some complex and challenging tasks.

  • ID
  • Model Type ID
    Text To Text
  • Input Type
    text
  • Output Type
    text
  • Description
    A powerful 7-billion parameter LLM trained on up to 8K sequence length with fine-tuning on instructional data, enabling robust long sequence modeling and various NLP applications.
  • Last Updated
    Oct 17, 2024
  • Privacy
    PUBLIC
  • Use Case
  • Toolkit
  • License
  • Share
  • Badge
    xgen-7b-8k-instruct