starCoder2-15b

StarCoder2-15B is a 15-billion parameter code generation model trained on 600+ programming languages, optimized for code-related tasks with a large context window and robust performance

Input

Prompt:

Press Ctrl + Enter to submit
The maximum number of tokens to generate. Shorter token lengths will provide faster performance.
A decimal number that determines the degree of randomness in the response
An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.
The top-k parameter limits the model's predictions to the top k most probable tokens at each step of generation.

Output

Submit a prompt for a response.

Notes

Introduction

StarCoder2-15B is a 15-billion parameter model designed for code generation and comprehension, trained by the BigCode project. It is part of the next generation of Large Language Models for Code (Code LLMs) and has been trained on a diverse set of programming languages and other related datasets.

StarCoder2-15B

StarCoder2-15B model is a 15B parameter model trained on 600+ programming languages from The Stack v2, with opt-out requests excluded. The model uses Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and was trained using the Fill-in-the-Middle objective on 4+ trillion tokens. The model was trained with NVIDIA NeMo™ Framework using the NVIDIA Eos Supercomputer built with NVIDIA DGX H100 systems.:

  • Parameters: 15 billion
  • Context Window: 16,384 tokens
  • Sliding Window Attention: 4,096 tokens
  • Training Objective: Fill-in-the-Middle
  • Training Tokens: 4+ trillion tokens

This model employs Grouped Query Attention to efficiently handle its large context window, providing robust performance in generating and understanding code across a wide variety of languages and applications.

Use Cases

Intended Use

StarCoder2-15B is designed for various programming and coding tasks, making it an ideal tool for software developers and researchers. Potential use cases include:

  • Code Completion: Assisting developers by predicting the next part of the code.
  • Code Generation: Generating code snippets based on given prompts or partial code.

Limitations

The model is not intended for instruction-based tasks, such as "Write a function that computes the square root." It performs best when given partial code or specific programming tasks rather than general commands.

Evaluation and Benchmark Results

StarCoder2-15B has demonstrated superior performance across various code-related benchmarks:

  • Outperforms other models of similar size, such as CodeLlama-13B, and matches or exceeds the performance of larger models like CodeLlama-34B.
  • Excels in low-resource programming languages and tasks requiring mathematical reasoning and code execution understanding.
  • Despite DeepSeekCoder-33B being the best at high-resource language code completion, StarCoder2-15B surpasses it in benchmarks involving low-resource languages and complex reasoning.

Dataset

Training Data

StarCoder2-15B was trained on The Stack v2, a dataset built in collaboration with Software Heritage. The Stack v2 includes code from 619 programming languages and other high-quality sources such as GitHub issues, pull requests, Kaggle notebooks, and code documentation. The training dataset is approximately 4 times larger than its predecessor, The Stack v1, and includes extensive filtering and deduplication processes to ensure high quality and relevance.

Data Processing

The training data was carefully curated and processed to remove low-quality code, redact personally identifiable information (PII), and handle opt-out requests from developers. This ensures that the model is trained on high-quality and ethically sourced data.

Advantages

  • Wide Language Coverage: Trained on over 600 programming languages, providing extensive support for various coding environments.
  • Large Context Window: Supports a context window of 16,384 tokens, enabling it to handle large codebases and complex coding tasks.
  • Robust Performance: Demonstrates strong performance on a range of code-related benchmarks, outperforming many comparable and larger models.
  • Open Access: Model weights are available under an OpenRAIL license, promoting transparency and accessibility.

Limitations

  • Not Instruction-Based: The model does not perform well with instruction-based prompts and requires more specific programming tasks for optimal performance.
  • Ethical Considerations: Despite efforts to clean and filter the training data, there may still be ethical concerns related to the inclusion of certain datasets.
  • ID
  • Model Type ID
    Text To Text
  • Input Type
    text
  • Output Type
    text
  • Description
    StarCoder2-15B is a 15-billion parameter code generation model trained on 600+ programming languages, optimized for code-related tasks with a large context window and robust performance
  • Last Updated
    Jun 24, 2024
  • Privacy
    PUBLIC
  • Use Case
  • License
  • Share
  • Badge
    starCoder2-15b