StarCoder2-15B is a 15-billion parameter code generation model trained on 600+ programming languages, optimized for code-related tasks with a large context window and robust performance
The maximum number of tokens to generate. Shorter token lengths will provide faster performance.
A decimal number that determines the degree of randomness in the response
An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.
The top-k parameter limits the model's predictions to the top k most probable tokens at each step of generation.
ResetGenerate
Output
Submit a prompt for a response.
Notes
Introduction
StarCoder2-15B is a 15-billion parameter model designed for code generation and comprehension, trained by the BigCode project. It is part of the next generation of Large Language Models for Code (Code LLMs) and has been trained on a diverse set of programming languages and other related datasets.
This model employs Grouped Query Attention to efficiently handle its large context window, providing robust performance in generating and understanding code across a wide variety of languages and applications.
Use Cases
Intended Use
StarCoder2-15B is designed for various programming and coding tasks, making it an ideal tool for software developers and researchers. Potential use cases include:
Code Completion: Assisting developers by predicting the next part of the code.
Code Generation: Generating code snippets based on given prompts or partial code.
Limitations
The model is not intended for instruction-based tasks, such as "Write a function that computes the square root." It performs best when given partial code or specific programming tasks rather than general commands.
Evaluation and Benchmark Results
StarCoder2-15B has demonstrated superior performance across various code-related benchmarks:
Outperforms other models of similar size, such as CodeLlama-13B, and matches or exceeds the performance of larger models like CodeLlama-34B.
Excels in low-resource programming languages and tasks requiring mathematical reasoning and code execution understanding.
Despite DeepSeekCoder-33B being the best at high-resource language code completion, StarCoder2-15B surpasses it in benchmarks involving low-resource languages and complex reasoning.
Dataset
Training Data
StarCoder2-15B was trained on The Stack v2, a dataset built in collaboration with Software Heritage. The Stack v2 includes code from 619 programming languages and other high-quality sources such as GitHub issues, pull requests, Kaggle notebooks, and code documentation. The training dataset is approximately 4 times larger than its predecessor, The Stack v1, and includes extensive filtering and deduplication processes to ensure high quality and relevance.
Data Processing
The training data was carefully curated and processed to remove low-quality code, redact personally identifiable information (PII), and handle opt-out requests from developers. This ensures that the model is trained on high-quality and ethically sourced data.
Advantages
Wide Language Coverage: Trained on over 600 programming languages, providing extensive support for various coding environments.
Large Context Window: Supports a context window of 16,384 tokens, enabling it to handle large codebases and complex coding tasks.
Robust Performance: Demonstrates strong performance on a range of code-related benchmarks, outperforming many comparable and larger models.
Open Access: Model weights are available under an OpenRAIL license, promoting transparency and accessibility.
Limitations
Not Instruction-Based: The model does not perform well with instruction-based prompts and requires more specific programming tasks for optimal performance.
Ethical Considerations: Despite efforts to clean and filter the training data, there may still be ethical concerns related to the inclusion of certain datasets.
ID
Model Type ID
Text To Text
Input Type
text
Output Type
text
Description
StarCoder2-15B is a 15-billion parameter code generation model trained on 600+ programming languages, optimized for code-related tasks with a large context window and robust performance