got-ocr-2_0

The OCR-2.0 model (GOT) is a versatile and efficient optical character recognition system designed to handle diverse tasks, including text, formulas, and charts, through a unified end-to-end architecture.

Input

Output

Submit an image for a response.

Notes

Introduction

The OCR-2.0 model is an advanced Optical Character Recognition (OCR) system designed to accurately convert various types of text from digital images or scanned documents into machine-readable formats. This model builds upon the strengths of previous OCR technologies, incorporating deep learning techniques to enhance accuracy and expand its functionalities.  GOT integrates a unified end-to-end architecture that enhances the processing of various optical characters, including plain text, mathematical formulas, charts, and more. 

OCR-2.0

OCR-2.0 is a conceptual framework that expands the capabilities of traditional OCR systems by incorporating a broader range of optical signals and enhancing the model's ability to process complex data types. The GOT model is built on this theory, utilizing a sophisticated encoder-decoder architecture that allows for high compression rates and long context lengths. The model consists of three main components: an image encoder, a linear layer for channel mapping, and a language decoder. This architecture enables GOT to effectively handle various OCR tasks, including scene text recognition, document OCR, and more general optical character processing.

Running OCR-2.0 with an API

You can run the OCR-2.0 Model using Clarifai’s python SDK.

Export your PAT as an environment variable. Then, import and initialize the API Client.

Find your PAT in your security settings.

export CLARIFAI_PAT={your personal access token}
from clarifai.client.model import Model

# Model Predict
model_prediction = Model("https://clarifai.com/stepfun-ai/ocr/models/got-ocr-2_0").predict_by_url(image_url, "image")

print(model_prediction.outputs[0].data.text.raw)

You can also run OCR-2.0 API using other Clarifai Client Libraries like Java, cURL, NodeJS, PHP, etc here.

Use Cases

OCR-2.0 can be effectively applied in various scenarios, including but not limited to:

  • Document OCR: Extracting text from scanned documents and images.
  • Scene Text OCR: Recognizing text in natural scenes, such as street signs and advertisements.
  • Handwritten Text Recognition: Processing handwritten notes and documents in various languages.
  • Mathematical and Molecular Formula Recognition: Identifying and interpreting complex formulas in academic and scientific contexts.
  • Chart and Graph Interpretation: Extracting data from visual representations like charts and graphs.
  • License Plate Recognition: Automating vehicle identification in security systems or toll collection.
  • Text Extraction for Machine Translation: Providing a foundation for translating text found in images for travelers or researchers.

Evaluation and Benchmark Results

To evaluate the performance of OCR-2.0, several standard datasets were utilized, including MNIST for handwritten digit recognition and ICDAR for text in natural images. The model demonstrated superior accuracy rates as follows:

  • MNIST: Achieved an accuracy of 99.5% on handwritten digit recognition tasks.
  • ICDAR: Recorded a significant F1-score improvement over previous models, reaching 95.2% in extracting text from complex layouts.

The evaluation metrics also included precision, recall, and processing speed, with OCR-2.0 delivering competitive results across various parameters, making it suitable for real-time applications.

Dataset

The training dataset for OCR-2.0 consisted of a diverse collection of images that represented a wide range of text styles and formats. The dataset included:

  • Handwritten and printed text examples sourced from public repositories.
  • Images with varying levels of noise, background complexity, and occlusion to improve the model's robustness.
  • Multilingual text data to enable comprehensive recognition capabilities across different languages and scripts.

In total, over 500,000 unique images were used, ensuring a well-rounded representation of potential use cases.

Advantages

  • High Accuracy: Leverages advanced neural network architectures to improve text recognition rates significantly.
  • Versatility: Capable of processing a variety of text formats, including cursive handwriting and non-Latin scripts.
  • Real-time Processing: Optimized for speed, allowing for quick recognition suitable for applications requiring immediate feedback.
  • Accessibility Features: Designed with inclusive functionalities that cater to users with disabilities.
  • Broad Applicability: Suitable for multiple domains, from administrative tasks to creative industries.

Limitations

  • Dependency on Image Quality: Performance can degrade significantly with poor-quality images or extreme lighting conditions.
  • Language Support: Currently, the model primarily supports English and Chinese, which may limit its applicability in multilingual contexts.
  • Complex Geometries: While the model can handle basic geometric shapes, more complex geometries may still pose challenges.
  • Limited Contextual Understanding: As an OCR model, it lacks the capability to understand context, leading to potential misrecognition when encountering ambiguous terms or phrases.
  • Training Data Bias: If the dataset disproportionately represents certain fonts, languages, or styles, this can affect recognition accuracy for underrepresented text types.
  • ID
  • Model Type ID
    Image To Text
  • Input Type
    image
  • Output Type
    text
  • Description
    The OCR-2.0 model (GOT) is a versatile and efficient optical character recognition system designed to handle diverse tasks, including text, formulas, and charts, through a unified end-to-end architecture.
  • Last Updated
    Oct 18, 2024
  • Privacy
    PUBLIC
  • Use Case
  • Toolkit
  • License
  • Share
  • Badge
    got-ocr-2_0