Overview
1

Notes

Workflow: OCR English-to-Spanish

Introduction

The optical character recognition (OCR) multilingual workflow is capable of detecting english blocks of text present in an input image using an OCR model, and outputs a spanish text translation for each text block using a natural language processing (NLP) model.

See this workflow in action:


Signal Flow

  1. An input image is fed into an OCR model:

  2. The output of the OCR model is fed as input to a NLP model:

  3. The output is a list of each one of the detected bounding boxes containing each block of text, as well as the translated block of text.

Limitations

The OCR model detects a bouding box around each line of text. This leads to each individual text block being fed into the translation model. If you find that you need the content of the bounding boxes to be analyzed together, you can insert an aggregator between the OCR and the NLP model.

The usage of random capitalization and punctuation in the Helsinki NLP model may result in erroneous translations grammatically speaking. If you are using this model in a workflow and find grammar issues, you can try utilizing aggregators to minimize errors.

PaddleOCR

Introduction

PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and apply them into practice. The information in this summary is taken from their GitHub.

Release PP-OCRv3: With comparable speed, the effect of Chinese scene is further improved by 5% compared with PP-OCRv2, the effect of English scene is improved by 11%, and the average recognition accuracy of 80 language multilingual models is improved by more than 5%.

Features

PaddleOCR support a variety of cutting-edge algorithms related to OCR, and developed industrial featured models/solution PP-OCR and PP-Structure on this basis, and get through the whole process of data production, model training, compression, inference and deployment.

PP-OCR Series Model List

This model is the English ultra-lightweight PP-OCRv3 model (13.4M) on the second row.

Model introductionModel nameRecommended sceneDetection modelDirection classifierRecognition model
Chinese and English ultra-lightweight PP-OCRv3 model(16.2M)ch_PP-OCRv3_xxMobile & Serverinference model / trained modelinference model / trained modelinference model / trained model
English ultra-lightweight PP-OCRv3 model(13.4M)en_PP-OCRv3_xxMobile & Serverinference model / trained modelinference model / trained modelinference model / trained model
Chinese and English ultra-lightweight PP-OCRv2 model(11.6M)ch_PP-OCRv2_xxMobile & Serverinference model / trained modelinference model / trained modelinference model / trained model
Chinese and English ultra-lightweight PP-OCR model (9.4M)ch_ppocr_mobile_v2.0_xxMobile & serverinference model / trained modelinference model / trained modelinference model / trained model
Chinese and English general PP-OCR model (143.4M)ch_ppocr_server_v2.0_xxServerinference model / trained modelinference model / trained modelinference model / trained model

PP-OCRv3 English model

PP-OCRv3 Chinese model

PP-OCRv3 Multilingual model


Helsinki-NLP - English to Spanish

Introduction

The Helsinki-NLP models are used to translate text from one language to another. As such, the model takes a block text as its input, and outputs the translated block of text. This particular model takes in English text as it's input and outputs Spanish text.

Limitations

The usage of random capitalization and punctuation may result in erroneous translations grammatically speaking. If you are using this model in a workflow and find grammar issues, you can try utilizing aggregators to minimize errors.

More Info

Paper

Natural language processing for similar languages, varieties, and dialects: A survey

Authors: Marcos Zampieri, Preslav Nakov, Yves Scherrer

Abstract

There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.

Risks, Limitations, and Biases

CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.

There has been significant research exploring bias and fairness issues with language models. Some important papers in this field include:

  • Societal Biases in Language Generation: Progress and Challenges

    • Authors: Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, Nanyun Peng
    • Abstract: Technology for language generation has advanced rapidly, spurred by advancements in pre-training large models on massive amounts of data and the need for intelligent agents to communicate in a natural manner. While techniques can effectively generate fluent text, they can also produce undesirable societal biases that can have a disproportionately negative impact on marginalized populations. Language generation presents unique challenges for biases in terms of direct user interaction and the structure of decoding techniques. To better understand these challenges, we present a survey on societal biases in language generation, focusing on how data and techniques contribute to biases and progress towards reducing biases. Motivated by a lack of studies on biases from decoding techniques, we also conduct experiments to quantify the effects of these techniques. By further discussing general trends and open challenges, we call to attention promising directions for research and the importance of fairness and inclusivity considerations for language generation applications.
  • On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

    • Authors: Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
    • Abstract: The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

Benchmarks

The following benchmarks are for the opus-2021-02-19 weights.

testsetBLEUchr-F#sent#wordsBP
newssyscomb2009-engspa.eng.spa31.30.583502125060.990
news-test2008-engspa.eng.spa29.60.5642051525961.000
newstest2009-engspa.eng.spa30.20.5782525681141.000
newstest2010-engspa.eng.spa36.90.6202489655221.000
newstest2011-engspa.eng.spa38.30.6203003794760.984
newstest2012-engspa.eng.spa39.10.6263003790060.969
newstest2013-engspa.eng.spa35.10.5983000705280.960
Tatoeba-test.eng.spa55.10.72110000773110.978
tico19-test.eng-spa50.40.7272100665910.959
  • Workflow ID
    ocr-english-to-spanish
  • Description
    --
  • Last Updated
    Aug 19, 2022
  • Privacy
    PUBLIC
  • Share