The optical character recognition (OCR) multilingual workflow is capable of detecting english blocks of text present in an input image using an OCR model, and outputs a spanish text translation for each text block using a natural language processing (NLP) model.
The output is a list of each one of the detected bounding boxes containing each block of text, as well as the translated block of text.
Limitations
The OCR model detects a bouding box around each line of text. This leads to each individual text block being fed into the translation model. If you find that you need the content of the bounding boxes to be analyzed together, you can insert an aggregator between the OCR and the NLP model.
The usage of random capitalization and punctuation in the Helsinki NLP model may result in erroneous translations grammatically speaking. If you are using this model in a workflow and find grammar issues, you can try utilizing aggregators to minimize errors.
PaddleOCR
Introduction
PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train better models and apply them into practice. The information in this summary is taken from their GitHub.
Release PP-OCRv3: With comparable speed, the effect of Chinese scene is further improved by 5% compared with PP-OCRv2, the effect of English scene is improved by 11%, and the average recognition accuracy of 80 language multilingual models is improved by more than 5%.
Features
PaddleOCR support a variety of cutting-edge algorithms related to OCR, and developed industrial featured models/solution PP-OCR and PP-Structure on this basis, and get through the whole process of data production, model training, compression, inference and deployment.
PP-OCR Series Model List
This model is the English ultra-lightweight PP-OCRv3 model (13.4M) on the second row.
Model introduction
Model name
Recommended scene
Detection model
Direction classifier
Recognition model
Chinese and English ultra-lightweight PP-OCRv3 model(16.2M)
For structural document analysis models, please refer to PP-Structure models.
PP-OCRv3 English model
PP-OCRv3 Chinese model
PP-OCRv3 Multilingual model
Helsinki-NLP - English to Spanish
Introduction
The Helsinki-NLP models are used to translate text from one language to another. As such, the model takes a block text as its input, and outputs the translated block of text. This particular model takes in English text as it's input and outputs Spanish text.
Limitations
The usage of random capitalization and punctuation may result in erroneous translations grammatically speaking. If you are using this model in a workflow and find grammar issues, you can try utilizing aggregators to minimize errors.
There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.
Risks, Limitations, and Biases
CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.
There has been significant research exploring bias and fairness issues with language models. Some important papers in this field include:
Abstract: Technology for language generation has advanced rapidly, spurred by advancements in pre-training large models on massive amounts of data and the need for intelligent agents to communicate in a natural manner. While techniques can effectively generate fluent text, they can also produce undesirable societal biases that can have a disproportionately negative impact on marginalized populations. Language generation presents unique challenges for biases in terms of direct user interaction and the structure of decoding techniques. To better understand these challenges, we present a survey on societal biases in language generation, focusing on how data and techniques contribute to biases and progress towards reducing biases. Motivated by a lack of studies on biases from decoding techniques, we also conduct experiments to quantify the effects of these techniques. By further discussing general trends and open challenges, we call to attention promising directions for research and the importance of fairness and inclusivity considerations for language generation applications.
Authors: Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
Abstract: The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.
Benchmarks
The following benchmarks are for the opus-2021-02-19 weights.
testset
BLEU
chr-F
#sent
#words
BP
newssyscomb2009-engspa.eng.spa
31.3
0.583
502
12506
0.990
news-test2008-engspa.eng.spa
29.6
0.564
2051
52596
1.000
newstest2009-engspa.eng.spa
30.2
0.578
2525
68114
1.000
newstest2010-engspa.eng.spa
36.9
0.620
2489
65522
1.000
newstest2011-engspa.eng.spa
38.3
0.620
3003
79476
0.984
newstest2012-engspa.eng.spa
39.1
0.626
3003
79006
0.969
newstest2013-engspa.eng.spa
35.1
0.598
3000
70528
0.960
Tatoeba-test.eng.spa
55.1
0.721
10000
77311
0.978
tico19-test.eng-spa
50.4
0.727
2100
66591
0.959
Workflow ID
ocr-english-to-spanish
Description
--
Last Updated
Aug 19, 2022
Privacy
PUBLIC
Share
Copy URL
Twitter
Facebook
Reddit
LinkedIn
Email
Help
ocr-english-to-spanish workflow | Clarifai - The World's AI