Helsinki-NLP - Japanese to English
The Helsinki-NLP models are used to translate text from one language to another. As such, the model takes a block text as its input, and outputs the translated block of text. This particular model takes in Japanese text as it's input and outputs English text.
Authors: Marcos Zampieri, Preslav Nakov, Yves Scherrer
There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.
Risks, Limitations, and Biases
CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.
There has been significant research exploring bias and fairness issues with language models. Some important papers in this field include:
- Authors: Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, Nanyun Peng
- Abstract: Technology for language generation has advanced rapidly, spurred by advancements in pre-training large models on massive amounts of data and the need for intelligent agents to communicate in a natural manner. While techniques can effectively generate fluent text, they can also produce undesirable societal biases that can have a disproportionately negative impact on marginalized populations. Language generation presents unique challenges for biases in terms of direct user interaction and the structure of decoding techniques. To better understand these challenges, we present a survey on societal biases in language generation, focusing on how data and techniques contribute to biases and progress towards reducing biases. Motivated by a lack of studies on biases from decoding techniques, we also conduct experiments to quantify the effects of these techniques. By further discussing general trends and open challenges, we call to attention promising directions for research and the importance of fairness and inclusivity considerations for language generation applications.
- Authors: Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
- Abstract: The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.
The following benchmarks are for the opus-2020-01-09 weights.
- Data set: Opus
- Model: Transformer-Align
- Source Language(s): jap (Japanese)
- Target Language(s): en (English)
- Pre-processing: Normalization + SentencePiece (spm32k, spm32k)
- Download original weights: opus-2020-01-09.zip
- Test set translations: opus-2020-01-09.test.txt
- Test set scores: opus-2020-01-09.eval.txt
- Model Type IDText To Text
- Last UpdatedAug 01, 2022
- Use Case