Melody Arnaud
2022, arXiv (Cornell University)
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 444 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. The BIG-bench github code infrastructure and documentation was developed by Guy Gur-Ari,
Related papers
susannah young
ArXiv, 2021
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.
View PDFchevron_right
Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
Adina Williams
2021
We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on selfreported metrics or predictions on a single dataset. Under this paradigm, models are submitted to be evaluated in the cloud, circumventing the issues of reproducibility, accessibility, and backwards compatibility that often hinder benchmarking in NLP. This allows users to interact with uploaded models in real time to assess their quality, and permits the collection of additional metrics such as memory use, throughput, and robustness, which – despite their importance to practitioners – have traditionally been absent from leaderboards. On each task, models are ranked according to the Dynascore, a novel utility-based aggregation of these statistics, which users can customize to better reflect their preferences, placing more/less weight on a particular axi...
View PDFchevron_right
PaLM: Scaling Language Modeling with Pathways
Shivani Agrawal -XII D
arXiv (Cornell University), 2022
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model (PaLM). We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-ofthe-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned stateof-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. * Equal Contribution. Author contributions and ordering details are listed in Appendix A.
View PDFchevron_right
Mind the Gap: Assessing Temporal Generalization in Neural Language Models
Tayfun Terzi
arXiv (Cornell University), 2021
Our world is open-ended, non-stationary, and constantly evolving; thus what we talk about and how we talk about it change over time. This inherent dynamic nature of language contrasts with the current static language modelling paradigm, which trains and evaluates models on utterances from overlapping time periods. Despite impressive recent progress, we demonstrate that Transformer-XL language models perform worse in the realistic setup of predicting future utterances from beyond their training period, and that model performance becomes increasingly worse with time. We find that, while increasing model size alone-a key driver behind recent progress-does not solve this problem, having models that continually update their knowledge with new information can indeed mitigate this performance degradation over time. Hence, given the compilation of ever-larger language modelling datasets, combined with the growing list of language-model-based NLP applications that require up-to-date factual knowledge about the world, we argue that now is the right time to rethink the static way in which we currently train and evaluate our language models, and develop adaptive language models that can remain up-to-date with respect to our ever-changing and non-stationary world. We will publicly release our dynamic, streaming language modelling benchmarks for WMT and ARXIV to facilitate language model evaluation that takes temporal dynamics into account. 1 * Equal contribution. ♠ Project initiation. Paper writing. ♦ Project technical infrastructure. ♥ Model design and experiments. ♣ Project support and advice. 1 We release our dynamic (streaming) language modelling benchmark for WMT and ARXIV at https: //github.com/deepmind/deepmind-research/tree/master/pitfalls_static_language_models.
View PDFchevron_right
BigDataBench: A Scalable and Unified Big Data and AI Benchmark Suite
Jianfeng Zhan
arXiv (Cornell University), 2018
Several fundamental changes in technology indicate domain-specific hardware and software co-design is the only path left. In this context, architecture, system, data management, and machine learning communities pay greater attention to innovative big data and AI algorithms, architecture, and systems. Unfortunately, complexity, diversity, frequently-changed workloads, and rapid evolution of big data and AI systems raise great challenges. First, the traditional benchmarking methodology that creates a new benchmark or proxy for every possible workload is not scalable, or even impossible for Big Data and AI benchmarking. Second, it is prohibitively expensive to tailor the architecture to characteristics of one or more application or even a domain of applications. We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on different initial or intermediate data inputs, each class of which we call a data motif. On the basis of our previous work that identifies eight data motifs taking up most of the run time of a wide variety of big data and AI workloads, we propose a scalable benchmarking methodology that uses the combination of one or more data motifs-to represent diversity of big data and AI workloads. Following this methodology, we present a unified big data and AI benchmark suite-BigDataBench 4.0, publicly available from http://prof. ict.ac.cn/BigDataBench. This unified benchmark suite sheds new light on domain-specific hardware and software co-design: tailoring the system and architecture to characteristics of the unified eight data motifs other than one or more application case by case. Also, for the first time, we comprehensively characterize the CPU pipeline efficiency using the benchmarks of seven workload types in BigDataBench 4.0 in addition to traditional benchmarks like SPECCPU, PARSEC and HPCC in a hierarchical manner and drill down on five levels, using the Top-Down analysis from an architecture perspective. In addition, we evaluate the micro-architectural performance of AI benchmarks on GPUs.
View PDFchevron_right
Large Language Models: Machine Learning Down to Earth
Paulo Leocadio
Brainz Magazine, 2023
The business marketplace has been flooded with waves of technology trends that periodically surface and become present on every other sales pitch from technology vendors and build up as the ultimate necessity in the minds of many CIOs. There is a variety of examples to mention: downsizing, rightsizing, outsourcing, offshoring, consumerization of IT, and the adoption of Cloud technologies, among others. And like virtual tsunamis, these trends come impacting, sometimes disrupting, and even influencing the performance or changing the perception of a corporation (or government) in the eyes of the consumer. It is not different today: tailgating the Cloud adoption, and the spreading of a Big Data mindset to solve problems and build solutions comes Artificial Intelligence (AI) and Machine Learning (ML). Vendors quickly packaged “AI” offerings, often pre-existing call-center-related scripts and chatbots, the door to the CIOs’ IT team was opened, and the offerings began to mature beyond call centers and the diverse medical and micro-biological use. University and Academic AI-related research were reinforced by highly specialized and well-funded dedicated groups in companies like Google, Facebook, and IBM. Smaller organizations, collectives, and startups are also receiving funding and investments. Open AI became an important presence under many aspects and circumstances. Anyone can interact online and in real-time with its chatbot, obtaining real-life value as a work tool for many areas. When asked “please explain what Large Language Models are?”, the OpenAI chatbot responded (OpenAI, 2023): “Large Language Models (LLMs) are a type of artificial intelligence (AI) model that is designed to understand and generate human-like language. These models are trained on massive amounts of text data, such as books, articles, and web pages, and use advanced algorithms to learn the patterns and structures of language. LLMs can perform a wide range of natural language processing (NLP) tasks, such as language translation, sentiment analysis, text classification, and question-answering. They can also generate human-like text, such as short stories, news articles, and even poetry. Some examples of LLMs include GPT-3 (Generative Pre-trained Transformer 3), BERT (Bidirectional Encoder Representations from Transformers), and T5 (Text-to-Text Transfer Transformer). These models have achieved significant breakthroughs in NLP and are widely used in industry and academia for various applications.”(grammar errors found in the AI-generated text were left unchanged on purpose).
View PDFchevron_right
SCALE: Scaling up the Complexity for Advanced Language Model Evaluation
Veton Matoshi
arXiv (Cornell University), 2023
Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional domain-specific ones), emphasizing the need for more challenging ones to properly assess LLM capabilities. In this paper, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), utilizing domain specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document to document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). Our benchmark comprises diverse legal NLP datasets from the Swiss legal system, allowing for a comprehensive study of the underlying Non-English, inherently multilingual, federal legal system. Despite recent advances, efficiently processing long documents for intense review/analysis tasks remains an open challenge for LLMs. Also, comprehensive, domain-specific benchmarks requiring high expertise to develop are rare, as are multilingual benchmarks. This scarcity underscores our contribution's value, considering most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. Our benchmark allows for testing and advancing the state-of-the-art LLMs. As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference. Despite the large size of our datasets (tens to hundreds of thousands of examples), * Equal contribution. Preprint. Under review.
View PDFchevron_right
ChatGPT: Large models, expert skills and academia Oana Ichim, Tech Hub, Geneva Graduate Institute
Oana Ichim
ChatGPT is not a sudden creation enabled by advancements in Artificial Intelligence. It has a genesis and comes with quite an impressive number of occurrences and contingencies, all of which allow a clearer understanding of its functioning and a better assessment of its added value. As the name suggests, generative AI produces or generates text, images, music, speech, code or video. Behind this concept lies machine-learning techniques which have evolved over the past decade, allowing us to explore and produce a specific output out of a large corpora of data. ChatGPT is a type of large language model (LLM) that uses deep learning to generate human-like text. GPT actually stands for Generative Pretrained Transformers: 'generative' (G) because it can generate new text based on the input received, 'pretrained' (P) because it is trained on a large corpus of text data before being fine-tuned for specific tasks, and 'transformers' (T) because it uses a transformer based neural network architecture to process input text and generate output text. 1 ChatGPT is a large language model, specialized for conversational interactions, (still) available as a free demo. It was released by OpenAI, which is a research lab running on a commercial model, with Microsoft being the main investor. Large models are trained on massive datasets represented by books, articles and websites; ChatGPT is one such LLM trained through an innovative technique in state-of-the art LLMs: reinforcement learning from human feedback (RLHF). The 'non-scalable' aspects and costs-the human component-are yet to be disclosed. The model has accomplished all kinds of impressive tasks, including providing feedback on code, writing poetry, explaining technical concepts in different tones, generating prompts for generative AI models, and going on philosophical rants. However, the model is also prone to the kinds of errors that similar LLMs have made, and most quoted ones are references to non-existing papers and books, misinterpreting intuitive physics, and failing at compositionality. OpenAI acknowledges that their models can still generate toxic and biased outputs. 2 All the above disclose the background against which potential and futile expectations regarding ChatGPT may arise, because it is against this background that one can determine what ChatGPT can and cannot do.
View PDFchevron_right
NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain
Sridevi Wagle
arXiv (Cornell University), 2023
View PDFchevron_right
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Vũ Chiến
arXiv (Cornell University), 2023
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM)(BigScience Workshop, 2022) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus. 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks.
View PDFchevron_right