Unitxt Enables Researchers and Practitioners to Easily Create, Share, and Reuse Data Pipelines for Generative Language Models
Paper Title: Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
Journal Name: arXiv (submitted to NAACL demo track)
Authors: Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed, Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera, Leshem Choshen, Michal Shmueli-Scheuer, Yoav Katz
What problem this paper solves?
In the domain of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. However, existing tools for textual data preparation and evaluation are either too rigid, too complex, or too specific to address this need.
What approach does this paper utilizes?
The paper presents Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners.
Also Read:Â How to Stop Nonconsensual Deepfake Porn: Three Solutions
These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Unitxt also provides a user-friendly interface for creating and modifying components, as well as a command-line tool for running experiments.
What are the impacts of this approach on AI research?
Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Unitxt aims to facilitate the development and evaluation of generative language models, as well as to promote the reproducibility and comparability of NLP research. Unitxt also supports a wide range of generative tasks, such as text summarization, text generation, text rewriting, and more, making it a versatile and comprehensive tool for generative NLP.
Also Read:Â TextQL: How Natural Language Queries Can Unlock the Power of Data
Summary of the research?
Unitxt is a new library for customizable textual data preparation and evaluation for generative language models. It offers a modular and flexible solution for creating and sharing data pipelines, as well as a centralized catalog of reusable components. Unitxt integrates with popular libraries and tools, and provides a user-friendly interface and a command-line tool. Unitxt is a community-driven platform that aims to advance generative NLP research and practice.
Also Read:Â Diffusion Models: The Next Big Thing in AI