Papers
arxiv:2406.03202

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

Published on Jun 5, 2024
Authors:
,
,

Abstract

A framework with Subject Selector, Grammar Selector, Prompt Manager, and Evaluator improves LLMs for generating grammatical error correction data using a new dataset, ChatLang-8.

We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named ChatLang-8, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing ChatGPT's data generation capabilities.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2406.03202
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.03202 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.03202 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.03202 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.