Large-Scale Text Data Deduplication#

Background#

During the training of large language models (LLMs), deduplication of training corpora is essential. Xpark provides high-performance deduplication capabilities, including the industry-exclusive distributed implementations described below.

Dedup Operations

MinHash Dedup: A fuzzy deduplication method based on MinHash LSH, suitable for near-duplicate detection at scale.
Exact Substring Dedup: A distributed exact substring deduplication method that identifies and removes documents sharing long common substrings. For more details, refer to the paper Deduplicating Training Data Makes Language Models Better.

Why Use Xpark#

MinHash Dedup: 8.5× faster deduplication performance than Data-Juicer (v1.4.3)
Exact Substring Dedup: Xpark is the first in the industry to provide a distributed implementation of this operator, whereas other open-source projects only support single-machine execution. It can deduplicate 1.2 TB of data in 2,813 seconds using a 3-node cluster.

Text Fuzzy Dedup#

The following example demonstrates how to perform fuzzy deduplication using Xpark:

from xpark.dataset import TextFuzzyDedup, read_parquet
from xpark.dataset.expressions import col

ds = read_parquet("/path/to/parquets/", dynamic_uid="uid")
ds.filter(TextFuzzyDedup().with_column(uid_column=col("uid"), text_column=col("text"))).drop_columns(
    "uid"
).write_parquet("/path/to/output/deduped_parquets/")

For additional parameters supported by Text Fuzzy Dedup, see the documentation of xpark.dataset.filters.dedup.text_fuzzy_dedup.TextFuzzyDedup.

Benchmark Results#

Xpark significantly improves deduplication efficiency through several optimizations:

a C++-based BTS implementation
unique UID retrieval based on Parquet metadata (eliminating one extra data read pass)
Bloom filter-based filter to skip redundant documents.

The benchmark was conducted under the following environment:

CPU: 32-core Intel(R) Xeon(R) Gold 6133 @ 2.50GHz
Memory: 125 GB
Dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main/sample/10BT
Disk Size: 27 GB
Total Rows: 9,672,101

Benchmark parameters:

shingling window size: 5
lowercase: True
jaccard threshold: 0.7
minhash num perm: 256
minhash seed: 42

Dedup Method	Remaining Rows	Time Taken (seconds)	Peak Memory Usage (GB)
Data-Juicer (space tokenizer)	9,070,657	2927.93s	116.7G
Xpark (space tokenizer)	9,151,558	341.24s	71.2G
Xpark (CJK tokenizer)	9,151,602	330.62s	78.0G

Text Exact Substring Dedup#

Due to the nature of the algorithm, it is a 3-stage process that produces multiple outputs. For details on the operator implementation and parameters, refer to xpark.dataset.filters.dedup.text_exact_substring_dedup.TextExactSubstringDedup

The following example demonstrates how to perform exact substring deduplication using Xpark:

from xpark.dataset import read_parquet
from xpark.dataset.filters.dedup.text_exact_substring_dedup import TextExactSubstringDedup
from xpark.dataset.expressions import col

ds = read_parquet("/path/to/parquets/", dynamic_uid="uid")
dedup_op = TextExactSubstringDedup(min_match_tokens=64).with_column(uid=col("uid"), text=col("text"))

# Get documents that have no duplicate substrings with any other documents
ds.filter(dedup_op).drop_columns("uid").write_parquet("/path/to/output/unique_data")

# For potentially duplicated documents, deduplicate using suffix array
ds_connected_deduped = dedup_op.collect_sa_deduped_dataset()
ds_connected_deduped.drop_columns("uid").write_parquet("/path/to/output/deduped_by_sa_data")

Benchmark Results#

The benchmark was conducted on Tencent EMR with 2~3 nodes of the following specifications:

Environment: Xpark on EMR(TKE)
CVM: SA5.64XLARGE1152
CPU: 128C AMD 9754 128-Core Processor * 2
Memory: 1152GB
Device: 2~3
Dataset: https://huggingface.co/datasets/openbmb/Ultra-FineWeb

Benchmark parameters:

min match tokens: 200
cjk: True

Dataset Size	Device Number	Total Line	Total Time Cost	Peak Memory
489.09GB (400 files)	2	226,404,291	1102.78s	937GB
1210.24GB (1000 files)	3	566,030,018	2813.54s	1780.12GB

Large-Scale Text Data Deduplication#

Background#

Why Use Xpark#

Text Fuzzy Dedup#

Benchmark Results#

Text Exact Substring Dedup#

Benchmark Results#

This Page