xpark.dataset.TextFuzzyDedup#
- class xpark.dataset.TextFuzzyDedup(tokenize_regex_pattern: str | None = None, shingling_window_size: int = 5, lowercase: bool = True, cjk: bool = True, jaccard_threshold: float = 0.7, minhash_num_perm: int = 256, minhash_seed: int = 42, lsh_num_bands: int | None = None, lsh_num_rows_per_band: int | None = None, union_find_parallel_num: int | None = None, union_batch_size: int = 256, union_balancing_batch_size: int = 1)[source]#
Text fuzzy deduplication using MinHashLSH.
This class performs fuzzy deduplication on text data by: 1. Shingling: Breaking text into overlapping token sequences (shingles) 2. MinHash: Computing compact signatures for each text 3. LSH (Locality-Sensitive Hashing): Efficiently finding similar texts 4. Union-Find: Grouping duplicates using a distributed union-find algorithm
- Parameters:
tokenize_regex_pattern – Split by space if None else apply the regex to tokenize
shingling_window_size – Size of shingles (tokens for space, characters for character)
lowercase – Whether to convert text to lowercase before processing
cjk – Whether to use CJK split, only available if tokenize_regex_pattern is None
jaccard_threshold – Jaccard similarity threshold for duplicates
minhash_num_perm – Number of permutations for MinHash
minhash_seed – Seed for random number generation (optional)
lsh_num_bands – Number of bands for LSH (computed if None)
lsh_num_rows_per_band – Number of rows per band (computed if None)
union_find_parallel_num – Number of union find to use for parallelization
union_batch_size – Batch size for union finding
union_balancing_batch_size – BTS batch size for union finding
Example
>>> from xpark.dataset import read_parquet, TextFuzzyDedup >>> from xpark.dataset.expressions import col >>> >>> # Read parquet files with `dynamic_uid` to generate a unique ID column >>> ds = read_parquet("/data/fineweb-edu-sample-10BT", dynamic_uid="uid") >>> >>> # Apply fuzzy deduplication and save results >>> ds.filter( ... TextFuzzyDedup().with_column(uid=col("uid"), text=col("text")) ... ).drop_columns("uid").write_parquet("/data/dedup-fineweb-edu-sample-10BT")
Methods
with_column(uid, text)Apply the TextFuzzyDedup filter with specified columns.
- with_column(uid: ColumnExpr, text: ColumnExpr) DedupOp[source]#
Apply the TextFuzzyDedup filter with specified columns.
- Parameters:
uid – Column expression for unique identifiers. Should be integer type. Use dynamic_uid when reading data to auto-generate unique IDs.
text – Column expression for text content to deduplicate.
- Returns:
A deduplication operation that can be used with filter().
- Return type:
DedupOp
Example
>>> from xpark.dataset import read_parquet, TextFuzzyDedup >>> from xpark.dataset.expressions import col >>> ds = read_parquet("data/", dynamic_uid="uid") >>> ds.filter(TextFuzzyDedup().with_column(uid=col("uid"), text=col("text")))