xpark.dataset.TextFuzzyDedup#

class xpark.dataset.TextFuzzyDedup(tokenize_regex_pattern: str | None = None, shingling_window_size: int = 5, lowercase: bool = True, cjk: bool = True, jaccard_threshold: float = 0.7, minhash_num_perm: int = 256, minhash_seed: int = 42, lsh_num_bands: int | None = None, lsh_num_rows_per_band: int | None = None, union_find_parallel_num: int | None = None, union_batch_size: int = 256, union_balancing_batch_size: int = 1)[source]#

Text fuzzy deduplication using MinHashLSH.

This class performs fuzzy deduplication on text data by: 1. Shingling: Breaking text into overlapping token sequences (shingles) 2. MinHash: Computing compact signatures for each text 3. LSH (Locality-Sensitive Hashing): Efficiently finding similar texts 4. Union-Find: Grouping duplicates using a distributed union-find algorithm

Parameters:
  • tokenize_regex_pattern – Split by space if None else apply the regex to tokenize

  • shingling_window_size – Size of shingles (tokens for space, characters for character)

  • lowercase – Whether to convert text to lowercase before processing

  • cjk – Whether to use CJK split, only available if tokenize_regex_pattern is None

  • jaccard_threshold – Jaccard similarity threshold for duplicates

  • minhash_num_perm – Number of permutations for MinHash

  • minhash_seed – Seed for random number generation (optional)

  • lsh_num_bands – Number of bands for LSH (computed if None)

  • lsh_num_rows_per_band – Number of rows per band (computed if None)

  • union_find_parallel_num – Number of union find to use for parallelization

  • union_batch_size – Batch size for union finding

  • union_balancing_batch_size – BTS batch size for union finding

Example

>>> from xpark.dataset import read_parquet, TextFuzzyDedup
>>> from xpark.dataset.expressions import col
>>>
>>> # Read parquet files with `dynamic_uid` to generate a unique ID column
>>> ds = read_parquet("/data/fineweb-edu-sample-10BT", dynamic_uid="uid")
>>>
>>> # Apply fuzzy deduplication and save results
>>> ds.filter(
...     TextFuzzyDedup().with_column(uid=col("uid"), text=col("text"))
... ).drop_columns("uid").write_parquet("/data/dedup-fineweb-edu-sample-10BT")

Methods

with_column(uid, text)

Apply the TextFuzzyDedup filter with specified columns.

with_column(uid: ColumnExpr, text: ColumnExpr) DedupOp[source]#

Apply the TextFuzzyDedup filter with specified columns.

Parameters:
  • uid – Column expression for unique identifiers. Should be integer type. Use dynamic_uid when reading data to auto-generate unique IDs.

  • text – Column expression for text content to deduplicate.

Returns:

A deduplication operation that can be used with filter().

Return type:

DedupOp

Example

>>> from xpark.dataset import read_parquet, TextFuzzyDedup
>>> from xpark.dataset.expressions import col
>>> ds = read_parquet("data/", dynamic_uid="uid")
>>> ds.filter(TextFuzzyDedup().with_column(uid=col("uid"), text=col("text")))