xpark.dataset.Dataset.write_parquet#
- Dataset.write_parquet(path: str, *, partition_cols: List[str] | None = None, filesystem: pyarrow.fs.FileSystem | None = None, try_create_dir: bool = True, arrow_open_stream_args: Dict[str, Any] | None = None, filename_provider: FilenameProvider | None = None, arrow_parquet_args_fn: Callable[[], Dict[str, Any]] | None = None, min_rows_per_file: int | None = None, max_rows_per_file: int | None = None, ray_remote_args: Dict[str, Any] = None, concurrency: int | None = None, num_rows_per_file: int | None = None, mode: SaveMode = SaveMode.APPEND, **arrow_parquet_args) None[source]#
Writes the
Datasetto parquet files under the providedpath.The number of files is determined by the number of blocks in the dataset. To control the number of number of blocks, call
repartition().If pyarrow can’t represent your data, this method errors.
By default, the format of the output files is
{uuid}_{block_idx}.parquet, whereuuidis a unique id for the dataset. To modify this behavior, implement a customFilenameProviderand pass it in as thefilename_providerargument.Note
This operation will trigger execution of the lazy transformations performed on this dataset.
Examples
>>> import xpark >>> ds = xpark.dataset.from_range(100) >>> ds.write_parquet("local:///tmp/data/")
Time complexity: O(dataset size / parallelism)
- Parameters:
path – The path to the destination root directory, where parquet files are written to.
partition_cols – Column names by which to partition the dataset. Files are writted in Hive partition style.
filesystem – The pyarrow filesystem implementation to write to. These filesystems are specified in the pyarrow docs. Specify this if you need to provide specific configurations to the filesystem. By default, the filesystem is automatically selected based on the scheme of the paths. For example, if the path begins with
s3://, theS3FileSystemis used.try_create_dir – If
True, attempts to create all directories in the destination path. Does nothing if all directories already exist. Defaults toTrue.arrow_open_stream_args – kwargs passed to pyarrow.fs.FileSystem.open_output_stream, which is used when opening the file to write to.
filename_provider – A
FilenameProviderimplementation. Use this parameter to customize what your filenames look like. The filename is expected to be templatized with {i} to ensure unique filenames when writing multiple files. If it’s not templatized, Xpark Dataset will add {i} to the filename to ensure compatibility with the pyarrow write_dataset.arrow_parquet_args_fn – Callable that returns a dictionary of write arguments that are provided to pyarrow.parquet.ParquetWriter() when writing each block to a file. Overrides any duplicate keys from
arrow_parquet_args. Use this argument instead ofarrow_parquet_argsif any of your write arguments can’t pickled, or if you’d like to lazily resolve the write arguments for each dataset block. See the note below for more details.min_rows_per_file – [Experimental] The target minimum number of rows to write to each file. If
None, Xpark Dataset writes a system-chosen number of rows to each file. If the number of rows per block is larger than the specified value, Xpark Dataset writes the number of rows per block to each file. The specified value is a hint, not a strict limit. Xpark Dataset might write more or fewer rows to each file.max_rows_per_file – [Experimental] The target maximum number of rows to write to each file. If
None, Xpark Dataset writes a system-chosen number of rows to each file. If the number of rows per block is smaller than the specified value, Xpark Dataset writes the number of rows per block to each file. The specified value is a hint, not a strict limit. Xpark Dataset might write more or fewer rows to each file. If bothmin_rows_per_fileandmax_rows_per_fileare specified,max_rows_per_filetakes precedence when they cannot both be satisfied.ray_remote_args – Kwargs passed to
ray.remote()in the write tasks.concurrency – The maximum number of Xpark tasks to run concurrently. Set this to control number of tasks to run concurrently. This doesn’t change the total number of tasks run. By default, concurrency is dynamically decided based on the available resources.
num_rows_per_file – [Deprecated] Use min_rows_per_file instead.
arrow_parquet_args –
Options to pass to pyarrow.parquet.ParquetWriter(), which is used to write out each block to a file. See arrow_parquet_args_fn for more detail.
mode – Determines how to handle existing files. Valid modes are “overwrite”, “error”, “ignore”, “append”. Defaults to “append”. NOTE: This method isn’t atomic. “Overwrite” first deletes all the data before writing to path.
Note
When using arrow_parquet_args or arrow_parquet_args_fn to pass extra options to pyarrow, there are some special cases:
partitioning_flavor: if it’s not provided, default is “hive” in Xpark Dataset. Otherwise, it follows pyarrow’s behavior: None for pyarrow’s DirectoryPartitioning, “hive” for HivePartitioning, and “filename” for FilenamePartitioning. See pyarrow.dataset.partitioning <https://arrow.apache.org/docs/python/generated/pyarrow.dataset.partitioning.html>_.
row_group_size: if provided, it’s passed to pyarrow.parquet.ParquetWriter.write_table().