API Reference

User Functions

`available_compressions`()	Return a list of the implemented compressions.
`available_protocols`()	Return a list of the implemented protocols.
`filesystem`(protocol, **storage_options)	Instantiate filesystems for given protocol and arguments
`fuse.run`(fs, path, mount_point[, ...])	Mount stuff in a local directory
`generic.rsync`(source, destination[, ...])	Sync files between two directory trees
`get_filesystem_class`(protocol)	Fetch named protocol implementation from the registry
`get_mapper`([url, check, create, ...])	Create key-value interface for given URL and options
`gui.FileSelector`([url, filters, ignore, kwargs])	Panel-based graphical file selector widget
`open`(urlpath[, mode, compression, encoding, ...])	Given a path or paths, return one `OpenFile` object.
`open_files`(urlpath[, mode, compression, ...])	Given a path or paths, return a list of `OpenFile` objects.
`open_local`(url[, mode])	Open file(s) which can be resolved to local

fsspec.available_compressions()[source]: Return a list of the implemented compressions.

fsspec.available_protocols()[source]

Return a list of the implemented protocols.

Note that any given protocol may require extra packages to be importable.

fsspec.filesystem(protocol, **storage_options)[source]

Instantiate filesystems for given protocol and arguments

storage_options are specific to the protocol being chosen, and are passed directly to the class.

fsspec.fuse.run(fs, path, mount_point, foreground=True, threads=False, ready_file=False, ops_class=<class 'fsspec.fuse.FUSEr'>)[source]

Mount stuff in a local directory

This uses fusepy to make it appear as if a given path on an fsspec instance is in fact resident within the local file-system.

This requires that fusepy by installed, and that FUSE be available on the system (typically requiring a package to be installed with apt, yum, brew, etc.).

Parameters:

fs: file-system instance: From one of the compatible implementations
path: str: Location on that file-system to regard as the root directory to mount. Note that you typically should include the terminating “/” character.
mount_point: str: An empty directory on the local file-system where the contents of the remote path will appear.
foreground: bool: Whether or not calling this function will block. Operation will typically be more stable if True.
threads: bool: Whether or not to create threads when responding to file operations within the mounter directory. Operation will typically be more stable if False.
ready_file: bool: Whether the FUSE process is ready. The .fuse_ready file will exist in the mount_point directory if True. Debugging purpose.
ops_class: FUSEr or Subclass of FUSEr: To override the default behavior of FUSEr. For Example, logging to file.

fsspec.generic.rsync(source, destination, delete_missing=False, source_field='size', dest_field='size', update_cond='different', inst_kwargs=None, fs=None, **kwargs)[source]

Sync files between two directory trees

(experimental)

Parameters:

source: str: Root of the directory tree to take files from. This must be a directory, but do not include any terminating “/” character
destination: str: Root path to copy into. The contents of this location should be identical to the contents of source when done. This will be made a directory, and the terminal “/” should not be included.
delete_missing: bool: If there are paths in the destination that don’t exist in the source and this is True, delete them. Otherwise, leave them alone.
source_field: str | callable: If update_field is “different”, this is the key in the info of source files to consider for difference. Maybe a function of the info dict.
dest_field: str | callable: If update_field is “different”, this is the key in the info of destination files to consider for difference. May be a function of the info dict.
update_cond: “different”|”always”|”never”: If “always”, every file is copied, regardless of whether it exists in the destination. If “never”, files that exist in the destination are not copied again. If “different” (default), only copy if the info fields given by source_field and dest_field (usually “size”) are different. Other comparisons may be added in the future.
inst_kwargs: dict|None: If fs is None, use this set of keyword arguments to make a GenericFileSystem instance
fs: GenericFileSystem|None: Instance to use if explicitly given. The instance defines how to to make downstream file system instances from paths.

Returns:

dict of the copy operations that were performed, {source: destination}

fsspec.get_filesystem_class(protocol)[source]

Fetch named protocol implementation from the registry

The dict known_implementations maps protocol names to the locations of classes implementing the corresponding file-system. When used for the first time, appropriate imports will happen and the class will be placed in the registry. All subsequent calls will fetch directly from the registry.

Some protocol implementations require additional dependencies, and so the import may fail. In this case, the string in the “err” field of the known_implementations will be given as the error message.

fsspec.get_mapper(url='', check=False, create=False, missing_exceptions=None, alternate_root=None, **kwargs)[source]

Create key-value interface for given URL and options

The URL will be of the form “protocol://location” and point to the root of the mapper required. All keys will be file-names below this location, and their values the contents of each key.

Also accepts compound URLs like zip::s3://bucket/file.zip , see fsspec.open.

Parameters:

url: str: Root URL of mapping
check: bool: Whether to attempt to read from the location before instantiation, to check that the mapping does exist
create: bool: Whether to make the directory corresponding to the root before instantiating
missing_exceptions: None or tuple: If given, these exception types will be regarded as missing keys and return KeyError when trying to read data. By default, you get (FileNotFoundError, IsADirectoryError, NotADirectoryError)
alternate_root: None or str: In cases of complex URLs, the parser may fail to pick the correct part for the mapper root, so this arg can override

Returns:

FSMap instance, the dict-like key-value store.

class fsspec.gui.FileSelector(url=None, filters=None, ignore=None, kwargs=None)[source]

Panel-based graphical file selector widget

Instances of this widget are interactive and can be displayed in jupyter by having them as the output of a cell, or in a separate browser tab using .show().

property fs: Current filesystem instance

open_file(mode='rb', compression=None, encoding=None)[source]

Create OpenFile instance for the currently selected item

For example, in a notebook you might do something like

[ ]: sel = FileSelector(); sel

# user selects their file

[ ]: with sel.open_file('rb') as f:
...      out = f.read()

Parameters:

mode: str (optional): Open mode for the file.
compression: str (optional): The interact with the file as compressed. Set to ‘infer’ to guess compression from the file ending
encoding: str (optional): If using text mode, use this encoding; defaults to UTF8.

property storage_options: Value of the kwargs box as a dictionary

property urlpath: URL of currently selected item

fsspec.open(urlpath, mode='rb', compression=None, encoding='utf8', errors=None, protocol=None, newline=None, expand=None, **kwargs)[source]

Given a path or paths, return one OpenFile object.

Parameters:

urlpath: string or list: Absolute or relative filepath. Prefix with a protocol like s3:// to read from alternative filesystems. Should not include glob character(s).
mode: ‘rb’, ‘wt’, etc.
compression: string or None: If given, open file using compression codec. Can either be a compression name (a key in fsspec.compression.compr) or “infer” to guess the compression from the filename suffix.
encoding: str: For text mode only
errors: None or str: Passed to TextIOWrapper in text mode
protocol: str or None: If given, overrides the protocol found in the URL.
newline: bytes or None: Used for line terminator in text mode. If None, uses system default; if blank, uses no translation.
expand: bool or None: Whether to regard file paths containing special glob characters as needing expansion (finding the first match) or absolute. Setting False allows using paths which do embed such characters. If None (default), this argument takes its value from the DEFAULT_EXPAND module variable, which takes its initial value from the “open_expand” config value at startup, which will be False if not set.
**kwargs: dict: Extra options that make sense to a particular storage connection, e.g. host, port, username, password, etc.

Returns:

OpenFile object.

Notes

For a full list of the available protocols and the implementations that they map across to see the latest online documentation:

For implementations built into fsspec see https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations
For implementations in separate packages see https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations

Examples

>>> openfile = open('2015-01-01.csv')  
>>> openfile = open(
...     's3://bucket/2015-01-01.csv.gz', compression='gzip'
... )  
>>> with openfile as f:
...     df = pd.read_csv(f)  
...

fsspec.open_files(urlpath, mode='rb', compression=None, encoding='utf8', errors=None, name_function=None, num=1, protocol=None, newline=None, auto_mkdir=True, expand=True, **kwargs)[source]

Given a path or paths, return a list of OpenFile objects.

For writing, a str path must contain the “*” character, which will be filled in by increasing numbers, e.g., “part*” -> “part1”, “part2” if num=2.

For either reading or writing, can instead provide explicit list of paths.

Parameters:

urlpath: string or list: Absolute or relative filepath(s). Prefix with a protocol like s3:// to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.
mode: ‘rb’, ‘wt’, etc.
compression: string or None: If given, open file using compression codec. Can either be a compression name (a key in fsspec.compression.compr) or “infer” to guess the compression from the filename suffix.
encoding: str: For text mode only
errors: None or str: Passed to TextIOWrapper in text mode
name_function: function or None: if opening a set of files for writing, those files do not yet exist, so we need to generate their names by formatting the urlpath for each sequence number
num: int [1]: if writing mode, number of files we expect to create (passed to name+function)
protocol: str or None: If given, overrides the protocol found in the URL.
newline: bytes or None: Used for line terminator in text mode. If None, uses system default; if blank, uses no translation.
auto_mkdir: bool (True): If in write mode, this will ensure the target directory exists before writing, by calling fs.mkdirs(exist_ok=True).
expand: bool
**kwargs: dict: Extra options that make sense to a particular storage connection, e.g. host, port, username, password, etc.

Returns:

An OpenFiles instance, which is a list of OpenFile objects that can
be used as a single context

Notes

For a full list of the available protocols and the implementations that they map across to see the latest online documentation:

For implementations built into fsspec see https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations
For implementations in separate packages see https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations

Examples

>>> files = open_files('2015-*-*.csv')  
>>> files = open_files(
...     's3://bucket/2015-*-*.csv.gz', compression='gzip'
... )  

fsspec.open_local(url: str | list[str] | Path | list[Path], mode: str = 'rb', **storage_options: dict) → str | list[str][source]

Open file(s) which can be resolved to local

For files which either are local, or get downloaded upon open (e.g., by file caching)

Parameters:

url: str or list(str)
mode: str: Must be read mode
storage_options:: passed on to FS for or used by open_files (e.g., compression)

Base Classes

`archive.AbstractArchiveFileSystem`(*args, ...)	A generic superclass for implementing Archive-based filesystems.
`asyn.AsyncFileSystem`(args, *kwargs)	Async file operations, default implementations
`callbacks.Callback`([size, value, hooks])	Base class and interface for callback mechanism
`callbacks.DotPrinterCallback`([chr_to_print])	Simple example Callback implementation
`callbacks.NoOpCallback`([size, value, hooks])	This implementation of Callback does exactly nothing
`callbacks.TqdmCallback`([tqdm_kwargs])	A callback to display a progress bar using tqdm
`core.BaseCache`(blocksize, fetcher, size)	Pass-though cache: doesn't keep anything, calls every time
`core.OpenFile`(fs, path[, mode, compression, ...])	File-like object to be used in a context
`core.OpenFiles`(*args[, mode, fs])	List of OpenFile instances
`core.get_fs_token_paths`(urlpath[, mode, ...])	Filesystem, deterministic token, and paths from a urlpath and options.
`core.url_to_fs`(url, **kwargs)	Turn fully-qualified and potentially chained URL into filesystem instance
`dircache.DirCache`([use_listings_cache, ...])	Caching of directory listings, in a structure like.
`FSMap`(root, fs[, check, create, ...])	Wrap a FileSystem instance as a mutable wrapping.
`generic.GenericFileSystem`(args, *kwargs)	Wrapper over all other FS types
`registry.register_implementation`(name, cls)	Add implementation class to the registry
`spec.AbstractBufferedFile`(fs, path[, mode, ...])	Convenient class to derive from to provide buffering
`spec.AbstractFileSystem`(args, *kwargs)	An abstract super-class for pythonic file-systems
`spec.Transaction`(fs, **kwargs)	Filesystem transaction write context

class fsspec.archive.AbstractArchiveFileSystem(*args, **kwargs)[source]

A generic superclass for implementing Archive-based filesystems.

Currently, it is shared amongst ZipFileSystem, LibArchiveFileSystem and TarFileSystem.

info(path, **kwargs)[source]

Give details of entry at path

Returns a single dictionary, with exactly the same information as ls would with detail=True.

The default implementation calls ls and could be overridden by a shortcut. kwargs are passed on to `ls().

Some file systems might not be able to measure the file’s size, in which case, the returned dict will include 'size': None.

Returns:

dict with keys: name (full path in the FS), size (in bytes), type (file,
directory, or something else) and other FS-specific keys.

ls(path, detail=True, **kwargs)[source]

List objects at path.

This should include subdirectories and files at that location. The difference between a file and a directory must be clear when details are requested.

The specific keys, or perhaps a FileInfo class, or similar, is TBD, but must be consistent across implementations. Must include:

full path to the entry (without protocol)
size of the entry, in bytes. If the value cannot be determined, will be None.
type of entry, “file”, “directory” or other

Additional information may be present, appropriate to the file-system, e.g., generation, checksum, etc.

May use refresh=True|False to allow use of self._ls_from_cache to check for a saved listing and avoid calling the backend. This would be common where listing may be expensive.

Parameters:

path: str
detail: bool: if True, gives a list of dictionaries, where each is the same as the result of info(path). If False, gives a list of paths (str).
kwargs: may have additional backend-specific options, such as version: information

Returns:

List of strings if detail is False, or list of directory information
dicts if detail is True.

ukey(path)[source]: Hash of file properties, to tell if it has changed

class fsspec.callbacks.Callback(size=None, value=0, hooks=None, **kwargs)[source]

Base class and interface for callback mechanism

This class can be used directly for monitoring file transfers by providing callback=Callback(hooks=...) (see the hooks argument, below), or subclassed for more specialised behaviour.

Parameters:

size: int (optional): Nominal quantity for the value that corresponds to a complete transfer, e.g., total number of tiles or total number of bytes
value: int (0): Starting internal counter value
hooks: dict or None: A dict of named functions to be called on each update. The signature of these must be f(size, value, **kwargs)

absolute_update(value)[source]

Set the internal value state

Triggers call()

Parameters:

value: int

classmethod as_callback(maybe_callback=None)[source]

Transform callback=… into Callback instance

For the special value of None, return the global instance of NoOpCallback. This is an alternative to including callback=DEFAULT_CALLBACK directly in a method signature.

branch(path_1, path_2, kwargs)[source]

Set callbacks for child transfers

If this callback is operating at a higher level, e.g., put, which may trigger transfers that can also be monitored. The passed kwargs are to be mutated to add callback=, if this class supports branching to children.

Parameters:

path_1: str: Child’s source path
path_2: str: Child’s destination path
kwargs: dict: arguments passed to child method, e.g., put_file.

Returns:

branch_coro(fn)[source]: Wraps a coroutine, and pass a new child callback to it.

branched(path_1, path_2, **kwargs)[source]

Return callback for child transfers

If this callback is operating at a higher level, e.g., put, which may trigger transfers that can also be monitored. The function returns a callback that has to be passed to the child method, e.g., put_file, as callback= argument.

The implementation uses callback.branch for compatibility. When implementing callbacks, it is recommended to override this function instead of branch and avoid calling super().branched(...).

Prefer using this function over branch.

Parameters:

path_1: str: Child’s source path
path_2: str: Child’s destination path
**kwargs:: Arbitrary keyword arguments

Returns:

callback: Callback: A callback instance to be passed to the child method

call(hook_name=None, **kwargs)[source]

Execute hook(s) with current state

Each function is passed the internal size and current value

Parameters:

hook_name: str or None: If given, execute on this hook
kwargs: passed on to (all) hook(s)

close()[source]: Close callback.

relative_update(inc=1)[source]

Delta increment the internal counter

Triggers call()

Parameters:

inc: int

set_size(size)[source]

Set the internal maximum size attribute

Usually called if not initially set at instantiation. Note that this triggers a call().

Parameters:

size: int

wrap(iterable)[source]

Wrap an iterable to call relative_update on each iterations

Parameters:

iterable: Iterable: The iterable that is being wrapped

class fsspec.callbacks.DotPrinterCallback(chr_to_print='#', **kwargs)[source]

Simple example Callback implementation

Almost identical to Callback with a hook that prints a char; here we demonstrate how the outer layer may print “#” and the inner layer “.”

branch(path_1, path_2, kwargs)[source]: Mutate kwargs to add new instance with different print char

call(**kwargs)[source]: Just outputs a character

class fsspec.callbacks.NoOpCallback(size=None, value=0, hooks=None, **kwargs)[source]

This implementation of Callback does exactly nothing

call(*args, **kwargs)[source]

Execute hook(s) with current state

Each function is passed the internal size and current value

Parameters:

hook_name: str or None: If given, execute on this hook
kwargs: passed on to (all) hook(s)

class fsspec.callbacks.TqdmCallback(tqdm_kwargs=None, *args, **kwargs)[source]

A callback to display a progress bar using tqdm

Parameters:

tqdm_kwargsdict, (optional): Any argument accepted by the tqdm constructor. See the tqdm doc. Will be forwarded to tqdm_cls.
tqdm_cls: (optional): subclass of tqdm.tqdm. If not passed, it will default to tqdm.tqdm.

Examples

>>> import fsspec
>>> from fsspec.callbacks import TqdmCallback
>>> fs = fsspec.filesystem("memory")
>>> path2distant_data = "/your-path"
>>> fs.upload(
        ".",
        path2distant_data,
        recursive=True,
        callback=TqdmCallback(),
    )

You can forward args to tqdm using the tqdm_kwargs parameter.

>>> fs.upload(
        ".",
        path2distant_data,
        recursive=True,
        callback=TqdmCallback(tqdm_kwargs={"desc": "Your tqdm description"}),
    )

You can also customize the progress bar by passing a subclass of tqdm.

class TqdmFormat(tqdm):
    '''Provides a `total_time` format parameter'''
    @property
    def format_dict(self):
        d = super().format_dict
        total_time = d["elapsed"] * (d["total"] or 0) / max(d["n"], 1)
        d.update(total_time=self.format_interval(total_time) + " in total")
        return d

>>> with TqdmCallback(
        tqdm_kwargs={
            "desc": "desc",
            "bar_format": "{total_time}: {percentage:.0f}%|{bar}{r_bar}",
        },
        tqdm_cls=TqdmFormat,
    ) as callback:
        fs.upload(".", path2distant_data, recursive=True, callback=callback)

call(*args, **kwargs)[source]

Execute hook(s) with current state

Each function is passed the internal size and current value

Parameters:

hook_name: str or None: If given, execute on this hook
kwargs: passed on to (all) hook(s)

close()[source]: Close callback.

class fsspec.core.BaseCache(blocksize: int, fetcher: Callable[[int, int], bytes], size: int)[source]

Pass-though cache: doesn’t keep anything, calls every time

Acts as base class for other cachers

Parameters:

blocksize: int: How far to read ahead in numbers of bytes
fetcher: func: Function of the form f(start, end) which gets bytes from remote as specified
size: int: How big this file is

class fsspec.core.OpenFile(fs, path, mode='rb', compression=None, encoding=None, errors=None, newline=None)[source]

File-like object to be used in a context

Can layer (buffered) text-mode and compression over any file-system, which are typically binary-only.

These instances are safe to serialize, as the low-level file object is not created until invoked using with.

Parameters:

fs: FileSystem: The file system to use for opening the file. Should be a subclass or duck-type with fsspec.spec.AbstractFileSystem
path: str: Location to open
mode: str like ‘rb’, optional: Mode of the opened file
compression: str or None, optional: Compression to apply
encoding: str or None, optional: The encoding to use if opened in text mode.
errors: str or None, optional: How to handle encoding errors if opened in text mode.
newline: None or str: Passed to TextIOWrapper in text mode, how to handle line endings.
autoopen: bool: If True, calls open() immediately. Mostly used by pickle
pos: int: If given and autoopen is True, seek to this location immediately

close()[source]: Close all encapsulated file objects

open()[source]

Materialise this as a real open file without context

The OpenFile object should be explicitly closed to avoid enclosed file instances persisting. You must, therefore, keep a reference to the OpenFile during the life of the file-like it generates.

class fsspec.core.OpenFiles(*args, mode='rb', fs=None)[source]

List of OpenFile instances

Can be used in a single context, which opens and closes all of the contained files. Normal list access to get the elements works as normal.

A special case is made for caching filesystems - the files will be down/uploaded together at the start or end of the context, and this may happen concurrently, if the target filesystem supports it.

fsspec.core.get_fs_token_paths(urlpath, mode='rb', num=1, name_function=None, storage_options=None, protocol=None, expand=True)[source]

Filesystem, deterministic token, and paths from a urlpath and options.

Parameters:

urlpath: string or iterable: Absolute or relative filepath, URL (may include protocols like s3://), or globstring pointing to data.
mode: str, optional: Mode in which to open files.
num: int, optional: If opening in writing mode, number of files we expect to create.
name_function: callable, optional: If opening in writing mode, this callable is used to generate path names. Names are generated for each partition by urlpath.replace('*', name_function(partition_index)).
storage_options: dict, optional: Additional keywords to pass to the filesystem class.
protocol: str or None: To override the protocol specifier in the URL
expand: bool: Expand string paths for writing, assuming the path is a directory

fsspec.core.url_to_fs(url, **kwargs)[source]

Turn fully-qualified and potentially chained URL into filesystem instance

Parameters:

urlstr: The fsspec-compatible URL
**kwargs: dict: Extra options that make sense to a particular storage connection, e.g. host, port, username, password, etc.

Returns:

filesystemFileSystem: The new filesystem discovered from url and created with **kwargs.
urlpathstr: The file-systems-specific URL for url.

class fsspec.dircache.DirCache(use_listings_cache=True, listings_expiry_time=None, max_paths=None, **kwargs)[source]

Caching of directory listings, in a structure like:

{"path0": [
    {"name": "path0/file0",
     "size": 123,
     "type": "file",
     ...
    },
    {"name": "path0/file1",
    },
    ...
    ],
 "path1": [...]
}

Parameters to this class control listing expiry or indeed turn caching off

__init__(use_listings_cache=True, listings_expiry_time=None, max_paths=None, **kwargs)[source]

Parameters:

use_listings_cache: bool: If False, this cache never returns items, but always reports KeyError, and setting items has no effect
listings_expiry_time: int or float (optional): Time in seconds that a listing is considered valid. If None, listings do not expire.
max_paths: int (optional): The number of most recent listings that are considered valid; ‘recent’ refers to when the entry was set.

class fsspec.FSMap(root, fs, check=False, create=False, missing_exceptions=None)[source]

Wrap a FileSystem instance as a mutable wrapping.

The keys of the mapping become files under the given root, and the values (which must be bytes) the contents of those files.

Parameters:

root: string: prefix for all the files
fs: FileSystem instance
check: bool (=True): performs a touch at the location, to check for write access.

Examples

>>> fs = FileSystem(**parameters)  
>>> d = FSMap('my-data/path/', fs)  
or, more likely
>>> d = fs.get_mapper('my-data/path/')

>>> d['loc1'] = b'Hello World'  
>>> list(d.keys())  
['loc1']
>>> d['loc1']  
b'Hello World'

clear()[source]: Remove all keys below root - empties out mapping

delitems(keys)[source]: Remove multiple keys from the store

property dirfs: dirfs instance that can be used with the same keys as the mapper

getitems(keys, on_error='raise')[source]

Fetch multiple items from the store

If the backend is async-able, this might proceed concurrently

Parameters:

keys: list(str): They keys to be fetched
on_error“raise”, “omit”, “return”: If raise, an underlying exception will be raised (converted to KeyError if the type is in self.missing_exceptions); if omit, keys with exception will simply not be included in the output; if “return”, all keys are included in the output, but the value will be bytes or an exception instance.

Returns:

dict(key, bytes|exception)

pop(key, default=None)[source]: Pop data

setitems(values_dict)[source]

Set the values of multiple items in the store

Parameters:

values_dict: dict(str, bytes)

class fsspec.generic.GenericFileSystem(*args, **kwargs)[source]

Wrapper over all other FS types

<experimental!>

This implementation is a single unified interface to be able to run FS operations over generic URLs, and dispatch to the specific implementations using the URL protocol prefix.

Note: instances of this FS are always async, even if you never use it with any async backend.

fsspec.registry.register_implementation(name, cls, clobber=False, errtxt=None)[source]

Add implementation class to the registry

Parameters:

name: str: Protocol name to associate with the class
cls: class or str: if a class: fsspec-compliant implementation class (normally inherits from fsspec.AbstractFileSystem, gets added straight to the registry. If a str, the full path to an implementation class like package.module.class, which gets added to known_implementations, so the import is deferred until the filesystem is actually used.
clobber: bool (optional): Whether to overwrite a protocol with the same name; if False, will raise instead.
errtxt: str (optional): If given, then a failure to import the given class will result in this text being given.

class fsspec.spec.AbstractBufferedFile(fs, path, mode='rb', block_size='default', autocommit=True, cache_type='readahead', cache_options=None, size=None, **kwargs)[source]

Convenient class to derive from to provide buffering

In the case that the backend does not provide a pythonic file-like object already, this class contains much of the logic to build one. The only methods that need to be overridden are _upload_chunk, _initiate_upload and _fetch_range.

close()[source]

Close file

Finalizes writes, discards cache

commit()[source]: Move from temp to final destination

discard()[source]: Throw away temporary file

flush(force=False)[source]

Write buffered data to backend store.

Writes the current buffer, if it is larger than the block-size, or if the file is being closed.

Parameters:

force: bool: When closing, write the last block even if it is smaller than blocks are allowed to be. Disallows further writing to this file.

info()[source]: File information about this path

read(length=-1)[source]

Return data from cache, or fetch pieces as necessary

Parameters:

length: int (-1): Number of bytes to read; if <0, all remaining bytes.

readable()[source]: Whether opened for reading

readinto(b)[source]

mirrors builtin file’s readinto method

https://docs.python.org/3/library/io.html#io.RawIOBase.readinto

readline()[source]

Read until and including the first occurrence of newline character

Note that, because of character encoding, this is not necessarily a true line ending.

readlines()[source]: Return all data, split by the newline character, including the newline character

readuntil(char=b'\n', blocks=None)[source]

Return data between current position and first occurrence of char

char is included in the output, except if the end of the tile is encountered first.

Parameters:

char: bytes: Thing to find
blocks: None or int: How much to read in each go. Defaults to file blocksize - which may mean a new read on every call.

seek(loc, whence=0)[source]

Set current file location

Parameters:

loc: int: byte location
whence: {0, 1, 2}: from start of file, current location or end of file, resp.

seekable()[source]: Whether is seekable (only in read mode)

tell()[source]: Current file location

writable()[source]: Whether opened for writing

write(data)[source]

Write data to buffer.

Buffer only sent on flush() or if buffer is greater than or equal to blocksize.

Parameters:

data: bytes: Set of bytes to be written.

class fsspec.spec.AbstractFileSystem(*args, **kwargs)[source]

An abstract super-class for pythonic file-systems

Implementations are expected to be compatible with or, better, subclass from here.

cat(path, recursive=False, on_error='raise', **kwargs)[source]

Fetch (potentially multiple) paths’ contents

Parameters:

recursive: bool: If True, assume the path(s) are directories, and get all the contained files
on_error“raise”, “omit”, “return”: If raise, an underlying exception will be raised (converted to KeyError if the type is in self.missing_exceptions); if omit, keys with exception will simply not be included in the output; if “return”, all keys are included in the output, but the value will be bytes or an exception instance.
kwargs: passed to cat_file

Returns:

dict of {path: contents} if there are multiple paths
or the path has been otherwise expanded

cat_file(path, start=None, end=None, **kwargs)[source]

Get the content of a file

Parameters:

path: URL of file on this filesystems
start, end: int: Bytes limits of the read. If negative, backwards from end, like usual python slices. Either can be None for start or end of file, respectively
kwargs: passed to ``open()``.

cat_ranges(paths, starts, ends, max_gap=None, on_error='return', **kwargs)[source]

Get the contents of byte ranges from one or more files

Parameters:

paths: list: A list of of filepaths on this filesystems
starts, ends: int or list: Bytes limits of the read. If using a single int, the same value will be used to read all the specified files.

checksum(path)[source]

Unique value for current version of file

If the checksum is the same from one moment to another, the contents are guaranteed to be the same. If the checksum changes, the contents might have changed.

This should normally be overridden; default will probably capture creation/modification timestamp (which would be good) or maybe access timestamp (which would be bad)

classmethod clear_instance_cache()[source]

Clear the cache of filesystem instances.

Notes

Unless overridden by setting the cachable class attribute to False, the filesystem class stores a reference to newly created instances. This prevents Python’s normal rules around garbage collection from working, since the instances refcount will not drop to zero until clear_instance_cache is called.

copy(path1, path2, recursive=False, maxdepth=None, on_error=None, **kwargs)[source]

Copy within two locations in the filesystem

on_error“raise”, “ignore”: If raise, any not-found exceptions will be raised; if ignore any not-found exceptions will cause the path to be skipped; defaults to raise unless recursive is true, where the default is ignore

cp(path1, path2, **kwargs)[source]: Alias of AbstractFileSystem.copy.

created(path)[source]: Return the created timestamp of a file as a datetime.datetime

classmethod current()[source]

Return the most recently instantiated FileSystem

If no instance has been created, then create one with defaults

delete(path, recursive=False, maxdepth=None)[source]: Alias of AbstractFileSystem.rm.

disk_usage(path, total=True, maxdepth=None, **kwargs)[source]: Alias of AbstractFileSystem.du.

download(rpath, lpath, recursive=False, **kwargs)[source]: Alias of AbstractFileSystem.get.

du(path, total=True, maxdepth=None, withdirs=False, **kwargs)[source]

Space used by files and optionally directories within a path

Directory size does not include the size of its contents.

Parameters:

path: str
total: bool: Whether to sum all the file sizes
maxdepth: int or None: Maximum number of directory levels to descend, None for unlimited.
withdirs: bool: Whether to include directory paths in the output.
kwargs: passed to ``find``

Returns:

Dict of {path: size} if total=False, or int otherwise, where numbers
refer to bytes used.

end_transaction()[source]: Finish write transaction, non-context version

exists(path, **kwargs)[source]: Is there a file at the given path

expand_path(path, recursive=False, maxdepth=None, **kwargs)[source]

Turn one or more globs or directories into a list of all matching paths to files or directories.

kwargs are passed to glob or find, which may in turn call ls

find(path, maxdepth=None, withdirs=False, detail=False, **kwargs)[source]

List all files below path.

Like posix find command without conditions

Parameters:

pathstr
maxdepth: int or None: If not None, the maximum number of levels to descend
withdirs: bool: Whether to include directory paths in the output. This is True when used by glob, but users usually only want files.
kwargs are passed to ``ls``.

static from_dict(dct: dict[str, Any]) → AbstractFileSystem[source]

Recreate a filesystem instance from dictionary representation.

See .to_dict() for the expected structure of the input.

Parameters:

dct: Dict[str, Any]

Returns:

file system instance, not necessarily of this particular class.

Warning

This can import arbitrary modules (as determined by the cls key). Make sure you haven’t installed any modules that may execute malicious code at import time.

static from_json(blob: str) → AbstractFileSystem[source]

Recreate a filesystem instance from JSON representation.

See .to_json() for the expected structure of the input.

Parameters:

blob: str

Returns:

file system instance, not necessarily of this particular class.

Warning

This can import arbitrary modules (as determined by the cls key). Make sure you haven’t installed any modules that may execute malicious code at import time.

property fsid: Persistent filesystem id that can be used to compare filesystems across sessions.

get(rpath, lpath, recursive=False, callback=<fsspec.callbacks.NoOpCallback object>, maxdepth=None, **kwargs)[source]

Copy file(s) to local.

Copies a specific file or tree of files (if recursive=True). If lpath ends with a “/”, it will be assumed to be a directory, and target files will go within. Can submit a list of paths, which may be glob-patterns and will be expanded.

Calls get_file for each source.

get_file(rpath, lpath, callback=<fsspec.callbacks.NoOpCallback object>, outfile=None, **kwargs)[source]: Copy single remote file to local

get_mapper(root='', check=False, create=False, missing_exceptions=None)[source]

Create key/value store based on this file-system

Makes a MutableMapping interface to the FS at the given root path. See fsspec.mapping.FSMap for further details.

glob(path, maxdepth=None, **kwargs)[source]

Find files by glob-matching.

Pattern matching capabilities for finding files that match the given pattern.

Parameters:

path: str: The glob pattern to match against
maxdepth: int or None: Maximum depth for '**' patterns. Applied on the first '**' found. Must be at least 1 if provided.
kwargs:: Additional arguments passed to find (e.g., detail=True)

Returns:

List of matched paths, or dict of paths and their info if detail=True

Notes

Supported patterns: - ‘*’: Matches any sequence of characters within a single directory level - '**': Matches any number of directory levels (must be an entire path component) - ‘?’: Matches exactly one character - ‘[abc]’: Matches any character in the set - ‘[a-z]’: Matches any character in the range - ‘[!abc]’: Matches any character NOT in the set

Special behaviors: - If the path ends with ‘/’, only folders are returned - Consecutive ‘*’ characters are compressed into a single ‘*’ - Empty brackets ‘[]’ never match anything - Negated empty brackets ‘[!]’ match any single character - Special characters in character classes are escaped properly

Limitations: - '**' must be a complete path component (e.g., 'a/**/b', not 'a**b') - No brace expansion (‘{a,b}.txt’) - No extended glob patterns (‘+(pattern)’, ‘!(pattern)’)

head(path, size=1024)[source]: Get the first size bytes from file

info(path, **kwargs)[source]

Give details of entry at path

Returns a single dictionary, with exactly the same information as ls would with detail=True.

The default implementation calls ls and could be overridden by a shortcut. kwargs are passed on to `ls().

Some file systems might not be able to measure the file’s size, in which case, the returned dict will include 'size': None.

Returns:

dict with keys: name (full path in the FS), size (in bytes), type (file,
directory, or something else) and other FS-specific keys.

invalidate_cache(path=None)[source]

Discard any cached directory information

Parameters:

path: string or None: If None, clear all listings cached else listings at or under given path.

isdir(path)[source]: Is this entry directory-like?

isfile(path)[source]: Is this entry file-like?

lexists(path, **kwargs)[source]: If there is a file at the given path (including broken links)

listdir(path, detail=True, **kwargs)[source]: Alias of AbstractFileSystem.ls.

ls(path, detail=True, **kwargs)[source]

List objects at path.

This should include subdirectories and files at that location. The difference between a file and a directory must be clear when details are requested.

The specific keys, or perhaps a FileInfo class, or similar, is TBD, but must be consistent across implementations. Must include:

full path to the entry (without protocol)
size of the entry, in bytes. If the value cannot be determined, will be None.
type of entry, “file”, “directory” or other

Additional information may be present, appropriate to the file-system, e.g., generation, checksum, etc.

May use refresh=True|False to allow use of self._ls_from_cache to check for a saved listing and avoid calling the backend. This would be common where listing may be expensive.

Parameters:

path: str
detail: bool: if True, gives a list of dictionaries, where each is the same as the result of info(path). If False, gives a list of paths (str).
kwargs: may have additional backend-specific options, such as version: information

Returns:

List of strings if detail is False, or list of directory information
dicts if detail is True.

makedir(path, create_parents=True, **kwargs)[source]: Alias of AbstractFileSystem.mkdir.

makedirs(path, exist_ok=False)[source]

Recursively make directories

Creates directory at path and any intervening required directories. Raises exception if, for instance, the path already exists but is a file.

Parameters:

path: str: leaf directory name
exist_ok: bool (False): If False, will error if the target already exists

mkdir(path, create_parents=True, **kwargs)[source]

Create directory entry at path

For systems that don’t have true directories, may create an for this instance only and not touch the real filesystem

Parameters:

path: str: location
create_parents: bool: if True, this is equivalent to makedirs
kwargs:: may be permissions, etc.

mkdirs(path, exist_ok=False)[source]: Alias of AbstractFileSystem.makedirs.

modified(path)[source]: Return the modified timestamp of a file as a datetime.datetime

move(path1, path2, **kwargs)[source]: Alias of AbstractFileSystem.mv.

mv(path1, path2, recursive=False, maxdepth=None, **kwargs)[source]: Move file(s) from one location to another

open(path, mode='rb', block_size=None, cache_options=None, compression=None, **kwargs)[source]

Return a file-like object from the filesystem

The resultant instance must function correctly in a context with block.

Parameters:

path: str: Target file
mode: str like ‘rb’, ‘w’: See builtin open() Mode “x” (exclusive write) may be implemented by the backend. Even if it is, whether it is checked up front or on commit, and whether it is atomic is implementation-dependent.
block_size: int: Some indication of buffering - this is a value in bytes
cache_optionsdict, optional: Extra arguments to pass through to the cache.
compression: string or None: If given, open file using compression codec. Can either be a compression name (a key in fsspec.compression.compr) or “infer” to guess the compression from the filename suffix.
encoding, errors, newline: passed on to TextIOWrapper for text mode

pipe(path, value=None, **kwargs)[source]

Put value into path

(counterpart to cat)

Parameters:

path: string or dict(str, bytes): If a string, a single remote location to put value bytes; if a dict, a mapping of {path: bytesvalue}.
value: bytes, optional: If using a single path, these are the bytes to put there. Ignored if path is a dict

pipe_file(path, value, mode='overwrite', **kwargs)[source]: Set the bytes of given file

put(lpath, rpath, recursive=False, callback=<fsspec.callbacks.NoOpCallback object>, maxdepth=None, **kwargs)[source]

Copy file(s) from local.

Copies a specific file or tree of files (if recursive=True). If rpath ends with a “/”, it will be assumed to be a directory, and target files will go within.

Calls put_file for each source.

put_file(lpath, rpath, callback=<fsspec.callbacks.NoOpCallback object>, mode='overwrite', **kwargs)[source]: Copy single file to remote

read_block(fn, offset, length, delimiter=None)[source]

Read a block of bytes from

Starting at offset of the file, read length bytes. If delimiter is set then we ensure that the read starts and stops at delimiter boundaries that follow the locations offset and offset + length. If offset is zero then we start at zero. The bytestring returned WILL include the end delimiter string.

If offset+length is beyond the eof, reads to eof.

Parameters:

fn: string: Path to filename
offset: int: Byte offset to start read
length: int: Number of bytes to read. If None, read to end.
delimiter: bytes (optional): Ensure reading starts and stops at delimiter bytestring

Built-in Implementations

`implementations.arrow.ArrowFSWrapper`(*args, ...)	FSSpec-compatible wrapper of pyarrow.fs.FileSystem.
`implementations.arrow.HadoopFileSystem`(...)	A wrapper on top of the pyarrow.fs.HadoopFileSystem to connect it's interface with fsspec
`implementations.cached.CachingFileSystem`(...)	Locally caching filesystem, layer over any other FS
`implementations.cached.SimpleCacheFileSystem`(...)	Caches whole remote files on first access
`implementations.cached.WholeFileCacheFileSystem`(...)	Caches whole remote files on first access
`implementations.dask.DaskWorkerFileSystem`(...)	View files accessible to a worker as any other remote file-system
`implementations.data.DataFileSystem`(*args, ...)	A handy decoder for data-URLs
`implementations.dbfs.DatabricksFileSystem`(...)	Get access to the Databricks filesystem implementation over HTTP.
`implementations.dirfs.DirFileSystem`(*args, ...)	Directory prefix filesystem
`implementations.ftp.FTPFileSystem`(*args, ...)	A filesystem over classic FTP
`implementations.gist.GistFileSystem`(*args, ...)	Interface to files in a single GitHub Gist.
`implementations.git.GitFileSystem`(*args, ...)	Browse the files of a local git repo at any hash/tag/branch
`implementations.github.GithubFileSystem`(...)	Interface to files in github
`implementations.http.HTTPFileSystem`(*args, ...)	Simple File-System for fetching data via HTTP(S)
`implementations.jupyter.JupyterFileSystem`(...)	View of the files as seen by a Jupyter server (notebook or lab)
`implementations.libarchive.LibArchiveFileSystem`(...)	Compressed archives as a file-system (read-only)
`implementations.local.LocalFileSystem`(*args, ...)	Interface to files on local storage
`implementations.memory.MemoryFileSystem`(...)	A filesystem based on a dict of BytesIO objects
`implementations.reference.ReferenceFileSystem`(...)	View byte ranges of some other file as a file system Initial version: single file system target, which must support async, and must allow start and end args in _cat_file.
`implementations.reference.LazyReferenceMapper`(root)	This interface can be used to read/write references from Parquet stores.
`implementations.sftp.SFTPFileSystem`(*args, ...)	Files over SFTP/SSH
`implementations.smb.SMBFileSystem`(*args, ...)	Allow reading and writing to Windows and Samba network shares.
`implementations.tar.TarFileSystem`(*args, ...)	Compressed Tar archives as a file-system (read-only)
`implementations.webhdfs.WebHDFS`(args, *kwargs)	Interface to HDFS over HTTP using the WebHDFS API.
`implementations.zip.ZipFileSystem`(*args, ...)	Read/Write contents of ZIP archive as a file-system

class fsspec.implementations.arrow.ArrowFSWrapper(*args, **kwargs)[source]

FSSpec-compatible wrapper of pyarrow.fs.FileSystem.

Parameters:

fspyarrow.fs.FileSystem

__init__(fs, **kwargs)[source]

Create and configure file-system instance

Instances may be cachable, so if similar enough arguments are seen a new instance is not required. The token attribute exists to allow implementations to cache instances if they wish.

A reasonable default should be provided if there are no arguments.

Subclasses should call this method.

Parameters:

use_listings_cache, listings_expiry_time, max_paths:: passed to DirCache, if the implementation supports directory listing caching. Pass use_listings_cache=False to disable such caching.
skip_instance_cache: bool: If this is a cachable implementation, pass True here to force creating a new instance even if a matching instance exists, and prevent storing this instance.
asynchronous: bool
loop: asyncio-compatible IOLoop or None

class fsspec.implementations.arrow.HadoopFileSystem(*args, **kwargs)[source]

A wrapper on top of the pyarrow.fs.HadoopFileSystem to connect it’s interface with fsspec

__init__(host='default', port=0, user=None, kerb_ticket=None, replication=3, extra_conf=None, **kwargs)[source]

Parameters:

host: str: Hostname, IP or “default” to try to read from Hadoop config
port: int: Port to connect on, or default from Hadoop config if 0
user: str or None: If given, connect as this username
kerb_ticket: str or None: If given, use this ticket for authentication
replication: int: set replication factor of file for write operations. default value is 3.
extra_conf: None or dict: Passed on to HadoopFileSystem

class fsspec.implementations.cached.CachingFileSystem(*args, **kwargs)[source]

Locally caching filesystem, layer over any other FS

This class implements chunk-wise local storage of remote files, for quick access after the initial download. The files are stored in a given directory with hashes of URLs for the filenames. If no directory is given, a temporary one is used, which should be cleaned up by the OS after the process ends. The files themselves are sparse (as implemented in MMapCache), so only the data which is accessed takes up space.

Restrictions:

the block-size must be the same for each access of a given file, unless all blocks of the file have already been read
caching can only be applied to file-systems which produce files derived from fsspec.spec.AbstractBufferedFile ; LocalFileSystem is also allowed, for testing

__init__(target_protocol=None, cache_storage='TMP', cache_check=10, check_files=False, expiry_time=604800, target_options=None, fs=None, same_names: bool | None = None, compression=None, cache_mapper: AbstractCacheMapper | None = None, **kwargs)[source]

Parameters:

target_protocol: str (optional): Target filesystem protocol. Provide either this or fs.
cache_storage: str or list(str): Location to store files. If “TMP”, this is a temporary directory, and will be cleaned up by the OS when this process ends (or later). If a list, each location will be tried in the order given, but only the last will be considered writable.
cache_check: int: Number of seconds between reload of cache metadata
check_files: bool: Whether to explicitly see if the UID of the remote file matches the stored one before using. Warning: some file systems such as HTTP cannot reliably give a unique hash of the contents of some path, so be sure to set this option to False.
expiry_time: int: The time in seconds after which a local copy is considered useless. Set to falsy to prevent expiry. The default is equivalent to one week.
target_options: dict or None: Passed to the instantiation of the FS, if fs is None.
fs: filesystem instance: The target filesystem to run against. Provide this or protocol.
same_names: bool (optional): By default, target URLs are hashed using a HashCacheMapper so that files from different backends with the same basename do not conflict. If this argument is true, a BasenameCacheMapper is used instead. Other cache mapper options are available by using the cache_mapper keyword argument. Only one of this and cache_mapper should be specified.
compression: str (optional): To decompress on download. Can be ‘infer’ (guess from the URL name), one of the entries in fsspec.compression.compr, or None for no decompression.
cache_mapper: AbstractCacheMapper (optional): The object use to map from original filenames to cached filenames. Only one of this and same_names should be specified.

class fsspec.implementations.cached.SimpleCacheFileSystem(*args, **kwargs)[source]

Caches whole remote files on first access

This class is intended as a layer over any other file system, and will make a local copy of each file accessed, so that all subsequent reads are local. This implementation only copies whole files, and does not keep any metadata about the download time or file details. It is therefore safer to use in multi-threaded/concurrent situations.

This is the only of the caching filesystems that supports write: you will be given a real local open file, and upon close and commit, it will be uploaded to the target filesystem; the writability or the target URL is not checked until that time.

__init__(**kwargs)[source]

Parameters:

target_protocol: str (optional): Target filesystem protocol. Provide either this or fs.
cache_storage: str or list(str): Location to store files. If “TMP”, this is a temporary directory, and will be cleaned up by the OS when this process ends (or later). If a list, each location will be tried in the order given, but only the last will be considered writable.
cache_check: int: Number of seconds between reload of cache metadata
check_files: bool: Whether to explicitly see if the UID of the remote file matches the stored one before using. Warning: some file systems such as HTTP cannot reliably give a unique hash of the contents of some path, so be sure to set this option to False.
expiry_time: int: The time in seconds after which a local copy is considered useless. Set to falsy to prevent expiry. The default is equivalent to one week.
target_options: dict or None: Passed to the instantiation of the FS, if fs is None.
fs: filesystem instance: The target filesystem to run against. Provide this or protocol.
same_names: bool (optional): By default, target URLs are hashed using a HashCacheMapper so that files from different backends with the same basename do not conflict. If this argument is true, a BasenameCacheMapper is used instead. Other cache mapper options are available by using the cache_mapper keyword argument. Only one of this and cache_mapper should be specified.
compression: str (optional): To decompress on download. Can be ‘infer’ (guess from the URL name), one of the entries in fsspec.compression.compr, or None for no decompression.
cache_mapper: AbstractCacheMapper (optional): The object use to map from original filenames to cached filenames. Only one of this and same_names should be specified.

class fsspec.implementations.cached.WholeFileCacheFileSystem(*args, **kwargs)[source]

Caches whole remote files on first access

This class is intended as a layer over any other file system, and will make a local copy of each file accessed, so that all subsequent reads are local. This is similar to CachingFileSystem, but without the block-wise functionality and so can work even when sparse files are not allowed. See its docstring for definition of the init arguments.

The class still needs access to the remote store for listing files, and may refresh cached files.

__init__(target_protocol=None, cache_storage='TMP', cache_check=10, check_files=False, expiry_time=604800, target_options=None, fs=None, same_names: bool | None = None, compression=None, cache_mapper: AbstractCacheMapper | None = None, **kwargs)

Parameters:

target_protocol: str (optional): Target filesystem protocol. Provide either this or fs.
cache_storage: str or list(str): Location to store files. If “TMP”, this is a temporary directory, and will be cleaned up by the OS when this process ends (or later). If a list, each location will be tried in the order given, but only the last will be considered writable.
cache_check: int: Number of seconds between reload of cache metadata
check_files: bool: Whether to explicitly see if the UID of the remote file matches the stored one before using. Warning: some file systems such as HTTP cannot reliably give a unique hash of the contents of some path, so be sure to set this option to False.
expiry_time: int: The time in seconds after which a local copy is considered useless. Set to falsy to prevent expiry. The default is equivalent to one week.
target_options: dict or None: Passed to the instantiation of the FS, if fs is None.
fs: filesystem instance: The target filesystem to run against. Provide this or protocol.
same_names: bool (optional): By default, target URLs are hashed using a HashCacheMapper so that files from different backends with the same basename do not conflict. If this argument is true, a BasenameCacheMapper is used instead. Other cache mapper options are available by using the cache_mapper keyword argument. Only one of this and cache_mapper should be specified.
compression: str (optional): To decompress on download. Can be ‘infer’ (guess from the URL name), one of the entries in fsspec.compression.compr, or None for no decompression.
cache_mapper: AbstractCacheMapper (optional): The object use to map from original filenames to cached filenames. Only one of this and same_names should be specified.

class fsspec.implementations.dask.DaskWorkerFileSystem(*args, **kwargs)[source]

View files accessible to a worker as any other remote file-system

When instances are run on the worker, uses the real filesystem. When run on the client, they call the worker to provide information or data.

Warning this implementation is experimental, and read-only for now.

__init__(target_protocol=None, target_options=None, fs=None, client=None, **kwargs)[source]

Create and configure file-system instance

Instances may be cachable, so if similar enough arguments are seen a new instance is not required. The token attribute exists to allow implementations to cache instances if they wish.

A reasonable default should be provided if there are no arguments.

Subclasses should call this method.

Parameters:

use_listings_cache, listings_expiry_time, max_paths:: passed to DirCache, if the implementation supports directory listing caching. Pass use_listings_cache=False to disable such caching.
skip_instance_cache: bool: If this is a cachable implementation, pass True here to force creating a new instance even if a matching instance exists, and prevent storing this instance.
asynchronous: bool
loop: asyncio-compatible IOLoop or None

class fsspec.implementations.data.DataFileSystem(*args, **kwargs)[source]

A handy decoder for data-URLs

__init__(**kwargs)[source]: No parameters for this filesystem

class fsspec.implementations.dbfs.DatabricksFileSystem(*args, **kwargs)[source]

Get access to the Databricks filesystem implementation over HTTP. Can be used inside and outside of a databricks cluster.

__init__(instance, token, **kwargs)[source]

Create a new DatabricksFileSystem.

Parameters:

instance: str: The instance URL of the databricks cluster. For example for an Azure databricks cluster, this has the form adb-<some-number>.<two digits>.azuredatabricks.net.
token: str: Your personal token. Find out more here: https://docs.databricks.com/dev-tools/api/latest/authentication.html

class fsspec.implementations.dirfs.DirFileSystem(*args, **kwargs)[source]

Directory prefix filesystem

The DirFileSystem is a filesystem-wrapper. It assumes every path it is dealing with is relative to the path. After performing the necessary paths operation it delegates everything to the wrapped filesystem.

__init__(path=None, fs=None, fo=None, target_protocol=None, target_options=None, **storage_options)[source]

Parameters:

path: str: Path to the directory.
fs: AbstractFileSystem: An instantiated filesystem to wrap.
target_protocol, target_options:: if fs is none, construct it from these
fo: str: Alternate for path; do not provide both

class fsspec.implementations.ftp.FTPFileSystem(*args, **kwargs)[source]

A filesystem over classic FTP

__init__(host, port=21, username=None, password=None, acct=None, block_size=None, tempdir=None, timeout=30, encoding='utf-8', tls=False, **kwargs)[source]

You can use _get_kwargs_from_urls to get some kwargs from a reasonable FTP url.

Authentication will be anonymous if username/password are not given.

Parameters:

host: str: The remote server name/ip to connect to
port: int: Port to connect with
username: str or None: If authenticating, the user’s identifier
password: str of None: User’s password on the server, if using
acct: str or None: Some servers also need an “account” string for auth
block_size: int or None: If given, the read-ahead or write buffer size.
tempdir: str: Directory on remote to put temporary files when in a transaction
timeout: int: Timeout of the ftp connection in seconds
encoding: str: Encoding to use for directories and filenames in FTP connection
tls: bool: Use FTP-TLS, by default False

class fsspec.implementations.gist.GistFileSystem(*args, **kwargs)[source]

Interface to files in a single GitHub Gist.

Provides read-only access to a gist’s files. Gists do not contain subdirectories, so file listing is straightforward.

Parameters:

gist_id: str: The ID of the gist you want to access (the long hex value from the URL).
filenames: list[str] (optional): If provided, only make a file system representing these files, and do not fetch the list of all files for this gist.
sha: str (optional): If provided, fetch a particular revision of the gist. If omitted, the latest revision is used.
username: str (optional): GitHub username for authentication.
token: str (optional): GitHub personal access token (required if username is given), or.
timeout: (float, float) or float, optional: Connect and read timeouts for requests (default 60s each).
kwargs: dict: Stored on self.request_kw and passed to requests.get when fetching Gist metadata or reading (“opening”) a file.

__init__(gist_id, filenames=None, sha=None, username=None, token=None, timeout=None, **kwargs)[source]

Create and configure file-system instance

Instances may be cachable, so if similar enough arguments are seen a new instance is not required. The token attribute exists to allow implementations to cache instances if they wish.

A reasonable default should be provided if there are no arguments.

Subclasses should call this method.

Parameters:

use_listings_cache, listings_expiry_time, max_paths:: passed to DirCache, if the implementation supports directory listing caching. Pass use_listings_cache=False to disable such caching.
skip_instance_cache: bool: If this is a cachable implementation, pass True here to force creating a new instance even if a matching instance exists, and prevent storing this instance.
asynchronous: bool
loop: asyncio-compatible IOLoop or None

class fsspec.implementations.git.GitFileSystem(*args, **kwargs)[source]

Browse the files of a local git repo at any hash/tag/branch

(experimental backend)

__init__(path=None, fo=None, ref=None, **kwargs)[source]

Parameters:

path: str (optional): Local location of the repo (uses current directory if not given). May be deprecated in favour of fo. When used with a higher level function such as fsspec.open(), may be of the form “git://[path-to-repo[:]][ref@]path/to/file” (but the actual file path should not contain “@” or “:”).
fo: str (optional): Same as path, but passed as part of a chained URL. This one takes precedence if both are given.
ref: str (optional): Reference to work with, could be a hash, tag or branch name. Defaults to current working tree. Note that ls and open also take hash, so this becomes the default for those operations
kwargs

class fsspec.implementations.github.GithubFileSystem(*args, **kwargs)[source]

Interface to files in github

An instance of this class provides the files residing within a remote github repository. You may specify a point in the repos history, by SHA, branch or tag (default is current master).

For files less than 1 MB in size, file content is returned directly in a MemoryFile. For larger files, or for files tracked by git-lfs, file content is returned as an HTTPFile wrapping the download_url provided by the GitHub API.

When using fsspec.open, allows URIs of the form:

“github://path/file”, in which case you must specify org, repo and may specify sha in the extra args
‘github://org:repo@/precip/catalog.yml’, where the org and repo are part of the URI
‘github://org:repo@sha/precip/catalog.yml’, where the sha is also included

sha can be the full or abbreviated hex of the commit you want to fetch from, or a branch or tag name (so long as it doesn’t contain special characters like “/”, “?”, which would have to be HTTP-encoded).

For authorised access, you must provide username and token, which can be made at https://github.com/settings/tokens

__init__(org, repo, sha=None, username=None, token=None, timeout=None, **kwargs)[source]

Create and configure file-system instance

Instances may be cachable, so if similar enough arguments are seen a new instance is not required. The token attribute exists to allow implementations to cache instances if they wish.

A reasonable default should be provided if there are no arguments.

Subclasses should call this method.

Parameters:

use_listings_cache, listings_expiry_time, max_paths:: passed to DirCache, if the implementation supports directory listing caching. Pass use_listings_cache=False to disable such caching.
skip_instance_cache: bool: If this is a cachable implementation, pass True here to force creating a new instance even if a matching instance exists, and prevent storing this instance.
asynchronous: bool
loop: asyncio-compatible IOLoop or None

class fsspec.implementations.http.HTTPFileSystem(*args, **kwargs)[source]

Simple File-System for fetching data via HTTP(S)

ls() is implemented by loading the parent page and doing a regex match on the result. If simple_link=True, anything of the form “http(s)://server.com/stuff?thing=other”; otherwise only links within HTML href tags will be used.

__init__(simple_links=True, block_size=None, same_scheme=True, size_policy=None, cache_type='bytes', cache_options=None, asynchronous=False, loop=None, client_kwargs=None, get_client=<function get_client>, encoded=False, **storage_options)[source]

NB: if this is called async, you must await set_client

Parameters:

block_size: int: Blocks to read bytes; if 0, will default to raw requests file-like objects instead of HTTPFile instances
simple_links: bool: If True, will consider both HTML <a> tags and anything that looks like a URL; if False, will consider only the former.
same_scheme: True: When doing ls/glob, if this is True, only consider paths that have http/https matching the input URLs.
size_policy: this argument is deprecated
client_kwargs: dict: Passed to aiohttp.ClientSession, see https://docs.aiohttp.org/en/stable/client_reference.html For example, {'auth': aiohttp.BasicAuth('user', 'pass')}
get_client: Callable[…, aiohttp.ClientSession]: A callable, which takes keyword arguments and constructs an aiohttp.ClientSession. Its state will be managed by the HTTPFileSystem class.
storage_options: key-value: Any other parameters passed on to requests
cache_type, cache_options: defaults used in open()

class fsspec.implementations.jupyter.JupyterFileSystem(*args, **kwargs)[source]

View of the files as seen by a Jupyter server (notebook or lab)

__init__(url, tok=None, **kwargs)[source]

Parameters:

urlstr: Base URL of the server, like “http://127.0.0.1:8888”. May include token in the string, which is given by the process when starting up
tokstr: If the token is obtained separately, can be given here
kwargs

class fsspec.implementations.libarchive.LibArchiveFileSystem(*args, **kwargs)[source]

Compressed archives as a file-system (read-only)

Supports the following formats: tar, pax , cpio, ISO9660, zip, mtree, shar, ar, raw, xar, lha/lzh, rar Microsoft CAB, 7-Zip, WARC

See the libarchive documentation for further restrictions. https://www.libarchive.org/

Keeps file object open while instance lives. It only works in seekable file-like objects. In case the filesystem does not support this kind of file object, it is recommended to cache locally.

This class is pickleable, but not necessarily thread-safe (depends on the platform). See libarchive documentation for details.

__init__(fo='', mode='r', target_protocol=None, target_options=None, block_size=5242880, **kwargs)[source]

Parameters:

fo: str or file-like: Contains ZIP, and must exist. If a str, will fetch file using open_files(), which must return one file exactly.
mode: str: Currently, only ‘r’ accepted
target_protocol: str (optional): If fo is a string, this value can be used to override the FS protocol inferred from a URL
target_options: dict (optional): Kwargs passed when instantiating the target FS, if fo is a string.

class fsspec.implementations.local.LocalFileSystem(*args, **kwargs)[source]

Interface to files on local storage

Parameters:

auto_mkdir: bool: Whether, when opening a file, the directory containing it should be created (if it doesn’t already exist). This is assumed by pyarrow code.

__init__(auto_mkdir=False, **kwargs)[source]

Create and configure file-system instance

Instances may be cachable, so if similar enough arguments are seen a new instance is not required. The token attribute exists to allow implementations to cache instances if they wish.

A reasonable default should be provided if there are no arguments.

Subclasses should call this method.

Parameters:

use_listings_cache, listings_expiry_time, max_paths:: passed to DirCache, if the implementation supports directory listing caching. Pass use_listings_cache=False to disable such caching.
skip_instance_cache: bool: If this is a cachable implementation, pass True here to force creating a new instance even if a matching instance exists, and prevent storing this instance.
asynchronous: bool
loop: asyncio-compatible IOLoop or None

class fsspec.implementations.memory.MemoryFileSystem(*args, **kwargs)[source]

A filesystem based on a dict of BytesIO objects

This is a global filesystem so instances of this class all point to the same in memory filesystem.

__init__(*args, **storage_options)

Create and configure file-system instance

Instances may be cachable, so if similar enough arguments are seen a new instance is not required. The token attribute exists to allow implementations to cache instances if they wish.

A reasonable default should be provided if there are no arguments.

Subclasses should call this method.

Parameters:

use_listings_cache, listings_expiry_time, max_paths:: passed to DirCache, if the implementation supports directory listing caching. Pass use_listings_cache=False to disable such caching.
skip_instance_cache: bool: If this is a cachable implementation, pass True here to force creating a new instance even if a matching instance exists, and prevent storing this instance.
asynchronous: bool
loop: asyncio-compatible IOLoop or None

class fsspec.implementations.reference.ReferenceFileSystem(*args, **kwargs)[source]

View byte ranges of some other file as a file system Initial version: single file system target, which must support async, and must allow start and end args in _cat_file. Later versions may allow multiple arbitrary URLs for the targets. This FileSystem is read-only. It is designed to be used with async targets (for now). We do not get original file details from the target FS. Configuration is by passing a dict of references at init, or a URL to a JSON file containing the same; this dict can also contain concrete data for some set of paths. Reference dict format: {path0: bytes_data, path1: (target_url, offset, size)} https://github.com/fsspec/kerchunk/blob/main/README.md

__init__(fo, target=None, ref_storage_args=None, target_protocol=None, target_options=None, remote_protocol=None, remote_options=None, fs=None, template_overrides=None, simple_templates=True, max_gap=64000, max_block=256000000, cache_size=128, **kwargs)[source]

Parameters:

fodict or str

The set of references to use for this instance, with a structure as above. If str referencing a JSON file, will use fsspec.open, in conjunction with target_options and target_protocol to open and parse JSON at this location. If a directory, then assume references are a set of parquet files to be loaded lazily.

targetstr

For any references having target_url as None, this is the default file target to use

ref_storage_argsdict

If references is a str, use these kwargs for loading the JSON file. Deprecated: use target_options instead.

target_protocolstr

Used for loading the reference file, if it is a path. If None, protocol will be derived from the given path

target_optionsdict

Extra FS options for loading the reference file fo, if given as a path

remote_protocolstr

The protocol of the filesystem on which the references will be evaluated (unless fs is provided). If not given, will be derived from the first URL that has a protocol in the templates or in the references, in that order.

remote_optionsdict

kwargs to go with remote_protocol

fsAbstractFileSystem | dict(str, (AbstractFileSystem | dict))

Directly provide a file system(s):

a single filesystem instance
a dict of protocol:filesystem, where each value is either a filesystem instance, or a dict of kwargs that can be used to create in instance for the given protocol

If this is given, remote_options and remote_protocol are ignored.

template_overridesdict

Swap out any templates in the references file with these - useful for testing.

simple_templates: bool

Whether templates can be processed with simple replace (True) or if jinja is needed (False, much slower). All reference sets produced by kerchunk are simple in this sense, but the spec allows for complex.

max_gap, max_block: int

For merging multiple concurrent requests to the same remote file. Neighboring byte ranges will only be merged when their inter-range gap is <= max_gap. Default is 64KB. Set to 0 to only merge when it requires no extra bytes. Pass a negative number to disable merging, appropriate for local target files. Neighboring byte ranges will only be merged when the size of the aggregated range is <= max_block. Default is 256MB.

cache_sizeint

Maximum size of LRU cache, where cache_size*record_size denotes the total number of references that can be loaded in memory at once. Only used for lazily loaded references.

kwargspassed to parent class

class fsspec.implementations.reference.LazyReferenceMapper(root, fs=None, out_root=None, cache_size=128, categorical_threshold=10, engine: Literal['fastparquet', 'pyarrow'] = 'fastparquet')[source]

This interface can be used to read/write references from Parquet stores. It is not intended for other types of references. It can be used with Kerchunk’s MultiZarrToZarr method to combine references into a parquet store. Examples of this use-case can be found here: https://fsspec.github.io/kerchunk/advanced.html?highlight=parquet#parquet-storage

__init__(root, fs=None, out_root=None, cache_size=128, categorical_threshold=10, engine: Literal['fastparquet', 'pyarrow'] = 'fastparquet')[source]

This instance will be writable, storing changes in memory until full partitions are accumulated or .flush() is called.

To create an empty lazy store, use .create()

Parameters:

rootstr: Root of parquet store
fsfsspec.AbstractFileSystem: fsspec filesystem object, default is local filesystem.
cache_sizeint, default=128: Maximum size of LRU cache, where cache_size*record_size denotes the total number of references that can be loaded in memory at once.
categorical_thresholdint: Encode urls as pandas.Categorical to reduce memory footprint if the ratio of the number of unique urls to total number of refs for each variable is greater than or equal to this number. (default 10)
engine: Literal[“fastparquet”,”pyarrow”]: Engine choice for reading parquet files. (default is “fastparquet”)

class fsspec.implementations.sftp.SFTPFileSystem(*args, **kwargs)[source]

Files over SFTP/SSH

Peer-to-peer filesystem over SSH using paramiko.

Note: if using this with the open or open_files, with full URLs, there is no way to tell if a path is relative, so all paths are assumed to be absolute.

__init__(host, **ssh_kwargs)[source]

Parameters:

host: str: Hostname or IP as a string
temppath: str: Location on the server to put files, when within a transaction
ssh_kwargs: dict: Parameters passed on to connection. See details in https://docs.paramiko.org/en/3.3/api/client.html#paramiko.client.SSHClient.connect May include port, username, password…

class fsspec.implementations.smb.SMBFileSystem(*args, **kwargs)[source]

Allow reading and writing to Windows and Samba network shares.

When using fsspec.open() for getting a file-like object the URI should be specified as this format: smb://workgroup;user:password@server:port/share/folder/file.csv.

Example:

>>> import fsspec
>>> with fsspec.open(
...     'smb://myuser:mypassword@myserver.com/' 'share/folder/file.csv'
... ) as smbfile:
...     df = pd.read_csv(smbfile, sep='|', header=None)

Note that you need to pass in a valid hostname or IP address for the host component of the URL. Do not use the Windows/NetBIOS machine name for the host component.

The first component of the path in the URL points to the name of the shared folder. Subsequent path components will point to the directory/folder/file.

The URL components workgroup , user, password and port may be optional.

Note

For working this source require smbprotocol to be installed, e.g.:

$ pip install smbprotocol
# or
# pip install smbprotocol[kerberos]

Note: if using this with the open or open_files, with full URLs, there is no way to tell if a path is relative, so all paths are assumed to be absolute.

__init__(host, port=None, username=None, password=None, timeout=60, encrypt=None, share_access=None, register_session_retries=4, register_session_retry_wait=1, register_session_retry_factor=10, auto_mkdir=False, **kwargs)[source]

You can use _get_kwargs_from_urls to get some kwargs from a reasonable SMB url.

Authentication will be anonymous or integrated if username/password are not given.

Parameters:

host: str

The remote server name/ip to connect to

port: int or None

Port to connect with. Usually 445, sometimes 139.

username: str or None

Username to connect with. Required if Kerberos auth is not being used.

password: str or None

User’s password on the server, if using username

timeout: int

Connection timeout in seconds

encrypt: bool

Whether to force encryption or not, once this has been set to True the session cannot be changed back to False.

share_access: str or None

Specifies the default access applied to file open operations performed with this file system object. This affects whether other processes can concurrently open a handle to the same file.

None (the default): exclusively locks the file until closed.
‘r’: Allow other handles to be opened with read access.
‘w’: Allow other handles to be opened with write access.
‘d’: Allow other handles to be opened with delete access.

register_session_retries: int

Number of retries to register a session with the server. Retries are not performed for authentication errors, as they are considered as invalid credentials and not network issues. If set to negative value, no register attempts will be performed.

register_session_retry_wait: int

Time in seconds to wait between each retry. Number must be non-negative.

register_session_retry_factor: int

Base factor for the wait time between each retry. The wait time is calculated using exponential function. For factor=1 all wait times will be equal to register_session_retry_wait. For any number of retries, the last wait time will be equal to register_session_retry_wait and for retries>1 the first wait time will be equal to register_session_retry_wait / factor. Number must be equal to or greater than 1. Optimal factor is 10.

auto_mkdir: bool

Whether, when opening a file, the directory containing it should be created (if it doesn’t already exist). This is assumed by pyarrow and zarr-python code.

class fsspec.implementations.tar.TarFileSystem(*args, **kwargs)[source]

Compressed Tar archives as a file-system (read-only)

Supports the following formats: tar.gz, tar.bz2, tar.xz

__init__(fo='', index_store=None, target_options=None, target_protocol=None, compression=None, **kwargs)[source]

Create and configure file-system instance

Instances may be cachable, so if similar enough arguments are seen a new instance is not required. The token attribute exists to allow implementations to cache instances if they wish.

A reasonable default should be provided if there are no arguments.

Subclasses should call this method.

Parameters:

use_listings_cache, listings_expiry_time, max_paths:: passed to DirCache, if the implementation supports directory listing caching. Pass use_listings_cache=False to disable such caching.
skip_instance_cache: bool: If this is a cachable implementation, pass True here to force creating a new instance even if a matching instance exists, and prevent storing this instance.
asynchronous: bool
loop: asyncio-compatible IOLoop or None

class fsspec.implementations.webhdfs.WebHDFS(*args, **kwargs)[source]

Interface to HDFS over HTTP using the WebHDFS API. Supports also HttpFS gateways.

Four auth mechanisms are supported:

insecure: no auth is done, and the user is assumed to be whoever they: say they are (parameter user), or a predefined value such as “dr.who” if not given
spnego: when kerberos authentication is enabled, auth is negotiated by: requests_kerberos https://github.com/requests/requests-kerberos . This establishes a session based on existing kinit login and/or specified principal/password; parameters are passed with kerb_kwargs
token: uses an existing Hadoop delegation token from another secured: service. Indeed, this client can also generate such tokens when not insecure. Note that tokens expire, but can be renewed (by a previously specified user) and may allow for proxying.
basic-auth: used when both parameter user and parameter password: are provided.

__init__(host, port=50070, kerberos=False, token=None, user=None, password=None, proxy_to=None, kerb_kwargs=None, data_proxy=None, use_https=False, session_cert=None, session_verify=True, **kwargs)[source]

Parameters:

host: str: Name-node address
port: int: Port for webHDFS
kerberos: bool: Whether to authenticate with kerberos for this connection
token: str or None: If given, use this token on every call to authenticate. A user and user-proxy may be encoded in the token and should not be also given
user: str or None: If given, assert the user name to connect with
password: str or None: If given, assert the password to use for basic auth. If password is provided, user must be provided also
proxy_to: str or None: If given, the user has the authority to proxy, and this value is the user in who’s name actions are taken
kerb_kwargs: dict: Any extra arguments for HTTPKerberosAuth, see https://github.com/requests/requests-kerberos/blob/master/requests_kerberos/kerberos_.py
data_proxy: dict, callable or None: If given, map data-node addresses. This can be necessary if the HDFS cluster is behind a proxy, running on Docker or otherwise has a mismatch between the host-names given by the name-node and the address by which to refer to them from the client. If a dict, maps host names host->data_proxy[host]; if a callable, full URLs are passed, and function must conform to url->data_proxy(url).
use_https: bool: Whether to connect to the Name-node using HTTPS instead of HTTP
session_cert: str or Tuple[str, str] or None: Path to a certificate file, or tuple of (cert, key) files to use for the requests.Session
session_verify: str, bool or None: Path to a certificate file to use for verifying the requests.Session.
kwargs

class fsspec.implementations.zip.ZipFileSystem(*args, **kwargs)[source]

Read/Write contents of ZIP archive as a file-system

Keeps file object open while instance lives.

This class is pickleable, but not necessarily thread-safe

__init__(fo='', mode='r', target_protocol=None, target_options=None, compression=0, allowZip64=True, compresslevel=None, **kwargs)[source]

Parameters:

fo: str or file-like: Contains ZIP, and must exist. If a str, will fetch file using open_files(), which must return one file exactly.
mode: str: Accept: “r”, “w”, “a”
target_protocol: str (optional): If fo is a string, this value can be used to override the FS protocol inferred from a URL
target_options: dict (optional): Kwargs passed when instantiating the target FS, if fo is a string.
compression, allowZip64, compresslevel: passed to ZipFile: Only relevant when creating a ZIP

Other Known Implementations

Note that most of these projects are hosted outside of the fsspec organisation. Please read their documentation carefully before using any particular package.

abfs for Azure Blob service, with protocol “abfs://”
adl for Azure DataLake storage, with protocol “adl://”
alluxiofs to access fsspec implemented filesystem with Alluxio distributed cache
boxfs for access to Box file storage, with protocol “box://”
csvbase for access to csvbase.com hosted CSV files, with protocol “csvbase://”
dropbox for access to dropbox shares, with protocol “dropbox://”
dvc to access DVC/Git repository as a filesystem
fsspec-encrypted for transparent encryption on top of other fsspec filesystems.
gcsfs for Google Cloud Storage, with protocol “gs://” or “gcs://”
gdrive to access Google Drive and shares (experimental)
git to access Git repositories
huggingface_hub to access the Hugging Face Hub filesystem, with protocol “hf://”
hdfs-native to access Hadoop filesystem, with protocol “hdfs://”
httpfs-sync to access HTTP(s) files in a synchronous manner to offer an alternative to the aiohttp-based implementation.
ipfsspec for the InterPlanetary File System (IPFS), with protocol “ipfs://”
irods for access to iRODS servers, with protocol “irods://”
lakefs for lakeFS data lakes, with protocol “lakefs://”
morefs for OverlayFileSystem, DictFileSystem, and others
msgraphfs for Microsoft storage (ie Sharepoint) using the drive API through Microsoft Graph, with protocol “msgd://”
obstore: zero-dependency access to Amazon S3, Google Cloud Storage, and Azure Blob Storage using the underlying Rust object_store library, with protocols “s3://”, “gs://”, and “abfs://”.
ocifs for access to Oracle Cloud Object Storage, with protocol “oci://”
ocilake for OCI Data Lake storage
ossfs for Alibaba Cloud (Aliyun) Object Storage System (OSS)
p9fs for 9P (Plan 9 Filesystem Protocol) servers
PyAthena for S3 access to Amazon Athena, with protocol “s3://” or “s3a://”
fsspec-pydantic: Pydantic models to represent file and directory types
PyDrive2 for Google Drive access
fsspec-python: A chained filesystem that connects to Python’s import system, to allow for importing from an fsspec backend
fsspec-proxy for “pyscript:” URLs via a proxy server
s3fs for Amazon S3 and other compatible stores, with protocol “s3://”
sshfs for access to SSH servers, with protocol “ssh://” or “sftp://”
swiftspec for OpenStack SWIFT, with protocol “swift://”
tosfs for ByteDance volcano engine Tinder Object Storage (TOS)
fsspec-union: A chained filesystem that unions multiple fsspec backends as a read-through cache
wandbfs to access Wandb run data (experimental)
wandbfsspec to access Weights & Biases (experimental)
webdav4 for WebDAV, with protocol “webdav://” or “dav://”
xrootd for xrootd, with protocol “root://”

Read Buffering

`caching.BlockCache`(blocksize, fetcher, size)	Cache holding memory as a set of blocks.
`caching.BytesCache`(blocksize, fetcher, size)	Cache which holds data in a in-memory bytes object
`caching.MMapCache`(blocksize, fetcher, size)	memory-mapped sparse file cache
`caching.ReadAheadCache`(blocksize, fetcher, size)	Cache which reads only when we get beyond a block of data
`caching.FirstChunkCache`(blocksize, fetcher, size)	Caches the first block of a file only
`caching.BackgroundBlockCache`(blocksize, ...)	Cache holding memory as a set of blocks with pre-loading of the next block in the background.

class fsspec.caching.BlockCache(blocksize: int, fetcher: Callable[[int, int], bytes], size: int, maxblocks: int = 32)[source]

Cache holding memory as a set of blocks.

Requests are only ever made blocksize at a time, and are stored in an LRU cache. The least recently accessed block is discarded when more than maxblocks are stored.

Parameters:

blocksizeint: The number of bytes to store in each block. Requests are only ever made for blocksize, so this should balance the overhead of making a request against the granularity of the blocks.
fetcherCallable
sizeint: The total size of the file being cached.
maxblocksint: The maximum number of blocks to cache for. The maximum memory use for this cache is then blocksize * maxblocks.

cache_info()[source]

The statistics on the block cache.

Returns:

NamedTuple: Returned directly from the LRU Cache used internally.

class fsspec.caching.BytesCache(blocksize: int, fetcher: Callable[[int, int], bytes], size: int, trim: bool = True)[source]

Cache which holds data in a in-memory bytes object

Implements read-ahead by the block size, for semi-random reads progressing through the file.

Parameters:

trim: bool: As we read more data, whether to discard the start of the buffer when we are more than a blocksize ahead of it.

class fsspec.caching.MMapCache(blocksize: int, fetcher: Callable[[int, int], bytes], size: int, location: str | None = None, blocks: set[int] | None = None, multi_fetcher: Callable[[list[int, int]], bytes] | None = None)[source]

memory-mapped sparse file cache

Opens temporary file, which is filled blocks-wise when data is requested. Ensure there is enough disc space in the temporary location.

This cache method might only work on posix

Parameters:

blocksize: int: How far to read ahead in numbers of bytes
fetcher: Fetcher: Function of the form f(start, end) which gets bytes from remote as specified
size: int: How big this file is
location: str: Where to create the temporary file. If None, a temporary file is created using tempfile.TemporaryFile().
blocks: set[int]: Set of block numbers that have already been fetched. If None, an empty set is created.
multi_fetcher: MultiFetcher: Function of the form f([(start, end)]) which gets bytes from remote as specified. This function is used to fetch multiple blocks at once. If not specified, the fetcher function is used instead.

class fsspec.caching.ReadAheadCache(blocksize: int, fetcher: Callable[[int, int], bytes], size: int)[source]

Cache which reads only when we get beyond a block of data

This is a much simpler version of BytesCache, and does not attempt to fill holes in the cache or keep fragments alive. It is best suited to many small reads in a sequential order (e.g., reading lines from a file).

class fsspec.caching.FirstChunkCache(blocksize: int, fetcher: Callable[[int, int], bytes], size: int)[source]

Caches the first block of a file only

This may be useful for file types where the metadata is stored in the header, but is randomly accessed.

class fsspec.caching.BackgroundBlockCache(blocksize: int, fetcher: Callable[[int, int], bytes], size: int, maxblocks: int = 32)[source]

Cache holding memory as a set of blocks with pre-loading of the next block in the background.

Requests are only ever made blocksize at a time, and are stored in an LRU cache. The least recently accessed block is discarded when more than maxblocks are stored. If the next block is not in cache, it is loaded in a separate thread in non-blocking way.

Parameters:

blocksizeint: The number of bytes to store in each block. Requests are only ever made for blocksize, so this should balance the overhead of making a request against the granularity of the blocks.
fetcherCallable
sizeint: The total size of the file being cached.
maxblocksint: The maximum number of blocks to cache for. The maximum memory use for this cache is then blocksize * maxblocks.

cache_info() → CacheInfo[source]

The statistics on the block cache.

Returns:

NamedTuple: Returned directly from the LRU Cache used internally.

Utilities

utils.read_block(f, offset, length[, ...])

Read a block of bytes from a file

fsspec.utils.read_block(f: IO[bytes], offset: int, length: int | None, delimiter: bytes | None = None, split_before: bool = False) → bytes[source]

Read a block of bytes from a file

Parameters:

f: File: Open file
offset: int: Byte offset to start read
length: int: Number of bytes to read, read through end of file if None
delimiter: bytes (optional): Ensure reading starts and stops at delimiter bytestring
split_before: bool (optional): Start/stop read before delimiter bytestring.
If using the ``delimiter=`` keyword argument we ensure that the read
starts and stops at delimiter boundaries that follow the locations
``offset`` and ``offset + length``. If ``offset`` is zero then we
start at zero, regardless of delimiter. The bytestring returned WILL
include the terminating delimiter string.

Examples

>>> from io import BytesIO  
>>> f = BytesIO(b'Alice, 100\nBob, 200\nCharlie, 300')  
>>> read_block(f, 0, 13)  
b'Alice, 100\nBo'

>>> read_block(f, 0, 13, delimiter=b'\n')  
b'Alice, 100\nBob, 200\n'

>>> read_block(f, 10, 10, delimiter=b'\n')  
b'Bob, 200\nCharlie, 300'