Base classes

General and WaveformDataset

class AbstractBenchmarkDataset(chunks=None, citation=None, license=None, force=False, wait_for_file=False, repository_lookup=False, compile_from_source=False, download_kwargs=None, **kwargs)[source]

Bases: ABC

This class is the base class for all benchmark datasets. It adds functionality to automatically download the dataset to the SeisBench cache. Downloads can either be from the SeisBench repository if the dataset is available there and in the right format, or from another source, which will usually require some form of conversion. Furthermore, it adds annotations for citation and license.

This is an abstract class for any type of benchmark dataset. To implement a specific type, like a dataset for waveforms, this class will be subclassed. See BenchmarkDataset for an example. In the subclass, it’s enough to set the _files parameter telling the downloader which files constitute the dataset. Each file can use the placeholder $CHUNK to be replaced with the chunk name. Each file needs to be present for each chunk.

Parameters:
  • chunks (list[str] | None) – List of chunks to download

  • citation (str | None) – Citation for the dataset. Should be set in the inheriting class.

  • license (str | None) – License associated with the dataset. Should be set in the inheriting class.

  • force (bool) – Passed to callback_if_uncached()

  • wait_for_file (bool) – Passed to callback_if_uncached()

  • repository_lookup (bool) – Whether the data set should be search in the remote repository or directly use the download function. Should be set in the inheriting class. Only needs to be set to true if the dataset is available in a repository, e.g., the SeisBench repository, for direct download.

  • compile_from_source (bool) – If true, allows to compile the dataset from source. However, if a precompiled version is found either in the local cache or in the remote repository, it will be used instead.

  • download_kwargs (dict[str, Any]) – Dict of arguments passed to the download_dataset function, in case the dataset is loaded from scratch.

  • kwargs – Keyword arguments passed to WaveformDataset

add_chunk_to_download_args(chunk, kwargs)[source]
classmethod available_chunks(force=False, wait_for_file=False)[source]

Returns a list of available chunks. Queries both the local cache and the remote root.

property citation

The suggested citation for this dataset

property license

The license attached to this dataset

property name

Name of the dataset. For BenchmarkDatasets, always matches the class name.

property path

Path to the dataset location in the SeisBench cache

class BenchmarkDataset(*args, **kwargs)[source]

Bases: WaveformBenchmarkDataset

This class is only kept as an alias to ensure downward compatibility. Use WaveformBenchmarkDataset instead.

class Bucketer[source]

Bases: ABC

This is the abstract bucketer class that needs to be provided to the WaveformDataWriter. It offers one public function, get_bucket(), to assign a bucket to each trace.

abstractmethod get_bucket(metadata, waveform)[source]

Calculates the bucket for the trace given its metadata and waveforms

Parameters:
  • metadata – Metadata as given to the WaveformDataWriter.

  • waveform – Waveforms as given to the WaveformDataWriter.

Returns:

A hashable object denoting the bucket this sample belongs to.

class EventParameters[source]

Bases: TypedDict

index: NotRequired[int]
source_depth_km: NotRequired[float]
source_depth_uncertainty_km: NotRequired[float]
source_focal_mechanism_n_azimuth: NotRequired[float]
source_focal_mechanism_n_length: NotRequired[float]
source_focal_mechanism_n_plunge: NotRequired[float]
source_focal_mechanism_p_azimuth: NotRequired[float]
source_focal_mechanism_p_length: NotRequired[float]
source_focal_mechanism_p_plunge: NotRequired[float]
source_focal_mechanism_t_azimuth: NotRequired[float]
source_focal_mechanism_t_length: NotRequired[float]
source_focal_mechanism_t_plunge: NotRequired[float]
source_id: NotRequired[str]
source_latitude_deg: NotRequired[float]
source_latitude_uncertainty_km: NotRequired[float]
source_longitude_deg: NotRequired[float]
source_longitude_uncertainty_km: NotRequired[float]
source_magnitude: NotRequired[float]
source_magnitude_author: NotRequired[str]
source_magnitude_type: NotRequired[str]
source_magnitude_uncertainty: NotRequired[float]
source_origin_time: NotRequired[str]
source_origin_uncertainty_sec: NotRequired[float]
split: NotRequired[Literal['train', 'dev', 'test']]
class GeometricBucketer(minbucket=100, factor=1.2, splits=True, track_channels=True, axis=-1)[source]

Bases: Bucketer

A simple bucketer that uses the length of the traces and optionally the assigned split to determine buckets. Only takes into account the length along one fixed axis. Bucket edges are create with a geometric spacing above a minimum bucket. The first bucket is [0, minbucket), the second one [minbucket, minbucket * factor) and so on. There is no maximum bucket. This bucketer ensures that the overhead from padding is at most factor - 1, as long as only few traces with length < minbucket exist. Note that this can even be significantly reduced by passing the input traces ordered by their length.

Parameters:
  • minbucket (int) – Upper limit of the lowest bucket and start of the geometric spacing.

  • factor (float) – Factor for the geometric spacing.

  • splits (bool) – If true, returns separate buckets for each split. Defaults to true. If no split is defined in the metadata, this parameter is ignored.

  • track_channels (bool) – If true, uses the shape of the input waveform along all axis except the one defined in axis, to determine the bucket. Only traces agreeing in all dimensions except the given axis will be assigned to the same bucket.

  • axis (int) – Axis to take into account for determining the length of the trace.

get_bucket(metadata, waveform)[source]

Calculates the bucket for the trace given its metadata and waveforms

Parameters:
  • metadata – Metadata as given to the WaveformDataWriter.

  • waveform – Waveforms as given to the WaveformDataWriter.

Returns:

A hashable object denoting the bucket this sample belongs to.

class LoadingContext(chunks, waveform_paths)[source]

Bases: object

The LoadingContext is a dict of pointers to the hdf5 files for the chunks. It is an easy way to manage opening and closing of file pointers when required.

class MultiWaveformDataset(datasets)[source]

Bases: object

A MultiWaveformDataset is an ordered collection of WaveformDataset. It exposes mostly the same API as a single WaveformDataset.

The constructor checks for compatibility of dimension_order, component_order and sampling_rate. The caching strategy of each contained dataset is left unmodified, but a warning is issued if different caching schemes are found.

Parameters:

datasets – List of WaveformDataset. The constructor will create a copy of each dataset using the WaveformDataset.copy() method.

property cache

Get or set cache strategy

property component_order

Get or set component order

property datasets

Datasets contained in MultiWaveformDataset.

dev()

Convenience method for get_split(“dev”).

Returns:

Development dataset

property dimension_order

Get or set dimension order for output

filter(mask, inplace=True)[source]

Filters dataset, similar to WaveformDataset.filter().

Parameters:
  • mask (masked-array) – Boolean mask to apple to metadata.

  • inplace (bool) – If true, filter inplace.

Returns:

None if filter=true, otherwise the filtered dataset.

get_group_idx_from_params(params)

Returns the index of the group identified by the params.

Parameters:

params – The parameters identifying the group. For a single grouping parameter, this argument will be a single value. Otherwise this argument needs to be a tuple of keys.

Returns:

Index of the group

Return type:

int

get_group_samples(idx, **kwargs)

Returns the waveforms and metadata for each member of a group. For details see get_sample().

Parameters:
  • idx (int) – Group index

  • kwargs – Kwargs passed to get_sample()

Returns:

List of waveforms, list of metadata dicts

get_group_size(idx)

Returns the number of samples in a group

Parameters:

idx (int) – Group index

Returns:

Size of the group

Return type:

int

get_group_waveforms(idx, **kwargs)

Returns the waveforms for each member of a group. For details see get_sample().

Parameters:
  • idx (int) – Group index

  • kwargs – Kwargs passed to get_sample()

Returns:

List of waveforms

get_idx_from_trace_name(trace_name, chunk=None, dataset=None)

Returns the index of a trace with given trace_name, chunk and dataset. Chunk and dataset parameters are optional, but might be necessary to uniquely identify traces for chunked datasets or for MultiWaveformDataset. The method will issue a warning the first time a non-uniquely identifiable trace is requested. If no matching key is found, a KeyError is raised.

Parameters:
  • trace_name (str) – Trace name as in metadata[“trace_name”]

  • chunk (None) – Trace chunk as in metadata[“trace_chunk”]. If None this key will be ignored.

  • dataset (None) – Trace dataset as in metadata[“trace_dataset”]. Only for MultiWaveformDataset. If None this key will be ignored.

Returns:

Index of the sample

get_sample(idx, *args, **kwargs)[source]

Wraps WaveformDataset.get_sample()

Parameters:
  • idx – Index of the sample

  • args – passed to parent function

  • kwargs – passed to parent function

Returns:

Return value of parent function

get_split(split)

Returns a dataset with the requested split.

Parameters:

split – Split name to return. Usually one of “train”, “dev”, “test”

Returns:

Dataset filtered to the requested split.

get_waveforms(idx=None, mask=None, **kwargs)[source]

Collects waveforms and returns them as an array.

Parameters:
  • idx (int, list[int]) – Idx or list of idx to obtain waveforms for

  • mask (np.ndarray[bool]) – Binary mask on the metadata, indicating which traces should be returned. Can not be used jointly with idx.

  • kwargs – Passed to WaveformDataset.get_waveforms()

Returns:

Waveform array with dimensions ordered according to dimension_order e.g. default ‘NCW’ (number of traces, number of components, record samples). If the number record samples varies between different entries, all entries are padded to the maximum length.

Return type:

np.ndarray

property grouping

The grouping parameters for the dataset. Grouping allows to access metadata and waveforms jointly from a set of traces with a common metadata parameter. This can for example be used to access all waveforms belong to one event and building event based models. Setting the grouping parameter defines the output of groups and the associated methods. grouping can be either a single string or a list of strings. Each string must be a column in the metadata. By default, the grouping is None.

property groups

The list of groups as defined by the grouping or None if grouping is None.

property metadata

Metadata of the dataset as pandas DataFrame.

property metadata_cache
property missing_components

Get or set strategy for missing components

plot_map(res='110m', connections=False, **kwargs)

Plots the dataset onto a map using the Mercator projection. Requires a cartopy installation.

Parameters:
  • res (str, optional) – Resolution for cartopy features, defaults to 110m.

  • connections (bool, optional) – If true, plots lines connecting sources and stations. Defaults to false.

  • kwargs – Plotting kwargs that will be passed to matplotlib plot. Args need to be prefixed with sta_, ev_ and conn_ to address stations, events or connections.

Returns:

A figure handle for the created figure.

preload_waveforms(*args, **kwargs)[source]

Calls WaveformDataset.preload_waveforms() for all member datasets with the provided arguments.

region_filter(domain, lat_col, lon_col, inplace=True)

Filtering of dataset based on predefined region or geometry. See also convenience functions region_filter_[source|receiver].

Parameters:
  • domain (obspy.core.fdsn.mass_downloader.domain:) – The domain filter

  • lat_col (str) – Name of latitude coordinate column

  • lon_col (str) – Name of longitude coordinate column

  • inplace (bool) – Inplace filtering, default to true. See also filter().

Returns:

None if inplace=True, otherwise the filtered dataset.

region_filter_receiver(domain, inplace=True)

Convenience method for region filtering by receiver location.

region_filter_source(domain, inplace=True)

Convenience method for region filtering by source location.

property sampling_rate

Get or set sampling rate for output

test()

Convenience method for get_split(“test”).

Returns:

Test dataset

train()

Convenience method for get_split(“train”).

Returns:

Training dataset

train_dev_test()

Convenience method for returning training, development and test set. Equal to:

>>> self.train(), self.dev(), self.test()
Returns:

Training dataset, development dataset, test dataset

class TraceParameters[source]

Bases: TypedDict

path_back_azimuth_deg: NotRequired[float]
station_code: NotRequired[str]
station_elevation_m: NotRequired[float]
station_latitude_deg: NotRequired[float]
station_location_code: NotRequired[str]
station_longitude_deg: NotRequired[float]
station_network_code: NotRequired[str]
trace_channel: NotRequired[str]
trace_completeness: NotRequired[float]
trace_component_order: NotRequired[str]
trace_has_spikes: NotRequired[bool]
trace_name: NotRequired[str]
trace_sampling_rate_hz: NotRequired[float]
trace_start_time: NotRequired[str]
class WaveformBenchmarkDataset(chunks=None, citation=None, license=None, force=False, wait_for_file=False, repository_lookup=False, compile_from_source=False, download_kwargs=None, **kwargs)[source]

Bases: AbstractBenchmarkDataset, WaveformDataset, ABC

This class is the base class for benchmark waveform datasets. For the functionality, see the superclasses.

class WaveformDataWriter(metadata_path, waveforms_path)[source]

Bases: object

The WaveformDataWriter for writing datasets in SeisBench format.

To improve reading performance when using the datasets, the writer groups traces into blocks and writes them into joint arrays in the hdf5 file. The exact behaviour is controlled by the bucketer and the bucket_size. For details see their documentation. This packing is necessary, due to limitations in the hdf5 performance. Reading many small datasets from a hdf5 file causes the overhead of the hdf5 structure to define the read times.

Parameters:
  • metadata_path (str or Path) – Path to write the metadata file to

  • waveforms_path (str or Path) – Path to write the waveforms file to

Returns:

None

add_trace(metadata, waveform)[source]

Adds a trace to the writer. This does not imply that the trace is immediately written to disk, as the writer might wait to fill a bucket. The writer ensures that the order of traces in the metadata is identical to the order of calls to add_trace.

Parameters:
  • metadata (dict[str, any]) – Metadata of the trace

  • waveform (np.ndarray) – Waveform of the trace

Returns:

None

property bucket_size

The maximum size of a bucket. Once adding another trace would overload the bucket, the bucket is written to disk. Defaults to 1024.

Returns:

Bucket size

property bucketer

The currently used bucketer, which sorts traces into buckets. If the bucketer is None, no buckets are used and all traces are written separately. By default uses the GeometricBucketer with default parameters. Please check that this suits your needs. In particular, make sure that the default axis matches your sample axis.

Returns:

Returns the current bucketer.

flush_hdf5()[source]

Writes out all traces currently in the cache to the hdf5 file. Should be called if no more traces for the existing buckets will be added, e.g., after finishing a split. Does not write the metadata to csv.

set_total(n)[source]

Set the total number of traces to write. Only used for correct progress calculation

Parameters:

n (int) – Number of traces

Returns:

None

class WaveformDataset(path=None, name=None, dimension_order=None, component_order=None, sampling_rate=None, cache=None, chunks=None, missing_components='pad', metadata_cache=False, resample_zerophase=False, **kwargs)[source]

Bases: object

This class is the base class for waveform datasets.

A key consideration should be how the cache is used. If sufficient memory is available to keep the full data set in memory, activating the cache will yield strong performance gains. For details on the cache strategies, see the documentation of the cache parameter.

Parameters:
  • path (Path | str) – Path to dataset.

  • name (str | None) – Dataset name, default is None.

  • dimension_order (str | None) – Dimension order e.g. ‘CHW’, if not specified will be assumed from config file, defaults to None.

  • component_order (str | None) – Component order e.g. ‘ZNE’, if not specified will be assumed from config file, defaults to None.

  • sampling_rate (float | None) – Common sampling rate of waveforms in dataset, sampling rate can also be specified as a metadata column if not common across dataset.

  • cache (Literal['full', 'trace', None]) –

    Defines the behaviour of the waveform cache. Provides three options:

    • ”full”: When a trace is queried, the full block containing the trace is loaded into the cache and stored in memory. This causes the highest memory consumption, but also best performance when using large parts of the dataset.

    • ”trace”: When a trace is queried, only the trace itself is loaded and stored in memory. This is particularly useful when only a subset of traces is queried, but these are queried multiple times. In this case the performance of this strategy might outperform “full”.

    • None: When a trace is queried, it is always loaded from disk. This mode leads to low memory consumption but high IO load. It is most likely not usable for model training.

    Note that for datasets without blocks, i.e., each trace in a single array in the hdf5 file, the strategies “full” and “trace” are identical. The default cache strategy is None.

    Use preload_waveforms() to populate the cache. Preloading the waveforms is often much faster than loading them during later application, as preloading can use sequential access. Note that it is recommended to always first filter a dataset and then preload to reduce unnecessary reads and memory consumption.

  • chunks (list[str]) – Specify particular chunks to load. If None, loads all chunks. Defaults to None.

  • missing_components (Literal['pad', 'copy', 'ignore']) –

    Strategy to deal with missing components. Options are:

    • ”pad”: Fill with zeros.

    • ”copy”: Fill with values from first existing traces.

    • ”ignore”: Order all existing components in the requested order, but ignore missing ones. This will raise an error if traces with different numbers of components are requested together.

  • metadata_cache (bool) – If true, metadata is cached in a lookup table. This significantly speeds up access to metadata and thereby access to samples. On the downside, this requires storing two copies of the metadata in memory. The second copy usually consumes more memory due to the less space-efficient format. Runtime differences are particularly big for large datasets.

  • resample_zerophase (bool) – If True, resampling in data loading uses a zerophase filter for antialiasing. Otherwise, uses a causal filter. See the documentation of scipy.signal.decimate for details.

  • kwargs

static available_chunks(path)[source]

Determines the chunks of the dataset in the given path.

Parameters:

path – Dataset path

Returns:

List of chunks

property cache

Get or set the cache strategy of the dataset. For possible strategies see the constructor. Note that changing cache strategies will not cause a cache eviction.

property chunks

Returns a list of chunks. If dataset is not chunked, returns an empty list.

property component_order

Get or set order of components in the output.

copy()[source]

Create a copy of the data set. All attributes are copied by value, except waveform cache entries. The cache entries are copied by reference, as the waveforms will take up most of the memory. This should be fine for most use cases, because the cache entries should anyhow never be modified. Note that the cache dict itself is not shared, such that cache evictions and inserts in one of the data sets do not affect the other one.

Returns:

Copy of the dataset

property data_format

Data format dictionary, describing the data format of the stored dataset. Note that this does not necessarily equals the output data format of get waveforms. To query these, use the relevant class properties.

dev()[source]

Convenience method for get_split(“dev”).

Returns:

Development dataset

property dimension_order

Get or set the order of the dimension in the output.

filter(mask, inplace=True)[source]

Filters dataset, e.g. by distance/magnitude/…, using a binary mask. Default behaviour is to perform inplace filtering, directly changing the metadata and waveforms to only keep the results of the masking query. Setting inplace equal to false will return a filtered copy of the data set. For details on the copy operation see copy().

Parameters:
  • mask (boolean array) – Boolean mask to apply to metadata.

  • inplace (bool) – If true, filter inplace.

Example usage:

dataset.filter(dataset["p_status"] == "manual")
Returns:

None if inplace=True, otherwise the filtered dataset.

get_event_sample_indices(event_id)[source]

Returns the indices of all samples associated with a given event. Requires that source_id is part of the metadata.

Parameters:

event – Event identifier

Returns:

List of indices associated with the event

Return type:

list[int]

get_event_source_id(idx)[source]

Gets the source_id of an event. Note that the idx refers to the integer index of the event in order of appearance in the metadata after removing duplicates.

Parameters:

idx (int) – Event index

Return type:

str

Returns:

Source ID of the event

get_group_idx_from_params(params)[source]

Returns the index of the group identified by the params.

Parameters:

params – The parameters identifying the group. For a single grouping parameter, this argument will be a single value. Otherwise this argument needs to be a tuple of keys.

Returns:

Index of the group

Return type:

int

get_group_samples(idx, **kwargs)[source]

Returns the waveforms and metadata for each member of a group. For details see get_sample().

Parameters:
  • idx (int) – Group index

  • kwargs – Kwargs passed to get_sample()

Returns:

List of waveforms, list of metadata dicts

get_group_size(idx)[source]

Returns the number of samples in a group

Parameters:

idx (int) – Group index

Returns:

Size of the group

Return type:

int

get_group_waveforms(idx, **kwargs)[source]

Returns the waveforms for each member of a group. For details see get_sample().

Parameters:
  • idx (int) – Group index

  • kwargs – Kwargs passed to get_sample()

Returns:

List of waveforms

get_idx_from_trace_name(trace_name, chunk=None, dataset=None)[source]

Returns the index of a trace with given trace_name, chunk and dataset. Chunk and dataset parameters are optional, but might be necessary to uniquely identify traces for chunked datasets or for MultiWaveformDataset. The method will issue a warning the first time a non-uniquely identifiable trace is requested. If no matching key is found, a KeyError is raised.

Parameters:
  • trace_name (str) – Trace name as in metadata[“trace_name”]

  • chunk (None) – Trace chunk as in metadata[“trace_chunk”]. If None this key will be ignored.

  • dataset (None) – Trace dataset as in metadata[“trace_dataset”]. Only for MultiWaveformDataset. If None this key will be ignored.

Returns:

Index of the sample

get_sample(idx, sampling_rate=None)[source]

Returns both waveforms and metadata of a traces. Adjusts all metadata traces with sampling rate dependent values to the correct sampling rate, e.g., p_pick_samples will still point to the right sample after this operation, even if the trace was resampled.

Hint

When decimating data, a low-pass filter needs to be applied to avoid aliasing. To control whether this filter is causal or zerophase, the class attribute zerophase_resample can be used.

Parameters:
  • idx – Idx of sample to return

  • sampling_rate – Target sampling rate, overwrites sampling rate for dataset.

Returns:

Tuple with the waveforms and the metadata of the sample.

get_split(split)[source]

Returns a dataset with the requested split.

Parameters:

split – Split name to return. Usually one of “train”, “dev”, “test”

Returns:

Dataset filtered to the requested split.

get_waveforms(idx=None, mask=None, sampling_rate=None)[source]

Collects waveforms and returns them as an array.

Parameters:
  • idx (int, list[int]) – Idx or list of idx to obtain waveforms for

  • mask (np.ndarray[bool]) – Binary mask on the metadata, indicating which traces should be returned. Can not be used jointly with idx.

  • sampling_rate (float) – Target sampling rate, overwrites sampling rate for dataset

Returns:

Waveform array with dimensions ordered according to dimension_order e.g. default ‘NCW’ (number of traces, number of components, record samples). If the number of record samples varies between different entries, all entries are padded to the maximum length.

Return type:

np.ndarray

property grouping

The grouping parameters for the dataset. These parameters are used to determine the groups and for the associated methods. grouping can be either a single string or a list of strings. Each string must be a column in the metadata. By default, the grouping is None.

property groups

The list of groups as defined by the grouping or None if grouping is None.

property metadata

Metadata of the dataset as pandas DataFrame.

property metadata_cache
property missing_components

Get or set strategy to handle missing components. For options, see the constructor.

n_events()[source]

Returns the number of unique events in the dataset. Requires that source_id is part of the metadata.

Returns:

Number of unique events

Return type:

int

property name

Name of the dataset (immutable)

property path

Path of the dataset (immutable)

plot_map(res='110m', connections=False, **kwargs)[source]

Plots the dataset onto a map using the Mercator projection. Requires a cartopy installation.

Parameters:
  • res (str, optional) – Resolution for cartopy features, defaults to 110m.

  • connections (bool, optional) – If true, plots lines connecting sources and stations. Defaults to false.

  • kwargs – Plotting kwargs that will be passed to matplotlib plot. Args need to be prefixed with sta_, ev_ and conn_ to address stations, events or connections.

Returns:

A figure handle for the created figure.

preload_waveforms(pbar=False)[source]

Loads waveform data from hdf5 file into cache. Fails if caching strategy is None.

Parameters:

pbar – If true, shows progress bar. Defaults to False.

region_filter(domain, lat_col, lon_col, inplace=True)[source]

Filtering of dataset based on predefined region or geometry. See also convenience functions region_filter_[source|receiver].

Parameters:
  • domain (obspy.core.fdsn.mass_downloader.domain:) – The domain filter

  • lat_col (str) – Name of latitude coordinate column

  • lon_col (str) – Name of longitude coordinate column

  • inplace (bool) – Inplace filtering, default to true. See also filter().

Returns:

None if inplace=True, otherwise the filtered dataset.

region_filter_receiver(domain, inplace=True)[source]

Convenience method for region filtering by receiver location.

region_filter_source(domain, inplace=True)[source]

Convenience method for region filtering by source location.

test()[source]

Convenience method for get_split(“test”).

Returns:

Test dataset

train()[source]

Convenience method for get_split(“train”).

Returns:

Training dataset

train_dev_test()[source]

Convenience method for returning training, development and test set. Equal to:

>>> self.train(), self.dev(), self.test()
Returns:

Training dataset, development dataset, test dataset

DASDataset

class DASBenchmarkDataset(chunks=None, citation=None, license=None, force=False, wait_for_file=False, repository_lookup=False, compile_from_source=False, download_kwargs=None, **kwargs)[source]

Bases: AbstractBenchmarkDataset, DASDataset, ABC

This class is the base class for benchmark DAS datasets. For the functionality, see the superclasses.

class DASDataWriter(path, chunk='', metadata_path=None, data_path=None, data_type=<class 'numpy.float32'>, strict=True)[source]

Bases: object

This class allows writing DAS datasets in SeisBench format. It only writes a single chunk. To write multiple chunks, use multiple data writers with different chunk arguments but identical path.

Parameters:
  • path (Path | str) – Path to write the chunk to

  • chunk (str) – Chunk identifier

  • metadata_path (Path | str | None) – Overwrite for the metadata path. If provided, writes the metadata here instead of the default location. The chunk key will be ignored in this case. Unless integrated into complex workflows, this parameter should not be used.

  • data_path (Path | str | None) – Same as .metadata_path but for the data file.

  • data_type (type[floating] | type[integer]) – Data type of the data. Defaults to float32.

  • strict (bool) – If true, raise an error if the metadata does not contain the key fields. Otherwise, only raise a warning.

add_record(metadata, data, annotations)[source]

Add a record to the dataset. While the data and annotations will immediately be written to disk, the metadata will be stored in memory and written to disk when the dataset is closed.

Parameters:
  • metadata (dict[str, Any]) – Metadata of the record. There are no mandatory fields, but warnings will be issued if typical key fields are missing.

  • data (ndarray) – Data of the record. The data needs to be a 2D array (time, channel).

  • annotations (dict[str, ndarray]) – Annotations of the record. Each annotation consists of a 1D array with the same length as the number of channels. The entries are in samples along the time axis. For example, an annotation called "P" indicates the indices of the P wave arrival at each channel. NaN values are allowed. Annotations can differ between the records.

Return type:

None

property data_path: Path
property metadata_path: Path
class DASDataset(path=None, chunks=None)[source]

Bases: object

DATA_FILE = 'records_$CHUNK.hdf5'
METADATA_FILE = 'metadata_$CHUNK.parquet'
static available_chunks(path)[source]

Determines the chunks of the dataset in the given path. If available, parses the chunks file. Otherwise, scans the dataset for metadata and records files.

Parameters:

path (Path) – Dataset path

Return type:

list[str]

Returns:

List of chunks

property chunks: list[str]
copy()[source]

Create a copy of the data set. All attributes are copied by value.

Return type:

DASDataset

dev(inplace=False)[source]

Convenience method for get_split(“dev”).

Return type:

DASDataset | None

Returns:

Development dataset

filter(mask, inplace=True)[source]

Filters dataset, e.g. by distance/magnitude/…, using a binary mask. Default behaviour is to perform inplace filtering. Setting inplace equal to false will return a filtered copy of the data set.

Parameters:
  • mask (ndarray) – Boolean mask to apply to metadata.

  • inplace (bool) – If true, filter inplace.

Return type:

DASDataset | None

Example usage:

dataset.filter(dataset.metadata["record_sampling_rate_hz"] > 100)
get_sample(idx, record_virtual=True, annotations_virtual=False)[source]

Load the sample with the given index. Use the record_virtual and annotations_virtual arguments to control whether the record and annotations are loaded into memory or only pointers are returned. By default, the record will not be loaded into memory, while the annotations will be loaded into memory.

Parameters:
  • idx (int) – Index of the sample to load

  • record_virtual (bool) – If true, the record is returned as a virtual array. Otherwise, the record is loaded into memory.

  • annotations_virtual (bool) – If true, the annotations are returned as virtual arrays. Otherwise, the annotations are loaded into memory.

Return type:

tuple[dict[str, Any], ndarray | Dataset, dict[str, ndarray | Dataset]]

get_split(split, inplace=False)[source]

Returns a dataset with the requested split.

Parameters:

split (str) – Split name to return. Usually one of “train”, “dev”, “test”

Return type:

DASDataset | None

Returns:

Dataset filtered to the requested split.

property metadata: DataFrame
property path: Path

Path of the dataset

test(inplace=False)[source]

Convenience method for get_split(“test”).

Return type:

DASDataset | None

Returns:

Test dataset

train(inplace=False)[source]

Convenience method for get_split(“train”).

Return type:

DASDataset | None

Returns:

Training dataset

train_dev_test()[source]

Convenience method for returning training, development and test set. Equal to:

>>> self.train(), self.dev(), self.test()
Return type:

tuple[DASDataset, DASDataset, DASDataset]

Returns:

Training dataset, development dataset, test dataset

class MultiDASDataset(datasets)[source]

Bases: object

This class is a wrapper for multiple DAS datasets. It allows combining multiple datasets into a single dataset. It has mostly the same API as DASDataset.

property datasets
dev(inplace=False)

Convenience method for get_split(“dev”).

Return type:

DASDataset | None

Returns:

Development dataset

filter(mask, inplace=True)[source]

Filters dataset, similar to WaveformDataset.filter().

Parameters:
  • mask (ndarray) – Boolean mask to apple to metadata.

  • inplace (bool) – If true, filter inplace.

Return type:

MultiDASDataset | None

get_sample(idx, *args, **kwargs)[source]
get_split(split, inplace=False)

Returns a dataset with the requested split.

Parameters:

split (str) – Split name to return. Usually one of “train”, “dev”, “test”

Return type:

DASDataset | None

Returns:

Dataset filtered to the requested split.

property metadata
test(inplace=False)

Convenience method for get_split(“test”).

Return type:

DASDataset | None

Returns:

Test dataset

train(inplace=False)

Convenience method for get_split(“train”).

Return type:

DASDataset | None

Returns:

Training dataset

train_dev_test()

Convenience method for returning training, development and test set. Equal to:

>>> self.train(), self.dev(), self.test()
Return type:

tuple[DASDataset, DASDataset, DASDataset]

Returns:

Training dataset, development dataset, test dataset

class RandomDASDataset(**kwargs)[source]

Bases: DASBenchmarkDataset

This is a purely random dataset for testing purposes. It does not contain any actual data and should only be used for unit tests.

classmethod available_chunks(force=False, wait_for_file=False)[source]

Returns a list of available chunks. Queries both the local cache and the remote root.