Base classes
General and WaveformDataset
- class AbstractBenchmarkDataset(chunks=None, citation=None, license=None, force=False, wait_for_file=False, repository_lookup=False, compile_from_source=False, download_kwargs=None, **kwargs)[source]
Bases:
ABCThis class is the base class for all benchmark datasets. It adds functionality to automatically download the dataset to the SeisBench cache. Downloads can either be from the SeisBench repository if the dataset is available there and in the right format, or from another source, which will usually require some form of conversion. Furthermore, it adds annotations for citation and license.
This is an abstract class for any type of benchmark dataset. To implement a specific type, like a dataset for waveforms, this class will be subclassed. See
BenchmarkDatasetfor an example. In the subclass, it’s enough to set the_filesparameter telling the downloader which files constitute the dataset. Each file can use the placeholder$CHUNKto be replaced with the chunk name. Each file needs to be present for each chunk.- Parameters:
chunks (
list[str] |None) – List of chunks to downloadcitation (
str|None) – Citation for the dataset. Should be set in the inheriting class.license (
str|None) – License associated with the dataset. Should be set in the inheriting class.force (
bool) – Passed tocallback_if_uncached()wait_for_file (
bool) – Passed tocallback_if_uncached()repository_lookup (
bool) – Whether the data set should be search in the remote repository or directly use the download function. Should be set in the inheriting class. Only needs to be set to true if the dataset is available in a repository, e.g., the SeisBench repository, for direct download.compile_from_source (
bool) – If true, allows to compile the dataset from source. However, if a precompiled version is found either in the local cache or in the remote repository, it will be used instead.download_kwargs (
dict[str,Any]) – Dict of arguments passed to the download_dataset function, in case the dataset is loaded from scratch.kwargs – Keyword arguments passed to WaveformDataset
- classmethod available_chunks(force=False, wait_for_file=False)[source]
Returns a list of available chunks. Queries both the local cache and the remote root.
- property citation
The suggested citation for this dataset
- property license
The license attached to this dataset
- property name
Name of the dataset. For BenchmarkDatasets, always matches the class name.
- property path
Path to the dataset location in the SeisBench cache
- class BenchmarkDataset(*args, **kwargs)[source]
Bases:
WaveformBenchmarkDatasetThis class is only kept as an alias to ensure downward compatibility. Use
WaveformBenchmarkDatasetinstead.
- class Bucketer[source]
Bases:
ABCThis is the abstract bucketer class that needs to be provided to the WaveformDataWriter. It offers one public function,
get_bucket(), to assign a bucket to each trace.- abstractmethod get_bucket(metadata, waveform)[source]
Calculates the bucket for the trace given its metadata and waveforms
- Parameters:
metadata – Metadata as given to the WaveformDataWriter.
waveform – Waveforms as given to the WaveformDataWriter.
- Returns:
A hashable object denoting the bucket this sample belongs to.
- class EventParameters[source]
Bases:
TypedDict- index: NotRequired[int]
- source_depth_km: NotRequired[float]
- source_depth_uncertainty_km: NotRequired[float]
- source_focal_mechanism_n_azimuth: NotRequired[float]
- source_focal_mechanism_n_length: NotRequired[float]
- source_focal_mechanism_n_plunge: NotRequired[float]
- source_focal_mechanism_p_azimuth: NotRequired[float]
- source_focal_mechanism_p_length: NotRequired[float]
- source_focal_mechanism_p_plunge: NotRequired[float]
- source_focal_mechanism_t_azimuth: NotRequired[float]
- source_focal_mechanism_t_length: NotRequired[float]
- source_focal_mechanism_t_plunge: NotRequired[float]
- source_id: NotRequired[str]
- source_latitude_deg: NotRequired[float]
- source_latitude_uncertainty_km: NotRequired[float]
- source_longitude_deg: NotRequired[float]
- source_longitude_uncertainty_km: NotRequired[float]
- source_magnitude: NotRequired[float]
- source_magnitude_author: NotRequired[str]
- source_magnitude_type: NotRequired[str]
- source_magnitude_uncertainty: NotRequired[float]
- source_origin_time: NotRequired[str]
- source_origin_uncertainty_sec: NotRequired[float]
- split: NotRequired[Literal['train', 'dev', 'test']]
- class GeometricBucketer(minbucket=100, factor=1.2, splits=True, track_channels=True, axis=-1)[source]
Bases:
BucketerA simple bucketer that uses the length of the traces and optionally the assigned split to determine buckets. Only takes into account the length along one fixed axis. Bucket edges are create with a geometric spacing above a minimum bucket. The first bucket is [0, minbucket), the second one [minbucket, minbucket * factor) and so on. There is no maximum bucket. This bucketer ensures that the overhead from padding is at most factor - 1, as long as only few traces with length < minbucket exist. Note that this can even be significantly reduced by passing the input traces ordered by their length.
- Parameters:
minbucket (int) – Upper limit of the lowest bucket and start of the geometric spacing.
factor (float) – Factor for the geometric spacing.
splits (bool) – If true, returns separate buckets for each split. Defaults to true. If no split is defined in the metadata, this parameter is ignored.
track_channels (bool) – If true, uses the shape of the input waveform along all axis except the one defined in axis, to determine the bucket. Only traces agreeing in all dimensions except the given axis will be assigned to the same bucket.
axis (int) – Axis to take into account for determining the length of the trace.
- get_bucket(metadata, waveform)[source]
Calculates the bucket for the trace given its metadata and waveforms
- Parameters:
metadata – Metadata as given to the WaveformDataWriter.
waveform – Waveforms as given to the WaveformDataWriter.
- Returns:
A hashable object denoting the bucket this sample belongs to.
- class LoadingContext(chunks, waveform_paths)[source]
Bases:
objectThe LoadingContext is a dict of pointers to the hdf5 files for the chunks. It is an easy way to manage opening and closing of file pointers when required.
- class MultiWaveformDataset(datasets)[source]
Bases:
objectA
MultiWaveformDatasetis an ordered collection ofWaveformDataset. It exposes mostly the same API as a singleWaveformDataset.The constructor checks for compatibility of dimension_order, component_order and sampling_rate. The caching strategy of each contained dataset is left unmodified, but a warning is issued if different caching schemes are found.
- Parameters:
datasets – List of
WaveformDataset. The constructor will create a copy of each dataset using theWaveformDataset.copy()method.
- property cache
Get or set cache strategy
- property component_order
Get or set component order
- property datasets
Datasets contained in MultiWaveformDataset.
- dev()
Convenience method for get_split(“dev”).
- Returns:
Development dataset
- property dimension_order
Get or set dimension order for output
- filter(mask, inplace=True)[source]
Filters dataset, similar to
WaveformDataset.filter().- Parameters:
mask (masked-array) – Boolean mask to apple to metadata.
inplace (bool) – If true, filter inplace.
- Returns:
None if filter=true, otherwise the filtered dataset.
- get_group_idx_from_params(params)
Returns the index of the group identified by the params.
- Parameters:
params – The parameters identifying the group. For a single grouping parameter, this argument will be a single value. Otherwise this argument needs to be a tuple of keys.
- Returns:
Index of the group
- Return type:
int
- get_group_samples(idx, **kwargs)
Returns the waveforms and metadata for each member of a group. For details see
get_sample().- Parameters:
idx (int) – Group index
kwargs – Kwargs passed to
get_sample()
- Returns:
List of waveforms, list of metadata dicts
- get_group_size(idx)
Returns the number of samples in a group
- Parameters:
idx (int) – Group index
- Returns:
Size of the group
- Return type:
int
- get_group_waveforms(idx, **kwargs)
Returns the waveforms for each member of a group. For details see
get_sample().- Parameters:
idx (int) – Group index
kwargs – Kwargs passed to
get_sample()
- Returns:
List of waveforms
- get_idx_from_trace_name(trace_name, chunk=None, dataset=None)
Returns the index of a trace with given trace_name, chunk and dataset. Chunk and dataset parameters are optional, but might be necessary to uniquely identify traces for chunked datasets or for
MultiWaveformDataset. The method will issue a warning the first time a non-uniquely identifiable trace is requested. If no matching key is found, a KeyError is raised.- Parameters:
trace_name (str) – Trace name as in metadata[“trace_name”]
chunk (None) – Trace chunk as in metadata[“trace_chunk”]. If None this key will be ignored.
dataset (None) – Trace dataset as in metadata[“trace_dataset”]. Only for
MultiWaveformDataset. If None this key will be ignored.
- Returns:
Index of the sample
- get_sample(idx, *args, **kwargs)[source]
Wraps
WaveformDataset.get_sample()- Parameters:
idx – Index of the sample
args – passed to parent function
kwargs – passed to parent function
- Returns:
Return value of parent function
- get_split(split)
Returns a dataset with the requested split.
- Parameters:
split – Split name to return. Usually one of “train”, “dev”, “test”
- Returns:
Dataset filtered to the requested split.
- get_waveforms(idx=None, mask=None, **kwargs)[source]
Collects waveforms and returns them as an array.
- Parameters:
idx (int, list[int]) – Idx or list of idx to obtain waveforms for
mask (np.ndarray[bool]) – Binary mask on the metadata, indicating which traces should be returned. Can not be used jointly with idx.
kwargs – Passed to
WaveformDataset.get_waveforms()
- Returns:
Waveform array with dimensions ordered according to dimension_order e.g. default ‘NCW’ (number of traces, number of components, record samples). If the number record samples varies between different entries, all entries are padded to the maximum length.
- Return type:
np.ndarray
- property grouping
The grouping parameters for the dataset. Grouping allows to access metadata and waveforms jointly from a set of traces with a common metadata parameter. This can for example be used to access all waveforms belong to one event and building event based models. Setting the grouping parameter defines the output of
groupsand the associated methods. grouping can be either a single string or a list of strings. Each string must be a column in the metadata. By default, the grouping is None.
- property metadata
Metadata of the dataset as pandas DataFrame.
- property metadata_cache
- property missing_components
Get or set strategy for missing components
- plot_map(res='110m', connections=False, **kwargs)
Plots the dataset onto a map using the Mercator projection. Requires a cartopy installation.
- Parameters:
res (str, optional) – Resolution for cartopy features, defaults to 110m.
connections (bool, optional) – If true, plots lines connecting sources and stations. Defaults to false.
kwargs – Plotting kwargs that will be passed to matplotlib plot. Args need to be prefixed with sta_, ev_ and conn_ to address stations, events or connections.
- Returns:
A figure handle for the created figure.
- preload_waveforms(*args, **kwargs)[source]
Calls
WaveformDataset.preload_waveforms()for all member datasets with the provided arguments.
- region_filter(domain, lat_col, lon_col, inplace=True)
Filtering of dataset based on predefined region or geometry. See also convenience functions region_filter_[source|receiver].
- Parameters:
domain (obspy.core.fdsn.mass_downloader.domain:) – The domain filter
lat_col (str) – Name of latitude coordinate column
lon_col (str) – Name of longitude coordinate column
inplace (bool) – Inplace filtering, default to true. See also
filter().
- Returns:
None if inplace=True, otherwise the filtered dataset.
- region_filter_receiver(domain, inplace=True)
Convenience method for region filtering by receiver location.
- region_filter_source(domain, inplace=True)
Convenience method for region filtering by source location.
- property sampling_rate
Get or set sampling rate for output
- test()
Convenience method for get_split(“test”).
- Returns:
Test dataset
- train()
Convenience method for get_split(“train”).
- Returns:
Training dataset
- train_dev_test()
Convenience method for returning training, development and test set. Equal to:
>>> self.train(), self.dev(), self.test()
- Returns:
Training dataset, development dataset, test dataset
- class TraceParameters[source]
Bases:
TypedDict- path_back_azimuth_deg: NotRequired[float]
- station_code: NotRequired[str]
- station_elevation_m: NotRequired[float]
- station_latitude_deg: NotRequired[float]
- station_location_code: NotRequired[str]
- station_longitude_deg: NotRequired[float]
- station_network_code: NotRequired[str]
- trace_channel: NotRequired[str]
- trace_completeness: NotRequired[float]
- trace_component_order: NotRequired[str]
- trace_has_spikes: NotRequired[bool]
- trace_name: NotRequired[str]
- trace_sampling_rate_hz: NotRequired[float]
- trace_start_time: NotRequired[str]
- class WaveformBenchmarkDataset(chunks=None, citation=None, license=None, force=False, wait_for_file=False, repository_lookup=False, compile_from_source=False, download_kwargs=None, **kwargs)[source]
Bases:
AbstractBenchmarkDataset,WaveformDataset,ABCThis class is the base class for benchmark waveform datasets. For the functionality, see the superclasses.
- class WaveformDataWriter(metadata_path, waveforms_path)[source]
Bases:
objectThe WaveformDataWriter for writing datasets in SeisBench format.
To improve reading performance when using the datasets, the writer groups traces into blocks and writes them into joint arrays in the hdf5 file. The exact behaviour is controlled by the
bucketerand thebucket_size. For details see their documentation. This packing is necessary, due to limitations in the hdf5 performance. Reading many small datasets from a hdf5 file causes the overhead of the hdf5 structure to define the read times.- Parameters:
metadata_path (str or Path) – Path to write the metadata file to
waveforms_path (str or Path) – Path to write the waveforms file to
- Returns:
None
- add_trace(metadata, waveform)[source]
Adds a trace to the writer. This does not imply that the trace is immediately written to disk, as the writer might wait to fill a bucket. The writer ensures that the order of traces in the metadata is identical to the order of calls to add_trace.
- Parameters:
metadata (dict[str, any]) – Metadata of the trace
waveform (np.ndarray) – Waveform of the trace
- Returns:
None
- property bucket_size
The maximum size of a bucket. Once adding another trace would overload the bucket, the bucket is written to disk. Defaults to 1024.
- Returns:
Bucket size
- property bucketer
The currently used bucketer, which sorts traces into buckets. If the bucketer is None, no buckets are used and all traces are written separately. By default uses the
GeometricBucketerwith default parameters. Please check that this suits your needs. In particular, make sure that the default axis matches your sample axis.- Returns:
Returns the current bucketer.
- class WaveformDataset(path=None, name=None, dimension_order=None, component_order=None, sampling_rate=None, cache=None, chunks=None, missing_components='pad', metadata_cache=False, resample_zerophase=False, **kwargs)[source]
Bases:
objectThis class is the base class for waveform datasets.
A key consideration should be how the cache is used. If sufficient memory is available to keep the full data set in memory, activating the cache will yield strong performance gains. For details on the cache strategies, see the documentation of the
cacheparameter.- Parameters:
path (
Path|str) – Path to dataset.name (
str|None) – Dataset name, default is None.dimension_order (
str|None) – Dimension order e.g. ‘CHW’, if not specified will be assumed from config file, defaults to None.component_order (
str|None) – Component order e.g. ‘ZNE’, if not specified will be assumed from config file, defaults to None.sampling_rate (
float|None) – Common sampling rate of waveforms in dataset, sampling rate can also be specified as a metadata column if not common across dataset.cache (
Literal['full','trace',None]) –Defines the behaviour of the waveform cache. Provides three options:
”full”: When a trace is queried, the full block containing the trace is loaded into the cache and stored in memory. This causes the highest memory consumption, but also best performance when using large parts of the dataset.
”trace”: When a trace is queried, only the trace itself is loaded and stored in memory. This is particularly useful when only a subset of traces is queried, but these are queried multiple times. In this case the performance of this strategy might outperform “full”.
None: When a trace is queried, it is always loaded from disk. This mode leads to low memory consumption but high IO load. It is most likely not usable for model training.
Note that for datasets without blocks, i.e., each trace in a single array in the hdf5 file, the strategies “full” and “trace” are identical. The default cache strategy is None.
Use
preload_waveforms()to populate the cache. Preloading the waveforms is often much faster than loading them during later application, as preloading can use sequential access. Note that it is recommended to always first filter a dataset and then preload to reduce unnecessary reads and memory consumption.chunks (
list[str]) – Specify particular chunks to load. If None, loads all chunks. Defaults to None.missing_components (
Literal['pad','copy','ignore']) –Strategy to deal with missing components. Options are:
”pad”: Fill with zeros.
”copy”: Fill with values from first existing traces.
”ignore”: Order all existing components in the requested order, but ignore missing ones. This will raise an error if traces with different numbers of components are requested together.
metadata_cache (
bool) – If true, metadata is cached in a lookup table. This significantly speeds up access to metadata and thereby access to samples. On the downside, this requires storing two copies of the metadata in memory. The second copy usually consumes more memory due to the less space-efficient format. Runtime differences are particularly big for large datasets.resample_zerophase (
bool) – If True, resampling in data loading uses a zerophase filter for antialiasing. Otherwise, uses a causal filter. See the documentation ofscipy.signal.decimatefor details.kwargs
- static available_chunks(path)[source]
Determines the chunks of the dataset in the given path.
- Parameters:
path – Dataset path
- Returns:
List of chunks
- property cache
Get or set the cache strategy of the dataset. For possible strategies see the constructor. Note that changing cache strategies will not cause a cache eviction.
- property chunks
Returns a list of chunks. If dataset is not chunked, returns an empty list.
- property component_order
Get or set order of components in the output.
- copy()[source]
Create a copy of the data set. All attributes are copied by value, except waveform cache entries. The cache entries are copied by reference, as the waveforms will take up most of the memory. This should be fine for most use cases, because the cache entries should anyhow never be modified. Note that the cache dict itself is not shared, such that cache evictions and inserts in one of the data sets do not affect the other one.
- Returns:
Copy of the dataset
- property data_format
Data format dictionary, describing the data format of the stored dataset. Note that this does not necessarily equals the output data format of get waveforms. To query these, use the relevant class properties.
- property dimension_order
Get or set the order of the dimension in the output.
- filter(mask, inplace=True)[source]
Filters dataset, e.g. by distance/magnitude/…, using a binary mask. Default behaviour is to perform inplace filtering, directly changing the metadata and waveforms to only keep the results of the masking query. Setting inplace equal to false will return a filtered copy of the data set. For details on the copy operation see
copy().- Parameters:
mask (boolean array) – Boolean mask to apply to metadata.
inplace (bool) – If true, filter inplace.
Example usage:
dataset.filter(dataset["p_status"] == "manual")
- Returns:
None if inplace=True, otherwise the filtered dataset.
- get_event_sample_indices(event_id)[source]
Returns the indices of all samples associated with a given event. Requires that source_id is part of the metadata.
- Parameters:
event – Event identifier
- Returns:
List of indices associated with the event
- Return type:
list[int]
- get_event_source_id(idx)[source]
Gets the source_id of an event. Note that the
idxrefers to the integer index of the event in order of appearance in the metadata after removing duplicates.- Parameters:
idx (
int) – Event index- Return type:
str- Returns:
Source ID of the event
- get_group_idx_from_params(params)[source]
Returns the index of the group identified by the params.
- Parameters:
params – The parameters identifying the group. For a single grouping parameter, this argument will be a single value. Otherwise this argument needs to be a tuple of keys.
- Returns:
Index of the group
- Return type:
int
- get_group_samples(idx, **kwargs)[source]
Returns the waveforms and metadata for each member of a group. For details see
get_sample().- Parameters:
idx (int) – Group index
kwargs – Kwargs passed to
get_sample()
- Returns:
List of waveforms, list of metadata dicts
- get_group_size(idx)[source]
Returns the number of samples in a group
- Parameters:
idx (int) – Group index
- Returns:
Size of the group
- Return type:
int
- get_group_waveforms(idx, **kwargs)[source]
Returns the waveforms for each member of a group. For details see
get_sample().- Parameters:
idx (int) – Group index
kwargs – Kwargs passed to
get_sample()
- Returns:
List of waveforms
- get_idx_from_trace_name(trace_name, chunk=None, dataset=None)[source]
Returns the index of a trace with given trace_name, chunk and dataset. Chunk and dataset parameters are optional, but might be necessary to uniquely identify traces for chunked datasets or for
MultiWaveformDataset. The method will issue a warning the first time a non-uniquely identifiable trace is requested. If no matching key is found, a KeyError is raised.- Parameters:
trace_name (str) – Trace name as in metadata[“trace_name”]
chunk (None) – Trace chunk as in metadata[“trace_chunk”]. If None this key will be ignored.
dataset (None) – Trace dataset as in metadata[“trace_dataset”]. Only for
MultiWaveformDataset. If None this key will be ignored.
- Returns:
Index of the sample
- get_sample(idx, sampling_rate=None)[source]
Returns both waveforms and metadata of a traces. Adjusts all metadata traces with sampling rate dependent values to the correct sampling rate, e.g., p_pick_samples will still point to the right sample after this operation, even if the trace was resampled.
Hint
When decimating data, a low-pass filter needs to be applied to avoid aliasing. To control whether this filter is causal or zerophase, the class attribute
zerophase_resamplecan be used.- Parameters:
idx – Idx of sample to return
sampling_rate – Target sampling rate, overwrites sampling rate for dataset.
- Returns:
Tuple with the waveforms and the metadata of the sample.
- get_split(split)[source]
Returns a dataset with the requested split.
- Parameters:
split – Split name to return. Usually one of “train”, “dev”, “test”
- Returns:
Dataset filtered to the requested split.
- get_waveforms(idx=None, mask=None, sampling_rate=None)[source]
Collects waveforms and returns them as an array.
- Parameters:
idx (int, list[int]) – Idx or list of idx to obtain waveforms for
mask (np.ndarray[bool]) – Binary mask on the metadata, indicating which traces should be returned. Can not be used jointly with idx.
sampling_rate (float) – Target sampling rate, overwrites sampling rate for dataset
- Returns:
Waveform array with dimensions ordered according to dimension_order e.g. default ‘NCW’ (number of traces, number of components, record samples). If the number of record samples varies between different entries, all entries are padded to the maximum length.
- Return type:
np.ndarray
- property grouping
The grouping parameters for the dataset. These parameters are used to determine the
groupsand for the associated methods. grouping can be either a single string or a list of strings. Each string must be a column in the metadata. By default, the grouping is None.
- property metadata
Metadata of the dataset as pandas DataFrame.
- property metadata_cache
- property missing_components
Get or set strategy to handle missing components. For options, see the constructor.
- n_events()[source]
Returns the number of unique events in the dataset. Requires that source_id is part of the metadata.
- Returns:
Number of unique events
- Return type:
int
- property name
Name of the dataset (immutable)
- property path
Path of the dataset (immutable)
- plot_map(res='110m', connections=False, **kwargs)[source]
Plots the dataset onto a map using the Mercator projection. Requires a cartopy installation.
- Parameters:
res (str, optional) – Resolution for cartopy features, defaults to 110m.
connections (bool, optional) – If true, plots lines connecting sources and stations. Defaults to false.
kwargs – Plotting kwargs that will be passed to matplotlib plot. Args need to be prefixed with sta_, ev_ and conn_ to address stations, events or connections.
- Returns:
A figure handle for the created figure.
- preload_waveforms(pbar=False)[source]
Loads waveform data from hdf5 file into cache. Fails if caching strategy is None.
- Parameters:
pbar – If true, shows progress bar. Defaults to False.
- region_filter(domain, lat_col, lon_col, inplace=True)[source]
Filtering of dataset based on predefined region or geometry. See also convenience functions region_filter_[source|receiver].
- Parameters:
domain (obspy.core.fdsn.mass_downloader.domain:) – The domain filter
lat_col (str) – Name of latitude coordinate column
lon_col (str) – Name of longitude coordinate column
inplace (bool) – Inplace filtering, default to true. See also
filter().
- Returns:
None if inplace=True, otherwise the filtered dataset.
- region_filter_receiver(domain, inplace=True)[source]
Convenience method for region filtering by receiver location.
DASDataset
- class DASBenchmarkDataset(chunks=None, citation=None, license=None, force=False, wait_for_file=False, repository_lookup=False, compile_from_source=False, download_kwargs=None, **kwargs)[source]
Bases:
AbstractBenchmarkDataset,DASDataset,ABCThis class is the base class for benchmark DAS datasets. For the functionality, see the superclasses.
- class DASDataWriter(path, chunk='', metadata_path=None, data_path=None, data_type=<class 'numpy.float32'>, strict=True)[source]
Bases:
objectThis class allows writing DAS datasets in SeisBench format. It only writes a single chunk. To write multiple chunks, use multiple data writers with different chunk arguments but identical path.
- Parameters:
path (
Path|str) – Path to write the chunk tochunk (
str) – Chunk identifiermetadata_path (
Path|str|None) – Overwrite for the metadata path. If provided, writes the metadata here instead of the default location. The chunk key will be ignored in this case. Unless integrated into complex workflows, this parameter should not be used.data_path (
Path|str|None) – Same as.metadata_pathbut for the data file.data_type (
type[floating] |type[integer]) – Data type of the data. Defaults to float32.strict (
bool) – If true, raise an error if the metadata does not contain the key fields. Otherwise, only raise a warning.
- add_record(metadata, data, annotations)[source]
Add a record to the dataset. While the data and annotations will immediately be written to disk, the metadata will be stored in memory and written to disk when the dataset is closed.
- Parameters:
metadata (
dict[str,Any]) – Metadata of the record. There are no mandatory fields, but warnings will be issued if typical key fields are missing.data (
ndarray) – Data of the record. The data needs to be a 2D array (time, channel).annotations (
dict[str,ndarray]) – Annotations of the record. Each annotation consists of a 1D array with the same length as the number of channels. The entries are in samples along the time axis. For example, an annotation called"P"indicates the indices of the P wave arrival at each channel. NaN values are allowed. Annotations can differ between the records.
- Return type:
None
- property data_path: Path
- property metadata_path: Path
- class DASDataset(path=None, chunks=None)[source]
Bases:
object- DATA_FILE = 'records_$CHUNK.hdf5'
- METADATA_FILE = 'metadata_$CHUNK.parquet'
- static available_chunks(path)[source]
Determines the chunks of the dataset in the given path. If available, parses the chunks file. Otherwise, scans the dataset for metadata and records files.
- Parameters:
path (
Path) – Dataset path- Return type:
list[str]- Returns:
List of chunks
- property chunks: list[str]
- dev(inplace=False)[source]
Convenience method for get_split(“dev”).
- Return type:
DASDataset|None- Returns:
Development dataset
- filter(mask, inplace=True)[source]
Filters dataset, e.g. by distance/magnitude/…, using a binary mask. Default behaviour is to perform inplace filtering. Setting inplace equal to false will return a filtered copy of the data set.
- Parameters:
mask (
ndarray) – Boolean mask to apply to metadata.inplace (
bool) – If true, filter inplace.
- Return type:
DASDataset|None
Example usage:
dataset.filter(dataset.metadata["record_sampling_rate_hz"] > 100)
- get_sample(idx, record_virtual=True, annotations_virtual=False)[source]
Load the sample with the given index. Use the record_virtual and annotations_virtual arguments to control whether the record and annotations are loaded into memory or only pointers are returned. By default, the record will not be loaded into memory, while the annotations will be loaded into memory.
- Parameters:
idx (
int) – Index of the sample to loadrecord_virtual (
bool) – If true, the record is returned as a virtual array. Otherwise, the record is loaded into memory.annotations_virtual (
bool) – If true, the annotations are returned as virtual arrays. Otherwise, the annotations are loaded into memory.
- Return type:
tuple[dict[str,Any],ndarray|Dataset,dict[str,ndarray|Dataset]]
- get_split(split, inplace=False)[source]
Returns a dataset with the requested split.
- Parameters:
split (
str) – Split name to return. Usually one of “train”, “dev”, “test”- Return type:
DASDataset|None- Returns:
Dataset filtered to the requested split.
- property metadata: DataFrame
- property path: Path
Path of the dataset
- test(inplace=False)[source]
Convenience method for get_split(“test”).
- Return type:
DASDataset|None- Returns:
Test dataset
- train(inplace=False)[source]
Convenience method for get_split(“train”).
- Return type:
DASDataset|None- Returns:
Training dataset
- train_dev_test()[source]
Convenience method for returning training, development and test set. Equal to:
>>> self.train(), self.dev(), self.test()
- Return type:
tuple[DASDataset,DASDataset,DASDataset]- Returns:
Training dataset, development dataset, test dataset
- class MultiDASDataset(datasets)[source]
Bases:
objectThis class is a wrapper for multiple DAS datasets. It allows combining multiple datasets into a single dataset. It has mostly the same API as
DASDataset.- property datasets
- dev(inplace=False)
Convenience method for get_split(“dev”).
- Return type:
DASDataset|None- Returns:
Development dataset
- filter(mask, inplace=True)[source]
Filters dataset, similar to
WaveformDataset.filter().- Parameters:
mask (
ndarray) – Boolean mask to apple to metadata.inplace (
bool) – If true, filter inplace.
- Return type:
MultiDASDataset|None
- get_split(split, inplace=False)
Returns a dataset with the requested split.
- Parameters:
split (
str) – Split name to return. Usually one of “train”, “dev”, “test”- Return type:
DASDataset|None- Returns:
Dataset filtered to the requested split.
- property metadata
- test(inplace=False)
Convenience method for get_split(“test”).
- Return type:
DASDataset|None- Returns:
Test dataset
- train(inplace=False)
Convenience method for get_split(“train”).
- Return type:
DASDataset|None- Returns:
Training dataset
- train_dev_test()
Convenience method for returning training, development and test set. Equal to:
>>> self.train(), self.dev(), self.test()
- Return type:
tuple[DASDataset,DASDataset,DASDataset]- Returns:
Training dataset, development dataset, test dataset
- class RandomDASDataset(**kwargs)[source]
Bases:
DASBenchmarkDatasetThis is a purely random dataset for testing purposes. It does not contain any actual data and should only be used for unit tests.