Benchmark Datasets
SeisBench facilitates the downloading of a suite of publicly available seismic waveform datasets for training of machine learning algorithms. An overview of the contents of each dataset is below, along with the corresponding citation.
ETHZ
The ETHZ
benchmark dataset contains regional recorded seismicity on publicly available networks
throughout the Switzerland region. For more information see: SED website.
Dataset contains 36,743 waveform examples.
Warning
Dataset size: waveforms.hdf5 ~22Gb, metadata.csv ~13Mb.
Citation
Each individual network has its own DOI. From publicly available data:
GEOFON
As part of its effort to measure and characterize relevant seismicity world-wide in real time, the GEOFON earthquake monitoring serivce acquires and analyses waveform data of over 800, globally distributed seismic stations. Besides automatic processing, manual analysis, especially onset re-picking, is performed routinely whenever necessary to improve the location quality. Usually only few picks are manual re-picked. However, in order to collect reference picks to improve automatic picking, the P arrivals of some events are comprehensively re-picked by an experienced analyst, irrespective of the presence or quality of automatic picks. For local and near-regional events also S onsets have been picked and for a small fraction both Pn and Pg are included. For teleseismic events almost no S onsets have been picked. Depth phases have been picked occasionally but not comprehensively. In total, there are ~275,000 waveform examples. The magnitudes of the events comprising this dataset range from about 2 to 9, with the bulk of the manually picked events being intermediate to large events (M 5-7). Regional events with smaller magnitudes are mostly in Europe and northern Chile. The time range covers 2009 to 2013.
Warning
Dataset size: waveforms.hdf5 ~25.8Gb, metadata.csv ~99Mb.
Citation
Citation information will be added.
INSTANCE
The INSTANCE benchmark dataset is a dataset of signals comiled by the Istituto Nazionale di Geofisica e Vulcanologia
(INGV). Containing ~1.2 million 3C waveform traces, which record ~50,000 earthquakes and include ~130,000 noise traces.
Magnitude scale of events ranges from 0 - 6.5.
The dataset is split for ease of use into Noise examples InstanceNoise
,
waveform examples in counts InstanceCounts
, and waveform examples in
ground motion units InstanceGM
. A combined dataset containing the noise examples
and waveform examples as counts is also available InstanceCountsCombined
.
Warning
Dataset size:
waveforms (counts) ~160Gb
waveforms (ground motion units) ~310Gb
Citation
Michelini, A., Cianetti, S., Gaviano, S., Giunchi, C., Jozinović, D., & Lauciani, V. (2021). INSTANCE - The Italian Seismic Dataset For Machine Learning. Istituto Nazionale di Geofisica e Vulcanologia (INGV).
Iquique
The Iquique
dataset contains 13,400 examples of picked arrivals from
the aftershock sequence following the Mw=8.1 Iquique earthquake occuring in N. Chile in 2014. All stations
are 100Hz, 3-component stations. The waveforms contain examples of earthquakes only.
Warning
Dataset size: waveforms.hdf5 ~5Gb, metadata.csv ~2.6Mb
Citation
Woollam, J., Rietbrock, A., Bueno, A. and De Angelis, S., 2019. Convolutional neural network for seismic phase classification, performance demonstration over a local seismic network. Seismological Research Letters, 90(2A), pp.491-502. https://doi.org/10.1785/0220180312
ISC-EHB Depth Phases
The ISC_EHB_DepthPhases
dataset contains 44,106 events and 174,436 traces.
It contains traces with depth phase readings (pP, sP and pwP) from the
ISC-EHB bulletin.
Additional picks have been annotated on the traces, if they were contained in the bulletin.
Citation
Münchmeyer, J., Saul, J., Tilmann, F. (2023). Learning the Deep and the Shallow: Deep‐Learning‐Based Depth Phase Picking and Earthquake Depth Estimation. Seismological Research Letters. https://doi.org/10.1785/0220230187
LENDB
The LENDB
dataset is a published benchmark dataset (see citation below) of local
earthquakes recorded across a global set of 3-component seismic stations. The entire dataset comprisis ~1.25 million
waveform examples, recorded on 1487 individual 3-component stations. There are ~305,000 local earthquake examples and
~618,000 noise examples. For more information regarding the benchmark dataset, please refer to the original reference
below.
Warning
Dataset size: waveforms.hdf5 ~20Gb, metadata.csv ~218Mb
Citation
Magrini, Fabrizio, Jozinović, Dario, Cammarano, Fabio, Michelini, Alberto, & Boschi, Lapo. (2020). LEN-DB - Local earthquakes detection: a benchmark dataset of 3-component seismograms built on a global scale.
LFE stack datasets
SeisBench contains three datasets with stacked waveforms of low-frequency earthquakes datasets:
Cascadia (Canada/USA), 1817 stacks,
LFEStacksCascadiaBostock2015
Guerrero (Mexico), 11200 stacks,
LFEStacksMexicoFrank2014
San Andreas fault (USA), 2306 stacks,
LFEStacksSanAndreasShelly2017
Note that in addition to the regular pick columns, the datasets contain predicted arrival times in the trace_*_predicted_arrival_sample column.
Citation
Münchmeyer, J., Giffard-Roisin, S., Malfante, M., Frank, W., Poli, P., Marsan, D., Socquet A. (2024). Deep learning detects uncataloged low-frequency earthquakes across regions. Seismica.
MLAAPDE
The MLAAPDE
dataset is a global, mostly teleseismic dataset with detailed phase
annotations. It contains 1.9 million phase labels. Most label phases are P arrivals with some labels for detailed
phases.
Citation
Cole, H. M., Yeck, W. L., & Benz, H. M. (2023). MLAAPDE: A Machine Learning Dataset for Determining Global Earthquake Source Parameters. Seismological Research Letters, 94(5), 2489-2499. https://doi.org/10.1785/0220230021
Cole H. M. and W. L. Yeck, 2022, Global Earthquake Machine Learning Dataset: Machine Learning Asset Aggregation of the PDE (MLAAPDE): U.S. Geological Survey data release. https://doi.org/10.5066/P96FABIB
NEIC
The National Earthquake Information Centre (NEIC) benchmark dataset comprises ~1.3 million seismic phase arrivals with global source-station paths. As information on the trace start-time and station information is missing for this dataset, it is stored in the SeisBench format, but without this normally required information.
Warning
The NEIC dataset has been superseded by the more comprehensive MLAAPDE dataset. Unless you are aiming for exact comparison to previous work, we recommend using the MLAAPDE dataset instead. This dataset is larger and contains more comprehensive metadata.
Citation
Yeck, W. L., Patton, J. M., Ross, Z. E., Hayes, G. P., Guy, M. R., Ambruz, N. B., Shelly, D. R., Benz, H. M., Earle, P. S., (2021) Leveraging Deep Learning in Global 24/7 Real-Time Earthquake Monitoring at the National Earthquake Information Center.
OBS
The ocean-bottom seismometer (OBS) benchmark dataset (OBS
) comprises ~110,000 seismic waveforms with ~150,000 manually
labeled phase arrivals. The data comprises 15 deployments with a total of 355 stations across different tectonic
settings.
Citation
Bornstein, T., Lange, D., Münchmeyer, J., Woollam, J., Rietbrock, A., Barcheck, G., Grevemeyer, I., Tilmann, F. (2023). PickBlue: Seismic phase picking for ocean bottom seismometers with deep learning. Earth and Space Science.
OBST2024
The OBST dataset (OBST2024
) comprises ~60,000 seismic waveforms
from ocean-bottom seismometers (OBS). These split into ~35,000 earthquake waveforms and ~25,000 noise waveforms.
For each earthquake waveforms, P and S arrival times have been annotated.
The data comprises 11 deployments across different tectonic settings.
Citation
Niksejel, A. and Zhang, M. (2024). OBSTransformer: a deep-learning seismic phase picker for OBS data using automated labelling and transfer learning. Geophysical Journal International.
PNW
A ML-ready curated data set for a wide range of sources from the Pacific Northwest (PNW). PNW dataset is made by several separate datasets.
PNW
contains waveforms from earthquake and explosion (comcat events) from velocity channels (EH, HH and BH).
PNWAccelerometers
contains waveform from earthquake and explosion (comcat events) but from accelerometers (EN).
PNWNoise
contains noise waveforms
PNWExotic
contains exotic event waveforms (surface event, thunder quake, sonic boom, etc.)
For more information see: PNW-ML.
Citation
Ni, Y., Hutko, A., Skene, F., Denolle, M., Malone, S., Bodin, P., Hartog, R., & Wright, A. (2023). Curated Pacific Northwest AI-ready Seismic Dataset. Seismica, 2(1).
SCEDC
The SCEDC
benchmark dataset contains all publicly available recordings
of seismic events in the Southern Californian Seismic Network, which were manually picked, from
2000-2020. Contains ~8,100,000 waveform examples.
Warning
Dataset size: waveforms.hdf5 ~660Gb, metadata.csv ~2.2Gb
STEAD
The STEAD
dataset is a published benchmark dataset (see citation below) of local seismic signals -
both earthquake and non-earthquake - along with noise examples. In total there are ~1.2 million time series, of which ~100,000
are noise examples and the remaining contain seismic arrivals. 450,000 earthquakes are contained in the datasets.
Warning
Dataset size: waveforms.hdf5 ~70Gb, metadata.csv 200Mb
Citation
Mousavi, S. M., Sheng, Y., Zhu, W., Beroza G.C., (2019). STanford EArthquake Dataset (STEAD): A Global Data Set of Seismic Signals for AI, IEEE Access.
TXED
The TXED
dataset is a benchmark dataset of local seismic signals in the state of Texas.
In total there are ~500,000 time series encompassing 20,000 earthquakes (~300,000 traces) and noise traces (~200,000 traces).
Warning
Dataset size: waveforms.hdf5 ~70Gb, metadata.csv 120Mb
Citation
Chen, Y., A. Savvaidis, O. M. Saad, G.-C. D. Huang, D. Siervo, V. O’Sullivan, C. McCabe, B. Uku, P. Fleck, G. Burke, N. L. Alvarez, J. Domino, and I. Grigoratos, TXED: the texas earthquake dataset for AI, Seismological Research Letters, vol. 1, no. 1, p. doi: 10.1785/0220230327, 2024.