Benchmark Datasets

SeisBench facilitates the downloading of a suite of publicly available seismic waveform datasets for training of machine learning algorithms. An overview of the contents of each dataset is below, along with the corresponding citation.

ETHZ

../_images/ethz_mapplot.png

The ETHZ benchmark dataset contains regional recorded seismicity on publicly available networks throughout the Switzerland region. For more information see: SED website.

Dataset contains 36,743 waveform examples.

Warning

Dataset size: waveforms.hdf5 ~22Gb, metadata.csv ~13Mb.

GEOFON

../_images/geofon_mapplot.png

As part of its effort to measure and characterize relevant seismicity world-wide in real time, the GEOFON earthquake monitoring serivce acquires and analyses waveform data of over 800, globally distributed seismic stations. Besides automatic processing, manual analysis, especially onset re-picking, is performed routinely whenever necessary to improve the location quality. Usually only few picks are manual re-picked. However, in order to collect reference picks to improve automatic picking, the P arrivals of some events are comprehensively re-picked by an experienced analyst, irrespective of the presence or quality of automatic picks. For local and near-regional events also S onsets have been picked and for a small fraction both Pn and Pg are included. For teleseismic events almost no S onsets have been picked. Depth phases have been picked occasionally but not comprehensively. In total, there are ~275,000 waveform examples. The magnitudes of the events comprising this dataset range from about 2 to 9, with the bulk of the manually picked events being intermediate to large events (M 5-7). Regional events with smaller magnitudes are mostly in Europe and northern Chile. The time range covers 2009 to 2013.

Warning

Dataset size: waveforms.hdf5 ~25.8Gb, metadata.csv ~99Mb.

Citation

Citation information will be added.

INSTANCE

../_images/instance_mapplot.png

The INSTANCE benchmark dataset is a dataset of signals comiled by the Istituto Nazionale di Geofisica e Vulcanologia (INGV). Containing ~1.2 million 3C waveform traces, which record ~50,000 earthquakes and include ~130,000 noise traces. Magnitude scale of events ranges from 0 - 6.5. The dataset is split for ease of use into Noise examples InstanceNoise, waveform examples in counts InstanceCounts, and waveform examples in ground motion units InstanceGM. A combined dataset containing the noise examples and waveform examples as counts is also available InstanceCountsCombined.

Warning

Dataset size:

  • waveforms (counts) ~160Gb

  • waveforms (ground motion units) ~310Gb

Citation

Michelini, A., Cianetti, S., Gaviano, S., Giunchi, C., Jozinović, D., & Lauciani, V. (2021). INSTANCE - The Italian Seismic Dataset For Machine Learning. Istituto Nazionale di Geofisica e Vulcanologia (INGV).

https://doi.org/10.13127/INSTANCE

Iquique

../_images/iquique_mapplot.png

The Iquique dataset contains 13,400 examples of picked arrivals from the aftershock sequence following the Mw=8.1 Iquique earthquake occuring in N. Chile in 2014. All stations are 100Hz, 3-component stations. The waveforms contain examples of earthquakes only.

Warning

Dataset size: waveforms.hdf5 ~5Gb, metadata.csv ~2.6Mb

Citation

Woollam, J., Rietbrock, A., Bueno, A. and De Angelis, S., 2019. Convolutional neural network for seismic phase classification, performance demonstration over a local seismic network. Seismological Research Letters, 90(2A), pp.491-502. https://doi.org/10.1785/0220180312

ISC-EHB Depth Phases

../_images/isc_ehb_mapplot.png

The ISC_EHB_DepthPhases dataset contains 44,106 events and 174,436 traces. It contains traces with depth phase readings (pP, sP and pwP) from the ISC-EHB bulletin. Additional picks have been annotated on the traces, if they were contained in the bulletin.

Citation

Münchmeyer, J., Saul, J., Tilmann, F. (2023). Learning the Deep and the Shallow: Deep‐Learning‐Based Depth Phase Picking and Earthquake Depth Estimation. Seismological Research Letters. https://doi.org/10.1785/0220230187

LENDB

../_images/lendb_mapplot.png

The LENDB dataset is a published benchmark dataset (see citation below) of local earthquakes recorded across a global set of 3-component seismic stations. The entire dataset comprisis ~1.25 million waveform examples, recorded on 1487 individual 3-component stations. There are ~305,000 local earthquake examples and ~618,000 noise examples. For more information regarding the benchmark dataset, please refer to the original reference below.

Warning

Dataset size: waveforms.hdf5 ~20Gb, metadata.csv ~218Mb

Citation

Magrini, Fabrizio, Jozinović, Dario, Cammarano, Fabio, Michelini, Alberto, & Boschi, Lapo. (2020). LEN-DB - Local earthquakes detection: a benchmark dataset of 3-component seismograms built on a global scale.

LFE stack datasets

../_images/lfe_stacks_mapplot.png

SeisBench contains three datasets with stacked waveforms of low-frequency earthquakes datasets:

Note that in addition to the regular pick columns, the datasets contain predicted arrival times in the trace_*_predicted_arrival_sample column.

Citation

Münchmeyer, J., Giffard-Roisin, S., Malfante, M., Frank, W., Poli, P., Marsan, D., Socquet A. (2024). Deep learning detects uncataloged low-frequency earthquakes across regions. Seismica.

MLAAPDE

../_images/mlaapde_mapplot.png

The MLAAPDE dataset is a global, mostly teleseismic dataset with detailed phase annotations. It contains 1.9 million phase labels. Most label phases are P arrivals with some labels for detailed phases.

Citation

Cole, H. M., Yeck, W. L., & Benz, H. M. (2023). MLAAPDE: A Machine Learning Dataset for Determining Global Earthquake Source Parameters. Seismological Research Letters, 94(5), 2489-2499. https://doi.org/10.1785/0220230021

Cole H. M. and W. L. Yeck, 2022, Global Earthquake Machine Learning Dataset: Machine Learning Asset Aggregation of the PDE (MLAAPDE): U.S. Geological Survey data release. https://doi.org/10.5066/P96FABIB

NEIC

The National Earthquake Information Centre (NEIC) benchmark dataset comprises ~1.3 million seismic phase arrivals with global source-station paths. As information on the trace start-time and station information is missing for this dataset, it is stored in the SeisBench format, but without this normally required information.

Warning

The NEIC dataset has been superseded by the more comprehensive MLAAPDE dataset. Unless you are aiming for exact comparison to previous work, we recommend using the MLAAPDE dataset instead. This dataset is larger and contains more comprehensive metadata.

Citation

Yeck, W. L., Patton, J. M., Ross, Z. E., Hayes, G. P., Guy, M. R., Ambruz, N. B., Shelly, D. R., Benz, H. M., Earle, P. S., (2021) Leveraging Deep Learning in Global 24/7 Real-Time Earthquake Monitoring at the National Earthquake Information Center.

https://doi.org/10.1785/0220200178

OBS

../_images/obs_mapplot.png

The ocean-bottom seismometer (OBS) benchmark dataset (OBS) comprises ~110,000 seismic waveforms with ~150,000 manually labeled phase arrivals. The data comprises 15 deployments with a total of 355 stations across different tectonic settings.

Citation

Bornstein, T., Lange, D., Münchmeyer, J., Woollam, J., Rietbrock, A., Barcheck, G., Grevemeyer, I., Tilmann, F. (2023). PickBlue: Seismic phase picking for ocean bottom seismometers with deep learning. Earth and Space Science.

http://doi.org/10.1029/2023EA003332

OBST2024

../_images/obst2024_mapplot_small.jpeg

The OBST dataset (OBST2024) comprises ~60,000 seismic waveforms from ocean-bottom seismometers (OBS). These split into ~35,000 earthquake waveforms and ~25,000 noise waveforms. For each earthquake waveforms, P and S arrival times have been annotated. The data comprises 11 deployments across different tectonic settings.

Citation

Niksejel, A. and Zhang, M. (2024). OBSTransformer: a deep-learning seismic phase picker for OBS data using automated labelling and transfer learning. Geophysical Journal International.

https://doi.org/10.1093/gji/ggae049.

PNW

../_images/pnw_mapplot.png

A ML-ready curated data set for a wide range of sources from the Pacific Northwest (PNW). PNW dataset is made by several separate datasets.

PNW contains waveforms from earthquake and explosion (comcat events) from velocity channels (EH, HH and BH). PNWAccelerometers contains waveform from earthquake and explosion (comcat events) but from accelerometers (EN). PNWNoise contains noise waveforms PNWExotic contains exotic event waveforms (surface event, thunder quake, sonic boom, etc.)

For more information see: PNW-ML.

Citation

Ni, Y., Hutko, A., Skene, F., Denolle, M., Malone, S., Bodin, P., Hartog, R., & Wright, A. (2023). Curated Pacific Northwest AI-ready Seismic Dataset. Seismica, 2(1).

https://doi.org/10.26443/seismica.v2i1.368

SCEDC

../_images/scedc_mapplot.png

The SCEDC benchmark dataset contains all publicly available recordings of seismic events in the Southern Californian Seismic Network, which were manually picked, from 2000-2020. Contains ~8,100,000 waveform examples.

Warning

Dataset size: waveforms.hdf5 ~660Gb, metadata.csv ~2.2Gb

Citation

SCEDC (2013): Southern California Earthquake Center.

https://doi.org/10.7909/C3WD3xH1

STEAD

../_images/stead_mapplot.png

The STEAD dataset is a published benchmark dataset (see citation below) of local seismic signals - both earthquake and non-earthquake - along with noise examples. In total there are ~1.2 million time series, of which ~100,000 are noise examples and the remaining contain seismic arrivals. 450,000 earthquakes are contained in the datasets.

Warning

Dataset size: waveforms.hdf5 ~70Gb, metadata.csv 200Mb

Citation

Mousavi, S. M., Sheng, Y., Zhu, W., Beroza G.C., (2019). STanford EArthquake Dataset (STEAD): A Global Data Set of Seismic Signals for AI, IEEE Access.

https://doi.org/10.1109/ACCESS.2019.2947848

TXED

../_images/txed_mapplot.png

The TXED dataset is a benchmark dataset of local seismic signals in the state of Texas. In total there are ~500,000 time series encompassing 20,000 earthquakes (~300,000 traces) and noise traces (~200,000 traces).

Warning

Dataset size: waveforms.hdf5 ~70Gb, metadata.csv 120Mb

Citation

Chen, Y., A. Savvaidis, O. M. Saad, G.-C. D. Huang, D. Siervo, V. O’Sullivan, C. McCabe, B. Uku, P. Fleck, G. Burke, N. L. Alvarez, J. Domino, and I. Grigoratos, TXED: the texas earthquake dataset for AI, Seismological Research Letters, vol. 1, no. 1, p. doi: 10.1785/0220230327, 2024.

https://doi.org/10.1785/0220230327