Project Statement
Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets.
We (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly-available real-world graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently.
The largest graph is created by matching (i.e., all-to-all similarity aligning) 1.7 billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types.
Project Steps
(i) Creating and validating the MS-BioGraphs by designing and engineering parallel and distributed algorithms and data structures to optimize performance and cluster utilization,
(ii) Extending the WebGraph framework with parallel compression for MS-BioGraphs as edge-weighted graphs, and
(iii) Analyzing the structural characteristics of MS-BioGraphs by comparing to other real-world graphs.
Datasets on IEEE DataPort
DOI: https://doi.org/10.21227/gmd9-1534
Validation & Sample Code
Please visit the MS-BioGraphs Validation post.
The ParaGrapher library (https://blogs.qub.ac.uk/DIPSA/ParaGrapher/) may be used to access MS-BioGraphs in C/C++.
Datasets, Source Code, and Publications
- Minimum Spanning Forest of MS-BioGraphs

- MS-BioGraphs on IEEE DataPort

- ParaGrapher Source Code For WebGraph Types

- On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets β BigDataβ23 (Short Paper)

- Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs β IISWCβ23 (Poster)

- MS-BioGraphs: Sequence Similarity Graph Datasets

- MS-BioGraphs MS

- MS-BioGraphs MSA500

- MS-BioGraphs MS200

- MS-BioGraphs MSA200

- MS-BioGraphs MS50

- MS-BioGraphs MSA50

- MS-BioGraphs MSA10

- MS-BioGraphs MS1

- MS-BioGraphs Validation

Project Members
β Mohsen Koohi Esfahani
β Sebastiano Vigna, UniversitΓ degli Studi di Milano
β Paolo Boldi, UniversitΓ degli Studi di Milano
β Hans Vandierendonck
β Peter Kilpatrick
Naming
The name of each graph is started by two characters M and S as initials of Metaclust (as the source dataset) and Sequence similarity (as the real-world domain of the graph), respectively. The name of the directed subgraphs has a third character A that indicates the graph is asymmetric. The name of subgraphs is followed by 3 digits that show the relative-size of the subgraph in comparison to the MS graph, multiplied by a thousand.
Grants and Funding
β Kelvin-2 supercomputer (UKRI EPSRC grant EP/T022175/1)
β PhD scholarship from The Department for the Economy, Northern Ireland and QUB
β Energy Efficient Transprecision Techniques for Linear system Solvers
β SERICS project (PE00000014) under the NRRP MUR program funded by the EU β NGEU
License
The datasets are published under the CC BY-NC-SA license.
QUB IDF: 2223-052
Last update: Sep. 13th, 2024
Acknowledgements
We are grateful to
β IEEE DataPort
β Sean McKeever, Head of IT, EEECS, QUB and his team
β Ian Overton, School of Medicine, Dentistry and Biomedical Sciences, QUB
β Vaughan Purnell, Head of NI-HPC and his team
β Jesus Martinez-del-Rincon and SPRC committee, EEECS, QUB
β Martin Frith, University of Tokyo
β Ariful Azad, Indiana University
β Eurcom, France
β Unsplash, Pixabay, and Plotly














