Luca Cappelletti

Luca Cappelletti · 2026-04-02T09:42:18.820Z

At Earth Metabolome Initiative I work with both extremes: networks with billions of edges, where asymptotic complexity are often a valid mental shortcut, and MANY molecular graphs where most have fewer than 50 nodes, and constant factors dominate. I benchmarked using Criterion four maximum matching algorithms implemented in our geometric-traits Rust crate: Blossom (1965), Gabow (1976), Micali-Vazirani (1980), and Hopcroft-Karp (1973). Gabow 1976, O(V^3), wins 184 of 289 configurations and is the fastest algorithm up to V ~ 500. Micali-Vazirani, O(E sqrt(V)), only overtakes on sparse graphs past V ~ 1,000. On molecular-sized inputs the cubic algorithm with smaller constants is consistently faster than the sub-cubic alternative. Raw benchmarks on Zenodo (doi:10.5281/zenodo.19164092), code and full report at https://lnkd.in/d6bdWR_z.

Milano, Lombardia, Italia
2161 follower Oltre 500 collegamenti

Visualizza i collegamenti in comune con Luca

Luca può presentarti a 6 persone presso Université de Fribourg/Universität Freiburg

Email o telefono

Password

Hai dimenticato la password?

oppure

Nuovo utente di LinkedIn? Iscriviti ora

Cliccando su “Continua” per iscriverti o accedere, accetti il Contratto di licenza, l’Informativa sulla privacy e l’Informativa sui cookie di LinkedIn.

Iscriviti per visualizzare il profilo

Université de Fribourg - Universität Freiburg

Università degli Studi di Milano

Sito web personale

Informazioni

Software engineer with 7+ years of experience and currently Postdoctoral Researcher at…

Attività

First PubChem-wide topological run, was prepared for an all-nighter execution, finished before I was done brushing my teeth. Open sourced on GitHub…

First PubChem-wide topological run, was prepared for an all-nighter execution, finished before I was done brushing my teeth. Open sourced on GitHub…

Condiviso da Luca Cappelletti
I have long-running experiments that periodically publish results on Zenodo, so instead of re-implementing the same requests yet again, I wrote…

I have long-running experiments that periodically publish results on Zenodo, so instead of re-implementing the same requests yet again, I wrote…

Condiviso da Luca Cappelletti
At Earth Metabolome Initiative I work with both extremes: networks with billions of edges, where asymptotic complexity are often a valid mental…

At Earth Metabolome Initiative I work with both extremes: networks with billions of edges, where asymptotic complexity are often a valid mental…

Condiviso da Luca Cappelletti

Esperienza

Université de Fribourg - Universität Freiburg

Fribourg, Fribourg, Switzerland
-

Berkeley, California, United States
-

Milan Area, Italy
-

Berkeley, California, United States
-

Milan Area, Italy
-
-
-

Milan Area, Italy

Formazione

Università degli Studi di Milano

2019 - 2023

I focused on developing and deploying scalable ML solutions to complex real-world problems during my PhD in Computer Science at the University of Milan. Some of my notable projects include:

• My thesis involved creating ML algorithms for large-scale graph data analysis with applications in precision medicine
• Neural networks for genomic gap-filling
• Chest X-ray pipelines for normalization and bias detection
• ML for solvency prediction
2017 - 2018
2013 - 2017

Licenze e certificazioni

Cambridge English: Advanced (CAE) (A grade - C2)

Cambridge Assessment English

Emissione: ott 2017

ID credenziale 0059710474

Vedi credenziale
IGCSE ICT

University of Cambridge

Emissione: nov 2012

ID credenziale 0039028104

Vedi credenziale
ECDL

Associazione Italiana per l'Informatica e il Calcolo Automatico

Emissione: mag 2011

ID credenziale IT1759523

Vedi credenziale

Pubblicazioni

GRAPE for fast and scalable graph processing and random-walk-based embedding

Nature Publishing Group US 1 giugno 2023
Graph representation learning methods opened new avenues for addressing complex, real-world problems represented by graphs. However, many graphs used in these applications comprise millions of nodes and billions of edges and are beyond the capabilities of current methods and software implementations. We present GRAPE (Graph Representation Learning, Prediction and Evaluation), a software resource for graph processing and embedding that is able to scale with big graphs by using specialized and…

Graph representation learning methods opened new avenues for addressing complex, real-world problems represented by graphs. However, many graphs used in these applications comprise millions of nodes and billions of edges and are beyond the capabilities of current methods and software implementations. We present GRAPE (Graph Representation Learning, Prediction and Evaluation), a software resource for graph processing and embedding that is able to scale with big graphs by using specialized and smart data structures, algorithms, and a fast parallel implementation of random-walk-based methods. Compared with state-of-the-art software resources, GRAPE shows an improvement of orders of magnitude in empirical space and time complexity, as well as competitive edge- and node-label prediction performance. GRAPE comprises approximately 1.7 million well-documented lines of Python and Rust code and provides 69 node-embedding methods, 25 inference models, a collection of efficient graph-processing utilities, and over 80,000 graphs from the literature and other sources. Standardized interfaces allow a seamless integration of third-party libraries, while ready-to-use and modular pipelines permit an easy-to-use evaluation of graph-representation-learning methods, therefore also positioning GRAPE as a software resource that performs a fair comparison between methods and libraries for graph processing and embedding.

Altri autori
Vedi pubblicazione
Billion-scale Detection of Isomorphic Nodes

IEEE 15 maggio 2023
This paper presents an algorithm for detecting attributed high-degree node isomorphism. High-degree isomorphic nodes seldom happen by chance and often represent duplicated entities or data processing errors. By definition, isomorphic nodes are topologically indistinguishable and can be problematic in graph ML tasks. The algorithm employs a parallel, “degree-bounded” approach that fingerprints each node’s local properties through a hash, which constrains the search to nodes within hash-defined…

This paper presents an algorithm for detecting attributed high-degree node isomorphism. High-degree isomorphic nodes seldom happen by chance and often represent duplicated entities or data processing errors. By definition, isomorphic nodes are topologically indistinguishable and can be problematic in graph ML tasks. The algorithm employs a parallel, “degree-bounded” approach that fingerprints each node’s local properties through a hash, which constrains the search to nodes within hash-defined buckets, thus minimising the number of comparisons. This method scales on graphs with billions of nodes and edges. Finally, we provide isomorphic node oddities identified in real-world data.

Altri autori
Vedi pubblicazione
Semi-automatic Column Type Inference for CSV Table Understanding

Springer International Publishing 11 gennaio 2021
Spreadsheets are often used as a simple way for representing tabular data. However, since they do not impose any restriction on their table structures and contents, their automatic processing and the integration with other information sources are particularly hard problems to solve. Many table understanding approaches have been proposed for extracting data from tables and transforming them in meaningful information. However, they require some regularities on the table contents.

Altri autori
Vedi pubblicazione
KG-COVID-19: a framework to produce customized knowledge graphs for COVID-19 response

Elsevier Patterns 9 novembre 2020
Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time-consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community vary drastically for different tasks; the optimal data for a…

Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time-consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community vary drastically for different tasks; the optimal data for a machine learning task, for example, is much different from the data used to populate a browsable user interface for clinicians. To address these challenges, we created KG-COVID-19, a flexible framework that ingests and integrates heterogeneous biomedical data to produce knowledge graphs (KGs), and applied it to create a KG for COVID-19 response. This KG framework also can be applied to other problems in which siloed biomedical data must be quickly integrated for different research applications, including future pandemics.

Altri autori
Vedi pubblicazione
Explainable Machine Learning for Early Assessment of COVID-19 Risk Prediction in Emergency Departments

IEEE Access 26 ottobre 2020
We describe a computerized system primarily aimed at extracting the most relevant radiological, clinical, and laboratory variables for improving patient risk prediction, and secondarily at presenting an explainable machine learning system, which may provide simple decision criteria to be used by clinicians as a support for assessing patient risk. To achieve robust and reliable variable selection, Boruta and Random Forest (RF) are combined in a 10-fold cross-validation scheme to produce a…

We describe a computerized system primarily aimed at extracting the most relevant radiological, clinical, and laboratory variables for improving patient risk prediction, and secondarily at presenting an explainable machine learning system, which may provide simple decision criteria to be used by clinicians as a support for assessing patient risk. To achieve robust and reliable variable selection, Boruta and Random Forest (RF) are combined in a 10-fold cross-validation scheme to produce a variable importance estimate not biased by the presence of surrogates. The most important variables are then selected to train a RF classifier, whose rules may be extracted, simplified, and pruned to finally build an associative tree, particularly appealing for its simplicity. Results show that the radiological score automatically computed through a neural network is highly correlated with the score computed by radiologists, and that laboratory variables, together with the number of comorbidities, aid risk prediction. The prediction performance of our approach was compared to that that of generalized linear models and shown to be effective and robust. The proposed machine learning-based computational system can be easily deployed and used in emergency departments for rapid and accurate risk prediction in COVID-19 patients.

Altri autori
Vedi pubblicazione
Complex Data Imputation by Auto-Encoders and Convolutional Neural Networks - A Case Study on Genome Gap-Filling

MDPI - Computers 11 maggio 2020
Missing data imputation has been a hot topic in the past decade, and many state-of-the-art works have been presented to propose novel, interesting solutions that have been applied in a variety of fields. In the past decade, the successful results achieved by deep learning techniques have opened the way to their application for solving difficult problems where human skill is not able to provide a reliable solution. Not surprisingly, some deep learners, mainly exploiting encoder-decoder…

Missing data imputation has been a hot topic in the past decade, and many state-of-the-art works have been presented to propose novel, interesting solutions that have been applied in a variety of fields. In the past decade, the successful results achieved by deep learning techniques have opened the way to their application for solving difficult problems where human skill is not able to provide a reliable solution. Not surprisingly, some deep learners, mainly exploiting encoder-decoder architectures, have also been designed and applied to the task of missing data imputation. However, most of the proposed imputation techniques have not been designed to tackle “complex data”, that is high dimensional data belonging to datasets with huge cardinality and describing complex problems. Precisely, they often need critical parameters to be manually set or exploit complex architecture and/or training phases that make their computational load impracticable. In this paper, after clustering the state-of-the-art imputation techniques into three broad categories, we briefly review the most representative methods and then describe our data imputation proposals, which exploit deep learning techniques specifically designed to handle complex data. Comparative tests on genome sequences show that our deep learning imputers outperform the state-of-the-art KNN-imputation method when filling gaps in human genome sequences.

Altri autori
Vedi pubblicazione
Bayesian Optimization Improves Tissue-Specific Prediction of Active Regulatory Regions with Deep Neural Networks

International Work-Conference on Bioinformatics and Biomedical Engineering 6 aprile 2020
The annotation and characterization of tissue-specific cis-regulatory elements (CREs) in non-coding DNA represents an open challenge in computational genomics. Several prior works show that machine learning methods, using epigenetic or spectral features directly extracted from DNA sequences, can predict active promoters and enhancers in specific tissues or cell lines. In particular, very recently deep-learning techniques obtained state-of-the-art results in this challenging computational task…

The annotation and characterization of tissue-specific cis-regulatory elements (CREs) in non-coding DNA represents an open challenge in computational genomics. Several prior works show that machine learning methods, using epigenetic or spectral features directly extracted from DNA sequences, can predict active promoters and enhancers in specific tissues or cell lines. In particular, very recently deep-learning techniques obtained state-of-the-art results in this challenging computational task. In this study, we provide additional evidence that Feed Forward Neural Networks (FFNN) trained on epigenetic data and one-dimensional convolutional neural networks (CNN) trained on DNA sequence data can successfully predict active regulatory regions in different cell lines. We show that model selection by means of Bayesian optimization applied to both FFNN and CNN models can significantly improve deep neural network performance, by automatically finding models that best fit the data. Further, we show that techniques applied to balance active and non-active regulatory regions in the human genome in training and test data may lead to over-optimistic or poor predictions. We recommend to use actual imbalanced data that was not used to train the models for evaluating their generalization performance.

Altri autori
Vedi pubblicazione
On the Quality of Classification Models for Inferring ABAC Policies from Access Logs

IEEE International Conference on Big Data (Big Data) 12 dicembre 2019
The attribute-based access control (ABAC) model has been gaining popularity in recent years because of its advantages in granularity, flexibility, and usability. Few approaches based on association rules mining have been proposed for the automatic generation of ABAC policies from access logs. Their aim is the identification of policies that do not overfit over training data, are not too general and thus does to disclose sensitive resources to everyone, and are interpretable by humans. The large…

The attribute-based access control (ABAC) model has been gaining popularity in recent years because of its advantages in granularity, flexibility, and usability. Few approaches based on association rules mining have been proposed for the automatic generation of ABAC policies from access logs. Their aim is the identification of policies that do not overfit over training data, are not too general and thus does to disclose sensitive resources to everyone, and are interpretable by humans. The large ABAC privilege space along with the sparsity and unbalance distribution of the available logs make the solution of this task particularly complex and current approaches have different limitations. In this paper, we compare different symbolic and non-symbolic machine learning (ML) techniques for inferring ABACpolicies and discuss their pros and cons. Based on experimental results on a toy dataset and on a real dataset, we argue that which is the best technique depends on the characteristics of the considered data. When the data are highly separable according to PCA and t-SNE decomposition, the quality of the obtained ABC policies is higher and also policies are easily interpretable. By contrast, when this property does not hold, the quality of the obtained policies is low; in this case, non-symbolic ML techniques show better results than the symbolic ones.

Altri autori
Vedi pubblicazione
A neural model for the prediction of pathogenic genomic variants in Mendelian diseases

Proc. of the International Conference on Advances in Signal Processing and Artificial Intelligence (ASPAI 2019, Barcel- lona Spain), Pages 34-38. 2019
The detection of pathogenic genomic variants associated with genetic or cancer diseases represents an open problem in the context of Genomic Medicine. In particular, the detection of mutations in the non-coding regions of human genome represents a particularly challenging machine learning problem, since the number of neutral variants largely outnumber the pathogenic ones, thus resulting in highly imbalanced classification problems. We applied neural networks to the detection of pathogenic…

The detection of pathogenic genomic variants associated with genetic or cancer diseases represents an open problem in the context of Genomic Medicine. In particular, the detection of mutations in the non-coding regions of human genome represents a particularly challenging machine learning problem, since the number of neutral variants largely outnumber the pathogenic ones, thus resulting in highly imbalanced classification problems. We applied neural networks to the detection of pathogenic regulatory genomic variants in Mendelian diseases and we showed that leveraging imbalance-aware techniques and deep learning algorithms, we can obtain state-of-the-art results, using a less complex model than those proposed in the literature for this challenging prediction task.

Altri autori
Vedi pubblicazione
Training neural networks with balanced mini-batch to improve the prediction of pathogenic genomic variants in Mendelian diseases

Sensors & Transducers, 234 (6), Pages 16-21 2019
Known pathogenic variants associated with genetic Mendelian diseases represent a tiny minority of the overall genetic variation that characterizes the human genome. In this context classical imbalance-aware machine learning methods are unable to distinguish pathogenic from benign variants, since they are severely biased toward the majority (benign) class. Recent works based on ensemble and hyper- ensemble methods showed that by adopting sampling techniques we can significantly improve…

Known pathogenic variants associated with genetic Mendelian diseases represent a tiny minority of the overall genetic variation that characterizes the human genome. In this context classical imbalance-aware machine learning methods are unable to distinguish pathogenic from benign variants, since they are severely biased toward the majority (benign) class. Recent works based on ensemble and hyper- ensemble methods showed that by adopting sampling techniques we can significantly improve performance on this challenging task. Inspired by these findings and by recent successful applications of deep learning to Precision Medicine, we propose two learning techniques for neural networks designed to assure a certain balancing between pathogenic and benign variants during the training phase, or to assure that with high probability at least one pathogenic variant is included in the training mini-batch set of examples. The experimental prediction of non-coding mutations associated with Mendelian diseases show the effectiveness of these proposed neural network training approaches.

Altri autori
Vedi pubblicazione

Iscriviti ora per vedere tutte le pubblicazioni

Corsi

Corso di Data Mining e Business Intelligence, svolta a NextInt

-

Progetti

🍇 GRAPE: Rust/Python Graph Representation Learning library for Predictions and Evaluations

gen 2020
Developed at AnacletoLAB, University of Milan, in collaboration with the Robinson Lab at Jackson Laboratory for Genomic Medicine and BBOP at Lawrence Berkeley National Laboratory, GRAPE stands as a state-of-the-art graph processing and embedding library. Crafted in Rust and Python, GRAPE boasts scalability for handling large-scale graphs efficiently. Its core components, Ensmallen and Embiggen, ensure streamlined graph processing and advanced representation learning, making it an indispensable…

Developed at AnacletoLAB, University of Milan, in collaboration with the Robinson Lab at Jackson Laboratory for Genomic Medicine and BBOP at Lawrence Berkeley National Laboratory, GRAPE stands as a state-of-the-art graph processing and embedding library. Crafted in Rust and Python, GRAPE boasts scalability for handling large-scale graphs efficiently. Its core components, Ensmallen and Embiggen, ensure streamlined graph processing and advanced representation learning, making it an indispensable tool for tackling complex data challenges.

Altri creatori
TwitchTimer

lug 2016 - ott 2019
A Twitch and StreamLabs integrated website to help streamers automate the monetization rewards process build from the ground up.

Altri creatori
Vedi progetto

Lingue

English

Conoscenza professionale completa
Italian

Conoscenza madrelingua o bilingue

Referenze ricevute

8 persone hanno scritto una referenza per Luca

Iscriviti ora per vedere

Visualizza il profilo completo di Luca

Scoprire le conoscenze che avete in comune
Farti presentare
Contattare Luca direttamente

Iscriviti per visualizzare il profilo completo

Altri profili simili

Edoardo Zorzi

Edoardo Zorzi

Università degli Studi di Verona

321 follower
Italia

Visualizza profilo
Andrea Campagner

Andrea Campagner

Ospedale Galeazzi - Sant'Ambrogio

216 follower
Milano

Visualizza profilo
Marco Cipolla

Marco Cipolla

Engineering Ingegneria Informatica Spa

123 follower
Italia

Visualizza profilo
Georgios Peikos

Georgios Peikos

Università degli Studi di Milano-Bicocca

1156 follower
Milano

Visualizza profilo
Lavinia Roncoroni

Lavinia Roncoroni

Kendaxa Group

110 follower
Roma

Visualizza profilo
Intissar Khalifa

Intissar Khalifa

Istituto di patronato e di assistenza sociale

36 follower
Milano

Visualizza profilo
Sven Wiese

Sven Wiese

MOSEK ApS

639 follower
Forlì

Visualizza profilo
Michele Gandolfi

Michele Gandolfi

École polytechnique fédérale de Lausanne

256 follower
Losanna

Visualizza profilo
Francesca Meneghello

Francesca Meneghello

Northeastern University College of Engineering

931 follower
Italia

Visualizza profilo
Mehdi Haghshenas

Mehdi Haghshenas

Mid Sweden University

581 follower
Svezia

Visualizza profilo
Daniele Brugnara

Daniele Brugnara

INFN

90 follower
Darmstadt

Visualizza profilo
Davide Buffelli

Davide Buffelli

MediaTek Research 聯發創新基地

1503 follower
Londra

Visualizza profilo
Emanuele Di Buccio

Emanuele Di Buccio

Università degli Studi di Padova

287 follower
Italia

Visualizza profilo
Giulia Noaro

Giulia Noaro

Sava

542 follower
Padova

Visualizza profilo
Enver Bashirov

Enver Bashirov

Hekanize

823 follower
Padova

Visualizza profilo
Tommaso Levato

Tommaso Levato

FareHarbor

753 follower
Amsterdam

Visualizza profilo
Jacopo Banchetti

Jacopo Banchetti

Edison

1915 follower
Milano

Visualizza profilo
Amarildo Likmeta

Amarildo Likmeta

ML cube

506 follower
Milano

Visualizza profilo
Ehsan Arabnezhad

Ehsan Arabnezhad

Salcido CDL school

194 follower
Las Cruces, NM

Visualizza profilo
Matteo Luperto

Matteo Luperto

Università degli Studi di Milano

284 follower
Saronno

Visualizza profilo

Esplora altri post

Altre persone che si chiamano Luca Cappelletti in Italia

Su LinkedIn ci sono altre 59 persone che si chiamano Luca Cappelletti in Italia

Vedi altre persone che si chiamano Luca Cappelletti

Aggiungi nuove competenze con questi corsi

Vedi tutti i corsi

Luca Cappelletti

Milano, Lombardia, Italia 2161 follower Oltre 500 collegamenti

Informazioni

Attività

First PubChem-wide topological run, was prepared for an all-nighter execution, finished before I was done brushing my teeth. Open sourced on GitHub…

Condiviso da Luca Cappelletti

I have long-running experiments that periodically publish results on Zenodo, so instead of re-implementing the same requests yet again, I wrote…

Condiviso da Luca Cappelletti

At Earth Metabolome Initiative I work with both extremes: networks with billions of edges, where asymptotic complexity are often a valid mental…

Condiviso da Luca Cappelletti

Esperienza

-

-

-

-

-

-

-

Formazione

Licenze e certificazioni

Pubblicazioni

Nature Publishing Group US 1 giugno 2023

IEEE 15 maggio 2023

Springer International Publishing 11 gennaio 2021

Elsevier Patterns 9 novembre 2020

IEEE Access 26 ottobre 2020

MDPI - Computers 11 maggio 2020

International Work-Conference on Bioinformatics and Biomedical Engineering 6 aprile 2020

IEEE International Conference on Big Data (Big Data) 12 dicembre 2019

Proc. of the International Conference on Advances in Signal Processing and Artificial Intelligence (ASPAI 2019, Barcel- lona Spain), Pages 34-38. 2019

Sensors & Transducers, 234 (6), Pages 16-21 2019

Corsi

Corso di Data Mining e Business Intelligence, svolta a NextInt

-

Progetti

🍇 GRAPE: Rust/Python Graph Representation Learning library for Predictions and Evaluations

gen 2020

lug 2016 - ott 2019

Lingue

English

Conoscenza professionale completa

Italian

Conoscenza madrelingua o bilingue

Referenze ricevute

J. Harry Caufield

Sara Bonfitto

Visualizza il profilo completo di Luca

Altri profili simili

Edoardo Zorzi

Andrea Campagner

Marco Cipolla

Georgios Peikos

Lavinia Roncoroni

Intissar Khalifa

Sven Wiese

Michele Gandolfi

Francesca Meneghello

Mehdi Haghshenas

Daniele Brugnara

Davide Buffelli

Emanuele Di Buccio

Giulia Noaro

Enver Bashirov

Tommaso Levato

Jacopo Banchetti

Amarildo Likmeta

Ehsan Arabnezhad

Matteo Luperto

Esplora altri post

Altre persone che si chiamano Luca Cappelletti in Italia