“Luca is an absolutely phenomenal software engineer and Graph Wizard. We've worked together on several different projects, all involving working with knowledge graphs of varying sizes, and Luca has consistently found efficient solutions for data analysis problems. Those problems are usually "this graph is too big and our software takes too long to digest it" -- Luca laughs at these problems and finds ways to make large graph analysis doable. He has great intuition and notable research insights. Overall, just fantastic to work with.”
Luca Cappelletti
Milano, Lombardia, Italia
2161 follower
Oltre 500 collegamenti
Informazioni
Software engineer with 7+ years of experience and currently Postdoctoral Researcher at…
Attività
-
First PubChem-wide topological run, was prepared for an all-nighter execution, finished before I was done brushing my teeth. Open sourced on GitHub…
First PubChem-wide topological run, was prepared for an all-nighter execution, finished before I was done brushing my teeth. Open sourced on GitHub…
Condiviso da Luca Cappelletti
-
I have long-running experiments that periodically publish results on Zenodo, so instead of re-implementing the same requests yet again, I wrote…
I have long-running experiments that periodically publish results on Zenodo, so instead of re-implementing the same requests yet again, I wrote…
Condiviso da Luca Cappelletti
-
At Earth Metabolome Initiative I work with both extremes: networks with billions of edges, where asymptotic complexity are often a valid mental…
At Earth Metabolome Initiative I work with both extremes: networks with billions of edges, where asymptotic complexity are often a valid mental…
Condiviso da Luca Cappelletti
Esperienza
Formazione
-
Università degli Studi di Milano
-
I focused on developing and deploying scalable ML solutions to complex real-world problems during my PhD in Computer Science at the University of Milan. Some of my notable projects include:
• My thesis involved creating ML algorithms for large-scale graph data analysis with applications in precision medicine
• Neural networks for genomic gap-filling
• Chest X-ray pipelines for normalization and bias detection
• ML for solvency prediction -
-
-
-
Licenze e certificazioni
Pubblicazioni
-
GRAPE for fast and scalable graph processing and random-walk-based embedding
Nature Publishing Group US
Graph representation learning methods opened new avenues for addressing complex, real-world problems represented by graphs. However, many graphs used in these applications comprise millions of nodes and billions of edges and are beyond the capabilities of current methods and software implementations. We present GRAPE (Graph Representation Learning, Prediction and Evaluation), a software resource for graph processing and embedding that is able to scale with big graphs by using specialized and…
Graph representation learning methods opened new avenues for addressing complex, real-world problems represented by graphs. However, many graphs used in these applications comprise millions of nodes and billions of edges and are beyond the capabilities of current methods and software implementations. We present GRAPE (Graph Representation Learning, Prediction and Evaluation), a software resource for graph processing and embedding that is able to scale with big graphs by using specialized and smart data structures, algorithms, and a fast parallel implementation of random-walk-based methods. Compared with state-of-the-art software resources, GRAPE shows an improvement of orders of magnitude in empirical space and time complexity, as well as competitive edge- and node-label prediction performance. GRAPE comprises approximately 1.7 million well-documented lines of Python and Rust code and provides 69 node-embedding methods, 25 inference models, a collection of efficient graph-processing utilities, and over 80,000 graphs from the literature and other sources. Standardized interfaces allow a seamless integration of third-party libraries, while ready-to-use and modular pipelines permit an easy-to-use evaluation of graph-representation-learning methods, therefore also positioning GRAPE as a software resource that performs a fair comparison between methods and libraries for graph processing and embedding.
Altri autoriVedi pubblicazione -
Billion-scale Detection of Isomorphic Nodes
IEEE
This paper presents an algorithm for detecting attributed high-degree node isomorphism. High-degree isomorphic nodes seldom happen by chance and often represent duplicated entities or data processing errors. By definition, isomorphic nodes are topologically indistinguishable and can be problematic in graph ML tasks. The algorithm employs a parallel, “degree-bounded” approach that fingerprints each node’s local properties through a hash, which constrains the search to nodes within hash-defined…
This paper presents an algorithm for detecting attributed high-degree node isomorphism. High-degree isomorphic nodes seldom happen by chance and often represent duplicated entities or data processing errors. By definition, isomorphic nodes are topologically indistinguishable and can be problematic in graph ML tasks. The algorithm employs a parallel, “degree-bounded” approach that fingerprints each node’s local properties through a hash, which constrains the search to nodes within hash-defined buckets, thus minimising the number of comparisons. This method scales on graphs with billions of nodes and edges. Finally, we provide isomorphic node oddities identified in real-world data.
Altri autoriVedi pubblicazione -
Semi-automatic Column Type Inference for CSV Table Understanding
Springer International Publishing
Spreadsheets are often used as a simple way for representing tabular data. However, since they do not impose any restriction on their table structures and contents, their automatic processing and the integration with other information sources are particularly hard problems to solve. Many table understanding approaches have been proposed for extracting data from tables and transforming them in meaningful information. However, they require some regularities on the table contents.
Altri autoriVedi pubblicazione -
KG-COVID-19: a framework to produce customized knowledge graphs for COVID-19 response
Elsevier Patterns
Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time-consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community vary drastically for different tasks; the optimal data for a…
Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time-consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community vary drastically for different tasks; the optimal data for a machine learning task, for example, is much different from the data used to populate a browsable user interface for clinicians. To address these challenges, we created KG-COVID-19, a flexible framework that ingests and integrates heterogeneous biomedical data to produce knowledge graphs (KGs), and applied it to create a KG for COVID-19 response. This KG framework also can be applied to other problems in which siloed biomedical data must be quickly integrated for different research applications, including future pandemics.
Altri autoriVedi pubblicazione -
Explainable Machine Learning for Early Assessment of COVID-19 Risk Prediction in Emergency Departments
IEEE Access
We describe a computerized system primarily aimed at extracting the most relevant radiological, clinical, and laboratory variables for improving patient risk prediction, and secondarily at presenting an explainable machine learning system, which may provide simple decision criteria to be used by clinicians as a support for assessing patient risk. To achieve robust and reliable variable selection, Boruta and Random Forest (RF) are combined in a 10-fold cross-validation scheme to produce a…
We describe a computerized system primarily aimed at extracting the most relevant radiological, clinical, and laboratory variables for improving patient risk prediction, and secondarily at presenting an explainable machine learning system, which may provide simple decision criteria to be used by clinicians as a support for assessing patient risk. To achieve robust and reliable variable selection, Boruta and Random Forest (RF) are combined in a 10-fold cross-validation scheme to produce a variable importance estimate not biased by the presence of surrogates. The most important variables are then selected to train a RF classifier, whose rules may be extracted, simplified, and pruned to finally build an associative tree, particularly appealing for its simplicity. Results show that the radiological score automatically computed through a neural network is highly correlated with the score computed by radiologists, and that laboratory variables, together with the number of comorbidities, aid risk prediction. The prediction performance of our approach was compared to that that of generalized linear models and shown to be effective and robust. The proposed machine learning-based computational system can be easily deployed and used in emergency departments for rapid and accurate risk prediction in COVID-19 patients.
Altri autoriVedi pubblicazione -
Complex Data Imputation by Auto-Encoders and Convolutional Neural Networks - A Case Study on Genome Gap-Filling
MDPI - Computers
Missing data imputation has been a hot topic in the past decade, and many state-of-the-art works have been presented to propose novel, interesting solutions that have been applied in a variety of fields. In the past decade, the successful results achieved by deep learning techniques have opened the way to their application for solving difficult problems where human skill is not able to provide a reliable solution. Not surprisingly, some deep learners, mainly exploiting encoder-decoder…
Missing data imputation has been a hot topic in the past decade, and many state-of-the-art works have been presented to propose novel, interesting solutions that have been applied in a variety of fields. In the past decade, the successful results achieved by deep learning techniques have opened the way to their application for solving difficult problems where human skill is not able to provide a reliable solution. Not surprisingly, some deep learners, mainly exploiting encoder-decoder architectures, have also been designed and applied to the task of missing data imputation. However, most of the proposed imputation techniques have not been designed to tackle “complex data”, that is high dimensional data belonging to datasets with huge cardinality and describing complex problems. Precisely, they often need critical parameters to be manually set or exploit complex architecture and/or training phases that make their computational load impracticable. In this paper, after clustering the state-of-the-art imputation techniques into three broad categories, we briefly review the most representative methods and then describe our data imputation proposals, which exploit deep learning techniques specifically designed to handle complex data. Comparative tests on genome sequences show that our deep learning imputers outperform the state-of-the-art KNN-imputation method when filling gaps in human genome sequences.
Altri autoriVedi pubblicazione -
Bayesian Optimization Improves Tissue-Specific Prediction of Active Regulatory Regions with Deep Neural Networks
International Work-Conference on Bioinformatics and Biomedical Engineering
The annotation and characterization of tissue-specific cis-regulatory elements (CREs) in non-coding DNA represents an open challenge in computational genomics. Several prior works show that machine learning methods, using epigenetic or spectral features directly extracted from DNA sequences, can predict active promoters and enhancers in specific tissues or cell lines. In particular, very recently deep-learning techniques obtained state-of-the-art results in this challenging computational task…
The annotation and characterization of tissue-specific cis-regulatory elements (CREs) in non-coding DNA represents an open challenge in computational genomics. Several prior works show that machine learning methods, using epigenetic or spectral features directly extracted from DNA sequences, can predict active promoters and enhancers in specific tissues or cell lines. In particular, very recently deep-learning techniques obtained state-of-the-art results in this challenging computational task. In this study, we provide additional evidence that Feed Forward Neural Networks (FFNN) trained on epigenetic data and one-dimensional convolutional neural networks (CNN) trained on DNA sequence data can successfully predict active regulatory regions in different cell lines. We show that model selection by means of Bayesian optimization applied to both FFNN and CNN models can significantly improve deep neural network performance, by automatically finding models that best fit the data. Further, we show that techniques applied to balance active and non-active regulatory regions in the human genome in training and test data may lead to over-optimistic or poor predictions. We recommend to use actual imbalanced data that was not used to train the models for evaluating their generalization performance.
Altri autoriVedi pubblicazione -
On the Quality of Classification Models for Inferring ABAC Policies from Access Logs
IEEE International Conference on Big Data (Big Data)
The attribute-based access control (ABAC) model has been gaining popularity in recent years because of its advantages in granularity, flexibility, and usability. Few approaches based on association rules mining have been proposed for the automatic generation of ABAC policies from access logs. Their aim is the identification of policies that do not overfit over training data, are not too general and thus does to disclose sensitive resources to everyone, and are interpretable by humans. The large…
The attribute-based access control (ABAC) model has been gaining popularity in recent years because of its advantages in granularity, flexibility, and usability. Few approaches based on association rules mining have been proposed for the automatic generation of ABAC policies from access logs. Their aim is the identification of policies that do not overfit over training data, are not too general and thus does to disclose sensitive resources to everyone, and are interpretable by humans. The large ABAC privilege space along with the sparsity and unbalance distribution of the available logs make the solution of this task particularly complex and current approaches have different limitations. In this paper, we compare different symbolic and non-symbolic machine learning (ML) techniques for inferring ABACpolicies and discuss their pros and cons. Based on experimental results on a toy dataset and on a real dataset, we argue that which is the best technique depends on the characteristics of the considered data. When the data are highly separable according to PCA and t-SNE decomposition, the quality of the obtained ABC policies is higher and also policies are easily interpretable. By contrast, when this property does not hold, the quality of the obtained policies is low; in this case, non-symbolic ML techniques show better results than the symbolic ones.
Altri autoriVedi pubblicazione -
A neural model for the prediction of pathogenic genomic variants in Mendelian diseases
Proc. of the International Conference on Advances in Signal Processing and Artificial Intelligence (ASPAI 2019, Barcel- lona Spain), Pages 34-38.
The detection of pathogenic genomic variants associated with genetic or cancer diseases represents an open problem in the context of Genomic Medicine. In particular, the detection of mutations in the non-coding regions of human genome represents a particularly challenging machine learning problem, since the number of neutral variants largely outnumber the pathogenic ones, thus resulting in highly imbalanced classification problems. We applied neural networks to the detection of pathogenic…
The detection of pathogenic genomic variants associated with genetic or cancer diseases represents an open problem in the context of Genomic Medicine. In particular, the detection of mutations in the non-coding regions of human genome represents a particularly challenging machine learning problem, since the number of neutral variants largely outnumber the pathogenic ones, thus resulting in highly imbalanced classification problems. We applied neural networks to the detection of pathogenic regulatory genomic variants in Mendelian diseases and we showed that leveraging imbalance-aware techniques and deep learning algorithms, we can obtain state-of-the-art results, using a less complex model than those proposed in the literature for this challenging prediction task.
Altri autoriVedi pubblicazione -
Training neural networks with balanced mini-batch to improve the prediction of pathogenic genomic variants in Mendelian diseases
Sensors & Transducers, 234 (6), Pages 16-21
Known pathogenic variants associated with genetic Mendelian diseases represent a tiny minority of the overall genetic variation that characterizes the human genome. In this context classical imbalance-aware machine learning methods are unable to distinguish pathogenic from benign variants, since they are severely biased toward the majority (benign) class. Recent works based on ensemble and hyper- ensemble methods showed that by adopting sampling techniques we can significantly improve…
Known pathogenic variants associated with genetic Mendelian diseases represent a tiny minority of the overall genetic variation that characterizes the human genome. In this context classical imbalance-aware machine learning methods are unable to distinguish pathogenic from benign variants, since they are severely biased toward the majority (benign) class. Recent works based on ensemble and hyper- ensemble methods showed that by adopting sampling techniques we can significantly improve performance on this challenging task. Inspired by these findings and by recent successful applications of deep learning to Precision Medicine, we propose two learning techniques for neural networks designed to assure a certain balancing between pathogenic and benign variants during the training phase, or to assure that with high probability at least one pathogenic variant is included in the training mini-batch set of examples. The experimental prediction of non-coding mutations associated with Mendelian diseases show the effectiveness of these proposed neural network training approaches.
Altri autoriVedi pubblicazione
Corsi
-
Corso di Data Mining e Business Intelligence, svolta a NextInt
-
Progetti
-
🍇 GRAPE: Rust/Python Graph Representation Learning library for Predictions and Evaluations
Developed at AnacletoLAB, University of Milan, in collaboration with the Robinson Lab at Jackson Laboratory for Genomic Medicine and BBOP at Lawrence Berkeley National Laboratory, GRAPE stands as a state-of-the-art graph processing and embedding library. Crafted in Rust and Python, GRAPE boasts scalability for handling large-scale graphs efficiently. Its core components, Ensmallen and Embiggen, ensure streamlined graph processing and advanced representation learning, making it an indispensable…
Developed at AnacletoLAB, University of Milan, in collaboration with the Robinson Lab at Jackson Laboratory for Genomic Medicine and BBOP at Lawrence Berkeley National Laboratory, GRAPE stands as a state-of-the-art graph processing and embedding library. Crafted in Rust and Python, GRAPE boasts scalability for handling large-scale graphs efficiently. Its core components, Ensmallen and Embiggen, ensure streamlined graph processing and advanced representation learning, making it an indispensable tool for tackling complex data challenges.
Altri creatori -
TwitchTimer
-
A Twitch and StreamLabs integrated website to help streamers automate the monetization rewards process build from the ground up.
Altri creatoriVedi progetto
Lingue
-
English
Conoscenza professionale completa
-
Italian
Conoscenza madrelingua o bilingue
Referenze ricevute
8 persone hanno scritto una referenza per Luca
Iscriviti ora per vedereAltri profili simili
Altre persone che si chiamano Luca Cappelletti in Italia
Su LinkedIn ci sono altre 59 persone che si chiamano Luca Cappelletti in Italia
Vedi altre persone che si chiamano Luca Cappelletti