GitHub - catebell/Tumor-Type-Classification: Data preprocessing and training of a Deep Learning graph architecture for the classification of tumor types, with focus on the most impactful genomic and clinical differences through model explainability. Use cases tested: Lung cancer (LUAD / LUSC) and Kidney cancer (KIRC / KICH / KIRP). Project for the 'AI for Bioinformatics' course. · GitHub

click here to download our original dataset retrieved from GCD Data Portal
click here to download the three folders [train/val/test] of our processed graphs Dataset

Architecture

Figure 1. High-level overview of the project architecture.

Explainability

model_analysis_functions.py provides functions for the interpretation of the learned models decisional processes and what features they focus on:

Graph branch attention scores (get_gene_attention_weights()) retrieves the attention scores given to each gene by the GAT-GNN layers, with higher scores to the genes that the model learned to monitor more closely.

2026-04-09 11:56:00,592 - INFO - Genes with attention importance = 1.000:
2026-04-09 11:56:00,737 - INFO - ENSG00000128422: 1.0000   ['PC2', 'PC2', 'PC2', 'PC2', 'Pc2', '39.1', 'CK-17', 'K17', 'KRT17', 'PC2', 'PCHC1']
2026-04-09 11:56:00,741 - INFO - ENSG00000119147: 1.0000   ['C2orf40', 'ECRG4']
2026-04-09 11:56:00,742 - INFO - ENSG00000119632: 1.0000   ['FAM14A', 'IFI27L2', 'ISG12B', 'TLH29']
2026-04-09 11:56:00,857 - INFO - ENSG00000124107: 1.0000   ['ALP', 'ALK1', 'BLPI', 'HUSI', 'HUSI-I', 'SLPI', 'WAP4', 'WFDC4', 'ALP']
2026-04-09 11:56:02,203 - INFO - ENSG00000162733: 1.0000   ['DDR2', 'MIG20a', 'NTRKR3', 'TYRO10', 'WRCN']
...

Saliency (get_gene_saliency()) computes which genes influence the model's decision by having more relevance in the gradient computation.

2026-04-09 17:10:59,946 - INFO - Top 100 Genes saliency:
2026-04-09 17:10:59,950 - INFO - ENSG00000185201: 1.0000   ['1-8D', 'DSPA2c', 'IFITM2']
2026-04-09 17:10:59,951 - INFO - ENSG00000205420: 0.6538   ['CK-6C', 'CK-6E', 'K6C', 'KRT6C', 'PC3', 'CK-6C', 'CK-6E', 'CK6A', 'CK6C', 'CK6D', 'K6A', 'K6C', 'K6D', 'KRT6A', 'KRT6C', 'KRT6D', 'PC3']
2026-04-09 17:10:59,952 - INFO - ENSG00000011600: 0.5574   ['DAP12', 'KARAP', 'PLOSL', 'PLOSL1', 'TYROBP']
2026-04-09 17:10:59,953 - INFO - ENSG00000173599: 0.4210   ['PC', 'PC', 'PC', 'PC', 'PCB']
2026-04-09 17:10:59,954 - INFO - ENSG00000019582: 0.3975   ['CD74', 'CLIP', 'DHLAG', 'HLADG', 'Ia-GAMMA', 'CLIP', 'II', 'II', 'P33', 'p33']
2026-04-09 17:10:59,955 - INFO - ENSG00000186395: 0.3811   ['EHK', 'BCIE', 'BIE', 'CK10', 'K10', 'KPP', 'KRT10', 'EHK2']
2026-04-09 17:10:59,957 - INFO - ENSG00000171401: 0.3503   ['CK13', 'K13', 'KRT13', 'WSN2', 'K13']
2026-04-09 17:10:59,958 - INFO - ENSG00000186832: 0.3351   ['CK16', 'FNEPPK', 'K16', 'K1CP', 'KRT16', 'KRT16A']
2026-04-09 17:10:59,959 - INFO - ENSG00000186081: 0.3284   ['CK5', 'DDD', 'DDD1', 'EBS2', 'K5', 'KRT5', 'KRT5A']
...

→ Boxplots are employed in the visualization of the selected genes features values through the test patients dataset:

Expression (RNA) values	Other biomarkers

CNV values	Methylation values

Clinical importance (explain_clinical_importance()) explains which clinical features, if any, most influence the prediction accuracy.

2026-04-01 15:12:34,071 - INFO - Clinical Features importance:
2026-04-01 15:12:34,071 - INFO - age_at_index: 0.0070
2026-04-01 15:12:34,071 - INFO - country_of_residence_at_enrollment: 0.0070
2026-04-01 15:12:34,071 - INFO - gender: 0.0070
2026-04-01 15:12:34,071 - INFO - ajcc_pathologic_n: 0.0070
2026-04-01 15:12:34,072 - INFO - tissue_or_organ_of_origin: 0.0070
2026-04-01 15:12:34,072 - INFO - ethnicity: 0.0000
2026-04-01 15:12:34,072 - INFO - race: 0.0000
...

How to execute

Get the data

The dataset must be downloaded from the desired source and saved in a folder named original_dataset/. For example, our data (click here to download) was originally in this form:

original_dataset/
    clinical/
        clinical.tsv
        exposure.tsv
        LUAD_LUSC_metadata.json : file mapping, necessary to map exposure and clinical to CNV,RNA and methylation data (different file_id)
    CNV/
        722 patients folders
    methylation/
        758 patients folders
    RNA/
        757 patients folders

Extract the correctly formatted files in a new files/ folder by executing files_extraction_and_mapping.py. This will be our new reference folder:

files/
    clinical/
        file_case_mapping.tsv
        omics_files.tsv
    CNV/
        extracted patients .tsv files
    methylation/
        extracted patients .txt files
    RNA/
        extracted patients .tsv files

We downloaded from the STRING database the following files, used later on to retrieve genes properties and build the graphs based on their codified proteins (click to start the download):

→ Put them in a new STRING_downloaded_files/ folder and run STRING_files_to_tsv.py: the first function creates STRING_downloaded_files/9606.protein.aliases.gene.tsv, the second one creates STRING_downloaded_files/gene_ids_mapped.tsv.

⚠️ The execution of the first function can take a few hours.

We need also a methylation manifest for the preprocessing of methylation data; we downloaded from the relative Illumina support page the one relative to the Illumina “450 K array” technology (click here to start the download).

→ Put it in methylation_manifests/originals_downloaded/ and run methylation_manifest_to_tsv.py to extract only the needed information correctly formatted.

Preprocessing

Run preprocessing_clinical_features_to_file.py to obtain the following files:

files/
    clinical/
        features_considered.tsv  ←
        features_encoded.tsv  ←
        file_case_mapping.tsv
        omics_files.tsv

Run train_test_patients_split.py to assign to each patient (case_id) a label [train, val, test]:

files/
    clinical/
        features_considered.tsv
        features_encoded.tsv
        file_case_mapping.tsv
        omics_files.tsv
        patient_split_cleaned.csv  ←

Run graph_classification.py to create the patients graphs Dataset (if the folders do not already exist). You will then see three new folders:

data_graphs_processed_test/
data_graphs_processed_train/
data_graphs_processed_validation/

⚠️ The first execution can take a few hours. It will not start again unless the folders get deleted or renamed.

Click here to download the three folders [train/val/test] of our processed graphs Dataset

Try with different tumor classes

If needed, this model can be adapted to classify different tumor types (given the same kind of biological data).

In preprocessing_clinical_features_to_file.py change this mapping:

# for project.project_id, remap tumor class to 0-1
mapping = {
    'TCGA-LUAD': 0,
    'TCGA-LUSC': 1
}
features_df['project.project_id'] = features_df['project.project_id'].map(mapping)

If you want to use the default models/MultiModalGNN and consider both graphs and clinical features:

Change the content of preprocessing_clinical_features_to_file.py with respect to the clinical data you have.

Change the number of clinical features and/or the number of classes in every model initialization:

model = MultiModalGNN(num_node_features=5, num_edge_features=3, clinical_input_dim=53, hidden_channels=64, num_classes=2).to(device)

If you don't want to consider the clinical features, use only models/GAT:

Change the model initialization in graph_classification.py and/or other files where it is needed:

→ #model = GAT(num_node_features=5, num_edge_features=3, num_classes=2, hidden_channels=64).to(device)
#model = MLP(num_patient_features=53, num_classes=2).to(device)
→ model = MultiModalGNN(num_node_features=5, num_edge_features=3, clinical_input_dim=53, hidden_channels=64, num_classes=2).to(device)

→ model = GAT(num_node_features=5, num_edge_features=3, num_classes=2, hidden_channels=64).to(device)
#model = MLP(num_patient_features=53, num_classes=2).to(device)
→ #model = MultiModalGNN(num_node_features=5, num_edge_features=3, clinical_input_dim=53, hidden_channels=64, num_classes=2).to(device)

When the graph Dataset is created, class labels are automatically stored, so no need to modify PatientGraphDataset.py:

data.y = torch.tensor([self.labels_dict[case_id]], dtype=torch.long)

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.idea		.idea
images		images
models		models
utility_but_useless_scripts		utility_but_useless_scripts
PatientGraphDataset.py		PatientGraphDataset.py
README.md		README.md
STRING_files_to_tsv.py		STRING_files_to_tsv.py
config.py		config.py
files_extraction_and_mapping.py		files_extraction_and_mapping.py
graph_classification_grid_search.py		graph_classification_grid_search.py
k_folds_graph_classification.py		k_folds_graph_classification.py
main.py		main.py
methylation_manifest_to_tsv.py		methylation_manifest_to_tsv.py
model_analysis_functions.py		model_analysis_functions.py
model_comparison.py		model_comparison.py
montecarlo_graph_classification.py		montecarlo_graph_classification.py
preprocessing_CNV_to_df.py		preprocessing_CNV_to_df.py
preprocessing_RNA_to_df.py		preprocessing_RNA_to_df.py
preprocessing_clinical_features_to_file.py		preprocessing_clinical_features_to_file.py
preprocessing_methylation_to_df.py		preprocessing_methylation_to_df.py
train_test_val_patients_split.py		train_test_val_patients_split.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Architecture

Explainability

How to execute

Get the data

Preprocessing

Click here to download the three folders [train/val/test] of our processed graphs Dataset

Try with different tumor classes

If you want to use the default models/MultiModalGNN and consider both graphs and clinical features:

If you don't want to consider the clinical features, use only models/GAT:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Architecture

Explainability

How to execute

Get the data

Preprocessing

Click here to download the three folders [train/val/test] of our processed graphs Dataset

Try with different tumor classes

If you want to use the default models/MultiModalGNN and consider both graphs and clinical features:

If you don't want to consider the clinical features, use only models/GAT:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages