The researchers obtained methylome microarray data from the Cancer Genome Atlas (TCGA) GDC data portal and examined 13 human cancer types with at least 15 non-cancer samples.
They denoted multiclass methylation characteristics linked with genes as blue-colored nodes or in purple in case they were listed as cancer genes in OncoKB or the Cosmic Cancer Gene Census.
The team noted the enrichment of pathways associated with cancer hallmarks, including cancer pathways, metabolic pathways, and signal transduction pathways.
Several cancer-related pathways had multiclass genes, classified into particular cancer types, cell death and survival, tissue microenvironment, signaling, metabolism, and the immune system.
The study showed that XGBoost models can classify distinct cancer types based on DNA methylation data.
In a recent study published in Biology Methods and Protocols, researchers developed binary and multiclass machine learning models to distinguish cancer from non-cancerous tissue samples.
Background
Cancer, a primary worldwide health concern, is determined by age, environmental toxins, and lifestyle choices. Early detection is critical for effective treatment and survival. The intricate nature of cancer and its interactions with the tissue microenvironment and immune system make intervention development difficult.
Metastatic malignancies contribute to most cancer-related fatalities due to their late-stage diagnosis. Early detection and diagnosis, paired with modern medicines, have a significant influence on cancer survival and treatment. Computational approaches can aid in the early discovery, diagnosis, and screening of complicated neoplastic methylation patterns.
About the study
In the present study, researchers used machine learning and microarray-based methylation analysis to categorize 13 cancer types and their associated normal tissues.
The researchers obtained methylome microarray data from the Cancer Genome Atlas (TCGA) GDC data portal and examined 13 human cancer types with at least 15 non-cancer samples. They also analyzed data from independent research to evaluate the model.
During data preprocessing, they removed possibly noisy probes and those with more than 5.0% missing values, retaining probes mapping to autosomal and sex chromosomes. For multiclass information, they created features by intersecting the features of cancer types with non-cancer classes obtained from pooled non-cancer samples from all tissue types.
While preprocessing the datasets, the researchers analyzed unmethylated and methylated counts with TCGA data features to derive beta values. They used binary and multiple-class machine-learning models to distinguish between cancerous and normal tissues. Every binary model evaluated a single tissue type, identifying cancer from non-cancers, whereas multiclass models used all 13 types of tissues and non-cancer data.
They divided input data into the training and testing datasets, with test datasets comprising 25% of the samples. They used two basic categorization methods: logistic regressions and support vector machines (SVMs).
The researchers developed an XGBoost model using gradient-boosted decision trees, yielding 450 estimators with a depth of 10 and a 0.2 learning rate. They built EMethylNET, a multiclass feed-forward neural network, with input features having significance values above zero (3,388 features).
They created pan-cancer methylome models combining Molecular Mechanisms of Cancer pathways with Pathways in Cancer (Human) from the Ingenuity Route Analysis (IPA) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. They denoted multiclass methylation characteristics linked with genes as blue-colored nodes or in purple in case they were listed as cancer genes in OncoKB or the Cosmic Cancer Gene Census.
The researchers analyzed and compared long non-coding ribonucleic acids (lncRNA) to cancer lncRNAs using two cancer lncRNA databases, Lnc2Cancer 3.0 and CRlncRNA, and the Cancer LncRNA Census (CLC). Following gene normalization, they divided the data into stratified train and test sets, with three Cox proportional hazards regression models used to estimate the hazard on the test set.
Results
The model classified 13 cancerous and non-cancer tissues based on deoxyribonucleic acid (DNA) methylomes with 98% accuracy. The methylation-related genomic sites identified by the model classifier were linked to cancer-related pathways, networks, and genes, offering insight into epigenomic regulatory pathways of carcinogenesis.
The multiclass classification approach performed better than the binary categorization of DNA methylation in individual tumors and normal tissues. The multiclass logistic regression model achieved an average Mathews correlation coefficient (MCC) score of 0.96; however, its efficacy varies by cancer type.
The experiments assessed 13 genes, four of which overlapped with the multiclass genes. The team noted the enrichment of pathways associated with cancer hallmarks, including cancer pathways, metabolic pathways, and signal transduction pathways. Several cancer-related pathways had multiclass genes, classified into particular cancer types, cell death and survival, tissue microenvironment, signaling, metabolism, and the immune system.
The study showed that XGBoost models could detect cancer when input into EMethylNET, a multiclass deep neural network. However, there were two outliers to the models' performance: the independent colon cancer (COAD) data set and the Head-Neck Squamous Cell Carcinoma (HNSC) independent dataset. EMethylNET performed similarly or better versus test set data compared to related cancer classification research.
The study showed that XGBoost models can classify distinct cancer types based on DNA methylation data. Researchers also created the EMethylNET model that could be generalized to the most independent datasets.
Genetic mapping revealed genes with functional features and pathways associated with carcinogenesis. This technology can identify hundreds of cancers, with potential expansion to deoxyribonucleic acid methylation datasets from cell-free deoxyribonucleic acid for early diagnosis using liquid biopsy procedures. The practical use of this technology is to screen for specific cancers of unidentified origin that current machine-learning models may not be able to.