Introduction
Genomics is a branch of data science and an intersection of biology, which mainly takes care of cancer research. This research has contributed to various scholars’ understanding of cancer diseases. The study of the genomes’ function, structure, mapping, and evolution characterizes this study area.2 Cancer lab results are studied and analyzed in this study by analyzing cancer microarray datasets. The focus of this analysis is on the Medulloblastoma. Focusing on Medulloblastoma mainly aims to show a great understanding of the bioinformatics procedure used in gene expression profiling, subgroup discovery, 1, and validation of the identified subgroups through a machine-learning classifier.
The workshop on “Microarray Analysis and Application to Cancer” has given insightful ideas on the complicated nature of the Medulloblastoma transcriptomic microarray dataset, which shows interesting subgroups. The “Gene set identification and annotation” workshop followed, which gave insights into identifying genes that show variations in expression. This workshop helped me identify the molecular insights into each of the subgroups as a student. The third workshop, “Designing and validating machine learning classifiers, introduced the research into a useful external dataset presented by evaluation and classifier developed from the genes.
This report tries to evaluate the biological and scientific procedures involved in gene expression analysis, three subgroup discoveries, and classifier identification, which were involved in this research. To fulfil our learning objectives, this report focuses on showing comprehension of the bioinformatics analysis of cancer datasets. It includes the results from different practicals and workshops. The final collection will comprehensively describe a microarray collection of cancer transcripts. The structure of this report starts with an introduction, a section on the subgroup discovery, four and a third section on the differentially expressed genes, the classifiers, and their application in the real world, and finally, the report conclusion.
Subgroup Discovery
There are several ways to use data to show four groups, but not five. One of the ways to perform this is through creating a frequency distribution table. Creating a frequency distribution table helps divide massive data, making the information accessible and easy to comprehend for the four groups. The type of table that will be created is the grouped frequency distribution table, which allows the researcher to group distinct groups of data regardless of the data gathered. The range of data can also be determined easily by creating a frequency distribution table.
There are several ways to explain the data. One way to interpret the data is through statistical analysis. Through statistical analysis, a researcher can visually represent the data being analyzed. The observations of the statistical analysis also help illustrate the data being analyzed. The second way to explain the data is through sociological experiments. The sociological experiments aid in describing situations and drawing the necessary conclusions while making the necessary inferences about the data being analyzed. Finally, there is data computation through advocating for the necessary procedures to comprehend relevant data.
There are several ways to conclude that it’s group 4 instead of group 3 or 5. The conclusion involves understanding the different subtypes and their allocation to the common t-SNE visualization. Out of 1501 samples, additional analyses of the performed subtypes were split, leading to the final consensus technique of the t-SNE. Overall, the conclusion was derived from groupings whereby groups 3 and 5 had cohorts NMF, t-SNE, and SNF, which differed from group 4.
Differentially Expressed Genes
It is essential to understand the nature of the patterns of gene expressions in the field of cancer genomics. Further than identifying the subgroups, the molecular nature of Medulloblastoma is only described clearly when we give attention to Differentially Expressed Genes, also known as (DEG) together with a deep comprehension of the biological mechanisms behind the aggressive disease, cancer.8 A complex symphony of gene activity characterizes the distinct stages of Medulloblastoma, each marking a considerable chapter in cancer development. The subsequent section in my report shows the proceeds through the process stages. It identifies and interprets the nature of the Differentially Expressed Genes together with the process of their development.7
The genomes develop continuously as we proceed from stage one to stage four. During this development stage, some genes become more prominent than others.11 It’s at this point where the malignant genes can be noticed. The molecular tree map that results from identifying the Differentially Expressed Genes at each stage of the development guides us through the intricate environment of the cellular functions and the signalling waves.7 Therefore, this part of the report explains how the disease develops at the genetic level. One of the most effective ways of classifying the Differentially Expressed Genes according to their biological functions and molecular roles is Gene Ontology (GO).6 The classification reveals a summary of the molecular process that the genes undergo, showing a possible explanation for the process resulting in the development of Medulloblastoma.19 This process helps us to connect the dots, linking the gene expression patterns to the functional symphony of cancer.10 We can identify the connection between molecular changes and biological functions through Gene Ontology.9 The complex nature of the Differentially Expressed Genes provides a basis for identifying potential targets for treatment strategies. At this point, developing targeted medicines specific to each subgroup is essential.
According to the Volcano Plot above, the Differentially Expressed Genes are shown in red.
Differentially Expressed Gene knowledge plays a vital role in understanding the molecular biology that happens so that cancer may develop. In this context, under-expression and over-expression of genes give essential information about the development of cancer, the stage of the cancer, and the possible medicine needed to start combating the disease. Group 1 genes are related to morphogenesis and help give insights into the stage of tumour development. The changes in cellular development are studied and become noticeable when the people involved look at the changes. Morphology shows the changes in the genes, which helps researchers to conclude that the cells are abnormal.
Consequently, group 4 cells show great spontaneous activity in cancer development. The group 4 cells show a higher cell differentiation, including neural relocation, inside the cancer microenvironment. Different gene expression patterns in the groups show the various stages of the cancer. Group 1 cells are always standard in the first stages of tumour development. Group 4 cells show how the cells behave in the later stages of the malignant cells’ growth.
Genes that are underexpressed or overexpressed are critical players in the critical molecular functions that result in disorders that involve their dysregulation, which is connected to a broad group of diseases. Comprehending underexpressed and overexpressed genes has become a key component in cancer research that helps researchers know about tumour biology. The knowledge is essential in understanding treatment for the tumours associated with abnormal genes. WNT, SHH, Group 3, and Group 4 are the well-known subgroups that are responsible for brain tumours, which are malignant growth that mostly affect children.
Classifier and External Dataset Application
Cancer research is a complex field that requires detailed knowledge in classifying the cells suspected to be changing due to malignant growth. This research point involves using machine learning classifiers as a valuable tool that helps identify the complex patterns in the transcriptome microarray dataset for Medulloblastoma.17 To better understand and classify the Medulloblastoma subgroups, we explored the verification, use, and creation of a classifier built around the Differentially Expressed Genes at this point in our research.5 The classifier is a computational detective analysis trained to identify slight differences in the gene mutation that might not be visible in a typical identification process. It emerges from the challenging atmosphere of bioinformatics, where algorithms are improved to identify patterns corresponding to the subgroups discussed above within the specific dataset. The classifier’s effectiveness depends on its capacity to distinguish the subgroups and perform satisfactorily, providing dependable information for the medics to decide on the cancer stage.
The classifier is tested by connecting to an external dataset, which helps widen the scope beyond the initial dataset, allowing one to learn more. The most crucial stage of the process is confirming the classifier’s robustness and how accurately it groups the Medulloblastoma subgroups in several genomic environments.12 In this validation process, the report examines the classifier’s performance in a different context and shows how it performs as per expectation. We have to investigate the drawbacks and the strong points of the classifiers. The report examines the classifiers’ sensitivity, specificity, and accuracy, comprehensively understanding the classifier’s usefulness and functionality. 14
Picking a suitable machine learning classifier is essential for precise, accurate, and dependable predictions regarding cancer cell diagnosis from the data sets. We will look at the chance that the chosen classifier is dependable. We will develop on the fact that the required classifier had been trained initially using data sets, which included the patient clinical history, patient demographics, and gene expression levels. We are going to test the classifier using other sets from the people who are undergoing a diagnosis. The chosen classifier has to show resilience by accommodating different fluctuations. The classifier’s effectiveness can be established when the new set of data shows a high level of efficiency, accuracy, and precision in the new data set.
When using R as a programming language in machine learning, finding the most refined classifier is tailored by the best cost, 0.0312 5 (2^-5). The most efficient way to improve the classifier is to control the cost parameter that has been found. Assessing the classifier’s performance on the data from group 2 shows a significant finding that all the numbers are offered in a visible diagonal way, meaning that group one data was analyzed with a high level of accuracy. However, group 1 needs classifications and shows a possible chance of overfitting. To correct the overfitting, cross-validation is applied, which is used to guarantee the classifier’s performance.
A classifier’s dependability and generalizability are evaluated when applied to fresh, untested data. This study used a unique dataset with 160 metagene-specific probes to assess the classifier’s effectiveness. The goal was to determine if the classifier could identify four subgroups in this test dataset. Referred to as the “test predictor,” the classifier was utilized to predict the categorization of the new dataset using the Affymetrix human genome U122 plus 2.0 array. Given the similarities to the paper’s analysis of 76 pediatric Medulloblastoma samples and the discovery of four expression classes, whether our classifier can correctly identify these four expression classes using the same array emerges from the confusion matrix, which indicates the specific misclassification in the event of one. If the classifier predicts G4, it is most likely G4. If it means G3, it is most certainly G3, although it is also conceivable that G4 will occur; these two groups may overlap. Upon generating the Cohen’s Kappa, the accuracy score of 0.8958036, or 89%, can be obtained. When we compare our findings with theirs, we can determine the location of the misclassification thanks to the heat map.
Five mismatches are shown on the heat map above.
Principal component analysis indicates if the expected groups fall in the correct locations. There are instances where the green and yellow dots are misclassified. Contamination due to misclassification might result from experimental issues. If the tumour is in a transitional stage, which is in between stages 1 and 2, it may be misclassified, the biopsy may be dissected incorrectly, and the results may be messy if the tumour is half stage 1 and a half stage 2.21
Conclusion
Assessing the relative contributions and interactions of the diverse groups is vital. The consensus subgroups, clinical factors, and novel subtypes are essential in understanding the concerted molecular features, like the whole chromosome and focal cytogenetic aberrations. The careful observations of the different subtypes are adequate and aid in distinguishing between them and identifying any malignancy from the gathered data. The report was critical in showing the genetic activity and how the cells coordinate in the development of Medulloblastoma. A machine learning classifier is essential to relieve this disease’s pain in the medicine specification. After applying an external dataset, confirming and establishing its effectiveness is essential. The Medulloblastoma genomic environment analysis helps to venture into a world of knowledge and research that helps relevant stakeholders know more about the disease. As evidenced above, using R, a statistical analysis language, is a powerful tool to help sort and identify the genetic differences underlying the cells being studied. The R programming tool makes understanding complex techniques and statistical analysis more accessible. It makes it easier to identify our four subgroups: WNT, SHH, Group 3 (Grp3), and Group 4 (Grp4).
References
- Sharma T, Schwalbe EC, Williamson D, Sill M, Hovestadt V, Mynarek M, et al. Second-generation molecular subgrouping of Medulloblastoma: An international meta-analysis of Group 3 and Group 4 subtypes – Acta neuropathological [Internet]. Springer Berlin Heidelberg; 2019 [cited 2024 Jan 1]. Available from: https://link.springer.com/article/10.1007/s00401-019-02020-0
- Temple LK, McLeod RS, Gallinger S, Wright JG. Defining disease in the Genomics Era. Science. 2001;293(5531):807–8. https://doi.org/10.1126/science.1062938
- Lovén J, Orlando DA, Sigova AA, Lin CY, Rahl PB, Burge CB, et al. I am revisiting global gene expression analysis. Cell. 2012;151(3):476–82. https://doi.org/10.1016/j.cell.2012.10.012
- Atzmueller M. Subgroup discovery. WIREs Data Mining and Knowledge Discovery. 2015;5(1):35–49. https://doi.org/10.1002/widm.1144
- Rathi KS, Arif S, Koptyra M, Naqvi AS, Taylor DM, Storm PB, et al. A transcriptome-based classifier to determine molecular subtypes in Medulloblastoma. PLOS Computational Biology. 2020;16(10). https://doi.org/10.1371/journal.pcbi.1008263
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: Tool for the unification of biology. Nature Genetics. 2000;25(1):25–9. https://doi.org/10.1038/75556
- Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003;19(3):368–75. https://doi.org/10.1093/bioinformatics/btf877
- Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: An R package for identifying differentially expressed genes from RNA-Seq Data. Bioinformatics. 2009;26(1):136–8. https://doi.org/10.1093/bioinformatics/btp612
- Duncan RG, Reiser BJ. Reasoning across ontologically distinct levels: Students’ understandings of molecular genetics. Journal of Research in Science Teaching. 2007;44(7):938–59. https://doi.org/10.1002/tea.20186
- McDermaid A, Monier B, Zhao J, Liu B, Ma Q. Interpretation of differential gene expression results of RNA-Seq Data: Review and Integration. Briefings in Bioinformatics. 2018;20(6):2044–54. https://doi.org/10.1093/bib/bby067
- Conway T, K. G, Schoolnik. Microarray expression profiling: Capturing a genome‐wide portrait of the transcriptome. Molecular Microbiology. 2003;47(4):879–89. https://doi.org/10.1046/j.1365-2958.2003.03338.x
- Northcott PA, Robinson GW, Kratz CP, Mabbott DJ, Pomeroy SL, Clifford SC, et al. Medulloblastoma. Nature Reviews Disease Primers. 2019;5(1). https://doi.org/10.1038/s41572-019-0063-6
- Lauffenburger DA, Horwitz AF. Cell migration: A physically integrated molecular process. Cell. 1996;84(3):359–69. https://doi.org/10.1016/S0092-8674(00)81280-5
- Azar AT, El-Metwally SM. Decision tree classifiers for Automated Medical Diagnosis. Neural Computing and Applications. 2012;23(7–8):2387–403. https://doi.org/10.1007/s00521-012-1196-7
- Gamberoni G, Storari S, Volinia S. Finding biological process modifications in cancer tissues by mining gene expression correlations. BMC Bioinformatics. 2006;7(1). https://doi.org/10.1186/1471-2105-7-6
- Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems. 1987;2(1–3):37–52. https://doi.org/10.1016/0169-7439(87)80084-9
- Min S, Lee B, Yoon S. Deep Learning in Bioinformatics. Briefings in Bioinformatics. 2016; https://doi:10.1093/bib/bbw068
- Gururangan S, Schroeder K. Molecular variants and VA in Medulloblastoma. Pharmacogenomics and Personalized Medicine. 2014;43. https://doi:10.2147/pgpm.s38698
- Gao F, Wang W, Tan M, Zhu L, Zhang Y, Fessler E, et al. DeepCC: A novel deep learning-based framework for cancer molecular subtype classification. Oncogenesis. 2019;8(9). https://doi:10.1038/s41389-019-0157-8
- Munquad S, Si T, Mallik S, Das AB, Zhao Z. A deep learning–based framework for supporting the clinical diagnosis of glioblastoma subtypes. Frontiers in Genetics. 2022;13. https://doi:10.3389/fgene.2022.855420