Sudip Mandal, Goutam Saha and Rajat K. Pal
Biological databases related to medical science, containing pathological, radiological and genetic information of patients is undergoing tremendous growth, beyond our analyzing capability. However such analysis can reveal new findings about the cause and subsequent treatment of any disease. Here the genetic information of Lung Adenocarcinoma, in the form of microarray dataset has been investigated which have five different stages. Rough Set Theory (RST) has been used in analysis with an aim to effectively extract biologically relevant information, as RST is a tool that works well in an environment, heavy with inconsistent and ambiguous data, or with missing data and provides efficient algorithms for finding hidden patterns in data. The investigation has been carried out on the publicly available microarray dataset obtained from the GEO profiles at National Centre for Biotechnology Information (NCBI) website. Cross validation of the generated rule sets shows 100% accuracy. Now to extract the hidden biological dependencies between responsible genes, Decision Tree is used at consecutive two stages of cancer development to identify the main culprit genes for cancer development from one stage to another and that may lead to the drug design. The analysis revealed that hybrid Rough- Decision Tree is able to extract hidden relationships among the various genes which play an important role in causing the disease and also able to provide a unique rule set for automated medical diagnosis. Moreover at the end, the functions of the identified genes are studied and validated from Gene Ontology website DAVID which clearly shows the direct or indirect relation of genes with the cancer. This study highlights the usefulness and efficiency of RST and Decision Tree in the disease diagnosis process and its potential use in inductive learning and as a valuable aid for building more biologically significant expert systems in medical sciences
Partagez cet article