Using Apache Spark and Random Forest Algorithm to Implement Breast Cancer Risk Prediction Analysis

Authors: Lizhi Miao*, 1.College of Geographical and Biological Information Nanjing University of Posts and Telecommunications, Nanjing, China;2. Department of Geography and Geographic Information Science, University of Illinois at Urbana–Champaign, Mei-Po Kwan, Department of Geography and Geographic Information Science, University of Illinois at Urbana–Champaign, Jiyao Diao, College of Telecommunications & Information Engineering Nanjing University of Posts and Telecommunications, Nanjing, China, Donglai Jiao, College of Geographical and Biological Information Nanjing University of Posts and Telecommunications, Nanjing, China
Topics: Medical and Health Geography
Keywords: Apache Spark, random forest model, disease prediction, machine learning, Intelligent health, Big data analysis
Session Type: Paper
Day: 4/3/2019
Start / End Time: 9:55 AM / 11:35 AM
Room: Wilson A, Marriott, Mezzanine Level
Presentation File: No File Uploaded


Modern medical science is developing towards the direction of intelligent health. Under this background, in order to improve the detection and prediction of breast cancer risk, this paper uses multiple weak classifiers based on the random forest model to integrate the results of decision trees to obtain incidence of disease. At the same time, the pipeline learning method is used to train the model. We also carry out pathogenic factor analysis and result prediction based on the pipeline learning model. Meanwhile, this study reveals the influencing factors with higher weight using Pearson product-moment and Spearman's rank correlation coefficient, which will improve the monitoring risk of breast cancer. The results showed that Perimeter, Texture and Concave points influencing factors had great influence on the pathogenesis of breast cancer. The prediction accuracy of the model based on the pipeline training method can reach 99.04%, which will provide reference significance for the discovery of breast cancer risk.

Abstract Information

This abstract is already part of a session. View the session here.

To access contact information login