Comparative Analysis for Heart Disease Prediction

Today, heart diseases have become one of the leading causes of deaths in nationwide. The best prevention for this disease is to have an early system that can predict the early symptoms which can save more life. Recently research in data mining had gained a lot of attention and had been used in different kind of applications including in medical. The use of data mining techniques can help researchers in predicting the probability of getting heart diseases among susceptible patients. Among prior studies, several researchers articulated their efforts for finding a best possible technique for heart disease prediction model. This study aims to draw a comparison among different algorithms used to predict heart diseases. The results of this paper will helps towards developing an understanding of the recent methodologies used for heart disease prediction models. This paper presents analysis results of significant data mining techniques that can be used in developing highly accurate and efficient prediction model which will help doctors in reducing the number of deaths cause by heart disease. Keywords— Classifiers, Heart Disease Analysis.


I. INTRODUCTION
Data mining is the process of finding formerly unknown patterns and trends in the existing data and by means of this extracted information, build predictive and usable models.The field of data mining has not been particularly useful in medical sciences, but this trend is on a fast track path of change.Today, healthcare industry delivers a broad measure of complex information with respect to, hospitals, patients, electronic patient records, disease prognosis and diagnosis and medical health care devices.This huge amount of data needs to be mined and filtered so as to enable us to extract useful information that can be useful [1].
In spite of immense technological development in healthcare sector, developing countries and all around the world, there are still in need of providing quality healthcare services at a reasonable cost that is easily affordable by their population.Although such countries have seen large scale development in terms of providing better health care facilities, yet there is still a huge demand in terms of making these facilities affordable [2].

(a) Knowledge discovery in medical databases
Data mining is a very important step in knowledge discovery.For the past few years, it has gained huge amount of interest in the field of Data Sciences.The process of knowledge discovery comprises of an iterative data cleaning, data incorporation, data assortment, configuration recognition and lastly knowledge depiction [3].

(b) Heart Disease
Heart is the blood-pumping organ that provides oxygen and other supplements to all tissues throughout the human body.Unusual heart activities can be harmful for various other organs of human body e.g.brain, kidneys etc. ceasing of cardio-vascular activity can result in an instantaneous death of an individual [4].
There exists a huge amount of research on cardiovascular diseases and their diagnosis.Research has provided several different methodologies for the treatment of such diseases.An overview of such research is given below [5]: Milan Kumari designed system known as Data Mining Classification strategies namely, RIPPER and decision tree using artificial neural network support vector machine (SVM), to explore the coronary disease dataset.SVM predicts coronary diseases (CVD) with least error rate and most remarkable precision [6].
Colombet et al. [8] assessed the use of CART and artificial neural networks (ANN) with the intent of predicting heart diseases in individuals.Nidhi Bhatla and Kiran Jyoti used 15 traits in their survey for the expectation of finding and predicting coronary diseases [7].G. Parthiban et al published a theory that they termed as "Chances of diabetic patient getting heart disease".The accuracy of this theory was verified by the researchers by applying Naïve Bayes classifier, which yields very best prediction form by means of the minimum amount of training set [8]. Jyoti Soni et al. [9] performed a large number of experiments to predict the heart disease on a particularly useful dataset.The results showed that Decision Tree performs with highest accuracy.However, they found Bayesian classification to have related truthfulness as that of decision tree method.They observed that other predictive methods like KNN, in neural network used classification that not performs on such dataset.
M. Anbarasi et al. [10] used Genetic Algorithm to determine such attributes that play a vital role in contributing towards the diagnosis of cardiac disease.The research work indirectly reduced the figure of tests needed to be taken by a patient to determine the presence of any heart disease.Decision Tree performance after incorporating subset selection was found to be quite remarkable.
Robert Detrano performed experimental results exhibited precise classification of heart diseases, having an accuracy of nearly 77% by using logistic regression resulting discriminant purpose [10].Zheng Yao used a novel model called R-C4.5 and was able to improve the performance of attribute selection and partitioning models.Their experiments exhibited that the rules formed by R-C4.5s could be beneficial in providing health care experts with clear and useful information regarding heart diseases [11].Resul Das [12] then presented methodology that used SAS based software on behalf of the diagnosis of heart disease.A neural networks based technique was used by this system.
Another quite recent method is Associative classification which incorporates association rule mining and classification to a form for calculation and manages to achieve greatest accuracy.Associative classifiers are particularly suitable for applications where highest accuracy is required for prediction model [13].
This paper describes the accuracy of the different classifiers in classifying heart diseases dataset.It is organized as follows: Section 2 provides the method that had been used for performing the simulation.Section 3 performs the results in the form of graphs.Section 4 gives us the conclusion of this paper.

II. MATERIAL AND METHOD
Due to certain resource constraints, this paper presents an analysis of a number of data mining techniques, which might be supportive for health care professionals in helping them to perform accurate analysis of heart diseases.
In this research, we used four classifiers for prediction of heart disease by using Weka version 3.6.Weka is a commonly used tool in data mining.The initial dataset comprised of 14 attributes, 303 patients record and an algorithm that was responsible for attribute selection.The algorithm was applied on dataset for pre-process.After attribute selection, certain missing values were identified that were subsequently deleted from the dataset.After deleting of missing records, 296 records were left.Out of these remaining 296 records in the data set, they were subject to highly efficient data mining techniques, namely RIPPER, Decision Tree, Artificial Neural Networks (ANNs) and Support Vector Machine (SVM).
A prominent confusion matrix was derived with the intention of calculating the sensitivity, specificity and accuracy of the results.Below given formulae were used for calculating the parameters: Sensitivity = TP / (TP + FN) (1) Specificity = TN / (TN + FP) (2) Accuracy = (TP + TN) / (TP + FP + TN + FN) (3) True Positive rate= TP / (TP + FN) (4) False Positive rate = FP / (FP + TN) ( A Receiver Operating Characteristic (ROC) space depicts a comparative exchange between the positive and false positive.

III. RESUTS AND DISCUSSION
Table 1 shows the results of sensitivity, specificity, accuracy, True Positive and False Positive meant for the different classification techniques.According to the results in Figure 1, data mining models SVM was found to be the best predictor of heart diseases [14].
According to research results of Nidhi Batla, a prediction system for heart disease was developed by using 15 attributes.Weka was the tool that used for the experiment.In the beginning, missing values were identified in the dataset and then they were substituted with appropriate values that use Replace Missing Values filter.The researcher then used Decision Tree, Naïve Bayes and Neural Networks for calculating the precision of the dataset.
Table 2 depicts the outcomes of this study and it shows that a neural network has in fact superior accuracy as compared to further data mining techniques.Figure 2 shows the graphical results of diverse data mining techniques in terms of accuracy of heart disease prediction.The Neural Network based method is found to be the best classification technique as compared to others two methods [4].It is important to note here that excluding sex and family heredity attributes, all the others attributes have numeric values.For sex, we indicate either "M" or "F" for male and female patients, respectively.For the attribute of family heredity attribute, we values like "Father", "Mother" or "Both".In such case where the patient has no record of diabetes in previous generations, the attribute value in the table is left empty.
Table 3 shows the probability of diabetic patient having heart disease by applying Naïve Bayes data mining classifier.This technique generates an most favorable prediction model by using minimum training set. Figure 3 shows the graphical form of the results that shows Naïve Bayes method is more suitable for application on a diabetic patient as there is an increased probability of the diabetic patient gets heart disease [5].Intelligent Heart Disease Prediction System (IHDPS) is another remarkable prediction system that uses the three commonly used data mining techniques, namely, Decision Trees, Naïve Bayes and Neural Network.IHDPS is in fact a web-based prediction that has been observed to be quite userfriendly and scalable.The traditional prediction systems lack the ability to answer "what if" questions that are answered by IHDPS.The initial data set had 909 records and 15 attributes.These were then split into two equal data sets of equal size.The training data set had 455 records and testing dataset had 454 records.It shows the results of above mentioned techniques.According to the results, Naïve Bayes method has the accuracy of 86.5% accurate predictions, followed by Neural Network (85.53% accuracy).Whereas decision Trees method proves to be most effective by having 89% accuracy [6].  4 shows the observes that the Decision Tree data mining method performs better as compared to Naïve Bayes and ANN when it incorporates subset selection, having high model construction time.However, it must be noted that Naïve Bayes shows steadily before and after reducing attributes having same model construction time.On the other hand, cluster classification performs poorly in comparison to the further two methods [7].Research done by Asha Rajkumar suggests that data classification depends on machine learning algorithms, results in higher accuracy.In this research, Tanagra was used for data classification and the data was evaluated by employing 10fold cross validation.The results were compared after this process.The training data had 3000 instances with 14 unique attributes.These instances in the dataset were the results of different types of testing procedures that were performed on the patients to predict the occurrence of heart disease.The dataset was divided into two parts such that 70% of the data was used for training and 30% for testing.
The table 5 below contains secondary values of different classifications.These classification algorithms were compared, and it was experimental that Naive Bayes (NB) algorithm shows improved performance than the other two methods.This is primarily because it takes only few milliseconds to calculate the accuracy [18].According to the values in table 6, the accuracy was calculated by using three main attributes, namely left ventricle hypothesis, normal and stress abnormal.Performance was determined because of accuracy comparison.Naive Bayes algorithm was observed to be having better performance [15].Sellapan et al used an original data with 909 records and 13 attributes.To simplify the data set, attributes were categorized for all models.By using Genetic Algorithm with Feature Subset Selection, the figure of attributes was condensed to six.The comprise data set was then given to three classification models i.e.Naïve Bayes, Decision Tree and Classification via Clustering.K-fold cross validation method was used as the test form.The analysis of attribute was the class identifier having value "buff" which indicated no cardiac disease and having value "sick" which indicated the occurrence of cardiac disease [16].
Genetic Algorithm uses natural evolution methodology to find a solution of the given problem in the unlimited search space.The search process in genetic algorithm begins with zero attributes, and an initial population, which is generated randomly.Depends upon the natural concept of survival of the fittest, new population is generated that is supposed to be the best in the current generation.Whereas the off springs created from any current population, have the best traits of their parents.Offspring are produced by the application of genetic operators i.e. cross over and mutation.The process of creating subsequent generations continues until a point is reached where it evolves a population P, every trait in P satisfying the fitness criteria.Having the initial population of 20 instances, creation of generations continued till the twenty generations, with crossover probability of 0.6 and mutation probability of 0.033.The genetic search short-listed six attributes out of thirteen.
Heart attack prediction has been presented by another export he discuss the abstraction of substantial in 2009.accoeding to this approach data warehouse is preprocessed to make for mining process .after the data process the data warehouse make the data in group using k clustering algorithm which show relevant data.And MAFIA algorithm used to mine the heart disease .whichAcura like hood due to substantial age.The neural network is trained so as to enable efficient prediction of heart attack among the susceptible patients.A multi-layer Perceptron Neural Network with Back-propagation is used as training algorithm.

IV. CONCLUSION
All the above discussion showed the results of the different research papers.Results were discussed regarding the prediction of any type of heart diseases by applying Data Mining techniques with their classifiers and extension of the classifiers.In the current paper, the study pointed out different classifiers that show good results.Different research papers used different classifiers, algorithms, or techniques such as Support Vector Machine (SVM), Neural Networks (NN), and Naïve Bayes (NB) and also its extensions, different types of Decision Trees (DT) versions, K-Nearest Neighbor (K-NN), Artificial Neural Networks (ANN), Multi-Layer Perceptron (MLP), Genetic Algorithms, and Feature Subset Selections etc.However, the NB, DT and the SVM have more accurate results as compared to other methods.
In future, we will expand this work and will apply a probabilistic approach test on these three classifiers and hope to get results that are more effective for the prediction of heart disease.

Fig. 1 .
Fig. 1.Graphical representation of Data mining models with TP and FP rate

TABLE 1 COMPARISON
OF DATA MINING MODELS

TABLE 2 COMPARISON
OF DATA MINING TECHNIQUES

TABLE 3 .
RESULTS OF CLASSIFIED INSTANCES WITH DIFFERENT EXPERIMENTS

TP Rate FP Rate Precision Recall F- Measure ROC Area Incorrectly Classified Instances
Fig. 3: Graphical Representation of results

TABLE 5 .
PERFORMANCE STUDY OF ALGORITHM

TABLE 6 .
ALGORITHMS PERFORMANCE ACCORDING TO RECALL AND PRECISION