The two files that you are given contain data about a large number of patients (approximately 20000).
Both files have an id for each patient which uniquely identifies the patient. Thus patient 34573 in the first file is the same patient as 34573 in the second. The problem is that the id numbers are not in order and there are a couple of patients missing from each file (but not the same patient). Also, there are some missing values that need to be estimated.
The first file contains some demographic and simple measurements (height, weight, etc).
variable type values id unique key age numeric 1-80 ethnic factor 0=white, 1=black, 2=hispanic, 3=nativeAmerican, 4=asian, 5=other income factor 0 = 0 to 30,000, 1 = 30,001 to 60,000, 2 = 60,001 to 90,000, 3 = 90,001 to 120,000, 4= above 120,000 marital factor 0=single, 1=married occGroup factor 10 occupational groups gender factor 0=female, 1=male weight numeric 90 - 300 height numeric 40 - 80 heartRate numeric 30 - 110
The second file contains the results of 5 medical tests where the results range from 0 to 100. It also has an indicator of of whether or not the patient has a disease (0=yes, 1=no).
variable values id unique key testA numeric 0 to 100 testB numeric 0 to 100 testC numeric 0 to 100 testD numeric 0 to 100 testE numeric 0 to 100 disease factor 0=yes, 1=no Procedure Follow the steps below to process the data files:
1. do some preliminary analysis and clean up any outliers or missing data
2. merge the two data sets using the id field (R has a command called merge that will be helpful)
3. mining test all 5 classifiers that we used this semester (trees, naive Bayes, KNN, SVM, ANN) to find the best technique for classifying the data. Using disease as the class and 10-fold cross validation to verify the results. Use accuracy to measure the quality of the classification.
4. analyze the data further (plotting, clustering, statistics) to support your claim of which classifier is best. You may need to analyze just parts of the data by selecting certain values of variables.
Hand in •
a maximum 4 page report that discusses the decisions you made in the procedure above and the results that you received.
Also, most importantly, include the conclusions you reached and the support for the classifier that was chosen as best.
Use charts and plots to help illustrate.
In writing your report, try to use the ideas in the book by Knaflic (referenced above). • the R commands and scripts that you used to do the analysis above.
Rubric (hours are approximate – some students will take longer than others) :
hours points cleaning merging and preliminary analysis 3-5 7 classification 4-7 7 analysis 4-7 11 total 25
18 freelancers are bidding on average $80 for this job
I am a data scientist and have experience in machine learning and statistical analysis of data using R and Python. I have read your description and can deliver within the targeted deadline. Thanks.
A Data Scientist with experience in SPSS, CALCULUS, Advanced Excel, R programming, R Shiny, R studio and anything related to data science and python Master in Engineering.
Hi, I'm interested in your project. I have wide experience in machine learning and I can learn all the classifiers you've mentioned. just contact me for further details.