Project title: Logistic Regression model in R - academy writing research paper based on live data analsys

Project deadline: 3 days.

Project goals:

0. Spend some time to talk with me (the project ownder) and chat with me to get better understanding of model

1. connect to my public server where R-server is installed over public IP

- you get all the info to be able to connect to server

- the data is in MySQL table on the same server

- R can allready read data from Mysql database table (tested)

2. Write R script on R-server that would create Logistic Regression model in R

- connect to Mysql server in Rserver an fetch the data from table (tested myself and it works)

- construct Rscript to create Logistic Regresion model based on data from table

- train the model

- Interpret the results of our logistic regression model in detail using simple layman's terms

- test the model and assess the predictive ability of the model

- create some graphs and interpret them (describe them)

- create a conslusion

3. Deliverables:

- Deliver the Rscript code that generated the result table

- Write a word documnet report with permissa, model descrption, model validation, created graphs and conslusion

4. requered tools: Internet connection and latest Firefox/Chrome browser

5. Quality of work goals: Academic paper level reqiured with detailed explanations of created model backed up with high quality academic type of references (books) available online.

Milestone-1. Connect to server, write Rscript generate a model.

Milestone-2. Describe every line of the R-Script code, backup with references in manual and links to example

Milestone-3. Detailed report with final conslusion about permissa. (Permissa must be confirmed and model should be confirmed valid).

Submission deadline: Sunday, April [login to view URL] 2018. 15:00 CET.

- binomial logistic regression

data domain description:
We have a Mysql database table (one large table with milions of records).
The table structure is:
- date: the date the sample data is collected
- serial_number: the serial number of disk drive unit
- model: the model of the drive
- capacity_bytes - capacity of disk drive in bytes (not relevant)
- failure - a flag if the drive has failed (died) with value 1 if it is, or value if it is not. If the drive has failed (died) then we can train the model with params from this specific unit (serial_number)
The rest of the columns are values that are "predictors" to the "failure" event
- SMART_5_raw,
- SMART_5_normalized,
- SMART_187_raw,
- SMART_187_normalized,
- SMART_188_raw,
- SMART_188_normalized,
- SMART_197_raw,
- SMART_197_normalized,
- SMART_198_raw,
- SMART_198_normalized

Hypothesis to prove:
One can use SMART parameters to predict the failure of disk drive unit!
If the RAW value of the SMART params is greater than zero for the SMART stat listed

validate the hypothesis data:

Operational drives with one or more of our five SMART stats greater than zero – 4.2% (not failed)
Failed drives with one or more of our five SMART stats greater than zero – 76.7% (failed)

That means that 23.3% of failed drives showed no warning from the SMART stats we record. Are these stats useful?
Conlusion to achieve: SMART PARAMS are a sign of impending drive failure 76.7% of the time.

More reading here:

