본문 바로가기
석사과정

[Statistical Analysis with R] Data Manipulation

by JANIMUN 2021. 2. 16.

Important steps of a data analysis]

1. Import dataset from an external file (e.g. xls, txt, SPSS file)

2. Import check: check if dataset has been read correctly

3. Save dataset as R dataset (.Rdata), e.g. dat_raw.Rdata 

4. Data check: check if data is correct/missing, and e.g. remove probands/variables or decide for imputation. Save corrected dataset as new dataset, e.g. dat_corrected.Rdata

5. Transform variables, compute new variables, and/or select subset for final analysis. Save this again as new dataset e.g. dat_final.Rdata, and use in all further steps

6. Descriptive to describe main characteristics of study sample

7. Main analyses

8. Secondary analyses

9. Sensitivity analyses

 

Data check & Manipulation (steps 4 & 5)

 

Ways to check:

  • Important logical operators: == & | !

  • In combination with funcitons to compare/evaluate values > = < >= <= is.na(), and further specific functions to e.g. evaluate strings, many questions can be evaluated

  • The number of times this evaluations is true can be then displayed using the table() function

  • eg) Does anyone have age smaller than 0? : table(age<0)

  • eg) How many missing values does the variable age have table(is.na(age))

  • eg) How many people have a BMI of 0 : table(BMI==0)

  • eg) How many people have insulin level of 0 : table(insulin == 0)

  • eg) How many are those people with BMI the same people with insulin 0: table((BMI == 0) & (insulin ==0)

Transform:

  • Change variable type using the funcitons as.numeric(), as.character(), as.factor(), as.numeric(as.character()), as.Date()

  • Create new variable through mathematical operation, eg)

    • compute BMI from height and weight: compute BMI <-dat$weight/(dat$height^2)

    • standardize variables with scale() function: dat$BMI_z <- scale(dat$BMI)

  • Remove/add/replace values of variable with [.] operator:

    • dat$BMI[1] <-20

    • dat$BMI[dat$BMI<0] <- NA

 

 

 

  • Same ideas as for transforming variables (columns of data frame = variables = vectors!)

  • select subset of data frame to filter variables/observations, or add columns/rows. This can be done using [, ] operator, data.frame() function, and others, e.g.:

    • dat[!dat$Age == 0]

    • dat_female <- dat [dat$Gender == "F", ]

    • dat_final <- data.frame(ID = dat_female$PatientId, Age =

  • Same ideas as for transforming variables (columns of data frame = variables = vectors!)

  • Select subset of data frame to filter variables/observations, or add columns/rows. This can be done using [, ] operator, data.frame() function, and others,

    • dat[!dat$Age == 0] 

    • dat_female <- dat [dat$Gender == "F", ]

    • dat_final <- data.frame(ID = dat_female$PatientId, Age = dat_female$Age, NoShow=dat_female$No-show)

    • subset()function

  • Missing value?

    • is.na()    ;; doesn't always work well this way!

    • a[a$mpg==21 & lis.na(a$mpg), ]  ;; try like this!

 

Tidyverse

  • In R, in addition to the "classical" R programming, which we have mostly used so far, there are many new packages and functions that introduce new objects and structures how to program

  • Many are subsumed in the tidyverse ; examples:

    • Pima_diabetes %>%

    • dplyr::select(Pregnancies, BMI) %>%

    • dplyr::filter(Pregnancies>10) %>%

    • dplyr::summarize(av_BMI_highP = mean(BMI), n = n())

 

Dates and times are tricky in R, read more!

  • See R_2c_dates_and_times_in_R.pdf

  • Do exercise 4 in R 2 exercises.Rmd.

     

댓글