[Statistical Analysis with R] Data Manipulation

Important steps of a data analysis]

1. Import dataset from an external file (e.g. xls, txt, SPSS file)

2. Import check: check if dataset has been read correctly

3. Save dataset as R dataset (.Rdata), e.g. dat_raw.Rdata

4. Data check: check if data is correct/missing, and e.g. remove probands/variables or decide for imputation. Save corrected dataset as new dataset, e.g. dat_corrected.Rdata

5. Transform variables, compute new variables, and/or select subset for final analysis. Save this again as new dataset e.g. dat_final.Rdata, and use in all further steps

6. Descriptive to describe main characteristics of study sample

7. Main analyses

8. Secondary analyses

9. Sensitivity analyses

Data check & Manipulation (steps 4 & 5)

Ways to check:

Important logical operators: == & | !
In combination with funcitons to compare/evaluate values > = < >= <= is.na(), and further specific functions to e.g. evaluate strings, many questions can be evaluated
The number of times this evaluations is true can be then displayed using the table() function
eg) Does anyone have age smaller than 0? : table(age<0)
eg) How many missing values does the variable age have table(is.na(age))
eg) How many people have a BMI of 0 : table(BMI==0)
eg) How many people have insulin level of 0 : table(insulin == 0)
eg) How many are those people with BMI the same people with insulin 0: table((BMI == 0) & (insulin ==0)

Transform:

Change variable type using the funcitons as.numeric(), as.character(), as.factor(), as.numeric(as.character()), as.Date()
Create new variable through mathematical operation, eg)
- compute BMI from height and weight: compute BMI <-dat$weight/(dat$height^2)
- standardize variables with scale() function: dat$BMI_z <- scale(dat$BMI)
Remove/add/replace values of variable with [.] operator:
- dat$BMI[1] <-20
- dat$BMI[dat$BMI<0] <- NA

Same ideas as for transforming variables (columns of data frame = variables = vectors!)
select subset of data frame to filter variables/observations, or add columns/rows. This can be done using [, ] operator, data.frame() function, and others, e.g.:

dat[!dat$Age == 0]
dat_female <- dat [dat$Gender == "F", ]
dat_final <- data.frame(ID = dat_female$PatientId, Age =

Same ideas as for transforming variables (columns of data frame = variables = vectors!)
Select subset of data frame to filter variables/observations, or add columns/rows. This can be done using [, ] operator, data.frame() function, and others,
- dat[!dat$Age == 0]
- dat_female <- dat [dat$Gender == "F", ]
- dat_final <- data.frame(ID = dat_female$PatientId, Age = dat_female$Age, NoShow=dat_female$No-show)
- subset()function
Missing value?
- is.na() ;; doesn't always work well this way!
- a[a$mpg==21 & lis.na(a$mpg), ] ;; try like this!

Tidyverse

In R, in addition to the "classical" R programming, which we have mostly used so far, there are many new packages and functions that introduce new objects and structures how to program
Many are subsumed in the tidyverse ; examples:

Pima_diabetes %>%
dplyr::select(Pregnancies, BMI) %>%
dplyr::filter(Pregnancies>10) %>%
dplyr::summarize(av_BMI_highP = mean(BMI), n = n())

Dates and times are tricky in R, read more!

See R_2c_dates_and_times_in_R.pdf
Do exercise 4 in R 2 exercises.Rmd.

저작자표시 비영리 변경금지

'석사과정' 카테고리의 다른 글

[Statistical Analysis with R] Data Analysis (0)	2021.03.02
[Statistical Analysis with R] Advanced Tables & Plots (0)	2021.03.02
[Statistical Analysis with R] Descriptive Analysis (0)	2021.02.23
[Statistical Analysis with R] R Markdown (0)	2021.02.16
[Statistical Analysis with R] Overview (0)	2021.02.09

BERLIN BEGINNER

[Statistical Analysis with R] Data Manipulation

'석사과정' 카테고리의 다른 글

댓글

티스토리툴바

[Statistical Analysis with R] Data Manipulation

'석사과정' 카테고리의 다른 글

관련글

댓글

티스토리툴바