r - subset based on date in a reference table -
i have table1 follows.
studentid date1 lunch 23433 2014-08-26 yes 233989 2014-08-18 no 909978 2014-08-06 no 777492 2014-08-11 yes 3987387 2014-08-26 no
i have table, table2 follows
id studentid date2 result_nm 1 777492 2012.06.10 0.1 2 777492 2013.12.06 2.0 3 777492 2014.08.30 0.6 4 23433 2011.08.26 3.0 5 23433 2015.04.06 3.0 6 233989 2011.05.14 0.003 7 233989 2014.09.14 0.05 8 909978 2004-09-12 0.2 9 909978 2005-05-10 0.23 10 909978 2015-01-02 2.4 11 3987387 2014-10-06 3.5 12 3987387 2014-08-26 1.17
i want retain observations table2 dataset date2 values less date1 values each studentid. in other words should contain these rows.
id studentid date2 result_nm 1 777492 2012.06.10 0.1 2 777492 2013.12.06 2.0 4 23433 2011.08.26 3.0 6 233989 2014.09.14 0.05 8 909978 2004-09-12 0.2 9 909978 2005-05-10 0.23 12 3987387 2014-08-26 1.17
observation 3 excluded because date1 value studentid 777492 2014-08-11 , value less 2014.08.30, observations 5,7,10,11 on. have used subset before little more challenging , need help.
we can change 'date' columns 'date' class using ymd
lubridate
. can take multiple formats (.
, -
). join 2 dataset (left_join
) 'studentid', remove rows using filter
, select
specific columns
library(lubridate) library(dplyr) df2$date2 <- ymd(df2$date2) df1$date1 <- ymd(df1$date1) left_join(df2, df1, by='studentid') %>% filter(date2 <=date1) %>% select(1:4) # id studentid date2 result_nm #1 1 777492 2012-06-10 0.100 #2 2 777492 2013-12-06 2.000 #3 4 23433 2011-08-26 3.000 #4 6 233989 2011-05-14 0.003 #5 8 909978 2004-09-12 0.200 #6 9 909978 2005-05-10 0.230 #7 12 3987387 2014-08-26 1.170
or can use data.table
. here convert 'df2' 'data.frame' 'data.table' (setdt
), set key 'studentid' (setkey(..., studentid)
), join
subset of 'df1' ('studentid', 'date1'), filter output dataset based on condition (.sd[date2 <= date1]
) grouped 'key' variable. more info .eachi
here
library(data.table) setkey(setdt(df2),studentid)[df1[1:2], .sd[date2<=date1],by=.eachi][order(id)] # studentid id date2 result_nm #1: 777492 1 2012-06-10 0.100 #2: 777492 2 2013-12-06 2.000 #3: 23433 4 2011-08-26 3.000 #4: 233989 6 2011-05-14 0.003 #5: 909978 8 2004-09-12 0.200 #6: 909978 9 2005-05-10 0.230 #7: 3987387 12 2014-08-26 1.170
note: 'dates' changed 'date' class before join
.
data
df1 <- structure(list(studentid = c(23433l, 233989l, 909978l, 777492l, 3987387l), date1 = c("2014-08-26", "2014-08-18", "2014-08-06", "2014-08-11", "2014-08-26"), lunch = c("yes", "no", "no", "yes", "no")), .names = c("studentid", "date1", "lunch"), class = "data.frame", row.names = c(na, -5l)) df2 <- structure(list(id = 1:12, studentid = c(777492l, 777492l, 777492l, 23433l, 23433l, 233989l, 233989l, 909978l, 909978l, 909978l, 3987387l, 3987387l), date2 = c("2012.06.10", "2013.12.06", "2014.08.30", "2011.08.26", "2015.04.06", "2011.05.14", "2014.09.14", "2004-09-12", "2005-05-10", "2015-01-02", "2014-10-06", "2014-08-26"), result_nm = c(0.1, 2, 0.6, 3, 3, 0.003, 0.05, 0.2, 0.23, 2.4, 3.5, 1.17)), .names = c("id", "studentid", "date2", "result_nm"), class = "data.frame", row.names = c(na, -12l))