r - subset based on date in a reference table -


i have table1 follows.

   studentid        date1         lunch    23433            2014-08-26    yes    233989           2014-08-18    no    909978           2014-08-06    no    777492           2014-08-11    yes    3987387          2014-08-26    no 

i have table, table2 follows

id  studentid        date2        result_nm 1   777492           2012.06.10   0.1 2   777492           2013.12.06   2.0 3   777492           2014.08.30   0.6 4   23433            2011.08.26   3.0 5   23433            2015.04.06   3.0 6   233989           2011.05.14   0.003 7   233989           2014.09.14   0.05 8   909978           2004-09-12   0.2 9   909978           2005-05-10   0.23 10  909978           2015-01-02   2.4 11  3987387          2014-10-06   3.5 12  3987387          2014-08-26   1.17 

i want retain observations table2 dataset date2 values less date1 values each studentid. in other words should contain these rows.

  id  studentid        date2         result_nm   1   777492           2012.06.10    0.1   2   777492           2013.12.06    2.0   4   23433            2011.08.26    3.0   6   233989           2014.09.14    0.05   8   909978           2004-09-12    0.2   9   909978           2005-05-10    0.23   12  3987387          2014-08-26    1.17 

observation 3 excluded because date1 value studentid 777492 2014-08-11 , value less 2014.08.30, observations 5,7,10,11 on. have used subset before little more challenging , need help.

we can change 'date' columns 'date' class using ymd lubridate. can take multiple formats (., -). join 2 dataset (left_join) 'studentid', remove rows using filter , select specific columns

library(lubridate)  library(dplyr) df2$date2 <- ymd(df2$date2) df1$date1 <- ymd(df1$date1)  left_join(df2, df1, by='studentid') %>%                       filter(date2 <=date1) %>%                       select(1:4) #    id studentid      date2 result_nm #1  1    777492 2012-06-10     0.100 #2  2    777492 2013-12-06     2.000 #3  4     23433 2011-08-26     3.000 #4  6    233989 2011-05-14     0.003 #5  8    909978 2004-09-12     0.200 #6  9    909978 2005-05-10     0.230 #7 12   3987387 2014-08-26     1.170 

or can use data.table. here convert 'df2' 'data.frame' 'data.table' (setdt), set key 'studentid' (setkey(..., studentid)), join subset of 'df1' ('studentid', 'date1'), filter output dataset based on condition (.sd[date2 <= date1]) grouped 'key' variable. more info .eachi here

library(data.table) setkey(setdt(df2),studentid)[df1[1:2], .sd[date2<=date1],by=.eachi][order(id)] #   studentid id      date2 result_nm #1:    777492  1 2012-06-10     0.100 #2:    777492  2 2013-12-06     2.000 #3:     23433  4 2011-08-26     3.000 #4:    233989  6 2011-05-14     0.003 #5:    909978  8 2004-09-12     0.200 #6:    909978  9 2005-05-10     0.230 #7:   3987387 12 2014-08-26     1.170 

note: 'dates' changed 'date' class before join.

data

df1 <-  structure(list(studentid = c(23433l, 233989l, 909978l, 777492l,  3987387l), date1 = c("2014-08-26", "2014-08-18", "2014-08-06",  "2014-08-11", "2014-08-26"), lunch = c("yes", "no", "no", "yes",  "no")), .names = c("studentid", "date1", "lunch"),  class = "data.frame", row.names = c(na, -5l))  df2 <-  structure(list(id = 1:12, studentid = c(777492l, 777492l,  777492l,  23433l, 23433l, 233989l, 233989l, 909978l, 909978l, 909978l,  3987387l, 3987387l), date2 = c("2012.06.10", "2013.12.06",  "2014.08.30",  "2011.08.26", "2015.04.06", "2011.05.14", "2014.09.14", "2004-09-12",  "2005-05-10", "2015-01-02", "2014-10-06", "2014-08-26"),  result_nm = c(0.1,  2, 0.6, 3, 3, 0.003, 0.05, 0.2, 0.23, 2.4, 3.5, 1.17)), .names = c("id",  "studentid", "date2", "result_nm"), class = "data.frame",  row.names = c(na, -12l)) 

Popular posts from this blog