tibble has unique values, and if not exclude the non-unique values from
the tibble before joining the datasets. Assign the result to an object
called
.
Use count()
and select only the counts that are greater
than one.
nrow(crop)
## [1] 3415
nrow(crop2)
## [1] 3416
Indeed: the number of rows are not identical. Thus in one (or both)
of the joined tables there probably are non-unique keys. Thus, let’s
check in which table these non-unique keys are:
nrow(crop %>% left_join(site, by = "IDSite")) # 3415 thus this is okay
## [1] 3415
nrow(crop %>% left_join(irrigation, by = "IDCrop")) # 3416 thus irrigation has a duplicate key!
## [1] 3416
Therefore: let’s filter out the non-unique key from
irrigation
, by first counting the number of records for
each key, filtering the duplicates, and using the anti_join
to remove them from the irrigation
dataset:
count_ir <- irrigation %>%
count(IDCrop) %>%
filter(n > 1)
irrigation2 <- anti_join(irrigation, count_ir, by = "IDCrop")
crop2 <- crop %>%
left_join(site, by = "IDSite") %>%
left_join(irrigation2, by = "IDCrop")
nrow(crop)
## [1] 3415
nrow(crop2)
## [1] 3415
Now, crop
and crop2
have the same number of
rows!
Note that the same can be accomplished in different ways! For
example, the above can be written in a single pipe, e.g.:
crop2 <- crop %>%
left_join(site, by = "IDSite") %>%
left_join(anti_join(irrigation,
irrigation %>%
count(IDCrop) %>%
filter(n > 1),
by = "IDCrop"),
by = "IDCrop")