# detecting sequence by group and compute new variable for the subset

I need to detect a sequence by group in a data.frame and compute new variable.

Consider I have this following `data.frame`:

``````df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))

df1

> df1
ID seqs count product   stock
1   1    1     2       A       A
2   1    2     1       B     A,B
3   1    3     3       C   A,B,C
4   1    4     1       C   A,B,C
5   1    5     1     A,B   A,B,C
6   1    6     2   A,B,C   A,B,C
7   1    7     3       D A,B,C,D
8   2    1     1       A       A
9   2    2     2       B     A,B
10  2    3     1       A     A,B
11  3    1     3       A       A
12  3    2     1   A,B,C   A,B,C
13  3    3     4       D A,B,C,D
14  3    4     1       D A,B,C,D
``````

I am interested to compute a measure for `ID` that follow this sequence:

``````  - Count == 1
- Count > 1
- Count == 1
``````

In the example this is true for:

`````` - rows 2, 3, 4 for `ID==1`
- rows 8, 9, 10 for `ID==2`
- rows 12, 13, 14 for `ID==3`
``````

For these ID and rows, I need to compute a measure called `new` that takes the value of the `product` of the last row of the sequence `if` it is in the second row of the sequence and NOT in the `stock` of the first sequence.

The desired outcome is shown below:

``````> output
ID seq1 seq2 seq3 new
1  1    2    3    4   C
2  2    1    2    3
3  3    2    3    4   D
``````

Note:

1. In the sequence detected for ID no new products are added to the stock.
2. In the original data there are a lot of IDs who do not have any sequences.
3. Some `ID` have multiple qualifying sequences. All should be recorded.
4. Count is always 1 or greater.
5. The original data holds millions of `ID` with up to 1500 sequences.

How would you write an efficient piece of code to get this output?

• Can the count ever be less than 1? – C. Braun Apr 11 at 18:59
• @C.Braun No. The count is always 1 or greater. – wake_wake Apr 11 at 19:01

Here's a `data.table` option:

``````library(data.table)

char_cols <- c("product", "stock")
setDT(df1)[,
(char_cols) := lapply(.SD, as.character),
.SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) &
(shift(count) > 1) &
(shift(count, 2L) == 1),
by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) &
!mapply(grepl, pat, shift(stock, 2L)),
by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result
``````
• `setDT` converts `data.frame` to `data.table`. You can convert back using `setDF`. – M-M Apr 12 at 21:49

Here's another approach using ; however, I think `lag` and `lead` has made this solution a bit time-consuming. I included the comments within the code to make it more legible.

But I spent enough time on it, to post it anyway.

``````library(tidyverse)

df1 %>% group_by(ID) %>%

# this finds the row with count > 1 which ...
#... the counts of the row before and the one of after it equals to 1
mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>%

# this makes a column which has value of True for each chunk...
#that meets desired condition to later filter based on it
mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F))  %>%

filter(test2) %>% ungroup() %>%

# group each three occurrences in case of having multiple ones within each ID
group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>%

# creating new column with string extracting techniques ...
#... (assuming those columns are characters)
mutate(new=
str_remove_all(
as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
stock[1])) %>%

# selecting desired columns and adding times for long to wide conversion
select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>%

# long to wide conversion using tidyr (part of tidyverse)
gather(key, value, -ID, -G, -new, -times) %>%
unite(col, key, times) %>% spread(col, value) %>%

# making the desired order of columns
select(-G,-new,new) %>% as.data.frame()

#   ID seqs_1 seqs_2 seqs_3 new
# 1  1      2      3      4   C
# 2  2      1      2      3
# 3  3      2      3      4   D
``````