How to find the count of consecutive same string values in a pandas dataframe?

Assume that we have the following pandas dataframe:

df = pd.DataFrame({'col1':['AG','CT','CT','GT','CT', 'AG','AG','AG'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

input:
 col1 col2  start
0  AG  TCT   1000
1  CT  ACA   2000
2  CT  TCA   3000
3  GT  TCA   4000
4  CT  GCT   5000
5  AG  ACT   6000
6  AG  CTG  10000
7  AG  ATG  20000
8  CA  TCT  10000
9  CT  ACA   2000
10 CT  TCA   3000
11 CT  TCA   4000

What I want to get is the number of consecutive values in col1 and length of these consecutive values and the difference between the last element's start and first element's start:

output:
 type length  diff
0  CT  2   1000
1  AG  3   14000
2  CT  3   2000

Topic dataframe pandas

Category Data Science


Break col1 into sub-groups of consecutive strings. Extract first and last entry per sub-group.

Something like this:

df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'], 'start':[1000,2000,3000,4000,5000,6000,10000,20000]})

df['subgroup'] = (df['col1'] != df['col1'].shift(1)).cumsum()


  col1 col2  start  subgroup
0  A>G  TCT   1000         1
1  C>T  ACA   2000         2
2  C>T  TCA   3000         2
3  G>T  TCA   4000         3
4  C>T  GCT   5000         4
5  A>G  ACT   6000         5
6  A>G  CTG  10000         5
7  A>G  ATG  20000         5


df.groupby('subgroup',as_index=False).apply(lambda x: (x['col1'].head(1),
                                                        x.shape[0],
                                                        x['start'].iloc[-1] - x['start'].iloc[0]))

0        ([A>G], 1, 0)
1     ([C>T], 2, 1000)
2        ([G>T], 1, 0)
3        ([C>T], 1, 0)
4    ([A>G], 3, 14000)

Tweak as needed.

UPDATE: for pandas 1.1+ replace the last part with:

def func(x):
    result = {"type":x['col1'].head(1).values[0], "length": x.shape[0], "diff": x['start'].iloc[-1] - x['start'].iloc[0]}
    return pd.Series(result, name="index")
df.groupby('subgroup',as_index=False).apply(func)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.