Counting indexes in pandas

I feel like this is a rudimentary question but I'm very new to this and just haven't been able to crack it / find the answer.

Ultimately what I'm trying to do here is to count unique values on a certain column and then determine which of those unique values have more than one unique value in a matching column.

So for this data, what I am trying to determine is "who" has "more than one receipt" for all purchases, then determine the same information based on each product category.

My approach so far:

We have a dataset like this:

receipt,name,etc,category
1,george,xxx,fish
1,george,xxx,cat
2,george,xxx,fish
3,bill,xxx,fish
3,bill,xxx,dog
4,jill,xxx,cat
5,bill,xxx,cat
5,bill,xxx,cat
5,bill,xxx,dog
6,george,xxx,fish

So then I can do this:

df.set_index(['name','receipt'])

And get the more interesting

                etc category
name   receipt
george 1        xxx     fish
       1        xxx      cat
       2        xxx     fish
bill   3        xxx     fish
       3        xxx      dog
jill   4        xxx      cat
bill   5        xxx      cat
       5        xxx      cat
       5        xxx      dog
george 6        xxx     fish

At this point it feels to me like the data is easy to work with, but I haven't figured it out.

One thing that is interesting to me is that if I sort the data by name before indexing it, the data displays grouped by name. In both cases the index is the same, so I don't know how to play with the representation of the data after indexing.

It is easy to find the data by category using

 orders.loc[orders['category'] == 'fish']
                etc category
name   receipt
george 1        xxx     fish
       2        xxx     fish
bill   3        xxx     fish
george 6        xxx     fish

But what I can't figure out is how to tell pandas "Find me the list of names that have more than one receipt".

Smaller questions:

  • What is the "pandas way" to get the length of the names part of the index? I'm supposing I could just turn the name column into a set and get the length of that. But I'm curious about indexes.

Edit / Update

Thanks for those answers! Here is clarification on what I am looking for:

I'm trying to find "repeat customers": people with more than one receipt.

So my set of all customers would be:

names: ['george','bill','jill'], ratio: 1.0

My repeat customers:

names: ['george','bill'], ratio 0.66

All 'fish' customers:

names: ['george','bill'], ratio: 0.666

My repeat 'fish' customers:

names: ['george'], ratio: 0.333

I think the examples given look helpful, but feel free to add anything.

Topic pandas indexing python

Category Data Science


It's not quite clear what exactly are you trying to achieve (it would be helpful to understand your goals if you would post desired / expected data sets)...

But i'll try to guess ;)

Data:

In [100]: df
Out[100]:
   receipt    name  etc category
0        1  george  xxx     fish
1        1  george  xxx      cat
2        2  george  xxx     fish
3        3    bill  xxx     fish
4        3    bill  xxx      dog
5        4    jill  xxx      cat
6        5    bill  xxx      cat
7        5    bill  xxx      cat
8        5    bill  xxx      dog
9        6  george  xxx     fish

Setting virtual column count, showing # of rows grouped by name and filtering (querying) it using .query() method:

In [101]: (df.assign(count=df.groupby('name').receipt.transform('size'))
     ...:    .query("category in ['dog','cat'] and count > 1"))
     ...:
Out[101]:
   receipt    name  etc category  count
1        1  george  xxx      cat      4
4        3    bill  xxx      dog      5
6        5    bill  xxx      cat      5
7        5    bill  xxx      cat      5
8        5    bill  xxx      dog      5

or you can group by several columns and filter resulting groups:

In [102]: df.groupby(['name','category']).filter(lambda x: len(x) > 2)
Out[102]:
   receipt    name  etc category
0        1  george  xxx     fish
2        2  george  xxx     fish
9        6  george  xxx     fish

I think maybe you are looking for:

receipts_by_name_x_cat = df.groupby(['name','category']).count()

Or, if you just want the total across all categories:

receipts_by_name = df.groupby(['name']).count()

Then, you can search those who have more than one:

receipts_by_name[receipts_by_name['receipt']>1]

And, you can find the length of an index by typing:

len(df.index.get_level_values(0))

Assuming the name was the first index column (otherwise substitute 1, 2, etc.)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.