Counting indexes in pandas
I feel like this is a rudimentary question but I'm very new to this and just haven't been able to crack it / find the answer.
Ultimately what I'm trying to do here is to count unique values on a certain column and then determine which of those unique values have more than one unique value in a matching column.
So for this data, what I am trying to determine is "who" has "more than one receipt" for all purchases, then determine the same information based on each product category.
My approach so far:
We have a dataset like this:
receipt,name,etc,category
1,george,xxx,fish
1,george,xxx,cat
2,george,xxx,fish
3,bill,xxx,fish
3,bill,xxx,dog
4,jill,xxx,cat
5,bill,xxx,cat
5,bill,xxx,cat
5,bill,xxx,dog
6,george,xxx,fish
So then I can do this:
df.set_index(['name','receipt'])
And get the more interesting
etc category
name receipt
george 1 xxx fish
1 xxx cat
2 xxx fish
bill 3 xxx fish
3 xxx dog
jill 4 xxx cat
bill 5 xxx cat
5 xxx cat
5 xxx dog
george 6 xxx fish
At this point it feels to me like the data is easy to work with, but I haven't figured it out.
One thing that is interesting to me is that if I sort the data by name before indexing it, the data displays grouped by name. In both cases the index is the same, so I don't know how to play with the representation of the data after indexing.
It is easy to find the data by category using
orders.loc[orders['category'] == 'fish']
etc category
name receipt
george 1 xxx fish
2 xxx fish
bill 3 xxx fish
george 6 xxx fish
But what I can't figure out is how to tell pandas "Find me the list of names that have more than one receipt".
Smaller questions:
- What is the "pandas way" to get the length of the names part of the index? I'm supposing I could just turn the
name
column into a set and get the length of that. But I'm curious about indexes.
Edit / Update
Thanks for those answers! Here is clarification on what I am looking for:
I'm trying to find "repeat customers": people with more than one receipt.
So my set of all customers would be:
names: ['george','bill','jill'], ratio: 1.0
My repeat customers:
names: ['george','bill'], ratio 0.66
All 'fish' customers:
names: ['george','bill'], ratio: 0.666
My repeat 'fish' customers:
names: ['george'], ratio: 0.333
I think the examples given look helpful, but feel free to add anything.
Category Data Science