Why keep vocabulary and posting list separate in a search engine
I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):
- A vocabulary mapping terms to their statistics (frequency, type, ...) and
- A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)
These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?
I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.
Is this correct or is there another reason to keep vocabulary and posting list separate?
Topic indexing information-retrieval search
Category Data Science