Why keep vocabulary and posting list separate in a search engine

Question

Why keep vocabulary and posting list separate in a search engine

icehawk

2019年12月18日 10:01

I am taking a class in information retrieval. We learned that the index of a search engine has (possibly among other things):

A vocabulary mapping terms to their statistics (frequency, type, ...) and
A posting list mapping terms to the documents were they are stored (with or without positions, fields, ...)

These are separate data structures. I understand why those information is needed and what for. But I don't understand why we want to keep them separate. Why can't we have one data structure that maps terms to statistics and documents?

I am currently thinking it might be because the vocabulary would be much smaller and we could read it from memory. So we could use the statistics to remove certain query terms, which are likely not useful or to try to find misspellings in the query without having to touch the large posting list.

Is this correct or is there another reason to keep vocabulary and posting list separate?

Topic indexing information-retrieval search

Category Data Science

Intruso · Accepted Answer · 2016年9月27日 15:47

It has many reasons to be (performance, design, storage, compression, evaluation of data structures). The principal reason is that all structures are verified in the practice, but you can make you own data structure and show a new mode to do.

Even the google has a paper for verify that he works fine, if you has your own data structure and design, I suggest you choose a database with the correct size for your experiment, search for the precision and recall values for your own collection and make it.

Other reasons that you can see is that the information has different requirements about performance, compression, storage and hardware.

The cost of gigabyte for main memory is very different between RAM and HD, when you have a lot of servers your strategy of cost reduction is a reason of improve low cost storage and no high hardware requirements.

It's clear when you have a collection with Terabytes of data, many cultures or many countries. (google has a paper about his cluster).

Think simple, a collection with 50GB is a small collection, but how much cost a server with 60GB of RAM and how much cost a server with 50GB of HD?

The answer is clear and the it's reflects in the data structures (B-tree is good for secondary memory and simple hastables are good and fast in main memory).

But you decide.

A old good reference is the chapter 5 of Managing Gigabytes (https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?s=books&ie=UTF8&qid=1474990622&sr=1-1&keywords=managing+gigabytes)

The chapter shows a table with different results for many data structures and types of memory.

The google's paper shows the mode for verify your implementation and design: http://infolab.stanford.edu/~backrub/google.html

Think about the cluster and jobs: http://static.googleusercontent.com/media/research.google.com/pt-BR//pubs/archive/43438.pdf

Why keep vocabulary and posting list separate in a search engine

About