Using Information from the rest of a Sequence to Predict the Label for any one Item
I have a dictionary of variable-length sequences:
[(file_name[-10:], len(tag_is_header_list)) for file_name,
tag_is_header_list in HEADER_PATTERN_DICT.items()]
[('37bd1.html', 25),
('0bcce.html', 40),
('90364.html', 28),
('8f9c7.html', 24),
('d12d4.html', 73),
('46837.html', 37),
('adb92.html', 53),
('0a1e7.html', 69),
('da077.html', 43),
('9366a.html', 21),
('6ae4d.html', 37),
('f62ee.html', 19),
('73aee.html', 33),
('e090a.html', 35),
('8b093.html', 44)]
These contain a label for each item as to whether or not they are a subject heading:
HEADER_PATTERN_DICT[sorted([(file_name, len(tag_is_header_list)) for file_name,
tag_is_header_list in HEADER_PATTERN_DICT.items()],
key=lambda x: x[1])[0][0]]
[(None, True),
('div', False),
('div', False),
(None, True),
(None, False),
('li', False),
('li', False),
('li', False),
(None, False),
(None, False),
('li', False),
('li', False),
('li', False),
(None, True),
(None, True),
('li', False),
('li', False),
('li', False),
('div', False)]
Every item in the sequence is an instance for which the label should be predicted. So, what is the best way to use some variable-length sequence vectorization to train a model to predict the label?