UniLM - Unified Language Model for summarization

Question

UniLM - Unified Language Model for summarization

Nabin

2020年8月20日 17:15

The UniLM claims to be the best approach for summarization task. But there doesn't seem to be any tutorial or how-to section in the README.md or any other blog. How exactly can I use this state-of-the-art library for abstractive summary generation?

Github link

Paper

P.S. A newbie in NLP. Sorry if this is a dumb question.

Topic automatic-summarization nlp

Category Data Science

Vikas Bhandary · Accepted Answer · 2020年2月16日 16:15

Here's what you should do

Prepare your dataset: Follow similar instructions as described in the paper and preprocess your dataset. This will be your major task as after this you will only have to fine-tune the model. If you don't have a dataset, you can use the dataset used in this research paper, which can be downloaded from here.
Download the pre-trained model. Or you can choose to start from the provided fine-tuned model checkpoint (from the link). You will have to check which version of the model works best for your dataset. If you select model fine-tuned for summarization task and your dataset is similar to CNN/DailyMail dataset [37] and Gigaword [36] you can skip fine-tuning.
Fine-tune model: In this step, you will be using the command mentioned in the readme of the Github repository. Note that, there are some parameters that should be according to the language model you have downloaded in the previous step. Based on the size of your dataset you can change the number of epochs in the following command. You should also note that this will require GPU. The repository readme recommends 2 or 4 v100-32G GPU cards for finetuning the model.

    OUTPUT_DIR=/{path_of_fine-tuned_model}/
    MODEL_RECOVER_PATH=/{path_of_pre-trained_model}/unilmv1-large-cased.bin
    export PYTORCH_PRETRAINED_BERT_CACHE=/{tmp_folder}/bert-cased-pretrained-cache
    export CUDA_VISIBLE_DEVICES=0,1,2,3
    python biunilm/run_seq2seq.py --do_train --fp16 --amp --num_workers 0 \
      --bert_model bert-large-cased --new_segment_ids --tokenized_input \
      --data_dir ${DATA_DIR} \
      --output_dir ${OUTPUT_DIR}/bert_save \
      --log_dir ${OUTPUT_DIR}/bert_log \
      --model_recover_path ${MODEL_RECOVER_PATH} \
      --max_seq_length 192 --max_position_embeddings 192 \
      --trunc_seg a --always_truncate_tail --max_len_a 0 --max_len_b 64 \
      --mask_prob 0.7 --max_pred 48 \
      --train_batch_size 128 --gradient_accumulation_steps 1 \
      --learning_rate 0.00003 --warmup_proportion 0.1 --label_smoothing 0.1 \
      --num_train_epochs 30

Evaluate your model: Use biunilm/decode_seq2seq.py to decode (predict the output of evaluation dataset) and use the provided evaluation script to evaluate the trained model.
Use the trained model: In order to use this model to make a prediction, you can simply write your own python code to:
- load Pytorch pre-trained model using pytorch_pretrained_bert library as used in decode_seq2seq.py file
- Tokenize your input
- predict the output and detokenize the output.

Here is the logic which you can use:

model = BertForSeq2SeqDecoder.from_pretrained(long_list_of_arguments)
batch = seq2seq_loader.batch_list_to_batch_tensors(input_batch)
input_ids, token_type_ids, position_ids, input_mask, mask_qkv, task_idx = batch
traces = model(input_ids, token_type_ids,position_ids, input_mask, task_idx=task_idx, mask_qkv=mask_qkv)

Note that this is not the complete logic. This code just show how github repository code has handled the saved model and used it to make predictions. Use traces to convert ids to tokens and detokenize the output tokens (as used in the code here). The detokenization step is necessary as the input sequence is tokenized to subword units by WordPiece.

For reference here is the code which loads the pre-trained model. You can go through the loop to understand the logic and implement it in your case. I hope this helps.

UniLM - Unified Language Model for summarization

About