How to deal with "Ergänzungsstrichen" and "Bindestrichen" in German NLP?

Problem

In German, the phrase Haupt- und Nebensatz has exactly the same meaning as Hauptsatz und Nebensatz. However, when transforming both phrases using e.g. spacy's de_core_news_sm pipeline, the cosine similarity of the resulting vectors differs significantly:

token1 token2 similarity
Haupt- Hauptsatz 0.07
und und 0.67
Nebensatz Nebenssatz 0.87

Code to reproduce

import spacy
import numpy as np


def calc_cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


nlp = spacy.load(de_core_news_sm)
doc1 = nlp(Hauptsatz und Nebensatz)
doc2 = nlp(Haupt- und Nebensatz)
for token1, token2 in zip(doc1, doc2):
    similarity = calc_cosine_similarity(token1.vector, token2.vector)
    print(f{token1.text}: {similarity})

My research for a solution

This Bachelorthesis states on page 5:

A distinction is made between phrases with a complementary dash, as in main and subordinate clauses, and those with a hyphen, as in price-performance ratio. The former are divided into several tokens, the latter form a single one. (translated from original German)

This sounds like a preprocessing solution is readily available? However, so far I could not find any yet on e.g. https://github.com/adbar/German-NLP , but I might have overlooked things.

Topic tokenization nlp

Category Data Science


Tested bert-base-german-cased from huggingface today. The results are still different, but much more similar. Also, the tokenizer splits words in a desired way. This might already work for my use case, so marking the question as answered.

token1 token2 cos-similarity
[CLS] [CLS] 0.982
Haupt Haupt 0.933
- ##satz 0.824
und und 0.967
Neben Neben 0.951
##satz ##satz 0.958
[SEP] [SEP] 0.977

Code to reproduce:

from transformers import BertTokenizer, BertModel, AutoModel
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")
model = AutoModel.from_pretrained("bert-base-german-cased")

def calc_cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def calc_bert_embeddings(text):
  """Returns a (N,768) vector representing embeddings for the N tokens."""
  tokens = tokenizer(text, return_tensors='pt') 
  output = model(**tokens)
  return tokens.tokens(), output["last_hidden_state"][0,:,:].detach().numpy()

tokens1, vec1 = calc_bert_embeddings("Haupt- und Nebensatz")
tokens2, vec2 = calc_bert_embeddings("Hauptsatz und Nebensatz")
for idx in range(len(tokens1)):
  similarity = calc_cosine_similarity(vec1[idx,:], vec2[idx,:])
  print(f"{tokens1[idx]};{tokens2[idx]};{similarity:.3f}")

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.