Artificially increasing frequency weight of word ending characters in word building

Question

Artificially increasing frequency weight of word ending characters in word building

Matt

2021年5月20日 22:00

I have a database of letter pair bigrams. For example:

+-----------+--------+-----------+
|     first | second | frequency |
+-----------+--------+-----------+
|     gs    | so     |         1 |
|     gs    | sp     |         2 |
|     gs    | sr     |         1 |
|     gs    | ss     |         3 |
|     gs    | st     |         7 |
|     gt    | th     |         2 |
|     gt    | to     |        10 |
|     gu    | u      |         2 |
|     Gu    | ua     |        23 |
|     Gu    | ud     |         4 |
|     gu    | ue     |        49 |
|     Gu    | ui     |        27 |
|     Gu    | ul     |        15 |
|     gu    | um     |         4 |
+-----------+--------+-----------+

The way I am using this I chose a "first" word which will be a character pair. Then I will look at all the most likely letter to follow that. The relationship of the words is such that the second character of the first pair is always the first character of the second pair. This way I can continue the chain using the second pair. The frequency is how often I have found that pair in my dataset.

I am building words using markov chains and the above data. The base issue I am tackling is that some words can end up unrealistically long despite trying to mitigate length e.g. "Quakey Dit: Courdinning-Exanagolexer" and "Zwele Bulay orpirlastacival". The first one has a word length of 24! Side Note: I know those words are complete nonsense but sometimes something good comes of it.

The, work in progress but functioning, code I am using to build these is as follows. To keep the post length down and hopefully attention up! I am excluding my table definitions code as well as my load from json function which is just for loading my mariadb connection string.

from sqlalchemy import create_engine
from sqlalchemy.orm import Session
from random import choices
from bggdb import TitleLetterPairBigram
from toolkit import get_config

# Load configuration
config_file_name = 'config.json'
config_options = get_config(config_file_name)

# Initialize database session
sa_engine = create_engine(config_options['db_url'], pool_recycle=3600)
session = Session(bind=sa_engine)

minimum_title_length = 15
tokens = []

letter_count_threshold = 7
increase_space_percentage_factor = 0.1
letter_count_threshold_passed = 0
start_of_word_to_ignore = [" " + character for character in "("]

# Get the first letter for this title build
current_pair = choices([row.first for row in session.query(TitleLetterPairBigram.first).filter(TitleLetterPairBigram.first.like(" %")).all()])[0]
tokens.append(current_pair[1])

while True:
    # Get the selection of potential letters
    next_tokens = session.query(TitleLetterPairBigram).filter(TitleLetterPairBigram.first == current_pair, TitleLetterPairBigram.first.notin_(start_of_word_to_ignore)).all()

    # Ensure we got a result
    if len(next_tokens)  0:
        # Check the flags and metrics for skewing the freqencies in favour of different outcomes.
        title_thus_far = "".join(tokens)
        if len(title_thus_far[title_thus_far.rfind(" ") + 1:]) = letter_count_threshold:
            # Figure out the total frequency of all potential tokens
            total_bigram_freqeuncy = sum(list(single_bigram.frequency for single_bigram in next_tokens))

            # The word is getting long. Start bias towards ending the word.
            letter_count_threshold_passed += 1
            print("Total bigrams:", total_bigram_freqeuncy, "Bias Value:", (total_bigram_freqeuncy * increase_space_percentage_factor * letter_count_threshold_passed))
            for single_bigram in next_tokens:
                if single_bigram.second[0] == " ":
                    single_bigram.frequency = single_bigram.frequency + (total_bigram_freqeuncy * increase_space_percentage_factor * letter_count_threshold_passed)

        # Build two tuples of equal elements words and weights
        pairs_with_frequencies = tuple(zip(*([[t.second, t.frequency] for t in next_tokens])))

        # Get the next word using markov chains
        current_pair = choices(pairs_with_frequencies[0], weights=pairs_with_frequencies[1])[0]
    else:
        # This word is done and there is no continuation. Satisfy loop condition
        break

    # Add the current letter, from the pair, to the list
    tokens.append(current_pair[1:])

    # Check if we have finished a word. Clean flags where appropriate and see if we are done the title yet.
    if current_pair[1] == " ":
        # Reset any flags and counters
        letter_count_threshold_passed = 0
        # Check if we have exceeded the minimum title length.
        if len(tokens) = minimum_title_length:
            break

print("".join(tokens))

The whole point to my question is I want an opinion of my word ending logic. The way it stand is if we get a word that is more than 7 characters that I start to bolster the frequency count of space ending pairs. For every character we add to the word that is not a space then I increase the frequency multiplier for those ending characters. This should allow for word longer than 7 characters but decrease the chance of super long ones.

I am not sure if my logic is working the way I describe. Since this is based on random choice I can't go back and try again.

I plan on expanding this logic into finding closing braces, quotes etc. in other bigram esque project I am working on.

Topic ngrams markov-process python machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2020年7月11日 14:24

One was to evaluate the code is to run thousands of simulations and look the histogram of word length frequency.

Then parameterize your code so long word bias can be changed up and down. Rerun the simulations and look to see if the histograms have changed.

Artificially increasing frequency weight of word ending characters in word building

About