Oct 14, 2024

9.4 Generate interesting text with ngrams

You can generate surprisingly interesting text by simply counting up pairs or triples of words. When you you see that the word “love” is often followed by words like “you,” “it,” and “Python,” you know what word to print next whenever you see the word “love.” That’s what ChatGPT does when you type a question, is uses what is called a “language model” (next word predictor) to decide what to say, one word at a time.

To build a chabot this you will need several functions:

A tokenizer to split text into words and punctuation
An ngram collector that can gather up words that occur together
An accumulator to gather up words that follow after others in a sequence
A generator to pick a random word from the a likely words
A way to guess at a random word, when you need to follow a word that you’ve never seen

Don’t worry, you do not need to write all these functions. I will give you most of the code. You just need to fill in one of the functions, to show you are paying attention. I will give you all the code you need to build your chatbot that can talk about whatever text you ‘train’ it with.

Assignment

For this assignment, your job is to write a function that can collect pairs of objects from a sequence (list). Your function should be called collect_ngrams and should work like this:

>>> collect_ngrams(["I'm", 'a', 'Python', 'programmer', '.'])
[
    ("I'm", 'a'),
    ('a', 'Python'),
    ('Python', 'programmer'),
    ('programmer', '.')
]

You are NOT allowed to use ord(), str.translate() zip(), or re (regular expressions) or anything you haven’t learned in this course so far. You do not even have to tokenize the text, you just need to figure out a way to pair up tokens together. This is a common Python job interview question and coding challenge. So you will find a lot of advanced answers online. But you need to solve this problem without any new advanced Python keywords or functions.

Build a chatbot (language model)

Why would you ever want to write Python program to collect pairs of words? Well, it turns out that if you can do this on a lot of text, you can build a language model that can seem intelligent. You won’t have enough time (or computers) to build ChatGPT, but your chatbot will be able to do suprisingly fun things, just based on a few articles from Wikipedia. But first you need to understand how it all fits together.

N-grams

Even though ChatGPT pridicts one word at a timedoes when you type a question, is uses what is called a “language model” (next word predictor) to decide what to say, one word at a time. You can’t build a language model by reading at only one word at a time. You need to see pairs, triples and quadruples of words that go together. Pairs of words are called “bigrams,” tripplets are called “trigrams,” and above that three we call them “4-grams” and “5-grams” or just “N-grams.” The term “n-gram” is used for any sequence of “N” words that appeared together in a passage of text. You can call them whatever you like. You just need to be able to count them with Python to be able to build impressive Python programs that almost seem intelligent. N-gram language models are so powerful, that they are the main approach used for building all the most popular AI systems.

N-gram statistics are so useful, Google went to the trouble to count up all the n-grams it could find by scanning millions of physical documents and books. It’s what made their search so awesome, before they started enshittifying it with deceptive ads and click-bait around 2005. When Larry Page and his girlfriend were scanning and reproducing all those books, they were doing it illegally. And that turned out to be a good thing for you and I when judge in California ordered them to make all of those n-grams free and available to everyone!

Nowadays, if you ever want to know the most popular combination of English words for saying something, you can just enter a few alternatives into theEnglish N-gram Viewer to compare them. You can even use it to follow trends in programming languages and career choices. Check out then-gram counts for “java programmer” vs “Python programmer” over the past few years!

Splitting text into words

Before you can count n-grams, you need to break up a string into appropriate “grams,” usually words and punctuation. Before you can pair up the “grams” (sometimes also called tokens) you need to split the text into grams (1-grams). Grams can also be called “tokens”, which is a more general term for words and pieces of words or even single letters and punctuation marks. When you are having a conversation with ChatGPT, it isn’t talking in words, it’s talking in tokens. If your Python program uses tokens you can have it spell proper nouns and other words that it may not have ever seen before. It can even invent new words. But to keep things simple, you can just count up words and punctuation.

Say you want to split the sentence “I’m a Python programmer, and I love it!” The Python .split() method gets you close:

>>> "I'm a Python programmer, and I love it!".split()
["I'm", 'a', 'Python', 'programmer,', 'and', 'I', 'love', 'it!']

The .split() method got most of the words correct, but it looks like there’s a problem with the spelling of the word “programmer” and “it”. The both contain punctuation. That wouldn’t be a problem for a human, but it’s a big problem for a Python program that needs to count up unique words.

Tokenizing text

For this exercise you are going to keep things simple and you are going to have tokens, or “grams,” for all the English words you can find in a text document. You are also going to have tokens for the punctuation, so your chatbot can punctuate sentences correctly and sound smart. So whenever you hear me talking about grams or tokens, I’m talking about both words and punctuation marks.

To split text into tokens you can use this function below to tokenize your text — split it into tokens. It’ll ignore unnecessary tokens and punctuation.

>>> import re

>>> def tokenize_text(text):
...     return re.findall(r'[-\'a-zA-Z0-9]+|[!-/:-@\[-`{-~]', text)

Notice that import re at the beginning? That is probably new to you. It is for creating “regular expressions” which are patterns in text that you want to extract. You don’t need to worry about regular expressions until you take a more advanced Python course. Just reuse this tokenize_text function in your own code, whenever you need to split English text into tokens or words.

Try running this tokenize_text() function on the sentence about Python programming:

>>> text = "I'm a Python programmer, and I love it!"
>>> tokens = tokenize_text(text)
>>> tokens
["I'm", 'a', 'Python', 'programmer', ',', 'and', 'I', 'love', 'it', '!']

That’s just what you need, both words and punctuation pulled out. Now you just need to pull together pairs of words to create 2-grams.

Unicode and ASCII

Another problem with text is that text on the web includes a lot of Unicode typographical characters, like curly or slanted quotes (“ and ”) and apostrophes (’). Unicode is a way to represent any human language letter or word. You can even type the pictograms (Japanese and Chinese characters) in Unicode! But all these options can get confusing for your n-gram language model.

To avoid having to deal with a huge, confusing vocabulary for you n-gram models, it’s best if you consolidate all the Unicode appostrophes together into a single ASCII character. The ASCII alphabet has only 26 letters, and if you include punctuation, digits, and capitalization, there are about 90 ASCII letters that your n-gram model will need to process. ASCII is just the simplified characters that you see on a typical English laptop keyboard or old fashioned typewritter.

Here’s a function you can use to simplify the text you are using for your chatbot. It takes those curly quotes and apostrophes that you see on the web and turns them into normal ASCII punctuation characters, that are easier to process.

def unicode_to_ascii(text):
    translation_dict = dict([
        (ord(u), ord(a)) for u, a in zip(
            "‘’´–-"'“”',
            "'''--"'""')
    ])
    return text.translate(translation_dict)

There are a bunch of built-in Python functions here that you haven’t seen before.

zip() - join sequences together, like a zipper joining the sides
ord() - the integer (1-127) that represents a character
str.translate() - replaces one set of characters with another

You will probably use the zip() and ord() functions a lot during your Python career. You will learn more about them later in this course. The str.translate() is a bit rare. You would probably normally use the str.replace() method whenever you need to replace parts of a string. But str.translate() does many simultaneous character replacements super-fast, though.

When you use this function to straighten out the curly appostrophe in “I’m” and turn long Unicode dashes (–) into a short ASCII hyphens, this is what you get:

>>> text = "I’m a Python programmer – and I love it!"
>>> unicode_to_ascii(text)
"I'm a Python programmer - and I love it!"

You can probably hardly tell the difference. But your Python programs will thank you, if you use this function before you try to process text. It’s similar to lower() the user commands in your text adventure program. It makes it easier for your Python program to recognize the correct words from your user.

Pairing and counting tokens

Once you have create your tokens and converted unicode punctuation to ASCII, you are finally ready to create your n-grams and count their statistics. For your assignment, you were not allowed to use advanced techniques. But to build a chatbot that can compete with ChatGPT, I’ll give you two advanced Python implementations you can use to help you quickly process a lot of text.

Your collect_ngrams function had the following doctest requirement.

>>> collect_ngrams(["I'm", 'a', 'Python', 'programmer', '.'])
[
    ("I'm", 'a'),
    ('a', 'Python'),
    ('Python', 'programmer'),
    ('programmer', '.')
]

Here’s an implementation that should work well for you in most situations.

def collect_ngrams(tokens, n=2):

def token_grams(s, n=1):
    tokens = grams = [sentence[i: i + N] for i in range(len(sentence) - N + 1)]
    grams = [sentence[i: i + N] for i in range(len(sentence) - N + 1)]


import re


def tokenize(text, ngrams=1):
    text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    tokens = text.split()
    return [tuple(tokens[i:i + ngrams]) for i in xrange(len(tokens) - ngrams + 1)]

References

[regex tokenizer](https: // stackoverflow.com / a / 367292 / 623735)
[Python n - grams](https: // stackoverflow.com / a / 17532044 / 623735