Oct 23, 2024

10.4 N-grams

N-grams with a `for` loop

N-grams are pairs, tripples, and even quadruplets of words (or grams). And n-grams are what you use to teach chatbots how to chat like humans. You can generate surprisingly interesting text, including answers to your questions, if you build a next-word-predictor by collecting pairs or triples or longer groupings of words (n-grams) in the text you want your bot to imitate. A text generator is called a language model. A language model is a function that takes in some text and tries to predict the next word. That’s what ChatGPT does when you type a question, is uses a “language model” (next word predictor) to decide what to say, one word at a time.

Here are some other confusing words you probably want to understand. You can call words tokens when you want to include punctuation marks and other things that aren’t really words. A token is a bit-size chunk of text, usually a word or punctuation mark. And a token can be any string, such as pieces of words, or even individual characters. If you really want to go crazy you can even create tokens for the notes in a musical score, or the steps in a dance move you are working on. A gram is the particular kind of token that you are using in your n-gram generator or n-gram tokenizer. Basically gram and token mean the same thing. So an n-gram tokenizer can be used to generate almost anything, even a Tic-Toc video!

An n-gram is just a pair, tripple, or list of N tokens (grams) that you want to keep track of. An n-gram tokenizer is a function that takes in a sequence of strings, like a list of the words in a sentence, and returns a sequence of n-grams. for you when you are teaching a chatbot like chatGPT to read and pretend to understand text.

This n-gram tokenizer approach to reading text is surprisingly similar to how you understand text yourself. When you you see that the word “love” is often followed by words like “you,” “it,” and “Python.” The n-gram language model in your brain can think of all these options whenever you see the word “love” and decide which one should probably come next in the sentence you are speaking. It happens so quick, you probably don’t even notice it.

Assignment

For this assignment, your job is to write a function that can collect pairs of objects from a sequence (list). You don’t have to worry about the tokenizer part. Your job is to create a 2-gram tokenizer. Your function will always take a sequences of tokens (words, characters, or strings) and should return pairs of those tokens. Your function should be called collect_ngrams and should work like this:

>>> collect_ngrams(["I'm", 'a', 'Python', 'programmer', '.'])
[
    ("I'm", 'a'),
    ('a', 'Python'),
    ('Python', 'programmer'),
    ('programmer', '.')
]

You are NOT allowed to use ord(), str.translate() zip(), or re (regular expressions) or anything you haven’t learned in this course so far. You do not even have to tokenize the text, you just need to figure out a way to pair up tokens together. This is a common Python job interview question and coding challenge. So you will find a lot of advanced answers online. But you need to solve this problem without any new advanced Python keywords or functions.

Character 2-grams (+1 bonus)

You will get one bonus point if you can get your function to work for each of the following test cases.

If you have written your function well, it should work on any sequence of strings, not just a sequence of words and punctuation (tokens). Remember, tokens can be any strings in a sequence, and a string is just a sequence of 1-character strings. And a for loop can iterate through a string’s characters the same way you iterate through a list of strings. Make sure your can handle any sequence of strings, and a string is a sequence of 1-character strings. If you like you can convert a str to a list of strs using the list type: list('Hello') => ['H', 'e', 'l', 'l', 'o']. All types are functions that you can run on another type to coerce them into changing their type So your function should work on strings as well as lists and tuples.

>>> collect_ngrams(["I'm a Python programmer."])
[
    ('I', "'"),
    ("'", 'm'),
    ('m', ' '),
    (' ', 'a'),
    ('a', ' '),
    (' ', 'P'),
    ...
    ('r', '.')
]

Joined 2-grams (+1 bonus)

If you want to get fancy, when you detect that your tokens are characters you can return the pairs as 2-character strings rather than 2-tuples with 2 strings. The join method is tricky to use correctly. For this particular assignment you could use str.join('', ngram_tuple) or the shortcut ''.join(ngram_tuple).

>>> collect_ngrams("I'm a Python programmer.")
[
    "I'",
    "'m",
    'm ',
    ' a',
    'a ',
    ' P',
    ...
    'r.'
]

Keyword arguments (+1 bonus)

You can get another bonus point, if you function can take a keyword argument (sometimes called a “kwarg”) for n, so it will work for 3-grams and 4-grams and even 100-grams. You should use the variable name n for your kwarg. And the default value for your kwarg should be n=2. Your first keyword argument should be called tokens, so that you know to tokenize your text (split it in to tokens) if you want your grams and tokens to be something other than the character found in the input string.

>>> collect_ngrams("I'm a Python programmer.", n=3)
[
    "I'm",
    "'m ",
    'm a',
    ' a ',
    'a P',
    ' Py',
...
    'er.'
]

Trick or treat

Once you understand how n-gram tokenizers work, you will be much better at chatting with ChatGPT and other n-gram language models that are trying to pretend to be smart. You’ll know some of the “tricks” behind their magic tricks. And you might be able to build some treats with n-gram models that give you something sweeter than just pretending to talk like a human. N-gram language models are useful tools for real AI that can help you find things in a database or your personal notes or on the web.

If this was a difficult assignment, or if you feel like your function ran really slowly, next week you’ll see the “answer” to this problem that runs lickity split, and works perfectly on huge datasets, like all of Wikipedia.