Tokenize String using Natural Language Processing in SwiftUI

DevTechie Inc

Jul 8, 2023

Starting iOS 12, Apple introduced many APIs to bring Natural Language Processing native to Apple’s ecosystem.

Today, we will explore NLTokenizer which helps separate text into desired units leveraging power of Natural Language Processing.

Tokenizing a string

Tokenizing a string simply means that we separate a string into semantic units and analyze it for various use cases. We may want to divide string into units so we can understand the Names, addresses or Locations mentioned in the given text or we may want to understand the overall sentiment of the written text for positive or negative statement.

If the text is written in english language, we may opt to split the string using a separator but this approach may not work for the languages such as Chinese, Japanese or Korean to name a few where spaces are not used to separate their words so for these kind of use cases we can leverage power of Tokenization from Natural Language Processing.

NLTokenizer

NLTokenizer creates individual units from natural language text. This class is defined inside NaturalLanguage framework and we start by creating an object for the class.

NLTokenizer initializer requires linguistic unit to define the desired unit (word, sentence, paragraph, or document as declared in the NLTokenUnit) for tokenization.

let tokenizer = NLTokenizer(unit: .word)

Once the object instance is created, we can then assign a string to tokenize.

tokenizer.string = "Checkout DevTechie for more on iOS development content."

The enumerateTokens(in:using:) method provides the ranges of the tokens in the string based on the tokenization unit.

tokenizer.enumerateTokens(in: text.startIndex..<text.endIndex) { range, attributes in
                    print(String(text[range]))
                    return true
                }

Here is a complete example.

import NaturalLanguage

struct DevTechieNLTokenizerExample: View {
    @State private var text = "Checkout DevTechie for more on iOS development content. DevTechie helps you learn by building examples."
    @State private var tokenized = [String]()
    
    var body: some View {
        VStack {
            TextEditor(text: $text)
                .overlay(RoundedRectangle(cornerRadius: 20).stroke(Color.gray.gradient, lineWidth: 2))
            Button("Tokenize") {
                let tokenizer = NLTokenizer(unit: .word)
                tokenized = []
                tokenizer.string = text
                tokenizer.enumerateTokens(in: text.startIndex..<text.endIndex) { range, attributes in
                    tokenized.append(String(text[range]))
                    return true
                }
            }
            List(tokenized, id: \.self) { token in
                Text(token)
            }
        }
        .padding()
    }
}

Build and run

Let’s change the tokenization unit to sentence.

struct DevTechieNLTokenizerExample: View {
    @State private var text = "Checkout DevTechie for more on iOS development content. DevTechie helps you learn by building examples."
    @State private var tokenized = [String]()
    
    var body: some View {
        VStack {
            TextEditor(text: $text)
                .overlay(RoundedRectangle(cornerRadius: 20).stroke(Color.gray.gradient, lineWidth: 2))
            Button("Tokenize") {
                let tokenizer = NLTokenizer(unit: .sentence)
                tokenized = []
                tokenizer.string = text
                tokenizer.enumerateTokens(in: text.startIndex..<text.endIndex) { range, attributes in
                    tokenized.append(String(text[range]))
                    return true
                }
            }
            List(tokenized, id: \.self) { token in
                Text(token)
            }
        }
        .padding()
    }
}

Notice that we have the text broken down by complete sentence and all we had to do is to change the unit type. Experiment with other unit types to see different results.

Notice that we are returning true for enumeratedTokens closure. It’s because the return indicates the enumerator if it should continue going through each token or not. If we return false at any point, the enumerator will stop.

Language Support in NLTokenizer

As we saw earlier that the NLTokenizer is intelligent enough to identify and tokenize text without us specifying the language. This is true for the most part but because some languages don’t separate words in a sentence the same way as English does, the results may differ. So in order to help the tokenizer, we can set the language using setLanguage method.

We can detect dominant language and set the tokenizer language but to keep things simple, we will simply set the language by passing enum value from NLLanguage.

import NaturalLanguage

struct DevTechieNLTokenizerExample: View {
    @State private var text = "井の中の蛙大海を知らず"
    @State private var tokenized = [String]()
    
    var body: some View {
        VStack {
            TextEditor(text: $text)
                .overlay(RoundedRectangle(cornerRadius: 20).stroke(Color.gray.gradient, lineWidth: 2))
            Button("Tokenize") {
                let tokenizer = NLTokenizer(unit: .word)
                tokenized = []
                tokenizer.string = text
                tokenizer.setLanguage(.japanese)
                tokenizer.enumerateTokens(in: text.startIndex..<text.endIndex) { range, attributes in
                    tokenized.append(String(text[range]))
                    return true
                }
            }
            List(tokenized, id: \.self) { token in
                Text(token)
            }
        }
        .padding()
    }
}