Sequencing
Sequencing is required for turning sentences into data.
The process of sequencing turns each word in a sentence into a number from a given Tokenizer
's word index and produces an array of integers for each sentence.
#
Google Colab#
Handling words out of vocabularyWhen sequencing sentences that contain words not present in the vocabulary of the Tokenizer
, unless otherwise specified, the tokenizer will ignore the words from the final output integer array.
Libraries like tensorflow and keras provide OOV
tokens, which replace unknown words with a separate integer value, which although reduces meaning in a sentence does not completely remove it.
#
Handling sentences of varying lengthPadding can be used to handle such cases.
Padding techniques take into account the sentences with the max number of words. Sentences that have fewer words are padded with some arbitrary number e.g. 0
to match the length of the largest sentence in the corpus.
Padding can be done in the front(pre) or in the back(post).
Padding functions also provide a max_length
parameter which will truncate elements in an array. This can also be in the front or back. Generally when max_length
is specified, truncating is done from the end where padding is done so that meaningful data isn't lost.