# Scikit-learn 5

Preprocessing and feature extraction for text

## 5 Feature extraction


If you followed the course until now, you know how to load data sets, how to fit a machine learning model and how to evaluate it automatically. We assumed that we have a matrix `X` which represents our training observations, and that those would be in a suitable format. Exactly the same applies for the values in `y`.

In reality, we might have some data, but not in a format that is suitable for machine learning. This notebook describes ways to go from the raw data to reasonable representations in `X` and `y`. This process is called **feature extraction** (well, for the `X` part at least).

# 5.1 Extracting features from text (beware: spooky!)

Naturally, we are interested in extracting useful features from _text_. As the data set we do not use one of the prepared data sets that come with `scikit-learn` of course, but from a recent Kaggle challenge. Go to the Kaggle page:

https://www.kaggle.com/c/spooky-author-identification

download the training, test data and the sample submission, and unzip the files. In the following, I assume you have placed those files in the same folder as the notebook. As usual, we read data into memory with pandas:

In [3]:
# TODO: import pandas and load the .csv in a data frame
import pandas as pd
df = pd.read_csv("train.csv", index_col=0)
df.head()

Unnamed: 0_level_0,text,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1
id26305,"This process, however, afforded me no means of...",EAP
id17569,It never once occurred to me that the fumbling...,HPL
id11008,"In his left hand was a gold snuff box, from wh...",EAP
id27763,How lovely is spring As we looked from Windsor...,MWS
id12958,"Finding nothing else, not even gold, the Super...",HPL


Also as usual, we check the shapes of objects and find the target classes:

In [5]:
# TODO: check shapes
X = df["text"]
y = df["author"]

X.shape, y.shape

# TODO: what are the different response classes?
set(y)

{'EAP', 'HPL', 'MWS'}

As you can see, the only feature we currently have to describe samples is one string of text. That's not an optimal representation. For one thing, it should be possible for words (or characters) to contribute individually to predictions.

Second, strings are [_nominal_](https://en.wikipedia.org/wiki/Level_of_measurement) values: most classifiers require features to be numbers and, consequently, observations to be lists of numbers. Note: having a large number of features, i.e. describing each training example with a long list of numbers, is not a problem at all.

## Normalization

Before we even start converting strings to numbers, there are some sources of variation that we would like to eliminate right away. Most of those procedures can be described as **normalization**.

Please note: all methods described are not universally helpful. Depending on your task, they can also be detrimental because they remove important task-specific information. 

**Casing**

For instance, the fact that the first word in a sentence (say, "How") is capitalized means that "how" and "How" will be regarded as different words, which does not make sense. There are different variants of casing (e.g. lowercasing or truecasing), but most of the time, lowercasing makes sense.

Lowercasing is a predefined method of Python `str` objects, and is also easy for pandas frames:

In [0]:
df_lower = df.copy()
df_lower.text = df.text.str.lower()
df_lower.head()

In `scikit-learn`, lowercasing comes as an option for transformer classes like `sklearn.feature_extraction.text.CountVectorizer`, see below.

**Accents, diacritics, punctuation, whitespace**

Similarly, different spellings of the same word lead to entirely different entries in the vocabulary, if not taken care of. Accents or diacritics on characters are important here: people omit them simply because they are unfamiliar or cannot find them on the keyboard. Another frequent source of confusion are strange punctuation or whitespace symbols that have a separate Unicode representation.

Here is a [standard approach](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) to remove all of them, which is converting them to ASCII characters:

In [0]:
import unidecode # if you do not have this module yet: conda install unidecode
s = u"Rasenm√§her"
s_normalized = unidecode.unidecode(s)
s_normalized

As with casing, in `scikit-learn` there is rarely a need to do this by hand, as it is implemented in a number of classes, such as `sklearn.feature_extraction.text.CountVectorizer`, see below.

**Tokenization**

In raw strings of text, whitespace characters do not always indicate boundaries between tokens. **Tokenization simply means making sure that whitespace characters indicate token boundaries**. In a raw string like

    "Preemptive strikes and the war on Iraq: a critique of Bush administration"
    
splitting at whitespaces will result in `"Iraq:"` as one of the words. Here are two real-world examples of efficient tokenizers: the [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) and [cutter tokenizer](https://gitlab.cl.uzh.ch/graen/cutter).

In `scikit-learn`, we usually have a different way to go about this. Rather than modifying the raw string to insert whitespace characters, we define what exactly should be regarded as a token. A token can be anything that can be described with a regular expression.

**Stopword removal**

Finally, some of the words might need to be removed alltogether. In a collection of texts, some words will occur in all of the documents, and their distribution in the documents will be about the same. Words like that are commonly called **stopwords**.

Words that occur in all examples are bad features, so we would not want to waste resources to compute and store them. In Python, you can simply compile a list of stopwords and remove them from your input strings or ignore them. `scikit-learn` lets you use a stopword list or define your own, see below.

# 5.2 Extracting features that are order-agnostic

We now turn to actual feature extraction, which means turning normalized strings into lists of numbers. Language is inherently _sequential_, there is a natural order to the "items" in a text. Yet, ignoring the order (or, to put it another way, ignoring the fact that a string is a sequence) is simple and effective for many tasks.

Feature extraction methods that are **order-agnostic** essentially see a string as a _bag_ of items, for instance a bag of words or bag of characters.

## Token counts feature vectors

A straightfoward method is to split a string into tokens, represent each example by a vector of the size of the token vocabulary, and in each vector dimension indicate how many times that token occurred in each example. This procedure is called **count vectorization**.

**Write code that performs count vectorization on a list of strings**:

In [0]:
X = ["This is a toy example", "This is another toy example .",
     "toy toy toy"]

# code that performs count vectorization on a list of strings
vocab = set(" ".join(X).split(" "))
vocab

In [0]:
X_vectorized = []
for x in X:
    tokens = x.split(" ")
    x_vectorized = [tokens.count(v) for v in vocab]
    X_vectorized.append(x_vectorized)

X_vectorized

### Use an sklearn class to perform count vectorization

Using the toy example from above:

In [0]:
import numpy as np

# toy example
from sklearn.feature_extraction.text import CountVectorizer

X = ["This is a toy example", "This is another toy example .",
     "toy toy toy"]

count_vectorizer = CountVectorizer()
print(count_vectorizer)

X_vectorized = count_vectorizer.fit_transform(X)
X_vectorized

The result is a sparse matrix that can be converted back to a usual array and the vectorizer vocabulary is an attribute:

In [0]:
print(X_vectorized.toarray())
print(count_vectorizer.vocabulary_.items())

Now vectorizing our real spooky data:

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

# perhaps after normalizing the cells of df
X = df.text
y = df.author

count_vectorizer = CountVectorizer()
print(count_vectorizer)

X_vectorized = count_vectorizer.fit_transform(X)
X_vectorized

A `CountVectorizer` has many useful options for normalization, such as `lowercase`, `stop_words`, `strip_accents`, `tokenizer` and so on. Read the documentation to fine-tune the vectorizer behaviour. The result of the transformation is stored in a new format, a `scipy` compressed sparse row matrix. It is an efficient data structure because we expect most of the vector entries to be zero and hence, the matrix will be **sparse**.

**Question for you: Why will most entries in the vectors be zero?**

After fitting a transformer and transforming with it, both its attributes and the result can be inspected more closely:

In [0]:
# vocabulary (equals size of vectors)
print(count_vectorizer.vocabulary_.items()[:10])

# convert to usual numpy array
X_vectorized.toarray()

As you can see, we did not change the overall structure of `X`, it is still a two-dimensional array: a list of training examples where each example is itself a list. But we managed to represent all strings as numbers in a meaningful way.

## More sophisticated TF/IDF token count feature vectors

Simple count vectorization gives equal weight to all features (remember, features are tokens in the vocabulary in this case). But actually, it is not true that all token features are equally informative. You might want to use adaptive weighting like **term frequency / inverse document frequency (TF/IDF)** vectorization.

**Question for you: What is TF/IDF and how does it transform counts? If you cannot recall, please look it up.**

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
print(tfidf_vectorizer)

X_vectorized = tfidf_vectorizer.fit_transform(X)
X_vectorized.toarray()

**Question for you: This drastically reduces the size of the feature vectors. Why?**

## Extracting features that incorporate order

Strings are ordered sequences, and perhaps, features should keep information about ordering. A simple way of introducing this into vector counts is by counting **n-grams** instead of single tokens or characters. Such a model could be accurately described as a **bag of n-grams** model. There will be information about order encoded in the features, but it will be limited to the size of n-grams chosen by the user.

In [0]:
ngram_count_vectorizer = CountVectorizer(ngram_range=(1, 3))
print(ngram_count_vectorizer)

X_vectorized = ngram_count_vectorizer.fit_transform(X)
print(ngram_count_vectorizer.vocabulary_.items()[:10])

# 5.3 Transforming the labels

Virtually all algorithms require that the labels (responses, target values) are also encoded as numbers. But even if a specific algorithm does not, it is good practice to represent classes with integers, as we have seen in a number of `scikit-learn` toy datasets.

In [0]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

y_encoded = label_encoder.fit_transform(y)
print(y_encoded[:10])

print(label_encoder.classes_)

A label encoder can also reverse the transformation to numbers, back to text labels, with a function called `inverse_transform`.

In case the target values need to be binary labels instead of integers (for multi-class problems, for instance), use `LabelBinarizer`:

In [0]:
from sklearn.preprocessing import LabelBinarizer

label_binarizer = LabelBinarizer()

y_binarized = label_binarizer.fit_transform(y)
print(y_binarized[:10])

print(label_binarizer.classes_)

**In most cases, `scikit-learn` takes care of label binarization automatically, there is no need to transform labels in this way manually.**