Lyrics Classifier

Python, ML

Source

Description

This project was for a machine learning applications course. It was completed within a week. The project goal was the same for everyone: given a dataset of real music lyrics, classify them into one of three genres (Rock, Hip Hop, and Pop).

Source

Process

I needed to build a classifier, but I lacked domain knowledge in lyric composition. While there may be beautiful patterns that help sort the lyrics into a genre, I do not know any of them. So, it was hard to do any useful feature extraction from the lyrics. Things like lyrics length and number of verses vary a lot so simply finding quantities would not be enough. With limited time to work on this, it'd be difficult to gain a deeper knowledge of lyrical composition.

The challenge here was in processing the lyrics. It was a (long) text feature, and so as a categorical data type is a bit tricky to convert into a numerical feature for use in other regression methods. That's when I began to wonder if I needed to work hard to convert the categorical data to numerical. Surely, there must be something already out there to handle text features easily.

While exploring some ML libraries, I came across CatBoost. CatBoost employs gradient descent and was especially interesting for its focus on working with categorical data. After a little digging deeper into the CatBoost docs (which was also a challenge), I found that CatBoost actually has support for text features.

SETUP:


import pandas as pd 
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score 
from catboost import Pool, CatBoostClassifier

TRAINING:


text_features = ['Lyric']
train_dataset = Pool(data=train_X,
                        label=train_y,
                        text_features=text_features)
model = CatBoostClassifier(iterations=100,
                            learning_rate=1,
                            depth=3,
                            loss_function='MultiClass')
model.fit(train_dataset)

Result

After implementing CatBoost, the model's accuracy: ~68% at classifying the correct genre to existing lyrics.

TESTING:


pred = model.predict(holdout_set.drop('Genre',axis=1))
estimated_accuracy = accuracy_score(holdout_set['Genre'], pred)
print(estimated_accuracy)
pd.Series(estimated_accuracy).to_csv('ea.csv', index=False, header=False)

OUTPUT:

0.6794