Machine Learning Experiment With A Limited Number of Rap Data.


A. Background

As I am taking the Machine Learning, I wanted to make a machine learning that can generate the rap lyrics. So, I needed the data of the certain rapper. I picked Eminem, because of the article “Eminem Has the Largest Vocabulary in the Music Industry, According to Study.”

B. Data

So, I modified the code of to get Eminem’s lyrics. I made the song list text file first and ran it with for loop. screen-shot-2016-12-10-at-22-49-57


After this, I combined all the text files with cat function.



There were 199,800 words and 26,000 lines of lyrics.



C. Tensor-Flow

After this I used the char-rnn-tensor-flow to train with it. It was my first Eminem lyrics from the learning machines. I posted this picture on the facebook, and one of my friends insisted this is not Eminem. The reason is because of the N-word. Eminem never uses N-word. So, I realized it is because of the featuring rappers’ lyrics.


D. Over-Fitting

Now, I wanted to test the over-fitting. Car crash test uses one to four cars to test its probability. For example, if it was a Lamborghini, they would have used just one car. So, I thought of this sampling method to the learning machines. Since the learning machine cannot recreate Eminem with limited number of data, I wanted to test multiple of same lyrics in sampling.


Since the Learning Machine cannot reproduce Lamborghini, I am adding hundreds of exact Lamborghini to the Machine Learning. Learning Machine will be able to reproduce Lamborghini, but then we might not be able to call it reproduce.

So, I used 10 of same lyrics samples to the tensor flow and this is the lyrics. Here I can see that the patterns of the sentence is better than the one sample. 



After testing this over-fitting I realized that over-fitting made the lyrics some what boring and lost the creativity. So, I decide to stick with the original sample. So, it was drawing the Lamborghini but cannot recreate a sports car creatively.