The Simpsons: In Their Own Words

Background

As part of my M.S. program at USC, I took a Data Mining course where we covered the text mining technique Term Frequency-Inverse Document Frequency (TF-IDF). I found it to be pretty neat, and since we didn't spend too much time on it during lecture nor implement it in an assignment I thought it'd be fun to do something on my own. But what corpus of documents would I use?

My favorite TV show of all time, hands down, is The Simpsons. I started watching it as a young kid in the early mid-90s and continued through its golden years. I was (am?) one of those people who references the jokes and scenes whenever possible, whether appropriate or not. In some ways it's been a big part of my life, so of course I'd do some hacky side project involving it.

I also had an upcoming gig with the USC Annenberg Norman Lear Center to create a searchable TV show and movie script database that would require me to run some good ol' web scraping. I had not tried that out before so I figured this could also be a good way to kill two birds with one stone. So, I now had the following goals: And with that, I got started!

The Puzzle Pieces

I have been using Python for all my school assignments and other ML-related studying/side projects so I wanted to continue with it. With some quick searching I came upon the Beautiful Soup Python library for scraping data off the interwebs. For TF-IDF, I opted to use Scikit-Learn's implementation. As you'll see later, I also began playing around with NLTK libraries as well.

As for the dataset, I opted to go with episode scripts from Springfield! Springfield! (SS) as the corpus of documents. I also downloaded summaries from TV Calendar for comparison purposes.

The Scrapes of Wrath

Using Beautiful Soup to download scripts from SS was a breeze due to the library's ease of use and how the site itself was hierarchically structured. I was able to systematically go through each season and get the episode text, performing some basic text cleaning (normalize white spaces, remove weird characters) and metadata extraction (season number, episode number, episode title) along the way. This data was stored in a dictionary during runtime and pickled for later analysis.

Scraping source code

Term Frequency, Meet Inverse Document Frequency

Now that I had the document corpus, I could start the analysis of scripts to find the most important words from each episode in relation to the other episodes, and see how well those important words could describe or summarize the episode, at least for Simpsons nerds.

Overall, I thought it did a pretty durn good job! The table below highlights the results for a few favorites, and the results in their entirety can be found here.

Season Episode
3 Title: Homer at the Bat

Summary: The Springfield Nuclear Power Plant's softball team goes on a season long winning streak thanks to Homer's "Wunderbat." But with the pennant and a $1 million bet on the line Mr. Burns brings in 9 ringers from the professional baseball ranks and a disappointed Homer has to sit the bench.

Top Words: ['softball', 'strawberry', 'scioscia', 'bat', 'sax']

4 Title: A Streetcar Named Marge

Summary: Marge gets a taste of the acting bug and decides to volunteer at the Springfield Community Center. She is cast as Blanche DuBois in a musical version of A Streetcar Named Desire directed by the flamboyant Llewellyn Sinclair. Meanwhile, Maggie squares off with her strict new daycare provider.

Top Words: ['blanche', 'stella', 'stanley', 'new orleans', 'orleans']

4 Title: Last Exit to Springfield

Summary: Homer finds himself filling in for the Springfield Nuclear Power Plant's union leader when it comes time to negotiate their new contract with Mr. Burns. Homer is a tough negotiator, despite not knowing the first thing about union organizing, and forces Burns to accept the union's demands on the condition that Homer be removed as leader.

Top Words: ['dental plan', 'dental', 'braces', 'lisa needs', 'plan']

6 Title: Homer Badman

Summary: Homer and Marge attend a candy convention and have to find a babysitter for the kids. After the convention, Homer gives the babysitter a ride home. He notices that there is a very rare gummy stuck to her bottom, so he reaches out and grabs it. Homer is accused of sexual harassment and the whole town is against him until Willie saves the day.

Top Words: ['gummi', 'venus', 'candy', 'babysitter', 'grabbed']

6 Title: Lemon of Troy

Summary: The children of Springfield wage war on Shelbyville, after their beloved town lemon tree comes up missing. The fathers of Springfield take Ned's RV to search for their boys.

Top Words: ['shelbyville', 'lemon', 'tree', 'numerals', 'roman numerals']

8 Title: El Viaje de Nuestro Jomer (The Mysterious Voyage of Homer)

Summary: Remembering last year, Marge tries to hide the big annual chili cook-off from Homer. When he figures it out, she makes him promise not to drink any beer. Homer is known as the dude with the fireproof stomach and Chief Wiggum brews up some chili with Guatemalan insanity peppers, and it burns the hell out of Homer's mouth. He decides to put wax in his mouth, so he can eat the peppers whole. After eating a few, he begins hallucinating and Homer runs off into the sunset and experiencing a strange journey. The rest of the family leaves without him after he embarrasses them. On his journey, Homer meets a talking coyote who tells him to find his soul mate. He wakes up on a golf course and begins his search. Marge is upset with him, making him think it is not her. Homer ends up at a lighthouse, where Marge eventually finds him. Because she found him, he figures out that she IS his soul mate. A ship crashes at the lighthouse, leaving short shorts everywhere.

Top Words: ['soul mate', 'mate', 'chili', 'soul', 'coyote']



TF-IDF source code

Episode Similarities

Since I now had vector representations of the episodes, another simple thing I could try was comparing them to find the ones most similar to each other with respect to words in the scripts. Two natural similarity measures I tried out were Jaccard Similarity and Cosine Similarity.

My initial guess was that episodes in the same or nearby seasons would be more similar to each other due to factors like what writers were on the show, characters that were featured, and overall themes/tones of the show. I did see some of that in the results, but was surprised by how far apart (in terms of dates) some of the similar episodes were. One interesting thing to point out is that in the Jaccard Similarities I kept seeing episode 3 from season 6; turns out that this is "Another Simpsons Clip Show", which features clips of previous episodes, thus has mostly shared dialog. This is an instance of Jaccard Similarity being a decent measure of plagiarism.

I show the top 10 similar episodes by each similarity measure below, and the top 25 can be seen in the source code notebook results. It can be easily updated to show more based on the "top_n" variable.

Rank Jaccard Similarity Cosine Similarity Comments
1 S16 E1: Treehouse of Horror XV

S16 E2: All's Fair in Oven War
S13 E10: Half-Decent Proposal

S15 E14: The Ziff Who Came to Dinner
S13 E10 and S15 E14 both prominently feature the insufferable Artie Ziff.
2 S4 E12: Marge vs. the Monorail

S9 E11: All Singing, All Dancing
S3 E20: Colonel Homer

S19 E16: Papa Don't Leach
S9 E11 is a musical clips show, so includes clips from the classic monorail song.

S19 E16 brings back Lurleen Lumpkin, who was introduced in the Colonel Homer episode.
3 S1 E9: Life on the Fast Lane

S6 E3: Another Simpsons Clip Show
S13 E18: I am Furious Yellow

S22 E14: Angry Dad: The Movie
Instance of the clips show.

S13 E18 and S22 E14 are both about a comic character Angry Dad.
4 S5 E9: The Last Temptation of Homer

S6 E3: Another Simpsons Clip Show
S2 E9: Itchy & Scratchy & Marge

S7 E18: The Day the Violence Died
Instance of the clips show.

Both S2 E9 and S7 E18 are about the Itchy & Scratchy Show.
5 S5 E8: Boy-Scoutz 'n the Hood

S9 E11: Realty Bites
S2 E21: Three Men and a Comic Book

S7 E2: Radioactive Man
Both S2 E21 and S7 E2 feature the comic book character Radioactive Man...up and at them!!
6 S10 E5: When You Dish Upon a Star

S13 E17: Gump Roast
S6 E15: Homie the Clown

S7 E15: Bart the Fink
Gump Roast is a clips show that refers back to the time Homer met Alec Baldwin and Kim Basinger.

S6 E15 and S7 E15 have plots heavily involving Krusty.
7 S22 E12: Homer the Father

S22 E20: Homer Scissorhands
S25 E15: The War of Art

S29 E12: Homer is Where the Art Isn't
S22 E12 and S22 E20 are modern Homer episodes.

S25 E15 and S29 E12 are both about paintings/art.
8 S7 E1: Who Shot Mr. Burns? (Part 2)

S7 E10: The Simpsons 138th Episode Spectacular
S13 E12: The Lastest Gun in the West

S17 E22: Marge and Homer Turn a Couple Play
An instance of a clips show.
9 S4 E15: I Love Lisa

S6 E3: Another Simpsons Clip Show
S9 E18: This Little Wiggy

S19 E10: E Pluribus Wiggum
Instance of a clips show.

Wiggum episodes!
10 S23 E13: The Daughter Also Rises

S24 E10: A Test Before Trying
S1 E12: Krusty Gets Busted

S6 E15: Homie the Clown
S1 E12 and S6 E15 prominently feature Krusty.


Similarities source code

Sentiment Analysis - Sneak Peak

One popular application of NLP is sentiment analysis to classify a piece of text as positive or negative on a scale of -1.0 to 1.0. I've mostly seen this done in areas such as social media posts and product reviews, but I came upon an interesting paper that classified stories into emotional arcs and wanted to apply similar techniques to Simpsons episodes. Using the NLTK library, I sliced up the scripts and performed a "sliding window" approach similar to what's described in the paper and retrieved sentiment scores for each section, then plotted them with matplotlib. The plots below are for the episode "The Boy Who Knew Too Much" from Season 5.

First, we have the plot of raw sentiment scores for each window of text: Episode sentiment scores raw

Then, I tried smoothing it out to get a better sense of the plot progression: Episode sentiment scores smoothed

The preliminary results don't look too bad! Episode summary from the dataset we used before:
We begin with Bart and Lisa heading to school. Bart is bored in class, so he forges a note from Marge to cut class and goes to see "Boobarama." Meanwhile, Skinner is trying to track him down after becoming suspicious of the note. Bart narrowly escapes his grasp by stowing away in Quimby's nephew's car. At a Quimby party, the nephew gets in an argument with a waiter. Bart witnesses the waiter clumsily injuring himself. No one else sees it, so the waiter claims Quimby's nephew assaulted him. The trial is a media event, but Bart cannot come forward or Skinner will give him detention. The jury consists of Homer, Skinner, Apu, et cetera. Homer is an incompetant juror and wants to be sequestered at the city's nicest hotel. After watching "MacArnagle," Bart decides to come forward to the Judge. Bart's testimony frees the innocent Quimby, but gets him four months of detention.

So in our plot above, the initial low point is Bart lamenting the fact that he must waste a beautiful day at school, then reaps the benefits of playing hooky. Things go sour as Skinner chases him down, then rise as he escapes his grasp and crashes a party of local elites. The low point is the jury nearing a decision to falsely imprison Freddy, but then we end on a high note as the right decision is made and Bart gets away with "only" four months of detention.

Sentiment analysis source code

Moving Forward

As I complete my current NLP course I hope to gain more insights that could further flesh out this text mining exercise. One improvement that can be made to the TF-IDF is the removal of duplicates or too-similar words. For instance, we see results in the top words that repeat such as "meow meow", "meow meow meow", and "meow" in the episode "The Boy Who Knew Too Much". For the sentiment analysis, I would like to plot out more episodes and refine how the smoothing is done, better annotate the high and low points, and map to the six plot types outlined in the aforementioned paper.

All in all, this was a fun exercise and great learning experience!

Acknowledgements

Springfield! Springfield! - Source of Simpsons episode scripts
TV Calendar - Source of Simpsons episode summaries
Font Meme - Generated title text here
Pinterest - Source of cloud background
The emotional arcs of stories are dominated by six basic shapes - Paper that inspired the sentiment analysis of Simpsons scripts