Building a Model for Sentiment Analysis Using Natural Language Processing Part I¶

Goal¶

In this post, I will be building and evaluating a text classification model for sentiment analysis. In part I, I will learn from the data by loading all the data into the computer's internal memory while in part II, I will be exploring out-of-core learning learning techniques to allow learning on large data sets when there are memory limitations.

A note on the data¶

The training data used for this analysis is the 1.6 million English language Stanford twitter corpus that has been automatically annotated for negative and positive sentiment using emoticons. The test data is also from the Stanford group and consists of tweets that have been manually annotated with 177 reflecting negative sentiment and 182 reflecting positive sentiment. The data sets can be found here: https://help.sentiment140.com/for-students

import numpy as np
import pandas as pd
import nltk
import re #for regex
import cPickle as pickle


from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.util import ngrams
from nltk.classify.scikitlearn import SklearnClassifier


from nltk import NaiveBayesClassifier

In the loadData function, we will load the data from csv to a Pandas dataframe do some shuffling so as to get a mix of all sentiments, and remove unecessary columns. We will then pickle the data to disk to allow for faster loading.

def loadData ():
    """
    Load the cvs files with the training and test data
    """
    header = ['polarity', 'tweet_id', 'date','query', 'user', 'tweet']
    
    df_train = pd.read_csv('/home/Code/SentimentAnalysis/Twitter/\
    Data/training.1600000.processed.noemoticon.csv',
                 header=None)
    df_test = pd.read_csv('/home/Code/SentimentAnalysis/Twitter/\
    Data/testdata.manual.2009.06.14.csv',
                 header=None)

    df_train.columns = header
    df_test.columns  = header

    #Shuffle the rows so that you get a mix of pos, neutral, and neg sentiments
    df_train = df_train.sample(frac=1).reset_index(drop=True)
    df_test  = df_test.sample(frac=1).reset_index(drop=True)

    #Drop unnecessary columns
    df_train.drop(['tweet_id','date','query','user'], axis=1, inplace=True)
    df_test.drop(['tweet_id','date','query','user'], axis=1, inplace=True)

    #Pickle the data frames
    df_train.to_pickle('/Data/df_training.pkl')
    df_test.to_pickle('/Data/df_test.pkl')
    
    print "Finished loading and pickling data"

The preProcessTweet function takes a tweet and does some pre-processing on the tweet

def preProcessTweet(tweet):
    """
    Function to pre-process the tweet
    """
    #str(tweet.encode('utf-8')) 
    str(tweet)
    
    #Replace all words preceded by '@' with 'USER_NAME'
    tweet = re.sub(r'@[^\s]+', 'USER_NAME', tweet)
    
    #Replace all url's with 'URL'
    tweet = re.sub(r'www.[^\s]+ | http[^\s]+',' URL ', tweet)
    
    #Replace all hashtags with the word
    tweet = tweet.strip('#')
    
    #Replace words with long repeated characters with the shorter form
    tweet = re.sub(r'(.)\1{2,}', r'\1', tweet)
    
    #Remove any extra white space
    tweet = re.sub(r'[\s]+', ' ', tweet)
    
    return tweet

In preProcessData, we take the previoulsy pickled dataframes, pre-process each tweet in the dataframe, and pickle the now pre-processed dataframes for faster loading

def preProcessData():
    """
    Obtained the pickled data and pre-process.
    The pre-processed data is then pickled 
    """
    df_train = pd.read_pickle('/Data/df_training.pkl')
    df_test = pd.read_pickle('/Data/df_test.pkl')
    
    #Pre-process the data
    df_train['tweet'] = df_train['tweet'].apply(preProcessTweet)
    df_test['tweet'] = df_test['tweet'].apply(preProcessTweet)
    
    #Pickle pre-processed data frames
    df_train.to_pickle('/Data/df_training_preprocessed.pkl')
    df_test.to_pickle('/Data/df_test_preprocessed.pkl')

    print "Training and test data is now pre-processed"

The feature_Extractor functions takes a tweet as argument and extracts the features in the tweet.

def feature_Extractor(tweet):
    """
    Takes a tweet and extracts its features
    """
    tweet_words = set(tweet)
    features = {}
    for word in featureList:
        features['contains(%s)' % word] = (word in tweet_words)
    return features

getFeatureVector will take a tweet as argument and tokenize each tweet

def getFeatureVector(tweet):
    """
    The function takes a tweet and does some processing
    to remove stopwords, remove punctuation, lemmatize/stem
    and reject any words that are non-alpha. Depending on the 
    flag selected, it will return a unigram, bigram, or a
    mix of the two. It returns a list with the filtered n-grams
    """
    
    flag = 3 #1 for unigram; 2 for bigram; 3 for mix
    
    #tokenize the tweet and convert each token to lower case
    #tokens = [token.lower() for token in word_tokenize(tweet)]
    tokens = [token.lower() for token in word_tokenize(tweet.decode('latin-1'))]

    punctuations = ["'", ":", ",", "-", ".", "!", "(", ")", "?", '"', ";"]
    stopWords = stopwords.words('english')
    stopWords.append("#")
    stopWords.append("%")
    stopWords = set(stopWords)
    lemmatizer = WordNetLemmatizer()
    
    #Remove stopwords, punctuation, 'url', and 'user_name'
    filteredTokens = []
    featureVector = []
    for token in tokens:
        if (token in punctuations or token in stopWords):
               continue
        elif (token == 'url' or token == 'user_name'):
            continue
        elif token.isalpha()== False: #reject non-alpha tokens
            continue
        else:
            #Normalize the tokens, either by stemming or lemmatization
            #I might also have to tag the tokens with Parts of Speech
            #<lemmatize words>
            token = lemmatizer.lemmatize(token)
            
            #This is the feature vector for each tweet
            filteredTokens.append(token)
            if flag == 1:
                #unigrams
                featureVector = filteredTokens
            elif flag == 2:
                #bigrams 
                featureVector = list(nltk.bigrams(filteredTokens))
                if featureVector != []: #ensure it is not an empty list
                    #Convert the tuple of bigrams to a string
                    featureVector = [' '.join(bigram) for bigram in featureVector]
            else:
                #mixgrams
                featureVector = list(nltk.everygrams(filteredTokens, max_len=2))
                if featureVector != []:
                    #Convert any tuple of n-grams to a string
                    temp = []
                    for everygram in featureVector:
                        if type(everygram) == tuple:
                            everygram = ' '.join(everygram)
                        temp.append(everygram)
                    featureVector = temp
                                         
    return featureVector

def getFeatures(df):
    """
    This function obtains features from a data set using a Bag of Words or Bag of n-grams approach
    """
    tweets = []
    allWords = []
    
    #Set flag for unigram (Bag of words) or n-gram(Bag of n-grams)
    #Flag = 2 # 1 is unigram; 2 is bigram; any otehr is a mixed bag of unigram and bigram (everygram)
    
    for row in df.itertuples():
        polarity = row[1]
        tweet = row[2]
        
        #Obtain the feature vector for each tweet
        featureVector = getFeatureVector(tweet)
        
        #tweets is a list containing tuples of filtered n-grams
        #and their respective sentiments
        tweets.append((featureVector, polarity))
        
        #Get list of all words/n-grams from all the tweets
        allWords.extend(featureVector)
            
    #Return dict with the frequency distribution of each word/n-gram
    wordDist = nltk.FreqDist(allWords)
    
    #Get a list of the features with each word/n-gram in the dist as a feature
    featureList = wordDist.keys()
        
    return featureList, tweets

def generateTrainFeatureList(num):
    
    #Load pre-processed pickled data frame
    df = pd.read_pickle('/Data/df_training_preprocessed.pkl')
    df = df[:num]
    
    #Extract the data set
    featureList, tweets = getFeatures(df)
    
    #Pickle the feature list and tweets
    num = num/1000
    pickle.dump(featureList, open('featureList_train_{0}k.pkl'.format(num), 'wb'))
    pickle.dump(tweets, open('tweets_train_{0}k.pkl'.format(num), 'wb')) 
    
    print "Pickle of train feature list and {0}k tweets successful".format(num)

def generateTestFeatureList():
    
    #Load pre-processed pickled data frame
    df = pd.read_pickle('/Data/df_test_preprocessed.pkl')
    
    #Drop neutral polarity rows in test data (I don't have a neutral class in my training set)
    df = df[df.polarity !=2]
    
    #Extract the data set
    featureList, tweets = getFeatures(df)
    
    #Pickle the feature list and tweets
    pickle.dump(featureList, open('featureList_test.pkl', 'wb'))
    pickle.dump(tweets, open('tweets_test.pkl', 'wb')) 
    
    print "Pickle of test feature list and tweets successful"

Pre-process the data¶

preProcessData()

Generate train feature list and tweets list. generateFeatureList takes the number of tweets to train on as an argument¶

generateTrainFeatureList(10000)

Pickle of train feature list and 10k tweets successful

Generate test feature list and tweets list.¶

generateTestFeatureList()

Pickle of test feature list and tweets successful

Classify¶

Naive Bayes Classifier¶

Obtain the training set¶

featureList = pickle.load(open('featureList_train_10k.pkl', 'rb')) 
tweets = pickle.load(open('tweets_train_10k.pkl', 'rb')) 
training_set = nltk.classify.apply_features(feature_Extractor, tweets)

Train and pickle the Naive Bayes classifier¶

NBclassifier = nltk.NaiveBayesClassifier.train(training_set)
pickle.dump(NBclassifier, open('NBclassifier_10K.pkl', 'wb'))

Here is a quick test of the Naive Bayes classifier¶

testtweet = "I love turtles"
processedTweet = preProcessTweet(testtweet)
feature_vec = getFeatureVector(processedTweet)
features =  feature_Extractor(feature_vec)

if NBclassifier.classify(features)==0:
    print "Negative"
else:
    print "Positive"

Positive

Obtain the test set¶

featureList = pickle.load(open('featureList_test.pkl', 'rb')) 
tweets = pickle.load(open('tweets_test.pkl', 'rb')) 
testing_set = nltk.classify.apply_features(feature_Extractor, tweets)

Load and evaluate the pickled Naive Bayes classifier¶

f = open('NBclassifier_10K.pkl', 'rb')
NBclassifier = pickle.load(f)
f.close()

accuracy = nltk.classify.accuracy(NBclassifier, testing_set )*100
print("Classification accuracy is %.2f %%:" % accuracy)

Classification accuracy is 74.02 %:

Show the most valuable words¶

NBclassifier.show_most_informative_features(20)

Most Informative Features
           contains(sad) = True                0 : 4      =     24.8 : 1.0
          contains(died) = True                0 : 4      =     17.1 : 1.0
       contains(welcome) = True                4 : 0      =     15.5 : 1.0
       contains(anymore) = True                0 : 4      =     13.5 : 1.0
          contains(poor) = True                0 : 4      =     13.4 : 1.0
          contains(sick) = True                0 : 4      =     13.2 : 1.0
       contains(ca find) = True                0 : 4      =     13.1 : 1.0
          contains(hurt) = True                0 : 4      =     12.1 : 1.0
         contains(shame) = True                0 : 4      =     11.7 : 1.0
      contains(headache) = True                0 : 4      =     11.6 : 1.0
         contains(upset) = True                0 : 4      =     10.4 : 1.0
        contains(lonely) = True                0 : 4      =     10.4 : 1.0
    contains(wish could) = True                0 : 4      =     10.2 : 1.0
           contains(ugh) = True                0 : 4      =      9.8 : 1.0
        contains(throat) = True                0 : 4      =      9.7 : 1.0
            contains(bo) = True                0 : 4      =      9.7 : 1.0
    contains(sorry hear) = True                0 : 4      =      9.7 : 1.0
           contains(wtf) = True                0 : 4      =      9.7 : 1.0
    contains(first time) = True                4 : 0      =      9.6 : 1.0
       contains(go away) = True                0 : 4      =      9.1 : 1.0

Note, the last column shows the number of times a certain feature appears in one class as compared to the other. So, for example, "sad" appears 24.8 more times in a negative tweet as compared to a positive one

Final words on this analysis¶

In this analysis, I found that I was limited on the size of the dataset I could use for training due to the need to load the full dataset to memory. In the next part, I will explore out-of-core techniques for text classification.</strong></font></p> </div> </div> </div> </div> </div> </body> </html>

Alex Mwai

Sentiment Analysis Model Using NLP: Part I