Pre-Processing Techniques

Pre-processing is the first step in text classification, and choosing right pre-processing techniques can improve classification effectiveness. We evaluate the pre-processing techniques on their resulting classification accuracy and number of features they produce. We find that techniques like lemmatization, removing numbers, and replacing contractions, improve accuracy, while others like removing punctuation do not. We have used the following techniques for pre-processing:

1. Removing unicode

This technique removes all the Non – ASCII characters and certain unicodes like “\u002c” and “\x06”. Thus to improve the accuracy of the code and for cleaning the dataset by removing such unused words. It will reduce the noise present in dataset
Eg: “\u006OMG !! ! @stellargirl I loved my Kindle” is converted to “OMG !! ! @stellargirl I loved my Kindle”

2. Replacing user names and urls

This technique is used to replace usernames by AT_USER and urls by URL. this technique is basically to detect the url and username as the words present in username and url may be positive or negative and it may affect the accuracy of the dataset. Thus to improve its accuracy we replace it with a common name for all usernames and urls as AT_USER and URL respectively.
Eg: “\u006OMG !! ! @stellargirl I loved my Kindle” is converted to “OMG !! ! AT_USERstellargirl I loved my Kindle”

3. Removing abbreviations

This technique is used to remove abbreviations i.e to convert it to its full form.As many people use informal words this technique is important.This will affect the accuracy of dataset as it may not be able to detect or may interpret it as other way of its actual result. Thus this is the important step of pre-processing in order to get accurate output.
Eg: “OMG !! ! AT_USERstellargirl I loved my Kindle” is converted to “Oh My God
!! ! AT_USERstellargirl I loved my Kindle”

4. Replacing contractions

This technique removes contractions like won’t: will not, shouldn’t: should not, Isn’t: is not, etc. This step is important because classifier won’t consider these contractions as negative and this will affect a lot to the classification process as the accuracy will be reduced. Thus it is necessary to replace such contractions as to increase the accuracy.
Eg:”AT_YouSER You’ll love your Kindle2.” is converted to “AT_YouSER you shall / you will love your Kindle2”.

5. Removing Numbers

This technique removes all the numbers present in dataset. Numeric values in tweets or sentiments are of no use for classifying positive or negative sentiments. Thus it is removed by the pre-processing method.
Eg: “AT_YouSER you shall / you will love your Kindle2.” Is converted to
“AT_YouSER you shall / you will love your Kindle”.

6. Replacing multiple punctuations

This technique replaces multiple punctuations with “multi(punctuation name)”. This pre-processing step is applied to multiple punctuations as it may affect the classification. Thus before removing punctuation, multipunctuations are replaced by multi(Punctuation).
Eg: “Oh My God !! ! AT_USER I loved my Kindle.” is converted to
“Oh My God multiExclamation ! AT_USER I loved my Kindle.”

7. Replacing negations

Dealing with negations (like “not good”) is a critical step in Sentiment Analysis. A negation word can influence the tone of all the words around it, and ignoring negations is one of the main causes of misclassification. In this phase, all negative constructs (can’t, don’t, isn’t, never etc) are replaced with “not”. This technique allows the classifier model to be enriched with a lot of negation bigram constructs that would otherwise be excluded due to their low frequency.
Eg. The sentence “This movie isn’t good for family” will be changed to new
sentence “This movie is not good for family”.

8. Removing punctuation

In this preprocessing technique, we remove the punctuation marks like (, , . , ! , : , ; , etc). The main purpose of this method is that we only need words to learn the machine for some prediction and hence no any punctuations marks are required. In some sentiments removing punctuations increases the accuracy of the predictions but in some that will decrease the accuracy for example: punctuations like exclamation mark may mean an intense positive or negative sentiment and hence will reduce the accuracy.
Eg. The sentence “i don’t understand i really don’t. this course feels wrong,
hospital radio isn’t right, and i’m not happy.” will be changed to “i don’t understand. i
really don’t. this course feels wrong, hospital radio isn’t right, and i’m unhappy”.

9. Lowercasing

This technique increases the accuracy for sure. All characters are converted to lowercase letters. In reviews, people do not write sentences with perfection, some characters are in uppercase and some in lowercase which are surely affect the accuracy in prediction and also for learning and therefore we first convert all sentences from both test and train dataset in lowercase.
Eg. The sentence “I spilled milk all up in my Macbook.” will be changed to “i spill
milk all up in my macbook.”.

10. Removing stop words

Stopwords are function words which are present in sentences with high frequency like (it, this, the). These words are needless for sentiment analysis because they do not contain any fruitful information for both learning purpose and prediction purpose. Removing this stopwords will not increase the accuracy but it will improve the storage management as will require less amount of storage to store sentences without stopwords. These words are not predefined and it can be changed by removing or adding more to it.
Eg. Sentence “It’s time you changed direction! This is the answer! It’ll blow your
socks off!” will be changed to “time changed direction ! is answer ! blow socks off!”.

11. Stemming

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. Stemming helps to achieve the root forms of the derived words.
Consider the examples:
Eg. Playing, Plays and Played can be stemmed to “Play” as this is the root form.

Go To Classification Techniques