This week I decided to explore what machine learning technique can do for my practice project. I received a lot of thoughtful feedback from classmates (thank you Sam, Danny, Vincent, Minkyung), particularly Danco’s suggestion on using labeled training data to figure out which words are characteristic of which shops. I haven’t taken any particular machine learning course yet, so I used the simple Naive-bayes classifier by referring the NLTK book chapter 6. Since fashion products are widely sold in Instagram online store, machine learning is used here to classify which posts are fashion related and which are non-fashion. The other reason is because the NLTK book provides examples on classifying two categories (male vs female and pos vs neg review). I wish I could classify several categories but I haven’t figure it out and I’m running out of time.
Few weeks back I collected Instagram posts from 4 major cities through online shop city hashtags (Jakarta, Bandung, Jogja and Bali). The 4 major cities are chosen because of highest social media users in Indonesia (Jakarta and Bandung are top 10 most active twitter cities in the world, I didn’t find Instagram data though). The goal of this practice is to classify online store posts from the cities and see the posts comparison between fashion vs non-fashion.
First of all, we need to find certain stores that only sell fashion or non-fashion that will be used as “gold standard” in training the data. According to wikipedia, the term “fashion” consists of fairly broad categories, including clothing, footwear, accessories, makeup, even body piercing. I picked up 5 top followers that represent all the fashion categories. In last assignment, I made visualization on top followers, so I just picked from the list that only sells clothing, make up, accessories and skin care. Then I chose top follower that specialized on selling cake to represent the non-fashion category. According to a friend (she’s an avid cupcake seller in Instagram), cake is also sold frequently in Instagram beside fashion, so I chose this. I crawled all the posts both fashion and non-fashion posters have, that later will be used to train the Naive-bayes classifier.
Building The Naive-bayes Classifier
In the end, there are 10,171 crawled posts that include both fashion and non-fashion categories. After shuffling the order, the posts are split 90% as training set and the rest 10% as testing set for the classifier. According to my experiment, 90:10 is the percentages that gives the best accuracy. See my experiment below.
- Training:testing (50:50) gives accuracy of 69.37%
- Training:testing (70:30) gives accuracy of 70.25%
- Training:testing (90:10) gives accuracy of 71.81%
Since only 10% used for testing set, there might be biased test cases. But it’s quite reasonable to use 10%, we still have around 1,000 of 10,171 posts for testing set. The most informative features of classifier is quite interesting to look at.
(non: non-fashion, fas: fashion category)
- contain(cake) = True non : fas = 120.8 : 1.0
- contain(enak) = True non : fas = 46.5 : 1.0
- contain(kosmetik) = True fas : non = 45.9 : 1.0
- contain(family) = True non : fas = 45.0 : 1.0
- contain(cuci) = True fas : non = 40.3 : 1.0
Cake, “enak” (delicious) and family come as the most informative. Online stores use a lot of “delicious” word to market their food (cake) selling and probably cake is marketed to family segment for birthday or only for enjoying the time together with family? In the fashion category, the most informative are “kosmetik” (cosmetics) and “cuci” (clean) related to perhaps hair/pimple cleaning or skincare products.
Question: How is amount of fashion related posts differ from non-fashion posts?
Now let’s use the model to classify posts that we have. See the result on the visualization below:
The number of non-fashion posts is very less compare to fashion posts in these cities, that make them not feasible in the above histogram. The number of posts detail is as below:
- Jakarta (30,022 fashion vs 139 non-fashion posts)
- Bandung (30,306 fashion vs 98 non-fashion posts)
- Jogja (36,897 fashion vs 113 non-fashion posts)
- Bali (29,985 fashion vs 148 non-fashion posts)
While the small number of non-fashion posts might be true, the number is extremely small. It seems not reasonable, might be because of:
- The naive-bayes classifer I made is not good enough, should try to improve the model or try different machine learning technique
- The training data for non-fashion labeling is too specific (in this case: cake). And the non-fashion posts are not found on onlineshop hashtags that much as confirmed by my friend. (she said only premium cake is sold in Instagram which is a bit costly, onlineshop hashtagging is for selling cheap stuffs)
- The test cases to build the model are biased
What I learned from this playground
- I had hard time exploring this machine learning technique because I don’t have the basic, especially on how to prepare the right data for training/testing set and choosing the attributes. Perhaps I should take this (https://www.coursera.org/course/ml) which will be useful as Danny suggested.
- I wish I could classify posts into several categories (which posts are footwear, which are clothing, etc.) instead of two (fashion vs non-fashion), but the NLTK book only describes Naive-bayes for two categories.
Find the script I wrote on iPython notebook for this project here: http://nbviewer.ipython.org/github/girikuncoro/shopinstagram2/blob/master/machine-learning/InstaPracticeML.ipynb