NLP for classification

Roy Shpringer's DS place

Udemy assignment NLP - By Roy Shpringer

from pathlib import Path
import os 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords"stopwords")
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\roysh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
import warnings

loading table

path_root = Path().resolve()
path = path_root / Path("udemy_development_task.csv")
df = pd.read_csv(path, index_col=[0])
title description category longDescription
0 Python for Beginners Learn Python programming from scratch with han... Development **Why Python ?**\n\n * Python is one of the w...
1 Design Patterns in Python Learn the Design Patterns in a practical way u... Development Learning Design Pattern is a voracious learnin...
2 Unity Mobile C# Developer Course Create and deploy games for Android & iOS usin... Development Build 3 simple mobile games using the free Uni...
3 Django | Build a Smart Chatbot Using AI Learn Django By Building Chatbot Using AI Development **This courses will teach you How to Build a C...
4 Flutter Augmented Reality Course - Build 10+ A... Learn Google's Flutter ARCore & Become AR Deve... Development In this course you will learn how to develope ...
... ... ... ... ...
5656 What the FICO 2.0: The Essential Guide to Cred... Your Complete Guide to Fixing Bad Credit, Buil... Finance & Accounting Trying to understand credit can be somewhat co...
5657 Manual Bookkeeping Level 2 - update manual ledgers, prepare a pro... Finance & Accounting Manual bookkeeping covers the material equival...
5658 CorelDRAW for Beginners: Graphic Design in Cor... Learn how to design in Corel DRAW with these e... Design **Start creating professional graphic design i...
5659 SEO WordPress Masterclass: The Best Google Ran... Learn Website Search Engine Optimization With ... Marketing **Learn the most effective SEO Wordpress strat...
5660 Learn Thai for Beginners: The Ultimate 105-Les... You learn Thai minutes into your first lesson.... Teaching & Academics Are you ready to start speaking, writing and u...

14151 rows × 4 columns

  • we have some duplicated indices, so we need to re-index the table:
df= df.reset_index(drop=True)

check nulls

title               0
description         5
category           48
longDescription     0
dtype: int64
(df['longDescription'] == '\n\n').sum()
  • we have very small amount of nulls, and most of them are in the target column. we will drop the rows with nulls since relabeling with textual features is not a straightforward task and the gain will be very small
df= df.dropna()


how many catagories?

print("number of categories: " ,df['category'].nunique())
print("\ncategories:\n\n" )
pd.DataFrame(df['category'].unique(), columns=['category name'])
number of categories:  13


category name
0 Development
1 Teaching & Academics
2 Business
3 IT & Software
4 Personal Development
5 Finance & Accounting
6 Music
7 Design
8 Marketing
9 Photography & Video
10 Lifestyle
11 Office Productivity
12 Health & Fitness

category distribution

we can see below that "Development" is the most popular category by far while 60% of the courses belongs to it. "Business" is in the 2nd place amd is far behind with only 7% share . This means that all the non-development categories combined reach a 40% share of the courses. Also, some ot the categories which have a more humenisitc nature (like "music", "Photography", "Health & Fitness") are really rare

cats = pd.DataFrame(df['category'].value_counts(normalize=True).reset_index()).rename(columns= {'index':'category','category':'%'})

cats['%'] = (cats['%'].round(7)*100).astype('str').str[:4]+'%'

category %
0 Development 60.1%
1 Business 7.68%
2 IT & Software 5.72%
3 Personal Development 5.65%
4 Teaching & Academics 4.87%
5 Design 4.66%
6 Marketing 3.99%
7 Finance & Accounting 2.56%
8 Office Productivity 2.18%
9 Lifestyle 0.97%
10 Photography & Video 0.56%
11 Music 0.49%
12 Health & Fitness 0.42%
fig= plt.figure(figsize=(10,10))

plt.pie(df['category'].value_counts().values, labels=df['category'].value_counts().index, autopct='%1.1f%%', pctdistance=1.05, labeldistance=1.15);

plt.pie(df['category'].value_counts().values, autopct=lambda x: '{:.0f}'.format(x*df['category'].value_counts().sum()/100));

plt.title ("course categories distribution")
Text(0.5, 1.0, 'course categories distribution')

longest text

longest description is 164 characters long , and longest longdescription is 32,103 characters long

df['lengh_long_desc'] = df['longDescription'].map(lambda x: len(str(x))) 

df['lengh_short_desc'] = df['description'].map(lambda x: len(str(x))) 
print("lengh=",len(df.loc[df['lengh_short_desc'].idxmax()]['description']),'\n\n',  df.loc[df['lengh_short_desc'].idxmax()]['description'])
lengh= 164 

 Learn how to develop software in Behaviour Driven Development (BDD) using Specflow -  part of the Cucumber software family of tools for software testing automation.
print("lengh=",len(df.loc[df['lengh_long_desc'].idxmax()]['longDescription']),'\n\n',  df.loc[df['lengh_long_desc'].idxmax()]['longDescription'])
text lengh distribution by category

fig = plt.figure(figsize=(11,9)) 
ax1 = plt.subplot(1,2,1) 
# (2,1,1) indicates total number of rows, columns, and figure number respectively
ax2 = plt.subplot(1,2,2)

sns.boxplot(x="lengh_short_desc", y="category", data= df, ax=ax2)

sns.boxplot(x=df["lengh_long_desc"].clip(0,7500), y="category", data= df, ax=ax1)



we can see that the Development category has slightly longer long-descriptions with many outliers, and longer short-descriptions having 25% of the courses with over 125 characters long. (note for long_desc: I've clipped the X-axis at the 7500 mark in order to visualize the boxplots properly)

longest text in each category :

df.groupby('category').agg(max_long_desc= ('lengh_long_desc','max')).sort_values(by='max_long_desc',  ascending=False)
Development 32103
Personal Development 20021
Design 16801
Marketing 16435
Business 16359
IT & Software 15610
Finance & Accounting 14129
Lifestyle 9156
Teaching & Academics 8757
Office Productivity 8555
Health & Fitness 8546
Photography & Video 8444
Music 8306

how many numbers?

we see in the plot below that humanistic courses contain less numbers, which might help differentiate from development courses

df['num_count']= df.iloc[:,3].apply(lambda x: len(re.findall('(\d\.\d+|\d+)', x)))  # float OR integer
fig = plt.figure(figsize=(12,8)) 

sns.boxplot(x=df["num_count"].clip(0,100), y="category", data= df)

<AxesSubplot:xlabel='num_count', ylabel='category'>

how many non-alphanumeric characters?

we have more in the humanisitc courses with small seperations as a whole

fig = plt.figure(figsize=(12,8)) 

df['num_non_alphanum']= df.iloc[:,3].apply(lambda x: len(re.findall("[^0-9A-Za-z ]", x)))  # non-alpha numeric or spaces

sns.boxplot(x=df['num_non_alphanum'], y="category", data= df)

<AxesSubplot:xlabel='num_non_alphanum', ylabel='category'>


from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.base import TransformerMixin, BaseEstimator
from nltk.stem import PorterStemmer

creating the target and uniting all text columns to one column

we combine all the non-development categories into one category since the task is to label correctly the develpment courses. This will simplify the problem, since labeling a very unbalanced multiclass dataset is a much harder task for a model.

df.loc[df['category']=='Development', 'is_dev'] = True
df.loc[df['category']!='Development', 'is_dev'] = False

in order to keep a smaller number of dimentions I prefer uniting all the textual columns into one column

df['text'] = df['title']+' '+df['description']+' '+df['longDescription']
Index(['title', 'description', 'category', 'longDescription',
       'lengh_long_desc', 'lengh_short_desc', 'num_count', 'num_non_alphanum',
       'is_dev', 'text'],

cleaning text

  • remove endline mark (\n\n)
  • replace url's with the generic "url"
  • remove stopwords (using nltk's english stopwords)
  • reomve punctuation inside words (- and ')
  • remove non alphanomeric characters and lowercasing
  • replace numberes above 9 to 999
  • replace 2 consecutive spaces or above with only 1 space
def remove_end_line (x):
    return  x.str.replace("\n\n"," ")

def remove_url(x):
    return x.str.replace("https*\S+", "url")

def remove_puncs_inside_words(x):
    return x.str.replace("[\'\-']", "") 

def remove_non_alphanomeric_and_lower (x):
    return x.str.replace("[^0-9A-Za-z ']", " ").str.lower()

def replace_high_numbers (x):   #higher than 9 will be replaced with 999
        y= int(x) > 9
    except ValueError:
        return x
    if y :
        return '999'
    else :
        return x
def remove_stop_words (x):   
    stopwords_dict = {word: 1 for word in stopwords.words("english")}  
    return ' '.join( [y for y in x.split() if y not in stopwords_dict])  #--> using dict speeds up tremendoudsly

def replace_overspaces(x) :
    return x.str.replace("\s{2,}", " ")
df[['text']]= df[['text']].apply(remove_end_line)\
                            .applymap(lambda x:  ' '.join(replace_high_numbers(x) for x in x.split()))\
title description category longDescription lengh_long_desc lengh_short_desc num_count num_non_alphanum is_dev text
0 Python for Beginners Learn Python programming from scratch with han... Development **Why Python ?**\n\n * Python is one of the w... 1552 74 0 103 True python beginners learn python programming scra...
1 Design Patterns in Python Learn the Design Patterns in a practical way u... Development Learning Design Pattern is a voracious learnin... 1919 57 0 80 True design patterns python learn design patterns p...
2 Unity Mobile C# Developer Course Create and deploy games for Android & iOS usin... Development Build 3 simple mobile games using the free Uni... 1734 56 1 106 True unity mobile c developer course create deploy ...

analysis after cleaning : creating a wordcloud for each category

  • development courses:

wordcloud = WordCloud(stopwords=['course','learn','this','the','in','it','you','use']).generate(df[df['is_dev']==1]['text'].sum())

plt.imshow(wordcloud, interpolation='bilinear')
  • non-development courses:

wordcloud = WordCloud(stopwords=['course','learn','this','the','thi','in','it','you','use']).generate(df[df['is_dev']==0]['text'].sum())

plt.imshow(wordcloud, interpolation='bilinear')

it seems that words like 'application', 'project' ,'python' and 'machine learning' , are uniquely more frequent in the development courses (large only on the upper image), also the word 'business' seems to be very important mainly to the non-development courses

creating a dataframe with only the needed features:

df_m = df[['text','lengh_long_desc', 'lengh_short_desc', 'num_count','is_dev']]

Splitting to train/test

X = df_m.drop(['is_dev'],axis=1) 
y = df_m['is_dev']    

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Creating a pipeline for a random-forest model:

the pipeline steps are:

  • stemming with porter stemmer
  • fitting tf/idf trasformer only on the text column (with column transformer)-- this will create our vocabulary
  • use grid-search to cross-validate the model and for hyperparmeter tuning
  • fitting with the best estimator of random-forest algorithm

Stemming/ Lemma

while it is better to use lemmatization in order to maintain contextual meaning of a word, we will use stemming since it is much faster. stemming will transform words (mainly verbs, and the ending of nouns) to their root form, therefore it will decrease the dimentionality in a significant way

class TextStemmer(TransformerMixin, BaseEstimator):    
    def __init__(self):
        super().__init__() = PorterStemmer()
    def fit (self, X, y=None):
        return self
    def transform(self, X):
        X['text_stem']= X['text'].map(lambda y: ' '.join( for z in y.split()))
        X= X.drop('text', axis=1)
        return X


we will use tf/idf to generate the vocabulary

tf/idf in sklearn package is defined as:

tf/idf(t,d)=tf(t,d)idf(t)tf/idf(t, d) = tf(t, d) * idf(t)

idf(t)=log(ndf(t))+1idf(t) = log ( \frac{n}{df(t)} ) + 1

(tf= term frequency, idf = inverse document frequency)

tf/idf adds a weighting sensibility to a counting vector of each token in a document by deviding the counter of a token (tf) by a term that reflects how rare or frequent the word is in the entire corpus (idf) :

idf is defined here as the logarithmic fraction of the number of documents a token appears-in devided by the total number of documents.

if the ratio is big - meaning the token scarcely appears in the documents of the corpus, the term will be bigger than 1 , and will give a boost to the counter.

if the ratio is small - meaning the token appears frequently in the documents of the corpus, the term will be smaller than 1, this will decrease the counter.

creating all the transformers:

stem_text = TextStemmer()

tfidf = TfidfVectorizer(analyzer = 'word' ,token_pattern= r"(?u)\b\w+\b" ,ngram_range=(1,2), min_df= 0.005, max_df= 0.99 ,norm=None) 

rfc= RandomForestClassifier(max_depth=9, n_estimators=100, random_state= 42)

text_tfidf = ColumnTransformer(transformers= [('tfidf', tfidf, 'text_stem')], remainder= 'passthrough', sparse_threshold=0 )

tf/idf parameters:

  • we are using ngrams ability since some words get different meaning as a combination of 2 words, also we will decrease the size of the vocabualry by using min_df

  • the token_pattern enables words that are one character long like : "C" or "R" which are important programming languages and I anticiapte them to appear many times in the text

final_model = Pipeline(steps=[('st', stem_text),
                              ('ct', text_tfidf),
                              ('rfc', rfc)
params_store= final_model.get_params()

param_search = {
  'ct__tfidf__min_df' :[0.005, 0.1],
  'rfc__max_depth' : [6, 10, 12],
  'rfc__n_estimators' :[150]


using the pipeline:

gsearch = GridSearchCV(estimator= final_model,  cv=3,  
                      param_grid= param_search, verbose=2 ), y_train.astype('bool'))
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150 
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150, total=  39.8s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150 
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   39.7s remaining:    0.0s
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150, total=  39.8s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150, total=  40.1s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150, total=  41.8s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150, total=  42.1s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150, total=  42.2s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150, total=  42.8s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150, total=  42.9s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150, total=  42.5s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150 ..
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150, total=  37.0s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150 ..
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150, total=  39.3s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150 ..
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150, total=  39.7s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150, total=  40.5s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150, total=  40.2s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150, total=  40.6s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150, total=  40.4s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150, total=  40.5s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150, total=  40.4s
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed: 12.2min finished
             estimator=Pipeline(steps=[('st', TextStemmer()),
             param_grid={'ct__tfidf__min_df': [0.005, 0.1],
                         'rfc__max_depth': [6, 10, 12],
                         'rfc__n_estimators': [150]},

The Vocabulary

fitting tf-idf creates a dictionary for all the unique tokens in our train documnets- so every token gets a unique registry. number

in the next stage it counts the occurrences of each token in every document and in this way creates a feature vector.

in the last stage it adds the idf weight as described above

{'mobil': 3192,
 'applic': 384,
 'manual': 3102,
 'test': 4589,
 'io': 2578,
 'bug': 647,
 'track': 4811,
 'debug': 1369,
 'realtim': 3895,
 'process': 3697,
 'agil': 222,
 'methodolog': 3166,
 'develop': 1445,
 'hand': 2163,
 'devic': 1489,
 'function': 1987,
 'usabl': 4924,
 'consist': 977,
 'autom': 470,
 'type': 4859,
 'either': 1625,
 'come': 886,
 'instal': 2527,
 'softwar': 4279,
 'distribut': 1524,
 'platform': 3591,
 'growth': 2138,
 'past': 3530,
 'year': 5367,
 'a': 108,
 'studi': 4458,
 'conduct': 965,
 'group': 2135,
 'predict': 3657,
 'gener': 2025,
 '4': 47,
 '2': 16,
 'billion': 595,
 'revenu': 3991,
 '999': 68,
 '7': 62,
 'u': 4863,
 's': 4033,
 'smartphon': 4263,
 'app': 356,
 'download': 1558,
 'thi': 4675,
 'cours': 1031,
 'go': 2079,
 'cover': 1225,
 'follow': 1897,
 'approach': 402,
 '1': 3,
 'hardwar': 2180,
 'the': 4612,
 'includ': 2477,
 'intern': 2556,
 'screen': 4083,
 'size': 4229,
 'resolut': 3964,
 'space': 4311,
 'memori': 3157,
 'camera': 731,
 'radio': 3845,
 'etc': 1710,
 'sometim': 4297,
 'refer': 3916,
 'as': 430,
 'simpl': 4207,
 'work': 5303,
 'it': 2598,
 'call': 728,
 'differenti': 1499,
 'earlier': 1589,
 'method': 3164,
 'even': 1713,
 'basic': 509,
 'differ': 1496,
 'import': 2439,
 'understand': 4879,
 'nativ': 3276,
 'creat': 1243,
 'use': 4926,
 'like': 2946,
 'tablet': 4518,
 'b': 490,
 'web': 5166,
 'serversid': 4152,
 'access': 153,
 'websit': 5187,
 'browser': 644,
 'chrome': 800,
 'connect': 973,
 'network': 3312,
 'c': 716,
 'hybrid': 2316,
 'combin': 884,
 'they': 4672,
 'run': 4029,
 'offlin': 3414,
 'written': 5358,
 'technolog': 4578,
 'html5': 2309,
 'css': 1305,
 'mobil applic': 3194,
 'io applic': 2582,
 'mobil devic': 3196,
 'test autom': 4590,
 'past year': 3532,
 '999 7': 71,
 'thi cours': 4682,
 'cours go': 1117,
 'go cover': 2086,
 'cover follow': 1233,
 'applic work': 399,
 'applic creat': 387,
 'creat use': 1282,
 'web app': 5168,
 'use differ': 4944,
 'use web': 5006,
 'web technolog': 5183,
 'technolog like': 4579,
 'learn': 2778,
 'program': 3717,
 'beginn': 551,
 'advanc': 193,
 'scratch': 4077,
 'best': 574,
 'exampl': 1752,
 'purpos': 3797,
 'languag': 2740,
 'at': 449,
 't': 4511,
 'lab': 2732,
 'variou': 5031,
 'game': 2002,
 'object': 3393,
 'orient': 3467,
 'teach': 4555,
 'everyth': 1739,
 'start': 4354,
 'oper': 3452,
 'concept': 945,
 'topic': 4800,
 'everi': 1724,
 'lesson': 2915,
 'explain': 1786,
 'detail': 1440,
 'code': 842,
 'those': 4715,
 'want': 5098,
 'strong': 4431,
 'knowledg': 2715,
 'take': 4523,
 'divid': 1528,
 'three': 4724,
 'part': 3514,
 'first': 1877,
 'second': 4094,
 'third': 4708,
 'learn c': 2795,
 'c program': 724,
 'program beginn': 3720,
 'basic advanc': 510,
 'advanc learn': 198,
 'program languag': 3733,
 'languag c': 2741,
 'c develop': 720,
 'languag use': 2749,
 'use variou': 5003,
 'softwar develop': 4281,
 'object orient': 3396,
 'languag thi': 2748,
 'cours teach': 1194,
 'teach everyth': 4560,
 'start basic': 4355,
 'it cover': 2601,
 'cover topic': 1236,
 'advanc topic': 201,
 'explain detail': 1788,
 'want learn': 5108,
 'learn program': 2864,
 'take cours': 4530,
 'cours thi': 1199,
 'cours divid': 1088,
 'first learn': 1880,
 'learn basic': 2790,
 'learn object': 2855,
 'get': 2027,
 'power': 3630,
 'framework': 1938,
 'python': 3803,
 'that': 4606,
 'easi': 1594,
 'with': 5282,
 'grow': 2136,
 'skill': 4231,
 'gap': 2020,
 'need': 3284,
 'talent': 4543,
 'greater': 2127,
 'ever': 1720,
 'befor': 545,
 'ground': 2133,
 'build': 648,
 'launch': 2769,
 'career': 749,
 'entrepreneur': 1691,
 'make': 3045,
 'say': 4057,
 'give': 2069,
 'fundament': 1991,
 'well': 5204,
 'handson': 2167,
 'experi': 1770,
 'requir': 3960,
 'success': 4473,
 'turn': 4849,
 'comput': 938,
 'modern': 3204,
 'machin': 3030,
 'next': 3341,
 'move': 3241,
 'beyond': 587,
 'static': 4387,
 'dynam': 1580,
 'we': 5140,
 'won': 5295,
 'stop': 4414,
 'there': 4664,
 'll': 2972,
 'also': 261,
 'implement': 2437,
 'full': 1971,
 'authent': 465,
 'system': 4505,
 'final': 1862,
 'extend': 1799,
 'integr': 2543,
 'thirdparti': 4711,
 'api': 353,
 'when': 5241,
 'finish': 1872,
 'fulli': 1979,
 'equip': 1697,
 'custom': 1316,
 '0': 0,
 'latest': 2766,
 'version': 5046,
 'avail': 475,
 'provid': 3786,
 'relev': 3940,
 'inform': 2510,
 'content': 991,
 'legaci': 2909,
 'user': 5010,
 'about': 140,
 'author': 466,
 'sinc': 4218,
 'discov': 1517,
 'way': 5127,
 'he': 2188,
 'interest': 2550,
 'appli': 381,
 'scienc': 4070,
 'address': 185,
 'problem': 3692,
 'parallel': 3510,
 'domain': 1545,
 'get start': 2062,
 'web framework': 5174,
 'easi learn': 1596,
 'learn use': 2890,
 'build app': 652,
 'use skill': 4991,
 'web develop': 5173,
 'make web': 3071,
 'web applic': 5169,
 'applic develop': 388,
 'it thi': 2623,
 'cours give': 1116,
 'fundament concept': 1992,
 'handson experi': 2169,
 'experi requir': 1774,
 'build web': 684,
 'well start': 5215,
 'websit develop': 5190,
 'app we': 375,
 'won t': 5296,
 'we ll': 5151,
 'll also': 2974,
 'also cover': 264,
 'learn integr': 2834,
 'finish cours': 1873,
 'cours fulli': 1113,
 'build custom': 662,
 'app thi': 373,
 'cours use': 1207,
 'latest version': 2768,
 'relev inform': 3941,
 'about author': 141,
 'easi way': 1600,
 'way learn': 5134,
 'learn web': 2893,
 'develop he': 1461,
 'comput scienc': 941,
 'path': 3533,
 'realworld': 3896,
 'solut': 4289,
 'modular': 3210,
 'one': 3423,
 'effici': 1621,
 'seen': 4120,
 'increas': 2488,
 'rate': 3855,
 'adopt': 191,
 'mainli': 3041,
 'lightweight': 2945,
 'display': 1520,
 'great': 2122,
 'robust': 4015,
 'perform': 3559,
 'varieti': 5030,
 'open': 3448,
 'sourc': 4307,
 'reliabl': 3943,
 'often': 3415,
 'googl': 2104,
 'deriv': 1408,
 'addit': 182,
 'featur': 1837,
 'collect': 878,
 'safeti': 4041,
 'capabl': 740,
 'builtin': 690,
 'larg': 2754,
 'standard': 4351,
 'librari': 2930,
 'if': 2404,
 'foundat': 1932,
 'improv': 2450,
 'packt': 3497,
 'video': 5050,
 'seri': 4140,
 'individu': 2502,
 'product': 3704,
 'put': 3800,
 'togeth': 4776,
 'logic': 2988,
 'stepwis': 4408,
 'manner': 3100,
 'highlight': 2254,
 'are': 409,
 'strategi': 4424,
 'design': 1412,
 'pattern': 3537,
 'deal': 1368,
 'storag': 4415,
 'data': 1327,
 'mysql': 3270,
 'let': 2917,
 'quick': 3832,
 'look': 2999,
 'journey': 2670,
 'tutori': 4851,
 'leav': 2901,
 'off': 3408,
 'you': 5378,
 'immedi': 2433,
 'practic': 3634,
 'offer': 3409,
 'avoid': 478,
 'common': 899,
 'mistak': 3189,
 'new': 3317,
 'initi': 2515,
 'upon': 4919,
 'i': 2317,
 'o': 3392,
 'file': 1855,
 'command': 893,
 'line': 2957,
 'tool': 4784,
 'error': 1699,
 'handl': 2166,
 'help': 2208,
 'structur': 4436,
 'log': 2987,
 'context': 998,
 'packag': 3496,
 'databas': 1350,
 'nosql': 3368,
 'mongodb': 3219,
 'across': 168,
 'microservic': 3170,
 'further': 1994,
 'explor': 1793,
 'interact': 2548,
 'via': 5049,
 'demonstr': 1400,
 'tune': 4847,
 'lastli': 2763,
 'reactiv': 3864,
 'serverless': 4151,
 'tip': 4754,
 'trick': 4838,
 'by': 709,
 'end': 1651,
 'abl': 124,
 'bridg': 636,
 'meet': 3153,
 'your': 5427,
 'expert': 1782,
 'esteem': 1707,
 'ensur': 1680,
 'smooth': 4264,
 'receiv': 3901,
 'master': 3115,
 'degre': 1389,
 'institut': 2534,
 'mine': 3182,
 'high': 2243,
 'largescal': 2756,
 'current': 1312,
 'lead': 2774,
 'team': 4569,
 'refin': 3917,
 'focus': 1895,
 'emphasi': 1641,
 'continu': 999,
 'deliveri': 1396,
 'publish': 3791,
 'number': 3388,
 'paper': 3508,
 'sever': 4161,
 'area': 413,
 'passion': 3526,
 'share': 4165,
 'idea': 2400,
 'other': 3472,
 'huge': 2313,
 'fan': 1822,
 'backend': 495,
 'learn path': 2858,
 'one power': 3432,
 'languag it': 2743,
 'easi use': 1599,
 'open sourc': 3450,
 'make easi': 3055,
 'build simpl': 678,
 'if interest': 2409,
 'improv perform': 2451,
 'go learn': 2089,
 'path packt': 3535,
 'packt s': 3498,
 's video': 4037,
 'video learn': 5059,
 'path seri': 3536,
 'seri individu': 4141,
 'individu video': 2503,
 'video product': 5062,
 'product put': 3705,
 'put togeth': 3801,
 'togeth logic': 4777,
 'logic stepwis': 2989,
 'stepwis manner': 4409,
 'manner video': 3101,
 'video build': 5052,
 'build skill': 679,
 'skill learn': 4239,
 'learn video': 2892,
 'video it': 5058,
 'it the': 2622,
 'the highlight': 4624,
 'highlight learn': 2255,
 'path are': 3534,
 'design pattern': 1426,
 'applic use': 397,
 'use advanc': 4928,
 'let s': 2920,
 's take': 4036,
 'take quick': 4535,
 'quick look': 3834,
 'look learn': 3004,
 'learn journey': 2838,
 'thi learn': 4689,
 'advanc concept': 194,
 'i o': 2364,
 'file system': 1857,
 'command line': 894,
 'error handl': 1700,
 'you also': 5380,
 'also learn': 274,
 'use mysql': 4973,
 'come across': 888,
 'you learn': 5397,
 'tip trick': 4755,
 'by end': 711,
 'end learn': 1654,
 'basic understand': 524,
 'go use': 2093,
 'advanc featur': 197,
 'meet your': 3154,
 'your expert': 5429,
 'expert we': 1784,
 'combin best': 885,
 'best work': 580,
 'work follow': 5309,
 'follow esteem': 1901,
 'esteem author': 1708,
 'author ensur': 467,
 'ensur learn': 1681,
 'journey smooth': 2671,
 'he work': 2192,
 'high perform': 2246,
 'he current': 2190,
 'best practic': 577,
 'autom test': 472,
 'he passion': 2191,
 'share knowledg': 4166,
 'he also': 2189,
 'map': 3104,
 'studio': 4459,
 'js': 2674,
 'wide': 5266,
 'survey': 4495,
 'know': 2699,
 'find': 1868,
 'format': 1923,
 'style': 4463,
 'interfac': 2552,
 'truli': 4844,
 'respons': 3969,
 'complex': 926,
 'assum': 446,
 'littl': 2969,
 'walk': 5094,
 'step': 4391,
 'youll': 5422,
 'big': 589,
 'beauti': 531,
 'time': 4736,
 'modern web': 3206,
 'applic it': 390,
 'cover everyth': 1232,
 'everyth need': 1742,
 'need know': 3293,
 'cours assum': 1053,
 'knowledg program': 2724,
 'walk step': 5095,
 'youll learn': 5425,
 'learn creat': 2801,
 'differ way': 1498,
 'user interact': 5015,
 'let get': 2918,
 'pro': 3690,
 'becom': 534,
 'tester': 4596,
 'award': 483,
 'win': 5273,
 'profession': 3708,
 'udemi': 4865,
 'seller': 4129,
 'materi': 3123,
 'last': 2758,
 'updat': 4912,
 'novemb': 3379,
 'over': 3484,
 '000': 1,
 'student': 4442,
 'enrol': 1673,
 'worldwid': 5345,
 'commun': 902,
 'still': 4411,
 'count': 1027,
 'anoth': 334,
 'popular': 3614,
 'us': 4923,
 'showcas': 4195,
 'just': 2682,
 'kept': 2687,
 'intro': 2564,
 'free': 1945,
 'preview': 3674,
 'conveni': 1006,
 'pleas': 3598,
 'feel': 1843,
 'drive': 1570,
 'lose': 3009,
 'opportun': 3455,
 'previous': 3678,
 'known': 2728,
 'cost': 1024,
 'fortun': 1926,
 'market': 3108,
 'leader': 2775,
 'industri': 2504,
 'nowaday': 3386,
 'mani': 3086,
 'came': 730,
 'play': 3594,
 'control': 1003,
 'better': 582,
 'suitabl': 4481,
 'peopl': 3548,
 'background': 497,
 'support': 4490,
 'script': 4085,
 'howev': 2304,
 'difficult': 1500,
 'endtoend': 1659,
 'train': 4816,
 'essenti': 1703,
 'gain': 1999,
 'competit': 911,
 'advantag': 202,
 'today': 4768,
 'commit': 898,
 'uniqu': 4894,
 'deliv': 1395,
 'onlin': 3441,
 'in': 2454,
 'aspect': 437,
 'level': 2924,
 'treat': 4830,
 'singl': 4222,
 'thoroughli': 4714,
 'brush': 645,
 'specif': 4320,
 'entir': 1687,
 'overview': 3488,
 'variabl': 5027,
 'output': 3481,
 'valu': 5023,
 'descript': 1410,
 'environ': 1693,
 'read': 3865,
 'write': 5353,
 'excel': 1756,
 'driven': 1571,
 'keyword': 2692,
 'becom expert': 537,
 'cours udemi': 1203,
 'sinc 999': 4219,
 '999 cours': 77,
 'cours materi': 1150,
 'last updat': 2761,
 'over 999': 3485,
 '999 000': 69,
 '000 student': 2,
 'student enrol': 4449,
 'first time': 1884,
 'like cours': 2947,
 'basic cours': 514,
 'cours video': 1208,
 'pleas feel': 3599,
 'feel free': 1846,
 'it if': 2609,
 'if want': 2418,
 'want becom': 5100,
 'becom master': 539,
 'autom tool': 473,
 'learn experi': 2818,
 'in cours': 2459,
 'cours cover': 1078,
 'cover import': 1235,
 'import aspect': 2440,
 'advanc level': 199,
 'explain everi': 1789,
 'everi singl': 1731,
 'it great': 2606,
 'entir cours': 1688,
 'cover basic': 1228,
 'topic includ': 4803,
 'read write': 3868,
 'data driven': 1335,
 'how': 2286,
 'to': 4758,
 'and': 309,
 'wordpress': 5299,
 'sale': 4044,
 'funnel': 1993,
 'easili': 1604,
 'land': 2737,
 'page': 3499,
 'hello': 2205,
 'welcom': 5202,
 'stun': 4462,
 'whole': 5263,
 'convert': 1009,
 'servic': 4153,
 'client': 829,
 'right': 4002,
 'place': 3586,
 'usual': 5018,
 'thing': 4698,
 'lot': 3012,
 'up': 4910,
 'kind': 2696,
 'minimum': 3185,
 'plu': 3604,
 'wait': 5091,
 'top': 4797,
 'might': 3176,
 'exactli': 1747,
 'expect': 1768,
 'busi': 693,
 'charg': 785,
 'simpli': 4213,
 'flexibl': 1889,
 'fast': 1829,
 'pay': 3539,
 'them': 4649,
 'alway': 289,
 'ad': 175,
 'weekli': 5200,
 'basi': 508,
 'anyth': 346,
 'els': 1632,
 'so': 4267,
 'for': 1907,
 'decid': 1372,
 'day': 1360,
 'money': 3213,
 'back': 491,
 'question': 3823,
 'ask': 433,
 'risk': 4010,
 'involv': 2577,
 'how to': 2299,
 'to use': 4767,
 'to creat': 4761,
 'learn how': 2826,
 'easili creat': 1605,
 'land page': 2738,
 'hello welcom': 2206,
 'creat stun': 1280,
 'right place': 4006,
 'lot time': 3017,
 'web design': 5172,
 'onlin busi': 3442,
 'want abl': 5099,
 'cours you': 1222,
 'updat new': 4916,
 'new featur': 3322,
 'you need': 5403,
 'abl creat': 127,
 'creat great': 1267,
 'wait for': 5092,
 'for enrol': 1910,
 'inform cours': 2511,
 '999 day': 79,
 'get money': 2053,
 'money back': 3214,
 'back question': 494,
 'question ask': 3825,
 'interview': 2560,
 'prepar': 3663,
 'save': 4052,
 'architectur': 408,
 'fastest': 1833,
 'world': 5336,
 'compani': 903,
 'amazon': 294,
 'netflix': 3311,
 'base': 504,
 'achiev': 164,
 'goal': 2094,
 'field': 1852,
 'engin': 1663,
 'may': 3136,
 'salari': 4043,
 'similar': 4206,
 'qualif': 3816,
 'without': 5289,
 'benefit': 571,
 'case': 754,
 'what': 5223,
 'biggest': 593,
 'me': 3140,
 'demand': 1398,
 'higher': 2249,
 'job': 2661,
 'good': 2100,
 'theoret': 4660,
 'but': 701,
 'rang': 3849,
 'secur': 4112,
 'attend': 454,
 'spend': 4325,
 'search': 4091,
 'internet': 2557,
 'alreadi': 258,
 'compil': 913,
 'list': 2966,
 'answer': 335,
 'ye': 5365,
 'view': 5067,
 'watch': 5123,
 'begin': 547,
 'onc': 3421,
 'tri': 4834,
 'word': 5298,
 'mark': 3107,
 'could': 1025,
 'yourself': 5434,
 'then': 4655,
 'pass': 3525,
 'after': 210,
 'face': 1807,
 'technic': 4571,
 'contain': 989,
 'architect': 407,
 'difficulti': 1501,
 'vari': 5026,
 'experienc': 1779,
 'happen': 2171,
 'chang': 778,
 'futur': 1996,
 'from': 1960,
 'keep': 2685,
 'our': 3475,
 'aim': 228,
 'sampl': 4046,
 '3': 32,
 'role': 4017,
 '5': 54,
 'is': 2590,
 'tailor': 4522,
 'templat': 4582,
 'organ': 3464,
 '6': 59,
 'disadvantag': 1512,
 'characterist': 784,
 '8': 65,
 '9': 67,
 'point': 3606,
 'rememb': 3945,
 'prefer': 3659,
 'synchron': 4503,
 'asynchron': 448,
 'orchestr': 3462,
 'issu': 2597,
 'rest': 3973,
 'http': 2311,
 'can': 733,
 'state': 4384,
 'extens': 1800,
 'semant': 4130,
 'buy': 705,
 'commerci': 896,
 'whi': 5249,
 'break': 633,
 'per': 3553,
 'host': 2273,
 'model': 3201,
 'mock': 3198,
 'consum': 986,
 'contract': 1001,
 'separ': 4138,
 'deploy': 1405,
 'releas': 3939,
 'mean': 3142,
 'failur': 1815,
 'monitor': 3220,
 'multipl': 3258,
 'id': 2398,
 'certif': 770,
 'key': 2689,
 'public': 3790,
 'confus': 972,
 'consid': 975,
 'law': 2770,
 'circuit': 803,
 'scale': 4062,
 'queri': 3821,
 'cach': 725,
 'discoveri': 1518,
 'document': 1538,
 'scenario': 4064,
 'major': 3044,
 'principl': 3683,
 'interview question': 2561,
 'cours learn': 1142,
 'learn everyth': 2814,
 'save time': 4055,
 'fastest grow': 1834,
 'big compani': 590,
 'compani like': 904,
 'cours design': 1083,
 'design help': 1422,
 'help achiev': 2209,
 'achiev goal': 165,
 'softwar engin': 4282,
 'softwar design': 4280,
 'design develop': 1418,
 'develop i': 1462,
 'i explain': 2339,
 'import concept': 2441,
 'use case': 4936,
 'cours what': 1215,
 'benefit cours': 572,
 'cours abl': 1037,
 'job interview': 2662,
 'it good': 2605,
 'topic cover': 4802,
 'cover cours': 1229,
 'cours we': 1212,
 'we cover': 5145,
 'wide rang': 5267,
 'rang topic': 3850,
 'topic cours': 4801,
 'how cours': 2289,
 'cours help': 1123,
 'spend time': 4326,
 'ye cours': 5366,
 'best way': 579,
 'watch cours': 5124,
 'cours begin': 1058,
 'begin end': 549,
 'answer question': 336,
 'go cours': 2085,
 'cours 999': 1035,
 '999 time': 102,
 'well prepar': 5214,
 'question cours': 3826,
 'cours contain': 1073,
 'level the': 2927,
 'what happen': 5229,
 'time time': 4749,
 'cours follow': 1109,
 'follow 1': 1898,
 '1 what': 13,
 '999 what': 106,
 'what differ': 5226,
 'continu integr': 1000,
 'differ type': 1497,
 '999 how': 84,
 'whi use': 5255,
 'compani use': 905,
 'use api': 4930,
 '999 in': 86,
 'maintain': 3042,
 'crossplatform': 1298,
 'coder': 873,
 'divers': 1527,
 'excit': 1761,
 'or': 3459,
 'old': 3417,
 'vital': 5079,
 'figur': 1854,
 'guess': 2144,
 'clean': 818,
 'reason': 3900,
 'exist': 1766,
 'indepth': 2496,
 'perfect': 3554,
 'complet': 914,
 'guid': 2146,
 'project': 3746,
 'all': 237,
 've': 5035,
 'along': 253,
 'each': 1584,
 'section': 4099,
 'dedic': 1379,
 'math': 3127,
 'input': 2519,
 'statement': 4385,
 'loop': 3007,
 'string': 4430,
 'array': 423,
 'record': 3908,
 'date': 1357,
 'procedur': 3695,
 'eas': 1593,
 'set': 4157,
 'progress': 3745,
 'around': 419,
 'intent': 2547,
 'encourag': 1649,
 'highlevel': 2251,
 'compat': 908,
 'syntax': 4504,
 'design build': 1414,
 'beginn level': 559,
 'what s': 5236,
 'way get': 5131,
 'start program': 4372,
 'program it': 3730,
 'it s': 2619,
 'way help': 5132,
 'help find': 2214,
 'what wait': 5238,
 'learn take': 2879,
 'take your': 4541,
 'next level': 3343,
 'write code': 5354,
 'applic learn': 391,
 'learn best': 2792,

Validaion report

we can see that the model is stable since std between differnet cv splits is small

cv_report= pd.DataFrame(gsearch.cv_results_) # gives accuracy score 

mean_fit_time std_fit_time mean_score_time std_score_time param_ct__tfidf__min_df param_rfc__max_depth param_rfc__n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 27.904208 0.280141 11.990902 0.157760 0.005 6 150 {'ct__tfidf__min_df': 0.005, 'rfc__max_depth':... 0.838865 0.843644 0.854711 0.845740 0.006636 6
1 29.845166 0.293449 12.166636 0.168616 0.005 10 150 {'ct__tfidf__min_df': 0.005, 'rfc__max_depth':... 0.887660 0.887911 0.895289 0.890287 0.003539 2
2 30.715490 0.126816 12.016137 0.231241 0.005 12 150 {'ct__tfidf__min_df': 0.005, 'rfc__max_depth':... 0.893901 0.892168 0.900681 0.895583 0.003673 1
3 26.274902 1.073447 12.405668 0.293781 0.1 6 150 {'ct__tfidf__min_df': 0.1, 'rfc__max_depth': 6... 0.873475 0.855562 0.866345 0.865127 0.007364 5
4 27.804441 0.275139 12.626807 0.171780 0.1 10 150 {'ct__tfidf__min_df': 0.1, 'rfc__max_depth': 1... 0.884539 0.870318 0.875993 0.876950 0.005845 4
5 27.847812 0.187276 12.579125 0.211614 0.1 12 150 {'ct__tfidf__min_df': 0.1, 'rfc__max_depth': 1... 0.885957 0.875142 0.878263 0.879788 0.004545 3

Predicting with a random-forest best estimator

RandomForestClassifier is an ensamble bootstrap aggregation algorithm : it creates a number of decision tree classifiers where each

classifier fits only on part of the data (rows and columns) in a random manner - uniformly and with replacement.

Like in a regular decision tree, reduction in impurity is the parameter to consider in splitting on a feature in each tree.

The end-results is the majority vote each sample received from the classifiers.

This way of using "Wisdom of Crowds" improves the stability and accuracy of the decision making.


preds= gsearch.best_estimator_.predict(X_test)

final_results= pd.DataFrame(classification_report(y_test.astype('bool'), preds, output_dict=True))

conf_matrix= confusion_matrix(y_test.astype('bool') ,preds)


final_results.rename(columns= {'False': 'non-Development', 'True':'Development'})
non-Development Development accuracy macro avg weighted avg
precision 0.922407 0.889478 0.901277 0.905943 0.902696
recall 0.823322 0.953555 0.901277 0.888438 0.901277
f1-score 0.870052 0.920403 0.901277 0.895227 0.900191
support 1415.000000 2110.000000 0.901277 3525.000000 3525.000000

sns.heatmap(pd.DataFrame(conf_matrix), annot=True, fmt='d',, 

plt.title("confusion matrix")
Text(0.5, 1.0, 'confusion matrix')
  • TP = 2012
  • TN = 1165
  • FP = 250
  • FN = 98

we get very good performance from our model, precision-wise and recall-wise when looking at the Development label.

the preformance is slightly worse for the 0 or non-Development label.

after checking some false positives, we see that courses in the IT-Software category are harder to seperate from the development category, since many words are common to those 2 categories. So it may be wise to combine those 2 to the same category

feature importances

the feature importances method has a tendancy to increase the continuous features weights in a biased way, but since all of our features are continuous ones, it seems appropriate enough for a "big picture" estimation:

pd.DataFrame(pd.Series(gsearch.best_estimator_.named_steps["rfc"].feature_importances_  , index=gsearch.best_estimator_.steps[1][1].get_feature_names()).sort_values(ascending=False), columns=['importance']).head(15)
tfidf__code 0.059199
tfidf__web 0.030740
tfidf__develop 0.027360
tfidf__applic 0.020564
tfidf__data 0.019096
tfidf__app 0.017708
tfidf__program 0.016756
tfidf__build 0.016654
tfidf__languag 0.016126
tfidf__javascript 0.015336
tfidf__java 0.014110
tfidf__program languag 0.013793
tfidf__api 0.012328
tfidf__python 0.011386
tfidf__web develop 0.011220

we see that all programming related words (like code or java) are very important for the classification of "development" courses , which is very logical.

PCA and plotting a 3d scatter plot

In-order to check the assumption for the false positives, we will try to plot all the courses in a way that will reflect their differences, meaning close content courses should also be close in the scatter plot

PCA is a way to linearly transform a high-dimentional space to a much smaller hidden representation that captures the majority of the variance between samples.

here we will use 3 dimentional pca transformer, so we will be able to plot the resulting vectors for each course in a 3d scatter plot

X_train_vect= tfidf.fit_transform(X_train['text_stem'])

train = pd.DataFrame(X_train_vect.todense(), columns=tfidf.get_feature_names())#.iloc[0,:].sort_values()
from sklearn.decomposition import PCA

pca = PCA(3)
df_3d= pd.concat([pd.DataFrame(pca.fit_transform(train)), pd.DataFrame(df.loc[X_train.index]['category'].reset_index(drop=True)), pd.DataFrame(df.loc[X_train.index]['title'].reset_index(drop=True))], axis=1)
0 1 2 category title
0 -4.124536 4.492683 0.195532 Development Mobile Application Manual Testing - IOS Applic...
1 -9.513578 3.785385 -2.615611 Development Learn C++ Programming for beginners from basi...
2 -4.230656 3.113482 -2.179557 Development Learning Flask
3 6.002980 -4.374147 -11.042514 Development LEARNING PATH: Go: Real-World Go Solutions for...
4 -12.164592 8.173161 -0.145709 Development Interactive maps with Mapbox!
... ... ... ... ... ...
10568 -4.203248 0.577658 -1.131525 Development Introduction to C Programming for the Raspberr...
10569 -8.885592 3.946822 4.935661 Marketing How to Create a Marketing Video for Your Busin...
10570 -10.786600 6.954830 -1.057528 Development Using JSON In Unreal Engine 4 - C++
10571 8.539465 -13.127536 -30.164514 Development Building Recommender Systems with Machine Lear...
10572 -6.286109 1.255941 -3.256990 Development C and C++ Programming : Step-by-Step Tutorial

10573 rows × 5 columns

df_3d.columns= ['1','2','3' ,'cat','title']
import as px
fig = px.scatter_3d(df_3d, x='1', y='2', z='3',color='cat',size_max=10, hover_data=['title'], title = "3d scatter-plot of PCA results")


we can see a clustering pattern for the categories and a plane of seperation between buisness (orange) ,the more humanistic courses (red, torquoise) and development courses (dark blue) while IT courses (purple) show no seperation from development courses.

actually, a nice example of the validity and usability of our model is that most courses on the upper-left-corner image above (blue and purple) have the same content - the Spring framework, so it is apparent that there is even a sub-clustering pattern by course content

Using LSTM Neural Network For comaprison

for the sake of comparison and completion , we will also use a LSTM NN model.

we will regard this model as a blck-box, and won't elaborate on its mechanism,

but suffice is to say that LSTM's power relies on its ability to create a contextual connection between words in the text, thing that is mostly lacking in the TF/IDF approach.

this is done by sequental embedding of the words into vectors in the first layer and to a "smart" memory-neurons in the inner layer that can combine past infomation with the new info coming in, changing the state of the vector, or "forget" states that are less effective while trainning is done

from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer
from keras.layers import LSTM
from keras.layers import SimpleRNN

here we are only using the 'text' column after the cleaning procedure

ps = PorterStemmer()

df_m['text_stem'] =  df_m['text'].map(lambda x: ' '.join(ps.stem(y) for y in x.split()))
X = df_m[['text_stem']]  
y = df_m['is_dev']     

# split training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

t = Tokenizer(2500)
#3046 max words --> leaves 2500 most frequent words
y_train = np.where(y_train == True, 1, 0)
#cereates mapping dictionary  words to integers

vocab_size = len(t.word_index) + 1

#mapping integer encode the documents_tr
encoded_docs = t.texts_to_sequences(X_train['text_stem'])
encoded_docs_test = t.texts_to_sequences(X_test['text_stem'])

# adds zeros in the start for adjusting to same lengh
encoded_docs_padded= pad_sequences(encoded_docs , padding='pre')

len_row= len(encoded_docs_padded[1])

encoded_docs_padded_test= pad_sequences(encoded_docs_test ,maxlen=len_row,  padding='pre')

keras tokenizer creates a vocabulary:

import json
{'mobil': 2353,
 'applic': 9933,
 'manual': 389,
 'test': 6477,
 'io': 2440,
 'bug': 224,
 'track': 709,
 'debug': 394,
 'realtim': 434,
 'process': 4330,
 'agil': 592,
 'methodolog': 306,
 'develop': 17518,
 'hand': 1156,
 'held': 52,
 'devic': 989,
 'function': 4658,
 'usabl': 78,
 'consist': 555,
 'autom': 2514,
 'type': 2606,
 'either': 419,
 'come': 2404,
 'preinstal': 5,
 'instal': 2151,
 'softwar': 4359,
 'distribut': 658,
 'platform': 2348,
 'wit': 44,
 'phenomen': 17,
 'growth': 391,
 'past': 612,
 'year': 3474,
 'a': 5152,
 'studi': 1264,
 'conduct': 209,
 'yanke': 2,
 'group': 986,
 'predict': 539,
 'gener': 2263,
 '4': 3529,
 '2': 5395,
 'billion': 177,
 'revenu': 256,
 '999': 22011,
 '7': 1775,
 'u': 199,
 's': 4457,
 'smartphon': 139,
 'app': 10038,
 'download': 1687,
 'thi': 18552,
 'cours': 57846,
 'go': 5768,
 'cover': 5707,
 'follow': 3868,
 'approach': 1649,
 '1': 4667,
 'hardwar': 252,
 'the': 15948,
 'includ': 5927,
 'intern': 603,
 'processor': 67,
 'screen': 746,
 'size': 401,
 'resolut': 111,
 'space': 457,
 'memori': 514,
 'camera': 356,
 'radio': 77,
 'bluetooth': 15,
 'wifi': 48,
 'etc': 1295,
 'sometim': 242,
 'refer': 733,
 'as': 1802,
 'simpl': 3366,
 'work': 10927,
 'it': 10336,
 'call': 1265,
 'differenti': 165,
 'earlier': 116,
 'method': 2481,
 'even': 3159,
 'basic': 7544,
 'differ': 4439,
 'import': 3319,
 'understand': 8620,
 'nativ': 685,
 'creat': 16205,
 'use': 30943,
 'like': 6918,
 'tablet': 156,
 'b': 334,
 'web': 9349,
 'serversid': 133,
 'access': 2489,
 'websit': 6120,
 'browser': 637,
 'chrome': 142,
 'firefox': 23,
 'connect': 1485,
 'network': 2710,
 'wireless': 62,
 'c': 4272,
 'hybrid': 165,
 'combin': 768,
 'they': 640,
 'run': 2578,
 'offlin': 138,
 'written': 640,
 'technolog': 2828,
 'html5': 930,
 'css': 2310,
 'learn': 40085,
 'program': 11865,
 'beginn': 4583,
 'advanc': 4523,
 'scratch': 2529,
 'best': 4505,
 'exampl': 3774,
 'purpos': 639,
 'languag': 6402,
 'bjarn': 5,
 'stroustrup': 5,
 'at': 1079,
 't': 2007,
 'bell': 19,
 'lab': 539,
 'variou': 1709,
 'game': 8023,
 'object': 2863,
 'orient': 762,
 'teach': 5939,
 'everyth': 2840,
 'start': 9846,
 'oper': 2049,
 'concept': 5356,
 'topic': 3624,
 'everi': 3307,
 'lesson': 1968,
 'explain': 2639,
 'detail': 2111,
 'code': 10566,
 'those': 177,
 'want': 7517,
 'strong': 661,
 'knowledg': 4674,
 'take': 8249,
 'divid': 298,
 'three': 772,
 'part': 3064,
 'first': 4249,
 'second': 874,
 'third': 389,
 'flask': 183,
 'get': 12936,
 'power': 3724,
 'framework': 3927,
 'python': 6248,
 'that': 2758,
 'easi': 3941,
 'with': 3218,
 'grow': 1179,
 'skill': 6782,
 'gap': 159,
 'need': 8582,
 'talent': 190,
 'greater': 195,
 'ever': 1205,
 'befor': 634,
 'ground': 370,
 'build': 13049,
 'minimalist': 16,
 'easytolearn': 15,
 'launch': 538,
 'career': 2139,
 'entrepreneur': 630,
 'microframework': 5,
 'make': 11394,
 'say': 1170,
 'give': 3527,
 'fundament': 2258,
 'well': 5677,
 'handson': 1175,
 'experi': 4830,
 'requir': 2747,
 'success': 2892,
 'turn': 691,
 'comput': 2895,
 'modern': 1161,
 'machin': 3447,
 'next': 2270,
 'move': 1838,
 'beyond': 360,
 'static': 396,
 'databaseback': 3,
 'dynam': 1193,
 'we': 6982,
 'won': 125,
 'stop': 514,
 'there': 2555,
 'll': 3814,
 'also': 8825,
 'implement': 3352,
 'full': 2213,
 'authent': 686,
 'system': 4593,
 'final': 1871,
 'extend': 344,
 'integr': 1982,
 'thirdparti': 85,
 'api': 3290,
 'when': 975,
 'finish': 1021,
 'fulli': 1060,
 'equip': 336,
 'custom': 3262,
 '0': 803,
 'latest': 1073,
 'version': 1717,
 'avail': 1762,
 'provid': 4089,
 'relev': 468,
 'inform': 2846,
 'content': 4193,
 'legaci': 101,
 'user': 4048,
 'about': 1344,
 'author': 1769,
 'lalith': 2,
 'polepeddi': 1,
 'sinc': 1050,
 'discov': 1000,
 'way': 6688,
 'he': 2635,
 'tut': 5,
 'techpro': 1,
 'asid': 36,
 'interest': 1709,
 'appli': 2492,
 'scienc': 1944,
 'address': 457,
 'problem': 2605,
 'parallel': 295,
 'domain': 689,
 'biolog': 43,
 'path': 1312,
 'realworld': 766,
 'solut': 1939,
 'gopher': 2,
 'modular': 129,
 'testabl': 46,
 'one': 6805,
 'effici': 1150,
 'highlyperform': 2,
 'seen': 368,
 'increas': 1366,
 'rate': 658,
 'adopt': 223,
 'mainli': 152,
 'lightweight': 97,
 'display': 642,
 'great': 3288,
 'robust': 335,
 'perform': 2431,
 'varieti': 528,
 'open': 1363,
 'sourc': 2156,
 'reliabl': 251,
 'often': 641,
 'golang': 135,
 'googl': 2488,
 'deriv': 116,
 'addit': 1406,
 'featur': 3921,
 'garbag': 38,
 'collect': 772,
 'safeti': 139,
 'dynamictyp': 2,
 'capabl': 592,
 'builtin': 208,
 'larg': 731,
 'standard': 922,
 'librari': 1729,
 'if': 5273,
 'foundat': 1253,
 'improv': 2169,
 'packt': 208,
 'video': 7566,
 'seri': 1301,
 'individu': 618,
 'product': 3835,
 'put': 1502,
 'togeth': 1405,
 'logic': 948,
 'stepwis': 125,
 'manner': 549,
 'highlight': 304,
 'are': 1837,
 'encod': 60,
 'strategi': 1906,
 'design': 9905,
 'pattern': 1661,
 'deal': 755,
 'storag': 423,
 'data': 13379,
 'mysql': 1246,
 'let': 1956,
 'quick': 940,
 'look': 4764,
 'journey': 1096,
 'tutori': 2050,
 'leav': 377,
 'off': 116,
 'you': 20630,
 'immedi': 521,
 'practic': 6732,
 'offer': 1512,
 'avoid': 609,
 'common': 1111,
 'mistak': 382,
 'new': 6359,
 'initi': 425,
 'upon': 450,
 'i': 19543,
 'o': 180,
 'file': 2925,
 'command': 1081,
 'line': 1129,
 'tool': 5328,
 'error': 701,
 'handl': 1115,
 'help': 8440,
 'structur': 2830,
 'log': 346,
 'context': 319,
 'packag': 869,
 'databas': 4027,
 'nosql': 180,
 'mongodb': 572,
 'across': 609,
 'microservic': 511,
 'further': 241,
 'explor': 1553,
 'interact': 1927,
 'commandlin': 60,
 'via': 543,
 'demonstr': 757,
 'tune': 162,
 'lastli': 117,
 'reactiv': 246,
 'serverless': 372,
 'tip': 1203,
 'trick': 673,
 'by': 2676,
 'end': 4017,
 'abl': 3891,
 'bridg': 113,
 'meet': 711,
 'your': 3350,
 'expert': 1778,
 'esteem': 215,
 'ensur': 763,
 'smooth': 280,
 'aaron': 24,
 'torr': 2,
 'receiv': 902,
 'master': 3775,
 'degre': 298,
 'mexico': 13,
 'institut': 198,
 'mine': 411,
 'high': 1373,
 'largescal': 60,
 'current': 1298,
 'lead': 1202,
 'team': 1311,
 'refin': 88,
 'focus': 993,
 'emphasi': 83,
 'continu': 1281,
 'deliveri': 344,
 'publish': 1113,
 'number': 1240,
 'paper': 287,
 'sever': 1106,
 'patent': 45,
 'area': 960,
 'passion': 552,
 'share': 1575,
 'idea': 1639,
 'other': 1423,
 'huge': 593,
 'fan': 229,
 'backend': 644,
 'map': 1052,
 'mapbox': 21,
 'studio': 1141,
 'gl': 13,
 'js': 1785,
 'wide': 636,
 'survey': 105,
 'know': 6316,
 'find': 2976,
 'format': 964,
 'style': 1286,
 'interfac': 1577,
 'truli': 349,
 'respons': 1615,
 'complex': 1718,
 'assum': 346,
 'littl': 824,
 'geograph': 36,
 'walk': 954,
 'step': 6350,
 'youll': 1504,
 'big': 1536,
 'beauti': 745,
 'time': 7685,
 'pro': 872,
 'qtp': 52,
 'uft': 119,
 'becom': 3834,
 'tester': 306,
 'award': 203,
 'win': 296,
 'hp': 87,
 'profession': 3727,
 'udemi': 2437,
 'seller': 149,
 'materi': 1862,
 'last': 892,
 'updat': 2648,
 'novemb': 79,
 '27th': 10,
 'over': 818,
 '000': 1117,
 'student': 5802,
 'enrol': 2064,
 'worldwid': 228,
 'commun': 1960,
 'still': 910,
 'count': 210,
 'anoth': 829,
 'popular': 1971,
 'us': 1604,
 'showcas': 117,
 'just': 525,
 'kept': 93,
 'intro': 268,
 'free': 3388,
 'preview': 346,
 'conveni': 105,
 'pleas': 999,
 'feel': 1943,
 'drive': 519,
 'lose': 433,
 'opportun': 978,
 'unifi': 58,
 'previous': 131,
 'known': 363,
 'cost': 853,
 'fortun': 184,
 'market': 4132,
 'leader': 326,
 'industri': 1574,
 'nowaday': 119,
 'mani': 4163,
 'lowcost': 62,
 'came': 143,
 'play': 997,
 'control': 2367,
 'better': 2374,
 'suitabl': 315,
 'peopl': 3481,
 'nonprogram': 5,
 'background': 720,
 'support': 2113,
 'vb': 60,
 'script': 1528,
 'howev': 723,
 'difficult': 574,
 'endtoend': 85,
 'train': 3793,
 'essenti': 1420,
 'gain': 1453,
 'competit': 437,
 'advantag': 665,
 'today': 2285,
 'qaevers': 7,
 'commit': 260,
 'uniqu': 923,
 'deliv': 737,
 'onlin': 3286,
 'in': 9226,
 'aspect': 908,
 'level': 3771,
 'treat': 93,
 'freshman': 8,
 'singl': 1197,
 'thoroughli': 130,
 'brush': 139,
 'specif': 1334,
 'entir': 790,
 'overview': 1189,
 'checkpoint': 26,
 'parameter': 23,
 'variabl': 1085,
 'output': 551,
 'valu': 1379,
 'descript': 414,
 'environ': 1724,
 'read': 1664,
 'write': 4464,
 'excel': 2542,
 'driven': 410,
 'keyword': 398,
 'how': 8658,
 'to': 4275,
 'elementor': 148,
 'and': 4596,
 'wordpress': 2536,
 'sale': 1436,
 'funnel': 196,
 'easili': 1491,
 'land': 485,
 'page': 3024,
 'hello': 258,
 'welcom': 1034,
 'stun': 230,
 'whole': 584,
 'convert': 496,
 'servic': 3428,
 'client': 1624,
 'right': 3354,
 'place': 1217,
 'usual': 230,
 'outsoruc': 1,
 'thing': 2782,
 'lot': 2582,
 'up': 771,
 'kind': 658,
 'minimum': 126,
 'plu': 470,
 'wait': 932,
 'top': 1403,
 'might': 688,
 'exactli': 888,
 'expect': 686,
 'busi': 6762,
 'charg': 274,
 'simpli': 736,
 'flexibl': 402,
 'fast': 1267,
 'pay': 667,
 'dime': 15,
 'them': 1258,
 'alway': 1283,
 'ad': 2588,
 'weekli': 74,
 'basi': 308,
 'anyth': 669,
 'els': 526,
 'so': 2860,
 'for': 3456,
 'ps': 43,
 'decid': 400,
 'day': 2773,
 'money': 2872,
 'back': 1982,
 'question': 4230,
 'ask': 1517,
 'risk': 792,
 'involv': 600,
 'whatsoev': 31,
 'interview': 1732,
 'prepar': 1564,
 'save': 1226,
 'architectur': 1270,
 'fastest': 202,
 'world': 4317,
 'compani': 2525,
 'amazon': 936,
 'netflix': 111,
 'base': 2605,
 'achiev': 1056,
 'goal': 1341,
 'field': 1113,
 'engin': 2948,
 'may': 1412,
 'salari': 294,
 'similar': 445,
 'qualif': 78,
 'without': 2274,
 'benefit': 1202,
 'case': 1183,
 'what': 5681,
 'biggest': 260,
 'me': 759,
 'demand': 710,
 'higher': 347,
 'job': 2648,
 'good': 2709,
 'theoret': 285,
 'but': 1424,
 'rang': 666,
 'secur': 2318,
 'pact': 3,
 'bulkhead': 3,
 'attend': 150,
 'spend': 748,
 'search': 1485,
 'internet': 891,
 'alreadi': 1429,
 'compil': 367,
 'list': 1923,
 'answer': 1753,
 'ye': 612,
 'view': 1350,
 'watch': 1380,
 'begin': 1714,
 'onc': 895,
 'tri': 1509,
 'word': 1082,
 'mark': 270,
 'could': 1017,
 'yourself': 731,
 'then': 1631,
 'pass': 733,
 'after': 1806,
 'face': 702,
 'technic': 1096,
 'contain': 1434,
 'fresher': 58,
 'architect': 540,
 'difficulti': 168,
 'vari': 111,
 'experienc': 640,
 'happen': 510,
 'chang': 2406,
 'futur': 1343,
 'from': 1533,
 'keep': 1365,
 'our': 659,
 'aim': 533,
 'sampl': 603,
 '3': 4319,
 'role': 669,
 'soa': 11,
 '5': 3754,
 'is': 1561,
 'tailor': 77,
 'templat': 1234,
 'organ': 1360,
 '6': 1860,
 'disadvantag': 63,
 'decompos': 4,
 'monolith': 34,
 'characterist': 99,
 '8': 1555,
 'bound': 46,
 '9': 1051,
 'point': 1355,
 'rememb': 476,
 'prefer': 252,
 'synchron': 71,
 'asynchron': 177,
 'orchestr': 110,
 'choreographi': 2,
 'issu': 720,
 'rest': 1651,
 'http': 367,
 'can': 797,
 'state': 796,
 'extens': 697,
 'dri': 47,
 'semant': 90,
 'buy': 897,
 'commerci': 225,
 'shelf': 14,
 'whi': 1981,
 'break': 560,
 'ubiquit': 12,
 'per': 390,
 'host': 999,
 'model': 3927,
 'mike': 60,
 'cohn': 1,
 'pyramid': 21,
 'mock': 124,
 'stub': 12,
 'erad': 3,
 'nondetermin': 1,
 'consum': 349,
 'contract': 325,
 'cdc': 4,
 'separ': 346,
 'deploy': 1791,
 'releas': 599,
 'canari': 6,
 'mean': 1649,
 'repair': 69,
 'mttr': 1,
 'failur': 159,
 'mtbf': 1,
 'crossfunct': 13,
 'monitor': 448,
 'multipl': 1439,
 'correl': 68,
 'id': 198,
 'certif': 1832,
 'key': 1483,
 'public': 657,
 'confus': 300,
 'deputi': 3,
 'consid': 466,
 'conway': 1,
 'law': 302,
 'circuit': 162,
 'breaker': 12,
 'idempot': 1,
 'scale': 552,
 'queri': 1218,
 'segreg': 18,
 'cqr': 7,
 'cach': 183,
 'cap': 42,
 'theorem': 24,
 'discoveri': 80,
 'document': 1179,
 'scenario': 443,
 'major': 694,
 'principl': 992,
 'pascal': 35,
 'maintain': 660,
 'crossplatform': 197,
 'coder': 139,
 'divers': 97,
 'excit': 721,
 'or': 1067,
 'old': 338,
 'vital': 117,
 'figur': 329,
 'bewild': 5,
 'guess': 139,
 'clean': 444,
 'feet': 32,
 'reason': 861,
 'exist': 781,
 'indepth': 530,
 'perfect': 1081,
 'complet': 7162,
 'guid': 3058,
 'project': 8473,
 'all': 2649,
 '500mb': 1,
 've': 779,
 'along': 1974,
 'each': 747,
 'section': 4350,
 'dedic': 315,
 'math': 521,
 'input': 536,
 'statement': 989,
 'loop': 880,
 'string': 493,
 'array': 766,
 'record': 827,
 'date': 461,
 'procedur': 377,
 'eas': 407,
 'set': 3742,
 'progress': 691,
 'oldest': 14,
 'around': 1298,
 'intent': 137,
 'encourag': 293,
 'highlevel': 90,
 'imper': 34,
 'precursor': 8,
 'compat': 161,
 'syntax': 562,
 'microsoft': 1847,
 'own': 652,
 'pace': 566,
 'of': 1091,
 'intricaci': 16,
 'instructor': 2330,
 'ms': 167,
 'captur': 241,
 'actual': 1165,
 'desktop': 662,
 'verbal': 41,
 'do': 2732,
 'reduc': 470,
 'instruct': 747,
 'tour': 83,
 'brand': 821,
 'show': 3722,
 'add': 2318,
 'task': 1334,
 'resourc': 1629,
 'crop': 28,
 'gantt': 30,
 'chart': 834,
 'behav': 62,
 'person': 2241,
 'timelin': 90,
 'macro': 178,
 'within': 1717,
 'repetit': 129,
 'manag': 5750,
 'allow': 2031,
 'alongsid': 122,
 'titl': 322,
 'matter': 647,
 'weapon': 112,
 'sfx': 10,
 'less': 836,
 'hour': 2933,
 'daw': 15,
 'sound': 790,
 'effect': 2760,
 'layer': 434,
 'never': 1216,
 'heard': 218,
 'georgek': 2,
 'music': 603,
 'compos': 222,
 'sonic': 6,
 'specialist': 165,
 'portfolio': 550,
 'my': 1103,
 'limit': 566,
 'elder': 24,
 'scroll': 153,
 'skywind': 2,
 'darkfal': 2,
 'rise': 139,
 'agon': 3,
 'bulletrag': 2,
 'coca': 5,
 'cola': 9,
 'mystor': 2,
 'more': 3339,
 'throughout': 808,
 'obstacl': 108,
 'importantli': 233,
 'overcam': 4,
 'moreov': 107,
 'lectur': 3520,
 'mentor': 165,
 'aspir': 198,
 'craft': 246,
 'explos': 43,
 'digit': 1035,
 'audio': 536,
 'workstat': 29,
 'automat': 419,
 'rifl': 3,
 'handgun': 1,
 'ultim': 642,
 'destruct': 21,
 'furthermor': 95,
 'period': 249,
 'guidanc': 250,
 'forum': 255,
 'nich': 275,
 'stuck': 301,
 'struggl': 368,
 'goto': 46,
 'special': 853,
 'doe': 200,
 'factori': 135,
 'class': 2803,
 'relationship': 793,
 'inherit': 290,
 'much': 3179,
 'see': 3627,
 'abil': 795,
 'properli': 322,
 'techniqu': 3402,
 'creation': 845,
 'now': 2278,
 'storylin': 49,
 'elearn': 113,
 'wonder': 570,
 'prebuilt': 22,
 'inde': 63,
 'mayb': 382,
 'budget': 337,
 'purchas': 675,
 'vendor': 132,
 'perhap': 156,
 'quit': 394,
 'said': 232,
 'agre': 48,
 'possibl': 1507,
 'ive': 1136,
 'broken': 175,
 'board': 376,
 'review': 1314,
 'layout': 598,
 'introduc': 1184,
 'trigger': 224,
 'condit': 600,
 'fanci': 67,
 'intermedi': 681,
 'while': 537,
 'bit': 460,
 'quicker': 73,
 'previou': 509,
 'articul': 68,
 'no': 1717,
 'worri': 506,
 'choos': 715,
 'dive': 753,
 'q': 408,
 'built': 967,
 'acceler': 165,
 'minichalleng': 4,
 'replic': 140,
 'real': 3643,
 'encount': 150,
 'realiz': 255,
 'warn': 57,
 'amount': 530,
 'ill': 990,
 'think': 1632,
 'forward': 781,
 'join': 1720,
 'hd': 262,
 'webmast': 22,
 'than': 79,
 'week': 591,
 'clearli': 286,
 'precis': 168,
 'thank': 1233,
 'pino': 1,
 'amato': 1,
 'veri': 523,
 'useful': 16,
 'especi': 438,
 'maria': 5,
 'kastani': 1,
 'm': 499,
 'thumb': 20,
 'jonathan': 36,
 'nichol': 1,
 'essential': 1,
 'be': 1005,
 'powerus': 9,
 'confid': 1888,
 'minut': 953,
 'away': 585,
 'depth': 430,
 'optim': 1024,
 'visual': 2071,
 'pdf': 410,
 'mp3': 43,
 'anytim': 133,
 'everywher': 156,
 'bonu': 691,
 'premium': 148,
 'theme': 989,
 'wp': 43,
 'social': 1384,
 'press': 127,
 'complimentari': 14,
 'highli': 916,
 'customiz': 49,
 'ideal': 366,
 'technich': 1,
 'prior': 564,
 'term': 674,
 'everyon': 728,
 'probabl': 515,
 'almost': 627,
 'true': 492,
 'blog': 1010,
 'absolut': 974,
 'dozen': 215,
 'plugin': 1044,
 'sort': 591,
 'amaz': 1485,
 'imagin': 454,
 'membership': 109,
 'site': 1782,
 'regist': 271,
 'r': 1595,
 'post': 999,
 'pictur': 359,
 'comment': 338,
 'edit': 1027,
 'tag': 427,
 'widget': 233,
 'transfer': 243,
 'ustom': 1,
 'appear': 233,

using keras framework to build the layers in the LSTM model :

model = Sequential()

model.add(Embedding(vocab_size,128, input_length=len_row))
model.add(Dense(1, activation='sigmoid'))

training the model (for 10 epochs) :

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size = 32

history =, y_train, batch_size =batch_size, 
                   epochs = 10,  validation_split=0.1, verbose = 1,

Model: "sequential_1"
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 2929, 128)         4389504   
 dropout_2 (Dropout)         (None, 2929, 128)         0         
 lstm_1 (LSTM)               (None, 64)                49408     
 dropout_3 (Dropout)         (None, 64)                0         
 dense_1 (Dense)             (None, 1)                 65        
Total params: 4,438,977
Trainable params: 4,438,977
Non-trainable params: 0
Epoch 1/10
120/120 [==============================] - 121s 994ms/step - loss: 0.4385 - accuracy: 0.8010 - val_loss: 0.4034 - val_accuracy: 0.8374
Epoch 2/10
120/120 [==============================] - 118s 988ms/step - loss: 0.3185 - accuracy: 0.8935 - val_loss: 0.2724 - val_accuracy: 0.9140
Epoch 3/10
120/120 [==============================] - 118s 984ms/step - loss: 0.2503 - accuracy: 0.9230 - val_loss: 0.3063 - val_accuracy: 0.8922
Epoch 4/10
120/120 [==============================] - 119s 989ms/step - loss: 0.3097 - accuracy: 0.9010 - val_loss: 0.2989 - val_accuracy: 0.9074
Epoch 5/10
120/120 [==============================] - 118s 987ms/step - loss: 0.2752 - accuracy: 0.9060 - val_loss: 0.3076 - val_accuracy: 0.8932
Epoch 6/10
120/120 [==============================] - 119s 990ms/step - loss: 0.2360 - accuracy: 0.9253 - val_loss: 0.2760 - val_accuracy: 0.9130
Epoch 7/10
120/120 [==============================] - 119s 989ms/step - loss: 0.2258 - accuracy: 0.9255 - val_loss: 0.3169 - val_accuracy: 0.9140
Epoch 8/10
120/120 [==============================] - 119s 990ms/step - loss: 0.2051 - accuracy: 0.9358 - val_loss: 0.2580 - val_accuracy: 0.9130
Epoch 9/10
120/120 [==============================] - 119s 996ms/step - loss: 0.1910 - accuracy: 0.9398 - val_loss: 0.2910 - val_accuracy: 0.9187
Epoch 10/10
120/120 [==============================] - 122s 1s/step - loss: 0.2089 - accuracy: 0.9301 - val_loss: 0.2725 - val_accuracy: 0.9074

<matplotlib.legend.Legend at 0x2079347c6a0>
<matplotlib.legend.Legend at 0x207960d72e0>
predictions = model.predict(encoded_docs_padded_test)
final_table = pd.DataFrame({'preds':np.where(predictions>=0.5, 1, 0).reshape(-1),'true':np.where(y_test.to_numpy()==True, 1, 0)})


from sklearn.metrics import classification_report

pd.DataFrame(classification_report(final_table['true'], final_table['preds'], output_dict=True)).rename(columns= {'0': 'non-Development', '1':'Development'})
non-Development Development accuracy macro avg weighted avg
precision 0.890691 0.927860 0.912908 0.909275 0.912939
recall 0.892580 0.926540 0.912908 0.909560 0.912908
f1-score 0.891634 0.927199 0.912908 0.909417 0.912923
support 1415.000000 2110.000000 0.912908 3525.000000 3525.000000

we get slightly better results with the neural network approach:

main imporovement is in the '0' or non-development label results, so contextual approach does add to the model's ability to classify it correctly.

but, since it is hard to interpret a neural network model, that are certainly benefits for using both approaches.

Made with REPL Notes Build your own website in minutes with Jupyter notebooks.