NLP for classification

Roy Shpringer's DS place

Udemy assignment NLP - By Roy Shpringer

from pathlib import Path
import os 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\roysh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
True
import warnings
warnings.filterwarnings('ignore')

loading table

path_root = Path().resolve()
path = path_root / Path("udemy_development_task.csv")
df = pd.read_csv(path, index_col=[0])
df
title description category longDescription
0 Python for Beginners Learn Python programming from scratch with han... Development **Why Python ?**\n\n * Python is one of the w...
1 Design Patterns in Python Learn the Design Patterns in a practical way u... Development Learning Design Pattern is a voracious learnin...
2 Unity Mobile C# Developer Course Create and deploy games for Android & iOS usin... Development Build 3 simple mobile games using the free Uni...
3 Django | Build a Smart Chatbot Using AI Learn Django By Building Chatbot Using AI Development **This courses will teach you How to Build a C...
4 Flutter Augmented Reality Course - Build 10+ A... Learn Google's Flutter ARCore & Become AR Deve... Development In this course you will learn how to develope ...
... ... ... ... ...
5656 What the FICO 2.0: The Essential Guide to Cred... Your Complete Guide to Fixing Bad Credit, Buil... Finance & Accounting Trying to understand credit can be somewhat co...
5657 Manual Bookkeeping Level 2 - update manual ledgers, prepare a pro... Finance & Accounting Manual bookkeeping covers the material equival...
5658 CorelDRAW for Beginners: Graphic Design in Cor... Learn how to design in Corel DRAW with these e... Design **Start creating professional graphic design i...
5659 SEO WordPress Masterclass: The Best Google Ran... Learn Website Search Engine Optimization With ... Marketing **Learn the most effective SEO Wordpress strat...
5660 Learn Thai for Beginners: The Ultimate 105-Les... You learn Thai minutes into your first lesson.... Teaching & Academics Are you ready to start speaking, writing and u...

14151 rows × 4 columns

df.index.nunique()
13159
  • we have some duplicated indices, so we need to re-index the table:
df= df.reset_index(drop=True)
df.index.nunique()
14151

check nulls

df.isna().sum()
title               0
description         5
category           48
longDescription     0
dtype: int64
(df['longDescription'] == '\n\n').sum()
4
  • we have very small amount of nulls, and most of them are in the target column. we will drop the rows with nulls since relabeling with textual features is not a straightforward task and the gain will be very small
df= df.dropna()

EDA

how many catagories?

print("number of categories: " ,df['category'].nunique())
print("\ncategories:\n\n" )
pd.DataFrame(df['category'].unique(), columns=['category name'])
number of categories:  13

categories:


category name
0 Development
1 Teaching & Academics
2 Business
3 IT & Software
4 Personal Development
5 Finance & Accounting
6 Music
7 Design
8 Marketing
9 Photography & Video
10 Lifestyle
11 Office Productivity
12 Health & Fitness

category distribution

we can see below that "Development" is the most popular category by far while 60% of the courses belongs to it. "Business" is in the 2nd place amd is far behind with only 7% share . This means that all the non-development categories combined reach a 40% share of the courses. Also, some ot the categories which have a more humenisitc nature (like "music", "Photography", "Health & Fitness") are really rare

cats = pd.DataFrame(df['category'].value_counts(normalize=True).reset_index()).rename(columns= {'index':'category','category':'%'})

cats['%'] = (cats['%'].round(7)*100).astype('str').str[:4]+'%'

cats
category %
0 Development 60.1%
1 Business 7.68%
2 IT & Software 5.72%
3 Personal Development 5.65%
4 Teaching & Academics 4.87%
5 Design 4.66%
6 Marketing 3.99%
7 Finance & Accounting 2.56%
8 Office Productivity 2.18%
9 Lifestyle 0.97%
10 Photography & Video 0.56%
11 Music 0.49%
12 Health & Fitness 0.42%
fig= plt.figure(figsize=(10,10))

plt.pie(df['category'].value_counts().values, labels=df['category'].value_counts().index, autopct='%1.1f%%', pctdistance=1.05, labeldistance=1.15);

plt.pie(df['category'].value_counts().values, autopct=lambda x: '{:.0f}'.format(x*df['category'].value_counts().sum()/100));

plt.title ("course categories distribution")
Text(0.5, 1.0, 'course categories distribution')

longest text

longest description is 164 characters long , and longest longdescription is 32,103 characters long

df['lengh_long_desc'] = df['longDescription'].map(lambda x: len(str(x))) 

df['lengh_short_desc'] = df['description'].map(lambda x: len(str(x))) 
print("lengh=",len(df.loc[df['lengh_short_desc'].idxmax()]['description']),'\n\n',  df.loc[df['lengh_short_desc'].idxmax()]['description'])
lengh= 164 

 Learn how to develop software in Behaviour Driven Development (BDD) using Specflow -  part of the Cucumber software family of tools for software testing automation.
print("lengh=",len(df.loc[df['lengh_long_desc'].idxmax()]['longDescription']),'\n\n',  df.loc[df['lengh_long_desc'].idxmax()]['longDescription'])
lengh= 32103 

 **What is this course about:**

In this course You will learn Hands on Devops Technology Concepts.

We will Cover:

  * **Docker**
  *  **Jenkins**
  *  **GIT**
  *  **Maven**

  

**What will you learn from this lecture:**

In particularly, you will learn:

  * Containerize a web-based application with a micro-service approach and automate it using Dockerfile.
  * Design multi-container applications and automate the workflow using Compose.
  * Scale Docker workflow with Docker Swarm, orchestrate and deploy a large-scale application across multiple hosts in the cloud.
  * Best practices of working with Docker software in the field.
  * In-depth knowledge about Docker software and confidence to help your company or your own project to apply the right Docker deployment workflow and continuously deliver better software.
  * Invaluable DevOps skills such as setting up continuous integration pipelines.

**************************************************

**FAQ 1:**

**DevOps Engineering Jobs and Career Opputunities:-**  
Engineering is a trending course from past few years ove the world. Every year
there are many engineering graduates coming out from each part of country . Be
it Chennai Or Kashmir, from north to south. Process of manufacturing engineers
is continuing at a fast increasing rate. But jobs in engineering are very
less. There is a strong need of quality engineers.For an IT job, there is
fight from all section of Engineering. Be it computer engineer, civil engineer
or electronic engineer. If you go for online job search, latest job trend is
DevOps. DevOps is an abbreviation for its two words. Dev implies to
development and Ops stand for operation. DevOps offers various types of job
opportunities for you, like engineering project manager, development
engineering manager, automation engineers and many more various types of best
jobs. Let's have a closer look at how DevOps is a better career choice for
you:

  
**Packaging:-**  
DevOps is awesome if you love to explore and play with variety of Technology
and processes. In my opinion the first thing to consider is the Packaging of
IT that the tech teams used to provide the organisations services. The
maleable the packaging the easier it is to keep everything standardized and
reusable. If you are are comfortable working with configuration management
systems and developing some imaging systems such as docker you will like
DevOps. Closer look to the recent trends tells us the amount of new
technologies that are being released into the market is growing exponentially.
In DevOps no technology is beyond limits and you find yourself constantly
working with integrated and automating different Technologies. In DevOps your
goal is to create machines as machine manageable data objects that are
completely completely hands off on the production. The goal is to to allow
programs written by different teams to efficiently automate as much as
possible.

  
**Scaling:-**  
You will definitely like DevOps if reusability is your passion. In my opinion
the biggest factor in the successful tech organisations of the future will be
their ability to scale rapidly while being able to deflate when not needed to
minimise costs in downtime.  
If the Application is reliable ,zippy and meet their needs, customers don't
care about the tech behind it. They simply want speed.  
Scalability is a hard thing to achieve and most would rather not have to worry
about it, which is self explanatory about the growth scalability as a service
offering.  
Now, Ask yourself. Do you want to jump from mobile to AI? DevOps will allow
you. Do you want to play with that new SaaS service that is in trend these
days? DevOps will let you do that.  
DevOps is all about being the glue that holds everything and everyone
together, and if you ask me, that is what makes it so exciting. The
possibilities are beyond limits and the technologies are always growing and
evolving at an unexplanatory and unimaginable speed. And if you don’t focus on
DevOps, you will still somehow have to manage infrastructure as a developer.

**Q. What is the need for DevOps?**

As per me, this answer should start by explaining the general market trend.
Instead of releasing big sets of features, companies see if small features can
be transported to their customers via a series of release trains. This is very
much advantageous like quick feedback from customers, better software quality,
etc. which in turn takes the company to high customer satisfaction. To achieve
this, companies are required to:

Increase frequency of deployment  
Lower the New releases failure rate  
Shorten their lead time between fixes  
DevOps lets you achieve seamless software delivery and fulfills all above
requirements. You can give examples of companies like Amazon, Etsy, and Google
who have welcomed DevOps to achieve levels of performance that were
unimaginable even five years ago.  
Q. Explain your understanding and expertise on both the software development
side and the technical operations side of an organization you’ve worked for in
the past.

DevOps engineers always work in a 24*7 critical business online environment.
In my previous job, I was very much adaptable to on-call duties and was able
to take up real-time, live-system responsibilities. I was successful in
automated processes to support continuous software deployments. I have pretty
good experiences with public as well as private clouds, DevOps tools like CHEF
or PUPPET, scripting and automation with languages like PYTHON and PHP, and a
background in AGILE

  
**Q. What is Git?**

I will suggest that you attempt this question by first explaining about the
architecture of Git.

Git is a form of Distributed Version Control system (DVCS). It lets you track
changes to a file and allows you to revert to any specific change.  
Its distributed architecture makes it more advantageous over other Version
Control Systems (VCS) like SVN. Another major advantage of Git is that it does
not rely on a central server to store all the versions of a project’s files.
Instead of that, every developer gets “clones” the copy of a repository.
“Local repository” has the full history of the project on its hard drive so
that when there is a problem like a server outage, you need your teammate’s
local Git repository for recovery.  
There is a central cloud repository as well where developers can commit
changes and share it with other teammates where all collaborators are
committing changes “Remote repository"

  
**Q. In Git how do you revert a commit that has already been pushed and made
public?**

There are two possible answers to the above question so make sure that you
include both because any of the below options can be used depending on the
situation's demand:

Remove the bad file in a new commit and push the file to the remote
repository. This is the most common and natural way to fix a bug or an error.
Once you have included necessary changes to the file, commit it to the remote
repository. For that purpose I will use the command  
git commit -m “commit message"  
Now, Create a new commit that will undo all the changes that were made in the
bad Commit. To do so I will be using the command  
git revert <name of bad commit>

  
**Q. How is DevOps different from Agile / SDLC?**

I would suggest you go through the below explanation:

Agile is a set of values and principles about how to develop a software. For
an instance: if you have some idea about something and you want to turn that
idea into a working software the Agile values and principles can be used as a
way to do that. But, that software might only be working on a developer’s
laptop or within a test environment. You need a way to easily, quickly and
repeatably move that software into production infrastructure, in a simple and
safe way. To do that DevOps tools and techniques are required.

In a nutshell, Agile software development methodology keeps its focus on the
development of software but, on the other hand, DevOps is responsible for
development as well as the deployment of the software in the safest and
reliable possible way.

Now remember, keep this thing in mind, you have included DevOps tools in the
previous answer so be prepared to answer some questions related to that. They
might be thrown at you.

  
**Q. Which are the top DevOps tools? Which tools have you worked on?**

Few of The most famous DevOps tools are mentioned below:

Git: Version Control System tool  
Jenkins: Continuous Integration tool  
Selenium: Continuous Testing tool  
Puppet, Chef, Ansible: Configuration Management and Deployment tools  
Nagios: Continuous Monitoring tool  
Docker: Containerization tool  
You can also include any other tool if you want, but make sure you use the
above tools in your answer.  
The second part of the answer could have two possibilities:

If you have enough experience with all the above-mentioned tools then you may
mention that I have worked on all these tools for developing good quality
software and deploying that software easily, frequently, and reliably.  
If you have experience with only with few of the above tools then name those
tools and say that I have specialization in these tools and have an overview
of the rest of the tools.

  
**Q. How do all these tools work together?**

The code is developed by the developers and its source code is managed by
Version Control System tools like Git etc.  
Developers transmit this code to the Git repository and any transformations
made in the code is committed to this Repository.  
Jenkins extracts this code from the repository using the Git plugin and
creates it using tools like Ant or Maven.  
Configuration management tools, puppet, deploy & provisions testing
environment and after that Jenkins releases the code in the test environment
on which testing is done using tools like selenium.  
After the code gets tested, Jenkins sends it for deployment on the production
server (even the production server is provisioned & maintained by tools like
the puppet).  
After its deployment, It is continuously monitored by tools like Nagios.  
Docker containers provide the testing environment to test the build features.

**Q. What is Version control?**  
I guess this is the easiest question you could face in the interview. My take
is to first define Version control. It is a system that keeps records of
changes to a file or set of files over a period of time so that they can be
recalled after specific versions later. Version control systems consist of a
centrally shared repository where teammates can commit changes to a file or
set of file. Then you might mention the uses of version control.

Version control allows you to:

Restore back files to a previous state.  
Restore back the entire back to a previous state.  
Compare changes over a period of time.  
The issue was introduced by whom and when.

  
**Q. What are the benefits of using version control?**

The following advantages of version control are suggested to be used:

Version Control System (VCS), allows all the team members to work freely over
any file at any point of time. VCS later allows you to merge all the changes
into a common version.  
All the past versions and variants are nicely and systematically encapsulated
inside the CVS. Whenever you need it, you may request any version of software
at any time and you can have a snapshot of the complete project right away.  
Each time you have an updated version of your project, VCS requires you to
provide a short info about what was changed. Also, you can see what exactly
was altered in the file’s content. This gives you the privilege to know who
has made what altered the project.  
A distributed VCS like Git provides all the team members about the complete
history of the project so if there is a breakdown in the central server, you
may use any of your teammate’s local Git repository.

  
**Q. Describe branching strategies you have used.?**

This question tests your branching experience so tell them about how you have
used branching in your past jobs and what purpose does it serves, you can
refer the below points:

Feature branching:  
A feature branch model holds all of the changes for a particular feature
inside of a branch. When the feature is completely tested and validated by the
automated tests, the branch is then added to the master.  
Task branching:  
In this model, each task is implemented over its own branch with the task key
included inside the branch name. It is easy to notice which code implements
which task, just search for the task key in the branch name.  
Release branching:  
Once the developed branch acquires enough features for a release, you can get
that branch cloned to form a Release branch. Making this branch starts the
further release cycle, so no extra features can be added after this point,
only bug fixes, documentation generation, and other release-oriented tasks
should get on this branch. Once it is ready to be shipped, the release gets
merged into master and tagged with a version number. In addition, it should be
merged back inside develop branch, which might have progressed since the
release was initiated.  
At the end, tell them that branching strategies vary from one organization to
another, so I am familiar with basic branching operations like delete, merge,
checking out a branch etc.

**  
Q. What is meant by Continuous Integration?**

It is advised to begin this answer by giving a short definition of Continuous
Integration (CI). Continuous Integration is a development practice that needs
developers to integrate code into a shared repository many times a day. Each
check-in gets verified by an automated build, allowing teams to detect
problems early.  
I would suggest you explain how you have implemented it in your previous job.

  
**Q. Explain how you can move or copy Jenkins from one server to another?**

I could have achieved this task by copying the jobs directory directly from
the old server to the new one. There are many ways to do that; They are
mentioned below:  
You can:

Moving a job from one installation of Jenkins to another by simply copying and
pasting the corresponding job directory.  
Create a copy of an existing job by making a clone of a job directory by a
different name.  
Rename an existing job by renaming a directory. Notice that if you change a
job name, then you will need to change any other job that tries to call the
renamed job.

**  
Q. Explain how can you create a backup and copy files in Jenkins?**

The question has a direct answer. To create a backup, all you need to do is to
back up your JENKINS_HOME directory at regular intervals of time. JENKINS_HOME
directory contains all of your build jobs configurations, slave node
configurations, and build history. For generating a backup of your Jenkins
setup, simply copy its directory. You may also copy a job directory for
cloning or replicate a job or rename the directory.

**Q. How will you secure Jenkins?**  
The most common way of securing Jenkins is given below. But if you have any
other way of doing it, you may go with it, but make sure you are correct:  
Make sure that the global security is on.  
Make sure that Jenkins is integrated with “my company’s” user directory using
the appropriate plugin.  
Make sure that matrix/Project matrix is enabled for getting the fine tune
access.  
Automate the setting rights/privileges process in Jenkins with custom version
controlled script.  
Bound the physical access to Jenkins data/folders.  
Run security audits on same over a period of time.

**Q. What is Continuous Testing?**  
It is advised to follow the under mentioned explanation:  
“Continuous Testing is the process of executing automated tests as a part of
the software delivery pipeline to produce immediate feedback over the business
risks associated with the latest build. In this method, each build gets tested
continuously, allowing Development teams to get fast feedbacks so that as to
prevent those problems from progressing to the successive stage of Software
delivery life-cycle. Continuous Testing speeds up a developer’s workflow
dramatically as there’s no need to manually rebuild the project and re-run all
of the tests after making changes.”

**Q. What is Automation Testing?**  
Automation testing or Test Automation is a process of automating the manual
process for testing the application/system under test. The Process involves
the use of separate testing tools which allows you to create test scripts
which can be executed repeatedly and doesn’t require any sort of manual
intervention.

  
**Q. What are the benefits of Automation Testing?**  
Some of the many advantages of Automation Testing are mentioned below.
Including these points in your answer and adding your own experience of how
Continuous Testing helped you previous in your previous job, will make an
impressive and impacting answer:  
Supports execution of repeated test cases  
Aids in testing a large test matrix  
Enables parallel execution  
Encourages unattended execution  
Improves accuracy thereby reducing human-generated errors  
Saves time and money

**Q. What is the difference between Assert and Verify commands in Selenium?**  
The basic difference between Assert and Verify command is given below:  
Assert command checks if the given condition is boolean true or boolean false.
For an instance, say, we assert whether the given element is present on the
web page or not. If the condition results to be true, then the program control
will execute the next test step. But, if the condition results in false, the
execution would be terminated and no further test would be executed.  
Verify command also performs check whether the given condition is true or
false. Irrespective of the condition being true or false, the program
execution doesn’t stop i.e. if the verification process fails, it would not
stop the execution and all the test steps will be executed.

  
**Q. How can be a browser launched using WebDriver?**  
The following syntax could possibly be used to launch Browser:  
“WebDriver driver = new FirefoxDriver();”  
“WebDriver driver = new ChromeDriver();”  
“WebDriver driver = new InternetExplorerDriver();”

**Q. What are the goals of Configuration management processes?**  
The basic purpose of Configuration Management (CM) is to ensure if the product
is integral or system throughout its life-cycle by making t0he development or
deployment process controllable and repeatable, thus creating a higher quality
product or system. The Configuration Management process allows orderly
management of system information and system changes for purposes such as to:  
Revise capability,  
Improve performance,  
Reliability or maintainability,  
Extend life,  
Reduce cost,  
Reduce risk and  
Liability, or correct defects.

**Q. What is the difference between an Asset and a Configuration Item?**  
As per me, first of all, Asset should be explained. It has a financial value
along with a depreciation rate attached to it. IT assets are just a sub-set.
Everything and anything that holds a cost and the organization uses it for the
calculation of its asset value and related benefits in the calculation of tax
falls under Asset Management, and such item is called an asset.  
On the other hand, Configuration Item may or may not have financial values
assigned to it. Also, there will not be any depreciation linked to it. Thus,
its life will not depend on its financial value but will depend on the time
till that item becomes obsolete for the organization.  
Now examples can be given that can showcase the similarity and differences
between both:  
1) Similarity:  
Server – It is both an asset as well as a CI.  
2) Difference:  
Building – It is an asset but not a CI.  
Document – It is a CI but not an asset

  
**Q . What is Chef?**  
Start the answer with the definition of Chef. The Chef is one of the powerful
automation platforms that turns infrastructures into code. A chef is a tool
for which scripts are written that are used to automate processes. What kind
of processes? Any process that is related to IT.  
Now the architecture of Chef can be explained, it consists of:  
Chef Server: The Chef Server is the central store of infrastructure’s
configuration data. The Chef Server stores the data necessary to configure the
nodes and provides search. ChefServer is a powerful tool that lets you to
dynamically drive node configuration based on data.  
Chef Node: Node is any host that gets configured using Chef-client. Chef-
client runs on nodes. ChefNode contacts the Chef Server for the information
necessary to configure the node. Now, since a Node is just a machine that runs
the Chef-client software, nodes may be sometimes referred to as “clients”.  
Chef Workstation: A Chef Workstation is a host used to modify cookbooks and
other confrontational data.

  
**Q2. What is Nagios?**  
This question can be answered by first mentioning that Nagios is one of the
monitoring tools used for Continuous monitoring of systems, applications,
services, and business processes etc in DevOps culture. If a failure occurs,
Nagios alerts technical staff about the problem, that allows them to begin
remedial processes before outages affect business processes, end-users, or
customers. With Nagios, you need not explain why an unseen infrastructure
outage affects your organization's bottom line.  
Now once you defined what is Nagios, you can mention various things that can
be achieved using Nagios.  
By using Nagios you can:  
Plan for infrastructure upgrades before outdated systems cause failures.  
Response to the issues at problem’s first sign.  
Automatically fix detected problems.  
Coordinate easily with technical team responses.  
Ensure that your organization’s SLAs are being met.  
Monitor your entire infrastructure and business processes.  
Nagios runs on a server, usually as a daemon or service. Nagios runs plugins
residing on the same server over a period of time. They make contact to hosts
or servers on your network or over the internet. One can see the status
information using the web interface. Nagios also sends email or SMS
notifications if something happens.  
The Nagios daemon acts like a scheduler that executes certain scripts at
certain moments. It then saves the results of those scripts and will run other
scripts if these results change.

*****************************************************************************************************

**DevOps Job Description**  
Demand for people with DevOps skills is growing at a fast and steady rate
because businesses are getting great results from DevOps. Organizations using
DevOps practices are surprisingly high-functioning: -

  
They can deploy code up to 30 times more frequently than their competitors,
and 50 percent lesser of their deployments fail.  
With all this goodness, you would be thinking that there must be lots of
DevOps engineers out there. However, just 18% of survey respondents in the
survey said someone in their organization actually held this title. Why is
that? Partly, it is because defining what a DevOps engineers can do is still
in flux. Although, That is not stopping companies from hiring for DevOps
skills. On LinkedIn, people's mentioning of DevOps as a skill has seen a rise
of 50 percent over the past few years. A survey revealed the same trend:  
Half of about 4,000-plus respondents (in more than 90 countries) said their
companies are considering DevOps skills while hiring.

  
**What are DevOps skills?**  
The survey identified the top three skill areas for DevOps staff:

**Coding or scripting  
Process re-engineering  
Communicating and collaborating with others**  
The above-mentioned skills point to a growing recognition, that software isn’t
written in the old stereotypical way anymore. Where software was written from
scratch using a highly complex and lengthy process. Also, creating new
products is now a matter of selecting open source components and binding them
together with code. The complexity of today’s software lies less in the
programming, and more in ensuring that the new software works over a diverse
set of operating systems. Making it platform independent right away. Same way,
testing and deployment are now done at a much more frequency. That is, they
can be more often— if developers start communicating more early and regularly
with the operations team, and also if, operations people bring their knowledge
of the production environment to design of testing and staging environment.

  
**What is a DevOps engineer, anyway? And should anyone hire them?**  
There’s no formal cliched career track for kickstarting your career as a
DevOps engineer. They are maybe developers who get interested in deployment
and network operations, they might be sysadmins who have an affinity for
scripting and coding. Whatever world they are from, these are people who have
pushed themselves out of their comfort zone of their defined areas of
competence and who have a more holistic view of their technical environments.  
DevOps engineers are a quite elite group, so it’s not astonishing that we
found a smaller number of companies creating that title. Kelsey Hightower,
head of operations at Puppet Labs, described these people as the “Special
Forces” in an organization. “The DevOps engineer encapsulates depth of
knowledge and years of hands-on experience,” Kelsey says, “You’re battle
tested. This person blends the skills of the business analyst with the
technical chops to build the solution - plus they know the business well, and
can look at how any issue affects the entire company.  
So, in a nutshell, DevOps provides you lots of career opportunities and
companies are ACTUALLY hiring DevOps engineers.

******************************************************************************

**Object-Oriented Programming:-**  
Object-Oriented Programming or commonly called OOPs is  
There are 5 basic concepts of OOPs. Let's have a closer look at each of them.

  
**1\. Abstraction**  
This is the property of OOPs which refers to the act of representing only the
essential details and hiding  
the background data. Consider a car as your object. You are told that if you
apply the brakes, the vehicle  
stops. The background details, like the mechanism how the fluid runs through,
the brake shoes stopping  
the wheel, etc. are hidden from you. This is what abstraction is. Abstraction
is the advantage that you  
get from Object Oriented Programming over Procedural Oriented Programming.

  
**2\. Encapsulation**  
The process of binding characteristics and behavior in a single unit is simply
known as  
Let's get back to our previous example of a car. In a car, we have a steering
that helps to change the direction, we have brakes to stop the car, we have a
music system to listen to music, etc. These all units are capsuled (or
ENcapsuled) under a single unit called CAR. Like objects, each unit has its
own  
characteristics as well as behavior.  
It is a common observation that a class encapsulates objects of the similar
kind under a single unit.

  
**3\. Modularity**  
Modularity is the feature of Object Oriented Programming that allows us to
break a bigger problem in  
smaller chunks and assemble it together, later. For an instance, during the
manufacturing of a car, parts  
are constructed separately. Like there is a unit that makes the engine, a unit
makes the outer body, a  
unit makes the interior, etc. Later on, all the parts are assembled at one
place. This way, a big problem is divided into small chunks and handled
easily.  
In Object Oriented Programming, Modularity is implemented by functions.

  
**4\. Inheritance**  
Inheritance is the capability of a class to inherit the properties of some
other class. For an example,  
consider CAR as a class. Now let's take TOYOTA, NISSAN, SWIFT, HYUNDAI, etc.
as some other class.  
These classes will have them some individual properties but they will inherit
some of their properties  
from the class CAR. Like moving on applying accelerator, stopping when brakes
are applied, etc.  
The inheriting class is called the subclass whereas the inherited class is
called base class. In the above  
example, CAR is the base class and others are a subclass.

  
**5\. Polymorphism**

The act of existing in more than one form  
Lets again get back to our example of cars. Consider a class called HYUNDAI.
The HYUNDAI class has an  
object i10. Now there can be many cars with the name i10, but they have a
unique identification. (  
either by their registration number or engine number, we are not concerned
here about that)  
In an Object Oriented Programming language, there can be many functions with
the same name but  
they should be of different parameters.  
So now you know, the 5 pillars of Object Oriented Programming.

_**Happy coding!  
**_

_*********************************************************************************  
**_

_****_

**DevOps For Dummies- A Wiley Brand**  
is an IBM limited edition written by Sanjeev Sharma  
and Bernie Coyne. Earlier it was written only by Sanjeev Sharma alone, but in
the latest third edition,  
Bernie Coyne co-authored the book. This is a book for the people interested in
DevOps. It takes you  
from beginner to advanced level. The book is available in the form of
electronic media i.e. e-book. The  
free of cost book comes from IBM.  
Go to the link above and fill in your details, and you will get the download
link of your copy.  
Let's take a look at the book's features:  
 **Cover Page:-**  
It is often said, don't judge a book by its cover. But we humans are very much
stubborn  
and the cover matters the most for the readers, as it lures them towards
itself. The cover page for  
DevOps for dummies is a mixture of Black, blue and yellow color; with an
animated geeky face outline.  
At the top, IBM logo resides with its full dignity. The middle right half of
the page covers the main  
outlines of the book:  
  

  * **The business needs and value of DevOps.**
  *  **DevOps capabilities and adoption path.**
  *  **How Cloud accelerates DevOps.**

**Table Of Content:-**  
Next, as we turn over the "virtual pages" comes the table of content. This
gives  
an overview of what you are going to learn from this book. There are chapter
names with their  
subtopics under them. The chapter names are as follows:-

  
**1.What is DevOps?  
2.Looking at DevOps capabilities.  
3.Adopting DevOps.  
4.Looking at how cloud accelerates DevOps.  
5.Using DevOps to solve new challenges.  
6.Making DevOps work: IBM's Story.  
7.Ten DevOps myths.**

  
**Introduction:-**  
Next, comes in the introduction part.  
In the first line, the meaning of DevOps with its expanded form of Development
and Operations is  
explained. Everyone talks about it, but not everyone is familiar with it. In a
nutshell, DevOps is an  
approach based on lean and agile principles in which business owners and the
development, operations,  
and quality assurance departments collaborate to deliver software in a
continuous manner. The further  
lines tell about the IBM's broad and holistic view towards DevOps.  
The book tells what a true DevOps approach includes:  
Lines of business, practitioners, executives, partners, suppliers, and so on.  
 **About the book:-**  
The about the book section gives an overview of the book.  
The book takes a business-centric approach to DevOps. Today’s rapidly
advancing world makes DevOps essential to all enterprises that should be agile
and lean enough to respond rapidly to the changes such as customer demands,
market conditions, competitive pressures, or regulatory requirements.  
It is assumed that, if you are reading this book, you’ve heard about DevOps
but want to understand  
what it means and how your company can gain business benefits from it. This
book is targeted for  
executives, decision-makers, and practitioners who are new to the DevOps,
seeking info about the  
approach, who want to go through the hype surrounding the concept to reach t

_****_


text lengh distribution by category

fig = plt.figure(figsize=(11,9)) 
ax1 = plt.subplot(1,2,1) 
# (2,1,1) indicates total number of rows, columns, and figure number respectively
ax2 = plt.subplot(1,2,2)


sns.boxplot(x="lengh_short_desc", y="category", data= df, ax=ax2)


sns.boxplot(x=df["lengh_long_desc"].clip(0,7500), y="category", data= df, ax=ax1)


plt.tight_layout()

ax1.grid()
ax2.grid()


plt.show()

we can see that the Development category has slightly longer long-descriptions with many outliers, and longer short-descriptions having 25% of the courses with over 125 characters long. (note for long_desc: I've clipped the X-axis at the 7500 mark in order to visualize the boxplots properly)

longest text in each category :

df.groupby('category').agg(max_long_desc= ('lengh_long_desc','max')).sort_values(by='max_long_desc',  ascending=False)
max_long_desc
category
Development 32103
Personal Development 20021
Design 16801
Marketing 16435
Business 16359
IT & Software 15610
Finance & Accounting 14129
Lifestyle 9156
Teaching & Academics 8757
Office Productivity 8555
Health & Fitness 8546
Photography & Video 8444
Music 8306

how many numbers?

we see in the plot below that humanistic courses contain less numbers, which might help differentiate from development courses

df['num_count']= df.iloc[:,3].apply(lambda x: len(re.findall('(\d\.\d+|\d+)', x)))  # float OR integer
fig = plt.figure(figsize=(12,8)) 

sns.boxplot(x=df["num_count"].clip(0,100), y="category", data= df)

<AxesSubplot:xlabel='num_count', ylabel='category'>

how many non-alphanumeric characters?

we have more in the humanisitc courses with small seperations as a whole

fig = plt.figure(figsize=(12,8)) 

df['num_non_alphanum']= df.iloc[:,3].apply(lambda x: len(re.findall("[^0-9A-Za-z ]", x)))  # non-alpha numeric or spaces

sns.boxplot(x=df['num_non_alphanum'], y="category", data= df)

<AxesSubplot:xlabel='num_non_alphanum', ylabel='category'>

Preprocessing


from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.base import TransformerMixin, BaseEstimator
from nltk.stem import PorterStemmer

creating the target and uniting all text columns to one column

we combine all the non-development categories into one category since the task is to label correctly the develpment courses. This will simplify the problem, since labeling a very unbalanced multiclass dataset is a much harder task for a model.

df.loc[df['category']=='Development', 'is_dev'] = True
df.loc[df['category']!='Development', 'is_dev'] = False

in order to keep a smaller number of dimentions I prefer uniting all the textual columns into one column

df['text'] = df['title']+' '+df['description']+' '+df['longDescription']
df.columns
Index(['title', 'description', 'category', 'longDescription',
       'lengh_long_desc', 'lengh_short_desc', 'num_count', 'num_non_alphanum',
       'is_dev', 'text'],
      dtype='object')

cleaning text

  • remove endline mark (\n\n)
  • replace url's with the generic "url"
  • remove stopwords (using nltk's english stopwords)
  • reomve punctuation inside words (- and ')
  • remove non alphanomeric characters and lowercasing
  • replace numberes above 9 to 999
  • replace 2 consecutive spaces or above with only 1 space
def remove_end_line (x):
    return  x.str.replace("\n\n"," ")

def remove_url(x):
    return x.str.replace("https*\S+", "url")

def remove_puncs_inside_words(x):
    return x.str.replace("[\'\-']", "") 

def remove_non_alphanomeric_and_lower (x):
    return x.str.replace("[^0-9A-Za-z ']", " ").str.lower()


def replace_high_numbers (x):   #higher than 9 will be replaced with 999
    try:
        y= int(x) > 9
    except ValueError:
        return x
    if y :
        return '999'
    else :
        return x
                
def remove_stop_words (x):   
    stopwords_dict = {word: 1 for word in stopwords.words("english")}  
    return ' '.join( [y for y in x.split() if y not in stopwords_dict])  #--> using dict speeds up tremendoudsly



def replace_overspaces(x) :
    return x.str.replace("\s{2,}", " ")
df[['text']]= df[['text']].apply(remove_end_line)\
                            .apply(remove_url)\
                            .applymap(remove_stop_words)\
                            .apply(remove_puncs_inside_words)\
                            .apply(remove_non_alphanomeric_and_lower)\
                            .applymap(lambda x:  ' '.join(replace_high_numbers(x) for x in x.split()))\
                            .apply(replace_overspaces)
                                                  
df.head(3)
title description category longDescription lengh_long_desc lengh_short_desc num_count num_non_alphanum is_dev text
0 Python for Beginners Learn Python programming from scratch with han... Development **Why Python ?**\n\n * Python is one of the w... 1552 74 0 103 True python beginners learn python programming scra...
1 Design Patterns in Python Learn the Design Patterns in a practical way u... Development Learning Design Pattern is a voracious learnin... 1919 57 0 80 True design patterns python learn design patterns p...
2 Unity Mobile C# Developer Course Create and deploy games for Android & iOS usin... Development Build 3 simple mobile games using the free Uni... 1734 56 1 106 True unity mobile c developer course create deploy ...

analysis after cleaning : creating a wordcloud for each category

  • development courses:

wordcloud = WordCloud(stopwords=['course','learn','this','the','in','it','you','use']).generate(df[df['is_dev']==1]['text'].sum())

plt.figure(figsize=(11,11))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
  • non-development courses:

plt.figure(figsize=(11,11))
wordcloud = WordCloud(stopwords=['course','learn','this','the','thi','in','it','you','use']).generate(df[df['is_dev']==0]['text'].sum())

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

it seems that words like 'application', 'project' ,'python' and 'machine learning' , are uniquely more frequent in the development courses (large only on the upper image), also the word 'business' seems to be very important mainly to the non-development courses


creating a dataframe with only the needed features:

df_m = df[['text','lengh_long_desc', 'lengh_short_desc', 'num_count','is_dev']]

Splitting to train/test

X = df_m.drop(['is_dev'],axis=1) 
y = df_m['is_dev']    

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Creating a pipeline for a random-forest model:

the pipeline steps are:

  • stemming with porter stemmer
  • fitting tf/idf trasformer only on the text column (with column transformer)-- this will create our vocabulary
  • use grid-search to cross-validate the model and for hyperparmeter tuning
  • fitting with the best estimator of random-forest algorithm

Stemming/ Lemma

while it is better to use lemmatization in order to maintain contextual meaning of a word, we will use stemming since it is much faster. stemming will transform words (mainly verbs, and the ending of nouns) to their root form, therefore it will decrease the dimentionality in a significant way

class TextStemmer(TransformerMixin, BaseEstimator):    
    
    def __init__(self):
        super().__init__()
        self.ps = PorterStemmer()
    
    def fit (self, X, y=None):
        return self
    
    def transform(self, X):
        X['text_stem']= X['text'].map(lambda y: ' '.join(self.ps.stem(z) for z in y.split()))
        X= X.drop('text', axis=1)
        return X
        
    

TF/IDF

we will use tf/idf to generate the vocabulary

tf/idf in sklearn package is defined as:

tf/idf(t,d)=tf(t,d)idf(t)tf/idf(t, d) = tf(t, d) * idf(t)

idf(t)=log(ndf(t))+1idf(t) = log ( \frac{n}{df(t)} ) + 1

(tf= term frequency, idf = inverse document frequency)

tf/idf adds a weighting sensibility to a counting vector of each token in a document by deviding the counter of a token (tf) by a term that reflects how rare or frequent the word is in the entire corpus (idf) :

idf is defined here as the logarithmic fraction of the number of documents a token appears-in devided by the total number of documents.

if the ratio is big - meaning the token scarcely appears in the documents of the corpus, the term will be bigger than 1 , and will give a boost to the counter.

if the ratio is small - meaning the token appears frequently in the documents of the corpus, the term will be smaller than 1, this will decrease the counter.


creating all the transformers:

stem_text = TextStemmer()

tfidf = TfidfVectorizer(analyzer = 'word' ,token_pattern= r"(?u)\b\w+\b" ,ngram_range=(1,2), min_df= 0.005, max_df= 0.99 ,norm=None) 

rfc= RandomForestClassifier(max_depth=9, n_estimators=100, random_state= 42)

text_tfidf = ColumnTransformer(transformers= [('tfidf', tfidf, 'text_stem')], remainder= 'passthrough', sparse_threshold=0 )

tf/idf parameters:

  • we are using ngrams ability since some words get different meaning as a combination of 2 words, also we will decrease the size of the vocabualry by using min_df

  • the token_pattern enables words that are one character long like : "C" or "R" which are important programming languages and I anticiapte them to appear many times in the text

final_model = Pipeline(steps=[('st', stem_text),
                              ('ct', text_tfidf),
                              ('rfc', rfc)
                        ])
params_store= final_model.get_params()


param_search = {
  'ct__tfidf__min_df' :[0.005, 0.1],
  'rfc__max_depth' : [6, 10, 12],
  'rfc__n_estimators' :[150]
}

#params_store

using the pipeline:


gsearch = GridSearchCV(estimator= final_model,  cv=3,  
                      param_grid= param_search, verbose=2 ) 

gsearch.fit(X_train, y_train.astype('bool'))
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150 
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150, total=  39.8s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150 
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   39.7s remaining:    0.0s
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150, total=  39.8s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150, total=  40.1s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150, total=  41.8s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150, total=  42.1s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150, total=  42.2s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150, total=  42.8s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150, total=  42.9s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150 
[CV]  ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150, total=  42.5s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150 ..
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150, total=  37.0s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150 ..
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150, total=  39.3s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150 ..
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150, total=  39.7s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150, total=  40.5s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150, total=  40.2s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150, total=  40.6s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150, total=  40.4s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150, total=  40.5s
[CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150 .
[CV]  ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150, total=  40.4s
[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed: 12.2min finished
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('st', TextStemmer()),
                                       ('ct',
                                        ColumnTransformer(remainder='passthrough',
                                                          sparse_threshold=0,
                                                          transformers=[('tfidf',
                                                                         TfidfVectorizer(max_df=0.99,
                                                                                         min_df=0.005,
                                                                                         ngram_range=(1,
                                                                                                      2),
                                                                                         norm=None,
                                                                                         token_pattern='(?u)\\b\\w+\\b'),
                                                                         'text_stem')])),
                                       ('rfc',
                                        RandomForestClassifier(max_depth=9,
                                                               random_state=42))]),
             param_grid={'ct__tfidf__min_df': [0.005, 0.1],
                         'rfc__max_depth': [6, 10, 12],
                         'rfc__n_estimators': [150]},
             verbose=2)

The Vocabulary

fitting tf-idf creates a dictionary for all the unique tokens in our train documnets- so every token gets a unique registry. number

in the next stage it counts the occurrences of each token in every document and in this way creates a feature vector.

in the last stage it adds the idf weight as described above

gsearch.best_estimator_.steps[1][1].transformers_[0][1].vocabulary_
{'mobil': 3192,
 'applic': 384,
 'manual': 3102,
 'test': 4589,
 'io': 2578,
 'bug': 647,
 'track': 4811,
 'debug': 1369,
 'realtim': 3895,
 'process': 3697,
 'agil': 222,
 'methodolog': 3166,
 'develop': 1445,
 'hand': 2163,
 'devic': 1489,
 'function': 1987,
 'usabl': 4924,
 'consist': 977,
 'autom': 470,
 'type': 4859,
 'either': 1625,
 'come': 886,
 'instal': 2527,
 'softwar': 4279,
 'distribut': 1524,
 'platform': 3591,
 'growth': 2138,
 'past': 3530,
 'year': 5367,
 'a': 108,
 'studi': 4458,
 'conduct': 965,
 'group': 2135,
 'predict': 3657,
 'gener': 2025,
 '4': 47,
 '2': 16,
 'billion': 595,
 'revenu': 3991,
 '999': 68,
 '7': 62,
 'u': 4863,
 's': 4033,
 'smartphon': 4263,
 'app': 356,
 'download': 1558,
 'thi': 4675,
 'cours': 1031,
 'go': 2079,
 'cover': 1225,
 'follow': 1897,
 'approach': 402,
 '1': 3,
 'hardwar': 2180,
 'the': 4612,
 'includ': 2477,
 'intern': 2556,
 'screen': 4083,
 'size': 4229,
 'resolut': 3964,
 'space': 4311,
 'memori': 3157,
 'camera': 731,
 'radio': 3845,
 'etc': 1710,
 'sometim': 4297,
 'refer': 3916,
 'as': 430,
 'simpl': 4207,
 'work': 5303,
 'it': 2598,
 'call': 728,
 'differenti': 1499,
 'earlier': 1589,
 'method': 3164,
 'even': 1713,
 'basic': 509,
 'differ': 1496,
 'import': 2439,
 'understand': 4879,
 'nativ': 3276,
 'creat': 1243,
 'use': 4926,
 'like': 2946,
 'tablet': 4518,
 'b': 490,
 'web': 5166,
 'serversid': 4152,
 'access': 153,
 'websit': 5187,
 'browser': 644,
 'chrome': 800,
 'connect': 973,
 'network': 3312,
 'c': 716,
 'hybrid': 2316,
 'combin': 884,
 'they': 4672,
 'run': 4029,
 'offlin': 3414,
 'written': 5358,
 'technolog': 4578,
 'html5': 2309,
 'css': 1305,
 'mobil applic': 3194,
 'io applic': 2582,
 'mobil devic': 3196,
 'test autom': 4590,
 'past year': 3532,
 '999 7': 71,
 'thi cours': 4682,
 'cours go': 1117,
 'go cover': 2086,
 'cover follow': 1233,
 'applic work': 399,
 'applic creat': 387,
 'creat use': 1282,
 'web app': 5168,
 'use differ': 4944,
 'use web': 5006,
 'web technolog': 5183,
 'technolog like': 4579,
 'learn': 2778,
 'program': 3717,
 'beginn': 551,
 'advanc': 193,
 'scratch': 4077,
 'best': 574,
 'exampl': 1752,
 'purpos': 3797,
 'languag': 2740,
 'at': 449,
 't': 4511,
 'lab': 2732,
 'variou': 5031,
 'game': 2002,
 'object': 3393,
 'orient': 3467,
 'teach': 4555,
 'everyth': 1739,
 'start': 4354,
 'oper': 3452,
 'concept': 945,
 'topic': 4800,
 'everi': 1724,
 'lesson': 2915,
 'explain': 1786,
 'detail': 1440,
 'code': 842,
 'those': 4715,
 'want': 5098,
 'strong': 4431,
 'knowledg': 2715,
 'take': 4523,
 'divid': 1528,
 'three': 4724,
 'part': 3514,
 'first': 1877,
 'second': 4094,
 'third': 4708,
 'learn c': 2795,
 'c program': 724,
 'program beginn': 3720,
 'basic advanc': 510,
 'advanc learn': 198,
 'program languag': 3733,
 'languag c': 2741,
 'c develop': 720,
 'languag use': 2749,
 'use variou': 5003,
 'softwar develop': 4281,
 'object orient': 3396,
 'languag thi': 2748,
 'cours teach': 1194,
 'teach everyth': 4560,
 'start basic': 4355,
 'it cover': 2601,
 'cover topic': 1236,
 'advanc topic': 201,
 'explain detail': 1788,
 'want learn': 5108,
 'learn program': 2864,
 'take cours': 4530,
 'cours thi': 1199,
 'cours divid': 1088,
 'first learn': 1880,
 'learn basic': 2790,
 'learn object': 2855,
 'get': 2027,
 'power': 3630,
 'framework': 1938,
 'python': 3803,
 'that': 4606,
 'easi': 1594,
 'with': 5282,
 'grow': 2136,
 'skill': 4231,
 'gap': 2020,
 'need': 3284,
 'talent': 4543,
 'greater': 2127,
 'ever': 1720,
 'befor': 545,
 'ground': 2133,
 'build': 648,
 'launch': 2769,
 'career': 749,
 'entrepreneur': 1691,
 'make': 3045,
 'say': 4057,
 'give': 2069,
 'fundament': 1991,
 'well': 5204,
 'handson': 2167,
 'experi': 1770,
 'requir': 3960,
 'success': 4473,
 'turn': 4849,
 'comput': 938,
 'modern': 3204,
 'machin': 3030,
 'next': 3341,
 'move': 3241,
 'beyond': 587,
 'static': 4387,
 'dynam': 1580,
 'we': 5140,
 'won': 5295,
 'stop': 4414,
 'there': 4664,
 'll': 2972,
 'also': 261,
 'implement': 2437,
 'full': 1971,
 'authent': 465,
 'system': 4505,
 'final': 1862,
 'extend': 1799,
 'integr': 2543,
 'thirdparti': 4711,
 'api': 353,
 'when': 5241,
 'finish': 1872,
 'fulli': 1979,
 'equip': 1697,
 'custom': 1316,
 '0': 0,
 'latest': 2766,
 'version': 5046,
 'avail': 475,
 'provid': 3786,
 'relev': 3940,
 'inform': 2510,
 'content': 991,
 'legaci': 2909,
 'user': 5010,
 'about': 140,
 'author': 466,
 'sinc': 4218,
 'discov': 1517,
 'way': 5127,
 'he': 2188,
 'interest': 2550,
 'appli': 381,
 'scienc': 4070,
 'address': 185,
 'problem': 3692,
 'parallel': 3510,
 'domain': 1545,
 'get start': 2062,
 'web framework': 5174,
 'easi learn': 1596,
 'learn use': 2890,
 'build app': 652,
 'use skill': 4991,
 'web develop': 5173,
 'make web': 3071,
 'web applic': 5169,
 'applic develop': 388,
 'it thi': 2623,
 'cours give': 1116,
 'fundament concept': 1992,
 'handson experi': 2169,
 'experi requir': 1774,
 'build web': 684,
 'well start': 5215,
 'websit develop': 5190,
 'app we': 375,
 'won t': 5296,
 'we ll': 5151,
 'll also': 2974,
 'also cover': 264,
 'learn integr': 2834,
 'finish cours': 1873,
 'cours fulli': 1113,
 'build custom': 662,
 'app thi': 373,
 'cours use': 1207,
 'latest version': 2768,
 'relev inform': 3941,
 'about author': 141,
 'easi way': 1600,
 'way learn': 5134,
 'learn web': 2893,
 'develop he': 1461,
 'comput scienc': 941,
 'path': 3533,
 'realworld': 3896,
 'solut': 4289,
 'modular': 3210,
 'one': 3423,
 'effici': 1621,
 'seen': 4120,
 'increas': 2488,
 'rate': 3855,
 'adopt': 191,
 'mainli': 3041,
 'lightweight': 2945,
 'display': 1520,
 'great': 2122,
 'robust': 4015,
 'perform': 3559,
 'varieti': 5030,
 'open': 3448,
 'sourc': 4307,
 'reliabl': 3943,
 'often': 3415,
 'googl': 2104,
 'deriv': 1408,
 'addit': 182,
 'featur': 1837,
 'collect': 878,
 'safeti': 4041,
 'capabl': 740,
 'builtin': 690,
 'larg': 2754,
 'standard': 4351,
 'librari': 2930,
 'if': 2404,
 'foundat': 1932,
 'improv': 2450,
 'packt': 3497,
 'video': 5050,
 'seri': 4140,
 'individu': 2502,
 'product': 3704,
 'put': 3800,
 'togeth': 4776,
 'logic': 2988,
 'stepwis': 4408,
 'manner': 3100,
 'highlight': 2254,
 'are': 409,
 'strategi': 4424,
 'design': 1412,
 'pattern': 3537,
 'deal': 1368,
 'storag': 4415,
 'data': 1327,
 'mysql': 3270,
 'let': 2917,
 'quick': 3832,
 'look': 2999,
 'journey': 2670,
 'tutori': 4851,
 'leav': 2901,
 'off': 3408,
 'you': 5378,
 'immedi': 2433,
 'practic': 3634,
 'offer': 3409,
 'avoid': 478,
 'common': 899,
 'mistak': 3189,
 'new': 3317,
 'initi': 2515,
 'upon': 4919,
 'i': 2317,
 'o': 3392,
 'file': 1855,
 'command': 893,
 'line': 2957,
 'tool': 4784,
 'error': 1699,
 'handl': 2166,
 'help': 2208,
 'structur': 4436,
 'log': 2987,
 'context': 998,
 'packag': 3496,
 'databas': 1350,
 'nosql': 3368,
 'mongodb': 3219,
 'across': 168,
 'microservic': 3170,
 'further': 1994,
 'explor': 1793,
 'interact': 2548,
 'via': 5049,
 'demonstr': 1400,
 'tune': 4847,
 'lastli': 2763,
 'reactiv': 3864,
 'serverless': 4151,
 'tip': 4754,
 'trick': 4838,
 'by': 709,
 'end': 1651,
 'abl': 124,
 'bridg': 636,
 'meet': 3153,
 'your': 5427,
 'expert': 1782,
 'esteem': 1707,
 'ensur': 1680,
 'smooth': 4264,
 'receiv': 3901,
 'master': 3115,
 'degre': 1389,
 'institut': 2534,
 'mine': 3182,
 'high': 2243,
 'largescal': 2756,
 'current': 1312,
 'lead': 2774,
 'team': 4569,
 'refin': 3917,
 'focus': 1895,
 'emphasi': 1641,
 'continu': 999,
 'deliveri': 1396,
 'publish': 3791,
 'number': 3388,
 'paper': 3508,
 'sever': 4161,
 'area': 413,
 'passion': 3526,
 'share': 4165,
 'idea': 2400,
 'other': 3472,
 'huge': 2313,
 'fan': 1822,
 'backend': 495,
 'learn path': 2858,
 'one power': 3432,
 'languag it': 2743,
 'easi use': 1599,
 'open sourc': 3450,
 'make easi': 3055,
 'build simpl': 678,
 'if interest': 2409,
 'improv perform': 2451,
 'go learn': 2089,
 'path packt': 3535,
 'packt s': 3498,
 's video': 4037,
 'video learn': 5059,
 'path seri': 3536,
 'seri individu': 4141,
 'individu video': 2503,
 'video product': 5062,
 'product put': 3705,
 'put togeth': 3801,
 'togeth logic': 4777,
 'logic stepwis': 2989,
 'stepwis manner': 4409,
 'manner video': 3101,
 'video build': 5052,
 'build skill': 679,
 'skill learn': 4239,
 'learn video': 2892,
 'video it': 5058,
 'it the': 2622,
 'the highlight': 4624,
 'highlight learn': 2255,
 'path are': 3534,
 'design pattern': 1426,
 'applic use': 397,
 'use advanc': 4928,
 'let s': 2920,
 's take': 4036,
 'take quick': 4535,
 'quick look': 3834,
 'look learn': 3004,
 'learn journey': 2838,
 'thi learn': 4689,
 'advanc concept': 194,
 'i o': 2364,
 'file system': 1857,
 'command line': 894,
 'error handl': 1700,
 'you also': 5380,
 'also learn': 274,
 'use mysql': 4973,
 'come across': 888,
 'you learn': 5397,
 'tip trick': 4755,
 'by end': 711,
 'end learn': 1654,
 'basic understand': 524,
 'go use': 2093,
 'advanc featur': 197,
 'meet your': 3154,
 'your expert': 5429,
 'expert we': 1784,
 'combin best': 885,
 'best work': 580,
 'work follow': 5309,
 'follow esteem': 1901,
 'esteem author': 1708,
 'author ensur': 467,
 'ensur learn': 1681,
 'journey smooth': 2671,
 'he work': 2192,
 'high perform': 2246,
 'he current': 2190,
 'best practic': 577,
 'autom test': 472,
 'he passion': 2191,
 'share knowledg': 4166,
 'he also': 2189,
 'map': 3104,
 'studio': 4459,
 'js': 2674,
 'wide': 5266,
 'survey': 4495,
 'know': 2699,
 'find': 1868,
 'format': 1923,
 'style': 4463,
 'interfac': 2552,
 'truli': 4844,
 'respons': 3969,
 'complex': 926,
 'assum': 446,
 'littl': 2969,
 'walk': 5094,
 'step': 4391,
 'youll': 5422,
 'big': 589,
 'beauti': 531,
 'time': 4736,
 'modern web': 3206,
 'applic it': 390,
 'cover everyth': 1232,
 'everyth need': 1742,
 'need know': 3293,
 'cours assum': 1053,
 'knowledg program': 2724,
 'walk step': 5095,
 'youll learn': 5425,
 'learn creat': 2801,
 'differ way': 1498,
 'user interact': 5015,
 'let get': 2918,
 'pro': 3690,
 'becom': 534,
 'tester': 4596,
 'award': 483,
 'win': 5273,
 'profession': 3708,
 'udemi': 4865,
 'seller': 4129,
 'materi': 3123,
 'last': 2758,
 'updat': 4912,
 'novemb': 3379,
 'over': 3484,
 '000': 1,
 'student': 4442,
 'enrol': 1673,
 'worldwid': 5345,
 'commun': 902,
 'still': 4411,
 'count': 1027,
 'anoth': 334,
 'popular': 3614,
 'us': 4923,
 'showcas': 4195,
 'just': 2682,
 'kept': 2687,
 'intro': 2564,
 'free': 1945,
 'preview': 3674,
 'conveni': 1006,
 'pleas': 3598,
 'feel': 1843,
 'drive': 1570,
 'lose': 3009,
 'opportun': 3455,
 'previous': 3678,
 'known': 2728,
 'cost': 1024,
 'fortun': 1926,
 'market': 3108,
 'leader': 2775,
 'industri': 2504,
 'nowaday': 3386,
 'mani': 3086,
 'came': 730,
 'play': 3594,
 'control': 1003,
 'better': 582,
 'suitabl': 4481,
 'peopl': 3548,
 'background': 497,
 'support': 4490,
 'script': 4085,
 'howev': 2304,
 'difficult': 1500,
 'endtoend': 1659,
 'train': 4816,
 'essenti': 1703,
 'gain': 1999,
 'competit': 911,
 'advantag': 202,
 'today': 4768,
 'commit': 898,
 'uniqu': 4894,
 'deliv': 1395,
 'onlin': 3441,
 'in': 2454,
 'aspect': 437,
 'level': 2924,
 'treat': 4830,
 'singl': 4222,
 'thoroughli': 4714,
 'brush': 645,
 'specif': 4320,
 'entir': 1687,
 'overview': 3488,
 'variabl': 5027,
 'output': 3481,
 'valu': 5023,
 'descript': 1410,
 'environ': 1693,
 'read': 3865,
 'write': 5353,
 'excel': 1756,
 'driven': 1571,
 'keyword': 2692,
 'becom expert': 537,
 'cours udemi': 1203,
 'sinc 999': 4219,
 '999 cours': 77,
 'cours materi': 1150,
 'last updat': 2761,
 'over 999': 3485,
 '999 000': 69,
 '000 student': 2,
 'student enrol': 4449,
 'first time': 1884,
 'like cours': 2947,
 'basic cours': 514,
 'cours video': 1208,
 'pleas feel': 3599,
 'feel free': 1846,
 'it if': 2609,
 'if want': 2418,
 'want becom': 5100,
 'becom master': 539,
 'autom tool': 473,
 'learn experi': 2818,
 'in cours': 2459,
 'cours cover': 1078,
 'cover import': 1235,
 'import aspect': 2440,
 'advanc level': 199,
 'explain everi': 1789,
 'everi singl': 1731,
 'it great': 2606,
 'entir cours': 1688,
 'cover basic': 1228,
 'topic includ': 4803,
 'read write': 3868,
 'data driven': 1335,
 'how': 2286,
 'to': 4758,
 'and': 309,
 'wordpress': 5299,
 'sale': 4044,
 'funnel': 1993,
 'easili': 1604,
 'land': 2737,
 'page': 3499,
 'hello': 2205,
 'welcom': 5202,
 'stun': 4462,
 'whole': 5263,
 'convert': 1009,
 'servic': 4153,
 'client': 829,
 'right': 4002,
 'place': 3586,
 'usual': 5018,
 'thing': 4698,
 'lot': 3012,
 'up': 4910,
 'kind': 2696,
 'minimum': 3185,
 'plu': 3604,
 'wait': 5091,
 'top': 4797,
 'might': 3176,
 'exactli': 1747,
 'expect': 1768,
 'busi': 693,
 'charg': 785,
 'simpli': 4213,
 'flexibl': 1889,
 'fast': 1829,
 'pay': 3539,
 'them': 4649,
 'alway': 289,
 'ad': 175,
 'weekli': 5200,
 'basi': 508,
 'anyth': 346,
 'els': 1632,
 'so': 4267,
 'for': 1907,
 'decid': 1372,
 'day': 1360,
 'money': 3213,
 'back': 491,
 'question': 3823,
 'ask': 433,
 'risk': 4010,
 'involv': 2577,
 'how to': 2299,
 'to use': 4767,
 'to creat': 4761,
 'learn how': 2826,
 'easili creat': 1605,
 'land page': 2738,
 'hello welcom': 2206,
 'creat stun': 1280,
 'right place': 4006,
 'lot time': 3017,
 'web design': 5172,
 'onlin busi': 3442,
 'want abl': 5099,
 'cours you': 1222,
 'updat new': 4916,
 'new featur': 3322,
 'you need': 5403,
 'abl creat': 127,
 'creat great': 1267,
 'wait for': 5092,
 'for enrol': 1910,
 'inform cours': 2511,
 '999 day': 79,
 'get money': 2053,
 'money back': 3214,
 'back question': 494,
 'question ask': 3825,
 'interview': 2560,
 'prepar': 3663,
 'save': 4052,
 'architectur': 408,
 'fastest': 1833,
 'world': 5336,
 'compani': 903,
 'amazon': 294,
 'netflix': 3311,
 'base': 504,
 'achiev': 164,
 'goal': 2094,
 'field': 1852,
 'engin': 1663,
 'may': 3136,
 'salari': 4043,
 'similar': 4206,
 'qualif': 3816,
 'without': 5289,
 'benefit': 571,
 'case': 754,
 'what': 5223,
 'biggest': 593,
 'me': 3140,
 'demand': 1398,
 'higher': 2249,
 'job': 2661,
 'good': 2100,
 'theoret': 4660,
 'but': 701,
 'rang': 3849,
 'secur': 4112,
 'attend': 454,
 'spend': 4325,
 'search': 4091,
 'internet': 2557,
 'alreadi': 258,
 'compil': 913,
 'list': 2966,
 'answer': 335,
 'ye': 5365,
 'view': 5067,
 'watch': 5123,
 'begin': 547,
 'onc': 3421,
 'tri': 4834,
 'word': 5298,
 'mark': 3107,
 'could': 1025,
 'yourself': 5434,
 'then': 4655,
 'pass': 3525,
 'after': 210,
 'face': 1807,
 'technic': 4571,
 'contain': 989,
 'architect': 407,
 'difficulti': 1501,
 'vari': 5026,
 'experienc': 1779,
 'happen': 2171,
 'chang': 778,
 'futur': 1996,
 'from': 1960,
 'keep': 2685,
 'our': 3475,
 'aim': 228,
 'sampl': 4046,
 '3': 32,
 'role': 4017,
 '5': 54,
 'is': 2590,
 'tailor': 4522,
 'templat': 4582,
 'organ': 3464,
 '6': 59,
 'disadvantag': 1512,
 'characterist': 784,
 '8': 65,
 '9': 67,
 'point': 3606,
 'rememb': 3945,
 'prefer': 3659,
 'synchron': 4503,
 'asynchron': 448,
 'orchestr': 3462,
 'issu': 2597,
 'rest': 3973,
 'http': 2311,
 'can': 733,
 'state': 4384,
 'extens': 1800,
 'semant': 4130,
 'buy': 705,
 'commerci': 896,
 'whi': 5249,
 'break': 633,
 'per': 3553,
 'host': 2273,
 'model': 3201,
 'mock': 3198,
 'consum': 986,
 'contract': 1001,
 'separ': 4138,
 'deploy': 1405,
 'releas': 3939,
 'mean': 3142,
 'failur': 1815,
 'monitor': 3220,
 'multipl': 3258,
 'id': 2398,
 'certif': 770,
 'key': 2689,
 'public': 3790,
 'confus': 972,
 'consid': 975,
 'law': 2770,
 'circuit': 803,
 'scale': 4062,
 'queri': 3821,
 'cach': 725,
 'discoveri': 1518,
 'document': 1538,
 'scenario': 4064,
 'major': 3044,
 'principl': 3683,
 'interview question': 2561,
 'cours learn': 1142,
 'learn everyth': 2814,
 'save time': 4055,
 'fastest grow': 1834,
 'big compani': 590,
 'compani like': 904,
 'cours design': 1083,
 'design help': 1422,
 'help achiev': 2209,
 'achiev goal': 165,
 'softwar engin': 4282,
 'softwar design': 4280,
 'design develop': 1418,
 'develop i': 1462,
 'i explain': 2339,
 'import concept': 2441,
 'use case': 4936,
 'cours what': 1215,
 'benefit cours': 572,
 'cours abl': 1037,
 'job interview': 2662,
 'it good': 2605,
 'topic cover': 4802,
 'cover cours': 1229,
 'cours we': 1212,
 'we cover': 5145,
 'wide rang': 5267,
 'rang topic': 3850,
 'topic cours': 4801,
 'how cours': 2289,
 'cours help': 1123,
 'spend time': 4326,
 'ye cours': 5366,
 'best way': 579,
 'watch cours': 5124,
 'cours begin': 1058,
 'begin end': 549,
 'answer question': 336,
 'go cours': 2085,
 'cours 999': 1035,
 '999 time': 102,
 'well prepar': 5214,
 'question cours': 3826,
 'cours contain': 1073,
 'level the': 2927,
 'what happen': 5229,
 'time time': 4749,
 'cours follow': 1109,
 'follow 1': 1898,
 '1 what': 13,
 '999 what': 106,
 'what differ': 5226,
 'continu integr': 1000,
 'differ type': 1497,
 '999 how': 84,
 'whi use': 5255,
 'compani use': 905,
 'use api': 4930,
 '999 in': 86,
 'maintain': 3042,
 'crossplatform': 1298,
 'coder': 873,
 'divers': 1527,
 'excit': 1761,
 'or': 3459,
 'old': 3417,
 'vital': 5079,
 'figur': 1854,
 'guess': 2144,
 'clean': 818,
 'reason': 3900,
 'exist': 1766,
 'indepth': 2496,
 'perfect': 3554,
 'complet': 914,
 'guid': 2146,
 'project': 3746,
 'all': 237,
 've': 5035,
 'along': 253,
 'each': 1584,
 'section': 4099,
 'dedic': 1379,
 'math': 3127,
 'input': 2519,
 'statement': 4385,
 'loop': 3007,
 'string': 4430,
 'array': 423,
 'record': 3908,
 'date': 1357,
 'procedur': 3695,
 'eas': 1593,
 'set': 4157,
 'progress': 3745,
 'around': 419,
 'intent': 2547,
 'encourag': 1649,
 'highlevel': 2251,
 'compat': 908,
 'syntax': 4504,
 'design build': 1414,
 'beginn level': 559,
 'what s': 5236,
 'way get': 5131,
 'start program': 4372,
 'program it': 3730,
 'it s': 2619,
 'way help': 5132,
 'help find': 2214,
 'what wait': 5238,
 'learn take': 2879,
 'take your': 4541,
 'next level': 3343,
 'write code': 5354,
 'applic learn': 391,
 'learn best': 2792,
 ...}

Validaion report

we can see that the model is stable since std between differnet cv splits is small


cv_report= pd.DataFrame(gsearch.cv_results_) # gives accuracy score 

cv_report
mean_fit_time std_fit_time mean_score_time std_score_time param_ct__tfidf__min_df param_rfc__max_depth param_rfc__n_estimators params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score
0 27.904208 0.280141 11.990902 0.157760 0.005 6 150 {'ct__tfidf__min_df': 0.005, 'rfc__max_depth':... 0.838865 0.843644 0.854711 0.845740 0.006636 6
1 29.845166 0.293449 12.166636 0.168616 0.005 10 150 {'ct__tfidf__min_df': 0.005, 'rfc__max_depth':... 0.887660 0.887911 0.895289 0.890287 0.003539 2
2 30.715490 0.126816 12.016137 0.231241 0.005 12 150 {'ct__tfidf__min_df': 0.005, 'rfc__max_depth':... 0.893901 0.892168 0.900681 0.895583 0.003673 1
3 26.274902 1.073447 12.405668 0.293781 0.1 6 150 {'ct__tfidf__min_df': 0.1, 'rfc__max_depth': 6... 0.873475 0.855562 0.866345 0.865127 0.007364 5
4 27.804441 0.275139 12.626807 0.171780 0.1 10 150 {'ct__tfidf__min_df': 0.1, 'rfc__max_depth': 1... 0.884539 0.870318 0.875993 0.876950 0.005845 4
5 27.847812 0.187276 12.579125 0.211614 0.1 12 150 {'ct__tfidf__min_df': 0.1, 'rfc__max_depth': 1... 0.885957 0.875142 0.878263 0.879788 0.004545 3

Predicting with a random-forest best estimator

RandomForestClassifier is an ensamble bootstrap aggregation algorithm : it creates a number of decision tree classifiers where each

classifier fits only on part of the data (rows and columns) in a random manner - uniformly and with replacement.

Like in a regular decision tree, reduction in impurity is the parameter to consider in splitting on a feature in each tree.

The end-results is the majority vote each sample received from the classifiers.

This way of using "Wisdom of Crowds" improves the stability and accuracy of the decision making.

Random_forest_diagram_complete.png



preds= gsearch.best_estimator_.predict(X_test)


final_results= pd.DataFrame(classification_report(y_test.astype('bool'), preds, output_dict=True))


conf_matrix= confusion_matrix(y_test.astype('bool') ,preds)


Results

final_results.rename(columns= {'False': 'non-Development', 'True':'Development'})
non-Development Development accuracy macro avg weighted avg
precision 0.922407 0.889478 0.901277 0.905943 0.902696
recall 0.823322 0.953555 0.901277 0.888438 0.901277
f1-score 0.870052 0.920403 0.901277 0.895227 0.900191
support 1415.000000 2110.000000 0.901277 3525.000000 3525.000000

sns.heatmap(pd.DataFrame(conf_matrix), annot=True, fmt='d', cmap=plt.cm.Blues, 
            cbar=False)

plt.title("confusion matrix")
Text(0.5, 1.0, 'confusion matrix')
  • TP = 2012
  • TN = 1165
  • FP = 250
  • FN = 98

we get very good performance from our model, precision-wise and recall-wise when looking at the Development label.

the preformance is slightly worse for the 0 or non-Development label.

after checking some false positives, we see that courses in the IT-Software category are harder to seperate from the development category, since many words are common to those 2 categories. So it may be wise to combine those 2 to the same category

feature importances

the feature importances method has a tendancy to increase the continuous features weights in a biased way, but since all of our features are continuous ones, it seems appropriate enough for a "big picture" estimation:

pd.DataFrame(pd.Series(gsearch.best_estimator_.named_steps["rfc"].feature_importances_  , index=gsearch.best_estimator_.steps[1][1].get_feature_names()).sort_values(ascending=False), columns=['importance']).head(15)
importance
tfidf__code 0.059199
tfidf__web 0.030740
tfidf__develop 0.027360
tfidf__applic 0.020564
tfidf__data 0.019096
tfidf__app 0.017708
tfidf__program 0.016756
tfidf__build 0.016654
tfidf__languag 0.016126
tfidf__javascript 0.015336
tfidf__java 0.014110
tfidf__program languag 0.013793
tfidf__api 0.012328
tfidf__python 0.011386
tfidf__web develop 0.011220

we see that all programming related words (like code or java) are very important for the classification of "development" courses , which is very logical.

PCA and plotting a 3d scatter plot

In-order to check the assumption for the false positives, we will try to plot all the courses in a way that will reflect their differences, meaning close content courses should also be close in the scatter plot

PCA is a way to linearly transform a high-dimentional space to a much smaller hidden representation that captures the majority of the variance between samples.

here we will use 3 dimentional pca transformer, so we will be able to plot the resulting vectors for each course in a 3d scatter plot

X_train_vect= tfidf.fit_transform(X_train['text_stem'])

train = pd.DataFrame(X_train_vect.todense(), columns=tfidf.get_feature_names())#.iloc[0,:].sort_values()
from sklearn.decomposition import PCA

pca = PCA(3)
df_3d= pd.concat([pd.DataFrame(pca.fit_transform(train)), pd.DataFrame(df.loc[X_train.index]['category'].reset_index(drop=True)), pd.DataFrame(df.loc[X_train.index]['title'].reset_index(drop=True))], axis=1)
df_3d
0 1 2 category title
0 -4.124536 4.492683 0.195532 Development Mobile Application Manual Testing - IOS Applic...
1 -9.513578 3.785385 -2.615611 Development Learn C++ Programming for beginners from basi...
2 -4.230656 3.113482 -2.179557 Development Learning Flask
3 6.002980 -4.374147 -11.042514 Development LEARNING PATH: Go: Real-World Go Solutions for...
4 -12.164592 8.173161 -0.145709 Development Interactive maps with Mapbox!
... ... ... ... ... ...
10568 -4.203248 0.577658 -1.131525 Development Introduction to C Programming for the Raspberr...
10569 -8.885592 3.946822 4.935661 Marketing How to Create a Marketing Video for Your Busin...
10570 -10.786600 6.954830 -1.057528 Development Using JSON In Unreal Engine 4 - C++
10571 8.539465 -13.127536 -30.164514 Development Building Recommender Systems with Machine Lear...
10572 -6.286109 1.255941 -3.256990 Development C and C++ Programming : Step-by-Step Tutorial

10573 rows × 5 columns

df_3d.columns= ['1','2','3' ,'cat','title']
import plotly.express as px
fig = px.scatter_3d(df_3d, x='1', y='2', z='3',color='cat',size_max=10, hover_data=['title'], title = "3d scatter-plot of PCA results")
fig.update_traces(marker=dict(size=4)) 
fig.show()

Capture4.png

we can see a clustering pattern for the categories and a plane of seperation between buisness (orange) ,the more humanistic courses (red, torquoise) and development courses (dark blue) while IT courses (purple) show no seperation from development courses.

actually, a nice example of the validity and usability of our model is that most courses on the upper-left-corner image above (blue and purple) have the same content - the Spring framework, so it is apparent that there is even a sub-clustering pattern by course content


Using LSTM Neural Network For comaprison


for the sake of comparison and completion , we will also use a LSTM NN model.

we will regard this model as a blck-box, and won't elaborate on its mechanism,

but suffice is to say that LSTM's power relies on its ability to create a contextual connection between words in the text, thing that is mostly lacking in the TF/IDF approach.

this is done by sequental embedding of the words into vectors in the first layer and to a "smart" memory-neurons in the inner layer that can combine past infomation with the new info coming in, changing the state of the vector, or "forget" states that are less effective while trainning is done

from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer
from keras.layers import LSTM
from keras.layers import SimpleRNN

here we are only using the 'text' column after the cleaning procedure


ps = PorterStemmer()

df_m['text_stem'] =  df_m['text'].map(lambda x: ' '.join(ps.stem(y) for y in x.split()))
X = df_m[['text_stem']]  
y = df_m['is_dev']     

# split training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

t = Tokenizer(2500)
#3046 max words --> leaves 2500 most frequent words
y_train = np.where(y_train == True, 1, 0)
#cereates mapping dictionary  words to integers
t.fit_on_texts(X_train['text_stem'])

vocab_size = len(t.word_index) + 1

#mapping integer encode the documents_tr
encoded_docs = t.texts_to_sequences(X_train['text_stem'])
encoded_docs_test = t.texts_to_sequences(X_test['text_stem'])

# adds zeros in the start for adjusting to same lengh
encoded_docs_padded= pad_sequences(encoded_docs , padding='pre')

len_row= len(encoded_docs_padded[1])

encoded_docs_padded_test= pad_sequences(encoded_docs_test ,maxlen=len_row,  padding='pre')
len(encoded_docs_padded[1])
2929

keras tokenizer creates a vocabulary:

import json
json.loads(t.get_config()['word_counts'])
{'mobil': 2353,
 'applic': 9933,
 'manual': 389,
 'test': 6477,
 'io': 2440,
 'bug': 224,
 'track': 709,
 'debug': 394,
 'realtim': 434,
 'process': 4330,
 'agil': 592,
 'methodolog': 306,
 'develop': 17518,
 'hand': 1156,
 'held': 52,
 'devic': 989,
 'function': 4658,
 'usabl': 78,
 'consist': 555,
 'autom': 2514,
 'type': 2606,
 'either': 419,
 'come': 2404,
 'preinstal': 5,
 'instal': 2151,
 'softwar': 4359,
 'distribut': 658,
 'platform': 2348,
 'wit': 44,
 'phenomen': 17,
 'growth': 391,
 'past': 612,
 'year': 3474,
 'a': 5152,
 'studi': 1264,
 'conduct': 209,
 'yanke': 2,
 'group': 986,
 'predict': 539,
 'gener': 2263,
 '4': 3529,
 '2': 5395,
 'billion': 177,
 'revenu': 256,
 '999': 22011,
 '7': 1775,
 'u': 199,
 's': 4457,
 'smartphon': 139,
 'app': 10038,
 'download': 1687,
 'thi': 18552,
 'cours': 57846,
 'go': 5768,
 'cover': 5707,
 'follow': 3868,
 'approach': 1649,
 '1': 4667,
 'hardwar': 252,
 'the': 15948,
 'includ': 5927,
 'intern': 603,
 'processor': 67,
 'screen': 746,
 'size': 401,
 'resolut': 111,
 'space': 457,
 'memori': 514,
 'camera': 356,
 'radio': 77,
 'bluetooth': 15,
 'wifi': 48,
 'etc': 1295,
 'sometim': 242,
 'refer': 733,
 'as': 1802,
 'simpl': 3366,
 'work': 10927,
 'it': 10336,
 'call': 1265,
 'differenti': 165,
 'earlier': 116,
 'method': 2481,
 'even': 3159,
 'basic': 7544,
 'differ': 4439,
 'import': 3319,
 'understand': 8620,
 'nativ': 685,
 'creat': 16205,
 'use': 30943,
 'like': 6918,
 'tablet': 156,
 'b': 334,
 'web': 9349,
 'serversid': 133,
 'access': 2489,
 'websit': 6120,
 'browser': 637,
 'chrome': 142,
 'firefox': 23,
 'connect': 1485,
 'network': 2710,
 'wireless': 62,
 'c': 4272,
 'hybrid': 165,
 'combin': 768,
 'they': 640,
 'run': 2578,
 'offlin': 138,
 'written': 640,
 'technolog': 2828,
 'html5': 930,
 'css': 2310,
 'learn': 40085,
 'program': 11865,
 'beginn': 4583,
 'advanc': 4523,
 'scratch': 2529,
 'best': 4505,
 'exampl': 3774,
 'purpos': 639,
 'languag': 6402,
 'bjarn': 5,
 'stroustrup': 5,
 'at': 1079,
 't': 2007,
 'bell': 19,
 'lab': 539,
 'variou': 1709,
 'game': 8023,
 'object': 2863,
 'orient': 762,
 'teach': 5939,
 'everyth': 2840,
 'start': 9846,
 'oper': 2049,
 'concept': 5356,
 'topic': 3624,
 'everi': 3307,
 'lesson': 1968,
 'explain': 2639,
 'detail': 2111,
 'code': 10566,
 'those': 177,
 'want': 7517,
 'strong': 661,
 'knowledg': 4674,
 'take': 8249,
 'divid': 298,
 'three': 772,
 'part': 3064,
 'first': 4249,
 'second': 874,
 'third': 389,
 'flask': 183,
 'get': 12936,
 'power': 3724,
 'framework': 3927,
 'python': 6248,
 'that': 2758,
 'easi': 3941,
 'with': 3218,
 'grow': 1179,
 'skill': 6782,
 'gap': 159,
 'need': 8582,
 'talent': 190,
 'greater': 195,
 'ever': 1205,
 'befor': 634,
 'ground': 370,
 'build': 13049,
 'minimalist': 16,
 'easytolearn': 15,
 'launch': 538,
 'career': 2139,
 'entrepreneur': 630,
 'microframework': 5,
 'make': 11394,
 'say': 1170,
 'give': 3527,
 'fundament': 2258,
 'well': 5677,
 'handson': 1175,
 'experi': 4830,
 'requir': 2747,
 'success': 2892,
 'turn': 691,
 'comput': 2895,
 'modern': 1161,
 'machin': 3447,
 'next': 2270,
 'move': 1838,
 'beyond': 360,
 'static': 396,
 'databaseback': 3,
 'dynam': 1193,
 'we': 6982,
 'won': 125,
 'stop': 514,
 'there': 2555,
 'll': 3814,
 'also': 8825,
 'implement': 3352,
 'full': 2213,
 'authent': 686,
 'system': 4593,
 'final': 1871,
 'extend': 344,
 'integr': 1982,
 'thirdparti': 85,
 'api': 3290,
 'when': 975,
 'finish': 1021,
 'fulli': 1060,
 'equip': 336,
 'custom': 3262,
 '0': 803,
 'latest': 1073,
 'version': 1717,
 'avail': 1762,
 'provid': 4089,
 'relev': 468,
 'inform': 2846,
 'content': 4193,
 'legaci': 101,
 'user': 4048,
 'about': 1344,
 'author': 1769,
 'lalith': 2,
 'polepeddi': 1,
 'sinc': 1050,
 'discov': 1000,
 'way': 6688,
 'he': 2635,
 'tut': 5,
 'techpro': 1,
 'asid': 36,
 'interest': 1709,
 'appli': 2492,
 'scienc': 1944,
 'address': 457,
 'problem': 2605,
 'parallel': 295,
 'domain': 689,
 'biolog': 43,
 'path': 1312,
 'realworld': 766,
 'solut': 1939,
 'gopher': 2,
 'modular': 129,
 'testabl': 46,
 'one': 6805,
 'effici': 1150,
 'highlyperform': 2,
 'seen': 368,
 'increas': 1366,
 'rate': 658,
 'adopt': 223,
 'mainli': 152,
 'lightweight': 97,
 'display': 642,
 'great': 3288,
 'robust': 335,
 'perform': 2431,
 'varieti': 528,
 'open': 1363,
 'sourc': 2156,
 'reliabl': 251,
 'often': 641,
 'golang': 135,
 'googl': 2488,
 'deriv': 116,
 'addit': 1406,
 'featur': 3921,
 'garbag': 38,
 'collect': 772,
 'safeti': 139,
 'dynamictyp': 2,
 'capabl': 592,
 'builtin': 208,
 'larg': 731,
 'standard': 922,
 'librari': 1729,
 'if': 5273,
 'foundat': 1253,
 'improv': 2169,
 'packt': 208,
 'video': 7566,
 'seri': 1301,
 'individu': 618,
 'product': 3835,
 'put': 1502,
 'togeth': 1405,
 'logic': 948,
 'stepwis': 125,
 'manner': 549,
 'highlight': 304,
 'are': 1837,
 'encod': 60,
 'strategi': 1906,
 'design': 9905,
 'pattern': 1661,
 'deal': 755,
 'storag': 423,
 'data': 13379,
 'mysql': 1246,
 'let': 1956,
 'quick': 940,
 'look': 4764,
 'journey': 1096,
 'tutori': 2050,
 'leav': 377,
 'off': 116,
 'you': 20630,
 'immedi': 521,
 'practic': 6732,
 'offer': 1512,
 'avoid': 609,
 'common': 1111,
 'mistak': 382,
 'new': 6359,
 'initi': 425,
 'upon': 450,
 'i': 19543,
 'o': 180,
 'file': 2925,
 'command': 1081,
 'line': 1129,
 'tool': 5328,
 'error': 701,
 'handl': 1115,
 'help': 8440,
 'structur': 2830,
 'log': 346,
 'context': 319,
 'packag': 869,
 'databas': 4027,
 'nosql': 180,
 'mongodb': 572,
 'across': 609,
 'microservic': 511,
 'further': 241,
 'explor': 1553,
 'interact': 1927,
 'commandlin': 60,
 'via': 543,
 'demonstr': 757,
 'tune': 162,
 'lastli': 117,
 'reactiv': 246,
 'serverless': 372,
 'tip': 1203,
 'trick': 673,
 'by': 2676,
 'end': 4017,
 'abl': 3891,
 'bridg': 113,
 'meet': 711,
 'your': 3350,
 'expert': 1778,
 'esteem': 215,
 'ensur': 763,
 'smooth': 280,
 'aaron': 24,
 'torr': 2,
 'receiv': 902,
 'master': 3775,
 'degre': 298,
 'mexico': 13,
 'institut': 198,
 'mine': 411,
 'high': 1373,
 'largescal': 60,
 'current': 1298,
 'lead': 1202,
 'team': 1311,
 'refin': 88,
 'focus': 993,
 'emphasi': 83,
 'continu': 1281,
 'deliveri': 344,
 'publish': 1113,
 'number': 1240,
 'paper': 287,
 'sever': 1106,
 'patent': 45,
 'area': 960,
 'passion': 552,
 'share': 1575,
 'idea': 1639,
 'other': 1423,
 'huge': 593,
 'fan': 229,
 'backend': 644,
 'map': 1052,
 'mapbox': 21,
 'studio': 1141,
 'gl': 13,
 'js': 1785,
 'wide': 636,
 'survey': 105,
 'know': 6316,
 'find': 2976,
 'format': 964,
 'style': 1286,
 'interfac': 1577,
 'truli': 349,
 'respons': 1615,
 'complex': 1718,
 'assum': 346,
 'littl': 824,
 'geograph': 36,
 'walk': 954,
 'step': 6350,
 'youll': 1504,
 'big': 1536,
 'beauti': 745,
 'time': 7685,
 'pro': 872,
 'qtp': 52,
 'uft': 119,
 'becom': 3834,
 'tester': 306,
 'award': 203,
 'win': 296,
 'hp': 87,
 'profession': 3727,
 'udemi': 2437,
 'seller': 149,
 'materi': 1862,
 'last': 892,
 'updat': 2648,
 'novemb': 79,
 '27th': 10,
 'over': 818,
 '000': 1117,
 'student': 5802,
 'enrol': 2064,
 'worldwid': 228,
 'commun': 1960,
 'still': 910,
 'count': 210,
 'anoth': 829,
 'popular': 1971,
 'us': 1604,
 'showcas': 117,
 'just': 525,
 'kept': 93,
 'intro': 268,
 'free': 3388,
 'preview': 346,
 'conveni': 105,
 'pleas': 999,
 'feel': 1943,
 'drive': 519,
 'lose': 433,
 'opportun': 978,
 'unifi': 58,
 'previous': 131,
 'known': 363,
 'cost': 853,
 'fortun': 184,
 'market': 4132,
 'leader': 326,
 'industri': 1574,
 'nowaday': 119,
 'mani': 4163,
 'lowcost': 62,
 'came': 143,
 'play': 997,
 'control': 2367,
 'better': 2374,
 'suitabl': 315,
 'peopl': 3481,
 'nonprogram': 5,
 'background': 720,
 'support': 2113,
 'vb': 60,
 'script': 1528,
 'howev': 723,
 'difficult': 574,
 'endtoend': 85,
 'train': 3793,
 'essenti': 1420,
 'gain': 1453,
 'competit': 437,
 'advantag': 665,
 'today': 2285,
 'qaevers': 7,
 'commit': 260,
 'uniqu': 923,
 'deliv': 737,
 'onlin': 3286,
 'in': 9226,
 'aspect': 908,
 'level': 3771,
 'treat': 93,
 'freshman': 8,
 'singl': 1197,
 'thoroughli': 130,
 'brush': 139,
 'specif': 1334,
 'entir': 790,
 'overview': 1189,
 'checkpoint': 26,
 'parameter': 23,
 'variabl': 1085,
 'output': 551,
 'valu': 1379,
 'descript': 414,
 'environ': 1724,
 'read': 1664,
 'write': 4464,
 'excel': 2542,
 'driven': 410,
 'keyword': 398,
 'how': 8658,
 'to': 4275,
 'elementor': 148,
 'and': 4596,
 'wordpress': 2536,
 'sale': 1436,
 'funnel': 196,
 'easili': 1491,
 'land': 485,
 'page': 3024,
 'hello': 258,
 'welcom': 1034,
 'stun': 230,
 'whole': 584,
 'convert': 496,
 'servic': 3428,
 'client': 1624,
 'right': 3354,
 'place': 1217,
 'usual': 230,
 'outsoruc': 1,
 'thing': 2782,
 'lot': 2582,
 'up': 771,
 'kind': 658,
 'minimum': 126,
 'plu': 470,
 'wait': 932,
 'top': 1403,
 'might': 688,
 'exactli': 888,
 'expect': 686,
 'busi': 6762,
 'charg': 274,
 'simpli': 736,
 'flexibl': 402,
 'fast': 1267,
 'pay': 667,
 'dime': 15,
 'them': 1258,
 'alway': 1283,
 'ad': 2588,
 'weekli': 74,
 'basi': 308,
 'anyth': 669,
 'els': 526,
 'so': 2860,
 'for': 3456,
 'ps': 43,
 'decid': 400,
 'day': 2773,
 'money': 2872,
 'back': 1982,
 'question': 4230,
 'ask': 1517,
 'risk': 792,
 'involv': 600,
 'whatsoev': 31,
 'interview': 1732,
 'prepar': 1564,
 'save': 1226,
 'architectur': 1270,
 'fastest': 202,
 'world': 4317,
 'compani': 2525,
 'amazon': 936,
 'netflix': 111,
 'base': 2605,
 'achiev': 1056,
 'goal': 1341,
 'field': 1113,
 'engin': 2948,
 'may': 1412,
 'salari': 294,
 'similar': 445,
 'qualif': 78,
 'without': 2274,
 'benefit': 1202,
 'case': 1183,
 'what': 5681,
 'biggest': 260,
 'me': 759,
 'demand': 710,
 'higher': 347,
 'job': 2648,
 'good': 2709,
 'theoret': 285,
 'but': 1424,
 'rang': 666,
 'secur': 2318,
 'pact': 3,
 'bulkhead': 3,
 'attend': 150,
 'spend': 748,
 'search': 1485,
 'internet': 891,
 'alreadi': 1429,
 'compil': 367,
 'list': 1923,
 'answer': 1753,
 'ye': 612,
 'view': 1350,
 'watch': 1380,
 'begin': 1714,
 'onc': 895,
 'tri': 1509,
 'word': 1082,
 'mark': 270,
 'could': 1017,
 'yourself': 731,
 'then': 1631,
 'pass': 733,
 'after': 1806,
 'face': 702,
 'technic': 1096,
 'contain': 1434,
 'fresher': 58,
 'architect': 540,
 'difficulti': 168,
 'vari': 111,
 'experienc': 640,
 'happen': 510,
 'chang': 2406,
 'futur': 1343,
 'from': 1533,
 'keep': 1365,
 'our': 659,
 'aim': 533,
 'sampl': 603,
 '3': 4319,
 'role': 669,
 'soa': 11,
 '5': 3754,
 'is': 1561,
 'tailor': 77,
 'templat': 1234,
 'organ': 1360,
 '6': 1860,
 'disadvantag': 63,
 'decompos': 4,
 'monolith': 34,
 'characterist': 99,
 '8': 1555,
 'bound': 46,
 '9': 1051,
 'point': 1355,
 'rememb': 476,
 'prefer': 252,
 'synchron': 71,
 'asynchron': 177,
 'orchestr': 110,
 'choreographi': 2,
 'issu': 720,
 'rest': 1651,
 'http': 367,
 'can': 797,
 'state': 796,
 'extens': 697,
 'dri': 47,
 'semant': 90,
 'buy': 897,
 'commerci': 225,
 'shelf': 14,
 'whi': 1981,
 'break': 560,
 'ubiquit': 12,
 'per': 390,
 'host': 999,
 'model': 3927,
 'mike': 60,
 'cohn': 1,
 'pyramid': 21,
 'mock': 124,
 'stub': 12,
 'erad': 3,
 'nondetermin': 1,
 'consum': 349,
 'contract': 325,
 'cdc': 4,
 'separ': 346,
 'deploy': 1791,
 'releas': 599,
 'canari': 6,
 'mean': 1649,
 'repair': 69,
 'mttr': 1,
 'failur': 159,
 'mtbf': 1,
 'crossfunct': 13,
 'monitor': 448,
 'multipl': 1439,
 'correl': 68,
 'id': 198,
 'certif': 1832,
 'key': 1483,
 'public': 657,
 'confus': 300,
 'deputi': 3,
 'consid': 466,
 'conway': 1,
 'law': 302,
 'circuit': 162,
 'breaker': 12,
 'idempot': 1,
 'scale': 552,
 'queri': 1218,
 'segreg': 18,
 'cqr': 7,
 'cach': 183,
 'cap': 42,
 'theorem': 24,
 'discoveri': 80,
 'document': 1179,
 'scenario': 443,
 'major': 694,
 'principl': 992,
 'pascal': 35,
 'maintain': 660,
 'crossplatform': 197,
 'coder': 139,
 'divers': 97,
 'excit': 721,
 'or': 1067,
 'old': 338,
 'vital': 117,
 'figur': 329,
 'bewild': 5,
 'guess': 139,
 'clean': 444,
 'feet': 32,
 'reason': 861,
 'exist': 781,
 'indepth': 530,
 'perfect': 1081,
 'complet': 7162,
 'guid': 3058,
 'project': 8473,
 'all': 2649,
 '500mb': 1,
 've': 779,
 'along': 1974,
 'each': 747,
 'section': 4350,
 'dedic': 315,
 'math': 521,
 'input': 536,
 'statement': 989,
 'loop': 880,
 'string': 493,
 'array': 766,
 'record': 827,
 'date': 461,
 'procedur': 377,
 'eas': 407,
 'set': 3742,
 'progress': 691,
 'oldest': 14,
 'around': 1298,
 'intent': 137,
 'encourag': 293,
 'highlevel': 90,
 'imper': 34,
 'precursor': 8,
 'compat': 161,
 'syntax': 562,
 'microsoft': 1847,
 'own': 652,
 'pace': 566,
 'of': 1091,
 'intricaci': 16,
 'instructor': 2330,
 'ms': 167,
 'captur': 241,
 'actual': 1165,
 'desktop': 662,
 'verbal': 41,
 'do': 2732,
 'reduc': 470,
 'instruct': 747,
 'tour': 83,
 'brand': 821,
 'show': 3722,
 'add': 2318,
 'task': 1334,
 'resourc': 1629,
 'crop': 28,
 'gantt': 30,
 'chart': 834,
 'behav': 62,
 'person': 2241,
 'timelin': 90,
 'macro': 178,
 'within': 1717,
 'repetit': 129,
 'manag': 5750,
 'allow': 2031,
 'alongsid': 122,
 'titl': 322,
 'matter': 647,
 'weapon': 112,
 'sfx': 10,
 'less': 836,
 'hour': 2933,
 'daw': 15,
 'sound': 790,
 'effect': 2760,
 'layer': 434,
 'never': 1216,
 'heard': 218,
 'georgek': 2,
 'music': 603,
 'compos': 222,
 'sonic': 6,
 'specialist': 165,
 'portfolio': 550,
 'my': 1103,
 'limit': 566,
 'elder': 24,
 'scroll': 153,
 'skywind': 2,
 'darkfal': 2,
 'rise': 139,
 'agon': 3,
 'bulletrag': 2,
 'coca': 5,
 'cola': 9,
 'mystor': 2,
 'more': 3339,
 'throughout': 808,
 'obstacl': 108,
 'importantli': 233,
 'overcam': 4,
 'moreov': 107,
 'lectur': 3520,
 'mentor': 165,
 'aspir': 198,
 'craft': 246,
 'explos': 43,
 'digit': 1035,
 'audio': 536,
 'workstat': 29,
 'automat': 419,
 'rifl': 3,
 'handgun': 1,
 'ultim': 642,
 'destruct': 21,
 'furthermor': 95,
 'period': 249,
 'guidanc': 250,
 'forum': 255,
 'nich': 275,
 'stuck': 301,
 'struggl': 368,
 'goto': 46,
 'special': 853,
 'doe': 200,
 'factori': 135,
 'class': 2803,
 'relationship': 793,
 'inherit': 290,
 'much': 3179,
 'see': 3627,
 'abil': 795,
 'properli': 322,
 'techniqu': 3402,
 'creation': 845,
 'now': 2278,
 'storylin': 49,
 'elearn': 113,
 'wonder': 570,
 'prebuilt': 22,
 'inde': 63,
 'mayb': 382,
 'budget': 337,
 'purchas': 675,
 'vendor': 132,
 'perhap': 156,
 'quit': 394,
 'said': 232,
 'agre': 48,
 'possibl': 1507,
 'ive': 1136,
 'broken': 175,
 'board': 376,
 'review': 1314,
 'layout': 598,
 'introduc': 1184,
 'trigger': 224,
 'condit': 600,
 'fanci': 67,
 'intermedi': 681,
 'while': 537,
 'bit': 460,
 'quicker': 73,
 'previou': 509,
 'articul': 68,
 'no': 1717,
 'worri': 506,
 'choos': 715,
 'dive': 753,
 'q': 408,
 'built': 967,
 'acceler': 165,
 'minichalleng': 4,
 'replic': 140,
 'real': 3643,
 'encount': 150,
 'realiz': 255,
 'warn': 57,
 'amount': 530,
 'ill': 990,
 'think': 1632,
 'forward': 781,
 'join': 1720,
 'hd': 262,
 'webmast': 22,
 'than': 79,
 'week': 591,
 'clearli': 286,
 'precis': 168,
 'thank': 1233,
 'pino': 1,
 'amato': 1,
 'veri': 523,
 'useful': 16,
 'especi': 438,
 'maria': 5,
 'kastani': 1,
 'm': 499,
 'thumb': 20,
 'jonathan': 36,
 'nichol': 1,
 'essential': 1,
 'be': 1005,
 'powerus': 9,
 'confid': 1888,
 'minut': 953,
 'away': 585,
 'depth': 430,
 'optim': 1024,
 'visual': 2071,
 'pdf': 410,
 'mp3': 43,
 'anytim': 133,
 'everywher': 156,
 'bonu': 691,
 'premium': 148,
 'theme': 989,
 'wp': 43,
 'social': 1384,
 'press': 127,
 'complimentari': 14,
 'highli': 916,
 'customiz': 49,
 'ideal': 366,
 'technich': 1,
 'prior': 564,
 'term': 674,
 'everyon': 728,
 'probabl': 515,
 'almost': 627,
 'true': 492,
 'blog': 1010,
 'absolut': 974,
 'dozen': 215,
 'plugin': 1044,
 'sort': 591,
 'amaz': 1485,
 'imagin': 454,
 'membership': 109,
 'site': 1782,
 'regist': 271,
 'r': 1595,
 'post': 999,
 'pictur': 359,
 'comment': 338,
 'edit': 1027,
 'tag': 427,
 'widget': 233,
 'transfer': 243,
 'ustom': 1,
 'appear': 233,
 ...}

using keras framework to build the layers in the LSTM model :

model = Sequential()

model.add(Embedding(vocab_size,128, input_length=len_row))
model.add(Dropout(0.2))
model.add(LSTM(64))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

training the model (for 10 epochs) :


model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())


batch_size = 32

history = model.fit(encoded_docs_padded, y_train, batch_size =batch_size, 
                   epochs = 10,  validation_split=0.1, verbose = 1,
                   steps_per_epoch=120,
                )


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_1 (Embedding)     (None, 2929, 128)         4389504   
                                                                 
 dropout_2 (Dropout)         (None, 2929, 128)         0         
                                                                 
 lstm_1 (LSTM)               (None, 64)                49408     
                                                                 
 dropout_3 (Dropout)         (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 4,438,977
Trainable params: 4,438,977
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
120/120 [==============================] - 121s 994ms/step - loss: 0.4385 - accuracy: 0.8010 - val_loss: 0.4034 - val_accuracy: 0.8374
Epoch 2/10
120/120 [==============================] - 118s 988ms/step - loss: 0.3185 - accuracy: 0.8935 - val_loss: 0.2724 - val_accuracy: 0.9140
Epoch 3/10
120/120 [==============================] - 118s 984ms/step - loss: 0.2503 - accuracy: 0.9230 - val_loss: 0.3063 - val_accuracy: 0.8922
Epoch 4/10
120/120 [==============================] - 119s 989ms/step - loss: 0.3097 - accuracy: 0.9010 - val_loss: 0.2989 - val_accuracy: 0.9074
Epoch 5/10
120/120 [==============================] - 118s 987ms/step - loss: 0.2752 - accuracy: 0.9060 - val_loss: 0.3076 - val_accuracy: 0.8932
Epoch 6/10
120/120 [==============================] - 119s 990ms/step - loss: 0.2360 - accuracy: 0.9253 - val_loss: 0.2760 - val_accuracy: 0.9130
Epoch 7/10
120/120 [==============================] - 119s 989ms/step - loss: 0.2258 - accuracy: 0.9255 - val_loss: 0.3169 - val_accuracy: 0.9140
Epoch 8/10
120/120 [==============================] - 119s 990ms/step - loss: 0.2051 - accuracy: 0.9358 - val_loss: 0.2580 - val_accuracy: 0.9130
Epoch 9/10
120/120 [==============================] - 119s 996ms/step - loss: 0.1910 - accuracy: 0.9398 - val_loss: 0.2910 - val_accuracy: 0.9187
Epoch 10/10
120/120 [==============================] - 122s 1s/step - loss: 0.2089 - accuracy: 0.9301 - val_loss: 0.2725 - val_accuracy: 0.9074

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['accuracy','val_accuracy'])
<matplotlib.legend.Legend at 0x2079347c6a0>
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['loss','val_loss'])
<matplotlib.legend.Legend at 0x207960d72e0>
predictions = model.predict(encoded_docs_padded_test)
final_table = pd.DataFrame({'preds':np.where(predictions>=0.5, 1, 0).reshape(-1),'true':np.where(y_test.to_numpy()==True, 1, 0)})

results

from sklearn.metrics import classification_report

pd.DataFrame(classification_report(final_table['true'], final_table['preds'], output_dict=True)).rename(columns= {'0': 'non-Development', '1':'Development'})
non-Development Development accuracy macro avg weighted avg
precision 0.890691 0.927860 0.912908 0.909275 0.912939
recall 0.892580 0.926540 0.912908 0.909560 0.912908
f1-score 0.891634 0.927199 0.912908 0.909417 0.912923
support 1415.000000 2110.000000 0.912908 3525.000000 3525.000000

we get slightly better results with the neural network approach:

main imporovement is in the '0' or non-development label results, so contextual approach does add to the model's ability to classify it correctly.

but, since it is hard to interpret a neural network model, that are certainly benefits for using both approaches.

Made with REPL Notes Build your own website in minutes with Jupyter notebooks.