from pathlib import Path
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\roysh\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
True
import warnings
warnings.filterwarnings('ignore')
loading table
path_root = Path().resolve()
path = path_root / Path("udemy_development_task.csv")
df = pd.read_csv(path, index_col=[0])
df
title | description | category | longDescription | |
---|---|---|---|---|
0 | Python for Beginners | Learn Python programming from scratch with han... | Development | **Why Python ?**\n\n * Python is one of the w... |
1 | Design Patterns in Python | Learn the Design Patterns in a practical way u... | Development | Learning Design Pattern is a voracious learnin... |
2 | Unity Mobile C# Developer Course | Create and deploy games for Android & iOS usin... | Development | Build 3 simple mobile games using the free Uni... |
3 | Django | Build a Smart Chatbot Using AI | Learn Django By Building Chatbot Using AI | Development | **This courses will teach you How to Build a C... |
4 | Flutter Augmented Reality Course - Build 10+ A... | Learn Google's Flutter ARCore & Become AR Deve... | Development | In this course you will learn how to develope ... |
... | ... | ... | ... | ... |
5656 | What the FICO 2.0: The Essential Guide to Cred... | Your Complete Guide to Fixing Bad Credit, Buil... | Finance & Accounting | Trying to understand credit can be somewhat co... |
5657 | Manual Bookkeeping | Level 2 - update manual ledgers, prepare a pro... | Finance & Accounting | Manual bookkeeping covers the material equival... |
5658 | CorelDRAW for Beginners: Graphic Design in Cor... | Learn how to design in Corel DRAW with these e... | Design | **Start creating professional graphic design i... |
5659 | SEO WordPress Masterclass: The Best Google Ran... | Learn Website Search Engine Optimization With ... | Marketing | **Learn the most effective SEO Wordpress strat... |
5660 | Learn Thai for Beginners: The Ultimate 105-Les... | You learn Thai minutes into your first lesson.... | Teaching & Academics | Are you ready to start speaking, writing and u... |
14151 rows × 4 columns
df.index.nunique()
13159
df= df.reset_index(drop=True)
df.index.nunique()
14151
check nulls
df.isna().sum()
title 0 description 5 category 48 longDescription 0 dtype: int64
(df['longDescription'] == '\n\n').sum()
4
df= df.dropna()
how many catagories?
print("number of categories: " ,df['category'].nunique())
print("\ncategories:\n\n" )
pd.DataFrame(df['category'].unique(), columns=['category name'])
number of categories: 13 categories:
category name | |
---|---|
0 | Development |
1 | Teaching & Academics |
2 | Business |
3 | IT & Software |
4 | Personal Development |
5 | Finance & Accounting |
6 | Music |
7 | Design |
8 | Marketing |
9 | Photography & Video |
10 | Lifestyle |
11 | Office Productivity |
12 | Health & Fitness |
category distribution
we can see below that "Development" is the most popular category by far while 60% of the courses belongs to it. "Business" is in the 2nd place amd is far behind with only 7% share . This means that all the non-development categories combined reach a 40% share of the courses. Also, some ot the categories which have a more humenisitc nature (like "music", "Photography", "Health & Fitness") are really rare
cats = pd.DataFrame(df['category'].value_counts(normalize=True).reset_index()).rename(columns= {'index':'category','category':'%'})
cats['%'] = (cats['%'].round(7)*100).astype('str').str[:4]+'%'
cats
category | % | |
---|---|---|
0 | Development | 60.1% |
1 | Business | 7.68% |
2 | IT & Software | 5.72% |
3 | Personal Development | 5.65% |
4 | Teaching & Academics | 4.87% |
5 | Design | 4.66% |
6 | Marketing | 3.99% |
7 | Finance & Accounting | 2.56% |
8 | Office Productivity | 2.18% |
9 | Lifestyle | 0.97% |
10 | Photography & Video | 0.56% |
11 | Music | 0.49% |
12 | Health & Fitness | 0.42% |
fig= plt.figure(figsize=(10,10))
plt.pie(df['category'].value_counts().values, labels=df['category'].value_counts().index, autopct='%1.1f%%', pctdistance=1.05, labeldistance=1.15);
plt.pie(df['category'].value_counts().values, autopct=lambda x: '{:.0f}'.format(x*df['category'].value_counts().sum()/100));
plt.title ("course categories distribution")
Text(0.5, 1.0, 'course categories distribution')
longest text
longest description is 164 characters long , and longest longdescription is 32,103 characters long
df['lengh_long_desc'] = df['longDescription'].map(lambda x: len(str(x)))
df['lengh_short_desc'] = df['description'].map(lambda x: len(str(x)))
print("lengh=",len(df.loc[df['lengh_short_desc'].idxmax()]['description']),'\n\n', df.loc[df['lengh_short_desc'].idxmax()]['description'])
lengh= 164 Learn how to develop software in Behaviour Driven Development (BDD) using Specflow - part of the Cucumber software family of tools for software testing automation.
print("lengh=",len(df.loc[df['lengh_long_desc'].idxmax()]['longDescription']),'\n\n', df.loc[df['lengh_long_desc'].idxmax()]['longDescription'])
lengh= 32103 **What is this course about:** In this course You will learn Hands on Devops Technology Concepts. We will Cover: * **Docker** * **Jenkins** * **GIT** * **Maven** **What will you learn from this lecture:** In particularly, you will learn: * Containerize a web-based application with a micro-service approach and automate it using Dockerfile. * Design multi-container applications and automate the workflow using Compose. * Scale Docker workflow with Docker Swarm, orchestrate and deploy a large-scale application across multiple hosts in the cloud. * Best practices of working with Docker software in the field. * In-depth knowledge about Docker software and confidence to help your company or your own project to apply the right Docker deployment workflow and continuously deliver better software. * Invaluable DevOps skills such as setting up continuous integration pipelines. ************************************************** **FAQ 1:** **DevOps Engineering Jobs and Career Opputunities:-** Engineering is a trending course from past few years ove the world. Every year there are many engineering graduates coming out from each part of country . Be it Chennai Or Kashmir, from north to south. Process of manufacturing engineers is continuing at a fast increasing rate. But jobs in engineering are very less. There is a strong need of quality engineers.For an IT job, there is fight from all section of Engineering. Be it computer engineer, civil engineer or electronic engineer. If you go for online job search, latest job trend is DevOps. DevOps is an abbreviation for its two words. Dev implies to development and Ops stand for operation. DevOps offers various types of job opportunities for you, like engineering project manager, development engineering manager, automation engineers and many more various types of best jobs. Let's have a closer look at how DevOps is a better career choice for you: **Packaging:-** DevOps is awesome if you love to explore and play with variety of Technology and processes. In my opinion the first thing to consider is the Packaging of IT that the tech teams used to provide the organisations services. The maleable the packaging the easier it is to keep everything standardized and reusable. If you are are comfortable working with configuration management systems and developing some imaging systems such as docker you will like DevOps. Closer look to the recent trends tells us the amount of new technologies that are being released into the market is growing exponentially. In DevOps no technology is beyond limits and you find yourself constantly working with integrated and automating different Technologies. In DevOps your goal is to create machines as machine manageable data objects that are completely completely hands off on the production. The goal is to to allow programs written by different teams to efficiently automate as much as possible. **Scaling:-** You will definitely like DevOps if reusability is your passion. In my opinion the biggest factor in the successful tech organisations of the future will be their ability to scale rapidly while being able to deflate when not needed to minimise costs in downtime. If the Application is reliable ,zippy and meet their needs, customers don't care about the tech behind it. They simply want speed. Scalability is a hard thing to achieve and most would rather not have to worry about it, which is self explanatory about the growth scalability as a service offering. Now, Ask yourself. Do you want to jump from mobile to AI? DevOps will allow you. Do you want to play with that new SaaS service that is in trend these days? DevOps will let you do that. DevOps is all about being the glue that holds everything and everyone together, and if you ask me, that is what makes it so exciting. The possibilities are beyond limits and the technologies are always growing and evolving at an unexplanatory and unimaginable speed. And if you don’t focus on DevOps, you will still somehow have to manage infrastructure as a developer. **Q. What is the need for DevOps?** As per me, this answer should start by explaining the general market trend. Instead of releasing big sets of features, companies see if small features can be transported to their customers via a series of release trains. This is very much advantageous like quick feedback from customers, better software quality, etc. which in turn takes the company to high customer satisfaction. To achieve this, companies are required to: Increase frequency of deployment Lower the New releases failure rate Shorten their lead time between fixes DevOps lets you achieve seamless software delivery and fulfills all above requirements. You can give examples of companies like Amazon, Etsy, and Google who have welcomed DevOps to achieve levels of performance that were unimaginable even five years ago. Q. Explain your understanding and expertise on both the software development side and the technical operations side of an organization you’ve worked for in the past. DevOps engineers always work in a 24*7 critical business online environment. In my previous job, I was very much adaptable to on-call duties and was able to take up real-time, live-system responsibilities. I was successful in automated processes to support continuous software deployments. I have pretty good experiences with public as well as private clouds, DevOps tools like CHEF or PUPPET, scripting and automation with languages like PYTHON and PHP, and a background in AGILE **Q. What is Git?** I will suggest that you attempt this question by first explaining about the architecture of Git. Git is a form of Distributed Version Control system (DVCS). It lets you track changes to a file and allows you to revert to any specific change. Its distributed architecture makes it more advantageous over other Version Control Systems (VCS) like SVN. Another major advantage of Git is that it does not rely on a central server to store all the versions of a project’s files. Instead of that, every developer gets “clones” the copy of a repository. “Local repository” has the full history of the project on its hard drive so that when there is a problem like a server outage, you need your teammate’s local Git repository for recovery. There is a central cloud repository as well where developers can commit changes and share it with other teammates where all collaborators are committing changes “Remote repository" **Q. In Git how do you revert a commit that has already been pushed and made public?** There are two possible answers to the above question so make sure that you include both because any of the below options can be used depending on the situation's demand: Remove the bad file in a new commit and push the file to the remote repository. This is the most common and natural way to fix a bug or an error. Once you have included necessary changes to the file, commit it to the remote repository. For that purpose I will use the command git commit -m “commit message" Now, Create a new commit that will undo all the changes that were made in the bad Commit. To do so I will be using the command git revert <name of bad commit> **Q. How is DevOps different from Agile / SDLC?** I would suggest you go through the below explanation: Agile is a set of values and principles about how to develop a software. For an instance: if you have some idea about something and you want to turn that idea into a working software the Agile values and principles can be used as a way to do that. But, that software might only be working on a developer’s laptop or within a test environment. You need a way to easily, quickly and repeatably move that software into production infrastructure, in a simple and safe way. To do that DevOps tools and techniques are required. In a nutshell, Agile software development methodology keeps its focus on the development of software but, on the other hand, DevOps is responsible for development as well as the deployment of the software in the safest and reliable possible way. Now remember, keep this thing in mind, you have included DevOps tools in the previous answer so be prepared to answer some questions related to that. They might be thrown at you. **Q. Which are the top DevOps tools? Which tools have you worked on?** Few of The most famous DevOps tools are mentioned below: Git: Version Control System tool Jenkins: Continuous Integration tool Selenium: Continuous Testing tool Puppet, Chef, Ansible: Configuration Management and Deployment tools Nagios: Continuous Monitoring tool Docker: Containerization tool You can also include any other tool if you want, but make sure you use the above tools in your answer. The second part of the answer could have two possibilities: If you have enough experience with all the above-mentioned tools then you may mention that I have worked on all these tools for developing good quality software and deploying that software easily, frequently, and reliably. If you have experience with only with few of the above tools then name those tools and say that I have specialization in these tools and have an overview of the rest of the tools. **Q. How do all these tools work together?** The code is developed by the developers and its source code is managed by Version Control System tools like Git etc. Developers transmit this code to the Git repository and any transformations made in the code is committed to this Repository. Jenkins extracts this code from the repository using the Git plugin and creates it using tools like Ant or Maven. Configuration management tools, puppet, deploy & provisions testing environment and after that Jenkins releases the code in the test environment on which testing is done using tools like selenium. After the code gets tested, Jenkins sends it for deployment on the production server (even the production server is provisioned & maintained by tools like the puppet). After its deployment, It is continuously monitored by tools like Nagios. Docker containers provide the testing environment to test the build features. **Q. What is Version control?** I guess this is the easiest question you could face in the interview. My take is to first define Version control. It is a system that keeps records of changes to a file or set of files over a period of time so that they can be recalled after specific versions later. Version control systems consist of a centrally shared repository where teammates can commit changes to a file or set of file. Then you might mention the uses of version control. Version control allows you to: Restore back files to a previous state. Restore back the entire back to a previous state. Compare changes over a period of time. The issue was introduced by whom and when. **Q. What are the benefits of using version control?** The following advantages of version control are suggested to be used: Version Control System (VCS), allows all the team members to work freely over any file at any point of time. VCS later allows you to merge all the changes into a common version. All the past versions and variants are nicely and systematically encapsulated inside the CVS. Whenever you need it, you may request any version of software at any time and you can have a snapshot of the complete project right away. Each time you have an updated version of your project, VCS requires you to provide a short info about what was changed. Also, you can see what exactly was altered in the file’s content. This gives you the privilege to know who has made what altered the project. A distributed VCS like Git provides all the team members about the complete history of the project so if there is a breakdown in the central server, you may use any of your teammate’s local Git repository. **Q. Describe branching strategies you have used.?** This question tests your branching experience so tell them about how you have used branching in your past jobs and what purpose does it serves, you can refer the below points: Feature branching: A feature branch model holds all of the changes for a particular feature inside of a branch. When the feature is completely tested and validated by the automated tests, the branch is then added to the master. Task branching: In this model, each task is implemented over its own branch with the task key included inside the branch name. It is easy to notice which code implements which task, just search for the task key in the branch name. Release branching: Once the developed branch acquires enough features for a release, you can get that branch cloned to form a Release branch. Making this branch starts the further release cycle, so no extra features can be added after this point, only bug fixes, documentation generation, and other release-oriented tasks should get on this branch. Once it is ready to be shipped, the release gets merged into master and tagged with a version number. In addition, it should be merged back inside develop branch, which might have progressed since the release was initiated. At the end, tell them that branching strategies vary from one organization to another, so I am familiar with basic branching operations like delete, merge, checking out a branch etc. ** Q. What is meant by Continuous Integration?** It is advised to begin this answer by giving a short definition of Continuous Integration (CI). Continuous Integration is a development practice that needs developers to integrate code into a shared repository many times a day. Each check-in gets verified by an automated build, allowing teams to detect problems early. I would suggest you explain how you have implemented it in your previous job. **Q. Explain how you can move or copy Jenkins from one server to another?** I could have achieved this task by copying the jobs directory directly from the old server to the new one. There are many ways to do that; They are mentioned below: You can: Moving a job from one installation of Jenkins to another by simply copying and pasting the corresponding job directory. Create a copy of an existing job by making a clone of a job directory by a different name. Rename an existing job by renaming a directory. Notice that if you change a job name, then you will need to change any other job that tries to call the renamed job. ** Q. Explain how can you create a backup and copy files in Jenkins?** The question has a direct answer. To create a backup, all you need to do is to back up your JENKINS_HOME directory at regular intervals of time. JENKINS_HOME directory contains all of your build jobs configurations, slave node configurations, and build history. For generating a backup of your Jenkins setup, simply copy its directory. You may also copy a job directory for cloning or replicate a job or rename the directory. **Q. How will you secure Jenkins?** The most common way of securing Jenkins is given below. But if you have any other way of doing it, you may go with it, but make sure you are correct: Make sure that the global security is on. Make sure that Jenkins is integrated with “my company’s” user directory using the appropriate plugin. Make sure that matrix/Project matrix is enabled for getting the fine tune access. Automate the setting rights/privileges process in Jenkins with custom version controlled script. Bound the physical access to Jenkins data/folders. Run security audits on same over a period of time. **Q. What is Continuous Testing?** It is advised to follow the under mentioned explanation: “Continuous Testing is the process of executing automated tests as a part of the software delivery pipeline to produce immediate feedback over the business risks associated with the latest build. In this method, each build gets tested continuously, allowing Development teams to get fast feedbacks so that as to prevent those problems from progressing to the successive stage of Software delivery life-cycle. Continuous Testing speeds up a developer’s workflow dramatically as there’s no need to manually rebuild the project and re-run all of the tests after making changes.” **Q. What is Automation Testing?** Automation testing or Test Automation is a process of automating the manual process for testing the application/system under test. The Process involves the use of separate testing tools which allows you to create test scripts which can be executed repeatedly and doesn’t require any sort of manual intervention. **Q. What are the benefits of Automation Testing?** Some of the many advantages of Automation Testing are mentioned below. Including these points in your answer and adding your own experience of how Continuous Testing helped you previous in your previous job, will make an impressive and impacting answer: Supports execution of repeated test cases Aids in testing a large test matrix Enables parallel execution Encourages unattended execution Improves accuracy thereby reducing human-generated errors Saves time and money **Q. What is the difference between Assert and Verify commands in Selenium?** The basic difference between Assert and Verify command is given below: Assert command checks if the given condition is boolean true or boolean false. For an instance, say, we assert whether the given element is present on the web page or not. If the condition results to be true, then the program control will execute the next test step. But, if the condition results in false, the execution would be terminated and no further test would be executed. Verify command also performs check whether the given condition is true or false. Irrespective of the condition being true or false, the program execution doesn’t stop i.e. if the verification process fails, it would not stop the execution and all the test steps will be executed. **Q. How can be a browser launched using WebDriver?** The following syntax could possibly be used to launch Browser: “WebDriver driver = new FirefoxDriver();” “WebDriver driver = new ChromeDriver();” “WebDriver driver = new InternetExplorerDriver();” **Q. What are the goals of Configuration management processes?** The basic purpose of Configuration Management (CM) is to ensure if the product is integral or system throughout its life-cycle by making t0he development or deployment process controllable and repeatable, thus creating a higher quality product or system. The Configuration Management process allows orderly management of system information and system changes for purposes such as to: Revise capability, Improve performance, Reliability or maintainability, Extend life, Reduce cost, Reduce risk and Liability, or correct defects. **Q. What is the difference between an Asset and a Configuration Item?** As per me, first of all, Asset should be explained. It has a financial value along with a depreciation rate attached to it. IT assets are just a sub-set. Everything and anything that holds a cost and the organization uses it for the calculation of its asset value and related benefits in the calculation of tax falls under Asset Management, and such item is called an asset. On the other hand, Configuration Item may or may not have financial values assigned to it. Also, there will not be any depreciation linked to it. Thus, its life will not depend on its financial value but will depend on the time till that item becomes obsolete for the organization. Now examples can be given that can showcase the similarity and differences between both: 1) Similarity: Server – It is both an asset as well as a CI. 2) Difference: Building – It is an asset but not a CI. Document – It is a CI but not an asset **Q . What is Chef?** Start the answer with the definition of Chef. The Chef is one of the powerful automation platforms that turns infrastructures into code. A chef is a tool for which scripts are written that are used to automate processes. What kind of processes? Any process that is related to IT. Now the architecture of Chef can be explained, it consists of: Chef Server: The Chef Server is the central store of infrastructure’s configuration data. The Chef Server stores the data necessary to configure the nodes and provides search. ChefServer is a powerful tool that lets you to dynamically drive node configuration based on data. Chef Node: Node is any host that gets configured using Chef-client. Chef- client runs on nodes. ChefNode contacts the Chef Server for the information necessary to configure the node. Now, since a Node is just a machine that runs the Chef-client software, nodes may be sometimes referred to as “clients”. Chef Workstation: A Chef Workstation is a host used to modify cookbooks and other confrontational data. **Q2. What is Nagios?** This question can be answered by first mentioning that Nagios is one of the monitoring tools used for Continuous monitoring of systems, applications, services, and business processes etc in DevOps culture. If a failure occurs, Nagios alerts technical staff about the problem, that allows them to begin remedial processes before outages affect business processes, end-users, or customers. With Nagios, you need not explain why an unseen infrastructure outage affects your organization's bottom line. Now once you defined what is Nagios, you can mention various things that can be achieved using Nagios. By using Nagios you can: Plan for infrastructure upgrades before outdated systems cause failures. Response to the issues at problem’s first sign. Automatically fix detected problems. Coordinate easily with technical team responses. Ensure that your organization’s SLAs are being met. Monitor your entire infrastructure and business processes. Nagios runs on a server, usually as a daemon or service. Nagios runs plugins residing on the same server over a period of time. They make contact to hosts or servers on your network or over the internet. One can see the status information using the web interface. Nagios also sends email or SMS notifications if something happens. The Nagios daemon acts like a scheduler that executes certain scripts at certain moments. It then saves the results of those scripts and will run other scripts if these results change. ***************************************************************************************************** **DevOps Job Description** Demand for people with DevOps skills is growing at a fast and steady rate because businesses are getting great results from DevOps. Organizations using DevOps practices are surprisingly high-functioning: - They can deploy code up to 30 times more frequently than their competitors, and 50 percent lesser of their deployments fail. With all this goodness, you would be thinking that there must be lots of DevOps engineers out there. However, just 18% of survey respondents in the survey said someone in their organization actually held this title. Why is that? Partly, it is because defining what a DevOps engineers can do is still in flux. Although, That is not stopping companies from hiring for DevOps skills. On LinkedIn, people's mentioning of DevOps as a skill has seen a rise of 50 percent over the past few years. A survey revealed the same trend: Half of about 4,000-plus respondents (in more than 90 countries) said their companies are considering DevOps skills while hiring. **What are DevOps skills?** The survey identified the top three skill areas for DevOps staff: **Coding or scripting Process re-engineering Communicating and collaborating with others** The above-mentioned skills point to a growing recognition, that software isn’t written in the old stereotypical way anymore. Where software was written from scratch using a highly complex and lengthy process. Also, creating new products is now a matter of selecting open source components and binding them together with code. The complexity of today’s software lies less in the programming, and more in ensuring that the new software works over a diverse set of operating systems. Making it platform independent right away. Same way, testing and deployment are now done at a much more frequency. That is, they can be more often— if developers start communicating more early and regularly with the operations team, and also if, operations people bring their knowledge of the production environment to design of testing and staging environment. **What is a DevOps engineer, anyway? And should anyone hire them?** There’s no formal cliched career track for kickstarting your career as a DevOps engineer. They are maybe developers who get interested in deployment and network operations, they might be sysadmins who have an affinity for scripting and coding. Whatever world they are from, these are people who have pushed themselves out of their comfort zone of their defined areas of competence and who have a more holistic view of their technical environments. DevOps engineers are a quite elite group, so it’s not astonishing that we found a smaller number of companies creating that title. Kelsey Hightower, head of operations at Puppet Labs, described these people as the “Special Forces” in an organization. “The DevOps engineer encapsulates depth of knowledge and years of hands-on experience,” Kelsey says, “You’re battle tested. This person blends the skills of the business analyst with the technical chops to build the solution - plus they know the business well, and can look at how any issue affects the entire company. So, in a nutshell, DevOps provides you lots of career opportunities and companies are ACTUALLY hiring DevOps engineers. ****************************************************************************** **Object-Oriented Programming:-** Object-Oriented Programming or commonly called OOPs is There are 5 basic concepts of OOPs. Let's have a closer look at each of them. **1\. Abstraction** This is the property of OOPs which refers to the act of representing only the essential details and hiding the background data. Consider a car as your object. You are told that if you apply the brakes, the vehicle stops. The background details, like the mechanism how the fluid runs through, the brake shoes stopping the wheel, etc. are hidden from you. This is what abstraction is. Abstraction is the advantage that you get from Object Oriented Programming over Procedural Oriented Programming. **2\. Encapsulation** The process of binding characteristics and behavior in a single unit is simply known as Let's get back to our previous example of a car. In a car, we have a steering that helps to change the direction, we have brakes to stop the car, we have a music system to listen to music, etc. These all units are capsuled (or ENcapsuled) under a single unit called CAR. Like objects, each unit has its own characteristics as well as behavior. It is a common observation that a class encapsulates objects of the similar kind under a single unit. **3\. Modularity** Modularity is the feature of Object Oriented Programming that allows us to break a bigger problem in smaller chunks and assemble it together, later. For an instance, during the manufacturing of a car, parts are constructed separately. Like there is a unit that makes the engine, a unit makes the outer body, a unit makes the interior, etc. Later on, all the parts are assembled at one place. This way, a big problem is divided into small chunks and handled easily. In Object Oriented Programming, Modularity is implemented by functions. **4\. Inheritance** Inheritance is the capability of a class to inherit the properties of some other class. For an example, consider CAR as a class. Now let's take TOYOTA, NISSAN, SWIFT, HYUNDAI, etc. as some other class. These classes will have them some individual properties but they will inherit some of their properties from the class CAR. Like moving on applying accelerator, stopping when brakes are applied, etc. The inheriting class is called the subclass whereas the inherited class is called base class. In the above example, CAR is the base class and others are a subclass. **5\. Polymorphism** The act of existing in more than one form Lets again get back to our example of cars. Consider a class called HYUNDAI. The HYUNDAI class has an object i10. Now there can be many cars with the name i10, but they have a unique identification. ( either by their registration number or engine number, we are not concerned here about that) In an Object Oriented Programming language, there can be many functions with the same name but they should be of different parameters. So now you know, the 5 pillars of Object Oriented Programming. _**Happy coding! **_ _********************************************************************************* **_ _****_ **DevOps For Dummies- A Wiley Brand** is an IBM limited edition written by Sanjeev Sharma and Bernie Coyne. Earlier it was written only by Sanjeev Sharma alone, but in the latest third edition, Bernie Coyne co-authored the book. This is a book for the people interested in DevOps. It takes you from beginner to advanced level. The book is available in the form of electronic media i.e. e-book. The free of cost book comes from IBM. Go to the link above and fill in your details, and you will get the download link of your copy. Let's take a look at the book's features: **Cover Page:-** It is often said, don't judge a book by its cover. But we humans are very much stubborn and the cover matters the most for the readers, as it lures them towards itself. The cover page for DevOps for dummies is a mixture of Black, blue and yellow color; with an animated geeky face outline. At the top, IBM logo resides with its full dignity. The middle right half of the page covers the main outlines of the book: * **The business needs and value of DevOps.** * **DevOps capabilities and adoption path.** * **How Cloud accelerates DevOps.** **Table Of Content:-** Next, as we turn over the "virtual pages" comes the table of content. This gives an overview of what you are going to learn from this book. There are chapter names with their subtopics under them. The chapter names are as follows:- **1.What is DevOps? 2.Looking at DevOps capabilities. 3.Adopting DevOps. 4.Looking at how cloud accelerates DevOps. 5.Using DevOps to solve new challenges. 6.Making DevOps work: IBM's Story. 7.Ten DevOps myths.** **Introduction:-** Next, comes in the introduction part. In the first line, the meaning of DevOps with its expanded form of Development and Operations is explained. Everyone talks about it, but not everyone is familiar with it. In a nutshell, DevOps is an approach based on lean and agile principles in which business owners and the development, operations, and quality assurance departments collaborate to deliver software in a continuous manner. The further lines tell about the IBM's broad and holistic view towards DevOps. The book tells what a true DevOps approach includes: Lines of business, practitioners, executives, partners, suppliers, and so on. **About the book:-** The about the book section gives an overview of the book. The book takes a business-centric approach to DevOps. Today’s rapidly advancing world makes DevOps essential to all enterprises that should be agile and lean enough to respond rapidly to the changes such as customer demands, market conditions, competitive pressures, or regulatory requirements. It is assumed that, if you are reading this book, you’ve heard about DevOps but want to understand what it means and how your company can gain business benefits from it. This book is targeted for executives, decision-makers, and practitioners who are new to the DevOps, seeking info about the approach, who want to go through the hype surrounding the concept to reach t _****_
fig = plt.figure(figsize=(11,9))
ax1 = plt.subplot(1,2,1)
# (2,1,1) indicates total number of rows, columns, and figure number respectively
ax2 = plt.subplot(1,2,2)
sns.boxplot(x="lengh_short_desc", y="category", data= df, ax=ax2)
sns.boxplot(x=df["lengh_long_desc"].clip(0,7500), y="category", data= df, ax=ax1)
plt.tight_layout()
ax1.grid()
ax2.grid()
plt.show()
we can see that the Development category has slightly longer long-descriptions with many outliers, and longer short-descriptions having 25% of the courses with over 125 characters long. (note for long_desc: I've clipped the X-axis at the 7500 mark in order to visualize the boxplots properly)
df.groupby('category').agg(max_long_desc= ('lengh_long_desc','max')).sort_values(by='max_long_desc', ascending=False)
max_long_desc | |
---|---|
category | |
Development | 32103 |
Personal Development | 20021 |
Design | 16801 |
Marketing | 16435 |
Business | 16359 |
IT & Software | 15610 |
Finance & Accounting | 14129 |
Lifestyle | 9156 |
Teaching & Academics | 8757 |
Office Productivity | 8555 |
Health & Fitness | 8546 |
Photography & Video | 8444 |
Music | 8306 |
how many numbers?
we see in the plot below that humanistic courses contain less numbers, which might help differentiate from development courses
df['num_count']= df.iloc[:,3].apply(lambda x: len(re.findall('(\d\.\d+|\d+)', x))) # float OR integer
fig = plt.figure(figsize=(12,8))
sns.boxplot(x=df["num_count"].clip(0,100), y="category", data= df)
<AxesSubplot:xlabel='num_count', ylabel='category'>
how many non-alphanumeric characters?
we have more in the humanisitc courses with small seperations as a whole
fig = plt.figure(figsize=(12,8))
df['num_non_alphanum']= df.iloc[:,3].apply(lambda x: len(re.findall("[^0-9A-Za-z ]", x))) # non-alpha numeric or spaces
sns.boxplot(x=df['num_non_alphanum'], y="category", data= df)
<AxesSubplot:xlabel='num_non_alphanum', ylabel='category'>
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.base import TransformerMixin, BaseEstimator
from nltk.stem import PorterStemmer
creating the target and uniting all text columns to one column
we combine all the non-development categories into one category since the task is to label correctly the develpment courses. This will simplify the problem, since labeling a very unbalanced multiclass dataset is a much harder task for a model.
df.loc[df['category']=='Development', 'is_dev'] = True
df.loc[df['category']!='Development', 'is_dev'] = False
in order to keep a smaller number of dimentions I prefer uniting all the textual columns into one column
df['text'] = df['title']+' '+df['description']+' '+df['longDescription']
df.columns
Index(['title', 'description', 'category', 'longDescription', 'lengh_long_desc', 'lengh_short_desc', 'num_count', 'num_non_alphanum', 'is_dev', 'text'], dtype='object')
cleaning text
def remove_end_line (x):
return x.str.replace("\n\n"," ")
def remove_url(x):
return x.str.replace("https*\S+", "url")
def remove_puncs_inside_words(x):
return x.str.replace("[\'\-']", "")
def remove_non_alphanomeric_and_lower (x):
return x.str.replace("[^0-9A-Za-z ']", " ").str.lower()
def replace_high_numbers (x): #higher than 9 will be replaced with 999
try:
y= int(x) > 9
except ValueError:
return x
if y :
return '999'
else :
return x
def remove_stop_words (x):
stopwords_dict = {word: 1 for word in stopwords.words("english")}
return ' '.join( [y for y in x.split() if y not in stopwords_dict]) #--> using dict speeds up tremendoudsly
def replace_overspaces(x) :
return x.str.replace("\s{2,}", " ")
df[['text']]= df[['text']].apply(remove_end_line)\
.apply(remove_url)\
.applymap(remove_stop_words)\
.apply(remove_puncs_inside_words)\
.apply(remove_non_alphanomeric_and_lower)\
.applymap(lambda x: ' '.join(replace_high_numbers(x) for x in x.split()))\
.apply(replace_overspaces)
df.head(3)
title | description | category | longDescription | lengh_long_desc | lengh_short_desc | num_count | num_non_alphanum | is_dev | text | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Python for Beginners | Learn Python programming from scratch with han... | Development | **Why Python ?**\n\n * Python is one of the w... | 1552 | 74 | 0 | 103 | True | python beginners learn python programming scra... |
1 | Design Patterns in Python | Learn the Design Patterns in a practical way u... | Development | Learning Design Pattern is a voracious learnin... | 1919 | 57 | 0 | 80 | True | design patterns python learn design patterns p... |
2 | Unity Mobile C# Developer Course | Create and deploy games for Android & iOS usin... | Development | Build 3 simple mobile games using the free Uni... | 1734 | 56 | 1 | 106 | True | unity mobile c developer course create deploy ... |
wordcloud = WordCloud(stopwords=['course','learn','this','the','in','it','you','use']).generate(df[df['is_dev']==1]['text'].sum())
plt.figure(figsize=(11,11))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
plt.figure(figsize=(11,11))
wordcloud = WordCloud(stopwords=['course','learn','this','the','thi','in','it','you','use']).generate(df[df['is_dev']==0]['text'].sum())
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
it seems that words like 'application', 'project' ,'python' and 'machine learning' , are uniquely more frequent in the development courses (large only on the upper image), also the word 'business' seems to be very important mainly to the non-development courses
df_m = df[['text','lengh_long_desc', 'lengh_short_desc', 'num_count','is_dev']]
Splitting to train/test
X = df_m.drop(['is_dev'],axis=1)
y = df_m['is_dev']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
the pipeline steps are:
Stemming/ Lemma
while it is better to use lemmatization in order to maintain contextual meaning of a word, we will use stemming since it is much faster. stemming will transform words (mainly verbs, and the ending of nouns) to their root form, therefore it will decrease the dimentionality in a significant way
class TextStemmer(TransformerMixin, BaseEstimator):
def __init__(self):
super().__init__()
self.ps = PorterStemmer()
def fit (self, X, y=None):
return self
def transform(self, X):
X['text_stem']= X['text'].map(lambda y: ' '.join(self.ps.stem(z) for z in y.split()))
X= X.drop('text', axis=1)
return X
TF/IDF
we will use tf/idf to generate the vocabulary
tf/idf in sklearn package is defined as:
(tf= term frequency, idf = inverse document frequency)
tf/idf adds a weighting sensibility to a counting vector of each token in a document by deviding the counter of a token (tf) by a term that reflects how rare or frequent the word is in the entire corpus (idf) :
idf is defined here as the logarithmic fraction of the number of documents a token appears-in devided by the total number of documents.
if the ratio is big - meaning the token scarcely appears in the documents of the corpus, the term will be bigger than 1 , and will give a boost to the counter.
if the ratio is small - meaning the token appears frequently in the documents of the corpus, the term will be smaller than 1, this will decrease the counter.
stem_text = TextStemmer()
tfidf = TfidfVectorizer(analyzer = 'word' ,token_pattern= r"(?u)\b\w+\b" ,ngram_range=(1,2), min_df= 0.005, max_df= 0.99 ,norm=None)
rfc= RandomForestClassifier(max_depth=9, n_estimators=100, random_state= 42)
text_tfidf = ColumnTransformer(transformers= [('tfidf', tfidf, 'text_stem')], remainder= 'passthrough', sparse_threshold=0 )
we are using ngrams ability since some words get different meaning as a combination of 2 words, also we will decrease the size of the vocabualry by using min_df
the token_pattern enables words that are one character long like : "C" or "R" which are important programming languages and I anticiapte them to appear many times in the text
final_model = Pipeline(steps=[('st', stem_text),
('ct', text_tfidf),
('rfc', rfc)
])
params_store= final_model.get_params()
param_search = {
'ct__tfidf__min_df' :[0.005, 0.1],
'rfc__max_depth' : [6, 10, 12],
'rfc__n_estimators' :[150]
}
#params_store
gsearch = GridSearchCV(estimator= final_model, cv=3,
param_grid= param_search, verbose=2 )
gsearch.fit(X_train, y_train.astype('bool'))
Fitting 3 folds for each of 6 candidates, totalling 18 fits [CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150, total= 39.8s [CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 39.7s remaining: 0.0s
[CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150, total= 39.8s [CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150 [CV] ct__tfidf__min_df=0.005, rfc__max_depth=6, rfc__n_estimators=150, total= 40.1s [CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150 [CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150, total= 41.8s [CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150 [CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150, total= 42.1s [CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150 [CV] ct__tfidf__min_df=0.005, rfc__max_depth=10, rfc__n_estimators=150, total= 42.2s [CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150 [CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150, total= 42.8s [CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150 [CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150, total= 42.9s [CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150 [CV] ct__tfidf__min_df=0.005, rfc__max_depth=12, rfc__n_estimators=150, total= 42.5s [CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150 .. [CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150, total= 37.0s [CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150 .. [CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150, total= 39.3s [CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150 .. [CV] ct__tfidf__min_df=0.1, rfc__max_depth=6, rfc__n_estimators=150, total= 39.7s [CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150 . [CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150, total= 40.5s [CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150 . [CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150, total= 40.2s [CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150 . [CV] ct__tfidf__min_df=0.1, rfc__max_depth=10, rfc__n_estimators=150, total= 40.6s [CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150 . [CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150, total= 40.4s [CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150 . [CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150, total= 40.5s [CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150 . [CV] ct__tfidf__min_df=0.1, rfc__max_depth=12, rfc__n_estimators=150, total= 40.4s
[Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 12.2min finished
GridSearchCV(cv=3, estimator=Pipeline(steps=[('st', TextStemmer()), ('ct', ColumnTransformer(remainder='passthrough', sparse_threshold=0, transformers=[('tfidf', TfidfVectorizer(max_df=0.99, min_df=0.005, ngram_range=(1, 2), norm=None, token_pattern='(?u)\\b\\w+\\b'), 'text_stem')])), ('rfc', RandomForestClassifier(max_depth=9, random_state=42))]), param_grid={'ct__tfidf__min_df': [0.005, 0.1], 'rfc__max_depth': [6, 10, 12], 'rfc__n_estimators': [150]}, verbose=2)
The Vocabulary
fitting tf-idf creates a dictionary for all the unique tokens in our train documnets- so every token gets a unique registry. number
in the next stage it counts the occurrences of each token in every document and in this way creates a feature vector.
in the last stage it adds the idf weight as described above
gsearch.best_estimator_.steps[1][1].transformers_[0][1].vocabulary_
{'mobil': 3192, 'applic': 384, 'manual': 3102, 'test': 4589, 'io': 2578, 'bug': 647, 'track': 4811, 'debug': 1369, 'realtim': 3895, 'process': 3697, 'agil': 222, 'methodolog': 3166, 'develop': 1445, 'hand': 2163, 'devic': 1489, 'function': 1987, 'usabl': 4924, 'consist': 977, 'autom': 470, 'type': 4859, 'either': 1625, 'come': 886, 'instal': 2527, 'softwar': 4279, 'distribut': 1524, 'platform': 3591, 'growth': 2138, 'past': 3530, 'year': 5367, 'a': 108, 'studi': 4458, 'conduct': 965, 'group': 2135, 'predict': 3657, 'gener': 2025, '4': 47, '2': 16, 'billion': 595, 'revenu': 3991, '999': 68, '7': 62, 'u': 4863, 's': 4033, 'smartphon': 4263, 'app': 356, 'download': 1558, 'thi': 4675, 'cours': 1031, 'go': 2079, 'cover': 1225, 'follow': 1897, 'approach': 402, '1': 3, 'hardwar': 2180, 'the': 4612, 'includ': 2477, 'intern': 2556, 'screen': 4083, 'size': 4229, 'resolut': 3964, 'space': 4311, 'memori': 3157, 'camera': 731, 'radio': 3845, 'etc': 1710, 'sometim': 4297, 'refer': 3916, 'as': 430, 'simpl': 4207, 'work': 5303, 'it': 2598, 'call': 728, 'differenti': 1499, 'earlier': 1589, 'method': 3164, 'even': 1713, 'basic': 509, 'differ': 1496, 'import': 2439, 'understand': 4879, 'nativ': 3276, 'creat': 1243, 'use': 4926, 'like': 2946, 'tablet': 4518, 'b': 490, 'web': 5166, 'serversid': 4152, 'access': 153, 'websit': 5187, 'browser': 644, 'chrome': 800, 'connect': 973, 'network': 3312, 'c': 716, 'hybrid': 2316, 'combin': 884, 'they': 4672, 'run': 4029, 'offlin': 3414, 'written': 5358, 'technolog': 4578, 'html5': 2309, 'css': 1305, 'mobil applic': 3194, 'io applic': 2582, 'mobil devic': 3196, 'test autom': 4590, 'past year': 3532, '999 7': 71, 'thi cours': 4682, 'cours go': 1117, 'go cover': 2086, 'cover follow': 1233, 'applic work': 399, 'applic creat': 387, 'creat use': 1282, 'web app': 5168, 'use differ': 4944, 'use web': 5006, 'web technolog': 5183, 'technolog like': 4579, 'learn': 2778, 'program': 3717, 'beginn': 551, 'advanc': 193, 'scratch': 4077, 'best': 574, 'exampl': 1752, 'purpos': 3797, 'languag': 2740, 'at': 449, 't': 4511, 'lab': 2732, 'variou': 5031, 'game': 2002, 'object': 3393, 'orient': 3467, 'teach': 4555, 'everyth': 1739, 'start': 4354, 'oper': 3452, 'concept': 945, 'topic': 4800, 'everi': 1724, 'lesson': 2915, 'explain': 1786, 'detail': 1440, 'code': 842, 'those': 4715, 'want': 5098, 'strong': 4431, 'knowledg': 2715, 'take': 4523, 'divid': 1528, 'three': 4724, 'part': 3514, 'first': 1877, 'second': 4094, 'third': 4708, 'learn c': 2795, 'c program': 724, 'program beginn': 3720, 'basic advanc': 510, 'advanc learn': 198, 'program languag': 3733, 'languag c': 2741, 'c develop': 720, 'languag use': 2749, 'use variou': 5003, 'softwar develop': 4281, 'object orient': 3396, 'languag thi': 2748, 'cours teach': 1194, 'teach everyth': 4560, 'start basic': 4355, 'it cover': 2601, 'cover topic': 1236, 'advanc topic': 201, 'explain detail': 1788, 'want learn': 5108, 'learn program': 2864, 'take cours': 4530, 'cours thi': 1199, 'cours divid': 1088, 'first learn': 1880, 'learn basic': 2790, 'learn object': 2855, 'get': 2027, 'power': 3630, 'framework': 1938, 'python': 3803, 'that': 4606, 'easi': 1594, 'with': 5282, 'grow': 2136, 'skill': 4231, 'gap': 2020, 'need': 3284, 'talent': 4543, 'greater': 2127, 'ever': 1720, 'befor': 545, 'ground': 2133, 'build': 648, 'launch': 2769, 'career': 749, 'entrepreneur': 1691, 'make': 3045, 'say': 4057, 'give': 2069, 'fundament': 1991, 'well': 5204, 'handson': 2167, 'experi': 1770, 'requir': 3960, 'success': 4473, 'turn': 4849, 'comput': 938, 'modern': 3204, 'machin': 3030, 'next': 3341, 'move': 3241, 'beyond': 587, 'static': 4387, 'dynam': 1580, 'we': 5140, 'won': 5295, 'stop': 4414, 'there': 4664, 'll': 2972, 'also': 261, 'implement': 2437, 'full': 1971, 'authent': 465, 'system': 4505, 'final': 1862, 'extend': 1799, 'integr': 2543, 'thirdparti': 4711, 'api': 353, 'when': 5241, 'finish': 1872, 'fulli': 1979, 'equip': 1697, 'custom': 1316, '0': 0, 'latest': 2766, 'version': 5046, 'avail': 475, 'provid': 3786, 'relev': 3940, 'inform': 2510, 'content': 991, 'legaci': 2909, 'user': 5010, 'about': 140, 'author': 466, 'sinc': 4218, 'discov': 1517, 'way': 5127, 'he': 2188, 'interest': 2550, 'appli': 381, 'scienc': 4070, 'address': 185, 'problem': 3692, 'parallel': 3510, 'domain': 1545, 'get start': 2062, 'web framework': 5174, 'easi learn': 1596, 'learn use': 2890, 'build app': 652, 'use skill': 4991, 'web develop': 5173, 'make web': 3071, 'web applic': 5169, 'applic develop': 388, 'it thi': 2623, 'cours give': 1116, 'fundament concept': 1992, 'handson experi': 2169, 'experi requir': 1774, 'build web': 684, 'well start': 5215, 'websit develop': 5190, 'app we': 375, 'won t': 5296, 'we ll': 5151, 'll also': 2974, 'also cover': 264, 'learn integr': 2834, 'finish cours': 1873, 'cours fulli': 1113, 'build custom': 662, 'app thi': 373, 'cours use': 1207, 'latest version': 2768, 'relev inform': 3941, 'about author': 141, 'easi way': 1600, 'way learn': 5134, 'learn web': 2893, 'develop he': 1461, 'comput scienc': 941, 'path': 3533, 'realworld': 3896, 'solut': 4289, 'modular': 3210, 'one': 3423, 'effici': 1621, 'seen': 4120, 'increas': 2488, 'rate': 3855, 'adopt': 191, 'mainli': 3041, 'lightweight': 2945, 'display': 1520, 'great': 2122, 'robust': 4015, 'perform': 3559, 'varieti': 5030, 'open': 3448, 'sourc': 4307, 'reliabl': 3943, 'often': 3415, 'googl': 2104, 'deriv': 1408, 'addit': 182, 'featur': 1837, 'collect': 878, 'safeti': 4041, 'capabl': 740, 'builtin': 690, 'larg': 2754, 'standard': 4351, 'librari': 2930, 'if': 2404, 'foundat': 1932, 'improv': 2450, 'packt': 3497, 'video': 5050, 'seri': 4140, 'individu': 2502, 'product': 3704, 'put': 3800, 'togeth': 4776, 'logic': 2988, 'stepwis': 4408, 'manner': 3100, 'highlight': 2254, 'are': 409, 'strategi': 4424, 'design': 1412, 'pattern': 3537, 'deal': 1368, 'storag': 4415, 'data': 1327, 'mysql': 3270, 'let': 2917, 'quick': 3832, 'look': 2999, 'journey': 2670, 'tutori': 4851, 'leav': 2901, 'off': 3408, 'you': 5378, 'immedi': 2433, 'practic': 3634, 'offer': 3409, 'avoid': 478, 'common': 899, 'mistak': 3189, 'new': 3317, 'initi': 2515, 'upon': 4919, 'i': 2317, 'o': 3392, 'file': 1855, 'command': 893, 'line': 2957, 'tool': 4784, 'error': 1699, 'handl': 2166, 'help': 2208, 'structur': 4436, 'log': 2987, 'context': 998, 'packag': 3496, 'databas': 1350, 'nosql': 3368, 'mongodb': 3219, 'across': 168, 'microservic': 3170, 'further': 1994, 'explor': 1793, 'interact': 2548, 'via': 5049, 'demonstr': 1400, 'tune': 4847, 'lastli': 2763, 'reactiv': 3864, 'serverless': 4151, 'tip': 4754, 'trick': 4838, 'by': 709, 'end': 1651, 'abl': 124, 'bridg': 636, 'meet': 3153, 'your': 5427, 'expert': 1782, 'esteem': 1707, 'ensur': 1680, 'smooth': 4264, 'receiv': 3901, 'master': 3115, 'degre': 1389, 'institut': 2534, 'mine': 3182, 'high': 2243, 'largescal': 2756, 'current': 1312, 'lead': 2774, 'team': 4569, 'refin': 3917, 'focus': 1895, 'emphasi': 1641, 'continu': 999, 'deliveri': 1396, 'publish': 3791, 'number': 3388, 'paper': 3508, 'sever': 4161, 'area': 413, 'passion': 3526, 'share': 4165, 'idea': 2400, 'other': 3472, 'huge': 2313, 'fan': 1822, 'backend': 495, 'learn path': 2858, 'one power': 3432, 'languag it': 2743, 'easi use': 1599, 'open sourc': 3450, 'make easi': 3055, 'build simpl': 678, 'if interest': 2409, 'improv perform': 2451, 'go learn': 2089, 'path packt': 3535, 'packt s': 3498, 's video': 4037, 'video learn': 5059, 'path seri': 3536, 'seri individu': 4141, 'individu video': 2503, 'video product': 5062, 'product put': 3705, 'put togeth': 3801, 'togeth logic': 4777, 'logic stepwis': 2989, 'stepwis manner': 4409, 'manner video': 3101, 'video build': 5052, 'build skill': 679, 'skill learn': 4239, 'learn video': 2892, 'video it': 5058, 'it the': 2622, 'the highlight': 4624, 'highlight learn': 2255, 'path are': 3534, 'design pattern': 1426, 'applic use': 397, 'use advanc': 4928, 'let s': 2920, 's take': 4036, 'take quick': 4535, 'quick look': 3834, 'look learn': 3004, 'learn journey': 2838, 'thi learn': 4689, 'advanc concept': 194, 'i o': 2364, 'file system': 1857, 'command line': 894, 'error handl': 1700, 'you also': 5380, 'also learn': 274, 'use mysql': 4973, 'come across': 888, 'you learn': 5397, 'tip trick': 4755, 'by end': 711, 'end learn': 1654, 'basic understand': 524, 'go use': 2093, 'advanc featur': 197, 'meet your': 3154, 'your expert': 5429, 'expert we': 1784, 'combin best': 885, 'best work': 580, 'work follow': 5309, 'follow esteem': 1901, 'esteem author': 1708, 'author ensur': 467, 'ensur learn': 1681, 'journey smooth': 2671, 'he work': 2192, 'high perform': 2246, 'he current': 2190, 'best practic': 577, 'autom test': 472, 'he passion': 2191, 'share knowledg': 4166, 'he also': 2189, 'map': 3104, 'studio': 4459, 'js': 2674, 'wide': 5266, 'survey': 4495, 'know': 2699, 'find': 1868, 'format': 1923, 'style': 4463, 'interfac': 2552, 'truli': 4844, 'respons': 3969, 'complex': 926, 'assum': 446, 'littl': 2969, 'walk': 5094, 'step': 4391, 'youll': 5422, 'big': 589, 'beauti': 531, 'time': 4736, 'modern web': 3206, 'applic it': 390, 'cover everyth': 1232, 'everyth need': 1742, 'need know': 3293, 'cours assum': 1053, 'knowledg program': 2724, 'walk step': 5095, 'youll learn': 5425, 'learn creat': 2801, 'differ way': 1498, 'user interact': 5015, 'let get': 2918, 'pro': 3690, 'becom': 534, 'tester': 4596, 'award': 483, 'win': 5273, 'profession': 3708, 'udemi': 4865, 'seller': 4129, 'materi': 3123, 'last': 2758, 'updat': 4912, 'novemb': 3379, 'over': 3484, '000': 1, 'student': 4442, 'enrol': 1673, 'worldwid': 5345, 'commun': 902, 'still': 4411, 'count': 1027, 'anoth': 334, 'popular': 3614, 'us': 4923, 'showcas': 4195, 'just': 2682, 'kept': 2687, 'intro': 2564, 'free': 1945, 'preview': 3674, 'conveni': 1006, 'pleas': 3598, 'feel': 1843, 'drive': 1570, 'lose': 3009, 'opportun': 3455, 'previous': 3678, 'known': 2728, 'cost': 1024, 'fortun': 1926, 'market': 3108, 'leader': 2775, 'industri': 2504, 'nowaday': 3386, 'mani': 3086, 'came': 730, 'play': 3594, 'control': 1003, 'better': 582, 'suitabl': 4481, 'peopl': 3548, 'background': 497, 'support': 4490, 'script': 4085, 'howev': 2304, 'difficult': 1500, 'endtoend': 1659, 'train': 4816, 'essenti': 1703, 'gain': 1999, 'competit': 911, 'advantag': 202, 'today': 4768, 'commit': 898, 'uniqu': 4894, 'deliv': 1395, 'onlin': 3441, 'in': 2454, 'aspect': 437, 'level': 2924, 'treat': 4830, 'singl': 4222, 'thoroughli': 4714, 'brush': 645, 'specif': 4320, 'entir': 1687, 'overview': 3488, 'variabl': 5027, 'output': 3481, 'valu': 5023, 'descript': 1410, 'environ': 1693, 'read': 3865, 'write': 5353, 'excel': 1756, 'driven': 1571, 'keyword': 2692, 'becom expert': 537, 'cours udemi': 1203, 'sinc 999': 4219, '999 cours': 77, 'cours materi': 1150, 'last updat': 2761, 'over 999': 3485, '999 000': 69, '000 student': 2, 'student enrol': 4449, 'first time': 1884, 'like cours': 2947, 'basic cours': 514, 'cours video': 1208, 'pleas feel': 3599, 'feel free': 1846, 'it if': 2609, 'if want': 2418, 'want becom': 5100, 'becom master': 539, 'autom tool': 473, 'learn experi': 2818, 'in cours': 2459, 'cours cover': 1078, 'cover import': 1235, 'import aspect': 2440, 'advanc level': 199, 'explain everi': 1789, 'everi singl': 1731, 'it great': 2606, 'entir cours': 1688, 'cover basic': 1228, 'topic includ': 4803, 'read write': 3868, 'data driven': 1335, 'how': 2286, 'to': 4758, 'and': 309, 'wordpress': 5299, 'sale': 4044, 'funnel': 1993, 'easili': 1604, 'land': 2737, 'page': 3499, 'hello': 2205, 'welcom': 5202, 'stun': 4462, 'whole': 5263, 'convert': 1009, 'servic': 4153, 'client': 829, 'right': 4002, 'place': 3586, 'usual': 5018, 'thing': 4698, 'lot': 3012, 'up': 4910, 'kind': 2696, 'minimum': 3185, 'plu': 3604, 'wait': 5091, 'top': 4797, 'might': 3176, 'exactli': 1747, 'expect': 1768, 'busi': 693, 'charg': 785, 'simpli': 4213, 'flexibl': 1889, 'fast': 1829, 'pay': 3539, 'them': 4649, 'alway': 289, 'ad': 175, 'weekli': 5200, 'basi': 508, 'anyth': 346, 'els': 1632, 'so': 4267, 'for': 1907, 'decid': 1372, 'day': 1360, 'money': 3213, 'back': 491, 'question': 3823, 'ask': 433, 'risk': 4010, 'involv': 2577, 'how to': 2299, 'to use': 4767, 'to creat': 4761, 'learn how': 2826, 'easili creat': 1605, 'land page': 2738, 'hello welcom': 2206, 'creat stun': 1280, 'right place': 4006, 'lot time': 3017, 'web design': 5172, 'onlin busi': 3442, 'want abl': 5099, 'cours you': 1222, 'updat new': 4916, 'new featur': 3322, 'you need': 5403, 'abl creat': 127, 'creat great': 1267, 'wait for': 5092, 'for enrol': 1910, 'inform cours': 2511, '999 day': 79, 'get money': 2053, 'money back': 3214, 'back question': 494, 'question ask': 3825, 'interview': 2560, 'prepar': 3663, 'save': 4052, 'architectur': 408, 'fastest': 1833, 'world': 5336, 'compani': 903, 'amazon': 294, 'netflix': 3311, 'base': 504, 'achiev': 164, 'goal': 2094, 'field': 1852, 'engin': 1663, 'may': 3136, 'salari': 4043, 'similar': 4206, 'qualif': 3816, 'without': 5289, 'benefit': 571, 'case': 754, 'what': 5223, 'biggest': 593, 'me': 3140, 'demand': 1398, 'higher': 2249, 'job': 2661, 'good': 2100, 'theoret': 4660, 'but': 701, 'rang': 3849, 'secur': 4112, 'attend': 454, 'spend': 4325, 'search': 4091, 'internet': 2557, 'alreadi': 258, 'compil': 913, 'list': 2966, 'answer': 335, 'ye': 5365, 'view': 5067, 'watch': 5123, 'begin': 547, 'onc': 3421, 'tri': 4834, 'word': 5298, 'mark': 3107, 'could': 1025, 'yourself': 5434, 'then': 4655, 'pass': 3525, 'after': 210, 'face': 1807, 'technic': 4571, 'contain': 989, 'architect': 407, 'difficulti': 1501, 'vari': 5026, 'experienc': 1779, 'happen': 2171, 'chang': 778, 'futur': 1996, 'from': 1960, 'keep': 2685, 'our': 3475, 'aim': 228, 'sampl': 4046, '3': 32, 'role': 4017, '5': 54, 'is': 2590, 'tailor': 4522, 'templat': 4582, 'organ': 3464, '6': 59, 'disadvantag': 1512, 'characterist': 784, '8': 65, '9': 67, 'point': 3606, 'rememb': 3945, 'prefer': 3659, 'synchron': 4503, 'asynchron': 448, 'orchestr': 3462, 'issu': 2597, 'rest': 3973, 'http': 2311, 'can': 733, 'state': 4384, 'extens': 1800, 'semant': 4130, 'buy': 705, 'commerci': 896, 'whi': 5249, 'break': 633, 'per': 3553, 'host': 2273, 'model': 3201, 'mock': 3198, 'consum': 986, 'contract': 1001, 'separ': 4138, 'deploy': 1405, 'releas': 3939, 'mean': 3142, 'failur': 1815, 'monitor': 3220, 'multipl': 3258, 'id': 2398, 'certif': 770, 'key': 2689, 'public': 3790, 'confus': 972, 'consid': 975, 'law': 2770, 'circuit': 803, 'scale': 4062, 'queri': 3821, 'cach': 725, 'discoveri': 1518, 'document': 1538, 'scenario': 4064, 'major': 3044, 'principl': 3683, 'interview question': 2561, 'cours learn': 1142, 'learn everyth': 2814, 'save time': 4055, 'fastest grow': 1834, 'big compani': 590, 'compani like': 904, 'cours design': 1083, 'design help': 1422, 'help achiev': 2209, 'achiev goal': 165, 'softwar engin': 4282, 'softwar design': 4280, 'design develop': 1418, 'develop i': 1462, 'i explain': 2339, 'import concept': 2441, 'use case': 4936, 'cours what': 1215, 'benefit cours': 572, 'cours abl': 1037, 'job interview': 2662, 'it good': 2605, 'topic cover': 4802, 'cover cours': 1229, 'cours we': 1212, 'we cover': 5145, 'wide rang': 5267, 'rang topic': 3850, 'topic cours': 4801, 'how cours': 2289, 'cours help': 1123, 'spend time': 4326, 'ye cours': 5366, 'best way': 579, 'watch cours': 5124, 'cours begin': 1058, 'begin end': 549, 'answer question': 336, 'go cours': 2085, 'cours 999': 1035, '999 time': 102, 'well prepar': 5214, 'question cours': 3826, 'cours contain': 1073, 'level the': 2927, 'what happen': 5229, 'time time': 4749, 'cours follow': 1109, 'follow 1': 1898, '1 what': 13, '999 what': 106, 'what differ': 5226, 'continu integr': 1000, 'differ type': 1497, '999 how': 84, 'whi use': 5255, 'compani use': 905, 'use api': 4930, '999 in': 86, 'maintain': 3042, 'crossplatform': 1298, 'coder': 873, 'divers': 1527, 'excit': 1761, 'or': 3459, 'old': 3417, 'vital': 5079, 'figur': 1854, 'guess': 2144, 'clean': 818, 'reason': 3900, 'exist': 1766, 'indepth': 2496, 'perfect': 3554, 'complet': 914, 'guid': 2146, 'project': 3746, 'all': 237, 've': 5035, 'along': 253, 'each': 1584, 'section': 4099, 'dedic': 1379, 'math': 3127, 'input': 2519, 'statement': 4385, 'loop': 3007, 'string': 4430, 'array': 423, 'record': 3908, 'date': 1357, 'procedur': 3695, 'eas': 1593, 'set': 4157, 'progress': 3745, 'around': 419, 'intent': 2547, 'encourag': 1649, 'highlevel': 2251, 'compat': 908, 'syntax': 4504, 'design build': 1414, 'beginn level': 559, 'what s': 5236, 'way get': 5131, 'start program': 4372, 'program it': 3730, 'it s': 2619, 'way help': 5132, 'help find': 2214, 'what wait': 5238, 'learn take': 2879, 'take your': 4541, 'next level': 3343, 'write code': 5354, 'applic learn': 391, 'learn best': 2792, ...}
Validaion report
we can see that the model is stable since std between differnet cv splits is small
cv_report= pd.DataFrame(gsearch.cv_results_) # gives accuracy score
cv_report
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_ct__tfidf__min_df | param_rfc__max_depth | param_rfc__n_estimators | params | split0_test_score | split1_test_score | split2_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27.904208 | 0.280141 | 11.990902 | 0.157760 | 0.005 | 6 | 150 | {'ct__tfidf__min_df': 0.005, 'rfc__max_depth':... | 0.838865 | 0.843644 | 0.854711 | 0.845740 | 0.006636 | 6 |
1 | 29.845166 | 0.293449 | 12.166636 | 0.168616 | 0.005 | 10 | 150 | {'ct__tfidf__min_df': 0.005, 'rfc__max_depth':... | 0.887660 | 0.887911 | 0.895289 | 0.890287 | 0.003539 | 2 |
2 | 30.715490 | 0.126816 | 12.016137 | 0.231241 | 0.005 | 12 | 150 | {'ct__tfidf__min_df': 0.005, 'rfc__max_depth':... | 0.893901 | 0.892168 | 0.900681 | 0.895583 | 0.003673 | 1 |
3 | 26.274902 | 1.073447 | 12.405668 | 0.293781 | 0.1 | 6 | 150 | {'ct__tfidf__min_df': 0.1, 'rfc__max_depth': 6... | 0.873475 | 0.855562 | 0.866345 | 0.865127 | 0.007364 | 5 |
4 | 27.804441 | 0.275139 | 12.626807 | 0.171780 | 0.1 | 10 | 150 | {'ct__tfidf__min_df': 0.1, 'rfc__max_depth': 1... | 0.884539 | 0.870318 | 0.875993 | 0.876950 | 0.005845 | 4 |
5 | 27.847812 | 0.187276 | 12.579125 | 0.211614 | 0.1 | 12 | 150 | {'ct__tfidf__min_df': 0.1, 'rfc__max_depth': 1... | 0.885957 | 0.875142 | 0.878263 | 0.879788 | 0.004545 | 3 |
Predicting with a random-forest best estimator
RandomForestClassifier is an ensamble bootstrap aggregation algorithm : it creates a number of decision tree classifiers where each
classifier fits only on part of the data (rows and columns) in a random manner - uniformly and with replacement.
Like in a regular decision tree, reduction in impurity is the parameter to consider in splitting on a feature in each tree.
The end-results is the majority vote each sample received from the classifiers.
This way of using "Wisdom of Crowds" improves the stability and accuracy of the decision making.
preds= gsearch.best_estimator_.predict(X_test)
final_results= pd.DataFrame(classification_report(y_test.astype('bool'), preds, output_dict=True))
conf_matrix= confusion_matrix(y_test.astype('bool') ,preds)
Results
final_results.rename(columns= {'False': 'non-Development', 'True':'Development'})
non-Development | Development | accuracy | macro avg | weighted avg | |
---|---|---|---|---|---|
precision | 0.922407 | 0.889478 | 0.901277 | 0.905943 | 0.902696 |
recall | 0.823322 | 0.953555 | 0.901277 | 0.888438 | 0.901277 |
f1-score | 0.870052 | 0.920403 | 0.901277 | 0.895227 | 0.900191 |
support | 1415.000000 | 2110.000000 | 0.901277 | 3525.000000 | 3525.000000 |
sns.heatmap(pd.DataFrame(conf_matrix), annot=True, fmt='d', cmap=plt.cm.Blues,
cbar=False)
plt.title("confusion matrix")
Text(0.5, 1.0, 'confusion matrix')
we get very good performance from our model, precision-wise and recall-wise when looking at the Development label.
the preformance is slightly worse for the 0 or non-Development label.
after checking some false positives, we see that courses in the IT-Software category are harder to seperate from the development category, since many words are common to those 2 categories. So it may be wise to combine those 2 to the same category
feature importances
the feature importances method has a tendancy to increase the continuous features weights in a biased way, but since all of our features are continuous ones, it seems appropriate enough for a "big picture" estimation:
pd.DataFrame(pd.Series(gsearch.best_estimator_.named_steps["rfc"].feature_importances_ , index=gsearch.best_estimator_.steps[1][1].get_feature_names()).sort_values(ascending=False), columns=['importance']).head(15)
importance | |
---|---|
tfidf__code | 0.059199 |
tfidf__web | 0.030740 |
tfidf__develop | 0.027360 |
tfidf__applic | 0.020564 |
tfidf__data | 0.019096 |
tfidf__app | 0.017708 |
tfidf__program | 0.016756 |
tfidf__build | 0.016654 |
tfidf__languag | 0.016126 |
tfidf__javascript | 0.015336 |
tfidf__java | 0.014110 |
tfidf__program languag | 0.013793 |
tfidf__api | 0.012328 |
tfidf__python | 0.011386 |
tfidf__web develop | 0.011220 |
we see that all programming related words (like code or java) are very important for the classification of "development" courses , which is very logical.
In-order to check the assumption for the false positives, we will try to plot all the courses in a way that will reflect their differences, meaning close content courses should also be close in the scatter plot
PCA is a way to linearly transform a high-dimentional space to a much smaller hidden representation that captures the majority of the variance between samples.
here we will use 3 dimentional pca transformer, so we will be able to plot the resulting vectors for each course in a 3d scatter plot
X_train_vect= tfidf.fit_transform(X_train['text_stem'])
train = pd.DataFrame(X_train_vect.todense(), columns=tfidf.get_feature_names())#.iloc[0,:].sort_values()
from sklearn.decomposition import PCA
pca = PCA(3)
df_3d= pd.concat([pd.DataFrame(pca.fit_transform(train)), pd.DataFrame(df.loc[X_train.index]['category'].reset_index(drop=True)), pd.DataFrame(df.loc[X_train.index]['title'].reset_index(drop=True))], axis=1)
df_3d
0 | 1 | 2 | category | title | |
---|---|---|---|---|---|
0 | -4.124536 | 4.492683 | 0.195532 | Development | Mobile Application Manual Testing - IOS Applic... |
1 | -9.513578 | 3.785385 | -2.615611 | Development | Learn C++ Programming for beginners from basi... |
2 | -4.230656 | 3.113482 | -2.179557 | Development | Learning Flask |
3 | 6.002980 | -4.374147 | -11.042514 | Development | LEARNING PATH: Go: Real-World Go Solutions for... |
4 | -12.164592 | 8.173161 | -0.145709 | Development | Interactive maps with Mapbox! |
... | ... | ... | ... | ... | ... |
10568 | -4.203248 | 0.577658 | -1.131525 | Development | Introduction to C Programming for the Raspberr... |
10569 | -8.885592 | 3.946822 | 4.935661 | Marketing | How to Create a Marketing Video for Your Busin... |
10570 | -10.786600 | 6.954830 | -1.057528 | Development | Using JSON In Unreal Engine 4 - C++ |
10571 | 8.539465 | -13.127536 | -30.164514 | Development | Building Recommender Systems with Machine Lear... |
10572 | -6.286109 | 1.255941 | -3.256990 | Development | C and C++ Programming : Step-by-Step Tutorial |
10573 rows × 5 columns
df_3d.columns= ['1','2','3' ,'cat','title']
import plotly.express as px
fig = px.scatter_3d(df_3d, x='1', y='2', z='3',color='cat',size_max=10, hover_data=['title'], title = "3d scatter-plot of PCA results")
fig.update_traces(marker=dict(size=4))
fig.show()
we can see a clustering pattern for the categories and a plane of seperation between buisness (orange) ,the more humanistic courses (red, torquoise) and development courses (dark blue) while IT courses (purple) show no seperation from development courses.
actually, a nice example of the validity and usability of our model is that most courses on the upper-left-corner image above (blue and purple) have the same content - the Spring framework, so it is apparent that there is even a sub-clustering pattern by course content
for the sake of comparison and completion , we will also use a LSTM NN model.
we will regard this model as a blck-box, and won't elaborate on its mechanism,
but suffice is to say that LSTM's power relies on its ability to create a contextual connection between words in the text, thing that is mostly lacking in the TF/IDF approach.
this is done by sequental embedding of the words into vectors in the first layer and to a "smart" memory-neurons in the inner layer that can combine past infomation with the new info coming in, changing the state of the vector, or "forget" states that are less effective while trainning is done
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer
from keras.layers import LSTM
from keras.layers import SimpleRNN
here we are only using the 'text' column after the cleaning procedure
ps = PorterStemmer()
df_m['text_stem'] = df_m['text'].map(lambda x: ' '.join(ps.stem(y) for y in x.split()))
X = df_m[['text_stem']]
y = df_m['is_dev']
# split training and testing sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
t = Tokenizer(2500)
#3046 max words --> leaves 2500 most frequent words
y_train = np.where(y_train == True, 1, 0)
#cereates mapping dictionary words to integers
t.fit_on_texts(X_train['text_stem'])
vocab_size = len(t.word_index) + 1
#mapping integer encode the documents_tr
encoded_docs = t.texts_to_sequences(X_train['text_stem'])
encoded_docs_test = t.texts_to_sequences(X_test['text_stem'])
# adds zeros in the start for adjusting to same lengh
encoded_docs_padded= pad_sequences(encoded_docs , padding='pre')
len_row= len(encoded_docs_padded[1])
encoded_docs_padded_test= pad_sequences(encoded_docs_test ,maxlen=len_row, padding='pre')
len(encoded_docs_padded[1])
2929
keras tokenizer creates a vocabulary:
import json
json.loads(t.get_config()['word_counts'])
{'mobil': 2353, 'applic': 9933, 'manual': 389, 'test': 6477, 'io': 2440, 'bug': 224, 'track': 709, 'debug': 394, 'realtim': 434, 'process': 4330, 'agil': 592, 'methodolog': 306, 'develop': 17518, 'hand': 1156, 'held': 52, 'devic': 989, 'function': 4658, 'usabl': 78, 'consist': 555, 'autom': 2514, 'type': 2606, 'either': 419, 'come': 2404, 'preinstal': 5, 'instal': 2151, 'softwar': 4359, 'distribut': 658, 'platform': 2348, 'wit': 44, 'phenomen': 17, 'growth': 391, 'past': 612, 'year': 3474, 'a': 5152, 'studi': 1264, 'conduct': 209, 'yanke': 2, 'group': 986, 'predict': 539, 'gener': 2263, '4': 3529, '2': 5395, 'billion': 177, 'revenu': 256, '999': 22011, '7': 1775, 'u': 199, 's': 4457, 'smartphon': 139, 'app': 10038, 'download': 1687, 'thi': 18552, 'cours': 57846, 'go': 5768, 'cover': 5707, 'follow': 3868, 'approach': 1649, '1': 4667, 'hardwar': 252, 'the': 15948, 'includ': 5927, 'intern': 603, 'processor': 67, 'screen': 746, 'size': 401, 'resolut': 111, 'space': 457, 'memori': 514, 'camera': 356, 'radio': 77, 'bluetooth': 15, 'wifi': 48, 'etc': 1295, 'sometim': 242, 'refer': 733, 'as': 1802, 'simpl': 3366, 'work': 10927, 'it': 10336, 'call': 1265, 'differenti': 165, 'earlier': 116, 'method': 2481, 'even': 3159, 'basic': 7544, 'differ': 4439, 'import': 3319, 'understand': 8620, 'nativ': 685, 'creat': 16205, 'use': 30943, 'like': 6918, 'tablet': 156, 'b': 334, 'web': 9349, 'serversid': 133, 'access': 2489, 'websit': 6120, 'browser': 637, 'chrome': 142, 'firefox': 23, 'connect': 1485, 'network': 2710, 'wireless': 62, 'c': 4272, 'hybrid': 165, 'combin': 768, 'they': 640, 'run': 2578, 'offlin': 138, 'written': 640, 'technolog': 2828, 'html5': 930, 'css': 2310, 'learn': 40085, 'program': 11865, 'beginn': 4583, 'advanc': 4523, 'scratch': 2529, 'best': 4505, 'exampl': 3774, 'purpos': 639, 'languag': 6402, 'bjarn': 5, 'stroustrup': 5, 'at': 1079, 't': 2007, 'bell': 19, 'lab': 539, 'variou': 1709, 'game': 8023, 'object': 2863, 'orient': 762, 'teach': 5939, 'everyth': 2840, 'start': 9846, 'oper': 2049, 'concept': 5356, 'topic': 3624, 'everi': 3307, 'lesson': 1968, 'explain': 2639, 'detail': 2111, 'code': 10566, 'those': 177, 'want': 7517, 'strong': 661, 'knowledg': 4674, 'take': 8249, 'divid': 298, 'three': 772, 'part': 3064, 'first': 4249, 'second': 874, 'third': 389, 'flask': 183, 'get': 12936, 'power': 3724, 'framework': 3927, 'python': 6248, 'that': 2758, 'easi': 3941, 'with': 3218, 'grow': 1179, 'skill': 6782, 'gap': 159, 'need': 8582, 'talent': 190, 'greater': 195, 'ever': 1205, 'befor': 634, 'ground': 370, 'build': 13049, 'minimalist': 16, 'easytolearn': 15, 'launch': 538, 'career': 2139, 'entrepreneur': 630, 'microframework': 5, 'make': 11394, 'say': 1170, 'give': 3527, 'fundament': 2258, 'well': 5677, 'handson': 1175, 'experi': 4830, 'requir': 2747, 'success': 2892, 'turn': 691, 'comput': 2895, 'modern': 1161, 'machin': 3447, 'next': 2270, 'move': 1838, 'beyond': 360, 'static': 396, 'databaseback': 3, 'dynam': 1193, 'we': 6982, 'won': 125, 'stop': 514, 'there': 2555, 'll': 3814, 'also': 8825, 'implement': 3352, 'full': 2213, 'authent': 686, 'system': 4593, 'final': 1871, 'extend': 344, 'integr': 1982, 'thirdparti': 85, 'api': 3290, 'when': 975, 'finish': 1021, 'fulli': 1060, 'equip': 336, 'custom': 3262, '0': 803, 'latest': 1073, 'version': 1717, 'avail': 1762, 'provid': 4089, 'relev': 468, 'inform': 2846, 'content': 4193, 'legaci': 101, 'user': 4048, 'about': 1344, 'author': 1769, 'lalith': 2, 'polepeddi': 1, 'sinc': 1050, 'discov': 1000, 'way': 6688, 'he': 2635, 'tut': 5, 'techpro': 1, 'asid': 36, 'interest': 1709, 'appli': 2492, 'scienc': 1944, 'address': 457, 'problem': 2605, 'parallel': 295, 'domain': 689, 'biolog': 43, 'path': 1312, 'realworld': 766, 'solut': 1939, 'gopher': 2, 'modular': 129, 'testabl': 46, 'one': 6805, 'effici': 1150, 'highlyperform': 2, 'seen': 368, 'increas': 1366, 'rate': 658, 'adopt': 223, 'mainli': 152, 'lightweight': 97, 'display': 642, 'great': 3288, 'robust': 335, 'perform': 2431, 'varieti': 528, 'open': 1363, 'sourc': 2156, 'reliabl': 251, 'often': 641, 'golang': 135, 'googl': 2488, 'deriv': 116, 'addit': 1406, 'featur': 3921, 'garbag': 38, 'collect': 772, 'safeti': 139, 'dynamictyp': 2, 'capabl': 592, 'builtin': 208, 'larg': 731, 'standard': 922, 'librari': 1729, 'if': 5273, 'foundat': 1253, 'improv': 2169, 'packt': 208, 'video': 7566, 'seri': 1301, 'individu': 618, 'product': 3835, 'put': 1502, 'togeth': 1405, 'logic': 948, 'stepwis': 125, 'manner': 549, 'highlight': 304, 'are': 1837, 'encod': 60, 'strategi': 1906, 'design': 9905, 'pattern': 1661, 'deal': 755, 'storag': 423, 'data': 13379, 'mysql': 1246, 'let': 1956, 'quick': 940, 'look': 4764, 'journey': 1096, 'tutori': 2050, 'leav': 377, 'off': 116, 'you': 20630, 'immedi': 521, 'practic': 6732, 'offer': 1512, 'avoid': 609, 'common': 1111, 'mistak': 382, 'new': 6359, 'initi': 425, 'upon': 450, 'i': 19543, 'o': 180, 'file': 2925, 'command': 1081, 'line': 1129, 'tool': 5328, 'error': 701, 'handl': 1115, 'help': 8440, 'structur': 2830, 'log': 346, 'context': 319, 'packag': 869, 'databas': 4027, 'nosql': 180, 'mongodb': 572, 'across': 609, 'microservic': 511, 'further': 241, 'explor': 1553, 'interact': 1927, 'commandlin': 60, 'via': 543, 'demonstr': 757, 'tune': 162, 'lastli': 117, 'reactiv': 246, 'serverless': 372, 'tip': 1203, 'trick': 673, 'by': 2676, 'end': 4017, 'abl': 3891, 'bridg': 113, 'meet': 711, 'your': 3350, 'expert': 1778, 'esteem': 215, 'ensur': 763, 'smooth': 280, 'aaron': 24, 'torr': 2, 'receiv': 902, 'master': 3775, 'degre': 298, 'mexico': 13, 'institut': 198, 'mine': 411, 'high': 1373, 'largescal': 60, 'current': 1298, 'lead': 1202, 'team': 1311, 'refin': 88, 'focus': 993, 'emphasi': 83, 'continu': 1281, 'deliveri': 344, 'publish': 1113, 'number': 1240, 'paper': 287, 'sever': 1106, 'patent': 45, 'area': 960, 'passion': 552, 'share': 1575, 'idea': 1639, 'other': 1423, 'huge': 593, 'fan': 229, 'backend': 644, 'map': 1052, 'mapbox': 21, 'studio': 1141, 'gl': 13, 'js': 1785, 'wide': 636, 'survey': 105, 'know': 6316, 'find': 2976, 'format': 964, 'style': 1286, 'interfac': 1577, 'truli': 349, 'respons': 1615, 'complex': 1718, 'assum': 346, 'littl': 824, 'geograph': 36, 'walk': 954, 'step': 6350, 'youll': 1504, 'big': 1536, 'beauti': 745, 'time': 7685, 'pro': 872, 'qtp': 52, 'uft': 119, 'becom': 3834, 'tester': 306, 'award': 203, 'win': 296, 'hp': 87, 'profession': 3727, 'udemi': 2437, 'seller': 149, 'materi': 1862, 'last': 892, 'updat': 2648, 'novemb': 79, '27th': 10, 'over': 818, '000': 1117, 'student': 5802, 'enrol': 2064, 'worldwid': 228, 'commun': 1960, 'still': 910, 'count': 210, 'anoth': 829, 'popular': 1971, 'us': 1604, 'showcas': 117, 'just': 525, 'kept': 93, 'intro': 268, 'free': 3388, 'preview': 346, 'conveni': 105, 'pleas': 999, 'feel': 1943, 'drive': 519, 'lose': 433, 'opportun': 978, 'unifi': 58, 'previous': 131, 'known': 363, 'cost': 853, 'fortun': 184, 'market': 4132, 'leader': 326, 'industri': 1574, 'nowaday': 119, 'mani': 4163, 'lowcost': 62, 'came': 143, 'play': 997, 'control': 2367, 'better': 2374, 'suitabl': 315, 'peopl': 3481, 'nonprogram': 5, 'background': 720, 'support': 2113, 'vb': 60, 'script': 1528, 'howev': 723, 'difficult': 574, 'endtoend': 85, 'train': 3793, 'essenti': 1420, 'gain': 1453, 'competit': 437, 'advantag': 665, 'today': 2285, 'qaevers': 7, 'commit': 260, 'uniqu': 923, 'deliv': 737, 'onlin': 3286, 'in': 9226, 'aspect': 908, 'level': 3771, 'treat': 93, 'freshman': 8, 'singl': 1197, 'thoroughli': 130, 'brush': 139, 'specif': 1334, 'entir': 790, 'overview': 1189, 'checkpoint': 26, 'parameter': 23, 'variabl': 1085, 'output': 551, 'valu': 1379, 'descript': 414, 'environ': 1724, 'read': 1664, 'write': 4464, 'excel': 2542, 'driven': 410, 'keyword': 398, 'how': 8658, 'to': 4275, 'elementor': 148, 'and': 4596, 'wordpress': 2536, 'sale': 1436, 'funnel': 196, 'easili': 1491, 'land': 485, 'page': 3024, 'hello': 258, 'welcom': 1034, 'stun': 230, 'whole': 584, 'convert': 496, 'servic': 3428, 'client': 1624, 'right': 3354, 'place': 1217, 'usual': 230, 'outsoruc': 1, 'thing': 2782, 'lot': 2582, 'up': 771, 'kind': 658, 'minimum': 126, 'plu': 470, 'wait': 932, 'top': 1403, 'might': 688, 'exactli': 888, 'expect': 686, 'busi': 6762, 'charg': 274, 'simpli': 736, 'flexibl': 402, 'fast': 1267, 'pay': 667, 'dime': 15, 'them': 1258, 'alway': 1283, 'ad': 2588, 'weekli': 74, 'basi': 308, 'anyth': 669, 'els': 526, 'so': 2860, 'for': 3456, 'ps': 43, 'decid': 400, 'day': 2773, 'money': 2872, 'back': 1982, 'question': 4230, 'ask': 1517, 'risk': 792, 'involv': 600, 'whatsoev': 31, 'interview': 1732, 'prepar': 1564, 'save': 1226, 'architectur': 1270, 'fastest': 202, 'world': 4317, 'compani': 2525, 'amazon': 936, 'netflix': 111, 'base': 2605, 'achiev': 1056, 'goal': 1341, 'field': 1113, 'engin': 2948, 'may': 1412, 'salari': 294, 'similar': 445, 'qualif': 78, 'without': 2274, 'benefit': 1202, 'case': 1183, 'what': 5681, 'biggest': 260, 'me': 759, 'demand': 710, 'higher': 347, 'job': 2648, 'good': 2709, 'theoret': 285, 'but': 1424, 'rang': 666, 'secur': 2318, 'pact': 3, 'bulkhead': 3, 'attend': 150, 'spend': 748, 'search': 1485, 'internet': 891, 'alreadi': 1429, 'compil': 367, 'list': 1923, 'answer': 1753, 'ye': 612, 'view': 1350, 'watch': 1380, 'begin': 1714, 'onc': 895, 'tri': 1509, 'word': 1082, 'mark': 270, 'could': 1017, 'yourself': 731, 'then': 1631, 'pass': 733, 'after': 1806, 'face': 702, 'technic': 1096, 'contain': 1434, 'fresher': 58, 'architect': 540, 'difficulti': 168, 'vari': 111, 'experienc': 640, 'happen': 510, 'chang': 2406, 'futur': 1343, 'from': 1533, 'keep': 1365, 'our': 659, 'aim': 533, 'sampl': 603, '3': 4319, 'role': 669, 'soa': 11, '5': 3754, 'is': 1561, 'tailor': 77, 'templat': 1234, 'organ': 1360, '6': 1860, 'disadvantag': 63, 'decompos': 4, 'monolith': 34, 'characterist': 99, '8': 1555, 'bound': 46, '9': 1051, 'point': 1355, 'rememb': 476, 'prefer': 252, 'synchron': 71, 'asynchron': 177, 'orchestr': 110, 'choreographi': 2, 'issu': 720, 'rest': 1651, 'http': 367, 'can': 797, 'state': 796, 'extens': 697, 'dri': 47, 'semant': 90, 'buy': 897, 'commerci': 225, 'shelf': 14, 'whi': 1981, 'break': 560, 'ubiquit': 12, 'per': 390, 'host': 999, 'model': 3927, 'mike': 60, 'cohn': 1, 'pyramid': 21, 'mock': 124, 'stub': 12, 'erad': 3, 'nondetermin': 1, 'consum': 349, 'contract': 325, 'cdc': 4, 'separ': 346, 'deploy': 1791, 'releas': 599, 'canari': 6, 'mean': 1649, 'repair': 69, 'mttr': 1, 'failur': 159, 'mtbf': 1, 'crossfunct': 13, 'monitor': 448, 'multipl': 1439, 'correl': 68, 'id': 198, 'certif': 1832, 'key': 1483, 'public': 657, 'confus': 300, 'deputi': 3, 'consid': 466, 'conway': 1, 'law': 302, 'circuit': 162, 'breaker': 12, 'idempot': 1, 'scale': 552, 'queri': 1218, 'segreg': 18, 'cqr': 7, 'cach': 183, 'cap': 42, 'theorem': 24, 'discoveri': 80, 'document': 1179, 'scenario': 443, 'major': 694, 'principl': 992, 'pascal': 35, 'maintain': 660, 'crossplatform': 197, 'coder': 139, 'divers': 97, 'excit': 721, 'or': 1067, 'old': 338, 'vital': 117, 'figur': 329, 'bewild': 5, 'guess': 139, 'clean': 444, 'feet': 32, 'reason': 861, 'exist': 781, 'indepth': 530, 'perfect': 1081, 'complet': 7162, 'guid': 3058, 'project': 8473, 'all': 2649, '500mb': 1, 've': 779, 'along': 1974, 'each': 747, 'section': 4350, 'dedic': 315, 'math': 521, 'input': 536, 'statement': 989, 'loop': 880, 'string': 493, 'array': 766, 'record': 827, 'date': 461, 'procedur': 377, 'eas': 407, 'set': 3742, 'progress': 691, 'oldest': 14, 'around': 1298, 'intent': 137, 'encourag': 293, 'highlevel': 90, 'imper': 34, 'precursor': 8, 'compat': 161, 'syntax': 562, 'microsoft': 1847, 'own': 652, 'pace': 566, 'of': 1091, 'intricaci': 16, 'instructor': 2330, 'ms': 167, 'captur': 241, 'actual': 1165, 'desktop': 662, 'verbal': 41, 'do': 2732, 'reduc': 470, 'instruct': 747, 'tour': 83, 'brand': 821, 'show': 3722, 'add': 2318, 'task': 1334, 'resourc': 1629, 'crop': 28, 'gantt': 30, 'chart': 834, 'behav': 62, 'person': 2241, 'timelin': 90, 'macro': 178, 'within': 1717, 'repetit': 129, 'manag': 5750, 'allow': 2031, 'alongsid': 122, 'titl': 322, 'matter': 647, 'weapon': 112, 'sfx': 10, 'less': 836, 'hour': 2933, 'daw': 15, 'sound': 790, 'effect': 2760, 'layer': 434, 'never': 1216, 'heard': 218, 'georgek': 2, 'music': 603, 'compos': 222, 'sonic': 6, 'specialist': 165, 'portfolio': 550, 'my': 1103, 'limit': 566, 'elder': 24, 'scroll': 153, 'skywind': 2, 'darkfal': 2, 'rise': 139, 'agon': 3, 'bulletrag': 2, 'coca': 5, 'cola': 9, 'mystor': 2, 'more': 3339, 'throughout': 808, 'obstacl': 108, 'importantli': 233, 'overcam': 4, 'moreov': 107, 'lectur': 3520, 'mentor': 165, 'aspir': 198, 'craft': 246, 'explos': 43, 'digit': 1035, 'audio': 536, 'workstat': 29, 'automat': 419, 'rifl': 3, 'handgun': 1, 'ultim': 642, 'destruct': 21, 'furthermor': 95, 'period': 249, 'guidanc': 250, 'forum': 255, 'nich': 275, 'stuck': 301, 'struggl': 368, 'goto': 46, 'special': 853, 'doe': 200, 'factori': 135, 'class': 2803, 'relationship': 793, 'inherit': 290, 'much': 3179, 'see': 3627, 'abil': 795, 'properli': 322, 'techniqu': 3402, 'creation': 845, 'now': 2278, 'storylin': 49, 'elearn': 113, 'wonder': 570, 'prebuilt': 22, 'inde': 63, 'mayb': 382, 'budget': 337, 'purchas': 675, 'vendor': 132, 'perhap': 156, 'quit': 394, 'said': 232, 'agre': 48, 'possibl': 1507, 'ive': 1136, 'broken': 175, 'board': 376, 'review': 1314, 'layout': 598, 'introduc': 1184, 'trigger': 224, 'condit': 600, 'fanci': 67, 'intermedi': 681, 'while': 537, 'bit': 460, 'quicker': 73, 'previou': 509, 'articul': 68, 'no': 1717, 'worri': 506, 'choos': 715, 'dive': 753, 'q': 408, 'built': 967, 'acceler': 165, 'minichalleng': 4, 'replic': 140, 'real': 3643, 'encount': 150, 'realiz': 255, 'warn': 57, 'amount': 530, 'ill': 990, 'think': 1632, 'forward': 781, 'join': 1720, 'hd': 262, 'webmast': 22, 'than': 79, 'week': 591, 'clearli': 286, 'precis': 168, 'thank': 1233, 'pino': 1, 'amato': 1, 'veri': 523, 'useful': 16, 'especi': 438, 'maria': 5, 'kastani': 1, 'm': 499, 'thumb': 20, 'jonathan': 36, 'nichol': 1, 'essential': 1, 'be': 1005, 'powerus': 9, 'confid': 1888, 'minut': 953, 'away': 585, 'depth': 430, 'optim': 1024, 'visual': 2071, 'pdf': 410, 'mp3': 43, 'anytim': 133, 'everywher': 156, 'bonu': 691, 'premium': 148, 'theme': 989, 'wp': 43, 'social': 1384, 'press': 127, 'complimentari': 14, 'highli': 916, 'customiz': 49, 'ideal': 366, 'technich': 1, 'prior': 564, 'term': 674, 'everyon': 728, 'probabl': 515, 'almost': 627, 'true': 492, 'blog': 1010, 'absolut': 974, 'dozen': 215, 'plugin': 1044, 'sort': 591, 'amaz': 1485, 'imagin': 454, 'membership': 109, 'site': 1782, 'regist': 271, 'r': 1595, 'post': 999, 'pictur': 359, 'comment': 338, 'edit': 1027, 'tag': 427, 'widget': 233, 'transfer': 243, 'ustom': 1, 'appear': 233, ...}
using keras framework to build the layers in the LSTM model :
model = Sequential()
model.add(Embedding(vocab_size,128, input_length=len_row))
model.add(Dropout(0.2))
model.add(LSTM(64))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
training the model (for 10 epochs) :
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
batch_size = 32
history = model.fit(encoded_docs_padded, y_train, batch_size =batch_size,
epochs = 10, validation_split=0.1, verbose = 1,
steps_per_epoch=120,
)
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 2929, 128) 4389504 dropout_2 (Dropout) (None, 2929, 128) 0 lstm_1 (LSTM) (None, 64) 49408 dropout_3 (Dropout) (None, 64) 0 dense_1 (Dense) (None, 1) 65 ================================================================= Total params: 4,438,977 Trainable params: 4,438,977 Non-trainable params: 0 _________________________________________________________________ None Epoch 1/10 120/120 [==============================] - 121s 994ms/step - loss: 0.4385 - accuracy: 0.8010 - val_loss: 0.4034 - val_accuracy: 0.8374 Epoch 2/10 120/120 [==============================] - 118s 988ms/step - loss: 0.3185 - accuracy: 0.8935 - val_loss: 0.2724 - val_accuracy: 0.9140 Epoch 3/10 120/120 [==============================] - 118s 984ms/step - loss: 0.2503 - accuracy: 0.9230 - val_loss: 0.3063 - val_accuracy: 0.8922 Epoch 4/10 120/120 [==============================] - 119s 989ms/step - loss: 0.3097 - accuracy: 0.9010 - val_loss: 0.2989 - val_accuracy: 0.9074 Epoch 5/10 120/120 [==============================] - 118s 987ms/step - loss: 0.2752 - accuracy: 0.9060 - val_loss: 0.3076 - val_accuracy: 0.8932 Epoch 6/10 120/120 [==============================] - 119s 990ms/step - loss: 0.2360 - accuracy: 0.9253 - val_loss: 0.2760 - val_accuracy: 0.9130 Epoch 7/10 120/120 [==============================] - 119s 989ms/step - loss: 0.2258 - accuracy: 0.9255 - val_loss: 0.3169 - val_accuracy: 0.9140 Epoch 8/10 120/120 [==============================] - 119s 990ms/step - loss: 0.2051 - accuracy: 0.9358 - val_loss: 0.2580 - val_accuracy: 0.9130 Epoch 9/10 120/120 [==============================] - 119s 996ms/step - loss: 0.1910 - accuracy: 0.9398 - val_loss: 0.2910 - val_accuracy: 0.9187 Epoch 10/10 120/120 [==============================] - 122s 1s/step - loss: 0.2089 - accuracy: 0.9301 - val_loss: 0.2725 - val_accuracy: 0.9074
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['accuracy','val_accuracy'])
<matplotlib.legend.Legend at 0x2079347c6a0>
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['loss','val_loss'])
<matplotlib.legend.Legend at 0x207960d72e0>
predictions = model.predict(encoded_docs_padded_test)
final_table = pd.DataFrame({'preds':np.where(predictions>=0.5, 1, 0).reshape(-1),'true':np.where(y_test.to_numpy()==True, 1, 0)})
from sklearn.metrics import classification_report
pd.DataFrame(classification_report(final_table['true'], final_table['preds'], output_dict=True)).rename(columns= {'0': 'non-Development', '1':'Development'})
non-Development | Development | accuracy | macro avg | weighted avg | |
---|---|---|---|---|---|
precision | 0.890691 | 0.927860 | 0.912908 | 0.909275 | 0.912939 |
recall | 0.892580 | 0.926540 | 0.912908 | 0.909560 | 0.912908 |
f1-score | 0.891634 | 0.927199 | 0.912908 | 0.909417 | 0.912923 |
support | 1415.000000 | 2110.000000 | 0.912908 | 3525.000000 | 3525.000000 |
we get slightly better results with the neural network approach:
main imporovement is in the '0' or non-development label results, so contextual approach does add to the model's ability to classify it correctly.
but, since it is hard to interpret a neural network model, that are certainly benefits for using both approaches.