How I learned NLP and What I learned

Image source:

I started my data science journey back in 2017. I did my first natural language processing project during my Master's program. Got the dataset from a Kaggle competition. The project was Determination of Toxic Comments and Analysis for the Minimization of Unintentional Model Bias. The primary goal of this project was to design a supervised and deep learning model that can perform binary classification in order to determine toxic and non-toxic comments from a social network. My course instructor was really happy with the outcome and gave me a LinkedIn recommendation.

In 2018, when I started my MS program I knew I wanted to be a Data Scientist. During my MS, I took three data science courses — Data Analytics for Electrical Engineering, Deep Learning, Advanced Data Analytics followed by two math courses- Linear Algebra and Random Process, two programming courses- Data Structures and Algorithm, and Advanced Software Engineering (believe me this course was super hard!).

I am always a supporter of online education and to accomplish my goal I started taking online courses from on January 09, 2018. Three years later, in 2021, after spending more than 600 hours in this learning platform, I have completed 71 courses including 4400 exercises. I have followed the career paths — 1. Python Programmer, 2. Data Analyst, 3. Data Scientist, 4. Machine Learning Scientist in Python. You can check out my DataCamp profile here. Recently, I have completed another awesome skill track on Natural Language Processing in Python. The track consists of 6 courses.

Course no 1: Introduction to Natural Language Processing in Python

Instructor: Katharine Jarmul, Founder, kjamistan

Used toolbox: re, NLTK, Gensim, polyglot, spaCy, matplotlib, scikit-learn

Topics covered:

  1. Regular expressions, Tokenization, Lemmatization, shorten words to their root stems
  2. Removing stopwords, punctuation, or unwanted tokens
  3. Bag-of-words, count vectorizer, TF-IDF
  4. Topic identifications, Text classifications
  5. Named-Entity Recognition (NER): how to identify the who, what, and where of your texts using pre-trained models on English and non-English text?
  6. how to build your own fake news classifier?
  7. Can we predict the movie genre based on the words used in the plot summary?

Course no 2: Sentiment Analysis in Python

Instructor: Violeta Misheva, Data Scientist

Used toolbox: textblob, wordcloud, NLTK, pandas, scikit-learn, langdetect, matplotlib

Topics covered:

  1. Types and approaches of Sentiment Analysis
  2. How to build a word cloud?
  3. Which of your products are bestsellers and most of all and why?
  4. How to capture the context with bag-of-words (BOW)?
  5. N-grams with the Count Vectorizer, tokenization, dealing with punctuations
  6. How to detect the language of the given text?
  7. Regular expressions, Tokenization, Lemmatization, removing stopwords, TF-IDF
  8. Logistic Regression, regularization, confusion matrix

Datasets: IMDB movie reviews, Amazon product reviews, Tweets of US airline passengers

Course no 3: Building Chatbots in Python

Instructor: Alan Nichol, Co-founder and CTO of Rasa

Used toolbox: re, spaCy, scikit-learn, rasa NLU

Messaging and voice-controlled devices are the next big platforms, and conversational computing has a big role to play in creating engaging augmented and virtual reality experiences.

Topics covered:

  1. How to create conversational software such as Chatbot or virtual assistant in order to schedule a meeting, book a flight, search for a restaurant?
  2. How to use a regular expression to match messages against known patterns, extract key phrases, and transform sentences grammatically?
  3. How to recognize intent and entities?
  4. Word vector, cosine similarity, support vector machine,
  5. Natural Language Understanding for intent recognition and entity extraction
  6. How to deal with the typos during a conversation with the chatbot?
  7. How to access DB using natural language
  8. Incremental slot filling and negation
  9. Statefulness, build a chatbot that helps users order coffee
  10. Asking questions and queuing answers

Dataset: 1. ATIS (Airline Travel Information System) dataset, which contains thousands of sentences from real people interacting with a flight booking system. 2. Hotels database

Course no 4: Advanced NLP with spaCy

Instructor: Ines Montani, spaCy core developer and co-founder of Explosion AI

Used toolbox: spaCy

If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? What companies and products are mentioned? Which texts are similar to each other?

Topics covered:

  1. Text processing: Part-of-speech tags, Syntactic dependencies, Named entities, rule-based matching, pattern matching
  2. Data Structures: Vocab, Lexemes, String Store, Doc, Span, Token
  3. Find semantic similarities using word vectors
  4. Write custom pipeline components with extension attributes
  5. How to use custom attributes to add your own metadata to the documents, spans, and tokens?
  6. Scale-up spaCy pipelines and make them fast
  7. Create training data for spaCy statistical models
  8. Train and update spaCy neural network models with new data

Course no 5: Spoken Language Processing in Python

Instructor: Daniel Bourke, Machine Learning Engineer, and YouTube content creator

Used toolbox: NumPy, speech_recognition, PyDub, NLTK, spaCy, matplotlib, scikit-learn

We learn to speak far before we learn to read. Even in the digital age, our main method of communication is speech. Spoken Language Processing with Python will help you load, transform and transcribe audio files.

Topics covered:

  1. Converted audio files into sound waves and compare them visually
  2. Transcribed speech: Audio to text conversion
  3. Prepared and manipulated audio files: change different audio file attributes such as frame rate, number of channels, file format, etc.
  4. Built a spoken language processing pipeline: sentiment analysis, named entity recognition, and text classification.

Course no 6: Feature Engineering for NLP in Python

Instructor: Rounak Banik, Data Scientist at Fractal Analytics

Used toolbox: pandas, textatistic, spacy, scikit-learn

Topics covered:

  1. How to compute the readability scores of a book/article?
  2. How to find the number of hashtags and mentions in a tweet?

Tokenization and Lemmatization:

3. How to perform text cleaning such as removing unnecessary whitespaces, punctuations, special characters, and stopwords?

4. How to perform part-of-speech (POS) tagging, and named entity recognition (NER)?

Bag of words (BoW) and Vectorization:

5. How to perform sentiment analysis on movie reviews using n-gram modeling and Naive Bayes classifier?

6. How to compute TF-IDF weights and the cosine similarity score between two vectors?

7. Building a movie and a TED Talk recommender

I hope this blog will help you to organize your thoughts to build your own NLP curriculum. Happy learning!

Data Scientist at IDARE in the lone star state.