I started my data science journey back in 2017. I did my first natural language processing project during my Master's program. Got the dataset from a Kaggle competition. The project was Determination of Toxic Comments and Analysis for the Minimization of Unintentional Model Bias. The primary goal of this project was to design a supervised and deep learning model that can perform binary classification in order to determine toxic and non-toxic comments from a social network. My course instructor was really happy with the outcome and gave me a LinkedIn recommendation.
In 2018, when I started my MS program I knew I wanted to be a Data Scientist. During my MS, I took three data science courses — Data Analytics for Electrical Engineering, Deep Learning, Advanced Data Analytics followed by two math courses- Linear Algebra and Random Process, two programming courses- Data Structures and Algorithm, and Advanced Software Engineering (believe me this course was super hard!).
I am always a supporter of online education and to accomplish my goal I started taking online courses from datacamp.com on January 09, 2018. Three years later, in 2021, after spending more than 600 hours in this learning platform, I have completed 71 courses including 4400 exercises. I have followed the career paths — 1. Python Programmer, 2. Data Analyst, 3. Data Scientist, 4. Machine Learning Scientist in Python. You can check out my DataCamp profile here. Recently, I have completed another awesome skill track on Natural Language Processing in Python. The track consists of 6 courses.
Course no 1: Introduction to Natural Language Processing in Python
Instructor: Katharine Jarmul, Founder, kjamistan
Used toolbox: re, NLTK, Gensim, polyglot, spaCy, matplotlib, scikit-learn
- Regular expressions, Tokenization, Lemmatization, shorten words to their root stems
- Removing stopwords, punctuation, or unwanted tokens
- Bag-of-words, count vectorizer, TF-IDF
- Topic identifications, Text classifications
- Named-Entity Recognition (NER): how to identify the who, what, and where of your texts using pre-trained models on English and non-English text?
- how to build your own fake news classifier?
- Can we predict the movie genre based on the words used in the plot summary?
Course no 2: Sentiment Analysis in Python
Instructor: Violeta Misheva, Data Scientist
Used toolbox: textblob, wordcloud, NLTK, pandas, scikit-learn, langdetect, matplotlib
- Types and approaches of Sentiment Analysis
- How to build a word cloud?
- Which of your products are bestsellers and most of all and why?
- How to capture the context with bag-of-words (BOW)?
- N-grams with the Count Vectorizer, tokenization, dealing with punctuations
- How to detect the language of the given text?
- Regular expressions, Tokenization, Lemmatization, removing stopwords, TF-IDF
- Logistic Regression, regularization, confusion matrix
Datasets: IMDB movie reviews, Amazon product reviews, Tweets of US airline passengers
Course no 3: Building Chatbots in Python
Instructor: Alan Nichol, Co-founder and CTO of Rasa
Used toolbox: re, spaCy, scikit-learn, rasa NLU
Messaging and voice-controlled devices are the next big platforms, and conversational computing has a big role to play in creating engaging augmented and virtual reality experiences.
- How to create conversational software such as Chatbot or virtual assistant in order to schedule a meeting, book a flight, search for a restaurant?
- How to use a regular expression to match messages against known patterns, extract key phrases, and transform sentences grammatically?
- How to recognize intent and entities?
- Word vector, cosine similarity, support vector machine,
- Natural Language Understanding for intent recognition and entity extraction
- How to deal with the typos during a conversation with the chatbot?
- How to access DB using natural language
- Incremental slot filling and negation
- Statefulness, build a chatbot that helps users order coffee
- Asking questions and queuing answers
Dataset: 1. ATIS (Airline Travel Information System) dataset, which contains thousands of sentences from real people interacting with a flight booking system. 2. Hotels database
Course no 4: Advanced NLP with spaCy
Instructor: Ines Montani, spaCy core developer and co-founder of Explosion AI
Used toolbox: spaCy
If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? What companies and products are mentioned? Which texts are similar to each other?
- Text processing: Part-of-speech tags, Syntactic dependencies, Named entities, rule-based matching, pattern matching
- Data Structures: Vocab, Lexemes, String Store, Doc, Span, Token
- Find semantic similarities using word vectors
- Write custom pipeline components with extension attributes
- How to use custom attributes to add your own metadata to the documents, spans, and tokens?
- Scale-up spaCy pipelines and make them fast
- Create training data for spaCy statistical models
- Train and update spaCy neural network models with new data
Course no 5: Spoken Language Processing in Python
Instructor: Daniel Bourke, Machine Learning Engineer, and YouTube content creator
Used toolbox: NumPy, speech_recognition, PyDub, NLTK, spaCy, matplotlib, scikit-learn
We learn to speak far before we learn to read. Even in the digital age, our main method of communication is speech. Spoken Language Processing with Python will help you load, transform and transcribe audio files.
- Converted audio files into sound waves and compare them visually
- Transcribed speech: Audio to text conversion
- Prepared and manipulated audio files: change different audio file attributes such as frame rate, number of channels, file format, etc.
- Built a spoken language processing pipeline: sentiment analysis, named entity recognition, and text classification.
Course no 6: Feature Engineering for NLP in Python
Instructor: Rounak Banik, Data Scientist at Fractal Analytics
Used toolbox: pandas, textatistic, spacy, scikit-learn
- How to compute the readability scores of a book/article?
- How to find the number of hashtags and mentions in a tweet?
Tokenization and Lemmatization:
3. How to perform text cleaning such as removing unnecessary whitespaces, punctuations, special characters, and stopwords?
4. How to perform part-of-speech (POS) tagging, and named entity recognition (NER)?
Bag of words (BoW) and Vectorization:
5. How to perform sentiment analysis on movie reviews using n-gram modeling and Naive Bayes classifier?
6. How to compute TF-IDF weights and the cosine similarity score between two vectors?
7. Building a movie and a TED Talk recommender
I hope this blog will help you to organize your thoughts to build your own NLP curriculum. Happy learning!