We may earn an affiliate commission when you visit our partners.
Course image
Coursat.ai Dr. Ahmad ElSallab

Transformer Networks are the new trend in Deep Learning nowadays. Transformer models have taken the world of NLP by storm since 2017. Since then, they become the mainstream model in almost ALL NLP tasks. Transformers in CV are still lagging, however they started to take over since 2020.

Read more

Transformer Networks are the new trend in Deep Learning nowadays. Transformer models have taken the world of NLP by storm since 2017. Since then, they become the mainstream model in almost ALL NLP tasks. Transformers in CV are still lagging, however they started to take over since 2020.

We will start by introducing attention and the transformer networks. Since transformers were first introduced in NLP, they are easier to be described with some NLP example first. From there, we will understand the pros and cons of this architecture. Also, we will discuss the importance of unsupervised or semi supervised pre-training for the transformer architectures, discussing Large Scale Language Models (LLM) in brief, like BERT and GPT.

This will pave the way to introduce transformers in CV. Here we will try to extend the attention idea into the 2D spatial domain of the image. We will discuss how convolution can be generalized using self attention, within the encoder-decoder meta architecture. We will see how this generic architecture is almost the same in image as in text and NLP, which makes transformers a generic function approximator. We will discuss the channel and spatial attention, local vs. global attention among other topics.

In the next three modules, we will discuss the specific networks that solve the big problems in CV: classification, object detection and segmentation. We will discuss Vision Transformer (ViT) from Google, Shifter Window Transformer (SWIN) from Microsoft, Detection Transformer (DETR) from Facebook research, Segmentation Transformer (SETR) and many others. Then we will discuss the application of Transformers in video processing, through Spatio-Temporal Transformers with application to Moving Object Detection, along with Multi-Task Learning setup.

Finally, we will show how those pre-trained arcthiectures can be easily applied in practice using the famous Huggingface library using the Pipeline interface.

Enroll now

What's inside

Learning objectives

  • What are transformer networks?
  • State of the art architectures for cv apps like image classification, semantic segmentation, object detection and video processing
  • Practical application of sota architectures like vit, detr, swin in huggingface vision transformers
  • Attention mechanisms as a general deep learning idea
  • Inductive bias and the landscape of dl models in terms of modeling assumptions
  • Transformers application in nlp and machine translation
  • Transformers in computer vision
  • Different types of attention in computer vision

Syllabus

Introduction
Overview of Transformer Networks
The Rise of Transformers
Inductive Bias in Deep Neural Network Models
Read more

Save this course

Create your own learning path. Save this course to your list so you can find it easily later.
Save

Activities

Coming soon We're preparing activities for Transformers in Computer Vision - English version. These are activities you can do either before, during, or after a course.

Career center

Learners who complete Transformers in Computer Vision - English version will develop knowledge and skills that may be useful to these careers:
Computer Vision Engineer
A Computer Vision Engineer designs, develops, and deploys systems that enable computers to "see" and interpret visual data from images and videos. This role involves creating algorithms for tasks like object detection, image classification, and semantic segmentation, which are directly addressed by the course. Learners of this course will gain deep expertise in Transformer Networks, particularly their application in computer Vision. The course's focus on state-of-the-art architectures such as ViT, SWIN, and DETR, alongside practical application using the Huggingface library, provides a robust foundation for building high-performance vision systems. Understanding concepts like spatial attention and encoder-decoder designs is critical for innovating in this field. This course can help prepare you for the cutting edge of computer vision technology.
Deep Learning Engineer
A Deep Learning Engineer specializes in designing, training, and deploying advanced neural network models to solve complex problems across various domains. This course provides comprehensive knowledge of Transformer Networks, a pivotal architecture in modern deep learning, making it highly relevant for a future Deep Learning Engineer. It delves into the underlying attention mechanisms, encoder-decoder designs, and pre-training strategies that are fundamental to state-of-the-art models. By exploring transformers in both NLP and extensively in computer vision for tasks like image classification, object detection, and video processing, you will build a versatile skill set. The practical application with Huggingface also helps prepare you to implement and optimize these powerful models in real-world scenarios.
Machine Learning Engineer
A Machine Learning Engineer builds and implements predictive models and intelligent systems to automate tasks and extract insights from data. While a broad field, the course's deep dive into Transformer Networks—a cutting-edge machine learning architecture—is invaluable for this role. You will learn about their fundamental attention mechanisms, how they function as generic function approximators, and their significant impact across computer vision. The course’s coverage of specific applications like image classification, object detection, and video processing using models such as ViT and DETR demonstrates practical model deployment. This foundational understanding of advanced deep learning architectures can help you to develop sophisticated machine learning solutions.
Research Scientist Computer Vision
A Research Scientist Computer Vision professional conducts studies and experiments to advance the state of the art in visual perception technology, often requiring an advanced degree. This course is exceptionally relevant, as it focuses entirely on Transformer Networks in Computer Vision, which represent a significant frontier in the field. Learners will gain expertise in the theoretical underpinnings of attention mechanisms and their practical application in models like ViT, DETR, and SWIN for critical tasks such as image classification, object detection, and semantic segmentation. The exploration of spatio-temporal transformers for video processing and multi-task learning provides a robust toolkit for designing novel vision solutions and publishing impactful research.
AI Scientist
An AI Scientist typically combines research and development, creating innovative AI solutions and advancing the field through scientific inquiry. This role often requires an advanced degree. The course's comprehensive coverage of Transformer Networks, detailing their architectural nuances and practical applications across computer vision, directly supports the work of an AI Scientist. You will explore cutting-edge models like ViT, SWIN, and DETR, understanding their design principles and performance characteristics in classification, detection, and segmentation tasks. The discussion of unsupervised pre-training and multi-task learning further enhances the learner's ability to design robust and efficient AI systems. This course helps build a sophisticated understanding of advanced AI paradigms.
Artificial Intelligence Researcher
An Artificial Intelligence Researcher pushes the boundaries of AI, developing new algorithms, models, and theoretical frameworks. This role often requires an advanced degree. The course's in-depth exploration of Transformer Networks, from their core attention mechanisms to their generalization across domains, is highly beneficial for an aspiring Artificial Intelligence Researcher. You will delve into inductive bias, the pros and cons of these architectures, and advanced concepts like local versus global attention. This foundational knowledge is crucial for understanding the current landscape of deep learning models and for contributing to future advancements in areas such as computer vision and natural language processing. The course helps build a strong analytical framework for research endeavors.
Applied Machine Learning Scientist
An Applied Machine Learning Scientist bridges research and deployment, taking advanced machine learning techniques and applying them to solve specific real-world problems. This course is highly relevant, focusing on Transformer Networks, which are at the forefront of applied deep learning. You will learn how these architectures, initially successful in NLP, are now revolutionizing computer vision across tasks like image classification, object detection, and segmentation. The course delves into practical models such as ViT, DETR, and SWIN, and crucially, provides hands-on experience with the Huggingface library. This blend of theoretical understanding and practical application helps prepare you to implement state-of-the-art solutions effectively in an Applied Machine Learning Scientist role.
Data Scientist Machine Learning Focus
A Data Scientist Machine Learning Focus professional uses advanced statistical and machine learning methods to analyze complex datasets, build predictive models, and extract actionable insights. This course is relevant for those aiming to leverage cutting-edge deep learning architectures in their data science toolkit. By exploring Transformer Networks, you will gain a deep understanding of sophisticated models that excel in tasks ranging from image analysis to natural language processing. The ability to apply state-of-the-art architectures like ViT and DETR, and to integrate them using tools like Huggingface, can significantly enhance a data scientist's capacity to handle unstructured data, particularly visual data, and develop powerful, data-driven solutions.
Autonomous Driving Engineer
An Autonomous Driving Engineer develops the perception, planning, and control systems for self-driving vehicles, where accurate and real-time computer vision is paramount. This course can be highly valuable as it immerses learners in Transformer Networks, particularly their application to computer vision tasks like object detection, semantic segmentation, and video processing. Understanding models such as DETR for precise object localization and Spatio-Temporal Transformers for moving object detection is directly applicable to interpreting complex road environments. The course's emphasis on generic function approximators and advanced attention mechanisms helps build a foundation for developing robust perception systems critical for safe and reliable autonomous driving.
Software Developer - Machine Learning
A Software Developer Machine Learning builds and integrates machine learning capabilities into software applications and platforms. This course is highly pertinent, providing in-depth knowledge of Transformer Networks, which are becoming standard components in many ML-driven applications. You will learn about key architectures in computer vision, such as ViT for image classification and DETR for object detection, essential for developing robust visual features in software. The practical focus on using the Huggingface library and its Pipeline interface directly translates into skills for efficient model integration and deployment. This course can help you write cleaner, more efficient code for machine learning components, ensuring successful application builds.
Robotics Software Engineer
A Robotics Software Engineer designs and implements the intelligence and control systems for robots, often relying on advanced perception capabilities. This course may be helpful as it explores Transformer Networks, a powerful deep learning architecture with increasing relevance in robotics for tasks like object recognition, scene understanding, and navigation. The detailed coverage of computer vision applications, including object detection, semantic segmentation, and spatio-temporal transformers for moving object analysis, directly supports the development of sophisticated robotic perception. Understanding how to leverage models like DETR for accurate object detection and apply advanced attention mechanisms can help you build more intelligent and adaptable robotic systems.
Natural Language Processing Engineer
A Natural Language Processing Engineer develops systems that understand, interpret, and generate human language. This course may be useful because while its primary focus is computer vision, it extensively leverages the foundational principles of Transformer Networks, which originated and revolutionized NLP. You will learn about attention mechanisms, encoder-decoder architectures, and unsupervised pre-training, all critical concepts for NLP models like BERT and GPT, which are briefly discussed. Understanding how these generic function approximators operate and their pros and cons can transfer directly to building advanced language models, even as you pivot from the course's computer vision examples. This course helps provide a strong grasp of the core transformer architecture.
MLOps Engineer
An MLOps Engineer focuses on the operational aspects of machine learning, ensuring that models are reliably deployed, monitored, and maintained in production. This course may be helpful as a strong understanding of the underlying model architectures is crucial for effective MLOps. By learning about Transformer Networks, their pros and cons, and specific models like ViT and DETR, you will be better equipped to optimize their deployment, anticipate potential issues, and manage their lifecycle. The practical exposure to Huggingface library and its Pipeline interface helps build a practical understanding of how these complex models are packaged and consumed, which is invaluable for an MLOps Engineer.
Bioinformatics Research Scientist
A Bioinformatics Research Scientist analyzes complex biological and medical data, often requiring an advanced degree. This course may be useful for those applying advanced computational methods to visual biological data, such as medical images or microscopy. The knowledge of Transformer Networks and their application in computer vision for tasks like image classification and semantic segmentation can be directly adapted to analyze biological images for disease diagnosis, cell segmentation, or genetic pattern recognition. Understanding attention mechanisms and how to apply state-of-the-art architectures can help you develop novel analytical tools in bioinformatics, even though the course examples are not directly biological.
Quantitative Researcher
A Quantitative Researcher, often in finance, develops complex mathematical and statistical models to predict market movements or evaluate investment strategies. This role typically requires an advanced degree. This course may be helpful for a Quantitative Researcher exploring unconventional data sources or advanced modeling techniques. While not directly finance-related, the deep understanding of Transformer Networks as generic function approximators, capable of identifying complex patterns in high-dimensional data, can be highly transferable. The architectural insights into attention mechanisms and unsupervised pre-training can inform novel approaches to time-series analysis or the processing of alternative data, even if the course focuses on computer vision.

Reading list

We've selected 23 books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Transformers in Computer Vision - English version.
Is exceptionally relevant as it contains dedicated chapters on Vision Transformers (ViT) and Object Detection with DETR. It serves as a practical textbook for industry professionals looking to implement state-of-the-art vision models. The text adds significant depth to the course by providing hands-on coding examples that mirror the syllabus objectives.
Is the definitive guide to the Hugging Face library, which central component of the course's practical modules. It provides deep insights into the architecture of the original Transformer and its evolution into models like BERT and GPT. It vital reference for students looking to move from theory to implementation using the Pipeline interface.
This highly practical reference tool focused on the Hugging Face ecosystem, directly supporting the course's final module. It contains recipes for image classification and segmentation tasks using pre-trained models. Industry professionals will find this book particularly useful for rapid prototyping and deployment.
Explores the inner workings of various Transformer architectures, including their application in both NLP and Computer Vision. It provides a detailed look at self-attention matrix equations and multi-head attention, which are core topics in the course syllabus. It is highly valuable as additional reading for those wanting to master the scaling laws of these models.
This interactive textbook is commonly used by academic institutions and includes specific sections on the Vision Transformer (ViT). It provides a blend of math and code, making it a perfect companion for the course's module on image classification. It adds breadth by showing how Transformers compare to traditional Convolutional Neural Networks.
A very recent and authoritative textbook that covers the theoretical advancements in deep learning including attention and transformers. It is ideal for students seeking a rigorous academic treatment of the subject matter beyond what is covered in the video lectures. It serves as a high-level reference for the mathematical generalization of self-attention.
Focuses on solving real-world vision problems and includes discussions on modern architectures that have replaced older CNN-only approaches. It useful reference for the 'Transformers in Object Detection' and 'Semantic Segmentation' modules. It adds breadth by discussing data pipelines and model deployment.
This comprehensive text offers a broad overview of computer vision techniques, from classical methods to modern deep learning approaches. It useful reference tool for understanding how Transformers generalize classical ideas like spatial and channel attention. The second edition includes updated sections on modern architectures relevant to the course.
Focusing specifically on computer vision, this book helps bridge the gap between traditional CNNs and modern attention-based systems. It useful reference for understanding the transition from YOLO-style object detection to DETR. The book is particularly helpful for providing the background knowledge needed to appreciate the 'Rise of Transformers' discussed in the course.
This practical guide covers a wide range of computer vision tasks and includes a focus on transfer learning with pre-trained models. It aligns well with the course's objective of applying SoTA architectures like ViT and SWIN. It is particularly useful for those looking to implement the Hugging Face pipeline demo.
Serves as an excellent practical introduction to the broader field of machine learning and deep learning. It provides the necessary prerequisite knowledge for students who may be new to neural network design patterns. The latest edition includes specific chapters on Attention mechanisms and the Transformer architecture.
Is designed for professionals who want to apply AI to real-world problems quickly and efficiently. It covers the use of pre-trained models and the Hugging Face library, directly supporting the course's goal of practical application. It serves as a gentle introduction to the 'Huggingface Vision Transformers' module.
Since Transformers originated in NLP, this book provides the essential historical context and architectural theory required by the course syllabus. It covers BERT and GPT in detail, aligning with the course's introductory modules on unsupervised pre-training. It is more valuable as additional reading to understand the 'Attention is All You Need' paper.
Provides a robust foundation in PyTorch, which is the preferred framework for many Transformer implementations. It covers the mechanics of deep learning that serve as prerequisites for the more advanced CV topics in the course. It adds depth by explaining the underlying theory of training large-scale models.
Widely considered the 'bible' of deep learning, this book provides the essential theoretical background for understanding neural networks and optimization. It is an indispensable prerequisite for understanding the mathematical foundations of attention mechanisms. While it predates the ViT era, its coverage of foundational concepts like inductive bias is unparalleled.
Offers a concise look at the Transformer-based LLMs like BERT and GPT mentioned in the syllabus. Understanding these models is essential for grasping why Transformers were eventually adapted for Computer Vision. It provides helpful background knowledge on unsupervised pre-training and scaling.
While focused on generative models, this book contains extensive sections on how Transformers are used for creative tasks in both text and vision. It provides a unique perspective on the 'Spatio-Temporal Transformers' discussed in the course's video processing module. It is valuable as additional reading for students interested in the broader applications of attention.
Fundamental reference for learning the PyTorch framework used to build and train the architectures discussed in the course. It provides the prerequisite knowledge for handling image data and tensor operations. Its clear explanations of the computational graph are vital for understanding how Transformers are optimized.
This text focuses on the practical application and deployment of vision models, which complements the course's use of Gradio and Hugging Face. It provides industry-relevant context for why one might choose a specific Transformer architecture based on performance constraints. It is more valuable as additional reading for the implementation phase.
Uses a highly visual style to explain complex concepts like attention and neural network layers. It is excellent for students who find the mathematical equations of self-attention difficult to visualize. It provides a helpful conceptual background for the 'Attention General DL Idea' section of the syllabus.
Provides a structured walkthrough of vision tasks from classification to detection. It is particularly useful for students to review 'Object Detection methods' before diving into the DETR module. It serves as a bridge between foundational CNN knowledge and the advanced Transformer topics covered in the course.
Covers advanced topics including attention mechanisms and their implementation in Keras/TensorFlow. It provides an alternative framework perspective to the PyTorch-heavy resources, making the course more accessible to a wider audience. It solid reference for understanding the encoder-decoder meta-architecture.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Similar courses are unavailable at this time. Please try again later.
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2025 OpenCourser