Introducing Multimodal Llama 3.2 from Coursera

Join our new short course, Introducing Multimodal Llama 3.2, and learn from Amit Sangani, Senior Director of AI Partner Engineering at Meta, to learn all about the latest additions to the Llama models 3.1 and 3.2, from custom tool calling to multimodality and the new Llama stack.

Open models are a key building block of AI and a key enabler of AI research. With Meta’s family of open models, anyone can download, customize, fine-tune, or build new applications on top of them, allowing AI innovation. The Llama model family now ranges from 1B model parameters to its 405B foundation model, allowing for diverse use cases and applications.

In this course, you’ll learn about the new vision capabilities that Llama 3.2 brings to the Llama family. You’ll learn how to leverage this along with tool-calling, and Llama Stack, which is an open-source orchestration layer for building on top of the Llama family of models.

In detail, you’ll:

1. Learn about the new models, how they were trained, their features, and how they fit into the Llama family.

2. Understand how to do multimodal prompting with Llama and work on advanced image reasoning use cases such as understanding errors on a car dashboard, adding up the total of three restaurant receipts, grading written math homework, and many more.

3. Learn different roles—system, user, assistant, ipython—in the Llama 3.1 and 3.2 family and the prompt format that identifies those roles.

4. Understand how Llama uses the tiktoken tokenizer, and how it has expanded to a 128k vocabulary size that improves encoding efficiency and enables support for seven non-English languages.

5. Learn how to prompt Llama to call both built-in and custom tools with examples for web search and solving math equations.

6. Learn about ‘Llama Stack API’, which is a standardized interface for canonical toolchain components like fine-tuning or synthetic data generation to customize Llama models and build agentic applications.

Start building exciting applications on Llama!

What's inside

Syllabus

Traffic lights

Read about what's good

what should give you pause

and possible dealbreakers

Taught by a Senior Director of AI Partner Engineering at Meta, which lends credibility and practical insights into the Llama 3.2 model

Explores the Llama Stack API, which is a standardized interface for toolchain components, enabling customization and agentic application development

Examines multimodal prompting with Llama, which allows for advanced image reasoning use cases like error detection and data extraction

Covers the tiktoken tokenizer and its expanded vocabulary, which enhances encoding efficiency and broadens language support

Requires familiarity with AI concepts and model training, which may not be suitable for absolute beginners

Focuses on Llama 3.2, which may become outdated as newer models are released, potentially limiting the long-term applicability of the learned skills

Reviews summary

Llama 3.2: multimodality and tool calling

According to learners, this course offers a solid introduction to the new features of Llama 3.1 and 3.2, particularly focusing on multimodal capabilities and tool calling. Students highlight the practical examples, such as analyzing images like car dashboards and receipts, as being highly helpful and clear. While the course is generally concise and well-explained, some learners felt the pace was a bit rushed in places and noted that the section on Llama Stack could be more in-depth. Several reviews also expressed a desire for more advanced coding examples to aid practical implementation.

Mostly clear but some areas brief.

"Very concise and to the point. Clear explanations."

"The concepts are interesting, but the pace felt a bit rushed. Some explanations, like Llama Stack, could have been more in-depth."

"Decent overview, but definitely just an introduction... Good for beginners wanting a taste."

Explanation of tool calling is useful.

"Good overview of Llama 3.1/3.2 updates. The section on tool calling was particularly useful."

"Provides a good foundation on Llama 3.1 and 3.2 updates. The tool calling examples were useful."

"I appreciated the real-world examples, like those for tool calling."

Multimodal use cases are practical.

"Excellent introduction to Llama 3.2's new features, especially the multimodal capabilities. The examples for image reasoning were very practical and easy to follow."

"Fantastic short course! The hands-on examples for multimodal tasks (receipts, car dashboard) were incredibly helpful and demonstrated the power of Llama 3.2 well."

"Really impressed with the multimodal examples. Made it very clear how to apply Llama 3.2's vision capabilities. Highly recommended for anyone interested in leveraging these features."

Requests for more hands-on code.

"I wish there were more advanced coding examples beyond the basics shown."

"However, the lack of significant coding exercises makes it less practical for developers looking to implement these features right away. More code labs would be beneficial."

"As mentioned by others, more detailed examples for practical integration would be a great addition."

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Introducing Multimodal Llama 3.2 with these activities:

Review Foundational AI Concepts

Show steps

Reviewing foundational AI concepts will help you better understand the architecture and training methodologies behind Llama 3.2.

Browse courses on Large Language Models

Show steps

Review the basics of neural networks and deep learning.
Study the transformer architecture and attention mechanisms.
Familiarize yourself with key concepts like tokenization and embedding.

Read 'Attention is All You Need'

Show steps

Reading the original Transformer paper will provide a deeper understanding of the architecture used in Llama 3.2.

View Putting Knowledge to Work on Amazon

Show steps

Download and read the 'Attention is All You Need' paper.
Focus on understanding the multi-head attention mechanism.
Take notes on the key innovations presented in the paper.

Experiment with Multimodal Prompting

Show steps

Practicing multimodal prompting will help you master the techniques taught in the course and develop your ability to create effective prompts for Llama 3.2.

Show steps

Gather a collection of images and text prompts.
Experiment with different prompt formats and styles.
Analyze the model's responses and refine your prompts accordingly.

Four other activities

Expand to see all activities and additional details

Show all seven activities

Document Your Multimodal Experiments

Show steps

Creating a blog post or documentation on your multimodal experiments will solidify your understanding and allow you to share your findings with others.

Show steps

Choose a platform to document your experiments (e.g., blog, GitHub).
Describe your experimental setup, prompts, and results.
Share your documentation with the community and solicit feedback.

Read 'Generative AI with LLMs'

Show steps

Reading this book will provide a broader understanding of generative AI and how Llama 3.2 fits into the larger landscape.

View Putting Knowledge to Work on Amazon

Show steps

Obtain a copy of 'Generative AI with LLMs'.
Read the chapters relevant to prompt engineering and tool calling.
Apply the techniques learned in the book to your Llama 3.2 projects.

Build a Tool-Calling Application

Show steps

Building a tool-calling application will allow you to apply your knowledge of Llama Stack API and create a practical application that leverages the model's capabilities.

Show steps

Define the functionality of your tool-calling application.
Implement the necessary tools and integrate them with Llama 3.2.
Test and refine your application to ensure it functions correctly.

Contribute to Llama Stack

Show steps

Contributing to Llama Stack will provide valuable experience working with a real-world open-source project and deepen your understanding of the Llama ecosystem.

Browse courses on Open Source

Show steps

Explore the Llama Stack repository on GitHub.
Identify areas where you can contribute (e.g., bug fixes, documentation).
Submit a pull request with your changes and participate in the review process.

Career center

Learners who complete Introducing Multimodal Llama 3.2 will develop knowledge and skills that may be useful to these careers:

Artificial Intelligence Engineer

An Artificial Intelligence Engineer develops and implements AI models. This course, focused on the Llama family of models, helps build a foundation in understanding and utilizing multimodal models like Llama 3.2. The course provides practical skills, including tool-calling and use of Llama Stack, valuable for building and customizing AI applications. This education directly translates to the ability to develop and deploy more sophisticated AI systems. Furthermore, it introduces essential roles and prompt formats, expanding the skills of Artificial Intelligence Engineers and their adaptability with AI tools.

See salaries and explore the career path for Artificial Intelligence Engineer

Machine Learning Engineer

A Machine Learning Engineer specializes in building and deploying machine learning models. This course, which delves into Meta's Llama models, is highly relevant for anyone looking to work with cutting-edge language models. In this course, you will learn about the vision capabilities of Llama 3.2, how it leverages tool-calling, and Llama Stack for model orchestration. The course teaches you about multimodal prompting, including advanced image reasoning use cases. This directly improves your ability to implement complex machine learning solutions, improving practical understanding of model training, features, and applications. Furthermore, understanding model tokenizers and vocabulary size directly benefit Machine Learning Engineers when optimizing model performance.

See salaries and explore the career path for Machine Learning Engineer

Natural Language Processing Engineer

A Natural Language Processing Engineer builds systems that enable computers to understand, process, and generate human language. This course provides essential skills for a Natural Language Processing Engineer, especially with its focus on the Llama family of models including the newest multimodal prompting techniques. The course will teach you how to leverage the vision capabilities of Llama 3.2 along with tool-calling and Llama Stack. Understanding the tiktoken tokenizer and its expanded vocabulary for non-English languages directly aids Natural Language Processing Engineers in creating models that can handle diverse datasets and build more robust systems for natural language processing. Multimodal prompt engineering is critical for developing sophisticated systems that understand both text and images.

See salaries and explore the career path for Natural Language Processing Engineer

Computer Vision Engineer

A Computer Vision Engineer develops systems that allow computers to ‘see’ and interpret images and videos. The course’s focus on the vision capabilities of Llama 3.2 and multimodal prompting is directly relevant for a Computer Vision Engineer. The image reasoning use cases covered, such as understanding errors on a car dashboard or grading written math homework, are a microcosm of what computer vision engineers routinely work on. This course will teach you how to leverage tool-calling and the Llama Stack API which enhances the ability to build specialized vision applications. The specific understanding of how Llama handles multimodal input and processes images directly contributes to the capabilities of a Computer Vision Engineer.

See salaries and explore the career path for Computer Vision Engineer

AI Research Scientist

An AI Research Scientist conducts cutting-edge research to advance the field of artificial intelligence. This course focused on the Llama family of models may be helpful for anyone wishing to work with one of the most advanced open source AI models. Gaining familiarity with the Llama models, their training, features, and fitting within the family is useful for research and development. You will learn advanced image reasoning use cases, multimodal prompting, and tool calling. Your understanding of the tokenizers and expanded vocabulary directly contributes to research in model efficiency and language adaptability, expanding the scope of an AI Research Scientist.

See salaries and explore the career path for AI Research Scientist

Data Scientist

A Data Scientist analyzes data to extract meaningful insights and support decision-making. This course helps Data Scientists who need to work with AI models beyond just standard statistical models. Learning about the multimodal applications of Llama 3.2, including image reasoning and tool-calling, is directly applicable to the increasingly complex datasets that Data Scientists encounter. This course provides a strong way to improve how you customize and apply artificial intelligence in a data-driven context, which helps you build powerful models that can take both text and images as inputs. Furthermore, understanding the Llama stack can help a data scientist better integrate machine learning into pipelines.

See salaries and explore the career path for Data Scientist

Robotics Engineer

A Robotics Engineer designs, builds, and programs robots. This course will be useful to those who would like to control or program robots with multimodal models. Understanding how to use Llama 3.2’s vision capabilities, tool-calling, and Llama Stack is a great way to combine AI with automated systems. The image reasoning use cases taught in this course, like understanding car dashboard errors, directly relate to robotic applications in real-world scenarios. A robotics engineer who takes this course will be better prepared to integrate advanced AI models into their work. Further, knowledge of the different roles and prompt formats in Llama models is a great help to those developing AI commands for robots.

See salaries and explore the career path for Robotics Engineer

Software Developer

A Software Developer designs, writes, and tests code for various applications. A Software Developer who wants to develop AI software, apps, and tools may find this course useful. Learning about models, tool calling, and the Llama stack provides developers with the necessary knowledge to implement complex AI features into their applications. This course provides practical skills, such as multimodal prompting and using the Llama Stack API. Furthermore, knowledge of models and their features will help software developers integrate AI more efficiently.

See salaries and explore the career path for Software Developer

AI Product Manager

An AI Product Manager defines the strategy, roadmap, and feature set of AI products. This course can be helpful for those seeking to understand the technical aspects of AI models. Learning about the Llama family of models, including the vision capabilities of Llama 3.2, is useful for product strategy. Knowledge of tool-calling and the Llama Stack API helps an AI Product Manager understand the possibilities of new features or new products. Furthermore, knowing the different roles in the prompt format can be very helpful in planning user interface design for AI applications.

See salaries and explore the career path for AI Product Manager

Chatbot Developer

A Chatbot Developer creates conversational agents that interact with users via text or voice. This course that introduces the Llama family of models may be useful for those looking to develop advanced chatbots. Learning about multimodal prompting with Llama and the different roles that are available in prompt formatting directly impacts the quality of chatbot interactions. Learning the Llama Stack will help you build custom tools. Furthermore, the expanded vocabulary and non-English language support of the models are very helpful for developers that want to create globalized chatbots.

See salaries and explore the career path for Chatbot Developer

Computer Scientist

A Computer Scientist works on the theoretical foundations of information and computation. This course may be helpful for computer scientists who want to explore the practical applications of AI. Understanding the architecture of the Llama family of models, including the training methodology, is a fundamental step in exploring AI concepts. This course also introduces you to tool-calling and the Llama Stack API. Furthermore, studying the tokenizer and the expanded vocabulary directly assists the exploration of natural language processing. Familiarity with the practical applications of these models are useful for a computer scientist.

See salaries and explore the career path for Computer Scientist

Research Engineer

A Research Engineer applies scientific and engineering principles to solve complex problems and develop new technologies. This course, focused on the Llama family of models, may be helpful for research engineers looking for practical applications of AI. This course gives you a deeper understanding of model training, features, and how to leverage tool-calling, and Llama Stack. You will also learn how to work with multimodal prompting, providing you with new methods to research applications of the technology. Learning how the models work behind the scene can provide a research engineer with new avenues for innovation and implementation.

See salaries and explore the career path for Research Engineer

AI Consultant

An AI Consultant advises organizations on how to implement AI. This course may be useful for AI consultants to learn about practical applications of the newest models. Learning about the Llama family of models, including their vision capabilities, and how to use multimodal prompting will help you develop AI solutions for clients. This course introduces tool-calling and the Llama Stack API, which can help an AI consultant to develop a plan. The knowledge of model training, features, and roles can help an AI consultant to speak intelligently on the subject.

See salaries and explore the career path for AI Consultant

Technical Writer

A Technical Writer creates documentation. An individual who wants to develop documentation for AI products may find this course useful when working on documents for a product that utilizes Meta’s Llama family of models. Understanding the different roles in the prompt format and the various features of the model can be useful when writing technical material. Understanding the multimodal capabilities of Llama 3.2, including image processing, is a good benefit for a technical writer. Furthermore, knowing how tool-calling and Llama Stack work can improve the ability to make accurate and impactful technical documentation.

See salaries and explore the career path for Technical Writer

Data Analyst

A Data Analyst works with data to identify trends and patterns. This course may be useful for Data Analysts who want to work with new kinds of modeling techniques. Learning about the Llama family of models, and the way it does multimodal prompting, can introduce new ways of analysis. This course will teach you about Llama Stack, which can help you integrate models into pipelines. The understanding of tokenizers can improve model output. This may open new doors into different analytic techniques.

See salaries and explore the career path for Data Analyst

Introducing Multimodal Llama 3.2

Here's a deal for you

What's inside

Syllabus

Traffic lights

Save this course

Reviews summary

Llama 3.2: multimodality and tool calling

Activities

Career center

Reading list

Share

Similar courses