Module 1: Foundations
This module serves as the introduction to the course content and the course Jupyter server, where you will run your analytics scripts. First, you will read about specific examples of how analytics is being employed by Accounting firms. Next, you will learn about the capabilities of the course Jupyter server, and how to create, edit, and run notebooks on the course server. After this, you will learn how to write Markdown formatted documents, which is an easy way to quickly write formatted text, including descriptive text inside a course notebook. Finally, you will begin learning about Python, the programming language used in this course for data analytics.
Module 2: Introduction to Python
This module focuses on the basic features in the Python programming language that underlie most data analytics scripts. First, you will read about why accounting students should learn to write computer programs. Second, you will learn about basic data structures commonly used in Python programs. Third, you will learn how to write functions, which can be repeatedly called, in Python, and how to use them effectively in your own programs. Finally, you will learn how to control the execution process of your Python program by using conditional statements and looping constructs. At the conclusion of this module, you will be able to write Python scripts to perform basic data analytic tasks.
Module 3: Introduction to Data Analysis
This module introduces fundamental concepts in data analysis. First, you will read a report from the Association of Accountants and Financial Professionals in Business that explores Big Data in Accountancy. Next, you will learn about the Unix file system, which is the operating system used for most big data processing (as well as Linux and Mac OSX desktops and many mobile phones). Second, you will learn how to read and write data to a file from within a Python program. Finally, you will learn about the Pandas Python module that can simplify many challenging data analysis tasks, and includes the DataFrame, which programmatically mimics many of the features of a traditional spreadsheet.
Module 4: Statistical Data Analysis
This module introduces fundamental concepts in data analysis. First, you will read about how to perform many basic tasks in Excel by using the Pandas module in Python. Second, you will learn about the Numpy module, which provides support for fast numerical operations within Python. This module will focus on using Numpy with one-dimensional data (i.e., vectors or 1-D arrays), but a later module will explore using Numpy for higher-dimensional data. Third, you will learn about descriptive statistics, which can be used to characterize a data set by using a few specific measurements. Finally, you will learn about advanced functionality within the Pandas module including masking, grouping, stacking, and pivot tables.
Module 5: Introduction to Visualization
This module introduces visualization as an important tool for exploring and understanding data. First, the basic components of visualizations are introduced with an emphasis on how they can be used to convey information. Also, you will learn how to identify and avoid ways that a visualization can mislead or confuse a viewer. Next, you will learn more about conveying information to a user visually, including the use of form, color, and location. Third, you will learn how to actually create a simple visualization (basic line plot) in Python, which will introduce creating and displaying a visualization within a notebook, how to annotate a plot, and how to improve the visual aesthetics of a plot by using the Seaborn module. Finally, you will learn how to explore a one-dimensional data set by using rug plots, box plots, and histograms.
Module 6: Introduction to Probability
In this Module, you will learn the basics of probability, and how it relates to statistical data analysis. First, you will learn about the basic concepts of probability, including random variables, the calculation of simple probabilities, and several theoretical distributions that commonly occur in discussions of probability. Next, you will learn about conditional probability and Bayes theorem. Third, you will learn to calculate probabilities and to apply Bayes theorem directly by using Python. Finally, you will learn to work with both empirical and theoretical distributions in Python, and how to model an empirical data set by using a theoretical distribution.
Module 7: Exploring Two-Dimensional Data
This modules extends what you have learned in previous modules to the visual and analytic exploration of two-dimensional data. First, you will learn how to make two-dimensional scatter plots in Python and how they can be used to graphically identify a correlation and outlier points. Second, you will learn how to work with two-dimensional data by using the Numpy module, including a discussion on analytically quantifying correlations in data. Third, you will read about statistical issues that can impact understanding multi-dimensional data, which will allow you to avoid them in the future. Finally, you will learn about ordinary linear regression and how this technique can be used to model the relationship between two variables.
Module 8: Introduction to Density Estimation
Often, as part of exploratory data analysis, a histogram is used to understand how data are distributed, and in fact this technique can be used to compute a probability mass function (or PMF) from a data set as was shown in an earlier module. However, the binning approach has issues, including a dependance on the number and width of the bins used to compute the histogram. One approach to overcome these issues is to fit a function to the binned data, which is known as parametric estimation. Alternatively, we can construct an approximation to the data by employing a non-parametric density estimation. The most commonly used non-parametric technique is kernel density estimation (or KDE). In this module, you will learn about density estimation and specifically how to employ KDE. One often overlooked aspect of density estimation is the model representation that is generated for the data, which can be used to emulate new data. This concept is demonstrated by applying density estimation to images of handwritten digits, and sampling from the resulting model.