We may earn an affiliate commission when you visit our partners.
Course image
Curtis Huttenhower, John Quackenbush, Lorenzo Trippa, and Christine Choirat

Today the principles and techniques of reproducible research are more important than ever, across diverse disciplines from astrophysics to political science. No one wants to do research that can’t be reproduced. Thus, this course is really for anyone who is doing any data intensive research. While many of us come from a biomedical background, this course is for a broad audience of data scientists.

Read more

Today the principles and techniques of reproducible research are more important than ever, across diverse disciplines from astrophysics to political science. No one wants to do research that can’t be reproduced. Thus, this course is really for anyone who is doing any data intensive research. While many of us come from a biomedical background, this course is for a broad audience of data scientists.

To meet the needs of the scientific community, this course will examine the fundamentals of methods and tools for reproducible research. Led by experienced faculty from the Harvard T.H. Chan School of Public Health, you will participate in six modules that will include several case studies that illustrate the significant impact of reproducible research methods on scientific discovery.

This course will appeal to students and professionals in biostatistics, computational biology, bioinformatics, and data science. The course content will blend video lectures, case studies, peer-to-peer engagements and use of computational tools and platforms (such as R/RStudio, and Git/Github), culminating in a final presentation of a final reproducible research project.

We’ll cover Fundamentals of Reproducible Science; Case Studies; Data Provenance; Statistical Methods for Reproducible Science; Computational Tools for Reproducible Science; and Reproducible Reporting Science. These concepts are intended to translate to fields throughout the data sciences: physical and life sciences, applied mathematics and statistics, and computing.

Consider this course a survey of best practices: we’d like to make you aware of pitfalls in reproducible data science, some failure - and success - stories in the past, and tools and design patterns that might help make it all easier. But ultimately it’ll be up to you to take the skills you learn from this course to create your own environment in which you can easily carry out reproducible research, and to encourage and integrate with similar environments for your collaborators and colleagues. We look forward to seeing you in this course and the research you do in the future!

What's inside

Learning objectives

  • Understand a series of concepts, thought patterns, analysis paradigms, and computational and statistical tools, that together support data science and reproducible research.
  • Fundamentals of reproducible science using case studies that illustrate various practices
  • Key elements for ensuring data provenance and reproducible experimental design
  • Statistical methods for reproducible data analysis
  • Computational tools for reproducible data analysis and version control (git/github, emacs/rstudio/spyder), reproducible data (data repositories/dataverse) and reproducible dynamic report generation (rmarkdown/r notebook/jupyter/pandoc), and workflows.
  • How to develop new methods and tools for reproducible research and reporting
  • How to write your own reproducible paper.

Syllabus

Module 1: Introduction to Reproducible Science
Module 2: Fundamentals of Reproducible Science
Definitions and Concepts
Factors affecting reproducibility
Read more
Module 3: Case Studies in Reproducible Research
Module 4: Data Provenance
Project Design
Journal Requirements
Repositories
Privacy and Security
Module 5: Computational Tools for Reproducible Science
R and Rstudio
Python, Git, and GitHub
Creating a repository
Data sources
Dynamic report generation
Workflows
Module 6: A optional deeper dive into Statistical Methods for Reproducible Science
Prediction Models
Coefficient of determination
Brier score
Area Under the Curve (AUC)
Concordance in survival analysis
Cross-validation
Bootstrap
Simulations
Clustering

Good to know

Know what's good
, what to watch for
, and possible dealbreakers
Develops fundamentals and foundational skills for reproducible science in data science
Develops strong statistical methods, computational tools, and statistical tools for reproducible science in data science
Teaches foundational methods and tools for reproducible research in data science
Develops skills, knowledge, and tools that are core to current industry practices in data science
Taught by instructors with extensive experience and recognition in the field, including Curtis Huttenhower, John Quackenbush, Lorenzo Trippa, and Christine Choirat

Save this course

Save Principles, Statistical and Computational Tools for Reproducible Data Science to your list so you can find it easily later:
Save

Activities

Be better prepared before your course. Deepen your understanding during and after it. Supplement your coursework and achieve mastery of the topics covered in Principles, Statistical and Computational Tools for Reproducible Data Science with these activities:
Course Materials Compilation
Strengthens your grasp of course concepts by organizing and reviewing key resources.
Show steps
  • Gather lecture notes, slides, handouts, and assignments.
  • Organize the materials into a logical structure, such as by topic or module.
  • Review the materials regularly to reinforce your understanding.
  • Create summaries or mind maps of key concepts for quick reference.
Statistics Review
Sharpen your statistical skills, ensuring you have a solid foundation for the course's data analysis and modeling components.
Browse courses on Statistics
Show steps
  • Review basic statistical concepts, such as descriptive statistics, probability, and inference.
  • Practice solving statistical problems using real-world datasets.
  • Take practice quizzes or mock tests to assess your understanding.
  • Consider using online resources or textbooks for additional support.
Study Group Discussions
Fosters collaboration, improves understanding through peer-to-peer learning, and provides support for completing assignments and projects.
Show steps
  • Form a study group with classmates.
  • Meet regularly to discuss course material, work on assignments together, or practice solving problems.
  • Take turns leading discussions and presenting your findings.
  • Use online collaboration tools for effective communication outside of meetings.
Five other activities
Expand to see all activities and additional details
Show all eight activities
RStudio Tutorials
Builds proficiency in R's syntax and packages, essential tools for reproducible data science.
Browse courses on R
Show steps
  • Go through the RStudio tutorials on data manipulation, visualization, and statistical modeling.
  • Complete the exercises at the end of each tutorial.
  • Apply what you've learned to your own data analysis projects.
Read The Pragmatic Programmer
Provides context and best practices for data science and programming in general, including reusable software, testing, and debugging.
Show steps
  • Read all 26 chapters.
  • Complete all of the exercises.
  • Briefly summarize your key takeaways and favorite parts of the book.
Data Science Blog Post
Enhances your communication skills and reinforces your understanding of reproducible data science by sharing your knowledge with others.
Browse courses on Data Science
Show steps
  • Choose a topic related to data science or reproducible research that you're passionate about.
  • Conduct thorough research to gather relevant information.
  • Write a well-structured and engaging blog post, explaining the concepts clearly.
  • Edit and proofread your post carefully before publishing it.
  • Share your post on social media and engage with readers in the comments section.
Kaggle Competitions
Boosts your data science skills by exposing you to real-world datasets and competitions.
Browse courses on Machine Learning
Show steps
  • Choose a Kaggle competition related to your interests.
  • Download the competition data and explore it.
  • Develop and train your machine learning model.
  • Submit your model and check your score.
  • Analyze your results to identify areas for improvement.
Personal Data Science Project
Provides hands-on experience in applying your reproducible data science skills to a project of your choice.
Show steps
  • Define your project goals and objectives.
  • Gather and clean your data from various sources.
  • Explore your data using statistical techniques and visualizations.
  • Build and evaluate machine learning models to solve your problem.
  • Communicate your findings and insights in a clear and concise report.

Career center

Learners who complete Principles, Statistical and Computational Tools for Reproducible Data Science will develop knowledge and skills that may be useful to these careers:
Data Scientist
To be successful as a Data Scientist, one must be familiar with using computational and statistical tools to analyze and interpret data. This course provides learners with a comprehensive foundation in these tools, while also establishing the importance of reproducibility. These concepts are crucial for a Data Scientist, as they allow for the accurate and reliable analysis of data, which is essential for making informed decisions. Additionally, a Data Scientist must be able to clearly communicate their findings. This course places a strong emphasis on reproducible reporting, and as such, learners will be well-equipped with the skills necessary for communicating their research effectively.
Statistician
Statisticians play a vital role in analyzing and interpreting data. This course provides a thorough overview of the statistical methods used in reproducible data science, including prediction models, cross-validation, and bootstrapping. These methods are essential for Statisticians, as they allow for the accurate and reliable analysis of data, which is essential for making informed decisions. Furthermore, Statisticians must be able to clearly communicate their findings. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their research.
Computational Biologist
Computational Biologists use computational and statistical tools to analyze and interpret biological data. This course provides a solid foundation in these tools, while also highlighting the importance of reproducibility. These concepts are essential for Computational Biologists, as they allow for the accurate and reliable analysis of data, which is critical for making informed decisions in the field of biology. Additionally, Computational Biologists must be able to effectively communicate their findings. This course places a strong emphasis on reproducible reporting, ensuring that learners develop the skills necessary for communicating their research clearly and concisely.
Bioinformatician
Bioinformaticians use computational tools to analyze and interpret biological data. This course provides a comprehensive overview of the computational tools used in reproducible data science, including Git, GitHub, RStudio, and Python. These tools are essential for Bioinformaticians, as they allow for the accurate and reliable analysis of data, which is critical for making informed decisions in the field of biology. Furthermore, Bioinformaticians must be able to clearly communicate their findings. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their research.
Data Analyst
Data Analysts use computational and statistical tools to analyze and interpret data. This course provides a solid foundation in these tools, while also emphasizing the importance of reproducibility. These concepts are crucial for Data Analysts, as they allow for the accurate and reliable analysis of data, which is essential for making informed decisions. Additionally, Data Analysts must be able to effectively communicate their findings. This course places a strong emphasis on reproducible reporting, ensuring that learners develop the skills necessary for communicating their research clearly and concisely.
Research Scientist
Research Scientists use computational and statistical tools to analyze and interpret data. This course provides a comprehensive overview of the computational and statistical tools used in reproducible data science, including R, Python, and Git. These tools are essential for Research Scientists, as they allow for the accurate and reliable analysis of data, which is critical for making informed decisions in various fields of research. Furthermore, Research Scientists must be able to clearly communicate their findings. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their research.
Software Engineer
Software Engineers use computational tools to design, develop, and maintain software systems. This course provides a solid foundation in computational tools, including Git, GitHub, and Python. These tools are essential for Software Engineers, as they allow for the efficient and reliable development of software systems. Additionally, Software Engineers must be able to clearly communicate their design decisions and code. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their work.
Machine Learning Engineer
Machine Learning Engineers use computational and statistical tools to develop and deploy machine learning models. This course provides a comprehensive overview of the computational and statistical tools used in reproducible data science, including R, Python, and Git. These tools are essential for Machine Learning Engineers, as they allow for the accurate and reliable development and deployment of machine learning models. Furthermore, Machine Learning Engineers must be able to clearly communicate their design decisions and code. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their work.
Data Engineer
Data Engineers use computational tools to design, build, and maintain data pipelines. This course provides a solid foundation in computational tools, including Git, GitHub, and Python. These tools are essential for Data Engineers, as they allow for the efficient and reliable development and maintenance of data pipelines. Additionally, Data Engineers must be able to clearly communicate their design decisions and code. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their work.
Database Administrator
Database Administrators use computational tools to design, build, and maintain databases. This course provides a solid foundation in computational tools, including Git, GitHub, and Python. These tools are essential for Database Administrators, as they allow for the efficient and reliable development and maintenance of databases. Additionally, Database Administrators must be able to clearly communicate their design decisions and code. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their work.
Systems Administrator
Systems Administrators use computational tools to design, build, and maintain computer systems. This course provides a solid foundation in computational tools, including Git, GitHub, and Python. These tools are essential for Systems Administrators, as they allow for the efficient and reliable development and maintenance of computer systems. Additionally, Systems Administrators must be able to clearly communicate their design decisions and code. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their work.
Network Administrator
Network Administrators use computational tools to design, build, and maintain computer networks. This course provides a solid foundation in computational tools, including Git, GitHub, and Python. These tools are essential for Network Administrators, as they allow for the efficient and reliable development and maintenance of computer networks. Additionally, Network Administrators must be able to clearly communicate their design decisions and code. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their work.
Security Analyst
Security Analysts use computational tools to design, build, and maintain computer security systems. This course provides a solid foundation in computational tools, including Git, GitHub, and Python. These tools are essential for Security Analysts, as they allow for the efficient and reliable development and maintenance of computer security systems. Additionally, Security Analysts must be able to clearly communicate their design decisions and code. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their work.
Forensic Analyst
Forensic Analysts use computational tools to analyze and interpret digital evidence. This course provides a solid foundation in computational tools, including Git, GitHub, and Python. These tools are essential for Forensic Analysts, as they allow for the efficient and reliable analysis and interpretation of digital evidence. Additionally, Forensic Analysts must be able to clearly communicate their findings. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their work.
Penetration Tester
Penetration Testers use computational tools to test the security of computer systems and networks. This course provides a solid foundation in computational tools, including Git, GitHub, and Python. These tools are essential for Penetration Testers, as they allow for the efficient and reliable testing of the security of computer systems and networks. Additionally, Penetration Testers must be able to clearly communicate their findings. This course emphasizes reproducible reporting, ensuring that learners develop the skills necessary for effectively communicating their work.

Reading list

We've selected 13 books that we think will supplement your learning. Use these to develop background knowledge, enrich your coursework, and gain a deeper understanding of the topics covered in Principles, Statistical and Computational Tools for Reproducible Data Science.
A comprehensive introduction to Bayesian statistics, covering Bayesian modeling, Markov chain Monte Carlo (MCMC) methods, and hierarchical models. Provides a valuable resource for learners interested in Bayesian approaches to statistical inference.
A comprehensive textbook on predictive modeling, covering model evaluation, variable selection, and model deployment. Provides a solid foundation for learners interested in building and deploying predictive models.
A practical guide to R programming, covering data manipulation, visualization, statistical modeling, and reproducible research. Essential for learners who want to use R for data analysis.
A comprehensive guide to data science with Python, covering data manipulation, visualization, machine learning, and deep learning. Provides practical examples and exercises to help learners master data science techniques in Python.
A comprehensive textbook on regression and multilevel/hierarchical models, including statistical methods for reproducible data analysis. Provides in-depth coverage of advanced statistical concepts.
Provides an overview of causal inference, including graphical models, structural equation modeling, and counterfactual reasoning. Helpful for learners interested in understanding the principles of causal inference.
A comprehensive textbook that covers fundamental concepts of statistical learning, including prediction models, coefficients of determination, and cross-validation. Essential reading for students interested in statistical methods for reproducible data analysis.
Provides a practical introduction to machine learning with R, including supervised and unsupervised learning algorithms. Offers hands-on examples and exercises to help learners gain proficiency in machine learning techniques.
A comprehensive guide to data manipulation with R, covering data import, cleaning, transformation, and visualization. Essential for learners who need to master data manipulation skills for their data science projects.
A widely-used textbook for introductory statistics and machine learning, covering supervised and unsupervised learning methods. Provides a solid foundation for learners new to these topics.
A comprehensive textbook on deep learning, covering neural networks, convolutional neural networks, and recurrent neural networks. Provides an up-to-date overview of deep learning techniques and their applications.
Provides a comprehensive overview of machine learning algorithms, including supervised and unsupervised learning methods. Offers a theoretical foundation for understanding the strengths and weaknesses of different algorithms.
Provides a non-technical overview of data science, including data mining techniques and data-analytic thinking. Helpful for learners seeking a broader understanding of the field.

Share

Help others find this course page by sharing it with your friends and followers:

Similar courses

Here are nine courses similar to Principles, Statistical and Computational Tools for Reproducible Data Science.
Our mission

OpenCourser helps millions of learners each year. People visit us to learn workspace skills, ace their exams, and nurture their curiosity.

Our extensive catalog contains over 50,000 courses and twice as many books. Browse by search, by topic, or even by career interests. We'll match you to the right resources quickly.

Find this site helpful? Tell a friend about us.

Affiliate disclosure

We're supported by our community of learners. When you purchase or subscribe to courses and programs or purchase books, we may earn a commission from our partners.

Your purchases help us maintain our catalog and keep our servers humming without ads.

Thank you for supporting OpenCourser.

© 2016 - 2024 OpenCourser