Save for later

Getting and Cleaning Data

Data Science,

Before you can work with data you have to get some. This course will cover the basic ways that data can be obtained. The course will cover obtaining data from the web, from APIs, from databases and from colleagues in various formats. It will also cover the basics of data cleaning and how to make data “tidy”. Tidy data dramatically speed downstream data analysis tasks. The course will also cover the components of a complete data set including raw data, processing instructions, codebooks, and processed data. The course will cover the basics needed for collecting, cleaning, and sharing data.
Get Details and Enroll Now

OpenCourser is an affiliate partner of Coursera and may earn a commission when you buy through our links.

Get a Reminder

Send to:
Rating 4.2 based on 1,077 ratings
Length 5 weeks
Starts Jun 19 (49 weeks ago)
Cost $49
From Johns Hopkins University via Coursera
Instructors Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD
Download Videos On all desktop and mobile devices
Language English
Subjects Data Science Mathematics
Tags Data Science Data Analysis Probability And Statistics

Get a Reminder

Send to:

Similar Courses

What people are saying

course project

The course project seemed a little funky , especially creating the codebook for an already existing set of data but was a useful teaching aid.

I struggled with the Course Project Assignment because I didn't understand what I was supposed to do exactly.

The Course Project was daunting at first, but I reviewed my notes over and over again, tried reading from the site where the raw data was made available and constructed images of how the TIDY data should look like.

So honestly, I feel like I'm being ripped off a bit here.I did really enjoy the course project though.

Excelente Course Project was absolutely brilliant!

So, when I took a course project, I was struggling to find 'what should I do'.

However, the course project only involves reading data from several txt files and combining them into a single R dataset.

So you can do the course project without understanding anything covered in weeks 1 and 2 of the course.

Could be more interactive and could have more detailed instructions for the Course Project.

Last task for the Course Project uses a function that is not covered in previous lessons, which I think is not OK.

Well structured Great coverage of useful R libraries for retrieving and tidying up data, and difficult but valuable course project.

Materials (video and written) only partially prepare the student for the quizzes and course projects.

The last course in this specialization had me second guessing myself on whether or not I could do this (based on how difficult the quizzes/course project was).

The course project was still VERY difficult for this beginning specialization compared to what information is given in the lectures (spent over 16 hours on it and still didn't meet ALL criteria).

Read more

learned a lot

Learned a lot.

I learned a lot but my usually happy & grateful attitude was sorely challenged by the fact that so many facts in the videos and obvious course material was, well, wrong.

I learned a lot!

good intro to R yes very difficult, but I learned a lot The course material needs update.

It is a topic which is very often underestimated and we all need to learn to get more productive on this, as most of the time is spend on it in the "real world".Thanks Learned a lot This course covers the essential Explanation could be more elaborated like the earlier courses Great course to take !

Have learned a lot and discovered powerful tools and approaches.

Very useful I learned a lot.

learned a lot about cleaning data Quite disappointed at 'Getting data' part because of lack of explanation(I only had to learn extra sources to understand) but satisfied with 'Cleaning data' part.

Very practical course I learned a lot from this course and can see its practical applications.

Learned a lot, manipulating real data sets to produce "tidy" data sets, suitable for analysis.

I learned a lot from it!

A thorough course on how to structure and clean dirty data before making analyses on the data - very practical course in R. Learned a lot!

Good course, learned a lot especially through the quizzes and the course project, but slides/presentations could be more engaging.

I worked hard, and I learned a lot.

Read more

getting and cleaning data

Getting and cleaning data is the third course in the first wave of John Hopkins’s data science specialization track on Coursera.

Getting and Cleaning Data promises to teach students how to extract data from common data storage formats (including databases, specifically SQL, XML, JSON, and HDF5), and from the web using API's and web scraping.

Nice Course Excellent course It gave me an overview of how to getting and cleaning data.The course is best suited for anyone who is a novice in data science A good course great course, very useful and insightful; challenging final project.

Before taking "Getting and Cleaning Data", I had no prior R programming experience aside from completing the R programming course in the data science specialization on Coursera.

I feel that getting and cleaning data course is highly important to know since this is where 80% of the work in data science is being done.

Excellent information on getting and cleaning data.

gets you through the basics and beyond in getting and cleaning data from diverse sources.

Brief instroduction of getting and cleaning data.

Exellent course, which brings you to the next level of a Data Scientist.Getting and Cleaning data principles can be used in alot of situations.

It should provide a high-level approach to "getting and cleaning data"....and how it fits into the high-level roadmap of what Data Scientists do.

There are quite a few issues with the final assignment of the Getting and Cleaning Data course.

good Hi all, Course provides interesting insight to getting and cleaning data.

Thanks for this wonderful session on Getting and Cleaning Data.

Getting and cleaning data is the third course in the first wave of John Hopkins’s data science specialization track on Coursera.

Read more

tidy data set

It gives an idea on how to prepare tidy data set.

Nice and challenging exercices best course ever Great course, I've learned a lot about analyzing data sets and creating tidy data sets.

I learnt how to get the data from the web sources other than reading files of various formats, manipulate and group the data, and how to prepare a tidy data set for future analysis.

This is a terrific course on obtaining data from various sources and then cleaning the raw data obtained to form useful tidy data sets.

How do you develop a plan to build a tidy data set (i.e.

Learning how to create a tidy Data set was one of the perks in this course The code for the final assignment is peer reviewed which doesn't make sense.

In weeks two and three, the course presents a list of data format and how to read them into R. I would have preferred a better description on why tidy data sets are considered tidy that included some side-by-side comparisons and downstream effects of untidy data.

C The course has valuable content, but there is not enough emphasis on how to create a tidy data set.

You kind of learn what a tidy data set is (although the definition is vauge), but you would need to see examples of messy data sets and how to convert it to tidy data set.

The principles of a tidy data set might seem like common sense, but in practice it's more challenging than you might think.

I highly recommend taking this course even if you think you know what a tidy data set is.

Read more

peer review

However, the peer reviewed assignments are quite tricky and an excellent opportunity for learning.

for me the instruction for peer review exercises was not very clear I have learnt a lot.

Interesting enough, when looking at the solutions during the peer reviews, they seem to have found way easier solutions than I had.

What concerns me is that they all had similar issues, and are all doing peer review of each other - this means that there is no one that can make sure that their answers are really tidy...

Peer reviewed assessment with students who are unsure of the correct answers = unsure if solution is correct.

The lecture material was high level, and didn't seem to be a good preparation for the quizzes.The description for the final project was not very detailed, and the grading rubric likewise was not very specific for peer review.

The course project is a total letdown, uninteresting and badly worded leading to a total chaos in peer reviewing.

idea what they want for the project and the discussion forum is clogged with people asking for peer reviews.

I really liked this course and believe that my work, although seemingly noob-ish, will get much better as I see others works from the peer review and examples noted in the lessons.

The instructions are quite open to interpretation, which means that the final grade which you get via peer review is always going to be debatable.

Read more

discussion forums

All the help is in the discussion forums already anyway, so I'm not sure why they need more Mentors.

I spent way too much time on the exams and projects, because i believe not enough information was given (had to spend a lot of time searching through discussion forums, stackoverflow, help files etc...and while that is useful experience, it was a lot more time commitment than expected from course description) Doesn't worth the effort.

The instructors didn't respond to questions on the discussion forums about quiz items, the majority of assessment items seem to be available on Google and 50% of the peer reviewed assessment I checked used plagiarized solutions.

To make the situation even worse, THE TEACHERS SHOW ABSOLUTELY NO SUPPORT on the discussion forums.

My issues in dealing with these errors would have been alleviated had the professor advocated use of the discussion forums as a good place to discuss issues with the quizzes, errors, software, and more.

As I found in the Maps Coursera course I took, the discussion forums are a great place to supplement your learning from the course.

Overall I learned: -methods for good variable naming -lots about R programming -manipulating data and reading various forms (csv, xml, sql, etc... ) into R -merging data -keeping just the data you need, deleting what you don't need -dealing with null values While the course had frequent frustrating moments, I would say that I did learn a lot, but in order for the course to be more effective, the lectures need to be drastically re-tooled and the discussion forums need to be used to their full potential.

), but it certainly lead to a lot of confusion in the interpretation of the instructions, a sentiment reflected in the discussion forums.

I had to struggle a bit on the project but the discussion forums and a little online search was able to help me get to the right answers.

Read more

trying to figure

but the assignment instructions are not clear.A lot of time was wasted trying to figure out what data is what are what are we interested in.

The lesson plans and project were very vague and too much time was spent trying to figure out what was even being asked.

Instead, you'll spend much of your time scurrying around, trying to figure out why what you see on the lecture notes isn't working on your computer.

Either you will spend a MUCH larger time than advertised trying to figure things out on your own (which will give other benefits if you are new to the topic or R, if you have the patience), or you go on to the discussion forums because invariably others have the exact same confusion.

Read more

johns hopkins

I've completed 12 MOOCs, 2 bachelor's degrees, and several graduate courses at Stanford, so that is a distinction earned by Johns Hopkins U from among a very wide field.

It is great to know how the cleaning process is performed in R Another brilliant course from Johns Hopkins University in the data science specialisation.

Thanks Johns Hopkins.

This is the 3rd course in the "Data Science" track, and it continues the tradition that we have come to expect of the Johns Hopkins' Stats/DataSci courses, namely, that of being half assed and almost entirely useless.

This course was as equally bad as the rest of the courses in the "Data Science" track through Johns Hopkins University.

This is my third course completed in the Data Science Specialization offered by Johns Hopkins.

Thanks Coursera and Johns Hopkins University for making this happen!

Read more

figure out

Its upto you to figure out how to get the assignment done .. Google and StackOverflow is your instructors .. Really!

Really enjoyed the final project as it challenged you to, with minimal guidance, think through what a tidy dataset really means, and figure out how to make that happen with the dataset you are provided.

Overall very good but the class does not teach enough to make the course work easy to figure out.

Yes, I know you're supposed to do research to help figure out problems, and I have.

As a matter of fact, I have taken other courses on data wrangling to be able to figure out this problem.

Read more

regular expressions

Week 3 introduces subsetting and reshaping data and tools like dplyr, and week 4 introduces working with text strings and regular expressions.

Good way to get introduced to the tiny verse packages and importing, prepping datasets before they can used for exploratory analysis and modelling.Could have gone a bit more in depth on how to deal with dates, and regular expressions.

Also, one of the worst explanations of regular expressions I've ever heard.

Week 4 involves string manipulation, regular expressions and working with the Dates.

A lot of this is covered in Roger Peng's ebooks "R Programming for Data Science" and "Mastering Software Development in R" (both are freely available- google them).Assessments: The only assessments in the course are 4 quizzes- each of which involves about 5 short programming exercises- and a final project which only involves topics from weeks 3 and 4 (specifically- subsetting data, sorting data, reshaping data, and working with regular expressions).

The course also discusses subsetting data, adding variables, merging data, regular expressions and working with dates.

This is really R part 2, getting into file/API handling, data frames, regular expressions etc.

Read more

real world

Great Course I like the specialization quite a bit as it contains real world data and difficult enough exercises.

I learn a lot of skills I need for my job and university thank very much and this is very good course Very useful to get hands on experience in data science to solve real world problems!

But I think some important issues in the real world are not discussed enough here, like how to treat missing values, how to deal with messy format data.

They do not even bother provide any useful information (god knows why, maybe they're trying to mimic "real world conditions" but in real world you can interact with users...

The final project pretty fairly replicated what happens in the real world when you are given a disgustingly awful looking data set and are asked to do something with it.

Read more

well explained

Great course, everything is well explained ant the exercises are challenging enough to really understand what you are studying.

Good It taught a lot of tools and introductions Not well explained....

It's a good course, with right amount of exercises and well explained classes.

Honestly, I wanted to give complete 5 rating to this course, because the content of the lecture is well explained.

Although I understand that the lecturer did not want to explain every data type, some concepts were not very well explained (e.g.

Well explained.

Read more


An overview of related careers and their average salaries in the US. Bars indicate income percentile.

Graphic Designer/Book Cover Designer $37k

Veterinary/Processing/Animal Care Technician Also Enrichment Coordinator $40k

Cover Rep. $47k

Assistant Cover Editor $54k

Supervisor Concurrent Review Nurse and also Case management $60k

Seasonal Cover Art Designer $67k

cover designer $73k

Cover Producer $81k

Senior Cover Story Editor $84k

Recruiter (Also held role of District Lease Analyst ) $86k

Write a review

Your opinion matters. Tell us what you think.

Rating 4.2 based on 1,077 ratings
Length 5 weeks
Starts Jun 19 (49 weeks ago)
Cost $49
From Johns Hopkins University via Coursera
Instructors Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD
Download Videos On all desktop and mobile devices
Language English
Subjects Data Science Mathematics
Tags Data Science Data Analysis Probability And Statistics

Similar Courses

Sorted by relevance

Like this course?

Here's what to do next:

  • Save this course for later
  • Get more details from the course provider
  • Enroll in this course
Enroll Now