MODULE 3 – COLLECTING & CLEANING DATA
Identifying Your Data
Data Visualizations and infographics generally all start with a question. Something needs to be seen, and in order to do that we must give it physical dimension. Physical dimension requires measurement, a way to represent an idea or fact in quantifiable terms. In short, if someone ask you to create a visualization you first need to identify the data and types of data that you will need to give it shape. You may need to take their initial question and break it down into smaller questions to find what you need. This process is called the operationalization of data. The data will operate as the measure, shape and form of your answer.
In their book Making Data Visual, Danyel Fisher and Miriah Meyer define operationalization as the process of reducing complex set of factors into a single metric. Identifying tasks to be performed over the dataset that are a reasonable approximation of the high-level question of interest.
Proxies are an important concept to understand as you start to identify the data that you will need to answer the high level question. For example, the question may be “What are the top ten best movies of all time”, proxies for the abstract term “best” may be “ratings”, “awards given”, “critics praise” or “box office sales”. Proxies can be further categorized based on their relationship and role within the data:
- Objects: objects are things or events that exist in the real world such as movies, theaters, stores, media discs, streams.
- Measures: are quantitative outcomes of objects for example the ratings of the movie, the sales of discs, the number of streams.
Groupings or partitions: Attributes or characteristics of data that separate the data items into groups. These could be regions, genres, years.
Actions: Are words that describe what is being done with/or to the data such as compare, identify, characterize. Actions guide the process of choosing the appropriate visualizations.
Depending on your high-level question, the complexity of the project, operationalization of data may take several rounds of investigation and exploration of proxies before you determine how to best answer the question. Danyel Fisher and Miriah Meyer provide a method for breaking down the abstract into tasks that address the unknowns:
- Identify the components, or what is needed, for each task.
- Look for anything that cannot be addressed by your dataset.
- For any ambiguous or unclear components, define a proxy.
- If there are no ambiguous components, then the task is considered actionable and you can begin working on your visualization.
Understanding Your Data
Human beings are complex and have many facets or dimensions, same with data. Facebook collects hundreds of dimensions (often referred to as data points or attributes in other fields) on one user: age, time of activity, location, lkes, dislikes, posts, interests, facial features, employer, relationships, marriage status, children friends etc. It is amazing to consider how much people will share about themselves to conveniently interact with others. Once again, in this context, Measures are the quantitative values assigned to these dimensions. I do not want to get too far into data science and just repeat everything from our readings but we should be able to provide a brief descriptions of types of data:
Continuous (interval and ratio) data
- Ratio data are values like inches or feet that can be added and subtracted
- Interval data are values that have no zero point, they can be calculated but cannot be added together such as temperature or pH
Ordinal Data – Values that are ordered but cannot be meaningfully added or subtracted like rankings
Categorical Data – Values without order such as a color wheel or compass (direction)
Temporal Data – Times, Duration
Geographical Data – Location, Spatial
Relational Data – Hierarchy, network diagram, family tree
Different types of data can be translated or transformed into other types of data. The most common types of transformation include “Categorical to ordinal” and vice versa, “Continuous to ordinal” and vice versa, “Reducing cardinality for categorical data”, “Drilldowns”, “Rollups” and “Pivots”. You will need to analyze your data and decide what format is best suited for your visualization. The image below is a quick and easy reference depicting some of the common visualization forms used once you have your how and your action determined:
Cleaning your Data
You are halfway there. You know how you are going to answer the question, and you know the type of data you need… but you must be confident in the accuracy of your data. How clean and accurate is your data?
Be methodical, when it comes to data being thorough is a virtue, here are some guiding principles as you ensure the integrity of visualization:
- Remove Unwanted Observations
- Remove duplicates
- Remove irrelevant observations or data that does not address the question, focus on what matters.
- Fix Structural Errors such as column typos, mislabeled columns, inconsistent capitalization or formatting
- Filter Unwanted Outliers – Look for values that do not make any sense and verify them. They could be an valid anomaly or a mistake. If they do not hold up to muster, remove them as they may skew your results.
- Handle Missing Data – You can either drop them completely or input values based on other observations. Put on your detective hat. Either way, develop a mechanism for marking any values you have modified or substituted.
Three Data Sets
Our assignment this week was to gather three data sets. One self created, another with geographical data and a third with data from a reliable public data source.
Geographical Data – Washington DC Public Art
I thought it would be nice to have a handy map of all the free public art that is in downtown DC. There are many roundabouts and statues scattered around where I work. It would be cool to have a mobile app at provided locations and information on these if you are close by. It is a great walking city. I thought I would start with Ward 2 which includes Dupont Circle and get to know all the places I could visit nearby. I found a list on wikipedia and there are a few null values but that gives me plenty to do on my lunch breaks, The wikipedia list included spatial data in the form of GPS coordinates (Longitude and latitude), I think I will be able to use a KML file found from DC.gov to create an interactive map. Here are the sources and it is the first tab in my excel worksheet download.
List of public art in Washington, D.C., Ward 2 – Wikipedia. (2019, May 22). Retrieved from https://en.wikipedia.org/wiki/List_of_public_art_in_Washington,_D.C.,_Ward_2
DC Wards – Map Overlay – DC Ward Map Overlay (KML) – Code For DC – Open Data Portal. (2019, June 05). Retrieved from http://data.codefordc.org/dataset/dc-wards-map-overlay/resource/be5280da-5a51-4be3-94fd-f94545ca1e15
Reliable Public Data Source – Baseball Stats
I decided to pull the current Major League Baseball Standings on sheet two as an example of ordinal data from baseball-Reference.com
Exporting Data | Sports-Reference.com. (2019, June 05). Retrieved from https://www.sports-reference.com/blog/2016/11/exporting-data
2019 Major League Baseball Standings & Expanded Standings | Baseball-Reference.com. (2019, June 05). Retrieved from https://www.baseball-reference.com/leagues/MLB/2019-standings.shtml
Self Created Data Source – Tucker & Dale Vs. Evil Death Board
And finally, I gathered all the data from one of my favorite movies about miscommunication. This dark comedy is about affable hillbillies Tucker and Dale who go on a retreat to their dilapidated mountain cabin “vacation home” when they are mistaken for murderers by a group of preppy college students. I collected character data, cause of death, time of death and other stats for a possible “Death Board”.See Sheet 3 of workbook for all the grim details. Learn more about the movie at:
Tucker and Dale vs Evil (2010) – IMDb. (2019, June 09). Retrieved from https://www.imdb.com/title/tt1465522
Data Sets Shared via Google Sheets:
References and Resources for Further Exploration
Tableau Tutorials, Sample Data and Data Resources. (2014, May 01). Retrieved from https://public.tableau.com/en-us/s/resources
Bailey, B. (2018). Data Cleaning 101. Towards Data Science. Retrieved from https://towardsdatascience.com/data-cleaning-101-948d22a92e4
Chapter 3: Data Cleaning Steps and Techniques – Data Science Primer. (2019, June 09). Retrieved from https://elitedatascience.com/data-cleaning
Fisher, D., & Meyer, M. (2018). Making Data Visual: A Practical Guide to Using Visualization for Insight. O’Reilly Media. Retrieved from https://www.amazon.com/Making-Data-Visual-Practical-Visualization/dp/1491928468
Recent Posts from the Graduate Series:
- Dear Data Project 1 - I WILL CONSIDER SPEAKING TO HUMANS MORE. This project took me out of my comfort zone. Aside from working on…
- Compass: The 4 Directions of Information Design - READING SUMMARIES & RESPONSE FOR MODULE 2 My previous post presented a brief history of the history of information design…
- The Hiz of DataViz - READING SUMMARIES & RESPONSE FOR HISTORY OF DATA VISUALIZATION Jumping into ICM529, Data Visualization with Prof. Courtney Marchese In the…
- “Mary & Bessie” A Mini Documentary - "Mini-Documentary" Video Project Production Notes: This is just a good old fashioned ghost story. I wanted to try…
- This Course Ruined Me for Life! - FINAL THOUGHTS AND REFLECTIONS ON ICM508DE Final Thoughts and Reflections on the Course This course has ruined me for life!…
- Pure Illumination - LIGHTING, ADVERSITY AND DOCUMENTARIES 5400K is an important number to remember. It is the Kelvin scale color temperature measurement of…