Data Management & Exploratory Analysis

Last updated on Feb 8, 2022

Note: The code for these projects can be found in the “misc-data” repository on my GitHub. While these projects don’t belong to any particular research study, they are responsible for helping me develop my skills in working with large datasets.

Project 1: This project utilized the same dataset as the predictive model study. Here, we sought to determine if types of runners can be identified from running biomechanics using a linear discriminant analysis in R. Types of runners were defined by the industry partner and were based on running goals and/or volume of running performed. Biomechanical data came from studies in which subjects ran on a force-instrumented treadmill in a modified sock liner.

Project 2: This project involved the creation of a new data table from 3 existing datasets. The first dataset was an Excel file with data combined across multiple prototype shoe-testing studies. Examples of information contained in row were: shoe tested, study name (usually the name of a prototype shoe follow by its generation and iteration), season & year (e.g. Fall 2020), subject name, subject characteristics (body mass, height, age), mechanical properties of shoe tested, subject perception of shoe tested, and other survey data. The second dataset was a set of folders that contained MATLAB files. File names were constructed so that embedded within each were the study name, season & year, and subject initials. Each file contained biomechanical data across multiple shoes for one subject for a single study. The third dataset was an Excel file that had more biomechanical data, but for a control condition in each study. Each row contained the following information: subject initials, study name, season & year, and the biomechanical data. I was responsible for merging all 3 datasets into a master Excel file. I wrote code in MATLAB to do the following (more detail can be found in the code):

Locate first dataset and load it into MATLAB.
Create a few new columns from existing columns. Combine “prototype name” and “prototype generation” into a single column. Pull first letter from “first name” and “last name” columns to create subject initials. Combine “season” and “year” into a single column.
Loop through the folders containing the MATLAB files of biomechanical data. From the file names, extract the following: prototype shoe name + generation, prototype iteration, season + year, and subject initials.
Load in the data and pull out names of shoes tested and their associated biomechanical variables of interest. This created a table containing information from Step 3 and biomechanical data.
Due to irregularities and typos, quite a few names of tested shoes and prototype shoes had to be corrected from the MATLAB file. This was achieved by looping through all shoe names and seeing if they existed in a master list of shoe names from the first dataset. If they did not, then the incorrect name was replaced by the correct name by either selecting it from the master list or manually entering the name.
A combined table was then created by merging the first dataset and the table of biomechanical data using tested shoe name, subject initials, prototype shoe name + generation, and season + year.
More code was then written to combine the above table with the third dataset, once again using tested shoe name, subject initials, prototype shoe name + generation, and season + year.

Project 3: The overall goal of this project was to use GPS watch data to answer research questions for our industry partner. GPS watch data was collected from 1000+ runners over several as they went on their routine runs, and when they completed challenges set forth by the company – for a total of 150,000+ unique runs. Due to the timing of project with the end of my postdoctoral fellowship, I was only able to complete an exploratory data analysis regarding running patterns. To do this, I also needed to consolidate data across multiple datasets within a SQLite database. Unless noted, most work was done in Python using Jupyter Notebooks. Some functions I performed during my EDA:

Used Python and SQLite3 to create CSV files from each table within a SQLite database so the data could be used in R, Excel, or Python – performed in Spyder.
Created custom functions to: parse out times of day for use in histograms, parse out runner IDs from strings, and define runner groups from answers to a survey question.
Merged data from datasets using unique identifiers, such as runner ID and task/challenge ID.
Created histograms to check frequency of runs.
Created histograms to visually check for outliers or nonsense data (e.g. value of 0 when 0 is not possible) in variables of interest – cleaned data when necessary.
Ran simple linear regressions to test for relationships using variables of interest. Subsequently generated histograms of Pearson correlation coefficients, using average and 95% confidence intervals to confirm presence of relationships.
Derived a categorical variable from a continuous variable, and which was used to run ANOVAs with Tukey post-hoc tests.
Using R - run linear mixed models to determine if specific running metrics are influenced by time of day for a run
Using R - run linear discriminant analysis to determine if types of runners can be identified from running metrics
Using R - use principal components analysis to see if sub-groups can be distinguished within a given type of runner

Postdoc

Data Management & Exploratory Analysis

Matthew Q. Salzano

Researcher

Related