Introduction
As our experience with the novel coronavirus (SARS-CoV-2) has grown, scientists and clinicians have developed a better understanding of how the disease spreads. Several studies have examined modes of transmission, and epidemiologists have published articles analyzing risk factors inherent in everyday activities. The MyDataHelps digital health platform leverages this information to help users make informed choices about the activities they participate in.
Activity transmission risk
The app focuses on activity-specific risk of viral transmission. For example, is going to the bar riskier than going to the park, all else being equal? How much riskier, and why?
Deliberately excluded are factors specific to the person or the community, including:
- Individual health conditions that might make someone more prone to poor outcomes from viral illness.
- Prevalence of coronavirus within the community.
- Prior exposure to coronavirus, which may convey some degree of immunity for an unknown period of time.
- Use of substances such as alcohol, which may reduce inhibitions and engender riskier behaviors.
- Mask-wearing and other mitigations practices by other participants.
While these factors are important to an individual’s overall risk, conflating them with the activity can undermine the user’s appreciation of activity-based risks. To an individual in a hot spot with a compromised immune system, almost everything might pose significant risk; there is nevertheless value in understanding which activities are riskier than others.
By focusing on activity-specific risk, we empower users. They cannot control whether their community is a hot spot or whether other people will wear masks, but they can control which activities they choose to participate in and be aware of the risks they are taking. According to epidemiologist Emily Landon(20), tools like this can provide people with, “a much better idea about how much risk is associated with the things that they’re going to do.”
App workflow
The app prompts the user to select an activity, such as going to the bar or grocery shopping, and then displays a risk profile for that activity. This profile includes:
- An overall activity transmission risk score of Low, Medium Low, Medium, High, or Very High.
- A summary of characteristics that influenced the risk score, such as crowd size and location. These factors are discussed in more detail in the following section.
The user can adjust the activity characteristics and see the transmission risk score change in real-time. This helps develop an understanding of how the various conditions impact risk.
Scoring methodology and risk factors
To determine the transmission risk scoring system and activity characteristics, we first reviewed a number of articles in which panels of expert epidemiologists rated the risks of everyday activities, as well as the CDC’s general activity guidance.(1, 2, 3, 4, 5, 6, 7)
Two of the articles, from Michigan(1) and Texas(2), provided numerical scores for approximately forty different activities. These scores served as our quantitative data for training and validating our scoring algorithm. The numerical analysis is discussed in detail in the following section.
In addition to the quantitative numbers, the experts provided a narrative discussing what made certain activities riskier than others. For example, on stadiums the Michigan panel stated, “…sports stadiums have crowding and alcohol. People are also likely to cheer, yell and sing, among other noises, which also makes the spread easier.”(1) On indoor restaurants, infectious disease expert Elizabeth Connick said, “I think the biggest risk is being in a closed space and breathing the same air that other people are breathing, and also not wearing masks.”(3),
In reviewing these articles, common themes emerged. The Japanese Ministry of Health based its public information campaign around avoiding the “Three C’s”(8):
- Closed Spaces (with poor ventilation)
- Crowded Places (with many people nearby)
- Close Contact Settings (such as close-range conversations)
The Michigan panel cited similar factors, stating, “…whether it’s inside or outside; proximity to others; exposure time; likelihood of compliance; and personal risk level.”(1)
Dr. William Miller, an epidemiologist at Ohio State University, concurred, “We can think of transmission risk with a simple phrase: time, space, people, place.”(4) Another study from a researcher at the University of Denver compared the risk of indoor holiday gatherings based on the number of people, size of the room, and whether the gathering was inside or outside.(21)
Based on these consistent themes, we selected four characteristics for our activity rankings:
- Location – Indoors venues, especially small ones with poor ventilation, carry a higher risk of exposure than outdoors ones.
- Crowd size – More people means more opportunities for exposure.
- Close contact – Being in close proximity (an absence of social distancing) increases the chance of exposure.
- Duration – The longer the activity, the greater the chance of exposure.
In addition to these four primary characteristics, two special situations were mentioned repeatedly in the articles: respiratory droplets and shared items.
The Michigan panel explained respiratory droplets as, “When people talk loud or sing, it potentially emits more of the virus into the environment, further increasing the risk level.”(1) This conclusion is reinforced by studies examining coronavirus outbreaks in a restaurant in Wuhan China(9), a call center in South Korea(10), a choir in Washington state(11), and a summer camp in Georgia(12). There is a growing body of evidence to suggest caution for these kinds of activities.(13, 14)
The other special situation involved sharing items. While many environments may contain shared touch surfaces like doorknobs, experts in several articles cited certain kinds of shared serving utensils or equipment as a particular risk. For example, the Texas Medical Association ranks eating at a restaurant as a “7,” but eating at a buffet as an “8.”(2) The CDC guidance advises people to consider whether they will need to, “share any items, equipment, or tools with other people.”(6) The Michigan panel emphasized the need to wipe down shared gym equipment before use.(1)
Our app considers activities that involve shared items, shouting, loud talking, singing, exertion, or aerosolizing medical/dental procedures as being at a higher risk than other activities with equivalent primary characteristics.
Taking all these factors into consideration, the app generates an activity-specific risk score from 1 to 9. This score is presented to the user as a descriptive rating according to the following table:
Risk Score | Risk Title |
---|---|
9 | Very High |
7-8 | High |
5-6 | Medium |
3-4 | Medium Low |
1-2 | Low |
It should be noted that there was some disagreement among the expected scores between the Michigan and Texas panels, and even within the panels themselves. For example, the Michigan article stated, “There were varying opinions on the safety of flying in an airplane during a pandemic – two experts called it medium risk, one said it’s low risk and the other [said] it’s high risk.”(1) Michigan gave libraries a “3” while Texas gave them a “4.”(1, 2,)
Some of this disagreement can be attributed to different assumptions about the characteristics of the activity (different kinds of fights, for instance, or whether “flight” includes time spent in the airport or just the time spent on the aircraft), and in many cases it does not affect the final results (a “3” and a “4” are both categorized as a “Medium Low” risk in the app). Nonetheless, we stress that these are inherently subjective numbers where even experts in epidemiology have differing opinions.
Algorithm development
Starting with the forty activities rated by the Michigan(1) and Texas(2) panels, we assigned ratings in each of the four primary characteristics (location, duration, crowd size, and close contact), the two special situations (shared items and respiratory droplets), and the expected risk score (based on the expert ratings).
We then developed approximately twenty of our own activity ratings to add more granular scenarios or to fill in gaps. For example, the panel data listed “Air Travel” as a single category, but we split this into multiple scenarios for “short non-stop flight”, “multi-stop layover”, and “international flight.” Neither panel included yoga classes or book clubs, so we defined those activities ourselves.
A linear regression model was developed using the python statistics package statsmodels. Discrete values were assigned to categorical predictors, where lower numbers were less risky and higher numbers were more risky. The assignments used were:
Predictor | Scoring |
---|---|
Close Contact | 1 = Yes No social distancing is performed; participants count as “close contacts” by the CDC definitions (<6 ft; >= 15 minutes) 0 = No |
Crowd Size | 3 = Big Crowd (more than 25 people) Examples: a crowded bar, big party, concert. <br> <br /> 2 = Medium Crowd (11-25 people) 1 = Small Group (5-10 people) 0 = Individuals (less than 5 people) |
Duration | 2 = Long (More than 2 hours) 1 = Short (1-2 hours) 0 = Quick (Less than 1 hour) |
Location | 2 = Indoors (Small) Examples: a home, restaurant, bar, or small business. 1 = Indoors (Large) 0 = Outdoors |
Respiratory Droplets | 1 = Yes 0 = No |
Shared Items | 1 = Yes 0 = No |
The data set (62 observations) was split 70%/30% into training and test sets. The training set was used to develop the algorithm. The test set was then used to validate the algorithm’s results against the expected scores.
In our validation, the mean error (absolute value of the difference between the algorithm’s score and the expert score) was 0.5. In 95% of the activities, the algorithm-assigned risk category (Low/Medium High/High/Very High) was the same as the category based on the expert’s rating. The sole outlier, “Big Backyard Party”, was expected to be “High” (risk score 7) but was categorized as “Medium” (risk score 6). Overall, the algorithm produced highly accurate results compared to the expert scores.
The final algorithm had an adjusted R-squared value of 0.910, indicating that the predictors account for the majority of the variance in the model. The P-values for all six predictors are highly significant:
Predictor | P > |t| |
---|---|
Close Contact | 0.000 |
Crowd Size | 0.000 |
Duration | 0.000 |
Location | 0.000 |
Respiratory Droplets | 0.000 |
Shared Items | 0.003 |
We also explored the use of continuous predictors, such as a crowd size ranging from 1 to 100 people, and contact duration in hours. The continuous predictors did not perform as well as the discretely assigned predictors, and would often generate risk predictions that were greater than the maximum risk value of 9. Additionally, continuous predictors decreased usability of the app; it is easier for a user to select from a list of ranges than to guess specifically how many people will be in attendance.
A correlation analysis showed a relationship (Pearson correlation coefficient = 0.4) between close contact, crowd size, and location, and so we explored a new combined predictor called “crowd density” to encompass these factors. Crowd density did not perform as well as the individual predictors.
Model reduction was explored, with the net result that the adjusted R-squared was lowered in all cases. The original model, using the six predictors, gave the best performance of all alternatives considered.
Further validation
The expert scores used to drive the algorithm were all subjective opinions, and no formal studies to date have explored a scoring system for activities such as the one used here. However, data from several sources helps us to validate these ratings.
A study of urban mobile phone data examined the effects of mobility on virus spread.(18) The researchers’ model predicted that “a small minority of ‘superspreader’ POIs (points of interest) account for a large majority of infections.” Some of the locations with the highest predicted impact on infections in their model were restaurants/cafes, fitness centers, and churches.
In Pennsylvania, a contact tracing report from the Allegheny County Health Department cited bars, restaurants, parties, gyms, weddings, and funerals as among the activities most responsible for coronavirus cases.(15) Louisiana’s contact tracing dashboard similarly highlighted bars, restaurants, assembly lines, and casinos as generating high numbers of cases.(16) Reports from the White House Coronavirus Task Force, cited by the Washington Post (19), point to house parties and other small-scale gatherings as a source of coronavirus clusters.
The activities identified as significant drivers of infection in each of these reports are also highlighted as “High” or “Very High” risk by our app. Although further study is warranted, this data lends credence to the expert analysis on which our algorithm is based.
Location-based incidence warning
Although community prevalence is excluded from the activity risk assessment numbers, the app does provide a separate, optional assessment of Covid prevalence in the activity location (specified by the user as a postal code). This assessment, provided by the website covidactnow.org, provides an alert level (from “Low” to “Critical”) based on a number of factors, including daily new cases and positive test rate.(21)
Conclusions
Our app provides advice about the transmission risk of everyday activities, with high correlation to the ratings given by expert epidemiologists. Although this data is currently highly subjective, it is our hope that ongoing studies and contact tracing metrics will provide additional data to refine the algorithm. Our goal is to provide a tool at the user’s fingertips to help them make good decisions about what activities they engage in, and potentially reduce the spread of COVID-19.