In this quick start guide we will guide you through your first steps in using Syntegra's Data API.

We will provide the code for each step in Python or R, but the outputs are shown from Python. You can of course implement these steps in other languages such as Java, Go or SAS.

1. Get your api key

In case you don't already have your api-key, sign up here. Your api-key is required for every API call you make.

Here's a sample of what an api-key looks like:

api-key: 5f50767a2a3208f111ee395f8e24e5c69fb236891

2. List dataset

Now that you have your api-key, you can use the API to explore which datasets you have access to:

import requests
import pandas as pd

auth_info = {"api-key": "5f50767a2a3208f111ee395f8e24e5c69fb236891"}
api_url = "https://api.syntegra.io/v1"

r = requests.get(f'{api_url}/datasets', headers=auth_info)
pd.DataFrame(r.json()['Contents'])
library(httr)
library(jsonlite)

api_key <- '5f50767a2a3208f111ee395f8e24e5c69fb236891'
url <- "https://api.syntegra.io/v1/datasets"
response <- VERB("GET", url, add_headers('api-key' = api_key), content_type("application/octet-stream"),
                accept("application/json"))
fromJSON(content(response, "text"))

We provide the API key for authentication as part of the request header, and we use the /datasets endpoint to list all available datasets. The output of the request is converted to JSON; to enhance how the output is viewed, we use Pandas Dataframe to format the output as follows:

14721472

As you can see, each dataset has a name, a schema, and a description text. Furthermore, the 'available' column indicates whether the dataset is available to use under your subscription.

Let's pick the EHR_TUVA_SAMPLE dataset as our dataset for this guide. We now want to understand how this dataset is structured. Let's use the API to look at the schema:

r = requests.get(f'{api_url}/schema/EHR_TUVA_SCHEMA', headers=auth_info)
pd.DataFrame(r.json()['Contents'])
url <- "https://api.syntegra.io/v1/schema/EHR_TUVA_SCHEMA"
response <- VERB("GET", url, add_headers('api-key' = api_key), content_type("application/octet-stream"),
                accept("application/json"))
fromJSON(content(response, "text"))

The output is as follows:

21842184

We can see that this schema has 9 primary tables (ALLERGY, CONDITION, ENCOUNTER, LAB, MEDICATION, OBSERVATION, PATIENT, PROCEDURE and VITAL_SIGN) and 2 support table (LOCATION and PRACTITIONER). This is consistent with what we expect in the TUVA core common data model.

You can further dig down into individual fields by looking at "columns", where the list of columns for each table is listed.

3. Explore dataset tables

Now let's take a look at one of those tables. In order to do that we want to. The /dataset/query operator is a quick and easy way to do just that. This operation uses the Syntegra Query Syntax to define a query of interest on one of the tables, and returns the first 10 rows corresponding to that query.

As an example, let's look at the condition table:

dataset = 'EHR_TUVA'
ssq = {"table": "condition", "column": "code", "operator": "LIKE", "value": "%"}

r = requests.post(f'{api_url}/dataset/{dataset}/query', headers=auth_info, data = json.dumps(ssq))

print(r.json()['Result'])
pd.DataFrame(r.json()['Contents'])
url <- "https://api.syntegra.io/v1/dataset/EHR_TUVA/query"

payload <- "{\"table\":\"condition\",\"column\":\"code\",\"operator\":\"LIKE\",\"value\":\"%\"}"
encode <- "json"
response <- VERB("POST", url, body = payload, add_headers('api-key' = api_key), content_type("application/json"), accept("application/json"), encode = encode)
fromJSON(content(response, "text"))

Which results in the message:

10 rows provided (out of 2005777 total) of the table CONDITION for query {'table': 'CONDITION', 'column': 'CODE', 'operator': 'LIKE', 'value': '%'} on dataset EHR_TUVA*

Followed by the data itself as a pandas data-frame:

21842184

4. Concepts and Cohorts

One of the most useful capabilities of the API is the use of medical concepts and patient cohorts to download only the data you need for your research or modeling. The API comes pre-built with a set of concepts and cohorts that continues to grow over time such as:

  • diabetes
  • Liver disease
  • Pregnancy
  • Myocardial Infarction
  • Sepsis
    and many more.

You can add your own concepts and cohorts to fit your research needs, and importantly concepts and cohorts are stored within your API workspace and can be reused.

For purposes of this guide, we will take a look at patients who have copd but are not smokers. We have two pre-defined concepts: copd and non-smoker that we can use for this purpose. First, let's see all the concepts we have defined:

r = requests.get(f'{api_url}/concepts/', headers=auth_info)
pd.DataFrame(r.json()['Contents'])
url <- "https://api.syntegra.io/v1/concepts"
response <- VERB("GET", url, add_headers('api-key' = api_key), content_type("application/octet-stream"),
                accept("application/json"))
fromJSON(content(response, "text"))

And the output is:

21822182

Note that the output may change as more concepts are defined. Each concept includes an id, a name, its status (public is shared for all API users and private is restricted to your API key), a definition of the concept as well as the associated SQL query used internally to implement this concept.

For our purposes, we would like to use two concepts:

  1. copd (id=44)
  2. non-smoker (id=51)

Using these we can now create a cohort as follows:

cohort = {"schema": "EHR_TUVA_SCHEMA", "name": "copd_without_smoking", 
                     "definition": ["copd", "and", "non-smoker"], "private": "True"}
r = requests.post(f'{api_url}/cohorts/', headers=auth_info, data = json.dumps(cohort))
print(r.json()['Result'])
url <- "https://api.syntegra.io/v1/cohorts"
payload <- "{\"definition\":[\"ehr_copd\",\"AND\",\"ehr_non_smoker\"],\"name\":\"copd_without_smoking\",\"schema\":\"EHR_TUVA_SCHEMA\",\"private\":true}"
encode <- "json"
response <- VERB("POST", url, body = payload, add_headers('api-key' = api_key), content_type("application/json"), accept("application/json"), encode = encode)
content(response, "text")

and we get:

Cohort 'copd_without_smoking' with id 8 has been created

5. Request Data

Now we get to the final stage - we want to download the full data associated with our cohort of patients, i.e. all the patients in our dataset who have been diagnosed with COPD but are non-smokers.

🚧

NOTE:

Data exports are limited to the first 50,000 patients of a dataset within a given cohort.

For this we use the /dataset/{dataset_name}/data endpoint and specify the cohort as a parameter

r = requests.get(f'{api_url}/dataset/EHR_TUVA/data?cohort=8', headers=auth_info)
print(r.json())
url <- "https://api.syntegra.io/v1/dataset/EHR_TUVA/data"
queryString <- list(
  cohort = "8",
  fhir = "false"
)
response <- VERB("GET", url, add_headers('api-key' = api_key), query = queryString, content_type("application/octet-stream"), accept("application/json"))
fromJSON(content(response, "text"))

We get the following response:

{'Result': 'Data from dataset: EHR_TUVA, cohort: 8 is being processed, please check back later', 'Contents': []}

This is great. Now the API is processing all the data and selecting the synthetic patient records that fit our cohort - in this case, COPD and non-smokers. This process may take a few minutes, so the API returns a message specifying that the results are not ready yet.
We now wait a few minutes and issue the same get request every few minutes until the data is ready, in which case the API will respond with the actual data outputs:

r = requests.get(f'{api_url}/dataset/EHR_TUVA/data?cohort=8', headers=auth_info)
print(r.json())
url <- "https://api.syntegra.io/v1/dataset/EHR_TUVA/data"
queryString <- list(
  cohort = "8",
  fhir = "false"
)
response <- VERB("GET", url, add_headers('api-key' = api_key), query = queryString, content_type("application/octet-stream"), accept("application/json"))
fromJSON(content(response, "text"))

The response after just 1 minute is:

{'Result': 'Data from dataset: EHR_TUVA, cohort: 8 is now available, links are valid for 24 hours', 
'Contents': [{
'ALLERGY.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/ALLERGY.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=3JKBcMWpP%2BJxEDjzp2uwj5RBZbg%3D&Expires=1659240118', 
'CONDITION.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/CONDITION.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=yp6KHT4%2B%2Ff0QBrEIILSKuRq8AmE%3D&Expires=1659240118', 
'ENCOUNTER.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/ENCOUNTER.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=Bkr5LKQVRgk7Q89%2BiNk3dzIrhmU%3D&Expires=1659240118', 
'LAB.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/LAB.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=T5OBjyRdtjBEg1blYvs5deuMOEg%3D&Expires=1659240118', 
'LOCATION.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/LOCATION.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=Q7bmmYG%2BoO1Hlm5PZD6ASc3Q5U0%3D&Expires=1659240118', 
'MEDICATION.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/MEDICATION.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=z86DiLIY1Oei8ZAGgi0YP4JtJnw%3D&Expires=1659240118', 
'OBSERVATION.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/OBSERVATION.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=GeOKHiED7epgPKRbq0Ywzsf8koI%3D&Expires=1659240118', 
'PATIENT.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/PATIENT.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=jn0mxx57I%2FD83LxWF6XYEtZ%2BSLo%3D&Expires=1659240118', 
'PRACTITIONER.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/PRACTITIONER.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=jG9ZzDKhYmklWToj4uhU6K5akVw%3D&Expires=1659240118', 
'PROCEDURE.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/PROCEDURE.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=oBfAZg%2BKRJ1FaQfmZalkP6wa6fA%3D&Expires=1659240118', 
'VITAL_SIGN.csv.gz': 'https://syntegra-api.s3.amazonaws.com/data/EHR_TUVA/8/VITAL_SIGN.csv.gz?AWSAccessKeyId=AKIA52XAC2J36IA53DVC&Signature=y323R4MOEI4muRIkdzVPbHtMjGA%3D&Expires=1659240118'
}]
}

🚧

S3 Links

All S3 links are valid for 24 hours after they are provided. After that time you can make the same request and immediately get a set of new S3 links for access.

For Python, this dictionary includes a list of each table in the dataset and an S3 link to the content of that table for our desired patient population. This can be more easily viewed as a pandas dataframe using the following code:

df = pd.DataFrame.from_records(r.json()['Contents']).T
df.columns = ['s3_link']
df
19941994

6. Download data

Let's download the conditions table and the patients table into our local environment:

condition_df = pd.read_csv(df[df['tableName']=='CONDITION']['s3Url'].values[0], compression='gzip')
patient_df = pd.read_csv(df[df['tableName']=='PATIENT']['s3Url'].values[0], compression='gzip')
tbl <- fromJSON(content(response, "text"))

s3url = tbl[[2]]$s3Url[[2]]
txt <- readLines(gzcon(url(s3url)))
condition_tbl <- read.csv(textConnection(txt), header=TRUE)

s3url = tbl[[2]]$s3Url[[8]]
txt <- readLines(gzcon(url(s3url)))
patient_tbl <- read.csv(textConnection(txt), header=TRUE)

A few rows from the conditions table can be displayed:

condition_df.head(5)
condition_tbl
21962196

We see that in this example the first 5 rows include 3 individual patient IDs: 109719, 154847, 158273 and 141066. The conditions shown here don't reflect copd, but that is probably okay since patients may have more than one condition. Let's verify this with a specific patient - we'll look at ALL conditions for the first patient 109719:

condition_df[condition_df.PATIENT_ID==109719].DESCRIPTION.unique()
unique(condition_tbl[condition_tbl$PATIENT_ID == 109719, ]$DESCRIPTION)
17381738

We see COPD as the 3rd condition listed here. Even though we also see "personal history of nicotine dependence", this ICD10 code (Z87.891) is not included in our concept for non-smoker (which is focused on F17.X codes).

We can now perform any kind of analysis, on any of the tables in our cohort. We will finish with our next section, demonstrating how to perform a simple analysis on our selected cohort of non-smoker-copd patients.

7. Social Determinants of Health Analysis

Let's try to understand the characteristics of this population (copd patients who are not diagnosed as smokers) in terms of social determinants of health, including gender, age, and zip-code.

We first plot the gender distribution:

patient_df.GENDER.value_counts()
table(patient_tbl$GENDER)

Which results in:
male 5894
female 4074
Name: GENDER, dtype: int64

So we see there are more patients identified as male then female in our population.
Let's next look at age.

patient_df.BIRTH_DATE.map(lambda x: (pd.to_datetime('2022-07-31') - pd.to_datetime(x)).days/356).hist(bins=30)
require(lubridate)
birth_date <- as.Date(patient_tbl$BIRTH_DATE)
x_date   <- as.Date("2022-08-01")
hist(trunc((birth_date %--% x_date) / years(1)))
381381

Finally, we want to understand the distribution of patients by state:

import plotly.express as px
import pgeocode
 
gdf = patient_df['STATE'].value_counts().reset_index(drop=False)
gdf.columns = ['state_code', 'count']

fig = px.choropleth(gdf,
                    locations='state_code', 
                    color='count',
                    color_continuous_scale=px.colors.sequential.OrRd,
                    locationmode="USA-states", 
                    range_color=(1, 5000),
                    scope="usa",
                    labels={'count':'patients'}, 
                    )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
library(dplyr)
library(ggplot2)
library(maps)

tbl = left_join(patient_tbl, condition_tbl, by='PATIENT_ID')
tbl$region <- tolower(state.name[match(tbl$STATE,state.abb)])
tbl$region <- factor(tbl$region)

cnt_tbl <- tbl %>% group_by(region) %>% summarise(cnt = n())
MainStates <- map_data("state")
merged <- inner_join(MainStates, cnt_tbl, by = "region")

ggplot() + 
  geom_polygon( data=merged, aes(x=long, y=lat, group=group, fill=cnt),
                color="black", size=0.2)

Here we use the pgeocode python package to map zip codes to states, and use the plotly library choropleth function to plot a map of the US with states color coded by the number of patients with COPD that are non-smokers.

985985