Row-based file formats, also known as Entity-attribute-value data (EAV), organize data by record, keeping all the data associated with a record in the same row. Row based files are the traditional way of organizing data.
In this article
1. Data Structure
Row-based formats compatible with Harmoni require three data sets that are stored in tabular form in your media of choice, such as text files, SQL databases, AWS, Azure, GCP, NoSQL databases, to name a few.
Harmoni can directly connect to data in this structure via direct connections to a tabular based application - refer to Supported Data sources.
The three data sets required are:
a. ID table - contains the entity/respondent IDs
b. Dictionary table - defines the attributes
c. Data table - contains the data values
a. ID Table
The respondent ID table contains one field, which is a unique ID for every entity/respondent in the data. It is important that the data is sorted.
b. Dictionary Table
A tabular form of the dictionary. The order of items in the dictionary determines the order that the questions and answers appear when the dataset is loaded into Harmoni.
The dictionary has 3-5 fields:
- Question Label
- Answer Label
- Question Type
- tv - text variable
- sr - single response categorical response
- mr - multiple response categorical response
- nv - numerical value
- wt - weight variable
- img - image variable
- coord - hotspot variable
- Item/Question Label Description (optional)
- Value/Answer Label Description (optional)
The last two fields in the dictionary are optional.
When there are no answer labels, i.e., for those questions with no pre-coded answers, e.g., tv, nv, wt, img, the answer label should be null or empty.
Following is a sample table illustrating a dictionary with all possible question types. In this example, question types with numerical answers (e.g., exact age), text answers (e.g., email address), weight (numerical value but marked as weight), single-response questions with pre-coded answers, and multiple-response questions with pre-coded answers, are coded as nv, tv, wt, sr, and mr, respectively. Any user-chosen codes are acceptable, so long as the codes are unique.
The dictionary table needs to be sorted by Question Label then Answer Label in the preferred order.
| Question Label | Answer Label | Question Type | Question Label Description (optional) | Answer Label Description (optional) |
|---|---|---|---|---|
| Respondent ID (text) | tv | |||
| Q11 | Male | sr | Gender | Males |
| Q11 | Female | sr | Females | |
| Marital Status | Married | sr | ||
| Marital Status | Widowed | sr | ||
| Marital Status | Divorced | sr | ||
| Marital Status | Separated | sr | ||
| Marital Status | Never married | sr | ||
| Weight | wt | Main Weight | ||
| Exact Age | nv | |||
| News Sources | Internet | mr | Sources Used | |
| News Sources | Newspapers | mr | Sources Used | |
| News Sources | News magazines | mr | Sources Used | |
| News Sources | Television | mr | Sources Used | |
| News Sources | Radio | mr | Sources Used | |
| Uploaded Image | <url> | img | ||
| Hotspots | {“image”:”<url>”,”clicks”:[{x:14,y:12},{x:21,y:32}]} | coord |
Using the two optional columns adjusts the display label for the items in Harmoni. The compulsory columns continue to populate the source label.
For Q11 in the table above, here is an example of the axis and element labels with and without the optional fields.
c. Data Table
The Data table contains the answers to each question asked of a respondent in an Entity-Attribute-Value (EAV) structure.
The table should be sorted by respondent ID then by question label.
Included in the Data are the following fields:
- Respondent ID (the Entity)
- Questions Label (the Attribute)
- Answer (the Value)
For example:
| Respondent ID | Question Label | Answer |
|---|---|---|
| 0 | Respondent Id (text) | 0 |
| 0 | Gender | Male |
| 0 | Marital Status | Divorced |
| 1 | Respondent id (text) | 1 |
| 1 | Gender | Male |
| 1 | Marital Status | Never married |
So as long as the respondent, dictionary, and data tables can be obtained, the actual data can be stored in any format the user chooses.
2. Data Creation
Typically, these 3 datasets will be populated using SQL commands. The following examples illustrate some sample SQL commands to obtain the respondent count, dictionary table, and data table.
1) caseIDs (respondent IDs)
SELECT distinct CaseIdCol FROM OurDataTable ORDER BY CaseIdCol;
2) Dictionary
SELECT DISTINCT ItemLabelCol, ValueLabelCol, ItemTypeCol, ItemLabelColDesc, ValueLabelColDesc FROM OurDictionaryTable order by ItemLabelColOrder, ElementLabelColOrder;
3) Data
SELECT DISTINCT CaseIdCol, ItemLabelCol, ValueLabelCol FROM OurDataTable ORDER BY CaseIdCol, ItemLabelCol;
3. Relationships
This section focuses on relationships between datasets including linking and multi-level linking.
- The CaseId table identifies the unit of count of a dataset.
- The CaseIdCol value in the first field of the Data Table is used to identify each unique record in the Data and matches back to the CaseIDs table.
- If you want to use the CaseIdCol values as a link, they must also exist in the Data and Dictionary tables. It is these entries that create the variables used for linking.
- Links define the relationships between datasets. The link may be to a different unit of count (multi-level) or it may be to another dataset with the same unit of count (concatenated).
- Datasets at the same level, but with different units of count are referred to as siblings, while datasets at the same level and with the same unit of count are concatenated.
- The relationships determine the levels. The root level, by definition, has no parent; all other levels must have a parent.
- Datasets that link into the root level are referred to as children, and their children are referred to as grand-children, etc. For example, a survey of respondents may capture data on store visits, and purchases made on those visits. Each Store Visit is a child of a Respondent and the parent of a Purchase, while each Purchase is a child of a Store and the grand-child of a Respondent.
- The theory is the same for any dataset relationship; the caseIDs table identifies the unique records in the dataset, and variables in the Data table provide the linking variables.
Example of multi-level structure with siblings and concatenations:
- Respondents is the master dataset
- Each dataset has its own unique CaseIds table
- Each dataset includes variables for each of the ids required for linking
- The Respondent Weights dataset concatenates to Respondents using Respondent id
- The Visits dataset links to Respondents using Respondent id
- The Purchases dataset links to Visits using Visit id
- The Transport dataset links to Visits using Visit id
Respondents
The master root level dataset (Respondents) has a caseIDs table that identifies each Respondent in the Data
| Respondent ID |
|---|
| 0 |
| 1 |
The caseIDs are represented in the Respondent ID variable of the Data table to provide a variable for other data sets to link into.
| Respondent ID | Question Label | Answer |
|---|---|---|
| 0 | Respondent ID (text) | 0 |
| 1 | Respondent ID (text) | 1 |
| 0 | Gender | Male |
| 1 | Gender | Male |
| 0 | Marital Status | Divorced |
| 1 | Marital Status | Never married |
The Dictionary table may or may not include labels, depending on whether the data table contains codes or labels.
| Question Label | Answer Label | Answer Type | Question Label Description (optional) | Answer Label Description (optional) |
|---|---|---|---|---|
| Respondent ID | tv | |||
| Gender | Male | sr | ||
| Gender | Female | sr | ||
| Q3 | 1 | sr | Marital Status | Married |
| Q3 | 2 | sr | Marital Status | Widowed |
| Q3 | 3 | sr | Marital Status | Divorced |
Respondent Weights
The dataset for the Weights that are concatenated into the Respondents has its own caseIDs table that also identifies each Respondent in the Data that has a weight,
| Respondent ID |
|---|
| 0 |
| 1 |
The Respondent ID variable in the Weights Data table identifies the Respondent each Weight record is concatenated to.
| Respondent ID | Question Label | Answer |
|---|---|---|
| 0 | Respondent ID | 0 |
| 1 | Respondent ID | 1 |
| 0 | Weight | 0.87654321 |
| 1 | Weight | 0.12345678 |
The Dictionary table may or may not include labels, depending on whether the data table contains codes or labels.
| Question Label | Answer Label | Answer Type |
|---|---|---|
| Respondent ID | tv | |
| Weight | wt |
Visits
The dataset for the Visits at the next level down (child) a caseIDs table that identifies each Visit in the Data.
| Visit ID |
|---|
| 11 |
| 12 |
| 13 |
In the Data Table, the caseIDs for the visit level are represented in the Visit ID variable of the Data table so any potential children can link into it. The inclusion of a Respondent ID variable provides the link to its parent
| Visit ID | Question Label | Answer |
|---|---|---|
| 11 | Respondent ID | 0 |
| 12 | Respondent ID | 0 |
| 13 | Respondent ID | 1 |
| 11 | Visit ID | 11 |
| 12 | Visit ID | 12 |
| 13 | Visit ID | 13 |
| 11 | Location | Supermarket |
| 12 | Location | Convenience Store |
| 13 | Location | Supermarket |
The Dictionary table may or may not include labels, depending on whether the data table contains codes or labels.
| Question Label | Answer Label | Answer Type |
|---|---|---|
| Respondent ID | tv | |
| Visit ID | tv | |
| Location | Supermarket | sr |
| Location | Convenience Store | sr |
| Location | Supermarket | sr |
Purchases
The Purchases dataset contains details of each item purchased during the Visit by a Respondent. The Purchase level is a grandchild of Respondents, and has a CaseIDs table identifying each unique Purchase made in a Visit by a Respondent
| Purchase ID |
|---|
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
This grandchild Data links to the child-level Visit data via the Visit ID and an optional Purchase ID field may be included for any great-grandchildren to link into. These are all in the Data Table.
| Purchase ID | Question Label | Answer |
|---|---|---|
| 0 | Visit ID | 11 |
| 1 | Visit ID | 11 |
| 2 | Visit ID | 12 |
| 3 | Visit ID | 12 |
| 4 | Visit ID | 12 |
| 5 | Visit ID | 13 |
| 0 | Purchase ID | 0 |
| 1 | Purchase ID | 1 |
| 2 | Purchase ID | 2 |
| 3 | Purchase ID | 3 |
| 4 | Purchase ID | 4 |
| 5 | Purchase ID | 5 |
| 0 | Beverage | 1 |
| 1 | Beverage | 2 |
| 2 | Beverage | 1 |
| 3 | Beverage | 3 |
| 4 | Beverage | 6 |
| 5 | Beverage | 4 |
The Dictionary table may or may not include labels, depending on whether the data table contains codes or labels.
| Question Label | Answer Label | Answer Type | Question Label Description (optional) | Answer Label Description (optional) |
|---|---|---|---|---|
| Visit ID | tv | |||
| Purchase ID | tv | |||
| Beverage | 1 | sr | Coffee | |
| Beverage | 2 | sr | Tea | |
| Beverage | 3 | sr | Energy Drink | |
| Beverage | 4 | sr | Flavored Milk | |
| Beverage | 5 | sr | Sports Drink | |
| Beverage | 6 | sr | Bottled Water |
Transport
Transport is at the same level as Purchases, i.e. they are siblings. It includes variables relating to all the transport options used to achieve a Visit by a Respondent, this links into the Visit data, and is at the same level as the Purchases.
The Transport grandchild level has a CaseIDs table identifying each unique Respondent-Visit-Transport:
| Transport ID |
|---|
| 100 |
| 101 |
| 102 |
The transport data links into the Visit dataset using the Visit ID variable, It has no children to link into it, so the unique Transport ID that exists in the CaseIDs table doesn’t need to exist as a variable with responses in the Data Table:
| Transport ID | Question Label | Answer |
|---|---|---|
| 100 | Visit ID | 11 |
| 101 | Visit ID | 12 |
| 102 | Visit ID | 13 |
| 100 | Transport method | Bus |
| 101 | Transport method | Walk |
| 102 | Transport method | Car |
| 100 | Travel Companions | Family |
| 101 | Travel Companions | None |
| 102 | Travel Companions | Friend |
| 102 | Travel Companions | Family |
The Dictionary table may or may not include labels, depending on whether the data table contains codes or labels.
| Question Label | Answer Label | Answer Type |
|---|---|---|
| Visit ID | tv | |
| Visit ID | tv | |
| Location | Supermarket | sr |
| Location | Convenience Store | sr |
| Location | Supermarket | sr |
| Travel Companions | None | mr |
| Travel Companions | Friend | mr |
| Travel Companions | Family | mr |
| Travel Companions | Other | mr |
4. General Consideration
- The approach is the same for all linking, i.e., the ID needs to exist as a variable in both the parent and child Data to be used as a link.
- Concatenating datasets with the same counting level is currently only available at the root level in Harmoni
- Any ID that does not exist in the CaseIDs table is ignored, even if it appears in the data table.
- Any variable not found in the dictionary table will be ignored, even if it exists in the data table.
- Any variable listed in the dictionary table that has no corresponding response in the data table will be output with a count of zero.
- There is no reasonable limit on the number of levels that can be linked together
- There is no reasonable limit on the number of sibling datasets that can link into the same parent.
- Img and Coord variable types are not currently supported by the FCF2X converter
Where to from here?
Learn more about Harmoni