Data schema definition
What is a Data Schema
A Data Schema is a formal description that outlines the structure of the data within a Dataset.
It defines how data is formatted and stored and includes other key information about the statistical and analitical capabilties of the Dataset.
The Command Center adopts a Data Schema Definition through a JSON with a partial adoption of the JSON Schema Specification
Example
Considering to have the following Dataset
| timestamp_iso | item_code | valid_flasks_produced | shift |
|---|---|---|---|
| 2020-01-05T06:00:00 | 5078 | 102 | 1st Shift |
| 2020-01-06T06:00:00 | 5661 | 108 | 1st Shift |
| 2020-01-07T06:00:00 | 5078 | 79 | 1st Shift |
| 2020-01-08T06:00:00 | 5800 | 126 | 1st Shift |
| ... |
The following Data Schema Definition describes the content of the above Dataset :
{
"type": "object",
"properties": {
"timestamp_iso": {
"type": "string",
"cc_def": {
"type": "timestamp"
}
},
"item_code": {
"type": "integer",
"cc_def": {
"type": "categorical",
"scope": "nominal"
}
},
"valid_flasks_produced": {
"type": "integer",
"cc_def": {
"type": "numerical",
"scope": "continuous"
}
},
"shift": {
"type": "string",
"cc_def": {
"type": "categorical",
"scope": "nominal"
}
}
}
}Let's explore how to write the Data Schema.
Create a Data Schema Definition
To initialize the Data Schema definition, you need to start with the following basic definition.
{
"type": "object"
}This is a basic JSON definition with the keyword type that defines the type of the JSON structure.
However, it is still not enough to describe the Dataset Schema.
Define properties
The properties keyword needs to be defined to describe the structure of the Dataset. When you define properties, you create an object where each property represents a field/column of your Dataset.
1. Add the properties keyword
Add the properties validation keyword to the Data Schema:
{
"type": "object",
"properties": {}
}2. Define the fields definition of your Dataset
Add a keyword with field_name for each column that you want to define in your Dataset. If the data are stored in a csv file, the field_name corresponds to the name of the column defined in the first row of your csv file. In our case:
{
"type": "object",
"properties": {
"timestamp_iso": {},
"item_code": {},
"valid_flasks_produced": {},
"shift": {}
}
}3. Validation Rules
The field name must exactly match the name of the column in the CSV file
Validation rules for filedname:
- The fieldname should always start with alphanumerical character
- The fieldname should always contain only lowercase letters
- The fieldname should not start with underscore character (_)
- Are not allowed special characters or spaces
- Only dash or underscore ( - , _ ) are admitted in the fieldname
4. Field Data Type annotation
For each property, add the data type annotation using the type keyword. The type keyword defines what kind of data is expected for the field.
{
"type": "object",
"properties": {
"timestamp_iso": {
"type": "string"
},
"item_code": {
"type": "integer"
},
"valid_flasks_produced": {
"type": "integer"
},
"shift": {
"type": "string"
}
}
}It accepts the following values:
string: for string value (this type should be used for fields that represent the timestamp of the Dataset)integer: for integer valuesnumeric: for float or decimal numbers
5. Statistical definition, the cc_def keyword
The Data Type annotation is not enough to describe the statistical/analytical capability of each field available in the Dataset.
To complete this information, the cc_def keyword needs to be defined for each field property.
{
"type": "object",
"properties": {
"timestamp_iso": {
"type": "string",
"cc_def": {}
},
"item_code": {
"type": "integer",
"cc_def": {}
},
"valid_flasks_produced": {
"type": "integer",
"cc_def": {}
},
"shift": {
"type": "string",
"cc_def": {}
}
}
}Let's explore the keywords available for the cc_def keyword.
6. Statistical type definition
The type in the cc_def keyword defines the data characteristics for statistical and analytical purposes of the field.
The cc_def > type keyword expect the following values:
timestamp: for the field that contains the timestamp information of the record. A Dataset should have one field defined with this keyword value.categorical: for fields that contain categorical data, also known as qualitative data. It represents characteristics or attributes that can be sorted into groups or categories, but not in a numerical order. Categorical data is often used for classification. It can be represented as text or symbols and is not used for mathematical calculations in its raw form.numerical: for fields that contain numerical data, also known as quantitative data, represents measurable quantities and is expressed in numbers. It involves counting or measuring attributes of a population. Numerical data is suitable for mathematical calculations and statistical analysis. It can be further described using measures like mean, median, and standard deviation.
{
"type": "object",
"properties": {
"timestamp_iso": {
"type": "string",
"cc_def": {
"type": "timestamp"
}
},
"item_code": {
"type": "integer",
"cc_def": {
"type": "categorical"
}
},
"valid_flasks_produced": {
"type": "integer",
"cc_def": {
"type": "numerical"
}
},
"shift": {
"type": "string",
"cc_def": {
"type": "categorical"
}
}
}
}To complete the statistical/analytical capability for the dataset fields another keyword needs to be defined
7. The statistical scope definition
The scope, in the cc_def keyword, defines the data scope for statistical and analytical purpouses of the field. The cc_def > scope is related to the cc_def > type keyword:
For
categoricaltypes:nominal: These are categories with no inherent order. For example, colors (red, blue, green) or brands (Nike, Adidas, Puma).ordinal(not implemented): These categories have a specific order or ranking, but the intervals between the ranks are not necessarily equal. For example, satisfaction ratings (satisfied, neutral, dissatisfied).binary(not implemented): This is a special case of categorical data where there are only two possible categories (booleanfields)cyclical(not implemented): for categorical data that have a cyclical nature, where the categories reach an end and then "restart." (e.g days of week)
For
numericaltypes:discrete(not implemented): this data can only take certain values, like integers. It often represents countable items, such as the number of students in a class.continuous: this data can take any value within a range and is often measured. Examples include height, weight, or temperature.
timestamp types don't need scope definition.
{
"type": "object",
"properties": {
"timestamp_iso": {
"type": "string",
"cc_def": {
"type": "timestamp"
}
},
"item_code": {
"type": "integer",
"cc_def": {
"type": "categorical",
"scope": "nominal"
}
},
"valid_flasks_produced": {
"type": "integer",
"cc_def": {
"type": "numerical",
"scope": "continuous"
}
},
"shift": {
"type": "string",
"cc_def": {
"type": "categorical",
"scope": "nominal"
}
}
}
}The data schema, as defined, is complete and can be used for the Creation of a Dataset.