Schema Validation
There are quite a number of popular schema validation libraries available, but I will just stick to Pydantic, for the reasons that it is popular, touted to be the fastest, and used natively in FastAPI.
Config
The config file is one of the most frequently changed file in the repository, and for that reason, it is important to include in our unit-tests.
The below shows an example using a config yaml file & pydantic, with a custom validator.
# config.yml
modeldir: model
modelname: nameofname
format: dataframe # graph or dataframe
import pytest
import yaml
from pydantic import BaseModel, ValidationError, validator
class configSchema(BaseModel):
modeldir: str
modelname: str
format: str
@validator('format')
def format_list(cls, v):
if v not in ["dataframe", "graph"]:
raise ValueError("must be either 'dataframe or 'graph'")
return v
def test_config():
"""test for all key-values in config file"""
cf = yaml.safe_load(open("foldername/config.yml"))
try:
out = configSchema(**cf)
print(out)
except ValidationError as e:
pytest.raises(e)
API Request
Pydantic can also be used in Flask to validate all incoming requests. To validate the request schema before passing to the code, we can write a decorator function.
import logging
from functools import wraps
from typing import List, Optional
from flask import Flask, abort, jsonify, make_response, request, logging as flog
from pydantic import BaseModel, ValidationError, confloat, conint, validator
app = Flask(__name__)
class _weightage(BaseModel):
recommender: str
weight: confloat(ge=0, le=1)
@validator("recommender")
def recommender_list(cls,v):
if v not in recommender_list:
raise ValueError("Recommender is not part of {}".format(str(recommender_list)))
class RequestSchema(BaseModel):
productSKU: List[str]
storeId: str
weightage: List[_weightage]
customerId: Optional[str]
preceding_time_window: Optional[str]
resultSize: Optional[conint(ge=1, le=50)] = 20
@validator("weightage")
def weight_sum(cls, v):
sumw = 0
for i in range(len(v)):
weight = v[i].weight
sumw = weight + sumw
if sumw != 1:
raise ValueError("sum of weightage is not equals to 1")
def validate_request(requestschema):
"""decorator to validate request schema"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
requestschema(**request.json)
except ValidationError as e:
app.logger.error('{} - {}'.format(e, [422]))
err_json = json.loads(e.json())
abort(make_response(jsonify(err_json), 422))
return func(*args, **kwargs)
return wrapper
return decorator
@app.route("/recommendation", methods=["POST"])
@validate_request(RequestSchema)
def fusion_api():
req_content = request.json
# do something
return predicted_results
JSON
While pydantic works fine for JSON validation and we can fine-tune to test at quite a grandular level, it can be hard to grasp at start. Using something like pytest-schema
(pip install pytest-schema
) makes writing test cases much easier in pytest
.
from pytest_schema import schema
my_schema = {
"key1": int
"key2": float
"key3": {
"key4": str,
"key5": str
}
}
response = {
"key1": 111
"key2": 0.1
"key3": {
"key4": "test",
"key5": "test
}
}
def test_schema():
assert schema(my_schema) == response
Pandas
Ok, I take back what I said on Pydantic. For validating pandas dataframes, it is slightly difficult to use that library, hence we will use a library heavily inspired by Pydantic, called pandera.
from pandera import DataFrameSchema, Column, Check
schema = DataFrameSchema({
"antecedents": Column("category"),
"consequents": Column(object),
"antecedent support": Column("float32"),
"consequent support": Column("float32"),
"support": Column("float32", checks=Check(lambda x: 0 <= x <= 1, \
element_wise=True, \
error="range checker [0, 1]")),
"confidence": Column("float32", checks=Check(lambda x: 0 <= x <= 1, \
element_wise=True,
error="range checker [0, 1]")),
"lift": Column("float32"),
"leverage": Column("float32"),
"conviction": Column("float32"),
"storeid": Column("category"),
})
try:
validated_df = schema(df, lazy=True)
except Exception as e:
print(e)
The validated_df
is the same dataframe that can be used to continue coding. If we are validating in pytest, we can add a try, except to catch the error in a graceful way.
The argument lazy=True
should added to ensure that it captures all errors before giving a validation error report as shown below.
A total of 1 schema errors were found.
Error Counts
------------
- schema_component_check: 1
Schema Error Summary
--------------------
failure_cases n_failure_cases
schema_context column check
Column consequents pandas_dtype('float64') [object] 1
Usage Tip
---------
Directly inspect all errors by catching the exception:
try:
schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
err.failure_cases # dataframe of schema errors
err.data # invalid dataframe