Skip to main content Skip to Footer

BLOG


August 03, 2016
Model management
By: Chris Kang

As Labs worked with Accenture teams on client problems in the big data and data science spaces, it became clear that we needed a way for our consultants to quickly train and deploy machine learning models at big data scale. Furthermore, our clients represented a vast range of industries, and we knew that the types of models would vary across the client teams.

To address these concerns, we worked on building a machine learning platform at Accenture Labs. Our goal was to build a platform in which data scientists, business analysts, or data analysts could quickly iterate on machine learning models, training an algorithm with different parameters, comparing the resulting models in a champion-challenger view, and promoting the champion to a production environment for quick deployment. We also wanted the ability to train different types of models such as a facial recognizer in OpenCV, Python classifier written with scikit-learn, or a scalable cluster model that runs in Apache Spark.

We aren’t the only ones who arrived at the need for a machine learning platform. Facebook recently released information on its internal machine learning platform called FBLearner Flow. While the implementation is different from our idea, its goals are remarkably similar to ours: Reusable machine learning algorithms, automating the steps to train a model, and viewing and sharing results, experiments, models with others. It is effectively the centralized machine learning platform that Facebook engineers use to build and tune models quickly.

One of the major differences, however, resides in the way users write code to train and deploy models. With Facebook, models are created from workflows that consist of operators. Each operator is a defined unit of action such as training a model, calculating metrics, or predicting labels for an unknown data set. These operators are strung together in a decorated Python method that allows it to verify the inputs and outputs of that workflow.

With our platform, we have templates instead of workflows. A template is code that builds a model in a machine learning environment. For example, I can write Scala code that trains a Logistic Regression model with Apache Spark. Or, I can write Python code that trains a Naïve Bayes model with scikit-learn. This looks very similar to the workflow in Facebook. Here is an example:

def train(file_json):

    # Execute query

    cursor = presto.connect("xx.xxx.xxx.xx").cursor()

    cursor.execute("select * from " + file_json["input"]["catalog"] + "." + file_json["input"]["schema"] + "." + file_json["input"]["table"])

    # Build schema

    index = {d[0]:i for i,d in enumerate(cursor.description)}

    # Build dataset, X shape = (n_samples, n_features), y shape = (n_samples)

    X, y = [], []

    for data in cursor:

            X.append([data[index[c]] for c in file_json["schema"]["cols"]])

            y.append(data[index[file_json["schema"]["target"]]])

    train = np.array(X)

    target = np.array(y)

    # Train Naive Bayes model

    model = GaussianNB().fit(train, target)

The template uses an input and output contract (i.e. file_json in the code) to train the machine learning model correctly. To give a taste of what this execution call might look like, here is a sample payload in the API call to train a model in our platform.

POST 192.168.99.100:3000/v1/models/54139e94-c799-45a4-8ddd-5564b20ed2fb/train

{

    "runtime": "spark",

    "entry": "KMeansTraining",

    "schema": {"cols": ["avgmeasuredtime", "mediantime", "vehiclecount"]},

    "parameters": {

        "clusters": 2

    },

    "input": {

        "catalog": "cassandra",

        "schema": "demo",

        "table": "traffic_data",

        "type": "presto"

    },

    "output": {

        "bucket": “example",

        "key": "1e33b127-438a-4c4c-af86-57276aebf589"

    },

    "update": "http://192.168.99.100:3000/v1/models/1e33b127-438a-4c4c-af86-57276aebf589"

}

This API call will obtain the associated template code and run the code in the correct execution environment with the input and output JSON. The abstraction from the underlying execution environments allows us to extend the platform to support a variety of types of models such as OpenCV, Spark, Python, R. This extensibility is important to us because we can’t anticipate what models our client teams will build given the wide array of industries we work with.

In the Facebook workflow, operators are clearly defined and any new machine learning algorithm is coded as an operator within a workflow, which is easier to maintain and test. Our approach is more flexible because the code can execute as long as it behaves correctly with the input and output contract. This allows it to be more easily extended for other types of models, but makes it more challenging to verify its execution and maintain.

There are other distinguishing features in our platform that we are excited about such as deploying models into streams for real-time analytics. I discussed these features along with a technical deep-dive at the San Jose Hadoop Summit, where you can watch the talk here: https://www.youtube.com/watch?v=hVu_k_0T5v0

Topics highlighted

Technology

Popular Tags

    More blogs on this topic

      Archive