Natural language queries for SCADA - has anyone done this successfully?: Hardwares and Devices

Navigation

RSS Feeds

Articles Downloads Forums News Web Links

Member Polls

Natural language queries for SCADA - has anyone done this successfully?

Last updated on 2 months ago

programming

robotics

Track thread Print

KevinVeteran Member

Posted 2 months ago

Posted by scada_admin_john

We've got a large water treatment facility running on Ignition SCADA and management saw some demo of ChatGPT and now they want operators to be able to ask questions like "what was the flow rate at pump station 3 yesterday at 2pm" or "show me all alarms from the north plant last week" in plain English instead of navigating through screens. I get the appeal but I'm worried about security and reliability. Has anyone actually implemented LLM integration with their SCADA system? I'm thinking we'd need to set up some kind of API layer that the LLM can call to query our historian database, but I'm not sure where to start. We use SQL Server for our historian and everything is logging correctly, just need a way to translate natural language into SQL queries safely. Budget is approved but I need to prove this won't be a security nightmare.

KevinVeteran Member

Posted 2 months ago

Reply by industrial_ai_dev_rachel

John we did something similar with our Wonderware system about 6 months ago and it's actually been pretty useful once we worked out the kinks. The key is you absolutely cannot give the LLM direct database access, that's asking for SQL injection on steroids. What we built is a middleware API with predefined functions that the LLM can call - think of it like giving the AI a toolbox of specific actions it can perform. Here's a simplified example of our FastAPI setup:

Code Download source



from fastapi import FastAPI

from typing import Optional

import pyodbc

from datetime import datetime



app = FastAPI()



ALLOWED_TABLES = ['tag_history', 'alarm_log', 'events']



@app.get("/query_tag_value")

def get_tag_value(tag_name: str, timestamp: str):

 # Whitelist validation

 if not validate_tag_name(tag_name):

 return {"error": "Invalid tag name"}

 

 conn = pyodbc.connect('DRIVER={SQL Server};SERVER=scada-historian;DATABASE=ProcessData')

 cursor = conn.cursor()

 

 query = "SELECT TagValue FROM tag_history WHERE TagName=? AND Timestamp=?"

 cursor.execute(query, (tag_name, timestamp))

 result = cursor.fetchone()

 

 return {"tag": tag_name, "value": result[0] if result else None}

KevinVeteran Member

Posted 2 months ago

Then your LLM calls these functions instead of writing SQL directly. We're using OpenAI's function calling feature which works really well for this.

KevinVeteran Member

Posted 2 months ago

Reply by cybersecurity_tom

Rachel's approach is the right way but I'd add some more security layers. First, implement role-based access control so not every operator can query sensitive tags like chemical dosing rates or security system data. Second, add rate limiting because LLMs can go crazy sometimes and you don't want 1000 database queries per minute killing your historian. Third and this is important, log EVERYTHING - every query, every response, who asked it and when. For compliance reasons you need that audit trail. Also consider running this on an isolated network segment, not directly connected to your control network. We have ours on the business network side pulling read-only data from a replicated historian database, so even if someone compromises it they can't touch the actual SCADA system.

KevinVeteran Member

Posted 2 months ago

Reply by scada_admin_john

Rachel that code snippet is helpful thanks. So if I'm understanding right, the flow would be: operator types question in chat interface, that goes to LLM (hosted where, cloud or local?), LLM decides which API functions to call, API queries historian and returns data, LLM formats it into natural language response. Is that the basic architecture? Tom your security points are making me nervous lol but you're absolutely right. We definitely need RBAC because not everyone should see everything. Are you guys using a local LLM or cloud-based? I'm worried about sending operational data to OpenAI's servers even if it's just tag names and values, compliance might have issues with that.

KevinVeteran Member

Posted 2 months ago

Reply by industrial_ai_dev_rachel

Yeah you got the flow right. For the LLM hosting we actually use both - Azure OpenAI for most queries because it has better data residency guarantees and HIPAA compliance (we're in pharma), but we also have a local Llama model running on a GPU server for offline capability. The local model isn't as good at understanding complex queries but it works fine for basic stuff. Here's a rough example of how we structure the LLM prompt:

Code Download source

system_prompt = """You are an industrial SCADA assistant. You can help operators query process data.

Available functions:
- get_tag_value(tag_name, timestamp): Get a specific tag value at a time
- get_tag_history(tag_name, start_time, end_time): Get historical data range
- get_active_alarms(plant_area): Get current alarms for an area
- get_alarm_history(start_time, end_time, severity): Get past alarms

IMPORTANT:
- Only query tags the user has permission to access
- Always specify time ranges, don't query more than 7 days at once
- Tag names must match exactly: "FT-101" not "flow transmitter 101"

The trick is constraining the LLM enough that it can't do anything dangerous while still being useful. We spent a lot of time building a good tag name lookup system because operators say things like "the main feed pump" but the actual tag is "P-3301_SPEED_PV".

KevinVeteran Member

Posted 2 months ago

Reply by ignition_integrator_dave
Since you're on Ignition this should be pretty straightforward actually. You can use the WebDev module to create REST endpoints that expose specific queries, then have your LLM call those. Ignition's built-in security will handle permissions automatically if you set it up right. I'd recommend using named queries in Ignition instead of raw SQL in your Python code - that way you get all of Ignition's parameter sanitization and security checks. Also Ignition has that Tag History binding system that's already optimized for time-series queries, might as well leverage it. One gotcha though, if your historian has millions of records make sure you're adding proper limits and aggregation. We had an incident where someone asked "show me all temperatures from last month" and it tried to return 2 million rows and crashed the gateway lol.

KevinVeteran Member

Posted 2 months ago

Reply by ml_engineer_priya
Something nobody's mentioned yet - you need to handle the LLM hallucinating answers. We had issues where operators would ask about a tag that doesn't exist and the LLM would make up a plausible-sounding answer instead of saying "I don't have that data". Now we have strict validation on every function response and if the API returns null or error, we've instructed the LLM to explicitly say it couldn't find the data. Also consider adding a confidence score or disclaimer to every response like "Based on available data as of [timestamp]" so operators know this isn't gospel truth. For critical operations we always require them to verify in the actual SCADA screens, the LLM is just for quick lookups not making decisions.

KevinVeteran Member

Posted 2 months ago

Reply by scada_admin_john

OK this is all super helpful, I think I have a path forward now. Going to start with a proof of concept using Azure OpenAI and Ignition's WebDev module, limit it to read-only historian queries for non-critical tags, and implement proper audit logging. Dave the named queries approach makes sense since we already use those heavily. Priya good point about hallucinations, I'll make sure we have clear disclaimers. One last question - what's the typical response time been for you guys? I'm worried if it takes 10-15 seconds to get an answer operators won't use it and this whole project was a waste. Also has anyone tried using this for predictive stuff like "is pump 5 likely to fail soon based on vibration trends" or is that too complex?

KevinVeteran Member

Posted 2 months ago

#10

Reply by industrial_ai_dev_rachel
Response time is usually 2-4 seconds for simple queries, up to 10 seconds if it needs to call multiple functions or process a lot of data. Operators actually don't mind the wait as much as we thought they would, especially compared to clicking through multiple screens manually. For predictive stuff like your pump example, that requires a separate ML model trained on your specific equipment failure patterns, the LLM alone can't do that. But what we do is have a predictive model running separately that outputs risk scores to tags in the SCADA system, and then the LLM can query those scores and explain them in natural language. So like "Pump 5 has a high failure risk score of 0.78 based on elevated vibration levels over the past 3 days." Works pretty well actually.

You can view all discussion threads in this forum.
You cannot start a new discussion thread in this forum.
You cannot reply in this discussion thread.
You cannot start on a poll in this forum.
You cannot upload attachments in this forum.
You cannot download attachments in this forum.

Users Online Now

Guests Online 8
Members Online 0

Total Members: 27
Newest Member: Howardzit