Using AI and Control Theory to clean up datasets.

One major problem with machine learning and the resulting output is the dataset. So many models have so much junk in them because the data wasn’t clean or processed correctly during training, or data was included that wasn’t necessary, contradictory, redundant, or otherwise unhelpful or misaligned for training. LLM’s generally get trained on so much data from so many sources that it’s practically impossible to assume it’s all ingested perfectly. There are tools to clean up data, for sure, and to transform this to that and remove redundancies and such, but it’s usually a lot of mostly manual work and if you’re talking about big models this is impossible to do with people.

I was thinking of chaotic systems and Control Theory specifically as a potential solution for not only cleaning up anomalous data but also as a method to create new data sets and iterate through data as it changes. I doubt this is a novel idea (Chaosmonkey’ish/quants) but it’s interesting to think that you could apply a general framework around data to process anomalies and create data sets that either manage anomalies or remove outliers (or both). I think this could be built into a real-time workflow that is constantly massaging and tuning inputs and recording the metadata of outputs- the metrics generated could be dynamic “baselines” that change with the input and/or a feedback loop where the output is re-processed after going through some workflow as the model “learns”. Using Attention would be pretty tricky considering all of the input in one way or another “matters” but you could easily weigh it on the way in (conversely, it wouldn’t be easy to figure out how to weigh it!). The idea is sort of like committing to chaos- you know your data going to change, always, so just own it. Even if it’s static relative to something else, it’s still eventually going to change and be an element of that dynamical system. #ml#llm#chaos#controltheory

Leave a Comment