

Discover more from PATENT DROP
PATENT DROP: How to train your AI
A look at how Nvidia, Microsoft and Google are whipping AI into shape.
Happy Thursday and welcome to Patent Drop!
Today’s newsletter is about how the AI sausage gets made. Nvidia wants to take the work out of creating datasets, Microsoft wants to pick out the bad apples in AI training data, and Google wants to make sure your AI training systems are up to par.
Let's get into it, folks.
Nvidia’s synthetic data farm
AI models are voraciously data-hungry. Nvidia wants to satisfy their hunger.
The chip company wants to patent a system for generating “synthetic datasets for training neural networks.” It essentially uses a generative AI model to synthesize datasets that can be used in training a machine learning model for specific visual tasks, such as autonomous driving, robotics or facial recognition.
Feeding sample visual data to the generative model creates synthetic datasets that are more representative of authentic ones. “The generative model therefore serves as an aid in bridging the content gap that previously existed between synthetic data and real-world data,” Nvidia said in its filing.
The machine learning model is trained using the dataset, and is validated against a “real-world validation dataset,” a.k.a. an authentic one. Depending on how well the synthetic dataset trains the machine learning model, that outcome is used for “fine-tuning the generative model for making more synthetic datasets.”
While synthetic data is already being used to solve the “laborious, costly, and time consuming task” of data collection for visual AI systems, Nvidia said conventional methods sometimes require experts to create “virtual worlds” to harvest synthetic data, which can be resource-consuming and not accurately mimic real-world scenes.
Having access to loads of synthetic data can make training AI a far more accessible task, said Kevin Gordon, co-founder of AI consulting and development firm Velora Labs. Massive datasets for training take tons of time and resources, and for small companies or individual developers, this cost often isn’t practical.
“This possibility of using a neural network system to just generate almost infinite content is really appealing,” said Gordon. “Especially for visual tasks where you really do need a lot of data in general … that can be really hard to capture and aggregate.”
Another benefit of synthetic data: preserving privacy. These datasets don’t entirely eliminate the use of real-world data (which can be connected to real-world people), as the AI model that creates them is trained on authentic data. However, extracting any authentic data from an AI model trained on synthetic data is significantly challenging, said Gordon.
“At the very least, it abstracts data,” said Gordon. “That level of decoupling can help with privacy. I wouldn't say it solves it completely, but it does do a really good job of obfuscation.”
Nvidia certainly isn’t the first company to consider synthetic data, said Gordon. Plenty of companies have been working on using synthetic data to solve what Gordon calls “the data problem.”
At the Conference on Computer Vision and Pattern Recognition in late June, Gordon told me that about 1 in 4 companies were data-focused, many of which were synthetic data providers. And because Nvidia’s patent is filled with broad strokes and wide-reaching claims, actually securing it may be a difficult task, he said.
One thing that sets Nvidia’s patent apart is its mention of robotic systems. Generating synthetic data for robotics is a particularly difficult task compared to data collection for a large language model, said Gordon. And given that Nvidia has a substantial robotics division, this tech could work in partnership with that.
At the end of the day, though, Nvidia's biggest moneymaker is its chips. Helpful software — or anything else — is just icing on the cake. “Really, they're interested in getting people to want to use their software so that they buy their chips,” Gordon said. “Having this as part of their solutions for more cheaply developed AI systems … This can help them maintain the top spot as the number one AI chip provider.”
Microsoft’s data detective
An AI model is only as good as the data it’s trained on. Microsoft wants to make its data as neat as a pin.
The company is seeking to patent a system for improving machine learning models by “detecting and removing inaccurate training data.” Microsoft’s system essentially works by picking out the outliers. Ironically, it relies on a machine learning model trained to evaluate training data and come up with a “prediction confidence level” determining if the sample is both erroneous and varies too much from the rest of the dataset.
Microsoft’s tech specifically intends to improve machine learning-based classification, which is widely used across industries, including cybersecurity, logistics, autonomous driving and consumer tech, the company noted. Its system aims to weed out data that’s been inaccurately categorized or labeled due to human error, machine bugs or conflicts.
“As a result, ML model accuracy may be improved by training on a more accurate revised training set,” Microsoft noted.
Don’t want to dig up datasets yourself? Microsoft is working on that, too.
The company filed a patent application for a “machine-learning training service” using synthetic data. Microsoft’s filing details what it calls “synthetic data as a service,” which provides a machine learning training system that “allows customers to configure, generate, access, manage, and process synthetic data training datasets for machine learning.”
Microsoft’s system creates training datasets by taking synthetic data assets, such as 3D models and scenes, and altering the “intrinsic” and “extrinsic” parameters, or things like location, orientation and focal length when looking at the 3D scene. The system also takes the manual labor such as labeling, tagging, and updating out of developing datasets.
Ensuring the quality of a dataset can become “quite a bottleneck” in developing AI, said Gordon. With the wellspring of data that Microsoft can access, finding a way to efficiently organize it could speed up the entire process. That said, Microsoft’s tech to train and detect inaccurate data would more likely be put to use internally, rather than sold as a service, Gordon said.
The company’s “synthetic data as a service,” on the other hand, could help enable tons of small startups and individual developers to work on their own AI models much more easily, Gordon said. This is doubly true for those that are just dabbling in AI and don’t have “hordes of people helping with the data side.”
“It might not be the tool that's really aimed towards big companies,” said Gordon.
But similar to Nvidia’s work with synthetic data, securing this patent may prove difficult. There are plenty of companies, big and small, innovating and competing in the synthetic data space.
“There are a lot of people in big organizations with the legal power behind it that are also working on (synthetic data),” said Gordon. “I can see these being quite tough unless (Microsoft) can attach it to something a little more concrete that they can narrow the scope on.”
Google compares and contrasts
Data isn’t the only thing that makes an AI model work. Google wants to make sure its chips are up to snuff, too.
The tech firm is seeking to patent a system for “debugging correctness issues” when training machine learning models. Google defines a “correctness issue” as basically a failure in training execution, or when the outcome is “not deemed to be acceptable for a particular context.” These correctness issues can stem from the “configuration” of the computing system performing the training.
Google’s system trains two machine learning models using two different computing systems and compares them to one another. The system uses “shared training operations” on each, meaning that the only difference in training is the computing systems themselves.
Google’s system then comes up with a “similarity measure” between the two models, which is determined by comparing each model’s output. The similarity measure then is used to compare how well different computing systems train models by identifying what needs to be debugged, Google noted.
Think of it this way: Imagine two cars taking a road trip from San Francisco to Los Angeles, stopping at the same spots for gas and hitting the same traffic. But one car is a brand-new Maserati and the other is, well, a 2002 Ford Fiesta. You can probably guess which car will get there first.
It’s sometimes easy to tell when a computing system isn’t doing its job properly, but when training a neural network, there can sometimes be what Gordon calls “fuzziness.”
Neural networks are not “super-precise instruments,” Gordon said, so their parameters and conditions can shift to a certain extent without breaking accuracy. Google’s system may aim to figure out how different AI training hardware impacts that breaking point.
(Going back to the car analogy, Google’s system may be trying to figure out how good a Fiesta needs to be to compete with the Maserati. While it may not be able to get from San Francisco to L.A. at exactly the same time, the difference may be negligible.)
If this patent is related to Google’s internal hardware effort, this could provide a major benefit to its chip business. While Google isn’t a name-brand chip name, if it can build hardware that can “tolerate fuzziness in computation” without compromising accuracy, that opens the door for the company to create a “100-times more power efficient chip.”
“If they can solve the kind of fuzziness of it, they can have something that's really competitive compared to what Nvidia, or really anybody else, offers,” said Gordon.
Google has been trying to go after Nvidia’s dominance in the AI hardware space. In April, the company released a research paper claiming that its Tensor Processing Units, which power more than 90% of its AI training as part of its supercomputers, are faster and more energy efficient than Nvidia’s comparable A100 chip.
While Google does not sell its TPUs outright, the hardware is a major piece of the company’s AI work. Making these chips even more efficient could be part of its plan to remain a top player in the AI arms race.
Extra Drops
Some other fun patents we wanted to share.
Uber wants to figure out your ETA more accurately. The company is seeking to patent a system for “arrival time prediction” using deep learning.
Do you like to drive through certain neighborhoods in your commute? Ford has something for that. The auto manufacturer filed to patent a method for “personalized route prediction” which uses AI to predict which routes you like best based on “recent mobility behaviors.”
EBay wants to predict what’s going to sell the best. The ecommerce company filed to patent a system for “image-based popularity prediction,” which gives the photos in your listing an “image quality score” to predict if it’ll sell.
What else is new?
E.U. regulators opened an antitrust probe into Microsoft’s bundling of Teams with other Office products, alleging the practice may give the app “a distribution advantage.”
Meta posted stronger-than-expected earnings on Wednesday and saw its share prices jump, reflecting a rebound in the digital advertising market and good results from its “year of efficiency” cost cutting measures.
Rage against the machine? AI-generated songs call into question how computers help make music — but it’s a debate that has been decades in the making. Check out the history of machine generation in music.*
*Partner
Have any comments, tips or suggestions? Drop us a line! Email at admin@patentdrop.xyz or shoot us a DM on Twitter @patentdrop.