- cross-posted to:
- programming@programming.dev
- cross-posted to:
- programming@programming.dev
cross-posted from: https://programming.dev/post/8121843
~n (@nblr@chaos.social) writes:
This is fine…
“We observed that participants who had access to the AI assistant were more likely to introduce security vulnerabilities for the majority of programming tasks, yet were also more likely to rate their insecure answers as secure compared to those in our control group.”
[Do Users Write More Insecure Code with AI Assistants?](https://arxiv.org/abs/2211.03622?
This is just an extension of the larger issue of people not understanding how AI works, and trusting it too much.
AI is and has always been about exchanging accuracy for speed. It excels in cases where slow, methodical work is not given sufficient time already, because the accuracy is already low(er) as a result (e.g. overworked doctors examining CT scans).
But it should never be treated as the final word on something; it’s the first ~70%.
I feel like I’ve been screaming this for so long and you’re someone who gets it. AI stuff right now is pretty neat. I’ll use it to get jumping off points and new ideas on how to build something.
I would never ever push something written by it to production without scrutinizing the hell out of it.
Didn’t it turn out that the CT scan analysis thing was just the model figuring out the rough age of machine, becuse older machines tend to be in poorer places with more cancer and are more likely to only be used on serious illnesses?
If taking into account the older machines results in better healthcare, that seems like a great thing to be discovered as a result of the use of machine learning.
Your summary sounds like it may be inaccurate, but it’s interesting enough for me to want to know more.
I believe it was from a study on detecting Tuberculosis, but unfortunately google isn’t been very helpful for me.
The problem with that would be that people in poorer areas are more at risk from TB is not a new discovery, and a model which is intended and billed as detecting TB from a scan should ideally not be using a factor like hospital is old and poor to determine if a scan has diseased tissue, given that intrinsically means your model is more likely to miss it in patients at better hospitals while over-diagnosing it in poorer ones, and that of course at risk people can still go to newer hospitals.
A Doctor will take risk factors into consideration, but would also know that just because their hospital got a new machine doesn’t mean that their patients are now less likely to have a potentially fatal disease. This results in worse diagnosis, even if it technically scores better with the training set.
A Doctor will take risk factors into consideration
Unfortunately we see that the data doesn’t support this assumption. Poor populations are not given the same attention by doctors. Black populations in particular receive worse healthcare in the US after adjusting for many factors like income and family medical history.
It’s unfortunately not certain that they will take such measures with their patients even though most try, and indeed ethic discrepancies are one of the things likely to be made worse with machine learning given that there is often little thought or training data given to them, but age of the hospitals machine is not a good proxy for risk factors. It might be statistically corralled, the actual patients risk isn’t. Less at risk people may go to a cheaper hospital, and more at risk people might live in a city which also has a very up to date hospital.
It’s a decent first screen for pattern recognition for sure, but it is fast which is where I see most of its value. It can process information that people would never get to.
Anyone who’s going to copy and paste code that they don’t understand is inherently a security vulnerability.
True
This isn’t even a debate lol…
Stuff like CoPilot is awesome at making code that looks right, but contains subtle wrong variable names it’s self-created, or bad algorithms.
And that’s not the big issue.
The big issue is when you get distracted for 5 mins, you come back, and you forget that you’ve been working through that block of AI generated code (which looks correct), so you forget to check the rest of it, and it makes it into the source code, before testing later, only to realise its screwed because its AI generated code.
The other big issue, is that its only a matter of time until people start to get fed up, and start feeding these systems dodgy data to de-train them and make them worse / with backdoors.
People are including AI generated code in their projects without fully reading it or understanding how it works.
The same ones that were blindly copying and pasting from StackOverflow previously found a more convenient way to make their code “work”.
My argument is thus:
LLMs are decent at boilerplate. They’re good at rephrasing things so that they’re easier to understand. I had a student who struggled for months to wrap her head around how pointers work, two hours with GPT and the ability to ask clarifying questions and now she’s rockin’.
I like being able to plop in a chunk of Python and say, “type annotate this for me and none of your sarcasm this time!”
But if you’re using an LLM as a problem solver and not as an accelerator, you’re going to lack some of the deep understanding of what happens when your code runs.
The thing is that this is NOT what the marketers are selling, they’re not selling this as “Buy access to our service so that your products will be higher quality”, they’re selling this as “this will replace many of your employees”. Which it can’t, it’s very clear by now that it just can’t.
People tend to deify LLMs, because of the vast amounts of knowledge trained into them, but their answers are more like a single “reasoning iteration”.
How many human coders are capable of sitting down, typing a bunch of code at 100 WPM out of the blue, then end up with zero security flaws or errors? About absolutely none, not even if they get updated requirements, and the same holds up for LLMs. Coding is an iterative job, not a “zero shot” one.
Have an LLM iterate several times over the same piece of code (“think” about it), have it explain what it’s doing each time (“reason” about it)… then test run it, fix any compiler errors… run a test suite, fix for any non-passing tests… then ask it to take into account a context of best practices and security concerns. Only then the code can be compared to that of a serious human coder.
But that takes running the AI over and over and over with a large context, while AIs are being marketed as “single run, magic bullet”… so we can expect a lot of shit to happen in the near future.
On the bright side, anyone willing to run an LLM a hundred times over every piece of code, like in a CI workflow, in an error seeking mode, could catch flaws that would otherwise take dozens of humans to spot.
Excellent points!