When I post something here, I usually have the hope that someone else benefits from it in some small, tiny way. Either because it makes them laugh or because it gives them valuable information.
I don’t understand why anyone would post if they can’t bear the idea that someone else might get a tiny, little something for free.
It occurs to me that a lot of people don’t know the background here. (ETA: I wrote this in response to a different article, so some refs don’t make sense.)
LAION is a German Verein (a club). It’s mainly a German physics/comp sci teacher who does this in his spare time. (German teachers have the equivalent of a Master’s degree.)
He took data collected by an American non-profit called Common Crawl. “Crawl” means that they have a computer program that automatically follows all links on a page, and then all links on those pages, and so on. In this way, Common Crawl basically downloads the internet (or rather the publicly reachable parts of it).
Search engines, like Google or Microsoft’s Bing, crawl the internet to create the databases that power their search. But these and other for-profit businesses aren’t sharing the data. Common Crawl exists so that independent researchers also have some data to study the internet and its history.
Obviously, these data sets include illegal content. It’s not feasible to detect all of it. Even if you could manually look at all of it, that would be illegal in a lot of jurisdictions. Besides, which standards of illegal content should one apply? If a Chinese researcher downloads some data and learns things about Tiananmen Square in 1989, what should the US do about that?
Well, that data is somehow not the issue here, for some reason. Interesting, no?
The German physics teacher wrote a program that extracted links to images, as well as their accompanying text descriptions, from Common Crawl. These links and descriptions were put into a list - a spreadsheet, basically. The list also contains metadata like the image size. On top of that, he used AI to guess if they are “NSFW” (IE porn), and if people would think they are beautiful. This list, with 5 billion entries, is LAION-5b.
Sifting through Petabytes of data to do all that is not something you can do on your home computer. The funding that Stability AI provided is a few thousand USD for supercomputer time in “the cloud”.
German researchers at the LMU - a government funded university in Munich - had developed a new image AI, which is especially efficient and can be run on normal gaming PCs. (The main people now work on a start-up in New York.) The AI was trained on that open source data set and named Stable Diffusion in honor of Stability AI, which had provided the several 100k USD needed to pay for the supercomputer time.
These supposed issues are only an issue for free and open source AI. The for-profit AI companies keep their data sets secret. They are fairly safe from accusations.
Maybe one should use PhotoDNA to search for illegal content? The for-profit company PhotoDNA, which so kindly provided its services for free to this study, is owned by Microsoft, which is also behind OpenAI.
Or maybe one should only use data that has been manually checked by humans? That would be outsourced to a low wage country for pennies, but no need: Luckily, billion-dollar corporations exist that offer just such data sets.
This article solely attacks non-profit endeavors. The only for-profit companies mentioned (PhotoDNA, Getty), stand to gain from these attacks.
It occurs to me that a lot of people don’t know the background here.
LAION is a German Verein (a club). It’s mainly a German physics/comp sci teacher who does this in his spare time. (German teachers have the equivalent of a Master’s degree.)
He took data collected by an American non-profit called Common Crawl. “Crawl” means that they have a computer program that automatically follows all links on a page, and then all links on those pages, and so on. In this way, Common Crawl basically downloads the internet (or rather the publicly reachable parts of it).
Search engines, like Google or Microsoft’s Bing, crawl the internet to create the databases that power their search. But these and other for-profit businesses aren’t sharing the data. Common Crawl exists so that independent researchers also have some data to study the internet and its history.
Obviously, these data sets include illegal content. It’s not feasible to detect all of it. Even if you could manually look at all of it, that would be illegal in a lot of jurisdictions. Besides, which standards of illegal content should one apply? If a Chinese researcher downloads some data and learns things about Tiananmen Square in 1989, what should the US do about that?
Well, that data is somehow not the issue here, for some reason. Interesting, no?
The German physics teacher wrote a program that extracted links to images, as well as their accompanying text descriptions, from Common Crawl. These links and descriptions were put into a list - a spreadsheet, basically. The list also contains metadata like the image size. On top of that, he used AI to guess if they are “NSFW” (IE porn), and if people would think they are beautiful. This list, with 5 billion entries, is LAION-5b.
Sifting through Petabytes of data to do all that is not something you can do on your home computer. The funding that Stability AI provided is a few thousand USD for supercomputer time in “the cloud”.
German researchers at the LMU - a government funded university in Munich - had developed a new image AI, which is especially efficient and can be run on normal gaming PCs. (The main people now work on a start-up in New York.) The AI was trained on that open source data set and named Stable Diffusion in honor of Stability AI, which had provided the several 100k USD needed to pay for the supercomputer time.
These supposed issues are only an issue for free and open source AI. The for-profit AI companies keep their data sets secret. They are fairly safe from accusations.
Maybe one should use PhotoDNA to search for illegal content? The for-profit company PhotoDNA, which so kindly provided its services for free to this study, is owned by Microsoft, which is also behind OpenAI.
Or maybe one should only use data that has been manually checked by humans? That would be outsourced to a low wage country for pennies, but no need: Luckily, billion-dollar corporations exist that offer just such data sets.
This article solely attacks non-profit endeavors. The only for-profit companies mentioned (PhotoDNA, Getty), stand to gain from these attacks.
I don’t get it. What’s the joke?
Makes one wonder if there is some lobby org behind this. The benefits to major corporate interests are obvious, and it feels a little campaigny.
That’s way overkill. Just Markov chain it.
True, I chose a very bad example there and muddied the waters.
Normally, trademarks aren’t so bad, relatively speaking. As long as there’s no confusion about who is responsible for the product, and there’s no defamation, you should be able to use those pretty freely. When “trademark dilution” comes into play, it can get onerous, though.
you can take a movie still-frame and make an oil painting of it and it’ll be your work.
Maybe but not usually. This is making a derivative work. Derivative have their own copyright, but permission of the original owner is required to make them. In US terms, it might be fair use, if the painter wants to, say, make an artistic statement about consumer culture. EG Mickey Mouse has shown up in South Park episodes for the purpose of satire. That’s fine.
OTOH, if there’s nothing deeper behind the painting, then it’s just unlicensed merch. EG, Disney has come down on day care centers for using their IP.
Whether the OP describes infringement is doubtful to me. No one owns the right to make pictures of EG people next to wooden dogs. On its face, there is no infringement.
Not saying that there aren’t people like that, but this ain’t it. This tool specifically targets open source. The intention is to ruin things that aren’t owned and controlled by someone. A big part of AI hate is hyper-capitalist like that, though they know better than saying it openly.
People hoping for a payout get more done than people just being worried or frustrated. So it’s hardly a surprise that they get most of the attention.
It can only target open source, so it wouldn’t bother corpos at all. The people behind this object to not everything being owned and controlled. That’s the whole point.
Not really. It’s like with humans. Without the occasional reality checks it gets weird, but what people chose to upload is a reality check.
The pre-AI web was far from pristine, no matter how you define that. AI may improve matters by increasing the average quality.
If we all work together, we can make sure that none of us can benefit from the other? How does that even make sense?
Look. I am unable to understand why this bothers you. I like feeling I have a positive effect on the world. I like knowing, EG, that my taxes help the less fortunate. What you are saying seems completely absurd to me.