Perplexity AI embroiled in controversy over alleged web scraping abuse

by admin June 30, 2024

written by admin June 30, 2024

Perplexity AI has found itself at the center of a firestorm over its data collection practices.

The company, which is developing an AI-powered “answer engine” that essentially fuses a search engine with generative AI, has been accused on multiple fronts of improperly scraping content from numerous websites, including those that explicitly prohibit it.

The scandal erupted on June 11 when Forbes reported that Perplexity had lifted an entire article from its site, complete with custom illustrations, and repurposed it with only minimal attribution.

Not long after, WIRED conducted an investigation that uncovered evidence of Perplexity scraping content from websites that forbid automated data collection.

A website can request that its content isn’t scraped by web crawlers through a file called “robots.txt.”

This exclusion protocol is a standard used by websites to communicate with web crawlers and other automated bots. It’s a simple text file placed on a website’s server that specifies which pages or sections of the website should not be accessed or scraped by these automated tools.

The robots.txt file has been a widely respected convention since the early days of the web. It helps website owners maintain control over their content and prevent unauthorized data collection.

Although not legally binding, it has long been considered best practice for web crawlers to follow the instructions outlined in a website’s robots.txt file.

Key points of the ongoing scandal include:

Forbes has accused Perplexity of wholesale lifting one of its articles without proper attribution.
WIRED has found that Perplexity scraped websites that explicitly forbid such practices via robots.txt.
Other publishers are voicing concerns that such unauthorized scraping threatens their intellectual property

Jason Kint, CEO of Digital Content Next, a trade group representing online publishers, minced no words in his assessment.

“By default, AI companies should assume they have no right to take and reuse publishers’ content without permission,” he said.

“If Perplexity is skirting terms of service or robots.txt, the red alarms should be going off that something improper is going on.”

These revelations have now prompted Amazon Web Services (AWS), which hosts a server implicated in Perplexity’s alleged improper scraping, to launch an investigation.

AWS strictly prohibits customers from engaging in abusive or illegal activities that violate its terms of service.

Perplexity CEO Aravind Srinivas initially brushed off the concerns, asserting they reflected “a deep and fundamental misunderstanding” of the company’s operations and the internet at large.

However, in a subsequent interview with Fast Company, he conceded that Perplexity relied on an unnamed third-party vendor for web crawling and indexing, suggesting they were to blame for any robots.txt violations.

Srinivas declined to identify the company, citing a non-disclosure agreement.

For the moment, Perplexity appears determined to weather the storm, with a spokesperson downplaying the AWS probe as “standard procedure” and indicating the company has made no changes to its operations.

However, the startup’s defiant stance may prove untenable as the groundswell of concern over AI’s data practices continues to build.

Source Link

Perplexity AI embroiled in controversy over alleged web scraping abuse

Gemini’s data-analyzing abilities aren’t as good as Google claims

Here are India’s biggest AI startups based on how much money they’ve raised

Related Posts

Leave a Comment Cancel Reply