Draft:Crawlee
Submission declined on 27 October 2024 by Reading Beans (talk). This submission is not adequately supported by reliable sources. Reliable sources are required so that information can be verified. If you need help with referencing, please see Referencing for beginners and Citing sources.
Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
|
Developer(s) | Apify |
---|---|
Initial release | 13 July 2022 |
Written in | Typescript, Python |
Operating system | Windows, macOS, Linux |
Type | Web crawler |
License | Apache License 2.0 |
Crawlee is a free and open-source web-crawling and browser automation library developed by Apify. The original TypeScript version was first released in 2022, with a Python version added in 2024.
Crawlee's architecture is built around modular crawlers responsible for extracting data from websites.[1]. The library follows a declarative programming approach, where users define crawling logic through a structured set of rules. Crawlee uses queues to manage requests; for each request, a specific function is executed to extract data or perform further processing[2].
Crawlee supports both headless browser sessions (via Playwright and other browser automation software) and plain HTTP request-based scraping.
It also provides various web-scraping-related utilities, such as a sitemap parser[3] or an automatic HTTP proxy manager.
Notable mentions of Crawlee's use in web-crawling projects include GPT Crawler by Builder.io[4] and various generative AI projects maintained by AWS Labs[5].
History
[edit]The first stable TypeScript version was released in 2021 under the name Apify SDK[6]. This version offered both the open-source crawling framework and the proprietary storage implementation for use on the Apify platform.
In 2022, version v3.0.0 was released[7], renaming the library to Crawlee. This update made Crawlee independent of the Apify Platform, moving most of the Apify-specific features into a separate package (also named Apify SDK).
In 2024, a beta version of Crawlee for Python was released[8]
References
[edit]- ^ Koekemoer, Jakkie. "Web Scraping with Crawlee: Step-By-Step Tutorial". Bright Data.
- ^ Nechytailo, Yelyzaveta. "Crawlee Tutorial: Easy Web Scraping and Browser Automation". oxylabs.io.
- ^ "Release v3.7.0 · apify/crawlee". GitHub. Retrieved 22 September 2024.
- ^ "BuilderIO/gpt-crawler: Crawl a site to generate knowledge files to create your own custom GPT from a URL". GitHub. Retrieved 21 September 2024.
- ^ "awslabs/generative-ai-cdk-constructs: AWS Generative AI CDK Constructs are sample implementations of AWS CDK for common generative AI patterns". GitHub. Amazon Web Services - Labs. 20 September 2024. Retrieved 21 September 2024.
- ^ "Release v1.0.0 · apify/crawlee". GitHub.
- ^ "Release v3.0.0 · apify/crawlee". GitHub.
- ^ "Announcing Crawlee for Python: Now you can use Python to build reliable web crawlers | Crawlee · Build reliable crawlers. Fast". crawlee.dev. 5 July 2024.