This project utilizes e-commerce data scraping techniques employing Selenium and BeautifulSoup to extract specific product details. Focused on showcasing a single product type, it retrieves information on Name, Price, Rating, Number of reviews, and the product's URL. The adaptable code allows customization for diverse websites. Post-extraction, the data is compiled into a .csv file, facilitating user utilization for model shortlisting or analytics.
The project centers on DELL Laptops, employing Pandas, Matplotlib, and Seaborn for dataset analysis within a Jupyter Notebook environment. Essential package installations include Selenium and bs4, while browser-specific drivers, like msedgedriver.exe for Microsoft Edge, enable access to website data.
Begin the coding process for the Amazon data scraping function by following these steps:
Import Packages:
To scrape Amazon data, import the required packages for the project. Ensure inclusion of essential libraries.
Web Driver:
Define the execution path of the downloaded driver, such as "location/msedgedriver.exe," to enable its usage. This specification ensures the browser launches automatically with an empty page.
Enter Amazon India URL:
Utilize the .get() function to open the site using the URL as an argument. This step initiates access to the specified Amazon India webpage.
Generate Search Item URL:
To search, combine the URL with the item's name. Utilize the search_term variable, representing the item name, and create a function to insert this name into the URL dynamically. By using an e-commerce data scraper, this method ensures seamless searching for the specified item.
Replace Spaces in Search Term:
Substitute spaces with "+" in the search_term variable. In URLs, replace the spaces, and multi-word inputs are connected using this symbol. This adjustment ensures the proper formation of the search term for URL compatibility.
Now, proceed to open the generated URL in the browser. This action is essential for initiating the Amazon data scraping process and navigating to the specific search results page.
Extract Data:
Retrieve all HTML code from the Page Source. Although manual extraction from the site's page source is possible through right-clicking and selecting "View page source," this process is inefficient. Instead, utilize BeautifulSoup to automate the extraction of HTML code, streamlining the data retrieval process.
Extract Relevant Data:
Focus solely on the results pertinent to the search_term. After analyzing the page source, identify the suitable tag for extraction: < div data-component-type="s-search-result" >. Retrieve all data associated with this tag to gather the relevant information for the specified search term.
Iterative Data Extraction:
The provided code extracts e-commerce data solely from the first page. To extend this functionality across multiple pages, incorporate a loop in subsequent code segments. The length of the data_extracted variable corresponds to the number of products on the initial page. Be mindful that some products may lack pricing, rating, or review information, posing potential errors that lie in later code sections.
Data Prototype:
Establish a foundational understanding of the tags essential for extracting specific product information. Create a prototype as a reference, outlining the tags for the extraction process. This prototype serves as a guide for identifying and retrieving relevant data about each product on the webpage.
Extract Record Function:
Our e-commerce data scraping services help refine the extraction by creating an extract_record() function. This function focuses on retrieving specific details, such as price and ratings, essential for forming conclusions about each product. This optimization ensures that only the necessary information is extracted from the HTML code, streamlining the data analysis process.
Implement error handling within the extract_record() function to accommodate cases where variables, such as price or reviews, might not have assigned values. It ensures the robustness of the code, preventing potential errors when specific product details are unavailable.
Error Handling:
Utilize a loop to iterate over each product, retrieving the data into the records list. This list will eventually become a compilation of tuples, each representing the details of a specific laptop. This structured approach allows for organized product information storage for further analysis or export.
Intel Core i7-12650H (10-Core, 24MB, up to 4.70 GHz) // Memory & Storage: 16 GB, 2 x 8 GB, DDR5, 4800 MHz, dual-channel & 512GB SSD
Navigate Through Pages:
Utilize the page query in the URL, such as https://www.amazon.in/gp/browse.html?node=1375424031&ref_=nav_em_sbc_mobcomp_laptops_0_2_8_15, to navigate through pages. Concatenate each query with the URL using "&" to access different pages sequentially. This method systematically explores multiple pages to obtain comprehensive data on the searched item.
Upon executing the preceding function, the query will resemble the following format: https://www.amazon.in/s?k=laptops &ref=nb_sb_noss_2&page{}. In this structure, any page number can be passed as a placeholder within the "{}" to navigate through various pages in the search results.
Combined Code:
The consolidated code incorporates the functions and assignments in the required order. Copy and run this code on your system, provided you have the necessary packages installed, to initiate the web scraping process efficiently.
The driverFunction() function will generate an "amazon_scrape_data.csv" file, serving as a valuable resource for product selection and future analysis. This CSV file consolidates the extracted data, offering a convenient format for users to explore, evaluate, and utilize the scraped information.
Next Step: Analysis of DELL Laptops on Amazon India
With the established data scraping mechanism, we can now delve into the analysis and visual representation of DELL Laptops on Amazon India. Let's explore critical insights, trends, and patterns within the extracted data, providing a comprehensive view for informed decision-making and strategic planning.
Sample Laptop Information:
Brand Dell
Model Name G15-5520
Screen Size 15.6
Colour Dark Shadow Grey
Hard Disk Size 512 GB
CPU Model Core i7
RAM Memory Installed Size 16 GB
Operating System Windows 11
Special Feature Backlit Keyboard
Graphics Card Description
This laptop's name encompasses essential details such as screen size, processor, colour options, hard disk size, and specifications related to graphics, operating system, RAM, and storage.
It's imperative to gain a preliminary understanding of the collected data. It involves extracting key insights, patterns, and trends from our gathered information. This initial analysis will lay the foundation for more in-depth exploration and strategic decision-making based on the available data.
Filtering Unwanted Data:
It's crucial to eliminate laptops from other companies, inadvertently included due to sponsorships or advertisements. Implement a meticulous process to exclude these entries and remove any other extraneous or unwanted data, ensuring the dataset remains focused and relevant to our analysis.
Cleaning the Dataset:
Before delving deeper into the dataset, the initial step involves the removal of laptops not associated with DELL. This cleaning process ensures that only relevant data from DELL, excluding other companies, is retained for subsequent analysis.
To enhance accuracy, eliminate duplicate data entries present in the dataset. This step ensures that each laptop's information is unique, preventing redundancy and providing a more precise representation of the collected data.
Observing that Price, Ratings, and Review_Count are currently in string format, we plan to modify them later. Before this adjustment, checking for null values within these variables is essential to ensure data integrity and completeness. print(“Number of Null values in each column:\n”)
Addressing the absence of ratings in 24 laptops, a value of 0 will be added to indicate no rating. Additionally, the data type for the Ratings column will be modified to float, enhancing data consistency and facilitating further analysis.
Now, remove all null values.
Creating Processor Column:
After the removal of null rows, it's imperative to adjust the index values. Ensuring the index correctly aligns with the modified dataset is crucial for streamlined data access and analysis. This correction facilitates a more organized and accurate representation of the data.
A new column specifies the processor name for each laptop. This addition provides a detailed breakdown of the processor information, facilitating more comprehensive analysis and insights into the dataset.
Ensure the processor column is available to the dataset by thoroughly checking. This step confirms the inclusion of the new column and validates its presence in the dataset for further analysis.
Since some laptops may not specify the processor, implement a solution to handle these instances of missing processor information. It ensures that the dataset remains comprehensive and accurate, accounting for variations in the availability of specific details.
Removing Laptops with Missing Processor Information:
Identify and exclude laptops from the dataset that do not provide any information regarding the processor name. It ensures that the dataset only includes entries with relevant processor details, contributing to the accuracy and relevance of the analysis.
Determine the current number of laptops remaining in the dataset after implementing the necessary cleaning and filtering procedures. This count provides valuable insight into the dataset's size and completeness, paving the way for subsequent analyses.
Transform the "Price" column into numerical format using Price Intelligence for a more standardized and analytically helpful representation. This conversion enables efficient numerical operations and facilitates meaningful analysis of the pricing information in the dataset.
Pricing
Reviews
Visualization
Utilize a barplot to visually represent the distribution of laptops with Intel and AMD processors. This graphical representation provides a clear overview of the processor types present in the dataset, facilitating a quick and informative analysis.
Explore the distribution of laptops based on their ratings and prices. This analysis aims to unveil patterns and trends, offering insights into the relationship between a laptop's rating and its corresponding price. The graphical representation, likely a scatter plot or similar visualization, will provide a comprehensive overview of these two crucial factors, aiding in strategic decision-making and product evaluation.
Examining the rating distribution, it becomes evident that 50% of the laptops fall within the 0–1 star range, 43.8% within the 3–5 star range, and 6.26% within the 1–2 star range. This observation suggests a significant level of satisfaction among current customers, as a substantial majority of laptops garner higher ratings.
Analyzing the price distribution reveals that the % of laptops, 63.7%, falls into the mid to high price range, exceeding Rs. 70,000. Notably, there are laptops priced at most Rs. 50,000 in the dataset. This information provides insights into the prevailing price brackets of the available laptops, guiding potential customers and influencing purchasing decisions.
Develop a versatile function that allows users to input a specific price range and receive a list of laptops falling within that range. This functionality enhances user engagement, providing a tailored approach to explore laptops based on individual budget preferences.
The returned list
Explore the dataset to identify the most expensive laptops based on the "Price" attribute. This information is crucial for users seeking high-end options and contributes to a comprehensive understanding of the price distribution within the available laptops.
Cheapest One
Ratings
Highest Rated
Least
Most Reviewed
Least reviewed
Conclusion: By leveraging the provided code to extract a .csv file from Amazon India, users can create a DataFrame for visualization or specific data analysis. Additional modifications can cater to different product categories. The insights gained in this project show that most MSI laptops fall within the medium to high price range and predominantly feature Intel processors. Notably, 50% of laptops need ratings or reviews. The least expensive laptop is Rs.53,990 (3.3 stars, 7 reviews), while the most expensive is Rs.2,99,999 (0 stars, 0 reviews). The top-reviewed model is the MSI Bravo 15 Ryzen 7 4800H, priced at Rs75,990, with a rating of 4.2 stars and 53 reviews.
Product Data Scrape is committed to ethical standards across all facets, spanning Competitor Price Monitoring Services to Mobile Apps Data Scraping. Our global footprint ensures unparalleled and transparent services, catering to a broad spectrum of client requirements.