Amazon identifies every product for sale with a ten-digit alphanumeric code called an ASIN (Amazon Standard Identification Number). Whenever you visit a product page on Amazon, these number is somewhere in the URL; however, it's not always in the same place in the URL. This makes it a little bit hard to pull the ASIN out of an arbitrary URL.
There are many ways to solve this problem. I choose to use a tests and Ruby regular expressions. This may not be the best way to solve this problem, but it works for me. If you'd like to learn more about Ruby, I recommend reading The Ruby Programming Language: Everything You Need to Know.
Problem Solving Methodology
First, I went around Amazon clicking on as many links and products as I could to find as many different link formats as possible. I then used those links to build the test below. This solution is part of a Ruby on Rails application, so I show the file location of each code snippet as a comment at the top of the code.
# test/models/amazon_test.rb
require 'test_helper'
class AmazonTest < ActiveSupport::TestCase
def test_get_asin_from_url
url = "https://smile.amazon.com/Programmable-Touchscreen-Thermostat-Geofencing-HomeKit/dp/A01LTHM8LG/ref=s9u_simh_gw_i1?_encoding=UTF8&fpl=fresh&pd_rd_i=B01LTHM8LG&pd_rd_r=21CHQY4CPXGGXZ5G3Q71&pd_rd_w=imw1F&pd_rd_wg=CNLFs&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=XDXMZR4E839F0JD45S0F&pf_rd_r=XDXMZR4E839F0JD45S0F&pf_rd_t=36701&pf_rd_p=781f4767-b4d4-466b-8c26-2639359664eb&pf_rd_p=781f4767-b4d4-466b-8c26-2639359664eb&pf_rd_i=desktop"
asin = "A01LTHM8LG"
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
url = "http://www.amazon.com/Kindle-Wireless-Reading-Display-Generation/dp/B0015T963C"
asin = "B0015T963C"
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
url = "http://www.amazon.com/dp/C0015T963C"
asin = "C0015T963C"
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
url = "http://www.amazon.com/gp/product/D0015T963C"
asin = "D0015T963C"
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
url = "http://www.amazon.com/gp/product/glance/E0015T963C"
asin = "E0015T963C"
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
url = "https://smile.amazon.com/gp/offer-listing/F018Y23P7K/ref=dp_olp_all_mbc?ie=UTF8&condition=all"
asin = "F018Y23P7K"
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
url = "https://smile.amazon.com/product-reviews/G018Y23P7K/ref=acr_offerlistingpage_text?ie=UTF8&reviewerType=avp_only_reviews&showViewpoints=1"
asin = "G018Y23P7K"
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
url = "https://smile.amazon.com/forum/-/Tx3FTP6XCFXMJAO/ref=ask_dp_dpmw_al_hza?asin=H018Y23P7K"
asin = "H018Y23P7K"
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
url = "https://smile.amazon.com/gp/customer-reviews/R1VKN59YMEK5PC/ref=cm_cr_arp_d_viewpnt?ie=UTF8&ASIN=I018Y23P7K#R1VKN59YMEK5PC"
asin = "I018Y23P7K"
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
url = "https://smile.amazon.com/Korean-Made-Simple-beginners-learning/dp/1497445825/ref=sr_1_1?ie=UTF8&qid=1493580746&sr=8-1&keywords=korean+made+simple"
asin = "1497445825"
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
url = "http://stackoverflow.com/questions/1764605/scrape-asin-from-amazon-url-using-javascript"
asin = false
calculated_asin = Amazon.get_asin_from_url(url)
assert_equal asin, calculated_asin
end
end
Notice that I also included a false test case at the end because I don't want to return an ASIN if the string being processed isn't an Amazon URL. Also, the ASINs in the test have been modified slightly (I changed the first letter of each one) to make it easy to figure out which of the test cases was failing.
Once I had my tests set up, it was pretty easy to just go to town on the regular expressions until I got something that both made sense and passed my tests:
# models/amazon.rb
class Amazon
# Return the ASIN from a URL copied and pasted by the user
# Return false if no ASIN is found
def self.get_asin_from_url(amazon_url)
if amazon_url.match(/\/dp\/(\w{10})(\/|\Z)/)
# /dp/B0015T963C
asin = $1
elsif amazon_url.match(/\/gp\/\w*?\/(\w{10})(\/|\Z)/)
# /gp/product/D0015T963C
asin = $1
elsif amazon_url.match(/\/gp\/\w*?\/\w*?\/(\w{10})(\/|\Z)/)
# /gp/product/glance/E0015T963C
asin = $1
elsif amazon_url.match(/\/gp\/[\w-]*?\/(\w{10})(\/|\Z)/)
# /gp/offer-listing/F018Y23P7K
asin = $1
elsif amazon_url.match(/\/product-reviews\/(\w{10})(\/|\Z)/)
# /product-reviews/G018Y23P7K
asin = $1
elsif amazon_url.match(/[?&]asin=(\w{10})(&|#|\Z)/i)
# ?asin=H018Y23P7K
# &ASIN=H018Y23P7K
asin = $1
else
asin = false
end
end
If this helped you, or you have any suggestions on how to improve it, let me know in the comments!
Photo by Robert Scoble