python手机号前7位归属地爬虫代码实例

 更新时间:2020年03月31日 10:26:21   作者:wanli001  
这篇文章主要介绍了python手机号前7位归属地爬虫代码实例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持脚本之家。

相关文章

  • Pytest中skip skipif跳过用例详解

    Pytest中skip skipif跳过用例详解

    今天给大家带来的是关于Python的相关知识,文章围绕着Pytest中skip skipif跳过用例展开,文中有非常详细的介绍及代码示例,需要的朋友可以参考下
    2021-06-06
  • 利用python绘制带有时间线的柱状图

    利用python绘制带有时间线的柱状图

    这篇文章主要为大家详细介绍了如何使用python绘制出带有时间线的柱状图,文中的示例代码讲解的非常详细,具有一定的学习与借鉴价值,需要的可以参考一下
    2023-07-07
  • Python实现绘制凸包的示例代码

    Python实现绘制凸包的示例代码

    凸包(Convex Hull)是一个计算几何(图形学)中的概念。这篇文章主要为大家详细介绍了Python绘制凸包的示例代码,感兴趣的小伙伴可以了解一下
    2023-05-05
  • Python 专题六 局部变量、全局变量global、导入模块变量

    Python 专题六 局部变量、全局变量global、导入模块变量

    本文主要讲述python全局变量、局部变量和导入模块变量的方法。具有很好的参考价值,下面跟着小编一起来看下吧
    2017-03-03
  • python格式的Caffe图片数据均值计算学习

    python格式的Caffe图片数据均值计算学习

    这篇文章主要为大家介绍了python格式的Caffe图片数据均值计算学习示例详解,有需要的朋友可以借鉴参考下,希望能够有所帮助,祝大家多多进步,早日升职加薪
    2022-06-06
  • Python爬虫框架NewSpaper使用详解

    Python爬虫框架NewSpaper使用详解

    这篇文章主要为大家介绍了Python爬虫框架NewSpaper使用详解,有需要的朋友可以借鉴参考下,希望能够有所帮助,祝大家多多进步,早日升职加薪
    2022-08-08
  • pywinauto自动化操作记事本

    pywinauto自动化操作记事本

    这篇文章主要为大家详细介绍了pywinauto自动化操作记事本,具有一定的参考价值,感兴趣的小伙伴们可以参考一下
    2019-08-08
  • Python中Django 后台自定义表单控件

    Python中Django 后台自定义表单控件

    本篇文章主要介绍了Python中Django 后台自定义表单控件,其实 django 已经为我们提供了一些可用的表单控件,比如:多选框、单选按钮等,有兴趣的开业了解一下。
    2017-03-03
  • Pytorch GPU显存充足却显示out of memory的解决方式

    Pytorch GPU显存充足却显示out of memory的解决方式

    今天小编就为大家分享一篇Pytorch GPU显存充足却显示out of memory的解决方式,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧
    2020-01-01
  • 在PyCharm中高效使用远程文件编辑功能的实现

    在PyCharm中高效使用远程文件编辑功能的实现

    PyCharm作为业界领先的集成开发环境(IDE),提供了强大的本地和远程开发功能,本文详细介绍了如何在PyCharm中使用远程文件编辑功能,希望能够帮助你提高远程开发的效率和体验
    2024-08-08

最新评论