我有这个文件夹结构:
app.py # flask app
app/
datafoo/
scrapy.cfg
crawler.py
blogs/
pipelines.py
settings.py
middlewares.py
items.py
spiders/
allmusic_feed.py
allmusic_data/
delicate_tracks.jl
scrapy.cfg:
[settings]
default = blogs.settings
allmusic_feed.py:
class AllMusicDelicateTracks(scrapy.Spider): # one amongst many spiders
name = "allmusic_delicate_tracks"
allowed_domains = ["allmusic.com"]
start_urls = ["http://web.archive.org/web/20160813101056/http://www.allmusic.com/mood/delicate-xa0000000972/songs",
]
def parse(self, response):
for sel in response.xpath('//tr'):
item = AllMusicItem()
item['artist'] = sel.xpath('.//td[@class="performer"]/a/text()').extract_first()
item['track'] = sel.xpath('.//td[@class="title"]/a/text()').extract_first()
yield item
爬虫.py:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
def blog_crawler(self, mood):
item, jl = mood # ITEM = SPIDER
process = CrawlerProcess(get_project_settings())
process.crawl(item, domain='allmusic.com')
process.start()
allmusic = []
allmusic_tracks = []
allmusic_artists = []
try:
# jl is file where crawled data is stored
with open(jl, 'r+') as t:
for line in t:
allmusic.append(json.loads(line))
except Exception as e:
print (e, 'try another mood')
for item in allmusic:
allmusic_artists.append(item['artist'])
allmusic_tracks.append(item['track'])
return zip(allmusic_tracks, allmusic_artists)
应用程序.py:
@app.route('/tracks', methods=['GET','POST'])
def tracks(name):
from app.datafoo import crawler
c = crawler()
mood = ['allmusic_delicate_tracks', 'blogs/spiders/allmusic_data/delicate_tracks.jl']
results = c.blog_crawler(mood)
return results
如果简单地使用 python app.py 运行应用程序,我会收到以下错误:
ValueError: signal only works in main thread
当我使用 gunicorn -c gconfig.py app:app --log-level=debug --threads 2 运行应用程序时,它只是卡在那里:
127.0.0.1 - - [29/Jan/2018:03:40:36 -0200] "GET /tracks HTTP/1.1" 500 291 "http://127.0.0.1:8080/menu" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
最后,使用 gunicorn -c gconfig.py app:app --log-level=debug --threads 2 --error-logfile server.log 运行,我得到:
server.log
[2018-01-30 13:41:39 -0200] [4580] [DEBUG] Current configuration:
proxy_protocol: False
worker_connections: 1000
statsd_host: None
max_requests_jitter: 0
post_fork: <function post_fork at 0x1027da848>
errorlog: server.log
enable_stdio_inheritance: False
worker_class: sync
ssl_version: 2
suppress_ragged_eofs: True
syslog: False
syslog_facility: user
when_ready: <function when_ready at 0x1027da9b0>
pre_fork: <function pre_fork at 0x1027da938>
cert_reqs: 0
preload_app: False
keepalive: 5
accesslog: -
group: 20
graceful_timeout: 30
do_handshake_on_connect: False
spew: False
workers: 16
proc_name: None
sendfile: None
pidfile: None
umask: 0
on_reload: <function on_reload at 0x10285c2a8>
pre_exec: <function pre_exec at 0x1027da8c0>
worker_tmp_dir: None
limit_request_fields: 100
pythonpath: None
on_exit: <function on_exit at 0x102861500>
config: gconfig.py
logconfig: None
check_config: False
statsd_prefix:
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
reload_engine: auto
proxy_allow_ips: ['127.0.0.1']
pre_request: <function pre_request at 0x10285cde8>
post_request: <function post_request at 0x10285ced8>
forwarded_allow_ips: ['127.0.0.1']
worker_int: <function worker_int at 0x1027daa28>
raw_paste_global_conf: []
threads: 2
max_requests: 0
chdir: /Users/me/Documents/Code/Apps/app
daemon: False
user: 501
limit_request_line: 4094
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
certfile: None
on_starting: <function on_starting at 0x10285c140>
post_worker_init: <function post_worker_init at 0x10285c848>
child_exit: <function child_exit at 0x1028610c8>
worker_exit: <function worker_exit at 0x102861230>
paste: None
default_proc_name: app:app
syslog_addr: unix:///var/run/syslog
syslog_prefix: None
ciphers: TLSv1
worker_abort: <function worker_abort at 0x1027daaa0>
loglevel: debug
bind: ['127.0.0.1:8080']
raw_env: []
initgroups: False
capture_output: False
reload: False
limit_request_field_size: 8190
nworkers_changed: <function nworkers_changed at 0x102861398>
timeout: 120
keyfile: None
ca_certs: None
tmp_upload_dir: None
backlog: 2048
logger_class: gunicorn.glogging.Logger
[2018-01-30 13:41:39 -0200] [4580] [INFO] Starting gunicorn 19.7.1
[2018-01-30 13:41:39 -0200] [4580] [DEBUG] Arbiter booted
[2018-01-30 13:41:39 -0200] [4580] [INFO] Listening at: http://127.0.0.1:8080 (4580)
[2018-01-30 13:41:39 -0200] [4580] [INFO] Using worker: threads
[2018-01-30 13:41:39 -0200] [4580] [INFO] Server is ready. Spawning workers
[2018-01-30 13:41:39 -0200] [4583] [INFO] Booting worker with pid: 4583
[2018-01-30 13:41:39 -0200] [4583] [INFO] Worker spawned (pid: 4583)
[2018-01-30 13:41:39 -0200] [4584] [INFO] Booting worker with pid: 4584
[2018-01-30 13:41:39 -0200] [4584] [INFO] Worker spawned (pid: 4584)
[2018-01-30 13:41:39 -0200] [4585] [INFO] Booting worker with pid: 4585
[2018-01-30 13:41:39 -0200] [4585] [INFO] Worker spawned (pid: 4585)
[2018-01-30 13:41:40 -0200] [4586] [INFO] Booting worker with pid: 4586
[2018-01-30 13:41:40 -0200] [4586] [INFO] Worker spawned (pid: 4586)
[2018-01-30 13:41:40 -0200] [4587] [INFO] Booting worker with pid: 4587
[2018-01-30 13:41:40 -0200] [4587] [INFO] Worker spawned (pid: 4587)
[2018-01-30 13:41:40 -0200] [4588] [INFO] Booting worker with pid: 4588
[2018-01-30 13:41:40 -0200] [4588] [INFO] Worker spawned (pid: 4588)
[2018-01-30 13:41:40 -0200] [4589] [INFO] Booting worker with pid: 4589
[2018-01-30 13:41:40 -0200] [4589] [INFO] Worker spawned (pid: 4589)
[2018-01-30 13:41:40 -0200] [4590] [INFO] Booting worker with pid: 4590
[2018-01-30 13:41:40 -0200] [4590] [INFO] Worker spawned (pid: 4590)
[2018-01-30 13:41:40 -0200] [4591] [INFO] Booting worker with pid: 4591
[2018-01-30 13:41:40 -0200] [4591] [INFO] Worker spawned (pid: 4591)
[2018-01-30 13:41:40 -0200] [4592] [INFO] Booting worker with pid: 4592
[2018-01-30 13:41:40 -0200] [4592] [INFO] Worker spawned (pid: 4592)
[2018-01-30 13:41:40 -0200] [4595] [INFO] Booting worker with pid: 4595
[2018-01-30 13:41:40 -0200] [4595] [INFO] Worker spawned (pid: 4595)
[2018-01-30 13:41:40 -0200] [4596] [INFO] Booting worker with pid: 4596
[2018-01-30 13:41:40 -0200] [4596] [INFO] Worker spawned (pid: 4596)
[2018-01-30 13:41:40 -0200] [4597] [INFO] Booting worker with pid: 4597
[2018-01-30 13:41:40 -0200] [4597] [INFO] Worker spawned (pid: 4597)
[2018-01-30 13:41:40 -0200] [4598] [INFO] Booting worker with pid: 4598
[2018-01-30 13:41:40 -0200] [4598] [INFO] Worker spawned (pid: 4598)
[2018-01-30 13:41:40 -0200] [4599] [INFO] Booting worker with pid: 4599
[2018-01-30 13:41:40 -0200] [4599] [INFO] Worker spawned (pid: 4599)
[2018-01-30 13:41:40 -0200] [4600] [INFO] Booting worker with pid: 4600
[2018-01-30 13:41:40 -0200] [4600] [INFO] Worker spawned (pid: 4600)
[2018-01-30 13:41:40 -0200] [4580] [DEBUG] 16 workers
[2018-01-30 13:41:47 -0200] [4583] [DEBUG] GET /menu
[2018-01-30 13:41:54 -0200] [4584] [DEBUG] GET /tracks
注意:
在此SO answer我了解到,为了集成 Flask 和 Scrapy,您可以使用:
1. Python subprocess
2. Twisted-Klein + Scrapy
3. ScrapyRT
但我没有运气使我的特定代码适应这些解决方案。
我认为子流程会更简单和足够,因为用户体验很少需要抓取线程,但我不确定。
谁能给我指出正确的方向吗?
最佳答案
这是一个如何使用 ScrapyRT 实现的最小示例。
这是项目结构:
project/
├── scraping
│ ├── example
│ │ ├── __init__.py
│ │ ├── items.py
│ │ ├── middlewares.py
│ │ ├── pipelines.py
│ │ ├── settings.py
│ │ └── spiders
│ │ ├── __init__.py
│ │ └── quotes.py
│ └── scrapy.cfg
└── webapp
└── example.py
scraping 目录包含 Scrapy 项目。这个项目包含一个蜘蛛 quotes.py 来从 quotes.toscrape.com 中抓取一些引语。 :
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
for quote in response.xpath('//div[@class="quote"]'):
yield {
'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
'text': quote.xpath('normalize-space(./span[@class="text"])').extract_first()
}
为了启动 ScrapyRT 并监听抓取请求,进入 Scrapy 项目的目录 scraping 并发出 scrapyrt 命令:
$ cd ./project/scraping
$ scrapyrt
ScrapyRT 现在将监听 localhost:9080 .
webapp 目录包含简单的 Flask 应用程序,它按需抓取引号(使用上面的蜘蛛)并简单地将它们显示给用户:
from __future__ import unicode_literals
import json
import requests
from flask import Flask
app = Flask(__name__)
@app.route('/')
def show_quotes():
params = {
'spider_name': 'quotes',
'start_requests': True
}
response = requests.get('http://localhost:9080/crawl.json', params)
data = json.loads(response.text)
result = '\n'.join('<p><b>{}</b> - {}</p>'.format(item['author'], item['text'])
for item in data['items'])
return result
启动应用程序:
$ cd ./project/webapp
$ FLASK_APP=example.py flask run
现在,当您将浏览器指向 localhost:5000 时,您将获得从 quotes.toscrape.com 中新鲜抓取的报价列表.
关于python - 从 Flask 运行 Scrapy,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48480009/
总的来说,我对ruby还比较陌生,我正在为我正在创建的对象编写一些rspec测试用例。许多测试用例都非常基础,我只是想确保正确填充和返回值。我想知道是否有办法使用循环结构来执行此操作。不必为我要测试的每个方法都设置一个assertEquals。例如:describeitem,"TestingtheItem"doit"willhaveanullvaluetostart"doitem=Item.new#HereIcoulddotheitem.name.shouldbe_nil#thenIcoulddoitem.category.shouldbe_nilendend但我想要一些方法来使用
关闭。这个问题是opinion-based.它目前不接受答案。想要改进这个问题?更新问题,以便editingthispost可以用事实和引用来回答它.关闭4年前。Improvethisquestion我想在固定时间创建一系列低音和高音调的哔哔声。例如:在150毫秒时发出高音调的蜂鸣声在151毫秒时发出低音调的蜂鸣声200毫秒时发出低音调的蜂鸣声250毫秒的高音调蜂鸣声有没有办法在Ruby或Python中做到这一点?我真的不在乎输出编码是什么(.wav、.mp3、.ogg等等),但我确实想创建一个输出文件。
在选择我想要运行操作的频率时,唯一的选项是“每天”、“每小时”和“每10分钟”。谢谢!我想为我的Rails3.1应用程序运行调度程序。 最佳答案 这不是一个优雅的解决方案,但您可以安排它每天运行,并在实际开始工作之前检查日期是否为当月的第一天。 关于ruby-如何每月在Heroku运行一次Scheduler插件?,我们在StackOverflow上找到一个类似的问题: https://stackoverflow.com/questions/8692687/
exe应该在我打开页面时运行。异步进程需要运行。有什么方法可以在ruby中使用两个参数异步运行exe吗?我已经尝试过ruby命令-system()、exec()但它正在等待过程完成。我需要用参数启动exe,无需等待进程完成是否有任何rubygems会支持我的问题? 最佳答案 您可以使用Process.spawn和Process.wait2:pid=Process.spawn'your.exe','--option'#Later...pid,status=Process.wait2pid您的程序将作为解释器的子进程执行。除
我尝试运行2.x应用程序。我使用rvm并为此应用程序设置其他版本的ruby:$rvmuseree-1.8.7-head我尝试运行服务器,然后出现很多错误:$script/serverNOTE:Gem.source_indexisdeprecated,useSpecification.Itwillberemovedonorafter2011-11-01.Gem.source_indexcalledfrom/Users/serg/rails_projects_terminal/work_proj/spohelp/config/../vendor/rails/railties/lib/r
Sinatra新手;我正在运行一些rspec测试,但在日志中收到了一堆不需要的噪音。如何消除日志中过多的噪音?我仔细检查了环境是否设置为:test,这意味着记录器级别应设置为WARN而不是DEBUG。spec_helper:require"./app"require"sinatra"require"rspec"require"rack/test"require"database_cleaner"require"factory_girl"set:environment,:testFactoryGirl.definition_file_paths=%w{./factories./test/
GivenIamadumbprogrammerandIamusingrspecandIamusingsporkandIwanttodebug...mmm...let'ssaaay,aspecforPhone.那么,我应该把“require'ruby-debug'”行放在哪里,以便在phone_spec.rb的特定点停止处理?(我所要求的只是一个大而粗的箭头,即使是一个有挑战性的程序员也能看到:-3)我已经尝试了很多位置,除非我没有正确测试它们,否则会发生一些奇怪的事情:在spec_helper.rb中的以下位置:require'rubygems'require'spork'
是否有可能:before_filter:authenticate_user!||:authenticate_admin! 最佳答案 before_filter:do_authenticationdefdo_authenticationauthenticate_user!||authenticate_admin!end 关于ruby-on-rails-before_filter运行多个方法,我们在StackOverflow上找到一个类似的问题: https://
这个问题在这里已经有了答案:关闭10年前。PossibleDuplicate:Pythonconditionalassignmentoperator对于这样一个简单的问题表示歉意,但是谷歌搜索||=并不是很有帮助;)Python中是否有与Ruby和Perl中的||=语句等效的语句?例如:foo="hey"foo||="what"#assignfooifit'sundefined#fooisstill"hey"bar||="yeah"#baris"yeah"另外,类似这样的东西的通用术语是什么?条件分配是我的第一个猜测,但Wikipediapage跟我想的不太一样。
什么是ruby的rack或python的Java的wsgi?还有一个路由库。 最佳答案 来自Python标准PEP333:Bycontrast,althoughJavahasjustasmanywebapplicationframeworksavailable,Java's"servlet"APImakesitpossibleforapplicationswrittenwithanyJavawebapplicationframeworktoruninanywebserverthatsupportstheservletAPI.ht